* [PATCH v6 0/2] block/dm: support bio polling @ 2022-03-07 18:53 ` Mike Snitzer 0 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2022-03-07 18:53 UTC (permalink / raw) To: axboe; +Cc: ming.lei, hch, dm-devel, linux-block Hi, I've rebased Ming's latest [1] ontop of dm-5.18 [2] (which is based on for-5.18/block). End result available in dm-5.18-biopoll branch [3] These changes add bio polling support to DM. Tested with linear and striped DM targets. IOPS improvement was ~5% on my baremetal system with a single Intel Optane NVMe device (561K hipri=1 vs 530K hipri=0). Ming has seen better improvement while testing within a VM: dm-linear: hipri=1 vs hipri=0 15~20% iops improvement dm-stripe: hipri=1 vs hipri=0 ~30% iops improvement I'd like to merge these changes via the DM tree when the 5.18 merge window opens. The first block patch that adds ->poll_bio to block_device_operations will need review so that I can take it through the DM tree. Reason for going through the DM tree is there have been some fairly extensive changes queued in dm-5.18 that build on for-5.18/block. So I think it easiest to just add the block depenency via DM tree since DM is first consumer of ->poll_bio Thanks, Mike [1] https://github.com/ming1/linux/commits/my_v5.18-dm-bio-poll [2] https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.18 [3] https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.18-biopoll [4] https://github.com/ming1/linux/commit/c107c30e15041ac1ce672f56809961406e2a3e52 v6: Ming switched from reusing .bi_end_io to .bi_private and added a comment to comment-block above dm_get_bio_hlist_head as suggested by Jens. v5: remove WARN_ONs in ->poll_bio interface patch. Fixed comment typo along the way (found while seeing how other block_device_operations are referenced in block's code comments). Ming Lei (2): block: add ->poll_bio to block_device_operations dm: support bio polling block/blk-core.c | 14 +++-- block/genhd.c | 4 ++ drivers/md/dm-core.h | 2 + drivers/md/dm-table.c | 27 ++++++++++ drivers/md/dm.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/blkdev.h | 2 + 6 files changed, 184 insertions(+), 8 deletions(-) -- 2.15.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [dm-devel] [PATCH v6 0/2] block/dm: support bio polling @ 2022-03-07 18:53 ` Mike Snitzer 0 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2022-03-07 18:53 UTC (permalink / raw) To: axboe; +Cc: linux-block, dm-devel, hch, ming.lei Hi, I've rebased Ming's latest [1] ontop of dm-5.18 [2] (which is based on for-5.18/block). End result available in dm-5.18-biopoll branch [3] These changes add bio polling support to DM. Tested with linear and striped DM targets. IOPS improvement was ~5% on my baremetal system with a single Intel Optane NVMe device (561K hipri=1 vs 530K hipri=0). Ming has seen better improvement while testing within a VM: dm-linear: hipri=1 vs hipri=0 15~20% iops improvement dm-stripe: hipri=1 vs hipri=0 ~30% iops improvement I'd like to merge these changes via the DM tree when the 5.18 merge window opens. The first block patch that adds ->poll_bio to block_device_operations will need review so that I can take it through the DM tree. Reason for going through the DM tree is there have been some fairly extensive changes queued in dm-5.18 that build on for-5.18/block. So I think it easiest to just add the block depenency via DM tree since DM is first consumer of ->poll_bio Thanks, Mike [1] https://github.com/ming1/linux/commits/my_v5.18-dm-bio-poll [2] https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.18 [3] https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.18-biopoll [4] https://github.com/ming1/linux/commit/c107c30e15041ac1ce672f56809961406e2a3e52 v6: Ming switched from reusing .bi_end_io to .bi_private and added a comment to comment-block above dm_get_bio_hlist_head as suggested by Jens. v5: remove WARN_ONs in ->poll_bio interface patch. Fixed comment typo along the way (found while seeing how other block_device_operations are referenced in block's code comments). Ming Lei (2): block: add ->poll_bio to block_device_operations dm: support bio polling block/blk-core.c | 14 +++-- block/genhd.c | 4 ++ drivers/md/dm-core.h | 2 + drivers/md/dm-table.c | 27 ++++++++++ drivers/md/dm.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/blkdev.h | 2 + 6 files changed, 184 insertions(+), 8 deletions(-) -- 2.15.0 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* [dm-devel] [PATCH v6 1/2] block: add ->poll_bio to block_device_operations 2022-03-07 18:53 ` [dm-devel] " Mike Snitzer @ 2022-03-07 18:53 ` Mike Snitzer -1 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2022-03-07 18:53 UTC (permalink / raw) To: axboe; +Cc: linux-block, dm-devel, hch, ming.lei From: Ming Lei <ming.lei@redhat.com> Prepare for supporting IO polling for bio-based driver. Add ->poll_bio callback so that bio-based driver can provide their own logic for polling bio. Also fix ->submit_bio_bio typo in comment block above __submit_bio_noacct. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-core.c | 14 +++++++++----- block/genhd.c | 4 ++++ include/linux/blkdev.h | 2 ++ 3 files changed, 15 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 94bf37f8e61d..ce08f0aa9dfc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -708,7 +708,7 @@ static void __submit_bio(struct bio *bio) * * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio. * bio_list_on_stack[1] contains bios that were submitted before the current - * ->submit_bio_bio, but that haven't been processed yet. + * ->submit_bio, but that haven't been processed yet. */ static void __submit_bio_noacct(struct bio *bio) { @@ -975,7 +975,7 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); blk_qc_t cookie = READ_ONCE(bio->bi_cookie); - int ret; + int ret = 0; if (cookie == BLK_QC_T_NONE || !test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) @@ -985,10 +985,14 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)) return 0; - if (WARN_ON_ONCE(!queue_is_mq(q))) - ret = 0; /* not yet implemented, should not happen */ - else + if (queue_is_mq(q)) { ret = blk_mq_poll(q, cookie, iob, flags); + } else { + struct gendisk *disk = q->disk; + + if (disk && disk->fops->poll_bio) + ret = disk->fops->poll_bio(bio, iob, flags); + } blk_queue_exit(q); return ret; } diff --git a/block/genhd.c b/block/genhd.c index e351fac41bf2..1ed46a6f94f5 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -410,6 +410,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, struct device *ddev = disk_to_dev(disk); int ret; + /* Only makes sense for bio-based to set ->poll_bio */ + if (queue_is_mq(disk->queue) && disk->fops->poll_bio) + return -EINVAL; + /* * The disk queue should now be all set with enough information about * the device for the elevator code to pick an adequate default diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f757f9c2871f..51f1b1ddbed2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1455,6 +1455,8 @@ enum blk_unique_id { struct block_device_operations { void (*submit_bio)(struct bio *bio); + int (*poll_bio)(struct bio *bio, struct io_comp_batch *iob, + unsigned int flags); int (*open) (struct block_device *, fmode_t); void (*release) (struct gendisk *, fmode_t); int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int); -- 2.15.0 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 1/2] block: add ->poll_bio to block_device_operations @ 2022-03-07 18:53 ` Mike Snitzer 0 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2022-03-07 18:53 UTC (permalink / raw) To: axboe; +Cc: ming.lei, hch, dm-devel, linux-block From: Ming Lei <ming.lei@redhat.com> Prepare for supporting IO polling for bio-based driver. Add ->poll_bio callback so that bio-based driver can provide their own logic for polling bio. Also fix ->submit_bio_bio typo in comment block above __submit_bio_noacct. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- block/blk-core.c | 14 +++++++++----- block/genhd.c | 4 ++++ include/linux/blkdev.h | 2 ++ 3 files changed, 15 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 94bf37f8e61d..ce08f0aa9dfc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -708,7 +708,7 @@ static void __submit_bio(struct bio *bio) * * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio. * bio_list_on_stack[1] contains bios that were submitted before the current - * ->submit_bio_bio, but that haven't been processed yet. + * ->submit_bio, but that haven't been processed yet. */ static void __submit_bio_noacct(struct bio *bio) { @@ -975,7 +975,7 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); blk_qc_t cookie = READ_ONCE(bio->bi_cookie); - int ret; + int ret = 0; if (cookie == BLK_QC_T_NONE || !test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) @@ -985,10 +985,14 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)) return 0; - if (WARN_ON_ONCE(!queue_is_mq(q))) - ret = 0; /* not yet implemented, should not happen */ - else + if (queue_is_mq(q)) { ret = blk_mq_poll(q, cookie, iob, flags); + } else { + struct gendisk *disk = q->disk; + + if (disk && disk->fops->poll_bio) + ret = disk->fops->poll_bio(bio, iob, flags); + } blk_queue_exit(q); return ret; } diff --git a/block/genhd.c b/block/genhd.c index e351fac41bf2..1ed46a6f94f5 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -410,6 +410,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, struct device *ddev = disk_to_dev(disk); int ret; + /* Only makes sense for bio-based to set ->poll_bio */ + if (queue_is_mq(disk->queue) && disk->fops->poll_bio) + return -EINVAL; + /* * The disk queue should now be all set with enough information about * the device for the elevator code to pick an adequate default diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f757f9c2871f..51f1b1ddbed2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1455,6 +1455,8 @@ enum blk_unique_id { struct block_device_operations { void (*submit_bio)(struct bio *bio); + int (*poll_bio)(struct bio *bio, struct io_comp_batch *iob, + unsigned int flags); int (*open) (struct block_device *, fmode_t); void (*release) (struct gendisk *, fmode_t); int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int); -- 2.15.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [dm-devel] [PATCH v6 1/2] block: add ->poll_bio to block_device_operations 2022-03-07 18:53 ` Mike Snitzer @ 2022-03-09 1:01 ` Jens Axboe -1 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-09 1:01 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-block, dm-devel, hch, ming.lei On 3/7/22 11:53 AM, Mike Snitzer wrote: > From: Ming Lei <ming.lei@redhat.com> > > Prepare for supporting IO polling for bio-based driver. > > Add ->poll_bio callback so that bio-based driver can provide their own > logic for polling bio. > > Also fix ->submit_bio_bio typo in comment block above > __submit_bio_noacct. Assuming you want to take this through the dm tree: Reviewed-by: Jens Axboe <axboe@kernel.dk> -- Jens Axboe -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 1/2] block: add ->poll_bio to block_device_operations @ 2022-03-09 1:01 ` Jens Axboe 0 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-09 1:01 UTC (permalink / raw) To: Mike Snitzer; +Cc: ming.lei, hch, dm-devel, linux-block On 3/7/22 11:53 AM, Mike Snitzer wrote: > From: Ming Lei <ming.lei@redhat.com> > > Prepare for supporting IO polling for bio-based driver. > > Add ->poll_bio callback so that bio-based driver can provide their own > logic for polling bio. > > Also fix ->submit_bio_bio typo in comment block above > __submit_bio_noacct. Assuming you want to take this through the dm tree: Reviewed-by: Jens Axboe <axboe@kernel.dk> -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 2/2] dm: support bio polling 2022-03-07 18:53 ` [dm-devel] " Mike Snitzer @ 2022-03-07 18:53 ` Mike Snitzer -1 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2022-03-07 18:53 UTC (permalink / raw) To: axboe; +Cc: ming.lei, hch, dm-devel, linux-block From: Ming Lei <ming.lei@redhat.com> Support bio(REQ_POLLED) polling in the following approach: 1) only support io polling on normal READ/WRITE, and other abnormal IOs still fallback to IRQ mode, so the target io is exactly inside the dm io. 2) hold one refcnt on io->io_count after submitting this dm bio with REQ_POLLED 3) support dm native bio splitting, any dm io instance associated with current bio will be added into one list which head is bio->bi_private which will be recovered before ending this bio 4) implement .poll_bio() callback, call bio_poll() on the single target bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call dm_io_dec_pending() after the target io is done in .poll_bio() 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, which is based on Jeffle's previous patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/md/dm-core.h | 2 + drivers/md/dm-table.c | 27 ++++++++++ drivers/md/dm.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 169 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 8078b6c155ef..8cc03c0c262e 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -235,6 +235,8 @@ struct dm_io { bool start_io_acct:1; int was_accounted; unsigned long start_time; + void *data; + struct hlist_node node; spinlock_t endio_lock; struct dm_stats_aux stats_aux; /* last member of dm_target_io is 'struct bio' */ diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index f4ed756ab391..c0be4f60b427 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1481,6 +1481,14 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector) return &t->targets[(KEYS_PER_NODE * n) + k]; } +static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) +{ + struct request_queue *q = bdev_get_queue(dev->bdev); + + return !test_bit(QUEUE_FLAG_POLL, &q->queue_flags); +} + /* * type->iterate_devices() should be called when the sanity check needs to * iterate and check all underlying data devices. iterate_devices() will @@ -1531,6 +1539,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev, return 0; } +static int dm_table_supports_poll(struct dm_table *t) +{ + return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL); +} + /* * Check whether a table has no data devices attached using each * target's iterate_devices method. @@ -2067,6 +2080,20 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, dm_update_crypto_profile(q, t); disk_update_readahead(t->md->disk); + /* + * Check for request-based device is left to + * dm_mq_init_request_queue()->blk_mq_init_allocated_queue(). + * + * For bio-based device, only set QUEUE_FLAG_POLL when all + * underlying devices supporting polling. + */ + if (__table_type_bio_based(t->type)) { + if (dm_table_supports_poll(t)) + blk_queue_flag_set(QUEUE_FLAG_POLL, q); + else + blk_queue_flag_clear(QUEUE_FLAG_POLL, q); + } + return 0; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 454d39bc7745..d9111e17f0fc 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -40,6 +40,13 @@ #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE" #define DM_COOKIE_LENGTH 24 +/* + * For REQ_POLLED fs bio, this flag is set if we link mapped underlying + * dm_io into one list, and reuse bio->bi_private as the list head. Before + * ending this fs bio, we will recover its ->bi_private. + */ +#define REQ_DM_POLL_LIST REQ_DRV + static const char *_name = DM_NAME; static unsigned int major = 0; @@ -73,6 +80,7 @@ struct clone_info { struct dm_io *io; sector_t sector; unsigned sector_count; + bool submit_as_polled; }; #define DM_TARGET_IO_BIO_OFFSET (offsetof(struct dm_target_io, clone)) @@ -599,6 +607,9 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti, if (!clone) return NULL; + /* REQ_DM_POLL_LIST shouldn't be inherited */ + clone->bi_opf &= ~REQ_DM_POLL_LIST; + tio = clone_to_tio(clone); tio->inside_dm_io = false; } @@ -888,8 +899,15 @@ void dm_io_dec_pending(struct dm_io *io, blk_status_t error) if (unlikely(wq_has_sleeper(&md->wait))) wake_up(&md->wait); - if (io_error == BLK_STS_DM_REQUEUE) + if (io_error == BLK_STS_DM_REQUEUE) { + /* + * Upper layer won't help us poll split bio, io->orig_bio + * may only reflect a subset of the pre-split original, + * so clear REQ_POLLED in case of requeue + */ + bio->bi_opf &= ~REQ_POLLED; return; + } if (bio_is_flush_with_data(bio)) { /* @@ -1440,6 +1458,47 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti, return true; } +/* + * Reuse ->bi_private as hlist head for storing all dm_io instances + * associated with this bio, and this bio's bi_private needs to be + * stored in dm_io->data before the reuse. + * + * bio->bi_private is owned by fs or upper layer, so block layer won't + * touch it after splitting. Meantime it won't be changed by anyone after + * bio is submitted. So this reuse is safe. + */ +static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio) +{ + return (struct hlist_head *)&bio->bi_private; +} + +static void dm_queue_poll_io(struct bio *bio, struct dm_io *io) +{ + struct hlist_head *head = dm_get_bio_hlist_head(bio); + + if (!(bio->bi_opf & REQ_DM_POLL_LIST)) { + bio->bi_opf |= REQ_DM_POLL_LIST; + /* + * Save .bi_private into dm_io, so that we can reuse + * .bi_private as hlist head for storing dm_io list + */ + io->data = bio->bi_private; + + INIT_HLIST_HEAD(head); + + /* tell block layer to poll for completion */ + bio->bi_cookie = ~BLK_QC_T_NONE; + } else { + /* + * bio recursed due to split, reuse original poll list, + * and save bio->bi_private too. + */ + io->data = hlist_entry(head->first, struct dm_io, node)->data; + } + + hlist_add_head(&io->node, head); +} + /* * Select the correct strategy for processing a non-flush bio. */ @@ -1457,6 +1516,12 @@ static int __split_and_process_bio(struct clone_info *ci) if (__process_abnormal_io(ci, ti, &r)) return r; + /* + * Only support bio polling for normal IO, and the target io is + * exactly inside the dm_io instance (verified in dm_poll_dm_io) + */ + ci->submit_as_polled = ci->bio->bi_opf & REQ_POLLED; + len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count); clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO); __map_bio(clone); @@ -1473,6 +1538,7 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md, ci->map = map; ci->io = alloc_io(md, bio); ci->bio = bio; + ci->submit_as_polled = false; ci->sector = bio->bi_iter.bi_sector; ci->sector_count = bio_sectors(bio); @@ -1522,8 +1588,17 @@ static void dm_split_and_process_bio(struct mapped_device *md, if (ci.io->start_io_acct) dm_start_io_acct(ci.io, NULL); - /* drop the extra reference count */ - dm_io_dec_pending(ci.io, errno_to_blk_status(error)); + /* + * Drop the extra reference count for non-POLLED bio, and hold one + * reference for POLLED bio, which will be released in dm_poll_bio + * + * Add every dm_io instance into the hlist_head which is stored in + * bio->bi_private, so that dm_poll_bio can poll them all. + */ + if (error || !ci.submit_as_polled) + dm_io_dec_pending(ci.io, errno_to_blk_status(error)); + else + dm_queue_poll_io(bio, ci.io); } static void dm_submit_bio(struct bio *bio) @@ -1558,6 +1633,67 @@ static void dm_submit_bio(struct bio *bio) dm_put_live_table(md, srcu_idx); } +static bool dm_poll_dm_io(struct dm_io *io, struct io_comp_batch *iob, + unsigned int flags) +{ + WARN_ON_ONCE(!io->tio.inside_dm_io); + + /* don't poll if the mapped io is done */ + if (atomic_read(&io->io_count) > 1) + bio_poll(&io->tio.clone, iob, flags); + + /* bio_poll holds the last reference */ + return atomic_read(&io->io_count) == 1; +} + +static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob, + unsigned int flags) +{ + struct hlist_head *head = dm_get_bio_hlist_head(bio); + struct hlist_head tmp = HLIST_HEAD_INIT; + struct hlist_node *next; + struct dm_io *io; + + /* Only poll normal bio which was marked as REQ_DM_POLL_LIST */ + if (!(bio->bi_opf & REQ_DM_POLL_LIST)) + return 0; + + WARN_ON_ONCE(hlist_empty(head)); + + hlist_move_list(head, &tmp); + + /* + * Restore .bi_private before possibly completing dm_io. + * + * bio_poll() is only possible once @bio has been completely + * submitted via submit_bio_noacct()'s depth-first submission. + * So there is no dm_queue_poll_io() race associated with + * clearing REQ_DM_POLL_LIST here. + */ + bio->bi_opf &= ~REQ_DM_POLL_LIST; + bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data; + + hlist_for_each_entry_safe(io, next, &tmp, node) { + if (dm_poll_dm_io(io, iob, flags)) { + hlist_del_init(&io->node); + /* + * clone_endio() has already occurred, so passing + * error as 0 here doesn't override io->status + */ + dm_io_dec_pending(io, 0); + } + } + + /* Not done? */ + if (!hlist_empty(&tmp)) { + bio->bi_opf |= REQ_DM_POLL_LIST; + /* Reset bio->bi_private to dm_io list head */ + hlist_move_list(&tmp, head); + return 0; + } + return 1; +} + /*----------------------------------------------------------------- * An IDR is used to keep track of allocated minor numbers. *---------------------------------------------------------------*/ @@ -2983,6 +3119,7 @@ static const struct pr_ops dm_pr_ops = { static const struct block_device_operations dm_blk_dops = { .submit_bio = dm_submit_bio, + .poll_bio = dm_poll_bio, .open = dm_blk_open, .release = dm_blk_close, .ioctl = dm_blk_ioctl, -- 2.15.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [dm-devel] [PATCH v6 2/2] dm: support bio polling @ 2022-03-07 18:53 ` Mike Snitzer 0 siblings, 0 replies; 18+ messages in thread From: Mike Snitzer @ 2022-03-07 18:53 UTC (permalink / raw) To: axboe; +Cc: linux-block, dm-devel, hch, ming.lei From: Ming Lei <ming.lei@redhat.com> Support bio(REQ_POLLED) polling in the following approach: 1) only support io polling on normal READ/WRITE, and other abnormal IOs still fallback to IRQ mode, so the target io is exactly inside the dm io. 2) hold one refcnt on io->io_count after submitting this dm bio with REQ_POLLED 3) support dm native bio splitting, any dm io instance associated with current bio will be added into one list which head is bio->bi_private which will be recovered before ending this bio 4) implement .poll_bio() callback, call bio_poll() on the single target bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call dm_io_dec_pending() after the target io is done in .poll_bio() 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, which is based on Jeffle's previous patch. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/md/dm-core.h | 2 + drivers/md/dm-table.c | 27 ++++++++++ drivers/md/dm.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 169 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 8078b6c155ef..8cc03c0c262e 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -235,6 +235,8 @@ struct dm_io { bool start_io_acct:1; int was_accounted; unsigned long start_time; + void *data; + struct hlist_node node; spinlock_t endio_lock; struct dm_stats_aux stats_aux; /* last member of dm_target_io is 'struct bio' */ diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index f4ed756ab391..c0be4f60b427 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1481,6 +1481,14 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector) return &t->targets[(KEYS_PER_NODE * n) + k]; } +static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) +{ + struct request_queue *q = bdev_get_queue(dev->bdev); + + return !test_bit(QUEUE_FLAG_POLL, &q->queue_flags); +} + /* * type->iterate_devices() should be called when the sanity check needs to * iterate and check all underlying data devices. iterate_devices() will @@ -1531,6 +1539,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev, return 0; } +static int dm_table_supports_poll(struct dm_table *t) +{ + return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL); +} + /* * Check whether a table has no data devices attached using each * target's iterate_devices method. @@ -2067,6 +2080,20 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, dm_update_crypto_profile(q, t); disk_update_readahead(t->md->disk); + /* + * Check for request-based device is left to + * dm_mq_init_request_queue()->blk_mq_init_allocated_queue(). + * + * For bio-based device, only set QUEUE_FLAG_POLL when all + * underlying devices supporting polling. + */ + if (__table_type_bio_based(t->type)) { + if (dm_table_supports_poll(t)) + blk_queue_flag_set(QUEUE_FLAG_POLL, q); + else + blk_queue_flag_clear(QUEUE_FLAG_POLL, q); + } + return 0; } diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 454d39bc7745..d9111e17f0fc 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -40,6 +40,13 @@ #define DM_COOKIE_ENV_VAR_NAME "DM_COOKIE" #define DM_COOKIE_LENGTH 24 +/* + * For REQ_POLLED fs bio, this flag is set if we link mapped underlying + * dm_io into one list, and reuse bio->bi_private as the list head. Before + * ending this fs bio, we will recover its ->bi_private. + */ +#define REQ_DM_POLL_LIST REQ_DRV + static const char *_name = DM_NAME; static unsigned int major = 0; @@ -73,6 +80,7 @@ struct clone_info { struct dm_io *io; sector_t sector; unsigned sector_count; + bool submit_as_polled; }; #define DM_TARGET_IO_BIO_OFFSET (offsetof(struct dm_target_io, clone)) @@ -599,6 +607,9 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti, if (!clone) return NULL; + /* REQ_DM_POLL_LIST shouldn't be inherited */ + clone->bi_opf &= ~REQ_DM_POLL_LIST; + tio = clone_to_tio(clone); tio->inside_dm_io = false; } @@ -888,8 +899,15 @@ void dm_io_dec_pending(struct dm_io *io, blk_status_t error) if (unlikely(wq_has_sleeper(&md->wait))) wake_up(&md->wait); - if (io_error == BLK_STS_DM_REQUEUE) + if (io_error == BLK_STS_DM_REQUEUE) { + /* + * Upper layer won't help us poll split bio, io->orig_bio + * may only reflect a subset of the pre-split original, + * so clear REQ_POLLED in case of requeue + */ + bio->bi_opf &= ~REQ_POLLED; return; + } if (bio_is_flush_with_data(bio)) { /* @@ -1440,6 +1458,47 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti, return true; } +/* + * Reuse ->bi_private as hlist head for storing all dm_io instances + * associated with this bio, and this bio's bi_private needs to be + * stored in dm_io->data before the reuse. + * + * bio->bi_private is owned by fs or upper layer, so block layer won't + * touch it after splitting. Meantime it won't be changed by anyone after + * bio is submitted. So this reuse is safe. + */ +static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio) +{ + return (struct hlist_head *)&bio->bi_private; +} + +static void dm_queue_poll_io(struct bio *bio, struct dm_io *io) +{ + struct hlist_head *head = dm_get_bio_hlist_head(bio); + + if (!(bio->bi_opf & REQ_DM_POLL_LIST)) { + bio->bi_opf |= REQ_DM_POLL_LIST; + /* + * Save .bi_private into dm_io, so that we can reuse + * .bi_private as hlist head for storing dm_io list + */ + io->data = bio->bi_private; + + INIT_HLIST_HEAD(head); + + /* tell block layer to poll for completion */ + bio->bi_cookie = ~BLK_QC_T_NONE; + } else { + /* + * bio recursed due to split, reuse original poll list, + * and save bio->bi_private too. + */ + io->data = hlist_entry(head->first, struct dm_io, node)->data; + } + + hlist_add_head(&io->node, head); +} + /* * Select the correct strategy for processing a non-flush bio. */ @@ -1457,6 +1516,12 @@ static int __split_and_process_bio(struct clone_info *ci) if (__process_abnormal_io(ci, ti, &r)) return r; + /* + * Only support bio polling for normal IO, and the target io is + * exactly inside the dm_io instance (verified in dm_poll_dm_io) + */ + ci->submit_as_polled = ci->bio->bi_opf & REQ_POLLED; + len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count); clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO); __map_bio(clone); @@ -1473,6 +1538,7 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md, ci->map = map; ci->io = alloc_io(md, bio); ci->bio = bio; + ci->submit_as_polled = false; ci->sector = bio->bi_iter.bi_sector; ci->sector_count = bio_sectors(bio); @@ -1522,8 +1588,17 @@ static void dm_split_and_process_bio(struct mapped_device *md, if (ci.io->start_io_acct) dm_start_io_acct(ci.io, NULL); - /* drop the extra reference count */ - dm_io_dec_pending(ci.io, errno_to_blk_status(error)); + /* + * Drop the extra reference count for non-POLLED bio, and hold one + * reference for POLLED bio, which will be released in dm_poll_bio + * + * Add every dm_io instance into the hlist_head which is stored in + * bio->bi_private, so that dm_poll_bio can poll them all. + */ + if (error || !ci.submit_as_polled) + dm_io_dec_pending(ci.io, errno_to_blk_status(error)); + else + dm_queue_poll_io(bio, ci.io); } static void dm_submit_bio(struct bio *bio) @@ -1558,6 +1633,67 @@ static void dm_submit_bio(struct bio *bio) dm_put_live_table(md, srcu_idx); } +static bool dm_poll_dm_io(struct dm_io *io, struct io_comp_batch *iob, + unsigned int flags) +{ + WARN_ON_ONCE(!io->tio.inside_dm_io); + + /* don't poll if the mapped io is done */ + if (atomic_read(&io->io_count) > 1) + bio_poll(&io->tio.clone, iob, flags); + + /* bio_poll holds the last reference */ + return atomic_read(&io->io_count) == 1; +} + +static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob, + unsigned int flags) +{ + struct hlist_head *head = dm_get_bio_hlist_head(bio); + struct hlist_head tmp = HLIST_HEAD_INIT; + struct hlist_node *next; + struct dm_io *io; + + /* Only poll normal bio which was marked as REQ_DM_POLL_LIST */ + if (!(bio->bi_opf & REQ_DM_POLL_LIST)) + return 0; + + WARN_ON_ONCE(hlist_empty(head)); + + hlist_move_list(head, &tmp); + + /* + * Restore .bi_private before possibly completing dm_io. + * + * bio_poll() is only possible once @bio has been completely + * submitted via submit_bio_noacct()'s depth-first submission. + * So there is no dm_queue_poll_io() race associated with + * clearing REQ_DM_POLL_LIST here. + */ + bio->bi_opf &= ~REQ_DM_POLL_LIST; + bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data; + + hlist_for_each_entry_safe(io, next, &tmp, node) { + if (dm_poll_dm_io(io, iob, flags)) { + hlist_del_init(&io->node); + /* + * clone_endio() has already occurred, so passing + * error as 0 here doesn't override io->status + */ + dm_io_dec_pending(io, 0); + } + } + + /* Not done? */ + if (!hlist_empty(&tmp)) { + bio->bi_opf |= REQ_DM_POLL_LIST; + /* Reset bio->bi_private to dm_io list head */ + hlist_move_list(&tmp, head); + return 0; + } + return 1; +} + /*----------------------------------------------------------------- * An IDR is used to keep track of allocated minor numbers. *---------------------------------------------------------------*/ @@ -2983,6 +3119,7 @@ static const struct pr_ops dm_pr_ops = { static const struct block_device_operations dm_blk_dops = { .submit_bio = dm_submit_bio, + .poll_bio = dm_poll_bio, .open = dm_blk_open, .release = dm_blk_close, .ioctl = dm_blk_ioctl, -- 2.15.0 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling 2022-03-07 18:53 ` [dm-devel] " Mike Snitzer @ 2022-03-09 1:02 ` Jens Axboe -1 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-09 1:02 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-block, dm-devel, hch, ming.lei On 3/7/22 11:53 AM, Mike Snitzer wrote: > From: Ming Lei <ming.lei@redhat.com> > > Support bio(REQ_POLLED) polling in the following approach: > > 1) only support io polling on normal READ/WRITE, and other abnormal IOs > still fallback to IRQ mode, so the target io is exactly inside the dm > io. > > 2) hold one refcnt on io->io_count after submitting this dm bio with > REQ_POLLED > > 3) support dm native bio splitting, any dm io instance associated with > current bio will be added into one list which head is bio->bi_private > which will be recovered before ending this bio > > 4) implement .poll_bio() callback, call bio_poll() on the single target > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call > dm_io_dec_pending() after the target io is done in .poll_bio() > > 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, > which is based on Jeffle's previous patch. It's not the prettiest thing in the world with the overlay on bi_private, but at least it's nicely documented now. I would encourage you to actually test this on fast storage, should make a nice difference. I can run this on a gen2 optane, it's 10x the IOPS of what it was tested on and should help better highlight where it makes a difference. If either of you would like that, then send me a fool proof recipe for what should be setup so I have a poll capable dm device. -- Jens Axboe -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 2/2] dm: support bio polling @ 2022-03-09 1:02 ` Jens Axboe 0 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-09 1:02 UTC (permalink / raw) To: Mike Snitzer; +Cc: ming.lei, hch, dm-devel, linux-block On 3/7/22 11:53 AM, Mike Snitzer wrote: > From: Ming Lei <ming.lei@redhat.com> > > Support bio(REQ_POLLED) polling in the following approach: > > 1) only support io polling on normal READ/WRITE, and other abnormal IOs > still fallback to IRQ mode, so the target io is exactly inside the dm > io. > > 2) hold one refcnt on io->io_count after submitting this dm bio with > REQ_POLLED > > 3) support dm native bio splitting, any dm io instance associated with > current bio will be added into one list which head is bio->bi_private > which will be recovered before ending this bio > > 4) implement .poll_bio() callback, call bio_poll() on the single target > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call > dm_io_dec_pending() after the target io is done in .poll_bio() > > 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, > which is based on Jeffle's previous patch. It's not the prettiest thing in the world with the overlay on bi_private, but at least it's nicely documented now. I would encourage you to actually test this on fast storage, should make a nice difference. I can run this on a gen2 optane, it's 10x the IOPS of what it was tested on and should help better highlight where it makes a difference. If either of you would like that, then send me a fool proof recipe for what should be setup so I have a poll capable dm device. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling 2022-03-09 1:02 ` Jens Axboe @ 2022-03-09 1:13 ` Ming Lei -1 siblings, 0 replies; 18+ messages in thread From: Ming Lei @ 2022-03-09 1:13 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, dm-devel, hch, Mike Snitzer On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote: > On 3/7/22 11:53 AM, Mike Snitzer wrote: > > From: Ming Lei <ming.lei@redhat.com> > > > > Support bio(REQ_POLLED) polling in the following approach: > > > > 1) only support io polling on normal READ/WRITE, and other abnormal IOs > > still fallback to IRQ mode, so the target io is exactly inside the dm > > io. > > > > 2) hold one refcnt on io->io_count after submitting this dm bio with > > REQ_POLLED > > > > 3) support dm native bio splitting, any dm io instance associated with > > current bio will be added into one list which head is bio->bi_private > > which will be recovered before ending this bio > > > > 4) implement .poll_bio() callback, call bio_poll() on the single target > > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call > > dm_io_dec_pending() after the target io is done in .poll_bio() > > > > 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, > > which is based on Jeffle's previous patch. > > It's not the prettiest thing in the world with the overlay on bi_private, > but at least it's nicely documented now. > > I would encourage you to actually test this on fast storage, should make > a nice difference. I can run this on a gen2 optane, it's 10x the IOPS > of what it was tested on and should help better highlight where it > makes a difference. > > If either of you would like that, then send me a fool proof recipe for > what should be setup so I have a poll capable dm device. Follows steps for setup dm stripe over two nvmes, then run io_uring on the dm stripe dev. 1) dm_stripe.perl #!/usr/bin/perl -w # Create a striped device across any number of underlying devices. The device # will be called "stripe_dev" and have a chunk-size of 128k. my $chunk_size = 128 * 2; my $dev_name = "stripe_dev"; my $num_devs = @ARGV; my @devs = @ARGV; my ($min_dev_size, $stripe_dev_size, $i); if (!$num_devs) { die("Specify at least one device\n"); } $min_dev_size = `blockdev --getsz $devs[0]`; for ($i = 1; $i < $num_devs; $i++) { my $this_size = `blockdev --getsz $devs[$i]`; $min_dev_size = ($min_dev_size < $this_size) ? $min_dev_size : $this_size; } $stripe_dev_size = $min_dev_size * $num_devs; $stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs); $table = "0 $stripe_dev_size striped $num_devs $chunk_size"; for ($i = 0; $i < $num_devs; $i++) { $table .= " $devs[$i] 0"; } `echo $table | dmsetup create $dev_name`; 2) test_poll_on_dm_stripe.sh #!/bin/bash RT=40 JOBS=1 HI=1 BS=4K set -x dmsetup remove_all rmmod nvme modprobe nvme poll_queues=2 sleep 2 ./dm_stripe.perl /dev/nvme0n1 /dev/nvme1n1 sleep 1 DEV=/dev/mapper/stripe_dev echo "io_uring hipri test" fio --bs=$BS --ioengine=io_uring --fixedbufs --registerfiles \ --hipri=$HI --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \ --filename=$DEV --direct=1 --runtime=$RT --numjobs=$JOBS --rw=randread --name=test \ --group_reporting Thanks, Ming -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 2/2] dm: support bio polling @ 2022-03-09 1:13 ` Ming Lei 0 siblings, 0 replies; 18+ messages in thread From: Ming Lei @ 2022-03-09 1:13 UTC (permalink / raw) To: Jens Axboe; +Cc: Mike Snitzer, hch, dm-devel, linux-block On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote: > On 3/7/22 11:53 AM, Mike Snitzer wrote: > > From: Ming Lei <ming.lei@redhat.com> > > > > Support bio(REQ_POLLED) polling in the following approach: > > > > 1) only support io polling on normal READ/WRITE, and other abnormal IOs > > still fallback to IRQ mode, so the target io is exactly inside the dm > > io. > > > > 2) hold one refcnt on io->io_count after submitting this dm bio with > > REQ_POLLED > > > > 3) support dm native bio splitting, any dm io instance associated with > > current bio will be added into one list which head is bio->bi_private > > which will be recovered before ending this bio > > > > 4) implement .poll_bio() callback, call bio_poll() on the single target > > bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call > > dm_io_dec_pending() after the target io is done in .poll_bio() > > > > 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, > > which is based on Jeffle's previous patch. > > It's not the prettiest thing in the world with the overlay on bi_private, > but at least it's nicely documented now. > > I would encourage you to actually test this on fast storage, should make > a nice difference. I can run this on a gen2 optane, it's 10x the IOPS > of what it was tested on and should help better highlight where it > makes a difference. > > If either of you would like that, then send me a fool proof recipe for > what should be setup so I have a poll capable dm device. Follows steps for setup dm stripe over two nvmes, then run io_uring on the dm stripe dev. 1) dm_stripe.perl #!/usr/bin/perl -w # Create a striped device across any number of underlying devices. The device # will be called "stripe_dev" and have a chunk-size of 128k. my $chunk_size = 128 * 2; my $dev_name = "stripe_dev"; my $num_devs = @ARGV; my @devs = @ARGV; my ($min_dev_size, $stripe_dev_size, $i); if (!$num_devs) { die("Specify at least one device\n"); } $min_dev_size = `blockdev --getsz $devs[0]`; for ($i = 1; $i < $num_devs; $i++) { my $this_size = `blockdev --getsz $devs[$i]`; $min_dev_size = ($min_dev_size < $this_size) ? $min_dev_size : $this_size; } $stripe_dev_size = $min_dev_size * $num_devs; $stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs); $table = "0 $stripe_dev_size striped $num_devs $chunk_size"; for ($i = 0; $i < $num_devs; $i++) { $table .= " $devs[$i] 0"; } `echo $table | dmsetup create $dev_name`; 2) test_poll_on_dm_stripe.sh #!/bin/bash RT=40 JOBS=1 HI=1 BS=4K set -x dmsetup remove_all rmmod nvme modprobe nvme poll_queues=2 sleep 2 ./dm_stripe.perl /dev/nvme0n1 /dev/nvme1n1 sleep 1 DEV=/dev/mapper/stripe_dev echo "io_uring hipri test" fio --bs=$BS --ioengine=io_uring --fixedbufs --registerfiles \ --hipri=$HI --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 \ --filename=$DEV --direct=1 --runtime=$RT --numjobs=$JOBS --rw=randread --name=test \ --group_reporting Thanks, Ming ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 2/2] dm: support bio polling 2022-03-09 1:13 ` Ming Lei @ 2022-03-09 16:11 ` Jens Axboe -1 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-09 16:11 UTC (permalink / raw) To: Ming Lei; +Cc: Mike Snitzer, hch, dm-devel, linux-block On 3/8/22 6:13 PM, Ming Lei wrote: > On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote: >> On 3/7/22 11:53 AM, Mike Snitzer wrote: >>> From: Ming Lei <ming.lei@redhat.com> >>> >>> Support bio(REQ_POLLED) polling in the following approach: >>> >>> 1) only support io polling on normal READ/WRITE, and other abnormal IOs >>> still fallback to IRQ mode, so the target io is exactly inside the dm >>> io. >>> >>> 2) hold one refcnt on io->io_count after submitting this dm bio with >>> REQ_POLLED >>> >>> 3) support dm native bio splitting, any dm io instance associated with >>> current bio will be added into one list which head is bio->bi_private >>> which will be recovered before ending this bio >>> >>> 4) implement .poll_bio() callback, call bio_poll() on the single target >>> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call >>> dm_io_dec_pending() after the target io is done in .poll_bio() >>> >>> 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, >>> which is based on Jeffle's previous patch. >> >> It's not the prettiest thing in the world with the overlay on bi_private, >> but at least it's nicely documented now. >> >> I would encourage you to actually test this on fast storage, should make >> a nice difference. I can run this on a gen2 optane, it's 10x the IOPS >> of what it was tested on and should help better highlight where it >> makes a difference. >> >> If either of you would like that, then send me a fool proof recipe for >> what should be setup so I have a poll capable dm device. > > Follows steps for setup dm stripe over two nvmes, then run io_uring on > the dm stripe dev. Thanks! Much easier when I don't have to figure it out... Setup: CPU: 12900K Drives: 2x P5800X gen2 optane (~5M IOPS each at 512b) Baseline kernel: sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=1004 IOPS=2794K, BW=1364MiB/s, IOS/call=31/30, inflight=(124) IOPS=2793K, BW=1363MiB/s, IOS/call=31/31, inflight=(62) IOPS=2789K, BW=1362MiB/s, IOS/call=31/30, inflight=(124) IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(124) IOPS=2780K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) ^CExiting on signal Maximum IOPS=2794K generating about 500K ints/sec, and using 4k blocks: sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=967 IOPS=1683K, BW=6575MiB/s, IOS/call=24/24, inflight=(93) IOPS=1685K, BW=6584MiB/s, IOS/call=24/24, inflight=(124) IOPS=1686K, BW=6588MiB/s, IOS/call=24/24, inflight=(124) IOPS=1684K, BW=6581MiB/s, IOS/call=24/24, inflight=(93) IOPS=1686K, BW=6589MiB/s, IOS/call=24/24, inflight=(124) IOPS=1687K, BW=6593MiB/s, IOS/call=24/24, inflight=(128) IOPS=1687K, BW=6590MiB/s, IOS/call=24/24, inflight=(93) ^CExiting on signal Maximum IOPS=1687K which ends up being bw limited for me, because the devices aren't linked gen4. That's about 1.4M ints/sec. With the patched kernel, same test: sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=989 IOPS=4151K, BW=2026MiB/s, IOS/call=16/15, inflight=(128) IOPS=4159K, BW=2031MiB/s, IOS/call=15/15, inflight=(128) IOPS=4193K, BW=2047MiB/s, IOS/call=15/15, inflight=(128) IOPS=4191K, BW=2046MiB/s, IOS/call=15/15, inflight=(128) IOPS=4202K, BW=2052MiB/s, IOS/call=15/15, inflight=(128) ^CExiting on signal Maximum IOPS=4202K with basically zero interrupts, and 4k: sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=1015 IOPS=1706K, BW=6666MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) ^CExiting on signal Maximum IOPS=1706K again with basically zero interrupts. That's about a 50% improvement for polled IO. This is using 2 gen2 optanes, which are good for ~5M IOPS each. Using two threads on a single core, baseline kernel: sudo taskset -c 10,11 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n2 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) Added file /dev/dm-0 (submitter 1) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=1081 submitter=1, tid=1082 IOPS=3515K, BW=1716MiB/s, IOS/call=31/30, inflight=(124 62) IOPS=3515K, BW=1716MiB/s, IOS/call=31/31, inflight=(62 124) IOPS=3517K, BW=1717MiB/s, IOS/call=30/30, inflight=(113 124) IOPS=3517K, BW=1717MiB/s, IOS/call=31/31, inflight=(62 62) ^CExiting on signal Maximum IOPS=3517K and patched: udo taskset -c 10,11 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n2 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) Added file /dev/dm-0 (submitter 1) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=949 submitter=1, tid=950 IOPS=4988K, BW=2435MiB/s, IOS/call=15/15, inflight=(128 128) IOPS=4985K, BW=2434MiB/s, IOS/call=15/15, inflight=(128 128) IOPS=4970K, BW=2426MiB/s, IOS/call=15/15, inflight=(128 128) IOPS=4985K, BW=2434MiB/s, IOS/call=15/15, inflight=(128 128) ^CExiting on signal Maximum IOPS=4988K which is about a 42% improvement in IOPS. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling @ 2022-03-09 16:11 ` Jens Axboe 0 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-09 16:11 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, dm-devel, hch, Mike Snitzer On 3/8/22 6:13 PM, Ming Lei wrote: > On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote: >> On 3/7/22 11:53 AM, Mike Snitzer wrote: >>> From: Ming Lei <ming.lei@redhat.com> >>> >>> Support bio(REQ_POLLED) polling in the following approach: >>> >>> 1) only support io polling on normal READ/WRITE, and other abnormal IOs >>> still fallback to IRQ mode, so the target io is exactly inside the dm >>> io. >>> >>> 2) hold one refcnt on io->io_count after submitting this dm bio with >>> REQ_POLLED >>> >>> 3) support dm native bio splitting, any dm io instance associated with >>> current bio will be added into one list which head is bio->bi_private >>> which will be recovered before ending this bio >>> >>> 4) implement .poll_bio() callback, call bio_poll() on the single target >>> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call >>> dm_io_dec_pending() after the target io is done in .poll_bio() >>> >>> 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, >>> which is based on Jeffle's previous patch. >> >> It's not the prettiest thing in the world with the overlay on bi_private, >> but at least it's nicely documented now. >> >> I would encourage you to actually test this on fast storage, should make >> a nice difference. I can run this on a gen2 optane, it's 10x the IOPS >> of what it was tested on and should help better highlight where it >> makes a difference. >> >> If either of you would like that, then send me a fool proof recipe for >> what should be setup so I have a poll capable dm device. > > Follows steps for setup dm stripe over two nvmes, then run io_uring on > the dm stripe dev. Thanks! Much easier when I don't have to figure it out... Setup: CPU: 12900K Drives: 2x P5800X gen2 optane (~5M IOPS each at 512b) Baseline kernel: sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=1004 IOPS=2794K, BW=1364MiB/s, IOS/call=31/30, inflight=(124) IOPS=2793K, BW=1363MiB/s, IOS/call=31/31, inflight=(62) IOPS=2789K, BW=1362MiB/s, IOS/call=31/30, inflight=(124) IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(124) IOPS=2780K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) ^CExiting on signal Maximum IOPS=2794K generating about 500K ints/sec, and using 4k blocks: sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=967 IOPS=1683K, BW=6575MiB/s, IOS/call=24/24, inflight=(93) IOPS=1685K, BW=6584MiB/s, IOS/call=24/24, inflight=(124) IOPS=1686K, BW=6588MiB/s, IOS/call=24/24, inflight=(124) IOPS=1684K, BW=6581MiB/s, IOS/call=24/24, inflight=(93) IOPS=1686K, BW=6589MiB/s, IOS/call=24/24, inflight=(124) IOPS=1687K, BW=6593MiB/s, IOS/call=24/24, inflight=(128) IOPS=1687K, BW=6590MiB/s, IOS/call=24/24, inflight=(93) ^CExiting on signal Maximum IOPS=1687K which ends up being bw limited for me, because the devices aren't linked gen4. That's about 1.4M ints/sec. With the patched kernel, same test: sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=989 IOPS=4151K, BW=2026MiB/s, IOS/call=16/15, inflight=(128) IOPS=4159K, BW=2031MiB/s, IOS/call=15/15, inflight=(128) IOPS=4193K, BW=2047MiB/s, IOS/call=15/15, inflight=(128) IOPS=4191K, BW=2046MiB/s, IOS/call=15/15, inflight=(128) IOPS=4202K, BW=2052MiB/s, IOS/call=15/15, inflight=(128) ^CExiting on signal Maximum IOPS=4202K with basically zero interrupts, and 4k: sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=1015 IOPS=1706K, BW=6666MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) ^CExiting on signal Maximum IOPS=1706K again with basically zero interrupts. That's about a 50% improvement for polled IO. This is using 2 gen2 optanes, which are good for ~5M IOPS each. Using two threads on a single core, baseline kernel: sudo taskset -c 10,11 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n2 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) Added file /dev/dm-0 (submitter 1) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=1081 submitter=1, tid=1082 IOPS=3515K, BW=1716MiB/s, IOS/call=31/30, inflight=(124 62) IOPS=3515K, BW=1716MiB/s, IOS/call=31/31, inflight=(62 124) IOPS=3517K, BW=1717MiB/s, IOS/call=30/30, inflight=(113 124) IOPS=3517K, BW=1717MiB/s, IOS/call=31/31, inflight=(62 62) ^CExiting on signal Maximum IOPS=3517K and patched: udo taskset -c 10,11 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n2 -R1 -X1 /dev/dm-0 Added file /dev/dm-0 (submitter 0) Added file /dev/dm-0 (submitter 1) polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 submitter=0, tid=949 submitter=1, tid=950 IOPS=4988K, BW=2435MiB/s, IOS/call=15/15, inflight=(128 128) IOPS=4985K, BW=2434MiB/s, IOS/call=15/15, inflight=(128 128) IOPS=4970K, BW=2426MiB/s, IOS/call=15/15, inflight=(128 128) IOPS=4985K, BW=2434MiB/s, IOS/call=15/15, inflight=(128 128) ^CExiting on signal Maximum IOPS=4988K which is about a 42% improvement in IOPS. -- Jens Axboe -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 2/2] dm: support bio polling 2022-03-09 16:11 ` [dm-devel] " Jens Axboe @ 2022-03-10 4:00 ` Ming Lei -1 siblings, 0 replies; 18+ messages in thread From: Ming Lei @ 2022-03-10 4:00 UTC (permalink / raw) To: Jens Axboe; +Cc: Mike Snitzer, hch, dm-devel, linux-block On Wed, Mar 09, 2022 at 09:11:26AM -0700, Jens Axboe wrote: > On 3/8/22 6:13 PM, Ming Lei wrote: > > On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote: > >> On 3/7/22 11:53 AM, Mike Snitzer wrote: > >>> From: Ming Lei <ming.lei@redhat.com> > >>> > >>> Support bio(REQ_POLLED) polling in the following approach: > >>> > >>> 1) only support io polling on normal READ/WRITE, and other abnormal IOs > >>> still fallback to IRQ mode, so the target io is exactly inside the dm > >>> io. > >>> > >>> 2) hold one refcnt on io->io_count after submitting this dm bio with > >>> REQ_POLLED > >>> > >>> 3) support dm native bio splitting, any dm io instance associated with > >>> current bio will be added into one list which head is bio->bi_private > >>> which will be recovered before ending this bio > >>> > >>> 4) implement .poll_bio() callback, call bio_poll() on the single target > >>> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call > >>> dm_io_dec_pending() after the target io is done in .poll_bio() > >>> > >>> 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, > >>> which is based on Jeffle's previous patch. > >> > >> It's not the prettiest thing in the world with the overlay on bi_private, > >> but at least it's nicely documented now. > >> > >> I would encourage you to actually test this on fast storage, should make > >> a nice difference. I can run this on a gen2 optane, it's 10x the IOPS > >> of what it was tested on and should help better highlight where it > >> makes a difference. > >> > >> If either of you would like that, then send me a fool proof recipe for > >> what should be setup so I have a poll capable dm device. > > > > Follows steps for setup dm stripe over two nvmes, then run io_uring on > > the dm stripe dev. > > Thanks! Much easier when I don't have to figure it out... Setup: Jens, thanks for running the test! > > CPU: 12900K > Drives: 2x P5800X gen2 optane (~5M IOPS each at 512b) > > Baseline kernel: > > sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=1004 > IOPS=2794K, BW=1364MiB/s, IOS/call=31/30, inflight=(124) > IOPS=2793K, BW=1363MiB/s, IOS/call=31/31, inflight=(62) > IOPS=2789K, BW=1362MiB/s, IOS/call=31/30, inflight=(124) > IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(124) > IOPS=2780K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) > IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) > ^CExiting on signal > Maximum IOPS=2794K > > generating about 500K ints/sec, ~5.6 IOs completed in each int averagely, looks irq coalesce is working. > and using 4k blocks: > > sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=967 > IOPS=1683K, BW=6575MiB/s, IOS/call=24/24, inflight=(93) > IOPS=1685K, BW=6584MiB/s, IOS/call=24/24, inflight=(124) > IOPS=1686K, BW=6588MiB/s, IOS/call=24/24, inflight=(124) > IOPS=1684K, BW=6581MiB/s, IOS/call=24/24, inflight=(93) > IOPS=1686K, BW=6589MiB/s, IOS/call=24/24, inflight=(124) > IOPS=1687K, BW=6593MiB/s, IOS/call=24/24, inflight=(128) > IOPS=1687K, BW=6590MiB/s, IOS/call=24/24, inflight=(93) > ^CExiting on signal > Maximum IOPS=1687K > > which ends up being bw limited for me, because the devices aren't linked > gen4. That's about 1.4M ints/sec. Looks one interrupt just completes one IO with 4k bs, no irq coalesce any more. The interrupts may not run in CPU 10 I guess. > > With the patched kernel, same test: > > sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=989 > IOPS=4151K, BW=2026MiB/s, IOS/call=16/15, inflight=(128) > IOPS=4159K, BW=2031MiB/s, IOS/call=15/15, inflight=(128) > IOPS=4193K, BW=2047MiB/s, IOS/call=15/15, inflight=(128) > IOPS=4191K, BW=2046MiB/s, IOS/call=15/15, inflight=(128) > IOPS=4202K, BW=2052MiB/s, IOS/call=15/15, inflight=(128) > ^CExiting on signal > Maximum IOPS=4202K > > with basically zero interrupts, and 4k: > > sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=1015 > IOPS=1706K, BW=6666MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > ^CExiting on signal > Maximum IOPS=1706K Looks improvement on 4k is small, is it caused by pcie bw limit? What is the IOPS when running the same t/io_uring on single optane directly? Thanks, Ming ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling @ 2022-03-10 4:00 ` Ming Lei 0 siblings, 0 replies; 18+ messages in thread From: Ming Lei @ 2022-03-10 4:00 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, dm-devel, hch, Mike Snitzer On Wed, Mar 09, 2022 at 09:11:26AM -0700, Jens Axboe wrote: > On 3/8/22 6:13 PM, Ming Lei wrote: > > On Tue, Mar 08, 2022 at 06:02:50PM -0700, Jens Axboe wrote: > >> On 3/7/22 11:53 AM, Mike Snitzer wrote: > >>> From: Ming Lei <ming.lei@redhat.com> > >>> > >>> Support bio(REQ_POLLED) polling in the following approach: > >>> > >>> 1) only support io polling on normal READ/WRITE, and other abnormal IOs > >>> still fallback to IRQ mode, so the target io is exactly inside the dm > >>> io. > >>> > >>> 2) hold one refcnt on io->io_count after submitting this dm bio with > >>> REQ_POLLED > >>> > >>> 3) support dm native bio splitting, any dm io instance associated with > >>> current bio will be added into one list which head is bio->bi_private > >>> which will be recovered before ending this bio > >>> > >>> 4) implement .poll_bio() callback, call bio_poll() on the single target > >>> bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call > >>> dm_io_dec_pending() after the target io is done in .poll_bio() > >>> > >>> 5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL, > >>> which is based on Jeffle's previous patch. > >> > >> It's not the prettiest thing in the world with the overlay on bi_private, > >> but at least it's nicely documented now. > >> > >> I would encourage you to actually test this on fast storage, should make > >> a nice difference. I can run this on a gen2 optane, it's 10x the IOPS > >> of what it was tested on and should help better highlight where it > >> makes a difference. > >> > >> If either of you would like that, then send me a fool proof recipe for > >> what should be setup so I have a poll capable dm device. > > > > Follows steps for setup dm stripe over two nvmes, then run io_uring on > > the dm stripe dev. > > Thanks! Much easier when I don't have to figure it out... Setup: Jens, thanks for running the test! > > CPU: 12900K > Drives: 2x P5800X gen2 optane (~5M IOPS each at 512b) > > Baseline kernel: > > sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=1004 > IOPS=2794K, BW=1364MiB/s, IOS/call=31/30, inflight=(124) > IOPS=2793K, BW=1363MiB/s, IOS/call=31/31, inflight=(62) > IOPS=2789K, BW=1362MiB/s, IOS/call=31/30, inflight=(124) > IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(124) > IOPS=2780K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) > IOPS=2779K, BW=1357MiB/s, IOS/call=31/31, inflight=(62) > ^CExiting on signal > Maximum IOPS=2794K > > generating about 500K ints/sec, ~5.6 IOs completed in each int averagely, looks irq coalesce is working. > and using 4k blocks: > > sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=967 > IOPS=1683K, BW=6575MiB/s, IOS/call=24/24, inflight=(93) > IOPS=1685K, BW=6584MiB/s, IOS/call=24/24, inflight=(124) > IOPS=1686K, BW=6588MiB/s, IOS/call=24/24, inflight=(124) > IOPS=1684K, BW=6581MiB/s, IOS/call=24/24, inflight=(93) > IOPS=1686K, BW=6589MiB/s, IOS/call=24/24, inflight=(124) > IOPS=1687K, BW=6593MiB/s, IOS/call=24/24, inflight=(128) > IOPS=1687K, BW=6590MiB/s, IOS/call=24/24, inflight=(93) > ^CExiting on signal > Maximum IOPS=1687K > > which ends up being bw limited for me, because the devices aren't linked > gen4. That's about 1.4M ints/sec. Looks one interrupt just completes one IO with 4k bs, no irq coalesce any more. The interrupts may not run in CPU 10 I guess. > > With the patched kernel, same test: > > sudo taskset -c 10 t/io_uring -d128 -b512 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=989 > IOPS=4151K, BW=2026MiB/s, IOS/call=16/15, inflight=(128) > IOPS=4159K, BW=2031MiB/s, IOS/call=15/15, inflight=(128) > IOPS=4193K, BW=2047MiB/s, IOS/call=15/15, inflight=(128) > IOPS=4191K, BW=2046MiB/s, IOS/call=15/15, inflight=(128) > IOPS=4202K, BW=2052MiB/s, IOS/call=15/15, inflight=(128) > ^CExiting on signal > Maximum IOPS=4202K > > with basically zero interrupts, and 4k: > > sudo taskset -c 10 t/io_uring -d128 -b4096 -s31 -c16 -p1 -F1 -B1 -n1 -R1 -X1 /dev/dm-0 > Added file /dev/dm-0 (submitter 0) > polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > submitter=0, tid=1015 > IOPS=1706K, BW=6666MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > IOPS=1704K, BW=6658MiB/s, IOS/call=15/15, inflight=(128) > ^CExiting on signal > Maximum IOPS=1706K Looks improvement on 4k is small, is it caused by pcie bw limit? What is the IOPS when running the same t/io_uring on single optane directly? Thanks, Ming -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 2/2] dm: support bio polling 2022-03-10 4:00 ` [dm-devel] " Ming Lei @ 2022-03-10 4:06 ` Jens Axboe -1 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-10 4:06 UTC (permalink / raw) To: Ming Lei; +Cc: Mike Snitzer, hch, dm-devel, linux-block On 3/9/22 9:00 PM, Ming Lei wrote: > Looks improvement on 4k is small, is it caused by pcie bw limit? > What is the IOPS when running the same t/io_uring on single optane > directly? Yes, see what you responded to higher up: "which ends up being bw limited for me, because the devices aren't linked gen4". Some of them are, but the adapters are a bit janky and we often end up with gen3 links and hence limited to ~3.2GB/sec per drive. But with the bw limits even on gen4, you're roughly at about 1.5M IOPS per drive at that point, so would expect lower percentage wise gains for 4k with polling. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dm-devel] [PATCH v6 2/2] dm: support bio polling @ 2022-03-10 4:06 ` Jens Axboe 0 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2022-03-10 4:06 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, dm-devel, hch, Mike Snitzer On 3/9/22 9:00 PM, Ming Lei wrote: > Looks improvement on 4k is small, is it caused by pcie bw limit? > What is the IOPS when running the same t/io_uring on single optane > directly? Yes, see what you responded to higher up: "which ends up being bw limited for me, because the devices aren't linked gen4". Some of them are, but the adapters are a bit janky and we often end up with gen3 links and hence limited to ~3.2GB/sec per drive. But with the bw limits even on gen4, you're roughly at about 1.5M IOPS per drive at that point, so would expect lower percentage wise gains for 4k with polling. -- Jens Axboe -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2022-03-10 4:06 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-07 18:53 [PATCH v6 0/2] block/dm: support bio polling Mike Snitzer 2022-03-07 18:53 ` [dm-devel] " Mike Snitzer 2022-03-07 18:53 ` [dm-devel] [PATCH v6 1/2] block: add ->poll_bio to block_device_operations Mike Snitzer 2022-03-07 18:53 ` Mike Snitzer 2022-03-09 1:01 ` [dm-devel] " Jens Axboe 2022-03-09 1:01 ` Jens Axboe 2022-03-07 18:53 ` [PATCH v6 2/2] dm: support bio polling Mike Snitzer 2022-03-07 18:53 ` [dm-devel] " Mike Snitzer 2022-03-09 1:02 ` Jens Axboe 2022-03-09 1:02 ` Jens Axboe 2022-03-09 1:13 ` [dm-devel] " Ming Lei 2022-03-09 1:13 ` Ming Lei 2022-03-09 16:11 ` Jens Axboe 2022-03-09 16:11 ` [dm-devel] " Jens Axboe 2022-03-10 4:00 ` Ming Lei 2022-03-10 4:00 ` [dm-devel] " Ming Lei 2022-03-10 4:06 ` Jens Axboe 2022-03-10 4:06 ` [dm-devel] " Jens Axboe
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.