All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v2 0/3] Add plug based request allocation batching
@ 2021-10-06 23:13 Jens Axboe
  2021-10-06 23:13 ` [PATCH 1/3] block: bump max plugged deferred size from 16 to 32 Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jens Axboe @ 2021-10-06 23:13 UTC (permalink / raw)
  To: linux-block; +Cc: io-uring

Hi,

Even if the caller knows that N requests will be submitted, we still
have to go all the way to tag allocation for each one. That's somewhat
inefficient, when we could just grab as many tags as we need initially
instead.

This small series allows request caching in the blk_plug. We add a new
helper for passing in how many requests we expect to submit,
blk_start_plug_nr_ios(), and then we can use that information to
populate the cache on the first request allocation. Subsequent queue
attempts can then just grab a pre-allocated request out of the cache
rather than go into the tag allocation again.

This brings single core performance on my setup from:

taskset -c 0,16  t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme2n1 (submitter 1)
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=256
submitter=0, tid=1206
submitter=1, tid=1207
IOPS=5806400, BW=2835MiB/s, IOS/call=32/32, inflight=(77 32)
IOPS=5860352, BW=2861MiB/s, IOS/call=32/31, inflight=(102 128)
IOPS=5844800, BW=2853MiB/s, IOS/call=32/32, inflight=(102 32)

to:

taskset -c 0,16  t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme2n1 (submitter 1)
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=256
submitter=0, tid=1220
submitter=1, tid=1221
IOPS=6061248, BW=2959MiB/s, IOS/call=32/31, inflight=(114 82)
IOPS=6100672, BW=2978MiB/s, IOS/call=32/31, inflight=(128 128)
IOPS=6082496, BW=2969MiB/s, IOS/call=32/32, inflight=(77 38)

which is about a 4-5% improvement in single core performance. Looking
at profiles, we're spending _less_ time in sbitmap even though we're
doing more work.

V2:
- Remove hunk from another patch in patch 2
- Make the plugging limits block layer private

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread
* [PATCHSET RFC 0/3] Add plug based request allocation batching
@ 2021-10-06 16:35 Jens Axboe
  2021-10-06 16:35 ` [PATCH 2/3] block: pre-allocate requests if plug is started and is a batch Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-10-06 16:35 UTC (permalink / raw)
  To: linux-block

Hi,

Even if the caller knows that N requests will be submitted, we still
have to go all the way to tag allocation for each one. That's somewhat
inefficient, when we could just grab as many tags as we need initially
instead.

This small series allows request caching in the blk_plug. We add a new
helper for passing in how many requests we expect to submit,
blk_start_plug_nr_ios(), and then we can use that information to
populate the cache on the first request allocation. Subsequent queue
attempts can then just grab a pre-allocated request out of the cache
rather than go into the tag allocation again.

This brings single core performance on my setup from:

sudo taskset -c 0,16  t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme2n1 (submitter 1)
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=256
submitter=0, tid=1206
submitter=1, tid=1207
IOPS=5806400, BW=2835MiB/s, IOS/call=32/32, inflight=(77 32)
IOPS=5860352, BW=2861MiB/s, IOS/call=32/31, inflight=(102 128)
IOPS=5844800, BW=2853MiB/s, IOS/call=32/32, inflight=(102 32)

to:

taskset -c 0,16  t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1
Added file /dev/nvme1n1 (submitter 0)
Added file /dev/nvme2n1 (submitter 1)
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=256
submitter=0, tid=1220
submitter=1, tid=1221
IOPS=6061248, BW=2959MiB/s, IOS/call=32/31, inflight=(114 82)
IOPS=6100672, BW=2978MiB/s, IOS/call=32/31, inflight=(128 128)
IOPS=6082496, BW=2969MiB/s, IOS/call=32/32, inflight=(77 38)

which is about a 4-5% improvement in single core performance. Looking
at profiles, we're spending _less_ time in sbitmap even though we're
doing more work.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-12 10:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-06 23:13 [PATCHSET v2 0/3] Add plug based request allocation batching Jens Axboe
2021-10-06 23:13 ` [PATCH 1/3] block: bump max plugged deferred size from 16 to 32 Jens Axboe
2021-10-12 10:04   ` Christoph Hellwig
2021-10-06 23:13 ` [PATCH 2/3] block: pre-allocate requests if plug is started and is a batch Jens Axboe
2021-10-06 23:13 ` [PATCH 3/3] io_uring: inform block layer of how many requests we are submitting Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2021-10-06 16:35 [PATCHSET RFC 0/3] Add plug based request allocation batching Jens Axboe
2021-10-06 16:35 ` [PATCH 2/3] block: pre-allocate requests if plug is started and is a batch Jens Axboe
2021-10-06 18:24   ` Bart Van Assche
2021-10-06 18:27     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.