[PATCH v2 0/2] improve SQPOLL handling

* [PATCH v2 0/2] improve SQPOLL handling
@ 2020-11-03  6:15 Xiaoguang Wang
  2020-11-03  6:15 ` [PATCH v2 1/2] io_uring: refactor io_sq_thread() handling Xiaoguang Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Xiaoguang Wang @ 2020-11-03  6:15 UTC (permalink / raw)
  To: io-uring; +Cc: axboe, joseph.qi

The first patch tries to improve various issues in current implementation:
  The prepare_to_wait() usage in __io_sq_thread() is weird. If multiple ctxs
share one same poll thread, one ctx will put poll thread in TASK_INTERRUPTIBLE,
but if other ctxs have work to do, we don't need to change task's stat at all.
I think only if all ctxs don't have work to do, we can do it.
  We use round-robin strategy to make multiple ctxs share one same poll thread,
but there are various condition in __io_sq_thread(), which seems complicated and
may affect round-robin strategy.

The second patch adds a IORING_SETUP_SQPOLL_PERCPU flag, for those rings which
have SQPOLL enabled and are willing to be bound to one same cpu, hence share
one same poll thread, add a capability that these rings can share one poll thread
by specifying a new IORING_SETUP_SQPOLL_PERCPU flag. FIO tool can integrate this
feature easily, so we can test multiple rings to share same poll thread easily.

TEST:
  This patch set have passed liburing test cases.

  I also make fio support IORING_SETUP_SQPOLL_PERCPU flag, and make some
io stress tests, no errors or performance regression. See below fio job file:

First in unpatched kernel, I test a fio file which only contains one job
with iodepth being 128, see below:
[global]
ioengine=io_uring
sqthread_poll=1
registerfiles=1
fixedbufs=1
hipri=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=120
ramp_time=0
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1
sqthread_poll_cpu=15

[job0]
cpus_allowed=5
iodepth=128
sqthread_poll_cpu=9

performance data: IOPS: 453k, avg lat: 282.37usec

Second in unpatched kernel, I test a fio file which contains 4 jobs
with each iodepth being 32, see below:
[global]
ioengine=io_uring
sqthread_poll=1
registerfiles=1
fixedbufs=1
hipri=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=120
ramp_time=0
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1
sqthread_poll_cpu=15

[job0]
cpus_allowed=5
iodepth=32
sqthread_poll_cpu=9

[job1]
cpus_allowed=6
iodepth=32
sqthread_poll_cpu=9

[job2]
cpus_allowed=7
iodepth=32
sqthread_poll_cpu=9

[job3]
cpus_allowed=8
iodepth=32
sqthread_poll_cpu=9
performance data: IOPS: 254k, avg lat: 503.80 usec, obvious performance
drop.

Finally in patched kernel, I test a fio file which contains 4 jobs
with each iodepth being 32, and now we enable sqthread_poll_percpu
flag, see blow:

[global]
ioengine=io_uring
sqthread_poll=1
registerfiles=1
fixedbufs=1
hipri=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=120
ramp_time=0
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1
#sqthread_poll_cpu=15
sqthread_poll_percpu=1  # enable percpu feature

[job0]
cpus_allowed=5
iodepth=32
sqthread_poll_cpu=9

[job1]
cpus_allowed=6
iodepth=32
sqthread_poll_cpu=9

[job2]
cpus_allowed=7
iodepth=32
sqthread_poll_cpu=9

performance data: IOPS: 438k, avg lat: 291.69usec

From above teses, we can see that IORING_SETUP_SQPOLL_PERCPU is easy to
use, and no obvious performance regression.
Note I don't test IORING_SETUP_ATTACH_WQ in above three test cases, it's
a little hard to support IORING_SETUP_ATTACH_WQ in fio.

Xiaoguang Wang (2):
  io_uring: refactor io_sq_thread() handling
  io_uring: support multiple rings to share same poll thread by
    specifying same cpu

 fs/io_uring.c                 | 289 +++++++++++++++++++---------------
 include/uapi/linux/io_uring.h |   1 +
 2 files changed, 166 insertions(+), 124 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 6+ messages in thread