Re: Buffered IO async context overhead

From: Jens Axboe <axboe@kernel.dk>
To: Andres Freund <andres@anarazel.de>, io-uring@vger.kernel.org
Subject: Re: Buffered IO async context overhead
Date: Fri, 14 Feb 2020 13:13:35 -0700	[thread overview]
Message-ID: <c91551b2-9694-78cb-2aa6-bc8cccc474c3@kernel.dk> (raw)
In-Reply-To: <20200214195030.cbnr6msktdl3tqhn@alap3.anarazel.de>

On 2/14/20 12:50 PM, Andres Freund wrote:
> Hi,
> 
> For workloads where a lot of buffered IO is done, I see considerably
> higher overhead when going through io_uring, then when just doing the IO
> synchronously.
> 
> This example is on a samsung 970 pro nvme (i.e. decent, but not top
> class). I've limited the amount of dirty buffers allowed to 2/4 GB to
> keep the test times in check.   This is with a relatively recent kernel,
> 33b40134e5cfbbccad7f3040d1919889537a3df7 .
> 
> fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=1
> 
> Jobs: 1 (f=1): [W(1)][100.0%][w=1049MiB/s][w=134k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46625: Fri Feb 14 11:22:05 2020
>   write: IOPS=73.4k, BW=573MiB/s (601MB/s)(55.0GiB/100001msec); 0 zone resets
>     slat (nsec): min=571, max=607779, avg=2927.18, stdev=1213.47
>     clat (nsec): min=40, max=1400.7k, avg=9947.92, stdev=2606.65
>      lat (usec): min=4, max=1403, avg=12.94, stdev= 3.28
>     clat percentiles (nsec):
>      |  1.00th=[ 4960],  5.00th=[ 6368], 10.00th=[ 8032], 20.00th=[ 8640],
>      | 30.00th=[ 9024], 40.00th=[ 9280], 50.00th=[ 9664], 60.00th=[10048],
>      | 70.00th=[10560], 80.00th=[11072], 90.00th=[11968], 95.00th=[13120],
>      | 99.00th=[21632], 99.50th=[23424], 99.90th=[26752], 99.95th=[28288],
>      | 99.99th=[35584]
>    bw (  KiB/s): min=192408, max=1031483, per=79.92%, avg=468998.16, stdev=108691.70, samples=199
>    iops        : min=24051, max=128935, avg=58624.37, stdev=13586.54, samples=199
>   lat (nsec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
>   lat (nsec)   : 1000=0.01%
>   lat (usec)   : 2=0.01%, 4=0.01%, 10=56.98%, 20=41.67%, 50=1.32%
>   lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
>   lat (msec)   : 2=0.01%
>   cpu          : usr=15.63%, sys=34.79%, ctx=7333387, majf=0, minf=41
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,7335827,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: bw=573MiB/s (601MB/s), 573MiB/s-573MiB/s (601MB/s-601MB/s), io=55.0GiB (60.1GB), run=100001-100001msec
> 
> Disk stats (read/write):
>   nvme0n1: ios=0/49033, merge=0/12, ticks=0/1741787, in_queue=1717299, util=20.89%
> 
> 
> During this most of the time there's a number of tasks with decent CPU usage
>   46627 root      20   0       0      0      0 R  99.7   0.0   1:07.74 io_wqe_worker-0
>   46625 root      20   0  756920   6708   1984 S  51.0   0.0   0:34.56 fio
>   42818 root      20   0       0      0      0 I   3.3   0.0   0:03.23 kworker/u82:6-flush-259:0
>   43338 root      20   0       0      0      0 I   1.3   0.0   0:02.06 kworker/u82:18-events_unbound
> 
> 
> Comparing that to using sync:
> fio --name=t --time_based=1 --runtime=100 --ioengine=sync --rw=write --bs=8k --filesize=300G --iodepth=1
> Jobs: 1 (f=1): [W(1)][100.0%][w=1607MiB/s][w=206k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46690: Fri Feb 14 11:24:17 2020
>   write: IOPS=234k, BW=1831MiB/s (1920MB/s)(179GiB/100001msec); 0 zone resets
>     clat (nsec): min=1612, max=6649.7k, avg=3913.39, stdev=3170.65
>      lat (nsec): min=1653, max=6649.7k, avg=3955.30, stdev=3172.59
>     clat percentiles (nsec):
>      |  1.00th=[ 2512],  5.00th=[ 2544], 10.00th=[ 2576], 20.00th=[ 2640],
>      | 30.00th=[ 2672], 40.00th=[ 2768], 50.00th=[ 3376], 60.00th=[ 3792],
>      | 70.00th=[ 4192], 80.00th=[ 4832], 90.00th=[ 5792], 95.00th=[ 7008],
>      | 99.00th=[11968], 99.50th=[15552], 99.90th=[22656], 99.95th=[26496],
>      | 99.99th=[40192]
>    bw (  MiB/s): min=  989, max= 2118, per=82.26%, avg=1506.28, stdev=169.56, samples=193
>    iops        : min=126684, max=271118, avg=192803.37, stdev=21703.94, samples=193
>   lat (usec)   : 2=0.01%, 4=66.33%, 10=32.14%, 20=1.34%, 50=0.19%
>   lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 10=0.01%
>   cpu          : usr=8.35%, sys=91.55%, ctx=6183, majf=0, minf=74
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,23437892,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: bw=1831MiB/s (1920MB/s), 1831MiB/s-1831MiB/s (1920MB/s-1920MB/s), io=179GiB (192GB), run=100001-100001msec
> 
> Disk stats (read/write):
>   nvme0n1: ios=0/152371, merge=0/587772, ticks=0/5657236, in_queue=5580764, util=67.57%
> 
> Tasks:
>   46690 root      20   0  756912   6524   1800 R 100.0   0.0   1:24.64 fio
>    2913 root      20   0       0      0      0 S  32.2   0.0   0:09.57 kswapd1
>   45683 root      20   0       0      0      0 I  13.2   0.0   0:08.57 kworker/u82:16-events_unbound
>   42816 root      20   0       0      0      0 I  11.2   0.0   0:09.71 kworker/u82:1-flush-259:0
>   36087 root      20   0       0      0      0 I   1.6   0.0   0:01.51 kworker/27:2-xfs-conv/nvme0n1
> 
> I.e. sync did about three times the throughput of io_uring.
> 
> 
> It's possible to make io_uring perform better, by batching
> submissions/completions:
> fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=64 --iodepth_batch_submit=32 --iodepth_batch_complete_min=16 --iodepth_low=16
> t: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=64
> fio-3.17
> Starting 1 process
> Jobs: 1 (f=1): [W(1)][100.0%][w=1382MiB/s][w=177k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46849: Fri Feb 14 11:28:13 2020
>   write: IOPS=194k, BW=1513MiB/s (1587MB/s)(148GiB/100001msec); 0 zone resets
>     slat (usec): min=4, max=455, avg=48.09, stdev=18.20
>     clat (usec): min=6, max=25435, avg=164.91, stdev=104.69
>      lat (usec): min=66, max=25512, avg=213.03, stdev=101.19
>     clat percentiles (usec):
>      |  1.00th=[   60],  5.00th=[   68], 10.00th=[   78], 20.00th=[   94],
>      | 30.00th=[  108], 40.00th=[  127], 50.00th=[  143], 60.00th=[  180],
>      | 70.00th=[  210], 80.00th=[  227], 90.00th=[  265], 95.00th=[  302],
>      | 99.00th=[  379], 99.50th=[  441], 99.90th=[  562], 99.95th=[  611],
>      | 99.99th=[  816]
>    bw (  MiB/s): min=  289, max= 1802, per=81.00%, avg=1225.72, stdev=173.92, samples=197
>    iops        : min=37086, max=230692, avg=156891.42, stdev=22261.80, samples=197
>   lat (usec)   : 10=0.01%, 50=0.09%, 100=25.74%, 250=60.50%, 500=13.46%
>   lat (usec)   : 750=0.19%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>   cpu          : usr=31.80%, sys=26.04%, ctx=793383, majf=0, minf=1301
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=66.7%, >=64=33.3%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=50.0%, 32=50.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,19369087,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>   WRITE: bw=1513MiB/s (1587MB/s), 1513MiB/s-1513MiB/s (1587MB/s-1587MB/s), io=148GiB (159GB), run=100001-100001msec
> 
> Disk stats (read/write):
>   nvme0n1: ios=0/127095, merge=0/15, ticks=0/4936953, in_queue=4873341, util=55.82%
> 
> but that's still worse than the plain sync one. And doesn't work well
> for reads.
> 
> 
> Now, it's clear that there's some overhead in backgrounding small IOs to
> the async context, but this seems awfully high. There's still a slowdown
> when using considerably larger pieces of IO (128k).
> 
> For the 128k, iodepth 1 case I get:
> sync:
> -   97.77%     0.04%  fio      fio                     [.] fio_syncio_queue
>      97.73% fio_syncio_queue
>       - __libc_write
>          - __libc_write
>             - 97.50% entry_SYSCALL_64
>                - 97.31% do_syscall_64
>                   - 97.25% ksys_write
>                      - 97.10% vfs_write
>                         - 96.91% new_sync_write
>                            - 96.82% xfs_file_buffered_aio_write
>                               - 96.13% iomap_file_buffered_write
>                                  - iomap_apply
>                                     - 95.52% iomap_write_actor
>                                        + 37.66% iov_iter_copy_from_user_atomic
>                                        + 26.83% iomap_write_begin
>                                        + 26.64% iomap_write_end.isra.0
>                                          2.26% iov_iter_fault_in_readable
>                                          0.87% balance_dirty_pages_ratelimited
> and --no-children shows:
> +   36.57%  fio      [kernel.vmlinux]    [k] copy_user_enhanced_fast_string
> +   14.31%  fio      [kernel.vmlinux]    [k] iomap_set_range_uptodate
> +    3.85%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
> +    3.34%  fio      [kernel.vmlinux]    [k] queued_spin_lock_slowpath
> +    2.94%  fio      [kernel.vmlinux]    [k] xas_load
> +    2.66%  fio      [kernel.vmlinux]    [k] __add_to_page_cache_locked
> +    2.51%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
> +    2.28%  fio      [kernel.vmlinux]    [k] iov_iter_fault_in_readable
> +    1.87%  fio      [kernel.vmlinux]    [k] __lru_cache_add
> +    1.80%  fio      [kernel.vmlinux]    [k] __pagevec_lru_add_fn
> +    1.59%  fio      [kernel.vmlinux]    [k] node_dirty_ok
> +    1.30%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irq
> +    1.30%  fio      [kernel.vmlinux]    [k] iomap_set_page_dirty
> +    1.25%  fio      fio                 [.] get_io_u
> +    1.18%  fio      [kernel.vmlinux]    [k] xas_start
> 
> io_uring (the io_wqe_worker-0):
> -  100.00%    22.64%  io_wqe_worker-0  [kernel.vmlinux]  [k] io_wqe_worker
>    - 77.36% io_wqe_worker
>       - 77.18% io_worker_handle_work
>          - 75.80% io_rw_async
>             - io_wq_submit_work
>                - 75.57% io_issue_sqe
>                   - io_write
>                      - 70.96% xfs_file_buffered_aio_write
>                         - 70.37% iomap_file_buffered_write
>                            - iomap_apply
>                               - 70.08% iomap_write_actor
>                                  + 30.05% iov_iter_copy_from_user_atomic
>                                  - 18.38% iomap_write_begin
>                                     - 17.64% grab_cache_page_write_begin
>                                        + 16.73% pagecache_get_page
>                                          0.51% wait_for_stable_page
>                                  - 16.96% iomap_write_end.isra.0
>                                       8.72% iomap_set_range_uptodate
>                                     + 7.35% iomap_set_page_dirty
>                                    3.45% iov_iter_fault_in_readable
>                                    0.56% balance_dirty_pages_ratelimited
>                      - 4.36% kiocb_done
>                         - 3.56% io_cqring_ev_posted
>                            + __wake_up_common_lock
>                           0.61% io_cqring_add_event
>            0.56% io_free_req
> --no-children:
> +   29.55%  io_wqe_worker-0  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
> +   22.64%  io_wqe_worker-0  [kernel.vmlinux]  [k] io_wqe_worker
> +    8.71%  io_wqe_worker-0  [kernel.vmlinux]  [k] iomap_set_range_uptodate
> +    3.41%  io_wqe_worker-0  [kernel.vmlinux]  [k] iov_iter_fault_in_readable
> +    2.50%  io_wqe_worker-0  [kernel.vmlinux]  [k] get_page_from_freelist
> +    2.24%  io_wqe_worker-0  [kernel.vmlinux]  [k] xas_load
> +    1.82%  io_wqe_worker-0  [kernel.vmlinux]  [k] queued_spin_lock_slowpath
> +    1.80%  io_wqe_worker-0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
> +    1.72%  io_wqe_worker-0  [kernel.vmlinux]  [k] __add_to_page_cache_locked
> +    1.58%  io_wqe_worker-0  [kernel.vmlinux]  [k] _raw_spin_lock_irq
> +    1.29%  io_wqe_worker-0  [kernel.vmlinux]  [k] __lru_cache_add
> +    1.27%  io_wqe_worker-0  [kernel.vmlinux]  [k] __pagevec_lru_add_fn
> +    1.08%  io_wqe_worker-0  [kernel.vmlinux]  [k] xas_start
> 
> So it seems the biggest difference is the cpu time in io_wqe_worker
> itself. Which in turn appears to mostly be:
> 
>        │       mov    %rbp,%rdi
>        │     → callq  io_worker_handle_work
>        │       mov    $0x3e7,%eax
>        │     ↓ jmp    28c
>        │     rep_nop():
>        │     }
>        │
>        │     /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
>        │     static __always_inline void rep_nop(void)
>        │     {
>        │     asm volatile("rep; nop" ::: "memory");
>  74.50 │281:   pause
>        │     io_worker_spin_for_work():
>        │     while (++i < 1000) {
>  11.15 │       sub    $0x1,%eax
>        │     ↑ je     9b
>        │     __read_once_size():
>        │     __READ_ONCE_SIZE;
>   9.53 │28c:   mov    0x8(%rbx),%rdx
>        │     io_wqe_run_queue():
>        │     if (!wq_list_empty(&wqe->work_list) &&
>        │       test   %rdx,%rdx
>   2.68 │     ↓ je     29f
>        │     io_worker_spin_for_work():
> 
> 
> With 8k it looks similar, but more extreme:
> sync (fio itself):
> +   29.89%  fio      [kernel.vmlinux]    [k] copy_user_enhanced_fast_string
> +   14.10%  fio      [kernel.vmlinux]    [k] iomap_set_range_uptodate
> +    2.80%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
> +    2.66%  fio      [kernel.vmlinux]    [k] queued_spin_lock_slowpath
> +    2.46%  fio      [kernel.vmlinux]    [k] xas_load
> +    1.97%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
> +    1.94%  fio      [kernel.vmlinux]    [k] __add_to_page_cache_locked
> +    1.79%  fio      [kernel.vmlinux]    [k] iomap_set_page_dirty
> +    1.64%  fio      [kernel.vmlinux]    [k] iov_iter_fault_in_readable
> +    1.61%  fio      fio                 [.] __fio_gettime
> +    1.49%  fio      [kernel.vmlinux]    [k] __pagevec_lru_add_fn
> +    1.39%  fio      [kernel.vmlinux]    [k] __lru_cache_add
>      1.30%  fio      fio                 [.] get_io_u
> +    1.11%  fio      [kernel.vmlinux]    [k] up_write
> +    1.10%  fio      [kernel.vmlinux]    [k] node_dirty_ok
> +    1.09%  fio      [kernel.vmlinux]    [k] down_write
> 
> io_uring (io_wqe_worker-0 again):
> +   54.13%  io_wqe_worker-0  [kernel.vmlinux]  [k] io_wqe_worker
> +   11.01%  io_wqe_worker-0  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
> +    2.70%  io_wqe_worker-0  [kernel.vmlinux]  [k] iomap_set_range_uptodate
> +    2.62%  io_wqe_worker-0  [kernel.vmlinux]  [k] iov_iter_fault_in_readable
> +    1.91%  io_wqe_worker-0  [kernel.vmlinux]  [k] __slab_free
> +    1.85%  io_wqe_worker-0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
> +    1.66%  io_wqe_worker-0  [kernel.vmlinux]  [k] try_to_wake_up
> +    1.62%  io_wqe_worker-0  [kernel.vmlinux]  [k] io_cqring_fill_event
> +    1.55%  io_wqe_worker-0  [kernel.vmlinux]  [k] io_rw_async
> +    1.43%  io_wqe_worker-0  [kernel.vmlinux]  [k] __wake_up_common
> +    1.37%  io_wqe_worker-0  [kernel.vmlinux]  [k] _raw_spin_lock_irq
> +    1.20%  io_wqe_worker-0  [kernel.vmlinux]  [k] rw_verify_area
> 
> which I think is pretty clear evidence we're hitting fairly significant
> contention on the queue lock.
> 
> 
> I am hitting this in postgres originally, not fio, but I thought it's
> easier to reproduce this way.  There's obviously benefit to doing things
> in the background - but it requires odd logic around deciding when to
> use io_uring, and when not.
> 
> To be clear, none of this happens with DIO, but I don't forsee switching
> to DIO for all IO by default ever (too high demands on accurate
> configuration).

Can you try with this added?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 76cbf474c184..207daf83f209 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -620,6 +620,7 @@ static const struct io_op_def io_op_defs[] = {
 		.async_ctx		= 1,
 		.needs_mm		= 1,
 		.needs_file		= 1,
+		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 	},
 	[IORING_OP_WRITEV] = {
@@ -634,6 +635,7 @@ static const struct io_op_def io_op_defs[] = {
 	},
 	[IORING_OP_READ_FIXED] = {
 		.needs_file		= 1,
+		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 	},
 	[IORING_OP_WRITE_FIXED] = {
@@ -711,11 +713,13 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_READ] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
+		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 	},
 	[IORING_OP_WRITE] = {
 		.needs_mm		= 1,
 		.needs_file		= 1,
+		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 	},
 	[IORING_OP_FADVISE] = {
@@ -955,7 +959,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req,
 	bool do_hashed = false;
 
 	if (req->flags & REQ_F_ISREG) {
-		if (def->hash_reg_file)
+		if (!(req->kiocb->ki_flags & IOCB_DIRECT) && def->hash_reg_file)
 			do_hashed = true;
 	} else {
 		if (def->unbound_nonreg_file)

-- 
Jens Axboe