Re: Loophole in async page I/O

From: Jens Axboe <axboe@kernel.dk>
To: Hao_Xu <haoxu@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	io-uring@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: Loophole in async page I/O
Date: Wed, 14 Oct 2020 14:57:11 -0600	[thread overview]
Message-ID: <794bb5f3-b9c3-b3f1-df42-fe2167175d23@kernel.dk> (raw)
In-Reply-To: <22435d46-bf78-ee8d-4470-bb3bcbc20cf2@linux.alibaba.com>

On 10/14/20 2:31 PM, Hao_Xu wrote:
> Hi Jens,
> I've done some tests for the new fix code with readahead disabled from 
> userspace. Here comes some results.
> For the perf reports, since I'm new to kernel stuff, still investigating 
> on it.
> I'll keep addressing the issue which causes the difference among the 
> four perf reports(in which the  copy_user_enhanced_fast_string() catches 
> my eyes)
> 
> my environment is:
>      server: physical server
>      kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
>      fs: ext4
>      device: nvme ssd
>      fio: 3.20
> 
> I did the tests by setting and commenting the code:
>      filp->f_mode |= FMODE_BUF_RASYNC;
> in fs/ext4/file.c ext4_file_open()

You don't have to modify the kernel, if you use a newer fio then you can
essentially just add:

--force_async=1

after setting the engine to io_uring to get the same effect. Just a
heads up, as that might make it easier for you.

> the IOPS with readahead disabled from userspace is below:
> 
> with new fix code(force readahead)
> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
> 1                    10.8k                  10.3k
> 2                    21.2k                  20.1k
> 4                    41.1k                  39.1k
> 8                    76.1k                  72.2k
> 16                   133k                   126k
> 32                   169k                   147k
> 64                   176k                   160k
> 128                  (1)187k                (2)156k
> 
> now async buffered reads feature looks better in terms of IOPS,
> but it still looks similar with the async buffered reads feature in the 
> mainline code.

I'd say it looks better all around. And what you're completely
forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're
using QD number of async workers to achieve that result. Hence you have
1..128 threads potentially running on that one, vs having a _single_
process running with FMODE_BUF_RASYNC.

> with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix 
> async buffered reads when readahead is disabled"))
> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
> 1                       10.9k            10.2k
> 2                       21.6k            20.2k
> 4                       41.0k            39.9k
> 8                       79.7k            75.9k
> 16                      141k             138k
> 32                      169k             237k
> 64                      190k             316k
> 128                     (3)195k          (4)315k
> 
> Considering the number in place (1)(2)(3)(4), the new fix doesn't seem 
> to fix the slow down
> but make the number (4) become number (2)

Not sure why there would be a difference between 2 and 4, that does seem
odd. I'll see if I can reproduce that. More questions below.

> the perf reports of (1)(2)(3)(4) situations are:
> (1)
>    9 # Overhead  Command  Shared Object       Symbol
>   10 # ........  .......  .................. 
> ..............................................
>   11 #
>   12     10.19%  fio      [kernel.vmlinux]    [k] 
> copy_user_enhanced_fast_string
>   13      8.53%  fio      fio                 [.] clock_thread_fn
>   14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
>   15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
>   16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
>   17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
>   18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
>   19      1.48%  fio      [kernel.vmlinux]    [k] native_irq_return_iret
>   20      1.48%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
>   21      1.46%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
>   22      1.45%  fio      [nvme]              [k] nvme_irq
>   23      1.25%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
>   24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
>   25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
>   26      1.12%  fio      fio                 [.] get_io_u
>   27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
>   28      0.78%  fio      fio                 [.] fio_gettime
>   29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
>   30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
>   31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
>   32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
>   33      0.68%  fio      fio                 [.] io_u_queued_complete
>   34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
>   35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes

These profiles are of marginal use, as you're only profiling fio itself,
not all of the async workers that are running for !FMODE_BUF_RASYNC.

How long does the test run? It looks suspect that clock_thread_fn shows
up in the profiles at all.

And is it actually doing IO, or are you using shm/tmpfs for this test?
Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes
me a little confused.

-- 
Jens Axboe