Re: Loophole in async page I/O

From: Hao_Xu <haoxu@linux.alibaba.com>
To: Jens Axboe <axboe@kernel.dk>,
	Matthew Wilcox <willy@infradead.org>,
	io-uring@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: Loophole in async page I/O
Date: Thu, 15 Oct 2020 20:17:25 +0800	[thread overview]
Message-ID: <ebc1be1f-6bea-044c-3467-d1f6c74ace11@linux.alibaba.com> (raw)
In-Reply-To: <d8e87cb0-3880-742f-9478-6d71b5406b19@linux.alibaba.com>

在 2020/10/15 下午7:27, Hao_Xu 写道:
> 在 2020/10/15 上午4:57, Jens Axboe 写道:
>> On 10/14/20 2:31 PM, Hao_Xu wrote:
>>> Hi Jens,
>>> I've done some tests for the new fix code with readahead disabled from
>>> userspace. Here comes some results.
>>> For the perf reports, since I'm new to kernel stuff, still investigating
>>> on it.
>>> I'll keep addressing the issue which causes the difference among the
>>> four perf reports(in which the  copy_user_enhanced_fast_string() catches
>>> my eyes)
>>>
>>> my environment is:
>>>       server: physical server
>>>       kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
>>>       fs: ext4
>>>       device: nvme ssd
>>>       fio: 3.20
>>>
>>> I did the tests by setting and commenting the code:
>>>       filp->f_mode |= FMODE_BUF_RASYNC;
>>> in fs/ext4/file.c ext4_file_open()
>>
>> You don't have to modify the kernel, if you use a newer fio then you can
>> essentially just add:
>>
>> --force_async=1
>>
>> after setting the engine to io_uring to get the same effect. Just a
>> heads up, as that might make it easier for you.
>>
>>> the IOPS with readahead disabled from userspace is below:
>>>
>>> with new fix code(force readahead)
>>> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
>>> 1                    10.8k                  10.3k
>>> 2                    21.2k                  20.1k
>>> 4                    41.1k                  39.1k
>>> 8                    76.1k                  72.2k
>>> 16                   133k                   126k
>>> 32                   169k                   147k
>>> 64                   176k                   160k
>>> 128                  (1)187k                (2)156k
>>>
>>> now async buffered reads feature looks better in terms of IOPS,
>>> but it still looks similar with the async buffered reads feature in the
>>> mainline code.
>>
>> I'd say it looks better all around. And what you're completely
>> forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're
>> using QD number of async workers to achieve that result. Hence you have
>> 1..128 threads potentially running on that one, vs having a _single_
>> process running with FMODE_BUF_RASYNC.
> I totally agree with this, the server I use has many cpus which makes 
> the multiple async workers works exactly parallelly.
> 
>>
>>> with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix
>>> async buffered reads when readahead is disabled"))
>>> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
>>> 1                       10.9k            10.2k
>>> 2                       21.6k            20.2k
>>> 4                       41.0k            39.9k
>>> 8                       79.7k            75.9k
>>> 16                      141k             138k
>>> 32                      169k             237k
>>> 64                      190k             316k
>>> 128                     (3)195k          (4)315k
>>>
>>> Considering the number in place (1)(2)(3)(4), the new fix doesn't seem
>>> to fix the slow down
>>> but make the number (4) become number (2)
>>
>> Not sure why there would be a difference between 2 and 4, that does seem
>> odd. I'll see if I can reproduce that. More questions below.
>>
>>> the perf reports of (1)(2)(3)(4) situations are:
>>> (1)
>>>     9 # Overhead  Command  Shared Object       Symbol
>>>    10 # ........  .......  ..................
>>> ..............................................
>>>    11 #
>>>    12     10.19%  fio      [kernel.vmlinux]    [k]
>>> copy_user_enhanced_fast_string
>>>    13      8.53%  fio      fio                 [.] clock_thread_fn
>>>    14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
>>>    15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
>>>    16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
>>>    17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
>>>    18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
>>>    19      1.48%  fio      [kernel.vmlinux]    [k] 
>>> native_irq_return_iret
>>>    20      1.48%  fio      [kernel.vmlinux]    [k] 
>>> get_page_from_freelist
>>>    21      1.46%  fio      [kernel.vmlinux]    [k] 
>>> generic_file_buffered_read
>>>    22      1.45%  fio      [nvme]              [k] nvme_irq
>>>    23      1.25%  fio      [kernel.vmlinux]    [k] 
>>> __list_del_entry_valid
>>>    24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
>>>    25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
>>>    26      1.12%  fio      fio                 [.] get_io_u
>>>    27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
>>>    28      0.78%  fio      fio                 [.] fio_gettime
>>>    29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
>>>    30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
>>>    31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
>>>    32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
>>>    33      0.68%  fio      fio                 [.] io_u_queued_complete
>>>    34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
>>>    35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes
>>
>> These profiles are of marginal use, as you're only profiling fio itself,
>> not all of the async workers that are running for !FMODE_BUF_RASYNC.
>>
> Ah, I got it. Thanks.
>> How long does the test run? It looks suspect that clock_thread_fn shows
>> up in the profiles at all.
>>
> it runs about 5 msec, randread 4G with bs=4k
Sorry, 5 seconds not 5 msec.
>> And is it actually doing IO, or are you using shm/tmpfs for this test?
>> Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes
>> me a little confused.
>>
> I'm using ext4 on real nvme ssd device. from the call stack, the 
> shm_getpage_gfp is from __memset_avx2_erms in libc.
> there are ext4 related functions in all the four reports.
> I'm doing more to check if it is my test process causing high IOPS in 
> case (4).