Re: Expense of read_iter

From: Mikulas Patocka <mpatocka@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Ira Weiny <ira.weiny@intel.com>, Jan Kara <jack@suse.cz>,
	Steven Whitehouse <swhiteho@redhat.com>,
	Eric Sandeen <esandeen@redhat.com>,
	Dave Chinner <dchinner@redhat.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Wang Jianchao <jianchao.wan9@gmail.com>,
	"Kani, Toshi" <toshi.kani@hpe.com>,
	"Norton, Scott J" <scott.norton@hpe.com>,
	"Tadakamadla, Rajesh" <rajesh.tadakamadla@hpe.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-nvdimm@lists.01.org
Subject: Re: Expense of read_iter
Date: Thu, 7 Jan 2021 13:59:01 -0500 (EST)	[thread overview]
Message-ID: <alpine.LRH.2.02.2101071110080.30654@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <20210107151125.GB5270@casper.infradead.org>

On Thu, 7 Jan 2021, Matthew Wilcox wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > I'd like to ask about this piece of code in __kernel_read:
> > 	if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > 		return warn_unsupported...
> > and __kernel_write:
> > 	if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > 		return warn_unsupported...
> > 
> > - It exits with an error if both read_iter and read or write_iter and 
> > write are present.
> > 
> > I found out that on NVFS, reading a file with the read method has 10% 
> > better performance than the read_iter method. The benchmark just reads the 
> > same 4k page over and over again - and the cost of creating and parsing 
> > the kiocb and iov_iter structures is just that high.
> 
> Which part of it is so expensive?

The read_iter path is much bigger:
vfs_read		- 0x160 bytes
new_sync_read		- 0x160 bytes
nvfs_rw_iter		- 0x100 bytes
nvfs_rw_iter_locked	- 0x4a0 bytes
iov_iter_advance	- 0x300 bytes

If we go with the "read" method, there's just:
vfs_read		- 0x160 bytes
nvfs_read		- 0x200 bytes

> Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
> 
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
>  
>  enum iter_type {
>         /* iter types */
> +       ITER_UBUF = 2,
>         ITER_IOVEC = 4,
>         ITER_KVEC = 8,
>         ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
>         size_t iov_offset;
>         size_t count;
>         union {
> +               void __user *buf;
>                 const struct iovec *iov;
>                 const struct kvec *kvec;
>                 const struct bio_vec *bvec;
> 
> and then doing all the appropriate changes to make that work.

I tried this benchmark on nvfs:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void)
{
	unsigned long i;
	unsigned long l = 1UL << 38;
	unsigned s = 4096;
	void *a = valloc(s);
	if (!a) perror("malloc"), exit(1);
	for (i = 0; i < l; i += s) {
		if (pread(0, a, s, 0) != s) perror("read"), exit(1);
	}
	return 0;
}

Result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1049885560
#
# Overhead  Command  Shared Object     Symbol                               
# ........  .......  ................  .....................................
#
    47.32%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     7.83%  pread    [kernel.vmlinux]  [k] current_time
     6.57%  pread    [nvfs]            [k] nvfs_rw_iter_locked
     5.59%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     4.23%  pread    libc-2.31.so      [.] __libc_pread
     3.51%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     2.34%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     2.34%  pread    [kernel.vmlinux]  [k] vfs_read
     2.34%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     2.31%  pread    [kernel.vmlinux]  [k] new_sync_read
     2.21%  pread    [nvfs]            [k] nvfs_bmap
     1.89%  pread    [kernel.vmlinux]  [k] iov_iter_advance
     1.71%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     1.59%  pread    [kernel.vmlinux]  [k] atime_needs_update
     1.24%  pread    [nvfs]            [k] nvfs_rw_iter
     0.94%  pread    [kernel.vmlinux]  [k] touch_atime
     0.75%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.72%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.68%  pread    [kernel.vmlinux]  [k] down_read
     0.62%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.52%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.49%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.47%  pread    [kernel.vmlinux]  [k] __fget_light
     0.46%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.42%  pread    pread             [.] main
     0.33%  pread    [kernel.vmlinux]  [k] up_read
     0.29%  pread    [kernel.vmlinux]  [k] iov_iter_init
     0.16%  pread    [kernel.vmlinux]  [k] __fdget
     0.10%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.03%  pread    pread             [.] pread@plt
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all

#
# (Tip: Use --symfs <dir> if your symbol files are in non-standard locations)
#

Result, using the read method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1312158116
#
# Overhead  Command  Shared Object     Symbol                               
# ........  .......  ................  .....................................
#
    60.77%  pread    [kernel.vmlinux]  [k] copy_user_generic_string
     6.14%  pread    [kernel.vmlinux]  [k] current_time
     3.88%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64
     3.55%  pread    libc-2.31.so      [.] __libc_pread
     3.04%  pread    [nvfs]            [k] nvfs_bmap
     2.84%  pread    [kernel.vmlinux]  [k] syscall_return_via_sysret
     2.71%  pread    [nvfs]            [k] nvfs_read
     2.56%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
     2.00%  pread    [kernel.vmlinux]  [k] __x64_sys_pread64
     1.98%  pread    [kernel.vmlinux]  [k] __fsnotify_parent
     1.77%  pread    [kernel.vmlinux]  [k] vfs_read
     1.35%  pread    [kernel.vmlinux]  [k] atime_needs_update
     0.94%  pread    [kernel.vmlinux]  [k] exit_to_user_mode_prepare
     0.91%  pread    [kernel.vmlinux]  [k] __fget_light
     0.83%  pread    [kernel.vmlinux]  [k] syscall_enter_from_user_mode
     0.70%  pread    [kernel.vmlinux]  [k] down_read
     0.70%  pread    [kernel.vmlinux]  [k] touch_atime
     0.65%  pread    [kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
     0.55%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode
     0.49%  pread    [kernel.vmlinux]  [k] up_read
     0.44%  pread    [kernel.vmlinux]  [k] do_syscall_64
     0.39%  pread    [kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
     0.34%  pread    pread             [.] main
     0.26%  pread    [kernel.vmlinux]  [k] __fdget
     0.10%  pread    pread             [.] pread@plt
     0.10%  pread    [kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
     0.00%  perf     [kernel.vmlinux]  [k] x86_pmu_enable_all

#
# (Tip: To set sample time separation other than 100ms with --sort time use --time-quantum)
#

Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, 
iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the 
second trace, nvfs_read consumes just 2.71% - and it replaces 
functionality of all these functions.

That is the reason for that 10% degradation with read_iter.

Mikulas