From: Mikulas Patocka <mpatocka@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Ira Weiny <ira.weiny@intel.com>, Jan Kara <jack@suse.cz>,
Steven Whitehouse <swhiteho@redhat.com>,
Eric Sandeen <esandeen@redhat.com>,
Dave Chinner <dchinner@redhat.com>,
"Theodore Ts'o" <tytso@mit.edu>,
Wang Jianchao <jianchao.wan9@gmail.com>,
"Kani, Toshi" <toshi.kani@hpe.com>,
"Norton, Scott J" <scott.norton@hpe.com>,
"Tadakamadla, Rajesh" <rajesh.tadakamadla@hpe.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-nvdimm@lists.01.org
Subject: Re: Expense of read_iter
Date: Thu, 7 Jan 2021 13:59:01 -0500 (EST) [thread overview]
Message-ID: <alpine.LRH.2.02.2101071110080.30654@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <20210107151125.GB5270@casper.infradead.org>
On Thu, 7 Jan 2021, Matthew Wilcox wrote:
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > I'd like to ask about this piece of code in __kernel_read:
> > if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > return warn_unsupported...
> > and __kernel_write:
> > if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > return warn_unsupported...
> >
> > - It exits with an error if both read_iter and read or write_iter and
> > write are present.
> >
> > I found out that on NVFS, reading a file with the read method has 10%
> > better performance than the read_iter method. The benchmark just reads the
> > same 4k page over and over again - and the cost of creating and parsing
> > the kiocb and iov_iter structures is just that high.
>
> Which part of it is so expensive?
The read_iter path is much bigger:
vfs_read - 0x160 bytes
new_sync_read - 0x160 bytes
nvfs_rw_iter - 0x100 bytes
nvfs_rw_iter_locked - 0x4a0 bytes
iov_iter_advance - 0x300 bytes
If we go with the "read" method, there's just:
vfs_read - 0x160 bytes
nvfs_read - 0x200 bytes
> Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
>
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
>
> enum iter_type {
> /* iter types */
> + ITER_UBUF = 2,
> ITER_IOVEC = 4,
> ITER_KVEC = 8,
> ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
> size_t iov_offset;
> size_t count;
> union {
> + void __user *buf;
> const struct iovec *iov;
> const struct kvec *kvec;
> const struct bio_vec *bvec;
>
> and then doing all the appropriate changes to make that work.
I tried this benchmark on nvfs:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(void)
{
unsigned long i;
unsigned long l = 1UL << 38;
unsigned s = 4096;
void *a = valloc(s);
if (!a) perror("malloc"), exit(1);
for (i = 0; i < l; i += s) {
if (pread(0, a, s, 0) != s) perror("read"), exit(1);
}
return 0;
}
Result, using the read_iter method:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1049885560
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .....................................
#
47.32% pread [kernel.vmlinux] [k] copy_user_generic_string
7.83% pread [kernel.vmlinux] [k] current_time
6.57% pread [nvfs] [k] nvfs_rw_iter_locked
5.59% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.23% pread libc-2.31.so [.] __libc_pread
3.51% pread [kernel.vmlinux] [k] syscall_return_via_sysret
2.34% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
2.34% pread [kernel.vmlinux] [k] vfs_read
2.34% pread [kernel.vmlinux] [k] __fsnotify_parent
2.31% pread [kernel.vmlinux] [k] new_sync_read
2.21% pread [nvfs] [k] nvfs_bmap
1.89% pread [kernel.vmlinux] [k] iov_iter_advance
1.71% pread [kernel.vmlinux] [k] __x64_sys_pread64
1.59% pread [kernel.vmlinux] [k] atime_needs_update
1.24% pread [nvfs] [k] nvfs_rw_iter
0.94% pread [kernel.vmlinux] [k] touch_atime
0.75% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.72% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.68% pread [kernel.vmlinux] [k] down_read
0.62% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.52% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.49% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.47% pread [kernel.vmlinux] [k] __fget_light
0.46% pread [kernel.vmlinux] [k] do_syscall_64
0.42% pread pread [.] main
0.33% pread [kernel.vmlinux] [k] up_read
0.29% pread [kernel.vmlinux] [k] iov_iter_init
0.16% pread [kernel.vmlinux] [k] __fdget
0.10% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.03% pread pread [.] pread@plt
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all
#
# (Tip: Use --symfs <dir> if your symbol files are in non-standard locations)
#
Result, using the read method:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1312158116
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .....................................
#
60.77% pread [kernel.vmlinux] [k] copy_user_generic_string
6.14% pread [kernel.vmlinux] [k] current_time
3.88% pread [kernel.vmlinux] [k] entry_SYSCALL_64
3.55% pread libc-2.31.so [.] __libc_pread
3.04% pread [nvfs] [k] nvfs_bmap
2.84% pread [kernel.vmlinux] [k] syscall_return_via_sysret
2.71% pread [nvfs] [k] nvfs_read
2.56% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
2.00% pread [kernel.vmlinux] [k] __x64_sys_pread64
1.98% pread [kernel.vmlinux] [k] __fsnotify_parent
1.77% pread [kernel.vmlinux] [k] vfs_read
1.35% pread [kernel.vmlinux] [k] atime_needs_update
0.94% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.91% pread [kernel.vmlinux] [k] __fget_light
0.83% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.70% pread [kernel.vmlinux] [k] down_read
0.70% pread [kernel.vmlinux] [k] touch_atime
0.65% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.55% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.49% pread [kernel.vmlinux] [k] up_read
0.44% pread [kernel.vmlinux] [k] do_syscall_64
0.39% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.34% pread pread [.] main
0.26% pread [kernel.vmlinux] [k] __fdget
0.10% pread pread [.] pread@plt
0.10% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all
#
# (Tip: To set sample time separation other than 100ms with --sort time use --time-quantum)
#
Note that if we sum the percentage of nvfs_iter_locked, new_sync_read,
iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the
second trace, nvfs_read consumes just 2.71% - and it replaces
functionality of all these functions.
That is the reason for that 10% degradation with read_iter.
Mikulas
next prev parent reply other threads:[~2021-01-07 19:00 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-07 13:15 [RFC v2] nvfs: a filesystem for persistent memory Mikulas Patocka
2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
2021-01-07 16:43 ` Mingkai Dong
2021-01-12 13:45 ` Zhongwei Cai
2021-01-12 14:06 ` David Laight
2021-01-13 16:44 ` Mikulas Patocka
2021-01-15 9:40 ` Zhongwei Cai
2021-01-20 4:47 ` Dave Chinner
2021-01-20 14:18 ` Jan Kara
2021-01-20 15:12 ` Mikulas Patocka
2021-01-20 15:44 ` David Laight
2021-01-21 15:47 ` Matthew Wilcox
2021-01-21 16:06 ` Mikulas Patocka
2021-01-21 16:30 ` Zhongwei Cai
2021-01-07 18:59 ` Mikulas Patocka [this message]
2021-01-10 6:13 ` Matthew Wilcox
2021-01-10 21:19 ` Mikulas Patocka
2021-01-11 0:18 ` Matthew Wilcox
2021-01-11 21:10 ` Mikulas Patocka
2021-01-11 10:11 ` David Laight
2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
2021-01-10 16:51 ` Al Viro
2021-01-10 21:14 ` Mikulas Patocka
2021-01-10 23:40 ` Al Viro
2021-01-11 11:41 ` Mikulas Patocka
2021-01-11 10:29 ` David Laight
2021-01-11 11:44 ` Mikulas Patocka
2021-01-11 11:57 ` David Laight
2021-01-11 14:43 ` Al Viro
2021-01-11 14:54 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LRH.2.02.2101071110080.30654@file01.intranet.prod.int.rdu2.redhat.com \
--to=mpatocka@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dchinner@redhat.com \
--cc=esandeen@redhat.com \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jianchao.wan9@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=rajesh.tadakamadla@hpe.com \
--cc=scott.norton@hpe.com \
--cc=swhiteho@redhat.com \
--cc=toshi.kani@hpe.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).