From: Matthew Wilcox <willy@infradead.org>
To: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Ira Weiny <ira.weiny@intel.com>, Jan Kara <jack@suse.cz>,
Steven Whitehouse <swhiteho@redhat.com>,
Eric Sandeen <esandeen@redhat.com>,
Dave Chinner <dchinner@redhat.com>, Theodore Ts'o <tytso@mit.edu>,
Wang Jianchao <jianchao.wan9@gmail.com>,
"Kani, Toshi" <toshi.kani@hpe.com>,
"Norton, Scott J" <scott.norton@hpe.com>,
"Tadakamadla, Rajesh" <rajesh.tadakamadla@hpe.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-nvdimm@lists.01.org
Subject: Re: Expense of read_iter
Date: Sun, 10 Jan 2021 06:13:21 +0000 [thread overview]
Message-ID: <20210110061321.GC35215@casper.infradead.org> (raw)
In-Reply-To: <alpine.LRH.2.02.2101071110080.30654@file01.intranet.prod.int.rdu2.redhat.com>
On Thu, Jan 07, 2021 at 01:59:01PM -0500, Mikulas Patocka wrote:
> On Thu, 7 Jan 2021, Matthew Wilcox wrote:
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > I'd like to ask about this piece of code in __kernel_read:
> > > if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > > return warn_unsupported...
> > > and __kernel_write:
> > > if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > > return warn_unsupported...
> > >
> > > - It exits with an error if both read_iter and read or write_iter and
> > > write are present.
> > >
> > > I found out that on NVFS, reading a file with the read method has 10%
> > > better performance than the read_iter method. The benchmark just reads the
> > > same 4k page over and over again - and the cost of creating and parsing
> > > the kiocb and iov_iter structures is just that high.
> >
> > Which part of it is so expensive?
>
> The read_iter path is much bigger:
> vfs_read - 0x160 bytes
> new_sync_read - 0x160 bytes
> nvfs_rw_iter - 0x100 bytes
> nvfs_rw_iter_locked - 0x4a0 bytes
> iov_iter_advance - 0x300 bytes
Number of bytes in a function isn't really correlated with how expensive
a particular function is. That said, looking at new_sync_read() shows
one part that's particularly bad, init_sync_kiocb():
static inline int iocb_flags(struct file *file)
{
int res = 0;
if (file->f_flags & O_APPEND)
res |= IOCB_APPEND;
7ec: 8b 57 40 mov 0x40(%rdi),%edx
7ef: 48 89 75 80 mov %rsi,-0x80(%rbp)
if (file->f_flags & O_DIRECT)
7f3: 89 d0 mov %edx,%eax
7f5: c1 e8 06 shr $0x6,%eax
7f8: 83 e0 10 and $0x10,%eax
res |= IOCB_DIRECT;
if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
7fb: 89 c1 mov %eax,%ecx
7fd: 81 c9 00 00 02 00 or $0x20000,%ecx
803: f6 c6 40 test $0x40,%dh
806: 0f 45 c1 cmovne %ecx,%eax
res |= IOCB_DSYNC;
809: f6 c6 10 test $0x10,%dh
80c: 75 18 jne 826 <new_sync_read+0x66>
80e: 48 8b 8f d8 00 00 00 mov 0xd8(%rdi),%rcx
815: 48 8b 09 mov (%rcx),%rcx
818: 48 8b 71 28 mov 0x28(%rcx),%rsi
81c: f6 46 50 10 testb $0x10,0x50(%rsi)
820: 0f 84 e2 00 00 00 je 908 <new_sync_read+0x148>
if (file->f_flags & __O_SYNC)
826: 83 c8 02 or $0x2,%eax
res |= IOCB_SYNC;
return res;
829: 89 c1 mov %eax,%ecx
82b: 83 c9 04 or $0x4,%ecx
82e: 81 e2 00 00 10 00 and $0x100000,%edx
We could optimise this by, eg, checking for (__O_SYNC | O_DIRECT |
O_APPEND) and returning 0 if none of them are set, since they're all
pretty rare. It might be better to maintain an f_iocb_flags in the
struct file and just copy that unconditionally. We'd need to remember
to update it in fcntl(F_SETFL), but I think that's the only place.
> If we go with the "read" method, there's just:
> vfs_read - 0x160 bytes
> nvfs_read - 0x200 bytes
>
> > Is it worth, eg adding an iov_iter
> > type that points to a single buffer instead of a single-member iov?
> 6.57% pread [nvfs] [k] nvfs_rw_iter_locked
> 2.31% pread [kernel.vmlinux] [k] new_sync_read
> 1.89% pread [kernel.vmlinux] [k] iov_iter_advance
> 1.24% pread [nvfs] [k] nvfs_rw_iter
> 0.29% pread [kernel.vmlinux] [k] iov_iter_init
> 2.71% pread [nvfs] [k] nvfs_read
> Note that if we sum the percentage of nvfs_iter_locked, new_sync_read,
> iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the
> second trace, nvfs_read consumes just 2.71% - and it replaces
> functionality of all these functions.
>
> That is the reason for that 10% degradation with read_iter.
You seem to be focusing on your argument for "let's just permit
filesystems to implement both ->read and ->read_iter". My suggestion
is that we need to optimise the ->read_iter path, but to do that we need
to know what's expensive.
nvfs_rw_iter_locked() looks very complicated. I suspect it can
be simplified. Of course new_sync_read() needs to be improved too,
as do the other functions here, but fully a third of the difference
between read() and read_iter() is the difference between nvfs_read()
and nvfs_rw_iter_locked().
next prev parent reply other threads:[~2021-01-10 6:14 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-07 13:15 [RFC v2] nvfs: a filesystem for persistent memory Mikulas Patocka
2021-01-07 15:11 ` Expense of read_iter Matthew Wilcox
2021-01-07 16:43 ` Mingkai Dong
2021-01-12 13:45 ` Zhongwei Cai
2021-01-12 14:06 ` David Laight
2021-01-13 16:44 ` Mikulas Patocka
2021-01-15 9:40 ` Zhongwei Cai
2021-01-20 4:47 ` Dave Chinner
2021-01-20 14:18 ` Jan Kara
2021-01-20 15:12 ` Mikulas Patocka
2021-01-20 15:44 ` David Laight
2021-01-21 15:47 ` Matthew Wilcox
2021-01-21 16:06 ` Mikulas Patocka
2021-01-21 16:30 ` Zhongwei Cai
2021-01-07 18:59 ` Mikulas Patocka
2021-01-10 6:13 ` Matthew Wilcox [this message]
2021-01-10 21:19 ` Mikulas Patocka
2021-01-11 0:18 ` Matthew Wilcox
2021-01-11 21:10 ` Mikulas Patocka
2021-01-11 10:11 ` David Laight
2021-01-10 16:20 ` [RFC v2] nvfs: a filesystem for persistent memory Al Viro
2021-01-10 16:51 ` Al Viro
2021-01-10 21:14 ` Mikulas Patocka
2021-01-10 23:40 ` Al Viro
2021-01-11 11:41 ` Mikulas Patocka
2021-01-11 10:29 ` David Laight
2021-01-11 11:44 ` Mikulas Patocka
2021-01-11 11:57 ` David Laight
2021-01-11 14:43 ` Al Viro
2021-01-11 14:54 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210110061321.GC35215@casper.infradead.org \
--to=willy@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dchinner@redhat.com \
--cc=esandeen@redhat.com \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jianchao.wan9@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=mpatocka@redhat.com \
--cc=rajesh.tadakamadla@hpe.com \
--cc=scott.norton@hpe.com \
--cc=swhiteho@redhat.com \
--cc=toshi.kani@hpe.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).