linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	Andreas Gruenbacher <agruenba@redhat.com>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] Bypass filesystems for reading cached pages
Date: Sat, 20 Jun 2020 12:15:21 -0700	[thread overview]
Message-ID: <20200620191521.GG8681@bombadil.infradead.org> (raw)
In-Reply-To: <CAOQ4uxjy6JTAQqvK9pc+xNDfzGQ3ACefTrySXtKb_OcAYQrdzw@mail.gmail.com>

On Sat, Jun 20, 2020 at 09:19:37AM +0300, Amir Goldstein wrote:
> On Fri, Jun 19, 2020 at 6:52 PM Matthew Wilcox <willy@infradead.org> wrote:
> > This patch lifts the IOCB_CACHED idea expressed by Andreas to the VFS.
> > The advantage of this patch is that we can avoid taking any filesystem
> > lock, as long as the pages being accessed are in the cache (and we don't
> > need to readahead any pages into the cache).  We also avoid an indirect
> > function call in these cases.
> 
> XFS is taking i_rwsem lock in read_iter() for a surprising reason:
> https://lore.kernel.org/linux-xfs/CAOQ4uxjpqDQP2AKA8Hrt4jDC65cTo4QdYDOKFE-C3cLxBBa6pQ@mail.gmail.com/
> In that post I claim that ocfs2 and cifs also do some work in read_iter().
> I didn't go back to check what, but it sounds like cache coherence among
> nodes.

That's out of date.  Here's POSIX-2017:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html

  "I/O is intended to be atomic to ordinary files and pipes and
  FIFOs. Atomic means that all the bytes from a single operation that
  started out together end up together, without interleaving from other
  I/O operations. It is a known attribute of terminals that this is not
  honored, and terminals are explicitly (and implicitly permanently)
  excepted, making the behavior unspecified. The behavior for other
  device types is also left unspecified, but the wording is intended to
  imply that future standards might choose to specify atomicity (or not)."

That _doesn't_ say "a read cannot observe a write in progress".  It says
"Two writes cannot interleave".  Indeed, further down in that section, it says:

  "Earlier versions of this standard allowed two very different behaviors
  with regard to the handling of interrupts. In order to minimize the
  resulting confusion, it was decided that POSIX.1-2017 should support
  only one of these behaviors. Historical practice on AT&T-derived systems
  was to have read() and write() return -1 and set errno to [EINTR] when
  interrupted after some, but not all, of the data requested had been
  transferred. However, the US Department of Commerce FIPS 151-1 and FIPS
  151-2 require the historical BSD behavior, in which read() and write()
  return the number of bytes actually transferred before the interrupt. If
  -1 is returned when any data is transferred, it is difficult to recover
  from the error on a seekable device and impossible on a non-seekable
  device. Most new implementations support this behavior. The behavior
  required by POSIX.1-2017 is to return the number of bytes transferred."

That explicitly allows for a write to be interrupted by a signal and
later resumed, allowing a read to observe a half-complete write.

> Because if I am not mistaken, even though this change has a potential
> to improve many workloads, it may also degrade some workloads in cases
> where case readahead is not properly tuned. Imagine reading a large file
> and getting only a few pages worth of data read on every syscall.
> Or did I misunderstand your patch's behavior in that case?

I think you did.  If the IOCB_CACHED read hits a readahead page,
it returns early.  Then call_read_iter() notices the read is not yet
complete, and calls ->read_iter() to finish the read.  So it's two
calls to generic_file_buffered_read() rather than one, but it's still
one syscall.


  reply	other threads:[~2020-06-20 19:15 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-19 15:50 [RFC] Bypass filesystems for reading cached pages Matthew Wilcox
2020-06-19 19:06 ` Chaitanya Kulkarni
2020-06-19 20:12   ` Matthew Wilcox
2020-06-19 21:25     ` Chaitanya Kulkarni
2020-06-20  6:19 ` Amir Goldstein
2020-06-20 19:15   ` Matthew Wilcox [this message]
2020-06-21  6:00     ` Amir Goldstein
2020-06-22  1:02     ` Dave Chinner
2020-06-22  0:32 ` Dave Chinner
2020-06-22 14:35   ` Andreas Gruenbacher
2020-06-22 18:13     ` Matthew Wilcox
2020-06-24 12:35       ` Andreas Gruenbacher
2020-07-02 15:16         ` Andreas Gruenbacher
2020-07-02 17:30           ` Matthew Wilcox
2020-06-23  0:52     ` Dave Chinner
2020-06-23  7:41       ` Andreas Gruenbacher
2020-06-22 19:18   ` Matthew Wilcox
2020-06-23  2:35     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200620191521.GG8681@bombadil.infradead.org \
    --to=willy@infradead.org \
    --cc=agruenba@redhat.com \
    --cc=amir73il@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).