linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andres Freund <andres@anarazel.de>
To: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>, Jens Axboe <axboe@kernel.dk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
Date: Sat, 1 Feb 2020 02:33:42 -0800	[thread overview]
Message-ID: <20200201103342.ktxbrhaqqhbrtg7p@alap3.anarazel.de> (raw)
In-Reply-To: <C08B7F86-C3D6-47C6-AB17-6F234EA33687@fb.com>

Hi,

On 2019-12-13 01:32:10 +0000, Chris Mason wrote:
> Grepping through the code shows a wonderful assortment of helpers to
> control the cache, and RWF_UNCACHED would be both cleaner and faster
> than what we have today.  I'm on the fence about asking for
> RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without
> pitching pages, but we can talk to some service owners to see how useful
> that would be.   They can always chain a sync_file_range() in io_uring,
> but RWF_ would be lower overhead if it were a common pattern.

FWIW, for postgres something that'd allow us to do writes that

a) Doesn't remove pages from the pagecache if they're already there.
b) Doesn't delay writeback to some unpredictable point later.

   The later write causes both latency issues, and often under-utilizes
   write bandwidth for a while. For most cases where we write, we know
   that we're not likely to write the same page again soon.

c) Doesn't (except maybe temporarily) bring pages into the pagecache, if
   they weren't before.

   In the cases where the page previously wasn't in the page cache, and
   we wrote it out, it's very likely to have been resident for long
   enough in our cache, that the kernel caching it for the future isn't
   useful.

would be really helpful. Right now we simulate that to some degree by
doing normal buffered writes followed by sync_file_range(WRITE).

For most environments always using O_DIRECT isn't really an option for
us, as we can't rely on settings being tuned well enough (i.e. using a
large enough application cache), as well as continuing to want to
support setups where using a large enough postgres buffer cache isn't an
option because it'd prevent putting a number of variably used database
servers on one piece of hardware.

(There's also postgres side issues preventing us from doing O_DIRECT
performantly, partially because we so far couldn't rely on AIO, due to
also using buffered IO, but we're fixing that now.)


For us a per-request interface where we'd have to fulfill all the
requirements for O_DIRECT, but where neither reads nor writes would
cause a page to move in/out of the pagecache, would be optimal for a
good part of our IO. Especially when we still could get zero-copy IO for
the pretty common case that there's no pagecache presence for a file at
all.

That'd allow us to use zero copy writes for the common case of a file's
data fitting entirely in our cache, and us only occasionally writing the
deta out at checkpoints. And do zero copy reads for the the cases where
we know it's unnecessary for the kernel to cache (e.g. because we are
scanning a few TB of data on a machine with less memory, because we're
re-filling our cache after a restart, or because it's a maintenance
operation doing the reading). But still rely on the kernel page cache
for other reads where the kernel caching when memory is available is a
huge benefit.  Some well tuned workloads would turn that off, to only
use O_DIRECT, but everyone else would benefit with that being the
default.

We can concoct an approximation of that behaviour with a mix of
sync_file_range() (to force writeback), RWF_NOWAIT (to see if we should
read with O_DIRECT) and mmap()/mincore()/munmap() (to determine if
writes should use O_DIRECT). But that's quite a bit of overhead.

The reason that specifying this on a per-request basis would be useful
is mainly that that would allow us to avoid having to either have two
sets of FDs, or having to turn O_DIRECT on/off with fcntl.

Greetings,

Andres Freund

  parent reply	other threads:[~2020-02-01 10:33 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-11 17:19   ` Linus Torvalds
2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-11 17:19   ` Matthew Wilcox
2019-12-11 18:05     ` Jens Axboe
2019-12-12 22:34   ` Dave Chinner
2019-12-13  0:54     ` Jens Axboe
2019-12-13  0:57       ` Jens Axboe
2019-12-16  4:17         ` Dave Chinner
2019-12-17 14:31           ` Jens Axboe
2019-12-18  0:49             ` Dave Chinner
2019-12-18  1:01               ` Jens Axboe
2019-12-11 17:37 ` [PATCHSET v3 0/5] Support for RWF_UNCACHED Linus Torvalds
2019-12-11 17:56   ` Jens Axboe
2019-12-11 19:14     ` Linus Torvalds
2019-12-11 19:34     ` Jens Axboe
2019-12-11 20:03       ` Linus Torvalds
2019-12-11 20:08         ` Jens Axboe
2019-12-11 20:18           ` Linus Torvalds
2019-12-11 21:04             ` Johannes Weiner
2019-12-12  1:30               ` Jens Axboe
2019-12-11 23:41             ` Jens Axboe
2019-12-12  1:08               ` Linus Torvalds
2019-12-12  1:11                 ` Jens Axboe
2019-12-12  1:22                   ` Linus Torvalds
2019-12-12  1:29                     ` Jens Axboe
2019-12-12  1:41                       ` Linus Torvalds
2019-12-12  1:56                         ` Matthew Wilcox
2019-12-12  2:47                           ` Linus Torvalds
2019-12-12 17:52                             ` Matthew Wilcox
2019-12-12 18:29                               ` Linus Torvalds
2019-12-12 20:05                                 ` Matthew Wilcox
2019-12-12  1:41                       ` Jens Axboe
2019-12-12  1:49                         ` Linus Torvalds
2019-12-12  1:09               ` Jens Axboe
2019-12-12  2:03                 ` Jens Axboe
2019-12-12  2:10                   ` Jens Axboe
2019-12-12  2:21                   ` Matthew Wilcox
2019-12-12  2:38                     ` Jens Axboe
2019-12-12 22:18                 ` Dave Chinner
2019-12-13  1:32                   ` Chris Mason
2020-01-07 17:42                     ` Christoph Hellwig
2020-01-08 14:09                       ` Chris Mason
2020-02-01 10:33                     ` Andres Freund [this message]
2019-12-11 20:43           ` Matthew Wilcox
2019-12-11 20:04       ` Jens Axboe
2019-12-12 10:44 ` Martin Steigerwald
2019-12-12 15:16   ` Jens Axboe
2019-12-12 21:45     ` Martin Steigerwald
2019-12-12 22:15       ` Jens Axboe
2019-12-12 22:18     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200201103342.ktxbrhaqqhbrtg7p@alap3.anarazel.de \
    --to=andres@anarazel.de \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).