All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Dave Chinner <dchinner@redhat.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Christoph Hellwig <hch@lst.de>,
	Matthew Wilcox <willy@infradead.org>,
	Amir Goldstein <amir73il@gmail.com>, Jan Kara <jack@suse.cz>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Josef Bacik <josef@toxicpanda.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: pagecache locking (was: bcachefs status update) merged)
Date: Fri, 14 Jun 2019 09:55:24 +1000	[thread overview]
Message-ID: <20190613235524.GK14363@dread.disaster.area> (raw)
In-Reply-To: <20190613183625.GA28171@kmo-pixel>

On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote:
> On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > > Ok, I'm totally on board with returning EDEADLOCK.
> > > 
> > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > > in the same address space as the file being read/written to, even if the buffer
> > > and the IO don't technically overlap?
> > 
> > I'd say that depends on the lock granularity. For a range lock,
> > we'd be able to do the IO for non-overlapping ranges. For a normal
> > mutex or rwsem, then we risk deadlock if the page fault triggers on
> > the same address space host as we already have locked for IO. That's
> > the case we currently handle with the second IO lock in XFS, ext4,
> > btrfs, etc (XFS_MMAPLOCK_* in XFS).
> > 
> > One of the reasons I'm looking at range locks for XFS is to get rid
> > of the need for this second mmap lock, as there is no reason for it
> > existing if we can lock ranges and EDEADLOCK inside page faults and
> > return errors.
> 
> My concern is that range locks are going to turn out to be both more complicated
> and heavier weight, performance wise, than the approach I've taken of just a
> single lock per address space.

That's the battle I'm fighting at the moment with them for direct
IO(*), but range locks are something I'm doing for XFS and I don't
really care if anyone else wants to use them or not.

(*)Direct IO on XFS is a pure shared lock workload, so the rwsem
scales until single atomic update cache line bouncing limits
throughput. That means I can max out my hardware at 1.6 million
random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file
with a rwsem at 32 AIO+DIO dispatch threads. I've only got range
locks to about 1.1M IOPS on the same workload, though it's within a
couple of percent of a rwsem up to 16 threads...

(**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/
than pmem that is emulated with RAM, let alone real pmem which is
much slower at random writes than RAM.

> Reason being range locks only help when you've got multiple operations going on
> simultaneously that don't conflict - i.e. it's really only going to be useful
> for applications that are doing buffered IO and direct IO simultaneously to the
> same file.

Yes, they do that, but that's not why I'm looking at this.  Range
locks are primarily for applications that mix multiple different
types of operations to the same file concurrently. e.g:

- fallocate and read/write() can be run concurrently if they
don't overlap, but right now we serialise them because we have no
visibility into what other operations require.

- buffered read and buffered write can run concurrently if they
don't overlap, but right now they are serialised because that's the
only way to provide POSIX atomic write vs read semantics (only XFS
provides userspace with that guarantee).

- Sub-block direct IO is serialised against all other direct IO
because we can't tell if it overlaps with some other direct IO and
so we have to take the slow but safe option - range locks solve that
problem, too.

- there's inode_dio_wait() for DIO truncate serialisation
because AIO doesn't hold inode locks across IO - range locks can be
held all the way to AIO completion so we can get rid of
inode_Dio_wait() in XFS and that allows truncate/fallocate to run
concurrently with non-overlapping direct IO.

- holding non-overlapping range locks on either side of page
faults which then gets rid of the need for the special mmap locking
path to serialise it against invalidation operations.

IOWs, range locks for IO solve a bunch of long term problems we have
in XFS and largely simplify the lock algorithms within the
filesystem. And it puts us on the path to introduce range locks for
extent mapping serialisation, allowing concurrent mapping lookups
and allocation within a single file. It also has the potential to
allow us to do concurrent directory modifications....

> Personally, I think that would be a pretty gross thing to do and I'm
> not particularly interested in optimizing for that case myself... but, if you
> know of applications that do depend on that I might change my opinion. If not, I
> want to try and get the simpler, one-lock-per-address space approach to work.
> 
> That said though - range locks on the page cache can be viewed as just a
> performance optimization over my approach, they've got the same semantics
> (locking a subset of the page cache vs. the entire thing). So that's a bit of a
> digression.

IO range locks are not "locking the page cache". IO range locks are
purely for managing concurrent IO state in a fine grained manner.
The page cache already has it's own locking - that just needs to
nest inside IO range locks as teh io locks are what provide the high
level exclusion from overlapping page cache operations...

> > > This would simplify things a lot and eliminate a really nasty corner case - page
> > > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > > readahead can pull in pages that do overlap with the dio.
> > 
> > Page cache readahead needs to be moved under the filesystem IO
> > locks. There was a recent thread about how readahead can race with
> > hole punching and other fallocate() operations because page cache
> > readahead bypasses the filesystem IO locks used to serialise page
> > cache invalidation.
> > 
> > e.g. Readahead can be directed by userspace via fadvise, so we now
> > have file->f_op->fadvise() so that filesystems can lock the inode
> > before calling generic_fadvise() such that page cache instantiation
> > and readahead dispatch can be serialised against page cache
> > invalidation. I have a patch for XFS sitting around somewhere that
> > implements the ->fadvise method.
> 
> I just puked a little in my mouth.

Yeah, it's pretty gross. But the page cache simply isn't designed to
allow atomic range operations to be performed. We ahven't be able to
drag it out of the 1980s - we wrote the fs/iomap.c code so we could
do range based extent mapping for IOs rather than the horrible,
inefficient page-by-page block mapping the generic page cache code
does - that gave us a 30+% increase in buffered IO throughput
because we only do a single mapping lookup per IO rather than one
per page...

That said, the page cache is still far, far slower than direct IO,
and the gap is just getting wider and wider as nvme SSDs get faster
and faster. PCIe 4 SSDs are just going to make this even more
obvious - it's getting to the point where the only reason for having
a page cache is to support mmap() and cheap systems with spinning
rust storage.

> > I think there are some other patches floating around to address the
> > other readahead mechanisms to only be done under filesytem IO locks,
> > but I haven't had time to dig into it any further. Readahead from
> > page faults most definitely needs to be under the MMAPLOCK at
> > least so it serialises against fallocate()...
> 
> So I think there's two different approaches we should distinguish between. We
> can either add the locking to all the top level IO paths - what you just
> described - or, the locking can be pushed down to protect _only_ adding pages to
> the page cache, which is the approach I've been taking.

I'm don't think just serialising adding pages is sufficient for
filesystems like XFS because, e.g., DAX.

> I think both approaches are workable, but I do think that pushing the locking
> down to __add_to_page_cache_locked is fundamentally the better, more correct
> approach.
> 
>  - It better matches the semantics of what we're trying to do. All these
>    operations we're trying to protect - dio, fallocate, truncate - they all have
>    in common that they just want to shoot down a range of the page cache and
>    keep it from being readded. And in general, it's better to have locks that
>    protect specific data structures ("adding to this radix tree"), vs. large
>    critical sections ("the io path").

I disagree :)

The high level IO locks provide the IO concurrency policy for the
filesystem. The page cache is an internal structure for caching
pages - it is not a structure for efficiently and cleanly
implementing IO concurrency policy. That's the mistake the current
page cache architecture makes - it tries to be the central control
for all the filesystem IO (because filesystems are dumb and the page
cache knows best!) but, unfortunately, this does not provide the
semantics or functionality that all filesystems want and/or need.

Just look at the truncate detection mess we have every time we
lookup and lock a page anywhere in the mm/ code - do you see any
code in all that which detects a hole punch race? Nope, you don't
because the filesystems take responsibility for serialising that
functionality. Unfortunately, we have so much legacy filesystem
cruft we'll never get rid of those truncate hacks.

That's my beef with relying on the page cache - the page cache is
rapidly becoming a legacy structure that only serves to slow modern
IO subsystems down. More and more we are going to bypass the page
cache and push/pull data via DMA directly into user buffers because
that's the only way we have enough CPU power in the systems to keep
the storage we have fully utilised.

That means, IMO, making the page cache central to solving an IO
concurrency problem is going the wrong way....

>    In bcachefs, at least for buffered IO I don't currently need any per-inode IO
>    locks, page granularity locks suffice, so I'd like to keep that - under the
>    theory that buffered IO to pages already in cache is more of a fast path than
>    faulting pages in.
> 
>  - If we go with the approach of using the filesystem IO locks, we need to be
>    really careful about auditing and adding assertions to make sure we've found
>    and fixed all the code paths that can potentially add pages to the page
>    cache. I didn't even know about the fadvise case, eesh.

Sure, but we've largely done that already. There aren't a lot of
places that add pages to the page cache.

>  - We still need to do something about the case where we're recursively faulting
>    pages back in, which means we need _something_ in place to even detect that
>    that's happening. Just trying to cover everything with the filesystem IO
>    locks isn't useful here.

Haven't we just addressed that with setting a current task lock
context and returning -EDEADLOCK?

> So to summarize - if we have locking specifically for adding pages to the page
> cache, we don't need to extend the filesystem IO locks to all these places, and
> we need something at that level anyways to handle recursive faults from gup()
> anyways.
> 
> The tricky part is that there's multiple places that want to call
> add_to_page_cache() while holding this pagecache_add_lock.
> 
>  - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I
>    like way better than my recursive locking approach.
> 
>  - the other case is truncate/fpunch/etc - they make use of buffered IO to
>    handle operations that aren't page/block aligned.  But those look a lot more
>    tractable than dio, since they're calling find_get_page()/readpage() directly
>    instead of via gup(), and possibly they could be structured to not have to
>    truncate/punch the partial page while holding the pagecache_add_lock at all
>    (but that's going to be heavily filesystem dependent).

That sounds like it is going to be messy. :(

> One other tricky thing we need is a way to write out and then evict a page
> without allowing it to be redirtied - i.e. something that combines
> filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise,
> a process continuously redirtying a page is going to make truncate/dio
> operations spin trying to shoot down the page cache - in bcachefs I'm currently
> taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I
> really want to get rid of that. If we can get that combined
> write_and_invalidate() operation, then I think the locking will turn out fairly
> clean.

IMO, this isn't a page cache problem - this is a filesystem
operation vs page fault serialisation issue. Why? Because the
problem exists for DAX, and it doesn't add pages to the page cache
for mappings...

i.e.  We've already solved these problems with the same high level
IO locks that solve all the truncate, hole punch, etc issues for
both the page cache and DAX operation. i.e. The MMAPLOCK prevents
the PTE being re-dirtied across the entire filesystem operation, not
just the writeback and invalidation. The XFS flush+inval code
looks like this:

	xfs_ilock(ip, XFS_IOLOCK_EXCL);
	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
	xfs_flush_unmap_range(ip, offset, len);

	<do operation that requires invalidated page cache>
	<flush modifications if necessary>

	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
	xfs_iunlock(ip, XFS_IOLOCK_EXCL);

With range locks it looks like this;

	range_lock_init(&rl, offset, len);
	range_lock_write(&ip->i_iolock, &rl);
	xfs_flush_unmap_range(ip, offset, len);

	<do operation that requires invalidated page cache>
	<flush modified range if necessary>

	range_unlock_write(&ip->i_iolock, &rl);

---

In summary, I can see why the page cache add lock works well for
bcachefs, but on the other hand I can say that it really doesn't
match well for filesystems like XFS or for filesystems that
implement DAX and so may not be using the page cache at all...

Don't get me wrong - I'm not opposed to incuding page cache add
locking - I'm just saying that the problems it tries to address (and
ones it cannot address) are already largely solved in existing
filesystems. I suspect that if we do merge this code, whatever
locking is added would have to be optional....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2019-06-13 23:56 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
2019-06-12 17:16   ` Greg KH
2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
2019-06-11 10:25   ` Coly Li
2019-06-13  7:28   ` Christoph Hellwig
2019-06-13 11:04     ` Kent Overstreet
2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
2019-06-11 10:25   ` Coly Li
2019-06-12 17:17   ` Greg KH
2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
2019-07-16 10:47   ` Coly Li
2019-07-18  7:46     ` Coly Li
2019-07-22 17:22       ` Kent Overstreet
2020-11-24 23:07         ` Marc Smith
2020-11-25 18:10           ` Marc Smith
2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
2019-06-11  1:17   ` Kent Overstreet
2019-06-11  4:33     ` Dave Chinner
2019-06-12 16:21       ` Kent Overstreet
2019-06-12 23:02         ` Dave Chinner
2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
2019-06-13 21:13             ` Andreas Dilger
2019-06-13 21:21               ` Kent Overstreet
2019-06-14  0:35                 ` Dave Chinner
2019-06-13 23:55             ` Dave Chinner [this message]
2019-06-14  2:30               ` Linus Torvalds
2019-06-14  7:30                 ` Dave Chinner
2019-06-15  1:15                   ` Linus Torvalds
2019-06-14  3:08               ` Linus Torvalds
2019-06-15  4:01                 ` Linus Torvalds
2019-06-17 22:47                   ` Dave Chinner
2019-06-17 23:38                     ` Linus Torvalds
2019-06-18  4:21                       ` Amir Goldstein
2019-06-19 10:38                         ` Jan Kara
2019-06-19 22:37                           ` Dave Chinner
2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
     [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
2019-07-03  1:25                                 ` Boaz Harrosh
2019-07-05 23:31                               ` Dave Chinner
2019-07-07 15:05                                 ` Boaz Harrosh
2019-07-07 23:55                                   ` Dave Chinner
2019-07-08 13:31                                 ` Jan Kara
2019-07-09 23:47                                   ` Dave Chinner
2019-07-10  8:41                                     ` Jan Kara
2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
2019-06-19  8:21           ` bcachefs status update (it's done cooking; let's get this sucker merged) Jan Kara
2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
2019-07-03 17:21               ` Jan Kara
2019-07-03 18:03                 ` Boaz Harrosh
2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
2019-06-11 14:26       ` Matthew Wilcox
2019-06-11  4:10   ` Dave Chinner
2019-06-11  4:39     ` Linus Torvalds
2019-06-11  7:10       ` Dave Chinner
2019-06-12  2:07         ` Linus Torvalds
2019-07-03  5:59 ` Stefan K
2020-06-10 11:02 ` Stefan K
2020-06-10 11:32   ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190613235524.GK14363@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=amir73il@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.