[PATCH 0/3] Prevent ->map_pages from sleeping

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Prevent ->map_pages from sleeping
@ 2023-02-08 14:53 Matthew Wilcox (Oracle)
  2023-02-08 14:53 ` [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper Matthew Wilcox (Oracle)
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-02-08 14:53 UTC (permalink / raw)
  To: linux-xfs, linux-afs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle)

In preparation for a larger patch series which will handle (some, easy)
page faults protected only by RCU, change the two filesystems which have
sleeping locks to not take them and hold the RCU lock around calls to
->map_page to prevent other filesystems from adding sleeping locks.

Matthew Wilcox (Oracle) (3):
  xfs: Remove xfs_filemap_map_pages() wrapper
  afs: Split afs_pagecache_valid() out of afs_validate()
  mm: Hold the RCU read lock over calls to ->map_pages

 Documentation/filesystems/locking.rst |  4 ++--
 fs/afs/file.c                         | 14 ++------------
 fs/afs/inode.c                        | 27 +++++++++++++++++++--------
 fs/afs/internal.h                     |  1 +
 fs/xfs/xfs_file.c                     | 17 +----------------
 mm/memory.c                           |  7 ++++++-
 6 files changed, 31 insertions(+), 39 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-08 14:53 [PATCH 0/3] Prevent ->map_pages from sleeping Matthew Wilcox (Oracle)
@ 2023-02-08 14:53 ` Matthew Wilcox (Oracle)
  2023-02-08 16:39   ` Darrick J. Wong
  2023-02-08 14:53 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-02-08 14:53 UTC (permalink / raw)
  To: linux-xfs, linux-afs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle)

XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED
to do this, any more than it needs the XFS_MMAPLOCK_SHARED for a
read() that hits in the page cache.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_file.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 705250f9f90a..528fc538b6b9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1388,25 +1388,10 @@ xfs_filemap_pfn_mkwrite(
 	return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
 }
 
-static vm_fault_t
-xfs_filemap_map_pages(
-	struct vm_fault		*vmf,
-	pgoff_t			start_pgoff,
-	pgoff_t			end_pgoff)
-{
-	struct inode		*inode = file_inode(vmf->vma->vm_file);
-	vm_fault_t ret;
-
-	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
-	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	return ret;
-}
-
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
-	.map_pages	= xfs_filemap_map_pages,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_filemap_page_mkwrite,
 	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
 };
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-08 14:53 ` [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper Matthew Wilcox (Oracle)
@ 2023-02-08 16:39   ` Darrick J. Wong
  2023-02-08 17:12     ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2023-02-08 16:39 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-xfs, linux-afs, linux-fsdevel, linux-mm

On Wed, Feb 08, 2023 at 02:53:33PM +0000, Matthew Wilcox (Oracle) wrote:
> XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED
> to do this, any more than it needs the XFS_MMAPLOCK_SHARED for a
> read() that hits in the page cache.

Hmm.  From commit cd647d5651c0 ("xfs: use MMAPLOCK around
filemap_map_pages()"):

    The page faultround path ->map_pages is implemented in XFS via
    filemap_map_pages(). This function checks that pages found in page
    cache lookups have not raced with truncate based invalidation by
    checking page->mapping is correct and page->index is within EOF.

    However, we've known for a long time that this is not sufficient to
    protect against races with invalidations done by operations that do
    not change EOF. e.g. hole punching and other fallocate() based
    direct extent manipulations. The way we protect against these
    races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
    lock so they serialise against fallocate and truncate before calling
    into the filemap function that processes the fault.

    Do the same for XFS's ->map_pages implementation to close this
    potential data corruption issue.

How do we prevent faultaround from racing with fallocate and reflink
calls that operate below EOF?

--D

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/xfs/xfs_file.c | 17 +----------------
>  1 file changed, 1 insertion(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 705250f9f90a..528fc538b6b9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1388,25 +1388,10 @@ xfs_filemap_pfn_mkwrite(
>  	return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
>  }
>  
> -static vm_fault_t
> -xfs_filemap_map_pages(
> -	struct vm_fault		*vmf,
> -	pgoff_t			start_pgoff,
> -	pgoff_t			end_pgoff)
> -{
> -	struct inode		*inode = file_inode(vmf->vma->vm_file);
> -	vm_fault_t ret;
> -
> -	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
> -	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	return ret;
> -}
> -
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>  	.fault		= xfs_filemap_fault,
>  	.huge_fault	= xfs_filemap_huge_fault,
> -	.map_pages	= xfs_filemap_map_pages,
> +	.map_pages	= filemap_map_pages,
>  	.page_mkwrite	= xfs_filemap_page_mkwrite,
>  	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
>  };
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-08 16:39   ` Darrick J. Wong
@ 2023-02-08 17:12     ` Matthew Wilcox
  2023-02-08 21:53       ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2023-02-08 17:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-afs, linux-fsdevel, linux-mm, Dave Chinner

On Wed, Feb 08, 2023 at 08:39:19AM -0800, Darrick J. Wong wrote:
> On Wed, Feb 08, 2023 at 02:53:33PM +0000, Matthew Wilcox (Oracle) wrote:
> > XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED
> > to do this, any more than it needs the XFS_MMAPLOCK_SHARED for a
> > read() that hits in the page cache.
> 
> Hmm.  From commit cd647d5651c0 ("xfs: use MMAPLOCK around
> filemap_map_pages()"):
> 
>     The page faultround path ->map_pages is implemented in XFS via
>     filemap_map_pages(). This function checks that pages found in page
>     cache lookups have not raced with truncate based invalidation by
>     checking page->mapping is correct and page->index is within EOF.
> 
>     However, we've known for a long time that this is not sufficient to
>     protect against races with invalidations done by operations that do
>     not change EOF. e.g. hole punching and other fallocate() based
>     direct extent manipulations. The way we protect against these
>     races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
>     lock so they serialise against fallocate and truncate before calling
>     into the filemap function that processes the fault.
> 
>     Do the same for XFS's ->map_pages implementation to close this
>     potential data corruption issue.
> 
> How do we prevent faultaround from racing with fallocate and reflink
> calls that operate below EOF?

I don't understand the commit message.  It'd be nice to have an example
of what's insufficient about the protection.  If XFS really needs it,
it can trylock the semaphore and return 0 if it fails, falling back to
the ->fault path.  But I don't think XFS actually needs it.

The ->map_pages path trylocks the folio, checks the folio->mapping,
checks uptodate, then checks beyond EOF (not relevant to hole punch).
Then it takes the page table lock and puts the page(s) into the page
tables, unlocks the folio and moves on to the next folio.

The hole-punch path, like the truncate path, takes the folio lock,
unmaps the folio (which will take the page table lock) and removes
it from the page cache.

So what's the race?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-08 17:12     ` Matthew Wilcox
@ 2023-02-08 21:53       ` Dave Chinner
  2023-02-09  2:44         ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2023-02-08 21:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-xfs, linux-afs, linux-fsdevel, linux-mm,
	Dave Chinner

On Wed, Feb 08, 2023 at 05:12:06PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 08:39:19AM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 08, 2023 at 02:53:33PM +0000, Matthew Wilcox (Oracle) wrote:
> > > XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED
> > > to do this, any more than it needs the XFS_MMAPLOCK_SHARED for a
> > > read() that hits in the page cache.
> > 
> > Hmm.  From commit cd647d5651c0 ("xfs: use MMAPLOCK around
> > filemap_map_pages()"):
> > 
> >     The page faultround path ->map_pages is implemented in XFS via
> >     filemap_map_pages(). This function checks that pages found in page
> >     cache lookups have not raced with truncate based invalidation by
> >     checking page->mapping is correct and page->index is within EOF.
> > 
> >     However, we've known for a long time that this is not sufficient to
> >     protect against races with invalidations done by operations that do
> >     not change EOF. e.g. hole punching and other fallocate() based
> >     direct extent manipulations. The way we protect against these
> >     races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
> >     lock so they serialise against fallocate and truncate before calling
> >     into the filemap function that processes the fault.
> > 
> >     Do the same for XFS's ->map_pages implementation to close this
> >     potential data corruption issue.
> > 
> > How do we prevent faultaround from racing with fallocate and reflink
> > calls that operate below EOF?
> 
> I don't understand the commit message.  It'd be nice to have an example
> of what's insufficient about the protection.

When this change was made, "insufficient protection" was a reference
to the rather well known fact we'd been bugging MM developers about
for well over a decade (i.e. since before ->page_mkwrite existed)
that the unlocked page invalidation detection hack used everywhere
in the page cache code was broken for page invalidation within EOF.
i.e.  that cannot be correctly detected by (page->mapping == NULL &&
page->index > EOF) checks. This was a long standing problem, so
after a decade of being ignored, the MMAPLOCK was added to XFS to
serialise invalidation against page fault based operations.

At the time page faults could instantiate page cache pages whilst
invalidation operations like truncate_pagecache_range() were running
and hence page faults could be instantiating and mapping pages over
the range we are trying to invalidate. We were also finding niche
syscalls that caused data corruption due to invalidation races (e.g.
see xfs_file_fadvise() to avoid readahead vs hole punch races from
fadvise(WILLNEED) and readahead() syscalls), so I did an audit to
look for any potential interfaces that could race with invalidation.
->map_pages() being called from within the page fault code and
having a broken page->index based check for invalidation looked
suspect and potentially broken.  Hence I slapped the MMAPLOCK around
it to stop it from running while a XFS driven page cache
invalidation operation was in progress.

We work on the principle that when it comes to data corruption
vectors, it is far better to err on the side of safety than it is to
play fast and loose. fault-around is a perf optimisation, and taking
a rwsem in shared mode is not a major increase in overhead for that
path, so there was little risk of regressions in adding
serialisation just in case there was an as-yet-unknown data
corruption vector from that path.

Keep in mind this was written before the mm code handled page cache
instantiation serialisation sanely via the
mapping->invalidation_lock. The mapping->invalidation_lock solves
the same issues in a slightly different way, and it may well be that
the different implementation means that we don't need to use it in
all the places we place the MMAPLOCK in XFS originally.

> If XFS really needs it,
> it can trylock the semaphore and return 0 if it fails, falling back to
> the ->fault path.  But I don't think XFS actually needs it.
>
> The ->map_pages path trylocks the folio, checks the folio->mapping,
> checks uptodate, then checks beyond EOF (not relevant to hole punch).
> Then it takes the page table lock and puts the page(s) into the page
> tables, unlocks the folio and moves on to the next folio.
> 
> The hole-punch path, like the truncate path, takes the folio lock,
> unmaps the folio (which will take the page table lock) and removes
> it from the page cache.
> 
> So what's the race?

Hole punch is a multi-folio operation, so while we are operating on
invalidating one folio, another folio in the range we've already
invalidated could be instantiated and mapped, leaving mapped
up-to-date pages over a range we *require* the page cache to empty.

The original MMAPLOCK could not prevent the instantiation of new
page cache pages while an invalidation was running, hence we had to
block any operation from page faults that instantiated pages into
the page cache or operated on the page cache in any way while an
invalidation was being run.

The mapping->invalidation_lock solved this specific aspect of the
problem, so it's entirely possible that we don't have to care about
using MMAPLOCK for filemap_map_pages() any more. But I don't know
that for certain, I haven't had any time to investigate it in any
detail, and when it comes to data corruption vectors I'm not going
to change serialisation mechanisms without a decent amount of
investigation.

I couldn't ever convince myself there wasn't a problem hence the
comment in the commit:

"Do the same for XFS's ->map_pages implementation to close this
 potential data corruption issue."

Hence if you can explain to me how filemap_map_pages() cannot race
against invalidation without holding the mapping->invalidation_lock
without potentially leaving stale data in the page cache over the
invalidated range (this isn't an XFS specific issue!), then I don't
see a problem with removing the MMAPLOCK from this path.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-08 21:53       ` Dave Chinner
@ 2023-02-09  2:44         ` Matthew Wilcox
  2023-02-09 21:53           ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2023-02-09  2:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-afs, linux-fsdevel, linux-mm,
	Dave Chinner

On Thu, Feb 09, 2023 at 08:53:11AM +1100, Dave Chinner wrote:
> > If XFS really needs it,
> > it can trylock the semaphore and return 0 if it fails, falling back to
> > the ->fault path.  But I don't think XFS actually needs it.
> >
> > The ->map_pages path trylocks the folio, checks the folio->mapping,
> > checks uptodate, then checks beyond EOF (not relevant to hole punch).
> > Then it takes the page table lock and puts the page(s) into the page
> > tables, unlocks the folio and moves on to the next folio.
> > 
> > The hole-punch path, like the truncate path, takes the folio lock,
> > unmaps the folio (which will take the page table lock) and removes
> > it from the page cache.
> > 
> > So what's the race?
> 
> Hole punch is a multi-folio operation, so while we are operating on
> invalidating one folio, another folio in the range we've already
> invalidated could be instantiated and mapped, leaving mapped
> up-to-date pages over a range we *require* the page cache to empty.

Nope.  ->map_pages is defined to _not_ instantiate new pages.
If there are uptodate pages in the page cache, they can be mapped, but
missing pages will be skipped, and left to ->fault to bring in.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-09  2:44         ` Matthew Wilcox
@ 2023-02-09 21:53           ` Dave Chinner
  2023-02-09 22:34             ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2023-02-09 21:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-xfs, linux-afs, linux-fsdevel, linux-mm,
	Dave Chinner

On Thu, Feb 09, 2023 at 02:44:20AM +0000, Matthew Wilcox wrote:
> On Thu, Feb 09, 2023 at 08:53:11AM +1100, Dave Chinner wrote:
> > > If XFS really needs it,
> > > it can trylock the semaphore and return 0 if it fails, falling back to
> > > the ->fault path.  But I don't think XFS actually needs it.
> > >
> > > The ->map_pages path trylocks the folio, checks the folio->mapping,
> > > checks uptodate, then checks beyond EOF (not relevant to hole punch).
> > > Then it takes the page table lock and puts the page(s) into the page
> > > tables, unlocks the folio and moves on to the next folio.
> > > 
> > > The hole-punch path, like the truncate path, takes the folio lock,
> > > unmaps the folio (which will take the page table lock) and removes
> > > it from the page cache.
> > > 
> > > So what's the race?
> > 
> > Hole punch is a multi-folio operation, so while we are operating on
> > invalidating one folio, another folio in the range we've already
> > invalidated could be instantiated and mapped, leaving mapped
> > up-to-date pages over a range we *require* the page cache to empty.
> 
> Nope.  ->map_pages is defined to _not_ instantiate new pages.
> If there are uptodate pages in the page cache, they can be mapped, but
> missing pages will be skipped, and left to ->fault to bring in.

Sure, but *at the time this change was made* other operations could
instantiate pages whilst an invalidate was running, and then
->map_pages could also find them and map them whilst that
invalidation was still running. i.e. the race conditions that
existed before the mapping->invalidate_lock was introduced (ie. we
couldn't intercept read page faults instantiating pages in the page
cache at all) didn't require ->map_pages to instantiate the page for
it to be able to expose incorrect data to userspace when page faults
raced with an ongoing invalidation operation.

While this may not be able to happen now if everything is using the
mapping->invalidate_lock correctly (because read faults are now
intercepted before they can instatiate new page cache pages), it
doesn't mean it wasn't possible in the past.....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-09 21:53           ` Dave Chinner
@ 2023-02-09 22:34             ` Matthew Wilcox
  2023-02-09 23:59               ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2023-02-09 22:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-afs, linux-fsdevel, linux-mm,
	Dave Chinner

On Fri, Feb 10, 2023 at 08:53:58AM +1100, Dave Chinner wrote:
> On Thu, Feb 09, 2023 at 02:44:20AM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 09, 2023 at 08:53:11AM +1100, Dave Chinner wrote:
> > > > If XFS really needs it,
> > > > it can trylock the semaphore and return 0 if it fails, falling back to
> > > > the ->fault path.  But I don't think XFS actually needs it.
> > > >
> > > > The ->map_pages path trylocks the folio, checks the folio->mapping,
> > > > checks uptodate, then checks beyond EOF (not relevant to hole punch).
> > > > Then it takes the page table lock and puts the page(s) into the page
> > > > tables, unlocks the folio and moves on to the next folio.
> > > > 
> > > > The hole-punch path, like the truncate path, takes the folio lock,
> > > > unmaps the folio (which will take the page table lock) and removes
> > > > it from the page cache.
> > > > 
> > > > So what's the race?
> > > 
> > > Hole punch is a multi-folio operation, so while we are operating on
> > > invalidating one folio, another folio in the range we've already
> > > invalidated could be instantiated and mapped, leaving mapped
> > > up-to-date pages over a range we *require* the page cache to empty.
> > 
> > Nope.  ->map_pages is defined to _not_ instantiate new pages.
> > If there are uptodate pages in the page cache, they can be mapped, but
> > missing pages will be skipped, and left to ->fault to bring in.
> 
> Sure, but *at the time this change was made* other operations could
> instantiate pages whilst an invalidate was running, and then
> ->map_pages could also find them and map them whilst that
> invalidation was still running. i.e. the race conditions that
> existed before the mapping->invalidate_lock was introduced (ie. we
> couldn't intercept read page faults instantiating pages in the page
> cache at all) didn't require ->map_pages to instantiate the page for
> it to be able to expose incorrect data to userspace when page faults
> raced with an ongoing invalidation operation.
> 
> While this may not be able to happen now if everything is using the
> mapping->invalidate_lock correctly (because read faults are now
> intercepted before they can instatiate new page cache pages), it
> doesn't mean it wasn't possible in the past.....

Sorry, still not getting it.  Here's the scenario I think you're
talking about.  We have three threads (probably in different tasks
or they may end up getting synchronized on the page table locks).

Thread 1 is calling FALLOC_FL_PUNCH_HOLE over a nice wide range.
Thread 2 has the file mmaped and takes a read page fault.
Thread 3 also has the file mmaped and also takes a read page fault.

Thread 2 calls filemap_map_pages and finds the pages gone.  It proceeds
to call xfs_filemap_fault() which calls filemap_fault() without
taking any XFS locks.  filemap_fault() kicks off some readahead which
allocates some pages & puts them in the page cache.  It calls into
xfs_vm_readahead() which calls iomap_readahead() without taking any XFS
locks.  iomap_readahead() will then call back into xfs_read_iomap_begin()
which takes the XFS_ILOCK_SHARED.

Since thread 1 is holding XFS_IOLOCK_EXCL, I presume thread 2 will
block at this point until thread 1 is done.  At this point, the page
is still not uptodate, so thread 3 will not map the page if it finds it
in >map_pages.

Or have I misunderstood XFS inode locking?  Entirely possible, it
seems quite complicated.  Nevertheless, it seems to me that if there's
locking that's missing, there's ample opportunities for XFS to take those
missing locks in the (slow) fault path, and not take them in the (fast)
map_pages path.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper
  2023-02-09 22:34             ` Matthew Wilcox
@ 2023-02-09 23:59               ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-09 23:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-xfs, linux-afs, linux-fsdevel, linux-mm,
	Dave Chinner

On Thu, Feb 09, 2023 at 10:34:21PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 10, 2023 at 08:53:58AM +1100, Dave Chinner wrote:
> > On Thu, Feb 09, 2023 at 02:44:20AM +0000, Matthew Wilcox wrote:
> > > On Thu, Feb 09, 2023 at 08:53:11AM +1100, Dave Chinner wrote:
> > > > > If XFS really needs it,
> > > > > it can trylock the semaphore and return 0 if it fails, falling back to
> > > > > the ->fault path.  But I don't think XFS actually needs it.
> > > > >
> > > > > The ->map_pages path trylocks the folio, checks the folio->mapping,
> > > > > checks uptodate, then checks beyond EOF (not relevant to hole punch).
> > > > > Then it takes the page table lock and puts the page(s) into the page
> > > > > tables, unlocks the folio and moves on to the next folio.
> > > > > 
> > > > > The hole-punch path, like the truncate path, takes the folio lock,
> > > > > unmaps the folio (which will take the page table lock) and removes
> > > > > it from the page cache.
> > > > > 
> > > > > So what's the race?
> > > > 
> > > > Hole punch is a multi-folio operation, so while we are operating on
> > > > invalidating one folio, another folio in the range we've already
> > > > invalidated could be instantiated and mapped, leaving mapped
> > > > up-to-date pages over a range we *require* the page cache to empty.
> > > 
> > > Nope.  ->map_pages is defined to _not_ instantiate new pages.
> > > If there are uptodate pages in the page cache, they can be mapped, but
> > > missing pages will be skipped, and left to ->fault to bring in.
> > 
> > Sure, but *at the time this change was made* other operations could
> > instantiate pages whilst an invalidate was running, and then
> > ->map_pages could also find them and map them whilst that
> > invalidation was still running. i.e. the race conditions that
> > existed before the mapping->invalidate_lock was introduced (ie. we
> > couldn't intercept read page faults instantiating pages in the page
> > cache at all) didn't require ->map_pages to instantiate the page for
> > it to be able to expose incorrect data to userspace when page faults
> > raced with an ongoing invalidation operation.
> > 
> > While this may not be able to happen now if everything is using the
> > mapping->invalidate_lock correctly (because read faults are now
> > intercepted before they can instatiate new page cache pages), it
> > doesn't mean it wasn't possible in the past.....
> 
> Sorry, still not getting it.  Here's the scenario I think you're
> talking about.  We have three threads (probably in different tasks
> or they may end up getting synchronized on the page table locks).
> 
> Thread 1 is calling FALLOC_FL_PUNCH_HOLE over a nice wide range.
> Thread 2 has the file mmaped and takes a read page fault.
> Thread 3 also has the file mmaped and also takes a read page fault.
> 
> Thread 2 calls filemap_map_pages and finds the pages gone.  It proceeds
> to call xfs_filemap_fault() which calls filemap_fault() without
> taking any XFS locks.  filemap_fault() kicks off some readahead which
> allocates some pages & puts them in the page cache.  It calls into
> xfs_vm_readahead() which calls iomap_readahead() without taking any XFS
> locks.  iomap_readahead() will then call back into xfs_read_iomap_begin()
> which takes the XFS_ILOCK_SHARED.
> 
> Since thread 1 is holding XFS_IOLOCK_EXCL, I presume thread 2 will
> block at this point until thread 1 is done.

No, because XFS_IOLOCK is not the same lock as XFS_ILOCK.

IOLOCK (inode->i_rwsem) and MMAPLOCK(mapping->invalidate_lock)
serialise user access to user data (i.e. page cache and direct IO).

ILOCK (xfs_inode->i_ilock) serialises access to internal XFS inode
metadata such as the extent list.

The lock ordering is IOLOCK -> MMAPLOCK -> folio lock -> ILOCK, as
described in fs/xfs/xfs_inode.c

In the case we are talking about here, operations such as fallocate
operate directly on the extent map protected by the ILOCK, so they
first have to serialise user access to the data (i.e. take the
IOLOCK, MMAPLOCK, run inode_dio_wait() to drain running IOs, run
break_layouts() to recall remote pNFS access delegations and
serialise DAX accesses, etc), then flush any remaining dirty cached
data (which may require allocation and hence taking the ILOCK) and
then (maybe) invalidate the cached data over the range that is about
to be operated on.

Only once we hold all these locks and have performed all these user
data operations whilst holding those locks can we then take the
ILOCK and directly manipulate the extent map knowing we have locked
out all avenues of user data modification whilst we modify the
extent map and change the user data contained in the file.

> At this point, the page
> is still not uptodate, so thread 3 will not map the page if it finds it
> in >map_pages.
> 
> Or have I misunderstood XFS inode locking?  Entirely possible, it
> seems quite complicated.

Yes, it is, but as every other filesystem has encountered the same
problems that XFS has been dealing with since day zero they've grown
exactly the same locking requirements. Some of these are VFS locks
(i_rwsem, invalidate_lock) or infrastructure (inode_dio_wait()), but
over the long term linux filesystems and the VFS have been trending
towards the original XFS locking model that it inherited from Irix
30 years ago, not the other way around.

> Nevertheless, it seems to me that if there's
> locking that's missing, there's ample opportunities for XFS to take those
> missing locks in the (slow) fault path, and not take them in the (fast)
> map_pages path.

If you go back to the series that introduced the
mapping->invalidate_lock and the XFS conversion to use it for the
MMAPLOCK in commit 2433480a7e1d ("xfs: Convert to use
invalidate_lock"), that's what pulled filemap_fault() out from under
the MMAPLOCK.

i.e. we used to run the entire fault path under the MMAPLOCK to try
to avoid issues with read faults instantiating new pages whilst we
were invalidating pages in the same mapping.  i.e. we used to
serialise both the read fault and write fault path entirely against
invalidation.  The conversion to the invalidate_lock drove the
locking further into the filemap_fault path, so we didn't need to
take it for read faults anymore. We still have to the take it for
write faults (i.e. ->page_mkwrite) because the page cache is already
populated and we still need to serialise ->page_mkwrite against
truncate, hole punch, etc.

Maybe we didn't need to lock the ->map_pages() path, but after
seeing data corruption issues caused by user directed speculative
page cache readahead via fadvise() and readahead() syscalls racing
with operations that need exclusive invalidation, I didn't think
that was a chance worth taking.

So, as I've already said, it's entirely possible that we don't need
the MMAPLOCK in this path anymore. All I want is a concrete
explanation of how the page fault and VFS paths now serialise
against invalidation to prevent these historic invalidation race
conditions from ever occurring again in the commit message.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate()
  2023-02-08 14:53 [PATCH 0/3] Prevent ->map_pages from sleeping Matthew Wilcox (Oracle)
  2023-02-08 14:53 ` [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper Matthew Wilcox (Oracle)
@ 2023-02-08 14:53 ` Matthew Wilcox (Oracle)
  2023-02-08 14:53 ` [PATCH 3/3] mm: Hold the RCU read lock over calls to ->map_pages Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-02-08 14:53 UTC (permalink / raw)
  To: linux-xfs, linux-afs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle)

For the map_pages() method, we need a test that does not sleep.  The page
fault handler will continue to call the fault() method where we can
sleep and do the full revalidation there.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/afs/file.c     | 14 ++------------
 fs/afs/inode.c    | 27 +++++++++++++++++++--------
 fs/afs/internal.h |  1 +
 3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 68d6d5dc608d..719b31374879 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -569,20 +569,10 @@ static void afs_vm_close(struct vm_area_struct *vma)
 static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
 	struct afs_vnode *vnode = AFS_FS_I(file_inode(vmf->vma->vm_file));
-	struct afs_file *af = vmf->vma->vm_file->private_data;
 
-	switch (afs_validate(vnode, af->key)) {
-	case 0:
+	if (afs_pagecache_valid(vnode))
 		return filemap_map_pages(vmf, start_pgoff, end_pgoff);
-	case -ENOMEM:
-		return VM_FAULT_OOM;
-	case -EINTR:
-	case -ERESTARTSYS:
-		return VM_FAULT_RETRY;
-	case -ESTALE:
-	default:
-		return VM_FAULT_SIGBUS;
-	}
+	return 0;
 }
 
 static ssize_t afs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 0167e96e5198..b1bdffd5e888 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -667,6 +667,24 @@ bool afs_check_validity(struct afs_vnode *vnode)
 	return false;
 }
 
+/*
+ * Returns true if the pagecache is still valid.  Does not sleep.
+ */
+bool afs_pagecache_valid(struct afs_vnode *vnode)
+{
+	if (unlikely(test_bit(AFS_VNODE_DELETED, &vnode->flags))) {
+		if (vnode->netfs.inode.i_nlink)
+			clear_nlink(&vnode->netfs.inode);
+		return true;
+	}
+
+	if (test_bit(AFS_VNODE_CB_PROMISED, &vnode->flags) &&
+	    afs_check_validity(vnode))
+		return true;
+
+	return false;
+}
+
 /*
  * validate a vnode/inode
  * - there are several things we need to check
@@ -684,14 +702,7 @@ int afs_validate(struct afs_vnode *vnode, struct key *key)
 	       vnode->fid.vid, vnode->fid.vnode, vnode->flags,
 	       key_serial(key));
 
-	if (unlikely(test_bit(AFS_VNODE_DELETED, &vnode->flags))) {
-		if (vnode->netfs.inode.i_nlink)
-			clear_nlink(&vnode->netfs.inode);
-		goto valid;
-	}
-
-	if (test_bit(AFS_VNODE_CB_PROMISED, &vnode->flags) &&
-	    afs_check_validity(vnode))
+	if (afs_pagecache_valid(vnode))
 		goto valid;
 
 	down_write(&vnode->validate_lock);
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index ad8523d0d038..5c95df6621f9 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -1171,6 +1171,7 @@ extern struct inode *afs_iget(struct afs_operation *, struct afs_vnode_param *);
 extern struct inode *afs_root_iget(struct super_block *, struct key *);
 extern bool afs_check_validity(struct afs_vnode *);
 extern int afs_validate(struct afs_vnode *, struct key *);
+bool afs_pagecache_valid(struct afs_vnode *);
 extern int afs_getattr(struct mnt_idmap *idmap, const struct path *,
 		       struct kstat *, u32, unsigned int);
 extern int afs_setattr(struct mnt_idmap *idmap, struct dentry *, struct iattr *);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] mm: Hold the RCU read lock over calls to ->map_pages
  2023-02-08 14:53 [PATCH 0/3] Prevent ->map_pages from sleeping Matthew Wilcox (Oracle)
  2023-02-08 14:53 ` [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper Matthew Wilcox (Oracle)
  2023-02-08 14:53 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() Matthew Wilcox (Oracle)
@ 2023-02-08 14:53 ` Matthew Wilcox (Oracle)
  2023-02-09 14:28 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() David Howells
  2023-02-09 14:49 ` David Howells
  4 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-02-08 14:53 UTC (permalink / raw)
  To: linux-xfs, linux-afs, linux-fsdevel, linux-mm; +Cc: Matthew Wilcox (Oracle)

Prevent filesystems from doing things which sleep in their map_pages
method.  This is in preparation for a pagefault path protected only
by RCU.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst | 4 ++--
 mm/memory.c                           | 7 ++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 922886fefb7f..8a80390446ba 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -645,7 +645,7 @@ ops		mmap_lock	PageLocked(page)
 open:		yes
 close:		yes
 fault:		yes		can return with page locked
-map_pages:	yes
+map_pages:	read
 page_mkwrite:	yes		can return with page locked
 pfn_mkwrite:	yes
 access:		yes
@@ -661,7 +661,7 @@ locked. The VM will unlock the page.
 
 ->map_pages() is called when VM asks to map easy accessible pages.
 Filesystem should find and map pages associated with offsets from "start_pgoff"
-till "end_pgoff". ->map_pages() is called with page table locked and must
+till "end_pgoff". ->map_pages() is called with the RCU lock held and must
 not block.  If it's not possible to reach a page without blocking,
 filesystem should skip it. Filesystem should use set_pte_range() to setup
 page table entry. Pointer to entry associated with the page is passed in
diff --git a/mm/memory.c b/mm/memory.c
index f1cf47b7e3bb..77577a1d09ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4442,6 +4442,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 	unsigned long address = vmf->address, nr_pages, mask;
 	pgoff_t start_pgoff = vmf->pgoff;
 	pgoff_t end_pgoff;
+	vm_fault_t ret;
 	int off;
 
 	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
@@ -4467,7 +4468,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 			return VM_FAULT_OOM;
 	}
 
-	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+	rcu_read_lock();
+	ret = vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+	rcu_read_unlock();
+
+	return ret;
 }
 
 /* Return true if we should do read fault-around, false otherwise */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate()
  2023-02-08 14:53 [PATCH 0/3] Prevent ->map_pages from sleeping Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-02-08 14:53 ` [PATCH 3/3] mm: Hold the RCU read lock over calls to ->map_pages Matthew Wilcox (Oracle)
@ 2023-02-09 14:28 ` David Howells
  2023-02-09 14:56   ` Matthew Wilcox
  2023-02-09 14:49 ` David Howells
  4 siblings, 1 reply; 14+ messages in thread
From: David Howells @ 2023-02-09 14:28 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: dhowells, linux-xfs, linux-afs, linux-fsdevel, linux-mm

Matthew Wilcox (Oracle) <willy@infradead.org> wrote:

>  extern int afs_getattr(struct mnt_idmap *idmap, const struct path *,
>  		       struct kstat *, u32, unsigned int);
>  extern int afs_setattr(struct mnt_idmap *idmap, struct dentry *, struct iattr *);

This doesn't apply to linus/master.  I'm guessing it's based on something
else?

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate()
  2023-02-09 14:28 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() David Howells
@ 2023-02-09 14:56   ` Matthew Wilcox
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2023-02-09 14:56 UTC (permalink / raw)
  To: David Howells; +Cc: linux-xfs, linux-afs, linux-fsdevel, linux-mm

On Thu, Feb 09, 2023 at 02:28:25PM +0000, David Howells wrote:
> Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> >  extern int afs_getattr(struct mnt_idmap *idmap, const struct path *,
> >  		       struct kstat *, u32, unsigned int);
> >  extern int afs_setattr(struct mnt_idmap *idmap, struct dentry *, struct iattr *);
> 
> This doesn't apply to linus/master.  I'm guessing it's based on something
> else?

next-20230203.  I'm guessing it's "fs: port xattr to mnt_idmap" that's
causing merge conflicts for you?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate()
  2023-02-08 14:53 [PATCH 0/3] Prevent ->map_pages from sleeping Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2023-02-09 14:28 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() David Howells
@ 2023-02-09 14:49 ` David Howells
  4 siblings, 0 replies; 14+ messages in thread
From: David Howells @ 2023-02-09 14:49 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: dhowells, linux-xfs, linux-afs, linux-fsdevel, linux-mm

Matthew Wilcox (Oracle) <willy@infradead.org> wrote:

> For the map_pages() method, we need a test that does not sleep.  The page
> fault handler will continue to call the fault() method where we can
> sleep and do the full revalidation there.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-02-10  0:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-08 14:53 [PATCH 0/3] Prevent ->map_pages from sleeping Matthew Wilcox (Oracle)
2023-02-08 14:53 ` [PATCH 1/3] xfs: Remove xfs_filemap_map_pages() wrapper Matthew Wilcox (Oracle)
2023-02-08 16:39   ` Darrick J. Wong
2023-02-08 17:12     ` Matthew Wilcox
2023-02-08 21:53       ` Dave Chinner
2023-02-09  2:44         ` Matthew Wilcox
2023-02-09 21:53           ` Dave Chinner
2023-02-09 22:34             ` Matthew Wilcox
2023-02-09 23:59               ` Dave Chinner
2023-02-08 14:53 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() Matthew Wilcox (Oracle)
2023-02-08 14:53 ` [PATCH 3/3] mm: Hold the RCU read lock over calls to ->map_pages Matthew Wilcox (Oracle)
2023-02-09 14:28 ` [PATCH 2/3] afs: Split afs_pagecache_valid() out of afs_validate() David Howells
2023-02-09 14:56   ` Matthew Wilcox
2023-02-09 14:49 ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.