linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
@ 2020-06-04 20:23 Matthew Wilcox
  2020-06-04 22:57 ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-06-04 20:23 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel
  Cc: Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Test generic/019 often results in:

WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70

Since this can happen due to a storage error, we should not WARN for it.
Just return -EIO, which will be converted to a SIGBUS for the hapless
task attempting to write to the page that we can't read.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 89e21961d1ad..ae6c5e38f0e8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1066,7 +1066,8 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 			return ret;
 		block_commit_write(page, 0, length);
 	} else {
-		WARN_ON_ONCE(!PageUptodate(page));
+		if (!PageUptodate(page))
+			return -EIO;
 		iomap_page_create(inode, page);
 		set_page_dirty(page);
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-04 20:23 [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite Matthew Wilcox
@ 2020-06-04 22:57 ` Dave Chinner
  2020-06-04 23:05   ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-06-04 22:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Test generic/019 often results in:
> 
> WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70
> 
> Since this can happen due to a storage error, we should not WARN for it.
> Just return -EIO, which will be converted to a SIGBUS for the hapless
> task attempting to write to the page that we can't read.

Why didn't the "read" part of the fault which had the EIO error fail
the page fault? i.e. why are we waiting until deep inside the write
fault path to error out on a failed page read?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-04 22:57 ` Dave Chinner
@ 2020-06-04 23:05   ` Matthew Wilcox
  2020-06-04 23:30     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-06-04 23:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Test generic/019 often results in:
> > 
> > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70
> > 
> > Since this can happen due to a storage error, we should not WARN for it.
> > Just return -EIO, which will be converted to a SIGBUS for the hapless
> > task attempting to write to the page that we can't read.
> 
> Why didn't the "read" part of the fault which had the EIO error fail
> the page fault? i.e. why are we waiting until deep inside the write
> fault path to error out on a failed page read?

I have a hypothesis that I don't know how to verify.

First the task does a load from the page and we put a read-only PTE in
the page tables.  Then it writes to the page using write().  The page
gets written back, but hits an error in iomap_writepage_map()
which calls ClearPageUptodate().  Then the task with it mapped attempts
to store to it.

I haven't dug through what generic/019 does, so I don't know how plausible
this is.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-04 23:05   ` Matthew Wilcox
@ 2020-06-04 23:30     ` Dave Chinner
  2020-06-04 23:50       ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-06-04 23:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote:
> > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > Test generic/019 often results in:
> > > 
> > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70
> > > 
> > > Since this can happen due to a storage error, we should not WARN for it.
> > > Just return -EIO, which will be converted to a SIGBUS for the hapless
> > > task attempting to write to the page that we can't read.
> > 
> > Why didn't the "read" part of the fault which had the EIO error fail
> > the page fault? i.e. why are we waiting until deep inside the write
> > fault path to error out on a failed page read?
> 
> I have a hypothesis that I don't know how to verify.
> 
> First the task does a load from the page and we put a read-only PTE in
> the page tables.  Then it writes to the page using write().  The page
> gets written back, but hits an error in iomap_writepage_map()
> which calls ClearPageUptodate().  Then the task with it mapped attempts
> to store to it.

Sure, but that's not really what I was asking: why isn't this
!uptodate state caught before the page fault code calls
->page_mkwrite? The page fault code has a reference to the page,
after all, and in a couple of paths it even has the page locked.

What I'm trying to understand is why this needs to be fixed inside
->page_mkwrite, because AFAICT if we have to fix this in the iomap
code, we have the same "we got handed a bad page by the page fault"
problem in every single ->page_mkwrite implementation....

> I haven't dug through what generic/019 does, so I don't know how plausible
> this is.

# Run fsstress and fio(dio/aio and mmap) and simulate disk failure
# check filesystem consistency at the end.

I don't think it is mixing buffered writes and mmap writes on the
same file via fio. Maybe fsstress is triggering that, but the
fsstress workload is single threaded so, again, I'm not sure it will
do this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-04 23:30     ` Dave Chinner
@ 2020-06-04 23:50       ` Matthew Wilcox
  2020-06-05  0:31         ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-06-04 23:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Fri, Jun 05, 2020 at 09:30:53AM +1000, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote:
> > On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote:
> > > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > > 
> > > > Test generic/019 often results in:
> > > > 
> > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70
> > > > 
> > > > Since this can happen due to a storage error, we should not WARN for it.
> > > > Just return -EIO, which will be converted to a SIGBUS for the hapless
> > > > task attempting to write to the page that we can't read.
> > > 
> > > Why didn't the "read" part of the fault which had the EIO error fail
> > > the page fault? i.e. why are we waiting until deep inside the write
> > > fault path to error out on a failed page read?
> > 
> > I have a hypothesis that I don't know how to verify.
> > 
> > First the task does a load from the page and we put a read-only PTE in
> > the page tables.  Then it writes to the page using write().  The page
> > gets written back, but hits an error in iomap_writepage_map()
> > which calls ClearPageUptodate().  Then the task with it mapped attempts
> > to store to it.
> 
> Sure, but that's not really what I was asking: why isn't this
> !uptodate state caught before the page fault code calls
> ->page_mkwrite? The page fault code has a reference to the page,
> after all, and in a couple of paths it even has the page locked.

If there's already a PTE present, then the page fault code doesn't
check the uptodate bit.  Here's the path I'm looking at:

do_wp_page()
 -> vm_normal_page()
 -> wp_page_shared()
     -> do_page_mkwrite()

I don't see anything in there that checked Uptodate.

> What I'm trying to understand is why this needs to be fixed inside
> ->page_mkwrite, because AFAICT if we have to fix this in the iomap
> code, we have the same "we got handed a bad page by the page fault"
> problem in every single ->page_mkwrite implementation....

I think the iomap code is the only filesystem which clears PageUptodate
on errors.  I know we've argued about whether that's appropriate or not
in the past.

> > I haven't dug through what generic/019 does, so I don't know how plausible
> > this is.
> 
> # Run fsstress and fio(dio/aio and mmap) and simulate disk failure
> # check filesystem consistency at the end.
> 
> I don't think it is mixing buffered writes and mmap writes on the
> same file via fio. Maybe fsstress is triggering that, but the
> fsstress workload is single threaded so, again, I'm not sure it will
> do this.

Maybe that's not how we end up with a read-only PTE in the process's
page tables.  Perhaps it starts out with a store, then on an fsync()
we mark it read-only, then try to do another store.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-04 23:50       ` Matthew Wilcox
@ 2020-06-05  0:31         ` Dave Chinner
  2020-06-05  2:24           ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-06-05  0:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 05, 2020 at 09:30:53AM +1000, Dave Chinner wrote:
> > On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote:
> > > On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote:
> > > > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote:
> > > > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > > > 
> > > > > Test generic/019 often results in:
> > > > > 
> > > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70
> > > > > 
> > > > > Since this can happen due to a storage error, we should not WARN for it.
> > > > > Just return -EIO, which will be converted to a SIGBUS for the hapless
> > > > > task attempting to write to the page that we can't read.
> > > > 
> > > > Why didn't the "read" part of the fault which had the EIO error fail
> > > > the page fault? i.e. why are we waiting until deep inside the write
> > > > fault path to error out on a failed page read?
> > > 
> > > I have a hypothesis that I don't know how to verify.
> > > 
> > > First the task does a load from the page and we put a read-only PTE in
> > > the page tables.  Then it writes to the page using write().  The page
> > > gets written back, but hits an error in iomap_writepage_map()
> > > which calls ClearPageUptodate().  Then the task with it mapped attempts
> > > to store to it.
> > 
> > Sure, but that's not really what I was asking: why isn't this
> > !uptodate state caught before the page fault code calls
> > ->page_mkwrite? The page fault code has a reference to the page,
> > after all, and in a couple of paths it even has the page locked.
> 
> If there's already a PTE present, then the page fault code doesn't
> check the uptodate bit.  Here's the path I'm looking at:
> 
> do_wp_page()
>  -> vm_normal_page()
>  -> wp_page_shared()
>      -> do_page_mkwrite()
> 
> I don't see anything in there that checked Uptodate.

Yup, exactly the code I was looking at when I asked this question.
The kernel has invalidated the contents of a page, yet we still have
it mapped into userspace as containing valid contents, and we don't
check it at all when userspace generates a protection fault on the
page?

> > What I'm trying to understand is why this needs to be fixed inside
> > ->page_mkwrite, because AFAICT if we have to fix this in the iomap
> > code, we have the same "we got handed a bad page by the page fault"
> > problem in every single ->page_mkwrite implementation....
> 
> I think the iomap code is the only filesystem which clears PageUptodate
> on errors. 

I don't think you looked very hard. A quick scan shows at least
btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the
iomap path will call ClearPageUptodate() on a write IO error. Some
of them also call SetPageError(), too. The above fault path doesn't
check for that, either. AFAICT, It doesn't check for any adverse
page state at all that has occurred since the last time the PTE
changed state, and it doesn't check the mapping for error state,
either. It just hands ->page_mkwrite() a completely unchecked page.

Which again, is kinda what I'm getting at here: why should all
the ->page_mkwrite implementations be forced to check that the page
handed to it is in a valid state?

> I know we've argued about whether that's appropriate or not
> in the past.

Sure, but that's largely irrelevant here. The fact is that we are
taking a fault on a user page marked by the kernel with invalid
contents, and it's not being caught by the common page fault code
path. Ignore where the page contents are marked invalid, and instead
ask "what should a write fault do on a page with invalid contents?".

e.g. Maybe it actually read the contents of the page from disk
again, rather than have page_mkwrite fail and end up killing the
application?

Really, all I want is for all filesystems to have the same behaviour
in response to the same page state error conditions. Requiring every
filesystem to detect and handle the error deep in it's IO path is
not going to result in consistent behaviour....

> > > I haven't dug through what generic/019 does, so I don't know how plausible
> > > this is.
> > 
> > # Run fsstress and fio(dio/aio and mmap) and simulate disk failure
> > # check filesystem consistency at the end.
> > 
> > I don't think it is mixing buffered writes and mmap writes on the
> > same file via fio. Maybe fsstress is triggering that, but the
> > fsstress workload is single threaded so, again, I'm not sure it will
> > do this.
> 
> Maybe that's not how we end up with a read-only PTE in the process's
> page tables.  Perhaps it starts out with a store, then on an fsync()
> we mark it read-only, then try to do another store.

True. Normal background writeback will mark pages clean, so it
probably doesn't need fsync at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-05  0:31         ` Dave Chinner
@ 2020-06-05  2:24           ` Matthew Wilcox
  2020-06-05  3:07             ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-06-05  2:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote:
> > > Sure, but that's not really what I was asking: why isn't this
> > > !uptodate state caught before the page fault code calls
> > > ->page_mkwrite? The page fault code has a reference to the page,
> > > after all, and in a couple of paths it even has the page locked.
> > 
> > If there's already a PTE present, then the page fault code doesn't
> > check the uptodate bit.  Here's the path I'm looking at:
> > 
> > do_wp_page()
> >  -> vm_normal_page()
> >  -> wp_page_shared()
> >      -> do_page_mkwrite()
> > 
> > I don't see anything in there that checked Uptodate.
> 
> Yup, exactly the code I was looking at when I asked this question.
> The kernel has invalidated the contents of a page, yet we still have
> it mapped into userspace as containing valid contents, and we don't
> check it at all when userspace generates a protection fault on the
> page?

Right.  The iomap error path only clears PageUptodate.  It doesn't go
to the effort of unmapping the page from userspace, so userspace has a
read-only view of a !Uptodate page.

> > I think the iomap code is the only filesystem which clears PageUptodate
> > on errors. 
> 
> I don't think you looked very hard. A quick scan shows at least
> btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the
> iomap path will call ClearPageUptodate() on a write IO error.

I'll give you btrfs and jffs2, but I don't think it's true for f2fs.
The only other filesystem using the iomap bufferd IO paths today
is zonefs, afaik.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-05  2:24           ` Matthew Wilcox
@ 2020-06-05  3:07             ` Dave Chinner
  2020-06-05 12:48               ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-06-05  3:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote:
> > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote:
> > > > Sure, but that's not really what I was asking: why isn't this
> > > > !uptodate state caught before the page fault code calls
> > > > ->page_mkwrite? The page fault code has a reference to the page,
> > > > after all, and in a couple of paths it even has the page locked.
> > > 
> > > If there's already a PTE present, then the page fault code doesn't
> > > check the uptodate bit.  Here's the path I'm looking at:
> > > 
> > > do_wp_page()
> > >  -> vm_normal_page()
> > >  -> wp_page_shared()
> > >      -> do_page_mkwrite()
> > > 
> > > I don't see anything in there that checked Uptodate.
> > 
> > Yup, exactly the code I was looking at when I asked this question.
> > The kernel has invalidated the contents of a page, yet we still have
> > it mapped into userspace as containing valid contents, and we don't
> > check it at all when userspace generates a protection fault on the
> > page?
> 
> Right.  The iomap error path only clears PageUptodate.  It doesn't go
> to the effort of unmapping the page from userspace, so userspace has a
> read-only view of a !Uptodate page.

Hmmm - did you miss the ->discard_page() callout just before we call
ClearPageUptodate() on error in iomap_writepage_map()? That results
in XFS calling iomap_invalidatepage() on the page, which ....

/me sighs as he realises that ->invalidatepage doesn't actually
invalidate page mappings but only clears the page dirty state and
releases filesystem references to the page.

Yay. We leave -invalidated page cache pages- mapped into userspace,
and page faults on those pages don't catch access to invalidated
pages.

Geez, we really suck at this whole software thing, don't we?

It's not clear to me that we can actually unmap those pages safely
in a race free manner from this code - can we actually do that from
the page writeback path?

> > > I think the iomap code is the only filesystem which clears PageUptodate
> > > on errors. 
> > 
> > I don't think you looked very hard. A quick scan shows at least
> > btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the
> > iomap path will call ClearPageUptodate() on a write IO error.
> 
> I'll give you btrfs and jffs2, but I don't think it's true for f2fs.
> The only other filesystem using the iomap bufferd IO paths today
> is zonefs, afaik.

gfs2 as well.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-05  3:07             ` Dave Chinner
@ 2020-06-05 12:48               ` Matthew Wilcox
  2020-06-05 16:18                 ` Darrick J. Wong
  2020-06-05 21:48                 ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-06-05 12:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Fri, Jun 05, 2020 at 01:07:58PM +1000, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote:
> > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote:
> > > > > Sure, but that's not really what I was asking: why isn't this
> > > > > !uptodate state caught before the page fault code calls
> > > > > ->page_mkwrite? The page fault code has a reference to the page,
> > > > > after all, and in a couple of paths it even has the page locked.
> > > > 
> > > > If there's already a PTE present, then the page fault code doesn't
> > > > check the uptodate bit.  Here's the path I'm looking at:
> > > > 
> > > > do_wp_page()
> > > >  -> vm_normal_page()
> > > >  -> wp_page_shared()
> > > >      -> do_page_mkwrite()
> > > > 
> > > > I don't see anything in there that checked Uptodate.
> > > 
> > > Yup, exactly the code I was looking at when I asked this question.
> > > The kernel has invalidated the contents of a page, yet we still have
> > > it mapped into userspace as containing valid contents, and we don't
> > > check it at all when userspace generates a protection fault on the
> > > page?
> > 
> > Right.  The iomap error path only clears PageUptodate.  It doesn't go
> > to the effort of unmapping the page from userspace, so userspace has a
> > read-only view of a !Uptodate page.
> 
> Hmmm - did you miss the ->discard_page() callout just before we call
> ClearPageUptodate() on error in iomap_writepage_map()? That results
> in XFS calling iomap_invalidatepage() on the page, which ....

... I don't think that's the interesting path.  I mean, that's
the submission path, and usually we discover errors in the completion
path, not the submission path.

> /me sighs as he realises that ->invalidatepage doesn't actually
> invalidate page mappings but only clears the page dirty state and
> releases filesystem references to the page.
> 
> Yay. We leave -invalidated page cache pages- mapped into userspace,
> and page faults on those pages don't catch access to invalidated
> pages.

More than that ... by clearing Uptodate, you're trying to prevent
future reads to this page from succeeding without verifying the data
is still on storage, but a task that had it mapped before can still
load the data that was written but never made it to storage.
So at some point it'll teleport backwards when another task has a
successful read().  Funfunfun.

> Geez, we really suck at this whole software thing, don't we?

Certainly at handling errors ...

> It's not clear to me that we can actually unmap those pages safely
> in a race free manner from this code - can we actually do that from
> the page writeback path?

I don't see why it can't be done from the submission path.
unmap_mapping_range() calls i_mmap_lock_write(), which is
down_write(i_mmap_rwsem) in drag.  There might be a lock ordering
issue there, although lockdep should find it pretty quickly.

The bigger problem is the completion path.  We're in softirq context,
so that will have to punt to a thread that can take mutexes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-05 12:48               ` Matthew Wilcox
@ 2020-06-05 16:18                 ` Darrick J. Wong
  2020-06-05 21:48                 ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2020-06-05 16:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel, linux-kernel

On Fri, Jun 05, 2020 at 05:48:26AM -0700, Matthew Wilcox wrote:
> On Fri, Jun 05, 2020 at 01:07:58PM +1000, Dave Chinner wrote:
> > On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote:
> > > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote:
> > > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote:
> > > > > > Sure, but that's not really what I was asking: why isn't this
> > > > > > !uptodate state caught before the page fault code calls
> > > > > > ->page_mkwrite? The page fault code has a reference to the page,
> > > > > > after all, and in a couple of paths it even has the page locked.
> > > > > 
> > > > > If there's already a PTE present, then the page fault code doesn't
> > > > > check the uptodate bit.  Here's the path I'm looking at:
> > > > > 
> > > > > do_wp_page()
> > > > >  -> vm_normal_page()
> > > > >  -> wp_page_shared()
> > > > >      -> do_page_mkwrite()
> > > > > 
> > > > > I don't see anything in there that checked Uptodate.
> > > > 
> > > > Yup, exactly the code I was looking at when I asked this question.
> > > > The kernel has invalidated the contents of a page, yet we still have
> > > > it mapped into userspace as containing valid contents, and we don't
> > > > check it at all when userspace generates a protection fault on the
> > > > page?
> > > 
> > > Right.  The iomap error path only clears PageUptodate.  It doesn't go
> > > to the effort of unmapping the page from userspace, so userspace has a
> > > read-only view of a !Uptodate page.
> > 
> > Hmmm - did you miss the ->discard_page() callout just before we call
> > ClearPageUptodate() on error in iomap_writepage_map()? That results
> > in XFS calling iomap_invalidatepage() on the page, which ....
> 
> ... I don't think that's the interesting path.  I mean, that's
> the submission path, and usually we discover errors in the completion
> path, not the submission path.
> 
> > /me sighs as he realises that ->invalidatepage doesn't actually
> > invalidate page mappings but only clears the page dirty state and
> > releases filesystem references to the page.

<nod> Yes, we have preserved the old feebleness.

I've long felt that we should leave the page dirty and retry the write,
but that was objectionable because we could run out of memory and
reclaim will stall and OOM on pages it can't clean if IO is still
broken.

I can't remember the exact reasons for leaving a /clean/ page in memory,
but I think it had to do with preserving mmap'd page contents long
enough that a program could redirty the mapping <bluh bluh race
conditions><this is glitchy><blarghallthisisstupid>.

> > Yay. We leave -invalidated page cache pages- mapped into userspace,
> > and page faults on those pages don't catch access to invalidated
> > pages.
> 
> More than that ... by clearing Uptodate, you're trying to prevent
> future reads to this page from succeeding without verifying the data
> is still on storage, but a task that had it mapped before can still
> load the data that was written but never made it to storage.
> So at some point it'll teleport backwards when another task has a
> successful read().  Funfunfun.

Let's just invalidate the mapping and see if anyone complains. :D

> > Geez, we really suck at this whole software thing, don't we?
> 
> Certainly at handling errors ...
> 
> > It's not clear to me that we can actually unmap those pages safely
> > in a race free manner from this code - can we actually do that from
> > the page writeback path?
> 
> I don't see why it can't be done from the submission path.
> unmap_mapping_range() calls i_mmap_lock_write(), which is
> down_write(i_mmap_rwsem) in drag.  There might be a lock ordering
> issue there, although lockdep should find it pretty quickly.
> 
> The bigger problem is the completion path.  We're in softirq context,
> so that will have to punt to a thread that can take mutexes.

<nod> It's more workqueue punting, but I guess at least errors ought to
be infrequent.

--D

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-05 12:48               ` Matthew Wilcox
  2020-06-05 16:18                 ` Darrick J. Wong
@ 2020-06-05 21:48                 ` Dave Chinner
  2020-06-05 22:49                   ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-06-05 21:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Fri, Jun 05, 2020 at 05:48:26AM -0700, Matthew Wilcox wrote:
> On Fri, Jun 05, 2020 at 01:07:58PM +1000, Dave Chinner wrote:
> > On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote:
> > > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote:
> > > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote:
> > > > > > Sure, but that's not really what I was asking: why isn't this
> > > > > > !uptodate state caught before the page fault code calls
> > > > > > ->page_mkwrite? The page fault code has a reference to the page,
> > > > > > after all, and in a couple of paths it even has the page locked.
> > > > > 
> > > > > If there's already a PTE present, then the page fault code doesn't
> > > > > check the uptodate bit.  Here's the path I'm looking at:
> > > > > 
> > > > > do_wp_page()
> > > > >  -> vm_normal_page()
> > > > >  -> wp_page_shared()
> > > > >      -> do_page_mkwrite()
> > > > > 
> > > > > I don't see anything in there that checked Uptodate.
> > > > 
> > > > Yup, exactly the code I was looking at when I asked this question.
> > > > The kernel has invalidated the contents of a page, yet we still have
> > > > it mapped into userspace as containing valid contents, and we don't
> > > > check it at all when userspace generates a protection fault on the
> > > > page?
> > > 
> > > Right.  The iomap error path only clears PageUptodate.  It doesn't go
> > > to the effort of unmapping the page from userspace, so userspace has a
> > > read-only view of a !Uptodate page.
> > 
> > Hmmm - did you miss the ->discard_page() callout just before we call
> > ClearPageUptodate() on error in iomap_writepage_map()? That results
> > in XFS calling iomap_invalidatepage() on the page, which ....
> 
> ... I don't think that's the interesting path.  I mean, that's
> the submission path, and usually we discover errors in the completion
> path, not the submission path.

Where in the iomap write IO completion path do we call
ClearPageUptodate()?

I mean, it ends up in iomap_finish_page_writeback(), which does:

static void
iomap_finish_page_writeback(struct inode *inode, struct page *page,
                int error)
{
        struct iomap_page *iop = to_iomap_page(page);

        if (error) {
                SetPageError(page);
                mapping_set_error(inode->i_mapping, -EIO);
        }

        WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
        WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);

        if (!iop || atomic_dec_and_test(&iop->write_count))
                end_page_writeback(page);
}

I mean, we call SetPageError() and tag the mapping, but we most
certainly don't clear the PageUptodate state here.

So AFAICT, the -only- places that iomap clears the uptodate state on
a page is on -read- errors and on write submission failures.

If it's a read error, the page fault should already be failing. If
it's on submission, we invalidate it as we currently do and punch
out the user mappings, and then when userspace refaults it can be
killed by a read IO failure.

But I just don't see how this problem results from errors reported
at IO completion.

This comes back to my original, underlying worry about the fragility
of the page fault path: the page fault path is not even checking for
PageError during faults, and I'm betting that almost no
->page_mkwrite implementation is checking it, either....

> > It's not clear to me that we can actually unmap those pages safely
> > in a race free manner from this code - can we actually do that from
> > the page writeback path?
> 
> I don't see why it can't be done from the submission path.
> unmap_mapping_range() calls i_mmap_lock_write(), which is
> down_write(i_mmap_rwsem) in drag.  There might be a lock ordering
> issue there, although lockdep should find it pretty quickly.
> 
> The bigger problem is the completion path.  We're in softirq context,
> so that will have to punt to a thread that can take mutexes.

Punt to workqueue if we aren't already in a workqueue context -
for a lot of writes on XFS we already will be running completion in
a workqueue context....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite
  2020-06-05 21:48                 ` Dave Chinner
@ 2020-06-05 22:49                   ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-06-05 22:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel,
	linux-kernel

On Sat, Jun 06, 2020 at 07:48:41AM +1000, Dave Chinner wrote:
> On Fri, Jun 05, 2020 at 05:48:26AM -0700, Matthew Wilcox wrote:
> > ... I don't think that's the interesting path.  I mean, that's
> > the submission path, and usually we discover errors in the completion
> > path, not the submission path.
> 
> Where in the iomap write IO completion path do we call
> ClearPageUptodate()?

Oh, I misread.  You're right, I was looking at the read completion path.

So, this is also inconsistent.  We clear PageUptodate on errors we
discover during submission, but not for errors we discover during
completion.  That doesn't make sense.

> This comes back to my original, underlying worry about the fragility
> of the page fault path: the page fault path is not even checking for
> PageError during faults, and I'm betting that almost no
> ->page_mkwrite implementation is checking it, either....

I think it's a reasonable assumption that user page tables should never
contain a PTE for a page which is !Uptodate.  Otherwise the user can
read stale data.

> > I don't see why it can't be done from the submission path.
> > unmap_mapping_range() calls i_mmap_lock_write(), which is
> > down_write(i_mmap_rwsem) in drag.  There might be a lock ordering
> > issue there, although lockdep should find it pretty quickly.
> > 
> > The bigger problem is the completion path.  We're in softirq context,
> > so that will have to punt to a thread that can take mutexes.
> 
> Punt to workqueue if we aren't already in a workqueue context -
> for a lot of writes on XFS we already will be running completion in
> a workqueue context....

Yep.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-05 22:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-04 20:23 [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite Matthew Wilcox
2020-06-04 22:57 ` Dave Chinner
2020-06-04 23:05   ` Matthew Wilcox
2020-06-04 23:30     ` Dave Chinner
2020-06-04 23:50       ` Matthew Wilcox
2020-06-05  0:31         ` Dave Chinner
2020-06-05  2:24           ` Matthew Wilcox
2020-06-05  3:07             ` Dave Chinner
2020-06-05 12:48               ` Matthew Wilcox
2020-06-05 16:18                 ` Darrick J. Wong
2020-06-05 21:48                 ` Dave Chinner
2020-06-05 22:49                   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).