* [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite @ 2020-06-04 20:23 Matthew Wilcox 2020-06-04 22:57 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Matthew Wilcox @ 2020-06-04 20:23 UTC (permalink / raw) To: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel Cc: Matthew Wilcox (Oracle) From: "Matthew Wilcox (Oracle)" <willy@infradead.org> Test generic/019 often results in: WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 Since this can happen due to a storage error, we should not WARN for it. Just return -EIO, which will be converted to a SIGBUS for the hapless task attempting to write to the page that we can't read. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- fs/iomap/buffered-io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 89e21961d1ad..ae6c5e38f0e8 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1066,7 +1066,8 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length, return ret; block_commit_write(page, 0, length); } else { - WARN_ON_ONCE(!PageUptodate(page)); + if (!PageUptodate(page)) + return -EIO; iomap_page_create(inode, page); set_page_dirty(page); } -- 2.26.2 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-04 20:23 [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite Matthew Wilcox @ 2020-06-04 22:57 ` Dave Chinner 2020-06-04 23:05 ` Matthew Wilcox 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-06-04 22:57 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > Test generic/019 often results in: > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 > > Since this can happen due to a storage error, we should not WARN for it. > Just return -EIO, which will be converted to a SIGBUS for the hapless > task attempting to write to the page that we can't read. Why didn't the "read" part of the fault which had the EIO error fail the page fault? i.e. why are we waiting until deep inside the write fault path to error out on a failed page read? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-04 22:57 ` Dave Chinner @ 2020-06-04 23:05 ` Matthew Wilcox 2020-06-04 23:30 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Matthew Wilcox @ 2020-06-04 23:05 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote: > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote: > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > > > Test generic/019 often results in: > > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 > > > > Since this can happen due to a storage error, we should not WARN for it. > > Just return -EIO, which will be converted to a SIGBUS for the hapless > > task attempting to write to the page that we can't read. > > Why didn't the "read" part of the fault which had the EIO error fail > the page fault? i.e. why are we waiting until deep inside the write > fault path to error out on a failed page read? I have a hypothesis that I don't know how to verify. First the task does a load from the page and we put a read-only PTE in the page tables. Then it writes to the page using write(). The page gets written back, but hits an error in iomap_writepage_map() which calls ClearPageUptodate(). Then the task with it mapped attempts to store to it. I haven't dug through what generic/019 does, so I don't know how plausible this is. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-04 23:05 ` Matthew Wilcox @ 2020-06-04 23:30 ` Dave Chinner 2020-06-04 23:50 ` Matthew Wilcox 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-06-04 23:30 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote: > On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote: > > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > > > > > Test generic/019 often results in: > > > > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 > > > > > > Since this can happen due to a storage error, we should not WARN for it. > > > Just return -EIO, which will be converted to a SIGBUS for the hapless > > > task attempting to write to the page that we can't read. > > > > Why didn't the "read" part of the fault which had the EIO error fail > > the page fault? i.e. why are we waiting until deep inside the write > > fault path to error out on a failed page read? > > I have a hypothesis that I don't know how to verify. > > First the task does a load from the page and we put a read-only PTE in > the page tables. Then it writes to the page using write(). The page > gets written back, but hits an error in iomap_writepage_map() > which calls ClearPageUptodate(). Then the task with it mapped attempts > to store to it. Sure, but that's not really what I was asking: why isn't this !uptodate state caught before the page fault code calls ->page_mkwrite? The page fault code has a reference to the page, after all, and in a couple of paths it even has the page locked. What I'm trying to understand is why this needs to be fixed inside ->page_mkwrite, because AFAICT if we have to fix this in the iomap code, we have the same "we got handed a bad page by the page fault" problem in every single ->page_mkwrite implementation.... > I haven't dug through what generic/019 does, so I don't know how plausible > this is. # Run fsstress and fio(dio/aio and mmap) and simulate disk failure # check filesystem consistency at the end. I don't think it is mixing buffered writes and mmap writes on the same file via fio. Maybe fsstress is triggering that, but the fsstress workload is single threaded so, again, I'm not sure it will do this. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-04 23:30 ` Dave Chinner @ 2020-06-04 23:50 ` Matthew Wilcox 2020-06-05 0:31 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Matthew Wilcox @ 2020-06-04 23:50 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Fri, Jun 05, 2020 at 09:30:53AM +1000, Dave Chinner wrote: > On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote: > > On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote: > > > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote: > > > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > > > > > > > Test generic/019 often results in: > > > > > > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 > > > > > > > > Since this can happen due to a storage error, we should not WARN for it. > > > > Just return -EIO, which will be converted to a SIGBUS for the hapless > > > > task attempting to write to the page that we can't read. > > > > > > Why didn't the "read" part of the fault which had the EIO error fail > > > the page fault? i.e. why are we waiting until deep inside the write > > > fault path to error out on a failed page read? > > > > I have a hypothesis that I don't know how to verify. > > > > First the task does a load from the page and we put a read-only PTE in > > the page tables. Then it writes to the page using write(). The page > > gets written back, but hits an error in iomap_writepage_map() > > which calls ClearPageUptodate(). Then the task with it mapped attempts > > to store to it. > > Sure, but that's not really what I was asking: why isn't this > !uptodate state caught before the page fault code calls > ->page_mkwrite? The page fault code has a reference to the page, > after all, and in a couple of paths it even has the page locked. If there's already a PTE present, then the page fault code doesn't check the uptodate bit. Here's the path I'm looking at: do_wp_page() -> vm_normal_page() -> wp_page_shared() -> do_page_mkwrite() I don't see anything in there that checked Uptodate. > What I'm trying to understand is why this needs to be fixed inside > ->page_mkwrite, because AFAICT if we have to fix this in the iomap > code, we have the same "we got handed a bad page by the page fault" > problem in every single ->page_mkwrite implementation.... I think the iomap code is the only filesystem which clears PageUptodate on errors. I know we've argued about whether that's appropriate or not in the past. > > I haven't dug through what generic/019 does, so I don't know how plausible > > this is. > > # Run fsstress and fio(dio/aio and mmap) and simulate disk failure > # check filesystem consistency at the end. > > I don't think it is mixing buffered writes and mmap writes on the > same file via fio. Maybe fsstress is triggering that, but the > fsstress workload is single threaded so, again, I'm not sure it will > do this. Maybe that's not how we end up with a read-only PTE in the process's page tables. Perhaps it starts out with a store, then on an fsync() we mark it read-only, then try to do another store. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-04 23:50 ` Matthew Wilcox @ 2020-06-05 0:31 ` Dave Chinner 2020-06-05 2:24 ` Matthew Wilcox 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-06-05 0:31 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > On Fri, Jun 05, 2020 at 09:30:53AM +1000, Dave Chinner wrote: > > On Thu, Jun 04, 2020 at 04:05:19PM -0700, Matthew Wilcox wrote: > > > On Fri, Jun 05, 2020 at 08:57:26AM +1000, Dave Chinner wrote: > > > > On Thu, Jun 04, 2020 at 01:23:40PM -0700, Matthew Wilcox wrote: > > > > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > > > > > > > > > Test generic/019 often results in: > > > > > > > > > > WARNING: at fs/iomap/buffered-io.c:1069 iomap_page_mkwrite_actor+0x57/0x70 > > > > > > > > > > Since this can happen due to a storage error, we should not WARN for it. > > > > > Just return -EIO, which will be converted to a SIGBUS for the hapless > > > > > task attempting to write to the page that we can't read. > > > > > > > > Why didn't the "read" part of the fault which had the EIO error fail > > > > the page fault? i.e. why are we waiting until deep inside the write > > > > fault path to error out on a failed page read? > > > > > > I have a hypothesis that I don't know how to verify. > > > > > > First the task does a load from the page and we put a read-only PTE in > > > the page tables. Then it writes to the page using write(). The page > > > gets written back, but hits an error in iomap_writepage_map() > > > which calls ClearPageUptodate(). Then the task with it mapped attempts > > > to store to it. > > > > Sure, but that's not really what I was asking: why isn't this > > !uptodate state caught before the page fault code calls > > ->page_mkwrite? The page fault code has a reference to the page, > > after all, and in a couple of paths it even has the page locked. > > If there's already a PTE present, then the page fault code doesn't > check the uptodate bit. Here's the path I'm looking at: > > do_wp_page() > -> vm_normal_page() > -> wp_page_shared() > -> do_page_mkwrite() > > I don't see anything in there that checked Uptodate. Yup, exactly the code I was looking at when I asked this question. The kernel has invalidated the contents of a page, yet we still have it mapped into userspace as containing valid contents, and we don't check it at all when userspace generates a protection fault on the page? > > What I'm trying to understand is why this needs to be fixed inside > > ->page_mkwrite, because AFAICT if we have to fix this in the iomap > > code, we have the same "we got handed a bad page by the page fault" > > problem in every single ->page_mkwrite implementation.... > > I think the iomap code is the only filesystem which clears PageUptodate > on errors. I don't think you looked very hard. A quick scan shows at least btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the iomap path will call ClearPageUptodate() on a write IO error. Some of them also call SetPageError(), too. The above fault path doesn't check for that, either. AFAICT, It doesn't check for any adverse page state at all that has occurred since the last time the PTE changed state, and it doesn't check the mapping for error state, either. It just hands ->page_mkwrite() a completely unchecked page. Which again, is kinda what I'm getting at here: why should all the ->page_mkwrite implementations be forced to check that the page handed to it is in a valid state? > I know we've argued about whether that's appropriate or not > in the past. Sure, but that's largely irrelevant here. The fact is that we are taking a fault on a user page marked by the kernel with invalid contents, and it's not being caught by the common page fault code path. Ignore where the page contents are marked invalid, and instead ask "what should a write fault do on a page with invalid contents?". e.g. Maybe it actually read the contents of the page from disk again, rather than have page_mkwrite fail and end up killing the application? Really, all I want is for all filesystems to have the same behaviour in response to the same page state error conditions. Requiring every filesystem to detect and handle the error deep in it's IO path is not going to result in consistent behaviour.... > > > I haven't dug through what generic/019 does, so I don't know how plausible > > > this is. > > > > # Run fsstress and fio(dio/aio and mmap) and simulate disk failure > > # check filesystem consistency at the end. > > > > I don't think it is mixing buffered writes and mmap writes on the > > same file via fio. Maybe fsstress is triggering that, but the > > fsstress workload is single threaded so, again, I'm not sure it will > > do this. > > Maybe that's not how we end up with a read-only PTE in the process's > page tables. Perhaps it starts out with a store, then on an fsync() > we mark it read-only, then try to do another store. True. Normal background writeback will mark pages clean, so it probably doesn't need fsync at all. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-05 0:31 ` Dave Chinner @ 2020-06-05 2:24 ` Matthew Wilcox 2020-06-05 3:07 ` Dave Chinner 0 siblings, 1 reply; 12+ messages in thread From: Matthew Wilcox @ 2020-06-05 2:24 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote: > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > > > Sure, but that's not really what I was asking: why isn't this > > > !uptodate state caught before the page fault code calls > > > ->page_mkwrite? The page fault code has a reference to the page, > > > after all, and in a couple of paths it even has the page locked. > > > > If there's already a PTE present, then the page fault code doesn't > > check the uptodate bit. Here's the path I'm looking at: > > > > do_wp_page() > > -> vm_normal_page() > > -> wp_page_shared() > > -> do_page_mkwrite() > > > > I don't see anything in there that checked Uptodate. > > Yup, exactly the code I was looking at when I asked this question. > The kernel has invalidated the contents of a page, yet we still have > it mapped into userspace as containing valid contents, and we don't > check it at all when userspace generates a protection fault on the > page? Right. The iomap error path only clears PageUptodate. It doesn't go to the effort of unmapping the page from userspace, so userspace has a read-only view of a !Uptodate page. > > I think the iomap code is the only filesystem which clears PageUptodate > > on errors. > > I don't think you looked very hard. A quick scan shows at least > btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the > iomap path will call ClearPageUptodate() on a write IO error. I'll give you btrfs and jffs2, but I don't think it's true for f2fs. The only other filesystem using the iomap bufferd IO paths today is zonefs, afaik. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-05 2:24 ` Matthew Wilcox @ 2020-06-05 3:07 ` Dave Chinner 2020-06-05 12:48 ` Matthew Wilcox 0 siblings, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-06-05 3:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote: > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote: > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > > > > Sure, but that's not really what I was asking: why isn't this > > > > !uptodate state caught before the page fault code calls > > > > ->page_mkwrite? The page fault code has a reference to the page, > > > > after all, and in a couple of paths it even has the page locked. > > > > > > If there's already a PTE present, then the page fault code doesn't > > > check the uptodate bit. Here's the path I'm looking at: > > > > > > do_wp_page() > > > -> vm_normal_page() > > > -> wp_page_shared() > > > -> do_page_mkwrite() > > > > > > I don't see anything in there that checked Uptodate. > > > > Yup, exactly the code I was looking at when I asked this question. > > The kernel has invalidated the contents of a page, yet we still have > > it mapped into userspace as containing valid contents, and we don't > > check it at all when userspace generates a protection fault on the > > page? > > Right. The iomap error path only clears PageUptodate. It doesn't go > to the effort of unmapping the page from userspace, so userspace has a > read-only view of a !Uptodate page. Hmmm - did you miss the ->discard_page() callout just before we call ClearPageUptodate() on error in iomap_writepage_map()? That results in XFS calling iomap_invalidatepage() on the page, which .... /me sighs as he realises that ->invalidatepage doesn't actually invalidate page mappings but only clears the page dirty state and releases filesystem references to the page. Yay. We leave -invalidated page cache pages- mapped into userspace, and page faults on those pages don't catch access to invalidated pages. Geez, we really suck at this whole software thing, don't we? It's not clear to me that we can actually unmap those pages safely in a race free manner from this code - can we actually do that from the page writeback path? > > > I think the iomap code is the only filesystem which clears PageUptodate > > > on errors. > > > > I don't think you looked very hard. A quick scan shows at least > > btrfs, f2fs, hostfs, jffs2, reiserfs, vboxfs and anything using the > > iomap path will call ClearPageUptodate() on a write IO error. > > I'll give you btrfs and jffs2, but I don't think it's true for f2fs. > The only other filesystem using the iomap bufferd IO paths today > is zonefs, afaik. gfs2 as well. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-05 3:07 ` Dave Chinner @ 2020-06-05 12:48 ` Matthew Wilcox 2020-06-05 16:18 ` Darrick J. Wong 2020-06-05 21:48 ` Dave Chinner 0 siblings, 2 replies; 12+ messages in thread From: Matthew Wilcox @ 2020-06-05 12:48 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Fri, Jun 05, 2020 at 01:07:58PM +1000, Dave Chinner wrote: > On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote: > > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote: > > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > > > > > Sure, but that's not really what I was asking: why isn't this > > > > > !uptodate state caught before the page fault code calls > > > > > ->page_mkwrite? The page fault code has a reference to the page, > > > > > after all, and in a couple of paths it even has the page locked. > > > > > > > > If there's already a PTE present, then the page fault code doesn't > > > > check the uptodate bit. Here's the path I'm looking at: > > > > > > > > do_wp_page() > > > > -> vm_normal_page() > > > > -> wp_page_shared() > > > > -> do_page_mkwrite() > > > > > > > > I don't see anything in there that checked Uptodate. > > > > > > Yup, exactly the code I was looking at when I asked this question. > > > The kernel has invalidated the contents of a page, yet we still have > > > it mapped into userspace as containing valid contents, and we don't > > > check it at all when userspace generates a protection fault on the > > > page? > > > > Right. The iomap error path only clears PageUptodate. It doesn't go > > to the effort of unmapping the page from userspace, so userspace has a > > read-only view of a !Uptodate page. > > Hmmm - did you miss the ->discard_page() callout just before we call > ClearPageUptodate() on error in iomap_writepage_map()? That results > in XFS calling iomap_invalidatepage() on the page, which .... ... I don't think that's the interesting path. I mean, that's the submission path, and usually we discover errors in the completion path, not the submission path. > /me sighs as he realises that ->invalidatepage doesn't actually > invalidate page mappings but only clears the page dirty state and > releases filesystem references to the page. > > Yay. We leave -invalidated page cache pages- mapped into userspace, > and page faults on those pages don't catch access to invalidated > pages. More than that ... by clearing Uptodate, you're trying to prevent future reads to this page from succeeding without verifying the data is still on storage, but a task that had it mapped before can still load the data that was written but never made it to storage. So at some point it'll teleport backwards when another task has a successful read(). Funfunfun. > Geez, we really suck at this whole software thing, don't we? Certainly at handling errors ... > It's not clear to me that we can actually unmap those pages safely > in a race free manner from this code - can we actually do that from > the page writeback path? I don't see why it can't be done from the submission path. unmap_mapping_range() calls i_mmap_lock_write(), which is down_write(i_mmap_rwsem) in drag. There might be a lock ordering issue there, although lockdep should find it pretty quickly. The bigger problem is the completion path. We're in softirq context, so that will have to punt to a thread that can take mutexes. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-05 12:48 ` Matthew Wilcox @ 2020-06-05 16:18 ` Darrick J. Wong 2020-06-05 21:48 ` Dave Chinner 1 sibling, 0 replies; 12+ messages in thread From: Darrick J. Wong @ 2020-06-05 16:18 UTC (permalink / raw) To: Matthew Wilcox Cc: Dave Chinner, Christoph Hellwig, linux-xfs, linux-fsdevel, linux-kernel On Fri, Jun 05, 2020 at 05:48:26AM -0700, Matthew Wilcox wrote: > On Fri, Jun 05, 2020 at 01:07:58PM +1000, Dave Chinner wrote: > > On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote: > > > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote: > > > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > > > > > > Sure, but that's not really what I was asking: why isn't this > > > > > > !uptodate state caught before the page fault code calls > > > > > > ->page_mkwrite? The page fault code has a reference to the page, > > > > > > after all, and in a couple of paths it even has the page locked. > > > > > > > > > > If there's already a PTE present, then the page fault code doesn't > > > > > check the uptodate bit. Here's the path I'm looking at: > > > > > > > > > > do_wp_page() > > > > > -> vm_normal_page() > > > > > -> wp_page_shared() > > > > > -> do_page_mkwrite() > > > > > > > > > > I don't see anything in there that checked Uptodate. > > > > > > > > Yup, exactly the code I was looking at when I asked this question. > > > > The kernel has invalidated the contents of a page, yet we still have > > > > it mapped into userspace as containing valid contents, and we don't > > > > check it at all when userspace generates a protection fault on the > > > > page? > > > > > > Right. The iomap error path only clears PageUptodate. It doesn't go > > > to the effort of unmapping the page from userspace, so userspace has a > > > read-only view of a !Uptodate page. > > > > Hmmm - did you miss the ->discard_page() callout just before we call > > ClearPageUptodate() on error in iomap_writepage_map()? That results > > in XFS calling iomap_invalidatepage() on the page, which .... > > ... I don't think that's the interesting path. I mean, that's > the submission path, and usually we discover errors in the completion > path, not the submission path. > > > /me sighs as he realises that ->invalidatepage doesn't actually > > invalidate page mappings but only clears the page dirty state and > > releases filesystem references to the page. <nod> Yes, we have preserved the old feebleness. I've long felt that we should leave the page dirty and retry the write, but that was objectionable because we could run out of memory and reclaim will stall and OOM on pages it can't clean if IO is still broken. I can't remember the exact reasons for leaving a /clean/ page in memory, but I think it had to do with preserving mmap'd page contents long enough that a program could redirty the mapping <bluh bluh race conditions><this is glitchy><blarghallthisisstupid>. > > Yay. We leave -invalidated page cache pages- mapped into userspace, > > and page faults on those pages don't catch access to invalidated > > pages. > > More than that ... by clearing Uptodate, you're trying to prevent > future reads to this page from succeeding without verifying the data > is still on storage, but a task that had it mapped before can still > load the data that was written but never made it to storage. > So at some point it'll teleport backwards when another task has a > successful read(). Funfunfun. Let's just invalidate the mapping and see if anyone complains. :D > > Geez, we really suck at this whole software thing, don't we? > > Certainly at handling errors ... > > > It's not clear to me that we can actually unmap those pages safely > > in a race free manner from this code - can we actually do that from > > the page writeback path? > > I don't see why it can't be done from the submission path. > unmap_mapping_range() calls i_mmap_lock_write(), which is > down_write(i_mmap_rwsem) in drag. There might be a lock ordering > issue there, although lockdep should find it pretty quickly. > > The bigger problem is the completion path. We're in softirq context, > so that will have to punt to a thread that can take mutexes. <nod> It's more workqueue punting, but I guess at least errors ought to be infrequent. --D ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-05 12:48 ` Matthew Wilcox 2020-06-05 16:18 ` Darrick J. Wong @ 2020-06-05 21:48 ` Dave Chinner 2020-06-05 22:49 ` Matthew Wilcox 1 sibling, 1 reply; 12+ messages in thread From: Dave Chinner @ 2020-06-05 21:48 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Fri, Jun 05, 2020 at 05:48:26AM -0700, Matthew Wilcox wrote: > On Fri, Jun 05, 2020 at 01:07:58PM +1000, Dave Chinner wrote: > > On Thu, Jun 04, 2020 at 07:24:51PM -0700, Matthew Wilcox wrote: > > > On Fri, Jun 05, 2020 at 10:31:59AM +1000, Dave Chinner wrote: > > > > On Thu, Jun 04, 2020 at 04:50:50PM -0700, Matthew Wilcox wrote: > > > > > > Sure, but that's not really what I was asking: why isn't this > > > > > > !uptodate state caught before the page fault code calls > > > > > > ->page_mkwrite? The page fault code has a reference to the page, > > > > > > after all, and in a couple of paths it even has the page locked. > > > > > > > > > > If there's already a PTE present, then the page fault code doesn't > > > > > check the uptodate bit. Here's the path I'm looking at: > > > > > > > > > > do_wp_page() > > > > > -> vm_normal_page() > > > > > -> wp_page_shared() > > > > > -> do_page_mkwrite() > > > > > > > > > > I don't see anything in there that checked Uptodate. > > > > > > > > Yup, exactly the code I was looking at when I asked this question. > > > > The kernel has invalidated the contents of a page, yet we still have > > > > it mapped into userspace as containing valid contents, and we don't > > > > check it at all when userspace generates a protection fault on the > > > > page? > > > > > > Right. The iomap error path only clears PageUptodate. It doesn't go > > > to the effort of unmapping the page from userspace, so userspace has a > > > read-only view of a !Uptodate page. > > > > Hmmm - did you miss the ->discard_page() callout just before we call > > ClearPageUptodate() on error in iomap_writepage_map()? That results > > in XFS calling iomap_invalidatepage() on the page, which .... > > ... I don't think that's the interesting path. I mean, that's > the submission path, and usually we discover errors in the completion > path, not the submission path. Where in the iomap write IO completion path do we call ClearPageUptodate()? I mean, it ends up in iomap_finish_page_writeback(), which does: static void iomap_finish_page_writeback(struct inode *inode, struct page *page, int error) { struct iomap_page *iop = to_iomap_page(page); if (error) { SetPageError(page); mapping_set_error(inode->i_mapping, -EIO); } WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop); WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0); if (!iop || atomic_dec_and_test(&iop->write_count)) end_page_writeback(page); } I mean, we call SetPageError() and tag the mapping, but we most certainly don't clear the PageUptodate state here. So AFAICT, the -only- places that iomap clears the uptodate state on a page is on -read- errors and on write submission failures. If it's a read error, the page fault should already be failing. If it's on submission, we invalidate it as we currently do and punch out the user mappings, and then when userspace refaults it can be killed by a read IO failure. But I just don't see how this problem results from errors reported at IO completion. This comes back to my original, underlying worry about the fragility of the page fault path: the page fault path is not even checking for PageError during faults, and I'm betting that almost no ->page_mkwrite implementation is checking it, either.... > > It's not clear to me that we can actually unmap those pages safely > > in a race free manner from this code - can we actually do that from > > the page writeback path? > > I don't see why it can't be done from the submission path. > unmap_mapping_range() calls i_mmap_lock_write(), which is > down_write(i_mmap_rwsem) in drag. There might be a lock ordering > issue there, although lockdep should find it pretty quickly. > > The bigger problem is the completion path. We're in softirq context, > so that will have to punt to a thread that can take mutexes. Punt to workqueue if we aren't already in a workqueue context - for a lot of writes on XFS we already will be running completion in a workqueue context.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite 2020-06-05 21:48 ` Dave Chinner @ 2020-06-05 22:49 ` Matthew Wilcox 0 siblings, 0 replies; 12+ messages in thread From: Matthew Wilcox @ 2020-06-05 22:49 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel On Sat, Jun 06, 2020 at 07:48:41AM +1000, Dave Chinner wrote: > On Fri, Jun 05, 2020 at 05:48:26AM -0700, Matthew Wilcox wrote: > > ... I don't think that's the interesting path. I mean, that's > > the submission path, and usually we discover errors in the completion > > path, not the submission path. > > Where in the iomap write IO completion path do we call > ClearPageUptodate()? Oh, I misread. You're right, I was looking at the read completion path. So, this is also inconsistent. We clear PageUptodate on errors we discover during submission, but not for errors we discover during completion. That doesn't make sense. > This comes back to my original, underlying worry about the fragility > of the page fault path: the page fault path is not even checking for > PageError during faults, and I'm betting that almost no > ->page_mkwrite implementation is checking it, either.... I think it's a reasonable assumption that user page tables should never contain a PTE for a page which is !Uptodate. Otherwise the user can read stale data. > > I don't see why it can't be done from the submission path. > > unmap_mapping_range() calls i_mmap_lock_write(), which is > > down_write(i_mmap_rwsem) in drag. There might be a lock ordering > > issue there, although lockdep should find it pretty quickly. > > > > The bigger problem is the completion path. We're in softirq context, > > so that will have to punt to a thread that can take mutexes. > > Punt to workqueue if we aren't already in a workqueue context - > for a lot of writes on XFS we already will be running completion in > a workqueue context.... Yep. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-06-05 22:49 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-04 20:23 [PATCH] iomap: Handle I/O errors gracefully in page_mkwrite Matthew Wilcox 2020-06-04 22:57 ` Dave Chinner 2020-06-04 23:05 ` Matthew Wilcox 2020-06-04 23:30 ` Dave Chinner 2020-06-04 23:50 ` Matthew Wilcox 2020-06-05 0:31 ` Dave Chinner 2020-06-05 2:24 ` Matthew Wilcox 2020-06-05 3:07 ` Dave Chinner 2020-06-05 12:48 ` Matthew Wilcox 2020-06-05 16:18 ` Darrick J. Wong 2020-06-05 21:48 ` Dave Chinner 2020-06-05 22:49 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).