linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] misc iomap/xfs writeback fixes
@ 2020-10-29 13:23 Brian Foster
  2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Brian Foster @ 2020-10-29 13:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

Hi all,

Patch 1 is actually a repost of the original fix I posted[1] for the
truncate down zeroing issue. Patch 2 has some minor tweaks based on
feedback on v1 from Christoph. Patch 3 is new and fixes up some of the
remaining broken iomap writepage error handling logic (also discussed in
the v1 thread). Thoughts, reviews, flames appreciated.

Brian 

v2:
- Repost original XFS truncate down post-EOF zeroing fix.
- Pass file offset to iomap ->discard_page() callback.
- Add patch 3 to fix up iomap writepage error handling.
v1: https://lore.kernel.org/linux-xfs/20201026182019.1547662-1-bfoster@redhat.com/

[1] https://lore.kernel.org/linux-xfs/20201007143509.669729-1-bfoster@redhat.com/

Brian Foster (3):
  xfs: flush new eof page on truncate to avoid post-eof corruption
  iomap: support partial page discard on writeback block mapping failure
  iomap: clean up writeback state logic on writepage error

 fs/iomap/buffered-io.c | 30 ++++++++++--------------------
 fs/xfs/xfs_aops.c      | 13 +++++++------
 fs/xfs/xfs_iops.c      | 10 ++++++++++
 include/linux/iomap.h  |  2 +-
 4 files changed, 28 insertions(+), 27 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption
  2020-10-29 13:23 [PATCH v2 0/3] misc iomap/xfs writeback fixes Brian Foster
@ 2020-10-29 13:23 ` Brian Foster
  2020-10-29 15:04   ` Christoph Hellwig
                     ` (2 more replies)
  2020-10-29 13:23 ` [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure Brian Foster
  2020-10-29 13:23 ` [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error Brian Foster
  2 siblings, 3 replies; 17+ messages in thread
From: Brian Foster @ 2020-10-29 13:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

It is possible to expose non-zeroed post-EOF data in XFS if the new
EOF page is dirty, backed by an unwritten block and the truncate
happens to race with writeback. iomap_truncate_page() will not zero
the post-EOF portion of the page if the underlying block is
unwritten. The subsequent call to truncate_setsize() will, but
doesn't dirty the page. Therefore, if writeback happens to complete
after iomap_truncate_page() (so it still sees the unwritten block)
but before truncate_setsize(), the cached page becomes inconsistent
with the on-disk block. A mapped read after the associated page is
reclaimed or invalidated exposes non-zero post-EOF data.

For example, consider the following sequence when run on a kernel
modified to explicitly flush the new EOF page within the race
window:

$ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
$ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
  ...
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400:  00 00 00 00 00 00 00 00  ........
$ umount /mnt/; mount <dev> /mnt/
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400:  cd cd cd cd cd cd cd cd  ........

Update xfs_setattr_size() to explicitly flush the new EOF page prior
to the page truncate to ensure iomap has the latest state of the
underlying block.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iops.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 5e165456da68..1414ab79eacf 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -911,6 +911,16 @@ xfs_setattr_size(
 		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
 				&did_zeroing, &xfs_buffered_write_iomap_ops);
 	} else {
+		/*
+		 * iomap won't detect a dirty page over an unwritten block (or a
+		 * cow block over a hole) and subsequently skips zeroing the
+		 * newly post-EOF portion of the page. Flush the new EOF to
+		 * convert the block before the pagecache truncate.
+		 */
+		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
+						     newsize);
+		if (error)
+			return error;
 		error = iomap_truncate_page(inode, newsize, &did_zeroing,
 				&xfs_buffered_write_iomap_ops);
 	}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 13:23 [PATCH v2 0/3] misc iomap/xfs writeback fixes Brian Foster
  2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
@ 2020-10-29 13:23 ` Brian Foster
  2020-10-29 15:06   ` Christoph Hellwig
                     ` (2 more replies)
  2020-10-29 13:23 ` [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error Brian Foster
  2 siblings, 3 replies; 17+ messages in thread
From: Brian Foster @ 2020-10-29 13:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

iomap writeback mapping failure only calls into ->discard_page() if
the current page has not been added to the ioend. Accordingly, the
XFS callback assumes a full page discard and invalidation. This is
problematic for sub-page block size filesystems where some portion
of a page might have been mapped successfully before a failure to
map a delalloc block occurs. ->discard_page() is not called in that
error scenario and the bio is explicitly failed by iomap via the
error return from ->prepare_ioend(). As a result, the filesystem
leaks delalloc blocks and corrupts the filesystem block counters.

Since XFS is the only user of ->discard_page(), tweak the semantics
to invoke the callback unconditionally on mapping errors and provide
the file offset that failed to map. Update xfs_discard_page() to
discard the corresponding portion of the file and pass the range
along to iomap_invalidatepage(). The latter already properly handles
both full and sub-page scenarios by not changing any iomap or page
state on sub-page invalidations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 15 ++++++++-------
 fs/xfs/xfs_aops.c      | 13 +++++++------
 include/linux/iomap.h  |  2 +-
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..d1f04eabc7e4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1412,14 +1412,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * appropriately.
 	 */
 	if (unlikely(error)) {
+		/*
+		 * Let the filesystem know what portion of the current page
+		 * failed to map. If the page wasn't been added to ioend, it
+		 * won't be affected by I/O completion and we must unlock it
+		 * now.
+		 */
+		if (wpc->ops->discard_page)
+			wpc->ops->discard_page(page, file_offset);
 		if (!count) {
-			/*
-			 * If the current page hasn't been added to ioend, it
-			 * won't be affected by I/O completions and we must
-			 * discard and unlock it right here.
-			 */
-			if (wpc->ops->discard_page)
-				wpc->ops->discard_page(page);
 			ClearPageUptodate(page);
 			unlock_page(page);
 			goto done;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..46920c530b20 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -527,13 +527,14 @@ xfs_prepare_ioend(
  */
 static void
 xfs_discard_page(
-	struct page		*page)
+	struct page		*page,
+	loff_t			fileoff)
 {
 	struct inode		*inode = page->mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
-	loff_t			offset = page_offset(page);
-	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
+	unsigned int		pageoff = offset_in_page(fileoff);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
 	int			error;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
@@ -541,14 +542,14 @@ xfs_discard_page(
 
 	xfs_alert_ratelimited(mp,
 		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
-			page, ip->i_ino, offset);
+			page, ip->i_ino, fileoff);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			(PAGE_SIZE - pageoff) / i_blocksize(inode));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	iomap_invalidatepage(page, 0, PAGE_SIZE);
+	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
 }
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..36e0ab19210a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -220,7 +220,7 @@ struct iomap_writeback_ops {
 	 * Optional, allows the file system to discard state on a page where
 	 * we failed to submit any I/O.
 	 */
-	void (*discard_page)(struct page *page);
+	void (*discard_page)(struct page *page, loff_t fileoff);
 };
 
 struct iomap_writepage_ctx {
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error
  2020-10-29 13:23 [PATCH v2 0/3] misc iomap/xfs writeback fixes Brian Foster
  2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
  2020-10-29 13:23 ` [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure Brian Foster
@ 2020-10-29 13:23 ` Brian Foster
  2020-10-29 15:11   ` Christoph Hellwig
                     ` (2 more replies)
  2 siblings, 3 replies; 17+ messages in thread
From: Brian Foster @ 2020-10-29 13:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

The iomap writepage error handling logic is a mash of old and
slightly broken XFS writepage logic. When keepwrite writeback state
tracking was introduced in XFS in commit 0d085a529b42 ("xfs: ensure
WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
additional cluster writeback context that scanned ahead of
->writepage() to process dirty pages over the current ->writepage()
extent mapping. This context expected a dirty page and required
retention of the TOWRITE tag on partial page processing so the
higher level writeback context would revisit the page (in contrast
to ->writepage(), which passes a page with the dirty bit already
cleared).

The cluster writeback mechanism was eventually removed and some of
the error handling logic folded into the primary writeback path in
commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend"). This patch
accidentally conflated the two contexts by using the keepwrite logic
in ->writepage() without accounting for the fact that the page is
not dirty. Further, the keepwrite logic has no practical effect on
the core ->writepage() caller (write_cache_pages()) because it never
revisits a page in the current function invocation.

Technically, the page should be redirtied for the keepwrite logic to
have any effect. Otherwise, write_cache_pages() may find the tagged
page but will skip it since it is clean. Even if the page was
redirtied, however, there is still no practical effect to keepwrite
since write_cache_pages() does not wrap around within a single
invocation of the function. Therefore, the dirty page would simply
end up retagged on the next writeback sequence over the associated
range.

All that being said, none of this really matters because redirtying
a partially processed page introduces a potential infinite redirty
-> writeback failure loop that deviates from the current design
principle of clearing the dirty state on writepage failure to avoid
building up too much dirty, unreclaimable memory on the system.
Therefore, drop the spurious keepwrite usage and dirty state
clearing logic from iomap_writepage_map(), treat the partially
processed page the same as a fully processed page, and let the
imminent ioend failure clean up the writeback state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d1f04eabc7e4..e3a4568f6c2e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1404,6 +1404,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
 	WARN_ON_ONCE(!PageLocked(page));
 	WARN_ON_ONCE(PageWriteback(page));
+	WARN_ON_ONCE(PageDirty(page));
 
 	/*
 	 * We cannot cancel the ioend directly here on error.  We may have
@@ -1425,21 +1426,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 			unlock_page(page);
 			goto done;
 		}
-
-		/*
-		 * If the page was not fully cleaned, we need to ensure that the
-		 * higher layers come back to it correctly.  That means we need
-		 * to keep the page dirty, and for WB_SYNC_ALL writeback we need
-		 * to ensure the PAGECACHE_TAG_TOWRITE index mark is not removed
-		 * so another attempt to write this page in this writeback sweep
-		 * will be made.
-		 */
-		set_page_writeback_keepwrite(page);
-	} else {
-		clear_page_dirty_for_io(page);
-		set_page_writeback(page);
 	}
 
+	set_page_writeback(page);
 	unlock_page(page);
 
 	/*
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption
  2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
@ 2020-10-29 15:04   ` Christoph Hellwig
  2020-10-29 21:44   ` Darrick J. Wong
  2020-10-30 23:23   ` Allison Henderson
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2020-10-29 15:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 13:23 ` [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure Brian Foster
@ 2020-10-29 15:06   ` Christoph Hellwig
  2020-10-29 15:27   ` Darrick J. Wong
  2020-10-29 16:33   ` [PATCH v2.1 " Brian Foster
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2020-10-29 15:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error
  2020-10-29 13:23 ` [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error Brian Foster
@ 2020-10-29 15:11   ` Christoph Hellwig
  2020-10-29 21:48   ` Darrick J. Wong
  2020-10-30 23:23   ` Allison Henderson
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2020-10-29 15:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 13:23 ` [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure Brian Foster
  2020-10-29 15:06   ` Christoph Hellwig
@ 2020-10-29 15:27   ` Darrick J. Wong
  2020-10-29 16:07     ` Brian Foster
  2020-10-29 16:33   ` [PATCH v2.1 " Brian Foster
  2 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-10-29 15:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 29, 2020 at 09:23:24AM -0400, Brian Foster wrote:
> iomap writeback mapping failure only calls into ->discard_page() if
> the current page has not been added to the ioend. Accordingly, the
> XFS callback assumes a full page discard and invalidation. This is
> problematic for sub-page block size filesystems where some portion
> of a page might have been mapped successfully before a failure to
> map a delalloc block occurs. ->discard_page() is not called in that
> error scenario and the bio is explicitly failed by iomap via the
> error return from ->prepare_ioend(). As a result, the filesystem
> leaks delalloc blocks and corrupts the filesystem block counters.
> 
> Since XFS is the only user of ->discard_page(), tweak the semantics
> to invoke the callback unconditionally on mapping errors and provide
> the file offset that failed to map. Update xfs_discard_page() to
> discard the corresponding portion of the file and pass the range
> along to iomap_invalidatepage(). The latter already properly handles
> both full and sub-page scenarios by not changing any iomap or page
> state on sub-page invalidations.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++-------
>  fs/xfs/xfs_aops.c      | 13 +++++++------
>  include/linux/iomap.h  |  2 +-
>  3 files changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index bcfc288dba3f..d1f04eabc7e4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1412,14 +1412,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * appropriately.
>  	 */
>  	if (unlikely(error)) {
> +		/*
> +		 * Let the filesystem know what portion of the current page
> +		 * failed to map. If the page wasn't been added to ioend, it
> +		 * won't be affected by I/O completion and we must unlock it
> +		 * now.
> +		 */
> +		if (wpc->ops->discard_page)
> +			wpc->ops->discard_page(page, file_offset);
>  		if (!count) {
> -			/*
> -			 * If the current page hasn't been added to ioend, it
> -			 * won't be affected by I/O completions and we must
> -			 * discard and unlock it right here.
> -			 */
> -			if (wpc->ops->discard_page)
> -				wpc->ops->discard_page(page);
>  			ClearPageUptodate(page);
>  			unlock_page(page);
>  			goto done;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b35611882ff9..46920c530b20 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -527,13 +527,14 @@ xfs_prepare_ioend(
>   */
>  static void
>  xfs_discard_page(
> -	struct page		*page)
> +	struct page		*page,
> +	loff_t			fileoff)
>  {
>  	struct inode		*inode = page->mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> -	loff_t			offset = page_offset(page);
> -	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
> +	unsigned int		pageoff = offset_in_page(fileoff);
> +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
>  	int			error;
>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
> @@ -541,14 +542,14 @@ xfs_discard_page(
>  
>  	xfs_alert_ratelimited(mp,
>  		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> -			page, ip->i_ino, offset);
> +			page, ip->i_ino, fileoff);
>  
>  	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -			PAGE_SIZE / i_blocksize(inode));
> +			(PAGE_SIZE - pageoff) / i_blocksize(inode));

Er... could you rebase this against 5.10-rc1, please?  willy changed
that line to not use PAGE_SIZE directly.

I /think/ the way to resolve the merge conflict here is to change this
last argument to:

(i_blocks_per_page(page) - pageoff) / i_blocksize(inode)

--D

>  	if (error && !XFS_FORCED_SHUTDOWN(mp))
>  		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
>  out_invalidate:
> -	iomap_invalidatepage(page, 0, PAGE_SIZE);
> +	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
>  }
>  
>  static const struct iomap_writeback_ops xfs_writeback_ops = {
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 4d1d3c3469e9..36e0ab19210a 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -220,7 +220,7 @@ struct iomap_writeback_ops {
>  	 * Optional, allows the file system to discard state on a page where
>  	 * we failed to submit any I/O.
>  	 */
> -	void (*discard_page)(struct page *page);
> +	void (*discard_page)(struct page *page, loff_t fileoff);
>  };
>  
>  struct iomap_writepage_ctx {
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 15:27   ` Darrick J. Wong
@ 2020-10-29 16:07     ` Brian Foster
  2020-10-29 16:12       ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2020-10-29 16:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 29, 2020 at 08:27:18AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 29, 2020 at 09:23:24AM -0400, Brian Foster wrote:
> > iomap writeback mapping failure only calls into ->discard_page() if
> > the current page has not been added to the ioend. Accordingly, the
> > XFS callback assumes a full page discard and invalidation. This is
> > problematic for sub-page block size filesystems where some portion
> > of a page might have been mapped successfully before a failure to
> > map a delalloc block occurs. ->discard_page() is not called in that
> > error scenario and the bio is explicitly failed by iomap via the
> > error return from ->prepare_ioend(). As a result, the filesystem
> > leaks delalloc blocks and corrupts the filesystem block counters.
> > 
> > Since XFS is the only user of ->discard_page(), tweak the semantics
> > to invoke the callback unconditionally on mapping errors and provide
> > the file offset that failed to map. Update xfs_discard_page() to
> > discard the corresponding portion of the file and pass the range
> > along to iomap_invalidatepage(). The latter already properly handles
> > both full and sub-page scenarios by not changing any iomap or page
> > state on sub-page invalidations.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/iomap/buffered-io.c | 15 ++++++++-------
> >  fs/xfs/xfs_aops.c      | 13 +++++++------
> >  include/linux/iomap.h  |  2 +-
> >  3 files changed, 16 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index bcfc288dba3f..d1f04eabc7e4 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1412,14 +1412,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  	 * appropriately.
> >  	 */
> >  	if (unlikely(error)) {
> > +		/*
> > +		 * Let the filesystem know what portion of the current page
> > +		 * failed to map. If the page wasn't been added to ioend, it
> > +		 * won't be affected by I/O completion and we must unlock it
> > +		 * now.
> > +		 */
> > +		if (wpc->ops->discard_page)
> > +			wpc->ops->discard_page(page, file_offset);
> >  		if (!count) {
> > -			/*
> > -			 * If the current page hasn't been added to ioend, it
> > -			 * won't be affected by I/O completions and we must
> > -			 * discard and unlock it right here.
> > -			 */
> > -			if (wpc->ops->discard_page)
> > -				wpc->ops->discard_page(page);
> >  			ClearPageUptodate(page);
> >  			unlock_page(page);
> >  			goto done;
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index b35611882ff9..46920c530b20 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -527,13 +527,14 @@ xfs_prepare_ioend(
> >   */
> >  static void
> >  xfs_discard_page(
> > -	struct page		*page)
> > +	struct page		*page,
> > +	loff_t			fileoff)
> >  {
> >  	struct inode		*inode = page->mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> > -	loff_t			offset = page_offset(page);
> > -	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
> > +	unsigned int		pageoff = offset_in_page(fileoff);
> > +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
> >  	int			error;
> >  
> >  	if (XFS_FORCED_SHUTDOWN(mp))
> > @@ -541,14 +542,14 @@ xfs_discard_page(
> >  
> >  	xfs_alert_ratelimited(mp,
> >  		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> > -			page, ip->i_ino, offset);
> > +			page, ip->i_ino, fileoff);
> >  
> >  	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> > -			PAGE_SIZE / i_blocksize(inode));
> > +			(PAGE_SIZE - pageoff) / i_blocksize(inode));
> 
> Er... could you rebase this against 5.10-rc1, please?  willy changed
> that line to not use PAGE_SIZE directly.
> 

Sure.. (note that there's still a PAGE_SIZE usage in the
iomap_invalidatepage() call).

> I /think/ the way to resolve the merge conflict here is to change this
> last argument to:
> 
> (i_blocks_per_page(page) - pageoff) / i_blocksize(inode)
> 

Hmm... pageoff is bytes so that doesn't look quite right. How about
something like this?

	...
	unsigned int            pageoff = offset_in_page(fileoff);
	xfs_fileoff_t           pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);

	...
        error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
				i_blocks_per_page(inode, page) - pageoff_fsb);
	...

Brian

> --D
> 
> >  	if (error && !XFS_FORCED_SHUTDOWN(mp))
> >  		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
> >  out_invalidate:
> > -	iomap_invalidatepage(page, 0, PAGE_SIZE);
> > +	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
> >  }
> >  
> >  static const struct iomap_writeback_ops xfs_writeback_ops = {
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 4d1d3c3469e9..36e0ab19210a 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -220,7 +220,7 @@ struct iomap_writeback_ops {
> >  	 * Optional, allows the file system to discard state on a page where
> >  	 * we failed to submit any I/O.
> >  	 */
> > -	void (*discard_page)(struct page *page);
> > +	void (*discard_page)(struct page *page, loff_t fileoff);
> >  };
> >  
> >  struct iomap_writepage_ctx {
> > -- 
> > 2.25.4
> > 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 16:07     ` Brian Foster
@ 2020-10-29 16:12       ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-10-29 16:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 29, 2020 at 12:07:32PM -0400, Brian Foster wrote:
> On Thu, Oct 29, 2020 at 08:27:18AM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 29, 2020 at 09:23:24AM -0400, Brian Foster wrote:
> > > iomap writeback mapping failure only calls into ->discard_page() if
> > > the current page has not been added to the ioend. Accordingly, the
> > > XFS callback assumes a full page discard and invalidation. This is
> > > problematic for sub-page block size filesystems where some portion
> > > of a page might have been mapped successfully before a failure to
> > > map a delalloc block occurs. ->discard_page() is not called in that
> > > error scenario and the bio is explicitly failed by iomap via the
> > > error return from ->prepare_ioend(). As a result, the filesystem
> > > leaks delalloc blocks and corrupts the filesystem block counters.
> > > 
> > > Since XFS is the only user of ->discard_page(), tweak the semantics
> > > to invoke the callback unconditionally on mapping errors and provide
> > > the file offset that failed to map. Update xfs_discard_page() to
> > > discard the corresponding portion of the file and pass the range
> > > along to iomap_invalidatepage(). The latter already properly handles
> > > both full and sub-page scenarios by not changing any iomap or page
> > > state on sub-page invalidations.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 15 ++++++++-------
> > >  fs/xfs/xfs_aops.c      | 13 +++++++------
> > >  include/linux/iomap.h  |  2 +-
> > >  3 files changed, 16 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index bcfc288dba3f..d1f04eabc7e4 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1412,14 +1412,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> > >  	 * appropriately.
> > >  	 */
> > >  	if (unlikely(error)) {
> > > +		/*
> > > +		 * Let the filesystem know what portion of the current page
> > > +		 * failed to map. If the page wasn't been added to ioend, it
> > > +		 * won't be affected by I/O completion and we must unlock it
> > > +		 * now.
> > > +		 */
> > > +		if (wpc->ops->discard_page)
> > > +			wpc->ops->discard_page(page, file_offset);
> > >  		if (!count) {
> > > -			/*
> > > -			 * If the current page hasn't been added to ioend, it
> > > -			 * won't be affected by I/O completions and we must
> > > -			 * discard and unlock it right here.
> > > -			 */
> > > -			if (wpc->ops->discard_page)
> > > -				wpc->ops->discard_page(page);
> > >  			ClearPageUptodate(page);
> > >  			unlock_page(page);
> > >  			goto done;
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index b35611882ff9..46920c530b20 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -527,13 +527,14 @@ xfs_prepare_ioend(
> > >   */
> > >  static void
> > >  xfs_discard_page(
> > > -	struct page		*page)
> > > +	struct page		*page,
> > > +	loff_t			fileoff)
> > >  {
> > >  	struct inode		*inode = page->mapping->host;
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > > -	loff_t			offset = page_offset(page);
> > > -	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
> > > +	unsigned int		pageoff = offset_in_page(fileoff);
> > > +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
> > >  	int			error;
> > >  
> > >  	if (XFS_FORCED_SHUTDOWN(mp))
> > > @@ -541,14 +542,14 @@ xfs_discard_page(
> > >  
> > >  	xfs_alert_ratelimited(mp,
> > >  		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> > > -			page, ip->i_ino, offset);
> > > +			page, ip->i_ino, fileoff);
> > >  
> > >  	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> > > -			PAGE_SIZE / i_blocksize(inode));
> > > +			(PAGE_SIZE - pageoff) / i_blocksize(inode));
> > 
> > Er... could you rebase this against 5.10-rc1, please?  willy changed
> > that line to not use PAGE_SIZE directly.
> > 
> 
> Sure.. (note that there's still a PAGE_SIZE usage in the
> iomap_invalidatepage() call).

<nod> I think he has more for 5.11, but in the meantime it's just fixing
the merge conflicts. :)

> > I /think/ the way to resolve the merge conflict here is to change this
> > last argument to:
> > 
> > (i_blocks_per_page(page) - pageoff) / i_blocksize(inode)
> > 
> 
> Hmm... pageoff is bytes so that doesn't look quite right. How about
> something like this?
> 
> 	...
> 	unsigned int            pageoff = offset_in_page(fileoff);
> 	xfs_fileoff_t           pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
> 
> 	...
>         error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> 				i_blocks_per_page(inode, page) - pageoff_fsb);
> 	...

You could probably combine the two variables, but otherwise that looks
fine to me.

--D

> 
> Brian
> 
> > --D
> > 
> > >  	if (error && !XFS_FORCED_SHUTDOWN(mp))
> > >  		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
> > >  out_invalidate:
> > > -	iomap_invalidatepage(page, 0, PAGE_SIZE);
> > > +	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
> > >  }
> > >  
> > >  static const struct iomap_writeback_ops xfs_writeback_ops = {
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 4d1d3c3469e9..36e0ab19210a 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -220,7 +220,7 @@ struct iomap_writeback_ops {
> > >  	 * Optional, allows the file system to discard state on a page where
> > >  	 * we failed to submit any I/O.
> > >  	 */
> > > -	void (*discard_page)(struct page *page);
> > > +	void (*discard_page)(struct page *page, loff_t fileoff);
> > >  };
> > >  
> > >  struct iomap_writepage_ctx {
> > > -- 
> > > 2.25.4
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2.1 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 13:23 ` [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure Brian Foster
  2020-10-29 15:06   ` Christoph Hellwig
  2020-10-29 15:27   ` Darrick J. Wong
@ 2020-10-29 16:33   ` Brian Foster
  2020-10-29 21:45     ` Darrick J. Wong
  2020-10-30 23:23     ` Allison Henderson
  2 siblings, 2 replies; 17+ messages in thread
From: Brian Foster @ 2020-10-29 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

iomap writeback mapping failure only calls into ->discard_page() if
the current page has not been added to the ioend. Accordingly, the
XFS callback assumes a full page discard and invalidation. This is
problematic for sub-page block size filesystems where some portion
of a page might have been mapped successfully before a failure to
map a delalloc block occurs. ->discard_page() is not called in that
error scenario and the bio is explicitly failed by iomap via the
error return from ->prepare_ioend(). As a result, the filesystem
leaks delalloc blocks and corrupts the filesystem block counters.

Since XFS is the only user of ->discard_page(), tweak the semantics
to invoke the callback unconditionally on mapping errors and provide
the file offset that failed to map. Update xfs_discard_page() to
discard the corresponding portion of the file and pass the range
along to iomap_invalidatepage(). The latter already properly handles
both full and sub-page scenarios by not changing any iomap or page
state on sub-page invalidations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---

v2.1:
- Rebased to v5.10-rc1.

 fs/iomap/buffered-io.c | 15 ++++++++-------
 fs/xfs/xfs_aops.c      | 14 ++++++++------
 include/linux/iomap.h  |  2 +-
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8180061b9e16..e4ea1f9f94d0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1382,14 +1382,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * appropriately.
 	 */
 	if (unlikely(error)) {
+		/*
+		 * Let the filesystem know what portion of the current page
+		 * failed to map. If the page wasn't been added to ioend, it
+		 * won't be affected by I/O completion and we must unlock it
+		 * now.
+		 */
+		if (wpc->ops->discard_page)
+			wpc->ops->discard_page(page, file_offset);
 		if (!count) {
-			/*
-			 * If the current page hasn't been added to ioend, it
-			 * won't be affected by I/O completions and we must
-			 * discard and unlock it right here.
-			 */
-			if (wpc->ops->discard_page)
-				wpc->ops->discard_page(page);
 			ClearPageUptodate(page);
 			unlock_page(page);
 			goto done;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 55d126d4e096..5bf37afae5e9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -527,13 +527,15 @@ xfs_prepare_ioend(
  */
 static void
 xfs_discard_page(
-	struct page		*page)
+	struct page		*page,
+	loff_t			fileoff)
 {
 	struct inode		*inode = page->mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
-	loff_t			offset = page_offset(page);
-	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
+	unsigned int		pageoff = offset_in_page(fileoff);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
+	xfs_fileoff_t		pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
 	int			error;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
@@ -541,14 +543,14 @@ xfs_discard_page(
 
 	xfs_alert_ratelimited(mp,
 		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
-			page, ip->i_ino, offset);
+			page, ip->i_ino, fileoff);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			i_blocks_per_page(inode, page));
+			i_blocks_per_page(inode, page) - pageoff_fsb);
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	iomap_invalidatepage(page, 0, PAGE_SIZE);
+	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
 }
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 172b3397a1a3..5bd3cac4df9c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -221,7 +221,7 @@ struct iomap_writeback_ops {
 	 * Optional, allows the file system to discard state on a page where
 	 * we failed to submit any I/O.
 	 */
-	void (*discard_page)(struct page *page);
+	void (*discard_page)(struct page *page, loff_t fileoff);
 };
 
 struct iomap_writepage_ctx {
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption
  2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
  2020-10-29 15:04   ` Christoph Hellwig
@ 2020-10-29 21:44   ` Darrick J. Wong
  2020-10-30 23:23   ` Allison Henderson
  2 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-10-29 21:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 29, 2020 at 09:23:23AM -0400, Brian Foster wrote:
> It is possible to expose non-zeroed post-EOF data in XFS if the new
> EOF page is dirty, backed by an unwritten block and the truncate
> happens to race with writeback. iomap_truncate_page() will not zero
> the post-EOF portion of the page if the underlying block is
> unwritten. The subsequent call to truncate_setsize() will, but
> doesn't dirty the page. Therefore, if writeback happens to complete
> after iomap_truncate_page() (so it still sees the unwritten block)
> but before truncate_setsize(), the cached page becomes inconsistent
> with the on-disk block. A mapped read after the associated page is
> reclaimed or invalidated exposes non-zero post-EOF data.
> 
> For example, consider the following sequence when run on a kernel
> modified to explicitly flush the new EOF page within the race
> window:
> 
> $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
> $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
>   ...
> $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> 00000400:  00 00 00 00 00 00 00 00  ........
> $ umount /mnt/; mount <dev> /mnt/
> $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> 00000400:  cd cd cd cd cd cd cd cd  ........
> 
> Update xfs_setattr_size() to explicitly flush the new EOF page prior
> to the page truncate to ensure iomap has the latest state of the
> underlying block.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Seems reasonable,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_iops.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 5e165456da68..1414ab79eacf 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -911,6 +911,16 @@ xfs_setattr_size(
>  		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
>  				&did_zeroing, &xfs_buffered_write_iomap_ops);
>  	} else {
> +		/*
> +		 * iomap won't detect a dirty page over an unwritten block (or a
> +		 * cow block over a hole) and subsequently skips zeroing the
> +		 * newly post-EOF portion of the page. Flush the new EOF to
> +		 * convert the block before the pagecache truncate.
> +		 */
> +		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> +						     newsize);
> +		if (error)
> +			return error;
>  		error = iomap_truncate_page(inode, newsize, &did_zeroing,
>  				&xfs_buffered_write_iomap_ops);
>  	}
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2.1 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 16:33   ` [PATCH v2.1 " Brian Foster
@ 2020-10-29 21:45     ` Darrick J. Wong
  2020-10-30 23:23     ` Allison Henderson
  1 sibling, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-10-29 21:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 29, 2020 at 12:33:13PM -0400, Brian Foster wrote:
> iomap writeback mapping failure only calls into ->discard_page() if
> the current page has not been added to the ioend. Accordingly, the
> XFS callback assumes a full page discard and invalidation. This is
> problematic for sub-page block size filesystems where some portion
> of a page might have been mapped successfully before a failure to
> map a delalloc block occurs. ->discard_page() is not called in that
> error scenario and the bio is explicitly failed by iomap via the
> error return from ->prepare_ioend(). As a result, the filesystem
> leaks delalloc blocks and corrupts the filesystem block counters.
> 
> Since XFS is the only user of ->discard_page(), tweak the semantics
> to invoke the callback unconditionally on mapping errors and provide
> the file offset that failed to map. Update xfs_discard_page() to
> discard the corresponding portion of the file and pass the range
> along to iomap_invalidatepage(). The latter already properly handles
> both full and sub-page scenarios by not changing any iomap or page
> state on sub-page invalidations.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for the rebase,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> 
> v2.1:
> - Rebased to v5.10-rc1.
> 
>  fs/iomap/buffered-io.c | 15 ++++++++-------
>  fs/xfs/xfs_aops.c      | 14 ++++++++------
>  include/linux/iomap.h  |  2 +-
>  3 files changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8180061b9e16..e4ea1f9f94d0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1382,14 +1382,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * appropriately.
>  	 */
>  	if (unlikely(error)) {
> +		/*
> +		 * Let the filesystem know what portion of the current page
> +		 * failed to map. If the page wasn't been added to ioend, it
> +		 * won't be affected by I/O completion and we must unlock it
> +		 * now.
> +		 */
> +		if (wpc->ops->discard_page)
> +			wpc->ops->discard_page(page, file_offset);
>  		if (!count) {
> -			/*
> -			 * If the current page hasn't been added to ioend, it
> -			 * won't be affected by I/O completions and we must
> -			 * discard and unlock it right here.
> -			 */
> -			if (wpc->ops->discard_page)
> -				wpc->ops->discard_page(page);
>  			ClearPageUptodate(page);
>  			unlock_page(page);
>  			goto done;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 55d126d4e096..5bf37afae5e9 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -527,13 +527,15 @@ xfs_prepare_ioend(
>   */
>  static void
>  xfs_discard_page(
> -	struct page		*page)
> +	struct page		*page,
> +	loff_t			fileoff)
>  {
>  	struct inode		*inode = page->mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> -	loff_t			offset = page_offset(page);
> -	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
> +	unsigned int		pageoff = offset_in_page(fileoff);
> +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
> +	xfs_fileoff_t		pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
>  	int			error;
>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
> @@ -541,14 +543,14 @@ xfs_discard_page(
>  
>  	xfs_alert_ratelimited(mp,
>  		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> -			page, ip->i_ino, offset);
> +			page, ip->i_ino, fileoff);
>  
>  	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -			i_blocks_per_page(inode, page));
> +			i_blocks_per_page(inode, page) - pageoff_fsb);
>  	if (error && !XFS_FORCED_SHUTDOWN(mp))
>  		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
>  out_invalidate:
> -	iomap_invalidatepage(page, 0, PAGE_SIZE);
> +	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
>  }
>  
>  static const struct iomap_writeback_ops xfs_writeback_ops = {
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 172b3397a1a3..5bd3cac4df9c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -221,7 +221,7 @@ struct iomap_writeback_ops {
>  	 * Optional, allows the file system to discard state on a page where
>  	 * we failed to submit any I/O.
>  	 */
> -	void (*discard_page)(struct page *page);
> +	void (*discard_page)(struct page *page, loff_t fileoff);
>  };
>  
>  struct iomap_writepage_ctx {
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error
  2020-10-29 13:23 ` [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error Brian Foster
  2020-10-29 15:11   ` Christoph Hellwig
@ 2020-10-29 21:48   ` Darrick J. Wong
  2020-10-30 23:23   ` Allison Henderson
  2 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2020-10-29 21:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Thu, Oct 29, 2020 at 09:23:25AM -0400, Brian Foster wrote:
> The iomap writepage error handling logic is a mash of old and
> slightly broken XFS writepage logic. When keepwrite writeback state
> tracking was introduced in XFS in commit 0d085a529b42 ("xfs: ensure
> WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
> additional cluster writeback context that scanned ahead of
> ->writepage() to process dirty pages over the current ->writepage()
> extent mapping. This context expected a dirty page and required
> retention of the TOWRITE tag on partial page processing so the
> higher level writeback context would revisit the page (in contrast
> to ->writepage(), which passes a page with the dirty bit already
> cleared).
> 
> The cluster writeback mechanism was eventually removed and some of
> the error handling logic folded into the primary writeback path in
> commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend"). This patch
> accidentally conflated the two contexts by using the keepwrite logic
> in ->writepage() without accounting for the fact that the page is
> not dirty. Further, the keepwrite logic has no practical effect on
> the core ->writepage() caller (write_cache_pages()) because it never
> revisits a page in the current function invocation.
> 
> Technically, the page should be redirtied for the keepwrite logic to
> have any effect. Otherwise, write_cache_pages() may find the tagged
> page but will skip it since it is clean. Even if the page was
> redirtied, however, there is still no practical effect to keepwrite
> since write_cache_pages() does not wrap around within a single
> invocation of the function. Therefore, the dirty page would simply
> end up retagged on the next writeback sequence over the associated
> range.
> 
> All that being said, none of this really matters because redirtying
> a partially processed page introduces a potential infinite redirty
> -> writeback failure loop that deviates from the current design
> principle of clearing the dirty state on writepage failure to avoid
> building up too much dirty, unreclaimable memory on the system.
> Therefore, drop the spurious keepwrite usage and dirty state
> clearing logic from iomap_writepage_map(), treat the partially
> processed page the same as a fully processed page, and let the
> imminent ioend failure clean up the writeback state.

...and run away before ext4 tries to port itself to buffered iomap,
since it's the only other user of keepwrite.  Not sure why it ends up in
a state where it's doing writeback to a hole(?!)

> Signed-off-by: Brian Foster <bfoster@redhat.com>

Anyway this seems sensible to me...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap/buffered-io.c | 15 ++-------------
>  1 file changed, 2 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d1f04eabc7e4..e3a4568f6c2e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1404,6 +1404,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
>  	WARN_ON_ONCE(!PageLocked(page));
>  	WARN_ON_ONCE(PageWriteback(page));
> +	WARN_ON_ONCE(PageDirty(page));
>  
>  	/*
>  	 * We cannot cancel the ioend directly here on error.  We may have
> @@ -1425,21 +1426,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  			unlock_page(page);
>  			goto done;
>  		}
> -
> -		/*
> -		 * If the page was not fully cleaned, we need to ensure that the
> -		 * higher layers come back to it correctly.  That means we need
> -		 * to keep the page dirty, and for WB_SYNC_ALL writeback we need
> -		 * to ensure the PAGECACHE_TAG_TOWRITE index mark is not removed
> -		 * so another attempt to write this page in this writeback sweep
> -		 * will be made.
> -		 */
> -		set_page_writeback_keepwrite(page);
> -	} else {
> -		clear_page_dirty_for_io(page);
> -		set_page_writeback(page);
>  	}
>  
> +	set_page_writeback(page);
>  	unlock_page(page);
>  
>  	/*
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2.1 2/3] iomap: support partial page discard on writeback block mapping failure
  2020-10-29 16:33   ` [PATCH v2.1 " Brian Foster
  2020-10-29 21:45     ` Darrick J. Wong
@ 2020-10-30 23:23     ` Allison Henderson
  1 sibling, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2020-10-30 23:23 UTC (permalink / raw)
  To: Brian Foster, linux-fsdevel; +Cc: linux-xfs



On 10/29/20 9:33 AM, Brian Foster wrote:
> iomap writeback mapping failure only calls into ->discard_page() if
> the current page has not been added to the ioend. Accordingly, the
> XFS callback assumes a full page discard and invalidation. This is
> problematic for sub-page block size filesystems where some portion
> of a page might have been mapped successfully before a failure to
> map a delalloc block occurs. ->discard_page() is not called in that
> error scenario and the bio is explicitly failed by iomap via the
> error return from ->prepare_ioend(). As a result, the filesystem
> leaks delalloc blocks and corrupts the filesystem block counters.
> 
> Since XFS is the only user of ->discard_page(), tweak the semantics
> to invoke the callback unconditionally on mapping errors and provide
> the file offset that failed to map. Update xfs_discard_page() to
> discard the corresponding portion of the file and pass the range
> along to iomap_invalidatepage(). The latter already properly handles
> both full and sub-page scenarios by not changing any iomap or page
> state on sub-page invalidations.
> 
Looks ok to me
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> 
> v2.1:
> - Rebased to v5.10-rc1.
> 
>   fs/iomap/buffered-io.c | 15 ++++++++-------
>   fs/xfs/xfs_aops.c      | 14 ++++++++------
>   include/linux/iomap.h  |  2 +-
>   3 files changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8180061b9e16..e4ea1f9f94d0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1382,14 +1382,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   	 * appropriately.
>   	 */
>   	if (unlikely(error)) {
> +		/*
> +		 * Let the filesystem know what portion of the current page
> +		 * failed to map. If the page wasn't been added to ioend, it
> +		 * won't be affected by I/O completion and we must unlock it
> +		 * now.
> +		 */
> +		if (wpc->ops->discard_page)
> +			wpc->ops->discard_page(page, file_offset);
>   		if (!count) {
> -			/*
> -			 * If the current page hasn't been added to ioend, it
> -			 * won't be affected by I/O completions and we must
> -			 * discard and unlock it right here.
> -			 */
> -			if (wpc->ops->discard_page)
> -				wpc->ops->discard_page(page);
>   			ClearPageUptodate(page);
>   			unlock_page(page);
>   			goto done;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 55d126d4e096..5bf37afae5e9 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -527,13 +527,15 @@ xfs_prepare_ioend(
>    */
>   static void
>   xfs_discard_page(
> -	struct page		*page)
> +	struct page		*page,
> +	loff_t			fileoff)
>   {
>   	struct inode		*inode = page->mapping->host;
>   	struct xfs_inode	*ip = XFS_I(inode);
>   	struct xfs_mount	*mp = ip->i_mount;
> -	loff_t			offset = page_offset(page);
> -	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, offset);
> +	unsigned int		pageoff = offset_in_page(fileoff);
> +	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
> +	xfs_fileoff_t		pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
>   	int			error;
>   
>   	if (XFS_FORCED_SHUTDOWN(mp))
> @@ -541,14 +543,14 @@ xfs_discard_page(
>   
>   	xfs_alert_ratelimited(mp,
>   		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
> -			page, ip->i_ino, offset);
> +			page, ip->i_ino, fileoff);
>   
>   	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> -			i_blocks_per_page(inode, page));
> +			i_blocks_per_page(inode, page) - pageoff_fsb);
>   	if (error && !XFS_FORCED_SHUTDOWN(mp))
>   		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
>   out_invalidate:
> -	iomap_invalidatepage(page, 0, PAGE_SIZE);
> +	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
>   }
>   
>   static const struct iomap_writeback_ops xfs_writeback_ops = {
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 172b3397a1a3..5bd3cac4df9c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -221,7 +221,7 @@ struct iomap_writeback_ops {
>   	 * Optional, allows the file system to discard state on a page where
>   	 * we failed to submit any I/O.
>   	 */
> -	void (*discard_page)(struct page *page);
> +	void (*discard_page)(struct page *page, loff_t fileoff);
>   };
>   
>   struct iomap_writepage_ctx {
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption
  2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
  2020-10-29 15:04   ` Christoph Hellwig
  2020-10-29 21:44   ` Darrick J. Wong
@ 2020-10-30 23:23   ` Allison Henderson
  2 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2020-10-30 23:23 UTC (permalink / raw)
  To: Brian Foster, linux-fsdevel; +Cc: linux-xfs



On 10/29/20 6:23 AM, Brian Foster wrote:
> It is possible to expose non-zeroed post-EOF data in XFS if the new
> EOF page is dirty, backed by an unwritten block and the truncate
> happens to race with writeback. iomap_truncate_page() will not zero
> the post-EOF portion of the page if the underlying block is
> unwritten. The subsequent call to truncate_setsize() will, but
> doesn't dirty the page. Therefore, if writeback happens to complete
> after iomap_truncate_page() (so it still sees the unwritten block)
> but before truncate_setsize(), the cached page becomes inconsistent
> with the on-disk block. A mapped read after the associated page is
> reclaimed or invalidated exposes non-zero post-EOF data.
> 
> For example, consider the following sequence when run on a kernel
> modified to explicitly flush the new EOF page within the race
> window:
> 
> $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
> $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
>    ...
> $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> 00000400:  00 00 00 00 00 00 00 00  ........
> $ umount /mnt/; mount <dev> /mnt/
> $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
> 00000400:  cd cd cd cd cd cd cd cd  ........
> 
> Update xfs_setattr_size() to explicitly flush the new EOF page prior
> to the page truncate to ensure iomap has the latest state of the
> underlying block.
> 
Ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_iops.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 5e165456da68..1414ab79eacf 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -911,6 +911,16 @@ xfs_setattr_size(
>   		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
>   				&did_zeroing, &xfs_buffered_write_iomap_ops);
>   	} else {
> +		/*
> +		 * iomap won't detect a dirty page over an unwritten block (or a
> +		 * cow block over a hole) and subsequently skips zeroing the
> +		 * newly post-EOF portion of the page. Flush the new EOF to
> +		 * convert the block before the pagecache truncate.
> +		 */
> +		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> +						     newsize);
> +		if (error)
> +			return error;
>   		error = iomap_truncate_page(inode, newsize, &did_zeroing,
>   				&xfs_buffered_write_iomap_ops);
>   	}
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error
  2020-10-29 13:23 ` [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error Brian Foster
  2020-10-29 15:11   ` Christoph Hellwig
  2020-10-29 21:48   ` Darrick J. Wong
@ 2020-10-30 23:23   ` Allison Henderson
  2 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2020-10-30 23:23 UTC (permalink / raw)
  To: Brian Foster, linux-fsdevel; +Cc: linux-xfs



On 10/29/20 6:23 AM, Brian Foster wrote:
> The iomap writepage error handling logic is a mash of old and
> slightly broken XFS writepage logic. When keepwrite writeback state
> tracking was introduced in XFS in commit 0d085a529b42 ("xfs: ensure
> WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
> additional cluster writeback context that scanned ahead of
> ->writepage() to process dirty pages over the current ->writepage()
> extent mapping. This context expected a dirty page and required
> retention of the TOWRITE tag on partial page processing so the
> higher level writeback context would revisit the page (in contrast
> to ->writepage(), which passes a page with the dirty bit already
> cleared).
> 
> The cluster writeback mechanism was eventually removed and some of
> the error handling logic folded into the primary writeback path in
> commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend"). This patch
> accidentally conflated the two contexts by using the keepwrite logic
> in ->writepage() without accounting for the fact that the page is
> not dirty. Further, the keepwrite logic has no practical effect on
> the core ->writepage() caller (write_cache_pages()) because it never
> revisits a page in the current function invocation.
> 
> Technically, the page should be redirtied for the keepwrite logic to
> have any effect. Otherwise, write_cache_pages() may find the tagged
> page but will skip it since it is clean. Even if the page was
> redirtied, however, there is still no practical effect to keepwrite
> since write_cache_pages() does not wrap around within a single
> invocation of the function. Therefore, the dirty page would simply
> end up retagged on the next writeback sequence over the associated
> range.
> 
> All that being said, none of this really matters because redirtying
> a partially processed page introduces a potential infinite redirty
> -> writeback failure loop that deviates from the current design
> principle of clearing the dirty state on writepage failure to avoid
> building up too much dirty, unreclaimable memory on the system.
> Therefore, drop the spurious keepwrite usage and dirty state
> clearing logic from iomap_writepage_map(), treat the partially
> processed page the same as a fully processed page, and let the
> imminent ioend failure clean up the writeback state.
> 
Ok, thanks for all the explaining.  Makes sense :-)
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/iomap/buffered-io.c | 15 ++-------------
>   1 file changed, 2 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d1f04eabc7e4..e3a4568f6c2e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1404,6 +1404,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   	WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
>   	WARN_ON_ONCE(!PageLocked(page));
>   	WARN_ON_ONCE(PageWriteback(page));
> +	WARN_ON_ONCE(PageDirty(page));
>   
>   	/*
>   	 * We cannot cancel the ioend directly here on error.  We may have
> @@ -1425,21 +1426,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   			unlock_page(page);
>   			goto done;
>   		}
> -
> -		/*
> -		 * If the page was not fully cleaned, we need to ensure that the
> -		 * higher layers come back to it correctly.  That means we need
> -		 * to keep the page dirty, and for WB_SYNC_ALL writeback we need
> -		 * to ensure the PAGECACHE_TAG_TOWRITE index mark is not removed
> -		 * so another attempt to write this page in this writeback sweep
> -		 * will be made.
> -		 */
> -		set_page_writeback_keepwrite(page);
> -	} else {
> -		clear_page_dirty_for_io(page);
> -		set_page_writeback(page);
>   	}
>   
> +	set_page_writeback(page);
>   	unlock_page(page);
>   
>   	/*
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-10-30 23:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-29 13:23 [PATCH v2 0/3] misc iomap/xfs writeback fixes Brian Foster
2020-10-29 13:23 ` [PATCH v2 1/3] xfs: flush new eof page on truncate to avoid post-eof corruption Brian Foster
2020-10-29 15:04   ` Christoph Hellwig
2020-10-29 21:44   ` Darrick J. Wong
2020-10-30 23:23   ` Allison Henderson
2020-10-29 13:23 ` [PATCH v2 2/3] iomap: support partial page discard on writeback block mapping failure Brian Foster
2020-10-29 15:06   ` Christoph Hellwig
2020-10-29 15:27   ` Darrick J. Wong
2020-10-29 16:07     ` Brian Foster
2020-10-29 16:12       ` Darrick J. Wong
2020-10-29 16:33   ` [PATCH v2.1 " Brian Foster
2020-10-29 21:45     ` Darrick J. Wong
2020-10-30 23:23     ` Allison Henderson
2020-10-29 13:23 ` [PATCH v2 3/3] iomap: clean up writeback state logic on writepage error Brian Foster
2020-10-29 15:11   ` Christoph Hellwig
2020-10-29 21:48   ` Darrick J. Wong
2020-10-30 23:23   ` Allison Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).