All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 1/7] xfs: write page faults in iomap are not buffered writes
Date: Wed, 2 Nov 2022 09:12:05 -0700	[thread overview]
Message-ID: <Y2KW1Y0kKvXtZDVr@magnolia> (raw)
In-Reply-To: <20221101003412.3842572-2-david@fromorbit.com>

On Tue, Nov 01, 2022 at 11:34:06AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
> we mark the iomap as IOMAP_F_NEW so that the the write context
> understands that it allocated the delalloc region.
> 
> If we then fail that buffered write, xfs_buffered_write_iomap_end()
> checks for the IOMAP_F_NEW flag and if it is set, it punches out
> the unused delalloc region that was allocated for the write.
> 
> The assumption this code makes is that all buffered write operations
> that can allocate space are run under an exclusive lock (i_rwsem).
> This is an invalid assumption: page faults in mmap()d regions call
> through this same function pair to map the file range being faulted
> and this runs only holding the inode->i_mapping->invalidate_lock in
> shared mode.
> 
> IOWs, we can have races between page faults and write() calls that
> fail the nested page cache write operation that result in data loss.
> That is, the failing iomap_end call will punch out the data that
> the other racing iomap iteration brought into the page cache. This
> can be reproduced with generic/34[46] if we arbitrarily fail page
> cache copy-in operations from write() syscalls.
> 
> Code analysis tells us that the iomap_page_mkwrite() function holds
> the already instantiated and uptodate folio locked across the iomap
> mapping iterations. Hence the folio cannot be removed from memory
> whilst we are mapping the range it covers, and as such we do not
> care if the mapping changes state underneath the iomap iteration
> loop:
> 
> 1. if the folio is not already dirty, there is no writeback races
>    possible.
> 2. if we allocated the mapping (delalloc or unwritten), the folio
>    cannot already be dirty. See #1.
> 3. If the folio is already dirty, it must be up to date. As we hold
>    it locked, it cannot be reclaimed from memory. Hence we always
>    have valid data in the page cache while iterating the mapping.
> 4. Valid data in the page cache can exist when the underlying
>    mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
>    change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
>    change the data in the page - it only affects actions if we are
>    initialising a new page. Hence #3 applies  and we don't care
>    about these extent map transitions racing with
>    iomap_page_mkwrite().
> 5. iomap_page_mkwrite() checks for page invalidation races
>    (truncate, hole punch, etc) after it locks the folio. We also
>    hold the mapping->invalidation_lock here, and hence the mapping
>    cannot change due to extent removal operations while we are
>    iterating the folio.
> 
> As such, filesystems that don't use bufferheads will never fail
> the iomap_folio_mkwrite_iter() operation on the current mapping,
> regardless of whether the iomap should be considered stale.
> 
> Further, the range we are asked to iterate is limited to the range
> inside EOF that the folio spans. Hence, for XFS, we will only map
> the exact range we are asked for, and we will only do speculative
> preallocation with delalloc if we are mapping a hole at the EOF
> page. The iterator will consume the entire range of the folio that
> is within EOF, and anything beyond the EOF block cannot be accessed.
> We never need to truncate this post-EOF speculative prealloc away in
> the context of the iomap_page_mkwrite() iterator because if it
> remains unused we'll remove it when the last reference to the inode
> goes away.

Why /do/ we need to trim the delalloc reservations after a failed
write(), anyway?  I gather it's because we don't want to end up with a
clean page backed by a delalloc reservation because writeback will never
get run to convert that reservation into real space, which means we've
leaked the reservation until someone dirties the page?

Ah.  Inode eviction also complains about inodes with delalloc
reservations.  The blockgc code only touches cow fork mappings and
post-eof blocks, which means it doesn't look for these dead/orphaned
delalloc reservations either.

But you're right that (non-bh) page_mkwrite won't ever fail, so
->iomap_end isn't necessary.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Hence we don't actually need an .iomap_end() cleanup/error handling
> path at all for iomap_page_mkwrite() for XFS. This means we can
> separate the page fault processing from the complexity of the
> .iomap_end() processing in the buffered write path. This also means
> that the buffered write path will also be able to take the
> mapping->invalidate_lock as necessary.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c  | 2 +-
>  fs/xfs/xfs_iomap.c | 9 +++++++++
>  fs/xfs/xfs_iomap.h | 1 +
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c6c80265c0b2..fee471ca9737 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1324,7 +1324,7 @@ __xfs_filemap_fault(
>  		if (write_fault) {
>  			xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  			ret = iomap_page_mkwrite(vmf,
> -					&xfs_buffered_write_iomap_ops);
> +					&xfs_page_mkwrite_iomap_ops);
>  			xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  		} else {
>  			ret = filemap_fault(vmf);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 07da03976ec1..5cea069a38b4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1187,6 +1187,15 @@ const struct iomap_ops xfs_buffered_write_iomap_ops = {
>  	.iomap_end		= xfs_buffered_write_iomap_end,
>  };
>  
> +/*
> + * iomap_page_mkwrite() will never fail in a way that requires delalloc extents
> + * that it allocated to be revoked. Hence we do not need an .iomap_end method
> + * for this operation.
> + */
> +const struct iomap_ops xfs_page_mkwrite_iomap_ops = {
> +	.iomap_begin		= xfs_buffered_write_iomap_begin,
> +};
> +
>  static int
>  xfs_read_iomap_begin(
>  	struct inode		*inode,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index c782e8c0479c..0f62ab633040 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -47,6 +47,7 @@ xfs_aligned_fsb_count(
>  }
>  
>  extern const struct iomap_ops xfs_buffered_write_iomap_ops;
> +extern const struct iomap_ops xfs_page_mkwrite_iomap_ops;
>  extern const struct iomap_ops xfs_direct_write_iomap_ops;
>  extern const struct iomap_ops xfs_read_iomap_ops;
>  extern const struct iomap_ops xfs_seek_iomap_ops;
> -- 
> 2.37.2
> 

  parent reply	other threads:[~2022-11-02 16:16 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-01  0:34 xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
2022-11-01  0:34 ` [PATCH 1/7] xfs: write page faults in iomap are not buffered writes Dave Chinner
2022-11-02  7:17   ` Christoph Hellwig
2022-11-02 16:12   ` Darrick J. Wong [this message]
2022-11-02 21:11     ` Dave Chinner
2022-11-01  0:34 ` [PATCH 2/7] xfs: punching delalloc extents on write failure is racy Dave Chinner
2022-11-02  7:18   ` Christoph Hellwig
2022-11-02 16:22   ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 3/7] xfs: use byte ranges for write cleanup ranges Dave Chinner
2022-11-02  7:20   ` Christoph Hellwig
2022-11-02 16:32   ` Darrick J. Wong
2022-11-04  5:40     ` Dave Chinner
2022-11-07 23:53       ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 4/7] xfs: buffered write failure should not truncate the page cache Dave Chinner
2022-11-01 11:57   ` kernel test robot
2022-11-02  7:24   ` Christoph Hellwig
2022-11-02 20:57     ` Dave Chinner
2022-11-02 16:41   ` Darrick J. Wong
2022-11-02 21:04     ` Dave Chinner
2022-11-02 22:26       ` Darrick J. Wong
2022-11-04  8:08   ` Christoph Hellwig
2022-11-04 23:10     ` Dave Chinner
2022-11-07 23:48       ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 5/7] iomap: write iomap validity checks Dave Chinner
2022-11-02  8:36   ` Christoph Hellwig
2022-11-02 16:43     ` Darrick J. Wong
2022-11-02 16:58       ` Darrick J. Wong
2022-11-03  0:35         ` Dave Chinner
2022-11-04  8:12           ` Christoph Hellwig
2022-11-02 16:57   ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-11-01  9:15   ` kernel test robot
2022-11-02  8:41   ` Christoph Hellwig
2022-11-02 21:39     ` Dave Chinner
2022-11-04  8:14       ` Christoph Hellwig
2022-11-02 17:19   ` Darrick J. Wong
2022-11-02 22:36     ` Dave Chinner
2022-11-08  0:00       ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 7/7] xfs: drop write error injection is unfixable, remove it Dave Chinner
2022-11-01  3:39 ` xfs, iomap: fix data corrupton due to stale cached iomaps Darrick J. Wong
2022-11-01  4:21   ` Dave Chinner
2022-11-02 17:23     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2KW1Y0kKvXtZDVr@magnolia \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.