Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter

From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
Date: Mon, 14 Jan 2019 08:49:05 +1100	[thread overview]
Message-ID: <20190113214905.GB4205@dastard> (raw)
In-Reply-To: <20190111123032.31538-4-bfoster@redhat.com>

On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
> 
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
> 
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
> 
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 8 ++++++--
>  fs/xfs/xfs_iomap.c | 4 ++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..33a1be5df99f 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
>  struct xfs_writepage_ctx {
>  	struct xfs_bmbt_irec    imap;
>  	unsigned int		io_type;
> +	unsigned int		data_seq;
>  	unsigned int		cow_seq;
>  	struct xfs_ioend	*ioend;
>  };
> @@ -347,7 +348,8 @@ xfs_map_blocks(
>  	 * out that ensures that we always see the current value.
>  	 */
>  	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> +		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> +		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
>  	if (imap_valid &&
>  	    (!xfs_inode_has_cow_data(ip) ||
>  	     wpc->io_type == XFS_IO_COW ||

I suspect this next "if (imap_valid) ..." logic needs to be updated,
too. i.e. the next line is checking if the cow_seq has not changed.

i.e. I think wrapping this up in a helper (again!) might make more
sense:

static bool
xfs_imap_valid(
	struct xfs_inode	*ip,
	struct xfs_writepage_ctx *wpc,
	xfs_fileoff_t		offset_fsb)
{
	if (offset_fsb < wpc->imap.br_startoff)
		return false;
	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
		return false;
	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
		return false;
	if (!xfs_inode_has_cow_data(ip))
		return true;
	if (wpc->io_type != XFS_IO_COW)
		return true;
	if (wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
		return false;
	return true;
}

and then put the shutdown check before we check the map for validity
(i.e. don't continue to write to the cached map after a shutdown has
been triggered):

	if (XFS_FORCED_SHUTDOWN(mp))
		return -EIO;

	if (xfs_imap_valid(ip, wpc, offset_fsb))
		return 0;

> @@ -417,6 +419,7 @@ xfs_map_blocks(
>  	 */
>  	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
>  		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
> +	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
>  	if (imap.br_startoff > offset_fsb) {
> @@ -454,7 +457,8 @@ xfs_map_blocks(
>  	return 0;
>  allocate_blocks:
>  	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> -			&wpc->cow_seq);
> +			whichfork == XFS_COW_FORK ?
> +					 &wpc->cow_seq : &wpc->data_seq);
>  	if (error)
>  		return error;
>  	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 27c93b5f029d..0401e33d4e8f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
>  	int		whichfork,
>  	xfs_off_t	offset,
>  	xfs_bmbt_irec_t *imap,
> -	unsigned int	*cow_seq)
> +	unsigned int	*seq)
>  {
>  	xfs_mount_t	*mp = ip->i_mount;
>  	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
>  				goto error0;
>  
>  			if (whichfork == XFS_COW_FORK)
> -				*cow_seq = READ_ONCE(ifp->if_seq);
> +				*seq = READ_ONCE(ifp->if_seq);
>  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		}

One of the things that limits xfs_iomap_write_allocate() efficiency
is the mitigations for races against truncate. i.e. the huge comment that
starts:

	       /*
		* it is possible that the extents have changed since
		* we did the read call as we dropped the ilock for a
		* while. We have to be careful about truncates or hole
		* punchs here - we are not allowed to allocate
		* non-delalloc blocks here.
....

Now that we can detect that the extents have changed in the data
fork, we can go back to allocating multiple extents per
xfs_bmapi_write() call by doing a sequence number check after we
lock the inode. If the sequence number does not match what was
passed in or returned from the previous loop, we return -EAGAIN.

Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
handled correctly by xfs_map_blocks() anymore. i.e. it just returns
the error which can lead to discarding the page rather than checking
to see if the there was a valid map allocated. I think there's some
followup work here (another patch series). :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com