All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
Date: Mon, 14 Jan 2019 08:49:05 +1100	[thread overview]
Message-ID: <20190113214905.GB4205@dastard> (raw)
In-Reply-To: <20190111123032.31538-4-bfoster@redhat.com>

On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
> 
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
> 
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
> 
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 8 ++++++--
>  fs/xfs/xfs_iomap.c | 4 ++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..33a1be5df99f 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
>  struct xfs_writepage_ctx {
>  	struct xfs_bmbt_irec    imap;
>  	unsigned int		io_type;
> +	unsigned int		data_seq;
>  	unsigned int		cow_seq;
>  	struct xfs_ioend	*ioend;
>  };
> @@ -347,7 +348,8 @@ xfs_map_blocks(
>  	 * out that ensures that we always see the current value.
>  	 */
>  	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> +		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> +		     wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
>  	if (imap_valid &&
>  	    (!xfs_inode_has_cow_data(ip) ||
>  	     wpc->io_type == XFS_IO_COW ||

I suspect this next "if (imap_valid) ..." logic needs to be updated,
too. i.e. the next line is checking if the cow_seq has not changed.

i.e. I think wrapping this up in a helper (again!) might make more
sense:

static bool
xfs_imap_valid(
	struct xfs_inode	*ip,
	struct xfs_writepage_ctx *wpc,
	xfs_fileoff_t		offset_fsb)
{
	if (offset_fsb < wpc->imap.br_startoff)
		return false;
	if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
		return false;
	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
		return false;
	if (!xfs_inode_has_cow_data(ip))
		return true;
	if (wpc->io_type != XFS_IO_COW)
		return true;
	if (wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
		return false;
	return true;
}

and then put the shutdown check before we check the map for validity
(i.e. don't continue to write to the cached map after a shutdown has
been triggered):

	if (XFS_FORCED_SHUTDOWN(mp))
		return -EIO;

	if (xfs_imap_valid(ip, wpc, offset_fsb))
		return 0;


> @@ -417,6 +419,7 @@ xfs_map_blocks(
>  	 */
>  	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
>  		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
> +	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
>  	if (imap.br_startoff > offset_fsb) {
> @@ -454,7 +457,8 @@ xfs_map_blocks(
>  	return 0;
>  allocate_blocks:
>  	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> -			&wpc->cow_seq);
> +			whichfork == XFS_COW_FORK ?
> +					 &wpc->cow_seq : &wpc->data_seq);
>  	if (error)
>  		return error;
>  	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 27c93b5f029d..0401e33d4e8f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
>  	int		whichfork,
>  	xfs_off_t	offset,
>  	xfs_bmbt_irec_t *imap,
> -	unsigned int	*cow_seq)
> +	unsigned int	*seq)
>  {
>  	xfs_mount_t	*mp = ip->i_mount;
>  	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
>  				goto error0;
>  
>  			if (whichfork == XFS_COW_FORK)
> -				*cow_seq = READ_ONCE(ifp->if_seq);
> +				*seq = READ_ONCE(ifp->if_seq);
>  			xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		}

One of the things that limits xfs_iomap_write_allocate() efficiency
is the mitigations for races against truncate. i.e. the huge comment that
starts:

	       /*
		* it is possible that the extents have changed since
		* we did the read call as we dropped the ilock for a
		* while. We have to be careful about truncates or hole
		* punchs here - we are not allowed to allocate
		* non-delalloc blocks here.
....

Now that we can detect that the extents have changed in the data
fork, we can go back to allocating multiple extents per
xfs_bmapi_write() call by doing a sequence number check after we
lock the inode. If the sequence number does not match what was
passed in or returned from the previous loop, we return -EAGAIN.

Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
handled correctly by xfs_map_blocks() anymore. i.e. it just returns
the error which can lead to discarding the page rather than checking
to see if the there was a valid map allocated. I think there's some
followup work here (another patch series). :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2019-01-13 21:49 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
2019-01-16 13:35   ` Sasha Levin
2019-01-16 13:35     ` Sasha Levin
2019-01-16 14:10     ` Brian Foster
2019-01-11 12:30 ` [PATCH 2/4] xfs: update fork seq counter on data fork changes Brian Foster
2019-01-17 14:41   ` Christoph Hellwig
2019-01-11 12:30 ` [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter Brian Foster
2019-01-13 21:49   ` Dave Chinner [this message]
2019-01-14 15:34     ` Brian Foster
2019-01-14 20:57       ` Dave Chinner
2019-01-15 11:26         ` Brian Foster
2019-01-17 14:47       ` Christoph Hellwig
2019-01-17 16:35         ` Brian Foster
2019-01-17 16:41           ` Christoph Hellwig
2019-01-17 17:53             ` Brian Foster
2019-01-11 12:30 ` [PATCH 4/4] xfs: remove superfluous writeback mapping eof trimming Brian Foster
2019-01-11 13:31 ` [PATCH] tests/generic: test writepage cached mapping validity Brian Foster
2019-01-14  9:30   ` Eryu Guan
2019-01-14 15:34     ` Brian Foster
2019-01-15  3:52     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190113214905.GB4205@dastard \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.