Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten

From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 1/2] xfs: force writes to delalloc regions to unwritten
Date: Mon, 20 Jan 2020 07:49:25 +1100	[thread overview]
Message-ID: <20200119204925.GC9407@dread.disaster.area> (raw)
In-Reply-To: <157915535059.2406747.264640456606868955.stgit@magnolia>

On Wed, Jan 15, 2020 at 10:15:50PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When writing to a delalloc region in the data fork, commit the new
> allocations (of the da reservation) as unwritten so that the mappings
> are only marked written once writeback completes successfully.  This
> fixes the problem of stale data exposure if the system goes down during
> targeted writeback of a specific region of a file, as tested by
> generic/042.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 4544732d09a5..220ea1dc67ab 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4190,17 +4190,7 @@ xfs_bmapi_allocate(
>  	bma->got.br_blockcount = bma->length;
>  	bma->got.br_state = XFS_EXT_NORM;
>  
> -	/*
> -	 * In the data fork, a wasdelay extent has been initialized, so
> -	 * shouldn't be flagged as unwritten.
> -	 *
> -	 * For the cow fork, however, we convert delalloc reservations
> -	 * (extents allocated for speculative preallocation) to
> -	 * allocated unwritten extents, and only convert the unwritten
> -	 * extents to real extents when we're about to write the data.
> -	 */
> -	if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) &&
> -	    (bma->flags & XFS_BMAPI_PREALLOC))
> +	if (bma->flags & XFS_BMAPI_PREALLOC)
>  		bma->got.br_state = XFS_EXT_UNWRITTEN;
>  
>  	if (bma->wasdel)
> @@ -4608,8 +4598,24 @@ xfs_bmapi_convert_delalloc(
>  	bma.offset = bma.got.br_startoff;
>  	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
>  	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
> +
> +	/*
> +	 * When we're converting the delalloc reservations backing dirty pages
> +	 * in the page cache, we must be careful about how we create the new
> +	 * extents:
> +	 *
> +	 * New CoW fork extents are created unwritten, turned into real extents
> +	 * when we're about to write the data to disk, and mapped into the data
> +	 * fork after the write finishes.  End of story.
> +	 *
> +	 * New data fork extents must be mapped in as unwritten and converted
> +	 * to real extents after the write succeeds to avoid exposing stale
> +	 * disk contents if we crash.
> +	 */
>  	if (whichfork == XFS_COW_FORK)
>  		bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> +	else
> +		bma.flags = XFS_BMAPI_PREALLOC;

	bma.flags = XFS_BMAPI_PREALLOC;
	if (whichfork == XFS_COW_FORK)
		bma.flags |= XFS_BMAPI_COWFORK;

However, I'm still not convinced that this is the right/best
solution to the problem. It is the easiest, yes, but the down side
on fast/high iops storage and/or under low memory conditions has
potential to be extremely significant.

I suspect that heavy users of buffered O_DSYNC writes into sparse
files are going to notice this the most - there are databases out
there that work this way. And I suspect that most of the workloads
that use buffered O_DSYNC IO heavily won't see this change for years
as enterprise upgrade cycles are notoriously slow.

IOWs, all I see this change doing is kicking the can down the road
and guaranteeing that we'll still have to solve this stale data
exposure problem more efficiently in the future. And instead of
doing it now when we have the time and freedom to do the work, it
will have to be done urgently under high priority escalation
pressures...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com