Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing

From: Dave Chinner <david@fromorbit.com>
To: John Garry <john.g.garry@oracle.com>
Cc: djwong@kernel.org, hch@lst.de, viro@zeniv.linux.org.uk,
	brauner@kernel.org, jack@suse.cz, chandan.babu@oracle.com,
	axboe@kernel.dk, martin.petersen@oracle.com,
	linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com,
	ojaswin@linux.ibm.com, ritesh.list@gmail.com,
	linux-block@vger.kernel.org
Subject: Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing
Date: Thu, 7 Mar 2024 09:00:19 +1100	[thread overview]
Message-ID: <Zejnc+M32wRIutNZ@dread.disaster.area> (raw)
In-Reply-To: <20240304130428.13026-9-john.g.garry@oracle.com>

On Mon, Mar 04, 2024 at 01:04:22PM +0000, John Garry wrote:
> Set iomap->extent_shift when sub-extent zeroing is required.
> 
> We treat a sub-extent write same as an unaligned write, so we can leverage
> the existing sub-FSblock unaligned write support, i.e. try a shared lock
> with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
> lock.
> 
> In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c  | 28 +++++++++++++++-------------
>  fs/xfs/xfs_iomap.c | 15 +++++++++++++--
>  2 files changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e33e5e13b95f..d0bd9d5f596c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -617,18 +617,19 @@ xfs_file_dio_write_aligned(
>   * Handle block unaligned direct I/O writes
>   *
>   * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
> - * them to be done in parallel with reads and other direct I/O writes.  However,
> - * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
> - * to do sub-block zeroing and that requires serialisation against other direct
> - * I/O to the same block.  In this case we need to serialise the submission of
> - * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
> - * In the case where sub-block zeroing is not required, we can do concurrent
> - * sub-block dios to the same block successfully.
> + * them to be done in parallel with reads and other direct I/O writes.
> + * However if the I/O is not aligned to filesystem blocks/extent, the direct
> + * I/O layer may need to do sub-block/extent zeroing and that requires
> + * serialisation against other direct I/O to the same block/extent.  In this
> + * case we need to serialise the submission of the unaligned I/O so that we
> + * don't get racing block/extent zeroing in the dio layer.
> + * In the case where sub-block/extent zeroing is not required, we can do
> + * concurrent sub-block/extent dios to the same block/extent successfully.
>   *
>   * Optimistically submit the I/O using the shared lock first, but use the
>   * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
> - * if block allocation or partial block zeroing would be required.  In that case
> - * we try again with the exclusive lock.
> + * if block/extent allocation or partial block/extent zeroing would be
> + * required.  In that case we try again with the exclusive lock.
>   */
>  static noinline ssize_t
>  xfs_file_dio_write_unaligned(
> @@ -643,9 +644,9 @@ xfs_file_dio_write_unaligned(
>  	ssize_t			ret;
>  
>  	/*
> -	 * Extending writes need exclusivity because of the sub-block zeroing
> -	 * that the DIO code always does for partial tail blocks beyond EOF, so
> -	 * don't even bother trying the fast path in this case.
> +	 * Extending writes need exclusivity because of the sub-block/extent
> +	 * zeroing that the DIO code always does for partial tail blocks
> +	 * beyond EOF, so don't even bother trying the fast path in this case.
>  	 */
>  	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>  		if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -709,13 +710,14 @@ xfs_file_dio_write(
>  	struct iov_iter		*from)
>  {
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>  	size_t			count = iov_iter_count(from);
>  
>  	/* direct I/O must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
> -	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> +	if ((iocb->ki_pos | count) & (XFS_FSB_TO_B(mp, xfs_get_extsz(ip)) - 1))
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 70fe873951f3..88cc20bb19c9 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -98,6 +98,7 @@ xfs_bmbt_to_iomap(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
>  
>  	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
>  		return xfs_alert_fsblock_zero(ip, imap);
> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>  
>  	iomap->validity_cookie = sequence_cookie;
>  	iomap->folio_ops = &xfs_iomap_folio_ops;
> +	if (extsz > 1)
> +		iomap->extent_shift = ffs(extsz) - 1;

	iomap->extent_size = mp->m_bsize;
	if (xfs_inode_has_force_align(ip))
		iomap->extent_size *= ip->i_extsize;

> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
>  	xfs_fsize_t	i_size;
>  	uint		resblks;
>  	int		error;
> +	xfs_extlen_t	extsz = xfs_get_extsz(ip);
>  
>  	trace_xfs_unwritten_convert(ip, offset, count);
>  
> -	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> -	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +	if (extsz > 1) {
> +		xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
> +
> +		offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
> +		count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
> +	} else {
> +		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +		count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +	}

I don't think this is correct. We should only be converting the
extent when the entire range has had data written to it. If we are
doing unaligned writes, we end up running 3 separate unwritten
conversion transactions - the leading zeroing, the data written and
the trailing zeroing.

This will end up converting the entire range to written when the
leading zeroing completes, exposing stale data until the data and
trailing zeroing completes.

Concurrent reads (both DIO and buffered) can see this stale data
while the write is in progress, leading to a mechanism where a user
can issue sub-atomic write range IO and concurrent overlapping reads
to read arbitrary stale data from the disk just before it is
overwritten.

I suspect the only way to fix this for sub-force-aligned DIo writes
if for iomap_dio_bio_iter() to chain the zeroing and data bios so
the entire range gets a single completion run on it instead of three
separate sub-aligned extent IO completions. We only need to do this
in the zeroing case - this is already the DIo slow path, so
additional submission overhead is not an issue. It would, however,
reduce completion overhead and latency, as we only need to run a
single extent conversion instead of 3, so chaining the bios on
aligned writes may well be a net win...

Thoughts? Christoph probably needs to weigh in on this one...

-Dave.
-- 
Dave Chinner
david@fromorbit.com