linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: John Garry <john.g.garry@oracle.com>
Cc: djwong@kernel.org, hch@lst.de, viro@zeniv.linux.org.uk,
	brauner@kernel.org, jack@suse.cz, chandan.babu@oracle.com,
	axboe@kernel.dk, martin.petersen@oracle.com,
	linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com,
	ojaswin@linux.ibm.com, ritesh.list@gmail.com,
	linux-block@vger.kernel.org
Subject: Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing
Date: Thu, 7 Mar 2024 09:00:19 +1100	[thread overview]
Message-ID: <Zejnc+M32wRIutNZ@dread.disaster.area> (raw)
In-Reply-To: <20240304130428.13026-9-john.g.garry@oracle.com>

On Mon, Mar 04, 2024 at 01:04:22PM +0000, John Garry wrote:
> Set iomap->extent_shift when sub-extent zeroing is required.
> 
> We treat a sub-extent write same as an unaligned write, so we can leverage
> the existing sub-FSblock unaligned write support, i.e. try a shared lock
> with IOMAP_DIO_OVERWRITE_ONLY flag, if this fails then try the exclusive
> lock.
> 
> In xfs_iomap_write_unwritten(), FSB calcs are now based on the extsize.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c  | 28 +++++++++++++++-------------
>  fs/xfs/xfs_iomap.c | 15 +++++++++++++--
>  2 files changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e33e5e13b95f..d0bd9d5f596c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -617,18 +617,19 @@ xfs_file_dio_write_aligned(
>   * Handle block unaligned direct I/O writes
>   *
>   * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
> - * them to be done in parallel with reads and other direct I/O writes.  However,
> - * if the I/O is not aligned to filesystem blocks, the direct I/O layer may need
> - * to do sub-block zeroing and that requires serialisation against other direct
> - * I/O to the same block.  In this case we need to serialise the submission of
> - * the unaligned I/O so that we don't get racing block zeroing in the dio layer.
> - * In the case where sub-block zeroing is not required, we can do concurrent
> - * sub-block dios to the same block successfully.
> + * them to be done in parallel with reads and other direct I/O writes.
> + * However if the I/O is not aligned to filesystem blocks/extent, the direct
> + * I/O layer may need to do sub-block/extent zeroing and that requires
> + * serialisation against other direct I/O to the same block/extent.  In this
> + * case we need to serialise the submission of the unaligned I/O so that we
> + * don't get racing block/extent zeroing in the dio layer.
> + * In the case where sub-block/extent zeroing is not required, we can do
> + * concurrent sub-block/extent dios to the same block/extent successfully.
>   *
>   * Optimistically submit the I/O using the shared lock first, but use the
>   * IOMAP_DIO_OVERWRITE_ONLY flag to tell the lower layers to return -EAGAIN
> - * if block allocation or partial block zeroing would be required.  In that case
> - * we try again with the exclusive lock.
> + * if block/extent allocation or partial block/extent zeroing would be
> + * required.  In that case we try again with the exclusive lock.
>   */
>  static noinline ssize_t
>  xfs_file_dio_write_unaligned(
> @@ -643,9 +644,9 @@ xfs_file_dio_write_unaligned(
>  	ssize_t			ret;
>  
>  	/*
> -	 * Extending writes need exclusivity because of the sub-block zeroing
> -	 * that the DIO code always does for partial tail blocks beyond EOF, so
> -	 * don't even bother trying the fast path in this case.
> +	 * Extending writes need exclusivity because of the sub-block/extent
> +	 * zeroing that the DIO code always does for partial tail blocks
> +	 * beyond EOF, so don't even bother trying the fast path in this case.
>  	 */
>  	if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>  		if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -709,13 +710,14 @@ xfs_file_dio_write(
>  	struct iov_iter		*from)
>  {
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
>  	size_t			count = iov_iter_count(from);
>  
>  	/* direct I/O must be aligned to device logical sector size */
>  	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
> -	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> +	if ((iocb->ki_pos | count) & (XFS_FSB_TO_B(mp, xfs_get_extsz(ip)) - 1))
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 70fe873951f3..88cc20bb19c9 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -98,6 +98,7 @@ xfs_bmbt_to_iomap(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
>  
>  	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
>  		return xfs_alert_fsblock_zero(ip, imap);
> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>  
>  	iomap->validity_cookie = sequence_cookie;
>  	iomap->folio_ops = &xfs_iomap_folio_ops;
> +	if (extsz > 1)
> +		iomap->extent_shift = ffs(extsz) - 1;

	iomap->extent_size = mp->m_bsize;
	if (xfs_inode_has_force_align(ip))
		iomap->extent_size *= ip->i_extsize;

> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
>  	xfs_fsize_t	i_size;
>  	uint		resblks;
>  	int		error;
> +	xfs_extlen_t	extsz = xfs_get_extsz(ip);
>  
>  	trace_xfs_unwritten_convert(ip, offset, count);
>  
> -	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> -	count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +	if (extsz > 1) {
> +		xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
> +
> +		offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
> +		count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
> +	} else {
> +		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +		count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
> +	}

I don't think this is correct. We should only be converting the
extent when the entire range has had data written to it. If we are
doing unaligned writes, we end up running 3 separate unwritten
conversion transactions - the leading zeroing, the data written and
the trailing zeroing.

This will end up converting the entire range to written when the
leading zeroing completes, exposing stale data until the data and
trailing zeroing completes.

Concurrent reads (both DIO and buffered) can see this stale data
while the write is in progress, leading to a mechanism where a user
can issue sub-atomic write range IO and concurrent overlapping reads
to read arbitrary stale data from the disk just before it is
overwritten.

I suspect the only way to fix this for sub-force-aligned DIo writes
if for iomap_dio_bio_iter() to chain the zeroing and data bios so
the entire range gets a single completion run on it instead of three
separate sub-aligned extent IO completions. We only need to do this
in the zeroing case - this is already the DIo slow path, so
additional submission overhead is not an issue. It would, however,
reduce completion overhead and latency, as we only need to run a
single extent conversion instead of 3, so chaining the bios on
aligned writes may well be a net win...

Thoughts? Christoph probably needs to weigh in on this one...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2024-03-06 22:00 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-04 13:04 [PATCH v2 00/14] block atomic writes for XFS John Garry
2024-03-04 13:04 ` [PATCH v2 01/14] block: Add blk_validate_atomic_write_op_size() John Garry
2024-03-04 13:04 ` [PATCH v2 02/14] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
2024-03-04 22:15   ` Dave Chinner
2024-03-05 13:36     ` John Garry
2024-03-04 13:04 ` [PATCH v2 03/14] fs: xfs: Introduce FORCEALIGN inode flag John Garry
2024-03-04 13:04 ` [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
2024-03-05  0:44   ` Dave Chinner
2024-03-05 15:22     ` John Garry
2024-03-05 22:18       ` Dave Chinner
2024-03-06  9:41         ` John Garry
2024-03-04 13:04 ` [PATCH v2 05/14] fs: xfs: Enable file data forcealign feature John Garry
2024-03-04 13:04 ` [PATCH v2 06/14] fs: xfs: Do not free EOF blocks for forcealign John Garry
2024-03-06 21:07   ` Dave Chinner
2024-03-07 11:38     ` John Garry
2024-03-04 13:04 ` [PATCH v2 07/14] fs: iomap: Sub-extent zeroing John Garry
2024-03-06 21:14   ` Dave Chinner
2024-03-07 11:51     ` John Garry
2024-03-04 13:04 ` [PATCH v2 08/14] fs: xfs: " John Garry
2024-03-06 22:00   ` Dave Chinner [this message]
2024-03-07 12:57     ` John Garry
2024-03-04 13:04 ` [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-03-04 13:04 ` [PATCH v2 10/14] fs: iomap: Atomic write support John Garry
2024-03-04 13:04 ` [PATCH v2 11/14] fs: xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
2024-03-06 21:43   ` Dave Chinner
2024-03-07 12:42     ` John Garry
2024-03-04 13:04 ` [PATCH v2 12/14] fs: xfs: Support atomic write for statx John Garry
2024-03-06 21:31   ` Dave Chinner
2024-03-07 10:35     ` John Garry
2024-03-04 13:04 ` [PATCH v2 13/14] fs: xfs: Validate atomic writes John Garry
2024-03-06 21:22   ` Dave Chinner
2024-03-07 10:19     ` John Garry
2024-03-04 13:04 ` [PATCH v2 14/14] fs: xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
2024-03-06 21:33   ` Dave Chinner
2024-03-07 11:55     ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zejnc+M32wRIutNZ@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=chandan.babu@oracle.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jbongio@google.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).