All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: properly serialise fallocate against AIO+DIO
@ 2019-10-29  3:48 Dave Chinner
  2019-10-29  4:02 ` Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Dave Chinner @ 2019-10-29  3:48 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

AIO+DIO can extend the file size on IO completion, and it holds
no inode locks while the IO is in flight. Therefore, a race
condition exists in file size updates if we do something like this:

aio-thread			fallocate-thread

lock inode
submit IO beyond inode->i_size
unlock inode
.....
				lock inode
				break layouts
				if (off + len > inode->i_size)
					new_size = off + len
				.....
				inode_dio_wait()
				<blocks>
.....
completes
inode->i_size updated
inode_dio_done()
....
				<wakes>
				<does stuff no long beyond EOF>
				if (new_size)
					xfs_vn_setattr(inode, new_size)


Yup, that attempt to extend the file size in the fallocate code
turns into a truncate - it removes the whatever the aio write
allocated and put to disk, and reduced the inode size back down to
where the fallocate operation ends.

Fundamentally, xfs_file_fallocate()  not compatible with racing
AIO+DIO completions, so we need to move the inode_dio_wait() call
up to where the lock the inode and break the layouts.

Secondly, storing the inode size and then using it unchecked without
holding the ILOCK is not safe; we can only do such a thing if we've
locked out and drained all IO and other modification operations,
which we don't do initially in xfs_file_fallocate.

It should be noted that some of the fallocate operations are
compound operations - they are made up of multiple manipulations
that may zero data, and so we may need to flush and invalidate the
file multiple times during an operation. However, we only need to
lock out IO and other space manipulation operations once, as that
lockout is maintained until the entire fallocate operation has been
completed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c |  8 +-------
 fs/xfs/xfs_file.c      | 30 ++++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c     |  1 +
 3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fb31d7d6701e..dea68308fb64 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1040,6 +1040,7 @@ xfs_unmap_extent(
 	goto out_unlock;
 }
 
+/* Caller must first wait for the completion of any pending DIOs if required. */
 int
 xfs_flush_unmap_range(
 	struct xfs_inode	*ip,
@@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
 	xfs_off_t		rounding, start, end;
 	int			error;
 
-	/* wait for the completion of any pending DIOs */
-	inode_dio_wait(inode);
-
 	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
 	start = round_down(offset, rounding);
 	end = round_up(offset + len, rounding) - 1;
@@ -1085,10 +1083,6 @@ xfs_free_file_space(
 	if (len <= 0)	/* if nothing being freed */
 		return 0;
 
-	error = xfs_flush_unmap_range(ip, offset, len);
-	if (error)
-		return error;
-
 	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
 	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 525b29b99116..865543e41fb4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -817,6 +817,36 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
+	/*
+	 * Must wait for all AIO to complete before we continue as AIO can
+	 * change the file size on completion without holding any locks we
+	 * currently hold. We must do this first because AIO can update both
+	 * the on disk and in memory inode sizes, and the operations that follow
+	 * require the in-memory size to be fully up-to-date.
+	 */
+	inode_dio_wait(inode);
+
+	/*
+	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
+	 * the cached range over the first operation we are about to run.
+	 *
+	 * We care about zero and collapse here because they both run a hole
+	 * punch over the range first. Because that can zero data, and the range
+	 * of invalidation for the shift operations is much larger, we still do
+	 * the required flush for collapse in xfs_prepare_shift().
+	 *
+	 * Insert has the same range requirements as collapse, and we extend the
+	 * file first which can zero data. Hence insert has the same
+	 * flush/invalidate requirements as collapse and so they are both
+	 * handled at the right time by xfs_prepare_shift().
+	 */
+	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
+		    FALLOC_FL_COLLAPSE_RANGE)) {
+		error = xfs_flush_unmap_range(ip, offset, len);
+		if (error)
+			goto out_unlock;
+	}
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 287f83eb791f..800f07044636 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -623,6 +623,7 @@ xfs_ioc_space(
 	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
+	inode_dio_wait(inode);
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29  3:48 [PATCH] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
@ 2019-10-29  4:02 ` Dave Chinner
  2019-10-29  4:19 ` Darrick J. Wong
  2019-10-29 20:23 ` [RFC PATCH] generic: test race between appending AIO DIO and fallocate Darrick J. Wong
  2 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2019-10-29  4:02 UTC (permalink / raw)
  To: linux-xfs

On Tue, Oct 29, 2019 at 02:48:50PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> AIO+DIO can extend the file size on IO completion, and it holds
> no inode locks while the IO is in flight. Therefore, a race
> condition exists in file size updates if we do something like this:
> 
> aio-thread			fallocate-thread
> 
> lock inode
> submit IO beyond inode->i_size
> unlock inode
> .....
> 				lock inode
> 				break layouts
> 				if (off + len > inode->i_size)
> 					new_size = off + len
> 				.....
> 				inode_dio_wait()
> 				<blocks>
> .....
> completes
> inode->i_size updated
> inode_dio_done()
> ....
> 				<wakes>
> 				<does stuff no long beyond EOF>
> 				if (new_size)
> 					xfs_vn_setattr(inode, new_size)
> 
> 
> Yup, that attempt to extend the file size in the fallocate code
> turns into a truncate - it removes the whatever the aio write
> allocated and put to disk, and reduced the inode size back down to
> where the fallocate operation ends.
> 
> Fundamentally, xfs_file_fallocate()  not compatible with racing
> AIO+DIO completions, so we need to move the inode_dio_wait() call
> up to where the lock the inode and break the layouts.
> 
> Secondly, storing the inode size and then using it unchecked without
> holding the ILOCK is not safe; we can only do such a thing if we've
> locked out and drained all IO and other modification operations,
> which we don't do initially in xfs_file_fallocate.
> 
> It should be noted that some of the fallocate operations are
> compound operations - they are made up of multiple manipulations
> that may zero data, and so we may need to flush and invalidate the
> file multiple times during an operation. However, we only need to
> lock out IO and other space manipulation operations once, as that
> lockout is maintained until the entire fallocate operation has been
> completed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Just a note for anyone considering backporting this patch - I
intentionally did not mark it for stable kernel backports because it
is not just a simple backport.

i.e. while it may apply cleanly to older kernels, this is based on
the current xfs for-next tree and so is based on Christoph's
xfs_ioc_space() redirection to fallocate() patch set. That means the
changes to xfs_ioc_space() are tiny and trivial. Backporting this to
kernels that don't have Christoph's patch set will require adding
all the flush/invalidation calls that I added to
xfs_file_fallocate().

i.e. all the older XFS_IOC_UNRESVSP, XFS_IOC_ZERO_RANGE, etc
interfaces look to have the same serialisation problem against
AIO-DIO writes, and so older kernels will need them fixed as well.
That code is not in this patch.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29  3:48 [PATCH] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
  2019-10-29  4:02 ` Dave Chinner
@ 2019-10-29  4:19 ` Darrick J. Wong
  2019-10-29  4:41   ` Dave Chinner
  2019-10-29 20:23 ` [RFC PATCH] generic: test race between appending AIO DIO and fallocate Darrick J. Wong
  2 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2019-10-29  4:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 29, 2019 at 02:48:50PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> AIO+DIO can extend the file size on IO completion, and it holds
> no inode locks while the IO is in flight. Therefore, a race
> condition exists in file size updates if we do something like this:
> 
> aio-thread			fallocate-thread
> 
> lock inode
> submit IO beyond inode->i_size
> unlock inode
> .....
> 				lock inode
> 				break layouts
> 				if (off + len > inode->i_size)
> 					new_size = off + len
> 				.....
> 				inode_dio_wait()
> 				<blocks>
> .....
> completes
> inode->i_size updated
> inode_dio_done()
> ....
> 				<wakes>
> 				<does stuff no long beyond EOF>
> 				if (new_size)
> 					xfs_vn_setattr(inode, new_size)
> 
> 
> Yup, that attempt to extend the file size in the fallocate code
> turns into a truncate - it removes the whatever the aio write
> allocated and put to disk, and reduced the inode size back down to
> where the fallocate operation ends.
> 
> Fundamentally, xfs_file_fallocate()  not compatible with racing
> AIO+DIO completions, so we need to move the inode_dio_wait() call
> up to where the lock the inode and break the layouts.
> 
> Secondly, storing the inode size and then using it unchecked without
> holding the ILOCK is not safe; we can only do such a thing if we've
> locked out and drained all IO and other modification operations,
> which we don't do initially in xfs_file_fallocate.
> 
> It should be noted that some of the fallocate operations are
> compound operations - they are made up of multiple manipulations
> that may zero data, and so we may need to flush and invalidate the
> file multiple times during an operation. However, we only need to
> lock out IO and other space manipulation operations once, as that
> lockout is maintained until the entire fallocate operation has been
> completed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c |  8 +-------
>  fs/xfs/xfs_file.c      | 30 ++++++++++++++++++++++++++++++
>  fs/xfs/xfs_ioctl.c     |  1 +
>  3 files changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fb31d7d6701e..dea68308fb64 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1040,6 +1040,7 @@ xfs_unmap_extent(
>  	goto out_unlock;
>  }
>  
> +/* Caller must first wait for the completion of any pending DIOs if required. */
>  int
>  xfs_flush_unmap_range(
>  	struct xfs_inode	*ip,
> @@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
>  	xfs_off_t		rounding, start, end;
>  	int			error;
>  
> -	/* wait for the completion of any pending DIOs */
> -	inode_dio_wait(inode);

Does xfs_reflink_remap_prep still need this function to call inode_dio_wait
before zapping the page cache prior to reflinking into an existing file?

> -
>  	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
>  	start = round_down(offset, rounding);
>  	end = round_up(offset + len, rounding) - 1;
> @@ -1085,10 +1083,6 @@ xfs_free_file_space(
>  	if (len <= 0)	/* if nothing being freed */
>  		return 0;
>  
> -	error = xfs_flush_unmap_range(ip, offset, len);
> -	if (error)
> -		return error;
> -
>  	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
>  	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 525b29b99116..865543e41fb4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -817,6 +817,36 @@ xfs_file_fallocate(
>  	if (error)
>  		goto out_unlock;
>  
> +	/*
> +	 * Must wait for all AIO to complete before we continue as AIO can
> +	 * change the file size on completion without holding any locks we
> +	 * currently hold. We must do this first because AIO can update both
> +	 * the on disk and in memory inode sizes, and the operations that follow
> +	 * require the in-memory size to be fully up-to-date.
> +	 */
> +	inode_dio_wait(inode);
> +
> +	/*
> +	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
> +	 * the cached range over the first operation we are about to run.
> +	 *
> +	 * We care about zero and collapse here because they both run a hole
> +	 * punch over the range first. Because that can zero data, and the range
> +	 * of invalidation for the shift operations is much larger, we still do
> +	 * the required flush for collapse in xfs_prepare_shift().
> +	 *
> +	 * Insert has the same range requirements as collapse, and we extend the
> +	 * file first which can zero data. Hence insert has the same
> +	 * flush/invalidate requirements as collapse and so they are both
> +	 * handled at the right time by xfs_prepare_shift().
> +	 */
> +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> +		    FALLOC_FL_COLLAPSE_RANGE)) {

Er... "Insert has the same requirements as collapse", but we don't test
for that here?  Also ... xfs_prepare_shift handles flushing for both
collapse and insert range, but we still have to flush here for collapse?

<confused but suspecting this has something to do with the fact that we
only do insert range after updating the isize?>

I think the third paragraph of the comment is just confusing me more.
Does the following describe what's going on?

"Insert range has the same range [should this be "page cache flushing"?]
requirements as collapse.  Because we can zero data as part of extending
the file size, we skip the flush here and let the flush in
xfs_prepare_shift take care of invalidating the page cache." ?

--D

> +		error = xfs_flush_unmap_range(ip, offset, len);
> +		if (error)
> +			goto out_unlock;
> +	}
> +
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 287f83eb791f..800f07044636 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -623,6 +623,7 @@ xfs_ioc_space(
>  	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
>  	if (error)
>  		goto out_unlock;
> +	inode_dio_wait(inode);
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> -- 
> 2.24.0.rc0
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29  4:19 ` Darrick J. Wong
@ 2019-10-29  4:41   ` Dave Chinner
  2019-10-29 10:03     ` Brian Foster
  2019-10-29 20:22     ` Darrick J. Wong
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2019-10-29  4:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Oct 28, 2019 at 09:19:08PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 29, 2019 at 02:48:50PM +1100, Dave Chinner wrote:
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1040,6 +1040,7 @@ xfs_unmap_extent(
> >  	goto out_unlock;
> >  }
> >  
> > +/* Caller must first wait for the completion of any pending DIOs if required. */
> >  int
> >  xfs_flush_unmap_range(
> >  	struct xfs_inode	*ip,
> > @@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
> >  	xfs_off_t		rounding, start, end;
> >  	int			error;
> >  
> > -	/* wait for the completion of any pending DIOs */
> > -	inode_dio_wait(inode);
> 
> Does xfs_reflink_remap_prep still need this function to call inode_dio_wait
> before zapping the page cache prior to reflinking into an existing file?

No, because that is done in generic_remap_file_range_prep() after we
have locked the inodes and broken leases in
xfs_reflink_remap_prep().

> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 525b29b99116..865543e41fb4 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -817,6 +817,36 @@ xfs_file_fallocate(
> >  	if (error)
> >  		goto out_unlock;
> >  
> > +	/*
> > +	 * Must wait for all AIO to complete before we continue as AIO can
> > +	 * change the file size on completion without holding any locks we
> > +	 * currently hold. We must do this first because AIO can update both
> > +	 * the on disk and in memory inode sizes, and the operations that follow
> > +	 * require the in-memory size to be fully up-to-date.
> > +	 */
> > +	inode_dio_wait(inode);
> > +
> > +	/*
> > +	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
> > +	 * the cached range over the first operation we are about to run.
> > +	 *
> > +	 * We care about zero and collapse here because they both run a hole
> > +	 * punch over the range first. Because that can zero data, and the range
> > +	 * of invalidation for the shift operations is much larger, we still do
> > +	 * the required flush for collapse in xfs_prepare_shift().
> > +	 *
> > +	 * Insert has the same range requirements as collapse, and we extend the
> > +	 * file first which can zero data. Hence insert has the same
> > +	 * flush/invalidate requirements as collapse and so they are both
> > +	 * handled at the right time by xfs_prepare_shift().
> > +	 */
> > +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> > +		    FALLOC_FL_COLLAPSE_RANGE)) {
> 
> Er... "Insert has the same requirements as collapse", but we don't test
> for that here?  Also ... xfs_prepare_shift handles flushing for both
> collapse and insert range, but we still have to flush here for collapse?
>
> <confused but suspecting this has something to do with the fact that we
> only do insert range after updating the isize?>

Yes, exactly.

The flush for collapse here is for the hole punch part of collapse,
before we start shifting extents. insert does not hole punch, so it
doesn't need flushing here but it still needs flush/inval before
shifting. i.e.:

collapse				insert

flush_unmap(off, len)
punch hole(off, len)			extends EOF
  writes zeros around (off,len)		  writes zeros around EOF
collapse(off, len)			insert(off, len)
  flush_unmap(off, EOF)			  flush_unmap(off, EOF)
  shift extents down			  shift extents up

So once we start the actual extent shift operation (up or down)
the flush/unmap requirements are identical.

> I think the third paragraph of the comment is just confusing me more.
> Does the following describe what's going on?
> 
> "Insert range has the same range [should this be "page cache flushing"?]
> requirements as collapse.  Because we can zero data as part of extending
> the file size, we skip the flush here and let the flush in
> xfs_prepare_shift take care of invalidating the page cache." ?

It's a bit better - that's kinda what I was trying to describe - but
I'll try to reword it more clearly after I've let it settle in my
head for a little while....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29  4:41   ` Dave Chinner
@ 2019-10-29 10:03     ` Brian Foster
  2019-10-29 20:22       ` Darrick J. Wong
  2019-10-29 22:29       ` Dave Chinner
  2019-10-29 20:22     ` Darrick J. Wong
  1 sibling, 2 replies; 9+ messages in thread
From: Brian Foster @ 2019-10-29 10:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Tue, Oct 29, 2019 at 03:41:33PM +1100, Dave Chinner wrote:
> On Mon, Oct 28, 2019 at 09:19:08PM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 29, 2019 at 02:48:50PM +1100, Dave Chinner wrote:
...
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 525b29b99116..865543e41fb4 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -817,6 +817,36 @@ xfs_file_fallocate(
> > >  	if (error)
> > >  		goto out_unlock;
> > >  
...
> > > +	/*
> > > +	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
> > > +	 * the cached range over the first operation we are about to run.
> > > +	 *
> > > +	 * We care about zero and collapse here because they both run a hole
> > > +	 * punch over the range first. Because that can zero data, and the range
> > > +	 * of invalidation for the shift operations is much larger, we still do
> > > +	 * the required flush for collapse in xfs_prepare_shift().
> > > +	 *
> > > +	 * Insert has the same range requirements as collapse, and we extend the
> > > +	 * file first which can zero data. Hence insert has the same
> > > +	 * flush/invalidate requirements as collapse and so they are both
> > > +	 * handled at the right time by xfs_prepare_shift().
> > > +	 */
> > > +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> > > +		    FALLOC_FL_COLLAPSE_RANGE)) {
> > 
> > Er... "Insert has the same requirements as collapse", but we don't test
> > for that here?  Also ... xfs_prepare_shift handles flushing for both
> > collapse and insert range, but we still have to flush here for collapse?
> >
> > <confused but suspecting this has something to do with the fact that we
> > only do insert range after updating the isize?>
> 
> Yes, exactly.
> 
> The flush for collapse here is for the hole punch part of collapse,
> before we start shifting extents. insert does not hole punch, so it
> doesn't need flushing here but it still needs flush/inval before
> shifting. i.e.:
> 
> collapse				insert
> 
> flush_unmap(off, len)
> punch hole(off, len)			extends EOF
>   writes zeros around (off,len)		  writes zeros around EOF
> collapse(off, len)			insert(off, len)
>   flush_unmap(off, EOF)			  flush_unmap(off, EOF)
>   shift extents down			  shift extents up
> 
> So once we start the actual extent shift operation (up or down)
> the flush/unmap requirements are identical.
> 
> > I think the third paragraph of the comment is just confusing me more.
> > Does the following describe what's going on?
> > 
> > "Insert range has the same range [should this be "page cache flushing"?]
> > requirements as collapse.  Because we can zero data as part of extending
> > the file size, we skip the flush here and let the flush in
> > xfs_prepare_shift take care of invalidating the page cache." ?
> 
> It's a bit better - that's kinda what I was trying to describe - but
> I'll try to reword it more clearly after I've let it settle in my
> head for a little while....
> 

I agree the comment is a little confusing because to me, it's just
describing a bit too much for its context. I.e., I read the comment and
have to go look at other code to make sure I grok the comment rather
than the comment helping me grok the code it's associated with.

FWIW, I find something like the following a bit more clear/concise on
the whole:

        /*
+        * Once AIO and DIO has drained we flush and (if necessary) invalidate
+        * the cached range over the first operation we are about to run. We
+        * include zero and collapse here because they both start with a hole
+        * punch over the target range. Insert and collapse both invalidate the
+        * broader range affected by the shift in xfs_prepare_shift().
         */

... because it points out why we special case collapse here, and that
otherwise the prepare shift code is responsible for the rest. Just my
.02 and otherwise the patch looks good to me.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 10:03     ` Brian Foster
@ 2019-10-29 20:22       ` Darrick J. Wong
  2019-10-29 22:29       ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-10-29 20:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, Oct 29, 2019 at 06:03:42AM -0400, Brian Foster wrote:
> On Tue, Oct 29, 2019 at 03:41:33PM +1100, Dave Chinner wrote:
> > On Mon, Oct 28, 2019 at 09:19:08PM -0700, Darrick J. Wong wrote:
> > > On Tue, Oct 29, 2019 at 02:48:50PM +1100, Dave Chinner wrote:
> ...
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index 525b29b99116..865543e41fb4 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -817,6 +817,36 @@ xfs_file_fallocate(
> > > >  	if (error)
> > > >  		goto out_unlock;
> > > >  
> ...
> > > > +	/*
> > > > +	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
> > > > +	 * the cached range over the first operation we are about to run.
> > > > +	 *
> > > > +	 * We care about zero and collapse here because they both run a hole
> > > > +	 * punch over the range first. Because that can zero data, and the range
> > > > +	 * of invalidation for the shift operations is much larger, we still do
> > > > +	 * the required flush for collapse in xfs_prepare_shift().
> > > > +	 *
> > > > +	 * Insert has the same range requirements as collapse, and we extend the
> > > > +	 * file first which can zero data. Hence insert has the same
> > > > +	 * flush/invalidate requirements as collapse and so they are both
> > > > +	 * handled at the right time by xfs_prepare_shift().
> > > > +	 */
> > > > +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> > > > +		    FALLOC_FL_COLLAPSE_RANGE)) {
> > > 
> > > Er... "Insert has the same requirements as collapse", but we don't test
> > > for that here?  Also ... xfs_prepare_shift handles flushing for both
> > > collapse and insert range, but we still have to flush here for collapse?
> > >
> > > <confused but suspecting this has something to do with the fact that we
> > > only do insert range after updating the isize?>
> > 
> > Yes, exactly.
> > 
> > The flush for collapse here is for the hole punch part of collapse,
> > before we start shifting extents. insert does not hole punch, so it
> > doesn't need flushing here but it still needs flush/inval before
> > shifting. i.e.:
> > 
> > collapse				insert
> > 
> > flush_unmap(off, len)
> > punch hole(off, len)			extends EOF
> >   writes zeros around (off,len)		  writes zeros around EOF
> > collapse(off, len)			insert(off, len)
> >   flush_unmap(off, EOF)			  flush_unmap(off, EOF)
> >   shift extents down			  shift extents up
> > 
> > So once we start the actual extent shift operation (up or down)
> > the flush/unmap requirements are identical.
> > 
> > > I think the third paragraph of the comment is just confusing me more.
> > > Does the following describe what's going on?
> > > 
> > > "Insert range has the same range [should this be "page cache flushing"?]
> > > requirements as collapse.  Because we can zero data as part of extending
> > > the file size, we skip the flush here and let the flush in
> > > xfs_prepare_shift take care of invalidating the page cache." ?
> > 
> > It's a bit better - that's kinda what I was trying to describe - but
> > I'll try to reword it more clearly after I've let it settle in my
> > head for a little while....
> > 
> 
> I agree the comment is a little confusing because to me, it's just
> describing a bit too much for its context. I.e., I read the comment and
> have to go look at other code to make sure I grok the comment rather
> than the comment helping me grok the code it's associated with.
> 
> FWIW, I find something like the following a bit more clear/concise on
> the whole:
> 
>         /*
> +        * Once AIO and DIO has drained we flush and (if necessary) invalidate
> +        * the cached range over the first operation we are about to run. We
> +        * include zero and collapse here because they both start with a hole
> +        * punch over the target range. Insert and collapse both invalidate the
> +        * broader range affected by the shift in xfs_prepare_shift().
>          */
> 
> ... because it points out why we special case collapse here, and that
> otherwise the prepare shift code is responsible for the rest. Just my
> .02 and otherwise the patch looks good to me.

I like that version better too.

--D

> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29  4:41   ` Dave Chinner
  2019-10-29 10:03     ` Brian Foster
@ 2019-10-29 20:22     ` Darrick J. Wong
  1 sibling, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-10-29 20:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 29, 2019 at 03:41:33PM +1100, Dave Chinner wrote:
> On Mon, Oct 28, 2019 at 09:19:08PM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 29, 2019 at 02:48:50PM +1100, Dave Chinner wrote:
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -1040,6 +1040,7 @@ xfs_unmap_extent(
> > >  	goto out_unlock;
> > >  }
> > >  
> > > +/* Caller must first wait for the completion of any pending DIOs if required. */
> > >  int
> > >  xfs_flush_unmap_range(
> > >  	struct xfs_inode	*ip,
> > > @@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
> > >  	xfs_off_t		rounding, start, end;
> > >  	int			error;
> > >  
> > > -	/* wait for the completion of any pending DIOs */
> > > -	inode_dio_wait(inode);
> > 
> > Does xfs_reflink_remap_prep still need this function to call inode_dio_wait
> > before zapping the page cache prior to reflinking into an existing file?
> 
> No, because that is done in generic_remap_file_range_prep() after we
> have locked the inodes and broken leases in
> xfs_reflink_remap_prep().

Heh, ok.  The rest mostly looks ok to me, but I'll wait for the v2.

--D

> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 525b29b99116..865543e41fb4 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -817,6 +817,36 @@ xfs_file_fallocate(
> > >  	if (error)
> > >  		goto out_unlock;
> > >  
> > > +	/*
> > > +	 * Must wait for all AIO to complete before we continue as AIO can
> > > +	 * change the file size on completion without holding any locks we
> > > +	 * currently hold. We must do this first because AIO can update both
> > > +	 * the on disk and in memory inode sizes, and the operations that follow
> > > +	 * require the in-memory size to be fully up-to-date.
> > > +	 */
> > > +	inode_dio_wait(inode);
> > > +
> > > +	/*
> > > +	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
> > > +	 * the cached range over the first operation we are about to run.
> > > +	 *
> > > +	 * We care about zero and collapse here because they both run a hole
> > > +	 * punch over the range first. Because that can zero data, and the range
> > > +	 * of invalidation for the shift operations is much larger, we still do
> > > +	 * the required flush for collapse in xfs_prepare_shift().
> > > +	 *
> > > +	 * Insert has the same range requirements as collapse, and we extend the
> > > +	 * file first which can zero data. Hence insert has the same
> > > +	 * flush/invalidate requirements as collapse and so they are both
> > > +	 * handled at the right time by xfs_prepare_shift().
> > > +	 */
> > > +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> > > +		    FALLOC_FL_COLLAPSE_RANGE)) {
> > 
> > Er... "Insert has the same requirements as collapse", but we don't test
> > for that here?  Also ... xfs_prepare_shift handles flushing for both
> > collapse and insert range, but we still have to flush here for collapse?
> >
> > <confused but suspecting this has something to do with the fact that we
> > only do insert range after updating the isize?>
> 
> Yes, exactly.
> 
> The flush for collapse here is for the hole punch part of collapse,
> before we start shifting extents. insert does not hole punch, so it
> doesn't need flushing here but it still needs flush/inval before
> shifting. i.e.:
> 
> collapse				insert
> 
> flush_unmap(off, len)
> punch hole(off, len)			extends EOF
>   writes zeros around (off,len)		  writes zeros around EOF
> collapse(off, len)			insert(off, len)
>   flush_unmap(off, EOF)			  flush_unmap(off, EOF)
>   shift extents down			  shift extents up
> 
> So once we start the actual extent shift operation (up or down)
> the flush/unmap requirements are identical.
> 
> > I think the third paragraph of the comment is just confusing me more.
> > Does the following describe what's going on?
> > 
> > "Insert range has the same range [should this be "page cache flushing"?]
> > requirements as collapse.  Because we can zero data as part of extending
> > the file size, we skip the flush here and let the flush in
> > xfs_prepare_shift take care of invalidating the page cache." ?
> 
> It's a bit better - that's kinda what I was trying to describe - but
> I'll try to reword it more clearly after I've let it settle in my
> head for a little while....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH] generic: test race between appending AIO DIO and fallocate
  2019-10-29  3:48 [PATCH] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
  2019-10-29  4:02 ` Dave Chinner
  2019-10-29  4:19 ` Darrick J. Wong
@ 2019-10-29 20:23 ` Darrick J. Wong
  2 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-10-29 20:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Brian Foster, Eric Sandeen, fstests

From: Darrick J. Wong <darrick.wong@oracle.com>

Dave Chinner reports[1] that an appending AIO DIO write to the second
block of a zero-length file and an fallocate request to the first block
of the same file can race to set isize, with the user-visible end result
that the file size is set incorrectly to one block long.  Write a small
test to reproduce the results.

[1] https://lore.kernel.org/linux-xfs/20191029100342.GA41131@bfoster/T/

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../aio-dio-append-write-fallocate-race.c          |  212 ++++++++++++++++++++
 tests/generic/722                                  |   44 ++++
 tests/generic/722.out                              |    2 
 tests/generic/group                                |    1 
 4 files changed, 259 insertions(+)
 create mode 100644 src/aio-dio-regress/aio-dio-append-write-fallocate-race.c
 create mode 100755 tests/generic/722
 create mode 100644 tests/generic/722.out

diff --git a/src/aio-dio-regress/aio-dio-append-write-fallocate-race.c b/src/aio-dio-regress/aio-dio-append-write-fallocate-race.c
new file mode 100644
index 00000000..091b047d
--- /dev/null
+++ b/src/aio-dio-regress/aio-dio-append-write-fallocate-race.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0-or-newer
+/*
+ * Copyright (c) 2019 Oracle.
+ * All Rights Reserved.
+ *
+ * Race appending aio dio and fallocate to make sure we get the correct file
+ * size afterwards.
+ */
+#include <stdio.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <libaio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <limits.h>
+
+static int fd;
+static int blocksize;
+
+static void *
+falloc_thread(
+	void		*p)
+{
+	int		ret;
+
+	ret = fallocate(fd, 0, 0, blocksize);
+	if (ret)
+		perror("falloc");
+
+	return NULL;
+}
+
+static int
+test(
+	const char	*fname,
+	unsigned int	iteration,
+	unsigned int	*passed)
+{
+	struct stat	sbuf;
+	pthread_t	thread;
+	io_context_t	ioctx = 0;
+	struct iocb	iocb;
+	struct iocb	*iocbp = &iocb;
+	struct io_event	event;
+	char		*buf;
+	bool		wait_thread = false;
+	int		ret;
+
+	/* Truncate file, allocate resources for doing IO. */
+	fd = open(fname, O_DIRECT | O_RDWR | O_TRUNC | O_CREAT, 0644);
+	if (fd < 0) {
+		perror(fname);
+		return -1;
+	}
+
+	ret = fstat(fd, &sbuf);
+	if (ret) {
+		perror(fname);
+		goto out;
+	}
+	blocksize = sbuf.st_blksize;
+
+	ret = posix_memalign((void **)&buf, blocksize, blocksize);
+	if (ret) {
+		errno = ret;
+		perror("buffer");
+		goto out;
+	}
+	memset(buf, 'X', blocksize);
+	memset(&event, 0, sizeof(event));
+
+	ret = io_queue_init(1, &ioctx);
+	if (ret) {
+		errno = -ret;
+		perror("io_queue_init");
+		goto out_buf;
+	}
+
+	/*
+	 * Set ourselves up to race fallocate(0..blocksize) with aio dio
+	 * pwrite(blocksize..blocksize * 2).  This /should/ give us a file
+	 * with length (2 * blocksize).
+	 */
+	io_prep_pwrite(&iocb, fd, buf, blocksize, blocksize);
+
+	ret = pthread_create(&thread, NULL, falloc_thread, NULL);
+	if (ret) {
+		errno = ret;
+		perror("pthread");
+		goto out_io;
+	}
+	wait_thread = true;
+
+	ret = io_submit(ioctx, 1, &iocbp);
+	if (ret != 1) {
+		errno = -ret;
+		perror("io_submit");
+		goto out_join;
+	}
+
+	ret = io_getevents(ioctx, 1, 1, &event, NULL);
+	if (ret != 1) {
+		errno = -ret;
+		perror("io_getevents");
+		goto out_join;
+	}
+
+	if (event.res < 0) {
+		errno = -event.res;
+		perror("io_event.res");
+		goto out_join;
+	}
+
+	if (event.res2 < 0) {
+		errno = -event.res2;
+		perror("io_event.res2");
+		goto out_join;
+	}
+
+	wait_thread = false;
+	ret = pthread_join(thread, NULL);
+	if (ret) {
+		errno = ret;
+		perror("join");
+		goto out_io;
+	}
+
+	/* Make sure we actually got a file of size (2 * blocksize). */
+	ret = fstat(fd, &sbuf);
+	if (ret) {
+		perror(fname);
+		goto out_buf;
+	}
+
+	if (sbuf.st_size != 2 * blocksize) {
+		fprintf(stderr, "[%u]: sbuf.st_size=%llu, expected %llu.\n",
+				iteration,
+				(unsigned long long)sbuf.st_size,
+				(unsigned long long)2 * blocksize);
+	} else {
+		printf("[%u]: passed.\n", iteration);
+		(*passed)++;
+	}
+
+out_join:
+	if (wait_thread) {
+		ret = pthread_join(thread, NULL);
+		if (ret) {
+			errno = ret;
+			perror("join");
+			goto out_io;
+		}
+	}
+out_io:
+	ret = io_queue_release(ioctx);
+	if (ret) {
+		errno = -ret;
+		perror("io_queue_release");
+	}
+
+out_buf:
+	free(buf);
+out:
+	ret = close(fd);
+	fd = -1;
+	if (ret) {
+		perror("close");
+		return -1;
+	}
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int		ret;
+	long		l;
+	unsigned int	i;
+	unsigned int	passed = 0;
+
+	if (argc != 3) {
+		printf("Usage: %s filename iterations\n", argv[0]);
+		return 1;
+	}
+
+	errno = 0;
+	l = strtol(argv[2], NULL, 0);
+	if (errno) {
+		perror(argv[2]);
+		return 1;
+	}
+	if (l < 1 || l > UINT_MAX) {
+		fprintf(stderr, "%ld: must be between 1 and %u.\n",
+				l, UINT_MAX);
+		return 1;
+	}
+
+	for (i = 0; i < l; i++) {
+		ret = test(argv[1], i, &passed);
+		if (ret)
+			return 1;
+	}
+
+	printf("pass rate: %u/%u (%.2f%%)\n", passed, i, 100.0 * passed / i);
+
+	return 0;
+}
diff --git a/tests/generic/722 b/tests/generic/722
new file mode 100755
index 00000000..307e3b77
--- /dev/null
+++ b/tests/generic/722
@@ -0,0 +1,44 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2019, Oracle and/or its affiliates.  All Rights Reserved.
+#
+# FS QA Test No. 722
+#
+# Race an appending aio dio write to the second block of a file while
+# simultaneously fallocating to the first block.  Make sure that we end up
+# with a two-block file.
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1    # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.* $testfile
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+
+# real QA test starts here
+_supported_os Linux
+_supported_fs generic
+_require_aiodio "aiocp"
+_require_test
+AIO_TEST=$here/src/aio-dio-regress/aio-dio-append-write-fallocate-race
+
+rm -f $seqres.full
+
+testfile=$TEST_DIR/test-$seq
+$AIO_TEST $testfile 100 >> $seqres.full
+
+echo Silence is golden.
+# success, all done
+status=0
+exit
diff --git a/tests/generic/722.out b/tests/generic/722.out
new file mode 100644
index 00000000..8621a87d
--- /dev/null
+++ b/tests/generic/722.out
@@ -0,0 +1,2 @@
+QA output created by 722
+Silence is golden.
diff --git a/tests/generic/group b/tests/generic/group
index a252f424..20df0964 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -591,3 +591,4 @@
 716 dangerous_norepair
 720 dangerous_norepair
 721 auto quick atime
+722 auto quick rw falloc

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 10:03     ` Brian Foster
  2019-10-29 20:22       ` Darrick J. Wong
@ 2019-10-29 22:29       ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2019-10-29 22:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Tue, Oct 29, 2019 at 06:03:42AM -0400, Brian Foster wrote:
> On Tue, Oct 29, 2019 at 03:41:33PM +1100, Dave Chinner wrote:
> 
> FWIW, I find something like the following a bit more clear/concise on
> the whole:
> 
>         /*
> +        * Once AIO and DIO has drained we flush and (if necessary) invalidate
> +        * the cached range over the first operation we are about to run. We
> +        * include zero and collapse here because they both start with a hole
> +        * punch over the target range. Insert and collapse both invalidate the
> +        * broader range affected by the shift in xfs_prepare_shift().
>          */

Yup, much better. Thanks! :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-29 22:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29  3:48 [PATCH] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
2019-10-29  4:02 ` Dave Chinner
2019-10-29  4:19 ` Darrick J. Wong
2019-10-29  4:41   ` Dave Chinner
2019-10-29 10:03     ` Brian Foster
2019-10-29 20:22       ` Darrick J. Wong
2019-10-29 22:29       ` Dave Chinner
2019-10-29 20:22     ` Darrick J. Wong
2019-10-29 20:23 ` [RFC PATCH] generic: test race between appending AIO DIO and fallocate Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.