linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
@ 2019-10-29 22:37 Dave Chinner
  2019-10-29 23:33 ` Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Dave Chinner @ 2019-10-29 22:37 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

AIO+DIO can extend the file size on IO completion, and it holds
no inode locks while the IO is in flight. Therefore, a race
condition exists in file size updates if we do something like this:

aio-thread			fallocate-thread

lock inode
submit IO beyond inode->i_size
unlock inode
.....
				lock inode
				break layouts
				if (off + len > inode->i_size)
					new_size = off + len
				.....
				inode_dio_wait()
				<blocks>
.....
completes
inode->i_size updated
inode_dio_done()
....
				<wakes>
				<does stuff no long beyond EOF>
				if (new_size)
					xfs_vn_setattr(inode, new_size)


Yup, that attempt to extend the file size in the fallocate code
turns into a truncate - it removes the whatever the aio write
allocated and put to disk, and reduced the inode size back down to
where the fallocate operation ends.

Fundamentally, xfs_file_fallocate()  not compatible with racing
AIO+DIO completions, so we need to move the inode_dio_wait() call
up to where the lock the inode and break the layouts.

Secondly, storing the inode size and then using it unchecked without
holding the ILOCK is not safe; we can only do such a thing if we've
locked out and drained all IO and other modification operations,
which we don't do initially in xfs_file_fallocate.

It should be noted that some of the fallocate operations are
compound operations - they are made up of multiple manipulations
that may zero data, and so we may need to flush and invalidate the
file multiple times during an operation. However, we only need to
lock out IO and other space manipulation operations once, as that
lockout is maintained until the entire fallocate operation has been
completed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c |  8 +-------
 fs/xfs/xfs_file.c      | 23 +++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c     |  1 +
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fb31d7d6701e..dea68308fb64 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1040,6 +1040,7 @@ xfs_unmap_extent(
 	goto out_unlock;
 }
 
+/* Caller must first wait for the completion of any pending DIOs if required. */
 int
 xfs_flush_unmap_range(
 	struct xfs_inode	*ip,
@@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
 	xfs_off_t		rounding, start, end;
 	int			error;
 
-	/* wait for the completion of any pending DIOs */
-	inode_dio_wait(inode);
-
 	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
 	start = round_down(offset, rounding);
 	end = round_up(offset + len, rounding) - 1;
@@ -1085,10 +1083,6 @@ xfs_free_file_space(
 	if (len <= 0)	/* if nothing being freed */
 		return 0;
 
-	error = xfs_flush_unmap_range(ip, offset, len);
-	if (error)
-		return error;
-
 	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
 	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 525b29b99116..46fc5629369b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -817,6 +817,29 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
+	/*
+	 * Must wait for all AIO to complete before we continue as AIO can
+	 * change the file size on completion without holding any locks we
+	 * currently hold. We must do this first because AIO can update both
+	 * the on disk and in memory inode sizes, and the operations that follow
+	 * require the in-memory size to be fully up-to-date.
+	 */
+	inode_dio_wait(inode);
+
+	/*
+	 * Now that AIO and DIO has drained we can flush and (if necessary)
+	 * invalidate the cached range over the first operation we are about to
+	 * run. We include zero and collapse here because they both start with a
+	 * hole punch over the target range. Insert and collapse both invalidate
+	 * the broader range affected by the shift in xfs_prepare_shift().
+	 */
+	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
+		    FALLOC_FL_COLLAPSE_RANGE)) {
+		error = xfs_flush_unmap_range(ip, offset, len);
+		if (error)
+			goto out_unlock;
+	}
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 287f83eb791f..800f07044636 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -623,6 +623,7 @@ xfs_ioc_space(
 	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
+	inode_dio_wait(inode);
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
-- 
2.24.0.rc0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 22:37 [PATCH V2] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
@ 2019-10-29 23:33 ` Darrick J. Wong
  2019-10-29 23:54   ` Dave Chinner
  2019-10-30 12:46 ` Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2019-10-29 23:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 30, 2019 at 09:37:52AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> AIO+DIO can extend the file size on IO completion, and it holds
> no inode locks while the IO is in flight. Therefore, a race
> condition exists in file size updates if we do something like this:
> 
> aio-thread			fallocate-thread
> 
> lock inode
> submit IO beyond inode->i_size
> unlock inode
> .....
> 				lock inode
> 				break layouts
> 				if (off + len > inode->i_size)
> 					new_size = off + len
> 				.....
> 				inode_dio_wait()
> 				<blocks>
> .....
> completes
> inode->i_size updated
> inode_dio_done()
> ....
> 				<wakes>
> 				<does stuff no long beyond EOF>
> 				if (new_size)
> 					xfs_vn_setattr(inode, new_size)
> 
> 
> Yup, that attempt to extend the file size in the fallocate code
> turns into a truncate - it removes the whatever the aio write
> allocated and put to disk, and reduced the inode size back down to
> where the fallocate operation ends.
> 
> Fundamentally, xfs_file_fallocate()  not compatible with racing
> AIO+DIO completions, so we need to move the inode_dio_wait() call
> up to where the lock the inode and break the layouts.
> 
> Secondly, storing the inode size and then using it unchecked without
> holding the ILOCK is not safe; we can only do such a thing if we've
> locked out and drained all IO and other modification operations,
> which we don't do initially in xfs_file_fallocate.
> 
> It should be noted that some of the fallocate operations are
> compound operations - they are made up of multiple manipulations
> that may zero data, and so we may need to flush and invalidate the
> file multiple times during an operation. However, we only need to
> lock out IO and other space manipulation operations once, as that
> lockout is maintained until the entire fallocate operation has been
> completed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks reasonable to me; what do you think of my regression test?

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_bmap_util.c |  8 +-------
>  fs/xfs/xfs_file.c      | 23 +++++++++++++++++++++++
>  fs/xfs/xfs_ioctl.c     |  1 +
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fb31d7d6701e..dea68308fb64 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1040,6 +1040,7 @@ xfs_unmap_extent(
>  	goto out_unlock;
>  }
>  
> +/* Caller must first wait for the completion of any pending DIOs if required. */
>  int
>  xfs_flush_unmap_range(
>  	struct xfs_inode	*ip,
> @@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
>  	xfs_off_t		rounding, start, end;
>  	int			error;
>  
> -	/* wait for the completion of any pending DIOs */
> -	inode_dio_wait(inode);
> -
>  	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
>  	start = round_down(offset, rounding);
>  	end = round_up(offset + len, rounding) - 1;
> @@ -1085,10 +1083,6 @@ xfs_free_file_space(
>  	if (len <= 0)	/* if nothing being freed */
>  		return 0;
>  
> -	error = xfs_flush_unmap_range(ip, offset, len);
> -	if (error)
> -		return error;
> -
>  	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
>  	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 525b29b99116..46fc5629369b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -817,6 +817,29 @@ xfs_file_fallocate(
>  	if (error)
>  		goto out_unlock;
>  
> +	/*
> +	 * Must wait for all AIO to complete before we continue as AIO can
> +	 * change the file size on completion without holding any locks we
> +	 * currently hold. We must do this first because AIO can update both
> +	 * the on disk and in memory inode sizes, and the operations that follow
> +	 * require the in-memory size to be fully up-to-date.
> +	 */
> +	inode_dio_wait(inode);
> +
> +	/*
> +	 * Now that AIO and DIO has drained we can flush and (if necessary)
> +	 * invalidate the cached range over the first operation we are about to
> +	 * run. We include zero and collapse here because they both start with a
> +	 * hole punch over the target range. Insert and collapse both invalidate
> +	 * the broader range affected by the shift in xfs_prepare_shift().
> +	 */
> +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> +		    FALLOC_FL_COLLAPSE_RANGE)) {
> +		error = xfs_flush_unmap_range(ip, offset, len);
> +		if (error)
> +			goto out_unlock;
> +	}
> +
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 287f83eb791f..800f07044636 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -623,6 +623,7 @@ xfs_ioc_space(
>  	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
>  	if (error)
>  		goto out_unlock;
> +	inode_dio_wait(inode);
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> -- 
> 2.24.0.rc0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 23:33 ` Darrick J. Wong
@ 2019-10-29 23:54   ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2019-10-29 23:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 29, 2019 at 04:33:37PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 30, 2019 at 09:37:52AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > AIO+DIO can extend the file size on IO completion, and it holds
> > no inode locks while the IO is in flight. Therefore, a race
> > condition exists in file size updates if we do something like this:
> > 
> > aio-thread			fallocate-thread
> > 
> > lock inode
> > submit IO beyond inode->i_size
> > unlock inode
> > .....
> > 				lock inode
> > 				break layouts
> > 				if (off + len > inode->i_size)
> > 					new_size = off + len
> > 				.....
> > 				inode_dio_wait()
> > 				<blocks>
> > .....
> > completes
> > inode->i_size updated
> > inode_dio_done()
> > ....
> > 				<wakes>
> > 				<does stuff no long beyond EOF>
> > 				if (new_size)
> > 					xfs_vn_setattr(inode, new_size)
> > 
> > 
> > Yup, that attempt to extend the file size in the fallocate code
> > turns into a truncate - it removes the whatever the aio write
> > allocated and put to disk, and reduced the inode size back down to
> > where the fallocate operation ends.
> > 
> > Fundamentally, xfs_file_fallocate()  not compatible with racing
> > AIO+DIO completions, so we need to move the inode_dio_wait() call
> > up to where the lock the inode and break the layouts.
> > 
> > Secondly, storing the inode size and then using it unchecked without
> > holding the ILOCK is not safe; we can only do such a thing if we've
> > locked out and drained all IO and other modification operations,
> > which we don't do initially in xfs_file_fallocate.
> > 
> > It should be noted that some of the fallocate operations are
> > compound operations - they are made up of multiple manipulations
> > that may zero data, and so we may need to flush and invalidate the
> > file multiple times during an operation. However, we only need to
> > lock out IO and other space manipulation operations once, as that
> > lockout is maintained until the entire fallocate operation has been
> > completed.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Looks reasonable to me; what do you think of my regression test?

Looks reasonable at a first glance. Not much different what I was
using to test this patch. I haven't looked in more detail than that
yet...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 22:37 [PATCH V2] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
  2019-10-29 23:33 ` Darrick J. Wong
@ 2019-10-30 12:46 ` Brian Foster
  2019-10-30 14:26 ` Christoph Hellwig
  2019-11-20  2:38 ` Darrick J. Wong
  3 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2019-10-30 12:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 30, 2019 at 09:37:52AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> AIO+DIO can extend the file size on IO completion, and it holds
> no inode locks while the IO is in flight. Therefore, a race
> condition exists in file size updates if we do something like this:
> 
> aio-thread			fallocate-thread
> 
> lock inode
> submit IO beyond inode->i_size
> unlock inode
> .....
> 				lock inode
> 				break layouts
> 				if (off + len > inode->i_size)
> 					new_size = off + len
> 				.....
> 				inode_dio_wait()
> 				<blocks>
> .....
> completes
> inode->i_size updated
> inode_dio_done()
> ....
> 				<wakes>
> 				<does stuff no long beyond EOF>
> 				if (new_size)
> 					xfs_vn_setattr(inode, new_size)
> 
> 
> Yup, that attempt to extend the file size in the fallocate code
> turns into a truncate - it removes the whatever the aio write
> allocated and put to disk, and reduced the inode size back down to
> where the fallocate operation ends.
> 
> Fundamentally, xfs_file_fallocate()  not compatible with racing
> AIO+DIO completions, so we need to move the inode_dio_wait() call
> up to where the lock the inode and break the layouts.
> 
> Secondly, storing the inode size and then using it unchecked without
> holding the ILOCK is not safe; we can only do such a thing if we've
> locked out and drained all IO and other modification operations,
> which we don't do initially in xfs_file_fallocate.
> 
> It should be noted that some of the fallocate operations are
> compound operations - they are made up of multiple manipulations
> that may zero data, and so we may need to flush and invalidate the
> file multiple times during an operation. However, we only need to
> lock out IO and other space manipulation operations once, as that
> lockout is maintained until the entire fallocate operation has been
> completed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_bmap_util.c |  8 +-------
>  fs/xfs/xfs_file.c      | 23 +++++++++++++++++++++++
>  fs/xfs/xfs_ioctl.c     |  1 +
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fb31d7d6701e..dea68308fb64 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1040,6 +1040,7 @@ xfs_unmap_extent(
>  	goto out_unlock;
>  }
>  
> +/* Caller must first wait for the completion of any pending DIOs if required. */
>  int
>  xfs_flush_unmap_range(
>  	struct xfs_inode	*ip,
> @@ -1051,9 +1052,6 @@ xfs_flush_unmap_range(
>  	xfs_off_t		rounding, start, end;
>  	int			error;
>  
> -	/* wait for the completion of any pending DIOs */
> -	inode_dio_wait(inode);
> -
>  	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
>  	start = round_down(offset, rounding);
>  	end = round_up(offset + len, rounding) - 1;
> @@ -1085,10 +1083,6 @@ xfs_free_file_space(
>  	if (len <= 0)	/* if nothing being freed */
>  		return 0;
>  
> -	error = xfs_flush_unmap_range(ip, offset, len);
> -	if (error)
> -		return error;
> -
>  	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
>  	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 525b29b99116..46fc5629369b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -817,6 +817,29 @@ xfs_file_fallocate(
>  	if (error)
>  		goto out_unlock;
>  
> +	/*
> +	 * Must wait for all AIO to complete before we continue as AIO can
> +	 * change the file size on completion without holding any locks we
> +	 * currently hold. We must do this first because AIO can update both
> +	 * the on disk and in memory inode sizes, and the operations that follow
> +	 * require the in-memory size to be fully up-to-date.
> +	 */
> +	inode_dio_wait(inode);
> +
> +	/*
> +	 * Now that AIO and DIO has drained we can flush and (if necessary)
> +	 * invalidate the cached range over the first operation we are about to
> +	 * run. We include zero and collapse here because they both start with a
> +	 * hole punch over the target range. Insert and collapse both invalidate
> +	 * the broader range affected by the shift in xfs_prepare_shift().
> +	 */
> +	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> +		    FALLOC_FL_COLLAPSE_RANGE)) {
> +		error = xfs_flush_unmap_range(ip, offset, len);
> +		if (error)
> +			goto out_unlock;
> +	}
> +
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 287f83eb791f..800f07044636 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -623,6 +623,7 @@ xfs_ioc_space(
>  	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
>  	if (error)
>  		goto out_unlock;
> +	inode_dio_wait(inode);
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> -- 
> 2.24.0.rc0
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 22:37 [PATCH V2] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
  2019-10-29 23:33 ` Darrick J. Wong
  2019-10-30 12:46 ` Brian Foster
@ 2019-10-30 14:26 ` Christoph Hellwig
  2019-10-31 21:06   ` Dave Chinner
  2019-11-20  2:38 ` Darrick J. Wong
  3 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2019-10-30 14:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

Btw, I've been pondering multiple times if we can kill off i_dio_count
again, at least for iomap users.  I've added in the request of Thomas
how want to kill non-owner rw_semaphore unlocks.  But it turns out those
were needed in other placeѕ and have been added back at least partially.
I'll try to just use those again when I find some time, which should
simplify a lot of the mess we around waiting for direct I/O.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-10-30 14:26 ` Christoph Hellwig
@ 2019-10-31 21:06   ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2019-10-31 21:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Oct 30, 2019 at 07:26:22AM -0700, Christoph Hellwig wrote:
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Btw, I've been pondering multiple times if we can kill off i_dio_count
> again, at least for iomap users.  I've added in the request of Thomas
> how want to kill non-owner rw_semaphore unlocks.  But it turns out those
> were needed in other placeѕ and have been added back at least partially.
> I'll try to just use those again when I find some time, which should
> simplify a lot of the mess we around waiting for direct I/O.

I kinda planned to kill off inode_dio_wait() for XFS via range
locks. i.e. hold the range lock all the way to DIO completion, then
release it there. This allows things like fallocate, truncate, EOF
extension, etc to run concurrently with all IO and not require
serialisation of the IO in progress to perform extent and size
modification operations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-10-29 22:37 [PATCH V2] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
                   ` (2 preceding siblings ...)
  2019-10-30 14:26 ` Christoph Hellwig
@ 2019-11-20  2:38 ` Darrick J. Wong
  2019-11-21  6:50   ` Dave Chinner
  3 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2019-11-20  2:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hmm, do you think this is a reasonable backport of this patch for
pre-5.5?

--D

---------------------

From: Dave Chinner <dchinner@redhat.com>
Subject: [PATCH] xfs: properly serialise fallocate against AIO+DIO

AIO+DIO can extend the file size on IO completion, and it holds
no inode locks while the IO is in flight. Therefore, a race
condition exists in file size updates if we do something like this:

aio-thread			fallocate-thread

lock inode
submit IO beyond inode->i_size
unlock inode
.....
				lock inode
				break layouts
				if (off + len > inode->i_size)
					new_size = off + len
				.....
				inode_dio_wait()
				<blocks>
.....
completes
inode->i_size updated
inode_dio_done()
....
				<wakes>
				<does stuff no long beyond EOF>
				if (new_size)
					xfs_vn_setattr(inode, new_size)


Yup, that attempt to extend the file size in the fallocate code
turns into a truncate - it removes the whatever the aio write
allocated and put to disk, and reduced the inode size back down to
where the fallocate operation ends.

Fundamentally, xfs_file_fallocate()  not compatible with racing
AIO+DIO completions, so we need to move the inode_dio_wait() call
up to where the lock the inode and break the layouts.

Secondly, storing the inode size and then using it unchecked without
holding the ILOCK is not safe; we can only do such a thing if we've
locked out and drained all IO and other modification operations,
which we don't do initially in xfs_file_fallocate.

It should be noted that some of the fallocate operations are
compound operations - they are made up of multiple manipulations
that may zero data, and so we may need to flush and invalidate the
file multiple times during an operation. However, we only need to
lock out IO and other space manipulation operations once, as that
lockout is maintained until the entire fallocate operation has been
completed.

Cc: <stable@vger.kernel.org> # 5.4.x
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: include the chunks necessary for the old space ioctls]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |    8 +-------
 fs/xfs/xfs_file.c      |   23 +++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c     |   25 +++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2dd49fb6794e..615ff1e7c7be 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -929,6 +929,7 @@ xfs_unmap_extent(
 	goto out_unlock;
 }
 
+/* Caller must first wait for the completion of any pending DIOs if required. */
 int
 xfs_flush_unmap_range(
 	struct xfs_inode	*ip,
@@ -940,9 +941,6 @@ xfs_flush_unmap_range(
 	xfs_off_t		rounding, start, end;
 	int			error;
 
-	/* wait for the completion of any pending DIOs */
-	inode_dio_wait(inode);
-
 	rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_SIZE);
 	start = round_down(offset, rounding);
 	end = round_up(offset + len, rounding) - 1;
@@ -974,10 +972,6 @@ xfs_free_file_space(
 	if (len <= 0)	/* if nothing being freed */
 		return 0;
 
-	error = xfs_flush_unmap_range(ip, offset, len);
-	if (error)
-		return error;
-
 	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
 	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1ffb179f35d2..d94e9823cd9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -818,6 +818,29 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
+	/*
+	 * Must wait for all AIO to complete before we continue as AIO can
+	 * change the file size on completion without holding any locks we
+	 * currently hold. We must do this first because AIO can update both
+	 * the on disk and in memory inode sizes, and the operations that follow
+	 * require the in-memory size to be fully up-to-date.
+	 */
+	inode_dio_wait(inode);
+
+	/*
+	 * Now that AIO and DIO has drained we can flush and (if necessary)
+	 * invalidate the cached range over the first operation we are about to
+	 * run. We include zero and collapse here because they both start with a
+	 * hole punch over the target range. Insert and collapse both invalidate
+	 * the broader range affected by the shift in xfs_prepare_shift().
+	 */
+	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
+		    FALLOC_FL_COLLAPSE_RANGE)) {
+		error = xfs_flush_unmap_range(ip, offset, len);
+		if (error)
+			goto out_unlock;
+	}
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d58f0d6a699e..3fb501c5efcc 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -665,6 +665,31 @@ xfs_ioc_space(
 		goto out_unlock;
 	}
 
+	/*
+	 * Must wait for all AIO to complete before we continue as AIO can
+	 * change the file size on completion without holding any locks we
+	 * currently hold. We must do this first because AIO can update both
+	 * the on disk and in memory inode sizes, and the operations that follow
+	 * require the in-memory size to be fully up-to-date.
+	 */
+	inode_dio_wait(inode);
+
+	/*
+	 * Now that AIO and DIO has drained we can flush and (if necessary)
+	 * invalidate the cached range over the first operation we are about to
+	 * run. We include zero range here because it starts with a hole punch
+	 * over the target range.
+	 */
+	switch (cmd) {
+	case XFS_IOC_ZERO_RANGE:
+	case XFS_IOC_UNRESVSP:
+	case XFS_IOC_UNRESVSP64:
+		error = xfs_flush_unmap_range(ip, bf->l_start, bf->l_len);
+		if (error)
+			goto out_unlock;
+		break;
+	}
+
 	switch (cmd) {
 	case XFS_IOC_ZERO_RANGE:
 		flags |= XFS_PREALLOC_SET;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] xfs: properly serialise fallocate against AIO+DIO
  2019-11-20  2:38 ` Darrick J. Wong
@ 2019-11-21  6:50   ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2019-11-21  6:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Nov 19, 2019 at 06:38:21PM -0800, Darrick J. Wong wrote:
> Hmm, do you think this is a reasonable backport of this patch for
> pre-5.5?

Seems reasonable to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-21  6:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29 22:37 [PATCH V2] xfs: properly serialise fallocate against AIO+DIO Dave Chinner
2019-10-29 23:33 ` Darrick J. Wong
2019-10-29 23:54   ` Dave Chinner
2019-10-30 12:46 ` Brian Foster
2019-10-30 14:26 ` Christoph Hellwig
2019-10-31 21:06   ` Dave Chinner
2019-11-20  2:38 ` Darrick J. Wong
2019-11-21  6:50   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).