linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: hold ilock across insert and collapse range
@ 2019-12-13 17:12 Brian Foster
  2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw)
  To: linux-xfs

Hi all,

This is a followup to the recent bugfix I sent on collapse range.
Dave suggested that insert/collapse should probably be atomic wrt to
ilock, so this series reworks both operations appropriately. This
survives a couple fstests runs and I'll be running an fsx test over
the weekend...

Brian

Brian Foster (3):
  xfs: open code insert range extent split helper
  xfs: rework insert range into an atomic operation
  xfs: rework collapse range into an atomic operation

 fs/xfs/libxfs/xfs_bmap.c | 32 ++--------------------
 fs/xfs/libxfs/xfs_bmap.h |  3 ++-
 fs/xfs/xfs_bmap_util.c   | 57 ++++++++++++++++++++++------------------
 3 files changed, 36 insertions(+), 56 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] xfs: open code insert range extent split helper
  2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster
@ 2019-12-13 17:12 ` Brian Foster
  2019-12-17 17:02   ` Allison Collins
                     ` (2 more replies)
  2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster
  2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster
  2 siblings, 3 replies; 23+ messages in thread
From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw)
  To: linux-xfs

The insert range operation currently splits the extent at the target
offset in a separate transaction and lock cycle from the one that
shifts extents. In preparation for reworking insert range into an
atomic operation, lift the code into the caller so it can be easily
condensed to a single rolling transaction and lock cycle and
eliminate the helper. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------
 fs/xfs/libxfs/xfs_bmap.h |  3 ++-
 fs/xfs/xfs_bmap_util.c   | 14 +++++++++++++-
 3 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a9ad1f991ba3..2bba0f983e4f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents(
  * @split_fsb is a block where the extents is split.  If split_fsb lies in a
  * hole or the first block of extents, just return 0.
  */
-STATIC int
-xfs_bmap_split_extent_at(
+int
+xfs_bmap_split_extent(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		split_fsb)
@@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at(
 	return error;
 }
 
-int
-xfs_bmap_split_extent(
-	struct xfs_inode        *ip,
-	xfs_fileoff_t           split_fsb)
-{
-	struct xfs_mount        *mp = ip->i_mount;
-	struct xfs_trans        *tp;
-	int                     error;
-
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
-			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
-	if (error)
-		return error;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
-
-	error = xfs_bmap_split_extent_at(tp, ip, split_fsb);
-	if (error)
-		goto out;
-
-	return xfs_trans_commit(tp);
-
-out:
-	xfs_trans_cancel(tp);
-	return error;
-}
-
 /* Deferred mapping is only for real extents in the data fork. */
 static bool
 xfs_bmap_is_update_needed(
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 14d25e0b7d9c..f3259ad5c22c 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -222,7 +222,8 @@ int	xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off,
 int	xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
 		bool *done, xfs_fileoff_t stop_fsb);
-int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
+int	xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_fileoff_t split_offset);
 int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
 		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
 		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2efd78a9719e..829ab1a804c9 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1139,7 +1139,19 @@ xfs_insert_file_space(
 	 * is not the starting block of extent, we need to split the extent at
 	 * stop_fsb.
 	 */
-	error = xfs_bmap_split_extent(ip, stop_fsb);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
+			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_trans_commit(tp);
 	if (error)
 		return error;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster
  2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
@ 2019-12-13 17:12 ` Brian Foster
  2019-12-17 17:04   ` Allison Collins
                     ` (2 more replies)
  2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster
  2 siblings, 3 replies; 23+ messages in thread
From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw)
  To: linux-xfs

The insert range operation uses a unique transaction and ilock cycle
for the extent split and each extent shift iteration of the overall
operation. While this works, it is risks racing with other
operations in subtle ways such as COW writeback modifying an extent
tree in the middle of a shift operation.

To avoid this problem, make insert range atomic with respect to
ilock. Hold the ilock across the entire operation, replace the
individual transactions with a single rolling transaction sequence
and relog the inode to keep it moving in the log. This guarantees
that nothing else can change the extent mapping of an inode while
an insert range operation is in progress.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 829ab1a804c9..555c8b49a223 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1134,47 +1134,41 @@ xfs_insert_file_space(
 	if (error)
 		return error;
 
-	/*
-	 * The extent shifting code works on extent granularity. So, if stop_fsb
-	 * is not the starting block of extent, we need to split the extent at
-	 * stop_fsb.
-	 */
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
 			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
 	if (error)
 		return error;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
 
+	/*
+	 * The extent shifting code works on extent granularity. So, if stop_fsb
+	 * is not the starting block of extent, we need to split the extent at
+	 * stop_fsb.
+	 */
 	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
 	if (error)
 		goto out_trans_cancel;
 
-	error = xfs_trans_commit(tp);
-	if (error)
-		return error;
-
-	while (!error && !done) {
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
-					&tp);
+	do {
+		error = xfs_trans_roll_inode(&tp, ip);
 		if (error)
-			break;
+			goto out_trans_cancel;
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
 				&done, stop_fsb);
 		if (error)
 			goto out_trans_cancel;
+	} while (!done);
 
-		error = xfs_trans_commit(tp);
-	}
-
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 
 out_trans_cancel:
 	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] xfs: rework collapse range into an atomic operation
  2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster
  2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
  2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster
@ 2019-12-13 17:12 ` Brian Foster
  2019-12-18  2:39   ` Darrick J. Wong
  2019-12-24 11:29   ` Christoph Hellwig
  2 siblings, 2 replies; 23+ messages in thread
From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw)
  To: linux-xfs

The collapse range operation uses a unique transaction and ilock
cycle for the hole punch and each extent shift iteration of the
overall operation. While the hole punch is safe as a separate
operation due to the iolock, cycling the ilock after each extent
shift is risky similar to insert range.

To avoid this problem, make collapse range atomic with respect to
ilock. Hold the ilock across the entire operation, replace the
individual transactions with a single rolling transaction sequence
and finish dfops on each iteration to perform pending frees and roll
the transaction. Remove the unnecessary quota reservation as
collapse range can only ever merge extents (and thus remove extent
records and potentially free bmap blocks). The dfops call
automatically relogs the inode to keep it moving in the log. This
guarantees that nothing else can change the extent mapping of an
inode while a collapse range operation is in progress.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 555c8b49a223..1c34a34997ca 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1050,7 +1050,6 @@ xfs_collapse_file_space(
 	int			error;
 	xfs_fileoff_t		next_fsb = XFS_B_TO_FSB(mp, offset + len);
 	xfs_fileoff_t		shift_fsb = XFS_B_TO_FSB(mp, len);
-	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
 	bool			done = false;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
@@ -1066,32 +1065,34 @@ xfs_collapse_file_space(
 	if (error)
 		return error;
 
-	while (!error && !done) {
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
-					&tp);
-		if (error)
-			break;
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
+	if (error)
+		return error;
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
-				ip->i_gdquot, ip->i_pdquot, resblks, 0,
-				XFS_QMOPT_RES_REGBLKS);
-		if (error)
-			goto out_trans_cancel;
-		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
 
+	while (!done) {
 		error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb,
 				&done);
 		if (error)
 			goto out_trans_cancel;
+		if (done)
+			break;
 
-		error = xfs_trans_commit(tp);
+		/* finish any deferred frees and roll the transaction */
+		error = xfs_defer_finish(&tp);
+		if (error)
+			goto out_trans_cancel;
 	}
 
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 
 out_trans_cancel:
 	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] xfs: open code insert range extent split helper
  2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
@ 2019-12-17 17:02   ` Allison Collins
  2019-12-17 22:21     ` Darrick J. Wong
  2019-12-18  2:16   ` Darrick J. Wong
  2019-12-24 11:18   ` Christoph Hellwig
  2 siblings, 1 reply; 23+ messages in thread
From: Allison Collins @ 2019-12-17 17:02 UTC (permalink / raw)
  To: Brian Foster, linux-xfs

On 12/13/19 10:12 AM, Brian Foster wrote:
> The insert range operation currently splits the extent at the target
> offset in a separate transaction and lock cycle from the one that
> shifts extents. In preparation for reworking insert range into an
> atomic operation, lift the code into the caller so it can be easily
> condensed to a single rolling transaction and lock cycle and
> eliminate the helper. No functional changes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok to me.

Reviewed by: Allison Collins <allison.henderson@oracle.com>
> ---
>   fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------
>   fs/xfs/libxfs/xfs_bmap.h |  3 ++-
>   fs/xfs/xfs_bmap_util.c   | 14 +++++++++++++-
>   3 files changed, 17 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index a9ad1f991ba3..2bba0f983e4f 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents(
>    * @split_fsb is a block where the extents is split.  If split_fsb lies in a
>    * hole or the first block of extents, just return 0.
>    */
> -STATIC int
> -xfs_bmap_split_extent_at(
> +int
> +xfs_bmap_split_extent(
>   	struct xfs_trans	*tp,
>   	struct xfs_inode	*ip,
>   	xfs_fileoff_t		split_fsb)
> @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at(
>   	return error;
>   }
>   
> -int
> -xfs_bmap_split_extent(
> -	struct xfs_inode        *ip,
> -	xfs_fileoff_t           split_fsb)
> -{
> -	struct xfs_mount        *mp = ip->i_mount;
> -	struct xfs_trans        *tp;
> -	int                     error;
> -
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> -			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> -	if (error)
> -		return error;
> -
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> -
> -	error = xfs_bmap_split_extent_at(tp, ip, split_fsb);
> -	if (error)
> -		goto out;
> -
> -	return xfs_trans_commit(tp);
> -
> -out:
> -	xfs_trans_cancel(tp);
> -	return error;
> -}
> -
>   /* Deferred mapping is only for real extents in the data fork. */
>   static bool
>   xfs_bmap_is_update_needed(
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 14d25e0b7d9c..f3259ad5c22c 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -222,7 +222,8 @@ int	xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off,
>   int	xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip,
>   		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
>   		bool *done, xfs_fileoff_t stop_fsb);
> -int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
> +int	xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip,
> +		xfs_fileoff_t split_offset);
>   int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
>   		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
>   		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 2efd78a9719e..829ab1a804c9 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1139,7 +1139,19 @@ xfs_insert_file_space(
>   	 * is not the starting block of extent, we need to split the extent at
>   	 * stop_fsb.
>   	 */
> -	error = xfs_bmap_split_extent(ip, stop_fsb);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> +			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +
> +	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> +	if (error)
> +		goto out_trans_cancel;
> +
> +	error = xfs_trans_commit(tp);
>   	if (error)
>   		return error;
>   
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster
@ 2019-12-17 17:04   ` Allison Collins
  2019-12-18  2:37   ` Darrick J. Wong
  2019-12-24 11:26   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Allison Collins @ 2019-12-17 17:04 UTC (permalink / raw)
  To: Brian Foster, linux-xfs

On 12/13/19 10:12 AM, Brian Foster wrote:
> The insert range operation uses a unique transaction and ilock cycle
> for the extent split and each extent shift iteration of the overall
> operation. While this works, it is risks racing with other
> operations in subtle ways such as COW writeback modifying an extent
> tree in the middle of a shift operation.
> 
> To avoid this problem, make insert range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and relog the inode to keep it moving in the log. This guarantees
> that nothing else can change the extent mapping of an inode while
> an insert range operation is in progress.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Ok, this looks good to me.  Thanks!

Reviewed by: Allison Collins <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
>   1 file changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 829ab1a804c9..555c8b49a223 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
>   	if (error)
>   		return error;
>   
> -	/*
> -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> -	 * is not the starting block of extent, we need to split the extent at
> -	 * stop_fsb.
> -	 */
>   	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
>   			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
>   	if (error)
>   		return error;
>   
>   	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
>   
> +	/*
> +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> +	 * is not the starting block of extent, we need to split the extent at
> +	 * stop_fsb.
> +	 */
>   	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
>   	if (error)
>   		goto out_trans_cancel;
>   
> -	error = xfs_trans_commit(tp);
> -	if (error)
> -		return error;
> -
> -	while (!error && !done) {
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> -					&tp);
> +	do {
> +		error = xfs_trans_roll_inode(&tp, ip);
>   		if (error)
> -			break;
> +			goto out_trans_cancel;
>   
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>   		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
>   				&done, stop_fsb);
>   		if (error)
>   			goto out_trans_cancel;
> +	} while (!done);
>   
> -		error = xfs_trans_commit(tp);
> -	}
> -
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   	return error;
>   
>   out_trans_cancel:
>   	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   	return error;
>   }
>   
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] xfs: open code insert range extent split helper
  2019-12-17 17:02   ` Allison Collins
@ 2019-12-17 22:21     ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-17 22:21 UTC (permalink / raw)
  To: Allison Collins; +Cc: Brian Foster, linux-xfs

On Tue, Dec 17, 2019 at 10:02:42AM -0700, Allison Collins wrote:
> On 12/13/19 10:12 AM, Brian Foster wrote:
> > The insert range operation currently splits the extent at the target
> > offset in a separate transaction and lock cycle from the one that
> > shifts extents. In preparation for reworking insert range into an
> > atomic operation, lift the code into the caller so it can be easily
> > condensed to a single rolling transaction and lock cycle and
> > eliminate the helper. No functional changes.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Looks ok to me.
> 
> Reviewed by: Allison Collins <allison.henderson@oracle.com>

"Reviewed-by", with dash?

--D

> > ---
> >   fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------
> >   fs/xfs/libxfs/xfs_bmap.h |  3 ++-
> >   fs/xfs/xfs_bmap_util.c   | 14 +++++++++++++-
> >   3 files changed, 17 insertions(+), 32 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index a9ad1f991ba3..2bba0f983e4f 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents(
> >    * @split_fsb is a block where the extents is split.  If split_fsb lies in a
> >    * hole or the first block of extents, just return 0.
> >    */
> > -STATIC int
> > -xfs_bmap_split_extent_at(
> > +int
> > +xfs_bmap_split_extent(
> >   	struct xfs_trans	*tp,
> >   	struct xfs_inode	*ip,
> >   	xfs_fileoff_t		split_fsb)
> > @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at(
> >   	return error;
> >   }
> > -int
> > -xfs_bmap_split_extent(
> > -	struct xfs_inode        *ip,
> > -	xfs_fileoff_t           split_fsb)
> > -{
> > -	struct xfs_mount        *mp = ip->i_mount;
> > -	struct xfs_trans        *tp;
> > -	int                     error;
> > -
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > -			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > -	if (error)
> > -		return error;
> > -
> > -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > -
> > -	error = xfs_bmap_split_extent_at(tp, ip, split_fsb);
> > -	if (error)
> > -		goto out;
> > -
> > -	return xfs_trans_commit(tp);
> > -
> > -out:
> > -	xfs_trans_cancel(tp);
> > -	return error;
> > -}
> > -
> >   /* Deferred mapping is only for real extents in the data fork. */
> >   static bool
> >   xfs_bmap_is_update_needed(
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 14d25e0b7d9c..f3259ad5c22c 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -222,7 +222,8 @@ int	xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off,
> >   int	xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip,
> >   		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
> >   		bool *done, xfs_fileoff_t stop_fsb);
> > -int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
> > +int	xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip,
> > +		xfs_fileoff_t split_offset);
> >   int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
> >   		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
> >   		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 2efd78a9719e..829ab1a804c9 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1139,7 +1139,19 @@ xfs_insert_file_space(
> >   	 * is not the starting block of extent, we need to split the extent at
> >   	 * stop_fsb.
> >   	 */
> > -	error = xfs_bmap_split_extent(ip, stop_fsb);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > +			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > +	if (error)
> > +		return error;
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > +
> > +	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > +	if (error)
> > +		goto out_trans_cancel;
> > +
> > +	error = xfs_trans_commit(tp);
> >   	if (error)
> >   		return error;
> > 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] xfs: open code insert range extent split helper
  2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
  2019-12-17 17:02   ` Allison Collins
@ 2019-12-18  2:16   ` Darrick J. Wong
  2019-12-24 11:18   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-18  2:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Dec 13, 2019 at 12:12:56PM -0500, Brian Foster wrote:
> The insert range operation currently splits the extent at the target
> offset in a separate transaction and lock cycle from the one that
> shifts extents. In preparation for reworking insert range into an
> atomic operation, lift the code into the caller so it can be easily
> condensed to a single rolling transaction and lock cycle and
> eliminate the helper. No functional changes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------
>  fs/xfs/libxfs/xfs_bmap.h |  3 ++-
>  fs/xfs/xfs_bmap_util.c   | 14 +++++++++++++-
>  3 files changed, 17 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index a9ad1f991ba3..2bba0f983e4f 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents(
>   * @split_fsb is a block where the extents is split.  If split_fsb lies in a
>   * hole or the first block of extents, just return 0.
>   */
> -STATIC int
> -xfs_bmap_split_extent_at(
> +int
> +xfs_bmap_split_extent(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	xfs_fileoff_t		split_fsb)
> @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at(
>  	return error;
>  }
>  
> -int
> -xfs_bmap_split_extent(
> -	struct xfs_inode        *ip,
> -	xfs_fileoff_t           split_fsb)
> -{
> -	struct xfs_mount        *mp = ip->i_mount;
> -	struct xfs_trans        *tp;
> -	int                     error;
> -
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> -			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> -	if (error)
> -		return error;
> -
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> -
> -	error = xfs_bmap_split_extent_at(tp, ip, split_fsb);
> -	if (error)
> -		goto out;
> -
> -	return xfs_trans_commit(tp);
> -
> -out:
> -	xfs_trans_cancel(tp);
> -	return error;
> -}
> -
>  /* Deferred mapping is only for real extents in the data fork. */
>  static bool
>  xfs_bmap_is_update_needed(
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 14d25e0b7d9c..f3259ad5c22c 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -222,7 +222,8 @@ int	xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off,
>  int	xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
>  		bool *done, xfs_fileoff_t stop_fsb);
> -int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
> +int	xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip,
> +		xfs_fileoff_t split_offset);
>  int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
>  		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
>  		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 2efd78a9719e..829ab1a804c9 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1139,7 +1139,19 @@ xfs_insert_file_space(
>  	 * is not the starting block of extent, we need to split the extent at
>  	 * stop_fsb.
>  	 */
> -	error = xfs_bmap_split_extent(ip, stop_fsb);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> +			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +
> +	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> +	if (error)
> +		goto out_trans_cancel;
> +
> +	error = xfs_trans_commit(tp);
>  	if (error)
>  		return error;
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster
  2019-12-17 17:04   ` Allison Collins
@ 2019-12-18  2:37   ` Darrick J. Wong
  2019-12-18 12:10     ` Brian Foster
  2019-12-24 11:26   ` Christoph Hellwig
  2 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-18  2:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> The insert range operation uses a unique transaction and ilock cycle
> for the extent split and each extent shift iteration of the overall
> operation. While this works, it is risks racing with other
> operations in subtle ways such as COW writeback modifying an extent
> tree in the middle of a shift operation.
> 
> To avoid this problem, make insert range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and relog the inode to keep it moving in the log. This guarantees
> that nothing else can change the extent mapping of an inode while
> an insert range operation is in progress.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
>  1 file changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 829ab1a804c9..555c8b49a223 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> -	 * is not the starting block of extent, we need to split the extent at
> -	 * stop_fsb.
> -	 */
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
>  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
>  	if (error)
>  		return error;
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
>  
> +	/*
> +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> +	 * is not the starting block of extent, we need to split the extent at
> +	 * stop_fsb.
> +	 */
>  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	error = xfs_trans_commit(tp);
> -	if (error)
> -		return error;
> -
> -	while (!error && !done) {
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> -					&tp);

I'm a little concerned about the livelock potential here, if there are a lot of
other threads that have eaten all the transaction reservation and are trying to
get our ILOCK, while at the same time this thread has the ILOCK and is trying
to roll the transaction to move another extent, having already rolled the
transaction more than logcount times.

I think the extent shifting loop starts with the highest offset mapping and
shifts it up and continues in order of decreasing offset until it gets to
 @stop_fsb, correct?

Can we use "alloc trans; ilock; move; commit" for every extent higher than the
one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
tr_write pre-reserves enough space to that the roll won't need to get more,
which would eliminate that potential problem, I think.

--D

> +	do {
> +		error = xfs_trans_roll_inode(&tp, ip);
>  		if (error)
> -			break;
> +			goto out_trans_cancel;
>  
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
>  				&done, stop_fsb);
>  		if (error)
>  			goto out_trans_cancel;
> +	} while (!done);
>  
> -		error = xfs_trans_commit(tp);
> -	}
> -
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  }
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation
  2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster
@ 2019-12-18  2:39   ` Darrick J. Wong
  2019-12-18 12:11     ` Brian Foster
  2019-12-24 11:29   ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-18  2:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote:
> The collapse range operation uses a unique transaction and ilock
> cycle for the hole punch and each extent shift iteration of the
> overall operation. While the hole punch is safe as a separate
> operation due to the iolock, cycling the ilock after each extent
> shift is risky similar to insert range.

It is?  I thought collapse range was safe because we started by punching
out the doomed range and shifting downwards, which eliminates the
problems that come with two adjacent mappings that could be combined?

<confused?>

--D

> To avoid this problem, make collapse range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and finish dfops on each iteration to perform pending frees and roll
> the transaction. Remove the unnecessary quota reservation as
> collapse range can only ever merge extents (and thus remove extent
> records and potentially free bmap blocks). The dfops call
> automatically relogs the inode to keep it moving in the log. This
> guarantees that nothing else can change the extent mapping of an
> inode while a collapse range operation is in progress.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++--------------
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 555c8b49a223..1c34a34997ca 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1050,7 +1050,6 @@ xfs_collapse_file_space(
>  	int			error;
>  	xfs_fileoff_t		next_fsb = XFS_B_TO_FSB(mp, offset + len);
>  	xfs_fileoff_t		shift_fsb = XFS_B_TO_FSB(mp, len);
> -	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
>  	bool			done = false;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> @@ -1066,32 +1065,34 @@ xfs_collapse_file_space(
>  	if (error)
>  		return error;
>  
> -	while (!error && !done) {
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
> -					&tp);
> -		if (error)
> -			break;
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
>  
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
> -				ip->i_gdquot, ip->i_pdquot, resblks, 0,
> -				XFS_QMOPT_RES_REGBLKS);
> -		if (error)
> -			goto out_trans_cancel;
> -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
>  
> +	while (!done) {
>  		error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb,
>  				&done);
>  		if (error)
>  			goto out_trans_cancel;
> +		if (done)
> +			break;
>  
> -		error = xfs_trans_commit(tp);
> +		/* finish any deferred frees and roll the transaction */
> +		error = xfs_defer_finish(&tp);
> +		if (error)
> +			goto out_trans_cancel;
>  	}
>  
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  }
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-18  2:37   ` Darrick J. Wong
@ 2019-12-18 12:10     ` Brian Foster
  2019-12-18 21:15       ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2019-12-18 12:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > The insert range operation uses a unique transaction and ilock cycle
> > for the extent split and each extent shift iteration of the overall
> > operation. While this works, it is risks racing with other
> > operations in subtle ways such as COW writeback modifying an extent
> > tree in the middle of a shift operation.
> > 
> > To avoid this problem, make insert range atomic with respect to
> > ilock. Hold the ilock across the entire operation, replace the
> > individual transactions with a single rolling transaction sequence
> > and relog the inode to keep it moving in the log. This guarantees
> > that nothing else can change the extent mapping of an inode while
> > an insert range operation is in progress.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> >  1 file changed, 13 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 829ab1a804c9..555c8b49a223 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> >  	if (error)
> >  		return error;
> >  
> > -	/*
> > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > -	 * is not the starting block of extent, we need to split the extent at
> > -	 * stop_fsb.
> > -	 */
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> >  	if (error)
> >  		return error;
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> >  
> > +	/*
> > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > +	 * is not the starting block of extent, we need to split the extent at
> > +	 * stop_fsb.
> > +	 */
> >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> >  	if (error)
> >  		goto out_trans_cancel;
> >  
> > -	error = xfs_trans_commit(tp);
> > -	if (error)
> > -		return error;
> > -
> > -	while (!error && !done) {
> > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > -					&tp);
> 
> I'm a little concerned about the livelock potential here, if there are a lot of
> other threads that have eaten all the transaction reservation and are trying to
> get our ILOCK, while at the same time this thread has the ILOCK and is trying
> to roll the transaction to move another extent, having already rolled the
> transaction more than logcount times.
> 

My understanding is that the regrant mechanism is intended to deal with
that scenario. Even after the initial (res * logcount) reservation is
consumed, a regrant is different from an initial reservation in that the
reservation head is unconditionally updated with a new reservation unit.
We do wait on the write head in regrant, but IIUC that should be limited
to the pool of already allocated transactions (and is woken before the
reserve head waiters IIRC). I suppose something like this might be
possible in theory if we were blocked on regrant and the entirety of
remaining log space was consumed by transactions waiting on our ilock,
but I think that is highly unlikely since we also hold the iolock here.

Also note that this approach is based on the current truncate algorithm,
which is probably a better barometer of potential for this kind of issue
as it is a less specialized operation. I'd argue that if this is safe
enough for truncate, it should be safe enough for range shifting.

> I think the extent shifting loop starts with the highest offset mapping and
> shifts it up and continues in order of decreasing offset until it gets to
>  @stop_fsb, correct?
> 

Yep, for insert range at least.

> Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> tr_write pre-reserves enough space to that the roll won't need to get more,
> which would eliminate that potential problem, I think.
> 

We'd have to reorder the extent split for that kind of approach, which I
think you've noted in the sequence above, as the race window is between
the split and subsequent shift. Otherwise I think that would work.

That said, I'd prefer not to introduce the extra complexity and
functional variance unless it were absolutely necessary, and it's not
clear to me that it is. If it is, we'd probably have seen similar issues
in truncate and should target a fix there before worrying about range
shift.

Brian

> --D
> 
> > +	do {
> > +		error = xfs_trans_roll_inode(&tp, ip);
> >  		if (error)
> > -			break;
> > +			goto out_trans_cancel;
> >  
> > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> >  				&done, stop_fsb);
> >  		if (error)
> >  			goto out_trans_cancel;
> > +	} while (!done);
> >  
> > -		error = xfs_trans_commit(tp);
> > -	}
> > -
> > +	error = xfs_trans_commit(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	return error;
> >  
> >  out_trans_cancel:
> >  	xfs_trans_cancel(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	return error;
> >  }
> >  
> > -- 
> > 2.20.1
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation
  2019-12-18  2:39   ` Darrick J. Wong
@ 2019-12-18 12:11     ` Brian Foster
  2019-12-18 21:19       ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2019-12-18 12:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Dec 17, 2019 at 06:39:58PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote:
> > The collapse range operation uses a unique transaction and ilock
> > cycle for the hole punch and each extent shift iteration of the
> > overall operation. While the hole punch is safe as a separate
> > operation due to the iolock, cycling the ilock after each extent
> > shift is risky similar to insert range.
> 
> It is?  I thought collapse range was safe because we started by punching
> out the doomed range and shifting downwards, which eliminates the
> problems that come with two adjacent mappings that could be combined?
> 
> <confused?>
> 

This is somewhat vague wording. I don't mean to say the same bug is
possible on collapse. Indeed, I don't think that it is. What I mean to
say is just that cycling the ilock generally opens the operation up to
concurrent extent changes, similar to the behavior that resulted in the
insert range bug and against the general design principle of the
operation (as implied by the iolock hold -> flush -> unmap preparation
sequence).

IOW, it seems to me that a similar behavior is possible on collapse, it
just might occur after an extent has been shifted into its new target
range rather than before. That wouldn't be a corruption/bug because it
doesn't change the semantics of the shift operation or the content of
the file, but it's subtle and arguably a misbehavior and/or landmine.

Brian

> --D
> 
> > To avoid this problem, make collapse range atomic with respect to
> > ilock. Hold the ilock across the entire operation, replace the
> > individual transactions with a single rolling transaction sequence
> > and finish dfops on each iteration to perform pending frees and roll
> > the transaction. Remove the unnecessary quota reservation as
> > collapse range can only ever merge extents (and thus remove extent
> > records and potentially free bmap blocks). The dfops call
> > automatically relogs the inode to keep it moving in the log. This
> > guarantees that nothing else can change the extent mapping of an
> > inode while a collapse range operation is in progress.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++--------------
> >  1 file changed, 15 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 555c8b49a223..1c34a34997ca 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space(
> >  	int			error;
> >  	xfs_fileoff_t		next_fsb = XFS_B_TO_FSB(mp, offset + len);
> >  	xfs_fileoff_t		shift_fsb = XFS_B_TO_FSB(mp, len);
> > -	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> >  	bool			done = false;
> >  
> >  	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space(
> >  	if (error)
> >  		return error;
> >  
> > -	while (!error && !done) {
> > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
> > -					&tp);
> > -		if (error)
> > -			break;
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > +	if (error)
> > +		return error;
> >  
> > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -		error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
> > -				ip->i_gdquot, ip->i_pdquot, resblks, 0,
> > -				XFS_QMOPT_RES_REGBLKS);
> > -		if (error)
> > -			goto out_trans_cancel;
> > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> >  
> > +	while (!done) {
> >  		error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb,
> >  				&done);
> >  		if (error)
> >  			goto out_trans_cancel;
> > +		if (done)
> > +			break;
> >  
> > -		error = xfs_trans_commit(tp);
> > +		/* finish any deferred frees and roll the transaction */
> > +		error = xfs_defer_finish(&tp);
> > +		if (error)
> > +			goto out_trans_cancel;
> >  	}
> >  
> > +	error = xfs_trans_commit(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	return error;
> >  
> >  out_trans_cancel:
> >  	xfs_trans_cancel(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	return error;
> >  }
> >  
> > -- 
> > 2.20.1
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-18 12:10     ` Brian Foster
@ 2019-12-18 21:15       ` Darrick J. Wong
  2019-12-19 11:55         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-18 21:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > The insert range operation uses a unique transaction and ilock cycle
> > > for the extent split and each extent shift iteration of the overall
> > > operation. While this works, it is risks racing with other
> > > operations in subtle ways such as COW writeback modifying an extent
> > > tree in the middle of a shift operation.
> > > 
> > > To avoid this problem, make insert range atomic with respect to
> > > ilock. Hold the ilock across the entire operation, replace the
> > > individual transactions with a single rolling transaction sequence
> > > and relog the inode to keep it moving in the log. This guarantees
> > > that nothing else can change the extent mapping of an inode while
> > > an insert range operation is in progress.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index 829ab1a804c9..555c8b49a223 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > >  	if (error)
> > >  		return error;
> > >  
> > > -	/*
> > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > -	 * is not the starting block of extent, we need to split the extent at
> > > -	 * stop_fsb.
> > > -	 */
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > >  	if (error)
> > >  		return error;
> > >  
> > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > >  
> > > +	/*
> > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > +	 * is not the starting block of extent, we need to split the extent at
> > > +	 * stop_fsb.
> > > +	 */
> > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > >  	if (error)
> > >  		goto out_trans_cancel;
> > >  
> > > -	error = xfs_trans_commit(tp);
> > > -	if (error)
> > > -		return error;
> > > -
> > > -	while (!error && !done) {
> > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > -					&tp);
> > 
> > I'm a little concerned about the livelock potential here, if there are a lot of
> > other threads that have eaten all the transaction reservation and are trying to
> > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > to roll the transaction to move another extent, having already rolled the
> > transaction more than logcount times.
> > 
> 
> My understanding is that the regrant mechanism is intended to deal with
> that scenario. Even after the initial (res * logcount) reservation is
> consumed, a regrant is different from an initial reservation in that the
> reservation head is unconditionally updated with a new reservation unit.
> We do wait on the write head in regrant, but IIUC that should be limited
> to the pool of already allocated transactions (and is woken before the
> reserve head waiters IIRC). I suppose something like this might be
> possible in theory if we were blocked on regrant and the entirety of
> remaining log space was consumed by transactions waiting on our ilock,
> but I think that is highly unlikely since we also hold the iolock here.

True.  The only time I saw this happen was with buffered COW writeback
completions (which hold lock other than ILOCK), which should have been
fixed by the patch I made to put all the writeback items to a single
inode queue and run /one/ worker thread to process them all.  So maybe
my fears are unfounded nowadays. :)

The only other place I can think of that does a lot of transaction
rolling on a single inode is online repair, and it always holds all
three exclusive locks.

> Also note that this approach is based on the current truncate algorithm,
> which is probably a better barometer of potential for this kind of issue
> as it is a less specialized operation. I'd argue that if this is safe
> enough for truncate, it should be safe enough for range shifting.

Hehehe.

> > I think the extent shifting loop starts with the highest offset mapping and
> > shifts it up and continues in order of decreasing offset until it gets to
> >  @stop_fsb, correct?
> > 
> 
> Yep, for insert range at least.
> 
> > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > tr_write pre-reserves enough space to that the roll won't need to get more,
> > which would eliminate that potential problem, I think.
> > 
> 
> We'd have to reorder the extent split for that kind of approach, which I
> think you've noted in the sequence above, as the race window is between
> the split and subsequent shift. Otherwise I think that would work.
> 
> That said, I'd prefer not to introduce the extra complexity and
> functional variance unless it were absolutely necessary, and it's not
> clear to me that it is. If it is, we'd probably have seen similar issues
> in truncate and should target a fix there before worrying about range
> shift.

Ok.  Looking back through lore I don't see any complaints about insert
range, so I guess it's fine.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> Brian
> 
> > --D
> > 
> > > +	do {
> > > +		error = xfs_trans_roll_inode(&tp, ip);
> > >  		if (error)
> > > -			break;
> > > +			goto out_trans_cancel;
> > >  
> > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > >  				&done, stop_fsb);
> > >  		if (error)
> > >  			goto out_trans_cancel;
> > > +	} while (!done);
> > >  
> > > -		error = xfs_trans_commit(tp);
> > > -	}
> > > -
> > > +	error = xfs_trans_commit(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  	return error;
> > >  
> > >  out_trans_cancel:
> > >  	xfs_trans_cancel(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  	return error;
> > >  }
> > >  
> > > -- 
> > > 2.20.1
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation
  2019-12-18 12:11     ` Brian Foster
@ 2019-12-18 21:19       ` Darrick J. Wong
  2019-12-19 11:56         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-18 21:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Dec 18, 2019 at 07:11:20AM -0500, Brian Foster wrote:
> On Tue, Dec 17, 2019 at 06:39:58PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote:
> > > The collapse range operation uses a unique transaction and ilock
> > > cycle for the hole punch and each extent shift iteration of the
> > > overall operation. While the hole punch is safe as a separate
> > > operation due to the iolock, cycling the ilock after each extent
> > > shift is risky similar to insert range.
> > 
> > It is?  I thought collapse range was safe because we started by punching
> > out the doomed range and shifting downwards, which eliminates the
> > problems that come with two adjacent mappings that could be combined?
> > 
> > <confused?>
> > 
> 
> This is somewhat vague wording. I don't mean to say the same bug is
> possible on collapse. Indeed, I don't think that it is. What I mean to
> say is just that cycling the ilock generally opens the operation up to
> concurrent extent changes, similar to the behavior that resulted in the
> insert range bug and against the general design principle of the
> operation (as implied by the iolock hold -> flush -> unmap preparation
> sequence).

Oh, ok, you're merely trying to prevent anyone from seeing the inode
metadata while we're in the middle of a collapse-range operation.  I
wonder then if we need to take a look at the remap range operations, but
oh wow is that a gnarly mess of inode locking. :)

> IOW, it seems to me that a similar behavior is possible on collapse, it
> just might occur after an extent has been shifted into its new target
> range rather than before. That wouldn't be a corruption/bug because it
> doesn't change the semantics of the shift operation or the content of
> the file, but it's subtle and arguably a misbehavior and/or landmine.

<nod> Ok, just making sure. :)

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> Brian
> 
> > --D
> > 
> > > To avoid this problem, make collapse range atomic with respect to
> > > ilock. Hold the ilock across the entire operation, replace the
> > > individual transactions with a single rolling transaction sequence
> > > and finish dfops on each iteration to perform pending frees and roll
> > > the transaction. Remove the unnecessary quota reservation as
> > > collapse range can only ever merge extents (and thus remove extent
> > > records and potentially free bmap blocks). The dfops call
> > > automatically relogs the inode to keep it moving in the log. This
> > > guarantees that nothing else can change the extent mapping of an
> > > inode while a collapse range operation is in progress.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++--------------
> > >  1 file changed, 15 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index 555c8b49a223..1c34a34997ca 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space(
> > >  	int			error;
> > >  	xfs_fileoff_t		next_fsb = XFS_B_TO_FSB(mp, offset + len);
> > >  	xfs_fileoff_t		shift_fsb = XFS_B_TO_FSB(mp, len);
> > > -	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> > >  	bool			done = false;
> > >  
> > >  	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space(
> > >  	if (error)
> > >  		return error;
> > >  
> > > -	while (!error && !done) {
> > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
> > > -					&tp);
> > > -		if (error)
> > > -			break;
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > > +	if (error)
> > > +		return error;
> > >  
> > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -		error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
> > > -				ip->i_gdquot, ip->i_pdquot, resblks, 0,
> > > -				XFS_QMOPT_RES_REGBLKS);
> > > -		if (error)
> > > -			goto out_trans_cancel;
> > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > >  
> > > +	while (!done) {
> > >  		error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb,
> > >  				&done);
> > >  		if (error)
> > >  			goto out_trans_cancel;
> > > +		if (done)
> > > +			break;
> > >  
> > > -		error = xfs_trans_commit(tp);
> > > +		/* finish any deferred frees and roll the transaction */
> > > +		error = xfs_defer_finish(&tp);
> > > +		if (error)
> > > +			goto out_trans_cancel;
> > >  	}
> > >  
> > > +	error = xfs_trans_commit(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  	return error;
> > >  
> > >  out_trans_cancel:
> > >  	xfs_trans_cancel(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  	return error;
> > >  }
> > >  
> > > -- 
> > > 2.20.1
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-18 21:15       ` Darrick J. Wong
@ 2019-12-19 11:55         ` Brian Foster
  2019-12-20 20:17           ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2019-12-19 11:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote:
> On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > > The insert range operation uses a unique transaction and ilock cycle
> > > > for the extent split and each extent shift iteration of the overall
> > > > operation. While this works, it is risks racing with other
> > > > operations in subtle ways such as COW writeback modifying an extent
> > > > tree in the middle of a shift operation.
> > > > 
> > > > To avoid this problem, make insert range atomic with respect to
> > > > ilock. Hold the ilock across the entire operation, replace the
> > > > individual transactions with a single rolling transaction sequence
> > > > and relog the inode to keep it moving in the log. This guarantees
> > > > that nothing else can change the extent mapping of an inode while
> > > > an insert range operation is in progress.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > index 829ab1a804c9..555c8b49a223 100644
> > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > > -	/*
> > > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > -	 * is not the starting block of extent, we need to split the extent at
> > > > -	 * stop_fsb.
> > > > -	 */
> > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > >  
> > > > +	/*
> > > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > +	 * is not the starting block of extent, we need to split the extent at
> > > > +	 * stop_fsb.
> > > > +	 */
> > > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > > >  	if (error)
> > > >  		goto out_trans_cancel;
> > > >  
> > > > -	error = xfs_trans_commit(tp);
> > > > -	if (error)
> > > > -		return error;
> > > > -
> > > > -	while (!error && !done) {
> > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > > -					&tp);
> > > 
> > > I'm a little concerned about the livelock potential here, if there are a lot of
> > > other threads that have eaten all the transaction reservation and are trying to
> > > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > > to roll the transaction to move another extent, having already rolled the
> > > transaction more than logcount times.
> > > 
> > 
> > My understanding is that the regrant mechanism is intended to deal with
> > that scenario. Even after the initial (res * logcount) reservation is
> > consumed, a regrant is different from an initial reservation in that the
> > reservation head is unconditionally updated with a new reservation unit.
> > We do wait on the write head in regrant, but IIUC that should be limited
> > to the pool of already allocated transactions (and is woken before the
> > reserve head waiters IIRC). I suppose something like this might be
> > possible in theory if we were blocked on regrant and the entirety of
> > remaining log space was consumed by transactions waiting on our ilock,
> > but I think that is highly unlikely since we also hold the iolock here.
> 
> True.  The only time I saw this happen was with buffered COW writeback
> completions (which hold lock other than ILOCK), which should have been
> fixed by the patch I made to put all the writeback items to a single
> inode queue and run /one/ worker thread to process them all.  So maybe
> my fears are unfounded nowadays. :)
> 

Ok. If I'm following correctly, that goes back to this[1] commit. The
commit log describes a situation where we basically have unbound,
concurrent attempts to do to COW remappings on writeback completion.
That leads to heavy ilock contention and the potential for deadlocks
between log reservation and inode locks. Out of curiosity, was that old
scheme bad enough to reproduce that kind of deadlock or was the deadlock
description based on code analysis?

I am a bit curious how "deadlock avoidant" the current transaction
rolling/regrant mechanism is intended to be on its own vs. how much
responsibility is beared by the implementation of certain operations
(like truncate holding IOLOCK and assuming that sufficiently restricts
object access). ISTM that it's technically possible for this lockup
state to occur, but it relies on something unusual like the above where
we allow enough unbound (open reservation -> lock) concurrency on a
single metadata object to exhaust all log space.

[1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues")

> The only other place I can think of that does a lot of transaction
> rolling on a single inode is online repair, and it always holds all
> three exclusive locks.
> 

So I guess that should similarly hold off transaction reservation from
userspace intended to modify the particular inode, provided the XFS
internals play nice (re: your workqueue fix above).

> > Also note that this approach is based on the current truncate algorithm,
> > which is probably a better barometer of potential for this kind of issue
> > as it is a less specialized operation. I'd argue that if this is safe
> > enough for truncate, it should be safe enough for range shifting.
> 
> Hehehe.
> 
> > > I think the extent shifting loop starts with the highest offset mapping and
> > > shifts it up and continues in order of decreasing offset until it gets to
> > >  @stop_fsb, correct?
> > > 
> > 
> > Yep, for insert range at least.
> > 
> > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > > tr_write pre-reserves enough space to that the roll won't need to get more,
> > > which would eliminate that potential problem, I think.
> > > 
> > 
> > We'd have to reorder the extent split for that kind of approach, which I
> > think you've noted in the sequence above, as the race window is between
> > the split and subsequent shift. Otherwise I think that would work.
> > 
> > That said, I'd prefer not to introduce the extra complexity and
> > functional variance unless it were absolutely necessary, and it's not
> > clear to me that it is. If it is, we'd probably have seen similar issues
> > in truncate and should target a fix there before worrying about range
> > shift.
> 
> Ok.  Looking back through lore I don't see any complaints about insert
> range, so I guess it's fine.
> 

Ok, thanks.

Brian

> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > +	do {
> > > > +		error = xfs_trans_roll_inode(&tp, ip);
> > > >  		if (error)
> > > > -			break;
> > > > +			goto out_trans_cancel;
> > > >  
> > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > > >  				&done, stop_fsb);
> > > >  		if (error)
> > > >  			goto out_trans_cancel;
> > > > +	} while (!done);
> > > >  
> > > > -		error = xfs_trans_commit(tp);
> > > > -	}
> > > > -
> > > > +	error = xfs_trans_commit(tp);
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  	return error;
> > > >  
> > > >  out_trans_cancel:
> > > >  	xfs_trans_cancel(tp);
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  	return error;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.20.1
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation
  2019-12-18 21:19       ` Darrick J. Wong
@ 2019-12-19 11:56         ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2019-12-19 11:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Dec 18, 2019 at 01:19:05PM -0800, Darrick J. Wong wrote:
> On Wed, Dec 18, 2019 at 07:11:20AM -0500, Brian Foster wrote:
> > On Tue, Dec 17, 2019 at 06:39:58PM -0800, Darrick J. Wong wrote:
> > > On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote:
> > > > The collapse range operation uses a unique transaction and ilock
> > > > cycle for the hole punch and each extent shift iteration of the
> > > > overall operation. While the hole punch is safe as a separate
> > > > operation due to the iolock, cycling the ilock after each extent
> > > > shift is risky similar to insert range.
> > > 
> > > It is?  I thought collapse range was safe because we started by punching
> > > out the doomed range and shifting downwards, which eliminates the
> > > problems that come with two adjacent mappings that could be combined?
> > > 
> > > <confused?>
> > > 
> > 
> > This is somewhat vague wording. I don't mean to say the same bug is
> > possible on collapse. Indeed, I don't think that it is. What I mean to
> > say is just that cycling the ilock generally opens the operation up to
> > concurrent extent changes, similar to the behavior that resulted in the
> > insert range bug and against the general design principle of the
> > operation (as implied by the iolock hold -> flush -> unmap preparation
> > sequence).
> 
> Oh, ok, you're merely trying to prevent anyone from seeing the inode
> metadata while we're in the middle of a collapse-range operation.  I
> wonder then if we need to take a look at the remap range operations, but
> oh wow is that a gnarly mess of inode locking. :)
> 

Yeah, that's actually a much better way to put it. ;) Feel free to tweak
the commit log along those lines if my current description is too
vague/confusing. Thanks for the reviews..

Brian

> > IOW, it seems to me that a similar behavior is possible on collapse, it
> > just might occur after an extent has been shifted into its new target
> > range rather than before. That wouldn't be a corruption/bug because it
> > doesn't change the semantics of the shift operation or the content of
> > the file, but it's subtle and arguably a misbehavior and/or landmine.
> 
> <nod> Ok, just making sure. :)
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > To avoid this problem, make collapse range atomic with respect to
> > > > ilock. Hold the ilock across the entire operation, replace the
> > > > individual transactions with a single rolling transaction sequence
> > > > and finish dfops on each iteration to perform pending frees and roll
> > > > the transaction. Remove the unnecessary quota reservation as
> > > > collapse range can only ever merge extents (and thus remove extent
> > > > records and potentially free bmap blocks). The dfops call
> > > > automatically relogs the inode to keep it moving in the log. This
> > > > guarantees that nothing else can change the extent mapping of an
> > > > inode while a collapse range operation is in progress.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++--------------
> > > >  1 file changed, 15 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > index 555c8b49a223..1c34a34997ca 100644
> > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space(
> > > >  	int			error;
> > > >  	xfs_fileoff_t		next_fsb = XFS_B_TO_FSB(mp, offset + len);
> > > >  	xfs_fileoff_t		shift_fsb = XFS_B_TO_FSB(mp, len);
> > > > -	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> > > >  	bool			done = false;
> > > >  
> > > >  	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > > > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space(
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > > -	while (!error && !done) {
> > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
> > > > -					&tp);
> > > > -		if (error)
> > > > -			break;
> > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > > > +	if (error)
> > > > +		return error;
> > > >  
> > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > -		error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
> > > > -				ip->i_gdquot, ip->i_pdquot, resblks, 0,
> > > > -				XFS_QMOPT_RES_REGBLKS);
> > > > -		if (error)
> > > > -			goto out_trans_cancel;
> > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > >  
> > > > +	while (!done) {
> > > >  		error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb,
> > > >  				&done);
> > > >  		if (error)
> > > >  			goto out_trans_cancel;
> > > > +		if (done)
> > > > +			break;
> > > >  
> > > > -		error = xfs_trans_commit(tp);
> > > > +		/* finish any deferred frees and roll the transaction */
> > > > +		error = xfs_defer_finish(&tp);
> > > > +		if (error)
> > > > +			goto out_trans_cancel;
> > > >  	}
> > > >  
> > > > +	error = xfs_trans_commit(tp);
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  	return error;
> > > >  
> > > >  out_trans_cancel:
> > > >  	xfs_trans_cancel(tp);
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  	return error;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.20.1
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-19 11:55         ` Brian Foster
@ 2019-12-20 20:17           ` Darrick J. Wong
  2019-12-23 12:12             ` Brian Foster
  2019-12-24 11:28             ` Christoph Hellwig
  0 siblings, 2 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-20 20:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Dec 19, 2019 at 06:55:50AM -0500, Brian Foster wrote:
> On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > > > The insert range operation uses a unique transaction and ilock cycle
> > > > > for the extent split and each extent shift iteration of the overall
> > > > > operation. While this works, it is risks racing with other
> > > > > operations in subtle ways such as COW writeback modifying an extent
> > > > > tree in the middle of a shift operation.
> > > > > 
> > > > > To avoid this problem, make insert range atomic with respect to
> > > > > ilock. Hold the ilock across the entire operation, replace the
> > > > > individual transactions with a single rolling transaction sequence
> > > > > and relog the inode to keep it moving in the log. This guarantees
> > > > > that nothing else can change the extent mapping of an inode while
> > > > > an insert range operation is in progress.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > > > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > > index 829ab1a804c9..555c8b49a223 100644
> > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > > > >  	if (error)
> > > > >  		return error;
> > > > >  
> > > > > -	/*
> > > > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > -	 * is not the starting block of extent, we need to split the extent at
> > > > > -	 * stop_fsb.
> > > > > -	 */
> > > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > > >  	if (error)
> > > > >  		return error;
> > > > >  
> > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > >  
> > > > > +	/*
> > > > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > +	 * is not the starting block of extent, we need to split the extent at
> > > > > +	 * stop_fsb.
> > > > > +	 */
> > > > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > > > >  	if (error)
> > > > >  		goto out_trans_cancel;
> > > > >  
> > > > > -	error = xfs_trans_commit(tp);
> > > > > -	if (error)
> > > > > -		return error;
> > > > > -
> > > > > -	while (!error && !done) {
> > > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > > > -					&tp);
> > > > 
> > > > I'm a little concerned about the livelock potential here, if there are a lot of
> > > > other threads that have eaten all the transaction reservation and are trying to
> > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > > > to roll the transaction to move another extent, having already rolled the
> > > > transaction more than logcount times.
> > > > 
> > > 
> > > My understanding is that the regrant mechanism is intended to deal with
> > > that scenario. Even after the initial (res * logcount) reservation is
> > > consumed, a regrant is different from an initial reservation in that the
> > > reservation head is unconditionally updated with a new reservation unit.
> > > We do wait on the write head in regrant, but IIUC that should be limited
> > > to the pool of already allocated transactions (and is woken before the
> > > reserve head waiters IIRC). I suppose something like this might be
> > > possible in theory if we were blocked on regrant and the entirety of
> > > remaining log space was consumed by transactions waiting on our ilock,
> > > but I think that is highly unlikely since we also hold the iolock here.
> > 
> > True.  The only time I saw this happen was with buffered COW writeback
> > completions (which hold lock other than ILOCK), which should have been
> > fixed by the patch I made to put all the writeback items to a single
> > inode queue and run /one/ worker thread to process them all.  So maybe
> > my fears are unfounded nowadays. :)
> > 
> 
> Ok. If I'm following correctly, that goes back to this[1] commit. The
> commit log describes a situation where we basically have unbound,
> concurrent attempts to do to COW remappings on writeback completion.
> That leads to heavy ilock contention and the potential for deadlocks
> between log reservation and inode locks. Out of curiosity, was that old
> scheme bad enough to reproduce that kind of deadlock or was the deadlock
> description based on code analysis?

It was bad enough to cause an internal complaint about log livelock. :(

I think directio completions might suffer from the same class of problem
though, since we allow concurrent dio writes and dio doesn't do any of
the ioend batching that we do with buffered write ioends.

I think fixing directio is going to be a tougher nut to crack because it
means that in the non-overwrite case we can't really do the ioend
completion processing from the dio submitter thread.

It might also be nice to find a way to unify the ioend paths since they
both do "convert unwritten and do cow remapping" on the entire range,
and diverge only once that's done.

> I am a bit curious how "deadlock avoidant" the current transaction
> rolling/regrant mechanism is intended to be on its own vs. how much
> responsibility is beared by the implementation of certain operations
> (like truncate holding IOLOCK and assuming that sufficiently restricts
> object access). ISTM that it's technically possible for this lockup
> state to occur, but it relies on something unusual like the above where
> we allow enough unbound (open reservation -> lock) concurrency on a
> single metadata object to exhaust all log space.
> 
> [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues")

<nod> So far I haven't seen anyone in practice managing to hit a
thread-happy directio cow remap deadlock, though I think triggering it
should be as simple as turning on alwayscow and coaxing an aio stress
tester into issuing a lot of random writes.

> > The only other place I can think of that does a lot of transaction
> > rolling on a single inode is online repair, and it always holds all
> > three exclusive locks.
> > 
> 
> So I guess that should similarly hold off transaction reservation from
> userspace intended to modify the particular inode, provided the XFS
> internals play nice (re: your workqueue fix above).

Yeah.  ISTR the scrub setup functions flush dirty pages and drain
directios before taking the ILOCK, so it should never be picking any
fights with writeback. :)

> > > Also note that this approach is based on the current truncate algorithm,
> > > which is probably a better barometer of potential for this kind of issue
> > > as it is a less specialized operation. I'd argue that if this is safe
> > > enough for truncate, it should be safe enough for range shifting.
> > 
> > Hehehe.
> > 
> > > > I think the extent shifting loop starts with the highest offset mapping and
> > > > shifts it up and continues in order of decreasing offset until it gets to
> > > >  @stop_fsb, correct?
> > > > 
> > > 
> > > Yep, for insert range at least.
> > > 
> > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > > > tr_write pre-reserves enough space to that the roll won't need to get more,
> > > > which would eliminate that potential problem, I think.
> > > > 
> > > 
> > > We'd have to reorder the extent split for that kind of approach, which I
> > > think you've noted in the sequence above, as the race window is between
> > > the split and subsequent shift. Otherwise I think that would work.
> > > 
> > > That said, I'd prefer not to introduce the extra complexity and
> > > functional variance unless it were absolutely necessary, and it's not
> > > clear to me that it is. If it is, we'd probably have seen similar issues
> > > in truncate and should target a fix there before worrying about range
> > > shift.
> > 
> > Ok.  Looking back through lore I don't see any complaints about insert
> > range, so I guess it's fine.
> > 
> 
> Ok, thanks.
> 
> Brian
> 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > +	do {
> > > > > +		error = xfs_trans_roll_inode(&tp, ip);
> > > > >  		if (error)
> > > > > -			break;
> > > > > +			goto out_trans_cancel;
> > > > >  
> > > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > > > >  				&done, stop_fsb);
> > > > >  		if (error)
> > > > >  			goto out_trans_cancel;
> > > > > +	} while (!done);
> > > > >  
> > > > > -		error = xfs_trans_commit(tp);
> > > > > -	}
> > > > > -
> > > > > +	error = xfs_trans_commit(tp);
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >  	return error;
> > > > >  
> > > > >  out_trans_cancel:
> > > > >  	xfs_trans_cancel(tp);
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.20.1
> > > > > 
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-20 20:17           ` Darrick J. Wong
@ 2019-12-23 12:12             ` Brian Foster
  2019-12-24 11:28             ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Brian Foster @ 2019-12-23 12:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 19, 2019 at 06:55:50AM -0500, Brian Foster wrote:
> > On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote:
> > > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > > > > The insert range operation uses a unique transaction and ilock cycle
> > > > > > for the extent split and each extent shift iteration of the overall
> > > > > > operation. While this works, it is risks racing with other
> > > > > > operations in subtle ways such as COW writeback modifying an extent
> > > > > > tree in the middle of a shift operation.
> > > > > > 
> > > > > > To avoid this problem, make insert range atomic with respect to
> > > > > > ilock. Hold the ilock across the entire operation, replace the
> > > > > > individual transactions with a single rolling transaction sequence
> > > > > > and relog the inode to keep it moving in the log. This guarantees
> > > > > > that nothing else can change the extent mapping of an inode while
> > > > > > an insert range operation is in progress.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > > > > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > > > index 829ab1a804c9..555c8b49a223 100644
> > > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > > > > >  	if (error)
> > > > > >  		return error;
> > > > > >  
> > > > > > -	/*
> > > > > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > > -	 * is not the starting block of extent, we need to split the extent at
> > > > > > -	 * stop_fsb.
> > > > > > -	 */
> > > > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > > > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > > > >  	if (error)
> > > > > >  		return error;
> > > > > >  
> > > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > > +	 * is not the starting block of extent, we need to split the extent at
> > > > > > +	 * stop_fsb.
> > > > > > +	 */
> > > > > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > > > > >  	if (error)
> > > > > >  		goto out_trans_cancel;
> > > > > >  
> > > > > > -	error = xfs_trans_commit(tp);
> > > > > > -	if (error)
> > > > > > -		return error;
> > > > > > -
> > > > > > -	while (!error && !done) {
> > > > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > > > > -					&tp);
> > > > > 
> > > > > I'm a little concerned about the livelock potential here, if there are a lot of
> > > > > other threads that have eaten all the transaction reservation and are trying to
> > > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > > > > to roll the transaction to move another extent, having already rolled the
> > > > > transaction more than logcount times.
> > > > > 
> > > > 
> > > > My understanding is that the regrant mechanism is intended to deal with
> > > > that scenario. Even after the initial (res * logcount) reservation is
> > > > consumed, a regrant is different from an initial reservation in that the
> > > > reservation head is unconditionally updated with a new reservation unit.
> > > > We do wait on the write head in regrant, but IIUC that should be limited
> > > > to the pool of already allocated transactions (and is woken before the
> > > > reserve head waiters IIRC). I suppose something like this might be
> > > > possible in theory if we were blocked on regrant and the entirety of
> > > > remaining log space was consumed by transactions waiting on our ilock,
> > > > but I think that is highly unlikely since we also hold the iolock here.
> > > 
> > > True.  The only time I saw this happen was with buffered COW writeback
> > > completions (which hold lock other than ILOCK), which should have been
> > > fixed by the patch I made to put all the writeback items to a single
> > > inode queue and run /one/ worker thread to process them all.  So maybe
> > > my fears are unfounded nowadays. :)
> > > 
> > 
> > Ok. If I'm following correctly, that goes back to this[1] commit. The
> > commit log describes a situation where we basically have unbound,
> > concurrent attempts to do to COW remappings on writeback completion.
> > That leads to heavy ilock contention and the potential for deadlocks
> > between log reservation and inode locks. Out of curiosity, was that old
> > scheme bad enough to reproduce that kind of deadlock or was the deadlock
> > description based on code analysis?
> 
> It was bad enough to cause an internal complaint about log livelock. :(
> 

Interesting..

> I think directio completions might suffer from the same class of problem
> though, since we allow concurrent dio writes and dio doesn't do any of
> the ioend batching that we do with buffered write ioends.
> 

Indeed, that makes sense since it sounds like a similar pattern.

> I think fixing directio is going to be a tougher nut to crack because it
> means that in the non-overwrite case we can't really do the ioend
> completion processing from the dio submitter thread.
> 

Hm.. yeah, but it might be a little strange to have highly concurrent
dio writers to the same inode, right? IOW, it seems like the risk there
more depends on foolishness of userspace rather than of our backend
implementation deciding to allocate more tasks in the event of
contention. That isn't to say we shouldn't think about ways to protect
ourselves from the former case..

> It might also be nice to find a way to unify the ioend paths since they
> both do "convert unwritten and do cow remapping" on the entire range,
> and diverge only once that's done.
> 
> > I am a bit curious how "deadlock avoidant" the current transaction
> > rolling/regrant mechanism is intended to be on its own vs. how much
> > responsibility is beared by the implementation of certain operations
> > (like truncate holding IOLOCK and assuming that sufficiently restricts
> > object access). ISTM that it's technically possible for this lockup
> > state to occur, but it relies on something unusual like the above where
> > we allow enough unbound (open reservation -> lock) concurrency on a
> > single metadata object to exhaust all log space.
> > 
> > [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues")
> 
> <nod> So far I haven't seen anyone in practice managing to hit a
> thread-happy directio cow remap deadlock, though I think triggering it
> should be as simple as turning on alwayscow and coaxing an aio stress
> tester into issuing a lot of random writes.
> 
> > > The only other place I can think of that does a lot of transaction
> > > rolling on a single inode is online repair, and it always holds all
> > > three exclusive locks.
> > > 
> > 
> > So I guess that should similarly hold off transaction reservation from
> > userspace intended to modify the particular inode, provided the XFS
> > internals play nice (re: your workqueue fix above).
> 
> Yeah.  ISTR the scrub setup functions flush dirty pages and drain
> directios before taking the ILOCK, so it should never be picking any
> fights with writeback. :)
> 

That seems to be the right thing to do, at least for now. It doesn't
look like truncate currently waits on dio. Extent shifting might by
virtue of eofblocks trimming I guess, but that isn't guaranteed.
Regardless, I think either operation could wait on dio without a problem
given that they already isolate the inode for I/O, force writeback, etc.
I suspect it's just not been a likely enough situation to have a heavy
enough I/O workload concurrent with truncating (or shifting) a file to
see this problem in traditional (i.e. unlike the "always COW" mode)
codepaths.

Brian

> > > > Also note that this approach is based on the current truncate algorithm,
> > > > which is probably a better barometer of potential for this kind of issue
> > > > as it is a less specialized operation. I'd argue that if this is safe
> > > > enough for truncate, it should be safe enough for range shifting.
> > > 
> > > Hehehe.
> > > 
> > > > > I think the extent shifting loop starts with the highest offset mapping and
> > > > > shifts it up and continues in order of decreasing offset until it gets to
> > > > >  @stop_fsb, correct?
> > > > > 
> > > > 
> > > > Yep, for insert range at least.
> > > > 
> > > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > > > > tr_write pre-reserves enough space to that the roll won't need to get more,
> > > > > which would eliminate that potential problem, I think.
> > > > > 
> > > > 
> > > > We'd have to reorder the extent split for that kind of approach, which I
> > > > think you've noted in the sequence above, as the race window is between
> > > > the split and subsequent shift. Otherwise I think that would work.
> > > > 
> > > > That said, I'd prefer not to introduce the extra complexity and
> > > > functional variance unless it were absolutely necessary, and it's not
> > > > clear to me that it is. If it is, we'd probably have seen similar issues
> > > > in truncate and should target a fix there before worrying about range
> > > > shift.
> > > 
> > > Ok.  Looking back through lore I don't see any complaints about insert
> > > range, so I guess it's fine.
> > > 
> > 
> > Ok, thanks.
> > 
> > Brian
> > 
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > +	do {
> > > > > > +		error = xfs_trans_roll_inode(&tp, ip);
> > > > > >  		if (error)
> > > > > > -			break;
> > > > > > +			goto out_trans_cancel;
> > > > > >  
> > > > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > > > > >  				&done, stop_fsb);
> > > > > >  		if (error)
> > > > > >  			goto out_trans_cancel;
> > > > > > +	} while (!done);
> > > > > >  
> > > > > > -		error = xfs_trans_commit(tp);
> > > > > > -	}
> > > > > > -
> > > > > > +	error = xfs_trans_commit(tp);
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > >  	return error;
> > > > > >  
> > > > > >  out_trans_cancel:
> > > > > >  	xfs_trans_cancel(tp);
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > >  	return error;
> > > > > >  }
> > > > > >  
> > > > > > -- 
> > > > > > 2.20.1
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] xfs: open code insert range extent split helper
  2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
  2019-12-17 17:02   ` Allison Collins
  2019-12-18  2:16   ` Darrick J. Wong
@ 2019-12-24 11:18   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2019-12-24 11:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Dec 13, 2019 at 12:12:56PM -0500, Brian Foster wrote:
> The insert range operation currently splits the extent at the target
> offset in a separate transaction and lock cycle from the one that
> shifts extents. In preparation for reworking insert range into an
> atomic operation, lift the code into the caller so it can be easily
> condensed to a single rolling transaction and lock cycle and
> eliminate the helper. No functional changes.

The helper already is rather questionable as-is, so even without looking
at the following patches:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster
  2019-12-17 17:04   ` Allison Collins
  2019-12-18  2:37   ` Darrick J. Wong
@ 2019-12-24 11:26   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2019-12-24 11:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> The insert range operation uses a unique transaction and ilock cycle
> for the extent split and each extent shift iteration of the overall
> operation. While this works, it is risks racing with other
> operations in subtle ways such as COW writeback modifying an extent
> tree in the middle of a shift operation.
> 
> To avoid this problem, make insert range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and relog the inode to keep it moving in the log. This guarantees
> that nothing else can change the extent mapping of an inode while
> an insert range operation is in progress.

This looks good, and similar to our usual truncate sequence:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-20 20:17           ` Darrick J. Wong
  2019-12-23 12:12             ` Brian Foster
@ 2019-12-24 11:28             ` Christoph Hellwig
  2019-12-24 16:45               ` Darrick J. Wong
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2019-12-24 11:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote:
> I think directio completions might suffer from the same class of problem
> though, since we allow concurrent dio writes and dio doesn't do any of
> the ioend batching that we do with buffered write ioends.

OTOH direct I/O completions are per-I/O, and not per-extent like
buffered I/O completions.  Moreover for the case where we don't update
i_size and don't need a separate log force (overwrites without O_SYNC
or using fua) we could actually avoid the workqueue entirely with just
a little work.

> It might also be nice to find a way to unify the ioend paths since they
> both do "convert unwritten and do cow remapping" on the entire range,
> and diverge only once that's done.

They were common a while ago and it was a complete mess.  That is why
I split them.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation
  2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster
  2019-12-18  2:39   ` Darrick J. Wong
@ 2019-12-24 11:29   ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2019-12-24 11:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Looks good modulo any commit log issues:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation
  2019-12-24 11:28             ` Christoph Hellwig
@ 2019-12-24 16:45               ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2019-12-24 16:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Tue, Dec 24, 2019 at 03:28:45AM -0800, Christoph Hellwig wrote:
> On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote:
> > I think directio completions might suffer from the same class of problem
> > though, since we allow concurrent dio writes and dio doesn't do any of
> > the ioend batching that we do with buffered write ioends.
> 
> OTOH direct I/O completions are per-I/O, and not per-extent like
> buffered I/O completions.  Moreover for the case where we don't update
> i_size and don't need a separate log force (overwrites without O_SYNC
> or using fua) we could actually avoid the workqueue entirely with just
> a little work.

<nod>

> > It might also be nice to find a way to unify the ioend paths since they
> > both do "convert unwritten and do cow remapping" on the entire range,
> > and diverge only once that's done.
> 
> They were common a while ago and it was a complete mess.  That is why
> I split them.

And I couldn't figure out a sane way to make them work together so I
guess it's just as well. :)

--D

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-12-24 16:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster
2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster
2019-12-17 17:02   ` Allison Collins
2019-12-17 22:21     ` Darrick J. Wong
2019-12-18  2:16   ` Darrick J. Wong
2019-12-24 11:18   ` Christoph Hellwig
2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster
2019-12-17 17:04   ` Allison Collins
2019-12-18  2:37   ` Darrick J. Wong
2019-12-18 12:10     ` Brian Foster
2019-12-18 21:15       ` Darrick J. Wong
2019-12-19 11:55         ` Brian Foster
2019-12-20 20:17           ` Darrick J. Wong
2019-12-23 12:12             ` Brian Foster
2019-12-24 11:28             ` Christoph Hellwig
2019-12-24 16:45               ` Darrick J. Wong
2019-12-24 11:26   ` Christoph Hellwig
2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster
2019-12-18  2:39   ` Darrick J. Wong
2019-12-18 12:11     ` Brian Foster
2019-12-18 21:19       ` Darrick J. Wong
2019-12-19 11:56         ` Brian Foster
2019-12-24 11:29   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).