linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag
@ 2019-08-27  2:54 kaixuxia
  2019-08-29  1:27 ` kaixuxia
  0 siblings, 1 reply; 6+ messages in thread
From: kaixuxia @ 2019-08-27  2:54 UTC (permalink / raw)
  To: linux-xfs
  Cc: Darrick J. Wong, Dave Chinner, Brian Foster, newtongao,
	jasperwang, xiakaixu1987

When performing rename operation with RENAME_WHITEOUT flag, we will
hold AGF lock to allocate or free extents in manipulating the dirents
firstly, and then doing the xfs_iunlink_remove() call last to hold
AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.

The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI. Hence
the ordering that is imposed by other parts of the code is AGI before
AGF. So we get an ABBA deadlock between the AGI and AGF here.

Process A:
Call trace:
 ? __schedule+0x2bd/0x620
 schedule+0x33/0x90
 schedule_timeout+0x17d/0x290
 __down_common+0xef/0x125
 ? xfs_buf_find+0x215/0x6c0 [xfs]
 down+0x3b/0x50
 xfs_buf_lock+0x34/0xf0 [xfs]
 xfs_buf_find+0x215/0x6c0 [xfs]
 xfs_buf_get_map+0x37/0x230 [xfs]
 xfs_buf_read_map+0x29/0x190 [xfs]
 xfs_trans_read_buf_map+0x13d/0x520 [xfs]
 xfs_read_agf+0xa6/0x180 [xfs]
 ? schedule_timeout+0x17d/0x290
 xfs_alloc_read_agf+0x52/0x1f0 [xfs]
 xfs_alloc_fix_freelist+0x432/0x590 [xfs]
 ? down+0x3b/0x50
 ? xfs_buf_lock+0x34/0xf0 [xfs]
 ? xfs_buf_find+0x215/0x6c0 [xfs]
 xfs_alloc_vextent+0x301/0x6c0 [xfs]
 xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
 ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
 xfs_dialloc+0x116/0x290 [xfs]
 xfs_ialloc+0x6d/0x5e0 [xfs]
 ? xfs_log_reserve+0x165/0x280 [xfs]
 xfs_dir_ialloc+0x8c/0x240 [xfs]
 xfs_create+0x35a/0x610 [xfs]
 xfs_generic_create+0x1f1/0x2f0 [xfs]
 ...

Process B:
Call trace:
 ? __schedule+0x2bd/0x620
 ? xfs_bmapi_allocate+0x245/0x380 [xfs]
 schedule+0x33/0x90
 schedule_timeout+0x17d/0x290
 ? xfs_buf_find+0x1fd/0x6c0 [xfs]
 __down_common+0xef/0x125
 ? xfs_buf_get_map+0x37/0x230 [xfs]
 ? xfs_buf_find+0x215/0x6c0 [xfs]
 down+0x3b/0x50
 xfs_buf_lock+0x34/0xf0 [xfs]
 xfs_buf_find+0x215/0x6c0 [xfs]
 xfs_buf_get_map+0x37/0x230 [xfs]
 xfs_buf_read_map+0x29/0x190 [xfs]
 xfs_trans_read_buf_map+0x13d/0x520 [xfs]
 xfs_read_agi+0xa8/0x160 [xfs]
 xfs_iunlink_remove+0x6f/0x2a0 [xfs]
 ? current_time+0x46/0x80
 ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
 xfs_rename+0x57a/0xae0 [xfs]
 xfs_vn_rename+0xe4/0x150 [xfs]
 ...

In this patch we move the xfs_iunlink_remove() call to
before acquiring the AGF lock to preserve correct AGI/AGF locking
order.

Signed-off-by: kaixuxia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c | 83 +++++++++++++++++++++++++++---------------------------
 1 file changed, 42 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6467d5e..8ffd44f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3282,7 +3282,8 @@ struct xfs_iunlink {
 					spaceres);
 
 	/*
-	 * Set up the target.
+	 * Check for expected errors before we dirty the transaction
+	 * so we can return an error without a transaction abort.
 	 */
 	if (target_ip == NULL) {
 		/*
@@ -3294,6 +3295,46 @@ struct xfs_iunlink {
 			if (error)
 				goto out_trans_cancel;
 		}
+	} else {
+		/*
+		 * If target exists and it's a directory, check that whether
+		 * it can be destroyed.
+		 */
+		if (S_ISDIR(VFS_I(target_ip)->i_mode) &&
+		    (!xfs_dir_isempty(target_ip) ||
+		     (VFS_I(target_ip)->i_nlink > 2))) {
+			error = -EEXIST;
+			goto out_trans_cancel;
+		}
+	}
+
+	/*
+	 * Directory entry creation below may acquire the AGF. Remove
+	 * the whiteout from the unlinked list first to preserve correct
+	 * AGI/AGF locking order. This dirties the transaction so failures
+	 * after this point will abort and log recovery will clean up the
+	 * mess.
+	 *
+	 * For whiteouts, we need to bump the link count on the whiteout
+	 * inode. After this point, we have a real link, clear the tmpfile
+	 * state flag from the inode so it doesn't accidentally get misused
+	 * in future.
+	 */
+	if (wip) {
+		ASSERT(VFS_I(wip)->i_nlink == 0);
+		error = xfs_iunlink_remove(tp, wip);
+		if (error)
+			goto out_trans_cancel;
+
+		xfs_bumplink(tp, wip);
+		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
+		VFS_I(wip)->i_state &= ~I_LINKABLE;
+	}
+
+	/*
+	 * Set up the target.
+	 */
+	if (target_ip == NULL) {
 		/*
 		 * If target does not exist and the rename crosses
 		 * directories, adjust the target directory link count
@@ -3312,22 +3353,6 @@ struct xfs_iunlink {
 		}
 	} else { /* target_ip != NULL */
 		/*
-		 * If target exists and it's a directory, check that both
-		 * target and source are directories and that target can be
-		 * destroyed, or that neither is a directory.
-		 */
-		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
-			/*
-			 * Make sure target dir is empty.
-			 */
-			if (!(xfs_dir_isempty(target_ip)) ||
-			    (VFS_I(target_ip)->i_nlink > 2)) {
-				error = -EEXIST;
-				goto out_trans_cancel;
-			}
-		}
-
-		/*
 		 * Link the source inode under the target name.
 		 * If the source inode is a directory and we are moving
 		 * it across directories, its ".." entry will be
@@ -3417,30 +3442,6 @@ struct xfs_iunlink {
 	if (error)
 		goto out_trans_cancel;
 
-	/*
-	 * For whiteouts, we need to bump the link count on the whiteout inode.
-	 * This means that failures all the way up to this point leave the inode
-	 * on the unlinked list and so cleanup is a simple matter of dropping
-	 * the remaining reference to it. If we fail here after bumping the link
-	 * count, we're shutting down the filesystem so we'll never see the
-	 * intermediate state on disk.
-	 */
-	if (wip) {
-		ASSERT(VFS_I(wip)->i_nlink == 0);
-		xfs_bumplink(tp, wip);
-		error = xfs_iunlink_remove(tp, wip);
-		if (error)
-			goto out_trans_cancel;
-		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
-
-		/*
-		 * Now we have a real link, clear the "I'm a tmpfile" state
-		 * flag from the inode so it doesn't accidentally get misused in
-		 * future.
-		 */
-		VFS_I(wip)->i_state &= ~I_LINKABLE;
-	}
-
 	xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE);
 	if (new_parent)
-- 
1.8.3.1

-- 
kaixuxia

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag
  2019-08-27  2:54 [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag kaixuxia
@ 2019-08-29  1:27 ` kaixuxia
  2019-08-29  3:04   ` Eric Sandeen
  0 siblings, 1 reply; 6+ messages in thread
From: kaixuxia @ 2019-08-29  1:27 UTC (permalink / raw)
  To: linux-xfs, Darrick J. Wong
  Cc: Dave Chinner, Brian Foster, newtongao, jasperwang

ping...
Because there isn't this patch in the latest xfs-for-next branch 
update...


On 2019/8/27 10:54, kaixuxia wrote:
> When performing rename operation with RENAME_WHITEOUT flag, we will
> hold AGF lock to allocate or free extents in manipulating the dirents
> firstly, and then doing the xfs_iunlink_remove() call last to hold
> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> 
> The big problem here is that we have an ordering constraint on AGF
> and AGI locking - inode allocation locks the AGI, then can allocate
> a new extent for new inodes, locking the AGF after the AGI. Hence
> the ordering that is imposed by other parts of the code is AGI before
> AGF. So we get an ABBA deadlock between the AGI and AGF here.
> 
> Process A:
> Call trace:
>  ? __schedule+0x2bd/0x620
>  schedule+0x33/0x90
>  schedule_timeout+0x17d/0x290
>  __down_common+0xef/0x125
>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>  down+0x3b/0x50
>  xfs_buf_lock+0x34/0xf0 [xfs]
>  xfs_buf_find+0x215/0x6c0 [xfs]
>  xfs_buf_get_map+0x37/0x230 [xfs]
>  xfs_buf_read_map+0x29/0x190 [xfs]
>  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
>  xfs_read_agf+0xa6/0x180 [xfs]
>  ? schedule_timeout+0x17d/0x290
>  xfs_alloc_read_agf+0x52/0x1f0 [xfs]
>  xfs_alloc_fix_freelist+0x432/0x590 [xfs]
>  ? down+0x3b/0x50
>  ? xfs_buf_lock+0x34/0xf0 [xfs]
>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>  xfs_alloc_vextent+0x301/0x6c0 [xfs]
>  xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
>  ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
>  xfs_dialloc+0x116/0x290 [xfs]
>  xfs_ialloc+0x6d/0x5e0 [xfs]
>  ? xfs_log_reserve+0x165/0x280 [xfs]
>  xfs_dir_ialloc+0x8c/0x240 [xfs]
>  xfs_create+0x35a/0x610 [xfs]
>  xfs_generic_create+0x1f1/0x2f0 [xfs]
>  ...
> 
> Process B:
> Call trace:
>  ? __schedule+0x2bd/0x620
>  ? xfs_bmapi_allocate+0x245/0x380 [xfs]
>  schedule+0x33/0x90
>  schedule_timeout+0x17d/0x290
>  ? xfs_buf_find+0x1fd/0x6c0 [xfs]
>  __down_common+0xef/0x125
>  ? xfs_buf_get_map+0x37/0x230 [xfs]
>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>  down+0x3b/0x50
>  xfs_buf_lock+0x34/0xf0 [xfs]
>  xfs_buf_find+0x215/0x6c0 [xfs]
>  xfs_buf_get_map+0x37/0x230 [xfs]
>  xfs_buf_read_map+0x29/0x190 [xfs]
>  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
>  xfs_read_agi+0xa8/0x160 [xfs]
>  xfs_iunlink_remove+0x6f/0x2a0 [xfs]
>  ? current_time+0x46/0x80
>  ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
>  xfs_rename+0x57a/0xae0 [xfs]
>  xfs_vn_rename+0xe4/0x150 [xfs]
>  ...
> 
> In this patch we move the xfs_iunlink_remove() call to
> before acquiring the AGF lock to preserve correct AGI/AGF locking
> order.
> 
> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 83 +++++++++++++++++++++++++++---------------------------
>  1 file changed, 42 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 6467d5e..8ffd44f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3282,7 +3282,8 @@ struct xfs_iunlink {
>  					spaceres);
>  
>  	/*
> -	 * Set up the target.
> +	 * Check for expected errors before we dirty the transaction
> +	 * so we can return an error without a transaction abort.
>  	 */
>  	if (target_ip == NULL) {
>  		/*
> @@ -3294,6 +3295,46 @@ struct xfs_iunlink {
>  			if (error)
>  				goto out_trans_cancel;
>  		}
> +	} else {
> +		/*
> +		 * If target exists and it's a directory, check that whether
> +		 * it can be destroyed.
> +		 */
> +		if (S_ISDIR(VFS_I(target_ip)->i_mode) &&
> +		    (!xfs_dir_isempty(target_ip) ||
> +		     (VFS_I(target_ip)->i_nlink > 2))) {
> +			error = -EEXIST;
> +			goto out_trans_cancel;
> +		}
> +	}
> +
> +	/*
> +	 * Directory entry creation below may acquire the AGF. Remove
> +	 * the whiteout from the unlinked list first to preserve correct
> +	 * AGI/AGF locking order. This dirties the transaction so failures
> +	 * after this point will abort and log recovery will clean up the
> +	 * mess.
> +	 *
> +	 * For whiteouts, we need to bump the link count on the whiteout
> +	 * inode. After this point, we have a real link, clear the tmpfile
> +	 * state flag from the inode so it doesn't accidentally get misused
> +	 * in future.
> +	 */
> +	if (wip) {
> +		ASSERT(VFS_I(wip)->i_nlink == 0);
> +		error = xfs_iunlink_remove(tp, wip);
> +		if (error)
> +			goto out_trans_cancel;
> +
> +		xfs_bumplink(tp, wip);
> +		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
> +		VFS_I(wip)->i_state &= ~I_LINKABLE;
> +	}
> +
> +	/*
> +	 * Set up the target.
> +	 */
> +	if (target_ip == NULL) {
>  		/*
>  		 * If target does not exist and the rename crosses
>  		 * directories, adjust the target directory link count
> @@ -3312,22 +3353,6 @@ struct xfs_iunlink {
>  		}
>  	} else { /* target_ip != NULL */
>  		/*
> -		 * If target exists and it's a directory, check that both
> -		 * target and source are directories and that target can be
> -		 * destroyed, or that neither is a directory.
> -		 */
> -		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
> -			/*
> -			 * Make sure target dir is empty.
> -			 */
> -			if (!(xfs_dir_isempty(target_ip)) ||
> -			    (VFS_I(target_ip)->i_nlink > 2)) {
> -				error = -EEXIST;
> -				goto out_trans_cancel;
> -			}
> -		}
> -
> -		/*
>  		 * Link the source inode under the target name.
>  		 * If the source inode is a directory and we are moving
>  		 * it across directories, its ".." entry will be
> @@ -3417,30 +3442,6 @@ struct xfs_iunlink {
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	/*
> -	 * For whiteouts, we need to bump the link count on the whiteout inode.
> -	 * This means that failures all the way up to this point leave the inode
> -	 * on the unlinked list and so cleanup is a simple matter of dropping
> -	 * the remaining reference to it. If we fail here after bumping the link
> -	 * count, we're shutting down the filesystem so we'll never see the
> -	 * intermediate state on disk.
> -	 */
> -	if (wip) {
> -		ASSERT(VFS_I(wip)->i_nlink == 0);
> -		xfs_bumplink(tp, wip);
> -		error = xfs_iunlink_remove(tp, wip);
> -		if (error)
> -			goto out_trans_cancel;
> -		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
> -
> -		/*
> -		 * Now we have a real link, clear the "I'm a tmpfile" state
> -		 * flag from the inode so it doesn't accidentally get misused in
> -		 * future.
> -		 */
> -		VFS_I(wip)->i_state &= ~I_LINKABLE;
> -	}
> -
>  	xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
>  	xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE);
>  	if (new_parent)
> 

-- 
kaixuxia

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag
  2019-08-29  1:27 ` kaixuxia
@ 2019-08-29  3:04   ` Eric Sandeen
  2019-08-29  3:20     ` kaixuxia
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Sandeen @ 2019-08-29  3:04 UTC (permalink / raw)
  To: kaixuxia, linux-xfs, Darrick J. Wong
  Cc: Dave Chinner, Brian Foster, newtongao, jasperwang

On 8/28/19 8:27 PM, kaixuxia wrote:
> ping...
> Because there isn't this patch in the latest xfs-for-next branch 
> update...

1) V3 appears to have no changes from V2.  Why was V3 sent?

2) Neither version has a reviewed-by yet, Darrick has questions outstanding
   AFAICT, they may need to be answered prior to review and merge.

-Eric
 
> 
> On 2019/8/27 10:54, kaixuxia wrote:
>> When performing rename operation with RENAME_WHITEOUT flag, we will
>> hold AGF lock to allocate or free extents in manipulating the dirents
>> firstly, and then doing the xfs_iunlink_remove() call last to hold
>> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
>>
>> The big problem here is that we have an ordering constraint on AGF
>> and AGI locking - inode allocation locks the AGI, then can allocate
>> a new extent for new inodes, locking the AGF after the AGI. Hence
>> the ordering that is imposed by other parts of the code is AGI before
>> AGF. So we get an ABBA deadlock between the AGI and AGF here.
>>
>> Process A:
>> Call trace:
>>  ? __schedule+0x2bd/0x620
>>  schedule+0x33/0x90
>>  schedule_timeout+0x17d/0x290
>>  __down_common+0xef/0x125
>>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>>  down+0x3b/0x50
>>  xfs_buf_lock+0x34/0xf0 [xfs]
>>  xfs_buf_find+0x215/0x6c0 [xfs]
>>  xfs_buf_get_map+0x37/0x230 [xfs]
>>  xfs_buf_read_map+0x29/0x190 [xfs]
>>  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
>>  xfs_read_agf+0xa6/0x180 [xfs]
>>  ? schedule_timeout+0x17d/0x290
>>  xfs_alloc_read_agf+0x52/0x1f0 [xfs]
>>  xfs_alloc_fix_freelist+0x432/0x590 [xfs]
>>  ? down+0x3b/0x50
>>  ? xfs_buf_lock+0x34/0xf0 [xfs]
>>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>>  xfs_alloc_vextent+0x301/0x6c0 [xfs]
>>  xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
>>  ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
>>  xfs_dialloc+0x116/0x290 [xfs]
>>  xfs_ialloc+0x6d/0x5e0 [xfs]
>>  ? xfs_log_reserve+0x165/0x280 [xfs]
>>  xfs_dir_ialloc+0x8c/0x240 [xfs]
>>  xfs_create+0x35a/0x610 [xfs]
>>  xfs_generic_create+0x1f1/0x2f0 [xfs]
>>  ...
>>
>> Process B:
>> Call trace:
>>  ? __schedule+0x2bd/0x620
>>  ? xfs_bmapi_allocate+0x245/0x380 [xfs]
>>  schedule+0x33/0x90
>>  schedule_timeout+0x17d/0x290
>>  ? xfs_buf_find+0x1fd/0x6c0 [xfs]
>>  __down_common+0xef/0x125
>>  ? xfs_buf_get_map+0x37/0x230 [xfs]
>>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>>  down+0x3b/0x50
>>  xfs_buf_lock+0x34/0xf0 [xfs]
>>  xfs_buf_find+0x215/0x6c0 [xfs]
>>  xfs_buf_get_map+0x37/0x230 [xfs]
>>  xfs_buf_read_map+0x29/0x190 [xfs]
>>  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
>>  xfs_read_agi+0xa8/0x160 [xfs]
>>  xfs_iunlink_remove+0x6f/0x2a0 [xfs]
>>  ? current_time+0x46/0x80
>>  ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
>>  xfs_rename+0x57a/0xae0 [xfs]
>>  xfs_vn_rename+0xe4/0x150 [xfs]
>>  ...
>>
>> In this patch we move the xfs_iunlink_remove() call to
>> before acquiring the AGF lock to preserve correct AGI/AGF locking
>> order.
>>
>> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
>> Reviewed-by: Brian Foster <bfoster@redhat.com>
>> ---
>>  fs/xfs/xfs_inode.c | 83 +++++++++++++++++++++++++++---------------------------
>>  1 file changed, 42 insertions(+), 41 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 6467d5e..8ffd44f 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -3282,7 +3282,8 @@ struct xfs_iunlink {
>>  					spaceres);
>>  
>>  	/*
>> -	 * Set up the target.
>> +	 * Check for expected errors before we dirty the transaction
>> +	 * so we can return an error without a transaction abort.
>>  	 */
>>  	if (target_ip == NULL) {
>>  		/*
>> @@ -3294,6 +3295,46 @@ struct xfs_iunlink {
>>  			if (error)
>>  				goto out_trans_cancel;
>>  		}
>> +	} else {
>> +		/*
>> +		 * If target exists and it's a directory, check that whether
>> +		 * it can be destroyed.
>> +		 */
>> +		if (S_ISDIR(VFS_I(target_ip)->i_mode) &&
>> +		    (!xfs_dir_isempty(target_ip) ||
>> +		     (VFS_I(target_ip)->i_nlink > 2))) {
>> +			error = -EEXIST;
>> +			goto out_trans_cancel;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Directory entry creation below may acquire the AGF. Remove
>> +	 * the whiteout from the unlinked list first to preserve correct
>> +	 * AGI/AGF locking order. This dirties the transaction so failures
>> +	 * after this point will abort and log recovery will clean up the
>> +	 * mess.
>> +	 *
>> +	 * For whiteouts, we need to bump the link count on the whiteout
>> +	 * inode. After this point, we have a real link, clear the tmpfile
>> +	 * state flag from the inode so it doesn't accidentally get misused
>> +	 * in future.
>> +	 */
>> +	if (wip) {
>> +		ASSERT(VFS_I(wip)->i_nlink == 0);
>> +		error = xfs_iunlink_remove(tp, wip);
>> +		if (error)
>> +			goto out_trans_cancel;
>> +
>> +		xfs_bumplink(tp, wip);
>> +		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>> +		VFS_I(wip)->i_state &= ~I_LINKABLE;
>> +	}
>> +
>> +	/*
>> +	 * Set up the target.
>> +	 */
>> +	if (target_ip == NULL) {
>>  		/*
>>  		 * If target does not exist and the rename crosses
>>  		 * directories, adjust the target directory link count
>> @@ -3312,22 +3353,6 @@ struct xfs_iunlink {
>>  		}
>>  	} else { /* target_ip != NULL */
>>  		/*
>> -		 * If target exists and it's a directory, check that both
>> -		 * target and source are directories and that target can be
>> -		 * destroyed, or that neither is a directory.
>> -		 */
>> -		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
>> -			/*
>> -			 * Make sure target dir is empty.
>> -			 */
>> -			if (!(xfs_dir_isempty(target_ip)) ||
>> -			    (VFS_I(target_ip)->i_nlink > 2)) {
>> -				error = -EEXIST;
>> -				goto out_trans_cancel;
>> -			}
>> -		}
>> -
>> -		/*
>>  		 * Link the source inode under the target name.
>>  		 * If the source inode is a directory and we are moving
>>  		 * it across directories, its ".." entry will be
>> @@ -3417,30 +3442,6 @@ struct xfs_iunlink {
>>  	if (error)
>>  		goto out_trans_cancel;
>>  
>> -	/*
>> -	 * For whiteouts, we need to bump the link count on the whiteout inode.
>> -	 * This means that failures all the way up to this point leave the inode
>> -	 * on the unlinked list and so cleanup is a simple matter of dropping
>> -	 * the remaining reference to it. If we fail here after bumping the link
>> -	 * count, we're shutting down the filesystem so we'll never see the
>> -	 * intermediate state on disk.
>> -	 */
>> -	if (wip) {
>> -		ASSERT(VFS_I(wip)->i_nlink == 0);
>> -		xfs_bumplink(tp, wip);
>> -		error = xfs_iunlink_remove(tp, wip);
>> -		if (error)
>> -			goto out_trans_cancel;
>> -		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>> -
>> -		/*
>> -		 * Now we have a real link, clear the "I'm a tmpfile" state
>> -		 * flag from the inode so it doesn't accidentally get misused in
>> -		 * future.
>> -		 */
>> -		VFS_I(wip)->i_state &= ~I_LINKABLE;
>> -	}
>> -
>>  	xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
>>  	xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE);
>>  	if (new_parent)
>>
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag
  2019-08-29  3:04   ` Eric Sandeen
@ 2019-08-29  3:20     ` kaixuxia
  2019-08-29  3:22       ` Eric Sandeen
  2019-08-29  5:35       ` Dave Chinner
  0 siblings, 2 replies; 6+ messages in thread
From: kaixuxia @ 2019-08-29  3:20 UTC (permalink / raw)
  To: Eric Sandeen, linux-xfs, Darrick J. Wong
  Cc: Dave Chinner, Brian Foster, newtongao, jasperwang



On 2019/8/29 11:04, Eric Sandeen wrote:
> On 8/28/19 8:27 PM, kaixuxia wrote:
>> ping...
>> Because there isn't this patch in the latest xfs-for-next branch 
>> update...
> 
> 1) V3 appears to have no changes from V2.  Why was V3 sent?
> 
The V3 subject has been changed to distinguish from another droplink 
deadlock patch that will be sent soon. This V3 patch aim to fix the 
rename whiteout deadlock.

> 2) Neither version has a reviewed-by yet, Darrick has questions outstanding
>    AFAICT, they may need to be answered prior to review and merge.

Maybe it is mainly about the reproducer that will be put into xfstests,
I am trying to reduce the reproduce time and will send it when done.

> 
> -Eric
>  
>>
>> On 2019/8/27 10:54, kaixuxia wrote:
>>> When performing rename operation with RENAME_WHITEOUT flag, we will
>>> hold AGF lock to allocate or free extents in manipulating the dirents
>>> firstly, and then doing the xfs_iunlink_remove() call last to hold
>>> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
>>>
>>> The big problem here is that we have an ordering constraint on AGF
>>> and AGI locking - inode allocation locks the AGI, then can allocate
>>> a new extent for new inodes, locking the AGF after the AGI. Hence
>>> the ordering that is imposed by other parts of the code is AGI before
>>> AGF. So we get an ABBA deadlock between the AGI and AGF here.
>>>
>>> Process A:
>>> Call trace:
>>>  ? __schedule+0x2bd/0x620
>>>  schedule+0x33/0x90
>>>  schedule_timeout+0x17d/0x290
>>>  __down_common+0xef/0x125
>>>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>>>  down+0x3b/0x50
>>>  xfs_buf_lock+0x34/0xf0 [xfs]
>>>  xfs_buf_find+0x215/0x6c0 [xfs]
>>>  xfs_buf_get_map+0x37/0x230 [xfs]
>>>  xfs_buf_read_map+0x29/0x190 [xfs]
>>>  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
>>>  xfs_read_agf+0xa6/0x180 [xfs]
>>>  ? schedule_timeout+0x17d/0x290
>>>  xfs_alloc_read_agf+0x52/0x1f0 [xfs]
>>>  xfs_alloc_fix_freelist+0x432/0x590 [xfs]
>>>  ? down+0x3b/0x50
>>>  ? xfs_buf_lock+0x34/0xf0 [xfs]
>>>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>>>  xfs_alloc_vextent+0x301/0x6c0 [xfs]
>>>  xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
>>>  ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
>>>  xfs_dialloc+0x116/0x290 [xfs]
>>>  xfs_ialloc+0x6d/0x5e0 [xfs]
>>>  ? xfs_log_reserve+0x165/0x280 [xfs]
>>>  xfs_dir_ialloc+0x8c/0x240 [xfs]
>>>  xfs_create+0x35a/0x610 [xfs]
>>>  xfs_generic_create+0x1f1/0x2f0 [xfs]
>>>  ...
>>>
>>> Process B:
>>> Call trace:
>>>  ? __schedule+0x2bd/0x620
>>>  ? xfs_bmapi_allocate+0x245/0x380 [xfs]
>>>  schedule+0x33/0x90
>>>  schedule_timeout+0x17d/0x290
>>>  ? xfs_buf_find+0x1fd/0x6c0 [xfs]
>>>  __down_common+0xef/0x125
>>>  ? xfs_buf_get_map+0x37/0x230 [xfs]
>>>  ? xfs_buf_find+0x215/0x6c0 [xfs]
>>>  down+0x3b/0x50
>>>  xfs_buf_lock+0x34/0xf0 [xfs]
>>>  xfs_buf_find+0x215/0x6c0 [xfs]
>>>  xfs_buf_get_map+0x37/0x230 [xfs]
>>>  xfs_buf_read_map+0x29/0x190 [xfs]
>>>  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
>>>  xfs_read_agi+0xa8/0x160 [xfs]
>>>  xfs_iunlink_remove+0x6f/0x2a0 [xfs]
>>>  ? current_time+0x46/0x80
>>>  ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
>>>  xfs_rename+0x57a/0xae0 [xfs]
>>>  xfs_vn_rename+0xe4/0x150 [xfs]
>>>  ...
>>>
>>> In this patch we move the xfs_iunlink_remove() call to
>>> before acquiring the AGF lock to preserve correct AGI/AGF locking
>>> order.
>>>
>>> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
>>> Reviewed-by: Brian Foster <bfoster@redhat.com>
>>> ---
>>>  fs/xfs/xfs_inode.c | 83 +++++++++++++++++++++++++++---------------------------
>>>  1 file changed, 42 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>>> index 6467d5e..8ffd44f 100644
>>> --- a/fs/xfs/xfs_inode.c
>>> +++ b/fs/xfs/xfs_inode.c
>>> @@ -3282,7 +3282,8 @@ struct xfs_iunlink {
>>>  					spaceres);
>>>  
>>>  	/*
>>> -	 * Set up the target.
>>> +	 * Check for expected errors before we dirty the transaction
>>> +	 * so we can return an error without a transaction abort.
>>>  	 */
>>>  	if (target_ip == NULL) {
>>>  		/*
>>> @@ -3294,6 +3295,46 @@ struct xfs_iunlink {
>>>  			if (error)
>>>  				goto out_trans_cancel;
>>>  		}
>>> +	} else {
>>> +		/*
>>> +		 * If target exists and it's a directory, check that whether
>>> +		 * it can be destroyed.
>>> +		 */
>>> +		if (S_ISDIR(VFS_I(target_ip)->i_mode) &&
>>> +		    (!xfs_dir_isempty(target_ip) ||
>>> +		     (VFS_I(target_ip)->i_nlink > 2))) {
>>> +			error = -EEXIST;
>>> +			goto out_trans_cancel;
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * Directory entry creation below may acquire the AGF. Remove
>>> +	 * the whiteout from the unlinked list first to preserve correct
>>> +	 * AGI/AGF locking order. This dirties the transaction so failures
>>> +	 * after this point will abort and log recovery will clean up the
>>> +	 * mess.
>>> +	 *
>>> +	 * For whiteouts, we need to bump the link count on the whiteout
>>> +	 * inode. After this point, we have a real link, clear the tmpfile
>>> +	 * state flag from the inode so it doesn't accidentally get misused
>>> +	 * in future.
>>> +	 */
>>> +	if (wip) {
>>> +		ASSERT(VFS_I(wip)->i_nlink == 0);
>>> +		error = xfs_iunlink_remove(tp, wip);
>>> +		if (error)
>>> +			goto out_trans_cancel;
>>> +
>>> +		xfs_bumplink(tp, wip);
>>> +		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>>> +		VFS_I(wip)->i_state &= ~I_LINKABLE;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Set up the target.
>>> +	 */
>>> +	if (target_ip == NULL) {
>>>  		/*
>>>  		 * If target does not exist and the rename crosses
>>>  		 * directories, adjust the target directory link count
>>> @@ -3312,22 +3353,6 @@ struct xfs_iunlink {
>>>  		}
>>>  	} else { /* target_ip != NULL */
>>>  		/*
>>> -		 * If target exists and it's a directory, check that both
>>> -		 * target and source are directories and that target can be
>>> -		 * destroyed, or that neither is a directory.
>>> -		 */
>>> -		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
>>> -			/*
>>> -			 * Make sure target dir is empty.
>>> -			 */
>>> -			if (!(xfs_dir_isempty(target_ip)) ||
>>> -			    (VFS_I(target_ip)->i_nlink > 2)) {
>>> -				error = -EEXIST;
>>> -				goto out_trans_cancel;
>>> -			}
>>> -		}
>>> -
>>> -		/*
>>>  		 * Link the source inode under the target name.
>>>  		 * If the source inode is a directory and we are moving
>>>  		 * it across directories, its ".." entry will be
>>> @@ -3417,30 +3442,6 @@ struct xfs_iunlink {
>>>  	if (error)
>>>  		goto out_trans_cancel;
>>>  
>>> -	/*
>>> -	 * For whiteouts, we need to bump the link count on the whiteout inode.
>>> -	 * This means that failures all the way up to this point leave the inode
>>> -	 * on the unlinked list and so cleanup is a simple matter of dropping
>>> -	 * the remaining reference to it. If we fail here after bumping the link
>>> -	 * count, we're shutting down the filesystem so we'll never see the
>>> -	 * intermediate state on disk.
>>> -	 */
>>> -	if (wip) {
>>> -		ASSERT(VFS_I(wip)->i_nlink == 0);
>>> -		xfs_bumplink(tp, wip);
>>> -		error = xfs_iunlink_remove(tp, wip);
>>> -		if (error)
>>> -			goto out_trans_cancel;
>>> -		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>>> -
>>> -		/*
>>> -		 * Now we have a real link, clear the "I'm a tmpfile" state
>>> -		 * flag from the inode so it doesn't accidentally get misused in
>>> -		 * future.
>>> -		 */
>>> -		VFS_I(wip)->i_state &= ~I_LINKABLE;
>>> -	}
>>> -
>>>  	xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
>>>  	xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE);
>>>  	if (new_parent)
>>>
>>

-- 
kaixuxia

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag
  2019-08-29  3:20     ` kaixuxia
@ 2019-08-29  3:22       ` Eric Sandeen
  2019-08-29  5:35       ` Dave Chinner
  1 sibling, 0 replies; 6+ messages in thread
From: Eric Sandeen @ 2019-08-29  3:22 UTC (permalink / raw)
  To: kaixuxia, linux-xfs, Darrick J. Wong
  Cc: Dave Chinner, Brian Foster, newtongao, jasperwang

On 8/28/19 10:20 PM, kaixuxia wrote:
> 
> 
> On 2019/8/29 11:04, Eric Sandeen wrote:
>> On 8/28/19 8:27 PM, kaixuxia wrote:
>>> ping...
>>> Because there isn't this patch in the latest xfs-for-next branch 
>>> update...
>>
>> 1) V3 appears to have no changes from V2.  Why was V3 sent?
>>
> The V3 subject has been changed to distinguish from another droplink 
> deadlock patch that will be sent soon. This V3 patch aim to fix the 
> rename whiteout deadlock.

Ah I missed the subject change, sorry.

FWIW, adding a changelog below the "---" helps people know what changed
without needing to look too hard :)

>> 2) Neither version has a reviewed-by yet, Darrick has questions outstanding
>>    AFAICT, they may need to be answered prior to review and merge.
> 
> Maybe it is mainly about the reproducer that will be put into xfstests,
> I am trying to reduce the reproduce time and will send it when done.

I see.

-Eric



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag
  2019-08-29  3:20     ` kaixuxia
  2019-08-29  3:22       ` Eric Sandeen
@ 2019-08-29  5:35       ` Dave Chinner
  1 sibling, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2019-08-29  5:35 UTC (permalink / raw)
  To: kaixuxia
  Cc: Eric Sandeen, linux-xfs, Darrick J. Wong, Brian Foster,
	newtongao, jasperwang

On Thu, Aug 29, 2019 at 11:20:37AM +0800, kaixuxia wrote:
> 
> 
> On 2019/8/29 11:04, Eric Sandeen wrote:
> > On 8/28/19 8:27 PM, kaixuxia wrote:
> >> ping...
> >> Because there isn't this patch in the latest xfs-for-next branch 
> >> update...
> > 
> > 1) V3 appears to have no changes from V2.  Why was V3 sent?
> > 
> The V3 subject has been changed to distinguish from another droplink 
> deadlock patch that will be sent soon. This V3 patch aim to fix the 
> rename whiteout deadlock.

And now the subject line for the patch is far too long.  it is
almost twice as long as the old one, doesn't fit on one line without
wrapping, and really doesn't tell us anything much extra. If you
really must say it's a whiteout issue, then:

xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT

is sufficient. It's a summary - the above contains the problem, the
objects involved and the operation that needs to be fixed. Nothing
more is necessary. It needs to fit into ~60 characters so that
when you run git log --one-line the output can be pasted directly
into email without wrapping....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-08-29  5:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27  2:54 [PATCH v3] xfs: Fix ABBA deadlock between AGI and AGF when performing rename() with RENAME_WHITEOUT flag kaixuxia
2019-08-29  1:27 ` kaixuxia
2019-08-29  3:04   ` Eric Sandeen
2019-08-29  3:20     ` kaixuxia
2019-08-29  3:22       ` Eric Sandeen
2019-08-29  5:35       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).