All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
@ 2019-08-19 13:06 kaixuxia
  2019-08-19 15:13 ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: kaixuxia @ 2019-08-19 13:06 UTC (permalink / raw)
  To: linux-xfs
  Cc: Darrick J. Wong, Brian Foster, Dave Chinner, newtongao,
	jasperwang, xiakaixu1987

When performing rename operation with RENAME_WHITEOUT flag, we will
hold AGF lock to allocate or free extents in manipulating the dirents
firstly, and then doing the xfs_iunlink_remove() call last to hold
AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.

The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI. Hence
the ordering that is imposed by other parts of the code is AGI before
AGF. So we get the ABBA agi&agf deadlock here.

Process A:
Call trace:
  ? __schedule+0x2bd/0x620
  schedule+0x33/0x90
  schedule_timeout+0x17d/0x290
  __down_common+0xef/0x125
  ? xfs_buf_find+0x215/0x6c0 [xfs]
  down+0x3b/0x50
  xfs_buf_lock+0x34/0xf0 [xfs]
  xfs_buf_find+0x215/0x6c0 [xfs]
  xfs_buf_get_map+0x37/0x230 [xfs]
  xfs_buf_read_map+0x29/0x190 [xfs]
  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
  xfs_read_agf+0xa6/0x180 [xfs]
  ? schedule_timeout+0x17d/0x290
  xfs_alloc_read_agf+0x52/0x1f0 [xfs]
  xfs_alloc_fix_freelist+0x432/0x590 [xfs]
  ? down+0x3b/0x50
  ? xfs_buf_lock+0x34/0xf0 [xfs]
  ? xfs_buf_find+0x215/0x6c0 [xfs]
  xfs_alloc_vextent+0x301/0x6c0 [xfs]
  xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
  ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
  xfs_dialloc+0x116/0x290 [xfs]
  xfs_ialloc+0x6d/0x5e0 [xfs]
  ? xfs_log_reserve+0x165/0x280 [xfs]
  xfs_dir_ialloc+0x8c/0x240 [xfs]
  xfs_create+0x35a/0x610 [xfs]
  xfs_generic_create+0x1f1/0x2f0 [xfs]
  ...

Process B:
Call trace:
  ? __schedule+0x2bd/0x620
  ? xfs_bmapi_allocate+0x245/0x380 [xfs]
  schedule+0x33/0x90
  schedule_timeout+0x17d/0x290
  ? xfs_buf_find+0x1fd/0x6c0 [xfs]
  __down_common+0xef/0x125
  ? xfs_buf_get_map+0x37/0x230 [xfs]
  ? xfs_buf_find+0x215/0x6c0 [xfs]
  down+0x3b/0x50
  xfs_buf_lock+0x34/0xf0 [xfs]
  xfs_buf_find+0x215/0x6c0 [xfs]
  xfs_buf_get_map+0x37/0x230 [xfs]
  xfs_buf_read_map+0x29/0x190 [xfs]
  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
  xfs_read_agi+0xa8/0x160 [xfs]
  xfs_iunlink_remove+0x6f/0x2a0 [xfs]
  ? current_time+0x46/0x80
  ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
  xfs_rename+0x57a/0xae0 [xfs]
  xfs_vn_rename+0xe4/0x150 [xfs]
  ...

In this patch we move the xfs_iunlink_remove() call to between
xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink
_remove() firstly, we remove the AGI/AGF lock inversion problem.

Signed-off-by: kaixuxia <kaixuxia@tencent.com>
---
  fs/xfs/xfs_inode.c | 20 +++++++++++++++++---
  1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6467d5e..48691f2 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3294,6 +3294,18 @@ struct xfs_iunlink {
  			if (error)
  				goto out_trans_cancel;
  		}
+
+		/*
+		 * Handle the whiteout inode and acquire the AGI lock, so
+		 * fix the AGI/AGF lock inversion problem.
+		 */
+		if (wip) {
+			ASSERT(VFS_I(wip)->i_nlink == 0);
+			error = xfs_iunlink_remove(tp, wip);
+			if (error)
+				goto out_trans_cancel;
+		}
+
  		/*
  		 * If target does not exist and the rename crosses
  		 * directories, adjust the target directory link count
@@ -3428,9 +3440,11 @@ struct xfs_iunlink {
  	if (wip) {
  		ASSERT(VFS_I(wip)->i_nlink == 0);
  		xfs_bumplink(tp, wip);
-		error = xfs_iunlink_remove(tp, wip);
-		if (error)
-			goto out_trans_cancel;
+		if (target_ip != NULL) {
+			error = xfs_iunlink_remove(tp, wip);
+			if (error)
+				goto out_trans_cancel;
+		}
  		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);

  		/*
-- 
1.8.3.1

-- 
kaixuxia

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-19 13:06 [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag kaixuxia
@ 2019-08-19 15:13 ` Brian Foster
  2019-08-19 22:12   ` Dave Chinner
  2019-08-20  6:45   ` kaixuxia
  0 siblings, 2 replies; 11+ messages in thread
From: Brian Foster @ 2019-08-19 15:13 UTC (permalink / raw)
  To: kaixuxia; +Cc: linux-xfs, Darrick J. Wong, Dave Chinner, newtongao, jasperwang

On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote:
> When performing rename operation with RENAME_WHITEOUT flag, we will
> hold AGF lock to allocate or free extents in manipulating the dirents
> firstly, and then doing the xfs_iunlink_remove() call last to hold
> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> 
> The big problem here is that we have an ordering constraint on AGF
> and AGI locking - inode allocation locks the AGI, then can allocate
> a new extent for new inodes, locking the AGF after the AGI. Hence
> the ordering that is imposed by other parts of the code is AGI before
> AGF. So we get the ABBA agi&agf deadlock here.
> 
...
> 
> In this patch we move the xfs_iunlink_remove() call to between
> xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink
> _remove() firstly, we remove the AGI/AGF lock inversion problem.
> 
> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
> ---
>  fs/xfs/xfs_inode.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 6467d5e..48691f2 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3294,6 +3294,18 @@ struct xfs_iunlink {
>  			if (error)
>  				goto out_trans_cancel;
>  		}
> +
> +		/*
> +		 * Handle the whiteout inode and acquire the AGI lock, so
> +		 * fix the AGI/AGF lock inversion problem.
> +		 */

The comment could be a little more specific. For example:

"Directory entry creation may acquire the AGF. Remove the whiteout from
the unlinked list first to preserve correct AGI/AGF locking order."

> +		if (wip) {
> +			ASSERT(VFS_I(wip)->i_nlink == 0);
> +			error = xfs_iunlink_remove(tp, wip);
> +			if (error)
> +				goto out_trans_cancel;
> +		}
> +
>  		/*
>  		 * If target does not exist and the rename crosses
>  		 * directories, adjust the target directory link count
> @@ -3428,9 +3440,11 @@ struct xfs_iunlink {
>  	if (wip) {
>  		ASSERT(VFS_I(wip)->i_nlink == 0);
>  		xfs_bumplink(tp, wip);
> -		error = xfs_iunlink_remove(tp, wip);
> -		if (error)
> -			goto out_trans_cancel;
> +		if (target_ip != NULL) {
> +			error = xfs_iunlink_remove(tp, wip);
> +			if (error)
> +				goto out_trans_cancel;
> +		}

The comment above this hunk needs to be updated. I'm also not a big fan
of this factoring of doing the removal in the if branch above and then
encoding the else logic down here. It might be cleaner and more
consistent to have a call in each branch of the if/else above.

FWIW, I'm also curious if this could be cleaned up further by pulling
the -ENOSPC/-EEXIST checks out of the earlier branch, following that
with the whiteout removal, and then doing the dir_create/replace. For
example, something like:

	/* error checks before we dirty the transaction */
	if (!target_ip && !spaceres) {
		error = xfs_dir_canenter();
		...
	} else if (S_ISDIR() && !(empty || nlink > 2))
		error = -EEXIST;
		...
	}

	if (wip) {
		...
		xfs_iunlink_remove();
	}

	if (!target_ip) {
		xfs_dir_create();
		...
	} else {
		xfs_dir_replace();
		...
	}

... but that may not be any cleaner..? It could also be done as a
followup cleanup patch as well.

Brian

>  		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
> 
>  		/*
> -- 
> 1.8.3.1
> 
> -- 
> kaixuxia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-19 15:13 ` Brian Foster
@ 2019-08-19 22:12   ` Dave Chinner
  2019-08-20  6:45   ` kaixuxia
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2019-08-19 22:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Mon, Aug 19, 2019 at 11:13:35AM -0400, Brian Foster wrote:
> On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote:
> > When performing rename operation with RENAME_WHITEOUT flag, we will
> > hold AGF lock to allocate or free extents in manipulating the dirents
> > firstly, and then doing the xfs_iunlink_remove() call last to hold
> > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> > 
> > The big problem here is that we have an ordering constraint on AGF
> > and AGI locking - inode allocation locks the AGI, then can allocate
> > a new extent for new inodes, locking the AGF after the AGI. Hence
> > the ordering that is imposed by other parts of the code is AGI before
> > AGF. So we get the ABBA agi&agf deadlock here.
> > 
> ...
> > 
> > In this patch we move the xfs_iunlink_remove() call to between
> > xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink
> > _remove() firstly, we remove the AGI/AGF lock inversion problem.
> > 
> > Signed-off-by: kaixuxia <kaixuxia@tencent.com>
> > ---
> >  fs/xfs/xfs_inode.c | 20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 6467d5e..48691f2 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3294,6 +3294,18 @@ struct xfs_iunlink {
> >  			if (error)
> >  				goto out_trans_cancel;
> >  		}
> > +
> > +		/*
> > +		 * Handle the whiteout inode and acquire the AGI lock, so
> > +		 * fix the AGI/AGF lock inversion problem.
> > +		 */
> 
> The comment could be a little more specific. For example:
> 
> "Directory entry creation may acquire the AGF. Remove the whiteout from
> the unlinked list first to preserve correct AGI/AGF locking order."
> 
> > +		if (wip) {
> > +			ASSERT(VFS_I(wip)->i_nlink == 0);
> > +			error = xfs_iunlink_remove(tp, wip);
> > +			if (error)
> > +				goto out_trans_cancel;
> > +		}
> > +
> >  		/*
> >  		 * If target does not exist and the rename crosses
> >  		 * directories, adjust the target directory link count
> > @@ -3428,9 +3440,11 @@ struct xfs_iunlink {
> >  	if (wip) {
> >  		ASSERT(VFS_I(wip)->i_nlink == 0);
> >  		xfs_bumplink(tp, wip);
> > -		error = xfs_iunlink_remove(tp, wip);
> > -		if (error)
> > -			goto out_trans_cancel;
> > +		if (target_ip != NULL) {
> > +			error = xfs_iunlink_remove(tp, wip);
> > +			if (error)
> > +				goto out_trans_cancel;
> > +		}
> 
> The comment above this hunk needs to be updated. I'm also not a big fan
> of this factoring of doing the removal in the if branch above and then
> encoding the else logic down here. It might be cleaner and more
> consistent to have a call in each branch of the if/else above.
> 
> FWIW, I'm also curious if this could be cleaned up further by pulling
> the -ENOSPC/-EEXIST checks out of the earlier branch, following that
> with the whiteout removal, and then doing the dir_create/replace. For
> example, something like:
> 
> 	/* error checks before we dirty the transaction */
> 	if (!target_ip && !spaceres) {
> 		error = xfs_dir_canenter();
> 		...
> 	} else if (S_ISDIR() && !(empty || nlink > 2))
> 		error = -EEXIST;
> 		...
> 	}
> 
> 	if (wip) {
> 		...
> 		xfs_iunlink_remove();
> 	}
> 
> 	if (!target_ip) {
> 		xfs_dir_create();
> 		...
> 	} else {
> 		xfs_dir_replace();
> 		...
> 	}
> 
> ... but that may not be any cleaner..? It could also be done as a
> followup cleanup patch as well.

Makes sense, but the more I look at this, the more I think we should
get rid of on-disk whiteout inodes altogether and finally make use
of XFS_DIR3_FT_WHT in the dirent file type.

The whiteout inode is a nasty VFS level hack to support whiteouts on
filesystems that have no way of storing whiteouts natively. i.e. a
whiteout inode is a chardev with a mode of 0 and a device number of
0. We only need to present that magic indoe to userspace, we don't
need to actually store it on disk.  IOWs, we could get rid of this
whole problem with XFS_DIR3_FT_WHT.

XFS_DIR3_FT_WHT means we no longer need to allocate a tmpfile for
the whiteout, we don't have xfs_iunlink_remove calls in rename, and
all we have to do is a xfs_dir_replace() call to zero out the inode
number and change the ftype in the dirent.

It will need changes to the xfs_lookup code to instantiate the
special chardev inode in memory when a whiteout is found, and
probably checks to ensure we never dirty such an inode. Giving it an
invalid inode number will be sufficient to prevent it from being
written to disk.

It's a bit more work, but I think killing on-disk whiteout inode is
the eventual solution we want here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-19 15:13 ` Brian Foster
  2019-08-19 22:12   ` Dave Chinner
@ 2019-08-20  6:45   ` kaixuxia
  2019-08-20  8:07     ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: kaixuxia @ 2019-08-20  6:45 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-xfs, Darrick J. Wong, Dave Chinner, newtongao, jasperwang

On 2019/8/19 23:13, Brian Foster wrote:
> On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote:
>> When performing rename operation with RENAME_WHITEOUT flag, we will
>> hold AGF lock to allocate or free extents in manipulating the dirents
>> firstly, and then doing the xfs_iunlink_remove() call last to hold
>> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
>>
>> The big problem here is that we have an ordering constraint on AGF
>> and AGI locking - inode allocation locks the AGI, then can allocate
>> a new extent for new inodes, locking the AGF after the AGI. Hence
>> the ordering that is imposed by other parts of the code is AGI before
>> AGF. So we get the ABBA agi&agf deadlock here.
>>
> ...
>>
>> In this patch we move the xfs_iunlink_remove() call to between
>> xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink
>> _remove() firstly, we remove the AGI/AGF lock inversion problem.
>>
>> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
>> ---
>>   fs/xfs/xfs_inode.c | 20 +++++++++++++++++---
>>   1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 6467d5e..48691f2 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -3294,6 +3294,18 @@ struct xfs_iunlink {
>>   			if (error)
>>   				goto out_trans_cancel;
>>   		}
>> +
>> +		/*
>> +		 * Handle the whiteout inode and acquire the AGI lock, so
>> +		 * fix the AGI/AGF lock inversion problem.
>> +		 */
> 
> The comment could be a little more specific. For example:
> 
> "Directory entry creation may acquire the AGF. Remove the whiteout from
> the unlinked list first to preserve correct AGI/AGF locking order."

Right, makes more sense.

> 
>> +		if (wip) {
>> +			ASSERT(VFS_I(wip)->i_nlink == 0);
>> +			error = xfs_iunlink_remove(tp, wip);
>> +			if (error)
>> +				goto out_trans_cancel;
>> +		}
>> +
>>   		/*
>>   		 * If target does not exist and the rename crosses
>>   		 * directories, adjust the target directory link count
>> @@ -3428,9 +3440,11 @@ struct xfs_iunlink {
>>   	if (wip) {
>>   		ASSERT(VFS_I(wip)->i_nlink == 0);
>>   		xfs_bumplink(tp, wip);
>> -		error = xfs_iunlink_remove(tp, wip);
>> -		if (error)
>> -			goto out_trans_cancel;
>> +		if (target_ip != NULL) {
>> +			error = xfs_iunlink_remove(tp, wip);
>> +			if (error)
>> +				goto out_trans_cancel;
>> +		}
> 
> The comment above this hunk needs to be updated. I'm also not a big fan
> of this factoring of doing the removal in the if branch above and then
> encoding the else logic down here. It might be cleaner and more
> consistent to have a call in each branch of the if/else above.
> 
> FWIW, I'm also curious if this could be cleaned up further by pulling
> the -ENOSPC/-EEXIST checks out of the earlier branch, following that
> with the whiteout removal, and then doing the dir_create/replace. For
> example, something like:
> 
> 	/* error checks before we dirty the transaction */
> 	if (!target_ip && !spaceres) {
> 		error = xfs_dir_canenter();
> 		...
> 	} else if (S_ISDIR() && !(empty || nlink > 2))
> 		error = -EEXIST;
> 		...
> 	}
> 
> 	if (wip) {
> 		...
> 		xfs_iunlink_remove();
> 	}
> 
> 	if (!target_ip) {
> 		xfs_dir_create();
> 		...
> 	} else {
> 		xfs_dir_replace();
> 		...
> 	}
> 
> ... but that may not be any cleaner..? It could also be done as a
> followup cleanup patch as well.

Yep, it is cleaner that making the whole check before the transaction
becomes dirty, just return the error code if check failed and
the filesystem is clean.

Dave gave another solution in the other subthread that using
XFS_DIR3_FT_WHT, it's a bit more work for this bug, include
refactoring the xfs_rename() and xfs_lookup(), not sure whether
it's worth the complex changes for this bug.

> 
> Brian
> 
>>   		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>>
>>   		/*
>> -- 
>> 1.8.3.1
>>
>> -- 
>> kaixuxia

-- 
kaixuxia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20  6:45   ` kaixuxia
@ 2019-08-20  8:07     ` Dave Chinner
  2019-08-20  8:53       ` kaixuxia
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2019-08-20  8:07 UTC (permalink / raw)
  To: kaixuxia; +Cc: Brian Foster, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Tue, Aug 20, 2019 at 02:45:36PM +0800, kaixuxia wrote:
> On 2019/8/19 23:13, Brian Foster wrote:
> > 	/* error checks before we dirty the transaction */
> > 	if (!target_ip && !spaceres) {
> > 		error = xfs_dir_canenter();
> > 		...
> > 	} else if (S_ISDIR() && !(empty || nlink > 2))
> > 		error = -EEXIST;
> > 		...
> > 	}
> > 
> > 	if (wip) {
> > 		...
> > 		xfs_iunlink_remove();
> > 	}
> > 
> > 	if (!target_ip) {
> > 		xfs_dir_create();
> > 		...
> > 	} else {
> > 		xfs_dir_replace();
> > 		...
> > 	}
> > 
> > ... but that may not be any cleaner..? It could also be done as a
> > followup cleanup patch as well.
> 
> Yep, it is cleaner that making the whole check before the transaction
> becomes dirty, just return the error code if check failed and
> the filesystem is clean.

*nod*

> Dave gave another solution in the other subthread that using
> XFS_DIR3_FT_WHT, it's a bit more work for this bug, include
> refactoring the xfs_rename() and xfs_lookup(), not sure whether
> it's worth the complex changes for this bug.

It's not necessary to fix the bug, but it's somethign we should
be looking to do because it makes whiteout handling a lot more
efficient - it's just dirent modifications at that point, no inodes
are necessary.

This is how I always intended to handle whiteouts - it's just
another thing on the "we need to fix" list....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20  8:07     ` Dave Chinner
@ 2019-08-20  8:53       ` kaixuxia
  2019-08-20 10:51         ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: kaixuxia @ 2019-08-20  8:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, linux-xfs, Darrick J. Wong, newtongao, jasperwang



On 2019/8/20 16:07, Dave Chinner wrote:
> On Tue, Aug 20, 2019 at 02:45:36PM +0800, kaixuxia wrote:
>> On 2019/8/19 23:13, Brian Foster wrote:
>>> 	/* error checks before we dirty the transaction */
>>> 	if (!target_ip && !spaceres) {
>>> 		error = xfs_dir_canenter();
>>> 		...
>>> 	} else if (S_ISDIR() && !(empty || nlink > 2))
>>> 		error = -EEXIST;
>>> 		...
>>> 	}
>>>
>>> 	if (wip) {
>>> 		...
>>> 		xfs_iunlink_remove();
>>> 	}
>>>
>>> 	if (!target_ip) {
>>> 		xfs_dir_create();
>>> 		...
>>> 	} else {
>>> 		xfs_dir_replace();
>>> 		...
>>> 	}
>>>
>>> ... but that may not be any cleaner..? It could also be done as a
>>> followup cleanup patch as well.
>>
>> Yep, it is cleaner that making the whole check before the transaction
>> becomes dirty, just return the error code if check failed and
>> the filesystem is clean.
> 
> *nod*
> 
>> Dave gave another solution in the other subthread that using
>> XFS_DIR3_FT_WHT, it's a bit more work for this bug, include
>> refactoring the xfs_rename() and xfs_lookup(), not sure whether
>> it's worth the complex changes for this bug.
> 
> It's not necessary to fix the bug, but it's somethign we should
> be looking to do because it makes whiteout handling a lot more
> efficient - it's just dirent modifications at that point, no inodes
> are necessary.
> 
> This is how I always intended to handle whiteouts - it's just
> another thing on the "we need to fix" list....

Right, it is more efficient because there is no need to store it on disk,
and it will improve performance just like the async deferred operations.
Maybe it is on the roadmap, so I'm not sure whether I should send the V3
patch to address Brian's comments. Maybe we can choose the V3 patch first,
and then the whiteout improvement could be done as the followup patch
in future...

> 
> Cheers,
> 
> Dave.
> 

-- 
kaixuxia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20  8:53       ` kaixuxia
@ 2019-08-20 10:51         ` Brian Foster
  2019-08-20 11:23           ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2019-08-20 10:51 UTC (permalink / raw)
  To: kaixuxia; +Cc: Dave Chinner, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote:
> 
> 
> On 2019/8/20 16:07, Dave Chinner wrote:
> > On Tue, Aug 20, 2019 at 02:45:36PM +0800, kaixuxia wrote:
> > > On 2019/8/19 23:13, Brian Foster wrote:
> > > > 	/* error checks before we dirty the transaction */
> > > > 	if (!target_ip && !spaceres) {
> > > > 		error = xfs_dir_canenter();
> > > > 		...
> > > > 	} else if (S_ISDIR() && !(empty || nlink > 2))
> > > > 		error = -EEXIST;
> > > > 		...
> > > > 	}
> > > > 
> > > > 	if (wip) {
> > > > 		...
> > > > 		xfs_iunlink_remove();
> > > > 	}
> > > > 
> > > > 	if (!target_ip) {
> > > > 		xfs_dir_create();
> > > > 		...
> > > > 	} else {
> > > > 		xfs_dir_replace();
> > > > 		...
> > > > 	}
> > > > 
> > > > ... but that may not be any cleaner..? It could also be done as a
> > > > followup cleanup patch as well.
> > > 
> > > Yep, it is cleaner that making the whole check before the transaction
> > > becomes dirty, just return the error code if check failed and
> > > the filesystem is clean.
> > 
> > *nod*
> > 
> > > Dave gave another solution in the other subthread that using
> > > XFS_DIR3_FT_WHT, it's a bit more work for this bug, include
> > > refactoring the xfs_rename() and xfs_lookup(), not sure whether
> > > it's worth the complex changes for this bug.
> > 

Yeah, I wasn't aware of that option. What Dave describes wrt to
replacing the on-disk whiteout inode with a dirent + in-core variant
sounds like the clear best option to me over the ones previously
discussed.

> > It's not necessary to fix the bug, but it's somethign we should
> > be looking to do because it makes whiteout handling a lot more
> > efficient - it's just dirent modifications at that point, no inodes
> > are necessary.
> > 
> > This is how I always intended to handle whiteouts - it's just
> > another thing on the "we need to fix" list....
> 
> Right, it is more efficient because there is no need to store it on disk,
> and it will improve performance just like the async deferred operations.
> Maybe it is on the roadmap, so I'm not sure whether I should send the V3
> patch to address Brian's comments. Maybe we can choose the V3 patch first,
> and then the whiteout improvement could be done as the followup patch
> in future...
> 

I agree. I think a two step process makes sense because we may want a
backportable fix around for the locking bug that doesn't depend on
replacing the implementation.

FWIW if we do take that approach, then IMO it's worth reconsidering the
1-2 liner I originally proposed to fix the locking. It's slightly hacky,
but really all three options are hacky in slightly different ways. The
flipside is it's trivial to implement, review and backport and now would
be removed shortly thereafter when we replace the on-disk whiteout with
the in-core fake whiteout thing. Just my .02 though..

Brian

> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
> -- 
> kaixuxia

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20 10:51         ` Brian Foster
@ 2019-08-20 11:23           ` Dave Chinner
  2019-08-20 12:23             ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2019-08-20 11:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote:
> On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote:
> FWIW if we do take that approach, then IMO it's worth reconsidering the
> 1-2 liner I originally proposed to fix the locking. It's slightly hacky,
> but really all three options are hacky in slightly different ways. The
> flipside is it's trivial to implement, review and backport and now would
> be removed shortly thereafter when we replace the on-disk whiteout with
> the in-core fake whiteout thing. Just my .02 though..

We've got to keep the existing whiteout method around for,
essentially, forever, because we have to support kernels that don't
do in-memory translations of DT_WHT to a magic chardev inode and
vice versa (i.e. via mknod). IOWs, we'll need a feature bit to
indicate that we actually have DT_WHT based whiteouts on disk.

So we may as well fix this properly now by restructuring the code as
we will still have to maintain this functionality for a long time to
come.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20 11:23           ` Dave Chinner
@ 2019-08-20 12:23             ` Brian Foster
  2019-08-20 22:13               ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2019-08-20 12:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Tue, Aug 20, 2019 at 09:23:04PM +1000, Dave Chinner wrote:
> On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote:
> > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote:
> > FWIW if we do take that approach, then IMO it's worth reconsidering the
> > 1-2 liner I originally proposed to fix the locking. It's slightly hacky,
> > but really all three options are hacky in slightly different ways. The
> > flipside is it's trivial to implement, review and backport and now would
> > be removed shortly thereafter when we replace the on-disk whiteout with
> > the in-core fake whiteout thing. Just my .02 though..
> 
> We've got to keep the existing whiteout method around for,
> essentially, forever, because we have to support kernels that don't
> do in-memory translations of DT_WHT to a magic chardev inode and
> vice versa (i.e. via mknod). IOWs, we'll need a feature bit to
> indicate that we actually have DT_WHT based whiteouts on disk.
> 

I'm not quite following (probably just because I'm not terribly familiar
with the use case). If current kernels know how to fake up whiteout
inodes in memory based on a dentry, why do we need to continue to create
new on-disk whiteout inodes just because a filesystem might already have
such inodes on disk? Wouldn't the old format whiteouts just continue to
work as expected without any extra handling?

I can see needing a feature bit to restrict a filesystem from being used
on an unsupported, older kernel, but is there a reason we wouldn't just
enable that by default anyways?

Brian

> So we may as well fix this properly now by restructuring the code as
> we will still have to maintain this functionality for a long time to
> come.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20 12:23             ` Brian Foster
@ 2019-08-20 22:13               ` Dave Chinner
  2019-08-21 11:25                 ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2019-08-20 22:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Tue, Aug 20, 2019 at 08:23:00AM -0400, Brian Foster wrote:
> On Tue, Aug 20, 2019 at 09:23:04PM +1000, Dave Chinner wrote:
> > On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote:
> > > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote:
> > > FWIW if we do take that approach, then IMO it's worth reconsidering the
> > > 1-2 liner I originally proposed to fix the locking. It's slightly hacky,
> > > but really all three options are hacky in slightly different ways. The
> > > flipside is it's trivial to implement, review and backport and now would
> > > be removed shortly thereafter when we replace the on-disk whiteout with
> > > the in-core fake whiteout thing. Just my .02 though..
> > 
> > We've got to keep the existing whiteout method around for,
> > essentially, forever, because we have to support kernels that don't
> > do in-memory translations of DT_WHT to a magic chardev inode and
> > vice versa (i.e. via mknod). IOWs, we'll need a feature bit to
> > indicate that we actually have DT_WHT based whiteouts on disk.
> > 
> 
> I'm not quite following (probably just because I'm not terribly familiar
> with the use case). If current kernels know how to fake up whiteout
> inodes in memory based on a dentry, why do we need to continue to create
> new on-disk whiteout inodes just because a filesystem might already have
> such inodes on disk?

We don't, unless there's a chance the filesystem will be booted
again on an older kernel. So it's a one-way conversion: once we
start using DT_WHT based whiteouts, there is no going back.

> Wouldn't the old format whiteouts just continue to
> work as expected without any extra handling?

Yes, but that's not the problem. It's forwards compatibility that
matters here.  i.e. upgrade a kernel, something doesn't work, roll
back to older kernel. Everything should work if the feature set on
the filesystem is unchanged, even though the filesystem was used
with a newer kernel.

If the newer kernel has done something on disk that the older kernel
does not understand (e.g. using DT_WHT based whiteouts), then the
newer kernel must mark the filesystem with a feature bit to prevent
the older kernel from doing the wrong thing with the format it
doesn't exect to see. Mis-parsing a whiteout and not applying it
correctly is a stale data exposure security bug, so we have to be
careful here.

Existing kernels don't know how to take a DT_WHT whiteout and fake
up a magical chardev inode. Hence DT_WHT whiteouts will not work
correctly on current kernels, even though we have DT_WHT in the dir
ftype feature (and hence available on both v4 and v5 filesystems).
i.e. the issue here is the VFS has defined the on-disk whiteout
format, and we want the on-disk storage from chardev inodes to
DT_WHT. Hence we need a feature bit because we are changing the
method of storing whiteouts on disk, even though we
-technically- aren't changing the _XFS_ on-disk format.

Think of it as though DT_WHT support doesn't exist in XFS, and now
we are going to add it...

> I can see needing a feature bit to restrict a filesystem from being used
> on an unsupported, older kernel, but is there a reason we wouldn't just
> enable that by default anyways?

Rule #1: Never enable new on-disk feature bits by default

:)

Yes, in future we can decide to automatically enable the feature bit
in the kernel code, but that's not something we can do until kernel
support for the DT_WHT based whiteouts is already widespread as it's
not a backwards compatible change.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag
  2019-08-20 22:13               ` Dave Chinner
@ 2019-08-21 11:25                 ` Brian Foster
  0 siblings, 0 replies; 11+ messages in thread
From: Brian Foster @ 2019-08-21 11:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang

On Wed, Aug 21, 2019 at 08:13:37AM +1000, Dave Chinner wrote:
> On Tue, Aug 20, 2019 at 08:23:00AM -0400, Brian Foster wrote:
> > On Tue, Aug 20, 2019 at 09:23:04PM +1000, Dave Chinner wrote:
> > > On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote:
> > > > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote:
> > > > FWIW if we do take that approach, then IMO it's worth reconsidering the
> > > > 1-2 liner I originally proposed to fix the locking. It's slightly hacky,
> > > > but really all three options are hacky in slightly different ways. The
> > > > flipside is it's trivial to implement, review and backport and now would
> > > > be removed shortly thereafter when we replace the on-disk whiteout with
> > > > the in-core fake whiteout thing. Just my .02 though..
> > > 
> > > We've got to keep the existing whiteout method around for,
> > > essentially, forever, because we have to support kernels that don't
> > > do in-memory translations of DT_WHT to a magic chardev inode and
> > > vice versa (i.e. via mknod). IOWs, we'll need a feature bit to
> > > indicate that we actually have DT_WHT based whiteouts on disk.
> > > 
> > 
> > I'm not quite following (probably just because I'm not terribly familiar
> > with the use case). If current kernels know how to fake up whiteout
> > inodes in memory based on a dentry, why do we need to continue to create
> > new on-disk whiteout inodes just because a filesystem might already have
> > such inodes on disk?
> 
> We don't, unless there's a chance the filesystem will be booted
> again on an older kernel. So it's a one-way conversion: once we
> start using DT_WHT based whiteouts, there is no going back.
> 

Ok. I think I had the (wrong) impression that all we had to do was
create the whiteout dentry and the inode magic would happen elsewhere in
overlayfs or something, but in reality overlayfs only cares about the
magic inode and DT_WHT basically just saves us from allocating a
physical inode.

> > Wouldn't the old format whiteouts just continue to
> > work as expected without any extra handling?
> 
> Yes, but that's not the problem. It's forwards compatibility that
> matters here.  i.e. upgrade a kernel, something doesn't work, roll
> back to older kernel. Everything should work if the feature set on
> the filesystem is unchanged, even though the filesystem was used
> with a newer kernel.
> 
> If the newer kernel has done something on disk that the older kernel
> does not understand (e.g. using DT_WHT based whiteouts), then the
> newer kernel must mark the filesystem with a feature bit to prevent
> the older kernel from doing the wrong thing with the format it
> doesn't exect to see. Mis-parsing a whiteout and not applying it
> correctly is a stale data exposure security bug, so we have to be
> careful here.
> 
> Existing kernels don't know how to take a DT_WHT whiteout and fake
> up a magical chardev inode. Hence DT_WHT whiteouts will not work
> correctly on current kernels, even though we have DT_WHT in the dir
> ftype feature (and hence available on both v4 and v5 filesystems).
> i.e. the issue here is the VFS has defined the on-disk whiteout
> format, and we want the on-disk storage from chardev inodes to
> DT_WHT. Hence we need a feature bit because we are changing the
> method of storing whiteouts on disk, even though we
> -technically- aren't changing the _XFS_ on-disk format.
> 
> Think of it as though DT_WHT support doesn't exist in XFS, and now
> we are going to add it...
> 

Yep, makes sense.

> > I can see needing a feature bit to restrict a filesystem from being used
> > on an unsupported, older kernel, but is there a reason we wouldn't just
> > enable that by default anyways?
> 
> Rule #1: Never enable new on-disk feature bits by default
> 

Heh, Ok. That all seems proper enough reason to keep the existing rename
whiteout code around for a while. Thanks.

Brian

> :)
> 
> Yes, in future we can decide to automatically enable the feature bit
> in the kernel code, but that's not something we can do until kernel
> support for the DT_WHT based whiteouts is already widespread as it's
> not a backwards compatible change.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-08-21 11:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19 13:06 [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag kaixuxia
2019-08-19 15:13 ` Brian Foster
2019-08-19 22:12   ` Dave Chinner
2019-08-20  6:45   ` kaixuxia
2019-08-20  8:07     ` Dave Chinner
2019-08-20  8:53       ` kaixuxia
2019-08-20 10:51         ` Brian Foster
2019-08-20 11:23           ` Dave Chinner
2019-08-20 12:23             ` Brian Foster
2019-08-20 22:13               ` Dave Chinner
2019-08-21 11:25                 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.