* [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag @ 2019-08-19 13:06 kaixuxia 2019-08-19 15:13 ` Brian Foster 0 siblings, 1 reply; 11+ messages in thread From: kaixuxia @ 2019-08-19 13:06 UTC (permalink / raw) To: linux-xfs Cc: Darrick J. Wong, Brian Foster, Dave Chinner, newtongao, jasperwang, xiakaixu1987 When performing rename operation with RENAME_WHITEOUT flag, we will hold AGF lock to allocate or free extents in manipulating the dirents firstly, and then doing the xfs_iunlink_remove() call last to hold AGI lock to modify the tmpfile info, so we the lock order AGI->AGF. The big problem here is that we have an ordering constraint on AGF and AGI locking - inode allocation locks the AGI, then can allocate a new extent for new inodes, locking the AGF after the AGI. Hence the ordering that is imposed by other parts of the code is AGI before AGF. So we get the ABBA agi&agf deadlock here. Process A: Call trace: ? __schedule+0x2bd/0x620 schedule+0x33/0x90 schedule_timeout+0x17d/0x290 __down_common+0xef/0x125 ? xfs_buf_find+0x215/0x6c0 [xfs] down+0x3b/0x50 xfs_buf_lock+0x34/0xf0 [xfs] xfs_buf_find+0x215/0x6c0 [xfs] xfs_buf_get_map+0x37/0x230 [xfs] xfs_buf_read_map+0x29/0x190 [xfs] xfs_trans_read_buf_map+0x13d/0x520 [xfs] xfs_read_agf+0xa6/0x180 [xfs] ? schedule_timeout+0x17d/0x290 xfs_alloc_read_agf+0x52/0x1f0 [xfs] xfs_alloc_fix_freelist+0x432/0x590 [xfs] ? down+0x3b/0x50 ? xfs_buf_lock+0x34/0xf0 [xfs] ? xfs_buf_find+0x215/0x6c0 [xfs] xfs_alloc_vextent+0x301/0x6c0 [xfs] xfs_ialloc_ag_alloc+0x182/0x700 [xfs] ? _xfs_trans_bjoin+0x72/0xf0 [xfs] xfs_dialloc+0x116/0x290 [xfs] xfs_ialloc+0x6d/0x5e0 [xfs] ? xfs_log_reserve+0x165/0x280 [xfs] xfs_dir_ialloc+0x8c/0x240 [xfs] xfs_create+0x35a/0x610 [xfs] xfs_generic_create+0x1f1/0x2f0 [xfs] ... Process B: Call trace: ? __schedule+0x2bd/0x620 ? xfs_bmapi_allocate+0x245/0x380 [xfs] schedule+0x33/0x90 schedule_timeout+0x17d/0x290 ? xfs_buf_find+0x1fd/0x6c0 [xfs] __down_common+0xef/0x125 ? xfs_buf_get_map+0x37/0x230 [xfs] ? xfs_buf_find+0x215/0x6c0 [xfs] down+0x3b/0x50 xfs_buf_lock+0x34/0xf0 [xfs] xfs_buf_find+0x215/0x6c0 [xfs] xfs_buf_get_map+0x37/0x230 [xfs] xfs_buf_read_map+0x29/0x190 [xfs] xfs_trans_read_buf_map+0x13d/0x520 [xfs] xfs_read_agi+0xa8/0x160 [xfs] xfs_iunlink_remove+0x6f/0x2a0 [xfs] ? current_time+0x46/0x80 ? xfs_trans_ichgtime+0x39/0xb0 [xfs] xfs_rename+0x57a/0xae0 [xfs] xfs_vn_rename+0xe4/0x150 [xfs] ... In this patch we move the xfs_iunlink_remove() call to between xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink _remove() firstly, we remove the AGI/AGF lock inversion problem. Signed-off-by: kaixuxia <kaixuxia@tencent.com> --- fs/xfs/xfs_inode.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 6467d5e..48691f2 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3294,6 +3294,18 @@ struct xfs_iunlink { if (error) goto out_trans_cancel; } + + /* + * Handle the whiteout inode and acquire the AGI lock, so + * fix the AGI/AGF lock inversion problem. + */ + if (wip) { + ASSERT(VFS_I(wip)->i_nlink == 0); + error = xfs_iunlink_remove(tp, wip); + if (error) + goto out_trans_cancel; + } + /* * If target does not exist and the rename crosses * directories, adjust the target directory link count @@ -3428,9 +3440,11 @@ struct xfs_iunlink { if (wip) { ASSERT(VFS_I(wip)->i_nlink == 0); xfs_bumplink(tp, wip); - error = xfs_iunlink_remove(tp, wip); - if (error) - goto out_trans_cancel; + if (target_ip != NULL) { + error = xfs_iunlink_remove(tp, wip); + if (error) + goto out_trans_cancel; + } xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); /* -- 1.8.3.1 -- kaixuxia ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-19 13:06 [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag kaixuxia @ 2019-08-19 15:13 ` Brian Foster 2019-08-19 22:12 ` Dave Chinner 2019-08-20 6:45 ` kaixuxia 0 siblings, 2 replies; 11+ messages in thread From: Brian Foster @ 2019-08-19 15:13 UTC (permalink / raw) To: kaixuxia; +Cc: linux-xfs, Darrick J. Wong, Dave Chinner, newtongao, jasperwang On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote: > When performing rename operation with RENAME_WHITEOUT flag, we will > hold AGF lock to allocate or free extents in manipulating the dirents > firstly, and then doing the xfs_iunlink_remove() call last to hold > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF. > > The big problem here is that we have an ordering constraint on AGF > and AGI locking - inode allocation locks the AGI, then can allocate > a new extent for new inodes, locking the AGF after the AGI. Hence > the ordering that is imposed by other parts of the code is AGI before > AGF. So we get the ABBA agi&agf deadlock here. > ... > > In this patch we move the xfs_iunlink_remove() call to between > xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink > _remove() firstly, we remove the AGI/AGF lock inversion problem. > > Signed-off-by: kaixuxia <kaixuxia@tencent.com> > --- > fs/xfs/xfs_inode.c | 20 +++++++++++++++++--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 6467d5e..48691f2 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -3294,6 +3294,18 @@ struct xfs_iunlink { > if (error) > goto out_trans_cancel; > } > + > + /* > + * Handle the whiteout inode and acquire the AGI lock, so > + * fix the AGI/AGF lock inversion problem. > + */ The comment could be a little more specific. For example: "Directory entry creation may acquire the AGF. Remove the whiteout from the unlinked list first to preserve correct AGI/AGF locking order." > + if (wip) { > + ASSERT(VFS_I(wip)->i_nlink == 0); > + error = xfs_iunlink_remove(tp, wip); > + if (error) > + goto out_trans_cancel; > + } > + > /* > * If target does not exist and the rename crosses > * directories, adjust the target directory link count > @@ -3428,9 +3440,11 @@ struct xfs_iunlink { > if (wip) { > ASSERT(VFS_I(wip)->i_nlink == 0); > xfs_bumplink(tp, wip); > - error = xfs_iunlink_remove(tp, wip); > - if (error) > - goto out_trans_cancel; > + if (target_ip != NULL) { > + error = xfs_iunlink_remove(tp, wip); > + if (error) > + goto out_trans_cancel; > + } The comment above this hunk needs to be updated. I'm also not a big fan of this factoring of doing the removal in the if branch above and then encoding the else logic down here. It might be cleaner and more consistent to have a call in each branch of the if/else above. FWIW, I'm also curious if this could be cleaned up further by pulling the -ENOSPC/-EEXIST checks out of the earlier branch, following that with the whiteout removal, and then doing the dir_create/replace. For example, something like: /* error checks before we dirty the transaction */ if (!target_ip && !spaceres) { error = xfs_dir_canenter(); ... } else if (S_ISDIR() && !(empty || nlink > 2)) error = -EEXIST; ... } if (wip) { ... xfs_iunlink_remove(); } if (!target_ip) { xfs_dir_create(); ... } else { xfs_dir_replace(); ... } ... but that may not be any cleaner..? It could also be done as a followup cleanup patch as well. Brian > xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); > > /* > -- > 1.8.3.1 > > -- > kaixuxia ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-19 15:13 ` Brian Foster @ 2019-08-19 22:12 ` Dave Chinner 2019-08-20 6:45 ` kaixuxia 1 sibling, 0 replies; 11+ messages in thread From: Dave Chinner @ 2019-08-19 22:12 UTC (permalink / raw) To: Brian Foster; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Mon, Aug 19, 2019 at 11:13:35AM -0400, Brian Foster wrote: > On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote: > > When performing rename operation with RENAME_WHITEOUT flag, we will > > hold AGF lock to allocate or free extents in manipulating the dirents > > firstly, and then doing the xfs_iunlink_remove() call last to hold > > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF. > > > > The big problem here is that we have an ordering constraint on AGF > > and AGI locking - inode allocation locks the AGI, then can allocate > > a new extent for new inodes, locking the AGF after the AGI. Hence > > the ordering that is imposed by other parts of the code is AGI before > > AGF. So we get the ABBA agi&agf deadlock here. > > > ... > > > > In this patch we move the xfs_iunlink_remove() call to between > > xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink > > _remove() firstly, we remove the AGI/AGF lock inversion problem. > > > > Signed-off-by: kaixuxia <kaixuxia@tencent.com> > > --- > > fs/xfs/xfs_inode.c | 20 +++++++++++++++++--- > > 1 file changed, 17 insertions(+), 3 deletions(-) > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index 6467d5e..48691f2 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -3294,6 +3294,18 @@ struct xfs_iunlink { > > if (error) > > goto out_trans_cancel; > > } > > + > > + /* > > + * Handle the whiteout inode and acquire the AGI lock, so > > + * fix the AGI/AGF lock inversion problem. > > + */ > > The comment could be a little more specific. For example: > > "Directory entry creation may acquire the AGF. Remove the whiteout from > the unlinked list first to preserve correct AGI/AGF locking order." > > > + if (wip) { > > + ASSERT(VFS_I(wip)->i_nlink == 0); > > + error = xfs_iunlink_remove(tp, wip); > > + if (error) > > + goto out_trans_cancel; > > + } > > + > > /* > > * If target does not exist and the rename crosses > > * directories, adjust the target directory link count > > @@ -3428,9 +3440,11 @@ struct xfs_iunlink { > > if (wip) { > > ASSERT(VFS_I(wip)->i_nlink == 0); > > xfs_bumplink(tp, wip); > > - error = xfs_iunlink_remove(tp, wip); > > - if (error) > > - goto out_trans_cancel; > > + if (target_ip != NULL) { > > + error = xfs_iunlink_remove(tp, wip); > > + if (error) > > + goto out_trans_cancel; > > + } > > The comment above this hunk needs to be updated. I'm also not a big fan > of this factoring of doing the removal in the if branch above and then > encoding the else logic down here. It might be cleaner and more > consistent to have a call in each branch of the if/else above. > > FWIW, I'm also curious if this could be cleaned up further by pulling > the -ENOSPC/-EEXIST checks out of the earlier branch, following that > with the whiteout removal, and then doing the dir_create/replace. For > example, something like: > > /* error checks before we dirty the transaction */ > if (!target_ip && !spaceres) { > error = xfs_dir_canenter(); > ... > } else if (S_ISDIR() && !(empty || nlink > 2)) > error = -EEXIST; > ... > } > > if (wip) { > ... > xfs_iunlink_remove(); > } > > if (!target_ip) { > xfs_dir_create(); > ... > } else { > xfs_dir_replace(); > ... > } > > ... but that may not be any cleaner..? It could also be done as a > followup cleanup patch as well. Makes sense, but the more I look at this, the more I think we should get rid of on-disk whiteout inodes altogether and finally make use of XFS_DIR3_FT_WHT in the dirent file type. The whiteout inode is a nasty VFS level hack to support whiteouts on filesystems that have no way of storing whiteouts natively. i.e. a whiteout inode is a chardev with a mode of 0 and a device number of 0. We only need to present that magic indoe to userspace, we don't need to actually store it on disk. IOWs, we could get rid of this whole problem with XFS_DIR3_FT_WHT. XFS_DIR3_FT_WHT means we no longer need to allocate a tmpfile for the whiteout, we don't have xfs_iunlink_remove calls in rename, and all we have to do is a xfs_dir_replace() call to zero out the inode number and change the ftype in the dirent. It will need changes to the xfs_lookup code to instantiate the special chardev inode in memory when a whiteout is found, and probably checks to ensure we never dirty such an inode. Giving it an invalid inode number will be sufficient to prevent it from being written to disk. It's a bit more work, but I think killing on-disk whiteout inode is the eventual solution we want here... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-19 15:13 ` Brian Foster 2019-08-19 22:12 ` Dave Chinner @ 2019-08-20 6:45 ` kaixuxia 2019-08-20 8:07 ` Dave Chinner 1 sibling, 1 reply; 11+ messages in thread From: kaixuxia @ 2019-08-20 6:45 UTC (permalink / raw) To: Brian Foster Cc: linux-xfs, Darrick J. Wong, Dave Chinner, newtongao, jasperwang On 2019/8/19 23:13, Brian Foster wrote: > On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote: >> When performing rename operation with RENAME_WHITEOUT flag, we will >> hold AGF lock to allocate or free extents in manipulating the dirents >> firstly, and then doing the xfs_iunlink_remove() call last to hold >> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF. >> >> The big problem here is that we have an ordering constraint on AGF >> and AGI locking - inode allocation locks the AGI, then can allocate >> a new extent for new inodes, locking the AGF after the AGI. Hence >> the ordering that is imposed by other parts of the code is AGI before >> AGF. So we get the ABBA agi&agf deadlock here. >> > ... >> >> In this patch we move the xfs_iunlink_remove() call to between >> xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink >> _remove() firstly, we remove the AGI/AGF lock inversion problem. >> >> Signed-off-by: kaixuxia <kaixuxia@tencent.com> >> --- >> fs/xfs/xfs_inode.c | 20 +++++++++++++++++--- >> 1 file changed, 17 insertions(+), 3 deletions(-) >> >> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c >> index 6467d5e..48691f2 100644 >> --- a/fs/xfs/xfs_inode.c >> +++ b/fs/xfs/xfs_inode.c >> @@ -3294,6 +3294,18 @@ struct xfs_iunlink { >> if (error) >> goto out_trans_cancel; >> } >> + >> + /* >> + * Handle the whiteout inode and acquire the AGI lock, so >> + * fix the AGI/AGF lock inversion problem. >> + */ > > The comment could be a little more specific. For example: > > "Directory entry creation may acquire the AGF. Remove the whiteout from > the unlinked list first to preserve correct AGI/AGF locking order." Right, makes more sense. > >> + if (wip) { >> + ASSERT(VFS_I(wip)->i_nlink == 0); >> + error = xfs_iunlink_remove(tp, wip); >> + if (error) >> + goto out_trans_cancel; >> + } >> + >> /* >> * If target does not exist and the rename crosses >> * directories, adjust the target directory link count >> @@ -3428,9 +3440,11 @@ struct xfs_iunlink { >> if (wip) { >> ASSERT(VFS_I(wip)->i_nlink == 0); >> xfs_bumplink(tp, wip); >> - error = xfs_iunlink_remove(tp, wip); >> - if (error) >> - goto out_trans_cancel; >> + if (target_ip != NULL) { >> + error = xfs_iunlink_remove(tp, wip); >> + if (error) >> + goto out_trans_cancel; >> + } > > The comment above this hunk needs to be updated. I'm also not a big fan > of this factoring of doing the removal in the if branch above and then > encoding the else logic down here. It might be cleaner and more > consistent to have a call in each branch of the if/else above. > > FWIW, I'm also curious if this could be cleaned up further by pulling > the -ENOSPC/-EEXIST checks out of the earlier branch, following that > with the whiteout removal, and then doing the dir_create/replace. For > example, something like: > > /* error checks before we dirty the transaction */ > if (!target_ip && !spaceres) { > error = xfs_dir_canenter(); > ... > } else if (S_ISDIR() && !(empty || nlink > 2)) > error = -EEXIST; > ... > } > > if (wip) { > ... > xfs_iunlink_remove(); > } > > if (!target_ip) { > xfs_dir_create(); > ... > } else { > xfs_dir_replace(); > ... > } > > ... but that may not be any cleaner..? It could also be done as a > followup cleanup patch as well. Yep, it is cleaner that making the whole check before the transaction becomes dirty, just return the error code if check failed and the filesystem is clean. Dave gave another solution in the other subthread that using XFS_DIR3_FT_WHT, it's a bit more work for this bug, include refactoring the xfs_rename() and xfs_lookup(), not sure whether it's worth the complex changes for this bug. > > Brian > >> xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); >> >> /* >> -- >> 1.8.3.1 >> >> -- >> kaixuxia -- kaixuxia ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 6:45 ` kaixuxia @ 2019-08-20 8:07 ` Dave Chinner 2019-08-20 8:53 ` kaixuxia 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2019-08-20 8:07 UTC (permalink / raw) To: kaixuxia; +Cc: Brian Foster, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Tue, Aug 20, 2019 at 02:45:36PM +0800, kaixuxia wrote: > On 2019/8/19 23:13, Brian Foster wrote: > > /* error checks before we dirty the transaction */ > > if (!target_ip && !spaceres) { > > error = xfs_dir_canenter(); > > ... > > } else if (S_ISDIR() && !(empty || nlink > 2)) > > error = -EEXIST; > > ... > > } > > > > if (wip) { > > ... > > xfs_iunlink_remove(); > > } > > > > if (!target_ip) { > > xfs_dir_create(); > > ... > > } else { > > xfs_dir_replace(); > > ... > > } > > > > ... but that may not be any cleaner..? It could also be done as a > > followup cleanup patch as well. > > Yep, it is cleaner that making the whole check before the transaction > becomes dirty, just return the error code if check failed and > the filesystem is clean. *nod* > Dave gave another solution in the other subthread that using > XFS_DIR3_FT_WHT, it's a bit more work for this bug, include > refactoring the xfs_rename() and xfs_lookup(), not sure whether > it's worth the complex changes for this bug. It's not necessary to fix the bug, but it's somethign we should be looking to do because it makes whiteout handling a lot more efficient - it's just dirent modifications at that point, no inodes are necessary. This is how I always intended to handle whiteouts - it's just another thing on the "we need to fix" list.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 8:07 ` Dave Chinner @ 2019-08-20 8:53 ` kaixuxia 2019-08-20 10:51 ` Brian Foster 0 siblings, 1 reply; 11+ messages in thread From: kaixuxia @ 2019-08-20 8:53 UTC (permalink / raw) To: Dave Chinner Cc: Brian Foster, linux-xfs, Darrick J. Wong, newtongao, jasperwang On 2019/8/20 16:07, Dave Chinner wrote: > On Tue, Aug 20, 2019 at 02:45:36PM +0800, kaixuxia wrote: >> On 2019/8/19 23:13, Brian Foster wrote: >>> /* error checks before we dirty the transaction */ >>> if (!target_ip && !spaceres) { >>> error = xfs_dir_canenter(); >>> ... >>> } else if (S_ISDIR() && !(empty || nlink > 2)) >>> error = -EEXIST; >>> ... >>> } >>> >>> if (wip) { >>> ... >>> xfs_iunlink_remove(); >>> } >>> >>> if (!target_ip) { >>> xfs_dir_create(); >>> ... >>> } else { >>> xfs_dir_replace(); >>> ... >>> } >>> >>> ... but that may not be any cleaner..? It could also be done as a >>> followup cleanup patch as well. >> >> Yep, it is cleaner that making the whole check before the transaction >> becomes dirty, just return the error code if check failed and >> the filesystem is clean. > > *nod* > >> Dave gave another solution in the other subthread that using >> XFS_DIR3_FT_WHT, it's a bit more work for this bug, include >> refactoring the xfs_rename() and xfs_lookup(), not sure whether >> it's worth the complex changes for this bug. > > It's not necessary to fix the bug, but it's somethign we should > be looking to do because it makes whiteout handling a lot more > efficient - it's just dirent modifications at that point, no inodes > are necessary. > > This is how I always intended to handle whiteouts - it's just > another thing on the "we need to fix" list.... Right, it is more efficient because there is no need to store it on disk, and it will improve performance just like the async deferred operations. Maybe it is on the roadmap, so I'm not sure whether I should send the V3 patch to address Brian's comments. Maybe we can choose the V3 patch first, and then the whiteout improvement could be done as the followup patch in future... > > Cheers, > > Dave. > -- kaixuxia ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 8:53 ` kaixuxia @ 2019-08-20 10:51 ` Brian Foster 2019-08-20 11:23 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Brian Foster @ 2019-08-20 10:51 UTC (permalink / raw) To: kaixuxia; +Cc: Dave Chinner, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote: > > > On 2019/8/20 16:07, Dave Chinner wrote: > > On Tue, Aug 20, 2019 at 02:45:36PM +0800, kaixuxia wrote: > > > On 2019/8/19 23:13, Brian Foster wrote: > > > > /* error checks before we dirty the transaction */ > > > > if (!target_ip && !spaceres) { > > > > error = xfs_dir_canenter(); > > > > ... > > > > } else if (S_ISDIR() && !(empty || nlink > 2)) > > > > error = -EEXIST; > > > > ... > > > > } > > > > > > > > if (wip) { > > > > ... > > > > xfs_iunlink_remove(); > > > > } > > > > > > > > if (!target_ip) { > > > > xfs_dir_create(); > > > > ... > > > > } else { > > > > xfs_dir_replace(); > > > > ... > > > > } > > > > > > > > ... but that may not be any cleaner..? It could also be done as a > > > > followup cleanup patch as well. > > > > > > Yep, it is cleaner that making the whole check before the transaction > > > becomes dirty, just return the error code if check failed and > > > the filesystem is clean. > > > > *nod* > > > > > Dave gave another solution in the other subthread that using > > > XFS_DIR3_FT_WHT, it's a bit more work for this bug, include > > > refactoring the xfs_rename() and xfs_lookup(), not sure whether > > > it's worth the complex changes for this bug. > > Yeah, I wasn't aware of that option. What Dave describes wrt to replacing the on-disk whiteout inode with a dirent + in-core variant sounds like the clear best option to me over the ones previously discussed. > > It's not necessary to fix the bug, but it's somethign we should > > be looking to do because it makes whiteout handling a lot more > > efficient - it's just dirent modifications at that point, no inodes > > are necessary. > > > > This is how I always intended to handle whiteouts - it's just > > another thing on the "we need to fix" list.... > > Right, it is more efficient because there is no need to store it on disk, > and it will improve performance just like the async deferred operations. > Maybe it is on the roadmap, so I'm not sure whether I should send the V3 > patch to address Brian's comments. Maybe we can choose the V3 patch first, > and then the whiteout improvement could be done as the followup patch > in future... > I agree. I think a two step process makes sense because we may want a backportable fix around for the locking bug that doesn't depend on replacing the implementation. FWIW if we do take that approach, then IMO it's worth reconsidering the 1-2 liner I originally proposed to fix the locking. It's slightly hacky, but really all three options are hacky in slightly different ways. The flipside is it's trivial to implement, review and backport and now would be removed shortly thereafter when we replace the on-disk whiteout with the in-core fake whiteout thing. Just my .02 though.. Brian > > > > Cheers, > > > > Dave. > > > > -- > kaixuxia ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 10:51 ` Brian Foster @ 2019-08-20 11:23 ` Dave Chinner 2019-08-20 12:23 ` Brian Foster 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2019-08-20 11:23 UTC (permalink / raw) To: Brian Foster; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote: > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote: > FWIW if we do take that approach, then IMO it's worth reconsidering the > 1-2 liner I originally proposed to fix the locking. It's slightly hacky, > but really all three options are hacky in slightly different ways. The > flipside is it's trivial to implement, review and backport and now would > be removed shortly thereafter when we replace the on-disk whiteout with > the in-core fake whiteout thing. Just my .02 though.. We've got to keep the existing whiteout method around for, essentially, forever, because we have to support kernels that don't do in-memory translations of DT_WHT to a magic chardev inode and vice versa (i.e. via mknod). IOWs, we'll need a feature bit to indicate that we actually have DT_WHT based whiteouts on disk. So we may as well fix this properly now by restructuring the code as we will still have to maintain this functionality for a long time to come. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 11:23 ` Dave Chinner @ 2019-08-20 12:23 ` Brian Foster 2019-08-20 22:13 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Brian Foster @ 2019-08-20 12:23 UTC (permalink / raw) To: Dave Chinner; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Tue, Aug 20, 2019 at 09:23:04PM +1000, Dave Chinner wrote: > On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote: > > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote: > > FWIW if we do take that approach, then IMO it's worth reconsidering the > > 1-2 liner I originally proposed to fix the locking. It's slightly hacky, > > but really all three options are hacky in slightly different ways. The > > flipside is it's trivial to implement, review and backport and now would > > be removed shortly thereafter when we replace the on-disk whiteout with > > the in-core fake whiteout thing. Just my .02 though.. > > We've got to keep the existing whiteout method around for, > essentially, forever, because we have to support kernels that don't > do in-memory translations of DT_WHT to a magic chardev inode and > vice versa (i.e. via mknod). IOWs, we'll need a feature bit to > indicate that we actually have DT_WHT based whiteouts on disk. > I'm not quite following (probably just because I'm not terribly familiar with the use case). If current kernels know how to fake up whiteout inodes in memory based on a dentry, why do we need to continue to create new on-disk whiteout inodes just because a filesystem might already have such inodes on disk? Wouldn't the old format whiteouts just continue to work as expected without any extra handling? I can see needing a feature bit to restrict a filesystem from being used on an unsupported, older kernel, but is there a reason we wouldn't just enable that by default anyways? Brian > So we may as well fix this properly now by restructuring the code as > we will still have to maintain this functionality for a long time to > come. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 12:23 ` Brian Foster @ 2019-08-20 22:13 ` Dave Chinner 2019-08-21 11:25 ` Brian Foster 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2019-08-20 22:13 UTC (permalink / raw) To: Brian Foster; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Tue, Aug 20, 2019 at 08:23:00AM -0400, Brian Foster wrote: > On Tue, Aug 20, 2019 at 09:23:04PM +1000, Dave Chinner wrote: > > On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote: > > > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote: > > > FWIW if we do take that approach, then IMO it's worth reconsidering the > > > 1-2 liner I originally proposed to fix the locking. It's slightly hacky, > > > but really all three options are hacky in slightly different ways. The > > > flipside is it's trivial to implement, review and backport and now would > > > be removed shortly thereafter when we replace the on-disk whiteout with > > > the in-core fake whiteout thing. Just my .02 though.. > > > > We've got to keep the existing whiteout method around for, > > essentially, forever, because we have to support kernels that don't > > do in-memory translations of DT_WHT to a magic chardev inode and > > vice versa (i.e. via mknod). IOWs, we'll need a feature bit to > > indicate that we actually have DT_WHT based whiteouts on disk. > > > > I'm not quite following (probably just because I'm not terribly familiar > with the use case). If current kernels know how to fake up whiteout > inodes in memory based on a dentry, why do we need to continue to create > new on-disk whiteout inodes just because a filesystem might already have > such inodes on disk? We don't, unless there's a chance the filesystem will be booted again on an older kernel. So it's a one-way conversion: once we start using DT_WHT based whiteouts, there is no going back. > Wouldn't the old format whiteouts just continue to > work as expected without any extra handling? Yes, but that's not the problem. It's forwards compatibility that matters here. i.e. upgrade a kernel, something doesn't work, roll back to older kernel. Everything should work if the feature set on the filesystem is unchanged, even though the filesystem was used with a newer kernel. If the newer kernel has done something on disk that the older kernel does not understand (e.g. using DT_WHT based whiteouts), then the newer kernel must mark the filesystem with a feature bit to prevent the older kernel from doing the wrong thing with the format it doesn't exect to see. Mis-parsing a whiteout and not applying it correctly is a stale data exposure security bug, so we have to be careful here. Existing kernels don't know how to take a DT_WHT whiteout and fake up a magical chardev inode. Hence DT_WHT whiteouts will not work correctly on current kernels, even though we have DT_WHT in the dir ftype feature (and hence available on both v4 and v5 filesystems). i.e. the issue here is the VFS has defined the on-disk whiteout format, and we want the on-disk storage from chardev inodes to DT_WHT. Hence we need a feature bit because we are changing the method of storing whiteouts on disk, even though we -technically- aren't changing the _XFS_ on-disk format. Think of it as though DT_WHT support doesn't exist in XFS, and now we are going to add it... > I can see needing a feature bit to restrict a filesystem from being used > on an unsupported, older kernel, but is there a reason we wouldn't just > enable that by default anyways? Rule #1: Never enable new on-disk feature bits by default :) Yes, in future we can decide to automatically enable the feature bit in the kernel code, but that's not something we can do until kernel support for the DT_WHT based whiteouts is already widespread as it's not a backwards compatible change. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag 2019-08-20 22:13 ` Dave Chinner @ 2019-08-21 11:25 ` Brian Foster 0 siblings, 0 replies; 11+ messages in thread From: Brian Foster @ 2019-08-21 11:25 UTC (permalink / raw) To: Dave Chinner; +Cc: kaixuxia, linux-xfs, Darrick J. Wong, newtongao, jasperwang On Wed, Aug 21, 2019 at 08:13:37AM +1000, Dave Chinner wrote: > On Tue, Aug 20, 2019 at 08:23:00AM -0400, Brian Foster wrote: > > On Tue, Aug 20, 2019 at 09:23:04PM +1000, Dave Chinner wrote: > > > On Tue, Aug 20, 2019 at 06:51:01AM -0400, Brian Foster wrote: > > > > On Tue, Aug 20, 2019 at 04:53:22PM +0800, kaixuxia wrote: > > > > FWIW if we do take that approach, then IMO it's worth reconsidering the > > > > 1-2 liner I originally proposed to fix the locking. It's slightly hacky, > > > > but really all three options are hacky in slightly different ways. The > > > > flipside is it's trivial to implement, review and backport and now would > > > > be removed shortly thereafter when we replace the on-disk whiteout with > > > > the in-core fake whiteout thing. Just my .02 though.. > > > > > > We've got to keep the existing whiteout method around for, > > > essentially, forever, because we have to support kernels that don't > > > do in-memory translations of DT_WHT to a magic chardev inode and > > > vice versa (i.e. via mknod). IOWs, we'll need a feature bit to > > > indicate that we actually have DT_WHT based whiteouts on disk. > > > > > > > I'm not quite following (probably just because I'm not terribly familiar > > with the use case). If current kernels know how to fake up whiteout > > inodes in memory based on a dentry, why do we need to continue to create > > new on-disk whiteout inodes just because a filesystem might already have > > such inodes on disk? > > We don't, unless there's a chance the filesystem will be booted > again on an older kernel. So it's a one-way conversion: once we > start using DT_WHT based whiteouts, there is no going back. > Ok. I think I had the (wrong) impression that all we had to do was create the whiteout dentry and the inode magic would happen elsewhere in overlayfs or something, but in reality overlayfs only cares about the magic inode and DT_WHT basically just saves us from allocating a physical inode. > > Wouldn't the old format whiteouts just continue to > > work as expected without any extra handling? > > Yes, but that's not the problem. It's forwards compatibility that > matters here. i.e. upgrade a kernel, something doesn't work, roll > back to older kernel. Everything should work if the feature set on > the filesystem is unchanged, even though the filesystem was used > with a newer kernel. > > If the newer kernel has done something on disk that the older kernel > does not understand (e.g. using DT_WHT based whiteouts), then the > newer kernel must mark the filesystem with a feature bit to prevent > the older kernel from doing the wrong thing with the format it > doesn't exect to see. Mis-parsing a whiteout and not applying it > correctly is a stale data exposure security bug, so we have to be > careful here. > > Existing kernels don't know how to take a DT_WHT whiteout and fake > up a magical chardev inode. Hence DT_WHT whiteouts will not work > correctly on current kernels, even though we have DT_WHT in the dir > ftype feature (and hence available on both v4 and v5 filesystems). > i.e. the issue here is the VFS has defined the on-disk whiteout > format, and we want the on-disk storage from chardev inodes to > DT_WHT. Hence we need a feature bit because we are changing the > method of storing whiteouts on disk, even though we > -technically- aren't changing the _XFS_ on-disk format. > > Think of it as though DT_WHT support doesn't exist in XFS, and now > we are going to add it... > Yep, makes sense. > > I can see needing a feature bit to restrict a filesystem from being used > > on an unsupported, older kernel, but is there a reason we wouldn't just > > enable that by default anyways? > > Rule #1: Never enable new on-disk feature bits by default > Heh, Ok. That all seems proper enough reason to keep the existing rename whiteout code around for a while. Thanks. Brian > :) > > Yes, in future we can decide to automatically enable the feature bit > in the kernel code, but that's not something we can do until kernel > support for the DT_WHT based whiteouts is already widespread as it's > not a backwards compatible change. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-08-21 11:25 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-19 13:06 [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag kaixuxia 2019-08-19 15:13 ` Brian Foster 2019-08-19 22:12 ` Dave Chinner 2019-08-20 6:45 ` kaixuxia 2019-08-20 8:07 ` Dave Chinner 2019-08-20 8:53 ` kaixuxia 2019-08-20 10:51 ` Brian Foster 2019-08-20 11:23 ` Dave Chinner 2019-08-20 12:23 ` Brian Foster 2019-08-20 22:13 ` Dave Chinner 2019-08-21 11:25 ` Brian Foster
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.