From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 061B8C83007 for ; Tue, 28 Apr 2020 18:47:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D4DDB20B80 for ; Tue, 28 Apr 2020 18:47:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588099671; bh=PBa8/sAUJ1kaH7h+CieQcrHixtRkOD8ZqlXUDuc/O9I=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=FS7BazXQs8outAWH8CZLQ/7nI34MqJPYTFfzr2fj/JbmLzci8nhC3mAigndeLX1PF e2iDrs/00uzj702mmU23MTYCEvN2HdgohFGJCKSUe8KesmpBmXvyj3LK6OF2jnWNZS W3/S4aDwKs3XsiZBKUVtl+rTodLxuPSX0BZgadWQ= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731200AbgD1Srv (ORCPT ); Tue, 28 Apr 2020 14:47:51 -0400 Received: from mail.kernel.org ([198.145.29.99]:33178 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731008AbgD1Slg (ORCPT ); Tue, 28 Apr 2020 14:41:36 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4951020B80; Tue, 28 Apr 2020 18:41:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588099295; bh=PBa8/sAUJ1kaH7h+CieQcrHixtRkOD8ZqlXUDuc/O9I=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=xFTiriSilKZH++i8x1w9RZ90FQS3W/mRrr+Xts25Ec3QgF58XAGMbj3q/WjbeXwk9 MfZVEhcmxE3sQ3cREyTqRjR8R93esKGdjEOKHr7FKAa//nZ47MJyNjIo2+gQ0LYy1Y MJv0QkSpSugU1J5NoXh+M6lIgGvMWqykHqT7az34= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, kaixuxia , Brian Foster , "Darrick J. Wong" , Suraj Jitindar Singh Subject: [PATCH 4.19 130/131] xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT Date: Tue, 28 Apr 2020 20:25:42 +0200 Message-Id: <20200428182241.693131606@linuxfoundation.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200428182224.822179290@linuxfoundation.org> References: <20200428182224.822179290@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: kaixuxia commit bc56ad8c74b8588685c2875de0df8ab6974828ef upstream. When performing rename operation with RENAME_WHITEOUT flag, we will hold AGF lock to allocate or free extents in manipulating the dirents firstly, and then doing the xfs_iunlink_remove() call last to hold AGI lock to modify the tmpfile info, so we the lock order AGI->AGF. The big problem here is that we have an ordering constraint on AGF and AGI locking - inode allocation locks the AGI, then can allocate a new extent for new inodes, locking the AGF after the AGI. Hence the ordering that is imposed by other parts of the code is AGI before AGF. So we get an ABBA deadlock between the AGI and AGF here. Process A: Call trace: ? __schedule+0x2bd/0x620 schedule+0x33/0x90 schedule_timeout+0x17d/0x290 __down_common+0xef/0x125 ? xfs_buf_find+0x215/0x6c0 [xfs] down+0x3b/0x50 xfs_buf_lock+0x34/0xf0 [xfs] xfs_buf_find+0x215/0x6c0 [xfs] xfs_buf_get_map+0x37/0x230 [xfs] xfs_buf_read_map+0x29/0x190 [xfs] xfs_trans_read_buf_map+0x13d/0x520 [xfs] xfs_read_agf+0xa6/0x180 [xfs] ? schedule_timeout+0x17d/0x290 xfs_alloc_read_agf+0x52/0x1f0 [xfs] xfs_alloc_fix_freelist+0x432/0x590 [xfs] ? down+0x3b/0x50 ? xfs_buf_lock+0x34/0xf0 [xfs] ? xfs_buf_find+0x215/0x6c0 [xfs] xfs_alloc_vextent+0x301/0x6c0 [xfs] xfs_ialloc_ag_alloc+0x182/0x700 [xfs] ? _xfs_trans_bjoin+0x72/0xf0 [xfs] xfs_dialloc+0x116/0x290 [xfs] xfs_ialloc+0x6d/0x5e0 [xfs] ? xfs_log_reserve+0x165/0x280 [xfs] xfs_dir_ialloc+0x8c/0x240 [xfs] xfs_create+0x35a/0x610 [xfs] xfs_generic_create+0x1f1/0x2f0 [xfs] ... Process B: Call trace: ? __schedule+0x2bd/0x620 ? xfs_bmapi_allocate+0x245/0x380 [xfs] schedule+0x33/0x90 schedule_timeout+0x17d/0x290 ? xfs_buf_find+0x1fd/0x6c0 [xfs] __down_common+0xef/0x125 ? xfs_buf_get_map+0x37/0x230 [xfs] ? xfs_buf_find+0x215/0x6c0 [xfs] down+0x3b/0x50 xfs_buf_lock+0x34/0xf0 [xfs] xfs_buf_find+0x215/0x6c0 [xfs] xfs_buf_get_map+0x37/0x230 [xfs] xfs_buf_read_map+0x29/0x190 [xfs] xfs_trans_read_buf_map+0x13d/0x520 [xfs] xfs_read_agi+0xa8/0x160 [xfs] xfs_iunlink_remove+0x6f/0x2a0 [xfs] ? current_time+0x46/0x80 ? xfs_trans_ichgtime+0x39/0xb0 [xfs] xfs_rename+0x57a/0xae0 [xfs] xfs_vn_rename+0xe4/0x150 [xfs] ... In this patch we move the xfs_iunlink_remove() call to before acquiring the AGF lock to preserve correct AGI/AGF locking order. [Minor massage required due to upstream change making xfs_bumplink() a void function where as in the 4.19.y tree the return value is checked, even though it is always zero. Only change was to the last code block removed by the patch. Functionally equivalent to upstream.] Signed-off-by: kaixuxia Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Suraj Jitindar Singh Signed-off-by: Greg Kroah-Hartman --- fs/xfs/xfs_inode.c | 85 ++++++++++++++++++++++++++--------------------------- 1 file changed, 42 insertions(+), 43 deletions(-) --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2949,7 +2949,8 @@ xfs_rename( spaceres); /* - * Set up the target. + * Check for expected errors before we dirty the transaction + * so we can return an error without a transaction abort. */ if (target_ip == NULL) { /* @@ -2961,6 +2962,46 @@ xfs_rename( if (error) goto out_trans_cancel; } + } else { + /* + * If target exists and it's a directory, check that whether + * it can be destroyed. + */ + if (S_ISDIR(VFS_I(target_ip)->i_mode) && + (!xfs_dir_isempty(target_ip) || + (VFS_I(target_ip)->i_nlink > 2))) { + error = -EEXIST; + goto out_trans_cancel; + } + } + + /* + * Directory entry creation below may acquire the AGF. Remove + * the whiteout from the unlinked list first to preserve correct + * AGI/AGF locking order. This dirties the transaction so failures + * after this point will abort and log recovery will clean up the + * mess. + * + * For whiteouts, we need to bump the link count on the whiteout + * inode. After this point, we have a real link, clear the tmpfile + * state flag from the inode so it doesn't accidentally get misused + * in future. + */ + if (wip) { + ASSERT(VFS_I(wip)->i_nlink == 0); + error = xfs_iunlink_remove(tp, wip); + if (error) + goto out_trans_cancel; + + xfs_bumplink(tp, wip); + xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); + VFS_I(wip)->i_state &= ~I_LINKABLE; + } + + /* + * Set up the target. + */ + if (target_ip == NULL) { /* * If target does not exist and the rename crosses * directories, adjust the target directory link count @@ -2981,22 +3022,6 @@ xfs_rename( } } else { /* target_ip != NULL */ /* - * If target exists and it's a directory, check that both - * target and source are directories and that target can be - * destroyed, or that neither is a directory. - */ - if (S_ISDIR(VFS_I(target_ip)->i_mode)) { - /* - * Make sure target dir is empty. - */ - if (!(xfs_dir_isempty(target_ip)) || - (VFS_I(target_ip)->i_nlink > 2)) { - error = -EEXIST; - goto out_trans_cancel; - } - } - - /* * Link the source inode under the target name. * If the source inode is a directory and we are moving * it across directories, its ".." entry will be @@ -3086,32 +3111,6 @@ xfs_rename( if (error) goto out_trans_cancel; - /* - * For whiteouts, we need to bump the link count on the whiteout inode. - * This means that failures all the way up to this point leave the inode - * on the unlinked list and so cleanup is a simple matter of dropping - * the remaining reference to it. If we fail here after bumping the link - * count, we're shutting down the filesystem so we'll never see the - * intermediate state on disk. - */ - if (wip) { - ASSERT(VFS_I(wip)->i_nlink == 0); - error = xfs_bumplink(tp, wip); - if (error) - goto out_trans_cancel; - error = xfs_iunlink_remove(tp, wip); - if (error) - goto out_trans_cancel; - xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE); - - /* - * Now we have a real link, clear the "I'm a tmpfile" state - * flag from the inode so it doesn't accidentally get misused in - * future. - */ - VFS_I(wip)->i_state &= ~I_LINKABLE; - } - xfs_trans_ichgtime(tp, src_dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); xfs_trans_log_inode(tp, src_dp, XFS_ILOG_CORE); if (new_parent)