From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7CA7C433FE for ; Fri, 18 Mar 2022 11:31:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235675AbiCRLcf (ORCPT ); Fri, 18 Mar 2022 07:32:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40498 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235841AbiCRLc3 (ORCPT ); Fri, 18 Mar 2022 07:32:29 -0400 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D98F31FAA39 for ; Fri, 18 Mar 2022 04:31:08 -0700 (PDT) Received: from pps.filterd (m0246627.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 22I9Ckqe009049; Fri, 18 Mar 2022 11:31:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=references : from : to : cc : subject : in-reply-to : message-id : date : content-type : mime-version; s=corp-2021-07-09; bh=6EX7XFiQkjmihKITbe6wQqyPs6U3XKohMMr7TJVTqsI=; b=UXc3TotPsrSSWqdLhY6xaTjB7xxn8sAe4BobmnCFOu/Pg29Wwt3V3fANpug7vu1PuJqK pvlcpjPQeT6UjYKnAVGc8hmOXwwyI5KSkriKorTO+uhz039XRK889bwvCcBl2HqW5eXk rwA1jQPpCAxyk/oq5ibBbE1n1mYmoEES1q+f8SRuxypwVG2FSAT8Te2+WZVLW0Vu/i44 tmFBIi9c1taolpBcmUlnDlQI5/+CqDQ2yECFealYGBuM82g4XpspZDO6rcw8elagfTSm FLXnGaBFAksGdimnJUKDi1c6CXuqKhEweoJtnfJ4YpafIctKWosUVme1xigFFJctMLPh vw== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by mx0b-00069f02.pphosted.com with ESMTP id 3et52q3q96-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 18 Mar 2022 11:31:07 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.1.2/8.16.1.2) with SMTP id 22IBGiOs080250; Fri, 18 Mar 2022 11:31:06 GMT Received: from nam10-bn7-obe.outbound.protection.outlook.com (mail-bn7nam10lp2106.outbound.protection.outlook.com [104.47.70.106]) by userp3020.oracle.com with ESMTP id 3et65a6bgm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 18 Mar 2022 11:31:05 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dgXDdXdhjd6tJn7drf+DsmStPsDV6mWCgnPsfqb5R45BHIan/Is//Iim/Dol7vL/85UO+tR78DxrYzdXpbtPYH13Ieeeq9saFH5scmomwnZxXt4d4KgK1FkXKyMynBA+hj+xmYqZwdXtca13W+XCF0Dfs3fEzcy7Q2zzrwiH5mDlwdpf1s3xbC1d/ptzMr3NZPYGgUUcRNmB4pH6A+nohJpQ6mEpkfKn4mJ5bINvZ2TLNkNacVvEOSkZasfAh06io7eW/yni+1P7qP/y58oURu2pf7S61wB18PL/2AowkrsZAQk/qkAsb8T9TLf8MX21ytFwLat6j3qWUXyjIvta7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=6EX7XFiQkjmihKITbe6wQqyPs6U3XKohMMr7TJVTqsI=; b=Ob9r/E+1VHvXDQX8tdSURE+3qW9Fci9z0EKgAxlt9N7pLZWoMZN/w3ASOql61QCFakYnLXgSqDzFuT71IDBYC4BMRYDCbeKvEuH5Iob+jklMVuOPAF390vx/KiSUhfpjOv7vYMEWZsscvFaU0OCEg2dWjk2QZtJALoWHwm8tp2QKkFUCbqsIh7qG7tU8lN+YitwVQU3JC8i/U1JU+kXK1/nnHfUCBCNPhu6hmX/q4cT2q1co9UWn4aFL0u7fs8gp7cYcmo+pOEDIlf1VBEsQi/L/v0gptAeLgFfyX1EdJp4fF+AQs2bdAU1CIfHtr3L2VxnVqaQTLelN4HqYNqlx7Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=6EX7XFiQkjmihKITbe6wQqyPs6U3XKohMMr7TJVTqsI=; b=HvTKfh4NE2MWJ2hP0pEf1xmj8SP0Xj4f53l2CCNYD9ChwoLSYRyFlZHL3Z6gsci99ZGlwH7156pcSLtDYjAjwnCaTkP2DRi3QzPiO3jScPUpWG+U3Z8RLr+3O6CzkS1YsLX4Bh2dMt6zh5XHG+3m0vVm/XES0SS6S54yNoSPOBE= Received: from SA2PR10MB4587.namprd10.prod.outlook.com (2603:10b6:806:114::12) by CY4PR1001MB2216.namprd10.prod.outlook.com (2603:10b6:910:43::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5081.14; Fri, 18 Mar 2022 11:31:03 +0000 Received: from SA2PR10MB4587.namprd10.prod.outlook.com ([fe80::b0b4:e94f:82df:234f]) by SA2PR10MB4587.namprd10.prod.outlook.com ([fe80::b0b4:e94f:82df:234f%9]) with mapi id 15.20.5081.018; Fri, 18 Mar 2022 11:31:03 +0000 References: <20220317053907.164160-1-david@fromorbit.com> <20220317053907.164160-8-david@fromorbit.com> User-agent: mu4e 1.4.15; emacs 27.1 From: Chandan Babu R To: Dave Chinner Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 7/7] xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight In-reply-to: <20220317053907.164160-8-david@fromorbit.com> Message-ID: <874k3vcxar.fsf@debian-BULLSEYE-live-builder-AMD64> Date: Fri, 18 Mar 2022 17:00:52 +0530 Content-Type: text/plain X-ClientProxiedBy: SG2PR06CA0128.apcprd06.prod.outlook.com (2603:1096:1:1d::30) To SA2PR10MB4587.namprd10.prod.outlook.com (2603:10b6:806:114::12) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: efb81ee1-e2a4-4c90-c75a-08da08d2c86d X-MS-TrafficTypeDiagnostic: CY4PR1001MB2216:EE_ X-Microsoft-Antispam-PRVS: X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: UeU9pxnDJ8uhSn8S+C9pVZmbb9n2flpfRhDwq+5EW6JRDao4X4hNdWxwco9kR09UeGQx4f8+Qx1hIsOCmgWMdKK+n2e1uFYqYvok6X44N0uyqZHuP8sVEfaLxI4n7Ab/XFjvNr0ysQzXgQfavYB853v2H17unfvFLBGIzlRWgs8rVHCmDyU3qT7Ic1xB8DPkBI/KYE7NhGvcdUXXRcuKU5VrCdqsDXB9SNSywGG4z/exRQU0EzOPMS6J/oG4i0t2g786k1wmS2T6GChu3nWmJesf5NH7IBspxDB8ojPKo8UGrpl+w3LvP8IRnCsNHoCEr1auy0QAC9FO0gBYSb37Z9kZbw13TxnS5RtH5A1YNkeKTavRUHR7XGLE72iDG4O0nFd9BUxl91TPqaPxCUK1HH/VEFgRrG3jtUZoShBW1WZ17I1B5zd0R9VXGuQoYcScw7eapZGqlAajOj7WMWPsogL65KjhcHAFbypShChc65s+jkIcfQDx46rHSCu6a3UQlbACvZnQpbyT42GrNyzEGNYrFHoALvk8WNW6WjhbOF6t4PQAOOEbOkmv5bJiQO0DEi4+DQuFU0Z2bIvN50+ZsfW0XGqC5qqQc5xGM8MA81RIpSq7yVnKpBQrq/dpGc6/s4Ph1ROv7ul+laLbdcxWVI4Qbv46jkX97kJHIFII2J5C02k/2OMfI0vgBk/AHY+sxEsEroKPD3eM5rSVLxTiXg== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SA2PR10MB4587.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230001)(7916004)(366004)(6916009)(316002)(6666004)(6512007)(9686003)(6506007)(186003)(52116002)(26005)(53546011)(6486002)(83380400001)(5660300002)(508600001)(8936002)(30864003)(86362001)(2906002)(38350700002)(38100700002)(8676002)(4326008)(66556008)(33716001)(66946007)(66476007);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?rh6Gk/cyrA/FguN/iB9mEyFZYxOeTm17jyjIrYXGb12u2TDdZJt2oKL7VCgm?= =?us-ascii?Q?0aIccaM49QOxkR8uUiUhV1tUAF0TmvFShyrEnzAg7Q1PEYxg5X/8myCycaYn?= =?us-ascii?Q?E9vYiEPPouKzqs0EKELkWO5ttqpQFxUalHrLyQGxdJQdHEKTgVZTQz9gRSck?= =?us-ascii?Q?xNOqr7/82K82wh3sTPB2ZsZm6wNikpowNe9+rjpvHj3xKk7cM9UuZe5DIwjr?= =?us-ascii?Q?kjJsFuBKoNGWKrdT92eEIjva5wKg1l56+Wan3mps2XmL2HfmYy2McHNuikdL?= =?us-ascii?Q?w00Sh/1Uq2k5Q13LjSP4s632ZJDcYrtJHxj9ytW9Ztq4jbR0/Ja6ZNN9/xLi?= =?us-ascii?Q?kjbIko3GLazPvneJ3/ZJp7gkI+jBievOzT7P3reoEKlWQL6HAup3XsfscXz5?= =?us-ascii?Q?19GEwTBLwo8NoCbGNTJ7IdJfLL/+0o9qmBqeMEemWprhRJyATBs4L0nZc6Mc?= =?us-ascii?Q?J9/45GVZMxCzSsNi0YQDiQ4n5VbTwsTQewKBGxgKfAYTQXIrxohqK9ehtljA?= =?us-ascii?Q?FRA7Qf2G9MLFGQgwqt968/B46iTfUlelFJEjRPgSpY5ac3od+BArhGlQF9B5?= =?us-ascii?Q?C453cqzxQuNY94epDofA8F8p0EcoJ40MkM+FEbv4wK4MkeAypZeV3wyzxjIs?= =?us-ascii?Q?nnEqnnoKmSLQUy6uup/+BuWHukGQrI79xPvvAQocRcR6qPHq5lyTRJ1CDPLg?= =?us-ascii?Q?/QSLAyNQ0BDmTEPXbysGtYW1J/mvSIDZNpWKhk6MnljWZ8DvH2q1VZ2+i4Qr?= =?us-ascii?Q?QnkOlpSt4yaBQxrfisiuWs+klTsXDgMkeDhU3KSVwri+AphpoEzPwY2NRAXR?= =?us-ascii?Q?EovPiAY1wnlfNXtgxqtGMlgnnrX1OPkYMFnV9kt9urnMxRsQAYcjbneT8jdy?= =?us-ascii?Q?XpZt6c352gKLhWGtv6AAvseLYS1GZopmEJfq3EobS8zveIl/k8Wi7FvMHXJB?= =?us-ascii?Q?U94bKkvT7gbhCUhYszkPSn5zgSIg20XNMZ7joFy1LdT3uIko67dYYS4oULgh?= =?us-ascii?Q?GDo9OOey5i6sWsMBy+6/i6oRzDn/jCwsf6409dgUq0PHNHBWnr3SLNwNQ9zG?= =?us-ascii?Q?QjxLrLraM9XM/PGi14/yjU4ojX9Tec3Jk+3lkd0xiCFZpMhvC/gfgSz2sZS1?= =?us-ascii?Q?np5+50Ebwq44M++US6YzBKz9abKyEnZmHk6TZ4KZ/ljv+UsjGCur2WirYFpL?= =?us-ascii?Q?1n1rGTEHf78hN07YGYtA6VYFHwpeKSit5ah9HJZ6fMPLDdvUlfprLKM4JbXX?= =?us-ascii?Q?MGNJPifatscNE3iFGrOwdoeKAd0ctp690CaBx/s0ORfbMkv1ncaPyhKQSoFb?= =?us-ascii?Q?LCJCU3yDWukV4KOf0K2kttk7r7MHJcVf1MS+jG8tb4zlceHN1qwi8Of0z2CS?= =?us-ascii?Q?7hRKTw8GbZGrkXuDI/shtjGOPlI4QhSgEJ3+marGaRtxSwiuxsUROl+UTiTV?= =?us-ascii?Q?e34UR2iLjCVaMJWrxTz/z/fN4p+g/nEz?= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: efb81ee1-e2a4-4c90-c75a-08da08d2c86d X-MS-Exchange-CrossTenant-AuthSource: SA2PR10MB4587.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 Mar 2022 11:31:03.1711 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: V4zlyewMeGdDvyJG/njMLv4DFMLq10I7lDUbYAlCo8jMhil2MPqzWUwebeQ0PjTEnEfpztfMqAbcMsjJwhifjw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR1001MB2216 X-Proofpoint-Virus-Version: vendor=nai engine=6300 definitions=10289 signatures=693715 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 spamscore=0 malwarescore=0 adultscore=0 bulkscore=0 mlxscore=0 mlxlogscore=999 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2202240000 definitions=main-2203180063 X-Proofpoint-GUID: QfXB070s1L6-8JA4AebdRoVEUzVwa7ay X-Proofpoint-ORIG-GUID: QfXB070s1L6-8JA4AebdRoVEUzVwa7ay Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On 17 Mar 2022 at 11:09, Dave Chinner wrote: > From: Dave Chinner > > I've been chasing a recent resurgence in generic/388 recovery > failure and/or corruption events. The events have largely been > uninitialised inode chunks being tripped over in log recovery > such as: > > XFS (pmem1): User initiated shutdown received. > pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096 > XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem. > XFS (pmem1): Please unmount the filesystem and rectify the problem(s) > XFS (pmem1): Unmounting Filesystem > XFS (pmem1): Mounting V5 Filesystem > XFS (pmem1): Starting recovery (logdev: internal) > XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818) > XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify > XFS (pmem1): Unmount and run xfs_repair > XFS (pmem1): First 128 bytes of corrupted metadata buffer: > 00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > 00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 ................ > XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117 > XFS (pmem1): log mount/recovery failed: error -117 > XFS (pmem1): log mount failed > > There have been isolated random other issues, too - xfs_repair fails > because it finds some corruption in symlink blocks, rmap > inconsistencies, etc - but they are nowhere near as common as the > uninitialised inode chunk failure. > > The problem has clearly happened at runtime before recovery has run; > I can see the ICREATE log item in the log shortly before the > actively recovered range of the log. This means the ICREATE was > definitely created and written to the log, but for some reason the > tail of the log has been moved past the ordered buffer log item that > tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the > log moving past the ICREATE log item before the inode chunk buffer > is written to disk. > > Tracing the fsstress processes that are running when the filesystem > shut down immediately pin-pointed the problem: > > user shutdown marks xfs_mount as shutdown > > godown-213341 [008] 6398.022871: console: [ 6397.915392] XFS (pmem1): User initiated shutdown received. > ..... > > aild tries to push ordered inode cluster buffer > > xfsaild/pmem1-213314 [001] 6398.022974: xfs_buf_trylock: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e > xfsaild/pmem1-213314 [001] 6398.022976: xfs_ilock_nowait: dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae > > xfs_iflush_cluster() checks xfs_is_shutdown(), returns true, > calls xfs_iflush_abort() to kill writeback of the inode. > Inode is removed from AIL, drops cluster buffer reference. > > xfsaild/pmem1-213314 [001] 6398.022977: xfs_ail_delete: dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL > xfsaild/pmem1-213314 [001] 6398.022978: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7 > > ..... > > All inodes on cluster buffer are aborted, then the cluster buffer > itself is aborted and removed from the AIL *without writeback*: > > xfsaild/pmem1-213314 [001] 6398.023011: xfs_buf_error_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33 > xfsaild/pmem1-213314 [001] 6398.023012: xfs_ail_delete: dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL > > The inode buffer was at 7/20344 when it was removed from the AIL. > > xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_item_relse: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31 > xfsaild/pmem1-213314 [001] 6398.023012: xfs_buf_rele: dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39 > > ..... > > Userspace is still running, doing stuff. an fsstress process runs > syncfs() or sync() and we end up in sync_fs_one_sb() which issues > a log force. This pushes on the CIL: > > fsstress-213322 [001] 6398.024430: xfs_fs_sync_fs: dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26 > fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82 > fsstress-213322 [001] 6398.024430: xfs_log_force: dev 259:1 lsn 0x5f caller xfs_log_force+0x7c > <...>-194402 [001] 6398.024467: kmem_alloc: size 176 flags 0x14 caller xlog_cil_push_work+0x9f > > And the CIL fills up iclogs with pending changes. This picks up > the current tail from the AIL: > > <...>-194402 [001] 6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags caller xlog_write+0x149 > <...>-194402 [001] 6398.024498: xlog_iclog_switch: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags caller xlog_state_get_iclog_space+0x37e > <...>-194402 [001] 6398.024521: xlog_iclog_release: dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags caller xlog_write+0x5f9 > <...>-194402 [001] 6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448 > > And it moves the tail of the log to 7/21000 from 7/20344. This > *moves the tail of the log beyond the ICREATE transaction* that was > at 7/20344 and pinned by the inode cluster buffer that was cancelled > above. > > .... > > godown-213341 [008] 6398.027005: xfs_force_shutdown: dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500 > godown-213341 [008] 6398.027022: console: [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096 > godown-213341 [008] 6398.030551: console: [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/ > > And finally the log itself is now shutdown, stopping all further > writes to the log. But this is too late to prevent the corruption > that moving the tail of the log forwards after we start cancelling > writeback causes. > > The fundamental problem here is that we are using the wrong shutdown > checks for log items. We've long conflated mount shutdown with log > shutdown state, and I started separating that recently with the > atomic shutdown state changes in commit b36d4651e165 ("xfs: make > forced shutdown processing atomic"). The changes in that commit > series are directly responsible for being able to diagnose this > issue because it clearly separated mount shutdown from log shutdown. > > Essentially, once we start cancelling writeback of log items and > removing them from the AIL because the filesystem is shut down, we > *cannot* update the journal because we may have cancelled the items > that pin the tail of the log. That moves the tail of the log > forwards without having written the metadata back, hence we have > corrupt in memory state and writing to the journal propagates that > to the on-disk state. > > What commit b36d4651e165 makes clear is that log item state needs to > change relative to log shutdown, not mount shutdown. IOWs, anything > that aborts metadata writeback needs to check log shutdown state > because log items directly affect log consistency. Having them check > mount shutdown state introduces the above race condition where we > cancel metadata writeback before the log shuts down. > > To fix this, this patch works through all log items and converts > shutdown checks to use xlog_is_shutdown() rather than > xfs_is_shutdown(), so that we don't start aborting metadata > writeback before we shut off journal writes. > > AFAICT, this race condition is a zero day IO error handling bug in > XFS that dates back to the introduction of XLOG_IO_ERROR, > XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997. > All the invocations of xfs_is_shutdown() replaced by xlog_is_shutdown() are correct. The analysis posted above also looks correct. Reviewed-by: Chandan Babu R > Signed-off-by: Dave Chinner > --- > fs/xfs/xfs_buf.c | 40 ++++++++++++++++++++++++++++++++-------- > fs/xfs/xfs_icache.c | 10 +++++++++- > fs/xfs/xfs_inode.c | 15 +++++++++++++-- > fs/xfs/xfs_inode_item.c | 12 ++++++++++++ > fs/xfs/xfs_qm.c | 8 ++++---- > 5 files changed, 70 insertions(+), 15 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 8867f143598e..3617d9d2bc73 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -14,6 +14,7 @@ > #include "xfs_trace.h" > #include "xfs_log.h" > #include "xfs_log_recover.h" > +#include "xfs_log_priv.h" > #include "xfs_trans.h" > #include "xfs_buf_item.h" > #include "xfs_errortag.h" > @@ -813,7 +814,15 @@ xfs_buf_read_map( > * buffer. > */ > if (error) { > - if (!xfs_is_shutdown(target->bt_mount)) > + /* > + * Check against log shutdown for error reporting because > + * metadata writeback may require a read first and we need to > + * report errors in metadata writeback until the log is shut > + * down. High level transaction read functions already check > + * against mount shutdown, anyway, so we only need to be > + * concerned about low level IO interactions here. > + */ > + if (!xlog_is_shutdown(target->bt_mount->m_log)) > xfs_buf_ioerror_alert(bp, fa); > > bp->b_flags &= ~XBF_DONE; > @@ -1177,10 +1186,10 @@ xfs_buf_ioend_handle_error( > struct xfs_error_cfg *cfg; > > /* > - * If we've already decided to shutdown the filesystem because of I/O > - * errors, there's no point in giving this a retry. > + * If we've already shutdown the journal because of I/O errors, there's > + * no point in giving this a retry. > */ > - if (xfs_is_shutdown(mp)) > + if (xlog_is_shutdown(mp->m_log)) > goto out_stale; > > xfs_buf_ioerror_alert_ratelimited(bp); > @@ -1593,8 +1602,23 @@ __xfs_buf_submit( > > ASSERT(!(bp->b_flags & _XBF_DELWRI_Q)); > > - /* on shutdown we stale and complete the buffer immediately */ > - if (xfs_is_shutdown(bp->b_mount)) { > + /* > + * On log shutdown we stale and complete the buffer immediately. We can > + * be called to read the superblock before the log has been set up, so > + * be careful checking the log state. > + * > + * Checking the mount shutdown state here can result in the log tail > + * moving inappropriately on disk as the log may not yet be shut down. > + * i.e. failing this buffer on mount shutdown can remove it from the AIL > + * and move the tail of the log forwards without having written this > + * buffer to disk. This corrupts the log tail state in memory, and > + * because the log may not be shut down yet, it can then be propagated > + * to disk before the log is shutdown. Hence we check log shutdown > + * state here rather than mount state to avoid corrupting the log tail > + * on shutdown. > + */ > + if (bp->b_mount->m_log && > + xlog_is_shutdown(bp->b_mount->m_log)) { > xfs_buf_ioend_fail(bp); > return -EIO; > } > @@ -1808,10 +1832,10 @@ xfs_buftarg_drain( > * If one or more failed buffers were freed, that means dirty metadata > * was thrown away. This should only ever happen after I/O completion > * handling has elevated I/O error(s) to permanent failures and shuts > - * down the fs. > + * down the journal. > */ > if (write_fail) { > - ASSERT(xfs_is_shutdown(btp->bt_mount)); > + ASSERT(xlog_is_shutdown(btp->bt_mount->m_log)); > xfs_alert(btp->bt_mount, > "Please run xfs_repair to determine the extent of the problem."); > } > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 9644f938990c..4148cdf7ce4a 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -23,6 +23,7 @@ > #include "xfs_reflink.h" > #include "xfs_ialloc.h" > #include "xfs_ag.h" > +#include "xfs_log_priv.h" > > #include > > @@ -873,7 +874,14 @@ xfs_reclaim_inode( > if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING)) > goto out_iunlock; > > - if (xfs_is_shutdown(ip->i_mount)) { > + /* > + * Check for log shutdown because aborting the inode can move the log > + * tail and corrupt in memory state. This is fine if the log is shut > + * down, but if the log is still active and only the mount is shut down > + * then the in-memory log tail movement caused by the abort can be > + * incorrectly propagated to disk. > + */ > + if (xlog_is_shutdown(ip->i_mount->m_log)) { > xfs_iunpin_wait(ip); > xfs_iflush_abort(ip); > goto reclaim; > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 04bf467b1090..aab55a06ece7 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -35,6 +35,7 @@ > #include "xfs_bmap_btree.h" > #include "xfs_reflink.h" > #include "xfs_ag.h" > +#include "xfs_log_priv.h" > > struct kmem_cache *xfs_inode_cache; > > @@ -3659,7 +3660,7 @@ xfs_iflush_cluster( > * AIL, leaving a dirty/unpinned inode attached to the buffer > * that otherwise looks like it should be flushed. > */ > - if (xfs_is_shutdown(mp)) { > + if (xlog_is_shutdown(mp->m_log)) { > xfs_iunpin_wait(ip); > xfs_iflush_abort(ip); > xfs_iunlock(ip, XFS_ILOCK_SHARED); > @@ -3685,9 +3686,19 @@ xfs_iflush_cluster( > } > > if (error) { > + /* > + * Shutdown first so we kill the log before we release this > + * buffer. If it is an INODE_ALLOC buffer and pins the tail > + * of the log, failing it before the _log_ is shut down can > + * result in the log tail being moved forward in the journal > + * on disk because log writes can still be taking place. Hence > + * unpinning the tail will allow the ICREATE intent to be > + * removed from the log an recovery will fail with uninitialised > + * inode cluster buffers. > + */ > + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > bp->b_flags |= XBF_ASYNC; > xfs_buf_ioend_fail(bp); > - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > return error; > } > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 90d8e591baf8..11158fa81a09 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -17,6 +17,7 @@ > #include "xfs_trans_priv.h" > #include "xfs_buf_item.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > #include "xfs_error.h" > > #include > @@ -720,6 +721,17 @@ xfs_iflush_ail_updates( > if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) > continue; > > + /* > + * dgc: Not sure how this happens, but it happens very > + * occassionaly via generic/388. xfs_iflush_abort() also > + * silently handles this same "under writeback but not in AIL at > + * shutdown" condition via xfs_trans_ail_delete(). > + */ > + if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { > + ASSERT(xlog_is_shutdown(lip->li_log)); > + continue; > + } > + > lsn = xfs_ail_delete_one(ailp, lip); > if (!tail_lsn && lsn) > tail_lsn = lsn; > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > index 32ac8d9c8940..f165d1a3de1d 100644 > --- a/fs/xfs/xfs_qm.c > +++ b/fs/xfs/xfs_qm.c > @@ -25,6 +25,7 @@ > #include "xfs_error.h" > #include "xfs_ag.h" > #include "xfs_ialloc.h" > +#include "xfs_log_priv.h" > > /* > * The global quota manager. There is only one of these for the entire > @@ -121,8 +122,7 @@ xfs_qm_dqpurge( > struct xfs_dquot *dqp, > void *data) > { > - struct xfs_mount *mp = dqp->q_mount; > - struct xfs_quotainfo *qi = mp->m_quotainfo; > + struct xfs_quotainfo *qi = dqp->q_mount->m_quotainfo; > int error = -EAGAIN; > > xfs_dqlock(dqp); > @@ -157,7 +157,7 @@ xfs_qm_dqpurge( > } > > ASSERT(atomic_read(&dqp->q_pincount) == 0); > - ASSERT(xfs_is_shutdown(mp) || > + ASSERT(xlog_is_shutdown(dqp->q_logitem.qli_item.li_log) || > !test_bit(XFS_LI_IN_AIL, &dqp->q_logitem.qli_item.li_flags)); > > xfs_dqfunlock(dqp); > @@ -172,7 +172,7 @@ xfs_qm_dqpurge( > */ > ASSERT(!list_empty(&dqp->q_lru)); > list_lru_del(&qi->qi_lru, &dqp->q_lru); > - XFS_STATS_DEC(mp, xs_qm_dquot_unused); > + XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused); > > xfs_qm_dqdestroy(dqp); > return 0; -- chandan