All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: more shutdown/recovery fixes
@ 2022-03-21  1:23 Dave Chinner
  2022-03-21  1:23 ` [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dave Chinner @ 2022-03-21  1:23 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

These two patches are followups to my recent series of
shutdown/recovery fixes. The cluster buffer lock patch addresses a
race condition that started to show up regularly once the fixes in
the previous series were done - it is a regression from the async
inode reclaim work that was done almost 2 years ago now.

The second patch is something I'm really surprised has taken this
long to uncover. There is a check in intent recovery/cancellation
that checks that there are no intent items in the AIL after the
first non-intent item is found. This behaviour was correct back when
we only had standalone intent items (i.e. EFI/EFD), but when we
started to chain complex operations by intents, the recovery of an
incomplete intent can log and commit new intents and they can end up
in the AIL before log recovery is complete and finished processing
the deferred items. Hence the ASSERT() check that no intents
exist in the AIL after the first non-intent item is simply invalid.

With these two patches, I'm back to being able to run hundreds of
cycles of g/388 or -g recoveryloop without seeing any failures.

-Dave.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock
  2022-03-21  1:23 [PATCH 0/2] xfs: more shutdown/recovery fixes Dave Chinner
@ 2022-03-21  1:23 ` Dave Chinner
  2022-03-21 21:56   ` Darrick J. Wong
  2022-03-21  1:23 ` [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL Dave Chinner
  2022-03-22  2:35 ` [PATCH 0/2] xfs: more shutdown/recovery fixes Darrick J. Wong
  2 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-03-21  1:23 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Most buffer io list operations are run with the bp->b_lock held, but
xfs_iflush_abort() can be called without the buffer lock being held
resulting in inodes being removed from the buffer list while other
list operations are occurring. This causes problems with corrupted
bp->b_io_list inode lists during filesystem shutdown, leading to
traversals that never end, double removals from the AIL, etc.

Fix this by passing the buffer to xfs_iflush_abort() if we have
it locked. If the inode is attached to the buffer, we're going to
have to remove it from the buffer list and we'd have to get the
buffer off the inode log item to do that anyway.

If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
then we can determine if the inode has a log item and if it is
attached to a buffer before we do anything else. If it does have an
attached buffer, we can lock it safely (because the inode has a
reference to it) and then perform the inode abort.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c     |   2 +-
 fs/xfs/xfs_inode.c      |   2 +-
 fs/xfs/xfs_inode_item.c | 156 ++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_inode_item.h |   1 +
 4 files changed, 122 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4148cdf7ce4a..6c7267451b82 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -883,7 +883,7 @@ xfs_reclaim_inode(
 	 */
 	if (xlog_is_shutdown(ip->i_mount->m_log)) {
 		xfs_iunpin_wait(ip);
-		xfs_iflush_abort(ip);
+		xfs_iflush_shutdown_abort(ip);
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip))
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index aab55a06ece7..07be0992321c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3612,7 +3612,7 @@ xfs_iflush_cluster(
 
 	/*
 	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
-	 * can remove itself from the list.
+	 * will remove itself from the list.
 	 */
 	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
 		iip = (struct xfs_inode_log_item *)lip;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 11158fa81a09..79504b721ffe 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -721,17 +721,6 @@ xfs_iflush_ail_updates(
 		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
 			continue;
 
-		/*
-		 * dgc: Not sure how this happens, but it happens very
-		 * occassionaly via generic/388.  xfs_iflush_abort() also
-		 * silently handles this same "under writeback but not in AIL at
-		 * shutdown" condition via xfs_trans_ail_delete().
-		 */
-		if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
-			ASSERT(xlog_is_shutdown(lip->li_log));
-			continue;
-		}
-
 		lsn = xfs_ail_delete_one(ailp, lip);
 		if (!tail_lsn && lsn)
 			tail_lsn = lsn;
@@ -834,46 +823,139 @@ xfs_buf_inode_io_fail(
 }
 
 /*
- * This is the inode flushing abort routine.  It is called when
- * the filesystem is shutting down to clean up the inode state.  It is
- * responsible for removing the inode item from the AIL if it has not been
- * re-logged and clearing the inode's flush state.
+ * Clear the inode logging fields so no more flushes are attempted.  If we are
+ * on a buffer list, it is now safe to remove it because the buffer is
+ * guaranteed to be locked. The caller will drop the reference to the buffer
+ * the log item held.
+ */
+static void
+xfs_iflush_abort_clean(
+	struct xfs_inode_log_item *iip)
+{
+	iip->ili_last_fields = 0;
+	iip->ili_fields = 0;
+	iip->ili_fsync_fields = 0;
+	iip->ili_flush_lsn = 0;
+	iip->ili_item.li_buf = NULL;
+	list_del_init(&iip->ili_item.li_bio_list);
+}
+
+/*
+ * Abort flushing the inode from a context holding the cluster buffer locked.
+ *
+ * This is the normal runtime method of aborting writeback of an inode that is
+ * attached to a cluster buffer. It occurs when the inode and the backing
+ * cluster buffer have been freed (i.e. inode is XFS_ISTALE), or when cluster
+ * flushing or buffer IO completion encounters a log shutdown situation.
+ *
+ * If we need to abort inode writeback and we don't already hold the buffer
+ * locked, call xfs_iflush_shutdown_abort() instead as this should only ever be
+ * necessary in a shutdown situation.
  */
 void
 xfs_iflush_abort(
 	struct xfs_inode	*ip)
 {
 	struct xfs_inode_log_item *iip = ip->i_itemp;
-	struct xfs_buf		*bp = NULL;
+	struct xfs_buf		*bp;
 
-	if (iip) {
-		/*
-		 * Clear the failed bit before removing the item from the AIL so
-		 * xfs_trans_ail_delete() doesn't try to clear and release the
-		 * buffer attached to the log item before we are done with it.
-		 */
-		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
-		xfs_trans_ail_delete(&iip->ili_item, 0);
+	if (!iip) {
+		/* clean inode, nothing to do */
+		xfs_iflags_clear(ip, XFS_IFLUSHING);
+		return;
+	}
+
+	/*
+	 * Capture the associated buffer and lock it if the caller didn't
+	 * pass us the locked buffer to begin with.
+	 */
+	spin_lock(&iip->ili_lock);
+	bp = iip->ili_item.li_buf;
+	xfs_iflush_abort_clean(iip);
+	spin_unlock(&iip->ili_lock);
+
+	/*
+	 * Clear the failed bit before removing the item from the AIL so
+	 * xfs_trans_ail_delete() doesn't try to clear and release the buffer
+	 * attached to the log item before we are done with it.
+	 */
+	clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
+	xfs_trans_ail_delete(&iip->ili_item, 0);
+
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
+
+	/* we can now release the buffer reference the inode log item held. */
+	if (bp)
+		xfs_buf_rele(bp);
+}
 
+/*
+ * Abort an inode flush in the case of a shutdown filesystem. This can be called
+ * from anywhere with just an inode reference and does not require holding the
+ * inode cluster buffer locked. If the inode is attached to a cluster buffer,
+ * it will grab and lock it safely, then abort the inode flush.
+ */
+void
+xfs_iflush_shutdown_abort(
+	struct xfs_inode	*ip)
+{
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct xfs_buf		*bp;
+
+	if (!iip) {
+		/* clean inode, nothing to do */
+		xfs_iflags_clear(ip, XFS_IFLUSHING);
+		return;
+	}
+
+	spin_lock(&iip->ili_lock);
+	bp = iip->ili_item.li_buf;
+	if (!bp) {
+		spin_unlock(&iip->ili_lock);
+		xfs_iflush_abort(ip);
+		return;
+	}
+
+	/*
+	 * We have to take a reference to the buffer so that it doesn't get
+	 * freed when we drop the ili_lock and then wait to lock the buffer.
+	 * We'll clean up the extra reference after we pick up the ili_lock
+	 * again.
+	 */
+	xfs_buf_hold(bp);
+	spin_unlock(&iip->ili_lock);
+	xfs_buf_lock(bp);
+
+	spin_lock(&iip->ili_lock);
+	if (!iip->ili_item.li_buf) {
 		/*
-		 * Clear the inode logging fields so no more flushes are
-		 * attempted.
+		 * Raced with another removal, hold the only reference
+		 * to bp now. Inode should not be in the AIL now, so just clean
+		 * up and return;
 		 */
-		spin_lock(&iip->ili_lock);
-		iip->ili_last_fields = 0;
-		iip->ili_fields = 0;
-		iip->ili_fsync_fields = 0;
-		iip->ili_flush_lsn = 0;
-		bp = iip->ili_item.li_buf;
-		iip->ili_item.li_buf = NULL;
-		list_del_init(&iip->ili_item.li_bio_list);
+		ASSERT(list_empty(&iip->ili_item.li_bio_list));
+		ASSERT(!test_bit(XFS_LI_IN_AIL, &iip->ili_item.li_flags));
+		xfs_iflush_abort_clean(iip);
 		spin_unlock(&iip->ili_lock);
+		xfs_iflags_clear(ip, XFS_IFLUSHING);
+		xfs_buf_relse(bp);
+		return;
 	}
-	xfs_iflags_clear(ip, XFS_IFLUSHING);
-	if (bp)
-		xfs_buf_rele(bp);
+
+	/*
+	 * Got two references to bp. The first will get droped by 
+	 * xfs_iflush_abort() when the item is removed from the buffer list, but
+	 * we can't drop our reference until _abort() returns because we have to
+	 * unlock the buffer as well. Hence we abort and then unlock and release
+	 * our reference to the buffer.
+	 */
+	ASSERT(iip->ili_item.li_buf == bp);
+	spin_unlock(&iip->ili_lock);
+	xfs_iflush_abort(ip);
+	xfs_buf_relse(bp);
 }
 
+
 /*
  * convert an xfs_inode_log_format struct from the old 32 bit version
  * (which can have different field alignments) to the native 64 bit version
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 1a302000d604..bbd836a44ff0 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -44,6 +44,7 @@ static inline int xfs_inode_clean(struct xfs_inode *ip)
 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
 extern void xfs_inode_item_destroy(struct xfs_inode *);
 extern void xfs_iflush_abort(struct xfs_inode *);
+extern void xfs_iflush_shutdown_abort(struct xfs_inode *);
 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
 					 struct xfs_inode_log_format *);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL
  2022-03-21  1:23 [PATCH 0/2] xfs: more shutdown/recovery fixes Dave Chinner
  2022-03-21  1:23 ` [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock Dave Chinner
@ 2022-03-21  1:23 ` Dave Chinner
  2022-03-21 21:52   ` Darrick J. Wong
  2022-03-22  2:35 ` [PATCH 0/2] xfs: more shutdown/recovery fixes Darrick J. Wong
  2 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-03-21  1:23 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

generic/388 triggered a failure in RUI recovery due to a corrupted
btree record and the system then locked up hard due to a subsequent
assert failure while holding a spinlock cancelling intents:

 XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964).  Shutting down filesystem.
 XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
 XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
 Call Trace:
  <TASK>
  xlog_recover_cancel_intents.isra.0+0xd1/0x120
  xlog_recover_finish+0xb9/0x110
  xfs_log_mount_finish+0x15a/0x1e0
  xfs_mountfs+0x540/0x910
  xfs_fs_fill_super+0x476/0x830
  get_tree_bdev+0x171/0x270
  ? xfs_init_fs_context+0x1e0/0x1e0
  xfs_fs_get_tree+0x15/0x20
  vfs_get_tree+0x24/0xc0
  path_mount+0x304/0xba0
  ? putname+0x55/0x60
  __x64_sys_mount+0x108/0x140
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Essentially, there's dirty metadata in the AIL from intent recovery
transactions, so when we go to cancel the remaining intents we assume
that all objects after the first non-intent log item in the AIL are
not intents.

This is not true. Intent recovery can log new intents to continue
the operations the original intent could not complete in a single
transaction. The new intents are committed before they are deferred,
which means if the CIL commits in the background they will get
inserted into the AIL at the head.

Hence if we shut down the filesystem while processing intent
recovery, the AIL may have new intents active at the current head.
Hence this check:

                /*
                 * We're done when we see something other than an intent.
                 * There should be no intents left in the AIL now.
                 */
                if (!xlog_item_is_intent(lip)) {
#ifdef DEBUG
                        for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
                                ASSERT(!xlog_item_is_intent(lip));
#endif
                        break;
                }

in both xlog_recover_process_intents() and
log_recover_cancel_intents() is simply not valid. It was valid back
when we only had EFI/EFD intents and didn't chain intents, but it
hasn't been valid ever since intent recovery could create and commit
new intents.

Given that crashing the mount task like this pretty much prevents
diagnosing what went wrong that lead to the initial failure that
triggered intent cancellation, just remove the checks altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 50 ++++++++++++++--------------------------
 1 file changed, 17 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 96c997ed2ec8..7758a6706b8c 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2519,21 +2519,22 @@ xlog_abort_defer_ops(
 		xfs_defer_ops_capture_free(mp, dfc);
 	}
 }
+
 /*
  * When this is called, all of the log intent items which did not have
- * corresponding log done items should be in the AIL.  What we do now
- * is update the data structures associated with each one.
+ * corresponding log done items should be in the AIL.  What we do now is update
+ * the data structures associated with each one.
  *
- * Since we process the log intent items in normal transactions, they
- * will be removed at some point after the commit.  This prevents us
- * from just walking down the list processing each one.  We'll use a
- * flag in the intent item to skip those that we've already processed
- * and use the AIL iteration mechanism's generation count to try to
- * speed this up at least a bit.
+ * Since we process the log intent items in normal transactions, they will be
+ * removed at some point after the commit.  This prevents us from just walking
+ * down the list processing each one.  We'll use a flag in the intent item to
+ * skip those that we've already processed and use the AIL iteration mechanism's
+ * generation count to try to speed this up at least a bit.
  *
- * When we start, we know that the intents are the only things in the
- * AIL.  As we process them, however, other items are added to the
- * AIL.
+ * When we start, we know that the intents are the only things in the AIL. As we
+ * process them, however, other items are added to the AIL. Hence we know we
+ * have started recovery on all the pending intents when we find an non-intent
+ * item in the AIL.
  */
 STATIC int
 xlog_recover_process_intents(
@@ -2556,17 +2557,8 @@ xlog_recover_process_intents(
 	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
 	     lip != NULL;
 	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
-		/*
-		 * We're done when we see something other than an intent.
-		 * There should be no intents left in the AIL now.
-		 */
-		if (!xlog_item_is_intent(lip)) {
-#ifdef DEBUG
-			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
-				ASSERT(!xlog_item_is_intent(lip));
-#endif
+		if (!xlog_item_is_intent(lip))
 			break;
-		}
 
 		/*
 		 * We should never see a redo item with a LSN higher than
@@ -2607,8 +2599,9 @@ xlog_recover_process_intents(
 }
 
 /*
- * A cancel occurs when the mount has failed and we're bailing out.
- * Release all pending log intent items so they don't pin the AIL.
+ * A cancel occurs when the mount has failed and we're bailing out.  Release all
+ * pending log intent items that we haven't started recovery on so they don't
+ * pin the AIL.
  */
 STATIC void
 xlog_recover_cancel_intents(
@@ -2622,17 +2615,8 @@ xlog_recover_cancel_intents(
 	spin_lock(&ailp->ail_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
 	while (lip != NULL) {
-		/*
-		 * We're done when we see something other than an intent.
-		 * There should be no intents left in the AIL now.
-		 */
-		if (!xlog_item_is_intent(lip)) {
-#ifdef DEBUG
-			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
-				ASSERT(!xlog_item_is_intent(lip));
-#endif
+		if (!xlog_item_is_intent(lip))
 			break;
-		}
 
 		spin_unlock(&ailp->ail_lock);
 		lip->li_ops->iop_release(lip);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL
  2022-03-21  1:23 ` [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL Dave Chinner
@ 2022-03-21 21:52   ` Darrick J. Wong
  2022-03-21 22:41     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-03-21 21:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 12:23:29PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic/388 triggered a failure in RUI recovery due to a corrupted
> btree record and the system then locked up hard due to a subsequent
> assert failure while holding a spinlock cancelling intents:
> 
>  XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964).  Shutting down filesystem.
>  XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
>  XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
>  Call Trace:
>   <TASK>
>   xlog_recover_cancel_intents.isra.0+0xd1/0x120
>   xlog_recover_finish+0xb9/0x110
>   xfs_log_mount_finish+0x15a/0x1e0
>   xfs_mountfs+0x540/0x910
>   xfs_fs_fill_super+0x476/0x830
>   get_tree_bdev+0x171/0x270
>   ? xfs_init_fs_context+0x1e0/0x1e0
>   xfs_fs_get_tree+0x15/0x20
>   vfs_get_tree+0x24/0xc0
>   path_mount+0x304/0xba0
>   ? putname+0x55/0x60
>   __x64_sys_mount+0x108/0x140
>   do_syscall_64+0x35/0x80
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Essentially, there's dirty metadata in the AIL from intent recovery
> transactions, so when we go to cancel the remaining intents we assume
> that all objects after the first non-intent log item in the AIL are
> not intents.
> 
> This is not true. Intent recovery can log new intents to continue
> the operations the original intent could not complete in a single
> transaction. The new intents are committed before they are deferred,
> which means if the CIL commits in the background they will get
> inserted into the AIL at the head.

Like you, I wonder how I never hit this.  Maybe I've never hit a
corrupted rmap btree record during recovery?

So I guess what we're tripping over is a sequence of items in the AIL
that looks something like this?

0. <recovered non intent items>
1. <recovered intent item>
2. <new non-intent item>
3. <new intent items>

So we speed along the AIL list, dealing with the <0> items until we get
to <1>.  We recover <1>, which generates <2> and <3>.  Next, the
debugging code thinks we've hit the end of the list of recovered items,
and therefore it can keep walking the AIL and that it will only find
items like <2>.  Unfortunately, it finds the new intent <3> and trips?

> Hence if we shut down the filesystem while processing intent
> recovery, the AIL may have new intents active at the current head.
> Hence this check:
> 
>                 /*
>                  * We're done when we see something other than an intent.
>                  * There should be no intents left in the AIL now.
>                  */
>                 if (!xlog_item_is_intent(lip)) {
> #ifdef DEBUG
>                         for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
>                                 ASSERT(!xlog_item_is_intent(lip));
> #endif
>                         break;
>                 }
> 
> in both xlog_recover_process_intents() and
> log_recover_cancel_intents() is simply not valid. It was valid back
> when we only had EFI/EFD intents and didn't chain intents, but it
> hasn't been valid ever since intent recovery could create and commit
> new intents.
> 
> Given that crashing the mount task like this pretty much prevents
> diagnosing what went wrong that lead to the initial failure that
> triggered intent cancellation, just remove the checks altogether.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Actually, I bet I have hit this while fuzz testing online repair, but
then decided that the better fix would be to make online fsck more
careful about not crashing the fs in the first place.

Assuming the answer to my question above is 'yes', then

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_recover.c | 50 ++++++++++++++--------------------------
>  1 file changed, 17 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 96c997ed2ec8..7758a6706b8c 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2519,21 +2519,22 @@ xlog_abort_defer_ops(
>  		xfs_defer_ops_capture_free(mp, dfc);
>  	}
>  }
> +
>  /*
>   * When this is called, all of the log intent items which did not have
> - * corresponding log done items should be in the AIL.  What we do now
> - * is update the data structures associated with each one.
> + * corresponding log done items should be in the AIL.  What we do now is update
> + * the data structures associated with each one.
>   *
> - * Since we process the log intent items in normal transactions, they
> - * will be removed at some point after the commit.  This prevents us
> - * from just walking down the list processing each one.  We'll use a
> - * flag in the intent item to skip those that we've already processed
> - * and use the AIL iteration mechanism's generation count to try to
> - * speed this up at least a bit.
> + * Since we process the log intent items in normal transactions, they will be
> + * removed at some point after the commit.  This prevents us from just walking
> + * down the list processing each one.  We'll use a flag in the intent item to
> + * skip those that we've already processed and use the AIL iteration mechanism's
> + * generation count to try to speed this up at least a bit.
>   *
> - * When we start, we know that the intents are the only things in the
> - * AIL.  As we process them, however, other items are added to the
> - * AIL.
> + * When we start, we know that the intents are the only things in the AIL. As we
> + * process them, however, other items are added to the AIL. Hence we know we
> + * have started recovery on all the pending intents when we find an non-intent
> + * item in the AIL.
>   */
>  STATIC int
>  xlog_recover_process_intents(
> @@ -2556,17 +2557,8 @@ xlog_recover_process_intents(
>  	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>  	     lip != NULL;
>  	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> -		/*
> -		 * We're done when we see something other than an intent.
> -		 * There should be no intents left in the AIL now.
> -		 */
> -		if (!xlog_item_is_intent(lip)) {
> -#ifdef DEBUG
> -			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> -				ASSERT(!xlog_item_is_intent(lip));
> -#endif
> +		if (!xlog_item_is_intent(lip))
>  			break;
> -		}
>  
>  		/*
>  		 * We should never see a redo item with a LSN higher than
> @@ -2607,8 +2599,9 @@ xlog_recover_process_intents(
>  }
>  
>  /*
> - * A cancel occurs when the mount has failed and we're bailing out.
> - * Release all pending log intent items so they don't pin the AIL.
> + * A cancel occurs when the mount has failed and we're bailing out.  Release all
> + * pending log intent items that we haven't started recovery on so they don't
> + * pin the AIL.
>   */
>  STATIC void
>  xlog_recover_cancel_intents(
> @@ -2622,17 +2615,8 @@ xlog_recover_cancel_intents(
>  	spin_lock(&ailp->ail_lock);
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>  	while (lip != NULL) {
> -		/*
> -		 * We're done when we see something other than an intent.
> -		 * There should be no intents left in the AIL now.
> -		 */
> -		if (!xlog_item_is_intent(lip)) {
> -#ifdef DEBUG
> -			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> -				ASSERT(!xlog_item_is_intent(lip));
> -#endif
> +		if (!xlog_item_is_intent(lip))
>  			break;
> -		}
>  
>  		spin_unlock(&ailp->ail_lock);
>  		lip->li_ops->iop_release(lip);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock
  2022-03-21  1:23 ` [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock Dave Chinner
@ 2022-03-21 21:56   ` Darrick J. Wong
  2022-03-21 22:43     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-03-21 21:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 12:23:28PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Most buffer io list operations are run with the bp->b_lock held, but
> xfs_iflush_abort() can be called without the buffer lock being held
> resulting in inodes being removed from the buffer list while other
> list operations are occurring. This causes problems with corrupted
> bp->b_io_list inode lists during filesystem shutdown, leading to
> traversals that never end, double removals from the AIL, etc.
> 
> Fix this by passing the buffer to xfs_iflush_abort() if we have
> it locked. If the inode is attached to the buffer, we're going to
> have to remove it from the buffer list and we'd have to get the
> buffer off the inode log item to do that anyway.
> 
> If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
> then we can determine if the inode has a log item and if it is
> attached to a buffer before we do anything else. If it does have an
> attached buffer, we can lock it safely (because the inode has a
> reference to it) and then perform the inode abort.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c     |   2 +-
>  fs/xfs/xfs_inode.c      |   2 +-
>  fs/xfs/xfs_inode_item.c | 156 ++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_inode_item.h |   1 +
>  4 files changed, 122 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 4148cdf7ce4a..6c7267451b82 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -883,7 +883,7 @@ xfs_reclaim_inode(
>  	 */
>  	if (xlog_is_shutdown(ip->i_mount->m_log)) {
>  		xfs_iunpin_wait(ip);
> -		xfs_iflush_abort(ip);
> +		xfs_iflush_shutdown_abort(ip);
>  		goto reclaim;
>  	}
>  	if (xfs_ipincount(ip))
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index aab55a06ece7..07be0992321c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3612,7 +3612,7 @@ xfs_iflush_cluster(
>  
>  	/*
>  	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
> -	 * can remove itself from the list.
> +	 * will remove itself from the list.
>  	 */
>  	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
>  		iip = (struct xfs_inode_log_item *)lip;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 11158fa81a09..79504b721ffe 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -721,17 +721,6 @@ xfs_iflush_ail_updates(
>  		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
>  			continue;
>  
> -		/*
> -		 * dgc: Not sure how this happens, but it happens very
> -		 * occassionaly via generic/388.  xfs_iflush_abort() also
> -		 * silently handles this same "under writeback but not in AIL at
> -		 * shutdown" condition via xfs_trans_ail_delete().
> -		 */
> -		if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) {
> -			ASSERT(xlog_is_shutdown(lip->li_log));
> -			continue;
> -		}
> -
>  		lsn = xfs_ail_delete_one(ailp, lip);
>  		if (!tail_lsn && lsn)
>  			tail_lsn = lsn;
> @@ -834,46 +823,139 @@ xfs_buf_inode_io_fail(
>  }
>  
>  /*
> - * This is the inode flushing abort routine.  It is called when
> - * the filesystem is shutting down to clean up the inode state.  It is
> - * responsible for removing the inode item from the AIL if it has not been
> - * re-logged and clearing the inode's flush state.
> + * Clear the inode logging fields so no more flushes are attempted.  If we are
> + * on a buffer list, it is now safe to remove it because the buffer is
> + * guaranteed to be locked. The caller will drop the reference to the buffer
> + * the log item held.
> + */
> +static void
> +xfs_iflush_abort_clean(
> +	struct xfs_inode_log_item *iip)
> +{
> +	iip->ili_last_fields = 0;
> +	iip->ili_fields = 0;
> +	iip->ili_fsync_fields = 0;
> +	iip->ili_flush_lsn = 0;
> +	iip->ili_item.li_buf = NULL;
> +	list_del_init(&iip->ili_item.li_bio_list);
> +}
> +
> +/*
> + * Abort flushing the inode from a context holding the cluster buffer locked.
> + *
> + * This is the normal runtime method of aborting writeback of an inode that is
> + * attached to a cluster buffer. It occurs when the inode and the backing
> + * cluster buffer have been freed (i.e. inode is XFS_ISTALE), or when cluster
> + * flushing or buffer IO completion encounters a log shutdown situation.
> + *
> + * If we need to abort inode writeback and we don't already hold the buffer
> + * locked, call xfs_iflush_shutdown_abort() instead as this should only ever be
> + * necessary in a shutdown situation.
>   */
>  void
>  xfs_iflush_abort(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_inode_log_item *iip = ip->i_itemp;
> -	struct xfs_buf		*bp = NULL;
> +	struct xfs_buf		*bp;
>  
> -	if (iip) {
> -		/*
> -		 * Clear the failed bit before removing the item from the AIL so
> -		 * xfs_trans_ail_delete() doesn't try to clear and release the
> -		 * buffer attached to the log item before we are done with it.
> -		 */
> -		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
> -		xfs_trans_ail_delete(&iip->ili_item, 0);
> +	if (!iip) {
> +		/* clean inode, nothing to do */
> +		xfs_iflags_clear(ip, XFS_IFLUSHING);
> +		return;
> +	}
> +
> +	/*
> +	 * Capture the associated buffer and lock it if the caller didn't
> +	 * pass us the locked buffer to begin with.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	bp = iip->ili_item.li_buf;
> +	xfs_iflush_abort_clean(iip);
> +	spin_unlock(&iip->ili_lock);

Is the comment here incorrect?  The _shutdown_abort variant will go
ahead and lock the buffer, but this function does not do that...?

> +
> +	/*
> +	 * Clear the failed bit before removing the item from the AIL so
> +	 * xfs_trans_ail_delete() doesn't try to clear and release the buffer
> +	 * attached to the log item before we are done with it.
> +	 */
> +	clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
> +	xfs_trans_ail_delete(&iip->ili_item, 0);
> +
> +	xfs_iflags_clear(ip, XFS_IFLUSHING);
> +
> +	/* we can now release the buffer reference the inode log item held. */
> +	if (bp)
> +		xfs_buf_rele(bp);
> +}
>  
> +/*
> + * Abort an inode flush in the case of a shutdown filesystem. This can be called
> + * from anywhere with just an inode reference and does not require holding the
> + * inode cluster buffer locked. If the inode is attached to a cluster buffer,
> + * it will grab and lock it safely, then abort the inode flush.
> + */
> +void
> +xfs_iflush_shutdown_abort(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct xfs_buf		*bp;
> +
> +	if (!iip) {
> +		/* clean inode, nothing to do */
> +		xfs_iflags_clear(ip, XFS_IFLUSHING);
> +		return;
> +	}
> +
> +	spin_lock(&iip->ili_lock);
> +	bp = iip->ili_item.li_buf;
> +	if (!bp) {
> +		spin_unlock(&iip->ili_lock);
> +		xfs_iflush_abort(ip);
> +		return;
> +	}
> +
> +	/*
> +	 * We have to take a reference to the buffer so that it doesn't get
> +	 * freed when we drop the ili_lock and then wait to lock the buffer.
> +	 * We'll clean up the extra reference after we pick up the ili_lock
> +	 * again.
> +	 */
> +	xfs_buf_hold(bp);
> +	spin_unlock(&iip->ili_lock);
> +	xfs_buf_lock(bp);
> +
> +	spin_lock(&iip->ili_lock);
> +	if (!iip->ili_item.li_buf) {
>  		/*
> -		 * Clear the inode logging fields so no more flushes are
> -		 * attempted.
> +		 * Raced with another removal, hold the only reference
> +		 * to bp now. Inode should not be in the AIL now, so just clean
> +		 * up and return;
>  		 */
> -		spin_lock(&iip->ili_lock);
> -		iip->ili_last_fields = 0;
> -		iip->ili_fields = 0;
> -		iip->ili_fsync_fields = 0;
> -		iip->ili_flush_lsn = 0;
> -		bp = iip->ili_item.li_buf;
> -		iip->ili_item.li_buf = NULL;
> -		list_del_init(&iip->ili_item.li_bio_list);
> +		ASSERT(list_empty(&iip->ili_item.li_bio_list));
> +		ASSERT(!test_bit(XFS_LI_IN_AIL, &iip->ili_item.li_flags));
> +		xfs_iflush_abort_clean(iip);
>  		spin_unlock(&iip->ili_lock);
> +		xfs_iflags_clear(ip, XFS_IFLUSHING);
> +		xfs_buf_relse(bp);
> +		return;
>  	}
> -	xfs_iflags_clear(ip, XFS_IFLUSHING);
> -	if (bp)
> -		xfs_buf_rele(bp);
> +
> +	/*
> +	 * Got two references to bp. The first will get droped by 

"The first will get dropped by..." (spelling and stgit nagging me about
trailing whitespace)

> +	 * xfs_iflush_abort() when the item is removed from the buffer list, but
> +	 * we can't drop our reference until _abort() returns because we have to
> +	 * unlock the buffer as well. Hence we abort and then unlock and release
> +	 * our reference to the buffer.

...and presumably xfs_iflush_abort will drop the other bp reference at
some point after where we unlocked the inode item, locked the (held)
buffer, and relocked the inode item?

--D

> +	 */
> +	ASSERT(iip->ili_item.li_buf == bp);
> +	spin_unlock(&iip->ili_lock);
> +	xfs_iflush_abort(ip);
> +	xfs_buf_relse(bp);
>  }
>  
> +
>  /*
>   * convert an xfs_inode_log_format struct from the old 32 bit version
>   * (which can have different field alignments) to the native 64 bit version
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 1a302000d604..bbd836a44ff0 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -44,6 +44,7 @@ static inline int xfs_inode_clean(struct xfs_inode *ip)
>  extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
>  extern void xfs_inode_item_destroy(struct xfs_inode *);
>  extern void xfs_iflush_abort(struct xfs_inode *);
> +extern void xfs_iflush_shutdown_abort(struct xfs_inode *);
>  extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
>  					 struct xfs_inode_log_format *);
>  
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL
  2022-03-21 21:52   ` Darrick J. Wong
@ 2022-03-21 22:41     ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2022-03-21 22:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 02:52:36PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 21, 2022 at 12:23:29PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > generic/388 triggered a failure in RUI recovery due to a corrupted
> > btree record and the system then locked up hard due to a subsequent
> > assert failure while holding a spinlock cancelling intents:
> > 
> >  XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964).  Shutting down filesystem.
> >  XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
> >  XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
> >  Call Trace:
> >   <TASK>
> >   xlog_recover_cancel_intents.isra.0+0xd1/0x120
> >   xlog_recover_finish+0xb9/0x110
> >   xfs_log_mount_finish+0x15a/0x1e0
> >   xfs_mountfs+0x540/0x910
> >   xfs_fs_fill_super+0x476/0x830
> >   get_tree_bdev+0x171/0x270
> >   ? xfs_init_fs_context+0x1e0/0x1e0
> >   xfs_fs_get_tree+0x15/0x20
> >   vfs_get_tree+0x24/0xc0
> >   path_mount+0x304/0xba0
> >   ? putname+0x55/0x60
> >   __x64_sys_mount+0x108/0x140
> >   do_syscall_64+0x35/0x80
> >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > Essentially, there's dirty metadata in the AIL from intent recovery
> > transactions, so when we go to cancel the remaining intents we assume
> > that all objects after the first non-intent log item in the AIL are
> > not intents.
> > 
> > This is not true. Intent recovery can log new intents to continue
> > the operations the original intent could not complete in a single
> > transaction. The new intents are committed before they are deferred,
> > which means if the CIL commits in the background they will get
> > inserted into the AIL at the head.
> 
> Like you, I wonder how I never hit this.  Maybe I've never hit a
> corrupted rmap btree record during recovery?
> 
> So I guess what we're tripping over is a sequence of items in the AIL
> that looks something like this?
> 
> 0. <recovered non intent items>
> 1. <recovered intent item>
> 2. <new non-intent item>
> 3. <new intent items>

Mostly correct, but not quite. At the end of the first phase of
recovery (i.e. reading the log and extracting/recovering individual
items), the AIL looks like this:

0. <incomplete recovered intents>

It has nothing else in it, just the intents that need recovery.

In the second phase of recovery, we start processing those intents,
walking the above list and calling ->recover on them. IIUC,
recovering a BUI can create CUI and RUI intents, recovering a CUI
can create both RUI and EFI intents, and recovering a RUI can create
EFI intents.

Now, processing these newly created chained intents is deferred
after committing them to the *CIL*, so only the initial intents in 0
above are replayed because we check the AIL contents again. If there
is enough space in the journal and CIL to aggregate all these new
intents and objects, they will never reach the AIL before the
initial intent recovery has completed, and the check will never
fail.  Hence I think the actual logged contents check is rarely, if
ever, run properly because nothing has reached the AIL in normal
cases.

However, if the CIL -ever- commits during intent processing, then
the AIL check is completely invalid - the AIL can contain buffers,
inodes, new intents, etc in any order:

0. <incomplete recovered intents>
1. <non intent items>
2. <new chained/deferred intents>
3. <non intent items>
4. <new chained/deferred intents>
5. <non intent items>
....

> So we speed along the AIL list, dealing with the <0> items until we get
> to <1>.  We recover <1>, which generates <2> and <3>.  Next, the
> debugging code thinks we've hit the end of the list of recovered items,
> and therefore it can keep walking the AIL and that it will only find
> items like <2>.  Unfortunately, it finds the new intent <3> and trips?

Effectively, yes.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock
  2022-03-21 21:56   ` Darrick J. Wong
@ 2022-03-21 22:43     ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2022-03-21 22:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 02:56:20PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 21, 2022 at 12:23:28PM +1100, Dave Chinner wrote:
> > +
> > +	/*
> > +	 * Capture the associated buffer and lock it if the caller didn't
> > +	 * pass us the locked buffer to begin with.
> > +	 */
> > +	spin_lock(&iip->ili_lock);
> > +	bp = iip->ili_item.li_buf;
> > +	xfs_iflush_abort_clean(iip);
> > +	spin_unlock(&iip->ili_lock);
> 
> Is the comment here incorrect?  The _shutdown_abort variant will go
> ahead and lock the buffer, but this function does not do that...?

Ah, stale comment from before I refactored it into separate
functions. I'll clean it up....

> > -	xfs_iflags_clear(ip, XFS_IFLUSHING);
> > -	if (bp)
> > -		xfs_buf_rele(bp);
> > +
> > +	/*
> > +	 * Got two references to bp. The first will get droped by 
> 
> "The first will get dropped by..." (spelling and stgit nagging me about
> trailing whitespace)
> 
> > +	 * xfs_iflush_abort() when the item is removed from the buffer list, but
> > +	 * we can't drop our reference until _abort() returns because we have to
> > +	 * unlock the buffer as well. Hence we abort and then unlock and release
> > +	 * our reference to the buffer.
> 
> ...and presumably xfs_iflush_abort will drop the other bp reference at
> some point after where we unlocked the inode item, locked the (held)
> buffer, and relocked the inode item?

Yes, xfs_iflush_abort() will drop the other buffer reference when it
removes the inode from the buffer item list.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] xfs: more shutdown/recovery fixes
  2022-03-21  1:23 [PATCH 0/2] xfs: more shutdown/recovery fixes Dave Chinner
  2022-03-21  1:23 ` [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock Dave Chinner
  2022-03-21  1:23 ` [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL Dave Chinner
@ 2022-03-22  2:35 ` Darrick J. Wong
  2022-03-22  3:26   ` Dave Chinner
  2 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2022-03-22  2:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 12:23:27PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> These two patches are followups to my recent series of
> shutdown/recovery fixes. The cluster buffer lock patch addresses a
> race condition that started to show up regularly once the fixes in
> the previous series were done - it is a regression from the async
> inode reclaim work that was done almost 2 years ago now.
> 
> The second patch is something I'm really surprised has taken this
> long to uncover. There is a check in intent recovery/cancellation
> that checks that there are no intent items in the AIL after the
> first non-intent item is found. This behaviour was correct back when
> we only had standalone intent items (i.e. EFI/EFD), but when we
> started to chain complex operations by intents, the recovery of an
> incomplete intent can log and commit new intents and they can end up
> in the AIL before log recovery is complete and finished processing
> the deferred items. Hence the ASSERT() check that no intents
> exist in the AIL after the first non-intent item is simply invalid.
> 
> With these two patches, I'm back to being able to run hundreds of
> cycles of g/388 or -g recoveryloop without seeing any failures.

Hmm, with this series applied to current for-next
+ your other log patches
+ my xfs_reserve_blocks infinite loop fix and cleanup series

I see this while running fstests on an all-defaults v5 filesystem.
I /think/ it's crashing in the "xfs_buf_ispinned(bp)" right below the
"ASSERT(iip->ili_item.li_buf);" in xfs_inode_item_push.

[ 1294.636121] run fstests generic/251 at 2022-03-21 16:06:27
[ 1295.168287] XFS (sda): Mounting V5 Filesystem
[ 1295.180601] XFS (sda): Ending clean mount
[ 1295.471352] XFS (sdf): Mounting V5 Filesystem
[ 1295.482082] XFS (sdf): Ending clean mount
[ 1295.483612] XFS (sdf): Quotacheck needed: Please wait.
[ 1295.487265] XFS (sdf): Quotacheck: Done.
[ 1376.010962] XFS (sdf): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
[ 1386.006736] XFS: Assertion failed: iip->ili_item.li_buf, file: fs/xfs/xfs_inode_item.c, line: 547
[ 1386.007571] ------------[ cut here ]------------
[ 1386.007991] WARNING: CPU: 0 PID: 419577 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
[ 1386.008837] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress ext4 mbcache jbd2 dm_flakey dm_snapshot dm_bufio dm_zero xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip xt_REDIRECT xt_set ip_set_hash_net ip_set_hash_mac ip_set iptable_nat nf_nat nf_conntrack nfnetlink nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bfq pvpanic_mmio pvpanic sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
[ 1386.012759] CPU: 0 PID: 419577 Comm: xfsaild/sdf Not tainted 5.17.0-rc6-djwx #rc6 38f9b0b98565b531708bd3bd34525d3eec280440
[ 1386.013718] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 1386.014493] RIP: 0010:assfail+0x3c/0x40 [xfs]
[ 1386.015098] Code: 90 9c 45 a0 e8 81 f9 ff ff 8a 1d d3 28 0c 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 60 81 4f a0 e8 b3 4d fd e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
[ 1386.016706] RSP: 0018:ffffc9000530be18 EFLAGS: 00010246
[ 1386.017172] RAX: 0000000000000000 RBX: ffff888063ace300 RCX: 0000000000000000
[ 1386.017801] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa0449eb0
[ 1386.018435] RBP: ffff88805fa897c0 R08: 0000000000000000 R09: 000000000000000a
[ 1386.019059] R10: 000000000000000a R11: f000000000000000 R12: ffff88800ad06f40
[ 1386.019685] R13: ffff88800ad06f68 R14: 00000016000191e9 R15: 0000000000000000
[ 1386.020395] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[ 1386.021172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1386.021680] CR2: 0000564ba4fc9968 CR3: 000000002852c002 CR4: 00000000001706b0
[ 1386.022313] Call Trace:
[ 1386.022555]  <TASK>
[ 1386.022763]  xfs_inode_item_push+0xf4/0x140 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
[ 1386.023579]  xfsaild+0x402/0xc30 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
[ 1386.024367]  ? __set_cpus_allowed_ptr_locked+0xe0/0x1a0
[ 1386.024841]  ? preempt_count_add+0x73/0xa0
[ 1386.025255]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
[ 1386.026163]  kthread+0xea/0x110
[ 1386.026487]  ? kthread_complete_and_exit+0x20/0x20
[ 1386.026924]  ret_from_fork+0x1f/0x30
[ 1386.027261]  </TASK>
[ 1386.027475] ---[ end trace 0000000000000000 ]---
[ 1386.027889] BUG: kernel NULL pointer dereference, address: 000000000000017c
[ 1386.028498] #PF: supervisor read access in kernel mode
[ 1386.028955] #PF: error_code(0x0000) - not-present page
[ 1386.029419] PGD 0 P4D 0 
[ 1386.029663] Oops: 0000 [#1] PREEMPT SMP
[ 1386.030012] CPU: 0 PID: 419577 Comm: xfsaild/sdf Tainted: G        W         5.17.0-rc6-djwx #rc6 38f9b0b98565b531708bd3bd34525d3eec280440
[ 1386.031100] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 1386.031855] RIP: 0010:xfs_inode_item_push+0x3d/0x140 [xfs]
[ 1386.032435] Code: 8b 7f 48 48 89 fb 48 8b af 90 00 00 00 4d 85 ff 0f 84 b1 00 00 00 8b 85 c8 00 00 00 41 bc 01 00 00 00 85 c0 0f 85 87 00 00 00 <41> 8b 87 7c 01 00 00 85 c0 75 7c f6 85 f8 00 00 00 02 75 73 4c 8d
[ 1386.034005] RSP: 0018:ffffc9000530be28 EFLAGS: 00010246
[ 1386.034569] RAX: 0000000000000000 RBX: ffff888063ace360 RCX: 0000000000000000
[ 1386.035230] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa0449eb0
[ 1386.035845] RBP: ffff88805fa897c0 R08: 0000000000000000 R09: 000000000000000a
[ 1386.036462] R10: 000000000000000a R11: f000000000000000 R12: 0000000000000001
[ 1386.037069] R13: ffff88800ad06f68 R14: 00000016000191e9 R15: 0000000000000000
[ 1386.037722] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[ 1386.038504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1386.038998] CR2: 000000000000017c CR3: 000000002852c002 CR4: 00000000001706b0
[ 1386.039616] Call Trace:
[ 1386.039849]  <TASK>
[ 1386.040055]  xfsaild+0x402/0xc30 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
[ 1386.040800]  ? __set_cpus_allowed_ptr_locked+0xe0/0x1a0
[ 1386.041267]  ? preempt_count_add+0x73/0xa0
[ 1386.041634]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
[ 1386.042487]  kthread+0xea/0x110
[ 1386.042777]  ? kthread_complete_and_exit+0x20/0x20
[ 1386.043207]  ret_from_fork+0x1f/0x30
[ 1386.043584]  </TASK>
[ 1386.043795] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress ext4 mbcache jbd2 dm_flakey dm_snapshot dm_bufio dm_zero xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip xt_REDIRECT xt_set ip_set_hash_net ip_set_hash_mac ip_set iptable_nat nf_nat nf_conntrack nfnetlink nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bfq pvpanic_mmio pvpanic sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
[ 1386.047627] Dumping ftrace buffer:
[ 1386.047940]    (ftrace buffer empty)
[ 1386.048267] CR2: 000000000000017c
[ 1386.048569] ---[ end trace 0000000000000000 ]---
[ 1386.049097] RIP: 0010:xfs_inode_item_push+0x3d/0x140 [xfs]
[ 1386.049704] Code: 8b 7f 48 48 89 fb 48 8b af 90 00 00 00 4d 85 ff 0f 84 b1 00 00 00 8b 85 c8 00 00 00 41 bc 01 00 00 00 85 c0 0f 85 87 00 00 00 <41> 8b 87 7c 01 00 00 85 c0 75 7c f6 85 f8 00 00 00 02 75 73 4c 8d
[ 1386.051276] RSP: 0018:ffffc9000530be28 EFLAGS: 00010246
[ 1386.051732] RAX: 0000000000000000 RBX: ffff888063ace360 RCX: 0000000000000000
[ 1386.052338] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa0449eb0
[ 1386.052945] RBP: ffff88805fa897c0 R08: 0000000000000000 R09: 000000000000000a
[ 1386.053555] R10: 000000000000000a R11: f000000000000000 R12: 0000000000000001
[ 1386.054228] R13: ffff88800ad06f68 R14: 00000016000191e9 R15: 0000000000000000
[ 1386.054855] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[ 1386.055535] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1386.056034] CR2: 000000000000017c CR3: 000000002852c002 CR4: 00000000001706b0
[ 1386.056653] note: xfsaild/sdf[419577] exited with preempt_count 1

--D

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] xfs: more shutdown/recovery fixes
  2022-03-22  2:35 ` [PATCH 0/2] xfs: more shutdown/recovery fixes Darrick J. Wong
@ 2022-03-22  3:26   ` Dave Chinner
  2022-03-22 16:23     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2022-03-22  3:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Mar 21, 2022 at 07:35:28PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 21, 2022 at 12:23:27PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > These two patches are followups to my recent series of
> > shutdown/recovery fixes. The cluster buffer lock patch addresses a
> > race condition that started to show up regularly once the fixes in
> > the previous series were done - it is a regression from the async
> > inode reclaim work that was done almost 2 years ago now.
> > 
> > The second patch is something I'm really surprised has taken this
> > long to uncover. There is a check in intent recovery/cancellation
> > that checks that there are no intent items in the AIL after the
> > first non-intent item is found. This behaviour was correct back when
> > we only had standalone intent items (i.e. EFI/EFD), but when we
> > started to chain complex operations by intents, the recovery of an
> > incomplete intent can log and commit new intents and they can end up
> > in the AIL before log recovery is complete and finished processing
> > the deferred items. Hence the ASSERT() check that no intents
> > exist in the AIL after the first non-intent item is simply invalid.
> > 
> > With these two patches, I'm back to being able to run hundreds of
> > cycles of g/388 or -g recoveryloop without seeing any failures.
> 
> Hmm, with this series applied to current for-next
> + your other log patches
> + my xfs_reserve_blocks infinite loop fix and cleanup series
> 
> I see this while running fstests on an all-defaults v5 filesystem.
> I /think/ it's crashing in the "xfs_buf_ispinned(bp)" right below the
> "ASSERT(iip->ili_item.li_buf);" in xfs_inode_item_push.

Oh, because you have CONFIG_XFS_ASSERT_FATAL=n, or whatever the
config thingy is to keep going even when asserts fail...

> [ 1294.636121] run fstests generic/251 at 2022-03-21 16:06:27
> [ 1295.168287] XFS (sda): Mounting V5 Filesystem
> [ 1295.180601] XFS (sda): Ending clean mount
> [ 1295.471352] XFS (sdf): Mounting V5 Filesystem
> [ 1295.482082] XFS (sdf): Ending clean mount
> [ 1295.483612] XFS (sdf): Quotacheck needed: Please wait.
> [ 1295.487265] XFS (sdf): Quotacheck: Done.
> [ 1376.010962] XFS (sdf): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> [ 1386.006736] XFS: Assertion failed: iip->ili_item.li_buf, file: fs/xfs/xfs_inode_item.c, line: 547
> [ 1386.007571] ------------[ cut here ]------------
> [ 1386.007991] WARNING: CPU: 0 PID: 419577 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
> [ 1386.008837] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress ext4 mbcache jbd2 dm_flakey dm_snapshot dm_bufio dm_zero xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip xt_REDIRECT xt_set ip_set_hash_net ip_set_hash_mac ip_set iptable_nat nf_nat nf_conntrack nfnetlink nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bfq pvpanic_mmio pvpanic sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
> [ 1386.012759] CPU: 0 PID: 419577 Comm: xfsaild/sdf Not tainted 5.17.0-rc6-djwx #rc6 38f9b0b98565b531708bd3bd34525d3eec280440
> [ 1386.013718] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> [ 1386.014493] RIP: 0010:assfail+0x3c/0x40 [xfs]
> [ 1386.015098] Code: 90 9c 45 a0 e8 81 f9 ff ff 8a 1d d3 28 0c 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 60 81 4f a0 e8 b3 4d fd e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
> [ 1386.016706] RSP: 0018:ffffc9000530be18 EFLAGS: 00010246
> [ 1386.017172] RAX: 0000000000000000 RBX: ffff888063ace300 RCX: 0000000000000000
> [ 1386.017801] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa0449eb0
> [ 1386.018435] RBP: ffff88805fa897c0 R08: 0000000000000000 R09: 000000000000000a
> [ 1386.019059] R10: 000000000000000a R11: f000000000000000 R12: ffff88800ad06f40
> [ 1386.019685] R13: ffff88800ad06f68 R14: 00000016000191e9 R15: 0000000000000000
> [ 1386.020395] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
> [ 1386.021172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1386.021680] CR2: 0000564ba4fc9968 CR3: 000000002852c002 CR4: 00000000001706b0
> [ 1386.022313] Call Trace:
> [ 1386.022555]  <TASK>
> [ 1386.022763]  xfs_inode_item_push+0xf4/0x140 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> [ 1386.023579]  xfsaild+0x402/0xc30 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> [ 1386.024367]  ? __set_cpus_allowed_ptr_locked+0xe0/0x1a0
> [ 1386.024841]  ? preempt_count_add+0x73/0xa0
> [ 1386.025255]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> [ 1386.026163]  kthread+0xea/0x110
> [ 1386.026487]  ? kthread_complete_and_exit+0x20/0x20
> [ 1386.026924]  ret_from_fork+0x1f/0x30
> [ 1386.027261]  </TASK>

I haven't seen that particular failure - that's likely an
interaction with stale indoes and buffers on inode cluster buffer
freeing. On the surface it doesn't seem to be related to the
shutdown changes, but it could be related to the fact that we
are accessing the inode log item without holding either the buffer
locked or the item locked.

Hence this could be racing with an journal IO completion for a freed
inode cluster buffer that runs inode IO compeltion directly, which
then calls xfs_iflush_abort() because XFS_ISTALE is set on all the
inodes.

Yeah.

I changed xfs_iflush_abort() to clean up the inode log item state
before I removed it from the AIL, thinking that this was safe to
do because anything that uses the iip->ili_item.li_buf should be
checking it before using it.....

.... and that is exactly what xfs_inode_item_push() is _not_ doing.

OK, I know how this is happening, I'll have to rework the first
patch in this series to address this.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] xfs: more shutdown/recovery fixes
  2022-03-22  3:26   ` Dave Chinner
@ 2022-03-22 16:23     ` Darrick J. Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2022-03-22 16:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 22, 2022 at 02:26:49PM +1100, Dave Chinner wrote:
> On Mon, Mar 21, 2022 at 07:35:28PM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 21, 2022 at 12:23:27PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > These two patches are followups to my recent series of
> > > shutdown/recovery fixes. The cluster buffer lock patch addresses a
> > > race condition that started to show up regularly once the fixes in
> > > the previous series were done - it is a regression from the async
> > > inode reclaim work that was done almost 2 years ago now.
> > > 
> > > The second patch is something I'm really surprised has taken this
> > > long to uncover. There is a check in intent recovery/cancellation
> > > that checks that there are no intent items in the AIL after the
> > > first non-intent item is found. This behaviour was correct back when
> > > we only had standalone intent items (i.e. EFI/EFD), but when we
> > > started to chain complex operations by intents, the recovery of an
> > > incomplete intent can log and commit new intents and they can end up
> > > in the AIL before log recovery is complete and finished processing
> > > the deferred items. Hence the ASSERT() check that no intents
> > > exist in the AIL after the first non-intent item is simply invalid.
> > > 
> > > With these two patches, I'm back to being able to run hundreds of
> > > cycles of g/388 or -g recoveryloop without seeing any failures.
> > 
> > Hmm, with this series applied to current for-next
> > + your other log patches
> > + my xfs_reserve_blocks infinite loop fix and cleanup series
> > 
> > I see this while running fstests on an all-defaults v5 filesystem.
> > I /think/ it's crashing in the "xfs_buf_ispinned(bp)" right below the
> > "ASSERT(iip->ili_item.li_buf);" in xfs_inode_item_push.
> 
> Oh, because you have CONFIG_XFS_ASSERT_FATAL=n, or whatever the
> config thingy is to keep going even when asserts fail...

Yes, because I often have ftrace turned on when I'm debugging things,
and it's /much/ easier to let trace-cmd continue to slurp data out of
the ring buffer (even if I have to manually kill the fstest) than dig
through 10000+ lines of serial console log after the fact.

I always enable the printk tracepoint so the traces and the console
output stay in sync too.

> > [ 1294.636121] run fstests generic/251 at 2022-03-21 16:06:27
> > [ 1295.168287] XFS (sda): Mounting V5 Filesystem
> > [ 1295.180601] XFS (sda): Ending clean mount
> > [ 1295.471352] XFS (sdf): Mounting V5 Filesystem
> > [ 1295.482082] XFS (sdf): Ending clean mount
> > [ 1295.483612] XFS (sdf): Quotacheck needed: Please wait.
> > [ 1295.487265] XFS (sdf): Quotacheck: Done.
> > [ 1376.010962] XFS (sdf): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> > [ 1386.006736] XFS: Assertion failed: iip->ili_item.li_buf, file: fs/xfs/xfs_inode_item.c, line: 547
> > [ 1386.007571] ------------[ cut here ]------------
> > [ 1386.007991] WARNING: CPU: 0 PID: 419577 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
> > [ 1386.008837] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress ext4 mbcache jbd2 dm_flakey dm_snapshot dm_bufio dm_zero xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip xt_REDIRECT xt_set ip_set_hash_net ip_set_hash_mac ip_set iptable_nat nf_nat nf_conntrack nfnetlink nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bfq pvpanic_mmio pvpanic sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
> > [ 1386.012759] CPU: 0 PID: 419577 Comm: xfsaild/sdf Not tainted 5.17.0-rc6-djwx #rc6 38f9b0b98565b531708bd3bd34525d3eec280440
> > [ 1386.013718] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> > [ 1386.014493] RIP: 0010:assfail+0x3c/0x40 [xfs]
> > [ 1386.015098] Code: 90 9c 45 a0 e8 81 f9 ff ff 8a 1d d3 28 0c 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 60 81 4f a0 e8 b3 4d fd e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
> > [ 1386.016706] RSP: 0018:ffffc9000530be18 EFLAGS: 00010246
> > [ 1386.017172] RAX: 0000000000000000 RBX: ffff888063ace300 RCX: 0000000000000000
> > [ 1386.017801] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa0449eb0
> > [ 1386.018435] RBP: ffff88805fa897c0 R08: 0000000000000000 R09: 000000000000000a
> > [ 1386.019059] R10: 000000000000000a R11: f000000000000000 R12: ffff88800ad06f40
> > [ 1386.019685] R13: ffff88800ad06f68 R14: 00000016000191e9 R15: 0000000000000000
> > [ 1386.020395] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
> > [ 1386.021172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1386.021680] CR2: 0000564ba4fc9968 CR3: 000000002852c002 CR4: 00000000001706b0
> > [ 1386.022313] Call Trace:
> > [ 1386.022555]  <TASK>
> > [ 1386.022763]  xfs_inode_item_push+0xf4/0x140 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> > [ 1386.023579]  xfsaild+0x402/0xc30 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> > [ 1386.024367]  ? __set_cpus_allowed_ptr_locked+0xe0/0x1a0
> > [ 1386.024841]  ? preempt_count_add+0x73/0xa0
> > [ 1386.025255]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> > [ 1386.026163]  kthread+0xea/0x110
> > [ 1386.026487]  ? kthread_complete_and_exit+0x20/0x20
> > [ 1386.026924]  ret_from_fork+0x1f/0x30
> > [ 1386.027261]  </TASK>
> 
> I haven't seen that particular failure - that's likely an
> interaction with stale indoes and buffers on inode cluster buffer
> freeing. On the surface it doesn't seem to be related to the
> shutdown changes, but it could be related to the fact that we
> are accessing the inode log item without holding either the buffer
> locked or the item locked.
> 
> Hence this could be racing with an journal IO completion for a freed
> inode cluster buffer that runs inode IO compeltion directly, which
> then calls xfs_iflush_abort() because XFS_ISTALE is set on all the
> inodes.
> 
> Yeah.
> 
> I changed xfs_iflush_abort() to clean up the inode log item state
> before I removed it from the AIL, thinking that this was safe to
> do because anything that uses the iip->ili_item.li_buf should be
> checking it before using it.....
> 
> .... and that is exactly what xfs_inode_item_push() is _not_ doing.

Aha, that's sort of what I was guessing.

> OK, I know how this is happening, I'll have to rework the first
> patch in this series to address this.

Cool. :)

--D

> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-03-22 16:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-21  1:23 [PATCH 0/2] xfs: more shutdown/recovery fixes Dave Chinner
2022-03-21  1:23 ` [PATCH 1/2] xfs: aborting inodes on shutdown may need buffer lock Dave Chinner
2022-03-21 21:56   ` Darrick J. Wong
2022-03-21 22:43     ` Dave Chinner
2022-03-21  1:23 ` [PATCH 2/2] xfs: shutdown in intent recovery has non-intent items in the AIL Dave Chinner
2022-03-21 21:52   ` Darrick J. Wong
2022-03-21 22:41     ` Dave Chinner
2022-03-22  2:35 ` [PATCH 0/2] xfs: more shutdown/recovery fixes Darrick J. Wong
2022-03-22  3:26   ` Dave Chinner
2022-03-22 16:23     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.