All of lore.kernel.org
 help / color / mirror / Atom feed
From: Catherine Hoang <catherine.hoang@oracle.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 6.6 CANDIDATE v1 14/21] xfs: fix internal error from AGFL exhaustion
Date: Tue, 30 Jan 2024 15:44:12 -0800	[thread overview]
Message-ID: <20240130234419.45896-15-catherine.hoang@oracle.com> (raw)
In-Reply-To: <20240130234419.45896-1-catherine.hoang@oracle.com>

From: Omar Sandoval <osandov@fb.com>

commit f63a5b3769ad7659da4c0420751d78958ab97675 upstream.

We've been seeing XFS errors like the following:

XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
...
Call Trace:
 xfs_corruption_error+0x94/0xa0
 xfs_btree_insert+0x221/0x280
 xfs_alloc_fixup_trees+0x104/0x3e0
 xfs_alloc_ag_vextent_size+0x667/0x820
 xfs_alloc_fix_freelist+0x5d9/0x750
 xfs_free_extent_fix_freelist+0x65/0xa0
 __xfs_free_extent+0x57/0x180
...

This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
xfs_btree_insrec() fails.

After converting this into a panic and dissecting the core dump, I found
that xfs_btree_insrec() is failing because it's trying to split a leaf
node in the cntbt when the AG free list is empty. In particular, it's
failing to get a block from the AGFL _while trying to refill the AGFL_.

If a single operation splits every level of the bnobt and the cntbt (and
the rmapbt if it is enabled) at once, the free list will be empty. Then,
when the next operation tries to refill the free list, it allocates
space. If the allocation does not use a full extent, it will need to
insert records for the remaining space in the bnobt and cntbt. And if
those new records go in full leaves, the leaves (and potentially more
nodes up to the old root) need to be split.

Fix it by accounting for the additional splits that may be required to
refill the free list in the calculation for the minimum free list size.

P.S. As far as I can tell, this bug has existed for a long time -- maybe
back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
...") in April 1994! It requires a very unlucky sequence of events, and
in fact we didn't hit it until a particular sparse mmap workload updated
from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
exposed by some other change in allocation or writeback patterns. It's
also much less likely to be hit with the rmapbt enabled, since that
increases the minimum free list size and is unlikely to split at the
same time as the bnobt and cntbt.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 3069194527dd..100ab5931b31 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2275,16 +2275,37 @@ xfs_alloc_min_freelist(
 
 	ASSERT(mp->m_alloc_maxlevels > 0);
 
+	/*
+	 * For a btree shorter than the maximum height, the worst case is that
+	 * every level gets split and a new level is added, then while inserting
+	 * another entry to refill the AGFL, every level under the old root gets
+	 * split again. This is:
+	 *
+	 *   (full height split reservation) + (AGFL refill split height)
+	 * = (current height + 1) + (current height - 1)
+	 * = (new height) + (new height - 2)
+	 * = 2 * new height - 2
+	 *
+	 * For a btree of maximum height, the worst case is that every level
+	 * under the root gets split, then while inserting another entry to
+	 * refill the AGFL, every level under the root gets split again. This is
+	 * also:
+	 *
+	 *   2 * (current height - 1)
+	 * = 2 * (new height - 1)
+	 * = 2 * new height - 2
+	 */
+
 	/* space needed by-bno freespace btree */
 	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
-				       mp->m_alloc_maxlevels);
+				       mp->m_alloc_maxlevels) * 2 - 2;
 	/* space needed by-size freespace btree */
 	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
-				       mp->m_alloc_maxlevels);
+				       mp->m_alloc_maxlevels) * 2 - 2;
 	/* space needed reverse mapping used space btree */
 	if (xfs_has_rmapbt(mp))
 		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,
-						mp->m_rmap_maxlevels);
+						mp->m_rmap_maxlevels) * 2 - 2;
 
 	return min_free;
 }
-- 
2.39.3


  parent reply	other threads:[~2024-01-30 23:44 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-30 23:43 [PATCH 6.6 CANDIDATE v1 00/21] xfs backports for 6.6.y (from v6.7) Catherine Hoang
2024-01-30 23:43 ` [PATCH 6.6 CANDIDATE v1 01/21] xfs: bump max fsgeom struct version Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 02/21] xfs: hoist freeing of rt data fork extent mappings Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 03/21] xfs: prevent rt growfs when quota is enabled Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 04/21] xfs: rt stubs should return negative errnos when rt disabled Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 05/21] xfs: fix units conversion error in xfs_bmap_del_extent_delay Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 06/21] xfs: make sure maxlen is still congruent with prod when rounding down Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 07/21] xfs: introduce protection for drop nlink Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 08/21] xfs: handle nimaps=0 from xfs_bmapi_write in xfs_alloc_file_space Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 09/21] xfs: allow read IO and FICLONE to run concurrently Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 10/21] xfs: factor out xfs_defer_pending_abort Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 11/21] xfs: abort intent items when recovery intents fail Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 12/21] xfs: only remap the written blocks in xfs_reflink_end_cow_extent Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 13/21] xfs: up(ic_sema) if flushing data device fails Catherine Hoang
2024-01-30 23:44 ` Catherine Hoang [this message]
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 15/21] xfs: fix again select in kconfig XFS_ONLINE_SCRUB_STATS Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 16/21] xfs: inode recovery does not validate the recovered inode Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 17/21] xfs: clean up dqblk extraction Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 18/21] xfs: dquot recovery does not validate the recovered dquot Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 19/21] filemap: add a per-mapping stable writes flag Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 20/21] xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags Catherine Hoang
2024-01-30 23:44 ` [PATCH 6.6 CANDIDATE v1 21/21] xfs: respect the stable writes flag on the RT device Catherine Hoang
2024-01-31  3:49 ` [PATCH 6.6 CANDIDATE v1 00/21] xfs backports for 6.6.y (from v6.7) Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240130234419.45896-15-catherine.hoang@oracle.com \
    --to=catherine.hoang@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.