All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5 V2] xfs: various fixes
@ 2023-06-20  0:20 Dave Chinner
  2023-06-20  0:20 ` [PATCH 1/5] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Dave Chinner @ 2023-06-20  0:20 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This is a wrap up of version 2 of all the fixes I have recently
pushed out for review.

The first patch fixes a AIL ordering problem identified when testing
patches 2-4 in this series. This patch only addresses the AIL ordering
problem that was found, it does not address any other corner cases
in intent recovery that may be exposed by patches 2-4.

Patches 2-4 allow partial handling of multi-extent EFIs during
runtime operation and during journal recovery. This is needed as
we attempt to process all the extents in the EFI in a single
transaction and we can deadlock during AGFL fixup if the transaction
context holds the only free space in the AG in a busy extent.

This patchset uncovered a problem where log recovery would run off
the end of the list of intents to recover and into intents that
require deferred processing. This was caused by the ordering issue
fixed in patch 1.

This patchset does not attempt to address the end of intent recovery
detection issues - this raises other issues with the intent recovery
beyond loop termination. Solving those issues requires more thought,
and the problem can largely be avoided by the first patch in the
series. As it is, CUI recovery has been vulnerable to these intent
recovery loop termination issues for several years but we don't have
any evidence that systems have tripped over this so the urgency to
fix the loop termination fully isn't as high as fixing the AIL bulk
insertion ordering problem that exposed it.

Finally, patch 5 addresses journal geometry validation issues. It
makes all geometry mismatches hard failures for V4 and V5
filesystems, except for the log size being too small on V4
filesystems. This addresses the problem on V4 filesystems where we'd
fail to perform ithe remaining validation on the geometry once we'd
detected that the log was too small or too large.

This all passed fstests on v4 and v5 filesystems without
regressions.

-Dave.

---

Version 2:
- patch 1
  - rewritten commit message
- patch 2
  - uint32_t alloc_flag conversion pushed all the way down into
    xfs_extent_busy_flush
- patch 3
  - Updated comment for xfs_efd_from_efi()
  - fixed whitespace damage
  - check error return from xfs_free_extent_later()
- patch 5
  - update error message for max LSU size check
  - fix whitespace damage
  - clean up error handling in xfs_log_mount() changes



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] xfs: don't reverse order of items in bulk AIL insertion
  2023-06-20  0:20 [PATCH 0/5 V2] xfs: various fixes Dave Chinner
@ 2023-06-20  0:20 ` Dave Chinner
  2023-06-20  0:20 ` [PATCH 2/5] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2023-06-20  0:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

XFS has strict metadata ordering requirements. One of the things it
does is maintain the commit order of items from transaction commit
through the CIL and into the AIL. That is, if a transaction logs
item A before item B in a modification, then they will be inserted
into the CIL in the order {A, B}. These items are then written into
the iclog during checkpointing in the order {A, B}. When the
checkpoint commits, they are supposed to be inserted into the AIL in
the order {A, B}, and when they are pushed from the AIL, they are
pushed in the order {A, B}.

If we crash, log recovery then replays the two items from the
checkpoint in the order {A, B}, resulting in the objects the items
apply to being queued for writeback at the end of the checkpoint
in the order {A, B}. This means recovery behaves the same way as the
runtime code.

In places, we have subtle dependencies on this ordering being
maintained. One of this place is performing intent recovery from the
log. It assumes that recovering an intent will result in a
non-intent object being the first thing that is modified in the
recovery transaction, and so when the transaction commits and the
journal flushes, the first object inserted into the AIL beyond the
intent recovery range will be a non-intent item.  It uses the
transistion from intent items to non-intent items to stop the
recovery pass.

A recent log recovery issue indicated that an intent was appearing
as the first item in the AIL beyond the recovery range, hence
breaking the end of recovery detection that exists.

Tracing indicated insertion of the items into the AIL was apparently
occurring in the right order (the intent was last in the commit item
list), but the intent was appearing first in the AIL. IOWs, the
order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk
insertion was reversing the order of the items in the batch of items
being inserted.

Lucky for us, all the items fed to bulk insertion have the same LSN,
so the reversal of order does not affect the log head/tail tracking
that is based on the contents of the AIL. It only impacts on code
that has implicit, subtle dependencies on object order, and AFAICT
only the intent recovery loop is impacted by it.

Make sure bulk AIL insertion does not reorder items incorrectly.

Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_ail.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d4109af193e..1098452e7f95 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
 			trace_xfs_ail_insert(lip, 0, lsn);
 		}
 		lip->li_lsn = lsn;
-		list_add(&lip->li_ail, &tmp);
+		list_add_tail(&lip->li_ail, &tmp);
 	}
 
 	if (!list_empty(&tmp))
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-20  0:20 [PATCH 0/5 V2] xfs: various fixes Dave Chinner
  2023-06-20  0:20 ` [PATCH 1/5] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
@ 2023-06-20  0:20 ` Dave Chinner
  2023-06-20  0:20 ` [PATCH 3/5] xfs: allow extent free intents to be retried Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2023-06-20  0:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To avoid blocking in xfs_extent_busy_flush() when freeing extents
and the only busy extents are held by the current transaction, we
need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
into xfs_extent_busy_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_alloc.h |  2 +-
 fs/xfs/xfs_extent_busy.c  |  3 +-
 fs/xfs/xfs_extent_busy.h  |  2 +-
 4 files changed, 56 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c20fe99405d8..11bd0a1756a1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
  */
 STATIC int
 xfs_alloc_ag_vextent_near(
-	struct xfs_alloc_arg	*args)
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
 	struct xfs_alloc_cur	acur = {};
 	int			error;		/* error code */
@@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
 		if (acur.busy) {
 			trace_xfs_alloc_near_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen);
+					      acur.busy_gen, alloc_flags);
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
  * and of the form k * prod + mod unless there's nothing that large.
  * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
  */
-STATIC int				/* error */
+static int
 xfs_alloc_ag_vextent_size(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
-	struct xfs_agf	*agf = args->agbp->b_addr;
-	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
-	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
-	int		error;		/* error result */
-	xfs_agblock_t	fbno;		/* start of found freespace */
-	xfs_extlen_t	flen;		/* length of found freespace */
-	int		i;		/* temp status variable */
-	xfs_agblock_t	rbno;		/* returned block number */
-	xfs_extlen_t	rlen;		/* length of returned extent */
-	bool		busy;
-	unsigned	busy_gen;
+	struct xfs_agf		*agf = args->agbp->b_addr;
+	struct xfs_btree_cur	*bno_cur;
+	struct xfs_btree_cur	*cnt_cur;
+	xfs_agblock_t		fbno;		/* start of found freespace */
+	xfs_extlen_t		flen;		/* length of found freespace */
+	xfs_agblock_t		rbno;		/* returned block number */
+	xfs_extlen_t		rlen;		/* length of returned extent */
+	bool			busy;
+	unsigned		busy_gen;
+	int			error;
+	int			i;
 
 restart:
 	/*
@@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
 				xfs_btree_del_cursor(cnt_cur,
 						     XFS_BTREE_NOERROR);
 				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp,
-							args->pag, busy_gen);
+				xfs_extent_busy_flush(args->mp, args->pag,
+						busy_gen, alloc_flags);
 				goto restart;
 			}
 		}
@@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
 		if (busy) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
+			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
+					alloc_flags);
 			goto restart;
 		}
 		goto out_nominleft;
@@ -2568,7 +2571,7 @@ xfs_exact_minlen_extent_available(
 int			/* error */
 xfs_alloc_fix_freelist(
 	struct xfs_alloc_arg	*args,	/* allocation argument structure */
-	int			flags)	/* XFS_ALLOC_FLAG_... */
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*pag = args->pag;
@@ -2584,7 +2587,7 @@ xfs_alloc_fix_freelist(
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
 	if (!xfs_perag_initialised_agf(pag)) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2600,13 +2603,13 @@ xfs_alloc_fix_freelist(
 	 */
 	if (xfs_perag_prefers_metadata(pag) &&
 	    (args->datatype & XFS_ALLOC_USERDATA) &&
-	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
-		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
+	    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
+		ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
 		goto out_agbp_relse;
 	}
 
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags |
+	if (!xfs_alloc_space_available(args, need, alloc_flags |
 			XFS_ALLOC_FLAG_CHECK))
 		goto out_agbp_relse;
 
@@ -2615,7 +2618,7 @@ xfs_alloc_fix_freelist(
 	 * Can fail if we're not blocking on locks, and it's held.
 	 */
 	if (!agbp) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2630,7 +2633,7 @@ xfs_alloc_fix_freelist(
 
 	/* If there isn't enough total space or single-extent, reject it. */
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags))
+	if (!xfs_alloc_space_available(args, need, alloc_flags))
 		goto out_agbp_relse;
 
 #ifdef DEBUG
@@ -2668,11 +2671,12 @@ xfs_alloc_fix_freelist(
 	 */
 	memset(&targs, 0, sizeof(targs));
 	/* struct copy below */
-	if (flags & XFS_ALLOC_FLAG_NORMAP)
+	if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
 		targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
 	else
 		targs.oinfo = XFS_RMAP_OINFO_AG;
-	while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
+	while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
+			pag->pagf_flcount > need) {
 		error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
 		if (error)
 			goto out_agbp_relse;
@@ -2700,7 +2704,7 @@ xfs_alloc_fix_freelist(
 		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
-		error = xfs_alloc_ag_vextent_size(&targs);
+		error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
 		if (error)
 			goto out_agflbp_relse;
 
@@ -2710,7 +2714,7 @@ xfs_alloc_fix_freelist(
 		 * on a completely full ag.
 		 */
 		if (targs.agbno == NULLAGBLOCK) {
-			if (flags & XFS_ALLOC_FLAG_FREEING)
+			if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
 				break;
 			goto out_agflbp_relse;
 		}
@@ -3226,7 +3230,7 @@ xfs_alloc_vextent_check_args(
 static int
 xfs_alloc_vextent_prepare_ag(
 	struct xfs_alloc_arg	*args,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	bool			need_pag = !args->pag;
 	int			error;
@@ -3235,7 +3239,7 @@ xfs_alloc_vextent_prepare_ag(
 		args->pag = xfs_perag_get(args->mp, args->agno);
 
 	args->agbp = NULL;
-	error = xfs_alloc_fix_freelist(args, flags);
+	error = xfs_alloc_fix_freelist(args, alloc_flags);
 	if (error) {
 		trace_xfs_alloc_vextent_nofix(args);
 		if (need_pag)
@@ -3357,6 +3361,7 @@ xfs_alloc_vextent_this_ag(
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	ASSERT(args->pag != NULL);
@@ -3375,9 +3380,9 @@ xfs_alloc_vextent_this_ag(
 		return error;
 	}
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_size(args);
+		error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
 }
@@ -3406,20 +3411,20 @@ xfs_alloc_vextent_iterate_ags(
 	xfs_agnumber_t		minimum_agno,
 	xfs_agnumber_t		start_agno,
 	xfs_agblock_t		target_agbno,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		restart_agno = minimum_agno;
 	xfs_agnumber_t		agno;
 	int			error = 0;
 
-	if (flags & XFS_ALLOC_FLAG_TRYLOCK)
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
 		restart_agno = 0;
 restart:
 	for_each_perag_wrap_range(mp, start_agno, restart_agno,
 			mp->m_sb.sb_agcount, agno, args->pag) {
 		args->agno = agno;
-		error = xfs_alloc_vextent_prepare_ag(args, flags);
+		error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 		if (error)
 			break;
 		if (!args->agbp) {
@@ -3433,10 +3438,10 @@ xfs_alloc_vextent_iterate_ags(
 		 */
 		if (args->agno == start_agno && target_agbno) {
 			args->agbno = target_agbno;
-			error = xfs_alloc_ag_vextent_near(args);
+			error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 		} else {
 			args->agbno = 0;
-			error = xfs_alloc_ag_vextent_size(args);
+			error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 		}
 		break;
 	}
@@ -3453,8 +3458,8 @@ xfs_alloc_vextent_iterate_ags(
 	 * constraining flags by the caller, drop them and retry the allocation
 	 * without any constraints being set.
 	 */
-	if (flags) {
-		flags = 0;
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
+		alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
 		restart_agno = minimum_agno;
 		goto restart;
 	}
@@ -3482,6 +3487,7 @@ xfs_alloc_vextent_start_ag(
 	xfs_agnumber_t		start_agno;
 	xfs_agnumber_t		rotorstep = xfs_rotorstep;
 	bool			bump_rotor = false;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3508,7 +3514,7 @@ xfs_alloc_vextent_start_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 
 	if (bump_rotor) {
 		if (args->agno == start_agno)
@@ -3535,6 +3541,7 @@ xfs_alloc_vextent_first_ag(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	xfs_agnumber_t		start_agno;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3553,7 +3560,7 @@ xfs_alloc_vextent_first_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), 0);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
 }
 
@@ -3606,6 +3613,7 @@ xfs_alloc_vextent_near_bno(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	bool			needs_perag = args->pag == NULL;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	if (!needs_perag)
@@ -3626,9 +3634,9 @@ xfs_alloc_vextent_near_bno(
 	if (needs_perag)
 		args->pag = xfs_perag_grab(mp, args->agno);
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_near(args);
+		error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 85ac470be0da..d1aa7c63eafe 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
 		struct xfs_buf **bpp);
 int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
 			struct xfs_buf *, struct xfs_owner_info *);
-int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
+int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
 int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
 		struct xfs_buf **agbp);
 
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index f3d328e4a440..5f44936eae1c 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -571,7 +571,8 @@ void
 xfs_extent_busy_flush(
 	struct xfs_mount	*mp,
 	struct xfs_perag	*pag,
-	unsigned		busy_gen)
+	unsigned		busy_gen,
+	uint32_t		alloc_flags)
 {
 	DEFINE_WAIT		(wait);
 	int			error;
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 4a118131059f..7a82c689bfa4 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 
 void
 xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen);
+	unsigned busy_gen, uint32_t alloc_flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] xfs: allow extent free intents to be retried
  2023-06-20  0:20 [PATCH 0/5 V2] xfs: various fixes Dave Chinner
  2023-06-20  0:20 ` [PATCH 1/5] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
  2023-06-20  0:20 ` [PATCH 2/5] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
@ 2023-06-20  0:20 ` Dave Chinner
  2023-06-20  0:20 ` [PATCH 4/5] xfs: don't block in busy flushing when freeing extents Dave Chinner
  2023-06-20  0:20 ` [PATCH 5/5] xfs: journal geometry is not properly bounds checked Dave Chinner
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2023-06-20  0:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Extent freeing neeeds to be able to avoid a busy extent deadlock
when the transaction itself holds the only busy extents in the
allocation group. This may occur if we have an EFI that contains
multiple extents to be freed, and the freeing the second intent
requires the space the first extent free released to expand the
AGFL. If we block on the busy extent at this point, we deadlock.

We hold a dirty transaction that contains a entire atomic extent
free operations within it, so if we can abort the extent free
operation and commit the progress that we've made, the busy extent
can be resolved by a log force. Hence we can restart the aborted
extent free with a new transaction and continue to make
progress without risking deadlocks.

To enable this, we need the EFI processing code to be able to handle
an -EAGAIN error to tell it to commit the current transaction and
retry again. This mechanism is already built into the defer ops
processing (used bythe refcount btree modification intents), so
there's relatively little handling we need to add to the EFI code to
enable this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_extfree_item.c | 72 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 69 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f9e36b810663..840df1f66a1a 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -336,6 +336,34 @@ xfs_trans_get_efd(
 	return efdp;
 }
 
+/*
+ * Fill the EFD with all extents from the EFI when we need to roll the
+ * transaction and continue with a new EFI.
+ *
+ * This simply copies all the extents in the EFI to the EFD rather than make
+ * assumptions about which extents in the EFI have already been processed. We
+ * currently keep the xefi list in the same order as the EFI extent list, but
+ * that may not always be the case. Copying everything avoids leaving a landmine
+ * were we fail to cancel all the extents in an EFI if the xefi list is
+ * processed in a different order to the extents in the EFI.
+ */
+static void
+xfs_efd_from_efi(
+	struct xfs_efd_log_item	*efdp)
+{
+	struct xfs_efi_log_item *efip = efdp->efd_efip;
+	uint                    i;
+
+	ASSERT(efip->efi_format.efi_nextents > 0);
+	ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents);
+
+	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+	       efdp->efd_format.efd_extents[i] =
+		       efip->efi_format.efi_extents[i];
+	}
+	efdp->efd_next_extent = efip->efi_format.efi_nextents;
+}
+
 /*
  * Free an extent and log it to the EFD. Note that the transaction is marked
  * dirty regardless of whether the extent free succeeds or fails to support the
@@ -378,6 +406,17 @@ xfs_trans_free_extent(
 	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
 
+	/*
+	 * If we need a new transaction to make progress, the caller will log a
+	 * new EFI with the current contents. It will also log an EFD to cancel
+	 * the existing EFI, and so we need to copy all the unprocessed extents
+	 * in this EFI to the EFD so this works correctly.
+	 */
+	if (error == -EAGAIN) {
+		xfs_efd_from_efi(efdp);
+		return error;
+	}
+
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
@@ -495,6 +534,13 @@ xfs_extent_free_finish_item(
 
 	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
 
+	/*
+	 * Don't free the XEFI if we need a new transaction to complete
+	 * processing of it.
+	 */
+	if (error == -EAGAIN)
+		return error;
+
 	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
@@ -620,6 +666,7 @@ xfs_efi_item_recover(
 	struct xfs_trans		*tp;
 	int				i;
 	int				error = 0;
+	bool				requeue_only = false;
 
 	/*
 	 * First check the validity of the extents described by the
@@ -652,9 +699,28 @@ xfs_efi_item_recover(
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
 
-		xfs_extent_free_get_group(mp, &fake);
-		error = xfs_trans_free_extent(tp, efdp, &fake);
-		xfs_extent_free_put_group(&fake);
+		if (!requeue_only) {
+			xfs_extent_free_get_group(mp, &fake);
+			error = xfs_trans_free_extent(tp, efdp, &fake);
+			xfs_extent_free_put_group(&fake);
+		}
+
+		/*
+		 * If we can't free the extent without potentially deadlocking,
+		 * requeue the rest of the extents to a new so that they get
+		 * run again later with a new transaction context.
+		 */
+		if (error == -EAGAIN || requeue_only) {
+			error = xfs_free_extent_later(tp, fake.xefi_startblock,
+					fake.xefi_blockcount,
+					&XFS_RMAP_OINFO_ANY_OWNER);
+			if (!error) {
+				requeue_only = true;
+				error = 0;
+				continue;
+			}
+		};
+
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] xfs: don't block in busy flushing when freeing extents
  2023-06-20  0:20 [PATCH 0/5 V2] xfs: various fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2023-06-20  0:20 ` [PATCH 3/5] xfs: allow extent free intents to be retried Dave Chinner
@ 2023-06-20  0:20 ` Dave Chinner
  2023-06-20 14:53   ` Chandan Babu R
  2023-06-20  0:20 ` [PATCH 5/5] xfs: journal geometry is not properly bounds checked Dave Chinner
  4 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2023-06-20  0:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

If the current transaction holds a busy extent and we are trying to
allocate a new extent to fix up the free list, we can deadlock if
the AG is entirely empty except for the busy extent held by the
transaction.

This can occur at runtime processing an XEFI with multiple extents
in this path:

__schedule+0x22f at ffffffff81f75e8f
schedule+0x46 at ffffffff81f76366
xfs_extent_busy_flush+0x69 at ffffffff81477d99
xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
__xfs_free_extent+0x99 at ffffffff81419499
xfs_trans_free_extent+0x3e at ffffffff814a6fee
xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
xfs_defer_finish+0x11 at ffffffff814417e1
xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
xfs_inactive_truncate+0xb9 at ffffffff8148bb89
xfs_inactive+0x227 at ffffffff8148c4f7
xfs_fs_destroy_inode+0xb8 at ffffffff81496898
destroy_inode+0x3b at ffffffff8127d2ab
do_unlinkat+0x1d1 at ffffffff81270df1
do_syscall_64+0x40 at ffffffff81f6b5f0
entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c

This can also happen in log recovery when processing an EFI
with multiple extents through this path:

context_switch() kernel/sched/core.c:3881
__schedule() kernel/sched/core.c:5111
schedule() kernel/sched/core.c:5186
xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
xfs_log_mount_finish() fs/xfs/xfs_log.c:764
xfs_mountfs() fs/xfs/xfs_mount.c:978
xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
mount_bdev() fs/super.c:1417
xfs_fs_mount() fs/xfs/xfs_super.c:1985
legacy_get_tree() fs/fs_context.c:647
vfs_get_tree() fs/super.c:1547
do_new_mount() fs/namespace.c:2843
do_mount() fs/namespace.c:3163
ksys_mount() fs/namespace.c:3372
__do_sys_mount() fs/namespace.c:3386
__se_sys_mount() fs/namespace.c:3383
__x64_sys_mount() fs/namespace.c:3383
do_syscall_64() arch/x86/entry/common.c:296
entry_SYSCALL_64() arch/x86/entry/entry_64.S:180

To avoid this deadlock, we should not block in
xfs_extent_busy_flush() if we hold a busy extent in the current
transaction.

Now that the EFI processing code can handle requeuing a partially
completed EFI, we can detect this situation in
xfs_extent_busy_flush() and return -EAGAIN rather than going to
sleep forever. The -EAGAIN get propagated back out to the
xfs_trans_free_extent() context, where the EFD is populated and the
transaction is rolled, thereby moving the busy extents into the CIL.

At this point, we can retry the extent free operation again with a
clean transaction. If we hit the same "all free extents are busy"
situation when trying to fix up the free list, we can safely call
xfs_extent_busy_flush() and wait for the busy extents to resolve
and wake us. At this point, the allocation search can make progress
again and we can fix up the free list.

This deadlock was first reported by Chandan in mid-2021, but I
couldn't make myself understood during review, and didn't have time
to fix it myself.

It was reported again in March 2023, and again I have found myself
unable to explain the complexities of the solution needed during
review.

As such, I don't have hours more time to waste trying to get the
fix written the way it needs to be written, so I'm just doing it
myself. This patchset is largely based on Wengang Wang's last patch,
but with all the unnecessary stuff removed, split up into multiple
patches and cleaned up somewhat.

Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++-----------
 fs/xfs/libxfs/xfs_alloc.h | 11 ++++---
 fs/xfs/xfs_extent_busy.c  | 33 ++++++++++++++++---
 fs/xfs/xfs_extent_busy.h  |  6 ++--
 4 files changed, 88 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 11bd0a1756a1..7c675aae0a0f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near(
 	if (args->agbno > args->max_agbno)
 		args->agbno = args->max_agbno;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	len = 0;
 
@@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near(
 	 */
 	if (!acur.len) {
 		if (acur.busy) {
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_near_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen, alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					acur.busy_gen, alloc_flags);
+			if (error)
+				goto out;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size(
 	int			error;
 	int			i;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	/*
 	 * Allocate and initialize a cursor for the by-size btree.
@@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size(
 			error = xfs_btree_increment(cnt_cur, 0, &i);
 			if (error)
 				goto error0;
-			if (i == 0) {
-				/*
-				 * Our only valid extents must have been busy.
-				 * Make it unbusy by forcing the log out and
-				 * retrying.
-				 */
-				xfs_btree_del_cursor(cnt_cur,
-						     XFS_BTREE_NOERROR);
-				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp, args->pag,
-						busy_gen, alloc_flags);
-				goto restart;
-			}
+			if (i)
+				continue;
+
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
+			trace_xfs_alloc_size_busy(args);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			goto restart;
 		}
 	}
 
@@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size(
 	args->len = rlen;
 	if (rlen < args->minlen) {
 		if (busy) {
-			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
-					alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			goto restart;
 		}
 		goto out_nominleft;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index d1aa7c63eafe..f267842e36ba 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
 /*
  * Flags for xfs_alloc_fix_freelist.
  */
-#define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
-#define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
-#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
-#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
-#define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYLOCK	(1U << 0)  /* use trylock for buffer locking */
+#define	XFS_ALLOC_FLAG_FREEING	(1U << 1)  /* indicate caller is freeing extents*/
+#define	XFS_ALLOC_FLAG_NORMAP	(1U << 2)  /* don't modify the rmapbt */
+#define	XFS_ALLOC_FLAG_NOSHRINK	(1U << 3)  /* don't shrink the freelist */
+#define	XFS_ALLOC_FLAG_CHECK	(1U << 4)  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYFLUSH	(1U << 5)  /* don't wait in busy extent flush */
 
 /*
  * Argument structure for xfs_alloc routines.
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 5f44936eae1c..7c2fdc71e42d 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -566,10 +566,21 @@ xfs_extent_busy_clear(
 
 /*
  * Flush out all busy extents for this AG.
+ *
+ * If the current transaction is holding busy extents, the caller may not want
+ * to wait for committed busy extents to resolve. If we are being told just to
+ * try a flush or progress has been made since we last skipped a busy extent,
+ * return immediately to allow the caller to try again.
+ *
+ * If we are freeing extents, we might actually be holding the only free extents
+ * in the transaction busy list and the log force won't resolve that situation.
+ * In this case, we must return -EAGAIN to avoid a deadlock by informing the
+ * caller it needs to commit the busy extents it holds before retrying the
+ * extent free operation.
  */
-void
+int
 xfs_extent_busy_flush(
-	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
 	struct xfs_perag	*pag,
 	unsigned		busy_gen,
 	uint32_t		alloc_flags)
@@ -577,10 +588,23 @@ xfs_extent_busy_flush(
 	DEFINE_WAIT		(wait);
 	int			error;
 
-	error = xfs_log_force(mp, XFS_LOG_SYNC);
+	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
 	if (error)
-		return;
+		return error;
 
+	/* Avoid deadlocks on uncommitted busy extents. */
+	if (!list_empty(&tp->t_busy)) {
+		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
+			return 0;
+
+		if (busy_gen != READ_ONCE(pag->pagb_gen))
+			return 0;
+
+		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
+			return -EAGAIN;
+	}
+
+	/* Wait for committed busy extents to resolve. */
 	do {
 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
@@ -589,6 +613,7 @@ xfs_extent_busy_flush(
 	} while (1);
 
 	finish_wait(&pag->pagb_wait, &wait);
+	return 0;
 }
 
 void
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 7a82c689bfa4..c37bf87e6781 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -51,9 +51,9 @@ bool
 xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 		xfs_extlen_t *len, unsigned *busy_gen);
 
-void
-xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen, uint32_t alloc_flags);
+int
+xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
+		unsigned busy_gen, uint32_t alloc_flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] xfs: journal geometry is not properly bounds checked
  2023-06-20  0:20 [PATCH 0/5 V2] xfs: various fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2023-06-20  0:20 ` [PATCH 4/5] xfs: don't block in busy flushing when freeing extents Dave Chinner
@ 2023-06-20  0:20 ` Dave Chinner
  4 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2023-06-20  0:20 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.

If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.

The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.

Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.

However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_sb.c | 56 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_log.c       | 47 +++++++++++------------------------
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index ba0f17bc1dc0..5e174685a77c 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -412,7 +412,6 @@ xfs_validate_sb_common(
 	    sbp->sb_inodelog < XFS_DINODE_MIN_LOG			||
 	    sbp->sb_inodelog > XFS_DINODE_MAX_LOG			||
 	    sbp->sb_inodesize != (1 << sbp->sb_inodelog)		||
-	    sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE			||
 	    sbp->sb_inopblock != howmany(sbp->sb_blocksize,sbp->sb_inodesize) ||
 	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) < XFS_MIN_AG_BYTES	||
 	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) > XFS_MAX_AG_BYTES	||
@@ -430,6 +429,61 @@ xfs_validate_sb_common(
 		return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Logs that are too large are not supported at all. Reject them
+	 * outright. Logs that are too small are tolerated on v4 filesystems,
+	 * but we can only check that when mounting the log. Hence we skip
+	 * those checks here.
+	 */
+	if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) {
+		xfs_notice(mp,
+		"Log size 0x%x blocks too large, maximum size is 0x%llx blocks",
+			 sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS);
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_FSB_TO_B(mp, sbp->sb_logblocks) > XFS_MAX_LOG_BYTES) {
+		xfs_warn(mp,
+		"log size 0x%llx bytes too large, maximum size is 0x%llx bytes",
+			 XFS_FSB_TO_B(mp, sbp->sb_logblocks),
+			 XFS_MAX_LOG_BYTES);
+		return -EFSCORRUPTED;
+	}
+
+	/*
+	 * Do not allow filesystems with corrupted log sector or stripe units to
+	 * be mounted. We cannot safely size the iclogs or write to the log if
+	 * the log stripe unit is not valid.
+	 */
+	if (sbp->sb_versionnum & XFS_SB_VERSION_SECTORBIT) {
+		if (sbp->sb_logsectsize != (1U << sbp->sb_logsectlog)) {
+			xfs_notice(mp,
+			"log sector size in bytes/log2 (0x%x/0x%x) must match",
+				sbp->sb_logsectsize, 1U << sbp->sb_logsectlog);
+			return -EFSCORRUPTED;
+		}
+	} else if (sbp->sb_logsectsize || sbp->sb_logsectlog) {
+		xfs_notice(mp,
+		"log sector size in bytes/log2 (0x%x/0x%x) are not zero",
+			sbp->sb_logsectsize, sbp->sb_logsectlog);
+		return -EFSCORRUPTED;
+	}
+
+	if (sbp->sb_logsunit > 1) {
+		if (sbp->sb_logsunit % sbp->sb_blocksize) {
+			xfs_notice(mp,
+		"log stripe unit 0x%x bytes must be a multiple of block size",
+				sbp->sb_logsunit);
+			return -EFSCORRUPTED;
+		}
+		if (sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE) {
+			xfs_notice(mp,
+		"log stripe unit 0x%x bytes over maximum size (0x%x bytes)",
+				sbp->sb_logsunit, XLOG_MAX_RECORD_BSIZE);
+			return -EFSCORRUPTED;
+		}
+	}
+
 	/* Validate the realtime geometry; stolen from xfs_repair */
 	if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE ||
 	    sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE) {
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fc61cc024023..79004d193e54 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -639,7 +639,6 @@ xfs_log_mount(
 	int		num_bblks)
 {
 	struct xlog	*log;
-	bool		fatal = xfs_has_crc(mp);
 	int		error = 0;
 	int		min_logfsbs;
 
@@ -663,53 +662,37 @@ xfs_log_mount(
 	mp->m_log = log;
 
 	/*
-	 * Validate the given log space and drop a critical message via syslog
-	 * if the log size is too small that would lead to some unexpected
-	 * situations in transaction log space reservation stage.
+	 * Now that we have set up the log and it's internal geometry
+	 * parameters, we can validate the given log space and drop a critical
+	 * message via syslog if the log size is too small. A log that is too
+	 * small can lead to unexpected situations in transaction log space
+	 * reservation stage. The superblock verifier has already validated all
+	 * the other log geometry constraints, so we don't have to check those
+	 * here.
 	 *
-	 * Note: we can't just reject the mount if the validation fails.  This
-	 * would mean that people would have to downgrade their kernel just to
-	 * remedy the situation as there is no way to grow the log (short of
-	 * black magic surgery with xfs_db).
+	 * Note: For v4 filesystems, we can't just reject the mount if the
+	 * validation fails.  This would mean that people would have to
+	 * downgrade their kernel just to remedy the situation as there is no
+	 * way to grow the log (short of black magic surgery with xfs_db).
 	 *
-	 * We can, however, reject mounts for CRC format filesystems, as the
+	 * We can, however, reject mounts for V5 format filesystems, as the
 	 * mkfs binary being used to make the filesystem should never create a
 	 * filesystem with a log that is too small.
 	 */
 	min_logfsbs = xfs_log_calc_minimum_size(mp);
-
 	if (mp->m_sb.sb_logblocks < min_logfsbs) {
 		xfs_warn(mp,
 		"Log size %d blocks too small, minimum size is %d blocks",
 			 mp->m_sb.sb_logblocks, min_logfsbs);
-		error = -EINVAL;
-	} else if (mp->m_sb.sb_logblocks > XFS_MAX_LOG_BLOCKS) {
-		xfs_warn(mp,
-		"Log size %d blocks too large, maximum size is %lld blocks",
-			 mp->m_sb.sb_logblocks, XFS_MAX_LOG_BLOCKS);
-		error = -EINVAL;
-	} else if (XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks) > XFS_MAX_LOG_BYTES) {
-		xfs_warn(mp,
-		"log size %lld bytes too large, maximum size is %lld bytes",
-			 XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks),
-			 XFS_MAX_LOG_BYTES);
-		error = -EINVAL;
-	} else if (mp->m_sb.sb_logsunit > 1 &&
-		   mp->m_sb.sb_logsunit % mp->m_sb.sb_blocksize) {
-		xfs_warn(mp,
-		"log stripe unit %u bytes must be a multiple of block size",
-			 mp->m_sb.sb_logsunit);
-		error = -EINVAL;
-		fatal = true;
-	}
-	if (error) {
+
 		/*
 		 * Log check errors are always fatal on v5; or whenever bad
 		 * metadata leads to a crash.
 		 */
-		if (fatal) {
+		if (xfs_has_crc(mp)) {
 			xfs_crit(mp, "AAIEEE! Log failed size checks. Abort!");
 			ASSERT(0);
+			error = -EINVAL;
 			goto out_free_log;
 		}
 		xfs_crit(mp, "Log size out of supported range.");
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents
  2023-06-20  0:20 ` [PATCH 4/5] xfs: don't block in busy flushing when freeing extents Dave Chinner
@ 2023-06-20 14:53   ` Chandan Babu R
  2023-06-20 22:18     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Chandan Babu R @ 2023-06-20 14:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> If the current transaction holds a busy extent and we are trying to
> allocate a new extent to fix up the free list, we can deadlock if
> the AG is entirely empty except for the busy extent held by the
> transaction.
>
> This can occur at runtime processing an XEFI with multiple extents
> in this path:
>
> __schedule+0x22f at ffffffff81f75e8f
> schedule+0x46 at ffffffff81f76366
> xfs_extent_busy_flush+0x69 at ffffffff81477d99
> xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
> xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
> xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
> xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
> __xfs_free_extent+0x99 at ffffffff81419499
> xfs_trans_free_extent+0x3e at ffffffff814a6fee
> xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
> xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
> xfs_defer_finish+0x11 at ffffffff814417e1
> xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
> xfs_inactive_truncate+0xb9 at ffffffff8148bb89
> xfs_inactive+0x227 at ffffffff8148c4f7
> xfs_fs_destroy_inode+0xb8 at ffffffff81496898
> destroy_inode+0x3b at ffffffff8127d2ab
> do_unlinkat+0x1d1 at ffffffff81270df1
> do_syscall_64+0x40 at ffffffff81f6b5f0
> entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c
>
> This can also happen in log recovery when processing an EFI
> with multiple extents through this path:
>
> context_switch() kernel/sched/core.c:3881
> __schedule() kernel/sched/core.c:5111
> schedule() kernel/sched/core.c:5186
> xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
> xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
> xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
> xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
> xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
> __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
> xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
> xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
> xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
> xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
> xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
> xfs_log_mount_finish() fs/xfs/xfs_log.c:764
> xfs_mountfs() fs/xfs/xfs_mount.c:978
> xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
> mount_bdev() fs/super.c:1417
> xfs_fs_mount() fs/xfs/xfs_super.c:1985
> legacy_get_tree() fs/fs_context.c:647
> vfs_get_tree() fs/super.c:1547
> do_new_mount() fs/namespace.c:2843
> do_mount() fs/namespace.c:3163
> ksys_mount() fs/namespace.c:3372
> __do_sys_mount() fs/namespace.c:3386
> __se_sys_mount() fs/namespace.c:3383
> __x64_sys_mount() fs/namespace.c:3383
> do_syscall_64() arch/x86/entry/common.c:296
> entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
>
> To avoid this deadlock, we should not block in
> xfs_extent_busy_flush() if we hold a busy extent in the current
> transaction.
>
> Now that the EFI processing code can handle requeuing a partially
> completed EFI, we can detect this situation in
> xfs_extent_busy_flush() and return -EAGAIN rather than going to
> sleep forever. The -EAGAIN get propagated back out to the
> xfs_trans_free_extent() context, where the EFD is populated and the
> transaction is rolled, thereby moving the busy extents into the CIL.
>
> At this point, we can retry the extent free operation again with a
> clean transaction. If we hit the same "all free extents are busy"
> situation when trying to fix up the free list, we can safely call
> xfs_extent_busy_flush() and wait for the busy extents to resolve
> and wake us. At this point, the allocation search can make progress
> again and we can fix up the free list.
>
> This deadlock was first reported by Chandan in mid-2021, but I
> couldn't make myself understood during review, and didn't have time
> to fix it myself.
>
> It was reported again in March 2023, and again I have found myself
> unable to explain the complexities of the solution needed during
> review.
>
> As such, I don't have hours more time to waste trying to get the
> fix written the way it needs to be written, so I'm just doing it
> myself. This patchset is largely based on Wengang Wang's last patch,
> but with all the unnecessary stuff removed, split up into multiple
> patches and cleaned up somewhat.
>
> Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++-----------
>  fs/xfs/libxfs/xfs_alloc.h | 11 ++++---
>  fs/xfs/xfs_extent_busy.c  | 33 ++++++++++++++++---
>  fs/xfs/xfs_extent_busy.h  |  6 ++--
>  4 files changed, 88 insertions(+), 30 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 11bd0a1756a1..7c675aae0a0f 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near(
>  	if (args->agbno > args->max_agbno)
>  		args->agbno = args->max_agbno;
>  
> +	/* Retry once quickly if we find busy extents before blocking. */
> +	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
>  restart:
>  	len = 0;
>  
> @@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near(
>  	 */
>  	if (!acur.len) {
>  		if (acur.busy) {
> +			/*
> +			 * Our only valid extents must have been busy. Flush and
> +			 * retry the allocation again. If we get an -EAGAIN
> +			 * error, we're being told that a deadlock was avoided
> +			 * and the current transaction needs committing before
> +			 * the allocation can be retried.
> +			 */
>  			trace_xfs_alloc_near_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag,
> -					      acur.busy_gen, alloc_flags);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					acur.busy_gen, alloc_flags);
> +			if (error)
> +				goto out;
> +
> +			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  			goto restart;
>  		}
>  		trace_xfs_alloc_size_neither(args);
> @@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size(
>  	int			error;
>  	int			i;
>  
> +	/* Retry once quickly if we find busy extents before blocking. */
> +	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
>  restart:
>  	/*
>  	 * Allocate and initialize a cursor for the by-size btree.
> @@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size(
>  			error = xfs_btree_increment(cnt_cur, 0, &i);
>  			if (error)
>  				goto error0;
> -			if (i == 0) {
> -				/*
> -				 * Our only valid extents must have been busy.
> -				 * Make it unbusy by forcing the log out and
> -				 * retrying.
> -				 */
> -				xfs_btree_del_cursor(cnt_cur,
> -						     XFS_BTREE_NOERROR);
> -				trace_xfs_alloc_size_busy(args);
> -				xfs_extent_busy_flush(args->mp, args->pag,
> -						busy_gen, alloc_flags);
> -				goto restart;
> -			}
> +			if (i)
> +				continue;
> +
> +			/*
> +			 * Our only valid extents must have been busy. Flush and
> +			 * retry the allocation again. If we get an -EAGAIN
> +			 * error, we're being told that a deadlock was avoided
> +			 * and the current transaction needs committing before
> +			 * the allocation can be retried.
> +			 */
> +			trace_xfs_alloc_size_busy(args);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					busy_gen, alloc_flags);
> +			if (error)
> +				goto error0;
> +
> +			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> +			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> +			goto restart;
>  		}
>  	}
>  
> @@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size(
>  	args->len = rlen;
>  	if (rlen < args->minlen) {
>  		if (busy) {
> -			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> +			/*
> +			 * Our only valid extents must have been busy. Flush and
> +			 * retry the allocation again. If we get an -EAGAIN
> +			 * error, we're being told that a deadlock was avoided
> +			 * and the current transaction needs committing before
> +			 * the allocation can be retried.
> +			 */
>  			trace_xfs_alloc_size_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
> -					alloc_flags);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					busy_gen, alloc_flags);
> +			if (error)
> +				goto error0;
> +
> +			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> +			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			goto restart;
>  		}
>  		goto out_nominleft;
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index d1aa7c63eafe..f267842e36ba 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>  /*
>   * Flags for xfs_alloc_fix_freelist.
>   */
> -#define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
> -#define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
> -#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
> -#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
> -#define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
> +#define	XFS_ALLOC_FLAG_TRYLOCK	(1U << 0)  /* use trylock for buffer locking */
> +#define	XFS_ALLOC_FLAG_FREEING	(1U << 1)  /* indicate caller is freeing extents*/
> +#define	XFS_ALLOC_FLAG_NORMAP	(1U << 2)  /* don't modify the rmapbt */
> +#define	XFS_ALLOC_FLAG_NOSHRINK	(1U << 3)  /* don't shrink the freelist */
> +#define	XFS_ALLOC_FLAG_CHECK	(1U << 4)  /* test only, don't modify args */
> +#define	XFS_ALLOC_FLAG_TRYFLUSH	(1U << 5)  /* don't wait in busy extent flush */
>  
>  /*
>   * Argument structure for xfs_alloc routines.
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index 5f44936eae1c..7c2fdc71e42d 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -566,10 +566,21 @@ xfs_extent_busy_clear(
>  
>  /*
>   * Flush out all busy extents for this AG.
> + *
> + * If the current transaction is holding busy extents, the caller may not want
> + * to wait for committed busy extents to resolve. If we are being told just to
> + * try a flush or progress has been made since we last skipped a busy extent,
> + * return immediately to allow the caller to try again.
> + *
> + * If we are freeing extents, we might actually be holding the only free extents
> + * in the transaction busy list and the log force won't resolve that situation.
> + * In this case, we must return -EAGAIN to avoid a deadlock by informing the
> + * caller it needs to commit the busy extents it holds before retrying the
> + * extent free operation.
>   */
> -void
> +int
>  xfs_extent_busy_flush(
> -	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
>  	struct xfs_perag	*pag,
>  	unsigned		busy_gen,
>  	uint32_t		alloc_flags)
> @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
>  	DEFINE_WAIT		(wait);
>  	int			error;
>  
> -	error = xfs_log_force(mp, XFS_LOG_SYNC);
> +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>  	if (error)
> -		return;
> +		return error;
>  
> +	/* Avoid deadlocks on uncommitted busy extents. */
> +	if (!list_empty(&tp->t_busy)) {
> +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
> +			return 0;
> +
> +		if (busy_gen != READ_ONCE(pag->pagb_gen))
> +			return 0;
> +
> +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
> +			return -EAGAIN;
> +	}

In the case where a task is freeing an ondisk inode, an ifree transaction can
invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
the next call to free its immediate parent block.

The first call to __xfs_inobt_free_block() adds the freed extent into the
transaction's busy list and also into the per-ag rb tree tracking the busy
extent. Freeing the second inobt block could lead to the following sequence of
function calls,

__xfs_free_extent() => xfs_free_extent_fix_freelist() =>
xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()

Here, xfs_extent_busy_flush() is invoked with XFS_ALLOC_FLAG_TRYFLUSH flag
set. xfs_extent_busy_flush() flushes the CIL and returns to
xfs_alloc_ag_vextent_size() since XFS_ALLOC_FLAG_TRYFLUSH was
set. xfs_alloc_ag_vextent_size() invokes xfs_extent_busy_flush() once again;
albeit without XFS_ALLOC_FLAG_TRYFLUSH flag set. This time
xfs_extent_busy_flush() returns -EAGAIN since XFS_ALLOC_FLAG_FREEING flag is
set.

This will ultimate cause xfs_inactive_ifree() to shutdown the filesystem and
cancel a dirty transaction.

Please note that the above narrative could be incorrect. However, I have
browsed the source code multiple times and am yet to find any evidence to the
contrary.

> +
> +	/* Wait for committed busy extents to resolve. */
>  	do {
>  		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>  		if  (busy_gen != READ_ONCE(pag->pagb_gen))
> @@ -589,6 +613,7 @@ xfs_extent_busy_flush(
>  	} while (1);
>  
>  	finish_wait(&pag->pagb_wait, &wait);
> +	return 0;
>  }
>  
>  void
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 7a82c689bfa4..c37bf87e6781 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -51,9 +51,9 @@ bool
>  xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>  		xfs_extlen_t *len, unsigned *busy_gen);
>  
> -void
> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> -	unsigned busy_gen, uint32_t alloc_flags);
> +int
> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
> +		unsigned busy_gen, uint32_t alloc_flags);
>  
>  void
>  xfs_extent_busy_wait_all(struct xfs_mount *mp);


-- 
chandan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents
  2023-06-20 14:53   ` Chandan Babu R
@ 2023-06-20 22:18     ` Dave Chinner
  2023-06-22  3:59       ` Chandan Babu R
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2023-06-20 22:18 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote:
> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > If the current transaction holds a busy extent and we are trying to
> > allocate a new extent to fix up the free list, we can deadlock if
> > the AG is entirely empty except for the busy extent held by the
> > transaction.
....
> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
> >  	DEFINE_WAIT		(wait);
> >  	int			error;
> >  
> > -	error = xfs_log_force(mp, XFS_LOG_SYNC);
> > +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> >  	if (error)
> > -		return;
> > +		return error;
> >  
> > +	/* Avoid deadlocks on uncommitted busy extents. */
> > +	if (!list_empty(&tp->t_busy)) {
> > +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
> > +			return 0;
> > +
> > +		if (busy_gen != READ_ONCE(pag->pagb_gen))
> > +			return 0;
> > +
> > +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
> > +			return -EAGAIN;
> > +	}
> 
> In the case where a task is freeing an ondisk inode, an ifree transaction can
> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
> the next call to free its immediate parent block.
> 
> The first call to __xfs_inobt_free_block() adds the freed extent into the
> transaction's busy list and also into the per-ag rb tree tracking the busy
> extent. Freeing the second inobt block could lead to the following sequence of
> function calls,
> 
> __xfs_free_extent() => xfs_free_extent_fix_freelist() =>
> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()

Yes, I think you might be right. I checked inode chunks - they are
freed from this path via:

xfs_ifree
  xfs_difree
    xfs_difree_inobt
      xfs_difree_inobt_chunk
        xfs_free_extent_later
	  <queues an XEFI for deferred freeing>

And I didn't think about the inobt blocks themselves because freeing
an inode can require allocation of finobt blocks and hence there's a
transaction reservation for block allocation on finobt enabled
filesystems. i.e. freeing can't proceed unless there is some amount
of free blocks available, and that's why the finobt has an amount of
per-ag space reserved for it.

Hence, for finobt enabled filesystems, I don't think we can ever get
down to a completely empty AG and an AGFL that needs refilling from
the inode path - the metadata reserve doesn't allow the AG to be
completely emptied in the way that is needed for this bug to
manifest.

Yes, I think it is still possible for all the free space to be busy,
and so when online discard is enabled we need to do the busy wait
after the log force to avoid that. However, for non-discard
filesystems the sync log force is all that is needed to resolve busy
extents outside the current transaction, so this wouldn't be an
issue for the current patchset.

I suspect that is why I haven't seen issues on v5 filesystems,
though I also haven't seen issues on v4 filesystems that don't have
the finobt per-ag metadata reservation nor the space reservation at
transaction reservation time. I know that the fstests enospc group
is exercising the busy flush code, but I doubt that it was exercised
through the inode btree block freeing path...

I note that the refcount btree block freeing path also call
xfs_free_extent(). This might be OK, because refcount btree updates
get called from deferred intent processing, and hence the EAGAIN
will trigger a transaction roll and retry correctly.

I suspect, however, that both of these paths should simply call
xfs_free_extent_later() to queue an XEFI for deferred processing,
and that takes the entire extent freeing path out from under the
btree operations. 

I'll look into that. Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents
  2023-06-20 22:18     ` Dave Chinner
@ 2023-06-22  3:59       ` Chandan Babu R
  2023-06-22  5:28         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Chandan Babu R @ 2023-06-22  3:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote:
> On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote:
>> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
>> > From: Dave Chinner <dchinner@redhat.com>
>> >
>> > If the current transaction holds a busy extent and we are trying to
>> > allocate a new extent to fix up the free list, we can deadlock if
>> > the AG is entirely empty except for the busy extent held by the
>> > transaction.
> ....
>> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
>> >  	DEFINE_WAIT		(wait);
>> >  	int			error;
>> >  
>> > -	error = xfs_log_force(mp, XFS_LOG_SYNC);
>> > +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> >  	if (error)
>> > -		return;
>> > +		return error;
>> >  
>> > +	/* Avoid deadlocks on uncommitted busy extents. */
>> > +	if (!list_empty(&tp->t_busy)) {
>> > +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
>> > +			return 0;
>> > +
>> > +		if (busy_gen != READ_ONCE(pag->pagb_gen))
>> > +			return 0;
>> > +
>> > +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
>> > +			return -EAGAIN;
>> > +	}
>> 
>> In the case where a task is freeing an ondisk inode, an ifree transaction can
>> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
>> the next call to free its immediate parent block.
>> 
>> The first call to __xfs_inobt_free_block() adds the freed extent into the
>> transaction's busy list and also into the per-ag rb tree tracking the busy
>> extent. Freeing the second inobt block could lead to the following sequence of
>> function calls,
>> 
>> __xfs_free_extent() => xfs_free_extent_fix_freelist() =>
>> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()
>
> Yes, I think you might be right. I checked inode chunks - they are
> freed from this path via:
>
> xfs_ifree
>   xfs_difree
>     xfs_difree_inobt
>       xfs_difree_inobt_chunk
>         xfs_free_extent_later
> 	  <queues an XEFI for deferred freeing>
>
> And I didn't think about the inobt blocks themselves because freeing
> an inode can require allocation of finobt blocks and hence there's a
> transaction reservation for block allocation on finobt enabled
> filesystems. i.e. freeing can't proceed unless there is some amount
> of free blocks available, and that's why the finobt has an amount of
> per-ag space reserved for it.
>
> Hence, for finobt enabled filesystems, I don't think we can ever get
> down to a completely empty AG and an AGFL that needs refilling from
> the inode path - the metadata reserve doesn't allow the AG to be
> completely emptied in the way that is needed for this bug to
> manifest.
>
> Yes, I think it is still possible for all the free space to be busy,
> and so when online discard is enabled we need to do the busy wait
> after the log force to avoid that. However, for non-discard
> filesystems the sync log force is all that is needed to resolve busy
> extents outside the current transaction, so this wouldn't be an
> issue for the current patchset.

Are you planning to post a new version of this patchset which would solve the
possible cancellation of dirty transaction during freeing inobt blocks?  If
not, I will spend some time to review the current version of the patchset.

>
> I suspect that is why I haven't seen issues on v5 filesystems,
> though I also haven't seen issues on v4 filesystems that don't have
> the finobt per-ag metadata reservation nor the space reservation at
> transaction reservation time. I know that the fstests enospc group
> is exercising the busy flush code, but I doubt that it was exercised
> through the inode btree block freeing path...
>
> I note that the refcount btree block freeing path also call
> xfs_free_extent(). This might be OK, because refcount btree updates
> get called from deferred intent processing, and hence the EAGAIN
> will trigger a transaction roll and retry correctly.
>
> I suspect, however, that both of these paths should simply call
> xfs_free_extent_later() to queue an XEFI for deferred processing,
> and that takes the entire extent freeing path out from under the
> btree operations. 
>
> I'll look into that. Thanks!

-- 
chandan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents
  2023-06-22  3:59       ` Chandan Babu R
@ 2023-06-22  5:28         ` Dave Chinner
  2023-06-22  5:36           ` Chandan Babu R
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2023-06-22  5:28 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Thu, Jun 22, 2023 at 09:29:46AM +0530, Chandan Babu R wrote:
> On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote:
> > On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote:
> >> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
> >> > From: Dave Chinner <dchinner@redhat.com>
> >> >
> >> > If the current transaction holds a busy extent and we are trying to
> >> > allocate a new extent to fix up the free list, we can deadlock if
> >> > the AG is entirely empty except for the busy extent held by the
> >> > transaction.
> > ....
> >> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
> >> >  	DEFINE_WAIT		(wait);
> >> >  	int			error;
> >> >  
> >> > -	error = xfs_log_force(mp, XFS_LOG_SYNC);
> >> > +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> >> >  	if (error)
> >> > -		return;
> >> > +		return error;
> >> >  
> >> > +	/* Avoid deadlocks on uncommitted busy extents. */
> >> > +	if (!list_empty(&tp->t_busy)) {
> >> > +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
> >> > +			return 0;
> >> > +
> >> > +		if (busy_gen != READ_ONCE(pag->pagb_gen))
> >> > +			return 0;
> >> > +
> >> > +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
> >> > +			return -EAGAIN;
> >> > +	}
> >> 
> >> In the case where a task is freeing an ondisk inode, an ifree transaction can
> >> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
> >> the next call to free its immediate parent block.
> >> 
> >> The first call to __xfs_inobt_free_block() adds the freed extent into the
> >> transaction's busy list and also into the per-ag rb tree tracking the busy
> >> extent. Freeing the second inobt block could lead to the following sequence of
> >> function calls,
> >> 
> >> __xfs_free_extent() => xfs_free_extent_fix_freelist() =>
> >> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()
> >
> > Yes, I think you might be right. I checked inode chunks - they are
> > freed from this path via:
> >
> > xfs_ifree
> >   xfs_difree
> >     xfs_difree_inobt
> >       xfs_difree_inobt_chunk
> >         xfs_free_extent_later
> > 	  <queues an XEFI for deferred freeing>
> >
> > And I didn't think about the inobt blocks themselves because freeing
> > an inode can require allocation of finobt blocks and hence there's a
> > transaction reservation for block allocation on finobt enabled
> > filesystems. i.e. freeing can't proceed unless there is some amount
> > of free blocks available, and that's why the finobt has an amount of
> > per-ag space reserved for it.
> >
> > Hence, for finobt enabled filesystems, I don't think we can ever get
> > down to a completely empty AG and an AGFL that needs refilling from
> > the inode path - the metadata reserve doesn't allow the AG to be
> > completely emptied in the way that is needed for this bug to
> > manifest.
> >
> > Yes, I think it is still possible for all the free space to be busy,
> > and so when online discard is enabled we need to do the busy wait
> > after the log force to avoid that. However, for non-discard
> > filesystems the sync log force is all that is needed to resolve busy
> > extents outside the current transaction, so this wouldn't be an
> > issue for the current patchset.
> 
> Are you planning to post a new version of this patchset which would solve the
> possible cancellation of dirty transaction during freeing inobt blocks?  If
> not, I will spend some time to review the current version of the patchset.

I'm working on moving all the direct calls to xfs_free_extent to use
xfs_free_extent_later(). It will be a totally separate preparation
patch for the series, so everything else in the patchset should
largely remain unchanged.

I haven't finished the patch or tested it yet, but to give you an
idea of what and why, the commit message currently reads:

   xfs: use deferred frees for btree block freeing
   
   Btrees that aren't freespace management trees use the normal extent
   allocation and freeing routines for their blocks. Hence when a btree
   block is freed, a direct call to xfs_free_extent() is made and the
   extent is immediately freed. This puts the entire free space
   management btrees under this path, so we are stacking btrees on
   btrees in the call stack. The inobt, finobt and refcount btrees
   all do this.
   
   However, the bmap btree does not do this - it calls
   xfs_free_extent_later() to defer the extent free operation via an
   XEFI and hence it gets processed in deferred operation processing
   during the commit of the primary transaction (i.e. via intent
   chaining).
   
   We need to change xfs_free_extent() to behave in a non-blocking
   manner so that we can avoid deadlocks with busy extents near ENOSPC
   in transactions that free multiple extents. Inserting or removing a
   record from a btree can cause a multi-level tree merge operation and
   that will free multiple blocks from the btree in a single
   transaction. i.e. we can call xfs_free_extent() multiple times, and
   hence the btree manipulation transaction is vulnerable to this busy
   extent deadlock vector.
   
   To fix this, convert all the remaining callers of xfs_free_extent()
   to use xfs_free_extent_later() to queue XEFIs and hence defer
   processing of the extent frees to a context that can be safely
   restarted if a deadlock condition is detected.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents
  2023-06-22  5:28         ` Dave Chinner
@ 2023-06-22  5:36           ` Chandan Babu R
  0 siblings, 0 replies; 11+ messages in thread
From: Chandan Babu R @ 2023-06-22  5:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 22, 2023 at 03:28:28 PM +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 09:29:46AM +0530, Chandan Babu R wrote:
>> On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote:
>> > On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote:
>> >> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
>> >> > From: Dave Chinner <dchinner@redhat.com>
>> >> >
>> >> > If the current transaction holds a busy extent and we are trying to
>> >> > allocate a new extent to fix up the free list, we can deadlock if
>> >> > the AG is entirely empty except for the busy extent held by the
>> >> > transaction.
>> > ....
>> >> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
>> >> >  	DEFINE_WAIT		(wait);
>> >> >  	int			error;
>> >> >  
>> >> > -	error = xfs_log_force(mp, XFS_LOG_SYNC);
>> >> > +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> >> >  	if (error)
>> >> > -		return;
>> >> > +		return error;
>> >> >  
>> >> > +	/* Avoid deadlocks on uncommitted busy extents. */
>> >> > +	if (!list_empty(&tp->t_busy)) {
>> >> > +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
>> >> > +			return 0;
>> >> > +
>> >> > +		if (busy_gen != READ_ONCE(pag->pagb_gen))
>> >> > +			return 0;
>> >> > +
>> >> > +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
>> >> > +			return -EAGAIN;
>> >> > +	}
>> >> 
>> >> In the case where a task is freeing an ondisk inode, an ifree transaction can
>> >> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
>> >> the next call to free its immediate parent block.
>> >> 
>> >> The first call to __xfs_inobt_free_block() adds the freed extent into the
>> >> transaction's busy list and also into the per-ag rb tree tracking the busy
>> >> extent. Freeing the second inobt block could lead to the following sequence of
>> >> function calls,
>> >> 
>> >> __xfs_free_extent() => xfs_free_extent_fix_freelist() =>
>> >> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()
>> >
>> > Yes, I think you might be right. I checked inode chunks - they are
>> > freed from this path via:
>> >
>> > xfs_ifree
>> >   xfs_difree
>> >     xfs_difree_inobt
>> >       xfs_difree_inobt_chunk
>> >         xfs_free_extent_later
>> > 	  <queues an XEFI for deferred freeing>
>> >
>> > And I didn't think about the inobt blocks themselves because freeing
>> > an inode can require allocation of finobt blocks and hence there's a
>> > transaction reservation for block allocation on finobt enabled
>> > filesystems. i.e. freeing can't proceed unless there is some amount
>> > of free blocks available, and that's why the finobt has an amount of
>> > per-ag space reserved for it.
>> >
>> > Hence, for finobt enabled filesystems, I don't think we can ever get
>> > down to a completely empty AG and an AGFL that needs refilling from
>> > the inode path - the metadata reserve doesn't allow the AG to be
>> > completely emptied in the way that is needed for this bug to
>> > manifest.
>> >
>> > Yes, I think it is still possible for all the free space to be busy,
>> > and so when online discard is enabled we need to do the busy wait
>> > after the log force to avoid that. However, for non-discard
>> > filesystems the sync log force is all that is needed to resolve busy
>> > extents outside the current transaction, so this wouldn't be an
>> > issue for the current patchset.
>> 
>> Are you planning to post a new version of this patchset which would solve the
>> possible cancellation of dirty transaction during freeing inobt blocks?  If
>> not, I will spend some time to review the current version of the patchset.
>
> I'm working on moving all the direct calls to xfs_free_extent to use
> xfs_free_extent_later(). It will be a totally separate preparation
> patch for the series, so everything else in the patchset should
> largely remain unchanged.
>
> I haven't finished the patch or tested it yet, but to give you an
> idea of what and why, the commit message currently reads:
>
>    xfs: use deferred frees for btree block freeing
>    
>    Btrees that aren't freespace management trees use the normal extent
>    allocation and freeing routines for their blocks. Hence when a btree
>    block is freed, a direct call to xfs_free_extent() is made and the
>    extent is immediately freed. This puts the entire free space
>    management btrees under this path, so we are stacking btrees on
>    btrees in the call stack. The inobt, finobt and refcount btrees
>    all do this.
>    
>    However, the bmap btree does not do this - it calls
>    xfs_free_extent_later() to defer the extent free operation via an
>    XEFI and hence it gets processed in deferred operation processing
>    during the commit of the primary transaction (i.e. via intent
>    chaining).
>    
>    We need to change xfs_free_extent() to behave in a non-blocking
>    manner so that we can avoid deadlocks with busy extents near ENOSPC
>    in transactions that free multiple extents. Inserting or removing a
>    record from a btree can cause a multi-level tree merge operation and
>    that will free multiple blocks from the btree in a single
>    transaction. i.e. we can call xfs_free_extent() multiple times, and
>    hence the btree manipulation transaction is vulnerable to this busy
>    extent deadlock vector.
>    
>    To fix this, convert all the remaining callers of xfs_free_extent()
>    to use xfs_free_extent_later() to queue XEFIs and hence defer
>    processing of the extent frees to a context that can be safely
>    restarted if a deadlock condition is detected.
>

Ok. In that case, I will defer review until the new version of the patchset is
posted. Thanks for working on this.

-- 
chandan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-06-22  5:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-20  0:20 [PATCH 0/5 V2] xfs: various fixes Dave Chinner
2023-06-20  0:20 ` [PATCH 1/5] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
2023-06-20  0:20 ` [PATCH 2/5] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
2023-06-20  0:20 ` [PATCH 3/5] xfs: allow extent free intents to be retried Dave Chinner
2023-06-20  0:20 ` [PATCH 4/5] xfs: don't block in busy flushing when freeing extents Dave Chinner
2023-06-20 14:53   ` Chandan Babu R
2023-06-20 22:18     ` Dave Chinner
2023-06-22  3:59       ` Chandan Babu R
2023-06-22  5:28         ` Dave Chinner
2023-06-22  5:36           ` Chandan Babu R
2023-06-20  0:20 ` [PATCH 5/5] xfs: journal geometry is not properly bounds checked Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.