[PATCH 0/3] xfs: fix xfs_extent_busy_flush() deadlock in EFI processing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] xfs: fix xfs_extent_busy_flush() deadlock in EFI processing
@ 2023-06-15  1:41 Dave Chinner
  2023-06-15  1:41 ` [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Dave Chinner @ 2023-06-15  1:41 UTC (permalink / raw)
  To: linux-xfs; +Cc: chandanrlinux, wen.gang.wang

Hi folks,

This patchset is largely a rework of the patch that Wengang posted
here:

https://lore.kernel.org/linux-xfs/20230519171829.4108-1-wen.gang.wang@oracle.com/

Review has run aground because I simply don't have the time to
explain every deep, subtle corner case on every issue that is raised
before progress can be made fixing this issue. Indeed, this is the
second attempt to get this bug fixed that has run aground like this.

Hence I've decided that it's less time and effort to just take what
we have, split it, clean it up, fixed it up, remove all the
unnecessary bits and run it through testing and push it back out for
further review.

I split the alloc flags out as a separate variable that is passed
down the stack; I renamed them all "alloc_flags" so that it's clear
that the flags being used and passed between the functions are the
XFS_ALLOC_FLAG* flag values. If we want to, it will now be trivial
to pull these back into the struct xfs_alloc_arg if we so desire -
there is no mix-and-match of args->flags and function paramter flags
to confuse the issue as was in the original patch.

The changes to xfs_extent_busy_flush() mean that it can now return
-EFSCORRUPTED or -EIO from xfs_log_force(), not just -EAGAIN to
indicate the transaction must be committed before the allocation is
retried. This has been exercised by recoveryloop testing and it
appears that the new corruption/error detection conditions do not
introduce any new failures.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15  1:41 [PATCH 0/3] xfs: fix xfs_extent_busy_flush() deadlock in EFI processing Dave Chinner
@ 2023-06-15  1:41 ` Dave Chinner
  2023-06-15  3:32   ` Darrick J. Wong
  2023-06-15 21:57   ` Wengang Wang
  2023-06-15  1:42 ` [PATCH 2/3] xfs: allow extent free intents to be retried Dave Chinner
  2023-06-15  1:42 ` [PATCH 3/3] xfs: don't block in busy flushing when freeing extents Dave Chinner
  2 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2023-06-15  1:41 UTC (permalink / raw)
  To: linux-xfs; +Cc: chandanrlinux, wen.gang.wang

From: Dave Chinner <dchinner@redhat.com>

To avoid blocking in xfs_extent_busy_flush() when freeing extents
and the only busy extents are held by the current transaction, we
need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
into xfs_extent_busy_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_alloc.h |  2 +-
 fs/xfs/xfs_extent_busy.c  |  3 +-
 fs/xfs/xfs_extent_busy.h  |  2 +-
 4 files changed, 56 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c20fe99405d8..11bd0a1756a1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
  */
 STATIC int
 xfs_alloc_ag_vextent_near(
-	struct xfs_alloc_arg	*args)
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
 	struct xfs_alloc_cur	acur = {};
 	int			error;		/* error code */
@@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
 		if (acur.busy) {
 			trace_xfs_alloc_near_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen);
+					      acur.busy_gen, alloc_flags);
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
  * and of the form k * prod + mod unless there's nothing that large.
  * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
  */
-STATIC int				/* error */
+static int
 xfs_alloc_ag_vextent_size(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
-	struct xfs_agf	*agf = args->agbp->b_addr;
-	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
-	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
-	int		error;		/* error result */
-	xfs_agblock_t	fbno;		/* start of found freespace */
-	xfs_extlen_t	flen;		/* length of found freespace */
-	int		i;		/* temp status variable */
-	xfs_agblock_t	rbno;		/* returned block number */
-	xfs_extlen_t	rlen;		/* length of returned extent */
-	bool		busy;
-	unsigned	busy_gen;
+	struct xfs_agf		*agf = args->agbp->b_addr;
+	struct xfs_btree_cur	*bno_cur;
+	struct xfs_btree_cur	*cnt_cur;
+	xfs_agblock_t		fbno;		/* start of found freespace */
+	xfs_extlen_t		flen;		/* length of found freespace */
+	xfs_agblock_t		rbno;		/* returned block number */
+	xfs_extlen_t		rlen;		/* length of returned extent */
+	bool			busy;
+	unsigned		busy_gen;
+	int			error;
+	int			i;
 
 restart:
 	/*
@@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
 				xfs_btree_del_cursor(cnt_cur,
 						     XFS_BTREE_NOERROR);
 				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp,
-							args->pag, busy_gen);
+				xfs_extent_busy_flush(args->mp, args->pag,
+						busy_gen, alloc_flags);
 				goto restart;
 			}
 		}
@@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
 		if (busy) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
+			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
+					alloc_flags);
 			goto restart;
 		}
 		goto out_nominleft;
@@ -2568,7 +2571,7 @@ xfs_exact_minlen_extent_available(
 int			/* error */
 xfs_alloc_fix_freelist(
 	struct xfs_alloc_arg	*args,	/* allocation argument structure */
-	int			flags)	/* XFS_ALLOC_FLAG_... */
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*pag = args->pag;
@@ -2584,7 +2587,7 @@ xfs_alloc_fix_freelist(
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
 	if (!xfs_perag_initialised_agf(pag)) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2600,13 +2603,13 @@ xfs_alloc_fix_freelist(
 	 */
 	if (xfs_perag_prefers_metadata(pag) &&
 	    (args->datatype & XFS_ALLOC_USERDATA) &&
-	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
-		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
+	    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
+		ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
 		goto out_agbp_relse;
 	}
 
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags |
+	if (!xfs_alloc_space_available(args, need, alloc_flags |
 			XFS_ALLOC_FLAG_CHECK))
 		goto out_agbp_relse;
 
@@ -2615,7 +2618,7 @@ xfs_alloc_fix_freelist(
 	 * Can fail if we're not blocking on locks, and it's held.
 	 */
 	if (!agbp) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2630,7 +2633,7 @@ xfs_alloc_fix_freelist(
 
 	/* If there isn't enough total space or single-extent, reject it. */
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags))
+	if (!xfs_alloc_space_available(args, need, alloc_flags))
 		goto out_agbp_relse;
 
 #ifdef DEBUG
@@ -2668,11 +2671,12 @@ xfs_alloc_fix_freelist(
 	 */
 	memset(&targs, 0, sizeof(targs));
 	/* struct copy below */
-	if (flags & XFS_ALLOC_FLAG_NORMAP)
+	if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
 		targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
 	else
 		targs.oinfo = XFS_RMAP_OINFO_AG;
-	while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
+	while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
+			pag->pagf_flcount > need) {
 		error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
 		if (error)
 			goto out_agbp_relse;
@@ -2700,7 +2704,7 @@ xfs_alloc_fix_freelist(
 		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
-		error = xfs_alloc_ag_vextent_size(&targs);
+		error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
 		if (error)
 			goto out_agflbp_relse;
 
@@ -2710,7 +2714,7 @@ xfs_alloc_fix_freelist(
 		 * on a completely full ag.
 		 */
 		if (targs.agbno == NULLAGBLOCK) {
-			if (flags & XFS_ALLOC_FLAG_FREEING)
+			if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
 				break;
 			goto out_agflbp_relse;
 		}
@@ -3226,7 +3230,7 @@ xfs_alloc_vextent_check_args(
 static int
 xfs_alloc_vextent_prepare_ag(
 	struct xfs_alloc_arg	*args,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	bool			need_pag = !args->pag;
 	int			error;
@@ -3235,7 +3239,7 @@ xfs_alloc_vextent_prepare_ag(
 		args->pag = xfs_perag_get(args->mp, args->agno);
 
 	args->agbp = NULL;
-	error = xfs_alloc_fix_freelist(args, flags);
+	error = xfs_alloc_fix_freelist(args, alloc_flags);
 	if (error) {
 		trace_xfs_alloc_vextent_nofix(args);
 		if (need_pag)
@@ -3357,6 +3361,7 @@ xfs_alloc_vextent_this_ag(
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	ASSERT(args->pag != NULL);
@@ -3375,9 +3380,9 @@ xfs_alloc_vextent_this_ag(
 		return error;
 	}
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_size(args);
+		error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
 }
@@ -3406,20 +3411,20 @@ xfs_alloc_vextent_iterate_ags(
 	xfs_agnumber_t		minimum_agno,
 	xfs_agnumber_t		start_agno,
 	xfs_agblock_t		target_agbno,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		restart_agno = minimum_agno;
 	xfs_agnumber_t		agno;
 	int			error = 0;
 
-	if (flags & XFS_ALLOC_FLAG_TRYLOCK)
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
 		restart_agno = 0;
 restart:
 	for_each_perag_wrap_range(mp, start_agno, restart_agno,
 			mp->m_sb.sb_agcount, agno, args->pag) {
 		args->agno = agno;
-		error = xfs_alloc_vextent_prepare_ag(args, flags);
+		error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 		if (error)
 			break;
 		if (!args->agbp) {
@@ -3433,10 +3438,10 @@ xfs_alloc_vextent_iterate_ags(
 		 */
 		if (args->agno == start_agno && target_agbno) {
 			args->agbno = target_agbno;
-			error = xfs_alloc_ag_vextent_near(args);
+			error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 		} else {
 			args->agbno = 0;
-			error = xfs_alloc_ag_vextent_size(args);
+			error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 		}
 		break;
 	}
@@ -3453,8 +3458,8 @@ xfs_alloc_vextent_iterate_ags(
 	 * constraining flags by the caller, drop them and retry the allocation
 	 * without any constraints being set.
 	 */
-	if (flags) {
-		flags = 0;
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
+		alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
 		restart_agno = minimum_agno;
 		goto restart;
 	}
@@ -3482,6 +3487,7 @@ xfs_alloc_vextent_start_ag(
 	xfs_agnumber_t		start_agno;
 	xfs_agnumber_t		rotorstep = xfs_rotorstep;
 	bool			bump_rotor = false;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3508,7 +3514,7 @@ xfs_alloc_vextent_start_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 
 	if (bump_rotor) {
 		if (args->agno == start_agno)
@@ -3535,6 +3541,7 @@ xfs_alloc_vextent_first_ag(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	xfs_agnumber_t		start_agno;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3553,7 +3560,7 @@ xfs_alloc_vextent_first_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), 0);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
 }
 
@@ -3606,6 +3613,7 @@ xfs_alloc_vextent_near_bno(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	bool			needs_perag = args->pag == NULL;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	if (!needs_perag)
@@ -3626,9 +3634,9 @@ xfs_alloc_vextent_near_bno(
 	if (needs_perag)
 		args->pag = xfs_perag_grab(mp, args->agno);
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_near(args);
+		error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 85ac470be0da..d1aa7c63eafe 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
 		struct xfs_buf **bpp);
 int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
 			struct xfs_buf *, struct xfs_owner_info *);
-int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
+int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
 int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
 		struct xfs_buf **agbp);
 
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index f3d328e4a440..874798949625 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -571,7 +571,8 @@ void
 xfs_extent_busy_flush(
 	struct xfs_mount	*mp,
 	struct xfs_perag	*pag,
-	unsigned		busy_gen)
+	unsigned		busy_gen,
+	unsigned int		flags)
 {
 	DEFINE_WAIT		(wait);
 	int			error;
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 4a118131059f..79ef5a2d7758 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 
 void
 xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen);
+	unsigned busy_gen, unsigned int flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] xfs: allow extent free intents to be retried
  2023-06-15  1:41 [PATCH 0/3] xfs: fix xfs_extent_busy_flush() deadlock in EFI processing Dave Chinner
  2023-06-15  1:41 ` [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
@ 2023-06-15  1:42 ` Dave Chinner
  2023-06-15  3:38   ` Darrick J. Wong
  2023-06-15  1:42 ` [PATCH 3/3] xfs: don't block in busy flushing when freeing extents Dave Chinner
  2 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-15  1:42 UTC (permalink / raw)
  To: linux-xfs; +Cc: chandanrlinux, wen.gang.wang

From: Dave Chinner <dchinner@redhat.com>

Extent freeing neeeds to be able to avoid a busy extent deadlock
when the transaction itself holds the only busy extents in the
allocation group. This may occur if we have an EFI that contains
multiple extents to be freed, and the freeing the second intent
requires the space the first extent free released to expand the
AGFL. If we block on the busy extent at this point, we deadlock.

We hold a dirty transaction that contains a entire atomic extent
free operations within it, so if we can abort the extent free
operation and commit the progress that we've made, the busy extent
can be resolved by a log force. Hence we can restart the aborted
extent free with a new transaction and continue to make
progress without risking deadlocks.

To enable this, we need the EFI processing code to be able to handle
an -EAGAIN error to tell it to commit the current transaction and
retry again. This mechanism is already built into the defer ops
processing (used bythe refcount btree modification intents), so
there's relatively little handling we need to add to the EFI code to
enable this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_extfree_item.c | 64 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f9e36b810663..3b33d27efdce 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -336,6 +336,29 @@ xfs_trans_get_efd(
 	return efdp;
 }
 
+/*
+ * Fill the EFD with all extents from the EFI when we need to roll the
+ * transaction and continue with a new EFI.
+ */
+static void
+xfs_efd_from_efi(
+	struct xfs_efd_log_item	*efdp)
+{
+	struct xfs_efi_log_item *efip = efdp->efd_efip;
+	uint                    i;
+
+	ASSERT(efip->efi_format.efi_nextents > 0);
+
+	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
+		return;
+
+	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+	       efdp->efd_format.efd_extents[i] =
+		       efip->efi_format.efi_extents[i];
+	}
+	efdp->efd_next_extent = efip->efi_format.efi_nextents;
+}
+
 /*
  * Free an extent and log it to the EFD. Note that the transaction is marked
  * dirty regardless of whether the extent free succeeds or fails to support the
@@ -378,6 +401,17 @@ xfs_trans_free_extent(
 	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
 
+	/*
+	 * If we need a new transaction to make progress, the caller will log a
+	 * new EFI with the current contents. It will also log an EFD to cancel
+	 * the existing EFI, and so we need to copy all the unprocessed extents
+	 * in this EFI to the EFD so this works correctly.
+	 */
+	if (error == -EAGAIN) {
+		xfs_efd_from_efi(efdp);
+		return error;
+	}
+
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
@@ -495,6 +529,13 @@ xfs_extent_free_finish_item(
 
 	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
 
+	/*
+	 * Don't free the XEFI if we need a new transaction to complete
+	 * processing of it.
+	 */
+	if (error == -EAGAIN)
+		return error;
+
 	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
@@ -620,6 +661,7 @@ xfs_efi_item_recover(
 	struct xfs_trans		*tp;
 	int				i;
 	int				error = 0;
+	bool				requeue_only = false;
 
 	/*
 	 * First check the validity of the extents described by the
@@ -652,9 +694,25 @@ xfs_efi_item_recover(
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
 
-		xfs_extent_free_get_group(mp, &fake);
-		error = xfs_trans_free_extent(tp, efdp, &fake);
-		xfs_extent_free_put_group(&fake);
+		if (!requeue_only) {
+			xfs_extent_free_get_group(mp, &fake);
+			error = xfs_trans_free_extent(tp, efdp, &fake);
+			xfs_extent_free_put_group(&fake);
+		}
+
+		/*
+		 * If we can't free the extent without potentially deadlocking,
+		 * requeue the rest of the extents to a new so that they get
+		 * run again later with a new transaction context.
+		 */
+		if (error == -EAGAIN || requeue_only) {
+			xfs_free_extent_later(tp, fake.xefi_startblock,
+				fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
+			requeue_only = true;
+			error = 0;
+			continue;
+		};
+
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] xfs: don't block in busy flushing when freeing extents
  2023-06-15  1:41 [PATCH 0/3] xfs: fix xfs_extent_busy_flush() deadlock in EFI processing Dave Chinner
  2023-06-15  1:41 ` [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
  2023-06-15  1:42 ` [PATCH 2/3] xfs: allow extent free intents to be retried Dave Chinner
@ 2023-06-15  1:42 ` Dave Chinner
  2023-06-15  3:40   ` Darrick J. Wong
  2 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-15  1:42 UTC (permalink / raw)
  To: linux-xfs; +Cc: chandanrlinux, wen.gang.wang

From: Dave Chinner <dchinner@redhat.com>

If the current transaction holds a busy extent and we are trying to
allocate a new extent to fix up the free list, we can deadlock if
the AG is entirely empty except for the busy extent held by the
transaction.

This can occur at runtime processing an XEFI with multiple extents
in this path:

__schedule+0x22f at ffffffff81f75e8f
schedule+0x46 at ffffffff81f76366
xfs_extent_busy_flush+0x69 at ffffffff81477d99
xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
__xfs_free_extent+0x99 at ffffffff81419499
xfs_trans_free_extent+0x3e at ffffffff814a6fee
xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
xfs_defer_finish+0x11 at ffffffff814417e1
xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
xfs_inactive_truncate+0xb9 at ffffffff8148bb89
xfs_inactive+0x227 at ffffffff8148c4f7
xfs_fs_destroy_inode+0xb8 at ffffffff81496898
destroy_inode+0x3b at ffffffff8127d2ab
do_unlinkat+0x1d1 at ffffffff81270df1
do_syscall_64+0x40 at ffffffff81f6b5f0
entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c

This can also happen in log recovery when processing an EFI
with multiple extents through this path:

context_switch() kernel/sched/core.c:3881
__schedule() kernel/sched/core.c:5111
schedule() kernel/sched/core.c:5186
xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
xfs_log_mount_finish() fs/xfs/xfs_log.c:764
xfs_mountfs() fs/xfs/xfs_mount.c:978
xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
mount_bdev() fs/super.c:1417
xfs_fs_mount() fs/xfs/xfs_super.c:1985
legacy_get_tree() fs/fs_context.c:647
vfs_get_tree() fs/super.c:1547
do_new_mount() fs/namespace.c:2843
do_mount() fs/namespace.c:3163
ksys_mount() fs/namespace.c:3372
__do_sys_mount() fs/namespace.c:3386
__se_sys_mount() fs/namespace.c:3383
__x64_sys_mount() fs/namespace.c:3383
do_syscall_64() arch/x86/entry/common.c:296
entry_SYSCALL_64() arch/x86/entry/entry_64.S:180

To avoid this deadlock, we should not block in
xfs_extent_busy_flush() if we hold a busy extent in the current
transaction.

Now that the EFI processing code can handle requeuing a partially
completed EFI, we can detect this situation in
xfs_extent_busy_flush() and return -EAGAIN rather than going to
sleep forever. The -EAGAIN get propagated back out to the
xfs_trans_free_extent() context, where the EFD is populated and the
transaction is rolled, thereby moving the busy extents into the CIL.

At this point, we can retry the extent free operation again with a
clean transaction. If we hit the same "all free extents are busy"
situation when trying to fix up the free list, we can safely call
xfs_extent_busy_flush() and wait for the busy extents to resolve
and wake us. At this point, the allocation search can make progress
again and we can fix up the free list.

This deadlock was first reported by Chandan in mid-2021, but I
couldn't make myself understood during review, and didn't have time
to fix it myself.

It was reported again in March 2023, and again I have found myself
unable to explain the complexities of the solution needed during
review.

As such, I don't have hours more time to waste trying to get the
fix written the way it needs to be written, so I'm just doing it
myself. This patchset is largely based on Wengang Wang's last patch,
but with all the unnecessary stuff removed, split up into multiple
patches and cleaned up somewhat.

Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++-----------
 fs/xfs/libxfs/xfs_alloc.h | 11 ++++---
 fs/xfs/xfs_extent_busy.c  | 35 +++++++++++++++++---
 fs/xfs/xfs_extent_busy.h  |  6 ++--
 4 files changed, 89 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 11bd0a1756a1..7c675aae0a0f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near(
 	if (args->agbno > args->max_agbno)
 		args->agbno = args->max_agbno;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	len = 0;
 
@@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near(
 	 */
 	if (!acur.len) {
 		if (acur.busy) {
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_near_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen, alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					acur.busy_gen, alloc_flags);
+			if (error)
+				goto out;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size(
 	int			error;
 	int			i;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	/*
 	 * Allocate and initialize a cursor for the by-size btree.
@@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size(
 			error = xfs_btree_increment(cnt_cur, 0, &i);
 			if (error)
 				goto error0;
-			if (i == 0) {
-				/*
-				 * Our only valid extents must have been busy.
-				 * Make it unbusy by forcing the log out and
-				 * retrying.
-				 */
-				xfs_btree_del_cursor(cnt_cur,
-						     XFS_BTREE_NOERROR);
-				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp, args->pag,
-						busy_gen, alloc_flags);
-				goto restart;
-			}
+			if (i)
+				continue;
+
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
+			trace_xfs_alloc_size_busy(args);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			goto restart;
 		}
 	}
 
@@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size(
 	args->len = rlen;
 	if (rlen < args->minlen) {
 		if (busy) {
-			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
-					alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			goto restart;
 		}
 		goto out_nominleft;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index d1aa7c63eafe..f267842e36ba 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
 /*
  * Flags for xfs_alloc_fix_freelist.
  */
-#define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
-#define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
-#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
-#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
-#define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYLOCK	(1U << 0)  /* use trylock for buffer locking */
+#define	XFS_ALLOC_FLAG_FREEING	(1U << 1)  /* indicate caller is freeing extents*/
+#define	XFS_ALLOC_FLAG_NORMAP	(1U << 2)  /* don't modify the rmapbt */
+#define	XFS_ALLOC_FLAG_NOSHRINK	(1U << 3)  /* don't shrink the freelist */
+#define	XFS_ALLOC_FLAG_CHECK	(1U << 4)  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYFLUSH	(1U << 5)  /* don't wait in busy extent flush */
 
 /*
  * Argument structure for xfs_alloc routines.
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 874798949625..7c2fdc71e42d 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -566,21 +566,45 @@ xfs_extent_busy_clear(
 
 /*
  * Flush out all busy extents for this AG.
+ *
+ * If the current transaction is holding busy extents, the caller may not want
+ * to wait for committed busy extents to resolve. If we are being told just to
+ * try a flush or progress has been made since we last skipped a busy extent,
+ * return immediately to allow the caller to try again.
+ *
+ * If we are freeing extents, we might actually be holding the only free extents
+ * in the transaction busy list and the log force won't resolve that situation.
+ * In this case, we must return -EAGAIN to avoid a deadlock by informing the
+ * caller it needs to commit the busy extents it holds before retrying the
+ * extent free operation.
  */
-void
+int
 xfs_extent_busy_flush(
-	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
 	struct xfs_perag	*pag,
 	unsigned		busy_gen,
-	unsigned int		flags)
+	uint32_t		alloc_flags)
 {
 	DEFINE_WAIT		(wait);
 	int			error;
 
-	error = xfs_log_force(mp, XFS_LOG_SYNC);
+	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
 	if (error)
-		return;
+		return error;
 
+	/* Avoid deadlocks on uncommitted busy extents. */
+	if (!list_empty(&tp->t_busy)) {
+		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
+			return 0;
+
+		if (busy_gen != READ_ONCE(pag->pagb_gen))
+			return 0;
+
+		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
+			return -EAGAIN;
+	}
+
+	/* Wait for committed busy extents to resolve. */
 	do {
 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
@@ -589,6 +613,7 @@ xfs_extent_busy_flush(
 	} while (1);
 
 	finish_wait(&pag->pagb_wait, &wait);
+	return 0;
 }
 
 void
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 79ef5a2d7758..2104548d6695 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -51,9 +51,9 @@ bool
 xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 		xfs_extlen_t *len, unsigned *busy_gen);
 
-void
-xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen, unsigned int flags);
+int
+xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
+	unsigned busy_gen, uint32_t flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15  1:41 ` [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
@ 2023-06-15  3:32   ` Darrick J. Wong
  2023-06-15  3:48     ` Dave Chinner
  2023-06-15 21:57   ` Wengang Wang
  1 sibling, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2023-06-15  3:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Thu, Jun 15, 2023 at 11:41:59AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To avoid blocking in xfs_extent_busy_flush() when freeing extents
> and the only busy extents are held by the current transaction, we
> need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
> into xfs_extent_busy_flush().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
>  fs/xfs/libxfs/xfs_alloc.h |  2 +-
>  fs/xfs/xfs_extent_busy.c  |  3 +-
>  fs/xfs/xfs_extent_busy.h  |  2 +-
>  4 files changed, 56 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index c20fe99405d8..11bd0a1756a1 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>   */
>  STATIC int
>  xfs_alloc_ag_vextent_near(
> -	struct xfs_alloc_arg	*args)
> +	struct xfs_alloc_arg	*args,
> +	uint32_t		alloc_flags)

Is some bot going to complain about the uint32_t here vs the unsigned
int in xfs_extent_busy_flush?

Just to check my grokking here -- it's the ALLOC_FLAG_FREEING from
xfs_free_extent_fix_freelist that we need to pass all the way to the
bottom of the allocator?

If the answers are 'no' and 'yes', then
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


>  {
>  	struct xfs_alloc_cur	acur = {};
>  	int			error;		/* error code */
> @@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
>  		if (acur.busy) {
>  			trace_xfs_alloc_near_busy(args);
>  			xfs_extent_busy_flush(args->mp, args->pag,
> -					      acur.busy_gen);
> +					      acur.busy_gen, alloc_flags);
>  			goto restart;
>  		}
>  		trace_xfs_alloc_size_neither(args);
> @@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
>   * and of the form k * prod + mod unless there's nothing that large.
>   * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
>   */
> -STATIC int				/* error */
> +static int
>  xfs_alloc_ag_vextent_size(
> -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
> +	struct xfs_alloc_arg	*args,
> +	uint32_t		alloc_flags)
>  {
> -	struct xfs_agf	*agf = args->agbp->b_addr;
> -	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
> -	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
> -	int		error;		/* error result */
> -	xfs_agblock_t	fbno;		/* start of found freespace */
> -	xfs_extlen_t	flen;		/* length of found freespace */
> -	int		i;		/* temp status variable */
> -	xfs_agblock_t	rbno;		/* returned block number */
> -	xfs_extlen_t	rlen;		/* length of returned extent */
> -	bool		busy;
> -	unsigned	busy_gen;
> +	struct xfs_agf		*agf = args->agbp->b_addr;
> +	struct xfs_btree_cur	*bno_cur;
> +	struct xfs_btree_cur	*cnt_cur;
> +	xfs_agblock_t		fbno;		/* start of found freespace */
> +	xfs_extlen_t		flen;		/* length of found freespace */
> +	xfs_agblock_t		rbno;		/* returned block number */
> +	xfs_extlen_t		rlen;		/* length of returned extent */
> +	bool			busy;
> +	unsigned		busy_gen;
> +	int			error;
> +	int			i;
>  
>  restart:
>  	/*
> @@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
>  				xfs_btree_del_cursor(cnt_cur,
>  						     XFS_BTREE_NOERROR);
>  				trace_xfs_alloc_size_busy(args);
> -				xfs_extent_busy_flush(args->mp,
> -							args->pag, busy_gen);
> +				xfs_extent_busy_flush(args->mp, args->pag,
> +						busy_gen, alloc_flags);
>  				goto restart;
>  			}
>  		}
> @@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
>  		if (busy) {
>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			trace_xfs_alloc_size_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> +			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
> +					alloc_flags);
>  			goto restart;
>  		}
>  		goto out_nominleft;
> @@ -2568,7 +2571,7 @@ xfs_exact_minlen_extent_available(
>  int			/* error */
>  xfs_alloc_fix_freelist(
>  	struct xfs_alloc_arg	*args,	/* allocation argument structure */
> -	int			flags)	/* XFS_ALLOC_FLAG_... */
> +	uint32_t		alloc_flags)
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	struct xfs_perag	*pag = args->pag;
> @@ -2584,7 +2587,7 @@ xfs_alloc_fix_freelist(
>  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
>  
>  	if (!xfs_perag_initialised_agf(pag)) {
> -		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
> +		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
>  		if (error) {
>  			/* Couldn't lock the AGF so skip this AG. */
>  			if (error == -EAGAIN)
> @@ -2600,13 +2603,13 @@ xfs_alloc_fix_freelist(
>  	 */
>  	if (xfs_perag_prefers_metadata(pag) &&
>  	    (args->datatype & XFS_ALLOC_USERDATA) &&
> -	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
> -		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
> +	    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
> +		ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
>  		goto out_agbp_relse;
>  	}
>  
>  	need = xfs_alloc_min_freelist(mp, pag);
> -	if (!xfs_alloc_space_available(args, need, flags |
> +	if (!xfs_alloc_space_available(args, need, alloc_flags |
>  			XFS_ALLOC_FLAG_CHECK))
>  		goto out_agbp_relse;
>  
> @@ -2615,7 +2618,7 @@ xfs_alloc_fix_freelist(
>  	 * Can fail if we're not blocking on locks, and it's held.
>  	 */
>  	if (!agbp) {
> -		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
> +		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
>  		if (error) {
>  			/* Couldn't lock the AGF so skip this AG. */
>  			if (error == -EAGAIN)
> @@ -2630,7 +2633,7 @@ xfs_alloc_fix_freelist(
>  
>  	/* If there isn't enough total space or single-extent, reject it. */
>  	need = xfs_alloc_min_freelist(mp, pag);
> -	if (!xfs_alloc_space_available(args, need, flags))
> +	if (!xfs_alloc_space_available(args, need, alloc_flags))
>  		goto out_agbp_relse;
>  
>  #ifdef DEBUG
> @@ -2668,11 +2671,12 @@ xfs_alloc_fix_freelist(
>  	 */
>  	memset(&targs, 0, sizeof(targs));
>  	/* struct copy below */
> -	if (flags & XFS_ALLOC_FLAG_NORMAP)
> +	if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
>  		targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
>  	else
>  		targs.oinfo = XFS_RMAP_OINFO_AG;
> -	while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
> +	while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
> +			pag->pagf_flcount > need) {
>  		error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
>  		if (error)
>  			goto out_agbp_relse;
> @@ -2700,7 +2704,7 @@ xfs_alloc_fix_freelist(
>  		targs.resv = XFS_AG_RESV_AGFL;
>  
>  		/* Allocate as many blocks as possible at once. */
> -		error = xfs_alloc_ag_vextent_size(&targs);
> +		error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
>  		if (error)
>  			goto out_agflbp_relse;
>  
> @@ -2710,7 +2714,7 @@ xfs_alloc_fix_freelist(
>  		 * on a completely full ag.
>  		 */
>  		if (targs.agbno == NULLAGBLOCK) {
> -			if (flags & XFS_ALLOC_FLAG_FREEING)
> +			if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
>  				break;
>  			goto out_agflbp_relse;
>  		}
> @@ -3226,7 +3230,7 @@ xfs_alloc_vextent_check_args(
>  static int
>  xfs_alloc_vextent_prepare_ag(
>  	struct xfs_alloc_arg	*args,
> -	uint32_t		flags)
> +	uint32_t		alloc_flags)
>  {
>  	bool			need_pag = !args->pag;
>  	int			error;
> @@ -3235,7 +3239,7 @@ xfs_alloc_vextent_prepare_ag(
>  		args->pag = xfs_perag_get(args->mp, args->agno);
>  
>  	args->agbp = NULL;
> -	error = xfs_alloc_fix_freelist(args, flags);
> +	error = xfs_alloc_fix_freelist(args, alloc_flags);
>  	if (error) {
>  		trace_xfs_alloc_vextent_nofix(args);
>  		if (need_pag)
> @@ -3357,6 +3361,7 @@ xfs_alloc_vextent_this_ag(
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		minimum_agno;
> +	uint32_t		alloc_flags = 0;
>  	int			error;
>  
>  	ASSERT(args->pag != NULL);
> @@ -3375,9 +3380,9 @@ xfs_alloc_vextent_this_ag(
>  		return error;
>  	}
>  
> -	error = xfs_alloc_vextent_prepare_ag(args, 0);
> +	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
>  	if (!error && args->agbp)
> -		error = xfs_alloc_ag_vextent_size(args);
> +		error = xfs_alloc_ag_vextent_size(args, alloc_flags);
>  
>  	return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
>  }
> @@ -3406,20 +3411,20 @@ xfs_alloc_vextent_iterate_ags(
>  	xfs_agnumber_t		minimum_agno,
>  	xfs_agnumber_t		start_agno,
>  	xfs_agblock_t		target_agbno,
> -	uint32_t		flags)
> +	uint32_t		alloc_flags)
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		restart_agno = minimum_agno;
>  	xfs_agnumber_t		agno;
>  	int			error = 0;
>  
> -	if (flags & XFS_ALLOC_FLAG_TRYLOCK)
> +	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
>  		restart_agno = 0;
>  restart:
>  	for_each_perag_wrap_range(mp, start_agno, restart_agno,
>  			mp->m_sb.sb_agcount, agno, args->pag) {
>  		args->agno = agno;
> -		error = xfs_alloc_vextent_prepare_ag(args, flags);
> +		error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
>  		if (error)
>  			break;
>  		if (!args->agbp) {
> @@ -3433,10 +3438,10 @@ xfs_alloc_vextent_iterate_ags(
>  		 */
>  		if (args->agno == start_agno && target_agbno) {
>  			args->agbno = target_agbno;
> -			error = xfs_alloc_ag_vextent_near(args);
> +			error = xfs_alloc_ag_vextent_near(args, alloc_flags);
>  		} else {
>  			args->agbno = 0;
> -			error = xfs_alloc_ag_vextent_size(args);
> +			error = xfs_alloc_ag_vextent_size(args, alloc_flags);
>  		}
>  		break;
>  	}
> @@ -3453,8 +3458,8 @@ xfs_alloc_vextent_iterate_ags(
>  	 * constraining flags by the caller, drop them and retry the allocation
>  	 * without any constraints being set.
>  	 */
> -	if (flags) {
> -		flags = 0;
> +	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
> +		alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
>  		restart_agno = minimum_agno;
>  		goto restart;
>  	}
> @@ -3482,6 +3487,7 @@ xfs_alloc_vextent_start_ag(
>  	xfs_agnumber_t		start_agno;
>  	xfs_agnumber_t		rotorstep = xfs_rotorstep;
>  	bool			bump_rotor = false;
> +	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
>  	int			error;
>  
>  	ASSERT(args->pag == NULL);
> @@ -3508,7 +3514,7 @@ xfs_alloc_vextent_start_ag(
>  
>  	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
>  	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
> -			XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
> +			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
>  
>  	if (bump_rotor) {
>  		if (args->agno == start_agno)
> @@ -3535,6 +3541,7 @@ xfs_alloc_vextent_first_ag(
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		minimum_agno;
>  	xfs_agnumber_t		start_agno;
> +	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
>  	int			error;
>  
>  	ASSERT(args->pag == NULL);
> @@ -3553,7 +3560,7 @@ xfs_alloc_vextent_first_ag(
>  
>  	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
>  	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
> -			XFS_FSB_TO_AGBNO(mp, target), 0);
> +			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
>  	return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
>  }
>  
> @@ -3606,6 +3613,7 @@ xfs_alloc_vextent_near_bno(
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		minimum_agno;
>  	bool			needs_perag = args->pag == NULL;
> +	uint32_t		alloc_flags = 0;
>  	int			error;
>  
>  	if (!needs_perag)
> @@ -3626,9 +3634,9 @@ xfs_alloc_vextent_near_bno(
>  	if (needs_perag)
>  		args->pag = xfs_perag_grab(mp, args->agno);
>  
> -	error = xfs_alloc_vextent_prepare_ag(args, 0);
> +	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
>  	if (!error && args->agbp)
> -		error = xfs_alloc_ag_vextent_near(args);
> +		error = xfs_alloc_ag_vextent_near(args, alloc_flags);
>  
>  	return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
>  }
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 85ac470be0da..d1aa7c63eafe 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
>  		struct xfs_buf **bpp);
>  int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
>  			struct xfs_buf *, struct xfs_owner_info *);
> -int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
> +int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
>  int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
>  		struct xfs_buf **agbp);
>  
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index f3d328e4a440..874798949625 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -571,7 +571,8 @@ void
>  xfs_extent_busy_flush(
>  	struct xfs_mount	*mp,
>  	struct xfs_perag	*pag,
> -	unsigned		busy_gen)
> +	unsigned		busy_gen,
> +	unsigned int		flags)
>  {
>  	DEFINE_WAIT		(wait);
>  	int			error;
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 4a118131059f..79ef5a2d7758 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>  
>  void
>  xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> -	unsigned busy_gen);
> +	unsigned busy_gen, unsigned int flags);
>  
>  void
>  xfs_extent_busy_wait_all(struct xfs_mount *mp);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] xfs: allow extent free intents to be retried
  2023-06-15  1:42 ` [PATCH 2/3] xfs: allow extent free intents to be retried Dave Chinner
@ 2023-06-15  3:38   ` Darrick J. Wong
  2023-06-15  3:57     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2023-06-15  3:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Thu, Jun 15, 2023 at 11:42:00AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Extent freeing neeeds to be able to avoid a busy extent deadlock
> when the transaction itself holds the only busy extents in the
> allocation group. This may occur if we have an EFI that contains
> multiple extents to be freed, and the freeing the second intent
> requires the space the first extent free released to expand the
> AGFL. If we block on the busy extent at this point, we deadlock.
> 
> We hold a dirty transaction that contains a entire atomic extent
> free operations within it, so if we can abort the extent free
> operation and commit the progress that we've made, the busy extent
> can be resolved by a log force. Hence we can restart the aborted
> extent free with a new transaction and continue to make
> progress without risking deadlocks.
> 
> To enable this, we need the EFI processing code to be able to handle
> an -EAGAIN error to tell it to commit the current transaction and
> retry again. This mechanism is already built into the defer ops
> processing (used bythe refcount btree modification intents), so
> there's relatively little handling we need to add to the EFI code to
> enable this.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_extfree_item.c | 64 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index f9e36b810663..3b33d27efdce 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -336,6 +336,29 @@ xfs_trans_get_efd(
>  	return efdp;
>  }
>  
> +/*
> + * Fill the EFD with all extents from the EFI when we need to roll the
> + * transaction and continue with a new EFI.
> + */
> +static void
> +xfs_efd_from_efi(
> +	struct xfs_efd_log_item	*efdp)
> +{
> +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> +	uint                    i;
> +
> +	ASSERT(efip->efi_format.efi_nextents > 0);
> +
> +	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> +		return;
> +
> +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> +	       efdp->efd_format.efd_extents[i] =
> +		       efip->efi_format.efi_extents[i];
> +	}
> +	efdp->efd_next_extent = efip->efi_format.efi_nextents;

Odd question -- if we managed to free half the extents mentioned in an
EFI before hitting -EAGAIN, then efdp->efd_next_extent should already be
half of efip->efi_format.efi_nextents, right?

I suppose it's duplicative (or maybe just careful) to recopy the entire
thing... but maybe that doesn't even really matter since no modern xlog
code actually pays attention to what's in the EFD aside from the ID
number.


> +}
> +
>  /*
>   * Free an extent and log it to the EFD. Note that the transaction is marked
>   * dirty regardless of whether the extent free succeeds or fails to support the
> @@ -378,6 +401,17 @@ xfs_trans_free_extent(
>  	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>  	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
>  
> +	/*
> +	 * If we need a new transaction to make progress, the caller will log a
> +	 * new EFI with the current contents. It will also log an EFD to cancel
> +	 * the existing EFI, and so we need to copy all the unprocessed extents
> +	 * in this EFI to the EFD so this works correctly.
> +	 */
> +	if (error == -EAGAIN) {
> +		xfs_efd_from_efi(efdp);
> +		return error;
> +	}
> +
>  	next_extent = efdp->efd_next_extent;
>  	ASSERT(next_extent < efdp->efd_format.efd_nextents);
>  	extp = &(efdp->efd_format.efd_extents[next_extent]);
> @@ -495,6 +529,13 @@ xfs_extent_free_finish_item(
>  
>  	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>  
> +	/*
> +	 * Don't free the XEFI if we need a new transaction to complete
> +	 * processing of it.
> +	 */
> +	if (error == -EAGAIN)
> +		return error;
> +
>  	xfs_extent_free_put_group(xefi);
>  	kmem_cache_free(xfs_extfree_item_cache, xefi);
>  	return error;
> @@ -620,6 +661,7 @@ xfs_efi_item_recover(
>  	struct xfs_trans		*tp;
>  	int				i;
>  	int				error = 0;
> +	bool				requeue_only = false;
>  
>  	/*
>  	 * First check the validity of the extents described by the
> @@ -652,9 +694,25 @@ xfs_efi_item_recover(
>  		fake.xefi_startblock = extp->ext_start;
>  		fake.xefi_blockcount = extp->ext_len;
>  
> -		xfs_extent_free_get_group(mp, &fake);
> -		error = xfs_trans_free_extent(tp, efdp, &fake);
> -		xfs_extent_free_put_group(&fake);
> +		if (!requeue_only) {
> +			xfs_extent_free_get_group(mp, &fake);
> +			error = xfs_trans_free_extent(tp, efdp, &fake);
> +			xfs_extent_free_put_group(&fake);
> +		}
> +
> +		/*
> +		 * If we can't free the extent without potentially deadlocking,
> +		 * requeue the rest of the extents to a new so that they get
> +		 * run again later with a new transaction context.
> +		 */
> +		if (error == -EAGAIN || requeue_only) {
> +			xfs_free_extent_later(tp, fake.xefi_startblock,
> +				fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);

Shouldn't we check the return value of xfs_free_extent_later now?
I think we already did that above, but since you just plumbed in the
extra checks, we ought to use it. :)

(Also the indenting here isn't the usual two tabs)

Aside from that everything looks in order here.

--D

> +			requeue_only = true;
> +			error = 0;
> +			continue;
> +		};
> +
>  		if (error == -EFSCORRUPTED)
>  			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  					extp, sizeof(*extp));
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] xfs: don't block in busy flushing when freeing extents
  2023-06-15  1:42 ` [PATCH 3/3] xfs: don't block in busy flushing when freeing extents Dave Chinner
@ 2023-06-15  3:40   ` Darrick J. Wong
  0 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2023-06-15  3:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Thu, Jun 15, 2023 at 11:42:01AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If the current transaction holds a busy extent and we are trying to
> allocate a new extent to fix up the free list, we can deadlock if
> the AG is entirely empty except for the busy extent held by the
> transaction.
> 
> This can occur at runtime processing an XEFI with multiple extents
> in this path:
> 
> __schedule+0x22f at ffffffff81f75e8f
> schedule+0x46 at ffffffff81f76366
> xfs_extent_busy_flush+0x69 at ffffffff81477d99
> xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
> xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
> xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
> xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
> __xfs_free_extent+0x99 at ffffffff81419499
> xfs_trans_free_extent+0x3e at ffffffff814a6fee
> xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
> xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
> xfs_defer_finish+0x11 at ffffffff814417e1
> xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
> xfs_inactive_truncate+0xb9 at ffffffff8148bb89
> xfs_inactive+0x227 at ffffffff8148c4f7
> xfs_fs_destroy_inode+0xb8 at ffffffff81496898
> destroy_inode+0x3b at ffffffff8127d2ab
> do_unlinkat+0x1d1 at ffffffff81270df1
> do_syscall_64+0x40 at ffffffff81f6b5f0
> entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c
> 
> This can also happen in log recovery when processing an EFI
> with multiple extents through this path:
> 
> context_switch() kernel/sched/core.c:3881
> __schedule() kernel/sched/core.c:5111
> schedule() kernel/sched/core.c:5186
> xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
> xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
> xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
> xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
> xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
> __xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
> xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
> xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
> xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
> xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
> xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
> xfs_log_mount_finish() fs/xfs/xfs_log.c:764
> xfs_mountfs() fs/xfs/xfs_mount.c:978
> xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
> mount_bdev() fs/super.c:1417
> xfs_fs_mount() fs/xfs/xfs_super.c:1985
> legacy_get_tree() fs/fs_context.c:647
> vfs_get_tree() fs/super.c:1547
> do_new_mount() fs/namespace.c:2843
> do_mount() fs/namespace.c:3163
> ksys_mount() fs/namespace.c:3372
> __do_sys_mount() fs/namespace.c:3386
> __se_sys_mount() fs/namespace.c:3383
> __x64_sys_mount() fs/namespace.c:3383
> do_syscall_64() arch/x86/entry/common.c:296
> entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
> 
> To avoid this deadlock, we should not block in
> xfs_extent_busy_flush() if we hold a busy extent in the current
> transaction.
> 
> Now that the EFI processing code can handle requeuing a partially
> completed EFI, we can detect this situation in
> xfs_extent_busy_flush() and return -EAGAIN rather than going to
> sleep forever. The -EAGAIN get propagated back out to the
> xfs_trans_free_extent() context, where the EFD is populated and the
> transaction is rolled, thereby moving the busy extents into the CIL.
> 
> At this point, we can retry the extent free operation again with a
> clean transaction. If we hit the same "all free extents are busy"
> situation when trying to fix up the free list, we can safely call
> xfs_extent_busy_flush() and wait for the busy extents to resolve
> and wake us. At this point, the allocation search can make progress
> again and we can fix up the free list.
> 
> This deadlock was first reported by Chandan in mid-2021, but I
> couldn't make myself understood during review, and didn't have time
> to fix it myself.
> 
> It was reported again in March 2023, and again I have found myself
> unable to explain the complexities of the solution needed during
> review.
> 
> As such, I don't have hours more time to waste trying to get the
> fix written the way it needs to be written, so I'm just doing it
> myself. This patchset is largely based on Wengang Wang's last patch,
> but with all the unnecessary stuff removed, split up into multiple
> patches and cleaned up somewhat.

Thank you for taking this over!

> Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++-----------
>  fs/xfs/libxfs/xfs_alloc.h | 11 ++++---
>  fs/xfs/xfs_extent_busy.c  | 35 +++++++++++++++++---
>  fs/xfs/xfs_extent_busy.h  |  6 ++--
>  4 files changed, 89 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 11bd0a1756a1..7c675aae0a0f 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near(
>  	if (args->agbno > args->max_agbno)
>  		args->agbno = args->max_agbno;
>  
> +	/* Retry once quickly if we find busy extents before blocking. */
> +	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
>  restart:
>  	len = 0;
>  
> @@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near(
>  	 */
>  	if (!acur.len) {
>  		if (acur.busy) {
> +			/*
> +			 * Our only valid extents must have been busy. Flush and
> +			 * retry the allocation again. If we get an -EAGAIN
> +			 * error, we're being told that a deadlock was avoided
> +			 * and the current transaction needs committing before
> +			 * the allocation can be retried.
> +			 */
>  			trace_xfs_alloc_near_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag,
> -					      acur.busy_gen, alloc_flags);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					acur.busy_gen, alloc_flags);
> +			if (error)
> +				goto out;
> +
> +			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
>  			goto restart;
>  		}
>  		trace_xfs_alloc_size_neither(args);
> @@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size(
>  	int			error;
>  	int			i;
>  
> +	/* Retry once quickly if we find busy extents before blocking. */
> +	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
>  restart:
>  	/*
>  	 * Allocate and initialize a cursor for the by-size btree.
> @@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size(
>  			error = xfs_btree_increment(cnt_cur, 0, &i);
>  			if (error)
>  				goto error0;
> -			if (i == 0) {
> -				/*
> -				 * Our only valid extents must have been busy.
> -				 * Make it unbusy by forcing the log out and
> -				 * retrying.
> -				 */
> -				xfs_btree_del_cursor(cnt_cur,
> -						     XFS_BTREE_NOERROR);
> -				trace_xfs_alloc_size_busy(args);
> -				xfs_extent_busy_flush(args->mp, args->pag,
> -						busy_gen, alloc_flags);
> -				goto restart;
> -			}
> +			if (i)
> +				continue;
> +
> +			/*
> +			 * Our only valid extents must have been busy. Flush and
> +			 * retry the allocation again. If we get an -EAGAIN
> +			 * error, we're being told that a deadlock was avoided
> +			 * and the current transaction needs committing before
> +			 * the allocation can be retried.
> +			 */
> +			trace_xfs_alloc_size_busy(args);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					busy_gen, alloc_flags);
> +			if (error)
> +				goto error0;
> +
> +			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> +			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> +			goto restart;
>  		}
>  	}
>  
> @@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size(
>  	args->len = rlen;
>  	if (rlen < args->minlen) {
>  		if (busy) {
> -			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> +			/*
> +			 * Our only valid extents must have been busy. Flush and
> +			 * retry the allocation again. If we get an -EAGAIN
> +			 * error, we're being told that a deadlock was avoided
> +			 * and the current transaction needs committing before
> +			 * the allocation can be retried.
> +			 */
>  			trace_xfs_alloc_size_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
> -					alloc_flags);
> +			error = xfs_extent_busy_flush(args->tp, args->pag,
> +					busy_gen, alloc_flags);
> +			if (error)
> +				goto error0;
> +
> +			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
> +			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			goto restart;
>  		}
>  		goto out_nominleft;
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index d1aa7c63eafe..f267842e36ba 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
>  /*
>   * Flags for xfs_alloc_fix_freelist.
>   */
> -#define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
> -#define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
> -#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
> -#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
> -#define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
> +#define	XFS_ALLOC_FLAG_TRYLOCK	(1U << 0)  /* use trylock for buffer locking */
> +#define	XFS_ALLOC_FLAG_FREEING	(1U << 1)  /* indicate caller is freeing extents*/
> +#define	XFS_ALLOC_FLAG_NORMAP	(1U << 2)  /* don't modify the rmapbt */
> +#define	XFS_ALLOC_FLAG_NOSHRINK	(1U << 3)  /* don't shrink the freelist */
> +#define	XFS_ALLOC_FLAG_CHECK	(1U << 4)  /* test only, don't modify args */
> +#define	XFS_ALLOC_FLAG_TRYFLUSH	(1U << 5)  /* don't wait in busy extent flush */
>  
>  /*
>   * Argument structure for xfs_alloc routines.
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index 874798949625..7c2fdc71e42d 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -566,21 +566,45 @@ xfs_extent_busy_clear(
>  
>  /*
>   * Flush out all busy extents for this AG.
> + *
> + * If the current transaction is holding busy extents, the caller may not want
> + * to wait for committed busy extents to resolve. If we are being told just to
> + * try a flush or progress has been made since we last skipped a busy extent,
> + * return immediately to allow the caller to try again.
> + *
> + * If we are freeing extents, we might actually be holding the only free extents
> + * in the transaction busy list and the log force won't resolve that situation.
> + * In this case, we must return -EAGAIN to avoid a deadlock by informing the
> + * caller it needs to commit the busy extents it holds before retrying the
> + * extent free operation.
>   */
> -void
> +int
>  xfs_extent_busy_flush(
> -	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
>  	struct xfs_perag	*pag,
>  	unsigned		busy_gen,
> -	unsigned int		flags)
> +	uint32_t		alloc_flags)
>  {
>  	DEFINE_WAIT		(wait);
>  	int			error;
>  
> -	error = xfs_log_force(mp, XFS_LOG_SYNC);
> +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>  	if (error)
> -		return;
> +		return error;
>  
> +	/* Avoid deadlocks on uncommitted busy extents. */
> +	if (!list_empty(&tp->t_busy)) {
> +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
> +			return 0;
> +
> +		if (busy_gen != READ_ONCE(pag->pagb_gen))
> +			return 0;
> +
> +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
> +			return -EAGAIN;
> +	}
> +
> +	/* Wait for committed busy extents to resolve. */
>  	do {
>  		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>  		if  (busy_gen != READ_ONCE(pag->pagb_gen))
> @@ -589,6 +613,7 @@ xfs_extent_busy_flush(
>  	} while (1);
>  
>  	finish_wait(&pag->pagb_wait, &wait);
> +	return 0;
>  }
>  
>  void
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 79ef5a2d7758..2104548d6695 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -51,9 +51,9 @@ bool
>  xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>  		xfs_extlen_t *len, unsigned *busy_gen);
>  
> -void
> -xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> -	unsigned busy_gen, unsigned int flags);
> +int
> +xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
> +	unsigned busy_gen, uint32_t flags);

Aha, you /did/ change this by the end of the series.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  
>  void
>  xfs_extent_busy_wait_all(struct xfs_mount *mp);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15  3:32   ` Darrick J. Wong
@ 2023-06-15  3:48     ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2023-06-15  3:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Wed, Jun 14, 2023 at 08:32:35PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 15, 2023 at 11:41:59AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To avoid blocking in xfs_extent_busy_flush() when freeing extents
> > and the only busy extents are held by the current transaction, we
> > need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
> > into xfs_extent_busy_flush().
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
> >  fs/xfs/libxfs/xfs_alloc.h |  2 +-
> >  fs/xfs/xfs_extent_busy.c  |  3 +-
> >  fs/xfs/xfs_extent_busy.h  |  2 +-
> >  4 files changed, 56 insertions(+), 47 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index c20fe99405d8..11bd0a1756a1 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
> >   */
> >  STATIC int
> >  xfs_alloc_ag_vextent_near(
> > -	struct xfs_alloc_arg	*args)
> > +	struct xfs_alloc_arg	*args,
> > +	uint32_t		alloc_flags)
> 
> Is some bot going to complain about the uint32_t here vs the unsigned
> int in xfs_extent_busy_flush?

Huh. I thought I fixed that to use uint32_t all the way through.

Oooh, I fixed that in a later patch. Oops, my bad, I'll update it.

> Just to check my grokking here -- it's the ALLOC_FLAG_FREEING from
> xfs_free_extent_fix_freelist that we need to pass all the way to the
> bottom of the allocator?

Yes. It needs to propagate through xfs_alloc_fix_freelist() into
freelist block allocation...

> If the answers are 'no' and 'yes', then
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] xfs: allow extent free intents to be retried
  2023-06-15  3:38   ` Darrick J. Wong
@ 2023-06-15  3:57     ` Dave Chinner
  2023-06-15 14:41       ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-15  3:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Wed, Jun 14, 2023 at 08:38:37PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 15, 2023 at 11:42:00AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Extent freeing neeeds to be able to avoid a busy extent deadlock
> > when the transaction itself holds the only busy extents in the
> > allocation group. This may occur if we have an EFI that contains
> > multiple extents to be freed, and the freeing the second intent
> > requires the space the first extent free released to expand the
> > AGFL. If we block on the busy extent at this point, we deadlock.
> > 
> > We hold a dirty transaction that contains a entire atomic extent
> > free operations within it, so if we can abort the extent free
> > operation and commit the progress that we've made, the busy extent
> > can be resolved by a log force. Hence we can restart the aborted
> > extent free with a new transaction and continue to make
> > progress without risking deadlocks.
> > 
> > To enable this, we need the EFI processing code to be able to handle
> > an -EAGAIN error to tell it to commit the current transaction and
> > retry again. This mechanism is already built into the defer ops
> > processing (used bythe refcount btree modification intents), so
> > there's relatively little handling we need to add to the EFI code to
> > enable this.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_extfree_item.c | 64 +++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 61 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > index f9e36b810663..3b33d27efdce 100644
> > --- a/fs/xfs/xfs_extfree_item.c
> > +++ b/fs/xfs/xfs_extfree_item.c
> > @@ -336,6 +336,29 @@ xfs_trans_get_efd(
> >  	return efdp;
> >  }
> >  
> > +/*
> > + * Fill the EFD with all extents from the EFI when we need to roll the
> > + * transaction and continue with a new EFI.
> > + */
> > +static void
> > +xfs_efd_from_efi(
> > +	struct xfs_efd_log_item	*efdp)
> > +{
> > +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> > +	uint                    i;
> > +
> > +	ASSERT(efip->efi_format.efi_nextents > 0);
> > +
> > +	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> > +		return;
> > +
> > +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> > +	       efdp->efd_format.efd_extents[i] =
> > +		       efip->efi_format.efi_extents[i];
> > +	}
> > +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> 
> Odd question -- if we managed to free half the extents mentioned in an
> EFI before hitting -EAGAIN, then efdp->efd_next_extent should already be
> half of efip->efi_format.efi_nextents, right?

Yes, on success we normally update the EFD with the extent we just
completed and move the EFI/EFD cursors forwards.

> I suppose it's duplicative (or maybe just careful) to recopy the entire
> thing... but maybe that doesn't even really matter since no modern xlog
> code actually pays attention to what's in the EFD aside from the ID
> number.

*nod*

On second thoughts, now that you've questioned this behaviour, I
need to go back and check my assumptions about what the intent
creation is doing vs the current EFI vs the XEFI we are processing.
The new EFI we log shouldn't have the extents we've completed in it,
just the ones we haven't run, and I need to make sure that is
actually what is happening here.

> > @@ -652,9 +694,25 @@ xfs_efi_item_recover(
> >  		fake.xefi_startblock = extp->ext_start;
> >  		fake.xefi_blockcount = extp->ext_len;
> >  
> > -		xfs_extent_free_get_group(mp, &fake);
> > -		error = xfs_trans_free_extent(tp, efdp, &fake);
> > -		xfs_extent_free_put_group(&fake);
> > +		if (!requeue_only) {
> > +			xfs_extent_free_get_group(mp, &fake);
> > +			error = xfs_trans_free_extent(tp, efdp, &fake);
> > +			xfs_extent_free_put_group(&fake);
> > +		}
> > +
> > +		/*
> > +		 * If we can't free the extent without potentially deadlocking,
> > +		 * requeue the rest of the extents to a new so that they get
> > +		 * run again later with a new transaction context.
> > +		 */
> > +		if (error == -EAGAIN || requeue_only) {
> > +			xfs_free_extent_later(tp, fake.xefi_startblock,
> > +				fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
> 
> Shouldn't we check the return value of xfs_free_extent_later now?
> I think we already did that above, but since you just plumbed in the
> extra checks, we ought to use it. :)

Oh, right, my cscope tree needs updating, so I was thinking it is
still a void function.

> (Also the indenting here isn't the usual two tabs)

I'll fix that too.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] xfs: allow extent free intents to be retried
  2023-06-15  3:57     ` Dave Chinner
@ 2023-06-15 14:41       ` Darrick J. Wong
  2023-06-15 22:21         ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2023-06-15 14:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Thu, Jun 15, 2023 at 01:57:22PM +1000, Dave Chinner wrote:
> On Wed, Jun 14, 2023 at 08:38:37PM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 15, 2023 at 11:42:00AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Extent freeing neeeds to be able to avoid a busy extent deadlock
> > > when the transaction itself holds the only busy extents in the
> > > allocation group. This may occur if we have an EFI that contains
> > > multiple extents to be freed, and the freeing the second intent
> > > requires the space the first extent free released to expand the
> > > AGFL. If we block on the busy extent at this point, we deadlock.
> > > 
> > > We hold a dirty transaction that contains a entire atomic extent
> > > free operations within it, so if we can abort the extent free
> > > operation and commit the progress that we've made, the busy extent
> > > can be resolved by a log force. Hence we can restart the aborted
> > > extent free with a new transaction and continue to make
> > > progress without risking deadlocks.
> > > 
> > > To enable this, we need the EFI processing code to be able to handle
> > > an -EAGAIN error to tell it to commit the current transaction and
> > > retry again. This mechanism is already built into the defer ops
> > > processing (used bythe refcount btree modification intents), so
> > > there's relatively little handling we need to add to the EFI code to
> > > enable this.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_extfree_item.c | 64 +++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 61 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > > index f9e36b810663..3b33d27efdce 100644
> > > --- a/fs/xfs/xfs_extfree_item.c
> > > +++ b/fs/xfs/xfs_extfree_item.c
> > > @@ -336,6 +336,29 @@ xfs_trans_get_efd(
> > >  	return efdp;
> > >  }
> > >  
> > > +/*
> > > + * Fill the EFD with all extents from the EFI when we need to roll the
> > > + * transaction and continue with a new EFI.
> > > + */
> > > +static void
> > > +xfs_efd_from_efi(
> > > +	struct xfs_efd_log_item	*efdp)
> > > +{
> > > +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> > > +	uint                    i;
> > > +
> > > +	ASSERT(efip->efi_format.efi_nextents > 0);
> > > +
> > > +	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> > > +		return;
> > > +
> > > +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> > > +	       efdp->efd_format.efd_extents[i] =
> > > +		       efip->efi_format.efi_extents[i];
> > > +	}
> > > +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> > 
> > Odd question -- if we managed to free half the extents mentioned in an
> > EFI before hitting -EAGAIN, then efdp->efd_next_extent should already be
> > half of efip->efi_format.efi_nextents, right?
> 
> Yes, on success we normally update the EFD with the extent we just
> completed and move the EFI/EFD cursors forwards.
> 
> > I suppose it's duplicative (or maybe just careful) to recopy the entire
> > thing... but maybe that doesn't even really matter since no modern xlog
> > code actually pays attention to what's in the EFD aside from the ID
> > number.
> 
> *nod*
> 
> On second thoughts, now that you've questioned this behaviour, I
> need to go back and check my assumptions about what the intent
> creation is doing vs the current EFI vs the XEFI we are processing.
> The new EFI we log shouldn't have the extents we've completed in it,
> just the ones we haven't run, and I need to make sure that is
> actually what is happening here.

That shouldn't be happening -- each of the xfs_free_extent_later calls
below adds new incore EFIs to an xfs_defer_pending.dfp_work list and
each xfs_defer_pending itself gets added to xfs_trans.t_dfops.  The
defer_capture_and_commit function will turn the xfs_defer_pending into a
new EFI log item with the queued dfp_work items attached.

IOWs, as long as you don't call xfs_free_extent_later on any of the
xefi_startblock/blockcount pairs where xfs_trans_free_extent returned 0,
your assumptions are correct.

The code presented in this patch is correct.

--D

> > > @@ -652,9 +694,25 @@ xfs_efi_item_recover(
> > >  		fake.xefi_startblock = extp->ext_start;
> > >  		fake.xefi_blockcount = extp->ext_len;
> > >  
> > > -		xfs_extent_free_get_group(mp, &fake);
> > > -		error = xfs_trans_free_extent(tp, efdp, &fake);
> > > -		xfs_extent_free_put_group(&fake);
> > > +		if (!requeue_only) {
> > > +			xfs_extent_free_get_group(mp, &fake);
> > > +			error = xfs_trans_free_extent(tp, efdp, &fake);
> > > +			xfs_extent_free_put_group(&fake);
> > > +		}
> > > +
> > > +		/*
> > > +		 * If we can't free the extent without potentially deadlocking,
> > > +		 * requeue the rest of the extents to a new so that they get
> > > +		 * run again later with a new transaction context.
> > > +		 */
> > > +		if (error == -EAGAIN || requeue_only) {
> > > +			xfs_free_extent_later(tp, fake.xefi_startblock,
> > > +				fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
> > 
> > Shouldn't we check the return value of xfs_free_extent_later now?
> > I think we already did that above, but since you just plumbed in the
> > extra checks, we ought to use it. :)
> 
> Oh, right, my cscope tree needs updating, so I was thinking it is
> still a void function.
> 
> > (Also the indenting here isn't the usual two tabs)
> 
> I'll fix that too.
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15  1:41 ` [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
  2023-06-15  3:32   ` Darrick J. Wong
@ 2023-06-15 21:57   ` Wengang Wang
  2023-06-15 22:14     ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-15 21:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux

Hi Dave,

I got this error when applying this patch to the newest code:
(I tried on both MacOS and OL8, it’s the same result)

$ patch -p1 <~/tmp/Dave1.txt
patching file 'fs/xfs/libxfs/xfs_alloc.c'
patch: **** malformed patch at line 27:  */

Looking at the patch to see the line numbers:

 22 diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
 23 index c20fe99405d8..11bd0a1756a1 100644
 24 --- a/fs/xfs/libxfs/xfs_alloc.c
 25 +++ b/fs/xfs/libxfs/xfs_alloc.c
 26 @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
 27  */
 28 STATIC int
 29 xfs_alloc_ag_vextent_near(
 30 -   struct xfs_alloc_arg    *args)
 31 +   struct xfs_alloc_arg    *args,

thanks,
wengang

> On Jun 14, 2023, at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> To avoid blocking in xfs_extent_busy_flush() when freeing extents
> and the only busy extents are held by the current transaction, we
> need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
> into xfs_extent_busy_flush().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
> fs/xfs/libxfs/xfs_alloc.h |  2 +-
> fs/xfs/xfs_extent_busy.c  |  3 +-
> fs/xfs/xfs_extent_busy.h  |  2 +-
> 4 files changed, 56 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index c20fe99405d8..11bd0a1756a1 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>  */
> STATIC int
> xfs_alloc_ag_vextent_near(
> - struct xfs_alloc_arg *args)
> + struct xfs_alloc_arg *args,
> + uint32_t alloc_flags)
> {
> struct xfs_alloc_cur acur = {};
> int error; /* error code */
> @@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
> if (acur.busy) {
> trace_xfs_alloc_near_busy(args);
> xfs_extent_busy_flush(args->mp, args->pag,
> -      acur.busy_gen);
> +      acur.busy_gen, alloc_flags);
> goto restart;
> }
> trace_xfs_alloc_size_neither(args);
> @@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
>  * and of the form k * prod + mod unless there's nothing that large.
>  * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
>  */
> -STATIC int /* error */
> +static int
> xfs_alloc_ag_vextent_size(
> - xfs_alloc_arg_t *args) /* allocation argument structure */
> + struct xfs_alloc_arg *args,
> + uint32_t alloc_flags)
> {
> - struct xfs_agf *agf = args->agbp->b_addr;
> - struct xfs_btree_cur *bno_cur; /* cursor for bno btree */
> - struct xfs_btree_cur *cnt_cur; /* cursor for cnt btree */
> - int error; /* error result */
> - xfs_agblock_t fbno; /* start of found freespace */
> - xfs_extlen_t flen; /* length of found freespace */
> - int i; /* temp status variable */
> - xfs_agblock_t rbno; /* returned block number */
> - xfs_extlen_t rlen; /* length of returned extent */
> - bool busy;
> - unsigned busy_gen;
> + struct xfs_agf *agf = args->agbp->b_addr;
> + struct xfs_btree_cur *bno_cur;
> + struct xfs_btree_cur *cnt_cur;
> + xfs_agblock_t fbno; /* start of found freespace */
> + xfs_extlen_t flen; /* length of found freespace */
> + xfs_agblock_t rbno; /* returned block number */
> + xfs_extlen_t rlen; /* length of returned extent */
> + bool busy;
> + unsigned busy_gen;
> + int error;
> + int i;
> 
> restart:
> /*
> @@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
> xfs_btree_del_cursor(cnt_cur,
>     XFS_BTREE_NOERROR);
> trace_xfs_alloc_size_busy(args);
> - xfs_extent_busy_flush(args->mp,
> - args->pag, busy_gen);
> + xfs_extent_busy_flush(args->mp, args->pag,
> + busy_gen, alloc_flags);
> goto restart;
> }
> }
> @@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
> if (busy) {
> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> trace_xfs_alloc_size_busy(args);
> - xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> + xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
> + alloc_flags);
> goto restart;
> }
> goto out_nominleft;
> @@ -2568,7 +2571,7 @@ xfs_exact_minlen_extent_available(
> int /* error */
> xfs_alloc_fix_freelist(
> struct xfs_alloc_arg *args, /* allocation argument structure */
> - int flags) /* XFS_ALLOC_FLAG_... */
> + uint32_t alloc_flags)
> {
> struct xfs_mount *mp = args->mp;
> struct xfs_perag *pag = args->pag;
> @@ -2584,7 +2587,7 @@ xfs_alloc_fix_freelist(
> ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> 
> if (!xfs_perag_initialised_agf(pag)) {
> - error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
> + error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
> if (error) {
> /* Couldn't lock the AGF so skip this AG. */
> if (error == -EAGAIN)
> @@ -2600,13 +2603,13 @@ xfs_alloc_fix_freelist(
> */
> if (xfs_perag_prefers_metadata(pag) &&
>    (args->datatype & XFS_ALLOC_USERDATA) &&
> -    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
> - ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
> +    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
> + ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
> goto out_agbp_relse;
> }
> 
> need = xfs_alloc_min_freelist(mp, pag);
> - if (!xfs_alloc_space_available(args, need, flags |
> + if (!xfs_alloc_space_available(args, need, alloc_flags |
> XFS_ALLOC_FLAG_CHECK))
> goto out_agbp_relse;
> 
> @@ -2615,7 +2618,7 @@ xfs_alloc_fix_freelist(
> * Can fail if we're not blocking on locks, and it's held.
> */
> if (!agbp) {
> - error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
> + error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
> if (error) {
> /* Couldn't lock the AGF so skip this AG. */
> if (error == -EAGAIN)
> @@ -2630,7 +2633,7 @@ xfs_alloc_fix_freelist(
> 
> /* If there isn't enough total space or single-extent, reject it. */
> need = xfs_alloc_min_freelist(mp, pag);
> - if (!xfs_alloc_space_available(args, need, flags))
> + if (!xfs_alloc_space_available(args, need, alloc_flags))
> goto out_agbp_relse;
> 
> #ifdef DEBUG
> @@ -2668,11 +2671,12 @@ xfs_alloc_fix_freelist(
> */
> memset(&targs, 0, sizeof(targs));
> /* struct copy below */
> - if (flags & XFS_ALLOC_FLAG_NORMAP)
> + if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
> targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
> else
> targs.oinfo = XFS_RMAP_OINFO_AG;
> - while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
> + while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
> + pag->pagf_flcount > need) {
> error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
> if (error)
> goto out_agbp_relse;
> @@ -2700,7 +2704,7 @@ xfs_alloc_fix_freelist(
> targs.resv = XFS_AG_RESV_AGFL;
> 
> /* Allocate as many blocks as possible at once. */
> - error = xfs_alloc_ag_vextent_size(&targs);
> + error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
> if (error)
> goto out_agflbp_relse;
> 
> @@ -2710,7 +2714,7 @@ xfs_alloc_fix_freelist(
> * on a completely full ag.
> */
> if (targs.agbno == NULLAGBLOCK) {
> - if (flags & XFS_ALLOC_FLAG_FREEING)
> + if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
> break;
> goto out_agflbp_relse;
> }
> @@ -3226,7 +3230,7 @@ xfs_alloc_vextent_check_args(
> static int
> xfs_alloc_vextent_prepare_ag(
> struct xfs_alloc_arg *args,
> - uint32_t flags)
> + uint32_t alloc_flags)
> {
> bool need_pag = !args->pag;
> int error;
> @@ -3235,7 +3239,7 @@ xfs_alloc_vextent_prepare_ag(
> args->pag = xfs_perag_get(args->mp, args->agno);
> 
> args->agbp = NULL;
> - error = xfs_alloc_fix_freelist(args, flags);
> + error = xfs_alloc_fix_freelist(args, alloc_flags);
> if (error) {
> trace_xfs_alloc_vextent_nofix(args);
> if (need_pag)
> @@ -3357,6 +3361,7 @@ xfs_alloc_vextent_this_ag(
> {
> struct xfs_mount *mp = args->mp;
> xfs_agnumber_t minimum_agno;
> + uint32_t alloc_flags = 0;
> int error;
> 
> ASSERT(args->pag != NULL);
> @@ -3375,9 +3380,9 @@ xfs_alloc_vextent_this_ag(
> return error;
> }
> 
> - error = xfs_alloc_vextent_prepare_ag(args, 0);
> + error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
> if (!error && args->agbp)
> - error = xfs_alloc_ag_vextent_size(args);
> + error = xfs_alloc_ag_vextent_size(args, alloc_flags);
> 
> return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
> }
> @@ -3406,20 +3411,20 @@ xfs_alloc_vextent_iterate_ags(
> xfs_agnumber_t minimum_agno,
> xfs_agnumber_t start_agno,
> xfs_agblock_t target_agbno,
> - uint32_t flags)
> + uint32_t alloc_flags)
> {
> struct xfs_mount *mp = args->mp;
> xfs_agnumber_t restart_agno = minimum_agno;
> xfs_agnumber_t agno;
> int error = 0;
> 
> - if (flags & XFS_ALLOC_FLAG_TRYLOCK)
> + if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
> restart_agno = 0;
> restart:
> for_each_perag_wrap_range(mp, start_agno, restart_agno,
> mp->m_sb.sb_agcount, agno, args->pag) {
> args->agno = agno;
> - error = xfs_alloc_vextent_prepare_ag(args, flags);
> + error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
> if (error)
> break;
> if (!args->agbp) {
> @@ -3433,10 +3438,10 @@ xfs_alloc_vextent_iterate_ags(
> */
> if (args->agno == start_agno && target_agbno) {
> args->agbno = target_agbno;
> - error = xfs_alloc_ag_vextent_near(args);
> + error = xfs_alloc_ag_vextent_near(args, alloc_flags);
> } else {
> args->agbno = 0;
> - error = xfs_alloc_ag_vextent_size(args);
> + error = xfs_alloc_ag_vextent_size(args, alloc_flags);
> }
> break;
> }
> @@ -3453,8 +3458,8 @@ xfs_alloc_vextent_iterate_ags(
> * constraining flags by the caller, drop them and retry the allocation
> * without any constraints being set.
> */
> - if (flags) {
> - flags = 0;
> + if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
> + alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
> restart_agno = minimum_agno;
> goto restart;
> }
> @@ -3482,6 +3487,7 @@ xfs_alloc_vextent_start_ag(
> xfs_agnumber_t start_agno;
> xfs_agnumber_t rotorstep = xfs_rotorstep;
> bool bump_rotor = false;
> + uint32_t alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
> int error;
> 
> ASSERT(args->pag == NULL);
> @@ -3508,7 +3514,7 @@ xfs_alloc_vextent_start_ag(
> 
> start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
> error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
> - XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
> + XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
> 
> if (bump_rotor) {
> if (args->agno == start_agno)
> @@ -3535,6 +3541,7 @@ xfs_alloc_vextent_first_ag(
> struct xfs_mount *mp = args->mp;
> xfs_agnumber_t minimum_agno;
> xfs_agnumber_t start_agno;
> + uint32_t alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
> int error;
> 
> ASSERT(args->pag == NULL);
> @@ -3553,7 +3560,7 @@ xfs_alloc_vextent_first_ag(
> 
> start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
> error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
> - XFS_FSB_TO_AGBNO(mp, target), 0);
> + XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
> return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
> }
> 
> @@ -3606,6 +3613,7 @@ xfs_alloc_vextent_near_bno(
> struct xfs_mount *mp = args->mp;
> xfs_agnumber_t minimum_agno;
> bool needs_perag = args->pag == NULL;
> + uint32_t alloc_flags = 0;
> int error;
> 
> if (!needs_perag)
> @@ -3626,9 +3634,9 @@ xfs_alloc_vextent_near_bno(
> if (needs_perag)
> args->pag = xfs_perag_grab(mp, args->agno);
> 
> - error = xfs_alloc_vextent_prepare_ag(args, 0);
> + error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
> if (!error && args->agbp)
> - error = xfs_alloc_ag_vextent_near(args);
> + error = xfs_alloc_ag_vextent_near(args, alloc_flags);
> 
> return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
> }
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 85ac470be0da..d1aa7c63eafe 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
> struct xfs_buf **bpp);
> int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
> struct xfs_buf *, struct xfs_owner_info *);
> -int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
> +int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
> int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
> struct xfs_buf **agbp);
> 
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index f3d328e4a440..874798949625 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -571,7 +571,8 @@ void
> xfs_extent_busy_flush(
> struct xfs_mount *mp,
> struct xfs_perag *pag,
> - unsigned busy_gen)
> + unsigned busy_gen,
> + unsigned int flags)
> {
> DEFINE_WAIT (wait);
> int error;
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 4a118131059f..79ef5a2d7758 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
> 
> void
> xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> - unsigned busy_gen);
> + unsigned busy_gen, unsigned int flags);
> 
> void
> xfs_extent_busy_wait_all(struct xfs_mount *mp);
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15 21:57   ` Wengang Wang
@ 2023-06-15 22:14     ` Dave Chinner
  2023-06-15 22:31       ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-15 22:14 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs, chandanrlinux

On Thu, Jun 15, 2023 at 09:57:18PM +0000, Wengang Wang wrote:
> Hi Dave,
> 
> I got this error when applying this patch to the newest code:
> (I tried on both MacOS and OL8, it’s the same result)
> 
> $ patch -p1 <~/tmp/Dave1.txt
> patching file 'fs/xfs/libxfs/xfs_alloc.c'
> patch: **** malformed patch at line 27:  */

Whoa. I haven't use patch like this for a decade. :)

The way all the cool kids do this now is apply the entire series to
directly to a git tree branch with b4:

$ b4 am -o - 20230615014201.3171380-2-david@fromorbit.com | git am

(b4 shazam merges the b4 am and git am operations into the one
command, IIRC, but that trick isn't automatic for me yet :)
> Looking at the patch to see the line numbers:
> 
>  22 diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>  23 index c20fe99405d8..11bd0a1756a1 100644
>  24 --- a/fs/xfs/libxfs/xfs_alloc.c
>  25 +++ b/fs/xfs/libxfs/xfs_alloc.c
>  26 @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>  27  */
>  28 STATIC int
>  29 xfs_alloc_ag_vextent_near(

Yup, however you saved the patch to a file stripped the leading
spaces from all the lines in the patch.

If you look at the raw email on lore it has the correct leading
spaces in the patch.

https://lore.kernel.org/linux-xfs/20230615014201.3171380-2-david@fromorbit.com/raw

These sorts of patching problems go away when you use tools like b4
to pull the patches directly from the mailing list...

> > On Jun 14, 2023, at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To avoid blocking in xfs_extent_busy_flush() when freeing extents
> > and the only busy extents are held by the current transaction, we
> > need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
> > into xfs_extent_busy_flush().
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
> > fs/xfs/libxfs/xfs_alloc.h |  2 +-
> > fs/xfs/xfs_extent_busy.c  |  3 +-
> > fs/xfs/xfs_extent_busy.h  |  2 +-
> > 4 files changed, 56 insertions(+), 47 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index c20fe99405d8..11bd0a1756a1 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
> >  */
> > STATIC int
> > xfs_alloc_ag_vextent_near(
> > - struct xfs_alloc_arg *args)
> > + struct xfs_alloc_arg *args,
> > + uint32_t alloc_flags)
> > {
> > struct xfs_alloc_cur acur = {};
> > int error; /* error code */

This indicates the problem is likely to be your mail program,
because the quoting it has done here has completely mangled all the
whitespace in the patch....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] xfs: allow extent free intents to be retried
  2023-06-15 14:41       ` Darrick J. Wong
@ 2023-06-15 22:21         ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2023-06-15 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandanrlinux, wen.gang.wang

On Thu, Jun 15, 2023 at 07:41:50AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 15, 2023 at 01:57:22PM +1000, Dave Chinner wrote:
> > On Wed, Jun 14, 2023 at 08:38:37PM -0700, Darrick J. Wong wrote:
> > > On Thu, Jun 15, 2023 at 11:42:00AM +1000, Dave Chinner wrote:
> > > > +/*
> > > > + * Fill the EFD with all extents from the EFI when we need to roll the
> > > > + * transaction and continue with a new EFI.
> > > > + */
> > > > +static void
> > > > +xfs_efd_from_efi(
> > > > +	struct xfs_efd_log_item	*efdp)
> > > > +{
> > > > +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> > > > +	uint                    i;
> > > > +
> > > > +	ASSERT(efip->efi_format.efi_nextents > 0);
> > > > +
> > > > +	if (efdp->efd_next_extent == efip->efi_format.efi_nextents)
> > > > +		return;
> > > > +
> > > > +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> > > > +	       efdp->efd_format.efd_extents[i] =
> > > > +		       efip->efi_format.efi_extents[i];
> > > > +	}
> > > > +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> > > 
> > > Odd question -- if we managed to free half the extents mentioned in an
> > > EFI before hitting -EAGAIN, then efdp->efd_next_extent should already be
> > > half of efip->efi_format.efi_nextents, right?
> > 
> > Yes, on success we normally update the EFD with the extent we just
> > completed and move the EFI/EFD cursors forwards.
> > 
> > > I suppose it's duplicative (or maybe just careful) to recopy the entire
> > > thing... but maybe that doesn't even really matter since no modern xlog
> > > code actually pays attention to what's in the EFD aside from the ID
> > > number.
> > 
> > *nod*
> > 
> > On second thoughts, now that you've questioned this behaviour, I
> > need to go back and check my assumptions about what the intent
> > creation is doing vs the current EFI vs the XEFI we are processing.
> > The new EFI we log shouldn't have the extents we've completed in it,
> > just the ones we haven't run, and I need to make sure that is
> > actually what is happening here.
> 
> That shouldn't be happening -- each of the xfs_free_extent_later calls
> below adds new incore EFIs to an xfs_defer_pending.dfp_work list and
> each xfs_defer_pending itself gets added to xfs_trans.t_dfops.  The
> defer_capture_and_commit function will turn the xfs_defer_pending into a
> new EFI log item with the queued dfp_work items attached.

Yup, I came to that conclusion when I went back over it again last
night. I added a few comments to the function about the methods
we might make to optimise it (i.e. only fill out from
efdp->efd_next_extent to efip->efi_format.efi_nextents) but also the
assumptions they rely on (xefis are always ordered in the same order
the efi extents are ordered) and the landmines that changing xefi
order processing might leave. Hence it's best just to copy the
entire EFI into the EFD and avoid all the possible silent corruption
problems that out-of-order xefi processing might cause...

> IOWs, as long as you don't call xfs_free_extent_later on any of the
> xefi_startblock/blockcount pairs where xfs_trans_free_extent returned 0,
> your assumptions are correct.
> 
> The code presented in this patch is correct.

Thanks for the double-check, I'll get the updated patches out in a
short while...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15 22:14     ` Dave Chinner
@ 2023-06-15 22:31       ` Wengang Wang
  2023-06-15 23:09         ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-15 22:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 15, 2023, at 3:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jun 15, 2023 at 09:57:18PM +0000, Wengang Wang wrote:
>> Hi Dave,
>> 
>> I got this error when applying this patch to the newest code:
>> (I tried on both MacOS and OL8, it’s the same result)
>> 
>> $ patch -p1 <~/tmp/Dave1.txt
>> patching file 'fs/xfs/libxfs/xfs_alloc.c'
>> patch: **** malformed patch at line 27:  */
> 
> Whoa. I haven't use patch like this for a decade. :)
> 
> The way all the cool kids do this now is apply the entire series to
> directly to a git tree branch with b4:
> 
> $ b4 am -o - 20230615014201.3171380-2-david@fromorbit.com | git am
> 
> (b4 shazam merges the b4 am and git am operations into the one
> command, IIRC, but that trick isn't automatic for me yet :)
>> Looking at the patch to see the line numbers:
>> 
>> 22 diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> 23 index c20fe99405d8..11bd0a1756a1 100644
>> 24 --- a/fs/xfs/libxfs/xfs_alloc.c
>> 25 +++ b/fs/xfs/libxfs/xfs_alloc.c
>> 26 @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>> 27  */
>> 28 STATIC int
>> 29 xfs_alloc_ag_vextent_near(
> 
> Yup, however you saved the patch to a file stripped the leading
> spaces from all the lines in the patch.
> 
> If you look at the raw email on lore it has the correct leading
> spaces in the patch.
> 
> https://lore.kernel.org/linux-xfs/20230615014201.3171380-2-david@fromorbit.com/raw
> 
> These sorts of patching problems go away when you use tools like b4
> to pull the patches directly from the mailing list...
> 
>>> On Jun 14, 2023, at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> From: Dave Chinner <dchinner@redhat.com>
>>> 
>>> To avoid blocking in xfs_extent_busy_flush() when freeing extents
>>> and the only busy extents are held by the current transaction, we
>>> need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
>>> into xfs_extent_busy_flush().
>>> 
>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>> ---
>>> fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
>>> fs/xfs/libxfs/xfs_alloc.h |  2 +-
>>> fs/xfs/xfs_extent_busy.c  |  3 +-
>>> fs/xfs/xfs_extent_busy.h  |  2 +-
>>> 4 files changed, 56 insertions(+), 47 deletions(-)
>>> 
>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>>> index c20fe99405d8..11bd0a1756a1 100644
>>> --- a/fs/xfs/libxfs/xfs_alloc.c
>>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>>> @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>>> */
>>> STATIC int
>>> xfs_alloc_ag_vextent_near(
>>> - struct xfs_alloc_arg *args)
>>> + struct xfs_alloc_arg *args,
>>> + uint32_t alloc_flags)
>>> {
>>> struct xfs_alloc_cur acur = {};
>>> int error; /* error code */
> 
> This indicates the problem is likely to be your mail program,
> because the quoting it has done here has completely mangled all the
> whitespace in the patch....
> 

Right, the .txt was not raw.
Now I got the raw formats and they applied to upstream master. Let me run log recover with my metadump, will report result later.

thanks,
wengang

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15 22:31       ` Wengang Wang
@ 2023-06-15 23:09         ` Wengang Wang
  2023-06-15 23:33           ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-15 23:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux

When mounting the problematic metadump with the patches, I see the following reported.

For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
=================================================
[   67.212496] loop: module loaded
[   67.214732] loop0: detected capacity change from 0 to 629137408
[   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
[   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
[   72.241546] XFS (loop0): Starting recovery (logdev: internal)
[   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
[   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
[   92.250869] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.5.1 06/16/2021
[   92.252301] Call Trace:
[   92.252784]  <TASK>
[   92.253166]  dump_stack_lvl+0x4c/0x70
[   92.253866]  dump_stack+0x14/0x20
[   92.254475]  xfs_corruption_error+0x94/0xa0 [xfs]
[   92.255571]  ? xfs_free_ag_extent+0x3f6/0x870 [xfs]
[   92.256595]  xfs_free_ag_extent+0x428/0x870 [xfs]
[   92.257609]  ? xfs_free_ag_extent+0x3f6/0x870 [xfs]
[   92.258638]  __xfs_free_extent+0x95/0x170 [xfs]
[   92.259618]  xfs_trans_free_extent+0x97/0x1c0 [xfs]
[   92.260672]  xfs_extent_free_finish_item+0x18/0x50 [xfs]
[   92.261823]  xfs_defer_finish_noroll+0x18e/0x6c0 [xfs]
[   92.262892]  __xfs_trans_commit+0x241/0x360 [xfs]
[   92.263900]  ? kfree+0x7c/0x120
[   92.264506]  xfs_trans_commit+0x14/0x20 [xfs]
[   92.265506]  xlog_recover_process_intents.isra.28+0x1db/0x2d0 [xfs]
[   92.266803]  xlog_recover_finish+0x35/0x300 [xfs]
[   92.267848]  ? __queue_delayed_work+0x5e/0x90
[   92.268634]  xfs_log_mount_finish+0xe8/0x170 [xfs]
[   92.269677]  xfs_mountfs+0x51a/0x900 [xfs]
[   92.270608]  xfs_fs_fill_super+0x4df/0x980 [xfs]
[   92.271642]  get_tree_bdev+0x1a3/0x280
[   92.272287]  ? __pfx_xfs_fs_fill_super+0x10/0x10 [xfs]
[   92.273415]  xfs_fs_get_tree+0x19/0x20 [xfs]
[   92.274374]  vfs_get_tree+0x2b/0xd0
[   92.275006]  ? capable+0x1d/0x30
[   92.275599]  path_mount+0x709/0xa40
[   92.276202]  ? putname+0x5d/0x70
[   92.276808]  do_mount+0x84/0xa0
[   92.277364]  __x64_sys_mount+0xd1/0xf0
[   92.278107]  do_syscall_64+0x3b/0x90
[   92.278779]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   92.279703] RIP: 0033:0x7fdc8683a26e
[   92.280801] Code: 48 8b 0d 1d 4c 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ea 4b 38 00 f7 d8 64 89 01 48
[   92.284752] RSP: 002b:00007ffc213711d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[   92.286460] RAX: ffffffffffffffda RBX: 0000561748023b40 RCX: 00007fdc8683a26e
[   92.288048] RDX: 0000561748032f70 RSI: 0000561748025a20 RDI: 000056174802b010
[   92.289665] RBP: 00007fdc87e58184 R08: 0000000000000000 R09: 0000000000000000
[   92.291315] R10: 00000000c0ed0000 R11: 0000000000000246 R12: 0000000000000000
[   92.292898] R13: 00000000c0ed0000 R14: 000056174802b010 R15: 0000561748032f70
[   92.294505]  </TASK>
[   92.295351] XFS (loop0): Corruption detected. Unmount and run xfs_repair
[   92.297073] XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_defer_finish_noroll+0x406/0x6c0 [xfs] (fs/xfs/libxfs/xfs_defer.c:575).  Shutting down filesystem.
[   92.300686] XFS (loop0): Please unmount the filesystem and rectify the problem(s)
[   92.302372] XFS (loop0): Failed to recover intents
[   92.303571] XFS (loop0): Ending recovery (logdev: internal)


I think that’s because that the same EFI record was going to be freed again
by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().
I was trying to fix above issue in my previous patch by checking the intent
log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().

Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
we can avoid the double free.

thanks,
wengang

> On Jun 15, 2023, at 3:31 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Jun 15, 2023, at 3:14 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Thu, Jun 15, 2023 at 09:57:18PM +0000, Wengang Wang wrote:
>>> Hi Dave,
>>> 
>>> I got this error when applying this patch to the newest code:
>>> (I tried on both MacOS and OL8, it’s the same result)
>>> 
>>> $ patch -p1 <~/tmp/Dave1.txt
>>> patching file 'fs/xfs/libxfs/xfs_alloc.c'
>>> patch: **** malformed patch at line 27:  */
>> 
>> Whoa. I haven't use patch like this for a decade. :)
>> 
>> The way all the cool kids do this now is apply the entire series to
>> directly to a git tree branch with b4:
>> 
>> $ b4 am -o - 20230615014201.3171380-2-david@fromorbit.com | git am
>> 
>> (b4 shazam merges the b4 am and git am operations into the one
>> command, IIRC, but that trick isn't automatic for me yet :)
>>> Looking at the patch to see the line numbers:
>>> 
>>> 22 diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>>> 23 index c20fe99405d8..11bd0a1756a1 100644
>>> 24 --- a/fs/xfs/libxfs/xfs_alloc.c
>>> 25 +++ b/fs/xfs/libxfs/xfs_alloc.c
>>> 26 @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>>> 27  */
>>> 28 STATIC int
>>> 29 xfs_alloc_ag_vextent_near(
>> 
>> Yup, however you saved the patch to a file stripped the leading
>> spaces from all the lines in the patch.
>> 
>> If you look at the raw email on lore it has the correct leading
>> spaces in the patch.
>> 
>> https://lore.kernel.org/linux-xfs/20230615014201.3171380-2-david@fromorbit.com/raw
>> 
>> These sorts of patching problems go away when you use tools like b4
>> to pull the patches directly from the mailing list...
>> 
>>>> On Jun 14, 2023, at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> 
>>>> From: Dave Chinner <dchinner@redhat.com>
>>>> 
>>>> To avoid blocking in xfs_extent_busy_flush() when freeing extents
>>>> and the only busy extents are held by the current transaction, we
>>>> need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
>>>> into xfs_extent_busy_flush().
>>>> 
>>>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>>>> ---
>>>> fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
>>>> fs/xfs/libxfs/xfs_alloc.h |  2 +-
>>>> fs/xfs/xfs_extent_busy.c  |  3 +-
>>>> fs/xfs/xfs_extent_busy.h  |  2 +-
>>>> 4 files changed, 56 insertions(+), 47 deletions(-)
>>>> 
>>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>>>> index c20fe99405d8..11bd0a1756a1 100644
>>>> --- a/fs/xfs/libxfs/xfs_alloc.c
>>>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>>>> @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>>>> */
>>>> STATIC int
>>>> xfs_alloc_ag_vextent_near(
>>>> - struct xfs_alloc_arg *args)
>>>> + struct xfs_alloc_arg *args,
>>>> + uint32_t alloc_flags)
>>>> {
>>>> struct xfs_alloc_cur acur = {};
>>>> int error; /* error code */
>> 
>> This indicates the problem is likely to be your mail program,
>> because the quoting it has done here has completely mangled all the
>> whitespace in the patch....
>> 
> 
> Right, the .txt was not raw.
> Now I got the raw formats and they applied to upstream master. Let me run log recover with my metadump, will report result later.
> 
> thanks,
> wengang
> 
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15 23:09         ` Wengang Wang
@ 2023-06-15 23:33           ` Dave Chinner
  2023-06-15 23:51             ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-15 23:33 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs, chandanrlinux

On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
> When mounting the problematic metadump with the patches, I see the following reported.
> 
> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
> =================================================
> [   67.212496] loop: module loaded
> [   67.214732] loop0: detected capacity change from 0 to 629137408
> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8

What is the test you are running? Please describe how you reproduced
this failure - a reproducer script would be the best thing here.

Does the test fail on a v5 filesytsem?

> I think that’s because that the same EFI record was going to be freed again
> by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().
> I was trying to fix above issue in my previous patch by checking the intent
> log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().
> 
> Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
> xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
> memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
> that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
> we can avoid the double free.

I'm not really interested in speculation of the cause or the fix at
this point. I want to know how the problem is triggered so I can
work out exactly what caused it, along with why we don't have
coverage of this specific failure case in fstests already.

Indeed, if you have a script that is reproducing this, please turn
it into a fstests test so it becomes a regression test that is
always run...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15 23:33           ` Dave Chinner
@ 2023-06-15 23:51             ` Wengang Wang
  2023-06-16  0:17               ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-15 23:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 15, 2023, at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
>> When mounting the problematic metadump with the patches, I see the following reported.
>> 
>> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
>> =================================================
>> [   67.212496] loop: module loaded
>> [   67.214732] loop0: detected capacity change from 0 to 629137408
>> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
>> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
>> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
>> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
>> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
> 
> What is the test you are running? Please describe how you reproduced
> this failure - a reproducer script would be the best thing here.

I was mounting a (copy of) V4 metadump from customer.

> 
> Does the test fail on a v5 filesytsem?

N/A.

> 
>> I think that’s because that the same EFI record was going to be freed again
>> by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().
>> I was trying to fix above issue in my previous patch by checking the intent
>> log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().
>> 
>> Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
>> xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
>> memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
>> that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
>> we can avoid the double free.
> 
> I'm not really interested in speculation of the cause or the fix at
> this point. I want to know how the problem is triggered so I can
> work out exactly what caused it, along with why we don't have
> coverage of this specific failure case in fstests already.
> 

I get to know the cause by adding additional debug log along with my previous patch. 

> Indeed, if you have a script that is reproducing this, please turn
> it into a fstests test so it becomes a regression test that is
> always run...
> 

So far I don’t have such a script. Though I can try that, I am not sure if I can finish it shortly.
I am wondering what if we won’t have a stable reproducer soon?

thanks,
wengang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-15 23:51             ` Wengang Wang
@ 2023-06-16  0:17               ` Dave Chinner
  2023-06-16  0:42                 ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-16  0:17 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs, chandanrlinux

On Thu, Jun 15, 2023 at 11:51:09PM +0000, Wengang Wang wrote:
> 
> 
> > On Jun 15, 2023, at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
> >> When mounting the problematic metadump with the patches, I see the following reported.
> >> 
> >> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
> >> =================================================
> >> [   67.212496] loop: module loaded
> >> [   67.214732] loop0: detected capacity change from 0 to 629137408
> >> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
> >> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
> >> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
> >> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
> >> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
> > 
> > What is the test you are running? Please describe how you reproduced
> > this failure - a reproducer script would be the best thing here.
> 
> I was mounting a (copy of) V4 metadump from customer.

Is the metadump obfuscated? Can I get a copy of it via a private,
secure channel?

> > Does the test fail on a v5 filesytsem?
> 
> N/A.
> 
> > 
> >> I think that’s because that the same EFI record was going to be freed again
> >> by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().

How is this happening? Where (and why) are we defering an extent we
have successfully freed into a new xefi that we create a new intent
for and then defer?

Can you post the debug output and analysis that lead you to this
observation? I certainly can't see how this can happen from looking
at the code

> >> I was trying to fix above issue in my previous patch by checking the intent
> >> log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().
> >> 
> >> Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
> >> xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
> >> memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
> >> that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
> >> we can avoid the double free.
> > 
> > I'm not really interested in speculation of the cause or the fix at
> > this point. I want to know how the problem is triggered so I can
> > work out exactly what caused it, along with why we don't have
> > coverage of this specific failure case in fstests already.
> > 
> 
> I get to know the cause by adding additional debug log along with
> my previous patch.

Can you please post that debug and analysis, rather than just a
stack trace that is completely lacking in context? Nothing can be
inferred from a stack trace, and what you are saying is occurring
does not match what the code should actually be doing. So I need to
actually look at what is happening in detail to work out where this
mismatch is coming from....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16  0:17               ` Dave Chinner
@ 2023-06-16  0:42                 ` Wengang Wang
  2023-06-16  4:27                   ` Wengang Wang
  2023-06-22  1:15                   ` Wengang Wang
  0 siblings, 2 replies; 29+ messages in thread
From: Wengang Wang @ 2023-06-16  0:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jun 15, 2023 at 11:51:09PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Jun 15, 2023, at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
>>>> When mounting the problematic metadump with the patches, I see the following reported.
>>>> 
>>>> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
>>>> =================================================
>>>> [   67.212496] loop: module loaded
>>>> [   67.214732] loop0: detected capacity change from 0 to 629137408
>>>> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
>>>> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
>>>> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
>>>> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
>>>> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
>>> 
>>> What is the test you are running? Please describe how you reproduced
>>> this failure - a reproducer script would be the best thing here.
>> 
>> I was mounting a (copy of) V4 metadump from customer.
> 
> Is the metadump obfuscated? Can I get a copy of it via a private,
> secure channel?

I am OK to give you a copy after I get approvement for that.

> 
>>> Does the test fail on a v5 filesytsem?
>> 
>> N/A.
>> 
>>> 
>>>> I think that’s because that the same EFI record was going to be freed again
>>>> by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().
> 
> How is this happening? Where (and why) are we defering an extent we
> have successfully freed into a new xefi that we create a new intent
> for and then defer?
> 
> Can you post the debug output and analysis that lead you to this
> observation? I certainly can't see how this can happen from looking
> at the code
> 
>>>> I was trying to fix above issue in my previous patch by checking the intent
>>>> log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().
>>>> 
>>>> Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
>>>> xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
>>>> memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
>>>> that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
>>>> we can avoid the double free.
>>> 
>>> I'm not really interested in speculation of the cause or the fix at
>>> this point. I want to know how the problem is triggered so I can
>>> work out exactly what caused it, along with why we don't have
>>> coverage of this specific failure case in fstests already.
>>> 
>> 
>> I get to know the cause by adding additional debug log along with
>> my previous patch.
> 
> Can you please post that debug and analysis, rather than just a
> stack trace that is completely lacking in context? Nothing can be
> inferred from a stack trace, and what you are saying is occurring
> does not match what the code should actually be doing. So I need to
> actually look at what is happening in detail to work out where this
> mismatch is coming from....

The debug patch was based on my previous patch, I will rework the debug patch
basing on yours. I will share you the debug patch, output and my analysis later. 

thanks,
wengang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16  0:42                 ` Wengang Wang
@ 2023-06-16  4:27                   ` Wengang Wang
  2023-06-16  5:04                     ` Wengang Wang
  2023-06-16  7:36                     ` Dave Chinner
  2023-06-22  1:15                   ` Wengang Wang
  1 sibling, 2 replies; 29+ messages in thread
From: Wengang Wang @ 2023-06-16  4:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Thu, Jun 15, 2023 at 11:51:09PM +0000, Wengang Wang wrote:
>>> 
>>> 
>>>> On Jun 15, 2023, at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> 
>>>> On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
>>>>> When mounting the problematic metadump with the patches, I see the following reported.
>>>>> 
>>>>> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
>>>>> =================================================
>>>>> [   67.212496] loop: module loaded
>>>>> [   67.214732] loop0: detected capacity change from 0 to 629137408
>>>>> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
>>>>> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
>>>>> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
>>>>> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
>>>>> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
>>>> 
>>>> What is the test you are running? Please describe how you reproduced
>>>> this failure - a reproducer script would be the best thing here.
>>> 
>>> I was mounting a (copy of) V4 metadump from customer.
>> 
>> Is the metadump obfuscated? Can I get a copy of it via a private,
>> secure channel?
> 
> I am OK to give you a copy after I get approvement for that.
> 
>> 
>>>> Does the test fail on a v5 filesytsem?
>>> 
>>> N/A.
>>> 
>>>> 
>>>>> I think that’s because that the same EFI record was going to be freed again
>>>>> by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().
>> 
>> How is this happening? Where (and why) are we defering an extent we
>> have successfully freed into a new xefi that we create a new intent
>> for and then defer?
>> 
>> Can you post the debug output and analysis that lead you to this
>> observation? I certainly can't see how this can happen from looking
>> at the code
>> 
>>>>> I was trying to fix above issue in my previous patch by checking the intent
>>>>> log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().
>>>>> 
>>>>> Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
>>>>> xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
>>>>> memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
>>>>> that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
>>>>> we can avoid the double free.
>>>> 
>>>> I'm not really interested in speculation of the cause or the fix at
>>>> this point. I want to know how the problem is triggered so I can
>>>> work out exactly what caused it, along with why we don't have
>>>> coverage of this specific failure case in fstests already.
>>>> 
>>> 
>>> I get to know the cause by adding additional debug log along with
>>> my previous patch.
>> 
>> Can you please post that debug and analysis, rather than just a
>> stack trace that is completely lacking in context? Nothing can be
>> inferred from a stack trace, and what you are saying is occurring
>> does not match what the code should actually be doing. So I need to
>> actually look at what is happening in detail to work out where this
>> mismatch is coming from....
> 
> The debug patch was based on my previous patch, I will rework the debug patch
> basing on yours. I will share you the debug patch, output and my analysis later. 
> 

My analysis:
It’s problem of double free. The first free is from xfs_efi_item_recover(), the
second free is from xfs_extent_free_finish_item().
Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
The double free problem begins when the -EAGAIN is returned.

1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
   operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
   see line 6.
3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
   And the deferred options is moved to capture_list. see line 9. The call path is:
   xfs_efi_item_recover()
     xfs_defer_ops_capture_and_commit(tp, capture_list)
       xfs_defer_ops_capture()
         xfs_defer_create_intents()
           xfs_defer_create_intent()
             ops->create_intent() —>   xfs_extent_free_create_intent()
4. The new EFI is committed with current transaction. see line 10.
5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
7. The mount process (log recover) continue work with other intents. line 21 to line 35.
8. The mount process (log recover) come to process the new EFI that was added to AIL
   at step 6. I guess the previous log items, which were added to AIL together with the
   EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
   iteration in xlog_recover_process_intents().  see line 36.
9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
   That’s the first free. see line 37.
10. The AIL intents iteration is done. It begins to process the capture_list.
11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
    added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
    and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
    So, it’s a double free issue.

ftrace log output:
 1            mount-4557    [003] ...1.   160.746048: xlog_recover_process_intents: last_lsn: 1b0000515f
 2            mount-4557    [003] ...1.   160.746049: xlog_recover_process_intents: processing intent 00000000e190b7a0 lsn=1a000067b4
 3            mount-4557    [003] .....   160.746059: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x2543cc4, 0x30)
 4            mount-4557    [003] .....   160.746089: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441ca, 0x30)
 5            mount-4557    [003] .....   160.746095: xfs_efi_item_recover: -EAGAIN for EFI record (0x25441ca, 0x30)
 6            mount-4557    [003] .....   160.746096: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441ca, 0x30) to trans 00000000e503eaf8
 7            mount-4557    [003] .....   160.746097: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441b0, 0x1)
 8            mount-4557    [003] .....   160.746098: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441b0, 0x1) to trans 00000000e503eaf8
 9            mount-4557    [003] .....   160.746099: xfs_extent_free_create_intent: new EFI 000000002027a359 to trans 00000000e503eaf8
10            mount-4557    [003] .....   160.746100: xfs_defer_ops_capture_and_commit: committing trans: 00000000e503eaf8
11            mount-4557    [003] ...1.   160.746109: xlog_recover_process_intents: processing intent 00000000def04d9f lsn=1a000069c4
12            mount-4557    [003] .....   160.746110: xfs_efi_item_recover: recover EFI 00000000def04d9f (0x22ad0b9, 0x1)
13    kworker/42:1H-502     [042] ...1.   161.025882: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
14    kworker/42:1H-502     [042] ...1.   161.025883: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
15    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 000000002190f497 to AIL
16    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 0000000043992ff9 to AIL
17    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 000000003554d2a3 to AIL
18    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
19    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000ee3155a9 to AIL
20    kworker/42:1H-502     [042] ...1.   161.025886: xfs_trans_ail_update_bulk: adding lip 000000002027a359 to AIL
21            mount-4557    [003] ...1.   161.025937: xlog_recover_process_intents: processing intent 000000005baced87 lsn=1a00006e20
22            mount-4557    [003] .....   161.025945: xfs_efi_item_recover: recover EFI 000000005baced87 (0x254425f, 0x2c)
23            mount-4557    [003] .....   161.025952: xfs_efi_item_recover: recover EFI 000000005baced87 (0x2544350, 0x14)
24            mount-4557    [003] ...1.   161.025961: xlog_recover_process_intents: processing intent 0000000032c6e417 lsn=1a000072aa
25            mount-4557    [003] .....   161.025962: xfs_efi_item_recover: recover EFI 0000000032c6e417 (0x20e30a4, 0x1)
26            mount-4557    [003] ...1.   161.025972: xlog_recover_process_intents: processing intent 000000001ce29d3a lsn=1a00007c5c
27            mount-4557    [003] .....   161.025973: xfs_efi_item_recover: recover EFI 000000001ce29d3a (0x20d2ec8, 0x1)
28            mount-4557    [003] ...1.   161.025980: xlog_recover_process_intents: processing intent 0000000021ef376d lsn=1a0000abfb
29            mount-4557    [003] .....   161.025981: xfs_efi_item_recover: recover EFI 0000000021ef376d (0x23552bd, 0x1)
30            mount-4557    [003] ...1.   161.025988: xlog_recover_process_intents: processing intent 00000000e1d79fc3 lsn=1a00011100
31            mount-4557    [003] .....   161.025989: xfs_efi_item_recover: recover EFI 00000000e1d79fc3 (0x258bf61, 0x1)
32            mount-4557    [003] ...1.   161.025996: xlog_recover_process_intents: processing intent 00000000a1474aa7 lsn=1a00017075
33            mount-4557    [003] .....   161.025998: xfs_efi_item_recover: recover EFI 00000000a1474aa7 (0x2028b96, 0x1)
34            mount-4557    [003] ...1.   161.026005: xlog_recover_process_intents: processing intent 000000005ae385b9 lsn=1a0001810e
35            mount-4557    [003] .....   161.026006: xfs_efi_item_recover: recover EFI 000000005ae385b9 (0x231579e, 0x1)
36            mount-4557    [003] ...1.   161.026013: xlog_recover_process_intents: processing intent 000000002027a359 lsn=1b0000515f
37            mount-4557    [003] .....   161.026014: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441ca, 0x30)
38            mount-4557    [003] .....   161.026019: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441b0, 0x1)
39    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
40    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 000000006c56fcb9 to AIL
41    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
42    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
43            mount-4557    [003] .N...   161.076277: xfs_extent_free_finish_item: failed, efi: 000000002027a359, (0x25441ca, 0x30)


The debug patch:
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
  2 index 7c675aae0a0f..0986af74012f 100644
  3 --- a/fs/xfs/libxfs/xfs_alloc.c
  4 +++ b/fs/xfs/libxfs/xfs_alloc.c
  5 @@ -2495,6 +2495,10 @@ xfs_defer_agfl_block(
  6         return 0;
  7  }
  8
  9 +bool should_debug(struct xfs_mount *mp)
 10 +{
 11 +       return strcmp(mp->m_super->s_id, "loop0") == 0;
 12 +}
 13  /*
 14   * Add the extent to the list of extents to be free at transaction end.
 15   * The list is maintained sorted (by block number).
 16 @@ -2551,6 +2555,9 @@ __xfs_free_extent_later(
 17                         XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
 18
 19         xfs_extent_free_get_group(mp, xefi);
 20 +       if (should_debug(mp))
 21 +               trace_printk("adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x%llx, 0x%x) to trans %p\n",
 22 +               xefi->xefi_startblock, xefi->xefi_blockcount, tp);
 23         xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list);
 24         return 0;
 25  }
 26 diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
 27 index bcfb6a4203cd..e4a4f39bdd7f 100644
 28 --- a/fs/xfs/libxfs/xfs_defer.c
 29 +++ b/fs/xfs/libxfs/xfs_defer.c
 30 @@ -342,6 +342,7 @@ xfs_defer_restore_resources(
 31         }
 32  }
 33
 34 +extern bool should_debug(struct xfs_mount *mp);
 35  /* Roll a transaction so we can do some deferred op processing. */
 36  STATIC int
 37  xfs_defer_trans_roll(
 38 @@ -799,6 +800,8 @@ xfs_defer_ops_capture_and_commit(
 39         if (!dfc)
 40                 return xfs_trans_commit(tp);
 41
 42 +       if (should_debug(mp))
 43 +               trace_printk("committing trans: %p\n", tp);
 44         /* Commit the transaction and add the capture structure to the list. */
 45         error = xfs_trans_commit(tp);
 46         if (error) {
 47 diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
 48 index 3b33d27efdce..2c19010d8e5b 100644
 49 --- a/fs/xfs/xfs_extfree_item.c
 50 +++ b/fs/xfs/xfs_extfree_item.c
 51 @@ -463,6 +463,7 @@ xfs_extent_free_log_item(
 52         extp->ext_len = xefi->xefi_blockcount;
 53  }
 54
55 +extern bool should_debug(struct xfs_mount *mp);
 56  static struct xfs_log_item *
 57  xfs_extent_free_create_intent(
 58         struct xfs_trans                *tp,
 59 @@ -476,6 +477,8 @@ xfs_extent_free_create_intent(
 60
 61         ASSERT(count > 0);
 62
 63 +       if (should_debug(mp))
 64 +               trace_printk("new EFI %p to trans %p\n", efip, tp);
 65         xfs_trans_add_item(tp, &efip->efi_item);
 66         if (sort)
 67                 list_sort(mp, items, xfs_extent_free_diff_items);
 68 @@ -524,6 +527,7 @@ xfs_extent_free_finish_item(
 69  {
 70         struct xfs_extent_free_item     *xefi;
 71         int                             error;
 72 +       struct xfs_efi_log_item         *efip = EFD_ITEM(done)->efd_efip;
 73
 74         xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
 75
 76 @@ -536,6 +540,9 @@ xfs_extent_free_finish_item(
 77         if (error == -EAGAIN)
 78                 return error;
 79
 80 +       if (error && should_debug(tp->t_mountp))
 81 +               trace_printk("failed, efi: %p, (0x%llx, 0x%x)\n", efip, xefi->xefi_startblock, xefi->xefi_blockcount);
 82 +
 83         xfs_extent_free_put_group(xefi);
 84         kmem_cache_free(xfs_extfree_item_cache, xefi);
 85         return error;
 86 @@ -694,6 +701,9 @@ xfs_efi_item_recover(
 87                 fake.xefi_startblock = extp->ext_start;
 88                 fake.xefi_blockcount = extp->ext_len;
 89
 90 +               if (should_debug(mp))
 91 +                       trace_printk("recover EFI %p (0x%llx, 0x%x)\n", efip, fake.xefi_startblock, fake.xefi_blockcount);
 92 +
 93                 if (!requeue_only) {
 94                         xfs_extent_free_get_group(mp, &fake);
 95                         error = xfs_trans_free_extent(tp, efdp, &fake);
 96 @@ -706,6 +716,8 @@ xfs_efi_item_recover(
 97                  * run again later with a new transaction context.
 98                  */
 99                 if (error == -EAGAIN || requeue_only) {
100 +                       if (error == -EAGAIN && should_debug(mp))
101 +                               trace_printk("-EAGAIN for EFI record (0x%llx, 0x%x)\n", fake.xefi_startblock, fake.xefi_blockcount);
102                         xfs_free_extent_later(tp, fake.xefi_startblock,
103                                 fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
104                         requeue_only = true;
105 diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
106 index 82c81d20459d..a40ed9a2b70a 100644
107 --- a/fs/xfs/xfs_log_recover.c
108 +++ b/fs/xfs/xfs_log_recover.c
109 @@ -2515,6 +2515,7 @@ xlog_abort_defer_ops(
110         }
111  }
112
113 +extern bool should_debug(struct xfs_mount *mp);
114  /*
115   * When this is called, all of the log intent items which did not have
116   * corresponding log done items should be in the AIL.  What we do now is update
117 @@ -2540,15 +2541,16 @@ xlog_recover_process_intents(
118         struct xfs_log_item     *lip;
119         struct xfs_ail          *ailp;
120         int                     error = 0;
121 -#if defined(DEBUG) || defined(XFS_WARN)
122         xfs_lsn_t               last_lsn;
123 -#endif
124 +       int                     debug = 0;
125
126         ailp = log->l_ailp;
127         spin_lock(&ailp->ail_lock);
128 -#if defined(DEBUG) || defined(XFS_WARN)
129         last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
130 -#endif
131 +       if (should_debug(log->l_mp)) {
132 +               debug = 1;
133 +               trace_printk("last_lsn: %llx\n", last_lsn);
134 +       }
135         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
136              lip != NULL;
137              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
138 @@ -2563,7 +2565,8 @@ xlog_recover_process_intents(
139                  * of recovery.
140                  */
141                 ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
142 -
143 +               if (debug)
144 +                       trace_printk("processing intent %p lsn=%llx\n", lip, lip->li_lsn);
145                 /*
146                  * NOTE: If your intent processing routine can create more
147                  * deferred ops, you /must/ attach them to the capture list in
148 diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
149 index 8c0bfc9a33b1..42b50b4e1c47 100644
150 --- a/fs/xfs/xfs_trans.c
151 +++ b/fs/xfs/xfs_trans.c
152 @@ -1125,6 +1125,7 @@ xfs_trans_cancel(
153         xfs_trans_free(tp);
154  }
155
156 +extern bool should_debug(struct xfs_mount *mp);
157  /*
158   * Roll from one trans in the sequence of PERMANENT transactions to
159   * the next: permanent transactions are only flushed out when
160 diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
161 index 7d4109af193e..013889b18846 100644
162 --- a/fs/xfs/xfs_trans_ail.c
163 +++ b/fs/xfs/xfs_trans_ail.c
164 @@ -769,6 +769,7 @@ xfs_ail_update_finish(
165         xfs_log_space_wake(log->l_mp);
166  }
167
168 +extern bool should_debug(struct xfs_mount *mp);
169  /*
170   * xfs_trans_ail_update - bulk AIL insertion operation.
171   *
172 @@ -801,9 +802,10 @@ xfs_trans_ail_update_bulk(
173  {
174         struct xfs_log_item     *mlip;
175         xfs_lsn_t               tail_lsn = 0;
176 -       int                     i;
177 +       int                     i, debug;
178         LIST_HEAD(tmp);
179
180 +       debug = should_debug(ailp->ail_log->l_mp);
181         ASSERT(nr_items > 0);           /* Not required, but true. */
182         mlip = xfs_ail_min(ailp);
183
184 @@ -823,6 +825,8 @@ xfs_trans_ail_update_bulk(
185                         trace_xfs_ail_insert(lip, 0, lsn);
186                 }
187                 lip->li_lsn = lsn;
188 +               if (debug)
189 +                       trace_printk("adding lip %p to AIL\n", lip);
190                 list_add(&lip->li_ail, &tmp);
191         }
192


thanks,
wengang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16  4:27                   ` Wengang Wang
@ 2023-06-16  5:04                     ` Wengang Wang
  2023-06-16  7:36                     ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Wengang Wang @ 2023-06-16  5:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 15, 2023, at 9:27 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>> 
>> 
>> 
>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Thu, Jun 15, 2023 at 11:51:09PM +0000, Wengang Wang wrote:
>>>> 
>>>> 
>>>>> On Jun 15, 2023, at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> 
>>>>> On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
>>>>>> When mounting the problematic metadump with the patches, I see the following reported.
>>>>>> 
>>>>>> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
>>>>>> =================================================
>>>>>> [   67.212496] loop: module loaded
>>>>>> [   67.214732] loop0: detected capacity change from 0 to 629137408
>>>>>> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
>>>>>> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
>>>>>> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
>>>>>> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
>>>>>> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
>>>>> 
>>>>> What is the test you are running? Please describe how you reproduced
>>>>> this failure - a reproducer script would be the best thing here.
>>>> 
>>>> I was mounting a (copy of) V4 metadump from customer.
>>> 
>>> Is the metadump obfuscated? Can I get a copy of it via a private,
>>> secure channel?
>> 
>> I am OK to give you a copy after I get approvement for that.
>> 
>>> 
>>>>> Does the test fail on a v5 filesytsem?
>>>> 
>>>> N/A.
>>>> 
>>>>> 
>>>>>> I think that’s because that the same EFI record was going to be freed again
>>>>>> by xfs_extent_free_finish_item() after it already got freed by xfs_efi_item_recover().
>>> 
>>> How is this happening? Where (and why) are we defering an extent we
>>> have successfully freed into a new xefi that we create a new intent
>>> for and then defer?
>>> 
>>> Can you post the debug output and analysis that lead you to this
>>> observation? I certainly can't see how this can happen from looking
>>> at the code
>>> 
>>>>>> I was trying to fix above issue in my previous patch by checking the intent
>>>>>> log item’s lsn and avoid running iop_recover() in xlog_recover_process_intents().
>>>>>> 
>>>>>> Now I am thinking if we can pass a flag, say XFS_EFI_PROCESSED, from
>>>>>> xfs_efi_item_recover() after it processed that record to the xfs_efi_log_item
>>>>>> memory structure somehow. In xfs_extent_free_finish_item(), we skip to process
>>>>>> that xfs_efi_log_item on seeing XFS_EFI_PROCESSED and return OK. By that
>>>>>> we can avoid the double free.
>>>>> 
>>>>> I'm not really interested in speculation of the cause or the fix at
>>>>> this point. I want to know how the problem is triggered so I can
>>>>> work out exactly what caused it, along with why we don't have
>>>>> coverage of this specific failure case in fstests already.
>>>>> 
>>>> 
>>>> I get to know the cause by adding additional debug log along with
>>>> my previous patch.
>>> 
>>> Can you please post that debug and analysis, rather than just a
>>> stack trace that is completely lacking in context? Nothing can be
>>> inferred from a stack trace, and what you are saying is occurring
>>> does not match what the code should actually be doing. So I need to
>>> actually look at what is happening in detail to work out where this
>>> mismatch is coming from....
>> 
>> The debug patch was based on my previous patch, I will rework the debug patch
>> basing on yours. I will share you the debug patch, output and my analysis later. 
>> 
> 
> My analysis:
> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
> second free is from xfs_extent_free_finish_item().
> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
> The double free problem begins when the -EAGAIN is returned.
> 
> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
>   operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
>   see line 6.
> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
>   And the deferred options is moved to capture_list. see line 9. The call path is:
>   xfs_efi_item_recover()
>     xfs_defer_ops_capture_and_commit(tp, capture_list)
>       xfs_defer_ops_capture()
>         xfs_defer_create_intents()
>           xfs_defer_create_intent()
>             ops->create_intent() —>   xfs_extent_free_create_intent()
> 4. The new EFI is committed with current transaction. see line 10.
> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
> 8. The mount process (log recover) come to process the new EFI that was added to AIL
>   at step 6. I guess the previous log items, which were added to AIL together with the
>   EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
>   iteration in xlog_recover_process_intents().  see line 36.

Checking the code again in xfs_trans_ail_update_bulk(),
Every log items are added to the temporary list as head.  The new EFI was the last one
of those which were added to the temporary list. As a result, the new EFI is finally the
first one among those log items in the temp list to be added to AIL. That’s why we see
the new EFI before other newly added.

thanks,
wengang 

> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
>   That’s the first free. see line 37.
> 10. The AIL intents iteration is done. It begins to process the capture_list.
> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
>    added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
>    and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
>    So, it’s a double free issue.
> 
> ftrace log output:
> 1            mount-4557    [003] ...1.   160.746048: xlog_recover_process_intents: last_lsn: 1b0000515f
> 2            mount-4557    [003] ...1.   160.746049: xlog_recover_process_intents: processing intent 00000000e190b7a0 lsn=1a000067b4
> 3            mount-4557    [003] .....   160.746059: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x2543cc4, 0x30)
> 4            mount-4557    [003] .....   160.746089: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441ca, 0x30)
> 5            mount-4557    [003] .....   160.746095: xfs_efi_item_recover: -EAGAIN for EFI record (0x25441ca, 0x30)
> 6            mount-4557    [003] .....   160.746096: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441ca, 0x30) to trans 00000000e503eaf8
> 7            mount-4557    [003] .....   160.746097: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441b0, 0x1)
> 8            mount-4557    [003] .....   160.746098: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441b0, 0x1) to trans 00000000e503eaf8
> 9            mount-4557    [003] .....   160.746099: xfs_extent_free_create_intent: new EFI 000000002027a359 to trans 00000000e503eaf8
> 10            mount-4557    [003] .....   160.746100: xfs_defer_ops_capture_and_commit: committing trans: 00000000e503eaf8
> 11            mount-4557    [003] ...1.   160.746109: xlog_recover_process_intents: processing intent 00000000def04d9f lsn=1a000069c4
> 12            mount-4557    [003] .....   160.746110: xfs_efi_item_recover: recover EFI 00000000def04d9f (0x22ad0b9, 0x1)
> 13    kworker/42:1H-502     [042] ...1.   161.025882: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
> 14    kworker/42:1H-502     [042] ...1.   161.025883: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
> 15    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 000000002190f497 to AIL
> 16    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 0000000043992ff9 to AIL
> 17    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 000000003554d2a3 to AIL
> 18    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
> 19    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000ee3155a9 to AIL
> 20    kworker/42:1H-502     [042] ...1.   161.025886: xfs_trans_ail_update_bulk: adding lip 000000002027a359 to AIL
> 21            mount-4557    [003] ...1.   161.025937: xlog_recover_process_intents: processing intent 000000005baced87 lsn=1a00006e20
> 22            mount-4557    [003] .....   161.025945: xfs_efi_item_recover: recover EFI 000000005baced87 (0x254425f, 0x2c)
> 23            mount-4557    [003] .....   161.025952: xfs_efi_item_recover: recover EFI 000000005baced87 (0x2544350, 0x14)
> 24            mount-4557    [003] ...1.   161.025961: xlog_recover_process_intents: processing intent 0000000032c6e417 lsn=1a000072aa
> 25            mount-4557    [003] .....   161.025962: xfs_efi_item_recover: recover EFI 0000000032c6e417 (0x20e30a4, 0x1)
> 26            mount-4557    [003] ...1.   161.025972: xlog_recover_process_intents: processing intent 000000001ce29d3a lsn=1a00007c5c
> 27            mount-4557    [003] .....   161.025973: xfs_efi_item_recover: recover EFI 000000001ce29d3a (0x20d2ec8, 0x1)
> 28            mount-4557    [003] ...1.   161.025980: xlog_recover_process_intents: processing intent 0000000021ef376d lsn=1a0000abfb
> 29            mount-4557    [003] .....   161.025981: xfs_efi_item_recover: recover EFI 0000000021ef376d (0x23552bd, 0x1)
> 30            mount-4557    [003] ...1.   161.025988: xlog_recover_process_intents: processing intent 00000000e1d79fc3 lsn=1a00011100
> 31            mount-4557    [003] .....   161.025989: xfs_efi_item_recover: recover EFI 00000000e1d79fc3 (0x258bf61, 0x1)
> 32            mount-4557    [003] ...1.   161.025996: xlog_recover_process_intents: processing intent 00000000a1474aa7 lsn=1a00017075
> 33            mount-4557    [003] .....   161.025998: xfs_efi_item_recover: recover EFI 00000000a1474aa7 (0x2028b96, 0x1)
> 34            mount-4557    [003] ...1.   161.026005: xlog_recover_process_intents: processing intent 000000005ae385b9 lsn=1a0001810e
> 35            mount-4557    [003] .....   161.026006: xfs_efi_item_recover: recover EFI 000000005ae385b9 (0x231579e, 0x1)
> 36            mount-4557    [003] ...1.   161.026013: xlog_recover_process_intents: processing intent 000000002027a359 lsn=1b0000515f
> 37            mount-4557    [003] .....   161.026014: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441ca, 0x30)
> 38            mount-4557    [003] .....   161.026019: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441b0, 0x1)
> 39    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
> 40    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 000000006c56fcb9 to AIL
> 41    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
> 42    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
> 43            mount-4557    [003] .N...   161.076277: xfs_extent_free_finish_item: failed, efi: 000000002027a359, (0x25441ca, 0x30)
> 
> 
> The debug patch:
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>  2 index 7c675aae0a0f..0986af74012f 100644
>  3 --- a/fs/xfs/libxfs/xfs_alloc.c
>  4 +++ b/fs/xfs/libxfs/xfs_alloc.c
>  5 @@ -2495,6 +2495,10 @@ xfs_defer_agfl_block(
>  6         return 0;
>  7  }
>  8
>  9 +bool should_debug(struct xfs_mount *mp)
> 10 +{
> 11 +       return strcmp(mp->m_super->s_id, "loop0") == 0;
> 12 +}
> 13  /*
> 14   * Add the extent to the list of extents to be free at transaction end.
> 15   * The list is maintained sorted (by block number).
> 16 @@ -2551,6 +2555,9 @@ __xfs_free_extent_later(
> 17                         XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
> 18
> 19         xfs_extent_free_get_group(mp, xefi);
> 20 +       if (should_debug(mp))
> 21 +               trace_printk("adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x%llx, 0x%x) to trans %p\n",
> 22 +               xefi->xefi_startblock, xefi->xefi_blockcount, tp);
> 23         xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list);
> 24         return 0;
> 25  }
> 26 diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> 27 index bcfb6a4203cd..e4a4f39bdd7f 100644
> 28 --- a/fs/xfs/libxfs/xfs_defer.c
> 29 +++ b/fs/xfs/libxfs/xfs_defer.c
> 30 @@ -342,6 +342,7 @@ xfs_defer_restore_resources(
> 31         }
> 32  }
> 33
> 34 +extern bool should_debug(struct xfs_mount *mp);
> 35  /* Roll a transaction so we can do some deferred op processing. */
> 36  STATIC int
> 37  xfs_defer_trans_roll(
> 38 @@ -799,6 +800,8 @@ xfs_defer_ops_capture_and_commit(
> 39         if (!dfc)
> 40                 return xfs_trans_commit(tp);
> 41
> 42 +       if (should_debug(mp))
> 43 +               trace_printk("committing trans: %p\n", tp);
> 44         /* Commit the transaction and add the capture structure to the list. */
> 45         error = xfs_trans_commit(tp);
> 46         if (error) {
> 47 diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> 48 index 3b33d27efdce..2c19010d8e5b 100644
> 49 --- a/fs/xfs/xfs_extfree_item.c
> 50 +++ b/fs/xfs/xfs_extfree_item.c
> 51 @@ -463,6 +463,7 @@ xfs_extent_free_log_item(
> 52         extp->ext_len = xefi->xefi_blockcount;
> 53  }
> 54
> 55 +extern bool should_debug(struct xfs_mount *mp);
> 56  static struct xfs_log_item *
> 57  xfs_extent_free_create_intent(
> 58         struct xfs_trans                *tp,
> 59 @@ -476,6 +477,8 @@ xfs_extent_free_create_intent(
> 60
> 61         ASSERT(count > 0);
> 62
> 63 +       if (should_debug(mp))
> 64 +               trace_printk("new EFI %p to trans %p\n", efip, tp);
> 65         xfs_trans_add_item(tp, &efip->efi_item);
> 66         if (sort)
> 67                 list_sort(mp, items, xfs_extent_free_diff_items);
> 68 @@ -524,6 +527,7 @@ xfs_extent_free_finish_item(
> 69  {
> 70         struct xfs_extent_free_item     *xefi;
> 71         int                             error;
> 72 +       struct xfs_efi_log_item         *efip = EFD_ITEM(done)->efd_efip;
> 73
> 74         xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
> 75
> 76 @@ -536,6 +540,9 @@ xfs_extent_free_finish_item(
> 77         if (error == -EAGAIN)
> 78                 return error;
> 79
> 80 +       if (error && should_debug(tp->t_mountp))
> 81 +               trace_printk("failed, efi: %p, (0x%llx, 0x%x)\n", efip, xefi->xefi_startblock, xefi->xefi_blockcount);
> 82 +
> 83         xfs_extent_free_put_group(xefi);
> 84         kmem_cache_free(xfs_extfree_item_cache, xefi);
> 85         return error;
> 86 @@ -694,6 +701,9 @@ xfs_efi_item_recover(
> 87                 fake.xefi_startblock = extp->ext_start;
> 88                 fake.xefi_blockcount = extp->ext_len;
> 89
> 90 +               if (should_debug(mp))
> 91 +                       trace_printk("recover EFI %p (0x%llx, 0x%x)\n", efip, fake.xefi_startblock, fake.xefi_blockcount);
> 92 +
> 93                 if (!requeue_only) {
> 94                         xfs_extent_free_get_group(mp, &fake);
> 95                         error = xfs_trans_free_extent(tp, efdp, &fake);
> 96 @@ -706,6 +716,8 @@ xfs_efi_item_recover(
> 97                  * run again later with a new transaction context.
> 98                  */
> 99                 if (error == -EAGAIN || requeue_only) {
> 100 +                       if (error == -EAGAIN && should_debug(mp))
> 101 +                               trace_printk("-EAGAIN for EFI record (0x%llx, 0x%x)\n", fake.xefi_startblock, fake.xefi_blockcount);
> 102                         xfs_free_extent_later(tp, fake.xefi_startblock,
> 103                                 fake.xefi_blockcount, &XFS_RMAP_OINFO_ANY_OWNER);
> 104                         requeue_only = true;
> 105 diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> 106 index 82c81d20459d..a40ed9a2b70a 100644
> 107 --- a/fs/xfs/xfs_log_recover.c
> 108 +++ b/fs/xfs/xfs_log_recover.c
> 109 @@ -2515,6 +2515,7 @@ xlog_abort_defer_ops(
> 110         }
> 111  }
> 112
> 113 +extern bool should_debug(struct xfs_mount *mp);
> 114  /*
> 115   * When this is called, all of the log intent items which did not have
> 116   * corresponding log done items should be in the AIL.  What we do now is update
> 117 @@ -2540,15 +2541,16 @@ xlog_recover_process_intents(
> 118         struct xfs_log_item     *lip;
> 119         struct xfs_ail          *ailp;
> 120         int                     error = 0;
> 121 -#if defined(DEBUG) || defined(XFS_WARN)
> 122         xfs_lsn_t               last_lsn;
> 123 -#endif
> 124 +       int                     debug = 0;
> 125
> 126         ailp = log->l_ailp;
> 127         spin_lock(&ailp->ail_lock);
> 128 -#if defined(DEBUG) || defined(XFS_WARN)
> 129         last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> 130 -#endif
> 131 +       if (should_debug(log->l_mp)) {
> 132 +               debug = 1;
> 133 +               trace_printk("last_lsn: %llx\n", last_lsn);
> 134 +       }
> 135         for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> 136              lip != NULL;
> 137              lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> 138 @@ -2563,7 +2565,8 @@ xlog_recover_process_intents(
> 139                  * of recovery.
> 140                  */
> 141                 ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> 142 -
> 143 +               if (debug)
> 144 +                       trace_printk("processing intent %p lsn=%llx\n", lip, lip->li_lsn);
> 145                 /*
> 146                  * NOTE: If your intent processing routine can create more
> 147                  * deferred ops, you /must/ attach them to the capture list in
> 148 diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> 149 index 8c0bfc9a33b1..42b50b4e1c47 100644
> 150 --- a/fs/xfs/xfs_trans.c
> 151 +++ b/fs/xfs/xfs_trans.c
> 152 @@ -1125,6 +1125,7 @@ xfs_trans_cancel(
> 153         xfs_trans_free(tp);
> 154  }
> 155
> 156 +extern bool should_debug(struct xfs_mount *mp);
> 157  /*
> 158   * Roll from one trans in the sequence of PERMANENT transactions to
> 159   * the next: permanent transactions are only flushed out when
> 160 diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> 161 index 7d4109af193e..013889b18846 100644
> 162 --- a/fs/xfs/xfs_trans_ail.c
> 163 +++ b/fs/xfs/xfs_trans_ail.c
> 164 @@ -769,6 +769,7 @@ xfs_ail_update_finish(
> 165         xfs_log_space_wake(log->l_mp);
> 166  }
> 167
> 168 +extern bool should_debug(struct xfs_mount *mp);
> 169  /*
> 170   * xfs_trans_ail_update - bulk AIL insertion operation.
> 171   *
> 172 @@ -801,9 +802,10 @@ xfs_trans_ail_update_bulk(
> 173  {
> 174         struct xfs_log_item     *mlip;
> 175         xfs_lsn_t               tail_lsn = 0;
> 176 -       int                     i;
> 177 +       int                     i, debug;
> 178         LIST_HEAD(tmp);
> 179
> 180 +       debug = should_debug(ailp->ail_log->l_mp);
> 181         ASSERT(nr_items > 0);           /* Not required, but true. */
> 182         mlip = xfs_ail_min(ailp);
> 183
> 184 @@ -823,6 +825,8 @@ xfs_trans_ail_update_bulk(
> 185                         trace_xfs_ail_insert(lip, 0, lsn);
> 186                 }
> 187                 lip->li_lsn = lsn;
> 188 +               if (debug)
> 189 +                       trace_printk("adding lip %p to AIL\n", lip);
> 190                 list_add(&lip->li_ail, &tmp);
> 191         }
> 192
> 
> 
> thanks,
> wengang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16  4:27                   ` Wengang Wang
  2023-06-16  5:04                     ` Wengang Wang
@ 2023-06-16  7:36                     ` Dave Chinner
  2023-06-16 17:43                       ` Wengang Wang
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-16  7:36 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs, chandanrlinux

On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote:
> > On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> >> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> Can you please post that debug and analysis, rather than just a
> >> stack trace that is completely lacking in context? Nothing can be
> >> inferred from a stack trace, and what you are saying is occurring
> >> does not match what the code should actually be doing. So I need to
> >> actually look at what is happening in detail to work out where this
> >> mismatch is coming from....
> > 
> > The debug patch was based on my previous patch, I will rework the debug patch
> > basing on yours. I will share you the debug patch, output and my analysis later. 
> > 
> 
> My analysis:
> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
> second free is from xfs_extent_free_finish_item().
> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
> The double free problem begins when the -EAGAIN is returned.
> 
> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
>    operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
>    see line 6.
> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
>    And the deferred options is moved to capture_list. see line 9. The call path is:
>    xfs_efi_item_recover()
>      xfs_defer_ops_capture_and_commit(tp, capture_list)
>        xfs_defer_ops_capture()
>          xfs_defer_create_intents()
>            xfs_defer_create_intent()
>              ops->create_intent() —>   xfs_extent_free_create_intent()
> 4. The new EFI is committed with current transaction. see line 10.
> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
> 8. The mount process (log recover) come to process the new EFI that was added to AIL
>    at step 6. I guess the previous log items, which were added to AIL together with the
>    EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
>    iteration in xlog_recover_process_intents().  see line 36.
> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
>    That’s the first free. see line 37.
> 10. The AIL intents iteration is done. It begins to process the capture_list.
> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
>     added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
>     and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
>     So, it’s a double free issue.

Yes, this much I understand.

As I've been saying all along, there's something else wrong here,
because we should never encounter that new EFI in the recovery loop.
The recovery loop should see some other type of item and abort
before it sees that EFI.

So, on to the raw debug output.

> ftrace log output:
>  1            mount-4557    [003] ...1.   160.746048: xlog_recover_process_intents: last_lsn: 1b0000515f
>  2            mount-4557    [003] ...1.   160.746049: xlog_recover_process_intents: processing intent 00000000e190b7a0 lsn=1a000067b4

lsn of this EFI is 1a000067b.

>  3            mount-4557    [003] .....   160.746059: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x2543cc4, 0x30)
>  4            mount-4557    [003] .....   160.746089: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441ca, 0x30)
>  5            mount-4557    [003] .....   160.746095: xfs_efi_item_recover: -EAGAIN for EFI record (0x25441ca, 0x30)
>  6            mount-4557    [003] .....   160.746096: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441ca, 0x30) to trans 00000000e503eaf8
>  7            mount-4557    [003] .....   160.746097: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441b0, 0x1)
>  8            mount-4557    [003] .....   160.746098: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441b0, 0x1) to trans 00000000e503eaf8
>  9            mount-4557    [003] .....   160.746099: xfs_extent_free_create_intent: new EFI 000000002027a359 to trans 00000000e503eaf8
> 10            mount-4557    [003] .....   160.746100: xfs_defer_ops_capture_and_commit: committing trans: 00000000e503eaf8

Ah, XFS_DEFER_OPS_TYPE_AGFL_FREE? where did that come from? That op
type only occurs from xfs_agfl_defer_free(), and the problem here is
that -EAGAIN comes from when we are doing allocation for the AGFL.
We are not going anywhere near the code that should set that xefi op
type.

Oh, you hard coded that in the trace_printk().

Hmmm, and none of the actual xfs trace points that tell us what is
actually going on.

Ok. trace-cmd is your friend - you can get rid of that "should
debug" stuff in the debug code...

First, create a tmpfs filesystem, cd into it so trace-cmd will dump
it's tracing data without creating xfs trace events.

then run:

# trace-cmd record -e xfs\* -e printk
....

and leave it running.

In another shell, run:

# mount test.img /mnt/test
<hang or oops>

Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump
all the trace events to the trace.dat file. Then run:

# trace-cmd report > trace.txt

the trace.txt file should have the entire mount trace in it, all the
xfs events, the trace-printk output and the console output. now you
can filter everything with grep, sed, awk, python, perl, any way you
like.

Anyway, back to the printk trace. the extents in the EFI are:

1. (0x2543cc4, 0x30)
2. (0x25441ca, 0x30)
3. (0x25441b0, 0x1)

Extent #1 was freed.

Extent #2 was not freed, it got -EAGAIN. It was deferred.
Extent #3 was deferred.

The commit was done, we have a new EFI 000000002027a359 created with
extents #2 and #3.

> 11            mount-4557    [003] ...1.   160.746109: xlog_recover_process_intents: processing intent 00000000def04d9f lsn=1a000069c4
> 12            mount-4557    [003] .....   160.746110: xfs_efi_item_recover: recover EFI 00000000def04d9f (0x22ad0b9, 0x1)
> 13    kworker/42:1H-502     [042] ...1.   161.025882: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
> 14    kworker/42:1H-502     [042] ...1.   161.025883: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
> 15    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 000000002190f497 to AIL
> 16    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 0000000043992ff9 to AIL
> 17    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 000000003554d2a3 to AIL
> 18    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
> 19    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000ee3155a9 to AIL
> 20    kworker/42:1H-502     [042] ...1.   161.025886: xfs_trans_ail_update_bulk: adding lip 000000002027a359 to AIL

And here we have what appears to be journal IO completion adding a
bunch of log items to the AIL. The last of those log items is
000000002027a359, which is the new EFI. Ok, so we should have at
least half a dozen items in the AIL before the new EFI. All good
here.

> 21            mount-4557    [003] ...1.   161.025937: xlog_recover_process_intents: processing intent 000000005baced87 lsn=1a00006e20
> 22            mount-4557    [003] .....   161.025945: xfs_efi_item_recover: recover EFI 000000005baced87 (0x254425f, 0x2c)
> 23            mount-4557    [003] .....   161.025952: xfs_efi_item_recover: recover EFI 000000005baced87 (0x2544350, 0x14)
> 24            mount-4557    [003] ...1.   161.025961: xlog_recover_process_intents: processing intent 0000000032c6e417 lsn=1a000072aa
> 25            mount-4557    [003] .....   161.025962: xfs_efi_item_recover: recover EFI 0000000032c6e417 (0x20e30a4, 0x1)
> 26            mount-4557    [003] ...1.   161.025972: xlog_recover_process_intents: processing intent 000000001ce29d3a lsn=1a00007c5c
> 27            mount-4557    [003] .....   161.025973: xfs_efi_item_recover: recover EFI 000000001ce29d3a (0x20d2ec8, 0x1)
> 28            mount-4557    [003] ...1.   161.025980: xlog_recover_process_intents: processing intent 0000000021ef376d lsn=1a0000abfb
> 29            mount-4557    [003] .....   161.025981: xfs_efi_item_recover: recover EFI 0000000021ef376d (0x23552bd, 0x1)
> 30            mount-4557    [003] ...1.   161.025988: xlog_recover_process_intents: processing intent 00000000e1d79fc3 lsn=1a00011100
> 31            mount-4557    [003] .....   161.025989: xfs_efi_item_recover: recover EFI 00000000e1d79fc3 (0x258bf61, 0x1)
> 32            mount-4557    [003] ...1.   161.025996: xlog_recover_process_intents: processing intent 00000000a1474aa7 lsn=1a00017075
> 33            mount-4557    [003] .....   161.025998: xfs_efi_item_recover: recover EFI 00000000a1474aa7 (0x2028b96, 0x1)
> 34            mount-4557    [003] ...1.   161.026005: xlog_recover_process_intents: processing intent 000000005ae385b9 lsn=1a0001810e
> 35            mount-4557    [003] .....   161.026006: xfs_efi_item_recover: recover EFI 000000005ae385b9 (0x231579e, 0x1)

lsn is 1a0001810e, so this must be the last of the pending recovery
intents.  Now we've finished all the pending intents that needed
recovery and....

> 36            mount-4557    [003] ...1.   161.026013: xlog_recover_process_intents: processing intent 000000002027a359 lsn=1b0000515f

... we immediately find the EFI at the new head of the AIL @
lsn=1b0000515f.

So what happened to all the items added to the AIL before this item?
We clearly see them being processed in the trace above, why didn't
we trip over one of them here and abort the recovery loop?

> 37            mount-4557    [003] .....   161.026014: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441ca, 0x30)
> 38            mount-4557    [003] .....   161.026019: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441b0, 0x1)
> 39    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
> 40    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 000000006c56fcb9 to AIL
> 41    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
> 42    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
> 43            mount-4557    [003] .N...   161.076277: xfs_extent_free_finish_item: failed, efi: 000000002027a359, (0x25441ca, 0x30)

Yeah, so that is the -third- attempt to free the extent, not the
second.

The trace makes it pretty obvious what is happening: there's an
ordering problem in bulk AIL insertion.

Can you test the patch below on top of the three in this patchset?

-Dave
-- 
Dave Chinner
david@fromorbit.com


xfs: don't reverse order of items in bulk AIL insertion

From: Dave Chinner <dchinner@redhat.com>

Log recovery depends on the items appearing in the AIL in the same
order that they are committed into the CIL. Unfortunately, the bulk
AIL insertion has been reversing the order of the items it pulls off
the CIL at checkpoint completion. This results in intents that are
created at the end of a commit being placed into the AIL before all
the other items that were modified in the commit are placed in the
AIL.

This can cause intent recovery from the log to fail to detect the
end of initial recovery correctly.

Lucky for us, all the items fed to bulk insertion have the same LSN,
otherwise this would have screwed up both the head and tail lsn
tracking in the log.

Make sure bulk AIL insertion does not reorder items.

Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_ail.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d4109af193e..1098452e7f95 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
 			trace_xfs_ail_insert(lip, 0, lsn);
 		}
 		lip->li_lsn = lsn;
-		list_add(&lip->li_ail, &tmp);
+		list_add_tail(&lip->li_ail, &tmp);
 	}
 
 	if (!list_empty(&tmp))

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16  7:36                     ` Dave Chinner
@ 2023-06-16 17:43                       ` Wengang Wang
  2023-06-16 22:29                         ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-16 17:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 16, 2023, at 12:36 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote:
>>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> Can you please post that debug and analysis, rather than just a
>>>> stack trace that is completely lacking in context? Nothing can be
>>>> inferred from a stack trace, and what you are saying is occurring
>>>> does not match what the code should actually be doing. So I need to
>>>> actually look at what is happening in detail to work out where this
>>>> mismatch is coming from....
>>> 
>>> The debug patch was based on my previous patch, I will rework the debug patch
>>> basing on yours. I will share you the debug patch, output and my analysis later. 
>>> 
>> 
>> My analysis:
>> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
>> second free is from xfs_extent_free_finish_item().
>> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
>> The double free problem begins when the -EAGAIN is returned.
>> 
>> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
>> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
>>   operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
>>   see line 6.
>> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
>>   And the deferred options is moved to capture_list. see line 9. The call path is:
>>   xfs_efi_item_recover()
>>     xfs_defer_ops_capture_and_commit(tp, capture_list)
>>       xfs_defer_ops_capture()
>>         xfs_defer_create_intents()
>>           xfs_defer_create_intent()
>>             ops->create_intent() —>   xfs_extent_free_create_intent()
>> 4. The new EFI is committed with current transaction. see line 10.
>> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
>> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
>> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
>> 8. The mount process (log recover) come to process the new EFI that was added to AIL
>>   at step 6. I guess the previous log items, which were added to AIL together with the
>>   EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
>>   iteration in xlog_recover_process_intents().  see line 36.
>> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
>>   That’s the first free. see line 37.
>> 10. The AIL intents iteration is done. It begins to process the capture_list.
>> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
>>    added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
>>    and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
>>    So, it’s a double free issue.
> 
> Yes, this much I understand.
> 
> As I've been saying all along, there's something else wrong here,
> because we should never encounter that new EFI in the recovery loop.
> The recovery loop should see some other type of item and abort
> before it sees that EFI.

Maybe you are right. But here is my concern, its possible that:
Even the were other types of items placed in AIL before that new EFI
(after your change list_add() -> list_add_tail()), suppose this scenario:

# suppose there were many log intents after the Original EFI (which needs retry)
   to process. 
1.  Non EFI type log items (say Item A to Item C) were added to AIL.
2.  That new EFI log item, Item D was added to AIL.
3.  Now the recovery process the remaining log intents. it takes long enough
     to let step 4 happen. 
4.  During the time that the mount (log recover) process was processing the
     remaining intents (suppose that needs long enough time), Item A, Item B
     and Item C (on AIL) were processed by xfs_aild (iop_push()) and them was
     removed from AIL later.  Possible?
5.  In the recovery intents loop, it comes to the first new log item, that is Item D,
     the new EFI.  (Item A to Item C didn’t come before Item D because they were
     removed in step 4).
6.  the records in the new EFI was freed by xfs_efi_item_recover()
7.  the same records in the same new EFI was freed by xfs_extent_free_finish_item()
     again.

If above is possible, the problem still exists even you make sure the new log
items in original order when they are placed to AIL.
 

> 
> So, on to the raw debug output.
> 
>> ftrace log output:
>> 1            mount-4557    [003] ...1.   160.746048: xlog_recover_process_intents: last_lsn: 1b0000515f
>> 2            mount-4557    [003] ...1.   160.746049: xlog_recover_process_intents: processing intent 00000000e190b7a0 lsn=1a000067b4
> 
> lsn of this EFI is 1a000067b.
> 
>> 3            mount-4557    [003] .....   160.746059: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x2543cc4, 0x30)
>> 4            mount-4557    [003] .....   160.746089: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441ca, 0x30)
>> 5            mount-4557    [003] .....   160.746095: xfs_efi_item_recover: -EAGAIN for EFI record (0x25441ca, 0x30)
>> 6            mount-4557    [003] .....   160.746096: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441ca, 0x30) to trans 00000000e503eaf8
>> 7            mount-4557    [003] .....   160.746097: xfs_efi_item_recover: recover EFI 00000000e190b7a0 (0x25441b0, 0x1)
>> 8            mount-4557    [003] .....   160.746098: __xfs_free_extent_later: adding deferred op XFS_DEFER_OPS_TYPE_AGFL_FREE (0x25441b0, 0x1) to trans 00000000e503eaf8
>> 9            mount-4557    [003] .....   160.746099: xfs_extent_free_create_intent: new EFI 000000002027a359 to trans 00000000e503eaf8
>> 10            mount-4557    [003] .....   160.746100: xfs_defer_ops_capture_and_commit: committing trans: 00000000e503eaf8
> 
> Ah, XFS_DEFER_OPS_TYPE_AGFL_FREE? where did that come from? That op
> type only occurs from xfs_agfl_defer_free(), and the problem here is
> that -EAGAIN comes from when we are doing allocation for the AGFL.
> We are not going anywhere near the code that should set that xefi op
> type.
> 
> Oh, you hard coded that in the trace_printk().

Yes, I copied it wrong, it should be XFS_DEFER_OPS_TYPE_FREE.

> 
> Hmmm, and none of the actual xfs trace points that tell us what is
> actually going on.
> 
> Ok. trace-cmd is your friend - you can get rid of that "should
> debug" stuff in the debug code...
> 
> First, create a tmpfs filesystem, cd into it so trace-cmd will dump
> it's tracing data without creating xfs trace events.
> 
> then run:
> 
> # trace-cmd record -e xfs\* -e printk
> ....
> 
> and leave it running.
> 
> In another shell, run:
> 
> # mount test.img /mnt/test
> <hang or oops>
> 
> Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump
> all the trace events to the trace.dat file. Then run:
> 
> # trace-cmd report > trace.txt
> 
> the trace.txt file should have the entire mount trace in it, all the
> xfs events, the trace-printk output and the console output. now you
> can filter everything with grep, sed, awk, python, perl, any way you
> like.
> 

Good to know. Well, seems it can’t tell us the -EAGAIN thing? We don’t
have trace even in every function for every case.

> Anyway, back to the printk trace. the extents in the EFI are:
> 
> 1. (0x2543cc4, 0x30)
> 2. (0x25441ca, 0x30)
> 3. (0x25441b0, 0x1)
> 
> Extent #1 was freed.
> 
> Extent #2 was not freed, it got -EAGAIN. It was deferred.
> Extent #3 was deferred.
> 
> The commit was done, we have a new EFI 000000002027a359 created with
> extents #2 and #3.

Yes, correct.

> 
>> 11            mount-4557    [003] ...1.   160.746109: xlog_recover_process_intents: processing intent 00000000def04d9f lsn=1a000069c4
>> 12            mount-4557    [003] .....   160.746110: xfs_efi_item_recover: recover EFI 00000000def04d9f (0x22ad0b9, 0x1)
>> 13    kworker/42:1H-502     [042] ...1.   161.025882: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
>> 14    kworker/42:1H-502     [042] ...1.   161.025883: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
>> 15    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 000000002190f497 to AIL
>> 16    kworker/42:1H-502     [042] ...1.   161.025884: xfs_trans_ail_update_bulk: adding lip 0000000043992ff9 to AIL
>> 17    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 000000003554d2a3 to AIL
>> 18    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
>> 19    kworker/42:1H-502     [042] ...1.   161.025885: xfs_trans_ail_update_bulk: adding lip 00000000ee3155a9 to AIL
>> 20    kworker/42:1H-502     [042] ...1.   161.025886: xfs_trans_ail_update_bulk: adding lip 000000002027a359 to AIL
> 
> And here we have what appears to be journal IO completion adding a
> bunch of log items to the AIL. The last of those log items is
> 000000002027a359, which is the new EFI. Ok, so we should have at
> least half a dozen items in the AIL before the new EFI. All good
> here.

yes.

> 
>> 21            mount-4557    [003] ...1.   161.025937: xlog_recover_process_intents: processing intent 000000005baced87 lsn=1a00006e20
>> 22            mount-4557    [003] .....   161.025945: xfs_efi_item_recover: recover EFI 000000005baced87 (0x254425f, 0x2c)
>> 23            mount-4557    [003] .....   161.025952: xfs_efi_item_recover: recover EFI 000000005baced87 (0x2544350, 0x14)
>> 24            mount-4557    [003] ...1.   161.025961: xlog_recover_process_intents: processing intent 0000000032c6e417 lsn=1a000072aa
>> 25            mount-4557    [003] .....   161.025962: xfs_efi_item_recover: recover EFI 0000000032c6e417 (0x20e30a4, 0x1)
>> 26            mount-4557    [003] ...1.   161.025972: xlog_recover_process_intents: processing intent 000000001ce29d3a lsn=1a00007c5c
>> 27            mount-4557    [003] .....   161.025973: xfs_efi_item_recover: recover EFI 000000001ce29d3a (0x20d2ec8, 0x1)
>> 28            mount-4557    [003] ...1.   161.025980: xlog_recover_process_intents: processing intent 0000000021ef376d lsn=1a0000abfb
>> 29            mount-4557    [003] .....   161.025981: xfs_efi_item_recover: recover EFI 0000000021ef376d (0x23552bd, 0x1)
>> 30            mount-4557    [003] ...1.   161.025988: xlog_recover_process_intents: processing intent 00000000e1d79fc3 lsn=1a00011100
>> 31            mount-4557    [003] .....   161.025989: xfs_efi_item_recover: recover EFI 00000000e1d79fc3 (0x258bf61, 0x1)
>> 32            mount-4557    [003] ...1.   161.025996: xlog_recover_process_intents: processing intent 00000000a1474aa7 lsn=1a00017075
>> 33            mount-4557    [003] .....   161.025998: xfs_efi_item_recover: recover EFI 00000000a1474aa7 (0x2028b96, 0x1)
>> 34            mount-4557    [003] ...1.   161.026005: xlog_recover_process_intents: processing intent 000000005ae385b9 lsn=1a0001810e
>> 35            mount-4557    [003] .....   161.026006: xfs_efi_item_recover: recover EFI 000000005ae385b9 (0x231579e, 0x1)
> 
> lsn is 1a0001810e, so this must be the last of the pending recovery
> intents.  Now we've finished all the pending intents that needed
> recovery and....
> 
>> 36            mount-4557    [003] ...1.   161.026013: xlog_recover_process_intents: processing intent 000000002027a359 lsn=1b0000515f
> 
> ... we immediately find the EFI at the new head of the AIL @
> lsn=1b0000515f.
> 
> So what happened to all the items added to the AIL before this item?
> We clearly see them being processed in the trace above, why didn't
> we trip over one of them here and abort the recovery loop?

I think you got the reason later.  list_add() -> list_add_tail().

> 
>> 37            mount-4557    [003] .....   161.026014: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441ca, 0x30)
>> 38            mount-4557    [003] .....   161.026019: xfs_efi_item_recover: recover EFI 000000002027a359 (0x25441b0, 0x1)
>> 39    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 00000000cc1ccecb to AIL
>> 40    kworker/42:1H-502     [042] ...1.   161.027123: xfs_trans_ail_update_bulk: adding lip 000000006c56fcb9 to AIL
>> 41    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000823e9198 to AIL
>> 42    kworker/42:1H-502     [042] ...1.   161.027124: xfs_trans_ail_update_bulk: adding lip 00000000b8a3774a to AIL
>> 43            mount-4557    [003] .N...   161.076277: xfs_extent_free_finish_item: failed, efi: 000000002027a359, (0x25441ca, 0x30)
> 
> Yeah, so that is the -third- attempt to free the extent, not the
> second.

If you say ‘attempt’, yes.  When I said the ‘second', I meant the real free.

> 
> The trace makes it pretty obvious what is happening: there's an
> ordering problem in bulk AIL insertion.
> 
> Can you test the patch below on top of the three in this patchset?

This maybe work for this specific mounting, but it looks a bit tricky to me.
Please see my comment above. I meant the below patch maybe can ‘fix’ the issue
occurred to this specific xfs volume(metadump), but will is it sure to fix all cases?

I don’t understand that why only the new intent which stand at the first place of all
the new log items in AIL needs to processed in the AIL loop, but no for others.
That’s sounds strange to me. And why we can’t just stop/break before the new log
item no matter what it is as new intents are processed safely with the deferred
option list. 

thanks,
wengang
> 
> -Dave
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: don't reverse order of items in bulk AIL insertion
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Log recovery depends on the items appearing in the AIL in the same
> order that they are committed into the CIL. Unfortunately, the bulk
> AIL insertion has been reversing the order of the items it pulls off
> the CIL at checkpoint completion. This results in intents that are
> created at the end of a commit being placed into the AIL before all
> the other items that were modified in the commit are placed in the
> AIL.
> 
> This can cause intent recovery from the log to fail to detect the
> end of initial recovery correctly.
> 
> Lucky for us, all the items fed to bulk insertion have the same LSN,
> otherwise this would have screwed up both the head and tail lsn
> tracking in the log.
> 
> Make sure bulk AIL insertion does not reorder items.
> 
> Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_trans_ail.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 7d4109af193e..1098452e7f95 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
> trace_xfs_ail_insert(lip, 0, lsn);
> }
> lip->li_lsn = lsn;
> - list_add(&lip->li_ail, &tmp);
> + list_add_tail(&lip->li_ail, &tmp);
> }
> 
> if (!list_empty(&tmp))


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16 17:43                       ` Wengang Wang
@ 2023-06-16 22:29                         ` Dave Chinner
  2023-06-16 22:53                           ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-16 22:29 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs, chandanrlinux

On Fri, Jun 16, 2023 at 05:43:42PM +0000, Wengang Wang wrote:
> > On Jun 16, 2023, at 12:36 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote:
> >>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> >>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>> Can you please post that debug and analysis, rather than just a
> >>>> stack trace that is completely lacking in context? Nothing can be
> >>>> inferred from a stack trace, and what you are saying is occurring
> >>>> does not match what the code should actually be doing. So I need to
> >>>> actually look at what is happening in detail to work out where this
> >>>> mismatch is coming from....
> >>> 
> >>> The debug patch was based on my previous patch, I will rework the debug patch
> >>> basing on yours. I will share you the debug patch, output and my analysis later. 
> >>> 
> >> 
> >> My analysis:
> >> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
> >> second free is from xfs_extent_free_finish_item().
> >> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
> >> The double free problem begins when the -EAGAIN is returned.
> >> 
> >> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
> >> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
> >>   operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
> >>   see line 6.
> >> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
> >>   And the deferred options is moved to capture_list. see line 9. The call path is:
> >>   xfs_efi_item_recover()
> >>     xfs_defer_ops_capture_and_commit(tp, capture_list)
> >>       xfs_defer_ops_capture()
> >>         xfs_defer_create_intents()
> >>           xfs_defer_create_intent()
> >>             ops->create_intent() —>   xfs_extent_free_create_intent()
> >> 4. The new EFI is committed with current transaction. see line 10.
> >> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
> >> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
> >> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
> >> 8. The mount process (log recover) come to process the new EFI that was added to AIL
> >>   at step 6. I guess the previous log items, which were added to AIL together with the
> >>   EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
> >>   iteration in xlog_recover_process_intents().  see line 36.
> >> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
> >>   That’s the first free. see line 37.
> >> 10. The AIL intents iteration is done. It begins to process the capture_list.
> >> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
> >>    added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
> >>    and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
> >>    So, it’s a double free issue.
> > 
> > Yes, this much I understand.
> > 
> > As I've been saying all along, there's something else wrong here,
> > because we should never encounter that new EFI in the recovery loop.
> > The recovery loop should see some other type of item and abort
> > before it sees that EFI.
> 
> Maybe you are right. But here is my concern, its possible that:
> Even the were other types of items placed in AIL before that new EFI
> (after your change list_add() -> list_add_tail()), suppose this scenario:
> 
> # suppose there were many log intents after the Original EFI (which needs retry)
>    to process. 
> 1.  Non EFI type log items (say Item A to Item C) were added to AIL.
> 2.  That new EFI log item, Item D was added to AIL.
> 3.  Now the recovery process the remaining log intents. it takes long enough
>      to let step 4 happen. 
> 4.  During the time that the mount (log recover) process was processing the
>      remaining intents (suppose that needs long enough time), Item A, Item B
>      and Item C (on AIL) were processed by xfs_aild (iop_push()) and them was
>      removed from AIL later.  Possible?
> 5.  In the recovery intents loop, it comes to the first new log item, that is Item D,
>      the new EFI.  (Item A to Item C didn’t come before Item D because they were
>      removed in step 4).
> 6.  the records in the new EFI was freed by xfs_efi_item_recover()
> 7.  the same records in the same new EFI was freed by xfs_extent_free_finish_item()
>      again.
> 
> If above is possible, the problem still exists even you make sure the new log
> items in original order when they are placed to AIL.

Yes, I *know this can happen*.

What I've been trying to understand is *how the bug occurred in the
first place*. 

Root cause analysis comes first. Second is confirming the analysis
is correct. Only then do we really start thinking about how to fix
it properly.

We are at the second step: I need *empirical confirmation* that this
ordering problem is the root cause of the behaviour that was
observed.

Stop trying to fix the problem before we understand how it happened!

> > Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump
> > all the trace events to the trace.dat file. Then run:
> > 
> > # trace-cmd report > trace.txt
> > 
> > the trace.txt file should have the entire mount trace in it, all the
> > xfs events, the trace-printk output and the console output. now you
> > can filter everything with grep, sed, awk, python, perl, any way you
> > like.
> > 
> 
> Good to know. Well, seems it can’t tell us the -EAGAIN thing? We don’t
> have trace even in every function for every case.

We don't need an explicit trace for that. If we hit that code path
in recovery, we will see a xfs_bmap_free_deferred trace for an
extent from recovery, followed by a xfs_alloc_size_busy/xfs_alloc_near_busy
event followed immediately by a xfs_bmap_free_defer event for the
extent we were trying to free.

That tells us an EAGAIN was received and the extent free was
deferred.

All the tracepoints we need to tell us what is happening in this
path, one just needs to connect the dots correctly.

> > The trace makes it pretty obvious what is happening: there's an
> > ordering problem in bulk AIL insertion.
> > 
> > Can you test the patch below on top of the three in this patchset?
> 
> This maybe work for this specific mounting, but it looks a bit tricky to me.
> Please see my comment above. I meant the below patch maybe can ‘fix’ the issue
> occurred to this specific xfs volume(metadump), but will is it sure to fix all cases?

I don't care about "other cases" right now. All I'm trying to do is
confirm that this is the *root cause* of *this problem*.

I am aware that this is a test patch, and that I wrote it purely to
confirm my hypothesis. I am aware that it doesn't fix everything and
I am aware that I am not *trying to fix everything* with it. I need
confirmation that my root cause analysis is correct before I spend
any time on working out how to fix the problem completely.

So, can you please just test the patch and see if the problem is
fixed?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16 22:29                         ` Dave Chinner
@ 2023-06-16 22:53                           ` Wengang Wang
  2023-06-16 23:14                             ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-16 22:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 16, 2023, at 3:29 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Jun 16, 2023 at 05:43:42PM +0000, Wengang Wang wrote:
>>> On Jun 16, 2023, at 12:36 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote:
>>>>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>>>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>> Can you please post that debug and analysis, rather than just a
>>>>>> stack trace that is completely lacking in context? Nothing can be
>>>>>> inferred from a stack trace, and what you are saying is occurring
>>>>>> does not match what the code should actually be doing. So I need to
>>>>>> actually look at what is happening in detail to work out where this
>>>>>> mismatch is coming from....
>>>>> 
>>>>> The debug patch was based on my previous patch, I will rework the debug patch
>>>>> basing on yours. I will share you the debug patch, output and my analysis later. 
>>>>> 
>>>> 
>>>> My analysis:
>>>> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
>>>> second free is from xfs_extent_free_finish_item().
>>>> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
>>>> The double free problem begins when the -EAGAIN is returned.
>>>> 
>>>> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
>>>> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
>>>>  operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
>>>>  see line 6.
>>>> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
>>>>  And the deferred options is moved to capture_list. see line 9. The call path is:
>>>>  xfs_efi_item_recover()
>>>>    xfs_defer_ops_capture_and_commit(tp, capture_list)
>>>>      xfs_defer_ops_capture()
>>>>        xfs_defer_create_intents()
>>>>          xfs_defer_create_intent()
>>>>            ops->create_intent() —>   xfs_extent_free_create_intent()
>>>> 4. The new EFI is committed with current transaction. see line 10.
>>>> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
>>>> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
>>>> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
>>>> 8. The mount process (log recover) come to process the new EFI that was added to AIL
>>>>  at step 6. I guess the previous log items, which were added to AIL together with the
>>>>  EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
>>>>  iteration in xlog_recover_process_intents().  see line 36.
>>>> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
>>>>  That’s the first free. see line 37.
>>>> 10. The AIL intents iteration is done. It begins to process the capture_list.
>>>> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
>>>>   added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
>>>>   and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
>>>>   So, it’s a double free issue.
>>> 
>>> Yes, this much I understand.
>>> 
>>> As I've been saying all along, there's something else wrong here,
>>> because we should never encounter that new EFI in the recovery loop.
>>> The recovery loop should see some other type of item and abort
>>> before it sees that EFI.
>> 
>> Maybe you are right. But here is my concern, its possible that:
>> Even the were other types of items placed in AIL before that new EFI
>> (after your change list_add() -> list_add_tail()), suppose this scenario:
>> 
>> # suppose there were many log intents after the Original EFI (which needs retry)
>>   to process. 
>> 1.  Non EFI type log items (say Item A to Item C) were added to AIL.
>> 2.  That new EFI log item, Item D was added to AIL.
>> 3.  Now the recovery process the remaining log intents. it takes long enough
>>     to let step 4 happen. 
>> 4.  During the time that the mount (log recover) process was processing the
>>     remaining intents (suppose that needs long enough time), Item A, Item B
>>     and Item C (on AIL) were processed by xfs_aild (iop_push()) and them was
>>     removed from AIL later.  Possible?
>> 5.  In the recovery intents loop, it comes to the first new log item, that is Item D,
>>     the new EFI.  (Item A to Item C didn’t come before Item D because they were
>>     removed in step 4).
>> 6.  the records in the new EFI was freed by xfs_efi_item_recover()
>> 7.  the same records in the same new EFI was freed by xfs_extent_free_finish_item()
>>     again.
>> 
>> If above is possible, the problem still exists even you make sure the new log
>> items in original order when they are placed to AIL.
> 
> Yes, I *know this can happen*.
> 
> What I've been trying to understand is *how the bug occurred in the
> first place*. 
> 
> Root cause analysis comes first. Second is confirming the analysis
> is correct. Only then do we really start thinking about how to fix
> it properly.
> 
> We are at the second step: I need *empirical confirmation* that this
> ordering problem is the root cause of the behaviour that was
> observed.
> 
> Stop trying to fix the problem before we understand how it happened!
> 
>>> Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump
>>> all the trace events to the trace.dat file. Then run:
>>> 
>>> # trace-cmd report > trace.txt
>>> 
>>> the trace.txt file should have the entire mount trace in it, all the
>>> xfs events, the trace-printk output and the console output. now you
>>> can filter everything with grep, sed, awk, python, perl, any way you
>>> like.
>>> 
>> 
>> Good to know. Well, seems it can’t tell us the -EAGAIN thing? We don’t
>> have trace even in every function for every case.
> 
> We don't need an explicit trace for that. If we hit that code path
> in recovery, we will see a xfs_bmap_free_deferred trace for an
> extent from recovery, followed by a xfs_alloc_size_busy/xfs_alloc_near_busy
> event followed immediately by a xfs_bmap_free_defer event for the
> extent we were trying to free.
> 
> That tells us an EAGAIN was received and the extent free was
> deferred.
> 
> All the tracepoints we need to tell us what is happening in this
> path, one just needs to connect the dots correctly.
> 
>>> The trace makes it pretty obvious what is happening: there's an
>>> ordering problem in bulk AIL insertion.
>>> 
>>> Can you test the patch below on top of the three in this patchset?
>> 
>> This maybe work for this specific mounting, but it looks a bit tricky to me.
>> Please see my comment above. I meant the below patch maybe can ‘fix’ the issue
>> occurred to this specific xfs volume(metadump), but will is it sure to fix all cases?
> 
> I don't care about "other cases" right now. All I'm trying to do is
> confirm that this is the *root cause* of *this problem*.
> 
> I am aware that this is a test patch, and that I wrote it purely to
> confirm my hypothesis. I am aware that it doesn't fix everything and
> I am aware that I am not *trying to fix everything* with it. I need
> confirmation that my root cause analysis is correct before I spend
> any time on working out how to fix the problem completely.
> 
> So, can you please just test the patch and see if the problem is
> fixed?

Then OK, I will test it and report back.

thanks,
wengang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16 22:53                           ` Wengang Wang
@ 2023-06-16 23:14                             ` Wengang Wang
  2023-06-17  0:47                               ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Wengang Wang @ 2023-06-16 23:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux

>> 
>> So, can you please just test the patch and see if the problem is
>> fixed?
> 
> Then OK, I will test it and report back.
> 

Log recover ran successfully with the test patch.

thanks,
wengang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16 23:14                             ` Wengang Wang
@ 2023-06-17  0:47                               ` Dave Chinner
  2023-06-20 16:56                                 ` Wengang Wang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2023-06-17  0:47 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs, chandanrlinux

On Fri, Jun 16, 2023 at 11:14:51PM +0000, Wengang Wang wrote:
> >> 
> >> So, can you please just test the patch and see if the problem is
> >> fixed?
> > 
> > Then OK, I will test it and report back.
> > 
> 
> Log recover ran successfully with the test patch.

Thank you.

I apologise if I sound annoyed. I don't mean to be, and if I am it's
not aimed at you. I've been sick all week and I'm pretty much at my
wits end. I don't want to fight just to get a test run, I just want
an answer to the question I'm asking. I don't want everything to a
battle and far more difficult than it should be.

But I'm sick and exhausted, and so I'm not caring about my tone as
much as I should. For that I apologise, and I thank you for testing
the patch to confirm that we now understand what the root cause of
the problem is.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-17  0:47                               ` Dave Chinner
@ 2023-06-20 16:56                                 ` Wengang Wang
  0 siblings, 0 replies; 29+ messages in thread
From: Wengang Wang @ 2023-06-20 16:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 16, 2023, at 5:47 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Jun 16, 2023 at 11:14:51PM +0000, Wengang Wang wrote:
>>>> 
>>>> So, can you please just test the patch and see if the problem is
>>>> fixed?
>>> 
>>> Then OK, I will test it and report back.
>>> 
>> 
>> Log recover ran successfully with the test patch.
> 
> Thank you.
> 
> I apologise if I sound annoyed. I don't mean to be, and if I am it's
> not aimed at you. I've been sick all week and I'm pretty much at my
> wits end. I don't want to fight just to get a test run, I just want
> an answer to the question I'm asking. I don't want everything to a
> battle and far more difficult than it should be.
> 
> But I'm sick and exhausted, and so I'm not caring about my tone as
> much as I should. For that I apologise, and I thank you for testing
> the patch to confirm that we now understand what the root cause of
> the problem is.
> 

No worries Dave!

I didn’t describe the problem its self clearly in previous conversations but
was instead trying to prove it at source code level and made some confusions.
It would be better to raise the problem with my analysis as soon as I found
it.  Well, finally we now understood the root cause. Thank you too!

And hope you haven't suffered too much pain with the sickness.

thanks,
wengang 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-16  0:42                 ` Wengang Wang
  2023-06-16  4:27                   ` Wengang Wang
@ 2023-06-22  1:15                   ` Wengang Wang
  1 sibling, 0 replies; 29+ messages in thread
From: Wengang Wang @ 2023-06-22  1:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandanrlinux



> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Thu, Jun 15, 2023 at 11:51:09PM +0000, Wengang Wang wrote:
>>> 
>>> 
>>>> On Jun 15, 2023, at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> 
>>>> On Thu, Jun 15, 2023 at 11:09:41PM +0000, Wengang Wang wrote:
>>>>> When mounting the problematic metadump with the patches, I see the following reported.
>>>>> 
>>>>> For more information about troubleshooting your instance using a console connection, see the documentation: https://docs.cloud.oracle.com/en-us/iaas/Content/Compute/References/serialconsole.htm#four
>>>>> =================================================
>>>>> [   67.212496] loop: module loaded
>>>>> [   67.214732] loop0: detected capacity change from 0 to 629137408
>>>>> [   67.247542] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
>>>>> [   67.249257] XFS (loop0): Mounting V4 Filesystem af755a98-5f62-421d-aa81-2db7bffd2c40
>>>>> [   72.241546] XFS (loop0): Starting recovery (logdev: internal)
>>>>> [   92.218256] XFS (loop0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f6/0x870 [xfs]
>>>>> [   92.249802] CPU: 1 PID: 4201 Comm: mount Not tainted 6.4.0-rc6 #8
>>>> 
>>>> What is the test you are running? Please describe how you reproduced
>>>> this failure - a reproducer script would be the best thing here.
>>> 
>>> I was mounting a (copy of) V4 metadump from customer.
>> 
>> Is the metadump obfuscated? Can I get a copy of it via a private,
>> secure channel?
> 
> I am OK to give you a copy after I get approvement for that.

Dave, I tried a lot of times to reproduce his issue, but only reproduced it once.
So I don’t have a script to reproduce it stably. And I won't work more on
reproducing it.  I saved the small (1GiB) XFS volume (on which log recover hang).
Let me know if you still have the interest to get a copy.

thanks,
wengang


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-06-22  1:15 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-15  1:41 [PATCH 0/3] xfs: fix xfs_extent_busy_flush() deadlock in EFI processing Dave Chinner
2023-06-15  1:41 ` [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
2023-06-15  3:32   ` Darrick J. Wong
2023-06-15  3:48     ` Dave Chinner
2023-06-15 21:57   ` Wengang Wang
2023-06-15 22:14     ` Dave Chinner
2023-06-15 22:31       ` Wengang Wang
2023-06-15 23:09         ` Wengang Wang
2023-06-15 23:33           ` Dave Chinner
2023-06-15 23:51             ` Wengang Wang
2023-06-16  0:17               ` Dave Chinner
2023-06-16  0:42                 ` Wengang Wang
2023-06-16  4:27                   ` Wengang Wang
2023-06-16  5:04                     ` Wengang Wang
2023-06-16  7:36                     ` Dave Chinner
2023-06-16 17:43                       ` Wengang Wang
2023-06-16 22:29                         ` Dave Chinner
2023-06-16 22:53                           ` Wengang Wang
2023-06-16 23:14                             ` Wengang Wang
2023-06-17  0:47                               ` Dave Chinner
2023-06-20 16:56                                 ` Wengang Wang
2023-06-22  1:15                   ` Wengang Wang
2023-06-15  1:42 ` [PATCH 2/3] xfs: allow extent free intents to be retried Dave Chinner
2023-06-15  3:38   ` Darrick J. Wong
2023-06-15  3:57     ` Dave Chinner
2023-06-15 14:41       ` Darrick J. Wong
2023-06-15 22:21         ` Dave Chinner
2023-06-15  1:42 ` [PATCH 3/3] xfs: don't block in busy flushing when freeing extents Dave Chinner
2023-06-15  3:40   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.