All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: fix how we deal with new intents during recovery
@ 2020-09-17  3:29 Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-17  3:29 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, david

Hi all,

This second series of log fixes dates back to an earlier discussion that
Dave and I had about the weird way that log recovery works w.r.t. intent
items.  The current code juggles nested transactions so that it can
siphon off new deferred items for later; this we replace with a new
dfops freezer that captures the log reservation type and remaining block
reservation so that we finish the new deferred items with the same
transaction context as we would have had the system not gone down.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-recovery-intent-chaining
---
 fs/xfs/libxfs/xfs_defer.c       |   66 ++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_defer.h       |   22 ++++++++
 fs/xfs/libxfs/xfs_log_recover.h |    4 +
 fs/xfs/xfs_bmap_item.c          |   16 +-----
 fs/xfs/xfs_extfree_item.c       |    7 +--
 fs/xfs/xfs_log_recover.c        |  105 ++++++++++++++++++++++-----------------
 fs/xfs/xfs_refcount_item.c      |   16 +-----
 fs/xfs/xfs_rmap_item.c          |    7 +--
 fs/xfs/xfs_trans.h              |    4 +
 9 files changed, 160 insertions(+), 87 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-17  3:29 [PATCH 0/3] xfs: fix how we deal with new intents during recovery Darrick J. Wong
@ 2020-09-17  3:29 ` Darrick J. Wong
  2020-09-23  7:48   ` Christoph Hellwig
  2020-09-24  6:02   ` [PATCH v2 " Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation Darrick J. Wong
  2 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-17  3:29 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, david

From: Darrick J. Wong <darrick.wong@oracle.com>

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c       |   59 +++++++++++++++++--
 fs/xfs/libxfs/xfs_defer.h       |   20 ++++++
 fs/xfs/libxfs/xfs_log_recover.h |    4 +
 fs/xfs/xfs_bmap_item.c          |   16 +----
 fs/xfs/xfs_extfree_item.c       |    7 +-
 fs/xfs/xfs_log_recover.c        |  121 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_refcount_item.c      |   16 +----
 fs/xfs/xfs_rmap_item.c          |    7 +-
 fs/xfs/xfs_trans.h              |    4 +
 9 files changed, 168 insertions(+), 86 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 29e9762f3b77..e4a6928c8566 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -563,14 +563,59 @@ xfs_defer_move(
  *
  * Create and log intent items for all the work that we're capturing so that we
  * can be assured that the items will get replayed if the system goes down
- * before log recovery gets a chance to finish the work it put off.  Then we
- * move the chain from stp to dtp.
+ * before log recovery gets a chance to finish the work it put off.  The entire
+ * deferred ops state is transferred to the capture structure and the
+ * transaction is committed.
+ *
+ * Note that the capture state is passed up to the caller and must be freed
+ * even if the transaction commit returns error.
  */
-void
+int
 xfs_defer_capture(
-	struct xfs_trans	*dtp,
-	struct xfs_trans	*stp)
+	struct xfs_trans		*tp,
+	struct xfs_defer_capture	**dfcp)
 {
-	xfs_defer_create_intents(stp);
-	xfs_defer_move(dtp, stp);
+	struct xfs_defer_capture	*dfc = NULL;
+
+	if (!list_empty(&tp->t_dfops)) {
+		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
+
+		xfs_defer_create_intents(tp);
+
+		INIT_LIST_HEAD(&dfc->dfc_list);
+		INIT_LIST_HEAD(&dfc->dfc_dfops);
+
+		/* Move the dfops chain and transaction state to the freezer. */
+		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
+		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+		xfs_defer_reset(tp);
+	}
+
+	*dfcp = dfc;
+	return xfs_trans_commit(tp);
+}
+
+/* Attach a chain of captured deferred ops to a new transaction. */
+void
+xfs_defer_continue(
+	struct xfs_defer_capture	*dfc,
+	struct xfs_trans		*tp)
+{
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/* Move captured dfops chain and state to the transaction. */
+	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
+	tp->t_flags |= dfc->dfc_tpflags;
+	dfc->dfc_tpflags = 0;
+}
+
+/* Release all resources that we used to capture deferred ops. */
+void
+xfs_defer_capture_free(
+	struct xfs_mount		*mp,
+	struct xfs_defer_capture	*dfc)
+{
+	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
+	kmem_free(dfc);
 }
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 3164199162b6..d61ef0a750ab 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -8,6 +8,7 @@
 
 struct xfs_btree_cur;
 struct xfs_defer_op_type;
+struct xfs_defer_capture;
 
 /*
  * Header for deferred operation list.
@@ -63,10 +64,27 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 
+/*
+ * Deferred operation freezer.  This structure enables a dfops user to detach
+ * the chain of deferred operations from a transaction so that they can be
+ * continued later.
+ */
+struct xfs_defer_capture {
+	/* List of other freezer heads. */
+	struct list_head	dfc_list;
+
+	/* Deferred ops state saved from the transaction. */
+	struct list_head	dfc_dfops;
+	unsigned int		dfc_tpflags;
+};
+
 /*
  * Functions to capture a chain of deferred operations and continue them later.
  * This doesn't normally happen except log recovery.
  */
-void xfs_defer_capture(struct xfs_trans *dtp, struct xfs_trans *stp);
+int xfs_defer_capture(struct xfs_trans *tp, struct xfs_defer_capture **dfcp);
+void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp);
+void xfs_defer_capture_free(struct xfs_mount *mp,
+		struct xfs_defer_capture *dfc);
 
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 641132d0e39d..c3563c5c033c 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -6,6 +6,8 @@
 #ifndef	__XFS_LOG_RECOVER_H__
 #define __XFS_LOG_RECOVER_H__
 
+struct xfs_defer_capture;
+
 /*
  * Each log item type (XFS_LI_*) gets its own xlog_recover_item_ops to
  * define how recovery should work for that type of log item.
@@ -125,5 +127,7 @@ void xlog_recover_iodone(struct xfs_buf *bp);
 
 void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id);
+int xlog_recover_trans_commit(struct xfs_trans *tp,
+		struct xfs_defer_capture **dfcp);
 
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 598f713831c9..0a0904a7650a 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -424,13 +424,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 STATIC int
 xfs_bui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_map_extent		*bmap;
 	struct xfs_bud_log_item		*budp;
 	xfs_fsblock_t			startblock_fsb;
@@ -486,12 +486,7 @@ xfs_bui_item_recover(
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	budp = xfs_trans_get_bud(tp, buip);
 
 	/* Grab the inode. */
@@ -539,15 +534,12 @@ xfs_bui_item_recover(
 		xfs_bmap_unmap_extent(tp, ip, &irec);
 	}
 
-	xfs_defer_capture(parent_tp, tp);
-	error = xfs_trans_commit(tp);
+	error = xlog_recover_trans_commit(tp, dfcp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
-
 	return error;
 
 err_inode:
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 6cb8cd11072a..f544ec007c12 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -585,10 +585,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
 STATIC int
 xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
 	struct xfs_extent		*extp;
@@ -633,8 +633,7 @@ xfs_efi_item_recover(
 
 	}
 
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dfcp);
 
 abort_error:
 	xfs_trans_cancel(tp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e2ec91b2d0f4..0c90c1326860 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1791,6 +1791,18 @@ xlog_recover_release_intent(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Freeze any deferred ops and commit the transaction.  This is the last step
+ * needed to finish a log intent item that we recovered from the log.
+ */
+int
+xlog_recover_trans_commit(
+	struct xfs_trans		*tp,
+	struct xfs_defer_capture	**dfcp)
+{
+	return xfs_defer_capture(tp, dfcp);
+}
+
 /******************************************************************************
  *
  *		Log recover routines
@@ -2467,38 +2479,64 @@ xlog_recover_process_data(
 	return 0;
 }
 
+static void
+xlog_cancel_defer_ops(
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
+{
+	struct xfs_defer_capture *dfc, *next;
+
+	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
+		list_del_init(&dfc->dfc_list);
+		xfs_defer_capture_free(mp, dfc);
+	}
+}
+
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
-	struct xfs_trans	*parent_tp)
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
 {
-	struct xfs_mount	*mp = parent_tp->t_mountp;
+	struct xfs_defer_capture *dfc, *next;
 	struct xfs_trans	*tp;
 	int64_t			freeblks;
-	uint			resblks;
-	int			error;
+	uint64_t		resblks;
+	int			error = 0;
 
-	/*
-	 * We're finishing the defer_ops that accumulated as a result of
-	 * recovering unfinished intent items during log recovery.  We
-	 * reserve an itruncate transaction because it is the largest
-	 * permanent transaction type.  Since we're the only user of the fs
-	 * right now, take 93% (15/16) of the available free blocks.  Use
-	 * weird math to avoid a 64-bit division.
-	 */
-	freeblks = percpu_counter_sum(&mp->m_fdblocks);
-	if (freeblks <= 0)
-		return -ENOSPC;
-	resblks = min_t(int64_t, UINT_MAX, freeblks);
-	resblks = (resblks * 15) >> 4;
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-			0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-	/* transfer all collected dfops to this transaction */
-	xfs_defer_move(tp, parent_tp);
+	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
+		/*
+		 * We're finishing the defer_ops that accumulated as a result
+		 * of recovering unfinished intent items during log recovery.
+		 * We reserve an itruncate transaction because it is the
+		 * largest permanent transaction type.  Since we're the only
+		 * user of the fs right now, take 93% (15/16) of the available
+		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 */
+		freeblks = percpu_counter_sum(&mp->m_fdblocks);
+		if (freeblks <= 0) {
+			error = -ENOSPC;
+			break;
+		}
 
-	return xfs_trans_commit(tp);
+		resblks = min_t(uint64_t, UINT_MAX, freeblks);
+		resblks = (resblks * 15) >> 4;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			break;
+
+		/* transfer all collected dfops to this transaction */
+		list_del_init(&dfc->dfc_list);
+		xfs_defer_continue(dfc, tp);
+
+		error = xfs_trans_commit(tp);
+		xfs_defer_capture_free(mp, dfc);
+		if (error)
+			break;
+	}
+
+	return error;
 }
 
 /* Is this log item a deferred action intent? */
@@ -2528,28 +2566,16 @@ STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_trans	*parent_tp;
+	LIST_HEAD(dfops_freezers);
 	struct xfs_ail_cursor	cur;
+	struct xfs_defer_capture *freezer = NULL;
 	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
-	int			error;
+	int			error = 0;
 #if defined(DEBUG) || defined(XFS_WARN)
 	xfs_lsn_t		last_lsn;
 #endif
 
-	/*
-	 * The intent recovery handlers commit transactions to complete recovery
-	 * for individual intents, but any new deferred operations that are
-	 * queued during that process are held off until the very end. The
-	 * purpose of this transaction is to serve as a container for deferred
-	 * operations. Each intent recovery handler must transfer dfops here
-	 * before its local transaction commits, and we'll finish the entire
-	 * list below.
-	 */
-	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
-	if (error)
-		return error;
-
 	ailp = log->l_ailp;
 	spin_lock(&ailp->ail_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
@@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
 
 		/*
 		 * NOTE: If your intent processing routine can create more
-		 * deferred ops, you /must/ attach them to the transaction in
+		 * deferred ops, you /must/ attach them to the freezer in
 		 * this routine or else those subsequent intents will get
 		 * replayed in the wrong order!
 		 */
 		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
 			spin_unlock(&ailp->ail_lock);
-			error = lip->li_ops->iop_recover(lip, parent_tp);
+			error = lip->li_ops->iop_recover(lip, &freezer);
 			spin_lock(&ailp->ail_lock);
 		}
+		if (freezer) {
+			list_add_tail(&freezer->dfc_list, &dfops_freezers);
+			freezer = NULL;
+		}
 		if (error)
-			goto out;
+			break;
+
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
 	}
-out:
+
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->ail_lock);
 	if (!error)
-		error = xlog_finish_defer_ops(parent_tp);
-	xfs_trans_cancel(parent_tp);
+		error = xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
 
+	xlog_cancel_defer_ops(log->l_mp, &dfops_freezers);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 492d80a0b406..30e579b5857d 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -424,7 +424,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 STATIC int
 xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
@@ -432,7 +432,7 @@ xfs_cui_item_recover(
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			new_fsb;
 	xfs_extlen_t			new_len;
@@ -493,12 +493,7 @@ xfs_cui_item_recover(
 			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
@@ -555,13 +550,10 @@ xfs_cui_item_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_capture(parent_tp, tp);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dfcp);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index dc5b0753cd51..34dd6b02bdc8 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -467,14 +467,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 STATIC int
 xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_map_extent		*rmap;
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	enum xfs_rmap_intent_type	type;
 	xfs_exntst_t			state;
@@ -573,8 +573,7 @@ xfs_rui_item_recover(
 	}
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dfcp);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b752501818d2..c88240961aa1 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -26,6 +26,7 @@ struct xfs_cui_log_item;
 struct xfs_cud_log_item;
 struct xfs_bui_log_item;
 struct xfs_bud_log_item;
+struct xfs_defer_capture;
 
 struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -74,7 +75,8 @@ struct xfs_item_ops {
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
-	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
+	int (*iop_recover)(struct xfs_log_item *lip,
+			   struct xfs_defer_capture **dfcp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation
  2020-09-17  3:29 [PATCH 0/3] xfs: fix how we deal with new intents during recovery Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
@ 2020-09-17  3:29 ` Darrick J. Wong
  2020-09-23  7:49   ` Christoph Hellwig
  2020-09-24  6:04   ` [PATCH v2 " Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation Darrick J. Wong
  2 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-17  3:29 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, david

From: Darrick J. Wong <darrick.wong@oracle.com>

When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should absorb the remaining block reservation so
that when we continue the dfops chain, we still have those blocks to
use.

This adds the requirement that every log intent item recovery function
must be careful to reserve enough blocks to handle both itself and all
defer ops that it can queue.  On the other hand, this enables us to do
away with the handwaving block estimation nonsense that was going on in
xlog_finish_defer_ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |    4 ++++
 fs/xfs/libxfs/xfs_defer.h |    1 +
 fs/xfs/xfs_log_recover.c  |   20 +-------------------
 3 files changed, 6 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index e4a6928c8566..d1db97ccad51 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -588,6 +588,8 @@ xfs_defer_capture(
 		/* Move the dfops chain and transaction state to the freezer. */
 		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
 		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+		dfc->dfc_blkres = tp->t_blk_res - tp->t_blk_res_used;
+		tp->t_blk_res = tp->t_blk_res_used;
 		xfs_defer_reset(tp);
 	}
 
@@ -608,6 +610,8 @@ xfs_defer_continue(
 	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
 	tp->t_flags |= dfc->dfc_tpflags;
 	dfc->dfc_tpflags = 0;
+	tp->t_blk_res += dfc->dfc_blkres;
+	dfc->dfc_blkres = 0;
 }
 
 /* Release all resources that we used to capture deferred ops. */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index d61ef0a750ab..fea610f5d3e1 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -76,6 +76,7 @@ struct xfs_defer_capture {
 	/* Deferred ops state saved from the transaction. */
 	struct list_head	dfc_dfops;
 	unsigned int		dfc_tpflags;
+	unsigned int		dfc_blkres;
 };
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 0c90c1326860..5f664e52c542 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2500,28 +2500,10 @@ xlog_finish_defer_ops(
 {
 	struct xfs_defer_capture *dfc, *next;
 	struct xfs_trans	*tp;
-	int64_t			freeblks;
-	uint64_t		resblks;
 	int			error = 0;
 
 	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
-		/*
-		 * We're finishing the defer_ops that accumulated as a result
-		 * of recovering unfinished intent items during log recovery.
-		 * We reserve an itruncate transaction because it is the
-		 * largest permanent transaction type.  Since we're the only
-		 * user of the fs right now, take 93% (15/16) of the available
-		 * free blocks.  Use weird math to avoid a 64-bit division.
-		 */
-		freeblks = percpu_counter_sum(&mp->m_fdblocks);
-		if (freeblks <= 0) {
-			error = -ENOSPC;
-			break;
-		}
-
-		resblks = min_t(uint64_t, UINT_MAX, freeblks);
-		resblks = (resblks * 15) >> 4;
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0,
 				0, XFS_TRANS_RESERVE, &tp);
 		if (error)
 			break;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation
  2020-09-17  3:29 [PATCH 0/3] xfs: fix how we deal with new intents during recovery Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
  2020-09-17  3:29 ` [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation Darrick J. Wong
@ 2020-09-17  3:29 ` Darrick J. Wong
  2020-09-23  7:49   ` Christoph Hellwig
  2020-09-24  6:03   ` [PATCH v2 " Darrick J. Wong
  2 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-17  3:29 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, david

From: Darrick J. Wong <darrick.wong@oracle.com>

When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the transaction reservation type
from the old transaction so that when we continue the dfops chain, we
still use the same reservation parameters.

This avoids a potential failure vector by ensuring that we never ask for
more log reservation space than we would have asked for had the system
not gone down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |    3 +++
 fs/xfs/libxfs/xfs_defer.h |    1 +
 fs/xfs/xfs_log_recover.c  |    4 ++--
 3 files changed, 6 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index d1db97ccad51..edfb3b9856c8 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -590,6 +590,9 @@ xfs_defer_capture(
 		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
 		dfc->dfc_blkres = tp->t_blk_res - tp->t_blk_res_used;
 		tp->t_blk_res = tp->t_blk_res_used;
+		dfc->dfc_tres.tr_logres = tp->t_log_res;
+		dfc->dfc_tres.tr_logcount = tp->t_log_count;
+		dfc->dfc_tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
 		xfs_defer_reset(tp);
 	}
 
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index fea610f5d3e1..a788855aef69 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -77,6 +77,7 @@ struct xfs_defer_capture {
 	struct list_head	dfc_dfops;
 	unsigned int		dfc_tpflags;
 	unsigned int		dfc_blkres;
+	struct xfs_trans_res	dfc_tres;
 };
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 5f664e52c542..f5748c8f5157 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2503,8 +2503,8 @@ xlog_finish_defer_ops(
 	int			error = 0;
 
 	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0,
-				0, XFS_TRANS_RESERVE, &tp);
+		error = xfs_trans_alloc(mp, &dfc->dfc_tres, 0, 0,
+				XFS_TRANS_RESERVE, &tp);
 		if (error)
 			break;
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
@ 2020-09-23  7:48   ` Christoph Hellwig
  2020-09-23 15:46     ` Darrick J. Wong
  2020-09-24  6:02   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2020-09-23  7:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Wed, Sep 16, 2020 at 08:29:07PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.

Wow, that is a massive change, but I really like the direction.


> +int
>  xfs_defer_capture(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_capture	**dfcp)
>  {
> +	struct xfs_defer_capture	*dfc = NULL;
> +
> +	if (!list_empty(&tp->t_dfops)) {
> +		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> +
> +		xfs_defer_create_intents(tp);
> +
> +		INIT_LIST_HEAD(&dfc->dfc_list);
> +		INIT_LIST_HEAD(&dfc->dfc_dfops);
> +
> +		/* Move the dfops chain and transaction state to the freezer. */
> +		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> +		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> +		xfs_defer_reset(tp);
> +	}
> +
> +	*dfcp = dfc;
> +	return xfs_trans_commit(tp);

Don't we need to free the structure if xfs_trans_commit fails?

Wouldn't a saner calling convention would to pass the list_head into
->iop_recover and this function, and just add it to the list here
instead of passing it around as an output pointer argument.

> +/*
> + * Freeze any deferred ops and commit the transaction.  This is the last step
> + * needed to finish a log intent item that we recovered from the log.
> + */
> +int
> +xlog_recover_trans_commit(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_capture	**dfcp)
> +{
> +	return xfs_defer_capture(tp, dfcp);
> +}

I don't think this wrapper is very helpful.  I think a direct call
to xfs_defer_capture captures the intent better than pretending it just
is another commit.

> +
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> +	struct xfs_mount	*mp,
> +	struct list_head	*dfops_freezers)
>  {
> +	struct xfs_defer_capture *dfc, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
> +	uint64_t		resblks;
> +	int			error = 0;
>  
> +	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> +		/*
> +		 * We're finishing the defer_ops that accumulated as a result
> +		 * of recovering unfinished intent items during log recovery.
> +		 * We reserve an itruncate transaction because it is the
> +		 * largest permanent transaction type.  Since we're the only
> +		 * user of the fs right now, take 93% (15/16) of the available
> +		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 */
> +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> +		if (freeblks <= 0) {
> +			error = -ENOSPC;
> +			break;
> +		}
>  
> +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> +				0, XFS_TRANS_RESERVE, &tp);

This isn't actually new, but how did we end up with the truncate
reservation here?

> +		if (error)
> +			break;
> +
> +		/* transfer all collected dfops to this transaction */
> +		list_del_init(&dfc->dfc_list);
> +		xfs_defer_continue(dfc, tp);
> +
> +		error = xfs_trans_commit(tp);
> +		xfs_defer_capture_free(mp, dfc);
> +		if (error)
> +			break;

I'd just do direct returns and return 0 outside the loop, but that is
just nitpicking.

>  /* Is this log item a deferred action intent? */
> @@ -2528,28 +2566,16 @@ STATIC int
>  xlog_recover_process_intents(
>  	struct xlog		*log)
>  {
> +	LIST_HEAD(dfops_freezers);
>  	struct xfs_ail_cursor	cur;
> +	struct xfs_defer_capture *freezer = NULL;
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
> +	int			error = 0;
>  #if defined(DEBUG) || defined(XFS_WARN)
>  	xfs_lsn_t		last_lsn;
>  #endif
>  
>  	ailp = log->l_ailp;
>  	spin_lock(&ailp->ail_lock);
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> @@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
>  
>  		/*
>  		 * NOTE: If your intent processing routine can create more
> +		 * deferred ops, you /must/ attach them to the freezer in
>  		 * this routine or else those subsequent intents will get
>  		 * replayed in the wrong order!
>  		 */
>  		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
>  			spin_unlock(&ailp->ail_lock);
> +			error = lip->li_ops->iop_recover(lip, &freezer);
>  			spin_lock(&ailp->ail_lock);
>  		}
> +		if (freezer) {
> +			list_add_tail(&freezer->dfc_list, &dfops_freezers);
> +			freezer = NULL;
> +		}
>  		if (error)
> +			break;
> +
>  		lip = xfs_trans_ail_cursor_next(ailp, &cur);

The code flow looks really very odd (though correct) to me, what do you
think of something like:

  	spin_lock(&ailp->ail_lock);
	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
	     lip;
	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
			continue;

		spin_unlock(&ailp->ail_lock);
		error = lip->li_ops->iop_recover(lip, &dfops_freezers);
		spin_lock(&ailp->ail_lock);
		if (error)
			break;
	}
  	xfs_trans_ail_cursor_done(&cur);
  	spin_unlock(&ailp->ail_lock);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation
  2020-09-17  3:29 ` [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation Darrick J. Wong
@ 2020-09-23  7:49   ` Christoph Hellwig
  2020-09-24  6:04   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2020-09-23  7:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Wed, Sep 16, 2020 at 08:29:13PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When xfs_defer_capture extracts the deferred ops and transaction state
> from a transaction, it should absorb the remaining block reservation so
> that when we continue the dfops chain, we still have those blocks to
> use.
> 
> This adds the requirement that every log intent item recovery function
> must be careful to reserve enough blocks to handle both itself and all
> defer ops that it can queue.  On the other hand, this enables us to do
> away with the handwaving block estimation nonsense that was going on in
> xlog_finish_defer_ops.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation
  2020-09-17  3:29 ` [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation Darrick J. Wong
@ 2020-09-23  7:49   ` Christoph Hellwig
  2020-09-23 15:22     ` Darrick J. Wong
  2020-09-24  6:03   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2020-09-23  7:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

>  	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0,
> -				0, XFS_TRANS_RESERVE, &tp);
> +		error = xfs_trans_alloc(mp, &dfc->dfc_tres, 0, 0,
> +				XFS_TRANS_RESERVE, &tp);

... and this fixes the weird itruncate thing.  Nice!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation
  2020-09-23  7:49   ` Christoph Hellwig
@ 2020-09-23 15:22     ` Darrick J. Wong
  0 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-23 15:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david

On Wed, Sep 23, 2020 at 08:49:53AM +0100, Christoph Hellwig wrote:
> >  	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0,
> > -				0, XFS_TRANS_RESERVE, &tp);
> > +		error = xfs_trans_alloc(mp, &dfc->dfc_tres, 0, 0,
> > +				XFS_TRANS_RESERVE, &tp);
> 
> ... and this fixes the weird itruncate thing.  Nice!

Yep, Dave was also complaining about how weird the itruncate thing was.

--D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-23  7:48   ` Christoph Hellwig
@ 2020-09-23 15:46     ` Darrick J. Wong
  0 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-23 15:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david

On Wed, Sep 23, 2020 at 08:48:16AM +0100, Christoph Hellwig wrote:
> On Wed, Sep 16, 2020 at 08:29:07PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> 
> Wow, that is a massive change, but I really like the direction.
> 
> 
> > +int
> >  xfs_defer_capture(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_capture	**dfcp)
> >  {
> > +	struct xfs_defer_capture	*dfc = NULL;
> > +
> > +	if (!list_empty(&tp->t_dfops)) {
> > +		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> > +
> > +		xfs_defer_create_intents(tp);
> > +
> > +		INIT_LIST_HEAD(&dfc->dfc_list);
> > +		INIT_LIST_HEAD(&dfc->dfc_dfops);
> > +
> > +		/* Move the dfops chain and transaction state to the freezer. */
> > +		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> > +		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> > +		xfs_defer_reset(tp);
> > +	}
> > +
> > +	*dfcp = dfc;
> > +	return xfs_trans_commit(tp);
> 
> Don't we need to free the structure if xfs_trans_commit fails?

They'll all get passed up to xlog_recover_process_intents (even with the
error code) and cancelled by xlog_cancel_defer_ops.

But you're right, this should maintain the principle of not creating
more state for callers to deal with when we're just going to error out
anyway.

> Wouldn't a saner calling convention would to pass the list_head into
> ->iop_recover and this function, and just add it to the list here
> instead of passing it around as an output pointer argument.

Sure, that works too.  I sort of hate passing around generic list_head
but I could say the same of double pointers. :)

> > +/*
> > + * Freeze any deferred ops and commit the transaction.  This is the last step
> > + * needed to finish a log intent item that we recovered from the log.
> > + */
> > +int
> > +xlog_recover_trans_commit(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_capture	**dfcp)
> > +{
> > +	return xfs_defer_capture(tp, dfcp);
> > +}
> 
> I don't think this wrapper is very helpful.  I think a direct call
> to xfs_defer_capture captures the intent better than pretending it just
> is another commit.

It's not particularly enlightening right now, but later (in the bmap
use-after-free series) I'll enlarge this helper to be able to stash
inodes and whatnot.  The end result is that all the log intent item
recovery functions can call this at the bottom and it'll clean up the
transaction and inodes that were used in recovery.

(Or hide them away for later.)

> > +
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*dfops_freezers)
> >  {
> > +	struct xfs_defer_capture *dfc, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> > +	uint64_t		resblks;
> > +	int			error = 0;
> >  
> > +	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> > +		/*
> > +		 * We're finishing the defer_ops that accumulated as a result
> > +		 * of recovering unfinished intent items during log recovery.
> > +		 * We reserve an itruncate transaction because it is the
> > +		 * largest permanent transaction type.  Since we're the only
> > +		 * user of the fs right now, take 93% (15/16) of the available
> > +		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 */
> > +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > +		if (freeblks <= 0) {
> > +			error = -ENOSPC;
> > +			break;
> > +		}
> >  
> > +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > +				0, XFS_TRANS_RESERVE, &tp);
> 
> This isn't actually new, but how did we end up with the truncate
> reservation here?

itruncate is the largest reservation, so it ended up here (and a few
other places) as a standin for "big enough to avoid failing log recovery
on a reservation overflow".

> > +		if (error)
> > +			break;
> > +
> > +		/* transfer all collected dfops to this transaction */
> > +		list_del_init(&dfc->dfc_list);
> > +		xfs_defer_continue(dfc, tp);
> > +
> > +		error = xfs_trans_commit(tp);
> > +		xfs_defer_capture_free(mp, dfc);
> > +		if (error)
> > +			break;
> 
> I'd just do direct returns and return 0 outside the loop, but that is
> just nitpicking.

Ok.

> >  /* Is this log item a deferred action intent? */
> > @@ -2528,28 +2566,16 @@ STATIC int
> >  xlog_recover_process_intents(
> >  	struct xlog		*log)
> >  {
> > +	LIST_HEAD(dfops_freezers);
> >  	struct xfs_ail_cursor	cur;
> > +	struct xfs_defer_capture *freezer = NULL;
> >  	struct xfs_log_item	*lip;
> >  	struct xfs_ail		*ailp;
> > +	int			error = 0;
> >  #if defined(DEBUG) || defined(XFS_WARN)
> >  	xfs_lsn_t		last_lsn;
> >  #endif
> >  
> >  	ailp = log->l_ailp;
> >  	spin_lock(&ailp->ail_lock);
> >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > @@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
> >  
> >  		/*
> >  		 * NOTE: If your intent processing routine can create more
> > +		 * deferred ops, you /must/ attach them to the freezer in
> >  		 * this routine or else those subsequent intents will get
> >  		 * replayed in the wrong order!
> >  		 */
> >  		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
> >  			spin_unlock(&ailp->ail_lock);
> > +			error = lip->li_ops->iop_recover(lip, &freezer);
> >  			spin_lock(&ailp->ail_lock);
> >  		}
> > +		if (freezer) {
> > +			list_add_tail(&freezer->dfc_list, &dfops_freezers);
> > +			freezer = NULL;
> > +		}
> >  		if (error)
> > +			break;
> > +
> >  		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> 
> The code flow looks really very odd (though correct) to me, what do you
> think of something like:
> 
>   	spin_lock(&ailp->ail_lock);
> 	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> 	     lip;
> 	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> 		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
> 			continue;
> 
> 		spin_unlock(&ailp->ail_lock);
> 		error = lip->li_ops->iop_recover(lip, &dfops_freezers);
> 		spin_lock(&ailp->ail_lock);
> 		if (error)
> 			break;
> 	}
>   	xfs_trans_ail_cursor_done(&cur);
>   	spin_unlock(&ailp->ail_lock);

Yeah, I like that better.  Thanks for the suggestions.

--D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
  2020-09-23  7:48   ` Christoph Hellwig
@ 2020-09-24  6:02   ` Darrick J. Wong
  2020-09-25 12:21     ` Brian Foster
  1 sibling, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-24  6:02 UTC (permalink / raw)
  To: linux-xfs, david, hch

From: Darrick J. Wong <darrick.wong@oracle.com>

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: rework the api so we pass around the capture list instead of pointer
pointer weirdness
---
 fs/xfs/libxfs/xfs_defer.c       |   73 ++++++++++++++++++++--
 fs/xfs/libxfs/xfs_defer.h       |   20 ++++++
 fs/xfs/libxfs/xfs_log_recover.h |    2 +
 fs/xfs/xfs_bmap_item.c          |   16 +----
 fs/xfs/xfs_extfree_item.c       |    7 +-
 fs/xfs/xfs_log_recover.c        |  131 +++++++++++++++++++++++----------------
 fs/xfs/xfs_refcount_item.c      |   16 +----
 fs/xfs/xfs_rmap_item.c          |    7 +-
 fs/xfs/xfs_trans.h              |    3 +
 9 files changed, 181 insertions(+), 94 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 29e9762f3b77..0cb4af0c5c10 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -563,14 +563,73 @@ xfs_defer_move(
  *
  * Create and log intent items for all the work that we're capturing so that we
  * can be assured that the items will get replayed if the system goes down
- * before log recovery gets a chance to finish the work it put off.  Then we
- * move the chain from stp to dtp.
+ * before log recovery gets a chance to finish the work it put off.  The entire
+ * deferred ops state is transferred to the capture structure and the
+ * transaction is committed.
+ *
+ * Note that the capture state is passed up to the caller and must be freed
+ * even if the transaction commit returns error.
  */
-void
+int
 xfs_defer_capture(
-	struct xfs_trans	*dtp,
-	struct xfs_trans	*stp)
+	struct xfs_trans		*tp,
+	struct list_head		*capture_list)
 {
-	xfs_defer_create_intents(stp);
-	xfs_defer_move(dtp, stp);
+	struct xfs_defer_capture	*dfc;
+	struct xfs_mount		*mp = tp->t_mountp;
+	int				error;
+
+	if (list_empty(&tp->t_dfops))
+		return xfs_trans_commit(tp);
+
+	/* Create an object to capture the defer ops. */
+	dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
+
+	xfs_defer_create_intents(tp);
+
+	INIT_LIST_HEAD(&dfc->dfc_list);
+	INIT_LIST_HEAD(&dfc->dfc_dfops);
+
+	/* Move the dfops chain and transaction state to the freezer. */
+	list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
+	dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+	xfs_defer_reset(tp);
+
+	/*
+	 * Commit the transaction.  If that fails, clean up the defer ops and
+	 * the dfc that we just created.  Otherwise, add the dfc to the list.
+	 */
+	error = xfs_trans_commit(tp);
+	if (error) {
+		xfs_defer_capture_free(mp, dfc);
+		return error;
+	}
+
+	list_add_tail(&dfc->dfc_list, capture_list);
+	return 0;
+}
+
+/* Attach a chain of captured deferred ops to a new transaction. */
+void
+xfs_defer_continue(
+	struct xfs_defer_capture	*dfc,
+	struct xfs_trans		*tp)
+{
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/* Move captured dfops chain and state to the transaction. */
+	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
+	tp->t_flags |= dfc->dfc_tpflags;
+	dfc->dfc_tpflags = 0;
+}
+
+/* Release all resources that we used to capture deferred ops. */
+void
+xfs_defer_capture_free(
+	struct xfs_mount		*mp,
+	struct xfs_defer_capture	*dfc)
+{
+	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
+	kmem_free(dfc);
 }
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 3164199162b6..366d08d99e11 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -8,6 +8,7 @@
 
 struct xfs_btree_cur;
 struct xfs_defer_op_type;
+struct xfs_defer_capture;
 
 /*
  * Header for deferred operation list.
@@ -63,10 +64,27 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 
+/*
+ * Deferred operation freezer.  This structure enables a dfops user to detach
+ * the chain of deferred operations from a transaction so that they can be
+ * continued later.
+ */
+struct xfs_defer_capture {
+	/* List of other freezer heads. */
+	struct list_head	dfc_list;
+
+	/* Deferred ops state saved from the transaction. */
+	struct list_head	dfc_dfops;
+	unsigned int		dfc_tpflags;
+};
+
 /*
  * Functions to capture a chain of deferred operations and continue them later.
  * This doesn't normally happen except log recovery.
  */
-void xfs_defer_capture(struct xfs_trans *dtp, struct xfs_trans *stp);
+int xfs_defer_capture(struct xfs_trans *tp, struct list_head *capture_list);
+void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp);
+void xfs_defer_capture_free(struct xfs_mount *mp,
+		struct xfs_defer_capture *dfc);
 
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 641132d0e39d..4f752096f7c7 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -125,5 +125,7 @@ void xlog_recover_iodone(struct xfs_buf *bp);
 
 void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id);
+int xlog_recover_trans_commit(struct xfs_trans *tp,
+		struct list_head *capture_list);
 
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index b04ebcd78316..b73f0a0890a2 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -424,13 +424,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 STATIC int
 xfs_bui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct list_head		*capture_list)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_map_extent		*bmap;
 	struct xfs_bud_log_item		*budp;
 	xfs_fsblock_t			startblock_fsb;
@@ -478,12 +478,7 @@ xfs_bui_item_recover(
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	budp = xfs_trans_get_bud(tp, buip);
 
 	/* Grab the inode. */
@@ -531,15 +526,12 @@ xfs_bui_item_recover(
 		xfs_bmap_unmap_extent(tp, ip, &irec);
 	}
 
-	xfs_defer_capture(parent_tp, tp);
-	error = xfs_trans_commit(tp);
+	error = xlog_recover_trans_commit(tp, capture_list);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
-
 	return error;
 
 err_inode:
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 9093d2e7afdf..be0186875566 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -585,10 +585,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
 STATIC int
 xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct list_head		*capture_list)
 {
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
 	struct xfs_extent		*extp;
@@ -627,8 +627,7 @@ xfs_efi_item_recover(
 
 	}
 
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, capture_list);
 
 abort_error:
 	xfs_trans_cancel(tp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e2ec91b2d0f4..79be1f02d1b4 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1791,6 +1791,19 @@ xlog_recover_release_intent(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Freeze any deferred ops and commit the transaction.  This is the last step
+ * needed to finish a log intent item that we recovered from the log, and will
+ * take care of releasing all the relevant resources.
+ */
+int
+xlog_recover_trans_commit(
+	struct xfs_trans		*tp,
+	struct list_head		*capture_list)
+{
+	return xfs_defer_capture(tp, capture_list);
+}
+
 /******************************************************************************
  *
  *		Log recover routines
@@ -2467,38 +2480,62 @@ xlog_recover_process_data(
 	return 0;
 }
 
+static void
+xlog_cancel_defer_ops(
+	struct xfs_mount	*mp,
+	struct list_head	*capture_list)
+{
+	struct xfs_defer_capture *dfc, *next;
+
+	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
+		list_del_init(&dfc->dfc_list);
+		xfs_defer_capture_free(mp, dfc);
+	}
+}
+
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
-	struct xfs_trans	*parent_tp)
+	struct xfs_mount	*mp,
+	struct list_head	*capture_list)
 {
-	struct xfs_mount	*mp = parent_tp->t_mountp;
+	struct xfs_defer_capture *dfc, *next;
 	struct xfs_trans	*tp;
 	int64_t			freeblks;
-	uint			resblks;
-	int			error;
+	uint64_t		resblks;
+	int			error = 0;
 
-	/*
-	 * We're finishing the defer_ops that accumulated as a result of
-	 * recovering unfinished intent items during log recovery.  We
-	 * reserve an itruncate transaction because it is the largest
-	 * permanent transaction type.  Since we're the only user of the fs
-	 * right now, take 93% (15/16) of the available free blocks.  Use
-	 * weird math to avoid a 64-bit division.
-	 */
-	freeblks = percpu_counter_sum(&mp->m_fdblocks);
-	if (freeblks <= 0)
-		return -ENOSPC;
-	resblks = min_t(int64_t, UINT_MAX, freeblks);
-	resblks = (resblks * 15) >> 4;
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-			0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-	/* transfer all collected dfops to this transaction */
-	xfs_defer_move(tp, parent_tp);
+	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
+		/*
+		 * We're finishing the defer_ops that accumulated as a result
+		 * of recovering unfinished intent items during log recovery.
+		 * We reserve an itruncate transaction because it is the
+		 * largest permanent transaction type.  Since we're the only
+		 * user of the fs right now, take 93% (15/16) of the available
+		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 */
+		freeblks = percpu_counter_sum(&mp->m_fdblocks);
+		if (freeblks <= 0)
+			return -ENOSPC;
 
-	return xfs_trans_commit(tp);
+		resblks = min_t(uint64_t, UINT_MAX, freeblks);
+		resblks = (resblks * 15) >> 4;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			return error;
+
+		/* transfer all collected dfops to this transaction */
+		list_del_init(&dfc->dfc_list);
+		xfs_defer_continue(dfc, tp);
+
+		error = xfs_trans_commit(tp);
+		xfs_defer_capture_free(mp, dfc);
+		if (error)
+			return error;
+	}
+
+	return 0;
 }
 
 /* Is this log item a deferred action intent? */
@@ -2528,35 +2565,23 @@ STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_trans	*parent_tp;
+	LIST_HEAD(capture_list);
 	struct xfs_ail_cursor	cur;
 	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
-	int			error;
+	int			error = 0;
 #if defined(DEBUG) || defined(XFS_WARN)
 	xfs_lsn_t		last_lsn;
 #endif
 
-	/*
-	 * The intent recovery handlers commit transactions to complete recovery
-	 * for individual intents, but any new deferred operations that are
-	 * queued during that process are held off until the very end. The
-	 * purpose of this transaction is to serve as a container for deferred
-	 * operations. Each intent recovery handler must transfer dfops here
-	 * before its local transaction commits, and we'll finish the entire
-	 * list below.
-	 */
-	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
-	if (error)
-		return error;
-
 	ailp = log->l_ailp;
 	spin_lock(&ailp->ail_lock);
-	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
 #if defined(DEBUG) || defined(XFS_WARN)
 	last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
 #endif
-	while (lip != NULL) {
+	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	     lip != NULL;
+	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
 		/*
 		 * We're done when we see something other than an intent.
 		 * There should be no intents left in the AIL now.
@@ -2576,28 +2601,28 @@ xlog_recover_process_intents(
 		 */
 		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
 
+		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
+			continue;
+
 		/*
 		 * NOTE: If your intent processing routine can create more
-		 * deferred ops, you /must/ attach them to the transaction in
-		 * this routine or else those subsequent intents will get
+		 * deferred ops, you /must/ attach them to the capture list in
+		 * the recover routine or else those subsequent intents will be
 		 * replayed in the wrong order!
 		 */
-		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
-			spin_unlock(&ailp->ail_lock);
-			error = lip->li_ops->iop_recover(lip, parent_tp);
-			spin_lock(&ailp->ail_lock);
-		}
+		spin_unlock(&ailp->ail_lock);
+		error = lip->li_ops->iop_recover(lip, &capture_list);
+		spin_lock(&ailp->ail_lock);
 		if (error)
-			goto out;
-		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			break;
 	}
-out:
+
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->ail_lock);
 	if (!error)
-		error = xlog_finish_defer_ops(parent_tp);
-	xfs_trans_cancel(parent_tp);
+		error = xlog_finish_defer_ops(log->l_mp, &capture_list);
 
+	xlog_cancel_defer_ops(log->l_mp, &capture_list);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 3e34b7662361..7a57b4de9ee7 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -424,7 +424,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 STATIC int
 xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct list_head		*capture_list)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
@@ -432,7 +432,7 @@ xfs_cui_item_recover(
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			new_fsb;
 	xfs_extlen_t			new_len;
@@ -487,12 +487,7 @@ xfs_cui_item_recover(
 			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
@@ -549,13 +544,10 @@ xfs_cui_item_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_capture(parent_tp, tp);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, capture_list);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index e38ec5d736be..16c7a6385c3f 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -467,14 +467,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 STATIC int
 xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct list_head		*capture_list)
 {
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_map_extent		*rmap;
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	enum xfs_rmap_intent_type	type;
 	xfs_exntst_t			state;
@@ -567,8 +567,7 @@ xfs_rui_item_recover(
 	}
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, capture_list);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b752501818d2..b68272666ce1 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -74,7 +74,8 @@ struct xfs_item_ops {
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
-	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
+	int (*iop_recover)(struct xfs_log_item *lip,
+			   struct list_head *capture_list);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation
  2020-09-17  3:29 ` [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation Darrick J. Wong
  2020-09-23  7:49   ` Christoph Hellwig
@ 2020-09-24  6:03   ` Darrick J. Wong
  1 sibling, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-24  6:03 UTC (permalink / raw)
  To: linux-xfs, david, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the transaction reservation type
from the old transaction so that when we continue the dfops chain, we
still use the same reservation parameters.

This avoids a potential failure vector by ensuring that we never ask for
more log reservation space than we would have asked for had the system
not gone down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: rebased on v2 of patch 1
---
 fs/xfs/libxfs/xfs_defer.c |    3 +++
 fs/xfs/libxfs/xfs_defer.h |    1 +
 fs/xfs/xfs_log_recover.c  |    4 ++--
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 8bd01952c955..b693d2c42c27 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -595,6 +595,9 @@ xfs_defer_capture(
 	dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
 	dfc->dfc_blkres = tp->t_blk_res - tp->t_blk_res_used;
 	tp->t_blk_res = tp->t_blk_res_used;
+	dfc->dfc_tres.tr_logres = tp->t_log_res;
+	dfc->dfc_tres.tr_logcount = tp->t_log_count;
+	dfc->dfc_tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
 	xfs_defer_reset(tp);
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 59d126aeb31c..254d48e6e4dc 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -77,6 +77,7 @@ struct xfs_defer_capture {
 	struct list_head	dfc_dfops;
 	unsigned int		dfc_tpflags;
 	unsigned int		dfc_blkres;
+	struct xfs_trans_res	dfc_tres;
 };
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 777b60073ff7..ab9825ab14d5 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2504,8 +2504,8 @@ xlog_finish_defer_ops(
 	int			error = 0;
 
 	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0,
-				0, XFS_TRANS_RESERVE, &tp);
+		error = xfs_trans_alloc(mp, &dfc->dfc_tres, 0, 0,
+				XFS_TRANS_RESERVE, &tp);
 		if (error)
 			return error;
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] xfs: xfs_defer_capture should absorb remaining block reservation
  2020-09-17  3:29 ` [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation Darrick J. Wong
  2020-09-23  7:49   ` Christoph Hellwig
@ 2020-09-24  6:04   ` Darrick J. Wong
  1 sibling, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-24  6:04 UTC (permalink / raw)
  To: linux-xfs, david

From: Darrick J. Wong <darrick.wong@oracle.com>

When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should absorb the remaining block reservation so
that when we continue the dfops chain, we still have those blocks to
use.

This adds the requirement that every log intent item recovery function
must be careful to reserve enough blocks to handle both itself and all
defer ops that it can queue.  On the other hand, this enables us to do
away with the handwaving block estimation nonsense that was going on in
xlog_finish_defer_ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v2: trivial rebase of patch 1
---
 fs/xfs/libxfs/xfs_defer.c |    4 ++++
 fs/xfs/libxfs/xfs_defer.h |    1 +
 fs/xfs/xfs_log_recover.c  |   18 +-----------------
 3 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 0cb4af0c5c10..8bd01952c955 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -593,6 +593,8 @@ xfs_defer_capture(
 	/* Move the dfops chain and transaction state to the freezer. */
 	list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
 	dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+	dfc->dfc_blkres = tp->t_blk_res - tp->t_blk_res_used;
+	tp->t_blk_res = tp->t_blk_res_used;
 	xfs_defer_reset(tp);
 
 	/*
@@ -622,6 +624,8 @@ xfs_defer_continue(
 	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
 	tp->t_flags |= dfc->dfc_tpflags;
 	dfc->dfc_tpflags = 0;
+	tp->t_blk_res += dfc->dfc_blkres;
+	dfc->dfc_blkres = 0;
 }
 
 /* Release all resources that we used to capture deferred ops. */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 366d08d99e11..59d126aeb31c 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -76,6 +76,7 @@ struct xfs_defer_capture {
 	/* Deferred ops state saved from the transaction. */
 	struct list_head	dfc_dfops;
 	unsigned int		dfc_tpflags;
+	unsigned int		dfc_blkres;
 };
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 79be1f02d1b4..777b60073ff7 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2501,26 +2501,10 @@ xlog_finish_defer_ops(
 {
 	struct xfs_defer_capture *dfc, *next;
 	struct xfs_trans	*tp;
-	int64_t			freeblks;
-	uint64_t		resblks;
 	int			error = 0;
 
 	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
-		/*
-		 * We're finishing the defer_ops that accumulated as a result
-		 * of recovering unfinished intent items during log recovery.
-		 * We reserve an itruncate transaction because it is the
-		 * largest permanent transaction type.  Since we're the only
-		 * user of the fs right now, take 93% (15/16) of the available
-		 * free blocks.  Use weird math to avoid a 64-bit division.
-		 */
-		freeblks = percpu_counter_sum(&mp->m_fdblocks);
-		if (freeblks <= 0)
-			return -ENOSPC;
-
-		resblks = min_t(uint64_t, UINT_MAX, freeblks);
-		resblks = (resblks * 15) >> 4;
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0,
 				0, XFS_TRANS_RESERVE, &tp);
 		if (error)
 			return error;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-24  6:02   ` [PATCH v2 " Darrick J. Wong
@ 2020-09-25 12:21     ` Brian Foster
  2020-09-25 19:19       ` Darrick J. Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2020-09-25 12:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, hch

On Wed, Sep 23, 2020 at 11:02:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: rework the api so we pass around the capture list instead of pointer
> pointer weirdness
> ---
>  fs/xfs/libxfs/xfs_defer.c       |   73 ++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_defer.h       |   20 ++++++
>  fs/xfs/libxfs/xfs_log_recover.h |    2 +
>  fs/xfs/xfs_bmap_item.c          |   16 +----
>  fs/xfs/xfs_extfree_item.c       |    7 +-
>  fs/xfs/xfs_log_recover.c        |  131 +++++++++++++++++++++++----------------
>  fs/xfs/xfs_refcount_item.c      |   16 +----
>  fs/xfs/xfs_rmap_item.c          |    7 +-
>  fs/xfs/xfs_trans.h              |    3 +
>  9 files changed, 181 insertions(+), 94 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index 29e9762f3b77..0cb4af0c5c10 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -563,14 +563,73 @@ xfs_defer_move(
>   *
>   * Create and log intent items for all the work that we're capturing so that we
>   * can be assured that the items will get replayed if the system goes down
> - * before log recovery gets a chance to finish the work it put off.  Then we
> - * move the chain from stp to dtp.
> + * before log recovery gets a chance to finish the work it put off.  The entire
> + * deferred ops state is transferred to the capture structure and the
> + * transaction is committed.
> + *
> + * Note that the capture state is passed up to the caller and must be freed
> + * even if the transaction commit returns error.
>   */
> -void
> +int
>  xfs_defer_capture(
> -	struct xfs_trans	*dtp,
> -	struct xfs_trans	*stp)
> +	struct xfs_trans		*tp,
> +	struct list_head		*capture_list)
>  {
> -	xfs_defer_create_intents(stp);
> -	xfs_defer_move(dtp, stp);
> +	struct xfs_defer_capture	*dfc;
> +	struct xfs_mount		*mp = tp->t_mountp;
> +	int				error;
> +
> +	if (list_empty(&tp->t_dfops))
> +		return xfs_trans_commit(tp);
> +
> +	/* Create an object to capture the defer ops. */
> +	dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> +
> +	xfs_defer_create_intents(tp);
> +
> +	INIT_LIST_HEAD(&dfc->dfc_list);
> +	INIT_LIST_HEAD(&dfc->dfc_dfops);

Maybe move this right after the allocation..?

> +
> +	/* Move the dfops chain and transaction state to the freezer. */

"freezer" or "capture?"

> +	list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> +	dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> +	xfs_defer_reset(tp);

Seems all this does now is clear the LOWMODE flag. Any real need for
such a helper?

> +
> +	/*
> +	 * Commit the transaction.  If that fails, clean up the defer ops and
> +	 * the dfc that we just created.  Otherwise, add the dfc to the list.
> +	 */
> +	error = xfs_trans_commit(tp);
> +	if (error) {
> +		xfs_defer_capture_free(mp, dfc);
> +		return error;
> +	}

This is subjective I suppose, but from an interface standpoint I find it
a little odd that a "defer capture" function commits a transaction. I'd
expect that to be a separate action and for this function just to do
whatever is necessary with the dfops as it relates to the transaction
(particularly since this also appears to end up wrapped in a
'_trans_commit()' helper that could presumably commit the transaction)..

I also just realized the patch from the git repo doesn't match this one
and this one doesn't apply cleanly to for-next. Which is the right
version?

Brian

> +
> +	list_add_tail(&dfc->dfc_list, capture_list);
> +	return 0;
> +}
> +
> +/* Attach a chain of captured deferred ops to a new transaction. */
> +void
> +xfs_defer_continue(
> +	struct xfs_defer_capture	*dfc,
> +	struct xfs_trans		*tp)
> +{
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	/* Move captured dfops chain and state to the transaction. */
> +	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
> +	tp->t_flags |= dfc->dfc_tpflags;
> +	dfc->dfc_tpflags = 0;
> +}
> +
> +/* Release all resources that we used to capture deferred ops. */
> +void
> +xfs_defer_capture_free(
> +	struct xfs_mount		*mp,
> +	struct xfs_defer_capture	*dfc)
> +{
> +	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
> +	kmem_free(dfc);
>  }
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> index 3164199162b6..366d08d99e11 100644
> --- a/fs/xfs/libxfs/xfs_defer.h
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -8,6 +8,7 @@
>  
>  struct xfs_btree_cur;
>  struct xfs_defer_op_type;
> +struct xfs_defer_capture;
>  
>  /*
>   * Header for deferred operation list.
> @@ -63,10 +64,27 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
>  extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
>  extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
>  
> +/*
> + * Deferred operation freezer.  This structure enables a dfops user to detach
> + * the chain of deferred operations from a transaction so that they can be
> + * continued later.
> + */
> +struct xfs_defer_capture {
> +	/* List of other freezer heads. */
> +	struct list_head	dfc_list;
> +
> +	/* Deferred ops state saved from the transaction. */
> +	struct list_head	dfc_dfops;
> +	unsigned int		dfc_tpflags;
> +};
> +
>  /*
>   * Functions to capture a chain of deferred operations and continue them later.
>   * This doesn't normally happen except log recovery.
>   */
> -void xfs_defer_capture(struct xfs_trans *dtp, struct xfs_trans *stp);
> +int xfs_defer_capture(struct xfs_trans *tp, struct list_head *capture_list);
> +void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp);
> +void xfs_defer_capture_free(struct xfs_mount *mp,
> +		struct xfs_defer_capture *dfc);
>  
>  #endif /* __XFS_DEFER_H__ */
> diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
> index 641132d0e39d..4f752096f7c7 100644
> --- a/fs/xfs/libxfs/xfs_log_recover.h
> +++ b/fs/xfs/libxfs/xfs_log_recover.h
> @@ -125,5 +125,7 @@ void xlog_recover_iodone(struct xfs_buf *bp);
>  
>  void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
>  		uint64_t intent_id);
> +int xlog_recover_trans_commit(struct xfs_trans *tp,
> +		struct list_head *capture_list);
>  
>  #endif	/* __XFS_LOG_RECOVER_H__ */
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index b04ebcd78316..b73f0a0890a2 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -424,13 +424,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
>  STATIC int
>  xfs_bui_item_recover(
>  	struct xfs_log_item		*lip,
> -	struct xfs_trans		*parent_tp)
> +	struct list_head		*capture_list)
>  {
>  	struct xfs_bmbt_irec		irec;
>  	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
>  	struct xfs_trans		*tp;
>  	struct xfs_inode		*ip = NULL;
> -	struct xfs_mount		*mp = parent_tp->t_mountp;
> +	struct xfs_mount		*mp = lip->li_mountp;
>  	struct xfs_map_extent		*bmap;
>  	struct xfs_bud_log_item		*budp;
>  	xfs_fsblock_t			startblock_fsb;
> @@ -478,12 +478,7 @@ xfs_bui_item_recover(
>  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
>  	if (error)
>  		return error;
> -	/*
> -	 * Recovery stashes all deferred ops during intent processing and
> -	 * finishes them on completion. Transfer current dfops state to this
> -	 * transaction and transfer the result back before we return.
> -	 */
> -	xfs_defer_move(tp, parent_tp);
> +
>  	budp = xfs_trans_get_bud(tp, buip);
>  
>  	/* Grab the inode. */
> @@ -531,15 +526,12 @@ xfs_bui_item_recover(
>  		xfs_bmap_unmap_extent(tp, ip, &irec);
>  	}
>  
> -	xfs_defer_capture(parent_tp, tp);
> -	error = xfs_trans_commit(tp);
> +	error = xlog_recover_trans_commit(tp, capture_list);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	xfs_irele(ip);
> -
>  	return error;
>  
>  err_inode:
> -	xfs_defer_move(parent_tp, tp);
>  	xfs_trans_cancel(tp);
>  	if (ip) {
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 9093d2e7afdf..be0186875566 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -585,10 +585,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
>  STATIC int
>  xfs_efi_item_recover(
>  	struct xfs_log_item		*lip,
> -	struct xfs_trans		*parent_tp)
> +	struct list_head		*capture_list)
>  {
>  	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
> -	struct xfs_mount		*mp = parent_tp->t_mountp;
> +	struct xfs_mount		*mp = lip->li_mountp;
>  	struct xfs_efd_log_item		*efdp;
>  	struct xfs_trans		*tp;
>  	struct xfs_extent		*extp;
> @@ -627,8 +627,7 @@ xfs_efi_item_recover(
>  
>  	}
>  
> -	error = xfs_trans_commit(tp);
> -	return error;
> +	return xlog_recover_trans_commit(tp, capture_list);
>  
>  abort_error:
>  	xfs_trans_cancel(tp);
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index e2ec91b2d0f4..79be1f02d1b4 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1791,6 +1791,19 @@ xlog_recover_release_intent(
>  	spin_unlock(&ailp->ail_lock);
>  }
>  
> +/*
> + * Freeze any deferred ops and commit the transaction.  This is the last step
> + * needed to finish a log intent item that we recovered from the log, and will
> + * take care of releasing all the relevant resources.
> + */
> +int
> +xlog_recover_trans_commit(
> +	struct xfs_trans		*tp,
> +	struct list_head		*capture_list)
> +{
> +	return xfs_defer_capture(tp, capture_list);
> +}
> +
>  /******************************************************************************
>   *
>   *		Log recover routines
> @@ -2467,38 +2480,62 @@ xlog_recover_process_data(
>  	return 0;
>  }
>  
> +static void
> +xlog_cancel_defer_ops(
> +	struct xfs_mount	*mp,
> +	struct list_head	*capture_list)
> +{
> +	struct xfs_defer_capture *dfc, *next;
> +
> +	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> +		list_del_init(&dfc->dfc_list);
> +		xfs_defer_capture_free(mp, dfc);
> +	}
> +}
> +
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> -	struct xfs_trans	*parent_tp)
> +	struct xfs_mount	*mp,
> +	struct list_head	*capture_list)
>  {
> -	struct xfs_mount	*mp = parent_tp->t_mountp;
> +	struct xfs_defer_capture *dfc, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
> -	uint			resblks;
> -	int			error;
> +	uint64_t		resblks;
> +	int			error = 0;
>  
> -	/*
> -	 * We're finishing the defer_ops that accumulated as a result of
> -	 * recovering unfinished intent items during log recovery.  We
> -	 * reserve an itruncate transaction because it is the largest
> -	 * permanent transaction type.  Since we're the only user of the fs
> -	 * right now, take 93% (15/16) of the available free blocks.  Use
> -	 * weird math to avoid a 64-bit division.
> -	 */
> -	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> -	if (freeblks <= 0)
> -		return -ENOSPC;
> -	resblks = min_t(int64_t, UINT_MAX, freeblks);
> -	resblks = (resblks * 15) >> 4;
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> -			0, XFS_TRANS_RESERVE, &tp);
> -	if (error)
> -		return error;
> -	/* transfer all collected dfops to this transaction */
> -	xfs_defer_move(tp, parent_tp);
> +	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> +		/*
> +		 * We're finishing the defer_ops that accumulated as a result
> +		 * of recovering unfinished intent items during log recovery.
> +		 * We reserve an itruncate transaction because it is the
> +		 * largest permanent transaction type.  Since we're the only
> +		 * user of the fs right now, take 93% (15/16) of the available
> +		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 */
> +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> +		if (freeblks <= 0)
> +			return -ENOSPC;
>  
> -	return xfs_trans_commit(tp);
> +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> +				0, XFS_TRANS_RESERVE, &tp);
> +		if (error)
> +			return error;
> +
> +		/* transfer all collected dfops to this transaction */
> +		list_del_init(&dfc->dfc_list);
> +		xfs_defer_continue(dfc, tp);
> +
> +		error = xfs_trans_commit(tp);
> +		xfs_defer_capture_free(mp, dfc);
> +		if (error)
> +			return error;
> +	}
> +
> +	return 0;
>  }
>  
>  /* Is this log item a deferred action intent? */
> @@ -2528,35 +2565,23 @@ STATIC int
>  xlog_recover_process_intents(
>  	struct xlog		*log)
>  {
> -	struct xfs_trans	*parent_tp;
> +	LIST_HEAD(capture_list);
>  	struct xfs_ail_cursor	cur;
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
> -	int			error;
> +	int			error = 0;
>  #if defined(DEBUG) || defined(XFS_WARN)
>  	xfs_lsn_t		last_lsn;
>  #endif
>  
> -	/*
> -	 * The intent recovery handlers commit transactions to complete recovery
> -	 * for individual intents, but any new deferred operations that are
> -	 * queued during that process are held off until the very end. The
> -	 * purpose of this transaction is to serve as a container for deferred
> -	 * operations. Each intent recovery handler must transfer dfops here
> -	 * before its local transaction commits, and we'll finish the entire
> -	 * list below.
> -	 */
> -	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
> -	if (error)
> -		return error;
> -
>  	ailp = log->l_ailp;
>  	spin_lock(&ailp->ail_lock);
> -	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>  #if defined(DEBUG) || defined(XFS_WARN)
>  	last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
>  #endif
> -	while (lip != NULL) {
> +	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> +	     lip != NULL;
> +	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
>  		/*
>  		 * We're done when we see something other than an intent.
>  		 * There should be no intents left in the AIL now.
> @@ -2576,28 +2601,28 @@ xlog_recover_process_intents(
>  		 */
>  		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
>  
> +		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
> +			continue;
> +
>  		/*
>  		 * NOTE: If your intent processing routine can create more
> -		 * deferred ops, you /must/ attach them to the transaction in
> -		 * this routine or else those subsequent intents will get
> +		 * deferred ops, you /must/ attach them to the capture list in
> +		 * the recover routine or else those subsequent intents will be
>  		 * replayed in the wrong order!
>  		 */
> -		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
> -			spin_unlock(&ailp->ail_lock);
> -			error = lip->li_ops->iop_recover(lip, parent_tp);
> -			spin_lock(&ailp->ail_lock);
> -		}
> +		spin_unlock(&ailp->ail_lock);
> +		error = lip->li_ops->iop_recover(lip, &capture_list);
> +		spin_lock(&ailp->ail_lock);
>  		if (error)
> -			goto out;
> -		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +			break;
>  	}
> -out:
> +
>  	xfs_trans_ail_cursor_done(&cur);
>  	spin_unlock(&ailp->ail_lock);
>  	if (!error)
> -		error = xlog_finish_defer_ops(parent_tp);
> -	xfs_trans_cancel(parent_tp);
> +		error = xlog_finish_defer_ops(log->l_mp, &capture_list);
>  
> +	xlog_cancel_defer_ops(log->l_mp, &capture_list);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 3e34b7662361..7a57b4de9ee7 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -424,7 +424,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
>  STATIC int
>  xfs_cui_item_recover(
>  	struct xfs_log_item		*lip,
> -	struct xfs_trans		*parent_tp)
> +	struct list_head		*capture_list)
>  {
>  	struct xfs_bmbt_irec		irec;
>  	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
> @@ -432,7 +432,7 @@ xfs_cui_item_recover(
>  	struct xfs_cud_log_item		*cudp;
>  	struct xfs_trans		*tp;
>  	struct xfs_btree_cur		*rcur = NULL;
> -	struct xfs_mount		*mp = parent_tp->t_mountp;
> +	struct xfs_mount		*mp = lip->li_mountp;
>  	xfs_fsblock_t			startblock_fsb;
>  	xfs_fsblock_t			new_fsb;
>  	xfs_extlen_t			new_len;
> @@ -487,12 +487,7 @@ xfs_cui_item_recover(
>  			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
>  	if (error)
>  		return error;
> -	/*
> -	 * Recovery stashes all deferred ops during intent processing and
> -	 * finishes them on completion. Transfer current dfops state to this
> -	 * transaction and transfer the result back before we return.
> -	 */
> -	xfs_defer_move(tp, parent_tp);
> +
>  	cudp = xfs_trans_get_cud(tp, cuip);
>  
>  	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> @@ -549,13 +544,10 @@ xfs_cui_item_recover(
>  	}
>  
>  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> -	xfs_defer_capture(parent_tp, tp);
> -	error = xfs_trans_commit(tp);
> -	return error;
> +	return xlog_recover_trans_commit(tp, capture_list);
>  
>  abort_error:
>  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> -	xfs_defer_move(parent_tp, tp);
>  	xfs_trans_cancel(tp);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index e38ec5d736be..16c7a6385c3f 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -467,14 +467,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
>  STATIC int
>  xfs_rui_item_recover(
>  	struct xfs_log_item		*lip,
> -	struct xfs_trans		*parent_tp)
> +	struct list_head		*capture_list)
>  {
>  	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
>  	struct xfs_map_extent		*rmap;
>  	struct xfs_rud_log_item		*rudp;
>  	struct xfs_trans		*tp;
>  	struct xfs_btree_cur		*rcur = NULL;
> -	struct xfs_mount		*mp = parent_tp->t_mountp;
> +	struct xfs_mount		*mp = lip->li_mountp;
>  	xfs_fsblock_t			startblock_fsb;
>  	enum xfs_rmap_intent_type	type;
>  	xfs_exntst_t			state;
> @@ -567,8 +567,7 @@ xfs_rui_item_recover(
>  	}
>  
>  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> -	error = xfs_trans_commit(tp);
> -	return error;
> +	return xlog_recover_trans_commit(tp, capture_list);
>  
>  abort_error:
>  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index b752501818d2..b68272666ce1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -74,7 +74,8 @@ struct xfs_item_ops {
>  	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> -	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
> +	int (*iop_recover)(struct xfs_log_item *lip,
> +			   struct list_head *capture_list);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
>  };
>  
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-25 12:21     ` Brian Foster
@ 2020-09-25 19:19       ` Darrick J. Wong
  0 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2020-09-25 19:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, hch

On Fri, Sep 25, 2020 at 08:21:46AM -0400, Brian Foster wrote:
> On Wed, Sep 23, 2020 at 11:02:06PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: rework the api so we pass around the capture list instead of pointer
> > pointer weirdness
> > ---
> >  fs/xfs/libxfs/xfs_defer.c       |   73 ++++++++++++++++++++--
> >  fs/xfs/libxfs/xfs_defer.h       |   20 ++++++
> >  fs/xfs/libxfs/xfs_log_recover.h |    2 +
> >  fs/xfs/xfs_bmap_item.c          |   16 +----
> >  fs/xfs/xfs_extfree_item.c       |    7 +-
> >  fs/xfs/xfs_log_recover.c        |  131 +++++++++++++++++++++++----------------
> >  fs/xfs/xfs_refcount_item.c      |   16 +----
> >  fs/xfs/xfs_rmap_item.c          |    7 +-
> >  fs/xfs/xfs_trans.h              |    3 +
> >  9 files changed, 181 insertions(+), 94 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 29e9762f3b77..0cb4af0c5c10 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -563,14 +563,73 @@ xfs_defer_move(
> >   *
> >   * Create and log intent items for all the work that we're capturing so that we
> >   * can be assured that the items will get replayed if the system goes down
> > - * before log recovery gets a chance to finish the work it put off.  Then we
> > - * move the chain from stp to dtp.
> > + * before log recovery gets a chance to finish the work it put off.  The entire
> > + * deferred ops state is transferred to the capture structure and the
> > + * transaction is committed.
> > + *
> > + * Note that the capture state is passed up to the caller and must be freed
> > + * even if the transaction commit returns error.
> >   */
> > -void
> > +int
> >  xfs_defer_capture(
> > -	struct xfs_trans	*dtp,
> > -	struct xfs_trans	*stp)
> > +	struct xfs_trans		*tp,
> > +	struct list_head		*capture_list)
> >  {
> > -	xfs_defer_create_intents(stp);
> > -	xfs_defer_move(dtp, stp);
> > +	struct xfs_defer_capture	*dfc;
> > +	struct xfs_mount		*mp = tp->t_mountp;
> > +	int				error;
> > +
> > +	if (list_empty(&tp->t_dfops))
> > +		return xfs_trans_commit(tp);
> > +
> > +	/* Create an object to capture the defer ops. */
> > +	dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> > +
> > +	xfs_defer_create_intents(tp);
> > +
> > +	INIT_LIST_HEAD(&dfc->dfc_list);
> > +	INIT_LIST_HEAD(&dfc->dfc_dfops);
> 
> Maybe move this right after the allocation..?

Fixed.

> > +
> > +	/* Move the dfops chain and transaction state to the freezer. */
> 
> "freezer" or "capture?"

"...to the capture struct."

> > +	list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> > +	dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> > +	xfs_defer_reset(tp);
> 
> Seems all this does now is clear the LOWMODE flag. Any real need for
> such a helper?

Probably not.  I guess I could remove that as a prep patch.

> > +
> > +	/*
> > +	 * Commit the transaction.  If that fails, clean up the defer ops and
> > +	 * the dfc that we just created.  Otherwise, add the dfc to the list.
> > +	 */
> > +	error = xfs_trans_commit(tp);
> > +	if (error) {
> > +		xfs_defer_capture_free(mp, dfc);
> > +		return error;
> > +	}
> 
> This is subjective I suppose, but from an interface standpoint I find it
> a little odd that a "defer capture" function commits a transaction. I'd
> expect that to be a separate action and for this function just to do
> whatever is necessary with the dfops as it relates to the transaction
> (particularly since this also appears to end up wrapped in a
> '_trans_commit()' helper that could presumably commit the transaction)..

Hmm, yeah.  Maybe I took Christoph too literally when he said that I
should pass the capture list further down the call stack.  I could make
xfs_defer_capture return the *dfc pointer, which puts
xlog_recover_trans_commit_inodes back in charge of committing the
transaction and (later in the "fix bui recovery UAF" patch).  That
function would then be in charge of unlocking the inodes and whatnot.
The ownership change in the inodes makes things murky.

> I also just realized the patch from the git repo doesn't match this one
> and this one doesn't apply cleanly to for-next. Which is the right
> version?

Ugh... so usually I push my dev branch, then send patches, but there was
something broken another 100 patches down in my dev branch a couple of
nights ago so I didn't push, but sent a new revision of these patches
out anyway.

Maybe it's just time to copy these patches into a for-5.10 branch based
on the pile of patches I already have collected for for-next, and post
that, though that means that whatever rewrites now force me to change
three branches at once... :/

--D

> 
> Brian
> 
> > +
> > +	list_add_tail(&dfc->dfc_list, capture_list);
> > +	return 0;
> > +}
> > +
> > +/* Attach a chain of captured deferred ops to a new transaction. */
> > +void
> > +xfs_defer_continue(
> > +	struct xfs_defer_capture	*dfc,
> > +	struct xfs_trans		*tp)
> > +{
> > +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> > +
> > +	/* Move captured dfops chain and state to the transaction. */
> > +	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
> > +	tp->t_flags |= dfc->dfc_tpflags;
> > +	dfc->dfc_tpflags = 0;
> > +}
> > +
> > +/* Release all resources that we used to capture deferred ops. */
> > +void
> > +xfs_defer_capture_free(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_defer_capture	*dfc)
> > +{
> > +	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
> > +	kmem_free(dfc);
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > index 3164199162b6..366d08d99e11 100644
> > --- a/fs/xfs/libxfs/xfs_defer.h
> > +++ b/fs/xfs/libxfs/xfs_defer.h
> > @@ -8,6 +8,7 @@
> >  
> >  struct xfs_btree_cur;
> >  struct xfs_defer_op_type;
> > +struct xfs_defer_capture;
> >  
> >  /*
> >   * Header for deferred operation list.
> > @@ -63,10 +64,27 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
> >  extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
> >  extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
> >  
> > +/*
> > + * Deferred operation freezer.  This structure enables a dfops user to detach
> > + * the chain of deferred operations from a transaction so that they can be
> > + * continued later.
> > + */
> > +struct xfs_defer_capture {
> > +	/* List of other freezer heads. */
> > +	struct list_head	dfc_list;
> > +
> > +	/* Deferred ops state saved from the transaction. */
> > +	struct list_head	dfc_dfops;
> > +	unsigned int		dfc_tpflags;
> > +};
> > +
> >  /*
> >   * Functions to capture a chain of deferred operations and continue them later.
> >   * This doesn't normally happen except log recovery.
> >   */
> > -void xfs_defer_capture(struct xfs_trans *dtp, struct xfs_trans *stp);
> > +int xfs_defer_capture(struct xfs_trans *tp, struct list_head *capture_list);
> > +void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp);
> > +void xfs_defer_capture_free(struct xfs_mount *mp,
> > +		struct xfs_defer_capture *dfc);
> >  
> >  #endif /* __XFS_DEFER_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
> > index 641132d0e39d..4f752096f7c7 100644
> > --- a/fs/xfs/libxfs/xfs_log_recover.h
> > +++ b/fs/xfs/libxfs/xfs_log_recover.h
> > @@ -125,5 +125,7 @@ void xlog_recover_iodone(struct xfs_buf *bp);
> >  
> >  void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
> >  		uint64_t intent_id);
> > +int xlog_recover_trans_commit(struct xfs_trans *tp,
> > +		struct list_head *capture_list);
> >  
> >  #endif	/* __XFS_LOG_RECOVER_H__ */
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index b04ebcd78316..b73f0a0890a2 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> > @@ -424,13 +424,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
> >  STATIC int
> >  xfs_bui_item_recover(
> >  	struct xfs_log_item		*lip,
> > -	struct xfs_trans		*parent_tp)
> > +	struct list_head		*capture_list)
> >  {
> >  	struct xfs_bmbt_irec		irec;
> >  	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
> >  	struct xfs_trans		*tp;
> >  	struct xfs_inode		*ip = NULL;
> > -	struct xfs_mount		*mp = parent_tp->t_mountp;
> > +	struct xfs_mount		*mp = lip->li_mountp;
> >  	struct xfs_map_extent		*bmap;
> >  	struct xfs_bud_log_item		*budp;
> >  	xfs_fsblock_t			startblock_fsb;
> > @@ -478,12 +478,7 @@ xfs_bui_item_recover(
> >  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
> >  	if (error)
> >  		return error;
> > -	/*
> > -	 * Recovery stashes all deferred ops during intent processing and
> > -	 * finishes them on completion. Transfer current dfops state to this
> > -	 * transaction and transfer the result back before we return.
> > -	 */
> > -	xfs_defer_move(tp, parent_tp);
> > +
> >  	budp = xfs_trans_get_bud(tp, buip);
> >  
> >  	/* Grab the inode. */
> > @@ -531,15 +526,12 @@ xfs_bui_item_recover(
> >  		xfs_bmap_unmap_extent(tp, ip, &irec);
> >  	}
> >  
> > -	xfs_defer_capture(parent_tp, tp);
> > -	error = xfs_trans_commit(tp);
> > +	error = xlog_recover_trans_commit(tp, capture_list);
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	xfs_irele(ip);
> > -
> >  	return error;
> >  
> >  err_inode:
> > -	xfs_defer_move(parent_tp, tp);
> >  	xfs_trans_cancel(tp);
> >  	if (ip) {
> >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > index 9093d2e7afdf..be0186875566 100644
> > --- a/fs/xfs/xfs_extfree_item.c
> > +++ b/fs/xfs/xfs_extfree_item.c
> > @@ -585,10 +585,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
> >  STATIC int
> >  xfs_efi_item_recover(
> >  	struct xfs_log_item		*lip,
> > -	struct xfs_trans		*parent_tp)
> > +	struct list_head		*capture_list)
> >  {
> >  	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
> > -	struct xfs_mount		*mp = parent_tp->t_mountp;
> > +	struct xfs_mount		*mp = lip->li_mountp;
> >  	struct xfs_efd_log_item		*efdp;
> >  	struct xfs_trans		*tp;
> >  	struct xfs_extent		*extp;
> > @@ -627,8 +627,7 @@ xfs_efi_item_recover(
> >  
> >  	}
> >  
> > -	error = xfs_trans_commit(tp);
> > -	return error;
> > +	return xlog_recover_trans_commit(tp, capture_list);
> >  
> >  abort_error:
> >  	xfs_trans_cancel(tp);
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index e2ec91b2d0f4..79be1f02d1b4 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -1791,6 +1791,19 @@ xlog_recover_release_intent(
> >  	spin_unlock(&ailp->ail_lock);
> >  }
> >  
> > +/*
> > + * Freeze any deferred ops and commit the transaction.  This is the last step
> > + * needed to finish a log intent item that we recovered from the log, and will
> > + * take care of releasing all the relevant resources.
> > + */
> > +int
> > +xlog_recover_trans_commit(
> > +	struct xfs_trans		*tp,
> > +	struct list_head		*capture_list)
> > +{
> > +	return xfs_defer_capture(tp, capture_list);
> > +}
> > +
> >  /******************************************************************************
> >   *
> >   *		Log recover routines
> > @@ -2467,38 +2480,62 @@ xlog_recover_process_data(
> >  	return 0;
> >  }
> >  
> > +static void
> > +xlog_cancel_defer_ops(
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*capture_list)
> > +{
> > +	struct xfs_defer_capture *dfc, *next;
> > +
> > +	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> > +		list_del_init(&dfc->dfc_list);
> > +		xfs_defer_capture_free(mp, dfc);
> > +	}
> > +}
> > +
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > -	struct xfs_trans	*parent_tp)
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*capture_list)
> >  {
> > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > +	struct xfs_defer_capture *dfc, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> > -	uint			resblks;
> > -	int			error;
> > +	uint64_t		resblks;
> > +	int			error = 0;
> >  
> > -	/*
> > -	 * We're finishing the defer_ops that accumulated as a result of
> > -	 * recovering unfinished intent items during log recovery.  We
> > -	 * reserve an itruncate transaction because it is the largest
> > -	 * permanent transaction type.  Since we're the only user of the fs
> > -	 * right now, take 93% (15/16) of the available free blocks.  Use
> > -	 * weird math to avoid a 64-bit division.
> > -	 */
> > -	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > -	if (freeblks <= 0)
> > -		return -ENOSPC;
> > -	resblks = min_t(int64_t, UINT_MAX, freeblks);
> > -	resblks = (resblks * 15) >> 4;
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > -			0, XFS_TRANS_RESERVE, &tp);
> > -	if (error)
> > -		return error;
> > -	/* transfer all collected dfops to this transaction */
> > -	xfs_defer_move(tp, parent_tp);
> > +	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> > +		/*
> > +		 * We're finishing the defer_ops that accumulated as a result
> > +		 * of recovering unfinished intent items during log recovery.
> > +		 * We reserve an itruncate transaction because it is the
> > +		 * largest permanent transaction type.  Since we're the only
> > +		 * user of the fs right now, take 93% (15/16) of the available
> > +		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 */
> > +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > +		if (freeblks <= 0)
> > +			return -ENOSPC;
> >  
> > -	return xfs_trans_commit(tp);
> > +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > +				0, XFS_TRANS_RESERVE, &tp);
> > +		if (error)
> > +			return error;
> > +
> > +		/* transfer all collected dfops to this transaction */
> > +		list_del_init(&dfc->dfc_list);
> > +		xfs_defer_continue(dfc, tp);
> > +
> > +		error = xfs_trans_commit(tp);
> > +		xfs_defer_capture_free(mp, dfc);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	return 0;
> >  }
> >  
> >  /* Is this log item a deferred action intent? */
> > @@ -2528,35 +2565,23 @@ STATIC int
> >  xlog_recover_process_intents(
> >  	struct xlog		*log)
> >  {
> > -	struct xfs_trans	*parent_tp;
> > +	LIST_HEAD(capture_list);
> >  	struct xfs_ail_cursor	cur;
> >  	struct xfs_log_item	*lip;
> >  	struct xfs_ail		*ailp;
> > -	int			error;
> > +	int			error = 0;
> >  #if defined(DEBUG) || defined(XFS_WARN)
> >  	xfs_lsn_t		last_lsn;
> >  #endif
> >  
> > -	/*
> > -	 * The intent recovery handlers commit transactions to complete recovery
> > -	 * for individual intents, but any new deferred operations that are
> > -	 * queued during that process are held off until the very end. The
> > -	 * purpose of this transaction is to serve as a container for deferred
> > -	 * operations. Each intent recovery handler must transfer dfops here
> > -	 * before its local transaction commits, and we'll finish the entire
> > -	 * list below.
> > -	 */
> > -	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
> > -	if (error)
> > -		return error;
> > -
> >  	ailp = log->l_ailp;
> >  	spin_lock(&ailp->ail_lock);
> > -	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> >  #if defined(DEBUG) || defined(XFS_WARN)
> >  	last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
> >  #endif
> > -	while (lip != NULL) {
> > +	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > +	     lip != NULL;
> > +	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> >  		/*
> >  		 * We're done when we see something other than an intent.
> >  		 * There should be no intents left in the AIL now.
> > @@ -2576,28 +2601,28 @@ xlog_recover_process_intents(
> >  		 */
> >  		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> >  
> > +		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
> > +			continue;
> > +
> >  		/*
> >  		 * NOTE: If your intent processing routine can create more
> > -		 * deferred ops, you /must/ attach them to the transaction in
> > -		 * this routine or else those subsequent intents will get
> > +		 * deferred ops, you /must/ attach them to the capture list in
> > +		 * the recover routine or else those subsequent intents will be
> >  		 * replayed in the wrong order!
> >  		 */
> > -		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
> > -			spin_unlock(&ailp->ail_lock);
> > -			error = lip->li_ops->iop_recover(lip, parent_tp);
> > -			spin_lock(&ailp->ail_lock);
> > -		}
> > +		spin_unlock(&ailp->ail_lock);
> > +		error = lip->li_ops->iop_recover(lip, &capture_list);
> > +		spin_lock(&ailp->ail_lock);
> >  		if (error)
> > -			goto out;
> > -		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +			break;
> >  	}
> > -out:
> > +
> >  	xfs_trans_ail_cursor_done(&cur);
> >  	spin_unlock(&ailp->ail_lock);
> >  	if (!error)
> > -		error = xlog_finish_defer_ops(parent_tp);
> > -	xfs_trans_cancel(parent_tp);
> > +		error = xlog_finish_defer_ops(log->l_mp, &capture_list);
> >  
> > +	xlog_cancel_defer_ops(log->l_mp, &capture_list);
> >  	return error;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> > index 3e34b7662361..7a57b4de9ee7 100644
> > --- a/fs/xfs/xfs_refcount_item.c
> > +++ b/fs/xfs/xfs_refcount_item.c
> > @@ -424,7 +424,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
> >  STATIC int
> >  xfs_cui_item_recover(
> >  	struct xfs_log_item		*lip,
> > -	struct xfs_trans		*parent_tp)
> > +	struct list_head		*capture_list)
> >  {
> >  	struct xfs_bmbt_irec		irec;
> >  	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
> > @@ -432,7 +432,7 @@ xfs_cui_item_recover(
> >  	struct xfs_cud_log_item		*cudp;
> >  	struct xfs_trans		*tp;
> >  	struct xfs_btree_cur		*rcur = NULL;
> > -	struct xfs_mount		*mp = parent_tp->t_mountp;
> > +	struct xfs_mount		*mp = lip->li_mountp;
> >  	xfs_fsblock_t			startblock_fsb;
> >  	xfs_fsblock_t			new_fsb;
> >  	xfs_extlen_t			new_len;
> > @@ -487,12 +487,7 @@ xfs_cui_item_recover(
> >  			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
> >  	if (error)
> >  		return error;
> > -	/*
> > -	 * Recovery stashes all deferred ops during intent processing and
> > -	 * finishes them on completion. Transfer current dfops state to this
> > -	 * transaction and transfer the result back before we return.
> > -	 */
> > -	xfs_defer_move(tp, parent_tp);
> > +
> >  	cudp = xfs_trans_get_cud(tp, cuip);
> >  
> >  	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> > @@ -549,13 +544,10 @@ xfs_cui_item_recover(
> >  	}
> >  
> >  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > -	xfs_defer_capture(parent_tp, tp);
> > -	error = xfs_trans_commit(tp);
> > -	return error;
> > +	return xlog_recover_trans_commit(tp, capture_list);
> >  
> >  abort_error:
> >  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > -	xfs_defer_move(parent_tp, tp);
> >  	xfs_trans_cancel(tp);
> >  	return error;
> >  }
> > diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> > index e38ec5d736be..16c7a6385c3f 100644
> > --- a/fs/xfs/xfs_rmap_item.c
> > +++ b/fs/xfs/xfs_rmap_item.c
> > @@ -467,14 +467,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
> >  STATIC int
> >  xfs_rui_item_recover(
> >  	struct xfs_log_item		*lip,
> > -	struct xfs_trans		*parent_tp)
> > +	struct list_head		*capture_list)
> >  {
> >  	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
> >  	struct xfs_map_extent		*rmap;
> >  	struct xfs_rud_log_item		*rudp;
> >  	struct xfs_trans		*tp;
> >  	struct xfs_btree_cur		*rcur = NULL;
> > -	struct xfs_mount		*mp = parent_tp->t_mountp;
> > +	struct xfs_mount		*mp = lip->li_mountp;
> >  	xfs_fsblock_t			startblock_fsb;
> >  	enum xfs_rmap_intent_type	type;
> >  	xfs_exntst_t			state;
> > @@ -567,8 +567,7 @@ xfs_rui_item_recover(
> >  	}
> >  
> >  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > -	error = xfs_trans_commit(tp);
> > -	return error;
> > +	return xlog_recover_trans_commit(tp, capture_list);
> >  
> >  abort_error:
> >  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index b752501818d2..b68272666ce1 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -74,7 +74,8 @@ struct xfs_item_ops {
> >  	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> >  	void (*iop_release)(struct xfs_log_item *);
> >  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> > -	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
> > +	int (*iop_recover)(struct xfs_log_item *lip,
> > +			   struct list_head *capture_list);
> >  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
> >  };
> >  
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-09-25 23:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-17  3:29 [PATCH 0/3] xfs: fix how we deal with new intents during recovery Darrick J. Wong
2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
2020-09-23  7:48   ` Christoph Hellwig
2020-09-23 15:46     ` Darrick J. Wong
2020-09-24  6:02   ` [PATCH v2 " Darrick J. Wong
2020-09-25 12:21     ` Brian Foster
2020-09-25 19:19       ` Darrick J. Wong
2020-09-17  3:29 ` [PATCH 2/3] xfs: xfs_defer_capture should absorb remaining block reservation Darrick J. Wong
2020-09-23  7:49   ` Christoph Hellwig
2020-09-24  6:04   ` [PATCH v2 " Darrick J. Wong
2020-09-17  3:29 ` [PATCH 3/3] xfs: xfs_defer_capture should absorb remaining transaction reservation Darrick J. Wong
2020-09-23  7:49   ` Christoph Hellwig
2020-09-23 15:22     ` Darrick J. Wong
2020-09-24  6:03   ` [PATCH v2 " Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.