All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: fix inode use-after-free during log recovery
@ 2020-04-22  2:08 Darrick J. Wong
  2020-04-22  2:08 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-22  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

Fix a use-after-free during log recovery of deferred operations by
creating explicit freeze and thaw mechanisms for deferred ops that were
created while processing intent items that were recovered from the log.
While we're at it, fix all the bogosity around how we gather up log
intents during recovery and actually commit them to the filesystem.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-log-recovery

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=fix-log-recovery

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-04-22  2:08 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
@ 2020-04-22  2:08 ` Darrick J. Wong
  2020-04-24 14:02   ` Brian Foster
  2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
  2020-04-22  2:08 ` [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes Darrick J. Wong
  2 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-22  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c       |  100 ++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_defer.h       |   25 ++++++++
 fs/xfs/libxfs/xfs_log_recover.h |    6 ++
 fs/xfs/xfs_bmap_item.c          |   19 ++----
 fs/xfs/xfs_extfree_item.c       |    4 +
 fs/xfs/xfs_log_recover.c        |  122 +++++++++++++++++++++++++--------------
 fs/xfs/xfs_refcount_item.c      |   20 ++----
 fs/xfs/xfs_rmap_item.c          |   11 ++--
 8 files changed, 228 insertions(+), 79 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 22557527cfdb..33e0f246e181 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -552,3 +552,103 @@ xfs_defer_move(
 
 	xfs_defer_reset(stp);
 }
+
+/*
+ * Freeze a chain of deferred ops that are attached to a transaction.  The
+ * entire deferred ops state is transferred to the freezer, and each dfops
+ * item will be prepared for freezing.
+ */
+int
+xfs_defer_freeze(
+	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp)
+{
+	struct xfs_defer_freezer	*dff;
+	struct xfs_defer_pending	*dfp;
+	int				error;
+
+	*dffp = NULL;
+	if (list_empty(&tp->t_dfops))
+		return 0;
+
+	dff = kmem_zalloc(sizeof(*dff), KM_NOFS);
+	if (!dff)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&dff->dff_list);
+	INIT_LIST_HEAD(&dff->dff_dfops);
+
+	/* Freeze all of the dfops items attached to the transaction. */
+	list_for_each_entry(dfp, &tp->t_dfops, dfp_list) {
+		const struct xfs_defer_op_type *ops;
+		struct list_head	*li;
+
+		ops = defer_op_types[dfp->dfp_type];
+		if (!ops->freeze_item)
+			continue;
+
+		list_for_each(li, &dfp->dfp_work) {
+			error = ops->freeze_item(dff, li);
+			if (error)
+				break;
+		}
+		if (error)
+			break;
+	}
+	if (error) {
+		kmem_free(dff);
+		return error;
+	}
+
+	/* Move all the dfops items to the freezer. */
+	list_splice_init(&tp->t_dfops, &dff->dff_dfops);
+	dff->dff_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+	xfs_defer_reset(tp);
+
+	*dffp = dff;
+	return 0;
+}
+
+/* Thaw a chain of deferred ops that are attached to a transaction. */
+int
+xfs_defer_thaw(
+	struct xfs_defer_freezer	*dff,
+	struct xfs_trans		*tp)
+{
+	struct xfs_defer_pending	*dfp;
+	int				error;
+
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+
+	/* Thaw each of the items. */
+	list_for_each_entry(dfp, &dff->dff_dfops, dfp_list) {
+		const struct xfs_defer_op_type *ops;
+		struct list_head	*li;
+
+		ops = defer_op_types[dfp->dfp_type];
+		if (!ops->thaw_item)
+			continue;
+
+		list_for_each(li, &dfp->dfp_work) {
+			error = ops->thaw_item(dff, li);
+			if (error)
+				return error;
+		}
+	}
+
+	/* Add the dfops items to the transaction. */
+	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
+	tp->t_flags |= dff->dff_tpflags;
+
+	return 0;
+}
+
+/* Release a deferred op freezer and all resources associated with it. */
+void
+xfs_defer_freeezer_finish(
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	*dff)
+{
+	xfs_defer_cancel_list(mp, &dff->dff_dfops);
+	kmem_free(dff);
+}
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 7c28d7608ac6..90e05f6af53c 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -7,6 +7,7 @@
 #define	__XFS_DEFER_H__
 
 struct xfs_defer_op_type;
+struct xfs_defer_freezer;
 
 /*
  * Header for deferred operation list.
@@ -53,6 +54,10 @@ struct xfs_defer_op_type {
 	void *(*create_intent)(struct xfs_trans *, uint);
 	void (*log_item)(struct xfs_trans *, void *, struct list_head *);
 	unsigned int		max_items;
+	int (*freeze_item)(struct xfs_defer_freezer *freezer,
+			struct list_head *item);
+	int (*thaw_item)(struct xfs_defer_freezer *freezer,
+			struct list_head *item);
 };
 
 extern const struct xfs_defer_op_type xfs_bmap_update_defer_type;
@@ -61,4 +66,24 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 
+/*
+ * Deferred operation freezer.  This structure enables a dfops user to detach
+ * the chain of deferred operations from a transaction so that they can be
+ * run later.
+ */
+struct xfs_defer_freezer {
+	/* List of other freezer heads. */
+	struct list_head	dff_list;
+
+	/* Deferred ops state saved from the transaction. */
+	struct list_head	dff_dfops;
+	unsigned int		dff_tpflags;
+};
+
+/* Functions to freeze a chain of deferred operations for later. */
+int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
+int xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
+void xfs_defer_freeezer_finish(struct xfs_mount *mp,
+		struct xfs_defer_freezer *dff);
+
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 5c37940386d6..b36ccaa5465b 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -6,6 +6,8 @@
 #ifndef	__XFS_LOG_RECOVER_H__
 #define __XFS_LOG_RECOVER_H__
 
+struct xfs_defer_freezer;
+
 /*
  * Each log item type (XFS_LI_*) gets its own xlog_recover_item_type to
  * define how recovery should work for that type of log item.
@@ -131,7 +133,7 @@ typedef int (*xlog_recover_intent_fn)(struct xlog *xlog,
 typedef int (*xlog_recover_done_fn)(struct xlog *xlog,
 		struct xlog_recover_item *item);
 typedef int (*xlog_recover_process_intent_fn)(struct xlog *log,
-		struct xfs_trans *tp, struct xfs_log_item *lip);
+		struct xfs_defer_freezer **dffp, struct xfs_log_item *lip);
 typedef void (*xlog_recover_cancel_intent_fn)(struct xfs_log_item *lip);
 
 struct xlog_recover_intent_type {
@@ -174,5 +176,7 @@ void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id, xlog_recover_release_intent_fn fn);
 void xlog_recover_insert_ail(struct xlog *log, struct xfs_log_item *lip,
 		xfs_lsn_t lsn);
+int xlog_recover_trans_commit(struct xfs_trans *tp,
+		struct xfs_defer_freezer **dffp);
 
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 53160172c36b..5c22a902d8ca 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -433,7 +433,8 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
  */
 STATIC int
 xfs_bui_recover(
-	struct xfs_trans		*parent_tp,
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_bui_log_item		*buip)
 {
 	int				error = 0;
@@ -450,7 +451,6 @@ xfs_bui_recover(
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
 	struct xfs_bmbt_irec		irec;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
 
 	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
 
@@ -499,12 +499,7 @@ xfs_bui_recover(
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	budp = xfs_trans_get_bud(tp, buip);
 
 	/* Grab the inode. */
@@ -549,15 +544,13 @@ xfs_bui_recover(
 	}
 
 	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
-	xfs_defer_move(parent_tp, tp);
-	error = xfs_trans_commit(tp);
+	error = xlog_recover_trans_commit(tp, dffp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
 
 	return error;
 
 err_inode:
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -678,7 +671,7 @@ xlog_recover_bud(
 STATIC int
 xlog_recover_process_bui(
 	struct xlog			*log,
-	struct xfs_trans		*parent_tp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_log_item		*lip)
 {
 	struct xfs_ail			*ailp = log->l_ailp;
@@ -692,7 +685,7 @@ xlog_recover_process_bui(
 		return 0;
 
 	spin_unlock(&ailp->ail_lock);
-	error = xfs_bui_recover(parent_tp, buip);
+	error = xfs_bui_recover(log->l_mp, dffp, buip);
 	spin_lock(&ailp->ail_lock);
 
 	return error;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index a15ede29244a..5675062ad436 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -739,7 +739,7 @@ xlog_recover_efd(
 STATIC int
 xlog_recover_process_efi(
 	struct xlog			*log,
-	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_log_item		*lip)
 {
 	struct xfs_ail			*ailp = log->l_ailp;
@@ -753,7 +753,7 @@ xlog_recover_process_efi(
 		return 0;
 
 	spin_unlock(&ailp->ail_lock);
-	error = xfs_efi_recover(tp->t_mountp, efip);
+	error = xfs_efi_recover(log->l_mp, efip);
 	spin_lock(&ailp->ail_lock);
 
 	return error;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 5a20a95c5dad..e9b3e901d009 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1814,6 +1814,26 @@ xlog_recover_insert_ail(
 	xfs_trans_ail_update(log->l_ailp, lip, lsn);
 }
 
+/*
+ * Freeze any deferred ops and commit the transaction.  This is the last step
+ * needed to finish a log intent item that we recovered from the log.
+ */
+int
+xlog_recover_trans_commit(
+	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp)
+{
+	int				error;
+
+	error = xfs_defer_freeze(tp, dffp);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	return xfs_trans_commit(tp);
+}
+
 static inline const struct xlog_recover_intent_type *
 xlog_intent_for_type(
 	unsigned short			type)
@@ -2652,35 +2672,59 @@ xlog_recover_process_data(
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
-	struct xfs_trans	*parent_tp)
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
 {
-	struct xfs_mount	*mp = parent_tp->t_mountp;
+	struct xfs_defer_freezer *dff, *next;
 	struct xfs_trans	*tp;
 	int64_t			freeblks;
 	uint			resblks;
-	int			error;
+	int			error = 0;
 
-	/*
-	 * We're finishing the defer_ops that accumulated as a result of
-	 * recovering unfinished intent items during log recovery.  We
-	 * reserve an itruncate transaction because it is the largest
-	 * permanent transaction type.  Since we're the only user of the fs
-	 * right now, take 93% (15/16) of the available free blocks.  Use
-	 * weird math to avoid a 64-bit division.
-	 */
-	freeblks = percpu_counter_sum(&mp->m_fdblocks);
-	if (freeblks <= 0)
-		return -ENOSPC;
-	resblks = min_t(int64_t, UINT_MAX, freeblks);
-	resblks = (resblks * 15) >> 4;
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-			0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-	/* transfer all collected dfops to this transaction */
-	xfs_defer_move(tp, parent_tp);
+	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		/*
+		 * We're finishing the defer_ops that accumulated as a result
+		 * of recovering unfinished intent items during log recovery.
+		 * We reserve an itruncate transaction because it is the
+		 * largest permanent transaction type.  Since we're the only
+		 * user of the fs right now, take 93% (15/16) of the available
+		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 */
+		freeblks = percpu_counter_sum(&mp->m_fdblocks);
+		if (freeblks <= 0) {
+			error = -ENOSPC;
+			break;
+		}
 
-	return xfs_trans_commit(tp);
+		resblks = min_t(int64_t, UINT_MAX, freeblks);
+		resblks = (resblks * 15) >> 4;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			break;
+
+		/* transfer all collected dfops to this transaction */
+		list_del_init(&dff->dff_list);
+		error = xfs_defer_thaw(dff, tp);
+		if (error) {
+			xfs_trans_cancel(tp);
+			xfs_defer_freeezer_finish(mp, dff);
+			break;
+		}
+
+		error = xfs_trans_commit(tp);
+		xfs_defer_freeezer_finish(mp, dff);
+		if (error)
+			break;
+	}
+
+	/* Kill any remaining freezers. */
+	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		list_del_init(&dff->dff_list);
+		xfs_defer_freeezer_finish(mp, dff);
+	}
+
+	return error;
 }
 
 /*
@@ -2703,8 +2747,9 @@ STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_trans	*parent_tp;
+	LIST_HEAD(dfops_freezers);
 	struct xfs_ail_cursor	cur;
+	struct xfs_defer_freezer *freezer = NULL;
 	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
 	int			error;
@@ -2712,19 +2757,6 @@ xlog_recover_process_intents(
 	xfs_lsn_t		last_lsn;
 #endif
 
-	/*
-	 * The intent recovery handlers commit transactions to complete recovery
-	 * for individual intents, but any new deferred operations that are
-	 * queued during that process are held off until the very end. The
-	 * purpose of this transaction is to serve as a container for deferred
-	 * operations. Each intent recovery handler must transfer dfops here
-	 * before its local transaction commits, and we'll finish the entire
-	 * list below.
-	 */
-	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
-	if (error)
-		return error;
-
 	ailp = log->l_ailp;
 	spin_lock(&ailp->ail_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
@@ -2756,23 +2788,27 @@ xlog_recover_process_intents(
 
 		/*
 		 * NOTE: If your intent processing routine can create more
-		 * deferred ops, you /must/ attach them to the transaction in
+		 * deferred ops, you /must/ attach them to the freezer in
 		 * this routine or else those subsequent intents will get
 		 * replayed in the wrong order!
 		 */
-		error = type->process_intent(log, parent_tp, lip);
+		error = type->process_intent(log, &freezer, lip);
 		if (error)
 			goto out;
+		if (freezer) {
+			list_add_tail(&freezer->dff_list, &dfops_freezers);
+			freezer = NULL;
+		}
+
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
 	}
 out:
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->ail_lock);
-	if (!error)
-		error = xlog_finish_defer_ops(parent_tp);
-	xfs_trans_cancel(parent_tp);
+	if (error)
+		return error;
 
-	return error;
+	return xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
 }
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 01a393727a1e..18b1fbc276fc 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -447,7 +447,8 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
  */
 STATIC int
 xfs_cui_recover(
-	struct xfs_trans		*parent_tp,
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_cui_log_item		*cuip)
 {
 	int				i;
@@ -464,7 +465,6 @@ xfs_cui_recover(
 	xfs_extlen_t			new_len;
 	struct xfs_bmbt_irec		irec;
 	bool				requeue_only = false;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
 
 	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
 
@@ -519,12 +519,7 @@ xfs_cui_recover(
 			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
@@ -582,13 +577,10 @@ xfs_cui_recover(
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
 	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
-	xfs_defer_move(parent_tp, tp);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	return error;
 }
@@ -701,7 +693,7 @@ xlog_recover_cud(
 STATIC int
 xlog_recover_process_cui(
 	struct xlog			*log,
-	struct xfs_trans		*parent_tp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_log_item		*lip)
 {
 	struct xfs_ail			*ailp = log->l_ailp;
@@ -715,7 +707,7 @@ xlog_recover_process_cui(
 		return 0;
 
 	spin_unlock(&ailp->ail_lock);
-	error = xfs_cui_recover(parent_tp, cuip);
+	error = xfs_cui_recover(log->l_mp, dffp, cuip);
 	spin_lock(&ailp->ail_lock);
 
 	return error;
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 69a2d23eedda..7291fac7c6d1 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -491,7 +491,8 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
  */
 STATIC int
 xfs_rui_recover(
-	struct xfs_trans		*parent_tp,
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_rui_log_item		*ruip)
 {
 	int				i;
@@ -505,7 +506,6 @@ xfs_rui_recover(
 	xfs_exntst_t			state;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
 
 	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
 
@@ -601,8 +601,7 @@ xfs_rui_recover(
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
 	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
@@ -691,7 +690,7 @@ xlog_recover_rud(
 STATIC int
 xlog_recover_process_rui(
 	struct xlog			*log,
-	struct xfs_trans		*parent_tp,
+	struct xfs_defer_freezer	**dffp,
 	struct xfs_log_item		*lip)
 {
 	struct xfs_ail			*ailp = log->l_ailp;
@@ -705,7 +704,7 @@ xlog_recover_process_rui(
 		return 0;
 
 	spin_unlock(&ailp->ail_lock);
-	error = xfs_rui_recover(parent_tp, ruip);
+	error = xfs_rui_recover(log->l_mp, dffp, ruip);
 	spin_lock(&ailp->ail_lock);
 
 	return error;


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-22  2:08 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
  2020-04-22  2:08 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
@ 2020-04-22  2:08 ` Darrick J. Wong
  2020-04-24 14:04   ` Brian Foster
  2020-04-22  2:08 ` [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes Darrick J. Wong
  2 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-22  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

On filesystems that support them, bmap intent log items can be used to
change mappings in inode data or attr forks.  However, if the bmbt must
expand, the enormous block reservations that we make for finishing
chains of deferred log items become a liability because the bmbt block
allocator sets minleft to the transaction reservation and there probably
aren't any AGs in the filesystem that have that much free space.

Whereas previously we would reserve 93% of the free blocks in the
filesystem, now we only want to reserve 7/8ths of the free space in the
least full AG, and no more than half of the usable blocks in an AG.  In
theory we shouldn't run out of space because (prior to the unclean
shutdown) all of the in-progress transactions successfully reserved the
worst case number of disk blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e9b3e901d009..a416b028b320 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2669,6 +2669,44 @@ xlog_recover_process_data(
 	return 0;
 }
 
+/*
+ * Estimate a block reservation for a log recovery transaction.  Since we run
+ * separate transactions for each chain of deferred ops that get created as a
+ * result of recovering unfinished log intent items, we must be careful not to
+ * reserve so many blocks that block allocations fail because we can't satisfy
+ * the minleft requirements (e.g. for bmbt blocks).
+ */
+static int
+xlog_estimate_recovery_resblks(
+	struct xfs_mount	*mp,
+	unsigned int		*resblks)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	unsigned int		free = 0;
+	int			error;
+
+	/* Don't use more than 7/8th of the free space in the least full AG. */
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		unsigned int	ag_free;
+
+		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
+		if (error)
+			return error;
+		pag = xfs_perag_get(mp, agno);
+		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
+		free = max(free, (ag_free * 7) / 8);
+		xfs_perag_put(pag);
+	}
+
+	/* Don't try to reserve more than half the usable AG blocks. */
+	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
+	if (*resblks == 0)
+		return -ENOSPC;
+
+	return 0;
+}
+
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
@@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
 {
 	struct xfs_defer_freezer *dff, *next;
 	struct xfs_trans	*tp;
-	int64_t			freeblks;
 	uint			resblks;
 	int			error = 0;
 
 	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		error = xlog_estimate_recovery_resblks(mp, &resblks);
+		if (error)
+			break;
+
 		/*
 		 * We're finishing the defer_ops that accumulated as a result
 		 * of recovering unfinished intent items during log recovery.
 		 * We reserve an itruncate transaction because it is the
-		 * largest permanent transaction type.  Since we're the only
-		 * user of the fs right now, take 93% (15/16) of the available
-		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 * largest permanent transaction type.
 		 */
-		freeblks = percpu_counter_sum(&mp->m_fdblocks);
-		if (freeblks <= 0) {
-			error = -ENOSPC;
-			break;
-		}
-
-		resblks = min_t(int64_t, UINT_MAX, freeblks);
-		resblks = (resblks * 15) >> 4;
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
 				0, XFS_TRANS_RESERVE, &tp);
 		if (error)


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-22  2:08 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
  2020-04-22  2:08 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
  2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
@ 2020-04-22  2:08 ` Darrick J. Wong
  2020-04-25 19:01   ` Christoph Hellwig
  2 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-22  2:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Make it so that the deferred operations freezer can save inode numbers
when we freeze the dfops chain, and turn them into pointers to incore
inodes when we thaw the dfops chain to finish them.  Next, add dfops
item freeze and thaw functions to the BUI/BUD items so that they can
take advantage of this new feature.  This fixes a UAF bug in the
deferred bunmapi code because xfs_bui_recover can schedule another BUI
to continue unmapping but drops the inode pointer immediately
afterwards.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.h  |    5 +++-
 fs/xfs/libxfs/xfs_defer.c |   56 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_defer.h |   18 ++++++++++++++
 fs/xfs/xfs_bmap_item.c    |   41 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.c       |   49 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 167 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index f3259ad5c22c..58f6c14f9100 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -253,7 +253,10 @@ enum xfs_bmap_intent_type {
 struct xfs_bmap_intent {
 	struct list_head			bi_list;
 	enum xfs_bmap_intent_type		bi_type;
-	struct xfs_inode			*bi_owner;
+	union {
+		struct xfs_inode		*bi_owner;
+		xfs_ino_t			bi_owner_ino;
+	};
 	int					bi_whichfork;
 	struct xfs_bmbt_irec			bi_bmap;
 };
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 33e0f246e181..1cab95cef399 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -16,6 +16,7 @@
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
 #include "xfs_trace.h"
+#include "xfs_icache.h"
 
 /*
  * Deferred Operations in XFS
@@ -565,6 +566,7 @@ xfs_defer_freeze(
 {
 	struct xfs_defer_freezer	*dff;
 	struct xfs_defer_pending	*dfp;
+	unsigned int			i;
 	int				error;
 
 	*dffp = NULL;
@@ -577,6 +579,8 @@ xfs_defer_freeze(
 
 	INIT_LIST_HEAD(&dff->dff_list);
 	INIT_LIST_HEAD(&dff->dff_dfops);
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++)
+		dff->dff_ino[i] = NULLFSINO;
 
 	/* Freeze all of the dfops items attached to the transaction. */
 	list_for_each_entry(dfp, &tp->t_dfops, dfp_list) {
@@ -620,6 +624,11 @@ xfs_defer_thaw(
 
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
+	/* Grab all the inodes we wanted. */
+	error = xfs_defer_freezer_iget(dff, tp);
+	if (error)
+		return error;
+
 	/* Thaw each of the items. */
 	list_for_each_entry(dfp, &dff->dff_dfops, dfp_list) {
 		const struct xfs_defer_op_type *ops;
@@ -632,7 +641,7 @@ xfs_defer_thaw(
 		list_for_each(li, &dfp->dfp_work) {
 			error = ops->thaw_item(dff, li);
 			if (error)
-				return error;
+				goto out_irele;
 		}
 	}
 
@@ -641,6 +650,9 @@ xfs_defer_thaw(
 	tp->t_flags |= dff->dff_tpflags;
 
 	return 0;
+out_irele:
+	xfs_defer_freezer_irele(dff);
+	return error;
 }
 
 /* Release a deferred op freezer and all resources associated with it. */
@@ -650,5 +662,47 @@ xfs_defer_freeezer_finish(
 	struct xfs_defer_freezer	*dff)
 {
 	xfs_defer_cancel_list(mp, &dff->dff_dfops);
+	xfs_defer_freezer_irele(dff);
 	kmem_free(dff);
 }
+
+/* Attach an inode to this deferred ops freezer. */
+int
+xfs_defer_freezer_ijoin(
+	struct xfs_defer_freezer	*dff,
+	struct xfs_inode		*ip)
+{
+	unsigned int			i;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
+		if (dff->dff_ino[i] == NULLFSINO)
+			break;
+		if (dff->dff_ino[i] == ip->i_ino)
+			return 0;
+	}
+
+	if (i == XFS_DEFER_FREEZER_INODES) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	dff->dff_ino[i] = ip->i_ino;
+	return 0;
+}
+
+/* Find an incore inode that has been attached to the freezer. */
+struct xfs_inode *
+xfs_defer_freezer_igrab(
+	struct xfs_defer_freezer	*dff,
+	xfs_ino_t			ino)
+{
+	unsigned int			i;
+
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++)
+		if (dff->dff_ino[i] == ino)
+			return dff->dff_inodes[i];
+
+	return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 90e05f6af53c..e64b577a9b95 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -78,6 +78,16 @@ struct xfs_defer_freezer {
 	/* Deferred ops state saved from the transaction. */
 	struct list_head	dff_dfops;
 	unsigned int		dff_tpflags;
+
+	/*
+	 * Inodes to hold when we want to finish the deferred work items.
+	 * dfops freezer functions should set dff_ino.  xfs_defer_thaw will
+	 * fill out the dff_inodes array, from which the dfops thaw functions
+	 * can pick up the new inode pointers.
+	 */
+#define XFS_DEFER_FREEZER_INODES	2
+	xfs_ino_t		dff_ino[XFS_DEFER_FREEZER_INODES];
+	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
 };
 
 /* Functions to freeze a chain of deferred operations for later. */
@@ -85,5 +95,13 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
 int xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
 void xfs_defer_freeezer_finish(struct xfs_mount *mp,
 		struct xfs_defer_freezer *dff);
+int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
+		struct xfs_inode *ip);
+struct xfs_inode *xfs_defer_freezer_igrab(struct xfs_defer_freezer *dff,
+		xfs_ino_t ino);
+
+/* These functions must be provided by the xfs implementation. */
+void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
+int xfs_defer_freezer_iget(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
 
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 5c22a902d8ca..267351fbea67 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -416,6 +416,45 @@ xfs_bmap_update_cancel_item(
 	kmem_free(bmap);
 }
 
+/* Prepare a deferred bmap item for freezing by detaching the inode. */
+STATIC int
+xfs_bmap_freeze_item(
+	struct xfs_defer_freezer	*freezer,
+	struct list_head		*item)
+{
+	struct xfs_bmap_intent		*bmap;
+	struct xfs_inode		*ip;
+	int				error;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+
+	ip = bmap->bi_owner;
+	error = xfs_defer_freezer_ijoin(freezer, ip);
+	if (error)
+		return error;
+	bmap->bi_owner_ino = ip->i_ino;
+
+	return 0;
+}
+
+/* Thaw a deferred bmap item by reattaching the inode. */
+STATIC int
+xfs_bmap_thaw_item(
+	struct xfs_defer_freezer	*freezer,
+	struct list_head		*item)
+{
+	struct xfs_bmap_intent		*bmap;
+	struct xfs_inode		*ip;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+	ip = xfs_defer_freezer_igrab(freezer, bmap->bi_owner_ino);
+	if (!ip)
+		return -EFSCORRUPTED;
+
+	bmap->bi_owner = ip;
+	return 0;
+}
+
 const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 	.max_items	= XFS_BUI_MAX_FAST_EXTENTS,
 	.diff_items	= xfs_bmap_update_diff_items,
@@ -425,6 +464,8 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 	.create_done	= xfs_bmap_update_create_done,
 	.finish_item	= xfs_bmap_update_finish_item,
 	.cancel_item	= xfs_bmap_update_cancel_item,
+	.freeze_item	= xfs_bmap_freeze_item,
+	.thaw_item	= xfs_bmap_thaw_item,
 };
 
 /*
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8bf1d15be3f6..895dffc80acc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -12,6 +12,7 @@
 #include "xfs_sb.h"
 #include "xfs_mount.h"
 #include "xfs_inode.h"
+#include "xfs_defer.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_inode_item.h"
@@ -1849,3 +1850,51 @@ xfs_start_block_reaping(
 	xfs_queue_eofblocks(mp);
 	xfs_queue_cowblocks(mp);
 }
+
+/* Release all the inode resources attached to this freezer. */
+void
+xfs_defer_freezer_irele(
+	struct xfs_defer_freezer	*dff)
+{
+	unsigned int			i;
+
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
+		if (dff->dff_inodes[i]) {
+			xfs_iunlock(dff->dff_inodes[i], XFS_ILOCK_EXCL);
+			xfs_irele(dff->dff_inodes[i]);
+			dff->dff_inodes[i] = NULL;
+		}
+	}
+}
+
+/* Attach inodes to this freezer. */
+int
+xfs_defer_freezer_iget(
+	struct xfs_defer_freezer	*dff,
+	struct xfs_trans		*tp)
+{
+	unsigned int			i;
+	int				error;
+
+	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
+		if (dff->dff_ino[i] == NULLFSINO)
+			continue;
+		error = xfs_iget(tp->t_mountp, tp, dff->dff_ino[i], 0, 0,
+				&dff->dff_inodes[i]);
+		if (error) {
+			xfs_defer_freezer_irele(dff);
+			return error;
+		}
+	}
+	if (dff->dff_inodes[1]) {
+		xfs_lock_two_inodes(dff->dff_inodes[0], XFS_ILOCK_EXCL,
+				    dff->dff_inodes[1], XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dff->dff_inodes[0], 0);
+		xfs_trans_ijoin(tp, dff->dff_inodes[1], 0);
+	} else if (dff->dff_inodes[0]) {
+		xfs_ilock(dff->dff_inodes[0], XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dff->dff_inodes[0], 0);
+	}
+
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-04-22  2:08 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
@ 2020-04-24 14:02   ` Brian Foster
  2020-04-28 22:28     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2020-04-24 14:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 21, 2020 at 07:08:14PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Just high level thoughts on a first look...

>  fs/xfs/libxfs/xfs_defer.c       |  100 ++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_defer.h       |   25 ++++++++
>  fs/xfs/libxfs/xfs_log_recover.h |    6 ++
>  fs/xfs/xfs_bmap_item.c          |   19 ++----
>  fs/xfs/xfs_extfree_item.c       |    4 +
>  fs/xfs/xfs_log_recover.c        |  122 +++++++++++++++++++++++++--------------
>  fs/xfs/xfs_refcount_item.c      |   20 ++----
>  fs/xfs/xfs_rmap_item.c          |   11 ++--
>  8 files changed, 228 insertions(+), 79 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index 22557527cfdb..33e0f246e181 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -552,3 +552,103 @@ xfs_defer_move(
>  
>  	xfs_defer_reset(stp);
>  }
> +
> +/*
> + * Freeze a chain of deferred ops that are attached to a transaction.  The
> + * entire deferred ops state is transferred to the freezer, and each dfops
> + * item will be prepared for freezing.
> + */
> +int
> +xfs_defer_freeze(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_freezer	**dffp)
> +{
> +	struct xfs_defer_freezer	*dff;
> +	struct xfs_defer_pending	*dfp;
> +	int				error;
> +
> +	*dffp = NULL;
> +	if (list_empty(&tp->t_dfops))
> +		return 0;
> +
> +	dff = kmem_zalloc(sizeof(*dff), KM_NOFS);
> +	if (!dff)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&dff->dff_list);
> +	INIT_LIST_HEAD(&dff->dff_dfops);
> +
> +	/* Freeze all of the dfops items attached to the transaction. */
> +	list_for_each_entry(dfp, &tp->t_dfops, dfp_list) {
> +		const struct xfs_defer_op_type *ops;
> +		struct list_head	*li;
> +
> +		ops = defer_op_types[dfp->dfp_type];
> +		if (!ops->freeze_item)
> +			continue;
> +
> +		list_for_each(li, &dfp->dfp_work) {
> +			error = ops->freeze_item(dff, li);
> +			if (error)
> +				break;
> +		}
> +		if (error)
> +			break;
> +	}
> +	if (error) {
> +		kmem_free(dff);
> +		return error;
> +	}
> +
> +	/* Move all the dfops items to the freezer. */
> +	list_splice_init(&tp->t_dfops, &dff->dff_dfops);
> +	dff->dff_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> +	xfs_defer_reset(tp);
> +
> +	*dffp = dff;
> +	return 0;
> +}
> +
> +/* Thaw a chain of deferred ops that are attached to a transaction. */
> +int
> +xfs_defer_thaw(
> +	struct xfs_defer_freezer	*dff,
> +	struct xfs_trans		*tp)
> +{

A little confused by the freeze/thaw naming because I associate that
with filesystem freeze. I don't have a better suggestion atm though so
I'll go with it for now..

Also I find switching the parameters around between freeze/thaw to be
annoying, but that's just me. ;P

> +	struct xfs_defer_pending	*dfp;
> +	int				error;
> +
> +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +
> +	/* Thaw each of the items. */
> +	list_for_each_entry(dfp, &dff->dff_dfops, dfp_list) {
> +		const struct xfs_defer_op_type *ops;
> +		struct list_head	*li;
> +
> +		ops = defer_op_types[dfp->dfp_type];
> +		if (!ops->thaw_item)
> +			continue;
> +
> +		list_for_each(li, &dfp->dfp_work) {
> +			error = ops->thaw_item(dff, li);
> +			if (error)
> +				return error;
> +		}

Hmm.. so the purpose of the freeze_item() and thaw_item() callbacks is
not clear to me from this patch because they don't appear to be used
yet. Is this patch still intended to be functional before these
callbacks are implemented? If so, it might make sense to leave these
parts out and emphasize that this patch is laying foundation for this
new aggregation approach (trans per dfops chain) and otherwise plumbing
in this new freezer container thing. Then a subsequent patch can
introduce the freeze/thaw bits and emphasize what problem that actually
fixes..

> +	}
> +
> +	/* Add the dfops items to the transaction. */
> +	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> +	tp->t_flags |= dff->dff_tpflags;
> +
> +	return 0;
> +}
> +
> +/* Release a deferred op freezer and all resources associated with it. */
> +void
> +xfs_defer_freeezer_finish(
> +	struct xfs_mount		*mp,
> +	struct xfs_defer_freezer	*dff)
> +{
> +	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> +	kmem_free(dff);
> +}
...
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 53160172c36b..5c22a902d8ca 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
...
> @@ -499,12 +499,7 @@ xfs_bui_recover(
>  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
>  	if (error)
>  		return error;
> -	/*
> -	 * Recovery stashes all deferred ops during intent processing and
> -	 * finishes them on completion. Transfer current dfops state to this
> -	 * transaction and transfer the result back before we return.
> -	 */
> -	xfs_defer_move(tp, parent_tp);
> +

Is there any particular reason current code moves the list back and
forth like this, or is that just an implementation detail that allows us
to continue to aggregate dfops in order with our little xfs_defer_move()
helper?

Brian

>  	budp = xfs_trans_get_bud(tp, buip);
>  
>  	/* Grab the inode. */
> @@ -549,15 +544,13 @@ xfs_bui_recover(
>  	}
>  
>  	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> -	xfs_defer_move(parent_tp, tp);
> -	error = xfs_trans_commit(tp);
> +	error = xlog_recover_trans_commit(tp, dffp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	xfs_irele(ip);
>  
>  	return error;
>  
>  err_inode:
> -	xfs_defer_move(parent_tp, tp);
>  	xfs_trans_cancel(tp);
>  	if (ip) {
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -678,7 +671,7 @@ xlog_recover_bud(
>  STATIC int
>  xlog_recover_process_bui(
>  	struct xlog			*log,
> -	struct xfs_trans		*parent_tp,
> +	struct xfs_defer_freezer	**dffp,
>  	struct xfs_log_item		*lip)
>  {
>  	struct xfs_ail			*ailp = log->l_ailp;
> @@ -692,7 +685,7 @@ xlog_recover_process_bui(
>  		return 0;
>  
>  	spin_unlock(&ailp->ail_lock);
> -	error = xfs_bui_recover(parent_tp, buip);
> +	error = xfs_bui_recover(log->l_mp, dffp, buip);
>  	spin_lock(&ailp->ail_lock);
>  
>  	return error;
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index a15ede29244a..5675062ad436 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -739,7 +739,7 @@ xlog_recover_efd(
>  STATIC int
>  xlog_recover_process_efi(
>  	struct xlog			*log,
> -	struct xfs_trans		*tp,
> +	struct xfs_defer_freezer	**dffp,
>  	struct xfs_log_item		*lip)
>  {
>  	struct xfs_ail			*ailp = log->l_ailp;
> @@ -753,7 +753,7 @@ xlog_recover_process_efi(
>  		return 0;
>  
>  	spin_unlock(&ailp->ail_lock);
> -	error = xfs_efi_recover(tp->t_mountp, efip);
> +	error = xfs_efi_recover(log->l_mp, efip);
>  	spin_lock(&ailp->ail_lock);
>  
>  	return error;
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 5a20a95c5dad..e9b3e901d009 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1814,6 +1814,26 @@ xlog_recover_insert_ail(
>  	xfs_trans_ail_update(log->l_ailp, lip, lsn);
>  }
>  
> +/*
> + * Freeze any deferred ops and commit the transaction.  This is the last step
> + * needed to finish a log intent item that we recovered from the log.
> + */
> +int
> +xlog_recover_trans_commit(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_freezer	**dffp)
> +{
> +	int				error;
> +
> +	error = xfs_defer_freeze(tp, dffp);
> +	if (error) {
> +		xfs_trans_cancel(tp);
> +		return error;
> +	}
> +
> +	return xfs_trans_commit(tp);
> +}
> +
>  static inline const struct xlog_recover_intent_type *
>  xlog_intent_for_type(
>  	unsigned short			type)
> @@ -2652,35 +2672,59 @@ xlog_recover_process_data(
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> -	struct xfs_trans	*parent_tp)
> +	struct xfs_mount	*mp,
> +	struct list_head	*dfops_freezers)
>  {
> -	struct xfs_mount	*mp = parent_tp->t_mountp;
> +	struct xfs_defer_freezer *dff, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
>  	uint			resblks;
> -	int			error;
> +	int			error = 0;
>  
> -	/*
> -	 * We're finishing the defer_ops that accumulated as a result of
> -	 * recovering unfinished intent items during log recovery.  We
> -	 * reserve an itruncate transaction because it is the largest
> -	 * permanent transaction type.  Since we're the only user of the fs
> -	 * right now, take 93% (15/16) of the available free blocks.  Use
> -	 * weird math to avoid a 64-bit division.
> -	 */
> -	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> -	if (freeblks <= 0)
> -		return -ENOSPC;
> -	resblks = min_t(int64_t, UINT_MAX, freeblks);
> -	resblks = (resblks * 15) >> 4;
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> -			0, XFS_TRANS_RESERVE, &tp);
> -	if (error)
> -		return error;
> -	/* transfer all collected dfops to this transaction */
> -	xfs_defer_move(tp, parent_tp);
> +	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> +		/*
> +		 * We're finishing the defer_ops that accumulated as a result
> +		 * of recovering unfinished intent items during log recovery.
> +		 * We reserve an itruncate transaction because it is the
> +		 * largest permanent transaction type.  Since we're the only
> +		 * user of the fs right now, take 93% (15/16) of the available
> +		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 */
> +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> +		if (freeblks <= 0) {
> +			error = -ENOSPC;
> +			break;
> +		}
>  
> -	return xfs_trans_commit(tp);
> +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> +				0, XFS_TRANS_RESERVE, &tp);
> +		if (error)
> +			break;
> +
> +		/* transfer all collected dfops to this transaction */
> +		list_del_init(&dff->dff_list);
> +		error = xfs_defer_thaw(dff, tp);
> +		if (error) {
> +			xfs_trans_cancel(tp);
> +			xfs_defer_freeezer_finish(mp, dff);
> +			break;
> +		}
> +
> +		error = xfs_trans_commit(tp);
> +		xfs_defer_freeezer_finish(mp, dff);
> +		if (error)
> +			break;
> +	}
> +
> +	/* Kill any remaining freezers. */
> +	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> +		list_del_init(&dff->dff_list);
> +		xfs_defer_freeezer_finish(mp, dff);
> +	}
> +
> +	return error;
>  }
>  
>  /*
> @@ -2703,8 +2747,9 @@ STATIC int
>  xlog_recover_process_intents(
>  	struct xlog		*log)
>  {
> -	struct xfs_trans	*parent_tp;
> +	LIST_HEAD(dfops_freezers);
>  	struct xfs_ail_cursor	cur;
> +	struct xfs_defer_freezer *freezer = NULL;
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
>  	int			error;
> @@ -2712,19 +2757,6 @@ xlog_recover_process_intents(
>  	xfs_lsn_t		last_lsn;
>  #endif
>  
> -	/*
> -	 * The intent recovery handlers commit transactions to complete recovery
> -	 * for individual intents, but any new deferred operations that are
> -	 * queued during that process are held off until the very end. The
> -	 * purpose of this transaction is to serve as a container for deferred
> -	 * operations. Each intent recovery handler must transfer dfops here
> -	 * before its local transaction commits, and we'll finish the entire
> -	 * list below.
> -	 */
> -	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
> -	if (error)
> -		return error;
> -
>  	ailp = log->l_ailp;
>  	spin_lock(&ailp->ail_lock);
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> @@ -2756,23 +2788,27 @@ xlog_recover_process_intents(
>  
>  		/*
>  		 * NOTE: If your intent processing routine can create more
> -		 * deferred ops, you /must/ attach them to the transaction in
> +		 * deferred ops, you /must/ attach them to the freezer in
>  		 * this routine or else those subsequent intents will get
>  		 * replayed in the wrong order!
>  		 */
> -		error = type->process_intent(log, parent_tp, lip);
> +		error = type->process_intent(log, &freezer, lip);
>  		if (error)
>  			goto out;
> +		if (freezer) {
> +			list_add_tail(&freezer->dff_list, &dfops_freezers);
> +			freezer = NULL;
> +		}
> +
>  		lip = xfs_trans_ail_cursor_next(ailp, &cur);
>  	}
>  out:
>  	xfs_trans_ail_cursor_done(&cur);
>  	spin_unlock(&ailp->ail_lock);
> -	if (!error)
> -		error = xlog_finish_defer_ops(parent_tp);
> -	xfs_trans_cancel(parent_tp);
> +	if (error)
> +		return error;
>  
> -	return error;
> +	return xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 01a393727a1e..18b1fbc276fc 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -447,7 +447,8 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
>   */
>  STATIC int
>  xfs_cui_recover(
> -	struct xfs_trans		*parent_tp,
> +	struct xfs_mount		*mp,
> +	struct xfs_defer_freezer	**dffp,
>  	struct xfs_cui_log_item		*cuip)
>  {
>  	int				i;
> @@ -464,7 +465,6 @@ xfs_cui_recover(
>  	xfs_extlen_t			new_len;
>  	struct xfs_bmbt_irec		irec;
>  	bool				requeue_only = false;
> -	struct xfs_mount		*mp = parent_tp->t_mountp;
>  
>  	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
>  
> @@ -519,12 +519,7 @@ xfs_cui_recover(
>  			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
>  	if (error)
>  		return error;
> -	/*
> -	 * Recovery stashes all deferred ops during intent processing and
> -	 * finishes them on completion. Transfer current dfops state to this
> -	 * transaction and transfer the result back before we return.
> -	 */
> -	xfs_defer_move(tp, parent_tp);
> +
>  	cudp = xfs_trans_get_cud(tp, cuip);
>  
>  	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> @@ -582,13 +577,10 @@ xfs_cui_recover(
>  
>  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
>  	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
> -	xfs_defer_move(parent_tp, tp);
> -	error = xfs_trans_commit(tp);
> -	return error;
> +	return xlog_recover_trans_commit(tp, dffp);
>  
>  abort_error:
>  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> -	xfs_defer_move(parent_tp, tp);
>  	xfs_trans_cancel(tp);
>  	return error;
>  }
> @@ -701,7 +693,7 @@ xlog_recover_cud(
>  STATIC int
>  xlog_recover_process_cui(
>  	struct xlog			*log,
> -	struct xfs_trans		*parent_tp,
> +	struct xfs_defer_freezer	**dffp,
>  	struct xfs_log_item		*lip)
>  {
>  	struct xfs_ail			*ailp = log->l_ailp;
> @@ -715,7 +707,7 @@ xlog_recover_process_cui(
>  		return 0;
>  
>  	spin_unlock(&ailp->ail_lock);
> -	error = xfs_cui_recover(parent_tp, cuip);
> +	error = xfs_cui_recover(log->l_mp, dffp, cuip);
>  	spin_lock(&ailp->ail_lock);
>  
>  	return error;
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index 69a2d23eedda..7291fac7c6d1 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -491,7 +491,8 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
>   */
>  STATIC int
>  xfs_rui_recover(
> -	struct xfs_trans		*parent_tp,
> +	struct xfs_mount		*mp,
> +	struct xfs_defer_freezer	**dffp,
>  	struct xfs_rui_log_item		*ruip)
>  {
>  	int				i;
> @@ -505,7 +506,6 @@ xfs_rui_recover(
>  	xfs_exntst_t			state;
>  	struct xfs_trans		*tp;
>  	struct xfs_btree_cur		*rcur = NULL;
> -	struct xfs_mount		*mp = parent_tp->t_mountp;
>  
>  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
>  
> @@ -601,8 +601,7 @@ xfs_rui_recover(
>  
>  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
>  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> -	error = xfs_trans_commit(tp);
> -	return error;
> +	return xlog_recover_trans_commit(tp, dffp);
>  
>  abort_error:
>  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> @@ -691,7 +690,7 @@ xlog_recover_rud(
>  STATIC int
>  xlog_recover_process_rui(
>  	struct xlog			*log,
> -	struct xfs_trans		*parent_tp,
> +	struct xfs_defer_freezer	**dffp,
>  	struct xfs_log_item		*lip)
>  {
>  	struct xfs_ail			*ailp = log->l_ailp;
> @@ -705,7 +704,7 @@ xlog_recover_process_rui(
>  		return 0;
>  
>  	spin_unlock(&ailp->ail_lock);
> -	error = xfs_rui_recover(parent_tp, ruip);
> +	error = xfs_rui_recover(log->l_mp, dffp, ruip);
>  	spin_lock(&ailp->ail_lock);
>  
>  	return error;
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
@ 2020-04-24 14:04   ` Brian Foster
  2020-04-28 22:22     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2020-04-24 14:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 21, 2020 at 07:08:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> On filesystems that support them, bmap intent log items can be used to
> change mappings in inode data or attr forks.  However, if the bmbt must
> expand, the enormous block reservations that we make for finishing
> chains of deferred log items become a liability because the bmbt block
> allocator sets minleft to the transaction reservation and there probably
> aren't any AGs in the filesystem that have that much free space.
> 
> Whereas previously we would reserve 93% of the free blocks in the
> filesystem, now we only want to reserve 7/8ths of the free space in the
> least full AG, and no more than half of the usable blocks in an AG.  In
> theory we shouldn't run out of space because (prior to the unclean
> shutdown) all of the in-progress transactions successfully reserved the
> worst case number of disk blocks.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 43 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index e9b3e901d009..a416b028b320 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2669,6 +2669,44 @@ xlog_recover_process_data(
>  	return 0;
>  }
>  
> +/*
> + * Estimate a block reservation for a log recovery transaction.  Since we run
> + * separate transactions for each chain of deferred ops that get created as a
> + * result of recovering unfinished log intent items, we must be careful not to
> + * reserve so many blocks that block allocations fail because we can't satisfy
> + * the minleft requirements (e.g. for bmbt blocks).
> + */
> +static int
> +xlog_estimate_recovery_resblks(
> +	struct xfs_mount	*mp,
> +	unsigned int		*resblks)
> +{
> +	struct xfs_perag	*pag;
> +	xfs_agnumber_t		agno;
> +	unsigned int		free = 0;
> +	int			error;
> +
> +	/* Don't use more than 7/8th of the free space in the least full AG. */
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		unsigned int	ag_free;
> +
> +		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
> +		if (error)
> +			return error;
> +		pag = xfs_perag_get(mp, agno);
> +		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
> +		free = max(free, (ag_free * 7) / 8);
> +		xfs_perag_put(pag);
> +	}
> +

Somewhat unfortunate that we have to iterate all AGs for each chain. I'm
wondering if that has any effect on a large recovery on fs' with an
inordinate AG count. Have you tested under those particular conditions?
I suppose it's possible the recovery is slow enough that this won't
matter...

Also, perhaps not caused by this patch but does this
outsized/manufactured reservation have the effect of artificially
steering allocations to a particular AG if one happens to be notably
larger than the rest?

Brian

> +	/* Don't try to reserve more than half the usable AG blocks. */
> +	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
> +	if (*resblks == 0)
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> @@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
>  {
>  	struct xfs_defer_freezer *dff, *next;
>  	struct xfs_trans	*tp;
> -	int64_t			freeblks;
>  	uint			resblks;
>  	int			error = 0;
>  
>  	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> +		error = xlog_estimate_recovery_resblks(mp, &resblks);
> +		if (error)
> +			break;
> +
>  		/*
>  		 * We're finishing the defer_ops that accumulated as a result
>  		 * of recovering unfinished intent items during log recovery.
>  		 * We reserve an itruncate transaction because it is the
> -		 * largest permanent transaction type.  Since we're the only
> -		 * user of the fs right now, take 93% (15/16) of the available
> -		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 * largest permanent transaction type.
>  		 */
> -		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> -		if (freeblks <= 0) {
> -			error = -ENOSPC;
> -			break;
> -		}
> -
> -		resblks = min_t(int64_t, UINT_MAX, freeblks);
> -		resblks = (resblks * 15) >> 4;
>  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
>  				0, XFS_TRANS_RESERVE, &tp);
>  		if (error)
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-22  2:08 ` [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes Darrick J. Wong
@ 2020-04-25 19:01   ` Christoph Hellwig
  2020-04-27 11:37     ` Brian Foster
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-04-25 19:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 21, 2020 at 07:08:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Make it so that the deferred operations freezer can save inode numbers
> when we freeze the dfops chain, and turn them into pointers to incore
> inodes when we thaw the dfops chain to finish them.  Next, add dfops
> item freeze and thaw functions to the BUI/BUD items so that they can
> take advantage of this new feature.  This fixes a UAF bug in the
> deferred bunmapi code because xfs_bui_recover can schedule another BUI
> to continue unmapping but drops the inode pointer immediately
> afterwards.

I'm only looking over this the first time, but why can't we just keep
inode reference around during reocvery instead of this fairly
complicated scheme to save the ino and then look it up again?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-25 19:01   ` Christoph Hellwig
@ 2020-04-27 11:37     ` Brian Foster
  2020-04-28 22:17       ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2020-04-27 11:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Sat, Apr 25, 2020 at 12:01:37PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 21, 2020 at 07:08:26PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Make it so that the deferred operations freezer can save inode numbers
> > when we freeze the dfops chain, and turn them into pointers to incore
> > inodes when we thaw the dfops chain to finish them.  Next, add dfops
> > item freeze and thaw functions to the BUI/BUD items so that they can
> > take advantage of this new feature.  This fixes a UAF bug in the
> > deferred bunmapi code because xfs_bui_recover can schedule another BUI
> > to continue unmapping but drops the inode pointer immediately
> > afterwards.
> 
> I'm only looking over this the first time, but why can't we just keep
> inode reference around during reocvery instead of this fairly
> complicated scheme to save the ino and then look it up again?
> 

I'm also a little confused about the use after free in the first place.
Doesn't xfs_bui_recover() look up the inode itself, or is the issue that
xfs_bui_recover() is fine but we might get into
xfs_bmap_update_finish_item() sometime later on the same inode without
any reference? If the latter, similarly to Christoph I wonder if we
really could/should grab a reference on the inode for the intent itself,
even though that might not be necessary outside of recovery.

Either way, more details about the problem being fixed in the commit log
would be helpful.

Brian


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-27 11:37     ` Brian Foster
@ 2020-04-28 22:17       ` Darrick J. Wong
  2020-04-29 11:38         ` Brian Foster
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-28 22:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Mon, Apr 27, 2020 at 07:37:52AM -0400, Brian Foster wrote:
> On Sat, Apr 25, 2020 at 12:01:37PM -0700, Christoph Hellwig wrote:
> > On Tue, Apr 21, 2020 at 07:08:26PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Make it so that the deferred operations freezer can save inode numbers
> > > when we freeze the dfops chain, and turn them into pointers to incore
> > > inodes when we thaw the dfops chain to finish them.  Next, add dfops
> > > item freeze and thaw functions to the BUI/BUD items so that they can
> > > take advantage of this new feature.  This fixes a UAF bug in the
> > > deferred bunmapi code because xfs_bui_recover can schedule another BUI
> > > to continue unmapping but drops the inode pointer immediately
> > > afterwards.
> > 
> > I'm only looking over this the first time, but why can't we just keep
> > inode reference around during reocvery instead of this fairly
> > complicated scheme to save the ino and then look it up again?
> > 
> 
> I'm also a little confused about the use after free in the first place.
> Doesn't xfs_bui_recover() look up the inode itself, or is the issue that
> xfs_bui_recover() is fine but we might get into
> xfs_bmap_update_finish_item() sometime later on the same inode without
> any reference?

The second.  In practice it doesn't seem to trigger on the existing
code, but the combination of atomic extent swap + fsstress + shutdown
testing was enough to push it over the edge once due to reclaim.

> If the latter, similarly to Christoph I wonder if we
> really could/should grab a reference on the inode for the intent itself,
> even though that might not be necessary outside of recovery.

Outside of recovery we don't have the UAF problem because there's always
something (usually the VFS dentry cache, but sometimes an explicit iget)
that hold a reference to the inode for the duration of the transaction
and dfops processing.

One could just hang on to all incore inodes until the end of recovery
like Christoph says, but the downside of doing it that way is that now
we require enough memory to maintain all that incore state vs. only
needing enough for the incore inodes involved in a particular dfops
chain.  That isn't a huge deal now, but I was looking ahead to atomic
extent swaps.

(And, yeah, I should put that series on the list now...)

> Either way, more details about the problem being fixed in the commit log
> would be helpful.

<nod>

--D

> Brian
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-24 14:04   ` Brian Foster
@ 2020-04-28 22:22     ` Darrick J. Wong
  2020-05-27 22:39       ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-28 22:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Apr 24, 2020 at 10:04:08AM -0400, Brian Foster wrote:
> On Tue, Apr 21, 2020 at 07:08:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > On filesystems that support them, bmap intent log items can be used to
> > change mappings in inode data or attr forks.  However, if the bmbt must
> > expand, the enormous block reservations that we make for finishing
> > chains of deferred log items become a liability because the bmbt block
> > allocator sets minleft to the transaction reservation and there probably
> > aren't any AGs in the filesystem that have that much free space.
> > 
> > Whereas previously we would reserve 93% of the free blocks in the
> > filesystem, now we only want to reserve 7/8ths of the free space in the
> > least full AG, and no more than half of the usable blocks in an AG.  In
> > theory we shouldn't run out of space because (prior to the unclean
> > shutdown) all of the in-progress transactions successfully reserved the
> > worst case number of disk blocks.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 43 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index e9b3e901d009..a416b028b320 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2669,6 +2669,44 @@ xlog_recover_process_data(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Estimate a block reservation for a log recovery transaction.  Since we run
> > + * separate transactions for each chain of deferred ops that get created as a
> > + * result of recovering unfinished log intent items, we must be careful not to
> > + * reserve so many blocks that block allocations fail because we can't satisfy
> > + * the minleft requirements (e.g. for bmbt blocks).
> > + */
> > +static int
> > +xlog_estimate_recovery_resblks(
> > +	struct xfs_mount	*mp,
> > +	unsigned int		*resblks)
> > +{
> > +	struct xfs_perag	*pag;
> > +	xfs_agnumber_t		agno;
> > +	unsigned int		free = 0;
> > +	int			error;
> > +
> > +	/* Don't use more than 7/8th of the free space in the least full AG. */
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		unsigned int	ag_free;
> > +
> > +		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
> > +		if (error)
> > +			return error;
> > +		pag = xfs_perag_get(mp, agno);
> > +		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
> > +		free = max(free, (ag_free * 7) / 8);
> > +		xfs_perag_put(pag);
> > +	}
> > +
> 
> Somewhat unfortunate that we have to iterate all AGs for each chain. I'm
> wondering if that has any effect on a large recovery on fs' with an
> inordinate AG count. Have you tested under those particular conditions?
> I suppose it's possible the recovery is slow enough that this won't
> matter...

I admit I haven't actually looked at that.  A more precise way to handle
this would be for each intent recovery function to store its own worst
case resblks estimation in the recovery freezer so that we'd be using
roughly the same space requirements as the pre-crash transaction, but
that's also a lot more complicated than this kludge.

> Also, perhaps not caused by this patch but does this
> outsized/manufactured reservation have the effect of artificially
> steering allocations to a particular AG if one happens to be notably
> larger than the rest?

It tends to steer allocations towards whichever AGs were less full at
the start of each transaction.  Were we to shift to scanning the AGs
once for the entire recovery cycle, I think I'd probably pick a smaller
ratio.

--D

> Brian
> 
> > +	/* Don't try to reserve more than half the usable AG blocks. */
> > +	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
> > +	if (*resblks == 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > @@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
> >  {
> >  	struct xfs_defer_freezer *dff, *next;
> >  	struct xfs_trans	*tp;
> > -	int64_t			freeblks;
> >  	uint			resblks;
> >  	int			error = 0;
> >  
> >  	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> > +		error = xlog_estimate_recovery_resblks(mp, &resblks);
> > +		if (error)
> > +			break;
> > +
> >  		/*
> >  		 * We're finishing the defer_ops that accumulated as a result
> >  		 * of recovering unfinished intent items during log recovery.
> >  		 * We reserve an itruncate transaction because it is the
> > -		 * largest permanent transaction type.  Since we're the only
> > -		 * user of the fs right now, take 93% (15/16) of the available
> > -		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 * largest permanent transaction type.
> >  		 */
> > -		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > -		if (freeblks <= 0) {
> > -			error = -ENOSPC;
> > -			break;
> > -		}
> > -
> > -		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > -		resblks = (resblks * 15) >> 4;
> >  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> >  				0, XFS_TRANS_RESERVE, &tp);
> >  		if (error)
> > 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-04-24 14:02   ` Brian Foster
@ 2020-04-28 22:28     ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-28 22:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Apr 24, 2020 at 10:02:22AM -0400, Brian Foster wrote:
> On Tue, Apr 21, 2020 at 07:08:14PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Just high level thoughts on a first look...
> 
> >  fs/xfs/libxfs/xfs_defer.c       |  100 ++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_defer.h       |   25 ++++++++
> >  fs/xfs/libxfs/xfs_log_recover.h |    6 ++
> >  fs/xfs/xfs_bmap_item.c          |   19 ++----
> >  fs/xfs/xfs_extfree_item.c       |    4 +
> >  fs/xfs/xfs_log_recover.c        |  122 +++++++++++++++++++++++++--------------
> >  fs/xfs/xfs_refcount_item.c      |   20 ++----
> >  fs/xfs/xfs_rmap_item.c          |   11 ++--
> >  8 files changed, 228 insertions(+), 79 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 22557527cfdb..33e0f246e181 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -552,3 +552,103 @@ xfs_defer_move(
> >  
> >  	xfs_defer_reset(stp);
> >  }
> > +
> > +/*
> > + * Freeze a chain of deferred ops that are attached to a transaction.  The
> > + * entire deferred ops state is transferred to the freezer, and each dfops
> > + * item will be prepared for freezing.
> > + */
> > +int
> > +xfs_defer_freeze(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_freezer	**dffp)
> > +{
> > +	struct xfs_defer_freezer	*dff;
> > +	struct xfs_defer_pending	*dfp;
> > +	int				error;
> > +
> > +	*dffp = NULL;
> > +	if (list_empty(&tp->t_dfops))
> > +		return 0;
> > +
> > +	dff = kmem_zalloc(sizeof(*dff), KM_NOFS);
> > +	if (!dff)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&dff->dff_list);
> > +	INIT_LIST_HEAD(&dff->dff_dfops);
> > +
> > +	/* Freeze all of the dfops items attached to the transaction. */
> > +	list_for_each_entry(dfp, &tp->t_dfops, dfp_list) {
> > +		const struct xfs_defer_op_type *ops;
> > +		struct list_head	*li;
> > +
> > +		ops = defer_op_types[dfp->dfp_type];
> > +		if (!ops->freeze_item)
> > +			continue;
> > +
> > +		list_for_each(li, &dfp->dfp_work) {
> > +			error = ops->freeze_item(dff, li);
> > +			if (error)
> > +				break;
> > +		}
> > +		if (error)
> > +			break;
> > +	}
> > +	if (error) {
> > +		kmem_free(dff);
> > +		return error;
> > +	}
> > +
> > +	/* Move all the dfops items to the freezer. */
> > +	list_splice_init(&tp->t_dfops, &dff->dff_dfops);
> > +	dff->dff_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> > +	xfs_defer_reset(tp);
> > +
> > +	*dffp = dff;
> > +	return 0;
> > +}
> > +
> > +/* Thaw a chain of deferred ops that are attached to a transaction. */
> > +int
> > +xfs_defer_thaw(
> > +	struct xfs_defer_freezer	*dff,
> > +	struct xfs_trans		*tp)
> > +{
> 
> A little confused by the freeze/thaw naming because I associate that
> with filesystem freeze. I don't have a better suggestion atm though so
> I'll go with it for now..

I don't have a better name for them either.

> Also I find switching the parameters around between freeze/thaw to be
> annoying, but that's just me. ;P
> 
> > +	struct xfs_defer_pending	*dfp;
> > +	int				error;
> > +
> > +	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +
> > +	/* Thaw each of the items. */
> > +	list_for_each_entry(dfp, &dff->dff_dfops, dfp_list) {
> > +		const struct xfs_defer_op_type *ops;
> > +		struct list_head	*li;
> > +
> > +		ops = defer_op_types[dfp->dfp_type];
> > +		if (!ops->thaw_item)
> > +			continue;
> > +
> > +		list_for_each(li, &dfp->dfp_work) {
> > +			error = ops->thaw_item(dff, li);
> > +			if (error)
> > +				return error;
> > +		}
> 
> Hmm.. so the purpose of the freeze_item() and thaw_item() callbacks is
> not clear to me from this patch because they don't appear to be used
> yet. Is this patch still intended to be functional before these
> callbacks are implemented? If so, it might make sense to leave these
> parts out and emphasize that this patch is laying foundation for this
> new aggregation approach (trans per dfops chain) and otherwise plumbing
> in this new freezer container thing. Then a subsequent patch can
> introduce the freeze/thaw bits and emphasize what problem that actually
> fixes..

Ehh, the thaw/freeze part could be moved to the next patch so that it
would be clearer that the 'freeze' functions can stash a breadcrumb
pointing to incore state to reduce memory requirements, and that the
'thaw' functions turn those crumbs back into incore structures.

> 
> > +	}
> > +
> > +	/* Add the dfops items to the transaction. */
> > +	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > +	tp->t_flags |= dff->dff_tpflags;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Release a deferred op freezer and all resources associated with it. */
> > +void
> > +xfs_defer_freeezer_finish(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_defer_freezer	*dff)
> > +{
> > +	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > +	kmem_free(dff);
> > +}
> ...
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index 53160172c36b..5c22a902d8ca 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> ...
> > @@ -499,12 +499,7 @@ xfs_bui_recover(
> >  			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
> >  	if (error)
> >  		return error;
> > -	/*
> > -	 * Recovery stashes all deferred ops during intent processing and
> > -	 * finishes them on completion. Transfer current dfops state to this
> > -	 * transaction and transfer the result back before we return.
> > -	 */
> > -	xfs_defer_move(tp, parent_tp);
> > +
> 
> Is there any particular reason current code moves the list back and
> forth like this, or is that just an implementation detail that allows us
> to continue to aggregate dfops in order with our little xfs_defer_move()
> helper?

I think I wrote the original code to do that because I thought that we
could occasionally merge adjacent dfops work items from different
recovery chains.  AFAICT that rarely happens, and worse yet it doesn't
remain true to the sequencing that would have happened if the fs hadn't
gone down.

--D

> Brian
> 
> >  	budp = xfs_trans_get_bud(tp, buip);
> >  
> >  	/* Grab the inode. */
> > @@ -549,15 +544,13 @@ xfs_bui_recover(
> >  	}
> >  
> >  	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> > -	xfs_defer_move(parent_tp, tp);
> > -	error = xfs_trans_commit(tp);
> > +	error = xlog_recover_trans_commit(tp, dffp);
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	xfs_irele(ip);
> >  
> >  	return error;
> >  
> >  err_inode:
> > -	xfs_defer_move(parent_tp, tp);
> >  	xfs_trans_cancel(tp);
> >  	if (ip) {
> >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > @@ -678,7 +671,7 @@ xlog_recover_bud(
> >  STATIC int
> >  xlog_recover_process_bui(
> >  	struct xlog			*log,
> > -	struct xfs_trans		*parent_tp,
> > +	struct xfs_defer_freezer	**dffp,
> >  	struct xfs_log_item		*lip)
> >  {
> >  	struct xfs_ail			*ailp = log->l_ailp;
> > @@ -692,7 +685,7 @@ xlog_recover_process_bui(
> >  		return 0;
> >  
> >  	spin_unlock(&ailp->ail_lock);
> > -	error = xfs_bui_recover(parent_tp, buip);
> > +	error = xfs_bui_recover(log->l_mp, dffp, buip);
> >  	spin_lock(&ailp->ail_lock);
> >  
> >  	return error;
> > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> > index a15ede29244a..5675062ad436 100644
> > --- a/fs/xfs/xfs_extfree_item.c
> > +++ b/fs/xfs/xfs_extfree_item.c
> > @@ -739,7 +739,7 @@ xlog_recover_efd(
> >  STATIC int
> >  xlog_recover_process_efi(
> >  	struct xlog			*log,
> > -	struct xfs_trans		*tp,
> > +	struct xfs_defer_freezer	**dffp,
> >  	struct xfs_log_item		*lip)
> >  {
> >  	struct xfs_ail			*ailp = log->l_ailp;
> > @@ -753,7 +753,7 @@ xlog_recover_process_efi(
> >  		return 0;
> >  
> >  	spin_unlock(&ailp->ail_lock);
> > -	error = xfs_efi_recover(tp->t_mountp, efip);
> > +	error = xfs_efi_recover(log->l_mp, efip);
> >  	spin_lock(&ailp->ail_lock);
> >  
> >  	return error;
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 5a20a95c5dad..e9b3e901d009 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -1814,6 +1814,26 @@ xlog_recover_insert_ail(
> >  	xfs_trans_ail_update(log->l_ailp, lip, lsn);
> >  }
> >  
> > +/*
> > + * Freeze any deferred ops and commit the transaction.  This is the last step
> > + * needed to finish a log intent item that we recovered from the log.
> > + */
> > +int
> > +xlog_recover_trans_commit(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_freezer	**dffp)
> > +{
> > +	int				error;
> > +
> > +	error = xfs_defer_freeze(tp, dffp);
> > +	if (error) {
> > +		xfs_trans_cancel(tp);
> > +		return error;
> > +	}
> > +
> > +	return xfs_trans_commit(tp);
> > +}
> > +
> >  static inline const struct xlog_recover_intent_type *
> >  xlog_intent_for_type(
> >  	unsigned short			type)
> > @@ -2652,35 +2672,59 @@ xlog_recover_process_data(
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > -	struct xfs_trans	*parent_tp)
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*dfops_freezers)
> >  {
> > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > +	struct xfs_defer_freezer *dff, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> >  	uint			resblks;
> > -	int			error;
> > +	int			error = 0;
> >  
> > -	/*
> > -	 * We're finishing the defer_ops that accumulated as a result of
> > -	 * recovering unfinished intent items during log recovery.  We
> > -	 * reserve an itruncate transaction because it is the largest
> > -	 * permanent transaction type.  Since we're the only user of the fs
> > -	 * right now, take 93% (15/16) of the available free blocks.  Use
> > -	 * weird math to avoid a 64-bit division.
> > -	 */
> > -	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > -	if (freeblks <= 0)
> > -		return -ENOSPC;
> > -	resblks = min_t(int64_t, UINT_MAX, freeblks);
> > -	resblks = (resblks * 15) >> 4;
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > -			0, XFS_TRANS_RESERVE, &tp);
> > -	if (error)
> > -		return error;
> > -	/* transfer all collected dfops to this transaction */
> > -	xfs_defer_move(tp, parent_tp);
> > +	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> > +		/*
> > +		 * We're finishing the defer_ops that accumulated as a result
> > +		 * of recovering unfinished intent items during log recovery.
> > +		 * We reserve an itruncate transaction because it is the
> > +		 * largest permanent transaction type.  Since we're the only
> > +		 * user of the fs right now, take 93% (15/16) of the available
> > +		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 */
> > +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > +		if (freeblks <= 0) {
> > +			error = -ENOSPC;
> > +			break;
> > +		}
> >  
> > -	return xfs_trans_commit(tp);
> > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > +				0, XFS_TRANS_RESERVE, &tp);
> > +		if (error)
> > +			break;
> > +
> > +		/* transfer all collected dfops to this transaction */
> > +		list_del_init(&dff->dff_list);
> > +		error = xfs_defer_thaw(dff, tp);
> > +		if (error) {
> > +			xfs_trans_cancel(tp);
> > +			xfs_defer_freeezer_finish(mp, dff);
> > +			break;
> > +		}
> > +
> > +		error = xfs_trans_commit(tp);
> > +		xfs_defer_freeezer_finish(mp, dff);
> > +		if (error)
> > +			break;
> > +	}
> > +
> > +	/* Kill any remaining freezers. */
> > +	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> > +		list_del_init(&dff->dff_list);
> > +		xfs_defer_freeezer_finish(mp, dff);
> > +	}
> > +
> > +	return error;
> >  }
> >  
> >  /*
> > @@ -2703,8 +2747,9 @@ STATIC int
> >  xlog_recover_process_intents(
> >  	struct xlog		*log)
> >  {
> > -	struct xfs_trans	*parent_tp;
> > +	LIST_HEAD(dfops_freezers);
> >  	struct xfs_ail_cursor	cur;
> > +	struct xfs_defer_freezer *freezer = NULL;
> >  	struct xfs_log_item	*lip;
> >  	struct xfs_ail		*ailp;
> >  	int			error;
> > @@ -2712,19 +2757,6 @@ xlog_recover_process_intents(
> >  	xfs_lsn_t		last_lsn;
> >  #endif
> >  
> > -	/*
> > -	 * The intent recovery handlers commit transactions to complete recovery
> > -	 * for individual intents, but any new deferred operations that are
> > -	 * queued during that process are held off until the very end. The
> > -	 * purpose of this transaction is to serve as a container for deferred
> > -	 * operations. Each intent recovery handler must transfer dfops here
> > -	 * before its local transaction commits, and we'll finish the entire
> > -	 * list below.
> > -	 */
> > -	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
> > -	if (error)
> > -		return error;
> > -
> >  	ailp = log->l_ailp;
> >  	spin_lock(&ailp->ail_lock);
> >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > @@ -2756,23 +2788,27 @@ xlog_recover_process_intents(
> >  
> >  		/*
> >  		 * NOTE: If your intent processing routine can create more
> > -		 * deferred ops, you /must/ attach them to the transaction in
> > +		 * deferred ops, you /must/ attach them to the freezer in
> >  		 * this routine or else those subsequent intents will get
> >  		 * replayed in the wrong order!
> >  		 */
> > -		error = type->process_intent(log, parent_tp, lip);
> > +		error = type->process_intent(log, &freezer, lip);
> >  		if (error)
> >  			goto out;
> > +		if (freezer) {
> > +			list_add_tail(&freezer->dff_list, &dfops_freezers);
> > +			freezer = NULL;
> > +		}
> > +
> >  		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> >  	}
> >  out:
> >  	xfs_trans_ail_cursor_done(&cur);
> >  	spin_unlock(&ailp->ail_lock);
> > -	if (!error)
> > -		error = xlog_finish_defer_ops(parent_tp);
> > -	xfs_trans_cancel(parent_tp);
> > +	if (error)
> > +		return error;
> >  
> > -	return error;
> > +	return xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> > index 01a393727a1e..18b1fbc276fc 100644
> > --- a/fs/xfs/xfs_refcount_item.c
> > +++ b/fs/xfs/xfs_refcount_item.c
> > @@ -447,7 +447,8 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
> >   */
> >  STATIC int
> >  xfs_cui_recover(
> > -	struct xfs_trans		*parent_tp,
> > +	struct xfs_mount		*mp,
> > +	struct xfs_defer_freezer	**dffp,
> >  	struct xfs_cui_log_item		*cuip)
> >  {
> >  	int				i;
> > @@ -464,7 +465,6 @@ xfs_cui_recover(
> >  	xfs_extlen_t			new_len;
> >  	struct xfs_bmbt_irec		irec;
> >  	bool				requeue_only = false;
> > -	struct xfs_mount		*mp = parent_tp->t_mountp;
> >  
> >  	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
> >  
> > @@ -519,12 +519,7 @@ xfs_cui_recover(
> >  			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
> >  	if (error)
> >  		return error;
> > -	/*
> > -	 * Recovery stashes all deferred ops during intent processing and
> > -	 * finishes them on completion. Transfer current dfops state to this
> > -	 * transaction and transfer the result back before we return.
> > -	 */
> > -	xfs_defer_move(tp, parent_tp);
> > +
> >  	cudp = xfs_trans_get_cud(tp, cuip);
> >  
> >  	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> > @@ -582,13 +577,10 @@ xfs_cui_recover(
> >  
> >  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> >  	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
> > -	xfs_defer_move(parent_tp, tp);
> > -	error = xfs_trans_commit(tp);
> > -	return error;
> > +	return xlog_recover_trans_commit(tp, dffp);
> >  
> >  abort_error:
> >  	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > -	xfs_defer_move(parent_tp, tp);
> >  	xfs_trans_cancel(tp);
> >  	return error;
> >  }
> > @@ -701,7 +693,7 @@ xlog_recover_cud(
> >  STATIC int
> >  xlog_recover_process_cui(
> >  	struct xlog			*log,
> > -	struct xfs_trans		*parent_tp,
> > +	struct xfs_defer_freezer	**dffp,
> >  	struct xfs_log_item		*lip)
> >  {
> >  	struct xfs_ail			*ailp = log->l_ailp;
> > @@ -715,7 +707,7 @@ xlog_recover_process_cui(
> >  		return 0;
> >  
> >  	spin_unlock(&ailp->ail_lock);
> > -	error = xfs_cui_recover(parent_tp, cuip);
> > +	error = xfs_cui_recover(log->l_mp, dffp, cuip);
> >  	spin_lock(&ailp->ail_lock);
> >  
> >  	return error;
> > diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> > index 69a2d23eedda..7291fac7c6d1 100644
> > --- a/fs/xfs/xfs_rmap_item.c
> > +++ b/fs/xfs/xfs_rmap_item.c
> > @@ -491,7 +491,8 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
> >   */
> >  STATIC int
> >  xfs_rui_recover(
> > -	struct xfs_trans		*parent_tp,
> > +	struct xfs_mount		*mp,
> > +	struct xfs_defer_freezer	**dffp,
> >  	struct xfs_rui_log_item		*ruip)
> >  {
> >  	int				i;
> > @@ -505,7 +506,6 @@ xfs_rui_recover(
> >  	xfs_exntst_t			state;
> >  	struct xfs_trans		*tp;
> >  	struct xfs_btree_cur		*rcur = NULL;
> > -	struct xfs_mount		*mp = parent_tp->t_mountp;
> >  
> >  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> >  
> > @@ -601,8 +601,7 @@ xfs_rui_recover(
> >  
> >  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> >  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > -	error = xfs_trans_commit(tp);
> > -	return error;
> > +	return xlog_recover_trans_commit(tp, dffp);
> >  
> >  abort_error:
> >  	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > @@ -691,7 +690,7 @@ xlog_recover_rud(
> >  STATIC int
> >  xlog_recover_process_rui(
> >  	struct xlog			*log,
> > -	struct xfs_trans		*parent_tp,
> > +	struct xfs_defer_freezer	**dffp,
> >  	struct xfs_log_item		*lip)
> >  {
> >  	struct xfs_ail			*ailp = log->l_ailp;
> > @@ -705,7 +704,7 @@ xlog_recover_process_rui(
> >  		return 0;
> >  
> >  	spin_unlock(&ailp->ail_lock);
> > -	error = xfs_rui_recover(parent_tp, ruip);
> > +	error = xfs_rui_recover(log->l_mp, dffp, ruip);
> >  	spin_lock(&ailp->ail_lock);
> >  
> >  	return error;
> > 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-28 22:17       ` Darrick J. Wong
@ 2020-04-29 11:38         ` Brian Foster
  2020-04-29 11:48           ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Foster @ 2020-04-29 11:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Tue, Apr 28, 2020 at 03:17:47PM -0700, Darrick J. Wong wrote:
> On Mon, Apr 27, 2020 at 07:37:52AM -0400, Brian Foster wrote:
> > On Sat, Apr 25, 2020 at 12:01:37PM -0700, Christoph Hellwig wrote:
> > > On Tue, Apr 21, 2020 at 07:08:26PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Make it so that the deferred operations freezer can save inode numbers
> > > > when we freeze the dfops chain, and turn them into pointers to incore
> > > > inodes when we thaw the dfops chain to finish them.  Next, add dfops
> > > > item freeze and thaw functions to the BUI/BUD items so that they can
> > > > take advantage of this new feature.  This fixes a UAF bug in the
> > > > deferred bunmapi code because xfs_bui_recover can schedule another BUI
> > > > to continue unmapping but drops the inode pointer immediately
> > > > afterwards.
> > > 
> > > I'm only looking over this the first time, but why can't we just keep
> > > inode reference around during reocvery instead of this fairly
> > > complicated scheme to save the ino and then look it up again?
> > > 
> > 
> > I'm also a little confused about the use after free in the first place.
> > Doesn't xfs_bui_recover() look up the inode itself, or is the issue that
> > xfs_bui_recover() is fine but we might get into
> > xfs_bmap_update_finish_item() sometime later on the same inode without
> > any reference?
> 
> The second.  In practice it doesn't seem to trigger on the existing
> code, but the combination of atomic extent swap + fsstress + shutdown
> testing was enough to push it over the edge once due to reclaim.
> 
> > If the latter, similarly to Christoph I wonder if we
> > really could/should grab a reference on the inode for the intent itself,
> > even though that might not be necessary outside of recovery.
> 
> Outside of recovery we don't have the UAF problem because there's always
> something (usually the VFS dentry cache, but sometimes an explicit iget)
> that hold a reference to the inode for the duration of the transaction
> and dfops processing.
> 

Right, that's what I figured.

> One could just hang on to all incore inodes until the end of recovery
> like Christoph says, but the downside of doing it that way is that now
> we require enough memory to maintain all that incore state vs. only
> needing enough for the incore inodes involved in a particular dfops
> chain.  That isn't a huge deal now, but I was looking ahead to atomic
> extent swaps.
> 

What I was thinking above was tying the reference to the lifetime of the
intents associated with the inode, not necessarily the full lifetime of
recovery. It's not immediately clear to me if that indirectly leads to a
similar chain of in-core inodes due to unusual ordering of dfops chains
during recovery; ISTM that would mean a deviation from the typical
runtime dfops ordering, but perhaps I'm missing something...

That aside, based on your description above it seems we currently rely
on this icache retention behavior for recovery anyways, otherwise we'd
hit this use after free and probably have user reports. That suggests to
me that holding a reference is a logical next step, at least as a bug
fix patch to provide a more practical solution for stable/distro
kernels. For example, if we just associated an iget()/iput() with the
assignment of the xfs_bmap_intent->bi_owner field (and the eventual free
of the intent structure), would that technically solve the inode use
after free problem?

BTW, I also wonder about the viability of changing ->bi_owner to an
xfs_ino_t instead of a direct pointer, but that might be more
involved than just adding a reference to the existing scheme...

Brian

> (And, yeah, I should put that series on the list now...)
> 
> > Either way, more details about the problem being fixed in the commit log
> > would be helpful.
> 
> <nod>
> 
> --D
> 
> > Brian
> > 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-29 11:38         ` Brian Foster
@ 2020-04-29 11:48           ` Christoph Hellwig
  2020-04-29 14:28             ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-04-29 11:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs

On Wed, Apr 29, 2020 at 07:38:03AM -0400, Brian Foster wrote:
> That aside, based on your description above it seems we currently rely
> on this icache retention behavior for recovery anyways, otherwise we'd
> hit this use after free and probably have user reports. That suggests to
> me that holding a reference is a logical next step, at least as a bug
> fix patch to provide a more practical solution for stable/distro
> kernels. For example, if we just associated an iget()/iput() with the
> assignment of the xfs_bmap_intent->bi_owner field (and the eventual free
> of the intent structure), would that technically solve the inode use
> after free problem?

Yes, that's what I thought.

> 
> BTW, I also wonder about the viability of changing ->bi_owner to an
> xfs_ino_t instead of a direct pointer, but that might be more
> involved than just adding a reference to the existing scheme...

It is actually pretty easy, but I'm not sure if hitting the icache for
every finished bmap item is all that desirable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-29 11:48           ` Christoph Hellwig
@ 2020-04-29 14:28             ` Darrick J. Wong
  2020-04-29 14:55               ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-29 14:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Wed, Apr 29, 2020 at 04:48:19AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 29, 2020 at 07:38:03AM -0400, Brian Foster wrote:
> > That aside, based on your description above it seems we currently rely
> > on this icache retention behavior for recovery anyways, otherwise we'd
> > hit this use after free and probably have user reports. That suggests to
> > me that holding a reference is a logical next step, at least as a bug
> > fix patch to provide a more practical solution for stable/distro
> > kernels. For example, if we just associated an iget()/iput() with the
> > assignment of the xfs_bmap_intent->bi_owner field (and the eventual free
> > of the intent structure), would that technically solve the inode use
> > after free problem?
> 
> Yes, that's what I thought.
> 
> > 
> > BTW, I also wonder about the viability of changing ->bi_owner to an
> > xfs_ino_t instead of a direct pointer, but that might be more
> > involved than just adding a reference to the existing scheme...
> 
> It is actually pretty easy, but I'm not sure if hitting the icache for
> every finished bmap item is all that desirable.

It came with a noticeable (~2%) slowdown on a swapext-heavy fsstress
run, which was my motivation for this (somewhat clunky) system for
avoiding all that overhead except for recovery.

Hmm.  Actually now that I think harder about it, the bmap item is
completely incore and fields are selectively copied to the log item.
This means that regular IO could set bi_owner = <some inode number> and
bi_ip = <the incore inode>.  Recovery IO can set bi_owner but leave
bi_ip NULL, and then the bmap item replay can iget as needed.  Now we
don't need this freeze/thaw thing at all.

--D

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-29 14:28             ` Darrick J. Wong
@ 2020-04-29 14:55               ` Christoph Hellwig
  2020-04-29 23:58                 ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-04-29 14:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Brian Foster, linux-xfs

On Wed, Apr 29, 2020 at 07:28:07AM -0700, Darrick J. Wong wrote:
> Hmm.  Actually now that I think harder about it, the bmap item is
> completely incore and fields are selectively copied to the log item.
> This means that regular IO could set bi_owner = <some inode number> and
> bi_ip = <the incore inode>.  Recovery IO can set bi_owner but leave
> bi_ip NULL, and then the bmap item replay can iget as needed.  Now we
> don't need this freeze/thaw thing at all.

Yes, that sounds pretty reasonable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-29 14:55               ` Christoph Hellwig
@ 2020-04-29 23:58                 ` Darrick J. Wong
  2020-05-01 17:09                   ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-04-29 23:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Wed, Apr 29, 2020 at 07:55:57AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 29, 2020 at 07:28:07AM -0700, Darrick J. Wong wrote:
> > Hmm.  Actually now that I think harder about it, the bmap item is
> > completely incore and fields are selectively copied to the log item.
> > This means that regular IO could set bi_owner = <some inode number> and
> > bi_ip = <the incore inode>.  Recovery IO can set bi_owner but leave
> > bi_ip NULL, and then the bmap item replay can iget as needed.  Now we
> > don't need this freeze/thaw thing at all.
> 
> Yes, that sounds pretty reasonable.

OTOH I was talking to Dave and we decided to try the simpler solution of
retaining the inode reference unless someone can show that the more
complex machinery is required.

--D

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes
  2020-04-29 23:58                 ` Darrick J. Wong
@ 2020-05-01 17:09                   ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2020-05-01 17:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Brian Foster, linux-xfs

On Wed, Apr 29, 2020 at 04:58:19PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 29, 2020 at 07:55:57AM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 29, 2020 at 07:28:07AM -0700, Darrick J. Wong wrote:
> > > Hmm.  Actually now that I think harder about it, the bmap item is
> > > completely incore and fields are selectively copied to the log item.
> > > This means that regular IO could set bi_owner = <some inode number> and
> > > bi_ip = <the incore inode>.  Recovery IO can set bi_owner but leave
> > > bi_ip NULL, and then the bmap item replay can iget as needed.  Now we
> > > don't need this freeze/thaw thing at all.
> > 
> > Yes, that sounds pretty reasonable.
> 
> OTOH I was talking to Dave and we decided to try the simpler solution of
> retaining the inode reference unless someone can show that the more
> complex machinery is required.

Sounds good.  And we only need to do it when in recovery, this should
save a few atomic ops during normal operations.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] xfs: reduce log recovery transaction block reservations
  2020-04-28 22:22     ` Darrick J. Wong
@ 2020-05-27 22:39       ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2020-05-27 22:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Apr 28, 2020 at 03:22:47PM -0700, Darrick J. Wong wrote:
> On Fri, Apr 24, 2020 at 10:04:08AM -0400, Brian Foster wrote:
> > On Tue, Apr 21, 2020 at 07:08:20PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > On filesystems that support them, bmap intent log items can be used to
> > > change mappings in inode data or attr forks.  However, if the bmbt must
> > > expand, the enormous block reservations that we make for finishing
> > > chains of deferred log items become a liability because the bmbt block
> > > allocator sets minleft to the transaction reservation and there probably
> > > aren't any AGs in the filesystem that have that much free space.
> > > 
> > > Whereas previously we would reserve 93% of the free blocks in the
> > > filesystem, now we only want to reserve 7/8ths of the free space in the
> > > least full AG, and no more than half of the usable blocks in an AG.  In
> > > theory we shouldn't run out of space because (prior to the unclean
> > > shutdown) all of the in-progress transactions successfully reserved the
> > > worst case number of disk blocks.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_log_recover.c |   55 ++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 43 insertions(+), 12 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index e9b3e901d009..a416b028b320 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -2669,6 +2669,44 @@ xlog_recover_process_data(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Estimate a block reservation for a log recovery transaction.  Since we run
> > > + * separate transactions for each chain of deferred ops that get created as a
> > > + * result of recovering unfinished log intent items, we must be careful not to
> > > + * reserve so many blocks that block allocations fail because we can't satisfy
> > > + * the minleft requirements (e.g. for bmbt blocks).
> > > + */
> > > +static int
> > > +xlog_estimate_recovery_resblks(
> > > +	struct xfs_mount	*mp,
> > > +	unsigned int		*resblks)
> > > +{
> > > +	struct xfs_perag	*pag;
> > > +	xfs_agnumber_t		agno;
> > > +	unsigned int		free = 0;
> > > +	int			error;
> > > +
> > > +	/* Don't use more than 7/8th of the free space in the least full AG. */
> > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		unsigned int	ag_free;
> > > +
> > > +		error = xfs_alloc_pagf_init(mp, NULL, agno, 0);
> > > +		if (error)
> > > +			return error;
> > > +		pag = xfs_perag_get(mp, agno);
> > > +		ag_free = pag->pagf_freeblks + pag->pagf_flcount;
> > > +		free = max(free, (ag_free * 7) / 8);
> > > +		xfs_perag_put(pag);
> > > +	}
> > > +
> > 
> > Somewhat unfortunate that we have to iterate all AGs for each chain. I'm
> > wondering if that has any effect on a large recovery on fs' with an
> > inordinate AG count. Have you tested under those particular conditions?
> > I suppose it's possible the recovery is slow enough that this won't
> > matter...
> 
> I admit I haven't actually looked at that.  A more precise way to handle
> this would be for each intent recovery function to store its own worst
> case resblks estimation in the recovery freezer so that we'd be using
> roughly the same space requirements as the pre-crash transaction, but
> that's also a lot more complicated than this kludge.

I've been testing a new patch that rips out all of this in favor of
stealing the unused block reservation from the transaction that the log
item recovery function creates instead of this hokey AG iteration mess.
The results look promising, and I think I'll do this instead.

Granted, this now means that log intent item recovery must be very
careful to reserve enough blocks for the entire chain of operations that
could be created.  This shouldn't be too heavy a lift since we don't
have that many intent types yet, and at least at first glance the
resblks we ask for now looked ok to me.

--D

> > Also, perhaps not caused by this patch but does this
> > outsized/manufactured reservation have the effect of artificially
> > steering allocations to a particular AG if one happens to be notably
> > larger than the rest?
> 
> It tends to steer allocations towards whichever AGs were less full at
> the start of each transaction.  Were we to shift to scanning the AGs
> once for the entire recovery cycle, I think I'd probably pick a smaller
> ratio.
> 
> --D
> 
> > Brian
> > 
> > > +	/* Don't try to reserve more than half the usable AG blocks. */
> > > +	*resblks = min(free, xfs_alloc_ag_max_usable(mp) / 2);
> > > +	if (*resblks == 0)
> > > +		return -ENOSPC;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  /* Take all the collected deferred ops and finish them in order. */
> > >  static int
> > >  xlog_finish_defer_ops(
> > > @@ -2677,27 +2715,20 @@ xlog_finish_defer_ops(
> > >  {
> > >  	struct xfs_defer_freezer *dff, *next;
> > >  	struct xfs_trans	*tp;
> > > -	int64_t			freeblks;
> > >  	uint			resblks;
> > >  	int			error = 0;
> > >  
> > >  	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
> > > +		error = xlog_estimate_recovery_resblks(mp, &resblks);
> > > +		if (error)
> > > +			break;
> > > +
> > >  		/*
> > >  		 * We're finishing the defer_ops that accumulated as a result
> > >  		 * of recovering unfinished intent items during log recovery.
> > >  		 * We reserve an itruncate transaction because it is the
> > > -		 * largest permanent transaction type.  Since we're the only
> > > -		 * user of the fs right now, take 93% (15/16) of the available
> > > -		 * free blocks.  Use weird math to avoid a 64-bit division.
> > > +		 * largest permanent transaction type.
> > >  		 */
> > > -		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > > -		if (freeblks <= 0) {
> > > -			error = -ENOSPC;
> > > -			break;
> > > -		}
> > > -
> > > -		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > > -		resblks = (resblks * 15) >> 4;
> > >  		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > >  				0, XFS_TRANS_RESERVE, &tp);
> > >  		if (error)
> > > 
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-23  7:48   ` Christoph Hellwig
@ 2020-09-23 15:46     ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2020-09-23 15:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david

On Wed, Sep 23, 2020 at 08:48:16AM +0100, Christoph Hellwig wrote:
> On Wed, Sep 16, 2020 at 08:29:07PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> 
> Wow, that is a massive change, but I really like the direction.
> 
> 
> > +int
> >  xfs_defer_capture(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_capture	**dfcp)
> >  {
> > +	struct xfs_defer_capture	*dfc = NULL;
> > +
> > +	if (!list_empty(&tp->t_dfops)) {
> > +		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> > +
> > +		xfs_defer_create_intents(tp);
> > +
> > +		INIT_LIST_HEAD(&dfc->dfc_list);
> > +		INIT_LIST_HEAD(&dfc->dfc_dfops);
> > +
> > +		/* Move the dfops chain and transaction state to the freezer. */
> > +		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> > +		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> > +		xfs_defer_reset(tp);
> > +	}
> > +
> > +	*dfcp = dfc;
> > +	return xfs_trans_commit(tp);
> 
> Don't we need to free the structure if xfs_trans_commit fails?

They'll all get passed up to xlog_recover_process_intents (even with the
error code) and cancelled by xlog_cancel_defer_ops.

But you're right, this should maintain the principle of not creating
more state for callers to deal with when we're just going to error out
anyway.

> Wouldn't a saner calling convention would to pass the list_head into
> ->iop_recover and this function, and just add it to the list here
> instead of passing it around as an output pointer argument.

Sure, that works too.  I sort of hate passing around generic list_head
but I could say the same of double pointers. :)

> > +/*
> > + * Freeze any deferred ops and commit the transaction.  This is the last step
> > + * needed to finish a log intent item that we recovered from the log.
> > + */
> > +int
> > +xlog_recover_trans_commit(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_capture	**dfcp)
> > +{
> > +	return xfs_defer_capture(tp, dfcp);
> > +}
> 
> I don't think this wrapper is very helpful.  I think a direct call
> to xfs_defer_capture captures the intent better than pretending it just
> is another commit.

It's not particularly enlightening right now, but later (in the bmap
use-after-free series) I'll enlarge this helper to be able to stash
inodes and whatnot.  The end result is that all the log intent item
recovery functions can call this at the bottom and it'll clean up the
transaction and inodes that were used in recovery.

(Or hide them away for later.)

> > +
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*dfops_freezers)
> >  {
> > +	struct xfs_defer_capture *dfc, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> > +	uint64_t		resblks;
> > +	int			error = 0;
> >  
> > +	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> > +		/*
> > +		 * We're finishing the defer_ops that accumulated as a result
> > +		 * of recovering unfinished intent items during log recovery.
> > +		 * We reserve an itruncate transaction because it is the
> > +		 * largest permanent transaction type.  Since we're the only
> > +		 * user of the fs right now, take 93% (15/16) of the available
> > +		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 */
> > +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > +		if (freeblks <= 0) {
> > +			error = -ENOSPC;
> > +			break;
> > +		}
> >  
> > +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > +				0, XFS_TRANS_RESERVE, &tp);
> 
> This isn't actually new, but how did we end up with the truncate
> reservation here?

itruncate is the largest reservation, so it ended up here (and a few
other places) as a standin for "big enough to avoid failing log recovery
on a reservation overflow".

> > +		if (error)
> > +			break;
> > +
> > +		/* transfer all collected dfops to this transaction */
> > +		list_del_init(&dfc->dfc_list);
> > +		xfs_defer_continue(dfc, tp);
> > +
> > +		error = xfs_trans_commit(tp);
> > +		xfs_defer_capture_free(mp, dfc);
> > +		if (error)
> > +			break;
> 
> I'd just do direct returns and return 0 outside the loop, but that is
> just nitpicking.

Ok.

> >  /* Is this log item a deferred action intent? */
> > @@ -2528,28 +2566,16 @@ STATIC int
> >  xlog_recover_process_intents(
> >  	struct xlog		*log)
> >  {
> > +	LIST_HEAD(dfops_freezers);
> >  	struct xfs_ail_cursor	cur;
> > +	struct xfs_defer_capture *freezer = NULL;
> >  	struct xfs_log_item	*lip;
> >  	struct xfs_ail		*ailp;
> > +	int			error = 0;
> >  #if defined(DEBUG) || defined(XFS_WARN)
> >  	xfs_lsn_t		last_lsn;
> >  #endif
> >  
> >  	ailp = log->l_ailp;
> >  	spin_lock(&ailp->ail_lock);
> >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > @@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
> >  
> >  		/*
> >  		 * NOTE: If your intent processing routine can create more
> > +		 * deferred ops, you /must/ attach them to the freezer in
> >  		 * this routine or else those subsequent intents will get
> >  		 * replayed in the wrong order!
> >  		 */
> >  		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
> >  			spin_unlock(&ailp->ail_lock);
> > +			error = lip->li_ops->iop_recover(lip, &freezer);
> >  			spin_lock(&ailp->ail_lock);
> >  		}
> > +		if (freezer) {
> > +			list_add_tail(&freezer->dfc_list, &dfops_freezers);
> > +			freezer = NULL;
> > +		}
> >  		if (error)
> > +			break;
> > +
> >  		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> 
> The code flow looks really very odd (though correct) to me, what do you
> think of something like:
> 
>   	spin_lock(&ailp->ail_lock);
> 	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> 	     lip;
> 	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> 		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
> 			continue;
> 
> 		spin_unlock(&ailp->ail_lock);
> 		error = lip->li_ops->iop_recover(lip, &dfops_freezers);
> 		spin_lock(&ailp->ail_lock);
> 		if (error)
> 			break;
> 	}
>   	xfs_trans_ail_cursor_done(&cur);
>   	spin_unlock(&ailp->ail_lock);

Yeah, I like that better.  Thanks for the suggestions.

--D

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
@ 2020-09-23  7:48   ` Christoph Hellwig
  2020-09-23 15:46     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2020-09-23  7:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Wed, Sep 16, 2020 at 08:29:07PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.

Wow, that is a massive change, but I really like the direction.


> +int
>  xfs_defer_capture(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_capture	**dfcp)
>  {
> +	struct xfs_defer_capture	*dfc = NULL;
> +
> +	if (!list_empty(&tp->t_dfops)) {
> +		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
> +
> +		xfs_defer_create_intents(tp);
> +
> +		INIT_LIST_HEAD(&dfc->dfc_list);
> +		INIT_LIST_HEAD(&dfc->dfc_dfops);
> +
> +		/* Move the dfops chain and transaction state to the freezer. */
> +		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
> +		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
> +		xfs_defer_reset(tp);
> +	}
> +
> +	*dfcp = dfc;
> +	return xfs_trans_commit(tp);

Don't we need to free the structure if xfs_trans_commit fails?

Wouldn't a saner calling convention would to pass the list_head into
->iop_recover and this function, and just add it to the list here
instead of passing it around as an output pointer argument.

> +/*
> + * Freeze any deferred ops and commit the transaction.  This is the last step
> + * needed to finish a log intent item that we recovered from the log.
> + */
> +int
> +xlog_recover_trans_commit(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_capture	**dfcp)
> +{
> +	return xfs_defer_capture(tp, dfcp);
> +}

I don't think this wrapper is very helpful.  I think a direct call
to xfs_defer_capture captures the intent better than pretending it just
is another commit.

> +
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> +	struct xfs_mount	*mp,
> +	struct list_head	*dfops_freezers)
>  {
> +	struct xfs_defer_capture *dfc, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
> +	uint64_t		resblks;
> +	int			error = 0;
>  
> +	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
> +		/*
> +		 * We're finishing the defer_ops that accumulated as a result
> +		 * of recovering unfinished intent items during log recovery.
> +		 * We reserve an itruncate transaction because it is the
> +		 * largest permanent transaction type.  Since we're the only
> +		 * user of the fs right now, take 93% (15/16) of the available
> +		 * free blocks.  Use weird math to avoid a 64-bit division.
> +		 */
> +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> +		if (freeblks <= 0) {
> +			error = -ENOSPC;
> +			break;
> +		}
>  
> +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> +				0, XFS_TRANS_RESERVE, &tp);

This isn't actually new, but how did we end up with the truncate
reservation here?

> +		if (error)
> +			break;
> +
> +		/* transfer all collected dfops to this transaction */
> +		list_del_init(&dfc->dfc_list);
> +		xfs_defer_continue(dfc, tp);
> +
> +		error = xfs_trans_commit(tp);
> +		xfs_defer_capture_free(mp, dfc);
> +		if (error)
> +			break;

I'd just do direct returns and return 0 outside the loop, but that is
just nitpicking.

>  /* Is this log item a deferred action intent? */
> @@ -2528,28 +2566,16 @@ STATIC int
>  xlog_recover_process_intents(
>  	struct xlog		*log)
>  {
> +	LIST_HEAD(dfops_freezers);
>  	struct xfs_ail_cursor	cur;
> +	struct xfs_defer_capture *freezer = NULL;
>  	struct xfs_log_item	*lip;
>  	struct xfs_ail		*ailp;
> +	int			error = 0;
>  #if defined(DEBUG) || defined(XFS_WARN)
>  	xfs_lsn_t		last_lsn;
>  #endif
>  
>  	ailp = log->l_ailp;
>  	spin_lock(&ailp->ail_lock);
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> @@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
>  
>  		/*
>  		 * NOTE: If your intent processing routine can create more
> +		 * deferred ops, you /must/ attach them to the freezer in
>  		 * this routine or else those subsequent intents will get
>  		 * replayed in the wrong order!
>  		 */
>  		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
>  			spin_unlock(&ailp->ail_lock);
> +			error = lip->li_ops->iop_recover(lip, &freezer);
>  			spin_lock(&ailp->ail_lock);
>  		}
> +		if (freezer) {
> +			list_add_tail(&freezer->dfc_list, &dfops_freezers);
> +			freezer = NULL;
> +		}
>  		if (error)
> +			break;
> +
>  		lip = xfs_trans_ail_cursor_next(ailp, &cur);

The code flow looks really very odd (though correct) to me, what do you
think of something like:

  	spin_lock(&ailp->ail_lock);
	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
	     lip;
	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
			continue;

		spin_unlock(&ailp->ail_lock);
		error = lip->li_ops->iop_recover(lip, &dfops_freezers);
		spin_lock(&ailp->ail_lock);
		if (error)
			break;
	}
  	xfs_trans_ail_cursor_done(&cur);
  	spin_unlock(&ailp->ail_lock);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-09-17  3:29 [PATCH 0/3] xfs: fix how we deal with new intents during recovery Darrick J. Wong
@ 2020-09-17  3:29 ` Darrick J. Wong
  2020-09-23  7:48   ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-09-17  3:29 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, david

From: Darrick J. Wong <darrick.wong@oracle.com>

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c       |   59 +++++++++++++++++--
 fs/xfs/libxfs/xfs_defer.h       |   20 ++++++
 fs/xfs/libxfs/xfs_log_recover.h |    4 +
 fs/xfs/xfs_bmap_item.c          |   16 +----
 fs/xfs/xfs_extfree_item.c       |    7 +-
 fs/xfs/xfs_log_recover.c        |  121 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_refcount_item.c      |   16 +----
 fs/xfs/xfs_rmap_item.c          |    7 +-
 fs/xfs/xfs_trans.h              |    4 +
 9 files changed, 168 insertions(+), 86 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 29e9762f3b77..e4a6928c8566 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -563,14 +563,59 @@ xfs_defer_move(
  *
  * Create and log intent items for all the work that we're capturing so that we
  * can be assured that the items will get replayed if the system goes down
- * before log recovery gets a chance to finish the work it put off.  Then we
- * move the chain from stp to dtp.
+ * before log recovery gets a chance to finish the work it put off.  The entire
+ * deferred ops state is transferred to the capture structure and the
+ * transaction is committed.
+ *
+ * Note that the capture state is passed up to the caller and must be freed
+ * even if the transaction commit returns error.
  */
-void
+int
 xfs_defer_capture(
-	struct xfs_trans	*dtp,
-	struct xfs_trans	*stp)
+	struct xfs_trans		*tp,
+	struct xfs_defer_capture	**dfcp)
 {
-	xfs_defer_create_intents(stp);
-	xfs_defer_move(dtp, stp);
+	struct xfs_defer_capture	*dfc = NULL;
+
+	if (!list_empty(&tp->t_dfops)) {
+		dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS);
+
+		xfs_defer_create_intents(tp);
+
+		INIT_LIST_HEAD(&dfc->dfc_list);
+		INIT_LIST_HEAD(&dfc->dfc_dfops);
+
+		/* Move the dfops chain and transaction state to the freezer. */
+		list_splice_init(&tp->t_dfops, &dfc->dfc_dfops);
+		dfc->dfc_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+		xfs_defer_reset(tp);
+	}
+
+	*dfcp = dfc;
+	return xfs_trans_commit(tp);
+}
+
+/* Attach a chain of captured deferred ops to a new transaction. */
+void
+xfs_defer_continue(
+	struct xfs_defer_capture	*dfc,
+	struct xfs_trans		*tp)
+{
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/* Move captured dfops chain and state to the transaction. */
+	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
+	tp->t_flags |= dfc->dfc_tpflags;
+	dfc->dfc_tpflags = 0;
+}
+
+/* Release all resources that we used to capture deferred ops. */
+void
+xfs_defer_capture_free(
+	struct xfs_mount		*mp,
+	struct xfs_defer_capture	*dfc)
+{
+	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
+	kmem_free(dfc);
 }
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 3164199162b6..d61ef0a750ab 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -8,6 +8,7 @@
 
 struct xfs_btree_cur;
 struct xfs_defer_op_type;
+struct xfs_defer_capture;
 
 /*
  * Header for deferred operation list.
@@ -63,10 +64,27 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 
+/*
+ * Deferred operation freezer.  This structure enables a dfops user to detach
+ * the chain of deferred operations from a transaction so that they can be
+ * continued later.
+ */
+struct xfs_defer_capture {
+	/* List of other freezer heads. */
+	struct list_head	dfc_list;
+
+	/* Deferred ops state saved from the transaction. */
+	struct list_head	dfc_dfops;
+	unsigned int		dfc_tpflags;
+};
+
 /*
  * Functions to capture a chain of deferred operations and continue them later.
  * This doesn't normally happen except log recovery.
  */
-void xfs_defer_capture(struct xfs_trans *dtp, struct xfs_trans *stp);
+int xfs_defer_capture(struct xfs_trans *tp, struct xfs_defer_capture **dfcp);
+void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp);
+void xfs_defer_capture_free(struct xfs_mount *mp,
+		struct xfs_defer_capture *dfc);
 
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 641132d0e39d..c3563c5c033c 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -6,6 +6,8 @@
 #ifndef	__XFS_LOG_RECOVER_H__
 #define __XFS_LOG_RECOVER_H__
 
+struct xfs_defer_capture;
+
 /*
  * Each log item type (XFS_LI_*) gets its own xlog_recover_item_ops to
  * define how recovery should work for that type of log item.
@@ -125,5 +127,7 @@ void xlog_recover_iodone(struct xfs_buf *bp);
 
 void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id);
+int xlog_recover_trans_commit(struct xfs_trans *tp,
+		struct xfs_defer_capture **dfcp);
 
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 598f713831c9..0a0904a7650a 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -424,13 +424,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 STATIC int
 xfs_bui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_map_extent		*bmap;
 	struct xfs_bud_log_item		*budp;
 	xfs_fsblock_t			startblock_fsb;
@@ -486,12 +486,7 @@ xfs_bui_item_recover(
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	budp = xfs_trans_get_bud(tp, buip);
 
 	/* Grab the inode. */
@@ -539,15 +534,12 @@ xfs_bui_item_recover(
 		xfs_bmap_unmap_extent(tp, ip, &irec);
 	}
 
-	xfs_defer_capture(parent_tp, tp);
-	error = xfs_trans_commit(tp);
+	error = xlog_recover_trans_commit(tp, dfcp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
-
 	return error;
 
 err_inode:
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 6cb8cd11072a..f544ec007c12 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -585,10 +585,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
 STATIC int
 xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
 	struct xfs_extent		*extp;
@@ -633,8 +633,7 @@ xfs_efi_item_recover(
 
 	}
 
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dfcp);
 
 abort_error:
 	xfs_trans_cancel(tp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e2ec91b2d0f4..0c90c1326860 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1791,6 +1791,18 @@ xlog_recover_release_intent(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Freeze any deferred ops and commit the transaction.  This is the last step
+ * needed to finish a log intent item that we recovered from the log.
+ */
+int
+xlog_recover_trans_commit(
+	struct xfs_trans		*tp,
+	struct xfs_defer_capture	**dfcp)
+{
+	return xfs_defer_capture(tp, dfcp);
+}
+
 /******************************************************************************
  *
  *		Log recover routines
@@ -2467,38 +2479,64 @@ xlog_recover_process_data(
 	return 0;
 }
 
+static void
+xlog_cancel_defer_ops(
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
+{
+	struct xfs_defer_capture *dfc, *next;
+
+	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
+		list_del_init(&dfc->dfc_list);
+		xfs_defer_capture_free(mp, dfc);
+	}
+}
+
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
-	struct xfs_trans	*parent_tp)
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
 {
-	struct xfs_mount	*mp = parent_tp->t_mountp;
+	struct xfs_defer_capture *dfc, *next;
 	struct xfs_trans	*tp;
 	int64_t			freeblks;
-	uint			resblks;
-	int			error;
+	uint64_t		resblks;
+	int			error = 0;
 
-	/*
-	 * We're finishing the defer_ops that accumulated as a result of
-	 * recovering unfinished intent items during log recovery.  We
-	 * reserve an itruncate transaction because it is the largest
-	 * permanent transaction type.  Since we're the only user of the fs
-	 * right now, take 93% (15/16) of the available free blocks.  Use
-	 * weird math to avoid a 64-bit division.
-	 */
-	freeblks = percpu_counter_sum(&mp->m_fdblocks);
-	if (freeblks <= 0)
-		return -ENOSPC;
-	resblks = min_t(int64_t, UINT_MAX, freeblks);
-	resblks = (resblks * 15) >> 4;
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-			0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-	/* transfer all collected dfops to this transaction */
-	xfs_defer_move(tp, parent_tp);
+	list_for_each_entry_safe(dfc, next, dfops_freezers, dfc_list) {
+		/*
+		 * We're finishing the defer_ops that accumulated as a result
+		 * of recovering unfinished intent items during log recovery.
+		 * We reserve an itruncate transaction because it is the
+		 * largest permanent transaction type.  Since we're the only
+		 * user of the fs right now, take 93% (15/16) of the available
+		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 */
+		freeblks = percpu_counter_sum(&mp->m_fdblocks);
+		if (freeblks <= 0) {
+			error = -ENOSPC;
+			break;
+		}
 
-	return xfs_trans_commit(tp);
+		resblks = min_t(uint64_t, UINT_MAX, freeblks);
+		resblks = (resblks * 15) >> 4;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			break;
+
+		/* transfer all collected dfops to this transaction */
+		list_del_init(&dfc->dfc_list);
+		xfs_defer_continue(dfc, tp);
+
+		error = xfs_trans_commit(tp);
+		xfs_defer_capture_free(mp, dfc);
+		if (error)
+			break;
+	}
+
+	return error;
 }
 
 /* Is this log item a deferred action intent? */
@@ -2528,28 +2566,16 @@ STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_trans	*parent_tp;
+	LIST_HEAD(dfops_freezers);
 	struct xfs_ail_cursor	cur;
+	struct xfs_defer_capture *freezer = NULL;
 	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
-	int			error;
+	int			error = 0;
 #if defined(DEBUG) || defined(XFS_WARN)
 	xfs_lsn_t		last_lsn;
 #endif
 
-	/*
-	 * The intent recovery handlers commit transactions to complete recovery
-	 * for individual intents, but any new deferred operations that are
-	 * queued during that process are held off until the very end. The
-	 * purpose of this transaction is to serve as a container for deferred
-	 * operations. Each intent recovery handler must transfer dfops here
-	 * before its local transaction commits, and we'll finish the entire
-	 * list below.
-	 */
-	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
-	if (error)
-		return error;
-
 	ailp = log->l_ailp;
 	spin_lock(&ailp->ail_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
@@ -2578,26 +2604,31 @@ xlog_recover_process_intents(
 
 		/*
 		 * NOTE: If your intent processing routine can create more
-		 * deferred ops, you /must/ attach them to the transaction in
+		 * deferred ops, you /must/ attach them to the freezer in
 		 * this routine or else those subsequent intents will get
 		 * replayed in the wrong order!
 		 */
 		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
 			spin_unlock(&ailp->ail_lock);
-			error = lip->li_ops->iop_recover(lip, parent_tp);
+			error = lip->li_ops->iop_recover(lip, &freezer);
 			spin_lock(&ailp->ail_lock);
 		}
+		if (freezer) {
+			list_add_tail(&freezer->dfc_list, &dfops_freezers);
+			freezer = NULL;
+		}
 		if (error)
-			goto out;
+			break;
+
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
 	}
-out:
+
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->ail_lock);
 	if (!error)
-		error = xlog_finish_defer_ops(parent_tp);
-	xfs_trans_cancel(parent_tp);
+		error = xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
 
+	xlog_cancel_defer_ops(log->l_mp, &dfops_freezers);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 492d80a0b406..30e579b5857d 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -424,7 +424,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 STATIC int
 xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
@@ -432,7 +432,7 @@ xfs_cui_item_recover(
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			new_fsb;
 	xfs_extlen_t			new_len;
@@ -493,12 +493,7 @@ xfs_cui_item_recover(
 			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
@@ -555,13 +550,10 @@ xfs_cui_item_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_capture(parent_tp, tp);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dfcp);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index dc5b0753cd51..34dd6b02bdc8 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -467,14 +467,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 STATIC int
 xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_capture	**dfcp)
 {
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_map_extent		*rmap;
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	enum xfs_rmap_intent_type	type;
 	xfs_exntst_t			state;
@@ -573,8 +573,7 @@ xfs_rui_item_recover(
 	}
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dfcp);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b752501818d2..c88240961aa1 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -26,6 +26,7 @@ struct xfs_cui_log_item;
 struct xfs_cud_log_item;
 struct xfs_bui_log_item;
 struct xfs_bud_log_item;
+struct xfs_defer_capture;
 
 struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -74,7 +75,8 @@ struct xfs_item_ops {
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
-	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
+	int (*iop_recover)(struct xfs_log_item *lip,
+			   struct xfs_defer_capture **dfcp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  5:10       ` Dave Chinner
@ 2020-05-05 15:08         ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2020-05-05 15:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 05, 2020 at 03:10:29PM +1000, Dave Chinner wrote:
> On Mon, May 04, 2020 at 08:06:51PM -0700, Darrick J. Wong wrote:
> > On Tue, May 05, 2020 at 12:33:05PM +1000, Dave Chinner wrote:
> > > On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > When we replay unfinished intent items that have been recovered from the
> > > > log, it's possible that the replay will cause the creation of more
> > > > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > > > recovery should replay deferred ops in order"), later work items have an
> > > > implicit ordering dependency on earlier work items.  Therefore, recovery
> > > > must replay the items (both recovered and created) in the same order
> > > > that they would have been during normal operation.
> > > > 
> > > > For log recovery, we enforce this ordering by using an empty transaction
> > > > to collect deferred ops that get created in the process of recovering a
> > > > log intent item to prevent them from being committed before the rest of
> > > > the recovered intent items.  After we finish committing all the
> > > > recovered log items, we allocate a transaction with an enormous block
> > > > reservation, splice our huge list of created deferred ops into that
> > > > transaction, and commit it, thereby finishing all those ops.
> > > > 
> > > > This is /really/ hokey -- it's the one place in XFS where we allow
> > > > nested transactions; the splicing of the defer ops list is is inelegant
> > > > and has to be done twice per recovery function; and the broken way we
> > > > handle inode pointers and block reservations cause subtle use-after-free
> > > > and allocator problems that will be fixed by this patch and the two
> > > > patches after it.
> > > > 
> > > > Therefore, replace the hokey empty transaction with a structure designed
> > > > to capture each chain of deferred ops that are created as part of
> > > > recovering a single unfinished log intent.  Finally, refactor the loop
> > > > that replays those chains to do so using one transaction per chain.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > FWIW, I don't like the "freezer" based naming here. It's too easily
> > > confused with freezing and thawing the filesystem....
> > > 
> > > I know, "delayed deferred ops" isn't much better, but at least it
> > > won't get confused with existing unrelated functionality.
> > 
> > xfs_defer_{freeze,thaw} -> xfs_defer_{capture,relink} ?
> 
> Yeah, capture seems appropriate, maybe relink -> continue? i.e.
> capture the remaining defer_ops to be run, then continue running
> them?

I think I like capture/continue.

> /me shrugs and thinks "naming is hard"....

:)

> > > I've barely looked at the code, so no real comments on that yet,
> > > but I did notice this:
> > > 
> > > > @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
> > > >  /* Take all the collected deferred ops and finish them in order. */
> > > >  static int
> > > >  xlog_finish_defer_ops(
> > > > -	struct xfs_trans	*parent_tp)
> > > > +	struct xfs_mount	*mp,
> > > > +	struct list_head	*dfops_freezers)
> > > >  {
> > > > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > > > +	struct xfs_defer_freezer *dff, *next;
> > > >  	struct xfs_trans	*tp;
> > > >  	int64_t			freeblks;
> > > >  	uint			resblks;
> > > ....
> > > > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > > > +		resblks = (resblks * 15) >> 4;
> > > 
> > > Can overflow when freeblks > (UINT_MAX / 15).
> > 
> > D'oh.  Ugh, I hate this whole fugly hack.
> > 
> > TBH I've been thinking that perhaps the freezer function should be
> > capturing the unused transaction block reservation when we capture the
> > dfops chain from the transaction.
> 
> Exactly what problem is this hack supposed to avoid? having the
> filesystem ENOSPC before all the deferops have been completed?
>
> if so, can that even happen? Because the fact that the intents are
> in the log means that when they were started there was enough space
> in the fs for them to run, so ENOSPC should not be an issue, right?

We're probably not going to run out of space, seeing as we had enough
space to run the first time.  However, there's various parts of the
filesystem that either behave differently or ENOSPC early if the
transaction has no block reservation, so we need to avoid them.

The outcome of the recovery work ought to be as close as possible to
what would have happened if the fs hadn't gone down.

> > When we set up the second transaction, we then set t_blk_res to the
> > captured block reservation.  So long as the recovery function is smart
> > enough to set up sufficient reservation we should avoid hitting ENOSPC,
> > right?
> 
> I'm not sure ENOSPC is really a problem for recovery of deferred ops
> given the above...

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  3:06     ` Darrick J. Wong
@ 2020-05-05  5:10       ` Dave Chinner
  2020-05-05 15:08         ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2020-05-05  5:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 04, 2020 at 08:06:51PM -0700, Darrick J. Wong wrote:
> On Tue, May 05, 2020 at 12:33:05PM +1000, Dave Chinner wrote:
> > On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > When we replay unfinished intent items that have been recovered from the
> > > log, it's possible that the replay will cause the creation of more
> > > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > > recovery should replay deferred ops in order"), later work items have an
> > > implicit ordering dependency on earlier work items.  Therefore, recovery
> > > must replay the items (both recovered and created) in the same order
> > > that they would have been during normal operation.
> > > 
> > > For log recovery, we enforce this ordering by using an empty transaction
> > > to collect deferred ops that get created in the process of recovering a
> > > log intent item to prevent them from being committed before the rest of
> > > the recovered intent items.  After we finish committing all the
> > > recovered log items, we allocate a transaction with an enormous block
> > > reservation, splice our huge list of created deferred ops into that
> > > transaction, and commit it, thereby finishing all those ops.
> > > 
> > > This is /really/ hokey -- it's the one place in XFS where we allow
> > > nested transactions; the splicing of the defer ops list is is inelegant
> > > and has to be done twice per recovery function; and the broken way we
> > > handle inode pointers and block reservations cause subtle use-after-free
> > > and allocator problems that will be fixed by this patch and the two
> > > patches after it.
> > > 
> > > Therefore, replace the hokey empty transaction with a structure designed
> > > to capture each chain of deferred ops that are created as part of
> > > recovering a single unfinished log intent.  Finally, refactor the loop
> > > that replays those chains to do so using one transaction per chain.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > FWIW, I don't like the "freezer" based naming here. It's too easily
> > confused with freezing and thawing the filesystem....
> > 
> > I know, "delayed deferred ops" isn't much better, but at least it
> > won't get confused with existing unrelated functionality.
> 
> xfs_defer_{freeze,thaw} -> xfs_defer_{capture,relink} ?

Yeah, capture seems appropriate, maybe relink -> continue? i.e.
capture the remaining defer_ops to be run, then continue running
them?

/me shrugs and thinks "naming is hard"....

> > I've barely looked at the code, so no real comments on that yet,
> > but I did notice this:
> > 
> > > @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
> > >  /* Take all the collected deferred ops and finish them in order. */
> > >  static int
> > >  xlog_finish_defer_ops(
> > > -	struct xfs_trans	*parent_tp)
> > > +	struct xfs_mount	*mp,
> > > +	struct list_head	*dfops_freezers)
> > >  {
> > > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > > +	struct xfs_defer_freezer *dff, *next;
> > >  	struct xfs_trans	*tp;
> > >  	int64_t			freeblks;
> > >  	uint			resblks;
> > ....
> > > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > > +		resblks = (resblks * 15) >> 4;
> > 
> > Can overflow when freeblks > (UINT_MAX / 15).
> 
> D'oh.  Ugh, I hate this whole fugly hack.
> 
> TBH I've been thinking that perhaps the freezer function should be
> capturing the unused transaction block reservation when we capture the
> dfops chain from the transaction.

Exactly what problem is this hack supposed to avoid? having the
filesystem ENOSPC before all the deferops have been completed?

if so, can that even happen? Because the fact that the intents are
in the log means that when they were started there was enough space
in the fs for them to run, so ENOSPC should not be an issue, right?

> When we set up the second transaction, we then set t_blk_res to the
> captured block reservation.  So long as the recovery function is smart
> enough to set up sufficient reservation we should avoid hitting ENOSPC,
> right?

I'm not sure ENOSPC is really a problem for recovery of deferred ops
given the above...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  2:33   ` Dave Chinner
@ 2020-05-05  3:06     ` Darrick J. Wong
  2020-05-05  5:10       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-05-05  3:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 05, 2020 at 12:33:05PM +1000, Dave Chinner wrote:
> On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> FWIW, I don't like the "freezer" based naming here. It's too easily
> confused with freezing and thawing the filesystem....
> 
> I know, "delayed deferred ops" isn't much better, but at least it
> won't get confused with existing unrelated functionality.

xfs_defer_{freeze,thaw} -> xfs_defer_{capture,relink} ?

> I've barely looked at the code, so no real comments on that yet,
> but I did notice this:
> 
> > @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > -	struct xfs_trans	*parent_tp)
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*dfops_freezers)
> >  {
> > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > +	struct xfs_defer_freezer *dff, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> >  	uint			resblks;
> ....
> > +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> 
> Can overflow when freeblks > (UINT_MAX / 15).

D'oh.  Ugh, I hate this whole fugly hack.

TBH I've been thinking that perhaps the freezer function should be
capturing the unused transaction block reservation when we capture the
dfops chain from the transaction.

When we set up the second transaction, we then set t_blk_res to the
captured block reservation.  So long as the recovery function is smart
enough to set up sufficient reservation we should avoid hitting ENOSPC,
right?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
@ 2020-05-05  2:33   ` Dave Chinner
  2020-05-05  3:06     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2020-05-05  2:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, May 04, 2020 at 06:13:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When we replay unfinished intent items that have been recovered from the
> log, it's possible that the replay will cause the creation of more
> deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> recovery should replay deferred ops in order"), later work items have an
> implicit ordering dependency on earlier work items.  Therefore, recovery
> must replay the items (both recovered and created) in the same order
> that they would have been during normal operation.
> 
> For log recovery, we enforce this ordering by using an empty transaction
> to collect deferred ops that get created in the process of recovering a
> log intent item to prevent them from being committed before the rest of
> the recovered intent items.  After we finish committing all the
> recovered log items, we allocate a transaction with an enormous block
> reservation, splice our huge list of created deferred ops into that
> transaction, and commit it, thereby finishing all those ops.
> 
> This is /really/ hokey -- it's the one place in XFS where we allow
> nested transactions; the splicing of the defer ops list is is inelegant
> and has to be done twice per recovery function; and the broken way we
> handle inode pointers and block reservations cause subtle use-after-free
> and allocator problems that will be fixed by this patch and the two
> patches after it.
> 
> Therefore, replace the hokey empty transaction with a structure designed
> to capture each chain of deferred ops that are created as part of
> recovering a single unfinished log intent.  Finally, refactor the loop
> that replays those chains to do so using one transaction per chain.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

FWIW, I don't like the "freezer" based naming here. It's too easily
confused with freezing and thawing the filesystem....

I know, "delayed deferred ops" isn't much better, but at least it
won't get confused with existing unrelated functionality.

I've barely looked at the code, so no real comments on that yet,
but I did notice this:

> @@ -2495,35 +2515,59 @@ xlog_recover_process_data(
>  /* Take all the collected deferred ops and finish them in order. */
>  static int
>  xlog_finish_defer_ops(
> -	struct xfs_trans	*parent_tp)
> +	struct xfs_mount	*mp,
> +	struct list_head	*dfops_freezers)
>  {
> -	struct xfs_mount	*mp = parent_tp->t_mountp;
> +	struct xfs_defer_freezer *dff, *next;
>  	struct xfs_trans	*tp;
>  	int64_t			freeblks;
>  	uint			resblks;
....
> +		resblks = min_t(int64_t, UINT_MAX, freeblks);
> +		resblks = (resblks * 15) >> 4;

Can overflow when freeblks > (UINT_MAX / 15).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery
  2020-05-05  1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
@ 2020-05-05  1:13 ` Darrick J. Wong
  2020-05-05  2:33   ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2020-05-05  1:13 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c       |   57 ++++++++++++++++++
 fs/xfs/libxfs/xfs_defer.h       |   21 +++++++
 fs/xfs/libxfs/xfs_log_recover.h |    4 +
 fs/xfs/xfs_bmap_item.c          |   16 +----
 fs/xfs/xfs_extfree_item.c       |    7 +-
 fs/xfs/xfs_log_recover.c        |  122 +++++++++++++++++++++++++--------------
 fs/xfs/xfs_refcount_item.c      |   16 +----
 fs/xfs/xfs_rmap_item.c          |    7 +-
 fs/xfs/xfs_trans.h              |    4 +
 9 files changed, 178 insertions(+), 76 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 1172fbf072d8..bbbd0141d4a6 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -544,3 +544,60 @@ xfs_defer_move(
 
 	xfs_defer_reset(stp);
 }
+
+/*
+ * Freeze a chain of deferred ops that are attached to a transaction.  The
+ * entire deferred ops state is transferred to the freezer, and each dfops
+ * item will be prepared for freezing.
+ */
+int
+xfs_defer_freeze(
+	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp)
+{
+	struct xfs_defer_freezer	*dff;
+
+	*dffp = NULL;
+	if (list_empty(&tp->t_dfops))
+		return 0;
+
+	dff = kmem_zalloc(sizeof(*dff), KM_NOFS);
+	if (!dff)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&dff->dff_list);
+	INIT_LIST_HEAD(&dff->dff_dfops);
+
+	/* Move all the dfops items to the freezer. */
+	list_splice_init(&tp->t_dfops, &dff->dff_dfops);
+	dff->dff_tpflags = tp->t_flags & XFS_TRANS_LOWMODE;
+	xfs_defer_reset(tp);
+
+	*dffp = dff;
+	return 0;
+}
+
+/* Thaw a chain of deferred ops that are attached to a transaction. */
+int
+xfs_defer_thaw(
+	struct xfs_defer_freezer	*dff,
+	struct xfs_trans		*tp)
+{
+	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+
+	/* Add the dfops items to the transaction. */
+	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
+	tp->t_flags |= dff->dff_tpflags;
+
+	return 0;
+}
+
+/* Release a deferred op freezer and all resources associated with it. */
+void
+xfs_defer_freeezer_finish(
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	*dff)
+{
+	xfs_defer_cancel_list(mp, &dff->dff_dfops);
+	kmem_free(dff);
+}
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 3bf7c2c4d851..4789bf53dd96 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -8,6 +8,7 @@
 
 struct xfs_btree_cur;
 struct xfs_defer_op_type;
+struct xfs_defer_freezer;
 
 /*
  * Header for deferred operation list.
@@ -63,4 +64,24 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 
+/*
+ * Deferred operation freezer.  This structure enables a dfops user to detach
+ * the chain of deferred operations from a transaction so that they can be
+ * run later.
+ */
+struct xfs_defer_freezer {
+	/* List of other freezer heads. */
+	struct list_head	dff_list;
+
+	/* Deferred ops state saved from the transaction. */
+	struct list_head	dff_dfops;
+	unsigned int		dff_tpflags;
+};
+
+/* Functions to freeze a chain of deferred operations for later. */
+int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
+int xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
+void xfs_defer_freeezer_finish(struct xfs_mount *mp,
+		struct xfs_defer_freezer *dff);
+
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index d8c0eae87179..6092a02b2d2a 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -6,6 +6,8 @@
 #ifndef	__XFS_LOG_RECOVER_H__
 #define __XFS_LOG_RECOVER_H__
 
+struct xfs_defer_freezer;
+
 /*
  * Each log item type (XFS_LI_*) gets its own xlog_recover_item_ops to
  * define how recovery should work for that type of log item.
@@ -130,5 +132,7 @@ void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id);
 void xlog_recover_insert_ail(struct xlog *log, struct xfs_log_item *lip,
 		xfs_lsn_t lsn);
+int xlog_recover_trans_commit(struct xfs_trans *tp,
+		struct xfs_defer_freezer **dffp);
 
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 0793c317defb..c733bdeeeb9b 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -423,13 +423,13 @@ const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
 STATIC int
 xfs_bui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_map_extent		*bmap;
 	struct xfs_bud_log_item		*budp;
 	xfs_fsblock_t			startblock_fsb;
@@ -485,12 +485,7 @@ xfs_bui_item_recover(
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	budp = xfs_trans_get_bud(tp, buip);
 
 	/* Grab the inode. */
@@ -534,15 +529,12 @@ xfs_bui_item_recover(
 		xfs_bmap_unmap_extent(tp, ip, &irec);
 	}
 
-	xfs_defer_move(parent_tp, tp);
-	error = xfs_trans_commit(tp);
+	error = xlog_recover_trans_commit(tp, dffp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_irele(ip);
-
 	return error;
 
 err_inode:
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	if (ip) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index b92678bede24..262acddd2e1b 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -583,10 +583,10 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
 STATIC int
 xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
 	struct xfs_extent		*extp;
@@ -631,8 +631,7 @@ xfs_efi_item_recover(
 
 	}
 
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_trans_cancel(tp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index a9cc546535e0..908bfb284a9a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1809,6 +1809,26 @@ xlog_recover_insert_ail(
 	xfs_trans_ail_update(log->l_ailp, lip, lsn);
 }
 
+/*
+ * Freeze any deferred ops and commit the transaction.  This is the last step
+ * needed to finish a log intent item that we recovered from the log.
+ */
+int
+xlog_recover_trans_commit(
+	struct xfs_trans		*tp,
+	struct xfs_defer_freezer	**dffp)
+{
+	int				error;
+
+	error = xfs_defer_freeze(tp, dffp);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	return xfs_trans_commit(tp);
+}
+
 /******************************************************************************
  *
  *		Log recover routines
@@ -2495,35 +2515,59 @@ xlog_recover_process_data(
 /* Take all the collected deferred ops and finish them in order. */
 static int
 xlog_finish_defer_ops(
-	struct xfs_trans	*parent_tp)
+	struct xfs_mount	*mp,
+	struct list_head	*dfops_freezers)
 {
-	struct xfs_mount	*mp = parent_tp->t_mountp;
+	struct xfs_defer_freezer *dff, *next;
 	struct xfs_trans	*tp;
 	int64_t			freeblks;
 	uint			resblks;
-	int			error;
+	int			error = 0;
 
-	/*
-	 * We're finishing the defer_ops that accumulated as a result of
-	 * recovering unfinished intent items during log recovery.  We
-	 * reserve an itruncate transaction because it is the largest
-	 * permanent transaction type.  Since we're the only user of the fs
-	 * right now, take 93% (15/16) of the available free blocks.  Use
-	 * weird math to avoid a 64-bit division.
-	 */
-	freeblks = percpu_counter_sum(&mp->m_fdblocks);
-	if (freeblks <= 0)
-		return -ENOSPC;
-	resblks = min_t(int64_t, UINT_MAX, freeblks);
-	resblks = (resblks * 15) >> 4;
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
-			0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-	/* transfer all collected dfops to this transaction */
-	xfs_defer_move(tp, parent_tp);
+	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		/*
+		 * We're finishing the defer_ops that accumulated as a result
+		 * of recovering unfinished intent items during log recovery.
+		 * We reserve an itruncate transaction because it is the
+		 * largest permanent transaction type.  Since we're the only
+		 * user of the fs right now, take 93% (15/16) of the available
+		 * free blocks.  Use weird math to avoid a 64-bit division.
+		 */
+		freeblks = percpu_counter_sum(&mp->m_fdblocks);
+		if (freeblks <= 0) {
+			error = -ENOSPC;
+			break;
+		}
 
-	return xfs_trans_commit(tp);
+		resblks = min_t(int64_t, UINT_MAX, freeblks);
+		resblks = (resblks * 15) >> 4;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, XFS_TRANS_RESERVE, &tp);
+		if (error)
+			break;
+
+		/* transfer all collected dfops to this transaction */
+		list_del_init(&dff->dff_list);
+		error = xfs_defer_thaw(dff, tp);
+		if (error) {
+			xfs_trans_cancel(tp);
+			xfs_defer_freeezer_finish(mp, dff);
+			break;
+		}
+
+		error = xfs_trans_commit(tp);
+		xfs_defer_freeezer_finish(mp, dff);
+		if (error)
+			break;
+	}
+
+	/* Kill any remaining freezers. */
+	list_for_each_entry_safe(dff, next, dfops_freezers, dff_list) {
+		list_del_init(&dff->dff_list);
+		xfs_defer_freeezer_finish(mp, dff);
+	}
+
+	return error;
 }
 
 /* Is this log item a deferred action intent? */
@@ -2553,8 +2597,9 @@ STATIC int
 xlog_recover_process_intents(
 	struct xlog		*log)
 {
-	struct xfs_trans	*parent_tp;
+	LIST_HEAD(dfops_freezers);
 	struct xfs_ail_cursor	cur;
+	struct xfs_defer_freezer *freezer = NULL;
 	struct xfs_log_item	*lip;
 	struct xfs_ail		*ailp;
 	int			error = 0;
@@ -2562,19 +2607,6 @@ xlog_recover_process_intents(
 	xfs_lsn_t		last_lsn;
 #endif
 
-	/*
-	 * The intent recovery handlers commit transactions to complete recovery
-	 * for individual intents, but any new deferred operations that are
-	 * queued during that process are held off until the very end. The
-	 * purpose of this transaction is to serve as a container for deferred
-	 * operations. Each intent recovery handler must transfer dfops here
-	 * before its local transaction commits, and we'll finish the entire
-	 * list below.
-	 */
-	error = xfs_trans_alloc_empty(log->l_mp, &parent_tp);
-	if (error)
-		return error;
-
 	ailp = log->l_ailp;
 	spin_lock(&ailp->ail_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
@@ -2603,27 +2635,31 @@ xlog_recover_process_intents(
 
 		/*
 		 * NOTE: If your intent processing routine can create more
-		 * deferred ops, you /must/ attach them to the transaction in
+		 * deferred ops, you /must/ attach them to the freezer in
 		 * this routine or else those subsequent intents will get
 		 * replayed in the wrong order!
 		 */
 		if (!test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags)) {
 			spin_unlock(&ailp->ail_lock);
-			error = lip->li_ops->iop_recover(lip, parent_tp);
+			error = lip->li_ops->iop_recover(lip, &freezer);
 			spin_lock(&ailp->ail_lock);
 		}
 		if (error)
 			goto out;
+		if (freezer) {
+			list_add_tail(&freezer->dff_list, &dfops_freezers);
+			freezer = NULL;
+		}
+
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
 	}
 out:
 	xfs_trans_ail_cursor_done(&cur);
 	spin_unlock(&ailp->ail_lock);
-	if (!error)
-		error = xlog_finish_defer_ops(parent_tp);
-	xfs_trans_cancel(parent_tp);
+	if (error)
+		return error;
 
-	return error;
+	return xlog_finish_defer_ops(log->l_mp, &dfops_freezers);
 }
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index e6d355a09bb3..dcdb975dbb86 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -423,7 +423,7 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 STATIC int
 xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_bmbt_irec		irec;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
@@ -431,7 +431,7 @@ xfs_cui_item_recover(
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			new_fsb;
 	xfs_extlen_t			new_len;
@@ -492,12 +492,7 @@ xfs_cui_item_recover(
 			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
-	/*
-	 * Recovery stashes all deferred ops during intent processing and
-	 * finishes them on completion. Transfer current dfops state to this
-	 * transaction and transfer the result back before we return.
-	 */
-	xfs_defer_move(tp, parent_tp);
+
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
@@ -554,13 +549,10 @@ xfs_cui_item_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	xfs_defer_move(parent_tp, tp);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 4a5e2b1cf75a..c751487740df 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -466,14 +466,14 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 STATIC int
 xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
-	struct xfs_trans		*parent_tp)
+	struct xfs_defer_freezer	**dffp)
 {
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_map_extent		*rmap;
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
-	struct xfs_mount		*mp = parent_tp->t_mountp;
+	struct xfs_mount		*mp = lip->li_mountp;
 	xfs_fsblock_t			startblock_fsb;
 	enum xfs_rmap_intent_type	type;
 	xfs_exntst_t			state;
@@ -572,8 +572,7 @@ xfs_rui_item_recover(
 	}
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
-	error = xfs_trans_commit(tp);
-	return error;
+	return xlog_recover_trans_commit(tp, dffp);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e40..48db4510b9d9 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -26,6 +26,7 @@ struct xfs_cui_log_item;
 struct xfs_cud_log_item;
 struct xfs_bui_log_item;
 struct xfs_bud_log_item;
+struct xfs_defer_freezer;
 
 struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -79,7 +80,8 @@ struct xfs_item_ops {
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
 	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
-	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
+	int (*iop_recover)(struct xfs_log_item *lip,
+			   struct xfs_defer_freezer **dffp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-09-23 15:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-22  2:08 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
2020-04-22  2:08 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
2020-04-24 14:02   ` Brian Foster
2020-04-28 22:28     ` Darrick J. Wong
2020-04-22  2:08 ` [PATCH 2/3] xfs: reduce log recovery transaction block reservations Darrick J. Wong
2020-04-24 14:04   ` Brian Foster
2020-04-28 22:22     ` Darrick J. Wong
2020-05-27 22:39       ` Darrick J. Wong
2020-04-22  2:08 ` [PATCH 3/3] xfs: teach deferred op freezer to freeze and thaw inodes Darrick J. Wong
2020-04-25 19:01   ` Christoph Hellwig
2020-04-27 11:37     ` Brian Foster
2020-04-28 22:17       ` Darrick J. Wong
2020-04-29 11:38         ` Brian Foster
2020-04-29 11:48           ` Christoph Hellwig
2020-04-29 14:28             ` Darrick J. Wong
2020-04-29 14:55               ` Christoph Hellwig
2020-04-29 23:58                 ` Darrick J. Wong
2020-05-01 17:09                   ` Christoph Hellwig
2020-05-05  1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong
2020-05-05  1:13 ` [PATCH 1/3] xfs: proper replay of deferred ops queued " Darrick J. Wong
2020-05-05  2:33   ` Dave Chinner
2020-05-05  3:06     ` Darrick J. Wong
2020-05-05  5:10       ` Dave Chinner
2020-05-05 15:08         ` Darrick J. Wong
2020-09-17  3:29 [PATCH 0/3] xfs: fix how we deal with new intents during recovery Darrick J. Wong
2020-09-17  3:29 ` [PATCH 1/3] xfs: proper replay of deferred ops queued during log recovery Darrick J. Wong
2020-09-23  7:48   ` Christoph Hellwig
2020-09-23 15:46     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.