linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] xfs: intent item whiteouts
@ 2021-09-02  9:59 Dave Chinner
  2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

HI folks,

This is a patchset built on top of Allison's Logged Attributes
and my CIL Scalibility patch sets. It is inspired by the performance
regressions that were seen from logging 64k xattrs and trying to
work out how to minimise the impact of logging xattrs. Most of that
is explained in the "[RFC] xfs: intent item whiteouts" patch, so
I won't repeat it here.

The whiteouts massively reduce the journal write overhead of logging
xattrs - with this patchset I've reduced 2.5GB/s of log traffic (16
way file create w/64k xattr workload) down to approximately 220MB of
log traffic, and performance has increased from 9k creates/s to 36k
creates/s. The workload still writes to disk at 2.5GB/s, but that's
what writing 35k x 64k xattrs to disk does.

This is still short of the non-logged attribute mechanism, which
runs at 45-50k creates a second and 3.5-4GB/s to disk, but it brings
logged attrs to within roughly 5-15% of non-logged attrs across the
full range of attribute sizes.

The biggest limitation to logged attr throughput at this point in
time is the memory allocation overhead of the shadow buffers for
logging the xattr name. I have some ideas on how to avoid that, but
nothing concrete yet, so in the mean time there's a patch to make
"kvmalloc()" behave how we need it to behave and the whiteout
implementation frees shadow buffers attached to the intents when the
whiteout is applied so that we don't hold lots of memory allocated
unnecessarily.

This patchset is separate to the attr code, though, because
intent whiteouts are not specific to the attr code. They are a
generic mechanism that can be applied to all the intent/intent done
item pairs we already have. This patch set modifies all those
intents to use whiteouts, and so there is benefits from the patch
set for all operations that use these intents.

I haven't done anything other than smoke test these patches with
xattr performance tests, so don't use them on anything you care
about. This posting is for early feedback so I can try to land them
with the logged attribute code to minimise the impact of the perf
regressions.

What do people think of the approach I've taken for this feature?

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/7] xfs: add log item flags to indicate intents
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:08   ` Allison Henderson
  2021-09-02  9:59 ` [PATCH 2/7] xfs: tag transactions that contain intent done items Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We currently have a couple of helper functions that try to infer
whether the log item is an intent or intent done item from the
combinations of operations it supports.  This is incredibly fragile
and not very efficient as it requires checking specific combinations
of ops.

We need to be able to identify intent and intent done items quickly
and easily in upcoming patches, so simply add intent and intent done
type flags to the log item ops flags. These are static flags to
begin with, so intent items should have been typed like this from
the start.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_attr_item.c     |  4 +++-
 fs/xfs/xfs_bmap_item.c     |  4 +++-
 fs/xfs/xfs_extfree_item.c  |  4 +++-
 fs/xfs/xfs_refcount_item.c |  4 +++-
 fs/xfs/xfs_rmap_item.c     |  4 +++-
 fs/xfs/xfs_trans.h         | 25 +++++++++++++------------
 6 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index bd4089eb8087..f900001e8f3a 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -479,7 +479,8 @@ xfs_trans_get_attrd(struct xfs_trans		*tp,
 }
 
 static const struct xfs_item_ops xfs_attrd_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_attrd_item_size,
 	.iop_format	= xfs_attrd_item_format,
 	.iop_release    = xfs_attrd_item_release,
@@ -684,6 +685,7 @@ xfs_attri_item_relog(
 }
 
 static const struct xfs_item_ops xfs_attri_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_attri_item_size,
 	.iop_format	= xfs_attri_item_format,
 	.iop_unpin	= xfs_attri_item_unpin,
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 51ba8ee368ca..8de644a343b5 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -202,7 +202,8 @@ xfs_bud_item_release(
 }
 
 static const struct xfs_item_ops xfs_bud_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_bud_item_size,
 	.iop_format	= xfs_bud_item_format,
 	.iop_release	= xfs_bud_item_release,
@@ -584,6 +585,7 @@ xfs_bui_item_relog(
 }
 
 static const struct xfs_item_ops xfs_bui_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_bui_item_size,
 	.iop_format	= xfs_bui_item_format,
 	.iop_unpin	= xfs_bui_item_unpin,
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 046f21338c48..952a46477907 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -307,7 +307,8 @@ xfs_efd_item_release(
 }
 
 static const struct xfs_item_ops xfs_efd_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_efd_item_size,
 	.iop_format	= xfs_efd_item_format,
 	.iop_release	= xfs_efd_item_release,
@@ -681,6 +682,7 @@ xfs_efi_item_relog(
 }
 
 static const struct xfs_item_ops xfs_efi_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_efi_item_size,
 	.iop_format	= xfs_efi_item_format,
 	.iop_unpin	= xfs_efi_item_unpin,
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index a6e7351ca4f9..38b38a734fd6 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -208,7 +208,8 @@ xfs_cud_item_release(
 }
 
 static const struct xfs_item_ops xfs_cud_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_cud_item_size,
 	.iop_format	= xfs_cud_item_format,
 	.iop_release	= xfs_cud_item_release,
@@ -600,6 +601,7 @@ xfs_cui_item_relog(
 }
 
 static const struct xfs_item_ops xfs_cui_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_cui_item_size,
 	.iop_format	= xfs_cui_item_format,
 	.iop_unpin	= xfs_cui_item_unpin,
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 8c70a4af80a9..1b3655090113 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -231,7 +231,8 @@ xfs_rud_item_release(
 }
 
 static const struct xfs_item_ops xfs_rud_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_rud_item_size,
 	.iop_format	= xfs_rud_item_format,
 	.iop_release	= xfs_rud_item_release,
@@ -630,6 +631,7 @@ xfs_rui_item_relog(
 }
 
 static const struct xfs_item_ops xfs_rui_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_rui_item_size,
 	.iop_format	= xfs_rui_item_format,
 	.iop_unpin	= xfs_rui_item_unpin,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 2d1cc1ff93c7..ab6e0bc1df1a 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -80,28 +80,29 @@ struct xfs_item_ops {
 			struct xfs_trans *tp);
 };
 
-/* Is this log item a deferred action intent? */
+/*
+ * Log item ops flags
+ */
+/*
+ * Release the log item when the journal commits instead of inserting into the
+ * AIL for writeback tracking and/or log tail pinning.
+ */
+#define XFS_ITEM_RELEASE_WHEN_COMMITTED	(1 << 0)
+#define XFS_ITEM_INTENT			(1 << 1)
+#define XFS_ITEM_INTENT_DONE		(1 << 2)
+
 static inline bool
 xlog_item_is_intent(struct xfs_log_item *lip)
 {
-	return lip->li_ops->iop_recover != NULL &&
-	       lip->li_ops->iop_match != NULL;
+	return lip->li_ops->flags & XFS_ITEM_INTENT;
 }
 
-/* Is this a log intent-done item? */
 static inline bool
 xlog_item_is_intent_done(struct xfs_log_item *lip)
 {
-	return lip->li_ops->iop_unpin == NULL &&
-	       lip->li_ops->iop_push == NULL;
+	return lip->li_ops->flags & XFS_ITEM_INTENT_DONE;
 }
 
-/*
- * Release the log item as soon as committed.  This is for items just logging
- * intents that never need to be written back in place.
- */
-#define XFS_ITEM_RELEASE_WHEN_COMMITTED	(1 << 0)
-
 void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 			  int type, const struct xfs_item_ops *ops);
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/7] xfs: tag transactions that contain intent done items
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
  2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:09   ` Allison Henderson
  2021-09-02  9:59 ` [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Intent whiteouts will require extra work to be done during
transaction commit if the transaction contains an intent done item.

To determine if a transaction contains an intent done item, we want
to avoid having to walk all the items in the transaction to check if
they are intent done items. Hence when we add an intent done item to
a transaction, tag the transaction to indicate that it contains such
an item.

We don't tag the transaction when the defer ops is relogging an
intent to move it forward in the log. Whiteouts will never apply to
these cases, so we don't need to bother looking for them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h | 24 +++++++++++++++++-------
 fs/xfs/xfs_attr_item.c     |  4 +++-
 fs/xfs/xfs_bmap_item.c     |  2 +-
 fs/xfs/xfs_extfree_item.c  |  2 +-
 fs/xfs/xfs_refcount_item.c |  2 +-
 fs/xfs/xfs_rmap_item.c     |  2 +-
 6 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 25c4cab58851..e96618dbde29 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -54,13 +54,23 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 /*
  * Values for t_flags.
  */
-#define	XFS_TRANS_DIRTY		0x01	/* something needs to be logged */
-#define	XFS_TRANS_SB_DIRTY	0x02	/* superblock is modified */
-#define	XFS_TRANS_PERM_LOG_RES	0x04	/* xact took a permanent log res */
-#define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
-#define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
-#define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
-#define XFS_TRANS_RES_FDBLKS	0x80	/* reserve newly freed blocks */
+/* Transaction needs to be logged */
+#define XFS_TRANS_DIRTY			(1 << 0)
+/* Superblock is dirty and needs to be logged */
+#define XFS_TRANS_SB_DIRTY		(1 << 1)
+/* Transaction took a permanent log reservation */
+#define XFS_TRANS_PERM_LOG_RES		(1 << 2)
+/* Synchronous transaction commit needed */
+#define XFS_TRANS_SYNC			(1 << 3)
+/* Transaction can use reserve block pool */
+#define XFS_TRANS_RESERVE		(1 << 4)
+/* Transaction should avoid VFS level superblock write accounting */
+#define XFS_TRANS_NO_WRITECOUNT		(1 << 5)
+/* Transaction has freed blocks returned to it's reservation */
+#define XFS_TRANS_RES_FDBLKS		(1 << 6)
+/* Transaction contains an intent done log item */
+#define XFS_TRANS_HAS_INTENT_DONE	(1 << 7)
+
 /*
  * LOWMODE is used by the allocator to activate the lowspace algorithm - when
  * free space is running low the extent allocator may choose to allocate an
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index f900001e8f3a..572edb7fb2cd 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -311,8 +311,10 @@ xfs_trans_attr_finish_update(
 	/*
 	 * attr intent/done items are null when delayed attributes are disabled
 	 */
-	if (attrdp)
+	if (attrdp) {
 		set_bit(XFS_LI_DIRTY, &attrdp->attrd_item.li_flags);
+		args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
+	}
 
 	return error;
 }
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 8de644a343b5..5244d85b1ba4 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -255,7 +255,7 @@ xfs_trans_log_finish_bmap_update(
 	 * 1.) releases the BUI and frees the BUD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
 
 	return error;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 952a46477907..f689530aaa75 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -381,7 +381,7 @@ xfs_trans_free_extent(
 	 * 1.) releases the EFI and frees the EFD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
 
 	next_extent = efdp->efd_next_extent;
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 38b38a734fd6..b426e98d7f4f 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -260,7 +260,7 @@ xfs_trans_log_finish_refcount_update(
 	 * 1.) releases the CUI and frees the CUD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
 
 	return error;
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 1b3655090113..df3e61c1bf69 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -328,7 +328,7 @@ xfs_trans_log_finish_rmap_update(
 	 * 1.) releases the RUI and frees the RUD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
 
 	return error;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
  2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
  2021-09-02  9:59 ` [PATCH 2/7] xfs: tag transactions that contain intent done items Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:09   ` Allison Henderson
  2021-09-02  9:59 ` [PATCH 4/7] xfs: add log item method to return related intents Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In preparation for adding support for intent item whiteouts.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 109 +++++++++++++++++++++++++------------------
 1 file changed, 64 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 9488db6c6b21..bd2c8178255e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -58,6 +58,38 @@ xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
 			(log->l_iclog_size - log->l_iclog_hsize)));
 }
 
+/*
+ * Check if the current log item was first committed in this sequence.
+ * We can't rely on just the log item being in the CIL, we have to check
+ * the recorded commit sequence number.
+ *
+ * Note: for this to be used in a non-racy manner, it has to be called with
+ * CIL flushing locked out. As a result, it should only be used during the
+ * transaction commit process when deciding what to format into the item.
+ */
+static bool
+xlog_item_in_current_chkpt(
+	struct xfs_cil		*cil,
+	struct xfs_log_item	*lip)
+{
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
+		return false;
+
+	/*
+	 * li_seq is written on the first commit of a log item to record the
+	 * first checkpoint it is written to. Hence if it is different to the
+	 * current sequence, we're in a new checkpoint.
+	 */
+	return lip->li_seq == cil->xc_ctx->sequence;
+}
+
+bool
+xfs_log_item_in_current_chkpt(
+	struct xfs_log_item *lip)
+{
+	return xlog_item_in_current_chkpt(lip->li_mountp->m_log->l_cilp, lip);
+}
+
 /*
  * Unavoidable forward declaration - xlog_cil_push_work() calls
  * xlog_cil_ctx_alloc() itself.
@@ -995,6 +1027,37 @@ xlog_cil_order_cmp(
 	return l1->lv_order_id > l2->lv_order_id;
 }
 
+/*
+ * Build a log vector chain from the current CIL.
+ */
+static void
+xlog_cil_build_lv_chain(
+	struct xfs_cil_ctx	*ctx,
+	uint32_t		*num_iovecs,
+	uint32_t		*num_bytes)
+{
+
+	while (!list_empty(&ctx->log_items)) {
+		struct xfs_log_item	*item;
+		struct xfs_log_vec	*lv;
+
+		item = list_first_entry(&ctx->log_items,
+					struct xfs_log_item, li_cil);
+
+		lv = item->li_lv;
+		lv->lv_order_id = item->li_order_id;
+		*num_iovecs += lv->lv_niovecs;
+		/* we don't write ordered log vectors */
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
+			*num_bytes += lv->lv_bytes;
+
+		list_add_tail(&lv->lv_list, &ctx->lv_chain);
+		list_del_init(&item->li_cil);
+		item->li_order_id = 0;
+		item->li_lv = NULL;
+	}
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -1017,7 +1080,6 @@ xlog_cil_push_work(
 		container_of(work, struct xfs_cil_ctx, push_work);
 	struct xfs_cil		*cil = ctx->cil;
 	struct xlog		*log = cil->xc_log;
-	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	int			num_iovecs = 0;
 	int			num_bytes = 0;
@@ -1116,24 +1178,7 @@ xlog_cil_push_work(
 				&bdev_flush);
 
 	xlog_cil_pcp_aggregate(cil, ctx);
-
-	while (!list_empty(&ctx->log_items)) {
-		struct xfs_log_item	*item;
-
-		item = list_first_entry(&ctx->log_items,
-					struct xfs_log_item, li_cil);
-		lv = item->li_lv;
-		lv->lv_order_id = item->li_order_id;
-		num_iovecs += lv->lv_niovecs;
-		/* we don't write ordered log vectors */
-		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
-			num_bytes += lv->lv_bytes;
-
-		list_add_tail(&lv->lv_list, &ctx->lv_chain);
-		list_del_init(&item->li_cil);
-		item->li_order_id = 0;
-		item->li_lv = NULL;
-	}
+	xlog_cil_build_lv_chain(ctx, &num_iovecs, &num_bytes);
 
 	/*
 	 * Switch the contexts so we can drop the context lock and move out
@@ -1612,32 +1657,6 @@ xlog_cil_force_seq(
 	return 0;
 }
 
-/*
- * Check if the current log item was first committed in this sequence.
- * We can't rely on just the log item being in the CIL, we have to check
- * the recorded commit sequence number.
- *
- * Note: for this to be used in a non-racy manner, it has to be called with
- * CIL flushing locked out. As a result, it should only be used during the
- * transaction commit process when deciding what to format into the item.
- */
-bool
-xfs_log_item_in_current_chkpt(
-	struct xfs_log_item *lip)
-{
-	struct xfs_cil		*cil = lip->li_mountp->m_log->l_cilp;
-
-	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
-		return false;
-
-	/*
-	 * li_seq is written on the first commit of a log item to record the
-	 * first checkpoint it is written to. Hence if it is different to the
-	 * current sequence, we're in a new checkpoint.
-	 */
-	return lip->li_seq == cil->xc_ctx->sequence;
-}
-
 /*
  * Move dead percpu state to the relevant CIL context structures.
  *
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/7] xfs: add log item method to return related intents
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
                   ` (2 preceding siblings ...)
  2021-09-02  9:59 ` [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:09   ` Allison Henderson
  2021-09-02  9:59 ` [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To apply a whiteout to an intent item when an intent done item is
committed, we need to be able to retrieve the intent item from the
the intent done item. Add a log item op method for doing this, and
wire all the intent done items up to it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_attr_item.c     | 8 ++++++++
 fs/xfs/xfs_bmap_item.c     | 8 ++++++++
 fs/xfs/xfs_extfree_item.c  | 8 ++++++++
 fs/xfs/xfs_refcount_item.c | 8 ++++++++
 fs/xfs/xfs_rmap_item.c     | 8 ++++++++
 fs/xfs/xfs_trans.h         | 1 +
 6 files changed, 41 insertions(+)

diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 572edb7fb2cd..86c8d5d08176 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -480,12 +480,20 @@ xfs_trans_get_attrd(struct xfs_trans		*tp,
 	return attrdp;
 }
 
+static struct xfs_log_item *
+xfs_attrd_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &ATTRD_ITEM(lip)->attrd_attrip->attri_item;
+}
+
 static const struct xfs_item_ops xfs_attrd_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_attrd_item_size,
 	.iop_format	= xfs_attrd_item_format,
 	.iop_release    = xfs_attrd_item_release,
+	.iop_intent	= xfs_attrd_item_intent,
 };
 
 
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 5244d85b1ba4..0b06159cfd1b 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -201,12 +201,20 @@ xfs_bud_item_release(
 	kmem_cache_free(xfs_bud_zone, budp);
 }
 
+static struct xfs_log_item *
+xfs_bud_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &BUD_ITEM(lip)->bud_buip->bui_item;
+}
+
 static const struct xfs_item_ops xfs_bud_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_bud_item_size,
 	.iop_format	= xfs_bud_item_format,
 	.iop_release	= xfs_bud_item_release,
+	.iop_intent	= xfs_bud_item_intent,
 };
 
 static struct xfs_bud_log_item *
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f689530aaa75..87cba4a71883 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -306,12 +306,20 @@ xfs_efd_item_release(
 	xfs_efd_item_free(efdp);
 }
 
+static struct xfs_log_item *
+xfs_efd_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &EFD_ITEM(lip)->efd_efip->efi_item;
+}
+
 static const struct xfs_item_ops xfs_efd_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_efd_item_size,
 	.iop_format	= xfs_efd_item_format,
 	.iop_release	= xfs_efd_item_release,
+	.iop_intent	= xfs_efd_item_intent,
 };
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index b426e98d7f4f..de739884e857 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -207,12 +207,20 @@ xfs_cud_item_release(
 	kmem_cache_free(xfs_cud_zone, cudp);
 }
 
+static struct xfs_log_item *
+xfs_cud_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &CUD_ITEM(lip)->cud_cuip->cui_item;
+}
+
 static const struct xfs_item_ops xfs_cud_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_cud_item_size,
 	.iop_format	= xfs_cud_item_format,
 	.iop_release	= xfs_cud_item_release,
+	.iop_intent	= xfs_cud_item_intent,
 };
 
 static struct xfs_cud_log_item *
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index df3e61c1bf69..8d57529d9ddd 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -230,12 +230,20 @@ xfs_rud_item_release(
 	kmem_cache_free(xfs_rud_zone, rudp);
 }
 
+static struct xfs_log_item *
+xfs_rud_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &RUD_ITEM(lip)->rud_ruip->rui_item;
+}
+
 static const struct xfs_item_ops xfs_rud_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_rud_item_size,
 	.iop_format	= xfs_rud_item_format,
 	.iop_release	= xfs_rud_item_release,
+	.iop_intent	= xfs_rud_item_intent,
 };
 
 static struct xfs_rud_log_item *
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index ab6e0bc1df1a..a6d7b3309bd7 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -78,6 +78,7 @@ struct xfs_item_ops {
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 	struct xfs_log_item *(*iop_relog)(struct xfs_log_item *intent,
 			struct xfs_trans *tp);
+	struct xfs_log_item *(*iop_intent)(struct xfs_log_item *intent_done);
 };
 
 /*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
                   ` (3 preceding siblings ...)
  2021-09-02  9:59 ` [PATCH 4/7] xfs: add log item method to return related intents Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:09   ` Allison Henderson
  2021-09-02  9:59 ` [PATCH 6/7] [RFC] xfs: intent item whiteouts Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we release an intent that a whiteout applies to, it will not
have been committed to the journal and so won't be in the AIL. Hence
when we drop the last reference to the intent, we do not want to try
to remove it from the AIL as that will trigger a filesystem
shutdown. Hence make the removal of intents from the AIL conditional
on them actually being in the AIL so we do the correct thing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_attr_item.c     | 11 ++++++-----
 fs/xfs/xfs_bmap_item.c     |  8 +++++---
 fs/xfs/xfs_extfree_item.c  |  8 +++++---
 fs/xfs/xfs_refcount_item.c |  8 +++++---
 fs/xfs/xfs_rmap_item.c     |  8 +++++---
 5 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 86c8d5d08176..11546967a5d7 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -67,11 +67,12 @@ xfs_attri_release(
 	struct xfs_attri_log_item	*attrip)
 {
 	ASSERT(atomic_read(&attrip->attri_refcount) > 0);
-	if (atomic_dec_and_test(&attrip->attri_refcount)) {
-		xfs_trans_ail_delete(&attrip->attri_item,
-				     SHUTDOWN_LOG_IO_ERROR);
-		xfs_attri_item_free(attrip);
-	}
+	if (!atomic_dec_and_test(&attrip->attri_refcount))
+		return;
+
+	if (test_bit(XFS_LI_IN_AIL, &attrip->attri_item.li_flags))
+		xfs_trans_ail_delete(&attrip->attri_item, SHUTDOWN_LOG_IO_ERROR);
+	xfs_attri_item_free(attrip);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 0b06159cfd1b..7cabb59138b1 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -54,10 +54,12 @@ xfs_bui_release(
 	struct xfs_bui_log_item	*buip)
 {
 	ASSERT(atomic_read(&buip->bui_refcount) > 0);
-	if (atomic_dec_and_test(&buip->bui_refcount)) {
+	if (!atomic_dec_and_test(&buip->bui_refcount))
+		return;
+
+	if (test_bit(XFS_LI_IN_AIL, &buip->bui_item.li_flags))
 		xfs_trans_ail_delete(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_bui_item_free(buip);
-	}
+	xfs_bui_item_free(buip);
 }
 
 
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 87cba4a71883..7032125fe987 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -58,10 +58,12 @@ xfs_efi_release(
 	struct xfs_efi_log_item	*efip)
 {
 	ASSERT(atomic_read(&efip->efi_refcount) > 0);
-	if (atomic_dec_and_test(&efip->efi_refcount)) {
+	if (!atomic_dec_and_test(&efip->efi_refcount))
+		return;
+
+	if (test_bit(XFS_LI_IN_AIL, &efip->efi_item.li_flags))
 		xfs_trans_ail_delete(&efip->efi_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_efi_item_free(efip);
-	}
+	xfs_efi_item_free(efip);
 }
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index de739884e857..f62dc5b7af88 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -53,10 +53,12 @@ xfs_cui_release(
 	struct xfs_cui_log_item	*cuip)
 {
 	ASSERT(atomic_read(&cuip->cui_refcount) > 0);
-	if (atomic_dec_and_test(&cuip->cui_refcount)) {
+	if (!atomic_dec_and_test(&cuip->cui_refcount))
+		return;
+
+	if (test_bit(XFS_LI_IN_AIL, &cuip->cui_item.li_flags))
 		xfs_trans_ail_delete(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_cui_item_free(cuip);
-	}
+	xfs_cui_item_free(cuip);
 }
 
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 8d57529d9ddd..0c67abcd189b 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -53,10 +53,12 @@ xfs_rui_release(
 	struct xfs_rui_log_item	*ruip)
 {
 	ASSERT(atomic_read(&ruip->rui_refcount) > 0);
-	if (atomic_dec_and_test(&ruip->rui_refcount)) {
+	if (!atomic_dec_and_test(&ruip->rui_refcount))
+		return;
+
+	if (test_bit(XFS_LI_IN_AIL, &ruip->rui_item.li_flags))
 		xfs_trans_ail_delete(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_rui_item_free(ruip);
-	}
+	xfs_rui_item_free(ruip);
 }
 
 STATIC void
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 6/7] [RFC] xfs: intent item whiteouts
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
                   ` (4 preceding siblings ...)
  2021-09-02  9:59 ` [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:09   ` Allison Henderson
  2021-09-02  9:59 ` [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers Dave Chinner
  2021-09-09 11:37 ` [RFC PATCH 0/7] xfs: intent item whiteouts Christoph Hellwig
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we log modifications based on intents, we add both intent
and intent done items to the modification being made. These get
written to the log to ensure that the operation is re-run if the
intent done is not found in the log.

However, for operations that complete wholly within a single
checkpoint, the change in the checkpoint is atomic and will never
need replay. In this case, we don't need to actually write the
intent and intent done items to the journal because log recovery
will never need to manually restart this modification.

Log recovery currently handles intent/intent done matching by
inserting the intent into the AIL, then removing it when a matching
intent done item is found. Hence for all the intent-based operations
that complete within a checkpoint, we spend all that time parsing
the intent/intent done items just to cancel them and do nothing with
them.

Hence it follows that the only time we actually need intents in the
log is when the modification crosses checkpoint boundaries in the
log and so may only be partially complete in the journal. Hence if
we commit and intent done item to the CIL and the intent item is in
the same checkpoint, we don't actually have to write them to the
journal because log recovery will always cancel the intents.

We've never really worried about the overhead of logging intents
unnecessarily like this because the intents we log are generally
very much smaller than the change being made. e.g. freeing an extent
involves modifying at lease two freespace btree blocks and the AGF,
so the EFI/EFD overhead is only a small increase in space and
processing time compared to the overall cost of freeing an extent.

However, delayed attributes change this cost equation dramatically,
especially for inline attributes. In the case of adding an inline
attribute, we only log the inode core and attribute fork at present.
With delayed attributes, we now log the attr intent which includes
the name and value, the inode core adn attr fork, and finally the
attr intent done item. We increase the number of items we log from 1
to 3, and the number of log vectors (regions) goes up from 3 to 7.
Hence we tripple the number of objects that the CIL has to process,
and more than double the number of log vectors that need to be
written to the journal.

At scale, this means delayed attributes cause a non-pipelined CIL to
become CPU bound processing all the extra items, resulting in a > 40%
performance degradation on 16-way file+xattr create worklaods.
Pipelining the CIL (as per 5.15) reduces the performance degradation
to 20%, but now the limitation is the rate at which the log items
can be written to the iclogs and iclogs be dispatched for IO and
completed.

Even log IO completion is slowed down by these intents, because it
now has to process 3x the number of items in the checkpoint.
Processing completed intents is especially inefficient here, because
we first insert the intent into the AIL, then remove it from the AIL
when the intent done is processed. IOWs, we are also doing expensive
operations in log IO completion we could completely avoid if we
didn't log completed intent/intent done pairs.

Enter log item whiteouts.

When an intent done is committed, we can check to see if the
associated intent is in the same checkpoint as we are currently
committing the intent done to. If so, we can mark the intent log
item with a whiteout and immediately free the intent done item
rather than committing it to the CIL. We can basically skip the
entire formatting and CIL insertion steps for the intent done item.

However, we cannot remove the intent item from the CIL at this point
because the unlocked per-cpu CIL item lists do not permit removal
without holding the CIL context lock exclusively. Transaction commit
only holds the context lock shared, hence the best we can do is mark
the intent item with a whiteout so that the CIL push can release it
rather than writing it to the log.

This means we never write the intent to the log if the intent done
has also been committed to the same checkpoint, but we'll always
write the intent if the intent done has not been committed or has
been committed to a different checkpoint. This will result in
correct log recovery behaviour in all cases, without the overhead of
logging unnecessary intents.

This intent whiteout concept is generic - we can apply it to all
intent/intent done pairs that have a direct 1:1 relationship. The
way deferred ops iterate and relog intents mean that all intents
currently have a 1:1 relationship with their done intent, and hence
we can apply this cancellation to all existing intent/intent done
implementations.

For delayed attributes with a 16-way 64kB xattr create workload,
whiteouts reduce the amount of journalled metadata from ~2.5GB/s
down to ~600MB/s and improve the creation rate from 9000/s to
14000/s.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 78 +++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_trace.h   |  3 ++
 fs/xfs/xfs_trans.h   |  7 ++--
 3 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index bd2c8178255e..fff68aad254e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -502,7 +502,8 @@ xlog_cil_insert_format_items(
 static void
 xlog_cil_insert_items(
 	struct xlog		*log,
-	struct xfs_trans	*tp)
+	struct xfs_trans	*tp,
+	uint32_t		released_space)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
@@ -569,6 +570,7 @@ xlog_cil_insert_items(
 	 */
 	cilpcp = get_cpu_ptr(cil->xc_pcp);
 	cilpcp->space_reserved += ctx_res;
+	cilpcp->space_used -= released_space;
 	cilpcp->space_used += len;
 	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
 	    cilpcp->space_used >
@@ -1028,11 +1030,14 @@ xlog_cil_order_cmp(
 }
 
 /*
- * Build a log vector chain from the current CIL.
+ * Build a log vector chain from the current CIL. If a log item is marked with a
+ * whiteout, we do not need to write it to the journal and so we just move them
+ * to the whiteout list for the caller to dispose of appropriately.
  */
 static void
 xlog_cil_build_lv_chain(
 	struct xfs_cil_ctx	*ctx,
+	struct list_head	*whiteouts,
 	uint32_t		*num_iovecs,
 	uint32_t		*num_bytes)
 {
@@ -1043,6 +1048,11 @@ xlog_cil_build_lv_chain(
 
 		item = list_first_entry(&ctx->log_items,
 					struct xfs_log_item, li_cil);
+		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
+			list_move(&item->li_cil, whiteouts);
+			trace_xfs_cil_whiteout_skip(item);
+			continue;
+		}
 
 		lv = item->li_lv;
 		lv->lv_order_id = item->li_order_id;
@@ -1092,6 +1102,7 @@ xlog_cil_push_work(
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 	bool			push_commit_stable;
 	struct xlog_ticket	*ticket;
+	LIST_HEAD		(whiteouts);
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -1178,7 +1189,7 @@ xlog_cil_push_work(
 				&bdev_flush);
 
 	xlog_cil_pcp_aggregate(cil, ctx);
-	xlog_cil_build_lv_chain(ctx, &num_iovecs, &num_bytes);
+	xlog_cil_build_lv_chain(ctx, &whiteouts, &num_iovecs, &num_bytes);
 
 	/*
 	 * Switch the contexts so we can drop the context lock and move out
@@ -1312,6 +1323,15 @@ xlog_cil_push_work(
 	/* Not safe to reference ctx now! */
 
 	spin_unlock(&log->l_icloglock);
+
+	/* clean up log items that had whiteouts */
+	while (!list_empty(&whiteouts)) {
+		struct xfs_log_item *item = list_first_entry(&whiteouts,
+						struct xfs_log_item, li_cil);
+		list_del_init(&item->li_cil);
+		trace_xfs_cil_whiteout_unpin(item);
+		item->li_ops->iop_unpin(item, 1);
+	}
 	xfs_log_ticket_ungrant(log, ticket);
 	return;
 
@@ -1323,6 +1343,14 @@ xlog_cil_push_work(
 
 out_abort_free_ticket:
 	ASSERT(xlog_is_shutdown(log));
+	while (!list_empty(&whiteouts)) {
+		struct xfs_log_item *item = list_first_entry(&whiteouts,
+						struct xfs_log_item, li_cil);
+		list_del_init(&item->li_cil);
+		trace_xfs_cil_whiteout_unpin(item);
+		item->li_ops->iop_unpin(item, 1);
+	}
+
 	if (!ctx->commit_iclog) {
 		xfs_log_ticket_ungrant(log, ctx->ticket);
 		xlog_cil_committed(ctx);
@@ -1475,6 +1503,43 @@ xlog_cil_empty(
 	return empty;
 }
 
+/*
+ * If there are intent done items in this transaction and the related intent was
+ * committed in the current (same) CIL checkpoint, we don't need to write either
+ * the intent or intent done item to the journal as the change will be
+ * journalled atomically within this checkpoint. As we cannot remove items from
+ * the CIL here, mark the related intent with a whiteout so that the CIL push
+ * can remove it rather than writing it to the journal. Then remove the intent
+ * done item from the current transaction and release it so it doesn't get put
+ * into the CIL at all.
+ */
+static uint32_t
+xlog_cil_process_intents(
+	struct xfs_cil		*cil,
+	struct xfs_trans	*tp)
+{
+	struct xfs_log_item	*lip, *ilip, *next;
+	uint32_t		len = 0;
+
+	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
+		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
+			continue;
+
+		ilip = lip->li_ops->iop_intent(lip);
+		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
+			continue;
+		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
+		trace_xfs_cil_whiteout_mark(ilip);
+		len += ilip->li_lv->lv_bytes;
+		kmem_free(ilip->li_lv);
+		ilip->li_lv = NULL;
+
+		xfs_trans_del_item(lip);
+		lip->li_ops->iop_release(lip);
+	}
+	return len;
+}
+
 /*
  * Commit a transaction with the given vector to the Committed Item List.
  *
@@ -1497,6 +1562,7 @@ xlog_cil_commit(
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_log_item	*lip, *next;
+	uint32_t		released_space = 0;
 
 	/*
 	 * Do all necessary memory allocation before we lock the CIL.
@@ -1508,7 +1574,11 @@ xlog_cil_commit(
 	/* lock out background commit */
 	down_read(&cil->xc_ctx_lock);
 
-	xlog_cil_insert_items(log, tp);
+	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
+		released_space = xlog_cil_process_intents(cil, tp);
+
+	xlog_cil_insert_items(log, tp, released_space);
+	tp->t_ticket->t_curr_res += released_space;
 
 	if (regrant && !xlog_is_shutdown(log))
 		xfs_log_ticket_regrant(log, tp->t_ticket);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 77a78b5b1a29..d4f5a1410879 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1348,6 +1348,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
 
 DECLARE_EVENT_CLASS(xfs_ail_class,
 	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index a6d7b3309bd7..5765ca6e2c84 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -55,13 +55,16 @@ struct xfs_log_item {
 #define	XFS_LI_IN_AIL	0
 #define	XFS_LI_ABORTED	1
 #define	XFS_LI_FAILED	2
-#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
+#define	XFS_LI_DIRTY	3
+#define	XFS_LI_WHITEOUT	4
 
 #define XFS_LI_FLAGS \
 	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
 	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
 	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
-	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
+	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
+	{ (1 << XFS_LI_WHITEOUT),	"WHITEOUT" }
+
 
 struct xfs_item_ops {
 	unsigned flags;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
                   ` (5 preceding siblings ...)
  2021-09-02  9:59 ` [PATCH 6/7] [RFC] xfs: intent item whiteouts Dave Chinner
@ 2021-09-02  9:59 ` Dave Chinner
  2021-09-03 21:55   ` Allison Henderson
  2021-09-09 11:37 ` [RFC PATCH 0/7] xfs: intent item whiteouts Christoph Hellwig
  7 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-09-02  9:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Oh, let me count the ways that the kvmalloc API sucks dog eggs.

The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:

     - 49.73% xlog_cil_commit
	 - 31.62% kvmalloc_node
	    - 29.96% __kmalloc_node
	       - 29.38% kmalloc_large_node
		  - 29.33% __alloc_pages
		     - 24.33% __alloc_pages_slowpath.constprop.0
			- 18.35% __alloc_pages_direct_compact
			   - 17.39% try_to_compact_pages
			      - compact_zone_order
				 - 15.26% compact_zone
				      5.29% __pageblock_pfn_to_page
				      3.71% PageHuge
				    - 1.44% isolate_migratepages_block
					 0.71% set_pfnblock_flags_mask
				   1.11% get_pfnblock_flags_mask
			   - 0.81% get_page_from_freelist
			      - 0.59% _raw_spin_lock_irqsave
				 - do_raw_spin_lock
				      __pv_queued_spin_lock_slowpath
			- 3.24% try_to_free_pages
			   - 3.14% shrink_node
			      - 2.94% shrink_slab.constprop.0
				 - 0.89% super_cache_count
				    - 0.66% xfs_fs_nr_cached_objects
				       - 0.65% xfs_reclaim_inodes_count
					    0.55% xfs_perag_get_tag
				   0.58% kfree_rcu_shrink_count
			- 2.09% get_page_from_freelist
			   - 1.03% _raw_spin_lock_irqsave
			      - do_raw_spin_lock
				   __pv_queued_spin_lock_slowpath
		     - 4.88% get_page_from_freelist
			- 3.66% _raw_spin_lock_irqsave
			   - do_raw_spin_lock
				__pv_queued_spin_lock_slowpath
	    - 1.63% __vmalloc_node
	       - __vmalloc_node_range
		  - 1.10% __alloc_pages_bulk
		     - 0.93% __alloc_pages
			- 0.92% get_page_from_freelist
			   - 0.89% rmqueue_bulk
			      - 0.69% _raw_spin_lock
				 - do_raw_spin_lock
				      __pv_queued_spin_lock_slowpath
	   13.73% memcpy_erms
	 - 2.22% kvfree

On this workload, that's almost a dozen CPUs all trying to compact
and reclaim memory inside kvmalloc_node at the same time. Yet it is
regularly falling back to vmalloc despite all that compaction, page
and shrinker reclaim that direct reclaim is doing. Copying all the
metadata is taking far less CPU time than allocating the storage!

Direct reclaim should be considered extremely harmful.

This is a high frequency, high throughput, CPU usage and latency
sensitive allocation. We've got memory there, and we're using
kvmalloc to allow memory allocation to avoid doing lots of work to
try to do contiguous allocations.

Except it still does *lots of costly work* that is unnecessary.

Worse: the only way to avoid the slowpath page allocation trying to
do compaction on costly allocations is to turn off direct reclaim
(i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).

Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
GFP_KERNEL allocation context, so you only get kmalloc!". This
cuts off the vmalloc fallback, and this leads to almost instant OOM
problems which ends up in filesystems deadlocks, shutdowns and/or
kernel crashes.

I want some basic kvmalloc behaviour:

- kmalloc for a contiguous range with fail fast semantics - no
  compaction direct reclaim if the allocation enters the slow path.
- run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails

The really, really stupid part about this is these kvmalloc() calls
are run under memalloc_nofs task context, so all the allocations are
always reduced to GFP_NOFS regardless of the fact that kvmalloc
requires GFP_KERNEL to be passed in. IOWs, we're already telling
kvmalloc to behave differently to the gfp flags we pass in, but it
still won't allow vmalloc to be run with anything other than
GFP_KERNEL.

So, this patch open codes the kvmalloc() in the commit path to have
the above described behaviour. The result is we more than halve the
CPU time spend doing kvmalloc() in this path and transaction commits
with 64kB objects in them more than doubles. i.e. we get ~5x
reduction in CPU usage per costly-sized kvmalloc() invocation and
the profile looks like this:

  - 37.60% xlog_cil_commit
	16.01% memcpy_erms
      - 8.45% __kmalloc
	 - 8.04% kmalloc_order_trace
	    - 8.03% kmalloc_order
	       - 7.93% alloc_pages
		  - 7.90% __alloc_pages
		     - 4.05% __alloc_pages_slowpath.constprop.0
			- 2.18% get_page_from_freelist
			- 1.77% wake_all_kswapds
....
				    - __wake_up_common_lock
				       - 0.94% _raw_spin_lock_irqsave
		     - 3.72% get_page_from_freelist
			- 2.43% _raw_spin_lock_irqsave
      - 5.72% vmalloc
	 - 5.72% __vmalloc_node_range
	    - 4.81% __get_vm_area_node.constprop.0
	       - 3.26% alloc_vmap_area
		  - 2.52% _raw_spin_lock
	       - 1.46% _raw_spin_lock
	      0.56% __alloc_pages_bulk
      - 4.66% kvfree
	 - 3.25% vfree
	    - __vfree
	       - 3.23% __vunmap
		  - 1.95% remove_vm_area
		     - 1.06% free_vmap_area_noflush
			- 0.82% _raw_spin_lock
		     - 0.68% _raw_spin_lock
		  - 0.92% _raw_spin_lock
	 - 1.40% kfree
	    - 1.36% __free_pages
	       - 1.35% __free_pages_ok
		  - 1.02% _raw_spin_lock_irqsave

It's worth noting that over 50% of the CPU time spent allocating
these shadow buffers is now spent on spinlocks. So the shadow buffer
allocation overhead is greatly reduced by getting rid of direct
reclaim from kmalloc, and could probably be made even less costly if
vmalloc() didn't use global spinlocks to protect it's structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 46 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index fff68aad254e..81ebf03bfa5c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -185,6 +185,39 @@ xlog_cil_iovec_space(
 			sizeof(uint64_t));
 }
 
+/*
+ * shadow buffers can be large, so we need to use kvmalloc() here to ensure
+ * success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts to fall
+ * back to vmalloc, so we can't actually do anything useful with gfp flags to
+ * control the kmalloc() behaviour within kvmalloc(). Hence kmalloc() will do
+ * direct reclaim and compaction in the slow path, both of which are
+ * horrendously expensive. We just want kmalloc to fail fast and fall back to
+ * vmalloc if it can't get somethign straight away from the free lists or buddy
+ * allocator. Hence we have to open code kvmalloc outselves here.
+ *
+ * Also, we are in memalloc_nofs_save task context here, so despite the use of
+ * GFP_KERNEL here, we are actually going to be doing GFP_NOFS allocations. This
+ * is actually the only way to make vmalloc() do GFP_NOFS allocations, so lets
+ * just all pretend this is a GFP_KERNEL context operation....
+ */
+static inline void *
+xlog_cil_kvmalloc(
+	size_t		size)
+{
+	gfp_t		flags = GFP_KERNEL;
+	void		*p;
+
+	flags &= ~__GFP_DIRECT_RECLAIM;
+	flags |= __GFP_NOWARN | __GFP_NORETRY;
+	do {
+		p = kmalloc(buf_size, flags);
+		if (!p)
+			p = vmalloc(buf_size);
+	} while (!p);
+
+	return p;
+}
+
 /*
  * Allocate or pin log vector buffers for CIL insertion.
  *
@@ -293,25 +326,16 @@ xlog_cil_alloc_shadow_bufs(
 		 */
 		if (!lip->li_lv_shadow ||
 		    buf_size > lip->li_lv_shadow->lv_size) {
-
 			/*
 			 * We free and allocate here as a realloc would copy
-			 * unnecessary data. We don't use kmem_zalloc() for the
+			 * unnecessary data. We don't use kvzalloc() for the
 			 * same reason - we don't need to zero the data area in
 			 * the buffer, only the log vector header and the iovec
 			 * storage.
 			 */
 			kmem_free(lip->li_lv_shadow);
+			lv = xlog_cil_kvmalloc(buf_size);
 
-			/*
-			 * We are in transaction context, which means this
-			 * allocation will pick up GFP_NOFS from the
-			 * memalloc_nofs_save/restore context the transaction
-			 * holds. This means we can use GFP_KERNEL here so the
-			 * generic kvmalloc() code will run vmalloc on
-			 * contiguous page allocation failure as we require.
-			 */
-			lv = kvmalloc(buf_size, GFP_KERNEL);
 			memset(lv, 0, xlog_cil_iovec_space(niovecs));
 
 			INIT_LIST_HEAD(&lv->lv_list);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/7] xfs: add log item flags to indicate intents
  2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
@ 2021-09-03 21:08   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:08 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently have a couple of helper functions that try to infer
> whether the log item is an intent or intent done item from the
> combinations of operations it supports.  This is incredibly fragile
> and not very efficient as it requires checking specific combinations
> of ops.
> 
> We need to be able to identify intent and intent done items quickly
> and easily in upcoming patches, so simply add intent and intent done
> type flags to the log item ops flags. These are static flags to
> begin with, so intent items should have been typed like this from
> the start.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, this is pretty straight forward and makes a lot of sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_attr_item.c     |  4 +++-
>   fs/xfs/xfs_bmap_item.c     |  4 +++-
>   fs/xfs/xfs_extfree_item.c  |  4 +++-
>   fs/xfs/xfs_refcount_item.c |  4 +++-
>   fs/xfs/xfs_rmap_item.c     |  4 +++-
>   fs/xfs/xfs_trans.h         | 25 +++++++++++++------------
>   6 files changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> index bd4089eb8087..f900001e8f3a 100644
> --- a/fs/xfs/xfs_attr_item.c
> +++ b/fs/xfs/xfs_attr_item.c
> @@ -479,7 +479,8 @@ xfs_trans_get_attrd(struct xfs_trans		*tp,
>   }
>   
>   static const struct xfs_item_ops xfs_attrd_item_ops = {
> -	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
> +			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_attrd_item_size,
>   	.iop_format	= xfs_attrd_item_format,
>   	.iop_release    = xfs_attrd_item_release,
> @@ -684,6 +685,7 @@ xfs_attri_item_relog(
>   }
>   
>   static const struct xfs_item_ops xfs_attri_item_ops = {
> +	.flags		= XFS_ITEM_INTENT,
>   	.iop_size	= xfs_attri_item_size,
>   	.iop_format	= xfs_attri_item_format,
>   	.iop_unpin	= xfs_attri_item_unpin,
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 51ba8ee368ca..8de644a343b5 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -202,7 +202,8 @@ xfs_bud_item_release(
>   }
>   
>   static const struct xfs_item_ops xfs_bud_item_ops = {
> -	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
> +			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_bud_item_size,
>   	.iop_format	= xfs_bud_item_format,
>   	.iop_release	= xfs_bud_item_release,
> @@ -584,6 +585,7 @@ xfs_bui_item_relog(
>   }
>   
>   static const struct xfs_item_ops xfs_bui_item_ops = {
> +	.flags		= XFS_ITEM_INTENT,
>   	.iop_size	= xfs_bui_item_size,
>   	.iop_format	= xfs_bui_item_format,
>   	.iop_unpin	= xfs_bui_item_unpin,
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 046f21338c48..952a46477907 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -307,7 +307,8 @@ xfs_efd_item_release(
>   }
>   
>   static const struct xfs_item_ops xfs_efd_item_ops = {
> -	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
> +			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_efd_item_size,
>   	.iop_format	= xfs_efd_item_format,
>   	.iop_release	= xfs_efd_item_release,
> @@ -681,6 +682,7 @@ xfs_efi_item_relog(
>   }
>   
>   static const struct xfs_item_ops xfs_efi_item_ops = {
> +	.flags		= XFS_ITEM_INTENT,
>   	.iop_size	= xfs_efi_item_size,
>   	.iop_format	= xfs_efi_item_format,
>   	.iop_unpin	= xfs_efi_item_unpin,
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index a6e7351ca4f9..38b38a734fd6 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -208,7 +208,8 @@ xfs_cud_item_release(
>   }
>   
>   static const struct xfs_item_ops xfs_cud_item_ops = {
> -	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
> +			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_cud_item_size,
>   	.iop_format	= xfs_cud_item_format,
>   	.iop_release	= xfs_cud_item_release,
> @@ -600,6 +601,7 @@ xfs_cui_item_relog(
>   }
>   
>   static const struct xfs_item_ops xfs_cui_item_ops = {
> +	.flags		= XFS_ITEM_INTENT,
>   	.iop_size	= xfs_cui_item_size,
>   	.iop_format	= xfs_cui_item_format,
>   	.iop_unpin	= xfs_cui_item_unpin,
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index 8c70a4af80a9..1b3655090113 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -231,7 +231,8 @@ xfs_rud_item_release(
>   }
>   
>   static const struct xfs_item_ops xfs_rud_item_ops = {
> -	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
> +			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_rud_item_size,
>   	.iop_format	= xfs_rud_item_format,
>   	.iop_release	= xfs_rud_item_release,
> @@ -630,6 +631,7 @@ xfs_rui_item_relog(
>   }
>   
>   static const struct xfs_item_ops xfs_rui_item_ops = {
> +	.flags		= XFS_ITEM_INTENT,
>   	.iop_size	= xfs_rui_item_size,
>   	.iop_format	= xfs_rui_item_format,
>   	.iop_unpin	= xfs_rui_item_unpin,
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 2d1cc1ff93c7..ab6e0bc1df1a 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -80,28 +80,29 @@ struct xfs_item_ops {
>   			struct xfs_trans *tp);
>   };
>   
> -/* Is this log item a deferred action intent? */
> +/*
> + * Log item ops flags
> + */
> +/*
> + * Release the log item when the journal commits instead of inserting into the
> + * AIL for writeback tracking and/or log tail pinning.
> + */
> +#define XFS_ITEM_RELEASE_WHEN_COMMITTED	(1 << 0)
> +#define XFS_ITEM_INTENT			(1 << 1)
> +#define XFS_ITEM_INTENT_DONE		(1 << 2)
> +
>   static inline bool
>   xlog_item_is_intent(struct xfs_log_item *lip)
>   {
> -	return lip->li_ops->iop_recover != NULL &&
> -	       lip->li_ops->iop_match != NULL;
> +	return lip->li_ops->flags & XFS_ITEM_INTENT;
>   }
>   
> -/* Is this a log intent-done item? */
>   static inline bool
>   xlog_item_is_intent_done(struct xfs_log_item *lip)
>   {
> -	return lip->li_ops->iop_unpin == NULL &&
> -	       lip->li_ops->iop_push == NULL;
> +	return lip->li_ops->flags & XFS_ITEM_INTENT_DONE;
>   }
>   
> -/*
> - * Release the log item as soon as committed.  This is for items just logging
> - * intents that never need to be written back in place.
> - */
> -#define XFS_ITEM_RELEASE_WHEN_COMMITTED	(1 << 0)
> -
>   void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
>   			  int type, const struct xfs_item_ops *ops);
>   
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/7] xfs: tag transactions that contain intent done items
  2021-09-02  9:59 ` [PATCH 2/7] xfs: tag transactions that contain intent done items Dave Chinner
@ 2021-09-03 21:09   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:09 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Intent whiteouts will require extra work to be done during
> transaction commit if the transaction contains an intent done item.
> 
> To determine if a transaction contains an intent done item, we want
> to avoid having to walk all the items in the transaction to check if
> they are intent done items. Hence when we add an intent done item to
> a transaction, tag the transaction to indicate that it contains such
> an item.
> 
> We don't tag the transaction when the defer ops is relogging an
> intent to move it forward in the log. Whiteouts will never apply to
> these cases, so we don't need to bother looking for them.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/libxfs/xfs_shared.h | 24 +++++++++++++++++-------
>   fs/xfs/xfs_attr_item.c     |  4 +++-
>   fs/xfs/xfs_bmap_item.c     |  2 +-
>   fs/xfs/xfs_extfree_item.c  |  2 +-
>   fs/xfs/xfs_refcount_item.c |  2 +-
>   fs/xfs/xfs_rmap_item.c     |  2 +-
>   6 files changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 25c4cab58851..e96618dbde29 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -54,13 +54,23 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>   /*
>    * Values for t_flags.
>    */
> -#define	XFS_TRANS_DIRTY		0x01	/* something needs to be logged */
> -#define	XFS_TRANS_SB_DIRTY	0x02	/* superblock is modified */
> -#define	XFS_TRANS_PERM_LOG_RES	0x04	/* xact took a permanent log res */
> -#define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
> -#define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
> -#define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
> -#define XFS_TRANS_RES_FDBLKS	0x80	/* reserve newly freed blocks */
> +/* Transaction needs to be logged */
> +#define XFS_TRANS_DIRTY			(1 << 0)
> +/* Superblock is dirty and needs to be logged */
> +#define XFS_TRANS_SB_DIRTY		(1 << 1)
> +/* Transaction took a permanent log reservation */
> +#define XFS_TRANS_PERM_LOG_RES		(1 << 2)
> +/* Synchronous transaction commit needed */
> +#define XFS_TRANS_SYNC			(1 << 3)
> +/* Transaction can use reserve block pool */
> +#define XFS_TRANS_RESERVE		(1 << 4)
> +/* Transaction should avoid VFS level superblock write accounting */
> +#define XFS_TRANS_NO_WRITECOUNT		(1 << 5)
> +/* Transaction has freed blocks returned to it's reservation */
> +#define XFS_TRANS_RES_FDBLKS		(1 << 6)
> +/* Transaction contains an intent done log item */
> +#define XFS_TRANS_HAS_INTENT_DONE	(1 << 7)
> +
>   /*
>    * LOWMODE is used by the allocator to activate the lowspace algorithm - when
>    * free space is running low the extent allocator may choose to allocate an
> diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> index f900001e8f3a..572edb7fb2cd 100644
> --- a/fs/xfs/xfs_attr_item.c
> +++ b/fs/xfs/xfs_attr_item.c
> @@ -311,8 +311,10 @@ xfs_trans_attr_finish_update(
>   	/*
>   	 * attr intent/done items are null when delayed attributes are disabled
>   	 */
> -	if (attrdp)
> +	if (attrdp) {
>   		set_bit(XFS_LI_DIRTY, &attrdp->attrd_item.li_flags);
> +		args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> +	}
>   
>   	return error;
>   }
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 8de644a343b5..5244d85b1ba4 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -255,7 +255,7 @@ xfs_trans_log_finish_bmap_update(
>   	 * 1.) releases the BUI and frees the BUD
>   	 * 2.) shuts down the filesystem
>   	 */
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>   	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
>   
>   	return error;
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 952a46477907..f689530aaa75 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -381,7 +381,7 @@ xfs_trans_free_extent(
>   	 * 1.) releases the EFI and frees the EFD
>   	 * 2.) shuts down the filesystem
>   	 */
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>   	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
>   
>   	next_extent = efdp->efd_next_extent;
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 38b38a734fd6..b426e98d7f4f 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -260,7 +260,7 @@ xfs_trans_log_finish_refcount_update(
>   	 * 1.) releases the CUI and frees the CUD
>   	 * 2.) shuts down the filesystem
>   	 */
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>   	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
>   
>   	return error;
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index 1b3655090113..df3e61c1bf69 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -328,7 +328,7 @@ xfs_trans_log_finish_rmap_update(
>   	 * 1.) releases the RUI and frees the RUD
>   	 * 2.) shuts down the filesystem
>   	 */
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>   	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
>   
>   	return error;
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c
  2021-09-02  9:59 ` [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c Dave Chinner
@ 2021-09-03 21:09   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:09 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In preparation for adding support for intent item whiteouts.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks like a straight forward hoist
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log_cil.c | 109 +++++++++++++++++++++++++------------------
>   1 file changed, 64 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 9488db6c6b21..bd2c8178255e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -58,6 +58,38 @@ xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
>   			(log->l_iclog_size - log->l_iclog_hsize)));
>   }
>   
> +/*
> + * Check if the current log item was first committed in this sequence.
> + * We can't rely on just the log item being in the CIL, we have to check
> + * the recorded commit sequence number.
> + *
> + * Note: for this to be used in a non-racy manner, it has to be called with
> + * CIL flushing locked out. As a result, it should only be used during the
> + * transaction commit process when deciding what to format into the item.
> + */
> +static bool
> +xlog_item_in_current_chkpt(
> +	struct xfs_cil		*cil,
> +	struct xfs_log_item	*lip)
> +{
> +	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
> +		return false;
> +
> +	/*
> +	 * li_seq is written on the first commit of a log item to record the
> +	 * first checkpoint it is written to. Hence if it is different to the
> +	 * current sequence, we're in a new checkpoint.
> +	 */
> +	return lip->li_seq == cil->xc_ctx->sequence;
> +}
> +
> +bool
> +xfs_log_item_in_current_chkpt(
> +	struct xfs_log_item *lip)
> +{
> +	return xlog_item_in_current_chkpt(lip->li_mountp->m_log->l_cilp, lip);
> +}
> +
>   /*
>    * Unavoidable forward declaration - xlog_cil_push_work() calls
>    * xlog_cil_ctx_alloc() itself.
> @@ -995,6 +1027,37 @@ xlog_cil_order_cmp(
>   	return l1->lv_order_id > l2->lv_order_id;
>   }
>   
> +/*
> + * Build a log vector chain from the current CIL.
> + */
> +static void
> +xlog_cil_build_lv_chain(
> +	struct xfs_cil_ctx	*ctx,
> +	uint32_t		*num_iovecs,
> +	uint32_t		*num_bytes)
> +{
> +
> +	while (!list_empty(&ctx->log_items)) {
> +		struct xfs_log_item	*item;
> +		struct xfs_log_vec	*lv;
> +
> +		item = list_first_entry(&ctx->log_items,
> +					struct xfs_log_item, li_cil);
> +
> +		lv = item->li_lv;
> +		lv->lv_order_id = item->li_order_id;
> +		*num_iovecs += lv->lv_niovecs;
> +		/* we don't write ordered log vectors */
> +		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> +			*num_bytes += lv->lv_bytes;
> +
> +		list_add_tail(&lv->lv_list, &ctx->lv_chain);
> +		list_del_init(&item->li_cil);
> +		item->li_order_id = 0;
> +		item->li_lv = NULL;
> +	}
> +}
> +
>   /*
>    * Push the Committed Item List to the log.
>    *
> @@ -1017,7 +1080,6 @@ xlog_cil_push_work(
>   		container_of(work, struct xfs_cil_ctx, push_work);
>   	struct xfs_cil		*cil = ctx->cil;
>   	struct xlog		*log = cil->xc_log;
> -	struct xfs_log_vec	*lv;
>   	struct xfs_cil_ctx	*new_ctx;
>   	int			num_iovecs = 0;
>   	int			num_bytes = 0;
> @@ -1116,24 +1178,7 @@ xlog_cil_push_work(
>   				&bdev_flush);
>   
>   	xlog_cil_pcp_aggregate(cil, ctx);
> -
> -	while (!list_empty(&ctx->log_items)) {
> -		struct xfs_log_item	*item;
> -
> -		item = list_first_entry(&ctx->log_items,
> -					struct xfs_log_item, li_cil);
> -		lv = item->li_lv;
> -		lv->lv_order_id = item->li_order_id;
> -		num_iovecs += lv->lv_niovecs;
> -		/* we don't write ordered log vectors */
> -		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> -			num_bytes += lv->lv_bytes;
> -
> -		list_add_tail(&lv->lv_list, &ctx->lv_chain);
> -		list_del_init(&item->li_cil);
> -		item->li_order_id = 0;
> -		item->li_lv = NULL;
> -	}
> +	xlog_cil_build_lv_chain(ctx, &num_iovecs, &num_bytes);
>   
>   	/*
>   	 * Switch the contexts so we can drop the context lock and move out
> @@ -1612,32 +1657,6 @@ xlog_cil_force_seq(
>   	return 0;
>   }
>   
> -/*
> - * Check if the current log item was first committed in this sequence.
> - * We can't rely on just the log item being in the CIL, we have to check
> - * the recorded commit sequence number.
> - *
> - * Note: for this to be used in a non-racy manner, it has to be called with
> - * CIL flushing locked out. As a result, it should only be used during the
> - * transaction commit process when deciding what to format into the item.
> - */
> -bool
> -xfs_log_item_in_current_chkpt(
> -	struct xfs_log_item *lip)
> -{
> -	struct xfs_cil		*cil = lip->li_mountp->m_log->l_cilp;
> -
> -	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
> -		return false;
> -
> -	/*
> -	 * li_seq is written on the first commit of a log item to record the
> -	 * first checkpoint it is written to. Hence if it is different to the
> -	 * current sequence, we're in a new checkpoint.
> -	 */
> -	return lip->li_seq == cil->xc_ctx->sequence;
> -}
> -
>   /*
>    * Move dead percpu state to the relevant CIL context structures.
>    *
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] xfs: add log item method to return related intents
  2021-09-02  9:59 ` [PATCH 4/7] xfs: add log item method to return related intents Dave Chinner
@ 2021-09-03 21:09   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:09 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To apply a whiteout to an intent item when an intent done item is
> committed, we need to be able to retrieve the intent item from the
> the intent done item. Add a log item op method for doing this, and
> wire all the intent done items up to it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
OK, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_attr_item.c     | 8 ++++++++
>   fs/xfs/xfs_bmap_item.c     | 8 ++++++++
>   fs/xfs/xfs_extfree_item.c  | 8 ++++++++
>   fs/xfs/xfs_refcount_item.c | 8 ++++++++
>   fs/xfs/xfs_rmap_item.c     | 8 ++++++++
>   fs/xfs/xfs_trans.h         | 1 +
>   6 files changed, 41 insertions(+)
> 
> diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> index 572edb7fb2cd..86c8d5d08176 100644
> --- a/fs/xfs/xfs_attr_item.c
> +++ b/fs/xfs/xfs_attr_item.c
> @@ -480,12 +480,20 @@ xfs_trans_get_attrd(struct xfs_trans		*tp,
>   	return attrdp;
>   }
>   
> +static struct xfs_log_item *
> +xfs_attrd_item_intent(
> +	struct xfs_log_item	*lip)
> +{
> +	return &ATTRD_ITEM(lip)->attrd_attrip->attri_item;
> +}
> +
>   static const struct xfs_item_ops xfs_attrd_item_ops = {
>   	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
>   			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_attrd_item_size,
>   	.iop_format	= xfs_attrd_item_format,
>   	.iop_release    = xfs_attrd_item_release,
> +	.iop_intent	= xfs_attrd_item_intent,
>   };
>   
>   
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 5244d85b1ba4..0b06159cfd1b 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -201,12 +201,20 @@ xfs_bud_item_release(
>   	kmem_cache_free(xfs_bud_zone, budp);
>   }
>   
> +static struct xfs_log_item *
> +xfs_bud_item_intent(
> +	struct xfs_log_item	*lip)
> +{
> +	return &BUD_ITEM(lip)->bud_buip->bui_item;
> +}
> +
>   static const struct xfs_item_ops xfs_bud_item_ops = {
>   	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
>   			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_bud_item_size,
>   	.iop_format	= xfs_bud_item_format,
>   	.iop_release	= xfs_bud_item_release,
> +	.iop_intent	= xfs_bud_item_intent,
>   };
>   
>   static struct xfs_bud_log_item *
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index f689530aaa75..87cba4a71883 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -306,12 +306,20 @@ xfs_efd_item_release(
>   	xfs_efd_item_free(efdp);
>   }
>   
> +static struct xfs_log_item *
> +xfs_efd_item_intent(
> +	struct xfs_log_item	*lip)
> +{
> +	return &EFD_ITEM(lip)->efd_efip->efi_item;
> +}
> +
>   static const struct xfs_item_ops xfs_efd_item_ops = {
>   	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
>   			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_efd_item_size,
>   	.iop_format	= xfs_efd_item_format,
>   	.iop_release	= xfs_efd_item_release,
> +	.iop_intent	= xfs_efd_item_intent,
>   };
>   
>   /*
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index b426e98d7f4f..de739884e857 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -207,12 +207,20 @@ xfs_cud_item_release(
>   	kmem_cache_free(xfs_cud_zone, cudp);
>   }
>   
> +static struct xfs_log_item *
> +xfs_cud_item_intent(
> +	struct xfs_log_item	*lip)
> +{
> +	return &CUD_ITEM(lip)->cud_cuip->cui_item;
> +}
> +
>   static const struct xfs_item_ops xfs_cud_item_ops = {
>   	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
>   			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_cud_item_size,
>   	.iop_format	= xfs_cud_item_format,
>   	.iop_release	= xfs_cud_item_release,
> +	.iop_intent	= xfs_cud_item_intent,
>   };
>   
>   static struct xfs_cud_log_item *
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index df3e61c1bf69..8d57529d9ddd 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -230,12 +230,20 @@ xfs_rud_item_release(
>   	kmem_cache_free(xfs_rud_zone, rudp);
>   }
>   
> +static struct xfs_log_item *
> +xfs_rud_item_intent(
> +	struct xfs_log_item	*lip)
> +{
> +	return &RUD_ITEM(lip)->rud_ruip->rui_item;
> +}
> +
>   static const struct xfs_item_ops xfs_rud_item_ops = {
>   	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
>   			  XFS_ITEM_INTENT_DONE,
>   	.iop_size	= xfs_rud_item_size,
>   	.iop_format	= xfs_rud_item_format,
>   	.iop_release	= xfs_rud_item_release,
> +	.iop_intent	= xfs_rud_item_intent,
>   };
>   
>   static struct xfs_rud_log_item *
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index ab6e0bc1df1a..a6d7b3309bd7 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -78,6 +78,7 @@ struct xfs_item_ops {
>   	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
>   	struct xfs_log_item *(*iop_relog)(struct xfs_log_item *intent,
>   			struct xfs_trans *tp);
> +	struct xfs_log_item *(*iop_intent)(struct xfs_log_item *intent_done);
>   };
>   
>   /*
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL
  2021-09-02  9:59 ` [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL Dave Chinner
@ 2021-09-03 21:09   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:09 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we release an intent that a whiteout applies to, it will not
> have been committed to the journal and so won't be in the AIL. Hence
> when we drop the last reference to the intent, we do not want to try
> to remove it from the AIL as that will trigger a filesystem
> shutdown. Hence make the removal of intents from the AIL conditional
> on them actually being in the AIL so we do the correct thing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_attr_item.c     | 11 ++++++-----
>   fs/xfs/xfs_bmap_item.c     |  8 +++++---
>   fs/xfs/xfs_extfree_item.c  |  8 +++++---
>   fs/xfs/xfs_refcount_item.c |  8 +++++---
>   fs/xfs/xfs_rmap_item.c     |  8 +++++---
>   5 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> index 86c8d5d08176..11546967a5d7 100644
> --- a/fs/xfs/xfs_attr_item.c
> +++ b/fs/xfs/xfs_attr_item.c
> @@ -67,11 +67,12 @@ xfs_attri_release(
>   	struct xfs_attri_log_item	*attrip)
>   {
>   	ASSERT(atomic_read(&attrip->attri_refcount) > 0);
> -	if (atomic_dec_and_test(&attrip->attri_refcount)) {
> -		xfs_trans_ail_delete(&attrip->attri_item,
> -				     SHUTDOWN_LOG_IO_ERROR);
> -		xfs_attri_item_free(attrip);
> -	}
> +	if (!atomic_dec_and_test(&attrip->attri_refcount))
> +		return;
> +
> +	if (test_bit(XFS_LI_IN_AIL, &attrip->attri_item.li_flags))
> +		xfs_trans_ail_delete(&attrip->attri_item, SHUTDOWN_LOG_IO_ERROR);
> +	xfs_attri_item_free(attrip);
>   }
>   
>   STATIC void
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 0b06159cfd1b..7cabb59138b1 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -54,10 +54,12 @@ xfs_bui_release(
>   	struct xfs_bui_log_item	*buip)
>   {
>   	ASSERT(atomic_read(&buip->bui_refcount) > 0);
> -	if (atomic_dec_and_test(&buip->bui_refcount)) {
> +	if (!atomic_dec_and_test(&buip->bui_refcount))
> +		return;
> +
> +	if (test_bit(XFS_LI_IN_AIL, &buip->bui_item.li_flags))
>   		xfs_trans_ail_delete(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
> -		xfs_bui_item_free(buip);
> -	}
> +	xfs_bui_item_free(buip);
>   }
>   
>   
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 87cba4a71883..7032125fe987 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -58,10 +58,12 @@ xfs_efi_release(
>   	struct xfs_efi_log_item	*efip)
>   {
>   	ASSERT(atomic_read(&efip->efi_refcount) > 0);
> -	if (atomic_dec_and_test(&efip->efi_refcount)) {
> +	if (!atomic_dec_and_test(&efip->efi_refcount))
> +		return;
> +
> +	if (test_bit(XFS_LI_IN_AIL, &efip->efi_item.li_flags))
>   		xfs_trans_ail_delete(&efip->efi_item, SHUTDOWN_LOG_IO_ERROR);
> -		xfs_efi_item_free(efip);
> -	}
> +	xfs_efi_item_free(efip);
>   }
>   
>   /*
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index de739884e857..f62dc5b7af88 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -53,10 +53,12 @@ xfs_cui_release(
>   	struct xfs_cui_log_item	*cuip)
>   {
>   	ASSERT(atomic_read(&cuip->cui_refcount) > 0);
> -	if (atomic_dec_and_test(&cuip->cui_refcount)) {
> +	if (!atomic_dec_and_test(&cuip->cui_refcount))
> +		return;
> +
> +	if (test_bit(XFS_LI_IN_AIL, &cuip->cui_item.li_flags))
>   		xfs_trans_ail_delete(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
> -		xfs_cui_item_free(cuip);
> -	}
> +	xfs_cui_item_free(cuip);
>   }
>   
>   
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index 8d57529d9ddd..0c67abcd189b 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -53,10 +53,12 @@ xfs_rui_release(
>   	struct xfs_rui_log_item	*ruip)
>   {
>   	ASSERT(atomic_read(&ruip->rui_refcount) > 0);
> -	if (atomic_dec_and_test(&ruip->rui_refcount)) {
> +	if (!atomic_dec_and_test(&ruip->rui_refcount))
> +		return;
> +
> +	if (test_bit(XFS_LI_IN_AIL, &ruip->rui_item.li_flags))
>   		xfs_trans_ail_delete(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
> -		xfs_rui_item_free(ruip);
> -	}
> +	xfs_rui_item_free(ruip);
>   }
>   
>   STATIC void
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/7] [RFC] xfs: intent item whiteouts
  2021-09-02  9:59 ` [PATCH 6/7] [RFC] xfs: intent item whiteouts Dave Chinner
@ 2021-09-03 21:09   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:09 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
> 
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
> 
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
> 
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
> 
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.
> 
> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
> 
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
> 
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
> 
> Enter log item whiteouts.
> 
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
> 
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
> 
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
> 
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
> 
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, this makes a lot of sense.  Thanks for the detailed explanation, it 
helps a lot.  I think we need to do everything we can to optimize things 
as much as possible.  Since the over all goal is parent pointers, attr 
activity will increase quite a bit.  So I would support this change.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log_cil.c | 78 +++++++++++++++++++++++++++++++++++++++++---
>   fs/xfs/xfs_trace.h   |  3 ++
>   fs/xfs/xfs_trans.h   |  7 ++--
>   3 files changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index bd2c8178255e..fff68aad254e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -502,7 +502,8 @@ xlog_cil_insert_format_items(
>   static void
>   xlog_cil_insert_items(
>   	struct xlog		*log,
> -	struct xfs_trans	*tp)
> +	struct xfs_trans	*tp,
> +	uint32_t		released_space)
>   {
>   	struct xfs_cil		*cil = log->l_cilp;
>   	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
> @@ -569,6 +570,7 @@ xlog_cil_insert_items(
>   	 */
>   	cilpcp = get_cpu_ptr(cil->xc_pcp);
>   	cilpcp->space_reserved += ctx_res;
> +	cilpcp->space_used -= released_space;
>   	cilpcp->space_used += len;
>   	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
>   	    cilpcp->space_used >
> @@ -1028,11 +1030,14 @@ xlog_cil_order_cmp(
>   }
>   
>   /*
> - * Build a log vector chain from the current CIL.
> + * Build a log vector chain from the current CIL. If a log item is marked with a
> + * whiteout, we do not need to write it to the journal and so we just move them
> + * to the whiteout list for the caller to dispose of appropriately.
>    */
>   static void
>   xlog_cil_build_lv_chain(
>   	struct xfs_cil_ctx	*ctx,
> +	struct list_head	*whiteouts,
>   	uint32_t		*num_iovecs,
>   	uint32_t		*num_bytes)
>   {
> @@ -1043,6 +1048,11 @@ xlog_cil_build_lv_chain(
>   
>   		item = list_first_entry(&ctx->log_items,
>   					struct xfs_log_item, li_cil);
> +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> +			list_move(&item->li_cil, whiteouts);
> +			trace_xfs_cil_whiteout_skip(item);
> +			continue;
> +		}
>   
>   		lv = item->li_lv;
>   		lv->lv_order_id = item->li_order_id;
> @@ -1092,6 +1102,7 @@ xlog_cil_push_work(
>   	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>   	bool			push_commit_stable;
>   	struct xlog_ticket	*ticket;
> +	LIST_HEAD		(whiteouts);
>   
>   	new_ctx = xlog_cil_ctx_alloc();
>   	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1178,7 +1189,7 @@ xlog_cil_push_work(
>   				&bdev_flush);
>   
>   	xlog_cil_pcp_aggregate(cil, ctx);
> -	xlog_cil_build_lv_chain(ctx, &num_iovecs, &num_bytes);
> +	xlog_cil_build_lv_chain(ctx, &whiteouts, &num_iovecs, &num_bytes);
>   
>   	/*
>   	 * Switch the contexts so we can drop the context lock and move out
> @@ -1312,6 +1323,15 @@ xlog_cil_push_work(
>   	/* Not safe to reference ctx now! */
>   
>   	spin_unlock(&log->l_icloglock);
> +
> +	/* clean up log items that had whiteouts */
> +	while (!list_empty(&whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(&whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
>   	xfs_log_ticket_ungrant(log, ticket);
>   	return;
>   
> @@ -1323,6 +1343,14 @@ xlog_cil_push_work(
>   
>   out_abort_free_ticket:
>   	ASSERT(xlog_is_shutdown(log));
> +	while (!list_empty(&whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(&whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
> +
>   	if (!ctx->commit_iclog) {
>   		xfs_log_ticket_ungrant(log, ctx->ticket);
>   		xlog_cil_committed(ctx);
> @@ -1475,6 +1503,43 @@ xlog_cil_empty(
>   	return empty;
>   }
>   
> +/*
> + * If there are intent done items in this transaction and the related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to write either
> + * the intent or intent done item to the journal as the change will be
> + * journalled atomically within this checkpoint. As we cannot remove items from
> + * the CIL here, mark the related intent with a whiteout so that the CIL push
> + * can remove it rather than writing it to the journal. Then remove the intent
> + * done item from the current transaction and release it so it doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> +	struct xfs_cil		*cil,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_log_item	*lip, *ilip, *next;
> +	uint32_t		len = 0;
> +
> +	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> +		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> +			continue;
> +
> +		ilip = lip->li_ops->iop_intent(lip);
> +		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> +			continue;
> +		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> +		trace_xfs_cil_whiteout_mark(ilip);
> +		len += ilip->li_lv->lv_bytes;
> +		kmem_free(ilip->li_lv);
> +		ilip->li_lv = NULL;
> +
> +		xfs_trans_del_item(lip);
> +		lip->li_ops->iop_release(lip);
> +	}
> +	return len;
> +}
> +
>   /*
>    * Commit a transaction with the given vector to the Committed Item List.
>    *
> @@ -1497,6 +1562,7 @@ xlog_cil_commit(
>   {
>   	struct xfs_cil		*cil = log->l_cilp;
>   	struct xfs_log_item	*lip, *next;
> +	uint32_t		released_space = 0;
>   
>   	/*
>   	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1508,7 +1574,11 @@ xlog_cil_commit(
>   	/* lock out background commit */
>   	down_read(&cil->xc_ctx_lock);
>   
> -	xlog_cil_insert_items(log, tp);
> +	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> +		released_space = xlog_cil_process_intents(cil, tp);
> +
> +	xlog_cil_insert_items(log, tp, released_space);
> +	tp->t_ticket->t_curr_res += released_space;
>   
>   	if (regrant && !xlog_is_shutdown(log))
>   		xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 77a78b5b1a29..d4f5a1410879 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1348,6 +1348,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>   
>   DECLARE_EVENT_CLASS(xfs_ail_class,
>   	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index a6d7b3309bd7..5765ca6e2c84 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,16 @@ struct xfs_log_item {
>   #define	XFS_LI_IN_AIL	0
>   #define	XFS_LI_ABORTED	1
>   #define	XFS_LI_FAILED	2
> -#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> +#define	XFS_LI_DIRTY	3
> +#define	XFS_LI_WHITEOUT	4
>   
>   #define XFS_LI_FLAGS \
>   	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
>   	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
>   	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
> -	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1 << XFS_LI_WHITEOUT),	"WHITEOUT" }
> +
>   
>   struct xfs_item_ops {
>   	unsigned flags;
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers
  2021-09-02  9:59 ` [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers Dave Chinner
@ 2021-09-03 21:55   ` Allison Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Allison Henderson @ 2021-09-03 21:55 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> 
> The problem is when we are logging lots of large objects, we hit
> kvmalloc really damn hard with costly order allocations, and
> behaviour utterly sucks:
> 
>       - 49.73% xlog_cil_commit
> 	 - 31.62% kvmalloc_node
> 	    - 29.96% __kmalloc_node
> 	       - 29.38% kmalloc_large_node
> 		  - 29.33% __alloc_pages
> 		     - 24.33% __alloc_pages_slowpath.constprop.0
> 			- 18.35% __alloc_pages_direct_compact
> 			   - 17.39% try_to_compact_pages
> 			      - compact_zone_order
> 				 - 15.26% compact_zone
> 				      5.29% __pageblock_pfn_to_page
> 				      3.71% PageHuge
> 				    - 1.44% isolate_migratepages_block
> 					 0.71% set_pfnblock_flags_mask
> 				   1.11% get_pfnblock_flags_mask
> 			   - 0.81% get_page_from_freelist
> 			      - 0.59% _raw_spin_lock_irqsave
> 				 - do_raw_spin_lock
> 				      __pv_queued_spin_lock_slowpath
> 			- 3.24% try_to_free_pages
> 			   - 3.14% shrink_node
> 			      - 2.94% shrink_slab.constprop.0
> 				 - 0.89% super_cache_count
> 				    - 0.66% xfs_fs_nr_cached_objects
> 				       - 0.65% xfs_reclaim_inodes_count
> 					    0.55% xfs_perag_get_tag
> 				   0.58% kfree_rcu_shrink_count
> 			- 2.09% get_page_from_freelist
> 			   - 1.03% _raw_spin_lock_irqsave
> 			      - do_raw_spin_lock
> 				   __pv_queued_spin_lock_slowpath
> 		     - 4.88% get_page_from_freelist
> 			- 3.66% _raw_spin_lock_irqsave
> 			   - do_raw_spin_lock
> 				__pv_queued_spin_lock_slowpath
> 	    - 1.63% __vmalloc_node
> 	       - __vmalloc_node_range
> 		  - 1.10% __alloc_pages_bulk
> 		     - 0.93% __alloc_pages
> 			- 0.92% get_page_from_freelist
> 			   - 0.89% rmqueue_bulk
> 			      - 0.69% _raw_spin_lock
> 				 - do_raw_spin_lock
> 				      __pv_queued_spin_lock_slowpath
> 	   13.73% memcpy_erms
> 	 - 2.22% kvfree
> 
> On this workload, that's almost a dozen CPUs all trying to compact
> and reclaim memory inside kvmalloc_node at the same time. Yet it is
> regularly falling back to vmalloc despite all that compaction, page
> and shrinker reclaim that direct reclaim is doing. Copying all the
> metadata is taking far less CPU time than allocating the storage!
> 
> Direct reclaim should be considered extremely harmful.
> 
> This is a high frequency, high throughput, CPU usage and latency
> sensitive allocation. We've got memory there, and we're using
> kvmalloc to allow memory allocation to avoid doing lots of work to
> try to do contiguous allocations.
> 
> Except it still does *lots of costly work* that is unnecessary.
> 
> Worse: the only way to avoid the slowpath page allocation trying to
> do compaction on costly allocations is to turn off direct reclaim
> (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
> 
> Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
> GFP_KERNEL allocation context, so you only get kmalloc!". This
> cuts off the vmalloc fallback, and this leads to almost instant OOM
> problems which ends up in filesystems deadlocks, shutdowns and/or
> kernel crashes.
> 
> I want some basic kvmalloc behaviour:
> 
> - kmalloc for a contiguous range with fail fast semantics - no
>    compaction direct reclaim if the allocation enters the slow path.
> - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
> 
> The really, really stupid part about this is these kvmalloc() calls
> are run under memalloc_nofs task context, so all the allocations are
> always reduced to GFP_NOFS regardless of the fact that kvmalloc
> requires GFP_KERNEL to be passed in. IOWs, we're already telling
> kvmalloc to behave differently to the gfp flags we pass in, but it
> still won't allow vmalloc to be run with anything other than
> GFP_KERNEL.
> 
> So, this patch open codes the kvmalloc() in the commit path to have
> the above described behaviour. The result is we more than halve the
> CPU time spend doing kvmalloc() in this path and transaction commits
> with 64kB objects in them more than doubles. i.e. we get ~5x
> reduction in CPU usage per costly-sized kvmalloc() invocation and
> the profile looks like this:
> 
>    - 37.60% xlog_cil_commit
> 	16.01% memcpy_erms
>        - 8.45% __kmalloc
> 	 - 8.04% kmalloc_order_trace
> 	    - 8.03% kmalloc_order
> 	       - 7.93% alloc_pages
> 		  - 7.90% __alloc_pages
> 		     - 4.05% __alloc_pages_slowpath.constprop.0
> 			- 2.18% get_page_from_freelist
> 			- 1.77% wake_all_kswapds
> ....
> 				    - __wake_up_common_lock
> 				       - 0.94% _raw_spin_lock_irqsave
> 		     - 3.72% get_page_from_freelist
> 			- 2.43% _raw_spin_lock_irqsave
>        - 5.72% vmalloc
> 	 - 5.72% __vmalloc_node_range
> 	    - 4.81% __get_vm_area_node.constprop.0
> 	       - 3.26% alloc_vmap_area
> 		  - 2.52% _raw_spin_lock
> 	       - 1.46% _raw_spin_lock
> 	      0.56% __alloc_pages_bulk
>        - 4.66% kvfree
> 	 - 3.25% vfree
> 	    - __vfree
> 	       - 3.23% __vunmap
> 		  - 1.95% remove_vm_area
> 		     - 1.06% free_vmap_area_noflush
> 			- 0.82% _raw_spin_lock
> 		     - 0.68% _raw_spin_lock
> 		  - 0.92% _raw_spin_lock
> 	 - 1.40% kfree
> 	    - 1.36% __free_pages
> 	       - 1.35% __free_pages_ok
> 		  - 1.02% _raw_spin_lock_irqsave
> 
> It's worth noting that over 50% of the CPU time spent allocating
> these shadow buffers is now spent on spinlocks. So the shadow buffer
> allocation overhead is greatly reduced by getting rid of direct
> reclaim from kmalloc, and could probably be made even less costly if
> vmalloc() didn't use global spinlocks to protect it's structures.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, thanks for all the explaining.  I will try out your set with mine, 
and do a few perfs myself.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_log_cil.c | 46 +++++++++++++++++++++++++++++++++-----------
>   1 file changed, 35 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index fff68aad254e..81ebf03bfa5c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -185,6 +185,39 @@ xlog_cil_iovec_space(
>   			sizeof(uint64_t));
>   }
>   
> +/*
> + * shadow buffers can be large, so we need to use kvmalloc() here to ensure
> + * success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts to fall
> + * back to vmalloc, so we can't actually do anything useful with gfp flags to
> + * control the kmalloc() behaviour within kvmalloc(). Hence kmalloc() will do
> + * direct reclaim and compaction in the slow path, both of which are
> + * horrendously expensive. We just want kmalloc to fail fast and fall back to
> + * vmalloc if it can't get somethign straight away from the free lists or buddy
> + * allocator. Hence we have to open code kvmalloc outselves here.
> + *
> + * Also, we are in memalloc_nofs_save task context here, so despite the use of
> + * GFP_KERNEL here, we are actually going to be doing GFP_NOFS allocations. This
> + * is actually the only way to make vmalloc() do GFP_NOFS allocations, so lets
> + * just all pretend this is a GFP_KERNEL context operation....
> + */
> +static inline void *
> +xlog_cil_kvmalloc(
> +	size_t		size)
> +{
> +	gfp_t		flags = GFP_KERNEL;
> +	void		*p;
> +
> +	flags &= ~__GFP_DIRECT_RECLAIM;
> +	flags |= __GFP_NOWARN | __GFP_NORETRY;
> +	do {
> +		p = kmalloc(buf_size, flags);
> +		if (!p)
> +			p = vmalloc(buf_size);
> +	} while (!p);
> +
> +	return p;
> +}
> +
>   /*
>    * Allocate or pin log vector buffers for CIL insertion.
>    *
> @@ -293,25 +326,16 @@ xlog_cil_alloc_shadow_bufs(
>   		 */
>   		if (!lip->li_lv_shadow ||
>   		    buf_size > lip->li_lv_shadow->lv_size) {
> -
>   			/*
>   			 * We free and allocate here as a realloc would copy
> -			 * unnecessary data. We don't use kmem_zalloc() for the
> +			 * unnecessary data. We don't use kvzalloc() for the
>   			 * same reason - we don't need to zero the data area in
>   			 * the buffer, only the log vector header and the iovec
>   			 * storage.
>   			 */
>   			kmem_free(lip->li_lv_shadow);
> +			lv = xlog_cil_kvmalloc(buf_size);
>   
> -			/*
> -			 * We are in transaction context, which means this
> -			 * allocation will pick up GFP_NOFS from the
> -			 * memalloc_nofs_save/restore context the transaction
> -			 * holds. This means we can use GFP_KERNEL here so the
> -			 * generic kvmalloc() code will run vmalloc on
> -			 * contiguous page allocation failure as we require.
> -			 */
> -			lv = kvmalloc(buf_size, GFP_KERNEL);
>   			memset(lv, 0, xlog_cil_iovec_space(niovecs));
>   
>   			INIT_LIST_HEAD(&lv->lv_list);
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/7] xfs: intent item whiteouts
  2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
                   ` (6 preceding siblings ...)
  2021-09-02  9:59 ` [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers Dave Chinner
@ 2021-09-09 11:37 ` Christoph Hellwig
  2021-09-09 21:21   ` Dave Chinner
  7 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-09-09 11:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Sep 02, 2021 at 07:59:20PM +1000, Dave Chinner wrote:
> HI folks,
> 
> This is a patchset built on top of Allison's Logged Attributes
> and my CIL Scalibility patch sets.

Do you have a git tree with all applied somewhere to help playing
with this series?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/7] xfs: intent item whiteouts
  2021-09-09 11:37 ` [RFC PATCH 0/7] xfs: intent item whiteouts Christoph Hellwig
@ 2021-09-09 21:21   ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2021-09-09 21:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Sep 09, 2021 at 01:37:49PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 02, 2021 at 07:59:20PM +1000, Dave Chinner wrote:
> > HI folks,
> > 
> > This is a patchset built on top of Allison's Logged Attributes
> > and my CIL Scalibility patch sets.
> 
> Do you have a git tree with all applied somewhere to help playing
> with this series?

Not yet, because the patch stack it is build on top of has a bunch
of other WIP, not-quite-fully-working stuff in it so I haven't been
able to run it through a full fstests pass yet (hence the RFC tag).
The next version (hopefully early next week) will be rebased
directly on top of the CIL scalability work and I'll push out those
branches into a git tree when it's actually been more fully tested
and survived some recovery stress looping.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-09 21:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
2021-09-03 21:08   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 2/7] xfs: tag transactions that contain intent done items Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 4/7] xfs: add log item method to return related intents Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 6/7] [RFC] xfs: intent item whiteouts Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers Dave Chinner
2021-09-03 21:55   ` Allison Henderson
2021-09-09 11:37 ` [RFC PATCH 0/7] xfs: intent item whiteouts Christoph Hellwig
2021-09-09 21:21   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).