All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10 v6] xfs: intent whiteouts
@ 2022-05-03 22:17 Dave Chinner
  2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

This is a patchset inspired by the performance regressions that were
seen from logging 64k xattrs with Allison's delayed attribute
patchset and trying to work out how to minimise the impact of
logging xattrs. Most of that is explained in the "xfs: intent item
whiteouts" patch, so I won't repeat it here.

The whiteouts massively reduce the journal write overhead of logging
xattrs - with this patchset I've reduced 2.5GB/s of log traffic (16
way file create w/64k xattr workload) down to approximately 220MB of
log traffic, and performance has increased from 9k creates/s to 36k
creates/s. The workload still writes to disk at 2.5GB/s, but that's
what writing 35k x 64k xattrs to disk does.

This is still short of the non-logged attribute mechanism, which
runs at 40-45k creates a second and 3.5-4GB/s to disk, but it brings
logged attrs to within roughly 5-15% of non-logged attrs across the
full range of attribute sizes.

So, while this patchset was clearly insired and has major positive
impact on Allison's delayed attribute work, it also applies
generically to all other intent/intent done pairs that already
exist. Hence I've created this patchset as a stand-alone patchset
that isn't dependent on the delayed attributes being committed, nor
does the delayed attribute patchset need this to funciton properly.
IOWs, they can be merged in parallel and then the attribute log item
implementation be updated to support whiteouts after the fact.

This patchset is separate to the attr code, though, because
intent whiteouts are not specific to the attr code. They are a
generic mechanism that can be applied to all the intent/intent done
item pairs we already have. This patch set modifies all those
intents to use whiteouts, and so there is benefits from the patch
set for all operations that use these intents.

Changelog:

Version 6:
- added backportable bug fix for inode fork data leak to start of series
- added backportable bug fix for intent shadow buffer leak.
- modified xlog_finish_iovec() rework to avoid unaligned format objects having
  the same data leak.
- fixed whitespace/long line issues.
- moved the released_space transaction reservation update into
  xlog_cil_insert_items to keep all the CIL commit transaction reservation and
  used space accounting in the one place.
- cleaned up AIL removal of intents on last release.
- factored whiteout cleanup in xlog_cil_push_work() into helper function.

Version 5:
- https://lore.kernel.org/linux-xfs/20220427022259.695399-1-david@fromorbit.com/
- rebased on 5.18-rc2 + linux-xfs/for-next
- converted transaction flags to unsigned to match tp->t_flags definition

Version 4:
- not published
- rebased on 5.17 + for-next + log shutdown fixes + xlog-write-rework
- fixed memory leak of CUI shadow buffers from log recovery when clearing
  leftover reflink entries.

Version 3:
- https://lore.kernel.org/linux-xfs/20220314220631.3093283-1-david@fromorbit.com/
- rebased on 5.17-rc4 + xlog-write-rework
- no longer dependent on xfs-cil-scalability, so there's some porting changes
  that was needed to remove all the per-cpu CIL dependencies.

Version 2:
- not published
- rebased on 5.15-rc2 + xfs-cil-scalability
- dropped the kvmalloc changes for CIL shadow buffers as that's a
  separate perf problem and not something related to intent
  whiteouts.
- dropped all the delayed attribute modifications so that the
  patchset is not dependent on Allison's dev tree.
- Thanks to Allison for an initial quick review pass - I haven't
  included those RVB tags because every patch in the series has
  changed since the original RFC posting.

RFC:
- https://lore.kernel.org/linux-xfs/20210909212133.GE2361455@dread.disaster.area/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/10] xfs: zero inode fork buffer at allocation
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:41   ` Darrick J. Wong
                     ` (2 more replies)
  2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we first allocate or resize an inline inode fork, we round up
the allocation to 4 byte alingment to make journal alignment
constraints. We don't clear the unused bytes, so we can copy up to
three uninitialised bytes into the journal. Zero those bytes so we
only ever copy zeros into the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 9aee4a1e2fe9..a15ff38c3d41 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -50,8 +50,13 @@ xfs_init_local_fork(
 		mem_size++;
 
 	if (size) {
+		/*
+		 * As we round up the allocation here, we need to ensure the
+		 * bytes we don't copy data into are zeroed because the log
+		 * vectors still copy them into the journal.
+		 */
 		real_size = roundup(mem_size, 4);
-		ifp->if_u1.if_data = kmem_alloc(real_size, KM_NOFS);
+		ifp->if_u1.if_data = kmem_zalloc(real_size, KM_NOFS);
 		memcpy(ifp->if_u1.if_data, data, size);
 		if (zero_terminate)
 			ifp->if_u1.if_data[size] = '\0';
@@ -500,10 +505,11 @@ xfs_idata_realloc(
 	/*
 	 * For inline data, the underlying buffer must be a multiple of 4 bytes
 	 * in size so that it can be logged and stay on word boundaries.
-	 * We enforce that here.
+	 * We enforce that here, and use __GFP_ZERO to ensure that size
+	 * extensions always zero the unused roundup area.
 	 */
 	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data, roundup(new_size, 4),
-				      GFP_NOFS | __GFP_NOFAIL);
+				      GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
 	ifp->if_bytes = new_size;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/10] xfs: fix potential log item leak
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
  2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:42   ` Alli
                     ` (2 more replies)
  2022-05-03 22:17 ` [PATCH 03/10] xfs: hide log iovec alignment constraints Dave Chinner
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Ever since we added shadown format buffers to the log items, log
items need to handle the item being released with shadow buffers
attached. Due to the fact this requirement was added at the same
time we added new rmap/reflink intents, we missed the cleanup of
those items.

In theory, this means shadow buffers can be leaked in a very small
window when a shutdown is initiated. Testing with KASAN shows this
leak does not happen in practice - we haven't identified a single
leak in several years of shutdown testing since ~v4.8 kernels.

However, the intent whiteout cleanup mechanism results in every
cancelled intent in exactly the same state as this tiny race window
creates and so if intents down clean up shadow buffers on final
release we will leak the shadow buffer for just about every intent
we create.

Hence we start with this patch to close this condition off and
ensure that when whiteouts start to be used we don't leak lots of
memory.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_item.c     | 2 ++
 fs/xfs/xfs_icreate_item.c  | 1 +
 fs/xfs/xfs_refcount_item.c | 2 ++
 fs/xfs/xfs_rmap_item.c     | 2 ++
 4 files changed, 7 insertions(+)

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 593ac29cffc7..2c8b686e2a11 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -39,6 +39,7 @@ STATIC void
 xfs_bui_item_free(
 	struct xfs_bui_log_item	*buip)
 {
+	kmem_free(buip->bui_item.li_lv_shadow);
 	kmem_cache_free(xfs_bui_cache, buip);
 }
 
@@ -198,6 +199,7 @@ xfs_bud_item_release(
 	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
 
 	xfs_bui_release(budp->bud_buip);
+	kmem_free(budp->bud_item.li_lv_shadow);
 	kmem_cache_free(xfs_bud_cache, budp);
 }
 
diff --git a/fs/xfs/xfs_icreate_item.c b/fs/xfs/xfs_icreate_item.c
index 508e184e3b8f..b05314d48176 100644
--- a/fs/xfs/xfs_icreate_item.c
+++ b/fs/xfs/xfs_icreate_item.c
@@ -63,6 +63,7 @@ STATIC void
 xfs_icreate_item_release(
 	struct xfs_log_item	*lip)
 {
+	kmem_free(ICR_ITEM(lip)->ic_item.li_lv_shadow);
 	kmem_cache_free(xfs_icreate_cache, ICR_ITEM(lip));
 }
 
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 0d868c93144d..10474fe389e1 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -35,6 +35,7 @@ STATIC void
 xfs_cui_item_free(
 	struct xfs_cui_log_item	*cuip)
 {
+	kmem_free(cuip->cui_item.li_lv_shadow);
 	if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
 		kmem_free(cuip);
 	else
@@ -204,6 +205,7 @@ xfs_cud_item_release(
 	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
 
 	xfs_cui_release(cudp->cud_cuip);
+	kmem_free(cudp->cud_item.li_lv_shadow);
 	kmem_cache_free(xfs_cud_cache, cudp);
 }
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index a22b2d19ef91..6c0b56ebdbe1 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -35,6 +35,7 @@ STATIC void
 xfs_rui_item_free(
 	struct xfs_rui_log_item	*ruip)
 {
+	kmem_free(ruip->rui_item.li_lv_shadow);
 	if (ruip->rui_format.rui_nextents > XFS_RUI_MAX_FAST_EXTENTS)
 		kmem_free(ruip);
 	else
@@ -227,6 +228,7 @@ xfs_rud_item_release(
 	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
 
 	xfs_rui_release(rudp->rud_ruip);
+	kmem_free(rudp->rud_item.li_lv_shadow);
 	kmem_cache_free(xfs_rud_cache, rudp);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/10] xfs: hide log iovec alignment constraints
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
  2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
  2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:45   ` Darrick J. Wong
  2022-05-03 22:17 ` [PATCH 04/10] xfs: don't commit the first deferred transaction without intents Dave Chinner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Callers currently have to round out the size of buffers to match the
aligment constraints of log iovecs and xlog_write(). They should not
need to know this detail, so introduce a new function to calculate
the iovec length (for use in ->iop_size implementations). Also
modify xlog_finish_iovec() to round up the length to the correct
alignment so the callers don't need to do this, either.

Convert the only user - inode forks - of this alignment rounding to
use the new interface.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c  | 20 ++++------------
 fs/xfs/xfs_inode_item.c         | 25 ++++++--------------
 fs/xfs/xfs_inode_item_recover.c |  4 ++--
 fs/xfs/xfs_log.h                | 42 ++++++++++++++++++++++++++++++---
 4 files changed, 52 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index a15ff38c3d41..1a4cdf550f6d 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -36,7 +36,7 @@ xfs_init_local_fork(
 	int64_t			size)
 {
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
-	int			mem_size = size, real_size = 0;
+	int			mem_size = size;
 	bool			zero_terminate;
 
 	/*
@@ -50,13 +50,7 @@ xfs_init_local_fork(
 		mem_size++;
 
 	if (size) {
-		/*
-		 * As we round up the allocation here, we need to ensure the
-		 * bytes we don't copy data into are zeroed because the log
-		 * vectors still copy them into the journal.
-		 */
-		real_size = roundup(mem_size, 4);
-		ifp->if_u1.if_data = kmem_zalloc(real_size, KM_NOFS);
+		ifp->if_u1.if_data = kmem_alloc(mem_size, KM_NOFS);
 		memcpy(ifp->if_u1.if_data, data, size);
 		if (zero_terminate)
 			ifp->if_u1.if_data[size] = '\0';
@@ -502,14 +496,8 @@ xfs_idata_realloc(
 		return;
 	}
 
-	/*
-	 * For inline data, the underlying buffer must be a multiple of 4 bytes
-	 * in size so that it can be logged and stay on word boundaries.
-	 * We enforce that here, and use __GFP_ZERO to ensure that size
-	 * extensions always zero the unused roundup area.
-	 */
-	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data, roundup(new_size, 4),
-				      GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
+	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data, new_size,
+				      GFP_NOFS | __GFP_NOFAIL);
 	ifp->if_bytes = new_size;
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 00733a18ccdc..721def0639fd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -71,7 +71,7 @@ xfs_inode_item_data_fork_size(
 	case XFS_DINODE_FMT_LOCAL:
 		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
-			*nbytes += roundup(ip->i_df.if_bytes, 4);
+			*nbytes += xlog_calc_iovec_len(ip->i_df.if_bytes);
 			*nvecs += 1;
 		}
 		break;
@@ -112,7 +112,7 @@ xfs_inode_item_attr_fork_size(
 	case XFS_DINODE_FMT_LOCAL:
 		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
-			*nbytes += roundup(ip->i_afp->if_bytes, 4);
+			*nbytes += xlog_calc_iovec_len(ip->i_afp->if_bytes);
 			*nvecs += 1;
 		}
 		break;
@@ -204,17 +204,12 @@ xfs_inode_item_format_data_fork(
 			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | XFS_ILOG_DEV);
 		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
 		    ip->i_df.if_bytes > 0) {
-			/*
-			 * Round i_bytes up to a word boundary.
-			 * The underlying memory is guaranteed
-			 * to be there by xfs_idata_realloc().
-			 */
-			data_bytes = roundup(ip->i_df.if_bytes, 4);
 			ASSERT(ip->i_df.if_u1.if_data != NULL);
 			ASSERT(ip->i_disk_size > 0);
 			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_ILOCAL,
-					ip->i_df.if_u1.if_data, data_bytes);
-			ilf->ilf_dsize = (unsigned)data_bytes;
+					ip->i_df.if_u1.if_data,
+					ip->i_df.if_bytes);
+			ilf->ilf_dsize = (unsigned)ip->i_df.if_bytes;
 			ilf->ilf_size++;
 		} else {
 			iip->ili_fields &= ~XFS_ILOG_DDATA;
@@ -288,17 +283,11 @@ xfs_inode_item_format_attr_fork(
 
 		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
 		    ip->i_afp->if_bytes > 0) {
-			/*
-			 * Round i_bytes up to a word boundary.
-			 * The underlying memory is guaranteed
-			 * to be there by xfs_idata_realloc().
-			 */
-			data_bytes = roundup(ip->i_afp->if_bytes, 4);
 			ASSERT(ip->i_afp->if_u1.if_data != NULL);
 			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IATTR_LOCAL,
 					ip->i_afp->if_u1.if_data,
-					data_bytes);
-			ilf->ilf_asize = (unsigned)data_bytes;
+					ip->i_afp->if_bytes);
+			ilf->ilf_asize = (unsigned)ip->i_afp->if_bytes;
 			ilf->ilf_size++;
 		} else {
 			iip->ili_fields &= ~XFS_ILOG_ADATA;
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 6d44f5fd6d7e..d28ffaebd067 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -462,7 +462,7 @@ xlog_recover_inode_commit_pass2(
 	ASSERT(in_f->ilf_size <= 4);
 	ASSERT((in_f->ilf_size == 3) || (fields & XFS_ILOG_AFORK));
 	ASSERT(!(fields & XFS_ILOG_DFORK) ||
-	       (len == in_f->ilf_dsize));
+	       (len == xlog_calc_iovec_len(in_f->ilf_dsize)));
 
 	switch (fields & XFS_ILOG_DFORK) {
 	case XFS_ILOG_DDATA:
@@ -497,7 +497,7 @@ xlog_recover_inode_commit_pass2(
 		}
 		len = item->ri_buf[attr_index].i_len;
 		src = item->ri_buf[attr_index].i_addr;
-		ASSERT(len == in_f->ilf_asize);
+		ASSERT(len == xlog_calc_iovec_len(in_f->ilf_asize));
 
 		switch (in_f->ilf_fields & XFS_ILOG_AFORK) {
 		case XFS_ILOG_ADATA:
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 8dafe8f771c7..3a4f6a4e4eb7 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -21,23 +21,59 @@ struct xfs_log_vec {
 
 #define XFS_LOG_VEC_ORDERED	(-1)
 
+/*
+ * Calculate the log iovec length for a given user buffer length. Intended to be
+ * used by ->iop_size implementations when sizing buffers of arbitrary
+ * alignments.
+ */
+static inline int
+xlog_calc_iovec_len(int len)
+{
+	return roundup(len, sizeof(uint32_t));
+}
+
 void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 		uint type);
 
 static inline void
-xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
+xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec,
+		int data_len)
 {
 	struct xlog_op_header	*oph = vec->i_addr;
-
-	/* opheader tracks payload length, logvec tracks region length */
+	int			len;
+
+	/*
+	 * Always round up the length to the correct alignment so callers don't
+	 * need to know anything about this log vec layout requirement. This
+	 * means we have to zero the area the data to be written does not cover.
+	 * This is complicated by fact the payload region is offset into the
+	 * logvec region by the opheader that tracks the payload.
+	 */
+	len = xlog_calc_iovec_len(data_len);
+	if (len - data_len != 0) {
+		char	*buf = vec->i_addr + sizeof(struct xlog_op_header);
+
+		memset(buf + data_len, 0, len - data_len);
+	}
+
+	/*
+	 * The opheader tracks aligned payload length, whilst the logvec tracks
+	 * the overall region length.
+	 */
 	oph->oh_len = cpu_to_be32(len);
 
 	len += sizeof(struct xlog_op_header);
 	lv->lv_buf_len += len;
 	lv->lv_bytes += len;
 	vec->i_len = len;
+
+	/* Catch buffer overruns */
+	ASSERT((void *)lv->lv_buf + lv->lv_bytes <= (void *)lv + lv->lv_size);
 }
 
+/*
+ * Copy the amount of data requested by the caller into a new log iovec.
+ */
 static inline void *
 xlog_copy_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 		uint type, void *data, int len)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/10] xfs: don't commit the first deferred transaction without intents
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (2 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 03/10] xfs: hide log iovec alignment constraints Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:17 ` [PATCH 05/10] xfs: add log item flags to indicate intents Dave Chinner
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

If the first operation in a string of defer ops has no intents,
then there is no reason to commit it before running the first call
to xfs_defer_finish_one(). This allows the defer ops to be used
effectively for non-intent based operations without requiring an
unnecessary extra transaction commit when first called.

This fixes a regression in per-attribute modification transaction
count when delayed attributes are not being used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 0805ade2d300..1aa32bfdf0cc 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -186,7 +186,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = {
 	[XFS_DEFER_OPS_TYPE_AGFL_FREE]	= &xfs_agfl_free_defer_type,
 };
 
-static void
+static bool
 xfs_defer_create_intent(
 	struct xfs_trans		*tp,
 	struct xfs_defer_pending	*dfp,
@@ -197,6 +197,7 @@ xfs_defer_create_intent(
 	if (!dfp->dfp_intent)
 		dfp->dfp_intent = ops->create_intent(tp, &dfp->dfp_work,
 						     dfp->dfp_count, sort);
+	return dfp->dfp_intent != NULL;
 }
 
 /*
@@ -204,16 +205,18 @@ xfs_defer_create_intent(
  * associated extents, then add the entire intake list to the end of
  * the pending list.
  */
-STATIC void
+static bool
 xfs_defer_create_intents(
 	struct xfs_trans		*tp)
 {
 	struct xfs_defer_pending	*dfp;
+	bool				ret = false;
 
 	list_for_each_entry(dfp, &tp->t_dfops, dfp_list) {
 		trace_xfs_defer_create_intent(tp->t_mountp, dfp);
-		xfs_defer_create_intent(tp, dfp, true);
+		ret |= xfs_defer_create_intent(tp, dfp, true);
 	}
+	return ret;
 }
 
 /* Abort all the intents that were committed. */
@@ -487,7 +490,7 @@ int
 xfs_defer_finish_noroll(
 	struct xfs_trans		**tp)
 {
-	struct xfs_defer_pending	*dfp;
+	struct xfs_defer_pending	*dfp = NULL;
 	int				error = 0;
 	LIST_HEAD(dop_pending);
 
@@ -506,17 +509,20 @@ xfs_defer_finish_noroll(
 		 * of time that any one intent item can stick around in memory,
 		 * pinning the log tail.
 		 */
-		xfs_defer_create_intents(*tp);
+		bool has_intents = xfs_defer_create_intents(*tp);
+
 		list_splice_init(&(*tp)->t_dfops, &dop_pending);
 
-		error = xfs_defer_trans_roll(tp);
-		if (error)
-			goto out_shutdown;
+		if (has_intents || dfp) {
+			error = xfs_defer_trans_roll(tp);
+			if (error)
+				goto out_shutdown;
 
-		/* Possibly relog intent items to keep the log moving. */
-		error = xfs_defer_relog(tp, &dop_pending);
-		if (error)
-			goto out_shutdown;
+			/* Relog intent items to keep the log moving. */
+			error = xfs_defer_relog(tp, &dop_pending);
+			if (error)
+				goto out_shutdown;
+		}
 
 		dfp = list_first_entry(&dop_pending, struct xfs_defer_pending,
 				       dfp_list);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/10] xfs: add log item flags to indicate intents
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (3 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 04/10] xfs: don't commit the first deferred transaction without intents Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:17 ` [PATCH 06/10] xfs: tag transactions that contain intent done items Dave Chinner
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We currently have a couple of helper functions that try to infer
whether the log item is an intent or intent done item from the
combinations of operations it supports.  This is incredibly fragile
and not very efficient as it requires checking specific combinations
of ops.

We need to be able to identify intent and intent done items quickly
and easily in upcoming patches, so simply add intent and intent done
type flags to the log item ops flags. These are static flags to
begin with, so intent items should have been typed like this from
the start.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_item.c     |  4 +++-
 fs/xfs/xfs_extfree_item.c  |  4 +++-
 fs/xfs/xfs_refcount_item.c |  4 +++-
 fs/xfs/xfs_rmap_item.c     |  4 +++-
 fs/xfs/xfs_trans.h         | 25 +++++++++++++------------
 5 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 2c8b686e2a11..0e0aae83308c 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -204,7 +204,8 @@ xfs_bud_item_release(
 }
 
 static const struct xfs_item_ops xfs_bud_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_bud_item_size,
 	.iop_format	= xfs_bud_item_format,
 	.iop_release	= xfs_bud_item_release,
@@ -588,6 +589,7 @@ xfs_bui_item_relog(
 }
 
 static const struct xfs_item_ops xfs_bui_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_bui_item_size,
 	.iop_format	= xfs_bui_item_format,
 	.iop_unpin	= xfs_bui_item_unpin,
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 0e50f2c9348e..21a159f9d8c5 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -307,7 +307,8 @@ xfs_efd_item_release(
 }
 
 static const struct xfs_item_ops xfs_efd_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_efd_item_size,
 	.iop_format	= xfs_efd_item_format,
 	.iop_release	= xfs_efd_item_release,
@@ -688,6 +689,7 @@ xfs_efi_item_relog(
 }
 
 static const struct xfs_item_ops xfs_efi_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_efi_item_size,
 	.iop_format	= xfs_efi_item_format,
 	.iop_unpin	= xfs_efi_item_unpin,
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 10474fe389e1..71225d094e03 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -210,7 +210,8 @@ xfs_cud_item_release(
 }
 
 static const struct xfs_item_ops xfs_cud_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_cud_item_size,
 	.iop_format	= xfs_cud_item_format,
 	.iop_release	= xfs_cud_item_release,
@@ -602,6 +603,7 @@ xfs_cui_item_relog(
 }
 
 static const struct xfs_item_ops xfs_cui_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_cui_item_size,
 	.iop_format	= xfs_cui_item_format,
 	.iop_unpin	= xfs_cui_item_unpin,
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 6c0b56ebdbe1..6ecbc37e4b8d 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -233,7 +233,8 @@ xfs_rud_item_release(
 }
 
 static const struct xfs_item_ops xfs_rud_item_ops = {
-	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_rud_item_size,
 	.iop_format	= xfs_rud_item_format,
 	.iop_release	= xfs_rud_item_release,
@@ -632,6 +633,7 @@ xfs_rui_item_relog(
 }
 
 static const struct xfs_item_ops xfs_rui_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
 	.iop_size	= xfs_rui_item_size,
 	.iop_format	= xfs_rui_item_format,
 	.iop_unpin	= xfs_rui_item_unpin,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 87e940b5366e..f68e74e46026 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -80,28 +80,29 @@ struct xfs_item_ops {
 			struct xfs_trans *tp);
 };
 
-/* Is this log item a deferred action intent? */
+/*
+ * Log item ops flags
+ */
+/*
+ * Release the log item when the journal commits instead of inserting into the
+ * AIL for writeback tracking and/or log tail pinning.
+ */
+#define XFS_ITEM_RELEASE_WHEN_COMMITTED	(1 << 0)
+#define XFS_ITEM_INTENT			(1 << 1)
+#define XFS_ITEM_INTENT_DONE		(1 << 2)
+
 static inline bool
 xlog_item_is_intent(struct xfs_log_item *lip)
 {
-	return lip->li_ops->iop_recover != NULL &&
-	       lip->li_ops->iop_match != NULL;
+	return lip->li_ops->flags & XFS_ITEM_INTENT;
 }
 
-/* Is this a log intent-done item? */
 static inline bool
 xlog_item_is_intent_done(struct xfs_log_item *lip)
 {
-	return lip->li_ops->iop_unpin == NULL &&
-	       lip->li_ops->iop_push == NULL;
+	return lip->li_ops->flags & XFS_ITEM_INTENT_DONE;
 }
 
-/*
- * Release the log item as soon as committed.  This is for items just logging
- * intents that never need to be written back in place.
- */
-#define XFS_ITEM_RELEASE_WHEN_COMMITTED	(1 << 0)
-
 void	xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item,
 			  int type, const struct xfs_item_ops *ops);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/10] xfs: tag transactions that contain intent done items
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (4 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 05/10] xfs: add log item flags to indicate intents Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:17 ` [PATCH 07/10] xfs: factor and move some code in xfs_log_cil.c Dave Chinner
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Intent whiteouts will require extra work to be done during
transaction commit if the transaction contains an intent done item.

To determine if a transaction contains an intent done item, we want
to avoid having to walk all the items in the transaction to check if
they are intent done items. Hence when we add an intent done item to
a transaction, tag the transaction to indicate that it contains such
an item.

We don't tag the transaction when the defer ops is relogging an
intent to move it forward in the log. Whiteouts will never apply to
these cases, so we don't need to bother looking for them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_shared.h | 24 +++++++++++++++++-------
 fs/xfs/xfs_bmap_item.c     |  2 +-
 fs/xfs/xfs_extfree_item.c  |  2 +-
 fs/xfs/xfs_refcount_item.c |  2 +-
 fs/xfs/xfs_rmap_item.c     |  2 +-
 5 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 25c4cab58851..c4381388c0c1 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -54,13 +54,23 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 /*
  * Values for t_flags.
  */
-#define	XFS_TRANS_DIRTY		0x01	/* something needs to be logged */
-#define	XFS_TRANS_SB_DIRTY	0x02	/* superblock is modified */
-#define	XFS_TRANS_PERM_LOG_RES	0x04	/* xact took a permanent log res */
-#define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
-#define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
-#define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
-#define XFS_TRANS_RES_FDBLKS	0x80	/* reserve newly freed blocks */
+/* Transaction needs to be logged */
+#define XFS_TRANS_DIRTY			(1u << 0)
+/* Superblock is dirty and needs to be logged */
+#define XFS_TRANS_SB_DIRTY		(1u << 1)
+/* Transaction took a permanent log reservation */
+#define XFS_TRANS_PERM_LOG_RES		(1u << 2)
+/* Synchronous transaction commit needed */
+#define XFS_TRANS_SYNC			(1u << 3)
+/* Transaction can use reserve block pool */
+#define XFS_TRANS_RESERVE		(1u << 4)
+/* Transaction should avoid VFS level superblock write accounting */
+#define XFS_TRANS_NO_WRITECOUNT		(1u << 5)
+/* Transaction has freed blocks returned to it's reservation */
+#define XFS_TRANS_RES_FDBLKS		(1u << 6)
+/* Transaction contains an intent done log item */
+#define XFS_TRANS_HAS_INTENT_DONE	(1u << 7)
+
 /*
  * LOWMODE is used by the allocator to activate the lowspace algorithm - when
  * free space is running low the extent allocator may choose to allocate an
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 0e0aae83308c..3d1fa8edf28f 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -257,7 +257,7 @@ xfs_trans_log_finish_bmap_update(
 	 * 1.) releases the BUI and frees the BUD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
 
 	return error;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 21a159f9d8c5..96735f23d12d 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -381,7 +381,7 @@ xfs_trans_free_extent(
 	 * 1.) releases the EFI and frees the EFD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
 
 	next_extent = efdp->efd_next_extent;
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 71225d094e03..b37a9d2ce652 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -262,7 +262,7 @@ xfs_trans_log_finish_refcount_update(
 	 * 1.) releases the CUI and frees the CUD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
 
 	return error;
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 6ecbc37e4b8d..5221fd1e6f6f 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -330,7 +330,7 @@ xfs_trans_log_finish_rmap_update(
 	 * 1.) releases the RUI and frees the RUD
 	 * 2.) shuts down the filesystem
 	 */
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
 
 	return error;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/10] xfs: factor and move some code in xfs_log_cil.c
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (5 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 06/10] xfs: tag transactions that contain intent done items Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:17 ` [PATCH 08/10] xfs: add log item method to return related intents Dave Chinner
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In preparation for adding support for intent item whiteouts.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_log_cil.c | 119 ++++++++++++++++++++++++-------------------
 1 file changed, 67 insertions(+), 52 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index e5ab62f08c19..0d8d092447ad 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -47,6 +47,38 @@ xlog_cil_ticket_alloc(
 	return tic;
 }
 
+/*
+ * Check if the current log item was first committed in this sequence.
+ * We can't rely on just the log item being in the CIL, we have to check
+ * the recorded commit sequence number.
+ *
+ * Note: for this to be used in a non-racy manner, it has to be called with
+ * CIL flushing locked out. As a result, it should only be used during the
+ * transaction commit process when deciding what to format into the item.
+ */
+static bool
+xlog_item_in_current_chkpt(
+	struct xfs_cil		*cil,
+	struct xfs_log_item	*lip)
+{
+	if (list_empty(&lip->li_cil))
+		return false;
+
+	/*
+	 * li_seq is written on the first commit of a log item to record the
+	 * first checkpoint it is written to. Hence if it is different to the
+	 * current sequence, we're in a new checkpoint.
+	 */
+	return lip->li_seq == READ_ONCE(cil->xc_current_sequence);
+}
+
+bool
+xfs_log_item_in_current_chkpt(
+	struct xfs_log_item *lip)
+{
+	return xlog_item_in_current_chkpt(lip->li_log->l_cilp, lip);
+}
+
 /*
  * Unavoidable forward declaration - xlog_cil_push_work() calls
  * xlog_cil_ctx_alloc() itself.
@@ -934,6 +966,40 @@ xlog_cil_build_trans_hdr(
 	tic->t_curr_res -= lvhdr->lv_bytes;
 }
 
+/*
+ * Pull all the log vectors off the items in the CIL, and remove the items from
+ * the CIL. We don't need the CIL lock here because it's only needed on the
+ * transaction commit side which is currently locked out by the flush lock.
+ */
+static void
+xlog_cil_build_lv_chain(
+	struct xfs_cil		*cil,
+	struct xfs_cil_ctx	*ctx,
+	uint32_t		*num_iovecs,
+	uint32_t		*num_bytes)
+{
+	struct xfs_log_vec	*lv = NULL;
+
+	while (!list_empty(&cil->xc_cil)) {
+		struct xfs_log_item	*item;
+
+		item = list_first_entry(&cil->xc_cil,
+					struct xfs_log_item, li_cil);
+		list_del_init(&item->li_cil);
+		if (!ctx->lv_chain)
+			ctx->lv_chain = item->li_lv;
+		else
+			lv->lv_next = item->li_lv;
+		lv = item->li_lv;
+		item->li_lv = NULL;
+		*num_iovecs += lv->lv_niovecs;
+
+		/* we don't write ordered log vectors */
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
+			*num_bytes += lv->lv_bytes;
+	}
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -956,7 +1022,6 @@ xlog_cil_push_work(
 		container_of(work, struct xfs_cil_ctx, push_work);
 	struct xfs_cil		*cil = ctx->cil;
 	struct xlog		*log = cil->xc_log;
-	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	int			num_iovecs = 0;
 	int			num_bytes = 0;
@@ -1033,31 +1098,7 @@ xlog_cil_push_work(
 	list_add(&ctx->committing, &cil->xc_committing);
 	spin_unlock(&cil->xc_push_lock);
 
-	/*
-	 * Pull all the log vectors off the items in the CIL, and remove the
-	 * items from the CIL. We don't need the CIL lock here because it's only
-	 * needed on the transaction commit side which is currently locked out
-	 * by the flush lock.
-	 */
-	lv = NULL;
-	while (!list_empty(&cil->xc_cil)) {
-		struct xfs_log_item	*item;
-
-		item = list_first_entry(&cil->xc_cil,
-					struct xfs_log_item, li_cil);
-		list_del_init(&item->li_cil);
-		if (!ctx->lv_chain)
-			ctx->lv_chain = item->li_lv;
-		else
-			lv->lv_next = item->li_lv;
-		lv = item->li_lv;
-		item->li_lv = NULL;
-		num_iovecs += lv->lv_niovecs;
-
-		/* we don't write ordered log vectors */
-		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
-			num_bytes += lv->lv_bytes;
-	}
+	xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
 
 	/*
 	 * Switch the contexts so we can drop the context lock and move out
@@ -1508,32 +1549,6 @@ xlog_cil_force_seq(
 	return 0;
 }
 
-/*
- * Check if the current log item was first committed in this sequence.
- * We can't rely on just the log item being in the CIL, we have to check
- * the recorded commit sequence number.
- *
- * Note: for this to be used in a non-racy manner, it has to be called with
- * CIL flushing locked out. As a result, it should only be used during the
- * transaction commit process when deciding what to format into the item.
- */
-bool
-xfs_log_item_in_current_chkpt(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_cil		*cil = lip->li_log->l_cilp;
-
-	if (list_empty(&lip->li_cil))
-		return false;
-
-	/*
-	 * li_seq is written on the first commit of a log item to record the
-	 * first checkpoint it is written to. Hence if it is different to the
-	 * current sequence, we're in a new checkpoint.
-	 */
-	return lip->li_seq == READ_ONCE(cil->xc_current_sequence);
-}
-
 /*
  * Perform initial CIL structure initialisation.
  */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/10] xfs: add log item method to return related intents
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (6 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 07/10] xfs: factor and move some code in xfs_log_cil.c Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:17 ` [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL Dave Chinner
  2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
  9 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To apply a whiteout to an intent item when an intent done item is
committed, we need to be able to retrieve the intent item from the
the intent done item. Add a log item op method for doing this, and
wire all the intent done items up to it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_bmap_item.c     | 8 ++++++++
 fs/xfs/xfs_extfree_item.c  | 8 ++++++++
 fs/xfs/xfs_refcount_item.c | 8 ++++++++
 fs/xfs/xfs_rmap_item.c     | 8 ++++++++
 fs/xfs/xfs_trans.h         | 1 +
 5 files changed, 33 insertions(+)

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 3d1fa8edf28f..f05663fdb6ff 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -203,12 +203,20 @@ xfs_bud_item_release(
 	kmem_cache_free(xfs_bud_cache, budp);
 }
 
+static struct xfs_log_item *
+xfs_bud_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &BUD_ITEM(lip)->bud_buip->bui_item;
+}
+
 static const struct xfs_item_ops xfs_bud_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_bud_item_size,
 	.iop_format	= xfs_bud_item_format,
 	.iop_release	= xfs_bud_item_release,
+	.iop_intent	= xfs_bud_item_intent,
 };
 
 static struct xfs_bud_log_item *
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 96735f23d12d..032db5269e97 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -306,12 +306,20 @@ xfs_efd_item_release(
 	xfs_efd_item_free(efdp);
 }
 
+static struct xfs_log_item *
+xfs_efd_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &EFD_ITEM(lip)->efd_efip->efi_item;
+}
+
 static const struct xfs_item_ops xfs_efd_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_efd_item_size,
 	.iop_format	= xfs_efd_item_format,
 	.iop_release	= xfs_efd_item_release,
+	.iop_intent	= xfs_efd_item_intent,
 };
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index b37a9d2ce652..57a025f5fd4b 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -209,12 +209,20 @@ xfs_cud_item_release(
 	kmem_cache_free(xfs_cud_cache, cudp);
 }
 
+static struct xfs_log_item *
+xfs_cud_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &CUD_ITEM(lip)->cud_cuip->cui_item;
+}
+
 static const struct xfs_item_ops xfs_cud_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_cud_item_size,
 	.iop_format	= xfs_cud_item_format,
 	.iop_release	= xfs_cud_item_release,
+	.iop_intent	= xfs_cud_item_intent,
 };
 
 static struct xfs_cud_log_item *
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 5221fd1e6f6f..1c7d8518cb48 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -232,12 +232,20 @@ xfs_rud_item_release(
 	kmem_cache_free(xfs_rud_cache, rudp);
 }
 
+static struct xfs_log_item *
+xfs_rud_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &RUD_ITEM(lip)->rud_ruip->rui_item;
+}
+
 static const struct xfs_item_ops xfs_rud_item_ops = {
 	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
 			  XFS_ITEM_INTENT_DONE,
 	.iop_size	= xfs_rud_item_size,
 	.iop_format	= xfs_rud_item_format,
 	.iop_release	= xfs_rud_item_release,
+	.iop_intent	= xfs_rud_item_intent,
 };
 
 static struct xfs_rud_log_item *
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f68e74e46026..d72a5995d33e 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -78,6 +78,7 @@ struct xfs_item_ops {
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 	struct xfs_log_item *(*iop_relog)(struct xfs_log_item *intent,
 			struct xfs_trans *tp);
+	struct xfs_log_item *(*iop_intent)(struct xfs_log_item *intent_done);
 };
 
 /*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (7 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 08/10] xfs: add log item method to return related intents Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-10 12:49   ` Christoph Hellwig
  2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
  9 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we release an intent that a whiteout applies to, it will not
have been committed to the journal and so won't be in the AIL. Hence
when we drop the last reference to the intent, we do not want to try
to remove it from the AIL as that will trigger a filesystem
shutdown. Hence make the removal of intents from the AIL conditional
on them actually being in the AIL so we do the correct thing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_item.c     | 9 +++++----
 fs/xfs/xfs_extfree_item.c  | 9 +++++----
 fs/xfs/xfs_refcount_item.c | 9 +++++----
 fs/xfs/xfs_rmap_item.c     | 9 +++++----
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index f05663fdb6ff..51f66e982484 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -55,10 +55,11 @@ xfs_bui_release(
 	struct xfs_bui_log_item	*buip)
 {
 	ASSERT(atomic_read(&buip->bui_refcount) > 0);
-	if (atomic_dec_and_test(&buip->bui_refcount)) {
-		xfs_trans_ail_delete(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_bui_item_free(buip);
-	}
+	if (!atomic_dec_and_test(&buip->bui_refcount))
+		return;
+
+	xfs_trans_ail_delete(&buip->bui_item, 0);
+	xfs_bui_item_free(buip);
 }
 
 
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 032db5269e97..765be054dffe 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -58,10 +58,11 @@ xfs_efi_release(
 	struct xfs_efi_log_item	*efip)
 {
 	ASSERT(atomic_read(&efip->efi_refcount) > 0);
-	if (atomic_dec_and_test(&efip->efi_refcount)) {
-		xfs_trans_ail_delete(&efip->efi_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_efi_item_free(efip);
-	}
+	if (!atomic_dec_and_test(&efip->efi_refcount))
+		return;
+
+	xfs_trans_ail_delete(&efip->efi_item, 0);
+	xfs_efi_item_free(efip);
 }
 
 /*
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 57a025f5fd4b..7e97bf19793d 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -54,10 +54,11 @@ xfs_cui_release(
 	struct xfs_cui_log_item	*cuip)
 {
 	ASSERT(atomic_read(&cuip->cui_refcount) > 0);
-	if (atomic_dec_and_test(&cuip->cui_refcount)) {
-		xfs_trans_ail_delete(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_cui_item_free(cuip);
-	}
+	if (!atomic_dec_and_test(&cuip->cui_refcount))
+		return;
+
+	xfs_trans_ail_delete(&cuip->cui_item, 0);
+	xfs_cui_item_free(cuip);
 }
 
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 1c7d8518cb48..fef92e02f3bb 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -54,10 +54,11 @@ xfs_rui_release(
 	struct xfs_rui_log_item	*ruip)
 {
 	ASSERT(atomic_read(&ruip->rui_refcount) > 0);
-	if (atomic_dec_and_test(&ruip->rui_refcount)) {
-		xfs_trans_ail_delete(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
-		xfs_rui_item_free(ruip);
-	}
+	if (!atomic_dec_and_test(&ruip->rui_refcount))
+		return;
+
+	xfs_trans_ail_delete(&ruip->rui_item, 0);
+	xfs_rui_item_free(ruip);
 }
 
 STATIC void
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/10] xfs: intent item whiteouts
  2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
                   ` (8 preceding siblings ...)
  2022-05-03 22:17 ` [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL Dave Chinner
@ 2022-05-03 22:17 ` Dave Chinner
  2022-05-03 22:42   ` Alli
  2022-05-03 22:50   ` Darrick J. Wong
  9 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 22:17 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we log modifications based on intents, we add both intent
and intent done items to the modification being made. These get
written to the log to ensure that the operation is re-run if the
intent done is not found in the log.

However, for operations that complete wholly within a single
checkpoint, the change in the checkpoint is atomic and will never
need replay. In this case, we don't need to actually write the
intent and intent done items to the journal because log recovery
will never need to manually restart this modification.

Log recovery currently handles intent/intent done matching by
inserting the intent into the AIL, then removing it when a matching
intent done item is found. Hence for all the intent-based operations
that complete within a checkpoint, we spend all that time parsing
the intent/intent done items just to cancel them and do nothing with
them.

Hence it follows that the only time we actually need intents in the
log is when the modification crosses checkpoint boundaries in the
log and so may only be partially complete in the journal. Hence if
we commit and intent done item to the CIL and the intent item is in
the same checkpoint, we don't actually have to write them to the
journal because log recovery will always cancel the intents.

We've never really worried about the overhead of logging intents
unnecessarily like this because the intents we log are generally
very much smaller than the change being made. e.g. freeing an extent
involves modifying at lease two freespace btree blocks and the AGF,
so the EFI/EFD overhead is only a small increase in space and
processing time compared to the overall cost of freeing an extent.

However, delayed attributes change this cost equation dramatically,
especially for inline attributes. In the case of adding an inline
attribute, we only log the inode core and attribute fork at present.
With delayed attributes, we now log the attr intent which includes
the name and value, the inode core adn attr fork, and finally the
attr intent done item. We increase the number of items we log from 1
to 3, and the number of log vectors (regions) goes up from 3 to 7.
Hence we tripple the number of objects that the CIL has to process,
and more than double the number of log vectors that need to be
written to the journal.

At scale, this means delayed attributes cause a non-pipelined CIL to
become CPU bound processing all the extra items, resulting in a > 40%
performance degradation on 16-way file+xattr create worklaods.
Pipelining the CIL (as per 5.15) reduces the performance degradation
to 20%, but now the limitation is the rate at which the log items
can be written to the iclogs and iclogs be dispatched for IO and
completed.

Even log IO completion is slowed down by these intents, because it
now has to process 3x the number of items in the checkpoint.
Processing completed intents is especially inefficient here, because
we first insert the intent into the AIL, then remove it from the AIL
when the intent done is processed. IOWs, we are also doing expensive
operations in log IO completion we could completely avoid if we
didn't log completed intent/intent done pairs.

Enter log item whiteouts.

When an intent done is committed, we can check to see if the
associated intent is in the same checkpoint as we are currently
committing the intent done to. If so, we can mark the intent log
item with a whiteout and immediately free the intent done item
rather than committing it to the CIL. We can basically skip the
entire formatting and CIL insertion steps for the intent done item.

However, we cannot remove the intent item from the CIL at this point
because the unlocked per-cpu CIL item lists do not permit removal
without holding the CIL context lock exclusively. Transaction commit
only holds the context lock shared, hence the best we can do is mark
the intent item with a whiteout so that the CIL push can release it
rather than writing it to the log.

This means we never write the intent to the log if the intent done
has also been committed to the same checkpoint, but we'll always
write the intent if the intent done has not been committed or has
been committed to a different checkpoint. This will result in
correct log recovery behaviour in all cases, without the overhead of
logging unnecessary intents.

This intent whiteout concept is generic - we can apply it to all
intent/intent done pairs that have a direct 1:1 relationship. The
way deferred ops iterate and relog intents mean that all intents
currently have a 1:1 relationship with their done intent, and hence
we can apply this cancellation to all existing intent/intent done
implementations.

For delayed attributes with a 16-way 64kB xattr create workload,
whiteouts reduce the amount of journalled metadata from ~2.5GB/s
down to ~600MB/s and improve the creation rate from 9000/s to
14000/s.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log_cil.c | 78 ++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_trace.h   |  3 ++
 fs/xfs/xfs_trans.h   |  6 ++--
 3 files changed, 82 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 0d8d092447ad..fecd2ea3e935 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
 static void
 xlog_cil_insert_items(
 	struct xlog		*log,
-	struct xfs_trans	*tp)
+	struct xfs_trans	*tp,
+	uint32_t		released_space)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
@@ -525,7 +526,9 @@ xlog_cil_insert_items(
 		ASSERT(tp->t_ticket->t_curr_res >= len);
 	}
 	tp->t_ticket->t_curr_res -= len;
+	tp->t_ticket->t_curr_res += released_space;
 	ctx->space_used += len;
+	ctx->space_used -= released_space;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
@@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr(
  * Pull all the log vectors off the items in the CIL, and remove the items from
  * the CIL. We don't need the CIL lock here because it's only needed on the
  * transaction commit side which is currently locked out by the flush lock.
+ *
+ * If a log item is marked with a whiteout, we do not need to write it to the
+ * journal and so we just move them to the whiteout list for the caller to
+ * dispose of appropriately.
  */
 static void
 xlog_cil_build_lv_chain(
 	struct xfs_cil		*cil,
 	struct xfs_cil_ctx	*ctx,
+	struct list_head	*whiteouts,
 	uint32_t		*num_iovecs,
 	uint32_t		*num_bytes)
 {
@@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
 
 		item = list_first_entry(&cil->xc_cil,
 					struct xfs_log_item, li_cil);
+
+		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
+			list_move(&item->li_cil, whiteouts);
+			trace_xfs_cil_whiteout_skip(item);
+			continue;
+		}
+
 		list_del_init(&item->li_cil);
 		if (!ctx->lv_chain)
 			ctx->lv_chain = item->li_lv;
@@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
 	}
 }
 
+static void
+xlog_cil_push_cleanup_whiteouts(
+	struct list_head	*whiteouts)
+{
+	while (!list_empty(whiteouts)) {
+		struct xfs_log_item *item = list_first_entry(whiteouts,
+						struct xfs_log_item, li_cil);
+		list_del_init(&item->li_cil);
+		trace_xfs_cil_whiteout_unpin(item);
+		item->li_ops->iop_unpin(item, 1);
+	}
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -1030,6 +1058,7 @@ xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_csn_t		push_seq;
 	bool			push_commit_stable;
+	LIST_HEAD		(whiteouts);
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -1098,7 +1127,7 @@ xlog_cil_push_work(
 	list_add(&ctx->committing, &cil->xc_committing);
 	spin_unlock(&cil->xc_push_lock);
 
-	xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
+	xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs, &num_bytes);
 
 	/*
 	 * Switch the contexts so we can drop the context lock and move out
@@ -1201,6 +1230,7 @@ xlog_cil_push_work(
 	/* Not safe to reference ctx now! */
 
 	spin_unlock(&log->l_icloglock);
+	xlog_cil_push_cleanup_whiteouts(&whiteouts);
 	return;
 
 out_skip:
@@ -1212,6 +1242,7 @@ xlog_cil_push_work(
 out_abort_free_ticket:
 	xfs_log_ticket_ungrant(log, ctx->ticket);
 	ASSERT(xlog_is_shutdown(log));
+	xlog_cil_push_cleanup_whiteouts(&whiteouts);
 	if (!ctx->commit_iclog) {
 		xlog_cil_committed(ctx);
 		return;
@@ -1360,6 +1391,43 @@ xlog_cil_empty(
 	return empty;
 }
 
+/*
+ * If there are intent done items in this transaction and the related intent was
+ * committed in the current (same) CIL checkpoint, we don't need to write either
+ * the intent or intent done item to the journal as the change will be
+ * journalled atomically within this checkpoint. As we cannot remove items from
+ * the CIL here, mark the related intent with a whiteout so that the CIL push
+ * can remove it rather than writing it to the journal. Then remove the intent
+ * done item from the current transaction and release it so it doesn't get put
+ * into the CIL at all.
+ */
+static uint32_t
+xlog_cil_process_intents(
+	struct xfs_cil		*cil,
+	struct xfs_trans	*tp)
+{
+	struct xfs_log_item	*lip, *ilip, *next;
+	uint32_t		len = 0;
+
+	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
+		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
+			continue;
+
+		ilip = lip->li_ops->iop_intent(lip);
+		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
+			continue;
+		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
+		trace_xfs_cil_whiteout_mark(ilip);
+		len += ilip->li_lv->lv_bytes;
+		kmem_free(ilip->li_lv);
+		ilip->li_lv = NULL;
+
+		xfs_trans_del_item(lip);
+		lip->li_ops->iop_release(lip);
+	}
+	return len;
+}
+
 /*
  * Commit a transaction with the given vector to the Committed Item List.
  *
@@ -1382,6 +1450,7 @@ xlog_cil_commit(
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_log_item	*lip, *next;
+	uint32_t		released_space = 0;
 
 	/*
 	 * Do all necessary memory allocation before we lock the CIL.
@@ -1393,7 +1462,10 @@ xlog_cil_commit(
 	/* lock out background commit */
 	down_read(&cil->xc_ctx_lock);
 
-	xlog_cil_insert_items(log, tp);
+	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
+		released_space = xlog_cil_process_intents(cil, tp);
+
+	xlog_cil_insert_items(log, tp, released_space);
 
 	if (regrant && !xlog_is_shutdown(log))
 		xfs_log_ticket_regrant(log, tp->t_ticket);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e1197f9ad97e..75934e3c3f55 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
+DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
 
 DECLARE_EVENT_CLASS(xfs_ail_class,
 	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index d72a5995d33e..9561f193e7e1 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -55,13 +55,15 @@ struct xfs_log_item {
 #define	XFS_LI_IN_AIL	0
 #define	XFS_LI_ABORTED	1
 #define	XFS_LI_FAILED	2
-#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
+#define	XFS_LI_DIRTY	3
+#define	XFS_LI_WHITEOUT	4
 
 #define XFS_LI_FLAGS \
 	{ (1u << XFS_LI_IN_AIL),	"IN_AIL" }, \
 	{ (1u << XFS_LI_ABORTED),	"ABORTED" }, \
 	{ (1u << XFS_LI_FAILED),	"FAILED" }, \
-	{ (1u << XFS_LI_DIRTY),		"DIRTY" }
+	{ (1u << XFS_LI_DIRTY),		"DIRTY" }, \
+	{ (1u << XFS_LI_WHITEOUT),	"WHITEOUT" }
 
 struct xfs_item_ops {
 	unsigned flags;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] xfs: zero inode fork buffer at allocation
  2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
@ 2022-05-03 22:41   ` Darrick J. Wong
  2022-05-03 22:42   ` Alli
  2022-05-10 12:47   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2022-05-03 22:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 04, 2022 at 08:17:19AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we first allocate or resize an inline inode fork, we round up
> the allocation to 4 byte alingment to make journal alignment
> constraints. We don't clear the unused bytes, so we can copy up to
> three uninitialised bytes into the journal. Zero those bytes so we
> only ever copy zeros into the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Yup.  Thanks for the cleanup.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_inode_fork.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 9aee4a1e2fe9..a15ff38c3d41 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -50,8 +50,13 @@ xfs_init_local_fork(
>  		mem_size++;
>  
>  	if (size) {
> +		/*
> +		 * As we round up the allocation here, we need to ensure the
> +		 * bytes we don't copy data into are zeroed because the log
> +		 * vectors still copy them into the journal.
> +		 */
>  		real_size = roundup(mem_size, 4);
> -		ifp->if_u1.if_data = kmem_alloc(real_size, KM_NOFS);
> +		ifp->if_u1.if_data = kmem_zalloc(real_size, KM_NOFS);
>  		memcpy(ifp->if_u1.if_data, data, size);
>  		if (zero_terminate)
>  			ifp->if_u1.if_data[size] = '\0';
> @@ -500,10 +505,11 @@ xfs_idata_realloc(
>  	/*
>  	 * For inline data, the underlying buffer must be a multiple of 4 bytes
>  	 * in size so that it can be logged and stay on word boundaries.
> -	 * We enforce that here.
> +	 * We enforce that here, and use __GFP_ZERO to ensure that size
> +	 * extensions always zero the unused roundup area.
>  	 */
>  	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data, roundup(new_size, 4),
> -				      GFP_NOFS | __GFP_NOFAIL);
> +				      GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
>  	ifp->if_bytes = new_size;
>  }
>  
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] xfs: fix potential log item leak
  2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
@ 2022-05-03 22:42   ` Alli
  2022-05-03 22:44   ` Darrick J. Wong
  2022-05-10 12:48   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Alli @ 2022-05-03 22:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On Wed, 2022-05-04 at 08:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Ever since we added shadown format buffers to the log items, log
> items need to handle the item being released with shadow buffers
> attached. Due to the fact this requirement was added at the same
> time we added new rmap/reflink intents, we missed the cleanup of
> those items.
> 
> In theory, this means shadow buffers can be leaked in a very small
> window when a shutdown is initiated. Testing with KASAN shows this
> leak does not happen in practice - we haven't identified a single
> leak in several years of shutdown testing since ~v4.8 kernels.
> 
> However, the intent whiteout cleanup mechanism results in every
> cancelled intent in exactly the same state as this tiny race window
> creates and so if intents down clean up shadow buffers on final
> release we will leak the shadow buffer for just about every intent
> we create.
> 
> Hence we start with this patch to close this condition off and
> ensure that when whiteouts start to be used we don't leak lots of
> memory.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, looks good to me
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>  fs/xfs/xfs_bmap_item.c     | 2 ++
>  fs/xfs/xfs_icreate_item.c  | 1 +
>  fs/xfs/xfs_refcount_item.c | 2 ++
>  fs/xfs/xfs_rmap_item.c     | 2 ++
>  4 files changed, 7 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 593ac29cffc7..2c8b686e2a11 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -39,6 +39,7 @@ STATIC void
>  xfs_bui_item_free(
>  	struct xfs_bui_log_item	*buip)
>  {
> +	kmem_free(buip->bui_item.li_lv_shadow);
>  	kmem_cache_free(xfs_bui_cache, buip);
>  }
>  
> @@ -198,6 +199,7 @@ xfs_bud_item_release(
>  	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
>  
>  	xfs_bui_release(budp->bud_buip);
> +	kmem_free(budp->bud_item.li_lv_shadow);
>  	kmem_cache_free(xfs_bud_cache, budp);
>  }
>  
> diff --git a/fs/xfs/xfs_icreate_item.c b/fs/xfs/xfs_icreate_item.c
> index 508e184e3b8f..b05314d48176 100644
> --- a/fs/xfs/xfs_icreate_item.c
> +++ b/fs/xfs/xfs_icreate_item.c
> @@ -63,6 +63,7 @@ STATIC void
>  xfs_icreate_item_release(
>  	struct xfs_log_item	*lip)
>  {
> +	kmem_free(ICR_ITEM(lip)->ic_item.li_lv_shadow);
>  	kmem_cache_free(xfs_icreate_cache, ICR_ITEM(lip));
>  }
>  
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 0d868c93144d..10474fe389e1 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -35,6 +35,7 @@ STATIC void
>  xfs_cui_item_free(
>  	struct xfs_cui_log_item	*cuip)
>  {
> +	kmem_free(cuip->cui_item.li_lv_shadow);
>  	if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
>  		kmem_free(cuip);
>  	else
> @@ -204,6 +205,7 @@ xfs_cud_item_release(
>  	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
>  
>  	xfs_cui_release(cudp->cud_cuip);
> +	kmem_free(cudp->cud_item.li_lv_shadow);
>  	kmem_cache_free(xfs_cud_cache, cudp);
>  }
>  
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index a22b2d19ef91..6c0b56ebdbe1 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -35,6 +35,7 @@ STATIC void
>  xfs_rui_item_free(
>  	struct xfs_rui_log_item	*ruip)
>  {
> +	kmem_free(ruip->rui_item.li_lv_shadow);
>  	if (ruip->rui_format.rui_nextents > XFS_RUI_MAX_FAST_EXTENTS)
>  		kmem_free(ruip);
>  	else
> @@ -227,6 +228,7 @@ xfs_rud_item_release(
>  	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
>  
>  	xfs_rui_release(rudp->rud_ruip);
> +	kmem_free(rudp->rud_item.li_lv_shadow);
>  	kmem_cache_free(xfs_rud_cache, rudp);
>  }
>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] xfs: zero inode fork buffer at allocation
  2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
  2022-05-03 22:41   ` Darrick J. Wong
@ 2022-05-03 22:42   ` Alli
  2022-05-10 12:47   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Alli @ 2022-05-03 22:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On Wed, 2022-05-04 at 08:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we first allocate or resize an inline inode fork, we round up
> the allocation to 4 byte alingment to make journal alignment
> constraints. We don't clear the unused bytes, so we can copy up to
> three uninitialised bytes into the journal. Zero those bytes so we
> only ever copy zeros into the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks good
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>  fs/xfs/libxfs/xfs_inode_fork.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c
> b/fs/xfs/libxfs/xfs_inode_fork.c
> index 9aee4a1e2fe9..a15ff38c3d41 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -50,8 +50,13 @@ xfs_init_local_fork(
>  		mem_size++;
>  
>  	if (size) {
> +		/*
> +		 * As we round up the allocation here, we need to
> ensure the
> +		 * bytes we don't copy data into are zeroed because the
> log
> +		 * vectors still copy them into the journal.
> +		 */
>  		real_size = roundup(mem_size, 4);
> -		ifp->if_u1.if_data = kmem_alloc(real_size, KM_NOFS);
> +		ifp->if_u1.if_data = kmem_zalloc(real_size, KM_NOFS);
>  		memcpy(ifp->if_u1.if_data, data, size);
>  		if (zero_terminate)
>  			ifp->if_u1.if_data[size] = '\0';
> @@ -500,10 +505,11 @@ xfs_idata_realloc(
>  	/*
>  	 * For inline data, the underlying buffer must be a multiple of
> 4 bytes
>  	 * in size so that it can be logged and stay on word
> boundaries.
> -	 * We enforce that here.
> +	 * We enforce that here, and use __GFP_ZERO to ensure that size
> +	 * extensions always zero the unused roundup area.
>  	 */
>  	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data,
> roundup(new_size, 4),
> -				      GFP_NOFS | __GFP_NOFAIL);
> +				      GFP_NOFS | __GFP_NOFAIL |
> __GFP_ZERO);
>  	ifp->if_bytes = new_size;
>  }
>  


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 10/10] xfs: intent item whiteouts
  2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
@ 2022-05-03 22:42   ` Alli
  2022-05-03 22:50   ` Darrick J. Wong
  1 sibling, 0 replies; 23+ messages in thread
From: Alli @ 2022-05-03 22:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On Wed, 2022-05-04 at 08:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
> 
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
> 
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
> 
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
> 
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.
> 
> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
> 
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
> 
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
> 
> Enter log item whiteouts.
> 
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
> 
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
> 
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
> 
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
> 
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
Ok, I think it looks good
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>  fs/xfs/xfs_log_cil.c | 78
> ++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_trace.h   |  3 ++
>  fs/xfs/xfs_trans.h   |  6 ++--
>  3 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0d8d092447ad..fecd2ea3e935 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
>  static void
>  xlog_cil_insert_items(
>  	struct xlog		*log,
> -	struct xfs_trans	*tp)
> +	struct xfs_trans	*tp,
> +	uint32_t		released_space)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
> @@ -525,7 +526,9 @@ xlog_cil_insert_items(
>  		ASSERT(tp->t_ticket->t_curr_res >= len);
>  	}
>  	tp->t_ticket->t_curr_res -= len;
> +	tp->t_ticket->t_curr_res += released_space;
>  	ctx->space_used += len;
> +	ctx->space_used -= released_space;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before
> we move
> @@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr(
>   * Pull all the log vectors off the items in the CIL, and remove the
> items from
>   * the CIL. We don't need the CIL lock here because it's only needed
> on the
>   * transaction commit side which is currently locked out by the
> flush lock.
> + *
> + * If a log item is marked with a whiteout, we do not need to write
> it to the
> + * journal and so we just move them to the whiteout list for the
> caller to
> + * dispose of appropriately.
>   */
>  static void
>  xlog_cil_build_lv_chain(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx,
> +	struct list_head	*whiteouts,
>  	uint32_t		*num_iovecs,
>  	uint32_t		*num_bytes)
>  {
> @@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
>  
>  		item = list_first_entry(&cil->xc_cil,
>  					struct xfs_log_item, li_cil);
> +
> +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> +			list_move(&item->li_cil, whiteouts);
> +			trace_xfs_cil_whiteout_skip(item);
> +			continue;
> +		}
> +
>  		list_del_init(&item->li_cil);
>  		if (!ctx->lv_chain)
>  			ctx->lv_chain = item->li_lv;
> @@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
>  	}
>  }
>  
> +static void
> +xlog_cil_push_cleanup_whiteouts(
> +	struct list_head	*whiteouts)
> +{
> +	while (!list_empty(whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(whiteouts,
> +						struct xfs_log_item,
> li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -1030,6 +1058,7 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_csn_t		push_seq;
>  	bool			push_commit_stable;
> +	LIST_HEAD		(whiteouts);
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1098,7 +1127,7 @@ xlog_cil_push_work(
>  	list_add(&ctx->committing, &cil->xc_committing);
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
> +	xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs,
> &num_bytes);
>  
>  	/*
>  	 * Switch the contexts so we can drop the context lock and move
> out
> @@ -1201,6 +1230,7 @@ xlog_cil_push_work(
>  	/* Not safe to reference ctx now! */
>  
>  	spin_unlock(&log->l_icloglock);
> +	xlog_cil_push_cleanup_whiteouts(&whiteouts);
>  	return;
>  
>  out_skip:
> @@ -1212,6 +1242,7 @@ xlog_cil_push_work(
>  out_abort_free_ticket:
>  	xfs_log_ticket_ungrant(log, ctx->ticket);
>  	ASSERT(xlog_is_shutdown(log));
> +	xlog_cil_push_cleanup_whiteouts(&whiteouts);
>  	if (!ctx->commit_iclog) {
>  		xlog_cil_committed(ctx);
>  		return;
> @@ -1360,6 +1391,43 @@ xlog_cil_empty(
>  	return empty;
>  }
>  
> +/*
> + * If there are intent done items in this transaction and the
> related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to
> write either
> + * the intent or intent done item to the journal as the change will
> be
> + * journalled atomically within this checkpoint. As we cannot remove
> items from
> + * the CIL here, mark the related intent with a whiteout so that the
> CIL push
> + * can remove it rather than writing it to the journal. Then remove
> the intent
> + * done item from the current transaction and release it so it
> doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> +	struct xfs_cil		*cil,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_log_item	*lip, *ilip, *next;
> +	uint32_t		len = 0;
> +
> +	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> +		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> +			continue;
> +
> +		ilip = lip->li_ops->iop_intent(lip);
> +		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> +			continue;
> +		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> +		trace_xfs_cil_whiteout_mark(ilip);
> +		len += ilip->li_lv->lv_bytes;
> +		kmem_free(ilip->li_lv);
> +		ilip->li_lv = NULL;
> +
> +		xfs_trans_del_item(lip);
> +		lip->li_ops->iop_release(lip);
> +	}
> +	return len;
> +}
> +
>  /*
>   * Commit a transaction with the given vector to the Committed Item
> List.
>   *
> @@ -1382,6 +1450,7 @@ xlog_cil_commit(
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> +	uint32_t		released_space = 0;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1393,7 +1462,10 @@ xlog_cil_commit(
>  	/* lock out background commit */
>  	down_read(&cil->xc_ctx_lock);
>  
> -	xlog_cil_insert_items(log, tp);
> +	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> +		released_space = xlog_cil_process_intents(cil, tp);
> +
> +	xlog_cil_insert_items(log, tp, released_space);
>  
>  	if (regrant && !xlog_is_shutdown(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index e1197f9ad97e..75934e3c3f55 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>  
>  DECLARE_EVENT_CLASS(xfs_ail_class,
>  	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t
> new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index d72a5995d33e..9561f193e7e1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,15 @@ struct xfs_log_item {
>  #define	XFS_LI_IN_AIL	0
>  #define	XFS_LI_ABORTED	1
>  #define	XFS_LI_FAILED	2
> -#define	XFS_LI_DIRTY	3	/* log item dirty in
> transaction */
> +#define	XFS_LI_DIRTY	3
> +#define	XFS_LI_WHITEOUT	4
>  
>  #define XFS_LI_FLAGS \
>  	{ (1u << XFS_LI_IN_AIL),	"IN_AIL" }, \
>  	{ (1u << XFS_LI_ABORTED),	"ABORTED" }, \
>  	{ (1u << XFS_LI_FAILED),	"FAILED" }, \
> -	{ (1u << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1u << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1u << XFS_LI_WHITEOUT),	"WHITEOUT" }
>  
>  struct xfs_item_ops {
>  	unsigned flags;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] xfs: fix potential log item leak
  2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
  2022-05-03 22:42   ` Alli
@ 2022-05-03 22:44   ` Darrick J. Wong
  2022-05-10 12:48   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2022-05-03 22:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 04, 2022 at 08:17:20AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Ever since we added shadown format buffers to the log items, log
> items need to handle the item being released with shadow buffers
> attached. Due to the fact this requirement was added at the same
> time we added new rmap/reflink intents, we missed the cleanup of
> those items.
> 
> In theory, this means shadow buffers can be leaked in a very small
> window when a shutdown is initiated. Testing with KASAN shows this
> leak does not happen in practice - we haven't identified a single
> leak in several years of shutdown testing since ~v4.8 kernels.
> 
> However, the intent whiteout cleanup mechanism results in every
> cancelled intent in exactly the same state as this tiny race window
> creates and so if intents down clean up shadow buffers on final
> release we will leak the shadow buffer for just about every intent
> we create.
> 
> Hence we start with this patch to close this condition off and
> ensure that when whiteouts start to be used we don't leak lots of
> memory.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_item.c     | 2 ++
>  fs/xfs/xfs_icreate_item.c  | 1 +
>  fs/xfs/xfs_refcount_item.c | 2 ++
>  fs/xfs/xfs_rmap_item.c     | 2 ++
>  4 files changed, 7 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 593ac29cffc7..2c8b686e2a11 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -39,6 +39,7 @@ STATIC void
>  xfs_bui_item_free(
>  	struct xfs_bui_log_item	*buip)
>  {
> +	kmem_free(buip->bui_item.li_lv_shadow);
>  	kmem_cache_free(xfs_bui_cache, buip);
>  }
>  
> @@ -198,6 +199,7 @@ xfs_bud_item_release(
>  	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
>  
>  	xfs_bui_release(budp->bud_buip);
> +	kmem_free(budp->bud_item.li_lv_shadow);
>  	kmem_cache_free(xfs_bud_cache, budp);
>  }
>  
> diff --git a/fs/xfs/xfs_icreate_item.c b/fs/xfs/xfs_icreate_item.c
> index 508e184e3b8f..b05314d48176 100644
> --- a/fs/xfs/xfs_icreate_item.c
> +++ b/fs/xfs/xfs_icreate_item.c
> @@ -63,6 +63,7 @@ STATIC void
>  xfs_icreate_item_release(
>  	struct xfs_log_item	*lip)
>  {
> +	kmem_free(ICR_ITEM(lip)->ic_item.li_lv_shadow);

Oh hey, I missed one.  Good catch!

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  	kmem_cache_free(xfs_icreate_cache, ICR_ITEM(lip));
>  }
>  
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 0d868c93144d..10474fe389e1 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -35,6 +35,7 @@ STATIC void
>  xfs_cui_item_free(
>  	struct xfs_cui_log_item	*cuip)
>  {
> +	kmem_free(cuip->cui_item.li_lv_shadow);
>  	if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
>  		kmem_free(cuip);
>  	else
> @@ -204,6 +205,7 @@ xfs_cud_item_release(
>  	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
>  
>  	xfs_cui_release(cudp->cud_cuip);
> +	kmem_free(cudp->cud_item.li_lv_shadow);
>  	kmem_cache_free(xfs_cud_cache, cudp);
>  }
>  
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index a22b2d19ef91..6c0b56ebdbe1 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -35,6 +35,7 @@ STATIC void
>  xfs_rui_item_free(
>  	struct xfs_rui_log_item	*ruip)
>  {
> +	kmem_free(ruip->rui_item.li_lv_shadow);
>  	if (ruip->rui_format.rui_nextents > XFS_RUI_MAX_FAST_EXTENTS)
>  		kmem_free(ruip);
>  	else
> @@ -227,6 +228,7 @@ xfs_rud_item_release(
>  	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
>  
>  	xfs_rui_release(rudp->rud_ruip);
> +	kmem_free(rudp->rud_item.li_lv_shadow);
>  	kmem_cache_free(xfs_rud_cache, rudp);
>  }
>  
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/10] xfs: hide log iovec alignment constraints
  2022-05-03 22:17 ` [PATCH 03/10] xfs: hide log iovec alignment constraints Dave Chinner
@ 2022-05-03 22:45   ` Darrick J. Wong
  2022-05-03 23:07     ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2022-05-03 22:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 04, 2022 at 08:17:21AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Callers currently have to round out the size of buffers to match the
> aligment constraints of log iovecs and xlog_write(). They should not
> need to know this detail, so introduce a new function to calculate
> the iovec length (for use in ->iop_size implementations). Also
> modify xlog_finish_iovec() to round up the length to the correct
> alignment so the callers don't need to do this, either.
> 
> Convert the only user - inode forks - of this alignment rounding to
> use the new interface.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_inode_fork.c  | 20 ++++------------
>  fs/xfs/xfs_inode_item.c         | 25 ++++++--------------
>  fs/xfs/xfs_inode_item_recover.c |  4 ++--
>  fs/xfs/xfs_log.h                | 42 ++++++++++++++++++++++++++++++---
>  4 files changed, 52 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index a15ff38c3d41..1a4cdf550f6d 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -36,7 +36,7 @@ xfs_init_local_fork(
>  	int64_t			size)
>  {
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> -	int			mem_size = size, real_size = 0;
> +	int			mem_size = size;
>  	bool			zero_terminate;
>  
>  	/*
> @@ -50,13 +50,7 @@ xfs_init_local_fork(
>  		mem_size++;
>  
>  	if (size) {
> -		/*
> -		 * As we round up the allocation here, we need to ensure the
> -		 * bytes we don't copy data into are zeroed because the log
> -		 * vectors still copy them into the journal.
> -		 */
> -		real_size = roundup(mem_size, 4);
> -		ifp->if_u1.if_data = kmem_zalloc(real_size, KM_NOFS);
> +		ifp->if_u1.if_data = kmem_alloc(mem_size, KM_NOFS);
>  		memcpy(ifp->if_u1.if_data, data, size);
>  		if (zero_terminate)
>  			ifp->if_u1.if_data[size] = '\0';
> @@ -502,14 +496,8 @@ xfs_idata_realloc(
>  		return;
>  	}
>  
> -	/*
> -	 * For inline data, the underlying buffer must be a multiple of 4 bytes
> -	 * in size so that it can be logged and stay on word boundaries.
> -	 * We enforce that here, and use __GFP_ZERO to ensure that size
> -	 * extensions always zero the unused roundup area.
> -	 */
> -	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data, roundup(new_size, 4),
> -				      GFP_NOFS | __GFP_NOFAIL | __GFP_ZERO);
> +	ifp->if_u1.if_data = krealloc(ifp->if_u1.if_data, new_size,
> +				      GFP_NOFS | __GFP_NOFAIL);
>  	ifp->if_bytes = new_size;
>  }
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 00733a18ccdc..721def0639fd 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -71,7 +71,7 @@ xfs_inode_item_data_fork_size(
>  	case XFS_DINODE_FMT_LOCAL:
>  		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
>  		    ip->i_df.if_bytes > 0) {
> -			*nbytes += roundup(ip->i_df.if_bytes, 4);
> +			*nbytes += xlog_calc_iovec_len(ip->i_df.if_bytes);
>  			*nvecs += 1;
>  		}
>  		break;
> @@ -112,7 +112,7 @@ xfs_inode_item_attr_fork_size(
>  	case XFS_DINODE_FMT_LOCAL:
>  		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
>  		    ip->i_afp->if_bytes > 0) {
> -			*nbytes += roundup(ip->i_afp->if_bytes, 4);
> +			*nbytes += xlog_calc_iovec_len(ip->i_afp->if_bytes);
>  			*nvecs += 1;
>  		}
>  		break;
> @@ -204,17 +204,12 @@ xfs_inode_item_format_data_fork(
>  			~(XFS_ILOG_DEXT | XFS_ILOG_DBROOT | XFS_ILOG_DEV);
>  		if ((iip->ili_fields & XFS_ILOG_DDATA) &&
>  		    ip->i_df.if_bytes > 0) {
> -			/*
> -			 * Round i_bytes up to a word boundary.
> -			 * The underlying memory is guaranteed
> -			 * to be there by xfs_idata_realloc().
> -			 */
> -			data_bytes = roundup(ip->i_df.if_bytes, 4);
>  			ASSERT(ip->i_df.if_u1.if_data != NULL);
>  			ASSERT(ip->i_disk_size > 0);
>  			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_ILOCAL,
> -					ip->i_df.if_u1.if_data, data_bytes);
> -			ilf->ilf_dsize = (unsigned)data_bytes;
> +					ip->i_df.if_u1.if_data,
> +					ip->i_df.if_bytes);
> +			ilf->ilf_dsize = (unsigned)ip->i_df.if_bytes;
>  			ilf->ilf_size++;
>  		} else {
>  			iip->ili_fields &= ~XFS_ILOG_DDATA;
> @@ -288,17 +283,11 @@ xfs_inode_item_format_attr_fork(
>  
>  		if ((iip->ili_fields & XFS_ILOG_ADATA) &&
>  		    ip->i_afp->if_bytes > 0) {
> -			/*
> -			 * Round i_bytes up to a word boundary.
> -			 * The underlying memory is guaranteed
> -			 * to be there by xfs_idata_realloc().
> -			 */
> -			data_bytes = roundup(ip->i_afp->if_bytes, 4);
>  			ASSERT(ip->i_afp->if_u1.if_data != NULL);
>  			xlog_copy_iovec(lv, vecp, XLOG_REG_TYPE_IATTR_LOCAL,
>  					ip->i_afp->if_u1.if_data,
> -					data_bytes);
> -			ilf->ilf_asize = (unsigned)data_bytes;
> +					ip->i_afp->if_bytes);
> +			ilf->ilf_asize = (unsigned)ip->i_afp->if_bytes;
>  			ilf->ilf_size++;
>  		} else {
>  			iip->ili_fields &= ~XFS_ILOG_ADATA;
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 6d44f5fd6d7e..d28ffaebd067 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -462,7 +462,7 @@ xlog_recover_inode_commit_pass2(
>  	ASSERT(in_f->ilf_size <= 4);
>  	ASSERT((in_f->ilf_size == 3) || (fields & XFS_ILOG_AFORK));
>  	ASSERT(!(fields & XFS_ILOG_DFORK) ||
> -	       (len == in_f->ilf_dsize));
> +	       (len == xlog_calc_iovec_len(in_f->ilf_dsize)));
>  
>  	switch (fields & XFS_ILOG_DFORK) {
>  	case XFS_ILOG_DDATA:
> @@ -497,7 +497,7 @@ xlog_recover_inode_commit_pass2(
>  		}
>  		len = item->ri_buf[attr_index].i_len;
>  		src = item->ri_buf[attr_index].i_addr;
> -		ASSERT(len == in_f->ilf_asize);
> +		ASSERT(len == xlog_calc_iovec_len(in_f->ilf_asize));
>  
>  		switch (in_f->ilf_fields & XFS_ILOG_AFORK) {
>  		case XFS_ILOG_ADATA:
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 8dafe8f771c7..3a4f6a4e4eb7 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -21,23 +21,59 @@ struct xfs_log_vec {
>  
>  #define XFS_LOG_VEC_ORDERED	(-1)
>  
> +/*
> + * Calculate the log iovec length for a given user buffer length. Intended to be
> + * used by ->iop_size implementations when sizing buffers of arbitrary
> + * alignments.
> + */
> +static inline int
> +xlog_calc_iovec_len(int len)
> +{
> +	return roundup(len, sizeof(uint32_t));
> +}
> +
>  void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  		uint type);
>  
>  static inline void
> -xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
> +xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec,
> +		int data_len)
>  {
>  	struct xlog_op_header	*oph = vec->i_addr;
> -
> -	/* opheader tracks payload length, logvec tracks region length */
> +	int			len;
> +
> +	/*
> +	 * Always round up the length to the correct alignment so callers don't
> +	 * need to know anything about this log vec layout requirement. This
> +	 * means we have to zero the area the data to be written does not cover.
> +	 * This is complicated by fact the payload region is offset into the
> +	 * logvec region by the opheader that tracks the payload.
> +	 */
> +	len = xlog_calc_iovec_len(data_len);
> +	if (len - data_len != 0) {
> +		char	*buf = vec->i_addr + sizeof(struct xlog_op_header);
> +
> +		memset(buf + data_len, 0, len - data_len);

Assuming this is the replacement for the kzalloc/kzrealloc calls above
so that we don't write junk to disk,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +	}
> +
> +	/*
> +	 * The opheader tracks aligned payload length, whilst the logvec tracks
> +	 * the overall region length.
> +	 */
>  	oph->oh_len = cpu_to_be32(len);
>  
>  	len += sizeof(struct xlog_op_header);
>  	lv->lv_buf_len += len;
>  	lv->lv_bytes += len;
>  	vec->i_len = len;
> +
> +	/* Catch buffer overruns */
> +	ASSERT((void *)lv->lv_buf + lv->lv_bytes <= (void *)lv + lv->lv_size);
>  }
>  
> +/*
> + * Copy the amount of data requested by the caller into a new log iovec.
> + */
>  static inline void *
>  xlog_copy_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
>  		uint type, void *data, int len)
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 10/10] xfs: intent item whiteouts
  2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
  2022-05-03 22:42   ` Alli
@ 2022-05-03 22:50   ` Darrick J. Wong
  2022-05-04  1:49     ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2022-05-03 22:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 04, 2022 at 08:17:28AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
> 
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
> 
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
> 
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
> 
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.
> 
> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
> 
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
> 
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
> 
> Enter log item whiteouts.
> 
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
> 
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
> 
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
> 
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
> 
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log_cil.c | 78 ++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_trace.h   |  3 ++
>  fs/xfs/xfs_trans.h   |  6 ++--
>  3 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0d8d092447ad..fecd2ea3e935 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
>  static void
>  xlog_cil_insert_items(
>  	struct xlog		*log,
> -	struct xfs_trans	*tp)
> +	struct xfs_trans	*tp,
> +	uint32_t		released_space)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
> @@ -525,7 +526,9 @@ xlog_cil_insert_items(
>  		ASSERT(tp->t_ticket->t_curr_res >= len);
>  	}
>  	tp->t_ticket->t_curr_res -= len;
> +	tp->t_ticket->t_curr_res += released_space;
>  	ctx->space_used += len;
> +	ctx->space_used -= released_space;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> @@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr(
>   * Pull all the log vectors off the items in the CIL, and remove the items from
>   * the CIL. We don't need the CIL lock here because it's only needed on the
>   * transaction commit side which is currently locked out by the flush lock.
> + *
> + * If a log item is marked with a whiteout, we do not need to write it to the
> + * journal and so we just move them to the whiteout list for the caller to
> + * dispose of appropriately.
>   */
>  static void
>  xlog_cil_build_lv_chain(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx,
> +	struct list_head	*whiteouts,
>  	uint32_t		*num_iovecs,
>  	uint32_t		*num_bytes)
>  {
> @@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
>  
>  		item = list_first_entry(&cil->xc_cil,
>  					struct xfs_log_item, li_cil);
> +
> +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> +			list_move(&item->li_cil, whiteouts);
> +			trace_xfs_cil_whiteout_skip(item);
> +			continue;
> +		}
> +
>  		list_del_init(&item->li_cil);
>  		if (!ctx->lv_chain)
>  			ctx->lv_chain = item->li_lv;
> @@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
>  	}
>  }
>  
> +static void
> +xlog_cil_push_cleanup_whiteouts(

Pushing cleanup whiteouts?

Oh, clean up whiteouts as part of pushing CIL.

I almost want to ask for a comment here:

/* Remove log items from the CIL that have been elided from the checkpoint. */
static void
xlog_cil_push_cleanup_whiteouts(

But fmeh, aside from my own momentary confusion this isn't that big of a
deal.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +	struct list_head	*whiteouts)
> +{
> +	while (!list_empty(whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -1030,6 +1058,7 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_csn_t		push_seq;
>  	bool			push_commit_stable;
> +	LIST_HEAD		(whiteouts);
>  
>  	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1098,7 +1127,7 @@ xlog_cil_push_work(
>  	list_add(&ctx->committing, &cil->xc_committing);
>  	spin_unlock(&cil->xc_push_lock);
>  
> -	xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes);
> +	xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs, &num_bytes);
>  
>  	/*
>  	 * Switch the contexts so we can drop the context lock and move out
> @@ -1201,6 +1230,7 @@ xlog_cil_push_work(
>  	/* Not safe to reference ctx now! */
>  
>  	spin_unlock(&log->l_icloglock);
> +	xlog_cil_push_cleanup_whiteouts(&whiteouts);
>  	return;
>  
>  out_skip:
> @@ -1212,6 +1242,7 @@ xlog_cil_push_work(
>  out_abort_free_ticket:
>  	xfs_log_ticket_ungrant(log, ctx->ticket);
>  	ASSERT(xlog_is_shutdown(log));
> +	xlog_cil_push_cleanup_whiteouts(&whiteouts);
>  	if (!ctx->commit_iclog) {
>  		xlog_cil_committed(ctx);
>  		return;
> @@ -1360,6 +1391,43 @@ xlog_cil_empty(
>  	return empty;
>  }
>  
> +/*
> + * If there are intent done items in this transaction and the related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to write either
> + * the intent or intent done item to the journal as the change will be
> + * journalled atomically within this checkpoint. As we cannot remove items from
> + * the CIL here, mark the related intent with a whiteout so that the CIL push
> + * can remove it rather than writing it to the journal. Then remove the intent
> + * done item from the current transaction and release it so it doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> +	struct xfs_cil		*cil,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_log_item	*lip, *ilip, *next;
> +	uint32_t		len = 0;
> +
> +	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> +		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> +			continue;
> +
> +		ilip = lip->li_ops->iop_intent(lip);
> +		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> +			continue;
> +		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> +		trace_xfs_cil_whiteout_mark(ilip);
> +		len += ilip->li_lv->lv_bytes;
> +		kmem_free(ilip->li_lv);
> +		ilip->li_lv = NULL;
> +
> +		xfs_trans_del_item(lip);
> +		lip->li_ops->iop_release(lip);
> +	}
> +	return len;
> +}
> +
>  /*
>   * Commit a transaction with the given vector to the Committed Item List.
>   *
> @@ -1382,6 +1450,7 @@ xlog_cil_commit(
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> +	uint32_t		released_space = 0;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1393,7 +1462,10 @@ xlog_cil_commit(
>  	/* lock out background commit */
>  	down_read(&cil->xc_ctx_lock);
>  
> -	xlog_cil_insert_items(log, tp);
> +	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> +		released_space = xlog_cil_process_intents(cil, tp);
> +
> +	xlog_cil_insert_items(log, tp, released_space);
>  
>  	if (regrant && !xlog_is_shutdown(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index e1197f9ad97e..75934e3c3f55 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>  DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>  
>  DECLARE_EVENT_CLASS(xfs_ail_class,
>  	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index d72a5995d33e..9561f193e7e1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,15 @@ struct xfs_log_item {
>  #define	XFS_LI_IN_AIL	0
>  #define	XFS_LI_ABORTED	1
>  #define	XFS_LI_FAILED	2
> -#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> +#define	XFS_LI_DIRTY	3
> +#define	XFS_LI_WHITEOUT	4
>  
>  #define XFS_LI_FLAGS \
>  	{ (1u << XFS_LI_IN_AIL),	"IN_AIL" }, \
>  	{ (1u << XFS_LI_ABORTED),	"ABORTED" }, \
>  	{ (1u << XFS_LI_FAILED),	"FAILED" }, \
> -	{ (1u << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1u << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1u << XFS_LI_WHITEOUT),	"WHITEOUT" }
>  
>  struct xfs_item_ops {
>  	unsigned flags;
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/10] xfs: hide log iovec alignment constraints
  2022-05-03 22:45   ` Darrick J. Wong
@ 2022-05-03 23:07     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-03 23:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, May 03, 2022 at 03:45:29PM -0700, Darrick J. Wong wrote:
> On Wed, May 04, 2022 at 08:17:21AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Callers currently have to round out the size of buffers to match the
> > aligment constraints of log iovecs and xlog_write(). They should not
> > need to know this detail, so introduce a new function to calculate
> > the iovec length (for use in ->iop_size implementations). Also
> > modify xlog_finish_iovec() to round up the length to the correct
> > alignment so the callers don't need to do this, either.
> > 
> > Convert the only user - inode forks - of this alignment rounding to
> > use the new interface.
....
> >  static inline void
> > -xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
> > +xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec,
> > +		int data_len)
> >  {
> >  	struct xlog_op_header	*oph = vec->i_addr;
> > -
> > -	/* opheader tracks payload length, logvec tracks region length */
> > +	int			len;
> > +
> > +	/*
> > +	 * Always round up the length to the correct alignment so callers don't
> > +	 * need to know anything about this log vec layout requirement. This
> > +	 * means we have to zero the area the data to be written does not cover.
> > +	 * This is complicated by fact the payload region is offset into the
> > +	 * logvec region by the opheader that tracks the payload.
> > +	 */
> > +	len = xlog_calc_iovec_len(data_len);
> > +	if (len - data_len != 0) {
> > +		char	*buf = vec->i_addr + sizeof(struct xlog_op_header);
> > +
> > +		memset(buf + data_len, 0, len - data_len);
> 
> Assuming this is the replacement for the kzalloc/kzrealloc calls above
> so that we don't write junk to disk,

Yes, exactly that.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 10/10] xfs: intent item whiteouts
  2022-05-03 22:50   ` Darrick J. Wong
@ 2022-05-04  1:49     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2022-05-04  1:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, May 03, 2022 at 03:50:09PM -0700, Darrick J. Wong wrote:
> On Wed, May 04, 2022 at 08:17:28AM +1000, Dave Chinner wrote:
> > @@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
> >  
> >  		item = list_first_entry(&cil->xc_cil,
> >  					struct xfs_log_item, li_cil);
> > +
> > +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> > +			list_move(&item->li_cil, whiteouts);
> > +			trace_xfs_cil_whiteout_skip(item);
> > +			continue;
> > +		}
> > +
> >  		list_del_init(&item->li_cil);
> >  		if (!ctx->lv_chain)
> >  			ctx->lv_chain = item->li_lv;
> > @@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
> >  	}
> >  }
> >  
> > +static void
> > +xlog_cil_push_cleanup_whiteouts(
> 
> Pushing cleanup whiteouts?
> 
> Oh, clean up whiteouts as part of pushing CIL.
> 
> I almost want to ask for a comment here:
> 
> /* Remove log items from the CIL that have been elided from the checkpoint. */
> static void
> xlog_cil_push_cleanup_whiteouts(
> 
> But fmeh, aside from my own momentary confusion this isn't that big of a
> deal.

Oh, fair comment - it's not consistent with other helpers that
are just named xlog_cil_<thing>, like xlog_cil_build_lv_chain().

I dropped the "push" out of the name.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] xfs: zero inode fork buffer at allocation
  2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
  2022-05-03 22:41   ` Darrick J. Wong
  2022-05-03 22:42   ` Alli
@ 2022-05-10 12:47   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-10 12:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 04, 2022 at 08:17:19AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we first allocate or resize an inline inode fork, we round up
> the allocation to 4 byte alingment to make journal alignment
> constraints. We don't clear the unused bytes, so we can copy up to
> three uninitialised bytes into the journal. Zero those bytes so we
> only ever copy zeros into the journal.

It took me a while to figure out how GFP_ZERO works for krealloc,
and it seems like kmalloc and friends always zero the entire length
of the kmem_cache, so if krealloc reuses a rounded up allocation the
padding is alredy zeroed at that point.

So this looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] xfs: fix potential log item leak
  2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
  2022-05-03 22:42   ` Alli
  2022-05-03 22:44   ` Darrick J. Wong
@ 2022-05-10 12:48   ` Christoph Hellwig
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-10 12:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL
  2022-05-03 22:17 ` [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL Dave Chinner
@ 2022-05-10 12:49   ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-05-10 12:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 04, 2022 at 08:17:27AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we release an intent that a whiteout applies to, it will not
> have been committed to the journal and so won't be in the AIL. Hence
> when we drop the last reference to the intent, we do not want to try
> to remove it from the AIL as that will trigger a filesystem
> shutdown. Hence make the removal of intents from the AIL conditional
> on them actually being in the AIL so we do the correct thing.
> 

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-05-10 12:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-03 22:17 [PATCH 00/10 v6] xfs: intent whiteouts Dave Chinner
2022-05-03 22:17 ` [PATCH 01/10] xfs: zero inode fork buffer at allocation Dave Chinner
2022-05-03 22:41   ` Darrick J. Wong
2022-05-03 22:42   ` Alli
2022-05-10 12:47   ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 02/10] xfs: fix potential log item leak Dave Chinner
2022-05-03 22:42   ` Alli
2022-05-03 22:44   ` Darrick J. Wong
2022-05-10 12:48   ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 03/10] xfs: hide log iovec alignment constraints Dave Chinner
2022-05-03 22:45   ` Darrick J. Wong
2022-05-03 23:07     ` Dave Chinner
2022-05-03 22:17 ` [PATCH 04/10] xfs: don't commit the first deferred transaction without intents Dave Chinner
2022-05-03 22:17 ` [PATCH 05/10] xfs: add log item flags to indicate intents Dave Chinner
2022-05-03 22:17 ` [PATCH 06/10] xfs: tag transactions that contain intent done items Dave Chinner
2022-05-03 22:17 ` [PATCH 07/10] xfs: factor and move some code in xfs_log_cil.c Dave Chinner
2022-05-03 22:17 ` [PATCH 08/10] xfs: add log item method to return related intents Dave Chinner
2022-05-03 22:17 ` [PATCH 09/10] xfs: whiteouts release intents that are not in the AIL Dave Chinner
2022-05-10 12:49   ` Christoph Hellwig
2022-05-03 22:17 ` [PATCH 10/10] xfs: intent item whiteouts Dave Chinner
2022-05-03 22:42   ` Alli
2022-05-03 22:50   ` Darrick J. Wong
2022-05-04  1:49     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.