All of lore.kernel.org
 help / color / mirror / Atom feed
* [NYE DELUGE 2/4] xfs: online repair in its entirety
@ 2022-12-30 21:14 Darrick J. Wong
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                   ` (77 more replies)
  0 siblings, 78 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 21:14 UTC (permalink / raw)
  To: Dave Chinner, Allison Henderson, Chandan Babu R, Catherine Hoang, djwong
  Cc: xfs, greg.marsden, shirley.ma, konrad.wilk, linux-fsdevel,
	Matthew Wilcox, tpkelly, smahar, Christoph Hellwig, fstests,
	Zorro Lang, Carlos Maiolino

Hi everyone,

As I've mentioned several times throughout 2022, I would like to merge
the online fsck feature in time for the 2023 LTS kernel.  This is the
second part of that effort.

This deluge contains all of the online repair kernel code, a significant
amount of restructuring of how repairs work in the userspace driver
program, and a ton of fstests updates to provide automated fuzz testing
and stress testing of forced repairs.

Within the kernel section, the major pieces are the use of tmpfs files
to provide pageable kernel memory for staging repair information;
lightweight hooks into the main xfs filesystem for scrub via jump
labels; coordinated inode scans for live index construction; and the
atomic file mapping swap feature.

Changes to the userspace driver program fall into two main categories:
restructuring how repairs are scheduled so that they're tracked by inode
or AG; establishing data dependency chains so that we scan and repair
things in the correct order; and reworking the systemd background
services to be more secure, enable periodic media scans, and provide
some semblance of fs corruption reporting.

The fstests changes are a substantial reworking of the fuzzing code to
fit the testing described in the design documentation; adding stress
testing of online repairs vs. fsstress; and functional tests for all the
new features that ride in with online repair.

For this review, I would like people to focus the following:

- Are the major subsystems sufficiently documented that you could figure
  out what the code does?

- Do you see any problems that are severe enough to cause long term
  support hassles? (e.g. bad API design, writing weird metadata to disk)

- Can you spot mis-interactions between the subsystems?

- What were my blind spots in devising this feature?

- Are there missing pieces that you'd like to help build?

- Can I just merge all of this?

The one thing that is /not/ in scope for this review are requests for
more refactoring of existing subsystems.  While there are usually valid
arguments for performing such cleanups, those are separate tasks to be
prioritized separately.  I will get to them after merging online fsck,
because revising existing subsystems generally involves rebasing work
in this patchset, which means the affected patches need re-reviewing.
Unless it's absolutely necessary, this just creates more work for
everybody.

I've been running daily online **repairs** of every computer I own for
the last eight months.  All modifications so far have been to optimize
data structures (holes in the xattr structures, excessively large rmap
btrees, and bugs in quota resource counter updates).  So far, no damage
has resulted from these operations.  All issues observed in that time
have been corrected in this submission.

Fuzz and stress testing of online repairs have been running well for a
year now.  As of this writing, online repair can fix slightly more
things than offline repair, and the fsstress+repair long soak test has
passed 100 million repairs with zero problems observed.

(For comparison, the long soak fsx test recently passed 92 billion file
operations, so online fsck has a ways to go...)

As a warning, the patches will likely take several days to trickle in.
While everyone else looks at this, I plan to prototype directory tree
reconstruction with Allison's parent pointers v27 patchset.  Having a
user of that functionality is (I think) the last major hurdle to
ensuring that parent pointers are a good fit for the problems that need
solving, which in turn is the last requirement for merging that feature.

--D

^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/9] xfs: fix online repair block reaping
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/9] xfs: move the post-repair block reaping code to a separate file Darrick J. Wong
                     ` (8 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
                   ` (76 subsequent siblings)
  77 siblings, 9 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

These patches fix a few problems that I noticed in the code that deals
with old btree blocks after a successful repair.

First, I observed that it is possible for repair to incorrectly
invalidate and delete old btree blocks if they were crosslinked.  The
solution here is to consult the reverse mappings for each block in the
extent -- singly owned blocks are invalidated and freed, whereas for
crosslinked blocks, we merely drop the incorrect reverse mapping.

A largeish change in this patchset is moving the reaping code to a
separate file, because the code are mostly interrelated static
functions.  For now this also drops the ability to reap file blocks,
which will return when we add the bmbt repair functions.

Second, we convert the reap function to use EFIs so that we can commit
to freeing as many blocks in as few transactions as we dare.  We would
like to free as many old blocks as we can in the same transaction that
commits the new structure to the ondisk filesystem to minimize the
number of blocks that leak if the system crashes before the repair fully
completes.

The third change made in this series is to avoid tripping buffer cache
assertions if we're merely scanning the buffer cache for buffers to
invalidate, and find a non-stale buffer of the wrong length.  This is
primarily cosmetic, but makes my life easier.

The fourth change restructures the reaping code to try to process as many
blocks in one go as possible, to reduce logging traffic.

The last change switches the reaping mechanism to use per-AG bitmaps
defined in a previous patchset.  This should reduce type confusion when
reading the source code.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-reap-fixes

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-reap-fixes
---
 fs/xfs/Makefile                |    1 
 fs/xfs/scrub/agheader_repair.c |   75 +++---
 fs/xfs/scrub/bitmap.c          |   78 ------
 fs/xfs/scrub/bitmap.h          |   10 -
 fs/xfs/scrub/reap.c            |  489 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/reap.h            |   12 +
 fs/xfs/scrub/repair.c          |  367 ++++--------------------------
 fs/xfs/scrub/repair.h          |   10 -
 fs/xfs/scrub/trace.h           |   72 +++---
 fs/xfs/xfs_buf.c               |    5 
 fs/xfs/xfs_buf.h               |   10 +
 11 files changed, 649 insertions(+), 480 deletions(-)
 create mode 100644 fs/xfs/scrub/reap.c
 create mode 100644 fs/xfs/scrub/reap.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/9] xfs: cull repair code that will never get used
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 6/9] xfs: rearrange xrep_reap_block to make future code flow easier Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/9] xfs: only allow reaping of per-AG blocks in xrep_reap_extents Darrick J. Wong
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

These two functions date from the era when I thought that we could
rebuild btrees by creating an alternate root and adding records one by
one.  In other words, they predate the btree bulk loader.  They're not
necessary now, so remove them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.c |   84 -------------------------------------------------
 fs/xfs/scrub/repair.h |    6 ----
 fs/xfs/scrub/trace.h  |   22 -------------
 3 files changed, 112 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 074c6f5974d1..c1037536ca45 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -297,90 +297,6 @@ xrep_calc_ag_resblks(
 	return max(max(bnobt_sz, inobt_sz), max(rmapbt_sz, refcbt_sz));
 }
 
-/* Allocate a block in an AG. */
-int
-xrep_alloc_ag_block(
-	struct xfs_scrub		*sc,
-	const struct xfs_owner_info	*oinfo,
-	xfs_fsblock_t			*fsbno,
-	enum xfs_ag_resv_type		resv)
-{
-	struct xfs_alloc_arg		args = {0};
-	xfs_agblock_t			bno;
-	int				error;
-
-	switch (resv) {
-	case XFS_AG_RESV_AGFL:
-	case XFS_AG_RESV_RMAPBT:
-		error = xfs_alloc_get_freelist(sc->sa.pag, sc->tp,
-				sc->sa.agf_bp, &bno, 1);
-		if (error)
-			return error;
-		if (bno == NULLAGBLOCK)
-			return -ENOSPC;
-		xfs_extent_busy_reuse(sc->mp, sc->sa.pag, bno, 1, false);
-		*fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno, bno);
-		if (resv == XFS_AG_RESV_RMAPBT)
-			xfs_ag_resv_rmapbt_alloc(sc->mp, sc->sa.pag->pag_agno);
-		return 0;
-	default:
-		break;
-	}
-
-	args.tp = sc->tp;
-	args.mp = sc->mp;
-	args.oinfo = *oinfo;
-	args.fsbno = XFS_AGB_TO_FSB(args.mp, sc->sa.pag->pag_agno, 0);
-	args.minlen = 1;
-	args.maxlen = 1;
-	args.prod = 1;
-	args.type = XFS_ALLOCTYPE_THIS_AG;
-	args.resv = resv;
-
-	error = xfs_alloc_vextent(&args);
-	if (error)
-		return error;
-	if (args.fsbno == NULLFSBLOCK)
-		return -ENOSPC;
-	ASSERT(args.len == 1);
-	*fsbno = args.fsbno;
-
-	return 0;
-}
-
-/* Initialize a new AG btree root block with zero entries. */
-int
-xrep_init_btblock(
-	struct xfs_scrub		*sc,
-	xfs_fsblock_t			fsb,
-	struct xfs_buf			**bpp,
-	xfs_btnum_t			btnum,
-	const struct xfs_buf_ops	*ops)
-{
-	struct xfs_trans		*tp = sc->tp;
-	struct xfs_mount		*mp = sc->mp;
-	struct xfs_buf			*bp;
-	int				error;
-
-	trace_xrep_init_btblock(mp, XFS_FSB_TO_AGNO(mp, fsb),
-			XFS_FSB_TO_AGBNO(mp, fsb), btnum);
-
-	ASSERT(XFS_FSB_TO_AGNO(mp, fsb) == sc->sa.pag->pag_agno);
-	error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
-			XFS_FSB_TO_DADDR(mp, fsb), XFS_FSB_TO_BB(mp, 1), 0,
-			&bp);
-	if (error)
-		return error;
-	xfs_buf_zero(bp, 0, BBTOB(bp->b_length));
-	xfs_btree_init_block(mp, bp, btnum, 0, 0, sc->sa.pag->pag_agno);
-	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_BTREE_BUF);
-	xfs_trans_log_buf(tp, bp, 0, BBTOB(bp->b_length) - 1);
-	bp->b_ops = ops;
-	*bpp = bp;
-
-	return 0;
-}
-
 /*
  * Reconstructing per-AG Btrees
  *
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 150157ac2489..a764838e969d 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -23,12 +23,6 @@ int xrep_roll_ag_trans(struct xfs_scrub *sc);
 bool xrep_ag_has_space(struct xfs_perag *pag, xfs_extlen_t nr_blocks,
 		enum xfs_ag_resv_type type);
 xfs_extlen_t xrep_calc_ag_resblks(struct xfs_scrub *sc);
-int xrep_alloc_ag_block(struct xfs_scrub *sc,
-		const struct xfs_owner_info *oinfo, xfs_fsblock_t *fsbno,
-		enum xfs_ag_resv_type resv);
-int xrep_init_btblock(struct xfs_scrub *sc, xfs_fsblock_t fsb,
-		struct xfs_buf **bpp, xfs_btnum_t btnum,
-		const struct xfs_buf_ops *ops);
 
 struct xbitmap;
 struct xagb_bitmap;
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index cd9cfe98f14f..030ea76f1c90 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -828,28 +828,6 @@ TRACE_EVENT(xrep_refcount_extent_fn,
 		  __entry->refcount)
 )
 
-TRACE_EVENT(xrep_init_btblock,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-		 xfs_btnum_t btnum),
-	TP_ARGS(mp, agno, agbno, btnum),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_agnumber_t, agno)
-		__field(xfs_agblock_t, agbno)
-		__field(uint32_t, btnum)
-	),
-	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
-		__entry->agbno = agbno;
-		__entry->btnum = btnum;
-	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x btree %s",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->agno,
-		  __entry->agbno,
-		  __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS))
-)
 TRACE_EVENT(xrep_findroot_block,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
 		 uint32_t magic, uint16_t level),


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/9] xfs: move the post-repair block reaping code to a separate file
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 8/9] xfs: reap large AG metadata extents when possible Darrick J. Wong
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Reaping blocks after a repair is a complicated affair involving a lot of
rmap btree lookups and figuring out if we're going to unmap or free old
metadata blocks that might be crosslinked.  Eventually, we will need to
be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
garbage CoW staging extents, and (even later) blocks from btrees rooted
in inodes.  This results in a lot of reaping code, so we might as well
split that off while it's easy.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/scrub/agheader_repair.c |    1 
 fs/xfs/scrub/reap.c            |  268 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/reap.h            |   13 ++
 fs/xfs/scrub/repair.c          |  232 -----------------------------------
 fs/xfs/scrub/repair.h          |    2 
 6 files changed, 283 insertions(+), 234 deletions(-)
 create mode 100644 fs/xfs/scrub/reap.c
 create mode 100644 fs/xfs/scrub/reap.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0b8dfac6d9a3..73bebce4d6f9 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -171,6 +171,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   reap.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index c22dc71fdd82..53863560ab0f 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -26,6 +26,7 @@
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/bitmap.h"
+#include "scrub/reap.h"
 
 /* Superblock */
 
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
new file mode 100644
index 000000000000..613a8b897a25
--- /dev/null
+++ b/fs/xfs/scrub/reap.c
@@ -0,0 +1,268 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_extent_busy.h"
+#include "xfs_ag.h"
+#include "xfs_ag_resv.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/reap.h"
+
+/*
+ * Disposal of Blocks from Old Metadata
+ *
+ * Now that we've constructed a new btree to replace the damaged one, we want
+ * to dispose of the blocks that (we think) the old btree was using.
+ * Previously, we used the rmapbt to collect the extents (bitmap) with the
+ * rmap owner corresponding to the tree we rebuilt, collected extents for any
+ * blocks with the same rmap owner that are owned by another data structure
+ * (sublist), and subtracted sublist from bitmap.  In theory the extents
+ * remaining in bitmap are the old btree's blocks.
+ *
+ * Unfortunately, it's possible that the btree was crosslinked with other
+ * blocks on disk.  The rmap data can tell us if there are multiple owners, so
+ * if the rmapbt says there is an owner of this block other than @oinfo, then
+ * the block is crosslinked.  Remove the reverse mapping and continue.
+ *
+ * If there is one rmap record, we can free the block, which removes the
+ * reverse mapping but doesn't add the block to the free space.  Our repair
+ * strategy is to hope the other metadata objects crosslinked on this block
+ * will be rebuilt (atop different blocks), thereby removing all the cross
+ * links.
+ *
+ * If there are no rmap records at all, we also free the block.  If the btree
+ * being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then there isn't
+ * supposed to be a rmap record and everything is ok.  For other btrees there
+ * had to have been an rmap entry for the block to have ended up on @bitmap,
+ * so if it's gone now there's something wrong and the fs will shut down.
+ *
+ * Note: If there are multiple rmap records with only the same rmap owner as
+ * the btree we're trying to rebuild and the block is indeed owned by another
+ * data structure with the same rmap owner, then the block will be in sublist
+ * and therefore doesn't need disposal.  If there are multiple rmap records
+ * with only the same rmap owner but the block is not owned by something with
+ * the same rmap owner, the block will be freed.
+ *
+ * The caller is responsible for locking the AG headers for the entire rebuild
+ * operation so that nothing else can sneak in and change the AG state while
+ * we're not looking.  We also assume that the caller already invalidated any
+ * buffers associated with @bitmap.
+ */
+
+static int
+xrep_invalidate_block(
+	uint64_t		fsbno,
+	void			*priv)
+{
+	struct xfs_scrub	*sc = priv;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/* Skip AG headers and post-EOFS blocks */
+	if (!xfs_verify_fsbno(sc->mp, fsbno))
+		return 0;
+
+	error = xfs_buf_incore(sc->mp->m_ddev_targp,
+			XFS_FSB_TO_DADDR(sc->mp, fsbno),
+			XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
+	if (error)
+		return 0;
+
+	xfs_trans_bjoin(sc->tp, bp);
+	xfs_trans_binval(sc->tp, bp);
+	return 0;
+}
+
+/*
+ * Invalidate buffers for per-AG btree blocks we're dumping.  This function
+ * is not intended for use with file data repairs; we have bunmapi for that.
+ */
+int
+xrep_invalidate_blocks(
+	struct xfs_scrub	*sc,
+	struct xbitmap		*bitmap)
+{
+	/*
+	 * For each block in each extent, see if there's an incore buffer for
+	 * exactly that block; if so, invalidate it.  The buffer cache only
+	 * lets us look for one buffer at a time, so we have to look one block
+	 * at a time.  Avoid invalidating AG headers and post-EOFS blocks
+	 * because we never own those; and if we can't TRYLOCK the buffer we
+	 * assume it's owned by someone else.
+	 */
+	return xbitmap_walk_bits(bitmap, xrep_invalidate_block, sc);
+}
+
+/* Information about reaping extents after a repair. */
+struct xrep_reap_state {
+	struct xfs_scrub		*sc;
+
+	/* Reverse mapping owner and metadata reservation type. */
+	const struct xfs_owner_info	*oinfo;
+	enum xfs_ag_resv_type		resv;
+};
+
+/*
+ * Put a block back on the AGFL.
+ */
+STATIC int
+xrep_put_freelist(
+	struct xfs_scrub	*sc,
+	xfs_agblock_t		agbno)
+{
+	struct xfs_buf		*agfl_bp;
+	int			error;
+
+	/* Make sure there's space on the freelist. */
+	error = xrep_fix_freelist(sc, true);
+	if (error)
+		return error;
+
+	/*
+	 * Since we're "freeing" a lost block onto the AGFL, we have to
+	 * create an rmap for the block prior to merging it or else other
+	 * parts will break.
+	 */
+	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno, 1,
+			&XFS_RMAP_OINFO_AG);
+	if (error)
+		return error;
+
+	/* Put the block on the AGFL. */
+	error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_put_freelist(sc->sa.pag, sc->tp, sc->sa.agf_bp,
+			agfl_bp, agbno, 0);
+	if (error)
+		return error;
+	xfs_extent_busy_insert(sc->tp, sc->sa.pag, agbno, 1,
+			XFS_EXTENT_BUSY_SKIP_DISCARD);
+
+	return 0;
+}
+
+/* Dispose of a single block. */
+STATIC int
+xrep_reap_block(
+	uint64_t			fsbno,
+	void				*priv)
+{
+	struct xrep_reap_state		*rs = priv;
+	struct xfs_scrub		*sc = rs->sc;
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agf_bp = NULL;
+	xfs_agblock_t			agbno;
+	bool				has_other_rmap;
+	int				error;
+
+	ASSERT(sc->ip != NULL ||
+	       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
+	trace_xrep_dispose_btree_extent(sc->mp,
+			XFS_FSB_TO_AGNO(sc->mp, fsbno),
+			XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
+
+	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
+	ASSERT(XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
+
+	/*
+	 * If we are repairing per-inode metadata, we need to read in the AGF
+	 * buffer.  Otherwise, we're repairing a per-AG structure, so reuse
+	 * the AGF buffer that the setup functions already grabbed.
+	 */
+	if (sc->ip) {
+		error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &agf_bp);
+		if (error)
+			return error;
+	} else {
+		agf_bp = sc->sa.agf_bp;
+	}
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, sc->sa.pag);
+
+	/* Can we find any other rmappings? */
+	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
+			&has_other_rmap);
+	xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out_free;
+
+	/*
+	 * If there are other rmappings, this block is cross linked and must
+	 * not be freed.  Remove the reverse mapping and move on.  Otherwise,
+	 * we were the only owner of the block, so free the extent, which will
+	 * also remove the rmap.
+	 *
+	 * XXX: XFS doesn't support detecting the case where a single block
+	 * metadata structure is crosslinked with a multi-block structure
+	 * because the buffer cache doesn't detect aliasing problems, so we
+	 * can't fix 100% of crosslinking problems (yet).  The verifiers will
+	 * blow on writeout, the filesystem will shut down, and the admin gets
+	 * to run xfs_repair.
+	 */
+	if (has_other_rmap)
+		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno,
+					1, rs->oinfo);
+	else if (rs->resv == XFS_AG_RESV_AGFL)
+		error = xrep_put_freelist(sc, agbno);
+	else
+		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
+				rs->resv);
+	if (agf_bp != sc->sa.agf_bp)
+		xfs_trans_brelse(sc->tp, agf_bp);
+	if (error)
+		return error;
+
+	if (sc->ip)
+		return xfs_trans_roll_inode(&sc->tp, sc->ip);
+	return xrep_roll_ag_trans(sc);
+
+out_free:
+	if (agf_bp != sc->sa.agf_bp)
+		xfs_trans_brelse(sc->tp, agf_bp);
+	return error;
+}
+
+/* Dispose of every block of every extent in the bitmap. */
+int
+xrep_reap_extents(
+	struct xfs_scrub		*sc,
+	struct xbitmap			*bitmap,
+	const struct xfs_owner_info	*oinfo,
+	enum xfs_ag_resv_type		type)
+{
+	struct xrep_reap_state		rs = {
+		.sc			= sc,
+		.oinfo			= oinfo,
+		.resv			= type,
+	};
+
+	ASSERT(xfs_has_rmapbt(sc->mp));
+
+	return xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
+}
diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h
new file mode 100644
index 000000000000..73d098ea7b04
--- /dev/null
+++ b/fs/xfs/scrub/reap.h
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_REAP_H__
+#define __XFS_SCRUB_REAP_H__
+
+int xrep_reap_extents(struct xfs_scrub *sc, struct xbitmap *bitmap,
+		const struct xfs_owner_info *oinfo,
+		enum xfs_ag_resv_type type);
+
+#endif /* __XFS_SCRUB_REAP_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index c1037536ca45..762eefb6ac90 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -321,91 +321,8 @@ xrep_calc_ag_resblks(
  * sublist.  As with the other btrees we subtract sublist from bitmap, and the
  * result (since the rmapbt lives in the free space) are the blocks from the
  * old rmapbt.
- *
- * Disposal of Blocks from Old per-AG Btrees
- *
- * Now that we've constructed a new btree to replace the damaged one, we want
- * to dispose of the blocks that (we think) the old btree was using.
- * Previously, we used the rmapbt to collect the extents (bitmap) with the
- * rmap owner corresponding to the tree we rebuilt, collected extents for any
- * blocks with the same rmap owner that are owned by another data structure
- * (sublist), and subtracted sublist from bitmap.  In theory the extents
- * remaining in bitmap are the old btree's blocks.
- *
- * Unfortunately, it's possible that the btree was crosslinked with other
- * blocks on disk.  The rmap data can tell us if there are multiple owners, so
- * if the rmapbt says there is an owner of this block other than @oinfo, then
- * the block is crosslinked.  Remove the reverse mapping and continue.
- *
- * If there is one rmap record, we can free the block, which removes the
- * reverse mapping but doesn't add the block to the free space.  Our repair
- * strategy is to hope the other metadata objects crosslinked on this block
- * will be rebuilt (atop different blocks), thereby removing all the cross
- * links.
- *
- * If there are no rmap records at all, we also free the block.  If the btree
- * being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then there isn't
- * supposed to be a rmap record and everything is ok.  For other btrees there
- * had to have been an rmap entry for the block to have ended up on @bitmap,
- * so if it's gone now there's something wrong and the fs will shut down.
- *
- * Note: If there are multiple rmap records with only the same rmap owner as
- * the btree we're trying to rebuild and the block is indeed owned by another
- * data structure with the same rmap owner, then the block will be in sublist
- * and therefore doesn't need disposal.  If there are multiple rmap records
- * with only the same rmap owner but the block is not owned by something with
- * the same rmap owner, the block will be freed.
- *
- * The caller is responsible for locking the AG headers for the entire rebuild
- * operation so that nothing else can sneak in and change the AG state while
- * we're not looking.  We also assume that the caller already invalidated any
- * buffers associated with @bitmap.
  */
 
-static int
-xrep_invalidate_block(
-	uint64_t		fsbno,
-	void			*priv)
-{
-	struct xfs_scrub	*sc = priv;
-	struct xfs_buf		*bp;
-	int			error;
-
-	/* Skip AG headers and post-EOFS blocks */
-	if (!xfs_verify_fsbno(sc->mp, fsbno))
-		return 0;
-
-	error = xfs_buf_incore(sc->mp->m_ddev_targp,
-			XFS_FSB_TO_DADDR(sc->mp, fsbno),
-			XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
-	if (error)
-		return 0;
-
-	xfs_trans_bjoin(sc->tp, bp);
-	xfs_trans_binval(sc->tp, bp);
-	return 0;
-}
-
-/*
- * Invalidate buffers for per-AG btree blocks we're dumping.  This function
- * is not intended for use with file data repairs; we have bunmapi for that.
- */
-int
-xrep_invalidate_blocks(
-	struct xfs_scrub	*sc,
-	struct xbitmap		*bitmap)
-{
-	/*
-	 * For each block in each extent, see if there's an incore buffer for
-	 * exactly that block; if so, invalidate it.  The buffer cache only
-	 * lets us look for one buffer at a time, so we have to look one block
-	 * at a time.  Avoid invalidating AG headers and post-EOFS blocks
-	 * because we never own those; and if we can't TRYLOCK the buffer we
-	 * assume it's owned by someone else.
-	 */
-	return xbitmap_walk_bits(bitmap, xrep_invalidate_block, sc);
-}
-
 /* Ensure the freelist is the correct size. */
 int
 xrep_fix_freelist(
@@ -424,155 +341,6 @@ xrep_fix_freelist(
 			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
 }
 
-/* Information about reaping extents after a repair. */
-struct xrep_reap_state {
-	struct xfs_scrub		*sc;
-
-	/* Reverse mapping owner and metadata reservation type. */
-	const struct xfs_owner_info	*oinfo;
-	enum xfs_ag_resv_type		resv;
-};
-
-/*
- * Put a block back on the AGFL.
- */
-STATIC int
-xrep_put_freelist(
-	struct xfs_scrub	*sc,
-	xfs_agblock_t		agbno)
-{
-	struct xfs_buf		*agfl_bp;
-	int			error;
-
-	/* Make sure there's space on the freelist. */
-	error = xrep_fix_freelist(sc, true);
-	if (error)
-		return error;
-
-	/*
-	 * Since we're "freeing" a lost block onto the AGFL, we have to
-	 * create an rmap for the block prior to merging it or else other
-	 * parts will break.
-	 */
-	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno, 1,
-			&XFS_RMAP_OINFO_AG);
-	if (error)
-		return error;
-
-	/* Put the block on the AGFL. */
-	error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp);
-	if (error)
-		return error;
-
-	error = xfs_alloc_put_freelist(sc->sa.pag, sc->tp, sc->sa.agf_bp,
-			agfl_bp, agbno, 0);
-	if (error)
-		return error;
-	xfs_extent_busy_insert(sc->tp, sc->sa.pag, agbno, 1,
-			XFS_EXTENT_BUSY_SKIP_DISCARD);
-
-	return 0;
-}
-
-/* Dispose of a single block. */
-STATIC int
-xrep_reap_block(
-	uint64_t			fsbno,
-	void				*priv)
-{
-	struct xrep_reap_state		*rs = priv;
-	struct xfs_scrub		*sc = rs->sc;
-	struct xfs_btree_cur		*cur;
-	struct xfs_buf			*agf_bp = NULL;
-	xfs_agblock_t			agbno;
-	bool				has_other_rmap;
-	int				error;
-
-	ASSERT(sc->ip != NULL ||
-	       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
-	trace_xrep_dispose_btree_extent(sc->mp,
-			XFS_FSB_TO_AGNO(sc->mp, fsbno),
-			XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
-
-	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
-	ASSERT(XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
-
-	/*
-	 * If we are repairing per-inode metadata, we need to read in the AGF
-	 * buffer.  Otherwise, we're repairing a per-AG structure, so reuse
-	 * the AGF buffer that the setup functions already grabbed.
-	 */
-	if (sc->ip) {
-		error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &agf_bp);
-		if (error)
-			return error;
-	} else {
-		agf_bp = sc->sa.agf_bp;
-	}
-	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, sc->sa.pag);
-
-	/* Can we find any other rmappings? */
-	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
-			&has_other_rmap);
-	xfs_btree_del_cursor(cur, error);
-	if (error)
-		goto out_free;
-
-	/*
-	 * If there are other rmappings, this block is cross linked and must
-	 * not be freed.  Remove the reverse mapping and move on.  Otherwise,
-	 * we were the only owner of the block, so free the extent, which will
-	 * also remove the rmap.
-	 *
-	 * XXX: XFS doesn't support detecting the case where a single block
-	 * metadata structure is crosslinked with a multi-block structure
-	 * because the buffer cache doesn't detect aliasing problems, so we
-	 * can't fix 100% of crosslinking problems (yet).  The verifiers will
-	 * blow on writeout, the filesystem will shut down, and the admin gets
-	 * to run xfs_repair.
-	 */
-	if (has_other_rmap)
-		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno,
-					1, rs->oinfo);
-	else if (rs->resv == XFS_AG_RESV_AGFL)
-		error = xrep_put_freelist(sc, agbno);
-	else
-		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
-				rs->resv);
-	if (agf_bp != sc->sa.agf_bp)
-		xfs_trans_brelse(sc->tp, agf_bp);
-	if (error)
-		return error;
-
-	if (sc->ip)
-		return xfs_trans_roll_inode(&sc->tp, sc->ip);
-	return xrep_roll_ag_trans(sc);
-
-out_free:
-	if (agf_bp != sc->sa.agf_bp)
-		xfs_trans_brelse(sc->tp, agf_bp);
-	return error;
-}
-
-/* Dispose of every block of every extent in the bitmap. */
-int
-xrep_reap_extents(
-	struct xfs_scrub		*sc,
-	struct xbitmap			*bitmap,
-	const struct xfs_owner_info	*oinfo,
-	enum xfs_ag_resv_type		type)
-{
-	struct xrep_reap_state		rs = {
-		.sc			= sc,
-		.oinfo			= oinfo,
-		.resv			= type,
-	};
-
-	ASSERT(xfs_has_rmapbt(sc->mp));
-
-	return xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
-}
-
 /*
  * Finding per-AG Btree Roots for AGF/AGI Reconstruction
  *
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index a764838e969d..72ea48802848 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -29,8 +29,6 @@ struct xagb_bitmap;
 
 int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
 int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xbitmap *btlist);
-int xrep_reap_extents(struct xfs_scrub *sc, struct xbitmap *exlist,
-		const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type);
 
 struct xrep_find_ag_btree {
 	/* in: rmap owner of the btree we're looking for */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/9] xfs: only invalidate blocks if we're going to free them
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 5/9] xfs: use deferred frees to reap old btree blocks Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 6/9] xfs: rearrange xrep_reap_block to make future code flow easier Darrick J. Wong
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're discarding old btree blocks after a repair, only invalidate
the buffers for the ones that we're freeing -- if the metadata was
crosslinked with another data structure, we don't want to touch it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/reap.c   |   96 +++++++++++++++++++++----------------------------
 fs/xfs/scrub/repair.h |    1 -
 2 files changed, 42 insertions(+), 55 deletions(-)


diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 613a8b897a25..b75ec582e5b1 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -70,54 +70,10 @@
  *
  * The caller is responsible for locking the AG headers for the entire rebuild
  * operation so that nothing else can sneak in and change the AG state while
- * we're not looking.  We also assume that the caller already invalidated any
- * buffers associated with @bitmap.
+ * we're not looking.  We must also invalidate any buffers associated with
+ * @bitmap.
  */
 
-static int
-xrep_invalidate_block(
-	uint64_t		fsbno,
-	void			*priv)
-{
-	struct xfs_scrub	*sc = priv;
-	struct xfs_buf		*bp;
-	int			error;
-
-	/* Skip AG headers and post-EOFS blocks */
-	if (!xfs_verify_fsbno(sc->mp, fsbno))
-		return 0;
-
-	error = xfs_buf_incore(sc->mp->m_ddev_targp,
-			XFS_FSB_TO_DADDR(sc->mp, fsbno),
-			XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
-	if (error)
-		return 0;
-
-	xfs_trans_bjoin(sc->tp, bp);
-	xfs_trans_binval(sc->tp, bp);
-	return 0;
-}
-
-/*
- * Invalidate buffers for per-AG btree blocks we're dumping.  This function
- * is not intended for use with file data repairs; we have bunmapi for that.
- */
-int
-xrep_invalidate_blocks(
-	struct xfs_scrub	*sc,
-	struct xbitmap		*bitmap)
-{
-	/*
-	 * For each block in each extent, see if there's an incore buffer for
-	 * exactly that block; if so, invalidate it.  The buffer cache only
-	 * lets us look for one buffer at a time, so we have to look one block
-	 * at a time.  Avoid invalidating AG headers and post-EOFS blocks
-	 * because we never own those; and if we can't TRYLOCK the buffer we
-	 * assume it's owned by someone else.
-	 */
-	return xbitmap_walk_bits(bitmap, xrep_invalidate_block, sc);
-}
-
 /* Information about reaping extents after a repair. */
 struct xrep_reap_state {
 	struct xfs_scrub		*sc;
@@ -127,9 +83,7 @@ struct xrep_reap_state {
 	enum xfs_ag_resv_type		resv;
 };
 
-/*
- * Put a block back on the AGFL.
- */
+/* Put a block back on the AGFL. */
 STATIC int
 xrep_put_freelist(
 	struct xfs_scrub	*sc,
@@ -168,6 +122,37 @@ xrep_put_freelist(
 	return 0;
 }
 
+/* Try to invalidate the incore buffer for a block that we're about to free. */
+STATIC void
+xrep_block_reap_binval(
+	struct xfs_scrub	*sc,
+	xfs_fsblock_t		fsbno)
+{
+	struct xfs_buf		*bp = NULL;
+	int			error;
+
+	/*
+	 * If there's an incore buffer for exactly this block, invalidate it.
+	 * Avoid invalidating AG headers and post-EOFS blocks because we never
+	 * own those.
+	 */
+	if (!xfs_verify_fsbno(sc->mp, fsbno))
+		return;
+
+	/*
+	 * We assume that the lack of any other known owners means that the
+	 * buffer can be locked without risk of deadlocking.
+	 */
+	error = xfs_buf_incore(sc->mp->m_ddev_targp,
+			XFS_FSB_TO_DADDR(sc->mp, fsbno),
+			XFS_FSB_TO_BB(sc->mp, 1), 0, &bp);
+	if (error)
+		return;
+
+	xfs_trans_bjoin(sc->tp, bp);
+	xfs_trans_binval(sc->tp, bp);
+}
+
 /* Dispose of a single block. */
 STATIC int
 xrep_reap_block(
@@ -225,14 +210,17 @@ xrep_reap_block(
 	 * blow on writeout, the filesystem will shut down, and the admin gets
 	 * to run xfs_repair.
 	 */
-	if (has_other_rmap)
-		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno,
-					1, rs->oinfo);
-	else if (rs->resv == XFS_AG_RESV_AGFL)
+	if (has_other_rmap) {
+		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno, 1,
+				rs->oinfo);
+	} else if (rs->resv == XFS_AG_RESV_AGFL) {
+		xrep_block_reap_binval(sc, fsbno);
 		error = xrep_put_freelist(sc, agbno);
-	else
+	} else {
+		xrep_block_reap_binval(sc, fsbno);
 		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
 				rs->resv);
+	}
 	if (agf_bp != sc->sa.agf_bp)
 		xfs_trans_brelse(sc->tp, agf_bp);
 	if (error)
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 72ea48802848..fd2f3ada7ca3 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -28,7 +28,6 @@ struct xbitmap;
 struct xagb_bitmap;
 
 int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
-int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xbitmap *btlist);
 
 struct xrep_find_ag_btree {
 	/* in: rmap owner of the btree we're looking for */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/9] xfs: only allow reaping of per-AG blocks in xrep_reap_extents
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 1/9] xfs: cull repair code that will never get used Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 7/9] xfs: ignore stale buffers when scanning the buffer cache Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 9/9] xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've refactored btree cursors to require the caller to pass in
a perag structure, there are numerous problems in xrep_reap_extents if
it's being called to reap extents for an inode metadata repair.  We
don't have any repair functions that can do that, so drop the support
for now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/reap.c |   45 +++++++++++++--------------------------------
 1 file changed, 13 insertions(+), 32 deletions(-)


diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index b75ec582e5b1..c57388c47dc4 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -162,40 +162,30 @@ xrep_reap_block(
 	struct xrep_reap_state		*rs = priv;
 	struct xfs_scrub		*sc = rs->sc;
 	struct xfs_btree_cur		*cur;
-	struct xfs_buf			*agf_bp = NULL;
+	xfs_agnumber_t			agno;
 	xfs_agblock_t			agbno;
 	bool				has_other_rmap;
 	int				error;
 
-	ASSERT(sc->ip != NULL ||
-	       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
-	trace_xrep_dispose_btree_extent(sc->mp,
-			XFS_FSB_TO_AGNO(sc->mp, fsbno),
-			XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
-
+	agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
 	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
-	ASSERT(XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
 
-	/*
-	 * If we are repairing per-inode metadata, we need to read in the AGF
-	 * buffer.  Otherwise, we're repairing a per-AG structure, so reuse
-	 * the AGF buffer that the setup functions already grabbed.
-	 */
-	if (sc->ip) {
-		error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &agf_bp);
-		if (error)
-			return error;
-	} else {
-		agf_bp = sc->sa.agf_bp;
+	trace_xrep_dispose_btree_extent(sc->mp, agno, agbno, 1);
+
+	/* We don't support reaping file extents yet. */
+	if (sc->ip != NULL || sc->sa.pag->pag_agno != agno) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
 	}
-	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, sc->sa.pag);
+
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, sc->sa.pag);
 
 	/* Can we find any other rmappings? */
 	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
 			&has_other_rmap);
 	xfs_btree_del_cursor(cur, error);
 	if (error)
-		goto out_free;
+		return error;
 
 	/*
 	 * If there are other rmappings, this block is cross linked and must
@@ -211,8 +201,8 @@ xrep_reap_block(
 	 * to run xfs_repair.
 	 */
 	if (has_other_rmap) {
-		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno, 1,
-				rs->oinfo);
+		error = xfs_rmap_free(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno,
+				1, rs->oinfo);
 	} else if (rs->resv == XFS_AG_RESV_AGFL) {
 		xrep_block_reap_binval(sc, fsbno);
 		error = xrep_put_freelist(sc, agbno);
@@ -221,19 +211,10 @@ xrep_reap_block(
 		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
 				rs->resv);
 	}
-	if (agf_bp != sc->sa.agf_bp)
-		xfs_trans_brelse(sc->tp, agf_bp);
 	if (error)
 		return error;
 
-	if (sc->ip)
-		return xfs_trans_roll_inode(&sc->tp, sc->ip);
 	return xrep_roll_ag_trans(sc);
-
-out_free:
-	if (agf_bp != sc->sa.agf_bp)
-		xfs_trans_brelse(sc->tp, agf_bp);
-	return error;
 }
 
 /* Dispose of every block of every extent in the bitmap. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/9] xfs: use deferred frees to reap old btree blocks
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/9] xfs: move the post-repair block reaping code to a separate file Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 8/9] xfs: reap large AG metadata extents when possible Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/9] xfs: only invalidate blocks if we're going to free them Darrick J. Wong
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use deferred frees (EFIs) to reap the blocks of a btree that we just
replaced.  This helps us to shrink the window in which those old blocks
could be lost due to a system crash, though we try to flush the EFIs
every few hundred blocks so that we don't also overflow the transaction
reservations during and after we commit the new btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/reap.c |   27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index c57388c47dc4..74c150f38a33 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -26,6 +26,7 @@
 #include "xfs_ag_resv.h"
 #include "xfs_quota.h"
 #include "xfs_qm.h"
+#include "xfs_bmap.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -81,6 +82,9 @@ struct xrep_reap_state {
 	/* Reverse mapping owner and metadata reservation type. */
 	const struct xfs_owner_info	*oinfo;
 	enum xfs_ag_resv_type		resv;
+
+	/* Number of deferred reaps attached to the current transaction. */
+	unsigned int			deferred;
 };
 
 /* Put a block back on the AGFL. */
@@ -165,6 +169,7 @@ xrep_reap_block(
 	xfs_agnumber_t			agno;
 	xfs_agblock_t			agbno;
 	bool				has_other_rmap;
+	bool				need_roll = true;
 	int				error;
 
 	agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
@@ -207,13 +212,22 @@ xrep_reap_block(
 		xrep_block_reap_binval(sc, fsbno);
 		error = xrep_put_freelist(sc, agbno);
 	} else {
+		/*
+		 * Use deferred frees to get rid of the old btree blocks to try
+		 * to minimize the window in which we could crash and lose the
+		 * old blocks.  However, we still need to roll the transaction
+		 * every 100 or so EFIs so that we don't exceed the log
+		 * reservation.
+		 */
 		xrep_block_reap_binval(sc, fsbno);
-		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
-				rs->resv);
+		__xfs_free_extent_later(sc->tp, fsbno, 1, rs->oinfo, true);
+		rs->deferred++;
+		need_roll = rs->deferred > 100;
 	}
-	if (error)
+	if (error || !need_roll)
 		return error;
 
+	rs->deferred = 0;
 	return xrep_roll_ag_trans(sc);
 }
 
@@ -230,8 +244,13 @@ xrep_reap_extents(
 		.oinfo			= oinfo,
 		.resv			= type,
 	};
+	int				error;
 
 	ASSERT(xfs_has_rmapbt(sc->mp));
 
-	return xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
+	error = xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
+	if (error || rs.deferred == 0)
+		return error;
+
+	return xrep_roll_ag_trans(sc);
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/9] xfs: rearrange xrep_reap_block to make future code flow easier
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 3/9] xfs: only invalidate blocks if we're going to free them Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/9] xfs: cull repair code that will never get used Darrick J. Wong
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rearrange the logic inside xrep_reap_block to make it more obvious that
crosslinked metadata blocks are handled differently.  Add a couple of
tracepoints so that we can tell what's going on at the end of a btree
rebuild operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |    6 +++---
 fs/xfs/scrub/reap.c            |   19 ++++++++++++++-----
 fs/xfs/scrub/trace.h           |   17 ++++++++---------
 3 files changed, 25 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 53863560ab0f..629f2a681485 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -645,13 +645,13 @@ xrep_agfl_fill(
 	xfs_fsblock_t		fsbno = start;
 	int			error;
 
+	trace_xrep_agfl_insert(sc->sa.pag, XFS_FSB_TO_AGBNO(sc->mp, start),
+			len);
+
 	while (fsbno < start + len && af->fl_off < af->flcount)
 		af->agfl_bno[af->fl_off++] =
 				cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, fsbno++));
 
-	trace_xrep_agfl_insert(sc->mp, sc->sa.pag->pag_agno,
-			XFS_FSB_TO_AGBNO(sc->mp, start), len);
-
 	error = xbitmap_set(&af->used_extents, start, fsbno - 1);
 	if (error)
 		return error;
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 74c150f38a33..00cdc0e9063e 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -175,8 +175,6 @@ xrep_reap_block(
 	agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
 	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
 
-	trace_xrep_dispose_btree_extent(sc->mp, agno, agbno, 1);
-
 	/* We don't support reaping file extents yet. */
 	if (sc->ip != NULL || sc->sa.pag->pag_agno != agno) {
 		ASSERT(0);
@@ -206,10 +204,21 @@ xrep_reap_block(
 	 * to run xfs_repair.
 	 */
 	if (has_other_rmap) {
+		trace_xrep_dispose_unmap_extent(sc->sa.pag, agbno, 1);
+
 		error = xfs_rmap_free(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno,
 				1, rs->oinfo);
-	} else if (rs->resv == XFS_AG_RESV_AGFL) {
-		xrep_block_reap_binval(sc, fsbno);
+		if (error)
+			return error;
+
+		goto roll_out;
+	}
+
+	trace_xrep_dispose_free_extent(sc->sa.pag, agbno, 1);
+
+	xrep_block_reap_binval(sc, fsbno);
+
+	if (rs->resv == XFS_AG_RESV_AGFL) {
 		error = xrep_put_freelist(sc, agbno);
 	} else {
 		/*
@@ -219,7 +228,6 @@ xrep_reap_block(
 		 * every 100 or so EFIs so that we don't exceed the log
 		 * reservation.
 		 */
-		xrep_block_reap_binval(sc, fsbno);
 		__xfs_free_extent_later(sc->tp, fsbno, 1, rs->oinfo, true);
 		rs->deferred++;
 		need_roll = rs->deferred > 100;
@@ -227,6 +235,7 @@ xrep_reap_block(
 	if (error || !need_roll)
 		return error;
 
+roll_out:
 	rs->deferred = 0;
 	return xrep_roll_ag_trans(sc);
 }
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 030ea76f1c90..5c4375397f24 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -730,9 +730,8 @@ TRACE_EVENT(xchk_refcount_incorrect,
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
 DECLARE_EVENT_CLASS(xrep_extent_class,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 xfs_agblock_t agbno, xfs_extlen_t len),
-	TP_ARGS(mp, agno, agbno, len),
+	TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno, xfs_extlen_t len),
+	TP_ARGS(pag, agbno, len),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
@@ -740,8 +739,8 @@ DECLARE_EVENT_CLASS(xrep_extent_class,
 		__field(xfs_extlen_t, len)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
 		__entry->agbno = agbno;
 		__entry->len = len;
 	),
@@ -753,10 +752,10 @@ DECLARE_EVENT_CLASS(xrep_extent_class,
 );
 #define DEFINE_REPAIR_EXTENT_EVENT(name) \
 DEFINE_EVENT(xrep_extent_class, name, \
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
-		 xfs_agblock_t agbno, xfs_extlen_t len), \
-	TP_ARGS(mp, agno, agbno, len))
-DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_btree_extent);
+	TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno, xfs_extlen_t len), \
+	TP_ARGS(pag, agbno, len))
+DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_unmap_extent);
+DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_free_extent);
 DEFINE_REPAIR_EXTENT_EVENT(xrep_agfl_insert);
 
 DECLARE_EVENT_CLASS(xrep_rmap_class,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/9] xfs: ignore stale buffers when scanning the buffer cache
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 4/9] xfs: only allow reaping of per-AG blocks in xrep_reap_extents Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 9/9] xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

After an online repair, we need to invalidate buffers representing the
blocks from the old metadata that we're replacing.  It's possible that
parts of a tree that were previously cached in memory are no longer
accessible due to media failure or other corruption on interior nodes,
so repair figures out the old blocks from the reverse mapping data and
scans the buffer cache directly.

Unfortunately, the current buffer cache code triggers asserts if the
rhashtable lookup finds a non-stale buffer of a different length than
the key we searched for.  For regular operation this is desirable, but
for this repair procedure, we don't care since we're going to forcibly
stale the buffer anyway.  Add an internal lookup flag to avoid the
assert.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/reap.c |    2 +-
 fs/xfs/xfs_buf.c    |    5 ++++-
 fs/xfs/xfs_buf.h    |   10 ++++++++++
 3 files changed, 15 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 00cdc0e9063e..a329235b039b 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -149,7 +149,7 @@ xrep_block_reap_binval(
 	 */
 	error = xfs_buf_incore(sc->mp->m_ddev_targp,
 			XFS_FSB_TO_DADDR(sc->mp, fsbno),
-			XFS_FSB_TO_BB(sc->mp, 1), 0, &bp);
+			XFS_FSB_TO_BB(sc->mp, 1), XBF_BCACHE_SCAN, &bp);
 	if (error)
 		return;
 
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 54c774af6e1c..a538501b652b 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -482,7 +482,8 @@ _xfs_buf_obj_cmp(
 		 * reallocating a busy extent. Skip this buffer and
 		 * continue searching for an exact match.
 		 */
-		ASSERT(bp->b_flags & XBF_STALE);
+		if (!(map->bm_flags & XBM_IGNORE_LENGTH_MISMATCH))
+			ASSERT(bp->b_flags & XBF_STALE);
 		return 1;
 	}
 	return 0;
@@ -683,6 +684,8 @@ xfs_buf_get_map(
 	int			error;
 	int			i;
 
+	if (flags & XBF_BCACHE_SCAN)
+		cmap.bm_flags |= XBM_IGNORE_LENGTH_MISMATCH;
 	for (i = 0; i < nmaps; i++)
 		cmap.bm_len += map[i].bm_len;
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 549c60942208..d6e8c3bab9f6 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -44,6 +44,11 @@ struct xfs_buf;
 #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
 
 /* flags used only as arguments to access routines */
+/*
+ * We're scanning the buffer cache; do not warn about lookup mismatches.
+ * Only online repair should use this.
+ */
+#define XBF_BCACHE_SCAN	 (1u << 28)
 #define XBF_INCORE	 (1u << 29)/* lookup only, return if found in cache */
 #define XBF_TRYLOCK	 (1u << 30)/* lock requested, but do not wait */
 #define XBF_UNMAPPED	 (1u << 31)/* do not map the buffer */
@@ -67,6 +72,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
 	/* The following interface flags should never be set */ \
+	{ XBF_BCACHE_SCAN,	"BCACHE_SCAN" }, \
 	{ XBF_INCORE,		"INCORE" }, \
 	{ XBF_TRYLOCK,		"TRYLOCK" }, \
 	{ XBF_UNMAPPED,		"UNMAPPED" }
@@ -114,8 +120,12 @@ typedef struct xfs_buftarg {
 struct xfs_buf_map {
 	xfs_daddr_t		bm_bn;	/* block number for I/O */
 	int			bm_len;	/* size of I/O */
+	unsigned int		bm_flags;
 };
 
+/* Don't complain about live buffers with the wrong length during lookup. */
+#define XBM_IGNORE_LENGTH_MISMATCH	(1U << 0)
+
 #define DEFINE_SINGLE_BUF_MAP(map, blkno, numblk) \
 	struct xfs_buf_map (map) = { .bm_bn = (blkno), .bm_len = (numblk) };
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 8/9] xfs: reap large AG metadata extents when possible
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/9] xfs: move the post-repair block reaping code to a separate file Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/9] xfs: use deferred frees to reap old btree blocks Darrick J. Wong
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents.  This enables us to dispose of old resources more efficiently
than doing them block by block.

While we're at it, rename the reaping functions to make it clear that
they're reaping per-AG extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |    2 
 fs/xfs/scrub/bitmap.c          |   37 ----
 fs/xfs/scrub/bitmap.h          |    4 
 fs/xfs/scrub/reap.c            |  382 ++++++++++++++++++++++++++++++++--------
 fs/xfs/scrub/reap.h            |    2 
 fs/xfs/scrub/repair.c          |   51 +++++
 fs/xfs/scrub/repair.h          |    1 
 fs/xfs/scrub/trace.h           |   37 ++++
 8 files changed, 395 insertions(+), 121 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 629f2a681485..bd4283d16891 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -774,7 +774,7 @@ xrep_agfl(
 		goto err;
 
 	/* Dump any AGFL overflow. */
-	error = xrep_reap_extents(sc, &agfl_extents, &XFS_RMAP_OINFO_AG,
+	error = xrep_reap_ag_metadata(sc, &agfl_extents, &XFS_RMAP_OINFO_AG,
 			XFS_AG_RESV_AGFL);
 err:
 	xbitmap_destroy(&agfl_extents);
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 72fdb6cd69b4..ebcb7a07d26f 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -385,43 +385,6 @@ xbitmap_walk(
 	return error;
 }
 
-struct xbitmap_walk_bits {
-	xbitmap_walk_bits_fn	fn;
-	void			*priv;
-};
-
-/* Walk all the bits in a run. */
-static int
-xbitmap_walk_bits_in_run(
-	uint64_t			start,
-	uint64_t			len,
-	void				*priv)
-{
-	struct xbitmap_walk_bits	*wb = priv;
-	uint64_t			i;
-	int				error = 0;
-
-	for (i = start; i < start + len; i++) {
-		error = wb->fn(i, wb->priv);
-		if (error)
-			break;
-	}
-
-	return error;
-}
-
-/* Call a function for every set bit in this bitmap. */
-int
-xbitmap_walk_bits(
-	struct xbitmap			*bitmap,
-	xbitmap_walk_bits_fn		fn,
-	void				*priv)
-{
-	struct xbitmap_walk_bits	wb = {.fn = fn, .priv = priv};
-
-	return xbitmap_walk(bitmap, xbitmap_walk_bits_in_run, &wb);
-}
-
 /* Does this bitmap have no bits set at all? */
 bool
 xbitmap_empty(
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index ab67073f4f01..8c2711feea30 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -33,10 +33,6 @@ typedef int (*xbitmap_walk_fn)(uint64_t start, uint64_t len, void *priv);
 int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn,
 		void *priv);
 
-typedef int (*xbitmap_walk_bits_fn)(uint64_t bit, void *priv);
-int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
-		void *priv);
-
 bool xbitmap_empty(struct xbitmap *bitmap);
 bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len);
 
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index a329235b039b..7a6b68255781 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -27,6 +27,10 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_bmap.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr.h"
+#include "xfs_attr_remote.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -76,20 +80,29 @@
  */
 
 /* Information about reaping extents after a repair. */
-struct xrep_reap_state {
+struct xreap_state {
 	struct xfs_scrub		*sc;
 
 	/* Reverse mapping owner and metadata reservation type. */
 	const struct xfs_owner_info	*oinfo;
 	enum xfs_ag_resv_type		resv;
 
+	/* If true, roll the transaction before reaping the next extent. */
+	bool				force_roll;
+
 	/* Number of deferred reaps attached to the current transaction. */
 	unsigned int			deferred;
+
+	/* Number of invalidated buffers logged to the current transaction. */
+	unsigned int			invalidated;
+
+	/* Number of deferred reaps queued during the whole reap sequence. */
+	unsigned long long		total_deferred;
 };
 
 /* Put a block back on the AGFL. */
 STATIC int
-xrep_put_freelist(
+xreap_put_freelist(
 	struct xfs_scrub	*sc,
 	xfs_agblock_t		agbno)
 {
@@ -126,69 +139,162 @@ xrep_put_freelist(
 	return 0;
 }
 
-/* Try to invalidate the incore buffer for a block that we're about to free. */
+/* Are there any uncommitted reap operations? */
+static inline bool xreap_dirty(const struct xreap_state *rs)
+{
+	if (rs->force_roll)
+		return true;
+	if (rs->deferred)
+		return true;
+	if (rs->invalidated)
+		return true;
+	if (rs->total_deferred)
+		return true;
+	return false;
+}
+
+#define XREAP_MAX_DEFERRED	(128)
+#define XREAP_MAX_BINVAL	(2048)
+
+/*
+ * Decide if we want to roll the transaction after reaping an extent.  We don't
+ * want to overrun the transaction reservation, so we prohibit more than
+ * 128 EFIs per transaction.  For the same reason, we limit the number
+ * of buffer invalidations to 2048.
+ */
+static inline bool xreap_want_roll(const struct xreap_state *rs)
+{
+	if (rs->force_roll)
+		return true;
+	if (rs->deferred > XREAP_MAX_DEFERRED)
+		return true;
+	if (rs->invalidated > XREAP_MAX_BINVAL)
+		return true;
+	return false;
+}
+
+static inline void xreap_reset(struct xreap_state *rs)
+{
+	rs->total_deferred += rs->deferred;
+	rs->deferred = 0;
+	rs->invalidated = 0;
+	rs->force_roll = false;
+}
+
+#define XREAP_MAX_DEFER_CHAIN		(2048)
+
+/*
+ * Decide if we want to finish the deferred ops that are attached to the scrub
+ * transaction.  We don't want to queue huge chains of deferred ops because
+ * that can consume a lot of log space and kernel memory.  Hence we trigger a
+ * xfs_defer_finish if there are more than 2048 deferred reap operations or the
+ * caller did some real work.
+ */
+static inline bool
+xreap_want_defer_finish(const struct xreap_state *rs)
+{
+	if (rs->force_roll)
+		return true;
+	if (rs->total_deferred > XREAP_MAX_DEFER_CHAIN)
+		return true;
+	return false;
+}
+
+static inline void xreap_defer_finish_reset(struct xreap_state *rs)
+{
+	rs->total_deferred = 0;
+	rs->deferred = 0;
+	rs->invalidated = 0;
+	rs->force_roll = false;
+}
+
+/* Try to invalidate the incore buffers for an extent that we're freeing. */
 STATIC void
-xrep_block_reap_binval(
-	struct xfs_scrub	*sc,
-	xfs_fsblock_t		fsbno)
+xreap_agextent_binval(
+	struct xreap_state	*rs,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		*aglenp)
 {
-	struct xfs_buf		*bp = NULL;
-	int			error;
+	struct xfs_scrub	*sc = rs->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agnumber_t		agno = sc->sa.pag->pag_agno;
+	xfs_agblock_t		agbno_next = agbno + *aglenp;
+	xfs_agblock_t		bno = agbno;
 
 	/*
-	 * If there's an incore buffer for exactly this block, invalidate it.
 	 * Avoid invalidating AG headers and post-EOFS blocks because we never
 	 * own those.
 	 */
-	if (!xfs_verify_fsbno(sc->mp, fsbno))
+	if (!xfs_verify_agbno(pag, agbno) ||
+	    !xfs_verify_agbno(pag, agbno_next - 1))
 		return;
 
 	/*
-	 * We assume that the lack of any other known owners means that the
-	 * buffer can be locked without risk of deadlocking.
+	 * If there are incore buffers for these blocks, invalidate them.  We
+	 * assume that the lack of any other known owners means that the buffer
+	 * can be locked without risk of deadlocking.  The buffer cache cannot
+	 * detect aliasing, so employ nested loops to scan for incore buffers
+	 * of any plausible size.
 	 */
-	error = xfs_buf_incore(sc->mp->m_ddev_targp,
-			XFS_FSB_TO_DADDR(sc->mp, fsbno),
-			XFS_FSB_TO_BB(sc->mp, 1), XBF_BCACHE_SCAN, &bp);
-	if (error)
-		return;
-
-	xfs_trans_bjoin(sc->tp, bp);
-	xfs_trans_binval(sc->tp, bp);
+	while (bno < agbno_next) {
+		xfs_agblock_t	fsbcount;
+		xfs_agblock_t	max_fsbs;
+
+		/*
+		 * Max buffer size is the max remote xattr buffer size, which
+		 * is one fs block larger than 64k.
+		 */
+		max_fsbs = min_t(xfs_agblock_t, agbno_next - bno,
+				xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX));
+
+		for (fsbcount = 1; fsbcount < max_fsbs; fsbcount++) {
+			struct xfs_buf	*bp = NULL;
+			xfs_daddr_t	daddr;
+			int		error;
+
+			daddr = XFS_AGB_TO_DADDR(mp, agno, bno);
+			error = xfs_buf_incore(mp->m_ddev_targp, daddr,
+					XFS_FSB_TO_BB(mp, fsbcount),
+					XBF_BCACHE_SCAN, &bp);
+			if (error)
+				continue;
+
+			xfs_trans_bjoin(sc->tp, bp);
+			xfs_trans_binval(sc->tp, bp);
+			rs->invalidated++;
+
+			/*
+			 * Stop invalidating if we've hit the limit; we should
+			 * still have enough reservation left to free however
+			 * far we've gotten.
+			 */
+			if (rs->invalidated > XREAP_MAX_BINVAL) {
+				*aglenp -= agbno_next - bno;
+				goto out;
+			}
+		}
+
+		bno++;
+	}
+
+out:
+	trace_xreap_agextent_binval(sc->sa.pag, agbno, *aglenp);
 }
 
-/* Dispose of a single block. */
+/* Dispose of a single AG extent. */
 STATIC int
-xrep_reap_block(
-	uint64_t			fsbno,
-	void				*priv)
+xreap_agextent(
+	struct xreap_state	*rs,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		*aglenp,
+	bool			crosslinked)
 {
-	struct xrep_reap_state		*rs = priv;
-	struct xfs_scrub		*sc = rs->sc;
-	struct xfs_btree_cur		*cur;
-	xfs_agnumber_t			agno;
-	xfs_agblock_t			agbno;
-	bool				has_other_rmap;
-	bool				need_roll = true;
-	int				error;
+	struct xfs_scrub	*sc = rs->sc;
+	xfs_fsblock_t		fsbno;
+	int			error = 0;
 
-	agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
-	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
-
-	/* We don't support reaping file extents yet. */
-	if (sc->ip != NULL || sc->sa.pag->pag_agno != agno) {
-		ASSERT(0);
-		return -EFSCORRUPTED;
-	}
-
-	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, sc->sa.pag);
-
-	/* Can we find any other rmappings? */
-	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
-			&has_other_rmap);
-	xfs_btree_del_cursor(cur, error);
-	if (error)
-		return error;
+	fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno, agbno);
 
 	/*
 	 * If there are other rmappings, this block is cross linked and must
@@ -203,52 +309,172 @@ xrep_reap_block(
 	 * blow on writeout, the filesystem will shut down, and the admin gets
 	 * to run xfs_repair.
 	 */
-	if (has_other_rmap) {
-		trace_xrep_dispose_unmap_extent(sc->sa.pag, agbno, 1);
+	if (crosslinked) {
+		trace_xreap_dispose_unmap_extent(sc->sa.pag, agbno, *aglenp);
 
-		error = xfs_rmap_free(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno,
-				1, rs->oinfo);
-		if (error)
-			return error;
-
-		goto roll_out;
+		rs->force_roll = true;
+		return xfs_rmap_free(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno,
+				*aglenp, rs->oinfo);
 	}
 
-	trace_xrep_dispose_free_extent(sc->sa.pag, agbno, 1);
+	trace_xreap_dispose_free_extent(sc->sa.pag, agbno, *aglenp);
 
-	xrep_block_reap_binval(sc, fsbno);
+	/*
+	 * Invalidate as many buffers as we can, starting at agbno.  If this
+	 * function sets *aglenp to zero, the transaction is full of logged
+	 * buffer invalidations, so we need to return early so that we can
+	 * roll and retry.
+	 */
+	xreap_agextent_binval(rs, agbno, aglenp);
+	if (*aglenp == 0) {
+		ASSERT(xreap_want_roll(rs));
+		return 0;
+	}
 
 	if (rs->resv == XFS_AG_RESV_AGFL) {
-		error = xrep_put_freelist(sc, agbno);
+		ASSERT(*aglenp == 1);
+		error = xreap_put_freelist(sc, agbno);
+		rs->force_roll = true;
 	} else {
 		/*
 		 * Use deferred frees to get rid of the old btree blocks to try
 		 * to minimize the window in which we could crash and lose the
-		 * old blocks.  However, we still need to roll the transaction
-		 * every 100 or so EFIs so that we don't exceed the log
-		 * reservation.
+		 * old blocks.
 		 */
-		__xfs_free_extent_later(sc->tp, fsbno, 1, rs->oinfo, true);
+		__xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, true);
 		rs->deferred++;
-		need_roll = rs->deferred > 100;
 	}
-	if (error || !need_roll)
-		return error;
 
-roll_out:
-	rs->deferred = 0;
-	return xrep_roll_ag_trans(sc);
+	return error;
 }
 
-/* Dispose of every block of every extent in the bitmap. */
+/*
+ * Figure out the longest run of blocks that we can dispose of with a single
+ * call.  Cross-linked blocks should have their reverse mappings removed, but
+ * single-owner extents can be freed.  AGFL blocks can only be put back one at
+ * a time.
+ */
+STATIC int
+xreap_agextent_select(
+	struct xreap_state	*rs,
+	xfs_agblock_t		agbno,
+	xfs_agblock_t		agbno_next,
+	bool			*crosslinked,
+	xfs_extlen_t		*aglenp)
+{
+	struct xfs_scrub	*sc = rs->sc;
+	struct xfs_btree_cur	*cur;
+	xfs_agblock_t		bno = agbno + 1;
+	xfs_extlen_t		len = 1;
+	int			error;
+
+	/*
+	 * Determine if there are any other rmap records covering the first
+	 * block of this extent.  If so, the block is crosslinked.
+	 */
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag);
+	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
+			crosslinked);
+	if (error)
+		goto out_cur;
+
+	/* AGFL blocks can only be deal with one at a time. */
+	if (rs->resv == XFS_AG_RESV_AGFL)
+		goto out_found;
+
+	/*
+	 * Figure out how many of the subsequent blocks have the same crosslink
+	 * status.
+	 */
+	while (bno < agbno_next) {
+		bool		also_crosslinked;
+
+		error = xfs_rmap_has_other_keys(cur, bno, 1, rs->oinfo,
+				&also_crosslinked);
+		if (error)
+			goto out_cur;
+
+		if (*crosslinked != also_crosslinked)
+			break;
+
+		len++;
+		bno++;
+	}
+
+out_found:
+	*aglenp = len;
+	trace_xreap_agextent_select(sc->sa.pag, agbno, len, *crosslinked);
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/*
+ * Break an AG metadata extent into sub-extents by fate (crosslinked, not
+ * crosslinked), and dispose of each sub-extent separately.
+ */
+STATIC int
+xreap_agmeta_extent(
+	uint64_t		fsbno,
+	uint64_t		len,
+	void			*priv)
+{
+	struct xreap_state	*rs = priv;
+	struct xfs_scrub	*sc = rs->sc;
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
+	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
+	xfs_agblock_t		agbno_next = agbno + len;
+	int			error = 0;
+
+	ASSERT(len <= XFS_MAX_BMBT_EXTLEN);
+	ASSERT(sc->ip == NULL);
+
+	if (agno != sc->sa.pag->pag_agno) {
+		ASSERT(sc->sa.pag->pag_agno == agno);
+		return -EFSCORRUPTED;
+	}
+
+	while (agbno < agbno_next) {
+		xfs_extlen_t	aglen;
+		bool		crosslinked;
+
+		error = xreap_agextent_select(rs, agbno, agbno_next,
+				&crosslinked, &aglen);
+		if (error)
+			return error;
+
+		error = xreap_agextent(rs, agbno, &aglen, crosslinked);
+		if (error)
+			return error;
+
+		if (xreap_want_defer_finish(rs)) {
+			error = xrep_defer_finish(sc);
+			if (error)
+				return error;
+			xreap_defer_finish_reset(rs);
+		} else if (xreap_want_roll(rs)) {
+			error = xrep_roll_ag_trans(sc);
+			if (error)
+				return error;
+			xreap_reset(rs);
+		}
+
+		agbno += aglen;
+	}
+
+	return 0;
+}
+
+/* Dispose of every block of every AG metadata extent in the bitmap. */
 int
-xrep_reap_extents(
+xrep_reap_ag_metadata(
 	struct xfs_scrub		*sc,
 	struct xbitmap			*bitmap,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type)
 {
-	struct xrep_reap_state		rs = {
+	struct xreap_state		rs = {
 		.sc			= sc,
 		.oinfo			= oinfo,
 		.resv			= type,
@@ -256,10 +482,14 @@ xrep_reap_extents(
 	int				error;
 
 	ASSERT(xfs_has_rmapbt(sc->mp));
+	ASSERT(sc->ip == NULL);
 
-	error = xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
-	if (error || rs.deferred == 0)
+	error = xbitmap_walk(bitmap, xreap_agmeta_extent, &rs);
+	if (error)
 		return error;
 
-	return xrep_roll_ag_trans(sc);
+	if (xreap_dirty(&rs))
+		return xrep_defer_finish(sc);
+
+	return 0;
 }
diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h
index 73d098ea7b04..143bd5ae5fe7 100644
--- a/fs/xfs/scrub/reap.h
+++ b/fs/xfs/scrub/reap.h
@@ -6,7 +6,7 @@
 #ifndef __XFS_SCRUB_REAP_H__
 #define __XFS_SCRUB_REAP_H__
 
-int xrep_reap_extents(struct xfs_scrub *sc, struct xbitmap *bitmap,
+int xrep_reap_ag_metadata(struct xfs_scrub *sc, struct xbitmap *bitmap,
 		const struct xfs_owner_info *oinfo,
 		enum xfs_ag_resv_type type);
 
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 762eefb6ac90..8d990a42119e 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -26,6 +26,7 @@
 #include "xfs_ag_resv.h"
 #include "xfs_quota.h"
 #include "xfs_qm.h"
+#include "xfs_defer.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -166,6 +167,56 @@ xrep_roll_ag_trans(
 	return 0;
 }
 
+/* Finish all deferred work attached to the repair transaction. */
+int
+xrep_defer_finish(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	/*
+	 * Keep the AG header buffers locked while we complete deferred work
+	 * items.  Ensure that both AG buffers are dirty and held when we roll
+	 * the transaction so that they move forward in the log without losing
+	 * the bli (and hence the bli type) when the transaction commits.
+	 *
+	 * Normal code would never hold clean buffers across a roll, but repair
+	 * needs both buffers to maintain a total lock on the AG.
+	 */
+	if (sc->sa.agi_bp) {
+		xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, XFS_AGI_MAGICNUM);
+		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
+	}
+
+	if (sc->sa.agf_bp) {
+		xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, XFS_AGF_MAGICNUM);
+		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
+	}
+
+	/*
+	 * Finish all deferred work items.  We still hold the AG header buffers
+	 * locked regardless of whether or not that succeeds.  On failure, the
+	 * buffers will be released during teardown on our way out of the
+	 * kernel.  If successful, join the buffers to the new transaction
+	 * and move on.
+	 */
+	error = xfs_defer_finish(&sc->tp);
+	if (error)
+		return error;
+
+	/*
+	 * Release the hold that we set above because defer_finish won't do
+	 * that for us.  The defer roll code redirties held buffers after each
+	 * roll, so the AG header buffers should be ready for logging.
+	 */
+	if (sc->sa.agi_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
+
+	return 0;
+}
+
 /*
  * Does the given AG have enough space to rebuild a btree?  Neither AG
  * reservation can be critical, and we must have enough space (factoring
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index fd2f3ada7ca3..a0df121e6866 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -20,6 +20,7 @@ static inline int xrep_notsupported(struct xfs_scrub *sc)
 int xrep_attempt(struct xfs_scrub *sc);
 void xrep_failure(struct xfs_mount *mp);
 int xrep_roll_ag_trans(struct xfs_scrub *sc);
+int xrep_defer_finish(struct xfs_scrub *sc);
 bool xrep_ag_has_space(struct xfs_perag *pag, xfs_extlen_t nr_blocks,
 		enum xfs_ag_resv_type type);
 xfs_extlen_t xrep_calc_ag_resblks(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 5c4375397f24..df36311081a5 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -754,10 +754,43 @@ DECLARE_EVENT_CLASS(xrep_extent_class,
 DEFINE_EVENT(xrep_extent_class, name, \
 	TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno, xfs_extlen_t len), \
 	TP_ARGS(pag, agbno, len))
-DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_unmap_extent);
-DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_free_extent);
+DEFINE_REPAIR_EXTENT_EVENT(xreap_dispose_unmap_extent);
+DEFINE_REPAIR_EXTENT_EVENT(xreap_dispose_free_extent);
+DEFINE_REPAIR_EXTENT_EVENT(xreap_agextent_binval);
 DEFINE_REPAIR_EXTENT_EVENT(xrep_agfl_insert);
 
+DECLARE_EVENT_CLASS(xrep_reap_find_class,
+	TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno, xfs_extlen_t len,
+		bool crosslinked),
+	TP_ARGS(pag, agbno, len, crosslinked),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(bool, crosslinked)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->crosslinked = crosslinked;
+	),
+	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x crosslinked %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->crosslinked ? 1 : 0)
+);
+#define DEFINE_REPAIR_REAP_FIND_EVENT(name) \
+DEFINE_EVENT(xrep_reap_find_class, name, \
+	TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno, xfs_extlen_t len, \
+		 bool crosslinked), \
+	TP_ARGS(pag, agbno, len, crosslinked))
+DEFINE_REPAIR_REAP_FIND_EVENT(xreap_agextent_select);
+
 DECLARE_EVENT_CLASS(xrep_rmap_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 xfs_agblock_t agbno, xfs_extlen_t len,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 9/9] xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 7/9] xfs: ignore stale buffers when scanning the buffer cache Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The AGFL repair code uses a series of bitmaps to figure out where there
are OWN_AG blocks that are not claimed by the free space and rmap
btrees.  These blocks become the new AGFL, and any overflow is reaped.
The bitmaps current track xfs_fsblock_t even though we already know the
AG number.

In the last patch, we introduced a new bitmap "type" for tracking
xfs_agblock_t extents.  Port the reaping code and the AGFL repair to use
this new type, which makes it very obvious what we're tracking.  This
also eliminates a bunch of unnecessary agblock <-> fsblock conversions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |   74 ++++++++++++++++++----------------------
 fs/xfs/scrub/bitmap.c          |   41 ++--------------------
 fs/xfs/scrub/bitmap.h          |    6 +--
 fs/xfs/scrub/reap.c            |   14 ++------
 fs/xfs/scrub/reap.h            |    5 +--
 5 files changed, 45 insertions(+), 95 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index bd4283d16891..df0dadcedd97 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -444,13 +444,13 @@ xrep_agf(
 
 struct xrep_agfl {
 	/* Bitmap of alleged AGFL blocks that we're not going to add. */
-	struct xbitmap		crossed;
+	struct xagb_bitmap	crossed;
 
 	/* Bitmap of other OWN_AG metadata blocks. */
-	struct xbitmap		agmetablocks;
+	struct xagb_bitmap	agmetablocks;
 
 	/* Bitmap of free space. */
-	struct xbitmap		*freesp;
+	struct xagb_bitmap	*freesp;
 
 	/* rmapbt cursor for finding crosslinked blocks */
 	struct xfs_btree_cur	*rmap_cur;
@@ -466,7 +466,6 @@ xrep_agfl_walk_rmap(
 	void			*priv)
 {
 	struct xrep_agfl	*ra = priv;
-	xfs_fsblock_t		fsb;
 	int			error = 0;
 
 	if (xchk_should_terminate(ra->sc, &error))
@@ -474,14 +473,13 @@ xrep_agfl_walk_rmap(
 
 	/* Record all the OWN_AG blocks. */
 	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
-		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-				rec->rm_startblock);
-		error = xbitmap_set(ra->freesp, fsb, rec->rm_blockcount);
+		error = xagb_bitmap_set(ra->freesp, rec->rm_startblock,
+				rec->rm_blockcount);
 		if (error)
 			return error;
 	}
 
-	return xbitmap_set_btcur_path(&ra->agmetablocks, cur);
+	return xagb_bitmap_set_btcur_path(&ra->agmetablocks, cur);
 }
 
 /* Strike out the blocks that are cross-linked according to the rmapbt. */
@@ -492,12 +490,10 @@ xrep_agfl_check_extent(
 	void			*priv)
 {
 	struct xrep_agfl	*ra = priv;
-	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(ra->sc->mp, start);
+	xfs_agblock_t		agbno = start;
 	xfs_agblock_t		last_agbno = agbno + len - 1;
 	int			error;
 
-	ASSERT(XFS_FSB_TO_AGNO(ra->sc->mp, start) == ra->sc->sa.pag->pag_agno);
-
 	while (agbno <= last_agbno) {
 		bool		other_owners;
 
@@ -507,7 +503,7 @@ xrep_agfl_check_extent(
 			return error;
 
 		if (other_owners) {
-			error = xbitmap_set(&ra->crossed, agbno, 1);
+			error = xagb_bitmap_set(&ra->crossed, agbno, 1);
 			if (error)
 				return error;
 		}
@@ -533,7 +529,7 @@ STATIC int
 xrep_agfl_collect_blocks(
 	struct xfs_scrub	*sc,
 	struct xfs_buf		*agf_bp,
-	struct xbitmap		*agfl_extents,
+	struct xagb_bitmap	*agfl_extents,
 	xfs_agblock_t		*flcount)
 {
 	struct xrep_agfl	ra;
@@ -543,8 +539,8 @@ xrep_agfl_collect_blocks(
 
 	ra.sc = sc;
 	ra.freesp = agfl_extents;
-	xbitmap_init(&ra.agmetablocks);
-	xbitmap_init(&ra.crossed);
+	xagb_bitmap_init(&ra.agmetablocks);
+	xagb_bitmap_init(&ra.crossed);
 
 	/* Find all space used by the free space btrees & rmapbt. */
 	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.pag);
@@ -556,7 +552,7 @@ xrep_agfl_collect_blocks(
 	/* Find all blocks currently being used by the bnobt. */
 	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp,
 			sc->sa.pag, XFS_BTNUM_BNO);
-	error = xbitmap_set_btblocks(&ra.agmetablocks, cur);
+	error = xagb_bitmap_set_btblocks(&ra.agmetablocks, cur);
 	xfs_btree_del_cursor(cur, error);
 	if (error)
 		goto out_bmp;
@@ -564,7 +560,7 @@ xrep_agfl_collect_blocks(
 	/* Find all blocks currently being used by the cntbt. */
 	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp,
 			sc->sa.pag, XFS_BTNUM_CNT);
-	error = xbitmap_set_btblocks(&ra.agmetablocks, cur);
+	error = xagb_bitmap_set_btblocks(&ra.agmetablocks, cur);
 	xfs_btree_del_cursor(cur, error);
 	if (error)
 		goto out_bmp;
@@ -573,17 +569,17 @@ xrep_agfl_collect_blocks(
 	 * Drop the freesp meta blocks that are in use by btrees.
 	 * The remaining blocks /should/ be AGFL blocks.
 	 */
-	error = xbitmap_disunion(agfl_extents, &ra.agmetablocks);
+	error = xagb_bitmap_disunion(agfl_extents, &ra.agmetablocks);
 	if (error)
 		goto out_bmp;
 
 	/* Strike out the blocks that are cross-linked. */
 	ra.rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.pag);
-	error = xbitmap_walk(agfl_extents, xrep_agfl_check_extent, &ra);
+	error = xagb_bitmap_walk(agfl_extents, xrep_agfl_check_extent, &ra);
 	xfs_btree_del_cursor(ra.rmap_cur, error);
 	if (error)
 		goto out_bmp;
-	error = xbitmap_disunion(agfl_extents, &ra.crossed);
+	error = xagb_bitmap_disunion(agfl_extents, &ra.crossed);
 	if (error)
 		goto out_bmp;
 
@@ -591,12 +587,12 @@ xrep_agfl_collect_blocks(
 	 * Calculate the new AGFL size.  If we found more blocks than fit in
 	 * the AGFL we'll free them later.
 	 */
-	*flcount = min_t(uint64_t, xbitmap_hweight(agfl_extents),
+	*flcount = min_t(uint64_t, xagb_bitmap_hweight(agfl_extents),
 			 xfs_agfl_size(mp));
 
 out_bmp:
-	xbitmap_destroy(&ra.crossed);
-	xbitmap_destroy(&ra.agmetablocks);
+	xagb_bitmap_destroy(&ra.crossed);
+	xagb_bitmap_destroy(&ra.agmetablocks);
 	return error;
 }
 
@@ -626,7 +622,7 @@ xrep_agfl_update_agf(
 }
 
 struct xrep_agfl_fill {
-	struct xbitmap		used_extents;
+	struct xagb_bitmap	used_extents;
 	struct xfs_scrub	*sc;
 	__be32			*agfl_bno;
 	xfs_agblock_t		flcount;
@@ -642,17 +638,15 @@ xrep_agfl_fill(
 {
 	struct xrep_agfl_fill	*af = priv;
 	struct xfs_scrub	*sc = af->sc;
-	xfs_fsblock_t		fsbno = start;
+	xfs_agblock_t		agbno = start;
 	int			error;
 
-	trace_xrep_agfl_insert(sc->sa.pag, XFS_FSB_TO_AGBNO(sc->mp, start),
-			len);
+	trace_xrep_agfl_insert(sc->sa.pag, agbno, len);
 
-	while (fsbno < start + len && af->fl_off < af->flcount)
-		af->agfl_bno[af->fl_off++] =
-				cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, fsbno++));
+	while (agbno < start + len && af->fl_off < af->flcount)
+		af->agfl_bno[af->fl_off++] = cpu_to_be32(agbno++);
 
-	error = xbitmap_set(&af->used_extents, start, fsbno - 1);
+	error = xagb_bitmap_set(&af->used_extents, start, agbno - 1);
 	if (error)
 		return error;
 
@@ -667,7 +661,7 @@ STATIC int
 xrep_agfl_init_header(
 	struct xfs_scrub	*sc,
 	struct xfs_buf		*agfl_bp,
-	struct xbitmap		*agfl_extents,
+	struct xagb_bitmap	*agfl_extents,
 	xfs_agblock_t		flcount)
 {
 	struct xrep_agfl_fill	af = {
@@ -695,17 +689,17 @@ xrep_agfl_init_header(
 	 * blocks than fit in the AGFL, they will be freed in a subsequent
 	 * step.
 	 */
-	xbitmap_init(&af.used_extents);
+	xagb_bitmap_init(&af.used_extents);
 	af.agfl_bno = xfs_buf_to_agfl_bno(agfl_bp),
-	xbitmap_walk(agfl_extents, xrep_agfl_fill, &af);
-	error = xbitmap_disunion(agfl_extents, &af.used_extents);
+	xagb_bitmap_walk(agfl_extents, xrep_agfl_fill, &af);
+	error = xagb_bitmap_disunion(agfl_extents, &af.used_extents);
 	if (error)
 		return error;
 
 	/* Write new AGFL to disk. */
 	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
 	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
-	xbitmap_destroy(&af.used_extents);
+	xagb_bitmap_destroy(&af.used_extents);
 	return 0;
 }
 
@@ -714,7 +708,7 @@ int
 xrep_agfl(
 	struct xfs_scrub	*sc)
 {
-	struct xbitmap		agfl_extents;
+	struct xagb_bitmap	agfl_extents;
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_buf		*agf_bp;
 	struct xfs_buf		*agfl_bp;
@@ -725,7 +719,7 @@ xrep_agfl(
 	if (!xfs_has_rmapbt(mp))
 		return -EOPNOTSUPP;
 
-	xbitmap_init(&agfl_extents);
+	xagb_bitmap_init(&agfl_extents);
 
 	/*
 	 * Read the AGF so that we can query the rmapbt.  We hope that there's
@@ -774,10 +768,10 @@ xrep_agfl(
 		goto err;
 
 	/* Dump any AGFL overflow. */
-	error = xrep_reap_ag_metadata(sc, &agfl_extents, &XFS_RMAP_OINFO_AG,
+	error = xrep_reap_agblocks(sc, &agfl_extents, &XFS_RMAP_OINFO_AG,
 			XFS_AG_RESV_AGFL);
 err:
-	xbitmap_destroy(&agfl_extents);
+	xagb_bitmap_destroy(&agfl_extents);
 	return error;
 }
 
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index ebcb7a07d26f..f707434b1c86 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -301,21 +301,15 @@ xagb_bitmap_set_btblocks(
  * blocks going from the leaf towards the root.
  */
 int
-xbitmap_set_btcur_path(
-	struct xbitmap		*bitmap,
+xagb_bitmap_set_btcur_path(
+	struct xagb_bitmap	*bitmap,
 	struct xfs_btree_cur	*cur)
 {
-	struct xfs_buf		*bp;
-	xfs_fsblock_t		fsb;
 	int			i;
 	int			error;
 
 	for (i = 0; i < cur->bc_nlevels && cur->bc_levels[i].ptr == 1; i++) {
-		xfs_btree_get_block(cur, i, &bp);
-		if (!bp)
-			continue;
-		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
-		error = xbitmap_set(bitmap, fsb, 1);
+		error = xagb_bitmap_visit_btblock(cur, i, bitmap);
 		if (error)
 			return error;
 	}
@@ -323,35 +317,6 @@ xbitmap_set_btcur_path(
 	return 0;
 }
 
-/* Collect a btree's block in the bitmap. */
-STATIC int
-xbitmap_collect_btblock(
-	struct xfs_btree_cur	*cur,
-	int			level,
-	void			*priv)
-{
-	struct xbitmap		*bitmap = priv;
-	struct xfs_buf		*bp;
-	xfs_fsblock_t		fsbno;
-
-	xfs_btree_get_block(cur, level, &bp);
-	if (!bp)
-		return 0;
-
-	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
-	return xbitmap_set(bitmap, fsbno, 1);
-}
-
-/* Walk the btree and mark the bitmap wherever a btree block is found. */
-int
-xbitmap_set_btblocks(
-	struct xbitmap		*bitmap,
-	struct xfs_btree_cur	*cur)
-{
-	return xfs_btree_visit_blocks(cur, xbitmap_collect_btblock,
-			XFS_BTREE_VISIT_ALL, bitmap);
-}
-
 /* How many bits are set in this bitmap? */
 uint64_t
 xbitmap_hweight(
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 8c2711feea30..4265e53f45fa 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -16,10 +16,6 @@ void xbitmap_destroy(struct xbitmap *bitmap);
 int xbitmap_clear(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_set(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_disunion(struct xbitmap *bitmap, struct xbitmap *sub);
-int xbitmap_set_btcur_path(struct xbitmap *bitmap,
-		struct xfs_btree_cur *cur);
-int xbitmap_set_btblocks(struct xbitmap *bitmap,
-		struct xfs_btree_cur *cur);
 uint64_t xbitmap_hweight(struct xbitmap *bitmap);
 
 /*
@@ -104,5 +100,7 @@ static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap,
 
 int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap,
 		struct xfs_btree_cur *cur);
+int xagb_bitmap_set_btcur_path(struct xagb_bitmap *bitmap,
+		struct xfs_btree_cur *cur);
 
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 7a6b68255781..797536e6eba8 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -422,19 +422,13 @@ xreap_agmeta_extent(
 {
 	struct xreap_state	*rs = priv;
 	struct xfs_scrub	*sc = rs->sc;
-	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
-	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
+	xfs_agblock_t		agbno = fsbno;
 	xfs_agblock_t		agbno_next = agbno + len;
 	int			error = 0;
 
 	ASSERT(len <= XFS_MAX_BMBT_EXTLEN);
 	ASSERT(sc->ip == NULL);
 
-	if (agno != sc->sa.pag->pag_agno) {
-		ASSERT(sc->sa.pag->pag_agno == agno);
-		return -EFSCORRUPTED;
-	}
-
 	while (agbno < agbno_next) {
 		xfs_extlen_t	aglen;
 		bool		crosslinked;
@@ -468,9 +462,9 @@ xreap_agmeta_extent(
 
 /* Dispose of every block of every AG metadata extent in the bitmap. */
 int
-xrep_reap_ag_metadata(
+xrep_reap_agblocks(
 	struct xfs_scrub		*sc,
-	struct xbitmap			*bitmap,
+	struct xagb_bitmap		*bitmap,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type)
 {
@@ -484,7 +478,7 @@ xrep_reap_ag_metadata(
 	ASSERT(xfs_has_rmapbt(sc->mp));
 	ASSERT(sc->ip == NULL);
 
-	error = xbitmap_walk(bitmap, xreap_agmeta_extent, &rs);
+	error = xagb_bitmap_walk(bitmap, xreap_agmeta_extent, &rs);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h
index 143bd5ae5fe7..032d1eac91b2 100644
--- a/fs/xfs/scrub/reap.h
+++ b/fs/xfs/scrub/reap.h
@@ -6,8 +6,7 @@
 #ifndef __XFS_SCRUB_REAP_H__
 #define __XFS_SCRUB_REAP_H__
 
-int xrep_reap_ag_metadata(struct xfs_scrub *sc, struct xbitmap *bitmap,
-		const struct xfs_owner_info *oinfo,
-		enum xfs_ag_resv_type type);
+int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap,
+		const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type);
 
 #endif /* __XFS_SCRUB_REAP_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
                   ` (75 subsequent siblings)
  77 siblings, 6 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Before we start merging the online repair functions, let's improve the
bulk loading code a bit.  First, we need to fix a misinteraction between
the AIL and the btree bulkloader wherein the delwri at the end of the
bulk load fails to queue a buffer for writeback if it happens to be on
the AIL list.

Second, we introduce EFIs in the btree bulkloader block allocator to to
guarantee that staging blocks are freed if the filesystem goes down
before committing the new btree.

Third, we change the bulkloader itself to copy multiple records into a
block if possible, and add some debugging knobs so that developers can
control the slack factors, just like they can do for xfs_repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading
---
 fs/xfs/Makefile                   |    1 
 fs/xfs/libxfs/xfs_btree.c         |    2 
 fs/xfs/libxfs/xfs_btree.h         |    3 
 fs/xfs/libxfs/xfs_btree_staging.c |   67 +++-
 fs/xfs/libxfs/xfs_btree_staging.h |   32 +-
 fs/xfs/scrub/agheader_repair.c    |    1 
 fs/xfs/scrub/common.c             |    1 
 fs/xfs/scrub/newbt.c              |  567 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/newbt.h              |   68 ++++
 fs/xfs/scrub/repair.c             |   10 +
 fs/xfs/scrub/repair.h             |    1 
 fs/xfs/scrub/scrub.c              |    2 
 fs/xfs/scrub/trace.h              |   36 ++
 fs/xfs/xfs_buf.c                  |   31 ++
 fs/xfs/xfs_buf.h                  |    1 
 fs/xfs/xfs_globals.c              |   12 +
 fs/xfs/xfs_sysctl.h               |    2 
 fs/xfs/xfs_sysfs.c                |   54 ++++
 18 files changed, 858 insertions(+), 33 deletions(-)
 create mode 100644 fs/xfs/scrub/newbt.c
 create mode 100644 fs/xfs/scrub/newbt.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/6] xfs: force all buffers to be written during btree bulk load
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/6] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/6] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffer readers
encountering buffers with DELWRI_Q set, even though the btree bulk load
had already committed and the buffer itself wasn't on any delwri list.

I traced this to a misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list.  If a buffer is
logged and committed, the buffer can end up on that AIL buffer list.  If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up.  This clears DELWRI_Q from the buffer state.

If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block.  Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.

When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.

This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
committing the new btree root pointer.  This ordering requirement is
required for data consistency.

Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.

To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_staging.c |    4 +---
 fs/xfs/xfs_buf.c                  |   31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_buf.h                  |    1 +
 3 files changed, 33 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index dd75e208b543..29e3f8ccb185 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf(
 	if (*bpp == NULL)
 		return;
 
-	if (!xfs_buf_delwri_queue(*bpp, buffers_list))
-		ASSERT(0);
-
+	xfs_buf_delwri_queue_here(*bpp, buffers_list);
 	xfs_buf_relse(*bpp);
 	*bpp = NULL;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a538501b652b..2bea2c3f9ead 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2113,6 +2113,37 @@ xfs_buf_delwri_queue(
 	return true;
 }
 
+/*
+ * Queue a buffer to this delwri list as part of a data integrity operation.
+ * If the buffer is on any other delwri list, we'll wait for that to clear
+ * so that the caller can submit the buffer for IO and wait for the result.
+ * Callers must ensure the buffer is not already on the list.
+ */
+void
+xfs_buf_delwri_queue_here(
+	struct xfs_buf		*bp,
+	struct list_head	*buffer_list)
+{
+	/*
+	 * We need this buffer to end up on the /caller's/ delwri list, not any
+	 * old list.  This can happen if the buffer is marked stale (which
+	 * clears DELWRI_Q) after the AIL queues the buffer to its list but
+	 * before the AIL has a chance to submit the list.
+	 */
+	while (!list_empty(&bp->b_list)) {
+		xfs_buf_unlock(bp);
+		delay(1);
+		xfs_buf_lock(bp);
+	}
+
+	ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
+
+	/* This buffer is uptodate; don't let it get reread. */
+	bp->b_flags |= XBF_DONE;
+
+	xfs_buf_delwri_queue(bp, buffer_list);
+}
+
 /*
  * Compare function is more complex than it needs to be because
  * the return value is only 32 bits and we are doing comparisons
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d6e8c3bab9f6..467ddb2e2f0d 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -315,6 +315,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
 /* Delayed Write Buffer Routines */
 extern void xfs_buf_delwri_cancel(struct list_head *);
 extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
+void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl);
 extern int xfs_buf_delwri_submit(struct list_head *);
 extern int xfs_buf_delwri_submit_nowait(struct list_head *);
 extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/6] xfs: implement block reservation accounting for btrees we're staging
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 3/6] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                   |    1 
 fs/xfs/libxfs/xfs_btree_staging.h |    7 -
 fs/xfs/scrub/agheader_repair.c    |    1 
 fs/xfs/scrub/common.c             |    1 
 fs/xfs/scrub/newbt.c              |  399 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/newbt.h              |   64 ++++++
 fs/xfs/scrub/scrub.c              |    2 
 fs/xfs/scrub/trace.h              |   36 +++
 8 files changed, 506 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/scrub/newbt.c
 create mode 100644 fs/xfs/scrub/newbt.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 73bebce4d6f9..90f1f01277be 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -171,6 +171,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   newbt.o \
 				   reap.o \
 				   repair.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h
index f0d2976050ae..d6dea3f0088c 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.h
+++ b/fs/xfs/libxfs/xfs_btree_staging.h
@@ -38,11 +38,8 @@ struct xbtree_ifakeroot {
 	/* Number of bytes available for this fork in the inode. */
 	unsigned int		if_fork_size;
 
-	/* Fork format. */
-	unsigned int		if_format;
-
-	/* Number of records. */
-	unsigned int		if_extents;
+	/* Which fork is this btree being built for? */
+	int			if_whichfork;
 };
 
 /* Cursor interactions with fake roots for inode-rooted btrees. */
diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index df0dadcedd97..5140f52fa9a1 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -10,6 +10,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_btree.h"
+#include "xfs_btree_staging.h"
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
 #include "xfs_sb.h"
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 002bb90559ff..6b2e14aecd66 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -10,6 +10,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_btree.h"
+#include "xfs_btree_staging.h"
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c
new file mode 100644
index 000000000000..9d677640e65e
--- /dev/null
+++ b/fs/xfs/scrub/newbt.c
@@ -0,0 +1,399 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_ag.h"
+#include "xfs_defer.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/newbt.h"
+
+/*
+ * Estimate proper slack values for a btree that's being reloaded.
+ *
+ * Under most circumstances, we'll take whatever default loading value the
+ * btree bulk loading code calculates for us.  However, there are some
+ * exceptions to this rule:
+ *
+ * (1) If someone turned one of the debug knobs.
+ * (2) If this is a per-AG btree and the AG has less than ~9% space free.
+ * (3) If this is an inode btree and the FS has less than ~9% space free.
+ *
+ * Note that we actually use 3/32 for the comparison to avoid division.
+ */
+static void
+xrep_newbt_estimate_slack(
+	struct xrep_newbt	*xnr)
+{
+	struct xfs_scrub	*sc = xnr->sc;
+	struct xfs_btree_bload	*bload = &xnr->bload;
+	uint64_t		free;
+	uint64_t		sz;
+
+	/* Let the btree code compute the default slack values. */
+	bload->leaf_slack = -1;
+	bload->node_slack = -1;
+
+	if (sc->ops->type == ST_PERAG) {
+		free = sc->sa.pag->pagf_freeblks;
+		sz = xfs_ag_block_count(sc->mp, sc->sa.pag->pag_agno);
+	} else {
+		free = percpu_counter_sum(&sc->mp->m_fdblocks);
+		sz = sc->mp->m_sb.sb_dblocks;
+	}
+
+	/* No further changes if there's more than 3/32ths space left. */
+	if (free >= ((sz * 3) >> 5))
+		return;
+
+	/* We're low on space; load the btrees as tightly as possible. */
+	if (bload->leaf_slack < 0)
+		bload->leaf_slack = 0;
+	if (bload->node_slack < 0)
+		bload->node_slack = 0;
+}
+
+/* Initialize accounting resources for staging a new AG btree. */
+void
+xrep_newbt_init_ag(
+	struct xrep_newbt		*xnr,
+	struct xfs_scrub		*sc,
+	const struct xfs_owner_info	*oinfo,
+	xfs_fsblock_t			alloc_hint,
+	enum xfs_ag_resv_type		resv)
+{
+	memset(xnr, 0, sizeof(struct xrep_newbt));
+	xnr->sc = sc;
+	xnr->oinfo = *oinfo; /* structure copy */
+	xnr->alloc_hint = alloc_hint;
+	xnr->resv = resv;
+	INIT_LIST_HEAD(&xnr->resv_list);
+	xrep_newbt_estimate_slack(xnr);
+}
+
+/* Initialize accounting resources for staging a new inode fork btree. */
+int
+xrep_newbt_init_inode(
+	struct xrep_newbt		*xnr,
+	struct xfs_scrub		*sc,
+	int				whichfork,
+	const struct xfs_owner_info	*oinfo)
+{
+	struct xfs_ifork		*ifp;
+
+	ifp = kmem_cache_zalloc(xfs_ifork_cache, XCHK_GFP_FLAGS);
+	if (!ifp)
+		return -ENOMEM;
+
+	xrep_newbt_init_ag(xnr, sc, oinfo,
+			XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino),
+			XFS_AG_RESV_NONE);
+	xnr->ifake.if_fork = ifp;
+	xnr->ifake.if_fork_size = xfs_inode_fork_size(sc->ip, whichfork);
+	xnr->ifake.if_whichfork = whichfork;
+	return 0;
+}
+
+/*
+ * Initialize accounting resources for staging a new btree.  Callers are
+ * expected to add their own reservations (and clean them up) manually.
+ */
+void
+xrep_newbt_init_bare(
+	struct xrep_newbt		*xnr,
+	struct xfs_scrub		*sc)
+{
+	xrep_newbt_init_ag(xnr, sc, &XFS_RMAP_OINFO_ANY_OWNER, NULLFSBLOCK,
+			XFS_AG_RESV_NONE);
+}
+
+/* Designate specific blocks to be used to build our new btree. */
+int
+xrep_newbt_add_blocks(
+	struct xrep_newbt		*xnr,
+	xfs_fsblock_t			fsbno,
+	xfs_extlen_t			len)
+{
+	struct xrep_newbt_resv		*resv;
+	struct xfs_mount		*mp = xnr->sc->mp;
+
+	resv = kmalloc(sizeof(struct xrep_newbt_resv), XCHK_GFP_FLAGS);
+	if (!resv)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&resv->list);
+	resv->agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	resv->len = len;
+	resv->used = 0;
+	resv->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, fsbno));
+
+	list_add_tail(&resv->list, &xnr->resv_list);
+	return 0;
+}
+
+/* Allocate disk space for our new btree. */
+int
+xrep_newbt_alloc_blocks(
+	struct xrep_newbt	*xnr,
+	uint64_t		nr_blocks)
+{
+	struct xfs_scrub	*sc = xnr->sc;
+	xfs_alloctype_t		type;
+	xfs_fsblock_t		alloc_hint = xnr->alloc_hint;
+	int			error = 0;
+
+	/*
+	 * Inode-rooted btrees can allocate from any AG, whereas AG btrees
+	 * require a specific AG mentioned in the alloc hint..
+	 */
+	type = sc->ip ? XFS_ALLOCTYPE_START_BNO : XFS_ALLOCTYPE_NEAR_BNO;
+
+	while (nr_blocks > 0) {
+		struct xfs_alloc_arg	args = {
+			.tp		= sc->tp,
+			.mp		= sc->mp,
+			.type		= type,
+			.fsbno		= alloc_hint,
+			.oinfo		= xnr->oinfo,
+			.minlen		= 1,
+			.maxlen		= nr_blocks,
+			.prod		= 1,
+			.resv		= xnr->resv,
+		};
+
+		error = xfs_alloc_vextent(&args);
+		if (error)
+			return error;
+		if (args.fsbno == NULLFSBLOCK)
+			return -ENOSPC;
+
+		trace_xrep_newbt_alloc_blocks(sc->mp,
+				XFS_FSB_TO_AGNO(sc->mp, args.fsbno),
+				XFS_FSB_TO_AGBNO(sc->mp, args.fsbno),
+				args.len, xnr->oinfo.oi_owner);
+
+		error = xrep_newbt_add_blocks(xnr, args.fsbno, args.len);
+		if (error)
+			return error;
+
+		nr_blocks -= args.len;
+		alloc_hint = args.fsbno + args.len - 1;
+
+		error = xrep_defer_finish(sc);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * How many extent freeing items can we attach to a transaction before we want
+ * to finish the chain so that unreserving new btree blocks doesn't overrun
+ * the transaction reservation?
+ */
+#define XREP_REAP_MAX_NEWBT_EFIS	(128)
+
+/*
+ * Free the unused part of an extent.  Returns the number of EFIs logged or
+ * a negative errno.
+ */
+STATIC int
+xrep_newbt_free_extent(
+	struct xrep_newbt	*xnr,
+	struct xrep_newbt_resv	*resv,
+	bool			btree_committed)
+{
+	struct xfs_scrub	*sc = xnr->sc;
+	xfs_agblock_t		free_agbno = resv->agbno;
+	xfs_extlen_t		free_aglen = resv->len;
+	xfs_fsblock_t		fsbno;
+	int			error;
+
+	/* If we don't commit the new btree, free all of the space. */
+	if (btree_committed) {
+		free_agbno += resv->used;
+		free_aglen -= resv->used;
+	}
+
+	if (free_aglen == 0)
+		return 0;
+
+	trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, free_agbno,
+			free_aglen, xnr->oinfo.oi_owner);
+
+	if (xnr->resv == XFS_AG_RESV_NONE) {
+		/*
+		 * No per-AG reservation means that we can use EFIs to free the
+		 * reservations.  This reduces the chance that we leak blocks
+		 * if the system goes down.
+		 */
+		fsbno = XFS_AGB_TO_FSB(sc->mp, resv->pag->pag_agno, free_agbno);
+		__xfs_free_extent_later(sc->tp, fsbno, free_aglen, &xnr->oinfo,
+				true);
+		return 1;
+	}
+
+	if (xnr->resv == XFS_AG_RESV_RMAPBT ||
+	    xnr->resv == XFS_AG_RESV_METADATA) {
+		/*
+		 * Metadata blocks taken from a per-AG reservation must be put
+		 * back into that reservation immediately because EFIs cannot
+		 * free into per-AG reservations.
+		 */
+		error = __xfs_free_extent(sc->tp, resv->pag, free_agbno,
+				free_aglen, &xnr->oinfo, xnr->resv, true);
+		if (error < 0)
+			return error;
+		return XREP_REAP_MAX_NEWBT_EFIS;
+	}
+
+	ASSERT(0);
+	return -EFSCORRUPTED;
+}
+
+/* Free all the accounting info and disk space we reserved for a new btree. */
+STATIC int
+xrep_newbt_free(
+	struct xrep_newbt	*xnr,
+	bool			btree_committed)
+{
+	struct xfs_scrub	*sc = xnr->sc;
+	struct xrep_newbt_resv	*resv, *n;
+	unsigned int		freed = 0;
+	int			error = 0;
+
+	/*
+	 * If the filesystem already went down, we can't free the blocks.  Skip
+	 * ahead to freeing the incore metadata because we can't fix anything.
+	 */
+	if (xfs_is_shutdown(sc->mp))
+		goto junkit;
+
+	list_for_each_entry_safe(resv, n, &xnr->resv_list, list) {
+		int		ret;
+
+		ret = xrep_newbt_free_extent(xnr, resv, btree_committed);
+		list_del(&resv->list);
+		xfs_perag_put(resv->pag);
+		kfree(resv);
+		if (ret < 0) {
+			error = ret;
+			goto junkit;
+		}
+
+		freed += ret;
+		if (freed >= XREP_REAP_MAX_NEWBT_EFIS) {
+			error = xrep_defer_finish(sc);
+			if (error)
+				goto junkit;
+			freed = 0;
+		}
+	}
+
+	if (freed)
+		error = xrep_defer_finish(sc);
+
+junkit:
+	/*
+	 * If we still have reservations attached to @newbt, cleanup must have
+	 * failed and the filesystem is about to go down.  Clean up the incore
+	 * reservations.
+	 */
+	list_for_each_entry_safe(resv, n, &xnr->resv_list, list) {
+		list_del(&resv->list);
+		xfs_perag_put(resv->pag);
+		kfree(resv);
+	}
+
+	if (sc->ip) {
+		kmem_cache_free(xfs_ifork_cache, xnr->ifake.if_fork);
+		xnr->ifake.if_fork = NULL;
+	}
+
+	return error;
+}
+
+/*
+ * Free all the accounting info and unused disk space allocations after
+ * committing a new btree.
+ */
+int
+xrep_newbt_commit(
+	struct xrep_newbt	*xnr)
+{
+	return xrep_newbt_free(xnr, true);
+}
+
+/*
+ * Free all the accounting info and all of the disk space we reserved for a new
+ * btree that we're not going to commit.  We want to try to roll things back
+ * cleanly for things like ENOSPC midway through allocation.
+ */
+void
+xrep_newbt_cancel(
+	struct xrep_newbt	*xnr)
+{
+	xrep_newbt_free(xnr, false);
+}
+
+/* Feed one of the reserved btree blocks to the bulk loader. */
+int
+xrep_newbt_claim_block(
+	struct xfs_btree_cur	*cur,
+	struct xrep_newbt	*xnr,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xrep_newbt_resv	*resv;
+	struct xfs_mount	*mp = cur->bc_mp;
+	xfs_agblock_t		agbno;
+
+	/*
+	 * The first item in the list should always have a free block unless
+	 * we're completely out.
+	 */
+	resv = list_first_entry(&xnr->resv_list, struct xrep_newbt_resv, list);
+	if (resv->used == resv->len)
+		return -ENOSPC;
+
+	/*
+	 * Peel off a block from the start of the reservation.  We allocate
+	 * blocks in order to place blocks on disk in increasing record or key
+	 * order.  The block reservations tend to end up on the list in
+	 * decreasing order, which hopefully results in leaf blocks ending up
+	 * together.
+	 */
+	agbno = resv->agbno + resv->used;
+	resv->used++;
+
+	/* If we used all the blocks in this reservation, move it to the end. */
+	if (resv->used == resv->len)
+		list_move_tail(&resv->list, &xnr->resv_list);
+
+	trace_xrep_newbt_claim_block(mp, resv->pag->pag_agno, agbno, 1,
+			xnr->oinfo.oi_owner);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		ptr->l = cpu_to_be64(XFS_AGB_TO_FSB(mp, resv->pag->pag_agno,
+								agbno));
+	else
+		ptr->s = cpu_to_be32(agbno);
+	return 0;
+}
diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h
new file mode 100644
index 000000000000..ad24cb32a7f1
--- /dev/null
+++ b/fs/xfs/scrub/newbt.h
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_NEWBT_H__
+#define __XFS_SCRUB_NEWBT_H__
+
+struct xrep_newbt_resv {
+	/* Link to list of extents that we've reserved. */
+	struct list_head	list;
+
+	struct xfs_perag	*pag;
+
+	/* AG block of the extent we reserved. */
+	xfs_agblock_t		agbno;
+
+	/* Length of the reservation. */
+	xfs_extlen_t		len;
+
+	/* How much of this reservation has been used. */
+	xfs_extlen_t		used;
+};
+
+struct xrep_newbt {
+	struct xfs_scrub	*sc;
+
+	/* List of extents that we've reserved. */
+	struct list_head	resv_list;
+
+	/* Fake root for new btree. */
+	union {
+		struct xbtree_afakeroot	afake;
+		struct xbtree_ifakeroot	ifake;
+	};
+
+	/* rmap owner of these blocks */
+	struct xfs_owner_info	oinfo;
+
+	/* btree geometry for the bulk loader */
+	struct xfs_btree_bload	bload;
+
+	/* Allocation hint */
+	xfs_fsblock_t		alloc_hint;
+
+	/* per-ag reservation type */
+	enum xfs_ag_resv_type	resv;
+};
+
+void xrep_newbt_init_bare(struct xrep_newbt *xnr, struct xfs_scrub *sc);
+void xrep_newbt_init_ag(struct xrep_newbt *xnr, struct xfs_scrub *sc,
+		const struct xfs_owner_info *oinfo, xfs_fsblock_t alloc_hint,
+		enum xfs_ag_resv_type resv);
+int xrep_newbt_init_inode(struct xrep_newbt *xnr, struct xfs_scrub *sc,
+		int whichfork, const struct xfs_owner_info *oinfo);
+int xrep_newbt_add_blocks(struct xrep_newbt *xnr, xfs_fsblock_t fsbno,
+		xfs_extlen_t len);
+int xrep_newbt_alloc_blocks(struct xrep_newbt *xnr, uint64_t nr_blocks);
+void xrep_newbt_cancel(struct xrep_newbt *xnr);
+int xrep_newbt_commit(struct xrep_newbt *xnr);
+int xrep_newbt_claim_block(struct xfs_btree_cur *cur, struct xrep_newbt *xnr,
+		union xfs_btree_ptr *ptr);
+
+#endif /* __XFS_SCRUB_NEWBT_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6697f5f32106..afd481f5a15e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -17,6 +17,8 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_scrub.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index df36311081a5..cb33f42190df 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -989,6 +989,42 @@ TRACE_EVENT(xrep_ialloc_insert,
 		  __entry->freemask)
 )
 
+DECLARE_EVENT_CLASS(xrep_newbt_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_extlen_t len,
+		 int64_t owner),
+	TP_ARGS(mp, agno, agbno, len, owner),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(int64_t, owner)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->owner = owner;
+	),
+	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x owner 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->owner)
+);
+#define DEFINE_NEWBT_EXTENT_EVENT(name) \
+DEFINE_EVENT(xrep_newbt_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 xfs_agblock_t agbno, xfs_extlen_t len, \
+		 int64_t owner), \
+	TP_ARGS(mp, agno, agbno, len, owner))
+DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_blocks);
+DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_free_blocks);
+DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_claim_block);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/6] xfs: log EFIs for all btree blocks being used to stage a btree
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/6] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery.  Add a function to relog the EFIs, so that repair
can relog them all every time it creates a new btree block, which will
help us to avoid pinning the log tail.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/newbt.c  |  171 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/newbt.h  |    4 +
 fs/xfs/scrub/repair.c |   10 +++
 fs/xfs/scrub/repair.h |    1 
 4 files changed, 182 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c
index 9d677640e65e..4eee8923a9bc 100644
--- a/fs/xfs/scrub/newbt.c
+++ b/fs/xfs/scrub/newbt.c
@@ -13,12 +13,14 @@
 #include "xfs_btree_staging.h"
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_sb.h"
 #include "xfs_inode.h"
 #include "xfs_alloc.h"
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 #include "xfs_defer.h"
+#include "xfs_extfree_item.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -124,15 +126,150 @@ xrep_newbt_init_bare(
 			XFS_AG_RESV_NONE);
 }
 
+/*
+ * Set up automatic reaping of the blocks reserved for btree reconstruction in
+ * case we crash by logging a deferred free item for each extent we allocate so
+ * that we can get all of the space back if we crash before we can commit the
+ * new btree.  This function returns a token that can be used to cancel
+ * automatic reaping if repair is successful.
+ */
+static int
+xrep_newbt_schedule_autoreap(
+	struct xrep_newbt		*xnr,
+	struct xrep_newbt_resv		*resv)
+{
+	struct xfs_extent_free_item	efi_item = {
+		.xefi_blockcount	= resv->len,
+		.xefi_owner		= xnr->oinfo.oi_owner,
+		.xefi_flags		= XFS_EFI_SKIP_DISCARD,
+		.xefi_pag		= resv->pag,
+	};
+	struct xfs_scrub		*sc = xnr->sc;
+	struct xfs_log_item		*lip;
+	LIST_HEAD(items);
+
+	ASSERT(xnr->oinfo.oi_offset == 0);
+
+	efi_item.xefi_startblock = XFS_AGB_TO_FSB(sc->mp, resv->pag->pag_agno,
+			resv->agbno);
+	if (xnr->oinfo.oi_flags & XFS_OWNER_INFO_ATTR_FORK)
+		efi_item.xefi_flags |= XFS_EFI_ATTR_FORK;
+	if (xnr->oinfo.oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
+		efi_item.xefi_flags |= XFS_EFI_BMBT_BLOCK;
+
+	INIT_LIST_HEAD(&efi_item.xefi_list);
+	list_add(&efi_item.xefi_list, &items);
+
+	xfs_perag_bump_intents(resv->pag);
+	lip = xfs_extent_free_defer_type.create_intent(sc->tp, &items, 1,
+			false);
+	ASSERT(lip != NULL && !IS_ERR(lip));
+
+	resv->efi = lip;
+	return 0;
+}
+
+/*
+ * Earlier, we logged EFIs for the extents that we allocated to hold the new
+ * btree so that we could automatically roll back those allocations if the
+ * system crashed.  Now we log an EFD to cancel the EFI, either because the
+ * repair succeeded and the new blocks are in use; or because the repair was
+ * cancelled and we're about to free the extents directly.
+ */
+static inline void
+xrep_newbt_finish_autoreap(
+	struct xfs_scrub	*sc,
+	struct xrep_newbt_resv	*resv)
+{
+	struct xfs_efd_log_item	*efdp;
+	struct xfs_extent	*extp;
+	struct xfs_log_item	*efd_lip;
+
+	efd_lip = xfs_extent_free_defer_type.create_done(sc->tp, resv->efi, 1);
+	efdp = container_of(efd_lip, struct xfs_efd_log_item, efd_item);
+	extp = efdp->efd_format.efd_extents;
+	extp->ext_start = XFS_AGB_TO_FSB(sc->mp, resv->pag->pag_agno,
+					 resv->agbno);
+	extp->ext_len = resv->len;
+	efdp->efd_next_extent++;
+	set_bit(XFS_LI_DIRTY, &efd_lip->li_flags);
+	xfs_perag_drop_intents(resv->pag);
+}
+
+/* Abort an EFI logged for a new btree block reservation. */
+static inline void
+xrep_newbt_cancel_autoreap(
+	struct xrep_newbt_resv	*resv)
+{
+	xfs_extent_free_defer_type.abort_intent(resv->efi);
+	xfs_perag_drop_intents(resv->pag);
+}
+
+/*
+ * Relog the EFIs attached to a staging btree so that we don't pin the log
+ * tail.  Same logic as xfs_defer_relog.
+ */
+int
+xrep_newbt_relog_autoreap(
+	struct xrep_newbt	*xnr)
+{
+	struct xrep_newbt_resv	*resv;
+	unsigned int		efi_bytes = 0;
+
+	list_for_each_entry(resv, &xnr->resv_list, list) {
+		/*
+		 * If the log intent item for this deferred op is in a
+		 * different checkpoint, relog it to keep the log tail moving
+		 * forward.  We're ok with this being racy because an incorrect
+		 * decision means we'll be a little slower at pushing the tail.
+		 */
+		if (!resv->efi || xfs_log_item_in_current_chkpt(resv->efi))
+			continue;
+
+		resv->efi = xfs_trans_item_relog(resv->efi, xnr->sc->tp);
+
+		/*
+		 * If free space is very fragmented, it's possible that the new
+		 * btree will be allocated a large number of small extents.
+		 * On an active system, it's possible that so many of those
+		 * EFIs will need relogging here that doing them all in one
+		 * transaction will overflow the reservation.
+		 *
+		 * Each allocation for the new btree (xrep_newbt_resv) points
+		 * to a unique single-mapping EFI, so each relog operation logs
+		 * a single-mapping EFD followed by a new EFI.  Each single
+		 * mapping EF[ID] item consumes about 128 bytes, so we'll
+		 * assume 256 bytes per relog.  Roll if we consume more than
+		 * half of the transaction reservation.
+		 */
+		efi_bytes += 256;
+		if (efi_bytes > xnr->sc->tp->t_log_res / 2) {
+			int	error;
+
+			error = xrep_roll_trans(xnr->sc);
+			if (error)
+				return error;
+
+			efi_bytes = 0;
+		}
+	}
+
+	if (xnr->sc->tp->t_flags & XFS_TRANS_DIRTY)
+		return xrep_roll_trans(xnr->sc);
+	return 0;
+}
+
 /* Designate specific blocks to be used to build our new btree. */
-int
-xrep_newbt_add_blocks(
+static int
+__xrep_newbt_add_blocks(
 	struct xrep_newbt		*xnr,
 	xfs_fsblock_t			fsbno,
-	xfs_extlen_t			len)
+	xfs_extlen_t			len,
+	bool				auto_reap)
 {
 	struct xrep_newbt_resv		*resv;
 	struct xfs_mount		*mp = xnr->sc->mp;
+	int				error;
 
 	resv = kmalloc(sizeof(struct xrep_newbt_resv), XCHK_GFP_FLAGS);
 	if (!resv)
@@ -144,10 +281,32 @@ xrep_newbt_add_blocks(
 	resv->used = 0;
 	resv->pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, fsbno));
 
+	if (auto_reap) {
+		error = xrep_newbt_schedule_autoreap(xnr, resv);
+		if (error) {
+			xfs_perag_put(resv->pag);
+			kfree(resv);
+			return error;
+		}
+	}
+
 	list_add_tail(&resv->list, &xnr->resv_list);
 	return 0;
 }
 
+/*
+ * Allow certain callers to add disk space directly to the reservation.
+ * Callers are responsible for cleaning up the reservations.
+ */
+int
+xrep_newbt_add_blocks(
+	struct xrep_newbt		*xnr,
+	xfs_fsblock_t			fsbno,
+	xfs_extlen_t			len)
+{
+	return __xrep_newbt_add_blocks(xnr, fsbno, len, false);
+}
+
 /* Allocate disk space for our new btree. */
 int
 xrep_newbt_alloc_blocks(
@@ -189,7 +348,8 @@ xrep_newbt_alloc_blocks(
 				XFS_FSB_TO_AGBNO(sc->mp, args.fsbno),
 				args.len, xnr->oinfo.oi_owner);
 
-		error = xrep_newbt_add_blocks(xnr, args.fsbno, args.len);
+		error = __xrep_newbt_add_blocks(xnr, args.fsbno, args.len,
+				true);
 		if (error)
 			return error;
 
@@ -233,6 +393,8 @@ xrep_newbt_free_extent(
 		free_aglen -= resv->used;
 	}
 
+	xrep_newbt_finish_autoreap(sc, resv);
+
 	if (free_aglen == 0)
 		return 0;
 
@@ -318,6 +480,7 @@ xrep_newbt_free(
 	 * reservations.
 	 */
 	list_for_each_entry_safe(resv, n, &xnr->resv_list, list) {
+		xrep_newbt_cancel_autoreap(resv);
 		list_del(&resv->list);
 		xfs_perag_put(resv->pag);
 		kfree(resv);
diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h
index ad24cb32a7f1..a1eda5213b83 100644
--- a/fs/xfs/scrub/newbt.h
+++ b/fs/xfs/scrub/newbt.h
@@ -12,6 +12,9 @@ struct xrep_newbt_resv {
 
 	struct xfs_perag	*pag;
 
+	/* EFI tracking this space reservation */
+	struct xfs_log_item	*efi;
+
 	/* AG block of the extent we reserved. */
 	xfs_agblock_t		agbno;
 
@@ -60,5 +63,6 @@ void xrep_newbt_cancel(struct xrep_newbt *xnr);
 int xrep_newbt_commit(struct xrep_newbt *xnr);
 int xrep_newbt_claim_block(struct xfs_btree_cur *cur, struct xrep_newbt *xnr,
 		union xfs_btree_ptr *ptr);
+int xrep_newbt_relog_autoreap(struct xrep_newbt *xnr);
 
 #endif /* __XFS_SCRUB_NEWBT_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 8d990a42119e..b4117ff221aa 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -167,6 +167,16 @@ xrep_roll_ag_trans(
 	return 0;
 }
 
+/* Roll the scrub transaction, holding the primary metadata locked. */
+int
+xrep_roll_trans(
+	struct xfs_scrub	*sc)
+{
+	if (!sc->ip)
+		return xrep_roll_ag_trans(sc);
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
 /* Finish all deferred work attached to the repair transaction. */
 int
 xrep_defer_finish(
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index a0df121e6866..3179746a063e 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -20,6 +20,7 @@ static inline int xrep_notsupported(struct xfs_scrub *sc)
 int xrep_attempt(struct xfs_scrub *sc);
 void xrep_failure(struct xfs_mount *mp);
 int xrep_roll_ag_trans(struct xfs_scrub *sc);
+int xrep_roll_trans(struct xfs_scrub *sc);
 int xrep_defer_finish(struct xfs_scrub *sc);
 bool xrep_ag_has_space(struct xfs_perag *pag, xfs_extlen_t nr_blocks,
 		enum xfs_ag_resv_type type);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add some debug knobs so that we can control the leaf and node block
slack when rebuilding btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/newbt.c |   10 ++++++---
 fs/xfs/xfs_globals.c |   12 +++++++++++
 fs/xfs/xfs_sysctl.h  |    2 ++
 fs/xfs/xfs_sysfs.c   |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 75 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c
index 4eee8923a9bc..56c790fa0f9a 100644
--- a/fs/xfs/scrub/newbt.c
+++ b/fs/xfs/scrub/newbt.c
@@ -49,9 +49,13 @@ xrep_newbt_estimate_slack(
 	uint64_t		free;
 	uint64_t		sz;
 
-	/* Let the btree code compute the default slack values. */
-	bload->leaf_slack = -1;
-	bload->node_slack = -1;
+	/*
+	 * The xfs_globals values are set to -1 (i.e. take the bload defaults)
+	 * unless someone has set them otherwise, so we just pull the values
+	 * here.
+	 */
+	bload->leaf_slack = xfs_globals.bload_leaf_slack;
+	bload->node_slack = xfs_globals.bload_node_slack;
 
 	if (sc->ops->type == ST_PERAG) {
 		free = sc->sa.pag->pagf_freeblks;
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index 4d0a98f920ca..eaee77491bca 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -43,4 +43,16 @@ struct xfs_globals xfs_globals = {
 	.pwork_threads		=	-1,	/* automatic thread detection */
 	.larp			=	false,	/* log attribute replay */
 #endif
+
+	/*
+	 * Leave this many record slots empty when bulk loading btrees.  By
+	 * default we load new btree leaf blocks 75% full.
+	 */
+	.bload_leaf_slack	=	-1,
+
+	/*
+	 * Leave this many key/ptr slots empty when bulk loading btrees.  By
+	 * default we load new btree node blocks 75% full.
+	 */
+	.bload_node_slack	=	-1,
 };
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index f78ad6b10ea5..276696a07040 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -85,6 +85,8 @@ struct xfs_globals {
 	int	pwork_threads;		/* parallel workqueue threads */
 	bool	larp;			/* log attribute replay */
 #endif
+	int	bload_leaf_slack;	/* btree bulk load leaf slack */
+	int	bload_node_slack;	/* btree bulk load node slack */
 	int	log_recovery_delay;	/* log recovery delay (secs) */
 	int	mount_delay;		/* mount setup delay (secs) */
 	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f7faf6e70d7f..57dea30fa757 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -253,6 +253,58 @@ larp_show(
 XFS_SYSFS_ATTR_RW(larp);
 #endif /* DEBUG */
 
+STATIC ssize_t
+bload_leaf_slack_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	int		ret;
+	int		val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	xfs_globals.bload_leaf_slack = val;
+	return count;
+}
+
+STATIC ssize_t
+bload_leaf_slack_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.bload_leaf_slack);
+}
+XFS_SYSFS_ATTR_RW(bload_leaf_slack);
+
+STATIC ssize_t
+bload_node_slack_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	int		ret;
+	int		val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	xfs_globals.bload_node_slack = val;
+	return count;
+}
+
+STATIC ssize_t
+bload_node_slack_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.bload_node_slack);
+}
+XFS_SYSFS_ATTR_RW(bload_node_slack);
+
 static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(bug_on_assert),
 	ATTR_LIST(log_recovery_delay),
@@ -262,6 +314,8 @@ static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(pwork_threads),
 	ATTR_LIST(larp),
 #endif
+	ATTR_LIST(bload_leaf_slack),
+	ATTR_LIST(bload_node_slack),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_dbg);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're performing a bulk load of a btree, move the code that
actually stores the btree record in the new btree block out of the
generic code and into the individual ->get_record implementations.
This is preparation for being able to store multiple records with a
single indirect call.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_staging.c |   17 +++++++----------
 fs/xfs/libxfs/xfs_btree_staging.h |   15 ++++++++++-----
 2 files changed, 17 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index 29e3f8ccb185..369965cacc8c 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -434,22 +434,19 @@ STATIC int
 xfs_btree_bload_leaf(
 	struct xfs_btree_cur		*cur,
 	unsigned int			recs_this_block,
-	xfs_btree_bload_get_record_fn	get_record,
+	xfs_btree_bload_get_records_fn	get_records,
 	struct xfs_btree_block		*block,
 	void				*priv)
 {
-	unsigned int			j;
+	unsigned int			j = 1;
 	int				ret;
 
 	/* Fill the leaf block with records. */
-	for (j = 1; j <= recs_this_block; j++) {
-		union xfs_btree_rec	*block_rec;
-
-		ret = get_record(cur, priv);
-		if (ret)
+	while (j <= recs_this_block) {
+		ret = get_records(cur, j, block, recs_this_block - j + 1, priv);
+		if (ret < 0)
 			return ret;
-		block_rec = xfs_btree_rec_addr(cur, j, block);
-		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+		j += ret;
 	}
 
 	return 0;
@@ -787,7 +784,7 @@ xfs_btree_bload(
 		trace_xfs_btree_bload_block(cur, level, i, blocks, &ptr,
 				nr_this_block);
 
-		ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_record,
+		ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_records,
 				block, priv);
 		if (ret)
 			goto out;
diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h
index d6dea3f0088c..82a3a8ef0f12 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.h
+++ b/fs/xfs/libxfs/xfs_btree_staging.h
@@ -50,7 +50,9 @@ void xfs_btree_commit_ifakeroot(struct xfs_btree_cur *cur, struct xfs_trans *tp,
 		int whichfork, const struct xfs_btree_ops *ops);
 
 /* Bulk loading of staged btrees. */
-typedef int (*xfs_btree_bload_get_record_fn)(struct xfs_btree_cur *cur, void *priv);
+typedef int (*xfs_btree_bload_get_records_fn)(struct xfs_btree_cur *cur,
+		unsigned int idx, struct xfs_btree_block *block,
+		unsigned int nr_wanted, void *priv);
 typedef int (*xfs_btree_bload_claim_block_fn)(struct xfs_btree_cur *cur,
 		union xfs_btree_ptr *ptr, void *priv);
 typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur,
@@ -58,11 +60,14 @@ typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur,
 
 struct xfs_btree_bload {
 	/*
-	 * This function will be called nr_records times to load records into
-	 * the btree.  The function does this by setting the cursor's bc_rec
-	 * field in in-core format.  Records must be returned in sort order.
+	 * This function will be called to load @nr_wanted records into the
+	 * btree.  The implementation does this by setting the cursor's bc_rec
+	 * field in in-core format and using init_rec_from_cur to set the
+	 * records in the btree block.  Records must be returned in sort order.
+	 * The function must return the number of records loaded or the usual
+	 * negative errno.
 	 */
-	xfs_btree_bload_get_record_fn	get_record;
+	xfs_btree_bload_get_records_fn	get_records;
 
 	/*
 	 * This function will be called nr_blocks times to obtain a pointer


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Constrain the number of dirty buffers that are locked by the btree
staging code at any given time by establishing a threshold at which we
put them all on the delwri queue and push them to disk.  This limits
memory consumption while writing out new btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c         |    2 +
 fs/xfs/libxfs/xfs_btree.h         |    3 ++
 fs/xfs/libxfs/xfs_btree_staging.c |   50 +++++++++++++++++++++++++++++--------
 fs/xfs/libxfs/xfs_btree_staging.h |   10 +++++++
 fs/xfs/scrub/newbt.c              |    1 +
 5 files changed, 54 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 99b79de7efcd..842a710e6c3b 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1330,7 +1330,7 @@ xfs_btree_get_buf_block(
  * Read in the buffer at the given ptr and return the buffer and
  * the block pointer within the buffer.
  */
-STATIC int
+int
 xfs_btree_read_buf_block(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_ptr	*ptr,
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index a2aa36b23e25..2db03f0ae961 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -700,6 +700,9 @@ void xfs_btree_set_ptr_null(struct xfs_btree_cur *cur,
 int xfs_btree_get_buf_block(struct xfs_btree_cur *cur,
 		const union xfs_btree_ptr *ptr, struct xfs_btree_block **block,
 		struct xfs_buf **bpp);
+int xfs_btree_read_buf_block(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr, int flags,
+		struct xfs_btree_block **block, struct xfs_buf **bpp);
 void xfs_btree_set_sibling(struct xfs_btree_cur *cur,
 		struct xfs_btree_block *block, const union xfs_btree_ptr *ptr,
 		int lr);
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index 369965cacc8c..6fd6ea8e6fbd 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -333,18 +333,35 @@ xfs_btree_commit_ifakeroot(
 /*
  * Put a btree block that we're loading onto the ordered list and release it.
  * The btree blocks will be written to disk when bulk loading is finished.
+ * If we reach the dirty buffer threshold, flush them to disk before
+ * continuing.
  */
-static void
+static int
 xfs_btree_bload_drop_buf(
-	struct list_head	*buffers_list,
-	struct xfs_buf		**bpp)
+	struct xfs_btree_bload		*bbl,
+	struct list_head		*buffers_list,
+	struct xfs_buf			**bpp)
 {
-	if (*bpp == NULL)
-		return;
+	struct xfs_buf			*bp = *bpp;
+	int				error;
 
-	xfs_buf_delwri_queue_here(*bpp, buffers_list);
-	xfs_buf_relse(*bpp);
+	if (!bp)
+		return 0;
+
+	xfs_buf_delwri_queue_here(bp, buffers_list);
+	xfs_buf_relse(bp);
 	*bpp = NULL;
+	bbl->nr_dirty++;
+
+	if (!bbl->max_dirty || bbl->nr_dirty < bbl->max_dirty)
+		return 0;
+
+	error = xfs_buf_delwri_submit(buffers_list);
+	if (error)
+		return error;
+
+	bbl->nr_dirty = 0;
+	return 0;
 }
 
 /*
@@ -416,7 +433,10 @@ xfs_btree_bload_prep_block(
 	 */
 	if (*blockp)
 		xfs_btree_set_sibling(cur, *blockp, &new_ptr, XFS_BB_RIGHTSIB);
-	xfs_btree_bload_drop_buf(buffers_list, bpp);
+
+	ret = xfs_btree_bload_drop_buf(bbl, buffers_list, bpp);
+	if (ret)
+		return ret;
 
 	/* Initialize the new btree block. */
 	xfs_btree_init_block_cur(cur, new_bp, level, nr_this_block);
@@ -480,7 +500,7 @@ xfs_btree_bload_node(
 
 		ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr));
 
-		ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block,
+		ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block,
 				&child_bp);
 		if (ret)
 			return ret;
@@ -759,6 +779,7 @@ xfs_btree_bload(
 	cur->bc_nlevels = bbl->btree_height;
 	xfs_btree_set_ptr_null(cur, &child_ptr);
 	xfs_btree_set_ptr_null(cur, &ptr);
+	bbl->nr_dirty = 0;
 
 	xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level,
 			&avg_per_block, &blocks, &blocks_with_extra);
@@ -797,7 +818,10 @@ xfs_btree_bload(
 			xfs_btree_copy_ptrs(cur, &child_ptr, &ptr, 1);
 	}
 	total_blocks += blocks;
-	xfs_btree_bload_drop_buf(&buffers_list, &bp);
+
+	ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp);
+	if (ret)
+		goto out;
 
 	/* Populate the internal btree nodes. */
 	for (level = 1; level < cur->bc_nlevels; level++) {
@@ -839,7 +863,11 @@ xfs_btree_bload(
 				xfs_btree_copy_ptrs(cur, &first_ptr, &ptr, 1);
 		}
 		total_blocks += blocks;
-		xfs_btree_bload_drop_buf(&buffers_list, &bp);
+
+		ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp);
+		if (ret)
+			goto out;
+
 		xfs_btree_copy_ptrs(cur, &child_ptr, &first_ptr, 1);
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h
index 82a3a8ef0f12..d2eaf4fdc603 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.h
+++ b/fs/xfs/libxfs/xfs_btree_staging.h
@@ -115,6 +115,16 @@ struct xfs_btree_bload {
 	 * height of the new btree.
 	 */
 	unsigned int			btree_height;
+
+	/*
+	 * Flush the new btree block buffer list to disk after this many blocks
+	 * have been formatted.  Zero prohibits writing any buffers until all
+	 * blocks have been formatted.
+	 */
+	uint16_t			max_dirty;
+
+	/* Number of dirty buffers. */
+	uint16_t			nr_dirty;
 };
 
 int xfs_btree_bload_compute_geometry(struct xfs_btree_cur *cur,
diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c
index 56c790fa0f9a..507137e7bc24 100644
--- a/fs/xfs/scrub/newbt.c
+++ b/fs/xfs/scrub/newbt.c
@@ -91,6 +91,7 @@ xrep_newbt_init_ag(
 	xnr->alloc_hint = alloc_hint;
 	xnr->resv = resv;
 	INIT_LIST_HEAD(&xnr->resv_list);
+	xnr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */
 	xrep_newbt_estimate_slack(xnr);
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
  2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
                     ` (6 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
                   ` (74 subsequent siblings)
  77 siblings, 7 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

Hi all,

In general, online repair of an indexed record set walks the filesystem
looking for records.  These records are sorted and bulk-loaded into a
new btree.  To make this happen without pinning gigabytes of metadata in
memory, first create an abstraction ('xfile') of memfd files so that
kernel code can access paged memory, and then an array abstraction
('xfarray') based on xfiles so that online repair can create an array of
new records without pinning memory.

These two data storage abstractions are critical for repair of space
metadata -- the memory used is pageable, which helps us avoid pinning
kernel memory and driving OOM problems; and they are byte-accessible
enough that we can use them like (very slow and programmatic) memory
buffers.

Later patchsets will build on this functionality to provide blob storage
and btrees.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array
---
 fs/xfs/Kconfig         |    1 
 fs/xfs/Makefile        |    2 
 fs/xfs/scrub/trace.c   |    4 
 fs/xfs/scrub/trace.h   |  262 ++++++++++++
 fs/xfs/scrub/xfarray.c | 1084 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfarray.h |  142 ++++++
 fs/xfs/scrub/xfile.c   |  426 +++++++++++++++++++
 fs/xfs/scrub/xfile.h   |   78 +++
 8 files changed, 1998 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/xfarray.c
 create mode 100644 fs/xfs/scrub/xfarray.h
 create mode 100644 fs/xfs/scrub/xfile.c
 create mode 100644 fs/xfs/scrub/xfile.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 2/7] xfs: enable sorting of xfile-backed arrays
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 7/7] xfs: improve xfarray quicksort pivot Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

The btree bulk loading code requires that records be provided in the
correct record sort order for the given btree type.  In general, repair
code cannot be required to collect records in order, and it is not
feasible to insert new records in the middle of an array to maintain
sort order.

Implement a sorting algorithm so that we can sort the records just prior
to bulk loading.  In principle, an xfarray could consume many gigabytes
of memory and its backing pages can be sent out to disk at any time.
This means that we cannot map the entire array into memory at once, so
we must find a way to divide the work into smaller portions (e.g. a
page) that /can/ be mapped into memory.

Quicksort seems like a reasonable fit for this purpose, since it uses a
divide and conquer strategy to keep its average runtime logarithmic.
The solution presented here is a port of the glibc implementation, which
itself is derived from the median-of-three and tail call recursion
strategies outlined by Sedgwick.

Subsequent patches will optimize the implementation further by utilizing
the kernel's heapsort on directly-mapped memory whenever possible, and
improving the quicksort pivot selection algorithm to try to avoid O(n^2)
collapses.

Note: The sorting functionality gets its own patch because the basic big
array mechanisms were plenty for a single code patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/trace.h   |  114 ++++++++++
 fs/xfs/scrub/xfarray.c |  569 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfarray.h |   67 ++++++
 3 files changed, 750 insertions(+)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 84edfa7556ac..02f5f547c563 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -18,6 +18,7 @@
 
 struct xfile;
 struct xfarray;
+struct xfarray_sortinfo;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -849,6 +850,119 @@ TRACE_EVENT(xfarray_create,
 		  __entry->obj_size_log)
 );
 
+TRACE_EVENT(xfarray_isort,
+	TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
+	TP_ARGS(si, lo, hi),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, lo)
+		__field(unsigned long long, hi)
+	),
+	TP_fast_assign(
+		__entry->ino = file_inode(si->array->xfile->file)->i_ino;
+		__entry->lo = lo;
+		__entry->hi = hi;
+	),
+	TP_printk("xfino 0x%lx lo %llu hi %llu elts %llu",
+		  __entry->ino,
+		  __entry->lo,
+		  __entry->hi,
+		  __entry->hi - __entry->lo)
+);
+
+TRACE_EVENT(xfarray_qsort,
+	TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
+	TP_ARGS(si, lo, hi),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, lo)
+		__field(unsigned long long, hi)
+		__field(int, stack_depth)
+		__field(int, max_stack_depth)
+	),
+	TP_fast_assign(
+		__entry->ino = file_inode(si->array->xfile->file)->i_ino;
+		__entry->lo = lo;
+		__entry->hi = hi;
+		__entry->stack_depth = si->stack_depth;
+		__entry->max_stack_depth = si->max_stack_depth;
+	),
+	TP_printk("xfino 0x%lx lo %llu hi %llu elts %llu stack %d/%d",
+		  __entry->ino,
+		  __entry->lo,
+		  __entry->hi,
+		  __entry->hi - __entry->lo,
+		  __entry->stack_depth,
+		  __entry->max_stack_depth)
+);
+
+TRACE_EVENT(xfarray_sort,
+	TP_PROTO(struct xfarray_sortinfo *si, size_t bytes),
+	TP_ARGS(si, bytes),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, nr)
+		__field(size_t, obj_size)
+		__field(size_t, bytes)
+		__field(unsigned int, max_stack_depth)
+	),
+	TP_fast_assign(
+		__entry->nr = si->array->nr;
+		__entry->obj_size = si->array->obj_size;
+		__entry->ino = file_inode(si->array->xfile->file)->i_ino;
+		__entry->bytes = bytes;
+		__entry->max_stack_depth = si->max_stack_depth;
+	),
+	TP_printk("xfino 0x%lx nr %llu objsz %zu stack %u bytes %zu",
+		  __entry->ino,
+		  __entry->nr,
+		  __entry->obj_size,
+		  __entry->max_stack_depth,
+		  __entry->bytes)
+);
+
+TRACE_EVENT(xfarray_sort_stats,
+	TP_PROTO(struct xfarray_sortinfo *si, int error),
+	TP_ARGS(si, error),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+#ifdef DEBUG
+		__field(unsigned long long, loads)
+		__field(unsigned long long, stores)
+		__field(unsigned long long, compares)
+#endif
+		__field(unsigned int, max_stack_depth)
+		__field(unsigned int, max_stack_used)
+		__field(int, error)
+	),
+	TP_fast_assign(
+		__entry->ino = file_inode(si->array->xfile->file)->i_ino;
+#ifdef DEBUG
+		__entry->loads = si->loads;
+		__entry->stores = si->stores;
+		__entry->compares = si->compares;
+#endif
+		__entry->max_stack_depth = si->max_stack_depth;
+		__entry->max_stack_used = si->max_stack_used;
+		__entry->error = error;
+	),
+	TP_printk(
+#ifdef DEBUG
+		  "xfino 0x%lx loads %llu stores %llu compares %llu stack_depth %u/%u error %d",
+#else
+		  "xfino 0x%lx stack_depth %u/%u error %d",
+#endif
+		  __entry->ino,
+#ifdef DEBUG
+		  __entry->loads,
+		  __entry->stores,
+		  __entry->compares,
+#endif
+		  __entry->max_stack_used,
+		  __entry->max_stack_depth,
+		  __entry->error)
+);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 8fdd7dd40193..2cd3a2f42e19 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -368,3 +368,572 @@ xfarray_load_next(
 	*idx = cur;
 	return 0;
 }
+
+/* Sorting functions */
+
+#ifdef DEBUG
+# define xfarray_sort_bump_loads(si)	do { (si)->loads++; } while (0)
+# define xfarray_sort_bump_stores(si)	do { (si)->stores++; } while (0)
+# define xfarray_sort_bump_compares(si)	do { (si)->compares++; } while (0)
+#else
+# define xfarray_sort_bump_loads(si)
+# define xfarray_sort_bump_stores(si)
+# define xfarray_sort_bump_compares(si)
+#endif /* DEBUG */
+
+/* Load an array element for sorting. */
+static inline int
+xfarray_sort_load(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		idx,
+	void			*ptr)
+{
+	xfarray_sort_bump_loads(si);
+	return xfarray_load(si->array, idx, ptr);
+}
+
+/* Store an array element for sorting. */
+static inline int
+xfarray_sort_store(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		idx,
+	void			*ptr)
+{
+	xfarray_sort_bump_stores(si);
+	return xfarray_store(si->array, idx, ptr);
+}
+
+/* Compare an array element for sorting. */
+static inline int
+xfarray_sort_cmp(
+	struct xfarray_sortinfo	*si,
+	const void		*a,
+	const void		*b)
+{
+	xfarray_sort_bump_compares(si);
+	return si->cmp_fn(a, b);
+}
+
+/* Return a pointer to the low index stack for quicksort partitioning. */
+static inline xfarray_idx_t *xfarray_sortinfo_lo(struct xfarray_sortinfo *si)
+{
+	return (xfarray_idx_t *)(si + 1);
+}
+
+/* Return a pointer to the high index stack for quicksort partitioning. */
+static inline xfarray_idx_t *xfarray_sortinfo_hi(struct xfarray_sortinfo *si)
+{
+	return xfarray_sortinfo_lo(si) + si->max_stack_depth;
+}
+
+/* Allocate memory to handle the sort. */
+static inline int
+xfarray_sortinfo_alloc(
+	struct xfarray		*array,
+	xfarray_cmp_fn		cmp_fn,
+	unsigned int		flags,
+	struct xfarray_sortinfo	**infop)
+{
+	struct xfarray_sortinfo	*si;
+	size_t			nr_bytes = sizeof(struct xfarray_sortinfo);
+	int			max_stack_depth;
+
+	/*
+	 * Tail-call recursion during the partitioning phase means that
+	 * quicksort will never recurse more than log2(nr) times.  We need one
+	 * extra level of stack to hold the initial parameters.
+	 */
+	max_stack_depth = ilog2(array->nr) + 1;
+
+	/* Each level of quicksort uses a lo and a hi index */
+	nr_bytes += max_stack_depth * sizeof(xfarray_idx_t) * 2;
+
+	/* One record for the pivot */
+	nr_bytes += array->obj_size;
+
+	si = kvzalloc(nr_bytes, XCHK_GFP_FLAGS);
+	if (!si)
+		return -ENOMEM;
+
+	si->array = array;
+	si->cmp_fn = cmp_fn;
+	si->flags = flags;
+	si->max_stack_depth = max_stack_depth;
+	si->max_stack_used = 1;
+
+	xfarray_sortinfo_lo(si)[0] = 0;
+	xfarray_sortinfo_hi(si)[0] = array->nr - 1;
+
+	trace_xfarray_sort(si, nr_bytes);
+	*infop = si;
+	return 0;
+}
+
+/* Should this sort be terminated by a fatal signal? */
+static inline bool
+xfarray_sort_terminated(
+	struct xfarray_sortinfo	*si,
+	int			*error)
+{
+	/*
+	 * If preemption is disabled, we need to yield to the scheduler every
+	 * few seconds so that we don't run afoul of the soft lockup watchdog
+	 * or RCU stall detector.
+	 */
+	cond_resched();
+
+	if ((si->flags & XFARRAY_SORT_KILLABLE) &&
+	    fatal_signal_pending(current)) {
+		if (*error == 0)
+			*error = -EINTR;
+		return true;
+	}
+	return false;
+}
+
+/* Do we want an insertion sort? */
+static inline bool
+xfarray_want_isort(
+	struct xfarray_sortinfo *si,
+	xfarray_idx_t		start,
+	xfarray_idx_t		end)
+{
+	/*
+	 * For array subsets smaller than 8 elements, it's slightly faster to
+	 * use insertion sort than quicksort's stack machine.
+	 */
+	return (end - start) < 8;
+}
+
+/* Return the scratch space within the sortinfo structure. */
+static inline void *xfarray_sortinfo_isort_scratch(struct xfarray_sortinfo *si)
+{
+	return xfarray_sortinfo_hi(si) + si->max_stack_depth;
+}
+
+/*
+ * Perform an insertion sort on a subset of the array.
+ * Though insertion sort is an O(n^2) algorithm, for small set sizes it's
+ * faster than quicksort's stack machine, so we let it take over for that.
+ * This ought to be replaced with something more efficient.
+ */
+STATIC int
+xfarray_isort(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		lo,
+	xfarray_idx_t		hi)
+{
+	void			*a = xfarray_sortinfo_isort_scratch(si);
+	void			*b = xfarray_scratch(si->array);
+	xfarray_idx_t		tmp;
+	xfarray_idx_t		i;
+	xfarray_idx_t		run;
+	int			error;
+
+	trace_xfarray_isort(si, lo, hi);
+
+	/*
+	 * Move the smallest element in a[lo..hi] to a[lo].  This
+	 * simplifies the loop control logic below.
+	 */
+	tmp = lo;
+	error = xfarray_sort_load(si, tmp, b);
+	if (error)
+		return error;
+	for (run = lo + 1; run <= hi; run++) {
+		/* if a[run] < a[tmp], tmp = run */
+		error = xfarray_sort_load(si, run, a);
+		if (error)
+			return error;
+		if (xfarray_sort_cmp(si, a, b) < 0) {
+			tmp = run;
+			memcpy(b, a, si->array->obj_size);
+		}
+
+		if (xfarray_sort_terminated(si, &error))
+			return error;
+	}
+
+	/*
+	 * The smallest element is a[tmp]; swap with a[lo] if tmp != lo.
+	 * Recall that a[tmp] is already in *b.
+	 */
+	if (tmp != lo) {
+		error = xfarray_sort_load(si, lo, a);
+		if (error)
+			return error;
+		error = xfarray_sort_store(si, tmp, a);
+		if (error)
+			return error;
+		error = xfarray_sort_store(si, lo, b);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Perform an insertion sort on a[lo+1..hi].  We already made sure
+	 * that the smallest value in the original range is now in a[lo],
+	 * so the inner loop should never underflow.
+	 *
+	 * For each a[lo+2..hi], make sure it's in the correct position
+	 * with respect to the elements that came before it.
+	 */
+	for (run = lo + 2; run <= hi; run++) {
+		error = xfarray_sort_load(si, run, a);
+		if (error)
+			return error;
+
+		/*
+		 * Find the correct place for a[run] by walking leftwards
+		 * towards the start of the range until a[tmp] is no longer
+		 * greater than a[run].
+		 */
+		tmp = run - 1;
+		error = xfarray_sort_load(si, tmp, b);
+		if (error)
+			return error;
+		while (xfarray_sort_cmp(si, a, b) < 0) {
+			tmp--;
+			error = xfarray_sort_load(si, tmp, b);
+			if (error)
+				return error;
+
+			if (xfarray_sort_terminated(si, &error))
+				return error;
+		}
+		tmp++;
+
+		/*
+		 * If tmp != run, then a[tmp..run-1] are all less than a[run],
+		 * so right barrel roll a[tmp..run] to get this range in
+		 * sorted order.
+		 */
+		if (tmp == run)
+			continue;
+
+		for (i = run; i >= tmp; i--) {
+			error = xfarray_sort_load(si, i - 1, b);
+			if (error)
+				return error;
+			error = xfarray_sort_store(si, i, b);
+			if (error)
+				return error;
+
+			if (xfarray_sort_terminated(si, &error))
+				return error;
+		}
+		error = xfarray_sort_store(si, tmp, a);
+		if (error)
+			return error;
+
+		if (xfarray_sort_terminated(si, &error))
+			return error;
+	}
+
+	return 0;
+}
+
+/* Return a pointer to the xfarray pivot record within the sortinfo struct. */
+static inline void *xfarray_sortinfo_pivot(struct xfarray_sortinfo *si)
+{
+	return xfarray_sortinfo_hi(si) + si->max_stack_depth;
+}
+
+/*
+ * Find a pivot value for quicksort partitioning, swap it with a[lo], and save
+ * the cached pivot record for the next step.
+ *
+ * Select the median value from a[lo], a[mid], and a[hi].  Put the median in
+ * a[lo], the lowest in a[mid], and the highest in a[hi].  Using the median of
+ * the three reduces the chances that we pick the worst case pivot value, since
+ * it's likely that our array values are nearly sorted.
+ */
+STATIC int
+xfarray_qsort_pivot(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		lo,
+	xfarray_idx_t		hi)
+{
+	void			*a = xfarray_sortinfo_pivot(si);
+	void			*b = xfarray_scratch(si->array);
+	xfarray_idx_t		mid = lo + ((hi - lo) / 2);
+	int			error;
+
+	/* if a[mid] < a[lo], swap a[mid] and a[lo]. */
+	error = xfarray_sort_load(si, mid, a);
+	if (error)
+		return error;
+	error = xfarray_sort_load(si, lo, b);
+	if (error)
+		return error;
+	if (xfarray_sort_cmp(si, a, b) < 0) {
+		error = xfarray_sort_store(si, lo, a);
+		if (error)
+			return error;
+		error = xfarray_sort_store(si, mid, b);
+		if (error)
+			return error;
+	}
+
+	/* if a[hi] < a[mid], swap a[mid] and a[hi]. */
+	error = xfarray_sort_load(si, hi, a);
+	if (error)
+		return error;
+	error = xfarray_sort_load(si, mid, b);
+	if (error)
+		return error;
+	if (xfarray_sort_cmp(si, a, b) < 0) {
+		error = xfarray_sort_store(si, mid, a);
+		if (error)
+			return error;
+		error = xfarray_sort_store(si, hi, b);
+		if (error)
+			return error;
+	} else {
+		goto move_front;
+	}
+
+	/* if a[mid] < a[lo], swap a[mid] and a[lo]. */
+	error = xfarray_sort_load(si, mid, a);
+	if (error)
+		return error;
+	error = xfarray_sort_load(si, lo, b);
+	if (error)
+		return error;
+	if (xfarray_sort_cmp(si, a, b) < 0) {
+		error = xfarray_sort_store(si, lo, a);
+		if (error)
+			return error;
+		error = xfarray_sort_store(si, mid, b);
+		if (error)
+			return error;
+	}
+
+move_front:
+	/*
+	 * Move our selected pivot to a[lo].  Recall that a == si->pivot, so
+	 * this leaves us with the pivot cached in the sortinfo structure.
+	 */
+	error = xfarray_sort_load(si, lo, b);
+	if (error)
+		return error;
+	error = xfarray_sort_load(si, mid, a);
+	if (error)
+		return error;
+	error = xfarray_sort_store(si, mid, b);
+	if (error)
+		return error;
+	return xfarray_sort_store(si, lo, a);
+}
+
+/*
+ * Set up the pointers for the next iteration.  We push onto the stack all of
+ * the unsorted values between a[lo + 1] and a[end[i]], and we tweak the
+ * current stack frame to point to the unsorted values between a[beg[i]] and
+ * a[lo] so that those values will be sorted when we pop the stack.
+ */
+static inline int
+xfarray_qsort_push(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		*si_lo,
+	xfarray_idx_t		*si_hi,
+	xfarray_idx_t		lo,
+	xfarray_idx_t		hi)
+{
+	/* Check for stack overflows */
+	if (si->stack_depth >= si->max_stack_depth - 1) {
+		ASSERT(si->stack_depth < si->max_stack_depth - 1);
+		return -EFSCORRUPTED;
+	}
+
+	si->max_stack_used = max_t(uint8_t, si->max_stack_used,
+					    si->stack_depth + 2);
+
+	si_lo[si->stack_depth + 1] = lo + 1;
+	si_hi[si->stack_depth + 1] = si_hi[si->stack_depth];
+	si_hi[si->stack_depth++] = lo - 1;
+
+	/*
+	 * Always start with the smaller of the two partitions to keep the
+	 * amount of recursion in check.
+	 */
+	if (si_hi[si->stack_depth]     - si_lo[si->stack_depth] >
+	    si_hi[si->stack_depth - 1] - si_lo[si->stack_depth - 1]) {
+		swap(si_lo[si->stack_depth], si_lo[si->stack_depth - 1]);
+		swap(si_hi[si->stack_depth], si_hi[si->stack_depth - 1]);
+	}
+
+	return 0;
+}
+
+/*
+ * Sort the array elements via quicksort.  This implementation incorporates
+ * four optimizations discussed in Sedgewick:
+ *
+ * 1. Use an explicit stack of array indices to store the next array partition
+ *    to sort.  This helps us to avoid recursion in the call stack, which is
+ *    particularly expensive in the kernel.
+ *
+ * 2. For arrays with records in arbitrary or user-controlled order, choose the
+ *    pivot element using a median-of-three decision tree.  This reduces the
+ *    probability of selecting a bad pivot value which causes worst case
+ *    behavior (i.e. partition sizes of 1).
+ *
+ * 3. The smaller of the two sub-partitions is pushed onto the stack to start
+ *    the next level of recursion, and the larger sub-partition replaces the
+ *    current stack frame.  This guarantees that we won't need more than
+ *    log2(nr) stack space.
+ *
+ * 4. Use insertion sort for small sets since since insertion sort is faster
+ *    for small, mostly sorted array segments.  In the author's experience,
+ *    substituting insertion sort for arrays smaller than 8 elements yields
+ *    a ~10% reduction in runtime.
+ */
+
+/*
+ * Due to the use of signed indices, we can only support up to 2^63 records.
+ * Files can only grow to 2^63 bytes, so this is not much of a limitation.
+ */
+#define QSORT_MAX_RECS		(1ULL << 63)
+
+int
+xfarray_sort(
+	struct xfarray		*array,
+	xfarray_cmp_fn		cmp_fn,
+	unsigned int		flags)
+{
+	struct xfarray_sortinfo	*si;
+	xfarray_idx_t		*si_lo, *si_hi;
+	void			*pivot;
+	void			*scratch = xfarray_scratch(array);
+	xfarray_idx_t		lo, hi;
+	int			error = 0;
+
+	if (array->nr < 2)
+		return 0;
+	if (array->nr >= QSORT_MAX_RECS)
+		return -E2BIG;
+
+	error = xfarray_sortinfo_alloc(array, cmp_fn, flags, &si);
+	if (error)
+		return error;
+	si_lo = xfarray_sortinfo_lo(si);
+	si_hi = xfarray_sortinfo_hi(si);
+	pivot = xfarray_sortinfo_pivot(si);
+
+	while (si->stack_depth >= 0) {
+		lo = si_lo[si->stack_depth];
+		hi = si_hi[si->stack_depth];
+
+		trace_xfarray_qsort(si, lo, hi);
+
+		/* Nothing left in this partition to sort; pop stack. */
+		if (lo >= hi) {
+			si->stack_depth--;
+			continue;
+		}
+
+		/* If insertion sort can solve our problems, we're done. */
+		if (xfarray_want_isort(si, lo, hi)) {
+			error = xfarray_isort(si, lo, hi);
+			if (error)
+				goto out_free;
+			si->stack_depth--;
+			continue;
+		}
+
+		/* Pick a pivot, move it to a[lo] and stash it. */
+		error = xfarray_qsort_pivot(si, lo, hi);
+		if (error)
+			goto out_free;
+
+		/*
+		 * Rearrange a[lo..hi] such that everything smaller than the
+		 * pivot is on the left side of the range and everything larger
+		 * than the pivot is on the right side of the range.
+		 */
+		while (lo < hi) {
+			/*
+			 * Decrement hi until it finds an a[hi] less than the
+			 * pivot value.
+			 */
+			error = xfarray_sort_load(si, hi, scratch);
+			if (error)
+				goto out_free;
+			while (xfarray_sort_cmp(si, scratch, pivot) >= 0 &&
+								lo < hi) {
+				if (xfarray_sort_terminated(si, &error))
+					goto out_free;
+
+				hi--;
+				error = xfarray_sort_load(si, hi, scratch);
+				if (error)
+					goto out_free;
+			}
+
+			if (xfarray_sort_terminated(si, &error))
+				goto out_free;
+
+			/* Copy that item (a[hi]) to a[lo]. */
+			if (lo < hi) {
+				error = xfarray_sort_store(si, lo++, scratch);
+				if (error)
+					goto out_free;
+			}
+
+			/*
+			 * Increment lo until it finds an a[lo] greater than
+			 * the pivot value.
+			 */
+			error = xfarray_sort_load(si, lo, scratch);
+			if (error)
+				goto out_free;
+			while (xfarray_sort_cmp(si, scratch, pivot) <= 0 &&
+								lo < hi) {
+				if (xfarray_sort_terminated(si, &error))
+					goto out_free;
+
+				lo++;
+				error = xfarray_sort_load(si, lo, scratch);
+				if (error)
+					goto out_free;
+			}
+
+			if (xfarray_sort_terminated(si, &error))
+				goto out_free;
+
+			/* Copy that item (a[lo]) to a[hi]. */
+			if (lo < hi) {
+				error = xfarray_sort_store(si, hi--, scratch);
+				if (error)
+					goto out_free;
+			}
+
+			if (xfarray_sort_terminated(si, &error))
+				goto out_free;
+		}
+
+		/*
+		 * Put our pivot value in the correct place at a[lo].  All
+		 * values between a[beg[i]] and a[lo - 1] should be less than
+		 * the pivot; and all values between a[lo + 1] and a[end[i]-1]
+		 * should be greater than the pivot.
+		 */
+		error = xfarray_sort_store(si, lo, pivot);
+		if (error)
+			goto out_free;
+
+		/* Set up the stack frame to process the two partitions. */
+		error = xfarray_qsort_push(si, si_lo, si_hi, lo, hi);
+		if (error)
+			goto out_free;
+
+		if (xfarray_sort_terminated(si, &error))
+			goto out_free;
+	}
+
+out_free:
+	trace_xfarray_sort_stats(si, error);
+	kvfree(si);
+	return error;
+}
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index 26e2b594f121..b0cf818c6a7f 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -55,4 +55,71 @@ static inline int xfarray_append(struct xfarray *array, const void *ptr)
 uint64_t xfarray_length(struct xfarray *array);
 int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
 
+/* Declarations for xfile array sort functionality. */
+
+typedef cmp_func_t xfarray_cmp_fn;
+
+struct xfarray_sortinfo {
+	struct xfarray		*array;
+
+	/* Comparison function for the sort. */
+	xfarray_cmp_fn		cmp_fn;
+
+	/* Maximum height of the partition stack. */
+	uint8_t			max_stack_depth;
+
+	/* Current height of the partition stack. */
+	int8_t			stack_depth;
+
+	/* Maximum stack depth ever used. */
+	uint8_t			max_stack_used;
+
+	/* XFARRAY_SORT_* flags; see below. */
+	unsigned int		flags;
+
+#ifdef DEBUG
+	/* Performance statistics. */
+	uint64_t		loads;
+	uint64_t		stores;
+	uint64_t		compares;
+#endif
+
+	/*
+	 * Extra bytes are allocated beyond the end of the structure to store
+	 * quicksort information.  C does not permit multiple VLAs per struct,
+	 * so we document all of this in a comment.
+	 *
+	 * Pretend that we have a typedef for array records:
+	 *
+	 * typedef char[array->obj_size]	xfarray_rec_t;
+	 *
+	 * First comes the quicksort partition stack:
+	 *
+	 * xfarray_idx_t	lo[max_stack_depth];
+	 * xfarray_idx_t	hi[max_stack_depth];
+	 *
+	 * union {
+	 *
+	 * If for a given subset we decide to use an insertion sort, we use the
+	 * scratchpad record after the xfarray and a second scratchpad record
+	 * here to compare items:
+	 *
+	 * 	xfarray_rec_t	scratch;
+	 *
+	 * Otherwise, we want to partition the records to partition the array.
+	 * We store the chosen pivot record here and use the xfarray scratchpad
+	 * to rearrange the array around the pivot:
+	 *
+	 * 	xfarray_rec_t	pivot;
+	 *
+	 * }
+	 */
+};
+
+/* Sort can be interrupted by a fatal signal. */
+#define XFARRAY_SORT_KILLABLE	(1U << 0)
+
+int xfarray_sort(struct xfarray *array, xfarray_cmp_fn cmp_fn,
+		unsigned int flags);
+
 #endif /* __XFS_SCRUB_XFARRAY_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 1/7] xfs: create a big array data structure
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index.  For
repair operations, the most important operations are append, iterate,
and sort.

Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.

Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages.  Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig         |    1 
 fs/xfs/Makefile        |    2 
 fs/xfs/scrub/trace.c   |    4 -
 fs/xfs/scrub/trace.h   |  123 ++++++++++++++++
 fs/xfs/scrub/xfarray.c |  370 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfarray.h |   58 ++++++++
 fs/xfs/scrub/xfile.c   |  318 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfile.h   |   58 ++++++++
 8 files changed, 933 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/xfarray.c
 create mode 100644 fs/xfs/scrub/xfarray.h
 create mode 100644 fs/xfs/scrub/xfile.c
 create mode 100644 fs/xfs/scrub/xfile.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 05bc865142b8..6077ac04c0c3 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -101,6 +101,7 @@ config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
 	default n
 	depends on XFS_FS
+	depends on TMPFS && SHMEM
 	select XFS_DRAIN_INTENTS
 	help
 	  If you say Y here you will be able to check metadata on a
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 90f1f01277be..90cbba7dc550 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -162,6 +162,8 @@ xfs-y				+= $(addprefix scrub/, \
 				   rmap.o \
 				   scrub.o \
 				   symlink.o \
+				   xfarray.o \
+				   xfile.o \
 				   )
 
 xfs-$(CONFIG_XFS_RT)		+= scrub/rtbitmap.o
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index b5f94676c37c..4a0385c97ea6 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -12,8 +12,10 @@
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
-#include "scrub/scrub.h"
 #include "xfs_ag.h"
+#include "scrub/scrub.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
 
 /* Figure out which block the btree cursor was pointing to. */
 static inline xfs_fsblock_t
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index cb33f42190df..84edfa7556ac 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -16,6 +16,9 @@
 #include <linux/tracepoint.h>
 #include "xfs_bit.h"
 
+struct xfile;
+struct xfarray;
+
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
  * TRACE_DEFINE_ENUM macro so that the enum value can be encoded in the ftrace
@@ -726,6 +729,126 @@ TRACE_EVENT(xchk_refcount_incorrect,
 		  __entry->seen)
 )
 
+TRACE_EVENT(xfile_create,
+	TP_PROTO(struct xfs_mount *mp, struct xfile *xf),
+	TP_ARGS(mp, xf),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, ino)
+		__array(char, pathname, 256)
+	),
+	TP_fast_assign(
+		char		pathname[257];
+		char		*path;
+
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = file_inode(xf->file)->i_ino;
+		memset(pathname, 0, sizeof(pathname));
+		path = file_path(xf->file, pathname, sizeof(pathname) - 1);
+		if (IS_ERR(path))
+			path = "(unknown)";
+		strncpy(__entry->pathname, path, sizeof(__entry->pathname));
+	),
+	TP_printk("dev %d:%d xfino 0x%lx path '%s'",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->pathname)
+);
+
+TRACE_EVENT(xfile_destroy,
+	TP_PROTO(struct xfile *xf),
+	TP_ARGS(xf),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, bytes)
+		__field(loff_t, size)
+	),
+	TP_fast_assign(
+		struct xfile_stat	statbuf;
+		int			ret;
+
+		ret = xfile_stat(xf, &statbuf);
+		if (!ret) {
+			__entry->bytes = statbuf.bytes;
+			__entry->size = statbuf.size;
+		} else {
+			__entry->bytes = -1;
+			__entry->size = -1;
+		}
+		__entry->ino = file_inode(xf->file)->i_ino;
+	),
+	TP_printk("xfino 0x%lx mem_bytes 0x%llx isize 0x%llx",
+		  __entry->ino,
+		  __entry->bytes,
+		  __entry->size)
+);
+
+DECLARE_EVENT_CLASS(xfile_class,
+	TP_PROTO(struct xfile *xf, loff_t pos, unsigned long long bytecount),
+	TP_ARGS(xf, pos, bytecount),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, bytes_used)
+		__field(loff_t, pos)
+		__field(loff_t, size)
+		__field(unsigned long long, bytecount)
+	),
+	TP_fast_assign(
+		struct xfile_stat	statbuf;
+		int			ret;
+
+		ret = xfile_stat(xf, &statbuf);
+		if (!ret) {
+			__entry->bytes_used = statbuf.bytes;
+			__entry->size = statbuf.size;
+		} else {
+			__entry->bytes_used = -1;
+			__entry->size = -1;
+		}
+		__entry->ino = file_inode(xf->file)->i_ino;
+		__entry->pos = pos;
+		__entry->bytecount = bytecount;
+	),
+	TP_printk("xfino 0x%lx mem_bytes 0x%llx pos 0x%llx bytecount 0x%llx isize 0x%llx",
+		  __entry->ino,
+		  __entry->bytes_used,
+		  __entry->pos,
+		  __entry->bytecount,
+		  __entry->size)
+);
+#define DEFINE_XFILE_EVENT(name) \
+DEFINE_EVENT(xfile_class, name, \
+	TP_PROTO(struct xfile *xf, loff_t pos, unsigned long long bytecount), \
+	TP_ARGS(xf, pos, bytecount))
+DEFINE_XFILE_EVENT(xfile_pread);
+DEFINE_XFILE_EVENT(xfile_pwrite);
+DEFINE_XFILE_EVENT(xfile_seek_data);
+
+TRACE_EVENT(xfarray_create,
+	TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
+	TP_ARGS(xfa, required_capacity),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(uint64_t, max_nr)
+		__field(size_t, obj_size)
+		__field(int, obj_size_log)
+		__field(unsigned long long, required_capacity)
+	),
+	TP_fast_assign(
+		__entry->max_nr = xfa->max_nr;
+		__entry->obj_size = xfa->obj_size;
+		__entry->obj_size_log = xfa->obj_size_log;
+		__entry->ino = file_inode(xfa->xfile->file)->i_ino;
+		__entry->required_capacity = required_capacity;
+	),
+	TP_printk("xfino 0x%lx max_nr %llu reqd_nr %llu objsz %zu objszlog %d",
+		  __entry->ino,
+		  __entry->max_nr,
+		  __entry->required_capacity,
+		  __entry->obj_size,
+		  __entry->obj_size_log)
+);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
new file mode 100644
index 000000000000..8fdd7dd40193
--- /dev/null
+++ b/fs/xfs/scrub/xfarray.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+
+/*
+ * Large Arrays of Fixed-Size Records
+ * ==================================
+ *
+ * This memory array uses an xfile (which itself is a memfd "file") to store
+ * large numbers of fixed-size records in memory that can be paged out.  This
+ * puts less stress on the memory reclaim algorithms during an online repair
+ * because we don't have to pin so much memory.  However, array access is less
+ * direct than would be in a regular memory array.  Access to the array is
+ * performed via indexed load and store methods, and an append method is
+ * provided for convenience.  Array elements can be unset, which sets them to
+ * all zeroes.  Unset entries are skipped during iteration, though direct loads
+ * will return a zeroed buffer.  Callers are responsible for concurrency
+ * control.
+ */
+
+/*
+ * Pointer to scratch space.  Because we can't access the xfile data directly,
+ * we allocate a small amount of memory on the end of the xfarray structure to
+ * buffer array items when we need space to store values temporarily.
+ */
+static inline void *xfarray_scratch(struct xfarray *array)
+{
+	return (array + 1);
+}
+
+/* Compute array index given an xfile offset. */
+static xfarray_idx_t
+xfarray_idx(
+	struct xfarray	*array,
+	loff_t		pos)
+{
+	if (array->obj_size_log >= 0)
+		return (xfarray_idx_t)pos >> array->obj_size_log;
+
+	return div_u64((xfarray_idx_t)pos, array->obj_size);
+}
+
+/* Compute xfile offset of array element. */
+static inline loff_t xfarray_pos(struct xfarray *array, xfarray_idx_t idx)
+{
+	if (array->obj_size_log >= 0)
+		return idx << array->obj_size_log;
+
+	return idx * array->obj_size;
+}
+
+/*
+ * Initialize a big memory array.  Array records cannot be larger than a
+ * page, and the array cannot span more bytes than the page cache supports.
+ * If @required_capacity is nonzero, the maximum array size will be set to this
+ * quantity and the array creation will fail if the underlying storage cannot
+ * support that many records.
+ */
+int
+xfarray_create(
+	struct xfs_mount	*mp,
+	const char		*description,
+	unsigned long long	required_capacity,
+	size_t			obj_size,
+	struct xfarray		**arrayp)
+{
+	struct xfarray		*array;
+	struct xfile		*xfile;
+	int			error;
+
+	ASSERT(obj_size < PAGE_SIZE);
+
+	error = xfile_create(mp, description, 0, &xfile);
+	if (error)
+		return error;
+
+	error = -ENOMEM;
+	array = kzalloc(sizeof(struct xfarray) + obj_size, XCHK_GFP_FLAGS);
+	if (!array)
+		goto out_xfile;
+
+	array->xfile = xfile;
+	array->obj_size = obj_size;
+
+	if (is_power_of_2(obj_size))
+		array->obj_size_log = ilog2(obj_size);
+	else
+		array->obj_size_log = -1;
+
+	array->max_nr = xfarray_idx(array, MAX_LFS_FILESIZE);
+	trace_xfarray_create(array, required_capacity);
+
+	if (required_capacity > 0) {
+		if (array->max_nr < required_capacity) {
+			error = -ENOMEM;
+			goto out_xfarray;
+		}
+		array->max_nr = required_capacity;
+	}
+
+	*arrayp = array;
+	return 0;
+
+out_xfarray:
+	kfree(array);
+out_xfile:
+	xfile_destroy(xfile);
+	return error;
+}
+
+/* Destroy the array. */
+void
+xfarray_destroy(
+	struct xfarray	*array)
+{
+	xfile_destroy(array->xfile);
+	kfree(array);
+}
+
+/* Load an element from the array. */
+int
+xfarray_load(
+	struct xfarray	*array,
+	xfarray_idx_t	idx,
+	void		*ptr)
+{
+	if (idx >= array->nr)
+		return -ENODATA;
+
+	return xfile_obj_load(array->xfile, ptr, array->obj_size,
+			xfarray_pos(array, idx));
+}
+
+/* Is this array element potentially unset? */
+static inline bool
+xfarray_is_unset(
+	struct xfarray	*array,
+	loff_t		pos)
+{
+	void		*temp = xfarray_scratch(array);
+	int		error;
+
+	if (array->unset_slots == 0)
+		return false;
+
+	error = xfile_obj_load(array->xfile, temp, array->obj_size, pos);
+	if (!error && xfarray_element_is_null(array, temp))
+		return true;
+
+	return false;
+}
+
+/*
+ * Unset an array element.  If @idx is the last element in the array, the
+ * array will be truncated.  Otherwise, the entry will be zeroed.
+ */
+int
+xfarray_unset(
+	struct xfarray	*array,
+	xfarray_idx_t	idx)
+{
+	void		*temp = xfarray_scratch(array);
+	loff_t		pos = xfarray_pos(array, idx);
+	int		error;
+
+	if (idx >= array->nr)
+		return -ENODATA;
+
+	if (idx == array->nr - 1) {
+		array->nr--;
+		return 0;
+	}
+
+	if (xfarray_is_unset(array, pos))
+		return 0;
+
+	memset(temp, 0, array->obj_size);
+	error = xfile_obj_store(array->xfile, temp, array->obj_size, pos);
+	if (error)
+		return error;
+
+	array->unset_slots++;
+	return 0;
+}
+
+/*
+ * Store an element in the array.  The element must not be completely zeroed,
+ * because those are considered unset sparse elements.
+ */
+int
+xfarray_store(
+	struct xfarray	*array,
+	xfarray_idx_t	idx,
+	const void	*ptr)
+{
+	int		ret;
+
+	if (idx >= array->max_nr)
+		return -EFBIG;
+
+	ASSERT(!xfarray_element_is_null(array, ptr));
+
+	ret = xfile_obj_store(array->xfile, ptr, array->obj_size,
+			xfarray_pos(array, idx));
+	if (ret)
+		return ret;
+
+	array->nr = max(array->nr, idx + 1);
+	return 0;
+}
+
+/* Is this array element NULL? */
+bool
+xfarray_element_is_null(
+	struct xfarray	*array,
+	const void	*ptr)
+{
+	return !memchr_inv(ptr, 0, array->obj_size);
+}
+
+/*
+ * Store an element anywhere in the array that is unset.  If there are no
+ * unset slots, append the element to the array.
+ */
+int
+xfarray_store_anywhere(
+	struct xfarray	*array,
+	const void	*ptr)
+{
+	void		*temp = xfarray_scratch(array);
+	loff_t		endpos = xfarray_pos(array, array->nr);
+	loff_t		pos;
+	int		error;
+
+	/* Find an unset slot to put it in. */
+	for (pos = 0;
+	     pos < endpos && array->unset_slots > 0;
+	     pos += array->obj_size) {
+		error = xfile_obj_load(array->xfile, temp, array->obj_size,
+				pos);
+		if (error || !xfarray_element_is_null(array, temp))
+			continue;
+
+		error = xfile_obj_store(array->xfile, ptr, array->obj_size,
+				pos);
+		if (error)
+			return error;
+
+		array->unset_slots--;
+		return 0;
+	}
+
+	/* No unset slots found; attach it on the end. */
+	array->unset_slots = 0;
+	return xfarray_append(array, ptr);
+}
+
+/* Return length of array. */
+uint64_t
+xfarray_length(
+	struct xfarray	*array)
+{
+	return array->nr;
+}
+
+/*
+ * Decide which array item we're going to read as part of an _iter_get.
+ * @cur is the array index, and @pos is the file offset of that array index in
+ * the backing xfile.  Returns ENODATA if we reach the end of the records.
+ *
+ * Reading from a hole in a sparse xfile causes page instantiation, so for
+ * iterating a (possibly sparse) array we need to figure out if the cursor is
+ * pointing at a totally uninitialized hole and move the cursor up if
+ * necessary.
+ */
+static inline int
+xfarray_find_data(
+	struct xfarray	*array,
+	xfarray_idx_t	*cur,
+	loff_t		*pos)
+{
+	unsigned int	pgoff = offset_in_page(*pos);
+	loff_t		end_pos = *pos + array->obj_size - 1;
+	loff_t		new_pos;
+
+	/*
+	 * If the current array record is not adjacent to a page boundary, we
+	 * are in the middle of the page.  We do not need to move the cursor.
+	 */
+	if (pgoff != 0 && pgoff + array->obj_size - 1 < PAGE_SIZE)
+		return 0;
+
+	/*
+	 * Call SEEK_DATA on the last byte in the record we're about to read.
+	 * If the record ends at (or crosses) the end of a page then we know
+	 * that the first byte of the record is backed by pages and don't need
+	 * to query it.  If instead the record begins at the start of the page
+	 * then we know that querying the last byte is just as good as querying
+	 * the first byte, since records cannot be larger than a page.
+	 *
+	 * If the call returns the same file offset, we know this record is
+	 * backed by real pages.  We do not need to move the cursor.
+	 */
+	new_pos = xfile_seek_data(array->xfile, end_pos);
+	if (new_pos == -ENXIO)
+		return -ENODATA;
+	if (new_pos < 0)
+		return new_pos;
+	if (new_pos == end_pos)
+		return 0;
+
+	/*
+	 * Otherwise, SEEK_DATA told us how far up to move the file pointer to
+	 * find more data.  Move the array index to the first record past the
+	 * byte offset we were given.
+	 */
+	new_pos = roundup_64(new_pos, array->obj_size);
+	*cur = xfarray_idx(array, new_pos);
+	*pos = xfarray_pos(array, *cur);
+	return 0;
+}
+
+/*
+ * Starting at *idx, fetch the next non-null array entry and advance the index
+ * to set up the next _load_next call.  Returns ENODATA if we reach the end of
+ * the array.  Callers must set @*idx to XFARRAY_CURSOR_INIT before the first
+ * call to this function.
+ */
+int
+xfarray_load_next(
+	struct xfarray	*array,
+	xfarray_idx_t	*idx,
+	void		*rec)
+{
+	xfarray_idx_t	cur = *idx;
+	loff_t		pos = xfarray_pos(array, cur);
+	int		error;
+
+	do {
+		if (cur >= array->nr)
+			return -ENODATA;
+
+		/*
+		 * Ask the backing store for the location of next possible
+		 * written record, then retrieve that record.
+		 */
+		error = xfarray_find_data(array, &cur, &pos);
+		if (error)
+			return error;
+		error = xfarray_load(array, cur, rec);
+		if (error)
+			return error;
+
+		cur++;
+		pos += array->obj_size;
+	} while (xfarray_element_is_null(array, rec));
+
+	*idx = cur;
+	return 0;
+}
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
new file mode 100644
index 000000000000..26e2b594f121
--- /dev/null
+++ b/fs/xfs/scrub/xfarray.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_XFARRAY_H__
+#define __XFS_SCRUB_XFARRAY_H__
+
+/* xfile array index type, along with cursor initialization */
+typedef uint64_t		xfarray_idx_t;
+#define XFARRAY_CURSOR_INIT	((__force xfarray_idx_t)0)
+
+/* Iterate each index of an xfile array. */
+#define foreach_xfarray_idx(array, idx) \
+	for ((idx) = XFARRAY_CURSOR_INIT; \
+	     (idx) < xfarray_length(array); \
+	     (idx)++)
+
+struct xfarray {
+	/* Underlying file that backs the array. */
+	struct xfile	*xfile;
+
+	/* Number of array elements. */
+	xfarray_idx_t	nr;
+
+	/* Maximum possible array size. */
+	xfarray_idx_t	max_nr;
+
+	/* Number of unset slots in the array below @nr. */
+	uint64_t	unset_slots;
+
+	/* Size of an array element. */
+	size_t		obj_size;
+
+	/* log2 of array element size, if possible. */
+	int		obj_size_log;
+};
+
+int xfarray_create(struct xfs_mount *mp, const char *descr,
+		unsigned long long required_capacity, size_t obj_size,
+		struct xfarray **arrayp);
+void xfarray_destroy(struct xfarray *array);
+int xfarray_load(struct xfarray *array, xfarray_idx_t idx, void *ptr);
+int xfarray_unset(struct xfarray *array, xfarray_idx_t idx);
+int xfarray_store(struct xfarray *array, xfarray_idx_t idx, const void *ptr);
+int xfarray_store_anywhere(struct xfarray *array, const void *ptr);
+bool xfarray_element_is_null(struct xfarray *array, const void *ptr);
+
+/* Append an element to the array. */
+static inline int xfarray_append(struct xfarray *array, const void *ptr)
+{
+	return xfarray_store(array, array->nr, ptr);
+}
+
+uint64_t xfarray_length(struct xfarray *array);
+int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
+
+#endif /* __XFS_SCRUB_XFARRAY_H__ */
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
new file mode 100644
index 000000000000..43455aa78243
--- /dev/null
+++ b/fs/xfs/scrub/xfile.c
@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_format.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+#include <linux/shmem_fs.h>
+
+/*
+ * Swappable Temporary Memory
+ * ==========================
+ *
+ * Online checking sometimes needs to be able to stage a large amount of data
+ * in memory.  This information might not fit in the available memory and it
+ * doesn't all need to be accessible at all times.  In other words, we want an
+ * indexed data buffer to store data that can be paged out.
+ *
+ * When CONFIG_TMPFS=y, shmemfs is enough of a filesystem to meet those
+ * requirements.  Therefore, the xfile mechanism uses an unlinked shmem file to
+ * store our staging data.  This file is not installed in the file descriptor
+ * table so that user programs cannot access the data, which means that the
+ * xfile must be freed with xfile_destroy.
+ *
+ * xfiles assume that the caller will handle all required concurrency
+ * management; standard vfs locks (freezer and inode) are not taken.  Reads
+ * and writes are satisfied directly from the page cache.
+ *
+ * NOTE: The current shmemfs implementation has a quirk that in-kernel reads
+ * of a hole cause a page to be mapped into the file.  If you are going to
+ * create a sparse xfile, please be careful about reading from uninitialized
+ * parts of the file.  These pages are !Uptodate and will eventually be
+ * reclaimed if not written, but in the short term this boosts memory
+ * consumption.
+ */
+
+/*
+ * xfiles must not be exposed to userspace and require upper layers to
+ * coordinate access to the one handle returned by the constructor, so
+ * establish a separate lock class for xfiles to avoid confusing lockdep.
+ */
+static struct lock_class_key xfile_i_mutex_key;
+
+/*
+ * Create an xfile of the given size.  The description will be used in the
+ * trace output.
+ */
+int
+xfile_create(
+	struct xfs_mount	*mp,
+	const char		*description,
+	loff_t			isize,
+	struct xfile		**xfilep)
+{
+	char			*fname;
+	struct xfile		*xf;
+	int			error = -ENOMEM;
+
+	xf = kmalloc(sizeof(struct xfile), XCHK_GFP_FLAGS);
+	if (!xf)
+		return -ENOMEM;
+
+	fname = kmalloc(MAXNAMELEN, XCHK_GFP_FLAGS);
+	if (!fname)
+		goto out_xfile;
+
+	snprintf(fname, MAXNAMELEN - 1, "XFS (%s): %s", mp->m_super->s_id,
+			description);
+	fname[MAXNAMELEN - 1] = 0;
+
+	xf->file = shmem_file_setup(fname, isize, 0);
+	if (!xf->file)
+		goto out_fname;
+	if (IS_ERR(xf->file)) {
+		error = PTR_ERR(xf->file);
+		goto out_fname;
+	}
+
+	/*
+	 * We want a large sparse file that we can pread, pwrite, and seek.
+	 * xfile users are responsible for keeping the xfile hidden away from
+	 * all other callers, so we skip timestamp updates and security checks.
+	 */
+	xf->file->f_mode |= FMODE_PREAD | FMODE_PWRITE | FMODE_NOCMTIME |
+			    FMODE_LSEEK;
+	xf->file->f_flags |= O_RDWR | O_LARGEFILE | O_NOATIME;
+	xf->file->f_inode->i_flags |= S_PRIVATE | S_NOCMTIME | S_NOATIME;
+
+	lockdep_set_class(&file_inode(xf->file)->i_rwsem, &xfile_i_mutex_key);
+
+	trace_xfile_create(mp, xf);
+
+	kfree(fname);
+	*xfilep = xf;
+	return 0;
+out_fname:
+	kfree(fname);
+out_xfile:
+	kfree(xf);
+	return error;
+}
+
+/* Close the file and release all resources. */
+void
+xfile_destroy(
+	struct xfile		*xf)
+{
+	struct inode		*inode = file_inode(xf->file);
+
+	trace_xfile_destroy(xf);
+
+	lockdep_set_class(&inode->i_rwsem, &inode->i_sb->s_type->i_mutex_key);
+	fput(xf->file);
+	kfree(xf);
+}
+
+/*
+ * Read a memory object directly from the xfile's page cache.  Unlike regular
+ * pread, we return -E2BIG and -EFBIG for reads that are too large or at too
+ * high an offset, instead of truncating the read.  Otherwise, we return
+ * bytes read or an error code, like regular pread.
+ */
+ssize_t
+xfile_pread(
+	struct xfile		*xf,
+	void			*buf,
+	size_t			count,
+	loff_t			pos)
+{
+	struct inode		*inode = file_inode(xf->file);
+	struct address_space	*mapping = inode->i_mapping;
+	struct page		*page = NULL;
+	ssize_t			read = 0;
+	unsigned int		pflags;
+	int			error = 0;
+
+	if (count > MAX_RW_COUNT)
+		return -E2BIG;
+	if (inode->i_sb->s_maxbytes - pos < count)
+		return -EFBIG;
+
+	trace_xfile_pread(xf, pos, count);
+
+	pflags = memalloc_nofs_save();
+	while (count > 0) {
+		void		*p, *kaddr;
+		unsigned int	len;
+
+		len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
+
+		/*
+		 * In-kernel reads of a shmem file cause it to allocate a page
+		 * if the mapping shows a hole.  Therefore, if we hit ENOMEM
+		 * we can continue by zeroing the caller's buffer.
+		 */
+		page = shmem_read_mapping_page_gfp(mapping, pos >> PAGE_SHIFT,
+				__GFP_NOWARN);
+		if (IS_ERR(page)) {
+			error = PTR_ERR(page);
+			if (error != -ENOMEM)
+				break;
+
+			memset(buf, 0, len);
+			goto advance;
+		}
+
+		if (PageUptodate(page)) {
+			/*
+			 * xfile pages must never be mapped into userspace, so
+			 * we skip the dcache flush.
+			 */
+			kaddr = kmap_local_page(page);
+			p = kaddr + offset_in_page(pos);
+			memcpy(buf, p, len);
+			kunmap_local(kaddr);
+		} else {
+			memset(buf, 0, len);
+		}
+		put_page(page);
+
+advance:
+		count -= len;
+		pos += len;
+		buf += len;
+		read += len;
+	}
+	memalloc_nofs_restore(pflags);
+
+	if (read > 0)
+		return read;
+	return error;
+}
+
+/*
+ * Write a memory object directly to the xfile's page cache.  Unlike regular
+ * pwrite, we return -E2BIG and -EFBIG for writes that are too large or at too
+ * high an offset, instead of truncating the write.  Otherwise, we return
+ * bytes written or an error code, like regular pwrite.
+ */
+ssize_t
+xfile_pwrite(
+	struct xfile		*xf,
+	const void		*buf,
+	size_t			count,
+	loff_t			pos)
+{
+	struct inode		*inode = file_inode(xf->file);
+	struct address_space	*mapping = inode->i_mapping;
+	const struct address_space_operations *aops = mapping->a_ops;
+	struct page		*page = NULL;
+	ssize_t			written = 0;
+	unsigned int		pflags;
+	int			error = 0;
+
+	if (count > MAX_RW_COUNT)
+		return -E2BIG;
+	if (inode->i_sb->s_maxbytes - pos < count)
+		return -EFBIG;
+
+	trace_xfile_pwrite(xf, pos, count);
+
+	pflags = memalloc_nofs_save();
+	while (count > 0) {
+		void		*fsdata = NULL;
+		void		*p, *kaddr;
+		unsigned int	len;
+		int		ret;
+
+		len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
+
+		/*
+		 * We call write_begin directly here to avoid all the freezer
+		 * protection lock-taking that happens in the normal path.
+		 * shmem doesn't support fs freeze, but lockdep doesn't know
+		 * that and will trip over that.
+		 */
+		error = aops->write_begin(NULL, mapping, pos, len, &page,
+				&fsdata);
+		if (error)
+			break;
+
+		/*
+		 * xfile pages must never be mapped into userspace, so we skip
+		 * the dcache flush.  If the page is not uptodate, zero it
+		 * before writing data.
+		 */
+		kaddr = kmap_local_page(page);
+		if (!PageUptodate(page)) {
+			memset(kaddr, 0, PAGE_SIZE);
+			SetPageUptodate(page);
+		}
+		p = kaddr + offset_in_page(pos);
+		memcpy(p, buf, len);
+		kunmap_local(kaddr);
+
+		ret = aops->write_end(NULL, mapping, pos, len, len, page,
+				fsdata);
+		if (ret < 0) {
+			error = ret;
+			break;
+		}
+
+		written += ret;
+		if (ret != len)
+			break;
+
+		count -= ret;
+		pos += ret;
+		buf += ret;
+	}
+	memalloc_nofs_restore(pflags);
+
+	if (written > 0)
+		return written;
+	return error;
+}
+
+/* Find the next written area in the xfile data for a given offset. */
+loff_t
+xfile_seek_data(
+	struct xfile		*xf,
+	loff_t			pos)
+{
+	loff_t			ret;
+
+	ret = vfs_llseek(xf->file, pos, SEEK_DATA);
+	trace_xfile_seek_data(xf, pos, ret);
+	return ret;
+}
+
+/* Query stat information for an xfile. */
+int
+xfile_stat(
+	struct xfile		*xf,
+	struct xfile_stat	*statbuf)
+{
+	struct kstat		ks;
+	int			error;
+
+	error = vfs_getattr_nosec(&xf->file->f_path, &ks,
+			STATX_SIZE | STATX_BLOCKS, AT_STATX_DONT_SYNC);
+	if (error)
+		return error;
+
+	statbuf->size = ks.size;
+	statbuf->bytes = ks.blocks << SECTOR_SHIFT;
+	return 0;
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
new file mode 100644
index 000000000000..b37dba1961d8
--- /dev/null
+++ b/fs/xfs/scrub/xfile.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_XFILE_H__
+#define __XFS_SCRUB_XFILE_H__
+
+struct xfile {
+	struct file		*file;
+};
+
+int xfile_create(struct xfs_mount *mp, const char *description, loff_t isize,
+		struct xfile **xfilep);
+void xfile_destroy(struct xfile *xf);
+
+ssize_t xfile_pread(struct xfile *xf, void *buf, size_t count, loff_t pos);
+ssize_t xfile_pwrite(struct xfile *xf, const void *buf, size_t count,
+		loff_t pos);
+
+/*
+ * Load an object.  Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+static inline int
+xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t pos)
+{
+	ssize_t	ret = xfile_pread(xf, buf, count, pos);
+
+	if (ret < 0 || ret != count)
+		return -ENOMEM;
+	return 0;
+}
+
+/*
+ * Store an object.  Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+static inline int
+xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t pos)
+{
+	ssize_t	ret = xfile_pwrite(xf, buf, count, pos);
+
+	if (ret < 0 || ret != count)
+		return -ENOMEM;
+	return 0;
+}
+
+loff_t xfile_seek_data(struct xfile *xf, loff_t pos);
+
+struct xfile_stat {
+	loff_t			size;
+	unsigned long long	bytes;
+};
+
+int xfile_stat(struct xfile *xf, struct xfile_stat *statbuf);
+
+#endif /* __XFS_SCRUB_XFILE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 7/7] xfs: improve xfarray quicksort pivot Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

In the previous patch, we created a very basic quicksort implementation
for xfile arrays.  While the use of an alternate sorting algorithm to
avoid quicksort recursion on very small subsets reduces the runtime
modestly, we could do better than a load and store-heavy insertion sort,
particularly since each load and store requires a page mapping lookup in
the xfile.

For a small increase in kernel memory requirements, we could instead
bulk load the xfarray records into memory, use the kernel's existing
heapsort implementation to sort the records, and bulk store the memory
buffer back into the xfile.  On the author's computer, this reduces the
runtime by about 5% on a 500,000 element array.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/trace.h   |    5 +-
 fs/xfs/scrub/xfarray.c |  142 +++++++++---------------------------------------
 fs/xfs/scrub/xfarray.h |   12 +++-
 3 files changed, 39 insertions(+), 120 deletions(-)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 02f5f547c563..9de9d4f795e8 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -930,6 +930,7 @@ TRACE_EVENT(xfarray_sort_stats,
 		__field(unsigned long long, loads)
 		__field(unsigned long long, stores)
 		__field(unsigned long long, compares)
+		__field(unsigned long long, heapsorts)
 #endif
 		__field(unsigned int, max_stack_depth)
 		__field(unsigned int, max_stack_used)
@@ -941,6 +942,7 @@ TRACE_EVENT(xfarray_sort_stats,
 		__entry->loads = si->loads;
 		__entry->stores = si->stores;
 		__entry->compares = si->compares;
+		__entry->heapsorts = si->heapsorts;
 #endif
 		__entry->max_stack_depth = si->max_stack_depth;
 		__entry->max_stack_used = si->max_stack_used;
@@ -948,7 +950,7 @@ TRACE_EVENT(xfarray_sort_stats,
 	),
 	TP_printk(
 #ifdef DEBUG
-		  "xfino 0x%lx loads %llu stores %llu compares %llu stack_depth %u/%u error %d",
+		  "xfino 0x%lx loads %llu stores %llu compares %llu heapsorts %llu stack_depth %u/%u error %d",
 #else
 		  "xfino 0x%lx stack_depth %u/%u error %d",
 #endif
@@ -957,6 +959,7 @@ TRACE_EVENT(xfarray_sort_stats,
 		  __entry->loads,
 		  __entry->stores,
 		  __entry->compares,
+		  __entry->heapsorts,
 #endif
 		  __entry->max_stack_used,
 		  __entry->max_stack_depth,
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 2cd3a2f42e19..171c40d04e6c 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -375,10 +375,12 @@ xfarray_load_next(
 # define xfarray_sort_bump_loads(si)	do { (si)->loads++; } while (0)
 # define xfarray_sort_bump_stores(si)	do { (si)->stores++; } while (0)
 # define xfarray_sort_bump_compares(si)	do { (si)->compares++; } while (0)
+# define xfarray_sort_bump_heapsorts(si) do { (si)->heapsorts++; } while (0)
 #else
 # define xfarray_sort_bump_loads(si)
 # define xfarray_sort_bump_stores(si)
 # define xfarray_sort_bump_compares(si)
+# define xfarray_sort_bump_heapsorts(si)
 #endif /* DEBUG */
 
 /* Load an array element for sorting. */
@@ -441,15 +443,19 @@ xfarray_sortinfo_alloc(
 	/*
 	 * Tail-call recursion during the partitioning phase means that
 	 * quicksort will never recurse more than log2(nr) times.  We need one
-	 * extra level of stack to hold the initial parameters.
+	 * extra level of stack to hold the initial parameters.  In-memory
+	 * sort will always take care of the last few levels of recursion for
+	 * us, so we can reduce the stack depth by that much.
 	 */
-	max_stack_depth = ilog2(array->nr) + 1;
+	max_stack_depth = ilog2(array->nr) + 1 - (XFARRAY_ISORT_SHIFT - 1);
+	if (max_stack_depth < 1)
+		max_stack_depth = 1;
 
 	/* Each level of quicksort uses a lo and a hi index */
 	nr_bytes += max_stack_depth * sizeof(xfarray_idx_t) * 2;
 
-	/* One record for the pivot */
-	nr_bytes += array->obj_size;
+	/* Scratchpad for in-memory sort, or one record for the pivot */
+	nr_bytes += (XFARRAY_ISORT_NR * array->obj_size);
 
 	si = kvzalloc(nr_bytes, XCHK_GFP_FLAGS);
 	if (!si)
@@ -491,7 +497,7 @@ xfarray_sort_terminated(
 	return false;
 }
 
-/* Do we want an insertion sort? */
+/* Do we want an in-memory sort? */
 static inline bool
 xfarray_want_isort(
 	struct xfarray_sortinfo *si,
@@ -499,10 +505,10 @@ xfarray_want_isort(
 	xfarray_idx_t		end)
 {
 	/*
-	 * For array subsets smaller than 8 elements, it's slightly faster to
-	 * use insertion sort than quicksort's stack machine.
+	 * For array subsets that fit in the scratchpad, it's much faster to
+	 * use the kernel's heapsort than quicksort's stack machine.
 	 */
-	return (end - start) < 8;
+	return (end - start) < XFARRAY_ISORT_NR;
 }
 
 /* Return the scratch space within the sortinfo structure. */
@@ -512,10 +518,8 @@ static inline void *xfarray_sortinfo_isort_scratch(struct xfarray_sortinfo *si)
 }
 
 /*
- * Perform an insertion sort on a subset of the array.
- * Though insertion sort is an O(n^2) algorithm, for small set sizes it's
- * faster than quicksort's stack machine, so we let it take over for that.
- * This ought to be replaced with something more efficient.
+ * Sort a small number of array records using scratchpad memory.  The records
+ * need not be contiguous in the xfile's memory pages.
  */
 STATIC int
 xfarray_isort(
@@ -523,114 +527,23 @@ xfarray_isort(
 	xfarray_idx_t		lo,
 	xfarray_idx_t		hi)
 {
-	void			*a = xfarray_sortinfo_isort_scratch(si);
-	void			*b = xfarray_scratch(si->array);
-	xfarray_idx_t		tmp;
-	xfarray_idx_t		i;
-	xfarray_idx_t		run;
+	void			*scratch = xfarray_sortinfo_isort_scratch(si);
+	loff_t			lo_pos = xfarray_pos(si->array, lo);
+	loff_t			len = xfarray_pos(si->array, hi - lo + 1);
 	int			error;
 
 	trace_xfarray_isort(si, lo, hi);
 
-	/*
-	 * Move the smallest element in a[lo..hi] to a[lo].  This
-	 * simplifies the loop control logic below.
-	 */
-	tmp = lo;
-	error = xfarray_sort_load(si, tmp, b);
+	xfarray_sort_bump_loads(si);
+	error = xfile_obj_load(si->array->xfile, scratch, len, lo_pos);
 	if (error)
 		return error;
-	for (run = lo + 1; run <= hi; run++) {
-		/* if a[run] < a[tmp], tmp = run */
-		error = xfarray_sort_load(si, run, a);
-		if (error)
-			return error;
-		if (xfarray_sort_cmp(si, a, b) < 0) {
-			tmp = run;
-			memcpy(b, a, si->array->obj_size);
-		}
 
-		if (xfarray_sort_terminated(si, &error))
-			return error;
-	}
+	xfarray_sort_bump_heapsorts(si);
+	sort(scratch, hi - lo + 1, si->array->obj_size, si->cmp_fn, NULL);
 
-	/*
-	 * The smallest element is a[tmp]; swap with a[lo] if tmp != lo.
-	 * Recall that a[tmp] is already in *b.
-	 */
-	if (tmp != lo) {
-		error = xfarray_sort_load(si, lo, a);
-		if (error)
-			return error;
-		error = xfarray_sort_store(si, tmp, a);
-		if (error)
-			return error;
-		error = xfarray_sort_store(si, lo, b);
-		if (error)
-			return error;
-	}
-
-	/*
-	 * Perform an insertion sort on a[lo+1..hi].  We already made sure
-	 * that the smallest value in the original range is now in a[lo],
-	 * so the inner loop should never underflow.
-	 *
-	 * For each a[lo+2..hi], make sure it's in the correct position
-	 * with respect to the elements that came before it.
-	 */
-	for (run = lo + 2; run <= hi; run++) {
-		error = xfarray_sort_load(si, run, a);
-		if (error)
-			return error;
-
-		/*
-		 * Find the correct place for a[run] by walking leftwards
-		 * towards the start of the range until a[tmp] is no longer
-		 * greater than a[run].
-		 */
-		tmp = run - 1;
-		error = xfarray_sort_load(si, tmp, b);
-		if (error)
-			return error;
-		while (xfarray_sort_cmp(si, a, b) < 0) {
-			tmp--;
-			error = xfarray_sort_load(si, tmp, b);
-			if (error)
-				return error;
-
-			if (xfarray_sort_terminated(si, &error))
-				return error;
-		}
-		tmp++;
-
-		/*
-		 * If tmp != run, then a[tmp..run-1] are all less than a[run],
-		 * so right barrel roll a[tmp..run] to get this range in
-		 * sorted order.
-		 */
-		if (tmp == run)
-			continue;
-
-		for (i = run; i >= tmp; i--) {
-			error = xfarray_sort_load(si, i - 1, b);
-			if (error)
-				return error;
-			error = xfarray_sort_store(si, i, b);
-			if (error)
-				return error;
-
-			if (xfarray_sort_terminated(si, &error))
-				return error;
-		}
-		error = xfarray_sort_store(si, tmp, a);
-		if (error)
-			return error;
-
-		if (xfarray_sort_terminated(si, &error))
-			return error;
-	}
-
-	return 0;
+	xfarray_sort_bump_stores(si);
+	return xfile_obj_store(si->array->xfile, scratch, len, lo_pos);
 }
 
 /* Return a pointer to the xfarray pivot record within the sortinfo struct. */
@@ -784,9 +697,8 @@ xfarray_qsort_push(
  *    current stack frame.  This guarantees that we won't need more than
  *    log2(nr) stack space.
  *
- * 4. Use insertion sort for small sets since since insertion sort is faster
- *    for small, mostly sorted array segments.  In the author's experience,
- *    substituting insertion sort for arrays smaller than 8 elements yields
+ * 4. For small sets, load the records into the scratchpad and run heapsort on
+ *    them because that is very fast.  In the author's experience, this yields
  *    a ~10% reduction in runtime.
  */
 
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index b0cf818c6a7f..f49c1afe24a1 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -59,6 +59,10 @@ int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
 
 typedef cmp_func_t xfarray_cmp_fn;
 
+/* Perform an in-memory heapsort for small subsets. */
+#define XFARRAY_ISORT_SHIFT		(4)
+#define XFARRAY_ISORT_NR		(1U << XFARRAY_ISORT_SHIFT)
+
 struct xfarray_sortinfo {
 	struct xfarray		*array;
 
@@ -82,6 +86,7 @@ struct xfarray_sortinfo {
 	uint64_t		loads;
 	uint64_t		stores;
 	uint64_t		compares;
+	uint64_t		heapsorts;
 #endif
 
 	/*
@@ -100,11 +105,10 @@ struct xfarray_sortinfo {
 	 *
 	 * union {
 	 *
-	 * If for a given subset we decide to use an insertion sort, we use the
-	 * scratchpad record after the xfarray and a second scratchpad record
-	 * here to compare items:
+	 * If for a given subset we decide to use an in-memory sort, we use a
+	 * block of scratchpad records here to compare items:
 	 *
-	 * 	xfarray_rec_t	scratch;
+	 * 	xfarray_rec_t	scratch[ISORT_NR];
 	 *
 	 * Otherwise, we want to partition the records to partition the array.
 	 * We store the chosen pivot record here and use the xfarray scratchpad


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 1/7] xfs: create a big array data structure Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Certain xfile array operations (such as sorting) can be sped up quite a
bit by allowing xfile users to grab a page to bulk-read the records
contained within it.  Create helper methods to facilitate this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/trace.h |    2 +
 fs/xfs/scrub/xfile.c |  108 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfile.h |   10 +++++
 3 files changed, 120 insertions(+)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 9de9d4f795e8..79b844c969df 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -824,6 +824,8 @@ DEFINE_EVENT(xfile_class, name, \
 DEFINE_XFILE_EVENT(xfile_pread);
 DEFINE_XFILE_EVENT(xfile_pwrite);
 DEFINE_XFILE_EVENT(xfile_seek_data);
+DEFINE_XFILE_EVENT(xfile_get_page);
+DEFINE_XFILE_EVENT(xfile_put_page);
 
 TRACE_EVENT(xfarray_create,
 	TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index 43455aa78243..7090a8e12b60 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -316,3 +316,111 @@ xfile_stat(
 	statbuf->bytes = ks.blocks << SECTOR_SHIFT;
 	return 0;
 }
+
+/*
+ * Grab the (locked) page for a memory object.  The object cannot span a page
+ * boundary.  Returns 0 (and a locked page) if successful, -ENOTBLK if we
+ * cannot grab the page, or the usual negative errno.
+ */
+int
+xfile_get_page(
+	struct xfile		*xf,
+	loff_t			pos,
+	unsigned int		len,
+	struct xfile_page	*xfpage)
+{
+	struct inode		*inode = file_inode(xf->file);
+	struct address_space	*mapping = inode->i_mapping;
+	const struct address_space_operations *aops = mapping->a_ops;
+	struct page		*page = NULL;
+	void			*fsdata = NULL;
+	loff_t			key = round_down(pos, PAGE_SIZE);
+	unsigned int		pflags;
+	int			error;
+
+	if (inode->i_sb->s_maxbytes - pos < len)
+		return -ENOMEM;
+	if (len > PAGE_SIZE - offset_in_page(pos))
+		return -ENOTBLK;
+
+	trace_xfile_get_page(xf, pos, len);
+
+	pflags = memalloc_nofs_save();
+
+	/*
+	 * We call write_begin directly here to avoid all the freezer
+	 * protection lock-taking that happens in the normal path.  shmem
+	 * doesn't support fs freeze, but lockdep doesn't know that and will
+	 * trip over that.
+	 */
+	error = aops->write_begin(NULL, mapping, key, PAGE_SIZE, &page,
+			&fsdata);
+	if (error)
+		goto out_pflags;
+
+	/* We got the page, so make sure we push out EOF. */
+	if (i_size_read(inode) < pos + len)
+		i_size_write(inode, pos + len);
+
+	/*
+	 * If the page isn't up to date, fill it with zeroes before we hand it
+	 * to the caller and make sure the backing store will hold on to them.
+	 */
+	if (!PageUptodate(page)) {
+		void	*kaddr;
+
+		kaddr = kmap_local_page(page);
+		memset(kaddr, 0, PAGE_SIZE);
+		kunmap_local(kaddr);
+		SetPageUptodate(page);
+	}
+
+	/*
+	 * Mark each page dirty so that the contents are written to some
+	 * backing store when we drop this buffer, and take an extra reference
+	 * to prevent the xfile page from being swapped or removed from the
+	 * page cache by reclaim if the caller unlocks the page.
+	 */
+	set_page_dirty(page);
+	get_page(page);
+
+	xfpage->page = page;
+	xfpage->fsdata = fsdata;
+	xfpage->pos = key;
+out_pflags:
+	memalloc_nofs_restore(pflags);
+	return error;
+}
+
+/*
+ * Release the (locked) page for a memory object.  Returns 0 or a negative
+ * errno.
+ */
+int
+xfile_put_page(
+	struct xfile		*xf,
+	struct xfile_page	*xfpage)
+{
+	struct inode		*inode = file_inode(xf->file);
+	struct address_space	*mapping = inode->i_mapping;
+	const struct address_space_operations *aops = mapping->a_ops;
+	unsigned int		pflags;
+	int			ret;
+
+	trace_xfile_put_page(xf, xfpage->pos, PAGE_SIZE);
+
+	/* Give back the reference that we took in xfile_get_page. */
+	put_page(xfpage->page);
+
+	pflags = memalloc_nofs_save();
+	ret = aops->write_end(NULL, mapping, xfpage->pos, PAGE_SIZE, PAGE_SIZE,
+			xfpage->page, xfpage->fsdata);
+	memalloc_nofs_restore(pflags);
+	memset(xfpage, 0, sizeof(struct xfile_page));
+
+	if (ret < 0)
+		return ret;
+	if (ret != PAGE_SIZE)
+		return -EIO;
+	return 0;
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index b37dba1961d8..e34ab9c4aad9 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -6,6 +6,12 @@
 #ifndef __XFS_SCRUB_XFILE_H__
 #define __XFS_SCRUB_XFILE_H__
 
+struct xfile_page {
+	struct page		*page;
+	void			*fsdata;
+	loff_t			pos;
+};
+
 struct xfile {
 	struct file		*file;
 };
@@ -55,4 +61,8 @@ struct xfile_stat {
 
 int xfile_stat(struct xfile *xf, struct xfile_stat *statbuf);
 
+int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
+		struct xfile_page *xbuf);
+int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
+
 #endif /* __XFS_SCRUB_XFILE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/7] xfs: create a big array data structure Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

After quicksort picks a pivot item for a particular subsort, it walks
the records in that subset from the outside in, rearranging them so that
every record less than the pivot comes before it, and every record
greater than the pivot comes after it.  This scan has a lot of locality,
so we can speed it up quite a bit by grabbing the xfile backing page and
holding onto it as long as we possibly can.  Doing so reduces the
runtime by another 5% on the author's computer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/xfarray.c |   86 ++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/scrub/xfile.h   |   10 ++++++
 2 files changed, 86 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 08479be07fda..3e232ee5e7e6 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -760,6 +760,66 @@ xfarray_qsort_push(
 	return 0;
 }
 
+/*
+ * Load an element from the array into the first scratchpad and cache the page,
+ * if possible.
+ */
+static inline int
+xfarray_sort_load_cached(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		idx,
+	void			*ptr)
+{
+	loff_t			idx_pos = xfarray_pos(si->array, idx);
+	pgoff_t			startpage;
+	pgoff_t			endpage;
+	int			error = 0;
+
+	/*
+	 * If this load would split a page, release the cached page, if any,
+	 * and perform a traditional read.
+	 */
+	startpage = idx_pos >> PAGE_SHIFT;
+	endpage = (idx_pos + si->array->obj_size - 1) >> PAGE_SHIFT;
+	if (startpage != endpage) {
+		error = xfarray_sort_put_page(si);
+		if (error)
+			return error;
+
+		if (xfarray_sort_terminated(si, &error))
+			return error;
+
+		return xfile_obj_load(si->array->xfile, ptr,
+				si->array->obj_size, idx_pos);
+	}
+
+	/* If the cached page is not the one we want, release it. */
+	if (xfile_page_cached(&si->xfpage) &&
+	    xfile_page_index(&si->xfpage) != startpage) {
+		error = xfarray_sort_put_page(si);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If we don't have a cached page (and we know the load is contained
+	 * in a single page) then grab it.
+	 */
+	if (!xfile_page_cached(&si->xfpage)) {
+		if (xfarray_sort_terminated(si, &error))
+			return error;
+
+		error = xfarray_sort_get_page(si, startpage << PAGE_SHIFT,
+				PAGE_SIZE);
+		if (error)
+			return error;
+	}
+
+	memcpy(ptr, si->page_kaddr + offset_in_page(idx_pos),
+			si->array->obj_size);
+	return 0;
+}
+
 /*
  * Sort the array elements via quicksort.  This implementation incorporates
  * four optimizations discussed in Sedgewick:
@@ -785,6 +845,10 @@ xfarray_qsort_push(
  *    If a small set is contained entirely within a single xfile memory page,
  *    map the page directly and run heap sort directly on the xfile page
  *    instead of using the load/store interface.  This halves the runtime.
+ *
+ * 5. This optimization is specific to the implementation.  When converging lo
+ *    and hi after selecting a pivot, we will try to retain the xfile memory
+ *    page between load calls, which reduces run time by 50%.
  */
 
 /*
@@ -866,19 +930,20 @@ xfarray_sort(
 			 * Decrement hi until it finds an a[hi] less than the
 			 * pivot value.
 			 */
-			error = xfarray_sort_load(si, hi, scratch);
+			error = xfarray_sort_load_cached(si, hi, scratch);
 			if (error)
 				goto out_free;
 			while (xfarray_sort_cmp(si, scratch, pivot) >= 0 &&
 								lo < hi) {
-				if (xfarray_sort_terminated(si, &error))
-					goto out_free;
-
 				hi--;
-				error = xfarray_sort_load(si, hi, scratch);
+				error = xfarray_sort_load_cached(si, hi,
+						scratch);
 				if (error)
 					goto out_free;
 			}
+			error = xfarray_sort_put_page(si);
+			if (error)
+				goto out_free;
 
 			if (xfarray_sort_terminated(si, &error))
 				goto out_free;
@@ -894,19 +959,20 @@ xfarray_sort(
 			 * Increment lo until it finds an a[lo] greater than
 			 * the pivot value.
 			 */
-			error = xfarray_sort_load(si, lo, scratch);
+			error = xfarray_sort_load_cached(si, lo, scratch);
 			if (error)
 				goto out_free;
 			while (xfarray_sort_cmp(si, scratch, pivot) <= 0 &&
 								lo < hi) {
-				if (xfarray_sort_terminated(si, &error))
-					goto out_free;
-
 				lo++;
-				error = xfarray_sort_load(si, lo, scratch);
+				error = xfarray_sort_load_cached(si, lo,
+						scratch);
 				if (error)
 					goto out_free;
 			}
+			error = xfarray_sort_put_page(si);
+			if (error)
+				goto out_free;
 
 			if (xfarray_sort_terminated(si, &error))
 				goto out_free;
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index e34ab9c4aad9..0172bd9eeab0 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -12,6 +12,16 @@ struct xfile_page {
 	loff_t			pos;
 };
 
+static inline bool xfile_page_cached(const struct xfile_page *xfpage)
+{
+	return xfpage->page != NULL;
+}
+
+static inline pgoff_t xfile_page_index(const struct xfile_page *xfpage)
+{
+	return xfpage->page->index;
+}
+
 struct xfile {
 	struct file		*file;
 };


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

If all the records in an xfarray subset live within the same memory
page, we can short-circuit even more quicksort recursion by mapping that
page into the local CPU and using the kernel's heapsort function to sort
the subset.  On the author's computer, this reduces the runtime by
another 15% on a 500,000 element array.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/trace.h   |   20 ++++++++++
 fs/xfs/scrub/xfarray.c |   97 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfarray.h |    4 ++
 3 files changed, 121 insertions(+)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 79b844c969df..2431083b9f91 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -872,6 +872,26 @@ TRACE_EVENT(xfarray_isort,
 		  __entry->hi - __entry->lo)
 );
 
+TRACE_EVENT(xfarray_pagesort,
+	TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
+	TP_ARGS(si, lo, hi),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, lo)
+		__field(unsigned long long, hi)
+	),
+	TP_fast_assign(
+		__entry->ino = file_inode(si->array->xfile->file)->i_ino;
+		__entry->lo = lo;
+		__entry->hi = hi;
+	),
+	TP_printk("xfino 0x%lx lo %llu hi %llu elts %llu",
+		  __entry->ino,
+		  __entry->lo,
+		  __entry->hi,
+		  __entry->hi - __entry->lo)
+);
+
 TRACE_EVENT(xfarray_qsort,
 	TP_PROTO(struct xfarray_sortinfo *si, uint64_t lo, uint64_t hi),
 	TP_ARGS(si, lo, hi),
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 171c40d04e6c..08479be07fda 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -546,6 +546,87 @@ xfarray_isort(
 	return xfile_obj_store(si->array->xfile, scratch, len, lo_pos);
 }
 
+/* Grab a page for sorting records. */
+static inline int
+xfarray_sort_get_page(
+	struct xfarray_sortinfo	*si,
+	loff_t			pos,
+	uint64_t		len)
+{
+	int			error;
+
+	error = xfile_get_page(si->array->xfile, pos, len, &si->xfpage);
+	if (error)
+		return error;
+
+	/*
+	 * xfile pages must never be mapped into userspace, so we skip the
+	 * dcache flush when mapping the page.
+	 */
+	si->page_kaddr = kmap_local_page(si->xfpage.page);
+	return 0;
+}
+
+/* Release a page we grabbed for sorting records. */
+static inline int
+xfarray_sort_put_page(
+	struct xfarray_sortinfo	*si)
+{
+	if (!si->page_kaddr)
+		return 0;
+
+	kunmap_local(si->page_kaddr);
+	si->page_kaddr = NULL;
+
+	return xfile_put_page(si->array->xfile, &si->xfpage);
+}
+
+/* Decide if these records are eligible for in-page sorting. */
+static inline bool
+xfarray_want_pagesort(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		lo,
+	xfarray_idx_t		hi)
+{
+	pgoff_t			lo_page;
+	pgoff_t			hi_page;
+	loff_t			end_pos;
+
+	/* We can only map one page at a time. */
+	lo_page = xfarray_pos(si->array, lo) >> PAGE_SHIFT;
+	end_pos = xfarray_pos(si->array, hi) + si->array->obj_size - 1;
+	hi_page = end_pos >> PAGE_SHIFT;
+
+	return lo_page == hi_page;
+}
+
+/* Sort a bunch of records that all live in the same memory page. */
+STATIC int
+xfarray_pagesort(
+	struct xfarray_sortinfo	*si,
+	xfarray_idx_t		lo,
+	xfarray_idx_t		hi)
+{
+	void			*startp;
+	loff_t			lo_pos = xfarray_pos(si->array, lo);
+	uint64_t		len = xfarray_pos(si->array, hi - lo);
+	int			error = 0;
+
+	trace_xfarray_pagesort(si, lo, hi);
+
+	xfarray_sort_bump_loads(si);
+	error = xfarray_sort_get_page(si, lo_pos, len);
+	if (error)
+		return error;
+
+	xfarray_sort_bump_heapsorts(si);
+	startp = si->page_kaddr + offset_in_page(lo_pos);
+	sort(startp, hi - lo + 1, si->array->obj_size, si->cmp_fn, NULL);
+
+	xfarray_sort_bump_stores(si);
+	return xfarray_sort_put_page(si);
+}
+
 /* Return a pointer to the xfarray pivot record within the sortinfo struct. */
 static inline void *xfarray_sortinfo_pivot(struct xfarray_sortinfo *si)
 {
@@ -700,6 +781,10 @@ xfarray_qsort_push(
  * 4. For small sets, load the records into the scratchpad and run heapsort on
  *    them because that is very fast.  In the author's experience, this yields
  *    a ~10% reduction in runtime.
+ *
+ *    If a small set is contained entirely within a single xfile memory page,
+ *    map the page directly and run heap sort directly on the xfile page
+ *    instead of using the load/store interface.  This halves the runtime.
  */
 
 /*
@@ -745,6 +830,18 @@ xfarray_sort(
 			continue;
 		}
 
+		/*
+		 * If directly mapping the page and sorting can solve our
+		 * problems, we're done.
+		 */
+		if (xfarray_want_pagesort(si, lo, hi)) {
+			error = xfarray_pagesort(si, lo, hi);
+			if (error)
+				goto out_free;
+			si->stack_depth--;
+			continue;
+		}
+
 		/* If insertion sort can solve our problems, we're done. */
 		if (xfarray_want_isort(si, lo, hi)) {
 			error = xfarray_isort(si, lo, hi);
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index f49c1afe24a1..e8a4523bf2de 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -81,6 +81,10 @@ struct xfarray_sortinfo {
 	/* XFARRAY_SORT_* flags; see below. */
 	unsigned int		flags;
 
+	/* Cache a page here for faster access. */
+	struct xfile_page	xfpage;
+	void			*page_kaddr;
+
 #ifdef DEBUG
 	/* Performance statistics. */
 	uint64_t		loads;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/7] xfs: improve xfarray quicksort pivot
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the means to do insertion sorts of small in-memory
subsets of an xfarray, use it to improve the quicksort pivot algorithm
by reading 7 records into memory and finding the median of that.  This
should prevent bad partitioning when a[lo] and a[hi] end up next to each
other in the final sort, which can happen when sorting for cntbt repair
when the free space is extremely fragmented (e.g. generic/176).

This doesn't speed up the average quicksort run by much, but it will
(hopefully) avoid the quadratic time collapse for which quicksort is
famous.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/xfarray.c |  198 ++++++++++++++++++++++++++++++++----------------
 fs/xfs/scrub/xfarray.h |   19 +++--
 2 files changed, 148 insertions(+), 69 deletions(-)


diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index 3e232ee5e7e6..ce1365144209 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -428,6 +428,14 @@ static inline xfarray_idx_t *xfarray_sortinfo_hi(struct xfarray_sortinfo *si)
 	return xfarray_sortinfo_lo(si) + si->max_stack_depth;
 }
 
+/* Size of each element in the quicksort pivot array. */
+static inline size_t
+xfarray_pivot_rec_sz(
+	struct xfarray		*array)
+{
+	return round_up(array->obj_size, 8) + sizeof(xfarray_idx_t);
+}
+
 /* Allocate memory to handle the sort. */
 static inline int
 xfarray_sortinfo_alloc(
@@ -438,8 +446,16 @@ xfarray_sortinfo_alloc(
 {
 	struct xfarray_sortinfo	*si;
 	size_t			nr_bytes = sizeof(struct xfarray_sortinfo);
+	size_t			pivot_rec_sz = xfarray_pivot_rec_sz(array);
 	int			max_stack_depth;
 
+	/*
+	 * The median-of-nine pivot algorithm doesn't work if a subset has
+	 * fewer than 9 items.  Make sure the in-memory sort will always take
+	 * over for subsets where this wouldn't be the case.
+	 */
+	BUILD_BUG_ON(XFARRAY_QSORT_PIVOT_NR >= XFARRAY_ISORT_NR);
+
 	/*
 	 * Tail-call recursion during the partitioning phase means that
 	 * quicksort will never recurse more than log2(nr) times.  We need one
@@ -454,8 +470,10 @@ xfarray_sortinfo_alloc(
 	/* Each level of quicksort uses a lo and a hi index */
 	nr_bytes += max_stack_depth * sizeof(xfarray_idx_t) * 2;
 
-	/* Scratchpad for in-memory sort, or one record for the pivot */
-	nr_bytes += (XFARRAY_ISORT_NR * array->obj_size);
+	/* Scratchpad for in-memory sort, or finding the pivot */
+	nr_bytes += max_t(size_t,
+			(XFARRAY_QSORT_PIVOT_NR + 1) * pivot_rec_sz,
+			XFARRAY_ISORT_NR * array->obj_size);
 
 	si = kvzalloc(nr_bytes, XCHK_GFP_FLAGS);
 	if (!si)
@@ -633,14 +651,43 @@ static inline void *xfarray_sortinfo_pivot(struct xfarray_sortinfo *si)
 	return xfarray_sortinfo_hi(si) + si->max_stack_depth;
 }
 
+/* Return a pointer to the start of the pivot array. */
+static inline void *
+xfarray_sortinfo_pivot_array(
+	struct xfarray_sortinfo	*si)
+{
+	return xfarray_sortinfo_pivot(si) + si->array->obj_size;
+}
+
+/* The xfarray record is stored at the start of each pivot array element. */
+static inline void *
+xfarray_pivot_array_rec(
+	void			*pa,
+	size_t			pa_recsz,
+	unsigned int		pa_idx)
+{
+	return pa + (pa_recsz * pa_idx);
+}
+
+/* The xfarray index is stored at the end of each pivot array element. */
+static inline xfarray_idx_t *
+xfarray_pivot_array_idx(
+	void			*pa,
+	size_t			pa_recsz,
+	unsigned int		pa_idx)
+{
+	return xfarray_pivot_array_rec(pa, pa_recsz, pa_idx + 1) -
+			sizeof(xfarray_idx_t);
+}
+
 /*
  * Find a pivot value for quicksort partitioning, swap it with a[lo], and save
  * the cached pivot record for the next step.
  *
- * Select the median value from a[lo], a[mid], and a[hi].  Put the median in
- * a[lo], the lowest in a[mid], and the highest in a[hi].  Using the median of
- * the three reduces the chances that we pick the worst case pivot value, since
- * it's likely that our array values are nearly sorted.
+ * Load evenly-spaced records within the given range into memory, sort them,
+ * and choose the pivot from the median record.  Using multiple points will
+ * improve the quality of the pivot selection, and hopefully avoid the worst
+ * quicksort behavior, since our array values are nearly always evenly sorted.
  */
 STATIC int
 xfarray_qsort_pivot(
@@ -648,76 +695,99 @@ xfarray_qsort_pivot(
 	xfarray_idx_t		lo,
 	xfarray_idx_t		hi)
 {
-	void			*a = xfarray_sortinfo_pivot(si);
-	void			*b = xfarray_scratch(si->array);
-	xfarray_idx_t		mid = lo + ((hi - lo) / 2);
+	void			*pivot = xfarray_sortinfo_pivot(si);
+	void			*parray = xfarray_sortinfo_pivot_array(si);
+	void			*recp;
+	xfarray_idx_t		*idxp;
+	xfarray_idx_t		step = (hi - lo) / (XFARRAY_QSORT_PIVOT_NR - 1);
+	size_t			pivot_rec_sz = xfarray_pivot_rec_sz(si->array);
+	int			i, j;
 	int			error;
 
-	/* if a[mid] < a[lo], swap a[mid] and a[lo]. */
-	error = xfarray_sort_load(si, mid, a);
-	if (error)
-		return error;
-	error = xfarray_sort_load(si, lo, b);
-	if (error)
-		return error;
-	if (xfarray_sort_cmp(si, a, b) < 0) {
-		error = xfarray_sort_store(si, lo, a);
-		if (error)
-			return error;
-		error = xfarray_sort_store(si, mid, b);
-		if (error)
-			return error;
-	}
+	ASSERT(step > 0);
 
-	/* if a[hi] < a[mid], swap a[mid] and a[hi]. */
-	error = xfarray_sort_load(si, hi, a);
-	if (error)
-		return error;
-	error = xfarray_sort_load(si, mid, b);
-	if (error)
-		return error;
-	if (xfarray_sort_cmp(si, a, b) < 0) {
-		error = xfarray_sort_store(si, mid, a);
-		if (error)
-			return error;
-		error = xfarray_sort_store(si, hi, b);
-		if (error)
-			return error;
-	} else {
-		goto move_front;
+	/*
+	 * Load the xfarray indexes of the records we intend to sample into the
+	 * pivot array.
+	 */
+	idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, 0);
+	*idxp = lo;
+	for (i = 1; i < XFARRAY_QSORT_PIVOT_NR - 1; i++) {
+		idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, i);
+		*idxp = lo + (i * step);
 	}
+	idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz,
+			XFARRAY_QSORT_PIVOT_NR - 1);
+	*idxp = hi;
 
-	/* if a[mid] < a[lo], swap a[mid] and a[lo]. */
-	error = xfarray_sort_load(si, mid, a);
-	if (error)
-		return error;
-	error = xfarray_sort_load(si, lo, b);
-	if (error)
-		return error;
-	if (xfarray_sort_cmp(si, a, b) < 0) {
-		error = xfarray_sort_store(si, lo, a);
-		if (error)
-			return error;
-		error = xfarray_sort_store(si, mid, b);
+	/* Load the selected xfarray records into the pivot array. */
+	for (i = 0; i < XFARRAY_QSORT_PIVOT_NR; i++) {
+		xfarray_idx_t	idx;
+
+		recp = xfarray_pivot_array_rec(parray, pivot_rec_sz, i);
+		idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, i);
+
+		/* No unset records; load directly into the array. */
+		if (likely(si->array->unset_slots == 0)) {
+			error = xfarray_sort_load(si, *idxp, recp);
+			if (error)
+				return error;
+			continue;
+		}
+
+		/*
+		 * Load non-null records into the scratchpad without changing
+		 * the xfarray_idx_t in the pivot array.
+		 */
+		idx = *idxp;
+		xfarray_sort_bump_loads(si);
+		error = xfarray_load_next(si->array, &idx, recp);
 		if (error)
 			return error;
 	}
 
-move_front:
+	xfarray_sort_bump_heapsorts(si);
+	sort(parray, XFARRAY_QSORT_PIVOT_NR, pivot_rec_sz, si->cmp_fn, NULL);
+
 	/*
-	 * Move our selected pivot to a[lo].  Recall that a == si->pivot, so
-	 * this leaves us with the pivot cached in the sortinfo structure.
+	 * We sorted the pivot array records (which includes the xfarray
+	 * indices) in xfarray record order.  The median element of the pivot
+	 * array contains the xfarray record that we will use as the pivot.
+	 * Copy that xfarray record to the designated space.
 	 */
-	error = xfarray_sort_load(si, lo, b);
-	if (error)
-		return error;
-	error = xfarray_sort_load(si, mid, a);
-	if (error)
-		return error;
-	error = xfarray_sort_store(si, mid, b);
+	recp = xfarray_pivot_array_rec(parray, pivot_rec_sz,
+			XFARRAY_QSORT_PIVOT_NR / 2);
+	memcpy(pivot, recp, si->array->obj_size);
+
+	/* If the pivot record we chose was already in a[lo] then we're done. */
+	idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz,
+			XFARRAY_QSORT_PIVOT_NR / 2);
+	if (*idxp == lo)
+		return 0;
+
+	/*
+	 * Find the cached copy of a[lo] in the pivot array so that we can swap
+	 * a[lo] and a[pivot].
+	 */
+	for (i = 0, j = -1; i < XFARRAY_QSORT_PIVOT_NR; i++) {
+		idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz, i);
+		if (*idxp == lo)
+			j = i;
+	}
+	if (j < 0) {
+		ASSERT(j >= 0);
+		return -EFSCORRUPTED;
+	}
+
+	/* Swap a[lo] and a[pivot]. */
+	error = xfarray_sort_store(si, lo, pivot);
 	if (error)
 		return error;
-	return xfarray_sort_store(si, lo, a);
+
+	recp = xfarray_pivot_array_rec(parray, pivot_rec_sz, j);
+	idxp = xfarray_pivot_array_idx(parray, pivot_rec_sz,
+			XFARRAY_QSORT_PIVOT_NR / 2);
+	return xfarray_sort_store(si, *idxp, recp);
 }
 
 /*
@@ -829,7 +899,7 @@ xfarray_sort_load_cached(
  *    particularly expensive in the kernel.
  *
  * 2. For arrays with records in arbitrary or user-controlled order, choose the
- *    pivot element using a median-of-three decision tree.  This reduces the
+ *    pivot element using a median-of-nine decision tree.  This reduces the
  *    probability of selecting a bad pivot value which causes worst case
  *    behavior (i.e. partition sizes of 1).
  *
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index e8a4523bf2de..69f0c922c98a 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -63,6 +63,9 @@ typedef cmp_func_t xfarray_cmp_fn;
 #define XFARRAY_ISORT_SHIFT		(4)
 #define XFARRAY_ISORT_NR		(1U << XFARRAY_ISORT_SHIFT)
 
+/* Evalulate this many points to find the qsort pivot. */
+#define XFARRAY_QSORT_PIVOT_NR		(9)
+
 struct xfarray_sortinfo {
 	struct xfarray		*array;
 
@@ -92,7 +95,6 @@ struct xfarray_sortinfo {
 	uint64_t		compares;
 	uint64_t		heapsorts;
 #endif
-
 	/*
 	 * Extra bytes are allocated beyond the end of the structure to store
 	 * quicksort information.  C does not permit multiple VLAs per struct,
@@ -115,11 +117,18 @@ struct xfarray_sortinfo {
 	 * 	xfarray_rec_t	scratch[ISORT_NR];
 	 *
 	 * Otherwise, we want to partition the records to partition the array.
-	 * We store the chosen pivot record here and use the xfarray scratchpad
-	 * to rearrange the array around the pivot:
-	 *
-	 * 	xfarray_rec_t	pivot;
+	 * We store the chosen pivot record at the start of the scratchpad area
+	 * and use the rest to sample some records to estimate the median.
+	 * The format of the qsort_pivot array enables us to use the kernel
+	 * heapsort function to place the median value in the middle.
 	 *
+	 * 	struct {
+	 * 		xfarray_rec_t	pivot;
+	 * 		struct {
+	 *			xfarray_rec_t	rec;  (rounded up to 8 bytes)
+	 * 			xfarray_idx_t	idx;
+	 *		} qsort_pivot[QSORT_PIVOT_NR];
+	 * 	};
 	 * }
 	 */
 };


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/4] xfs: get our own reference to inodes that we want to scrub Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: miscellaneous repair tweaks Darrick J. Wong
                   ` (73 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This patchset implements an online checker for the realtime summary
file.  The first few changes are some general cleanups -- scrub should
get its own references to all inodes, and we also wrap the inode lock
functions so that we can standardize unlocking and releasing inodes that
are the focus of a scrub.

With that out of the way, we move on to constructing a shadow copy of
the rtsummary information from the rtbitmap, and compare the new copy
against the ondisk copy.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-rtsummary

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-rtsummary
---
 fs/xfs/Makefile          |    6 +
 fs/xfs/scrub/bmap.c      |    6 -
 fs/xfs/scrub/common.c    |   63 +++++++++--
 fs/xfs/scrub/common.h    |   16 ++-
 fs/xfs/scrub/dir.c       |    3 -
 fs/xfs/scrub/inode.c     |   11 +-
 fs/xfs/scrub/parent.c    |   10 +-
 fs/xfs/scrub/quota.c     |   15 +--
 fs/xfs/scrub/rtbitmap.c  |   48 +-------
 fs/xfs/scrub/rtsummary.c |  262 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c     |   17 ++-
 fs/xfs/scrub/scrub.h     |    4 +
 fs/xfs/scrub/trace.h     |   34 ++++++
 fs/xfs/xfs_trace.h       |    3 +
 14 files changed, 411 insertions(+), 87 deletions(-)
 create mode 100644 fs/xfs/scrub/rtsummary.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: get our own reference to inodes that we want to scrub
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/4] xfs: move the realtime summary file scrubber to a separate source file Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we want to scrub a file, get our own reference to the inode
unconditionally.  This will make disposal rules simpler in the long run.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c   |   25 +++++++++++++++++++++----
 fs/xfs/scrub/common.h   |    1 +
 fs/xfs/scrub/inode.c    |    5 ++++-
 fs/xfs/scrub/quota.c    |    6 +++++-
 fs/xfs/scrub/rtbitmap.c |    6 ++++--
 fs/xfs/scrub/scrub.c    |    6 +-----
 6 files changed, 36 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 6b2e14aecd66..305bbacc03df 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -807,6 +807,25 @@ xchk_install_handle_inode(
 	return 0;
 }
 
+/*
+ * Install an already-referenced inode for scrubbing.  Get our own reference to
+ * the inode to make disposal simpler.  The inode must not be in I_FREEING or
+ * I_WILL_FREE state!
+ */
+int
+xchk_install_live_inode(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	if (!igrab(VFS_I(ip))) {
+		xchk_ino_set_corrupt(sc, ip->i_ino);
+		return -EFSCORRUPTED;
+	}
+
+	sc->ip = ip;
+	return 0;
+}
+
 /*
  * In preparation to scrub metadata structures that hang off of an inode,
  * grab either the inode referenced in the scrub control structure or the
@@ -829,10 +848,8 @@ xchk_iget_for_scrubbing(
 	ASSERT(sc->tp == NULL);
 
 	/* We want to scan the inode we already had opened. */
-	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
-		sc->ip = ip_in;
-		return 0;
-	}
+	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino)
+		return xchk_install_live_inode(sc, ip_in);
 
 	/* Reject internal metadata files and obviously bad inode numbers. */
 	if (xfs_internal_inum(mp, sc->sm->sm_ino))
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 9cfc2660dbb4..e51f4b6d287c 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -137,6 +137,7 @@ int xchk_count_rmap_ownedby_ag(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 int xchk_setup_ag_btree(struct xfs_scrub *sc, bool force_log);
 int xchk_iget_for_scrubbing(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
+int xchk_install_live_inode(struct xfs_scrub *sc, struct xfs_inode *ip);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
 int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 51b8ba7037f3..30437a7f5660 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -81,7 +81,10 @@ xchk_setup_inode(
 
 	/* We want to scan the opened inode, so lock it and exit. */
 	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
-		sc->ip = ip_in;
+		error = xchk_install_live_inode(sc, ip_in);
+		if (error)
+			return error;
+
 		return xchk_prepare_iscrub(sc);
 	}
 
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 7b21e1012eff..57330694cb37 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -59,7 +59,11 @@ xchk_setup_quota(
 	error = xchk_setup_fs(sc);
 	if (error)
 		return error;
-	sc->ip = xfs_quota_inode(sc->mp, dqtype);
+
+	error = xchk_install_live_inode(sc, xfs_quota_inode(sc->mp, dqtype));
+	if (error)
+		return error;
+
 	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
 	sc->ilock_flags = XFS_ILOCK_EXCL;
 	return 0;
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 0a3bde64c675..41dcffeb2947 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -28,10 +28,12 @@ xchk_setup_rt(
 	if (error)
 		return error;
 
+	error = xchk_install_live_inode(sc, sc->mp->m_rbmip);
+	if (error)
+		return error;
+
 	sc->ilock_flags = XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP;
-	sc->ip = sc->mp->m_rbmip;
 	xfs_ilock(sc->ip, sc->ilock_flags);
-
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index afd481f5a15e..5f006e6799c6 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -168,8 +168,6 @@ xchk_teardown(
 	struct xfs_scrub	*sc,
 	int			error)
 {
-	struct xfs_inode	*ip_in = XFS_I(file_inode(sc->file));
-
 	xchk_ag_free(sc, &sc->sa);
 	if (sc->tp) {
 		if (error == 0 && (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR))
@@ -181,9 +179,7 @@ xchk_teardown(
 	if (sc->ip) {
 		if (sc->ilock_flags)
 			xfs_iunlock(sc->ip, sc->ilock_flags);
-		if (sc->ip != ip_in &&
-		    !xfs_internal_inum(sc->mp, sc->ip->i_ino))
-			xchk_irele(sc, sc->ip);
+		xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
 	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: wrap ilock/iunlock operations on sc->ip
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 4/4] xfs: implement online scrubbing of rtsummary info Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Scrub tracks the resources that it's holding onto in the xfs_scrub
structure.  This includes the inode being checked (if applicable) and
the inode lock state of that inode.  Replace the open-coded structure
manipulation with a trivial helper to eliminate sources of error.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c     |    6 ++----
 fs/xfs/scrub/common.c   |   38 +++++++++++++++++++++++++++++++++-----
 fs/xfs/scrub/common.h   |    5 +++++
 fs/xfs/scrub/dir.c      |    3 +--
 fs/xfs/scrub/inode.c    |    6 ++----
 fs/xfs/scrub/parent.c   |   10 +++-------
 fs/xfs/scrub/quota.c    |    9 +++------
 fs/xfs/scrub/rtbitmap.c |    9 ++++-----
 fs/xfs/scrub/scrub.c    |    2 +-
 9 files changed, 54 insertions(+), 34 deletions(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 499e82110f2f..2dc5bcd5c4fa 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -38,8 +38,7 @@ xchk_setup_inode_bmap(
 	if (error)
 		goto out;
 
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	xchk_ilock(sc, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
 	/*
 	 * We don't want any ephemeral data fork updates sitting around
@@ -76,9 +75,8 @@ xchk_setup_inode_bmap(
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		goto out;
-	sc->ilock_flags |= XFS_ILOCK_EXCL;
-	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
 
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
 out:
 	/* scrub teardown will unlock and release the inode */
 	return error;
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 305bbacc03df..0c1fa210deae 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -986,19 +986,47 @@ xchk_setup_inode_contents(
 		return error;
 
 	/* Got the inode, lock it and we're ready to go. */
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	xchk_ilock(sc, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 	error = xchk_trans_alloc(sc, resblks);
 	if (error)
 		goto out;
-	sc->ilock_flags |= XFS_ILOCK_EXCL;
-	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
-
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
 out:
 	/* scrub teardown will unlock and release the inode for us */
 	return error;
 }
 
+void
+xchk_ilock(
+	struct xfs_scrub	*sc,
+	unsigned int		ilock_flags)
+{
+	sc->ilock_flags |= ilock_flags;
+	xfs_ilock(sc->ip, ilock_flags);
+}
+
+bool
+xchk_ilock_nowait(
+	struct xfs_scrub	*sc,
+	unsigned int		ilock_flags)
+{
+	if (xfs_ilock_nowait(sc->ip, ilock_flags)) {
+		sc->ilock_flags |= ilock_flags;
+		return true;
+	}
+
+	return false;
+}
+
+void
+xchk_iunlock(
+	struct xfs_scrub	*sc,
+	unsigned int		ilock_flags)
+{
+	xfs_iunlock(sc->ip, ilock_flags);
+	sc->ilock_flags &= ~ilock_flags;
+}
+
 /*
  * Predicate that decides if we need to evaluate the cross-reference check.
  * If there was an error accessing the cross-reference btree, just delete
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index e51f4b6d287c..d755ff734c9e 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -138,6 +138,11 @@ int xchk_setup_ag_btree(struct xfs_scrub *sc, bool force_log);
 int xchk_iget_for_scrubbing(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 int xchk_install_live_inode(struct xfs_scrub *sc, struct xfs_inode *ip);
+
+void xchk_ilock(struct xfs_scrub *sc, unsigned int ilock_flags);
+bool xchk_ilock_nowait(struct xfs_scrub *sc, unsigned int ilock_flags);
+void xchk_iunlock(struct xfs_scrub *sc, unsigned int ilock_flags);
+
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
 int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 8076e7620734..2a3107cc8ccb 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -840,8 +840,7 @@ xchk_directory(
 	 * _dir_lookup routines, which do their own ILOCK locking.
 	 */
 	oldpos = 0;
-	sc->ilock_flags &= ~XFS_ILOCK_EXCL;
-	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
 	while (true) {
 		error = xfs_readdir(sc->tp, sc->ip, &sdc.dir_iter, bufsize);
 		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0,
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 30437a7f5660..7a248f26a03c 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -31,15 +31,13 @@ xchk_prepare_iscrub(
 {
 	int			error;
 
-	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	xchk_ilock(sc, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
 
-	sc->ilock_flags |= XFS_ILOCK_EXCL;
-	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 0c23fd49716b..8581a21bfbfd 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -137,8 +137,7 @@ xchk_parent_lock_two_dirs(
 		return -EINVAL;
 	}
 
-	xfs_iunlock(sc->ip, sc->ilock_flags);
-	sc->ilock_flags = 0;
+	xchk_iunlock(sc, sc->ilock_flags);
 	while (true) {
 		if (xchk_should_terminate(sc, &error))
 			return error;
@@ -149,10 +148,8 @@ xchk_parent_lock_two_dirs(
 		 * on either IOLOCK.
 		 */
 		if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
-			if (xfs_ilock_nowait(sc->ip, XFS_IOLOCK_EXCL)) {
-				sc->ilock_flags = XFS_IOLOCK_EXCL;
+			if (xchk_ilock_nowait(sc, XFS_IOLOCK_EXCL))
 				break;
-			}
 			xfs_iunlock(dp, XFS_IOLOCK_SHARED);
 		}
 
@@ -299,8 +296,7 @@ xchk_parent(
 	 * getting a write lock on i_rwsem.  Therefore, it is safe for us
 	 * to drop the ILOCK here in order to do directory lookups.
 	 */
-	sc->ilock_flags &= ~(XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
-	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
+	xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
 	/* Look up '..' */
 	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &parent_ino,
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 57330694cb37..085ff234f6ba 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -64,8 +64,7 @@ xchk_setup_quota(
 	if (error)
 		return error;
 
-	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
-	sc->ilock_flags = XFS_ILOCK_EXCL;
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
 	return 0;
 }
 
@@ -239,13 +238,11 @@ xchk_quota(
 	 * data fork we have to drop ILOCK_EXCL to use the regular dquot
 	 * functions.
 	 */
-	xfs_iunlock(sc->ip, sc->ilock_flags);
-	sc->ilock_flags = 0;
+	xchk_iunlock(sc, sc->ilock_flags);
 	sqi.sc = sc;
 	sqi.last_id = 0;
 	error = xfs_qm_dqiterate(mp, dqtype, xchk_quota_item, &sqi);
-	sc->ilock_flags = XFS_ILOCK_EXCL;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
 	if (error == -ECANCELED)
 		error = 0;
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK,
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 41dcffeb2947..c58c86fa1b03 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -32,8 +32,7 @@ xchk_setup_rt(
 	if (error)
 		return error;
 
-	sc->ilock_flags = XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP);
 	return 0;
 }
 
@@ -143,8 +142,8 @@ xchk_rtsummary(
 	 * flags so that we don't mix up the inode state that @sc tracks.
 	 */
 	sc->ip = rsumip;
-	sc->ilock_flags = XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM;
-	xfs_ilock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = 0;
+	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
 
 	/* Invoke the fork scrubber. */
 	error = xchk_metadata_inode_forks(sc);
@@ -155,7 +154,7 @@ xchk_rtsummary(
 	xchk_set_incomplete(sc);
 out:
 	/* Switch back to the rtbitmap inode and lock flags. */
-	xfs_iunlock(sc->ip, sc->ilock_flags);
+	xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
 	sc->ilock_flags = old_ilock_flags;
 	sc->ip = old_ip;
 	return error;
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 5f006e6799c6..189a56f93279 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -178,7 +178,7 @@ xchk_teardown(
 	}
 	if (sc->ip) {
 		if (sc->ilock_flags)
-			xfs_iunlock(sc->ip, sc->ilock_flags);
+			xchk_iunlock(sc, sc->ilock_flags);
 		xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: move the realtime summary file scrubber to a separate source file
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/4] xfs: get our own reference to inodes that we want to scrub Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/4] xfs: implement online scrubbing of rtsummary info Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/4] xfs: wrap ilock/iunlock operations on sc->ip Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the realtime summary file checking code to a separate file in
preparation to actually implement it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile          |    6 ++++-
 fs/xfs/scrub/rtbitmap.c  |   37 --------------------------------
 fs/xfs/scrub/rtsummary.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/scrub/rtsummary.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 90cbba7dc550..e82d937a4513 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -166,7 +166,11 @@ xfs-y				+= $(addprefix scrub/, \
 				   xfile.o \
 				   )
 
-xfs-$(CONFIG_XFS_RT)		+= scrub/rtbitmap.o
+xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
+				   rtbitmap.o \
+				   rtsummary.o \
+				   )
+
 xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 
 # online repair
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index c58c86fa1b03..0ac7e7c2fbf9 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -124,43 +124,6 @@ xchk_rtbitmap(
 	return error;
 }
 
-/* Scrub the realtime summary. */
-int
-xchk_rtsummary(
-	struct xfs_scrub	*sc)
-{
-	struct xfs_inode	*rsumip = sc->mp->m_rsumip;
-	struct xfs_inode	*old_ip = sc->ip;
-	uint			old_ilock_flags = sc->ilock_flags;
-	int			error = 0;
-
-	/*
-	 * We ILOCK'd the rt bitmap ip in the setup routine, now lock the
-	 * rt summary ip in compliance with the rt inode locking rules.
-	 *
-	 * Since we switch sc->ip to rsumip we have to save the old ilock
-	 * flags so that we don't mix up the inode state that @sc tracks.
-	 */
-	sc->ip = rsumip;
-	sc->ilock_flags = 0;
-	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-
-	/* Invoke the fork scrubber. */
-	error = xchk_metadata_inode_forks(sc);
-	if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
-		goto out;
-
-	/* XXX: implement this some day */
-	xchk_set_incomplete(sc);
-out:
-	/* Switch back to the rtbitmap inode and lock flags. */
-	xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-	sc->ilock_flags = old_ilock_flags;
-	sc->ip = old_ip;
-	return error;
-}
-
-
 /* xref check that the extent is not free in the rtbitmap */
 void
 xchk_xref_is_used_rt_space(
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
new file mode 100644
index 000000000000..e12554e7f6f9
--- /dev/null
+++ b/fs/xfs/scrub/rtsummary.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2017-2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_inode.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_rtalloc.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+
+/* Scrub the realtime summary. */
+int
+xchk_rtsummary(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_inode	*rsumip = sc->mp->m_rsumip;
+	struct xfs_inode	*old_ip = sc->ip;
+	uint			old_ilock_flags = sc->ilock_flags;
+	int			error = 0;
+
+	/*
+	 * We ILOCK'd the rt bitmap ip in the setup routine, now lock the
+	 * rt summary ip in compliance with the rt inode locking rules.
+	 *
+	 * Since we switch sc->ip to rsumip we have to save the old ilock
+	 * flags so that we don't mix up the inode state that @sc tracks.
+	 */
+	sc->ip = rsumip;
+	sc->ilock_flags = 0;
+	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
+
+	/* Invoke the fork scrubber. */
+	error = xchk_metadata_inode_forks(sc);
+	if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
+		goto out;
+
+	/* XXX: implement this some day */
+	xchk_set_incomplete(sc);
+out:
+	/* Switch back to the rtbitmap inode and lock flags. */
+	xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
+	sc->ilock_flags = old_ilock_flags;
+	sc->ip = old_ip;
+	return error;
+}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: implement online scrubbing of rtsummary info
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/4] xfs: get our own reference to inodes that we want to scrub Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/4] xfs: move the realtime summary file scrubber to a separate source file Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/4] xfs: wrap ilock/iunlock operations on sc->ip Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.h    |   10 +-
 fs/xfs/scrub/rtbitmap.c  |    4 -
 fs/xfs/scrub/rtsummary.c |  252 ++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/scrub/scrub.c     |    9 +-
 fs/xfs/scrub/scrub.h     |    4 +
 fs/xfs/scrub/trace.h     |   34 ++++++
 fs/xfs/xfs_trace.h       |    3 +
 7 files changed, 288 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index d755ff734c9e..4b2590540be4 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -88,10 +88,16 @@ int xchk_setup_xattr(struct xfs_scrub *sc);
 int xchk_setup_symlink(struct xfs_scrub *sc);
 int xchk_setup_parent(struct xfs_scrub *sc);
 #ifdef CONFIG_XFS_RT
-int xchk_setup_rt(struct xfs_scrub *sc);
+int xchk_setup_rtbitmap(struct xfs_scrub *sc);
+int xchk_setup_rtsummary(struct xfs_scrub *sc);
 #else
 static inline int
-xchk_setup_rt(struct xfs_scrub *sc)
+xchk_setup_rtbitmap(struct xfs_scrub *sc)
+{
+	return -ENOENT;
+}
+static inline int
+xchk_setup_rtsummary(struct xfs_scrub *sc)
 {
 	return -ENOENT;
 }
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 0ac7e7c2fbf9..851039588fe3 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -19,12 +19,12 @@
 
 /* Set us up with the realtime metadata locked. */
 int
-xchk_setup_rt(
+xchk_setup_rtbitmap(
 	struct xfs_scrub	*sc)
 {
 	int			error;
 
-	error = xchk_setup_fs(sc);
+	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index e12554e7f6f9..6e71e39e197e 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -14,41 +14,249 @@
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
 #include "xfs_rtalloc.h"
+#include "xfs_bit.h"
+#include "xfs_bmap.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/xfile.h"
+
+/*
+ * Realtime Summary
+ * ================
+ *
+ * We check the realtime summary by scanning the realtime bitmap file to create
+ * a new summary file incore, and then we compare the computed version against
+ * the ondisk version.  We use the 'xfile' functionality to store this
+ * (potentially large) amount of data in pageable memory.
+ */
+
+/* Set us up to check the rtsummary file. */
+int
+xchk_setup_rtsummary(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	/*
+	 * Create an xfile to construct a new rtsummary file.  The xfile allows
+	 * us to avoid pinning kernel memory for this purpose.
+	 */
+	error = xfile_create(mp, "realtime summary file", mp->m_rsumsize,
+			&sc->xfile);
+	if (error)
+		return error;
+
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		return error;
+
+	/* Allocate a memory buffer for the summary comparison. */
+	sc->buf = kvmalloc(mp->m_sb.sb_blocksize, XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	error = xchk_install_live_inode(sc, mp->m_rsumip);
+	if (error)
+		return error;
+
+	/*
+	 * Locking order requires us to take the rtbitmap first.  We must be
+	 * careful to unlock it ourselves when we are done with the rtbitmap
+	 * file since the scrub infrastructure won't do that for us.  Only
+	 * then we can lock the rtsummary inode.
+	 */
+	xfs_ilock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
+	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
+	return 0;
+}
+
+/* Helper functions to record suminfo words in an xfile. */
+
+typedef unsigned int xchk_rtsumoff_t;
+
+static inline int
+xfsum_load(
+	struct xfs_scrub	*sc,
+	xchk_rtsumoff_t		sumoff,
+	xfs_suminfo_t		*info)
+{
+	return xfile_obj_load(sc->xfile, info, sizeof(xfs_suminfo_t),
+			sumoff << XFS_WORDLOG);
+}
+
+static inline int
+xfsum_store(
+	struct xfs_scrub	*sc,
+	xchk_rtsumoff_t		sumoff,
+	const xfs_suminfo_t	info)
+{
+	return xfile_obj_store(sc->xfile, &info, sizeof(xfs_suminfo_t),
+			sumoff << XFS_WORDLOG);
+}
+
+static inline int
+xfsum_copyout(
+	struct xfs_scrub	*sc,
+	xchk_rtsumoff_t		sumoff,
+	xfs_suminfo_t		*info,
+	unsigned int		nr_words)
+{
+	return xfile_obj_load(sc->xfile, info, nr_words << XFS_WORDLOG,
+			sumoff << XFS_WORDLOG);
+}
+
+/* Update the summary file to reflect the free extent that we've accumulated. */
+STATIC int
+xchk_rtsum_record_free(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	const struct xfs_rtalloc_rec	*rec,
+	void				*priv)
+{
+	struct xfs_scrub		*sc = priv;
+	xfs_fileoff_t			rbmoff;
+	xfs_rtblock_t			rtbno;
+	xfs_filblks_t			rtlen;
+	xchk_rtsumoff_t			offs;
+	unsigned int			lenlog;
+	xfs_suminfo_t			v = 0;
+	int				error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/* Compute the relevant location in the rtsum file. */
+	rbmoff = XFS_BITTOBLOCK(mp, rec->ar_startext);
+	lenlog = XFS_RTBLOCKLOG(rec->ar_extcount);
+	offs = XFS_SUMOFFS(mp, lenlog, rbmoff);
+
+	rtbno = rec->ar_startext * mp->m_sb.sb_rextsize;
+	rtlen = rec->ar_extcount * mp->m_sb.sb_rextsize;
+
+	if (!xfs_verify_rtext(mp, rtbno, rtlen)) {
+		xchk_ino_xref_set_corrupt(sc, mp->m_rbmip->i_ino);
+		return -EFSCORRUPTED;
+	}
+
+	/* Bump the summary count. */
+	error = xfsum_load(sc, offs, &v);
+	if (error)
+		return error;
+
+	v++;
+	trace_xchk_rtsum_record_free(mp, rec->ar_startext, rec->ar_extcount,
+			lenlog, offs, v);
+
+	return xfsum_store(sc, offs, v);
+}
+
+/* Compute the realtime summary from the realtime bitmap. */
+STATIC int
+xchk_rtsum_compute(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	unsigned long long	rtbmp_bytes;
+
+	/* If the bitmap size doesn't match the computed size, bail. */
+	rtbmp_bytes = howmany_64(mp->m_sb.sb_rextents, NBBY);
+	if (roundup_64(rtbmp_bytes, mp->m_sb.sb_blocksize) !=
+			mp->m_rbmip->i_disk_size)
+		return -EFSCORRUPTED;
+
+	return xfs_rtalloc_query_all(sc->mp, sc->tp, xchk_rtsum_record_free,
+			sc);
+}
+
+/* Compare the rtsummary file against the one we computed. */
+STATIC int
+xchk_rtsum_compare(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*bp;
+	struct xfs_bmbt_irec	map;
+	xfs_fileoff_t		off;
+	xchk_rtsumoff_t		sumoff = 0;
+	int			nmap;
+
+	for (off = 0; off < XFS_B_TO_FSB(mp, mp->m_rsumsize); off++) {
+		int		error = 0;
+
+		if (xchk_should_terminate(sc, &error))
+			return error;
+		if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+			return 0;
+
+		/* Make sure we have a written extent. */
+		nmap = 1;
+		error = xfs_bmapi_read(mp->m_rsumip, off, 1, &map, &nmap,
+				XFS_DATA_FORK);
+		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, off, &error))
+			return error;
+
+		if (nmap != 1 || !xfs_bmap_is_written_extent(&map)) {
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
+			return 0;
+		}
+
+		/* Read a block's worth of ondisk rtsummary file. */
+		error = xfs_rtbuf_get(mp, sc->tp, off, 1, &bp);
+		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, off, &error))
+			return error;
+
+		/* Read a block's worth of computed rtsummary file. */
+		error = xfsum_copyout(sc, sumoff, sc->buf, mp->m_blockwsize);
+		if (error) {
+			xfs_trans_brelse(sc->tp, bp);
+			return error;
+		}
+
+		if (memcmp(bp->b_addr, sc->buf,
+					mp->m_blockwsize << XFS_WORDLOG) != 0)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
+
+		xfs_trans_brelse(sc->tp, bp);
+		sumoff += mp->m_blockwsize;
+	}
+
+	return 0;
+}
 
 /* Scrub the realtime summary. */
 int
 xchk_rtsummary(
 	struct xfs_scrub	*sc)
 {
-	struct xfs_inode	*rsumip = sc->mp->m_rsumip;
-	struct xfs_inode	*old_ip = sc->ip;
-	uint			old_ilock_flags = sc->ilock_flags;
+	struct xfs_mount	*mp = sc->mp;
 	int			error = 0;
 
-	/*
-	 * We ILOCK'd the rt bitmap ip in the setup routine, now lock the
-	 * rt summary ip in compliance with the rt inode locking rules.
-	 *
-	 * Since we switch sc->ip to rsumip we have to save the old ilock
-	 * flags so that we don't mix up the inode state that @sc tracks.
-	 */
-	sc->ip = rsumip;
-	sc->ilock_flags = 0;
-	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-
 	/* Invoke the fork scrubber. */
 	error = xchk_metadata_inode_forks(sc);
 	if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
-		goto out;
+		goto out_rbm;
 
-	/* XXX: implement this some day */
-	xchk_set_incomplete(sc);
-out:
-	/* Switch back to the rtbitmap inode and lock flags. */
-	xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-	sc->ilock_flags = old_ilock_flags;
-	sc->ip = old_ip;
+	/* Construct the new summary file from the rtbitmap. */
+	error = xchk_rtsum_compute(sc);
+	if (error == -EFSCORRUPTED) {
+		/*
+		 * EFSCORRUPTED means the rtbitmap is corrupt, which is an xref
+		 * error since we're checking the summary file.
+		 */
+		xchk_ino_xref_set_corrupt(sc, mp->m_rbmip->i_ino);
+		error = 0;
+		goto out_rbm;
+	}
+	if (error)
+		goto out_rbm;
+
+	/* Does the computed summary file match the actual rtsummary file? */
+	error = xchk_rtsum_compare(sc);
+
+out_rbm:
+	/* Unlock the rtbitmap since we're done with it. */
+	xfs_iunlock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 189a56f93279..fe5faec4d1d2 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -24,6 +24,7 @@
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/health.h"
+#include "scrub/xfile.h"
 
 /*
  * Online Scrub and Repair
@@ -186,6 +187,10 @@ xchk_teardown(
 		mnt_drop_write_file(sc->file);
 	if (sc->flags & XCHK_REAPING_DISABLED)
 		xchk_start_reaping(sc);
+	if (sc->xfile) {
+		xfile_destroy(sc->xfile);
+		sc->xfile = NULL;
+	}
 	if (sc->buf) {
 		if (sc->buf_cleanup)
 			sc->buf_cleanup(sc->buf);
@@ -320,14 +325,14 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 	},
 	[XFS_SCRUB_TYPE_RTBITMAP] = {	/* realtime bitmap */
 		.type	= ST_FS,
-		.setup	= xchk_setup_rt,
+		.setup	= xchk_setup_rtbitmap,
 		.scrub	= xchk_rtbitmap,
 		.has	= xfs_has_realtime,
 		.repair	= xrep_notsupported,
 	},
 	[XFS_SCRUB_TYPE_RTSUM] = {	/* realtime summary */
 		.type	= ST_FS,
-		.setup	= xchk_setup_rt,
+		.setup	= xchk_setup_rtsummary,
 		.scrub	= xchk_rtsummary,
 		.has	= xfs_has_realtime,
 		.repair	= xrep_notsupported,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 5d6e9a9527c3..83ad1662d802 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -88,6 +88,10 @@ struct xfs_scrub {
 	 */
 	void				(*buf_cleanup)(void *buf);
 
+	/* xfile used by the scrubbers; freed at teardown. */
+	struct xfile			*xfile;
+
+	/* Lock flags for @ip. */
 	uint				ilock_flags;
 
 	/* See the XCHK/XREP state flags below. */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 2431083b9f91..8d9a5e8c59e2 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -988,6 +988,40 @@ TRACE_EVENT(xfarray_sort_stats,
 		  __entry->error)
 );
 
+#ifdef CONFIG_XFS_RT
+TRACE_EVENT(xchk_rtsum_record_free,
+	TP_PROTO(struct xfs_mount *mp, xfs_rtblock_t start,
+		 uint64_t len, unsigned int log, loff_t pos, xfs_suminfo_t v),
+	TP_ARGS(mp, start, len, log, pos, v),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, rtdev)
+		__field(xfs_rtblock_t, start)
+		__field(unsigned long long, len)
+		__field(unsigned int, log)
+		__field(loff_t, pos)
+		__field(xfs_suminfo_t, v)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->rtdev = mp->m_rtdev_targp->bt_dev;
+		__entry->start = start;
+		__entry->len = len;
+		__entry->log = log;
+		__entry->pos = pos;
+		__entry->v = v;
+	),
+	TP_printk("dev %d:%d rtdev %d:%d rtx 0x%llx rtxcount 0x%llx log %u rsumpos 0x%llx sumcount %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->rtdev), MINOR(__entry->rtdev),
+		  __entry->start,
+		  __entry->len,
+		  __entry->log,
+		  __entry->pos,
+		  __entry->v)
+);
+#endif /* CONFIG_XFS_RT */
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6941deb80244..dfb1d47ee3c3 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -22,6 +22,9 @@
  * daddr: physical block number in 512b blocks
  * bbcount: number of blocks in a physical extent, in 512b blocks
  *
+ * rtx: physical rt extent number for extent mappings
+ * rtxcount: number of rt extents in an extent mapping
+ *
  * owner: reverse-mapping owner, usually inodes
  *
  * fileoff: file offset, in fs blocks


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] xfs: miscellaneous repair tweaks
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/2] xfs: allow the user to cancel repairs before we start writing Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/2] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: force rebuilding of metadata Darrick J. Wong
                   ` (72 subsequent siblings)
  77 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Before we start adding online repair functionality, there's a few tweaks
that I'd like to make to the common repair code.  First is a fix to the
integration between repair and the health status code that was
interfering with repair re-evaluations.  Second is a minor tweak to the
sole existing repair functions to make one last check that the user
hasn't terminated the calling process before we start writing to the
filesystem.  This is a pattern that will repeat throughout the rest of
the repair functions.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tweaks
---
 fs/xfs/scrub/agheader_repair.c |   16 ++++++++++++++++
 fs/xfs/scrub/health.c          |   10 ++++++++++
 2 files changed, 26 insertions(+)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs: always rescan allegedly healthy per-ag metadata after repair
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: miscellaneous repair tweaks Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/2] xfs: allow the user to cancel repairs before we start writing Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed.  Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/health.c |   10 ++++++++++
 1 file changed, 10 insertions(+)


diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index f7c5a109615f..f67279ecb69c 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -226,6 +226,16 @@ xchk_ag_btree_healthy_enough(
 		return true;
 	}
 
+	/*
+	 * If we just repaired some AG metadata, sc->sick_mask will reflect all
+	 * the per-AG metadata types that were repaired.  Exclude these from
+	 * the filesystem health query because we have not yet updated the
+	 * health status and we want everything to be scanned.
+	 */
+	if ((sc->flags & XREP_ALREADY_FIXED) &&
+	    type_to_health_flag[sc->sm->sm_type].group == XHG_AG)
+		mask &= ~sc->sick_mask;
+
 	if (xfs_ag_has_sickness(pag, mask)) {
 		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XFAIL;
 		return false;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs: allow the user to cancel repairs before we start writing
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: miscellaneous repair tweaks Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/2] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

All online repair functions have the same structure: walk filesystem
metadata structures gathering enough data to rebuild the structure,
stage a new copy, and then commit the new copy.

The gathering steps do not write anything to disk, so they are peppered
with xchk_should_terminate calls to avoid softlockup warnings and to
provide an opportunity to abort the repair (by killing xfs_scrub).
However, it's not clear in the code base when is the last chance to
abort cleanly without having to undo a bunch of structure.

Therefore, add one more call to xchk_should_terminate (along with a
comment) providing the sysadmin with the ability to abort before it's
too late and to make it clear in the source code when it's no longer
convenient or safe to abort a repair.   As there are only four repair
functions right now, this patch exists more to establish a precedent for
subsequent additions than to deliver practical functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 5140f52fa9a1..377a7a4bda5c 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -50,6 +50,10 @@ xrep_superblock(
 	if (error)
 		return error;
 
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
 	/* Copy AG 0's superblock to this one. */
 	xfs_buf_zero(bp, 0, BBTOB(bp->b_length));
 	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
@@ -424,6 +428,10 @@ xrep_agf(
 	if (error)
 		return error;
 
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
 	/* Start rewriting the header and implant the btrees we found. */
 	xrep_agf_init_header(sc, agf_bp, &old_agf);
 	xrep_agf_set_roots(sc, agf, fab);
@@ -748,6 +756,10 @@ xrep_agfl(
 	if (error)
 		goto err;
 
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		goto err;
+
 	/*
 	 * Update AGF and AGFL.  We reset the global free block counter when
 	 * we adjust the AGF flcount (which can fail) so avoid updating any
@@ -995,6 +1007,10 @@ xrep_agi(
 	if (error)
 		return error;
 
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
 	/* Start rewriting the header and implant the btrees we found. */
 	xrep_agi_init_header(sc, agi_bp, &old_agi);
 	xrep_agi_set_roots(sc, agi, fab);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] xfs: force rebuilding of metadata
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: miscellaneous repair tweaks Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/2] xfs: allow userspace to rebuild metadata structures Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/2] xfs: don't complain about unfixed metadata when repairs were injected Darrick J. Wong
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
                   ` (71 subsequent siblings)
  77 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This patchset adds a new IFLAG to the scrub ioctl so that userspace can
force a rebuild of an otherwise consistent piece of metadata.  This will
eventually enable the use of online repair to relocate metadata during a
filesystem reorganization (e.g. shrink).  For now, it facilitates stress
testing of online repair without needing the debugging knobs to be
enabled.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-force-rebuild

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-force-rebuild

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-force-rebuild
---
 fs/xfs/libxfs/xfs_fs.h |    6 +++++-
 fs/xfs/scrub/common.h  |   12 ++++++++++++
 fs/xfs/scrub/scrub.c   |   18 ++++++++++++------
 fs/xfs/scrub/trace.h   |    3 ++-
 4 files changed, 31 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs: don't complain about unfixed metadata when repairs were injected
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: force rebuilding of metadata Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/2] xfs: allow userspace to rebuild metadata structures Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

While debugging other parts of online repair, I noticed that if someone
injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of
metadata, and the metadata repair fails, we'll log a message about
uncorrected errors in the filesystem.

This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT
and we're only doing the repair because the error injection knob is set.
Repair functions are allowed to abort the entire operation at any point
before committing new metadata, in which case the piece of metadata is
in the same state as it was before.  Therefore, the log message should
be gated on the results of the scrub.  Refactor the predicate and
rearrange the code flow to make this happen.

Note: If the repair function errors out after it commits the new
metadata, the transaction cancellation will shut down the filesystem,
which is an obvious sign of corrupt metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.h |   12 ++++++++++++
 fs/xfs/scrub/scrub.c  |    7 ++-----
 2 files changed, 14 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 4b2590540be4..92578c4aed13 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -167,6 +167,18 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 			       XFS_SCRUB_OFLAG_XCORRUPT);
 }
 
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+/* Decide if a repair is required. */
+static inline bool xchk_needs_repair(const struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+			       XFS_SCRUB_OFLAG_XCORRUPT |
+			       XFS_SCRUB_OFLAG_PREEN);
+}
+#else
+# define xchk_needs_repair(sc)		(false)
+#endif /* CONFIG_XFS_ONLINE_REPAIR */
+
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
 void xchk_stop_reaping(struct xfs_scrub *sc);
 void xchk_start_reaping(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index fe5faec4d1d2..cda7b55d77a5 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -534,15 +534,12 @@ xfs_scrub_metadata(
 
 	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
 	    !(sc->flags & XREP_ALREADY_FIXED)) {
-		bool needs_fix;
+		bool needs_fix = xchk_needs_repair(sc->sm);
 
 		/* Let debug users force us into the repair routines. */
 		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
-			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+			needs_fix = true;
 
-		needs_fix = (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
-						 XFS_SCRUB_OFLAG_XCORRUPT |
-						 XFS_SCRUB_OFLAG_PREEN));
 		/*
 		 * If userspace asked for a repair but it wasn't necessary,
 		 * report that back to userspace.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs: allow userspace to rebuild metadata structures
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: force rebuilding of metadata Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/2] xfs: don't complain about unfixed metadata when repairs were injected Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken.  We will use
this to move metadata structures out of the way during a free space
defragmentation operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |    6 +++++-
 fs/xfs/scrub/scrub.c   |   11 ++++++++++-
 fs/xfs/scrub/trace.h   |    3 ++-
 3 files changed, 17 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 1cfd5bc6520a..920fd4513fcb 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -741,7 +741,11 @@ struct xfs_scrub_metadata {
  */
 #define XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED (1u << 7)
 
-#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR)
+/* i: Rebuild the data structure. */
+#define XFS_SCRUB_IFLAG_FORCE_REBUILD	(1 << 31)
+
+#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR | \
+				 XFS_SCRUB_IFLAG_FORCE_REBUILD)
 #define XFS_SCRUB_FLAGS_OUT	(XFS_SCRUB_OFLAG_CORRUPT | \
 				 XFS_SCRUB_OFLAG_PREEN | \
 				 XFS_SCRUB_OFLAG_XFAIL | \
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index cda7b55d77a5..60975d050b82 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -412,6 +412,11 @@ xchk_validate_inputs(
 		goto out;
 	}
 
+	/* No rebuild without repair. */
+	if ((sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD) &&
+	    !(sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR))
+		return -EINVAL;
+
 	/*
 	 * We only want to repair read-write v5+ filesystems.  Defer the check
 	 * for ops->repair until after our scrub confirms that we need to
@@ -536,8 +541,12 @@ xfs_scrub_metadata(
 	    !(sc->flags & XREP_ALREADY_FIXED)) {
 		bool needs_fix = xchk_needs_repair(sc->sm);
 
+		/* Userspace asked us to rebuild the structure regardless. */
+		if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD)
+			needs_fix = true;
+
 		/* Let debug users force us into the repair routines. */
-		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
+		if (XFS_TEST_ERROR(needs_fix, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
 			needs_fix = true;
 
 		/*
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 8d9a5e8c59e2..788a02aee689 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -98,7 +98,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XFS_SCRUB_OFLAG_XCORRUPT,		"xcorrupt" }, \
 	{ XFS_SCRUB_OFLAG_INCOMPLETE,		"incomplete" }, \
 	{ XFS_SCRUB_OFLAG_WARNING,		"warning" }, \
-	{ XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED,	"norepair" }
+	{ XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED,	"norepair" }, \
+	{ XFS_SCRUB_IFLAG_FORCE_REBUILD,	"rebuild" }
 
 #define XFS_SCRUB_STATE_STRINGS \
 	{ XCHK_TRY_HARDER,			"try_harder" }, \


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs: online repair of AG btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (5 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: force rebuilding of metadata Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/5] xfs: clear pagf_agflreset when repairing the AGFL Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
                   ` (70 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Now that we've spent a lot of time reworking common code in online fsck,
we're ready to start rebuilding the AG space btrees.  This series
implements repair functions for the free space, inode, and refcount
btrees.  Rebuilding the reverse mapping btree is much more intense and
is left for a subsequent patchset.  The fstests counterpart of this
patchset implements stress testing of repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees
---
 fs/xfs/Makefile                    |    3 
 fs/xfs/libxfs/xfs_ag.h             |   10 
 fs/xfs/libxfs/xfs_ag_resv.c        |    2 
 fs/xfs/libxfs/xfs_alloc.c          |   18 -
 fs/xfs/libxfs/xfs_alloc.h          |    2 
 fs/xfs/libxfs/xfs_alloc_btree.c    |   13 -
 fs/xfs/libxfs/xfs_btree.c          |   26 +
 fs/xfs/libxfs/xfs_btree.h          |    2 
 fs/xfs/libxfs/xfs_ialloc.c         |   41 +-
 fs/xfs/libxfs/xfs_ialloc.h         |    3 
 fs/xfs/libxfs/xfs_refcount.c       |   18 -
 fs/xfs/libxfs/xfs_refcount.h       |    2 
 fs/xfs/libxfs/xfs_refcount_btree.c |   13 -
 fs/xfs/libxfs/xfs_types.h          |    7 
 fs/xfs/scrub/agheader_repair.c     |    4 
 fs/xfs/scrub/alloc.c               |   14 -
 fs/xfs/scrub/alloc_repair.c        |  912 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c              |    1 
 fs/xfs/scrub/common.h              |   13 +
 fs/xfs/scrub/ialloc_repair.c       |  873 ++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/newbt.c               |   13 +
 fs/xfs/scrub/newbt.h               |    4 
 fs/xfs/scrub/reap.c                |   17 +
 fs/xfs/scrub/refcount_repair.c     |  791 +++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c              |  128 +++++
 fs/xfs/scrub/repair.h              |   43 ++
 fs/xfs/scrub/scrub.c               |   22 +
 fs/xfs/scrub/scrub.h               |    9 
 fs/xfs/scrub/trace.h               |  112 +++-
 fs/xfs/scrub/xfarray.h             |   22 +
 fs/xfs/xfs_extent_busy.c           |   13 +
 fs/xfs/xfs_extent_busy.h           |    2 
 fs/xfs/xfs_icache.c                |  122 ++++-
 fs/xfs/xfs_trace.h                 |   22 +
 34 files changed, 3193 insertions(+), 104 deletions(-)
 create mode 100644 fs/xfs/scrub/alloc_repair.c
 create mode 100644 fs/xfs/scrub/ialloc_repair.c
 create mode 100644 fs/xfs/scrub/refcount_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: clear pagf_agflreset when repairing the AGFL
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/5] xfs: repair free space btrees Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Clear the pagf_agflreset flag when we're repairing the AGFL because we
fix all the same padding problems that xfs_agfl_reset does.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 377a7a4bda5c..daeb88cf5825 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -620,8 +620,10 @@ xrep_agfl_update_agf(
 	xfs_force_summary_recalc(sc->mp);
 
 	/* Update the AGF counters. */
-	if (sc->sa.pag->pagf_init)
+	if (sc->sa.pag->pagf_init) {
 		sc->sa.pag->pagf_flcount = flcount;
+		sc->sa.pag->pagf_agflreset = false;
+	}
 	agf->agf_flfirst = cpu_to_be32(0);
 	agf->agf_flcount = cpu_to_be32(flcount);
 	agf->agf_fllast = cpu_to_be32(flcount - 1);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: repair free space btrees
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/5] xfs: clear pagf_agflreset when repairing the AGFL Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rebuild the free space btrees from the gaps in the rmap btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_ag.h          |    9 
 fs/xfs/libxfs/xfs_ag_resv.c     |    2 
 fs/xfs/libxfs/xfs_alloc.c       |   18 +
 fs/xfs/libxfs/xfs_alloc.h       |    2 
 fs/xfs/libxfs/xfs_alloc_btree.c |   13 +
 fs/xfs/libxfs/xfs_types.h       |    7 
 fs/xfs/scrub/alloc.c            |   14 +
 fs/xfs/scrub/alloc_repair.c     |  912 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h           |   13 +
 fs/xfs/scrub/newbt.c            |   13 +
 fs/xfs/scrub/newbt.h            |    4 
 fs/xfs/scrub/reap.c             |   17 +
 fs/xfs/scrub/repair.c           |   69 +++
 fs/xfs/scrub/repair.h           |   24 +
 fs/xfs/scrub/scrub.c            |   14 -
 fs/xfs/scrub/scrub.h            |    8 
 fs/xfs/scrub/trace.h            |   24 +
 fs/xfs/scrub/xfarray.h          |   22 +
 fs/xfs/xfs_extent_busy.c        |   13 +
 fs/xfs/xfs_extent_busy.h        |    2 
 21 files changed, 1184 insertions(+), 17 deletions(-)
 create mode 100644 fs/xfs/scrub/alloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e82d937a4513..b0ba3eec7068 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -177,6 +177,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   alloc_repair.o \
 				   newbt.o \
 				   reap.o \
 				   repair.o \
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 5b4b8658685f..bb87f6677495 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -82,6 +82,15 @@ struct xfs_perag {
 	 */
 	uint16_t	pag_checked;
 	uint16_t	pag_sick;
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+	/*
+	 * Alternate btree heights so that online repair won't trip the write
+	 * verifiers while rebuilding the AG btrees.
+	 */
+	uint8_t		pagf_alt_levels[XFS_BTNUM_AGF];
+#endif
+
 	spinlock_t	pag_state_lock;
 
 	spinlock_t	pagb_lock;	/* lock for pagb_tree */
diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 5af123d13a63..2e6128a25635 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -411,6 +411,8 @@ xfs_ag_resv_free_extent(
 		fallthrough;
 	case XFS_AG_RESV_NONE:
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
+		fallthrough;
+	case XFS_AG_RESV_IGNORE:
 		return;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e0ddae7a62ec..62136ecaa071 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -247,14 +247,11 @@ xfs_alloc_btrec_to_irec(
 	irec->ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount);
 }
 
-/* Simple checks for free space records. */
-xfs_failaddr_t
-xfs_alloc_check_irec(
-	struct xfs_btree_cur		*cur,
+inline xfs_failaddr_t
+xfs_alloc_check_perag_irec(
+	struct xfs_perag		*pag,
 	const struct xfs_alloc_rec_incore *irec)
 {
-	struct xfs_perag		*pag = cur->bc_ag.pag;
-
 	if (irec->ar_blockcount == 0)
 		return __this_address;
 
@@ -265,6 +262,15 @@ xfs_alloc_check_irec(
 	return NULL;
 }
 
+/* Simple checks for free space records. */
+xfs_failaddr_t
+xfs_alloc_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	return xfs_alloc_check_perag_irec(cur->bc_ag.pag, irec);
+}
+
 static inline int
 xfs_alloc_complain_bad_rec(
 	struct xfs_btree_cur		*cur,
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 6d17f8d36a37..5b05c8bfa60a 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -173,6 +173,8 @@ xfs_alloc_get_rec(
 union xfs_btree_rec;
 void xfs_alloc_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_alloc_rec_incore *irec);
+xfs_failaddr_t xfs_alloc_check_perag_irec(struct xfs_perag *pag,
+		const struct xfs_alloc_rec_incore *irec);
 xfs_failaddr_t xfs_alloc_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_alloc_rec_incore *irec);
 
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index a91e2a81ba2c..6162223ce18d 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -323,7 +323,18 @@ xfs_allocbt_verify(
 	if (bp->b_ops->magic[0] == cpu_to_be32(XFS_ABTC_MAGIC))
 		btnum = XFS_BTNUM_CNTi;
 	if (pag && pag->pagf_init) {
-		if (level >= pag->pagf_levels[btnum])
+		unsigned int	maxlevel = pag->pagf_levels[btnum];
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+		/*
+		 * Online repair could be rewriting the free space btrees, so
+		 * we'll validate against the larger of either tree while this
+		 * is going on.
+		 */
+		maxlevel = max_t(unsigned int, maxlevel,
+				pag->pagf_alt_levels[btnum]);
+#endif
+		if (level >= maxlevel)
 			return __this_address;
 	} else if (level >= mp->m_alloc_maxlevels)
 		return __this_address;
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 851220021484..c2868e8b6a1e 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -202,6 +202,13 @@ enum xfs_ag_resv_type {
 	XFS_AG_RESV_AGFL,
 	XFS_AG_RESV_METADATA,
 	XFS_AG_RESV_RMAPBT,
+
+	/*
+	 * Don't increase fdblocks when freeing extent.  This is a pony for
+	 * the bnobt repair functions to re-free the free space without
+	 * altering fdblocks.  If you think you need this you're wrong.
+	 */
+	XFS_AG_RESV_IGNORE,
 };
 
 /* Results of scanning a btree keyspace to check occupancy. */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 94f4b836c48d..e16a48486c6d 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -12,10 +12,11 @@
 #include "xfs_btree.h"
 #include "xfs_alloc.h"
 #include "xfs_rmap.h"
+#include "xfs_ag.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
-#include "xfs_ag.h"
+#include "scrub/repair.h"
 
 /*
  * Set us up to scrub free space btrees.
@@ -24,10 +25,19 @@ int
 xchk_setup_ag_allocbt(
 	struct xfs_scrub	*sc)
 {
+	int			error;
+
 	if (xchk_need_fshook_drain(sc))
 		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
 
-	return xchk_setup_ag_btree(sc, false);
+	error = xchk_setup_ag_btree(sc, false);
+	if (error)
+		return error;
+
+	if (xchk_could_repair(sc))
+		return xrep_setup_ag_allocbt(sc);
+
+	return 0;
 }
 
 /* Free space btree scrubber. */
diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
new file mode 100644
index 000000000000..1e06ffe26029
--- /dev/null
+++ b/fs/xfs/scrub/alloc_repair.c
@@ -0,0 +1,912 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_refcount.h"
+#include "xfs_extent_busy.h"
+#include "xfs_health.h"
+#include "xfs_bmap.h"
+#include "xfs_ialloc.h"
+#include "xfs_ag.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/newbt.h"
+#include "scrub/reap.h"
+
+/*
+ * Free Space Btree Repair
+ * =======================
+ *
+ * The reverse mappings are supposed to record all space usage for the entire
+ * AG.  Therefore, we can recalculate the free extents in an AG by looking for
+ * gaps in the physical extents recorded in the rmapbt.  On a reflink
+ * filesystem this is a little more tricky in that we have to be aware that
+ * the rmap records are allowed to overlap.
+ *
+ * We derive which blocks belonged to the old bnobt/cntbt by recording all the
+ * OWN_AG extents and subtracting out the blocks owned by all other OWN_AG
+ * metadata: the rmapbt blocks visited while iterating the reverse mappings
+ * and the AGFL blocks.
+ *
+ * Once we have both of those pieces, we can reconstruct the bnobt and cntbt
+ * by blowing out the free block state and freeing all the extents that we
+ * found.  This adds the requirement that we can't have any busy extents in
+ * the AG because the busy code cannot handle duplicate records.
+ *
+ * Note that we can only rebuild both free space btrees at the same time
+ * because the regular extent freeing infrastructure loads both btrees at the
+ * same time.
+ *
+ * We use the prefix 'xrep_abt' here because we regenerate both free space
+ * allocation btrees at the same time.
+ */
+
+struct xrep_abt {
+	/* Blocks owned by the rmapbt or the agfl. */
+	struct xagb_bitmap	not_allocbt_blocks;
+
+	/* All OWN_AG blocks. */
+	struct xagb_bitmap	old_allocbt_blocks;
+
+	/*
+	 * New bnobt information.  All btree block reservations are added to
+	 * the reservation list in new_bnobt.
+	 */
+	struct xrep_newbt	new_bnobt;
+
+	/* new cntbt information */
+	struct xrep_newbt	new_cntbt;
+
+	/* Free space extents. */
+	struct xfarray		*free_records;
+
+	struct xfs_scrub	*sc;
+
+	/* Number of non-null records in @free_records. */
+	uint64_t		nr_real_records;
+
+	/* get_records()'s position in the free space record array. */
+	xfarray_idx_t		array_cur;
+
+	/*
+	 * Next block we anticipate seeing in the rmap records.  If the next
+	 * rmap record is greater than next_agbno, we have found unused space.
+	 */
+	xfs_agblock_t		next_agbno;
+
+	/* Number of free blocks in this AG. */
+	xfs_agblock_t		nr_blocks;
+
+	/* Longest free extent we found in the AG. */
+	xfs_agblock_t		longest;
+};
+
+/* Set up to repair AG free space btrees. */
+int
+xrep_setup_ag_allocbt(
+	struct xfs_scrub	*sc)
+{
+	unsigned int		busy_gen;
+
+	/*
+	 * Make sure the busy extent list is clear because we can't put extents
+	 * on there twice.
+	 */
+	busy_gen = READ_ONCE(sc->sa.pag->pagb_gen);
+	if (!xfs_extent_busy_list_empty(sc->sa.pag))
+		xfs_extent_busy_flush(sc->mp, sc->sa.pag, busy_gen);
+
+	return 0;
+}
+
+/* Check for any obvious conflicts in the free extent. */
+STATIC int
+xrep_abt_check_free_ext(
+	struct xfs_scrub	*sc,
+	const struct xfs_alloc_rec_incore *rec)
+{
+	enum xbtree_recpacking	outcome;
+	int			error;
+
+	if (xfs_alloc_check_perag_irec(sc->sa.pag, rec) != NULL)
+		return -EFSCORRUPTED;
+
+	/* Must not be an inode chunk. */
+	error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur,
+			rec->ar_startblock, rec->ar_blockcount, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	/* Must not be shared or CoW staging. */
+	if (sc->sa.refc_cur) {
+		error = xfs_refcount_has_records(sc->sa.refc_cur,
+				XFS_REFC_DOMAIN_SHARED, rec->ar_startblock,
+				rec->ar_blockcount, &outcome);
+		if (error)
+			return error;
+		if (outcome != XBTREE_RECPACKING_EMPTY)
+			return -EFSCORRUPTED;
+
+		error = xfs_refcount_has_records(sc->sa.refc_cur,
+				XFS_REFC_DOMAIN_COW, rec->ar_startblock,
+				rec->ar_blockcount, &outcome);
+		if (error)
+			return error;
+		if (outcome != XBTREE_RECPACKING_EMPTY)
+			return -EFSCORRUPTED;
+	}
+
+	return 0;
+}
+
+/*
+ * Stash a free space record for all the space since the last bno we found
+ * all the way up to @end.
+ */
+static int
+xrep_abt_stash(
+	struct xrep_abt		*ra,
+	xfs_agblock_t		end)
+{
+	struct xfs_alloc_rec_incore arec = {
+		.ar_startblock	= ra->next_agbno,
+		.ar_blockcount	= end - ra->next_agbno,
+	};
+	struct xfs_scrub	*sc = ra->sc;
+	int			error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	error = xrep_abt_check_free_ext(ra->sc, &arec);
+	if (error)
+		return error;
+
+	trace_xrep_abt_found(sc->mp, sc->sa.pag->pag_agno, &arec);
+
+	error = xfarray_append(ra->free_records, &arec);
+	if (error)
+		return error;
+
+	ra->nr_blocks += arec.ar_blockcount;
+	return 0;
+}
+
+/* Record extents that aren't in use from gaps in the rmap records. */
+STATIC int
+xrep_abt_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_abt			*ra = priv;
+	int				error;
+
+	/* Record all the OWN_AG blocks... */
+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
+		error = xagb_bitmap_set(&ra->old_allocbt_blocks,
+				rec->rm_startblock, rec->rm_blockcount);
+		if (error)
+			return error;
+	}
+
+	/* ...and all the rmapbt blocks... */
+	error = xagb_bitmap_set_btcur_path(&ra->not_allocbt_blocks, cur);
+	if (error)
+		return error;
+
+	/* ...and all the free space. */
+	if (rec->rm_startblock > ra->next_agbno) {
+		error = xrep_abt_stash(ra, rec->rm_startblock);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * rmap records can overlap on reflink filesystems, so project
+	 * next_agbno as far out into the AG space as we currently know about.
+	 */
+	ra->next_agbno = max_t(xfs_agblock_t, ra->next_agbno,
+			rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
+/* Collect an AGFL block for the not-to-release list. */
+static int
+xrep_abt_walk_agfl(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		agbno,
+	void			*priv)
+{
+	struct xrep_abt		*ra = priv;
+
+	return xagb_bitmap_set(&ra->not_allocbt_blocks, agbno, 1);
+}
+
+/*
+ * Compare two free space extents by block number.  We want to sort in order of
+ * increasing block number.
+ */
+static int
+xrep_bnobt_extent_cmp(
+	const void		*a,
+	const void		*b)
+{
+	const struct xfs_alloc_rec_incore *ap = a;
+	const struct xfs_alloc_rec_incore *bp = b;
+
+	if (ap->ar_startblock > bp->ar_startblock)
+		return 1;
+	else if (ap->ar_startblock < bp->ar_startblock)
+		return -1;
+	return 0;
+}
+
+/*
+ * Re-sort the free extents by block number so so that we can put the records
+ * into the bnobt in the correct order.  Make sure the records do not overlap
+ * in physical space.
+ */
+STATIC int
+xrep_bnobt_sort_records(
+	struct xrep_abt			*ra)
+{
+	struct xfs_alloc_rec_incore	arec;
+	xfarray_idx_t			cur = XFARRAY_CURSOR_INIT;
+	xfs_agblock_t			next_agbno = 0;
+	int				error;
+
+	error = xfarray_sort(ra->free_records, xrep_bnobt_extent_cmp, 0);
+	if (error)
+		return error;
+
+	while ((error = xfarray_iter(ra->free_records, &cur, &arec)) == 1) {
+		if (arec.ar_startblock < next_agbno)
+			return -EFSCORRUPTED;
+
+		next_agbno = arec.ar_startblock + arec.ar_blockcount;
+	}
+
+	return error;
+}
+
+/*
+ * Compare two free space extents by length and then block number.  We want
+ * to sort first in order of increasing length and then in order of increasing
+ * block number.
+ */
+static int
+xrep_cntbt_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	const struct xfs_alloc_rec_incore *ap = a;
+	const struct xfs_alloc_rec_incore *bp = b;
+
+	if (ap->ar_blockcount > bp->ar_blockcount)
+		return 1;
+	else if (ap->ar_blockcount < bp->ar_blockcount)
+		return -1;
+	return xrep_bnobt_extent_cmp(a, b);
+}
+
+/*
+ * Sort the free extents by length so so that we can put the records into the
+ * cntbt in the correct order.  Don't let userspace kill us if we're resorting
+ * after allocating btree blocks.
+ */
+STATIC int
+xrep_cntbt_sort_records(
+	struct xrep_abt			*ra,
+	bool				is_resort)
+{
+	return xfarray_sort(ra->free_records, xrep_cntbt_extent_cmp,
+			is_resort ? 0 : XFARRAY_SORT_KILLABLE);
+}
+
+/*
+ * Iterate all reverse mappings to find (1) the gaps between rmap records (all
+ * unowned space), (2) the OWN_AG extents (which encompass the free space
+ * btrees, the rmapbt, and the agfl), (3) the rmapbt blocks, and (4) the AGFL
+ * blocks.  The free space is (1) + (2) - (3) - (4).
+ */
+STATIC int
+xrep_abt_find_freespace(
+	struct xrep_abt		*ra)
+{
+	struct xfs_scrub	*sc = ra->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	struct xfs_buf		*agfl_bp;
+	xfs_agblock_t		agend;
+	int			error;
+
+	xagb_bitmap_init(&ra->not_allocbt_blocks);
+
+	xrep_ag_btcur_init(sc, &sc->sa);
+
+	/*
+	 * Iterate all the reverse mappings to find gaps in the physical
+	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
+	 */
+	error = xfs_rmap_query_all(sc->sa.rmap_cur, xrep_abt_walk_rmap, ra);
+	if (error)
+		goto err;
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agend = be32_to_cpu(agf->agf_length);
+	if (ra->next_agbno < agend) {
+		error = xrep_abt_stash(ra, agend);
+		if (error)
+			goto err;
+	}
+
+	/* Collect all the AGFL blocks. */
+	error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp);
+	if (error)
+		goto err;
+
+	error = xfs_agfl_walk(mp, agf, agfl_bp, xrep_abt_walk_agfl, ra);
+	if (error)
+		goto err_agfl;
+
+	/* Compute the old bnobt/cntbt blocks. */
+	error = xagb_bitmap_disunion(&ra->old_allocbt_blocks,
+			&ra->not_allocbt_blocks);
+	if (error)
+		goto err_agfl;
+
+	ra->nr_real_records = xfarray_length(ra->free_records);
+err_agfl:
+	xfs_trans_brelse(sc->tp, agfl_bp);
+err:
+	xchk_ag_btcur_free(&sc->sa);
+	xagb_bitmap_destroy(&ra->not_allocbt_blocks);
+	return error;
+}
+
+/*
+ * We're going to use the observed free space records to reserve blocks for the
+ * new free space btrees, so we play an iterative game where we try to converge
+ * on the number of blocks we need:
+ *
+ * 1. Estimate how many blocks we'll need to store the records.
+ * 2. If the first free record has more blocks than we need, we're done.
+ *    We will have to re-sort the records prior to building the cntbt.
+ * 3. If that record has exactly the number of blocks we need, null out the
+ *    record.  We're done.
+ * 4. Otherwise, we still need more blocks.  Null out the record, subtract its
+ *    length from the number of blocks we need, and go back to step 1.
+ *
+ * Fortunately, we don't have to do any transaction work to play this game, so
+ * we don't have to tear down the staging cursors.
+ */
+STATIC int
+xrep_abt_reserve_space(
+	struct xrep_abt		*ra,
+	struct xfs_btree_cur	*bno_cur,
+	struct xfs_btree_cur	*cnt_cur,
+	bool			*needs_resort)
+{
+	struct xfs_scrub	*sc = ra->sc;
+	xfarray_idx_t		record_nr;
+	unsigned int		allocated = 0;
+	int			error = 0;
+
+	record_nr = xfarray_length(ra->free_records) - 1;
+	do {
+		struct xfs_alloc_rec_incore arec;
+		xfs_fsblock_t		fsbno;
+		uint64_t		required;
+		unsigned int		desired;
+		unsigned int		len;
+
+		/* Compute how many blocks we'll need. */
+		error = xfs_btree_bload_compute_geometry(cnt_cur,
+				&ra->new_cntbt.bload, ra->nr_real_records);
+		if (error)
+			break;
+
+		error = xfs_btree_bload_compute_geometry(bno_cur,
+				&ra->new_bnobt.bload, ra->nr_real_records);
+		if (error)
+			break;
+
+		/* How many btree blocks do we need to store all records? */
+		required = ra->new_bnobt.bload.nr_blocks +
+			   ra->new_cntbt.bload.nr_blocks;
+		ASSERT(required < INT_MAX);
+
+		/* If we've reserved enough blocks, we're done. */
+		if (allocated >= required)
+			break;
+
+		desired = required - allocated;
+
+		/* We need space but there's none left; bye! */
+		if (ra->nr_real_records == 0) {
+			error = -ENOSPC;
+			break;
+		}
+
+		/* Grab the first record from the list. */
+		error = xfarray_load(ra->free_records, record_nr, &arec);
+		if (error)
+			break;
+
+		ASSERT(arec.ar_blockcount <= UINT_MAX);
+		len = min_t(unsigned int, arec.ar_blockcount, desired);
+		fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno,
+				arec.ar_startblock);
+
+		trace_xrep_newbt_alloc_blocks(sc->mp, sc->sa.pag->pag_agno,
+				arec.ar_startblock, len, XFS_RMAP_OWN_AG);
+
+		error = xrep_newbt_add_blocks(&ra->new_bnobt, fsbno, len);
+		if (error)
+			break;
+		allocated += len;
+		ra->nr_blocks -= len;
+
+		if (arec.ar_blockcount > desired) {
+			/*
+			 * Record has more space than we need.  The number of
+			 * free records doesn't change, so shrink the free
+			 * record, inform the caller that the records are no
+			 * longer sorted by length, and exit.
+			 */
+			arec.ar_startblock += desired;
+			arec.ar_blockcount -= desired;
+			error = xfarray_store(ra->free_records, record_nr,
+					&arec);
+			if (error)
+				break;
+
+			*needs_resort = true;
+			return 0;
+		}
+
+		/*
+		 * We're going to use up the entire record, so unset it and
+		 * move on to the next one.  This changes the number of free
+		 * records (but doesn't break the sorting order), so we must
+		 * go around the loop once more to re-run _bload_init.
+		 */
+		error = xfarray_unset(ra->free_records, record_nr);
+		if (error)
+			break;
+		ra->nr_real_records--;
+		record_nr--;
+	} while (1);
+
+	return error;
+}
+
+STATIC int
+xrep_abt_dispose_one(
+	struct xrep_abt		*ra,
+	struct xrep_newbt_resv	*resv)
+{
+	struct xfs_scrub	*sc = ra->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+	xfs_agblock_t		free_agbno = resv->agbno + resv->used;
+	xfs_extlen_t		free_aglen = resv->len - resv->used;
+	int			error;
+
+	ASSERT(pag == resv->pag);
+
+	/* Add a deferred rmap for each extent we used. */
+	if (resv->used > 0)
+		xfs_rmap_alloc_extent(sc->tp, pag->pag_agno, resv->agbno,
+				resv->used, XFS_RMAP_OWN_AG);
+
+	/*
+	 * For each reserved btree block we didn't use, add it to the free
+	 * space btree.  We didn't touch fdblocks when we reserved them, so
+	 * we don't touch it now.
+	 */
+	if (free_aglen == 0)
+		return 0;
+
+	trace_xrep_newbt_free_blocks(sc->mp, resv->pag->pag_agno, free_agbno,
+			free_aglen, ra->new_bnobt.oinfo.oi_owner);
+
+	error = __xfs_free_extent(sc->tp, resv->pag, free_agbno, free_aglen,
+			&ra->new_bnobt.oinfo, XFS_AG_RESV_IGNORE, true);
+	if (error)
+		return error;
+
+	return xrep_defer_finish(sc);
+}
+
+/*
+ * Deal with all the space we reserved.  Blocks that were allocated for the
+ * free space btrees need to have a (deferred) rmap added for the OWN_AG
+ * allocation, and blocks that didn't get used can be freed via the usual
+ * (deferred) means.
+ */
+STATIC void
+xrep_abt_dispose_reservations(
+	struct xrep_abt		*ra,
+	int			error)
+{
+	struct xrep_newbt_resv	*resv, *n;
+
+	if (error)
+		goto junkit;
+
+	for_each_xrep_newbt_reservation(&ra->new_bnobt, resv, n) {
+		error = xrep_abt_dispose_one(ra, resv);
+		if (error)
+			goto junkit;
+	}
+
+junkit:
+	for_each_xrep_newbt_reservation(&ra->new_bnobt, resv, n) {
+		xfs_perag_put(resv->pag);
+		list_del(&resv->list);
+		kfree(resv);
+	}
+
+	xrep_newbt_cancel(&ra->new_bnobt);
+	xrep_newbt_cancel(&ra->new_cntbt);
+}
+
+/* Retrieve free space data for bulk load. */
+STATIC int
+xrep_abt_get_records(
+	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
+	void				*priv)
+{
+	struct xfs_alloc_rec_incore	*arec = &cur->bc_rec.a;
+	struct xrep_abt			*ra = priv;
+	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
+	int				error;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		error = xfarray_load_next(ra->free_records, &ra->array_cur,
+				arec);
+		if (error)
+			return error;
+
+		ra->longest = max(ra->longest, arec->ar_blockcount);
+
+		block_rec = xfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Feed one of the new btree blocks to the bulk loader. */
+STATIC int
+xrep_abt_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_abt		*ra = priv;
+
+	return xrep_newbt_claim_block(cur, &ra->new_bnobt, ptr);
+}
+
+/*
+ * Reset the AGF counters to reflect the free space btrees that we just
+ * rebuilt, then reinitialize the per-AG data.
+ */
+STATIC int
+xrep_abt_reset_counters(
+	struct xrep_abt		*ra)
+{
+	struct xfs_scrub	*sc = ra->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	unsigned int		freesp_btreeblks = 0;
+
+	/*
+	 * Compute the contribution to agf_btreeblks for the new free space
+	 * btrees.  This is the computed btree size minus anything we didn't
+	 * use.
+	 */
+	freesp_btreeblks += ra->new_bnobt.bload.nr_blocks - 1;
+	freesp_btreeblks += ra->new_cntbt.bload.nr_blocks - 1;
+
+	freesp_btreeblks -= xrep_newbt_unused_blocks(&ra->new_bnobt);
+	freesp_btreeblks -= xrep_newbt_unused_blocks(&ra->new_cntbt);
+
+	/*
+	 * The AGF header contains extra information related to the free space
+	 * btrees, so we must update those fields here.
+	 */
+	agf->agf_btreeblks = cpu_to_be32(freesp_btreeblks +
+				(be32_to_cpu(agf->agf_rmap_blocks) - 1));
+	agf->agf_freeblks = cpu_to_be32(ra->nr_blocks);
+	agf->agf_longest = cpu_to_be32(ra->longest);
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, XFS_AGF_BTREEBLKS |
+						 XFS_AGF_LONGEST |
+						 XFS_AGF_FREEBLKS);
+
+	/*
+	 * After we commit the new btree to disk, it is possible that the
+	 * process to reap the old btree blocks will race with the AIL trying
+	 * to checkpoint the old btree blocks into the filesystem.  If the new
+	 * tree is shorter than the old one, the allocbt write verifier will
+	 * fail and the AIL will shut down the filesystem.
+	 *
+	 * To avoid this, save the old incore btree height values as the alt
+	 * height values before re-initializing the perag info from the updated
+	 * AGF to capture all the new values.
+	 */
+	pag->pagf_alt_levels[XFS_BTNUM_BNOi] = pag->pagf_levels[XFS_BTNUM_BNOi];
+	pag->pagf_alt_levels[XFS_BTNUM_CNTi] = pag->pagf_levels[XFS_BTNUM_CNTi];
+
+	/* Reinitialize with the values we just logged. */
+	return xrep_reinit_pagf(sc);
+}
+
+/*
+ * Use the collected free space information to stage new free space btrees.
+ * If this is successful we'll return with the new btree root
+ * information logged to the repair transaction but not yet committed.
+ */
+STATIC int
+xrep_abt_build_new_trees(
+	struct xrep_abt		*ra)
+{
+	struct xfs_scrub	*sc = ra->sc;
+	struct xfs_btree_cur	*bno_cur;
+	struct xfs_btree_cur	*cnt_cur;
+	struct xfs_perag	*pag = sc->sa.pag;
+	bool			needs_resort = false;
+	int			error;
+
+	/*
+	 * Sort the free extents by length so that we can set up the free space
+	 * btrees in as few extents as possible.  This reduces the amount of
+	 * deferred rmap / free work we have to do at the end.
+	 */
+	error = xrep_cntbt_sort_records(ra, false);
+	if (error)
+		return error;
+
+	/*
+	 * Prepare to construct the new btree by reserving disk space for the
+	 * new btree and setting up all the accounting information we'll need
+	 * to root the new btree while it's under construction and before we
+	 * attach it to the AG header.
+	 */
+	xrep_newbt_init_bare(&ra->new_bnobt, sc);
+	xrep_newbt_init_bare(&ra->new_cntbt, sc);
+
+	ra->new_bnobt.bload.get_records = xrep_abt_get_records;
+	ra->new_cntbt.bload.get_records = xrep_abt_get_records;
+
+	ra->new_bnobt.bload.claim_block = xrep_abt_claim_block;
+	ra->new_cntbt.bload.claim_block = xrep_abt_claim_block;
+
+	/* Allocate cursors for the staged btrees. */
+	bno_cur = xfs_allocbt_stage_cursor(sc->mp, &ra->new_bnobt.afake,
+			pag, XFS_BTNUM_BNO);
+	cnt_cur = xfs_allocbt_stage_cursor(sc->mp, &ra->new_cntbt.afake,
+			pag, XFS_BTNUM_CNT);
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		goto err_cur;
+
+	/* Reserve the space we'll need for the new btrees. */
+	error = xrep_abt_reserve_space(ra, bno_cur, cnt_cur, &needs_resort);
+	if (error)
+		goto err_cur;
+
+	/*
+	 * If we need to re-sort the free extents by length, do so so that we
+	 * can put the records into the cntbt in the correct order.
+	 */
+	if (needs_resort) {
+		error = xrep_cntbt_sort_records(ra, needs_resort);
+		if (error)
+			goto err_cur;
+	}
+
+	/*
+	 * Due to btree slack factors, it's possible for a new btree to be one
+	 * level taller than the old btree.  Update the alternate incore btree
+	 * height so that we don't trip the verifiers when writing the new
+	 * btree blocks to disk.
+	 */
+	pag->pagf_alt_levels[XFS_BTNUM_BNOi] =
+					ra->new_bnobt.bload.btree_height;
+	pag->pagf_alt_levels[XFS_BTNUM_CNTi] =
+					ra->new_cntbt.bload.btree_height;
+
+	/* Load the free space by length tree. */
+	ra->array_cur = XFARRAY_CURSOR_INIT;
+	ra->longest = 0;
+	error = xfs_btree_bload(cnt_cur, &ra->new_cntbt.bload, ra);
+	if (error)
+		goto err_levels;
+
+	error = xrep_bnobt_sort_records(ra);
+	if (error)
+		return error;
+
+	/* Load the free space by block number tree. */
+	ra->array_cur = XFARRAY_CURSOR_INIT;
+	error = xfs_btree_bload(bno_cur, &ra->new_bnobt.bload, ra);
+	if (error)
+		goto err_levels;
+
+	/*
+	 * Install the new btrees in the AG header.  After this point the old
+	 * btrees are no longer accessible and the new trees are live.
+	 */
+	xfs_allocbt_commit_staged_btree(bno_cur, sc->tp, sc->sa.agf_bp);
+	xfs_btree_del_cursor(bno_cur, 0);
+	xfs_allocbt_commit_staged_btree(cnt_cur, sc->tp, sc->sa.agf_bp);
+	xfs_btree_del_cursor(cnt_cur, 0);
+
+	/* Reset the AGF counters now that we've changed the btree shape. */
+	error = xrep_abt_reset_counters(ra);
+	if (error)
+		goto err_newbt;
+
+	/* Dispose of any unused blocks and the accounting information. */
+	xrep_abt_dispose_reservations(ra, error);
+
+	return xrep_roll_ag_trans(sc);
+
+err_levels:
+	pag->pagf_alt_levels[XFS_BTNUM_BNOi] = 0;
+	pag->pagf_alt_levels[XFS_BTNUM_CNTi] = 0;
+err_cur:
+	xfs_btree_del_cursor(cnt_cur, error);
+	xfs_btree_del_cursor(bno_cur, error);
+err_newbt:
+	xrep_abt_dispose_reservations(ra, error);
+	return error;
+}
+
+/*
+ * Now that we've logged the roots of the new btrees, invalidate all of the
+ * old blocks and free them.
+ */
+STATIC int
+xrep_abt_remove_old_trees(
+	struct xrep_abt		*ra)
+{
+	struct xfs_perag	*pag = ra->sc->sa.pag;
+	int			error;
+
+	/* Free the old btree blocks if they're not in use. */
+	error = xrep_reap_agblocks(ra->sc, &ra->old_allocbt_blocks,
+			&XFS_RMAP_OINFO_AG, XFS_AG_RESV_IGNORE);
+	if (error)
+		return error;
+
+	/*
+	 * Now that we've zapped all the old allocbt blocks we can turn off
+	 * the alternate height mechanism.
+	 */
+	pag->pagf_alt_levels[XFS_BTNUM_BNOi] = 0;
+	pag->pagf_alt_levels[XFS_BTNUM_CNTi] = 0;
+	return 0;
+}
+
+/* Repair the freespace btrees for some AG. */
+int
+xrep_allocbt(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_abt		*ra;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(mp))
+		return -EOPNOTSUPP;
+
+	ra = kzalloc(sizeof(struct xrep_abt), XCHK_GFP_FLAGS);
+	if (!ra)
+		return -ENOMEM;
+	ra->sc = sc;
+
+	/* We rebuild both data structures. */
+	sc->sick_mask = XFS_SICK_AG_BNOBT | XFS_SICK_AG_CNTBT;
+
+	/*
+	 * Make sure the busy extent list is clear because we can't put extents
+	 * on there twice.  In theory we cleared this before we started, but
+	 * let's not risk the filesystem.
+	 */
+	if (!xfs_extent_busy_list_empty(sc->sa.pag)) {
+		error = -EDEADLOCK;
+		goto out_ra;
+	}
+
+	/* Set up enough storage to handle maximally fragmented free space. */
+	error = xfarray_create(mp, "free space extents",
+			mp->m_sb.sb_agblocks / 2,
+			sizeof(struct xfs_alloc_rec_incore),
+			&ra->free_records);
+	if (error)
+		goto out_ra;
+
+	/* Collect the free space data and find the old btree blocks. */
+	xagb_bitmap_init(&ra->old_allocbt_blocks);
+	error = xrep_abt_find_freespace(ra);
+	if (error)
+		goto out_bitmap;
+
+	/* Rebuild the free space information. */
+	error = xrep_abt_build_new_trees(ra);
+	if (error)
+		goto out_bitmap;
+
+	/* Kill the old trees. */
+	error = xrep_abt_remove_old_trees(ra);
+
+out_bitmap:
+	xagb_bitmap_destroy(&ra->old_allocbt_blocks);
+	xfarray_destroy(ra->free_records);
+out_ra:
+	kfree(ra);
+	return error;
+}
+
+/* Make sure both btrees are ok after we've rebuilt them. */
+int
+xrep_revalidate_allocbt(
+	struct xfs_scrub	*sc)
+{
+	__u32			old_type = sc->sm->sm_type;
+	int			error;
+
+	/*
+	 * We must update sm_type temporarily so that the tree-to-tree cross
+	 * reference checks will work in the correct direction, and also so
+	 * that tracing will report correctly if there are more errors.
+	 */
+	sc->sm->sm_type = XFS_SCRUB_TYPE_BNOBT;
+	error = xchk_bnobt(sc);
+	if (error)
+		goto out;
+
+	sc->sm->sm_type = XFS_SCRUB_TYPE_CNTBT;
+	error = xchk_cntbt(sc);
+out:
+	sc->sm->sm_type = old_type;
+	return error;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 92578c4aed13..7f2714092514 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -175,8 +175,21 @@ static inline bool xchk_needs_repair(const struct xfs_scrub_metadata *sm)
 			       XFS_SCRUB_OFLAG_XCORRUPT |
 			       XFS_SCRUB_OFLAG_PREEN);
 }
+
+/*
+ * "Should we prepare for a repair?"
+ *
+ * Return true if the caller permits us to repair metadata and we're not
+ * setting up for a post-repair evaluation.
+ */
+static inline bool xchk_could_repair(const struct xfs_scrub *sc)
+{
+	return (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
+		!(sc->flags & XREP_ALREADY_FIXED);
+}
 #else
 # define xchk_needs_repair(sc)		(false)
+# define xchk_could_repair(sc)		(false)
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c
index 507137e7bc24..efa73f676ad4 100644
--- a/fs/xfs/scrub/newbt.c
+++ b/fs/xfs/scrub/newbt.c
@@ -565,3 +565,16 @@ xrep_newbt_claim_block(
 		ptr->s = cpu_to_be32(agbno);
 	return 0;
 }
+
+/* How many reserved blocks are unused? */
+unsigned int
+xrep_newbt_unused_blocks(
+	struct xrep_newbt	*xnr)
+{
+	struct xrep_newbt_resv	*resv;
+	unsigned int		unused = 0;
+
+	list_for_each_entry(resv, &xnr->resv_list, list)
+		unused += resv->len - resv->used;
+	return unused;
+}
diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h
index a1eda5213b83..96ae0dc3bc41 100644
--- a/fs/xfs/scrub/newbt.h
+++ b/fs/xfs/scrub/newbt.h
@@ -50,6 +50,9 @@ struct xrep_newbt {
 	enum xfs_ag_resv_type	resv;
 };
 
+#define for_each_xrep_newbt_reservation(xnr, resv, n)	\
+	list_for_each_entry_safe((resv), (n), &(xnr)->resv_list, list)
+
 void xrep_newbt_init_bare(struct xrep_newbt *xnr, struct xfs_scrub *sc);
 void xrep_newbt_init_ag(struct xrep_newbt *xnr, struct xfs_scrub *sc,
 		const struct xfs_owner_info *oinfo, xfs_fsblock_t alloc_hint,
@@ -64,5 +67,6 @@ int xrep_newbt_commit(struct xrep_newbt *xnr);
 int xrep_newbt_claim_block(struct xfs_btree_cur *cur, struct xrep_newbt *xnr,
 		union xfs_btree_ptr *ptr);
 int xrep_newbt_relog_autoreap(struct xrep_newbt *xnr);
+unsigned int xrep_newbt_unused_blocks(struct xrep_newbt *xnr);
 
 #endif /* __XFS_SCRUB_NEWBT_H__ */
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 797536e6eba8..3ff4adaf3fe8 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -331,11 +331,23 @@ xreap_agextent(
 		return 0;
 	}
 
-	if (rs->resv == XFS_AG_RESV_AGFL) {
+	switch (rs->resv) {
+	case XFS_AG_RESV_AGFL:
 		ASSERT(*aglenp == 1);
 		error = xreap_put_freelist(sc, agbno);
 		rs->force_roll = true;
-	} else {
+		break;
+	case XFS_AG_RESV_IGNORE:
+		/*
+		 * bnobt/cntbt blocks are counted as free space, so we pass
+		 * XFS_AG_RESV_IGNORE when reaping the old free space btree
+		 * blocks to avoid changing fdblocks.
+		 */
+		error = __xfs_free_extent(sc->tp, sc->sa.pag, agbno, *aglenp,
+				rs->oinfo, rs->resv, true);
+		rs->force_roll = true;
+		break;
+	default:
 		/*
 		 * Use deferred frees to get rid of the old btree blocks to try
 		 * to minimize the window in which we could crash and lose the
@@ -343,6 +355,7 @@ xreap_agextent(
 		 */
 		__xfs_free_extent_later(sc->tp, fsbno, *aglenp, rs->oinfo, true);
 		rs->deferred++;
+		break;
 	}
 
 	return error;
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index b4117ff221aa..964b48fffc58 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -735,3 +735,72 @@ xrep_ino_dqattach(
 
 	return error;
 }
+
+/* Initialize all the btree cursors for an AG repair. */
+void
+xrep_ag_btcur_init(
+	struct xfs_scrub	*sc,
+	struct xchk_ag		*sa)
+{
+	struct xfs_mount	*mp = sc->mp;
+
+	/* Set up a bnobt cursor for cross-referencing. */
+	if (sc->sm->sm_type != XFS_SCRUB_TYPE_BNOBT &&
+	    sc->sm->sm_type != XFS_SCRUB_TYPE_CNTBT) {
+		sa->bno_cur = xfs_allocbt_init_cursor(mp, sc->tp, sa->agf_bp,
+				sc->sa.pag, XFS_BTNUM_BNO);
+		sa->cnt_cur = xfs_allocbt_init_cursor(mp, sc->tp, sa->agf_bp,
+				sc->sa.pag, XFS_BTNUM_CNT);
+	}
+
+	/* Set up a inobt cursor for cross-referencing. */
+	if (sc->sm->sm_type != XFS_SCRUB_TYPE_INOBT &&
+	    sc->sm->sm_type != XFS_SCRUB_TYPE_FINOBT) {
+		sa->ino_cur = xfs_inobt_init_cursor(mp, sc->tp, sa->agi_bp,
+				sc->sa.pag, XFS_BTNUM_INO);
+		if (xfs_has_finobt(mp))
+			sa->fino_cur = xfs_inobt_init_cursor(mp, sc->tp,
+					sa->agi_bp, sc->sa.pag, XFS_BTNUM_FINO);
+	}
+
+	/* Set up a rmapbt cursor for cross-referencing. */
+	if (sc->sm->sm_type != XFS_SCRUB_TYPE_RMAPBT &&
+	    xfs_has_rmapbt(mp))
+		sa->rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, sa->agf_bp,
+				sc->sa.pag);
+
+	/* Set up a refcountbt cursor for cross-referencing. */
+	if (sc->sm->sm_type != XFS_SCRUB_TYPE_REFCNTBT &&
+	    xfs_has_reflink(mp))
+		sa->refc_cur = xfs_refcountbt_init_cursor(mp, sc->tp,
+				sa->agf_bp, sc->sa.pag);
+}
+
+/*
+ * Reinitialize the in-core AG state after a repair by rereading the AGF
+ * buffer.  We had better get the same AGF buffer as the one that's attached
+ * to the scrub context.
+ */
+int
+xrep_reinit_pagf(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_buf		*bp;
+	int			error;
+
+	ASSERT(pag);
+	ASSERT(pag->pagf_init);
+
+	pag->pagf_init = 0;
+	error = xfs_alloc_read_agf(pag, sc->tp, 0, &bp);
+	if (error)
+		return error;
+
+	if (bp != sc->sa.agf_bp) {
+		ASSERT(bp == sc->sa.agf_bp);
+		return -EFSCORRUPTED;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 3179746a063e..71f2423b6de7 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -51,6 +51,15 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
 
+/* Repair setup functions */
+int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
+
+void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa);
+
+/* Metadata revalidators */
+
+int xrep_revalidate_allocbt(struct xfs_scrub *sc);
+
 /* Metadata repairers */
 
 int xrep_probe(struct xfs_scrub *sc);
@@ -58,6 +67,9 @@ int xrep_superblock(struct xfs_scrub *sc);
 int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
+int xrep_allocbt(struct xfs_scrub *sc);
+
+int xrep_reinit_pagf(struct xfs_scrub *sc);
 
 #else
 
@@ -77,11 +89,23 @@ xrep_calc_ag_resblks(
 	return 0;
 }
 
+/* repair setup functions for no-repair */
+static inline int
+xrep_setup_nothing(
+	struct xfs_scrub	*sc)
+{
+	return 0;
+}
+#define xrep_setup_ag_allocbt		xrep_setup_nothing
+
+#define xrep_revalidate_allocbt		(NULL)
+
 #define xrep_probe			xrep_notsupported
 #define xrep_superblock			xrep_notsupported
 #define xrep_agf			xrep_notsupported
 #define xrep_agfl			xrep_notsupported
 #define xrep_agi			xrep_notsupported
+#define xrep_allocbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 60975d050b82..6e52aa7a14b0 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -240,13 +240,15 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_allocbt,
 		.scrub	= xchk_bnobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_allocbt,
+		.repair_eval = xrep_revalidate_allocbt,
 	},
 	[XFS_SCRUB_TYPE_CNTBT] = {	/* cntbt */
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_allocbt,
 		.scrub	= xchk_cntbt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_allocbt,
+		.repair_eval = xrep_revalidate_allocbt,
 	},
 	[XFS_SCRUB_TYPE_INOBT] = {	/* inobt */
 		.type	= ST_PERAG,
@@ -527,7 +529,10 @@ xfs_scrub_metadata(
 		goto out_teardown;
 
 	/* Scrub for errors. */
-	error = sc->ops->scrub(sc);
+	if ((sc->flags & XREP_ALREADY_FIXED) && sc->ops->repair_eval != NULL)
+		error = sc->ops->repair_eval(sc);
+	else
+		error = sc->ops->scrub(sc);
 	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
 		goto try_harder;
 	if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN))
@@ -537,8 +542,7 @@ xfs_scrub_metadata(
 
 	xchk_update_health(sc);
 
-	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
-	    !(sc->flags & XREP_ALREADY_FIXED)) {
+	if (xchk_could_repair(sc)) {
 		bool needs_fix = xchk_needs_repair(sc->sm);
 
 		/* Userspace asked us to rebuild the structure regardless. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 83ad1662d802..c21a8d732530 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -35,6 +35,14 @@ struct xchk_meta_ops {
 	/* Repair or optimize the metadata. */
 	int		(*repair)(struct xfs_scrub *);
 
+	/*
+	 * Re-scrub the metadata we repaired, in case there's extra work that
+	 * we need to do to check our repair work.  If this is NULL, we'll use
+	 * the ->scrub function pointer, assuming that the regular scrub is
+	 * sufficient.
+	 */
+	int		(*repair_eval)(struct xfs_scrub *sc);
+
 	/* Decide if we even have this piece of metadata. */
 	bool		(*has)(struct xfs_mount *);
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 788a02aee689..d78d28d7c12b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1126,11 +1126,33 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 xfs_agblock_t agbno, xfs_extlen_t len, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
-DEFINE_REPAIR_RMAP_EVENT(xrep_alloc_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
 
+TRACE_EVENT(xrep_abt_found,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 const struct xfs_alloc_rec_incore *rec),
+	TP_ARGS(mp, agno, rec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startblock = rec->ar_startblock;
+		__entry->blockcount = rec->ar_blockcount;
+	),
+	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->startblock,
+		  __entry->blockcount)
+)
+
 TRACE_EVENT(xrep_refcount_extent_fn,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 struct xfs_refcount_irec *irec),
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index 69f0c922c98a..95684e0c572e 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -55,6 +55,28 @@ static inline int xfarray_append(struct xfarray *array, const void *ptr)
 uint64_t xfarray_length(struct xfarray *array);
 int xfarray_load_next(struct xfarray *array, xfarray_idx_t *idx, void *rec);
 
+/*
+ * Iterate the non-null elements in a sparse xfarray.  Callers should
+ * initialize *idx to XFARRAY_CURSOR_INIT before the first call; on return, it
+ * will be set to one more than the index of the record that was retrieved.
+ * Returns 1 if a record was retrieved, 0 if there weren't any more records, or
+ * a negative errno.
+ */
+static inline int
+xfarray_iter(
+	struct xfarray	*array,
+	xfarray_idx_t	*idx,
+	void		*rec)
+{
+	int ret = xfarray_load_next(array, idx, rec);
+
+	if (ret == -ENODATA)
+		return 0;
+	if (ret == 0)
+		return 1;
+	return ret;
+}
+
 /* Declarations for xfile array sort functionality. */
 
 typedef cmp_func_t xfarray_cmp_fn;
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index ad22a003f959..ea384d031804 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -628,3 +628,16 @@ xfs_extent_busy_ag_cmp(
 		diff = b1->bno - b2->bno;
 	return diff;
 }
+
+/* Are there any busy extents in this AG? */
+bool
+xfs_extent_busy_list_empty(
+	struct xfs_perag	*pag)
+{
+	bool			res;
+
+	spin_lock(&pag->pagb_lock);
+	res = RB_EMPTY_ROOT(&pag->pagb_tree);
+	spin_unlock(&pag->pagb_lock);
+	return res;
+}
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 4a118131059f..19828c9854d1 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -67,4 +67,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
 	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
 }
 
+bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
+
 #endif /* __XFS_EXTENT_BUSY_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: rewrite xfs_icache_inode_is_allocated
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Back in the mists of time[1], I proposed this function to assist the
inode btree scrubbers in checking the inode btree contents against the
allocation state of the inode records.  The original version performed a
direct lookup in the inode cache and returned the allocation status if
the cached inode hadn't been reused and wasn't in an intermediate state.
Brian thought it would be better to use the usual iget/irele mechanisms,
so that was changed for the final version.

Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has
one user and clutters up the regular iget path, which makes it hard to
reason about how it actually works.  Worse yet, the inode inactivation
series silently broke it because iget won't return inodes that are
anywhere in the inactivation machinery, even though the caller is
already required to prevent inode allocation and freeing.  Inodes in the
inactivation machinery are still allocated, but the current code's
interactions with the iget code prevent us from being able to say that.

Now that I understand the inode lifecycle better than I did in early
2017, I now realize that as long as the cached inode hasn't been reused
and isn't actively being reclaimed, it's safe to access the i_mode field
(with the AGI, rcu, and i_flags locks held), and we don't need to worry
about the inode being freed out from under us.

Therefore, port the original version to modern code structure, which
fixes the brokennes w.r.t. inactivation.  In the next patch we'll remove
IGET_INCORE since it's no longer necessary.

[1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |  122 +++++++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_trace.h  |   22 +++++++++
 2 files changed, 124 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0d58d7b0d8ac..c602c0f98053 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -777,23 +777,23 @@ xfs_iget(
 }
 
 /*
- * "Is this a cached inode that's also allocated?"
+ * Decide if this is this a cached inode that's also allocated.  The caller
+ * must hold the AGI buffer lock to prevent inodes from being allocated or
+ * freed.
  *
- * Look up an inode by number in the given file system.  If the inode is
- * in cache and isn't in purgatory, return 1 if the inode is allocated
- * and 0 if it is not.  For all other cases (not in cache, being torn
- * down, etc.), return a negative error code.
+ * Look up an inode by number in the given file system.  If the inode number
+ * is invalid, return -EINVAL.  If the inode is not in cache, return -ENODATA.
+ * If the inode is in an intermediate state (new, being reclaimed, reused) then
+ * return -EAGAIN.
  *
- * The caller has to prevent inode allocation and freeing activity,
- * presumably by locking the AGI buffer.   This is to ensure that an
- * inode cannot transition from allocated to freed until the caller is
- * ready to allow that.  If the inode is in an intermediate state (new,
- * reclaimable, or being reclaimed), -EAGAIN will be returned; if the
- * inode is not in the cache, -ENOENT will be returned.  The caller must
- * deal with these scenarios appropriately.
+ * Otherwise, the incore inode is the one we want, and it is either live,
+ * somewhere in the inactivation machinery, or reclaimable.  The inode is
+ * allocated if i_mode is nonzero.  In all three cases, the cached inode will
+ * be more up to date than the ondisk inode buffer, so we must use the incore
+ * i_mode.
  *
- * This is a specialized use case for the online scrubber; if you're
- * reading this, you probably want xfs_iget.
+ * This is a specialized use case for the online fsck; if you're reading this,
+ * you probably want xfs_iget.
  */
 int
 xfs_icache_inode_is_allocated(
@@ -803,15 +803,97 @@ xfs_icache_inode_is_allocated(
 	bool			*inuse)
 {
 	struct xfs_inode	*ip;
+	struct xfs_perag	*pag;
+	xfs_agino_t		agino;
 	int			error;
 
-	error = xfs_iget(mp, tp, ino, XFS_IGET_INCORE, 0, &ip);
-	if (error)
-		return error;
+	/* reject inode numbers outside existing AGs */
+	if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
+		return -EINVAL;
 
-	*inuse = !!(VFS_I(ip)->i_mode);
-	xfs_irele(ip);
-	return 0;
+	/* get the perag structure and ensure that it's inode capable */
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
+	agino = XFS_INO_TO_AGINO(mp, ino);
+
+	rcu_read_lock();
+	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
+	if (!ip) {
+		/* cache miss */
+		error = -ENODATA;
+		goto out_pag;
+	}
+
+	/*
+	 * If the inode number doesn't match, the incore inode got reused
+	 * during an RCU grace period and the radix tree hasn't been updated.
+	 * This isn't the inode we want.
+	 */
+	error = -ENODATA;
+	spin_lock(&ip->i_flags_lock);
+	if (ip->i_ino != ino)
+		goto out_skip;
+
+	trace_xfs_icache_inode_is_allocated(ip);
+
+	/*
+	 * We have an incore inode that matches the inode we want, and the
+	 * caller holds the AGI buffer.
+	 *
+	 * If the incore inode is INEW, there are several possibilities:
+	 *
+	 * For a file that is being created, note that we allocate the ondisk
+	 * inode before allocating, initializing, and adding the incore inode
+	 * to the radix tree.
+	 *
+	 * If the incore inode is being recycled, the inode has to be allocated
+	 * because we don't allow freed inodes to be recycled.
+	 *
+	 * If the inode is queued for inactivation, it should still be
+	 * allocated.
+	 *
+	 * If the incore inode is undergoing inactivation, either it is before
+	 * the point where it would get freed ondisk (in which case i_mode is
+	 * still nonzero), or it has already been freed, in which case i_mode
+	 * is zero.  We don't take the ILOCK here, but difree and dialloc
+	 * require the AGI, which we do hold.
+	 *
+	 * If the inode is anywhere in the reclaim mechanism, we know that it's
+	 * still ok to query i_mode because we don't allow uncached inode
+	 * updates.
+	 *
+	 * If the incore inode is live (i.e. referenced from the dcache), the
+	 * ondisk inode had better be allocated.  This is the most trivial
+	 * case.
+	 */
+#ifdef DEBUG
+	if (ip->i_flags & XFS_INEW) {
+		/* created on disk already or recycling */
+		ASSERT(VFS_I(ip)->i_mode != 0);
+	}
+
+	if ((ip->i_flags & XFS_NEED_INACTIVE) &&
+	    !(ip->i_flags & XFS_INACTIVATING)) {
+		/* definitely before difree */
+		ASSERT(VFS_I(ip)->i_mode != 0);
+	}
+
+	/* XFS_INACTIVATING and XFS_IRECLAIMABLE could be either state */
+
+	if (!(ip->i_flags & (XFS_NEED_INACTIVE | XFS_INEW | XFS_IRECLAIMABLE |
+			     XFS_INACTIVATING))) {
+		/* live inode */
+		ASSERT(VFS_I(ip)->i_mode != 0);
+	}
+#endif
+	*inuse = VFS_I(ip)->i_mode != 0;
+	error = 0;
+
+out_skip:
+	spin_unlock(&ip->i_flags_lock);
+out_pag:
+	rcu_read_unlock();
+	xfs_perag_put(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index dfb1d47ee3c3..a435ca32f186 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -793,6 +793,28 @@ DEFINE_INODE_EVENT(xfs_inode_reclaiming);
 DEFINE_INODE_EVENT(xfs_inode_set_need_inactive);
 DEFINE_INODE_EVENT(xfs_inode_inactivating);
 
+TRACE_EVENT(xfs_icache_inode_is_allocated,
+	TP_PROTO(struct xfs_inode *ip),
+	TP_ARGS(ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned long, iflags)
+		__field(umode_t, mode)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->iflags = ip->i_flags;
+		__entry->mode = VFS_I(ip)->i_mode;
+	),
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx mode 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->iflags,
+		  __entry->mode)
+);
+
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
  * TRACE_DEFINE_ENUM macro so that the enum value can be encoded in the ftrace


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: repair inode btrees
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/5] xfs: clear pagf_agflreset when repairing the AGFL Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/5] xfs: repair free space btrees Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/5] xfs: rewrite xfs_icache_inode_is_allocated Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/libxfs/xfs_ialloc.c   |   41 +-
 fs/xfs/libxfs/xfs_ialloc.h   |    3 
 fs/xfs/scrub/common.c        |    1 
 fs/xfs/scrub/ialloc_repair.c |  873 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c        |   59 +++
 fs/xfs/scrub/repair.h        |   17 +
 fs/xfs/scrub/scrub.c         |    6 
 fs/xfs/scrub/scrub.h         |    1 
 fs/xfs/scrub/trace.h         |   68 ++-
 10 files changed, 1021 insertions(+), 49 deletions(-)
 create mode 100644 fs/xfs/scrub/ialloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b0ba3eec7068..48985e83ad4c 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -178,6 +178,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
+				   ialloc_repair.o \
 				   newbt.o \
 				   reap.o \
 				   repair.o \
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index d5de1eed97e2..98961914dc01 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -95,18 +95,29 @@ xfs_inobt_btrec_to_irec(
 	irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
 }
 
-/* Simple checks for inode records. */
-xfs_failaddr_t
-xfs_inobt_check_irec(
-	struct xfs_btree_cur			*cur,
+/* Compute the freecount of an incore inode record. */
+uint8_t
+xfs_inobt_rec_freecount(
 	const struct xfs_inobt_rec_incore	*irec)
 {
-	uint64_t			realfree;
+	uint64_t				realfree;
 
+	if (!xfs_inobt_issparse(irec->ir_holemask))
+		realfree = irec->ir_free;
+	else
+		realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec);
+	return hweight64(realfree);
+}
+
+inline xfs_failaddr_t
+xfs_inobt_check_perag_irec(
+	struct xfs_perag			*pag,
+	const struct xfs_inobt_rec_incore	*irec)
+{
 	/* Record has to be properly aligned within the AG. */
-	if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino))
+	if (!xfs_verify_agino(pag, irec->ir_startino))
 		return __this_address;
-	if (!xfs_verify_agino(cur->bc_ag.pag,
+	if (!xfs_verify_agino(pag,
 				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
 		return __this_address;
 	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
@@ -115,17 +126,21 @@ xfs_inobt_check_irec(
 	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
 		return __this_address;
 
-	/* if there are no holes, return the first available offset */
-	if (!xfs_inobt_issparse(irec->ir_holemask))
-		realfree = irec->ir_free;
-	else
-		realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec);
-	if (hweight64(realfree) != irec->ir_freecount)
+	if (xfs_inobt_rec_freecount(irec) != irec->ir_freecount)
 		return __this_address;
 
 	return NULL;
 }
 
+/* Simple checks for inode records. */
+xfs_failaddr_t
+xfs_inobt_check_irec(
+	struct xfs_btree_cur			*cur,
+	const struct xfs_inobt_rec_incore	*irec)
+{
+	return xfs_inobt_check_perag_irec(cur->bc_ag.pag, irec);
+}
+
 static inline int
 xfs_inobt_complain_bad_rec(
 	struct xfs_btree_cur		*cur,
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index fa4d506086b9..f4dc97bb8e83 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -78,6 +78,7 @@ int xfs_inobt_lookup(struct xfs_btree_cur *cur, xfs_agino_t ino,
  */
 int xfs_inobt_get_rec(struct xfs_btree_cur *cur,
 		xfs_inobt_rec_incore_t *rec, int *stat);
+uint8_t xfs_inobt_rec_freecount(const struct xfs_inobt_rec_incore *irec);
 
 /*
  * Inode chunk initialisation routine
@@ -92,6 +93,8 @@ union xfs_btree_rec;
 void xfs_inobt_btrec_to_irec(struct xfs_mount *mp,
 		const union xfs_btree_rec *rec,
 		struct xfs_inobt_rec_incore *irec);
+xfs_failaddr_t xfs_inobt_check_perag_irec(struct xfs_perag *pag,
+		const struct xfs_inobt_rec_incore *irec);
 xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_inobt_rec_incore *irec);
 int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 0c1fa210deae..2c3707730cd4 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -597,6 +597,7 @@ xchk_ag_free(
 	struct xchk_ag		*sa)
 {
 	xchk_ag_btcur_free(sa);
+	xrep_reset_perag_resv(sc);
 	if (sa->agf_bp) {
 		xfs_trans_brelse(sc->tp, sa->agf_bp);
 		sa->agf_bp = NULL;
diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
new file mode 100644
index 000000000000..dde2cde16e8c
--- /dev/null
+++ b/fs/xfs/scrub/ialloc_repair.c
@@ -0,0 +1,873 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_log.h"
+#include "xfs_trans_priv.h"
+#include "xfs_error.h"
+#include "xfs_health.h"
+#include "xfs_ag.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/newbt.h"
+#include "scrub/reap.h"
+
+/*
+ * Inode Btree Repair
+ * ==================
+ *
+ * A quick refresher of inode btrees on a v5 filesystem:
+ *
+ * - Inode records are read into memory in units of 'inode clusters'.  However
+ *   many inodes fit in a cluster buffer is the smallest number of inodes that
+ *   can be allocated or freed.  Clusters are never smaller than one fs block
+ *   though they can span multiple blocks.  The size (in fs blocks) is
+ *   computed with xfs_icluster_size_fsb().  The fs block alignment of a
+ *   cluster is computed with xfs_ialloc_cluster_alignment().
+ *
+ * - Each inode btree record can describe a single 'inode chunk'.  The chunk
+ *   size is defined to be 64 inodes.  If sparse inodes are enabled, every
+ *   inobt record must be aligned to the chunk size; if not, every record must
+ *   be aligned to the start of a cluster.  It is possible to construct an XFS
+ *   geometry where one inobt record maps to multiple inode clusters; it is
+ *   also possible to construct a geometry where multiple inobt records map to
+ *   different parts of one inode cluster.
+ *
+ * - If sparse inodes are not enabled, the smallest unit of allocation for
+ *   inode records is enough to contain one inode chunk's worth of inodes.
+ *
+ * - If sparse inodes are enabled, the holemask field will be active.  Each
+ *   bit of the holemask represents 4 potential inodes; if set, the
+ *   corresponding space does *not* contain inodes and must be left alone.
+ *   Clusters cannot be smaller than 4 inodes.  The smallest unit of allocation
+ *   of inode records is one inode cluster.
+ *
+ * So what's the rebuild algorithm?
+ *
+ * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
+ * records.  The OWN_INOBT records are the old inode btree blocks and will be
+ * cleared out after we've rebuilt the tree.  Each possible inode cluster
+ * within an OWN_INODES record will be read in; for each possible inobt record
+ * associated with that cluster, compute the freemask calculated from the
+ * i_mode data in the inode chunk.  For sparse inodes the holemask will be
+ * calculated by creating the properly aligned inobt record and punching out
+ * any chunk that's missing.  Inode allocations and frees grab the AGI first,
+ * so repair protects itself from concurrent access by locking the AGI.
+ *
+ * Once we've reconstructed all the inode records, we can create new inode
+ * btree roots and reload the btrees.  We rebuild both inode trees at the same
+ * time because they have the same rmap owner and it would be more complex to
+ * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
+ * blocks it owns.  We have all the data we need to build both, so dump
+ * everything and start over.
+ *
+ * We use the prefix 'xrep_ibt' because we rebuild both inode btrees at once.
+ */
+
+struct xrep_ibt {
+	/* Record under construction. */
+	struct xfs_inobt_rec_incore	rie;
+
+	/* new inobt information */
+	struct xrep_newbt	new_inobt;
+
+	/* new finobt information */
+	struct xrep_newbt	new_finobt;
+
+	/* Old inode btree blocks we found in the rmap. */
+	struct xagb_bitmap	old_iallocbt_blocks;
+
+	/* Reconstructed inode records. */
+	struct xfarray		*inode_records;
+
+	struct xfs_scrub	*sc;
+
+	/* Number of inodes assigned disk space. */
+	unsigned int		icount;
+
+	/* Number of inodes in use. */
+	unsigned int		iused;
+
+	/* Number of finobt records needed. */
+	unsigned int		finobt_recs;
+
+	/* get_records()'s position in the inode record array. */
+	xfarray_idx_t		array_cur;
+};
+
+/*
+ * Is this inode in use?  If the inode is in memory we can tell from i_mode,
+ * otherwise we have to check di_mode in the on-disk buffer.  We only care
+ * that the high (i.e. non-permission) bits of _mode are zero.  This should be
+ * safe because repair keeps all AG headers locked until the end, and process
+ * trying to perform an inode allocation/free must lock the AGI.
+ *
+ * @cluster_ag_base is the inode offset of the cluster within the AG.
+ * @cluster_bp is the cluster buffer.
+ * @cluster_index is the inode offset within the inode cluster.
+ */
+STATIC int
+xrep_ibt_check_ifree(
+	struct xrep_ibt		*ri,
+	xfs_agino_t		cluster_ag_base,
+	struct xfs_buf		*cluster_bp,
+	unsigned int		cluster_index,
+	bool			*inuse)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_dinode	*dip;
+	xfs_ino_t		fsino;
+	xfs_agnumber_t		agno = ri->sc->sa.pag->pag_agno;
+	unsigned int		cluster_buf_base;
+	unsigned int		offset;
+	int			error;
+
+	fsino = XFS_AGINO_TO_INO(mp, agno, cluster_ag_base + cluster_index);
+
+	/* Inode uncached or half assembled, read disk buffer */
+	cluster_buf_base = XFS_INO_TO_OFFSET(mp, cluster_ag_base);
+	offset = (cluster_buf_base + cluster_index) * mp->m_sb.sb_inodesize;
+	if (offset >= BBTOB(cluster_bp->b_length))
+		return -EFSCORRUPTED;
+	dip = xfs_buf_offset(cluster_bp, offset);
+	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
+		return -EFSCORRUPTED;
+
+	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
+		return -EFSCORRUPTED;
+
+	/* Will the in-core inode tell us if it's in use? */
+	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
+	if (!error)
+		return 0;
+
+	*inuse = dip->di_mode != 0;
+	return 0;
+}
+
+/* Stash the accumulated inobt record for rebuilding. */
+STATIC int
+xrep_ibt_stash(
+	struct xrep_ibt		*ri)
+{
+	int			error = 0;
+
+	if (xchk_should_terminate(ri->sc, &error))
+		return error;
+
+	ri->rie.ir_freecount = xfs_inobt_rec_freecount(&ri->rie);
+	if (xfs_inobt_check_perag_irec(ri->sc->sa.pag, &ri->rie) != NULL)
+		return -EFSCORRUPTED;
+
+	if (ri->rie.ir_freecount > 0)
+		ri->finobt_recs++;
+
+	trace_xrep_ibt_found(ri->sc->mp, ri->sc->sa.pag->pag_agno, &ri->rie);
+
+	error = xfarray_append(ri->inode_records, &ri->rie);
+	if (error)
+		return error;
+
+	ri->rie.ir_startino = NULLAGINO;
+	return 0;
+}
+
+/*
+ * Given an extent of inodes and an inode cluster buffer, calculate the
+ * location of the corresponding inobt record (creating it if necessary),
+ * then update the parts of the holemask and freemask of that record that
+ * correspond to the inode extent we were given.
+ *
+ * @cluster_ir_startino is the AG inode number of an inobt record that we're
+ * proposing to create for this inode cluster.  If sparse inodes are enabled,
+ * we must round down to a chunk boundary to find the actual sparse record.
+ * @cluster_bp is the buffer of the inode cluster.
+ * @nr_inodes is the number of inodes to check from the cluster.
+ */
+STATIC int
+xrep_ibt_cluster_record(
+	struct xrep_ibt		*ri,
+	xfs_agino_t		cluster_ir_startino,
+	struct xfs_buf		*cluster_bp,
+	unsigned int		nr_inodes)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agino_t		ir_startino;
+	unsigned int		cluster_base;
+	unsigned int		cluster_index;
+	int			error = 0;
+
+	ir_startino = cluster_ir_startino;
+	if (xfs_has_sparseinodes(mp))
+		ir_startino = rounddown(ir_startino, XFS_INODES_PER_CHUNK);
+	cluster_base = cluster_ir_startino - ir_startino;
+
+	/*
+	 * If the accumulated inobt record doesn't map this cluster, add it to
+	 * the list and reset it.
+	 */
+	if (ri->rie.ir_startino != NULLAGINO &&
+	    ri->rie.ir_startino + XFS_INODES_PER_CHUNK <= ir_startino) {
+		error = xrep_ibt_stash(ri);
+		if (error)
+			return error;
+	}
+
+	if (ri->rie.ir_startino == NULLAGINO) {
+		ri->rie.ir_startino = ir_startino;
+		ri->rie.ir_free = XFS_INOBT_ALL_FREE;
+		ri->rie.ir_holemask = 0xFFFF;
+		ri->rie.ir_count = 0;
+	}
+
+	/* Record the whole cluster. */
+	ri->icount += nr_inodes;
+	ri->rie.ir_count += nr_inodes;
+	ri->rie.ir_holemask &= ~xfs_inobt_maskn(
+				cluster_base / XFS_INODES_PER_HOLEMASK_BIT,
+				nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
+
+	/* Which inodes within this cluster are free? */
+	for (cluster_index = 0; cluster_index < nr_inodes; cluster_index++) {
+		bool		inuse = false;
+
+		error = xrep_ibt_check_ifree(ri, cluster_ir_startino,
+				cluster_bp, cluster_index, &inuse);
+		if (error)
+			return error;
+		if (!inuse)
+			continue;
+		ri->iused++;
+		ri->rie.ir_free &= ~XFS_INOBT_MASK(cluster_base +
+						   cluster_index);
+	}
+	return 0;
+}
+
+/*
+ * For each inode cluster covering the physical extent recorded by the rmapbt,
+ * we must calculate the properly aligned startino of that cluster, then
+ * iterate each cluster to fill in used and filled masks appropriately.  We
+ * then use the (startino, used, filled) information to construct the
+ * appropriate inode records.
+ */
+STATIC int
+xrep_ibt_process_cluster(
+	struct xrep_ibt		*ri,
+	xfs_agblock_t		cluster_bno)
+{
+	struct xfs_imap		imap;
+	struct xfs_buf		*cluster_bp;
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	xfs_agino_t		cluster_ag_base;
+	xfs_agino_t		irec_index;
+	unsigned int		nr_inodes;
+	int			error;
+
+	nr_inodes = min_t(unsigned int, igeo->inodes_per_cluster,
+			XFS_INODES_PER_CHUNK);
+
+	/*
+	 * Grab the inode cluster buffer.  This is safe to do with a broken
+	 * inobt because imap_to_bp directly maps the buffer without touching
+	 * either inode btree.
+	 */
+	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.pag->pag_agno, cluster_bno);
+	imap.im_len = XFS_FSB_TO_BB(mp, igeo->blocks_per_cluster);
+	imap.im_boffset = 0;
+	error = xfs_imap_to_bp(mp, sc->tp, &imap, &cluster_bp);
+	if (error)
+		return error;
+
+	/*
+	 * Record the contents of each possible inobt record mapping this
+	 * cluster.
+	 */
+	cluster_ag_base = XFS_AGB_TO_AGINO(mp, cluster_bno);
+	for (irec_index = 0;
+	     irec_index < igeo->inodes_per_cluster;
+	     irec_index += XFS_INODES_PER_CHUNK) {
+		error = xrep_ibt_cluster_record(ri,
+				cluster_ag_base + irec_index, cluster_bp,
+				nr_inodes);
+		if (error)
+			break;
+
+	}
+
+	xfs_trans_brelse(sc->tp, cluster_bp);
+	return error;
+}
+
+/* Check for any obvious conflicts in the inode chunk extent. */
+STATIC int
+xrep_ibt_check_inode_ext(
+	struct xfs_scrub	*sc,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	xfs_agino_t		agino;
+	enum xbtree_recpacking	outcome;
+	int			error;
+
+	/* Inode records must be within the AG. */
+	if (!xfs_verify_agbext(sc->sa.pag, agbno, len))
+		return -EFSCORRUPTED;
+
+	/* The entire record must align to the inode cluster size. */
+	if (!IS_ALIGNED(agbno, igeo->blocks_per_cluster) ||
+	    !IS_ALIGNED(agbno + len, igeo->blocks_per_cluster))
+		return -EFSCORRUPTED;
+
+	/*
+	 * The entire record must also adhere to the inode cluster alignment
+	 * size if sparse inodes are not enabled.
+	 */
+	if (!xfs_has_sparseinodes(mp) &&
+	    (!IS_ALIGNED(agbno, igeo->cluster_align) ||
+	     !IS_ALIGNED(agbno + len, igeo->cluster_align)))
+		return -EFSCORRUPTED;
+
+	/*
+	 * On a sparse inode fs, this cluster could be part of a sparse chunk.
+	 * Sparse clusters must be aligned to sparse chunk alignment.
+	 */
+	if (xfs_has_sparseinodes(mp) &&
+	    (!IS_ALIGNED(agbno, mp->m_sb.sb_spino_align) ||
+	     !IS_ALIGNED(agbno + len, mp->m_sb.sb_spino_align)))
+		return -EFSCORRUPTED;
+
+	/* Make sure the entire range of blocks are valid AG inodes. */
+	agino = XFS_AGB_TO_AGINO(mp, agbno);
+	if (!xfs_verify_agino(sc->sa.pag, agino))
+		return -EFSCORRUPTED;
+
+	agino = XFS_AGB_TO_AGINO(mp, agbno + len) - 1;
+	if (!xfs_verify_agino(sc->sa.pag, agino))
+		return -EFSCORRUPTED;
+
+	/* Make sure this isn't free space. */
+	error = xfs_alloc_has_records(sc->sa.bno_cur, agbno, len, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Found a fragment of the old inode btrees; dispose of them later. */
+STATIC int
+xrep_ibt_record_old_btree_blocks(
+	struct xrep_ibt			*ri,
+	const struct xfs_rmap_irec	*rec)
+{
+	if (!xfs_verify_agbext(ri->sc->sa.pag, rec->rm_startblock,
+				rec->rm_blockcount))
+		return -EFSCORRUPTED;
+
+	return xagb_bitmap_set(&ri->old_iallocbt_blocks, rec->rm_startblock,
+			rec->rm_blockcount);
+}
+
+/* Record extents that belong to inode btrees. */
+STATIC int
+xrep_ibt_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_ibt			*ri = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_ino_geometry		*igeo = M_IGEO(mp);
+	xfs_agblock_t			cluster_base;
+	int				error = 0;
+
+	if (xchk_should_terminate(ri->sc, &error))
+		return error;
+
+	if (rec->rm_owner == XFS_RMAP_OWN_INOBT)
+		return xrep_ibt_record_old_btree_blocks(ri, rec);
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
+		return 0;
+
+	error = xrep_ibt_check_inode_ext(ri->sc, rec->rm_startblock,
+			rec->rm_blockcount);
+	if (error)
+		return error;
+
+	trace_xrep_ibt_walk_rmap(mp, ri->sc->sa.pag->pag_agno,
+			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
+			rec->rm_offset, rec->rm_flags);
+
+	/*
+	 * Record the free/hole masks for each inode cluster that could be
+	 * mapped by this rmap record.
+	 */
+	for (cluster_base = 0;
+	     cluster_base < rec->rm_blockcount;
+	     cluster_base += igeo->blocks_per_cluster) {
+		error = xrep_ibt_process_cluster(ri,
+				rec->rm_startblock + cluster_base);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
+ * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
+ * the inode btrees.  The caller must clean up the lists if anything goes
+ * wrong.
+ */
+STATIC int
+xrep_ibt_find_inodes(
+	struct xrep_ibt		*ri)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	int			error;
+
+	ri->rie.ir_startino = NULLAGINO;
+
+	/* Collect all reverse mappings for inode blocks. */
+	xrep_ag_btcur_init(sc, &sc->sa);
+	error = xfs_rmap_query_all(sc->sa.rmap_cur, xrep_ibt_walk_rmap, ri);
+	xchk_ag_btcur_free(&sc->sa);
+	if (error)
+		return error;
+
+	/* If we have a record ready to go, add it to the array. */
+	if (ri->rie.ir_startino == NULLAGINO)
+		return 0;
+
+	return xrep_ibt_stash(ri);
+}
+
+/* Update the AGI counters. */
+STATIC int
+xrep_ibt_reset_counters(
+	struct xrep_ibt		*ri)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_agi		*agi = sc->sa.agi_bp->b_addr;
+	unsigned int		freecount = ri->icount - ri->iused;
+
+	/* Trigger inode count recalculation */
+	xfs_force_summary_recalc(sc->mp);
+
+	/*
+	 * The AGI header contains extra information related to the inode
+	 * btrees, so we must update those fields here.
+	 */
+	agi->agi_count = cpu_to_be32(ri->icount);
+	agi->agi_freecount = cpu_to_be32(freecount);
+	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp,
+			   XFS_AGI_COUNT | XFS_AGI_FREECOUNT);
+
+	/* Reinitialize with the values we just logged. */
+	return xrep_reinit_pagi(sc);
+}
+
+/* Retrieve finobt data for bulk load. */
+STATIC int
+xrep_fibt_get_records(
+	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	*irec = &cur->bc_rec.i;
+	struct xrep_ibt			*ri = priv;
+	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
+	int				error;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		do {
+			error = xfarray_load(ri->inode_records,
+					ri->array_cur++, irec);
+		} while (error == 0 && xfs_inobt_rec_freecount(irec) == 0);
+		if (error)
+			return error;
+
+		block_rec = xfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Retrieve inobt data for bulk load. */
+STATIC int
+xrep_ibt_get_records(
+	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	*irec = &cur->bc_rec.i;
+	struct xrep_ibt			*ri = priv;
+	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
+	int				error;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		error = xfarray_load(ri->inode_records, ri->array_cur++, irec);
+		if (error)
+			return error;
+
+		block_rec = xfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Feed one of the new inobt blocks to the bulk loader. */
+STATIC int
+xrep_ibt_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_ibt		*ri = priv;
+	int			error;
+
+	error = xrep_newbt_relog_autoreap(&ri->new_inobt);
+	if (error)
+		return error;
+
+	return xrep_newbt_claim_block(cur, &ri->new_inobt, ptr);
+}
+
+/* Feed one of the new finobt blocks to the bulk loader. */
+STATIC int
+xrep_fibt_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_ibt		*ri = priv;
+	int			error;
+
+	error = xrep_newbt_relog_autoreap(&ri->new_finobt);
+	if (error)
+		return error;
+
+	return xrep_newbt_claim_block(cur, &ri->new_finobt, ptr);
+}
+
+/* Make sure the records do not overlap in inumber address space. */
+STATIC int
+xrep_ibt_check_startino(
+	struct xrep_ibt			*ri)
+{
+	struct xfs_inobt_rec_incore	irec;
+	xfarray_idx_t			cur;
+	xfs_agino_t			next_agino = 0;
+	int				error = 0;
+
+	foreach_xfarray_idx(ri->inode_records, cur) {
+		if (xchk_should_terminate(ri->sc, &error))
+			return error;
+
+		error = xfarray_load(ri->inode_records, cur, &irec);
+		if (error)
+			return error;
+
+		if (irec.ir_startino < next_agino)
+			return -EFSCORRUPTED;
+
+		next_agino = irec.ir_startino + XFS_INODES_PER_CHUNK;
+	}
+
+	return error;
+}
+
+/* Build new inode btrees and dispose of the old one. */
+STATIC int
+xrep_ibt_build_new_trees(
+	struct xrep_ibt		*ri)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_btree_cur	*ino_cur;
+	struct xfs_btree_cur	*fino_cur = NULL;
+	xfs_fsblock_t		fsbno;
+	bool			need_finobt;
+	int			error;
+
+	need_finobt = xfs_has_finobt(sc->mp);
+
+	/*
+	 * Create new btrees for staging all the inobt records we collected
+	 * earlier.  The records were collected in order of increasing agino,
+	 * so we do not have to sort them.  Ensure there are no overlapping
+	 * records.
+	 */
+	error = xrep_ibt_check_startino(ri);
+	if (error)
+		return error;
+
+	/*
+	 * The new inode btrees will not be rooted in the AGI until we've
+	 * successfully rebuilt the tree.
+	 *
+	 * Start by setting up the inobt staging cursor.
+	 */
+	fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno,
+			XFS_IBT_BLOCK(sc->mp)),
+	xrep_newbt_init_ag(&ri->new_inobt, sc, &XFS_RMAP_OINFO_INOBT, fsbno,
+			XFS_AG_RESV_NONE);
+	ri->new_inobt.bload.claim_block = xrep_ibt_claim_block;
+	ri->new_inobt.bload.get_records = xrep_ibt_get_records;
+
+	ino_cur = xfs_inobt_stage_cursor(sc->mp, &ri->new_inobt.afake,
+			sc->sa.pag, XFS_BTNUM_INO);
+	error = xfs_btree_bload_compute_geometry(ino_cur, &ri->new_inobt.bload,
+			xfarray_length(ri->inode_records));
+	if (error)
+		goto err_inocur;
+
+	/* Set up finobt staging cursor. */
+	if (need_finobt) {
+		enum xfs_ag_resv_type	resv = XFS_AG_RESV_METADATA;
+
+		if (sc->mp->m_finobt_nores)
+			resv = XFS_AG_RESV_NONE;
+
+		fsbno = XFS_AGB_TO_FSB(sc->mp, sc->sa.pag->pag_agno,
+				XFS_FIBT_BLOCK(sc->mp)),
+		xrep_newbt_init_ag(&ri->new_finobt, sc, &XFS_RMAP_OINFO_INOBT,
+				fsbno, resv);
+		ri->new_finobt.bload.claim_block = xrep_fibt_claim_block;
+		ri->new_finobt.bload.get_records = xrep_fibt_get_records;
+
+		fino_cur = xfs_inobt_stage_cursor(sc->mp,
+				&ri->new_finobt.afake, sc->sa.pag,
+				XFS_BTNUM_FINO);
+		error = xfs_btree_bload_compute_geometry(fino_cur,
+				&ri->new_finobt.bload, ri->finobt_recs);
+		if (error)
+			goto err_finocur;
+	}
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		goto err_finocur;
+
+	/* Reserve all the space we need to build the new btrees. */
+	error = xrep_newbt_alloc_blocks(&ri->new_inobt,
+			ri->new_inobt.bload.nr_blocks);
+	if (error)
+		goto err_finocur;
+
+	if (need_finobt) {
+		error = xrep_newbt_alloc_blocks(&ri->new_finobt,
+				ri->new_finobt.bload.nr_blocks);
+		if (error)
+			goto err_finocur;
+	}
+
+	/* Add all inobt records. */
+	ri->array_cur = XFARRAY_CURSOR_INIT;
+	error = xfs_btree_bload(ino_cur, &ri->new_inobt.bload, ri);
+	if (error)
+		goto err_finocur;
+
+	/* Add all finobt records. */
+	if (need_finobt) {
+		ri->array_cur = XFARRAY_CURSOR_INIT;
+		error = xfs_btree_bload(fino_cur, &ri->new_finobt.bload, ri);
+		if (error)
+			goto err_finocur;
+	}
+
+	/*
+	 * Install the new btrees in the AG header.  After this point the old
+	 * btrees are no longer accessible and the new trees are live.
+	 */
+	xfs_inobt_commit_staged_btree(ino_cur, sc->tp, sc->sa.agi_bp);
+	xfs_btree_del_cursor(ino_cur, 0);
+
+	if (fino_cur) {
+		xfs_inobt_commit_staged_btree(fino_cur, sc->tp, sc->sa.agi_bp);
+		xfs_btree_del_cursor(fino_cur, 0);
+	}
+
+	/* Reset the AGI counters now that we've changed the inode roots. */
+	error = xrep_ibt_reset_counters(ri);
+	if (error)
+		goto err_finobt;
+
+	/* Free unused blocks and bitmap. */
+	if (need_finobt) {
+		error = xrep_newbt_commit(&ri->new_finobt);
+		if (error)
+			goto err_inobt;
+	}
+	error = xrep_newbt_commit(&ri->new_inobt);
+	if (error)
+		return error;
+
+	return xrep_roll_ag_trans(sc);
+
+err_finocur:
+	if (need_finobt)
+		xfs_btree_del_cursor(fino_cur, error);
+err_inocur:
+	xfs_btree_del_cursor(ino_cur, error);
+err_finobt:
+	if (need_finobt)
+		xrep_newbt_cancel(&ri->new_finobt);
+err_inobt:
+	xrep_newbt_cancel(&ri->new_inobt);
+	return error;
+}
+
+/*
+ * Now that we've logged the roots of the new btrees, invalidate all of the
+ * old blocks and free them.
+ */
+STATIC int
+xrep_ibt_remove_old_trees(
+	struct xrep_ibt		*ri)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	int			error;
+
+	/* Free the old inode btree blocks if they're not in use. */
+	error = xrep_reap_agblocks(sc, &ri->old_iallocbt_blocks,
+			&XFS_RMAP_OINFO_INOBT, XFS_AG_RESV_NONE);
+	if (error)
+		return error;
+
+	/*
+	 * If the finobt is enabled and has a per-AG reservation, make sure we
+	 * reinitialize the per-AG reservations.
+	 */
+	if (xfs_has_finobt(sc->mp) && !sc->mp->m_finobt_nores)
+		sc->flags |= XREP_RESET_PERAG_RESV;
+
+	return 0;
+}
+
+/* Repair both inode btrees. */
+int
+xrep_iallocbt(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_ibt		*ri;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agino_t		first_agino, last_agino;
+	int			error = 0;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(mp))
+		return -EOPNOTSUPP;
+
+	ri = kzalloc(sizeof(struct xrep_ibt), XCHK_GFP_FLAGS);
+	if (!ri)
+		return -ENOMEM;
+	ri->sc = sc;
+
+	/* We rebuild both inode btrees. */
+	sc->sick_mask = XFS_SICK_AG_INOBT | XFS_SICK_AG_FINOBT;
+
+	/* Set up enough storage to handle an AG with nothing but inodes. */
+	xfs_agino_range(mp, sc->sa.pag->pag_agno, &first_agino, &last_agino);
+	last_agino /= XFS_INODES_PER_CHUNK;
+	error = xfarray_create(mp, "inode records", last_agino,
+			sizeof(struct xfs_inobt_rec_incore),
+			&ri->inode_records);
+	if (error)
+		goto out_ri;
+
+	/* Collect the inode data and find the old btree blocks. */
+	xagb_bitmap_init(&ri->old_iallocbt_blocks);
+	error = xrep_ibt_find_inodes(ri);
+	if (error)
+		goto out_bitmap;
+
+	/* Rebuild the inode indexes. */
+	error = xrep_ibt_build_new_trees(ri);
+	if (error)
+		goto out_bitmap;
+
+	/* Kill the old tree. */
+	error = xrep_ibt_remove_old_trees(ri);
+
+out_bitmap:
+	xagb_bitmap_destroy(&ri->old_iallocbt_blocks);
+	xfarray_destroy(ri->inode_records);
+out_ri:
+	kfree(ri);
+	return error;
+}
+
+/* Make sure both btrees are ok after we've rebuilt them. */
+int
+xrep_revalidate_iallocbt(
+	struct xfs_scrub	*sc)
+{
+	__u32			old_type = sc->sm->sm_type;
+	int			error;
+
+	/*
+	 * We must update sm_type temporarily so that the tree-to-tree cross
+	 * reference checks will work in the correct direction, and also so
+	 * that tracing will report correctly if there are more errors.
+	 */
+	sc->sm->sm_type = XFS_SCRUB_TYPE_INOBT;
+	error = xchk_inobt(sc);
+	if (error)
+		goto out;
+
+	if (xfs_has_finobt(sc->mp)) {
+		sc->sm->sm_type = XFS_SCRUB_TYPE_FINOBT;
+		error = xchk_finobt(sc);
+	}
+
+out:
+	sc->sm->sm_type = old_type;
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 964b48fffc58..bc5bbff1558e 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -804,3 +804,62 @@ xrep_reinit_pagf(
 
 	return 0;
 }
+
+/*
+ * Reinitialize the in-core AG state after a repair by rereading the AGI
+ * buffer.  We had better get the same AGI buffer as the one that's attached
+ * to the scrub context.
+ */
+int
+xrep_reinit_pagi(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_buf		*bp;
+	int			error;
+
+	ASSERT(pag);
+	ASSERT(pag->pagi_init);
+
+	pag->pagi_init = 0;
+	error = xfs_ialloc_read_agi(pag, sc->tp, &bp);
+	if (error)
+		return error;
+
+	if (bp != sc->sa.agi_bp) {
+		ASSERT(bp == sc->sa.agi_bp);
+		return -EFSCORRUPTED;
+	}
+
+	return 0;
+}
+
+/* Reinitialize the per-AG block reservation for the AG we just fixed. */
+int
+xrep_reset_perag_resv(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	if (!(sc->flags & XREP_RESET_PERAG_RESV))
+		return 0;
+
+	ASSERT(sc->sa.pag != NULL);
+	ASSERT(sc->ops->type == ST_PERAG);
+	ASSERT(sc->tp);
+
+	sc->flags &= ~XREP_RESET_PERAG_RESV;
+	error = xfs_ag_resv_free(sc->sa.pag);
+	if (error)
+		goto out;
+	error = xfs_ag_resv_init(sc->sa.pag, sc->tp);
+	if (error == -ENOSPC) {
+		xfs_err(sc->mp,
+"Insufficient free space to reset per-AG reservation for AG %u after repair.",
+				sc->sa.pag->pag_agno);
+		error = 0;
+	}
+
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 71f2423b6de7..b6e60362b7cb 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -50,6 +50,7 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
 void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
+int xrep_reset_perag_resv(struct xfs_scrub *sc);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
@@ -59,6 +60,7 @@ void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa);
 /* Metadata revalidators */
 
 int xrep_revalidate_allocbt(struct xfs_scrub *sc);
+int xrep_revalidate_iallocbt(struct xfs_scrub *sc);
 
 /* Metadata repairers */
 
@@ -68,8 +70,10 @@ int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
+int xrep_iallocbt(struct xfs_scrub *sc);
 
 int xrep_reinit_pagf(struct xfs_scrub *sc);
+int xrep_reinit_pagi(struct xfs_scrub *sc);
 
 #else
 
@@ -89,6 +93,17 @@ xrep_calc_ag_resblks(
 	return 0;
 }
 
+static inline int
+xrep_reset_perag_resv(
+	struct xfs_scrub	*sc)
+{
+	if (!(sc->flags & XREP_RESET_PERAG_RESV))
+		return 0;
+
+	ASSERT(0);
+	return -EOPNOTSUPP;
+}
+
 /* repair setup functions for no-repair */
 static inline int
 xrep_setup_nothing(
@@ -99,6 +114,7 @@ xrep_setup_nothing(
 #define xrep_setup_ag_allocbt		xrep_setup_nothing
 
 #define xrep_revalidate_allocbt		(NULL)
+#define xrep_revalidate_iallocbt	(NULL)
 
 #define xrep_probe			xrep_notsupported
 #define xrep_superblock			xrep_notsupported
@@ -106,6 +122,7 @@ xrep_setup_nothing(
 #define xrep_agfl			xrep_notsupported
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
+#define xrep_iallocbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6e52aa7a14b0..aef30515c050 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -254,14 +254,16 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_iallocbt,
 		.scrub	= xchk_inobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_iallocbt,
+		.repair_eval = xrep_revalidate_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_iallocbt,
 		.scrub	= xchk_finobt,
 		.has	= xfs_has_finobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_iallocbt,
+		.repair_eval = xrep_revalidate_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index c21a8d732530..7670c0a415a4 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -121,6 +121,7 @@ struct xfs_scrub {
 #define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
 #define XCHK_FSHOOKS_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
+#define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 #define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index d78d28d7c12b..5e66be26055b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -106,6 +106,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
+	{ XREP_RESET_PERAG_RESV,		"reset_perag_resv" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 
 DECLARE_EVENT_CLASS(xchk_class,
@@ -1126,7 +1127,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 xfs_agblock_t agbno, xfs_extlen_t len, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
-DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
 
@@ -1153,6 +1154,38 @@ TRACE_EVENT(xrep_abt_found,
 		  __entry->blockcount)
 )
 
+TRACE_EVENT(xrep_ibt_found,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 const struct xfs_inobt_rec_incore *rec),
+	TP_ARGS(mp, agno, rec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, startino)
+		__field(uint16_t, holemask)
+		__field(uint8_t, count)
+		__field(uint8_t, freecount)
+		__field(uint64_t, freemask)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startino = rec->ir_startino;
+		__entry->holemask = rec->ir_holemask;
+		__entry->count = rec->ir_count;
+		__entry->freecount = rec->ir_freecount;
+		__entry->freemask = rec->ir_free;
+	),
+	TP_printk("dev %d:%d agno 0x%x agino 0x%x holemask 0x%x count 0x%x freecount 0x%x freemask 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->startino,
+		  __entry->holemask,
+		  __entry->count,
+		  __entry->freecount,
+		  __entry->freemask)
+)
+
 TRACE_EVENT(xrep_refcount_extent_fn,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 struct xfs_refcount_irec *irec),
@@ -1275,39 +1308,6 @@ TRACE_EVENT(xrep_reset_counters,
 		  MAJOR(__entry->dev), MINOR(__entry->dev))
 )
 
-TRACE_EVENT(xrep_ialloc_insert,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 xfs_agino_t startino, uint16_t holemask, uint8_t count,
-		 uint8_t freecount, uint64_t freemask),
-	TP_ARGS(mp, agno, startino, holemask, count, freecount, freemask),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_agnumber_t, agno)
-		__field(xfs_agino_t, startino)
-		__field(uint16_t, holemask)
-		__field(uint8_t, count)
-		__field(uint8_t, freecount)
-		__field(uint64_t, freemask)
-	),
-	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
-		__entry->startino = startino;
-		__entry->holemask = holemask;
-		__entry->count = count;
-		__entry->freecount = freecount;
-		__entry->freemask = freemask;
-	),
-	TP_printk("dev %d:%d agno 0x%x startino 0x%x holemask 0x%x count %u freecount %u freemask 0x%llx",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->agno,
-		  __entry->startino,
-		  __entry->holemask,
-		  __entry->count,
-		  __entry->freecount,
-		  __entry->freemask)
-)
-
 DECLARE_EVENT_CLASS(xrep_newbt_extent_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 xfs_agblock_t agbno, xfs_extlen_t len,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs: repair refcount btrees
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/5] xfs: rewrite xfs_icache_inode_is_allocated Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Reconstruct the refcount data from the rmap btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_ag.h             |    1 
 fs/xfs/libxfs/xfs_btree.c          |   26 +
 fs/xfs/libxfs/xfs_btree.h          |    2 
 fs/xfs/libxfs/xfs_refcount.c       |   18 +
 fs/xfs/libxfs/xfs_refcount.h       |    2 
 fs/xfs/libxfs/xfs_refcount_btree.c |   13 +
 fs/xfs/scrub/refcount_repair.c     |  791 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h              |    2 
 fs/xfs/scrub/scrub.c               |    2 
 fs/xfs/scrub/trace.h               |   22 +
 11 files changed, 862 insertions(+), 18 deletions(-)
 create mode 100644 fs/xfs/scrub/refcount_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 48985e83ad4c..c448c2a4d691 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -181,6 +181,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   ialloc_repair.o \
 				   newbt.o \
 				   reap.o \
+				   refcount_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index bb87f6677495..fd663d04bdff 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -89,6 +89,7 @@ struct xfs_perag {
 	 * verifiers while rebuilding the AG btrees.
 	 */
 	uint8_t		pagf_alt_levels[XFS_BTNUM_AGF];
+	uint8_t		pagf_alt_refcount_level;
 #endif
 
 	spinlock_t	pag_state_lock;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 842a710e6c3b..b63650a5d690 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -5198,3 +5198,29 @@ xfs_btree_destroy_cur_caches(void)
 	xfs_rmapbt_destroy_cur_cache();
 	xfs_refcountbt_destroy_cur_cache();
 }
+
+/* Move the btree cursor before the first record. */
+int
+xfs_btree_goto_left_edge(
+	struct xfs_btree_cur	*cur)
+{
+	int			stat = 0;
+	int			error;
+
+	memset(&cur->bc_rec, 0, sizeof(cur->bc_rec));
+	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat);
+	if (error)
+		return error;
+	if (!stat)
+		return 0;
+
+	error = xfs_btree_decrement(cur, 0, &stat);
+	if (error)
+		return error;
+	if (stat != 0) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 2db03f0ae961..5525d3715d57 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -738,4 +738,6 @@ xfs_btree_alloc_cursor(
 int __init xfs_btree_init_cur_caches(void);
 void xfs_btree_destroy_cur_caches(void);
 
+int xfs_btree_goto_left_edge(struct xfs_btree_cur *cur);
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index c1c65774dcc2..8082bb7b953a 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -120,14 +120,11 @@ xfs_refcount_btrec_to_irec(
 	irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
 }
 
-/* Simple checks for refcount records. */
-xfs_failaddr_t
-xfs_refcount_check_irec(
-	struct xfs_btree_cur		*cur,
+inline xfs_failaddr_t
+xfs_refcount_check_perag_irec(
+	struct xfs_perag		*pag,
 	const struct xfs_refcount_irec	*irec)
 {
-	struct xfs_perag		*pag = cur->bc_ag.pag;
-
 	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
 		return __this_address;
 
@@ -144,6 +141,15 @@ xfs_refcount_check_irec(
 	return NULL;
 }
 
+/* Simple checks for refcount records. */
+xfs_failaddr_t
+xfs_refcount_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*irec)
+{
+	return xfs_refcount_check_perag_irec(cur->bc_ag.pag, irec);
+}
+
 static inline int
 xfs_refcount_complain_bad_rec(
 	struct xfs_btree_cur		*cur,
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 783cd89ca195..2d6fecb258bb 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -117,6 +117,8 @@ extern int xfs_refcount_has_records(struct xfs_btree_cur *cur,
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
+xfs_failaddr_t xfs_refcount_check_perag_irec(struct xfs_perag *pag,
+		const struct xfs_refcount_irec *irec);
 xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_refcount_irec *irec);
 extern int xfs_refcount_insert(struct xfs_btree_cur *cur,
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 2ec45e2ffbe1..1bf991bf452f 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -232,7 +232,18 @@ xfs_refcountbt_verify(
 
 	level = be16_to_cpu(block->bb_level);
 	if (pag && pag->pagf_init) {
-		if (level >= pag->pagf_refcount_level)
+		unsigned int	maxlevel = pag->pagf_refcount_level;
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+		/*
+		 * Online repair could be rewriting the refcount btree, so
+		 * we'll validate against the larger of either tree while this
+		 * is going on.
+		 */
+		maxlevel = max_t(unsigned int, maxlevel,
+				pag->pagf_alt_refcount_level);
+#endif
+		if (level >= maxlevel)
 			return __this_address;
 	} else if (level >= mp->m_refc_maxlevels)
 		return __this_address;
diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c
new file mode 100644
index 000000000000..d3f0384d084d
--- /dev/null
+++ b/fs/xfs/scrub/refcount_repair.c
@@ -0,0 +1,791 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
+#include "xfs_inode.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_error.h"
+#include "xfs_ag.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/newbt.h"
+#include "scrub/reap.h"
+
+/*
+ * Rebuilding the Reference Count Btree
+ * ====================================
+ *
+ * This algorithm is "borrowed" from xfs_repair.  Imagine the rmap
+ * entries as rectangles representing extents of physical blocks, and
+ * that the rectangles can be laid down to allow them to overlap each
+ * other; then we know that we must emit a refcnt btree entry wherever
+ * the amount of overlap changes, i.e. the emission stimulus is
+ * level-triggered:
+ *
+ *                 -    ---
+ *       --      ----- ----   ---        ------
+ * --   ----     ----------- ----     ---------
+ * -------------------------------- -----------
+ * ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+ * 2 1  23 21    3 43 234  2123  1 01 2  3     0
+ *
+ * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner).
+ *
+ * Note that in the actual refcnt btree we don't store the refcount < 2
+ * cases because the bnobt tells us which blocks are free; single-use
+ * blocks aren't recorded in the bnobt or the refcntbt.  If the rmapbt
+ * supports storing multiple entries covering a given block we could
+ * theoretically dispense with the refcntbt and simply count rmaps, but
+ * that's inefficient in the (hot) write path, so we'll take the cost of
+ * the extra tree to save time.  Also there's no guarantee that rmap
+ * will be enabled.
+ *
+ * Given an array of rmaps sorted by physical block number, a starting
+ * physical block (sp), a bag to hold rmaps that cover sp, and the next
+ * physical block where the level changes (np), we can reconstruct the
+ * refcount btree as follows:
+ *
+ * While there are still unprocessed rmaps in the array,
+ *  - Set sp to the physical block (pblk) of the next unprocessed rmap.
+ *  - Add to the bag all rmaps in the array where startblock == sp.
+ *  - Set np to the physical block where the bag size will change.  This
+ *    is the minimum of (the pblk of the next unprocessed rmap) and
+ *    (startblock + len of each rmap in the bag).
+ *  - Record the bag size as old_bag_size.
+ *
+ *  - While the bag isn't empty,
+ *     - Remove from the bag all rmaps where startblock + len == np.
+ *     - Add to the bag all rmaps in the array where startblock == np.
+ *     - If the bag size isn't old_bag_size, store the refcount entry
+ *       (sp, np - sp, bag_size) in the refcnt btree.
+ *     - If the bag is empty, break out of the inner loop.
+ *     - Set old_bag_size to the bag size
+ *     - Set sp = np.
+ *     - Set np to the physical block where the bag size will change.
+ *       This is the minimum of (the pblk of the next unprocessed rmap)
+ *       and (startblock + len of each rmap in the bag).
+ *
+ * Like all the other repairers, we make a list of all the refcount
+ * records we need, then reinitialize the refcount btree root and
+ * insert all the records.
+ */
+
+/* The only parts of the rmap that we care about for computing refcounts. */
+struct xrep_refc_rmap {
+	xfs_agblock_t		startblock;
+	xfs_extlen_t		blockcount;
+} __packed;
+
+struct xrep_refc {
+	/* refcount extents */
+	struct xfarray		*refcount_records;
+
+	/* new refcountbt information */
+	struct xrep_newbt	new_btree;
+
+	/* old refcountbt blocks */
+	struct xagb_bitmap	old_refcountbt_blocks;
+
+	struct xfs_scrub	*sc;
+
+	/* get_records()'s position in the refcount record array. */
+	xfarray_idx_t		array_cur;
+
+	/* # of refcountbt blocks */
+	xfs_extlen_t		btblocks;
+};
+
+/* Check for any obvious conflicts with this shared/CoW staging extent. */
+STATIC int
+xrep_refc_check_ext(
+	struct xfs_scrub		*sc,
+	const struct xfs_refcount_irec	*rec)
+{
+	enum xbtree_recpacking		outcome;
+	int				error;
+
+	if (xfs_refcount_check_perag_irec(sc->sa.pag, rec) != NULL)
+		return -EFSCORRUPTED;
+
+	/* Make sure this isn't free space. */
+	error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rc_startblock,
+			rec->rc_blockcount, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	/* Must not be an inode chunk. */
+	error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur,
+			rec->rc_startblock, rec->rc_blockcount, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Record a reference count extent. */
+STATIC int
+xrep_refc_stash(
+	struct xrep_refc		*rr,
+	enum xfs_refc_domain		domain,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			len,
+	uint64_t			refcount)
+{
+	struct xfs_refcount_irec	irec = {
+		.rc_startblock		= agbno,
+		.rc_blockcount		= len,
+		.rc_domain		= domain,
+	};
+	struct xfs_scrub		*sc = rr->sc;
+	int				error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	irec.rc_refcount = min_t(uint64_t, MAXREFCOUNT, refcount);
+
+	error = xrep_refc_check_ext(rr->sc, &irec);
+	if (error)
+		return error;
+
+	trace_xrep_refc_found(sc->sa.pag, &irec);
+
+	return xfarray_append(rr->refcount_records, &irec);
+}
+
+/* Record a CoW staging extent. */
+STATIC int
+xrep_refc_stash_cow(
+	struct xrep_refc		*rr,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			len)
+{
+	return xrep_refc_stash(rr, XFS_REFC_DOMAIN_COW, agbno, len, 1);
+}
+
+/* Decide if an rmap could describe a shared extent. */
+static inline bool
+xrep_refc_rmap_shareable(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rmap)
+{
+	/* AG metadata are never sharable */
+	if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+		return false;
+
+	/* Metadata in files are never shareable */
+	if (xfs_internal_inum(mp, rmap->rm_owner))
+		return false;
+
+	/* Metadata and unwritten file blocks are not shareable. */
+	if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+			      XFS_RMAP_UNWRITTEN))
+		return false;
+
+	return true;
+}
+
+/*
+ * Walk along the reverse mapping records until we find one that could describe
+ * a shared extent.
+ */
+STATIC int
+xrep_refc_walk_rmaps(
+	struct xrep_refc	*rr,
+	struct xrep_refc_rmap	*rrm,
+	bool			*have_rec)
+{
+	struct xfs_rmap_irec	rmap;
+	struct xfs_btree_cur	*cur = rr->sc->sa.rmap_cur;
+	struct xfs_mount	*mp = cur->bc_mp;
+	int			have_gt;
+	int			error = 0;
+
+	*have_rec = false;
+
+	/*
+	 * Loop through the remaining rmaps.  Remember CoW staging
+	 * extents and the refcountbt blocks from the old tree for later
+	 * disposal.  We can only share written data fork extents, so
+	 * keep looping until we find an rmap for one.
+	 */
+	do {
+		if (xchk_should_terminate(rr->sc, &error))
+			return error;
+
+		error = xfs_btree_increment(cur, 0, &have_gt);
+		if (error)
+			return error;
+		if (!have_gt)
+			return 0;
+
+		error = xfs_rmap_get_rec(cur, &rmap, &have_gt);
+		if (error)
+			return error;
+		if (XFS_IS_CORRUPT(mp, !have_gt))
+			return -EFSCORRUPTED;
+
+		if (rmap.rm_owner == XFS_RMAP_OWN_COW) {
+			error = xrep_refc_stash_cow(rr, rmap.rm_startblock,
+					rmap.rm_blockcount);
+			if (error)
+				return error;
+		} else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) {
+			/* refcountbt block, dump it when we're done. */
+			rr->btblocks += rmap.rm_blockcount;
+			error = xagb_bitmap_set(&rr->old_refcountbt_blocks,
+					rmap.rm_startblock, rmap.rm_blockcount);
+			if (error)
+				return error;
+		}
+	} while (!xrep_refc_rmap_shareable(mp, &rmap));
+
+	rrm->startblock = rmap.rm_startblock;
+	rrm->blockcount = rmap.rm_blockcount;
+	*have_rec = true;
+	return 0;
+}
+
+static inline uint32_t
+xrep_refc_encode_startblock(
+	const struct xfs_refcount_irec	*irec)
+{
+	uint32_t			start;
+
+	start = irec->rc_startblock & ~XFS_REFC_COWFLAG;
+	if (irec->rc_domain == XFS_REFC_DOMAIN_COW)
+		start |= XFS_REFC_COWFLAG;
+
+	return start;
+}
+
+/* Sort in the same order as the ondisk records. */
+static int
+xrep_refc_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	const struct xfs_refcount_irec	*ap = a;
+	const struct xfs_refcount_irec	*bp = b;
+	uint32_t			sa, sb;
+
+	sa = xrep_refc_encode_startblock(ap);
+	sb = xrep_refc_encode_startblock(bp);
+
+	if (sa > sb)
+		return 1;
+	if (sa < sb)
+		return -1;
+	return 0;
+}
+
+/*
+ * Sort the refcount extents by startblock or else the btree records will be in
+ * the wrong order.  Make sure the records do not overlap in physical space.
+ */
+STATIC int
+xrep_refc_sort_records(
+	struct xrep_refc		*rr)
+{
+	struct xfs_refcount_irec	irec;
+	xfarray_idx_t			cur;
+	enum xfs_refc_domain		dom = XFS_REFC_DOMAIN_SHARED;
+	xfs_agblock_t			next_agbno = 0;
+	int				error;
+
+	error = xfarray_sort(rr->refcount_records, xrep_refc_extent_cmp,
+			XFARRAY_SORT_KILLABLE);
+	if (error)
+		return error;
+
+	foreach_xfarray_idx(rr->refcount_records, cur) {
+		if (xchk_should_terminate(rr->sc, &error))
+			return error;
+
+		error = xfarray_load(rr->refcount_records, cur, &irec);
+		if (error)
+			return error;
+
+		if (dom == XFS_REFC_DOMAIN_SHARED &&
+		    irec.rc_domain == XFS_REFC_DOMAIN_COW) {
+			dom = irec.rc_domain;
+			next_agbno = 0;
+		}
+
+		if (dom != irec.rc_domain)
+			return -EFSCORRUPTED;
+		if (irec.rc_startblock < next_agbno)
+			return -EFSCORRUPTED;
+
+		next_agbno = irec.rc_startblock + irec.rc_blockcount;
+	}
+
+	return error;
+}
+
+#define RRM_NEXT(r)	((r).startblock + (r).blockcount)
+/*
+ * Find the next block where the refcount changes, given the next rmap we
+ * looked at and the ones we're already tracking.
+ */
+static inline int
+xrep_refc_next_edge(
+	struct xfarray		*rmap_bag,
+	struct xrep_refc_rmap	*next_rrm,
+	bool			next_valid,
+	xfs_agblock_t		*nbnop)
+{
+	struct xrep_refc_rmap	rrm;
+	xfarray_idx_t		array_cur = XFARRAY_CURSOR_INIT;
+	xfs_agblock_t		nbno = NULLAGBLOCK;
+	int			error;
+
+	if (next_valid)
+		nbno = next_rrm->startblock;
+
+	while ((error = xfarray_iter(rmap_bag, &array_cur, &rrm)) == 1)
+		nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm));
+
+	if (error)
+		return error;
+
+	/*
+	 * We should have found /something/ because either next_rrm is the next
+	 * interesting rmap to look at after emitting this refcount extent, or
+	 * there are other rmaps in rmap_bag contributing to the current
+	 * sharing count.  But if something is seriously wrong, bail out.
+	 */
+	if (nbno == NULLAGBLOCK)
+		return -EFSCORRUPTED;
+
+	*nbnop = nbno;
+	return 0;
+}
+
+/*
+ * Walk forward through the rmap btree to collect all rmaps starting at
+ * @bno in @rmap_bag.  These represent the file(s) that share ownership of
+ * the current block.  Upon return, the rmap cursor points to the last record
+ * satisfying the startblock constraint.
+ */
+static int
+xrep_refc_push_rmaps_at(
+	struct xrep_refc	*rr,
+	struct xfarray		*rmap_bag,
+	xfs_agblock_t		bno,
+	struct xrep_refc_rmap	*rrm,
+	bool			*have,
+	uint64_t		*stack_sz)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	int			have_gt;
+	int			error;
+
+	while (*have && rrm->startblock == bno) {
+		error = xfarray_store_anywhere(rmap_bag, rrm);
+		if (error)
+			return error;
+		(*stack_sz)++;
+		error = xrep_refc_walk_rmaps(rr, rrm, have);
+		if (error)
+			return error;
+	}
+
+	error = xfs_btree_decrement(sc->sa.rmap_cur, 0, &have_gt);
+	if (error)
+		return error;
+	if (XFS_IS_CORRUPT(sc->mp, !have_gt))
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Iterate all the rmap records to generate reference count data. */
+STATIC int
+xrep_refc_find_refcounts(
+	struct xrep_refc	*rr)
+{
+	struct xrep_refc_rmap	rrm;
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfarray		*rmap_bag;
+	uint64_t		old_stack_sz;
+	uint64_t		stack_sz = 0;
+	xfs_agblock_t		sbno;
+	xfs_agblock_t		cbno;
+	xfs_agblock_t		nbno;
+	bool			have;
+	int			error;
+
+	xrep_ag_btcur_init(sc, &sc->sa);
+
+	/*
+	 * Set up a sparse array to store all the rmap records that we're
+	 * tracking to generate a reference count record.  If this exceeds
+	 * MAXREFCOUNT, we clamp rc_refcount.
+	 */
+	error = xfarray_create(sc->mp, "rmap bag", 0,
+			sizeof(struct xrep_refc_rmap), &rmap_bag);
+	if (error)
+		goto out_cur;
+
+	/* Start the rmapbt cursor to the left of all records. */
+	error = xfs_btree_goto_left_edge(sc->sa.rmap_cur);
+	if (error)
+		goto out_bag;
+
+	/* Process reverse mappings into refcount data. */
+	while (xfs_btree_has_more_records(sc->sa.rmap_cur)) {
+		/* Push all rmaps with pblk == sbno onto the stack */
+		error = xrep_refc_walk_rmaps(rr, &rrm, &have);
+		if (error)
+			goto out_bag;
+		if (!have)
+			break;
+		sbno = cbno = rrm.startblock;
+		error = xrep_refc_push_rmaps_at(rr, rmap_bag, sbno,
+					&rrm, &have, &stack_sz);
+		if (error)
+			goto out_bag;
+
+		/* Set nbno to the bno of the next refcount change */
+		error = xrep_refc_next_edge(rmap_bag, &rrm, have, &nbno);
+		if (error)
+			goto out_bag;
+
+		ASSERT(nbno > sbno);
+		old_stack_sz = stack_sz;
+
+		/* While stack isn't empty... */
+		while (stack_sz) {
+			xfarray_idx_t	array_cur = XFARRAY_CURSOR_INIT;
+
+			/* Pop all rmaps that end at nbno */
+			while ((error = xfarray_iter(rmap_bag, &array_cur,
+								&rrm)) == 1) {
+				if (RRM_NEXT(rrm) != nbno)
+					continue;
+				error = xfarray_unset(rmap_bag, array_cur - 1);
+				if (error)
+					goto out_bag;
+				stack_sz--;
+			}
+			if (error)
+				goto out_bag;
+
+			/* Push array items that start at nbno */
+			error = xrep_refc_walk_rmaps(rr, &rrm, &have);
+			if (error)
+				goto out_bag;
+			if (have) {
+				error = xrep_refc_push_rmaps_at(rr, rmap_bag,
+						nbno, &rrm, &have, &stack_sz);
+				if (error)
+					goto out_bag;
+			}
+
+			/* Emit refcount if necessary */
+			ASSERT(nbno > cbno);
+			if (stack_sz != old_stack_sz) {
+				if (old_stack_sz > 1) {
+					error = xrep_refc_stash(rr,
+							XFS_REFC_DOMAIN_SHARED,
+							cbno, nbno - cbno,
+							old_stack_sz);
+					if (error)
+						goto out_bag;
+				}
+				cbno = nbno;
+			}
+
+			/* Stack empty, go find the next rmap */
+			if (stack_sz == 0)
+				break;
+			old_stack_sz = stack_sz;
+			sbno = nbno;
+
+			/* Set nbno to the bno of the next refcount change */
+			error = xrep_refc_next_edge(rmap_bag, &rrm, have,
+					&nbno);
+			if (error)
+				goto out_bag;
+
+			ASSERT(nbno > sbno);
+		}
+	}
+
+	ASSERT(stack_sz == 0);
+out_bag:
+	xfarray_destroy(rmap_bag);
+out_cur:
+	xchk_ag_btcur_free(&sc->sa);
+	return error;
+}
+#undef RRM_NEXT
+
+/* Retrieve refcountbt data for bulk load. */
+STATIC int
+xrep_refc_get_records(
+	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
+	void				*priv)
+{
+	struct xfs_refcount_irec	*irec = &cur->bc_rec.rc;
+	struct xrep_refc		*rr = priv;
+	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
+	int				error;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		error = xfarray_load(rr->refcount_records, rr->array_cur++,
+				irec);
+		if (error)
+			return error;
+
+		block_rec = xfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Feed one of the new btree blocks to the bulk loader. */
+STATIC int
+xrep_refc_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_refc        *rr = priv;
+	int			error;
+
+	error = xrep_newbt_relog_autoreap(&rr->new_btree);
+	if (error)
+		return error;
+
+	return xrep_newbt_claim_block(cur, &rr->new_btree, ptr);
+}
+
+/* Update the AGF counters. */
+STATIC int
+xrep_refc_reset_counters(
+	struct xrep_refc	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+
+	/*
+	 * After we commit the new btree to disk, it is possible that the
+	 * process to reap the old btree blocks will race with the AIL trying
+	 * to checkpoint the old btree blocks into the filesystem.  If the new
+	 * tree is shorter than the old one, the refcountbt write verifier will
+	 * fail and the AIL will shut down the filesystem.
+	 *
+	 * To avoid this, save the old incore btree height values as the alt
+	 * height values before re-initializing the perag info from the updated
+	 * AGF to capture all the new values.
+	 */
+	pag->pagf_alt_refcount_level = pag->pagf_refcount_level;
+
+	/* Reinitialize with the values we just logged. */
+	return xrep_reinit_pagf(sc);
+}
+
+/*
+ * Use the collected refcount information to stage a new refcount btree.  If
+ * this is successful we'll return with the new btree root information logged
+ * to the repair transaction but not yet committed.
+ */
+STATIC int
+xrep_refc_build_new_tree(
+	struct xrep_refc	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_btree_cur	*refc_cur;
+	struct xfs_perag	*pag = sc->sa.pag;
+	xfs_fsblock_t		fsbno;
+	int			error;
+
+	error = xrep_refc_sort_records(rr);
+	if (error)
+		return error;
+
+	/*
+	 * Prepare to construct the new btree by reserving disk space for the
+	 * new btree and setting up all the accounting information we'll need
+	 * to root the new btree while it's under construction and before we
+	 * attach it to the AG header.
+	 */
+	fsbno = XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, xfs_refc_block(sc->mp));
+	xrep_newbt_init_ag(&rr->new_btree, sc, &XFS_RMAP_OINFO_REFC, fsbno,
+			XFS_AG_RESV_METADATA);
+	rr->new_btree.bload.get_records = xrep_refc_get_records;
+	rr->new_btree.bload.claim_block = xrep_refc_claim_block;
+
+	/* Compute how many blocks we'll need. */
+	refc_cur = xfs_refcountbt_stage_cursor(sc->mp, &rr->new_btree.afake,
+			pag);
+	error = xfs_btree_bload_compute_geometry(refc_cur,
+			&rr->new_btree.bload,
+			xfarray_length(rr->refcount_records));
+	if (error)
+		goto err_cur;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		goto err_cur;
+
+	/* Reserve the space we'll need for the new btree. */
+	error = xrep_newbt_alloc_blocks(&rr->new_btree,
+			rr->new_btree.bload.nr_blocks);
+	if (error)
+		goto err_cur;
+
+	/*
+	 * Due to btree slack factors, it's possible for a new btree to be one
+	 * level taller than the old btree.  Update the incore btree height so
+	 * that we don't trip the verifiers when writing the new btree blocks
+	 * to disk.
+	 */
+	pag->pagf_alt_refcount_level = rr->new_btree.bload.btree_height;
+
+	/* Add all observed refcount records. */
+	rr->array_cur = XFARRAY_CURSOR_INIT;
+	error = xfs_btree_bload(refc_cur, &rr->new_btree.bload, rr);
+	if (error)
+		goto err_level;
+
+	/*
+	 * Install the new btree in the AG header.  After this point the old
+	 * btree is no longer accessible and the new tree is live.
+	 */
+	xfs_refcountbt_commit_staged_btree(refc_cur, sc->tp, sc->sa.agf_bp);
+	xfs_btree_del_cursor(refc_cur, 0);
+
+	/* Reset the AGF counters now that we've changed the btree shape. */
+	error = xrep_refc_reset_counters(rr);
+	if (error)
+		goto err_newbt;
+
+	/* Dispose of any unused blocks and the accounting information. */
+	error = xrep_newbt_commit(&rr->new_btree);
+	if (error)
+		return error;
+
+	return xrep_roll_ag_trans(sc);
+
+err_level:
+	pag->pagf_alt_refcount_level = 0;
+err_cur:
+	xfs_btree_del_cursor(refc_cur, error);
+err_newbt:
+	xrep_newbt_cancel(&rr->new_btree);
+	return error;
+}
+
+/*
+ * Now that we've logged the roots of the new btrees, invalidate all of the
+ * old blocks and free them.
+ */
+STATIC int
+xrep_refc_remove_old_tree(
+	struct xrep_refc	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+	int			error;
+
+	/* Free the old refcountbt blocks if they're not in use. */
+	error = xrep_reap_agblocks(sc, &rr->old_refcountbt_blocks,
+			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
+
+	/*
+	 * Now that we've zapped all the old refcountbt blocks we can turn off
+	 * the alternate height mechanism and reset the per-AG space
+	 * reservations.
+	 */
+	pag->pagf_alt_refcount_level = 0;
+	sc->flags |= XREP_RESET_PERAG_RESV;
+	return 0;
+}
+
+/* Rebuild the refcount btree. */
+int
+xrep_refcountbt(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_refc	*rr;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(mp))
+		return -EOPNOTSUPP;
+
+	rr = kzalloc(sizeof(struct xrep_refc), XCHK_GFP_FLAGS);
+	if (!rr)
+		return -ENOMEM;
+	rr->sc = sc;
+
+	/* Set up enough storage to handle one refcount record per block. */
+	error = xfarray_create(mp, "refcount records",
+			mp->m_sb.sb_agblocks,
+			sizeof(struct xfs_refcount_irec),
+			&rr->refcount_records);
+	if (error)
+		goto out_rr;
+
+	/* Collect all reference counts. */
+	xagb_bitmap_init(&rr->old_refcountbt_blocks);
+	error = xrep_refc_find_refcounts(rr);
+	if (error)
+		goto out_bitmap;
+
+	/* Rebuild the refcount information. */
+	error = xrep_refc_build_new_tree(rr);
+	if (error)
+		goto out_bitmap;
+
+	/* Kill the old tree. */
+	error = xrep_refc_remove_old_tree(rr);
+
+out_bitmap:
+	xagb_bitmap_destroy(&rr->old_refcountbt_blocks);
+	xfarray_destroy(rr->refcount_records);
+out_rr:
+	kfree(rr);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index b6e60362b7cb..e93cae73cf61 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -71,6 +71,7 @@ int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
+int xrep_refcountbt(struct xfs_scrub *sc);
 
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
@@ -123,6 +124,7 @@ xrep_setup_nothing(
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
+#define xrep_refcountbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index aef30515c050..449c3e623c63 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -277,7 +277,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_ag_refcountbt,
 		.scrub	= xchk_refcountbt,
 		.has	= xfs_has_reflink,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_refcountbt,
 	},
 	[XFS_SCRUB_TYPE_INODE] = {	/* inode record */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 5e66be26055b..8532dcd16630 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1186,27 +1186,29 @@ TRACE_EVENT(xrep_ibt_found,
 		  __entry->freemask)
 )
 
-TRACE_EVENT(xrep_refcount_extent_fn,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 struct xfs_refcount_irec *irec),
-	TP_ARGS(mp, agno, irec),
+TRACE_EVENT(xrep_refc_found,
+	TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *rec),
+	TP_ARGS(pag, rec),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, domain)
 		__field(xfs_agblock_t, startblock)
 		__field(xfs_extlen_t, blockcount)
 		__field(xfs_nlink_t, refcount)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
-		__entry->startblock = irec->rc_startblock;
-		__entry->blockcount = irec->rc_blockcount;
-		__entry->refcount = irec->rc_refcount;
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->domain = rec->rc_domain;
+		__entry->startblock = rec->rc_startblock;
+		__entry->blockcount = rec->rc_blockcount;
+		__entry->refcount = rec->rc_refcount;
 	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u",
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
+		  __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->startblock,
 		  __entry->blockcount,
 		  __entry->refcount)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (6 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/6] xfs: repair inode records Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
                   ` (69 subsequent siblings)
  77 siblings, 6 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

In this series, online repair gains the ability to repair inode records.
To do this, we must repair the ondisk inode and fork information enough
to pass the iget verifiers and hence make the inode igettable again.
Once that's done, we can perform higher level repairs on the incore
inode.  The fstests counterpart of this patchset implements stress
testing of repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-inodes

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-inodes
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_attr_leaf.c      |   32 -
 fs/xfs/libxfs/xfs_attr_leaf.h      |    2 
 fs/xfs/libxfs/xfs_bmap.c           |   22 -
 fs/xfs/libxfs/xfs_bmap.h           |    2 
 fs/xfs/libxfs/xfs_dir2_priv.h      |    2 
 fs/xfs/libxfs/xfs_dir2_sf.c        |   29 -
 fs/xfs/libxfs/xfs_format.h         |    3 
 fs/xfs/libxfs/xfs_shared.h         |    1 
 fs/xfs/libxfs/xfs_symlink_remote.c |   21 
 fs/xfs/scrub/alloc.c               |    2 
 fs/xfs/scrub/bmap.c                |    4 
 fs/xfs/scrub/common.c              |   26 +
 fs/xfs/scrub/common.h              |    8 
 fs/xfs/scrub/dir.c                 |   21 
 fs/xfs/scrub/inode.c               |   14 
 fs/xfs/scrub/inode_repair.c        | 1565 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/parent.c              |   11 
 fs/xfs/scrub/repair.c              |   47 +
 fs/xfs/scrub/repair.h              |   28 +
 fs/xfs/scrub/rtbitmap.c            |    4 
 fs/xfs/scrub/rtsummary.c           |    4 
 fs/xfs/scrub/scrub.c               |    2 
 fs/xfs/scrub/trace.h               |  174 ++++
 24 files changed, 1982 insertions(+), 43 deletions(-)
 create mode 100644 fs/xfs/scrub/inode_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/6] xfs: disable online repair quota helpers when quota not enabled
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 4/6] xfs: zap broken inode forks Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/6] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 6/6] xfs: repair obviously broken inode modes Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't compile the quota helper functions if quota isn't being built into
the XFS module.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.c |    2 ++
 fs/xfs/scrub/repair.h |    9 +++++++++
 2 files changed, 11 insertions(+)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index bc5bbff1558e..d9b0d19c8e2d 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -674,6 +674,7 @@ xrep_find_ag_btree_roots(
 	return error;
 }
 
+#ifdef CONFIG_XFS_QUOTA
 /* Force a quotacheck the next time we mount. */
 void
 xrep_force_quotacheck(
@@ -735,6 +736,7 @@ xrep_ino_dqattach(
 
 	return error;
 }
+#endif /* CONFIG_XFS_QUOTA */
 
 /* Initialize all the btree cursors for an AG repair. */
 void
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index e93cae73cf61..441b6b073001 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -48,8 +48,15 @@ struct xrep_find_ag_btree {
 
 int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
+
+#ifdef CONFIG_XFS_QUOTA
 void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
+#else
+# define xrep_force_quotacheck(sc, type)	((void)0)
+# define xrep_ino_dqattach(sc)			(0)
+#endif /* CONFIG_XFS_QUOTA */
+
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
 
 /* Repair setup functions */
@@ -78,6 +85,8 @@ int xrep_reinit_pagi(struct xfs_scrub *sc);
 
 #else
 
+#define xrep_ino_dqattach(sc)	(0)
+
 static inline int
 xrep_attempt(
 	struct xfs_scrub	*sc)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/6] xfs: try to attach dquots to files before repairing them
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/6] xfs: repair inode records Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/6] xfs: zap broken inode forks Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Inode resource usage is tracked in the quota metadata.  Repairing a file
might change the resources used by that file, which means that we need
to attach dquots to the file that we're examining before accessing
anything in the file protected by the ILOCK.

However, there's a twist: a dquot cache miss requires the dquot to be
read in from the quota file, during which we drop the ILOCK on the file
being examined.  This means that we *must* try to attach the dquots
before taking the ILOCK.

Therefore, dquots must be attached to files in the scrub setup function.
If doing so yields corruption errors (or unknown dquot errors), we
instead clear the quotachecked status, which will cause a quotacheck on
next mount.  A future series will make this trigger live quotacheck.

While we're here, change the xrep_ino_dqattach function to use the
unlocked dqattach functions so that we avoid cycling the ILOCK if the
inode already has dquots attached.  This makes the naming and locking
requirements consistent with the rest of the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c      |    4 ++++
 fs/xfs/scrub/common.c    |   25 +++++++++++++++++++++++++
 fs/xfs/scrub/common.h    |    6 ++++++
 fs/xfs/scrub/inode.c     |    4 ++++
 fs/xfs/scrub/repair.c    |   13 ++++++++-----
 fs/xfs/scrub/rtbitmap.c  |    4 ++++
 fs/xfs/scrub/rtsummary.c |    4 ++++
 7 files changed, 55 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 2dc5bcd5c4fa..b6fa880787d9 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -76,6 +76,10 @@ xchk_setup_inode_bmap(
 	if (error)
 		goto out;
 
+	error = xchk_ino_dqattach(sc);
+	if (error)
+		goto out;
+
 	xchk_ilock(sc, XFS_ILOCK_EXCL);
 out:
 	/* scrub teardown will unlock and release the inode */
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 2c3707730cd4..6b9d852873d8 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -793,6 +793,26 @@ xchk_iget_agi(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_QUOTA
+/*
+ * Try to attach dquots to this inode if we think we might want to repair it.
+ * Callers must not hold any ILOCKs.  If the dquots are broken and cannot be
+ * attached, a quotacheck will be scheduled.
+ */
+int
+xchk_ino_dqattach(
+	struct xfs_scrub	*sc)
+{
+	ASSERT(sc->tp != NULL);
+	ASSERT(sc->ip != NULL);
+
+	if (!xchk_could_repair(sc))
+		return 0;
+
+	return xrep_ino_dqattach(sc);
+}
+#endif
+
 /* Install an inode that we opened by handle for scrubbing. */
 int
 xchk_install_handle_inode(
@@ -991,6 +1011,11 @@ xchk_setup_inode_contents(
 	error = xchk_trans_alloc(sc, resblks);
 	if (error)
 		goto out;
+
+	error = xchk_ino_dqattach(sc);
+	if (error)
+		goto out;
+
 	xchk_ilock(sc, XFS_ILOCK_EXCL);
 out:
 	/* scrub teardown will unlock and release the inode for us */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 7f2714092514..c1a0a1ac19b2 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -103,9 +103,15 @@ xchk_setup_rtsummary(struct xfs_scrub *sc)
 }
 #endif
 #ifdef CONFIG_XFS_QUOTA
+int xchk_ino_dqattach(struct xfs_scrub *sc);
 int xchk_setup_quota(struct xfs_scrub *sc);
 #else
 static inline int
+xchk_ino_dqattach(struct xfs_scrub *sc)
+{
+	return 0;
+}
+static inline int
 xchk_setup_quota(struct xfs_scrub *sc)
 {
 	return -ENOENT;
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 7a248f26a03c..5c0aaffc6f01 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -37,6 +37,10 @@ xchk_prepare_iscrub(
 	if (error)
 		return error;
 
+	error = xchk_ino_dqattach(sc);
+	if (error)
+		return error;
+
 	xchk_ilock(sc, XFS_ILOCK_EXCL);
 	return 0;
 }
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index d9b0d19c8e2d..e3bcbbcb373e 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -701,10 +701,10 @@ xrep_force_quotacheck(
  *
  * This function ensures that the appropriate dquots are attached to an inode.
  * We cannot allow the dquot code to allocate an on-disk dquot block here
- * because we're already in transaction context with the inode locked.  The
- * on-disk dquot should already exist anyway.  If the quota code signals
- * corruption or missing quota information, schedule quotacheck, which will
- * repair corruptions in the quota metadata.
+ * because we're already in transaction context.  The on-disk dquot should
+ * already exist anyway.  If the quota code signals corruption or missing quota
+ * information, schedule quotacheck, which will repair corruptions in the quota
+ * metadata.
  */
 int
 xrep_ino_dqattach(
@@ -712,7 +712,10 @@ xrep_ino_dqattach(
 {
 	int			error;
 
-	error = xfs_qm_dqattach_locked(sc->ip, false);
+	ASSERT(sc->tp != NULL);
+	ASSERT(sc->ip != NULL);
+
+	error = xfs_qm_dqattach(sc->ip);
 	switch (error) {
 	case -EFSBADCRC:
 	case -EFSCORRUPTED:
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 851039588fe3..c22427012a11 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -32,6 +32,10 @@ xchk_setup_rtbitmap(
 	if (error)
 		return error;
 
+	error = xchk_ino_dqattach(sc);
+	if (error)
+		return error;
+
 	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP);
 	return 0;
 }
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 6e71e39e197e..73c75d41ef3c 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -61,6 +61,10 @@ xchk_setup_rtsummary(
 	if (error)
 		return error;
 
+	error = xchk_ino_dqattach(sc);
+	if (error)
+		return error;
+
 	/*
 	 * Locking order requires us to take the rtbitmap first.  We must be
 	 * careful to unlock it ourselves when we are done with the rtbitmap


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/6] xfs: repair inode records
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/6] xfs: try to attach dquots to files before repairing them Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If an inode is so badly damaged that it cannot be loaded into the cache,
fix the ondisk metadata and try again.  If there /is/ a cached inode,
fix any problems and apply any optimizations that can be solved incore.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/libxfs/xfs_format.h  |    3 
 fs/xfs/scrub/alloc.c        |    2 
 fs/xfs/scrub/inode.c        |   10 +
 fs/xfs/scrub/inode_repair.c |  774 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c       |   32 ++
 fs/xfs/scrub/repair.h       |   19 +
 fs/xfs/scrub/scrub.c        |    2 
 fs/xfs/scrub/trace.h        |  129 +++++++
 9 files changed, 969 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/inode_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index c448c2a4d691..1f61f299e436 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -179,6 +179,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   ialloc_repair.o \
+				   inode_repair.o \
 				   newbt.o \
 				   reap.o \
 				   refcount_repair.o \
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 371dc07233e0..5ba2dae7aa2f 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -996,7 +996,8 @@ enum xfs_dinode_fmt {
 #define XFS_DFORK_APTR(dip)	\
 	(XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip))
 #define XFS_DFORK_PTR(dip,w)	\
-	((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip))
+	((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \
+					 XFS_DFORK_APTR(dip)))
 
 #define XFS_DFORK_FORMAT(dip,w) \
 	((w) == XFS_DATA_FORK ? \
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index e16a48486c6d..0fc412abfc50 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -9,6 +9,8 @@
 #include "xfs_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
 #include "xfs_btree.h"
 #include "xfs_alloc.h"
 #include "xfs_rmap.h"
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 5c0aaffc6f01..d86a2e1572ee 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -23,6 +23,7 @@
 #include "scrub/common.h"
 #include "scrub/btree.h"
 #include "scrub/trace.h"
+#include "scrub/repair.h"
 
 /* Prepare the attached inode for scrubbing. */
 static inline int
@@ -175,8 +176,11 @@ xchk_setup_inode(
 	 * saying the inode is allocated and the icache being unable to load
 	 * the inode until we can flag the corruption in xchk_inode.  The
 	 * scrub function has to note the corruption, since we're not really
-	 * supposed to do that from the setup function.
+	 * supposed to do that from the setup function.  Save the mapping to
+	 * make repairs to the ondisk inode buffer.
 	 */
+	if (xchk_could_repair(sc))
+		xrep_setup_inode(sc, &imap);
 	return 0;
 
 out_cancel:
@@ -332,6 +336,10 @@ xchk_inode_flags2(
 	if (xfs_dinode_has_bigtime(dip) && !xfs_has_bigtime(mp))
 		goto bad;
 
+	/* no large extent counts without the filesystem feature */
+	if ((flags2 & XFS_DIFLAG2_NREXT64) && !xfs_has_large_extent_counts(mp))
+		goto bad;
+
 	return;
 bad:
 	xchk_ino_set_corrupt(sc, ino);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
new file mode 100644
index 000000000000..f1490e78766d
--- /dev/null
+++ b/fs/xfs/scrub/inode_repair.c
@@ -0,0 +1,774 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_inode_buf.h"
+#include "xfs_inode_fork.h"
+#include "xfs_ialloc.h"
+#include "xfs_da_format.h"
+#include "xfs_reflink.h"
+#include "xfs_rmap.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_quota_defs.h"
+#include "xfs_quota.h"
+#include "xfs_ag.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Inode Repair
+ *
+ * Roughly speaking, inode problems can be classified based on whether or not
+ * they trip the dinode verifiers.  If those trip, then we won't be able to
+ * _iget ourselves the inode.
+ *
+ * Therefore, the xrep_dinode_* functions fix anything that will cause the
+ * inode buffer verifier or the dinode verifier.  The xrep_inode_* functions
+ * fix things on live incore inodes.
+ */
+
+/*
+ * All the information we need to repair the ondisk inode if we can't iget the
+ * incore inode.  We don't allocate this buffer unless we're going to perform
+ * a repair to the ondisk inode cluster buffer.
+ */
+struct xrep_inode {
+	/* Inode mapping that we saved from the initial lookup attempt. */
+	struct xfs_imap		imap;
+
+	struct xfs_scrub	*sc;
+};
+
+/* Setup function for inode repair. */
+int
+xrep_setup_inode(
+	struct xfs_scrub	*sc,
+	struct xfs_imap		*imap)
+{
+	struct xrep_inode	*ri;
+
+	/*
+	 * The only information that needs to be passed between inode scrub and
+	 * repair is the location of the ondisk metadata if iget fails.  The
+	 * rest of struct xrep_inode is context data that we need to massage
+	 * the ondisk inode to the point that iget will work, which means that
+	 * we don't allocate anything at all if the incore inode is loaded.
+	 */
+	if (!imap)
+		return 0;
+
+	sc->buf = kzalloc(sizeof(struct xrep_inode), XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	ri = sc->buf;
+	memcpy(&ri->imap, imap, sizeof(struct xfs_imap));
+	ri->sc = sc;
+	return 0;
+}
+
+/* Make sure this inode cluster buffer can pass the inode buffer verifier. */
+STATIC void
+xrep_dinode_buf(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = sc->tp;
+	struct xfs_perag	*pag;
+	struct xfs_dinode	*dip;
+	xfs_agnumber_t		agno;
+	xfs_agino_t		agino;
+	int			ioff;
+	int			i;
+	int			ni;
+	bool			crc_ok;
+	bool			magic_ok;
+	bool			unlinked_ok;
+
+	ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock;
+	agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp));
+	pag = xfs_perag_get(mp, agno);
+	for (i = 0; i < ni; i++) {
+		ioff = i << mp->m_sb.sb_inodelog;
+		dip = xfs_buf_offset(bp, ioff);
+		agino = be32_to_cpu(dip->di_next_unlinked);
+
+		unlinked_ok = magic_ok = crc_ok = false;
+
+		if (xfs_verify_agino_or_null(pag, agino))
+			unlinked_ok = true;
+
+		if (dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
+		    xfs_dinode_good_version(mp, dip->di_version))
+			magic_ok = true;
+
+		if (xfs_verify_cksum((char *)dip, mp->m_sb.sb_inodesize,
+				XFS_DINODE_CRC_OFF))
+			crc_ok = true;
+
+		if (magic_ok && unlinked_ok && crc_ok)
+			continue;
+
+		if (!magic_ok) {
+			dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+			dip->di_version = 3;
+		}
+		if (!unlinked_ok)
+			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
+		xfs_dinode_calc_crc(mp, dip);
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
+		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
+	}
+	xfs_perag_put(pag);
+}
+
+/* Reinitialize things that never change in an inode. */
+STATIC void
+xrep_dinode_header(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	trace_xrep_dinode_header(sc, dip);
+
+	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
+		dip->di_version = 3;
+	dip->di_ino = cpu_to_be64(sc->sm->sm_ino);
+	uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid);
+	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
+}
+
+/* Turn di_mode into /something/ recognizable. */
+STATIC void
+xrep_dinode_mode(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	uint16_t		mode;
+
+	trace_xrep_dinode_mode(sc, dip);
+
+	mode = be16_to_cpu(dip->di_mode);
+	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
+		return;
+
+	/* bad mode, so we set it to a file that only root can read */
+	mode = S_IFREG;
+	dip->di_mode = cpu_to_be16(mode);
+	dip->di_uid = 0;
+	dip->di_gid = 0;
+}
+
+/* Fix any conflicting flags that the verifiers complain about. */
+STATIC void
+xrep_dinode_flags(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint64_t		flags2;
+	uint16_t		mode;
+	uint16_t		flags;
+
+	trace_xrep_dinode_flags(sc, dip);
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	if (xfs_has_reflink(mp) && S_ISREG(mode))
+		flags2 |= XFS_DIFLAG2_REFLINK;
+	else
+		flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE);
+	if (flags & XFS_DIFLAG_REALTIME)
+		flags2 &= ~XFS_DIFLAG2_REFLINK;
+	if (flags2 & XFS_DIFLAG2_REFLINK)
+		flags2 &= ~XFS_DIFLAG2_DAX;
+	if (!xfs_has_bigtime(mp))
+		flags2 &= ~XFS_DIFLAG2_BIGTIME;
+	if (!xfs_has_large_extent_counts(mp))
+		flags2 &= ~XFS_DIFLAG2_NREXT64;
+	if (flags2 & XFS_DIFLAG2_NREXT64)
+		dip->di_nrext64_pad = 0;
+	else if (dip->di_version >= 3)
+		dip->di_v3_pad = 0;
+	dip->di_flags = cpu_to_be16(flags);
+	dip->di_flags2 = cpu_to_be64(flags2);
+}
+
+/*
+ * Blow out symlink; now it points to the current dir.  We don't have to worry
+ * about incore state because this inode is failing the verifiers.
+ */
+STATIC void
+xrep_dinode_zap_symlink(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	char			*p;
+
+	trace_xrep_dinode_zap_symlink(sc, dip);
+
+	dip->di_format = XFS_DINODE_FMT_LOCAL;
+	dip->di_size = cpu_to_be64(1);
+	p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	*p = '.';
+}
+
+/*
+ * Blow out dir, make it point to the root.  In the future repair will
+ * reconstruct this directory for us.  Note that there's no in-core directory
+ * inode because the sf verifier tripped, so we don't have to worry about the
+ * dentry cache.
+ */
+STATIC void
+xrep_dinode_zap_dir(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_dir2_sf_hdr	*sfp;
+	int			i8count;
+
+	trace_xrep_dinode_zap_dir(sc, dip);
+
+	dip->di_format = XFS_DINODE_FMT_LOCAL;
+	i8count = mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM;
+	sfp = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	sfp->count = 0;
+	sfp->i8count = i8count;
+	xfs_dir2_sf_put_parent_ino(sfp, mp->m_sb.sb_rootino);
+	dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count));
+}
+
+/* Make sure we don't have a garbage file size. */
+STATIC void
+xrep_dinode_size(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	uint64_t		size;
+	uint16_t		mode;
+
+	trace_xrep_dinode_size(sc, dip);
+
+	mode = be16_to_cpu(dip->di_mode);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		/* di_size can't be nonzero for special files */
+		dip->di_size = 0;
+		break;
+	case S_IFREG:
+		/* Regular files can't be larger than 2^63-1 bytes. */
+		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
+		break;
+	case S_IFLNK:
+		/*
+		 * Truncate ridiculously oversized symlinks.  If the size is
+		 * zero, reset it to point to the current directory.  Both of
+		 * these conditions trigger dinode verifier errors, so there
+		 * is no in-core state to reset.
+		 */
+		if (size > XFS_SYMLINK_MAXLEN)
+			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
+		else if (size == 0)
+			xrep_dinode_zap_symlink(sc, dip);
+		break;
+	case S_IFDIR:
+		/*
+		 * Directories can't have a size larger than 32G.  If the size
+		 * is zero, reset it to an empty directory.  Both of these
+		 * conditions trigger dinode verifier errors, so there is no
+		 * in-core state to reset.
+		 */
+		if (size > XFS_DIR2_SPACE_SIZE)
+			dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE);
+		else if (size == 0)
+			xrep_dinode_zap_dir(sc, dip);
+		break;
+	}
+}
+
+/* Fix extent size hints. */
+STATIC void
+xrep_dinode_extsize_hints(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint64_t		flags2;
+	uint16_t		flags;
+	uint16_t		mode;
+	xfs_failaddr_t		fa;
+
+	trace_xrep_dinode_extsize_hints(sc, dip);
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	fa = xfs_inode_validate_extsize(mp, be32_to_cpu(dip->di_extsize),
+			mode, flags);
+	if (fa) {
+		dip->di_extsize = 0;
+		dip->di_flags &= ~cpu_to_be16(XFS_DIFLAG_EXTSIZE |
+					      XFS_DIFLAG_EXTSZINHERIT);
+	}
+
+	if (dip->di_version < 3)
+		return;
+
+	fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize),
+			mode, flags, flags2);
+	if (fa) {
+		dip->di_cowextsize = 0;
+		dip->di_flags2 &= ~cpu_to_be64(XFS_DIFLAG2_COWEXTSIZE);
+	}
+}
+
+/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
+STATIC int
+xrep_dinode_core(
+	struct xrep_inode	*ri)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_buf		*bp;
+	struct xfs_dinode	*dip;
+	xfs_ino_t		ino = sc->sm->sm_ino;
+	bool			dontcare;
+	int			error;
+
+	/* Read the inode cluster buffer. */
+	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+			ri->imap.im_blkno, ri->imap.im_len, XBF_UNMAPPED, &bp,
+			NULL);
+	if (error)
+		return error;
+
+	/*
+	 * Make absolutely sure this inode isn't in core, because we don't
+	 * want to create a cache coherence problem.
+	 */
+	error = xfs_icache_inode_is_allocated(sc->mp, sc->tp, ino, &dontcare);
+	if (error == 0) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	/* Make sure we can pass the inode buffer verifier. */
+	xrep_dinode_buf(sc, bp);
+	bp->b_ops = &xfs_inode_buf_ops;
+
+	/* Fix everything the verifier will complain about. */
+	dip = xfs_buf_offset(bp, ri->imap.im_boffset);
+	xrep_dinode_header(sc, dip);
+	xrep_dinode_mode(sc, dip);
+	xrep_dinode_flags(sc, dip);
+	xrep_dinode_size(sc, dip);
+	xrep_dinode_extsize_hints(sc, dip);
+
+	/* Write out the inode. */
+	trace_xrep_dinode_fixed(sc, dip);
+	xfs_dinode_calc_crc(sc->mp, dip);
+	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_DINO_BUF);
+	xfs_trans_log_buf(sc->tp, bp, ri->imap.im_boffset,
+			ri->imap.im_boffset + sc->mp->m_sb.sb_inodesize - 1);
+
+	/*
+	 * Now that we've finished rewriting anything in the ondisk metadata
+	 * that would prevent iget from giving us an incore inode, commit the
+	 * inode cluster buffer updates and drop the AGI buffer that we've been
+	 * holding since scrub setup.
+	 */
+	error = xrep_trans_commit(sc);
+	if (error)
+		return error;
+
+	/* Try again to load the inode. */
+	error = xchk_iget(sc, ino, &sc->ip);
+	if (error)
+		return error;
+
+	xchk_ilock(sc, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		return error;
+
+	error = xrep_ino_dqattach(sc);
+	if (error)
+		return error;
+
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
+	return 0;
+}
+
+/* Fix everything xfs_dinode_verify cares about. */
+STATIC int
+xrep_dinode_problems(
+	struct xrep_inode	*ri)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	int			error;
+
+	error = xrep_dinode_core(ri);
+	if (error)
+		return error;
+
+	/* We had to fix a totally busted inode, schedule quotacheck. */
+	if (XFS_IS_UQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQTYPE_USER);
+	if (XFS_IS_GQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQTYPE_GROUP);
+	if (XFS_IS_PQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQTYPE_PROJ);
+
+	return 0;
+}
+
+/*
+ * Fix problems that the verifiers don't care about.  In general these are
+ * errors that don't cause problems elsewhere in the kernel that we can easily
+ * detect, so we don't check them all that rigorously.
+ */
+
+/* Make sure block and extent counts are ok. */
+STATIC int
+xrep_inode_blockcounts(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+	xfs_filblks_t		count;
+	xfs_filblks_t		acount;
+	xfs_extnum_t		nextents;
+	int			error;
+
+	trace_xrep_inode_blockcounts(sc);
+
+	/* Set data fork counters from the data fork mappings. */
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK,
+			&nextents, &count);
+	if (error)
+		return error;
+	if (xfs_has_reflink(sc->mp)) {
+		; /* data fork blockcount can exceed physical storage */
+	} else if (XFS_IS_REALTIME_INODE(sc->ip)) {
+		if (count >= sc->mp->m_sb.sb_rblocks)
+			return -EFSCORRUPTED;
+	} else {
+		if (count >= sc->mp->m_sb.sb_dblocks)
+			return -EFSCORRUPTED;
+	}
+	error = xrep_ino_ensure_extent_count(sc, XFS_DATA_FORK, nextents);
+	if (error)
+		return error;
+	sc->ip->i_df.if_nextents = nextents;
+
+	/* Set attr fork counters from the attr fork mappings. */
+	ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK);
+	if (ifp) {
+		error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
+				&nextents, &acount);
+		if (error)
+			return error;
+		if (count >= sc->mp->m_sb.sb_dblocks)
+			return -EFSCORRUPTED;
+		error = xrep_ino_ensure_extent_count(sc, XFS_ATTR_FORK,
+				nextents);
+		if (error)
+			return error;
+		ifp->if_nextents = nextents;
+	} else {
+		acount = 0;
+	}
+
+	sc->ip->i_nblocks = count + acount;
+	return 0;
+}
+
+/* Check for invalid uid/gid/prid. */
+STATIC void
+xrep_inode_ids(
+	struct xfs_scrub	*sc)
+{
+	trace_xrep_inode_ids(sc);
+
+	if (i_uid_read(VFS_I(sc->ip)) == -1U) {
+		i_uid_write(VFS_I(sc->ip), 0);
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_UQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQTYPE_USER);
+	}
+
+	if (i_gid_read(VFS_I(sc->ip)) == -1U) {
+		i_gid_write(VFS_I(sc->ip), 0);
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_GQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQTYPE_GROUP);
+	}
+
+	if (sc->ip->i_projid == -1U) {
+		sc->ip->i_projid = 0;
+		if (XFS_IS_PQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQTYPE_PROJ);
+	}
+}
+
+static inline void
+xrep_clamp_nsec(
+	struct timespec64	*ts)
+{
+	ts->tv_nsec = clamp_t(long, ts->tv_nsec, 0, NSEC_PER_SEC);
+}
+
+/* Nanosecond counters can't have more than 1 billion. */
+STATIC void
+xrep_inode_timestamps(
+	struct xfs_inode	*ip)
+{
+	xrep_clamp_nsec(&VFS_I(ip)->i_atime);
+	xrep_clamp_nsec(&VFS_I(ip)->i_mtime);
+	xrep_clamp_nsec(&VFS_I(ip)->i_ctime);
+	xrep_clamp_nsec(&ip->i_crtime);
+}
+
+/* Fix inode flags that don't make sense together. */
+STATIC void
+xrep_inode_flags(
+	struct xfs_scrub	*sc)
+{
+	uint16_t		mode;
+
+	trace_xrep_inode_flags(sc);
+
+	mode = VFS_I(sc->ip)->i_mode;
+
+	/* Clear junk flags */
+	if (sc->ip->i_diflags & ~XFS_DIFLAG_ANY)
+		sc->ip->i_diflags &= ~XFS_DIFLAG_ANY;
+
+	/* NEWRTBM only applies to realtime bitmaps */
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_rbmino)
+		sc->ip->i_diflags |= XFS_DIFLAG_NEWRTBM;
+	else
+		sc->ip->i_diflags &= ~XFS_DIFLAG_NEWRTBM;
+
+	/* These only make sense for directories. */
+	if (!S_ISDIR(mode))
+		sc->ip->i_diflags &= ~(XFS_DIFLAG_RTINHERIT |
+					  XFS_DIFLAG_EXTSZINHERIT |
+					  XFS_DIFLAG_PROJINHERIT |
+					  XFS_DIFLAG_NOSYMLINKS);
+
+	/* These only make sense for files. */
+	if (!S_ISREG(mode))
+		sc->ip->i_diflags &= ~(XFS_DIFLAG_REALTIME |
+					  XFS_DIFLAG_EXTSIZE);
+
+	/* These only make sense for non-rt files. */
+	if (sc->ip->i_diflags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_diflags &= ~XFS_DIFLAG_FILESTREAM;
+
+	/* Immutable and append only?  Drop the append. */
+	if ((sc->ip->i_diflags & XFS_DIFLAG_IMMUTABLE) &&
+	    (sc->ip->i_diflags & XFS_DIFLAG_APPEND))
+		sc->ip->i_diflags &= ~XFS_DIFLAG_APPEND;
+
+	/* Clear junk flags. */
+	if (sc->ip->i_diflags2 & ~XFS_DIFLAG2_ANY)
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_ANY;
+
+	/* No reflink flag unless we support it and it's a file. */
+	if (!xfs_has_reflink(sc->mp) || !S_ISREG(mode))
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* DAX only applies to files and dirs. */
+	if (!(S_ISREG(mode) || S_ISDIR(mode)))
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_DAX;
+
+	/* No reflink files on the realtime device. */
+	if (sc->ip->i_diflags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* No mixing reflink and DAX yet. */
+	if (sc->ip->i_diflags2 & XFS_DIFLAG2_REFLINK)
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_DAX;
+}
+
+/*
+ * Fix size problems with block/node format directories.  If we fail to find
+ * the extent list, just bail out and let the bmapbtd repair functions clean
+ * up that mess.
+ */
+STATIC void
+xrep_inode_blockdir_size(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_ifork	*ifp;
+	xfs_fileoff_t		off;
+	int			error;
+
+	trace_xrep_inode_blockdir_size(sc);
+
+	/* Find the last block before 32G; this is the dir size. */
+	error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK);
+	if (error)
+		return;
+
+	ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK);
+	off = XFS_B_TO_FSB(sc->mp, XFS_DIR2_SPACE_SIZE);
+	if (!xfs_iext_lookup_extent_before(sc->ip, ifp, &off, &icur, &got)) {
+		/* zero-extents directory? */
+		return;
+	}
+
+	off = got.br_startoff + got.br_blockcount;
+	sc->ip->i_disk_size = min_t(loff_t, XFS_DIR2_SPACE_SIZE,
+			XFS_FSB_TO_B(sc->mp, off));
+}
+
+/* Fix size problems with short format directories. */
+STATIC void
+xrep_inode_sfdir_size(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+
+	trace_xrep_inode_sfdir_size(sc);
+
+	ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK);
+	sc->ip->i_disk_size = ifp->if_bytes;
+}
+
+/*
+ * Fix any irregularities in an inode's size now that we can iterate extent
+ * maps and access other regular inode data.
+ */
+STATIC void
+xrep_inode_size(
+	struct xfs_scrub	*sc)
+{
+	trace_xrep_inode_size(sc);
+
+	/*
+	 * Currently we only support fixing size on extents or btree format
+	 * directories.  Files can be any size and sizes for the other inode
+	 * special types are fixed by xrep_dinode_size.
+	 */
+	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
+		return;
+	switch (sc->ip->i_df.if_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+	case XFS_DINODE_FMT_BTREE:
+		xrep_inode_blockdir_size(sc);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		xrep_inode_sfdir_size(sc);
+		break;
+	}
+}
+
+/* Fix extent size hint problems. */
+STATIC void
+xrep_inode_extsize(
+	struct xfs_scrub	*sc)
+{
+	/* Fix misaligned extent size hints on a directory. */
+	if ((sc->ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
+	    (sc->ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
+	    sc->ip->i_extsize % sc->mp->m_sb.sb_rextsize > 0) {
+		sc->ip->i_extsize = 0;
+		sc->ip->i_diflags &= ~XFS_DIFLAG_EXTSZINHERIT;
+	}
+}
+
+/* Fix any irregularities in an inode that the verifiers don't catch. */
+STATIC int
+xrep_inode_problems(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xrep_inode_blockcounts(sc);
+	if (error)
+		return error;
+	xrep_inode_timestamps(sc->ip);
+	xrep_inode_flags(sc);
+	xrep_inode_ids(sc);
+	xrep_inode_size(sc);
+	xrep_inode_extsize(sc);
+
+	trace_xrep_inode_fixed(sc);
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	return xrep_roll_trans(sc);
+}
+
+/* Repair an inode's fields. */
+int
+xrep_inode(
+	struct xfs_scrub	*sc)
+{
+	int			error = 0;
+
+	/*
+	 * No inode?  That means we failed the _iget verifiers.  Repair all
+	 * the things that the inode verifiers care about, then retry _iget.
+	 */
+	if (!sc->ip) {
+		struct xrep_inode	*ri = sc->buf;
+
+		ASSERT(ri != NULL);
+
+		error = xrep_dinode_problems(ri);
+		if (error)
+			return error;
+
+		/* By this point we had better have a working incore inode. */
+		if (!sc->ip)
+			return -EFSCORRUPTED;
+	}
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* If we found corruption of any kind, try to fix it. */
+	if ((sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) ||
+	    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT)) {
+		error = xrep_inode_problems(sc);
+		if (error)
+			return error;
+	}
+
+	/* See if we can clear the reflink flag. */
+	if (xfs_is_reflink_inode(sc->ip))
+		return xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index e3bcbbcb373e..f06fb50d9082 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -741,6 +741,38 @@ xrep_ino_dqattach(
 }
 #endif /* CONFIG_XFS_QUOTA */
 
+/*
+ * Ensure that the inode being repaired is ready to handle a certain number of
+ * extents, or return EFSCORRUPTED.  Caller must hold the ILOCK of the inode
+ * being repaired and have joined it to the scrub transaction.
+ */
+int
+xrep_ino_ensure_extent_count(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	xfs_extnum_t		nextents)
+{
+	xfs_extnum_t		max_extents;
+	bool			large_extcount;
+
+	large_extcount = xfs_inode_has_large_extent_counts(sc->ip);
+	max_extents = xfs_iext_max_nextents(large_extcount, whichfork);
+	if (nextents <= max_extents)
+		return 0;
+	if (large_extcount)
+		return -EFSCORRUPTED;
+	if (!xfs_has_large_extent_counts(sc->mp))
+		return -EFSCORRUPTED;
+
+	max_extents = xfs_iext_max_nextents(true, whichfork);
+	if (nextents > max_extents)
+		return -EFSCORRUPTED;
+
+	sc->ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	return 0;
+}
+
 /* Initialize all the btree cursors for an AG repair. */
 void
 xrep_ag_btcur_init(
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 441b6b073001..892cb593a716 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -26,6 +26,16 @@ bool xrep_ag_has_space(struct xfs_perag *pag, xfs_extlen_t nr_blocks,
 		enum xfs_ag_resv_type type);
 xfs_extlen_t xrep_calc_ag_resblks(struct xfs_scrub *sc);
 
+static inline int
+xrep_trans_commit(
+	struct xfs_scrub	*sc)
+{
+	int			error = xfs_trans_commit(sc->tp);
+
+	sc->tp = NULL;
+	return error;
+}
+
 struct xbitmap;
 struct xagb_bitmap;
 
@@ -57,11 +67,16 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
 # define xrep_ino_dqattach(sc)			(0)
 #endif /* CONFIG_XFS_QUOTA */
 
+int xrep_ino_ensure_extent_count(struct xfs_scrub *sc, int whichfork,
+		xfs_extnum_t nextents);
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
 
+struct xfs_imap;
+int xrep_setup_inode(struct xfs_scrub *sc, struct xfs_imap *imap);
+
 void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa);
 
 /* Metadata revalidators */
@@ -79,6 +94,7 @@ int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
+int xrep_inode(struct xfs_scrub *sc);
 
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
@@ -123,6 +139,8 @@ xrep_setup_nothing(
 }
 #define xrep_setup_ag_allocbt		xrep_setup_nothing
 
+#define xrep_setup_inode(sc, imap)	((void)0)
+
 #define xrep_revalidate_allocbt		(NULL)
 #define xrep_revalidate_iallocbt	(NULL)
 
@@ -134,6 +152,7 @@ xrep_setup_nothing(
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
+#define xrep_inode			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 449c3e623c63..ace21c484cbc 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -283,7 +283,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode,
 		.scrub	= xchk_inode,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_inode,
 	},
 	[XFS_SCRUB_TYPE_BMBTD] = {	/* inode data fork */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 8532dcd16630..5313fce1c69d 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1346,6 +1346,135 @@ DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_alloc_blocks);
 DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_free_blocks);
 DEFINE_NEWBT_EXTENT_EVENT(xrep_newbt_claim_block);
 
+DECLARE_EVENT_CLASS(xrep_dinode_class,
+	TP_PROTO(struct xfs_scrub *sc, struct xfs_dinode *dip),
+	TP_ARGS(sc, dip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(uint16_t, mode)
+		__field(uint8_t, version)
+		__field(uint8_t, format)
+		__field(uint32_t, uid)
+		__field(uint32_t, gid)
+		__field(uint64_t, size)
+		__field(uint64_t, nblocks)
+		__field(uint32_t, extsize)
+		__field(uint32_t, nextents)
+		__field(uint16_t, anextents)
+		__field(uint8_t, forkoff)
+		__field(uint8_t, aformat)
+		__field(uint16_t, flags)
+		__field(uint32_t, gen)
+		__field(uint64_t, flags2)
+		__field(uint32_t, cowextsize)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->sm->sm_ino;
+		__entry->mode = be16_to_cpu(dip->di_mode);
+		__entry->version = dip->di_version;
+		__entry->format = dip->di_format;
+		__entry->uid = be32_to_cpu(dip->di_uid);
+		__entry->gid = be32_to_cpu(dip->di_gid);
+		__entry->size = be64_to_cpu(dip->di_size);
+		__entry->nblocks = be64_to_cpu(dip->di_nblocks);
+		__entry->extsize = be32_to_cpu(dip->di_extsize);
+		__entry->nextents = be32_to_cpu(dip->di_nextents);
+		__entry->anextents = be16_to_cpu(dip->di_anextents);
+		__entry->forkoff = dip->di_forkoff;
+		__entry->aformat = dip->di_aformat;
+		__entry->flags = be16_to_cpu(dip->di_flags);
+		__entry->gen = be32_to_cpu(dip->di_gen);
+		__entry->flags2 = be64_to_cpu(dip->di_flags2);
+		__entry->cowextsize = be32_to_cpu(dip->di_cowextsize);
+	),
+	TP_printk("dev %d:%d ino 0x%llx mode 0x%x version %u format %u uid %u gid %u disize 0x%llx nblocks 0x%llx extsize %u nextents %u anextents %u forkoff 0x%x aformat %u flags 0x%x gen 0x%x flags2 0x%llx cowextsize %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->mode,
+		  __entry->version,
+		  __entry->format,
+		  __entry->uid,
+		  __entry->gid,
+		  __entry->size,
+		  __entry->nblocks,
+		  __entry->extsize,
+		  __entry->nextents,
+		  __entry->anextents,
+		  __entry->forkoff,
+		  __entry->aformat,
+		  __entry->flags,
+		  __entry->gen,
+		  __entry->flags2,
+		  __entry->cowextsize)
+)
+
+#define DEFINE_REPAIR_DINODE_EVENT(name) \
+DEFINE_EVENT(xrep_dinode_class, name, \
+	TP_PROTO(struct xfs_scrub *sc, struct xfs_dinode *dip), \
+	TP_ARGS(sc, dip))
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_header);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_mode);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_flags);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_size);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_extsize_hints);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_symlink);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_dir);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_fixed);
+
+DECLARE_EVENT_CLASS(xrep_inode_class,
+	TP_PROTO(struct xfs_scrub *sc),
+	TP_ARGS(sc),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fsize_t, size)
+		__field(xfs_rfsblock_t, nblocks)
+		__field(uint16_t, flags)
+		__field(uint64_t, flags2)
+		__field(uint32_t, nextents)
+		__field(uint8_t, format)
+		__field(uint32_t, anextents)
+		__field(uint8_t, aformat)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->sm->sm_ino;
+		__entry->size = sc->ip->i_disk_size;
+		__entry->nblocks = sc->ip->i_nblocks;
+		__entry->flags = sc->ip->i_diflags;
+		__entry->flags2 = sc->ip->i_diflags2;
+		__entry->nextents = sc->ip->i_df.if_nextents;
+		__entry->format = sc->ip->i_df.if_format;
+		__entry->anextents = sc->ip->i_af.if_nextents;
+		__entry->aformat = sc->ip->i_af.if_format;
+	),
+	TP_printk("dev %d:%d ino 0x%llx disize 0x%llx nblocks 0%llx flags 0x%x flags2 0x%llx nextents %u format %u anextents %u aformat %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->size,
+		  __entry->nblocks,
+		  __entry->flags,
+		  __entry->flags2,
+		  __entry->nextents,
+		  __entry->format,
+		  __entry->anextents,
+		  __entry->aformat)
+)
+
+#define DEFINE_REPAIR_INODE_EVENT(name) \
+DEFINE_EVENT(xrep_inode_class, name, \
+	TP_PROTO(struct xfs_scrub *sc), \
+	TP_ARGS(sc))
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_blockcounts);
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_ids);
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_flags);
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_blockdir_size);
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_sfdir_size);
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_size);
+DEFINE_REPAIR_INODE_EVENT(xrep_inode_fixed);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/6] xfs: zap broken inode forks
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/6] xfs: repair inode records Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/6] xfs: try to attach dquots to files before repairing them Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/6] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true.  Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_leaf.c      |   32 +-
 fs/xfs/libxfs/xfs_attr_leaf.h      |    2 
 fs/xfs/libxfs/xfs_bmap.c           |   22 +
 fs/xfs/libxfs/xfs_bmap.h           |    2 
 fs/xfs/libxfs/xfs_dir2_priv.h      |    2 
 fs/xfs/libxfs/xfs_dir2_sf.c        |   29 +-
 fs/xfs/libxfs/xfs_shared.h         |    1 
 fs/xfs/libxfs/xfs_symlink_remote.c |   21 +
 fs/xfs/scrub/inode_repair.c        |  635 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/trace.h               |   42 ++
 10 files changed, 751 insertions(+), 37 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index beee51ad75ce..40aba0a9d339 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1040,23 +1040,16 @@ xfs_attr_shortform_allfit(
 	return xfs_attr_shortform_bytesfit(dp, bytes);
 }
 
-/* Verify the consistency of an inline attribute fork. */
+/* Verify the consistency of a raw inline attribute fork. */
 xfs_failaddr_t
-xfs_attr_shortform_verify(
-	struct xfs_inode		*ip)
+xfs_attr_shortform_verify_struct(
+	struct xfs_attr_shortform	*sfp,
+	size_t				size)
 {
-	struct xfs_attr_shortform	*sfp;
 	struct xfs_attr_sf_entry	*sfep;
 	struct xfs_attr_sf_entry	*next_sfep;
 	char				*endp;
-	struct xfs_ifork		*ifp;
 	int				i;
-	int64_t				size;
-
-	ASSERT(ip->i_af.if_format == XFS_DINODE_FMT_LOCAL);
-	ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK);
-	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
-	size = ifp->if_bytes;
 
 	/*
 	 * Give up if the attribute is way too short.
@@ -1116,6 +1109,23 @@ xfs_attr_shortform_verify(
 	return NULL;
 }
 
+/* Verify the consistency of an inline attribute fork. */
+xfs_failaddr_t
+xfs_attr_shortform_verify(
+	struct xfs_inode		*ip)
+{
+	struct xfs_attr_shortform	*sfp;
+	struct xfs_ifork		*ifp;
+	int64_t				size;
+
+	ASSERT(ip->i_af.if_format == XFS_DINODE_FMT_LOCAL);
+	ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK);
+	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
+	size = ifp->if_bytes;
+
+	return xfs_attr_shortform_verify_struct(sfp, size);
+}
+
 /*
  * Convert a leaf attribute list to shortform attribute list
  */
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 368f4d9fa1d5..0711a448f64c 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -56,6 +56,8 @@ int	xfs_attr_sf_findname(struct xfs_da_args *args,
 			     unsigned int *basep);
 int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
 int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
+xfs_failaddr_t xfs_attr_shortform_verify_struct(struct xfs_attr_shortform *sfp,
+		size_t size);
 xfs_failaddr_t xfs_attr_shortform_verify(struct xfs_inode *ip);
 void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index d9083cbeb20e..d689a262ce39 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6216,19 +6216,18 @@ xfs_bmap_finish_one(
 	return error;
 }
 
-/* Check that an inode's extent does not have invalid flags or bad ranges. */
+/* Check that an extent does not have invalid flags or bad ranges. */
 xfs_failaddr_t
-xfs_bmap_validate_extent(
-	struct xfs_inode	*ip,
+xfs_bmap_validate_extent_raw(
+	struct xfs_mount	*mp,
+	bool			rtfile,
 	int			whichfork,
 	struct xfs_bmbt_irec	*irec)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-
 	if (!xfs_verify_fileext(mp, irec->br_startoff, irec->br_blockcount))
 		return __this_address;
 
-	if (XFS_IS_REALTIME_INODE(ip) && whichfork == XFS_DATA_FORK) {
+	if (rtfile && whichfork == XFS_DATA_FORK) {
 		if (!xfs_verify_rtext(mp, irec->br_startblock,
 					  irec->br_blockcount))
 			return __this_address;
@@ -6258,3 +6257,14 @@ xfs_bmap_intent_destroy_cache(void)
 	kmem_cache_destroy(xfs_bmap_intent_cache);
 	xfs_bmap_intent_cache = NULL;
 }
+
+/* Check that an inode's extent does not have invalid flags or bad ranges. */
+xfs_failaddr_t
+xfs_bmap_validate_extent(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*irec)
+{
+	return xfs_bmap_validate_extent_raw(ip->i_mount,
+			XFS_IS_REALTIME_INODE(ip), whichfork, irec);
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 7af24f2ef8a2..903047c146c3 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -256,6 +256,8 @@ static inline uint32_t xfs_bmap_fork_to_state(int whichfork)
 	}
 }
 
+xfs_failaddr_t xfs_bmap_validate_extent_raw(struct xfs_mount *mp, bool rtfile,
+		int whichfork, struct xfs_bmbt_irec *irec);
 xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *irec);
 int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork,
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 7404a9ff1a92..b10859a43776 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -175,6 +175,8 @@ extern int xfs_dir2_sf_create(struct xfs_da_args *args, xfs_ino_t pino);
 extern int xfs_dir2_sf_lookup(struct xfs_da_args *args);
 extern int xfs_dir2_sf_removename(struct xfs_da_args *args);
 extern int xfs_dir2_sf_replace(struct xfs_da_args *args);
+extern xfs_failaddr_t xfs_dir2_sf_verify_struct(struct xfs_mount *mp,
+		struct xfs_dir2_sf_hdr *sfp, int64_t size);
 extern xfs_failaddr_t xfs_dir2_sf_verify(struct xfs_inode *ip);
 int xfs_dir2_sf_entsize(struct xfs_mount *mp,
 		struct xfs_dir2_sf_hdr *hdr, int len);
diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c
index 8cd37e6e9d38..008904658524 100644
--- a/fs/xfs/libxfs/xfs_dir2_sf.c
+++ b/fs/xfs/libxfs/xfs_dir2_sf.c
@@ -706,12 +706,11 @@ xfs_dir2_sf_check(
 
 /* Verify the consistency of an inline directory. */
 xfs_failaddr_t
-xfs_dir2_sf_verify(
-	struct xfs_inode		*ip)
+xfs_dir2_sf_verify_struct(
+	struct xfs_mount		*mp,
+	struct xfs_dir2_sf_hdr		*sfp,
+	int64_t				size)
 {
-	struct xfs_mount		*mp = ip->i_mount;
-	struct xfs_ifork		*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
-	struct xfs_dir2_sf_hdr		*sfp;
 	struct xfs_dir2_sf_entry	*sfep;
 	struct xfs_dir2_sf_entry	*next_sfep;
 	char				*endp;
@@ -719,15 +718,9 @@ xfs_dir2_sf_verify(
 	int				i;
 	int				i8count;
 	int				offset;
-	int64_t				size;
 	int				error;
 	uint8_t				filetype;
 
-	ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL);
-
-	sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data;
-	size = ifp->if_bytes;
-
 	/*
 	 * Give up if the directory is way too short.
 	 */
@@ -803,6 +796,20 @@ xfs_dir2_sf_verify(
 	return NULL;
 }
 
+xfs_failaddr_t
+xfs_dir2_sf_verify(
+	struct xfs_inode		*ip)
+{
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_ifork		*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+	struct xfs_dir2_sf_hdr		*sfp;
+
+	ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL);
+
+	sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data;
+	return xfs_dir2_sf_verify_struct(mp, sfp, ifp->if_bytes);
+}
+
 /*
  * Create a new (shortform) directory.
  */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index c4381388c0c1..57a52fa76a49 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -139,6 +139,7 @@ bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset,
 			uint32_t size, struct xfs_buf *bp);
 void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
 				 struct xfs_inode *ip, struct xfs_ifork *ifp);
+xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
 xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
 
 /* Computed inode geometry for the filesystem. */
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index bdc777b9ec4a..7660a95b1ea9 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -201,16 +201,12 @@ xfs_symlink_local_to_remote(
  * does not do on-disk format checks.
  */
 xfs_failaddr_t
-xfs_symlink_shortform_verify(
-	struct xfs_inode	*ip)
+xfs_symlink_sf_verify_struct(
+	void			*sfp,
+	int64_t			size)
 {
-	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
-	char			*sfp = (char *)ifp->if_u1.if_data;
-	int			size = ifp->if_bytes;
 	char			*endp = sfp + size;
 
-	ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL);
-
 	/*
 	 * Zero length symlinks should never occur in memory as they are
 	 * never allowed to exist on disk.
@@ -231,3 +227,14 @@ xfs_symlink_shortform_verify(
 		return __this_address;
 	return NULL;
 }
+
+xfs_failaddr_t
+xfs_symlink_shortform_verify(
+	struct xfs_inode	*ip)
+{
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+
+	ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL);
+
+	return xfs_symlink_sf_verify_struct(ifp->if_u1.if_data, ifp->if_bytes);
+}
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index f1490e78766d..45680de6b0ba 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -22,14 +22,19 @@
 #include "xfs_ialloc.h"
 #include "xfs_da_format.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
 #include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
 #include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_bmap_util.h"
 #include "xfs_dir2.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_quota_defs.h"
 #include "xfs_quota.h"
 #include "xfs_ag.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_log_priv.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -59,6 +64,28 @@ struct xrep_inode {
 	struct xfs_imap		imap;
 
 	struct xfs_scrub	*sc;
+
+	/* Blocks in use on the data device by data extents or bmbt blocks. */
+	xfs_rfsblock_t		data_blocks;
+
+	/* Blocks in use on the rt device. */
+	xfs_rfsblock_t		rt_blocks;
+
+	/* Blocks in use by the attr fork. */
+	xfs_rfsblock_t		attr_blocks;
+
+	/* Number of data device extents for the data fork. */
+	xfs_extnum_t		data_extents;
+
+	/*
+	 * Number of realtime device extents for the data fork.  If
+	 * data_extents and rt_extents indicate that the data fork has extents
+	 * on both devices, we'll just back away slowly.
+	 */
+	xfs_extnum_t		rt_extents;
+
+	/* Number of (data device) extents for the attr fork. */
+	xfs_aextnum_t		attr_extents;
 };
 
 /* Setup function for inode repair. */
@@ -186,7 +213,8 @@ xrep_dinode_mode(
 STATIC void
 xrep_dinode_flags(
 	struct xfs_scrub	*sc,
-	struct xfs_dinode	*dip)
+	struct xfs_dinode	*dip,
+	bool			isrt)
 {
 	struct xfs_mount	*mp = sc->mp;
 	uint64_t		flags2;
@@ -199,6 +227,11 @@ xrep_dinode_flags(
 	flags = be16_to_cpu(dip->di_flags);
 	flags2 = be64_to_cpu(dip->di_flags2);
 
+	if (isrt)
+		flags |= XFS_DIFLAG_REALTIME;
+	else
+		flags &= ~XFS_DIFLAG_REALTIME;
+
 	if (xfs_has_reflink(mp) && S_ISREG(mode))
 		flags2 |= XFS_DIFLAG2_REFLINK;
 	else
@@ -353,6 +386,598 @@ xrep_dinode_extsize_hints(
 	}
 }
 
+/* Count extents and blocks for an inode given an rmap. */
+STATIC int
+xrep_dinode_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_inode		*ri = priv;
+	int				error = 0;
+
+	if (xchk_should_terminate(ri->sc, &error))
+		return error;
+
+	/* We only care about this inode. */
+	if (rec->rm_owner != ri->sc->sm->sm_ino)
+		return 0;
+
+	if (rec->rm_flags & XFS_RMAP_ATTR_FORK) {
+		ri->attr_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			ri->attr_extents++;
+
+		return 0;
+	}
+
+	ri->data_blocks += rec->rm_blockcount;
+	if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+		ri->data_extents++;
+
+	return 0;
+}
+
+/* Count extents and blocks for an inode from all AG rmap data. */
+STATIC int
+xrep_dinode_count_ag_rmaps(
+	struct xrep_inode	*ri,
+	struct xfs_perag	*pag)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_buf		*agf;
+	int			error;
+
+	error = xfs_alloc_read_agf(pag, ri->sc->tp, 0, &agf);
+	if (error)
+		return error;
+
+	cur = xfs_rmapbt_init_cursor(ri->sc->mp, ri->sc->tp, agf, pag);
+	error = xfs_rmap_query_all(cur, xrep_dinode_walk_rmap, ri);
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(ri->sc->tp, agf);
+	return error;
+}
+
+/* Count extents and blocks for a given inode from all rmap data. */
+STATIC int
+xrep_dinode_count_rmaps(
+	struct xrep_inode	*ri)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			error;
+
+	if (!xfs_has_rmapbt(ri->sc->mp) || xfs_has_realtime(ri->sc->mp))
+		return -EOPNOTSUPP;
+
+	for_each_perag(ri->sc->mp, agno, pag) {
+		error = xrep_dinode_count_ag_rmaps(ri, pag);
+		if (error) {
+			xfs_perag_put(pag);
+			return error;
+		}
+	}
+
+	/* Can't have extents on both the rt and the data device. */
+	if (ri->data_extents && ri->rt_extents)
+		return -EFSCORRUPTED;
+
+	trace_xrep_dinode_count_rmaps(ri->sc,
+			ri->data_blocks, ri->rt_blocks, ri->attr_blocks,
+			ri->data_extents, ri->rt_extents, ri->attr_extents);
+	return 0;
+}
+
+/* Return true if this extents-format ifork looks like garbage. */
+STATIC bool
+xrep_dinode_bad_extents_fork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	int			dfork_size,
+	int			whichfork)
+{
+	struct xfs_bmbt_irec	new;
+	struct xfs_bmbt_rec	*dp;
+	bool			isrt;
+	int			i;
+	int			nex;
+	int			fork_size;
+
+	nex = xfs_dfork_nextents(dip, whichfork);
+	fork_size = nex * sizeof(struct xfs_bmbt_rec);
+	if (fork_size < 0 || fork_size > dfork_size)
+		return true;
+	if (whichfork == XFS_ATTR_FORK && nex > ((uint16_t)-1U))
+		return true;
+	dp = XFS_DFORK_PTR(dip, whichfork);
+
+	isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME);
+	for (i = 0; i < nex; i++, dp++) {
+		xfs_failaddr_t	fa;
+
+		xfs_bmbt_disk_get_all(dp, &new);
+		fa = xfs_bmap_validate_extent_raw(sc->mp, isrt, whichfork,
+				&new);
+		if (fa)
+			return true;
+	}
+
+	return false;
+}
+
+/* Return true if this btree-format ifork looks like garbage. */
+STATIC bool
+xrep_dinode_bad_btree_fork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	int			dfork_size,
+	int			whichfork)
+{
+	struct xfs_bmdr_block	*dfp;
+	int			nrecs;
+	int			level;
+
+	if (xfs_dfork_nextents(dip, whichfork) <=
+			dfork_size / sizeof(struct xfs_bmbt_rec))
+		return true;
+
+	if (dfork_size < sizeof(struct xfs_bmdr_block))
+		return true;
+
+	dfp = XFS_DFORK_PTR(dip, whichfork);
+	nrecs = be16_to_cpu(dfp->bb_numrecs);
+	level = be16_to_cpu(dfp->bb_level);
+
+	if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size)
+		return true;
+	if (level == 0 || level >= XFS_BM_MAXLEVELS(sc->mp, whichfork))
+		return true;
+	return false;
+}
+
+/*
+ * Check the data fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xrep_dinode_check_dfork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	uint16_t		mode)
+{
+	void			*sfp;
+	uint64_t		size;
+	unsigned int		fmt;
+	int			dfork_size;
+
+	fmt = XFS_DFORK_FORMAT(dip, XFS_DATA_FORK);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		if (fmt != XFS_DINODE_FMT_DEV)
+			return true;
+		break;
+	case S_IFREG:
+		if (fmt == XFS_DINODE_FMT_LOCAL)
+			return true;
+		fallthrough;
+	case S_IFLNK:
+	case S_IFDIR:
+		switch (fmt) {
+		case XFS_DINODE_FMT_LOCAL:
+		case XFS_DINODE_FMT_EXTENTS:
+		case XFS_DINODE_FMT_BTREE:
+			break;
+		default:
+			return true;
+		}
+		break;
+	default:
+		return true;
+	}
+
+	dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK);
+	sfp = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	switch (fmt) {
+	case XFS_DINODE_FMT_DEV:
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		if (size > dfork_size)
+			return true;
+		if (S_ISDIR(mode) &&
+		    xfs_dir2_sf_verify_struct(sc->mp, sfp, size) != NULL)
+			return true;
+		if (S_ISLNK(mode) &&
+		    xfs_symlink_sf_verify_struct(sfp, size) != NULL)
+			return true;
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xrep_dinode_bad_extents_fork(sc, dip, dfork_size,
+				XFS_DATA_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xrep_dinode_bad_btree_fork(sc, dip, dfork_size,
+				XFS_DATA_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+static void
+xrep_dinode_set_data_nextents(
+	struct xfs_dinode	*dip,
+	xfs_extnum_t		nextents)
+{
+	if (xfs_dinode_has_large_extent_counts(dip))
+		dip->di_big_nextents = cpu_to_be64(nextents);
+	else
+		dip->di_nextents = cpu_to_be32(nextents);
+}
+
+static void
+xrep_dinode_set_attr_nextents(
+	struct xfs_dinode	*dip,
+	xfs_extnum_t		nextents)
+{
+	if (xfs_dinode_has_large_extent_counts(dip))
+		dip->di_big_anextents = cpu_to_be32(nextents);
+	else
+		dip->di_anextents = cpu_to_be16(nextents);
+}
+
+/* Reset the data fork to something sane. */
+STATIC void
+xrep_dinode_zap_dfork(
+	struct xrep_inode	*ri,
+	struct xfs_dinode	*dip,
+	uint16_t		mode)
+{
+	struct xfs_scrub	*sc = ri->sc;
+
+	trace_xrep_dinode_zap_dfork(sc, dip);
+
+	xrep_dinode_set_data_nextents(dip, 0);
+
+	/* Special files always get reset to DEV */
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		dip->di_format = XFS_DINODE_FMT_DEV;
+		dip->di_size = 0;
+		return;
+	}
+
+	/*
+	 * If we have data extents, reset to an empty map and hope the user
+	 * will run the bmapbtd checker next.
+	 */
+	if (ri->data_extents || ri->rt_extents || S_ISREG(mode)) {
+		dip->di_format = XFS_DINODE_FMT_EXTENTS;
+		return;
+	}
+
+	/* Otherwise, reset the local format to the minimum. */
+	switch (mode & S_IFMT) {
+	case S_IFLNK:
+		xrep_dinode_zap_symlink(sc, dip);
+		break;
+	case S_IFDIR:
+		xrep_dinode_zap_dir(sc, dip);
+		break;
+	}
+}
+
+/*
+ * Check the attr fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xrep_dinode_check_afork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip)
+{
+	struct xfs_attr_shortform	*sfp;
+	int				size;
+
+	if (XFS_DFORK_BOFF(dip) == 0)
+		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
+		       xfs_dfork_attr_extents(dip) != 0;
+
+	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
+	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
+	case XFS_DINODE_FMT_LOCAL:
+		sfp = XFS_DFORK_PTR(dip, XFS_ATTR_FORK);
+		return xfs_attr_shortform_verify_struct(sfp, size) != NULL;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xrep_dinode_bad_extents_fork(sc, dip, size, XFS_ATTR_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xrep_dinode_bad_btree_fork(sc, dip, size, XFS_ATTR_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Reset the attr fork to empty.  Since the attr fork could have contained
+ * ACLs, make the file readable only by root.
+ */
+STATIC void
+xrep_dinode_zap_afork(
+	struct xrep_inode	*ri,
+	struct xfs_dinode	*dip,
+	uint16_t		mode)
+{
+	struct xfs_scrub	*sc = ri->sc;
+
+	trace_xrep_dinode_zap_afork(sc, dip);
+
+	dip->di_aformat = XFS_DINODE_FMT_EXTENTS;
+	xrep_dinode_set_attr_nextents(dip, 0);
+
+	dip->di_forkoff = 0;
+	dip->di_mode = cpu_to_be16(mode & ~0777);
+	dip->di_uid = 0;
+	dip->di_gid = 0;
+}
+
+/* Make sure the fork offset is a sensible value. */
+STATIC void
+xrep_dinode_ensure_forkoff(
+	struct xrep_inode	*ri,
+	struct xfs_dinode	*dip,
+	uint16_t		mode)
+{
+	struct xfs_bmdr_block	*bmdr;
+	struct xfs_scrub	*sc = ri->sc;
+	xfs_extnum_t		attr_extents, data_extents;
+	size_t			bmdr_minsz = XFS_BMDR_SPACE_CALC(1);
+	unsigned int		lit_sz = XFS_LITINO(sc->mp);
+	unsigned int		afork_min, dfork_min;
+
+	trace_xrep_dinode_ensure_forkoff(sc, dip);
+
+	/*
+	 * Before calling this function, xrep_dinode_core ensured that both
+	 * forks actually fit inside their respective literal areas.  If this
+	 * was not the case, the fork was reset to FMT_EXTENTS with zero
+	 * records.  If the rmapbt scan found attr or data fork blocks, this
+	 * will be noted in the dinode_stats, and we must leave enough room
+	 * for the bmap repair code to reconstruct the mapping structure.
+	 *
+	 * First, compute the minimum space required for the attr fork.
+	 */
+	switch (dip->di_aformat) {
+	case XFS_DINODE_FMT_LOCAL:
+		/*
+		 * If we still have a shortform xattr structure at all, that
+		 * means the attr fork area was exactly large enough to fit
+		 * the sf structure.
+		 */
+		afork_min = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		attr_extents = xfs_dfork_attr_extents(dip);
+		if (attr_extents) {
+			/*
+			 * We must maintain sufficient space to hold the entire
+			 * extent map array in the data fork.  Note that we
+			 * previously zapped the fork if it had no chance of
+			 * fitting in the inode.
+			 */
+			afork_min = sizeof(struct xfs_bmbt_rec) * attr_extents;
+		} else if (ri->attr_extents > 0) {
+			/*
+			 * The attr fork thinks it has zero extents, but we
+			 * found some xattr extents.  We need to leave enough
+			 * empty space here so that the incore attr fork will
+			 * get created (and hence trigger the attr fork bmap
+			 * repairer).
+			 */
+			afork_min = bmdr_minsz;
+		} else {
+			/* No extents on disk or found in rmapbt. */
+			afork_min = 0;
+		}
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		/* Must have space for btree header and key/pointers. */
+		bmdr = XFS_DFORK_PTR(dip, XFS_ATTR_FORK);
+		afork_min = XFS_BMAP_BROOT_SPACE(sc->mp, bmdr);
+		break;
+	default:
+		/* We should never see any other formats. */
+		afork_min = 0;
+		break;
+	}
+
+	/* Compute the minimum space required for the data fork. */
+	switch (dip->di_format) {
+	case XFS_DINODE_FMT_DEV:
+		dfork_min = sizeof(__be32);
+		break;
+	case XFS_DINODE_FMT_UUID:
+		dfork_min = sizeof(uuid_t);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		/*
+		 * If we still have a shortform data fork at all, that means
+		 * the data fork area was large enough to fit whatever was in
+		 * there.
+		 */
+		dfork_min = be64_to_cpu(dip->di_size);
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		data_extents = xfs_dfork_data_extents(dip);
+		if (data_extents) {
+			/*
+			 * We must maintain sufficient space to hold the entire
+			 * extent map array in the data fork.  Note that we
+			 * previously zapped the fork if it had no chance of
+			 * fitting in the inode.
+			 */
+			dfork_min = sizeof(struct xfs_bmbt_rec) * data_extents;
+		} else if (ri->data_extents > 0 || ri->rt_extents > 0) {
+			/*
+			 * The data fork thinks it has zero extents, but we
+			 * found some data extents.  We need to leave enough
+			 * empty space here so that the the data fork bmap
+			 * repair will recover the mappings.
+			 */
+			dfork_min = bmdr_minsz;
+		} else {
+			/* No extents on disk or found in rmapbt. */
+			dfork_min = 0;
+		}
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		/* Must have space for btree header and key/pointers. */
+		bmdr = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+		dfork_min = XFS_BMAP_BROOT_SPACE(sc->mp, bmdr);
+		break;
+	default:
+		dfork_min = 0;
+		break;
+	}
+
+	/*
+	 * Round all values up to the nearest 8 bytes, because that is the
+	 * precision of di_forkoff.
+	 */
+	afork_min = roundup(afork_min, 8);
+	dfork_min = roundup(dfork_min, 8);
+	bmdr_minsz = roundup(bmdr_minsz, 8);
+
+	ASSERT(dfork_min <= lit_sz);
+	ASSERT(afork_min <= lit_sz);
+
+	/*
+	 * If the data fork was zapped and we don't have enough space for the
+	 * recovery fork, move the attr fork up.
+	 */
+	if (dip->di_format == XFS_DINODE_FMT_EXTENTS &&
+	    xfs_dfork_data_extents(dip) == 0 &&
+	    (ri->data_extents > 0 || ri->rt_extents > 0) &&
+	    bmdr_minsz > XFS_DFORK_DSIZE(dip, sc->mp)) {
+		if (bmdr_minsz + afork_min > lit_sz) {
+			/*
+			 * The attr for and the stub fork we need to recover
+			 * the data fork won't both fit.  Zap the attr fork.
+			 */
+			xrep_dinode_zap_afork(ri, dip, mode);
+			afork_min = bmdr_minsz;
+		} else {
+			void	*before, *after;
+
+			/* Otherwise, just slide the attr fork up. */
+			before = XFS_DFORK_APTR(dip);
+			dip->di_forkoff = bmdr_minsz >> 3;
+			after = XFS_DFORK_APTR(dip);
+			memmove(after, before, XFS_DFORK_ASIZE(dip, sc->mp));
+		}
+	}
+
+	/*
+	 * If the attr fork was zapped and we don't have enough space for the
+	 * recovery fork, move the attr fork down.
+	 */
+	if (dip->di_aformat == XFS_DINODE_FMT_EXTENTS &&
+	    xfs_dfork_attr_extents(dip) == 0 &&
+	    ri->attr_extents > 0 &&
+	    bmdr_minsz > XFS_DFORK_ASIZE(dip, sc->mp)) {
+		if (dip->di_format == XFS_DINODE_FMT_BTREE) {
+			/*
+			 * If the data fork is in btree format then we can't
+			 * adjust forkoff because that runs the risk of
+			 * violating the extents/btree format transition rules.
+			 */
+		} else if (bmdr_minsz + dfork_min > lit_sz) {
+			/*
+			 * If we can't move the attr fork, too bad, we lose the
+			 * attr fork and leak its blocks.
+			 */
+			xrep_dinode_zap_afork(ri, dip, mode);
+		} else {
+			/*
+			 * Otherwise, just slide the attr fork down.  The attr
+			 * fork is empty, so we don't have any old contents to
+			 * move here.
+			 */
+			dip->di_forkoff = (lit_sz - bmdr_minsz) >> 3;
+		}
+	}
+}
+
+/*
+ * Zap the data/attr forks if we spot anything that isn't going to pass the
+ * ifork verifiers or the ifork formatters, because we need to get the inode
+ * into good enough shape that the higher level repair functions can run.
+ */
+STATIC void
+xrep_dinode_zap_forks(
+	struct xrep_inode	*ri,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_scrub	*sc = ri->sc;
+	xfs_extnum_t		data_extents;
+	xfs_extnum_t		attr_extents;
+	xfs_filblks_t		nblocks;
+	uint16_t		mode;
+	bool			zap_datafork = false;
+	bool			zap_attrfork = false;
+
+	trace_xrep_dinode_zap_forks(sc, dip);
+
+	mode = be16_to_cpu(dip->di_mode);
+
+	data_extents = xfs_dfork_data_extents(dip);
+	attr_extents = xfs_dfork_attr_extents(dip);
+	nblocks = be64_to_cpu(dip->di_nblocks);
+
+	/* Inode counters don't make sense? */
+	if (data_extents > nblocks)
+		zap_datafork = true;
+	if (attr_extents > nblocks)
+		zap_attrfork = true;
+	if (data_extents + attr_extents > nblocks)
+		zap_datafork = zap_attrfork = true;
+
+	if (!zap_datafork)
+		zap_datafork = xrep_dinode_check_dfork(sc, dip, mode);
+	if (!zap_attrfork)
+		zap_attrfork = xrep_dinode_check_afork(sc, dip);
+
+	/* Zap whatever's bad. */
+	if (zap_attrfork)
+		xrep_dinode_zap_afork(ri, dip, mode);
+	if (zap_datafork)
+		xrep_dinode_zap_dfork(ri, dip, mode);
+	xrep_dinode_ensure_forkoff(ri, dip, mode);
+	dip->di_nblocks = 0;
+	if (!zap_attrfork)
+		be64_add_cpu(&dip->di_nblocks, ri->attr_blocks);
+	if (!zap_datafork) {
+		be64_add_cpu(&dip->di_nblocks, ri->data_blocks);
+		be64_add_cpu(&dip->di_nblocks, ri->rt_blocks);
+	}
+}
+
 /* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
 STATIC int
 xrep_dinode_core(
@@ -365,6 +990,11 @@ xrep_dinode_core(
 	bool			dontcare;
 	int			error;
 
+	/* Figure out what this inode had mapped in both forks. */
+	error = xrep_dinode_count_rmaps(ri);
+	if (error)
+		return error;
+
 	/* Read the inode cluster buffer. */
 	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
 			ri->imap.im_blkno, ri->imap.im_len, XBF_UNMAPPED, &bp,
@@ -390,9 +1020,10 @@ xrep_dinode_core(
 	dip = xfs_buf_offset(bp, ri->imap.im_boffset);
 	xrep_dinode_header(sc, dip);
 	xrep_dinode_mode(sc, dip);
-	xrep_dinode_flags(sc, dip);
+	xrep_dinode_flags(sc, dip, ri->rt_extents > 0);
 	xrep_dinode_size(sc, dip);
 	xrep_dinode_extsize_hints(sc, dip);
+	xrep_dinode_zap_forks(ri, dip);
 
 	/* Write out the inode. */
 	trace_xrep_dinode_fixed(sc, dip);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 5313fce1c69d..22a8a32a53e5 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1422,6 +1422,10 @@ DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_extsize_hints);
 DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_symlink);
 DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_dir);
 DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_fixed);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_forks);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_dfork);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_zap_afork);
+DEFINE_REPAIR_DINODE_EVENT(xrep_dinode_ensure_forkoff);
 
 DECLARE_EVENT_CLASS(xrep_inode_class,
 	TP_PROTO(struct xfs_scrub *sc),
@@ -1475,6 +1479,44 @@ DEFINE_REPAIR_INODE_EVENT(xrep_inode_sfdir_size);
 DEFINE_REPAIR_INODE_EVENT(xrep_inode_size);
 DEFINE_REPAIR_INODE_EVENT(xrep_inode_fixed);
 
+TRACE_EVENT(xrep_dinode_count_rmaps,
+	TP_PROTO(struct xfs_scrub *sc, xfs_rfsblock_t data_blocks,
+		xfs_rfsblock_t rt_blocks, xfs_rfsblock_t attr_blocks,
+		xfs_extnum_t data_extents, xfs_extnum_t rt_extents,
+		xfs_aextnum_t attr_extents),
+	TP_ARGS(sc, data_blocks, rt_blocks, attr_blocks, data_extents,
+		rt_extents, attr_extents),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_rfsblock_t, data_blocks)
+		__field(xfs_rfsblock_t, rt_blocks)
+		__field(xfs_rfsblock_t, attr_blocks)
+		__field(xfs_extnum_t, data_extents)
+		__field(xfs_extnum_t, rt_extents)
+		__field(xfs_aextnum_t, attr_extents)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->sm->sm_ino;
+		__entry->data_blocks = data_blocks;
+		__entry->rt_blocks = rt_blocks;
+		__entry->attr_blocks = attr_blocks;
+		__entry->data_extents = data_extents;
+		__entry->rt_extents = rt_extents;
+		__entry->attr_extents = attr_extents;
+	),
+	TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->data_blocks,
+		  __entry->rt_blocks,
+		  __entry->attr_blocks,
+		  __entry->data_extents,
+		  __entry->rt_extents,
+		  __entry->attr_extents)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/6] xfs: abort directory parent scrub scans if we encounter a zapped directory
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 1/6] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 6/6] xfs: repair obviously broken inode modes Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In the previous patch, we added some code to perform sufficient repairs
to an ondisk inode record such that the inode cache would be willing to
load the inode.  If the broken inode was a shortform directory, it will
reset the directory to something plausible, which is to say an empty
subdirectory of the root.  The telltale signs that something is
seriously wrong is the broken link count.

Such directories look clean, but they shouldn't participate in a
filesystem scan to find or confirm a directory parent pointer.  Create a
predicate that identifies such directories and abort the scrub.

Found by fuzzing xfs/1554 with multithreaded xfs_scrub enabled and
u3.bmx[0].startblock = zeroes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |    1 +
 fs/xfs/scrub/common.h |    2 ++
 fs/xfs/scrub/dir.c    |   21 +++++++++++++++++++++
 fs/xfs/scrub/parent.c |   11 +++++++++++
 4 files changed, 35 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 6b9d852873d8..3fc392c1b1a8 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -26,6 +26,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
+#include "xfs_dir2_priv.h"
 #include "xfs_attr.h"
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index c1a0a1ac19b2..4c90c45b9b34 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -173,6 +173,8 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 			       XFS_SCRUB_OFLAG_XCORRUPT);
 }
 
+bool xchk_dir_looks_zapped(struct xfs_inode *dp);
+
 #ifdef CONFIG_XFS_ONLINE_REPAIR
 /* Decide if a repair is required. */
 static inline bool xchk_needs_repair(const struct xfs_scrub_metadata *sm)
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 2a3107cc8ccb..5b3a9edc8932 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -854,3 +854,24 @@ xchk_directory(
 out:
 	return error;
 }
+
+/*
+ * Decide if this directory has been zapped to satisfy the inode and ifork
+ * verifiers.  Checking and repairing should be postponed until the directory
+ * is fixed.
+ */
+bool
+xchk_dir_looks_zapped(
+	struct xfs_inode	*dp)
+{
+	/*
+	 * If the dinode repair found a bad data fork, it will reset the fork
+	 * to extents format with zero records and wait for the bmapbtd
+	 * scrubber to reconstruct the block mappings.  Directories always
+	 * contain some content, so this is a clear sign of a zapped directory.
+	 */
+	if (dp->i_df.if_format == XFS_DINODE_FMT_EXTENTS)
+		return dp->i_df.if_nextents == 0;
+
+	return false;
+}
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 8581a21bfbfd..371526f4369d 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -89,6 +89,17 @@ xchk_parent_count_parent_dentries(
 	 * if there is one.
 	 */
 	lock_mode = xfs_ilock_data_map_shared(parent);
+
+	/*
+	 * We cannot yet validate this parent pointer if the directory looks as
+	 * though it has been zapped by the inode record repair code.
+	 */
+	if (xchk_dir_looks_zapped(parent)) {
+		xfs_iunlock(parent, lock_mode);
+		xchk_set_incomplete(sc);
+		return -EFSCORRUPTED;
+	}
+
 	if (parent->i_df.if_nextents > 0)
 		error = xfs_dir3_data_readahead(parent, 0, 0);
 	xfs_iunlock(parent, lock_mode);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/6] xfs: repair obviously broken inode modes
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 5/6] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Building off the rmap scanner that we added in the previous patch, we
can now find block 0 and try to use the information contained inside of
it to guess the mode of an inode if it's totally improper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/inode_repair.c |  170 ++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/trace.h        |   11 ++-
 2 files changed, 172 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 45680de6b0ba..e5e3daf75fd1 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -74,6 +74,9 @@ struct xrep_inode {
 	/* Blocks in use by the attr fork. */
 	xfs_rfsblock_t		attr_blocks;
 
+	/* Physical block containing data block 0. */
+	xfs_fsblock_t		block0;
+
 	/* Number of data device extents for the data fork. */
 	xfs_extnum_t		data_extents;
 
@@ -113,6 +116,7 @@ xrep_setup_inode(
 	ri = sc->buf;
 	memcpy(&ri->imap, imap, sizeof(struct xfs_imap));
 	ri->sc = sc;
+	ri->block0 = NULLFSBLOCK;
 	return 0;
 }
 
@@ -188,12 +192,159 @@ xrep_dinode_header(
 	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
 }
 
+/* Parse enough of the directory block header to guess if this is a dir. */
+static inline bool
+xrep_dinode_is_dir(
+	xfs_ino_t			ino,
+	xfs_daddr_t			daddr,
+	struct xfs_buf			*bp)
+{
+	struct xfs_dir3_blk_hdr		*hdr3 = bp->b_addr;
+	struct xfs_dir2_data_free	*bf;
+	struct xfs_mount		*mp = bp->b_mount;
+	xfs_lsn_t			lsn = be64_to_cpu(hdr3->lsn);
+
+	/* Does the dir3 header match the filesystem? */
+	if (hdr3->magic != cpu_to_be32(XFS_DIR3_BLOCK_MAGIC) &&
+	    hdr3->magic != cpu_to_be32(XFS_DIR3_DATA_MAGIC))
+		return false;
+
+	if (be64_to_cpu(hdr3->owner) != ino)
+		return false;
+
+	if (!uuid_equal(&hdr3->uuid, &mp->m_sb.sb_meta_uuid))
+		return false;
+
+	if (be64_to_cpu(hdr3->blkno) != daddr)
+		return false;
+
+	/* Directory blocks are always logged and must have a valid LSN. */
+	if (lsn == NULLCOMMITLSN)
+		return false;
+	if (!xlog_valid_lsn(mp->m_log, lsn))
+		return false;
+
+	/*
+	 * bestfree information lives immediately after the end of the header,
+	 * so we won't run off the end of the buffer.
+	 */
+	bf = xfs_dir2_data_bestfree_p(mp, bp->b_addr);
+	if (!bf[0].length && bf[0].offset)
+		return false;
+	if (!bf[1].length && bf[1].offset)
+		return false;
+	if (!bf[2].length && bf[2].offset)
+		return false;
+
+	if (be16_to_cpu(bf[0].length) < be16_to_cpu(bf[1].length))
+		return false;
+	if (be16_to_cpu(bf[1].length) < be16_to_cpu(bf[2].length))
+		return false;
+
+	return true;
+}
+
+/* Guess the mode of this file from the contents. */
+STATIC uint16_t
+xrep_dinode_guess_mode(
+	struct xrep_inode	*ri,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_buf		*bp;
+	struct xfs_mount	*mp = ri->sc->mp;
+	xfs_daddr_t		daddr;
+	uint64_t		fsize = be64_to_cpu(dip->di_size);
+	unsigned int		dfork_sz = XFS_DFORK_DSIZE(dip, mp);
+	uint16_t		mode = S_IFREG;
+	int			error;
+
+	switch (dip->di_format) {
+	case XFS_DINODE_FMT_LOCAL:
+		/*
+		 * If the data fork is local format, the size of the data area
+		 * is reasonable and is big enough to contain the entire file,
+		 * we can guess the file type from the local data.
+		 *
+		 * If there are no nulls, guess this is a symbolic link.
+		 * Otherwise, this is probably a shortform directory.
+		 */
+		if (dfork_sz <= XFS_LITINO(mp) && dfork_sz >= fsize) {
+			if (!memchr(XFS_DFORK_DPTR(dip), 0, fsize))
+				return S_IFLNK;
+			return S_IFDIR;
+		}
+
+		/* By default, we guess regular file. */
+		return S_IFREG;
+	case XFS_DINODE_FMT_DEV:
+		/*
+		 * If the data fork is dev format, the size of the data area is
+		 * reasonable and large enough to store a dev_t, and the file
+		 * size is zero, this could be a blockdev, a chardev, a fifo,
+		 * or a socket.  There is no solid way to distinguish between
+		 * those choices, so we guess blockdev if the device number is
+		 * nonzero and chardev if it's zero (aka whiteout).
+		 */
+		if (dfork_sz <= XFS_LITINO(mp) &&
+		    dfork_sz >= sizeof(__be32) && fsize == 0) {
+			xfs_dev_t	dev = xfs_dinode_get_rdev(dip);
+
+			return dev != 0 ? S_IFBLK : S_IFCHR;
+		}
+
+		/* By default, we guess regular file. */
+		return S_IFREG;
+	case XFS_DINODE_FMT_EXTENTS:
+	case XFS_DINODE_FMT_BTREE:
+		/* There are data blocks to examine below. */
+		break;
+	default:
+		/* Everything else is considered a regular file. */
+		return S_IFREG;
+	}
+
+	/* There are no zero-length directories. */
+	if (fsize == 0)
+		return S_IFREG;
+
+	/*
+	 * If we didn't find a written mapping for file block zero, we'll guess
+	 * that it's a sparse regular file.
+	 */
+	if (ri->block0 == NULLFSBLOCK)
+		return S_IFREG;
+
+	/* Directories can't have rt extents. */
+	if (ri->rt_extents > 0)
+		return S_IFREG;
+
+	/*
+	 * Read the first block of the file.  Since we have no idea what kind
+	 * of file geometry (e.g. dirblock size) we might be reading into, use
+	 * an uncached buffer so that we don't pollute the buffer cache.  We
+	 * can't do uncached mapped buffers, so the best we can do is guess
+	 * from the directory header.
+	 */
+	daddr = XFS_FSB_TO_DADDR(mp, ri->block0);
+	error = xfs_buf_read_uncached(mp->m_ddev_targp, daddr,
+			XFS_FSS_TO_BB(mp, 1), 0, &bp, NULL);
+	if (error)
+		return S_IFREG;
+
+	if (xrep_dinode_is_dir(ri->sc->sm->sm_ino, daddr, bp))
+		mode = S_IFDIR;
+
+	xfs_buf_relse(bp);
+	return mode;
+}
+
 /* Turn di_mode into /something/ recognizable. */
 STATIC void
 xrep_dinode_mode(
-	struct xfs_scrub	*sc,
+	struct xrep_inode	*ri,
 	struct xfs_dinode	*dip)
 {
+	struct xfs_scrub	*sc = ri->sc;
 	uint16_t		mode;
 
 	trace_xrep_dinode_mode(sc, dip);
@@ -203,7 +354,7 @@ xrep_dinode_mode(
 		return;
 
 	/* bad mode, so we set it to a file that only root can read */
-	mode = S_IFREG;
+	mode = xrep_dinode_guess_mode(ri, dip);
 	dip->di_mode = cpu_to_be16(mode);
 	dip->di_uid = 0;
 	dip->di_gid = 0;
@@ -412,9 +563,17 @@ xrep_dinode_walk_rmap(
 	}
 
 	ri->data_blocks += rec->rm_blockcount;
-	if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+	if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK)) {
 		ri->data_extents++;
 
+		if (rec->rm_offset == 0 &&
+		    !(rec->rm_flags & XFS_RMAP_UNWRITTEN)) {
+			if (ri->block0 != NULLFSBLOCK)
+				return -EFSCORRUPTED;
+			ri->block0 = rec->rm_startblock;
+		}
+	}
+
 	return 0;
 }
 
@@ -465,7 +624,8 @@ xrep_dinode_count_rmaps(
 
 	trace_xrep_dinode_count_rmaps(ri->sc,
 			ri->data_blocks, ri->rt_blocks, ri->attr_blocks,
-			ri->data_extents, ri->rt_extents, ri->attr_extents);
+			ri->data_extents, ri->rt_extents, ri->attr_extents,
+			ri->block0);
 	return 0;
 }
 
@@ -1019,7 +1179,7 @@ xrep_dinode_core(
 	/* Fix everything the verifier will complain about. */
 	dip = xfs_buf_offset(bp, ri->imap.im_boffset);
 	xrep_dinode_header(sc, dip);
-	xrep_dinode_mode(sc, dip);
+	xrep_dinode_mode(ri, dip);
 	xrep_dinode_flags(sc, dip, ri->rt_extents > 0);
 	xrep_dinode_size(sc, dip);
 	xrep_dinode_extsize_hints(sc, dip);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 22a8a32a53e5..253b3cd784a9 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1483,9 +1483,9 @@ TRACE_EVENT(xrep_dinode_count_rmaps,
 	TP_PROTO(struct xfs_scrub *sc, xfs_rfsblock_t data_blocks,
 		xfs_rfsblock_t rt_blocks, xfs_rfsblock_t attr_blocks,
 		xfs_extnum_t data_extents, xfs_extnum_t rt_extents,
-		xfs_aextnum_t attr_extents),
+		xfs_aextnum_t attr_extents, xfs_fsblock_t block0),
 	TP_ARGS(sc, data_blocks, rt_blocks, attr_blocks, data_extents,
-		rt_extents, attr_extents),
+		rt_extents, attr_extents, block0),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
@@ -1495,6 +1495,7 @@ TRACE_EVENT(xrep_dinode_count_rmaps,
 		__field(xfs_extnum_t, data_extents)
 		__field(xfs_extnum_t, rt_extents)
 		__field(xfs_aextnum_t, attr_extents)
+		__field(xfs_fsblock_t, block0)
 	),
 	TP_fast_assign(
 		__entry->dev = sc->mp->m_super->s_dev;
@@ -1505,8 +1506,9 @@ TRACE_EVENT(xrep_dinode_count_rmaps,
 		__entry->data_extents = data_extents;
 		__entry->rt_extents = rt_extents;
 		__entry->attr_extents = attr_extents;
+		__entry->block0 = block0;
 	),
-	TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u",
+	TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u startblock0 0x%llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->data_blocks,
@@ -1514,7 +1516,8 @@ TRACE_EVENT(xrep_dinode_count_rmaps,
 		  __entry->attr_blocks,
 		  __entry->data_extents,
 		  __entry->rt_extents,
-		  __entry->attr_extents)
+		  __entry->attr_extents,
+		  __entry->block0)
 );
 
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (7 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
                   ` (68 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

In this series, online repair gains the ability to rebuild data and attr
fork mappings from the reverse mapping information.  It is at this point
where we reintroduce the ability to reap file extents.

Repair of CoW forks is a little different -- on disk, CoW staging
extents are owned by the refcount btree and cannot be mapped back to
individual files.  Hence we can only detect staging extents that don't
quite look right (missing reverse mappings, shared staging extents) and
replace them with fresh allocations.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-file-mappings

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-file-mappings
---
 fs/xfs/Makefile                   |    2 
 fs/xfs/libxfs/xfs_bmap_btree.c    |  112 +++++
 fs/xfs/libxfs/xfs_bmap_btree.h    |    5 
 fs/xfs/libxfs/xfs_btree_staging.c |   11 -
 fs/xfs/libxfs/xfs_btree_staging.h |    2 
 fs/xfs/libxfs/xfs_iext_tree.c     |   23 +
 fs/xfs/libxfs/xfs_inode_fork.c    |    1 
 fs/xfs/libxfs/xfs_inode_fork.h    |    3 
 fs/xfs/libxfs/xfs_refcount.c      |   41 ++
 fs/xfs/libxfs/xfs_refcount.h      |   10 
 fs/xfs/scrub/bitmap.h             |   28 +
 fs/xfs/scrub/bmap.c               |   18 +
 fs/xfs/scrub/bmap_repair.c        |  776 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/cow_repair.c         |  661 ++++++++++++++++++++++++++++++++
 fs/xfs/scrub/reap.c               |  145 +++++++
 fs/xfs/scrub/reap.h               |    2 
 fs/xfs/scrub/repair.c             |   47 ++
 fs/xfs/scrub/repair.h             |   11 +
 fs/xfs/scrub/scrub.c              |   20 -
 fs/xfs/scrub/trace.h              |  115 +++++
 fs/xfs/xfs_trans.c                |   95 +++++
 fs/xfs/xfs_trans.h                |    4 
 22 files changed, 2087 insertions(+), 45 deletions(-)
 create mode 100644 fs/xfs/scrub/bmap_repair.c
 create mode 100644 fs/xfs/scrub/cow_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Reintroduce to xrep_reap_extents the ability to reap extents from any
AG.  We dropped this before because it was buggy, but in the next patch
we will gain the ability to reap old bmap btrees, which can have blocks
in any AG.  To do this, we require that sc->sa is uninitialized, so that
we can use it to hold all the per-AG context for a given extent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bitmap.h |   28 +++++++++++
 fs/xfs/scrub/reap.c   |  120 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/scrub/reap.h   |    2 +
 fs/xfs/scrub/repair.h |    1 
 4 files changed, 147 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 4265e53f45fa..7f1b9c9c7831 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -103,4 +103,32 @@ int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap,
 int xagb_bitmap_set_btcur_path(struct xagb_bitmap *bitmap,
 		struct xfs_btree_cur *cur);
 
+/* Bitmaps, but for type-checked for xfs_fsblock_t */
+
+struct xfsb_bitmap {
+	struct xbitmap	fsbitmap;
+};
+
+static inline void xfsb_bitmap_init(struct xfsb_bitmap *bitmap)
+{
+	xbitmap_init(&bitmap->fsbitmap);
+}
+
+static inline void xfsb_bitmap_destroy(struct xfsb_bitmap *bitmap)
+{
+	xbitmap_destroy(&bitmap->fsbitmap);
+}
+
+static inline int xfsb_bitmap_set(struct xfsb_bitmap *bitmap,
+		xfs_fsblock_t start, xfs_filblks_t len)
+{
+	return xbitmap_set(&bitmap->fsbitmap, start, len);
+}
+
+static inline int xfsb_bitmap_walk(struct xfsb_bitmap *bitmap,
+		xbitmap_walk_fn fn, void *priv)
+{
+	return xbitmap_walk(&bitmap->fsbitmap, fn, priv);
+}
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 3ff4adaf3fe8..3da372bc6eaa 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -73,10 +73,10 @@
  * with only the same rmap owner but the block is not owned by something with
  * the same rmap owner, the block will be freed.
  *
- * The caller is responsible for locking the AG headers for the entire rebuild
- * operation so that nothing else can sneak in and change the AG state while
- * we're not looking.  We must also invalidate any buffers associated with
- * @bitmap.
+ * The caller is responsible for locking the AG headers/inode for the entire
+ * rebuild operation so that nothing else can sneak in and change the incore
+ * state while we're not looking.  We must also invalidate any buffers
+ * associated with @bitmap.
  */
 
 /* Information about reaping extents after a repair. */
@@ -500,3 +500,115 @@ xrep_reap_agblocks(
 
 	return 0;
 }
+
+/*
+ * Break a file metadata extent into sub-extents by fate (crosslinked, not
+ * crosslinked), and dispose of each sub-extent separately.  The extent must
+ * not cross an AG boundary.
+ */
+STATIC int
+xreap_fsmeta_extent(
+	uint64_t		fsbno,
+	uint64_t		len,
+	void			*priv)
+{
+	struct xreap_state	*rs = priv;
+	struct xfs_scrub	*sc = rs->sc;
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
+	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
+	xfs_agblock_t		agbno_next = agbno + len;
+	int			error = 0;
+
+	ASSERT(len <= XFS_MAX_BMBT_EXTLEN);
+	ASSERT(sc->ip != NULL);
+	ASSERT(!sc->sa.pag);
+
+	/*
+	 * We're reaping blocks after repairing file metadata, which means that
+	 * we have to init the xchk_ag structure ourselves.
+	 */
+	sc->sa.pag = xfs_perag_get(sc->mp, agno);
+	if (!sc->sa.pag)
+		return -EFSCORRUPTED;
+
+	error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &sc->sa.agf_bp);
+	if (error)
+		goto out_pag;
+
+	while (agbno < agbno_next) {
+		xfs_extlen_t	aglen;
+		bool		crosslinked;
+
+		error = xreap_agextent_select(rs, agbno, agbno_next,
+				&crosslinked, &aglen);
+		if (error)
+			goto out_agf;
+
+		error = xreap_agextent(rs, agbno, &aglen, crosslinked);
+		if (error)
+			goto out_agf;
+
+		if (xreap_want_defer_finish(rs)) {
+			/*
+			 * Holds the AGF buffer across the deferred chain
+			 * processing.
+			 */
+			error = xrep_defer_finish(sc);
+			if (error)
+				goto out_agf;
+			xreap_defer_finish_reset(rs);
+		} else if (xreap_want_roll(rs)) {
+			/*
+			 * Hold the AGF buffer across the transaction roll so
+			 * that we don't have to reattach it to the scrub
+			 * context.
+			 */
+			xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
+			error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+			xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
+			if (error)
+				goto out_agf;
+			xreap_reset(rs);
+		}
+
+		agbno += aglen;
+	}
+
+out_agf:
+	xfs_trans_brelse(sc->tp, sc->sa.agf_bp);
+	sc->sa.agf_bp = NULL;
+out_pag:
+	xfs_perag_put(sc->sa.pag);
+	sc->sa.pag = NULL;
+	return error;
+}
+
+/*
+ * Dispose of every block of every fs metadata extent in the bitmap.
+ * Do not use this to dispose of the mappings in an ondisk inode fork.
+ */
+int
+xrep_reap_fsblocks(
+	struct xfs_scrub		*sc,
+	struct xfsb_bitmap		*bitmap,
+	const struct xfs_owner_info	*oinfo)
+{
+	struct xreap_state		rs = {
+		.sc			= sc,
+		.oinfo			= oinfo,
+		.resv			= XFS_AG_RESV_NONE,
+	};
+	int				error;
+
+	ASSERT(xfs_has_rmapbt(sc->mp));
+	ASSERT(sc->ip != NULL);
+
+	error = xfsb_bitmap_walk(bitmap, xreap_fsmeta_extent, &rs);
+	if (error)
+		return error;
+
+	if (xreap_dirty(&rs))
+		return xrep_defer_finish(sc);
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h
index 032d1eac91b2..29a8b52a1a54 100644
--- a/fs/xfs/scrub/reap.h
+++ b/fs/xfs/scrub/reap.h
@@ -8,5 +8,7 @@
 
 int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap,
 		const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type);
+int xrep_reap_fsblocks(struct xfs_scrub *sc, struct xfsb_bitmap *bitmap,
+		const struct xfs_owner_info *oinfo);
 
 #endif /* __XFS_SCRUB_REAP_H__ */
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 892cb593a716..4b01a3d954e4 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -38,6 +38,7 @@ xrep_trans_commit(
 
 struct xbitmap;
 struct xagb_bitmap;
+struct xfsb_bitmap;
 
 int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: repair inode fork block mapping data structures
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                   |    1 
 fs/xfs/libxfs/xfs_bmap_btree.c    |  112 +++++
 fs/xfs/libxfs/xfs_bmap_btree.h    |    5 
 fs/xfs/libxfs/xfs_btree_staging.c |   11 -
 fs/xfs/libxfs/xfs_btree_staging.h |    2 
 fs/xfs/libxfs/xfs_iext_tree.c     |   23 +
 fs/xfs/libxfs/xfs_inode_fork.c    |    1 
 fs/xfs/libxfs/xfs_inode_fork.h    |    3 
 fs/xfs/scrub/bmap.c               |   18 +
 fs/xfs/scrub/bmap_repair.c        |  776 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c             |   25 +
 fs/xfs/scrub/repair.h             |    6 
 fs/xfs/scrub/scrub.c              |    4 
 fs/xfs/scrub/trace.h              |   34 ++
 fs/xfs/xfs_trans.c                |   95 +++++
 fs/xfs/xfs_trans.h                |    4 
 16 files changed, 1093 insertions(+), 27 deletions(-)
 create mode 100644 fs/xfs/scrub/bmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 1f61f299e436..08dcb9eea58e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -178,6 +178,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
+				   bmap_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   newbt.o \
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index e53c5bd42e86..4c59da1a977b 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -15,6 +15,7 @@
 #include "xfs_trans.h"
 #include "xfs_alloc.h"
 #include "xfs_btree.h"
+#include "xfs_btree_staging.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_bmap.h"
 #include "xfs_error.h"
@@ -302,10 +303,7 @@ xfs_bmbt_get_minrecs(
 	int			level)
 {
 	if (level == cur->bc_nlevels - 1) {
-		struct xfs_ifork	*ifp;
-
-		ifp = xfs_ifork_ptr(cur->bc_ino.ip,
-				    cur->bc_ino.whichfork);
+		struct xfs_ifork	*ifp = xfs_btree_ifork_ptr(cur);
 
 		return xfs_bmbt_maxrecs(cur->bc_mp,
 					ifp->if_broot_bytes, level == 0) / 2;
@@ -320,10 +318,7 @@ xfs_bmbt_get_maxrecs(
 	int			level)
 {
 	if (level == cur->bc_nlevels - 1) {
-		struct xfs_ifork	*ifp;
-
-		ifp = xfs_ifork_ptr(cur->bc_ino.ip,
-				    cur->bc_ino.whichfork);
+		struct xfs_ifork	*ifp = xfs_btree_ifork_ptr(cur);
 
 		return xfs_bmbt_maxrecs(cur->bc_mp,
 					ifp->if_broot_bytes, level == 0);
@@ -560,20 +555,19 @@ static const struct xfs_btree_ops xfs_bmbt_ops = {
 /*
  * Allocate a new bmap btree cursor.
  */
-struct xfs_btree_cur *				/* new bmap btree cursor */
-xfs_bmbt_init_cursor(
+static struct xfs_btree_cur *			/* new bmap btree cursor */
+xfs_bmbt_init_common(
 	struct xfs_mount	*mp,		/* file system mount point */
 	struct xfs_trans	*tp,		/* transaction pointer */
 	struct xfs_inode	*ip,		/* inode owning the btree */
-	int			whichfork)	/* data or attr fork */
+	int			whichfork)
 {
-	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, whichfork);
 	struct xfs_btree_cur	*cur;
+
 	ASSERT(whichfork != XFS_COW_FORK);
 
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
 			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
-	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
 	cur->bc_ops = &xfs_bmbt_ops;
@@ -581,10 +575,30 @@ xfs_bmbt_init_cursor(
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
-	cur->bc_ino.forksize = xfs_inode_fork_size(ip, whichfork);
 	cur->bc_ino.ip = ip;
 	cur->bc_ino.allocated = 0;
 	cur->bc_ino.flags = 0;
+
+	return cur;
+}
+
+/*
+ * Allocate a new bmap btree cursor.
+ */
+struct xfs_btree_cur *				/* new bmap btree cursor */
+xfs_bmbt_init_cursor(
+	struct xfs_mount	*mp,		/* file system mount point */
+	struct xfs_trans	*tp,		/* transaction pointer */
+	struct xfs_inode	*ip,		/* inode owning the btree */
+	int			whichfork)	/* data or attr fork */
+{
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, whichfork);
+	struct xfs_btree_cur	*cur;
+
+	cur = xfs_bmbt_init_common(mp, tp, ip, whichfork);
+
+	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
+	cur->bc_ino.forksize = xfs_inode_fork_size(ip, whichfork);
 	cur->bc_ino.whichfork = whichfork;
 
 	return cur;
@@ -601,6 +615,76 @@ xfs_bmbt_block_maxrecs(
 	return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t));
 }
 
+/*
+ * Allocate a new bmap btree cursor for reloading an inode block mapping data
+ * structure.  Note that callers can use the staged cursor to reload extents
+ * format inode forks if they rebuild the iext tree and commit the staged
+ * cursor immediately.
+ */
+struct xfs_btree_cur *
+xfs_bmbt_stage_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
+	struct xbtree_ifakeroot	*ifake)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_btree_ops	*ops;
+
+	cur = xfs_bmbt_init_common(mp, NULL, ip, ifake->if_whichfork);
+	cur->bc_nlevels = ifake->if_levels;
+	cur->bc_ino.forksize = ifake->if_fork_size;
+	/* Don't let anyone think we're attached to the real fork yet. */
+	cur->bc_ino.whichfork = -1;
+	xfs_btree_stage_ifakeroot(cur, ifake, &ops);
+	ops->update_cursor = NULL;
+	return cur;
+}
+
+/*
+ * Swap in the new inode fork root.  Once we pass this point the newly rebuilt
+ * mappings are in place and we have to kill off any old btree blocks.
+ */
+void
+xfs_bmbt_commit_staged_btree(
+	struct xfs_btree_cur	*cur,
+	struct xfs_trans	*tp,
+	int			whichfork)
+{
+	struct xbtree_ifakeroot	*ifake = cur->bc_ino.ifake;
+	struct xfs_ifork	*ifp;
+	static const short	brootflag[2] =
+		{ XFS_ILOG_DBROOT, XFS_ILOG_ABROOT };
+	static const short	extflag[2] =
+		{ XFS_ILOG_DEXT, XFS_ILOG_AEXT };
+	int			flags = XFS_ILOG_CORE;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_STAGING);
+	ASSERT(whichfork != XFS_COW_FORK);
+
+	/*
+	 * Free any resources hanging off the real fork, then shallow-copy the
+	 * staging fork's contents into the real fork to transfer everything
+	 * we just built.
+	 */
+	ifp = xfs_ifork_ptr(cur->bc_ino.ip, whichfork);
+	xfs_idestroy_fork(ifp);
+	memcpy(ifp, ifake->if_fork, sizeof(struct xfs_ifork));
+
+	switch (ifp->if_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		flags |= extflag[whichfork];
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		flags |= brootflag[whichfork];
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+	xfs_trans_log_inode(tp, cur->bc_ino.ip, flags);
+	xfs_btree_commit_ifakeroot(cur, tp, whichfork, &xfs_bmbt_ops);
+}
+
 /*
  * Calculate number of records in a bmap btree block.
  */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
index 3e7a40a83835..151b8491f60e 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.h
+++ b/fs/xfs/libxfs/xfs_bmap_btree.h
@@ -11,6 +11,7 @@ struct xfs_btree_block;
 struct xfs_mount;
 struct xfs_inode;
 struct xfs_trans;
+struct xbtree_ifakeroot;
 
 /*
  * Btree block header size depends on a superblock flag.
@@ -106,6 +107,10 @@ extern int xfs_bmbt_change_owner(struct xfs_trans *tp, struct xfs_inode *ip,
 
 extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
 		struct xfs_trans *, struct xfs_inode *, int);
+struct xfs_btree_cur *xfs_bmbt_stage_cursor(struct xfs_mount *mp,
+		struct xfs_inode *ip, struct xbtree_ifakeroot *ifake);
+void xfs_bmbt_commit_staged_btree(struct xfs_btree_cur *cur,
+		struct xfs_trans *tp, int whichfork);
 
 extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index 6fd6ea8e6fbd..4cdf7976b7bf 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -399,7 +399,7 @@ xfs_btree_bload_prep_block(
 		ASSERT(*bpp == NULL);
 
 		/* Allocate a new incore btree root block. */
-		new_size = bbl->iroot_size(cur, nr_this_block, priv);
+		new_size = bbl->iroot_size(cur, level, nr_this_block, priv);
 		ifp->if_broot = kmem_zalloc(new_size, 0);
 		ifp->if_broot_bytes = (int)new_size;
 
@@ -585,7 +585,14 @@ xfs_btree_bload_level_geometry(
 	unsigned int		desired_npb;
 	unsigned int		maxnr;
 
-	maxnr = cur->bc_ops->get_maxrecs(cur, level);
+	/*
+	 * Compute the absolute maximum number of records that we can store in
+	 * the ondisk block or inode root.
+	 */
+	if (cur->bc_ops->get_dmaxrecs)
+		maxnr = cur->bc_ops->get_dmaxrecs(cur, level);
+	else
+		maxnr = cur->bc_ops->get_maxrecs(cur, level);
 
 	/*
 	 * Compute the number of blocks we need to fill each block with the
diff --git a/fs/xfs/libxfs/xfs_btree_staging.h b/fs/xfs/libxfs/xfs_btree_staging.h
index d2eaf4fdc603..439d3490c878 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.h
+++ b/fs/xfs/libxfs/xfs_btree_staging.h
@@ -56,7 +56,7 @@ typedef int (*xfs_btree_bload_get_records_fn)(struct xfs_btree_cur *cur,
 typedef int (*xfs_btree_bload_claim_block_fn)(struct xfs_btree_cur *cur,
 		union xfs_btree_ptr *ptr, void *priv);
 typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur,
-		unsigned int nr_this_level, void *priv);
+		unsigned int level, unsigned int nr_this_level, void *priv);
 
 struct xfs_btree_bload {
 	/*
diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 773cf4349428..d062794cc795 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -622,13 +622,11 @@ static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp)
 }
 
 void
-xfs_iext_insert(
-	struct xfs_inode	*ip,
+xfs_iext_insert_raw(
+	struct xfs_ifork	*ifp,
 	struct xfs_iext_cursor	*cur,
-	struct xfs_bmbt_irec	*irec,
-	int			state)
+	struct xfs_bmbt_irec	*irec)
 {
-	struct xfs_ifork	*ifp = xfs_iext_state_to_fork(ip, state);
 	xfs_fileoff_t		offset = irec->br_startoff;
 	struct xfs_iext_leaf	*new = NULL;
 	int			nr_entries, i;
@@ -662,12 +660,23 @@ xfs_iext_insert(
 	xfs_iext_set(cur_rec(cur), irec);
 	ifp->if_bytes += sizeof(struct xfs_iext_rec);
 
-	trace_xfs_iext_insert(ip, cur, state, _RET_IP_);
-
 	if (new)
 		xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2);
 }
 
+void
+xfs_iext_insert(
+	struct xfs_inode	*ip,
+	struct xfs_iext_cursor	*cur,
+	struct xfs_bmbt_irec	*irec,
+	int			state)
+{
+	struct xfs_ifork	*ifp = xfs_iext_state_to_fork(ip, state);
+
+	xfs_iext_insert_raw(ifp, cur, irec);
+	trace_xfs_iext_insert(ip, cur, state, _RET_IP_);
+}
+
 static struct xfs_iext_node *
 xfs_iext_rebalance_node(
 	struct xfs_iext_node	*parent,
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index ff37eecec4b0..a175e7f0f30f 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -506,6 +506,7 @@ xfs_idata_realloc(
 	ifp->if_bytes = new_size;
 }
 
+/* Free all memory and reset a fork back to its initial state. */
 void
 xfs_idestroy_fork(
 	struct xfs_ifork	*ifp)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index d3943d6ad0b9..36a9fe3420cd 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -179,6 +179,9 @@ void		xfs_init_local_fork(struct xfs_inode *ip, int whichfork,
 				const void *data, int64_t size);
 
 xfs_extnum_t	xfs_iext_count(struct xfs_ifork *ifp);
+void		xfs_iext_insert_raw(struct xfs_ifork *ifp,
+			struct xfs_iext_cursor *cur,
+			struct xfs_bmbt_irec *irec);
 void		xfs_iext_insert(struct xfs_inode *, struct xfs_iext_cursor *cur,
 			struct xfs_bmbt_irec *, int);
 void		xfs_iext_remove(struct xfs_inode *, struct xfs_iext_cursor *,
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index b6fa880787d9..93fb88ca5f28 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -48,6 +48,15 @@ xchk_setup_inode_bmap(
 	if (S_ISREG(VFS_I(sc->ip)->i_mode) &&
 	    sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) {
 		struct address_space	*mapping = VFS_I(sc->ip)->i_mapping;
+		bool			is_repair = xchk_could_repair(sc);
+
+		/* Break all our leases, we're going to mess with things. */
+		if (is_repair) {
+			error = xfs_break_layouts(VFS_I(sc->ip),
+					&sc->ilock_flags, BREAK_UNMAP);
+			if (error)
+				goto out;
+		}
 
 		inode_dio_wait(VFS_I(sc->ip));
 
@@ -69,6 +78,15 @@ xchk_setup_inode_bmap(
 			error = filemap_fdatawait_keep_errors(mapping);
 		if (error && (error != -ENOSPC && error != -EIO))
 			goto out;
+
+		/* Drop the page cache if we're repairing block mappings. */
+		if (is_repair) {
+			error = invalidate_inode_pages2(
+					VFS_I(sc->ip)->i_mapping);
+			if (error)
+				goto out;
+		}
+
 	}
 
 	/* Got the inode, lock it and we're ready to go. */
diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c
new file mode 100644
index 000000000000..b9425f1a87d7
--- /dev/null
+++ b/fs/xfs/scrub/bmap_repair.c
@@ -0,0 +1,776 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_quota.h"
+#include "xfs_ialloc.h"
+#include "xfs_ag.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/newbt.h"
+#include "scrub/reap.h"
+
+/*
+ * Inode Fork Block Mapping (BMBT) Repair
+ * ======================================
+ *
+ * Gather all the rmap records for the inode and fork we're fixing, reset the
+ * incore fork, then recreate the btree.
+ */
+struct xrep_bmap {
+	/* Old bmbt blocks */
+	struct xfsb_bitmap	old_bmbt_blocks;
+
+	/* New fork. */
+	struct xrep_newbt	new_bmapbt;
+
+	/* List of new bmap records. */
+	struct xfarray		*bmap_records;
+
+	struct xfs_scrub	*sc;
+
+	/* How many blocks did we find allocated to this file? */
+	xfs_rfsblock_t		nblocks;
+
+	/* How many bmbt blocks did we find for this fork? */
+	xfs_rfsblock_t		old_bmbt_block_count;
+
+	/* get_records()'s position in the free space record array. */
+	xfarray_idx_t		array_cur;
+
+	/* How many real (non-hole, non-delalloc) mappings do we have? */
+	uint64_t		real_mappings;
+
+	/* Which fork are we fixing? */
+	int			whichfork;
+};
+
+/* Remember this reverse-mapping as a series of bmap records. */
+STATIC int
+xrep_bmap_from_rmap(
+	struct xrep_bmap	*rb,
+	xfs_fileoff_t		startoff,
+	xfs_fsblock_t		startblock,
+	xfs_filblks_t		blockcount,
+	bool			unwritten)
+{
+	struct xfs_bmbt_rec	rbe;
+	struct xfs_bmbt_irec	irec;
+	int			error = 0;
+
+	irec.br_startoff = startoff;
+	irec.br_startblock = startblock;
+	irec.br_state = unwritten ? XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+
+	do {
+		xfs_failaddr_t	fa;
+
+		irec.br_blockcount = min_t(xfs_filblks_t, blockcount,
+				XFS_MAX_BMBT_EXTLEN);
+
+		fa = xfs_bmap_validate_extent(rb->sc->ip, rb->whichfork, &irec);
+		if (fa)
+			return -EFSCORRUPTED;
+
+		xfs_bmbt_disk_set_all(&rbe, &irec);
+
+		trace_xrep_bmap_found(rb->sc->ip, rb->whichfork, &irec);
+
+		if (xchk_should_terminate(rb->sc, &error))
+			return error;
+
+		error = xfarray_append(rb->bmap_records, &rbe);
+		if (error)
+			return error;
+
+		rb->real_mappings++;
+
+		irec.br_startblock += irec.br_blockcount;
+		irec.br_startoff += irec.br_blockcount;
+		blockcount -= irec.br_blockcount;
+	} while (blockcount > 0);
+
+	return 0;
+}
+
+/* Check for any obvious errors or conflicts in the file mapping. */
+STATIC int
+xrep_bmap_check_fork_rmap(
+	struct xrep_bmap		*rb,
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec)
+{
+	struct xfs_scrub		*sc = rb->sc;
+	enum xbtree_recpacking		outcome;
+	int				error;
+
+	/*
+	 * Data extents for rt files are never stored on the data device, but
+	 * everything else (xattrs, bmbt blocks) can be.
+	 */
+	if (XFS_IS_REALTIME_INODE(sc->ip) &&
+	    !(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)))
+		return -EFSCORRUPTED;
+
+	/* Check that this is within the AG. */
+	if (!xfs_verify_agbext(cur->bc_ag.pag, rec->rm_startblock,
+				rec->rm_blockcount))
+		return -EFSCORRUPTED;
+
+	/* Check the file offset range. */
+	if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
+	    !xfs_verify_fileext(sc->mp, rec->rm_offset, rec->rm_blockcount))
+		return -EFSCORRUPTED;
+
+	/* No contradictory flags. */
+	if ((rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)) &&
+	    (rec->rm_flags & XFS_RMAP_UNWRITTEN))
+		return -EFSCORRUPTED;
+
+	/* Make sure this isn't free space. */
+	error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rm_startblock,
+			rec->rm_blockcount, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	/* Must not be an inode chunk. */
+	error = xfs_ialloc_has_inodes_at_extent(sc->sa.ino_cur,
+			rec->rm_startblock, rec->rm_blockcount, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xrep_bmap_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_bmap		*rb = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_fsblock_t			fsbno;
+	int				error = 0;
+
+	if (xchk_should_terminate(rb->sc, &error))
+		return error;
+
+	if (rec->rm_owner != rb->sc->ip->i_ino)
+		return 0;
+
+	error = xrep_bmap_check_fork_rmap(rb, cur, rec);
+	if (error)
+		return error;
+
+	/*
+	 * Record all blocks allocated to this file even if the extent isn't
+	 * for the fork we're rebuilding so that we can reset di_nblocks later.
+	 */
+	rb->nblocks += rec->rm_blockcount;
+
+	/* If this rmap isn't for the fork we want, we're done. */
+	if (rb->whichfork == XFS_DATA_FORK &&
+	    (rec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return 0;
+	if (rb->whichfork == XFS_ATTR_FORK &&
+	    !(rec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return 0;
+
+	fsbno = XFS_AGB_TO_FSB(mp, cur->bc_ag.pag->pag_agno,
+			rec->rm_startblock);
+
+	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+		rb->old_bmbt_block_count += rec->rm_blockcount;
+		return xfsb_bitmap_set(&rb->old_bmbt_blocks, fsbno,
+				rec->rm_blockcount);
+	}
+
+	return xrep_bmap_from_rmap(rb, rec->rm_offset, fsbno,
+			rec->rm_blockcount,
+			rec->rm_flags & XFS_RMAP_UNWRITTEN);
+}
+
+/*
+ * Compare two block mapping records.  We want to sort in order of increasing
+ * file offset.
+ */
+static int
+xrep_bmap_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	xfs_fileoff_t			ao;
+	xfs_fileoff_t			bo;
+
+	ao = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)a);
+	bo = xfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)b);
+
+	if (ao > bo)
+		return 1;
+	else if (ao < bo)
+		return -1;
+	return 0;
+}
+
+/*
+ * Sort the bmap extents by fork offset or else the records will be in the
+ * wrong order.  Ensure there are no overlaps in the file offset ranges.
+ */
+STATIC int
+xrep_bmap_sort_records(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_bmbt_irec	irec;
+	xfs_fileoff_t		next_off = 0;
+	xfarray_idx_t		array_cur;
+	int			error;
+
+	error = xfarray_sort(rb->bmap_records, xrep_bmap_extent_cmp,
+			XFARRAY_SORT_KILLABLE);
+	if (error)
+		return error;
+
+	foreach_xfarray_idx(rb->bmap_records, array_cur) {
+		struct xfs_bmbt_rec	rec;
+
+		if (xchk_should_terminate(rb->sc, &error))
+			return error;
+
+		error = xfarray_load(rb->bmap_records, array_cur, &rec);
+		if (error)
+			return error;
+
+		xfs_bmbt_disk_get_all(&rec, &irec);
+
+		if (irec.br_startoff < next_off)
+			return -EFSCORRUPTED;
+
+		next_off = irec.br_startoff + irec.br_blockcount;
+	}
+
+	return 0;
+}
+
+/* Scan one AG for reverse mappings that we can turn into extent maps. */
+STATIC int
+xrep_bmap_scan_ag(
+	struct xrep_bmap	*rb,
+	struct xfs_perag	*pag)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	int			error;
+
+	error = xrep_ag_init(sc, pag, &sc->sa);
+	if (error)
+		return error;
+
+	error = xfs_rmap_query_all(sc->sa.rmap_cur, xrep_bmap_walk_rmap, rb);
+	xchk_ag_free(sc, &sc->sa);
+	return error;
+}
+
+/* Find the delalloc extents from the old incore extent tree. */
+STATIC int
+xrep_bmap_find_delalloc(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_bmbt_irec	irec;
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_rec	rbe;
+	struct xfs_inode	*ip = rb->sc->ip;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, rb->whichfork);
+	int			error = 0;
+
+	/*
+	 * Skip this scan if we don't expect to find delayed allocation
+	 * reservations in this fork.
+	 */
+	if (rb->whichfork == XFS_ATTR_FORK || ip->i_delayed_blks == 0)
+		return 0;
+
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		if (!isnullstartblock(irec.br_startblock))
+			continue;
+
+		xfs_bmbt_disk_set_all(&rbe, &irec);
+
+		trace_xrep_bmap_found(ip, rb->whichfork, &irec);
+
+		if (xchk_should_terminate(rb->sc, &error))
+			return error;
+
+		error = xfarray_append(rb->bmap_records, &rbe);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Collect block mappings for this fork of this inode and decide if we have
+ * enough space to rebuild.  Caller is responsible for cleaning up the list if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_bmap_find_mappings(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			error = 0;
+
+	/* Iterate the rmaps for extents. */
+	for_each_perag(sc->mp, agno, pag) {
+		error = xrep_bmap_scan_ag(rb, pag);
+		if (error) {
+			xfs_perag_put(pag);
+			return error;
+		}
+	}
+
+	return xrep_bmap_find_delalloc(rb);
+}
+
+/* Retrieve real extent mappings for bulk loading the bmap btree. */
+STATIC int
+xrep_bmap_get_records(
+	struct xfs_btree_cur	*cur,
+	unsigned int		idx,
+	struct xfs_btree_block	*block,
+	unsigned int		nr_wanted,
+	void			*priv)
+{
+	struct xfs_bmbt_rec	rec;
+	struct xfs_bmbt_irec	*irec = &cur->bc_rec.b;
+	struct xrep_bmap	*rb = priv;
+	union xfs_btree_rec	*block_rec;
+	unsigned int		loaded;
+	int			error;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		do {
+			error = xfarray_load(rb->bmap_records, rb->array_cur++,
+					&rec);
+			if (error)
+				return error;
+
+			xfs_bmbt_disk_get_all(&rec, irec);
+		} while (isnullstartblock(irec->br_startblock));
+
+		block_rec = xfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Feed one of the new btree blocks to the bulk loader. */
+STATIC int
+xrep_bmap_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_bmap        *rb = priv;
+	int			error;
+
+	error = xrep_newbt_relog_autoreap(&rb->new_bmapbt);
+	if (error)
+		return error;
+
+	return xrep_newbt_claim_block(cur, &rb->new_bmapbt, ptr);
+}
+
+/* Figure out how much space we need to create the incore btree root block. */
+STATIC size_t
+xrep_bmap_iroot_size(
+	struct xfs_btree_cur	*cur,
+	unsigned int		level,
+	unsigned int		nr_this_level,
+	void			*priv)
+{
+	ASSERT(level > 0);
+
+	return XFS_BMAP_BROOT_SPACE_CALC(cur->bc_mp, nr_this_level);
+}
+
+/* Update the inode counters. */
+STATIC int
+xrep_bmap_reset_counters(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	struct xbtree_ifakeroot	*ifake = &rb->new_bmapbt.ifake;
+	int64_t			delta;
+
+	/*
+	 * Update the inode block counts to reflect the extents we found in the
+	 * rmapbt.
+	 */
+	delta = ifake->if_blocks - rb->old_bmbt_block_count;
+	sc->ip->i_nblocks = rb->nblocks + delta;
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+
+	/*
+	 * Adjust the quota counts by the difference in size between the old
+	 * and new bmbt.
+	 */
+	xfs_trans_mod_dquot_byino(sc->tp, sc->ip, XFS_TRANS_DQ_BCOUNT, delta);
+	return 0;
+}
+
+/*
+ * Create a new iext tree and load it with block mappings.  If the inode is
+ * in extents format, that's all we need to do to commit the new mappings.
+ * If it is in btree format, this takes care of preloading the incore tree.
+ */
+STATIC int
+xrep_bmap_extents_load(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	irec;
+	struct xfs_ifork	*ifp = rb->new_bmapbt.ifake.if_fork;
+	xfarray_idx_t		array_cur;
+	int			error;
+
+	ASSERT(ifp->if_bytes == 0);
+
+	/* Add all the mappings (incl. delalloc) to the incore extent tree. */
+	xfs_iext_first(ifp, &icur);
+	foreach_xfarray_idx(rb->bmap_records, array_cur) {
+		struct xfs_bmbt_rec	rec;
+
+		error = xfarray_load(rb->bmap_records, array_cur, &rec);
+		if (error)
+			return error;
+
+		xfs_bmbt_disk_get_all(&rec, &irec);
+
+		xfs_iext_insert_raw(ifp, &icur, &irec);
+		if (!isnullstartblock(irec.br_startblock))
+			ifp->if_nextents++;
+
+		xfs_iext_next(ifp, &icur);
+	}
+
+	return xrep_ino_ensure_extent_count(rb->sc, rb->whichfork,
+			ifp->if_nextents);
+}
+
+/*
+ * Reserve new btree blocks, bulk load the bmap records into the ondisk btree,
+ * and load the incore extent tree.
+ */
+STATIC int
+xrep_bmap_btree_load(
+	struct xrep_bmap	*rb,
+	struct xfs_btree_cur	*bmap_cur)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	int			error;
+
+	/* Compute how many blocks we'll need. */
+	error = xfs_btree_bload_compute_geometry(bmap_cur,
+			&rb->new_bmapbt.bload, rb->real_mappings);
+	if (error)
+		return error;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/*
+	 * Guess how many blocks we're going to need to rebuild an entire bmap
+	 * from the number of extents we found, and pump up our transaction to
+	 * have sufficient block reservation.  We're allowed to exceed file
+	 * quota to repair inconsistent metadata.
+	 */
+	error = xfs_trans_reserve_more_inode(sc->tp, sc->ip,
+			rb->new_bmapbt.bload.nr_blocks, 0, true);
+	if (error)
+		return error;
+
+	/* Reserve the space we'll need for the new btree. */
+	error = xrep_newbt_alloc_blocks(&rb->new_bmapbt,
+			rb->new_bmapbt.bload.nr_blocks);
+	if (error)
+		return error;
+
+	/* Add all observed bmap records. */
+	rb->array_cur = XFARRAY_CURSOR_INIT;
+	error = xfs_btree_bload(bmap_cur, &rb->new_bmapbt.bload, rb);
+	if (error)
+		return error;
+
+	/*
+	 * Load the new bmap records into the new incore extent tree to
+	 * preserve delalloc reservations for regular files.  The directory
+	 * code loads the extent tree during xfs_dir_open and assumes
+	 * thereafter that it remains loaded, so we must not violate that
+	 * assumption.
+	 */
+	return xrep_bmap_extents_load(rb);
+}
+
+/*
+ * Use the collected bmap information to stage a new bmap fork.  If this is
+ * successful we'll return with the new fork information logged to the repair
+ * transaction but not yet committed.  The caller must ensure that the inode
+ * is joined to the transaction; the inode will be joined to a clean
+ * transaction when the function returns.
+ */
+STATIC int
+xrep_bmap_build_new_fork(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_owner_info	oinfo;
+	struct xfs_scrub	*sc = rb->sc;
+	struct xfs_btree_cur	*bmap_cur;
+	struct xbtree_ifakeroot	*ifake = &rb->new_bmapbt.ifake;
+	int			error;
+
+	error = xrep_bmap_sort_records(rb);
+	if (error)
+		return error;
+
+	/*
+	 * Prepare to construct the new fork by initializing the new btree
+	 * structure and creating a fake ifork in the ifakeroot structure.
+	 */
+	xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, rb->whichfork);
+	error = xrep_newbt_init_inode(&rb->new_bmapbt, sc, rb->whichfork,
+			&oinfo);
+	if (error)
+		return error;
+
+	rb->new_bmapbt.bload.get_records = xrep_bmap_get_records;
+	rb->new_bmapbt.bload.claim_block = xrep_bmap_claim_block;
+	rb->new_bmapbt.bload.iroot_size = xrep_bmap_iroot_size;
+	bmap_cur = xfs_bmbt_stage_cursor(sc->mp, sc->ip, ifake);
+
+	/*
+	 * Figure out the size and format of the new fork, then fill it with
+	 * all the bmap records we've found.  Join the inode to the transaction
+	 * so that we can roll the transaction while holding the inode locked.
+	 */
+	if (rb->real_mappings <= XFS_IFORK_MAXEXT(sc->ip, rb->whichfork)) {
+		ifake->if_fork->if_format = XFS_DINODE_FMT_EXTENTS;
+		error = xrep_bmap_extents_load(rb);
+	} else {
+		ifake->if_fork->if_format = XFS_DINODE_FMT_BTREE;
+		error = xrep_bmap_btree_load(rb, bmap_cur);
+	}
+	if (error)
+		goto err_cur;
+
+	/*
+	 * Install the new fork in the inode.  After this point the old mapping
+	 * data are no longer accessible and the new tree is live.  We delete
+	 * the cursor immediately after committing the staged root because the
+	 * staged fork might be in extents format.
+	 */
+	xfs_bmbt_commit_staged_btree(bmap_cur, sc->tp, rb->whichfork);
+	xfs_btree_del_cursor(bmap_cur, 0);
+
+	/* Reset the inode counters now that we've changed the fork. */
+	error = xrep_bmap_reset_counters(rb);
+	if (error)
+		goto err_newbt;
+
+	/* Dispose of any unused blocks and the accounting information. */
+	error = xrep_newbt_commit(&rb->new_bmapbt);
+	if (error)
+		return error;
+
+	return xrep_roll_trans(sc);
+
+err_cur:
+	if (bmap_cur)
+		xfs_btree_del_cursor(bmap_cur, error);
+err_newbt:
+	xrep_newbt_cancel(&rb->new_bmapbt);
+	return error;
+}
+
+/*
+ * Now that we've logged the new inode btree, invalidate all of the old blocks
+ * and free them, if there were any.
+ */
+STATIC int
+xrep_bmap_remove_old_tree(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	struct xfs_owner_info	oinfo;
+
+	/* Free the old bmbt blocks if they're not in use. */
+	xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, rb->whichfork);
+	return xrep_reap_fsblocks(sc, &rb->old_bmbt_blocks, &oinfo);
+}
+
+/* Check for garbage inputs.  Returns -ECANCELED if there's nothing to do. */
+STATIC int
+xrep_bmap_check_inputs(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->ip, whichfork);
+
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+	if (!xfs_has_rmapbt(sc->mp))
+		return -EOPNOTSUPP;
+
+	/* No fork means nothing to rebuild. */
+	if (!ifp)
+		return -ECANCELED;
+
+	/*
+	 * We only know how to repair extent mappings, which is to say that we
+	 * only support extents and btree fork format.  Repairs to a local
+	 * format fork require a higher level repair function, so we do not
+	 * have any work to do here.
+	 */
+	switch (ifp->if_format) {
+	case XFS_DINODE_FMT_DEV:
+	case XFS_DINODE_FMT_LOCAL:
+	case XFS_DINODE_FMT_UUID:
+		return -ECANCELED;
+	case XFS_DINODE_FMT_EXTENTS:
+	case XFS_DINODE_FMT_BTREE:
+		break;
+	default:
+		return -EFSCORRUPTED;
+	}
+
+	if (whichfork == XFS_ATTR_FORK)
+		return 0;
+
+	/* Only files, symlinks, and directories get to have data forks. */
+	switch (VFS_I(sc->ip)->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+		/* ok */
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Don't know how to rebuild realtime data forks. */
+	if (XFS_IS_REALTIME_INODE(sc->ip))
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+/* Repair an inode fork. */
+STATIC int
+xrep_bmap(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	struct xrep_bmap	*rb;
+	unsigned int		max_bmbt_recs;
+	bool			large_extcount;
+	int			error = 0;
+
+	error = xrep_bmap_check_inputs(sc, whichfork);
+	if (error == -ECANCELED)
+		return 0;
+	if (error)
+		return error;
+
+	rb = kzalloc(sizeof(struct xrep_bmap), XCHK_GFP_FLAGS);
+	if (!rb)
+		return -ENOMEM;
+	rb->sc = sc;
+	rb->whichfork = whichfork;
+
+	/* Set up enough storage to handle the max records for this fork. */
+	large_extcount = xfs_has_large_extent_counts(sc->mp);
+	max_bmbt_recs = xfs_iext_max_nextents(large_extcount, whichfork);
+	error = xfarray_create(sc->mp, "bmap records", max_bmbt_recs,
+			sizeof(struct xfs_bmbt_rec), &rb->bmap_records);
+	if (error)
+		goto out_rb;
+
+	/* Collect all reverse mappings for this fork's extents. */
+	xfsb_bitmap_init(&rb->old_bmbt_blocks);
+	error = xrep_bmap_find_mappings(rb);
+	if (error)
+		goto out_bitmap;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* Rebuild the bmap information. */
+	error = xrep_bmap_build_new_fork(rb);
+	if (error)
+		goto out_bitmap;
+
+	/* Kill the old tree. */
+	error = xrep_bmap_remove_old_tree(rb);
+
+out_bitmap:
+	xfsb_bitmap_destroy(&rb->old_bmbt_blocks);
+	xfarray_destroy(rb->bmap_records);
+out_rb:
+	kfree(rb);
+	return error;
+}
+
+/* Repair an inode's data fork. */
+int
+xrep_bmap_data(
+	struct xfs_scrub	*sc)
+{
+	return xrep_bmap(sc, XFS_DATA_FORK);
+}
+
+/* Repair an inode's attr fork. */
+int
+xrep_bmap_attr(
+	struct xfs_scrub	*sc)
+{
+	return xrep_bmap(sc, XFS_ATTR_FORK);
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index f06fb50d9082..55d096ad4c15 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -871,6 +871,31 @@ xrep_reinit_pagi(
 	return 0;
 }
 
+/* Given a reference to a perag structure, load AG headers and cursors. */
+int
+xrep_ag_init(
+	struct xfs_scrub	*sc,
+	struct xfs_perag	*pag,
+	struct xchk_ag		*sa)
+{
+	int			error;
+
+	ASSERT(!sa->pag);
+
+	error = xfs_ialloc_read_agi(pag, sc->tp, &sa->agi_bp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_read_agf(pag, sc->tp, 0, &sa->agf_bp);
+	if (error)
+		return error;
+
+	/* Grab our own reference to the perag structure. */
+	sa->pag = xfs_perag_bump(pag);
+	xrep_ag_btcur_init(sc, sa);
+	return 0;
+}
+
 /* Reinitialize the per-AG block reservation for the AG we just fixed. */
 int
 xrep_reset_perag_resv(
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 4b01a3d954e4..054eb9324ba3 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -79,6 +79,8 @@ struct xfs_imap;
 int xrep_setup_inode(struct xfs_scrub *sc, struct xfs_imap *imap);
 
 void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa);
+int xrep_ag_init(struct xfs_scrub *sc, struct xfs_perag *pag,
+		struct xchk_ag *sa);
 
 /* Metadata revalidators */
 
@@ -96,6 +98,8 @@ int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
+int xrep_bmap_data(struct xfs_scrub *sc);
+int xrep_bmap_attr(struct xfs_scrub *sc);
 
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
@@ -154,6 +158,8 @@ xrep_setup_nothing(
 #define xrep_iallocbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
 #define xrep_inode			xrep_notsupported
+#define xrep_bmap_data			xrep_notsupported
+#define xrep_bmap_attr			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index ace21c484cbc..3ddc2790601e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -289,13 +289,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_data,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_data,
 	},
 	[XFS_SCRUB_TYPE_BMBTA] = {	/* inode attr fork */
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_attr,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_attr,
 	},
 	[XFS_SCRUB_TYPE_BMBTC] = {	/* inode CoW fork */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 253b3cd784a9..531b15181c95 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1129,7 +1129,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
 DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
-DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_walk_rmap);
 
 TRACE_EVENT(xrep_abt_found,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
@@ -1214,6 +1214,38 @@ TRACE_EVENT(xrep_refc_found,
 		  __entry->refcount)
 )
 
+TRACE_EVENT(xrep_bmap_found,
+	TP_PROTO(struct xfs_inode *ip, int whichfork,
+		 struct xfs_bmbt_irec *irec),
+	TP_ARGS(ip, whichfork, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_filblks_t, len)
+		__field(xfs_fsblock_t, pblk)
+		__field(int, state)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->whichfork = whichfork;
+		__entry->lblk = irec->br_startoff;
+		__entry->len = irec->br_blockcount;
+		__entry->pblk = irec->br_startblock;
+		__entry->state = irec->br_state;
+	),
+	TP_printk("dev %d:%d ino 0x%llx whichfork %s fileoff 0x%llx fsbcount 0x%llx startblock 0x%llx state %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+		  __entry->lblk,
+		  __entry->len,
+		  __entry->pblk,
+		  __entry->state)
+);
+
 TRACE_EVENT(xrep_findroot_block,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
 		 uint32_t magic, uint16_t level),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 7bd16fbff534..1e95d11b6d7d 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -131,6 +131,62 @@ xfs_trans_dup(
 	return ntp;
 }
 
+/*
+ * Try to reserve more blocks for a transaction.
+ *
+ * This is for callers that need to attach resources to a transaction, scan
+ * those resources to determine the space reservation requirements, and then
+ * modify the attached resources.  In other words, online repair.  This can
+ * fail due to ENOSPC, so the caller must be able to cancel the transaction
+ * without shutting down the fs.
+ */
+int
+xfs_trans_reserve_more(
+	struct xfs_trans	*tp,
+	unsigned int		blocks,
+	unsigned int		rtextents)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	int			error = 0;
+
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/*
+	 * Attempt to reserve the needed disk blocks by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (blocks > 0) {
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
+		if (error)
+			return -ENOSPC;
+		tp->t_blk_res += blocks;
+	}
+
+	/*
+	 * Attempt to reserve the needed realtime extents by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (rtextents > 0) {
+		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
+		if (error) {
+			error = -ENOSPC;
+			goto out_blocks;
+		}
+		tp->t_rtx_res += rtextents;
+	}
+
+	return 0;
+out_blocks:
+	if (blocks > 0) {
+		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
+		tp->t_blk_res -= blocks;
+	}
+	return error;
+}
+
 /*
  * This is called to reserve free disk blocks and log space for the
  * given transaction.  This must be done before allocating any resources
@@ -1224,6 +1280,45 @@ xfs_trans_alloc_inode(
 	return error;
 }
 
+
+/* Try to reserve more blocks and file quota for a transaction. */
+int
+xfs_trans_reserve_more_inode(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	unsigned int		dblocks,
+	unsigned int		rblocks,
+	bool			force_quota)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	unsigned int		rtx = rblocks / mp->m_sb.sb_rextsize;
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	error = xfs_trans_reserve_more(tp, dblocks, rtx);
+	if (error)
+		return error;
+
+	if (!XFS_IS_QUOTA_ON(mp) || xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
+		return 0;
+
+	if (tp->t_flags & XFS_TRANS_RESERVE)
+		force_quota = true;
+
+	error = xfs_trans_reserve_quota_nblks(tp, ip, dblocks, rblocks,
+			force_quota);
+	if (!error)
+		return 0;
+
+	/* Quota failed, give back the new reservation. */
+	xfs_mod_fdblocks(mp, dblocks, tp->t_flags & XFS_TRANS_RESERVE);
+	tp->t_blk_res -= dblocks;
+	xfs_mod_frextents(mp, rtx);
+	tp->t_rtx_res -= rtx;
+	return error;
+}
+
 /*
  * Allocate an transaction in preparation for inode creation by reserving quota
  * against the given dquots.  Callers are not required to hold any inode locks.
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 55819785941c..ae587101e167 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -168,6 +168,8 @@ typedef struct xfs_trans {
 int		xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
 			uint blocks, uint rtextents, uint flags,
 			struct xfs_trans **tpp);
+int		xfs_trans_reserve_more(struct xfs_trans *tp,
+			unsigned int blocks, unsigned int rtextents);
 int		xfs_trans_alloc_empty(struct xfs_mount *mp,
 			struct xfs_trans **tpp);
 void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);
@@ -260,6 +262,8 @@ struct xfs_dquot;
 int xfs_trans_alloc_inode(struct xfs_inode *ip, struct xfs_trans_res *resv,
 		unsigned int dblocks, unsigned int rblocks, bool force,
 		struct xfs_trans **tpp);
+int xfs_trans_reserve_more_inode(struct xfs_trans *tp, struct xfs_inode *ip,
+		unsigned int dblocks, unsigned int rblocks, bool force_quota);
 int xfs_trans_alloc_icreate(struct xfs_mount *mp, struct xfs_trans_res *resv,
 		struct xfs_dquot *udqp, struct xfs_dquot *gdqp,
 		struct xfs_dquot *pdqp, unsigned int dblocks,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

There are a couple of conditions that userspace can set to force repairs
of metadata.  These really belong in the repair code and not open-coded
into the check code, so refactor them into a helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.c |   22 ++++++++++++++++++++++
 fs/xfs/scrub/repair.h |    2 ++
 fs/xfs/scrub/scrub.c  |   14 +-------------
 3 files changed, 25 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 55d096ad4c15..3f3554d82eeb 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -27,6 +27,8 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_defer.h"
+#include "xfs_errortag.h"
+#include "xfs_error.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -925,3 +927,23 @@ xrep_reset_perag_resv(
 out:
 	return error;
 }
+
+/* Decide if we are going to call the repair function for a scrub type. */
+bool
+xrep_will_attempt(
+	struct xfs_scrub	*sc)
+{
+	/* Userspace asked us to rebuild the structure regardless. */
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD)
+		return true;
+
+	/* Let debug users force us into the repair routines. */
+	if (XFS_TEST_ERROR(false, sc->mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
+		return true;
+
+	/* Metadata is corrupt or failed cross-referencing. */
+	if (xchk_needs_repair(sc->sm))
+		return true;
+
+	return false;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 054eb9324ba3..88af16715a46 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -18,6 +18,7 @@ static inline int xrep_notsupported(struct xfs_scrub *sc)
 /* Repair helpers */
 
 int xrep_attempt(struct xfs_scrub *sc);
+bool xrep_will_attempt(struct xfs_scrub *sc);
 void xrep_failure(struct xfs_mount *mp);
 int xrep_roll_ag_trans(struct xfs_scrub *sc);
 int xrep_roll_trans(struct xfs_scrub *sc);
@@ -107,6 +108,7 @@ int xrep_reinit_pagi(struct xfs_scrub *sc);
 #else
 
 #define xrep_ino_dqattach(sc)	(0)
+#define xrep_will_attempt(sc)	(false)
 
 static inline int
 xrep_attempt(
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 3ddc2790601e..61deec5a119a 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -14,8 +14,6 @@
 #include "xfs_inode.h"
 #include "xfs_quota.h"
 #include "xfs_qm.h"
-#include "xfs_errortag.h"
-#include "xfs_error.h"
 #include "xfs_scrub.h"
 #include "xfs_btree.h"
 #include "xfs_btree_staging.h"
@@ -545,21 +543,11 @@ xfs_scrub_metadata(
 	xchk_update_health(sc);
 
 	if (xchk_could_repair(sc)) {
-		bool needs_fix = xchk_needs_repair(sc->sm);
-
-		/* Userspace asked us to rebuild the structure regardless. */
-		if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD)
-			needs_fix = true;
-
-		/* Let debug users force us into the repair routines. */
-		if (XFS_TEST_ERROR(needs_fix, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
-			needs_fix = true;
-
 		/*
 		 * If userspace asked for a repair but it wasn't necessary,
 		 * report that back to userspace.
 		 */
-		if (!needs_fix) {
+		if (!xrep_will_attempt(sc)) {
 			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
 			goto out_nofix;
 		}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: create a ranged query function for refcount btrees
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Implement ranged queries for refcount records.  The next patch will use
this to scan refcount data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   41 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h |   10 ++++++++++
 2 files changed, 51 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 8082bb7b953a..bd4cd1c65221 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -2028,6 +2028,47 @@ xfs_refcount_has_records(
 	return xfs_btree_has_records(cur, &low, &high, NULL, outcome);
 }
 
+struct xfs_refcount_query_range_info {
+	xfs_refcount_query_range_fn	fn;
+	void				*priv;
+};
+
+/* Format btree record and pass to our callback. */
+STATIC int
+xfs_refcount_query_range_helper(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_rec	*rec,
+	void				*priv)
+{
+	struct xfs_refcount_query_range_info	*query = priv;
+	struct xfs_refcount_irec	irec;
+	xfs_failaddr_t			fa;
+
+	xfs_refcount_btrec_to_irec(rec, &irec);
+	fa = xfs_refcount_check_irec(cur, &irec);
+	if (fa)
+		return xfs_refcount_complain_bad_rec(cur, fa, &irec);
+
+	return query->fn(cur, &irec, query->priv);
+}
+
+/* Find all refcount records between two keys. */
+int
+xfs_refcount_query_range(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*low_rec,
+	const struct xfs_refcount_irec	*high_rec,
+	xfs_refcount_query_range_fn	fn,
+	void				*priv)
+{
+	union xfs_btree_irec		low_brec = { .rc = *low_rec };
+	union xfs_btree_irec		high_brec = { .rc = *high_rec };
+	struct xfs_refcount_query_range_info query = { .priv = priv, .fn = fn };
+
+	return xfs_btree_query_range(cur, &low_brec, &high_brec,
+			xfs_refcount_query_range_helper, &query);
+}
+
 int __init
 xfs_refcount_intent_init_cache(void)
 {
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 2d6fecb258bb..9563eb91be17 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -129,4 +129,14 @@ extern struct kmem_cache	*xfs_refcount_intent_cache;
 int __init xfs_refcount_intent_init_cache(void);
 void xfs_refcount_intent_destroy_cache(void);
 
+typedef int (*xfs_refcount_query_range_fn)(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*rec,
+	void				*priv);
+
+int xfs_refcount_query_range(struct xfs_btree_cur *cur,
+		const struct xfs_refcount_irec *low_rec,
+		const struct xfs_refcount_irec *high_rec,
+		xfs_refcount_query_range_fn fn, void *priv);
+
 #endif	/* __XFS_REFCOUNT_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs: repair problems in CoW forks
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file.  There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.

Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations.  Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile           |    1 
 fs/xfs/scrub/cow_repair.c |  661 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/reap.c       |   25 ++
 fs/xfs/scrub/repair.h     |    2 
 fs/xfs/scrub/scrub.c      |    2 
 fs/xfs/scrub/trace.h      |   81 ++++++
 6 files changed, 771 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/cow_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 08dcb9eea58e..b0a36ebb0a3b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -179,6 +179,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   bmap_repair.o \
+				   cow_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   newbt.o \
diff --git a/fs/xfs/scrub/cow_repair.c b/fs/xfs/scrub/cow_repair.c
new file mode 100644
index 000000000000..d1b5915e1703
--- /dev/null
+++ b/fs/xfs/scrub/cow_repair.c
@@ -0,0 +1,661 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_rmap.h"
+#include "xfs_refcount.h"
+#include "xfs_quota.h"
+#include "xfs_ialloc.h"
+#include "xfs_ag.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
+#include "xfs_icache.h"
+#include "xfs_refcount_btree.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/reap.h"
+
+/*
+ * CoW Fork Mapping Repair
+ * =======================
+ *
+ * Although CoW staging extents are owned by incore CoW inode forks, on disk
+ * they are owned by the refcount btree.  The ondisk metadata does not record
+ * any ownership information, which limits what we can do to repair the
+ * mappings in the CoW fork.  At most, we can replace ifork mappings that lack
+ * an entry in the refcount btree or are described by a reverse mapping record
+ * whose owner is not OWN_COW.
+ *
+ * Replacing extents is also tricky -- we can't touch written CoW fork extents
+ * since they are undergoing writeback, and delalloc extents do not require
+ * repair since they only exist incore.  Hence the most we can do is find the
+ * bad parts of unwritten mappings, allocate a replacement set of blocks, and
+ * replace the incore mapping.  We use the regular reaping process to unmap
+ * or free the discarded blocks, as appropriate.
+ */
+struct xrep_cow {
+	struct xfs_scrub	*sc;
+
+	/* Bitmap of file offset ranges that need replacing. */
+	struct xbitmap		bad_fileoffs;
+
+	/* Bitmap of fsblocks that were removed from the CoW fork. */
+	struct xfsb_bitmap	old_cowfork_fsblocks;
+
+	/* CoW fork mappings used to scan for bad CoW staging extents. */
+	struct xfs_bmbt_irec	irec;
+
+	/* refcount btree block number of irec.br_startblock */
+	unsigned int		irec_startbno;
+
+	/* refcount btree block number of the next refcount record we expect */
+	unsigned int		next_bno;
+};
+
+/*
+ * Mark the part of the file range that corresponds to the given physical
+ * space.  Caller must ensure that the physical range is within xc->irec.
+ */
+STATIC int
+xrep_cow_mark_file_range(
+	struct xrep_cow		*xc,
+	xfs_fsblock_t		startblock,
+	xfs_filblks_t		blockcount)
+{
+	xfs_fileoff_t		startoff;
+
+	startoff = xc->irec.br_startoff +
+				(startblock - xc->irec.br_startblock);
+
+	trace_xrep_cow_mark_file_range(xc->sc->ip, startblock, startoff,
+			blockcount);
+
+	return xbitmap_set(&xc->bad_fileoffs, startoff, blockcount);
+}
+
+/*
+ * Trim @src to fit within the CoW fork mapping being examined, and put the
+ * result in @dst.
+ */
+static inline void
+xrep_cow_trim_refcount(
+	struct xrep_cow			*xc,
+	struct xfs_refcount_irec	*dst,
+	const struct xfs_refcount_irec	*src)
+{
+	unsigned int			adj;
+
+	memcpy(dst, src, sizeof(*dst));
+
+	if (dst->rc_startblock < xc->irec_startbno) {
+		adj = xc->irec_startbno - dst->rc_startblock;
+		dst->rc_blockcount -= adj;
+		dst->rc_startblock += adj;
+	}
+
+	if (dst->rc_startblock + dst->rc_blockcount >
+	    xc->irec_startbno + xc->irec.br_blockcount) {
+		adj = (dst->rc_startblock + dst->rc_blockcount) -
+		      (xc->irec_startbno + xc->irec.br_blockcount);
+		dst->rc_blockcount -= adj;
+	}
+}
+
+/* Mark any shared CoW staging extents. */
+STATIC int
+xrep_cow_mark_shared_staging(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*rec,
+	void				*priv)
+{
+	struct xrep_cow			*xc = priv;
+	struct xfs_refcount_irec	rrec;
+	xfs_fsblock_t			fsbno;
+
+	if (!xfs_refcount_check_domain(rec) ||
+	    rec->rc_domain != XFS_REFC_DOMAIN_SHARED)
+		return -EFSCORRUPTED;
+
+	xrep_cow_trim_refcount(xc, &rrec, rec);
+
+	fsbno = XFS_AGB_TO_FSB(xc->sc->mp, cur->bc_ag.pag->pag_agno,
+			rrec.rc_startblock);
+	return xrep_cow_mark_file_range(xc, fsbno, rrec.rc_blockcount);
+}
+
+/*
+ * Mark any portion of the CoW fork file offset range where there is not a CoW
+ * staging extent record in the refcountbt, and keep a record of where we did
+ * find correct refcountbt records.  Staging records are always cleaned out at
+ * mount time, so any two inodes trying to map the same staging area would have
+ * already taken the fs down due to refcount btree verifier errors.  Hence this
+ * inode should be the sole creator of the staging extent records ondisk.
+ */
+STATIC int
+xrep_cow_mark_missing_staging(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*rec,
+	void				*priv)
+{
+	struct xrep_cow			*xc = priv;
+	struct xfs_refcount_irec	rrec;
+	int				error;
+
+	if (!xfs_refcount_check_domain(rec) ||
+	    rec->rc_domain != XFS_REFC_DOMAIN_COW)
+		return -EFSCORRUPTED;
+
+	xrep_cow_trim_refcount(xc, &rrec, rec);
+
+	if (xc->next_bno >= rrec.rc_startblock)
+		goto next;
+
+	error = xrep_cow_mark_file_range(xc,
+			XFS_AGB_TO_FSB(xc->sc->mp, cur->bc_ag.pag->pag_agno,
+				       xc->next_bno),
+			rrec.rc_startblock - xc->next_bno);
+	if (error)
+		return error;
+
+next:
+	xc->next_bno = rrec.rc_startblock + rrec.rc_blockcount;
+	return 0;
+}
+
+/*
+ * Mark any area that does not correspond to a CoW staging rmap.  These are
+ * cross-linked areas that must be avoided.
+ */
+STATIC int
+xrep_cow_mark_missing_staging_rmap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_cow			*xc = priv;
+	xfs_fsblock_t			fsbno;
+	xfs_agblock_t			rec_bno;
+	xfs_extlen_t			rec_len;
+	unsigned int			adj;
+
+	if (rec->rm_owner == XFS_RMAP_OWN_COW)
+		return 0;
+
+	rec_bno = rec->rm_startblock;
+	rec_len = rec->rm_blockcount;
+	if (rec_bno < xc->irec_startbno) {
+		adj = xc->irec_startbno - rec_bno;
+		rec_len -= adj;
+		rec_bno += adj;
+	}
+
+	if (rec_bno + rec_len > xc->irec_startbno + xc->irec.br_blockcount) {
+		adj = (rec_bno + rec_len) -
+		      (xc->irec_startbno + xc->irec.br_blockcount);
+		rec_len -= adj;
+	}
+
+	fsbno = XFS_AGB_TO_FSB(xc->sc->mp, cur->bc_ag.pag->pag_agno, rec_bno);
+	return xrep_cow_mark_file_range(xc, fsbno, rec_len);
+}
+
+/*
+ * Find any part of the CoW fork mapping that isn't a single-owner CoW staging
+ * extent and mark the corresponding part of the file range in the bitmap.
+ */
+STATIC int
+xrep_cow_find_bad(
+	struct xrep_cow			*xc)
+{
+	struct xfs_refcount_irec	rc_low = { 0 };
+	struct xfs_refcount_irec	rc_high = { 0 };
+	struct xfs_rmap_irec		rm_low = { 0 };
+	struct xfs_rmap_irec		rm_high = { 0 };
+	struct xfs_perag		*pag;
+	struct xfs_scrub		*sc = xc->sc;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	agno = XFS_FSB_TO_AGNO(sc->mp, xc->irec.br_startblock);
+	xc->irec_startbno = XFS_FSB_TO_AGBNO(sc->mp, xc->irec.br_startblock);
+
+	pag = xfs_perag_get(sc->mp, agno);
+	if (!pag)
+		return -EFSCORRUPTED;
+
+	error = xrep_ag_init(sc, pag, &sc->sa);
+	if (error)
+		goto out_pag;
+
+	/* Mark any CoW fork extents that are shared. */
+	rc_low.rc_startblock = xc->irec_startbno;
+	rc_high.rc_startblock = xc->irec_startbno + xc->irec.br_blockcount - 1;
+	rc_low.rc_domain = rc_high.rc_domain = XFS_REFC_DOMAIN_SHARED;
+	error = xfs_refcount_query_range(sc->sa.refc_cur, &rc_low, &rc_high,
+			xrep_cow_mark_shared_staging, xc);
+	if (error)
+		goto out_sa;
+
+	/* Make sure there are CoW staging extents for the whole mapping. */
+	rc_low.rc_startblock = xc->irec_startbno;
+	rc_high.rc_startblock = xc->irec_startbno + xc->irec.br_blockcount - 1;
+	rc_low.rc_domain = rc_high.rc_domain = XFS_REFC_DOMAIN_COW;
+	xc->next_bno = xc->irec_startbno;
+	error = xfs_refcount_query_range(sc->sa.refc_cur, &rc_low, &rc_high,
+			xrep_cow_mark_missing_staging, xc);
+	if (error)
+		goto out_sa;
+
+	if (xc->next_bno < xc->irec_startbno + xc->irec.br_blockcount) {
+		error = xrep_cow_mark_file_range(xc,
+				XFS_AGB_TO_FSB(sc->mp, pag->pag_agno,
+					       xc->next_bno),
+				xc->irec_startbno + xc->irec.br_blockcount -
+				xc->next_bno);
+		if (error)
+			goto out_sa;
+	}
+
+	/* Mark any area has an rmap that isn't a COW staging extent. */
+	rm_low.rm_startblock = xc->irec_startbno;
+	memset(&rm_high, 0xFF, sizeof(rm_high));
+	rm_high.rm_startblock = xc->irec_startbno + xc->irec.br_blockcount - 1;
+	error = xfs_rmap_query_range(sc->sa.rmap_cur, &rm_low, &rm_high,
+			xrep_cow_mark_missing_staging_rmap, xc);
+	if (error)
+		goto out_sa;
+
+	/*
+	 * If userspace is forcing us to rebuild the CoW fork or someone turned
+	 * on the debugging knob, replace everything in the CoW fork.
+	 */
+	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD) ||
+	    XFS_TEST_ERROR(false, sc->mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR)) {
+		error = xrep_cow_mark_file_range(xc, xc->irec.br_startblock,
+				xc->irec.br_blockcount);
+		if (error)
+			return error;
+	}
+
+out_sa:
+	xchk_ag_free(sc, &sc->sa);
+out_pag:
+	xfs_perag_put(pag);
+	return 0;
+}
+
+/*
+ * Allocate a replacement CoW staging extent of up to the given number of
+ * blocks, and fill out the mapping.  The caller must set irec->br_blockcount.
+ */
+STATIC int
+xrep_cow_alloc(
+	struct xfs_scrub	*sc,
+	struct xfs_bmbt_irec	*irec)
+{
+	struct xfs_alloc_arg	args = {
+		.tp		= sc->tp,
+		.mp		= sc->mp,
+		.oinfo		= XFS_RMAP_OINFO_SKIP_UPDATE,
+		.minlen		= 1,
+		.maxlen		= irec->br_blockcount,
+		.prod		= 1,
+		.type		= XFS_ALLOCTYPE_START_BNO,
+		.resv		= XFS_AG_RESV_NONE,
+		.datatype	= XFS_ALLOC_USERDATA,
+		.fsbno		= XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino),
+	};
+	int			error;
+
+	error = xfs_trans_reserve_more(sc->tp, irec->br_blockcount, 0);
+	if (error)
+		return error;
+
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		return error;
+	if (args.fsbno == NULLFSBLOCK)
+		return -ENOSPC;
+
+	xfs_refcount_alloc_cow_extent(sc->tp, args.fsbno, args.len);
+
+	irec->br_startblock = args.fsbno;
+	irec->br_blockcount = args.len;
+	return 0;
+}
+
+/*
+ * Look up the current CoW fork mapping so that we only allocate enough to
+ * replace a single mapping.  If we don't find a mapping that covers the start
+ * of the file range, or we find a delalloc or written extent, something is
+ * seriously wrong, since we didn't drop the ILOCK.
+ */
+static inline int
+xrep_cow_find_mapping(
+	struct xrep_cow		*xc,
+	struct xfs_iext_cursor	*icur,
+	xfs_fileoff_t		startoff,
+	struct xfs_bmbt_irec	*got)
+{
+	struct xfs_inode	*ip = xc->sc->ip;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
+
+	if (!xfs_iext_lookup_extent(ip, ifp, startoff, icur, got))
+		goto bad;
+
+	if (got->br_startoff > startoff)
+		goto bad;
+
+	if (got->br_blockcount == 0)
+		goto bad;
+
+	if (isnullstartblock(got->br_startblock))
+		goto bad;
+
+	if (xfs_bmap_is_written_extent(got))
+		goto bad;
+
+	return 0;
+bad:
+	ASSERT(0);
+	return -EFSCORRUPTED;
+}
+
+#define REPLACE_LEFT_SIDE	(1U << 0)
+#define REPLACE_RIGHT_SIDE	(1U << 1)
+
+/*
+ * Given a CoW fork mapping @got and a replacement mapping @rep that is within
+ * the file range described by @got, replace the relevant parts of @got with
+ * @rep and set @got to the space that was unmapped.
+ */
+static inline void
+xrep_cow_replace_mapping(
+	struct xfs_inode		*ip,
+	struct xfs_iext_cursor		*icur,
+	struct xfs_bmbt_irec		*got,
+	struct xfs_bmbt_irec		*rep,
+	xfs_fsblock_t			*old_startblock)
+{
+	struct xfs_ifork		*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
+	struct xfs_bmbt_irec		new;
+	xfs_fileoff_t			rep_endoff, got_endoff;
+	uint32_t			state = BMAP_COWFORK;
+
+	rep_endoff = rep->br_startoff + rep->br_blockcount;
+	got_endoff = got->br_startoff + got->br_blockcount;
+
+	ASSERT(rep->br_blockcount > 0);
+	ASSERT(got->br_startoff <= rep->br_startoff);
+	ASSERT(got_endoff >= rep_endoff);
+	ASSERT(!isnullstartblock(got->br_startblock));
+
+	if (got->br_startoff == rep->br_startoff)
+		state |= REPLACE_LEFT_SIDE;
+	if (got_endoff == rep_endoff)
+		state |= REPLACE_RIGHT_SIDE;
+
+	switch (state & (REPLACE_LEFT_SIDE | REPLACE_RIGHT_SIDE)) {
+	case REPLACE_LEFT_SIDE | REPLACE_RIGHT_SIDE:
+		/* Matches the whole extent.  Replace it. */
+		xfs_iext_update_extent(ip, state, icur, rep);
+		*old_startblock = got->br_startblock;
+		break;
+	case REPLACE_LEFT_SIDE:
+		/*
+		 * Move the left side of the extent up, then insert the
+		 * replacement mapping at the cursor position.
+		 */
+		new.br_startoff = got->br_startoff + rep->br_blockcount;
+		new.br_startblock = got->br_startblock + rep->br_blockcount;
+		new.br_blockcount = got->br_blockcount - rep->br_blockcount;
+		new.br_state = got->br_state;
+		xfs_iext_update_extent(ip, state, icur, &new);
+
+		xfs_iext_insert(ip, icur, rep, state);
+
+		*old_startblock = got->br_startblock;
+		break;
+	case REPLACE_RIGHT_SIDE:
+		/*
+		 * Trim the right side of the extent, then move forward and
+		 * add the replacement mapping.
+		 */
+		new.br_startoff = got->br_startoff;
+		new.br_startblock = got->br_startblock;
+		new.br_blockcount = got->br_blockcount - rep->br_blockcount;
+		new.br_state = got->br_state;
+		xfs_iext_update_extent(ip, state, icur, &new);
+
+		xfs_iext_next(ifp, icur);
+		xfs_iext_insert(ip, icur, rep, state);
+
+		*old_startblock = got->br_startblock +
+					(rep->br_startoff - got->br_startoff);
+		break;
+	case 0:
+		/*
+		 * Replace the middle of the extent by updating the left,
+		 * (re)inserting the right, and then adding the replacement.
+		 */
+		new.br_startoff = got->br_startoff;
+		new.br_startblock = got->br_startblock;
+		new.br_blockcount = rep->br_startoff - got->br_startoff;
+		new.br_state = got->br_state;
+		xfs_iext_update_extent(ip, state, icur, &new);
+
+		new.br_startoff = rep_endoff;
+		new.br_blockcount = got_endoff - rep_endoff;
+		new.br_startblock = got->br_startblock +
+					(rep_endoff - got->br_startoff);
+		xfs_iext_next(ifp, icur);
+		xfs_iext_insert(ip, icur, &new, state);
+
+		xfs_iext_insert(ip, icur, rep, state);
+
+		*old_startblock = got->br_startblock +
+					(rep->br_startoff - got->br_startoff);
+		break;
+	}
+}
+
+/*
+ * Replace an unwritten CoW staging extent with a new one that isn't as
+ * problematic.
+ */
+STATIC int
+xrep_cow_replace_one(
+	struct xrep_cow		*xc,
+	xfs_fileoff_t		startoff,
+	xfs_extlen_t		*blockcount)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	rep = {
+		.br_startoff	= startoff,
+		.br_state	= XFS_EXT_UNWRITTEN,
+	};
+	struct xfs_bmbt_irec	got;
+	struct xfs_scrub	*sc = xc->sc;
+	xfs_fileoff_t		endoff;
+	xfs_fsblock_t		old_startblock;
+	int			error;
+
+	/*
+	 * Put the existing CoW fork mapping in @got.  If @got ends before
+	 * @rep, truncate @rep so we only replace one extent mapping at a time.
+	 */
+	error = xrep_cow_find_mapping(xc, &icur, startoff, &got);
+	if (error)
+		return error;
+	endoff = min(startoff + *blockcount,
+		     got.br_startoff + got.br_blockcount);
+	rep.br_blockcount = endoff - startoff;
+
+	/*
+	 * Allocate a replacement extent.  If we don't fill all the blocks,
+	 * shorten the quantity that will be deleted in this step.
+	 */
+	error = xrep_cow_alloc(sc, &rep);
+	if (error)
+		return error;
+
+	/*
+	 * Replace the old mapping with the new one, and commit the metadata
+	 * changes made so far.
+	 */
+	xrep_cow_replace_mapping(sc->ip, &icur, &got, &rep, &old_startblock);
+	trace_xrep_cow_replace_one(sc->ip, &rep, old_startblock);
+
+	xfs_inode_set_cowblocks_tag(sc->ip);
+	error = xfs_defer_finish(&sc->tp);
+	if (error)
+		return error;
+
+	/* Note the old CoW staging extents; we'll reap them all later. */
+	error = xfsb_bitmap_set(&xc->old_cowfork_fsblocks, old_startblock,
+			rep.br_blockcount);
+	if (error)
+		return error;
+
+	*blockcount = rep.br_blockcount;
+	return 0;
+}
+
+/*
+ * Replace a bad part of an unwritten CoW staging extent with a fresh delalloc
+ * reservation.
+ */
+STATIC int
+xrep_cow_replace(
+	uint64_t		startoff,
+	uint64_t		blockcount,
+	void			*priv)
+{
+	struct xrep_cow		*xc = priv;
+	int			error = 0;
+
+	while (blockcount > 0) {
+		xfs_extlen_t	len = min_t(xfs_filblks_t, blockcount,
+					    XFS_MAX_BMBT_EXTLEN);
+
+		error = xrep_cow_replace_one(xc, startoff, &len);
+		if (error)
+			break;
+
+		blockcount -= len;
+		startoff += len;
+	}
+
+	return error;
+}
+
+/*
+ * Repair an inode's CoW fork.  The CoW fork is an in-core structure, so
+ * there's no btree to rebuid.  Instead, we replace any mappings that are
+ * cross-linked or lack ondisk CoW fork records in the refcount btree.
+ */
+int
+xrep_bmap_cow(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_cow		*xc;
+	struct xfs_iext_cursor	icur;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->ip, XFS_COW_FORK);
+	int			error;
+
+	if (!xfs_has_rmapbt(sc->mp) || !xfs_has_reflink(sc->mp))
+		return -EOPNOTSUPP;
+
+	if (!ifp)
+		return 0;
+
+	/* realtime files aren't supported yet */
+	if (XFS_IS_REALTIME_INODE(sc->ip))
+		return -EOPNOTSUPP;
+
+	/*
+	 * If we're somehow not in extents format, then reinitialize it to
+	 * an empty extent mapping fork and exit.
+	 */
+	if (ifp->if_format != XFS_DINODE_FMT_EXTENTS) {
+		ifp->if_format = XFS_DINODE_FMT_EXTENTS;
+		ifp->if_nextents = 0;
+		return 0;
+	}
+
+	xc = kzalloc(sizeof(struct xrep_cow), XCHK_GFP_FLAGS);
+	if (!xc)
+		return -ENOMEM;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	xc->sc = sc;
+	xbitmap_init(&xc->bad_fileoffs);
+	xfsb_bitmap_init(&xc->old_cowfork_fsblocks);
+
+	for_each_xfs_iext(ifp, &icur, &xc->irec) {
+		if (xchk_should_terminate(sc, &error))
+			goto out_bitmap;
+
+		/*
+		 * delalloc reservations only exist incore, so there is no
+		 * ondisk metadata that we can examine.  Hence we leave them
+		 * alone.
+		 */
+		if (isnullstartblock(xc->irec.br_startblock))
+			continue;
+
+		/*
+		 * COW fork extents are only in the written state if writeback
+		 * is actively writing to disk.  We cannot restart the write
+		 * at a different disk address since we've already issued the
+		 * IO, so we leave these alone and hope for the best.
+		 */
+		if (xfs_bmap_is_written_extent(&xc->irec))
+			continue;
+
+		error = xrep_cow_find_bad(xc);
+		if (error)
+			goto out_bitmap;
+	}
+
+	/* Replace any bad unwritten mappings with fresh reservations. */
+	error = xbitmap_walk(&xc->bad_fileoffs, xrep_cow_replace, xc);
+	if (error)
+		goto out_bitmap;
+
+	/*
+	 * Reap as many of the old CoW blocks as we can.  They are owned ondisk
+	 * by the refcount btree, not the inode, so it is correct to treat them
+	 * like inode metadata.
+	 */
+	error = xrep_reap_fsblocks(sc, &xc->old_cowfork_fsblocks,
+			&XFS_RMAP_OINFO_COW);
+	if (error)
+		goto out_bitmap;
+
+out_bitmap:
+	xfsb_bitmap_destroy(&xc->old_cowfork_fsblocks);
+	xbitmap_destroy(&xc->bad_fileoffs);
+	kmem_free(xc);
+	return error;
+}
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 3da372bc6eaa..79d43b9448a3 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -20,6 +20,7 @@
 #include "xfs_ialloc_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_extent_busy.h"
 #include "xfs_ag.h"
@@ -313,6 +314,14 @@ xreap_agextent(
 		trace_xreap_dispose_unmap_extent(sc->sa.pag, agbno, *aglenp);
 
 		rs->force_roll = true;
+		if (rs->oinfo == &XFS_RMAP_OINFO_COW) {
+			/*
+			 * If we're unmapping CoW staging extents, remove the
+			 * records from the refcountbt as well.
+			 */
+			xfs_refcount_free_cow_extent(sc->tp, fsbno, *aglenp);
+			return 0;
+		}
 		return xfs_rmap_free(sc->tp, sc->sa.agf_bp, sc->sa.pag, agbno,
 				*aglenp, rs->oinfo);
 	}
@@ -331,6 +340,22 @@ xreap_agextent(
 		return 0;
 	}
 
+	/*
+	 * If we're getting rid of CoW staging extents, use deferred work items
+	 * to remove the refcountbt records (which removes the rmap records)
+	 * and free the extent.  We're not worried about the system going down
+	 * here because log recovery walks the refcount btree to clean out the
+	 * CoW staging extents.
+	 */
+	if (rs->oinfo == &XFS_RMAP_OINFO_COW) {
+		ASSERT(rs->resv == XFS_AG_RESV_NONE);
+
+		rs->force_roll = true;
+		xfs_refcount_free_cow_extent(sc->tp, fsbno, *aglenp);
+		__xfs_free_extent_later(sc->tp, fsbno, *aglenp, NULL, true);
+		return 0;
+	}
+
 	switch (rs->resv) {
 	case XFS_AG_RESV_AGFL:
 		ASSERT(*aglenp == 1);
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 88af16715a46..8f525d7c63c7 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -101,6 +101,7 @@ int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
+int xrep_bmap_cow(struct xfs_scrub *sc);
 
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
@@ -162,6 +163,7 @@ xrep_setup_nothing(
 #define xrep_inode			xrep_notsupported
 #define xrep_bmap_data			xrep_notsupported
 #define xrep_bmap_attr			xrep_notsupported
+#define xrep_bmap_cow			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 61deec5a119a..1fed319b4f7a 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -299,7 +299,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_cow,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_cow,
 	},
 	[XFS_SCRUB_TYPE_DIR] = {	/* directory */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 531b15181c95..f0618ea849f0 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1552,6 +1552,87 @@ TRACE_EVENT(xrep_dinode_count_rmaps,
 		  __entry->block0)
 );
 
+TRACE_EVENT(xrep_cow_mark_file_range,
+	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t startblock,
+		 xfs_fileoff_t startoff, xfs_filblks_t blockcount),
+	TP_ARGS(ip, startblock, startoff, blockcount),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fsblock_t, startblock)
+		__field(xfs_fileoff_t, startoff)
+		__field(xfs_filblks_t, blockcount)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->startoff = startoff;
+		__entry->startblock = startblock;
+		__entry->blockcount = blockcount;
+	),
+	TP_printk("dev %d:%d ino 0x%llx fileoff 0x%llx startblock 0x%llx fsbcount 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->startoff,
+		  __entry->startblock,
+		  __entry->blockcount)
+);
+
+TRACE_EVENT(xrep_cow_replace_one,
+	TP_PROTO(struct xfs_inode *ip, const struct xfs_bmbt_irec *irec,
+		 xfs_fsblock_t old_startblock),
+	TP_ARGS(ip, irec, old_startblock),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fsblock_t, startblock)
+		__field(xfs_fileoff_t, startoff)
+		__field(xfs_filblks_t, blockcount)
+		__field(xfs_exntst_t, state)
+		__field(xfs_fsblock_t, old_startblock)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->startoff = irec->br_startoff;
+		__entry->startblock = irec->br_startblock;
+		__entry->blockcount = irec->br_blockcount;
+		__entry->state = irec->br_state;
+		__entry->old_startblock = old_startblock;
+	),
+	TP_printk("dev %d:%d ino 0x%llx startoff 0x%llx startblock 0x%llx fsbcount 0x%llx state 0x%x old_startblock 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->startoff,
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->state,
+		  __entry->old_startblock)
+);
+
+TRACE_EVENT(xrep_cow_free_staging,
+	TP_PROTO(struct xfs_perag *pag, xfs_agblock_t agbno,
+		 xfs_extlen_t blockcount),
+	TP_ARGS(pag, agbno, blockcount),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, blockcount)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->agbno = agbno;
+		__entry->blockcount = blockcount;
+	),
+	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->blockcount)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (8 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: online repair of realtime bitmaps Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
                   ` (67 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

XFS stores quota records and free space bitmap information in files.
Add the necessary infrastructure to enable repairing metadata inodes and
their forks, and then make it so that we can repair the file metadata
for the rtbitmap.  Repairing the bitmap contents (and the summary file)
is left for subsequent patchsets.

We also add the ability to repair file metadata the quota files.  As
part of these repairs, we also reinitialize the ondisk dquot records as
necessary to get the incore dquots working.  We can also correct
obviously bad dquot record attributes, but we leave checking the
resource usage counts for the next patchsets.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-quota

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-quota
---
 fs/xfs/Makefile                |    8 +
 fs/xfs/libxfs/xfs_bmap.c       |   39 ++++
 fs/xfs/libxfs/xfs_bmap.h       |    2 
 fs/xfs/scrub/bmap_repair.c     |   17 +-
 fs/xfs/scrub/quota.c           |   11 +
 fs/xfs/scrub/quota.h           |   11 +
 fs/xfs/scrub/quota_repair.c    |  405 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c          |  151 +++++++++++++++
 fs/xfs/scrub/repair.h          |   22 ++
 fs/xfs/scrub/rtbitmap.c        |   10 +
 fs/xfs/scrub/rtbitmap_repair.c |   56 ++++++
 fs/xfs/scrub/scrub.c           |    8 -
 fs/xfs/scrub/trace.c           |    1 
 fs/xfs/scrub/trace.h           |   28 +++
 fs/xfs/xfs_inode.c             |   24 --
 15 files changed, 761 insertions(+), 32 deletions(-)
 create mode 100644 fs/xfs/scrub/quota.h
 create mode 100644 fs/xfs/scrub/quota_repair.c
 create mode 100644 fs/xfs/scrub/rtbitmap_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: repair the inode core and forks of a metadata inode
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: online repair of realtime bitmaps Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: create a new inode fork block unmap helper Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/4] xfs: repair quotas Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a helper function to repair the core and forks of a metadata inode,
so that we can get move onto the task of repairing higher level metadata
that lives in an inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap_repair.c |   17 ++++-
 fs/xfs/scrub/repair.c      |  151 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    2 +
 3 files changed, 166 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c
index b9425f1a87d7..4638f3652b54 100644
--- a/fs/xfs/scrub/bmap_repair.c
+++ b/fs/xfs/scrub/bmap_repair.c
@@ -74,6 +74,9 @@ struct xrep_bmap {
 
 	/* Which fork are we fixing? */
 	int			whichfork;
+
+	/* Do we allow unwritten extents? */
+	bool			allow_unwritten;
 };
 
 /* Remember this reverse-mapping as a series of bmap records. */
@@ -213,6 +216,10 @@ xrep_bmap_walk_rmap(
 	    !(rec->rm_flags & XFS_RMAP_ATTR_FORK))
 		return 0;
 
+	/* Reject unwritten extents if we don't allow those. */
+	if ((rec->rm_flags & XFS_RMAP_UNWRITTEN) && !rb->allow_unwritten)
+		return -EFSCORRUPTED;
+
 	fsbno = XFS_AGB_TO_FSB(mp, cur->bc_ag.pag->pag_agno,
 			rec->rm_startblock);
 
@@ -705,10 +712,11 @@ xrep_bmap_check_inputs(
 }
 
 /* Repair an inode fork. */
-STATIC int
+int
 xrep_bmap(
 	struct xfs_scrub	*sc,
-	int			whichfork)
+	int			whichfork,
+	bool			allow_unwritten)
 {
 	struct xrep_bmap	*rb;
 	unsigned int		max_bmbt_recs;
@@ -726,6 +734,7 @@ xrep_bmap(
 		return -ENOMEM;
 	rb->sc = sc;
 	rb->whichfork = whichfork;
+	rb->allow_unwritten = allow_unwritten;
 
 	/* Set up enough storage to handle the max records for this fork. */
 	large_extcount = xfs_has_large_extent_counts(sc->mp);
@@ -764,7 +773,7 @@ int
 xrep_bmap_data(
 	struct xfs_scrub	*sc)
 {
-	return xrep_bmap(sc, XFS_DATA_FORK);
+	return xrep_bmap(sc, XFS_DATA_FORK, true);
 }
 
 /* Repair an inode's attr fork. */
@@ -772,5 +781,5 @@ int
 xrep_bmap_attr(
 	struct xfs_scrub	*sc)
 {
-	return xrep_bmap(sc, XFS_ATTR_FORK);
+	return xrep_bmap(sc, XFS_ATTR_FORK, false);
 }
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 3f3554d82eeb..2de438ddb8ac 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -29,6 +29,7 @@
 #include "xfs_defer.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
+#include "xfs_reflink.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -947,3 +948,153 @@ xrep_will_attempt(
 
 	return false;
 }
+
+/* Try to fix some part of a metadata inode by calling another scrubber. */
+STATIC int
+xrep_metadata_inode_subtype(
+	struct xfs_scrub	*sc,
+	unsigned int		scrub_type)
+{
+	__u32			smtype = sc->sm->sm_type;
+	__u32			smflags = sc->sm->sm_flags;
+	int			error;
+
+	/*
+	 * Let's see if the inode needs repair.  We're going to open-code calls
+	 * to the scrub and repair functions so that we can hang on to the
+	 * resources that we already acquired instead of using the standard
+	 * setup/teardown routines.
+	 */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	sc->sm->sm_type = scrub_type;
+
+	switch (scrub_type) {
+	case XFS_SCRUB_TYPE_INODE:
+		error = xchk_inode(sc);
+		break;
+	case XFS_SCRUB_TYPE_BMBTD:
+		error = xchk_bmap_data(sc);
+		break;
+	case XFS_SCRUB_TYPE_BMBTA:
+		error = xchk_bmap_attr(sc);
+		break;
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+	}
+	if (error)
+		goto out;
+
+	if (!xrep_will_attempt(sc))
+		goto out;
+
+	/*
+	 * Repair some part of the inode.  This will potentially join the inode
+	 * to the transaction.
+	 */
+	switch (scrub_type) {
+	case XFS_SCRUB_TYPE_INODE:
+		error = xrep_inode(sc);
+		break;
+	case XFS_SCRUB_TYPE_BMBTD:
+		error = xrep_bmap(sc, XFS_DATA_FORK, false);
+		break;
+	case XFS_SCRUB_TYPE_BMBTA:
+		error = xrep_bmap(sc, XFS_ATTR_FORK, false);
+		break;
+	}
+	if (error)
+		goto out;
+
+	/*
+	 * Finish all deferred intent items and then roll the transaction so
+	 * that the inode will not be joined to the transaction when we exit
+	 * the function.
+	 */
+	error = xfs_defer_finish(&sc->tp);
+	if (error)
+		goto out;
+	error = xfs_trans_roll(&sc->tp);
+	if (error)
+		goto out;
+
+	/*
+	 * Clear the corruption flags and re-check the metadata that we just
+	 * repaired.
+	 */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+
+	switch (scrub_type) {
+	case XFS_SCRUB_TYPE_INODE:
+		error = xchk_inode(sc);
+		break;
+	case XFS_SCRUB_TYPE_BMBTD:
+		error = xchk_bmap_data(sc);
+		break;
+	case XFS_SCRUB_TYPE_BMBTA:
+		error = xchk_bmap_attr(sc);
+		break;
+	}
+	if (error)
+		goto out;
+
+	/* If corruption persists, the repair has failed. */
+	if (xchk_needs_repair(sc->sm)) {
+		error = -EFSCORRUPTED;
+		goto out;
+	}
+out:
+	sc->sm->sm_type = smtype;
+	sc->sm->sm_flags = smflags;
+	return error;
+}
+
+/*
+ * Repair the ondisk forks of a metadata inode.  The caller must ensure that
+ * sc->ip points to the metadata inode and the ILOCK is held on that inode.
+ * The inode must not be joined to the transaction before the call, and will
+ * not be afterwards.
+ */
+int
+xrep_metadata_inode_forks(
+	struct xfs_scrub	*sc)
+{
+	bool			dirty = false;
+	int			error;
+
+	/* Repair the inode record and the data fork. */
+	error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_INODE);
+	if (error)
+		return error;
+
+	error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTD);
+	if (error)
+		return error;
+
+	/* Make sure the attr fork looks ok before we delete it. */
+	error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTA);
+	if (error)
+		return error;
+
+	/* Clear the reflink flag since metadata never shares. */
+	if (xfs_is_reflink_inode(sc->ip)) {
+		dirty = true;
+		xfs_trans_ijoin(sc->tp, sc->ip, 0);
+		error = xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If we modified the inode, roll the transaction but don't rejoin the
+	 * inode to the new transaction because xrep_bmap_data can do that.
+	 */
+	if (dirty) {
+		error = xfs_trans_roll(&sc->tp);
+		if (error)
+			return error;
+		dirty = false;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 8f525d7c63c7..69cb6b38bc55 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -72,6 +72,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
 int xrep_ino_ensure_extent_count(struct xfs_scrub *sc, int whichfork,
 		xfs_extnum_t nextents);
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
+int xrep_bmap(struct xfs_scrub *sc, int whichfork, bool allow_unwritten);
+int xrep_metadata_inode_forks(struct xfs_scrub *sc);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: create a new inode fork block unmap helper
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: online repair of realtime bitmaps Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/4] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/4] xfs: repair quotas Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new helper to unmap blocks from an inode's fork.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   39 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |    2 ++
 fs/xfs/xfs_inode.c       |   24 ++++--------------------
 3 files changed, 45 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index d689a262ce39..2f626ad1f4b4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6268,3 +6268,42 @@ xfs_bmap_validate_extent(
 	return xfs_bmap_validate_extent_raw(ip->i_mount,
 			XFS_IS_REALTIME_INODE(ip), whichfork, irec);
 }
+
+/*
+ * Used in xfs_itruncate_extents().  This is the maximum number of extents
+ * freed from a file in a single transaction.
+ */
+#define	XFS_ITRUNC_MAX_EXTENTS	2
+
+/*
+ * Unmap every extent in part of an inode's fork.  We don't do any higher level
+ * invalidation work at all.
+ */
+int
+xfs_bunmapi_range(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	uint32_t		flags,
+	xfs_fileoff_t		startoff,
+	xfs_fileoff_t		endoff)
+{
+	xfs_filblks_t		unmap_len = endoff - startoff + 1;
+	int			error = 0;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+
+	while (unmap_len > 0) {
+		ASSERT((*tpp)->t_firstblock == NULLFSBLOCK);
+		error = __xfs_bunmapi(*tpp, ip, startoff, &unmap_len, flags,
+				XFS_ITRUNC_MAX_EXTENTS);
+		if (error)
+			goto out;
+
+		/* free the just unmapped extents */
+		error = xfs_defer_finish(tpp);
+		if (error)
+			goto out;
+	}
+out:
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 903047c146c3..1201ee024c1f 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -266,6 +266,8 @@ int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork,
 int	xfs_bmapi_remap(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, xfs_fsblock_t startblock,
 		uint32_t flags);
+int	xfs_bunmapi_range(struct xfs_trans **tpp, struct xfs_inode *ip,
+		uint32_t flags, xfs_fileoff_t startoff, xfs_fileoff_t endoff);
 
 extern struct kmem_cache	*xfs_bmap_intent_cache;
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d354ea2b74f9..47197e4cdbe8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -40,12 +40,6 @@
 
 struct kmem_cache *xfs_inode_cache;
 
-/*
- * Used in xfs_itruncate_extents().  This is the maximum number of extents
- * freed from a file in a single transaction.
- */
-#define	XFS_ITRUNC_MAX_EXTENTS	2
-
 STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
 STATIC int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag,
 	struct xfs_inode *);
@@ -1333,7 +1327,6 @@ xfs_itruncate_extents_flags(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp = *tpp;
 	xfs_fileoff_t		first_unmap_block;
-	xfs_filblks_t		unmap_len;
 	int			error = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
@@ -1365,19 +1358,10 @@ xfs_itruncate_extents_flags(
 		return 0;
 	}
 
-	unmap_len = XFS_MAX_FILEOFF - first_unmap_block + 1;
-	while (unmap_len > 0) {
-		ASSERT(tp->t_firstblock == NULLFSBLOCK);
-		error = __xfs_bunmapi(tp, ip, first_unmap_block, &unmap_len,
-				flags, XFS_ITRUNC_MAX_EXTENTS);
-		if (error)
-			goto out;
-
-		/* free the just unmapped extents */
-		error = xfs_defer_finish(&tp);
-		if (error)
-			goto out;
-	}
+	error = xfs_bunmapi_range(&tp, ip, flags, first_unmap_block,
+			XFS_MAX_FILEOFF);
+	if (error)
+		goto out;
 
 	if (whichfork == XFS_DATA_FORK) {
 		/* Remove all pending CoW reservations. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: online repair of realtime bitmaps
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: create a new inode fork block unmap helper Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rebuild the realtime bitmap from the realtime rmap btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                |    4 +++
 fs/xfs/scrub/repair.h          |   13 +++++++++
 fs/xfs/scrub/rtbitmap.c        |   10 ++++++-
 fs/xfs/scrub/rtbitmap_repair.c |   56 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c           |    2 +
 5 files changed, 83 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/rtbitmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b0a36ebb0a3b..74fcf2b4dc86 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -187,5 +187,9 @@ xfs-y				+= $(addprefix scrub/, \
 				   refcount_repair.o \
 				   repair.o \
 				   )
+
+xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
+				   rtbitmap_repair.o \
+				   )
 endif
 endif
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 69cb6b38bc55..74325131f3ca 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -80,6 +80,7 @@ int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
 
 struct xfs_imap;
 int xrep_setup_inode(struct xfs_scrub *sc, struct xfs_imap *imap);
+int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *resblks);
 
 void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa);
 int xrep_ag_init(struct xfs_scrub *sc, struct xfs_perag *pag,
@@ -105,6 +106,12 @@ int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_bmap_cow(struct xfs_scrub *sc);
 
+#ifdef CONFIG_XFS_RT
+int xrep_rtbitmap(struct xfs_scrub *sc);
+#else
+# define xrep_rtbitmap			xrep_notsupported
+#endif /* CONFIG_XFS_RT */
+
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
 
@@ -151,6 +158,11 @@ xrep_setup_nothing(
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
+static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
+{
+	return 0;
+}
+
 #define xrep_revalidate_allocbt		(NULL)
 #define xrep_revalidate_iallocbt	(NULL)
 
@@ -166,6 +178,7 @@ xrep_setup_nothing(
 #define xrep_bmap_data			xrep_notsupported
 #define xrep_bmap_attr			xrep_notsupported
 #define xrep_bmap_cow			xrep_notsupported
+#define xrep_rtbitmap			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index c22427012a11..1d84a9eed67c 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -16,15 +16,23 @@
 #include "xfs_bmap.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/repair.h"
 
 /* Set us up with the realtime metadata locked. */
 int
 xchk_setup_rtbitmap(
 	struct xfs_scrub	*sc)
 {
+	unsigned int		resblks = 0;
 	int			error;
 
-	error = xchk_trans_alloc(sc, 0);
+	if (xchk_could_repair(sc)) {
+		error = xrep_setup_rtbitmap(sc, &resblks);
+		if (error)
+			return error;
+	}
+
+	error = xchk_trans_alloc(sc, resblks);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
new file mode 100644
index 000000000000..c88c49b03e86
--- /dev/null
+++ b/fs/xfs/scrub/rtbitmap_repair.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_bit.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/xfile.h"
+
+/* Set up to repair the realtime bitmap file metadata. */
+int
+xrep_setup_rtbitmap(
+	struct xfs_scrub	*sc,
+	unsigned int		*resblks)
+{
+	struct xfs_mount	*mp = sc->mp;
+	unsigned long long	blocks = 0;
+
+	/*
+	 * Reserve enough blocks to write out a completely new bmbt for the
+	 * bitmap file.
+	 */
+	blocks = xfs_bmbt_calc_size(mp, mp->m_sb.sb_rbmblocks);
+	if (blocks > UINT_MAX)
+		return -EOPNOTSUPP;
+
+	*resblks += blocks;
+	return 0;
+}
+
+/* Repair the realtime bitmap file metadata. */
+int
+xrep_rtbitmap(
+	struct xfs_scrub	*sc)
+{
+	/*
+	 * The only thing we know how to fix right now is problems with the
+	 * inode or its fork data.
+	 */
+	return xrep_metadata_inode_forks(sc);
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 1fed319b4f7a..06da054bf9e4 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -330,7 +330,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_rtbitmap,
 		.scrub	= xchk_rtbitmap,
 		.has	= xfs_has_realtime,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_rtbitmap,
 	},
 	[XFS_SCRUB_TYPE_RTSUM] = {	/* realtime summary */
 		.type	= ST_FS,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: repair quotas
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 1/4] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Fix anything that causes the quota verifiers to fail.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile             |    4 
 fs/xfs/scrub/quota.c        |   11 +
 fs/xfs/scrub/quota.h        |   11 +
 fs/xfs/scrub/quota_repair.c |  405 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h       |    7 +
 fs/xfs/scrub/scrub.c        |    6 -
 fs/xfs/scrub/trace.c        |    1 
 fs/xfs/scrub/trace.h        |   28 +++
 8 files changed, 467 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/scrub/quota.h
 create mode 100644 fs/xfs/scrub/quota_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 74fcf2b4dc86..5f31a5ee1473 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -191,5 +191,9 @@ xfs-y				+= $(addprefix scrub/, \
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
 				   rtbitmap_repair.o \
 				   )
+
+xfs-$(CONFIG_XFS_QUOTA)		+= $(addprefix scrub/, \
+				   quota_repair.o \
+				   )
 endif
 endif
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 085ff234f6ba..714bd4c0753a 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -17,9 +17,10 @@
 #include "xfs_bmap.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/quota.h"
 
 /* Convert a scrub type code to a DQ flag, or return 0 if error. */
-static inline xfs_dqtype_t
+xfs_dqtype_t
 xchk_quota_to_dqtype(
 	struct xfs_scrub	*sc)
 {
@@ -226,7 +227,10 @@ xchk_quota(
 
 	dqtype = xchk_quota_to_dqtype(sc);
 
-	/* Look for problem extents. */
+	/*
+	 * Look for problem extents.  Leave the quota inode ILOCKd if we find
+	 * any.
+	 */
 	error = xchk_quota_data_fork(sc);
 	if (error)
 		goto out;
@@ -239,10 +243,11 @@ xchk_quota(
 	 * functions.
 	 */
 	xchk_iunlock(sc, sc->ilock_flags);
+
+	/* Now look for things that the quota verifiers won't complain about. */
 	sqi.sc = sc;
 	sqi.last_id = 0;
 	error = xfs_qm_dqiterate(mp, dqtype, xchk_quota_item, &sqi);
-	xchk_ilock(sc, XFS_ILOCK_EXCL);
 	if (error == -ECANCELED)
 		error = 0;
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK,
diff --git a/fs/xfs/scrub/quota.h b/fs/xfs/scrub/quota.h
new file mode 100644
index 000000000000..1f0dc0b500e3
--- /dev/null
+++ b/fs/xfs/scrub/quota.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_QUOTA_H__
+#define __XFS_SCRUB_QUOTA_H__
+
+xfs_dqtype_t xchk_quota_to_dqtype(struct xfs_scrub *sc);
+
+#endif /* __XFS_SCRUB_QUOTA_H__ */
diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c
new file mode 100644
index 000000000000..a150719c2b90
--- /dev/null
+++ b/fs/xfs/scrub/quota_repair.c
@@ -0,0 +1,405 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_dquot.h"
+#include "xfs_dquot_item.h"
+#include "xfs_reflink.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/quota.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Quota Repair
+ * ============
+ *
+ * Quota repairs are fairly simplistic; we fix everything that the dquot
+ * verifiers complain about, cap any counters or limits that make no sense,
+ * and schedule a quotacheck if we had to fix anything.  We also repair any
+ * data fork extent records that don't apply to metadata files.
+ */
+
+struct xrep_quota_info {
+	struct xfs_scrub	*sc;
+	bool			need_quotacheck;
+};
+
+/* Scrub the fields in an individual quota item. */
+STATIC int
+xrep_quota_item(
+	struct xfs_dquot	*dqp,
+	xfs_dqtype_t		dqtype,
+	void			*priv)
+{
+	struct xrep_quota_info	*rqi = priv;
+	struct xfs_scrub	*sc = rqi->sc;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_ino_t		fs_icount;
+	bool			dirty = false;
+	int			error = 0;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/* Check the limits. */
+	if (dqp->q_blk.softlimit > dqp->q_blk.hardlimit) {
+		dqp->q_blk.softlimit = dqp->q_blk.hardlimit;
+		dirty = true;
+	}
+
+	if (dqp->q_ino.softlimit > dqp->q_ino.hardlimit) {
+		dqp->q_ino.softlimit = dqp->q_ino.hardlimit;
+		dirty = true;
+	}
+
+	if (dqp->q_rtb.softlimit > dqp->q_rtb.hardlimit) {
+		dqp->q_rtb.softlimit = dqp->q_rtb.hardlimit;
+		dirty = true;
+	}
+
+	/*
+	 * Check that usage doesn't exceed physical limits.  However, on
+	 * a reflink filesystem we're allowed to exceed physical space
+	 * if there are no quota limits.  We don't know what the real number
+	 * is, but we can make quotacheck find out for us.
+	 */
+	if (!xfs_has_reflink(mp) && dqp->q_blk.count > mp->m_sb.sb_dblocks) {
+		dqp->q_blk.reserved -= dqp->q_blk.count;
+		dqp->q_blk.reserved += mp->m_sb.sb_dblocks;
+		dqp->q_blk.count = mp->m_sb.sb_dblocks;
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	fs_icount = percpu_counter_sum(&mp->m_icount);
+	if (dqp->q_ino.count > fs_icount) {
+		dqp->q_ino.reserved -= dqp->q_ino.count;
+		dqp->q_ino.reserved += fs_icount;
+		dqp->q_ino.count = fs_icount;
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (dqp->q_rtb.count > mp->m_sb.sb_rblocks) {
+		dqp->q_rtb.reserved -= dqp->q_rtb.count;
+		dqp->q_rtb.reserved += mp->m_sb.sb_rblocks;
+		dqp->q_rtb.count = mp->m_sb.sb_rblocks;
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+
+	if (!dirty)
+		return 0;
+
+	trace_xrep_dquot_item(sc->mp, dqp->q_type, dqp->q_id);
+
+	dqp->q_flags |= XFS_DQFLAG_DIRTY;
+	xfs_trans_dqjoin(sc->tp, dqp);
+	if (dqp->q_id) {
+		xfs_qm_adjust_dqlimits(dqp);
+		xfs_qm_adjust_dqtimers(dqp);
+	}
+	xfs_trans_log_dquot(sc->tp, dqp);
+	error = xfs_trans_roll(&sc->tp);
+	xfs_dqlock(dqp);
+	return error;
+}
+
+/* Fix a quota timer so that we can pass the verifier. */
+STATIC void
+xrep_quota_fix_timer(
+	struct xfs_mount	*mp,
+	__be64			softlimit,
+	__be64			countnow,
+	__be32			*timer,
+	time64_t		timelimit)
+{
+	uint64_t		soft = be64_to_cpu(softlimit);
+	uint64_t		count = be64_to_cpu(countnow);
+	time64_t		new_timer;
+
+	if (!soft || count <= soft || *timer != 0)
+		return;
+
+	new_timer = xfs_dquot_set_timeout(mp,
+				ktime_get_real_seconds() + timelimit);
+	*timer = cpu_to_be32(new_timer);
+}
+
+/* Fix anything the verifiers complain about. */
+STATIC int
+xrep_quota_block(
+	struct xfs_scrub	*sc,
+	xfs_daddr_t		daddr,
+	xfs_dqtype_t		dqtype,
+	xfs_dqid_t		id)
+{
+	struct xfs_dqblk	*dqblk;
+	struct xfs_disk_dquot	*ddq;
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
+	struct xfs_def_quota	*defq = xfs_get_defquota(qi, dqtype);
+	struct xfs_buf		*bp = NULL;
+	enum xfs_blft		buftype = 0;
+	int			i;
+	int			error;
+
+	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp, daddr,
+			qi->qi_dqchunklen, 0, &bp, &xfs_dquot_buf_ops);
+	switch (error) {
+	case -EFSBADCRC:
+	case -EFSCORRUPTED:
+		/* Failed verifier, retry read with no ops. */
+		error = xfs_trans_read_buf(sc->mp, sc->tp,
+				sc->mp->m_ddev_targp, daddr, qi->qi_dqchunklen,
+				0, &bp, NULL);
+		if (error)
+			return error;
+		break;
+	case 0:
+		dqblk = bp->b_addr;
+		ddq = &dqblk[0].dd_diskdq;
+
+		/*
+		 * If there's nothing that would impede a dqiterate, we're
+		 * done.
+		 */
+		if ((ddq->d_type & XFS_DQTYPE_REC_MASK) != dqtype ||
+		    id == be32_to_cpu(ddq->d_id)) {
+			xfs_trans_brelse(sc->tp, bp);
+			return 0;
+		}
+		break;
+	default:
+		return error;
+	}
+
+	/* Something's wrong with the block, fix the whole thing. */
+	dqblk = bp->b_addr;
+	bp->b_ops = &xfs_dquot_buf_ops;
+	for (i = 0; i < qi->qi_dqperchunk; i++, dqblk++) {
+		ddq = &dqblk->dd_diskdq;
+
+		trace_xrep_disk_dquot(sc->mp, dqtype, id + i);
+
+		ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC);
+		ddq->d_version = XFS_DQUOT_VERSION;
+		ddq->d_type = dqtype;
+		ddq->d_id = cpu_to_be32(id + i);
+
+		xrep_quota_fix_timer(sc->mp, ddq->d_blk_softlimit,
+				ddq->d_bcount, &ddq->d_btimer,
+				defq->blk.time);
+
+		xrep_quota_fix_timer(sc->mp, ddq->d_ino_softlimit,
+				ddq->d_icount, &ddq->d_itimer,
+				defq->ino.time);
+
+		xrep_quota_fix_timer(sc->mp, ddq->d_rtb_softlimit,
+				ddq->d_rtbcount, &ddq->d_rtbtimer,
+				defq->rtb.time);
+
+		/* We only support v5 filesystems so always set these. */
+		uuid_copy(&dqblk->dd_uuid, &sc->mp->m_sb.sb_meta_uuid);
+		xfs_update_cksum((char *)dqblk, sizeof(struct xfs_dqblk),
+				 XFS_DQUOT_CRC_OFF);
+		dqblk->dd_lsn = 0;
+	}
+	switch (dqtype) {
+	case XFS_DQTYPE_USER:
+		buftype = XFS_BLFT_UDQUOT_BUF;
+		break;
+	case XFS_DQTYPE_GROUP:
+		buftype = XFS_BLFT_GDQUOT_BUF;
+		break;
+	case XFS_DQTYPE_PROJ:
+		buftype = XFS_BLFT_PDQUOT_BUF;
+		break;
+	}
+	xfs_trans_buf_set_type(sc->tp, bp, buftype);
+	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
+	return xrep_roll_trans(sc);
+}
+
+/*
+ * Repair a quota file's data fork.  The function returns with the inode
+ * joined.
+ */
+STATIC int
+xrep_quota_data_fork(
+	struct xfs_scrub	*sc,
+	xfs_dqtype_t		dqtype)
+{
+	struct xfs_bmbt_irec	irec = { 0 };
+	struct xfs_iext_cursor	icur;
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
+	struct xfs_ifork	*ifp;
+	xfs_fileoff_t		max_dqid_off;
+	xfs_fileoff_t		off;
+	xfs_fsblock_t		fsbno;
+	bool			truncate = false;
+	int			error = 0;
+
+	error = xrep_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+
+	/* Check for data fork problems that apply only to quota files. */
+	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
+	ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK);
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		if (isnullstartblock(irec.br_startblock)) {
+			error = -EFSCORRUPTED;
+			goto out;
+		}
+
+		if (irec.br_startoff > max_dqid_off ||
+		    irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) {
+			truncate = true;
+			break;
+		}
+
+		/* Convert unwritten extents to real ones. */
+		if (irec.br_state == XFS_EXT_UNWRITTEN) {
+			struct xfs_bmbt_irec	nrec;
+			int			nmap = 1;
+
+			xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+			error = xfs_bmapi_write(sc->tp, sc->ip,
+					irec.br_startoff, irec.br_blockcount,
+					XFS_BMAPI_CONVERT, 0, &nrec, &nmap);
+			if (error)
+				goto out;
+			ASSERT(nmap == 1);
+			ASSERT(nrec.br_startoff == irec.br_startoff);
+			ASSERT(nrec.br_blockcount == irec.br_blockcount);
+
+			error = xfs_defer_finish(&sc->tp);
+			if (error)
+				goto out;
+		}
+	}
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	if (truncate) {
+		/* Erase everything after the block containing the max dquot */
+		error = xfs_bunmapi_range(&sc->tp, sc->ip, 0,
+				max_dqid_off * sc->mp->m_sb.sb_blocksize,
+				XFS_MAX_FILEOFF);
+		if (error)
+			goto out;
+
+		/* Remove all CoW reservations. */
+		error = xfs_reflink_cancel_cow_blocks(sc->ip, &sc->tp, 0,
+				XFS_MAX_FILEOFF, true);
+		if (error)
+			goto out;
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+
+		/*
+		 * Always re-log the inode so that our permanent transaction
+		 * can keep on rolling it forward in the log.
+		 */
+		xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	}
+
+	/* Now go fix anything that fails the verifiers. */
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		for (fsbno = irec.br_startblock, off = irec.br_startoff;
+		     fsbno < irec.br_startblock + irec.br_blockcount;
+		     fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB,
+				off += XFS_DQUOT_CLUSTER_SIZE_FSB) {
+			error = xrep_quota_block(sc,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					dqtype, off * qi->qi_dqperchunk);
+			if (error)
+				goto out;
+		}
+	}
+
+out:
+	return error;
+}
+
+/*
+ * Go fix anything in the quota items that we could have been mad about.  Now
+ * that we've checked the quota inode data fork we have to drop ILOCK_EXCL to
+ * use the regular dquot functions.
+ */
+STATIC int
+xrep_quota_problems(
+	struct xfs_scrub	*sc,
+	xfs_dqtype_t		dqtype)
+{
+	struct xrep_quota_info	rqi;
+	int			error;
+
+	rqi.sc = sc;
+	rqi.need_quotacheck = false;
+	error = xfs_qm_dqiterate(sc->mp, dqtype, xrep_quota_item, &rqi);
+	if (error)
+		return error;
+
+	/* Make a quotacheck happen. */
+	if (rqi.need_quotacheck)
+		xrep_force_quotacheck(sc, dqtype);
+	return 0;
+}
+
+/* Repair all of a quota type's items. */
+int
+xrep_quota(
+	struct xfs_scrub	*sc)
+{
+	xfs_dqtype_t		dqtype;
+	int			error;
+
+	dqtype = xchk_quota_to_dqtype(sc);
+
+	/*
+	 * Re-take the ILOCK so that we can fix any problems that we found
+	 * with the data fork mappings, or with the dquot bufs themselves.
+	 */
+	if (!(sc->ilock_flags & XFS_ILOCK_EXCL))
+		xchk_ilock(sc, XFS_ILOCK_EXCL);
+	error = xrep_quota_data_fork(sc, dqtype);
+	if (error)
+		goto out;
+
+	/*
+	 * Roll the transaction to unjoin the quota inode from transaction so
+	 * that we can unlock the quota inode; we play only with dquots from
+	 * now on.
+	 */
+	error = xfs_trans_roll(&sc->tp);
+	if (error)
+		goto out;
+	xchk_iunlock(sc, sc->ilock_flags);
+
+	/* Fix anything the dquot verifiers don't complain about. */
+	error = xrep_quota_problems(sc, dqtype);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 74325131f3ca..16047fc42696 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -112,6 +112,12 @@ int xrep_rtbitmap(struct xfs_scrub *sc);
 # define xrep_rtbitmap			xrep_notsupported
 #endif /* CONFIG_XFS_RT */
 
+#ifdef CONFIG_XFS_QUOTA
+int xrep_quota(struct xfs_scrub *sc);
+#else
+# define xrep_quota			xrep_notsupported
+#endif /* CONFIG_XFS_QUOTA */
+
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
 
@@ -179,6 +185,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_bmap_attr			xrep_notsupported
 #define xrep_bmap_cow			xrep_notsupported
 #define xrep_rtbitmap			xrep_notsupported
+#define xrep_quota			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 06da054bf9e4..60875c4ad5d9 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -343,19 +343,19 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_GQUOTA] = {	/* group quota */
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_PQUOTA] = {	/* project quota */
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_FSCOUNTERS] = {	/* fs summary counters */
 		.type	= ST_FS,
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 4a0385c97ea6..6e3395d22824 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -13,6 +13,7 @@
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_ag.h"
+#include "xfs_quota_defs.h"
 #include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index f0618ea849f0..4978548dfbff 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1633,6 +1633,34 @@ TRACE_EVENT(xrep_cow_free_staging,
 		  __entry->blockcount)
 );
 
+#ifdef CONFIG_XFS_QUOTA
+DECLARE_EVENT_CLASS(xrep_dquot_class,
+	TP_PROTO(struct xfs_mount *mp, uint8_t type, uint32_t id),
+	TP_ARGS(mp, type, id),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint8_t, type)
+		__field(uint32_t, id)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->id = id;
+		__entry->type = type;
+	),
+	TP_printk("dev %d:%d type %s id 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->type, "|", XFS_DQTYPE_STRINGS),
+		  __entry->id)
+);
+
+#define DEFINE_XREP_DQUOT_EVENT(name) \
+DEFINE_EVENT(xrep_dquot_class, name, \
+	TP_PROTO(struct xfs_mount *mp, uint8_t type, uint32_t id), \
+	TP_ARGS(mp, type, id))
+DEFINE_XREP_DQUOT_EVENT(xrep_dquot_item);
+DEFINE_XREP_DQUOT_EVENT(xrep_disk_dquot);
+#endif /* CONFIG_XFS_QUOTA */
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (9 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: allow scrub to hook metadata updates in other writers Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
                   ` (66 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

The design document discusses the need for a specialized inode scan
cursor to manage walking every file on a live filesystem to build
replacement metadata objects while receiving updates about the files
already scanned.  This series adds two pieces of infrastructure -- the
scan cursor, and live hooks to deliver information about updates going
on in other parts of the filesystem.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan
---
 fs/xfs/Kconfig       |   37 ++++
 fs/xfs/Makefile      |    6 +
 fs/xfs/scrub/iscan.c |  478 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/iscan.h |   62 ++++++
 fs/xfs/scrub/trace.c |    1 
 fs/xfs/scrub/trace.h |   74 ++++++++
 fs/xfs/xfs_hooks.c   |   94 ++++++++++
 fs/xfs/xfs_hooks.h   |   72 ++++++++
 fs/xfs/xfs_iwalk.c   |   13 -
 fs/xfs/xfs_linux.h   |    1 
 10 files changed, 826 insertions(+), 12 deletions(-)
 create mode 100644 fs/xfs/scrub/iscan.c
 create mode 100644 fs/xfs/scrub/iscan.h
 create mode 100644 fs/xfs/xfs_hooks.c
 create mode 100644 fs/xfs/xfs_hooks.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: speed up xfs_iwalk_adjust_start a little bit
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: allow scrub to hook metadata updates in other writers Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: implement live inode scan for scrub Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/4] xfs: allow blocking notifier chains with filesystem hooks Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Replace the open-coded loop that recomputes freecount with a single call
to a bit weight function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iwalk.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 594ccadb729f..54a262b33244 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -22,6 +22,7 @@
 #include "xfs_trans.h"
 #include "xfs_pwork.h"
 #include "xfs_ag.h"
+#include "xfs_bit.h"
 
 /*
  * Walking Inodes in the Filesystem
@@ -131,21 +132,11 @@ xfs_iwalk_adjust_start(
 	struct xfs_inobt_rec_incore	*irec)	/* btree record */
 {
 	int				idx;	/* index into inode chunk */
-	int				i;
 
 	idx = agino - irec->ir_startino;
 
-	/*
-	 * We got a right chunk with some left inodes allocated at it.  Grab
-	 * the chunk record.  Mark all the uninteresting inodes free because
-	 * they're before our start point.
-	 */
-	for (i = 0; i < idx; i++) {
-		if (XFS_INOBT_MASK(i) & ~irec->ir_free)
-			irec->ir_freecount++;
-	}
-
 	irec->ir_free |= xfs_inobt_maskn(0, idx);
+	irec->ir_freecount = hweight64(irec->ir_free);
 }
 
 /* Allocate memory for a walk. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: implement live inode scan for scrub
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: allow scrub to hook metadata updates in other writers Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/4] xfs: speed up xfs_iwalk_adjust_start a little bit Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/4] xfs: allow blocking notifier chains with filesystem hooks Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.

The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited.  This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data.  It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.

This new code is split out as a separate patch from its initial user for
the sake of enabling the author to move patches around his tree with
ease.  The intended usage model for this code is roughly:

	xchk_iscan_start(iscan, 0, 0);
	while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
		xfs_ilock(ip, ...);
		/* capture inode metadata */
		xchk_iscan_mark_visited(iscan, ip);
		xfs_iunlock(ip, ...);

		xfs_irele(ip);
	}
	xchk_iscan_stop(iscan);
	if (error)
		return error;

Hook functions for live updates can then do:

	if (xchk_iscan_want_live_update(...))
		/* update the captured inode metadata */

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile      |    5 -
 fs/xfs/scrub/iscan.c |  478 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/iscan.h |   62 ++++++
 fs/xfs/scrub/trace.c |    1 
 fs/xfs/scrub/trace.h |   74 ++++++++
 5 files changed, 619 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/iscan.c
 create mode 100644 fs/xfs/scrub/iscan.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5f31a5ee1473..a0321f26f06d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -171,7 +171,10 @@ xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
 				   rtsummary.o \
 				   )
 
-xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
+xfs-$(CONFIG_XFS_QUOTA)		+= $(addprefix scrub/, \
+				   iscan.o \
+				   quota.o \
+				   )
 
 # online repair
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
diff --git a/fs/xfs/scrub/iscan.c b/fs/xfs/scrub/iscan.c
new file mode 100644
index 000000000000..e3db6a64338b
--- /dev/null
+++ b/fs/xfs/scrub/iscan.c
@@ -0,0 +1,478 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_ag.h"
+#include "xfs_error.h"
+#include "xfs_bit.h"
+#include "xfs_icache.h"
+#include "scrub/scrub.h"
+#include "scrub/iscan.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/*
+ * Live File Scan
+ * ==============
+ *
+ * Live file scans walk every inode in a live filesystem.  This is more or
+ * less like a regular iwalk, except that when we're advancing the scan cursor,
+ * we must ensure that inodes cannot be added or deleted anywhere between the
+ * old cursor value and the new cursor value.  If we're advancing the cursor
+ * by one inode, the caller must hold that inode; if we're finding the next
+ * inode to scan, we must grab the AGI and hold it until we've updated the
+ * scan cursor.
+ *
+ * Callers are expected to use this code to scan all files in the filesystem to
+ * construct a new metadata index of some kind.  The scan races against other
+ * live updates, which means there must be a provision to update the new index
+ * when updates are made to inodes that already been scanned.  The iscan lock
+ * can be used in live update hook code to stop the scan and protect this data
+ * structure.
+ *
+ * To keep the new index up to date with other metadata updates being made to
+ * the live filesystem, it is assumed that the caller will add hooks as needed
+ * to be notified when a metadata update occurs.  The inode scanner must tell
+ * the hook code when an inode has been visited with xchk_iscan_mark_visit.
+ * Hook functions can use xchk_iscan_want_live_update to decide if the
+ * scanner's observations must be updated.
+ */
+
+/*
+ * Set the bits in @irec's free mask that correspond to the inodes before
+ * @agino so that we skip them.  This is how we restart an inode walk that was
+ * interrupted in the middle of an inode record.
+ */
+STATIC void
+xchk_iscan_adjust_start(
+	xfs_agino_t			agino,	/* starting inode of chunk */
+	struct xfs_inobt_rec_incore	*irec)	/* btree record */
+{
+	int				idx;	/* index into inode chunk */
+
+	idx = agino - irec->ir_startino;
+
+	irec->ir_free |= xfs_inobt_maskn(0, idx);
+	irec->ir_freecount = hweight64(irec->ir_free);
+}
+
+/*
+ * Set *cursor to the next allocated inode after whatever it's set to now.
+ * If there are no more inodes in this AG, cursor is set to NULLAGINO.
+ */
+STATIC int
+xchk_iscan_find_next(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agi_bp,
+	struct xfs_perag	*pag,
+	xfs_agino_t		*cursor)
+{
+	struct xfs_inobt_rec_incore	rec;
+	struct xfs_btree_cur	*cur;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = sc->tp;
+	xfs_agnumber_t		agno = pag->pag_agno;
+	xfs_agino_t		lastino = NULLAGINO;
+	xfs_agino_t		first, last;
+	xfs_agino_t		agino = *cursor;
+	int			has_rec;
+	int			error;
+
+	/* If the cursor is beyond the end of this AG, move to the next one. */
+	xfs_agino_range(mp, agno, &first, &last);
+	if (agino > last) {
+		*cursor = NULLAGINO;
+		return 0;
+	}
+
+	/*
+	 * Look up the inode chunk for the current cursor position.  If there
+	 * is no chunk here, we want the next one.
+	 */
+	cur = xfs_inobt_init_cursor(mp, tp, agi_bp, pag, XFS_BTNUM_INO);
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_rec);
+	if (!error && !has_rec)
+		error = xfs_btree_increment(cur, 0, &has_rec);
+	for (; !error; error = xfs_btree_increment(cur, 0, &has_rec)) {
+		/*
+		 * If we've run out of inobt records in this AG, move the
+		 * cursor on to the next AG and exit.  The caller can try
+		 * again with the next AG.
+		 */
+		if (!has_rec) {
+			*cursor = NULLAGINO;
+			break;
+		}
+
+		error = xfs_inobt_get_rec(cur, &rec, &has_rec);
+		if (error)
+			break;
+		if (!has_rec) {
+			error = -EFSCORRUPTED;
+			break;
+		}
+
+		/* Make sure that we always move forward. */
+		if (lastino != NULLAGINO &&
+		    XFS_IS_CORRUPT(mp, lastino >= rec.ir_startino)) {
+			error = -EFSCORRUPTED;
+			break;
+		}
+		lastino = rec.ir_startino + XFS_INODES_PER_CHUNK - 1;
+
+		/*
+		 * If this record only covers inodes that come before the
+		 * cursor, advance to the next record.
+		 */
+		if (rec.ir_startino + XFS_INODES_PER_CHUNK <= agino)
+			continue;
+
+		/*
+		 * If the incoming lookup put us in the middle of an inobt
+		 * record, mark it and the previous inodes "free" so that the
+		 * search for allocated inodes will start at the cursor.  Use
+		 * funny math to avoid overflowing the bit shift.
+		 */
+		if (agino >= rec.ir_startino)
+			xchk_iscan_adjust_start(agino + 1, &rec);
+
+		/*
+		 * If there are allocated inodes in this chunk, find them,
+		 * and update the cursor.
+		 */
+		if (rec.ir_freecount < XFS_INODES_PER_CHUNK) {
+			int	next = xfs_lowbit64(~rec.ir_free);
+
+			*cursor = rec.ir_startino + next;
+			break;
+		}
+	}
+
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/*
+ * Prepare to return agno/agino to the iscan caller by moving the lastino
+ * cursor to the previous inode.  Do this while we still hold the AGI so that
+ * no other threads can create or delete inodes in this AG.
+ */
+static inline void
+xchk_iscan_move_cursor(
+	struct xfs_scrub	*sc,
+	struct xchk_iscan	*iscan,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino)
+{
+	struct xfs_mount	*mp = sc->mp;
+
+	mutex_lock(&iscan->lock);
+	iscan->cursor_ino = XFS_AGINO_TO_INO(mp, agno, agino);
+	iscan->__visited_ino = iscan->cursor_ino - 1;
+	trace_xchk_iscan_move_cursor(mp, iscan);
+	mutex_unlock(&iscan->lock);
+}
+
+/*
+ * Prepare to return agno/agino to the iscan caller by moving the lastino
+ * cursor to the previous inode.  Do this while we still hold the AGI so that
+ * no other threads can create or delete inodes in this AG.
+ */
+static inline void
+xchk_iscan_finish_scan(
+	struct xfs_scrub	*sc,
+	struct xchk_iscan	*iscan)
+{
+	struct xfs_mount	*mp = sc->mp;
+
+	mutex_lock(&iscan->lock);
+	iscan->cursor_ino = NULLFSINO;
+
+	/* All live updates will be applied from now on */
+	iscan->__visited_ino = NULLFSINO;
+
+	trace_xchk_iscan_move_cursor(mp, iscan);
+	mutex_unlock(&iscan->lock);
+}
+
+/*
+ * Advance ino to the next inode that the inobt thinks is allocated, being
+ * careful to jump to the next AG if we've reached the right end of this AG's
+ * inode btree.  Advancing ino effectively means that we've pushed the inode
+ * scan forward, so set the iscan cursor to (ino - 1) so that our live update
+ * predicates will track inode allocations in that part of the inode number
+ * key space once we release the AGI buffer.
+ *
+ * Returns 1 if there's a new inode to examine, 0 if we've run out of inodes,
+ * -ECANCELED if the live scan aborted, or the usual negative errno.
+ */
+STATIC int
+xchk_iscan_advance(
+	struct xfs_scrub	*sc,
+	struct xchk_iscan	*iscan,
+	struct xfs_buf		**agi_bpp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agi_bp;
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	xfs_agino_t		agino;
+	int			ret;
+
+	ASSERT(iscan->cursor_ino >= iscan->__visited_ino);
+
+	do {
+		agno = XFS_INO_TO_AGNO(mp, iscan->cursor_ino);
+		pag = xfs_perag_get(mp, agno);
+		if (!pag) {
+			xchk_iscan_finish_scan(sc, iscan);
+			return 0;
+		}
+
+		ret = xfs_ialloc_read_agi(pag, sc->tp, &agi_bp);
+		if (ret)
+			goto out_pag;
+
+		agino = XFS_INO_TO_AGINO(mp, iscan->cursor_ino);
+		ret = xchk_iscan_find_next(sc, agi_bp, pag, &agino);
+		if (ret)
+			goto out_buf;
+
+		if (agino != NULLAGINO)
+			break;
+
+		xchk_iscan_move_cursor(sc, iscan, agno + 1, 0);
+		xfs_trans_brelse(sc->tp, agi_bp);
+		xfs_perag_put(pag);
+
+		if (xchk_iscan_aborted(iscan))
+			return -ECANCELED;
+	} while (1);
+
+	xchk_iscan_move_cursor(sc, iscan, agno, agino);
+	*agi_bpp = agi_bp;
+	xfs_perag_put(pag);
+	return 1;
+
+out_buf:
+	xfs_trans_brelse(sc->tp, agi_bp);
+out_pag:
+	xfs_perag_put(pag);
+	return ret;
+}
+
+/*
+ * Grabbing the inode failed, so we need to back up the scan and ask the caller
+ * to try to _advance the scan again.  Returns -EBUSY if we've run out of retry
+ * opportunities, -ECANCELED if the process has a fatal signal pending, or
+ * -EAGAIN if we should try again.
+ */
+STATIC int
+xchk_iscan_iget_retry(
+	struct xfs_mount	*mp,
+	struct xchk_iscan	*iscan,
+	bool			wait)
+{
+	ASSERT(iscan->cursor_ino == iscan->__visited_ino + 1);
+
+	if (!iscan->iget_timeout ||
+	    time_is_before_jiffies(iscan->__iget_deadline))
+		return -EBUSY;
+
+	if (wait) {
+		unsigned long	relax;
+
+		/*
+		 * Sleep for a period of time to let the rest of the system
+		 * catch up.  If we return early, someone sent a kill signal to
+		 * the calling process.
+		 */
+		relax = msecs_to_jiffies(iscan->iget_retry_delay);
+		trace_xchk_iscan_iget_retry_wait(mp, iscan);
+
+		if (schedule_timeout_killable(relax) ||
+		    xchk_iscan_aborted(iscan))
+			return -ECANCELED;
+	}
+
+	iscan->cursor_ino--;
+	return -EAGAIN;
+}
+
+/*
+ * Grab an inode as part of an inode scan.  While scanning this inode, the
+ * caller must ensure that no other threads can modify the inode until a call
+ * to xchk_iscan_visit succeeds.
+ *
+ * Returns 0 and an incore inode; -EAGAIN if the caller should call again
+ * xchk_iscan_advance; -EBUSY if we couldn't grab an inode; -ECANCELED if
+ * there's a fatal signal pending; or some other negative errno.
+ */
+STATIC int
+xchk_iscan_iget(
+	struct xfs_scrub	*sc,
+	struct xchk_iscan	*iscan,
+	struct xfs_buf		*agi_bp,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	error = xfs_iget(sc->mp, sc->tp, iscan->cursor_ino, XFS_IGET_NORETRY, 0,
+			ipp);
+	xfs_trans_brelse(sc->tp, agi_bp);
+
+	trace_xchk_iscan_iget(mp, iscan, error);
+
+	if (error == -ENOENT || error == -EAGAIN) {
+		/*¬
+		 * It's possible that this inode has lost all of its links but
+		 * hasn't yet been inactivated.  If we don't have a transaction
+		 * or it's not writable, flush the inodegc workers and wait.
+		 */
+		xfs_inodegc_flush(mp);
+		return xchk_iscan_iget_retry(mp, iscan, true);
+	}
+
+	if (error == -EINVAL) {
+		/*
+		 * We thought the inode was allocated, but the inode btree
+		 * lookup failed, which means that it was freed since the last
+		 * time we advanced the cursor.  Back up and try again.  This
+		 * should never happen since still hold the AGI buffer from the
+		 * inobt check, but we need to be careful about infinite loops.
+		 */
+		return xchk_iscan_iget_retry(mp, iscan, false);
+	}
+
+	return error;
+}
+
+/*
+ * Advance the inode scan cursor to the next allocated inode and return the
+ * incore inode structure associated with it.
+ *
+ * Returns 1 if there's a new inode to examine, 0 if we've run out of inodes,
+ * -ECANCELED if the live scan aborted, -EBUSY if the incore inode could not be
+ * grabbed, or the usual negative errno.
+ *
+ * If the function returns -EBUSY and the caller can handle skipping an inode,
+ * it may call this function again to continue the scan with the next allocated
+ * inode.
+ */
+int
+xchk_iscan_iter(
+	struct xfs_scrub	*sc,
+	struct xchk_iscan	*iscan,
+	struct xfs_inode	**ipp)
+{
+	int			ret;
+
+	if (iscan->iget_timeout)
+		iscan->__iget_deadline = jiffies +
+					 msecs_to_jiffies(iscan->iget_timeout);
+
+	do {
+		struct xfs_buf	*agi_bp = NULL;
+
+		ret = xchk_iscan_advance(sc, iscan, &agi_bp);
+		if (ret != 1)
+			return ret;
+
+		if (xchk_iscan_aborted(iscan)) {
+			xfs_trans_brelse(sc->tp, agi_bp);
+			ret = -ECANCELED;
+			break;
+		}
+
+		ret = xchk_iscan_iget(sc, iscan, agi_bp, ipp);
+	} while (ret == -EAGAIN);
+
+	if (!ret)
+		return 1;
+
+	return ret;
+}
+
+
+/* Release inode scan resources. */
+void
+xchk_iscan_finish(
+	struct xchk_iscan	*iscan)
+{
+	mutex_destroy(&iscan->lock);
+	iscan->cursor_ino = NULLFSINO;
+	iscan->__visited_ino = NULLFSINO;
+}
+
+/*
+ * Set ourselves up to start an inode scan.  If the @iget_timeout and
+ * @iget_retry_delay parameters are set, the scan will try to iget each inode
+ * for @iget_timeout milliseconds.  If an iget call indicates that the inode is
+ * waiting to be inactivated, the CPU will relax for @iget_retry_delay
+ * milliseconds after pushing the inactivation workers.
+ */
+void
+xchk_iscan_start(
+	struct xchk_iscan	*iscan,
+	unsigned int		iget_timeout,
+	unsigned int		iget_retry_delay)
+{
+	clear_bit(XCHK_ISCAN_OPSTATE_ABORTED, &iscan->__opstate);
+	iscan->iget_timeout = iget_timeout;
+	iscan->iget_retry_delay = iget_retry_delay;
+	iscan->__visited_ino = 0;
+	iscan->cursor_ino = 0;
+	mutex_init(&iscan->lock);
+}
+
+/*
+ * Mark this inode as having been visited.  Callers must hold a sufficiently
+ * exclusive lock on the inode to prevent concurrent modifications.
+ */
+void
+xchk_iscan_mark_visited(
+	struct xchk_iscan	*iscan,
+	struct xfs_inode	*ip)
+{
+	mutex_lock(&iscan->lock);
+	iscan->__visited_ino = ip->i_ino;
+	trace_xchk_iscan_visit(ip->i_mount, iscan);
+	mutex_unlock(&iscan->lock);
+}
+
+/*
+ * Do we need a live update for this inode?  This is true if the scanner thread
+ * has visited this inode and the scan hasn't been aborted due to errors.
+ * Callers must hold a sufficiently exclusive lock on the inode to prevent
+ * scanners from reading any inode metadata.
+ */
+bool
+xchk_iscan_want_live_update(
+	struct xchk_iscan	*iscan,
+	xfs_ino_t		ino)
+{
+        bool			ret;
+
+	if (xchk_iscan_aborted(iscan))
+		return false;
+
+	mutex_lock(&iscan->lock);
+	ret = iscan->__visited_ino >= ino;
+	mutex_unlock(&iscan->lock);
+
+	return ret;
+}
diff --git a/fs/xfs/scrub/iscan.h b/fs/xfs/scrub/iscan.h
new file mode 100644
index 000000000000..947176620bc3
--- /dev/null
+++ b/fs/xfs/scrub/iscan.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_ISCAN_H__
+#define __XFS_SCRUB_ISCAN_H__
+
+struct xchk_iscan {
+	/* Lock to protect the scan cursor. */
+	struct mutex		lock;
+
+	/* This is the inode that will be examined next. */
+	xfs_ino_t		cursor_ino;
+
+	/*
+	 * This is the last inode that we've successfully scanned, either
+	 * because the caller scanned it, or we moved the cursor past an empty
+	 * part of the inode address space.  Scan callers should only use the
+	 * xchk_iscan_visit function to modify this.
+	 */
+	xfs_ino_t		__visited_ino;
+
+	/* Operational state of the livescan. */
+	unsigned long		__opstate;
+
+	/* Give up on iterating @cursor_ino if we can't iget it by this time. */
+	unsigned long		__iget_deadline;
+
+	/* Amount of time (in ms) that we will try to iget an inode. */
+	unsigned int		iget_timeout;
+
+	/* Wait this many ms to retry an iget. */
+	unsigned int		iget_retry_delay;
+};
+
+/* Set if the scan has been aborted due to some event in the fs. */
+#define XCHK_ISCAN_OPSTATE_ABORTED	(1)
+
+static inline bool
+xchk_iscan_aborted(const struct xchk_iscan *iscan)
+{
+	return test_bit(XCHK_ISCAN_OPSTATE_ABORTED, &iscan->__opstate);
+}
+
+static inline void
+xchk_iscan_abort(struct xchk_iscan *iscan)
+{
+	set_bit(XCHK_ISCAN_OPSTATE_ABORTED, &iscan->__opstate);
+}
+
+void xchk_iscan_start(struct xchk_iscan *iscan, unsigned int iget_timeout,
+		unsigned int iget_retry_delay);
+void xchk_iscan_finish(struct xchk_iscan *iscan);
+
+int xchk_iscan_iter(struct xfs_scrub *sc, struct xchk_iscan *iscan,
+		struct xfs_inode **ipp);
+
+void xchk_iscan_mark_visited(struct xchk_iscan *iscan, struct xfs_inode *ip);
+bool xchk_iscan_want_live_update(struct xchk_iscan *iscan, xfs_ino_t ino);
+
+#endif /* __XFS_SCRUB_ISCAN_H__ */
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 6e3395d22824..6a9835d9779f 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -17,6 +17,7 @@
 #include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
+#include "scrub/iscan.h"
 
 /* Figure out which block the btree cursor was pointing to. */
 static inline xfs_fsblock_t
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 4978548dfbff..a283e0462bae 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -16,9 +16,11 @@
 #include <linux/tracepoint.h>
 #include "xfs_bit.h"
 
+struct xfs_scrub;
 struct xfile;
 struct xfarray;
 struct xfarray_sortinfo;
+struct xchk_iscan;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -1024,6 +1026,78 @@ TRACE_EVENT(xchk_rtsum_record_free,
 );
 #endif /* CONFIG_XFS_RT */
 
+DECLARE_EVENT_CLASS(xchk_iscan_class,
+	TP_PROTO(struct xfs_mount *mp, struct xchk_iscan *iscan),
+	TP_ARGS(mp, iscan),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, cursor)
+		__field(xfs_ino_t, visited)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->cursor = iscan->cursor_ino;
+		__entry->visited = iscan->__visited_ino;
+	),
+	TP_printk("dev %d:%d iscan cursor 0x%llx visited 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->cursor, __entry->visited)
+)
+#define DEFINE_ISCAN_EVENT(name) \
+DEFINE_EVENT(xchk_iscan_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xchk_iscan *iscan), \
+	TP_ARGS(mp, iscan))
+DEFINE_ISCAN_EVENT(xchk_iscan_move_cursor);
+DEFINE_ISCAN_EVENT(xchk_iscan_visit);
+
+TRACE_EVENT(xchk_iscan_iget,
+	TP_PROTO(struct xfs_mount *mp, struct xchk_iscan *iscan, int error),
+	TP_ARGS(mp, iscan, error),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, cursor)
+		__field(xfs_ino_t, visited)
+		__field(int, error)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->cursor = iscan->cursor_ino;
+		__entry->visited = iscan->__visited_ino;
+		__entry->error = error;
+	),
+	TP_printk("dev %d:%d iscan cursor 0x%llx visited 0x%llx error %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->cursor, __entry->visited, __entry->error)
+);
+
+TRACE_EVENT(xchk_iscan_iget_retry_wait,
+	TP_PROTO(struct xfs_mount *mp, struct xchk_iscan *iscan),
+	TP_ARGS(mp, iscan),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, cursor)
+		__field(xfs_ino_t, visited)
+		__field(unsigned int, retry_delay)
+		__field(unsigned long, remaining)
+		__field(unsigned int, iget_timeout)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->cursor = iscan->cursor_ino;
+		__entry->visited = iscan->__visited_ino;
+		__entry->retry_delay = iscan->iget_retry_delay;
+		__entry->remaining = jiffies_to_msecs(iscan->__iget_deadline - jiffies);
+		__entry->iget_timeout = iscan->iget_timeout;
+	),
+	TP_printk("dev %d:%d iscan cursor 0x%llx visited 0x%llx remaining %lu timeout %u delay %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->cursor,
+		  __entry->visited,
+		  __entry->remaining,
+		  __entry->iget_timeout,
+		  __entry->retry_delay)
+);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: allow scrub to hook metadata updates in other writers
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: implement live inode scan for scrub Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Certain types of filesystem metadata can only be checked by scanning
every file in the entire filesystem.  Specific examples of this include
quota counts, file link counts, and reverse mappings of file extents.
Directory and parent pointer reconstruction may also fall into this
category.  File scanning is much trickier than scanning AG metadata
because we have to take inode locks in the same order as the rest of
[VX]FS, we can't be holding buffer locks when we do that, and scanning
the whole filesystem takes time.

Earlier versions of the online repair patchset relied heavily on
fsfreeze as a means to quiesce the filesystem so that we could take
locks in the proper order without worrying about concurrent updates from
other writers.  Reviewers of those patches opined that freezing the
entire fs to check and repair something was not sufficiently better than
unmounting to run fsck offline.  I don't agree with that 100%, but the
message was clear: find a way to repair things that minimizes the
quiet period where nobody can write to the filesystem.

Generally, building btree indexes online can be split into two phases: a
collection phase where we compute the records that will be put into the
new btree; and a construction phase, where we construct the physical
btree blocks and persist them.  While it's simple to hold resource locks
for the entirety of the two phases to ensure that the new index is
consistent with the rest of the system, we don't need to hold resource
locks during the collection phase if we have a means to receive live
updates of other work going on elsewhere in the system.

The goal of this patch, then, is to enable online fsck to learn about
metadata updates going on in other threads while it constructs a shadow
copy of the metadata records to verify or correct the real metadata.  To
minimize the overhead when online fsck isn't running, we use srcu
notifiers because they prioritize fast access to the notifier call chain
(particularly when the chain is empty) at a cost to configuring
notifiers.  Online fsck should be relatively infrequent, so this is
acceptable.

The intended usage model is fairly simple.  Code that modifies a
metadata structure of interest should declare a xfs_hook_chain structure
in some well defined place, and call xfs_hook_call whenever an update
happens.  Online fsck code should define a struct notifier_block and use
xfs_hook_add to attach the block to the chain, along with a function to
be called.  This function should synchronize with the fsck scanner to
update whatever in-memory data the scanner is collecting.  When
finished, xfs_hook_del removes the notifier from the list and waits for
them all to complete.

On the author's computer, calling an empty srcu notifier chain was
observed to have an overhead averaging ~40ns with a maximum of 60ns.
Adding a no-op notifier function increased the average to ~58ns and
66ns.  When the quotacheck live update notifier is attached, the average
increases to ~322ns with a max of 372ns to update scrub's in-memory
observation data, assuming no lock contention.

With jump labels enabled, calls to empty srcu notifier chains are elided
from the call sites when there are no hooks registered, which means that
the overhead is 0.36ns when fsck is not running.  For compilers that do
not support jump labels (all major architectures do), the overhead of a
no-op notifier call is less bad (on a many-cpu system) than the atomic
counter ops, so we make the hook switch itself a nop.

Note: This new code is also split out as a separate patch from its
initial user so that the author can move patches around his tree with
ease.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig     |    6 +++++
 fs/xfs/Makefile    |    1 +
 fs/xfs/xfs_hooks.c |   53 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_hooks.h |   68 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_linux.h |    1 +
 5 files changed, 129 insertions(+)
 create mode 100644 fs/xfs/xfs_hooks.c
 create mode 100644 fs/xfs/xfs_hooks.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 6077ac04c0c3..db60944ab3c3 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -97,11 +97,17 @@ config XFS_DRAIN_INTENTS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
 
+config XFS_LIVE_HOOKS
+	bool
+	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
+
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
 	default n
 	depends on XFS_FS
 	depends on TMPFS && SHMEM
+	depends on SRCU
+	select XFS_LIVE_HOOKS
 	select XFS_DRAIN_INTENTS
 	help
 	  If you say Y here you will be able to check metadata on a
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a0321f26f06d..76b6095154bf 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -136,6 +136,7 @@ xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
 endif
 
 xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
+xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/xfs_hooks.c b/fs/xfs/xfs_hooks.c
new file mode 100644
index 000000000000..3f958ece0dc0
--- /dev/null
+++ b/fs/xfs/xfs_hooks.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_ag.h"
+#include "xfs_trace.h"
+
+/* Initialize a notifier chain. */
+void
+xfs_hooks_init(
+	struct xfs_hooks	*chain)
+{
+	srcu_init_notifier_head(&chain->head);
+}
+
+/* Make it so a function gets called whenever we hit a certain hook point. */
+int
+xfs_hooks_add(
+	struct xfs_hooks	*chain,
+	struct xfs_hook		*hook)
+{
+	ASSERT(hook->nb.notifier_call != NULL);
+	BUILD_BUG_ON(offsetof(struct xfs_hook, nb) != 0);
+
+	return srcu_notifier_chain_register(&chain->head, &hook->nb);
+}
+
+/* Remove a previously installed hook. */
+void
+xfs_hooks_del(
+	struct xfs_hooks	*chain,
+	struct xfs_hook		*hook)
+{
+	srcu_notifier_chain_unregister(&chain->head, &hook->nb);
+	rcu_barrier();
+}
+
+/* Call a hook.  Returns the NOTIFY_* value returned by the last hook. */
+int
+xfs_hooks_call(
+	struct xfs_hooks	*chain,
+	unsigned long		val,
+	void			*priv)
+{
+	return srcu_notifier_call_chain(&chain->head, val, priv);
+}
diff --git a/fs/xfs/xfs_hooks.h b/fs/xfs/xfs_hooks.h
new file mode 100644
index 000000000000..9cd3f6e07751
--- /dev/null
+++ b/fs/xfs/xfs_hooks.h
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef XFS_HOOKS_H_
+#define XFS_HOOKS_H_
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_hooks {
+	struct srcu_notifier_head	head;
+};
+#else
+struct xfs_hooks { /* empty */ };
+#endif
+
+/*
+ * If hooks and jump labels are enabled, we use jump labels (aka patching of
+ * the code segment) to avoid the minute overhead of calling an empty notifier
+ * chain when we know there are no callers.  If hooks are enabled without jump
+ * labels, hardwire the predicate to true because calling an empty srcu
+ * notifier chain isn't so expensive.
+ */
+#if defined(CONFIG_JUMP_LABEL) && defined(CONFIG_XFS_LIVE_HOOKS)
+# define DEFINE_STATIC_XFS_HOOK_SWITCH(name) \
+	static DEFINE_STATIC_KEY_FALSE(name)
+# define xfs_hooks_switch_on(name)	static_branch_inc(name)
+# define xfs_hooks_switch_off(name)	static_branch_dec(name)
+# define xfs_hooks_switched_on(name)	static_branch_unlikely(name)
+#elif defined(CONFIG_XFS_LIVE_HOOKS)
+# define DEFINE_STATIC_XFS_HOOK_SWITCH(name)
+# define xfs_hooks_switch_on(name)	((void)0)
+# define xfs_hooks_switch_off(name)	((void)0)
+# define xfs_hooks_switched_on(name)	(true)
+#else
+# define DEFINE_STATIC_XFS_HOOK_SWITCH(name)
+# define xfs_hooks_switch_on(name)	((void)0)
+# define xfs_hooks_switch_off(name)	((void)0)
+# define xfs_hooks_switched_on(name)	(false)
+#endif /* JUMP_LABEL && XFS_LIVE_HOOKS */
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_hook {
+	/* This must come at the start of the structure. */
+	struct notifier_block		nb;
+};
+
+typedef	int (*xfs_hook_fn_t)(struct xfs_hook *hook, unsigned long action,
+		void *data);
+
+void xfs_hooks_init(struct xfs_hooks *chain);
+int xfs_hooks_add(struct xfs_hooks *chain, struct xfs_hook *hook);
+void xfs_hooks_del(struct xfs_hooks *chain, struct xfs_hook *hook);
+int xfs_hooks_call(struct xfs_hooks *chain, unsigned long action,
+		void *priv);
+
+static inline void xfs_hook_setup(struct xfs_hook *hook, xfs_hook_fn_t fn)
+{
+	hook->nb.notifier_call = (notifier_fn_t)fn;
+	hook->nb.priority = 0;
+}
+
+#else
+# define xfs_hooks_init(chain)			((void)0)
+# define xfs_hooks_call(chain, val, priv)	(NOTIFY_DONE)
+#endif
+
+#endif /* XFS_HOOKS_H_ */
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 51e84f824a7c..3847719c3026 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -80,6 +80,7 @@ typedef __u32			xfs_nlink_t;
 #include "xfs_buf.h"
 #include "xfs_message.h"
 #include "xfs_drain.h"
+#include "xfs_hooks.h"
 
 #ifdef __BIG_ENDIAN
 #define XFS_NATIVE_HOST 1


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: allow blocking notifier chains with filesystem hooks
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 1/4] xfs: speed up xfs_iwalk_adjust_start a little bit Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can switch between notifier chain implementations for
testing purposes.  On the author's test system, calling an empty srcu
notifier chain cost about 19ns per call, vs. 4ns for a blocking notifier
chain.  Hm.  Might we actually want regular blocking notifiers?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig     |   33 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_hooks.c |   41 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_hooks.h |    6 +++++-
 3 files changed, 78 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index db60944ab3c3..54806c2b80d4 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -106,7 +106,6 @@ config XFS_ONLINE_SCRUB
 	default n
 	depends on XFS_FS
 	depends on TMPFS && SHMEM
-	depends on SRCU
 	select XFS_LIVE_HOOKS
 	select XFS_DRAIN_INTENTS
 	help
@@ -122,6 +121,38 @@ config XFS_ONLINE_SCRUB
 
 	  If unsure, say N.
 
+choice
+	prompt "XFS hook implementation"
+	depends on XFS_FS && XFS_LIVE_HOOKS && XFS_ONLINE_SCRUB
+	default XFS_LIVE_HOOKS_BLOCKING if HAVE_ARCH_JUMP_LABEL
+	default XFS_LIVE_HOOKS_SRCU if !HAVE_ARCH_JUMP_LABEL
+	help
+	  Pick one
+
+config XFS_LIVE_HOOKS_SRCU
+	bool "SRCU notifier chains"
+	depends on SRCU
+	help
+	  Use SRCU notifier chains for filesystem hooks.  These have very low
+	  overhead for event initiators (the main filesystem) and higher
+	  overhead for chain modifiers (scrub waits for RCU grace).  This is
+	  the best option when jump labels are not supported or there are many
+	  CPUs in the system.
+
+	  This may cause problems with CPU hotplug invoking reclaim invoking
+	  XFS.
+
+config XFS_LIVE_HOOKS_BLOCKING
+	bool "Blocking notifier chains"
+	help
+	  Use blocking notifier chains for filesystem hooks.  These have medium
+	  overhead for event initiators (the main fs) and chain modifiers
+	  (scrub) due to their use of rwsems.  This is the best option when
+	  jump labels can be used to eliminate overhead for the filesystem when
+	  scrub is not running.
+
+endchoice
+
 config XFS_ONLINE_REPAIR
 	bool "XFS online metadata repair support"
 	default n
diff --git a/fs/xfs/xfs_hooks.c b/fs/xfs/xfs_hooks.c
index 3f958ece0dc0..653fc1f82516 100644
--- a/fs/xfs/xfs_hooks.c
+++ b/fs/xfs/xfs_hooks.c
@@ -12,6 +12,7 @@
 #include "xfs_ag.h"
 #include "xfs_trace.h"
 
+#if defined(CONFIG_XFS_LIVE_HOOKS_SRCU)
 /* Initialize a notifier chain. */
 void
 xfs_hooks_init(
@@ -51,3 +52,43 @@ xfs_hooks_call(
 {
 	return srcu_notifier_call_chain(&chain->head, val, priv);
 }
+#elif defined(CONFIG_XFS_LIVE_HOOKS_BLOCKING)
+/* Initialize a notifier chain. */
+void
+xfs_hooks_init(
+	struct xfs_hooks	*chain)
+{
+	BLOCKING_INIT_NOTIFIER_HEAD(&chain->head);
+}
+
+/* Make it so a function gets called whenever we hit a certain hook point. */
+int
+xfs_hooks_add(
+	struct xfs_hooks	*chain,
+	struct xfs_hook		*hook)
+{
+	ASSERT(hook->nb.notifier_call != NULL);
+	BUILD_BUG_ON(offsetof(struct xfs_hook, nb) != 0);
+
+	return blocking_notifier_chain_register(&chain->head, &hook->nb);
+}
+
+/* Remove a previously installed hook. */
+void
+xfs_hooks_del(
+	struct xfs_hooks	*chain,
+	struct xfs_hook		*hook)
+{
+	blocking_notifier_chain_unregister(&chain->head, &hook->nb);
+}
+
+/* Call a hook.  Returns the NOTIFY_* value returned by the last hook. */
+int
+xfs_hooks_call(
+	struct xfs_hooks	*chain,
+	unsigned long		val,
+	void			*priv)
+{
+	return blocking_notifier_call_chain(&chain->head, val, priv);
+}
+#endif /* CONFIG_XFS_LIVE_HOOKS_BLOCKING */
diff --git a/fs/xfs/xfs_hooks.h b/fs/xfs/xfs_hooks.h
index 9cd3f6e07751..7e5ef53f5829 100644
--- a/fs/xfs/xfs_hooks.h
+++ b/fs/xfs/xfs_hooks.h
@@ -6,10 +6,14 @@
 #ifndef XFS_HOOKS_H_
 #define XFS_HOOKS_H_
 
-#ifdef CONFIG_XFS_LIVE_HOOKS
+#if defined(CONFIG_XFS_LIVE_HOOKS_SRCU)
 struct xfs_hooks {
 	struct srcu_notifier_head	head;
 };
+#elif defined(CONFIG_XFS_LIVE_HOOKS_BLOCKING)
+struct xfs_hooks {
+	struct blocking_notifier_head	head;
+};
 #else
 struct xfs_hooks { /* empty */ };
 #endif


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs: online repair of quota counters
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (10 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/5] xfs: implement live quotacheck inode scan Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
                   ` (65 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series uses the inode scanner and live update hook functionality
introduced in the last patchset to implement quotacheck on a live
filesystem.  The quotacheck scrubber builds an incore copy of the
dquot resource usage counters and compares it to the live dquots to
report discrepancies.

If the user chooses to repair the quota counters, the repair function
visits each incore dquot to update the counts from the live information.
The live update hooks are key to keeping the incore copy up to date.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-quotacheck

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-quotacheck
---
 fs/xfs/Makefile                  |    2 
 fs/xfs/libxfs/xfs_fs.h           |    4 
 fs/xfs/libxfs/xfs_health.h       |    4 
 fs/xfs/scrub/common.c            |   47 ++
 fs/xfs/scrub/common.h            |   11 
 fs/xfs/scrub/health.c            |    1 
 fs/xfs/scrub/quotacheck.c        |  840 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/quotacheck.h        |   76 +++
 fs/xfs/scrub/quotacheck_repair.c |  254 +++++++++++
 fs/xfs/scrub/repair.c            |   46 ++
 fs/xfs/scrub/repair.h            |    5 
 fs/xfs/scrub/scrub.c             |    9 
 fs/xfs/scrub/scrub.h             |   10 
 fs/xfs/scrub/trace.h             |   32 +
 fs/xfs/scrub/xfarray.h           |   19 +
 fs/xfs/xfs_health.c              |    1 
 fs/xfs/xfs_inode.c               |   21 +
 fs/xfs/xfs_inode.h               |    3 
 fs/xfs/xfs_qm.c                  |   23 +
 fs/xfs/xfs_qm.h                  |   16 +
 fs/xfs/xfs_qm_bhv.c              |    1 
 fs/xfs/xfs_quota.h               |   45 ++
 fs/xfs/xfs_trans_dquot.c         |  158 +++++++
 23 files changed, 1604 insertions(+), 24 deletions(-)
 create mode 100644 fs/xfs/scrub/quotacheck.c
 create mode 100644 fs/xfs/scrub/quotacheck.h
 create mode 100644 fs/xfs/scrub/quotacheck_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: report the health of quota counts
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/5] xfs: implement live quotacheck inode scan Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/5] xfs: repair cannot update the summary counters when logging quota flags Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/5] xfs: track quota updates during live quotacheck Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 5/5] xfs: repair dquots based on live quotacheck results Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Report the health of quota counts.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h     |    1 +
 fs/xfs/libxfs/xfs_health.h |    4 +++-
 fs/xfs/xfs_health.c        |    1 +
 fs/xfs/xfs_qm.c            |    7 ++++++-
 fs/xfs/xfs_trans_dquot.c   |    2 ++
 5 files changed, 13 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 920fd4513fcb..7e86e1db66dd 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -195,6 +195,7 @@ struct xfs_fsop_geom {
 #define XFS_FSOP_GEOM_SICK_PQUOTA	(1 << 3)  /* project quota */
 #define XFS_FSOP_GEOM_SICK_RT_BITMAP	(1 << 4)  /* realtime bitmap */
 #define XFS_FSOP_GEOM_SICK_RT_SUMMARY	(1 << 5)  /* realtime summary */
+#define XFS_FSOP_GEOM_SICK_QUOTACHECK	(1 << 6)  /* quota counts */
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 99e796256c5d..1dea286bb157 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -41,6 +41,7 @@ struct xfs_fsop_geom;
 #define XFS_SICK_FS_UQUOTA	(1 << 1)  /* user quota */
 #define XFS_SICK_FS_GQUOTA	(1 << 2)  /* group quota */
 #define XFS_SICK_FS_PQUOTA	(1 << 3)  /* project quota */
+#define XFS_SICK_FS_QUOTACHECK	(1 << 4)  /* quota counts */
 
 /* Observable health issues for realtime volume metadata. */
 #define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
@@ -72,7 +73,8 @@ struct xfs_fsop_geom;
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
 				 XFS_SICK_FS_UQUOTA | \
 				 XFS_SICK_FS_GQUOTA | \
-				 XFS_SICK_FS_PQUOTA)
+				 XFS_SICK_FS_PQUOTA | \
+				 XFS_SICK_FS_QUOTACHECK)
 
 #define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
 				 XFS_SICK_RT_SUMMARY)
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 72a075bb2c10..cd32f0fc0643 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -280,6 +280,7 @@ static const struct ioctl_sick_map fs_map[] = {
 	{ XFS_SICK_FS_UQUOTA,	XFS_FSOP_GEOM_SICK_UQUOTA },
 	{ XFS_SICK_FS_GQUOTA,	XFS_FSOP_GEOM_SICK_GQUOTA },
 	{ XFS_SICK_FS_PQUOTA,	XFS_FSOP_GEOM_SICK_PQUOTA },
+	{ XFS_SICK_FS_QUOTACHECK, XFS_FSOP_GEOM_SICK_QUOTACHECK },
 	{ 0, 0 },
 };
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index ff53d40a2dae..8356b7500d75 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -26,6 +26,7 @@
 #include "xfs_ag.h"
 #include "xfs_ialloc.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"
 
 /*
  * The global quota manager. There is only one of these for the entire
@@ -1392,8 +1393,12 @@ xfs_qm_quotacheck(
 			xfs_warn(mp,
 				"Quotacheck: Failed to reset quota flags.");
 		}
-	} else
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_QUOTACHECK);
+	} else {
 		xfs_notice(mp, "Quotacheck: Done.");
+		xfs_fs_mark_healthy(mp, XFS_SICK_FS_QUOTACHECK);
+	}
+
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index aa00cf67ad72..968dc7af4fc7 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -17,6 +17,7 @@
 #include "xfs_qm.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 STATIC void	xfs_trans_alloc_dqinfo(xfs_trans_t *);
 
@@ -706,6 +707,7 @@ xfs_trans_dqresv(
 error_corrupt:
 	xfs_dqunlock(dqp);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	xfs_fs_mark_sick(mp, XFS_SICK_FS_QUOTACHECK);
 	return -EFSCORRUPTED;
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: implement live quotacheck inode scan
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/5] xfs: repair cannot update the summary counters when logging quota flags Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new trio of scrub functions to check quota counters.  While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters.  We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile           |    1 
 fs/xfs/libxfs/xfs_fs.h    |    3 
 fs/xfs/scrub/common.c     |   43 ++++
 fs/xfs/scrub/common.h     |   11 +
 fs/xfs/scrub/health.c     |    1 
 fs/xfs/scrub/quotacheck.c |  487 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/quotacheck.h |   67 ++++++
 fs/xfs/scrub/scrub.c      |    6 +
 fs/xfs/scrub/scrub.h      |    6 +
 fs/xfs/scrub/trace.h      |   30 +++
 fs/xfs/scrub/xfarray.h    |   19 ++
 fs/xfs/xfs_inode.c        |   21 ++
 fs/xfs/xfs_inode.h        |    3 
 13 files changed, 694 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/quotacheck.c
 create mode 100644 fs/xfs/scrub/quotacheck.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 76b6095154bf..62398f7332a1 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -175,6 +175,7 @@ xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
 xfs-$(CONFIG_XFS_QUOTA)		+= $(addprefix scrub/, \
 				   iscan.o \
 				   quota.o \
+				   quotacheck.o \
 				   )
 
 # online repair
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 7e86e1db66dd..6612c89944d0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -708,9 +708,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_GQUOTA	22	/* group quotas */
 #define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
 #define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
+#define XFS_SCRUB_TYPE_QUOTACHECK 25	/* quota counters */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	25
+#define XFS_SCRUB_TYPE_NR	26
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1u << 0)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 3fc392c1b1a8..0a0ac993c602 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -30,6 +30,7 @@
 #include "xfs_attr.h"
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
+#include "xfs_error.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -83,6 +84,15 @@ __xchk_process_error(
 				sc->ip ? sc->ip : XFS_I(file_inode(sc->file)),
 				sc->sm, *error);
 		break;
+	case -ECANCELED:
+		/*
+		 * ECANCELED here means that the caller set one of the scrub
+		 * outcome flags (corrupt, xfail, xcorrupt) and wants to exit
+		 * quickly.  Set error to zero and do not continue.
+		 */
+		trace_xchk_op_error(sc, agno, bno, *error, ret_ip);
+		*error = 0;
+		break;
 	case -EFSBADCRC:
 	case -EFSCORRUPTED:
 		/* Note the badness but don't abort. */
@@ -90,8 +100,7 @@ __xchk_process_error(
 		*error = 0;
 		fallthrough;
 	default:
-		trace_xchk_op_error(sc, agno, bno, *error,
-				ret_ip);
+		trace_xchk_op_error(sc, agno, bno, *error, ret_ip);
 		break;
 	}
 	return false;
@@ -137,6 +146,16 @@ __xchk_fblock_process_error(
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
+	case -ECANCELED:
+		/*
+		 * ECANCELED here means that the caller set one of the scrub
+		 * outcome flags (corrupt, xfail, xcorrupt) and wants to exit
+		 * quickly.  Set error to zero and do not continue.
+		 */
+		trace_xchk_file_op_error(sc, whichfork, offset, *error,
+				ret_ip);
+		*error = 0;
+		break;
 	case -EFSBADCRC:
 	case -EFSCORRUPTED:
 		/* Note the badness but don't abort. */
@@ -228,6 +247,19 @@ xchk_block_set_corrupt(
 	trace_xchk_block_error(sc, xfs_buf_daddr(bp), __return_address);
 }
 
+#ifdef CONFIG_XFS_QUOTA
+/* Record a corrupt quota counter. */
+void
+xchk_qcheck_set_corrupt(
+	struct xfs_scrub	*sc,
+	unsigned int		dqtype,
+	xfs_dqid_t		id)
+{
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	trace_xchk_qcheck_error(sc, dqtype, id, __return_address);
+}
+#endif /* CONFIG_XFS_QUOTA */
+
 /* Record a corruption while cross-referencing. */
 void
 xchk_block_xref_set_corrupt(
@@ -646,6 +678,13 @@ xchk_trans_cancel(
 	sc->tp = NULL;
 }
 
+int
+xchk_trans_alloc_empty(
+	struct xfs_scrub	*sc)
+{
+	return xfs_trans_alloc_empty(sc->mp, &sc->tp);
+}
+
 /*
  * Grab an empty transaction so that we can re-grab locked buffers if
  * one of our btrees turns out to be cyclic.
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 4c90c45b9b34..2c33814e0b69 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -32,6 +32,7 @@ xchk_should_terminate(
 }
 
 int xchk_trans_alloc(struct xfs_scrub *sc, uint resblks);
+int xchk_trans_alloc_empty(struct xfs_scrub *sc);
 void xchk_trans_cancel(struct xfs_scrub *sc);
 
 bool xchk_process_error(struct xfs_scrub *sc, xfs_agnumber_t agno,
@@ -54,6 +55,10 @@ void xchk_block_set_corrupt(struct xfs_scrub *sc,
 void xchk_ino_set_corrupt(struct xfs_scrub *sc, xfs_ino_t ino);
 void xchk_fblock_set_corrupt(struct xfs_scrub *sc, int whichfork,
 		xfs_fileoff_t offset);
+#ifdef CONFIG_XFS_QUOTA
+void xchk_qcheck_set_corrupt(struct xfs_scrub *sc, unsigned int dqtype,
+		xfs_dqid_t id);
+#endif /* CONFIG_XFS_QUOTA */
 
 void xchk_block_xref_set_corrupt(struct xfs_scrub *sc,
 		struct xfs_buf *bp);
@@ -105,6 +110,7 @@ xchk_setup_rtsummary(struct xfs_scrub *sc)
 #ifdef CONFIG_XFS_QUOTA
 int xchk_ino_dqattach(struct xfs_scrub *sc);
 int xchk_setup_quota(struct xfs_scrub *sc);
+int xchk_setup_quotacheck(struct xfs_scrub *sc);
 #else
 static inline int
 xchk_ino_dqattach(struct xfs_scrub *sc)
@@ -116,6 +122,11 @@ xchk_setup_quota(struct xfs_scrub *sc)
 {
 	return -ENOENT;
 }
+static inline int
+xchk_setup_quotacheck(struct xfs_scrub *sc)
+{
+	return -ENOENT;
+}
 #endif
 int xchk_setup_fscounters(struct xfs_scrub *sc);
 
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index f67279ecb69c..5e28fa1ab6aa 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -107,6 +107,7 @@ static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_GQUOTA]		= { XHG_FS,  XFS_SICK_FS_GQUOTA },
 	[XFS_SCRUB_TYPE_PQUOTA]		= { XHG_FS,  XFS_SICK_FS_PQUOTA },
 	[XFS_SCRUB_TYPE_FSCOUNTERS]	= { XHG_FS,  XFS_SICK_FS_COUNTERS },
+	[XFS_SCRUB_TYPE_QUOTACHECK]	= { XHG_FS,  XFS_SICK_FS_QUOTACHECK },
 };
 
 /* Return the health status mask for this scrub type. */
diff --git a/fs/xfs/scrub/quotacheck.c b/fs/xfs/scrub/quotacheck.c
new file mode 100644
index 000000000000..d2cbd79eafa5
--- /dev/null
+++ b/fs/xfs/scrub/quotacheck.c
@@ -0,0 +1,487 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_icache.h"
+#include "xfs_bmap_util.h"
+#include "xfs_ialloc.h"
+#include "xfs_ag.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/repair.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/iscan.h"
+#include "scrub/quotacheck.h"
+#include "scrub/trace.h"
+
+/*
+ * Live Quotacheck
+ * ===============
+ *
+ * Quota counters are "summary" metadata, in the sense that they are computed
+ * as the summation of the block usage counts for every file on the filesystem.
+ * Therefore, we compute the correct icount, bcount, and rtbcount values by
+ * creating a shadow quota counter structure and walking every inode.
+ */
+
+/* Set us up to scrub quota counters. */
+int
+xchk_setup_quotacheck(
+	struct xfs_scrub	*sc)
+{
+	/* Not ready for general consumption yet. */
+	return -EOPNOTSUPP;
+
+	if (!XFS_IS_QUOTA_ON(sc->mp))
+		return -ENOENT;
+
+	sc->buf = kzalloc(sizeof(struct xqcheck), XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	return xchk_setup_fs(sc);
+}
+
+/*
+ * Part 1: Collecting dquot resource usage counts.  For each xfs_dquot attached
+ * to each inode, we create a shadow dquot, and compute the inode count and add
+ * the data/rt block usage from what we see.
+ *
+ * To avoid false corruption reports in part 2, any failure in this part must
+ * set the INCOMPLETE flag even when a negative errno is returned.  This care
+ * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
+ * ECANCELED) that are absorbed into a scrub state flag update by
+ * xchk_*_process_error.
+ */
+
+/* Update an incore dquot counter information from a live update. */
+static int
+xqcheck_update_incore_counts(
+	struct xqcheck		*xqc,
+	struct xfarray		*counts,
+	xfs_dqid_t		id,
+	int64_t			inodes,
+	int64_t			nblks,
+	int64_t			rtblks)
+{
+	struct xqcheck_dquot	xcdq;
+	int			error;
+
+	error = xfarray_load_sparse(counts, id, &xcdq);
+	if (error)
+		return error;
+
+	xcdq.flags |= XQCHECK_DQUOT_WRITTEN;
+	xcdq.icount += inodes;
+	xcdq.bcount += nblks;
+	xcdq.rtbcount += rtblks;
+
+	error = xfarray_store(counts, id, &xcdq);
+	if (error == -EFBIG) {
+		/*
+		 * EFBIG means we tried to store data at too high a byte offset
+		 * in the sparse array.  IOWs, we cannot complete the check and
+		 * must notify userspace that the check was incomplete.
+		 */
+		error = -ECANCELED;
+	}
+	return error;
+}
+
+/* Record this inode's quota usage in our shadow quota counter data. */
+STATIC int
+xqcheck_collect_inode(
+	struct xqcheck		*xqc,
+	struct xfs_inode	*ip)
+{
+	struct xfs_trans	*tp = xqc->sc->tp;
+	xfs_filblks_t		nblks, rtblks;
+	uint			ilock_flags = 0;
+	xfs_dqid_t		id;
+	int			error = 0;
+
+	if (xfs_is_quota_inode(&tp->t_mountp->m_sb, ip->i_ino)) {
+		/*
+		 * Quota inode blocks are never counted towards quota, so we
+		 * do not need to take the lock.
+		 */
+		xchk_iscan_mark_visited(&xqc->iscan, ip);
+		return 0;
+	}
+
+	/* Figure out the data / rt device block counts. */
+	xfs_ilock(ip, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED);
+	ilock_flags = xfs_ilock_data_map_shared(ip);
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
+		if (error)
+			goto out_incomplete;
+	}
+	xfs_inode_count_blocks(tp, ip, &nblks, &rtblks);
+
+	/* Update the shadow dquot counters. */
+	mutex_lock(&xqc->lock);
+	if (xqc->ucounts) {
+		id = xfs_qm_id_for_quotatype(ip, XFS_DQTYPE_USER);
+		error = xqcheck_update_incore_counts(xqc, xqc->ucounts, id, 1,
+				nblks, rtblks);
+		if (error)
+			goto out_mutex;
+	}
+
+	if (xqc->gcounts) {
+		id = xfs_qm_id_for_quotatype(ip, XFS_DQTYPE_GROUP);
+		error = xqcheck_update_incore_counts(xqc, xqc->gcounts, id, 1,
+				nblks, rtblks);
+		if (error)
+			goto out_mutex;
+	}
+
+	if (xqc->pcounts) {
+		id = xfs_qm_id_for_quotatype(ip, XFS_DQTYPE_PROJ);
+		error = xqcheck_update_incore_counts(xqc, xqc->pcounts, id, 1,
+				nblks, rtblks);
+		if (error)
+			goto out_mutex;
+	}
+	mutex_unlock(&xqc->lock);
+
+	xchk_iscan_mark_visited(&xqc->iscan, ip);
+	goto out_ilock;
+
+out_mutex:
+	mutex_unlock(&xqc->lock);
+out_incomplete:
+	xchk_set_incomplete(xqc->sc);
+out_ilock:
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | ilock_flags);
+	return error;
+}
+
+/* Walk all the allocated inodes and run a quota scan on them. */
+STATIC int
+xqcheck_collect_counts(
+	struct xqcheck		*xqc)
+{
+	struct xfs_scrub	*sc = xqc->sc;
+	struct xfs_inode	*ip;
+	int			error;
+
+	/*
+	 * Set up for a potentially lengthy filesystem scan by reducing our
+	 * transaction resource usage for the duration.  Specifically:
+	 *
+	 * Cancel the transaction to release the log grant space while we scan
+	 * the filesystem.
+	 *
+	 * Create a new empty transaction to eliminate the possibility of the
+	 * inode scan deadlocking on cyclical metadata.
+	 *
+	 * We pass the empty transaction to the file scanning function to avoid
+	 * repeatedly cycling empty transactions.  This can be done without
+	 * risk of deadlock between sb_internal and the IOLOCK (we take the
+	 * IOLOCK to quiesce the file before scanning) because empty
+	 * transactions do not take sb_internal.
+	 */
+	xchk_trans_cancel(sc);
+	error = xchk_trans_alloc_empty(sc);
+	if (error)
+		return error;
+
+	while ((error = xchk_iscan_iter(sc, &xqc->iscan, &ip)) == 1) {
+		error = xqcheck_collect_inode(xqc, ip);
+		xchk_irele(sc, ip);
+		if (error)
+			break;
+
+		if (xchk_should_terminate(sc, &error))
+			break;
+	}
+	if (error) {
+		xchk_set_incomplete(sc);
+		/*
+		 * If we couldn't grab an inode that was busy with a state
+		 * change, change the error code so that we exit to userspace
+		 * as quickly as possible.
+		 */
+		if (error == -EBUSY)
+			return -ECANCELED;
+		return error;
+	}
+
+	/*
+	 * Switch out for a real transaction in preparation for building a new
+	 * tree.
+	 */
+	xchk_trans_cancel(sc);
+	return xchk_setup_fs(sc);
+}
+
+/*
+ * Part 2: Comparing dquot resource counters.  Walk each xfs_dquot, comparing
+ * the resource usage counters against our shadow dquots; and then walk each
+ * shadow dquot (that wasn't covered in the first part), comparing it against
+ * the xfs_dquot.
+ */
+
+/*
+ * Check the dquot data against what we observed.  Caller must hold the dquot
+ * lock.
+ */
+STATIC int
+xqcheck_compare_dquot(
+	struct xfs_dquot	*dqp,
+	xfs_dqtype_t		dqtype,
+	void			*priv)
+{
+	struct xqcheck_dquot	xcdq;
+	struct xqcheck		*xqc = priv;
+	struct xfarray		*counts = xqcheck_counters_for(xqc, dqtype);
+	int			error;
+
+	mutex_lock(&xqc->lock);
+	error = xfarray_load_sparse(counts, dqp->q_id, &xcdq);
+	if (error)
+		goto out_unlock;
+
+	if (xcdq.icount != dqp->q_ino.count)
+		xchk_qcheck_set_corrupt(xqc->sc, dqtype, dqp->q_id);
+
+	if (xcdq.bcount != dqp->q_blk.count)
+		xchk_qcheck_set_corrupt(xqc->sc, dqtype, dqp->q_id);
+
+	if (xcdq.rtbcount != dqp->q_rtb.count)
+		xchk_qcheck_set_corrupt(xqc->sc, dqtype, dqp->q_id);
+
+	xcdq.flags |= (XQCHECK_DQUOT_COMPARE_SCANNED | XQCHECK_DQUOT_WRITTEN);
+	error = xfarray_store(counts, dqp->q_id, &xcdq);
+	if (error == -EFBIG) {
+		/*
+		 * EFBIG means we tried to store data at too high a byte offset
+		 * in the sparse array.  IOWs, we cannot complete the check and
+		 * must notify userspace that the check was incomplete.  This
+		 * should never happen, since we just read the record.
+		 */
+		xchk_set_incomplete(xqc->sc);
+		error = -ECANCELED;
+	}
+	mutex_unlock(&xqc->lock);
+	if (error)
+		return error;
+
+	if (xqc->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return -ECANCELED;
+
+	return 0;
+
+out_unlock:
+	mutex_unlock(&xqc->lock);
+	return error;
+}
+
+/*
+ * Walk all the observed dquots, and make sure there's a matching incore
+ * dquot and that its counts match ours.
+ */
+STATIC int
+xqcheck_walk_observations(
+	struct xqcheck		*xqc,
+	xfs_dqtype_t		dqtype)
+{
+	struct xqcheck_dquot	xcdq;
+	struct xfs_dquot	*dqp;
+	struct xfarray		*counts = xqcheck_counters_for(xqc, dqtype);
+	xfarray_idx_t		cur = XFARRAY_CURSOR_INIT;
+	int			error;
+
+	mutex_lock(&xqc->lock);
+	while ((error = xfarray_iter(counts, &cur, &xcdq)) == 1) {
+		xfs_dqid_t	id = cur - 1;
+
+		if (xcdq.flags & XQCHECK_DQUOT_COMPARE_SCANNED)
+			continue;
+
+		mutex_unlock(&xqc->lock);
+
+		error = xfs_qm_dqget(xqc->sc->mp, id, dqtype, false, &dqp);
+		if (error == -ENOENT) {
+			xchk_qcheck_set_corrupt(xqc->sc, dqtype, id);
+			return 0;
+		}
+		if (error)
+			return error;
+
+		error = xqcheck_compare_dquot(dqp, dqtype, xqc);
+		xfs_qm_dqput(dqp);
+		if (error)
+			return error;
+
+		if (xchk_should_terminate(xqc->sc, &error))
+			return error;
+
+		mutex_lock(&xqc->lock);
+	}
+	mutex_unlock(&xqc->lock);
+
+	return error;
+}
+
+/* Compare the quota counters we observed against the live dquots. */
+STATIC int
+xqcheck_compare_dqtype(
+	struct xqcheck		*xqc,
+	xfs_dqtype_t		dqtype)
+{
+	struct xfs_scrub	*sc = xqc->sc;
+	int			error;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return 0;
+
+	/* If the quota CHKD flag is cleared, we need to repair this quota. */
+	if (!(xfs_quota_chkd_flag(dqtype) & sc->mp->m_qflags)) {
+		xchk_qcheck_set_corrupt(xqc->sc, dqtype, 0);
+		return 0;
+	}
+
+	/* Compare what we observed against the actual dquots. */
+	error = xfs_qm_dqiterate(sc->mp, dqtype, xqcheck_compare_dquot, xqc);
+	if (error)
+		return error;
+
+	/* Walk all the observed dquots and compare to the incore ones. */
+	return xqcheck_walk_observations(xqc, dqtype);
+}
+
+/* Tear down everything associated with a quotacheck. */
+static void
+xqcheck_teardown_scan(
+	void			*priv)
+{
+	struct xqcheck		*xqc = priv;
+
+	if (xqc->pcounts) {
+		xfarray_destroy(xqc->pcounts);
+		xqc->pcounts = NULL;
+	}
+
+	if (xqc->gcounts) {
+		xfarray_destroy(xqc->gcounts);
+		xqc->gcounts = NULL;
+	}
+
+	if (xqc->ucounts) {
+		xfarray_destroy(xqc->ucounts);
+		xqc->ucounts = NULL;
+	}
+
+	xchk_iscan_finish(&xqc->iscan);
+	mutex_destroy(&xqc->lock);
+	xqc->sc = NULL;
+}
+
+/*
+ * Scan all inodes in the entire filesystem to generate quota counter data.
+ * If the scan is successful, the quota data will be left alive for a repair.
+ * If any error occurs, we'll tear everything down.
+ */
+STATIC int
+xqcheck_setup_scan(
+	struct xfs_scrub	*sc,
+	struct xqcheck		*xqc)
+{
+	unsigned long long	max_dquots = ((xfs_dqid_t)-1) + 1;
+	int			error;
+
+	ASSERT(xqc->sc == NULL);
+	xqc->sc = sc;
+
+	mutex_init(&xqc->lock);
+
+	/* Retry iget every tenth of a second for up to 30 seconds. */
+	xchk_iscan_start(&xqc->iscan, 30000, 100);
+
+	error = -ENOMEM;
+	if (xfs_this_quota_on(sc->mp, XFS_DQTYPE_USER)) {
+		error = xfarray_create(sc->mp, "user dquots", max_dquots,
+				sizeof(struct xqcheck_dquot), &xqc->ucounts);
+		if (error)
+			goto out_teardown;
+	}
+
+	if (xfs_this_quota_on(sc->mp, XFS_DQTYPE_GROUP)) {
+		error = xfarray_create(sc->mp, "group dquots", max_dquots,
+				sizeof(struct xqcheck_dquot), &xqc->gcounts);
+		if (error)
+			goto out_teardown;
+	}
+
+	if (xfs_this_quota_on(sc->mp, XFS_DQTYPE_PROJ)) {
+		error = xfarray_create(sc->mp, "project dquots", max_dquots,
+				sizeof(struct xqcheck_dquot), &xqc->pcounts);
+		if (error)
+			goto out_teardown;
+	}
+
+	/* Use deferred cleanup to pass the quota count data to repair. */
+	sc->buf_cleanup = xqcheck_teardown_scan;
+	return 0;
+
+out_teardown:
+	xqcheck_teardown_scan(xqc);
+	return error;
+}
+
+/* Scrub all counters for a given quota type. */
+int
+xchk_quotacheck(
+	struct xfs_scrub	*sc)
+{
+	struct xqcheck		*xqc = sc->buf;
+	int			error = 0;
+
+	/* Check quota counters on the live filesystem. */
+	error = xqcheck_setup_scan(sc, xqc);
+	if (error)
+		return error;
+
+	/* Walk all inodes, picking up quota information. */
+	error = xqcheck_collect_counts(xqc);
+	if (!xchk_xref_process_error(sc, 0, 0, &error))
+		return error;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
+		return 0;
+
+	/* Compare quota counters. */
+	if (xqc->ucounts) {
+		error = xqcheck_compare_dqtype(xqc, XFS_DQTYPE_USER);
+		if (!xchk_xref_process_error(sc, 0, 0, &error))
+			return error;
+	}
+	if (xqc->gcounts) {
+		error = xqcheck_compare_dqtype(xqc, XFS_DQTYPE_GROUP);
+		if (!xchk_xref_process_error(sc, 0, 0, &error))
+			return error;
+	}
+	if (xqc->pcounts) {
+		error = xqcheck_compare_dqtype(xqc, XFS_DQTYPE_PROJ);
+		if (!xchk_xref_process_error(sc, 0, 0, &error))
+			return error;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/quotacheck.h b/fs/xfs/scrub/quotacheck.h
new file mode 100644
index 000000000000..0caf41dd10e6
--- /dev/null
+++ b/fs/xfs/scrub/quotacheck.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_QUOTACHECK_H__
+#define __XFS_SCRUB_QUOTACHECK_H__
+
+/* Quota counters for live quotacheck. */
+struct xqcheck_dquot {
+	/* block usage count */
+	int64_t			bcount;
+
+	/* inode usage count */
+	int64_t			icount;
+
+	/* realtime block usage count */
+	int64_t			rtbcount;
+
+	/* Record state */
+	unsigned int		flags;
+};
+
+/*
+ * This incore dquot record has been written at least once.  We never want to
+ * store an xqcheck_dquot that looks uninitialized.
+ */
+#define XQCHECK_DQUOT_WRITTEN		(1U << 0)
+
+/* Already checked this dquot. */
+#define XQCHECK_DQUOT_COMPARE_SCANNED	(1U << 1)
+
+/* Live quotacheck control structure. */
+struct xqcheck {
+	struct xfs_scrub	*sc;
+
+	/* Shadow dquot counter data. */
+	struct xfarray		*ucounts;
+	struct xfarray		*gcounts;
+	struct xfarray		*pcounts;
+
+	/* Lock protecting quotacheck count observations */
+	struct mutex		lock;
+
+	struct xchk_iscan	iscan;
+};
+
+/* Return the incore counter array for a given quota type. */
+static inline struct xfarray *
+xqcheck_counters_for(
+	struct xqcheck		*xqc,
+	xfs_dqtype_t		dqtype)
+{
+	switch (dqtype) {
+	case XFS_DQTYPE_USER:
+		return xqc->ucounts;
+	case XFS_DQTYPE_GROUP:
+		return xqc->gcounts;
+	case XFS_DQTYPE_PROJ:
+		return xqc->pcounts;
+	}
+
+	ASSERT(0);
+	return NULL;
+}
+
+#endif /* __XFS_SCRUB_QUOTACHECK_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 60875c4ad5d9..41db4c83f1cb 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -363,6 +363,12 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.scrub	= xchk_fscounters,
 		.repair	= xrep_notsupported,
 	},
+	[XFS_SCRUB_TYPE_QUOTACHECK] = {	/* quota counters */
+		.type	= ST_FS,
+		.setup	= xchk_setup_quotacheck,
+		.scrub	= xchk_quotacheck,
+		.repair	= xrep_notsupported,
+	},
 };
 
 static int
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 7670c0a415a4..a7e0dbd47733 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -163,12 +163,18 @@ xchk_rtsummary(struct xfs_scrub *sc)
 #endif
 #ifdef CONFIG_XFS_QUOTA
 int xchk_quota(struct xfs_scrub *sc);
+int xchk_quotacheck(struct xfs_scrub *sc);
 #else
 static inline int
 xchk_quota(struct xfs_scrub *sc)
 {
 	return -ENOENT;
 }
+static inline int
+xchk_quotacheck(struct xfs_scrub *sc)
+{
+	return -ENOENT;
+}
 #endif
 int xchk_fscounters(struct xfs_scrub *sc);
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a283e0462bae..9f161c0de7a8 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -15,6 +15,7 @@
 
 #include <linux/tracepoint.h>
 #include "xfs_bit.h"
+#include "xfs_quota_defs.h"
 
 struct xfs_scrub;
 struct xfile;
@@ -64,6 +65,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_UQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_GQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
+TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_QUOTACHECK);
 
 #define XFS_SCRUB_TYPE_STRINGS \
 	{ XFS_SCRUB_TYPE_PROBE,		"probe" }, \
@@ -90,7 +92,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XFS_SCRUB_TYPE_UQUOTA,	"usrquota" }, \
 	{ XFS_SCRUB_TYPE_GQUOTA,	"grpquota" }, \
 	{ XFS_SCRUB_TYPE_PQUOTA,	"prjquota" }, \
-	{ XFS_SCRUB_TYPE_FSCOUNTERS,	"fscounters" }
+	{ XFS_SCRUB_TYPE_FSCOUNTERS,	"fscounters" }, \
+	{ XFS_SCRUB_TYPE_QUOTACHECK,	"quotacheck" }
 
 #define XFS_SCRUB_FLAG_STRINGS \
 	{ XFS_SCRUB_IFLAG_REPAIR,		"repair" }, \
@@ -350,6 +353,31 @@ DEFINE_EVENT(xchk_fblock_error_class, name, \
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_error);
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_warning);
 
+#ifdef CONFIG_XFS_QUOTA
+TRACE_EVENT(xchk_qcheck_error,
+	TP_PROTO(struct xfs_scrub *sc, xfs_dqtype_t dqtype, xfs_dqid_t id,
+		 void *ret_ip),
+	TP_ARGS(sc, dqtype, id, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_dqtype_t, dqtype)
+		__field(xfs_dqid_t, id)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->dqtype = dqtype;
+		__entry->id = id;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d dquot type %s id 0x%x ret_ip %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->dqtype, XFS_DQTYPE_STRINGS),
+		  __entry->id,
+		  __entry->ret_ip)
+);
+#endif /* CONFIG_XFS_QUOTA */
+
 TRACE_EVENT(xchk_incomplete,
 	TP_PROTO(struct xfs_scrub *sc, void *ret_ip),
 	TP_ARGS(sc, ret_ip),
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index 95684e0c572e..44c7e7083881 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -46,6 +46,25 @@ int xfarray_store(struct xfarray *array, xfarray_idx_t idx, const void *ptr);
 int xfarray_store_anywhere(struct xfarray *array, const void *ptr);
 bool xfarray_element_is_null(struct xfarray *array, const void *ptr);
 
+/*
+ * Load an array element, but zero the buffer if there's no data because we
+ * haven't stored to that array element yet.
+ */
+static inline int
+xfarray_load_sparse(
+	struct xfarray	*array,
+	uint64_t	idx,
+	void		*rec)
+{
+	int		error = xfarray_load(array, idx, rec);
+
+	if (error == -ENODATA) {
+		memset(rec, 0, array->obj_size);
+		return 0;
+	}
+	return error;
+}
+
 /* Append an element to the array. */
 static inline int xfarray_append(struct xfarray *array, const void *ptr)
 {
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 47197e4cdbe8..eebdbc55d078 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3522,3 +3522,24 @@ xfs_iunlock2_io_mmap(
 	if (ip1 != ip2)
 		inode_unlock(VFS_I(ip1));
 }
+
+/* Compute the number of data and realtime blocks used by a file. */
+void
+xfs_inode_count_blocks(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_filblks_t		*dblocks,
+	xfs_filblks_t		*rblocks)
+{
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+
+	if (!XFS_IS_REALTIME_INODE(ip)) {
+		*dblocks = ip->i_nblocks;
+		*rblocks = 0;
+		return;
+	}
+
+	*rblocks = 0;
+	xfs_bmap_count_leaves(ifp, rblocks);
+	*dblocks = ip->i_nblocks - *rblocks;
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index fa780f08dc89..57c459f8e669 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -575,4 +575,7 @@ void xfs_end_io(struct work_struct *work);
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 
+void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
+
 #endif	/* __XFS_INODE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: track quota updates during live quotacheck
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 1/5] xfs: report the health of quota counts Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 5/5] xfs: repair dquots based on live quotacheck results Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a shadow dqtrx system in the quotacheck code that hooks the
regular dquot counter update code.  This will be the means to keep our
copy of the dquot counters up to date while the scan runs in real time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c     |    4 +
 fs/xfs/scrub/quotacheck.c |  358 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/quotacheck.h |    6 +
 fs/xfs/scrub/scrub.c      |    3 
 fs/xfs/scrub/scrub.h      |    4 -
 fs/xfs/scrub/trace.h      |    1 
 fs/xfs/xfs_qm.c           |   16 +-
 fs/xfs/xfs_qm.h           |   16 ++
 fs/xfs/xfs_qm_bhv.c       |    1 
 fs/xfs/xfs_quota.h        |   45 ++++++
 fs/xfs/xfs_trans_dquot.c  |  156 +++++++++++++++++++-
 11 files changed, 594 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 0a0ac993c602..9f418e30f5a3 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -31,6 +31,7 @@
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
 #include "xfs_error.h"
+#include "xfs_quota.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -1283,5 +1284,8 @@ xchk_fshooks_enable(
 	if (scrub_fshooks & XCHK_FSHOOKS_DRAIN)
 		xfs_drain_wait_enable();
 
+	if (scrub_fshooks & XCHK_FSHOOKS_QUOTA)
+		xfs_dqtrx_hook_enable();
+
 	sc->flags |= scrub_fshooks;
 }
diff --git a/fs/xfs/scrub/quotacheck.c b/fs/xfs/scrub/quotacheck.c
index d2cbd79eafa5..b9162556d00e 100644
--- a/fs/xfs/scrub/quotacheck.c
+++ b/fs/xfs/scrub/quotacheck.c
@@ -37,17 +37,54 @@
  * creating a shadow quota counter structure and walking every inode.
  */
 
+/* Track the quota deltas for a dquot in a transaction. */
+struct xqcheck_dqtrx {
+	xfs_dqtype_t		q_type;
+	xfs_dqid_t		q_id;
+
+	int64_t			icount_delta;
+
+	int64_t			bcount_delta;
+	int64_t			delbcnt_delta;
+
+	int64_t			rtbcount_delta;
+	int64_t			delrtb_delta;
+};
+
+#define XQCHECK_MAX_NR_DQTRXS	(XFS_QM_TRANS_DQTYPES * XFS_QM_TRANS_MAXDQS)
+
+/*
+ * Track the quota deltas for all dquots attached to a transaction if the
+ * quota deltas are being applied to an inode that we already scanned.
+ */
+struct xqcheck_dqacct {
+	struct rhash_head	hash;
+	uintptr_t		tx_id;
+	struct xqcheck_dqtrx	dqtrx[XQCHECK_MAX_NR_DQTRXS];
+	unsigned int		refcount;
+};
+
+/* Free a shadow dquot accounting structure. */
+static void
+xqcheck_dqacct_free(
+	void			*ptr,
+	void			*arg)
+{
+	struct xqcheck_dqacct	*dqa = ptr;
+
+	kfree(dqa);
+}
+
 /* Set us up to scrub quota counters. */
 int
 xchk_setup_quotacheck(
 	struct xfs_scrub	*sc)
 {
-	/* Not ready for general consumption yet. */
-	return -EOPNOTSUPP;
-
 	if (!XFS_IS_QUOTA_ON(sc->mp))
 		return -ENOENT;
 
+	xchk_fshooks_enable(sc, XCHK_FSHOOKS_QUOTA);
+
 	sc->buf = kzalloc(sizeof(struct xqcheck), XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
@@ -65,6 +102,22 @@ xchk_setup_quotacheck(
  * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
  * ECANCELED) that are absorbed into a scrub state flag update by
  * xchk_*_process_error.
+ *
+ * Because we are scanning a live filesystem, it's possible that another thread
+ * will try to update the quota counters for an inode that we've already
+ * scanned.  This will cause our counts to be incorrect.  Therefore, we hook
+ * the live transaction code in two places: (1) when the callers update the
+ * per-transaction dqtrx structure to log quota counter updates; and (2) when
+ * transaction commit actually logs those updates to the incore dquot.  By
+ * shadowing transaction updates in this manner, live quotacheck can ensure
+ * by locking the dquot and the shadow structure that its own copies are not
+ * out of date.  Because the hook code runs in a different process context from
+ * the scrub code and the scrub state flags are not accessed atomically,
+ * failures in the hook code must abort the iscan and the scrubber must notice
+ * the aborted scan and set the incomplete flag.
+ *
+ * Note that we use srcu notifier hooks to minimize the overhead when live
+ * quotacheck is /not/ running.
  */
 
 /* Update an incore dquot counter information from a live update. */
@@ -101,6 +154,234 @@ xqcheck_update_incore_counts(
 	return error;
 }
 
+/* Decide if this is the shadow dquot accounting structure for a transaction. */
+static int
+xqcheck_dqacct_obj_cmpfn(
+	struct rhashtable_compare_arg	*arg,
+	const void			*obj)
+{
+	const uintptr_t			*tx_idp = arg->key;
+	const struct xqcheck_dqacct	*dqa = obj;
+
+	if (dqa->tx_id != *tx_idp)
+		return 1;
+	return 0;
+}
+
+static const struct rhashtable_params xqcheck_dqacct_hash_params = {
+	.min_size		= 32,
+	.key_len		= sizeof(uintptr_t),
+	.key_offset		= offsetof(struct xqcheck_dqacct, tx_id),
+	.head_offset		= offsetof(struct xqcheck_dqacct, hash),
+	.automatic_shrinking	= true,
+	.obj_cmpfn		= xqcheck_dqacct_obj_cmpfn,
+};
+
+/* Find a shadow dqtrx slot for the given dquot. */
+STATIC struct xqcheck_dqtrx *
+xqcheck_get_dqtrx(
+	struct xqcheck_dqacct	*dqa,
+	xfs_dqtype_t		q_type,
+	xfs_dqid_t		q_id)
+{
+	int			i;
+
+	for (i = 0; i < XQCHECK_MAX_NR_DQTRXS; i++) {
+		if (dqa->dqtrx[i].q_type == 0 ||
+		    (dqa->dqtrx[i].q_type == q_type &&
+		     dqa->dqtrx[i].q_id == q_id))
+			return &dqa->dqtrx[i];
+	}
+
+	return NULL;
+}
+
+/*
+ * Create and fill out a quota delta tracking structure to shadow the updates
+ * going on in the regular quota code.
+ */
+static int
+xqcheck_mod_live_ino_dqtrx(
+	struct xfs_hook			*mod_hook,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_mod_ino_dqtrx_params *p = data;
+	struct xqcheck			*xqc;
+	struct xqcheck_dqacct		*dqa;
+	struct xqcheck_dqtrx		*dqtrx;
+	int				error;
+
+	xqc = container_of(mod_hook, struct xqcheck, hooks.mod_hook);
+
+	/* Skip quota reservation fields. */
+	switch (action) {
+	case XFS_TRANS_DQ_BCOUNT:
+	case XFS_TRANS_DQ_DELBCOUNT:
+	case XFS_TRANS_DQ_ICOUNT:
+	case XFS_TRANS_DQ_RTBCOUNT:
+	case XFS_TRANS_DQ_DELRTBCOUNT:
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+	/* Ignore dqtrx updates for quota types we don't care about. */
+	switch (p->q_type) {
+	case XFS_DQTYPE_USER:
+		if (!xqc->ucounts)
+			return NOTIFY_DONE;
+		break;
+	case XFS_DQTYPE_GROUP:
+		if (!xqc->gcounts)
+			return NOTIFY_DONE;
+		break;
+	case XFS_DQTYPE_PROJ:
+		if (!xqc->pcounts)
+			return NOTIFY_DONE;
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+	/* Skip inodes that haven't been scanned yet. */
+	if (!xchk_iscan_want_live_update(&xqc->iscan, p->ino))
+		return NOTIFY_DONE;
+
+	/* Make a shadow quota accounting tracker for this transaction. */
+	mutex_lock(&xqc->lock);
+	dqa = rhashtable_lookup_fast(&xqc->shadow_dquot_acct, &p->tx_id,
+			xqcheck_dqacct_hash_params);
+	if (!dqa) {
+		dqa = kzalloc(sizeof(struct xqcheck_dqacct), XCHK_GFP_FLAGS);
+		if (!dqa)
+			goto out_abort;
+
+		dqa->tx_id = p->tx_id;
+		error = rhashtable_insert_fast(&xqc->shadow_dquot_acct,
+				&dqa->hash, xqcheck_dqacct_hash_params);
+		if (error)
+			goto out_abort;
+	}
+
+	/* Find the shadow dqtrx (or an empty slot) here. */
+	dqtrx = xqcheck_get_dqtrx(dqa, p->q_type, p->q_id);
+	if (!dqtrx)
+		goto out_abort;
+	if (dqtrx->q_type == 0) {
+		dqtrx->q_type = p->q_type;
+		dqtrx->q_id = p->q_id;
+		dqa->refcount++;
+	}
+
+	/* Update counter */
+	switch (action) {
+	case XFS_TRANS_DQ_BCOUNT:
+		dqtrx->bcount_delta += p->delta;
+		break;
+	case XFS_TRANS_DQ_DELBCOUNT:
+		dqtrx->delbcnt_delta += p->delta;
+		break;
+	case XFS_TRANS_DQ_ICOUNT:
+		dqtrx->icount_delta += p->delta;
+		break;
+	case XFS_TRANS_DQ_RTBCOUNT:
+		dqtrx->rtbcount_delta += p->delta;
+		break;
+	case XFS_TRANS_DQ_DELRTBCOUNT:
+		dqtrx->delrtb_delta += p->delta;
+		break;
+	}
+
+	mutex_unlock(&xqc->lock);
+	return NOTIFY_DONE;
+
+out_abort:
+	xchk_iscan_abort(&xqc->iscan);
+	mutex_unlock(&xqc->lock);
+	return NOTIFY_DONE;
+}
+
+/*
+ * Apply the transaction quota deltas to our shadow quota accounting info when
+ * the regular quota code are doing the same.
+ */
+static int
+xqcheck_apply_live_dqtrx(
+	struct xfs_hook			*apply_hook,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_apply_dqtrx_params	*p = data;
+	struct xqcheck			*xqc;
+	struct xqcheck_dqacct		*dqa;
+	struct xqcheck_dqtrx		*dqtrx;
+	struct xfarray			*counts;
+	int				error;
+
+	xqc = container_of(apply_hook, struct xqcheck, hooks.apply_hook);
+
+	/* Map the dquot type to an incore counter object. */
+	switch (p->q_type) {
+	case XFS_DQTYPE_USER:
+		counts = xqc->ucounts;
+		break;
+	case XFS_DQTYPE_GROUP:
+		counts = xqc->gcounts;
+		break;
+	case XFS_DQTYPE_PROJ:
+		counts = xqc->pcounts;
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+	if (xchk_iscan_aborted(&xqc->iscan) || counts == NULL)
+		return NOTIFY_DONE;
+
+	/*
+	 * Find the shadow dqtrx for this transaction and dquot, if any deltas
+	 * need to be applied here.  If not, we're finished early.
+	 */
+	mutex_lock(&xqc->lock);
+	dqa = rhashtable_lookup_fast(&xqc->shadow_dquot_acct, &p->tx_id,
+			xqcheck_dqacct_hash_params);
+	if (!dqa)
+		goto out_unlock;
+	dqtrx = xqcheck_get_dqtrx(dqa, p->q_type, p->q_id);
+	if (!dqtrx || dqtrx->q_type == 0)
+		goto out_unlock;
+
+	/* Update our shadow dquot if we're committing. */
+	if (action == XFS_APPLY_DQTRX_COMMIT) {
+		error = xqcheck_update_incore_counts(xqc, counts, p->q_id,
+				dqtrx->icount_delta,
+				dqtrx->bcount_delta + dqtrx->delbcnt_delta,
+				dqtrx->rtbcount_delta + dqtrx->delrtb_delta);
+		if (error)
+			goto out_abort;
+	}
+
+	/* Free the shadow accounting structure if that was the last user. */
+	dqa->refcount--;
+	if (dqa->refcount == 0) {
+		error = rhashtable_remove_fast(&xqc->shadow_dquot_acct,
+				&dqa->hash, xqcheck_dqacct_hash_params);
+		if (error)
+			goto out_abort;
+		xqcheck_dqacct_free(dqa, NULL);
+	}
+
+	mutex_unlock(&xqc->lock);
+	return NOTIFY_DONE;
+
+out_abort:
+	xchk_iscan_abort(&xqc->iscan);
+out_unlock:
+	mutex_unlock(&xqc->lock);
+	return NOTIFY_DONE;
+}
+
 /* Record this inode's quota usage in our shadow quota counter data. */
 STATIC int
 xqcheck_collect_inode(
@@ -128,10 +409,15 @@ xqcheck_collect_inode(
 	if (XFS_IS_REALTIME_INODE(ip)) {
 		error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
 		if (error)
-			goto out_incomplete;
+			goto out_abort;
 	}
 	xfs_inode_count_blocks(tp, ip, &nblks, &rtblks);
 
+	if (xchk_iscan_aborted(&xqc->iscan)) {
+		error = -ECANCELED;
+		goto out_incomplete;
+	}
+
 	/* Update the shadow dquot counters. */
 	mutex_lock(&xqc->lock);
 	if (xqc->ucounts) {
@@ -164,6 +450,8 @@ xqcheck_collect_inode(
 
 out_mutex:
 	mutex_unlock(&xqc->lock);
+out_abort:
+	xchk_iscan_abort(&xqc->iscan);
 out_incomplete:
 	xchk_set_incomplete(xqc->sc);
 out_ilock:
@@ -252,6 +540,11 @@ xqcheck_compare_dquot(
 	struct xfarray		*counts = xqcheck_counters_for(xqc, dqtype);
 	int			error;
 
+	if (xchk_iscan_aborted(&xqc->iscan)) {
+		xchk_set_incomplete(xqc->sc);
+		return -ECANCELED;
+	}
+
 	mutex_lock(&xqc->lock);
 	error = xfarray_load_sparse(counts, dqp->q_id, &xcdq);
 	if (error)
@@ -273,7 +566,7 @@ xqcheck_compare_dquot(
 		 * EFBIG means we tried to store data at too high a byte offset
 		 * in the sparse array.  IOWs, we cannot complete the check and
 		 * must notify userspace that the check was incomplete.  This
-		 * should never happen, since we just read the record.
+		 * should never happen outside of the collection phase.
 		 */
 		xchk_set_incomplete(xqc->sc);
 		error = -ECANCELED;
@@ -372,6 +665,26 @@ xqcheck_teardown_scan(
 	void			*priv)
 {
 	struct xqcheck		*xqc = priv;
+	struct xfs_quotainfo	*qi = xqc->sc->mp->m_quotainfo;
+
+	/* Discourage any hook functions that might be running. */
+	xchk_iscan_abort(&xqc->iscan);
+
+	/*
+	 * As noted above, the apply hook is responsible for cleaning up the
+	 * shadow dquot accounting data when a transaction completes.  The mod
+	 * hook must be removed before the apply hook so that we don't
+	 * mistakenly leave an active shadow account for the mod hook to get
+	 * its hands on.  No hooks should be running after these functions
+	 * return.
+	 */
+	xfs_dqtrx_hook_del(qi, &xqc->hooks);
+
+	if (xqc->shadow_dquot_acct.key_len) {
+		rhashtable_free_and_destroy(&xqc->shadow_dquot_acct,
+				xqcheck_dqacct_free, NULL);
+		xqc->shadow_dquot_acct.key_len = 0;
+	}
 
 	if (xqc->pcounts) {
 		xfarray_destroy(xqc->pcounts);
@@ -403,6 +716,7 @@ xqcheck_setup_scan(
 	struct xfs_scrub	*sc,
 	struct xqcheck		*xqc)
 {
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
 	unsigned long long	max_dquots = ((xfs_dqid_t)-1) + 1;
 	int			error;
 
@@ -436,6 +750,33 @@ xqcheck_setup_scan(
 			goto out_teardown;
 	}
 
+	/*
+	 * Set up hash table to map transactions to our internal shadow dqtrx
+	 * structures.
+	 */
+	error = rhashtable_init(&xqc->shadow_dquot_acct,
+			&xqcheck_dqacct_hash_params);
+	if (error)
+		goto out_teardown;
+
+	/*
+	 * Hook into the quota code.  The hook only triggers for inodes that
+	 * were already scanned, and the scanner thread takes each inode's
+	 * ILOCK, which means that any in-progress inode updates will finish
+	 * before we can scan the inode.
+	 *
+	 * The apply hook (which removes the shadow dquot accounting struct)
+	 * must be installed before the mod hook so that we never fail to catch
+	 * the end of a quota update sequence and leave stale shadow data.
+	 */
+	ASSERT(sc->flags & XCHK_FSHOOKS_QUOTA);
+	xfs_hook_setup(&xqc->hooks.mod_hook, xqcheck_mod_live_ino_dqtrx);
+	xfs_hook_setup(&xqc->hooks.apply_hook, xqcheck_apply_live_dqtrx);
+
+	error = xfs_dqtrx_hook_add(qi, &xqc->hooks);
+	if (error)
+		goto out_teardown;
+
 	/* Use deferred cleanup to pass the quota count data to repair. */
 	sc->buf_cleanup = xqcheck_teardown_scan;
 	return 0;
@@ -463,6 +804,9 @@ xchk_quotacheck(
 	if (!xchk_xref_process_error(sc, 0, 0, &error))
 		return error;
 
+	/* Fail fast if we're not playing with a full dataset. */
+	if (xchk_iscan_aborted(&xqc->iscan))
+		xchk_set_incomplete(sc);
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
 		return 0;
 
@@ -483,5 +827,9 @@ xchk_quotacheck(
 			return error;
 	}
 
+	/* Check one last time for an incomplete dataset. */
+	if (xchk_iscan_aborted(&xqc->iscan))
+		xchk_set_incomplete(sc);
+
 	return 0;
 }
diff --git a/fs/xfs/scrub/quotacheck.h b/fs/xfs/scrub/quotacheck.h
index 0caf41dd10e6..6dc55686ad39 100644
--- a/fs/xfs/scrub/quotacheck.h
+++ b/fs/xfs/scrub/quotacheck.h
@@ -43,6 +43,12 @@ struct xqcheck {
 	struct mutex		lock;
 
 	struct xchk_iscan	iscan;
+
+	/* Hooks into the quota code. */
+	struct xfs_dqtrx_hook	hooks;
+
+	/* Shadow quota delta tracking structure. */
+	struct rhashtable	shadow_dquot_acct;
 };
 
 /* Return the incore counter array for a given quota type. */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 41db4c83f1cb..db277b57e8a2 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -158,6 +158,9 @@ xchk_fshooks_disable(
 	if (sc->flags & XCHK_FSHOOKS_DRAIN)
 		xfs_drain_wait_disable();
 
+	if (sc->flags & XCHK_FSHOOKS_QUOTA)
+		xfs_dqtrx_hook_disable();
+
 	sc->flags &= ~XCHK_FSHOOKS_ALL;
 }
 
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index a7e0dbd47733..de09b709992b 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -121,10 +121,12 @@ struct xfs_scrub {
 #define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
 #define XCHK_FSHOOKS_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
+#define XCHK_FSHOOKS_QUOTA	(1 << 4)  /* quota live update enabled */
 #define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
-#define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN)
+#define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN | \
+				 XCHK_FSHOOKS_QUOTA)
 
 /* Metadata scrubbers */
 int xchk_tester(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 9f161c0de7a8..f8ee323a3cd2 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -111,6 +111,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_QUOTACHECK);
 	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
+	{ XCHK_FSHOOKS_QUOTA,			"fshooks_quota" }, \
 	{ XREP_RESET_PERAG_RESV,		"reset_perag_resv" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 8356b7500d75..972ed5912950 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -691,6 +691,9 @@ xfs_qm_init_quotainfo(
 	if (error)
 		goto out_free_inos;
 
+	xfs_hooks_init(&qinf->qi_mod_ino_dqtrx_hooks);
+	xfs_hooks_init(&qinf->qi_apply_dqtrx_hooks);
+
 	return 0;
 
 out_free_inos:
@@ -1793,12 +1796,12 @@ xfs_qm_vop_chown(
 	ASSERT(prevdq);
 	ASSERT(prevdq != newdq);
 
-	xfs_trans_mod_dquot(tp, prevdq, bfield, -(ip->i_nblocks));
-	xfs_trans_mod_dquot(tp, prevdq, XFS_TRANS_DQ_ICOUNT, -1);
+	xfs_trans_mod_ino_dquot(tp, ip, prevdq, bfield, -(ip->i_nblocks));
+	xfs_trans_mod_ino_dquot(tp, ip, prevdq, XFS_TRANS_DQ_ICOUNT, -1);
 
 	/* the sparkling new dquot */
-	xfs_trans_mod_dquot(tp, newdq, bfield, ip->i_nblocks);
-	xfs_trans_mod_dquot(tp, newdq, XFS_TRANS_DQ_ICOUNT, 1);
+	xfs_trans_mod_ino_dquot(tp, ip, newdq, bfield, ip->i_nblocks);
+	xfs_trans_mod_ino_dquot(tp, ip, newdq, XFS_TRANS_DQ_ICOUNT, 1);
 
 	/*
 	 * Back when we made quota reservations for the chown, we reserved the
@@ -1880,22 +1883,21 @@ xfs_qm_vop_create_dqattach(
 		ASSERT(i_uid_read(VFS_I(ip)) == udqp->q_id);
 
 		ip->i_udquot = xfs_qm_dqhold(udqp);
-		xfs_trans_mod_dquot(tp, udqp, XFS_TRANS_DQ_ICOUNT, 1);
 	}
 	if (gdqp && XFS_IS_GQUOTA_ON(mp)) {
 		ASSERT(ip->i_gdquot == NULL);
 		ASSERT(i_gid_read(VFS_I(ip)) == gdqp->q_id);
 
 		ip->i_gdquot = xfs_qm_dqhold(gdqp);
-		xfs_trans_mod_dquot(tp, gdqp, XFS_TRANS_DQ_ICOUNT, 1);
 	}
 	if (pdqp && XFS_IS_PQUOTA_ON(mp)) {
 		ASSERT(ip->i_pdquot == NULL);
 		ASSERT(ip->i_projid == pdqp->q_id);
 
 		ip->i_pdquot = xfs_qm_dqhold(pdqp);
-		xfs_trans_mod_dquot(tp, pdqp, XFS_TRANS_DQ_ICOUNT, 1);
 	}
+
+	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_ICOUNT, 1);
 }
 
 /* Decide if this inode's dquot is near an enforcement boundary. */
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 9683f0457d19..d5700212b95c 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -68,6 +68,10 @@ struct xfs_quotainfo {
 	/* Minimum and maximum quota expiration timestamp values. */
 	time64_t		qi_expiry_min;
 	time64_t		qi_expiry_max;
+
+	/* Hook to feed quota counter updates to an active online repair. */
+	struct xfs_hooks	qi_mod_ino_dqtrx_hooks;
+	struct xfs_hooks	qi_apply_dqtrx_hooks;
 };
 
 static inline struct radix_tree_root *
@@ -104,6 +108,18 @@ xfs_quota_inode(struct xfs_mount *mp, xfs_dqtype_t type)
 	return NULL;
 }
 
+/*
+ * Parameters for tracking dqtrx changes on behalf of an inode.  The hook
+ * function arg parameter is the field being updated.
+ */
+struct xfs_mod_ino_dqtrx_params {
+	uintptr_t		tx_id;
+	xfs_ino_t		ino;
+	xfs_dqtype_t		q_type;
+	xfs_dqid_t		q_id;
+	int64_t			delta;
+};
+
 extern void	xfs_trans_mod_dquot(struct xfs_trans *tp, struct xfs_dquot *dqp,
 				    uint field, int64_t delta);
 extern void	xfs_trans_dqjoin(struct xfs_trans *, struct xfs_dquot *);
diff --git a/fs/xfs/xfs_qm_bhv.c b/fs/xfs/xfs_qm_bhv.c
index b77673dd0558..271c1021c733 100644
--- a/fs/xfs/xfs_qm_bhv.c
+++ b/fs/xfs/xfs_qm_bhv.c
@@ -9,6 +9,7 @@
 #include "xfs_format.h"
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_quota.h"
 #include "xfs_mount.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index dcc785fdd345..fe63489d91b2 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -74,6 +74,22 @@ struct xfs_dqtrx {
 	int64_t		qt_icount_delta;  /* dquot inode count changes */
 };
 
+enum xfs_apply_dqtrx_type {
+	XFS_APPLY_DQTRX_COMMIT = 0,
+	XFS_APPLY_DQTRX_UNRESERVE,
+};
+
+/*
+ * Parameters for applying dqtrx changes to a dquot.  The hook function arg
+ * parameter is enum xfs_apply_dqtrx_type.
+ */
+struct xfs_apply_dqtrx_params {
+	uintptr_t		tx_id;
+	xfs_ino_t		ino;
+	xfs_dqtype_t		q_type;
+	xfs_dqid_t		q_id;
+};
+
 #ifdef CONFIG_XFS_QUOTA
 extern void xfs_trans_dup_dqinfo(struct xfs_trans *, struct xfs_trans *);
 extern void xfs_trans_free_dqinfo(struct xfs_trans *);
@@ -114,6 +130,29 @@ xfs_quota_reserve_blkres(struct xfs_inode *ip, int64_t blocks)
 	return xfs_trans_reserve_quota_nblks(NULL, ip, blocks, 0, false);
 }
 bool xfs_inode_near_dquot_enforcement(struct xfs_inode *ip, xfs_dqtype_t type);
+
+# ifdef CONFIG_XFS_LIVE_HOOKS
+void xfs_trans_mod_ino_dquot(struct xfs_trans *tp, struct xfs_inode *ip,
+		struct xfs_dquot *dqp, unsigned int field, int64_t delta);
+
+struct xfs_quotainfo;
+
+struct xfs_dqtrx_hook {
+	struct xfs_hook		mod_hook;
+	struct xfs_hook		apply_hook;
+};
+
+void xfs_dqtrx_hook_disable(void);
+void xfs_dqtrx_hook_enable(void);
+
+int xfs_dqtrx_hook_add(struct xfs_quotainfo *qi, struct xfs_dqtrx_hook *hook);
+void xfs_dqtrx_hook_del(struct xfs_quotainfo *qi, struct xfs_dqtrx_hook *hook);
+
+# else
+#  define xfs_trans_mod_ino_dquot(tp, ip, dqp, field, delta) \
+		xfs_trans_mod_dquot((tp), (dqp), (field), (delta))
+# endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #else
 static inline int
 xfs_qm_vop_dqalloc(struct xfs_inode *ip, kuid_t kuid, kgid_t kgid,
@@ -170,6 +209,12 @@ xfs_trans_reserve_quota_icreate(struct xfs_trans *tp, struct xfs_dquot *udqp,
 #define xfs_qm_unmount(mp)
 #define xfs_qm_unmount_quotas(mp)
 #define xfs_inode_near_dquot_enforcement(ip, type)			(false)
+
+# ifdef CONFIG_XFS_LIVE_HOOKS
+#  define xfs_dqtrx_hook_enable()		((void)0)
+#  define xfs_dqtrx_hook_disable()		((void)0)
+# endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* CONFIG_XFS_QUOTA */
 
 static inline int
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 968dc7af4fc7..f5e9d76fb9a2 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -121,6 +121,105 @@ xfs_trans_dup_dqinfo(
 	}
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of quota live updates.  If the
+ * compiler supports jump labels, the static branch will be replaced by a nop
+ * sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_dqtrx_hooks_switch);
+
+void
+xfs_dqtrx_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_dqtrx_hooks_switch);
+}
+
+void
+xfs_dqtrx_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_dqtrx_hooks_switch);
+}
+
+/* Schedule a transactional dquot update on behalf of an inode. */
+void
+xfs_trans_mod_ino_dquot(
+	struct xfs_trans		*tp,
+	struct xfs_inode		*ip,
+	struct xfs_dquot		*dqp,
+	unsigned int			field,
+	int64_t				delta)
+{
+	xfs_trans_mod_dquot(tp, dqp, field, delta);
+
+	if (xfs_hooks_switched_on(&xfs_dqtrx_hooks_switch)) {
+		struct xfs_mod_ino_dqtrx_params	p = {
+			.tx_id		= (uintptr_t)tp,
+			.ino		= ip->i_ino,
+			.q_type		= xfs_dquot_type(dqp),
+			.q_id		= dqp->q_id,
+			.delta		= delta
+		};
+		struct xfs_quotainfo	*qi = tp->t_mountp->m_quotainfo;
+
+		xfs_hooks_call(&qi->qi_mod_ino_dqtrx_hooks, field, &p);
+	}
+}
+
+/* Call the specified functions during a dquot counter update. */
+int
+xfs_dqtrx_hook_add(
+	struct xfs_quotainfo	*qi,
+	struct xfs_dqtrx_hook	*hook)
+{
+	int			error;
+
+	/*
+	 * Transactional dquot updates first call the mod hook when changes
+	 * are attached to the transaction and then call the apply hook when
+	 * those changes are committed (or canceled).
+	 *
+	 * The apply hook must be installed before the mod hook so that we
+	 * never fail to catch the end of a quota update sequence.
+	 */
+	error = xfs_hooks_add(&qi->qi_apply_dqtrx_hooks, &hook->apply_hook);
+	if (error)
+		goto out;
+
+	error = xfs_hooks_add(&qi->qi_mod_ino_dqtrx_hooks, &hook->mod_hook);
+	if (error)
+		goto out_apply;
+
+	return 0;
+
+out_apply:
+	xfs_hooks_del(&qi->qi_apply_dqtrx_hooks, &hook->apply_hook);
+out:
+	return error;
+}
+
+/* Stop calling the specified function during a dquot counter update. */
+void
+xfs_dqtrx_hook_del(
+	struct xfs_quotainfo	*qi,
+	struct xfs_dqtrx_hook	*hook)
+{
+	/*
+	 * The mod hook must be removed before apply hook to avoid giving the
+	 * hook consumer with an incomplete update.  No hooks should be running
+	 * after these functions return.
+	 */
+	xfs_hooks_del(&qi->qi_mod_ino_dqtrx_hooks, &hook->mod_hook);
+	xfs_hooks_del(&qi->qi_apply_dqtrx_hooks, &hook->apply_hook);
+}
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Wrap around mod_dquot to account for both user and group quotas.
  */
@@ -138,11 +237,11 @@ xfs_trans_mod_dquot_byino(
 		return;
 
 	if (XFS_IS_UQUOTA_ON(mp) && ip->i_udquot)
-		(void) xfs_trans_mod_dquot(tp, ip->i_udquot, field, delta);
+		xfs_trans_mod_ino_dquot(tp, ip, ip->i_udquot, field, delta);
 	if (XFS_IS_GQUOTA_ON(mp) && ip->i_gdquot)
-		(void) xfs_trans_mod_dquot(tp, ip->i_gdquot, field, delta);
+		xfs_trans_mod_ino_dquot(tp, ip, ip->i_gdquot, field, delta);
 	if (XFS_IS_PQUOTA_ON(mp) && ip->i_pdquot)
-		(void) xfs_trans_mod_dquot(tp, ip->i_pdquot, field, delta);
+		xfs_trans_mod_ino_dquot(tp, ip, ip->i_pdquot, field, delta);
 }
 
 STATIC struct xfs_dqtrx *
@@ -322,6 +421,29 @@ xfs_apply_quota_reservation_deltas(
 	}
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/* Call downstream hooks now that it's time to apply dquot deltas. */
+static inline void
+xfs_trans_apply_dquot_deltas_hook(
+	struct xfs_trans		*tp,
+	struct xfs_dquot		*dqp)
+{
+	if (xfs_hooks_switched_on(&xfs_dqtrx_hooks_switch)) {
+		struct xfs_apply_dqtrx_params	p = {
+			.tx_id		= (uintptr_t)tp,
+			.q_type		= xfs_dquot_type(dqp),
+			.q_id		= dqp->q_id,
+		};
+		struct xfs_quotainfo	*qi = tp->t_mountp->m_quotainfo;
+
+		xfs_hooks_call(&qi->qi_apply_dqtrx_hooks,
+				XFS_APPLY_DQTRX_COMMIT, &p);
+	}
+}
+#else
+# define xfs_trans_apply_dquot_deltas_hook(tp, dqp)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Called by xfs_trans_commit() and similar in spirit to
  * xfs_trans_apply_sb_deltas().
@@ -367,6 +489,8 @@ xfs_trans_apply_dquot_deltas(
 
 			ASSERT(XFS_DQ_IS_LOCKED(dqp));
 
+			xfs_trans_apply_dquot_deltas_hook(tp, dqp);
+
 			/*
 			 * adjust the actual number of blocks used
 			 */
@@ -466,6 +590,29 @@ xfs_trans_apply_dquot_deltas(
 	}
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/* Call downstream hooks now that it's time to cancel dquot deltas. */
+static inline void
+xfs_trans_unreserve_and_mod_dquots_hook(
+	struct xfs_trans		*tp,
+	struct xfs_dquot		*dqp)
+{
+	if (xfs_hooks_switched_on(&xfs_dqtrx_hooks_switch)) {
+		struct xfs_apply_dqtrx_params	p = {
+			.tx_id		= (uintptr_t)tp,
+			.q_type		= xfs_dquot_type(dqp),
+			.q_id		= dqp->q_id,
+		};
+		struct xfs_quotainfo	*qi = tp->t_mountp->m_quotainfo;
+
+		xfs_hooks_call(&qi->qi_apply_dqtrx_hooks,
+				XFS_APPLY_DQTRX_UNRESERVE, &p);
+	}
+}
+#else
+# define xfs_trans_unreserve_and_mod_dquots_hook(tp, dqp)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Release the reservations, and adjust the dquots accordingly.
  * This is called only when the transaction is being aborted. If by
@@ -496,6 +643,9 @@ xfs_trans_unreserve_and_mod_dquots(
 			 */
 			if ((dqp = qtrx->qt_dquot) == NULL)
 				break;
+
+			xfs_trans_unreserve_and_mod_dquots_hook(tp, dqp);
+
 			/*
 			 * Unreserve the original reservation. We don't care
 			 * about the number of blocks used field, or deltas.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: repair cannot update the summary counters when logging quota flags
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/5] xfs: implement live quotacheck inode scan Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/5] xfs: report the health of quota counts Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

While running xfs/804 (quota repairs racing with fsstress), I observed a
filesystem shutdown in the primary sb write verifier:

run fstests xfs/804 at 2022-05-23 18:43:48
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Ending clean mount
XFS (sda4): Quotacheck needed: Please wait.
XFS (sda4): Quotacheck: Done.
XFS (sda4): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (sda4): SB ifree sanity check failed 0xb5 > 0x80
XFS (sda4): Metadata corruption detected at xfs_sb_write_verify+0x5e/0x100 [xfs], xfs_sb block 0x0
XFS (sda4): Unmount and run xfs_repair

The "SB ifree sanity check failed" message was a debugging printk that I
added to the kernel; observe that 0xb5 - 0x80 = 53, which is less than
one inode chunk.

I traced this to the xfs_log_sb calls from the online quota repair code,
which tries to clear the CHKD flags from the superblock to force a
mount-time quotacheck if the repair fails.  On a V5 filesystem,
xfs_log_sb updates the ondisk sb summary counters with the current
contents of the percpu counters.  This is done without quiescing other
writer threads, which means it could be racing with a thread that has
updated icount and is about to update ifree.

If the other write thread had incremented ifree before updating icount,
the repair thread will write icount > ifree into the logged update.  If
the AIL writes the logged superblock back to disk before anyone else
fixes this siutation, this will lead to a write verifier failure, which
causes a filesystem shutdown.

Resolve this problem by updating the quota flags and calling
xfs_sb_to_disk directly, which does not touch the percpu counters.
While we're at it, we can elide the entire update if the selected qflags
aren't set.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.c |   41 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 34 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 2de438ddb8ac..539c3544b11a 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -678,6 +678,39 @@ xrep_find_ag_btree_roots(
 }
 
 #ifdef CONFIG_XFS_QUOTA
+/* Update some quota flags in the superblock. */
+static void
+xrep_update_qflags(
+	struct xfs_scrub	*sc,
+	unsigned int		clear_flags)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*bp;
+
+	mutex_lock(&mp->m_quotainfo->qi_quotaofflock);
+	if ((mp->m_qflags & clear_flags) == 0)
+		goto no_update;
+
+	mp->m_qflags &= ~clear_flags;
+	spin_lock(&mp->m_sb_lock);
+	mp->m_sb.sb_qflags &= ~clear_flags;
+	spin_unlock(&mp->m_sb_lock);
+
+	/*
+	 * Update the quota flags in the ondisk superblock without touching
+	 * the summary counters.  We have not quiesced inode chunk allocation,
+	 * so we cannot coordinate with updates to the icount and ifree percpu
+	 * counters.
+	 */
+	bp = xfs_trans_getsb(sc->tp);
+	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
+	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
+	xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
+
+no_update:
+	mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
+}
+
 /* Force a quotacheck the next time we mount. */
 void
 xrep_force_quotacheck(
@@ -690,13 +723,7 @@ xrep_force_quotacheck(
 	if (!(flag & sc->mp->m_qflags))
 		return;
 
-	mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock);
-	sc->mp->m_qflags &= ~flag;
-	spin_lock(&sc->mp->m_sb_lock);
-	sc->mp->m_sb.sb_qflags &= ~flag;
-	spin_unlock(&sc->mp->m_sb_lock);
-	xfs_log_sb(sc->tp);
-	mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
+	xrep_update_qflags(sc, flag);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs: repair dquots based on live quotacheck results
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 3/5] xfs: track quota updates during live quotacheck Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use the shadow quota counters that live quotacheck creates to reset the
incore dquot counters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                  |    1 
 fs/xfs/scrub/quotacheck.c        |    9 +
 fs/xfs/scrub/quotacheck.h        |    3 
 fs/xfs/scrub/quotacheck_repair.c |  254 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c            |   13 +-
 fs/xfs/scrub/repair.h            |    5 +
 fs/xfs/scrub/scrub.c             |    2 
 fs/xfs/scrub/trace.h             |    1 
 8 files changed, 281 insertions(+), 7 deletions(-)
 create mode 100644 fs/xfs/scrub/quotacheck_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 62398f7332a1..30e555eac02e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -199,6 +199,7 @@ xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
 
 xfs-$(CONFIG_XFS_QUOTA)		+= $(addprefix scrub/, \
 				   quota_repair.o \
+				   quotacheck_repair.o \
 				   )
 endif
 endif
diff --git a/fs/xfs/scrub/quotacheck.c b/fs/xfs/scrub/quotacheck.c
index b9162556d00e..10d0fd717aea 100644
--- a/fs/xfs/scrub/quotacheck.c
+++ b/fs/xfs/scrub/quotacheck.c
@@ -101,7 +101,9 @@ xchk_setup_quotacheck(
  * set the INCOMPLETE flag even when a negative errno is returned.  This care
  * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
  * ECANCELED) that are absorbed into a scrub state flag update by
- * xchk_*_process_error.
+ * xchk_*_process_error.  Scrub and repair share the same incore data
+ * structures, so the INCOMPLETE flag is critical to prevent a repair based on
+ * insufficient information.
  *
  * Because we are scanning a live filesystem, it's possible that another thread
  * will try to update the quota counters for an inode that we've already
@@ -405,11 +407,14 @@ xqcheck_collect_inode(
 
 	/* Figure out the data / rt device block counts. */
 	xfs_ilock(ip, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED);
-	ilock_flags = xfs_ilock_data_map_shared(ip);
 	if (XFS_IS_REALTIME_INODE(ip)) {
+		ilock_flags = xfs_ilock_data_map_shared(ip);
 		error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
 		if (error)
 			goto out_abort;
+	} else {
+		ilock_flags = XFS_ILOCK_SHARED;
+		xfs_ilock(ip, ilock_flags);
 	}
 	xfs_inode_count_blocks(tp, ip, &nblks, &rtblks);
 
diff --git a/fs/xfs/scrub/quotacheck.h b/fs/xfs/scrub/quotacheck.h
index 6dc55686ad39..ccb3a9ffb168 100644
--- a/fs/xfs/scrub/quotacheck.h
+++ b/fs/xfs/scrub/quotacheck.h
@@ -30,6 +30,9 @@ struct xqcheck_dquot {
 /* Already checked this dquot. */
 #define XQCHECK_DQUOT_COMPARE_SCANNED	(1U << 1)
 
+/* Already repaired this dquot. */
+#define XQCHECK_DQUOT_REPAIR_SCANNED	(1U << 2)
+
 /* Live quotacheck control structure. */
 struct xqcheck {
 	struct xfs_scrub	*sc;
diff --git a/fs/xfs/scrub/quotacheck_repair.c b/fs/xfs/scrub/quotacheck_repair.c
new file mode 100644
index 000000000000..c7e5941366b8
--- /dev/null
+++ b/fs/xfs/scrub/quotacheck_repair.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_icache.h"
+#include "xfs_bmap_util.h"
+#include "xfs_iwalk.h"
+#include "xfs_ialloc.h"
+#include "xfs_sb.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/repair.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/iscan.h"
+#include "scrub/quotacheck.h"
+#include "scrub/trace.h"
+
+/*
+ * Live Quotacheck Repair
+ * ======================
+ *
+ * Use the live quota counter information that we collected to replace the
+ * counter values in the incore dquots.  A scrub->repair cycle should have left
+ * the live data and hooks active, so this is safe so long as we make sure the
+ * dquot is locked.
+ */
+
+/* Commit new counters to a dquot. */
+static int
+xqcheck_commit_dquot(
+	struct xfs_dquot	*dqp,
+	xfs_dqtype_t		dqtype,
+	void			*priv)
+{
+	struct xqcheck_dquot	xcdq;
+	struct xqcheck		*xqc = priv;
+	struct xfarray		*counts = xqcheck_counters_for(xqc, dqtype);
+	int64_t			delta;
+	bool			dirty = false;
+	int			error = 0;
+
+	/* Unlock the dquot just long enough to allocate a transaction. */
+	xfs_dqunlock(dqp);
+	error = xchk_trans_alloc(xqc->sc, 0);
+	xfs_dqlock(dqp);
+	if (error)
+		return error;
+
+	xfs_trans_dqjoin(xqc->sc->tp, dqp);
+
+	if (xchk_iscan_aborted(&xqc->iscan)) {
+		error = -ECANCELED;
+		goto out_cancel;
+	}
+
+	mutex_lock(&xqc->lock);
+	error = xfarray_load_sparse(counts, dqp->q_id, &xcdq);
+	if (error)
+		goto out_unlock;
+
+	/* Adjust counters as needed. */
+	delta = (int64_t)xcdq.icount - dqp->q_ino.count;
+	if (delta) {
+		dqp->q_ino.reserved += delta;
+		dqp->q_ino.count += delta;
+		dirty = true;
+	}
+
+	delta = (int64_t)xcdq.bcount - dqp->q_blk.count;
+	if (delta) {
+		dqp->q_blk.reserved += delta;
+		dqp->q_blk.count += delta;
+		dirty = true;
+	}
+
+	delta = (int64_t)xcdq.rtbcount - dqp->q_rtb.count;
+	if (delta) {
+		dqp->q_rtb.reserved += delta;
+		dqp->q_rtb.count += delta;
+		dirty = true;
+	}
+
+	xcdq.flags |= (XQCHECK_DQUOT_REPAIR_SCANNED | XQCHECK_DQUOT_WRITTEN);
+	error = xfarray_store(counts, dqp->q_id, &xcdq);
+	if (error == -EFBIG) {
+		/*
+		 * EFBIG means we tried to store data at too high a byte offset
+		 * in the sparse array.  IOWs, we cannot complete the repair
+		 * and must cancel the whole operation.  This should never
+		 * happen, but we need to catch it anyway.
+		 */
+		error = -ECANCELED;
+	}
+	mutex_unlock(&xqc->lock);
+	if (error || !dirty)
+		goto out_cancel;
+
+	trace_xrep_quotacheck_dquot(xqc->sc->mp, dqp->q_type, dqp->q_id);
+
+	/* Commit the dirty dquot to disk. */
+	dqp->q_flags |= XFS_DQFLAG_DIRTY;
+	if (dqp->q_id)
+		xfs_qm_adjust_dqtimers(dqp);
+	xfs_trans_log_dquot(xqc->sc->tp, dqp);
+
+	/*
+	 * Transaction commit unlocks the dquot, so we must re-lock it so that
+	 * the caller can put the reference (which apparently requires a locked
+	 * dquot).
+	 */
+	error = xrep_trans_commit(xqc->sc);
+	xfs_dqlock(dqp);
+	return error;
+
+out_unlock:
+	mutex_unlock(&xqc->lock);
+out_cancel:
+	xchk_trans_cancel(xqc->sc);
+
+	/* Re-lock the dquot so the caller can put the reference. */
+	xfs_dqlock(dqp);
+	return error;
+}
+
+/* Commit new quota counters for a particular quota type. */
+STATIC int
+xqcheck_commit_dqtype(
+	struct xqcheck		*xqc,
+	unsigned int		dqtype)
+{
+	struct xqcheck_dquot	xcdq;
+	struct xfs_scrub	*sc = xqc->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfarray		*counts = xqcheck_counters_for(xqc, dqtype);
+	struct xfs_dquot	*dqp;
+	xfarray_idx_t		cur = XFARRAY_CURSOR_INIT;
+	int			error;
+
+	/*
+	 * Update the counters of every dquot that the quota file knows about.
+	 */
+	error = xfs_qm_dqiterate(mp, dqtype, xqcheck_commit_dquot, xqc);
+	if (error)
+		return error;
+
+	/*
+	 * Make a second pass to deal with the dquots that we know about but
+	 * the quota file previously did not know about.
+	 */
+	mutex_lock(&xqc->lock);
+	while ((error = xfarray_iter(counts, &cur, &xcdq)) == 1) {
+		xfs_dqid_t	id = cur - 1;
+
+		if (xcdq.flags & XQCHECK_DQUOT_REPAIR_SCANNED)
+			continue;
+
+		mutex_unlock(&xqc->lock);
+
+		/*
+		 * Grab the dquot, allowing for dquot block allocation in a
+		 * separate transaction.  We committed the scrub transaction
+		 * in a previous step, so we will not be creating nested
+		 * transactions here.
+		 */
+		error = xfs_qm_dqget(mp, id, dqtype, true, &dqp);
+		if (error)
+			return error;
+
+		error = xqcheck_commit_dquot(dqp, dqtype, xqc);
+		xfs_qm_dqput(dqp);
+		if (error)
+			return error;
+
+		mutex_lock(&xqc->lock);
+	}
+	mutex_unlock(&xqc->lock);
+
+	return error;
+}
+
+/* Figure out quota CHKD flags for the running quota types. */
+static inline unsigned int
+xqcheck_chkd_flags(
+	struct xfs_mount	*mp)
+{
+	unsigned int		ret = 0;
+
+	if (XFS_IS_UQUOTA_ON(mp))
+		ret |= XFS_UQUOTA_CHKD;
+	if (XFS_IS_GQUOTA_ON(mp))
+		ret |= XFS_GQUOTA_CHKD;
+	if (XFS_IS_PQUOTA_ON(mp))
+		ret |= XFS_PQUOTA_CHKD;
+	return ret;
+}
+
+/* Commit the new dquot counters. */
+int
+xrep_quotacheck(
+	struct xfs_scrub	*sc)
+{
+	struct xqcheck		*xqc = sc->buf;
+	unsigned int		qflags = xqcheck_chkd_flags(sc->mp);
+	int			error;
+
+	/*
+	 * Clear the CHKD flag for the running quota types and commit the scrub
+	 * transaction so that we can allocate new quota block mappings if we
+	 * have to.  If we crash after this point, the sb still has the CHKD
+	 * flags cleared, so mount quotacheck will fix all of this up.
+	 */
+	xrep_update_qflags(sc, qflags, 0);
+	error = xrep_trans_commit(sc);
+	if (error)
+		return error;
+
+	/* Commit the new counters to the dquots. */
+	if (xqc->ucounts) {
+		error = xqcheck_commit_dqtype(xqc, XFS_DQTYPE_USER);
+		if (error)
+			return error;
+	}
+	if (xqc->gcounts) {
+		error = xqcheck_commit_dqtype(xqc, XFS_DQTYPE_GROUP);
+		if (error)
+			return error;
+	}
+	if (xqc->pcounts) {
+		error = xqcheck_commit_dqtype(xqc, XFS_DQTYPE_PROJ);
+		if (error)
+			return error;
+	}
+
+	/* Set the CHKD flags now that we've fixed quota counts. */
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		return error;
+
+	xrep_update_qflags(sc, 0, qflags);
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 539c3544b11a..7f66c763580b 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -679,21 +679,26 @@ xrep_find_ag_btree_roots(
 
 #ifdef CONFIG_XFS_QUOTA
 /* Update some quota flags in the superblock. */
-static void
+void
 xrep_update_qflags(
 	struct xfs_scrub	*sc,
-	unsigned int		clear_flags)
+	unsigned int		clear_flags,
+	unsigned int		set_flags)
 {
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_buf		*bp;
 
 	mutex_lock(&mp->m_quotainfo->qi_quotaofflock);
-	if ((mp->m_qflags & clear_flags) == 0)
+	if ((mp->m_qflags & clear_flags) == 0 &&
+	    (mp->m_qflags & set_flags) == set_flags)
 		goto no_update;
 
 	mp->m_qflags &= ~clear_flags;
+	mp->m_qflags |= set_flags;
+
 	spin_lock(&mp->m_sb_lock);
 	mp->m_sb.sb_qflags &= ~clear_flags;
+	mp->m_sb.sb_qflags |= set_flags;
 	spin_unlock(&mp->m_sb_lock);
 
 	/*
@@ -723,7 +728,7 @@ xrep_force_quotacheck(
 	if (!(flag & sc->mp->m_qflags))
 		return;
 
-	xrep_update_qflags(sc, flag);
+	xrep_update_qflags(sc, flag, 0);
 }
 
 /*
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 16047fc42696..f8d356b17b1f 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -62,6 +62,8 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
 
 #ifdef CONFIG_XFS_QUOTA
+void xrep_update_qflags(struct xfs_scrub *sc, unsigned int clear_flags,
+		unsigned int set_flags);
 void xrep_force_quotacheck(struct xfs_scrub *sc, xfs_dqtype_t type);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
 #else
@@ -114,8 +116,10 @@ int xrep_rtbitmap(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_QUOTA
 int xrep_quota(struct xfs_scrub *sc);
+int xrep_quotacheck(struct xfs_scrub *sc);
 #else
 # define xrep_quota			xrep_notsupported
+# define xrep_quotacheck		xrep_notsupported
 #endif /* CONFIG_XFS_QUOTA */
 
 int xrep_reinit_pagf(struct xfs_scrub *sc);
@@ -186,6 +190,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_bmap_cow			xrep_notsupported
 #define xrep_rtbitmap			xrep_notsupported
 #define xrep_quota			xrep_notsupported
+#define xrep_quotacheck			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index db277b57e8a2..60c6665b6277 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -370,7 +370,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_quotacheck,
 		.scrub	= xchk_quotacheck,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quotacheck,
 	},
 };
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index f8ee323a3cd2..2a025eb356fc 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1762,6 +1762,7 @@ DEFINE_EVENT(xrep_dquot_class, name, \
 	TP_ARGS(mp, type, id))
 DEFINE_XREP_DQUOT_EVENT(xrep_dquot_item);
 DEFINE_XREP_DQUOT_EVENT(xrep_disk_dquot);
+DEFINE_XREP_DQUOT_EVENT(xrep_quotacheck_dquot);
 #endif /* CONFIG_XFS_QUOTA */
 
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs: online repair of file link counts
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (11 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/5] xfs: report health of inode " Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                   ` (64 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Now that we've created the infrastructure to perform live scans of every
file in the filesystem and the necessary hook infrastructure to observe
live updates, use it to scan directories to compute the correct link
counts for files in the filesystem, and reset those link counts.

This patchset creates a tailored readdir implementation for scrub
because the regular version has to cycle ILOCKs to copy information to
userspace.  We can't cycle the ILOCK during the nlink scan and we don't
need all the other VFS support code (maintaining a readdir cursor and
translating XFS structures to VFS structures and back) so it was easier
to duplicate the code.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-nlinks

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-nlinks
---
 fs/xfs/Makefile               |    5 
 fs/xfs/libxfs/xfs_da_format.h |   11 
 fs/xfs/libxfs/xfs_dir2.c      |    6 
 fs/xfs/libxfs/xfs_dir2.h      |    1 
 fs/xfs/libxfs/xfs_fs.h        |    4 
 fs/xfs/libxfs/xfs_health.h    |    4 
 fs/xfs/scrub/common.c         |    3 
 fs/xfs/scrub/common.h         |    1 
 fs/xfs/scrub/dir.c            |  173 ++------
 fs/xfs/scrub/health.c         |    1 
 fs/xfs/scrub/nlinks.c         |  929 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/nlinks.h         |   99 ++++
 fs/xfs/scrub/nlinks_repair.c  |  226 ++++++++++
 fs/xfs/scrub/parent.c         |   90 +---
 fs/xfs/scrub/readdir.c        |  375 +++++++++++++++++
 fs/xfs/scrub/readdir.h        |   19 +
 fs/xfs/scrub/repair.h         |    2 
 fs/xfs/scrub/scrub.c          |    9 
 fs/xfs/scrub/scrub.h          |    5 
 fs/xfs/scrub/trace.c          |    2 
 fs/xfs/scrub/trace.h          |  193 ++++++++-
 fs/xfs/xfs_health.c           |    1 
 fs/xfs/xfs_inode.c            |  210 +++++++++
 fs/xfs/xfs_inode.h            |   35 ++
 fs/xfs/xfs_mount.h            |    2 
 fs/xfs/xfs_super.c            |    2 
 fs/xfs/xfs_symlink.c          |    1 
 27 files changed, 2219 insertions(+), 190 deletions(-)
 create mode 100644 fs/xfs/scrub/nlinks.c
 create mode 100644 fs/xfs/scrub/nlinks.h
 create mode 100644 fs/xfs/scrub/nlinks_repair.c
 create mode 100644 fs/xfs/scrub/readdir.c
 create mode 100644 fs/xfs/scrub/readdir.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: report health of inode link counts
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/5] xfs: teach scrub to check file nlinks Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Report on the health of the inode link counts.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h     |    1 +
 fs/xfs/libxfs/xfs_health.h |    4 +++-
 fs/xfs/xfs_health.c        |    1 +
 3 files changed, 5 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 6612c89944d0..2f9f13ba75b8 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -196,6 +196,7 @@ struct xfs_fsop_geom {
 #define XFS_FSOP_GEOM_SICK_RT_BITMAP	(1 << 4)  /* realtime bitmap */
 #define XFS_FSOP_GEOM_SICK_RT_SUMMARY	(1 << 5)  /* realtime summary */
 #define XFS_FSOP_GEOM_SICK_QUOTACHECK	(1 << 6)  /* quota counts */
+#define XFS_FSOP_GEOM_SICK_NLINKS	(1 << 7)  /* inode link counts */
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 1dea286bb157..5571f6cb2539 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -42,6 +42,7 @@ struct xfs_fsop_geom;
 #define XFS_SICK_FS_GQUOTA	(1 << 2)  /* group quota */
 #define XFS_SICK_FS_PQUOTA	(1 << 3)  /* project quota */
 #define XFS_SICK_FS_QUOTACHECK	(1 << 4)  /* quota counts */
+#define XFS_SICK_FS_NLINKS	(1 << 5)  /* inode link counts */
 
 /* Observable health issues for realtime volume metadata. */
 #define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
@@ -74,7 +75,8 @@ struct xfs_fsop_geom;
 				 XFS_SICK_FS_UQUOTA | \
 				 XFS_SICK_FS_GQUOTA | \
 				 XFS_SICK_FS_PQUOTA | \
-				 XFS_SICK_FS_QUOTACHECK)
+				 XFS_SICK_FS_QUOTACHECK | \
+				 XFS_SICK_FS_NLINKS)
 
 #define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
 				 XFS_SICK_RT_SUMMARY)
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index cd32f0fc0643..9cf933a8f532 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -281,6 +281,7 @@ static const struct ioctl_sick_map fs_map[] = {
 	{ XFS_SICK_FS_GQUOTA,	XFS_FSOP_GEOM_SICK_GQUOTA },
 	{ XFS_SICK_FS_PQUOTA,	XFS_FSOP_GEOM_SICK_PQUOTA },
 	{ XFS_SICK_FS_QUOTACHECK, XFS_FSOP_GEOM_SICK_QUOTACHECK },
+	{ XFS_SICK_FS_NLINKS,	XFS_FSOP_GEOM_SICK_NLINKS },
 	{ 0, 0 },
 };
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: streamline the directory iteration code for scrub
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/5] xfs: report health of inode " Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/5] xfs: teach scrub to check file nlinks Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 5/5] xfs: teach repair to fix file nlinks Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/5] xfs: track file link count updates during live nlinks fsck Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory.  This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.

Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/scrub/dir.c     |  173 +++++++---------------
 fs/xfs/scrub/parent.c  |   90 +++---------
 fs/xfs/scrub/readdir.c |  375 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/readdir.h |   19 ++
 5 files changed, 473 insertions(+), 185 deletions(-)
 create mode 100644 fs/xfs/scrub/readdir.c
 create mode 100644 fs/xfs/scrub/readdir.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 30e555eac02e..a762ee3cc454 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -159,6 +159,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   ialloc.o \
 				   inode.o \
 				   parent.o \
+				   readdir.o \
 				   refcount.o \
 				   rmap.o \
 				   scrub.o \
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 5b3a9edc8932..5b47d3cc8f78 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -18,6 +18,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
+#include "scrub/readdir.h"
 
 /* Set us up to scrub directories. */
 int
@@ -31,115 +32,88 @@ xchk_setup_directory(
 
 /* Scrub a directory entry. */
 
-struct xchk_dir_ctx {
-	/* VFS fill-directory iterator */
-	struct dir_context	dir_iter;
-
-	struct xfs_scrub	*sc;
-};
-
-/* Check that an inode's mode matches a given DT_ type. */
+/* Check that an inode's mode matches a given XFS_DIR3_FT_* type. */
 STATIC void
 xchk_dir_check_ftype(
-	struct xchk_dir_ctx	*sdc,
+	struct xfs_scrub	*sc,
 	xfs_fileoff_t		offset,
 	struct xfs_inode	*ip,
-	int			dtype)
+	int			ftype)
 {
-	struct xfs_mount	*mp = sdc->sc->mp;
-	int			ino_dtype;
+	struct xfs_mount	*mp = sc->mp;
 
 	if (!xfs_has_ftype(mp)) {
-		if (dtype != DT_UNKNOWN && dtype != DT_DIR)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
+		if (ftype != XFS_DIR3_FT_UNKNOWN && ftype != XFS_DIR3_FT_DIR)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 		return;
 	}
 
-	/* Convert mode to the DT_* values that dir_emit uses. */
-	ino_dtype = xfs_dir3_get_dtype(mp,
-			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
-	if (ino_dtype != dtype)
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
+	if (xfs_mode_to_ftype(VFS_I(ip)->i_mode) != ftype)
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 }
 
 /*
  * Scrub a single directory entry.
  *
- * We use the VFS directory iterator (i.e. readdir) to call this
- * function for every directory entry in a directory.  Once we're here,
- * we check the inode number to make sure it's sane, then we check that
- * we can look up this filename.  Finally, we check the ftype.
+ * Check the inode number to make sure it's sane, then we check that we can
+ * look up this filename.  Finally, we check the ftype.
  */
-STATIC bool
+STATIC int
 xchk_dir_actor(
-	struct dir_context	*dir_iter,
-	const char		*name,
-	int			namelen,
-	loff_t			pos,
-	u64			ino,
-	unsigned		type)
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xfs_dir2_dataptr_t	dapos,
+	const struct xfs_name	*name,
+	xfs_ino_t		ino,
+	void			*priv)
 {
-	struct xfs_mount	*mp;
-	struct xfs_inode	*dp;
+	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_inode	*ip;
-	struct xchk_dir_ctx	*sdc;
-	struct xfs_name		xname;
 	xfs_ino_t		lookup_ino;
 	xfs_dablk_t		offset;
 	int			error = 0;
 
-	sdc = container_of(dir_iter, struct xchk_dir_ctx, dir_iter);
-	dp = sdc->sc->ip;
-	mp = dp->i_mount;
 	offset = xfs_dir2_db_to_da(mp->m_dir_geo,
-			xfs_dir2_dataptr_to_db(mp->m_dir_geo, pos));
+			xfs_dir2_dataptr_to_db(mp->m_dir_geo, dapos));
 
-	if (xchk_should_terminate(sdc->sc, &error))
-		return !error;
+	if (xchk_should_terminate(sc, &error))
+		return error;
 
 	/* Does this inode number make sense? */
 	if (!xfs_verify_dir_ino(mp, ino)) {
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-		goto out;
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+		return -ECANCELED;
 	}
 
 	/* Does this name make sense? */
-	if (!xfs_dir2_namecheck(name, namelen)) {
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-		goto out;
+	if (!xfs_dir2_namecheck(name->name, name->len)) {
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+		return -ECANCELED;
 	}
 
-	if (!strncmp(".", name, namelen)) {
+	if (!strncmp(".", name->name, name->len)) {
 		/* If this is "." then check that the inum matches the dir. */
 		if (ino != dp->i_ino)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
-	} else if (!strncmp("..", name, namelen)) {
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+	} else if (!strncmp("..", name->name, name->len)) {
 		/*
 		 * If this is ".." in the root inode, check that the inum
 		 * matches this dir.
 		 */
 		if (dp->i_ino == mp->m_sb.sb_rootino && ino != dp->i_ino)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 	}
 
 	/* Verify that we can look up this name by hash. */
-	xname.name = name;
-	xname.len = namelen;
-	xname.type = XFS_DIR3_FT_UNKNOWN;
-
-	error = xfs_dir_lookup(sdc->sc->tp, dp, &xname, &lookup_ino, NULL);
+	error = xchk_dir_lookup(sc, dp, name, &lookup_ino);
 	/* ENOENT means the hash lookup failed and the dir is corrupt */
 	if (error == -ENOENT)
 		error = -EFSCORRUPTED;
-	if (!xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, offset,
-			&error))
+	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, offset, &error))
 		goto out;
 	if (lookup_ino != ino) {
-		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-		goto out;
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+		return -ECANCELED;
 	}
 
 	/*
@@ -151,27 +125,21 @@ xchk_dir_actor(
 	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xchk_iget(sdc->sc, ino, &ip);
+	error = xchk_iget(sc, ino, &ip);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
-		xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error);
+		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
 		goto out;
 	}
-	if (!xchk_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset,
-			&error))
+	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, offset, &error))
 		goto out;
 
-	xchk_dir_check_ftype(sdc, offset, ip, type);
-	xchk_irele(sdc->sc, ip);
+	xchk_dir_check_ftype(sc, offset, ip, name->type);
+	xchk_irele(sc, ip);
 out:
-	/*
-	 * A negative error code returned here is supposed to cause the
-	 * dir_emit caller (xfs_readdir) to abort the directory iteration
-	 * and return zero to xchk_directory.
-	 */
-	if (error == 0 && sdc->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return false;
-	return !error;
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return -ECANCELED;
+	return error;
 }
 
 /* Scrub a directory btree record. */
@@ -782,14 +750,7 @@ int
 xchk_directory(
 	struct xfs_scrub	*sc)
 {
-	struct xchk_dir_ctx	sdc = {
-		.dir_iter.actor = xchk_dir_actor,
-		.dir_iter.pos = 0,
-		.sc = sc,
-	};
-	size_t			bufsize;
-	loff_t			oldpos;
-	int			error = 0;
+	int			error;
 
 	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
 		return -ENOENT;
@@ -797,7 +758,7 @@ xchk_directory(
 	/* Plausible size? */
 	if (sc->ip->i_disk_size < xfs_dir2_sf_hdr_size(0)) {
 		xchk_ino_set_corrupt(sc, sc->ip->i_ino);
-		goto out;
+		return 0;
 	}
 
 	/* Check directory tree structure */
@@ -806,7 +767,7 @@ xchk_directory(
 		return error;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return error;
+		return 0;
 
 	/* Check the freespace. */
 	error = xchk_directory_blocks(sc);
@@ -814,44 +775,12 @@ xchk_directory(
 		return error;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return error;
+		return 0;
 
-	/*
-	 * Check that every dirent we see can also be looked up by hash.
-	 * Userspace usually asks for a 32k buffer, so we will too.
-	 */
-	bufsize = (size_t)min_t(loff_t, XFS_READDIR_BUFSIZE,
-			sc->ip->i_disk_size);
-
-	/*
-	 * Look up every name in this directory by hash.
-	 *
-	 * Use the xfs_readdir function to call xchk_dir_actor on
-	 * every directory entry in this directory.  In _actor, we check
-	 * the name, inode number, and ftype (if applicable) of the
-	 * entry.  xfs_readdir uses the VFS filldir functions to provide
-	 * iteration context.
-	 *
-	 * The VFS grabs a read or write lock via i_rwsem before it reads
-	 * or writes to a directory.  If we've gotten this far we've
-	 * already obtained IOLOCK_EXCL, which (since 4.10) is the same as
-	 * getting a write lock on i_rwsem.  Therefore, it is safe for us
-	 * to drop the ILOCK here in order to reuse the _readdir and
-	 * _dir_lookup routines, which do their own ILOCK locking.
-	 */
-	oldpos = 0;
-	xchk_iunlock(sc, XFS_ILOCK_EXCL);
-	while (true) {
-		error = xfs_readdir(sc->tp, sc->ip, &sdc.dir_iter, bufsize);
-		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0,
-				&error))
-			goto out;
-		if (oldpos == sdc.dir_iter.pos)
-			break;
-		oldpos = sdc.dir_iter.pos;
-	}
-
-out:
+	/* Look up every name in this directory by hash. */
+	error = xchk_dir_walk(sc, sc->ip, xchk_dir_actor, NULL);
+	if (error == -ECANCELED)
+		error = 0;
 	return error;
 }
 
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 371526f4369d..84740ffee0d2 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -16,6 +16,7 @@
 #include "xfs_dir2_priv.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/readdir.h"
 
 /* Set us up to scrub parents. */
 int
@@ -30,39 +31,37 @@ xchk_setup_parent(
 /* Look for an entry in a parent pointing to this inode. */
 
 struct xchk_parent_ctx {
-	struct dir_context	dc;
 	struct xfs_scrub	*sc;
 	xfs_ino_t		ino;
 	xfs_nlink_t		nlink;
-	bool			cancelled;
 };
 
 /* Look for a single entry in a directory pointing to an inode. */
-STATIC bool
+STATIC int
 xchk_parent_actor(
-	struct dir_context	*dc,
-	const char		*name,
-	int			namelen,
-	loff_t			pos,
-	u64			ino,
-	unsigned		type)
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xfs_dir2_dataptr_t	dapos,
+	const struct xfs_name	*name,
+	xfs_ino_t		ino,
+	void			*priv)
 {
-	struct xchk_parent_ctx	*spc;
+	struct xchk_parent_ctx	*spc = priv;
 	int			error = 0;
 
-	spc = container_of(dc, struct xchk_parent_ctx, dc);
+	/* Does this name make sense? */
+	if (!xfs_dir2_namecheck(name->name, name->len))
+		error = -EFSCORRUPTED;
+	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
+		return error;
+
 	if (spc->ino == ino)
 		spc->nlink++;
 
-	/*
-	 * If we're facing a fatal signal, bail out.  Store the cancellation
-	 * status separately because the VFS readdir code squashes error codes
-	 * into short directory reads.
-	 */
 	if (xchk_should_terminate(spc->sc, &error))
-		spc->cancelled = true;
+		return error;
 
-	return !error;
+	return 0;
 }
 
 /* Count the number of dentries in the parent dir that point to this inode. */
@@ -70,24 +69,11 @@ STATIC int
 xchk_parent_count_parent_dentries(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*parent,
-	xfs_nlink_t		*nlink)
+	struct xchk_parent_ctx	*spc)
 {
-	struct xchk_parent_ctx	spc = {
-		.dc.actor	= xchk_parent_actor,
-		.ino		= sc->ip->i_ino,
-		.sc		= sc,
-	};
-	size_t			bufsize;
-	loff_t			oldpos;
 	uint			lock_mode;
-	int			error = 0;
+	int			error;
 
-	/*
-	 * If there are any blocks, read-ahead block 0 as we're almost
-	 * certain to have the next operation be a read there.  This is
-	 * how we guarantee that the parent's extent map has been loaded,
-	 * if there is one.
-	 */
 	lock_mode = xfs_ilock_data_map_shared(parent);
 
 	/*
@@ -100,34 +86,8 @@ xchk_parent_count_parent_dentries(
 		return -EFSCORRUPTED;
 	}
 
-	if (parent->i_df.if_nextents > 0)
-		error = xfs_dir3_data_readahead(parent, 0, 0);
+	error = xchk_dir_walk(sc, parent, xchk_parent_actor, spc);
 	xfs_iunlock(parent, lock_mode);
-	if (error)
-		return error;
-
-	/*
-	 * Iterate the parent dir to confirm that there is
-	 * exactly one entry pointing back to the inode being
-	 * scanned.
-	 */
-	bufsize = (size_t)min_t(loff_t, XFS_READDIR_BUFSIZE,
-			parent->i_disk_size);
-	oldpos = 0;
-	while (true) {
-		error = xfs_readdir(sc->tp, parent, &spc.dc, bufsize);
-		if (error)
-			goto out;
-		if (spc.cancelled) {
-			error = -EAGAIN;
-			goto out;
-		}
-		if (oldpos == spc.dc.pos)
-			break;
-		oldpos = spc.dc.pos;
-	}
-	*nlink = spc.nlink;
-out:
 	return error;
 }
 
@@ -180,9 +140,13 @@ xchk_parent_validate(
 	struct xfs_scrub	*sc,
 	xfs_ino_t		parent_ino)
 {
+	struct xchk_parent_ctx	spc = {
+		.sc		= sc,
+		.ino		= sc->ip->i_ino,
+		.nlink		= 0,
+	};
 	struct xfs_inode	*dp = NULL;
 	xfs_nlink_t		expected_nlink;
-	xfs_nlink_t		nlink;
 	int			error = 0;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
@@ -260,7 +224,7 @@ xchk_parent_validate(
 	}
 
 	/* Look for a directory entry in the parent pointing to the child. */
-	error = xchk_parent_count_parent_dentries(sc, dp, &nlink);
+	error = xchk_parent_count_parent_dentries(sc, dp, &spc);
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out_unlock;
 
@@ -268,7 +232,7 @@ xchk_parent_validate(
 	 * Ensure that the parent has as many links to the child as the child
 	 * thinks it has to the parent.
 	 */
-	if (nlink != expected_nlink)
+	if (spc.nlink != expected_nlink)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 
 out_unlock:
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
new file mode 100644
index 000000000000..fd888fe5151f
--- /dev/null
+++ b/fs/xfs/scrub/readdir.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_trace.h"
+#include "xfs_bmap.h"
+#include "xfs_trans.h"
+#include "xfs_error.h"
+#include "scrub/scrub.h"
+#include "scrub/readdir.h"
+
+/* Call a function for every entry in a shortform directory. */
+STATIC int
+xchk_dir_walk_sf(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_name		name = {
+		.name		= ".",
+		.len		= 1,
+		.type		= XFS_DIR3_FT_DIR,
+	};
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_dir2_sf_entry *sfep;
+	struct xfs_dir2_sf_hdr	*sfp;
+	xfs_ino_t		ino;
+	xfs_dir2_dataptr_t	dapos;
+	unsigned int		i;
+	int			error;
+
+	ASSERT(dp->i_df.if_bytes == dp->i_disk_size);
+	ASSERT(dp->i_df.if_u1.if_data != NULL);
+
+	sfp = (struct xfs_dir2_sf_hdr *)dp->i_df.if_u1.if_data;
+
+	/* dot entry */
+	dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+			geo->data_entry_offset);
+
+	error = dirent_fn(sc, dp, dapos, &name, dp->i_ino, priv);
+	if (error)
+		return error;
+
+	/* dotdot entry */
+	dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+			geo->data_entry_offset +
+			xfs_dir2_data_entsize(mp, sizeof(".") - 1));
+	ino = xfs_dir2_sf_get_parent_ino(sfp);
+	name.name = "..";
+	name.len = 2;
+
+	error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+	if (error)
+		return error;
+
+	/* iterate everything else */
+	sfep = xfs_dir2_sf_firstentry(sfp);
+	for (i = 0; i < sfp->count; i++) {
+		dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+				xfs_dir2_sf_get_offset(sfep));
+		ino = xfs_dir2_sf_get_ino(mp, sfp, sfep);
+		name.name = sfep->name;
+		name.len = sfep->namelen;
+		name.type = xfs_dir2_sf_get_ftype(mp, sfep);
+
+		error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+		if (error)
+			return error;
+
+		sfep = xfs_dir2_sf_nextentry(mp, sfp, sfep);
+	}
+
+	return 0;
+}
+
+/* Call a function for every entry in a block directory. */
+STATIC int
+xchk_dir_walk_block(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_buf		*bp;
+	unsigned int		off, next_off, end;
+	int			error;
+
+	error = xfs_dir3_block_read(sc->tp, dp, &bp);
+	if (error)
+		return error;
+
+	/* Walk each directory entry. */
+	end = xfs_dir3_data_end_offset(geo, bp->b_addr);
+	for (off = geo->data_entry_offset; off < end; off = next_off) {
+		struct xfs_name			name = { };
+		struct xfs_dir2_data_unused	*dup = bp->b_addr + off;
+		struct xfs_dir2_data_entry	*dep = bp->b_addr + off;
+		xfs_ino_t			ino;
+		xfs_dir2_dataptr_t		dapos;
+
+		/* Skip an empty entry. */
+		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
+			next_off = off + be16_to_cpu(dup->length);
+			continue;
+		}
+
+		/* Otherwise, find the next entry and report it. */
+		next_off = off + xfs_dir2_data_entsize(mp, dep->namelen);
+		if (next_off > end)
+			break;
+
+		dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, off);
+		ino = be64_to_cpu(dep->inumber);
+		name.name = dep->name;
+		name.len = dep->namelen;
+		name.type = xfs_dir2_data_get_ftype(mp, dep);
+
+		error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+		if (error)
+			break;
+	}
+
+	xfs_trans_brelse(sc->tp, bp);
+	return error;
+}
+
+/* Read a leaf-format directory buffer. */
+STATIC int
+xchk_read_leaf_dir_buf(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*dp,
+	struct xfs_da_geometry	*geo,
+	xfs_dir2_off_t		*curoff,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	map;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(dp, XFS_DATA_FORK);
+	xfs_dablk_t		last_da;
+	xfs_dablk_t		map_off;
+	xfs_dir2_off_t		new_off;
+
+	*bpp = NULL;
+
+	/*
+	 * Look for mapped directory blocks at or above the current offset.
+	 * Truncate down to the nearest directory block to start the scanning
+	 * operation.
+	 */
+	last_da = xfs_dir2_byte_to_da(geo, XFS_DIR2_LEAF_OFFSET);
+	map_off = xfs_dir2_db_to_da(geo, xfs_dir2_byte_to_db(geo, *curoff));
+
+	if (!xfs_iext_lookup_extent(dp, ifp, map_off, &icur, &map))
+		return 0;
+	if (map.br_startoff >= last_da)
+		return 0;
+	xfs_trim_extent(&map, map_off, last_da - map_off);
+
+	/* Read the directory block of that first mapping. */
+	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
+	if (new_off > *curoff)
+		*curoff = new_off;
+
+	return xfs_dir3_data_read(tp, dp, map.br_startoff, 0, bpp);
+}
+
+/* Call a function for every entry in a leaf directory. */
+STATIC int
+xchk_dir_walk_leaf(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_buf		*bp = NULL;
+	xfs_dir2_off_t		curoff = 0;
+	unsigned int		offset = 0;
+	int			error;
+
+	/* Iterate every directory offset in this directory. */
+	while (curoff < XFS_DIR2_LEAF_OFFSET) {
+		struct xfs_name			name = { };
+		struct xfs_dir2_data_unused	*dup;
+		struct xfs_dir2_data_entry	*dep;
+		xfs_ino_t			ino;
+		unsigned int			length;
+		xfs_dir2_dataptr_t		dapos;
+
+		/*
+		 * If we have no buffer, or we're off the end of the
+		 * current buffer, need to get another one.
+		 */
+		if (!bp || offset >= geo->blksize) {
+			if (bp) {
+				xfs_trans_brelse(sc->tp, bp);
+				bp = NULL;
+			}
+
+			error = xchk_read_leaf_dir_buf(sc->tp, dp, geo, &curoff,
+					&bp);
+			if (error || !bp)
+				break;
+
+			/*
+			 * Find our position in the block.
+			 */
+			offset = geo->data_entry_offset;
+			curoff += geo->data_entry_offset;
+		}
+
+		/* Skip an empty entry. */
+		dup = bp->b_addr + offset;
+		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
+			length = be16_to_cpu(dup->length);
+			offset += length;
+			curoff += length;
+			continue;
+		}
+
+		/* Otherwise, find the next entry and report it. */
+		dep = bp->b_addr + offset;
+		length = xfs_dir2_data_entsize(mp, dep->namelen);
+
+		dapos = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, offset);
+		ino = be64_to_cpu(dep->inumber);
+		name.name = dep->name;
+		name.len = dep->namelen;
+		name.type = xfs_dir2_data_get_ftype(mp, dep);
+
+		error = dirent_fn(sc, dp, dapos, &name, ino, priv);
+		if (error)
+			break;
+
+		/* Advance to the next entry. */
+		offset += length;
+		curoff += length;
+	}
+
+	if (bp)
+		xfs_trans_brelse(sc->tp, bp);
+	return error;
+}
+
+/*
+ * Call a function for every entry in a directory.
+ *
+ * Callers must hold the ILOCK.  File types are XFS_DIR3_FT_*.
+ */
+int
+xchk_dir_walk(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xchk_dirent_fn		dirent_fn,
+	void			*priv)
+{
+	struct xfs_da_args	args = {
+		.dp		= dp,
+		.geo		= dp->i_mount->m_dir_geo,
+		.trans		= sc->tp,
+	};
+	bool			isblock;
+	int			error;
+
+	if (xfs_is_shutdown(dp->i_mount))
+		return -EIO;
+
+	ASSERT(S_ISDIR(VFS_I(dp)->i_mode));
+	ASSERT(xfs_isilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
+		return xchk_dir_walk_sf(sc, dp, dirent_fn, priv);
+
+	/* dir2 functions require that the data fork is loaded */
+	error = xfs_iread_extents(sc->tp, dp, XFS_DATA_FORK);
+	if (error)
+		return error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (isblock)
+		return xchk_dir_walk_block(sc, dp, dirent_fn, priv);
+
+	return xchk_dir_walk_leaf(sc, dp, dirent_fn, priv);
+}
+
+/*
+ * Look up the inode number for an exact name in a directory.
+ *
+ * Callers must hold the ILOCK.  File types are XFS_DIR3_FT_*.  Names are not
+ * checked for correctness.
+ */
+int
+xchk_dir_lookup(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	const struct xfs_name	*name,
+	xfs_ino_t		*ino)
+{
+	struct xfs_da_args	args = {
+		.dp		= dp,
+		.geo		= dp->i_mount->m_dir_geo,
+		.trans		= sc->tp,
+		.name		= name->name,
+		.namelen	= name->len,
+		.filetype	= name->type,
+		.hashval	= xfs_dir2_hashname(dp->i_mount, name),
+		.whichfork	= XFS_DATA_FORK,
+		.op_flags	= XFS_DA_OP_OKNOENT,
+	};
+	bool			isblock, isleaf;
+	int			error;
+
+	if (xfs_is_shutdown(dp->i_mount))
+		return -EIO;
+
+	ASSERT(S_ISDIR(VFS_I(dp)->i_mode));
+	ASSERT(xfs_isilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		error = xfs_dir2_sf_lookup(&args);
+		goto out_check_rval;
+	}
+
+	/* dir2 functions require that the data fork is loaded */
+	error = xfs_iread_extents(sc->tp, dp, XFS_DATA_FORK);
+	if (error)
+		return error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (isblock) {
+		error = xfs_dir2_block_lookup(&args);
+		goto out_check_rval;
+	}
+
+	error = xfs_dir2_isleaf(&args, &isleaf);
+	if (error)
+		return error;
+
+	if (isleaf) {
+		error = xfs_dir2_leaf_lookup(&args);
+		goto out_check_rval;
+	}
+
+	error = xfs_dir2_node_lookup(&args);
+
+out_check_rval:
+	if (error == -EEXIST)
+		error = 0;
+	if (!error)
+		*ino = args.inumber;
+	return error;
+}
diff --git a/fs/xfs/scrub/readdir.h b/fs/xfs/scrub/readdir.h
new file mode 100644
index 000000000000..7272f3bd28b4
--- /dev/null
+++ b/fs/xfs/scrub/readdir.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_READDIR_H__
+#define __XFS_SCRUB_READDIR_H__
+
+typedef int (*xchk_dirent_fn)(struct xfs_scrub *sc, struct xfs_inode *dp,
+		xfs_dir2_dataptr_t dapos, const struct xfs_name *name,
+		xfs_ino_t ino, void *priv);
+
+int xchk_dir_walk(struct xfs_scrub *sc, struct xfs_inode *dp,
+		xchk_dirent_fn dirent_fn, void *priv);
+
+int xchk_dir_lookup(struct xfs_scrub *sc, struct xfs_inode *dp,
+		const struct xfs_name *name, xfs_ino_t *ino);
+
+#endif /* __XFS_SCRUB_READDIR_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: teach scrub to check file nlinks
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/5] xfs: report health of inode " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/5] xfs: streamline the directory iteration code for scrub Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the necessary scrub code to walk the filesystem's directory tree
so that we can compute file link counts.  Similar to quotacheck, we
create an incore shadow array of link count information and then we walk
the filesystem a second time to compare the link counts.  We need live
updates to keep the information up to date during the lengthy scan, so
this scrubber remains disabled until the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile               |    3 
 fs/xfs/libxfs/xfs_da_format.h |   11 +
 fs/xfs/libxfs/xfs_dir2.c      |    6 
 fs/xfs/libxfs/xfs_dir2.h      |    1 
 fs/xfs/libxfs/xfs_fs.h        |    3 
 fs/xfs/scrub/common.h         |    1 
 fs/xfs/scrub/health.c         |    1 
 fs/xfs/scrub/nlinks.c         |  811 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/nlinks.h         |   90 +++++
 fs/xfs/scrub/scrub.c          |    6 
 fs/xfs/scrub/scrub.h          |    1 
 fs/xfs/scrub/trace.c          |    2 
 fs/xfs/scrub/trace.h          |  147 +++++++
 13 files changed, 1080 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/nlinks.c
 create mode 100644 fs/xfs/scrub/nlinks.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a762ee3cc454..ea9eda20a11d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -158,6 +158,8 @@ xfs-y				+= $(addprefix scrub/, \
 				   health.o \
 				   ialloc.o \
 				   inode.o \
+				   iscan.o \
+				   nlinks.o \
 				   parent.o \
 				   readdir.o \
 				   refcount.o \
@@ -174,7 +176,6 @@ xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
 				   )
 
 xfs-$(CONFIG_XFS_QUOTA)		+= $(addprefix scrub/, \
-				   iscan.o \
 				   quota.o \
 				   quotacheck.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 25e2841084e1..9d332415e0b6 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -159,6 +159,17 @@ struct xfs_da3_intnode {
 
 #define XFS_DIR3_FT_MAX			9
 
+#define XFS_DIR3_FTYPE_STR \
+	{ XFS_DIR3_FT_UNKNOWN,	"unknown" }, \
+	{ XFS_DIR3_FT_REG_FILE,	"file" }, \
+	{ XFS_DIR3_FT_DIR,	"directory" }, \
+	{ XFS_DIR3_FT_CHRDEV,	"char" }, \
+	{ XFS_DIR3_FT_BLKDEV,	"block" }, \
+	{ XFS_DIR3_FT_FIFO,	"fifo" }, \
+	{ XFS_DIR3_FT_SOCK,	"sock" }, \
+	{ XFS_DIR3_FT_SYMLINK,	"symlink" }, \
+	{ XFS_DIR3_FT_WHT,	"whiteout" }
+
 /*
  * Byte offset in data block and shortform entry.
  */
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 92bac3373f1f..ee30044af39d 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -25,6 +25,12 @@ const struct xfs_name xfs_name_dotdot = {
 	.type	= XFS_DIR3_FT_DIR,
 };
 
+const struct xfs_name xfs_name_dot = {
+	.name	= (unsigned char *)".",
+	.len	= 1,
+	.type	= XFS_DIR3_FT_DIR,
+};
+
 /*
  * Convert inode mode to directory entry filetype
  */
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index dd39f17dd9a9..15a36cf7ae87 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -22,6 +22,7 @@ struct xfs_dir3_icfree_hdr;
 struct xfs_dir3_icleaf_hdr;
 
 extern const struct xfs_name	xfs_name_dotdot;
+extern const struct xfs_name	xfs_name_dot;
 
 /*
  * Convert inode mode to directory entry filetype
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2f9f13ba75b8..3885c56078f5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -710,9 +710,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
 #define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
 #define XFS_SCRUB_TYPE_QUOTACHECK 25	/* quota counters */
+#define XFS_SCRUB_TYPE_NLINKS	26	/* inode link counts */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	26
+#define XFS_SCRUB_TYPE_NR	27
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1u << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 2c33814e0b69..45318bd5678d 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -129,6 +129,7 @@ xchk_setup_quotacheck(struct xfs_scrub *sc)
 }
 #endif
 int xchk_setup_fscounters(struct xfs_scrub *sc);
+int xchk_setup_nlinks(struct xfs_scrub *sc);
 
 void xchk_ag_free(struct xfs_scrub *sc, struct xchk_ag *sa);
 int xchk_ag_init(struct xfs_scrub *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index 5e28fa1ab6aa..6749930b50a2 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -108,6 +108,7 @@ static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_PQUOTA]		= { XHG_FS,  XFS_SICK_FS_PQUOTA },
 	[XFS_SCRUB_TYPE_FSCOUNTERS]	= { XHG_FS,  XFS_SICK_FS_COUNTERS },
 	[XFS_SCRUB_TYPE_QUOTACHECK]	= { XHG_FS,  XFS_SICK_FS_QUOTACHECK },
+	[XFS_SCRUB_TYPE_NLINKS]		= { XHG_FS,  XFS_SICK_FS_NLINKS },
 };
 
 /* Return the health status mask for this scrub type. */
diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c
new file mode 100644
index 000000000000..f97c46bdd06c
--- /dev/null
+++ b/fs/xfs/scrub/nlinks.c
@@ -0,0 +1,811 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_iwalk.h"
+#include "xfs_ialloc.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_ag.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/repair.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/iscan.h"
+#include "scrub/nlinks.h"
+#include "scrub/trace.h"
+#include "scrub/readdir.h"
+
+/*
+ * Live Inode Link Count Checking
+ * ==============================
+ *
+ * Inode link counts are "summary" metadata, in the sense that they are
+ * computed as the number of directory entries referencing each file on the
+ * filesystem.  Therefore, we compute the correct link counts by creating a
+ * shadow link count structure and walking every inode.
+ */
+
+/* Set us up to scrub inode link counts. */
+int
+xchk_setup_nlinks(
+	struct xfs_scrub	*sc)
+{
+	/* Not ready for general consumption yet. */
+	return -EOPNOTSUPP;
+
+	sc->buf = kzalloc(sizeof(struct xchk_nlink_ctrs), XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	return xchk_setup_fs(sc);
+}
+
+/*
+ * Part 1: Collecting file link counts.  For each file, we create a shadow link
+ * counting structure, then walk the entire directory tree, incrementing parent
+ * and child link counts for each directory entry seen.
+ *
+ * To avoid false corruption reports in part 2, any failure in this part must
+ * set the INCOMPLETE flag even when a negative errno is returned.  This care
+ * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
+ * ECANCELED) that are absorbed into a scrub state flag update by
+ * xchk_*_process_error.
+ */
+
+/* Update incore link count information.  Caller must hold the nlinks lock. */
+STATIC int
+xchk_nlinks_update_incore(
+	struct xchk_nlink_ctrs	*xnc,
+	xfs_ino_t		ino,
+	int			parents_delta,
+	int			backrefs_delta,
+	int			children_delta)
+{
+	struct xchk_nlink	nl;
+	int			error;
+
+	if (!xnc->nlinks)
+		return 0;
+
+	error = xfarray_load_sparse(xnc->nlinks, ino, &nl);
+	if (error)
+		return error;
+
+	trace_xchk_nlinks_update_incore(xnc->sc->mp, ino, &nl, parents_delta,
+			backrefs_delta, children_delta);
+
+	nl.flags |= XCHK_NLINK_WRITTEN;
+	nl.parents += parents_delta;
+	nl.backrefs += backrefs_delta;
+	nl.children += children_delta;
+
+	error = xfarray_store(xnc->nlinks, ino, &nl);
+	if (error == -EFBIG) {
+		/*
+		 * EFBIG means we tried to store data at too high a byte offset
+		 * in the sparse array.  IOWs, we cannot complete the check and
+		 * must notify userspace that the check was incomplete.
+		 */
+		error = -ECANCELED;
+	}
+	return error;
+}
+
+/* Bump the observed link count for the inode referenced by this entry. */
+STATIC int
+xchk_nlinks_collect_dirent(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	xfs_dir2_dataptr_t	dapos,
+	const struct xfs_name	*name,
+	xfs_ino_t		ino,
+	void			*priv)
+{
+	struct xchk_nlink_ctrs	*xnc = priv;
+	bool			dot = false, dotdot = false;
+	int			error;
+
+	/* Does this name make sense? */
+	if (name->len == 0 || !xfs_dir2_namecheck(name->name, name->len)) {
+		error = -ECANCELED;
+		goto out_abort;
+	}
+
+	if (name->len == 1 && name->name[0] == '.')
+		dot = true;
+	else if (name->len == 2 && name->name[0] == '.' &&
+				   name->name[1] == '.')
+		dotdot = true;
+
+	/* Don't accept a '.' entry that points somewhere else. */
+	if (dot && ino != dp->i_ino) {
+		error = -ECANCELED;
+		goto out_abort;
+	}
+
+	/* Don't accept an invalid inode number. */
+	if (!xfs_verify_ino(sc->mp, ino)) {
+		error = -ECANCELED;
+		goto out_abort;
+	}
+
+	/* Update the shadow link counts if we haven't already failed. */
+
+	if (xchk_iscan_aborted(&xnc->collect_iscan)) {
+		error = -ECANCELED;
+		goto out_incomplete;
+	}
+
+	trace_xchk_nlinks_collect_dirent(sc->mp, dp, ino, name);
+
+	mutex_lock(&xnc->lock);
+
+	/*
+	 * If this is a dotdot entry, it is a back link from dp to ino.  How
+	 * we handle this depends on whether or not dp is the root directory.
+	 *
+	 * The root directory is its own parent, so we pretend the dotdot entry
+	 * establishes the "parent" of the root directory.  Increment the
+	 * number of parents of the root directory.
+	 *
+	 * Otherwise, increment the number of backrefs pointing back to ino.
+	 */
+	if (dotdot) {
+		if (dp == sc->mp->m_rootip)
+			error = xchk_nlinks_update_incore(xnc, ino, 1, 0, 0);
+		else
+			error = xchk_nlinks_update_incore(xnc, ino, 0, 1, 0);
+		if (error)
+			goto out_unlock;
+	}
+
+	/*
+	 * If this dirent is a forward link from dp to ino, increment the
+	 * number of parents linking into ino.
+	 */
+	if (!dot && !dotdot) {
+		error = xchk_nlinks_update_incore(xnc, ino, 1, 0, 0);
+		if (error)
+			goto out_unlock;
+	}
+
+	/*
+	 * If this dirent is a forward link to a subdirectory or the dot entry,
+	 * increment the number of child links of dp.
+	 */
+	if (!dotdot && name->type == XFS_DIR3_FT_DIR) {
+		error = xchk_nlinks_update_incore(xnc, dp->i_ino, 0, 0,
+				1);
+		if (error)
+			goto out_unlock;
+	}
+
+	mutex_unlock(&xnc->lock);
+	return 0;
+
+out_unlock:
+	mutex_unlock(&xnc->lock);
+out_abort:
+	xchk_iscan_abort(&xnc->collect_iscan);
+out_incomplete:
+	xchk_set_incomplete(sc);
+	return error;
+}
+
+/* Walk a directory to bump the observed link counts of the children. */
+STATIC int
+xchk_nlinks_collect_dir(
+	struct xchk_nlink_ctrs	*xnc,
+	struct xfs_inode	*dp)
+{
+	struct xfs_scrub	*sc = xnc->sc;
+	unsigned int		lock_mode;
+	int			error = 0;
+
+	/* Prevent anyone from changing this directory while we walk it. */
+	xfs_ilock(dp, XFS_IOLOCK_SHARED);
+	lock_mode = xfs_ilock_data_map_shared(dp);
+
+	/*
+	 * The dotdot entry of an unlinked directory still points to the last
+	 * parent, but the parent no longer links to this directory.  Skip the
+	 * directory to avoid overcounting.
+	 */
+	if (VFS_I(dp)->i_nlink == 0)
+		goto out_unlock;
+
+	/*
+	 * We cannot count file links if the directory looks as though it has
+	 * been zapped by the inode record repair code.
+	 */
+	if (xchk_dir_looks_zapped(dp))
+		goto out_abort;
+
+	error = xchk_dir_walk(sc, dp, xchk_nlinks_collect_dirent, xnc);
+	if (error == -ECANCELED) {
+		error = 0;
+		goto out_unlock;
+	}
+	if (error)
+		goto out_abort;
+
+	xchk_iscan_mark_visited(&xnc->collect_iscan, dp);
+	goto out_unlock;
+
+out_abort:
+	xchk_set_incomplete(sc);
+	xchk_iscan_abort(&xnc->collect_iscan);
+out_unlock:
+	xfs_iunlock(dp, lock_mode);
+	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+	return error;
+}
+
+/* If this looks like a valid pointer, count it. */
+static inline int
+xchk_nlinks_collect_metafile(
+	struct xchk_nlink_ctrs	*xnc,
+	xfs_ino_t		ino)
+{
+	if (!xfs_verify_ino(xnc->sc->mp, ino))
+		return 0;
+
+	trace_xchk_nlinks_collect_metafile(xnc->sc->mp, ino);
+	return xchk_nlinks_update_incore(xnc, ino, 1, 0, 0);
+}
+
+/* Bump the link counts of metadata files rooted in the superblock. */
+STATIC int
+xchk_nlinks_collect_metafiles(
+	struct xchk_nlink_ctrs	*xnc)
+{
+	struct xfs_mount	*mp = xnc->sc->mp;
+	int			error = -ECANCELED;
+
+
+	if (xchk_iscan_aborted(&xnc->collect_iscan))
+		goto out_incomplete;
+
+	mutex_lock(&xnc->lock);
+	error = xchk_nlinks_collect_metafile(xnc, mp->m_sb.sb_rbmino);
+	if (error)
+		goto out_abort;
+
+	error = xchk_nlinks_collect_metafile(xnc, mp->m_sb.sb_rsumino);
+	if (error)
+		goto out_abort;
+
+	error = xchk_nlinks_collect_metafile(xnc, mp->m_sb.sb_uquotino);
+	if (error)
+		goto out_abort;
+
+	error = xchk_nlinks_collect_metafile(xnc, mp->m_sb.sb_gquotino);
+	if (error)
+		goto out_abort;
+
+	error = xchk_nlinks_collect_metafile(xnc, mp->m_sb.sb_pquotino);
+	if (error)
+		goto out_abort;
+	mutex_unlock(&xnc->lock);
+
+	return 0;
+
+out_abort:
+	mutex_unlock(&xnc->lock);
+	xchk_iscan_abort(&xnc->collect_iscan);
+out_incomplete:
+	xchk_set_incomplete(xnc->sc);
+	return error;
+}
+
+/* Advance the collection scan cursor for this non-directory file. */
+static inline int
+xchk_nlinks_collect_file(
+	struct xchk_nlink_ctrs	*xnc,
+	struct xfs_inode	*ip)
+{
+	xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	xchk_iscan_mark_visited(&xnc->collect_iscan, ip);
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+	return 0;
+}
+
+/* Walk all directories and count inode links. */
+STATIC int
+xchk_nlinks_collect(
+	struct xchk_nlink_ctrs	*xnc)
+{
+	struct xfs_scrub	*sc = xnc->sc;
+	struct xfs_inode	*ip;
+	int			error;
+
+	/* Count the rt and quota files that are rooted in the superblock. */
+	error = xchk_nlinks_collect_metafiles(xnc);
+	if (error)
+		return error;
+
+	/*
+	 * Set up for a potentially lengthy filesystem scan by reducing our
+	 * transaction resource usage for the duration.  Specifically:
+	 *
+	 * Cancel the transaction to release the log grant space while we scan
+	 * the filesystem.
+	 *
+	 * Create a new empty transaction to eliminate the possibility of the
+	 * inode scan deadlocking on cyclical metadata.
+	 *
+	 * We pass the empty transaction to the file scanning function to avoid
+	 * repeatedly cycling empty transactions.  This can be done even though
+	 * we take the IOLOCK to quiesce the file because empty transactions
+	 * do not take sb_internal.
+	 */
+	xchk_trans_cancel(sc);
+	error = xchk_trans_alloc_empty(sc);
+	if (error)
+		return error;
+
+	while ((error = xchk_iscan_iter(sc, &xnc->collect_iscan, &ip)) == 1) {
+		if (S_ISDIR(VFS_I(ip)->i_mode))
+			error = xchk_nlinks_collect_dir(xnc, ip);
+		else
+			error = xchk_nlinks_collect_file(xnc, ip);
+		xchk_irele(sc, ip);
+		if (error)
+			break;
+
+		if (xchk_should_terminate(sc, &error))
+			break;
+	}
+	if (error) {
+		xchk_set_incomplete(sc);
+		/*
+		 * If we couldn't grab an inode that was busy with a state
+		 * change, change the error code so that we exit to userspace
+		 * as quickly as possible.
+		 */
+		if (error == -EBUSY)
+			return -ECANCELED;
+		return error;
+	}
+
+	/*
+	 * Switch out for a real transaction in preparation for building a new
+	 * tree.
+	 */
+	xchk_trans_cancel(sc);
+	return xchk_setup_fs(sc);
+}
+
+/*
+ * Part 2: Comparing file link counters.  Walk each inode and compare the link
+ * counts against our shadow information; and then walk each shadow link count
+ * structure (that wasn't covered in the first part), comparing it against the
+ * file.
+ */
+
+/* Read the observed link count for comparison with the actual inode. */
+STATIC int
+xchk_nlinks_comparison_read(
+	struct xchk_nlink_ctrs	*xnc,
+	xfs_ino_t		ino,
+	struct xchk_nlink	*obs)
+{
+	struct xchk_nlink	nl;
+	int			error;
+
+	error = xfarray_load_sparse(xnc->nlinks, ino, &nl);
+	if (error)
+		return error;
+
+	nl.flags |= (XCHK_NLINK_COMPARE_SCANNED | XCHK_NLINK_WRITTEN);
+
+	error = xfarray_store(xnc->nlinks, ino, &nl);
+	if (error == -EFBIG) {
+		/*
+		 * EFBIG means we tried to store data at too high a byte offset
+		 * in the sparse array.  IOWs, we cannot complete the check and
+		 * must notify userspace that the check was incomplete.  This
+		 * shouldn't really happen outside of the collection phase.
+		 */
+		xchk_set_incomplete(xnc->sc);
+		return -ECANCELED;
+	}
+	if (error)
+		return error;
+
+	/* Copy the counters, but do not expose the internal state. */
+	obs->parents = nl.parents;
+	obs->backrefs = nl.backrefs;
+	obs->children = nl.children;
+	obs->flags = 0;
+	return 0;
+}
+
+/* Check our link count against an inode. */
+STATIC int
+xchk_nlinks_compare_inode(
+	struct xchk_nlink_ctrs	*xnc,
+	struct xfs_inode	*ip)
+{
+	struct xchk_nlink	obs;
+	struct xfs_scrub	*sc = xnc->sc;
+	uint64_t		total_links;
+	unsigned int		actual_nlink;
+	int			error;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	mutex_lock(&xnc->lock);
+
+	if (xchk_iscan_aborted(&xnc->collect_iscan)) {
+		xchk_set_incomplete(xnc->sc);
+		error = -ECANCELED;
+		goto out_scanlock;
+	}
+
+	error = xchk_nlinks_comparison_read(xnc, ip->i_ino, &obs);
+	if (error)
+		goto out_scanlock;
+
+	/*
+	 * If we don't have ftype to get an accurate count of the subdirectory
+	 * entries in this directory, take advantage of the fact that on a
+	 * consistent ftype=0 filesystem, the number of subdirectory
+	 * backreferences (dotdot entries) pointing towards this directory
+	 * should be one less than the number of subdirectory entries in the
+	 * directory.
+	 */
+	if (!xfs_has_ftype(sc->mp) && S_ISDIR(VFS_I(ip)->i_mode))
+		obs.children = obs.backrefs + 1;
+
+	total_links = xchk_nlink_total(&obs);
+	actual_nlink = VFS_I(ip)->i_nlink;
+
+	trace_xchk_nlinks_compare_inode(sc->mp, ip, &obs);
+
+	/* We found more than the maxiumum possible link count. */
+	if (total_links > U32_MAX)
+		xchk_ino_set_corrupt(sc, ip->i_ino);
+
+	/* Link counts should match. */
+	if (total_links != actual_nlink)
+		xchk_ino_set_corrupt(sc, ip->i_ino);
+
+	if (S_ISDIR(VFS_I(ip)->i_mode) && actual_nlink > 0) {
+		/*
+		 * The collection phase ignores directories with zero link
+		 * count, so we ignore them here too.
+		 *
+		 * Linked directories must have at least one child (dot entry).
+		 */
+		if (obs.children < 1)
+			xchk_ino_set_corrupt(sc, ip->i_ino);
+		/*
+		 * The number of subdirectory backreferences (dotdot entries)
+		 * pointing towards this directory should be one less than the
+		 * number of subdirectory entries in the directory.
+		 */
+		if (obs.children != obs.backrefs + 1)
+			xchk_ino_xref_set_corrupt(sc, ip->i_ino);
+	} else {
+		/*
+		 * Non-directories and unlinked directories should not have
+		 * back references.
+		 */
+		if (obs.backrefs != 0)
+			xchk_ino_set_corrupt(sc, ip->i_ino);
+
+		/*
+		 * Non-directories and unlinked directories should not have
+		 * children.
+		 */
+		if (obs.children != 0)
+			xchk_ino_set_corrupt(sc, ip->i_ino);
+	}
+
+	if (ip == sc->mp->m_rootip) {
+		/*
+		 * For the root of a directory tree, both the '.' and '..'
+		 * entries should point to the root directory.  The dot entry
+		 * is counted as a child subdirectory (like any directory).
+		 * The dotdot entry is counted as a parent of the root /and/
+		 * a backref of the root directory.
+		 */
+		if (obs.parents != 1)
+			xchk_ino_set_corrupt(sc, ip->i_ino);
+		if (obs.children < 1)
+			xchk_ino_set_corrupt(sc, ip->i_ino);
+	} else if (actual_nlink > 0) {
+		/*
+		 * Linked files that are not the root directory should have at
+		 * least one parent.
+		 */
+		if (obs.parents == 0)
+			xchk_ino_set_corrupt(sc, ip->i_ino);
+	}
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		error = -ECANCELED;
+
+out_scanlock:
+	mutex_unlock(&xnc->lock);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	return error;
+}
+
+/*
+ * Check our link count against an inode that wasn't checked previously.  This
+ * is intended to catch directories with dangling links, though we could be
+ * racing with inode allocation in other threads.
+ */
+STATIC int
+xchk_nlinks_compare_inum(
+	struct xchk_nlink_ctrs	*xnc,
+	xfs_ino_t		ino)
+{
+	struct xchk_nlink	obs;
+	struct xfs_mount	*mp = xnc->sc->mp;
+	struct xfs_trans	*tp = xnc->sc->tp;
+	struct xfs_buf		*agi_bp;
+	struct xfs_inode	*ip;
+	int			error;
+
+	/*
+	 * The first iget failed, so try again with the variant that returns
+	 * either an incore inode or the AGI buffer.  If the function returns
+	 * EINVAL/ENOENT, it should have passed us the AGI buffer so that we
+	 * can guarantee that the inode won't be allocated while we check for
+	 * a zero link count in the observed link count data.
+	 */
+	error = xchk_iget_agi(xnc->sc, ino, &agi_bp, &ip);
+	if (!error) {
+		/* Actually got an inode, so use the inode compare. */
+		error = xchk_nlinks_compare_inode(xnc, ip);
+		xchk_irele(xnc->sc, ip);
+		return error;
+	}
+	if (error == -ENOENT || error == -EINVAL) {
+		/* No inode was found.  Check for zero link count below. */
+		error = 0;
+	}
+	if (error)
+		goto out_agi;
+
+	/* Ensure that we have protected against inode allocation/freeing. */
+	if (agi_bp == NULL) {
+		ASSERT(agi_bp != NULL);
+		xchk_set_incomplete(xnc->sc);
+		return -ECANCELED;
+	}
+
+	if (xchk_iscan_aborted(&xnc->collect_iscan)) {
+		xchk_set_incomplete(xnc->sc);
+		error = -ECANCELED;
+		goto out_agi;
+	}
+
+	mutex_lock(&xnc->lock);
+	error = xchk_nlinks_comparison_read(xnc, ino, &obs);
+	if (error)
+		goto out_scanlock;
+
+	trace_xchk_nlinks_check_zero(mp, ino, &obs);
+
+	/*
+	 * If we can't grab the inode, the link count had better be zero.  We
+	 * still hold the AGI to prevent inode allocation/freeing.
+	 */
+	if (xchk_nlink_total(&obs) != 0) {
+		xchk_ino_set_corrupt(xnc->sc, ino);
+		error = -ECANCELED;
+	}
+
+out_scanlock:
+	mutex_unlock(&xnc->lock);
+out_agi:
+	if (agi_bp)
+		xfs_trans_brelse(tp, agi_bp);
+	return error;
+}
+
+/*
+ * Try to visit every inode in the filesystem to compare the link count.  Move
+ * on if we can't grab an inode, since we'll revisit unchecked nlink records in
+ * the second part.
+ */
+static int
+xchk_nlinks_compare_iter(
+	struct xchk_nlink_ctrs	*xnc,
+	struct xfs_inode	**ipp)
+{
+	int			error;
+
+	do {
+		error = xchk_iscan_iter(xnc->sc, &xnc->compare_iscan, ipp);
+	} while (error == -EBUSY);
+
+	return error;
+}
+
+/* Compare the link counts we observed against the live information. */
+STATIC int
+xchk_nlinks_compare(
+	struct xchk_nlink_ctrs	*xnc)
+{
+	struct xchk_nlink	nl;
+	struct xfs_scrub	*sc = xnc->sc;
+	struct xfs_inode	*ip;
+	xfarray_idx_t		cur = XFARRAY_CURSOR_INIT;
+	int			error;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return 0;
+
+	/*
+	 * Create a new empty transaction so that we can advance the iscan
+	 * cursor without deadlocking if the inobt has a cycle and push on the
+	 * inactivation workqueue.
+	 */
+	xchk_trans_cancel(sc);
+	error = xchk_trans_alloc_empty(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Use the inobt to walk all allocated inodes to compare the link
+	 * counts.  Inodes skipped by _compare_iter will be tried again in the
+	 * next phase of the scan.
+	 */
+	xchk_iscan_start(&xnc->compare_iscan, 0, 0);
+	while ((error = xchk_nlinks_compare_iter(xnc, &ip)) == 1) {
+		error = xchk_nlinks_compare_inode(xnc, ip);
+		xchk_iscan_mark_visited(&xnc->compare_iscan, ip);
+		xchk_irele(sc, ip);
+		if (error)
+			break;
+
+		if (xchk_should_terminate(sc, &error))
+			break;
+	}
+	xchk_iscan_finish(&xnc->compare_iscan);
+	if (error)
+		return error;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return 0;
+
+	/*
+	 * Walk all the non-null nlink observations that weren't checked in the
+	 * previous step.
+	 */
+	mutex_lock(&xnc->lock);
+	while ((error = xfarray_iter(xnc->nlinks, &cur, &nl)) == 1) {
+		xfs_ino_t	ino = cur - 1;
+
+		if (nl.flags & XCHK_NLINK_COMPARE_SCANNED)
+			continue;
+
+		mutex_unlock(&xnc->lock);
+
+		error = xchk_nlinks_compare_inum(xnc, ino);
+		if (error)
+			return error;
+
+		if (xchk_should_terminate(xnc->sc, &error))
+			return error;
+
+		mutex_lock(&xnc->lock);
+	}
+	mutex_unlock(&xnc->lock);
+
+	return error;
+}
+
+/* Tear down everything associated with a nlinks check. */
+static void
+xchk_nlinks_teardown_scan(
+	void			*priv)
+{
+	struct xchk_nlink_ctrs	*xnc = priv;
+
+	xfarray_destroy(xnc->nlinks);
+	xnc->nlinks = NULL;
+
+	xchk_iscan_finish(&xnc->collect_iscan);
+	mutex_destroy(&xnc->lock);
+	xnc->sc = NULL;
+}
+
+/*
+ * Scan all inodes in the entire filesystem to generate link count data.  If
+ * the scan is successful, the counts will be left alive for a repair.  If any
+ * error occurs, we'll tear everything down.
+ */
+STATIC int
+xchk_nlinks_setup_scan(
+	struct xfs_scrub	*sc,
+	struct xchk_nlink_ctrs	*xnc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	unsigned long long	max_inos;
+	xfs_agnumber_t		last_agno = mp->m_sb.sb_agcount - 1;
+	xfs_agino_t		first_agino, last_agino;
+	int			error;
+
+	ASSERT(xnc->sc == NULL);
+	xnc->sc = sc;
+
+	mutex_init(&xnc->lock);
+
+	/* Retry iget every tenth of a second for up to 30 seconds. */
+	xchk_iscan_start(&xnc->collect_iscan, 30000, 100);
+
+	/*
+	 * Set up enough space to store an nlink record for the highest
+	 * possible inode number in this system.
+	 */
+	xfs_agino_range(mp, last_agno, &first_agino, &last_agino);
+	max_inos = XFS_AGINO_TO_INO(mp, last_agno, last_agino) + 1;
+
+	error = xfarray_create(mp, "file link counts",
+			min(XFS_MAXINUMBER + 1, max_inos),
+			sizeof(struct xchk_nlink), &xnc->nlinks);
+	if (error)
+		goto out_teardown;
+
+	/* Use deferred cleanup to pass the inode link count data to repair. */
+	sc->buf_cleanup = xchk_nlinks_teardown_scan;
+	return 0;
+
+out_teardown:
+	xchk_nlinks_teardown_scan(xnc);
+	return error;
+}
+
+/* Scrub the link count of all inodes on the filesystem. */
+int
+xchk_nlinks(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_nlink_ctrs	*xnc = sc->buf;
+	int			error = 0;
+
+	/* Set ourselves up to check link counts on the live filesystem. */
+	error = xchk_nlinks_setup_scan(sc, xnc);
+	if (error)
+		return error;
+
+	/* Walk all inodes, picking up link count information. */
+	error = xchk_nlinks_collect(xnc);
+	if (!xchk_xref_process_error(sc, 0, 0, &error))
+		return error;
+
+	/* Fail fast if we're not playing with a full dataset. */
+	if (xchk_iscan_aborted(&xnc->collect_iscan))
+		xchk_set_incomplete(sc);
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
+		return 0;
+
+	/* Compare link counts. */
+	error = xchk_nlinks_compare(xnc);
+	if (!xchk_xref_process_error(sc, 0, 0, &error))
+		return error;
+
+	/* Check one last time for an incomplete dataset. */
+	if (xchk_iscan_aborted(&xnc->collect_iscan))
+		xchk_set_incomplete(sc);
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/nlinks.h b/fs/xfs/scrub/nlinks.h
new file mode 100644
index 000000000000..30fa7dd93029
--- /dev/null
+++ b/fs/xfs/scrub/nlinks.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_NLINKS_H__
+#define __XFS_SCRUB_NLINKS_H__
+
+/* Live link count control structure. */
+struct xchk_nlink_ctrs {
+	struct xfs_scrub	*sc;
+
+	/* Shadow link count data and its mutex. */
+	struct xfarray		*nlinks;
+	struct mutex		lock;
+
+	/*
+	 * The collection step uses a separate iscan context from the compare
+	 * step because the collection iscan coordinates live updates to the
+	 * observation data while this scanner is running.  The compare iscan
+	 * is secondary and can be reinitialized as needed.
+	 */
+	struct xchk_iscan	collect_iscan;
+	struct xchk_iscan	compare_iscan;
+};
+
+/*
+ * In-core link counts for a given inode in the filesystem.
+ *
+ * For an empty rootdir, the directory entries and the field to which they are
+ * accounted are as follows:
+ *
+ * Root directory:
+ *
+ * . points to self		(root.child)
+ * .. points to self		(root.parent)
+ * f1 points to a child file	(f1.parent)
+ * d1 points to a child dir	(d1.parent, root.child)
+ *
+ * Subdirectory d1:
+ *
+ * . points to self		(d1.child)
+ * .. points to root dir	(root.backref)
+ * f2 points to child file	(f2.parent)
+ * f3 points to root.f1		(f1.parent)
+ *
+ * root.nlink == 3 (root.dot, root.dotdot, root.d1)
+ * d1.nlink == 2 (root.d1, d1.dot)
+ * f1.nlink == 2 (root.f1, d1.f3)
+ * f2.nlink == 1 (d1.f2)
+ */
+struct xchk_nlink {
+	/* Count of forward links from parent directories to this file. */
+	xfs_nlink_t		parents;
+
+	/*
+	 * Count of back links to this parent directory from child
+	 * subdirectories.
+	 */
+	xfs_nlink_t		backrefs;
+
+	/*
+	 * Count of forward links from this directory to all child files and
+	 * the number of dot entries.  Should be zero for non-directories.
+	 */
+	xfs_nlink_t		children;
+
+	/* Record state flags */
+	unsigned int		flags;
+};
+
+/*
+ * This incore link count has been written at least once.  We never want to
+ * store an xchk_nlink that looks uninitialized.
+ */
+#define XCHK_NLINK_WRITTEN		(1U << 0)
+
+/* This data item was seen by the check-time compare function. */
+#define XCHK_NLINK_COMPARE_SCANNED	(1U << 1)
+
+/* Compute total link count, using large enough variables to detect overflow. */
+static inline uint64_t
+xchk_nlink_total(const struct xchk_nlink *live)
+{
+	uint64_t	ret = live->parents;
+
+	return ret + live->children;
+}
+
+#endif /* __XFS_SCRUB_NLINKS_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 60c6665b6277..8fdd38dbb9f4 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -372,6 +372,12 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.scrub	= xchk_quotacheck,
 		.repair	= xrep_quotacheck,
 	},
+	[XFS_SCRUB_TYPE_NLINKS] = {	/* inode link counts */
+		.type	= ST_FS,
+		.setup	= xchk_setup_nlinks,
+		.scrub	= xchk_nlinks,
+		.repair	= xrep_notsupported,
+	},
 };
 
 static int
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index de09b709992b..d39b2b95352a 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -179,6 +179,7 @@ xchk_quotacheck(struct xfs_scrub *sc)
 }
 #endif
 int xchk_fscounters(struct xfs_scrub *sc);
+int xchk_nlinks(struct xfs_scrub *sc);
 
 /* cross-referencing helpers */
 void xchk_xref_is_used_space(struct xfs_scrub *sc, xfs_agblock_t agbno,
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 6a9835d9779f..82edcc830fb8 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -14,10 +14,12 @@
 #include "xfs_btree.h"
 #include "xfs_ag.h"
 #include "xfs_quota_defs.h"
+#include "xfs_dir2.h"
 #include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
+#include "scrub/nlinks.h"
 
 /* Figure out which block the btree cursor was pointing to. */
 static inline xfs_fsblock_t
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 2a025eb356fc..508698d356d2 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -22,6 +22,7 @@ struct xfile;
 struct xfarray;
 struct xfarray_sortinfo;
 struct xchk_iscan;
+struct xchk_nlink;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -66,6 +67,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_GQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_QUOTACHECK);
+TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_NLINKS);
 
 #define XFS_SCRUB_TYPE_STRINGS \
 	{ XFS_SCRUB_TYPE_PROBE,		"probe" }, \
@@ -93,7 +95,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_QUOTACHECK);
 	{ XFS_SCRUB_TYPE_GQUOTA,	"grpquota" }, \
 	{ XFS_SCRUB_TYPE_PQUOTA,	"prjquota" }, \
 	{ XFS_SCRUB_TYPE_FSCOUNTERS,	"fscounters" }, \
-	{ XFS_SCRUB_TYPE_QUOTACHECK,	"quotacheck" }
+	{ XFS_SCRUB_TYPE_QUOTACHECK,	"quotacheck" }, \
+	{ XFS_SCRUB_TYPE_NLINKS,	"nlinks" }
 
 #define XFS_SCRUB_FLAG_STRINGS \
 	{ XFS_SCRUB_IFLAG_REPAIR,		"repair" }, \
@@ -1127,6 +1130,148 @@ TRACE_EVENT(xchk_iscan_iget_retry_wait,
 		  __entry->retry_delay)
 );
 
+TRACE_EVENT(xchk_nlinks_collect_dirent,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_inode *dp,
+		 xfs_ino_t ino, const struct xfs_name *name),
+	TP_ARGS(mp, dp, ino, name),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, namelen)
+		__dynamic_array(char, name, name->len)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->dir = dp->i_ino;
+		__entry->ino = ino;
+		__entry->namelen = name->len;
+		memcpy(__get_str(name), name->name, name->len);
+	),
+	TP_printk("dev %d:%d dir 0x%llx -> ino 0x%llx name '%.*s'",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir,
+		  __entry->ino,
+		  __entry->namelen,
+		  __get_str(name))
+);
+
+TRACE_EVENT(xchk_nlinks_collect_metafile,
+	TP_PROTO(struct xfs_mount *mp, xfs_ino_t ino),
+	TP_ARGS(mp, ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = ino;
+	),
+	TP_printk("dev %d:%d ino 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino)
+);
+
+TRACE_EVENT(xchk_nlinks_check_zero,
+	TP_PROTO(struct xfs_mount *mp, xfs_ino_t ino,
+		 const struct xchk_nlink *live),
+	TP_ARGS(mp, ino, live),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_nlink_t, parents)
+		__field(xfs_nlink_t, backrefs)
+		__field(xfs_nlink_t, children)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = ino;
+		__entry->parents = live->parents;
+		__entry->backrefs = live->backrefs;
+		__entry->children = live->children;
+	),
+	TP_printk("dev %d:%d ino 0x%llx parents %u backrefs %u children %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->parents,
+		  __entry->backrefs,
+		  __entry->children)
+);
+
+TRACE_EVENT(xchk_nlinks_update_incore,
+	TP_PROTO(struct xfs_mount *mp, xfs_ino_t ino,
+		 const struct xchk_nlink *live, int parents_delta,
+		 int backrefs_delta, int children_delta),
+	TP_ARGS(mp, ino, live, parents_delta, backrefs_delta, children_delta),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_nlink_t, parents)
+		__field(xfs_nlink_t, backrefs)
+		__field(xfs_nlink_t, children)
+		__field(int, parents_delta)
+		__field(int, backrefs_delta)
+		__field(int, children_delta)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = ino;
+		__entry->parents = live->parents;
+		__entry->backrefs = live->backrefs;
+		__entry->children = live->children;
+		__entry->parents_delta = parents_delta;
+		__entry->backrefs_delta = backrefs_delta;
+		__entry->children_delta = children_delta;
+	),
+	TP_printk("dev %d:%d ino 0x%llx parents %d:%u backrefs %d:%u children %d:%u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->parents_delta,
+		  __entry->parents,
+		  __entry->backrefs_delta,
+		  __entry->backrefs,
+		  __entry->children_delta,
+		  __entry->children)
+);
+
+DECLARE_EVENT_CLASS(xchk_nlinks_diff_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_inode *ip,
+		 const struct xchk_nlink *live),
+	TP_ARGS(mp, ip, live),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(uint8_t, ftype)
+		__field(xfs_nlink_t, nlink)
+		__field(xfs_nlink_t, parents)
+		__field(xfs_nlink_t, backrefs)
+		__field(xfs_nlink_t, children)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->ftype = xfs_mode_to_ftype(VFS_I(ip)->i_mode);
+		__entry->nlink = VFS_I(ip)->i_nlink;
+		__entry->parents = live->parents;
+		__entry->backrefs = live->backrefs;
+		__entry->children = live->children;
+	),
+	TP_printk("dev %d:%d ino 0x%llx ftype %s nlink %u parents %u backrefs %u children %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __print_symbolic(__entry->ftype, XFS_DIR3_FTYPE_STR),
+		  __entry->nlink,
+		  __entry->parents,
+		  __entry->backrefs,
+		  __entry->children)
+);
+#define DEFINE_SCRUB_NLINKS_DIFF_EVENT(name) \
+DEFINE_EVENT(xchk_nlinks_diff_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_inode *ip, \
+		 const struct xchk_nlink *live), \
+	TP_ARGS(mp, ip, live))
+DEFINE_SCRUB_NLINKS_DIFF_EVENT(xchk_nlinks_compare_inode);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: track file link count updates during live nlinks fsck
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 5/5] xfs: teach repair to fix file nlinks Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the necessary hooks in the file create/unlink/rename code so that
our live nlink scrub code can stay up to date with the rest of the
filesystem.  This will be the means to keep our shadow link count
information up to date while the scan runs in real time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |    3 +
 fs/xfs/scrub/nlinks.c |  120 ++++++++++++++++++++++++++++
 fs/xfs/scrub/nlinks.h |    6 +
 fs/xfs/scrub/scrub.c  |    3 +
 fs/xfs/scrub/scrub.h  |    4 +
 fs/xfs/scrub/trace.h  |   43 ++++++++++
 fs/xfs/xfs_inode.c    |  210 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h    |   35 ++++++++
 fs/xfs/xfs_mount.h    |    2 
 fs/xfs/xfs_super.c    |    2 
 fs/xfs/xfs_symlink.c  |    1 
 11 files changed, 426 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 9f418e30f5a3..ef2dd59d0ac9 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1287,5 +1287,8 @@ xchk_fshooks_enable(
 	if (scrub_fshooks & XCHK_FSHOOKS_QUOTA)
 		xfs_dqtrx_hook_enable();
 
+	if (scrub_fshooks & XCHK_FSHOOKS_NLINKS)
+		xfs_nlink_hook_enable();
+
 	sc->flags |= scrub_fshooks;
 }
diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c
index f97c46bdd06c..49ac7904896f 100644
--- a/fs/xfs/scrub/nlinks.c
+++ b/fs/xfs/scrub/nlinks.c
@@ -43,8 +43,7 @@ int
 xchk_setup_nlinks(
 	struct xfs_scrub	*sc)
 {
-	/* Not ready for general consumption yet. */
-	return -EOPNOTSUPP;
+	xchk_fshooks_enable(sc, XCHK_FSHOOKS_NLINKS);
 
 	sc->buf = kzalloc(sizeof(struct xchk_nlink_ctrs), XCHK_GFP_FLAGS);
 	if (!sc->buf)
@@ -63,6 +62,21 @@ xchk_setup_nlinks(
  * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
  * ECANCELED) that are absorbed into a scrub state flag update by
  * xchk_*_process_error.
+ *
+ * Because we are scanning a live filesystem, it's possible that another thread
+ * will try to update the link counts for an inode that we've already scanned.
+ * This will cause our counts to be incorrect.  Therefore, we hook all inode
+ * link count updates when the change is made to the incore inode.  By
+ * shadowing transaction updates in this manner, live nlink check can ensure by
+ * locking the inode and the shadow structure that its own copies are not out
+ * of date.  Because the hook code runs in a different process context from the
+ * scrub code and the scrub state flags are not accessed atomically, failures
+ * in the hook code must abort the iscan and the scrubber must notice the
+ * aborted scan and set the incomplete flag.
+ *
+ * Note that we use jump labels and srcu notifier hooks to minimize the
+ * overhead when live nlinks is /not/ running.  Locking order for nlink
+ * observations is inode ILOCK -> iscan_lock/xchk_nlink_ctrs lock.
  */
 
 /* Update incore link count information.  Caller must hold the nlinks lock. */
@@ -104,6 +118,91 @@ xchk_nlinks_update_incore(
 	return error;
 }
 
+/*
+ * Apply a link count change from the regular filesystem into our shadow link
+ * count structure.
+ */
+STATIC int
+xchk_nlinks_live_update(
+	struct xfs_hook			*delta_hook,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_nlink_delta_params	*p = data;
+	struct xchk_nlink_ctrs		*xnc;
+	const struct xfs_inode		*scan_dir = p->dp;
+	int				error;
+
+	xnc = container_of(delta_hook, struct xchk_nlink_ctrs, hooks.delta_hook);
+
+	/*
+	 * Back links between a parent directory and a child subdirectory are
+	 * accounted to the incore data when the child is scanned, so we only
+	 * want live backref updates if the child has been scanned.  For all
+	 * other links (forward and dot) we accept the live update for the
+	 * parent directory.
+	 */
+	if (action == XFS_BACKREF_NLINK_DELTA)
+		scan_dir = p->ip;
+
+	/* Ignore the live update if the directory hasn't been scanned yet. */
+	if (!xchk_iscan_want_live_update(&xnc->collect_iscan, scan_dir->i_ino))
+		return NOTIFY_DONE;
+
+	trace_xchk_nlinks_live_update(xnc->sc->mp, p->dp, action, p->ip->i_ino,
+			p->delta, p->name->name, p->name->len);
+
+	mutex_lock(&xnc->lock);
+
+	if (action == XFS_DIRENT_NLINK_DELTA) {
+		const struct inode	*inode = &p->ip->i_vnode;
+
+		/*
+		 * This is an update of a forward link from dp to ino.
+		 * Increment the number of parents linking into ino.  If the
+		 * forward link is to a subdirectory, increment the number of
+		 * child links of dp.
+		 */
+		error = xchk_nlinks_update_incore(xnc, p->ip->i_ino, p->delta,
+				0, 0);
+		if (error)
+			goto out_abort;
+
+		if (S_ISDIR(inode->i_mode)) {
+			error = xchk_nlinks_update_incore(xnc, p->dp->i_ino, 0,
+					0, p->delta);
+			if (error)
+				goto out_abort;
+		}
+	} else if (action == XFS_SELF_NLINK_DELTA) {
+		/*
+		 * This is an update to the dot entry.  Increment the number of
+		 * child links of dp.
+		 */
+		error = xchk_nlinks_update_incore(xnc, p->dp->i_ino, 0, 0,
+				p->delta);
+		if (error)
+			goto out_abort;
+	} else if (action == XFS_BACKREF_NLINK_DELTA) {
+		/*
+		 * This is an update to the dotdot entry.  Increment the number
+		 * of backrefs pointing back to dp (from ip).
+		 */
+		error = xchk_nlinks_update_incore(xnc, p->dp->i_ino, 0,
+				p->delta, 0);
+		if (error)
+			goto out_abort;
+	}
+
+	mutex_unlock(&xnc->lock);
+	return NOTIFY_DONE;
+
+out_abort:
+	xchk_iscan_abort(&xnc->collect_iscan);
+	mutex_unlock(&xnc->lock);
+	return NOTIFY_DONE;
+}
+
 /* Bump the observed link count for the inode referenced by this entry. */
 STATIC int
 xchk_nlinks_collect_dirent(
@@ -720,6 +819,11 @@ xchk_nlinks_teardown_scan(
 {
 	struct xchk_nlink_ctrs	*xnc = priv;
 
+	/* Discourage any hook functions that might be running. */
+	xchk_iscan_abort(&xnc->collect_iscan);
+
+	xfs_nlink_hook_del(xnc->sc->mp, &xnc->hooks);
+
 	xfarray_destroy(xnc->nlinks);
 	xnc->nlinks = NULL;
 
@@ -765,6 +869,18 @@ xchk_nlinks_setup_scan(
 	if (error)
 		goto out_teardown;
 
+	/*
+	 * Hook into the bumplink/droplink code.  The hook only triggers for
+	 * inodes that were already scanned, and the scanner thread takes each
+	 * inode's ILOCK, which means that any in-progress inode updates will
+	 * finish before we can scan the inode.
+	 */
+	ASSERT(sc->flags & XCHK_FSHOOKS_NLINKS);
+	xfs_hook_setup(&xnc->hooks.delta_hook, xchk_nlinks_live_update);
+	error = xfs_nlink_hook_add(mp, &xnc->hooks);
+	if (error)
+		goto out_teardown;
+
 	/* Use deferred cleanup to pass the inode link count data to repair. */
 	sc->buf_cleanup = xchk_nlinks_teardown_scan;
 	return 0;
diff --git a/fs/xfs/scrub/nlinks.h b/fs/xfs/scrub/nlinks.h
index 30fa7dd93029..69cf556b15a3 100644
--- a/fs/xfs/scrub/nlinks.h
+++ b/fs/xfs/scrub/nlinks.h
@@ -22,6 +22,12 @@ struct xchk_nlink_ctrs {
 	 */
 	struct xchk_iscan	collect_iscan;
 	struct xchk_iscan	compare_iscan;
+
+	/*
+	 * Hook into bumplink/droplink so that we can receive live updates
+	 * from other writer threads.
+	 */
+	struct xfs_nlink_hook	hooks;
 };
 
 /*
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 8fdd38dbb9f4..7e06aa98ca82 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -161,6 +161,9 @@ xchk_fshooks_disable(
 	if (sc->flags & XCHK_FSHOOKS_QUOTA)
 		xfs_dqtrx_hook_disable();
 
+	if (sc->flags & XCHK_FSHOOKS_NLINKS)
+		xfs_nlink_hook_disable();
+
 	sc->flags &= ~XCHK_FSHOOKS_ALL;
 }
 
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index d39b2b95352a..da9da6245475 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -122,11 +122,13 @@ struct xfs_scrub {
 #define XCHK_FSHOOKS_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
 #define XCHK_FSHOOKS_QUOTA	(1 << 4)  /* quota live update enabled */
+#define XCHK_FSHOOKS_NLINKS	(1 << 5)  /* link count live update enabled */
 #define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 #define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN | \
-				 XCHK_FSHOOKS_QUOTA)
+				 XCHK_FSHOOKS_QUOTA | \
+				 XCHK_FSHOOKS_NLINKS)
 
 /* Metadata scrubbers */
 int xchk_tester(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 508698d356d2..25acfff8fe6b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -115,6 +115,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_NLINKS);
 	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XCHK_FSHOOKS_QUOTA,			"fshooks_quota" }, \
+	{ XCHK_FSHOOKS_NLINKS,			"fshooks_nlinks" }, \
 	{ XREP_RESET_PERAG_RESV,		"reset_perag_resv" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 
@@ -1172,6 +1173,48 @@ TRACE_EVENT(xchk_nlinks_collect_metafile,
 		  __entry->ino)
 );
 
+TRACE_DEFINE_ENUM(XFS_DIRENT_NLINK_DELTA);
+TRACE_DEFINE_ENUM(XFS_BACKREF_NLINK_DELTA);
+TRACE_DEFINE_ENUM(XFS_SELF_NLINK_DELTA);
+
+#define XFS_NLINK_DELTA_STRINGS \
+	{ XFS_DIRENT_NLINK_DELTA,	"->" }, \
+	{ XFS_BACKREF_NLINK_DELTA,	"<-" }, \
+	{ XFS_SELF_NLINK_DELTA,		"<>" }
+
+TRACE_EVENT(xchk_nlinks_live_update,
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_inode *dp,
+		 int action, xfs_ino_t ino, int delta,
+		 const char *name, unsigned int namelen),
+	TP_ARGS(mp, dp, action, ino, delta, name, namelen),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir)
+		__field(int, action)
+		__field(xfs_ino_t, ino)
+		__field(int, delta)
+		__field(unsigned int, namelen)
+		__dynamic_array(char, name, namelen)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->dir = dp ? dp->i_ino : NULLFSINO;
+		__entry->action = action;
+		__entry->ino = ino;
+		__entry->delta = delta;
+		__entry->namelen = namelen;
+		memcpy(__get_str(name), name, namelen);
+	),
+	TP_printk("dev %d:%d dir 0x%llx %s ino 0x%llx nlink_delta %d name '%.*s'",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir,
+		  __print_symbolic(__entry->action, XFS_NLINK_DELTA_STRINGS),
+		  __entry->ino,
+		  __entry->delta,
+		  __entry->namelen,
+		  __get_str(name))
+);
+
 TRACE_EVENT(xchk_nlinks_check_zero,
 	TP_PROTO(struct xfs_mount *mp, xfs_ino_t ino,
 		 const struct xchk_nlink *live),
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index eebdbc55d078..d6eeb59217b4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -938,6 +938,117 @@ xfs_bumplink(
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of link count live updates.  If
+ * the compiler supports jump labels, the static branch will be replaced by a
+ * nop sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_nlinks_hooks_switch);
+
+void
+xfs_nlink_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_nlinks_hooks_switch);
+}
+
+void
+xfs_nlink_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_nlinks_hooks_switch);
+}
+
+/* Call hooks for a link count update relating to a dot dirent update. */
+static inline void
+xfs_nlink_self_delta(
+	struct xfs_inode		*dp,
+	int				delta)
+{
+	if (xfs_hooks_switched_on(&xfs_nlinks_hooks_switch)) {
+		struct xfs_nlink_delta_params	p = {
+			.dp		= dp,
+			.ip		= dp,
+			.delta		= delta,
+			.name		= &xfs_name_dot,
+		};
+		struct xfs_mount	*mp = dp->i_mount;
+
+		xfs_hooks_call(&mp->m_nlink_delta_hooks, XFS_SELF_NLINK_DELTA,
+				&p);
+	}
+}
+
+/* Call hooks for a link count update relating to a dotdot dirent update. */
+static inline void
+xfs_nlink_backref_delta(
+	struct xfs_inode		*dp,
+	struct xfs_inode		*ip,
+	int				delta)
+{
+	if (xfs_hooks_switched_on(&xfs_nlinks_hooks_switch)) {
+		struct xfs_nlink_delta_params	p = {
+			.dp		= dp,
+			.ip		= ip,
+			.delta		= delta,
+			.name		= &xfs_name_dotdot,
+		};
+		struct xfs_mount	*mp = ip->i_mount;
+
+		xfs_hooks_call(&mp->m_nlink_delta_hooks, XFS_BACKREF_NLINK_DELTA,
+				&p);
+	}
+}
+
+/* Call hooks for a link count update relating to a dirent update. */
+void
+xfs_nlink_dirent_delta(
+	struct xfs_inode		*dp,
+	struct xfs_inode		*ip,
+	int				delta,
+	struct xfs_name			*name)
+{
+	if (xfs_hooks_switched_on(&xfs_nlinks_hooks_switch)) {
+		struct xfs_nlink_delta_params	p = {
+			.dp		= dp,
+			.ip		= ip,
+			.delta		= delta,
+			.name		= name,
+		};
+		struct xfs_mount	*mp = ip->i_mount;
+
+		xfs_hooks_call(&mp->m_nlink_delta_hooks, XFS_DIRENT_NLINK_DELTA,
+				&p);
+	}
+}
+
+/* Call the specified function during a link count update. */
+int
+xfs_nlink_hook_add(
+	struct xfs_mount	*mp,
+	struct xfs_nlink_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_nlink_delta_hooks, &hook->delta_hook);
+}
+
+/* Stop calling the specified function during a link count update. */
+void
+xfs_nlink_hook_del(
+	struct xfs_mount	*mp,
+	struct xfs_nlink_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_nlink_delta_hooks, &hook->delta_hook);
+}
+#else
+# define xfs_nlink_self_delta(dp, delta)		((void)0)
+# define xfs_nlink_backref_delta(dp, ip, delta)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 int
 xfs_create(
 	struct user_namespace	*mnt_userns,
@@ -1046,6 +1157,16 @@ xfs_create(
 		xfs_bumplink(tp, dp);
 	}
 
+	/*
+	 * Create ip with a reference from dp, and add '.' and '..' references
+	 * if it's a directory.
+	 */
+	xfs_nlink_dirent_delta(dp, ip, 1, name);
+	if (is_dir) {
+		xfs_nlink_self_delta(ip, 1);
+		xfs_nlink_backref_delta(dp, ip, 1);
+	}
+
 	/*
 	 * If this is a synchronous mount, make sure that the
 	 * create transaction goes to disk before returning to
@@ -1258,6 +1379,7 @@ xfs_link(
 	xfs_trans_log_inode(tp, tdp, XFS_ILOG_CORE);
 
 	xfs_bumplink(tp, sip);
+	xfs_nlink_dirent_delta(tdp, sip, 1, target_name);
 
 	/*
 	 * If this is a synchronous mount, make sure that the
@@ -2486,6 +2608,16 @@ xfs_remove(
 		goto out_trans_cancel;
 	}
 
+	/*
+	 * Drop the link from dp to ip, and if ip was a directory, remove the
+	 * '.' and '..' references since we freed the directory.
+	 */
+	xfs_nlink_dirent_delta(dp, ip, -1, name);
+	if (S_ISDIR(VFS_I(ip)->i_mode)) {
+		xfs_nlink_backref_delta(dp, ip, -1);
+		xfs_nlink_self_delta(ip, -1);
+	}
+
 	/*
 	 * If this is a synchronous mount, make sure that the
 	 * remove transaction goes to disk before returning to
@@ -2560,6 +2692,75 @@ xfs_sort_for_rename(
 	}
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+static inline void
+xfs_rename_call_nlink_hooks(
+	struct xfs_inode	*src_dp,
+	struct xfs_name		*src_name,
+	struct xfs_inode	*src_ip,
+	struct xfs_inode	*target_dp,
+	struct xfs_name		*target_name,
+	struct xfs_inode	*target_ip,
+	struct xfs_inode	*wip,
+	unsigned int		flags)
+{
+	/* If we added a whiteout, add the reference from src_dp. */
+	if (wip)
+		xfs_nlink_dirent_delta(src_dp, wip, 1, src_name);
+
+	/* Move the src_ip forward link from src_dp to target_dp. */
+	xfs_nlink_dirent_delta(src_dp, src_ip, -1, src_name);
+	xfs_nlink_dirent_delta(target_dp, src_ip, 1, target_name);
+
+	/*
+	 * If src_ip is a dir, move its '..' back link from src_dp to
+	 * target_dp.
+	 */
+	if (S_ISDIR(VFS_I(src_ip)->i_mode)) {
+		xfs_nlink_backref_delta(src_dp, src_ip, -1);
+		xfs_nlink_backref_delta(target_dp, src_ip, 1);
+	}
+
+	if (!target_ip)
+		return;
+
+	if (flags & RENAME_EXCHANGE) {
+		/* Move the target_ip forward link from target_dp to src_dp. */
+		xfs_nlink_dirent_delta(target_dp, target_ip, -1, target_name);
+		xfs_nlink_dirent_delta(src_dp, target_ip, 1, target_name);
+
+		/*
+		 * If target_ip is a dir, move its '..' back link from
+		 * target_dp to src_dp.
+		 */
+		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
+			xfs_nlink_backref_delta(target_dp, target_ip, -1);
+			xfs_nlink_backref_delta(src_dp, target_ip, 1);
+		}
+
+		return;
+	}
+
+	/* Drop target_ip's forward link from target_dp. */
+	xfs_nlink_dirent_delta(target_dp, target_ip, -1, target_name);
+
+	if (!S_ISDIR(VFS_I(target_ip)->i_mode))
+		return;
+
+	/*
+	 * If target_ip was a dir, drop the '.' and '..' references since that
+	 * was the last reference.
+	 */
+	ASSERT(VFS_I(target_ip)->i_nlink == 0);
+	xfs_nlink_self_delta(target_ip, -1);
+	xfs_nlink_backref_delta(target_dp, target_ip, -1);
+}
+#else
+# define xfs_rename_call_nlink_hooks(src_dp, src_name, src_ip, target_dp, \
+				     target_name, target_ip, wip, flags) \
+		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 static int
 xfs_finish_rename(
 	struct xfs_trans	*tp)
@@ -2676,6 +2877,11 @@ xfs_cross_rename(
 	}
 	xfs_trans_ichgtime(tp, dp1, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, dp1, XFS_ILOG_CORE);
+
+	if (xfs_hooks_switched_on(&xfs_nlinks_hooks_switch))
+		xfs_rename_call_nlink_hooks(dp1, name1, ip1, dp2, name2, ip2,
+				NULL, RENAME_EXCHANGE);
+
 	return xfs_finish_rename(tp);
 
 out_trans_abort:
@@ -3059,6 +3265,10 @@ xfs_rename(
 	if (new_parent)
 		xfs_trans_log_inode(tp, target_dp, XFS_ILOG_CORE);
 
+	if (xfs_hooks_switched_on(&xfs_nlinks_hooks_switch))
+		xfs_rename_call_nlink_hooks(src_dp, src_name, src_ip,
+				target_dp, target_name, target_ip, wip, flags);
+
 	error = xfs_finish_rename(tp);
 	if (wip)
 		xfs_irele(wip);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 57c459f8e669..926e4dd566d0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -578,4 +578,39 @@ void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
 
+/*
+ * Parameters for tracking bumplink and droplink operations.  The hook
+ * function arg parameter is one of these.
+ */
+enum xfs_nlink_delta_type {
+	XFS_DIRENT_NLINK_DELTA,		/* parent pointing to child */
+	XFS_BACKREF_NLINK_DELTA,		/* dotdot entries */
+	XFS_SELF_NLINK_DELTA,		/* dot entries */
+};
+
+struct xfs_nlink_delta_params {
+	const struct xfs_inode	*dp;
+	const struct xfs_inode	*ip;
+	const struct xfs_name	*name;
+	int			delta;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+void xfs_nlink_dirent_delta(struct xfs_inode *dp, struct xfs_inode *ip,
+		int delta, struct xfs_name *name);
+
+struct xfs_nlink_hook {
+	struct xfs_hook		delta_hook;
+};
+
+void xfs_nlink_hook_disable(void);
+void xfs_nlink_hook_enable(void);
+
+int xfs_nlink_hook_add(struct xfs_mount *mp, struct xfs_nlink_hook *hook);
+void xfs_nlink_hook_del(struct xfs_mount *mp, struct xfs_nlink_hook *hook);
+
+#else
+# define xfs_nlink_dirent_delta(dp, ip, delta, name)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8aca2cc173ac..ec8b185d45f8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -242,6 +242,8 @@ typedef struct xfs_mount {
 	unsigned int		*m_errortag;
 	struct xfs_kobj		m_errortag_kobj;
 #endif
+	/* Hook to feed file link count updates to an active online repair. */
+	struct xfs_hooks	m_nlink_delta_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 0c4b73e9b29d..020ff2d93f23 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1944,6 +1944,8 @@ static int xfs_init_fs_context(
 	mp->m_logbsize = -1;
 	mp->m_allocsize_log = 16; /* 64k */
 
+	xfs_hooks_init(&mp->m_nlink_delta_hooks);
+
 	/*
 	 * Copy binary VFS mount flags we are interested in.
 	 */
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 8389f3ef88ef..8241c0fcd0ba 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -319,6 +319,7 @@ xfs_symlink(
 		goto out_trans_cancel;
 	xfs_trans_ichgtime(tp, dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, dp, XFS_ILOG_CORE);
+	xfs_nlink_dirent_delta(dp, ip, 1, link_name);
 
 	/*
 	 * If this is a synchronous mount, make sure that the


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs: teach repair to fix file nlinks
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 2/5] xfs: streamline the directory iteration code for scrub Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/5] xfs: track file link count updates during live nlinks fsck Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Fix the nlinks now too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/nlinks.c        |    4 +
 fs/xfs/scrub/nlinks.h        |    5 +
 fs/xfs/scrub/nlinks_repair.c |  226 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h        |    2 
 fs/xfs/scrub/scrub.c         |    2 
 fs/xfs/scrub/trace.h         |    3 +
 7 files changed, 240 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/nlinks_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ea9eda20a11d..a69c5585e41c 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -190,6 +190,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   newbt.o \
+				   nlinks_repair.o \
 				   reap.o \
 				   refcount_repair.o \
 				   repair.o \
diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c
index 49ac7904896f..e29d7da2eb32 100644
--- a/fs/xfs/scrub/nlinks.c
+++ b/fs/xfs/scrub/nlinks.c
@@ -61,7 +61,9 @@ xchk_setup_nlinks(
  * set the INCOMPLETE flag even when a negative errno is returned.  This care
  * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
  * ECANCELED) that are absorbed into a scrub state flag update by
- * xchk_*_process_error.
+ * xchk_*_process_error.  Scrub and repair share the same incore data
+ * structures, so the INCOMPLETE flag is critical to prevent a repair based on
+ * insufficient information.
  *
  * Because we are scanning a live filesystem, it's possible that another thread
  * will try to update the link counts for an inode that we've already scanned.
diff --git a/fs/xfs/scrub/nlinks.h b/fs/xfs/scrub/nlinks.h
index 69cf556b15a3..46baef3c2237 100644
--- a/fs/xfs/scrub/nlinks.h
+++ b/fs/xfs/scrub/nlinks.h
@@ -81,9 +81,12 @@ struct xchk_nlink {
  */
 #define XCHK_NLINK_WRITTEN		(1U << 0)
 
-/* This data item was seen by the check-time compare function. */
+/* Already checked this link count record. */
 #define XCHK_NLINK_COMPARE_SCANNED	(1U << 1)
 
+/* Already made a repair with this link count record. */
+#define XREP_NLINK_DIRTY		(1U << 2)
+
 /* Compute total link count, using large enough variables to detect overflow. */
 static inline uint64_t
 xchk_nlink_total(const struct xchk_nlink *live)
diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c
new file mode 100644
index 000000000000..2f83abd6eec7
--- /dev/null
+++ b/fs/xfs/scrub/nlinks_repair.c
@@ -0,0 +1,226 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_bmap_util.h"
+#include "xfs_iwalk.h"
+#include "xfs_ialloc.h"
+#include "xfs_sb.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/repair.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/iscan.h"
+#include "scrub/nlinks.h"
+#include "scrub/trace.h"
+
+/*
+ * Live Inode Link Count Repair
+ * ============================
+ *
+ * Use the live inode link count information that we collected to replace the
+ * nlink values of the incore inodes.  A scrub->repair cycle should have left
+ * the live data and hooks active, so this is safe so long as we make sure the
+ * inode is locked.
+ */
+
+/*
+ * Correct the link count of the given inode.  Because we have to grab locks
+ * and resources in a certain order, it's possible that this will be a no-op.
+ */
+STATIC int
+xrep_nlinks_repair_inode(
+	struct xchk_nlink_ctrs	*xnc)
+{
+	struct xchk_nlink	obs;
+	struct xfs_scrub	*sc = xnc->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_inode	*ip = sc->ip;
+	uint64_t		total_links;
+	unsigned int		actual_nlink;
+	int			error;
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, &sc->tp);
+	if (error)
+		goto out_iolock;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(sc->tp, ip, 0);
+
+	mutex_lock(&xnc->lock);
+
+	if (xchk_iscan_aborted(&xnc->collect_iscan)) {
+		error = -ECANCELED;
+		goto out_scanlock;
+	}
+
+	error = xfarray_load_sparse(xnc->nlinks, ip->i_ino, &obs);
+	if (error)
+		goto out_scanlock;
+
+	total_links = xchk_nlink_total(&obs);
+	actual_nlink = VFS_I(ip)->i_nlink;
+
+	/*
+	 * Cannot set more than the maxiumum possible link count.  We won't
+	 * touch this file, but we don't want to abort the entire operation,
+	 * so set an explicit error value to avoid static checker warnings and
+	 * return early.
+	 */
+	if (total_links > U32_MAX) {
+		trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		goto out_nextfile;
+	}
+
+	/*
+	 * Linked directories should have at least one "child" (the dot entry)
+	 * pointing up to them.
+	 */
+	if (S_ISDIR(VFS_I(ip)->i_mode) && actual_nlink > 0 &&
+					  obs.children == 0) {
+		trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		goto out_nextfile;
+	}
+
+	/* Non-directories cannot have directories pointing up to them. */
+	if (!S_ISDIR(VFS_I(ip)->i_mode) && obs.children != 0) {
+		trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		goto out_nextfile;
+	}
+
+	/*
+	 * We did not find any links to this inode.  If the inode agrees, we
+	 * have nothing further to do.  If not, the inode has a nonzero link
+	 * count and we don't have anywhere to graft the child onto.  Dropping
+	 * a live inode's link count to zero can cause unexpected shutdowns in
+	 * inactivation, so leave it alone.
+	 */
+	if (total_links == 0) {
+		if (actual_nlink != 0)
+			trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		goto out_nextfile;
+	}
+
+	/* Perfect match means we're done with this file. */
+	if (total_links == actual_nlink)
+		goto out_nextfile;
+
+	mutex_unlock(&xnc->lock);
+
+	/* Commit the new link count. */
+	trace_xrep_nlinks_update_inode(mp, ip, &obs);
+
+	set_nlink(VFS_I(ip), total_links);
+	xfs_trans_log_inode(sc->tp, ip, XFS_ILOG_CORE);
+	error = xfs_trans_commit(sc->tp);
+	sc->tp = NULL;
+	if (error)
+		goto out_ilock;
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	return 0;
+
+out_nextfile:
+	error = 0;
+out_scanlock:
+	mutex_unlock(&xnc->lock);
+	xchk_trans_cancel(sc);
+out_ilock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_iolock:
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	return error;
+}
+
+/*
+ * Try to visit every inode in the filesystem for repairs.  Move on if we can't
+ * grab an inode, since we're still making forward progress.
+ */
+static int
+xrep_nlinks_iter(
+	struct xchk_nlink_ctrs	*xnc,
+	struct xfs_inode	**ipp)
+{
+	int			error;
+
+	do {
+		error = xchk_iscan_iter(xnc->sc, &xnc->compare_iscan, ipp);
+	} while (error == -EBUSY);
+
+	return error;
+}
+
+/* Commit the new inode link counters. */
+int
+xrep_nlinks(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_nlink_ctrs	*xnc = sc->buf;
+	int			error;
+
+	/*
+	 * We need ftype for an accurate count of the number of child
+	 * subdirectory links.  Child subdirectories with a back link (dotdot
+	 * entry) but no forward link are unfixable, so we cannot repair the
+	 * link count of the parent directory based on the back link count
+	 * alone.  Filesystems without ftype support are rare (old V4) so we
+	 * just skip out here.
+	 */
+	if (!xfs_has_ftype(sc->mp))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Use the inobt to walk all allocated inodes to compare and fix the
+	 * link counts.  Retry iget every tenth of a second for up to 30
+	 * seconds -- even if repair misses a few inodes, we still try to fix
+	 * as many of them as we can.
+	 */
+	xchk_iscan_start(&xnc->compare_iscan, 30000, 100);
+	ASSERT(sc->ip == NULL);
+
+	while ((error = xrep_nlinks_iter(xnc, &sc->ip)) == 1) {
+		/*
+		 * Commit the scrub transaction so that we can create repair
+		 * transactions with the correct reservations.
+		 */
+		xchk_trans_cancel(sc);
+
+		error = xrep_nlinks_repair_inode(xnc);
+		xchk_iscan_mark_visited(&xnc->compare_iscan, sc->ip);
+		xchk_irele(sc, sc->ip);
+		sc->ip = NULL;
+		if (error)
+			break;
+
+		if (xchk_should_terminate(sc, &error))
+			break;
+
+		/*
+		 * Create a new empty transaction so that we can advance the
+		 * iscan cursor without deadlocking if the inobt has a cycle.
+		 * We can only push the inactivation workqueues with an empty
+		 * transaction.
+		 */
+		error = xchk_trans_alloc_empty(sc);
+		if (error)
+			break;
+	}
+	xchk_iscan_finish(&xnc->compare_iscan);
+
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index f8d356b17b1f..6c19f0d7f335 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -107,6 +107,7 @@ int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_bmap_cow(struct xfs_scrub *sc);
+int xrep_nlinks(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -191,6 +192,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_rtbitmap			xrep_notsupported
 #define xrep_quota			xrep_notsupported
 #define xrep_quotacheck			xrep_notsupported
+#define xrep_nlinks			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7e06aa98ca82..2c05fbde1f88 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -379,7 +379,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_nlinks,
 		.scrub	= xchk_nlinks,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_nlinks,
 	},
 };
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 25acfff8fe6b..dcbab5a197c1 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1953,6 +1953,9 @@ DEFINE_XREP_DQUOT_EVENT(xrep_disk_dquot);
 DEFINE_XREP_DQUOT_EVENT(xrep_quotacheck_dquot);
 #endif /* CONFIG_XFS_QUOTA */
 
+DEFINE_SCRUB_NLINKS_DIFF_EVENT(xrep_nlinks_update_inode);
+DEFINE_SCRUB_NLINKS_DIFF_EVENT(xrep_nlinks_unfixable_inode);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (12 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 01/11] xfs: separate the marking of sick and checked metadata Darrick J. Wong
                     ` (10 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: indirect health reporting Darrick J. Wong
                   ` (63 subsequent siblings)
  77 siblings, 11 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Any time that the runtime code thinks it has found corrupt metadata, it
should tell the health tracking subsystem that the corresponding part of
the filesystem is sick.  These reports come primarily from two places --
code that is reading a buffer that fails validation, and higher level
pieces that observe a conflict involving multiple buffers.  This
patchset uses automated scanning to update all such callsites with a
mark_sick call.

Doing this enables the health system to record problem observed at
runtime, which (for now) can prompt the sysadmin to run xfs_scrub, and
(later) may enable more targetted fixing of the filesystem.

Note: Earlier reviewers of this patchset suggested that the verifier
functions themselves should be responsible for calling _mark_sick.  In a
higher level language this would be easily accomplished with lambda
functions and closures.  For the kernel, however, we'd have to create
the necessary closures by hand, pass them to the buf_read calls, and
then implement necessary state tracking to detach the xfs_buf from the
closure at the necessary time.  This is far too much work and complexity
and will not be pursued further.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=corruption-health-reports

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=corruption-health-reports
---
 fs/xfs/libxfs/xfs_ag.c          |    5 +
 fs/xfs/libxfs/xfs_alloc.c       |  105 ++++++++++++++++++++----
 fs/xfs/libxfs/xfs_attr_leaf.c   |    4 +
 fs/xfs/libxfs/xfs_attr_remote.c |   35 +++++---
 fs/xfs/libxfs/xfs_bmap.c        |  123 +++++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_btree.c       |   39 ++++++++-
 fs/xfs/libxfs/xfs_da_btree.c    |   37 +++++++-
 fs/xfs/libxfs/xfs_dir2.c        |    5 +
 fs/xfs/libxfs/xfs_dir2_block.c  |    2 
 fs/xfs/libxfs/xfs_dir2_data.c   |    3 +
 fs/xfs/libxfs/xfs_dir2_leaf.c   |    3 +
 fs/xfs/libxfs/xfs_dir2_node.c   |    7 ++
 fs/xfs/libxfs/xfs_health.h      |   35 +++++++-
 fs/xfs/libxfs/xfs_ialloc.c      |   57 +++++++++++--
 fs/xfs/libxfs/xfs_inode_buf.c   |   12 ++-
 fs/xfs/libxfs/xfs_inode_fork.c  |    8 ++
 fs/xfs/libxfs/xfs_refcount.c    |   43 +++++++++-
 fs/xfs/libxfs/xfs_rmap.c        |   83 ++++++++++++++++++-
 fs/xfs/libxfs/xfs_rtbitmap.c    |    9 ++
 fs/xfs/libxfs/xfs_sb.c          |    2 
 fs/xfs/scrub/health.c           |   20 +++--
 fs/xfs/scrub/refcount_repair.c  |    9 ++
 fs/xfs/xfs_attr_inactive.c      |    4 +
 fs/xfs/xfs_attr_list.c          |   18 +++-
 fs/xfs/xfs_dir2_readdir.c       |    7 +-
 fs/xfs/xfs_discard.c            |    2 
 fs/xfs/xfs_dquot.c              |   30 +++++++
 fs/xfs/xfs_health.c             |  172 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.c             |    9 ++
 fs/xfs/xfs_inode.c              |   17 +++-
 fs/xfs/xfs_iomap.c              |   15 +++
 fs/xfs/xfs_iwalk.c              |    5 +
 fs/xfs/xfs_mount.c              |    5 +
 fs/xfs/xfs_qm.c                 |    8 +-
 fs/xfs/xfs_reflink.c            |    6 +
 fs/xfs/xfs_rtalloc.c            |    6 +
 fs/xfs/xfs_symlink.c            |   12 ++-
 37 files changed, 860 insertions(+), 102 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 02/11] xfs: report fs corruption errors to the health tracking system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 01/11] xfs: separate the marking of sick and checked metadata Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 08/11] xfs: report inode corruption errors to the health system Darrick J. Wong
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt fs metadata, we should report that to the
health monitoring system for later reporting.  A convenient program for
identifying places to insert xfs_*_mark_sick calls is as follows:

#!/bin/bash

# Detect missing calls to xfs_*_mark_sick

filter=cat
tty -s && filter=less

git grep -B3 EFSCORRUPTED fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] fs/xfs/scrub/*.[ch] | awk '
BEGIN {
	ignore = 0;
	lineno = 0;
	delete lines;
}
{
	if ($0 == "--") {
		if (!ignore) {
			for (i = 0; i < lineno; i++) {
				print(lines[i]);
			}
			printf("--\n");
		}
		delete lines;
		lineno = 0;
		ignore = 0;
	} else if ($0 ~ /mark_sick/) {
		ignore = 1;
	} else if ($0 ~ /if .fa/) {
		ignore = 1;
	} else if ($0 ~ /failaddr/) {
		ignore = 1;
	} else if ($0 ~ /_verifier_error/) {
		ignore = 1;
	} else if ($0 ~ /^ \* .*EFSCORRUPTED/) {
		ignore = 1;
	} else if ($0 ~ /== -EFSCORRUPTED/) {
		ignore = 1;
	} else if ($0 ~ /!= -EFSCORRUPTED/) {
		ignore = 1;
	} else {
		lines[lineno++] = $0;
	}
}
' | $filter

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c |    1 +
 1 file changed, 1 insertion(+)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 8b1bb228cba6..a30eab622266 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -164,6 +164,7 @@ xfs_initialize_perag_data(
 	 */
 	if (fdblocks > sbp->sb_dblocks || ifree > ialloc) {
 		xfs_alert(mp, "AGF corruption. Please run xfs_repair.");
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS);
 		error = -EFSCORRUPTED;
 		goto out;
 	}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 01/11] xfs: separate the marking of sick and checked metadata
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 02/11] xfs: report fs corruption errors to the health tracking system Darrick J. Wong
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Split the setting of the sick and checked masks into separate functions
as part of preparing to add the ability for regular runtime fs code
(i.e. not scrub) to mark metadata structures sick when corruptions are
found.  Improve the documentation of libxfs' requirements for helper
behavior.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h |   16 +++++++++++++-
 fs/xfs/scrub/health.c      |   20 ++++++++++-------
 fs/xfs/xfs_health.c        |   51 +++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_mount.c         |    5 +++-
 4 files changed, 81 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 5571f6cb2539..aa4771fad505 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -101,24 +101,38 @@ struct xfs_fsop_geom;
 				 XFS_SICK_INO_SYMLINK | \
 				 XFS_SICK_INO_PARENT)
 
-/* These functions must be provided by the xfs implementation. */
+/*
+ * These functions must be provided by the xfs implementation.  Function
+ * behavior with respect to the first argument should be as follows:
+ *
+ * xfs_*_mark_sick:    set the sick flags and do not set checked flags.
+ * xfs_*_mark_checked: set the checked flags.
+ * xfs_*_mark_healthy: clear the sick flags and set the checked flags.
+ *
+ * xfs_*_measure_sickness: return the sick and check status in the provided
+ * out parameters.
+ */
 
 void xfs_fs_mark_sick(struct xfs_mount *mp, unsigned int mask);
+void xfs_fs_mark_checked(struct xfs_mount *mp, unsigned int mask);
 void xfs_fs_mark_healthy(struct xfs_mount *mp, unsigned int mask);
 void xfs_fs_measure_sickness(struct xfs_mount *mp, unsigned int *sick,
 		unsigned int *checked);
 
 void xfs_rt_mark_sick(struct xfs_mount *mp, unsigned int mask);
+void xfs_rt_mark_checked(struct xfs_mount *mp, unsigned int mask);
 void xfs_rt_mark_healthy(struct xfs_mount *mp, unsigned int mask);
 void xfs_rt_measure_sickness(struct xfs_mount *mp, unsigned int *sick,
 		unsigned int *checked);
 
 void xfs_ag_mark_sick(struct xfs_perag *pag, unsigned int mask);
+void xfs_ag_mark_checked(struct xfs_perag *pag, unsigned int mask);
 void xfs_ag_mark_healthy(struct xfs_perag *pag, unsigned int mask);
 void xfs_ag_measure_sickness(struct xfs_perag *pag, unsigned int *sick,
 		unsigned int *checked);
 
 void xfs_inode_mark_sick(struct xfs_inode *ip, unsigned int mask);
+void xfs_inode_mark_checked(struct xfs_inode *ip, unsigned int mask);
 void xfs_inode_mark_healthy(struct xfs_inode *ip, unsigned int mask);
 void xfs_inode_measure_sickness(struct xfs_inode *ip, unsigned int *sick,
 		unsigned int *checked);
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index 6749930b50a2..e5cc89d43808 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -144,30 +144,34 @@ xchk_update_health(
 	switch (type_to_health_flag[sc->sm->sm_type].group) {
 	case XHG_AG:
 		pag = xfs_perag_get(sc->mp, sc->sm->sm_agno);
-		if (bad)
+		if (bad) {
 			xfs_ag_mark_sick(pag, sc->sick_mask);
-		else
+			xfs_ag_mark_checked(pag, sc->sick_mask);
+		} else
 			xfs_ag_mark_healthy(pag, sc->sick_mask);
 		xfs_perag_put(pag);
 		break;
 	case XHG_INO:
 		if (!sc->ip)
 			return;
-		if (bad)
+		if (bad) {
 			xfs_inode_mark_sick(sc->ip, sc->sick_mask);
-		else
+			xfs_inode_mark_checked(sc->ip, sc->sick_mask);
+		} else
 			xfs_inode_mark_healthy(sc->ip, sc->sick_mask);
 		break;
 	case XHG_FS:
-		if (bad)
+		if (bad) {
 			xfs_fs_mark_sick(sc->mp, sc->sick_mask);
-		else
+			xfs_fs_mark_checked(sc->mp, sc->sick_mask);
+		} else
 			xfs_fs_mark_healthy(sc->mp, sc->sick_mask);
 		break;
 	case XHG_RT:
-		if (bad)
+		if (bad) {
 			xfs_rt_mark_sick(sc->mp, sc->sick_mask);
-		else
+			xfs_rt_mark_checked(sc->mp, sc->sick_mask);
+		} else
 			xfs_rt_mark_healthy(sc->mp, sc->sick_mask);
 		break;
 	default:
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 9cf933a8f532..3ef91601bc2b 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -98,6 +98,18 @@ xfs_fs_mark_sick(
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_fs_sick |= mask;
+	spin_unlock(&mp->m_sb_lock);
+}
+
+/* Mark per-fs metadata as having been checked. */
+void
+xfs_fs_mark_checked(
+	struct xfs_mount	*mp,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_FS_PRIMARY));
+
+	spin_lock(&mp->m_sb_lock);
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
 }
@@ -141,6 +153,19 @@ xfs_rt_mark_sick(
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_rt_sick |= mask;
+	spin_unlock(&mp->m_sb_lock);
+}
+
+/* Mark realtime metadata as having been checked. */
+void
+xfs_rt_mark_checked(
+	struct xfs_mount	*mp,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_RT_PRIMARY));
+	trace_xfs_rt_mark_sick(mp, mask);
+
+	spin_lock(&mp->m_sb_lock);
 	mp->m_rt_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
 }
@@ -184,6 +209,18 @@ xfs_ag_mark_sick(
 
 	spin_lock(&pag->pag_state_lock);
 	pag->pag_sick |= mask;
+	spin_unlock(&pag->pag_state_lock);
+}
+
+/* Mark per-ag metadata as having been checked. */
+void
+xfs_ag_mark_checked(
+	struct xfs_perag	*pag,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_AG_PRIMARY));
+
+	spin_lock(&pag->pag_state_lock);
 	pag->pag_checked |= mask;
 	spin_unlock(&pag->pag_state_lock);
 }
@@ -227,7 +264,6 @@ xfs_inode_mark_sick(
 
 	spin_lock(&ip->i_flags_lock);
 	ip->i_sick |= mask;
-	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
 
 	/*
@@ -240,6 +276,19 @@ xfs_inode_mark_sick(
 	spin_unlock(&VFS_I(ip)->i_lock);
 }
 
+/* Mark inode metadata as having been checked. */
+void
+xfs_inode_mark_checked(
+	struct xfs_inode	*ip,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_INO_PRIMARY));
+
+	spin_lock(&ip->i_flags_lock);
+	ip->i_checked |= mask;
+	spin_unlock(&ip->i_flags_lock);
+}
+
 /* Mark parts of an inode healed. */
 void
 xfs_inode_mark_healthy(
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fb87ffb48f7f..31f49211fdd6 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -496,8 +496,10 @@ xfs_check_summary_counts(
 	if (xfs_is_clean(mp) &&
 	    (mp->m_sb.sb_fdblocks > mp->m_sb.sb_dblocks ||
 	     !xfs_verify_icount(mp, mp->m_sb.sb_icount) ||
-	     mp->m_sb.sb_ifree > mp->m_sb.sb_icount))
+	     mp->m_sb.sb_ifree > mp->m_sb.sb_icount)) {
 		xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS);
+		xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
+	}
 
 	/*
 	 * We can safely re-initialise incore superblock counters from the
@@ -1271,6 +1273,7 @@ xfs_force_summary_recalc(
 		return;
 
 	xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS);
+	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 03/11] xfs: report ag header corruption errors to the health tracking system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 09/11] xfs: report quota " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 07/11] xfs: report symlink block corruption errors to the health system Darrick J. Wong
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter a corrupt AG header, we should report that to the
health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c  |    6 ++++++
 fs/xfs/libxfs/xfs_health.h |   13 ++++++++++---
 fs/xfs/libxfs/xfs_ialloc.c |    3 +++
 fs/xfs/libxfs/xfs_sb.c     |    2 ++
 fs/xfs/xfs_health.c        |   17 +++++++++++++++++
 fs/xfs/xfs_inode.c         |   15 +++++++++++++--
 6 files changed, 51 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 62136ecaa071..819a38170351 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -26,6 +26,7 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 #include "xfs_bmap.h"
+#include "xfs_health.h"
 
 struct kmem_cache	*xfs_extfree_item_cache;
 
@@ -754,6 +755,8 @@ xfs_alloc_read_agfl(
 			mp, tp, mp->m_ddev_targp,
 			XFS_AG_DADDR(mp, pag->pag_agno, XFS_AGFL_DADDR(mp)),
 			XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_agfl_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGFL);
 	if (error)
 		return error;
 	xfs_buf_set_ref(bp, XFS_AGFL_REF);
@@ -775,6 +778,7 @@ xfs_alloc_update_counters(
 	if (unlikely(be32_to_cpu(agf->agf_freeblks) >
 		     be32_to_cpu(agf->agf_length))) {
 		xfs_buf_mark_corrupt(agbp);
+		xfs_ag_mark_sick(agbp->b_pag, XFS_SICK_AG_AGF);
 		return -EFSCORRUPTED;
 	}
 
@@ -3106,6 +3110,8 @@ xfs_read_agf(
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 			XFS_AG_DADDR(mp, pag->pag_agno, XFS_AGF_DADDR(mp)),
 			XFS_FSS_TO_BB(mp, 1), flags, agfbpp, &xfs_agf_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGF);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index aa4771fad505..5a4995391ae7 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -26,9 +26,11 @@
  * and the "sick" field tells us if that piece was found to need repairs.
  * Therefore we can conclude that for a given sick flag value:
  *
- *  - checked && sick  => metadata needs repair
- *  - checked && !sick => metadata is ok
- *  - !checked         => has not been examined since mount
+ *  - checked && sick   => metadata needs repair
+ *  - checked && !sick  => metadata is ok
+ *  - !checked && sick  => errors have been observed during normal operation,
+ *                         but the metadata has not been checked thoroughly
+ *  - !checked && !sick => has not been examined since mount
  */
 
 struct xfs_mount;
@@ -125,6 +127,8 @@ void xfs_rt_mark_healthy(struct xfs_mount *mp, unsigned int mask);
 void xfs_rt_measure_sickness(struct xfs_mount *mp, unsigned int *sick,
 		unsigned int *checked);
 
+void xfs_agno_mark_sick(struct xfs_mount *mp, xfs_agnumber_t agno,
+		unsigned int mask);
 void xfs_ag_mark_sick(struct xfs_perag *pag, unsigned int mask);
 void xfs_ag_mark_checked(struct xfs_perag *pag, unsigned int mask);
 void xfs_ag_mark_healthy(struct xfs_perag *pag, unsigned int mask);
@@ -205,4 +209,7 @@ void xfs_fsop_geom_health(struct xfs_mount *mp, struct xfs_fsop_geom *geo);
 void xfs_ag_geom_health(struct xfs_perag *pag, struct xfs_ag_geometry *ageo);
 void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 
+#define xfs_metadata_is_sick(error) \
+	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 98961914dc01..0f5a4a591775 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -27,6 +27,7 @@
 #include "xfs_log.h"
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
+#include "xfs_health.h"
 
 /*
  * Lookup a record by ino in the btree given by cur.
@@ -2622,6 +2623,8 @@ xfs_read_agi(
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 			XFS_AG_DADDR(mp, pag->pag_agno, XFS_AGI_DADDR(mp)),
 			XFS_FSS_TO_BB(mp, 1), 0, agibpp, &xfs_agi_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
 	if (error)
 		return error;
 	if (tp)
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 1eeecf2eb2a7..b3e8ab247b28 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1229,6 +1229,8 @@ xfs_sb_read_secondary(
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 			XFS_AG_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
 			XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_agno_mark_sick(mp, agno, XFS_SICK_AG_SB);
 	if (error)
 		return error;
 	xfs_buf_set_ref(bp, XFS_SSB_REF);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 3ef91601bc2b..ec987aebb042 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -198,6 +198,23 @@ xfs_rt_measure_sickness(
 	spin_unlock(&mp->m_sb_lock);
 }
 
+/* Mark unhealthy per-ag metadata given a raw AG number. */
+void
+xfs_agno_mark_sick(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	unsigned int		mask)
+{
+	struct xfs_perag	*pag = xfs_perag_get(mp, agno);
+
+	/* per-ag structure not set up yet? */
+	if (!pag)
+		return;
+
+	xfs_ag_mark_sick(pag, mask);
+	xfs_perag_put(pag);
+}
+
 /* Mark unhealthy per-ag metadata. */
 void
 xfs_ag_mark_sick(
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d6eeb59217b4..c238f43bd773 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -37,6 +37,7 @@
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"
 
 struct kmem_cache *xfs_inode_cache;
 
@@ -799,6 +800,8 @@ xfs_init_new_inode(
 	 */
 	if ((pip && ino == pip->i_ino) || !xfs_verify_dir_ino(mp, ino)) {
 		xfs_alert(mp, "Allocated a known in-use inode 0x%llx!", ino);
+		xfs_agno_mark_sick(mp, XFS_INO_TO_AGNO(mp, ino),
+				XFS_SICK_AG_INOBT);
 		return -EFSCORRUPTED;
 	}
 
@@ -1965,8 +1968,11 @@ xfs_iunlink_update_backref(
 		return 0;
 
 	ip = xfs_iunlink_lookup(pag, next_agino);
-	if (!ip)
+	if (!ip) {
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
 		return -EFSCORRUPTED;
+	}
+
 	ip->i_prev_unlinked = prev_agino;
 	return 0;
 }
@@ -2000,6 +2006,7 @@ xfs_iunlink_update_bucket(
 	 */
 	if (old_value == new_agino) {
 		xfs_buf_mark_corrupt(agibp);
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
 		return -EFSCORRUPTED;
 	}
 
@@ -2033,6 +2040,7 @@ xfs_iunlink_insert_inode(
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(pag, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
 		return -EFSCORRUPTED;
 	}
 
@@ -2117,6 +2125,7 @@ xfs_iunlink_remove_inode(
 	if (!xfs_verify_agino(pag, head_agino)) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 				agi, sizeof(*agi));
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
 		return -EFSCORRUPTED;
 	}
 
@@ -2142,8 +2151,10 @@ xfs_iunlink_remove_inode(
 		struct xfs_inode	*prev_ip;
 
 		prev_ip = xfs_iunlink_lookup(pag, ip->i_prev_unlinked);
-		if (!prev_ip)
+		if (!prev_ip) {
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 			return -EFSCORRUPTED;
+		}
 
 		error = xfs_iunlink_log_inode(tp, prev_ip, pag,
 				ip->i_next_unlinked);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 04/11] xfs: report block map corruption errors to the health tracking system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 05/11] xfs: report btree " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 10/11] xfs: report realtime metadata corruption errors to the health system Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 11/11] xfs: report XFS_IS_CORRUPT " Darrick J. Wong
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter a corrupt block mapping, we should report that to
the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c   |   35 +++++++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_health.h |    1 +
 fs/xfs/xfs_health.c        |   26 ++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.c         |   15 ++++++++++++---
 fs/xfs/xfs_reflink.c       |    6 +++++-
 5 files changed, 73 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2f626ad1f4b4..eb5b766a4d5a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -36,6 +36,7 @@
 #include "xfs_refcount.h"
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
+#include "xfs_health.h"
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
@@ -971,6 +972,7 @@ xfs_bmap_add_attrfork_local(
 
 	/* should only be called for types that support local format data */
 	ASSERT(0);
+	xfs_bmap_mark_sick(ip, XFS_ATTR_FORK);
 	return -EFSCORRUPTED;
 }
 
@@ -1154,6 +1156,7 @@ xfs_iread_bmbt_block(
 				(unsigned long long)ip->i_ino);
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, block,
 				sizeof(*block), __this_address);
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -1169,6 +1172,7 @@ xfs_iread_bmbt_block(
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 					"xfs_iread_extents(2)", frp,
 					sizeof(*frp), fa);
+			xfs_bmap_mark_sick(ip, whichfork);
 			return xfs_bmap_complain_bad_rec(ip, whichfork, fa,
 					&new);
 		}
@@ -1218,6 +1222,8 @@ xfs_iread_extents(
 	ASSERT(ir.loaded == xfs_iext_count(ifp));
 	return 0;
 out:
+	if (xfs_metadata_is_sick(error))
+		xfs_bmap_mark_sick(ip, whichfork);
 	xfs_iext_destroy(ifp);
 	return error;
 }
@@ -1297,6 +1303,7 @@ xfs_bmap_last_before(
 		break;
 	default:
 		ASSERT(0);
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -3908,12 +3915,16 @@ xfs_bmapi_read(
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK | XFS_BMAPI_ENTIRE)));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
 
-	if (WARN_ON_ONCE(!ifp))
+	if (WARN_ON_ONCE(!ifp)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
-	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT))
+	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -4394,6 +4405,7 @@ xfs_bmapi_write(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -4621,9 +4633,11 @@ xfs_bmapi_convert_delalloc(
 	error = -ENOSPC;
 	if (WARN_ON_ONCE(bma.blkno == NULLFSBLOCK))
 		goto out_finish;
-	error = -EFSCORRUPTED;
-	if (WARN_ON_ONCE(!xfs_valid_startblock(ip, bma.got.br_startblock)))
+	if (WARN_ON_ONCE(!xfs_valid_startblock(ip, bma.got.br_startblock))) {
+		xfs_bmap_mark_sick(ip, whichfork);
+		error = -EFSCORRUPTED;
 		goto out_finish;
+	}
 
 	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, bma.length));
 	XFS_STATS_INC(mp, xs_xstrat_quick);
@@ -4682,6 +4696,7 @@ xfs_bmapi_remap(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -5320,8 +5335,10 @@ __xfs_bunmapi(
 	whichfork = xfs_bmapi_whichfork(flags);
 	ASSERT(whichfork != XFS_COW_FORK);
 	ifp = xfs_ifork_ptr(ip, whichfork);
-	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)))
+	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp))) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
@@ -5791,6 +5808,7 @@ xfs_bmap_collapse_extents(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -5906,6 +5924,7 @@ xfs_bmap_insert_extents(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -6009,6 +6028,7 @@ xfs_bmap_split_extent(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -6191,8 +6211,10 @@ xfs_bmap_finish_one(
 			bmap->br_startoff, bmap->br_blockcount,
 			bmap->br_state);
 
-	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK))
+	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK)) {
+		xfs_bmap_mark_sick(bi->bi_owner, bi->bi_whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	if (XFS_TEST_ERROR(false, tp->t_mountp,
 			XFS_ERRTAG_BMAP_FINISH_ONE))
@@ -6210,6 +6232,7 @@ xfs_bmap_finish_one(
 		break;
 	default:
 		ASSERT(0);
+		xfs_bmap_mark_sick(bi->bi_owner, bi->bi_whichfork);
 		error = -EFSCORRUPTED;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 5a4995391ae7..8936176c38f1 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -142,6 +142,7 @@ void xfs_inode_measure_sickness(struct xfs_inode *ip, unsigned int *sick,
 		unsigned int *checked);
 
 void xfs_health_unmount(struct xfs_mount *mp);
+void xfs_bmap_mark_sick(struct xfs_inode *ip, int whichfork);
 
 /* Now some helpers. */
 
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index ec987aebb042..c60decd40e5e 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -461,3 +461,29 @@ xfs_bulkstat_health(
 			bs->bs_sick |= m->ioctl_mask;
 	}
 }
+
+/* Mark a block mapping sick. */
+void
+xfs_bmap_mark_sick(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	unsigned int		mask;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		mask = XFS_SICK_INO_BMBTD;
+		break;
+	case XFS_ATTR_FORK:
+		mask = XFS_SICK_INO_BMBTA;
+		break;
+	case XFS_COW_FORK:
+		mask = XFS_SICK_INO_BMBTC;
+		break;
+	default:
+		ASSERT(0);
+		return;
+	}
+
+	xfs_inode_mark_sick(ip, mask);
+}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index fc1946f80a4a..c2ba03281daf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -27,6 +27,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_health.h"
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -45,6 +46,7 @@ xfs_alert_fsblock_zero(
 		(unsigned long long)imap->br_startoff,
 		(unsigned long long)imap->br_blockcount,
 		imap->br_state);
+	xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 	return -EFSCORRUPTED;
 }
 
@@ -99,8 +101,10 @@ xfs_bmbt_to_iomap(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 
-	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
+	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) {
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 		return xfs_alert_fsblock_zero(ip, imap);
+	}
 
 	if (imap->br_startblock == HOLESTARTBLOCK) {
 		iomap->addr = IOMAP_NULL_ADDR;
@@ -325,8 +329,10 @@ xfs_iomap_write_direct(
 		goto out_unlock;
 	}
 
-	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
+	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) {
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 		error = xfs_alert_fsblock_zero(ip, imap);
+	}
 
 out_unlock:
 	*seq = xfs_iomap_inode_sequence(ip, 0);
@@ -639,8 +645,10 @@ xfs_iomap_write_unwritten(
 		if (error)
 			return error;
 
-		if (unlikely(!xfs_valid_startblock(ip, imap.br_startblock)))
+		if (unlikely(!xfs_valid_startblock(ip, imap.br_startblock))) {
+			xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 			return xfs_alert_fsblock_zero(ip, &imap);
+		}
 
 		if ((numblks_fsb = imap.br_blockcount) == 0) {
 			/*
@@ -986,6 +994,7 @@ xfs_buffered_write_iomap_begin(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 		error = -EFSCORRUPTED;
 		goto out_unlock;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 5535778a98f9..55604bbd25a4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
 #include "xfs_iomap.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "xfs_health.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -1223,8 +1224,10 @@ xfs_reflink_remap_extent(
 	 * extent if they're both holes or both the same physical extent.
 	 */
 	if (dmap->br_startblock == smap.br_startblock) {
-		if (dmap->br_state != smap.br_state)
+		if (dmap->br_state != smap.br_state) {
+			xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 			error = -EFSCORRUPTED;
+		}
 		goto out_cancel;
 	}
 
@@ -1387,6 +1390,7 @@ xfs_reflink_remap_blocks(
 		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
 		if (imap.br_startblock == DELAYSTARTBLOCK) {
 			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+			xfs_bmap_mark_sick(src, XFS_DATA_FORK);
 			error = -EFSCORRUPTED;
 			break;
 		}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 05/11] xfs: report btree block corruption errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 07/11] xfs: report symlink block corruption errors to the health system Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 04/11] xfs: report block map corruption errors to the health tracking system Darrick J. Wong
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt btree blocks, we should report that to the
health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c    |    2 ++
 fs/xfs/libxfs/xfs_bmap.c     |    6 ++++++
 fs/xfs/libxfs/xfs_btree.c    |   25 ++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_health.h   |    2 ++
 fs/xfs/libxfs/xfs_ialloc.c   |    1 +
 fs/xfs/libxfs/xfs_refcount.c |    6 +++++-
 fs/xfs/libxfs/xfs_rmap.c     |    6 +++++-
 fs/xfs/xfs_health.c          |   38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 81 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 819a38170351..06562746fa75 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -287,6 +287,7 @@ xfs_alloc_complain_bad_rec(
 	xfs_warn(mp,
 		"start block 0x%x block count 0x%x", irec->ar_startblock,
 		irec->ar_blockcount);
+	xfs_btree_mark_sick(cur);
 	return -EFSCORRUPTED;
 }
 
@@ -2612,6 +2613,7 @@ xfs_exact_minlen_extent_available(
 		goto out;
 
 	if (*stat == 0) {
+		xfs_btree_mark_sick(cnt_cur);
 		error = -EFSCORRUPTED;
 		goto out;
 	}
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index eb5b766a4d5a..de5d7b842145 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -368,6 +368,8 @@ xfs_bmap_check_leaf_extents(
 			error = xfs_btree_read_bufl(mp, NULL, bno, &bp,
 						XFS_BMAP_BTREE_REF,
 						&xfs_bmbt_buf_ops);
+			if (xfs_metadata_is_sick(error))
+				xfs_btree_mark_sick(cur);
 			if (error)
 				goto error_norelse;
 		}
@@ -454,6 +456,8 @@ xfs_bmap_check_leaf_extents(
 			error = xfs_btree_read_bufl(mp, NULL, bno, &bp,
 						XFS_BMAP_BTREE_REF,
 						&xfs_bmbt_buf_ops);
+			if (xfs_metadata_is_sick(error))
+				xfs_btree_mark_sick(cur);
 			if (error)
 				goto error_norelse;
 		}
@@ -568,6 +572,8 @@ xfs_bmap_btree_to_extents(
 #endif
 	error = xfs_btree_read_bufl(mp, tp, cbno, &cbp, XFS_BMAP_BTREE_REF,
 				&xfs_bmbt_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_btree_mark_sick(cur);
 	if (error)
 		return error;
 	cblock = XFS_BUF_TO_BLOCK(cbp);
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index b63650a5d690..2f0e8450268f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -27,6 +27,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_health.h"
 
 /*
  * Btree magic numbers.
@@ -177,6 +178,7 @@ xfs_btree_check_lblock(
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BTREE_CHECK_LBLOCK)) {
 		if (bp)
 			trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
 	}
 	return 0;
@@ -243,6 +245,7 @@ xfs_btree_check_sblock(
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BTREE_CHECK_SBLOCK)) {
 		if (bp)
 			trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
 	}
 	return 0;
@@ -318,6 +321,7 @@ xfs_btree_check_ptr(
 				level, index);
 	}
 
+	xfs_btree_mark_sick(cur);
 	return -EFSCORRUPTED;
 }
 
@@ -498,6 +502,8 @@ xfs_btree_dup_cursor(
 						   xfs_buf_daddr(bp), mp->m_bsize,
 						   0, &bp,
 						   cur->bc_ops->buf_ops);
+			if (xfs_metadata_is_sick(error))
+				xfs_btree_mark_sick(new);
 			if (error) {
 				xfs_btree_del_cursor(new, error);
 				*ncur = NULL;
@@ -1351,6 +1357,8 @@ xfs_btree_read_buf_block(
 	error = xfs_trans_read_buf(mp, cur->bc_tp, mp->m_ddev_targp, d,
 				   mp->m_bsize, flags, bpp,
 				   cur->bc_ops->buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_btree_mark_sick(cur);
 	if (error)
 		return error;
 
@@ -1661,6 +1669,7 @@ xfs_btree_increment(
 		if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
 			goto out0;
 		ASSERT(0);
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1754,6 +1763,7 @@ xfs_btree_decrement(
 		if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
 			goto out0;
 		ASSERT(0);
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1846,6 +1856,7 @@ xfs_btree_lookup_get_block(
 	*blkp = NULL;
 	xfs_buf_mark_corrupt(bp);
 	xfs_trans_brelse(cur->bc_tp, bp);
+	xfs_btree_mark_sick(cur);
 	return -EFSCORRUPTED;
 }
 
@@ -1892,8 +1903,10 @@ xfs_btree_lookup(
 	XFS_BTREE_STATS_INC(cur, lookup);
 
 	/* No such thing as a zero-level tree. */
-	if (XFS_IS_CORRUPT(cur->bc_mp, cur->bc_nlevels == 0))
+	if (XFS_IS_CORRUPT(cur->bc_mp, cur->bc_nlevels == 0)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	block = NULL;
 	keyno = 0;
@@ -1936,6 +1949,7 @@ xfs_btree_lookup(
 							XFS_ERRLEVEL_LOW,
 							cur->bc_mp, block,
 							sizeof(*block));
+					xfs_btree_mark_sick(cur);
 					return -EFSCORRUPTED;
 				}
 
@@ -4355,12 +4369,16 @@ xfs_btree_visit_block(
 	 */
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		if (be64_to_cpu(rptr.l) == XFS_DADDR_TO_FSB(cur->bc_mp,
-							xfs_buf_daddr(bp)))
+							xfs_buf_daddr(bp))) {
+			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
+		}
 	} else {
 		if (be32_to_cpu(rptr.s) == xfs_daddr_to_agbno(cur->bc_mp,
-							xfs_buf_daddr(bp)))
+							xfs_buf_daddr(bp))) {
+			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
+		}
 	}
 	return xfs_btree_lookup_get_block(cur, level, &rptr, &block);
 }
@@ -5219,6 +5237,7 @@ xfs_btree_goto_left_edge(
 		return error;
 	if (stat != 0) {
 		ASSERT(0);
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 8936176c38f1..2ee06af82c5a 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -37,6 +37,7 @@ struct xfs_mount;
 struct xfs_perag;
 struct xfs_inode;
 struct xfs_fsop_geom;
+struct xfs_btree_cur;
 
 /* Observable health issues for metadata spanning the entire filesystem. */
 #define XFS_SICK_FS_COUNTERS	(1 << 0)  /* summary counters */
@@ -143,6 +144,7 @@ void xfs_inode_measure_sickness(struct xfs_inode *ip, unsigned int *sick,
 
 void xfs_health_unmount(struct xfs_mount *mp);
 void xfs_bmap_mark_sick(struct xfs_inode *ip, int whichfork);
+void xfs_btree_mark_sick(struct xfs_btree_cur *cur);
 
 /* Now some helpers. */
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0f5a4a591775..e93e15153686 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -158,6 +158,7 @@ xfs_inobt_complain_bad_rec(
 "start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
 		irec->ir_startino, irec->ir_count, irec->ir_freecount,
 		irec->ir_free, irec->ir_holemask);
+	xfs_btree_mark_sick(cur);
 	return -EFSCORRUPTED;
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index bd4cd1c65221..e8eda50a1ee1 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -23,6 +23,7 @@
 #include "xfs_refcount.h"
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
+#include "xfs_health.h"
 
 struct kmem_cache	*xfs_refcount_intent_cache;
 
@@ -164,6 +165,7 @@ xfs_refcount_complain_bad_rec(
 	xfs_warn(mp,
 		"Start block 0x%x, block count 0x%x, references 0x%x",
 		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
+	xfs_btree_mark_sick(cur);
 	return -EFSCORRUPTED;
 }
 
@@ -1891,8 +1893,10 @@ xfs_refcount_recover_extent(
 	struct xfs_refcount_recovery	*rr;
 
 	if (XFS_IS_CORRUPT(cur->bc_mp,
-			   be32_to_cpu(rec->refc.rc_refcount) != 1))
+			   be32_to_cpu(rec->refc.rc_refcount) != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	rr = kmalloc(sizeof(struct xfs_refcount_recovery),
 			GFP_KERNEL | __GFP_NOFAIL);
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index a9f13d877822..0d41d1256fd9 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -23,6 +23,7 @@
 #include "xfs_error.h"
 #include "xfs_inode.h"
 #include "xfs_ag.h"
+#include "xfs_health.h"
 
 struct kmem_cache	*xfs_rmap_intent_cache;
 
@@ -56,8 +57,10 @@ xfs_rmap_lookup_le(
 	error = xfs_rmap_get_rec(cur, irec, &get_stat);
 	if (error)
 		return error;
-	if (!get_stat)
+	if (!get_stat) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	return 0;
 }
@@ -277,6 +280,7 @@ xfs_rmap_complain_bad_rec(
 		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
 		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
 		irec->rm_blockcount);
+	xfs_btree_mark_sick(cur);
 	return -EFSCORRUPTED;
 }
 
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index c60decd40e5e..89cd4d21065e 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -14,6 +14,7 @@
 #include "xfs_trace.h"
 #include "xfs_health.h"
 #include "xfs_ag.h"
+#include "xfs_btree.h"
 
 /*
  * Warn about metadata corruption that we detected but haven't fixed, and
@@ -487,3 +488,40 @@ xfs_bmap_mark_sick(
 
 	xfs_inode_mark_sick(ip, mask);
 }
+
+/* Record observations of btree corruption with the health tracking system. */
+void
+xfs_btree_mark_sick(
+	struct xfs_btree_cur		*cur)
+{
+	unsigned int			mask;
+
+	switch (cur->bc_btnum) {
+	case XFS_BTNUM_BMAP:
+		xfs_bmap_mark_sick(cur->bc_ino.ip, cur->bc_ino.whichfork);
+		return;
+	case XFS_BTNUM_BNO:
+		mask = XFS_SICK_AG_BNOBT;
+		break;
+	case XFS_BTNUM_CNT:
+		mask = XFS_SICK_AG_CNTBT;
+		break;
+	case XFS_BTNUM_INO:
+		mask = XFS_SICK_AG_INOBT;
+		break;
+	case XFS_BTNUM_FINO:
+		mask = XFS_SICK_AG_FINOBT;
+		break;
+	case XFS_BTNUM_RMAP:
+		mask = XFS_SICK_AG_RMAPBT;
+		break;
+	case XFS_BTNUM_REFC:
+		mask = XFS_SICK_AG_REFCNTBT;
+		break;
+	default:
+		ASSERT(0);
+		return;
+	}
+
+	xfs_ag_mark_sick(cur->bc_ag.pag, mask);
+}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 06/11] xfs: report dir/attr block corruption errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 08/11] xfs: report inode corruption errors to the health system Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 09/11] xfs: report quota " Darrick J. Wong
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt directory or extended attribute blocks, we
should report that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_leaf.c   |    4 ++++
 fs/xfs/libxfs/xfs_attr_remote.c |   27 ++++++++++++++++-----------
 fs/xfs/libxfs/xfs_da_btree.c    |   37 ++++++++++++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_dir2.c        |    5 ++++-
 fs/xfs/libxfs/xfs_dir2_block.c  |    2 ++
 fs/xfs/libxfs/xfs_dir2_data.c   |    3 +++
 fs/xfs/libxfs/xfs_dir2_leaf.c   |    3 +++
 fs/xfs/libxfs/xfs_dir2_node.c   |    7 +++++++
 fs/xfs/libxfs/xfs_health.h      |    3 +++
 fs/xfs/xfs_attr_inactive.c      |    4 ++++
 fs/xfs/xfs_attr_list.c          |    9 ++++++++-
 fs/xfs/xfs_health.c             |   39 +++++++++++++++++++++++++++++++++++++++
 12 files changed, 125 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 40aba0a9d339..1ba16c369f26 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -29,6 +29,7 @@
 #include "xfs_log.h"
 #include "xfs_ag.h"
 #include "xfs_errortag.h"
+#include "xfs_health.h"
 
 
 /*
@@ -2437,6 +2438,7 @@ xfs_attr3_leaf_lookup_int(
 	entries = xfs_attr3_leaf_entryp(leaf);
 	if (ichdr.count >= args->geo->blksize / 8) {
 		xfs_buf_mark_corrupt(bp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
 
@@ -2456,10 +2458,12 @@ xfs_attr3_leaf_lookup_int(
 	}
 	if (!(probe >= 0 && (!ichdr.count || probe < ichdr.count))) {
 		xfs_buf_mark_corrupt(bp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
 	if (!(span <= 4 || be32_to_cpu(entry->hashval) == hashval)) {
 		xfs_buf_mark_corrupt(bp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index d440393b40eb..b18a3cf44192 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -22,6 +22,7 @@
 #include "xfs_attr_remote.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 #define ATTR_RMTVALUE_MAPSIZE	1	/* # of map entries at once */
 
@@ -276,17 +277,18 @@ xfs_attr3_rmt_hdr_set(
  */
 STATIC int
 xfs_attr_rmtval_copyout(
-	struct xfs_mount *mp,
-	struct xfs_buf	*bp,
-	xfs_ino_t	ino,
-	int		*offset,
-	int		*valuelen,
-	uint8_t		**dst)
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct xfs_inode	*dp,
+	int			*offset,
+	int			*valuelen,
+	uint8_t			**dst)
 {
-	char		*src = bp->b_addr;
-	xfs_daddr_t	bno = xfs_buf_daddr(bp);
-	int		len = BBTOB(bp->b_length);
-	int		blksize = mp->m_attr_geo->blksize;
+	char			*src = bp->b_addr;
+	xfs_ino_t		ino = dp->i_ino;
+	xfs_daddr_t		bno = xfs_buf_daddr(bp);
+	int			len = BBTOB(bp->b_length);
+	int			blksize = mp->m_attr_geo->blksize;
 
 	ASSERT(len >= blksize);
 
@@ -302,6 +304,7 @@ xfs_attr_rmtval_copyout(
 				xfs_alert(mp,
 "remote attribute header mismatch bno/off/len/owner (0x%llx/0x%x/Ox%x/0x%llx)",
 					bno, *offset, byte_cnt, ino);
+				xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 				return -EFSCORRUPTED;
 			}
 			hdr_size = sizeof(struct xfs_attr3_rmt_hdr);
@@ -418,10 +421,12 @@ xfs_attr_rmtval_get(
 			dblkcnt = XFS_FSB_TO_BB(mp, map[i].br_blockcount);
 			error = xfs_buf_read(mp->m_ddev_targp, dblkno, dblkcnt,
 					0, &bp, &xfs_attr3_rmt_buf_ops);
+			if (xfs_metadata_is_sick(error))
+				xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
 			if (error)
 				return error;
 
-			error = xfs_attr_rmtval_copyout(mp, bp, args->dp->i_ino,
+			error = xfs_attr_rmtval_copyout(mp, bp, args->dp,
 							&offset, &valuelen,
 							&dst);
 			xfs_buf_relse(bp);
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index e576560b46e9..8d2992413047 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -23,6 +23,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_errortag.h"
+#include "xfs_health.h"
 
 /*
  * xfs_da_btree.c
@@ -352,6 +353,8 @@ const struct xfs_buf_ops xfs_da3_node_buf_ops = {
 static int
 xfs_da3_node_set_type(
 	struct xfs_trans	*tp,
+	struct xfs_inode	*dp,
+	int			whichfork,
 	struct xfs_buf		*bp)
 {
 	struct xfs_da_blkinfo	*info = bp->b_addr;
@@ -373,6 +376,7 @@ xfs_da3_node_set_type(
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, tp->t_mountp,
 				info, sizeof(*info));
 		xfs_trans_brelse(tp, bp);
+		xfs_dirattr_mark_sick(dp, whichfork);
 		return -EFSCORRUPTED;
 	}
 }
@@ -391,7 +395,7 @@ xfs_da3_node_read(
 			&xfs_da3_node_buf_ops);
 	if (error || !*bpp || !tp)
 		return error;
-	return xfs_da3_node_set_type(tp, *bpp);
+	return xfs_da3_node_set_type(tp, dp, whichfork, *bpp);
 }
 
 int
@@ -408,6 +412,8 @@ xfs_da3_node_read_mapped(
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, mappedbno,
 			XFS_FSB_TO_BB(mp, xfs_dabuf_nfsb(mp, whichfork)), 0,
 			bpp, &xfs_da3_node_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_dirattr_mark_sick(dp, whichfork);
 	if (error || !*bpp)
 		return error;
 
@@ -418,7 +424,7 @@ xfs_da3_node_read_mapped(
 
 	if (!tp)
 		return 0;
-	return xfs_da3_node_set_type(tp, *bpp);
+	return xfs_da3_node_set_type(tp, dp, whichfork, *bpp);
 }
 
 /*========================================================================
@@ -612,6 +618,7 @@ xfs_da3_split(
 	if (node->hdr.info.forw) {
 		if (be32_to_cpu(node->hdr.info.forw) != addblk->blkno) {
 			xfs_buf_mark_corrupt(oldblk->bp);
+			xfs_da_mark_sick(state->args);
 			error = -EFSCORRUPTED;
 			goto out;
 		}
@@ -625,6 +632,7 @@ xfs_da3_split(
 	if (node->hdr.info.back) {
 		if (be32_to_cpu(node->hdr.info.back) != addblk->blkno) {
 			xfs_buf_mark_corrupt(oldblk->bp);
+			xfs_da_mark_sick(state->args);
 			error = -EFSCORRUPTED;
 			goto out;
 		}
@@ -1643,6 +1651,7 @@ xfs_da3_node_lookup_int(
 
 		if (magic != XFS_DA_NODE_MAGIC && magic != XFS_DA3_NODE_MAGIC) {
 			xfs_buf_mark_corrupt(blk->bp);
+			xfs_da_mark_sick(args);
 			return -EFSCORRUPTED;
 		}
 
@@ -1658,6 +1667,7 @@ xfs_da3_node_lookup_int(
 		/* Tree taller than we can handle; bail out! */
 		if (nodehdr.level >= XFS_DA_NODE_MAXDEPTH) {
 			xfs_buf_mark_corrupt(blk->bp);
+			xfs_da_mark_sick(args);
 			return -EFSCORRUPTED;
 		}
 
@@ -1666,6 +1676,7 @@ xfs_da3_node_lookup_int(
 			expected_level = nodehdr.level - 1;
 		else if (expected_level != nodehdr.level) {
 			xfs_buf_mark_corrupt(blk->bp);
+			xfs_da_mark_sick(args);
 			return -EFSCORRUPTED;
 		} else
 			expected_level--;
@@ -1717,12 +1728,16 @@ xfs_da3_node_lookup_int(
 		}
 
 		/* We can't point back to the root. */
-		if (XFS_IS_CORRUPT(dp->i_mount, blkno == args->geo->leafblk))
+		if (XFS_IS_CORRUPT(dp->i_mount, blkno == args->geo->leafblk)) {
+			xfs_da_mark_sick(args);
 			return -EFSCORRUPTED;
+		}
 	}
 
-	if (XFS_IS_CORRUPT(dp->i_mount, expected_level != 0))
+	if (XFS_IS_CORRUPT(dp->i_mount, expected_level != 0)) {
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
+	}
 
 	/*
 	 * A leaf block that ends in the hashval that we are interested in
@@ -1740,6 +1755,7 @@ xfs_da3_node_lookup_int(
 			args->blkno = blk->blkno;
 		} else {
 			ASSERT(0);
+			xfs_da_mark_sick(args);
 			return -EFSCORRUPTED;
 		}
 		if (((retval == -ENOENT) || (retval == -ENOATTR)) &&
@@ -2305,8 +2321,10 @@ xfs_da3_swap_lastblock(
 	error = xfs_bmap_last_before(tp, dp, &lastoff, w);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(mp, lastoff == 0))
+	if (XFS_IS_CORRUPT(mp, lastoff == 0)) {
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
+	}
 	/*
 	 * Read the last block in the btree space.
 	 */
@@ -2355,6 +2373,7 @@ xfs_da3_swap_lastblock(
 		if (XFS_IS_CORRUPT(mp,
 				   be32_to_cpu(sib_info->forw) != last_blkno ||
 				   sib_info->magic != dead_info->magic)) {
+			xfs_da_mark_sick(args);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2375,6 +2394,7 @@ xfs_da3_swap_lastblock(
 		if (XFS_IS_CORRUPT(mp,
 				   be32_to_cpu(sib_info->back) != last_blkno ||
 				   sib_info->magic != dead_info->magic)) {
+			xfs_da_mark_sick(args);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2397,6 +2417,7 @@ xfs_da3_swap_lastblock(
 		xfs_da3_node_hdr_from_disk(dp->i_mount, &par_hdr, par_node);
 		if (XFS_IS_CORRUPT(mp,
 				   level >= 0 && level != par_hdr.level + 1)) {
+			xfs_da_mark_sick(args);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2408,6 +2429,7 @@ xfs_da3_swap_lastblock(
 		     entno++)
 			continue;
 		if (XFS_IS_CORRUPT(mp, entno == par_hdr.count)) {
+			xfs_da_mark_sick(args);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2433,6 +2455,7 @@ xfs_da3_swap_lastblock(
 		xfs_trans_brelse(tp, par_buf);
 		par_buf = NULL;
 		if (XFS_IS_CORRUPT(mp, par_blkno == 0)) {
+			xfs_da_mark_sick(args);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2442,6 +2465,7 @@ xfs_da3_swap_lastblock(
 		par_node = par_buf->b_addr;
 		xfs_da3_node_hdr_from_disk(dp->i_mount, &par_hdr, par_node);
 		if (XFS_IS_CORRUPT(mp, par_hdr.level != level)) {
+			xfs_da_mark_sick(args);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2570,6 +2594,7 @@ xfs_dabuf_map(
 invalid_mapping:
 	/* Caller ok with no mapping. */
 	if (XFS_IS_CORRUPT(mp, !(flags & XFS_DABUF_MAP_HOLE_OK))) {
+		xfs_dirattr_mark_sick(dp, whichfork);
 		error = -EFSCORRUPTED;
 		if (xfs_error_level >= XFS_ERRLEVEL_LOW) {
 			xfs_alert(mp, "%s: bno %u inode %llu",
@@ -2651,6 +2676,8 @@ xfs_da_read_buf(
 
 	error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
 			&bp, ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_dirattr_mark_sick(dp, whichfork);
 	if (error)
 		goto out_free;
 
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index ee30044af39d..f56710005f43 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -18,6 +18,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_trace.h"
+#include "xfs_health.h"
 
 const struct xfs_name xfs_name_dotdot = {
 	.name	= (const unsigned char *)"..",
@@ -631,8 +632,10 @@ xfs_dir2_isblock(
 		return 0;
 
 	*isblock = true;
-	if (XFS_IS_CORRUPT(mp, args->dp->i_disk_size != args->geo->blksize))
+	if (XFS_IS_CORRUPT(mp, args->dp->i_disk_size != args->geo->blksize)) {
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
+	}
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 00f960a703b2..6b3ca2b384cf 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -20,6 +20,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_health.h"
 
 /*
  * Local function prototypes.
@@ -152,6 +153,7 @@ xfs_dir3_block_read(
 		__xfs_buf_mark_corrupt(*bpp, fa);
 		xfs_trans_brelse(tp, *bpp);
 		*bpp = NULL;
+		xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
 		return -EFSCORRUPTED;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index dbcf58979a59..7a6d965bea71 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -18,6 +18,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
+#include "xfs_health.h"
 
 static xfs_failaddr_t xfs_dir2_data_freefind_verify(
 		struct xfs_dir2_data_hdr *hdr, struct xfs_dir2_data_free *bf,
@@ -433,6 +434,7 @@ xfs_dir3_data_read(
 		__xfs_buf_mark_corrupt(*bpp, fa);
 		xfs_trans_brelse(tp, *bpp);
 		*bpp = NULL;
+		xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
 		return -EFSCORRUPTED;
 	}
 
@@ -1198,6 +1200,7 @@ xfs_dir2_data_use_free(
 corrupt:
 	xfs_corruption_error(__func__, XFS_ERRLEVEL_LOW, args->dp->i_mount,
 			hdr, sizeof(*hdr), __FILE__, __LINE__, fa);
+	xfs_da_mark_sick(args);
 	return -EFSCORRUPTED;
 }
 
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index cb9e950a911d..08dda5ce9d91 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -19,6 +19,7 @@
 #include "xfs_trace.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
+#include "xfs_health.h"
 
 /*
  * Local function declarations.
@@ -1393,8 +1394,10 @@ xfs_dir2_leaf_removename(
 	bestsp = xfs_dir2_leaf_bests_p(ltp);
 	if (be16_to_cpu(bestsp[db]) != oldbest) {
 		xfs_buf_mark_corrupt(lbp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
+
 	/*
 	 * Mark the former data entry unused.
 	 */
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 7a03aeb9f4c9..be0b8834028c 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -20,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
+#include "xfs_health.h"
 
 /*
  * Function declarations.
@@ -231,6 +232,7 @@ __xfs_dir3_free_read(
 		__xfs_buf_mark_corrupt(*bpp, fa);
 		xfs_trans_brelse(tp, *bpp);
 		*bpp = NULL;
+		xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
 		return -EFSCORRUPTED;
 	}
 
@@ -443,6 +445,7 @@ xfs_dir2_leaf_to_node(
 	if (be32_to_cpu(ltp->bestcount) >
 				(uint)dp->i_disk_size / args->geo->blksize) {
 		xfs_buf_mark_corrupt(lbp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
 
@@ -517,6 +520,7 @@ xfs_dir2_leafn_add(
 	 */
 	if (index < 0) {
 		xfs_buf_mark_corrupt(bp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
 
@@ -736,6 +740,7 @@ xfs_dir2_leafn_lookup_for_addname(
 					   cpu_to_be16(NULLDATAOFF))) {
 				if (curfdb != newfdb)
 					xfs_trans_brelse(tp, curbp);
+				xfs_da_mark_sick(args);
 				return -EFSCORRUPTED;
 			}
 			curfdb = newfdb;
@@ -804,6 +809,7 @@ xfs_dir2_leafn_lookup_for_entry(
 	xfs_dir3_leaf_check(dp, bp);
 	if (leafhdr.count <= 0) {
 		xfs_buf_mark_corrupt(bp);
+		xfs_da_mark_sick(args);
 		return -EFSCORRUPTED;
 	}
 
@@ -1739,6 +1745,7 @@ xfs_dir2_node_add_datablk(
 			} else {
 				xfs_alert(mp, " ... fblk is NULL");
 			}
+			xfs_da_mark_sick(args);
 			return -EFSCORRUPTED;
 		}
 
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 2ee06af82c5a..e2e1b95ddfb9 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -38,6 +38,7 @@ struct xfs_perag;
 struct xfs_inode;
 struct xfs_fsop_geom;
 struct xfs_btree_cur;
+struct xfs_da_args;
 
 /* Observable health issues for metadata spanning the entire filesystem. */
 #define XFS_SICK_FS_COUNTERS	(1 << 0)  /* summary counters */
@@ -145,6 +146,8 @@ void xfs_inode_measure_sickness(struct xfs_inode *ip, unsigned int *sick,
 void xfs_health_unmount(struct xfs_mount *mp);
 void xfs_bmap_mark_sick(struct xfs_inode *ip, int whichfork);
 void xfs_btree_mark_sick(struct xfs_btree_cur *cur);
+void xfs_dirattr_mark_sick(struct xfs_inode *ip, int whichfork);
+void xfs_da_mark_sick(struct xfs_da_args *args);
 
 /* Now some helpers. */
 
diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
index 5db87b34fb6e..6ced1a657b1b 100644
--- a/fs/xfs/xfs_attr_inactive.c
+++ b/fs/xfs/xfs_attr_inactive.c
@@ -23,6 +23,7 @@
 #include "xfs_quota.h"
 #include "xfs_dir2.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 /*
  * Invalidate any incore buffers associated with this remote attribute value
@@ -147,6 +148,7 @@ xfs_attr3_node_inactive(
 	if (level > XFS_DA_NODE_MAXDEPTH) {
 		xfs_buf_mark_corrupt(bp);
 		xfs_trans_brelse(*trans, bp);	/* no locks for later trans */
+		xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 		return -EFSCORRUPTED;
 	}
 
@@ -197,6 +199,7 @@ xfs_attr3_node_inactive(
 		default:
 			xfs_buf_mark_corrupt(child_bp);
 			xfs_trans_brelse(*trans, child_bp);
+			xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 			error = -EFSCORRUPTED;
 			break;
 		}
@@ -286,6 +289,7 @@ xfs_attr3_root_inactive(
 		error = xfs_attr3_leaf_inactive(trans, dp, bp);
 		break;
 	default:
+		xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 		error = -EFSCORRUPTED;
 		xfs_buf_mark_corrupt(bp);
 		xfs_trans_brelse(*trans, bp);
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 99bbbe1a0e44..305559bfe2a1 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -22,6 +22,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 #include "xfs_dir2.h"
+#include "xfs_health.h"
 
 STATIC int
 xfs_attr_shortform_compare(const void *a, const void *b)
@@ -126,6 +127,7 @@ xfs_attr_shortform_list(
 					     context->dp->i_mount, sfe,
 					     sizeof(*sfe));
 			kmem_free(sbuf);
+			xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 			return -EFSCORRUPTED;
 		}
 
@@ -263,8 +265,10 @@ xfs_attr_node_list_lookup(
 			return 0;
 
 		/* We can't point back to the root. */
-		if (XFS_IS_CORRUPT(mp, cursor->blkno == 0))
+		if (XFS_IS_CORRUPT(mp, cursor->blkno == 0)) {
+			xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 			return -EFSCORRUPTED;
+		}
 	}
 
 	if (expected_level != 0)
@@ -276,6 +280,7 @@ xfs_attr_node_list_lookup(
 out_corruptbuf:
 	xfs_buf_mark_corrupt(bp);
 	xfs_trans_brelse(tp, bp);
+	xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 	return -EFSCORRUPTED;
 }
 
@@ -305,6 +310,8 @@ xfs_attr_node_list(
 	if (cursor->blkno > 0) {
 		error = xfs_da3_node_read(context->tp, dp, cursor->blkno, &bp,
 				XFS_ATTR_FORK);
+		if (xfs_metadata_is_sick(error))
+			xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 		if ((error != 0) && (error != -EFSCORRUPTED))
 			return error;
 		if (bp) {
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 89cd4d21065e..50b054e4751b 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -15,6 +15,8 @@
 #include "xfs_health.h"
 #include "xfs_ag.h"
 #include "xfs_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
 
 /*
  * Warn about metadata corruption that we detected but haven't fixed, and
@@ -525,3 +527,40 @@ xfs_btree_mark_sick(
 
 	xfs_ag_mark_sick(cur->bc_ag.pag, mask);
 }
+
+/*
+ * Record observations of dir/attr btree corruption with the health tracking
+ * system.
+ */
+void
+xfs_dirattr_mark_sick(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	unsigned int		mask;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		mask = XFS_SICK_INO_DIR;
+		break;
+	case XFS_ATTR_FORK:
+		mask = XFS_SICK_INO_XATTR;
+		break;
+	default:
+		ASSERT(0);
+		return;
+	}
+
+	xfs_inode_mark_sick(ip, mask);
+}
+
+/*
+ * Record observations of dir/attr btree corruption with the health tracking
+ * system.
+ */
+void
+xfs_da_mark_sick(
+	struct xfs_da_args	*args)
+{
+	xfs_dirattr_mark_sick(args->dp, args->whichfork);
+}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 07/11] xfs: report symlink block corruption errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 03/11] xfs: report ag header corruption errors to the health tracking system Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 05/11] xfs: report btree " Darrick J. Wong
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt symbolic link blocks, we should report
that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_symlink.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 8241c0fcd0ba..90de8991cd94 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -23,6 +23,7 @@
 #include "xfs_trans.h"
 #include "xfs_ialloc.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 /* ----- Kernel only functions below ----- */
 int
@@ -57,6 +58,8 @@ xfs_readlink_bmap_ilocked(
 
 		error = xfs_buf_read(mp->m_ddev_targp, d, BTOBB(byte_cnt), 0,
 				&bp, &xfs_symlink_buf_ops);
+		if (xfs_metadata_is_sick(error))
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
 		if (error)
 			return error;
 		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
@@ -67,6 +70,7 @@ xfs_readlink_bmap_ilocked(
 		if (xfs_has_crc(mp)) {
 			if (!xfs_symlink_hdr_ok(ip->i_ino, offset,
 							byte_cnt, bp)) {
+				xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
 				error = -EFSCORRUPTED;
 				xfs_alert(mp,
 "symlink header does not match required off/len/owner (0x%x/Ox%x,0x%llx)",
@@ -120,6 +124,7 @@ xfs_readlink(
 			 __func__, (unsigned long long) ip->i_ino,
 			 (long long) pathlen);
 		ASSERT(0);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
 		goto out;
 	}
 
@@ -128,8 +133,10 @@ xfs_readlink(
 		 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED
 		 * if if_data is junk.
 		 */
-		if (XFS_IS_CORRUPT(ip->i_mount, !ip->i_df.if_u1.if_data))
+		if (XFS_IS_CORRUPT(ip->i_mount, !ip->i_df.if_u1.if_data)) {
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
 			goto out;
+		}
 
 		memcpy(link, ip->i_df.if_u1.if_data, pathlen + 1);
 		error = 0;
@@ -139,6 +146,8 @@ xfs_readlink(
 
  out:
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (error == -EFSCORRUPTED)
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
 	return error;
 }
 
@@ -494,6 +503,7 @@ xfs_inactive_symlink(
 			 __func__, (unsigned long long)ip->i_ino, pathlen);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		ASSERT(0);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
 		return -EFSCORRUPTED;
 	}
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 08/11] xfs: report inode corruption errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 01/11] xfs: separate the marking of sick and checked metadata Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 02/11] xfs: report fs corruption errors to the health tracking system Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 06/11] xfs: report dir/attr block " Darrick J. Wong
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt inode records, we should report that to
the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c     |    1 +
 fs/xfs/libxfs/xfs_inode_buf.c  |   12 +++++++++---
 fs/xfs/libxfs/xfs_inode_fork.c |    8 ++++++++
 fs/xfs/xfs_icache.c            |    9 +++++++++
 fs/xfs/xfs_inode.c             |    2 ++
 5 files changed, 29 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e93e15153686..ddfe365666ba 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3019,6 +3019,7 @@ xfs_ialloc_check_shrink(
 		goto out;
 
 	if (!has) {
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_INOBT);
 		error = -EFSCORRUPTED;
 		goto out;
 	}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 758aacd8166b..992ce2d5b9d0 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -18,6 +18,7 @@
 #include "xfs_trans.h"
 #include "xfs_ialloc.h"
 #include "xfs_dir2.h"
+#include "xfs_health.h"
 
 #include <linux/iversion.h>
 
@@ -132,9 +133,14 @@ xfs_imap_to_bp(
 	struct xfs_imap		*imap,
 	struct xfs_buf		**bpp)
 {
-	return xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
-				   imap->im_len, XBF_UNMAPPED, bpp,
-				   &xfs_inode_buf_ops);
+	int			error;
+
+	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
+			imap->im_len, XBF_UNMAPPED, bpp, &xfs_inode_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_agno_mark_sick(mp, xfs_daddr_to_agno(mp, imap->im_blkno),
+				XFS_SICK_AG_INOBT);
+	return error;
 }
 
 static inline struct timespec64 xfs_inode_decode_bigtime(uint64_t ts)
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index a175e7f0f30f..6d23add33de9 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -25,6 +25,7 @@
 #include "xfs_attr_leaf.h"
 #include "xfs_types.h"
 #include "xfs_errortag.h"
+#include "xfs_health.h"
 
 struct kmem_cache *xfs_ifork_cache;
 
@@ -84,6 +85,7 @@ xfs_iformat_local(
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 				"xfs_iformat_local", dip, sizeof(*dip),
 				__this_address);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 		return -EFSCORRUPTED;
 	}
 
@@ -121,6 +123,7 @@ xfs_iformat_extents(
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 				"xfs_iformat_extents(1)", dip, sizeof(*dip),
 				__this_address);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 		return -EFSCORRUPTED;
 	}
 
@@ -140,6 +143,7 @@ xfs_iformat_extents(
 				xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 						"xfs_iformat_extents(2)",
 						dp, sizeof(*dp), fa);
+				xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 				return xfs_bmap_complain_bad_rec(ip, whichfork,
 						fa, &new);
 			}
@@ -198,6 +202,7 @@ xfs_iformat_btree(
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 				"xfs_iformat_btree", dfp, size,
 				__this_address);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 		return -EFSCORRUPTED;
 	}
 
@@ -257,12 +262,14 @@ xfs_iformat_data_fork(
 		default:
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
 					dip, sizeof(*dip), __this_address);
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 			return -EFSCORRUPTED;
 		}
 		break;
 	default:
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
 				sizeof(*dip), __this_address);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 		return -EFSCORRUPTED;
 	}
 }
@@ -326,6 +333,7 @@ xfs_iformat_attr_fork(
 	default:
 		xfs_inode_verifier_error(ip, error, __func__, dip,
 				sizeof(*dip), __this_address);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 		error = -EFSCORRUPTED;
 		break;
 	}
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c602c0f98053..e94c193cd417 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -24,6 +24,7 @@
 #include "xfs_ialloc.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"
 
 #include <linux/iversion.h>
 
@@ -415,6 +416,9 @@ xfs_iget_check_free_state(
 			xfs_warn(ip->i_mount,
 "Corruption detected! Free inode 0x%llx not marked free! (mode 0x%x)",
 				ip->i_ino, VFS_I(ip)->i_mode);
+			xfs_agno_mark_sick(ip->i_mount,
+					XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
+					XFS_SICK_AG_INOBT);
 			return -EFSCORRUPTED;
 		}
 
@@ -422,6 +426,9 @@ xfs_iget_check_free_state(
 			xfs_warn(ip->i_mount,
 "Corruption detected! Free inode 0x%llx has blocks allocated!",
 				ip->i_ino);
+			xfs_agno_mark_sick(ip->i_mount,
+					XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
+					XFS_SICK_AG_INOBT);
 			return -EFSCORRUPTED;
 		}
 		return 0;
@@ -614,6 +621,8 @@ xfs_iget_cache_miss(
 				xfs_buf_offset(bp, ip->i_imap.im_boffset));
 		if (!error)
 			xfs_buf_set_ref(bp, XFS_INO_REF);
+		else
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 		xfs_trans_brelse(tp, bp);
 
 		if (error)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c238f43bd773..2a1eee807f15 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3440,6 +3440,8 @@ xfs_iflush(
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
+	if (error)
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_CORE);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 09/11] xfs: report quota block corruption errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 06/11] xfs: report dir/attr block " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 03/11] xfs: report ag header corruption errors to the health tracking system Darrick J. Wong
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt quota blocks, we should report that to the
health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_dquot.c  |   30 ++++++++++++++++++++++++++++++
 fs/xfs/xfs_health.c |    1 +
 fs/xfs/xfs_qm.c     |    8 ++++++--
 3 files changed, 37 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 8fb90da89787..9b3fde256c9a 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -24,6 +24,7 @@
 #include "xfs_log.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 /*
  * Lock order:
@@ -44,6 +45,29 @@ static struct kmem_cache	*xfs_dquot_cache;
 static struct lock_class_key xfs_dquot_group_class;
 static struct lock_class_key xfs_dquot_project_class;
 
+/* Record observations of quota corruption with the health tracking system. */
+static void
+xfs_dquot_mark_sick(
+	struct xfs_dquot	*dqp)
+{
+	struct xfs_mount	*mp = dqp->q_mount;
+
+	switch (dqp->q_type) {
+	case XFS_DQTYPE_USER:
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_UQUOTA);
+		break;
+	case XFS_DQTYPE_GROUP:
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_GQUOTA);
+		break;
+	case XFS_DQTYPE_PROJ:
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_PQUOTA);
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+}
+
 /*
  * This is called to free all the memory associated with a dquot
  */
@@ -451,6 +475,8 @@ xfs_dquot_disk_read(
 	error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dqp->q_blkno,
 			mp->m_quotainfo->qi_dqchunklen, 0, &bp,
 			&xfs_dquot_buf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_dquot_mark_sick(dqp);
 	if (error) {
 		ASSERT(bp == NULL);
 		return error;
@@ -573,6 +599,7 @@ xfs_dquot_from_disk(
 			  "Metadata corruption detected at %pS, quota %u",
 			  __this_address, dqp->q_id);
 		xfs_alert(bp->b_mount, "Unmount and run xfs_repair");
+		xfs_dquot_mark_sick(dqp);
 		return -EFSCORRUPTED;
 	}
 
@@ -1238,6 +1265,8 @@ xfs_qm_dqflush(
 				   &bp, &xfs_dquot_buf_ops);
 	if (error == -EAGAIN)
 		goto out_unlock;
+	if (xfs_metadata_is_sick(error))
+		xfs_dquot_mark_sick(dqp);
 	if (error)
 		goto out_abort;
 
@@ -1246,6 +1275,7 @@ xfs_qm_dqflush(
 		xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS",
 				dqp->q_id, fa);
 		xfs_buf_relse(bp);
+		xfs_dquot_mark_sick(dqp);
 		error = -EFSCORRUPTED;
 		goto out_abort;
 	}
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 50b054e4751b..aef5345804da 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -17,6 +17,7 @@
 #include "xfs_btree.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
+#include "xfs_quota_defs.h"
 
 /*
  * Warn about metadata corruption that we detected but haven't fixed, and
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 972ed5912950..59ace2eedf69 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -761,14 +761,18 @@ xfs_qm_qino_alloc(
 			     (mp->m_sb.sb_gquotino != NULLFSINO)) {
 			ino = mp->m_sb.sb_gquotino;
 			if (XFS_IS_CORRUPT(mp,
-					   mp->m_sb.sb_pquotino != NULLFSINO))
+					   mp->m_sb.sb_pquotino != NULLFSINO)) {
+				xfs_fs_mark_sick(mp, XFS_SICK_FS_PQUOTA);
 				return -EFSCORRUPTED;
+			}
 		} else if ((flags & XFS_QMOPT_GQUOTA) &&
 			     (mp->m_sb.sb_pquotino != NULLFSINO)) {
 			ino = mp->m_sb.sb_pquotino;
 			if (XFS_IS_CORRUPT(mp,
-					   mp->m_sb.sb_gquotino != NULLFSINO))
+					   mp->m_sb.sb_gquotino != NULLFSINO)) {
+				xfs_fs_mark_sick(mp, XFS_SICK_FS_GQUOTA);
 				return -EFSCORRUPTED;
+			}
 		}
 		if (ino != NULLFSINO) {
 			error = xfs_iget(mp, NULL, ino, 0, 0, ipp);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 10/11] xfs: report realtime metadata corruption errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 04/11] xfs: report block map corruption errors to the health tracking system Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 11/11] xfs: report XFS_IS_CORRUPT " Darrick J. Wong
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter corrupt realtime metadat blocks, we should report
that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |    9 ++++++++-
 fs/xfs/xfs_rtalloc.c         |    6 ++++++
 2 files changed, 14 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index fa180ab66b73..99a0af8d9028 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -16,6 +16,7 @@
 #include "xfs_trans.h"
 #include "xfs_rtalloc.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 /*
  * Realtime allocator bitmap functions shared with userspace.
@@ -70,13 +71,19 @@ xfs_rtbuf_get(
 	if (error)
 		return error;
 
-	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_written_extent(&map)))
+	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_written_extent(&map))) {
+		xfs_rt_mark_sick(mp, issum ? XFS_SICK_RT_SUMMARY :
+					     XFS_SICK_RT_BITMAP);
 		return -EFSCORRUPTED;
+	}
 
 	ASSERT(map.br_startblock != NULLFSBLOCK);
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 				   XFS_FSB_TO_DADDR(mp, map.br_startblock),
 				   mp->m_bsize, 0, &bp, &xfs_rtbuf_ops);
+	if (xfs_metadata_is_sick(error))
+		xfs_rt_mark_sick(mp, issum ? XFS_SICK_RT_SUMMARY :
+					     XFS_SICK_RT_BITMAP);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 16534e9873f6..790191316a32 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -19,6 +19,8 @@
 #include "xfs_icache.h"
 #include "xfs_rtalloc.h"
 #include "xfs_sb.h"
+#include "xfs_log_priv.h"
+#include "xfs_health.h"
 
 /*
  * Read and return the summary information for a given extent size,
@@ -1373,6 +1375,8 @@ xfs_rtmount_inodes(
 
 	sbp = &mp->m_sb;
 	error = xfs_iget(mp, NULL, sbp->sb_rbmino, 0, 0, &mp->m_rbmip);
+	if (xfs_metadata_is_sick(error))
+		xfs_rt_mark_sick(mp, XFS_SICK_RT_BITMAP);
 	if (error)
 		return error;
 	ASSERT(mp->m_rbmip != NULL);
@@ -1382,6 +1386,8 @@ xfs_rtmount_inodes(
 		goto out_rele_bitmap;
 
 	error = xfs_iget(mp, NULL, sbp->sb_rsumino, 0, 0, &mp->m_rsumip);
+	if (xfs_metadata_is_sick(error))
+		xfs_rt_mark_sick(mp, XFS_SICK_RT_SUMMARY);
 	if (error)
 		goto out_rele_bitmap;
 	ASSERT(mp->m_rsumip != NULL);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 11/11] xfs: report XFS_IS_CORRUPT errors to the health system
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 10/11] xfs: report realtime metadata corruption errors to the health system Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  10 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Whenever we encounter XFS_IS_CORRUPT failures, we should report that to
the health monitoring system for later reporting.

I started with this semantic patch and massaged everything until it
built:

@@
expression mp, test;
@@

- if (XFS_IS_CORRUPT(mp, test)) return -EFSCORRUPTED;
+ if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; }

@@
expression mp, test;
identifier label, error;
@@

- if (XFS_IS_CORRUPT(mp, test)) { error = -EFSCORRUPTED; goto label; }
+ if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; }

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c          |    4 +-
 fs/xfs/libxfs/xfs_alloc.c       |   97 +++++++++++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_attr_remote.c |    8 ++-
 fs/xfs/libxfs/xfs_bmap.c        |   82 +++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_btree.c       |   14 +++++-
 fs/xfs/libxfs/xfs_ialloc.c      |   52 +++++++++++++++++----
 fs/xfs/libxfs/xfs_refcount.c    |   37 ++++++++++++++-
 fs/xfs/libxfs/xfs_rmap.c        |   77 +++++++++++++++++++++++++++++--
 fs/xfs/scrub/refcount_repair.c  |    9 +++-
 fs/xfs/xfs_attr_list.c          |    9 +++-
 fs/xfs/xfs_dir2_readdir.c       |    7 ++-
 fs/xfs/xfs_discard.c            |    2 +
 fs/xfs/xfs_iwalk.c              |    5 ++
 13 files changed, 357 insertions(+), 46 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index a30eab622266..d98ac1a997d9 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -871,8 +871,10 @@ xfs_ag_shrink_space(
 	agf = agfbp->b_addr;
 	aglen = be32_to_cpu(agi->agi_length);
 	/* some extra paranoid checks before we shrink the ag */
-	if (XFS_IS_CORRUPT(mp, agf->agf_length != agi->agi_length))
+	if (XFS_IS_CORRUPT(mp, agf->agf_length != agi->agi_length)) {
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_AGF);
 		return -EFSCORRUPTED;
+	}
 	if (delta >= aglen)
 		return -EINVAL;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 06562746fa75..550d0e3c8528 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -511,14 +511,18 @@ xfs_alloc_fixup_trees(
 		if (XFS_IS_CORRUPT(mp,
 				   i != 1 ||
 				   nfbno1 != fbno ||
-				   nflen1 != flen))
+				   nflen1 != flen)) {
+			xfs_btree_mark_sick(cnt_cur);
 			return -EFSCORRUPTED;
+		}
 #endif
 	} else {
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, fbno, flen, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			return -EFSCORRUPTED;
+		}
 	}
 	/*
 	 * Look up the record in the by-block tree if necessary.
@@ -530,14 +534,18 @@ xfs_alloc_fixup_trees(
 		if (XFS_IS_CORRUPT(mp,
 				   i != 1 ||
 				   nfbno1 != fbno ||
-				   nflen1 != flen))
+				   nflen1 != flen)) {
+			xfs_btree_mark_sick(bno_cur);
 			return -EFSCORRUPTED;
+		}
 #endif
 	} else {
 		if ((error = xfs_alloc_lookup_eq(bno_cur, fbno, flen, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			return -EFSCORRUPTED;
+		}
 	}
 
 #ifdef DEBUG
@@ -550,8 +558,10 @@ xfs_alloc_fixup_trees(
 
 		if (XFS_IS_CORRUPT(mp,
 				   bnoblock->bb_numrecs !=
-				   cntblock->bb_numrecs))
+				   cntblock->bb_numrecs)) {
+			xfs_btree_mark_sick(bno_cur);
 			return -EFSCORRUPTED;
+		}
 	}
 #endif
 
@@ -581,30 +591,40 @@ xfs_alloc_fixup_trees(
 	 */
 	if ((error = xfs_btree_delete(cnt_cur, &i)))
 		return error;
-	if (XFS_IS_CORRUPT(mp, i != 1))
+	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cnt_cur);
 		return -EFSCORRUPTED;
+	}
 	/*
 	 * Add new by-size btree entry(s).
 	 */
 	if (nfbno1 != NULLAGBLOCK) {
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, nfbno1, nflen1, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 0))
+		if (XFS_IS_CORRUPT(mp, i != 0)) {
+			xfs_btree_mark_sick(cnt_cur);
 			return -EFSCORRUPTED;
+		}
 		if ((error = xfs_btree_insert(cnt_cur, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			return -EFSCORRUPTED;
+		}
 	}
 	if (nfbno2 != NULLAGBLOCK) {
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, nfbno2, nflen2, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 0))
+		if (XFS_IS_CORRUPT(mp, i != 0)) {
+			xfs_btree_mark_sick(cnt_cur);
 			return -EFSCORRUPTED;
+		}
 		if ((error = xfs_btree_insert(cnt_cur, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			return -EFSCORRUPTED;
+		}
 	}
 	/*
 	 * Fix up the by-block btree entry(s).
@@ -615,8 +635,10 @@ xfs_alloc_fixup_trees(
 		 */
 		if ((error = xfs_btree_delete(bno_cur, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			return -EFSCORRUPTED;
+		}
 	} else {
 		/*
 		 * Update the by-block entry to start later|be shorter.
@@ -630,12 +652,16 @@ xfs_alloc_fixup_trees(
 		 */
 		if ((error = xfs_alloc_lookup_eq(bno_cur, nfbno2, nflen2, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 0))
+		if (XFS_IS_CORRUPT(mp, i != 0)) {
+			xfs_btree_mark_sick(bno_cur);
 			return -EFSCORRUPTED;
+		}
 		if ((error = xfs_btree_insert(bno_cur, &i)))
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			return -EFSCORRUPTED;
+		}
 	}
 	return 0;
 }
@@ -897,8 +923,10 @@ xfs_alloc_cur_check(
 	error = xfs_alloc_get_rec(cur, &bno, &len, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(args->mp, i != 1))
+	if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	/*
 	 * Check minlen and deactivate a cntbt cursor if out of acceptable size
@@ -1105,6 +1133,7 @@ xfs_alloc_ag_vextent_small(
 		if (error)
 			goto error;
 		if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+			xfs_btree_mark_sick(ccur);
 			error = -EFSCORRUPTED;
 			goto error;
 		}
@@ -1139,6 +1168,7 @@ xfs_alloc_ag_vextent_small(
 	*fbnop = args->agbno = fbno;
 	*flenp = args->len = 1;
 	if (XFS_IS_CORRUPT(args->mp, fbno >= be32_to_cpu(agf->agf_length))) {
+		xfs_btree_mark_sick(ccur);
 		error = -EFSCORRUPTED;
 		goto error;
 	}
@@ -1297,6 +1327,7 @@ xfs_alloc_ag_vextent_exact(
 	if (error)
 		goto error0;
 	if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+		xfs_btree_mark_sick(bno_cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1577,8 +1608,10 @@ xfs_alloc_ag_vextent_lastblock(
 			error = xfs_alloc_get_rec(acur->cnt, bno, len, &i);
 			if (error)
 				return error;
-			if (XFS_IS_CORRUPT(args->mp, i != 1))
+			if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+				xfs_btree_mark_sick(acur->cnt);
 				return -EFSCORRUPTED;
+			}
 			if (*len >= args->minlen)
 				break;
 			error = xfs_btree_increment(acur->cnt, 0, &i);
@@ -1773,6 +1806,7 @@ xfs_alloc_ag_vextent_size(
 			if (error)
 				goto error0;
 			if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+				xfs_btree_mark_sick(cnt_cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -1813,6 +1847,7 @@ xfs_alloc_ag_vextent_size(
 			   rlen != 0 &&
 			   (rlen > flen ||
 			    rbno + rlen > fbno + flen))) {
+		xfs_btree_mark_sick(cnt_cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1835,6 +1870,7 @@ xfs_alloc_ag_vextent_size(
 					&i)))
 				goto error0;
 			if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+				xfs_btree_mark_sick(cnt_cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -1847,6 +1883,7 @@ xfs_alloc_ag_vextent_size(
 					   rlen != 0 &&
 					   (rlen > flen ||
 					    rbno + rlen > fbno + flen))) {
+				xfs_btree_mark_sick(cnt_cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -1863,6 +1900,7 @@ xfs_alloc_ag_vextent_size(
 				&i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(args->mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1889,6 +1927,7 @@ xfs_alloc_ag_vextent_size(
 
 	rlen = args->len;
 	if (XFS_IS_CORRUPT(args->mp, rlen > flen)) {
+		xfs_btree_mark_sick(cnt_cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1908,6 +1947,7 @@ xfs_alloc_ag_vextent_size(
 	if (XFS_IS_CORRUPT(args->mp,
 			   args->agbno + args->len >
 			   be32_to_cpu(agf->agf_length))) {
+		xfs_ag_mark_sick(args->pag, XFS_SICK_AG_BNOBT);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1983,6 +2023,7 @@ xfs_free_ag_extent(
 		if ((error = xfs_alloc_get_rec(bno_cur, &ltbno, &ltlen, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1998,6 +2039,7 @@ xfs_free_ag_extent(
 			 * Very bad.
 			 */
 			if (XFS_IS_CORRUPT(mp, ltbno + ltlen > bno)) {
+				xfs_btree_mark_sick(bno_cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -2016,6 +2058,7 @@ xfs_free_ag_extent(
 		if ((error = xfs_alloc_get_rec(bno_cur, &gtbno, &gtlen, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2031,6 +2074,7 @@ xfs_free_ag_extent(
 			 * Very bad.
 			 */
 			if (XFS_IS_CORRUPT(mp, bno + len > gtbno)) {
+				xfs_btree_mark_sick(bno_cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -2051,12 +2095,14 @@ xfs_free_ag_extent(
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, ltbno, ltlen, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
 		if ((error = xfs_btree_delete(cnt_cur, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2066,12 +2112,14 @@ xfs_free_ag_extent(
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, gtbno, gtlen, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
 		if ((error = xfs_btree_delete(cnt_cur, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2081,6 +2129,7 @@ xfs_free_ag_extent(
 		if ((error = xfs_btree_delete(bno_cur, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2090,6 +2139,7 @@ xfs_free_ag_extent(
 		if ((error = xfs_btree_decrement(bno_cur, 0, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2109,6 +2159,7 @@ xfs_free_ag_extent(
 					   i != 1 ||
 					   xxbno != ltbno ||
 					   xxlen != ltlen)) {
+				xfs_btree_mark_sick(bno_cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -2133,12 +2184,14 @@ xfs_free_ag_extent(
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, ltbno, ltlen, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
 		if ((error = xfs_btree_delete(cnt_cur, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2149,6 +2202,7 @@ xfs_free_ag_extent(
 		if ((error = xfs_btree_decrement(bno_cur, 0, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2168,12 +2222,14 @@ xfs_free_ag_extent(
 		if ((error = xfs_alloc_lookup_eq(cnt_cur, gtbno, gtlen, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
 		if ((error = xfs_btree_delete(cnt_cur, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cnt_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2196,6 +2252,7 @@ xfs_free_ag_extent(
 		if ((error = xfs_btree_insert(bno_cur, &i)))
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(bno_cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2208,12 +2265,14 @@ xfs_free_ag_extent(
 	if ((error = xfs_alloc_lookup_eq(cnt_cur, nbno, nlen, &i)))
 		goto error0;
 	if (XFS_IS_CORRUPT(mp, i != 0)) {
+		xfs_btree_mark_sick(cnt_cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
 	if ((error = xfs_btree_insert(cnt_cur, &i)))
 		goto error0;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cnt_cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -3453,17 +3512,23 @@ __xfs_free_extent(
 		return -EIO;
 
 	error = xfs_free_extent_fix_freelist(tp, pag, &agbp);
-	if (error)
+	if (error) {
+		if (xfs_metadata_is_sick(error))
+			xfs_ag_mark_sick(pag, XFS_SICK_AG_BNOBT);
 		return error;
+	}
+
 	agf = agbp->b_addr;
 
 	if (XFS_IS_CORRUPT(mp, agbno >= mp->m_sb.sb_agblocks)) {
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_BNOBT);
 		error = -EFSCORRUPTED;
 		goto err_release;
 	}
 
 	/* validate the extent size is legal now we have the agf locked */
 	if (XFS_IS_CORRUPT(mp, agbno + len > be32_to_cpu(agf->agf_length))) {
+		xfs_ag_mark_sick(pag, XFS_SICK_AG_BNOBT);
 		error = -EFSCORRUPTED;
 		goto err_release;
 	}
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index b18a3cf44192..bb4cf1fa0dc2 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -553,8 +553,10 @@ xfs_attr_rmtval_stale(
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
 	if (XFS_IS_CORRUPT(mp, map->br_startblock == DELAYSTARTBLOCK) ||
-	    XFS_IS_CORRUPT(mp, map->br_startblock == HOLESTARTBLOCK))
+	    XFS_IS_CORRUPT(mp, map->br_startblock == HOLESTARTBLOCK)) {
+		xfs_bmap_mark_sick(ip, XFS_ATTR_FORK);
 		return -EFSCORRUPTED;
+	}
 
 	error = xfs_buf_incore(mp->m_ddev_targp,
 			XFS_FSB_TO_DADDR(mp, map->br_startblock),
@@ -664,8 +666,10 @@ xfs_attr_rmtval_invalidate(
 				       blkcnt, &map, &nmap, XFS_BMAPI_ATTRFORK);
 		if (error)
 			return error;
-		if (XFS_IS_CORRUPT(args->dp->i_mount, nmap != 1))
+		if (XFS_IS_CORRUPT(args->dp->i_mount, nmap != 1)) {
+			xfs_bmap_mark_sick(args->dp, XFS_ATTR_FORK);
 			return -EFSCORRUPTED;
+		}
 		error = xfs_attr_rmtval_stale(args->dp, &map, XBF_TRYLOCK);
 		if (error)
 			return error;
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index de5d7b842145..b658373bedc7 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -386,6 +386,7 @@ xfs_bmap_check_leaf_extents(
 		pp = XFS_BMBT_PTR_ADDR(mp, block, 1, mp->m_bmap_dmxr[1]);
 		bno = be64_to_cpu(*pp);
 		if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, bno))) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -567,8 +568,10 @@ xfs_bmap_btree_to_extents(
 	pp = XFS_BMAP_BROOT_PTR_ADDR(mp, rblock, 1, ifp->if_broot_bytes);
 	cbno = be64_to_cpu(*pp);
 #ifdef DEBUG
-	if (XFS_IS_CORRUPT(cur->bc_mp, !xfs_btree_check_lptr(cur, cbno, 1)))
+	if (XFS_IS_CORRUPT(cur->bc_mp, !xfs_btree_check_lptr(cur, cbno, 1))) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 #endif
 	error = xfs_btree_read_bufl(mp, tp, cbno, &cbp, XFS_BMAP_BTREE_REF,
 				&xfs_bmbt_buf_ops);
@@ -896,6 +899,7 @@ xfs_bmap_add_attrfork_btree(
 			goto error0;
 		/* must be at least one entry */
 		if (XFS_IS_CORRUPT(mp, stat != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1222,6 +1226,7 @@ xfs_iread_extents(
 		goto out;
 
 	if (XFS_IS_CORRUPT(mp, ir.loaded != ifp->if_nextents)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		error = -EFSCORRUPTED;
 		goto out;
 	}
@@ -1406,8 +1411,10 @@ xfs_bmap_last_offset(
 	if (ifp->if_format == XFS_DINODE_FMT_LOCAL)
 		return 0;
 
-	if (XFS_IS_CORRUPT(ip->i_mount, !xfs_ifork_has_extents(ifp)))
+	if (XFS_IS_CORRUPT(ip->i_mount, !xfs_ifork_has_extents(ifp))) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	error = xfs_bmap_last_extent(NULL, ip, whichfork, &rec, &is_empty);
 	if (error || is_empty)
@@ -1546,6 +1553,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1553,6 +1561,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1560,6 +1569,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1589,6 +1599,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1622,6 +1633,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1650,6 +1662,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1657,6 +1670,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1691,6 +1705,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1716,6 +1731,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1723,6 +1739,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1767,6 +1784,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1803,6 +1821,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1810,6 +1829,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1889,6 +1909,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -1896,6 +1917,7 @@ xfs_bmap_add_extent_delay_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(bma->cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2092,30 +2114,35 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_delete(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_decrement(cur, 0, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_delete(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_decrement(cur, 0, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2144,18 +2171,21 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_delete(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_decrement(cur, 0, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2187,18 +2217,21 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_delete(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_decrement(cur, 0, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2225,6 +2258,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2258,6 +2292,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2295,6 +2330,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2305,6 +2341,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if ((error = xfs_btree_insert(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2335,6 +2372,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2371,6 +2409,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2381,12 +2420,14 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
 			if ((error = xfs_btree_insert(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2423,6 +2464,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2435,6 +2477,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if ((error = xfs_btree_insert(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2447,6 +2490,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2454,6 +2498,7 @@ xfs_bmap_add_extent_unwritten_real(
 			if ((error = xfs_btree_insert(cur, &i)))
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2739,6 +2784,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2746,6 +2792,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2753,6 +2800,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2782,6 +2830,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2812,6 +2861,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2838,6 +2888,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 0)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -2845,6 +2896,7 @@ xfs_bmap_add_extent_hole_real(
 			if (error)
 				goto done;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -5135,6 +5187,7 @@ xfs_bmap_del_extent_real(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -5162,6 +5215,7 @@ xfs_bmap_del_extent_real(
 		if ((error = xfs_btree_delete(cur, &i)))
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -5237,6 +5291,7 @@ xfs_bmap_del_extent_real(
 				if (error)
 					goto done;
 				if (XFS_IS_CORRUPT(mp, i != 1)) {
+					xfs_btree_mark_sick(cur);
 					error = -EFSCORRUPTED;
 					goto done;
 				}
@@ -5257,6 +5312,7 @@ xfs_bmap_del_extent_real(
 				goto done;
 			}
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto done;
 			}
@@ -5713,21 +5769,27 @@ xfs_bmse_merge(
 	error = xfs_bmbt_lookup_eq(cur, got, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(mp, i != 1))
+	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	error = xfs_btree_delete(cur, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(mp, i != 1))
+	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	/* lookup and update size of the previous extent */
 	error = xfs_bmbt_lookup_eq(cur, left, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(mp, i != 1))
+	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	error = xfs_bmbt_update(cur, &new);
 	if (error)
@@ -5775,8 +5837,10 @@ xfs_bmap_shift_update_extent(
 		error = xfs_bmbt_lookup_eq(cur, &prev, &i);
 		if (error)
 			return error;
-		if (XFS_IS_CORRUPT(mp, i != 1))
+		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
+		}
 
 		error = xfs_bmbt_update(cur, got);
 		if (error)
@@ -5837,6 +5901,7 @@ xfs_bmap_collapse_extents(
 		goto del_cursor;
 	}
 	if (XFS_IS_CORRUPT(mp, isnullstartblock(got.br_startblock))) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		error = -EFSCORRUPTED;
 		goto del_cursor;
 	}
@@ -5962,11 +6027,13 @@ xfs_bmap_insert_extents(
 		}
 	}
 	if (XFS_IS_CORRUPT(mp, isnullstartblock(got.br_startblock))) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		error = -EFSCORRUPTED;
 		goto del_cursor;
 	}
 
 	if (XFS_IS_CORRUPT(mp, stop_fsb > got.br_startoff)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		error = -EFSCORRUPTED;
 		goto del_cursor;
 	}
@@ -6066,6 +6133,7 @@ xfs_bmap_split_extent(
 		if (error)
 			goto del_cursor;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto del_cursor;
 		}
@@ -6093,6 +6161,7 @@ xfs_bmap_split_extent(
 		if (error)
 			goto del_cursor;
 		if (XFS_IS_CORRUPT(mp, i != 0)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto del_cursor;
 		}
@@ -6100,6 +6169,7 @@ xfs_bmap_split_extent(
 		if (error)
 			goto del_cursor;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto del_cursor;
 		}
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 2f0e8450268f..02c237984fa6 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2026,8 +2026,10 @@ xfs_btree_lookup(
 			error = xfs_btree_increment(cur, 0, &i);
 			if (error)
 				goto error0;
-			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				return -EFSCORRUPTED;
+			}
 			*stat = 1;
 			return 0;
 		}
@@ -2480,6 +2482,7 @@ xfs_btree_lshift(
 			goto error0;
 		i = xfs_btree_firstrec(tcur, level);
 		if (XFS_IS_CORRUPT(tcur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -2650,6 +2653,7 @@ xfs_btree_rshift(
 		goto error0;
 	i = xfs_btree_lastrec(tcur, level);
 	if (XFS_IS_CORRUPT(tcur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -3524,6 +3528,7 @@ xfs_btree_insert(
 		}
 
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -3931,6 +3936,7 @@ xfs_btree_delrec(
 		 */
 		i = xfs_btree_lastrec(tcur, level);
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -3939,12 +3945,14 @@ xfs_btree_delrec(
 		if (error)
 			goto error0;
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
 
 		i = xfs_btree_lastrec(tcur, level);
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -3992,6 +4000,7 @@ xfs_btree_delrec(
 		if (!xfs_btree_ptr_is_null(cur, &lptr)) {
 			i = xfs_btree_firstrec(tcur, level);
 			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -4000,6 +4009,7 @@ xfs_btree_delrec(
 			if (error)
 				goto error0;
 			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto error0;
 			}
@@ -4017,6 +4027,7 @@ xfs_btree_delrec(
 		 */
 		i = xfs_btree_firstrec(tcur, level);
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -4026,6 +4037,7 @@ xfs_btree_delrec(
 			goto error0;
 		i = xfs_btree_firstrec(tcur, level);
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index ddfe365666ba..ecf53198907f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -584,6 +584,7 @@ xfs_inobt_insert_sprec(
 		if (error)
 			goto error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error;
 		}
@@ -600,10 +601,12 @@ xfs_inobt_insert_sprec(
 		if (error)
 			goto error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error;
 		}
 		if (XFS_IS_CORRUPT(mp, rec.ir_startino != nrec->ir_startino)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error;
 		}
@@ -613,6 +616,7 @@ xfs_inobt_insert_sprec(
 		 * cannot merge, something is seriously wrong.
 		 */
 		if (XFS_IS_CORRUPT(mp, !__xfs_inobt_can_merge(nrec, &rec))) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error;
 		}
@@ -966,8 +970,10 @@ xfs_ialloc_next_rec(
 		error = xfs_inobt_get_rec(cur, rec, &i);
 		if (error)
 			return error;
-		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
+		}
 	}
 
 	return 0;
@@ -991,8 +997,10 @@ xfs_ialloc_get_rec(
 		error = xfs_inobt_get_rec(cur, rec, &i);
 		if (error)
 			return error;
-		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
+		}
 	}
 
 	return 0;
@@ -1070,6 +1078,7 @@ xfs_dialloc_ag_inobt(
 		if (error)
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1078,6 +1087,7 @@ xfs_dialloc_ag_inobt(
 		if (error)
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, j != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1236,6 +1246,7 @@ xfs_dialloc_ag_inobt(
 	if (error)
 		goto error0;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1245,6 +1256,7 @@ xfs_dialloc_ag_inobt(
 		if (error)
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1254,6 +1266,7 @@ xfs_dialloc_ag_inobt(
 		if (error)
 			goto error0;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error0;
 		}
@@ -1314,8 +1327,10 @@ xfs_dialloc_ag_finobt_near(
 		error = xfs_inobt_get_rec(lcur, rec, &i);
 		if (error)
 			return error;
-		if (XFS_IS_CORRUPT(lcur->bc_mp, i != 1))
+		if (XFS_IS_CORRUPT(lcur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(lcur);
 			return -EFSCORRUPTED;
+		}
 
 		/*
 		 * See if we've landed in the parent inode record. The finobt
@@ -1339,12 +1354,14 @@ xfs_dialloc_ag_finobt_near(
 		if (error)
 			goto error_rcur;
 		if (XFS_IS_CORRUPT(lcur->bc_mp, j != 1)) {
+			xfs_btree_mark_sick(lcur);
 			error = -EFSCORRUPTED;
 			goto error_rcur;
 		}
 	}
 
 	if (XFS_IS_CORRUPT(lcur->bc_mp, i != 1 && j != 1)) {
+		xfs_btree_mark_sick(lcur);
 		error = -EFSCORRUPTED;
 		goto error_rcur;
 	}
@@ -1400,8 +1417,10 @@ xfs_dialloc_ag_finobt_newino(
 			error = xfs_inobt_get_rec(cur, rec, &i);
 			if (error)
 				return error;
-			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				return -EFSCORRUPTED;
+			}
 			return 0;
 		}
 	}
@@ -1412,14 +1431,18 @@ xfs_dialloc_ag_finobt_newino(
 	error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	error = xfs_inobt_get_rec(cur, rec, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	return 0;
 }
@@ -1441,14 +1464,18 @@ xfs_dialloc_ag_update_inobt(
 	error = xfs_inobt_lookup(cur, frec->ir_startino, XFS_LOOKUP_EQ, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	error = xfs_inobt_get_rec(cur, &rec, &i);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
+	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 	ASSERT((XFS_AGINO_TO_OFFSET(cur->bc_mp, rec.ir_startino) %
 				   XFS_INODES_PER_CHUNK) == 0);
 
@@ -1457,8 +1484,10 @@ xfs_dialloc_ag_update_inobt(
 
 	if (XFS_IS_CORRUPT(cur->bc_mp,
 			   rec.ir_free != frec->ir_free ||
-			   rec.ir_freecount != frec->ir_freecount))
+			   rec.ir_freecount != frec->ir_freecount)) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	return xfs_inobt_update(cur, &rec);
 }
@@ -1965,6 +1994,7 @@ xfs_difree_inobt(
 		goto error0;
 	}
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -1975,6 +2005,7 @@ xfs_difree_inobt(
 		goto error0;
 	}
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error0;
 	}
@@ -2085,6 +2116,7 @@ xfs_difree_finobt(
 		 * something is out of sync.
 		 */
 		if (XFS_IS_CORRUPT(mp, ibtrec->ir_freecount != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto error;
 		}
@@ -2111,6 +2143,7 @@ xfs_difree_finobt(
 	if (error)
 		goto error;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error;
 	}
@@ -2121,6 +2154,7 @@ xfs_difree_finobt(
 	if (XFS_IS_CORRUPT(mp,
 			   rec.ir_free != ibtrec->ir_free ||
 			   rec.ir_freecount != ibtrec->ir_freecount)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto error;
 	}
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index e8eda50a1ee1..ba329fa53a56 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -248,6 +248,7 @@ xfs_refcount_insert(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, *i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -278,12 +279,14 @@ xfs_refcount_delete(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
 	trace_xfs_refcount_delete(cur->bc_mp, cur->bc_ag.pag->pag_agno, &irec);
 	error = xfs_btree_delete(cur, i);
 	if (XFS_IS_CORRUPT(cur->bc_mp, *i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -408,6 +411,7 @@ xfs_refcount_split_extent(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -435,6 +439,7 @@ xfs_refcount_split_extent(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -480,6 +485,7 @@ xfs_refcount_merge_center_extents(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -488,6 +494,7 @@ xfs_refcount_merge_center_extents(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -497,6 +504,7 @@ xfs_refcount_merge_center_extents(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -508,6 +516,7 @@ xfs_refcount_merge_center_extents(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -552,6 +561,7 @@ xfs_refcount_merge_left_extent(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -560,6 +570,7 @@ xfs_refcount_merge_left_extent(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -571,6 +582,7 @@ xfs_refcount_merge_left_extent(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -618,6 +630,7 @@ xfs_refcount_merge_right_extent(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -626,6 +639,7 @@ xfs_refcount_merge_right_extent(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -637,6 +651,7 @@ xfs_refcount_merge_right_extent(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -684,6 +699,7 @@ xfs_refcount_find_left_extents(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -703,6 +719,7 @@ xfs_refcount_find_left_extents(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -777,6 +794,7 @@ xfs_refcount_find_right_extents(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -796,6 +814,7 @@ xfs_refcount_find_right_extents(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1152,6 +1171,7 @@ xfs_refcount_adjust_extents(
 					goto out_error;
 				if (XFS_IS_CORRUPT(cur->bc_mp,
 						   found_tmp != 1)) {
+					xfs_btree_mark_sick(cur);
 					error = -EFSCORRUPTED;
 					goto out_error;
 				}
@@ -1187,6 +1207,7 @@ xfs_refcount_adjust_extents(
 		 */
 		if (XFS_IS_CORRUPT(cur->bc_mp, ext.rc_blockcount == 0) ||
 		    XFS_IS_CORRUPT(cur->bc_mp, ext.rc_blockcount > *aglen)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1210,6 +1231,7 @@ xfs_refcount_adjust_extents(
 			if (error)
 				goto out_error;
 			if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto out_error;
 			}
@@ -1331,8 +1353,10 @@ xfs_refcount_continue_op(
 	struct xfs_perag		*pag = cur->bc_ag.pag;
 
 	if (XFS_IS_CORRUPT(mp, !xfs_verify_agbext(pag, new_agbno,
-					ri->ri_blockcount)))
+					ri->ri_blockcount))) {
+		xfs_btree_mark_sick(cur);
 		return -EFSCORRUPTED;
+	}
 
 	ri->ri_startblock = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
 
@@ -1539,6 +1563,7 @@ xfs_refcount_find_shared(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -1556,6 +1581,7 @@ xfs_refcount_find_shared(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1589,6 +1615,7 @@ xfs_refcount_find_shared(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1686,6 +1713,7 @@ xfs_refcount_adjust_cow_extents(
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec &&
 				ext.rc_domain != XFS_REFC_DOMAIN_COW)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -1701,6 +1729,7 @@ xfs_refcount_adjust_cow_extents(
 		/* Adding a CoW reservation, there should be nothing here. */
 		if (XFS_IS_CORRUPT(cur->bc_mp,
 				   agbno + aglen > ext.rc_startblock)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1718,6 +1747,7 @@ xfs_refcount_adjust_cow_extents(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_tmp != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1725,14 +1755,17 @@ xfs_refcount_adjust_cow_extents(
 	case XFS_REFCOUNT_ADJUST_COW_FREE:
 		/* Removing a CoW reservation, there should be one extent. */
 		if (XFS_IS_CORRUPT(cur->bc_mp, ext.rc_startblock != agbno)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
 		if (XFS_IS_CORRUPT(cur->bc_mp, ext.rc_blockcount != aglen)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
 		if (XFS_IS_CORRUPT(cur->bc_mp, ext.rc_refcount != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1744,6 +1777,7 @@ xfs_refcount_adjust_cow_extents(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1906,6 +1940,7 @@ xfs_refcount_recover_extent(
 	if (xfs_refcount_check_irec(cur, &rr->rr_rrec) != NULL ||
 	    XFS_IS_CORRUPT(cur->bc_mp,
 			   rr->rr_rrec.rc_domain != XFS_REFC_DOMAIN_COW)) {
+		xfs_btree_mark_sick(cur);
 		kfree(rr);
 		return -EFSCORRUPTED;
 	}
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 0d41d1256fd9..a30602660669 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -135,6 +135,7 @@ xfs_rmap_insert(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(rcur->bc_mp, i != 0)) {
+		xfs_btree_mark_sick(rcur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -148,6 +149,7 @@ xfs_rmap_insert(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(rcur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(rcur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -177,6 +179,7 @@ xfs_rmap_delete(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(rcur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(rcur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -185,6 +188,7 @@ xfs_rmap_delete(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(rcur->bc_mp, i != 1)) {
+		xfs_btree_mark_sick(rcur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -516,7 +520,7 @@ xfs_rmap_lookup_le_range(
  */
 static int
 xfs_rmap_free_check_owner(
-	struct xfs_mount	*mp,
+	struct xfs_btree_cur	*cur,
 	uint64_t		ltoff,
 	struct xfs_rmap_irec	*rec,
 	xfs_filblks_t		len,
@@ -524,6 +528,7 @@ xfs_rmap_free_check_owner(
 	uint64_t		offset,
 	unsigned int		flags)
 {
+	struct xfs_mount	*mp = cur->bc_mp;
 	int			error = 0;
 
 	if (owner == XFS_RMAP_OWN_UNKNOWN)
@@ -533,12 +538,14 @@ xfs_rmap_free_check_owner(
 	if (XFS_IS_CORRUPT(mp,
 			   (flags & XFS_RMAP_UNWRITTEN) !=
 			   (rec->rm_flags & XFS_RMAP_UNWRITTEN))) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out;
 	}
 
 	/* Make sure the owner matches what we expect to find in the tree. */
 	if (XFS_IS_CORRUPT(mp, owner != rec->rm_owner)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out;
 	}
@@ -550,16 +557,19 @@ xfs_rmap_free_check_owner(
 	if (flags & XFS_RMAP_BMBT_BLOCK) {
 		if (XFS_IS_CORRUPT(mp,
 				   !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out;
 		}
 	} else {
 		if (XFS_IS_CORRUPT(mp, rec->rm_offset > offset)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out;
 		}
 		if (XFS_IS_CORRUPT(mp,
 				   offset + len > ltoff + rec->rm_blockcount)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out;
 		}
@@ -622,6 +632,7 @@ xfs_rmap_unmap(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -643,6 +654,7 @@ xfs_rmap_unmap(
 		if (XFS_IS_CORRUPT(mp,
 				   bno <
 				   ltrec.rm_startblock + ltrec.rm_blockcount)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -669,6 +681,7 @@ xfs_rmap_unmap(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -681,12 +694,13 @@ xfs_rmap_unmap(
 			   ltrec.rm_startblock > bno ||
 			   ltrec.rm_startblock + ltrec.rm_blockcount <
 			   bno + len)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
 
 	/* Check owner information. */
-	error = xfs_rmap_free_check_owner(mp, ltoff, &ltrec, len, owner,
+	error = xfs_rmap_free_check_owner(cur, ltoff, &ltrec, len, owner,
 			offset, flags);
 	if (error)
 		goto out_error;
@@ -701,6 +715,7 @@ xfs_rmap_unmap(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -904,6 +919,7 @@ xfs_rmap_map(
 	if (XFS_IS_CORRUPT(mp,
 			   have_lt != 0 &&
 			   ltrec.rm_startblock + ltrec.rm_blockcount > bno)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -921,10 +937,12 @@ xfs_rmap_map(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, have_gt != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
 		if (XFS_IS_CORRUPT(mp, bno + len > gtrec.rm_startblock)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -978,6 +996,7 @@ xfs_rmap_map(
 			if (error)
 				goto out_error;
 			if (XFS_IS_CORRUPT(mp, i != 1)) {
+				xfs_btree_mark_sick(cur);
 				error = -EFSCORRUPTED;
 				goto out_error;
 			}
@@ -1025,6 +1044,7 @@ xfs_rmap_map(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -1120,6 +1140,7 @@ xfs_rmap_convert(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -1157,12 +1178,14 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
 		if (XFS_IS_CORRUPT(mp,
 				   LEFT.rm_startblock + LEFT.rm_blockcount >
 				   bno)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1185,6 +1208,7 @@ xfs_rmap_convert(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -1197,10 +1221,12 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
 		if (XFS_IS_CORRUPT(mp, bno + len > RIGHT.rm_startblock)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1231,6 +1257,7 @@ xfs_rmap_convert(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -1250,6 +1277,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1261,6 +1289,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1268,6 +1297,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1279,6 +1309,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1286,6 +1317,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1309,6 +1341,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1316,6 +1349,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1335,6 +1369,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1346,6 +1381,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1353,6 +1389,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1423,6 +1460,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1465,6 +1503,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 0)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1480,6 +1519,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1513,6 +1553,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1526,6 +1567,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 0)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1538,6 +1580,7 @@ xfs_rmap_convert(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1610,6 +1653,7 @@ xfs_rmap_convert_shared(
 	if (error)
 		goto done;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto done;
 	}
@@ -1638,6 +1682,7 @@ xfs_rmap_convert_shared(
 		if (XFS_IS_CORRUPT(mp,
 				   LEFT.rm_startblock + LEFT.rm_blockcount >
 				   bno)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1656,10 +1701,12 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
 		if (XFS_IS_CORRUPT(mp, bno + len > RIGHT.rm_startblock)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1710,6 +1757,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1736,6 +1784,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1762,6 +1811,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1785,6 +1835,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1820,6 +1871,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1865,6 +1917,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1900,6 +1953,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -1938,6 +1992,7 @@ xfs_rmap_convert_shared(
 		if (error)
 			goto done;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto done;
 		}
@@ -2027,6 +2082,7 @@ xfs_rmap_unmap_shared(
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(mp, i != 1)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -2037,12 +2093,14 @@ xfs_rmap_unmap_shared(
 			   ltrec.rm_startblock > bno ||
 			   ltrec.rm_startblock + ltrec.rm_blockcount <
 			   bno + len)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
 
 	/* Make sure the owner matches what we expect to find in the tree. */
 	if (XFS_IS_CORRUPT(mp, owner != ltrec.rm_owner)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -2051,16 +2109,19 @@ xfs_rmap_unmap_shared(
 	if (XFS_IS_CORRUPT(mp,
 			   (flags & XFS_RMAP_UNWRITTEN) !=
 			   (ltrec.rm_flags & XFS_RMAP_UNWRITTEN))) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
 
 	/* Check the offset. */
 	if (XFS_IS_CORRUPT(mp, ltrec.rm_offset > offset)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
 	if (XFS_IS_CORRUPT(mp, offset > ltoff + ltrec.rm_blockcount)) {
+		xfs_btree_mark_sick(cur);
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
@@ -2117,6 +2178,7 @@ xfs_rmap_unmap_shared(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -2146,6 +2208,7 @@ xfs_rmap_unmap_shared(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -2225,6 +2288,7 @@ xfs_rmap_map_shared(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, have_gt != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -2277,6 +2341,7 @@ xfs_rmap_map_shared(
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
@@ -2484,10 +2549,14 @@ xfs_rmap_finish_one(
 		 * allocate blocks.
 		 */
 		error = xfs_free_extent_fix_freelist(tp, ri->ri_pag, &agbp);
-		if (error)
+		if (error) {
+			xfs_ag_mark_sick(ri->ri_pag, XFS_SICK_AG_AGFL);
 			return error;
-		if (XFS_IS_CORRUPT(tp->t_mountp, !agbp))
+		}
+		if (XFS_IS_CORRUPT(tp->t_mountp, !agbp)) {
+			xfs_ag_mark_sick(ri->ri_pag, XFS_SICK_AG_AGFL);
 			return -EFSCORRUPTED;
+		}
 
 		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, ri->ri_pag);
 	}
diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c
index d3f0384d084d..4bf48fbe5285 100644
--- a/fs/xfs/scrub/refcount_repair.c
+++ b/fs/xfs/scrub/refcount_repair.c
@@ -25,6 +25,7 @@
 #include "xfs_refcount_btree.h"
 #include "xfs_error.h"
 #include "xfs_ag.h"
+#include "xfs_health.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -252,8 +253,10 @@ xrep_refc_walk_rmaps(
 		error = xfs_rmap_get_rec(cur, &rmap, &have_gt);
 		if (error)
 			return error;
-		if (XFS_IS_CORRUPT(mp, !have_gt))
+		if (XFS_IS_CORRUPT(mp, !have_gt)) {
+			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
+		}
 
 		if (rmap.rm_owner == XFS_RMAP_OWN_COW) {
 			error = xrep_refc_stash_cow(rr, rmap.rm_startblock,
@@ -424,8 +427,10 @@ xrep_refc_push_rmaps_at(
 	error = xfs_btree_decrement(sc->sa.rmap_cur, 0, &have_gt);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(sc->mp, !have_gt))
+	if (XFS_IS_CORRUPT(sc->mp, !have_gt)) {
+		xfs_btree_mark_sick(sc->sa.rmap_cur);
 		return -EFSCORRUPTED;
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 305559bfe2a1..dcfa8e8e146a 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -84,8 +84,10 @@ xfs_attr_shortform_list(
 		for (i = 0, sfe = &sf->list[0]; i < sf->hdr.count; i++) {
 			if (XFS_IS_CORRUPT(context->dp->i_mount,
 					   !xfs_attr_namecheck(sfe->nameval,
-							       sfe->namelen)))
+							       sfe->namelen))) {
+				xfs_dirattr_mark_sick(context->dp, XFS_ATTR_FORK);
 				return -EFSCORRUPTED;
+			}
 			context->put_listent(context,
 					     sfe->flags,
 					     sfe->nameval,
@@ -178,6 +180,7 @@ xfs_attr_shortform_list(
 		if (XFS_IS_CORRUPT(context->dp->i_mount,
 				   !xfs_attr_namecheck(sbp->name,
 						       sbp->namelen))) {
+			xfs_dirattr_mark_sick(context->dp, XFS_ATTR_FORK);
 			error = -EFSCORRUPTED;
 			goto out;
 		}
@@ -472,8 +475,10 @@ xfs_attr3_leaf_list_int(
 		}
 
 		if (XFS_IS_CORRUPT(context->dp->i_mount,
-				   !xfs_attr_namecheck(name, namelen)))
+				   !xfs_attr_namecheck(name, namelen))) {
+			xfs_dirattr_mark_sick(context->dp, XFS_ATTR_FORK);
 			return -EFSCORRUPTED;
+		}
 		context->put_listent(context, entry->flags,
 					      name, namelen, valuelen);
 		if (context->seen_enough)
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 9f3ceb461515..4c061b48da18 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -18,6 +18,7 @@
 #include "xfs_bmap.h"
 #include "xfs_trans.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
 
 /*
  * Directory file type support functions
@@ -119,8 +120,10 @@ xfs_dir2_sf_getdents(
 		ctx->pos = off & 0x7fffffff;
 		if (XFS_IS_CORRUPT(dp->i_mount,
 				   !xfs_dir2_namecheck(sfep->name,
-						       sfep->namelen)))
+						       sfep->namelen))) {
+			xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
 			return -EFSCORRUPTED;
+		}
 		if (!dir_emit(ctx, (char *)sfep->name, sfep->namelen, ino,
 			    xfs_dir3_get_dtype(mp, filetype)))
 			return 0;
@@ -212,6 +215,7 @@ xfs_dir2_block_getdents(
 		if (XFS_IS_CORRUPT(dp->i_mount,
 				   !xfs_dir2_namecheck(dep->name,
 						       dep->namelen))) {
+			xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
 			error = -EFSCORRUPTED;
 			goto out_rele;
 		}
@@ -466,6 +470,7 @@ xfs_dir2_leaf_getdents(
 		if (XFS_IS_CORRUPT(dp->i_mount,
 				   !xfs_dir2_namecheck(dep->name,
 						       dep->namelen))) {
+			xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
 			error = -EFSCORRUPTED;
 			break;
 		}
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index bfc829c07f03..3fa6b0ab9ed6 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -18,6 +18,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_ag.h"
+#include "xfs_health.h"
 
 STATIC int
 xfs_trim_extents(
@@ -73,6 +74,7 @@ xfs_trim_extents(
 		if (error)
 			goto out_del_cursor;
 		if (XFS_IS_CORRUPT(mp, i != 1)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out_del_cursor;
 		}
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 54a262b33244..9a8c7c0913a0 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -297,8 +297,10 @@ xfs_iwalk_ag_start(
 	error = xfs_inobt_get_rec(*curpp, irec, has_more);
 	if (error)
 		return error;
-	if (XFS_IS_CORRUPT(mp, *has_more != 1))
+	if (XFS_IS_CORRUPT(mp, *has_more != 1)) {
+		xfs_btree_mark_sick(*curpp);
 		return -EFSCORRUPTED;
+	}
 
 	iwag->lastino = XFS_AGINO_TO_INO(mp, pag->pag_agno,
 				irec->ir_startino + XFS_INODES_PER_CHUNK - 1);
@@ -425,6 +427,7 @@ xfs_iwalk_ag(
 		rec_fsino = XFS_AGINO_TO_INO(mp, pag->pag_agno, irec->ir_startino);
 		if (iwag->lastino != NULLFSINO &&
 		    XFS_IS_CORRUPT(mp, iwag->lastino >= rec_fsino)) {
+			xfs_btree_mark_sick(cur);
 			error = -EFSCORRUPTED;
 			goto out;
 		}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: indirect health reporting
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (13 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: update health status if we get a clean bill of health Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters Darrick J. Wong
                   ` (62 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series enables the XFS health reporting infrastructure to remember
indirect health concerns when resources are scarce.  For example, if a
scrub notices that there's something wrong with an inode's metadata but
memory reclaim needs to free the incore inode, we want to record in the
perag data the fact that there was some inode somewhere with an error.
The perag structures never go away.

The first two patches in this series set that up, and the third one
provides a means for xfs_scrub to tell the kernel that it can forget the
indirect problem report.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=indirect-health-reporting
---
 fs/xfs/libxfs/xfs_fs.h        |    4 ++
 fs/xfs/libxfs/xfs_health.h    |   45 ++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_buf.c |    2 +
 fs/xfs/scrub/health.c         |   76 ++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/health.h         |    1 +
 fs/xfs/scrub/repair.c         |    1 +
 fs/xfs/scrub/scrub.c          |    6 +++
 fs/xfs/scrub/trace.h          |    4 ++
 fs/xfs/xfs_health.c           |   25 +++++++++----
 fs/xfs/xfs_inode.c            |   29 ++++++++++++++++
 fs/xfs/xfs_trace.h            |    1 +
 11 files changed, 182 insertions(+), 12 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: add secondary and indirect classes to the health tracking system
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: indirect health reporting Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: update health status if we get a clean bill of health Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: remember sick inodes that get inactivated Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Establish two more classes of health tracking bits:

 * Indirect problems, which suggest problems in other health domains
   that we weren't able to preserve.

 * Secondary problems, which track state that's related to primary
   evidence of health problems; and

The first class we'll use in an upcoming patch to record in the AG
health status the fact that we ran out of memory and had to inactivate
an inode with defective metadata.  The second class we use to indicate
that repair knows that an inode is bad and we need to fix it later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_health.c        |   24 ++++++++++++++++--------
 2 files changed, 58 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index e2e1b95ddfb9..b3733f756bb2 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -31,6 +31,19 @@
  *  - !checked && sick  => errors have been observed during normal operation,
  *                         but the metadata has not been checked thoroughly
  *  - !checked && !sick => has not been examined since mount
+ *
+ * Evidence of health problems can be sorted into three basic categories:
+ *
+ * a) Primary evidence, which signals that something is defective within the
+ *    general grouping of metadata.
+ *
+ * b) Secondary evidence, which are side effects of primary problem but are
+ *    not themselves problems.  These can be forgotten when the primary
+ *    health problems are addressed.
+ *
+ * c) Indirect evidence, which points to something being wrong in another
+ *    group, but we had to release resources and this is all that's left of
+ *    that state.
  */
 
 struct xfs_mount;
@@ -105,6 +118,35 @@ struct xfs_da_args;
 				 XFS_SICK_INO_SYMLINK | \
 				 XFS_SICK_INO_PARENT)
 
+/* Secondary state related to (but not primary evidence of) health problems. */
+#define XFS_SICK_FS_SECONDARY	(0)
+#define XFS_SICK_RT_SECONDARY	(0)
+#define XFS_SICK_AG_SECONDARY	(0)
+#define XFS_SICK_INO_SECONDARY	(0)
+
+/* Evidence of health problems elsewhere. */
+#define XFS_SICK_FS_INDIRECT	(0)
+#define XFS_SICK_RT_INDIRECT	(0)
+#define XFS_SICK_AG_INDIRECT	(0)
+#define XFS_SICK_INO_INDIRECT	(0)
+
+/* All health masks. */
+#define XFS_SICK_FS_ALL	(XFS_SICK_FS_PRIMARY | \
+				 XFS_SICK_FS_SECONDARY | \
+				 XFS_SICK_FS_INDIRECT)
+
+#define XFS_SICK_RT_ALL	(XFS_SICK_RT_PRIMARY | \
+				 XFS_SICK_RT_SECONDARY | \
+				 XFS_SICK_RT_INDIRECT)
+
+#define XFS_SICK_AG_ALL	(XFS_SICK_AG_PRIMARY | \
+				 XFS_SICK_AG_SECONDARY | \
+				 XFS_SICK_AG_INDIRECT)
+
+#define XFS_SICK_INO_ALL	(XFS_SICK_INO_PRIMARY | \
+				 XFS_SICK_INO_SECONDARY | \
+				 XFS_SICK_INO_INDIRECT)
+
 /*
  * These functions must be provided by the xfs implementation.  Function
  * behavior with respect to the first argument should be as follows:
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index aef5345804da..e1c7fe898161 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -97,7 +97,7 @@ xfs_fs_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_FS_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
@@ -124,11 +124,13 @@ xfs_fs_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_FS_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_fs_sick &= ~mask;
+	if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY))
+		mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
 }
@@ -152,7 +154,7 @@ xfs_rt_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_RT_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
 	trace_xfs_rt_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
@@ -180,11 +182,13 @@ xfs_rt_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_RT_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
 	trace_xfs_rt_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_rt_sick &= ~mask;
+	if (!(mp->m_rt_sick & XFS_SICK_RT_PRIMARY))
+		mp->m_rt_sick &= ~XFS_SICK_RT_SECONDARY;
 	mp->m_rt_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
 }
@@ -225,7 +229,7 @@ xfs_ag_mark_sick(
 	struct xfs_perag	*pag,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_AG_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_AG_ALL));
 	trace_xfs_ag_mark_sick(pag->pag_mount, pag->pag_agno, mask);
 
 	spin_lock(&pag->pag_state_lock);
@@ -252,11 +256,13 @@ xfs_ag_mark_healthy(
 	struct xfs_perag	*pag,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_AG_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_AG_ALL));
 	trace_xfs_ag_mark_healthy(pag->pag_mount, pag->pag_agno, mask);
 
 	spin_lock(&pag->pag_state_lock);
 	pag->pag_sick &= ~mask;
+	if (!(pag->pag_sick & XFS_SICK_AG_PRIMARY))
+		pag->pag_sick &= ~XFS_SICK_AG_SECONDARY;
 	pag->pag_checked |= mask;
 	spin_unlock(&pag->pag_state_lock);
 }
@@ -280,7 +286,7 @@ xfs_inode_mark_sick(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_INO_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_sick(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
@@ -316,11 +322,13 @@ xfs_inode_mark_healthy(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
-	ASSERT(!(mask & ~XFS_SICK_INO_PRIMARY));
+	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_healthy(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
 	ip->i_sick &= ~mask;
+	if (!(ip->i_sick & XFS_SICK_INO_PRIMARY))
+		ip->i_sick &= ~XFS_SICK_INO_SECONDARY;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: remember sick inodes that get inactivated
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: indirect health reporting Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: update health status if we get a clean bill of health Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: add secondary and indirect classes to the health tracking system Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If an unhealthy inode gets inactivated, remember this fact in the
per-fs health summary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h        |    1 +
 fs/xfs/libxfs/xfs_health.h    |    7 +++++--
 fs/xfs/libxfs/xfs_inode_buf.c |    2 +-
 fs/xfs/scrub/health.c         |   12 +++++++++++-
 fs/xfs/xfs_health.c           |    1 +
 fs/xfs/xfs_inode.c            |   29 +++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.h            |    1 +
 7 files changed, 49 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 3885c56078f5..417cf85c0f70 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -292,6 +292,7 @@ struct xfs_ag_geometry {
 #define XFS_AG_GEOM_SICK_FINOBT	(1 << 7)  /* free inode index */
 #define XFS_AG_GEOM_SICK_RMAPBT	(1 << 8)  /* reverse mappings */
 #define XFS_AG_GEOM_SICK_REFCNTBT (1 << 9)  /* reference counts */
+#define XFS_AG_GEOM_SICK_INODES	(1 << 10) /* bad inodes were seen */
 
 /*
  * Structures for XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG & XFS_IOC_FSGROWFSRT
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index b3733f756bb2..252334bc0488 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -76,6 +76,7 @@ struct xfs_da_args;
 #define XFS_SICK_AG_FINOBT	(1 << 7)  /* free inode index */
 #define XFS_SICK_AG_RMAPBT	(1 << 8)  /* reverse mappings */
 #define XFS_SICK_AG_REFCNTBT	(1 << 9)  /* reference counts */
+#define XFS_SICK_AG_INODES	(1 << 10) /* inactivated bad inodes */
 
 /* Observable health issues for inode metadata. */
 #define XFS_SICK_INO_CORE	(1 << 0)  /* inode core */
@@ -86,6 +87,8 @@ struct xfs_da_args;
 #define XFS_SICK_INO_XATTR	(1 << 5)  /* extended attributes */
 #define XFS_SICK_INO_SYMLINK	(1 << 6)  /* symbolic link remote target */
 #define XFS_SICK_INO_PARENT	(1 << 7)  /* parent pointers */
+/* Don't propagate sick status to ag health summary during inactivation */
+#define XFS_SICK_INO_FORGET	(1 << 8)
 
 /* Primary evidence of health problems in a given group. */
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
@@ -122,12 +125,12 @@ struct xfs_da_args;
 #define XFS_SICK_FS_SECONDARY	(0)
 #define XFS_SICK_RT_SECONDARY	(0)
 #define XFS_SICK_AG_SECONDARY	(0)
-#define XFS_SICK_INO_SECONDARY	(0)
+#define XFS_SICK_INO_SECONDARY	(XFS_SICK_INO_FORGET)
 
 /* Evidence of health problems elsewhere. */
 #define XFS_SICK_FS_INDIRECT	(0)
 #define XFS_SICK_RT_INDIRECT	(0)
-#define XFS_SICK_AG_INDIRECT	(0)
+#define XFS_SICK_AG_INDIRECT	(XFS_SICK_AG_INODES)
 #define XFS_SICK_INO_INDIRECT	(0)
 
 /* All health masks. */
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 992ce2d5b9d0..454f40b29249 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -139,7 +139,7 @@ xfs_imap_to_bp(
 			imap->im_len, XBF_UNMAPPED, bpp, &xfs_inode_buf_ops);
 	if (xfs_metadata_is_sick(error))
 		xfs_agno_mark_sick(mp, xfs_daddr_to_agno(mp, imap->im_blkno),
-				XFS_SICK_AG_INOBT);
+				XFS_SICK_AG_INODES);
 	return error;
 }
 
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index e5cc89d43808..962791c8fafb 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -155,7 +155,17 @@ xchk_update_health(
 		if (!sc->ip)
 			return;
 		if (bad) {
-			xfs_inode_mark_sick(sc->ip, sc->sick_mask);
+			unsigned int	mask = sc->sick_mask;
+
+			/*
+			 * If we're coming in for repairs then we don't want
+			 * sickness flags to propagate to the incore health
+			 * status if the inode gets inactivated before we can
+			 * fix it.
+			 */
+			if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+				mask |= XFS_SICK_INO_FORGET;
+			xfs_inode_mark_sick(sc->ip, mask);
 			xfs_inode_mark_checked(sc->ip, sc->sick_mask);
 		} else
 			xfs_inode_mark_healthy(sc->ip, sc->sick_mask);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index e1c7fe898161..74a4620d763b 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -415,6 +415,7 @@ static const struct ioctl_sick_map ag_map[] = {
 	{ XFS_SICK_AG_FINOBT,	XFS_AG_GEOM_SICK_FINOBT },
 	{ XFS_SICK_AG_RMAPBT,	XFS_AG_GEOM_SICK_RMAPBT },
 	{ XFS_SICK_AG_REFCNTBT,	XFS_AG_GEOM_SICK_REFCNTBT },
+	{ XFS_SICK_AG_INODES,	XFS_AG_GEOM_SICK_INODES },
 	{ 0, 0 },
 };
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2a1eee807f15..b082222a9061 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1794,6 +1794,33 @@ xfs_inode_needs_inactive(
 	return xfs_can_free_eofblocks(ip, true);
 }
 
+/*
+ * Save health status somewhere, if we're dumping an inode with uncorrected
+ * errors and online repair isn't running.
+ */
+static inline void
+xfs_inactive_health(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	unsigned int		sick;
+	unsigned int		checked;
+
+	xfs_inode_measure_sickness(ip, &sick, &checked);
+	if (!sick)
+		return;
+
+	trace_xfs_inode_unfixed_corruption(ip, sick);
+
+	if (sick & XFS_SICK_INO_FORGET)
+		return;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	xfs_ag_mark_sick(pag, XFS_SICK_AG_INODES);
+	xfs_perag_put(pag);
+}
+
 /*
  * xfs_inactive
  *
@@ -1822,6 +1849,8 @@ xfs_inactive(
 	mp = ip->i_mount;
 	ASSERT(!xfs_iflags_test(ip, XFS_IRECOVERY));
 
+	xfs_inactive_health(ip);
+
 	/* If this is a read-only mount, don't do this (would generate I/O) */
 	if (xfs_is_readonly(mp))
 		goto out;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a435ca32f186..145808b733ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3921,6 +3921,7 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
 	TP_ARGS(ip, flags))
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
+DEFINE_INODE_CORRUPT_EVENT(xfs_inode_unfixed_corruption);
 
 TRACE_EVENT(xfs_iwalk_ag,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: update health status if we get a clean bill of health
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: indirect health reporting Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: remember sick inodes that get inactivated Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: add secondary and indirect classes to the health tracking system Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If scrub finds that everything is ok with the filesystem, we need a way
to tell the health tracking that it can let go of indirect health flags,
since indirect flags only mean that at some point in the past we lost
some context.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |    3 ++
 fs/xfs/scrub/health.c  |   64 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/health.h  |    1 +
 fs/xfs/scrub/repair.c  |    1 +
 fs/xfs/scrub/scrub.c   |    6 +++++
 fs/xfs/scrub/trace.h   |    4 ++-
 6 files changed, 77 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 417cf85c0f70..400cf68e551e 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -712,9 +712,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
 #define XFS_SCRUB_TYPE_QUOTACHECK 25	/* quota counters */
 #define XFS_SCRUB_TYPE_NLINKS	26	/* inode link counts */
+#define XFS_SCRUB_TYPE_HEALTHY	27	/* everything checked out ok */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	27
+#define XFS_SCRUB_TYPE_NR	28
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1u << 0)
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index 962791c8fafb..cdf059f47656 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -16,6 +16,7 @@
 #include "xfs_health.h"
 #include "scrub/scrub.h"
 #include "scrub/health.h"
+#include "scrub/common.h"
 
 /*
  * Scrub and In-Core Filesystem Health Assessments
@@ -119,6 +120,24 @@ xchk_health_mask_for_scrub_type(
 	return type_to_health_flag[scrub_type].sick_mask;
 }
 
+/*
+ * Scrub gave the filesystem a clean bill of health, so clear all the indirect
+ * markers of past problems (at least for the fs and ags) so that we can be
+ * healthy again.
+ */
+STATIC void
+xchk_mark_all_healthy(
+	struct xfs_mount	*mp)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	xfs_fs_mark_healthy(mp, XFS_SICK_FS_INDIRECT);
+	xfs_rt_mark_healthy(mp, XFS_SICK_RT_INDIRECT);
+	for_each_perag(mp, agno, pag)
+		xfs_ag_mark_healthy(pag, XFS_SICK_AG_INDIRECT);
+}
+
 /*
  * Update filesystem health assessments based on what we found and did.
  *
@@ -136,6 +155,18 @@ xchk_update_health(
 	struct xfs_perag	*pag;
 	bool			bad;
 
+	/*
+	 * The HEALTHY scrub type is a request from userspace to clear all the
+	 * indirect flags after a clean scan of the entire filesystem.  As such
+	 * there's no sick flag defined for it, so we branch here ahead of the
+	 * mask check.
+	 */
+	if (sc->sm->sm_type == XFS_SCRUB_TYPE_HEALTHY &&
+	    !(sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) {
+		xchk_mark_all_healthy(sc->mp);
+		return;
+	}
+
 	if (!sc->sick_mask)
 		return;
 
@@ -259,3 +290,36 @@ xchk_ag_btree_healthy_enough(
 
 	return true;
 }
+
+/*
+ * Quick scan to double-check that there isn't any evidence of lingering
+ * primary health problems.  If we're still clear, then the health update will
+ * take care of clearing the indirect evidence.
+ */
+int
+xchk_health_record(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	unsigned int		sick;
+	unsigned int		checked;
+
+	xfs_fs_measure_sickness(mp, &sick, &checked);
+	if (sick & XFS_SICK_FS_PRIMARY)
+		xchk_set_corrupt(sc);
+
+	xfs_rt_measure_sickness(mp, &sick, &checked);
+	if (sick & XFS_SICK_RT_PRIMARY)
+		xchk_set_corrupt(sc);
+
+	for_each_perag(mp, agno, pag) {
+		xfs_ag_measure_sickness(pag, &sick, &checked);
+		if (sick & XFS_SICK_AG_PRIMARY)
+			xchk_set_corrupt(sc);
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/health.h b/fs/xfs/scrub/health.h
index d0b938d3d028..ee80b663cfab 100644
--- a/fs/xfs/scrub/health.h
+++ b/fs/xfs/scrub/health.h
@@ -10,5 +10,6 @@ unsigned int xchk_health_mask_for_scrub_type(__u32 scrub_type);
 void xchk_update_health(struct xfs_scrub *sc);
 bool xchk_ag_btree_healthy_enough(struct xfs_scrub *sc, struct xfs_perag *pag,
 		xfs_btnum_t btnum);
+int xchk_health_record(struct xfs_scrub *sc);
 
 #endif /* __XFS_SCRUB_HEALTH_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 7f66c763580b..1862e05e398f 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -30,6 +30,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_reflink.h"
+#include "xfs_health.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 2c05fbde1f88..a9d3f344f8af 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -381,6 +381,12 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.scrub	= xchk_nlinks,
 		.repair	= xrep_nlinks,
 	},
+	[XFS_SCRUB_TYPE_HEALTHY] = {	/* fs healthy; clean all reminders */
+		.type	= ST_FS,
+		.setup	= xchk_setup_fs,
+		.scrub	= xchk_health_record,
+		.repair = xrep_notsupported,
+	},
 };
 
 static int
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index dcbab5a197c1..c723b6302bd6 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -68,6 +68,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_QUOTACHECK);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_NLINKS);
+TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 
 #define XFS_SCRUB_TYPE_STRINGS \
 	{ XFS_SCRUB_TYPE_PROBE,		"probe" }, \
@@ -96,7 +97,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_NLINKS);
 	{ XFS_SCRUB_TYPE_PQUOTA,	"prjquota" }, \
 	{ XFS_SCRUB_TYPE_FSCOUNTERS,	"fscounters" }, \
 	{ XFS_SCRUB_TYPE_QUOTACHECK,	"quotacheck" }, \
-	{ XFS_SCRUB_TYPE_NLINKS,	"nlinks" }
+	{ XFS_SCRUB_TYPE_NLINKS,	"nlinks" }, \
+	{ XFS_SCRUB_TYPE_HEALTHY,	"healthy" }
 
 #define XFS_SCRUB_FLAG_STRINGS \
 	{ XFS_SCRUB_IFLAG_REPAIR,		"repair" }, \


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (14 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: indirect health reporting Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: repair " Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
                   ` (61 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

A longstanding deficiency in the online fs summary counter scrubbing
code is that it hasn't any means to quiesce the incore percpu counters
while it's running.  There is no way to coordinate with other threads
are reserving or freeing free space simultaneously, which leads to false
error reports.  Right now, if the discrepancy is large, we just sort of
shrug and bail out with an incomplete flag, but this is lame.

For repair activity, we actually /do/ need to stabilize the counters to
get an accurate reading and install it in the percpu counter.  To
improve the former and enable the latter, allow the fscounters online
fsck code to perform an exclusive mini-freeze on the filesystem.  The
exclusivity prevents userspace from thawing while we're running, and the
mini-freeze means that we don't wait for the log to quiesce, which will
make both speedier.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-fscounters
---
 fs/xfs/Makefile                  |    1 
 fs/xfs/scrub/common.c            |   26 ----
 fs/xfs/scrub/common.h            |    2 
 fs/xfs/scrub/fscounters.c        |  264 +++++++++++++++++++++++++++++++-------
 fs/xfs/scrub/fscounters.h        |   20 +++
 fs/xfs/scrub/fscounters_repair.c |   72 ++++++++++
 fs/xfs/scrub/repair.h            |    2 
 fs/xfs/scrub/scrub.c             |   10 +
 fs/xfs/scrub/scrub.h             |    2 
 fs/xfs/scrub/trace.c             |    1 
 fs/xfs/scrub/trace.h             |   23 +++
 11 files changed, 339 insertions(+), 84 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters.h
 create mode 100644 fs/xfs/scrub/fscounters_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: stabilize fs summary counters for online fsck
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: repair " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: remove XCHK_REAPING_DISABLED from scrub Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the fscounters scrubber notices incorrect summary counters, it's
entirely possible that scrub is simply racing with other threads that
are updating the incore counters.  There isn't a good way to stabilize
percpu counters or set ourselves up to observe live updates with hooks
like we do for the quotacheck or nlinks scanners, so we instead choose
to freeze the filesystem long enough to walk the incore per-AG
structures.

Past me thought that it was going to be commonplace to have to freeze
the filesystem to perform some kind of repair and set up a whole
separate infrastructure to freeze the filesystem in such a way that
userspace could not unfreeze while we were running.  This involved
adding a mutex and freeze_super/thaw_super functions and dealing with
the fact that the VFS freeze/thaw functions can free the VFS superblock
references on return.

This was all very overwrought, since fscounters turned out to be the
only user of scrub freezes, and it doesn't require the log to quiesce,
only the incore superblock counters.  We prevent other threads from
changing the freeze level by adding a new SB_FREEZE_EXCLUSIVE level.

The end result is that fscounters should be much more efficient.  When
we're checking a busy system and we can't stabilize the counters, the
custom freeze will do less work, which should result in less downtime.
Repair should be similarly speedy, but that's in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/fscounters.c |  251 +++++++++++++++++++++++++++++++++++++--------
 fs/xfs/scrub/fscounters.h |   20 ++++
 fs/xfs/scrub/scrub.c      |    6 +
 fs/xfs/scrub/scrub.h      |    1 
 fs/xfs/scrub/trace.h      |    1 
 5 files changed, 233 insertions(+), 46 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters.h


diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index 63755ba4fc0e..e90e59c2e565 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0+
+// SPDX-License-Identifier: GPL-2.0-or-later
 /*
  * Copyright (C) 2019 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <darrick.wong@oracle.com>
@@ -16,9 +16,11 @@
 #include "xfs_ag.h"
 #include "xfs_rtalloc.h"
 #include "xfs_inode.h"
+#include "xfs_icache.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/fscounters.h"
 
 /*
  * FS Summary Counters
@@ -45,16 +47,6 @@
  * our tolerance for mismatch between expected and actual counter values.
  */
 
-struct xchk_fscounters {
-	struct xfs_scrub	*sc;
-	uint64_t		icount;
-	uint64_t		ifree;
-	uint64_t		fdblocks;
-	uint64_t		frextents;
-	unsigned long long	icount_min;
-	unsigned long long	icount_max;
-};
-
 /*
  * Since the expected value computation is lockless but only browses incore
  * values, the percpu counters should be fairly close to each other.  However,
@@ -121,6 +113,142 @@ xchk_fscount_warmup(
 	return error;
 }
 
+#define SB_FREEZE_EXCLUSIVE	(SB_FREEZE_COMPLETE + 1)
+
+/*
+ * We couldn't stabilize the filesystem long enough to sample all the variables
+ * that comprise the summary counters and compare them to the percpu counters.
+ * We need to disable all writer threads, which means taking the first two
+ * freeze levels to put userspace to sleep, and the third freeze level to
+ * prevent background threads from starting new transactions.  Take one level
+ * more to prevent other callers from unfreezing the filesystem while we run.
+ */
+STATIC int
+xchk_fscounters_freeze(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_fscounters	*fsc = sc->buf;
+	struct xfs_mount	*mp = sc->mp;
+	struct super_block	*sb = mp->m_super;
+	int			level;
+	int			error = 0;
+
+	if (sc->flags & XCHK_HAVE_FREEZE_PROT) {
+		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
+		mnt_drop_write_file(sc->file);
+	}
+
+	/* Wait until we're ready to freeze or give up. */
+	down_write(&sb->s_umount);
+	while (sb->s_writers.frozen != SB_UNFROZEN) {
+		up_write(&sb->s_umount);
+
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		delay(HZ / 10);
+		down_write(&sb->s_umount);
+	}
+
+	if (sb_rdonly(sb)) {
+		sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
+		goto done;
+	}
+
+	sb->s_writers.frozen = SB_FREEZE_WRITE;
+	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
+	up_write(&sb->s_umount);
+	percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+	down_write(&sb->s_umount);
+
+	/* Now we go and block page faults... */
+	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
+	percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_PAGEFAULT - 1);
+
+	/*
+	 * All writers are done so after syncing there won't be dirty data.
+	 * Let xfs_fs_sync_fs flush dirty data so the VFS won't start writeback
+	 * and to disable the background gc workers.
+	 */
+	error = sync_filesystem(sb);
+	if (error) {
+		sb->s_writers.frozen = SB_UNFROZEN;
+		for (level = SB_FREEZE_PAGEFAULT - 1; level >= 0; level--)
+			percpu_up_write(sb->s_writers.rw_sem + level);
+		wake_up(&sb->s_writers.wait_unfrozen);
+		up_write(&sb->s_umount);
+		return error;
+	}
+
+	/* Now wait for internal filesystem counter */
+	sb->s_writers.frozen = SB_FREEZE_FS;
+	percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_FS - 1);
+
+	/*
+	 * We do not need to quiesce the log to check the summary counters, so
+	 * skip the call to xfs_fs_freeze here.  To prevent anyone else from
+	 * unfreezing us, set the VFS freeze level to one higher than
+	 * FREEZE_COMPLETE.
+	 */
+	sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
+	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0,
+				_THIS_IP_);
+done:
+	fsc->frozen = true;
+	up_write(&sb->s_umount);
+	return 0;
+}
+
+/* Thaw the filesystem after checking or repairing fscounters. */
+STATIC void
+xchk_fscounters_cleanup(
+	void			*buf)
+{
+	struct xchk_fscounters	*fsc = buf;
+	struct xfs_scrub	*sc = fsc->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct super_block	*sb = mp->m_super;
+	int			level;
+
+	if (!fsc->frozen)
+		return;
+
+	down_write(&sb->s_umount);
+	if (sb->s_writers.frozen != SB_FREEZE_EXCLUSIVE) {
+		/* somebody snuck in and unfroze us? */
+		ASSERT(0);
+		up_write(&sb->s_umount);
+		return;
+	}
+
+	if (sb_rdonly(sb)) {
+		sb->s_writers.frozen = SB_UNFROZEN;
+		goto out;
+	}
+
+	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0,
+				_THIS_IP_);
+
+	/*
+	 * We didn't call xfs_fs_freeze, so we can't call xfs_fs_thaw.  Start
+	 * the background gc workers that were shut down by xfs_fs_sync_fs
+	 * when we froze.
+	 */
+	xfs_blockgc_start(mp);
+	xfs_inodegc_start(mp);
+
+	sb->s_writers.frozen = SB_UNFROZEN;
+	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+		percpu_up_write(sb->s_writers.rw_sem + level);
+out:
+	wake_up(&sb->s_writers.wait_unfrozen);
+	fsc->frozen = false;
+	up_write(&sb->s_umount);
+	return;
+}
+
 int
 xchk_setup_fscounters(
 	struct xfs_scrub	*sc)
@@ -138,6 +266,7 @@ xchk_setup_fscounters(
 	sc->buf = kzalloc(sizeof(struct xchk_fscounters), XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
+	sc->buf_cleanup = xchk_fscounters_cleanup;
 	fsc = sc->buf;
 	fsc->sc = sc;
 
@@ -149,13 +278,17 @@ xchk_setup_fscounters(
 		return error;
 
 	/*
-	 * Pause background reclaim while we're scrubbing to reduce the
-	 * likelihood of background perturbations to the counters throwing off
-	 * our calculations.
+	 * Pause all writer activity in the filesystem while we're scrubbing to
+	 * reduce the likelihood of background perturbations to the counters
+	 * throwing off our calculations.
 	 */
-	xchk_stop_reaping(sc);
+	if (sc->flags & XCHK_TRY_HARDER) {
+		error = xchk_fscounters_freeze(sc);
+		if (error)
+			return error;
+	}
 
-	return xchk_trans_alloc(sc, 0);
+	return xchk_trans_alloc_empty(sc);
 }
 
 /*
@@ -294,8 +427,7 @@ xchk_fscount_aggregate_agcounts(
 	if (fsc->ifree > fsc->icount) {
 		if (tries--)
 			goto retry;
-		xchk_set_incomplete(sc);
-		return 0;
+		return -EDEADLOCK;
 	}
 
 	return 0;
@@ -371,6 +503,8 @@ xchk_fscount_count_frextents(
  * Otherwise, we /might/ have a problem.  If the change in the summations is
  * more than we want to tolerate, the filesystem is probably busy and we should
  * just send back INCOMPLETE and see if userspace will try again.
+ *
+ * If we're repairing then we require an exact match.
  */
 static inline bool
 xchk_fscount_within_range(
@@ -400,21 +534,7 @@ xchk_fscount_within_range(
 	if (expected >= min_value && expected <= max_value)
 		return true;
 
-	/*
-	 * If the difference between the two summations is too large, the fs
-	 * might just be busy and so we'll mark the scrub incomplete.  Return
-	 * true here so that we don't mark the counter corrupt.
-	 *
-	 * XXX: In the future when userspace can grant scrub permission to
-	 * quiesce the filesystem to solve the outsized variance problem, this
-	 * check should be moved up and the return code changed to signal to
-	 * userspace that we need quiesce permission.
-	 */
-	if (max_value - min_value >= XCHK_FSCOUNT_MIN_VARIANCE) {
-		xchk_set_incomplete(sc);
-		return true;
-	}
-
+	/* Everything else is bad. */
 	return false;
 }
 
@@ -426,6 +546,7 @@ xchk_fscounters(
 	struct xfs_mount	*mp = sc->mp;
 	struct xchk_fscounters	*fsc = sc->buf;
 	int64_t			icount, ifree, fdblocks, frextents;
+	bool			try_again = false;
 	int			error;
 
 	/* Snapshot the percpu counters. */
@@ -435,9 +556,26 @@ xchk_fscounters(
 	frextents = percpu_counter_sum(&mp->m_frextents);
 
 	/* No negative values, please! */
-	if (icount < 0 || ifree < 0 || fdblocks < 0 || frextents < 0)
+	if (icount < 0 || ifree < 0)
 		xchk_set_corrupt(sc);
 
+	/*
+	 * If the filesystem is not frozen, the counter summation calls above
+	 * can race with xfs_mod_freecounter, which subtracts a requested space
+	 * reservation from the counter and undoes the subtraction if that made
+	 * the counter go negative.  Therefore, it's possible to see negative
+	 * values here, and we should only flag that as a corruption if we
+	 * froze the fs.  This is much more likely to happen with frextents
+	 * since there are no reserved pools.
+	 */
+	if (fdblocks < 0 || frextents < 0) {
+		if (!fsc->frozen)
+			return -EDEADLOCK;
+
+		xchk_set_corrupt(sc);
+		return 0;
+	}
+
 	/* See if icount is obviously wrong. */
 	if (icount < fsc->icount_min || icount > fsc->icount_max)
 		xchk_set_corrupt(sc);
@@ -461,8 +599,6 @@ xchk_fscounters(
 	error = xchk_fscount_aggregate_agcounts(sc, fsc);
 	if (!xchk_process_error(sc, 0, XFS_SB_BLOCK(mp), &error))
 		return error;
-	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
-		return 0;
 
 	/* Count the free extents counter for rt volumes. */
 	error = xchk_fscount_count_frextents(sc, fsc);
@@ -471,20 +607,45 @@ xchk_fscounters(
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE)
 		return 0;
 
-	/* Compare the in-core counters with whatever we counted. */
-	if (!xchk_fscount_within_range(sc, icount, &mp->m_icount, fsc->icount))
-		xchk_set_corrupt(sc);
+	/*
+	 * Compare the in-core counters with whatever we counted.  If the fs is
+	 * frozen, we treat the discrepancy as a corruption because the freeze
+	 * should have stabilized the counter values.  Otherwise, we need
+	 * userspace to call us back having granted us freeze permission.
+	 */
+	if (!xchk_fscount_within_range(sc, icount, &mp->m_icount,
+				fsc->icount)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
 
-	if (!xchk_fscount_within_range(sc, ifree, &mp->m_ifree, fsc->ifree))
-		xchk_set_corrupt(sc);
+	if (!xchk_fscount_within_range(sc, ifree, &mp->m_ifree, fsc->ifree)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
 
 	if (!xchk_fscount_within_range(sc, fdblocks, &mp->m_fdblocks,
-			fsc->fdblocks))
-		xchk_set_corrupt(sc);
+			fsc->fdblocks)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
 
 	if (!xchk_fscount_within_range(sc, frextents, &mp->m_frextents,
-			fsc->frextents))
-		xchk_set_corrupt(sc);
+			fsc->frextents)) {
+		if (fsc->frozen)
+			xchk_set_corrupt(sc);
+		else
+			try_again = true;
+	}
+
+	if (try_again)
+		return -EDEADLOCK;
 
 	return 0;
 }
diff --git a/fs/xfs/scrub/fscounters.h b/fs/xfs/scrub/fscounters.h
new file mode 100644
index 000000000000..8c8af4ff3c91
--- /dev/null
+++ b/fs/xfs/scrub/fscounters.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_FSCOUNTERS_H__
+#define __XFS_SCRUB_FSCOUNTERS_H__
+
+struct xchk_fscounters {
+	struct xfs_scrub	*sc;
+	uint64_t		icount;
+	uint64_t		ifree;
+	uint64_t		fdblocks;
+	uint64_t		frextents;
+	unsigned long long	icount_min;
+	unsigned long long	icount_max;
+	bool			frozen;
+};
+
+#endif /* __XFS_SCRUB_FSCOUNTERS_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index a9d3f344f8af..6728b1409fea 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -187,8 +187,10 @@ xchk_teardown(
 		xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
-	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+	if (sc->flags & XCHK_HAVE_FREEZE_PROT) {
+		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
 		mnt_drop_write_file(sc->file);
+	}
 	if (sc->flags & XCHK_REAPING_DISABLED)
 		xchk_start_reaping(sc);
 	if (sc->xfile) {
@@ -541,6 +543,8 @@ xfs_scrub_metadata(
 		error = mnt_want_write_file(sc->file);
 		if (error)
 			goto out_sc;
+
+		sc->flags |= XCHK_HAVE_FREEZE_PROT;
 	}
 
 	/* Set up for the operation. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index da9da6245475..12d1ba2aa83b 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -123,6 +123,7 @@ struct xfs_scrub {
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
 #define XCHK_FSHOOKS_QUOTA	(1 << 4)  /* quota live update enabled */
 #define XCHK_FSHOOKS_NLINKS	(1 << 5)  /* link count live update enabled */
+#define XCHK_HAVE_FREEZE_PROT	(1 << 6)  /* do we have freeze protection? */
 #define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index c723b6302bd6..19b2d569d0c8 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -113,6 +113,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 
 #define XFS_SCRUB_STATE_STRINGS \
 	{ XCHK_TRY_HARDER,			"try_harder" }, \
+	{ XCHK_HAVE_FREEZE_PROT,		"nofreeze" }, \
 	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: remove XCHK_REAPING_DISABLED from scrub
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: repair " Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: stabilize fs summary counters for online fsck Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Nobody uses this code anymore, so get rid of it.  It was racy with
regards to freezes and remounts anyway.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |   26 --------------------------
 fs/xfs/scrub/common.h |    2 --
 fs/xfs/scrub/scrub.c  |    2 --
 fs/xfs/scrub/scrub.h  |    3 +--
 fs/xfs/scrub/trace.h  |    1 -
 5 files changed, 1 insertion(+), 33 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index ef2dd59d0ac9..d96355ca4175 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1239,32 +1239,6 @@ xchk_metadata_inode_forks(
 	return 0;
 }
 
-/* Pause background reaping of resources. */
-void
-xchk_stop_reaping(
-	struct xfs_scrub	*sc)
-{
-	sc->flags |= XCHK_REAPING_DISABLED;
-	xfs_blockgc_stop(sc->mp);
-	xfs_inodegc_stop(sc->mp);
-}
-
-/* Restart background reaping of resources. */
-void
-xchk_start_reaping(
-	struct xfs_scrub	*sc)
-{
-	/*
-	 * Readonly filesystems do not perform inactivation or speculative
-	 * preallocation, so there's no need to restart the workers.
-	 */
-	if (!xfs_is_readonly(sc->mp)) {
-		xfs_inodegc_start(sc->mp);
-		xfs_blockgc_start(sc->mp);
-	}
-	sc->flags &= ~XCHK_REAPING_DISABLED;
-}
-
 /*
  * Enable filesystem hooks (i.e. runtime code patching) before starting a scrub
  * operation.  Callers must not hold any locks that intersect with the CPU
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 45318bd5678d..6992e3db5f11 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -213,8 +213,6 @@ static inline bool xchk_could_repair(const struct xfs_scrub *sc)
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
-void xchk_stop_reaping(struct xfs_scrub *sc);
-void xchk_start_reaping(struct xfs_scrub *sc);
 
 /*
  * Setting up a hook to wait for intents to drain is costly -- we have to take
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6728b1409fea..4ec9ff259c5e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -191,8 +191,6 @@ xchk_teardown(
 		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
 		mnt_drop_write_file(sc->file);
 	}
-	if (sc->flags & XCHK_REAPING_DISABLED)
-		xchk_start_reaping(sc);
 	if (sc->xfile) {
 		xfile_destroy(sc->xfile);
 		sc->xfile = NULL;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 12d1ba2aa83b..04afb584f504 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -118,12 +118,11 @@ struct xfs_scrub {
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
-#define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
+#define XCHK_HAVE_FREEZE_PROT	(1 << 1)  /* do we have freeze protection? */
 #define XCHK_FSHOOKS_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
 #define XCHK_FSHOOKS_QUOTA	(1 << 4)  /* quota live update enabled */
 #define XCHK_FSHOOKS_NLINKS	(1 << 5)  /* link count live update enabled */
-#define XCHK_HAVE_FREEZE_PROT	(1 << 6)  /* do we have freeze protection? */
 #define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 19b2d569d0c8..e0de12c4fcf4 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -114,7 +114,6 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 #define XFS_SCRUB_STATE_STRINGS \
 	{ XCHK_TRY_HARDER,			"try_harder" }, \
 	{ XCHK_HAVE_FREEZE_PROT,		"nofreeze" }, \
-	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XCHK_FSHOOKS_QUOTA,			"fshooks_quota" }, \


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: repair summary counters
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: stabilize fs summary counters for online fsck Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: remove XCHK_REAPING_DISABLED from scrub Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use the same summary counter calculation infrastructure to generate new
values for the in-core summary counters.   The difference between the
scrubber and the repairer is that the repairer will freeze the fs during
setup, which means that the values should match exactly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                  |    1 +
 fs/xfs/scrub/fscounters.c        |   15 +++++++-
 fs/xfs/scrub/fscounters_repair.c |   72 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h            |    2 +
 fs/xfs/scrub/scrub.c             |    2 +
 fs/xfs/scrub/trace.c             |    1 +
 fs/xfs/scrub/trace.h             |   21 +++++++++--
 7 files changed, 107 insertions(+), 7 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a69c5585e41c..2d756e13d441 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -187,6 +187,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   bmap_repair.o \
 				   cow_repair.o \
+				   fscounters_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   newbt.o \
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index e90e59c2e565..ae12da1be95c 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -281,8 +281,13 @@ xchk_setup_fscounters(
 	 * Pause all writer activity in the filesystem while we're scrubbing to
 	 * reduce the likelihood of background perturbations to the counters
 	 * throwing off our calculations.
+	 *
+	 * If we're repairing, we need to prevent any other thread from
+	 * changing the global fs summary counters while we're repairing them.
+	 * This requires the fs to be frozen, which will disable background
+	 * reclaim and purge all inactive inodes.
 	 */
-	if (sc->flags & XCHK_TRY_HARDER) {
+	if ((sc->flags & XCHK_TRY_HARDER) || xchk_could_repair(sc)) {
 		error = xchk_fscounters_freeze(sc);
 		if (error)
 			return error;
@@ -300,7 +305,9 @@ xchk_setup_fscounters(
  * set the INCOMPLETE flag even when a negative errno is returned.  This care
  * must be taken with certain errno values (i.e. EFSBADCRC, EFSCORRUPTED,
  * ECANCELED) that are absorbed into a scrub state flag update by
- * xchk_*_process_error.
+ * xchk_*_process_error.  Scrub and repair share the same incore data
+ * structures, so the INCOMPLETE flag is critical to prevent a repair based on
+ * insufficient information.
  */
 
 /* Count free space btree blocks manually for pre-lazysbcount filesystems. */
@@ -527,6 +534,10 @@ xchk_fscount_within_range(
 	if (curr_value == expected)
 		return true;
 
+	/* We require exact matches when repair is running. */
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+		return false;
+
 	min_value = min(old_value, curr_value);
 	max_value = max(old_value, curr_value);
 
diff --git a/fs/xfs/scrub/fscounters_repair.c b/fs/xfs/scrub/fscounters_repair.c
new file mode 100644
index 000000000000..1c38870736d9
--- /dev/null
+++ b/fs/xfs/scrub/fscounters_repair.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_health.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/fscounters.h"
+
+/*
+ * FS Summary Counters
+ * ===================
+ *
+ * We correct errors in the filesystem summary counters by setting them to the
+ * values computed during the obligatory scrub phase.  However, we must be
+ * careful not to allow any other thread to change the counters while we're
+ * computing and setting new values.  To achieve this, we freeze the
+ * filesystem for the whole operation if the REPAIR flag is set.  The checking
+ * function is stricter when we've frozen the fs.
+ */
+
+/*
+ * Reset the superblock counters.  Caller is responsible for freezing the
+ * filesystem during the calculation and reset phases.
+ */
+int
+xrep_fscounters(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xchk_fscounters	*fsc = sc->buf;
+
+	/*
+	 * Reinitialize the in-core counters from what we computed.  We froze
+	 * the filesystem, so there shouldn't be anyone else trying to modify
+	 * these counters.
+	 */
+	if (!fsc->frozen) {
+		ASSERT(fsc->frozen);
+		return -EFSCORRUPTED;
+	}
+
+	trace_xrep_reset_counters(mp, fsc);
+
+	percpu_counter_set(&mp->m_icount, fsc->icount);
+	percpu_counter_set(&mp->m_ifree, fsc->ifree);
+	percpu_counter_set(&mp->m_fdblocks, fsc->fdblocks);
+	percpu_counter_set(&mp->m_frextents, fsc->frextents);
+	mp->m_sb.sb_frextents = fsc->frextents;
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 6c19f0d7f335..5e3e6cfe3332 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -108,6 +108,7 @@ int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_bmap_cow(struct xfs_scrub *sc);
 int xrep_nlinks(struct xfs_scrub *sc);
+int xrep_fscounters(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -193,6 +194,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_quota			xrep_notsupported
 #define xrep_quotacheck			xrep_notsupported
 #define xrep_nlinks			xrep_notsupported
+#define xrep_fscounters			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 4ec9ff259c5e..fd116531a0d9 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -367,7 +367,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_fscounters,
 		.scrub	= xchk_fscounters,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_fscounters,
 	},
 	[XFS_SCRUB_TYPE_QUOTACHECK] = {	/* quota counters */
 		.type	= ST_FS,
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 82edcc830fb8..08e05d49e7c0 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -20,6 +20,7 @@
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
 #include "scrub/nlinks.h"
+#include "scrub/fscounters.h"
 
 /* Figure out which block the btree cursor was pointing to. */
 static inline xfs_fsblock_t
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index e0de12c4fcf4..14569068b6ee 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -23,6 +23,7 @@ struct xfarray;
 struct xfarray_sortinfo;
 struct xchk_iscan;
 struct xchk_nlink;
+struct xchk_fscounters;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -1623,16 +1624,28 @@ TRACE_EVENT(xrep_calc_ag_resblks_btsize,
 		  __entry->refcbt_sz)
 )
 TRACE_EVENT(xrep_reset_counters,
-	TP_PROTO(struct xfs_mount *mp),
-	TP_ARGS(mp),
+	TP_PROTO(struct xfs_mount *mp, struct xchk_fscounters *fsc),
+	TP_ARGS(mp, fsc),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(uint64_t, icount)
+		__field(uint64_t, ifree)
+		__field(uint64_t, fdblocks)
+		__field(uint64_t, frextents)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
+		__entry->icount = fsc->icount;
+		__entry->ifree = fsc->ifree;
+		__entry->fdblocks = fsc->fdblocks;
+		__entry->frextents = fsc->frextents;
 	),
-	TP_printk("dev %d:%d",
-		  MAJOR(__entry->dev), MINOR(__entry->dev))
+	TP_printk("dev %d:%d icount %llu ifree %llu fdblocks %llu frextents %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->icount,
+		  __entry->ifree,
+		  __entry->fdblocks,
+		  __entry->frextents)
 )
 
 DECLARE_EVENT_CLASS(xrep_newbt_extent_class,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/7] xfs: support in-memory btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (15 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
                     ` (6 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
                   ` (60 subsequent siblings)
  77 siblings, 7 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

Hi all,

Online repair of the reverse-mapping btrees presens some unique
challenges.  To construct a new reverse mapping btree, we must scan the
entire filesystem, but we cannot afford to quiesce the entire filesystem
for the potentially lengthy scan.

For rmap btrees, therefore, we relax our requirements of totally atomic
repairs.  Instead, repairs will scan all inodes, construct a new reverse
mapping dataset, format a new btree, and commit it before anyone trips
over the corruption.  This is exactly the same strategy as was used in
the quotacheck and nlink scanners.

Unfortunately, the xfarray cannot perform key-based lookups and is
therefore unsuitable for supporting live updates.  Luckily, we already a
data structure that maintains an indexed rmap recordset -- the existing
rmap btree code!  Hence we port the existing btree and buffer target
code to be able to create a btree using the xfile we developed earlier.
Live hooks keep the in-memory btree up to date for any resources that
have already been scanned.

This approach is not maximally memory efficient, but we can use the same
rmap code that we do everywhere else, which provides improved stability
without growing the code base even more.  Note that in-memory btree
blocks are always page sized.

This patchset modifies the kernel xfs buffer cache to be capable of
using a xfile (aka a shmem file) as a backing device.  It then augments
the btree code to support creating btree cursors with buffers that come
from a buftarg other than the data device (namely an xfile-backed
buftarg).  For the userspace xfs buffer cache, we instead use a memfd or
an O_TMPFILE file as a backing device.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=in-memory-btrees
---
 fs/xfs/Kconfig                     |    8 
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_btree.c          |  173 ++++++--
 fs/xfs/libxfs/xfs_btree.h          |   17 +
 fs/xfs/libxfs/xfs_btree_mem.h      |  128 ++++++
 fs/xfs/libxfs/xfs_refcount_btree.c |    4 
 fs/xfs/libxfs/xfs_rmap_btree.c     |    4 
 fs/xfs/scrub/bitmap.c              |   28 +
 fs/xfs/scrub/bitmap.h              |    3 
 fs/xfs/scrub/scrub.c               |    4 
 fs/xfs/scrub/scrub.h               |    3 
 fs/xfs/scrub/trace.c               |   13 +
 fs/xfs/scrub/trace.h               |  110 +++++
 fs/xfs/scrub/xfbtree.c             |  816 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfbtree.h             |   57 +++
 fs/xfs/scrub/xfile.c               |  181 ++++++++
 fs/xfs/scrub/xfile.h               |   65 +++
 fs/xfs/xfs_aops.c                  |    5 
 fs/xfs/xfs_bmap_util.c             |    8 
 fs/xfs/xfs_buf.c                   |  234 ++++++++--
 fs/xfs/xfs_buf.h                   |   90 ++++
 fs/xfs/xfs_discard.c               |    8 
 fs/xfs/xfs_file.c                  |    6 
 fs/xfs/xfs_health.c                |    3 
 fs/xfs/xfs_ioctl.c                 |    3 
 fs/xfs/xfs_iomap.c                 |    4 
 fs/xfs/xfs_log.c                   |    4 
 fs/xfs/xfs_log_cil.c               |    3 
 fs/xfs/xfs_log_recover.c           |    3 
 fs/xfs/xfs_super.c                 |    4 
 fs/xfs/xfs_trace.c                 |    3 
 fs/xfs/xfs_trace.h                 |   85 ++++
 fs/xfs/xfs_trans.h                 |    1 
 fs/xfs/xfs_trans_buf.c             |   42 ++
 34 files changed, 2011 insertions(+), 110 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_btree_mem.h
 create mode 100644 fs/xfs/scrub/xfbtree.c
 create mode 100644 fs/xfs/scrub/xfbtree.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/7] xfs: dump xfiles for debugging purposes
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add a debug function to dump an xfile's contents for debug purposes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/xfile.c |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfile.h |    2 +
 2 files changed, 100 insertions(+)


diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index 7090a8e12b60..1b858b6a53c8 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -424,3 +424,101 @@ xfile_put_page(
 		return -EIO;
 	return 0;
 }
+
+/* Dump an xfile to dmesg. */
+int
+xfile_dump(
+	struct xfile		*xf)
+{
+	struct xfile_stat	sb;
+	struct inode		*inode = file_inode(xf->file);
+	struct address_space	*mapping = inode->i_mapping;
+	loff_t			holepos = 0;
+	loff_t			datapos;
+	loff_t			ret;
+	unsigned int		pflags;
+	bool			all_zeroes = true;
+	int			error = 0;
+
+	error = xfile_stat(xf, &sb);
+	if (error)
+		return error;
+
+	printk(KERN_ALERT "xfile ino 0x%lx isize 0x%llx dump:", inode->i_ino,
+			sb.size);
+	pflags = memalloc_nofs_save();
+
+	while ((ret = vfs_llseek(xf->file, holepos, SEEK_DATA)) >= 0) {
+		datapos = rounddown_64(ret, PAGE_SIZE);
+		ret = vfs_llseek(xf->file, datapos, SEEK_HOLE);
+		if (ret < 0)
+			break;
+		holepos = min_t(loff_t, sb.size, roundup_64(ret, PAGE_SIZE));
+
+		while (datapos < holepos) {
+			struct page	*page = NULL;
+			void		*p, *kaddr;
+			u64		datalen = holepos - datapos;
+			unsigned int	pagepos;
+			unsigned int	pagelen;
+
+			cond_resched();
+
+			if (fatal_signal_pending(current)) {
+				error = -EINTR;
+				goto out_pflags;
+			}
+
+			pagelen = min_t(u64, datalen, PAGE_SIZE);
+
+			page = shmem_read_mapping_page_gfp(mapping,
+					datapos >> PAGE_SHIFT, __GFP_NOWARN);
+			if (IS_ERR(page)) {
+				error = PTR_ERR(page);
+				if (error == -EIO)
+					printk(KERN_ALERT "%.8llx: poisoned",
+							datapos);
+				else if (error != -ENOMEM)
+					goto out_pflags;
+
+				goto next_pgoff;
+			}
+
+			if (!PageUptodate(page))
+				goto next_page;
+
+			kaddr = kmap_local_page(page);
+			p = kaddr;
+
+			for (pagepos = 0; pagepos < pagelen; pagepos += 16) {
+				char prefix[16];
+				unsigned int linelen;
+
+				linelen = min_t(unsigned int, pagelen, 16);
+
+				if (!memchr_inv(p + pagepos, 0, linelen))
+					continue;
+
+				snprintf(prefix, 16, "%.8llx: ",
+						datapos + pagepos);
+
+				all_zeroes = false;
+				print_hex_dump(KERN_ALERT, prefix,
+						DUMP_PREFIX_NONE, 16, 1,
+						p + pagepos, linelen, true);
+			}
+			kunmap_local(kaddr);
+next_page:
+			put_page(page);
+next_pgoff:
+			datapos += PAGE_SIZE;
+		}
+	}
+	if (all_zeroes)
+		printk(KERN_ALERT "<all zeroes>");
+	if (ret != -ENXIO)
+		error = ret;
+out_pflags:
+	memalloc_nofs_restore(pflags);
+	return error;
+}
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index 0172bd9eeab0..b7f046016b1b 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -75,4 +75,6 @@ int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
 		struct xfile_page *xbuf);
 int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
 
+int xfile_dump(struct xfile *xf);
+
 #endif /* __XFS_SCRUB_XFILE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 5/7] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Currently, cached buffers are indexed by per-AG hashtables.  This works
great for the data device, but won't work for in-memory btrees.  Make it
so that buftargs can index buffers too.  Introduce XFS_BSTATE_CACHED as
an explicit state flag for buffers that are cached in an rhashtable,
since we can't rely on b_pag being set for buffers that are cached but
not on behalf of an AG.  We'll soon be using the buffer cache for
xfiles.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf.c |  142 ++++++++++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_buf.h |    9 +++
 2 files changed, 113 insertions(+), 38 deletions(-)


diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2bea2c3f9ead..7dfc1db566fa 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -570,7 +570,7 @@ xfs_buf_find_lock(
 
 static inline int
 xfs_buf_lookup(
-	struct xfs_perag	*pag,
+	struct rhashtable	*bufhash,
 	struct xfs_buf_map	*map,
 	xfs_buf_flags_t		flags,
 	struct xfs_buf		**bpp)
@@ -579,7 +579,7 @@ xfs_buf_lookup(
 	int			error;
 
 	rcu_read_lock();
-	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
+	bp = rhashtable_lookup(bufhash, map, xfs_buf_hash_params);
 	if (!bp || !atomic_inc_not_zero(&bp->b_hold)) {
 		rcu_read_unlock();
 		return -ENOENT;
@@ -605,6 +605,8 @@ static int
 xfs_buf_find_insert(
 	struct xfs_buftarg	*btp,
 	struct xfs_perag	*pag,
+	spinlock_t		*hashlock,
+	struct rhashtable	*bufhash,
 	struct xfs_buf_map	*cmap,
 	struct xfs_buf_map	*map,
 	int			nmaps,
@@ -632,18 +634,18 @@ xfs_buf_find_insert(
 			goto out_free_buf;
 	}
 
-	spin_lock(&pag->pag_buf_lock);
-	bp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash,
-			&new_bp->b_rhash_head, xfs_buf_hash_params);
+	spin_lock(hashlock);
+	bp = rhashtable_lookup_get_insert_fast(bufhash, &new_bp->b_rhash_head,
+			xfs_buf_hash_params);
 	if (IS_ERR(bp)) {
 		error = PTR_ERR(bp);
-		spin_unlock(&pag->pag_buf_lock);
+		spin_unlock(hashlock);
 		goto out_free_buf;
 	}
 	if (bp) {
 		/* found an existing buffer */
 		atomic_inc(&bp->b_hold);
-		spin_unlock(&pag->pag_buf_lock);
+		spin_unlock(hashlock);
 		error = xfs_buf_find_lock(bp, flags);
 		if (error)
 			xfs_buf_rele(bp);
@@ -654,14 +656,16 @@ xfs_buf_find_insert(
 
 	/* The new buffer keeps the perag reference until it is freed. */
 	new_bp->b_pag = pag;
-	spin_unlock(&pag->pag_buf_lock);
+	new_bp->b_state |= XFS_BSTATE_CACHED;
+	spin_unlock(hashlock);
 	*bpp = new_bp;
 	return 0;
 
 out_free_buf:
 	xfs_buf_free(new_bp);
 out_drop_pag:
-	xfs_perag_put(pag);
+	if (pag)
+		xfs_perag_put(pag);
 	return error;
 }
 
@@ -678,6 +682,8 @@ xfs_buf_get_map(
 	xfs_buf_flags_t		flags,
 	struct xfs_buf		**bpp)
 {
+	spinlock_t		*hashlock;
+	struct rhashtable	*bufhash;
 	struct xfs_perag	*pag;
 	struct xfs_buf		*bp = NULL;
 	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
@@ -693,10 +699,18 @@ xfs_buf_get_map(
 	if (error)
 		return error;
 
-	pag = xfs_perag_get(btp->bt_mount,
-			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
+	if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED) {
+		pag = NULL;
+		hashlock = &btp->bt_hashlock;
+		bufhash = &btp->bt_bufhash;
+	} else {
+		pag = xfs_perag_get(btp->bt_mount,
+				xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
+		hashlock = &pag->pag_buf_lock;
+		bufhash = &pag->pag_buf_hash;
+	}
 
-	error = xfs_buf_lookup(pag, &cmap, flags, &bp);
+	error = xfs_buf_lookup(bufhash, &cmap, flags, &bp);
 	if (error && error != -ENOENT)
 		goto out_put_perag;
 
@@ -708,13 +722,14 @@ xfs_buf_get_map(
 			goto out_put_perag;
 
 		/* xfs_buf_find_insert() consumes the perag reference. */
-		error = xfs_buf_find_insert(btp, pag, &cmap, map, nmaps,
-				flags, &bp);
+		error = xfs_buf_find_insert(btp, pag, hashlock, bufhash, &cmap,
+				map, nmaps, flags, &bp);
 		if (error)
 			return error;
 	} else {
 		XFS_STATS_INC(btp->bt_mount, xb_get_locked);
-		xfs_perag_put(pag);
+		if (pag)
+			xfs_perag_put(pag);
 	}
 
 	/* We do not hold a perag reference anymore. */
@@ -742,7 +757,8 @@ xfs_buf_get_map(
 	return 0;
 
 out_put_perag:
-	xfs_perag_put(pag);
+	if (pag)
+		xfs_perag_put(pag);
 	return error;
 }
 
@@ -996,12 +1012,14 @@ xfs_buf_rele(
 	struct xfs_buf		*bp)
 {
 	struct xfs_perag	*pag = bp->b_pag;
+	spinlock_t		*hashlock;
+	struct rhashtable	*bufhash;
 	bool			release;
 	bool			freebuf = false;
 
 	trace_xfs_buf_rele(bp, _RET_IP_);
 
-	if (!pag) {
+	if (!(bp->b_state & XFS_BSTATE_CACHED)) {
 		ASSERT(list_empty(&bp->b_lru));
 		if (atomic_dec_and_test(&bp->b_hold)) {
 			xfs_buf_ioacct_dec(bp);
@@ -1012,6 +1030,14 @@ xfs_buf_rele(
 
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 
+	if (bp->b_target->bt_flags & XFS_BUFTARG_SELF_CACHED) {
+		hashlock = &bp->b_target->bt_hashlock;
+		bufhash = &bp->b_target->bt_bufhash;
+	} else {
+		hashlock = &pag->pag_buf_lock;
+		bufhash = &pag->pag_buf_hash;
+	}
+
 	/*
 	 * We grab the b_lock here first to serialise racing xfs_buf_rele()
 	 * calls. The pag_buf_lock being taken on the last reference only
@@ -1023,7 +1049,7 @@ xfs_buf_rele(
 	 * leading to a use-after-free scenario.
 	 */
 	spin_lock(&bp->b_lock);
-	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
+	release = atomic_dec_and_lock(&bp->b_hold, hashlock);
 	if (!release) {
 		/*
 		 * Drop the in-flight state if the buffer is already on the LRU
@@ -1048,7 +1074,7 @@ xfs_buf_rele(
 			bp->b_state &= ~XFS_BSTATE_DISPOSE;
 			atomic_inc(&bp->b_hold);
 		}
-		spin_unlock(&pag->pag_buf_lock);
+		spin_unlock(hashlock);
 	} else {
 		/*
 		 * most of the time buffers will already be removed from the
@@ -1063,10 +1089,13 @@ xfs_buf_rele(
 		}
 
 		ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
-		rhashtable_remove_fast(&pag->pag_buf_hash, &bp->b_rhash_head,
-				       xfs_buf_hash_params);
-		spin_unlock(&pag->pag_buf_lock);
-		xfs_perag_put(pag);
+		rhashtable_remove_fast(bufhash, &bp->b_rhash_head,
+				xfs_buf_hash_params);
+		spin_unlock(hashlock);
+		if (pag)
+			xfs_perag_put(pag);
+		bp->b_state &= ~XFS_BSTATE_CACHED;
+		bp->b_pag = NULL;
 		freebuf = true;
 	}
 
@@ -1946,6 +1975,8 @@ xfs_free_buftarg(
 	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
 	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
+	if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED)
+		rhashtable_destroy(&btp->bt_bufhash);
 
 	blkdev_issue_flush(btp->bt_bdev);
 	invalidate_bdev(btp->bt_bdev);
@@ -1990,24 +2021,20 @@ xfs_setsize_buftarg_early(
 	return xfs_setsize_buftarg(btp, bdev_logical_block_size(bdev));
 }
 
-struct xfs_buftarg *
-xfs_alloc_buftarg(
+static struct xfs_buftarg *
+__xfs_alloc_buftarg(
 	struct xfs_mount	*mp,
-	struct block_device	*bdev)
+	unsigned int		flags)
 {
-	xfs_buftarg_t		*btp;
-	const struct dax_holder_operations *ops = NULL;
+	struct xfs_buftarg	*btp;
+	int			error;
 
-#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
-	ops = &xfs_dax_holder_operations;
-#endif
 	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
+	if (!btp)
+		return NULL;
 
 	btp->bt_mount = mp;
-	btp->bt_dev =  bdev->bd_dev;
-	btp->bt_bdev = bdev;
-	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
-					    mp, ops);
+	btp->bt_flags = flags;
 
 	/*
 	 * Buffer IO error rate limiting. Limit it to no more than 10 messages
@@ -2016,9 +2043,6 @@ xfs_alloc_buftarg(
 	ratelimit_state_init(&btp->bt_ioerror_rl, 30 * HZ,
 			     DEFAULT_RATELIMIT_BURST);
 
-	if (xfs_setsize_buftarg_early(btp, bdev))
-		goto error_free;
-
 	if (list_lru_init(&btp->bt_lru))
 		goto error_free;
 
@@ -2032,8 +2056,18 @@ xfs_alloc_buftarg(
 	if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s",
 			      mp->m_super->s_id))
 		goto error_pcpu;
+
+	if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED) {
+		spin_lock_init(&btp->bt_hashlock);
+		error = rhashtable_init(&btp->bt_bufhash, &xfs_buf_hash_params);
+		if (error)
+			goto error_shrinker;
+	}
+
 	return btp;
 
+error_shrinker:
+	unregister_shrinker(&btp->bt_shrinker);
 error_pcpu:
 	percpu_counter_destroy(&btp->bt_io_count);
 error_lru:
@@ -2043,6 +2077,38 @@ xfs_alloc_buftarg(
 	return NULL;
 }
 
+/* Allocate a buffer cache target for a persistent block device. */
+struct xfs_buftarg *
+xfs_alloc_buftarg(
+	struct xfs_mount	*mp,
+	struct block_device	*bdev)
+{
+	struct xfs_buftarg	*btp;
+	const struct dax_holder_operations *ops = NULL;
+
+#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
+	ops = &xfs_dax_holder_operations;
+#endif
+
+	btp = __xfs_alloc_buftarg(mp, 0);
+	if (!btp)
+		return NULL;
+
+	btp->bt_dev =  bdev->bd_dev;
+	btp->bt_bdev = bdev;
+	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
+					    mp, ops);
+
+	if (xfs_setsize_buftarg_early(btp, bdev))
+		goto error_free;
+
+	return btp;
+
+error_free:
+	xfs_free_buftarg(btp);
+	return NULL;
+}
+
 /*
  * Cancel a delayed write list.
  *
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 467ddb2e2f0d..d7bf7f657e99 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -82,6 +82,7 @@ typedef unsigned int xfs_buf_flags_t;
  */
 #define XFS_BSTATE_DISPOSE	 (1 << 0)	/* buffer being discarded */
 #define XFS_BSTATE_IN_FLIGHT	 (1 << 1)	/* I/O in flight */
+#define XFS_BSTATE_CACHED	 (1 << 2)	/* cached buffer */
 
 /*
  * The xfs_buftarg contains 2 notions of "sector size" -
@@ -102,11 +103,16 @@ typedef struct xfs_buftarg {
 	struct dax_device	*bt_daxdev;
 	u64			bt_dax_part_off;
 	struct xfs_mount	*bt_mount;
+	unsigned int		bt_flags;
 	unsigned int		bt_meta_sectorsize;
 	size_t			bt_meta_sectormask;
 	size_t			bt_logical_sectorsize;
 	size_t			bt_logical_sectormask;
 
+	/* self-caching buftargs */
+	spinlock_t		bt_hashlock;
+	struct rhashtable	bt_bufhash;
+
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_lru		bt_lru;
@@ -115,6 +121,9 @@ typedef struct xfs_buftarg {
 	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
+/* the xfs_buftarg indexes buffers via bt_buf_hash */
+#define XFS_BUFTARG_SELF_CACHED	(1U << 0)
+
 #define XB_PAGES	2
 
 struct xfs_buf_map {


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/7] xfs: support in-memory buffer cache targets
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 6/7] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 7/7] xfs: connect in-memory btrees to xfiles Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Allow the buffer cache to target in-memory files by connecting it to
xfiles.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig           |    4 ++
 fs/xfs/scrub/xfile.h     |   15 +++++++++
 fs/xfs/xfs_aops.c        |    5 ++-
 fs/xfs/xfs_bmap_util.c   |    8 ++---
 fs/xfs/xfs_buf.c         |   80 +++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_buf.h         |   71 +++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_discard.c     |    8 ++---
 fs/xfs/xfs_file.c        |    6 ++-
 fs/xfs/xfs_ioctl.c       |    3 +-
 fs/xfs/xfs_iomap.c       |    4 +-
 fs/xfs/xfs_log.c         |    4 +-
 fs/xfs/xfs_log_cil.c     |    3 +-
 fs/xfs/xfs_log_recover.c |    3 +-
 fs/xfs/xfs_super.c       |    4 +-
 14 files changed, 188 insertions(+), 30 deletions(-)


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 54806c2b80d4..2373324be997 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -101,6 +101,9 @@ config XFS_LIVE_HOOKS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
 
+config XFS_IN_MEMORY_FILE
+	bool
+
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
 	default n
@@ -108,6 +111,7 @@ config XFS_ONLINE_SCRUB
 	depends on TMPFS && SHMEM
 	select XFS_LIVE_HOOKS
 	select XFS_DRAIN_INTENTS
+	select XFS_IN_MEMORY_FILE
 	help
 	  If you say Y here you will be able to check metadata on a
 	  mounted XFS filesystem.  This feature is intended to reduce
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index b7f046016b1b..99b6db838612 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -6,6 +6,8 @@
 #ifndef __XFS_SCRUB_XFILE_H__
 #define __XFS_SCRUB_XFILE_H__
 
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+
 struct xfile_page {
 	struct page		*page;
 	void			*fsdata;
@@ -76,5 +78,18 @@ int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
 int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
 
 int xfile_dump(struct xfile *xf);
+#else
+static inline int
+xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t offset)
+{
+	return -EIO;
+}
+
+static inline int
+xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t offset)
+{
+	return -EIO;
+}
+#endif /* CONFIG_XFS_IN_MEMORY_FILE */
 
 #endif /* __XFS_SCRUB_XFILE_H__ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41734202796f..c3a9df0c0eab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -562,7 +562,10 @@ xfs_iomap_swapfile_activate(
 	struct file			*swap_file,
 	sector_t			*span)
 {
-	sis->bdev = xfs_inode_buftarg(XFS_I(file_inode(swap_file)))->bt_bdev;
+	struct xfs_inode		*ip = XFS_I(file_inode(swap_file));
+	struct xfs_buftarg		*btp = xfs_inode_buftarg(ip);
+
+	sis->bdev = xfs_buftarg_bdev(btp);
 	return iomap_swapfile_activate(sis, swap_file, span,
 			&xfs_read_iomap_ops);
 }
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 867645b74d88..e094932869f6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -62,10 +62,10 @@ xfs_zero_extent(
 	xfs_daddr_t		sector = xfs_fsb_to_db(ip, start_fsb);
 	sector_t		block = XFS_BB_TO_FSBT(mp, sector);
 
-	return blkdev_issue_zeroout(target->bt_bdev,
-		block << (mp->m_super->s_blocksize_bits - 9),
-		count_fsb << (mp->m_super->s_blocksize_bits - 9),
-		GFP_NOFS, 0);
+	return xfs_buftarg_zeroout(target,
+			block << (mp->m_super->s_blocksize_bits - 9),
+			count_fsb << (mp->m_super->s_blocksize_bits - 9),
+			GFP_NOFS, 0);
 }
 
 #ifdef CONFIG_XFS_RT
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7dfc1db566fa..2ec8d39def9c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -21,6 +21,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_ag.h"
+#include "scrub/xfile.h"
 
 struct kmem_cache *xfs_buf_cache;
 
@@ -1554,6 +1555,36 @@ xfs_buf_ioapply_map(
 
 }
 
+static inline void
+xfs_buf_ioapply_in_memory(
+	struct xfs_buf		*bp)
+{
+	struct xfile		*xfile = bp->b_target->bt_xfile;
+	loff_t			pos = BBTOB(xfs_buf_daddr(bp));
+	size_t			size = BBTOB(bp->b_length);
+	int			error;
+
+	atomic_inc(&bp->b_io_remaining);
+
+	if (bp->b_map_count > 1) {
+		/* We don't need or support multi-map buffers. */
+		ASSERT(0);
+		error = -EIO;
+	} else if (bp->b_flags & XBF_WRITE) {
+		error = xfile_obj_store(xfile, bp->b_addr, size, pos);
+	} else {
+		error = xfile_obj_load(xfile, bp->b_addr, size, pos);
+	}
+	if (error)
+		cmpxchg(&bp->b_io_error, 0, error);
+
+	if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
+		invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
+
+	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+		xfs_buf_ioend(bp);
+}
+
 STATIC void
 _xfs_buf_ioapply(
 	struct xfs_buf	*bp)
@@ -1611,6 +1642,11 @@ _xfs_buf_ioapply(
 	/* we only use the buffer cache for meta-data */
 	op |= REQ_META;
 
+	if (bp->b_target->bt_flags & XFS_BUFTARG_IN_MEMORY) {
+		xfs_buf_ioapply_in_memory(bp);
+		return;
+	}
+
 	/*
 	 * Walk all the vectors issuing IO on them. Set up the initial offset
 	 * into the buffer and the desired IO size before we start -
@@ -1978,9 +2014,11 @@ xfs_free_buftarg(
 	if (btp->bt_flags & XFS_BUFTARG_SELF_CACHED)
 		rhashtable_destroy(&btp->bt_bufhash);
 
-	blkdev_issue_flush(btp->bt_bdev);
-	invalidate_bdev(btp->bt_bdev);
-	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
+	if (!(btp->bt_flags & XFS_BUFTARG_IN_MEMORY)) {
+		blkdev_issue_flush(btp->bt_bdev);
+		invalidate_bdev(btp->bt_bdev);
+		fs_put_dax(btp->bt_daxdev, btp->bt_mount);
+	}
 
 	kmem_free(btp);
 }
@@ -2024,12 +2062,13 @@ xfs_setsize_buftarg_early(
 static struct xfs_buftarg *
 __xfs_alloc_buftarg(
 	struct xfs_mount	*mp,
-	unsigned int		flags)
+	unsigned int		flags,
+	xfs_km_flags_t		km_flags)
 {
 	struct xfs_buftarg	*btp;
 	int			error;
 
-	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
+	btp = kmem_zalloc(sizeof(*btp), KM_NOFS | km_flags);
 	if (!btp)
 		return NULL;
 
@@ -2090,7 +2129,7 @@ xfs_alloc_buftarg(
 	ops = &xfs_dax_holder_operations;
 #endif
 
-	btp = __xfs_alloc_buftarg(mp, 0);
+	btp = __xfs_alloc_buftarg(mp, 0, 0);
 	if (!btp)
 		return NULL;
 
@@ -2109,6 +2148,35 @@ xfs_alloc_buftarg(
 	return NULL;
 }
 
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+/* Allocate a buffer cache target for a memory-backed file. */
+int
+xfs_alloc_memory_buftarg(
+	struct xfs_mount	*mp,
+	struct xfile		*xfile,
+	struct xfs_buftarg	**btpp)
+{
+	struct xfs_buftarg	*btp;
+
+	btp = __xfs_alloc_buftarg(mp,
+			XFS_BUFTARG_SELF_CACHED | XFS_BUFTARG_IN_MEMORY,
+			KM_MAYFAIL);
+	if (!btp)
+		return -ENOMEM;
+
+	btp->bt_xfile = xfile;
+	btp->bt_dev = (dev_t)-1U;
+
+	btp->bt_meta_sectorsize = SECTOR_SIZE;
+	btp->bt_meta_sectormask = SECTOR_SIZE - 1;
+	btp->bt_logical_sectorsize = SECTOR_SIZE;
+	btp->bt_logical_sectormask = SECTOR_SIZE - 1;
+
+	*btpp = btp;
+	return 0;
+}
+#endif /* CONFIG_XFS_IN_MEMORY_FILE */
+
 /*
  * Cancel a delayed write list.
  *
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d7bf7f657e99..dcae77dabdcc 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -21,6 +21,7 @@ extern struct kmem_cache *xfs_buf_cache;
  *	Base types
  */
 struct xfs_buf;
+struct xfile;
 
 #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
 
@@ -99,7 +100,10 @@ typedef unsigned int xfs_buf_flags_t;
  */
 typedef struct xfs_buftarg {
 	dev_t			bt_dev;
-	struct block_device	*bt_bdev;
+	union {
+		struct block_device	*bt_bdev;
+		struct xfile		*bt_xfile;
+	};
 	struct dax_device	*bt_daxdev;
 	u64			bt_dax_part_off;
 	struct xfs_mount	*bt_mount;
@@ -124,6 +128,20 @@ typedef struct xfs_buftarg {
 /* the xfs_buftarg indexes buffers via bt_buf_hash */
 #define XFS_BUFTARG_SELF_CACHED	(1U << 0)
 
+/* in-memory buftarg via bt_xfile */
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+# define XFS_BUFTARG_IN_MEMORY	(1U << 1)
+#else
+# define XFS_BUFTARG_IN_MEMORY	(0)
+#endif
+
+static inline bool
+xfs_buftarg_in_memory(
+	struct xfs_buftarg	*btp)
+{
+	return btp->bt_flags & XFS_BUFTARG_IN_MEMORY;
+}
+
 #define XB_PAGES	2
 
 struct xfs_buf_map {
@@ -372,13 +390,60 @@ xfs_buf_update_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
  */
 struct xfs_buftarg *xfs_alloc_buftarg(struct xfs_mount *mp,
 		struct block_device *bdev);
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+int xfs_alloc_memory_buftarg(struct xfs_mount *mp, struct xfile *xfile,
+		struct xfs_buftarg **btpp);
+#endif
 extern void xfs_free_buftarg(struct xfs_buftarg *);
 extern void xfs_buftarg_wait(struct xfs_buftarg *);
 extern void xfs_buftarg_drain(struct xfs_buftarg *);
 extern int xfs_setsize_buftarg(struct xfs_buftarg *, unsigned int);
 
-#define xfs_getsize_buftarg(buftarg)	block_size((buftarg)->bt_bdev)
-#define xfs_readonly_buftarg(buftarg)	bdev_read_only((buftarg)->bt_bdev)
+static inline struct block_device *
+xfs_buftarg_bdev(struct xfs_buftarg *btp)
+{
+	if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+		return NULL;
+	return btp->bt_bdev;
+}
+
+static inline unsigned int
+xfs_getsize_buftarg(struct xfs_buftarg *btp)
+{
+	if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+		return SECTOR_SIZE;
+	return block_size(btp->bt_bdev);
+}
+
+static inline bool
+xfs_readonly_buftarg(struct xfs_buftarg *btp)
+{
+	if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+		return false;
+	return bdev_read_only(btp->bt_bdev);
+}
+
+static inline int
+xfs_buftarg_flush(struct xfs_buftarg *btp)
+{
+	if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+		return 0;
+	return blkdev_issue_flush(btp->bt_bdev);
+}
+
+static inline int
+xfs_buftarg_zeroout(
+	struct xfs_buftarg	*btp,
+	sector_t		sector,
+	sector_t		nr_sects,
+	gfp_t			gfp_mask,
+	unsigned		flags)
+{
+	if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+		return -EOPNOTSUPP;
+	return blkdev_issue_zeroout(btp->bt_bdev, sector, nr_sects, gfp_mask,
+			flags);
+}
 
 int xfs_buf_reverify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
 bool xfs_verify_magic(struct xfs_buf *bp, __be32 dmagic);
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 3fa6b0ab9ed6..44658cc7d3f2 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -29,7 +29,7 @@ xfs_trim_extents(
 	xfs_daddr_t		minlen,
 	uint64_t		*blocks_trimmed)
 {
-	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
+	struct block_device	*bdev = xfs_buftarg_bdev(mp->m_ddev_targp);
 	struct xfs_btree_cur	*cur;
 	struct xfs_buf		*agbp;
 	struct xfs_agf		*agf;
@@ -154,8 +154,8 @@ xfs_ioc_trim(
 	struct xfs_mount		*mp,
 	struct fstrim_range __user	*urange)
 {
-	unsigned int		granularity =
-		bdev_discard_granularity(mp->m_ddev_targp->bt_bdev);
+	struct block_device	*bdev = xfs_buftarg_bdev(mp->m_ddev_targp);
+	unsigned int		granularity = bdev_discard_granularity(bdev);
 	struct fstrim_range	range;
 	xfs_daddr_t		start, end, minlen;
 	xfs_agnumber_t		start_agno, end_agno, agno;
@@ -164,7 +164,7 @@ xfs_ioc_trim(
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
-	if (!bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev))
+	if (!bdev_max_discard_sectors(bdev))
 		return -EOPNOTSUPP;
 
 	/*
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 595a5bcf46b9..c4bdadd8fa71 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -164,9 +164,9 @@ xfs_file_fsync(
 	 * inode size in case of an extending write.
 	 */
 	if (XFS_IS_REALTIME_INODE(ip))
-		error = blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
+		error = xfs_buftarg_flush(mp->m_rtdev_targp);
 	else if (mp->m_logdev_targp != mp->m_ddev_targp)
-		error = blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
+		error = xfs_buftarg_flush(mp->m_ddev_targp);
 
 	/*
 	 * Any inode that has dirty modifications in the log is pinned.  The
@@ -189,7 +189,7 @@ xfs_file_fsync(
 	 */
 	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
 	    mp->m_logdev_targp == mp->m_ddev_targp) {
-		err2 = blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
+		err2 = xfs_buftarg_flush(mp->m_ddev_targp);
 		if (err2 && !error)
 			error = err2;
 	}
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 020111f0f2a2..4b2a02a08dfa 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1762,6 +1762,7 @@ xfs_ioc_setlabel(
 	char			__user *newlabel)
 {
 	struct xfs_sb		*sbp = &mp->m_sb;
+	struct block_device	*bdev = xfs_buftarg_bdev(mp->m_ddev_targp);
 	char			label[XFSLABEL_MAX + 1];
 	size_t			len;
 	int			error;
@@ -1808,7 +1809,7 @@ xfs_ioc_setlabel(
 	error = xfs_update_secondary_sbs(mp);
 	mutex_unlock(&mp->m_growlock);
 
-	invalidate_bdev(mp->m_ddev_targp->bt_bdev);
+	invalidate_bdev(bdev);
 
 out:
 	mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c2ba03281daf..99a7c271c353 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -129,7 +129,7 @@ xfs_bmbt_to_iomap(
 	if (mapping_flags & IOMAP_DAX)
 		iomap->dax_dev = target->bt_daxdev;
 	else
-		iomap->bdev = target->bt_bdev;
+		iomap->bdev = xfs_buftarg_bdev(target);
 	iomap->flags = iomap_flags;
 
 	if (xfs_ipincount(ip) &&
@@ -154,7 +154,7 @@ xfs_hole_to_iomap(
 	iomap->type = IOMAP_HOLE;
 	iomap->offset = XFS_FSB_TO_B(ip->i_mount, offset_fsb);
 	iomap->length = XFS_FSB_TO_B(ip->i_mount, end_fsb - offset_fsb);
-	iomap->bdev = target->bt_bdev;
+	iomap->bdev = xfs_buftarg_bdev(target);
 	iomap->dax_dev = target->bt_daxdev;
 }
 
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fc61cc024023..b32a8e57f576 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1938,7 +1938,7 @@ xlog_write_iclog(
 	 * writeback throttle from throttling log writes behind background
 	 * metadata writeback and causing priority inversions.
 	 */
-	bio_init(&iclog->ic_bio, log->l_targ->bt_bdev, iclog->ic_bvec,
+	bio_init(&iclog->ic_bio, xfs_buftarg_bdev(log->l_targ), iclog->ic_bvec,
 		 howmany(count, PAGE_SIZE),
 		 REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE);
 	iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno;
@@ -1959,7 +1959,7 @@ xlog_write_iclog(
 		 * avoid shutdown re-entering this path and erroring out again.
 		 */
 		if (log->l_targ != log->l_mp->m_ddev_targp &&
-		    blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev)) {
+		    xfs_buftarg_flush(log->l_mp->m_ddev_targp)) {
 			xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
 			return;
 		}
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index eccbfb99e894..12cd2874048f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -742,7 +742,8 @@ xlog_discard_busy_extents(
 		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
 					 busyp->length);
 
-		error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+		error = __blkdev_issue_discard(
+				xfs_buftarg_bdev(mp->m_ddev_targp),
 				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
 				XFS_FSB_TO_BB(mp, busyp->length),
 				GFP_NOFS, &bio);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 322eb2ee6c55..6b1f37bc3e95 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -137,7 +137,8 @@ xlog_do_io(
 	nbblks = round_up(nbblks, log->l_sectBBsize);
 	ASSERT(nbblks > 0);
 
-	error = xfs_rw_bdev(log->l_targ->bt_bdev, log->l_logBBstart + blk_no,
+	error = xfs_rw_bdev(xfs_buftarg_bdev(log->l_targ),
+			log->l_logBBstart + blk_no,
 			BBTOB(nbblks), data, op);
 	if (error && !xlog_is_shutdown(log)) {
 		xfs_alert(log->l_mp,
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 020ff2d93f23..8841947bdce7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -397,13 +397,13 @@ xfs_close_devices(
 	struct xfs_mount	*mp)
 {
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
-		struct block_device *logdev = mp->m_logdev_targp->bt_bdev;
+		struct block_device *logdev = xfs_buftarg_bdev(mp->m_logdev_targp);
 
 		xfs_free_buftarg(mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
 	}
 	if (mp->m_rtdev_targp) {
-		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
+		struct block_device *rtdev = xfs_buftarg_bdev(mp->m_rtdev_targp);
 
 		xfs_free_buftarg(mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/7] xfs: consolidate btree block freeing tracepoints
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 3/7] xfs: support in-memory buffer cache targets Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 7/7] xfs: connect in-memory btrees to xfiles Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Don't waste tracepoint segment memory on per-btree block freeing
tracepoints when we can do it from the generic btree code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c          |    2 ++
 fs/xfs/libxfs/xfs_refcount_btree.c |    2 --
 fs/xfs/libxfs/xfs_rmap_btree.c     |    2 --
 fs/xfs/xfs_trace.h                 |   32 ++++++++++++++++++++++++++++++--
 4 files changed, 32 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 02c237984fa6..7fab2df1046f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -414,6 +414,8 @@ xfs_btree_free_block(
 {
 	int			error;
 
+	trace_xfs_btree_free_block(cur, bp);
+
 	error = cur->bc_ops->free_block(cur, bp);
 	if (!error) {
 		xfs_trans_binval(cur->bc_tp, bp);
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1bf991bf452f..b1d1f3bb159f 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -108,8 +108,6 @@ xfs_refcountbt_free_block(
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
 	int			error;
 
-	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
 	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 2c90a05ca814..1421fcfcad64 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -125,8 +125,6 @@ xfs_rmapbt_free_block(
 	int			error;
 
 	bno = xfs_daddr_to_agbno(cur->bc_mp, xfs_buf_daddr(bp));
-	trace_xfs_rmapbt_free_block(cur->bc_mp, pag->pag_agno,
-			bno, 1);
 	be32_add_cpu(&agf->agf_rmap_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
 	error = xfs_alloc_put_freelist(pag, cur->bc_tp, agbp, NULL, bno, 1);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 145808b733ce..50f4d4410976 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2515,6 +2515,36 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
 DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
 DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
 
+TRACE_EVENT(xfs_btree_free_block,
+	TP_PROTO(struct xfs_btree_cur *cur, struct xfs_buf *bp),
+	TP_ARGS(cur, bp),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, ino)
+		__field(xfs_btnum_t, btnum)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = cur->bc_mp->m_super->s_dev;
+		__entry->agno = xfs_daddr_to_agno(cur->bc_mp,
+							xfs_buf_daddr(bp));
+		if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
+			__entry->ino = cur->bc_ino.ip->i_ino;
+		else
+			__entry->ino = 0;
+		__entry->btnum = cur->bc_btnum;
+		__entry->agbno = xfs_daddr_to_agbno(cur->bc_mp,
+							xfs_buf_daddr(bp));
+	),
+	TP_printk("dev %d:%d btree %s agno 0x%x ino 0x%llx agbno 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+		  __entry->agno,
+		  __entry->ino,
+		  __entry->agbno)
+);
+
 /* deferred ops */
 struct xfs_defer_pending;
 
@@ -2869,7 +2899,6 @@ DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
 DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
 
 DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
-DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
 DEFINE_RMAPBT_EVENT(xfs_rmap_update);
 DEFINE_RMAPBT_EVENT(xfs_rmap_insert);
 DEFINE_RMAPBT_EVENT(xfs_rmap_delete);
@@ -3228,7 +3257,6 @@ DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
 
 /* refcount btree tracepoints */
 DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
-DEFINE_BUSY_EVENT(xfs_refcountbt_free_block);
 DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcount_lookup);
 DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_get);
 DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_update);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/7] xfs: consolidate btree block allocation tracepoints
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 6/7] xfs: support in-memory btrees Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Don't waste tracepoint segment memory on per-btree block allocation
tracepoints when we can do it from the generic btree code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c          |   20 ++++++++++++---
 fs/xfs/libxfs/xfs_refcount_btree.c |    2 -
 fs/xfs/libxfs/xfs_rmap_btree.c     |    2 -
 fs/xfs/xfs_trace.h                 |   49 +++++++++++++++++++++++++++++++++++-
 4 files changed, 64 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 7fab2df1046f..f577c0463c6e 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2693,6 +2693,20 @@ xfs_btree_rshift(
 	return error;
 }
 
+static inline int
+xfs_btree_alloc_block(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*hint_block,
+	union xfs_btree_ptr		*new_block,
+	int				*stat)
+{
+	int				error;
+
+	error = cur->bc_ops->alloc_block(cur, hint_block, new_block, stat);
+	trace_xfs_btree_alloc_block(cur, new_block, *stat, error);
+	return error;
+}
+
 /*
  * Split cur/level block in half.
  * Return new block number and the key to its first
@@ -2736,7 +2750,7 @@ __xfs_btree_split(
 	xfs_btree_buf_to_ptr(cur, lbp, &lptr);
 
 	/* Allocate the new block. If we can't do it, we're toast. Give up. */
-	error = cur->bc_ops->alloc_block(cur, &lptr, &rptr, stat);
+	error = xfs_btree_alloc_block(cur, &lptr, &rptr, stat);
 	if (error)
 		goto error0;
 	if (*stat == 0)
@@ -3002,7 +3016,7 @@ xfs_btree_new_iroot(
 	pp = xfs_btree_ptr_addr(cur, 1, block);
 
 	/* Allocate the new block. If we can't do it, we're toast. Give up. */
-	error = cur->bc_ops->alloc_block(cur, pp, &nptr, stat);
+	error = xfs_btree_alloc_block(cur, pp, &nptr, stat);
 	if (error)
 		goto error0;
 	if (*stat == 0)
@@ -3102,7 +3116,7 @@ xfs_btree_new_root(
 	cur->bc_ops->init_ptr_from_cur(cur, &rptr);
 
 	/* Allocate the new block. If we can't do it, we're toast. Give up. */
-	error = cur->bc_ops->alloc_block(cur, &rptr, &lptr, stat);
+	error = xfs_btree_alloc_block(cur, &rptr, &lptr, stat);
 	if (error)
 		goto error0;
 	if (*stat == 0)
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index b1d1f3bb159f..b75005684aa2 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -77,8 +77,6 @@ xfs_refcountbt_alloc_block(
 	error = xfs_alloc_vextent(&args);
 	if (error)
 		goto out_error;
-	trace_xfs_refcountbt_alloc_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-			args.agbno, 1);
 	if (args.fsbno == NULLFSBLOCK) {
 		*stat = 0;
 		return 0;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 1421fcfcad64..5583dbe43bb5 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -94,8 +94,6 @@ xfs_rmapbt_alloc_block(
 				       &bno, 1);
 	if (error)
 		return error;
-
-	trace_xfs_rmapbt_alloc_block(cur->bc_mp, pag->pag_agno, bno, 1);
 	if (bno == NULLAGBLOCK) {
 		*stat = 0;
 		return 0;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 50f4d4410976..d86dd34127f2 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2515,6 +2515,53 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
 DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
 DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
 
+TRACE_EVENT(xfs_btree_alloc_block,
+	TP_PROTO(struct xfs_btree_cur *cur, union xfs_btree_ptr *ptr, int stat,
+		 int error),
+	TP_ARGS(cur, ptr, stat, error),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, ino)
+		__field(xfs_btnum_t, btnum)
+		__field(int, error)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = cur->bc_mp->m_super->s_dev;
+		if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) {
+			__entry->agno = 0;
+			__entry->ino = cur->bc_ino.ip->i_ino;
+		} else {
+			__entry->agno = cur->bc_ag.pag->pag_agno;
+			__entry->ino = 0;
+		}
+		__entry->btnum = cur->bc_btnum;
+		__entry->error = error;
+		if (!error && stat) {
+			if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+				xfs_fsblock_t	fsb = be64_to_cpu(ptr->l);
+
+				__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp,
+								fsb);
+				__entry->agbno = XFS_FSB_TO_AGBNO(cur->bc_mp,
+								fsb);
+			} else {
+				__entry->agbno = be32_to_cpu(ptr->s);
+			}
+		} else {
+			__entry->agbno = NULLAGBLOCK;
+		}
+	),
+	TP_printk("dev %d:%d btree %s agno 0x%x ino 0x%llx agbno 0x%x error %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+		  __entry->agno,
+		  __entry->ino,
+		  __entry->agbno,
+		  __entry->error)
+);
+
 TRACE_EVENT(xfs_btree_free_block,
 	TP_PROTO(struct xfs_btree_cur *cur, struct xfs_buf *bp),
 	TP_ARGS(cur, bp),
@@ -2898,7 +2945,6 @@ DEFINE_EVENT(xfs_rmapbt_class, name, \
 DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
 DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
 
-DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
 DEFINE_RMAPBT_EVENT(xfs_rmap_update);
 DEFINE_RMAPBT_EVENT(xfs_rmap_insert);
 DEFINE_RMAPBT_EVENT(xfs_rmap_delete);
@@ -3256,7 +3302,6 @@ DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
 	TP_ARGS(mp, agno, i1, i2, i3))
 
 /* refcount btree tracepoints */
-DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
 DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcount_lookup);
 DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_get);
 DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_update);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/7] xfs: support in-memory btrees
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 5/7] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/7] xfs: support in-memory buffer cache targets Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Adapt the generic btree cursor code to be able to create a btree whose
buffers come from a (presumably in-memory) buftarg with a header block
that's specific to in-memory btrees.  We'll connect this to other parts
of online scrub in the next patches.

Note that in-memory btrees always have a block size matching the system
memory page size for efficiency reasons.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig                |    4 
 fs/xfs/Makefile               |    1 
 fs/xfs/libxfs/xfs_btree.c     |  151 ++++++++++++++----
 fs/xfs/libxfs/xfs_btree.h     |   17 ++
 fs/xfs/libxfs/xfs_btree_mem.h |   87 ++++++++++
 fs/xfs/scrub/xfbtree.c        |  352 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfbtree.h        |   34 ++++
 fs/xfs/scrub/xfile.h          |   46 +++++
 fs/xfs/xfs_buf.c              |   10 +
 fs/xfs/xfs_buf.h              |   10 +
 fs/xfs/xfs_health.c           |    3 
 fs/xfs/xfs_trace.c            |    3 
 fs/xfs/xfs_trace.h            |    5 -
 13 files changed, 694 insertions(+), 29 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_btree_mem.h
 create mode 100644 fs/xfs/scrub/xfbtree.c
 create mode 100644 fs/xfs/scrub/xfbtree.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 2373324be997..612e5c458033 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -104,6 +104,9 @@ config XFS_LIVE_HOOKS
 config XFS_IN_MEMORY_FILE
 	bool
 
+config XFS_IN_MEMORY_BTREE
+	bool
+
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
 	default n
@@ -161,6 +164,7 @@ config XFS_ONLINE_REPAIR
 	bool "XFS online metadata repair support"
 	default n
 	depends on XFS_FS && XFS_ONLINE_SCRUB
+	select XFS_IN_MEMORY_BTREE
 	help
 	  If you say Y here you will be able to repair metadata on a
 	  mounted XFS filesystem.  This feature is intended to reduce
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 2d756e13d441..7e1495465cec 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -195,6 +195,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   reap.o \
 				   refcount_repair.o \
 				   repair.o \
+				   xfbtree.o \
 				   )
 
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index f577c0463c6e..4c5b4d26cd1b 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -28,6 +28,9 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_health.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "xfs_btree_mem.h"
 
 /*
  * Btree magic numbers.
@@ -82,6 +85,9 @@ xfs_btree_check_lblock_siblings(
 	if (level >= 0) {
 		if (!xfs_btree_check_lptr(cur, sibling, level + 1))
 			return __this_address;
+	} else if (cur && (cur->bc_flags & XFS_BTREE_IN_MEMORY)) {
+		if (!xfbtree_verify_xfileoff(cur, sibling))
+			return __this_address;
 	} else {
 		if (!xfs_verify_fsbno(mp, sibling))
 			return __this_address;
@@ -109,6 +115,9 @@ xfs_btree_check_sblock_siblings(
 	if (level >= 0) {
 		if (!xfs_btree_check_sptr(cur, sibling, level + 1))
 			return __this_address;
+	} else if (cur && (cur->bc_flags & XFS_BTREE_IN_MEMORY)) {
+		if (!xfbtree_verify_xfileoff(cur, sibling))
+			return __this_address;
 	} else {
 		if (!xfs_verify_agbno(pag, sibling))
 			return __this_address;
@@ -151,7 +160,9 @@ __xfs_btree_check_lblock(
 	    cur->bc_ops->get_maxrecs(cur, level))
 		return __this_address;
 
-	if (bp)
+	if ((cur->bc_flags & XFS_BTREE_IN_MEMORY) && bp)
+		fsb = xfbtree_buf_to_xfoff(cur, bp);
+	else if (bp)
 		fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
 
 	fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb,
@@ -218,8 +229,12 @@ __xfs_btree_check_sblock(
 	    cur->bc_ops->get_maxrecs(cur, level))
 		return __this_address;
 
-	if (bp)
+	if ((cur->bc_flags & XFS_BTREE_IN_MEMORY) && bp) {
+		pag = NULL;
+		agbno = xfbtree_buf_to_xfoff(cur, bp);
+	} else if (bp) {
 		agbno = xfs_daddr_to_agbno(mp, xfs_buf_daddr(bp));
+	}
 
 	fa = xfs_btree_check_sblock_siblings(pag, cur, level, agbno,
 			block->bb_u.s.bb_leftsib);
@@ -276,6 +291,8 @@ xfs_btree_check_lptr(
 {
 	if (level <= 0)
 		return false;
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_verify_xfileoff(cur, fsbno);
 	return xfs_verify_fsbno(cur->bc_mp, fsbno);
 }
 
@@ -288,6 +305,8 @@ xfs_btree_check_sptr(
 {
 	if (level <= 0)
 		return false;
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_verify_xfileoff(cur, agbno);
 	return xfs_verify_agbno(cur->bc_ag.pag, agbno);
 }
 
@@ -302,6 +321,9 @@ xfs_btree_check_ptr(
 	int				index,
 	int				level)
 {
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_check_ptr(cur, ptr, index, level);
+
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		if (xfs_btree_check_lptr(cur, be64_to_cpu((&ptr->l)[index]),
 				level))
@@ -458,11 +480,36 @@ xfs_btree_del_cursor(
 	       xfs_is_shutdown(cur->bc_mp) || error != 0);
 	if (unlikely(cur->bc_flags & XFS_BTREE_STAGING))
 		kmem_free(cur->bc_ops);
-	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
+	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) &&
+	    !(cur->bc_flags & XFS_BTREE_IN_MEMORY) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+		if (cur->bc_mem.pag)
+			xfs_perag_put(cur->bc_mem.pag);
+	}
 	kmem_cache_free(cur->bc_cache, cur);
 }
 
+/* Return the buffer target for this btree's buffer. */
+static inline struct xfs_buftarg *
+xfs_btree_buftarg(
+	struct xfs_btree_cur	*cur)
+{
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_target(cur->bc_mem.xfbtree);
+	return cur->bc_mp->m_ddev_targp;
+}
+
+/* Return the block size (in units of 512b sectors) for this btree. */
+static inline unsigned int
+xfs_btree_bbsize(
+	struct xfs_btree_cur	*cur)
+{
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_bbsize();
+	return cur->bc_mp->m_bsize;
+}
+
 /*
  * Duplicate the btree cursor.
  * Allocate a new one, copy the record, re-get the buffers.
@@ -500,10 +547,11 @@ xfs_btree_dup_cursor(
 		new->bc_levels[i].ra = cur->bc_levels[i].ra;
 		bp = cur->bc_levels[i].bp;
 		if (bp) {
-			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
-						   xfs_buf_daddr(bp), mp->m_bsize,
-						   0, &bp,
-						   cur->bc_ops->buf_ops);
+			error = xfs_trans_read_buf(mp, tp,
+					xfs_btree_buftarg(cur),
+					xfs_buf_daddr(bp),
+					xfs_btree_bbsize(cur), 0, &bp,
+					cur->bc_ops->buf_ops);
 			if (xfs_metadata_is_sick(error))
 				xfs_btree_mark_sick(new);
 			if (error) {
@@ -944,6 +992,9 @@ xfs_btree_readahead_lblock(
 	xfs_fsblock_t		left = be64_to_cpu(block->bb_u.l.bb_leftsib);
 	xfs_fsblock_t		right = be64_to_cpu(block->bb_u.l.bb_rightsib);
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return 0;
+
 	if ((lr & XFS_BTCUR_LEFTRA) && left != NULLFSBLOCK) {
 		xfs_btree_reada_bufl(cur->bc_mp, left, 1,
 				     cur->bc_ops->buf_ops);
@@ -969,6 +1020,8 @@ xfs_btree_readahead_sblock(
 	xfs_agblock_t		left = be32_to_cpu(block->bb_u.s.bb_leftsib);
 	xfs_agblock_t		right = be32_to_cpu(block->bb_u.s.bb_rightsib);
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return 0;
 
 	if ((lr & XFS_BTCUR_LEFTRA) && left != NULLAGBLOCK) {
 		xfs_btree_reada_bufs(cur->bc_mp, cur->bc_ag.pag->pag_agno,
@@ -1030,6 +1083,11 @@ xfs_btree_ptr_to_daddr(
 	if (error)
 		return error;
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+		*daddr = xfbtree_ptr_to_daddr(cur, ptr);
+		return 0;
+	}
+
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		fsbno = be64_to_cpu(ptr->l);
 		*daddr = XFS_FSB_TO_DADDR(cur->bc_mp, fsbno);
@@ -1058,8 +1116,9 @@ xfs_btree_readahead_ptr(
 
 	if (xfs_btree_ptr_to_daddr(cur, ptr, &daddr))
 		return;
-	xfs_buf_readahead(cur->bc_mp->m_ddev_targp, daddr,
-			  cur->bc_mp->m_bsize * count, cur->bc_ops->buf_ops);
+	xfs_buf_readahead(xfs_btree_buftarg(cur), daddr,
+			xfs_btree_bbsize(cur) * count,
+			cur->bc_ops->buf_ops);
 }
 
 /*
@@ -1233,7 +1292,9 @@ xfs_btree_init_block_cur(
 	 * change in future, but is safe for current users of the generic btree
 	 * code.
 	 */
-	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		owner = xfbtree_owner(cur);
+	else if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		owner = cur->bc_ino.ip->i_ino;
 	else
 		owner = cur->bc_ag.pag->pag_agno;
@@ -1273,6 +1334,11 @@ xfs_btree_buf_to_ptr(
 	struct xfs_buf		*bp,
 	union xfs_btree_ptr	*ptr)
 {
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+		xfbtree_buf_to_ptr(cur, bp, ptr);
+		return;
+	}
+
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		ptr->l = cpu_to_be64(XFS_DADDR_TO_FSB(cur->bc_mp,
 					xfs_buf_daddr(bp)));
@@ -1317,15 +1383,14 @@ xfs_btree_get_buf_block(
 	struct xfs_btree_block		**block,
 	struct xfs_buf			**bpp)
 {
-	struct xfs_mount	*mp = cur->bc_mp;
-	xfs_daddr_t		d;
-	int			error;
+	xfs_daddr_t			d;
+	int				error;
 
 	error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
 	if (error)
 		return error;
-	error = xfs_trans_get_buf(cur->bc_tp, mp->m_ddev_targp, d, mp->m_bsize,
-			0, bpp);
+	error = xfs_trans_get_buf(cur->bc_tp, xfs_btree_buftarg(cur), d,
+			xfs_btree_bbsize(cur), 0, bpp);
 	if (error)
 		return error;
 
@@ -1356,9 +1421,9 @@ xfs_btree_read_buf_block(
 	error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
 	if (error)
 		return error;
-	error = xfs_trans_read_buf(mp, cur->bc_tp, mp->m_ddev_targp, d,
-				   mp->m_bsize, flags, bpp,
-				   cur->bc_ops->buf_ops);
+	error = xfs_trans_read_buf(mp, cur->bc_tp, xfs_btree_buftarg(cur), d,
+			xfs_btree_bbsize(cur), flags, bpp,
+			cur->bc_ops->buf_ops);
 	if (xfs_metadata_is_sick(error))
 		xfs_btree_mark_sick(cur);
 	if (error)
@@ -1798,6 +1863,37 @@ xfs_btree_decrement(
 	return error;
 }
 
+/*
+ * Check the btree block owner now that we have the context to know who the
+ * real owner is.
+ */
+static inline xfs_failaddr_t
+xfs_btree_check_block_owner(
+	struct xfs_btree_cur	*cur,
+	struct xfs_btree_block	*block)
+{
+	if (!xfs_has_crc(cur->bc_mp))
+		return NULL;
+
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_check_block_owner(cur, block);
+
+	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS)) {
+		if (be32_to_cpu(block->bb_u.s.bb_owner) !=
+						cur->bc_ag.pag->pag_agno)
+			return __this_address;
+		return NULL;
+	}
+
+	if (cur->bc_ino.flags & XFS_BTCUR_BMBT_INVALID_OWNER)
+		return NULL;
+
+	if (be64_to_cpu(block->bb_u.l.bb_owner) != cur->bc_ino.ip->i_ino)
+		return __this_address;
+
+	return NULL;
+}
+
 int
 xfs_btree_lookup_get_block(
 	struct xfs_btree_cur		*cur,	/* btree cursor */
@@ -1836,11 +1932,7 @@ xfs_btree_lookup_get_block(
 		return error;
 
 	/* Check the inode owner since the verifiers don't. */
-	if (xfs_has_crc(cur->bc_mp) &&
-	    !(cur->bc_ino.flags & XFS_BTCUR_BMBT_INVALID_OWNER) &&
-	    (cur->bc_flags & XFS_BTREE_LONG_PTRS) &&
-	    be64_to_cpu((*blkp)->bb_u.l.bb_owner) !=
-			cur->bc_ino.ip->i_ino)
+	if (xfs_btree_check_block_owner(cur, *blkp) != NULL)
 		goto out_bad;
 
 	/* Did we get the level we were looking for? */
@@ -4372,7 +4464,7 @@ xfs_btree_visit_block(
 {
 	struct xfs_btree_block		*block;
 	struct xfs_buf			*bp;
-	union xfs_btree_ptr		rptr;
+	union xfs_btree_ptr		rptr, bufptr;
 	int				error;
 
 	/* do right sibling readahead */
@@ -4395,15 +4487,14 @@ xfs_btree_visit_block(
 	 * return the same block without checking if the right sibling points
 	 * back to us and creates a cyclic reference in the btree.
 	 */
+	xfs_btree_buf_to_ptr(cur, bp, &bufptr);
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
-		if (be64_to_cpu(rptr.l) == XFS_DADDR_TO_FSB(cur->bc_mp,
-							xfs_buf_daddr(bp))) {
+		if (rptr.l == bufptr.l) {
 			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
 		}
 	} else {
-		if (be32_to_cpu(rptr.s) == xfs_daddr_to_agbno(cur->bc_mp,
-							xfs_buf_daddr(bp))) {
+		if (rptr.s == bufptr.s) {
 			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
 		}
@@ -4585,6 +4676,8 @@ xfs_btree_lblock_verify(
 	xfs_fsblock_t		fsb;
 	xfs_failaddr_t		fa;
 
+	ASSERT(!xfs_buftarg_in_memory(bp->b_target));
+
 	/* numrecs verification */
 	if (be16_to_cpu(block->bb_numrecs) > max_recs)
 		return __this_address;
@@ -4640,6 +4733,8 @@ xfs_btree_sblock_verify(
 	xfs_agblock_t		agbno;
 	xfs_failaddr_t		fa;
 
+	ASSERT(!xfs_buftarg_in_memory(bp->b_target));
+
 	/* numrecs verification */
 	if (be16_to_cpu(block->bb_numrecs) > max_recs)
 		return __this_address;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 5525d3715d57..b2c08a436997 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -248,6 +248,15 @@ struct xfs_btree_cur_ino {
 #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
 };
 
+/* In-memory btree information */
+struct xfbtree;
+
+struct xfs_btree_cur_mem {
+	struct xfbtree			*xfbtree;
+	struct xfs_buf			*head_bp;
+	struct xfs_perag		*pag;
+};
+
 struct xfs_btree_level {
 	/* buffer pointer */
 	struct xfs_buf		*bp;
@@ -287,6 +296,7 @@ struct xfs_btree_cur
 	union {
 		struct xfs_btree_cur_ag	bc_ag;
 		struct xfs_btree_cur_ino bc_ino;
+		struct xfs_btree_cur_mem bc_mem;
 	};
 
 	/* Must be at the end of the struct! */
@@ -317,6 +327,13 @@ xfs_btree_cur_sizeof(unsigned int nlevels)
  */
 #define XFS_BTREE_STAGING		(1<<5)
 
+/* btree stored in memory; not compatible with ROOT_IN_INODE */
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+# define XFS_BTREE_IN_MEMORY		(1<<7)
+#else
+# define XFS_BTREE_IN_MEMORY		(0)
+#endif
+
 #define	XFS_BTREE_NOERROR	0
 #define	XFS_BTREE_ERROR		1
 
diff --git a/fs/xfs/libxfs/xfs_btree_mem.h b/fs/xfs/libxfs/xfs_btree_mem.h
new file mode 100644
index 000000000000..6ca9ea64a9a4
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_btree_mem.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_BTREE_MEM_H__
+#define __XFS_BTREE_MEM_H__
+
+struct xfbtree;
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
+
+struct xfs_buftarg *xfbtree_target(struct xfbtree *xfbtree);
+int xfbtree_check_ptr(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr, int index, int level);
+xfs_daddr_t xfbtree_ptr_to_daddr(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr);
+void xfbtree_buf_to_ptr(struct xfs_btree_cur *cur, struct xfs_buf *bp,
+		union xfs_btree_ptr *ptr);
+
+unsigned int xfbtree_bbsize(void);
+
+void xfbtree_set_root(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr, int inc);
+void xfbtree_init_ptr_from_cur(struct xfs_btree_cur *cur,
+		union xfs_btree_ptr *ptr);
+struct xfs_btree_cur *xfbtree_dup_cursor(struct xfs_btree_cur *cur);
+bool xfbtree_verify_xfileoff(struct xfs_btree_cur *cur,
+		unsigned long long xfoff);
+xfs_failaddr_t xfbtree_check_block_owner(struct xfs_btree_cur *cur,
+		struct xfs_btree_block *block);
+unsigned long long xfbtree_owner(struct xfs_btree_cur *cur);
+xfs_failaddr_t xfbtree_lblock_verify(struct xfs_buf *bp, unsigned int max_recs);
+xfs_failaddr_t xfbtree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
+unsigned long long xfbtree_buf_to_xfoff(struct xfs_btree_cur *cur,
+		struct xfs_buf *bp);
+#else
+static inline unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp)
+{
+	return 0;
+}
+
+static inline struct xfs_buftarg *
+xfbtree_target(struct xfbtree *xfbtree)
+{
+	return NULL;
+}
+
+static inline int
+xfbtree_check_ptr(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr,
+		  int index, int level)
+{
+	return 0;
+}
+
+static inline xfs_daddr_t
+xfbtree_ptr_to_daddr(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr)
+{
+	return 0;
+}
+
+static inline void
+xfbtree_buf_to_ptr(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp,
+	union xfs_btree_ptr	*ptr)
+{
+	memset(ptr, 0xFF, sizeof(*ptr));
+}
+
+static inline unsigned int xfbtree_bbsize(void)
+{
+	return 0;
+}
+
+#define xfbtree_set_root			NULL
+#define xfbtree_init_ptr_from_cur		NULL
+#define xfbtree_dup_cursor			NULL
+#define xfbtree_verify_xfileoff(cur, xfoff)	(false)
+#define xfbtree_check_block_owner(cur, block)	NULL
+#define xfbtree_owner(cur)			(0ULL)
+#define xfbtree_buf_to_xfoff(cur, bp)		(-1)
+
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* __XFS_BTREE_MEM_H__ */
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
new file mode 100644
index 000000000000..80f9ab4fec07
--- /dev/null
+++ b/fs/xfs/scrub/xfbtree.c
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_btree.h"
+#include "xfs_error.h"
+#include "xfs_btree_mem.h"
+#include "xfs_ag.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+
+/* btree ops functions for in-memory btrees. */
+
+static xfs_failaddr_t
+xfs_btree_mem_head_verify(
+	struct xfs_buf			*bp)
+{
+	struct xfs_btree_mem_head	*mhead = bp->b_addr;
+	struct xfs_mount		*mp = bp->b_mount;
+
+	if (!xfs_verify_magic(bp, mhead->mh_magic))
+		return __this_address;
+	if (be32_to_cpu(mhead->mh_nlevels) == 0)
+		return __this_address;
+	if (!uuid_equal(&mhead->mh_uuid, &mp->m_sb.sb_meta_uuid))
+		return __this_address;
+
+	return NULL;
+}
+
+static void
+xfs_btree_mem_head_read_verify(
+	struct xfs_buf		*bp)
+{
+	xfs_failaddr_t		fa = xfs_btree_mem_head_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static void
+xfs_btree_mem_head_write_verify(
+	struct xfs_buf		*bp)
+{
+	xfs_failaddr_t		fa = xfs_btree_mem_head_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static const struct xfs_buf_ops xfs_btree_mem_head_buf_ops = {
+	.name			= "xfs_btree_mem_head",
+	.magic			= { cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC),
+				    cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC) },
+	.verify_read		= xfs_btree_mem_head_read_verify,
+	.verify_write		= xfs_btree_mem_head_write_verify,
+	.verify_struct		= xfs_btree_mem_head_verify,
+};
+
+/* Initialize the header block for an in-memory btree. */
+static inline void
+xfs_btree_mem_head_init(
+	struct xfs_buf			*head_bp,
+	unsigned long long		owner,
+	xfileoff_t			leaf_xfoff)
+{
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+	struct xfs_mount		*mp = head_bp->b_mount;
+
+	mhead->mh_magic = cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC);
+	mhead->mh_nlevels = cpu_to_be32(1);
+	mhead->mh_owner = cpu_to_be64(owner);
+	mhead->mh_root = cpu_to_be64(leaf_xfoff);
+	uuid_copy(&mhead->mh_uuid, &mp->m_sb.sb_meta_uuid);
+
+	head_bp->b_ops = &xfs_btree_mem_head_buf_ops;
+}
+
+/* Return tree height from the in-memory btree head. */
+unsigned int
+xfs_btree_mem_head_nlevels(
+	struct xfs_buf			*head_bp)
+{
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+
+	return be32_to_cpu(mhead->mh_nlevels);
+}
+
+/* Extract the buftarg target for this xfile btree. */
+struct xfs_buftarg *
+xfbtree_target(struct xfbtree *xfbtree)
+{
+	return xfbtree->target;
+}
+
+/* Is this daddr (sector offset) contained within the buffer target? */
+static inline bool
+xfbtree_verify_buftarg_xfileoff(
+	struct xfs_buftarg	*btp,
+	xfileoff_t		xfoff)
+{
+	xfs_daddr_t		xfoff_daddr = xfo_to_daddr(xfoff);
+
+	return xfs_buftarg_verify_daddr(btp, xfoff_daddr);
+}
+
+/* Is this btree xfile offset contained within the xfile? */
+bool
+xfbtree_verify_xfileoff(
+	struct xfs_btree_cur	*cur,
+	unsigned long long	xfoff)
+{
+	struct xfs_buftarg	*btp = xfbtree_target(cur->bc_mem.xfbtree);
+
+	return xfbtree_verify_buftarg_xfileoff(btp, xfoff);
+}
+
+/* Check if a btree pointer is reasonable. */
+int
+xfbtree_check_ptr(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*ptr,
+	int				index,
+	int				level)
+{
+	xfileoff_t			bt_xfoff;
+	xfs_failaddr_t			fa = NULL;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		bt_xfoff = be64_to_cpu(ptr->l);
+	else
+		bt_xfoff = be32_to_cpu(ptr->s);
+
+	if (!xfbtree_verify_xfileoff(cur, bt_xfoff))
+		fa = __this_address;
+
+	if (fa) {
+		xfs_err(cur->bc_mp,
+"In-memory: Corrupt btree %d flags 0x%x pointer at level %d index %d fa %pS.",
+				cur->bc_btnum, cur->bc_flags, level, index,
+				fa);
+		return -EFSCORRUPTED;
+	}
+	return 0;
+}
+
+/* Convert a btree pointer to a daddr */
+xfs_daddr_t
+xfbtree_ptr_to_daddr(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*ptr)
+{
+	xfileoff_t			bt_xfoff;
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		bt_xfoff = be64_to_cpu(ptr->l);
+	else
+		bt_xfoff = be32_to_cpu(ptr->s);
+	return xfo_to_daddr(bt_xfoff);
+}
+
+/* Set the pointer to point to this buffer. */
+void
+xfbtree_buf_to_ptr(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp,
+	union xfs_btree_ptr	*ptr)
+{
+	xfileoff_t		xfoff = xfs_daddr_to_xfo(xfs_buf_daddr(bp));
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		ptr->l = cpu_to_be64(xfoff);
+	else
+		ptr->s = cpu_to_be32(xfoff);
+}
+
+/* Return the in-memory btree block size, in units of 512 bytes. */
+unsigned int xfbtree_bbsize(void)
+{
+	return xfo_to_daddr(1);
+}
+
+/* Set the root of an in-memory btree. */
+void
+xfbtree_set_root(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*ptr,
+	int				inc)
+{
+	struct xfs_buf			*head_bp = cur->bc_mem.head_bp;
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		mhead->mh_root = ptr->l;
+	} else {
+		uint32_t		root = be32_to_cpu(ptr->s);
+
+		mhead->mh_root = cpu_to_be64(root);
+	}
+	be32_add_cpu(&mhead->mh_nlevels, inc);
+	xfs_trans_log_buf(cur->bc_tp, head_bp, 0, sizeof(*mhead) - 1);
+}
+
+/* Initialize a pointer from the in-memory btree header. */
+void
+xfbtree_init_ptr_from_cur(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_ptr		*ptr)
+{
+	struct xfs_buf			*head_bp = cur->bc_mem.head_bp;
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		ptr->l = mhead->mh_root;
+	} else {
+		uint64_t		root = be64_to_cpu(mhead->mh_root);
+
+		ptr->s = cpu_to_be32(root);
+	}
+}
+
+/* Duplicate an in-memory btree cursor. */
+struct xfs_btree_cur *
+xfbtree_dup_cursor(
+	struct xfs_btree_cur		*cur)
+{
+	struct xfs_btree_cur		*ncur;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	ncur = xfs_btree_alloc_cursor(cur->bc_mp, cur->bc_tp, cur->bc_btnum,
+			cur->bc_maxlevels, cur->bc_cache);
+	ncur->bc_flags = cur->bc_flags;
+	ncur->bc_nlevels = cur->bc_nlevels;
+	ncur->bc_statoff = cur->bc_statoff;
+	ncur->bc_ops = cur->bc_ops;
+	memcpy(&ncur->bc_mem, &cur->bc_mem, sizeof(cur->bc_mem));
+
+	if (cur->bc_mem.pag)
+		ncur->bc_mem.pag = xfs_perag_bump(cur->bc_mem.pag);
+
+	return ncur;
+}
+
+/* Check the owner of an in-memory btree block. */
+xfs_failaddr_t
+xfbtree_check_block_owner(
+	struct xfs_btree_cur	*cur,
+	struct xfs_btree_block	*block)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		if (be64_to_cpu(block->bb_u.l.bb_owner) != xfbt->owner)
+			return __this_address;
+
+		return NULL;
+	}
+
+	if (be32_to_cpu(block->bb_u.s.bb_owner) != xfbt->owner)
+		return __this_address;
+
+	return NULL;
+}
+
+/* Return the owner of this in-memory btree. */
+unsigned long long
+xfbtree_owner(
+	struct xfs_btree_cur	*cur)
+{
+	return cur->bc_mem.xfbtree->owner;
+}
+
+/* Return the xfile offset (in blocks) of a btree buffer. */
+unsigned long long
+xfbtree_buf_to_xfoff(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	return xfs_daddr_to_xfo(xfs_buf_daddr(bp));
+}
+
+/* Verify a long-format btree block. */
+xfs_failaddr_t
+xfbtree_lblock_verify(
+	struct xfs_buf		*bp,
+	unsigned int		max_recs)
+{
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_buftarg	*btp = bp->b_target;
+
+	/* numrecs verification */
+	if (be16_to_cpu(block->bb_numrecs) > max_recs)
+		return __this_address;
+
+	/* sibling pointer verification */
+	if (block->bb_u.l.bb_leftsib != cpu_to_be64(NULLFSBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be64_to_cpu(block->bb_u.l.bb_leftsib)))
+		return __this_address;
+
+	if (block->bb_u.l.bb_rightsib != cpu_to_be64(NULLFSBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be64_to_cpu(block->bb_u.l.bb_rightsib)))
+		return __this_address;
+
+	return NULL;
+}
+
+/* Verify a short-format btree block. */
+xfs_failaddr_t
+xfbtree_sblock_verify(
+	struct xfs_buf		*bp,
+	unsigned int		max_recs)
+{
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_buftarg	*btp = bp->b_target;
+
+	/* numrecs verification */
+	if (be16_to_cpu(block->bb_numrecs) > max_recs)
+		return __this_address;
+
+	/* sibling pointer verification */
+	if (block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be32_to_cpu(block->bb_u.s.bb_leftsib)))
+		return __this_address;
+
+	if (block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be32_to_cpu(block->bb_u.s.bb_rightsib)))
+		return __this_address;
+
+	return NULL;
+}
diff --git a/fs/xfs/scrub/xfbtree.h b/fs/xfs/scrub/xfbtree.h
new file mode 100644
index 000000000000..b3836f21085d
--- /dev/null
+++ b/fs/xfs/scrub/xfbtree.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef XFS_SCRUB_XFBTREE_H__
+#define XFS_SCRUB_XFBTREE_H__
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+
+/* Root block for an in-memory btree. */
+struct xfs_btree_mem_head {
+	__be32				mh_magic;
+	__be32				mh_nlevels;
+	__be64				mh_owner;
+	__be64				mh_root;
+	uuid_t				mh_uuid;
+};
+
+#define XFS_BTREE_MEM_HEAD_MAGIC	0x4341544D	/* "CATM" */
+
+/* xfile-backed in-memory btrees */
+
+struct xfbtree {
+	/* buffer cache target for this in-memory btree */
+	struct xfs_buftarg		*target;
+
+	/* Owner of this btree. */
+	unsigned long long		owner;
+};
+
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* XFS_SCRUB_XFBTREE_H__ */
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index 99b6db838612..c934e70f95e8 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -78,6 +78,47 @@ int xfile_get_page(struct xfile *xf, loff_t offset, unsigned int len,
 int xfile_put_page(struct xfile *xf, struct xfile_page *xbuf);
 
 int xfile_dump(struct xfile *xf);
+
+static inline loff_t xfile_size(struct xfile *xf)
+{
+	return i_size_read(file_inode(xf->file));
+}
+
+/* file block (aka system page size) to basic block conversions. */
+typedef unsigned long long	xfileoff_t;
+#define XFB_BLOCKSIZE		(PAGE_SIZE)
+#define XFB_BSHIFT		(PAGE_SHIFT)
+#define XFB_SHIFT		(XFB_BSHIFT - BBSHIFT)
+
+static inline loff_t xfo_to_b(xfileoff_t xfoff)
+{
+	return xfoff << XFB_BSHIFT;
+}
+
+static inline xfileoff_t b_to_xfo(loff_t pos)
+{
+	return (pos + (XFB_BLOCKSIZE - 1)) >> XFB_BSHIFT;
+}
+
+static inline xfileoff_t b_to_xfot(loff_t pos)
+{
+	return pos >> XFB_BSHIFT;
+}
+
+static inline xfs_daddr_t xfo_to_daddr(xfileoff_t xfoff)
+{
+	return xfoff << XFB_SHIFT;
+}
+
+static inline xfileoff_t xfs_daddr_to_xfo(xfs_daddr_t bb)
+{
+	return (bb + (xfo_to_daddr(1) - 1)) >> XFB_SHIFT;
+}
+
+static inline xfileoff_t xfs_daddr_to_xfot(xfs_daddr_t bb)
+{
+	return bb >> XFB_SHIFT;
+}
 #else
 static inline int
 xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t offset)
@@ -90,6 +131,11 @@ xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t offset)
 {
 	return -EIO;
 }
+
+static inline loff_t xfile_size(struct xfile *xf)
+{
+	return 0;
+}
 #endif /* CONFIG_XFS_IN_MEMORY_FILE */
 
 #endif /* __XFS_SCRUB_XFILE_H__ */
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2ec8d39def9c..bf3b7c96f207 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2533,3 +2533,13 @@ xfs_verify_magic16(
 		return false;
 	return dmagic == bp->b_ops->magic16[idx];
 }
+
+/* Return the number of sectors for a buffer target. */
+xfs_daddr_t
+xfs_buftarg_nr_sectors(
+	struct xfs_buftarg	*btp)
+{
+	if (btp->bt_flags & XFS_BUFTARG_IN_MEMORY)
+		return xfile_size(btp->bt_xfile) >> SECTOR_SHIFT;
+	return bdev_nr_sectors(btp->bt_bdev);
+}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index dcae77dabdcc..d74ce9080282 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -445,6 +445,16 @@ xfs_buftarg_zeroout(
 			flags);
 }
 
+xfs_daddr_t xfs_buftarg_nr_sectors(struct xfs_buftarg *btp);
+
+static inline bool
+xfs_buftarg_verify_daddr(
+	struct xfs_buftarg	*btp,
+	xfs_daddr_t		daddr)
+{
+	return daddr < xfs_buftarg_nr_sectors(btp);
+}
+
 int xfs_buf_reverify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
 bool xfs_verify_magic(struct xfs_buf *bp, __be32 dmagic);
 bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 74a4620d763b..6de8780b208a 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -508,6 +508,9 @@ xfs_btree_mark_sick(
 {
 	unsigned int			mask;
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return;
+
 	switch (cur->bc_btnum) {
 	case XFS_BTNUM_BMAP:
 		xfs_bmap_mark_sick(cur->bc_ino.ip, cur->bc_ino.whichfork);
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 8a5dc1538aa8..2d49310fb912 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -36,6 +36,9 @@
 #include "xfs_error.h"
 #include <linux/iomap.h>
 #include "xfs_iomap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "xfs_btree_mem.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d86dd34127f2..2d006bf0f9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2529,7 +2529,10 @@ TRACE_EVENT(xfs_btree_alloc_block,
 	),
 	TP_fast_assign(
 		__entry->dev = cur->bc_mp->m_super->s_dev;
-		if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) {
+		if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+			__entry->agno = 0;
+			__entry->ino = 0;
+		} else if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) {
 			__entry->agno = 0;
 			__entry->ino = cur->bc_ino.ip->i_ino;
 		} else {


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/7] xfs: connect in-memory btrees to xfiles
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, willy, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Add to our stubbed-out in-memory btrees the ability to connect them with
an actual in-memory backing file (aka xfiles) and the necessary pieces
to track free space in the xfile and flush dirty xfbtree buffers on
demand, which we'll need for online repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_mem.h |   41 ++++
 fs/xfs/scrub/bitmap.c         |   28 ++
 fs/xfs/scrub/bitmap.h         |    3 
 fs/xfs/scrub/scrub.c          |    4 
 fs/xfs/scrub/scrub.h          |    3 
 fs/xfs/scrub/trace.c          |   13 +
 fs/xfs/scrub/trace.h          |  110 ++++++++++
 fs/xfs/scrub/xfbtree.c        |  466 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfbtree.h        |   25 ++
 fs/xfs/scrub/xfile.c          |   83 +++++++
 fs/xfs/scrub/xfile.h          |    2 
 fs/xfs/xfs_buf.c              |    6 -
 fs/xfs/xfs_trace.h            |    1 
 fs/xfs/xfs_trans.h            |    1 
 fs/xfs/xfs_trans_buf.c        |   42 ++++
 15 files changed, 825 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree_mem.h b/fs/xfs/libxfs/xfs_btree_mem.h
index 6ca9ea64a9a4..5e7b1f20fb5b 100644
--- a/fs/xfs/libxfs/xfs_btree_mem.h
+++ b/fs/xfs/libxfs/xfs_btree_mem.h
@@ -8,6 +8,26 @@
 
 struct xfbtree;
 
+struct xfbtree_config {
+	/* Buffer ops for the btree root block */
+	const struct xfs_btree_ops	*btree_ops;
+
+	/* Buffer target for the xfile backing this btree. */
+	struct xfs_buftarg		*target;
+
+	/* Owner of this btree. */
+	unsigned long long		owner;
+
+	/* Btree type number */
+	xfs_btnum_t			btnum;
+
+	/* XFBTREE_CREATE_* flags */
+	unsigned int			flags;
+};
+
+/* btree has long pointers */
+#define XFBTREE_CREATE_LONG_PTRS	(1U << 0)
+
 #ifdef CONFIG_XFS_IN_MEMORY_BTREE
 unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
 
@@ -35,6 +55,16 @@ xfs_failaddr_t xfbtree_lblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 xfs_failaddr_t xfbtree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 unsigned long long xfbtree_buf_to_xfoff(struct xfs_btree_cur *cur,
 		struct xfs_buf *bp);
+
+int xfbtree_get_minrecs(struct xfs_btree_cur *cur, int level);
+int xfbtree_get_maxrecs(struct xfs_btree_cur *cur, int level);
+
+int xfbtree_create(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+		struct xfbtree **xfbtreep);
+int xfbtree_alloc_block(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *start, union xfs_btree_ptr *ptr,
+		int *stat);
+int xfbtree_free_block(struct xfs_btree_cur *cur, struct xfs_buf *bp);
 #else
 static inline unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp)
 {
@@ -77,11 +107,22 @@ static inline unsigned int xfbtree_bbsize(void)
 #define xfbtree_set_root			NULL
 #define xfbtree_init_ptr_from_cur		NULL
 #define xfbtree_dup_cursor			NULL
+#define xfbtree_get_minrecs			NULL
+#define xfbtree_get_maxrecs			NULL
+#define xfbtree_alloc_block			NULL
+#define xfbtree_free_block			NULL
 #define xfbtree_verify_xfileoff(cur, xfoff)	(false)
 #define xfbtree_check_block_owner(cur, block)	NULL
 #define xfbtree_owner(cur)			(0ULL)
 #define xfbtree_buf_to_xfoff(cur, bp)		(-1)
 
+static inline int
+xfbtree_create(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+		struct xfbtree **xfbtreep)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_XFS_IN_MEMORY_BTREE */
 
 #endif /* __XFS_BTREE_MEM_H__ */
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index f707434b1c86..c98f4c45414a 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -379,3 +379,31 @@ xbitmap_test(
 	*len = bn->bn_start - start;
 	return false;
 }
+
+/*
+ * Find the first set bit in this bitmap, clear it, and return the index of
+ * that bit in @valp.  Returns -ENODATA if no bits were set, or the usual
+ * negative errno.
+ */
+int
+xbitmap_take_first_set(
+	struct xbitmap		*bitmap,
+	uint64_t		start,
+	uint64_t		last,
+	uint64_t		*valp)
+{
+	struct xbitmap_node	*bn;
+	uint64_t		val;
+	int			error;
+
+	bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last);
+	if (!bn)
+		return -ENODATA;
+
+	val = bn->bn_start;
+	error = xbitmap_clear(bitmap, bn->bn_start, 1);
+	if (error)
+		return error;
+	*valp = val;
+	return 0;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 7f1b9c9c7831..1ebe1918bdb2 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -32,6 +32,9 @@ int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn,
 bool xbitmap_empty(struct xbitmap *bitmap);
 bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len);
 
+int xbitmap_take_first_set(struct xbitmap *bitmap, uint64_t start,
+		uint64_t last, uint64_t *valp);
+
 /* Bitmaps, but for type-checked for xfs_agblock_t */
 
 struct xagb_bitmap {
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index fd116531a0d9..5bbc12649277 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -191,6 +191,10 @@ xchk_teardown(
 		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
 		mnt_drop_write_file(sc->file);
 	}
+	if (sc->xfile_buftarg) {
+		xfs_free_buftarg(sc->xfile_buftarg);
+		sc->xfile_buftarg = NULL;
+	}
 	if (sc->xfile) {
 		xfile_destroy(sc->xfile);
 		sc->xfile = NULL;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 04afb584f504..6fe59d1a2518 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -99,6 +99,9 @@ struct xfs_scrub {
 	/* xfile used by the scrubbers; freed at teardown. */
 	struct xfile			*xfile;
 
+	/* buffer target for the xfile; also freed at teardown. */
+	struct xfs_buftarg		*xfile_buftarg;
+
 	/* Lock flags for @ip. */
 	uint				ilock_flags;
 
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 08e05d49e7c0..177fc4c75507 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -12,15 +12,19 @@
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
+#include "xfs_btree_mem.h"
 #include "xfs_ag.h"
 #include "xfs_quota_defs.h"
 #include "xfs_dir2.h"
+#include "xfs_da_format.h"
+#include "xfs_btree_mem.h"
 #include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
 #include "scrub/nlinks.h"
 #include "scrub/fscounters.h"
+#include "scrub/xfbtree.h"
 
 /* Figure out which block the btree cursor was pointing to. */
 static inline xfs_fsblock_t
@@ -39,6 +43,15 @@ xchk_btree_cur_fsbno(
 	return NULLFSBLOCK;
 }
 
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+static inline unsigned long
+xfbtree_ino(
+	struct xfbtree		*xfbt)
+{
+	return file_inode(xfbt->target->bt_xfile->file)->i_ino;
+}
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
 /*
  * We include this last to have the helpers above available for the trace
  * event implementations.
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 14569068b6ee..05b6a6e3d0ab 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -24,6 +24,8 @@ struct xfarray_sortinfo;
 struct xchk_iscan;
 struct xchk_nlink;
 struct xchk_fscounters;
+struct xfbtree;
+struct xfbtree_config;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -866,6 +868,8 @@ DEFINE_XFILE_EVENT(xfile_pwrite);
 DEFINE_XFILE_EVENT(xfile_seek_data);
 DEFINE_XFILE_EVENT(xfile_get_page);
 DEFINE_XFILE_EVENT(xfile_put_page);
+DEFINE_XFILE_EVENT(xfile_discard);
+DEFINE_XFILE_EVENT(xfile_prealloc);
 
 TRACE_EVENT(xfarray_create,
 	TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
@@ -1971,8 +1975,114 @@ DEFINE_XREP_DQUOT_EVENT(xrep_quotacheck_dquot);
 DEFINE_SCRUB_NLINKS_DIFF_EVENT(xrep_nlinks_update_inode);
 DEFINE_SCRUB_NLINKS_DIFF_EVENT(xrep_nlinks_unfixable_inode);
 
+TRACE_EVENT(xfbtree_create,
+	TP_PROTO(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+		 struct xfbtree *xfbt),
+	TP_ARGS(mp, cfg, xfbt),
+	TP_STRUCT__entry(
+		__field(xfs_btnum_t, btnum)
+		__field(unsigned int, xfbtree_flags)
+		__field(unsigned long, xfino)
+		__field(unsigned int, leaf_mxr)
+		__field(unsigned int, leaf_mnr)
+		__field(unsigned int, node_mxr)
+		__field(unsigned int, node_mnr)
+		__field(unsigned long long, owner)
+	),
+	TP_fast_assign(
+		__entry->btnum = cfg->btnum;
+		__entry->xfbtree_flags = cfg->flags;
+		__entry->xfino = xfbtree_ino(xfbt);
+		__entry->leaf_mxr = xfbt->maxrecs[0];
+		__entry->node_mxr = xfbt->maxrecs[1];
+		__entry->leaf_mnr = xfbt->minrecs[0];
+		__entry->node_mnr = xfbt->minrecs[1];
+		__entry->owner = cfg->owner;
+	),
+	TP_printk("xfino 0x%lx btnum %s owner 0x%llx leaf_mxr %u leaf_mnr %u node_mxr %u node_mnr %u",
+		  __entry->xfino,
+		  __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+		  __entry->owner,
+		  __entry->leaf_mxr,
+		  __entry->leaf_mnr,
+		  __entry->node_mxr,
+		  __entry->node_mnr)
+);
+
+DECLARE_EVENT_CLASS(xfbtree_buf_class,
+	TP_PROTO(struct xfbtree *xfbt, struct xfs_buf *bp),
+	TP_ARGS(xfbt, bp),
+	TP_STRUCT__entry(
+		__field(unsigned long, xfino)
+		__field(xfs_daddr_t, bno)
+		__field(int, nblks)
+		__field(int, hold)
+		__field(int, pincount)
+		__field(unsigned, lockval)
+		__field(unsigned, flags)
+	),
+	TP_fast_assign(
+		__entry->xfino = xfbtree_ino(xfbt);
+		__entry->bno = xfs_buf_daddr(bp);
+		__entry->nblks = bp->b_length;
+		__entry->hold = atomic_read(&bp->b_hold);
+		__entry->pincount = atomic_read(&bp->b_pin_count);
+		__entry->lockval = bp->b_sema.count;
+		__entry->flags = bp->b_flags;
+	),
+	TP_printk("xfino 0x%lx daddr 0x%llx bbcount 0x%x hold %d pincount %d "
+		  "lock %d flags %s",
+		  __entry->xfino,
+		  (unsigned long long)__entry->bno,
+		  __entry->nblks,
+		  __entry->hold,
+		  __entry->pincount,
+		  __entry->lockval,
+		  __print_flags(__entry->flags, "|", XFS_BUF_FLAGS))
+)
+
+#define DEFINE_XFBTREE_BUF_EVENT(name) \
+DEFINE_EVENT(xfbtree_buf_class, name, \
+	TP_PROTO(struct xfbtree *xfbt, struct xfs_buf *bp), \
+	TP_ARGS(xfbt, bp))
+DEFINE_XFBTREE_BUF_EVENT(xfbtree_create_root_buf);
+DEFINE_XFBTREE_BUF_EVENT(xfbtree_trans_commit_buf);
+DEFINE_XFBTREE_BUF_EVENT(xfbtree_trans_cancel_buf);
+
+DECLARE_EVENT_CLASS(xfbtree_freesp_class,
+	TP_PROTO(struct xfbtree *xfbt, struct xfs_btree_cur *cur,
+		 xfs_fileoff_t fileoff),
+	TP_ARGS(xfbt, cur, fileoff),
+	TP_STRUCT__entry(
+		__field(unsigned long, xfino)
+		__field(xfs_btnum_t, btnum)
+		__field(int, nlevels)
+		__field(xfs_fileoff_t, fileoff)
+	),
+	TP_fast_assign(
+		__entry->xfino = xfbtree_ino(xfbt);
+		__entry->btnum = cur->bc_btnum;
+		__entry->nlevels = cur->bc_nlevels;
+		__entry->fileoff = fileoff;
+	),
+	TP_printk("xfino 0x%lx btree %s nlevels %d fileoff 0x%llx",
+		  __entry->xfino,
+		  __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+		  __entry->nlevels,
+		  (unsigned long long)__entry->fileoff)
+)
+
+#define DEFINE_XFBTREE_FREESP_EVENT(name) \
+DEFINE_EVENT(xfbtree_freesp_class, name, \
+	TP_PROTO(struct xfbtree *xfbt, struct xfs_btree_cur *cur, \
+		 xfs_fileoff_t fileoff), \
+	TP_ARGS(xfbt, cur, fileoff))
+DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block);
+DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
+
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 80f9ab4fec07..3eeb5110a1cc 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -9,14 +9,19 @@
 #include "xfs_format.h"
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
+#include "xfs_bit.h"
 #include "xfs_mount.h"
 #include "xfs_trans.h"
+#include "xfs_buf_item.h"
 #include "xfs_btree.h"
 #include "xfs_error.h"
 #include "xfs_btree_mem.h"
 #include "xfs_ag.h"
+#include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfbtree.h"
+#include "scrub/bitmap.h"
+#include "scrub/trace.h"
 
 /* btree ops functions for in-memory btrees. */
 
@@ -142,9 +147,18 @@ xfbtree_check_ptr(
 	else
 		bt_xfoff = be32_to_cpu(ptr->s);
 
-	if (!xfbtree_verify_xfileoff(cur, bt_xfoff))
+	if (!xfbtree_verify_xfileoff(cur, bt_xfoff)) {
 		fa = __this_address;
+		goto done;
+	}
 
+	/* Can't point to the head or anything before it */
+	if (bt_xfoff < XFBTREE_INIT_LEAF_BLOCK) {
+		fa = __this_address;
+		goto done;
+	}
+
+done:
 	if (fa) {
 		xfs_err(cur->bc_mp,
 "In-memory: Corrupt btree %d flags 0x%x pointer at level %d index %d fa %pS.",
@@ -350,3 +364,453 @@ xfbtree_sblock_verify(
 
 	return NULL;
 }
+
+/* Close the btree xfile and release all resources. */
+void
+xfbtree_destroy(
+	struct xfbtree		*xfbt)
+{
+	xbitmap_destroy(xfbt->freespace);
+	kfree(xfbt->freespace);
+	xfs_buftarg_drain(xfbt->target);
+	kfree(xfbt);
+}
+
+/* Compute the number of bytes available for records. */
+static inline unsigned int
+xfbtree_rec_bytes(
+	struct xfs_mount		*mp,
+	const struct xfbtree_config	*cfg)
+{
+	unsigned int			blocklen = xfo_to_b(1);
+
+	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS) {
+		if (xfs_has_crc(mp))
+			return blocklen - XFS_BTREE_LBLOCK_CRC_LEN;
+
+		return blocklen - XFS_BTREE_LBLOCK_LEN;
+	}
+
+	if (xfs_has_crc(mp))
+		return blocklen - XFS_BTREE_SBLOCK_CRC_LEN;
+
+	return blocklen - XFS_BTREE_SBLOCK_LEN;
+}
+
+/* Initialize an empty leaf block as the btree root. */
+STATIC int
+xfbtree_init_leaf_block(
+	struct xfs_mount		*mp,
+	struct xfbtree			*xfbt,
+	const struct xfbtree_config	*cfg)
+{
+	struct xfs_buf			*bp;
+	xfs_daddr_t			daddr;
+	int				error;
+	unsigned int			bc_flags = 0;
+
+	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+		bc_flags |= XFS_BTREE_LONG_PTRS;
+
+	daddr = xfo_to_daddr(XFBTREE_INIT_LEAF_BLOCK);
+	error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
+	if (error)
+		return error;
+
+	trace_xfbtree_create_root_buf(xfbt, bp);
+
+	bp->b_ops = cfg->btree_ops->buf_ops;
+	xfs_btree_init_block_int(mp, bp->b_addr, daddr, cfg->btnum, 0, 0,
+			cfg->owner, bc_flags);
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		return error;
+
+	xfbt->xf_used++;
+	return 0;
+}
+
+/* Initialize the in-memory btree header block. */
+STATIC int
+xfbtree_init_head(
+	struct xfbtree		*xfbt)
+{
+	struct xfs_buf		*bp;
+	xfs_daddr_t		daddr;
+	int			error;
+
+	daddr = xfo_to_daddr(XFBTREE_HEAD_BLOCK);
+	error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
+	if (error)
+		return error;
+
+	xfs_btree_mem_head_init(bp, xfbt->owner, XFBTREE_INIT_LEAF_BLOCK);
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		return error;
+
+	xfbt->xf_used++;
+	return 0;
+}
+
+/* Create an xfile btree backing thing that can be used for in-memory btrees. */
+int
+xfbtree_create(
+	struct xfs_mount		*mp,
+	const struct xfbtree_config	*cfg,
+	struct xfbtree			**xfbtreep)
+{
+	struct xfbtree			*xfbt;
+	unsigned int			blocklen = xfbtree_rec_bytes(mp, cfg);
+	unsigned int			keyptr_len = cfg->btree_ops->key_len;
+	int				error;
+
+	/* Requires an xfile-backed buftarg. */
+	if (!(cfg->target->bt_flags & XFS_BUFTARG_IN_MEMORY)) {
+		ASSERT(cfg->target->bt_flags & XFS_BUFTARG_IN_MEMORY);
+		return -EINVAL;
+	}
+
+	xfbt = kzalloc(sizeof(struct xfbtree), XCHK_GFP_FLAGS);
+	if (!xfbt)
+		return -ENOMEM;
+
+	/* Assign our memory file and the free space bitmap. */
+	xfbt->target = cfg->target;
+	xfbt->freespace = kmalloc(sizeof(struct xbitmap), XCHK_GFP_FLAGS);
+	if (!xfbt->freespace) {
+		error = -ENOMEM;
+		goto err_buftarg;
+	}
+	xbitmap_init(xfbt->freespace);
+
+	/* Set up min/maxrecs for this btree. */
+	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+		keyptr_len += sizeof(__be64);
+	else
+		keyptr_len += sizeof(__be32);
+	xfbt->maxrecs[0] = blocklen / cfg->btree_ops->rec_len;
+	xfbt->maxrecs[1] = blocklen / keyptr_len;
+	xfbt->minrecs[0] = xfbt->maxrecs[0] / 2;
+	xfbt->minrecs[1] = xfbt->maxrecs[1] / 2;
+	xfbt->owner = cfg->owner;
+
+	/* Initialize the empty btree. */
+	error = xfbtree_init_leaf_block(mp, xfbt, cfg);
+	if (error)
+		goto err_freesp;
+
+	error = xfbtree_init_head(xfbt);
+	if (error)
+		goto err_freesp;
+
+	trace_xfbtree_create(mp, cfg, xfbt);
+
+	*xfbtreep = xfbt;
+	return 0;
+
+err_freesp:
+	xbitmap_destroy(xfbt->freespace);
+	kfree(xfbt->freespace);
+err_buftarg:
+	xfs_buftarg_drain(xfbt->target);
+	kfree(xfbt);
+	return error;
+}
+
+/* Read the in-memory btree head. */
+int
+xfbtree_head_read_buf(
+	struct xfbtree		*xfbt,
+	struct xfs_trans	*tp,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_buftarg	*btp = xfbt->target;
+	struct xfs_mount	*mp = btp->bt_mount;
+	struct xfs_btree_mem_head *mhead;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		daddr;
+	int			error;
+
+	daddr = xfo_to_daddr(XFBTREE_HEAD_BLOCK);
+	error = xfs_trans_read_buf(mp, tp, btp, daddr, xfbtree_bbsize(), 0,
+			&bp, &xfs_btree_mem_head_buf_ops);
+	if (error)
+		return error;
+
+	mhead = bp->b_addr;
+	if (be64_to_cpu(mhead->mh_owner) != xfbt->owner) {
+		xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
+		xfs_trans_brelse(tp, bp);
+		return -EFSCORRUPTED;
+	}
+
+	*bpp = bp;
+	return 0;
+}
+
+static inline struct xfile *xfbtree_xfile(struct xfbtree *xfbt)
+{
+	return xfbt->target->bt_xfile;
+}
+
+/* Allocate a block to our in-memory btree. */
+int
+xfbtree_alloc_block(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*start,
+	union xfs_btree_ptr		*new,
+	int				*stat)
+{
+	struct xfbtree			*xfbt = cur->bc_mem.xfbtree;
+	xfileoff_t			bt_xfoff;
+	loff_t				pos;
+	int				error;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	/*
+	 * Find the first free block in the free space bitmap and take it.  If
+	 * none are found, seek to end of the file.
+	 */
+	error = xbitmap_take_first_set(xfbt->freespace, 0, -1ULL, &bt_xfoff);
+	if (error == -ENODATA) {
+		bt_xfoff = xfbt->xf_used;
+		xfbt->xf_used++;
+	} else if (error) {
+		return error;
+	}
+
+	trace_xfbtree_alloc_block(xfbt, cur, bt_xfoff);
+
+	/* Fail if the block address exceeds the maximum for short pointers. */
+	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && bt_xfoff >= INT_MAX) {
+		*stat = 0;
+		return 0;
+	}
+
+	/* Make sure we actually can write to the block before we return it. */
+	pos = xfo_to_b(bt_xfoff);
+	error = xfile_prealloc(xfbtree_xfile(xfbt), pos, xfo_to_b(1));
+	if (error)
+		return error;
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		new->l = cpu_to_be64(bt_xfoff);
+	else
+		new->s = cpu_to_be32(bt_xfoff);
+
+	*stat = 1;
+	return 0;
+}
+
+/* Free a block from our in-memory btree. */
+int
+xfbtree_free_block(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+	xfileoff_t		bt_xfoff, bt_xflen;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	bt_xfoff = xfs_daddr_to_xfot(xfs_buf_daddr(bp));
+	bt_xflen = xfs_daddr_to_xfot(bp->b_length);
+
+	trace_xfbtree_free_block(xfbt, cur, bt_xfoff);
+
+	return xbitmap_set(xfbt->freespace, bt_xfoff, bt_xflen);
+}
+
+/* Return the minimum number of records for a btree block. */
+int
+xfbtree_get_minrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+
+	return xfbt->minrecs[level != 0];
+}
+
+/* Return the maximum number of records for a btree block. */
+int
+xfbtree_get_maxrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+
+	return xfbt->maxrecs[level != 0];
+}
+
+/* If this log item is a buffer item that came from the xfbtree, return it. */
+static inline struct xfs_buf *
+xfbtree_buf_match(
+	struct xfbtree			*xfbt,
+	const struct xfs_log_item	*lip)
+{
+	const struct xfs_buf_log_item	*bli;
+	struct xfs_buf			*bp;
+
+	if (lip->li_type != XFS_LI_BUF)
+		return NULL;
+
+	bli = container_of(lip, struct xfs_buf_log_item, bli_item);
+	bp = bli->bli_buf;
+	if (bp->b_target != xfbt->target)
+		return NULL;
+
+	return bp;
+}
+
+/*
+ * Detach this (probably dirty) xfbtree buffer from the transaction by any
+ * means necessary.  Returns true if the buffer needs to be written.
+ */
+STATIC bool
+xfbtree_trans_bdetach(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bli = bp->b_log_item;
+	bool			dirty;
+
+	ASSERT(bli != NULL);
+
+	dirty = bli->bli_flags & (XFS_BLI_DIRTY | XFS_BLI_ORDERED);
+
+	bli->bli_flags &= ~(XFS_BLI_DIRTY | XFS_BLI_ORDERED |
+			    XFS_BLI_LOGGED | XFS_BLI_STALE);
+	clear_bit(XFS_LI_DIRTY, &bli->bli_item.li_flags);
+
+	while (bp->b_log_item != NULL)
+		xfs_trans_bdetach(tp, bp);
+
+	return dirty;
+}
+
+/*
+ * Commit changes to the incore btree immediately by writing all dirty xfbtree
+ * buffers to the backing xfile.  This detaches all xfbtree buffers from the
+ * transaction, even on failure.  The buffer locks are dropped between the
+ * delwri queue and submit, so the caller must synchronize btree access.
+ *
+ * Normally we'd let the buffers commit with the transaction and get written to
+ * the xfile via the log, but online repair stages ephemeral btrees in memory
+ * and uses the btree_staging functions to write new btrees to disk atomically.
+ * The in-memory btree (and its backing store) are discarded at the end of the
+ * repair phase, which means that xfbtree buffers cannot commit with the rest
+ * of a transaction.
+ *
+ * In other words, online repair only needs the transaction to collect buffer
+ * pointers and to avoid buffer deadlocks, not to guarantee consistency of
+ * updates.
+ */
+int
+xfbtree_trans_commit(
+	struct xfbtree		*xfbt,
+	struct xfs_trans	*tp)
+{
+	LIST_HEAD(buffer_list);
+	struct xfs_log_item	*lip, *n;
+	bool			corrupt = false;
+	bool			tp_dirty = false;
+
+	/*
+	 * For each xfbtree buffer attached to the transaction, write the dirty
+	 * buffers to the xfile and release them.
+	 */
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		struct xfs_buf	*bp = xfbtree_buf_match(xfbt, lip);
+		bool		dirty;
+
+		if (!bp) {
+			if (test_bit(XFS_LI_DIRTY, &lip->li_flags))
+				tp_dirty |= true;
+			continue;
+		}
+
+		trace_xfbtree_trans_commit_buf(xfbt, bp);
+
+		dirty = xfbtree_trans_bdetach(tp, bp);
+		if (dirty && !corrupt) {
+			xfs_failaddr_t	fa = bp->b_ops->verify_struct(bp);
+
+			/*
+			 * Because this btree is ephemeral, validate the buffer
+			 * structure before delwri_submit so that we can return
+			 * corruption errors to the caller without shutting
+			 * down the filesystem.
+			 *
+			 * If the buffer fails verification, log the failure
+			 * but continue walking the transaction items so that
+			 * we remove all ephemeral btree buffers.
+			 */
+			if (fa) {
+				corrupt = true;
+				xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+			} else {
+				xfs_buf_delwri_queue_here(bp, &buffer_list);
+			}
+		}
+
+		xfs_buf_relse(bp);
+	}
+
+	/*
+	 * Reset the transaction's dirty flag to reflect the dirty state of the
+	 * log items that are still attached.
+	 */
+	tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
+			(tp_dirty ? XFS_TRANS_DIRTY : 0);
+
+	if (corrupt) {
+		xfs_buf_delwri_cancel(&buffer_list);
+		return -EFSCORRUPTED;
+	}
+
+	if (list_empty(&buffer_list))
+		return 0;
+
+	return xfs_buf_delwri_submit(&buffer_list);
+}
+
+/*
+ * Cancel changes to the incore btree by detaching all the xfbtree buffers.
+ * Changes are not written to the backing store.  This is needed for online
+ * repair btrees, which are by nature ephemeral.
+ */
+void
+xfbtree_trans_cancel(
+	struct xfbtree		*xfbt,
+	struct xfs_trans	*tp)
+{
+	struct xfs_log_item	*lip, *n;
+	bool			tp_dirty = false;
+
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		struct xfs_buf	*bp = xfbtree_buf_match(xfbt, lip);
+
+		if (!bp) {
+			if (test_bit(XFS_LI_DIRTY, &lip->li_flags))
+				tp_dirty |= true;
+			continue;
+		}
+
+		trace_xfbtree_trans_cancel_buf(xfbt, bp);
+
+		xfbtree_trans_bdetach(tp, bp);
+		xfs_buf_relse(bp);
+	}
+
+	/*
+	 * Reset the transaction's dirty flag to reflect the dirty state of the
+	 * log items that are still attached.
+	 */
+	tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
+			(tp_dirty ? XFS_TRANS_DIRTY : 0);
+}
diff --git a/fs/xfs/scrub/xfbtree.h b/fs/xfs/scrub/xfbtree.h
index b3836f21085d..bdbf850bf7a1 100644
--- a/fs/xfs/scrub/xfbtree.h
+++ b/fs/xfs/scrub/xfbtree.h
@@ -22,13 +22,36 @@ struct xfs_btree_mem_head {
 /* xfile-backed in-memory btrees */
 
 struct xfbtree {
-	/* buffer cache target for this in-memory btree */
+	/* buffer cache target for the xfile backing this in-memory btree */
 	struct xfs_buftarg		*target;
 
+	/* Bitmap of free space from pos to used */
+	struct xbitmap			*freespace;
+
+	/* Number of xfile blocks actually used by this xfbtree. */
+	xfileoff_t			xf_used;
+
 	/* Owner of this btree. */
 	unsigned long long		owner;
+
+	/* Minimum and maximum records per block. */
+	unsigned int			maxrecs[2];
+	unsigned int			minrecs[2];
 };
 
+/* The head of the in-memory btree is always at block 0 */
+#define XFBTREE_HEAD_BLOCK		0
+
+/* in-memory btrees are always created with an empty leaf block at block 1 */
+#define XFBTREE_INIT_LEAF_BLOCK		1
+
+int xfbtree_head_read_buf(struct xfbtree *xfbt, struct xfs_trans *tp,
+		struct xfs_buf **bpp);
+
+void xfbtree_destroy(struct xfbtree *xfbt);
+int xfbtree_trans_commit(struct xfbtree *xfbt, struct xfs_trans *tp);
+void xfbtree_trans_cancel(struct xfbtree *xfbt, struct xfs_trans *tp);
+
 #endif /* CONFIG_XFS_IN_MEMORY_BTREE */
 
 #endif /* XFS_SCRUB_XFBTREE_H__ */
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index 1b858b6a53c8..b1cbf80f55d7 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -285,6 +285,89 @@ xfile_pwrite(
 	return error;
 }
 
+/* Discard pages backing a range of the xfile. */
+void
+xfile_discard(
+	struct xfile		*xf,
+	loff_t			pos,
+	u64			count)
+{
+	trace_xfile_discard(xf, pos, count);
+	shmem_truncate_range(file_inode(xf->file), pos, pos + count - 1);
+}
+
+/* Ensure that there is storage backing the given range. */
+int
+xfile_prealloc(
+	struct xfile		*xf,
+	loff_t			pos,
+	u64			count)
+{
+	struct inode		*inode = file_inode(xf->file);
+	struct address_space	*mapping = inode->i_mapping;
+	const struct address_space_operations *aops = mapping->a_ops;
+	struct page		*page = NULL;
+	unsigned int		pflags;
+	int			error = 0;
+
+	if (count > MAX_RW_COUNT)
+		return -E2BIG;
+	if (inode->i_sb->s_maxbytes - pos < count)
+		return -EFBIG;
+
+	trace_xfile_prealloc(xf, pos, count);
+
+	pflags = memalloc_nofs_save();
+	while (count > 0) {
+		void		*fsdata = NULL;
+		unsigned int	len;
+		int		ret;
+
+		len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
+
+		/*
+		 * We call write_begin directly here to avoid all the freezer
+		 * protection lock-taking that happens in the normal path.
+		 * shmem doesn't support fs freeze, but lockdep doesn't know
+		 * that and will trip over that.
+		 */
+		error = aops->write_begin(NULL, mapping, pos, len, &page,
+				&fsdata);
+		if (error)
+			break;
+
+		/*
+		 * xfile pages must never be mapped into userspace, so we skip
+		 * the dcache flush.  If the page is not uptodate, zero it to
+		 * ensure we never go lacking for space here.
+		 */
+		if (!PageUptodate(page)) {
+			void	*kaddr = kmap_local_page(page);
+
+			memset(kaddr, 0, PAGE_SIZE);
+			SetPageUptodate(page);
+			kunmap_local(kaddr);
+		}
+
+		ret = aops->write_end(NULL, mapping, pos, len, len, page,
+				fsdata);
+		if (ret < 0) {
+			error = ret;
+			break;
+		}
+		if (ret != len) {
+			error = -EIO;
+			break;
+		}
+
+		count -= len;
+		pos += len;
+	}
+	memalloc_nofs_restore(pflags);
+
+	return error;
+}
+
 /* Find the next written area in the xfile data for a given offset. */
 loff_t
 xfile_seek_data(
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index c934e70f95e8..bf80bb796e83 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -64,6 +64,8 @@ xfile_obj_store(struct xfile *xf, const void *buf, size_t count, loff_t pos)
 	return 0;
 }
 
+void xfile_discard(struct xfile *xf, loff_t pos, u64 count);
+int xfile_prealloc(struct xfile *xf, loff_t pos, u64 count);
 loff_t xfile_seek_data(struct xfile *xf, loff_t pos);
 
 struct xfile_stat {
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf3b7c96f207..410db46e7935 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2066,8 +2066,12 @@ __xfs_alloc_buftarg(
 	xfs_km_flags_t		km_flags)
 {
 	struct xfs_buftarg	*btp;
+	gfp_t			gfp = GFP_KERNEL;
 	int			error;
 
+	if (km_flags & KM_MAYFAIL)
+		gfp |= __GFP_RETRY_MAYFAIL;
+
 	btp = kmem_zalloc(sizeof(*btp), KM_NOFS | km_flags);
 	if (!btp)
 		return NULL;
@@ -2085,7 +2089,7 @@ __xfs_alloc_buftarg(
 	if (list_lru_init(&btp->bt_lru))
 		goto error_free;
 
-	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+	if (percpu_counter_init(&btp->bt_io_count, 0, gfp))
 		goto error_lru;
 
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2d006bf0f9ce..d1620ea1c70f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -632,6 +632,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_trans_read_buf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_read_buf_recur);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_log_buf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_brelse);
+DEFINE_BUF_ITEM_EVENT(xfs_trans_bdetach);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_bjoin);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_bhold);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_bhold_release);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index ae587101e167..a43d6465b9d4 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -219,6 +219,7 @@ struct xfs_buf	*xfs_trans_getsb(struct xfs_trans *);
 
 void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_bjoin(xfs_trans_t *, struct xfs_buf *);
+void		xfs_trans_bdetach(struct xfs_trans *tp, struct xfs_buf *bp);
 void		xfs_trans_bhold(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_bhold_release(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_binval(xfs_trans_t *, struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 6549e50d852c..e28ab74af4f0 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -392,6 +392,48 @@ xfs_trans_brelse(
 	xfs_buf_relse(bp);
 }
 
+/*
+ * Forcibly detach a buffer previously joined to the transaction.  The caller
+ * will retain its locked reference to the buffer after this function returns.
+ * The buffer must be completely clean and must not be held to the transaction.
+ */
+void
+xfs_trans_bdetach(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
+	ASSERT(tp != NULL);
+	ASSERT(bp->b_transp == tp);
+	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
+	ASSERT(atomic_read(&bip->bli_refcount) > 0);
+
+	trace_xfs_trans_bdetach(bip);
+
+	/*
+	 * Erase all recursion count, since we're removing this buffer from the
+	 * transaction.
+	 */
+	bip->bli_recur = 0;
+
+	/*
+	 * The buffer must be completely clean.  Specifically, it had better
+	 * not be dirty, stale, logged, ordered, or held to the transaction.
+	 */
+	ASSERT(!test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags));
+	ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY));
+	ASSERT(!(bip->bli_flags & XFS_BLI_HOLD));
+	ASSERT(!(bip->bli_flags & XFS_BLI_LOGGED));
+	ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED));
+	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+
+	/* Unlink the log item from the transaction and drop the log item. */
+	xfs_trans_del_item(&bip->bli_item);
+	xfs_buf_item_put(bip);
+	bp->b_transp = NULL;
+}
+
 /*
  * Mark the buffer as not needing to be unlocked when the buf item's
  * iop_committing() routine is called.  The buffer must already be locked


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (16 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: repair the rmapbt Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                   ` (59 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

We have now constructed the four tools that we need to scan the
filesystem looking for reverse mappings: an inode scanner, hooks to
receive live updates from other writer threads, the ability to construct
btrees in memory, and a btree bulk loader.

This series glues those three together, enabling us to scan the
filesystem for mappings and keep it up to date while other writers run,
and then commit the new btree to disk atomically.

To reduce the size of each patch, the functionality is left disabled
until the end of the series and broken up into three patches: one to
create the mechanics of scanning the filesystem, a second to transition
to in-memory btrees, and a third to set up the live hooks.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rmap-btree

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-rmap-btree
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_ag.c         |    1 
 fs/xfs/libxfs/xfs_ag.h         |    3 
 fs/xfs/libxfs/xfs_bmap.c       |   49 +
 fs/xfs/libxfs/xfs_bmap.h       |    8 
 fs/xfs/libxfs/xfs_inode_fork.c |    9 
 fs/xfs/libxfs/xfs_inode_fork.h |    1 
 fs/xfs/libxfs/xfs_rmap.c       |  192 +++--
 fs/xfs/libxfs/xfs_rmap.h       |   30 +
 fs/xfs/libxfs/xfs_rmap_btree.c |  136 +++
 fs/xfs/libxfs/xfs_rmap_btree.h |    9 
 fs/xfs/scrub/bitmap.c          |   14 
 fs/xfs/scrub/bitmap.h          |    5 
 fs/xfs/scrub/bmap.c            |    4 
 fs/xfs/scrub/common.c          |    7 
 fs/xfs/scrub/common.h          |    1 
 fs/xfs/scrub/newbt.c           |    5 
 fs/xfs/scrub/newbt.h           |    6 
 fs/xfs/scrub/reap.c            |    6 
 fs/xfs/scrub/repair.c          |   64 +-
 fs/xfs/scrub/repair.h          |   12 
 fs/xfs/scrub/rmap.c            |    9 
 fs/xfs/scrub/rmap_repair.c     | 1651 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c           |    6 
 fs/xfs/scrub/scrub.h           |    4 
 fs/xfs/scrub/trace.c           |    1 
 fs/xfs/scrub/trace.h           |   80 ++
 27 files changed, 2246 insertions(+), 68 deletions(-)
 create mode 100644 fs/xfs/scrub/rmap_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: create a helper to decide if a file mapping targets the rt volume
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: repair the rmapbt Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/4] xfs: hook live rmap operations during a repair operation Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: create a shadow rmap btree during rmap repair Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a helper so that we can stop open-coding this decision
everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c       |    6 +++---
 fs/xfs/libxfs/xfs_inode_fork.c |    9 +++++++++
 fs/xfs/libxfs/xfs_inode_fork.h |    1 +
 fs/xfs/scrub/bmap.c            |    4 ++--
 4 files changed, 15 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b658373bedc7..89cbd9b563ff 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4912,7 +4912,7 @@ xfs_bmap_del_extent_delay(
 
 	XFS_STATS_INC(mp, xs_del_exlist);
 
-	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
+	isrt = xfs_ifork_is_realtime(ip, whichfork);
 	del_endoff = del->br_startoff + del->br_blockcount;
 	got_endoff = got->br_startoff + got->br_blockcount;
 	da_old = startblockval(got->br_startblock);
@@ -5152,7 +5152,7 @@ xfs_bmap_del_extent_real(
 		return -ENOSPC;
 
 	flags = XFS_ILOG_CORE;
-	if (whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip)) {
+	if (xfs_ifork_is_realtime(ip, whichfork)) {
 		xfs_filblks_t	len;
 		xfs_extlen_t	mod;
 
@@ -5417,7 +5417,7 @@ __xfs_bunmapi(
 		return 0;
 	}
 	XFS_STATS_INC(mp, xs_blk_unmap);
-	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
+	isrt = xfs_ifork_is_realtime(ip, whichfork);
 	end = start + len;
 
 	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 6d23add33de9..c2cc3c193ffc 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -787,3 +787,12 @@ xfs_iext_count_upgrade(
 
 	return 0;
 }
+
+/* Decide if a file mapping is on the realtime device or not. */
+bool
+xfs_ifork_is_realtime(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	return XFS_IS_REALTIME_INODE(ip) && whichfork != XFS_ATTR_FORK;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 36a9fe3420cd..c201d8ad5957 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -261,6 +261,7 @@ int xfs_iext_count_may_overflow(struct xfs_inode *ip, int whichfork,
 		int nr_to_add);
 int xfs_iext_count_upgrade(struct xfs_trans *tp, struct xfs_inode *ip,
 		uint nr_to_add);
+bool xfs_ifork_is_realtime(struct xfs_inode *ip, int whichfork);
 
 /* returns true if the fork has extents but they are not read in yet. */
 static inline bool xfs_need_iread_extents(struct xfs_ifork *ifp)
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 93fb88ca5f28..150b8c40b809 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -711,7 +711,7 @@ xchk_bmap_check_rmaps(
 		return 0;
 
 	/* Don't support realtime rmap checks yet. */
-	if (XFS_IS_REALTIME_INODE(sc->ip) && whichfork == XFS_DATA_FORK)
+	if (xfs_ifork_is_realtime(sc->ip, whichfork))
 		return 0;
 
 	ASSERT(xfs_ifork_ptr(sc->ip, whichfork) != NULL);
@@ -796,7 +796,7 @@ xchk_bmap(
 	if (!ifp)
 		goto out;
 
-	info.is_rt = whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip);
+	info.is_rt = xfs_ifork_is_realtime(ip, whichfork);
 	info.whichfork = whichfork;
 	info.is_shared = whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(ip);
 	info.sc = sc;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: repair the rmapbt
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/4] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rebuild the reverse mapping btree from all primary metadata.  This first
patch establishes the bare mechanics of finding records and putting
together a new ondisk tree; more complex pieces are needed to make it
work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_bmap.c       |   43 +
 fs/xfs/libxfs/xfs_bmap.h       |    8 
 fs/xfs/libxfs/xfs_rmap.c       |   22 -
 fs/xfs/libxfs/xfs_rmap.h       |    2 
 fs/xfs/libxfs/xfs_rmap_btree.c |   13 
 fs/xfs/scrub/bitmap.c          |   14 
 fs/xfs/scrub/bitmap.h          |    5 
 fs/xfs/scrub/common.c          |    4 
 fs/xfs/scrub/common.h          |    1 
 fs/xfs/scrub/newbt.c           |    5 
 fs/xfs/scrub/newbt.h           |    6 
 fs/xfs/scrub/reap.c            |    6 
 fs/xfs/scrub/repair.c          |    5 
 fs/xfs/scrub/repair.h          |    6 
 fs/xfs/scrub/rmap.c            |    9 
 fs/xfs/scrub/rmap_repair.c     | 1405 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c           |    2 
 fs/xfs/scrub/trace.h           |   33 +
 19 files changed, 1572 insertions(+), 18 deletions(-)
 create mode 100644 fs/xfs/scrub/rmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 7e1495465cec..78ea3a6a0f5b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -195,6 +195,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   reap.o \
 				   refcount_repair.o \
 				   repair.o \
+				   rmap_repair.o \
 				   xfbtree.o \
 				   )
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 89cbd9b563ff..89fd34fbae7a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6406,3 +6406,46 @@ xfs_bunmapi_range(
 out:
 	return error;
 }
+
+struct xfs_bmap_query_range {
+	xfs_bmap_query_range_fn	fn;
+	void			*priv;
+};
+
+/* Format btree record and pass to our callback. */
+STATIC int
+xfs_bmap_query_range_helper(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_rec	*rec,
+	void				*priv)
+{
+	struct xfs_bmap_query_range	*query = priv;
+	struct xfs_bmbt_irec		irec;
+	xfs_failaddr_t			fa;
+
+	xfs_bmbt_disk_get_all(&rec->bmbt, &irec);
+	fa = xfs_bmap_validate_extent(cur->bc_ino.ip, cur->bc_ino.whichfork,
+			&irec);
+	if (fa) {
+		xfs_btree_mark_sick(cur);
+		return xfs_bmap_complain_bad_rec(cur->bc_ino.ip,
+				cur->bc_ino.whichfork, fa, &irec);
+	}
+
+	return query->fn(cur, &irec, query->priv);
+}
+
+/* Find all bmaps. */
+int
+xfs_bmap_query_all(
+	struct xfs_btree_cur		*cur,
+	xfs_bmap_query_range_fn		fn,
+	void				*priv)
+{
+	struct xfs_bmap_query_range	query = {
+		.priv			= priv,
+		.fn			= fn,
+	};
+
+	return xfs_btree_query_all(cur, xfs_bmap_query_range_helper, &query);
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 1201ee024c1f..bbda4a77cb69 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -274,4 +274,12 @@ extern struct kmem_cache	*xfs_bmap_intent_cache;
 int __init xfs_bmap_intent_init_cache(void);
 void xfs_bmap_intent_destroy_cache(void);
 
+typedef int (*xfs_bmap_query_range_fn)(
+	struct xfs_btree_cur	*cur,
+	struct xfs_bmbt_irec	*rec,
+	void			*priv);
+
+int xfs_bmap_query_all(struct xfs_btree_cur *cur, xfs_bmap_query_range_fn fn,
+		void *priv);
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index a30602660669..16233bb5be7e 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -212,13 +212,12 @@ xfs_rmap_btrec_to_irec(
 			irec);
 }
 
-/* Simple checks for rmap records. */
-xfs_failaddr_t
-xfs_rmap_check_irec(
-	struct xfs_btree_cur		*cur,
+inline xfs_failaddr_t
+xfs_rmap_check_perag_irec(
+	struct xfs_perag		*pag,
 	const struct xfs_rmap_irec	*irec)
 {
-	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_mount		*mp = pag->pag_mount;
 	bool				is_inode;
 	bool				is_unwritten;
 	bool				is_bmbt;
@@ -233,8 +232,8 @@ xfs_rmap_check_irec(
 			return __this_address;
 	} else {
 		/* check for valid extent range, including overflow */
-		if (!xfs_verify_agbext(cur->bc_ag.pag, irec->rm_startblock,
-						       irec->rm_blockcount))
+		if (!xfs_verify_agbext(pag, irec->rm_startblock,
+					    irec->rm_blockcount))
 			return __this_address;
 	}
 
@@ -269,6 +268,15 @@ xfs_rmap_check_irec(
 	return NULL;
 }
 
+/* Simple checks for rmap records. */
+xfs_failaddr_t
+xfs_rmap_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*irec)
+{
+	return xfs_rmap_check_perag_irec(cur->bc_ag.pag, irec);
+}
+
 static inline int
 xfs_rmap_complain_bad_rec(
 	struct xfs_btree_cur		*cur,
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index ced605d69324..b7ad51055e13 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -195,6 +195,8 @@ int xfs_rmap_compare(const struct xfs_rmap_irec *a,
 union xfs_btree_rec;
 xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
+xfs_failaddr_t xfs_rmap_check_perag_irec(struct xfs_perag *pag,
+		const struct xfs_rmap_irec *irec);
 xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_rmap_irec *irec);
 
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 5583dbe43bb5..103e4c97badc 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -342,7 +342,18 @@ xfs_rmapbt_verify(
 
 	level = be16_to_cpu(block->bb_level);
 	if (pag && pag->pagf_init) {
-		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
+		unsigned int	maxlevel = pag->pagf_levels[XFS_BTNUM_RMAPi];
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+		/*
+		 * Online repair could be rewriting the free space btrees, so
+		 * we'll validate against the larger of either tree while this
+		 * is going on.
+		 */
+		maxlevel = max_t(unsigned int, maxlevel,
+				pag->pagf_alt_levels[XFS_BTNUM_RMAPi]);
+#endif
+		if (level >= maxlevel)
 			return __this_address;
 	} else if (level >= mp->m_rmap_maxlevels)
 		return __this_address;
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index c98f4c45414a..7aeeb42a809f 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -407,3 +407,17 @@ xbitmap_take_first_set(
 	*valp = val;
 	return 0;
 }
+
+/* Count the number of set regions in this bitmap. */
+uint64_t
+xbitmap_count_set_regions(
+	struct xbitmap		*bitmap)
+{
+	struct xbitmap_node	*bn;
+	uint64_t		nr = 0;
+
+	for_each_xbitmap_extent(bn, bitmap)
+		nr++;
+
+	return nr;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 1ebe1918bdb2..d59d5e76782c 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -31,6 +31,7 @@ int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn,
 
 bool xbitmap_empty(struct xbitmap *bitmap);
 bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len);
+uint64_t xbitmap_count_set_regions(struct xbitmap *bitmap);
 
 int xbitmap_take_first_set(struct xbitmap *bitmap, uint64_t start,
 		uint64_t last, uint64_t *valp);
@@ -86,6 +87,10 @@ static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap,
 	return xbitmap_disunion(&bitmap->agbitmap, &sub->agbitmap);
 }
 
+static inline uint32_t xagb_bitmap_count_set_regions(struct xagb_bitmap *bitmap)
+{
+	return xbitmap_count_set_regions(&bitmap->agbitmap);
+}
 static inline uint32_t xagb_bitmap_hweight(struct xagb_bitmap *bitmap)
 {
 	return xbitmap_hweight(&bitmap->agbitmap);
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index d96355ca4175..c436d613521c 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -461,7 +461,7 @@ xchk_perag_read_headers(
  * Grab the AG headers for the attached perag structure and wait for pending
  * intents to drain.
  */
-static int
+int
 xchk_perag_lock(
 	struct xfs_scrub	*sc)
 {
@@ -705,7 +705,7 @@ xchk_trans_alloc(
 		return xfs_trans_alloc(sc->mp, &M_RES(sc->mp)->tr_itruncate,
 				resblks, 0, 0, &sc->tp);
 
-	return xfs_trans_alloc_empty(sc->mp, &sc->tp);
+	return xchk_trans_alloc_empty(sc);
 }
 
 /* Set us up with a transaction and an empty context. */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 6992e3db5f11..9bdacce17d82 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -134,6 +134,7 @@ int xchk_setup_nlinks(struct xfs_scrub *sc);
 void xchk_ag_free(struct xfs_scrub *sc, struct xchk_ag *sa);
 int xchk_ag_init(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		struct xchk_ag *sa);
+int xchk_perag_lock(struct xfs_scrub *sc);
 
 /*
  * Grab all AG resources, treating the inability to grab the perag structure as
diff --git a/fs/xfs/scrub/newbt.c b/fs/xfs/scrub/newbt.c
index efa73f676ad4..ebdfdf631be3 100644
--- a/fs/xfs/scrub/newbt.c
+++ b/fs/xfs/scrub/newbt.c
@@ -342,7 +342,10 @@ xrep_newbt_alloc_blocks(
 			.resv		= xnr->resv,
 		};
 
-		error = xfs_alloc_vextent(&args);
+		if (xnr->alloc_vextent)
+			error = xnr->alloc_vextent(sc, &args);
+		else
+			error = xfs_alloc_vextent(&args);
 		if (error)
 			return error;
 		if (args.fsbno == NULLFSBLOCK)
diff --git a/fs/xfs/scrub/newbt.h b/fs/xfs/scrub/newbt.h
index 96ae0dc3bc41..95fea3dd5211 100644
--- a/fs/xfs/scrub/newbt.h
+++ b/fs/xfs/scrub/newbt.h
@@ -6,6 +6,8 @@
 #ifndef __XFS_SCRUB_NEWBT_H__
 #define __XFS_SCRUB_NEWBT_H__
 
+struct xfs_alloc_arg;
+
 struct xrep_newbt_resv {
 	/* Link to list of extents that we've reserved. */
 	struct list_head	list;
@@ -28,6 +30,10 @@ struct xrep_newbt_resv {
 struct xrep_newbt {
 	struct xfs_scrub	*sc;
 
+	/* Custom allocation function, or NULL for xfs_alloc_vextent */
+	int			(*alloc_vextent)(struct xfs_scrub *sc,
+						 struct xfs_alloc_arg *args);
+
 	/* List of extents that we've reserved. */
 	struct list_head	resv_list;
 
diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index 79d43b9448a3..ea7a274aa778 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -111,7 +111,7 @@ xreap_put_freelist(
 	int			error;
 
 	/* Make sure there's space on the freelist. */
-	error = xrep_fix_freelist(sc, true);
+	error = xrep_fix_freelist(sc, 0);
 	if (error)
 		return error;
 
@@ -363,10 +363,14 @@ xreap_agextent(
 		rs->force_roll = true;
 		break;
 	case XFS_AG_RESV_IGNORE:
+	case XFS_AG_RESV_RMAPBT:
 		/*
 		 * bnobt/cntbt blocks are counted as free space, so we pass
 		 * XFS_AG_RESV_IGNORE when reaping the old free space btree
 		 * blocks to avoid changing fdblocks.
+		 *
+		 * rmapbt blocks are also counted as free space, but they have
+		 * their own per-AG reservation type.
 		 */
 		error = __xfs_free_extent(sc->tp, sc->sa.pag, agbno, *aglenp,
 				rs->oinfo, rs->resv, true);
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 1862e05e398f..7c242fddac8a 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -392,7 +392,7 @@ xrep_calc_ag_resblks(
 int
 xrep_fix_freelist(
 	struct xfs_scrub	*sc,
-	bool			can_shrink)
+	int			alloc_flags)
 {
 	struct xfs_alloc_arg	args = {0};
 
@@ -402,8 +402,7 @@ xrep_fix_freelist(
 	args.alignment = 1;
 	args.pag = sc->sa.pag;
 
-	return xfs_alloc_fix_freelist(&args,
-			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
+	return xfs_alloc_fix_freelist(&args, alloc_flags);
 }
 
 /*
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 5e3e6cfe3332..22e8e1ed2de2 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -41,7 +41,7 @@ struct xbitmap;
 struct xagb_bitmap;
 struct xfsb_bitmap;
 
-int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
+int xrep_fix_freelist(struct xfs_scrub *sc, int alloc_flags);
 
 struct xrep_find_ag_btree {
 	/* in: rmap owner of the btree we're looking for */
@@ -76,6 +76,7 @@ int xrep_ino_ensure_extent_count(struct xfs_scrub *sc, int whichfork,
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
 int xrep_bmap(struct xfs_scrub *sc, int whichfork, bool allow_unwritten);
 int xrep_metadata_inode_forks(struct xfs_scrub *sc);
+int xrep_setup_ag_rmapbt(struct xfs_scrub *sc);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
@@ -102,6 +103,7 @@ int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
+int xrep_rmapbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
@@ -167,6 +169,7 @@ xrep_setup_nothing(
 	return 0;
 }
 #define xrep_setup_ag_allocbt		xrep_setup_nothing
+#define xrep_setup_ag_rmapbt		xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
@@ -185,6 +188,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
+#define xrep_rmapbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
 #define xrep_inode			xrep_notsupported
 #define xrep_bmap_data			xrep_notsupported
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 8f1fdae71766..2ed0b877056a 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -24,6 +24,7 @@
 #include "scrub/common.h"
 #include "scrub/btree.h"
 #include "scrub/bitmap.h"
+#include "scrub/repair.h"
 
 /*
  * Set us up to scrub reverse mapping btrees.
@@ -35,6 +36,14 @@ xchk_setup_ag_rmapbt(
 	if (xchk_need_fshook_drain(sc))
 		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
 
+	if (xchk_could_repair(sc)) {
+		int		error;
+
+		error = xrep_setup_ag_rmapbt(sc);
+		if (error)
+			return error;
+	}
+
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/rmap_repair.c b/fs/xfs/scrub/rmap_repair.c
new file mode 100644
index 000000000000..952dae473d4d
--- /dev/null
+++ b/fs/xfs/scrub/rmap_repair.c
@@ -0,0 +1,1405 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_staging.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_ag.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/iscan.h"
+#include "scrub/newbt.h"
+#include "scrub/reap.h"
+
+/*
+ * Reverse Mapping Btree Repair
+ * ============================
+ *
+ * This is the most involved of all the AG space btree rebuilds.  Everywhere
+ * else in XFS we lock inodes and then AG data structures, but generating the
+ * list of rmap records requires that we be able to scan both block mapping
+ * btrees of every inode in the filesystem to see if it owns any extents in
+ * this AG.  We can't tolerate any inode updates while we do this, so we
+ * freeze the filesystem to lock everyone else out, and grant ourselves
+ * special privileges to run transactions with regular background reclamation
+ * turned off.
+ *
+ * We also have to be very careful not to allow inode reclaim to start a
+ * transaction because all transactions (other than our own) will block.
+ * Deferred inode inactivation helps us out there.
+ *
+ * I) Reverse mappings for all non-space metadata and file data are collected
+ * according to the following algorithm:
+ *
+ * 1. For each fork of each inode:
+ * 1.1. Create a bitmap BMBIT to track bmbt blocks if necessary.
+ * 1.2. If the incore extent map isn't loaded, walk the bmbt to accumulate
+ *      bmaps into rmap records (see 1.1.4).  Set bits in BMBIT for each btree
+ *      block.
+ * 1.3. If the incore extent map is loaded but the fork is in btree format,
+ *      just visit the bmbt blocks to set the corresponding BMBIT areas.
+ * 1.4. From the incore extent map, accumulate each bmap that falls into our
+ *      target AG.  Remember, multiple bmap records can map to a single rmap
+ *      record, so we cannot simply emit rmap records 1:1.
+ * 1.5. Emit rmap records for each extent in BMBIT and free it.
+ * 2. Create bitmaps INOBIT and ICHUNKBIT.
+ * 3. For each record in the inobt, set the corresponding areas in ICHUNKBIT,
+ *    and set bits in INOBIT for each btree block.  If the inobt has no records
+ *    at all, we must be careful to record its root in INOBIT.
+ * 4. For each block in the finobt, set the corresponding INOBIT area.
+ * 5. Emit rmap records for each extent in INOBIT and ICHUNKBIT and free them.
+ * 6. Create bitmaps REFCBIT and COWBIT.
+ * 7. For each CoW staging extent in the refcountbt, set the corresponding
+ *    areas in COWBIT.
+ * 8. For each block in the refcountbt, set the corresponding REFCBIT area.
+ * 9. Emit rmap records for each extent in REFCBIT and COWBIT and free them.
+ * A. Emit rmap for the AG headers.
+ * B. Emit rmap for the log, if there is one.
+ *
+ * II) The rmapbt shape and space metadata rmaps are computed as follows:
+ *
+ * 1. Count the rmaps collected in the previous step. (= NR)
+ * 2. Estimate the number of rmapbt blocks needed to store NR records. (= RMB)
+ * 3. Reserve RMB blocks through the newbt using the allocator in normap mode.
+ * 4. Create bitmap AGBIT.
+ * 5. For each reservation in the newbt, set the corresponding areas in AGBIT.
+ * 6. For each block in the AGFL, bnobt, and cntbt, set the bits in AGBIT.
+ * 7. Count the extents in AGBIT. (= AGNR)
+ * 8. Estimate the number of rmapbt blocks needed for NR + AGNR rmaps. (= RMB')
+ * 9. If RMB' >= RMB, reserve RMB' - RMB more newbt blocks, set RMB = RMB',
+ *    and clear AGBIT.  Go to step 5.
+ * A. Emit rmaps for each extent in AGBIT.
+ *
+ * III) The rmapbt is constructed and set in place as follows:
+ *
+ * 1. Sort the rmap records.
+ * 2. Bulk load the rmaps.
+ *
+ * IV) Reap the old btree blocks.
+ *
+ * 1. Create a bitmap OLDRMBIT.
+ * 2. For each gap in the new rmapbt, set the corresponding areas of OLDRMBIT.
+ * 3. For each extent in the bnobt, clear the corresponding parts of OLDRMBIT.
+ * 4. Reap the extents corresponding to the set areas in OLDRMBIT.  These are
+ *    the parts of the AG that the rmap didn't find during its scan of the
+ *    primary metadata and aren't known to be in the free space, which implies
+ *    that they were the old rmapbt blocks.
+ * 5. Commit.
+ *
+ * We use the 'xrep_rmap' prefix for all the rmap functions.
+ */
+
+/* Set us up to repair reverse mapping btrees. */
+int
+xrep_setup_ag_rmapbt(
+	struct xfs_scrub	*sc)
+{
+	/* For now this is a placeholder until we land other pieces. */
+	return 0;
+}
+
+/*
+ * Packed rmap record.  The ATTR/BMBT/UNWRITTEN flags are hidden in the upper
+ * bits of offset, just like the on-disk record.
+ */
+struct xrep_rmap_extent {
+	xfs_agblock_t	startblock;
+	xfs_extlen_t	blockcount;
+	uint64_t	owner;
+	uint64_t	offset;
+} __packed;
+
+/* Context for collecting rmaps */
+struct xrep_rmap {
+	/* new rmapbt information */
+	struct xrep_newbt	new_btree;
+
+	/* rmap records generated from primary metadata */
+	struct xfarray		*rmap_records;
+
+	struct xfs_scrub	*sc;
+
+	/* get_records()'s position in the rmap record array. */
+	xfarray_idx_t		array_cur;
+
+	/* inode scan cursor */
+	struct xchk_iscan	iscan;
+
+	/* bnobt/cntbt contribution to btreeblks */
+	xfs_agblock_t		freesp_btblocks;
+
+	/* old agf_rmap_blocks counter */
+	unsigned int		old_rmapbt_fsbcount;
+};
+
+/* Make sure there's nothing funny about this mapping. */
+STATIC int
+xrep_rmap_check_mapping(
+	struct xfs_scrub	*sc,
+	const struct xfs_rmap_irec *rec)
+{
+	enum xbtree_recpacking	outcome;
+	int			error;
+
+	if (xfs_rmap_check_perag_irec(sc->sa.pag, rec) != NULL)
+		return -EFSCORRUPTED;
+
+	/* Make sure this isn't free space. */
+	error = xfs_alloc_has_records(sc->sa.bno_cur, rec->rm_startblock,
+			rec->rm_blockcount, &outcome);
+	if (error)
+		return error;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Store a reverse-mapping record. */
+static inline int
+xrep_rmap_stash(
+	struct xrep_rmap	*rr,
+	xfs_agblock_t		startblock,
+	xfs_extlen_t		blockcount,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags)
+{
+	struct xrep_rmap_extent	rre = {
+		.startblock	= startblock,
+		.blockcount	= blockcount,
+		.owner		= owner,
+	};
+	struct xfs_rmap_irec	rmap = {
+		.rm_startblock	= startblock,
+		.rm_blockcount	= blockcount,
+		.rm_owner	= owner,
+		.rm_offset	= offset,
+		.rm_flags	= flags,
+	};
+	struct xfs_scrub	*sc = rr->sc;
+	int			error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	trace_xrep_rmap_found(sc->mp, sc->sa.pag->pag_agno, &rmap);
+
+	rre.offset = xfs_rmap_irec_offset_pack(&rmap);
+	return xfarray_append(rr->rmap_records, &rre);
+}
+
+struct xrep_rmap_stash_run {
+	struct xrep_rmap	*rr;
+	uint64_t		owner;
+	unsigned int		rmap_flags;
+};
+
+static int
+xrep_rmap_stash_run(
+	uint64_t			start,
+	uint64_t			len,
+	void				*priv)
+{
+	struct xrep_rmap_stash_run	*rsr = priv;
+	struct xrep_rmap		*rr = rsr->rr;
+
+	return xrep_rmap_stash(rr, start, len, rsr->owner, 0, rsr->rmap_flags);
+}
+
+/*
+ * Emit rmaps for every extent of bits set in the bitmap.  Caller must ensure
+ * that the ranges are in units of FS blocks.
+ */
+STATIC int
+xrep_rmap_stash_bitmap(
+	struct xrep_rmap		*rr,
+	struct xagb_bitmap		*bitmap,
+	const struct xfs_owner_info	*oinfo)
+{
+	struct xrep_rmap_stash_run	rsr = {
+		.rr			= rr,
+		.owner			= oinfo->oi_owner,
+		.rmap_flags		= 0,
+	};
+
+	if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
+		rsr.rmap_flags |= XFS_RMAP_ATTR_FORK;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
+		rsr.rmap_flags |= XFS_RMAP_BMBT_BLOCK;
+
+	return xagb_bitmap_walk(bitmap, xrep_rmap_stash_run, &rsr);
+}
+
+/* Section (I): Finding all file and bmbt extents. */
+
+/* Context for accumulating rmaps for an inode fork. */
+struct xrep_rmap_ifork {
+	/*
+	 * Accumulate rmap data here to turn multiple adjacent bmaps into a
+	 * single rmap.
+	 */
+	struct xfs_rmap_irec	accum;
+
+	/* Bitmap of bmbt blocks in this AG. */
+	struct xagb_bitmap	bmbt_blocks;
+
+	struct xrep_rmap	*rr;
+
+	/* Which inode fork? */
+	int			whichfork;
+};
+
+/* Stash an rmap that we accumulated while walking an inode fork. */
+STATIC int
+xrep_rmap_stash_accumulated(
+	struct xrep_rmap_ifork	*rf)
+{
+	if (rf->accum.rm_blockcount == 0)
+		return 0;
+
+	return xrep_rmap_stash(rf->rr, rf->accum.rm_startblock,
+			rf->accum.rm_blockcount, rf->accum.rm_owner,
+			rf->accum.rm_offset, rf->accum.rm_flags);
+}
+
+/* Accumulate a bmbt record. */
+STATIC int
+xrep_rmap_visit_bmbt(
+	struct xfs_btree_cur	*cur,
+	struct xfs_bmbt_irec	*rec,
+	void			*priv)
+{
+	struct xrep_rmap_ifork	*rf = priv;
+	struct xfs_mount	*mp = rf->rr->sc->mp;
+	struct xfs_rmap_irec	*accum = &rf->accum;
+	xfs_agblock_t		agbno;
+	unsigned int		rmap_flags = 0;
+	int			error;
+
+	if (XFS_FSB_TO_AGNO(mp, rec->br_startblock) !=
+			rf->rr->sc->sa.pag->pag_agno)
+		return 0;
+
+	agbno = XFS_FSB_TO_AGBNO(mp, rec->br_startblock);
+	if (rf->whichfork == XFS_ATTR_FORK)
+		rmap_flags |= XFS_RMAP_ATTR_FORK;
+	if (rec->br_state == XFS_EXT_UNWRITTEN)
+		rmap_flags |= XFS_RMAP_UNWRITTEN;
+
+	/* If this bmap is adjacent to the previous one, just add it. */
+	if (accum->rm_blockcount > 0 &&
+	    rec->br_startoff == accum->rm_offset + accum->rm_blockcount &&
+	    agbno == accum->rm_startblock + accum->rm_blockcount &&
+	    rmap_flags == accum->rm_flags) {
+		accum->rm_blockcount += rec->br_blockcount;
+		return 0;
+	}
+
+	/* Otherwise stash the old rmap and start accumulating a new one. */
+	error = xrep_rmap_stash_accumulated(rf);
+	if (error)
+		return error;
+
+	accum->rm_startblock = agbno;
+	accum->rm_blockcount = rec->br_blockcount;
+	accum->rm_offset = rec->br_startoff;
+	accum->rm_flags = rmap_flags;
+	return 0;
+}
+
+/* Add a btree block to the bitmap. */
+STATIC int
+xrep_rmap_visit_iroot_btree_block(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	void			*priv)
+{
+	struct xrep_rmap_ifork	*rf = priv;
+	struct xfs_buf		*bp;
+	xfs_fsblock_t		fsbno;
+	xfs_agblock_t		agbno;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
+	if (XFS_FSB_TO_AGNO(cur->bc_mp, fsbno) != rf->rr->sc->sa.pag->pag_agno)
+		return 0;
+
+	agbno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+	return xagb_bitmap_set(&rf->bmbt_blocks, agbno, 1);
+}
+
+/*
+ * Iterate a metadata btree rooted in an inode to collect rmap records for
+ * anything in this fork that matches the AG.
+ */
+STATIC int
+xrep_rmap_scan_iroot_btree(
+	struct xrep_rmap_ifork	*rf,
+	struct xfs_btree_cur	*cur)
+{
+	struct xfs_owner_info	oinfo;
+	struct xrep_rmap	*rr = rf->rr;
+	int			error;
+
+	xagb_bitmap_init(&rf->bmbt_blocks);
+
+	/* Record all the blocks in the btree itself. */
+	error = xfs_btree_visit_blocks(cur, xrep_rmap_visit_iroot_btree_block,
+			XFS_BTREE_VISIT_ALL, rf);
+	if (error)
+		goto out;
+
+	/* Emit rmaps for the btree blocks. */
+	xfs_rmap_ino_bmbt_owner(&oinfo, rf->accum.rm_owner, rf->whichfork);
+	error = xrep_rmap_stash_bitmap(rr, &rf->bmbt_blocks, &oinfo);
+	if (error)
+		goto out;
+
+	/* Stash any remaining accumulated rmaps. */
+	error = xrep_rmap_stash_accumulated(rf);
+out:
+	xagb_bitmap_destroy(&rf->bmbt_blocks);
+	return error;
+}
+
+static inline bool
+is_rt_data_fork(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	return XFS_IS_REALTIME_INODE(ip) && whichfork == XFS_DATA_FORK;
+}
+
+/*
+ * Iterate the block mapping btree to collect rmap records for anything in this
+ * fork that matches the AG.  Sets @mappings_done to true if we've scanned the
+ * block mappings in this fork.
+ */
+STATIC int
+xrep_rmap_scan_bmbt(
+	struct xrep_rmap_ifork	*rf,
+	struct xfs_inode	*ip,
+	bool			*mappings_done)
+{
+	struct xrep_rmap	*rr = rf->rr;
+	struct xfs_btree_cur	*cur;
+	struct xfs_ifork	*ifp;
+	int			error;
+
+	*mappings_done = false;
+	ifp = xfs_ifork_ptr(ip, rf->whichfork);
+	cur = xfs_bmbt_init_cursor(rr->sc->mp, rr->sc->tp, ip, rf->whichfork);
+
+	if (!xfs_ifork_is_realtime(ip, rf->whichfork) &&
+	    xfs_need_iread_extents(ifp)) {
+		/*
+		 * If the incore extent cache isn't loaded, scan the bmbt for
+		 * mapping records.  This avoids loading the incore extent
+		 * tree, which will increase memory pressure at a time when
+		 * we're trying to run as quickly as we possibly can.  Ignore
+		 * realtime extents.
+		 */
+		error = xfs_bmap_query_all(cur, xrep_rmap_visit_bmbt, rf);
+		if (error)
+			goto out_cur;
+
+		*mappings_done = true;
+	}
+
+	/* Scan for the bmbt blocks, which always live on the data device. */
+	error = xrep_rmap_scan_iroot_btree(rf, cur);
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/*
+ * Iterate the in-core extent cache to collect rmap records for anything in
+ * this fork that matches the AG.
+ */
+STATIC int
+xrep_rmap_scan_iext(
+	struct xrep_rmap_ifork	*rf,
+	struct xfs_ifork	*ifp)
+{
+	struct xfs_bmbt_irec	rec;
+	struct xfs_iext_cursor	icur;
+	int			error;
+
+	for_each_xfs_iext(ifp, &icur, &rec) {
+		if (isnullstartblock(rec.br_startblock))
+			continue;
+		error = xrep_rmap_visit_bmbt(NULL, &rec, rf);
+		if (error)
+			return error;
+	}
+
+	return xrep_rmap_stash_accumulated(rf);
+}
+
+/* Find all the extents from a given AG in an inode fork. */
+STATIC int
+xrep_rmap_scan_ifork(
+	struct xrep_rmap	*rr,
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	struct xrep_rmap_ifork	rf = {
+		.accum		= { .rm_owner = ip->i_ino, },
+		.rr		= rr,
+		.whichfork	= whichfork,
+	};
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, whichfork);
+	int			error = 0;
+
+	if (!ifp)
+		return 0;
+
+	if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
+		bool		mappings_done;
+
+		/*
+		 * Scan the bmap btree for data device mappings.  This includes
+		 * the btree blocks themselves, even if this is a realtime
+		 * file.
+		 */
+		error = xrep_rmap_scan_bmbt(&rf, ip, &mappings_done);
+		if (error || mappings_done)
+			return error;
+	} else if (ifp->if_format != XFS_DINODE_FMT_EXTENTS) {
+		return 0;
+	}
+
+	/* Scan incore extent cache if this isn't a realtime file. */
+	if (xfs_ifork_is_realtime(ip, whichfork))
+		return 0;
+
+	return xrep_rmap_scan_iext(&rf, ifp);
+}
+
+/* Record reverse mappings for a file. */
+STATIC int
+xrep_rmap_scan_inode(
+	struct xrep_rmap	*rr,
+	struct xfs_inode	*ip)
+{
+	unsigned int		lock_mode;
+	int			error;
+
+	xfs_ilock(ip, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED);
+	lock_mode = xfs_ilock_data_map_shared(ip);
+
+	/* Check the data fork. */
+	error = xrep_rmap_scan_ifork(rr, ip, XFS_DATA_FORK);
+	if (error)
+		goto out_unlock;
+
+	/* Check the attr fork. */
+	error = xrep_rmap_scan_ifork(rr, ip, XFS_ATTR_FORK);
+	if (error)
+		goto out_unlock;
+
+	/* COW fork extents are "owned" by the refcount btree. */
+
+	xchk_iscan_mark_visited(&rr->iscan, ip);
+out_unlock:
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | lock_mode);
+	return error;
+}
+
+/* Section (I): Find all AG metadata extents except for free space metadata. */
+
+struct xrep_rmap_inodes {
+	struct xrep_rmap	*rr;
+	struct xagb_bitmap	inobt_blocks;	/* INOBIT */
+	struct xagb_bitmap	ichunk_blocks;	/* ICHUNKBIT */
+};
+
+/* Record inode btree rmaps. */
+STATIC int
+xrep_rmap_walk_inobt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_rec	*rec,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	irec;
+	struct xrep_rmap_inodes		*ri = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_agblock_t			agbno;
+	xfs_agino_t			agino;
+	xfs_agino_t			iperhole;
+	unsigned int			i;
+	int				error;
+
+	/* Record the inobt blocks. */
+	error = xagb_bitmap_set_btcur_path(&ri->inobt_blocks, cur);
+	if (error)
+		return error;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+	if (xfs_inobt_check_irec(cur, &irec) != NULL)
+		return -EFSCORRUPTED;
+
+	agino = irec.ir_startino;
+
+	/* Record a non-sparse inode chunk. */
+	if (!xfs_inobt_issparse(irec.ir_holemask)) {
+		agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+
+		return xagb_bitmap_set(&ri->ichunk_blocks, agbno,
+				XFS_INODES_PER_CHUNK / mp->m_sb.sb_inopblock);
+	}
+
+	/* Iterate each chunk. */
+	iperhole = max_t(xfs_agino_t, mp->m_sb.sb_inopblock,
+			XFS_INODES_PER_HOLEMASK_BIT);
+	for (i = 0, agino = irec.ir_startino;
+	     i < XFS_INOBT_HOLEMASK_BITS;
+	     i += iperhole / XFS_INODES_PER_HOLEMASK_BIT, agino += iperhole) {
+		/* Skip holes. */
+		if (irec.ir_holemask & (1 << i))
+			continue;
+
+		/* Record the inode chunk otherwise. */
+		agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+		error = xagb_bitmap_set(&ri->ichunk_blocks, agbno,
+				iperhole / mp->m_sb.sb_inopblock);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Collect rmaps for the blocks containing inode btrees and the inode chunks. */
+STATIC int
+xrep_rmap_find_inode_rmaps(
+	struct xrep_rmap	*rr)
+{
+	struct xrep_rmap_inodes	ri = {
+		.rr		= rr,
+	};
+	struct xfs_scrub	*sc = rr->sc;
+	int			error;
+
+	xagb_bitmap_init(&ri.inobt_blocks);
+	xagb_bitmap_init(&ri.ichunk_blocks);
+
+	/*
+	 * Iterate every record in the inobt so we can capture all the inode
+	 * chunks and the blocks in the inobt itself.
+	 */
+	error = xfs_btree_query_all(sc->sa.ino_cur, xrep_rmap_walk_inobt, &ri);
+	if (error)
+		goto out_bitmap;
+
+	/*
+	 * Note that if there are zero records in the inobt then query_all does
+	 * nothing and we have to account the empty inobt root manually.
+	 */
+	if (xagb_bitmap_empty(&ri.ichunk_blocks)) {
+		struct xfs_agi	*agi = sc->sa.agi_bp->b_addr;
+
+		error = xagb_bitmap_set(&ri.inobt_blocks,
+				be32_to_cpu(agi->agi_root), 1);
+		if (error)
+			goto out_bitmap;
+	}
+
+	/* Scan the finobt too. */
+	if (xfs_has_finobt(sc->mp)) {
+		error = xagb_bitmap_set_btblocks(&ri.inobt_blocks,
+				sc->sa.fino_cur);
+		if (error)
+			goto out_bitmap;
+	}
+
+	/* Generate rmaps for everything. */
+	error = xrep_rmap_stash_bitmap(rr, &ri.inobt_blocks,
+			&XFS_RMAP_OINFO_INOBT);
+	if (error)
+		goto out_bitmap;
+	error = xrep_rmap_stash_bitmap(rr, &ri.ichunk_blocks,
+			&XFS_RMAP_OINFO_INODES);
+
+out_bitmap:
+	xagb_bitmap_destroy(&ri.inobt_blocks);
+	xagb_bitmap_destroy(&ri.ichunk_blocks);
+	return error;
+}
+
+/* Record a CoW staging extent. */
+STATIC int
+xrep_rmap_walk_cowblocks(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*irec,
+	void				*priv)
+{
+	struct xagb_bitmap		*bitmap = priv;
+
+	if (!xfs_refcount_check_domain(irec) ||
+	    irec->rc_domain != XFS_REFC_DOMAIN_COW)
+		return -EFSCORRUPTED;
+
+	return xagb_bitmap_set(bitmap, irec->rc_startblock, irec->rc_blockcount);
+}
+
+/*
+ * Collect rmaps for the blocks containing the refcount btree, and all CoW
+ * staging extents.
+ */
+STATIC int
+xrep_rmap_find_refcount_rmaps(
+	struct xrep_rmap	*rr)
+{
+	struct xagb_bitmap	refcountbt_blocks;	/* REFCBIT */
+	struct xagb_bitmap	cow_blocks;		/* COWBIT */
+	struct xfs_refcount_irec low = {
+		.rc_startblock	= 0,
+		.rc_domain	= XFS_REFC_DOMAIN_COW,
+	};
+	struct xfs_refcount_irec high = {
+		.rc_startblock	= -1U,
+		.rc_domain	= XFS_REFC_DOMAIN_COW,
+	};
+	struct xfs_scrub	*sc = rr->sc;
+	int			error;
+
+	if (!xfs_has_reflink(sc->mp))
+		return 0;
+
+	xagb_bitmap_init(&refcountbt_blocks);
+	xagb_bitmap_init(&cow_blocks);
+
+	/* refcountbt */
+	error = xagb_bitmap_set_btblocks(&refcountbt_blocks, sc->sa.refc_cur);
+	if (error)
+		goto out_bitmap;
+
+	/* Collect rmaps for CoW staging extents. */
+	error = xfs_refcount_query_range(sc->sa.refc_cur, &low, &high,
+			xrep_rmap_walk_cowblocks, &cow_blocks);
+	if (error)
+		goto out_bitmap;
+
+	/* Generate rmaps for everything. */
+	error = xrep_rmap_stash_bitmap(rr, &cow_blocks, &XFS_RMAP_OINFO_COW);
+	if (error)
+		goto out_bitmap;
+	error = xrep_rmap_stash_bitmap(rr, &refcountbt_blocks,
+			&XFS_RMAP_OINFO_REFC);
+
+out_bitmap:
+	xagb_bitmap_destroy(&cow_blocks);
+	xagb_bitmap_destroy(&refcountbt_blocks);
+	return error;
+}
+
+/* Generate rmaps for the AG headers (AGI/AGF/AGFL) */
+STATIC int
+xrep_rmap_find_agheader_rmaps(
+	struct xrep_rmap	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+
+	/* Create a record for the AG sb->agfl. */
+	return xrep_rmap_stash(rr, XFS_SB_BLOCK(sc->mp),
+			XFS_AGFL_BLOCK(sc->mp) - XFS_SB_BLOCK(sc->mp) + 1,
+			XFS_RMAP_OWN_FS, 0, 0);
+}
+
+/* Generate rmaps for the log, if it's in this AG. */
+STATIC int
+xrep_rmap_find_log_rmaps(
+	struct xrep_rmap	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+
+	if (!xfs_ag_contains_log(sc->mp, sc->sa.pag->pag_agno))
+		return 0;
+
+	return xrep_rmap_stash(rr,
+			XFS_FSB_TO_AGBNO(sc->mp, sc->mp->m_sb.sb_logstart),
+			sc->mp->m_sb.sb_logblocks, XFS_RMAP_OWN_LOG, 0, 0);
+}
+
+/*
+ * Generate all the reverse-mappings for this AG, a list of the old rmapbt
+ * blocks, and the new btreeblks count.  Figure out if we have enough free
+ * space to reconstruct the inode btrees.  The caller must clean up the lists
+ * if anything goes wrong.  This implements section (I) above.
+ */
+STATIC int
+xrep_rmap_find_rmaps(
+	struct xrep_rmap	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	struct xchk_ag		*sa = &sc->sa;
+	struct xfs_inode	*ip;
+	int			error;
+
+	/* Find all the per-AG metadata. */
+	xrep_ag_btcur_init(sc, &sc->sa);
+
+	error = xrep_rmap_find_inode_rmaps(rr);
+	if (error)
+		goto end_agscan;
+
+	error = xrep_rmap_find_refcount_rmaps(rr);
+	if (error)
+		goto end_agscan;
+
+	error = xrep_rmap_find_agheader_rmaps(rr);
+	if (error)
+		goto end_agscan;
+
+	error = xrep_rmap_find_log_rmaps(rr);
+end_agscan:
+	xchk_ag_btcur_free(&sc->sa);
+	if (error)
+		return error;
+
+	/*
+	 * Set up for a potentially lengthy filesystem scan by reducing our
+	 * transaction resource usage for the duration.  Specifically:
+	 *
+	 * Unlock the AG header buffers and cancel the transaction to release
+	 * the log grant space while we scan the filesystem.
+	 *
+	 * Create a new empty transaction to eliminate the possibility of the
+	 * inode scan deadlocking on cyclical metadata.
+	 *
+	 * We pass the empty transaction to the file scanning function to avoid
+	 * repeatedly cycling empty transactions.  This can be done even though
+	 * we take the IOLOCK to quiesce the file because empty transactions
+	 * do not take sb_internal.
+	 */
+	sa->agf_bp = NULL;
+	sa->agi_bp = NULL;
+	xchk_trans_cancel(sc);
+	error = xchk_trans_alloc_empty(sc);
+	if (error)
+		return error;
+
+	/* Iterate all AGs for inodes rmaps. */
+	while ((error = xchk_iscan_iter(sc, &rr->iscan, &ip)) == 1) {
+		error = xrep_rmap_scan_inode(rr, ip);
+		xchk_irele(sc, ip);
+		if (error)
+			break;
+
+		if (xchk_should_terminate(sc, &error))
+			break;
+	}
+	if (error)
+		return error;
+
+	/*
+	 * Switch out for a real transaction and lock the AG headers in
+	 * preparation for building a new tree.
+	 */
+	xchk_trans_cancel(sc);
+	error = xchk_setup_fs(sc);
+	if (error)
+		return error;
+	return xchk_perag_lock(sc);
+}
+
+/* Section (II): Reserving space for new rmapbt and setting free space bitmap */
+
+struct xrep_rmap_agfl {
+	struct xagb_bitmap	*bitmap;
+	xfs_agnumber_t		agno;
+};
+
+/* Add an AGFL block to the rmap list. */
+STATIC int
+xrep_rmap_walk_agfl(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		agbno,
+	void			*priv)
+{
+	struct xrep_rmap_agfl	*ra = priv;
+
+	return xagb_bitmap_set(ra->bitmap, agbno, 1);
+}
+
+/*
+ * Run one round of reserving space for the new rmapbt and recomputing the
+ * number of blocks needed to store the previously observed rmapbt records and
+ * the ones we'll create for the free space metadata.  When we don't need more
+ * blocks, return a bitmap of OWN_AG extents in @freesp_blocks and set @done to
+ * true.
+ */
+STATIC int
+xrep_rmap_try_reserve(
+	struct xrep_rmap	*rr,
+	struct xfs_btree_cur	*rmap_cur,
+	uint64_t		nr_records,
+	struct xagb_bitmap	*freesp_blocks,
+	uint64_t		*blocks_reserved,
+	bool			*done)
+{
+	struct xrep_rmap_agfl	ra = {
+		.bitmap		= freesp_blocks,
+		.agno		= rr->sc->sa.pag->pag_agno,
+	};
+	struct xfs_scrub	*sc = rr->sc;
+	struct xrep_newbt_resv	*resv, *n;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	struct xfs_buf		*agfl_bp;
+	uint64_t		nr_blocks;	/* RMB */
+	uint64_t		freesp_records;
+	int			error;
+
+	/*
+	 * We're going to recompute new_btree.bload.nr_blocks at the end of
+	 * this function to reflect however many btree blocks we need to store
+	 * all the rmap records (including the ones that reflect the changes we
+	 * made to support the new rmapbt blocks), so we save the old value
+	 * here so we can decide if we've reserved enough blocks.
+	 */
+	nr_blocks = rr->new_btree.bload.nr_blocks;
+
+	/*
+	 * Make sure we've reserved enough space for the new btree.  This can
+	 * change the shape of the free space btrees, which can cause secondary
+	 * interactions with the rmap records because all three space btrees
+	 * have the same rmap owner.  We'll account for all that below.
+	 */
+	error = xrep_newbt_alloc_blocks(&rr->new_btree,
+			nr_blocks - *blocks_reserved);
+	if (error)
+		return error;
+
+	*blocks_reserved = rr->new_btree.bload.nr_blocks;
+
+	/* Clear everything in the bitmap. */
+	xagb_bitmap_destroy(freesp_blocks);
+
+	/* Set all the bnobt blocks in the bitmap. */
+	sc->sa.bno_cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag, XFS_BTNUM_BNO);
+	error = xagb_bitmap_set_btblocks(freesp_blocks, sc->sa.bno_cur);
+	xfs_btree_del_cursor(sc->sa.bno_cur, error);
+	sc->sa.bno_cur = NULL;
+	if (error)
+		return error;
+
+	/* Set all the cntbt blocks in the bitmap. */
+	sc->sa.cnt_cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag, XFS_BTNUM_CNT);
+	error = xagb_bitmap_set_btblocks(freesp_blocks, sc->sa.cnt_cur);
+	xfs_btree_del_cursor(sc->sa.cnt_cur, error);
+	sc->sa.cnt_cur = NULL;
+	if (error)
+		return error;
+
+	/* Record our new btreeblks value. */
+	rr->freesp_btblocks = xagb_bitmap_hweight(freesp_blocks) - 2;
+
+	/* Set all the new rmapbt blocks in the bitmap. */
+	for_each_xrep_newbt_reservation(&rr->new_btree, resv, n) {
+		error = xagb_bitmap_set(freesp_blocks, resv->agbno, resv->len);
+		if (error)
+			return error;
+	}
+
+	/* Set all the AGFL blocks in the bitmap. */
+	error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp);
+	if (error)
+		return error;
+
+	error = xfs_agfl_walk(sc->mp, agf, agfl_bp, xrep_rmap_walk_agfl, &ra);
+	if (error)
+		return error;
+
+	/* Count the extents in the bitmap. */
+	freesp_records = xagb_bitmap_count_set_regions(freesp_blocks);
+
+	/* Compute how many blocks we'll need for all the rmaps. */
+	error = xfs_btree_bload_compute_geometry(rmap_cur,
+			&rr->new_btree.bload, nr_records + freesp_records);
+	if (error)
+		return error;
+
+	/* We're done when we don't need more blocks. */
+	*done = nr_blocks >= rr->new_btree.bload.nr_blocks;
+	return 0;
+}
+
+/*
+ * Iteratively reserve space for rmap btree while recording OWN_AG rmaps for
+ * the free space metadata.  This implements section (II) above.
+ */
+STATIC int
+xrep_rmap_reserve_space(
+	struct xrep_rmap	*rr,
+	struct xfs_btree_cur	*rmap_cur)
+{
+	struct xagb_bitmap	freesp_blocks;	/* AGBIT */
+	uint64_t		nr_records;	/* NR */
+	uint64_t		blocks_reserved = 0;
+	bool			done = false;
+	int			error;
+
+	nr_records = xfarray_length(rr->rmap_records);
+
+	/* Compute how many blocks we'll need for the rmaps collected so far. */
+	error = xfs_btree_bload_compute_geometry(rmap_cur,
+			&rr->new_btree.bload, nr_records);
+	if (error)
+		return error;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(rr->sc, &error))
+		return error;
+
+	xagb_bitmap_init(&freesp_blocks);
+
+	/*
+	 * Iteratively reserve space for the new rmapbt and recompute the
+	 * number of blocks needed to store the previously observed rmapbt
+	 * records and the ones we'll create for the free space metadata.
+	 * Finish when we don't need more blocks.
+	 */
+	do {
+		error = xrep_rmap_try_reserve(rr, rmap_cur, nr_records,
+				&freesp_blocks, &blocks_reserved, &done);
+		if (error)
+			goto out_bitmap;
+	} while (!done);
+
+	/* Emit rmaps for everything in the free space bitmap. */
+	xrep_ag_btcur_init(rr->sc, &rr->sc->sa);
+	error = xrep_rmap_stash_bitmap(rr, &freesp_blocks, &XFS_RMAP_OINFO_AG);
+	xchk_ag_btcur_free(&rr->sc->sa);
+
+out_bitmap:
+	xagb_bitmap_destroy(&freesp_blocks);
+	return error;
+}
+
+/* Section (III): Building the new rmap btree. */
+
+/* Update the AGF counters. */
+STATIC int
+xrep_rmap_reset_counters(
+	struct xrep_rmap	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	xfs_agblock_t		rmap_btblocks;
+
+	/*
+	 * The AGF header contains extra information related to the reverse
+	 * mapping btree, so we must update those fields here.
+	 */
+	rmap_btblocks = rr->new_btree.afake.af_blocks - 1;
+	agf->agf_btreeblks = cpu_to_be32(rr->freesp_btblocks + rmap_btblocks);
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, XFS_AGF_BTREEBLKS);
+
+	/*
+	 * After we commit the new btree to disk, it is possible that the
+	 * process to reap the old btree blocks will race with the AIL trying
+	 * to checkpoint the old btree blocks into the filesystem.  If the new
+	 * tree is shorter than the old one, the rmapbt write verifier will
+	 * fail and the AIL will shut down the filesystem.
+	 *
+	 * To avoid this, save the old incore btree height values as the alt
+	 * height values before re-initializing the perag info from the updated
+	 * AGF to capture all the new values.
+	 */
+	pag->pagf_alt_levels[XFS_BTNUM_RMAPi] =
+					pag->pagf_levels[XFS_BTNUM_RMAPi];
+
+	/* Reinitialize with the values we just logged. */
+	return xrep_reinit_pagf(sc);
+}
+
+/* Retrieve rmapbt data for bulk load. */
+STATIC int
+xrep_rmap_get_records(
+	struct xfs_btree_cur	*cur,
+	unsigned int		idx,
+	struct xfs_btree_block	*block,
+	unsigned int		nr_wanted,
+	void			*priv)
+{
+	struct xrep_rmap_extent	rec;
+	struct xfs_rmap_irec	*irec = &cur->bc_rec.r;
+	struct xrep_rmap	*rr = priv;
+	union xfs_btree_rec	*block_rec;
+	unsigned int		loaded;
+	int			error;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		error = xfarray_load_next(rr->rmap_records, &rr->array_cur,
+				&rec);
+		if (error)
+			return error;
+
+		irec->rm_startblock = rec.startblock;
+		irec->rm_blockcount = rec.blockcount;
+		irec->rm_owner = rec.owner;
+		if (xfs_rmap_irec_offset_unpack(rec.offset, irec) != NULL)
+			return -EFSCORRUPTED;
+
+		error = xrep_rmap_check_mapping(rr->sc, irec);
+		if (error)
+			return error;
+
+		block_rec = xfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Feed one of the new btree blocks to the bulk loader. */
+STATIC int
+xrep_rmap_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_rmap        *rr = priv;
+	int			error;
+
+	error = xrep_newbt_relog_autoreap(&rr->new_btree);
+	if (error)
+		return error;
+
+	return xrep_newbt_claim_block(cur, &rr->new_btree, ptr);
+}
+
+/* Custom allocation function for new rmap btrees. */
+STATIC int
+xrep_rmap_alloc_vextent(
+	struct xfs_scrub	*sc,
+	struct xfs_alloc_arg	*args)
+{
+	int			error;
+
+	/*
+	 * We don't want an rmap update on the allocation, since we iteratively
+	 * compute the OWN_AG records /after/ allocating blocks for the records
+	 * that we already know we need to store.  Therefore, fix the freelist
+	 * with the NORMAP flag set so that we don't also try to create an rmap
+	 * for new AGFL blocks.
+	 */
+	error = xrep_fix_freelist(sc, XFS_ALLOC_FLAG_NORMAP);
+	if (error)
+		return error;
+
+	/*
+	 * If xrep_fix_freelist fixed the freelist by moving blocks from the
+	 * free space btrees or by removing blocks from the AGFL and queueing
+	 * an EFI to free the block, the transaction will be dirty.  This
+	 * second case is of interest to us.
+	 *
+	 * Later on, we will need to compare gaps in the new recordset against
+	 * the block usage of all OWN_AG owners in order to free the old
+	 * btree's blocks, which means that we can't have EFIs for former AGFL
+	 * blocks attached to the repair transaction when we commit the new
+	 * btree.
+	 *
+	 * xrep_newbt_alloc_blocks guarantees this for us by calling
+	 * xrep_defer_finish to commit anything that fix_freelist may have
+	 * added to the transaction.
+	 */
+	return xfs_alloc_vextent(args);
+}
+
+/*
+ * Use the collected rmap information to stage a new rmap btree.  If this is
+ * successful we'll return with the new btree root information logged to the
+ * repair transaction but not yet committed.  This implements section (III)
+ * above.
+ */
+STATIC int
+xrep_rmap_build_new_tree(
+	struct xrep_rmap	*rr)
+{
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	struct xfs_btree_cur	*rmap_cur;
+	xfs_fsblock_t		fsbno;
+	int			error;
+
+	/*
+	 * Preserve the old rmapbt block count so that we can adjust the
+	 * per-AG rmapbt reservation after we commit the new btree root and
+	 * want to dispose of the old btree blocks.
+	 */
+	rr->old_rmapbt_fsbcount = be32_to_cpu(agf->agf_rmap_blocks);
+
+	/*
+	 * Prepare to construct the new btree by reserving disk space for the
+	 * new btree and setting up all the accounting information we'll need
+	 * to root the new btree while it's under construction and before we
+	 * attach it to the AG header.  The new blocks are accounted to the
+	 * rmapbt per-AG reservation, which we will adjust further after
+	 * committing the new btree.
+	 */
+	fsbno = XFS_AGB_TO_FSB(sc->mp, pag->pag_agno, XFS_RMAP_BLOCK(sc->mp));
+	xrep_newbt_init_ag(&rr->new_btree, sc, &XFS_RMAP_OINFO_SKIP_UPDATE,
+			fsbno, XFS_AG_RESV_RMAPBT);
+	rr->new_btree.bload.get_records = xrep_rmap_get_records;
+	rr->new_btree.bload.claim_block = xrep_rmap_claim_block;
+	rr->new_btree.alloc_vextent = xrep_rmap_alloc_vextent;
+	rmap_cur = xfs_rmapbt_stage_cursor(sc->mp, &rr->new_btree.afake, pag);
+
+	/*
+	 * Initialize @rr->new_btree, reserve space for the new rmapbt,
+	 * and compute OWN_AG rmaps.
+	 */
+	error = xrep_rmap_reserve_space(rr, rmap_cur);
+	if (error)
+		goto err_cur;
+
+	/*
+	 * Due to btree slack factors, it's possible for a new btree to be one
+	 * level taller than the old btree.  Update the incore btree height so
+	 * that we don't trip the verifiers when writing the new btree blocks
+	 * to disk.
+	 */
+	pag->pagf_alt_levels[XFS_BTNUM_RMAPi] =
+					rr->new_btree.bload.btree_height;
+
+	/* Add all observed rmap records. */
+	rr->array_cur = XFARRAY_CURSOR_INIT;
+	sc->sa.bno_cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag, XFS_BTNUM_BNO);
+	error = xfs_btree_bload(rmap_cur, &rr->new_btree.bload, rr);
+	xfs_btree_del_cursor(sc->sa.bno_cur, error);
+	sc->sa.bno_cur = NULL;
+	if (error)
+		goto err_level;
+
+	/*
+	 * Install the new btree in the AG header.  After this point the old
+	 * btree is no longer accessible and the new tree is live.
+	 */
+	xfs_rmapbt_commit_staged_btree(rmap_cur, sc->tp, sc->sa.agf_bp);
+	xfs_btree_del_cursor(rmap_cur, 0);
+
+	/*
+	 * The newly committed rmap recordset includes mappings for the blocks
+	 * that we reserved to build the new btree.  If there is excess space
+	 * reservation to be freed, the corresponding rmap records must also be
+	 * removed.
+	 */
+	rr->new_btree.oinfo = XFS_RMAP_OINFO_AG;
+
+	/* Reset the AGF counters now that we've changed the btree shape. */
+	error = xrep_rmap_reset_counters(rr);
+	if (error)
+		goto err_newbt;
+
+	/* Dispose of any unused blocks and the accounting information. */
+	error = xrep_newbt_commit(&rr->new_btree);
+	if (error)
+		return error;
+
+	return xrep_roll_ag_trans(sc);
+
+err_level:
+	pag->pagf_alt_levels[XFS_BTNUM_RMAPi] = 0;
+err_cur:
+	xfs_btree_del_cursor(rmap_cur, error);
+err_newbt:
+	xrep_newbt_cancel(&rr->new_btree);
+	return error;
+}
+
+/* Section (IV): Reaping the old btree. */
+
+struct xrep_rmap_find_gaps {
+	struct xagb_bitmap	rmap_gaps;
+	xfs_agblock_t		next_agbno;
+};
+
+/* Subtract each free extent in the bnobt from the rmap gaps. */
+STATIC int
+xrep_rmap_find_freesp(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_alloc_rec_incore *rec,
+	void				*priv)
+{
+	struct xrep_rmap_find_gaps	*rfg = priv;
+
+	return xagb_bitmap_clear(&rfg->rmap_gaps, rec->ar_startblock,
+			rec->ar_blockcount);
+}
+
+/*
+ * Reap the old rmapbt blocks.  Now that the rmapbt is fully rebuilt, we make
+ * a list of gaps in the rmap records and a list of the extents mentioned in
+ * the bnobt.  Any block that's in the new rmapbt gap list but not mentioned
+ * in the bnobt is a block from the old rmapbt and can be removed.
+ */
+STATIC int
+xrep_rmap_remove_old_tree(
+	struct xrep_rmap	*rr)
+{
+	struct xrep_rmap_find_gaps rfg = {
+		.next_agbno	= 0,
+	};
+	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	struct xfs_perag	*pag = sc->sa.pag;
+	xfs_agblock_t		agend;
+	xfarray_idx_t		array_cur;
+	int			error;
+
+	xagb_bitmap_init(&rfg.rmap_gaps);
+
+	/* Compute free space from the new rmapbt. */
+	foreach_xfarray_idx(rr->rmap_records, array_cur) {
+		struct xrep_rmap_extent	rec;
+
+		error = xfarray_load(rr->rmap_records, array_cur, &rec);
+		if (error)
+			goto out_bitmap;
+
+		/* Record the free space we find. */
+		if (rec.startblock > rfg.next_agbno) {
+			error = xagb_bitmap_set(&rfg.rmap_gaps, rfg.next_agbno,
+					rec.startblock - rfg.next_agbno);
+			if (error)
+				goto out_bitmap;
+		}
+		rfg.next_agbno = max_t(xfs_agblock_t, rfg.next_agbno,
+					rec.startblock + rec.blockcount);
+	}
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agend = be32_to_cpu(agf->agf_length);
+	if (rfg.next_agbno < agend) {
+		error = xagb_bitmap_set(&rfg.rmap_gaps, rfg.next_agbno,
+				agend - rfg.next_agbno);
+		if (error)
+			goto out_bitmap;
+	}
+
+	/* Compute free space from the existing bnobt. */
+	sc->sa.bno_cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag, XFS_BTNUM_BNO);
+	error = xfs_alloc_query_all(sc->sa.bno_cur, xrep_rmap_find_freesp,
+			&rfg);
+	xfs_btree_del_cursor(sc->sa.bno_cur, error);
+	sc->sa.bno_cur = NULL;
+	if (error)
+		goto out_bitmap;
+
+	/*
+	 * Free the "free" blocks that the new rmapbt knows about but the bnobt
+	 * doesn't--these are the old rmapbt blocks.  Credit the old rmapbt
+	 * block usage count back to the per-AG rmapbt reservation (and not
+	 * fdblocks, since the rmap btree lives in free space) to keep the
+	 * reservation and free space accounting correct.
+	 */
+	error = xrep_reap_agblocks(sc, &rfg.rmap_gaps,
+			&XFS_RMAP_OINFO_ANY_OWNER, XFS_AG_RESV_IGNORE);
+	if (error)
+		goto out_bitmap;
+	sc->sa.pag->pag_rmapbt_resv.ar_reserved += rr->old_rmapbt_fsbcount;
+
+	/*
+	 * Now that we've zapped all the old rmapbt blocks we can turn off
+	 * the alternate height mechanism and reset the per-AG space
+	 * reservation.
+	 */
+	pag->pagf_alt_levels[XFS_BTNUM_RMAPi] = 0;
+	sc->flags |= XREP_RESET_PERAG_RESV;
+out_bitmap:
+	xagb_bitmap_destroy(&rfg.rmap_gaps);
+	return error;
+}
+
+/* Repair the rmap btree for some AG. */
+int
+xrep_rmapbt(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_rmap	*rr;
+	int			error;
+
+	/* Functionality is not yet complete. */
+	return xrep_notsupported(sc);
+
+	rr = kzalloc(sizeof(struct xrep_rmap), XCHK_GFP_FLAGS);
+	if (!rr)
+		return -ENOMEM;
+	rr->sc = sc;
+
+	/* Set up some storage */
+	error = xfarray_create(sc->mp, "rmap records", 0,
+			sizeof(struct xrep_rmap_extent), &rr->rmap_records);
+	if (error)
+		goto out_rr;
+
+	/* Retry iget every tenth of a second for up to 30 seconds. */
+	xchk_iscan_start(&rr->iscan, 30000, 100);
+
+	/*
+	 * Collect rmaps for everything in this AG that isn't space metadata.
+	 * These rmaps won't change even as we try to allocate blocks.
+	 */
+	error = xrep_rmap_find_rmaps(rr);
+	if (error)
+		goto out_records;
+
+	/* Rebuild the rmap information. */
+	error = xrep_rmap_build_new_tree(rr);
+	if (error)
+		goto out_records;
+
+	/* Kill the old tree. */
+	error = xrep_rmap_remove_old_tree(rr);
+
+out_records:
+	xchk_iscan_finish(&rr->iscan);
+	xfarray_destroy(rr->rmap_records);
+out_rr:
+	kfree(rr);
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 5bbc12649277..f030311fae2b 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -278,7 +278,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_ag_rmapbt,
 		.scrub	= xchk_rmapbt,
 		.has	= xfs_has_rmapbt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_rmapbt,
 	},
 	[XFS_SCRUB_TYPE_REFCNTBT] = {	/* refcountbt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 05b6a6e3d0ab..53aafc70878b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1426,7 +1426,6 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
 DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
-DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_walk_rmap);
 
 TRACE_EVENT(xrep_abt_found,
@@ -1544,6 +1543,38 @@ TRACE_EVENT(xrep_bmap_found,
 		  __entry->state)
 );
 
+TRACE_EVENT(xrep_rmap_found,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 const struct xfs_rmap_irec *rec),
+	TP_ARGS(mp, agno, rec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(uint64_t, owner)
+		__field(uint64_t, offset)
+		__field(unsigned int, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = rec->rm_startblock;
+		__entry->len = rec->rm_blockcount;
+		__entry->owner = rec->rm_owner;
+		__entry->offset = rec->rm_offset;
+		__entry->flags = rec->rm_flags;
+	),
+	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x owner 0x%llx fileoff 0x%llx flags 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->owner,
+		  __entry->offset,
+		  __entry->flags)
+);
+
 TRACE_EVENT(xrep_findroot_block,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
 		 uint32_t magic, uint16_t level),


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: create a shadow rmap btree during rmap repair
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 4/4] xfs: hook live rmap operations during a repair operation Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an in-memory btree of rmap records instead of an array.  This
enables us to do live record collection instead of freezing the fs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c       |   25 ++--
 fs/xfs/libxfs/xfs_rmap_btree.c |  123 ++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |    9 +
 fs/xfs/scrub/repair.c          |   23 ++++
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/rmap_repair.c     |  250 +++++++++++++++++++++++++++++-----------
 6 files changed, 353 insertions(+), 79 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 16233bb5be7e..f7587f985aca 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -274,6 +274,8 @@ xfs_rmap_check_irec(
 	struct xfs_btree_cur		*cur,
 	const struct xfs_rmap_irec	*irec)
 {
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfs_rmap_check_perag_irec(cur->bc_mem.pag, irec);
 	return xfs_rmap_check_perag_irec(cur->bc_ag.pag, irec);
 }
 
@@ -285,9 +287,13 @@ xfs_rmap_complain_bad_rec(
 {
 	struct xfs_mount		*mp = cur->bc_mp;
 
-	xfs_warn(mp,
-		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
-		cur->bc_ag.pag->pag_agno, fa);
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		xfs_warn(mp,
+ "In-Memory Reverse Mapping BTree record corruption detected at %pS!", fa);
+	else
+		xfs_warn(mp,
+ "Reverse Mapping BTree record corruption in AG %d detected at %pS!",
+			cur->bc_ag.pag->pag_agno, fa);
 	xfs_warn(mp,
 		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
 		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
@@ -2412,15 +2418,12 @@ xfs_rmap_map_raw(
 {
 	struct xfs_owner_info	oinfo;
 
-	oinfo.oi_owner = rmap->rm_owner;
-	oinfo.oi_offset = rmap->rm_offset;
-	oinfo.oi_flags = 0;
-	if (rmap->rm_flags & XFS_RMAP_ATTR_FORK)
-		oinfo.oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
-	if (rmap->rm_flags & XFS_RMAP_BMBT_BLOCK)
-		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
+	xfs_owner_info_pack(&oinfo, rmap->rm_owner, rmap->rm_offset,
+			rmap->rm_flags);
 
-	if (rmap->rm_flags || XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+	if ((rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+			       XFS_RMAP_UNWRITTEN)) ||
+	    XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
 		return xfs_rmap_map(cur, rmap->rm_startblock,
 				rmap->rm_blockcount,
 				rmap->rm_flags & XFS_RMAP_UNWRITTEN,
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 103e4c97badc..49e6ecb9fb62 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -21,6 +21,9 @@
 #include "xfs_extent_busy.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "xfs_btree_mem.h"
 
 static struct kmem_cache	*xfs_rmapbt_cur_cache;
 
@@ -555,6 +558,126 @@ xfs_rmapbt_stage_cursor(
 	return cur;
 }
 
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+/*
+ * Validate an in-memory rmap btree block.  Callers are allowed to generate an
+ * in-memory btree even if the ondisk feature is not enabled.
+ */
+static xfs_failaddr_t
+xfs_rmapbt_mem_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	xfs_failaddr_t		fa;
+	unsigned int		level;
+
+	if (!xfs_verify_magic(bp, block->bb_magic))
+		return __this_address;
+
+	fa = xfs_btree_sblock_v5hdr_verify(bp);
+	if (fa)
+		return fa;
+
+	level = be16_to_cpu(block->bb_level);
+	if (xfs_has_rmapbt(mp)) {
+		if (level >= mp->m_rmap_maxlevels)
+			return __this_address;
+	} else {
+		if (level >= xfs_rmapbt_maxlevels_ondisk())
+			return __this_address;
+	}
+
+	return xfbtree_sblock_verify(bp,
+			xfs_rmapbt_maxrecs(xfo_to_b(1), level == 0));
+}
+
+static void
+xfs_rmapbt_mem_rw_verify(
+	struct xfs_buf	*bp)
+{
+	xfs_failaddr_t	fa = xfs_rmapbt_mem_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+/* skip crc checks on in-memory btrees to save time */
+static const struct xfs_buf_ops xfs_rmapbt_mem_buf_ops = {
+	.name			= "xfs_rmapbt_mem",
+	.magic			= { 0, cpu_to_be32(XFS_RMAP_CRC_MAGIC) },
+	.verify_read		= xfs_rmapbt_mem_rw_verify,
+	.verify_write		= xfs_rmapbt_mem_rw_verify,
+	.verify_struct		= xfs_rmapbt_mem_verify,
+};
+
+static const struct xfs_btree_ops xfs_rmapbt_mem_ops = {
+	.rec_len		= sizeof(struct xfs_rmap_rec),
+	.key_len		= 2 * sizeof(struct xfs_rmap_key),
+
+	.dup_cursor		= xfbtree_dup_cursor,
+	.set_root		= xfbtree_set_root,
+	.alloc_block		= xfbtree_alloc_block,
+	.free_block		= xfbtree_free_block,
+	.get_minrecs		= xfbtree_get_minrecs,
+	.get_maxrecs		= xfbtree_get_maxrecs,
+	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_rmapbt_init_high_key_from_rec,
+	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfbtree_init_ptr_from_cur,
+	.key_diff		= xfs_rmapbt_key_diff,
+	.buf_ops		= &xfs_rmapbt_mem_buf_ops,
+	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
+	.keys_inorder		= xfs_rmapbt_keys_inorder,
+	.recs_inorder		= xfs_rmapbt_recs_inorder,
+	.keys_contiguous	= xfs_rmapbt_keys_contiguous,
+};
+
+/* Create a cursor for an in-memory btree. */
+struct xfs_btree_cur *
+xfs_rmapbt_mem_cursor(
+	struct xfs_perag	*pag,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*head_bp,
+	struct xfbtree		*xfbtree)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	/* Overlapping btree; 2 keys per pointer. */
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
+			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
+	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING |
+			XFS_BTREE_IN_MEMORY;
+	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
+	cur->bc_ops = &xfs_rmapbt_mem_ops;
+	cur->bc_mem.xfbtree = xfbtree;
+	cur->bc_mem.head_bp = head_bp;
+	cur->bc_nlevels = xfs_btree_mem_head_nlevels(head_bp);
+
+	cur->bc_mem.pag = xfs_perag_bump(pag);
+	return cur;
+}
+
+/* Create an in-memory rmap btree. */
+int
+xfs_rmapbt_mem_create(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	struct xfs_buftarg	*target,
+	struct xfbtree		**xfbtreep)
+{
+	struct xfbtree_config	cfg = {
+		.btree_ops	= &xfs_rmapbt_mem_ops,
+		.target		= target,
+		.btnum		= XFS_BTNUM_RMAP,
+		.owner		= agno,
+	};
+
+	return xfbtree_create(mp, &cfg, xfbtreep);
+}
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
 /*
  * Install a new reverse mapping btree root.  Caller is responsible for
  * invalidating and freeing the old btree blocks.
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 3244715dd111..a27a236111dd 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -64,4 +64,13 @@ unsigned int xfs_rmapbt_maxlevels_ondisk(void);
 int __init xfs_rmapbt_init_cur_cache(void);
 void xfs_rmapbt_destroy_cur_cache(void);
 
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+struct xfbtree;
+struct xfs_btree_cur *xfs_rmapbt_mem_cursor(struct xfs_perag *pag,
+		struct xfs_trans *tp, struct xfs_buf *head_bp,
+		struct xfbtree *xfbtree);
+int xfs_rmapbt_mem_create(struct xfs_mount *mp, xfs_agnumber_t agno,
+		struct xfs_buftarg *target, struct xfbtree **xfbtreep);
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
 #endif /* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 7c242fddac8a..a685161db7bb 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -36,6 +36,7 @@
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/bitmap.h"
+#include "scrub/xfile.h"
 
 /*
  * Attempt to repair some metadata, if the metadata is corrupt and userspace
@@ -1130,3 +1131,25 @@ xrep_metadata_inode_forks(
 
 	return 0;
 }
+
+/*
+ * Set up an xfile and a buffer cache so that we can use the xfbtree.  Buffer
+ * target initialization registers a shrinker, so we cannot be in transaction
+ * context.  Park our resources in the scrub context and let the teardown
+ * function take care of them at the right time.
+ */
+int
+xrep_setup_buftarg(
+	struct xfs_scrub	*sc,
+	const char		*descr)
+{
+	int			error;
+
+	ASSERT(sc->tp == NULL);
+
+	error = xfile_create(sc->mp, descr, 0, &sc->xfile);
+	if (error)
+		return error;
+
+	return xfs_alloc_memory_buftarg(sc->mp, sc->xfile, &sc->xfile_buftarg);
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 22e8e1ed2de2..87e4827a6fb2 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -71,6 +71,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
 # define xrep_ino_dqattach(sc)			(0)
 #endif /* CONFIG_XFS_QUOTA */
 
+int xrep_setup_buftarg(struct xfs_scrub *sc, const char *descr);
+
 int xrep_ino_ensure_extent_count(struct xfs_scrub *sc, int whichfork,
 		xfs_extnum_t nextents);
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/rmap_repair.c b/fs/xfs/scrub/rmap_repair.c
index 952dae473d4d..ae6015b171d9 100644
--- a/fs/xfs/scrub/rmap_repair.c
+++ b/fs/xfs/scrub/rmap_repair.c
@@ -12,6 +12,7 @@
 #include "xfs_defer.h"
 #include "xfs_btree.h"
 #include "xfs_btree_staging.h"
+#include "xfs_btree_mem.h"
 #include "xfs_bit.h"
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
@@ -41,6 +42,7 @@
 #include "scrub/iscan.h"
 #include "scrub/newbt.h"
 #include "scrub/reap.h"
+#include "scrub/xfbtree.h"
 
 /*
  * Reverse Mapping Btree Repair
@@ -125,37 +127,28 @@ int
 xrep_setup_ag_rmapbt(
 	struct xfs_scrub	*sc)
 {
-	/* For now this is a placeholder until we land other pieces. */
-	return 0;
+	return xrep_setup_buftarg(sc, "rmapbt repair");
 }
 
-/*
- * Packed rmap record.  The ATTR/BMBT/UNWRITTEN flags are hidden in the upper
- * bits of offset, just like the on-disk record.
- */
-struct xrep_rmap_extent {
-	xfs_agblock_t	startblock;
-	xfs_extlen_t	blockcount;
-	uint64_t	owner;
-	uint64_t	offset;
-} __packed;
-
 /* Context for collecting rmaps */
 struct xrep_rmap {
 	/* new rmapbt information */
 	struct xrep_newbt	new_btree;
 
 	/* rmap records generated from primary metadata */
-	struct xfarray		*rmap_records;
+	struct xfbtree		*rmap_btree;
 
 	struct xfs_scrub	*sc;
 
-	/* get_records()'s position in the rmap record array. */
-	xfarray_idx_t		array_cur;
+	/* in-memory btree cursor for the xfs_btree_bload iteration */
+	struct xfs_btree_cur	*mcur;
 
 	/* inode scan cursor */
 	struct xchk_iscan	iscan;
 
+	/* Number of non-freespace records found. */
+	unsigned long long	nr_records;
+
 	/* bnobt/cntbt contribution to btreeblks */
 	xfs_agblock_t		freesp_btblocks;
 
@@ -196,11 +189,6 @@ xrep_rmap_stash(
 	uint64_t		offset,
 	unsigned int		flags)
 {
-	struct xrep_rmap_extent	rre = {
-		.startblock	= startblock,
-		.blockcount	= blockcount,
-		.owner		= owner,
-	};
 	struct xfs_rmap_irec	rmap = {
 		.rm_startblock	= startblock,
 		.rm_blockcount	= blockcount,
@@ -209,6 +197,8 @@ xrep_rmap_stash(
 		.rm_flags	= flags,
 	};
 	struct xfs_scrub	*sc = rr->sc;
+	struct xfs_btree_cur	*mcur;
+	struct xfs_buf		*mhead_bp;
 	int			error = 0;
 
 	if (xchk_should_terminate(sc, &error))
@@ -216,8 +206,22 @@ xrep_rmap_stash(
 
 	trace_xrep_rmap_found(sc->mp, sc->sa.pag->pag_agno, &rmap);
 
-	rre.offset = xfs_rmap_irec_offset_pack(&rmap);
-	return xfarray_append(rr->rmap_records, &rre);
+	error = xfbtree_head_read_buf(rr->rmap_btree, sc->tp, &mhead_bp);
+	if (error)
+		return error;
+
+	mcur = xfs_rmapbt_mem_cursor(sc->sa.pag, sc->tp, mhead_bp,
+			rr->rmap_btree);
+	error = xfs_rmap_map_raw(mcur, &rmap);
+	xfs_btree_del_cursor(mcur, error);
+	if (error)
+		goto out_cancel;
+
+	return xfbtree_trans_commit(rr->rmap_btree, sc->tp);
+
+out_cancel:
+	xfbtree_trans_cancel(rr->rmap_btree, sc->tp);
+	return error;
 }
 
 struct xrep_rmap_stash_run {
@@ -758,6 +762,24 @@ xrep_rmap_find_log_rmaps(
 			sc->mp->m_sb.sb_logblocks, XFS_RMAP_OWN_LOG, 0, 0);
 }
 
+/* Check and count all the records that we gathered. */
+STATIC int
+xrep_rmap_check_record(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_rmap		*rr = priv;
+	int				error;
+
+	error = xrep_rmap_check_mapping(rr->sc, rec);
+	if (error)
+		return error;
+
+	rr->nr_records++;
+	return 0;
+}
+
 /*
  * Generate all the reverse-mappings for this AG, a list of the old rmapbt
  * blocks, and the new btreeblks count.  Figure out if we have enough free
@@ -771,6 +793,8 @@ xrep_rmap_find_rmaps(
 	struct xfs_scrub	*sc = rr->sc;
 	struct xchk_ag		*sa = &sc->sa;
 	struct xfs_inode	*ip;
+	struct xfs_buf		*mhead_bp;
+	struct xfs_btree_cur	*mcur;
 	int			error;
 
 	/* Find all the per-AG metadata. */
@@ -837,7 +861,35 @@ xrep_rmap_find_rmaps(
 	error = xchk_setup_fs(sc);
 	if (error)
 		return error;
-	return xchk_perag_lock(sc);
+	error = xchk_perag_lock(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Now that we have everything locked again, we need to count the
+	 * number of rmap records stashed in the btree.  This should reflect
+	 * all actively-owned space in the filesystem.  At the same time, check
+	 * all our records before we start building a new btree, which requires
+	 * a bnobt cursor.
+	 */
+	error = xfbtree_head_read_buf(rr->rmap_btree, NULL, &mhead_bp);
+	if (error)
+		return error;
+
+	mcur = xfs_rmapbt_mem_cursor(rr->sc->sa.pag, NULL, mhead_bp,
+			rr->rmap_btree);
+	sc->sa.bno_cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag, XFS_BTNUM_BNO);
+
+	rr->nr_records = 0;
+	error = xfs_rmap_query_all(mcur, xrep_rmap_check_record, rr);
+
+	xfs_btree_del_cursor(sc->sa.bno_cur, error);
+	sc->sa.bno_cur = NULL;
+	xfs_btree_del_cursor(mcur, error);
+	xfs_buf_relse(mhead_bp);
+
+	return error;
 }
 
 /* Section (II): Reserving space for new rmapbt and setting free space bitmap */
@@ -870,7 +922,6 @@ STATIC int
 xrep_rmap_try_reserve(
 	struct xrep_rmap	*rr,
 	struct xfs_btree_cur	*rmap_cur,
-	uint64_t		nr_records,
 	struct xagb_bitmap	*freesp_blocks,
 	uint64_t		*blocks_reserved,
 	bool			*done)
@@ -954,7 +1005,7 @@ xrep_rmap_try_reserve(
 
 	/* Compute how many blocks we'll need for all the rmaps. */
 	error = xfs_btree_bload_compute_geometry(rmap_cur,
-			&rr->new_btree.bload, nr_records + freesp_records);
+			&rr->new_btree.bload, rr->nr_records + freesp_records);
 	if (error)
 		return error;
 
@@ -973,16 +1024,13 @@ xrep_rmap_reserve_space(
 	struct xfs_btree_cur	*rmap_cur)
 {
 	struct xagb_bitmap	freesp_blocks;	/* AGBIT */
-	uint64_t		nr_records;	/* NR */
 	uint64_t		blocks_reserved = 0;
 	bool			done = false;
 	int			error;
 
-	nr_records = xfarray_length(rr->rmap_records);
-
 	/* Compute how many blocks we'll need for the rmaps collected so far. */
 	error = xfs_btree_bload_compute_geometry(rmap_cur,
-			&rr->new_btree.bload, nr_records);
+			&rr->new_btree.bload, rr->nr_records);
 	if (error)
 		return error;
 
@@ -999,8 +1047,8 @@ xrep_rmap_reserve_space(
 	 * Finish when we don't need more blocks.
 	 */
 	do {
-		error = xrep_rmap_try_reserve(rr, rmap_cur, nr_records,
-				&freesp_blocks, &blocks_reserved, &done);
+		error = xrep_rmap_try_reserve(rr, rmap_cur, &freesp_blocks,
+				&blocks_reserved, &done);
 		if (error)
 			goto out_bitmap;
 	} while (!done);
@@ -1062,28 +1110,25 @@ xrep_rmap_get_records(
 	unsigned int		nr_wanted,
 	void			*priv)
 {
-	struct xrep_rmap_extent	rec;
-	struct xfs_rmap_irec	*irec = &cur->bc_rec.r;
 	struct xrep_rmap	*rr = priv;
 	union xfs_btree_rec	*block_rec;
 	unsigned int		loaded;
 	int			error;
 
 	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
-		error = xfarray_load_next(rr->rmap_records, &rr->array_cur,
-				&rec);
+		int		stat = 0;
+
+		error = xfs_btree_increment(rr->mcur, 0, &stat);
 		if (error)
 			return error;
-
-		irec->rm_startblock = rec.startblock;
-		irec->rm_blockcount = rec.blockcount;
-		irec->rm_owner = rec.owner;
-		if (xfs_rmap_irec_offset_unpack(rec.offset, irec) != NULL)
+		if (!stat)
 			return -EFSCORRUPTED;
 
-		error = xrep_rmap_check_mapping(rr->sc, irec);
+		error = xfs_rmap_get_rec(rr->mcur, &cur->bc_rec.r, &stat);
 		if (error)
 			return error;
+		if (!stat)
+			return -EFSCORRUPTED;
 
 		block_rec = xfs_btree_rec_addr(cur, idx, block);
 		cur->bc_ops->init_rec_from_cur(cur, block_rec);
@@ -1147,6 +1192,29 @@ xrep_rmap_alloc_vextent(
 	return xfs_alloc_vextent(args);
 }
 
+
+/* Count the records in this btree. */
+STATIC int
+xrep_rmap_count_records(
+	struct xfs_btree_cur	*cur,
+	unsigned long long	*nr)
+{
+	int			running = 1;
+	int			error;
+
+	*nr = 0;
+
+	error = xfs_btree_goto_left_edge(cur);
+	if (error)
+		return error;
+
+	while (running && !(error = xfs_btree_increment(cur, 0, &running))) {
+		if (running)
+			(*nr)++;
+	}
+
+	return error;
+}
 /*
  * Use the collected rmap information to stage a new rmap btree.  If this is
  * successful we'll return with the new btree root information logged to the
@@ -1161,6 +1229,7 @@ xrep_rmap_build_new_tree(
 	struct xfs_perag	*pag = sc->sa.pag;
 	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
 	struct xfs_btree_cur	*rmap_cur;
+	struct xfs_buf		*mhead_bp;
 	xfs_fsblock_t		fsbno;
 	int			error;
 
@@ -1195,6 +1264,21 @@ xrep_rmap_build_new_tree(
 	if (error)
 		goto err_cur;
 
+	/*
+	 * Count the rmapbt records again, because the space reservation
+	 * for the rmapbt itself probably added more records to the btree.
+	 */
+	error = xfbtree_head_read_buf(rr->rmap_btree, NULL, &mhead_bp);
+	if (error)
+		goto err_cur;
+
+	rr->mcur = xfs_rmapbt_mem_cursor(rr->sc->sa.pag, NULL, mhead_bp,
+			rr->rmap_btree);
+
+	error = xrep_rmap_count_records(rr->mcur, &rr->nr_records);
+	if (error)
+		goto err_mcur;
+
 	/*
 	 * Due to btree slack factors, it's possible for a new btree to be one
 	 * level taller than the old btree.  Update the incore btree height so
@@ -1204,13 +1288,16 @@ xrep_rmap_build_new_tree(
 	pag->pagf_alt_levels[XFS_BTNUM_RMAPi] =
 					rr->new_btree.bload.btree_height;
 
+	/*
+	 * Move the cursor to the left edge of the tree so that the first
+	 * increment in ->get_records positions us at the first record.
+	 */
+	error = xfs_btree_goto_left_edge(rr->mcur);
+	if (error)
+		goto err_level;
+
 	/* Add all observed rmap records. */
-	rr->array_cur = XFARRAY_CURSOR_INIT;
-	sc->sa.bno_cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
-			sc->sa.pag, XFS_BTNUM_BNO);
 	error = xfs_btree_bload(rmap_cur, &rr->new_btree.bload, rr);
-	xfs_btree_del_cursor(sc->sa.bno_cur, error);
-	sc->sa.bno_cur = NULL;
 	if (error)
 		goto err_level;
 
@@ -1220,6 +1307,15 @@ xrep_rmap_build_new_tree(
 	 */
 	xfs_rmapbt_commit_staged_btree(rmap_cur, sc->tp, sc->sa.agf_bp);
 	xfs_btree_del_cursor(rmap_cur, 0);
+	xfs_btree_del_cursor(rr->mcur, 0);
+	rr->mcur = NULL;
+	xfs_buf_relse(mhead_bp);
+
+	/*
+	 * Now that we've written the new btree to disk, we don't need to keep
+	 * updating the in-memory btree.  Abort the scan to stop live updates.
+	 */
+	xchk_iscan_abort(&rr->iscan);
 
 	/*
 	 * The newly committed rmap recordset includes mappings for the blocks
@@ -1243,6 +1339,9 @@ xrep_rmap_build_new_tree(
 
 err_level:
 	pag->pagf_alt_levels[XFS_BTNUM_RMAPi] = 0;
+err_mcur:
+	xfs_btree_del_cursor(rr->mcur, error);
+	xfs_buf_relse(mhead_bp);
 err_cur:
 	xfs_btree_del_cursor(rmap_cur, error);
 err_newbt:
@@ -1270,6 +1369,28 @@ xrep_rmap_find_freesp(
 			rec->ar_blockcount);
 }
 
+/* Record the free space we find, as part of cleaning out the btree. */
+STATIC int
+xrep_rmap_find_gaps(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_rmap_find_gaps	*rfg = priv;
+	int				error;
+
+	if (rec->rm_startblock > rfg->next_agbno) {
+		error = xagb_bitmap_set(&rfg->rmap_gaps, rfg->next_agbno,
+				rec->rm_startblock - rfg->next_agbno);
+		if (error)
+			return error;
+	}
+
+	rfg->next_agbno = max_t(xfs_agblock_t, rfg->next_agbno,
+				rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
 /*
  * Reap the old rmapbt blocks.  Now that the rmapbt is fully rebuilt, we make
  * a list of gaps in the rmap records and a list of the extents mentioned in
@@ -1286,30 +1407,23 @@ xrep_rmap_remove_old_tree(
 	struct xfs_scrub	*sc = rr->sc;
 	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
 	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_btree_cur	*mcur;
+	struct xfs_buf		*mhead_bp;
 	xfs_agblock_t		agend;
-	xfarray_idx_t		array_cur;
 	int			error;
 
 	xagb_bitmap_init(&rfg.rmap_gaps);
 
 	/* Compute free space from the new rmapbt. */
-	foreach_xfarray_idx(rr->rmap_records, array_cur) {
-		struct xrep_rmap_extent	rec;
+	error = xfbtree_head_read_buf(rr->rmap_btree, NULL, &mhead_bp);
+	mcur = xfs_rmapbt_mem_cursor(rr->sc->sa.pag, NULL, mhead_bp,
+			rr->rmap_btree);
 
-		error = xfarray_load(rr->rmap_records, array_cur, &rec);
-		if (error)
-			goto out_bitmap;
-
-		/* Record the free space we find. */
-		if (rec.startblock > rfg.next_agbno) {
-			error = xagb_bitmap_set(&rfg.rmap_gaps, rfg.next_agbno,
-					rec.startblock - rfg.next_agbno);
-			if (error)
-				goto out_bitmap;
-		}
-		rfg.next_agbno = max_t(xfs_agblock_t, rfg.next_agbno,
-					rec.startblock + rec.blockcount);
-	}
+	error = xfs_rmap_query_all(mcur, xrep_rmap_find_gaps, &rfg);
+	xfs_btree_del_cursor(mcur, error);
+	xfs_buf_relse(mhead_bp);
+	if (error)
+		goto out_bitmap;
 
 	/* Insert a record for space between the last rmap and EOAG. */
 	agend = be32_to_cpu(agf->agf_length);
@@ -1371,9 +1485,9 @@ xrep_rmapbt(
 		return -ENOMEM;
 	rr->sc = sc;
 
-	/* Set up some storage */
-	error = xfarray_create(sc->mp, "rmap records", 0,
-			sizeof(struct xrep_rmap_extent), &rr->rmap_records);
+	/* Set up in-memory rmap btree */
+	error = xfs_rmapbt_mem_create(sc->mp, sc->sa.pag->pag_agno,
+			sc->xfile_buftarg, &rr->rmap_btree);
 	if (error)
 		goto out_rr;
 
@@ -1398,7 +1512,7 @@ xrep_rmapbt(
 
 out_records:
 	xchk_iscan_finish(&rr->iscan);
-	xfarray_destroy(rr->rmap_records);
+	xfbtree_destroy(rr->rmap_btree);
 out_rr:
 	kfree(rr);
 	return error;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: hook live rmap operations during a repair operation
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: repair the rmapbt Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/4] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: create a shadow rmap btree during rmap repair Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Hook the regular rmap code when an rmapbt repair operation is running so
that we can unlock the AGF buffer to scan the filesystem and keep the
in-memory btree up to date during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c     |    1 
 fs/xfs/libxfs/xfs_ag.h     |    3 +
 fs/xfs/libxfs/xfs_rmap.c   |  145 ++++++++++++++++++++++++++++++++++----------
 fs/xfs/libxfs/xfs_rmap.h   |   28 ++++++++
 fs/xfs/scrub/common.c      |    3 +
 fs/xfs/scrub/repair.c      |   36 +++++++++++
 fs/xfs/scrub/repair.h      |    4 +
 fs/xfs/scrub/rmap_repair.c |  146 ++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/scrub/scrub.c       |    4 +
 fs/xfs/scrub/scrub.h       |    4 +
 fs/xfs/scrub/trace.c       |    1 
 fs/xfs/scrub/trace.h       |   47 ++++++++++++++
 12 files changed, 381 insertions(+), 41 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index d98ac1a997d9..58d485a76150 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -334,6 +334,7 @@ xfs_initialize_perag(
 		init_waitqueue_head(&pag->pagb_wait);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
+		xfs_hooks_init(&pag->pag_rmap_update_hooks);
 #endif /* __KERNEL__ */
 
 		error = xfs_buf_hash_init(pag);
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index fd663d04bdff..6d5ddefa321e 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -121,6 +121,9 @@ struct xfs_perag {
 	 * inconsistencies.
 	 */
 	struct xfs_drain	pag_intents;
+
+	/* Hook to feed rmapbt updates to an active online repair. */
+	struct xfs_hooks	pag_rmap_update_hooks;
 #endif /* __KERNEL__ */
 };
 
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index f7587f985aca..8da59935780a 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -821,6 +821,77 @@ xfs_rmap_unmap(
 	return error;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of rmapbt live updates.  If
+ * the compiler supports jump labels, the static branch will be replaced by a
+ * nop sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_rmap_hooks_switch);
+
+void
+xfs_rmap_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_rmap_hooks_switch);
+}
+
+void
+xfs_rmap_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_rmap_hooks_switch);
+}
+
+/* Call downstream hooks for a reverse mapping update. */
+static inline void
+xfs_rmap_update_hook(
+	struct xfs_trans		*tp,
+	struct xfs_perag		*pag,
+	enum xfs_rmap_intent_type	op,
+	xfs_agblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	bool				unwritten,
+	const struct xfs_owner_info	*oinfo)
+{
+	if (xfs_hooks_switched_on(&xfs_rmap_hooks_switch)) {
+		struct xfs_rmap_update_params	p = {
+			.startblock	= startblock,
+			.blockcount	= blockcount,
+			.unwritten	= unwritten,
+			.oinfo		= *oinfo, /* struct copy */
+		};
+
+		if (pag)
+			xfs_hooks_call(&pag->pag_rmap_update_hooks, op, &p);
+	}
+}
+
+/* Call the specified function during a reverse mapping update. */
+int
+xfs_rmap_hook_add(
+	struct xfs_perag	*pag,
+	struct xfs_rmap_hook	*hook)
+{
+	return xfs_hooks_add(&pag->pag_rmap_update_hooks, &hook->update_hook);
+}
+
+/* Stop calling the specified function during a reverse mapping update. */
+void
+xfs_rmap_hook_del(
+	struct xfs_perag	*pag,
+	struct xfs_rmap_hook	*hook)
+{
+	xfs_hooks_del(&pag->pag_rmap_update_hooks, &hook->update_hook);
+}
+#else
+# define xfs_rmap_update_hook(t, p, o, s, b, u, oi)	do { } while(0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Remove a reference to an extent in the rmap btree.
  */
@@ -841,7 +912,7 @@ xfs_rmap_free(
 		return 0;
 
 	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag);
-
+	xfs_rmap_update_hook(tp, pag, XFS_RMAP_UNMAP, bno, len, false, oinfo);
 	error = xfs_rmap_unmap(cur, bno, len, false, oinfo);
 
 	xfs_btree_del_cursor(cur, error);
@@ -1093,6 +1164,7 @@ xfs_rmap_alloc(
 		return 0;
 
 	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag);
+	xfs_rmap_update_hook(tp, pag, XFS_RMAP_MAP, bno, len, false, oinfo);
 	error = xfs_rmap_map(cur, bno, len, false, oinfo);
 
 	xfs_btree_del_cursor(cur, error);
@@ -2512,6 +2584,38 @@ xfs_rmap_finish_one_cleanup(
 		xfs_trans_brelse(tp, agbp);
 }
 
+/* Commit an rmap operation into the ondisk tree. */
+int
+__xfs_rmap_finish_intent(
+	struct xfs_btree_cur		*rcur,
+	enum xfs_rmap_intent_type	op,
+	xfs_agblock_t			bno,
+	xfs_extlen_t			len,
+	const struct xfs_owner_info	*oinfo,
+	bool				unwritten)
+{
+	switch (op) {
+	case XFS_RMAP_ALLOC:
+	case XFS_RMAP_MAP:
+		return xfs_rmap_map(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_MAP_SHARED:
+		return xfs_rmap_map_shared(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_FREE:
+	case XFS_RMAP_UNMAP:
+		return xfs_rmap_unmap(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_UNMAP_SHARED:
+		return xfs_rmap_unmap_shared(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_CONVERT:
+		return xfs_rmap_convert(rcur, bno, len, !unwritten, oinfo);
+	case XFS_RMAP_CONVERT_SHARED:
+		return xfs_rmap_convert_shared(rcur, bno, len, !unwritten,
+				oinfo);
+	default:
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+}
+
 /*
  * Process one of the deferred rmap operations.  We pass back the
  * btree cursor to maintain our lock on the rmapbt between calls.
@@ -2578,39 +2682,14 @@ xfs_rmap_finish_one(
 	unwritten = ri->ri_bmap.br_state == XFS_EXT_UNWRITTEN;
 	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, ri->ri_bmap.br_startblock);
 
-	switch (ri->ri_type) {
-	case XFS_RMAP_ALLOC:
-	case XFS_RMAP_MAP:
-		error = xfs_rmap_map(rcur, bno, ri->ri_bmap.br_blockcount,
-				unwritten, &oinfo);
-		break;
-	case XFS_RMAP_MAP_SHARED:
-		error = xfs_rmap_map_shared(rcur, bno,
-				ri->ri_bmap.br_blockcount, unwritten, &oinfo);
-		break;
-	case XFS_RMAP_FREE:
-	case XFS_RMAP_UNMAP:
-		error = xfs_rmap_unmap(rcur, bno, ri->ri_bmap.br_blockcount,
-				unwritten, &oinfo);
-		break;
-	case XFS_RMAP_UNMAP_SHARED:
-		error = xfs_rmap_unmap_shared(rcur, bno,
-				ri->ri_bmap.br_blockcount, unwritten, &oinfo);
-		break;
-	case XFS_RMAP_CONVERT:
-		error = xfs_rmap_convert(rcur, bno, ri->ri_bmap.br_blockcount,
-				!unwritten, &oinfo);
-		break;
-	case XFS_RMAP_CONVERT_SHARED:
-		error = xfs_rmap_convert_shared(rcur, bno,
-				ri->ri_bmap.br_blockcount, !unwritten, &oinfo);
-		break;
-	default:
-		ASSERT(0);
-		error = -EFSCORRUPTED;
-	}
+	error = __xfs_rmap_finish_intent(rcur, ri->ri_type, bno,
+			ri->ri_bmap.br_blockcount, &oinfo, unwritten);
+	if (error)
+		return error;
 
-	return error;
+	xfs_rmap_update_hook(tp, ri->ri_pag, ri->ri_type, bno,
+			ri->ri_bmap.br_blockcount, unwritten, &oinfo);
+	return 0;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index b7ad51055e13..2a9265218f1d 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -186,6 +186,10 @@ void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
 		struct xfs_btree_cur *rcur, int error);
 int xfs_rmap_finish_one(struct xfs_trans *tp, struct xfs_rmap_intent *ri,
 		struct xfs_btree_cur **pcur);
+int __xfs_rmap_finish_intent(struct xfs_btree_cur *rcur,
+		enum xfs_rmap_intent_type op, xfs_agblock_t bno,
+		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
+		bool unwritten);
 
 int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		uint64_t owner, uint64_t offset, unsigned int flags,
@@ -237,4 +241,28 @@ extern struct kmem_cache	*xfs_rmap_intent_cache;
 int __init xfs_rmap_intent_init_cache(void);
 void xfs_rmap_intent_destroy_cache(void);
 
+/*
+ * Parameters for tracking reverse mapping changes.  The hook function arg
+ * parameter is enum xfs_rmap_intent_type, and the rest is below.
+ */
+struct xfs_rmap_update_params {
+	xfs_agblock_t			startblock;
+	xfs_extlen_t			blockcount;
+	struct xfs_owner_info		oinfo;
+	bool				unwritten;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+
+struct xfs_rmap_hook {
+	struct xfs_hook			update_hook;
+};
+
+void xfs_rmap_hook_disable(void);
+void xfs_rmap_hook_enable(void);
+
+int xfs_rmap_hook_add(struct xfs_perag *pag, struct xfs_rmap_hook *hook);
+void xfs_rmap_hook_del(struct xfs_perag *pag, struct xfs_rmap_hook *hook);
+#endif
+
 #endif	/* __XFS_RMAP_H__ */
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index c436d613521c..bde9159dca4a 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1264,5 +1264,8 @@ xchk_fshooks_enable(
 	if (scrub_fshooks & XCHK_FSHOOKS_NLINKS)
 		xfs_nlink_hook_enable();
 
+	if (scrub_fshooks & XCHK_FSHOOKS_RMAP)
+		xfs_rmap_hook_enable();
+
 	sc->flags |= scrub_fshooks;
 }
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index a685161db7bb..da6bff1fcd86 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1153,3 +1153,39 @@ xrep_setup_buftarg(
 
 	return xfs_alloc_memory_buftarg(sc->mp, sc->xfile, &sc->xfile_buftarg);
 }
+
+/*
+ * Create a dummy transaction for use in a live update hook function.  This
+ * function MUST NOT be called from regular repair code because the current
+ * process' transaction is saved via the cookie.
+ */
+int
+xrep_trans_alloc_hook_dummy(
+	struct xfs_mount	*mp,
+	void			**cookiep,
+	struct xfs_trans	**tpp)
+{
+	int			error;
+
+	*cookiep = current->journal_info;
+	current->journal_info = NULL;
+
+	error = xfs_trans_alloc_empty(mp, tpp);
+	if (!error)
+		return 0;
+
+	current->journal_info = *cookiep;
+	*cookiep = NULL;
+	return error;
+}
+
+/* Cancel a dummy transaction used by a live update hook function. */
+void
+xrep_trans_cancel_hook_dummy(
+	void			**cookiep,
+	struct xfs_trans	*tp)
+{
+	xfs_trans_cancel(tp);
+	current->journal_info = *cookiep;
+	*cookiep = NULL;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 87e4827a6fb2..34c601aad642 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -131,6 +131,10 @@ int xrep_quotacheck(struct xfs_scrub *sc);
 int xrep_reinit_pagf(struct xfs_scrub *sc);
 int xrep_reinit_pagi(struct xfs_scrub *sc);
 
+int xrep_trans_alloc_hook_dummy(struct xfs_mount *mp, void **cookiep,
+		struct xfs_trans **tpp);
+void xrep_trans_cancel_hook_dummy(void **cookiep, struct xfs_trans *tp);
+
 #else
 
 #define xrep_ino_dqattach(sc)	(0)
diff --git a/fs/xfs/scrub/rmap_repair.c b/fs/xfs/scrub/rmap_repair.c
index ae6015b171d9..ed937e461bf8 100644
--- a/fs/xfs/scrub/rmap_repair.c
+++ b/fs/xfs/scrub/rmap_repair.c
@@ -127,6 +127,8 @@ int
 xrep_setup_ag_rmapbt(
 	struct xfs_scrub	*sc)
 {
+	xchk_fshooks_enable(sc, XCHK_FSHOOKS_RMAP);
+
 	return xrep_setup_buftarg(sc, "rmapbt repair");
 }
 
@@ -135,6 +137,9 @@ struct xrep_rmap {
 	/* new rmapbt information */
 	struct xrep_newbt	new_btree;
 
+	/* lock for the xfbtree and xfile */
+	struct mutex		lock;
+
 	/* rmap records generated from primary metadata */
 	struct xfbtree		*rmap_btree;
 
@@ -143,6 +148,9 @@ struct xrep_rmap {
 	/* in-memory btree cursor for the xfs_btree_bload iteration */
 	struct xfs_btree_cur	*mcur;
 
+	/* Hooks into rmap update code. */
+	struct xfs_rmap_hook	hooks;
+
 	/* inode scan cursor */
 	struct xchk_iscan	iscan;
 
@@ -204,11 +212,15 @@ xrep_rmap_stash(
 	if (xchk_should_terminate(sc, &error))
 		return error;
 
+	if (xchk_iscan_aborted(&rr->iscan))
+		return -EFSCORRUPTED;
+
 	trace_xrep_rmap_found(sc->mp, sc->sa.pag->pag_agno, &rmap);
 
+	mutex_lock(&rr->lock);
 	error = xfbtree_head_read_buf(rr->rmap_btree, sc->tp, &mhead_bp);
 	if (error)
-		return error;
+		goto out_abort;
 
 	mcur = xfs_rmapbt_mem_cursor(sc->sa.pag, sc->tp, mhead_bp,
 			rr->rmap_btree);
@@ -217,10 +229,18 @@ xrep_rmap_stash(
 	if (error)
 		goto out_cancel;
 
-	return xfbtree_trans_commit(rr->rmap_btree, sc->tp);
+	error = xfbtree_trans_commit(rr->rmap_btree, sc->tp);
+	if (error)
+		goto out_abort;
+
+	mutex_unlock(&rr->lock);
+	return 0;
 
 out_cancel:
 	xfbtree_trans_cancel(rr->rmap_btree, sc->tp);
+out_abort:
+	xchk_iscan_abort(&rr->iscan);
+	mutex_unlock(&rr->lock);
 	return error;
 }
 
@@ -865,6 +885,13 @@ xrep_rmap_find_rmaps(
 	if (error)
 		return error;
 
+	/*
+	 * If a hook failed to update the in-memory btree, we lack the data to
+	 * continue the repair.
+	 */
+	if (xchk_iscan_aborted(&rr->iscan))
+		return -EFSCORRUPTED;
+
 	/*
 	 * Now that we have everything locked again, we need to count the
 	 * number of rmap records stashed in the btree.  This should reflect
@@ -1469,6 +1496,97 @@ xrep_rmap_remove_old_tree(
 	return error;
 }
 
+static inline bool
+xrep_rmapbt_want_live_update(
+	struct xchk_iscan		*iscan,
+	const struct xfs_owner_info	*oi)
+{
+	if (xchk_iscan_aborted(iscan))
+		return false;
+
+	/*
+	 * Before unlocking the AG header to perform the inode scan, we
+	 * recorded reverse mappings for all AG metadata except for the OWN_AG
+	 * metadata.  IOWs, the in-memory btree knows about the AG headers, the
+	 * two inode btrees, the CoW staging extents, and the refcount btrees.
+	 * For these types of metadata, we need to record the live updates in
+	 * the in-memory rmap btree.
+	 *
+	 * However, we do not scan the free space btrees or the AGFL until we
+	 * have re-locked the AGF and are ready to reserve space for the new
+	 * new rmap btree, so we do not want live updates for OWN_AG metadata.
+	 */
+	if (XFS_RMAP_NON_INODE_OWNER(oi->oi_owner))
+		return oi->oi_owner != XFS_RMAP_OWN_AG;
+
+	/* Ignore updates to files that the scanner hasn't visited yet. */
+	return xchk_iscan_want_live_update(iscan, oi->oi_owner);
+}
+
+/*
+ * Apply a rmapbt update from the regular filesystem into our shadow btree.
+ * We're running from the thread that owns the AGF buffer and is generating
+ * the update, so we must be careful about which parts of the struct xrep_rmap
+ * that we change.
+ */
+static int
+xrep_rmapbt_live_update(
+	struct xfs_hook			*hook,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_rmap_update_params	*p = data;
+	struct xrep_rmap		*rr;
+	struct xfs_mount		*mp;
+	struct xfs_btree_cur		*mcur;
+	struct xfs_buf			*mhead_bp;
+	struct xfs_trans		*tp;
+	void				*txcookie;
+	int				error;
+
+	rr = container_of(hook, struct xrep_rmap, hooks.update_hook);
+	mp = rr->sc->mp;
+
+	if (!xrep_rmapbt_want_live_update(&rr->iscan, &p->oinfo))
+		goto out_unlock;
+
+	trace_xrep_rmap_live_update(mp, rr->sc->sa.pag->pag_agno, action, p);
+
+	error = xrep_trans_alloc_hook_dummy(mp, &txcookie, &tp);
+	if (error)
+		goto out_abort;
+
+	mutex_lock(&rr->lock);
+	error = xfbtree_head_read_buf(rr->rmap_btree, tp, &mhead_bp);
+	if (error)
+		goto out_cancel;
+
+	mcur = xfs_rmapbt_mem_cursor(rr->sc->sa.pag, tp, mhead_bp,
+			rr->rmap_btree);
+	error = __xfs_rmap_finish_intent(mcur, action, p->startblock,
+			p->blockcount, &p->oinfo, p->unwritten);
+	xfs_btree_del_cursor(mcur, error);
+	if (error)
+		goto out_cancel;
+
+	error = xfbtree_trans_commit(rr->rmap_btree, tp);
+	if (error)
+		goto out_cancel;
+
+	xrep_trans_cancel_hook_dummy(&txcookie, tp);
+	mutex_unlock(&rr->lock);
+	return NOTIFY_DONE;
+
+out_cancel:
+	xfbtree_trans_cancel(rr->rmap_btree, tp);
+	xrep_trans_cancel_hook_dummy(&txcookie, tp);
+out_abort:
+	mutex_unlock(&rr->lock);
+	xchk_iscan_abort(&rr->iscan);
+out_unlock:
+	return NOTIFY_DONE;
+}
+
 /* Repair the rmap btree for some AG. */
 int
 xrep_rmapbt(
@@ -1477,13 +1595,11 @@ xrep_rmapbt(
 	struct xrep_rmap	*rr;
 	int			error;
 
-	/* Functionality is not yet complete. */
-	return xrep_notsupported(sc);
-
 	rr = kzalloc(sizeof(struct xrep_rmap), XCHK_GFP_FLAGS);
 	if (!rr)
 		return -ENOMEM;
 	rr->sc = sc;
+	mutex_init(&rr->lock);
 
 	/* Set up in-memory rmap btree */
 	error = xfs_rmapbt_mem_create(sc->mp, sc->sa.pag->pag_agno,
@@ -1494,26 +1610,42 @@ xrep_rmapbt(
 	/* Retry iget every tenth of a second for up to 30 seconds. */
 	xchk_iscan_start(&rr->iscan, 30000, 100);
 
+	/*
+	 * Hook into live rmap operations so that we can update our in-memory
+	 * btree to reflect live changes on the filesystem.  Since we drop the
+	 * AGF buffer to scan all the inodes, we need this piece to avoid
+	 * installing a stale btree.
+	 */
+	ASSERT(sc->flags & XCHK_FSHOOKS_RMAP);
+	xfs_hook_setup(&rr->hooks.update_hook, xrep_rmapbt_live_update);
+	error = xfs_rmap_hook_add(sc->sa.pag, &rr->hooks);
+	if (error)
+		goto out_records;
+
 	/*
 	 * Collect rmaps for everything in this AG that isn't space metadata.
 	 * These rmaps won't change even as we try to allocate blocks.
 	 */
 	error = xrep_rmap_find_rmaps(rr);
 	if (error)
-		goto out_records;
+		goto out_abort;
 
 	/* Rebuild the rmap information. */
 	error = xrep_rmap_build_new_tree(rr);
 	if (error)
-		goto out_records;
+		goto out_abort;
 
 	/* Kill the old tree. */
 	error = xrep_rmap_remove_old_tree(rr);
 
+out_abort:
+	xchk_iscan_abort(&rr->iscan);
+	xfs_rmap_hook_del(sc->sa.pag, &rr->hooks);
 out_records:
 	xchk_iscan_finish(&rr->iscan);
 	xfbtree_destroy(rr->rmap_btree);
 out_rr:
+	mutex_destroy(&rr->lock);
 	kfree(rr);
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f030311fae2b..c6eb692a0822 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -17,6 +17,7 @@
 #include "xfs_scrub.h"
 #include "xfs_btree.h"
 #include "xfs_btree_staging.h"
+#include "xfs_rmap.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -164,6 +165,9 @@ xchk_fshooks_disable(
 	if (sc->flags & XCHK_FSHOOKS_NLINKS)
 		xfs_nlink_hook_disable();
 
+	if (sc->flags & XCHK_FSHOOKS_RMAP)
+		xfs_rmap_hook_disable();
+
 	sc->flags &= ~XCHK_FSHOOKS_ALL;
 }
 
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 6fe59d1a2518..cf18bb4e8b35 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -126,12 +126,14 @@ struct xfs_scrub {
 #define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
 #define XCHK_FSHOOKS_QUOTA	(1 << 4)  /* quota live update enabled */
 #define XCHK_FSHOOKS_NLINKS	(1 << 5)  /* link count live update enabled */
+#define XCHK_FSHOOKS_RMAP	(1 << 6)  /* rmapbt live update enabled */
 #define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 #define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN | \
 				 XCHK_FSHOOKS_QUOTA | \
-				 XCHK_FSHOOKS_NLINKS)
+				 XCHK_FSHOOKS_NLINKS | \
+				 XCHK_FSHOOKS_RMAP)
 
 /* Metadata scrubbers */
 int xchk_tester(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 177fc4c75507..f8f50c5a02c0 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -18,6 +18,7 @@
 #include "xfs_dir2.h"
 #include "xfs_da_format.h"
 #include "xfs_btree_mem.h"
+#include "xfs_rmap.h"
 #include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 53aafc70878b..dc0547691fe4 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -26,6 +26,7 @@ struct xchk_nlink;
 struct xchk_fscounters;
 struct xfbtree;
 struct xfbtree_config;
+struct xfs_rmap_update_params;
 
 /*
  * ftrace's __print_symbolic requires that all enum values be wrapped in the
@@ -121,6 +122,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XCHK_FSHOOKS_QUOTA,			"fshooks_quota" }, \
 	{ XCHK_FSHOOKS_NLINKS,			"fshooks_nlinks" }, \
+	{ XCHK_FSHOOKS_RMAP,			"fshooks_rmap" }, \
 	{ XREP_RESET_PERAG_RESV,		"reset_perag_resv" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 
@@ -2111,6 +2113,51 @@ DEFINE_EVENT(xfbtree_freesp_class, name, \
 DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block);
 DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
 
+TRACE_DEFINE_ENUM(XFS_RMAP_MAP);
+TRACE_DEFINE_ENUM(XFS_RMAP_MAP_SHARED);
+TRACE_DEFINE_ENUM(XFS_RMAP_UNMAP);
+TRACE_DEFINE_ENUM(XFS_RMAP_UNMAP_SHARED);
+TRACE_DEFINE_ENUM(XFS_RMAP_CONVERT);
+TRACE_DEFINE_ENUM(XFS_RMAP_CONVERT_SHARED);
+TRACE_DEFINE_ENUM(XFS_RMAP_ALLOC);
+TRACE_DEFINE_ENUM(XFS_RMAP_FREE);
+
+TRACE_EVENT(xrep_rmap_live_update,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, unsigned int op,
+		 const struct xfs_rmap_update_params *p),
+	TP_ARGS(mp, agno, op, p),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(unsigned int, op)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(uint64_t, owner)
+		__field(uint64_t, offset)
+		__field(unsigned int, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->op = op;
+		__entry->agbno = p->startblock;
+		__entry->len = p->blockcount;
+		xfs_owner_info_unpack(&p->oinfo, &__entry->owner,
+				&__entry->offset, &__entry->flags);
+		if (p->unwritten)
+			__entry->flags |= XFS_RMAP_UNWRITTEN;
+	),
+	TP_printk("dev %d:%d agno 0x%x op %d agbno 0x%x fsbcount 0x%x owner 0x%llx fileoff 0x%llx flags 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->op,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->owner,
+		  __entry->offset,
+		  __entry->flags)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (17 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/9] xfs: set the btree cursor bc_ops in xfs_btree_alloc_cursor Darrick J. Wong
                     ` (8 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage Darrick J. Wong
                   ` (58 subsequent siblings)
  77 siblings, 9 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This patchset prepares the generic btree code to allow for the creation
of new btree types outside of libxfs.  The end goal here is for online
fsck to be able to create its own in-memory btrees that will be used to
improve the performance (and reduce the memory requirements of) the
refcount btree.

To enable this, I decided that the btree ops structure is the ideal
place to encode all of the geometry information about a btree. The btree
ops struture already contains the buffer ops (and hence the btree block
magic numbers) as well as the key and record sizes, so it doesn't seem
all that farfetched to encode the XFS_BTREE_ flags that determine the
geometry (ROOT_IN_INODE, LONG_PTRS, etc).

The rest of the patchset cleans up the btree functions that initialize
btree blocks and btree buffers.  The bulk of this work is to replace
btree geometry related function call arguments with a single pointer to
the ops structure, and then clean up everything else around that.  As a
side effect, we rename the functions.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=btree-geometry-in-ops

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=btree-geometry-in-ops
---
 fs/xfs/libxfs/xfs_ag.c             |   33 +++++++----------
 fs/xfs/libxfs/xfs_ag.h             |    2 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |   21 ++++-------
 fs/xfs/libxfs/xfs_bmap.c           |    9 +----
 fs/xfs/libxfs/xfs_bmap_btree.c     |   14 ++-----
 fs/xfs/libxfs/xfs_btree.c          |   70 +++++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_btree.h          |   36 ++++++++-----------
 fs/xfs/libxfs/xfs_btree_mem.h      |    9 -----
 fs/xfs/libxfs/xfs_btree_staging.c  |    6 +--
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   17 ++++-----
 fs/xfs/libxfs/xfs_refcount_btree.c |    8 ++--
 fs/xfs/libxfs/xfs_rmap_btree.c     |   16 ++++----
 fs/xfs/libxfs/xfs_shared.h         |    9 +++++
 fs/xfs/scrub/trace.h               |   10 ++---
 fs/xfs/scrub/xfbtree.c             |   16 +++-----
 15 files changed, 118 insertions(+), 158 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/9] xfs: set the btree cursor bc_ops in xfs_btree_alloc_cursor
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 7/9] xfs: remove the unnecessary daddr paramter to _init_block Darrick J. Wong
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This is a precursor to putting more static data in the btree ops structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |   11 +++++------
 fs/xfs/libxfs/xfs_bmap_btree.c     |    3 +--
 fs/xfs/libxfs/xfs_btree.h          |    2 ++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   10 ++++++----
 fs/xfs/libxfs/xfs_refcount_btree.c |    4 ++--
 fs/xfs/libxfs/xfs_rmap_btree.c     |    7 +++----
 fs/xfs/scrub/xfbtree.c             |    4 ++--
 7 files changed, 21 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6162223ce18d..b7c369248c9c 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -512,18 +512,17 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_alloc_maxlevels,
-			xfs_allocbt_cur_cache);
-	cur->bc_ag.abt.active = false;
-
 	if (btnum == XFS_BTNUM_CNT) {
-		cur->bc_ops = &xfs_cntbt_ops;
+		cur = xfs_btree_alloc_cursor(mp, tp, btnum, &xfs_cntbt_ops,
+				mp->m_alloc_maxlevels, xfs_allocbt_cur_cache);
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtc_2);
 		cur->bc_flags = XFS_BTREE_LASTREC_UPDATE;
 	} else {
-		cur->bc_ops = &xfs_bnobt_ops;
+		cur = xfs_btree_alloc_cursor(mp, tp, btnum, &xfs_bnobt_ops,
+				mp->m_alloc_maxlevels, xfs_allocbt_cur_cache);
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtb_2);
 	}
+	cur->bc_ag.abt.active = false;
 
 	cur->bc_ag.pag = xfs_perag_bump(pag);
 
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 4c59da1a977b..8b6ba7bd7a41 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -566,11 +566,10 @@ xfs_bmbt_init_common(
 
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP, &xfs_bmbt_ops,
 			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
-	cur->bc_ops = &xfs_bmbt_ops;
 	cur->bc_flags = XFS_BTREE_LONG_PTRS | XFS_BTREE_ROOT_IN_INODE;
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b2c08a436997..ab12bc10ab25 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -737,12 +737,14 @@ xfs_btree_alloc_cursor(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
 	xfs_btnum_t		btnum,
+	const struct xfs_btree_ops *ops,
 	uint8_t			maxlevels,
 	struct kmem_cache	*cache)
 {
 	struct xfs_btree_cur	*cur;
 
 	cur = kmem_cache_zalloc(cache, GFP_NOFS | __GFP_NOFAIL);
+	cur->bc_ops = ops;
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 2b7571d50afb..01da9c0e71c7 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -455,14 +455,16 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
-			M_IGEO(mp)->inobt_maxlevels, xfs_inobt_cur_cache);
 	if (btnum == XFS_BTNUM_INO) {
+		cur = xfs_btree_alloc_cursor(mp, tp, btnum, &xfs_inobt_ops,
+				M_IGEO(mp)->inobt_maxlevels,
+				xfs_inobt_cur_cache);
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
-		cur->bc_ops = &xfs_inobt_ops;
 	} else {
+		cur = xfs_btree_alloc_cursor(mp, tp, btnum, &xfs_finobt_ops,
+				M_IGEO(mp)->inobt_maxlevels,
+				xfs_inobt_cur_cache);
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_fibt_2);
-		cur->bc_ops = &xfs_finobt_ops;
 	}
 
 	if (xfs_has_crc(mp))
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index b75005684aa2..41cf97c15e52 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -359,7 +359,8 @@ xfs_refcountbt_init_common(
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
-			mp->m_refc_maxlevels, xfs_refcountbt_cur_cache);
+			&xfs_refcountbt_ops, mp->m_refc_maxlevels,
+			xfs_refcountbt_cur_cache);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
@@ -367,7 +368,6 @@ xfs_refcountbt_init_common(
 	cur->bc_ag.pag = xfs_perag_bump(pag);
 	cur->bc_ag.refc.nr_ops = 0;
 	cur->bc_ag.refc.shape_changes = 0;
-	cur->bc_ops = &xfs_refcountbt_ops;
 	return cur;
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 49e6ecb9fb62..6e651a2c562b 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -517,11 +517,10 @@ xfs_rmapbt_init_common(
 	struct xfs_btree_cur	*cur;
 
 	/* Overlapping btree; 2 keys per pointer. */
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP, &xfs_rmapbt_ops,
 			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
-	cur->bc_ops = &xfs_rmapbt_ops;
 
 	cur->bc_ag.pag = xfs_perag_bump(pag);
 	return cur;
@@ -646,11 +645,11 @@ xfs_rmapbt_mem_cursor(
 
 	/* Overlapping btree; 2 keys per pointer. */
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
-			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
+			&xfs_rmapbt_mem_ops, mp->m_rmap_maxlevels,
+			xfs_rmapbt_cur_cache);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING |
 			XFS_BTREE_IN_MEMORY;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
-	cur->bc_ops = &xfs_rmapbt_mem_ops;
 	cur->bc_mem.xfbtree = xfbtree;
 	cur->bc_mem.head_bp = head_bp;
 	cur->bc_nlevels = xfs_btree_mem_head_nlevels(head_bp);
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 3eeb5110a1cc..4436522705e5 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -258,11 +258,11 @@ xfbtree_dup_cursor(
 	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
 
 	ncur = xfs_btree_alloc_cursor(cur->bc_mp, cur->bc_tp, cur->bc_btnum,
-			cur->bc_maxlevels, cur->bc_cache);
+			cur->bc_ops, cur->bc_maxlevels, cur->bc_cache);
 	ncur->bc_flags = cur->bc_flags;
 	ncur->bc_nlevels = cur->bc_nlevels;
 	ncur->bc_statoff = cur->bc_statoff;
-	ncur->bc_ops = cur->bc_ops;
+
 	memcpy(&ncur->bc_mem, &cur->bc_mem, sizeof(cur->bc_mem));
 
 	if (cur->bc_mem.pag)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/9] xfs: encode the default bc_flags in the btree ops structure
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/9] xfs: set the btree cursor bc_ops in xfs_btree_alloc_cursor Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 7/9] xfs: remove the unnecessary daddr paramter to _init_block Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/9] xfs: initialize btree blocks using btree_ops structure Darrick J. Wong
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Certain btree flags never change for the life of a btree cursor because
they describe the geometry of the btree itself.  Encode these in the
btree ops structure and reduce the amount of code required in each btree
type's init_cursor functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    6 +-----
 fs/xfs/libxfs/xfs_bmap_btree.c     |    5 +----
 fs/xfs/libxfs/xfs_btree.h          |    6 ++++++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    3 ---
 fs/xfs/libxfs/xfs_refcount_btree.c |    2 --
 fs/xfs/libxfs/xfs_rmap_btree.c     |    6 +++---
 6 files changed, 11 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index b7c369248c9c..5505cbb75cb6 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -480,6 +480,7 @@ static const struct xfs_btree_ops xfs_bnobt_ops = {
 static const struct xfs_btree_ops xfs_cntbt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
+	.geom_flags		= XFS_BTREE_LASTREC_UPDATE,
 
 	.dup_cursor		= xfs_allocbt_dup_cursor,
 	.set_root		= xfs_allocbt_set_root,
@@ -516,19 +517,14 @@ xfs_allocbt_init_common(
 		cur = xfs_btree_alloc_cursor(mp, tp, btnum, &xfs_cntbt_ops,
 				mp->m_alloc_maxlevels, xfs_allocbt_cur_cache);
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtc_2);
-		cur->bc_flags = XFS_BTREE_LASTREC_UPDATE;
 	} else {
 		cur = xfs_btree_alloc_cursor(mp, tp, btnum, &xfs_bnobt_ops,
 				mp->m_alloc_maxlevels, xfs_allocbt_cur_cache);
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtb_2);
 	}
 	cur->bc_ag.abt.active = false;
-
 	cur->bc_ag.pag = xfs_perag_bump(pag);
 
-	if (xfs_has_crc(mp))
-		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
-
 	return cur;
 }
 
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 8b6ba7bd7a41..60bd5f94de3e 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -532,6 +532,7 @@ xfs_bmbt_keys_contiguous(
 static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.rec_len		= sizeof(xfs_bmbt_rec_t),
 	.key_len		= sizeof(xfs_bmbt_key_t),
+	.geom_flags		= XFS_BTREE_LONG_PTRS | XFS_BTREE_ROOT_IN_INODE,
 
 	.dup_cursor		= xfs_bmbt_dup_cursor,
 	.update_cursor		= xfs_bmbt_update_cursor,
@@ -570,10 +571,6 @@ xfs_bmbt_init_common(
 			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
-	cur->bc_flags = XFS_BTREE_LONG_PTRS | XFS_BTREE_ROOT_IN_INODE;
-	if (xfs_has_crc(mp))
-		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
-
 	cur->bc_ino.ip = ip;
 	cur->bc_ino.allocated = 0;
 	cur->bc_ino.flags = 0;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index ab12bc10ab25..f23d12626a68 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -116,6 +116,9 @@ struct xfs_btree_ops {
 	size_t	key_len;
 	size_t	rec_len;
 
+	/* XFS_BTREE_* flags that determine the geometry of the btree */
+	unsigned int	geom_flags;
+
 	/* cursor operations */
 	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
 	void	(*update_cursor)(struct xfs_btree_cur *src,
@@ -750,6 +753,9 @@ xfs_btree_alloc_cursor(
 	cur->bc_btnum = btnum;
 	cur->bc_maxlevels = maxlevels;
 	cur->bc_cache = cache;
+	cur->bc_flags = ops->geom_flags;
+	if (xfs_has_crc(mp))
+		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 01da9c0e71c7..5a59e105c801 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -467,9 +467,6 @@ xfs_inobt_init_common(
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_fibt_2);
 	}
 
-	if (xfs_has_crc(mp))
-		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
-
 	cur->bc_ag.pag = xfs_perag_bump(pag);
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 41cf97c15e52..8ba9768c0b3b 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -363,8 +363,6 @@ xfs_refcountbt_init_common(
 			xfs_refcountbt_cur_cache);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
-	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
-
 	cur->bc_ag.pag = xfs_perag_bump(pag);
 	cur->bc_ag.refc.nr_ops = 0;
 	cur->bc_ag.refc.shape_changes = 0;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 6e651a2c562b..0040e2620c24 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -489,6 +489,7 @@ xfs_rmapbt_keys_contiguous(
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= 2 * sizeof(struct xfs_rmap_key),
+	.geom_flags		= XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING,
 
 	.dup_cursor		= xfs_rmapbt_dup_cursor,
 	.set_root		= xfs_rmapbt_set_root,
@@ -519,7 +520,6 @@ xfs_rmapbt_init_common(
 	/* Overlapping btree; 2 keys per pointer. */
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP, &xfs_rmapbt_ops,
 			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
-	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 
 	cur->bc_ag.pag = xfs_perag_bump(pag);
@@ -613,6 +613,8 @@ static const struct xfs_buf_ops xfs_rmapbt_mem_buf_ops = {
 static const struct xfs_btree_ops xfs_rmapbt_mem_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= 2 * sizeof(struct xfs_rmap_key),
+	.geom_flags		= XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING |
+				  XFS_BTREE_IN_MEMORY,
 
 	.dup_cursor		= xfbtree_dup_cursor,
 	.set_root		= xfbtree_set_root,
@@ -647,8 +649,6 @@ xfs_rmapbt_mem_cursor(
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
 			&xfs_rmapbt_mem_ops, mp->m_rmap_maxlevels,
 			xfs_rmapbt_cur_cache);
-	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING |
-			XFS_BTREE_IN_MEMORY;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_mem.xfbtree = xfbtree;
 	cur->bc_mem.head_bp = head_bp;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/9] xfs: export some of the btree ops structures
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 8/9] xfs: set btree block buffer ops in _init_buf Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 9/9] xfs: remove unnecessary fields in xfbtree_config Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Export these btree ops structures so that we can reference them in the
AG initialization code in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    4 ++--
 fs/xfs/libxfs/xfs_bmap_btree.c     |    2 +-
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    4 ++--
 fs/xfs/libxfs/xfs_refcount_btree.c |    2 +-
 fs/xfs/libxfs/xfs_rmap_btree.c     |    2 +-
 fs/xfs/libxfs/xfs_shared.h         |    9 +++++++++
 6 files changed, 16 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 5505cbb75cb6..6f17fee31872 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -454,7 +454,7 @@ xfs_allocbt_keys_contiguous(
 				 be32_to_cpu(key2->alloc.ar_startblock));
 }
 
-static const struct xfs_btree_ops xfs_bnobt_ops = {
+const struct xfs_btree_ops xfs_bnobt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
 
@@ -477,7 +477,7 @@ static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.keys_contiguous	= xfs_allocbt_keys_contiguous,
 };
 
-static const struct xfs_btree_ops xfs_cntbt_ops = {
+const struct xfs_btree_ops xfs_cntbt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
 	.geom_flags		= XFS_BTREE_LASTREC_UPDATE,
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 60bd5f94de3e..82a1e06aeac6 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -529,7 +529,7 @@ xfs_bmbt_keys_contiguous(
 				 be64_to_cpu(key2->bmbt.br_startoff));
 }
 
-static const struct xfs_btree_ops xfs_bmbt_ops = {
+const struct xfs_btree_ops xfs_bmbt_ops = {
 	.rec_len		= sizeof(xfs_bmbt_rec_t),
 	.key_len		= sizeof(xfs_bmbt_key_t),
 	.geom_flags		= XFS_BTREE_LONG_PTRS | XFS_BTREE_ROOT_IN_INODE,
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 5a59e105c801..84094e326a6e 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -399,7 +399,7 @@ xfs_inobt_keys_contiguous(
 				 be32_to_cpu(key2->inobt.ir_startino));
 }
 
-static const struct xfs_btree_ops xfs_inobt_ops = {
+const struct xfs_btree_ops xfs_inobt_ops = {
 	.rec_len		= sizeof(xfs_inobt_rec_t),
 	.key_len		= sizeof(xfs_inobt_key_t),
 
@@ -421,7 +421,7 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 	.keys_contiguous	= xfs_inobt_keys_contiguous,
 };
 
-static const struct xfs_btree_ops xfs_finobt_ops = {
+const struct xfs_btree_ops xfs_finobt_ops = {
 	.rec_len		= sizeof(xfs_inobt_rec_t),
 	.key_len		= sizeof(xfs_inobt_key_t),
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 8ba9768c0b3b..b037fd949f8c 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -323,7 +323,7 @@ xfs_refcountbt_keys_contiguous(
 				 be32_to_cpu(key2->refc.rc_startblock));
 }
 
-static const struct xfs_btree_ops xfs_refcountbt_ops = {
+const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
 	.key_len		= sizeof(struct xfs_refcount_key),
 
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 0040e2620c24..31dc40358bf9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -486,7 +486,7 @@ xfs_rmapbt_keys_contiguous(
 				 be32_to_cpu(key2->rmap.rm_startblock));
 }
 
-static const struct xfs_btree_ops xfs_rmapbt_ops = {
+const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= 2 * sizeof(struct xfs_rmap_key),
 	.geom_flags		= XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING,
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 57a52fa76a49..d1b3f210326e 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -43,6 +43,15 @@ extern const struct xfs_buf_ops xfs_sb_buf_ops;
 extern const struct xfs_buf_ops xfs_sb_quiet_buf_ops;
 extern const struct xfs_buf_ops xfs_symlink_buf_ops;
 
+/* btree ops */
+extern const struct xfs_btree_ops xfs_bnobt_ops;
+extern const struct xfs_btree_ops xfs_cntbt_ops;
+extern const struct xfs_btree_ops xfs_inobt_ops;
+extern const struct xfs_btree_ops xfs_finobt_ops;
+extern const struct xfs_btree_ops xfs_bmbt_ops;
+extern const struct xfs_btree_ops xfs_refcountbt_ops;
+extern const struct xfs_btree_ops xfs_rmapbt_ops;
+
 /* log size calculation functions */
 int	xfs_log_calc_unit_res(struct xfs_mount *mp, int unit_bytes);
 int	xfs_log_calc_minimum_size(struct xfs_mount *);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/9] xfs: rename btree block/buffer init functions
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 4/9] xfs: initialize btree blocks using btree_ops structure Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 6/9] xfs: btree convert xfs_btree_init_block to xfs_btree_init_buf calls Darrick J. Wong
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rename xfs_btree_init_block_int to xfs_btree_init_block, and
xfs_btree_init_block to xfs_btree_init_buf so that the name suggests the
type that caller are supposed to pass in.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c            |    6 +++---
 fs/xfs/libxfs/xfs_bmap.c          |    6 +++---
 fs/xfs/libxfs/xfs_bmap_btree.c    |    2 +-
 fs/xfs/libxfs/xfs_btree.c         |    8 ++++----
 fs/xfs/libxfs/xfs_btree.h         |    4 ++--
 fs/xfs/libxfs/xfs_btree_staging.c |    2 +-
 fs/xfs/scrub/xfbtree.c            |    2 +-
 7 files changed, 15 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index f9e9e6879d53..05d0a97e08c3 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -412,7 +412,7 @@ xfs_btroot_init(
 	struct xfs_buf		*bp,
 	struct aghdr_init_data	*id)
 {
-	xfs_btree_init_block(mp, bp, id->bc_ops, 0, 0, id->agno);
+	xfs_btree_init_buf(mp, bp, id->bc_ops, 0, 0, id->agno);
 }
 
 /* Finish initializing a free space btree. */
@@ -479,7 +479,7 @@ xfs_bnoroot_init(
 	struct xfs_buf		*bp,
 	struct aghdr_init_data	*id)
 {
-	xfs_btree_init_block(mp, bp, id->bc_ops, 0, 1, id->agno);
+	xfs_btree_init_buf(mp, bp, id->bc_ops, 0, 1, id->agno);
 	xfs_freesp_init_recs(mp, bp, id);
 }
 
@@ -495,7 +495,7 @@ xfs_rmaproot_init(
 	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
 	struct xfs_rmap_rec	*rrec;
 
-	xfs_btree_init_block(mp, bp, id->bc_ops, 0, 4, id->agno);
+	xfs_btree_init_buf(mp, bp, id->bc_ops, 0, 4, id->agno);
 
 	/*
 	 * mark the AG header regions as static metadata The BNO
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b91f273ccbec..3ff3202e6e91 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -639,8 +639,8 @@ xfs_bmap_extents_to_btree(
 	 * Fill in the root.
 	 */
 	block = ifp->if_broot;
-	xfs_btree_init_block_int(mp, block, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL,
-			1, 1, ip->i_ino);
+	xfs_btree_init_block(mp, block, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL, 1,
+			1, ip->i_ino);
 	/*
 	 * Need a cursor.  Can't allocate until bb_level is filled in.
 	 */
@@ -696,7 +696,7 @@ xfs_bmap_extents_to_btree(
 	 */
 	abp->b_ops = &xfs_bmbt_buf_ops;
 	ablock = XFS_BUF_TO_BLOCK(abp);
-	xfs_btree_init_block_int(mp, ablock, &xfs_bmbt_ops, xfs_buf_daddr(abp),
+	xfs_btree_init_block(mp, ablock, &xfs_bmbt_ops, xfs_buf_daddr(abp),
 			0, 0, ip->i_ino);
 
 	for_each_xfs_iext(ifp, &icur, &rec) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 2cf6459b7bca..f70194293f54 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -43,7 +43,7 @@ xfs_bmdr_to_bmbt(
 	xfs_bmbt_key_t		*tkp;
 	__be64			*tpp;
 
-	xfs_btree_init_block_int(mp, rblock, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL,
+	xfs_btree_init_block(mp, rblock, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL,
 			0, 0, ip->i_ino);
 	rblock->bb_level = dblock->bb_level;
 	ASSERT(be16_to_cpu(rblock->bb_level) > 0);
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 78c18c027575..fe2f21fa7b21 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1213,7 +1213,7 @@ xfs_btree_set_sibling(
 }
 
 void
-xfs_btree_init_block_int(
+xfs_btree_init_block(
 	struct xfs_mount	*mp,
 	struct xfs_btree_block	*buf,
 	const struct xfs_btree_ops *ops,
@@ -1255,7 +1255,7 @@ xfs_btree_init_block_int(
 }
 
 void
-xfs_btree_init_block(
+xfs_btree_init_buf(
 	struct xfs_mount		*mp,
 	struct xfs_buf			*bp,
 	const struct xfs_btree_ops	*ops,
@@ -1263,7 +1263,7 @@ xfs_btree_init_block(
 	__u16				numrecs,
 	__u64				owner)
 {
-	xfs_btree_init_block_int(mp, XFS_BUF_TO_BLOCK(bp), ops,
+	xfs_btree_init_block(mp, XFS_BUF_TO_BLOCK(bp), ops,
 			xfs_buf_daddr(bp), level, numrecs, owner);
 }
 
@@ -1289,7 +1289,7 @@ xfs_btree_init_block_cur(
 	else
 		owner = cur->bc_ag.pag->pag_agno;
 
-	xfs_btree_init_block_int(cur->bc_mp, XFS_BUF_TO_BLOCK(bp), cur->bc_ops,
+	xfs_btree_init_block(cur->bc_mp, XFS_BUF_TO_BLOCK(bp), cur->bc_ops,
 			xfs_buf_daddr(bp), level, numrecs, owner);
 }
 
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 3145d7e61cb4..5557aa4148e6 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -451,10 +451,10 @@ xfs_btree_reada_bufs(
 /*
  * Initialise a new btree block header
  */
-void xfs_btree_init_block(struct xfs_mount *mp, struct xfs_buf *bp,
+void xfs_btree_init_buf(struct xfs_mount *mp, struct xfs_buf *bp,
 		const struct xfs_btree_ops *ops, __u16 level, __u16 numrecs,
 		__u64 owner);
-void xfs_btree_init_block_int(struct xfs_mount *mp,
+void xfs_btree_init_block(struct xfs_mount *mp,
 		struct xfs_btree_block *buf, const struct xfs_btree_ops *ops,
 		xfs_daddr_t blkno, __u16 level, __u16 numrecs, __u64 owner);
 
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index 0bf20472dd27..de17d333ffb3 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -404,7 +404,7 @@ xfs_btree_bload_prep_block(
 		ifp->if_broot_bytes = (int)new_size;
 
 		/* Initialize it and send it out. */
-		xfs_btree_init_block_int(cur->bc_mp, ifp->if_broot,
+		xfs_btree_init_block(cur->bc_mp, ifp->if_broot,
 				cur->bc_ops, XFS_BUF_DADDR_NULL, level,
 				nr_this_block, cur->bc_ino.ip->i_ino);
 
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 052fbc1086dc..95cbdd6738ec 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -416,7 +416,7 @@ xfbtree_init_leaf_block(
 	trace_xfbtree_create_root_buf(xfbt, bp);
 
 	bp->b_ops = cfg->btree_ops->buf_ops;
-	xfs_btree_init_block_int(mp, bp->b_addr, cfg->btree_ops, daddr, 0, 0,
+	xfs_btree_init_block(mp, bp->b_addr, cfg->btree_ops, daddr, 0, 0,
 			cfg->owner);
 	error = xfs_bwrite(bp);
 	xfs_buf_relse(bp);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/9] xfs: initialize btree blocks using btree_ops structure
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 2/9] xfs: encode the default bc_flags in the btree ops structure Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 5/9] xfs: rename btree block/buffer init functions Darrick J. Wong
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Notice now that the btree ops structure encodes btree geometry flags and
the magic number through the buffer ops.  Refactor the btree block
initialization functions to use the btree ops so that we no longer have
to open code all that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c            |   33 +++++++++--------------
 fs/xfs/libxfs/xfs_ag.h            |    2 +
 fs/xfs/libxfs/xfs_bmap.c          |   10 +++----
 fs/xfs/libxfs/xfs_bmap_btree.c    |    5 +--
 fs/xfs/libxfs/xfs_btree.c         |   53 +++++++++++++++----------------------
 fs/xfs/libxfs/xfs_btree.h         |   28 ++++++--------------
 fs/xfs/libxfs/xfs_btree_staging.c |    5 +--
 fs/xfs/scrub/xfbtree.c            |    8 +-----
 8 files changed, 53 insertions(+), 91 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 58d485a76150..f9e9e6879d53 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -412,7 +412,7 @@ xfs_btroot_init(
 	struct xfs_buf		*bp,
 	struct aghdr_init_data	*id)
 {
-	xfs_btree_init_block(mp, bp, id->type, 0, 0, id->agno);
+	xfs_btree_init_block(mp, bp, id->bc_ops, 0, 0, id->agno);
 }
 
 /* Finish initializing a free space btree. */
@@ -471,7 +471,7 @@ xfs_freesp_init_recs(
 }
 
 /*
- * Alloc btree root block init functions
+ * bnobt/cntbt btree root block init functions
  */
 static void
 xfs_bnoroot_init(
@@ -479,17 +479,7 @@ xfs_bnoroot_init(
 	struct xfs_buf		*bp,
 	struct aghdr_init_data	*id)
 {
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_BNO, 0, 1, id->agno);
-	xfs_freesp_init_recs(mp, bp, id);
-}
-
-static void
-xfs_cntroot_init(
-	struct xfs_mount	*mp,
-	struct xfs_buf		*bp,
-	struct aghdr_init_data	*id)
-{
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_CNT, 0, 1, id->agno);
+	xfs_btree_init_block(mp, bp, id->bc_ops, 0, 1, id->agno);
 	xfs_freesp_init_recs(mp, bp, id);
 }
 
@@ -505,7 +495,7 @@ xfs_rmaproot_init(
 	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
 	struct xfs_rmap_rec	*rrec;
 
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_RMAP, 0, 4, id->agno);
+	xfs_btree_init_block(mp, bp, id->bc_ops, 0, 4, id->agno);
 
 	/*
 	 * mark the AG header regions as static metadata The BNO
@@ -718,7 +708,7 @@ struct xfs_aghdr_grow_data {
 	size_t			numblks;
 	const struct xfs_buf_ops *ops;
 	aghdr_init_work_f	work;
-	xfs_btnum_t		type;
+	const struct xfs_btree_ops *bc_ops;
 	bool			need_init;
 };
 
@@ -772,13 +762,15 @@ xfs_ag_init_headers(
 		.numblks = BTOBB(mp->m_sb.sb_blocksize),
 		.ops = &xfs_bnobt_buf_ops,
 		.work = &xfs_bnoroot_init,
+		.bc_ops = &xfs_bnobt_ops,
 		.need_init = true
 	},
 	{ /* CNT root block */
 		.daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_CNT_BLOCK(mp)),
 		.numblks = BTOBB(mp->m_sb.sb_blocksize),
 		.ops = &xfs_cntbt_buf_ops,
-		.work = &xfs_cntroot_init,
+		.work = &xfs_bnoroot_init,
+		.bc_ops = &xfs_cntbt_ops,
 		.need_init = true
 	},
 	{ /* INO root block */
@@ -786,7 +778,7 @@ xfs_ag_init_headers(
 		.numblks = BTOBB(mp->m_sb.sb_blocksize),
 		.ops = &xfs_inobt_buf_ops,
 		.work = &xfs_btroot_init,
-		.type = XFS_BTNUM_INO,
+		.bc_ops = &xfs_inobt_ops,
 		.need_init = true
 	},
 	{ /* FINO root block */
@@ -794,7 +786,7 @@ xfs_ag_init_headers(
 		.numblks = BTOBB(mp->m_sb.sb_blocksize),
 		.ops = &xfs_finobt_buf_ops,
 		.work = &xfs_btroot_init,
-		.type = XFS_BTNUM_FINO,
+		.bc_ops = &xfs_finobt_ops,
 		.need_init =  xfs_has_finobt(mp)
 	},
 	{ /* RMAP root block */
@@ -802,6 +794,7 @@ xfs_ag_init_headers(
 		.numblks = BTOBB(mp->m_sb.sb_blocksize),
 		.ops = &xfs_rmapbt_buf_ops,
 		.work = &xfs_rmaproot_init,
+		.bc_ops = &xfs_rmapbt_ops,
 		.need_init = xfs_has_rmapbt(mp)
 	},
 	{ /* REFC root block */
@@ -809,7 +802,7 @@ xfs_ag_init_headers(
 		.numblks = BTOBB(mp->m_sb.sb_blocksize),
 		.ops = &xfs_refcountbt_buf_ops,
 		.work = &xfs_btroot_init,
-		.type = XFS_BTNUM_REFC,
+		.bc_ops = &xfs_refcountbt_ops,
 		.need_init = xfs_has_reflink(mp)
 	},
 	{ /* NULL terminating block */
@@ -827,7 +820,7 @@ xfs_ag_init_headers(
 
 		id->daddr = dp->daddr;
 		id->numblks = dp->numblks;
-		id->type = dp->type;
+		id->bc_ops = dp->bc_ops;
 		error = xfs_ag_init_hdr(mp, id, dp->work, dp->ops);
 		if (error)
 			break;
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 6d5ddefa321e..654f58aa831f 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -251,7 +251,7 @@ struct aghdr_init_data {
 	/* per header data */
 	xfs_daddr_t		daddr;		/* header location */
 	size_t			numblks;	/* size of header */
-	xfs_btnum_t		type;		/* type of btree root block */
+	const struct xfs_btree_ops *bc_ops;	/* btree ops */
 };
 
 int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 89fd34fbae7a..b91f273ccbec 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -639,9 +639,8 @@ xfs_bmap_extents_to_btree(
 	 * Fill in the root.
 	 */
 	block = ifp->if_broot;
-	xfs_btree_init_block_int(mp, block, XFS_BUF_DADDR_NULL,
-				 XFS_BTNUM_BMAP, 1, 1, ip->i_ino,
-				 XFS_BTREE_LONG_PTRS);
+	xfs_btree_init_block_int(mp, block, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL,
+			1, 1, ip->i_ino);
 	/*
 	 * Need a cursor.  Can't allocate until bb_level is filled in.
 	 */
@@ -697,9 +696,8 @@ xfs_bmap_extents_to_btree(
 	 */
 	abp->b_ops = &xfs_bmbt_buf_ops;
 	ablock = XFS_BUF_TO_BLOCK(abp);
-	xfs_btree_init_block_int(mp, ablock, xfs_buf_daddr(abp),
-				XFS_BTNUM_BMAP, 0, 0, ip->i_ino,
-				XFS_BTREE_LONG_PTRS);
+	xfs_btree_init_block_int(mp, ablock, &xfs_bmbt_ops, xfs_buf_daddr(abp),
+			0, 0, ip->i_ino);
 
 	for_each_xfs_iext(ifp, &icur, &rec) {
 		if (isnullstartblock(rec.br_startblock))
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 82a1e06aeac6..2cf6459b7bca 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -43,9 +43,8 @@ xfs_bmdr_to_bmbt(
 	xfs_bmbt_key_t		*tkp;
 	__be64			*tpp;
 
-	xfs_btree_init_block_int(mp, rblock, XFS_BUF_DADDR_NULL,
-				 XFS_BTNUM_BMAP, 0, 0, ip->i_ino,
-				 XFS_BTREE_LONG_PTRS);
+	xfs_btree_init_block_int(mp, rblock, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL,
+			0, 0, ip->i_ino);
 	rblock->bb_level = dblock->bb_level;
 	ASSERT(be16_to_cpu(rblock->bb_level) > 0);
 	rblock->bb_numrecs = dblock->bb_numrecs;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 4c5b4d26cd1b..78c18c027575 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -35,24 +35,17 @@
 /*
  * Btree magic numbers.
  */
-static const uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
-	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
-	  XFS_FIBT_MAGIC, 0 },
-	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC,
-	  XFS_REFC_CRC_MAGIC }
-};
-
 uint32_t
 xfs_btree_magic(
-	int			crc,
-	xfs_btnum_t		btnum)
+	struct xfs_mount		*mp,
+	const struct xfs_btree_ops	*ops)
 {
-	uint32_t		magic = xfs_magics[crc][btnum];
+	int				idx = xfs_has_crc(mp) ? 1 : 0;
+	__be32				magic = ops->buf_ops->magic[idx];
 
 	/* Ensure we asked for crc for crc-only magics. */
 	ASSERT(magic != 0);
-	return magic;
+	return be32_to_cpu(magic);
 }
 
 /*
@@ -137,7 +130,6 @@ __xfs_btree_check_lblock(
 	struct xfs_buf		*bp)
 {
 	struct xfs_mount	*mp = cur->bc_mp;
-	xfs_btnum_t		btnum = cur->bc_btnum;
 	int			crc = xfs_has_crc(mp);
 	xfs_failaddr_t		fa;
 	xfs_fsblock_t		fsb = NULLFSBLOCK;
@@ -152,7 +144,7 @@ __xfs_btree_check_lblock(
 			return __this_address;
 	}
 
-	if (be32_to_cpu(block->bb_magic) != xfs_btree_magic(crc, btnum))
+	if (be32_to_cpu(block->bb_magic) != xfs_btree_magic(mp, cur->bc_ops))
 		return __this_address;
 	if (be16_to_cpu(block->bb_level) != level)
 		return __this_address;
@@ -208,7 +200,6 @@ __xfs_btree_check_sblock(
 {
 	struct xfs_mount	*mp = cur->bc_mp;
 	struct xfs_perag	*pag = cur->bc_ag.pag;
-	xfs_btnum_t		btnum = cur->bc_btnum;
 	int			crc = xfs_has_crc(mp);
 	xfs_failaddr_t		fa;
 	xfs_agblock_t		agbno = NULLAGBLOCK;
@@ -221,7 +212,7 @@ __xfs_btree_check_sblock(
 			return __this_address;
 	}
 
-	if (be32_to_cpu(block->bb_magic) != xfs_btree_magic(crc, btnum))
+	if (be32_to_cpu(block->bb_magic) != xfs_btree_magic(mp, cur->bc_ops))
 		return __this_address;
 	if (be16_to_cpu(block->bb_level) != level)
 		return __this_address;
@@ -1225,21 +1216,20 @@ void
 xfs_btree_init_block_int(
 	struct xfs_mount	*mp,
 	struct xfs_btree_block	*buf,
+	const struct xfs_btree_ops *ops,
 	xfs_daddr_t		blkno,
-	xfs_btnum_t		btnum,
 	__u16			level,
 	__u16			numrecs,
-	__u64			owner,
-	unsigned int		flags)
+	__u64			owner)
 {
 	int			crc = xfs_has_crc(mp);
-	__u32			magic = xfs_btree_magic(crc, btnum);
+	__u32			magic = xfs_btree_magic(mp, ops);
 
 	buf->bb_magic = cpu_to_be32(magic);
 	buf->bb_level = cpu_to_be16(level);
 	buf->bb_numrecs = cpu_to_be16(numrecs);
 
-	if (flags & XFS_BTREE_LONG_PTRS) {
+	if (ops->geom_flags & XFS_BTREE_LONG_PTRS) {
 		buf->bb_u.l.bb_leftsib = cpu_to_be64(NULLFSBLOCK);
 		buf->bb_u.l.bb_rightsib = cpu_to_be64(NULLFSBLOCK);
 		if (crc) {
@@ -1266,15 +1256,15 @@ xfs_btree_init_block_int(
 
 void
 xfs_btree_init_block(
-	struct xfs_mount *mp,
-	struct xfs_buf	*bp,
-	xfs_btnum_t	btnum,
-	__u16		level,
-	__u16		numrecs,
-	__u64		owner)
+	struct xfs_mount		*mp,
+	struct xfs_buf			*bp,
+	const struct xfs_btree_ops	*ops,
+	__u16				level,
+	__u16				numrecs,
+	__u64				owner)
 {
-	xfs_btree_init_block_int(mp, XFS_BUF_TO_BLOCK(bp), xfs_buf_daddr(bp),
-				 btnum, level, numrecs, owner, 0);
+	xfs_btree_init_block_int(mp, XFS_BUF_TO_BLOCK(bp), ops,
+			xfs_buf_daddr(bp), level, numrecs, owner);
 }
 
 void
@@ -1299,9 +1289,8 @@ xfs_btree_init_block_cur(
 	else
 		owner = cur->bc_ag.pag->pag_agno;
 
-	xfs_btree_init_block_int(cur->bc_mp, XFS_BUF_TO_BLOCK(bp),
-				xfs_buf_daddr(bp), cur->bc_btnum, level,
-				numrecs, owner, cur->bc_flags);
+	xfs_btree_init_block_int(cur->bc_mp, XFS_BUF_TO_BLOCK(bp), cur->bc_ops,
+			xfs_buf_daddr(bp), level, numrecs, owner);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f23d12626a68..3145d7e61cb4 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -63,7 +63,8 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
 #define	XFS_BTNUM_REFC	((xfs_btnum_t)XFS_BTNUM_REFCi)
 
-uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
+struct xfs_btree_ops;
+uint32_t xfs_btree_magic(struct xfs_mount *mp, const struct xfs_btree_ops *ops);
 
 /*
  * For logging record fields.
@@ -450,25 +451,12 @@ xfs_btree_reada_bufs(
 /*
  * Initialise a new btree block header
  */
-void
-xfs_btree_init_block(
-	struct xfs_mount *mp,
-	struct xfs_buf	*bp,
-	xfs_btnum_t	btnum,
-	__u16		level,
-	__u16		numrecs,
-	__u64		owner);
-
-void
-xfs_btree_init_block_int(
-	struct xfs_mount	*mp,
-	struct xfs_btree_block	*buf,
-	xfs_daddr_t		blkno,
-	xfs_btnum_t		btnum,
-	__u16			level,
-	__u16			numrecs,
-	__u64			owner,
-	unsigned int		flags);
+void xfs_btree_init_block(struct xfs_mount *mp, struct xfs_buf *bp,
+		const struct xfs_btree_ops *ops, __u16 level, __u16 numrecs,
+		__u64 owner);
+void xfs_btree_init_block_int(struct xfs_mount *mp,
+		struct xfs_btree_block *buf, const struct xfs_btree_ops *ops,
+		xfs_daddr_t blkno, __u16 level, __u16 numrecs, __u64 owner);
 
 /*
  * Common btree core entry points.
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index 4cdf7976b7bf..0bf20472dd27 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -405,9 +405,8 @@ xfs_btree_bload_prep_block(
 
 		/* Initialize it and send it out. */
 		xfs_btree_init_block_int(cur->bc_mp, ifp->if_broot,
-				XFS_BUF_DADDR_NULL, cur->bc_btnum, level,
-				nr_this_block, cur->bc_ino.ip->i_ino,
-				cur->bc_flags);
+				cur->bc_ops, XFS_BUF_DADDR_NULL, level,
+				nr_this_block, cur->bc_ino.ip->i_ino);
 
 		*bpp = NULL;
 		*blockp = ifp->if_broot;
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 4436522705e5..052fbc1086dc 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -407,10 +407,6 @@ xfbtree_init_leaf_block(
 	struct xfs_buf			*bp;
 	xfs_daddr_t			daddr;
 	int				error;
-	unsigned int			bc_flags = 0;
-
-	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
-		bc_flags |= XFS_BTREE_LONG_PTRS;
 
 	daddr = xfo_to_daddr(XFBTREE_INIT_LEAF_BLOCK);
 	error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
@@ -420,8 +416,8 @@ xfbtree_init_leaf_block(
 	trace_xfbtree_create_root_buf(xfbt, bp);
 
 	bp->b_ops = cfg->btree_ops->buf_ops;
-	xfs_btree_init_block_int(mp, bp->b_addr, daddr, cfg->btnum, 0, 0,
-			cfg->owner, bc_flags);
+	xfs_btree_init_block_int(mp, bp->b_addr, cfg->btree_ops, daddr, 0, 0,
+			cfg->owner);
 	error = xfs_bwrite(bp);
 	xfs_buf_relse(bp);
 	if (error)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/9] xfs: btree convert xfs_btree_init_block to xfs_btree_init_buf calls
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 5/9] xfs: rename btree block/buffer init functions Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 8/9] xfs: set btree block buffer ops in _init_buf Darrick J. Wong
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert any place we call xfs_btree_init_block with a buffer to use the
_init_buf function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c  |    3 +--
 fs/xfs/libxfs/xfs_btree.c |    3 +--
 fs/xfs/scrub/xfbtree.c    |    3 +--
 3 files changed, 3 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3ff3202e6e91..ac20971a8c18 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -696,8 +696,7 @@ xfs_bmap_extents_to_btree(
 	 */
 	abp->b_ops = &xfs_bmbt_buf_ops;
 	ablock = XFS_BUF_TO_BLOCK(abp);
-	xfs_btree_init_block(mp, ablock, &xfs_bmbt_ops, xfs_buf_daddr(abp),
-			0, 0, ip->i_ino);
+	xfs_btree_init_buf(mp, abp, &xfs_bmbt_ops, 0, 0, ip->i_ino);
 
 	for_each_xfs_iext(ifp, &icur, &rec) {
 		if (isnullstartblock(rec.br_startblock))
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index fe2f21fa7b21..b3f9b7324c93 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1289,8 +1289,7 @@ xfs_btree_init_block_cur(
 	else
 		owner = cur->bc_ag.pag->pag_agno;
 
-	xfs_btree_init_block(cur->bc_mp, XFS_BUF_TO_BLOCK(bp), cur->bc_ops,
-			xfs_buf_daddr(bp), level, numrecs, owner);
+	xfs_btree_init_buf(cur->bc_mp, bp, cur->bc_ops, level, numrecs, owner);
 }
 
 /*
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 95cbdd6738ec..d42d1bc8e36b 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -416,8 +416,7 @@ xfbtree_init_leaf_block(
 	trace_xfbtree_create_root_buf(xfbt, bp);
 
 	bp->b_ops = cfg->btree_ops->buf_ops;
-	xfs_btree_init_block(mp, bp->b_addr, cfg->btree_ops, daddr, 0, 0,
-			cfg->owner);
+	xfs_btree_init_buf(mp, bp, cfg->btree_ops, 0, 0, cfg->owner);
 	error = xfs_bwrite(bp);
 	xfs_buf_relse(bp);
 	if (error)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/9] xfs: remove the unnecessary daddr paramter to _init_block
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/9] xfs: set the btree cursor bc_ops in xfs_btree_alloc_cursor Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/9] xfs: encode the default bc_flags in the btree ops structure Darrick J. Wong
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that all of the callers pass XFS_BUF_DADDR_NULL as the daddr
parameter, we can elide that too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c          |    3 +--
 fs/xfs/libxfs/xfs_bmap_btree.c    |    3 +--
 fs/xfs/libxfs/xfs_btree.c         |   19 ++++++++++++++++---
 fs/xfs/libxfs/xfs_btree.h         |    2 +-
 fs/xfs/libxfs/xfs_btree_staging.c |    5 ++---
 5 files changed, 21 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ac20971a8c18..ee94c935702d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -639,8 +639,7 @@ xfs_bmap_extents_to_btree(
 	 * Fill in the root.
 	 */
 	block = ifp->if_broot;
-	xfs_btree_init_block(mp, block, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL, 1,
-			1, ip->i_ino);
+	xfs_btree_init_block(mp, block, &xfs_bmbt_ops, 1, 1, ip->i_ino);
 	/*
 	 * Need a cursor.  Can't allocate until bb_level is filled in.
 	 */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index f70194293f54..2f9202ed41dd 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -43,8 +43,7 @@ xfs_bmdr_to_bmbt(
 	xfs_bmbt_key_t		*tkp;
 	__be64			*tpp;
 
-	xfs_btree_init_block(mp, rblock, &xfs_bmbt_ops, XFS_BUF_DADDR_NULL,
-			0, 0, ip->i_ino);
+	xfs_btree_init_block(mp, rblock, &xfs_bmbt_ops, 0, 0, ip->i_ino);
 	rblock->bb_level = dblock->bb_level;
 	ASSERT(be16_to_cpu(rblock->bb_level) > 0);
 	rblock->bb_numrecs = dblock->bb_numrecs;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index b3f9b7324c93..54afec1f9121 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1212,8 +1212,8 @@ xfs_btree_set_sibling(
 	}
 }
 
-void
-xfs_btree_init_block(
+static void
+__xfs_btree_init_block(
 	struct xfs_mount	*mp,
 	struct xfs_btree_block	*buf,
 	const struct xfs_btree_ops *ops,
@@ -1254,6 +1254,19 @@ xfs_btree_init_block(
 	}
 }
 
+void
+xfs_btree_init_block(
+	struct xfs_mount	*mp,
+	struct xfs_btree_block	*block,
+	const struct xfs_btree_ops *ops,
+	__u16			level,
+	__u16			numrecs,
+	__u64			owner)
+{
+	__xfs_btree_init_block(mp, block, ops, XFS_BUF_DADDR_NULL, level,
+			numrecs, owner);
+}
+
 void
 xfs_btree_init_buf(
 	struct xfs_mount		*mp,
@@ -1263,7 +1276,7 @@ xfs_btree_init_buf(
 	__u16				numrecs,
 	__u64				owner)
 {
-	xfs_btree_init_block(mp, XFS_BUF_TO_BLOCK(bp), ops,
+	__xfs_btree_init_block(mp, XFS_BUF_TO_BLOCK(bp), ops,
 			xfs_buf_daddr(bp), level, numrecs, owner);
 }
 
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 5557aa4148e6..451263e77144 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -456,7 +456,7 @@ void xfs_btree_init_buf(struct xfs_mount *mp, struct xfs_buf *bp,
 		__u64 owner);
 void xfs_btree_init_block(struct xfs_mount *mp,
 		struct xfs_btree_block *buf, const struct xfs_btree_ops *ops,
-		xfs_daddr_t blkno, __u16 level, __u16 numrecs, __u64 owner);
+		__u16 level, __u16 numrecs, __u64 owner);
 
 /*
  * Common btree core entry points.
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index de17d333ffb3..73d9aaeafead 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -404,9 +404,8 @@ xfs_btree_bload_prep_block(
 		ifp->if_broot_bytes = (int)new_size;
 
 		/* Initialize it and send it out. */
-		xfs_btree_init_block(cur->bc_mp, ifp->if_broot,
-				cur->bc_ops, XFS_BUF_DADDR_NULL, level,
-				nr_this_block, cur->bc_ino.ip->i_ino);
+		xfs_btree_init_block(cur->bc_mp, ifp->if_broot, cur->bc_ops,
+				level, nr_this_block, cur->bc_ino.ip->i_ino);
 
 		*bpp = NULL;
 		*blockp = ifp->if_broot;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 8/9] xfs: set btree block buffer ops in _init_buf
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 6/9] xfs: btree convert xfs_btree_init_block to xfs_btree_init_buf calls Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/9] xfs: export some of the btree ops structures Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 9/9] xfs: remove unnecessary fields in xfbtree_config Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set the btree block buffer ops in xfs_btree_init_buf since we already
have access to that information through the btree ops.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c  |    1 -
 fs/xfs/libxfs/xfs_btree.c |    1 +
 fs/xfs/scrub/xfbtree.c    |    1 -
 3 files changed, 1 insertion(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ee94c935702d..9b99669be43d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -693,7 +693,6 @@ xfs_bmap_extents_to_btree(
 	/*
 	 * Fill in the child block.
 	 */
-	abp->b_ops = &xfs_bmbt_buf_ops;
 	ablock = XFS_BUF_TO_BLOCK(abp);
 	xfs_btree_init_buf(mp, abp, &xfs_bmbt_ops, 0, 0, ip->i_ino);
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 54afec1f9121..737342918e11 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1278,6 +1278,7 @@ xfs_btree_init_buf(
 {
 	__xfs_btree_init_block(mp, XFS_BUF_TO_BLOCK(bp), ops,
 			xfs_buf_daddr(bp), level, numrecs, owner);
+	bp->b_ops = ops->buf_ops;
 }
 
 void
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index d42d1bc8e36b..ac7b9f679b56 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -415,7 +415,6 @@ xfbtree_init_leaf_block(
 
 	trace_xfbtree_create_root_buf(xfbt, bp);
 
-	bp->b_ops = cfg->btree_ops->buf_ops;
 	xfs_btree_init_buf(mp, bp, cfg->btree_ops, 0, 0, cfg->owner);
 	error = xfs_bwrite(bp);
 	xfs_buf_relse(bp);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 9/9] xfs: remove unnecessary fields in xfbtree_config
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 3/9] xfs: export some of the btree ops structures Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove these fields now that we get all the info we need from the btree
ops.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_mem.h  |    9 ---------
 fs/xfs/libxfs/xfs_rmap_btree.c |    1 -
 fs/xfs/scrub/trace.h           |   10 ++++------
 fs/xfs/scrub/xfbtree.c         |    4 ++--
 4 files changed, 6 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree_mem.h b/fs/xfs/libxfs/xfs_btree_mem.h
index 5e7b1f20fb5b..ee142b972839 100644
--- a/fs/xfs/libxfs/xfs_btree_mem.h
+++ b/fs/xfs/libxfs/xfs_btree_mem.h
@@ -17,17 +17,8 @@ struct xfbtree_config {
 
 	/* Owner of this btree. */
 	unsigned long long		owner;
-
-	/* Btree type number */
-	xfs_btnum_t			btnum;
-
-	/* XFBTREE_CREATE_* flags */
-	unsigned int			flags;
 };
 
-/* btree has long pointers */
-#define XFBTREE_CREATE_LONG_PTRS	(1U << 0)
-
 #ifdef CONFIG_XFS_IN_MEMORY_BTREE
 unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
 
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 31dc40358bf9..ebd86c559837 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -669,7 +669,6 @@ xfs_rmapbt_mem_create(
 	struct xfbtree_config	cfg = {
 		.btree_ops	= &xfs_rmapbt_mem_ops,
 		.target		= target,
-		.btnum		= XFS_BTNUM_RMAP,
 		.owner		= agno,
 	};
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index dc0547691fe4..213134d812e8 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2013,8 +2013,7 @@ TRACE_EVENT(xfbtree_create,
 		 struct xfbtree *xfbt),
 	TP_ARGS(mp, cfg, xfbt),
 	TP_STRUCT__entry(
-		__field(xfs_btnum_t, btnum)
-		__field(unsigned int, xfbtree_flags)
+		__field(const void *, btree_ops)
 		__field(unsigned long, xfino)
 		__field(unsigned int, leaf_mxr)
 		__field(unsigned int, leaf_mnr)
@@ -2023,8 +2022,7 @@ TRACE_EVENT(xfbtree_create,
 		__field(unsigned long long, owner)
 	),
 	TP_fast_assign(
-		__entry->btnum = cfg->btnum;
-		__entry->xfbtree_flags = cfg->flags;
+		__entry->btree_ops = cfg->btree_ops;
 		__entry->xfino = xfbtree_ino(xfbt);
 		__entry->leaf_mxr = xfbt->maxrecs[0];
 		__entry->node_mxr = xfbt->maxrecs[1];
@@ -2032,9 +2030,9 @@ TRACE_EVENT(xfbtree_create,
 		__entry->node_mnr = xfbt->minrecs[1];
 		__entry->owner = cfg->owner;
 	),
-	TP_printk("xfino 0x%lx btnum %s owner 0x%llx leaf_mxr %u leaf_mnr %u node_mxr %u node_mnr %u",
+	TP_printk("xfino 0x%lx btree_ops %pS owner 0x%llx leaf_mxr %u leaf_mnr %u node_mxr %u node_mnr %u",
 		  __entry->xfino,
-		  __print_symbolic(__entry->btnum, XFS_BTNUM_STRINGS),
+		  __entry->btree_ops,
 		  __entry->owner,
 		  __entry->leaf_mxr,
 		  __entry->leaf_mnr,
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index ac7b9f679b56..072e1d8b813e 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -384,7 +384,7 @@ xfbtree_rec_bytes(
 {
 	unsigned int			blocklen = xfo_to_b(1);
 
-	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS) {
+	if (cfg->btree_ops->geom_flags & XFS_BTREE_LONG_PTRS) {
 		if (xfs_has_crc(mp))
 			return blocklen - XFS_BTREE_LBLOCK_CRC_LEN;
 
@@ -481,7 +481,7 @@ xfbtree_create(
 	xbitmap_init(xfbt->freespace);
 
 	/* Set up min/maxrecs for this btree. */
-	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+	if (cfg->btree_ops->geom_flags & XFS_BTREE_LONG_PTRS)
 		keyptr_len += sizeof(__be64);
 	else
 		keyptr_len += sizeof(__be32);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (18 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: define an in-memory btree for storing refcount bag info during repairs Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups Darrick J. Wong
                   ` (57 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

The refcountbt repair code has serious memory usage problems when the
block sharing factor of the filesystem is very high.  This can happen if
a deduplication tool has been run against the filesystem, or if the fs
stores reflinked VM images that have been aging for a long time.

Recall that the original reference counting algorithm walks the reverse
mapping records of the filesystem to generate reference counts.  For any
given block in the AG, the rmap bag structure contains the all rmap
records that cover that block; the refcount is the size of that bag.

For online repair, the bag doesn't need the owner, offset, or state flag
information, so it discards those.  This halves the record size, but the
bag structure still stores one excerpted record for each reverse
mapping.  If the sharing count is high, this will use a LOT of memory
storing redundant records.  In the extreme case, 100k mappings to the
same piece of space will consume 100k*16 bytes = 1.6M of memory.

For offline repair, the bag stores the owner values so that we know
which inodes need to be marked as being reflink inodes.  If a
deduplication tool has been run and there are many blocks within a file
pointing to the same physical space, this will stll use a lot of memory
to store redundant records.

The solution to this problem is to deduplicate the bag records when
possible by adding a reference count to the bag record, and changing the
bag add function to detect an existing record to bump the refcount.  In
the above example, the 100k mappings will now use 24 bytes of memory.
These lookups can be done efficiently with a btree, so we create a new
refcount bag btree type (inside of online repair).  This is why we
refactored the btree code in the previous patchset.

The btree conversion also dramatically reduces the runtime of the
refcount generation algorithm, because the code to delete all bag
records that end at a given agblock now only has to delete one record
instead of (using the example above) 100k records.  As an added benefit,
record deletion now gives back the unused xfile space, which it did not
do previously.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-refcount-scalability

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-refcount-scalability
---
 fs/xfs/Makefile                |    2 
 fs/xfs/libxfs/xfs_btree.c      |    3 
 fs/xfs/libxfs/xfs_btree.h      |    1 
 fs/xfs/libxfs/xfs_shared.h     |    1 
 fs/xfs/libxfs/xfs_types.h      |    6 -
 fs/xfs/scrub/rcbag.c           |  331 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/rcbag.h           |   28 +++
 fs/xfs/scrub/rcbag_btree.c     |  371 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/rcbag_btree.h     |   83 +++++++++
 fs/xfs/scrub/refcount.c        |   16 +-
 fs/xfs/scrub/refcount_repair.c |  155 ++++++-----------
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/trace.h           |    1 
 fs/xfs/xfs_super.c             |   10 +
 fs/xfs/xfs_trace.h             |    1 
 15 files changed, 902 insertions(+), 109 deletions(-)
 create mode 100644 fs/xfs/scrub/rcbag.c
 create mode 100644 fs/xfs/scrub/rcbag.h
 create mode 100644 fs/xfs/scrub/rcbag_btree.c
 create mode 100644 fs/xfs/scrub/rcbag_btree.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: define an in-memory btree for storing refcount bag info during repairs
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: port refcount repair to the new refcount bag structure Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: create refcount bag structure for btree repairs Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/libxfs/xfs_btree.c  |    3 
 fs/xfs/libxfs/xfs_btree.h  |    1 
 fs/xfs/libxfs/xfs_shared.h |    1 
 fs/xfs/libxfs/xfs_types.h  |    6 +
 fs/xfs/scrub/rcbag_btree.c |  313 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/rcbag_btree.h |   76 +++++++++++
 fs/xfs/scrub/trace.h       |    1 
 fs/xfs/xfs_trace.h         |    1 
 9 files changed, 401 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/rcbag_btree.c
 create mode 100644 fs/xfs/scrub/rcbag_btree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 78ea3a6a0f5b..61bcd7801480 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -192,6 +192,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   inode_repair.o \
 				   newbt.o \
 				   nlinks_repair.o \
+				   rcbag_btree.o \
 				   reap.o \
 				   refcount_repair.o \
 				   repair.o \
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 737342918e11..5176947870f9 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1373,6 +1373,9 @@ xfs_btree_set_refs(
 	case XFS_BTNUM_REFC:
 		xfs_buf_set_ref(bp, XFS_REFC_BTREE_REF);
 		break;
+	case XFS_BTNUM_RCBAG:
+		xfs_buf_set_ref(bp, XFS_RCBAG_BTREE_REF);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 451263e77144..0e12360ae36d 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -62,6 +62,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 #define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
 #define	XFS_BTNUM_REFC	((xfs_btnum_t)XFS_BTNUM_REFCi)
+#define	XFS_BTNUM_RCBAG	((xfs_btnum_t)XFS_BTNUM_RCBAGi)
 
 struct xfs_btree_ops;
 uint32_t xfs_btree_magic(struct xfs_mount *mp, const struct xfs_btree_ops *ops);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index d1b3f210326e..eaabfa52eda6 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -128,6 +128,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_ATTR_BTREE_REF	1
 #define	XFS_DQUOT_REF		1
 #define	XFS_REFC_BTREE_REF	1
+#define	XFS_RCBAG_BTREE_REF	1
 #define	XFS_SSB_REF		0
 
 /*
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index c2868e8b6a1e..9a4019f23dd5 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -116,7 +116,8 @@ typedef enum {
  */
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
-	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_MAX
+	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_RCBAGi,
+	XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 #define XFS_BTNUM_STRINGS \
@@ -126,7 +127,8 @@ typedef enum {
 	{ XFS_BTNUM_BMAPi,	"bmbt" }, \
 	{ XFS_BTNUM_INOi,	"inobt" }, \
 	{ XFS_BTNUM_FINOi,	"finobt" }, \
-	{ XFS_BTNUM_REFCi,	"refcbt" }
+	{ XFS_BTNUM_REFCi,	"refcbt" }, \
+	{ XFS_BTNUM_RCBAGi,	"rcbagbt" }
 
 struct xfs_name {
 	const unsigned char	*name;
diff --git a/fs/xfs/scrub/rcbag_btree.c b/fs/xfs/scrub/rcbag_btree.c
new file mode 100644
index 000000000000..1d912069f4d7
--- /dev/null
+++ b/fs/xfs/scrub/rcbag_btree.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_mem.h"
+#include "xfs_error.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "scrub/rcbag_btree.h"
+#include "scrub/trace.h"
+
+static struct kmem_cache	*rcbagbt_cur_cache;
+
+STATIC void
+rcbagbt_init_key_from_rec(
+	union xfs_btree_key		*key,
+	const union xfs_btree_rec	*rec)
+{
+	struct rcbag_key	*bag_key = (struct rcbag_key *)key;
+	const struct rcbag_rec	*bag_rec = (const struct rcbag_rec *)rec;
+
+	BUILD_BUG_ON(sizeof(struct rcbag_key) > sizeof(union xfs_btree_key));
+	BUILD_BUG_ON(sizeof(struct rcbag_rec) > sizeof(union xfs_btree_rec));
+
+	bag_key->rbg_startblock = bag_rec->rbg_startblock;
+	bag_key->rbg_blockcount = bag_rec->rbg_blockcount;
+}
+
+STATIC void
+rcbagbt_init_rec_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec)
+{
+	struct rcbag_rec	*bag_rec = (struct rcbag_rec *)rec;
+	struct rcbag_rec	*bag_irec = (struct rcbag_rec *)&cur->bc_rec;
+
+	bag_rec->rbg_startblock = bag_irec->rbg_startblock;
+	bag_rec->rbg_blockcount = bag_irec->rbg_blockcount;
+	bag_rec->rbg_refcount = bag_irec->rbg_refcount;
+}
+
+STATIC int64_t
+rcbagbt_key_diff(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key)
+{
+	struct rcbag_rec		*rec = (struct rcbag_rec *)&cur->bc_rec;
+	const struct rcbag_key		*kp = (const struct rcbag_key *)key;
+
+	if (kp->rbg_startblock > rec->rbg_startblock)
+		return 1;
+	if (kp->rbg_startblock < rec->rbg_startblock)
+		return -1;
+
+	if (kp->rbg_blockcount > rec->rbg_blockcount)
+		return 1;
+	if (kp->rbg_blockcount < rec->rbg_blockcount)
+		return -1;
+
+	return 0;
+}
+
+STATIC int64_t
+rcbagbt_diff_two_keys(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*k1,
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
+{
+	const struct rcbag_key		*kp1 = (const struct rcbag_key *)k1;
+	const struct rcbag_key		*kp2 = (const struct rcbag_key *)k2;
+
+	ASSERT(mask == NULL);
+
+	if (kp1->rbg_startblock > kp2->rbg_startblock)
+		return 1;
+	if (kp1->rbg_startblock < kp2->rbg_startblock)
+		return -1;
+
+	if (kp1->rbg_blockcount > kp2->rbg_blockcount)
+		return 1;
+	if (kp1->rbg_blockcount < kp2->rbg_blockcount)
+		return -1;
+
+	return 0;
+}
+
+STATIC int
+rcbagbt_keys_inorder(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*k1,
+	const union xfs_btree_key	*k2)
+{
+	const struct rcbag_key		*kp1 = (const struct rcbag_key *)k1;
+	const struct rcbag_key		*kp2 = (const struct rcbag_key *)k2;
+
+	if (kp1->rbg_startblock > kp2->rbg_startblock)
+		return 0;
+	if (kp1->rbg_startblock < kp2->rbg_startblock)
+		return 1;
+
+	if (kp1->rbg_blockcount > kp2->rbg_blockcount)
+		return 0;
+	if (kp1->rbg_blockcount < kp2->rbg_blockcount)
+		return 1;
+
+	return 0;
+}
+
+STATIC int
+rcbagbt_recs_inorder(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_rec	*r1,
+	const union xfs_btree_rec	*r2)
+{
+	const struct rcbag_rec		*rp1 = (const struct rcbag_rec *)r1;
+	const struct rcbag_rec		*rp2 = (const struct rcbag_rec *)r2;
+
+	if (rp1->rbg_startblock > rp2->rbg_startblock)
+		return 0;
+	if (rp1->rbg_startblock < rp2->rbg_startblock)
+		return 1;
+
+	if (rp1->rbg_blockcount > rp2->rbg_blockcount)
+		return 0;
+	if (rp1->rbg_blockcount < rp2->rbg_blockcount)
+		return 1;
+
+	return 0;
+}
+
+static xfs_failaddr_t
+rcbagbt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	xfs_failaddr_t		fa;
+	unsigned int		level;
+
+	if (!xfs_verify_magic(bp, block->bb_magic))
+		return __this_address;
+
+	fa = xfs_btree_lblock_v5hdr_verify(bp, XFS_RMAP_OWN_UNKNOWN);
+	if (fa)
+		return fa;
+
+	level = be16_to_cpu(block->bb_level);
+	if (level >= rcbagbt_maxlevels_possible())
+		return __this_address;
+
+	return xfbtree_lblock_verify(bp,
+			rcbagbt_maxrecs(mp, xfo_to_b(1), level == 0));
+}
+
+static void
+rcbagbt_rw_verify(
+	struct xfs_buf	*bp)
+{
+	xfs_failaddr_t	fa = rcbagbt_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+/* skip crc checks on in-memory btrees to save time */
+static const struct xfs_buf_ops rcbagbt_mem_buf_ops = {
+	.name			= "rcbagbt_mem",
+	.magic			= { 0, cpu_to_be32(RCBAG_MAGIC) },
+	.verify_read		= rcbagbt_rw_verify,
+	.verify_write		= rcbagbt_rw_verify,
+	.verify_struct		= rcbagbt_verify,
+};
+
+static const struct xfs_btree_ops rcbagbt_mem_ops = {
+	.rec_len		= sizeof(struct rcbag_rec),
+	.key_len		= sizeof(struct rcbag_key),
+	.geom_flags		= XFS_BTREE_CRC_BLOCKS | XFS_BTREE_LONG_PTRS |
+				  XFS_BTREE_IN_MEMORY,
+
+	.dup_cursor		= xfbtree_dup_cursor,
+	.set_root		= xfbtree_set_root,
+	.alloc_block		= xfbtree_alloc_block,
+	.free_block		= xfbtree_free_block,
+	.get_minrecs		= xfbtree_get_minrecs,
+	.get_maxrecs		= xfbtree_get_maxrecs,
+	.init_key_from_rec	= rcbagbt_init_key_from_rec,
+	.init_rec_from_cur	= rcbagbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfbtree_init_ptr_from_cur,
+	.key_diff		= rcbagbt_key_diff,
+	.buf_ops		= &rcbagbt_mem_buf_ops,
+	.diff_two_keys		= rcbagbt_diff_two_keys,
+	.keys_inorder		= rcbagbt_keys_inorder,
+	.recs_inorder		= rcbagbt_recs_inorder,
+};
+
+/* Create a cursor for an in-memory btree. */
+struct xfs_btree_cur *
+rcbagbt_mem_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*head_bp,
+	struct xfbtree		*xfbtree)
+{
+	struct xfs_btree_cur	*cur;
+
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RCBAG, &rcbagbt_mem_ops,
+			rcbagbt_maxlevels_possible(), rcbagbt_cur_cache);
+
+	cur->bc_mem.xfbtree = xfbtree;
+	cur->bc_mem.head_bp = head_bp;
+	cur->bc_nlevels = xfs_btree_mem_head_nlevels(head_bp);
+	return cur;
+}
+
+/* Create an in-memory refcount bag btree. */
+int
+rcbagbt_mem_create(
+	struct xfs_mount	*mp,
+	struct xfs_buftarg	*target,
+	struct xfbtree		**xfbtreep)
+{
+	struct xfbtree_config	cfg = {
+		.btree_ops	= &rcbagbt_mem_ops,
+		.target		= target,
+	};
+
+	return xfbtree_create(mp, &cfg, xfbtreep);
+}
+
+/* Calculate number of records in a refcount bag btree block. */
+static inline unsigned int
+rcbagbt_block_maxrecs(
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	if (leaf)
+		return blocklen / sizeof(struct rcbag_rec);
+	return blocklen /
+		(sizeof(struct rcbag_key) + sizeof(rcbag_ptr_t));
+}
+
+/*
+ * Calculate number of records in an refcount bag btree block.
+ */
+unsigned int
+rcbagbt_maxrecs(
+	struct xfs_mount	*mp,
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	blocklen -= RCBAG_BLOCK_LEN;
+	return rcbagbt_block_maxrecs(blocklen, leaf);
+}
+
+#define RCBAGBT_INIT_MINRECS(minrecs) \
+	do { \
+		unsigned int		blocklen; \
+ \
+		blocklen = PAGE_SIZE - XFS_BTREE_LBLOCK_CRC_LEN; \
+ \
+		minrecs[0] = rcbagbt_block_maxrecs(blocklen, true) / 2; \
+		minrecs[1] = rcbagbt_block_maxrecs(blocklen, false) / 2; \
+	} while (0)
+
+/* Compute the max possible height for refcount bag btrees. */
+unsigned int
+rcbagbt_maxlevels_possible(void)
+{
+	unsigned int		minrecs[2];
+
+	RCBAGBT_INIT_MINRECS(minrecs);
+	return xfs_btree_space_to_height(minrecs, ULLONG_MAX);
+}
+
+/* Calculate the refcount bag btree size for some records. */
+unsigned long long
+rcbagbt_calc_size(
+	unsigned long long	nr_records)
+{
+	unsigned int		minrecs[2];
+
+	RCBAGBT_INIT_MINRECS(minrecs);
+	return xfs_btree_calc_size(minrecs, nr_records);
+}
+
+int __init
+rcbagbt_init_cur_cache(void)
+{
+	rcbagbt_cur_cache = kmem_cache_create("xfs_rcbagbt_cur",
+			xfs_btree_cur_sizeof(rcbagbt_maxlevels_possible()),
+			0, 0, NULL);
+
+	if (!rcbagbt_cur_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void
+rcbagbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(rcbagbt_cur_cache);
+	rcbagbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/scrub/rcbag_btree.h b/fs/xfs/scrub/rcbag_btree.h
new file mode 100644
index 000000000000..cc88396aa1e7
--- /dev/null
+++ b/fs/xfs/scrub/rcbag_btree.h
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_RCBAG_BTREE_H__
+#define __XFS_SCRUB_RCBAG_BTREE_H__
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+
+struct xfs_buf;
+struct xfs_btree_cur;
+struct xfs_mount;
+
+#define RCBAG_MAGIC	0x74826671	/* 'JRBG' */
+
+struct rcbag_key {
+	uint32_t	rbg_startblock;
+	uint32_t	rbg_blockcount;
+};
+
+struct rcbag_rec {
+	uint32_t	rbg_startblock;
+	uint32_t	rbg_blockcount;
+	uint64_t	rbg_refcount;
+};
+
+typedef __be64 rcbag_ptr_t;
+
+/* reflinks only exist on crc enabled filesystems */
+#define RCBAG_BLOCK_LEN	XFS_BTREE_LBLOCK_CRC_LEN
+
+/*
+ * Record, key, and pointer address macros for btree blocks.
+ *
+ * (note that some of these may appear unused, but they are used in userspace)
+ */
+#define RCBAG_REC_ADDR(block, index) \
+	((struct rcbag_rec *) \
+		((char *)(block) + RCBAG_BLOCK_LEN + \
+		 (((index) - 1) * sizeof(struct rcbag_rec))))
+
+#define RCBAG_KEY_ADDR(block, index) \
+	((struct rcbag_key *) \
+		((char *)(block) + RCBAG_BLOCK_LEN + \
+		 ((index) - 1) * sizeof(struct rcbag_key)))
+
+#define RCBAG_PTR_ADDR(block, index, maxrecs) \
+	((rcbag_ptr_t *) \
+		((char *)(block) + RCBAG_BLOCK_LEN + \
+		 (maxrecs) * sizeof(struct rcbag_key) + \
+		 ((index) - 1) * sizeof(rcbag_ptr_t)))
+
+unsigned int rcbagbt_maxrecs(struct xfs_mount *mp, unsigned int blocklen,
+		bool leaf);
+
+unsigned long long rcbagbt_calc_size(unsigned long long nr_records);
+
+unsigned int rcbagbt_maxlevels_possible(void);
+
+int __init rcbagbt_init_cur_cache(void);
+void rcbagbt_destroy_cur_cache(void);
+
+struct xfbtree;
+struct xfs_btree_cur *rcbagbt_mem_cursor(struct xfs_mount *mp,
+		struct xfs_trans *tp, struct xfs_buf *head_bp,
+		struct xfbtree *xfbtree);
+int rcbagbt_mem_create(struct xfs_mount *mp, struct xfs_buftarg *target,
+		struct xfbtree **xfbtreep);
+
+#else
+# define rcbagbt_init_cur_cache()		0
+# define rcbagbt_destroy_cur_cache()		((void)0)
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* __XFS_SCRUB_RCBAG_BTREE_H__ */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 213134d812e8..558bef72b569 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -41,6 +41,7 @@ TRACE_DEFINE_ENUM(XFS_BTNUM_INOi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_FINOi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_RMAPi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_REFCi);
+TRACE_DEFINE_ENUM(XFS_BTNUM_RCBAGi);
 
 TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_SHARED);
 TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_COW);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d1620ea1c70f..6bb15f820120 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2480,6 +2480,7 @@ TRACE_DEFINE_ENUM(XFS_BTNUM_INOi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_FINOi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_RMAPi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_REFCi);
+TRACE_DEFINE_ENUM(XFS_BTNUM_RCBAGi);
 
 DECLARE_EVENT_CLASS(xfs_btree_cur_class,
 	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: create refcount bag structure for btree repairs
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: define an in-memory btree for storing refcount bag info during repairs Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: port refcount repair to the new refcount bag structure Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a bag structure for refcount information that uses the refcount
bag btree defined in the previous patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/rcbag.c       |  331 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/rcbag.h       |   28 ++++
 fs/xfs/scrub/rcbag_btree.c |   58 ++++++++
 fs/xfs/scrub/rcbag_btree.h |    7 +
 5 files changed, 425 insertions(+)
 create mode 100644 fs/xfs/scrub/rcbag.c
 create mode 100644 fs/xfs/scrub/rcbag.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 61bcd7801480..fc83759656c6 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -193,6 +193,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   newbt.o \
 				   nlinks_repair.o \
 				   rcbag_btree.o \
+				   rcbag.o \
 				   reap.o \
 				   refcount_repair.o \
 				   repair.o \
diff --git a/fs/xfs/scrub/rcbag.c b/fs/xfs/scrub/rcbag.c
new file mode 100644
index 000000000000..7bb36e36dc69
--- /dev/null
+++ b/fs/xfs/scrub/rcbag.c
@@ -0,0 +1,331 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_btree_mem.h"
+#include "xfs_error.h"
+#include "scrub/scrub.h"
+#include "scrub/xfile.h"
+#include "scrub/xfbtree.h"
+#include "scrub/rcbag_btree.h"
+#include "scrub/rcbag.h"
+#include "scrub/trace.h"
+
+struct rcbag {
+	struct xfs_mount	*mp;
+	struct xfbtree		*xfbtree;
+	uint64_t		nr_items;
+};
+
+int
+rcbag_init(
+	struct xfs_mount	*mp,
+	struct xfs_buftarg	*target,
+	struct rcbag		**bagp)
+{
+	struct rcbag		*bag;
+	int			error;
+
+	bag = kmalloc(sizeof(struct rcbag), XCHK_GFP_FLAGS);
+	if (!bag)
+		return -ENOMEM;
+
+	bag->nr_items = 0;
+	bag->mp = mp;
+
+	error = rcbagbt_mem_create(mp, target, &bag->xfbtree);
+	if (error)
+		goto out_bag;
+
+	*bagp = bag;
+	return 0;
+
+out_bag:
+	kfree(bag);
+	return error;
+}
+
+void
+rcbag_free(
+	struct rcbag		**bagp)
+{
+	struct rcbag		*bag = *bagp;
+
+	xfbtree_destroy(bag->xfbtree);
+	kfree(bag);
+	*bagp = NULL;
+}
+
+/* Track an rmap in the refcount bag. */
+int
+rcbag_add(
+	struct rcbag			*bag,
+	struct xfs_trans		*tp,
+	const struct xfs_rmap_irec	*rmap)
+{
+	struct rcbag_rec		bagrec;
+	struct xfs_mount		*mp = bag->mp;
+	struct xfs_buf			*head_bp;
+	struct xfs_btree_cur		*cur;
+	int				has;
+	int				error;
+
+	error = xfbtree_head_read_buf(bag->xfbtree, tp, &head_bp);
+	if (error)
+		return error;
+
+	cur = rcbagbt_mem_cursor(mp, tp, head_bp, bag->xfbtree);
+	error = rcbagbt_lookup_eq(cur, rmap, &has);
+	if (error)
+		goto out_cur;
+
+	if (has) {
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			goto out_cur;
+		if (!has) {
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		bagrec.rbg_refcount++;
+		error = rcbagbt_update(cur, &bagrec);
+		if (error)
+			goto out_cur;
+	} else {
+		bagrec.rbg_startblock = rmap->rm_startblock;
+		bagrec.rbg_blockcount = rmap->rm_blockcount;
+		bagrec.rbg_refcount = 1;
+
+		error = rcbagbt_insert(cur, &bagrec, &has);
+		if (error)
+			goto out_cur;
+		if (!has) {
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+	}
+
+	xfs_btree_del_cursor(cur, 0);
+	xfs_trans_brelse(tp, head_bp);
+
+	error = xfbtree_trans_commit(bag->xfbtree, tp);
+	if (error)
+		return error;
+
+	bag->nr_items++;
+	return 0;
+
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, head_bp);
+	xfbtree_trans_cancel(bag->xfbtree, tp);
+	return error;
+}
+
+uint64_t
+rcbag_count(
+	const struct rcbag	*rcbag)
+{
+	return rcbag->nr_items;
+}
+
+#define BAGREC_NEXT(r)	((r)->rbg_startblock + (r)->rbg_blockcount)
+
+/*
+ * Find the next block where the refcount changes, given the next rmap we
+ * looked at and the ones we're already tracking.
+ */
+int
+rcbag_next_edge(
+	struct rcbag			*bag,
+	struct xfs_trans		*tp,
+	const struct xfs_rmap_irec	*next_rmap,
+	bool				next_valid,
+	uint32_t			*next_bnop)
+{
+	struct rcbag_rec		bagrec;
+	struct xfs_mount		*mp = bag->mp;
+	struct xfs_buf			*head_bp;
+	struct xfs_btree_cur		*cur;
+	uint32_t			next_bno = NULLAGBLOCK;
+	int				has;
+	int				error;
+
+	if (next_valid)
+		next_bno = next_rmap->rm_startblock;
+
+	error = xfbtree_head_read_buf(bag->xfbtree, tp, &head_bp);
+	if (error)
+		return error;
+
+	cur = rcbagbt_mem_cursor(mp, tp, head_bp, bag->xfbtree);
+	error = xfs_btree_goto_left_edge(cur);
+	if (error)
+		goto out_cur;
+
+	while (true) {
+		error = xfs_btree_increment(cur, 0, &has);
+		if (error)
+			goto out_cur;
+		if (!has)
+			break;
+
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			goto out_cur;
+		if (!has) {
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		next_bno = min(next_bno, BAGREC_NEXT(&bagrec));
+	}
+
+	/*
+	 * We should have found /something/ because either next_rrm is the next
+	 * interesting rmap to look at after emitting this refcount extent, or
+	 * there are other rmaps in rmap_bag contributing to the current
+	 * sharing count.  But if something is seriously wrong, bail out.
+	 */
+	if (next_bno == NULLAGBLOCK) {
+		error = -EFSCORRUPTED;
+		goto out_cur;
+	}
+
+	xfs_btree_del_cursor(cur, 0);
+	xfs_trans_brelse(tp, head_bp);
+
+	*next_bnop = next_bno;
+	return 0;
+
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, head_bp);
+	return error;
+}
+
+/* Pop all refcount bag records that end at next_bno */
+int
+rcbag_remove_ending_at(
+	struct rcbag		*bag,
+	struct xfs_trans	*tp,
+	uint32_t		next_bno)
+{
+	struct rcbag_rec	bagrec;
+	struct xfs_mount	*mp = bag->mp;
+	struct xfs_buf		*head_bp;
+	struct xfs_btree_cur	*cur;
+	int			has;
+	int			error;
+
+	error = xfbtree_head_read_buf(bag->xfbtree, tp, &head_bp);
+	if (error)
+		return error;
+
+	/* go to the right edge of the tree */
+	cur = rcbagbt_mem_cursor(mp, tp, head_bp, bag->xfbtree);
+	memset(&cur->bc_rec, 0xFF, sizeof(cur->bc_rec));
+	error = xfs_btree_lookup(cur, XFS_LOOKUP_GE, &has);
+	if (error)
+		goto out_cur;
+
+	while (true) {
+		error = xfs_btree_decrement(cur, 0, &has);
+		if (error)
+			goto out_cur;
+		if (!has)
+			break;
+
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			goto out_cur;
+		if (!has) {
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		if (BAGREC_NEXT(&bagrec) != next_bno)
+			continue;
+
+		error = xfs_btree_delete(cur, &has);
+		if (error)
+			goto out_cur;
+		if (!has) {
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		bag->nr_items -= bagrec.rbg_refcount;
+	}
+
+	xfs_btree_del_cursor(cur, 0);
+	xfs_trans_brelse(tp, head_bp);
+	return xfbtree_trans_commit(bag->xfbtree, tp);
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, head_bp);
+	xfbtree_trans_cancel(bag->xfbtree, tp);
+	return error;
+}
+
+/* Dump the rcbag. */
+void
+rcbag_dump(
+	struct rcbag			*bag,
+	struct xfs_trans		*tp)
+{
+	struct rcbag_rec		bagrec;
+	struct xfs_mount		*mp = bag->mp;
+	struct xfs_buf			*head_bp;
+	struct xfs_btree_cur		*cur;
+	unsigned long long		nr = 0;
+	int				has;
+	int				error;
+
+	error = xfbtree_head_read_buf(bag->xfbtree, tp, &head_bp);
+	if (error)
+		return;
+
+	cur = rcbagbt_mem_cursor(mp, tp, head_bp, bag->xfbtree);
+	error = xfs_btree_goto_left_edge(cur);
+	if (error)
+		goto out_cur;
+
+	while (true) {
+		error = xfs_btree_increment(cur, 0, &has);
+		if (error)
+			goto out_cur;
+		if (!has)
+			break;
+
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			goto out_cur;
+		if (!has) {
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		xfs_err(bag->mp, "[%llu]: bno 0x%x fsbcount 0x%x refcount 0x%llx\n",
+				nr++,
+				(unsigned int)bagrec.rbg_startblock,
+				(unsigned int)bagrec.rbg_blockcount,
+				(unsigned long long)bagrec.rbg_refcount);
+	}
+
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, head_bp);
+}
diff --git a/fs/xfs/scrub/rcbag.h b/fs/xfs/scrub/rcbag.h
new file mode 100644
index 000000000000..f939c7156539
--- /dev/null
+++ b/fs/xfs/scrub/rcbag.h
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_RCBAG_H__
+#define __XFS_SCRUB_RCBAG_H__
+
+struct xfs_mount;
+struct rcbag;
+struct xfs_buftarg;
+
+int rcbag_init(struct xfs_mount *mp, struct xfs_buftarg *target,
+		struct rcbag **bagp);
+void rcbag_free(struct rcbag **bagp);
+int rcbag_add(struct rcbag *bag, struct xfs_trans *tp,
+		const struct xfs_rmap_irec *rmap);
+uint64_t rcbag_count(const struct rcbag *bag);
+
+int rcbag_next_edge(struct rcbag *bag, struct xfs_trans *tp,
+		const struct xfs_rmap_irec *next_rmap, bool next_valid,
+		uint32_t *next_bnop);
+int rcbag_remove_ending_at(struct rcbag *bag, struct xfs_trans *tp,
+		uint32_t next_bno);
+
+void rcbag_dump(struct rcbag *bag, struct xfs_trans *tp);
+
+#endif /* __XFS_SCRUB_RCBAG_H__ */
diff --git a/fs/xfs/scrub/rcbag_btree.c b/fs/xfs/scrub/rcbag_btree.c
index 1d912069f4d7..3aa40149e34d 100644
--- a/fs/xfs/scrub/rcbag_btree.c
+++ b/fs/xfs/scrub/rcbag_btree.c
@@ -311,3 +311,61 @@ rcbagbt_destroy_cur_cache(void)
 	kmem_cache_destroy(rcbagbt_cur_cache);
 	rcbagbt_cur_cache = NULL;
 }
+
+/* Look up the refcount bag record corresponding to this reverse mapping. */
+int
+rcbagbt_lookup_eq(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rmap,
+	int				*success)
+{
+	struct rcbag_rec		*rec = (struct rcbag_rec *)&cur->bc_rec;
+
+	rec->rbg_startblock = rmap->rm_startblock;
+	rec->rbg_blockcount = rmap->rm_blockcount;
+
+	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, success);
+}
+
+/* Get the data from the pointed-to record. */
+int
+rcbagbt_get_rec(
+	struct xfs_btree_cur	*cur,
+	struct rcbag_rec	*rec,
+	int			*has)
+{
+	union xfs_btree_rec	*btrec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &btrec, has);
+	if (error || !(*has))
+		return error;
+
+	memcpy(rec, btrec, sizeof(struct rcbag_rec));
+	return 0;
+}
+
+/* Update the record referred to by cur to the value given. */
+int
+rcbagbt_update(
+	struct xfs_btree_cur	*cur,
+	const struct rcbag_rec	*rec)
+{
+	union xfs_btree_rec	btrec;
+
+	memcpy(&btrec, rec, sizeof(struct rcbag_rec));
+	return xfs_btree_update(cur, &btrec);
+}
+
+/* Update the record referred to by cur to the value given. */
+int
+rcbagbt_insert(
+	struct xfs_btree_cur	*cur,
+	const struct rcbag_rec	*rec,
+	int			*success)
+{
+	struct rcbag_rec	*btrec = (struct rcbag_rec *)&cur->bc_rec;
+
+	memcpy(btrec, rec, sizeof(struct rcbag_rec));
+	return xfs_btree_insert(cur, success);
+}
diff --git a/fs/xfs/scrub/rcbag_btree.h b/fs/xfs/scrub/rcbag_btree.h
index cc88396aa1e7..b8ec7fdd49d9 100644
--- a/fs/xfs/scrub/rcbag_btree.h
+++ b/fs/xfs/scrub/rcbag_btree.h
@@ -68,6 +68,13 @@ struct xfs_btree_cur *rcbagbt_mem_cursor(struct xfs_mount *mp,
 int rcbagbt_mem_create(struct xfs_mount *mp, struct xfs_buftarg *target,
 		struct xfbtree **xfbtreep);
 
+int rcbagbt_lookup_eq(struct xfs_btree_cur *cur,
+		const struct xfs_rmap_irec *rmap, int *success);
+int rcbagbt_get_rec(struct xfs_btree_cur *cur, struct rcbag_rec *rec, int *has);
+int rcbagbt_update(struct xfs_btree_cur *cur, const struct rcbag_rec *rec);
+int rcbagbt_insert(struct xfs_btree_cur *cur, const struct rcbag_rec *rec,
+		int *success);
+
 #else
 # define rcbagbt_init_cur_cache()		0
 # define rcbagbt_destroy_cur_cache()		((void)0)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: port refcount repair to the new refcount bag structure
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: define an in-memory btree for storing refcount bag info during repairs Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: create refcount bag structure for btree repairs Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port the refcount record generating code to use the new refcount bag
data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/refcount.c        |   16 +++-
 fs/xfs/scrub/refcount_repair.c |  155 +++++++++++++---------------------------
 fs/xfs/scrub/repair.h          |    2 +
 fs/xfs/xfs_super.c             |   10 ++-
 4 files changed, 76 insertions(+), 107 deletions(-)


diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 9d957d2df3e1..413885eca333 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -7,8 +7,11 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_ag.h"
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
@@ -16,9 +19,7 @@
 #include "scrub/common.h"
 #include "scrub/btree.h"
 #include "scrub/trace.h"
-#include "xfs_trans_resv.h"
-#include "xfs_mount.h"
-#include "xfs_ag.h"
+#include "scrub/repair.h"
 
 /*
  * Set us up to scrub reference count btrees.
@@ -29,6 +30,15 @@ xchk_setup_ag_refcountbt(
 {
 	if (xchk_need_fshook_drain(sc))
 		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
+	if (xchk_could_repair(sc)) {
+		int		error;
+
+		error = xrep_setup_ag_refcountbt(sc);
+		if (error)
+			return error;
+	}
+
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c
index 4bf48fbe5285..539548cdc65a 100644
--- a/fs/xfs/scrub/refcount_repair.c
+++ b/fs/xfs/scrub/refcount_repair.c
@@ -37,6 +37,7 @@
 #include "scrub/xfarray.h"
 #include "scrub/newbt.h"
 #include "scrub/reap.h"
+#include "scrub/rcbag.h"
 
 /*
  * Rebuilding the Reference Count Btree
@@ -97,12 +98,6 @@
  * insert all the records.
  */
 
-/* The only parts of the rmap that we care about for computing refcounts. */
-struct xrep_refc_rmap {
-	xfs_agblock_t		startblock;
-	xfs_extlen_t		blockcount;
-} __packed;
-
 struct xrep_refc {
 	/* refcount extents */
 	struct xfarray		*refcount_records;
@@ -122,6 +117,14 @@ struct xrep_refc {
 	xfs_extlen_t		btblocks;
 };
 
+/* Set us up to repair refcount btrees. */
+int
+xrep_setup_ag_refcountbt(
+	struct xfs_scrub	*sc)
+{
+	return xrep_setup_buftarg(sc, "refcount bag");
+}
+
 /* Check for any obvious conflicts with this shared/CoW staging extent. */
 STATIC int
 xrep_refc_check_ext(
@@ -223,10 +226,9 @@ xrep_refc_rmap_shareable(
 STATIC int
 xrep_refc_walk_rmaps(
 	struct xrep_refc	*rr,
-	struct xrep_refc_rmap	*rrm,
+	struct xfs_rmap_irec	*rmap,
 	bool			*have_rec)
 {
-	struct xfs_rmap_irec	rmap;
 	struct xfs_btree_cur	*cur = rr->sc->sa.rmap_cur;
 	struct xfs_mount	*mp = cur->bc_mp;
 	int			have_gt;
@@ -250,7 +252,7 @@ xrep_refc_walk_rmaps(
 		if (!have_gt)
 			return 0;
 
-		error = xfs_rmap_get_rec(cur, &rmap, &have_gt);
+		error = xfs_rmap_get_rec(cur, rmap, &have_gt);
 		if (error)
 			return error;
 		if (XFS_IS_CORRUPT(mp, !have_gt)) {
@@ -258,23 +260,22 @@ xrep_refc_walk_rmaps(
 			return -EFSCORRUPTED;
 		}
 
-		if (rmap.rm_owner == XFS_RMAP_OWN_COW) {
-			error = xrep_refc_stash_cow(rr, rmap.rm_startblock,
-					rmap.rm_blockcount);
+		if (rmap->rm_owner == XFS_RMAP_OWN_COW) {
+			error = xrep_refc_stash_cow(rr, rmap->rm_startblock,
+					rmap->rm_blockcount);
 			if (error)
 				return error;
-		} else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) {
+		} else if (rmap->rm_owner == XFS_RMAP_OWN_REFC) {
 			/* refcountbt block, dump it when we're done. */
-			rr->btblocks += rmap.rm_blockcount;
+			rr->btblocks += rmap->rm_blockcount;
 			error = xagb_bitmap_set(&rr->old_refcountbt_blocks,
-					rmap.rm_startblock, rmap.rm_blockcount);
+					rmap->rm_startblock,
+					rmap->rm_blockcount);
 			if (error)
 				return error;
 		}
-	} while (!xrep_refc_rmap_shareable(mp, &rmap));
+	} while (!xrep_refc_rmap_shareable(mp, rmap));
 
-	rrm->startblock = rmap.rm_startblock;
-	rrm->blockcount = rmap.rm_blockcount;
 	*have_rec = true;
 	return 0;
 }
@@ -356,45 +357,6 @@ xrep_refc_sort_records(
 	return error;
 }
 
-#define RRM_NEXT(r)	((r).startblock + (r).blockcount)
-/*
- * Find the next block where the refcount changes, given the next rmap we
- * looked at and the ones we're already tracking.
- */
-static inline int
-xrep_refc_next_edge(
-	struct xfarray		*rmap_bag,
-	struct xrep_refc_rmap	*next_rrm,
-	bool			next_valid,
-	xfs_agblock_t		*nbnop)
-{
-	struct xrep_refc_rmap	rrm;
-	xfarray_idx_t		array_cur = XFARRAY_CURSOR_INIT;
-	xfs_agblock_t		nbno = NULLAGBLOCK;
-	int			error;
-
-	if (next_valid)
-		nbno = next_rrm->startblock;
-
-	while ((error = xfarray_iter(rmap_bag, &array_cur, &rrm)) == 1)
-		nbno = min_t(xfs_agblock_t, nbno, RRM_NEXT(rrm));
-
-	if (error)
-		return error;
-
-	/*
-	 * We should have found /something/ because either next_rrm is the next
-	 * interesting rmap to look at after emitting this refcount extent, or
-	 * there are other rmaps in rmap_bag contributing to the current
-	 * sharing count.  But if something is seriously wrong, bail out.
-	 */
-	if (nbno == NULLAGBLOCK)
-		return -EFSCORRUPTED;
-
-	*nbnop = nbno;
-	return 0;
-}
-
 /*
  * Walk forward through the rmap btree to collect all rmaps starting at
  * @bno in @rmap_bag.  These represent the file(s) that share ownership of
@@ -404,22 +366,21 @@ xrep_refc_next_edge(
 static int
 xrep_refc_push_rmaps_at(
 	struct xrep_refc	*rr,
-	struct xfarray		*rmap_bag,
+	struct rcbag		*rcstack,
 	xfs_agblock_t		bno,
-	struct xrep_refc_rmap	*rrm,
-	bool			*have,
-	uint64_t		*stack_sz)
+	struct xfs_rmap_irec	*rmap,
+	bool			*have)
 {
 	struct xfs_scrub	*sc = rr->sc;
 	int			have_gt;
 	int			error;
 
-	while (*have && rrm->startblock == bno) {
-		error = xfarray_store_anywhere(rmap_bag, rrm);
+	while (*have && rmap->rm_startblock == bno) {
+		error = rcbag_add(rcstack, rr->sc->tp, rmap);
 		if (error)
 			return error;
-		(*stack_sz)++;
-		error = xrep_refc_walk_rmaps(rr, rrm, have);
+
+		error = xrep_refc_walk_rmaps(rr, rmap, have);
 		if (error)
 			return error;
 	}
@@ -440,11 +401,9 @@ STATIC int
 xrep_refc_find_refcounts(
 	struct xrep_refc	*rr)
 {
-	struct xrep_refc_rmap	rrm;
 	struct xfs_scrub	*sc = rr->sc;
-	struct xfarray		*rmap_bag;
-	uint64_t		old_stack_sz;
-	uint64_t		stack_sz = 0;
+	struct rcbag		*rcstack;
+	uint64_t		old_stack_height;
 	xfs_agblock_t		sbno;
 	xfs_agblock_t		cbno;
 	xfs_agblock_t		nbno;
@@ -454,12 +413,11 @@ xrep_refc_find_refcounts(
 	xrep_ag_btcur_init(sc, &sc->sa);
 
 	/*
-	 * Set up a sparse array to store all the rmap records that we're
-	 * tracking to generate a reference count record.  If this exceeds
+	 * Set up a bag to store all the rmap records that we're tracking to
+	 * generate a reference count record.  If the size of the bag exceeds
 	 * MAXREFCOUNT, we clamp rc_refcount.
 	 */
-	error = xfarray_create(sc->mp, "rmap bag", 0,
-			sizeof(struct xrep_refc_rmap), &rmap_bag);
+	error = rcbag_init(sc->mp, sc->xfile_buftarg, &rcstack);
 	if (error)
 		goto out_cur;
 
@@ -470,62 +428,54 @@ xrep_refc_find_refcounts(
 
 	/* Process reverse mappings into refcount data. */
 	while (xfs_btree_has_more_records(sc->sa.rmap_cur)) {
+		struct xfs_rmap_irec	rmap;
+
 		/* Push all rmaps with pblk == sbno onto the stack */
-		error = xrep_refc_walk_rmaps(rr, &rrm, &have);
+		error = xrep_refc_walk_rmaps(rr, &rmap, &have);
 		if (error)
 			goto out_bag;
 		if (!have)
 			break;
-		sbno = cbno = rrm.startblock;
-		error = xrep_refc_push_rmaps_at(rr, rmap_bag, sbno,
-					&rrm, &have, &stack_sz);
+		sbno = cbno = rmap.rm_startblock;
+		error = xrep_refc_push_rmaps_at(rr, rcstack, sbno, &rmap,
+				&have);
 		if (error)
 			goto out_bag;
 
 		/* Set nbno to the bno of the next refcount change */
-		error = xrep_refc_next_edge(rmap_bag, &rrm, have, &nbno);
+		error = rcbag_next_edge(rcstack, sc->tp, &rmap, have, &nbno);
 		if (error)
 			goto out_bag;
 
 		ASSERT(nbno > sbno);
-		old_stack_sz = stack_sz;
+		old_stack_height = rcbag_count(rcstack);
 
 		/* While stack isn't empty... */
-		while (stack_sz) {
-			xfarray_idx_t	array_cur = XFARRAY_CURSOR_INIT;
-
+		while (rcbag_count(rcstack) > 0) {
 			/* Pop all rmaps that end at nbno */
-			while ((error = xfarray_iter(rmap_bag, &array_cur,
-								&rrm)) == 1) {
-				if (RRM_NEXT(rrm) != nbno)
-					continue;
-				error = xfarray_unset(rmap_bag, array_cur - 1);
-				if (error)
-					goto out_bag;
-				stack_sz--;
-			}
+			error = rcbag_remove_ending_at(rcstack, sc->tp, nbno);
 			if (error)
 				goto out_bag;
 
 			/* Push array items that start at nbno */
-			error = xrep_refc_walk_rmaps(rr, &rrm, &have);
+			error = xrep_refc_walk_rmaps(rr, &rmap, &have);
 			if (error)
 				goto out_bag;
 			if (have) {
-				error = xrep_refc_push_rmaps_at(rr, rmap_bag,
-						nbno, &rrm, &have, &stack_sz);
+				error = xrep_refc_push_rmaps_at(rr, rcstack,
+						nbno, &rmap, &have);
 				if (error)
 					goto out_bag;
 			}
 
 			/* Emit refcount if necessary */
 			ASSERT(nbno > cbno);
-			if (stack_sz != old_stack_sz) {
-				if (old_stack_sz > 1) {
+			if (rcbag_count(rcstack) != old_stack_height) {
+				if (old_stack_height > 1) {
 					error = xrep_refc_stash(rr,
 							XFS_REFC_DOMAIN_SHARED,
 							cbno, nbno - cbno,
-							old_stack_sz);
+							old_stack_height);
 					if (error)
 						goto out_bag;
 				}
@@ -533,13 +483,13 @@ xrep_refc_find_refcounts(
 			}
 
 			/* Stack empty, go find the next rmap */
-			if (stack_sz == 0)
+			if (rcbag_count(rcstack) == 0)
 				break;
-			old_stack_sz = stack_sz;
+			old_stack_height = rcbag_count(rcstack);
 			sbno = nbno;
 
 			/* Set nbno to the bno of the next refcount change */
-			error = xrep_refc_next_edge(rmap_bag, &rrm, have,
+			error = rcbag_next_edge(rcstack, sc->tp, &rmap, have,
 					&nbno);
 			if (error)
 				goto out_bag;
@@ -548,14 +498,13 @@ xrep_refc_find_refcounts(
 		}
 	}
 
-	ASSERT(stack_sz == 0);
+	ASSERT(rcbag_count(rcstack) == 0);
 out_bag:
-	xfarray_destroy(rmap_bag);
+	rcbag_free(&rcstack);
 out_cur:
 	xchk_ag_btcur_free(&sc->sa);
 	return error;
 }
-#undef RRM_NEXT
 
 /* Retrieve refcountbt data for bulk load. */
 STATIC int
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 34c601aad642..3b25f2fa629e 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -79,6 +79,7 @@ int xrep_reset_perag_resv(struct xfs_scrub *sc);
 int xrep_bmap(struct xfs_scrub *sc, int whichfork, bool allow_unwritten);
 int xrep_metadata_inode_forks(struct xfs_scrub *sc);
 int xrep_setup_ag_rmapbt(struct xfs_scrub *sc);
+int xrep_setup_ag_refcountbt(struct xfs_scrub *sc);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
@@ -176,6 +177,7 @@ xrep_setup_nothing(
 }
 #define xrep_setup_ag_allocbt		xrep_setup_nothing
 #define xrep_setup_ag_rmapbt		xrep_setup_nothing
+#define xrep_setup_ag_refcountbt	xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8841947bdce7..a16d4d1b35d0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -41,6 +41,7 @@
 #include "xfs_attr_item.h"
 #include "xfs_xattr.h"
 #include "xfs_iunlink_item.h"
+#include "scrub/rcbag_btree.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1995,10 +1996,14 @@ xfs_init_caches(void)
 	if (error)
 		goto out_destroy_log_ticket_cache;
 
-	error = xfs_defer_init_item_caches();
+	error = rcbagbt_init_cur_cache();
 	if (error)
 		goto out_destroy_btree_cur_cache;
 
+	error = xfs_defer_init_item_caches();
+	if (error)
+		goto out_destroy_rcbagbt_cur_cache;
+
 	xfs_da_state_cache = kmem_cache_create("xfs_da_state",
 					      sizeof(struct xfs_da_state),
 					      0, 0, NULL);
@@ -2155,6 +2160,8 @@ xfs_init_caches(void)
 	kmem_cache_destroy(xfs_da_state_cache);
  out_destroy_defer_item_cache:
 	xfs_defer_destroy_item_caches();
+ out_destroy_rcbagbt_cur_cache:
+	rcbagbt_destroy_cur_cache();
  out_destroy_btree_cur_cache:
 	xfs_btree_destroy_cur_caches();
  out_destroy_log_ticket_cache:
@@ -2192,6 +2199,7 @@ xfs_destroy_caches(void)
 	kmem_cache_destroy(xfs_ifork_cache);
 	kmem_cache_destroy(xfs_da_state_cache);
 	xfs_defer_destroy_item_caches();
+	rcbagbt_destroy_cur_cache();
 	xfs_btree_destroy_cur_caches();
 	kmem_cache_destroy(xfs_log_ticket_cache);
 	kmem_cache_destroy(xfs_buf_cache);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (19 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: remove xfs_trans_set_bmap_flags Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
                   ` (56 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

The next major target of online repair are metadata that are persisted
in blocks mapped by a file fork.  In other words, we want to repair
directories, extended attributes, symbolic links, and the realtime free
space information.  For file-based metadata, we assume that the space
metadata is correct, which enables repair to construct new versions of
the metadata in a temporary file.  We then need to swap the file fork
mappings of the two files atomically.  With this patchset, we begin
constructing such a facility based on the existing bmap log items and a
new extent swap log item.

This series cleans up a few parts of the file block mapping log intent
code before we start adding support for realtime bmap intents.  Most of
it involves cleaning up tracepoints so that more of the data extraction
logic ends up in the tracepoint code and not the tracepoint call site,
which should reduce overhead further when tracepoints are disabled.
There is also a change to pass bmap intents all the way back to the bmap
code instead of unboxing the intent values and re-boxing them after the
_finish_one function completes.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=bmap-intent-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=bmap-intent-cleanups
---
 fs/xfs/libxfs/xfs_bmap.c |   19 +--
 fs/xfs/libxfs/xfs_bmap.h |    4 +
 fs/xfs/xfs_bmap_item.c   |   38 ++-----
 fs/xfs/xfs_trace.c       |    1 
 fs/xfs/xfs_trace.h       |  267 +++++++++++++++++++++++++++++-----------------
 5 files changed, 192 insertions(+), 137 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: split tracepoint classes for deferred items
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: remove xfs_trans_set_bmap_flags Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: clean up bmap log intent item tracepoint callsites Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We're about to start adding support for deferred log intent items for
realtime extents, so split these four types into separate classes so
that we can customize them as the transition happens.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_trace.h |  273 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 177 insertions(+), 96 deletions(-)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6bb15f820120..66f3a2803dd4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2677,94 +2677,6 @@ DEFINE_EVENT(xfs_defer_pending_class, name, \
 	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp), \
 	TP_ARGS(mp, dfp))
 
-DECLARE_EVENT_CLASS(xfs_phys_extent_deferred_class,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 int type, xfs_agblock_t agbno, xfs_extlen_t len),
-	TP_ARGS(mp, agno, type, agbno, len),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_agnumber_t, agno)
-		__field(int, type)
-		__field(xfs_agblock_t, agbno)
-		__field(xfs_extlen_t, len)
-	),
-	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
-		__entry->type = type;
-		__entry->agbno = agbno;
-		__entry->len = len;
-	),
-	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x fsbcount 0x%x",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->type,
-		  __entry->agno,
-		  __entry->agbno,
-		  __entry->len)
-);
-#define DEFINE_PHYS_EXTENT_DEFERRED_EVENT(name) \
-DEFINE_EVENT(xfs_phys_extent_deferred_class, name, \
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
-		 int type, \
-		 xfs_agblock_t bno, \
-		 xfs_extlen_t len), \
-	TP_ARGS(mp, agno, type, bno, len))
-
-DECLARE_EVENT_CLASS(xfs_map_extent_deferred_class,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 int op,
-		 xfs_agblock_t agbno,
-		 xfs_ino_t ino,
-		 int whichfork,
-		 xfs_fileoff_t offset,
-		 xfs_filblks_t len,
-		 xfs_exntst_t state),
-	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state),
-	TP_STRUCT__entry(
-		__field(dev_t, dev)
-		__field(xfs_agnumber_t, agno)
-		__field(xfs_ino_t, ino)
-		__field(xfs_agblock_t, agbno)
-		__field(int, whichfork)
-		__field(xfs_fileoff_t, l_loff)
-		__field(xfs_filblks_t, l_len)
-		__field(xfs_exntst_t, l_state)
-		__field(int, op)
-	),
-	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
-		__entry->ino = ino;
-		__entry->agbno = agbno;
-		__entry->whichfork = whichfork;
-		__entry->l_loff = offset;
-		__entry->l_len = len;
-		__entry->l_state = state;
-		__entry->op = op;
-	),
-	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x owner 0x%llx %s fileoff 0x%llx fsbcount 0x%llx state %d",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->op,
-		  __entry->agno,
-		  __entry->agbno,
-		  __entry->ino,
-		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
-		  __entry->l_loff,
-		  __entry->l_len,
-		  __entry->l_state)
-);
-#define DEFINE_MAP_EXTENT_DEFERRED_EVENT(name) \
-DEFINE_EVENT(xfs_map_extent_deferred_class, name, \
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
-		 int op, \
-		 xfs_agblock_t agbno, \
-		 xfs_ino_t ino, \
-		 int whichfork, \
-		 xfs_fileoff_t offset, \
-		 xfs_filblks_t len, \
-		 xfs_exntst_t state), \
-	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state))
-
 DEFINE_DEFER_EVENT(xfs_defer_cancel);
 DEFINE_DEFER_EVENT(xfs_defer_trans_roll);
 DEFINE_DEFER_EVENT(xfs_defer_trans_abort);
@@ -2780,11 +2692,42 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_relog_intent);
 
-#define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
-DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
-DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
-DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_defer);
-DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_deferred);
+DECLARE_EVENT_CLASS(xfs_free_extent_deferred_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int type, xfs_agblock_t agbno, xfs_extlen_t len),
+	TP_ARGS(mp, agno, type, agbno, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, type)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->type = type;
+		__entry->agbno = agbno;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x fsbcount 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len)
+);
+#define DEFINE_FREE_EXTENT_DEFERRED_EVENT(name) \
+DEFINE_EVENT(xfs_free_extent_deferred_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 int type, \
+		 xfs_agblock_t bno, \
+		 xfs_extlen_t len), \
+	TP_ARGS(mp, agno, type, bno, len))
+DEFINE_FREE_EXTENT_DEFERRED_EVENT(xfs_bmap_free_defer);
+DEFINE_FREE_EXTENT_DEFERRED_EVENT(xfs_bmap_free_deferred);
+DEFINE_FREE_EXTENT_DEFERRED_EVENT(xfs_agfl_free_defer);
+DEFINE_FREE_EXTENT_DEFERRED_EVENT(xfs_agfl_free_deferred);
 
 DECLARE_EVENT_CLASS(xfs_defer_pending_item_class,
 	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp,
@@ -2946,7 +2889,60 @@ DEFINE_EVENT(xfs_rmapbt_class, name, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
 
-#define DEFINE_RMAP_DEFERRED_EVENT DEFINE_MAP_EXTENT_DEFERRED_EVENT
+DECLARE_EVENT_CLASS(xfs_rmap_deferred_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int op,
+		 xfs_agblock_t agbno,
+		 xfs_ino_t ino,
+		 int whichfork,
+		 xfs_fileoff_t offset,
+		 xfs_filblks_t len,
+		 xfs_exntst_t state),
+	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, ino)
+		__field(xfs_agblock_t, agbno)
+		__field(int, whichfork)
+		__field(xfs_fileoff_t, l_loff)
+		__field(xfs_filblks_t, l_len)
+		__field(xfs_exntst_t, l_state)
+		__field(int, op)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->ino = ino;
+		__entry->agbno = agbno;
+		__entry->whichfork = whichfork;
+		__entry->l_loff = offset;
+		__entry->l_len = len;
+		__entry->l_state = state;
+		__entry->op = op;
+	),
+	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x owner 0x%llx %s fileoff 0x%llx fsbcount 0x%llx state %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->op,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->ino,
+		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+		  __entry->l_loff,
+		  __entry->l_len,
+		  __entry->l_state)
+);
+#define DEFINE_RMAP_DEFERRED_EVENT(name) \
+DEFINE_EVENT(xfs_rmap_deferred_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 int op, \
+		 xfs_agblock_t agbno, \
+		 xfs_ino_t ino, \
+		 int whichfork, \
+		 xfs_fileoff_t offset, \
+		 xfs_filblks_t len, \
+		 xfs_exntst_t state), \
+	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state))
 DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
 DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
 
@@ -2966,7 +2962,60 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_find_right_neighbor_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
 /* deferred bmbt updates */
-#define DEFINE_BMAP_DEFERRED_EVENT	DEFINE_RMAP_DEFERRED_EVENT
+DECLARE_EVENT_CLASS(xfs_bmap_deferred_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int op,
+		 xfs_agblock_t agbno,
+		 xfs_ino_t ino,
+		 int whichfork,
+		 xfs_fileoff_t offset,
+		 xfs_filblks_t len,
+		 xfs_exntst_t state),
+	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, ino)
+		__field(xfs_agblock_t, agbno)
+		__field(int, whichfork)
+		__field(xfs_fileoff_t, l_loff)
+		__field(xfs_filblks_t, l_len)
+		__field(xfs_exntst_t, l_state)
+		__field(int, op)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->ino = ino;
+		__entry->agbno = agbno;
+		__entry->whichfork = whichfork;
+		__entry->l_loff = offset;
+		__entry->l_len = len;
+		__entry->l_state = state;
+		__entry->op = op;
+	),
+	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x owner 0x%llx %s fileoff 0x%llx fsbcount 0x%llx state %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->op,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->ino,
+		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+		  __entry->l_loff,
+		  __entry->l_len,
+		  __entry->l_state)
+);
+#define DEFINE_BMAP_DEFERRED_EVENT(name) \
+DEFINE_EVENT(xfs_bmap_deferred_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 int op, \
+		 xfs_agblock_t agbno, \
+		 xfs_ino_t ino, \
+		 int whichfork, \
+		 xfs_fileoff_t offset, \
+		 xfs_filblks_t len, \
+		 xfs_exntst_t state), \
+	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state))
 DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_defer);
 DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_deferred);
 
@@ -3343,7 +3392,39 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
-#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
+
+DECLARE_EVENT_CLASS(xfs_refcount_deferred_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int type, xfs_agblock_t agbno, xfs_extlen_t len),
+	TP_ARGS(mp, agno, type, agbno, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, type)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->type = type;
+		__entry->agbno = agbno;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x fsbcount 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len)
+);
+#define DEFINE_REFCOUNT_DEFERRED_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_deferred_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 int type, \
+		 xfs_agblock_t bno, \
+		 xfs_extlen_t len), \
+	TP_ARGS(mp, agno, type, bno, len))
 DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
 DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: clean up bmap log intent item tracepoint callsites
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: remove xfs_trans_set_bmap_flags Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: split tracepoint classes for deferred items Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Pass the incore bmap structure to the tracepoints instead of open-coding
the argument passing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   19 +++-------------
 fs/xfs/libxfs/xfs_bmap.h |    4 +++
 fs/xfs/xfs_trace.c       |    1 +
 fs/xfs/xfs_trace.h       |   54 ++++++++++++++++++++--------------------------
 4 files changed, 32 insertions(+), 46 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9b99669be43d..ce12a1fd3209 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6213,15 +6213,6 @@ __xfs_bmap_add(
 {
 	struct xfs_bmap_intent		*bi;
 
-	trace_xfs_bmap_defer(tp->t_mountp,
-			XFS_FSB_TO_AGNO(tp->t_mountp, bmap->br_startblock),
-			type,
-			XFS_FSB_TO_AGBNO(tp->t_mountp, bmap->br_startblock),
-			ip->i_ino, whichfork,
-			bmap->br_startoff,
-			bmap->br_blockcount,
-			bmap->br_state);
-
 	bi = kmem_cache_alloc(xfs_bmap_intent_cache, GFP_NOFS | __GFP_NOFAIL);
 	INIT_LIST_HEAD(&bi->bi_list);
 	bi->bi_type = type;
@@ -6229,6 +6220,8 @@ __xfs_bmap_add(
 	bi->bi_whichfork = whichfork;
 	bi->bi_bmap = *bmap;
 
+	trace_xfs_bmap_defer(bi);
+
 	xfs_bmap_update_get_group(tp->t_mountp, bi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
 	return 0;
@@ -6274,13 +6267,7 @@ xfs_bmap_finish_one(
 
 	ASSERT(tp->t_firstblock == NULLFSBLOCK);
 
-	trace_xfs_bmap_deferred(tp->t_mountp,
-			XFS_FSB_TO_AGNO(tp->t_mountp, bmap->br_startblock),
-			bi->bi_type,
-			XFS_FSB_TO_AGBNO(tp->t_mountp, bmap->br_startblock),
-			bi->bi_owner->i_ino, bi->bi_whichfork,
-			bmap->br_startoff, bmap->br_blockcount,
-			bmap->br_state);
+	trace_xfs_bmap_deferred(bi);
 
 	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK)) {
 		xfs_bmap_mark_sick(bi->bi_owner, bi->bi_whichfork);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index bbda4a77cb69..276ffd098c9e 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -226,6 +226,10 @@ enum xfs_bmap_intent_type {
 	XFS_BMAP_UNMAP,
 };
 
+#define XFS_BMAP_INTENT_STRINGS \
+	{ XFS_BMAP_MAP,		"map" }, \
+	{ XFS_BMAP_UNMAP,	"unmap" }
+
 struct xfs_bmap_intent {
 	struct list_head			bi_list;
 	enum xfs_bmap_intent_type		bi_type;
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 2d49310fb912..c9a5d8087b63 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -39,6 +39,7 @@
 #include "scrub/xfile.h"
 #include "scrub/xfbtree.h"
 #include "xfs_btree_mem.h"
+#include "xfs_bmap.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 66f3a2803dd4..b65b7969a1d3 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -77,6 +77,7 @@ struct xfs_inobt_rec_incore;
 union xfs_btree_ptr;
 struct xfs_dqtrx;
 struct xfs_icwalk;
+struct xfs_bmap_intent;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -2962,16 +2963,12 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_find_right_neighbor_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
 /* deferred bmbt updates */
+TRACE_DEFINE_ENUM(XFS_BMAP_MAP);
+TRACE_DEFINE_ENUM(XFS_BMAP_UNMAP);
+
 DECLARE_EVENT_CLASS(xfs_bmap_deferred_class,
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 int op,
-		 xfs_agblock_t agbno,
-		 xfs_ino_t ino,
-		 int whichfork,
-		 xfs_fileoff_t offset,
-		 xfs_filblks_t len,
-		 xfs_exntst_t state),
-	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state),
+	TP_PROTO(struct xfs_bmap_intent *bi),
+	TP_ARGS(bi),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
@@ -2984,22 +2981,26 @@ DECLARE_EVENT_CLASS(xfs_bmap_deferred_class,
 		__field(int, op)
 	),
 	TP_fast_assign(
-		__entry->dev = mp->m_super->s_dev;
-		__entry->agno = agno;
-		__entry->ino = ino;
-		__entry->agbno = agbno;
-		__entry->whichfork = whichfork;
-		__entry->l_loff = offset;
-		__entry->l_len = len;
-		__entry->l_state = state;
-		__entry->op = op;
+		struct xfs_inode	*ip = bi->bi_owner;
+
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->agno = XFS_FSB_TO_AGNO(ip->i_mount,
+					bi->bi_bmap.br_startblock);
+		__entry->ino = ip->i_ino;
+		__entry->agbno = XFS_FSB_TO_AGBNO(ip->i_mount,
+					bi->bi_bmap.br_startblock);
+		__entry->whichfork = bi->bi_whichfork;
+		__entry->l_loff = bi->bi_bmap.br_startoff;
+		__entry->l_len = bi->bi_bmap.br_blockcount;
+		__entry->l_state = bi->bi_bmap.br_state;
+		__entry->op = bi->bi_type;
 	),
-	TP_printk("dev %d:%d op %d agno 0x%x agbno 0x%x owner 0x%llx %s fileoff 0x%llx fsbcount 0x%llx state %d",
+	TP_printk("dev %d:%d op %s ino 0x%llx agno 0x%x agbno 0x%x %s fileoff 0x%llx fsbcount 0x%llx state %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->op,
+		  __print_symbolic(__entry->op, XFS_BMAP_INTENT_STRINGS),
+		  __entry->ino,
 		  __entry->agno,
 		  __entry->agbno,
-		  __entry->ino,
 		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
 		  __entry->l_loff,
 		  __entry->l_len,
@@ -3007,15 +3008,8 @@ DECLARE_EVENT_CLASS(xfs_bmap_deferred_class,
 );
 #define DEFINE_BMAP_DEFERRED_EVENT(name) \
 DEFINE_EVENT(xfs_bmap_deferred_class, name, \
-	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
-		 int op, \
-		 xfs_agblock_t agbno, \
-		 xfs_ino_t ino, \
-		 int whichfork, \
-		 xfs_fileoff_t offset, \
-		 xfs_filblks_t len, \
-		 xfs_exntst_t state), \
-	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state))
+	TP_PROTO(struct xfs_bmap_intent *bi), \
+	TP_ARGS(bi))
 DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_defer);
 DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_deferred);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: remove xfs_trans_set_bmap_flags
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: clean up bmap log intent item tracepoint callsites Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: split tracepoint classes for deferred items Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove this single-use helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_item.c |   38 +++++++++++++-------------------------
 1 file changed, 13 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index e13184afebaf..b4ecba7c7663 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -281,29 +281,6 @@ xfs_bmap_update_diff_items(
 	return ba->bi_owner->i_ino - bb->bi_owner->i_ino;
 }
 
-/* Set the map extent flags for this mapping. */
-static void
-xfs_trans_set_bmap_flags(
-	struct xfs_map_extent		*map,
-	enum xfs_bmap_intent_type	type,
-	int				whichfork,
-	xfs_exntst_t			state)
-{
-	map->me_flags = 0;
-	switch (type) {
-	case XFS_BMAP_MAP:
-	case XFS_BMAP_UNMAP:
-		map->me_flags = type;
-		break;
-	default:
-		ASSERT(0);
-	}
-	if (state == XFS_EXT_UNWRITTEN)
-		map->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
-	if (whichfork == XFS_ATTR_FORK)
-		map->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
-}
-
 /* Log bmap updates in the intent item. */
 STATIC void
 xfs_bmap_update_log_item(
@@ -329,8 +306,19 @@ xfs_bmap_update_log_item(
 	map->me_startblock = bi->bi_bmap.br_startblock;
 	map->me_startoff = bi->bi_bmap.br_startoff;
 	map->me_len = bi->bi_bmap.br_blockcount;
-	xfs_trans_set_bmap_flags(map, bi->bi_type, bi->bi_whichfork,
-			bi->bi_bmap.br_state);
+
+	switch (bi->bi_type) {
+	case XFS_BMAP_MAP:
+	case XFS_BMAP_UNMAP:
+		map->me_flags = bi->bi_type;
+		break;
+	default:
+		ASSERT(0);
+	}
+	if (bi->bi_bmap.br_state == XFS_EXT_UNWRITTEN)
+		map->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
+	if (bi->bi_whichfork == XFS_ATTR_FORK)
+		map->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
 }
 
 static struct xfs_log_item *


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (20 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: hoist freeing of rt data fork extent mappings Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] xfs: support attrfork and unwritten BUIs Darrick J. Wong
                   ` (55 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Atomic extent swapping (and later, reverse mapping and reflink) on the
realtime device needs to be able to defer file mapping and extent
freeing work in much the same manner as is required on the data volume.
Make the BUI log items operate on rt extents in preparation for atomic
swapping and realtime rmap.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=realtime-bmap-intents

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=realtime-bmap-intents
---
 fs/xfs/libxfs/xfs_bmap.c       |   24 +++++-------------------
 fs/xfs/libxfs/xfs_log_format.h |    4 +++-
 fs/xfs/libxfs/xfs_rtbitmap.c   |   33 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_item.c         |   17 +++++++++++++++++
 fs/xfs/xfs_rtalloc.h           |    5 +++++
 fs/xfs/xfs_trace.h             |   23 ++++++++++++++++++-----
 6 files changed, 81 insertions(+), 25 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: fix xfs_bunmapi to allow unmapping of partial rt extents
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: hoist freeing of rt data fork extent mappings Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 4/4] xfs: support recovering bmap intent items targetting realtime extents Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: add a realtime flag to the bmap update log redo items Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When XFS_BMAPI_REMAP is passed to bunmapi, that means that we want to
remove part of a block mapping without touching the allocator.  For
realtime files with rtextsize > 1, that also means that we should skip
all the code that changes a partial remove request into an unwritten
extent conversion.  IOWs, bunmapi in this mode should handle removing
the mapping from the rt file and nothing else.

Note that XFS_BMAPI_REMAP callers are required to decrement the
reference count and/or free the space manually.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ce12a1fd3209..ac5a0d3718f2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5148,15 +5148,14 @@ xfs_bmap_del_extent_real(
 
 	flags = XFS_ILOG_CORE;
 	if (xfs_ifork_is_realtime(ip, whichfork)) {
-		xfs_filblks_t	len;
-		xfs_extlen_t	mod;
-
-		len = div_u64_rem(del->br_blockcount, mp->m_sb.sb_rextsize,
-				  &mod);
-		ASSERT(mod == 0);
-
 		if (!(bflags & XFS_BMAPI_REMAP)) {
 			xfs_fsblock_t	bno;
+			xfs_filblks_t	len;
+			xfs_extlen_t	mod;
+
+			len = div_u64_rem(del->br_blockcount,
+					mp->m_sb.sb_rextsize, &mod);
+			ASSERT(mod == 0);
 
 			bno = div_u64_rem(del->br_startblock,
 					mp->m_sb.sb_rextsize, &mod);
@@ -5165,10 +5164,12 @@ xfs_bmap_del_extent_real(
 			error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
 			if (error)
 				goto done;
+			nblks = len * mp->m_sb.sb_rextsize;
+		} else {
+			nblks = del->br_blockcount;
 		}
 
 		do_fx = 0;
-		nblks = len * mp->m_sb.sb_rextsize;
 		qfield = XFS_TRANS_DQ_RTBCOUNT;
 	} else {
 		do_fx = 1;
@@ -5475,7 +5476,7 @@ __xfs_bunmapi(
 		if (del.br_startoff + del.br_blockcount > end + 1)
 			del.br_blockcount = end + 1 - del.br_startoff;
 
-		if (!isrt)
+		if (!isrt || (flags & XFS_BMAPI_REMAP))
 			goto delete;
 
 		sum = del.br_startblock + del.br_blockcount;
@@ -5493,7 +5494,7 @@ __xfs_bunmapi(
 				 * This piece is unwritten, or we're not
 				 * using unwritten extents.  Skip over it.
 				 */
-				ASSERT(end >= mod);
+				ASSERT((flags & XFS_BMAPI_REMAP) || end >= mod);
 				end -= mod > del.br_blockcount ?
 					del.br_blockcount : mod;
 				if (end < got.br_startoff &&


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: hoist freeing of rt data fork extent mappings
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/4] xfs: fix xfs_bunmapi to allow unmapping of partial rt extents Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, xfs_bmap_del_extent_real contains a bunch of code to convert
the physical extent of a data fork mapping for a realtime file into rt
extents and pass that to the rt extent freeing function.  Since the
details of this aren't needed when CONFIG_XFS_REALTIME=n, move it to
xfs_rtbitmap.c to reduce code size when realtime isn't enabled.

This will (one day) enable realtime EFIs to reuse the same
unit-converting call with less code duplication.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c     |   21 +++------------------
 fs/xfs/libxfs/xfs_rtbitmap.c |   33 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h         |    5 +++++
 3 files changed, 41 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ac5a0d3718f2..4c4e8ab42f63 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5149,33 +5149,18 @@ xfs_bmap_del_extent_real(
 	flags = XFS_ILOG_CORE;
 	if (xfs_ifork_is_realtime(ip, whichfork)) {
 		if (!(bflags & XFS_BMAPI_REMAP)) {
-			xfs_fsblock_t	bno;
-			xfs_filblks_t	len;
-			xfs_extlen_t	mod;
-
-			len = div_u64_rem(del->br_blockcount,
-					mp->m_sb.sb_rextsize, &mod);
-			ASSERT(mod == 0);
-
-			bno = div_u64_rem(del->br_startblock,
-					mp->m_sb.sb_rextsize, &mod);
-			ASSERT(mod == 0);
-
-			error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
+			error = xfs_rtfree_blocks(tp, del->br_startblock,
+					del->br_blockcount);
 			if (error)
 				goto done;
-			nblks = len * mp->m_sb.sb_rextsize;
-		} else {
-			nblks = del->br_blockcount;
 		}
-
 		do_fx = 0;
 		qfield = XFS_TRANS_DQ_RTBCOUNT;
 	} else {
 		do_fx = 1;
-		nblks = del->br_blockcount;
 		qfield = XFS_TRANS_DQ_BCOUNT;
 	}
+	nblks = del->br_blockcount;
 
 	del_endblock = del->br_startblock + del->br_blockcount;
 	if (cur) {
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 99a0af8d9028..c0bd7c44a6b8 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1012,6 +1012,39 @@ xfs_rtfree_extent(
 	return 0;
 }
 
+/*
+ * Free some blocks in the realtime subvolume.  rtbno and rtlen are in units of
+ * rt blocks, not rt extents; must be aligned to the rt extent size; and rtlen
+ * cannot exceed XFS_MAX_BMBT_EXTLEN.
+ */
+int
+xfs_rtfree_blocks(
+	struct xfs_trans	*tp,
+	xfs_fsblock_t		rtbno,
+	xfs_filblks_t		rtlen)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_rtblock_t		bno;
+	xfs_filblks_t		len;
+	xfs_extlen_t		mod;
+
+	ASSERT(rtlen <= XFS_MAX_BMBT_EXTLEN);
+
+	len = div_u64_rem(rtlen, mp->m_sb.sb_rextsize, &mod);
+	if (mod) {
+		ASSERT(mod == 0);
+		return -EIO;
+	}
+
+	bno = div_u64_rem(rtbno, mp->m_sb.sb_rextsize, &mod);
+	if (mod) {
+		ASSERT(mod == 0);
+		return -EIO;
+	}
+
+	return xfs_rtfree_extent(tp, bno, len);
+}
+
 /* Find all the free records within a given range. */
 int
 xfs_rtalloc_query_range(
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index 62c7ad79cbb6..3b2f1b499a11 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -58,6 +58,10 @@ xfs_rtfree_extent(
 	xfs_rtblock_t		bno,	/* starting block number to free */
 	xfs_extlen_t		len);	/* length of extent freed */
 
+/* Same as above, but in units of rt blocks. */
+int xfs_rtfree_blocks(struct xfs_trans *tp, xfs_fsblock_t rtbno,
+		xfs_filblks_t rtlen);
+
 /*
  * Initialize realtime fields in the mount structure.
  */
@@ -139,6 +143,7 @@ int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp);
 #else
 # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
 # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
+# define xfs_rtfree_blocks(t,rb,rl)			(ENOSYS)
 # define xfs_rtpick_extent(m,t,l,rb)                    (ENOSYS)
 # define xfs_growfs_rt(mp,in)                           (ENOSYS)
 # define xfs_rtalloc_query_range(t,l,h,f,p)             (ENOSYS)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: add a realtime flag to the bmap update log redo items
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 4/4] xfs: support recovering bmap intent items targetting realtime extents Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Extend the bmap update (BUI) log items with a new realtime flag that
indicates that the updates apply against a realtime file's data fork.
We'll wire up the actual code later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_log_format.h |    4 +++-
 fs/xfs/xfs_bmap_item.c         |    2 ++
 fs/xfs/xfs_trace.h             |   23 ++++++++++++++++++-----
 3 files changed, 23 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index f13e0809dc63..367f536d9881 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -831,10 +831,12 @@ struct xfs_cud_log_format {
 
 #define XFS_BMAP_EXTENT_ATTR_FORK	(1U << 31)
 #define XFS_BMAP_EXTENT_UNWRITTEN	(1U << 30)
+#define XFS_BMAP_EXTENT_REALTIME	(1U << 29)
 
 #define XFS_BMAP_EXTENT_FLAGS		(XFS_BMAP_EXTENT_TYPE_MASK | \
 					 XFS_BMAP_EXTENT_ATTR_FORK | \
-					 XFS_BMAP_EXTENT_UNWRITTEN)
+					 XFS_BMAP_EXTENT_UNWRITTEN | \
+					 XFS_BMAP_EXTENT_REALTIME)
 
 /*
  * This is the structure used to lay out an bui log item in the
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index b4ecba7c7663..82970413cb85 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -319,6 +319,8 @@ xfs_bmap_update_log_item(
 		map->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
 	if (bi->bi_whichfork == XFS_ATTR_FORK)
 		map->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
+	if (xfs_ifork_is_realtime(bmap->bi_owner, bmap->bi_whichfork))
+		map->me_flags |= XFS_BMAP_EXTENT_REALTIME;
 }
 
 static struct xfs_log_item *
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b65b7969a1d3..15bd6b86b514 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2971,9 +2971,11 @@ DECLARE_EVENT_CLASS(xfs_bmap_deferred_class,
 	TP_ARGS(bi),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(dev_t, opdev)
 		__field(xfs_agnumber_t, agno)
 		__field(xfs_ino_t, ino)
 		__field(xfs_agblock_t, agbno)
+		__field(xfs_fsblock_t, rtbno)
 		__field(int, whichfork)
 		__field(xfs_fileoff_t, l_loff)
 		__field(xfs_filblks_t, l_len)
@@ -2984,23 +2986,34 @@ DECLARE_EVENT_CLASS(xfs_bmap_deferred_class,
 		struct xfs_inode	*ip = bi->bi_owner;
 
 		__entry->dev = ip->i_mount->m_super->s_dev;
-		__entry->agno = XFS_FSB_TO_AGNO(ip->i_mount,
-					bi->bi_bmap.br_startblock);
+		if (xfs_ifork_is_realtime(ip, bi->bi_whichfork)) {
+			__entry->agno = 0;
+			__entry->agbno = 0;
+			__entry->rtbno = bi->bi_bmap.br_startblock;
+			__entry->opdev = ip->i_mount->m_rtdev_targp->bt_dev;
+		} else {
+			__entry->agno = XFS_FSB_TO_AGNO(ip->i_mount,
+						bi->bi_bmap.br_startblock);
+			__entry->agbno = XFS_FSB_TO_AGBNO(ip->i_mount,
+						bi->bi_bmap.br_startblock);
+			__entry->rtbno = 0;
+			__entry->opdev = __entry->dev;
+		}
 		__entry->ino = ip->i_ino;
-		__entry->agbno = XFS_FSB_TO_AGBNO(ip->i_mount,
-					bi->bi_bmap.br_startblock);
 		__entry->whichfork = bi->bi_whichfork;
 		__entry->l_loff = bi->bi_bmap.br_startoff;
 		__entry->l_len = bi->bi_bmap.br_blockcount;
 		__entry->l_state = bi->bi_bmap.br_state;
 		__entry->op = bi->bi_type;
 	),
-	TP_printk("dev %d:%d op %s ino 0x%llx agno 0x%x agbno 0x%x %s fileoff 0x%llx fsbcount 0x%llx state %d",
+	TP_printk("dev %d:%d op %s opdev %d:%d ino 0x%llx agno 0x%x agbno 0x%x rtbno 0x%llx %s fileoff 0x%llx fsbcount 0x%llx state %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __print_symbolic(__entry->op, XFS_BMAP_INTENT_STRINGS),
+		  MAJOR(__entry->opdev), MINOR(__entry->opdev),
 		  __entry->ino,
 		  __entry->agno,
 		  __entry->agbno,
+		  __entry->rtbno,
 		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
 		  __entry->l_loff,
 		  __entry->l_len,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: support recovering bmap intent items targetting realtime extents
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/4] xfs: hoist freeing of rt data fork extent mappings Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/4] xfs: fix xfs_bunmapi to allow unmapping of partial rt extents Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/4] xfs: add a realtime flag to the bmap update log redo items Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have reflink on the realtime device, bmap intent items have
to support remapping extents on the realtime volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_item.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 82970413cb85..30a5402bb79c 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -319,7 +319,7 @@ xfs_bmap_update_log_item(
 		map->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
 	if (bi->bi_whichfork == XFS_ATTR_FORK)
 		map->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
-	if (xfs_ifork_is_realtime(bmap->bi_owner, bmap->bi_whichfork))
+	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork))
 		map->me_flags |= XFS_BMAP_EXTENT_REALTIME;
 }
 
@@ -362,6 +362,9 @@ xfs_bmap_update_get_group(
 {
 	xfs_agnumber_t		agno;
 
+	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork))
+		return;
+
 	agno = XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock);
 	bi->bi_pag = xfs_perag_get(mp, agno);
 
@@ -380,6 +383,9 @@ static inline void
 xfs_bmap_update_put_group(
 	struct xfs_bmap_intent	*bi)
 {
+	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork))
+		return;
+
 	xfs_perag_drop_intents(bi->bi_pag);
 	xfs_perag_put(bi->bi_pag);
 }
@@ -469,6 +475,9 @@ xfs_bui_validate(
 	if (!xfs_verify_fileext(mp, map->me_startoff, map->me_len))
 		return false;
 
+	if (map->me_flags & XFS_BMAP_EXTENT_REALTIME)
+		return xfs_verify_rtext(mp, map->me_startblock, map->me_len);
+
 	return xfs_verify_fsbext(mp, map->me_startblock, map->me_len);
 }
 
@@ -516,6 +525,12 @@ xfs_bui_item_recover(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
+	if (!!(map->me_flags & XFS_BMAP_EXTENT_REALTIME) !=
+	    xfs_ifork_is_realtime(ip, fake.bi_whichfork)) {
+		error = -EFSCORRUPTED;
+		goto err_cancel;
+	}
+
 	if (fake.bi_type == XFS_BMAP_MAP)
 		iext_delta = XFS_IEXT_ADD_NOSPLIT_CNT;
 	else


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] xfs: support attrfork and unwritten BUIs
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (21 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/2] xfs: support deferred bmap updates on the attr fork Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/2] xfs: xfs_bmap_finish_one should map unwritten extents properly Darrick J. Wong
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: clean up symbolic link code Darrick J. Wong
                   ` (54 subsequent siblings)
  77 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

In preparation for atomic extent swapping and the online repair
functionality that wants atomic extent swaps, enhance the BUI code so
that we can support deferred work on the extended attribute fork and on
unwritten extents.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=expand-bmap-intent-usage

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=expand-bmap-intent-usage
---
 fs/xfs/libxfs/xfs_bmap.c |   49 ++++++++++++++++++++--------------------------
 fs/xfs/libxfs/xfs_bmap.h |    4 ++--
 fs/xfs/xfs_bmap_item.c   |    3 ++-
 fs/xfs/xfs_bmap_util.c   |    8 ++++----
 fs/xfs/xfs_reflink.c     |    8 ++++----
 5 files changed, 33 insertions(+), 39 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs: support deferred bmap updates on the attr fork
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] xfs: support attrfork and unwritten BUIs Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/2] xfs: xfs_bmap_finish_one should map unwritten extents properly Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The deferred bmap update log item has always supported the attr fork, so
plumb this in so that higher layers can access this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   47 +++++++++++++++++++---------------------------
 fs/xfs/libxfs/xfs_bmap.h |    4 ++--
 fs/xfs/xfs_bmap_item.c   |    3 ++-
 fs/xfs/xfs_bmap_util.c   |    8 ++++----
 fs/xfs/xfs_reflink.c     |    8 ++++----
 5 files changed, 31 insertions(+), 39 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4c4e8ab42f63..ecead1feabe5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6179,17 +6179,8 @@ xfs_bmap_split_extent(
 	return error;
 }
 
-/* Deferred mapping is only for real extents in the data fork. */
-static bool
-xfs_bmap_is_update_needed(
-	struct xfs_bmbt_irec	*bmap)
-{
-	return  bmap->br_startblock != HOLESTARTBLOCK &&
-		bmap->br_startblock != DELAYSTARTBLOCK;
-}
-
 /* Record a bmap intent. */
-static int
+static void
 __xfs_bmap_add(
 	struct xfs_trans		*tp,
 	enum xfs_bmap_intent_type	type,
@@ -6199,6 +6190,11 @@ __xfs_bmap_add(
 {
 	struct xfs_bmap_intent		*bi;
 
+	if ((whichfork != XFS_DATA_FORK && whichfork != XFS_ATTR_FORK) ||
+	    bmap->br_startblock == HOLESTARTBLOCK ||
+	    bmap->br_startblock == DELAYSTARTBLOCK)
+		return;
+
 	bi = kmem_cache_alloc(xfs_bmap_intent_cache, GFP_NOFS | __GFP_NOFAIL);
 	INIT_LIST_HEAD(&bi->bi_list);
 	bi->bi_type = type;
@@ -6210,7 +6206,6 @@ __xfs_bmap_add(
 
 	xfs_bmap_update_get_group(tp->t_mountp, bi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
-	return 0;
 }
 
 /* Map an extent into a file. */
@@ -6218,12 +6213,10 @@ void
 xfs_bmap_map_extent(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
+	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_bmap_is_update_needed(PREV))
-		return;
-
-	__xfs_bmap_add(tp, XFS_BMAP_MAP, ip, XFS_DATA_FORK, PREV);
+	__xfs_bmap_add(tp, XFS_BMAP_MAP, ip, whichfork, PREV);
 }
 
 /* Unmap an extent out of a file. */
@@ -6231,12 +6224,10 @@ void
 xfs_bmap_unmap_extent(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
+	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_bmap_is_update_needed(PREV))
-		return;
-
-	__xfs_bmap_add(tp, XFS_BMAP_UNMAP, ip, XFS_DATA_FORK, PREV);
+	__xfs_bmap_add(tp, XFS_BMAP_UNMAP, ip, whichfork, PREV);
 }
 
 /*
@@ -6250,29 +6241,29 @@ xfs_bmap_finish_one(
 {
 	struct xfs_bmbt_irec		*bmap = &bi->bi_bmap;
 	int				error = 0;
+	int				flags = 0;
+
+	if (bi->bi_whichfork == XFS_ATTR_FORK)
+		flags |= XFS_BMAPI_ATTRFORK;
 
 	ASSERT(tp->t_firstblock == NULLFSBLOCK);
 
 	trace_xfs_bmap_deferred(bi);
 
-	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK)) {
-		xfs_bmap_mark_sick(bi->bi_owner, bi->bi_whichfork);
-		return -EFSCORRUPTED;
-	}
-
-	if (XFS_TEST_ERROR(false, tp->t_mountp,
-			XFS_ERRTAG_BMAP_FINISH_ONE))
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_BMAP_FINISH_ONE))
 		return -EIO;
 
 	switch (bi->bi_type) {
 	case XFS_BMAP_MAP:
 		error = xfs_bmapi_remap(tp, bi->bi_owner, bmap->br_startoff,
-				bmap->br_blockcount, bmap->br_startblock, 0);
+				bmap->br_blockcount, bmap->br_startblock,
+				flags);
 		bmap->br_blockcount = 0;
 		break;
 	case XFS_BMAP_UNMAP:
 		error = __xfs_bunmapi(tp, bi->bi_owner, bmap->br_startoff,
-				&bmap->br_blockcount, XFS_BMAPI_REMAP, 1);
+				&bmap->br_blockcount, flags | XFS_BMAPI_REMAP,
+				1);
 		break;
 	default:
 		ASSERT(0);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 276ffd098c9e..cb09a43a2872 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -244,9 +244,9 @@ void xfs_bmap_update_get_group(struct xfs_mount *mp,
 
 int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_bmap_intent *bi);
 void	xfs_bmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap);
+		int whichfork, struct xfs_bmbt_irec *imap);
 void	xfs_bmap_unmap_extent(struct xfs_trans *tp, struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap);
+		int whichfork, struct xfs_bmbt_irec *imap);
 
 static inline uint32_t xfs_bmap_fork_to_state(int whichfork)
 {
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 30a5402bb79c..5561c0e1136b 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -560,7 +560,8 @@ xfs_bui_item_recover(
 
 	if (fake.bi_bmap.br_blockcount > 0) {
 		ASSERT(fake.bi_type == XFS_BMAP_UNMAP);
-		xfs_bmap_unmap_extent(tp, ip, &fake.bi_bmap);
+		xfs_bmap_unmap_extent(tp, fake.bi_owner, fake.bi_whichfork,
+				&fake.bi_bmap);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e094932869f6..8621534b749b 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1454,16 +1454,16 @@ xfs_swap_extent_rmap(
 			}
 
 			/* Remove the mapping from the donor file. */
-			xfs_bmap_unmap_extent(tp, tip, &uirec);
+			xfs_bmap_unmap_extent(tp, tip, XFS_DATA_FORK, &uirec);
 
 			/* Remove the mapping from the source file. */
-			xfs_bmap_unmap_extent(tp, ip, &irec);
+			xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &irec);
 
 			/* Map the donor file's blocks into the source file. */
-			xfs_bmap_map_extent(tp, ip, &uirec);
+			xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
 
 			/* Map the source file's blocks into the donor file. */
-			xfs_bmap_map_extent(tp, tip, &irec);
+			xfs_bmap_map_extent(tp, tip, XFS_DATA_FORK, &irec);
 
 			error = xfs_defer_finish(tpp);
 			tp = *tpp;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 55604bbd25a4..0804f0ad6b1c 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -802,7 +802,7 @@ xfs_reflink_end_cow_extent(
 		 * If the extent we're remapping is backed by storage (written
 		 * or not), unmap the extent and drop its refcount.
 		 */
-		xfs_bmap_unmap_extent(tp, ip, &data);
+		xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &data);
 		xfs_refcount_decrease_extent(tp, &data);
 		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
 				-data.br_blockcount);
@@ -826,7 +826,7 @@ xfs_reflink_end_cow_extent(
 	xfs_refcount_free_cow_extent(tp, del.br_startblock, del.br_blockcount);
 
 	/* Map the new blocks into the data fork. */
-	xfs_bmap_map_extent(tp, ip, &del);
+	xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &del);
 
 	/* Charge this new data fork mapping to the on-disk quota. */
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT,
@@ -1290,7 +1290,7 @@ xfs_reflink_remap_extent(
 		 * If the extent we're unmapping is backed by storage (written
 		 * or not), unmap the extent and drop its refcount.
 		 */
-		xfs_bmap_unmap_extent(tp, ip, &smap);
+		xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &smap);
 		xfs_refcount_decrease_extent(tp, &smap);
 		qdelta -= smap.br_blockcount;
 	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
@@ -1315,7 +1315,7 @@ xfs_reflink_remap_extent(
 	 */
 	if (dmap_written) {
 		xfs_refcount_increase_extent(tp, dmap);
-		xfs_bmap_map_extent(tp, ip, dmap);
+		xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, dmap);
 		qdelta += dmap->br_blockcount;
 	}
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs: xfs_bmap_finish_one should map unwritten extents properly
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] xfs: support attrfork and unwritten BUIs Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/2] xfs: support deferred bmap updates on the attr fork Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The deferred bmap work state and the log item can transmit unwritten
state, so the XFS_BMAP_MAP handler must map in extents with that
unwritten state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ecead1feabe5..7433e1ecdabb 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6255,6 +6255,8 @@ xfs_bmap_finish_one(
 
 	switch (bi->bi_type) {
 	case XFS_BMAP_MAP:
+		if (bi->bi_bmap.br_state == XFS_EXT_UNWRITTEN)
+			flags |= XFS_BMAPI_PREALLOC;
 		error = xfs_bmapi_remap(tp, bi->bi_owner, bmap->br_startoff,
 				bmap->br_blockcount, bmap->br_startblock,
 				flags);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: clean up symbolic link code
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (22 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] xfs: support attrfork and unwritten BUIs Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                   ` (53 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series cleans up a few bits of the symbolic link code as needed for
future projects.  Online repair requires the ability to commit fixed
fork-based filesystem metadata such as directories, xattrs, and symbolic
links atomically, so we need to rearrange the symlink code before we
land the atomic extent swapping.

Accomplish this by moving the remote symlink target block code and
declarations to xfs_symlink_remote.[ch].

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=symlink-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=symlink-cleanups
---
 fs/xfs/libxfs/xfs_bmap.c           |    1 
 fs/xfs/libxfs/xfs_inode_fork.c     |    1 
 fs/xfs/libxfs/xfs_shared.h         |   14 ---
 fs/xfs/libxfs/xfs_symlink_remote.c |  155 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.h |   27 ++++++
 fs/xfs/scrub/inode_repair.c        |    1 
 fs/xfs/scrub/symlink.c             |    3 -
 fs/xfs/xfs_symlink.c               |  144 +--------------------------------
 fs/xfs/xfs_symlink.h               |    1 
 9 files changed, 192 insertions(+), 155 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_symlink_remote.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: clean up symbolic link code Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: move symlink target write function to libxfs Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: move remote symlink target read " Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move declarations for libxfs symlink functions into a separate header
file like we do for most everything else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c           |    1 +
 fs/xfs/libxfs/xfs_inode_fork.c     |    1 +
 fs/xfs/libxfs/xfs_shared.h         |   14 --------------
 fs/xfs/libxfs/xfs_symlink_remote.c |    2 +-
 fs/xfs/libxfs/xfs_symlink_remote.h |   23 +++++++++++++++++++++++
 fs/xfs/scrub/inode_repair.c        |    1 +
 fs/xfs/scrub/symlink.c             |    1 +
 fs/xfs/xfs_symlink.c               |    1 +
 8 files changed, 29 insertions(+), 15 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_symlink_remote.h


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7433e1ecdabb..cf85f3896afa 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -37,6 +37,7 @@
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
 #include "xfs_health.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index c2cc3c193ffc..ab1bc0e3a595 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -26,6 +26,7 @@
 #include "xfs_types.h"
 #include "xfs_errortag.h"
 #include "xfs_health.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache *xfs_ifork_cache;
 
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index eaabfa52eda6..5127fa88531f 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -138,20 +138,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_ICHGTIME_CHG	0x2	/* inode field change timestamp */
 #define	XFS_ICHGTIME_CREATE	0x4	/* inode create timestamp */
 
-
-/*
- * Symlink decoding/encoding functions
- */
-int xfs_symlink_blocks(struct xfs_mount *mp, int pathlen);
-int xfs_symlink_hdr_set(struct xfs_mount *mp, xfs_ino_t ino, uint32_t offset,
-			uint32_t size, struct xfs_buf *bp);
-bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset,
-			uint32_t size, struct xfs_buf *bp);
-void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
-				 struct xfs_inode *ip, struct xfs_ifork *ifp);
-xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
-xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
-
 /* Computed inode geometry for the filesystem. */
 struct xfs_ino_geometry {
 	/* Maximum inode count in this filesystem. */
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 7660a95b1ea9..3ea30adc8220 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -16,7 +16,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
-
+#include "xfs_symlink_remote.h"
 
 /*
  * Each contiguous block has a header, so it is not just a simple pathlen
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
new file mode 100644
index 000000000000..a58d536c8b83
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc.
+ * Copyright (c) 2013 Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#ifndef __XFS_SYMLINK_REMOTE_H
+#define __XFS_SYMLINK_REMOTE_H
+
+/*
+ * Symlink decoding/encoding functions
+ */
+int xfs_symlink_blocks(struct xfs_mount *mp, int pathlen);
+int xfs_symlink_hdr_set(struct xfs_mount *mp, xfs_ino_t ino, uint32_t offset,
+			uint32_t size, struct xfs_buf *bp);
+bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset,
+			uint32_t size, struct xfs_buf *bp);
+void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
+				 struct xfs_inode *ip, struct xfs_ifork *ifp);
+xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
+xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
+
+#endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index e5e3daf75fd1..6c889c21ddec 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -35,6 +35,7 @@
 #include "xfs_ag.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_log_priv.h"
+#include "xfs_symlink_remote.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index c1c99ffe7408..3c9900b242e4 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -12,6 +12,7 @@
 #include "xfs_log_format.h"
 #include "xfs_inode.h"
 #include "xfs_symlink.h"
+#include "xfs_symlink_remote.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 90de8991cd94..76f80f958381 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -24,6 +24,7 @@
 #include "xfs_ialloc.h"
 #include "xfs_error.h"
 #include "xfs_health.h"
+#include "xfs_symlink_remote.h"
 
 /* ----- Kernel only functions below ----- */
 int


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: move remote symlink target read function to libxfs
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: clean up symbolic link code Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 3/3] xfs: move symlink target write function to libxfs Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move xfs_readlink_bmap_ilocked to xfs_symlink_remote.c so that the
swapext code can use it to convert a remote format symlink back to
shortform format after a metadata repair.  While we're at it, fix a
broken printf prefix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_symlink_remote.c |   77 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.h |    1 
 fs/xfs/scrub/symlink.c             |    2 -
 fs/xfs/xfs_symlink.c               |   75 -----------------------------------
 fs/xfs/xfs_symlink.h               |    1 
 5 files changed, 80 insertions(+), 76 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 3ea30adc8220..7b4f2b306bc4 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -17,6 +17,9 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_symlink_remote.h"
+#include "xfs_bit.h"
+#include "xfs_bmap.h"
+#include "xfs_health.h"
 
 /*
  * Each contiguous block has a header, so it is not just a simple pathlen
@@ -238,3 +241,77 @@ xfs_symlink_shortform_verify(
 
 	return xfs_symlink_sf_verify_struct(ifp->if_u1.if_data, ifp->if_bytes);
 }
+
+/* Read a remote symlink target into the buffer. */
+int
+xfs_symlink_remote_read(
+	struct xfs_inode	*ip,
+	char			*link)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	char			*cur_chunk;
+	int			pathlen = ip->i_disk_size;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			byte_cnt;
+	int			n;
+	int			error = 0;
+	int			fsblocks = 0;
+	int			offset;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	fsblocks = xfs_symlink_blocks(mp, pathlen);
+	error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0);
+	if (error)
+		goto out;
+
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+
+		error = xfs_buf_read(mp->m_ddev_targp, d, BTOBB(byte_cnt), 0,
+				&bp, &xfs_symlink_buf_ops);
+		if (xfs_metadata_is_sick(error))
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+		if (error)
+			return error;
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		if (pathlen < byte_cnt)
+			byte_cnt = pathlen;
+
+		cur_chunk = bp->b_addr;
+		if (xfs_has_crc(mp)) {
+			if (!xfs_symlink_hdr_ok(ip->i_ino, offset,
+							byte_cnt, bp)) {
+				xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+				error = -EFSCORRUPTED;
+				xfs_alert(mp,
+"symlink header does not match required off/len/owner (0x%x/0x%x,0x%llx)",
+					offset, byte_cnt, ip->i_ino);
+				xfs_buf_relse(bp);
+				goto out;
+
+			}
+
+			cur_chunk += sizeof(struct xfs_dsymlink_hdr);
+		}
+
+		memcpy(link + offset, cur_chunk, byte_cnt);
+
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_buf_relse(bp);
+	}
+	ASSERT(pathlen == 0);
+
+	link[ip->i_disk_size] = '\0';
+	error = 0;
+
+ out:
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
index a58d536c8b83..7d3acaee0af0 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.h
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -19,5 +19,6 @@ void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
 				 struct xfs_inode *ip, struct xfs_ifork *ifp);
 xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
 xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
+int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index 3c9900b242e4..c134f738bc43 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -60,7 +60,7 @@ xchk_symlink(
 	}
 
 	/* Remote symlink; must read the contents. */
-	error = xfs_readlink_bmap_ilocked(sc->ip, sc->buf);
+	error = xfs_symlink_remote_read(sc->ip, sc->buf);
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out;
 	if (strnlen(sc->buf, XFS_SYMLINK_MAXLEN) < len)
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 76f80f958381..710da8dfb7d3 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -27,79 +27,6 @@
 #include "xfs_symlink_remote.h"
 
 /* ----- Kernel only functions below ----- */
-int
-xfs_readlink_bmap_ilocked(
-	struct xfs_inode	*ip,
-	char			*link)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
-	struct xfs_buf		*bp;
-	xfs_daddr_t		d;
-	char			*cur_chunk;
-	int			pathlen = ip->i_disk_size;
-	int			nmaps = XFS_SYMLINK_MAPS;
-	int			byte_cnt;
-	int			n;
-	int			error = 0;
-	int			fsblocks = 0;
-	int			offset;
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
-
-	fsblocks = xfs_symlink_blocks(mp, pathlen);
-	error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0);
-	if (error)
-		goto out;
-
-	offset = 0;
-	for (n = 0; n < nmaps; n++) {
-		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
-		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
-
-		error = xfs_buf_read(mp->m_ddev_targp, d, BTOBB(byte_cnt), 0,
-				&bp, &xfs_symlink_buf_ops);
-		if (xfs_metadata_is_sick(error))
-			xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
-		if (error)
-			return error;
-		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
-		if (pathlen < byte_cnt)
-			byte_cnt = pathlen;
-
-		cur_chunk = bp->b_addr;
-		if (xfs_has_crc(mp)) {
-			if (!xfs_symlink_hdr_ok(ip->i_ino, offset,
-							byte_cnt, bp)) {
-				xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
-				error = -EFSCORRUPTED;
-				xfs_alert(mp,
-"symlink header does not match required off/len/owner (0x%x/Ox%x,0x%llx)",
-					offset, byte_cnt, ip->i_ino);
-				xfs_buf_relse(bp);
-				goto out;
-
-			}
-
-			cur_chunk += sizeof(struct xfs_dsymlink_hdr);
-		}
-
-		memcpy(link + offset, cur_chunk, byte_cnt);
-
-		pathlen -= byte_cnt;
-		offset += byte_cnt;
-
-		xfs_buf_relse(bp);
-	}
-	ASSERT(pathlen == 0);
-
-	link[ip->i_disk_size] = '\0';
-	error = 0;
-
- out:
-	return error;
-}
-
 int
 xfs_readlink(
 	struct xfs_inode	*ip,
@@ -142,7 +69,7 @@ xfs_readlink(
 		memcpy(link, ip->i_df.if_u1.if_data, pathlen + 1);
 		error = 0;
 	} else {
-		error = xfs_readlink_bmap_ilocked(ip, link);
+		error = xfs_symlink_remote_read(ip, link);
 	}
 
  out:
diff --git a/fs/xfs/xfs_symlink.h b/fs/xfs/xfs_symlink.h
index 2586b7e393f3..526ea98315b5 100644
--- a/fs/xfs/xfs_symlink.h
+++ b/fs/xfs/xfs_symlink.h
@@ -10,7 +10,6 @@
 int xfs_symlink(struct user_namespace *mnt_userns, struct xfs_inode *dp,
 		struct xfs_name *link_name, const char *target_path,
 		umode_t mode, struct xfs_inode **ipp);
-int xfs_readlink_bmap_ilocked(struct xfs_inode *ip, char *link);
 int xfs_readlink(struct xfs_inode *ip, char *link);
 int xfs_inactive_symlink(struct xfs_inode *ip);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: move symlink target write function to libxfs
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: clean up symbolic link code Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/3] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/3] xfs: move remote symlink target read " Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move xfs_symlink_write_target to xfs_symlink_remote.c so that kernel and
mkfs can share the same function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_symlink_remote.c |   76 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.h |    3 +
 fs/xfs/xfs_symlink.c               |   68 ++------------------------------
 3 files changed, 83 insertions(+), 64 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 7b4f2b306bc4..5261f15ea2ed 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -315,3 +315,79 @@ xfs_symlink_remote_read(
  out:
 	return error;
 }
+
+/* Write the symlink target into the inode. */
+int
+xfs_symlink_write_target(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	const char		*target_path,
+	int			pathlen,
+	xfs_fsblock_t		fs_blocks,
+	uint			resblks)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	const char		*cur_chunk;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	int			byte_cnt;
+	int			nmaps;
+	int			offset = 0;
+	int			n;
+	int			error;
+
+	/*
+	 * If the symlink will fit into the inode, write it inline.
+	 */
+	if (pathlen <= xfs_inode_data_fork_size(ip)) {
+		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
+
+		ip->i_disk_size = pathlen;
+		ip->i_df.if_format = XFS_DINODE_FMT_LOCAL;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+		return 0;
+	}
+
+	nmaps = XFS_SYMLINK_MAPS;
+	error = xfs_bmapi_write(tp, ip, 0, fs_blocks, XFS_BMAPI_METADATA,
+			resblks, mval, &nmaps);
+	if (error)
+		return error;
+
+	ip->i_disk_size = pathlen;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	cur_chunk = target_path;
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		char	*buf;
+
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
+				BTOBB(byte_cnt), 0, &bp);
+		if (error)
+			return error;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		byte_cnt = min(byte_cnt, pathlen);
+
+		buf = bp->b_addr;
+		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset, byte_cnt,
+				bp);
+
+		memcpy(buf, cur_chunk, byte_cnt);
+
+		cur_chunk += byte_cnt;
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
+		xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
+						(char *)bp->b_addr);
+	}
+	ASSERT(pathlen == 0);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
index 7d3acaee0af0..d81461c06b6b 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.h
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -20,5 +20,8 @@ void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
 xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
 xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
 int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
+int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
+		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
+		uint resblks);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 710da8dfb7d3..548d9116e0c5 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -94,15 +94,7 @@ xfs_symlink(
 	int			error = 0;
 	int			pathlen;
 	bool                    unlock_dp_on_error = false;
-	xfs_fileoff_t		first_fsb;
 	xfs_filblks_t		fs_blocks;
-	int			nmaps;
-	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
-	xfs_daddr_t		d;
-	const char		*cur_chunk;
-	int			byte_cnt;
-	int			n;
-	struct xfs_buf		*bp;
 	prid_t			prid;
 	struct xfs_dquot	*udqp = NULL;
 	struct xfs_dquot	*gdqp = NULL;
@@ -190,62 +182,10 @@ xfs_symlink(
 	xfs_qm_vop_create_dqattach(tp, ip, udqp, gdqp, pdqp);
 
 	resblks -= XFS_IALLOC_SPACE_RES(mp);
-	/*
-	 * If the symlink will fit into the inode, write it inline.
-	 */
-	if (pathlen <= xfs_inode_data_fork_size(ip)) {
-		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
-
-		ip->i_disk_size = pathlen;
-		ip->i_df.if_format = XFS_DINODE_FMT_LOCAL;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
-	} else {
-		int	offset;
-
-		first_fsb = 0;
-		nmaps = XFS_SYMLINK_MAPS;
-
-		error = xfs_bmapi_write(tp, ip, first_fsb, fs_blocks,
-				  XFS_BMAPI_METADATA, resblks, mval, &nmaps);
-		if (error)
-			goto out_trans_cancel;
-
-		resblks -= fs_blocks;
-		ip->i_disk_size = pathlen;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-
-		cur_chunk = target_path;
-		offset = 0;
-		for (n = 0; n < nmaps; n++) {
-			char	*buf;
-
-			d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
-			byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
-			error = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
-					       BTOBB(byte_cnt), 0, &bp);
-			if (error)
-				goto out_trans_cancel;
-			bp->b_ops = &xfs_symlink_buf_ops;
-
-			byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
-			byte_cnt = min(byte_cnt, pathlen);
-
-			buf = bp->b_addr;
-			buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
-						   byte_cnt, bp);
-
-			memcpy(buf, cur_chunk, byte_cnt);
-
-			cur_chunk += byte_cnt;
-			pathlen -= byte_cnt;
-			offset += byte_cnt;
-
-			xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
-			xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
-							(char *)bp->b_addr);
-		}
-		ASSERT(pathlen == 0);
-	}
+	error = xfs_symlink_write_target(tp, ip, target_path, pathlen,
+			fs_blocks, resblks);
+	if (error)
+		goto out_trans_cancel;
 	i_size_write(VFS_I(ip), ip->i_disk_size);
 
 	/*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 00/21] xfs: atomic file updates
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (23 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: clean up symbolic link code Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
                     ` (20 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
                   ` (52 subsequent siblings)
  77 siblings, 21 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

Hi all,

This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically.  This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes.  A successful call
completion guarantees that the new contents will be seen even if the
system fails.

The ability to swap extent mappings between files in this manner is
critical to supporting online filesystem repair, which is built upon the
strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Callers can arrange for the update to be rejected if the
original file has been changed.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.
Here is the proposed manual page:

IOCTL-FIEXCHANGE_RANGE(Linux Programmer's ManIOCTL-FIEXCHANGE_RANGE(2)

NAME
       ioctl_fiexchange_range  - exchange the contents of parts of two
       files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <linux/fiexchange.h>

       int    ioctl(int     file2_fd,     FIEXCHANGE_RANGE,     struct
       file_xchg_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions, so no userspace-level locks need to be taken  to  obtain
       consistent  results.  Implementations must guarantee that read‐
       ers see either the old contents or the new  contents  in  their
       entirety, even if the system fails.

       The exchange parameters are conveyed in a structure of the fol‐
       lowing form:

           struct file_xchg_range {
               __s64    file1_fd;
               __s64    file1_offset;
               __s64    file2_offset;
               __s64    length;

               __u64    flags;

               __s64    file2_ino;
               __s64    file2_mtime;
               __s64    file2_ctime;
               __s32    file2_mtime_nsec;
               __s32    file2_ctime_nsec;

               __u64    pad[6];
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both files must be from the same filesystem mount.  If the  two
       file  descriptors represent the same file, the byte ranges must
       not overlap.  Most  disk-based  filesystems  require  that  the
       starts  of  both ranges must be aligned to the file block size.
       If this is the case, the ends of the ranges  must  also  be  so
       aligned unless the FILE_XCHG_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           FILE_XCHG_RANGE_FILE2_FRESH
                  Check  the  freshness  of file2_fd after locking the
                  file but before exchanging the contents.   The  sup‐
                  plied  file2_ino field must match file2's inode num‐
                  ber, and the supplied file2_mtime, file2_mtime_nsec,
                  file2_ctime,  and file2_ctime_nsec fields must match
                  the modification time and change time of file2.   If
                  they do not match, EBUSY will be returned.

           FILE_XCHG_RANGE_TO_EOF
                  Ignore  the length parameter.  All bytes in file1_fd
                  from file1_offset to EOF are moved to file2_fd,  and
                  file2's  size is set to (file2_offset+(file1_length-
                  file1_offset)).  Meanwhile, all bytes in file2  from
                  file2_offset  to  EOF are moved to file1 and file1's
                  size   is   set   to    (file1_offset+(file2_length-
                  file2_offset)).   This option is not compatible with
                  FILE_XCHG_RANGE_FULL_FILES.

           FILE_XCHG_RANGE_FSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           FILE_XCHG_RANGE_SKIP_FILE1_HOLES
                  Skip  sub-ranges  of  file1_fd that are known not to
                  contain data.  This facility can be used  to  imple‐
                  ment  atomic scatter-gather writes of any complexity
                  for software-defined storage targets.

           FILE_XCHG_RANGE_DRY_RUN
                  Check the parameters and the feasibility of the  op‐
                  eration, but do not change anything.

           FILE_XCHG_RANGE_COMMIT
                  This      flag      is      a     combination     of
                  FILE_XCHG_RANGE_FILE2_FRESH |  FILE_XCHG_RANGE_FSYNC
                  and  can  be  used  to commit changes to file2_fd to
                  persistent storage if and  only  if  file2  has  not
                  changed.

           FILE_XCHG_RANGE_FULL_FILES
                  Require that file1_offset and file2_offset are zero,
                  and that the length field  matches  the  lengths  of
                  both  files.   If  not, EDOM will be returned.  This
                  option      is       not       compatible       with
                  FILE_XCHG_RANGE_TO_EOF.

           FILE_XCHG_RANGE_NONATOMIC
                  This  flag  relaxes the requirement that readers see
                  only the old contents or the new contents  in  their
                  entirety.   If  the system fails before all modified
                  in-core data and metadata updates are  persisted  to
                  disk,  the contents of both file ranges after recov‐
                  ery are not defined and may be a mix of both.

                  Do not use this flag unless  the  contents  of  both
                  ranges  are  known  to be identical and there are no
                  other writers.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EBUSY  The inode number and timestamps supplied  do  not  match
              file2_fd  and  FILE_XCHG_RANGE_FILE2_FRESH  was  set  in
              flags.

       EDOM   The ranges do not cover the entirety of both files,  and
              FILE_XCHG_RANGE_FULL_FILES was set in flags.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is Linux-specific.

USE CASES
       Three use cases are imagined for this system call.

       The  first  is a filesystem defragmenter, which copies the con‐
       tents of a file into another file and wishes  to  exchange  the
       space  mappings  of  the  two files, provided that the original
       file has not changed.  The flags NONATOMIC and FILE2_FRESH  are
       recommended for this application.

       The  second is a data storage program that wants to commit non-
       contiguous updates to a file atomically.  This can be  done  by
       creating a temporary file, calling FICLONE(2) to share the con‐
       tents, and staging the updates into the temporary file.  Either
       of  the  FULL_FILES or TO_EOF flags are recommended, along with
       FSYNC.  Depending on  the  application's  locking  design,  the
       flags FILE2_FRESH or COMMIT may be applicable here.  The tempo‐
       rary file can be deleted or punched out afterwards.

       The third is a software-defined storage host (e.g. a disk juke‐
       box)  which  implements an atomic scatter-gather write command.
       Provided the exported disk's logical  block  size  matches  the
       file's  allocation  unit  size,  this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       Use  this  call  with  the SKIP_HOLES flag to exchange only the
       blocks involved in the write command.  The  use  of  the  FSYNC
       flag is recommended here.  The temporary file should be deleted
       or punched out completely before being reused to stage  another
       write.

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

Linux                         2022-12-31     IOCTL-FIEXCHANGE_RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic file writes concept
that has also been floating around for years.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file swap is
implemented as an atomic inode fork swap, which means that we can
implement online reconstruction of extended attributes and directories
by building a new one in another inode and atomically swap the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic extent swapping.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be swapped atomically into
the inode being repaired.  This is essential to avoid making corruption
problems worse if the system goes down in the middle of running repair.

This patchset also ports the old XFS extent swap ioctl interface to use
the new extent swap code.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

Question: Should we really bother with fsdevel bikeshedding?  Most
filesystems cannot support this functionality, so we could keep it
private to XFS for now.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
 Documentation/filesystems/vfs.rst  |   16 
 fs/ioctl.c                         |   27 +
 fs/remap_range.c                   |  296 ++++++++
 fs/xfs/Makefile                    |    3 
 fs/xfs/libxfs/xfs_bmap.h           |    4 
 fs/xfs/libxfs/xfs_defer.c          |    7 
 fs/xfs/libxfs/xfs_defer.h          |    3 
 fs/xfs/libxfs/xfs_errortag.h       |    4 
 fs/xfs/libxfs/xfs_format.h         |   15 
 fs/xfs/libxfs/xfs_fs.h             |    2 
 fs/xfs/libxfs/xfs_log_format.h     |   80 ++
 fs/xfs/libxfs/xfs_log_recover.h    |    2 
 fs/xfs/libxfs/xfs_sb.c             |    3 
 fs/xfs/libxfs/xfs_swapext.c        | 1258 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_swapext.h        |  170 +++++
 fs/xfs/libxfs/xfs_symlink_remote.c |   47 +
 fs/xfs/libxfs/xfs_symlink_remote.h |    1 
 fs/xfs/libxfs/xfs_trans_space.h    |    4 
 fs/xfs/xfs_bmap_util.c             |  620 ------------------
 fs/xfs/xfs_bmap_util.h             |    3 
 fs/xfs/xfs_error.c                 |    3 
 fs/xfs/xfs_file.c                  |   94 ++-
 fs/xfs/xfs_inode.c                 |   13 
 fs/xfs/xfs_inode.h                 |    6 
 fs/xfs/xfs_ioctl.c                 |  102 +--
 fs/xfs/xfs_ioctl.h                 |    4 
 fs/xfs/xfs_ioctl32.c               |   11 
 fs/xfs/xfs_linux.h                 |    5 
 fs/xfs/xfs_log.c                   |   47 +
 fs/xfs/xfs_log.h                   |   10 
 fs/xfs/xfs_log_priv.h              |    3 
 fs/xfs/xfs_log_recover.c           |    5 
 fs/xfs/xfs_mount.c                 |   11 
 fs/xfs/xfs_mount.h                 |    7 
 fs/xfs/xfs_rtalloc.c               |  136 ++++
 fs/xfs/xfs_rtalloc.h               |    3 
 fs/xfs/xfs_super.c                 |   19 +
 fs/xfs/xfs_swapext_item.c          |  657 +++++++++++++++++++
 fs/xfs/xfs_swapext_item.h          |   56 ++
 fs/xfs/xfs_symlink.c               |   49 -
 fs/xfs/xfs_trace.c                 |    2 
 fs/xfs/xfs_trace.h                 |  351 ++++++++++
 fs/xfs/xfs_xattr.c                 |    6 
 fs/xfs/xfs_xchgrange.c             |  964 ++++++++++++++++++++++++++++
 fs/xfs/xfs_xchgrange.h             |   40 +
 include/linux/fs.h                 |   14 
 include/uapi/linux/fiexchange.h    |  101 +++
 47 files changed, 4473 insertions(+), 811 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_swapext.c
 create mode 100644 fs/xfs/libxfs/xfs_swapext.h
 create mode 100644 fs/xfs/xfs_swapext_item.c
 create mode 100644 fs/xfs/xfs_swapext_item.h
 create mode 100644 fs/xfs/xfs_xchgrange.c
 create mode 100644 fs/xfs/xfs_xchgrange.h
 create mode 100644 include/uapi/linux/fiexchange.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 01/21] vfs: introduce new file range exchange ioctl
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 03/21] xfs: refactor non-power-of-two alignment checks Darrick J. Wong
                     ` (18 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new ioctl to handle swapping ranges of bytes between files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/filesystems/vfs.rst |   16 ++
 fs/ioctl.c                        |   27 +++
 fs/remap_range.c                  |  296 +++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_fs.h            |    1 
 include/linux/fs.h                |   14 ++
 include/uapi/linux/fiexchange.h   |  101 +++++++++++++
 6 files changed, 454 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/fiexchange.h


diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 2c15e7053113..cae6dd3a8a0b 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1036,6 +1036,8 @@ This describes how the VFS can manipulate an open file.  As of kernel
 		loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
 					   struct file *file_out, loff_t pos_out,
 					   loff_t len, unsigned int remap_flags);
+                int (*xchg_file_range)(struct file *file1, struct file *file2,
+                                       struct file_xchg_range *fxr);
 		int (*fadvise)(struct file *, loff_t, loff_t, int);
 	};
 
@@ -1154,6 +1156,20 @@ otherwise noted.
 	ok with the implementation shortening the request length to
 	satisfy alignment or EOF requirements (or any other reason).
 
+``xchg_file_range``
+	called by the ioctl(2) system call for FIEXCHANGE_RANGE to exchange the
+	contents of two file ranges.  An implementation should exchange
+	fxr.length bytes starting at fxr.file1_offset in file1 with the same
+	number of bytes starting at fxr.file2_offset in file2.  Refer to
+	fiexchange.h file for more information.  Implementations must call
+	generic_xchg_file_range_prep to prepare the two files prior to taking
+	locks; they must call generic_xchg_file_range_check_fresh once the
+	inode is locked to abort the call if file2 has changed; and they must
+	update the inode change and mod times of both files as part of the
+	metadata update.  The timestamp updates must be done atomically as part
+	of the data exchange operation to ensure correctness of the freshness
+	check.
+
 ``fadvise``
 	possibly called by the fadvise64() system call.
 
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 80ac36aea913..bd636daf8e90 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -259,6 +259,30 @@ static long ioctl_file_clone_range(struct file *file,
 				args.src_length, args.dest_offset);
 }
 
+static long ioctl_file_xchg_range(struct file *file2,
+				  struct file_xchg_range __user *argp)
+{
+	struct file_xchg_range args;
+	struct fd file1;
+	int ret;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+
+	ret = -EXDEV;
+	if (file1.file->f_path.mnt != file2->f_path.mnt)
+		goto fdput;
+
+	ret = vfs_xchg_file_range(file1.file, file2, &args);
+fdput:
+	fdput(file1);
+	return ret;
+}
+
 /*
  * This provides compatibility with legacy XFS pre-allocation ioctls
  * which predate the fallocate syscall.
@@ -825,6 +849,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FIDEDUPERANGE:
 		return ioctl_file_dedupe_range(filp, argp);
 
+	case FIEXCHANGE_RANGE:
+		return ioctl_file_xchg_range(filp, argp);
+
 	case FIONREAD:
 		if (!S_ISREG(inode->i_mode))
 			return vfs_ioctl(filp, cmd, arg);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index 41f60477bb41..469d53fb42e9 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -567,3 +567,299 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	return ret;
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range);
+
+/* Performs necessary checks before doing a range exchange. */
+static int generic_xchg_file_range_checks(struct file *file1,
+					  struct file *file2,
+					  struct file_xchg_range *fxr,
+					  unsigned int blocksize)
+{
+	struct inode *inode1 = file1->f_mapping->host;
+	struct inode *inode2 = file2->f_mapping->host;
+	uint64_t blkmask = blocksize - 1;
+	int64_t test_len;
+	uint64_t blen;
+	loff_t size1, size2;
+	int ret;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
+		return -EPERM;
+	if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
+		return -ETXTBSY;
+
+	size1 = i_size_read(inode1);
+	size2 = i_size_read(inode2);
+
+	/* Ranges cannot start after EOF. */
+	if (fxr->file1_offset > size1 || fxr->file2_offset > size2)
+		return -EINVAL;
+
+	/*
+	 * If the caller asked for full files, check that the offset/length
+	 * values cover all of both files.
+	 */
+	if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    (fxr->file1_offset != 0 || fxr->file2_offset != 0 ||
+	     fxr->length != size1 || fxr->length != size2))
+		return -EDOM;
+
+	/*
+	 * If the caller said to exchange to EOF, we set the length of the
+	 * request large enough to cover everything to the end of both files.
+	 */
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		fxr->length = max_t(int64_t, size1 - fxr->file1_offset,
+					     size2 - fxr->file2_offset);
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(fxr->file1_offset, blocksize) ||
+	    !IS_ALIGNED(fxr->file2_offset, blocksize))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (fxr->file1_offset + fxr->length < fxr->file1_offset ||
+	    fxr->file2_offset + fxr->length < fxr->file2_offset)
+		return -EINVAL;
+
+	/*
+	 * We require both ranges to be within EOF, unless we're exchanging
+	 * to EOF.  generic_xchg_range_prep already checked that both
+	 * fxr->file1_offset and fxr->file2_offset are within EOF.
+	 */
+	if (!(fxr->flags & FILE_XCHG_RANGE_TO_EOF) &&
+	    (fxr->file1_offset + fxr->length > size1 ||
+	     fxr->file2_offset + fxr->length > size2))
+		return -EINVAL;
+
+	/*
+	 * Make sure we don't hit any file size limits.  If we hit any size
+	 * limits such that test_length was adjusted, we abort the whole
+	 * operation.
+	 */
+	test_len = fxr->length;
+	ret = generic_write_check_limits(file2, fxr->file2_offset, &test_len);
+	if (ret)
+		return ret;
+	ret = generic_write_check_limits(file1, fxr->file1_offset, &test_len);
+	if (ret)
+		return ret;
+	if (test_len != fxr->length)
+		return -EINVAL;
+
+	/*
+	 * If the user wanted us to exchange up to the infile's EOF, round up
+	 * to the next block boundary for this check.  Do the same for the
+	 * outfile.
+	 *
+	 * Otherwise, reject the range length if it's not block aligned.  We
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (fxr->file1_offset + fxr->length == size1)
+		blen = ALIGN(size1, blocksize) - fxr->file1_offset;
+	else if (fxr->file2_offset + fxr->length == size2)
+		blen = ALIGN(size2, blocksize) - fxr->file2_offset;
+	else if (!IS_ALIGNED(fxr->length, blocksize))
+		return -EINVAL;
+	else
+		blen = fxr->length;
+
+	/* Don't allow overlapped exchanges within the same file. */
+	if (inode1 == inode2 &&
+	    fxr->file2_offset + blen > fxr->file1_offset &&
+	    fxr->file1_offset + blen > fxr->file2_offset)
+		return -EINVAL;
+
+	/* If we already failed the freshness check, we're done. */
+	ret = generic_xchg_file_range_check_fresh(inode2, fxr);
+	if (ret)
+		return ret;
+
+	/*
+	 * Ensure that we don't exchange a partial EOF block into the middle of
+	 * another file.
+	 */
+	if ((fxr->length & blkmask) == 0)
+		return 0;
+
+	blen = fxr->length;
+	if (fxr->file2_offset + blen < size2)
+		blen &= ~blkmask;
+
+	if (fxr->file1_offset + blen < size1)
+		blen &= ~blkmask;
+
+	return blen == fxr->length ? 0 : -EINVAL;
+}
+
+/*
+ * Check that the two inodes are eligible for range exchanges, the ranges make
+ * sense, and then flush all dirty data.  Caller must ensure that the inodes
+ * have been locked against any other modifications.
+ */
+int generic_xchg_file_range_prep(struct file *file1, struct file *file2,
+				 struct file_xchg_range *fxr,
+				 unsigned int blocksize)
+{
+	struct inode *inode1 = file_inode(file1);
+	struct inode *inode2 = file_inode(file2);
+	bool same_inode = (inode1 == inode2);
+	int ret;
+
+	/* Check that we don't violate system file offset limits. */
+	ret = generic_xchg_file_range_checks(file1, file2, fxr, blocksize);
+	if (ret || fxr->length == 0)
+		return ret;
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode1);
+	if (!same_inode)
+		inode_dio_wait(inode2);
+
+	ret = filemap_write_and_wait_range(inode1->i_mapping, fxr->file1_offset,
+					   fxr->file1_offset + fxr->length - 1);
+	if (ret)
+		return ret;
+
+	ret = filemap_write_and_wait_range(inode2->i_mapping, fxr->file2_offset,
+					   fxr->file2_offset + fxr->length - 1);
+	if (ret)
+		return ret;
+
+	/*
+	 * If the files or inodes involved require synchronous writes, amend
+	 * the request to force the filesystem to flush all data and metadata
+	 * to disk after the operation completes.
+	 */
+	if (((file1->f_flags | file2->f_flags) & (__O_SYNC | O_DSYNC)) ||
+	    IS_SYNC(inode1) || IS_SYNC(inode2))
+		fxr->flags |= FILE_XCHG_RANGE_FSYNC;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(generic_xchg_file_range_prep);
+
+/*
+ * Finish a range exchange operation, if it was successful.  Caller must ensure
+ * that the inodes are still locked against any other modifications.
+ */
+int generic_xchg_file_range_finish(struct file *file1, struct file *file2)
+{
+	int ret;
+
+	ret = file_remove_privs(file1);
+	if (ret)
+		return ret;
+	if (file_inode(file1) == file_inode(file2))
+		return 0;
+
+	return file_remove_privs(file2);
+}
+EXPORT_SYMBOL_GPL(generic_xchg_file_range_finish);
+
+/*
+ * Check that both files' metadata agree with the snapshot that we took for
+ * the range exchange request.
+
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+int generic_xchg_file_range_check_fresh(struct inode *inode2,
+					const struct file_xchg_range *fxr)
+{
+	/* Check that file2 hasn't otherwise been modified. */
+	if ((fxr->flags & FILE_XCHG_RANGE_FILE2_FRESH) &&
+	    (fxr->file2_ino        != inode2->i_ino ||
+	     fxr->file2_ctime      != inode2->i_ctime.tv_sec  ||
+	     fxr->file2_ctime_nsec != inode2->i_ctime.tv_nsec ||
+	     fxr->file2_mtime      != inode2->i_mtime.tv_sec  ||
+	     fxr->file2_mtime_nsec != inode2->i_mtime.tv_nsec))
+		return -EBUSY;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(generic_xchg_file_range_check_fresh);
+
+static inline int xchg_range_verify_area(struct file *file, loff_t pos,
+					 struct file_xchg_range *fxr)
+{
+	int64_t len = fxr->length;
+
+	if (pos < 0)
+		return -EINVAL;
+
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		len = min_t(int64_t, len, i_size_read(file_inode(file)) - pos);
+	return remap_verify_area(file, pos, len, true);
+}
+
+int do_xchg_file_range(struct file *file1, struct file *file2,
+		       struct file_xchg_range *fxr)
+{
+	struct inode *inode1 = file_inode(file1);
+	struct inode *inode2 = file_inode(file2);
+	int ret;
+
+	if ((fxr->flags & ~FILE_XCHG_RANGE_ALL_FLAGS) ||
+	    memchr_inv(&fxr->pad, 0, sizeof(fxr->pad)))
+		return -EINVAL;
+
+	if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    (fxr->flags & FILE_XCHG_RANGE_TO_EOF))
+		return -EINVAL;
+
+	/*
+	 * The ioctl enforces that src and dest files are on the same mount.
+	 * Practically, they only need to be on the same file system.
+	 */
+	if (inode1->i_sb != inode2->i_sb)
+		return -EXDEV;
+
+	/* This only works for regular files. */
+	if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
+		return -EINVAL;
+
+	ret = generic_file_rw_checks(file1, file2);
+	if (ret < 0)
+		return ret;
+
+	ret = generic_file_rw_checks(file2, file1);
+	if (ret < 0)
+		return ret;
+
+	if (!file1->f_op->xchg_file_range)
+		return -EOPNOTSUPP;
+
+	ret = xchg_range_verify_area(file1, fxr->file1_offset, fxr);
+	if (ret)
+		return ret;
+
+	ret = xchg_range_verify_area(file2, fxr->file2_offset, fxr);
+	if (ret)
+		return ret;
+
+	ret = file2->f_op->xchg_file_range(file1, file2, fxr);
+	if (ret)
+		return ret;
+
+	fsnotify_modify(file1);
+	if (file2 != file1)
+		fsnotify_modify(file2);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(do_xchg_file_range);
+
+int vfs_xchg_file_range(struct file *file1, struct file *file2,
+			struct file_xchg_range *fxr)
+{
+	int ret;
+
+	file_start_write(file2);
+	ret = do_xchg_file_range(file1, file2, fxr);
+	file_end_write(file2);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfs_xchg_file_range);
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 400cf68e551e..210c17f5a16c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -841,6 +841,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
+/*	FIEXCHANGE_RANGE ----------- hoisted 129	 */
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 066555ad1bf8..cd86ac22c339 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -46,6 +46,7 @@
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
+#include <uapi/linux/fiexchange.h>
 
 struct backing_dev_info;
 struct bdi_writeback;
@@ -2125,6 +2126,8 @@ struct file_operations {
 	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
+	int (*xchg_file_range)(struct file *file1, struct file *file2,
+			       struct file_xchg_range *fsr);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
@@ -2205,6 +2208,10 @@ int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t *count, unsigned int remap_flags);
+int generic_xchg_file_range_prep(struct file *file1, struct file *file2,
+				 struct file_xchg_range *fsr,
+				 unsigned int blocksize);
+int generic_xchg_file_range_finish(struct file *file1, struct file *file2);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
@@ -2216,7 +2223,12 @@ extern int vfs_dedupe_file_range(struct file *file,
 extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 					struct file *dst_file, loff_t dst_pos,
 					loff_t len, unsigned int remap_flags);
-
+extern int do_xchg_file_range(struct file *file1, struct file *file2,
+			      struct file_xchg_range *fsr);
+extern int vfs_xchg_file_range(struct file *file1, struct file *file2,
+			       struct file_xchg_range *fsr);
+extern int generic_xchg_file_range_check_fresh(struct inode *inode2,
+					const struct file_xchg_range *fsr);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fiexchange.h b/include/uapi/linux/fiexchange.h
new file mode 100644
index 000000000000..72bc228d4141
--- /dev/null
+++ b/include/uapi/linux/fiexchange.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
+/*
+ * FIEXCHANGE_RANGE ioctl definitions, to facilitate exchanging parts of files.
+ *
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _LINUX_FIEXCHANGE_H
+#define _LINUX_FIEXCHANGE_H
+
+#include <linux/types.h>
+
+/*
+ * Exchange part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2).  Filesystems must be able to
+ * restart and complete the operation even after the system goes down.
+ */
+struct file_xchg_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see FILE_XCHG_RANGE_* below */
+
+	/* file2 metadata for optional freshness checks */
+	__s64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+
+	__u64		pad[6];		/* must be zeroes */
+};
+
+/*
+ * Atomic exchange operations are not required.  This relaxes the requirement
+ * that the filesystem must be able to complete the operation after a crash.
+ */
+#define FILE_XCHG_RANGE_NONATOMIC	(1 << 0)
+
+/*
+ * Check that file2's inode number, mtime, and ctime against the values
+ * provided, and return -EBUSY if there isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FILE2_FRESH	(1 << 1)
+
+/*
+ * Check that the file1's length is equal to file1_offset + length, and that
+ * file2's length is equal to file2_offset + length.  Returns -EDOM if there
+ * isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FULL_FILES	(1 << 2)
+
+/*
+ * Exchange file data all the way to the ends of both files, and then exchange
+ * the file sizes.  This flag can be used to replace a file's contents with a
+ * different amount of data.  length will be ignored.
+ */
+#define FILE_XCHG_RANGE_TO_EOF		(1 << 3)
+
+/* Flush all changes in file data and file metadata to disk before returning. */
+#define FILE_XCHG_RANGE_FSYNC		(1 << 4)
+
+/* Dry run; do all the parameter verification but do not change anything. */
+#define FILE_XCHG_RANGE_DRY_RUN		(1 << 5)
+
+/*
+ * Do not exchange any part of the range where file1's mapping is a hole.  This
+ * can be used to emulate scatter-gather atomic writes with a temp file.
+ */
+#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
+
+/*
+ * Commit the contents of file1 into file2 if file2 has the same inode number,
+ * mtime, and ctime as the arguments provided to the call.  The old contents of
+ * file2 will be moved to file1.
+ *
+ * With this flag, all committed information can be retrieved even if the
+ * system crashes or is rebooted.  This includes writing through or flushing a
+ * disk cache if present.  The call blocks until the device reports that the
+ * commit is complete.
+ *
+ * This flag should not be combined with NONATOMIC.  It can be combined with
+ * SKIP_FILE1_HOLES.
+ */
+#define FILE_XCHG_RANGE_COMMIT		(FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FSYNC)
+
+#define FILE_XCHG_RANGE_ALL_FLAGS	(FILE_XCHG_RANGE_NONATOMIC | \
+					 FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FULL_FILES | \
+					 FILE_XCHG_RANGE_TO_EOF | \
+					 FILE_XCHG_RANGE_FSYNC | \
+					 FILE_XCHG_RANGE_DRY_RUN | \
+					 FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+
+#define FIEXCHANGE_RANGE	_IOWR('X', 129, struct file_xchg_range)
+
+#endif /* _LINUX_FIEXCHANGE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 02/21] xfs: create a new helper to return a file's allocation unit
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 03/21] xfs: refactor non-power-of-two alignment checks Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 04/21] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
                     ` (16 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Create a new helper function to calculate the fundamental allocation
unit (i.e. the smallest unit of space we can allocate) of a file.
Things are going to get hairy with range-exchange on the realtime
device, so prepare for this now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  |   26 +++++++++-----------------
 fs/xfs/xfs_inode.c |   13 +++++++++++++
 fs/xfs/xfs_inode.h |    1 +
 3 files changed, 23 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4bdadd8fa71..b382380656d7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -44,27 +44,19 @@ xfs_is_falloc_aligned(
 	loff_t			pos,
 	long long int		len)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	uint64_t		mask;
+	unsigned int		alloc_unit = xfs_inode_alloc_unitsize(ip);
 
-	if (XFS_IS_REALTIME_INODE(ip)) {
-		if (!is_power_of_2(mp->m_sb.sb_rextsize)) {
-			u64	rextbytes;
-			u32	mod;
+	if (XFS_IS_REALTIME_INODE(ip) && !is_power_of_2(alloc_unit)) {
+		u32	mod;
 
-			rextbytes = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
-			div_u64_rem(pos, rextbytes, &mod);
-			if (mod)
-				return false;
-			div_u64_rem(len, rextbytes, &mod);
-			return mod == 0;
-		}
-		mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1;
-	} else {
-		mask = mp->m_sb.sb_blocksize - 1;
+		div_u64_rem(pos, alloc_unit, &mod);
+		if (mod)
+			return false;
+		div_u64_rem(len, alloc_unit, &mod);
+		return mod == 0;
 	}
 
-	return !((pos | len) & mask);
+	return !((pos | len) & (alloc_unit - 1));
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b082222a9061..04ceafb936bc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3795,3 +3795,16 @@ xfs_inode_count_blocks(
 	xfs_bmap_count_leaves(ifp, rblocks);
 	*dblocks = ip->i_nblocks - *rblocks;
 }
+
+/* Returns the size of fundamental allocation unit for a file, in bytes. */
+unsigned int
+xfs_inode_alloc_unitsize(
+	struct xfs_inode	*ip)
+{
+	unsigned int		blocks = 1;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		blocks = ip->i_mount->m_sb.sb_rextsize;
+
+	return XFS_FSB_TO_B(ip->i_mount, blocks);
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 926e4dd566d0..4b01d078ace2 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -577,6 +577,7 @@ void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 
 void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
+unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
 
 /*
  * Parameters for tracking bumplink and droplink operations.  The hook


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 03/21] xfs: refactor non-power-of-two alignment checks
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 02/21] xfs: create a new helper to return a file's allocation unit Darrick J. Wong
                     ` (17 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Create a helper function that can compute if a 64-bit number is an
integer multiple of a 32-bit number, where the 32-bit number is not
required to be an even power of two.  This is needed for some new code
for the realtime device, where we can set 37k allocation units and then
have to remap them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  |   12 +++---------
 fs/xfs/xfs_linux.h |    5 +++++
 2 files changed, 8 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b382380656d7..78323574021c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -46,15 +46,9 @@ xfs_is_falloc_aligned(
 {
 	unsigned int		alloc_unit = xfs_inode_alloc_unitsize(ip);
 
-	if (XFS_IS_REALTIME_INODE(ip) && !is_power_of_2(alloc_unit)) {
-		u32	mod;
-
-		div_u64_rem(pos, alloc_unit, &mod);
-		if (mod)
-			return false;
-		div_u64_rem(len, alloc_unit, &mod);
-		return mod == 0;
-	}
+	if (XFS_IS_REALTIME_INODE(ip) && !is_power_of_2(alloc_unit))
+		return  isaligned_64(pos, alloc_unit) &&
+			isaligned_64(len, alloc_unit);
 
 	return !((pos | len) & (alloc_unit - 1));
 }
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 3847719c3026..7e9bf03c80a3 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -197,6 +197,11 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
 	return x;
 }
 
+static inline bool isaligned_64(uint64_t x, uint32_t y)
+{
+	return do_div(x, y) == 0;
+}
+
 int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
 		char *data, enum req_op op);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 04/21] xfs: parameterize all the incompat log feature helpers
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 02/21] xfs: create a new helper to return a file's allocation unit Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 06/21] xfs: introduce a swap-extent log intent item Darrick J. Wong
                     ` (15 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

We're about to define a new XFS_SB_FEAT_INCOMPAT_LOG_ bit, which means
that callers will soon require the ability to toggle on and off
different log incompat feature bits.  Parameterize the
xlog_{use,drop}_incompat_feat and xfs_sb_remove_incompat_log_features
functions so that callers can specify which feature they're trying to
use and so that we can clear individual log incompat bits as needed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |    5 +++--
 fs/xfs/xfs_log.c           |   34 +++++++++++++++++++++++++---------
 fs/xfs/xfs_log.h           |    9 ++++++---
 fs/xfs/xfs_log_priv.h      |    2 +-
 fs/xfs/xfs_log_recover.c   |    3 ++-
 fs/xfs/xfs_mount.c         |   11 +++++------
 fs/xfs/xfs_mount.h         |    2 +-
 fs/xfs/xfs_xattr.c         |    6 +++---
 8 files changed, 46 insertions(+), 26 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 5ba2dae7aa2f..817adb36cb1e 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -404,9 +404,10 @@ xfs_sb_has_incompat_log_feature(
 
 static inline void
 xfs_sb_remove_incompat_log_features(
-	struct xfs_sb	*sbp)
+	struct xfs_sb	*sbp,
+	uint32_t	feature)
 {
-	sbp->sb_features_log_incompat &= ~XFS_SB_FEAT_INCOMPAT_LOG_ALL;
+	sbp->sb_features_log_incompat &= ~feature;
 }
 
 static inline void
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b32a8e57f576..a0ef09addc84 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1082,7 +1082,7 @@ xfs_log_quiesce(
 	 * failures, though it's not fatal to have a higher log feature
 	 * protection level than the log contents actually require.
 	 */
-	if (xfs_clear_incompat_log_features(mp)) {
+	if (xfs_clear_incompat_log_features(mp, XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
 		int error;
 
 		error = xfs_sync_sb(mp, false);
@@ -1489,6 +1489,7 @@ xlog_clear_incompat(
 	struct xlog		*log)
 {
 	struct xfs_mount	*mp = log->l_mp;
+	uint32_t		incompat_mask = 0;
 
 	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
 				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
@@ -1497,11 +1498,16 @@ xlog_clear_incompat(
 	if (log->l_covered_state != XLOG_STATE_COVER_DONE2)
 		return;
 
-	if (!down_write_trylock(&log->l_incompat_users))
+	if (down_write_trylock(&log->l_incompat_xattrs))
+		incompat_mask |= XFS_SB_FEAT_INCOMPAT_LOG_XATTRS;
+
+	if (!incompat_mask)
 		return;
 
-	xfs_clear_incompat_log_features(mp);
-	up_write(&log->l_incompat_users);
+	xfs_clear_incompat_log_features(mp, incompat_mask);
+
+	if (incompat_mask & XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
+		up_write(&log->l_incompat_xattrs);
 }
 
 /*
@@ -1618,7 +1624,7 @@ xlog_alloc_log(
 	}
 	log->l_sectBBsize = 1 << log2_size;
 
-	init_rwsem(&log->l_incompat_users);
+	init_rwsem(&log->l_incompat_xattrs);
 
 	xlog_get_iclog_buffer_size(mp, log);
 
@@ -3909,15 +3915,25 @@ xfs_log_check_lsn(
  */
 void
 xlog_use_incompat_feat(
-	struct xlog		*log)
+	struct xlog		*log,
+	enum xlog_incompat_feat	what)
 {
-	down_read(&log->l_incompat_users);
+	switch (what) {
+	case XLOG_INCOMPAT_FEAT_XATTRS:
+		down_read(&log->l_incompat_xattrs);
+		break;
+	}
 }
 
 /* Notify the log that we've finished using log incompat features. */
 void
 xlog_drop_incompat_feat(
-	struct xlog		*log)
+	struct xlog		*log,
+	enum xlog_incompat_feat	what)
 {
-	up_read(&log->l_incompat_users);
+	switch (what) {
+	case XLOG_INCOMPAT_FEAT_XATTRS:
+		up_read(&log->l_incompat_xattrs);
+		break;
+	}
 }
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 2728886c2963..d187f6445909 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -159,8 +159,11 @@ bool	xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
 xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
 bool	  xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags);
 
-void xlog_use_incompat_feat(struct xlog *log);
-void xlog_drop_incompat_feat(struct xlog *log);
-int xfs_attr_use_log_assist(struct xfs_mount *mp);
+enum xlog_incompat_feat {
+	XLOG_INCOMPAT_FEAT_XATTRS = XFS_SB_FEAT_INCOMPAT_LOG_XATTRS,
+};
+
+void xlog_use_incompat_feat(struct xlog *log, enum xlog_incompat_feat what);
+void xlog_drop_incompat_feat(struct xlog *log, enum xlog_incompat_feat what);
 
 #endif	/* __XFS_LOG_H__ */
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1bd2963e8fbd..a13b5b6b744d 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -447,7 +447,7 @@ struct xlog {
 	uint32_t		l_iclog_roundoff;/* padding roundoff */
 
 	/* Users of log incompat features should take a read lock. */
-	struct rw_semaphore	l_incompat_users;
+	struct rw_semaphore	l_incompat_xattrs;
 };
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 6b1f37bc3e95..81ce08c23306 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3473,7 +3473,8 @@ xlog_recover_finish(
 	 * longer anything to protect.  We rely on the AIL push to write out the
 	 * updated superblock after everything else.
 	 */
-	if (xfs_clear_incompat_log_features(log->l_mp)) {
+	if (xfs_clear_incompat_log_features(log->l_mp,
+				XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
 		error = xfs_sync_sb(log->l_mp, false);
 		if (error < 0) {
 			xfs_alert(log->l_mp,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 31f49211fdd6..54cd47882991 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1357,13 +1357,13 @@ xfs_add_incompat_log_feature(
  */
 bool
 xfs_clear_incompat_log_features(
-	struct xfs_mount	*mp)
+	struct xfs_mount	*mp,
+	uint32_t		features)
 {
 	bool			ret = false;
 
 	if (!xfs_has_crc(mp) ||
-	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
-				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
+	    !xfs_sb_has_incompat_log_feature(&mp->m_sb, features) ||
 	    xfs_is_shutdown(mp))
 		return false;
 
@@ -1375,9 +1375,8 @@ xfs_clear_incompat_log_features(
 	xfs_buf_lock(mp->m_sb_bp);
 	xfs_buf_hold(mp->m_sb_bp);
 
-	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
-				XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
-		xfs_sb_remove_incompat_log_features(&mp->m_sb);
+	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, features)) {
+		xfs_sb_remove_incompat_log_features(&mp->m_sb, features);
 		ret = true;
 	}
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ec8b185d45f8..7c48a2b70f6f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -547,7 +547,7 @@ struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
 		int error_class, int error);
 void xfs_force_summary_recalc(struct xfs_mount *mp);
 int xfs_add_incompat_log_feature(struct xfs_mount *mp, uint32_t feature);
-bool xfs_clear_incompat_log_features(struct xfs_mount *mp);
+bool xfs_clear_incompat_log_features(struct xfs_mount *mp, uint32_t feature);
 void xfs_mod_delalloc(struct xfs_mount *mp, int64_t delta);
 
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
index 10aa1fd39d2b..e03f199f50c7 100644
--- a/fs/xfs/xfs_xattr.c
+++ b/fs/xfs/xfs_xattr.c
@@ -37,7 +37,7 @@ xfs_attr_grab_log_assist(
 	 * Protect ourselves from an idle log clearing the logged xattrs log
 	 * incompat feature bit.
 	 */
-	xlog_use_incompat_feat(mp->m_log);
+	xlog_use_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_XATTRS);
 
 	/*
 	 * If log-assisted xattrs are already enabled, the caller can use the
@@ -57,7 +57,7 @@ xfs_attr_grab_log_assist(
 
 	return 0;
 drop_incompat:
-	xlog_drop_incompat_feat(mp->m_log);
+	xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_XATTRS);
 	return error;
 }
 
@@ -65,7 +65,7 @@ static inline void
 xfs_attr_rele_log_assist(
 	struct xfs_mount	*mp)
 {
-	xlog_drop_incompat_feat(mp->m_log);
+	xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_XATTRS);
 }
 
 static inline bool


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Create a log incompat flag so that we only attempt to process swap
extent log items if the filesystem supports it, and a geometry flag to
advertise support if it's present.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h  |    1 +
 fs/xfs/libxfs/xfs_fs.h      |    1 +
 fs/xfs/libxfs/xfs_sb.c      |    3 +++
 fs/xfs/libxfs/xfs_swapext.h |   24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 fs/xfs/libxfs/xfs_swapext.h


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 817adb36cb1e..1424976ec955 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -391,6 +391,7 @@ xfs_sb_has_incompat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS   (1 << 0)	/* Delayed Attributes */
+#define XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT  (1U << 31)	/* file extent swap */
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
 	(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 210c17f5a16c..a39fd65e6ee0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_BIGTIME	(1 << 21) /* 64-bit nsec timestamps */
 #define XFS_FSOP_GEOM_FLAGS_INOBTCNT	(1 << 22) /* inobt btree counter */
 #define XFS_FSOP_GEOM_FLAGS_NREXT64	(1 << 23) /* large extent counters */
+#define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP	(1U << 31) /* atomic file extent swap */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b3e8ab247b28..5b6f5939fda1 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -25,6 +25,7 @@
 #include "xfs_da_format.h"
 #include "xfs_health.h"
 #include "xfs_ag.h"
+#include "xfs_swapext.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -1197,6 +1198,8 @@ xfs_fs_geometry(
 	}
 	if (xfs_has_large_extent_counts(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
+	if (xfs_swapext_supported(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
new file mode 100644
index 000000000000..316323339d76
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SWAPEXT_H_
+#define __XFS_SWAPEXT_H_ 1
+
+/*
+ * Decide if this filesystem supports using log items to swap file extents and
+ * restart the operation if the system fails before the operation completes.
+ *
+ * This can be done to individual file extents by using the block mapping log
+ * intent items introduced with reflink and rmap; or to entire file ranges
+ * using swapext log intent items to track the overall progress across multiple
+ * extent mappings.  Realtime is not supported yet.
+ */
+static inline bool xfs_swapext_supported(struct xfs_mount *mp)
+{
+	return (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) &&
+	       !xfs_has_realtime(mp);
+}
+
+#endif /* __XFS_SWAPEXT_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 06/21] xfs: introduce a swap-extent log intent item
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 04/21] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 07/21] xfs: create deferred log items for extent swapping Darrick J. Wong
                     ` (14 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new intent log item to handle swapping extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_log_format.h  |   51 ++++++
 fs/xfs/libxfs/xfs_log_recover.h |    2 
 fs/xfs/xfs_log_recover.c        |    2 
 fs/xfs/xfs_super.c              |   19 ++
 fs/xfs/xfs_swapext_item.c       |  320 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_swapext_item.h       |   56 +++++++
 7 files changed, 448 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_swapext_item.c
 create mode 100644 fs/xfs/xfs_swapext_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index fc83759656c6..c5cb8cf6ffbb 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -109,6 +109,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_iunlink_item.o \
 				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
+				   xfs_swapext_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
 				   xfs_trans_buf.o
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 367f536d9881..b105a5ef6644 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
 #define XLOG_REG_TYPE_ATTRD_FORMAT	28
 #define XLOG_REG_TYPE_ATTR_NAME	29
 #define XLOG_REG_TYPE_ATTR_VALUE	30
-#define XLOG_REG_TYPE_MAX		30
-
+#define XLOG_REG_TYPE_SXI_FORMAT	31
+#define XLOG_REG_TYPE_SXD_FORMAT	32
+#define XLOG_REG_TYPE_MAX		32
 
 /*
  * Flags to log operation header
@@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_BUD		0x1245
 #define	XFS_LI_ATTRI		0x1246  /* attr set/remove intent*/
 #define	XFS_LI_ATTRD		0x1247  /* attr set/remove done */
+#define	XFS_LI_SXI		0x1248  /* extent swap intent */
+#define	XFS_LI_SXD		0x1249  /* extent swap done */
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
 	{ XFS_LI_BUD,		"XFS_LI_BUD" }, \
 	{ XFS_LI_ATTRI,		"XFS_LI_ATTRI" }, \
-	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }
+	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }, \
+	{ XFS_LI_SXI,		"XFS_LI_SXI" }, \
+	{ XFS_LI_SXD,		"XFS_LI_SXD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -871,6 +876,46 @@ struct xfs_bud_log_format {
 	uint64_t		bud_bui_id;	/* id of corresponding bui */
 };
 
+/*
+ * SXI/SXD (extent swapping) log format definitions
+ */
+
+struct xfs_swap_extent {
+	uint64_t		sx_inode1;
+	uint64_t		sx_inode2;
+	uint64_t		sx_startoff1;
+	uint64_t		sx_startoff2;
+	uint64_t		sx_blockcount;
+	uint64_t		sx_flags;
+	int64_t			sx_isize1;
+	int64_t			sx_isize2;
+};
+
+#define XFS_SWAP_EXT_FLAGS		(0)
+
+#define XFS_SWAP_EXT_STRINGS
+
+/* This is the structure used to lay out an sxi log item in the log. */
+struct xfs_sxi_log_format {
+	uint16_t		sxi_type;	/* sxi log item type */
+	uint16_t		sxi_size;	/* size of this item */
+	uint32_t		__pad;		/* must be zero */
+	uint64_t		sxi_id;		/* sxi identifier */
+	struct xfs_swap_extent	sxi_extent;	/* extent to swap */
+};
+
+/*
+ * This is the structure used to lay out an sxd log item in the
+ * log.  The sxd_extents array is a variable size array whose
+ * size is given by sxd_nextents;
+ */
+struct xfs_sxd_log_format {
+	uint16_t		sxd_type;	/* sxd log item type */
+	uint16_t		sxd_size;	/* size of this item */
+	uint32_t		__pad;
+	uint64_t		sxd_sxi_id;	/* id of corresponding bui */
+};
+
 /*
  * Dquot Log format definitions.
  *
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 2420865f3007..6162c93b5d38 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -74,6 +74,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops;
 extern const struct xlog_recover_item_ops xlog_cud_item_ops;
 extern const struct xlog_recover_item_ops xlog_attri_item_ops;
 extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
+extern const struct xlog_recover_item_ops xlog_sxi_item_ops;
+extern const struct xlog_recover_item_ops xlog_sxd_item_ops;
 
 /*
  * Macros, structures, prototypes for internal log manager use.
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 81ce08c23306..006ceff1959d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1796,6 +1796,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
 	&xlog_bud_item_ops,
 	&xlog_attri_item_ops,
 	&xlog_attrd_item_ops,
+	&xlog_sxi_item_ops,
+	&xlog_sxd_item_ops,
 };
 
 static const struct xlog_recover_item_ops *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a16d4d1b35d0..4cf26611f46f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -42,6 +42,7 @@
 #include "xfs_xattr.h"
 #include "xfs_iunlink_item.h"
 #include "scrub/rcbag_btree.h"
+#include "xfs_swapext_item.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -2122,8 +2123,24 @@ xfs_init_caches(void)
 	if (!xfs_iunlink_cache)
 		goto out_destroy_attri_cache;
 
+	xfs_sxd_cache = kmem_cache_create("xfs_sxd_item",
+					 sizeof(struct xfs_sxd_log_item),
+					 0, 0, NULL);
+	if (!xfs_sxd_cache)
+		goto out_destroy_iul_cache;
+
+	xfs_sxi_cache = kmem_cache_create("xfs_sxi_item",
+					 sizeof(struct xfs_sxi_log_item),
+					 0, 0, NULL);
+	if (!xfs_sxi_cache)
+		goto out_destroy_sxd_cache;
+
 	return 0;
 
+ out_destroy_sxd_cache:
+	kmem_cache_destroy(xfs_sxd_cache);
+ out_destroy_iul_cache:
+	kmem_cache_destroy(xfs_iunlink_cache);
  out_destroy_attri_cache:
 	kmem_cache_destroy(xfs_attri_cache);
  out_destroy_attrd_cache:
@@ -2180,6 +2197,8 @@ xfs_destroy_caches(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_cache_destroy(xfs_sxd_cache);
+	kmem_cache_destroy(xfs_sxi_cache);
 	kmem_cache_destroy(xfs_iunlink_cache);
 	kmem_cache_destroy(xfs_attri_cache);
 	kmem_cache_destroy(xfs_attrd_cache);
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
new file mode 100644
index 000000000000..ea4a3a8de7e3
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_shared.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_swapext_item.h"
+#include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_log_priv.h"
+#include "xfs_log_recover.h"
+
+struct kmem_cache	*xfs_sxi_cache;
+struct kmem_cache	*xfs_sxd_cache;
+
+static const struct xfs_item_ops xfs_sxi_item_ops;
+
+static inline struct xfs_sxi_log_item *SXI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_sxi_log_item, sxi_item);
+}
+
+STATIC void
+xfs_sxi_item_free(
+	struct xfs_sxi_log_item	*sxi_lip)
+{
+	kmem_free(sxi_lip->sxi_item.li_lv_shadow);
+	kmem_cache_free(xfs_sxi_cache, sxi_lip);
+}
+
+/*
+ * Freeing the SXI requires that we remove it from the AIL if it has already
+ * been placed there. However, the SXI may not yet have been placed in the AIL
+ * when called by xfs_sxi_release() from SXD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the SXI.
+ */
+STATIC void
+xfs_sxi_release(
+	struct xfs_sxi_log_item	*sxi_lip)
+{
+	ASSERT(atomic_read(&sxi_lip->sxi_refcount) > 0);
+	if (atomic_dec_and_test(&sxi_lip->sxi_refcount)) {
+		xfs_trans_ail_delete(&sxi_lip->sxi_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_sxi_item_free(sxi_lip);
+	}
+}
+
+
+STATIC void
+xfs_sxi_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_sxi_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given sxi log
+ * item. We use only 1 iovec, and we point that at the sxi_log_format structure
+ * embedded in the sxi item.
+ */
+STATIC void
+xfs_sxi_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_sxi_log_item	*sxi_lip = SXI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	sxi_lip->sxi_format.sxi_type = XFS_LI_SXI;
+	sxi_lip->sxi_format.sxi_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXI_FORMAT,
+			&sxi_lip->sxi_format,
+			sizeof(struct xfs_sxi_log_format));
+}
+
+/*
+ * The unpin operation is the last place an SXI is manipulated in the log. It
+ * is either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the SXI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the SXI to either construct
+ * and commit the SXD or drop the SXD's reference in the event of error. Simply
+ * drop the log's SXI reference now that the log is done with it.
+ */
+STATIC void
+xfs_sxi_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_sxi_log_item	*sxi_lip = SXI_ITEM(lip);
+
+	xfs_sxi_release(sxi_lip);
+}
+
+/*
+ * The SXI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an SXD isn't going to be
+ * constructed and thus we free the SXI here directly.
+ */
+STATIC void
+xfs_sxi_item_release(
+	struct xfs_log_item	*lip)
+{
+	xfs_sxi_release(SXI_ITEM(lip));
+}
+
+/* Allocate and initialize an sxi item with the given number of extents. */
+STATIC struct xfs_sxi_log_item *
+xfs_sxi_init(
+	struct xfs_mount	*mp)
+
+{
+	struct xfs_sxi_log_item	*sxi_lip;
+
+	sxi_lip = kmem_cache_zalloc(xfs_sxi_cache, GFP_KERNEL | __GFP_NOFAIL);
+
+	xfs_log_item_init(mp, &sxi_lip->sxi_item, XFS_LI_SXI, &xfs_sxi_item_ops);
+	sxi_lip->sxi_format.sxi_id = (uintptr_t)(void *)sxi_lip;
+	atomic_set(&sxi_lip->sxi_refcount, 2);
+
+	return sxi_lip;
+}
+
+static inline struct xfs_sxd_log_item *SXD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_sxd_log_item, sxd_item);
+}
+
+STATIC void
+xfs_sxd_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_sxd_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given sxd log
+ * item. We use only 1 iovec, and we point that at the sxd_log_format structure
+ * embedded in the sxd item.
+ */
+STATIC void
+xfs_sxd_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_sxd_log_item	*sxd_lip = SXD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	sxd_lip->sxd_format.sxd_type = XFS_LI_SXD;
+	sxd_lip->sxd_format.sxd_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXD_FORMAT, &sxd_lip->sxd_format,
+			sizeof(struct xfs_sxd_log_format));
+}
+
+/*
+ * The SXD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the SXI and free the
+ * SXD.
+ */
+STATIC void
+xfs_sxd_item_release(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_sxd_log_item	*sxd_lip = SXD_ITEM(lip);
+
+	kmem_free(sxd_lip->sxd_item.li_lv_shadow);
+	xfs_sxi_release(sxd_lip->sxd_intent_log_item);
+	kmem_cache_free(xfs_sxd_cache, sxd_lip);
+}
+
+static struct xfs_log_item *
+xfs_sxd_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &SXD_ITEM(lip)->sxd_intent_log_item->sxi_item;
+}
+
+static const struct xfs_item_ops xfs_sxd_item_ops = {
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
+	.iop_size	= xfs_sxd_item_size,
+	.iop_format	= xfs_sxd_item_format,
+	.iop_release	= xfs_sxd_item_release,
+	.iop_intent	= xfs_sxd_item_intent,
+};
+
+/* Process a swapext update intent item that was recovered from the log. */
+STATIC int
+xfs_sxi_item_recover(
+	struct xfs_log_item	*lip,
+	struct list_head	*capture_list)
+{
+	return -EFSCORRUPTED;
+}
+
+STATIC bool
+xfs_sxi_item_match(
+	struct xfs_log_item	*lip,
+	uint64_t		intent_id)
+{
+	return SXI_ITEM(lip)->sxi_format.sxi_id == intent_id;
+}
+
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_sxi_item_relog(
+	struct xfs_log_item	*intent,
+	struct xfs_trans	*tp)
+{
+	ASSERT(0);
+	return NULL;
+}
+
+static const struct xfs_item_ops xfs_sxi_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
+	.iop_size	= xfs_sxi_item_size,
+	.iop_format	= xfs_sxi_item_format,
+	.iop_unpin	= xfs_sxi_item_unpin,
+	.iop_release	= xfs_sxi_item_release,
+	.iop_recover	= xfs_sxi_item_recover,
+	.iop_match	= xfs_sxi_item_match,
+	.iop_relog	= xfs_sxi_item_relog,
+};
+
+/*
+ * This routine is called to create an in-core extent swapext update item from
+ * the sxi format structure which was logged on disk.  It allocates an in-core
+ * sxi, copies the extents from the format structure into it, and adds the sxi
+ * to the AIL with the given LSN.
+ */
+STATIC int
+xlog_recover_sxi_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_sxi_log_item		*sxi_lip;
+	struct xfs_sxi_log_format	*sxi_formatp;
+	size_t				len;
+
+	sxi_formatp = item->ri_buf[0].i_addr;
+
+	if (sxi_formatp->__pad != 0) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	len = sizeof(struct xfs_sxi_log_format);
+	if (item->ri_buf[0].i_len != len) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	sxi_lip = xfs_sxi_init(mp);
+	memcpy(&sxi_lip->sxi_format, sxi_formatp, len);
+
+	xfs_trans_ail_insert(log->l_ailp, &sxi_lip->sxi_item, lsn);
+	xfs_sxi_release(sxi_lip);
+	return 0;
+}
+
+const struct xlog_recover_item_ops xlog_sxi_item_ops = {
+	.item_type		= XFS_LI_SXI,
+	.commit_pass2		= xlog_recover_sxi_commit_pass2,
+};
+
+/*
+ * This routine is called when an SXD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding SXI if it
+ * was still in the log. To do this it searches the AIL for the SXI with an id
+ * equal to that in the SXD format structure. If we find it we drop the SXD
+ * reference, which removes the SXI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_sxd_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_sxd_log_format	*sxd_formatp;
+
+	sxd_formatp = item->ri_buf[0].i_addr;
+	if (item->ri_buf[0].i_len != sizeof(struct xfs_sxd_log_format)) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	xlog_recover_release_intent(log, XFS_LI_SXI, sxd_formatp->sxd_sxi_id);
+	return 0;
+}
+
+const struct xlog_recover_item_ops xlog_sxd_item_ops = {
+	.item_type		= XFS_LI_SXD,
+	.commit_pass2		= xlog_recover_sxd_commit_pass2,
+};
diff --git a/fs/xfs/xfs_swapext_item.h b/fs/xfs/xfs_swapext_item.h
new file mode 100644
index 000000000000..e3cb59692e50
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef	__XFS_SWAPEXT_ITEM_H__
+#define	__XFS_SWAPEXT_ITEM_H__
+
+/*
+ * The extent swapping intent item help us perform atomic extent swaps between
+ * two inode forks.  It does this by tracking the range of logical offsets that
+ * still need to be swapped, and relogs as progress happens.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * rest of the extent swaps.
+ */
+
+/* kernel only SXI/SXD definitions */
+
+struct xfs_mount;
+struct kmem_cache;
+
+/*
+ * This is the "swapext update intent" log item.  It is used to log the fact
+ * that we are swapping extents between two files.  It is used in conjunction
+ * with the "swapext update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_sxi_log_item {
+	struct xfs_log_item		sxi_item;
+	atomic_t			sxi_refcount;
+	struct xfs_sxi_log_format	sxi_format;
+};
+
+/*
+ * This is the "swapext update done" log item.  It is used to log the fact that
+ * some extent swapping mentioned in an earlier sxi item have been performed.
+ */
+struct xfs_sxd_log_item {
+	struct xfs_log_item		sxd_item;
+	struct xfs_sxi_log_item		*sxd_intent_log_item;
+	struct xfs_sxd_log_format	sxd_format;
+};
+
+extern struct kmem_cache	*xfs_sxi_cache;
+extern struct kmem_cache	*xfs_sxd_cache;
+
+#endif	/* __XFS_SWAPEXT_ITEM_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 07/21] xfs: create deferred log items for extent swapping
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 06/21] xfs: introduce a swap-extent log intent item Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 08/21] xfs: enable xlog users to toggle atomic " Darrick J. Wong
                     ` (13 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Now that we've created the skeleton of a log intent item to track and
restart extent swap operations, add the upper level logic to commit
intent items and turn them into concrete work recorded in the log.  We
use the deferred item "multihop" feature that was introduced a few
patches ago to constrain the number of active swap operations to one per
thread.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    2 
 fs/xfs/libxfs/xfs_bmap.h        |    4 
 fs/xfs/libxfs/xfs_defer.c       |    7 
 fs/xfs/libxfs/xfs_defer.h       |    3 
 fs/xfs/libxfs/xfs_format.h      |    6 
 fs/xfs/libxfs/xfs_log_format.h  |   28 +
 fs/xfs/libxfs/xfs_swapext.c     | 1021 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_swapext.h     |  142 +++++
 fs/xfs/libxfs/xfs_trans_space.h |    4 
 fs/xfs/xfs_swapext_item.c       |  357 +++++++++++++-
 fs/xfs/xfs_trace.c              |    1 
 fs/xfs/xfs_trace.h              |  215 ++++++++
 fs/xfs/xfs_xchgrange.c          |   65 ++
 fs/xfs/xfs_xchgrange.h          |   17 +
 14 files changed, 1855 insertions(+), 17 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_swapext.c
 create mode 100644 fs/xfs/xfs_xchgrange.c
 create mode 100644 fs/xfs/xfs_xchgrange.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index c5cb8cf6ffbb..23b0c40620cf 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -46,6 +46,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_refcount.o \
 				   xfs_refcount_btree.o \
 				   xfs_sb.o \
+				   xfs_swapext.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_inode.o \
 				   xfs_trans_resv.o \
@@ -92,6 +93,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_sysfs.o \
 				   xfs_trans.o \
 				   xfs_xattr.o \
+				   xfs_xchgrange.o \
 				   kmem.o
 
 # low-level transaction/log code
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index cb09a43a2872..413ec27f2f24 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -144,7 +144,7 @@ static inline int xfs_bmapi_whichfork(uint32_t bmapi_flags)
 	{ BMAP_COWFORK,		"COW" }
 
 /* Return true if the extent is an allocated extent, written or not. */
-static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec)
 {
 	return irec->br_startblock != HOLESTARTBLOCK &&
 		irec->br_startblock != DELAYSTARTBLOCK &&
@@ -155,7 +155,7 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
  * Return true if the extent is a real, allocated extent, or false if it is  a
  * delayed allocation, and unwritten extent or a hole.
  */
-static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)
 {
 	return xfs_bmap_is_real_extent(irec) &&
 	       irec->br_state != XFS_EXT_UNWRITTEN;
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index bcfb6a4203cd..1619b9b928db 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -26,6 +26,7 @@
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
+#include "xfs_swapext.h"
 
 static struct kmem_cache	*xfs_defer_pending_cache;
 
@@ -189,6 +190,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = {
 	[XFS_DEFER_OPS_TYPE_FREE]	= &xfs_extent_free_defer_type,
 	[XFS_DEFER_OPS_TYPE_AGFL_FREE]	= &xfs_agfl_free_defer_type,
 	[XFS_DEFER_OPS_TYPE_ATTR]	= &xfs_attr_defer_type,
+	[XFS_DEFER_OPS_TYPE_SWAPEXT]	= &xfs_swapext_defer_type,
 };
 
 /*
@@ -913,6 +915,10 @@ xfs_defer_init_item_caches(void)
 	error = xfs_attr_intent_init_cache();
 	if (error)
 		goto err;
+	error = xfs_swapext_intent_init_cache();
+	if (error)
+		goto err;
+
 	return 0;
 err:
 	xfs_defer_destroy_item_caches();
@@ -923,6 +929,7 @@ xfs_defer_init_item_caches(void)
 void
 xfs_defer_destroy_item_caches(void)
 {
+	xfs_swapext_intent_destroy_cache();
 	xfs_attr_intent_destroy_cache();
 	xfs_extfree_intent_destroy_cache();
 	xfs_bmap_intent_destroy_cache();
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 114a3a4930a3..bcc48b0c75c9 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -20,6 +20,7 @@ enum xfs_defer_ops_type {
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_AGFL_FREE,
 	XFS_DEFER_OPS_TYPE_ATTR,
+	XFS_DEFER_OPS_TYPE_SWAPEXT,
 	XFS_DEFER_OPS_TYPE_MAX,
 };
 
@@ -65,7 +66,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 extern const struct xfs_defer_op_type xfs_attr_defer_type;
-
+extern const struct xfs_defer_op_type xfs_swapext_defer_type;
 
 /*
  * Deferred operation item relogging limits.
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 1424976ec955..bb8bff488017 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -425,6 +425,12 @@ static inline bool xfs_sb_version_haslogxattrs(struct xfs_sb *sbp)
 		 XFS_SB_FEAT_INCOMPAT_LOG_XATTRS);
 }
 
+static inline bool xfs_sb_version_haslogswapext(struct xfs_sb *sbp)
+{
+	return xfs_sb_is_v5(sbp) && (sbp->sb_features_log_incompat &
+		 XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT);
+}
+
 static inline bool
 xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
 {
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index b105a5ef6644..65a84fdefe56 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -891,9 +891,33 @@ struct xfs_swap_extent {
 	int64_t			sx_isize2;
 };
 
-#define XFS_SWAP_EXT_FLAGS		(0)
+/* Swap extents between extended attribute forks. */
+#define XFS_SWAP_EXT_ATTR_FORK		(1ULL << 0)
 
-#define XFS_SWAP_EXT_STRINGS
+/* Set the file sizes when finished. */
+#define XFS_SWAP_EXT_SET_SIZES		(1ULL << 1)
+
+/* Do not swap any part of the range where inode1's mapping is a hole. */
+#define XFS_SWAP_EXT_SKIP_INO1_HOLES	(1ULL << 2)
+
+/* Clear the reflink flag from inode1 after the operation. */
+#define XFS_SWAP_EXT_CLEAR_INO1_REFLINK	(1ULL << 3)
+
+/* Clear the reflink flag from inode2 after the operation. */
+#define XFS_SWAP_EXT_CLEAR_INO2_REFLINK	(1ULL << 4)
+
+#define XFS_SWAP_EXT_FLAGS		(XFS_SWAP_EXT_ATTR_FORK | \
+					 XFS_SWAP_EXT_SET_SIZES | \
+					 XFS_SWAP_EXT_SKIP_INO1_HOLES | \
+					 XFS_SWAP_EXT_CLEAR_INO1_REFLINK | \
+					 XFS_SWAP_EXT_CLEAR_INO2_REFLINK)
+
+#define XFS_SWAP_EXT_STRINGS \
+	{ XFS_SWAP_EXT_ATTR_FORK,		"ATTRFORK" }, \
+	{ XFS_SWAP_EXT_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_SWAP_EXT_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
+	{ XFS_SWAP_EXT_CLEAR_INO1_REFLINK,	"CLEAR_INO1_REFLINK" }, \
+	{ XFS_SWAP_EXT_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }
 
 /* This is the structure used to lay out an sxi log item in the log. */
 struct xfs_sxi_log_format {
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
new file mode 100644
index 000000000000..0bc758c5cf5c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -0,0 +1,1021 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_quota.h"
+#include "xfs_swapext.h"
+#include "xfs_trace.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
+#include "xfs_health.h"
+
+struct kmem_cache	*xfs_swapext_intent_cache;
+
+/* bmbt mappings adjacent to a pair of records. */
+struct xfs_swapext_adjacent {
+	struct xfs_bmbt_irec		left1;
+	struct xfs_bmbt_irec		right1;
+	struct xfs_bmbt_irec		left2;
+	struct xfs_bmbt_irec		right2;
+};
+
+#define ADJACENT_INIT { \
+	.left1  = { .br_startblock = HOLESTARTBLOCK }, \
+	.right1 = { .br_startblock = HOLESTARTBLOCK }, \
+	.left2  = { .br_startblock = HOLESTARTBLOCK }, \
+	.right2 = { .br_startblock = HOLESTARTBLOCK }, \
+}
+
+/* Information to help us reset reflink flag / CoW fork state after a swap. */
+
+/* Previous state of the two inodes' reflink flags. */
+#define XFS_REFLINK_STATE_IP1		(1U << 0)
+#define XFS_REFLINK_STATE_IP2		(1U << 1)
+
+/*
+ * If the reflink flag is set on either inode, make sure it has an incore CoW
+ * fork, since all reflink inodes must have them.  If there's a CoW fork and it
+ * has extents in it, make sure the inodes are tagged appropriately so that
+ * speculative preallocations can be GC'd if we run low of space.
+ */
+static inline void
+xfs_swapext_ensure_cowfork(
+	struct xfs_inode	*ip)
+{
+	struct xfs_ifork	*cfork;
+
+	if (xfs_is_reflink_inode(ip))
+		xfs_ifork_init_cow(ip);
+
+	cfork = xfs_ifork_ptr(ip, XFS_COW_FORK);
+	if (!cfork)
+		return;
+	if (cfork->if_bytes > 0)
+		xfs_inode_set_cowblocks_tag(ip);
+	else
+		xfs_inode_clear_cowblocks_tag(ip);
+}
+
+/* Schedule an atomic extent swap. */
+void
+xfs_swapext_schedule(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	trace_xfs_swapext_defer(tp->t_mountp, sxi);
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_SWAPEXT, &sxi->sxi_list);
+}
+
+/*
+ * Adjust the on-disk inode size upwards if needed so that we never map extents
+ * into the file past EOF.  This is crucial so that log recovery won't get
+ * confused by the sudden appearance of post-eof extents.
+ */
+STATIC void
+xfs_swapext_update_size(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fsize_t		new_isize)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_fsize_t		len;
+
+	if (new_isize < 0)
+		return;
+
+	len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
+		  new_isize);
+
+	if (len <= ip->i_disk_size)
+		return;
+
+	trace_xfs_swapext_update_inode_size(ip, len);
+
+	ip->i_disk_size = len;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+static inline bool
+sxi_has_more_swap_work(const struct xfs_swapext_intent *sxi)
+{
+	return sxi->sxi_blockcount > 0;
+}
+
+static inline bool
+sxi_has_postop_work(const struct xfs_swapext_intent *sxi)
+{
+	return sxi->sxi_flags & (XFS_SWAP_EXT_CLEAR_INO1_REFLINK |
+				 XFS_SWAP_EXT_CLEAR_INO2_REFLINK);
+}
+
+static inline void
+sxi_advance(
+	struct xfs_swapext_intent	*sxi,
+	const struct xfs_bmbt_irec	*irec)
+{
+	sxi->sxi_startoff1 += irec->br_blockcount;
+	sxi->sxi_startoff2 += irec->br_blockcount;
+	sxi->sxi_blockcount -= irec->br_blockcount;
+}
+
+/* Check all extents to make sure we can actually swap them. */
+int
+xfs_swapext_check_extents(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_ifork		*ifp1, *ifp2;
+
+	/* No fork? */
+	ifp1 = xfs_ifork_ptr(req->ip1, req->whichfork);
+	ifp2 = xfs_ifork_ptr(req->ip2, req->whichfork);
+	if (!ifp1 || !ifp2)
+		return -EINVAL;
+
+	/* We don't know how to swap local format forks. */
+	if (ifp1->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
+		return -EINVAL;
+
+	/* We don't support realtime data forks yet. */
+	if (!XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+	if (req->whichfork == XFS_ATTR_FORK)
+		return 0;
+	return -EINVAL;
+}
+
+#ifdef CONFIG_XFS_QUOTA
+/* Log the actual updates to the quota accounting. */
+static inline void
+xfs_swapext_update_quota(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int64_t				ip1_delta = 0, ip2_delta = 0;
+	unsigned int			qflag;
+
+	qflag = XFS_IS_REALTIME_INODE(sxi->sxi_ip1) ? XFS_TRANS_DQ_RTBCOUNT :
+						      XFS_TRANS_DQ_BCOUNT;
+
+	if (xfs_bmap_is_real_extent(irec1)) {
+		ip1_delta -= irec1->br_blockcount;
+		ip2_delta += irec1->br_blockcount;
+	}
+
+	if (xfs_bmap_is_real_extent(irec2)) {
+		ip1_delta += irec2->br_blockcount;
+		ip2_delta -= irec2->br_blockcount;
+	}
+
+	xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip1, qflag, ip1_delta);
+	xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip2, qflag, ip2_delta);
+}
+#else
+# define xfs_swapext_update_quota(tp, sxi, irec1, irec2)	((void)0)
+#endif
+
+/*
+ * Walk forward through the file ranges in @sxi until we find two different
+ * mappings to exchange.  If there is work to do, return the mappings;
+ * otherwise we've reached the end of the range and sxi_blockcount will be
+ * zero.
+ *
+ * If the walk skips over a pair of mappings to the same storage, save them as
+ * the left records in @adj (if provided) so that the simulation phase can
+ * avoid an extra lookup.
+  */
+static int
+xfs_swapext_find_mappings(
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2,
+	struct xfs_swapext_adjacent	*adj)
+{
+	int				nimaps;
+	int				bmap_flags;
+	int				error;
+
+	bmap_flags = xfs_bmapi_aflag(xfs_swapext_whichfork(sxi));
+
+	for (; sxi_has_more_swap_work(sxi); sxi_advance(sxi, irec1)) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->sxi_ip1, sxi->sxi_startoff1,
+				sxi->sxi_blockcount, irec1, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec1->br_startblock == DELAYSTARTBLOCK ||
+		    irec1->br_startoff != sxi->sxi_startoff1) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * If the caller told us to ignore sparse areas of file1, jump
+		 * ahead to the next region.
+		 */
+		if ((sxi->sxi_flags & XFS_SWAP_EXT_SKIP_INO1_HOLES) &&
+		    irec1->br_startblock == HOLESTARTBLOCK) {
+			trace_xfs_swapext_extent1(sxi->sxi_ip1, irec1);
+			continue;
+		}
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->sxi_ip2, sxi->sxi_startoff2,
+				irec1->br_blockcount, irec2, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec2->br_startblock == DELAYSTARTBLOCK ||
+		    irec2->br_startoff != sxi->sxi_startoff2) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1->br_blockcount = min(irec1->br_blockcount,
+					   irec2->br_blockcount);
+
+		trace_xfs_swapext_extent1(sxi->sxi_ip1, irec1);
+		trace_xfs_swapext_extent2(sxi->sxi_ip2, irec2);
+
+		/* We found something to swap, so return it. */
+		if (irec1->br_startblock != irec2->br_startblock)
+			return 0;
+
+		/*
+		 * Two extents mapped to the same physical block must not have
+		 * different states; that's filesystem corruption.  Move on to
+		 * the next extent if they're both holes or both the same
+		 * physical extent.
+		 */
+		if (irec1->br_state != irec2->br_state) {
+			xfs_bmap_mark_sick(sxi->sxi_ip1,
+					xfs_swapext_whichfork(sxi));
+			xfs_bmap_mark_sick(sxi->sxi_ip2,
+					xfs_swapext_whichfork(sxi));
+			return -EFSCORRUPTED;
+		}
+
+		/*
+		 * Save the mappings if we're estimating work and skipping
+		 * these identical mappings.
+		 */
+		if (adj) {
+			memcpy(&adj->left1, irec1, sizeof(*irec1));
+			memcpy(&adj->left2, irec2, sizeof(*irec2));
+		}
+	}
+
+	return 0;
+}
+
+/* Exchange these two mappings. */
+static void
+xfs_swapext_exchange_mappings(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int				whichfork = xfs_swapext_whichfork(sxi);
+
+	xfs_swapext_update_quota(tp, sxi, irec1, irec2);
+
+	/* Remove both mappings. */
+	xfs_bmap_unmap_extent(tp, sxi->sxi_ip1, whichfork, irec1);
+	xfs_bmap_unmap_extent(tp, sxi->sxi_ip2, whichfork, irec2);
+
+	/*
+	 * Re-add both mappings.  We swap the file offsets between the two maps
+	 * and add the opposite map, which has the effect of filling the
+	 * logical offsets we just unmapped, but with with the physical mapping
+	 * information swapped.
+	 */
+	swap(irec1->br_startoff, irec2->br_startoff);
+	xfs_bmap_map_extent(tp, sxi->sxi_ip1, whichfork, irec2);
+	xfs_bmap_map_extent(tp, sxi->sxi_ip2, whichfork, irec1);
+
+	/* Make sure we're not mapping extents past EOF. */
+	if (whichfork == XFS_DATA_FORK) {
+		xfs_swapext_update_size(tp, sxi->sxi_ip1, irec2,
+				sxi->sxi_isize1);
+		xfs_swapext_update_size(tp, sxi->sxi_ip2, irec1,
+				sxi->sxi_isize2);
+	}
+
+	/*
+	 * Advance our cursor and exit.   The caller (either defer ops or log
+	 * recovery) will log the SXD item, and if *blockcount is nonzero, it
+	 * will log a new SXI item for the remainder and call us back.
+	 */
+	sxi_advance(sxi, irec1);
+}
+
+static inline void
+xfs_swapext_clear_reflink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	trace_xfs_reflink_unset_inode_flag(ip);
+
+	ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Finish whatever work might come after a swap operation. */
+static int
+xfs_swapext_do_postop_work(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO1_REFLINK) {
+		xfs_swapext_clear_reflink(tp, sxi->sxi_ip1);
+		sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
+	}
+
+	if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO2_REFLINK) {
+		xfs_swapext_clear_reflink(tp, sxi->sxi_ip2);
+		sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO2_REFLINK;
+	}
+
+	return 0;
+}
+
+/* Finish one extent swap, possibly log more. */
+int
+xfs_swapext_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	int				error;
+
+	if (sxi_has_more_swap_work(sxi)) {
+		/*
+		 * If the operation state says that some range of the files
+		 * have not yet been swapped, look for extents in that range to
+		 * swap.  If we find some extents, swap them.
+		 */
+		error = xfs_swapext_find_mappings(sxi, &irec1, &irec2, NULL);
+		if (error)
+			return error;
+
+		if (sxi_has_more_swap_work(sxi))
+			xfs_swapext_exchange_mappings(tp, sxi, &irec1, &irec2);
+
+		/*
+		 * If the caller asked us to exchange the file sizes after the
+		 * swap and either we just swapped the last extents in the
+		 * range or we didn't find anything to swap, update the ondisk
+		 * file sizes.
+		 */
+		if ((sxi->sxi_flags & XFS_SWAP_EXT_SET_SIZES) &&
+		    !sxi_has_more_swap_work(sxi)) {
+			sxi->sxi_ip1->i_disk_size = sxi->sxi_isize1;
+			sxi->sxi_ip2->i_disk_size = sxi->sxi_isize2;
+
+			xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+			xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+		}
+	} else if (sxi_has_postop_work(sxi)) {
+		/*
+		 * Now that we're finished with the swap operation, complete
+		 * the post-op cleanup work.
+		 */
+		error = xfs_swapext_do_postop_work(tp, sxi);
+		if (error)
+			return error;
+	}
+
+	/* If we still have work to do, ask for a new transaction. */
+	if (sxi_has_more_swap_work(sxi) || sxi_has_postop_work(sxi)) {
+		trace_xfs_swapext_defer(tp->t_mountp, sxi);
+		return -EAGAIN;
+	}
+
+	/*
+	 * If we reach here, we've finished all the swapping work and the post
+	 * operation work.  The last thing we need to do before returning to
+	 * the caller is to make sure that COW forks are set up correctly.
+	 */
+	if (!(sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)) {
+		xfs_swapext_ensure_cowfork(sxi->sxi_ip1);
+		xfs_swapext_ensure_cowfork(sxi->sxi_ip2);
+	}
+
+	return 0;
+}
+
+/*
+ * Compute the amount of bmbt blocks we should reserve for each file.  In the
+ * worst case, each exchange will fill a hole with a new mapping, which could
+ * result in a btree split every time we add a new leaf block.
+ */
+static inline uint64_t
+xfs_swapext_bmbt_blocks(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	return howmany_64(req->nr_exchanges,
+					XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)) *
+			XFS_EXTENTADD_SPACE_RES(mp, req->whichfork);
+}
+
+static inline uint64_t
+xfs_swapext_rmapbt_blocks(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	if (!xfs_has_rmapbt(mp))
+		return 0;
+	if (XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+
+	return howmany_64(req->nr_exchanges,
+					XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) *
+			XFS_RMAPADD_SPACE_RES(mp);
+}
+
+/* Estimate the bmbt and rmapbt overhead required to exchange extents. */
+static int
+xfs_swapext_estimate_overhead(
+	struct xfs_swapext_req	*req)
+{
+	struct xfs_mount	*mp = req->ip1->i_mount;
+	xfs_filblks_t		bmbt_blocks;
+	xfs_filblks_t		rmapbt_blocks;
+	xfs_filblks_t		resblks = req->resblks;
+
+	/*
+	 * Compute the number of bmbt and rmapbt blocks we might need to handle
+	 * the estimated number of exchanges.
+	 */
+	bmbt_blocks = xfs_swapext_bmbt_blocks(mp, req);
+	rmapbt_blocks = xfs_swapext_rmapbt_blocks(mp, req);
+
+	trace_xfs_swapext_overhead(mp, bmbt_blocks, rmapbt_blocks);
+
+	/* Make sure the change in file block count doesn't overflow. */
+	if (check_add_overflow(req->ip1_bcount, bmbt_blocks, &req->ip1_bcount))
+		return -EFBIG;
+	if (check_add_overflow(req->ip2_bcount, bmbt_blocks, &req->ip2_bcount))
+		return -EFBIG;
+
+	/*
+	 * Add together the number of blocks we need to handle btree growth,
+	 * then add it to the number of blocks we need to reserve to this
+	 * transaction.
+	 */
+	if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+		return -ENOSPC;
+
+	/* Can't actually reserve more than UINT_MAX blocks. */
+	if (req->resblks > UINT_MAX)
+		return -ENOSPC;
+
+	req->resblks = resblks;
+	trace_xfs_swapext_final_estimate(req);
+	return 0;
+}
+
+/* Decide if we can merge two real extents. */
+static inline bool
+can_merge(
+	const struct xfs_bmbt_irec	*b1,
+	const struct xfs_bmbt_irec	*b2)
+{
+	/* Don't merge holes. */
+	if (b1->br_startblock == HOLESTARTBLOCK ||
+	    b2->br_startblock == HOLESTARTBLOCK)
+		return false;
+
+	/* We don't merge holes. */
+	if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2))
+		return false;
+
+	if (b1->br_startoff   + b1->br_blockcount == b2->br_startoff &&
+	    b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
+	    b1->br_state			  == b2->br_state &&
+	    b1->br_blockcount + b2->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+		return true;
+
+	return false;
+}
+
+#define CLEFT_CONTIG	0x01
+#define CRIGHT_CONTIG	0x02
+#define CHOLE		0x04
+#define CBOTH_CONTIG	(CLEFT_CONTIG | CRIGHT_CONTIG)
+
+#define NLEFT_CONTIG	0x10
+#define NRIGHT_CONTIG	0x20
+#define NHOLE		0x40
+#define NBOTH_CONTIG	(NLEFT_CONTIG | NRIGHT_CONTIG)
+
+/* Estimate the effect of a single swap on extent count. */
+static inline int
+delta_nextents_step(
+	struct xfs_mount		*mp,
+	const struct xfs_bmbt_irec	*left,
+	const struct xfs_bmbt_irec	*curr,
+	const struct xfs_bmbt_irec	*new,
+	const struct xfs_bmbt_irec	*right)
+{
+	bool				lhole, rhole, chole, nhole;
+	unsigned int			state = 0;
+	int				ret = 0;
+
+	lhole = left->br_startblock == HOLESTARTBLOCK;
+	rhole = right->br_startblock == HOLESTARTBLOCK;
+	chole = curr->br_startblock == HOLESTARTBLOCK;
+	nhole = new->br_startblock == HOLESTARTBLOCK;
+
+	if (chole)
+		state |= CHOLE;
+	if (!lhole && !chole && can_merge(left, curr))
+		state |= CLEFT_CONTIG;
+	if (!rhole && !chole && can_merge(curr, right))
+		state |= CRIGHT_CONTIG;
+	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
+	    left->br_startblock + curr->br_startblock +
+					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+		state &= ~CRIGHT_CONTIG;
+
+	if (nhole)
+		state |= NHOLE;
+	if (!lhole && !nhole && can_merge(left, new))
+		state |= NLEFT_CONTIG;
+	if (!rhole && !nhole && can_merge(new, right))
+		state |= NRIGHT_CONTIG;
+	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
+	    left->br_startblock + new->br_startblock +
+					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+		state &= ~NRIGHT_CONTIG;
+
+	switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
+	case CLEFT_CONTIG | CRIGHT_CONTIG:
+		/*
+		 * left/curr/right are the same extent, so deleting curr causes
+		 * 2 new extents to be created.
+		 */
+		ret += 2;
+		break;
+	case 0:
+		/*
+		 * curr is not contiguous with any extent, so we remove curr
+		 * completely
+		 */
+		ret--;
+		break;
+	case CHOLE:
+		/* hole, do nothing */
+		break;
+	case CLEFT_CONTIG:
+	case CRIGHT_CONTIG:
+		/* trim either left or right, no change */
+		break;
+	}
+
+	switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) {
+	case NLEFT_CONTIG | NRIGHT_CONTIG:
+		/*
+		 * left/curr/right will become the same extent, so adding
+		 * curr causes the deletion of right.
+		 */
+		ret--;
+		break;
+	case 0:
+		/* new is not contiguous with any extent */
+		ret++;
+		break;
+	case NHOLE:
+		/* hole, do nothing. */
+		break;
+	case NLEFT_CONTIG:
+	case NRIGHT_CONTIG:
+		/* new is absorbed into left or right, no change */
+		break;
+	}
+
+	trace_xfs_swapext_delta_nextents_step(mp, left, curr, new, right, ret,
+			state);
+	return ret;
+}
+
+/* Make sure we don't overflow the extent counters. */
+static inline int
+ensure_delta_nextents(
+	struct xfs_swapext_req	*req,
+	struct xfs_inode	*ip,
+	int64_t			delta)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, req->whichfork);
+	xfs_extnum_t		max_extents;
+	bool			large_extcount;
+
+	if (delta < 0)
+		return 0;
+
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS)) {
+		if (ifp->if_nextents + delta > 10)
+			return -EFBIG;
+	}
+
+	if (req->req_flags & XFS_SWAP_REQ_NREXT64)
+		large_extcount = true;
+	else
+		large_extcount = xfs_inode_has_large_extent_counts(ip);
+
+	max_extents = xfs_iext_max_nextents(large_extcount, req->whichfork);
+	if (ifp->if_nextents + delta <= max_extents)
+		return 0;
+	if (large_extcount)
+		return -EFBIG;
+	if (!xfs_has_large_extent_counts(mp))
+		return -EFBIG;
+
+	max_extents = xfs_iext_max_nextents(true, req->whichfork);
+	if (ifp->if_nextents + delta > max_extents)
+		return -EFBIG;
+
+	req->req_flags |= XFS_SWAP_REQ_NREXT64;
+	return 0;
+}
+
+/* Find the next extent after irec. */
+static inline int
+get_next_ext(
+	struct xfs_inode		*ip,
+	int				bmap_flags,
+	const struct xfs_bmbt_irec	*irec,
+	struct xfs_bmbt_irec		*nrec)
+{
+	xfs_fileoff_t			off;
+	xfs_filblks_t			blockcount;
+	int				nimaps = 1;
+	int				error;
+
+	off = irec->br_startoff + irec->br_blockcount;
+	blockcount = XFS_MAX_FILEOFF - off;
+	error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags);
+	if (error)
+		return error;
+	if (nrec->br_startblock == DELAYSTARTBLOCK ||
+	    nrec->br_startoff != off) {
+		/*
+		 * If we don't get the extent we want, return a zero-length
+		 * mapping, which our estimator function will pretend is a hole.
+		 * We shouldn't get delalloc reservations.
+		 */
+		nrec->br_startblock = HOLESTARTBLOCK;
+	}
+
+	return 0;
+}
+
+int __init
+xfs_swapext_intent_init_cache(void)
+{
+	xfs_swapext_intent_cache = kmem_cache_create("xfs_swapext_intent",
+			sizeof(struct xfs_swapext_intent),
+			0, 0, NULL);
+
+	return xfs_swapext_intent_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_swapext_intent_destroy_cache(void)
+{
+	kmem_cache_destroy(xfs_swapext_intent_cache);
+	xfs_swapext_intent_cache = NULL;
+}
+
+/*
+ * Decide if we will swap the reflink flags between the two files after the
+ * swap.  The only time we want to do this is if we're exchanging all extents
+ * under EOF and the inode reflink flags have different states.
+ */
+static inline bool
+sxi_can_exchange_reflink_flags(
+	const struct xfs_swapext_req	*req,
+	unsigned int			reflink_state)
+{
+	struct xfs_mount		*mp = req->ip1->i_mount;
+
+	if (hweight32(reflink_state) != 1)
+		return false;
+	if (req->startoff1 != 0 || req->startoff2 != 0)
+		return false;
+	if (req->blockcount != XFS_B_TO_FSB(mp, req->ip1->i_disk_size))
+		return false;
+	if (req->blockcount != XFS_B_TO_FSB(mp, req->ip2->i_disk_size))
+		return false;
+	return true;
+}
+
+
+/* Allocate and initialize a new incore intent item from a request. */
+struct xfs_swapext_intent *
+xfs_swapext_init_intent(
+	const struct xfs_swapext_req	*req,
+	unsigned int			*reflink_state)
+{
+	struct xfs_swapext_intent	*sxi;
+	unsigned int			rs = 0;
+
+	sxi = kmem_cache_zalloc(xfs_swapext_intent_cache,
+			GFP_NOFS | __GFP_NOFAIL);
+	INIT_LIST_HEAD(&sxi->sxi_list);
+	sxi->sxi_ip1 = req->ip1;
+	sxi->sxi_ip2 = req->ip2;
+	sxi->sxi_startoff1 = req->startoff1;
+	sxi->sxi_startoff2 = req->startoff2;
+	sxi->sxi_blockcount = req->blockcount;
+	sxi->sxi_isize1 = sxi->sxi_isize2 = -1;
+
+	if (req->whichfork == XFS_ATTR_FORK)
+		sxi->sxi_flags |= XFS_SWAP_EXT_ATTR_FORK;
+
+	if (req->whichfork == XFS_DATA_FORK &&
+	    (req->req_flags & XFS_SWAP_REQ_SET_SIZES)) {
+		sxi->sxi_flags |= XFS_SWAP_EXT_SET_SIZES;
+		sxi->sxi_isize1 = req->ip2->i_disk_size;
+		sxi->sxi_isize2 = req->ip1->i_disk_size;
+	}
+
+	if (req->req_flags & XFS_SWAP_REQ_SKIP_INO1_HOLES)
+		sxi->sxi_flags |= XFS_SWAP_EXT_SKIP_INO1_HOLES;
+
+	if (req->req_flags & XFS_SWAP_REQ_LOGGED)
+		sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_LOGGED;
+	if (req->req_flags & XFS_SWAP_REQ_NREXT64)
+		sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_NREXT64;
+
+	if (req->whichfork == XFS_DATA_FORK) {
+		/*
+		 * Record the state of each inode's reflink flag before the
+		 * operation.
+		 */
+		if (xfs_is_reflink_inode(req->ip1))
+			rs |= XFS_REFLINK_STATE_IP1;
+		if (xfs_is_reflink_inode(req->ip2))
+			rs |= XFS_REFLINK_STATE_IP2;
+
+		/*
+		 * Figure out if we're clearing the reflink flags (which
+		 * effectively swaps them) after the operation.
+		 */
+		if (sxi_can_exchange_reflink_flags(req, rs)) {
+			if (rs & XFS_REFLINK_STATE_IP1)
+				sxi->sxi_flags |=
+						XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
+			if (rs & XFS_REFLINK_STATE_IP2)
+				sxi->sxi_flags |=
+						XFS_SWAP_EXT_CLEAR_INO2_REFLINK;
+		}
+	}
+
+	if (reflink_state)
+		*reflink_state = rs;
+	return sxi;
+}
+
+/*
+ * Estimate the number of exchange operations and the number of file blocks
+ * in each file that will be affected by the exchange operation.
+ */
+int
+xfs_swapext_estimate(
+	struct xfs_swapext_req		*req)
+{
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_bmbt_irec		irec1, irec2;
+	struct xfs_swapext_adjacent	adj = ADJACENT_INIT;
+	xfs_filblks_t			ip1_blocks = 0, ip2_blocks = 0;
+	int64_t				d_nexts1, d_nexts2;
+	int				bmap_flags;
+	int				error;
+
+	ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
+
+	bmap_flags = xfs_bmapi_aflag(req->whichfork);
+	sxi = xfs_swapext_init_intent(req, NULL);
+
+	/*
+	 * To guard against the possibility of overflowing the extent counters,
+	 * we have to estimate an upper bound on the potential increase in that
+	 * counter.  We can split the extent at each end of the range, and for
+	 * each step of the swap we can split the extent that we're working on
+	 * if the extents do not align.
+	 */
+	d_nexts1 = d_nexts2 = 3;
+
+	while (sxi_has_more_swap_work(sxi)) {
+		/*
+		 * Walk through the file ranges until we find something to
+		 * swap.  Because we're simulating the swap, pass in adj to
+		 * capture skipped mappings for correct estimation of bmbt
+		 * record merges.
+		 */
+		error = xfs_swapext_find_mappings(sxi, &irec1, &irec2, &adj);
+		if (error)
+			goto out_free;
+		if (!sxi_has_more_swap_work(sxi))
+			break;
+
+		/* Update accounting. */
+		if (xfs_bmap_is_real_extent(&irec1))
+			ip1_blocks += irec1.br_blockcount;
+		if (xfs_bmap_is_real_extent(&irec2))
+			ip2_blocks += irec2.br_blockcount;
+		req->nr_exchanges++;
+
+		/* Read the next extents from both files. */
+		error = get_next_ext(req->ip1, bmap_flags, &irec1, &adj.right1);
+		if (error)
+			goto out_free;
+
+		error = get_next_ext(req->ip2, bmap_flags, &irec2, &adj.right2);
+		if (error)
+			goto out_free;
+
+		/* Update extent count deltas. */
+		d_nexts1 += delta_nextents_step(req->ip1->i_mount,
+				&adj.left1, &irec1, &irec2, &adj.right1);
+
+		d_nexts2 += delta_nextents_step(req->ip1->i_mount,
+				&adj.left2, &irec2, &irec1, &adj.right2);
+
+		/* Now pretend we swapped the extents. */
+		if (can_merge(&adj.left2, &irec1))
+			adj.left2.br_blockcount += irec1.br_blockcount;
+		else
+			memcpy(&adj.left2, &irec1, sizeof(irec1));
+
+		if (can_merge(&adj.left1, &irec2))
+			adj.left1.br_blockcount += irec2.br_blockcount;
+		else
+			memcpy(&adj.left1, &irec2, sizeof(irec2));
+
+		sxi_advance(sxi, &irec1);
+	}
+
+	/* Account for the blocks that are being exchanged. */
+	if (XFS_IS_REALTIME_INODE(req->ip1) &&
+	    req->whichfork == XFS_DATA_FORK) {
+		req->ip1_rtbcount = ip1_blocks;
+		req->ip2_rtbcount = ip2_blocks;
+	} else {
+		req->ip1_bcount = ip1_blocks;
+		req->ip2_bcount = ip2_blocks;
+	}
+
+	/*
+	 * Make sure that both forks have enough slack left in their extent
+	 * counters that the swap operation will not overflow.
+	 */
+	trace_xfs_swapext_delta_nextents(req, d_nexts1, d_nexts2);
+	if (req->ip1 == req->ip2) {
+		error = ensure_delta_nextents(req, req->ip1,
+				d_nexts1 + d_nexts2);
+	} else {
+		error = ensure_delta_nextents(req, req->ip1, d_nexts1);
+		if (error)
+			goto out_free;
+		error = ensure_delta_nextents(req, req->ip2, d_nexts2);
+	}
+	if (error)
+		goto out_free;
+
+	trace_xfs_swapext_initial_estimate(req);
+	error = xfs_swapext_estimate_overhead(req);
+out_free:
+	kmem_cache_free(xfs_swapext_intent_cache, sxi);
+	return error;
+}
+
+static inline void
+xfs_swapext_set_reflink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	trace_xfs_reflink_set_inode_flag(ip);
+
+	ip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/*
+ * If either file has shared blocks and we're swapping data forks, we must flag
+ * the other file as having shared blocks so that we get the shared-block rmap
+ * functions if we need to fix up the rmaps.
+ */
+void
+xfs_swapext_ensure_reflink(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_intent	*sxi,
+	unsigned int			reflink_state)
+{
+	if ((reflink_state & XFS_REFLINK_STATE_IP1) &&
+	    !xfs_is_reflink_inode(sxi->sxi_ip2))
+		xfs_swapext_set_reflink(tp, sxi->sxi_ip2);
+
+	if ((reflink_state & XFS_REFLINK_STATE_IP2) &&
+	    !xfs_is_reflink_inode(sxi->sxi_ip1))
+		xfs_swapext_set_reflink(tp, sxi->sxi_ip1);
+}
+
+/* Widen the extent counts of both inodes if necessary. */
+static inline void
+xfs_swapext_upgrade_extent_counts(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_intent	*sxi)
+{
+	if (!(sxi->sxi_op_flags & XFS_SWAP_EXT_OP_NREXT64))
+		return;
+
+	sxi->sxi_ip1->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+
+	sxi->sxi_ip2->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+}
+
+/*
+ * Schedule a swap a range of extents from one inode to another.  If the atomic
+ * swap feature is enabled, then the operation progress can be resumed even if
+ * the system goes down.  The caller must commit the transaction to start the
+ * work.
+ *
+ * The caller must ensure the inodes must be joined to the transaction and
+ * ILOCKd; they will still be joined to the transaction at exit.
+ */
+void
+xfs_swapext(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_swapext_intent	*sxi;
+	unsigned int			reflink_state;
+
+	ASSERT(xfs_isilocked(req->ip1, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(req->ip2, XFS_ILOCK_EXCL));
+	ASSERT(req->whichfork != XFS_COW_FORK);
+	ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
+	if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
+		ASSERT(req->whichfork == XFS_DATA_FORK);
+
+	if (req->blockcount == 0)
+		return;
+
+	sxi = xfs_swapext_init_intent(req, &reflink_state);
+	xfs_swapext_schedule(tp, sxi);
+	xfs_swapext_ensure_reflink(tp, sxi, reflink_state);
+	xfs_swapext_upgrade_extent_counts(tp, sxi);
+}
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 316323339d76..1987897ddc25 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -21,4 +21,146 @@ static inline bool xfs_swapext_supported(struct xfs_mount *mp)
 	       !xfs_has_realtime(mp);
 }
 
+/*
+ * In-core information about an extent swap request between ranges of two
+ * inodes.
+ */
+struct xfs_swapext_intent {
+	/* List of other incore deferred work. */
+	struct list_head	sxi_list;
+
+	/* Inodes participating in the operation. */
+	struct xfs_inode	*sxi_ip1;
+	struct xfs_inode	*sxi_ip2;
+
+	/* File offset range information. */
+	xfs_fileoff_t		sxi_startoff1;
+	xfs_fileoff_t		sxi_startoff2;
+	xfs_filblks_t		sxi_blockcount;
+
+	/* Set these file sizes after the operation, unless negative. */
+	xfs_fsize_t		sxi_isize1;
+	xfs_fsize_t		sxi_isize2;
+
+	/* XFS_SWAP_EXT_* log operation flags */
+	unsigned int		sxi_flags;
+
+	/* XFS_SWAP_EXT_OP_* flags */
+	unsigned int		sxi_op_flags;
+};
+
+/* Use log intent items to track and restart the entire operation. */
+#define XFS_SWAP_EXT_OP_LOGGED	(1U << 0)
+
+/* Upgrade files to have large extent counts before proceeding. */
+#define XFS_SWAP_EXT_OP_NREXT64	(1U << 1)
+
+#define XFS_SWAP_EXT_OP_STRINGS \
+	{ XFS_SWAP_EXT_OP_LOGGED,		"LOGGED" }, \
+	{ XFS_SWAP_EXT_OP_NREXT64,		"NREXT64" }
+
+static inline int
+xfs_swapext_whichfork(const struct xfs_swapext_intent *sxi)
+{
+	if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
+		return XFS_ATTR_FORK;
+	return XFS_DATA_FORK;
+}
+
+/* Parameters for a swapext request. */
+struct xfs_swapext_req {
+	/* Inodes participating in the operation. */
+	struct xfs_inode	*ip1;
+	struct xfs_inode	*ip2;
+
+	/* File offset range information. */
+	xfs_fileoff_t		startoff1;
+	xfs_fileoff_t		startoff2;
+	xfs_filblks_t		blockcount;
+
+	/* Data or attr fork? */
+	int			whichfork;
+
+	/* XFS_SWAP_REQ_* operation flags */
+	unsigned int		req_flags;
+
+	/*
+	 * Fields below this line are filled out by xfs_swapext_estimate;
+	 * callers should initialize this part of the struct to zero.
+	 */
+
+	/*
+	 * Data device blocks to be moved out of ip1, and free space needed to
+	 * handle the bmbt changes.
+	 */
+	xfs_filblks_t		ip1_bcount;
+
+	/*
+	 * Data device blocks to be moved out of ip2, and free space needed to
+	 * handle the bmbt changes.
+	 */
+	xfs_filblks_t		ip2_bcount;
+
+	/* rt blocks to be moved out of ip1. */
+	xfs_filblks_t		ip1_rtbcount;
+
+	/* rt blocks to be moved out of ip2. */
+	xfs_filblks_t		ip2_rtbcount;
+
+	/* Free space needed to handle the bmbt changes */
+	unsigned long long	resblks;
+
+	/* Number of extent swaps needed to complete the operation */
+	unsigned long long	nr_exchanges;
+};
+
+/* Caller has permission to use log intent items for the swapext operation. */
+#define XFS_SWAP_REQ_LOGGED		(1U << 0)
+
+/* Set the file sizes when finished. */
+#define XFS_SWAP_REQ_SET_SIZES		(1U << 1)
+
+/* Do not swap any part of the range where ip1's mapping is a hole. */
+#define XFS_SWAP_REQ_SKIP_INO1_HOLES	(1U << 2)
+
+/* Files need to be upgraded to have large extent counts. */
+#define XFS_SWAP_REQ_NREXT64		(1U << 3)
+
+#define XFS_SWAP_REQ_FLAGS		(XFS_SWAP_REQ_LOGGED | \
+					 XFS_SWAP_REQ_SET_SIZES | \
+					 XFS_SWAP_REQ_SKIP_INO1_HOLES | \
+					 XFS_SWAP_REQ_NREXT64)
+
+#define XFS_SWAP_REQ_STRINGS \
+	{ XFS_SWAP_REQ_LOGGED,			"LOGGED" }, \
+	{ XFS_SWAP_REQ_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_SWAP_REQ_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
+	{ XFS_SWAP_REQ_NREXT64,			"NREXT64" }
+
+unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
+void xfs_swapext_reflink_finish(struct xfs_trans *tp,
+		const struct xfs_swapext_req *req, unsigned int reflink_state);
+
+int xfs_swapext_estimate(struct xfs_swapext_req *req);
+
+extern struct kmem_cache	*xfs_swapext_intent_cache;
+
+int __init xfs_swapext_intent_init_cache(void);
+void xfs_swapext_intent_destroy_cache(void);
+
+struct xfs_swapext_intent *xfs_swapext_init_intent(
+		const struct xfs_swapext_req *req, unsigned int *reflink_state);
+void xfs_swapext_ensure_reflink(struct xfs_trans *tp,
+		const struct xfs_swapext_intent *sxi, unsigned int reflink_state);
+
+void xfs_swapext_schedule(struct xfs_trans *tp,
+		struct xfs_swapext_intent *sxi);
+int xfs_swapext_finish_one(struct xfs_trans *tp,
+		struct xfs_swapext_intent *sxi);
+
+int xfs_swapext_check_extents(struct xfs_mount *mp,
+		const struct xfs_swapext_req *req);
+
+void xfs_swapext(struct xfs_trans *tp, const struct xfs_swapext_req *req);
+
 #endif /* __XFS_SWAPEXT_H_ */
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 87b31c69a773..9640fc232c14 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -10,6 +10,10 @@
  * Components of space reservations.
  */
 
+/* Worst case number of bmaps that can be held in a block. */
+#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)    \
+		(((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
+
 /* Worst case number of rmaps that can be held in a block. */
 #define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)    \
 		(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
index ea4a3a8de7e3..24331a702497 100644
--- a/fs/xfs/xfs_swapext_item.c
+++ b/fs/xfs/xfs_swapext_item.c
@@ -16,13 +16,17 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_swapext_item.h"
+#include "xfs_swapext.h"
 #include "xfs_log.h"
 #include "xfs_bmap.h"
 #include "xfs_icache.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_xchgrange.h"
+#include "xfs_trace.h"
 
 struct kmem_cache	*xfs_sxi_cache;
 struct kmem_cache	*xfs_sxd_cache;
@@ -206,13 +210,333 @@ static const struct xfs_item_ops xfs_sxd_item_ops = {
 	.iop_intent	= xfs_sxd_item_intent,
 };
 
+static struct xfs_sxd_log_item *
+xfs_trans_get_sxd(
+	struct xfs_trans		*tp,
+	struct xfs_sxi_log_item		*sxi_lip)
+{
+	struct xfs_sxd_log_item		*sxd_lip;
+
+	sxd_lip = kmem_cache_zalloc(xfs_sxd_cache, GFP_KERNEL | __GFP_NOFAIL);
+	xfs_log_item_init(tp->t_mountp, &sxd_lip->sxd_item, XFS_LI_SXD,
+			  &xfs_sxd_item_ops);
+	sxd_lip->sxd_intent_log_item = sxi_lip;
+	sxd_lip->sxd_format.sxd_sxi_id = sxi_lip->sxi_format.sxi_id;
+
+	xfs_trans_add_item(tp, &sxd_lip->sxd_item);
+	return sxd_lip;
+}
+
+/*
+ * Finish an swapext update and log it to the SXD. Note that the transaction is
+ * marked dirty regardless of whether the swapext update succeeds or fails to
+ * support the SXI/SXD lifecycle rules.
+ */
+static int
+xfs_swapext_finish_update(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct xfs_swapext_intent	*sxi)
+{
+	int				error;
+
+	error = xfs_swapext_finish_one(tp, sxi);
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the SXI and frees the SXD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	if (done)
+		set_bit(XFS_LI_DIRTY, &done->li_flags);
+
+	return error;
+}
+
+/* Log swapext updates in the intent item. */
+STATIC struct xfs_log_item *
+xfs_swapext_create_intent(
+	struct xfs_trans		*tp,
+	struct list_head		*items,
+	unsigned int			count,
+	bool				sort)
+{
+	struct xfs_sxi_log_item		*sxi_lip;
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_swap_extent		*sx;
+
+	ASSERT(count == 1);
+
+	sxi = list_first_entry_or_null(items, struct xfs_swapext_intent,
+			sxi_list);
+
+	/*
+	 * We use the same defer ops control machinery to perform extent swaps
+	 * even if we aren't using the machinery to track the operation status
+	 * through log items.
+	 */
+	if (!(sxi->sxi_op_flags & XFS_SWAP_EXT_OP_LOGGED))
+		return NULL;
+
+	sxi_lip = xfs_sxi_init(tp->t_mountp);
+	xfs_trans_add_item(tp, &sxi_lip->sxi_item);
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &sxi_lip->sxi_item.li_flags);
+
+	sx = &sxi_lip->sxi_format.sxi_extent;
+	sx->sx_inode1 = sxi->sxi_ip1->i_ino;
+	sx->sx_inode2 = sxi->sxi_ip2->i_ino;
+	sx->sx_startoff1 = sxi->sxi_startoff1;
+	sx->sx_startoff2 = sxi->sxi_startoff2;
+	sx->sx_blockcount = sxi->sxi_blockcount;
+	sx->sx_isize1 = sxi->sxi_isize1;
+	sx->sx_isize2 = sxi->sxi_isize2;
+	sx->sx_flags = sxi->sxi_flags;
+
+	return &sxi_lip->sxi_item;
+}
+
+STATIC struct xfs_log_item *
+xfs_swapext_create_done(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*intent,
+	unsigned int			count)
+{
+	if (intent == NULL)
+		return NULL;
+	return &xfs_trans_get_sxd(tp, SXI_ITEM(intent))->sxd_item;
+}
+
+/* Process a deferred swapext update. */
+STATIC int
+xfs_swapext_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct list_head		*item,
+	struct xfs_btree_cur		**state)
+{
+	struct xfs_swapext_intent	*sxi;
+	int				error;
+
+	sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+
+	/*
+	 * Swap one more extent between the two files.  If there's still more
+	 * work to do, we want to requeue ourselves after all other pending
+	 * deferred operations have finished.  This includes all of the dfops
+	 * that we queued directly as well as any new ones created in the
+	 * process of finishing the others.  Doing so prevents us from queuing
+	 * a large number of SXI log items in kernel memory, which in turn
+	 * prevents us from pinning the tail of the log (while logging those
+	 * new SXI items) until the first SXI items can be processed.
+	 */
+	error = xfs_swapext_finish_update(tp, done, sxi);
+	if (error == -EAGAIN)
+		return error;
+
+	kmem_cache_free(xfs_swapext_intent_cache, sxi);
+	return error;
+}
+
+/* Abort all pending SXIs. */
+STATIC void
+xfs_swapext_abort_intent(
+	struct xfs_log_item		*intent)
+{
+	xfs_sxi_release(SXI_ITEM(intent));
+}
+
+/* Cancel a deferred swapext update. */
+STATIC void
+xfs_swapext_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_swapext_intent	*sxi;
+
+	sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+	kmem_cache_free(xfs_swapext_intent_cache, sxi);
+}
+
+const struct xfs_defer_op_type xfs_swapext_defer_type = {
+	.max_items	= 1,
+	.create_intent	= xfs_swapext_create_intent,
+	.abort_intent	= xfs_swapext_abort_intent,
+	.create_done	= xfs_swapext_create_done,
+	.finish_item	= xfs_swapext_finish_item,
+	.cancel_item	= xfs_swapext_cancel_item,
+};
+
+/* Is this recovered SXI ok? */
+static inline bool
+xfs_sxi_validate(
+	struct xfs_mount		*mp,
+	struct xfs_sxi_log_item		*sxi_lip)
+{
+	struct xfs_swap_extent		*sx = &sxi_lip->sxi_format.sxi_extent;
+
+	if (!xfs_sb_version_haslogswapext(&mp->m_sb))
+		return false;
+
+	if (sxi_lip->sxi_format.__pad != 0)
+		return false;
+
+	if (sx->sx_flags & ~XFS_SWAP_EXT_FLAGS)
+		return false;
+
+	if (!xfs_verify_ino(mp, sx->sx_inode1) ||
+	    !xfs_verify_ino(mp, sx->sx_inode2))
+		return false;
+
+	if ((sx->sx_flags & XFS_SWAP_EXT_SET_SIZES) &&
+	     (sx->sx_isize1 < 0 || sx->sx_isize2 < 0))
+		return false;
+
+	if (!xfs_verify_fileext(mp, sx->sx_startoff1, sx->sx_blockcount))
+		return false;
+
+	return xfs_verify_fileext(mp, sx->sx_startoff2, sx->sx_blockcount);
+}
+
+/*
+ * Use the recovered log state to create a new request, estimate resource
+ * requirements, and create a new incore intent state.
+ */
+STATIC struct xfs_swapext_intent *
+xfs_sxi_item_recover_intent(
+	struct xfs_mount		*mp,
+	const struct xfs_swap_extent	*sx,
+	struct xfs_swapext_req		*req,
+	unsigned int			*reflink_state)
+{
+	struct xfs_inode		*ip1, *ip2;
+	int				error;
+
+	/*
+	 * Grab both inodes and set IRECOVERY to prevent trimming of post-eof
+	 * extents and freeing of unlinked inodes until we're totally done
+	 * processing files.
+	 */
+	error = xlog_recover_iget(mp, sx->sx_inode1, &ip1);
+	if (error)
+		return ERR_PTR(error);
+	error = xlog_recover_iget(mp, sx->sx_inode2, &ip2);
+	if (error)
+		goto err_rele1;
+
+	req->ip1 = ip1;
+	req->ip2 = ip2;
+	req->startoff1 = sx->sx_startoff1;
+	req->startoff2 = sx->sx_startoff2;
+	req->blockcount = sx->sx_blockcount;
+
+	if (sx->sx_flags & XFS_SWAP_EXT_ATTR_FORK)
+		req->whichfork = XFS_ATTR_FORK;
+	else
+		req->whichfork = XFS_DATA_FORK;
+
+	if (sx->sx_flags & XFS_SWAP_EXT_SET_SIZES)
+		req->req_flags |= XFS_SWAP_REQ_SET_SIZES;
+	if (sx->sx_flags & XFS_SWAP_EXT_SKIP_INO1_HOLES)
+		req->req_flags |= XFS_SWAP_REQ_SKIP_INO1_HOLES;
+	req->req_flags |= XFS_SWAP_REQ_LOGGED;
+
+	xfs_xchg_range_ilock(NULL, ip1, ip2);
+	error = xfs_swapext_estimate(req);
+	xfs_xchg_range_iunlock(ip1, ip2);
+	if (error)
+		goto err_rele2;
+
+	return xfs_swapext_init_intent(req, reflink_state);
+
+err_rele2:
+	xfs_irele(ip2);
+err_rele1:
+	xfs_irele(ip1);
+	return ERR_PTR(error);
+}
+
 /* Process a swapext update intent item that was recovered from the log. */
 STATIC int
 xfs_sxi_item_recover(
-	struct xfs_log_item	*lip,
-	struct list_head	*capture_list)
+	struct xfs_log_item		*lip,
+	struct list_head		*capture_list)
 {
-	return -EFSCORRUPTED;
+	struct xfs_swapext_req		req = { .req_flags = 0 };
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_sxi_log_item		*sxi_lip = SXI_ITEM(lip);
+	struct xfs_mount		*mp = lip->li_log->l_mp;
+	struct xfs_swap_extent		*sx = &sxi_lip->sxi_format.sxi_extent;
+	struct xfs_sxd_log_item		*sxd_lip = NULL;
+	struct xfs_trans		*tp;
+	struct xfs_inode		*ip1, *ip2;
+	unsigned int			reflink_state;
+	int				error = 0;
+
+	if (!xfs_sxi_validate(mp, sxi_lip)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+				&sxi_lip->sxi_format,
+				sizeof(sxi_lip->sxi_format));
+		return -EFSCORRUPTED;
+	}
+
+	sxi = xfs_sxi_item_recover_intent(mp, sx, &req, &reflink_state);
+	if (IS_ERR(sxi))
+		return PTR_ERR(sxi);
+
+	trace_xfs_swapext_recover(mp, sxi);
+
+	ip1 = sxi->sxi_ip1;
+	ip2 = sxi->sxi_ip2;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0, 0,
+			&tp);
+	if (error)
+		goto err_rele;
+
+	sxd_lip = xfs_trans_get_sxd(tp, sxi_lip);
+
+	xfs_xchg_range_ilock(tp, ip1, ip2);
+
+	xfs_swapext_ensure_reflink(tp, sxi, reflink_state);
+	error = xfs_swapext_finish_update(tp, &sxd_lip->sxd_item, sxi);
+	if (error == -EAGAIN) {
+		/*
+		 * If there's more extent swapping to be done, we have to
+		 * schedule that as a separate deferred operation to be run
+		 * after we've finished replaying all of the intents we
+		 * recovered from the log.  Transfer ownership of the sxi to
+		 * the transaction.
+		 */
+		xfs_swapext_schedule(tp, sxi);
+		error = 0;
+		sxi = NULL;
+	}
+	if (error == -EFSCORRUPTED)
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, sx,
+				sizeof(*sx));
+	if (error)
+		goto err_cancel;
+
+	/*
+	 * Commit transaction, which frees the transaction and saves the inodes
+	 * for later replay activities.
+	 */
+	error = xfs_defer_ops_capture_and_commit(tp, capture_list);
+	goto err_unlock;
+
+err_cancel:
+	xfs_trans_cancel(tp);
+err_unlock:
+	xfs_xchg_range_iunlock(ip1, ip2);
+err_rele:
+	if (sxi)
+		kmem_cache_free(xfs_swapext_intent_cache, sxi);
+	xfs_irele(ip2);
+	xfs_irele(ip1);
+	return error;
 }
 
 STATIC bool
@@ -229,8 +553,21 @@ xfs_sxi_item_relog(
 	struct xfs_log_item	*intent,
 	struct xfs_trans	*tp)
 {
-	ASSERT(0);
-	return NULL;
+	struct xfs_sxd_log_item		*sxd_lip;
+	struct xfs_sxi_log_item		*sxi_lip;
+	struct xfs_swap_extent		*sx;
+
+	sx = &SXI_ITEM(intent)->sxi_format.sxi_extent;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	sxd_lip = xfs_trans_get_sxd(tp, SXI_ITEM(intent));
+	set_bit(XFS_LI_DIRTY, &sxd_lip->sxd_item.li_flags);
+
+	sxi_lip = xfs_sxi_init(tp->t_mountp);
+	memcpy(&sxi_lip->sxi_format.sxi_extent, sx, sizeof(*sx));
+	xfs_trans_add_item(tp, &sxi_lip->sxi_item);
+	set_bit(XFS_LI_DIRTY, &sxi_lip->sxi_item.li_flags);
+	return &sxi_lip->sxi_item;
 }
 
 static const struct xfs_item_ops xfs_sxi_item_ops = {
@@ -264,17 +601,17 @@ xlog_recover_sxi_commit_pass2(
 
 	sxi_formatp = item->ri_buf[0].i_addr;
 
-	if (sxi_formatp->__pad != 0) {
-		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
-		return -EFSCORRUPTED;
-	}
-
 	len = sizeof(struct xfs_sxi_log_format);
 	if (item->ri_buf[0].i_len != len) {
 		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
 		return -EFSCORRUPTED;
 	}
 
+	if (sxi_formatp->__pad != 0) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
 	sxi_lip = xfs_sxi_init(mp);
 	memcpy(&sxi_lip->sxi_format, sxi_formatp, len);
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index c9a5d8087b63..b43b973f0e10 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -40,6 +40,7 @@
 #include "scrub/xfbtree.h"
 #include "xfs_btree_mem.h"
 #include "xfs_bmap.h"
+#include "xfs_swapext.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 15bd6b86b514..9ebaa5ffe504 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -78,6 +78,8 @@ union xfs_btree_ptr;
 struct xfs_dqtrx;
 struct xfs_icwalk;
 struct xfs_bmap_intent;
+struct xfs_swapext_intent;
+struct xfs_swapext_req;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -2173,7 +2175,7 @@ TRACE_EVENT(xfs_dir2_leafn_moveents,
 		  __entry->count)
 );
 
-#define XFS_SWAPEXT_INODES \
+#define XFS_SWAP_EXT_INODES \
 	{ 0,	"target" }, \
 	{ 1,	"temp" }
 
@@ -2208,7 +2210,7 @@ DECLARE_EVENT_CLASS(xfs_swap_extent_class,
 		  "broot size %d, forkoff 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
-		  __print_symbolic(__entry->which, XFS_SWAPEXT_INODES),
+		  __print_symbolic(__entry->which, XFS_SWAP_EXT_INODES),
 		  __print_symbolic(__entry->format, XFS_INODE_FORMAT_STR),
 		  __entry->nex,
 		  __entry->broot_size,
@@ -3761,6 +3763,9 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
 DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
+DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
 
 /* fsmap traces */
 DECLARE_EVENT_CLASS(xfs_fsmap_class,
@@ -4581,6 +4586,212 @@ DEFINE_PERAG_INTENTS_EVENT(xfs_perag_wait_intents);
 
 #endif /* CONFIG_XFS_DRAIN_INTENTS */
 
+TRACE_EVENT(xfs_swapext_overhead,
+	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
+		 unsigned long long rmapbt_blocks),
+	TP_ARGS(mp, bmbt_blocks, rmapbt_blocks),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, bmbt_blocks)
+		__field(unsigned long long, rmapbt_blocks)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->bmbt_blocks = bmbt_blocks;
+		__entry->rmapbt_blocks = rmapbt_blocks;
+	),
+	TP_printk("dev %d:%d bmbt_blocks 0x%llx rmapbt_blocks 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->bmbt_blocks,
+		  __entry->rmapbt_blocks)
+);
+
+DECLARE_EVENT_CLASS(xfs_swapext_estimate_class,
+	TP_PROTO(const struct xfs_swapext_req *req),
+	TP_ARGS(req),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(int, whichfork)
+		__field(unsigned int, req_flags)
+		__field(xfs_filblks_t, ip1_bcount)
+		__field(xfs_filblks_t, ip2_bcount)
+		__field(xfs_filblks_t, ip1_rtbcount)
+		__field(xfs_filblks_t, ip2_rtbcount)
+		__field(unsigned long long, resblks)
+		__field(unsigned long long, nr_exchanges)
+	),
+	TP_fast_assign(
+		__entry->dev = req->ip1->i_mount->m_super->s_dev;
+		__entry->ino1 = req->ip1->i_ino;
+		__entry->ino2 = req->ip2->i_ino;
+		__entry->startoff1 = req->startoff1;
+		__entry->startoff2 = req->startoff2;
+		__entry->blockcount = req->blockcount;
+		__entry->whichfork = req->whichfork;
+		__entry->req_flags = req->req_flags;
+		__entry->ip1_bcount = req->ip1_bcount;
+		__entry->ip2_bcount = req->ip2_bcount;
+		__entry->ip1_rtbcount = req->ip1_rtbcount;
+		__entry->ip2_rtbcount = req->ip2_rtbcount;
+		__entry->resblks = req->resblks;
+		__entry->nr_exchanges = req->nr_exchanges;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) fork %s bcount1 0x%llx rtbcount1 0x%llx bcount2 0x%llx rtbcount2 0x%llx resblks 0x%llx nr_exchanges %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->startoff1,
+		  __entry->ino2, __entry->startoff2,
+		  __entry->blockcount,
+		  __print_flags(__entry->req_flags, "|", XFS_SWAP_REQ_STRINGS),
+		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+		  __entry->ip1_bcount,
+		  __entry->ip1_rtbcount,
+		  __entry->ip2_bcount,
+		  __entry->ip2_rtbcount,
+		  __entry->resblks,
+		  __entry->nr_exchanges)
+);
+
+#define DEFINE_SWAPEXT_ESTIMATE_EVENT(name)	\
+DEFINE_EVENT(xfs_swapext_estimate_class, name,	\
+	TP_PROTO(const struct xfs_swapext_req *req), \
+	TP_ARGS(req))
+DEFINE_SWAPEXT_ESTIMATE_EVENT(xfs_swapext_initial_estimate);
+DEFINE_SWAPEXT_ESTIMATE_EVENT(xfs_swapext_final_estimate);
+
+DECLARE_EVENT_CLASS(xfs_swapext_intent_class,
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi),
+	TP_ARGS(mp, sxi),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(unsigned int, flags)
+		__field(unsigned int, opflags)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(xfs_fsize_t, isize1)
+		__field(xfs_fsize_t, isize2)
+		__field(xfs_fsize_t, new_isize1)
+		__field(xfs_fsize_t, new_isize2)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino1 = sxi->sxi_ip1->i_ino;
+		__entry->ino2 = sxi->sxi_ip2->i_ino;
+		__entry->flags = sxi->sxi_flags;
+		__entry->opflags = sxi->sxi_op_flags;
+		__entry->startoff1 = sxi->sxi_startoff1;
+		__entry->startoff2 = sxi->sxi_startoff2;
+		__entry->blockcount = sxi->sxi_blockcount;
+		__entry->isize1 = sxi->sxi_ip1->i_disk_size;
+		__entry->isize2 = sxi->sxi_ip2->i_disk_size;
+		__entry->new_isize1 = sxi->sxi_isize1;
+		__entry->new_isize2 = sxi->sxi_isize2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) opflags (%s) isize1 0x%llx newisize1 0x%llx isize2 0x%llx newisize2 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->startoff1,
+		  __entry->ino2, __entry->startoff2,
+		  __entry->blockcount,
+		  __print_flags(__entry->flags, "|", XFS_SWAP_EXT_STRINGS),
+		  __print_flags(__entry->opflags, "|", XFS_SWAP_EXT_OP_STRINGS),
+		  __entry->isize1, __entry->new_isize1,
+		  __entry->isize2, __entry->new_isize2)
+);
+
+#define DEFINE_SWAPEXT_INTENT_EVENT(name)	\
+DEFINE_EVENT(xfs_swapext_intent_class, name,	\
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi), \
+	TP_ARGS(mp, sxi))
+DEFINE_SWAPEXT_INTENT_EVENT(xfs_swapext_defer);
+DEFINE_SWAPEXT_INTENT_EVENT(xfs_swapext_recover);
+
+TRACE_EVENT(xfs_swapext_delta_nextents_step,
+	TP_PROTO(struct xfs_mount *mp,
+		 const struct xfs_bmbt_irec *left,
+		 const struct xfs_bmbt_irec *curr,
+		 const struct xfs_bmbt_irec *new,
+		 const struct xfs_bmbt_irec *right,
+		 int delta, unsigned int state),
+	TP_ARGS(mp, left, curr, new, right, delta, state),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_fileoff_t, loff)
+		__field(xfs_fsblock_t, lstart)
+		__field(xfs_filblks_t, lcount)
+		__field(xfs_fileoff_t, coff)
+		__field(xfs_fsblock_t, cstart)
+		__field(xfs_filblks_t, ccount)
+		__field(xfs_fileoff_t, noff)
+		__field(xfs_fsblock_t, nstart)
+		__field(xfs_filblks_t, ncount)
+		__field(xfs_fileoff_t, roff)
+		__field(xfs_fsblock_t, rstart)
+		__field(xfs_filblks_t, rcount)
+		__field(int, delta)
+		__field(unsigned int, state)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->loff = left->br_startoff;
+		__entry->lstart = left->br_startblock;
+		__entry->lcount = left->br_blockcount;
+		__entry->coff = curr->br_startoff;
+		__entry->cstart = curr->br_startblock;
+		__entry->ccount = curr->br_blockcount;
+		__entry->noff = new->br_startoff;
+		__entry->nstart = new->br_startblock;
+		__entry->ncount = new->br_blockcount;
+		__entry->roff = right->br_startoff;
+		__entry->rstart = right->br_startblock;
+		__entry->rcount = right->br_blockcount;
+		__entry->delta = delta;
+		__entry->state = state;
+	),
+	TP_printk("dev %d:%d left 0x%llx:0x%llx:0x%llx; curr 0x%llx:0x%llx:0x%llx <- new 0x%llx:0x%llx:0x%llx; right 0x%llx:0x%llx:0x%llx delta %d state 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		__entry->loff, __entry->lstart, __entry->lcount,
+		__entry->coff, __entry->cstart, __entry->ccount,
+		__entry->noff, __entry->nstart, __entry->ncount,
+		__entry->roff, __entry->rstart, __entry->rcount,
+		__entry->delta, __entry->state)
+);
+
+TRACE_EVENT(xfs_swapext_delta_nextents,
+	TP_PROTO(const struct xfs_swapext_req *req, int64_t d_nexts1,
+		 int64_t d_nexts2),
+	TP_ARGS(req, d_nexts1, d_nexts2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(xfs_extnum_t, nexts1)
+		__field(xfs_extnum_t, nexts2)
+		__field(int64_t, d_nexts1)
+		__field(int64_t, d_nexts2)
+	),
+	TP_fast_assign(
+		__entry->dev = req->ip1->i_mount->m_super->s_dev;
+		__entry->ino1 = req->ip1->i_ino;
+		__entry->ino2 = req->ip2->i_ino;
+		__entry->nexts1 = xfs_ifork_ptr(req->ip1, req->whichfork)->if_nextents;
+		__entry->nexts2 = xfs_ifork_ptr(req->ip2, req->whichfork)->if_nextents;
+		__entry->d_nexts1 = d_nexts1;
+		__entry->d_nexts2 = d_nexts2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->nexts1,
+		  __entry->ino2, __entry->nexts2,
+		  __entry->d_nexts1, __entry->d_nexts2)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
new file mode 100644
index 000000000000..0dba5078c9f7
--- /dev/null
+++ b/fs/xfs/xfs_xchgrange.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
+
+/* Lock (and optionally join) two inodes for a file range exchange. */
+void
+xfs_xchg_range_ilock(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	if (ip1 != ip2)
+		xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL,
+				    ip2, XFS_ILOCK_EXCL);
+	else
+		xfs_ilock(ip1, XFS_ILOCK_EXCL);
+	if (tp) {
+		xfs_trans_ijoin(tp, ip1, 0);
+		if (ip2 != ip1)
+			xfs_trans_ijoin(tp, ip2, 0);
+	}
+
+}
+
+/* Unlock two inodes after a file range exchange operation. */
+void
+xfs_xchg_range_iunlock(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	if (ip2 != ip1)
+		xfs_iunlock(ip2, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip1, XFS_ILOCK_EXCL);
+}
+
+/*
+ * Estimate the resource requirements to exchange file contents between the two
+ * files.  The caller is required to hold the IOLOCK and the MMAPLOCK and to
+ * have flushed both inodes' pagecache and active direct-ios.
+ */
+int
+xfs_xchg_range_estimate(
+	struct xfs_swapext_req	*req)
+{
+	int			error;
+
+	xfs_xchg_range_ilock(NULL, req->ip1, req->ip2);
+	error = xfs_swapext_estimate(req);
+	xfs_xchg_range_iunlock(req->ip1, req->ip2);
+	return error;
+}
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
new file mode 100644
index 000000000000..89320a354efa
--- /dev/null
+++ b/fs/xfs/xfs_xchgrange.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_XCHGRANGE_H__
+#define __XFS_XCHGRANGE_H__
+
+struct xfs_swapext_req;
+
+void xfs_xchg_range_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
+		struct xfs_inode *ip2);
+void xfs_xchg_range_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
+int xfs_xchg_range_estimate(struct xfs_swapext_req *req);
+
+#endif /* __XFS_XCHGRANGE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 08/21] xfs: enable xlog users to toggle atomic extent swapping
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 07/21] xfs: create deferred log items for extent swapping Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
                     ` (12 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Plumb the necessary bits into the xlog code so that higher level callers
can enable the atomic extent swapping feature and have it clear
automatically when possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      |   13 +++++++++++++
 fs/xfs/xfs_log.h      |    1 +
 fs/xfs/xfs_log_priv.h |    1 +
 3 files changed, 15 insertions(+)


diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a0ef09addc84..37e85c1bb913 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1501,11 +1501,17 @@ xlog_clear_incompat(
 	if (down_write_trylock(&log->l_incompat_xattrs))
 		incompat_mask |= XFS_SB_FEAT_INCOMPAT_LOG_XATTRS;
 
+	if (down_write_trylock(&log->l_incompat_swapext))
+		incompat_mask |= XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT;
+
 	if (!incompat_mask)
 		return;
 
 	xfs_clear_incompat_log_features(mp, incompat_mask);
 
+	if (incompat_mask & XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT)
+		up_write(&log->l_incompat_swapext);
+
 	if (incompat_mask & XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
 		up_write(&log->l_incompat_xattrs);
 }
@@ -1625,6 +1631,7 @@ xlog_alloc_log(
 	log->l_sectBBsize = 1 << log2_size;
 
 	init_rwsem(&log->l_incompat_xattrs);
+	init_rwsem(&log->l_incompat_swapext);
 
 	xlog_get_iclog_buffer_size(mp, log);
 
@@ -3922,6 +3929,9 @@ xlog_use_incompat_feat(
 	case XLOG_INCOMPAT_FEAT_XATTRS:
 		down_read(&log->l_incompat_xattrs);
 		break;
+	case XLOG_INCOMPAT_FEAT_SWAPEXT:
+		down_read(&log->l_incompat_swapext);
+		break;
 	}
 }
 
@@ -3935,5 +3945,8 @@ xlog_drop_incompat_feat(
 	case XLOG_INCOMPAT_FEAT_XATTRS:
 		up_read(&log->l_incompat_xattrs);
 		break;
+	case XLOG_INCOMPAT_FEAT_SWAPEXT:
+		up_read(&log->l_incompat_swapext);
+		break;
 	}
 }
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index d187f6445909..30bdbf8ee25c 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -161,6 +161,7 @@ bool	  xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags);
 
 enum xlog_incompat_feat {
 	XLOG_INCOMPAT_FEAT_XATTRS = XFS_SB_FEAT_INCOMPAT_LOG_XATTRS,
+	XLOG_INCOMPAT_FEAT_SWAPEXT = XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT
 };
 
 void xlog_use_incompat_feat(struct xlog *log, enum xlog_incompat_feat what);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a13b5b6b744d..6cbee6996de5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -448,6 +448,7 @@ struct xlog {
 
 	/* Users of log incompat features should take a read lock. */
 	struct rw_semaphore	l_incompat_xattrs;
+	struct rw_semaphore	l_incompat_swapext;
 };
 
 /*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 09/21] xfs: add a ->xchg_file_range handler
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 10/21] xfs: add error injection to test swapext recovery Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
                     ` (9 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add a function to handle file range exchange requests from the vfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    1 
 fs/xfs/xfs_file.c      |   70 +++++++++
 fs/xfs/xfs_mount.h     |    5 +
 fs/xfs/xfs_trace.c     |    1 
 fs/xfs/xfs_trace.h     |  120 +++++++++++++++
 fs/xfs/xfs_xchgrange.c |  375 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_xchgrange.h |   23 +++
 7 files changed, 594 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8621534b749b..d587015aec0e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -28,6 +28,7 @@
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_swapext.h"
 
 /* Kernel only BMAP related definitions and functions */
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 78323574021c..b4629c8aa6b7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -24,6 +24,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_xchgrange.h"
 
 #include <linux/dax.h>
 #include <linux/falloc.h>
@@ -1150,6 +1151,74 @@ xfs_file_remap_range(
 	return remapped > 0 ? remapped : ret;
 }
 
+STATIC int
+xfs_file_xchg_range(
+	struct file		*file1,
+	struct file		*file2,
+	struct file_xchg_range	*fxr)
+{
+	struct inode		*inode1 = file_inode(file1);
+	struct inode		*inode2 = file_inode(file2);
+	struct xfs_inode	*ip1 = XFS_I(inode1);
+	struct xfs_inode	*ip2 = XFS_I(inode2);
+	struct xfs_mount	*mp = ip1->i_mount;
+	unsigned int		priv_flags = 0;
+	bool			use_logging = false;
+	int			error;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	/* Update cmtime if the fd/inode don't forbid it. */
+	if (likely(!(file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1)))
+		priv_flags |= XFS_XCHG_RANGE_UPD_CMTIME1;
+	if (likely(!(file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2)))
+		priv_flags |= XFS_XCHG_RANGE_UPD_CMTIME2;
+
+	/* Lock both files against IO */
+	error = xfs_ilock2_io_mmap(ip1, ip2);
+	if (error)
+		goto out_err;
+
+	/* Prepare and then exchange file contents. */
+	error = xfs_xchg_range_prep(file1, file2, fxr);
+	if (error)
+		goto out_unlock;
+
+	/* Get permission to use log-assisted file content swaps. */
+	error = xfs_xchg_range_grab_log_assist(mp,
+			!(fxr->flags & FILE_XCHG_RANGE_NONATOMIC),
+			&use_logging);
+	if (error)
+		goto out_unlock;
+	if (use_logging)
+		priv_flags |= XFS_XCHG_RANGE_LOGGED;
+
+	error = xfs_xchg_range(ip1, ip2, fxr, priv_flags);
+	if (error)
+		goto out_drop_feat;
+
+	/*
+	 * Finish the exchange by removing special file privileges like any
+	 * other file write would do.  This may involve turning on support for
+	 * logged xattrs if either file has security capabilities, which means
+	 * xfs_xchg_range_grab_log_assist before xfs_attr_grab_log_assist.
+	 */
+	error = generic_xchg_file_range_finish(file1, file2);
+	if (error)
+		goto out_drop_feat;
+
+out_drop_feat:
+	if (use_logging)
+		xfs_xchg_range_rele_log_assist(mp);
+out_unlock:
+	xfs_iunlock2_io_mmap(ip1, ip2);
+out_err:
+	if (error)
+		trace_xfs_file_xchg_range_error(ip2, error, _RET_IP_);
+	return error;
+}
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1439,6 +1508,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.fadvise	= xfs_file_fadvise,
 	.remap_file_range = xfs_file_remap_range,
+	.xchg_file_range = xfs_file_xchg_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7c48a2b70f6f..3b2601ab954d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -399,6 +399,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
 #define XFS_OPSTATE_WARNED_SHRINK	8
 /* Kernel has logged a warning about logged xattr updates being used. */
 #define XFS_OPSTATE_WARNED_LARP		9
+/* Kernel has logged a warning about extent swapping being used on this fs. */
+#define XFS_OPSTATE_WARNED_SWAPEXT	10
 
 #define __XFS_IS_OPSTATE(name, NAME) \
 static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
@@ -438,7 +440,8 @@ xfs_should_warn(struct xfs_mount *mp, long nr)
 	{ (1UL << XFS_OPSTATE_BLOCKGC_ENABLED),		"blockgc" }, \
 	{ (1UL << XFS_OPSTATE_WARNED_SCRUB),		"wscrub" }, \
 	{ (1UL << XFS_OPSTATE_WARNED_SHRINK),		"wshrink" }, \
-	{ (1UL << XFS_OPSTATE_WARNED_LARP),		"wlarp" }
+	{ (1UL << XFS_OPSTATE_WARNED_LARP),		"wlarp" }, \
+	{ (1UL << XFS_OPSTATE_WARNED_SWAPEXT),		"wswapext" }
 
 /*
  * Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index b43b973f0e10..e38814f4380c 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -41,6 +41,7 @@
 #include "xfs_btree_mem.h"
 #include "xfs_bmap.h"
 #include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9ebaa5ffe504..6841f04ee38d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3763,10 +3763,130 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
 DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+
+/* swapext tracepoints */
+DEFINE_INODE_ERROR_EVENT(xfs_file_xchg_range_error);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
 DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
 
+#define FIEXCHANGE_FLAGS_STRS \
+	{ FILE_XCHG_RANGE_NONATOMIC,		"NONATOMIC" }, \
+	{ FILE_XCHG_RANGE_FILE2_FRESH,		"F2_FRESH" }, \
+	{ FILE_XCHG_RANGE_FULL_FILES,		"FULL" }, \
+	{ FILE_XCHG_RANGE_TO_EOF,		"TO_EOF" }, \
+	{ FILE_XCHG_RANGE_FSYNC	,		"FSYNC" }, \
+	{ FILE_XCHG_RANGE_DRY_RUN,		"DRY_RUN" }, \
+	{ FILE_XCHG_RANGE_SKIP_FILE1_HOLES,	"SKIP_F1_HOLES" }
+
+/* file exchange-range tracepoint class */
+DECLARE_EVENT_CLASS(xfs_xchg_range_class,
+	TP_PROTO(struct xfs_inode *ip1, const struct file_xchg_range *fxr,
+		 struct xfs_inode *ip2, unsigned int xchg_flags),
+	TP_ARGS(ip1, fxr, ip2, xchg_flags),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip1_ino)
+		__field(loff_t, ip1_isize)
+		__field(loff_t, ip1_disize)
+		__field(xfs_ino_t, ip2_ino)
+		__field(loff_t, ip2_isize)
+		__field(loff_t, ip2_disize)
+
+		__field(loff_t, file1_offset)
+		__field(loff_t, file2_offset)
+		__field(unsigned long long, length)
+		__field(unsigned long long, vflags)
+		__field(unsigned int, xflags)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip1)->i_sb->s_dev;
+		__entry->ip1_ino = ip1->i_ino;
+		__entry->ip1_isize = VFS_I(ip1)->i_size;
+		__entry->ip1_disize = ip1->i_disk_size;
+		__entry->ip2_ino = ip2->i_ino;
+		__entry->ip2_isize = VFS_I(ip2)->i_size;
+		__entry->ip2_disize = ip2->i_disk_size;
+
+		__entry->file1_offset = fxr->file1_offset;
+		__entry->file2_offset = fxr->file2_offset;
+		__entry->length = fxr->length;
+		__entry->vflags = fxr->flags;
+		__entry->xflags = xchg_flags;
+	),
+	TP_printk("dev %d:%d vfs_flags %s xchg_flags %s bytecount 0x%llx "
+		  "ino1 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx -> "
+		  "ino2 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		   __print_flags(__entry->vflags, "|", FIEXCHANGE_FLAGS_STRS),
+		   __print_flags(__entry->xflags, "|", XCHG_RANGE_FLAGS_STRS),
+		  __entry->length,
+		  __entry->ip1_ino,
+		  __entry->ip1_isize,
+		  __entry->ip1_disize,
+		  __entry->file1_offset,
+		  __entry->ip2_ino,
+		  __entry->ip2_isize,
+		  __entry->ip2_disize,
+		  __entry->file2_offset)
+)
+
+#define DEFINE_XCHG_RANGE_EVENT(name)	\
+DEFINE_EVENT(xfs_xchg_range_class, name,	\
+	TP_PROTO(struct xfs_inode *ip1, const struct file_xchg_range *fxr, \
+		 struct xfs_inode *ip2, unsigned int xchg_flags), \
+	TP_ARGS(ip1, fxr, ip2, xchg_flags))
+DEFINE_XCHG_RANGE_EVENT(xfs_xchg_range_prep);
+DEFINE_XCHG_RANGE_EVENT(xfs_xchg_range_flush);
+DEFINE_XCHG_RANGE_EVENT(xfs_xchg_range);
+
+TRACE_EVENT(xfs_xchg_range_freshness,
+	TP_PROTO(struct xfs_inode *ip2, const struct file_xchg_range *fxr),
+	TP_ARGS(ip2, fxr),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip2)->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+		__entry->ip2_mtime = VFS_I(ip2)->i_mtime.tv_sec;
+		__entry->ip2_ctime = VFS_I(ip2)->i_ctime.tv_sec;
+		__entry->ip2_mtime_nsec = VFS_I(ip2)->i_mtime.tv_nsec;
+		__entry->ip2_ctime_nsec = VFS_I(ip2)->i_ctime.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime;
+		__entry->file2_ctime = fxr->file2_ctime;
+		__entry->file2_mtime_nsec = fxr->file2_mtime_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 /* fsmap traces */
 DECLARE_EVENT_CLASS(xfs_fsmap_class,
 	TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_agnumber_t agno,
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 0dba5078c9f7..9966938134c0 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -13,8 +13,15 @@
 #include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
+#include "xfs_quota.h"
+#include "xfs_bmap_util.h"
+#include "xfs_reflink.h"
+#include "xfs_trace.h"
 #include "xfs_swapext.h"
 #include "xfs_xchgrange.h"
+#include "xfs_sb.h"
+#include "xfs_icache.h"
+#include "xfs_log.h"
 
 /* Lock (and optionally join) two inodes for a file range exchange. */
 void
@@ -63,3 +70,371 @@ xfs_xchg_range_estimate(
 	xfs_xchg_range_iunlock(req->ip1, req->ip2);
 	return error;
 }
+
+/* Prepare two files to have their data exchanged. */
+int
+xfs_xchg_range_prep(
+	struct file		*file1,
+	struct file		*file2,
+	struct file_xchg_range	*fxr)
+{
+	struct xfs_inode	*ip1 = XFS_I(file_inode(file1));
+	struct xfs_inode	*ip2 = XFS_I(file_inode(file2));
+	int			error;
+
+	trace_xfs_xchg_range_prep(ip1, fxr, ip2, 0);
+
+	/* Verify both files are either real-time or non-realtime */
+	if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
+		return -EINVAL;
+
+	/*
+	 * The alignment checks in the VFS helpers cannot deal with allocation
+	 * units that are not powers of 2.  This can happen with the realtime
+	 * volume if the extent size is set.  Note that alignment checks are
+	 * skipped if FULL_FILES is set.
+	 */
+	if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    !is_power_of_2(xfs_inode_alloc_unitsize(ip2)))
+		return -EOPNOTSUPP;
+
+	error = generic_xchg_file_range_prep(file1, file2, fxr,
+			xfs_inode_alloc_unitsize(ip2));
+	if (error || fxr->length == 0)
+		return error;
+
+	/* Attach dquots to both inodes before changing block maps. */
+	error = xfs_qm_dqattach(ip2);
+	if (error)
+		return error;
+	error = xfs_qm_dqattach(ip1);
+	if (error)
+		return error;
+
+	trace_xfs_xchg_range_flush(ip1, fxr, ip2, 0);
+
+	/* Flush the relevant ranges of both files. */
+	error = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length);
+	if (error)
+		return error;
+	error = xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length);
+	if (error)
+		return error;
+
+	/*
+	 * Cancel CoW fork preallocations for the ranges of both files.  The
+	 * prep function should have flushed all the dirty data, so the only
+	 * extents remaining should be speculative.
+	 */
+	if (xfs_inode_has_cow_data(ip1)) {
+		error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset,
+				fxr->length, true);
+		if (error)
+			return error;
+	}
+
+	if (xfs_inode_has_cow_data(ip2)) {
+		error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset,
+				fxr->length, true);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+#define QRETRY_IP1	(0x1)
+#define QRETRY_IP2	(0x2)
+
+/*
+ * Obtain a quota reservation to make sure we don't hit EDQUOT.  We can skip
+ * this if quota enforcement is disabled or if both inodes' dquots are the
+ * same.  The qretry structure must be initialized to zeroes before the first
+ * call to this function.
+ */
+STATIC int
+xfs_xchg_range_reserve_quota(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_req	*req,
+	unsigned int			*qretry)
+{
+	int64_t				ddelta, rdelta;
+	int				ip1_error = 0;
+	int				error;
+
+	/*
+	 * Don't bother with a quota reservation if we're not enforcing them
+	 * or the two inodes have the same dquots.
+	 */
+	if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
+	    (req->ip1->i_udquot == req->ip2->i_udquot &&
+	     req->ip1->i_gdquot == req->ip2->i_gdquot &&
+	     req->ip1->i_pdquot == req->ip2->i_pdquot))
+		return 0;
+
+	*qretry = 0;
+
+	/*
+	 * For each file, compute the net gain in the number of regular blocks
+	 * that will be mapped into that file and reserve that much quota.  The
+	 * quota counts must be able to absorb at least that much space.
+	 */
+	ddelta = req->ip2_bcount - req->ip1_bcount;
+	rdelta = req->ip2_rtbcount - req->ip1_rtbcount;
+	if (ddelta > 0 || rdelta > 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
+				ddelta > 0 ? ddelta : 0,
+				rdelta > 0 ? rdelta : 0,
+				false);
+		if (error == -EDQUOT || error == -ENOSPC) {
+			/*
+			 * Save this error and see what happens if we try to
+			 * reserve quota for ip2.  Then report both.
+			 */
+			*qretry |= QRETRY_IP1;
+			ip1_error = error;
+			error = 0;
+		}
+		if (error)
+			return error;
+	}
+	if (ddelta < 0 || rdelta < 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, req->ip2,
+				ddelta < 0 ? -ddelta : 0,
+				rdelta < 0 ? -rdelta : 0,
+				false);
+		if (error == -EDQUOT || error == -ENOSPC)
+			*qretry |= QRETRY_IP2;
+		if (error)
+			return error;
+	}
+	if (ip1_error)
+		return ip1_error;
+
+	/*
+	 * For each file, forcibly reserve the gross gain in mapped blocks so
+	 * that we don't trip over any quota block reservation assertions.
+	 * We must reserve the gross gain because the quota code subtracts from
+	 * bcount the number of blocks that we unmap; it does not add that
+	 * quantity back to the quota block reservation.
+	 */
+	error = xfs_trans_reserve_quota_nblks(tp, req->ip1, req->ip1_bcount,
+			req->ip1_rtbcount, true);
+	if (error)
+		return error;
+
+	return xfs_trans_reserve_quota_nblks(tp, req->ip2, req->ip2_bcount,
+			req->ip2_rtbcount, true);
+}
+
+/*
+ * Get permission to use log-assisted atomic exchange of file extents.
+ *
+ * Callers must hold the IOLOCK and MMAPLOCK of both files.  They must not be
+ * running any transactions or hold any ILOCKS.  If @use_logging is set after a
+ * successful return, callers must call xfs_xchg_range_rele_log_assist after
+ * the exchange is completed.
+ */
+int
+xfs_xchg_range_grab_log_assist(
+	struct xfs_mount	*mp,
+	bool			force,
+	bool			*use_logging)
+{
+	int			error = 0;
+
+	/*
+	 * Protect ourselves from an idle log clearing the atomic swapext
+	 * log incompat feature bit.
+	 */
+	xlog_use_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
+	*use_logging = true;
+
+	/*
+	 * If log-assisted swapping is already enabled, the caller can use the
+	 * log assisted swap functions with the log-incompat reference we got.
+	 */
+	if (xfs_sb_version_haslogswapext(&mp->m_sb))
+		return 0;
+
+	/*
+	 * If the caller doesn't /require/ log-assisted swapping, drop the
+	 * log-incompat feature protection and exit.  The caller cannot use
+	 * log assisted swapping.
+	 */
+	if (!force)
+		goto drop_incompat;
+
+	/*
+	 * Caller requires log-assisted swapping but the fs feature set isn't
+	 * rich enough to support it.  Bail out.
+	 */
+	if (!xfs_swapext_supported(mp)) {
+		error = -EOPNOTSUPP;
+		goto drop_incompat;
+	}
+
+	error = xfs_add_incompat_log_feature(mp,
+			XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT);
+	if (error)
+		goto drop_incompat;
+
+	xfs_warn_mount(mp, XFS_OPSTATE_WARNED_SWAPEXT,
+ "EXPERIMENTAL atomic file range swap feature in use. Use at your own risk!");
+
+	return 0;
+drop_incompat:
+	xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
+	*use_logging = false;
+	return error;
+}
+
+/* Release permission to use log-assisted extent swapping. */
+void
+xfs_xchg_range_rele_log_assist(
+	struct xfs_mount	*mp)
+{
+	xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
+}
+
+/* Exchange the contents of two files. */
+int
+xfs_xchg_range(
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	const struct file_xchg_range	*fxr,
+	unsigned int			xchg_flags)
+{
+	struct xfs_mount		*mp = ip1->i_mount;
+	struct xfs_swapext_req		req = {
+		.ip1			= ip1,
+		.ip2			= ip2,
+		.whichfork		= XFS_DATA_FORK,
+		.startoff1		= XFS_B_TO_FSBT(mp, fxr->file1_offset),
+		.startoff2		= XFS_B_TO_FSBT(mp, fxr->file2_offset),
+		.blockcount		= XFS_B_TO_FSB(mp, fxr->length),
+	};
+	struct xfs_trans		*tp;
+	unsigned int			qretry;
+	bool				retried = false;
+	int				error;
+
+	trace_xfs_xchg_range(ip1, fxr, ip2, xchg_flags);
+
+	/*
+	 * This function only supports using log intent items (SXI items if
+	 * atomic exchange is required, or BUI items if not) to exchange file
+	 * data.  The legacy whole-fork swap will be ported in a later patch.
+	 */
+	if (!(xchg_flags & XFS_XCHG_RANGE_LOGGED) && !xfs_swapext_supported(mp))
+		return -EOPNOTSUPP;
+
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		req.req_flags |= XFS_SWAP_REQ_SET_SIZES;
+	if (fxr->flags & FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+		req.req_flags |= XFS_SWAP_REQ_SKIP_INO1_HOLES;
+	if (xchg_flags & XFS_XCHG_RANGE_LOGGED)
+		req.req_flags |= XFS_SWAP_REQ_LOGGED;
+
+	error = xfs_xchg_range_estimate(&req);
+	if (error)
+		return error;
+
+retry:
+	/* Allocate the transaction, lock the inodes, and join them. */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
+			XFS_TRANS_RES_FDBLKS, &tp);
+	if (error)
+		return error;
+
+	xfs_xchg_range_ilock(tp, ip1, ip2);
+
+	trace_xfs_swap_extent_before(ip2, 0);
+	trace_xfs_swap_extent_before(ip1, 1);
+
+	if (fxr->flags & FILE_XCHG_RANGE_FILE2_FRESH)
+		trace_xfs_xchg_range_freshness(ip2, fxr);
+
+	/*
+	 * Now that we've excluded all other inode metadata changes by taking
+	 * the ILOCK, repeat the freshness check.
+	 */
+	error = generic_xchg_file_range_check_fresh(VFS_I(ip2), fxr);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_swapext_check_extents(mp, &req);
+	if (error)
+		goto out_trans_cancel;
+
+	/*
+	 * Reserve ourselves some quota if any of them are in enforcing mode.
+	 * In theory we only need enough to satisfy the change in the number
+	 * of blocks between the two ranges being remapped.
+	 */
+	error = xfs_xchg_range_reserve_quota(tp, &req, &qretry);
+	if ((error == -EDQUOT || error == -ENOSPC) && !retried) {
+		xfs_trans_cancel(tp);
+		xfs_xchg_range_iunlock(ip1, ip2);
+		if (qretry & QRETRY_IP1)
+			xfs_blockgc_free_quota(ip1, 0);
+		if (qretry & QRETRY_IP2)
+			xfs_blockgc_free_quota(ip2, 0);
+		retried = true;
+		goto retry;
+	}
+	if (error)
+		goto out_trans_cancel;
+
+	/* If we got this far on a dry run, all parameters are ok. */
+	if (fxr->flags & FILE_XCHG_RANGE_DRY_RUN)
+		goto out_trans_cancel;
+
+	/* Update the mtime and ctime of both files. */
+	if (xchg_flags & XFS_XCHG_RANGE_UPD_CMTIME1)
+		xfs_trans_ichgtime(tp, ip1,
+				XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+	if (xchg_flags & XFS_XCHG_RANGE_UPD_CMTIME2)
+		xfs_trans_ichgtime(tp, ip2,
+				XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+
+	xfs_swapext(tp, &req);
+
+	/*
+	 * Force the log to persist metadata updates if the caller or the
+	 * administrator requires this.  The VFS prep function already flushed
+	 * the relevant parts of the page cache.
+	 */
+	if (xfs_has_wsync(mp) || (fxr->flags & FILE_XCHG_RANGE_FSYNC))
+		xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+
+	trace_xfs_swap_extent_after(ip2, 0);
+	trace_xfs_swap_extent_after(ip1, 1);
+
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * If the caller wanted us to exchange the contents of two complete
+	 * files of unequal length, exchange the incore sizes now.  This should
+	 * be safe because we flushed both files' page caches, moved all the
+	 * extents, and updated the ondisk sizes.
+	 */
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF) {
+		loff_t	temp;
+
+		temp = i_size_read(VFS_I(ip2));
+		i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
+		i_size_write(VFS_I(ip1), temp);
+	}
+
+out_unlock:
+	xfs_xchg_range_iunlock(ip1, ip2);
+	return error;
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
index 89320a354efa..a0e64408784a 100644
--- a/fs/xfs/xfs_xchgrange.h
+++ b/fs/xfs/xfs_xchgrange.h
@@ -14,4 +14,27 @@ void xfs_xchg_range_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
 
 int xfs_xchg_range_estimate(struct xfs_swapext_req *req);
 
+int xfs_xchg_range_grab_log_assist(struct xfs_mount *mp, bool force,
+		bool *use_logging);
+void xfs_xchg_range_rele_log_assist(struct xfs_mount *mp);
+
+/* Caller has permission to use log intent items for the exchange operation. */
+#define XFS_XCHG_RANGE_LOGGED		(1U << 0)
+
+/* Update ip1's change and mod time. */
+#define XFS_XCHG_RANGE_UPD_CMTIME1	(1U << 1)
+
+/* Update ip2's change and mod time. */
+#define XFS_XCHG_RANGE_UPD_CMTIME2	(1U << 2)
+
+#define XCHG_RANGE_FLAGS_STRS \
+	{ XFS_XCHG_RANGE_LOGGED,		"LOGGED" }, \
+	{ XFS_XCHG_RANGE_UPD_CMTIME1,		"UPD_CMTIME1" }, \
+	{ XFS_XCHG_RANGE_UPD_CMTIME2,		"UPD_CMTIME2" }
+
+int xfs_xchg_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
+		const struct file_xchg_range *fxr, unsigned int xchg_flags);
+int xfs_xchg_range_prep(struct file *file1, struct file *file2,
+		struct file_xchg_range *fxr);
+
 #endif /* __XFS_XCHGRANGE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 10/21] xfs: add error injection to test swapext recovery
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 09/21] xfs: add a ->xchg_file_range handler Darrick J. Wong
                     ` (10 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add an errortag so that we can test recovery of swapext log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 +++-
 fs/xfs/libxfs/xfs_swapext.c  |    3 +++
 fs/xfs/xfs_error.c           |    3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 01a9e86b3037..263d62a8d70f 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -63,7 +63,8 @@
 #define XFS_ERRTAG_ATTR_LEAF_TO_NODE			41
 #define XFS_ERRTAG_WB_DELAY_MS				42
 #define XFS_ERRTAG_WRITE_DELAY_MS			43
-#define XFS_ERRTAG_MAX					44
+#define XFS_ERRTAG_SWAPEXT_FINISH_ONE			44
+#define XFS_ERRTAG_MAX					45
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -111,5 +112,6 @@
 #define XFS_RANDOM_ATTR_LEAF_TO_NODE			1
 #define XFS_RANDOM_WB_DELAY_MS				3000
 #define XFS_RANDOM_WRITE_DELAY_MS			3000
+#define XFS_RANDOM_SWAPEXT_FINISH_ONE			1
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 0bc758c5cf5c..227a08ac5d4b 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -426,6 +426,9 @@ xfs_swapext_finish_one(
 			return error;
 	}
 
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_SWAPEXT_FINISH_ONE))
+		return -EIO;
+
 	/* If we still have work to do, ask for a new transaction. */
 	if (sxi_has_more_swap_work(sxi) || sxi_has_postop_work(sxi)) {
 		trace_xfs_swapext_defer(tp->t_mountp, sxi);
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index ae082808cfed..4b57a809ced5 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_ATTR_LEAF_TO_NODE,
 	XFS_RANDOM_WB_DELAY_MS,
 	XFS_RANDOM_WRITE_DELAY_MS,
+	XFS_RANDOM_SWAPEXT_FINISH_ONE,
 };
 
 struct xfs_errortag_attr {
@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split,	XFS_ERRTAG_DA_LEAF_SPLIT);
 XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node,	XFS_ERRTAG_ATTR_LEAF_TO_NODE);
 XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
 XFS_ERRORTAG_ATTR_RW(write_delay_ms,	XFS_ERRTAG_WRITE_DELAY_MS);
+XFS_ERRORTAG_ATTR_RW(swapext_finish_one, XFS_ERRTAG_SWAPEXT_FINISH_ONE);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
 	XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
 	XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
+	XFS_ERRORTAG_ATTR_LIST(swapext_finish_one),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_errortag);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 08/21] xfs: enable xlog users to toggle atomic " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 10/21] xfs: add error injection to test swapext recovery Darrick J. Wong
                     ` (11 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

The inner loop of xfs_swap_extents_rmap does the same work as
xfs_swapext_finish_one, so adapt it to use that.  Doing so has the side
benefit that the older code path no longer wastes its time remapping
shared extents.

This forms the basis of the non-atomic swaprange implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  151 +++++-------------------------------------------
 fs/xfs/xfs_trace.h     |    5 --
 2 files changed, 16 insertions(+), 140 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index d587015aec0e..4d4696bf9b08 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1360,138 +1360,6 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
-/*
- * Move extents from one file to another, when rmap is enabled.
- */
-STATIC int
-xfs_swap_extent_rmap(
-	struct xfs_trans		**tpp,
-	struct xfs_inode		*ip,
-	struct xfs_inode		*tip)
-{
-	struct xfs_trans		*tp = *tpp;
-	struct xfs_bmbt_irec		irec;
-	struct xfs_bmbt_irec		uirec;
-	struct xfs_bmbt_irec		tirec;
-	xfs_fileoff_t			offset_fsb;
-	xfs_fileoff_t			end_fsb;
-	xfs_filblks_t			count_fsb;
-	int				error;
-	xfs_filblks_t			ilen;
-	xfs_filblks_t			rlen;
-	int				nimaps;
-	uint64_t			tip_flags2;
-
-	/*
-	 * If the source file has shared blocks, we must flag the donor
-	 * file as having shared blocks so that we get the shared-block
-	 * rmap functions when we go to fix up the rmaps.  The flags
-	 * will be switch for reals later.
-	 */
-	tip_flags2 = tip->i_diflags2;
-	if (ip->i_diflags2 & XFS_DIFLAG2_REFLINK)
-		tip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
-
-	offset_fsb = 0;
-	end_fsb = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
-	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
-
-	while (count_fsb) {
-		/* Read extent from the donor file */
-		nimaps = 1;
-		error = xfs_bmapi_read(tip, offset_fsb, count_fsb, &tirec,
-				&nimaps, 0);
-		if (error)
-			goto out;
-		ASSERT(nimaps == 1);
-		ASSERT(tirec.br_startblock != DELAYSTARTBLOCK);
-
-		trace_xfs_swap_extent_rmap_remap(tip, &tirec);
-		ilen = tirec.br_blockcount;
-
-		/* Unmap the old blocks in the source file. */
-		while (tirec.br_blockcount) {
-			ASSERT(tp->t_firstblock == NULLFSBLOCK);
-			trace_xfs_swap_extent_rmap_remap_piece(tip, &tirec);
-
-			/* Read extent from the source file */
-			nimaps = 1;
-			error = xfs_bmapi_read(ip, tirec.br_startoff,
-					tirec.br_blockcount, &irec,
-					&nimaps, 0);
-			if (error)
-				goto out;
-			ASSERT(nimaps == 1);
-			ASSERT(tirec.br_startoff == irec.br_startoff);
-			trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
-
-			/* Trim the extent. */
-			uirec = tirec;
-			uirec.br_blockcount = rlen = min_t(xfs_filblks_t,
-					tirec.br_blockcount,
-					irec.br_blockcount);
-			trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
-
-			if (xfs_bmap_is_real_extent(&uirec)) {
-				error = xfs_iext_count_may_overflow(ip,
-						XFS_DATA_FORK,
-						XFS_IEXT_SWAP_RMAP_CNT);
-				if (error == -EFBIG)
-					error = xfs_iext_count_upgrade(tp, ip,
-							XFS_IEXT_SWAP_RMAP_CNT);
-				if (error)
-					goto out;
-			}
-
-			if (xfs_bmap_is_real_extent(&irec)) {
-				error = xfs_iext_count_may_overflow(tip,
-						XFS_DATA_FORK,
-						XFS_IEXT_SWAP_RMAP_CNT);
-				if (error == -EFBIG)
-					error = xfs_iext_count_upgrade(tp, ip,
-							XFS_IEXT_SWAP_RMAP_CNT);
-				if (error)
-					goto out;
-			}
-
-			/* Remove the mapping from the donor file. */
-			xfs_bmap_unmap_extent(tp, tip, XFS_DATA_FORK, &uirec);
-
-			/* Remove the mapping from the source file. */
-			xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &irec);
-
-			/* Map the donor file's blocks into the source file. */
-			xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
-
-			/* Map the source file's blocks into the donor file. */
-			xfs_bmap_map_extent(tp, tip, XFS_DATA_FORK, &irec);
-
-			error = xfs_defer_finish(tpp);
-			tp = *tpp;
-			if (error)
-				goto out;
-
-			tirec.br_startoff += rlen;
-			if (tirec.br_startblock != HOLESTARTBLOCK &&
-			    tirec.br_startblock != DELAYSTARTBLOCK)
-				tirec.br_startblock += rlen;
-			tirec.br_blockcount -= rlen;
-		}
-
-		/* Roll on... */
-		count_fsb -= ilen;
-		offset_fsb += ilen;
-	}
-
-	tip->i_diflags2 = tip_flags2;
-	return 0;
-
-out:
-	trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
-	tip->i_diflags2 = tip_flags2;
-	return error;
-}
-
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
@@ -1775,13 +1643,24 @@ xfs_swap_extents(
 	src_log_flags = XFS_ILOG_CORE;
 	target_log_flags = XFS_ILOG_CORE;
 
-	if (xfs_has_rmapbt(mp))
-		error = xfs_swap_extent_rmap(&tp, ip, tip);
-	else
+	if (xfs_has_rmapbt(mp)) {
+		struct xfs_swapext_req	req = {
+			.ip1		= tip,
+			.ip2		= ip,
+			.whichfork	= XFS_DATA_FORK,
+			.blockcount	= XFS_B_TO_FSB(ip->i_mount,
+						       i_size_read(VFS_I(ip))),
+		};
+
+		xfs_swapext(tp, &req);
+		error = xfs_defer_finish(&tp);
+	} else
 		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
 				&target_log_flags);
-	if (error)
+	if (error) {
+		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;
+	}
 
 	/* Do we have to swap reflink flags? */
 	if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6841f04ee38d..b0ced76af3b9 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3759,13 +3759,10 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
 
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 
-/* rmap swapext tracepoints */
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
-DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
 
 /* swapext tracepoints */
 DEFINE_INODE_ERROR_EVENT(xfs_file_xchg_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_error);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
 DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (15 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 16/21] xfs: condense extended attributes " Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 15/21] xfs: remove old swap extents implementation Darrick J. Wong
                     ` (3 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Now that we've moved the old swapext code to use the new log-assisted
extent swap code for rmap filesystems, let's start porting the old
implementation to the new ioctl interface so that later we can port the
old interface to the new interface.

Consolidate the reflink flag swap code and the the bmbt owner change
scan code in xfs_swap_extent_forks, since both interfaces are going to
need that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  220 ++++++++++++++++++++++++------------------------
 1 file changed, 108 insertions(+), 112 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 4d4696bf9b08..dbd95d86addb 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1360,19 +1360,61 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+static int
+xfs_swap_change_owner(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_inode	*tmpip)
+{
+	int			error;
+	struct xfs_trans	*tp = *tpp;
+
+	do {
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+					      NULL);
+		/* success or fatal error */
+		if (error != -EAGAIN)
+			break;
+
+		error = xfs_trans_roll(tpp);
+		if (error)
+			break;
+		tp = *tpp;
+
+		/*
+		 * Redirty both inodes so they can relog and keep the log tail
+		 * moving forward.
+		 */
+		xfs_trans_ijoin(tp, ip, 0);
+		xfs_trans_ijoin(tp, tmpip, 0);
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+	} while (true);
+
+	return error;
+}
+
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
-	struct xfs_trans	*tp,
+	struct xfs_trans	**tpp,
 	struct xfs_inode	*ip,
-	struct xfs_inode	*tip,
-	int			*src_log_flags,
-	int			*target_log_flags)
+	struct xfs_inode	*tip)
 {
 	xfs_filblks_t		aforkblks = 0;
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
 	uint64_t		tmp;
+	int			src_log_flags = XFS_ILOG_CORE;
+	int			target_log_flags = XFS_ILOG_CORE;
 	int			error;
 
 	/*
@@ -1380,14 +1422,14 @@ xfs_swap_extent_forks(
 	 */
 	if (xfs_inode_has_attr_fork(ip) && ip->i_af.if_nextents > 0 &&
 	    ip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK, &junk,
+		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
 				&aforkblks);
 		if (error)
 			return error;
 	}
 	if (xfs_inode_has_attr_fork(tip) && tip->i_af.if_nextents > 0 &&
 	    tip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK, &junk,
+		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
 				&taforkblks);
 		if (error)
 			return error;
@@ -1402,9 +1444,9 @@ xfs_swap_extent_forks(
 	 */
 	if (xfs_has_v3inodes(ip->i_mount)) {
 		if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			(*target_log_flags) |= XFS_ILOG_DOWNER;
+			target_log_flags |= XFS_ILOG_DOWNER;
 		if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			(*src_log_flags) |= XFS_ILOG_DOWNER;
+			src_log_flags |= XFS_ILOG_DOWNER;
 	}
 
 	/*
@@ -1434,71 +1476,80 @@ xfs_swap_extent_forks(
 
 	switch (ip->i_df.if_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		(*src_log_flags) |= XFS_ILOG_DEXT;
+		src_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
 		ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
-		       (*src_log_flags & XFS_ILOG_DOWNER));
-		(*src_log_flags) |= XFS_ILOG_DBROOT;
+		       (src_log_flags & XFS_ILOG_DOWNER));
+		src_log_flags |= XFS_ILOG_DBROOT;
 		break;
 	}
 
 	switch (tip->i_df.if_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		(*target_log_flags) |= XFS_ILOG_DEXT;
+		target_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		(*target_log_flags) |= XFS_ILOG_DBROOT;
+		target_log_flags |= XFS_ILOG_DBROOT;
 		ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
-		       (*target_log_flags & XFS_ILOG_DOWNER));
+		       (target_log_flags & XFS_ILOG_DOWNER));
 		break;
 	}
 
+	/* Do we have to swap reflink flags? */
+	if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
+	    (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
+		uint64_t	f;
+
+		f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+		ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+		ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+		tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+		tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
+	}
+
+	/* Swap the cow forks. */
+	if (xfs_has_reflink(ip->i_mount)) {
+		ASSERT(!ip->i_cowfp ||
+		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+		ASSERT(!tip->i_cowfp ||
+		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+
+		swap(ip->i_cowfp, tip->i_cowfp);
+
+		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(ip);
+		else
+			xfs_inode_clear_cowblocks_tag(ip);
+		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(tip);
+		else
+			xfs_inode_clear_cowblocks_tag(tip);
+	}
+
+	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
+	xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+	/*
+	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+	 * have inode number owner values in the bmbt blocks that still refer to
+	 * the old inode. Scan each bmbt to fix up the owner values with the
+	 * inode number of the current inode.
+	 */
+	if (src_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, ip, tip);
+		if (error)
+			return error;
+	}
+	if (target_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, tip, ip);
+		if (error)
+			return error;
+	}
+
 	return 0;
 }
 
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
-	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tmpip)
-{
-	int			error;
-	struct xfs_trans	*tp = *tpp;
-
-	do {
-		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
-					      NULL);
-		/* success or fatal error */
-		if (error != -EAGAIN)
-			break;
-
-		error = xfs_trans_roll(tpp);
-		if (error)
-			break;
-		tp = *tpp;
-
-		/*
-		 * Redirty both inodes so they can relog and keep the log tail
-		 * moving forward.
-		 */
-		xfs_trans_ijoin(tp, ip, 0);
-		xfs_trans_ijoin(tp, tmpip, 0);
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
-	} while (true);
-
-	return error;
-}
-
 int
 xfs_swap_extents(
 	struct xfs_inode	*ip,	/* target inode */
@@ -1508,9 +1559,7 @@ xfs_swap_extents(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	int			src_log_flags, target_log_flags;
 	int			error = 0;
-	uint64_t		f;
 	int			resblks = 0;
 	unsigned int		flags = 0;
 
@@ -1640,9 +1689,6 @@ xfs_swap_extents(
 	 * recovery is going to see the fork as owned by the swapped inode,
 	 * not the pre-swapped inodes.
 	 */
-	src_log_flags = XFS_ILOG_CORE;
-	target_log_flags = XFS_ILOG_CORE;
-
 	if (xfs_has_rmapbt(mp)) {
 		struct xfs_swapext_req	req = {
 			.ip1		= tip,
@@ -1655,62 +1701,12 @@ xfs_swap_extents(
 		xfs_swapext(tp, &req);
 		error = xfs_defer_finish(&tp);
 	} else
-		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
-				&target_log_flags);
+		error = xfs_swap_extent_forks(&tp, ip, tip);
 	if (error) {
 		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;
 	}
 
-	/* Do we have to swap reflink flags? */
-	if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
-	    (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
-		f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
-		ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
-		ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
-		tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
-		tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
-	}
-
-	/* Swap the cow forks. */
-	if (xfs_has_reflink(mp)) {
-		ASSERT(!ip->i_cowfp ||
-		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-		ASSERT(!tip->i_cowfp ||
-		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-
-		swap(ip->i_cowfp, tip->i_cowfp);
-
-		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(ip);
-		else
-			xfs_inode_clear_cowblocks_tag(ip);
-		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(tip);
-		else
-			xfs_inode_clear_cowblocks_tag(tip);
-	}
-
-	xfs_trans_log_inode(tp, ip,  src_log_flags);
-	xfs_trans_log_inode(tp, tip, target_log_flags);
-
-	/*
-	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
-	 * have inode number owner values in the bmbt blocks that still refer to
-	 * the old inode. Scan each bmbt to fix up the owner values with the
-	 * inode number of the current inode.
-	 */
-	if (src_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(&tp, ip, tip);
-		if (error)
-			goto out_trans_cancel;
-	}
-	if (target_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(&tp, tip, ip);
-		if (error)
-			goto out_trans_cancel;
-	}
-
 	/*
 	 * If this is a synchronous mount, make sure that the
 	 * transaction goes to disk before returning to the user.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (12 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 18/21] xfs: condense symbolic links after an atomic swap Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 17/21] xfs: condense directories after an atomic swap Darrick J. Wong
                     ` (6 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Port the old extent fork swapping function to take a xfs_swapext_req as
input, which aligns it with the new fiexchange interface.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index dbd95d86addb..9d6337a05544 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1406,9 +1406,10 @@ xfs_swap_change_owner(
 STATIC int
 xfs_swap_extent_forks(
 	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tip)
+	struct xfs_swapext_req	*req)
 {
+	struct xfs_inode	*ip = req->ip2;
+	struct xfs_inode	*tip = req->ip1;
 	xfs_filblks_t		aforkblks = 0;
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
@@ -1556,6 +1557,11 @@ xfs_swap_extents(
 	struct xfs_inode	*tip,	/* tmp inode */
 	struct xfs_swapext	*sxp)
 {
+	struct xfs_swapext_req	req = {
+		.ip1		= tip,
+		.ip2		= ip,
+		.whichfork	= XFS_DATA_FORK,
+	};
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	struct xfs_bstat	*sbp = &sxp->sx_stat;
@@ -1689,19 +1695,12 @@ xfs_swap_extents(
 	 * recovery is going to see the fork as owned by the swapped inode,
 	 * not the pre-swapped inodes.
 	 */
+	req.blockcount = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
 	if (xfs_has_rmapbt(mp)) {
-		struct xfs_swapext_req	req = {
-			.ip1		= tip,
-			.ip2		= ip,
-			.whichfork	= XFS_DATA_FORK,
-			.blockcount	= XFS_B_TO_FSB(ip->i_mount,
-						       i_size_read(VFS_I(ip))),
-		};
-
 		xfs_swapext(tp, &req);
 		error = xfs_defer_finish(&tp);
 	} else
-		error = xfs_swap_extent_forks(&tp, ip, tip);
+		error = xfs_swap_extent_forks(&tp, &req);
 	if (error) {
 		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (10 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 09/21] xfs: add a ->xchg_file_range handler Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 18/21] xfs: condense symbolic links after an atomic swap Darrick J. Wong
                     ` (8 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

If userspace permits non-atomic swap operations, use the older code
paths to implement the same functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    4 +-
 fs/xfs/xfs_bmap_util.h |    4 ++
 fs/xfs/xfs_xchgrange.c |   96 +++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 92 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 9d6337a05544..e8562c4de7eb 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1261,7 +1261,7 @@ xfs_insert_file_space(
  * reject and log the attempt. basically we are putting the responsibility on
  * userspace to get this right.
  */
-static int
+int
 xfs_swap_extents_check_format(
 	struct xfs_inode	*ip,	/* target inode */
 	struct xfs_inode	*tip)	/* tmp inode */
@@ -1403,7 +1403,7 @@ xfs_swap_change_owner(
 }
 
 /* Swap the extents of two files by swapping data forks. */
-STATIC int
+int
 xfs_swap_extent_forks(
 	struct xfs_trans	**tpp,
 	struct xfs_swapext_req	*req)
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 6888078f5c31..39c71da08403 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -69,6 +69,10 @@ int	xfs_free_eofblocks(struct xfs_inode *ip);
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
 			 struct xfs_swapext *sx);
 
+struct xfs_swapext_req;
+int xfs_swap_extent_forks(struct xfs_trans **tpp, struct xfs_swapext_req *req);
+int xfs_swap_extents_check_format(struct xfs_inode *ip, struct xfs_inode *tip);
+
 xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
 
 xfs_extnum_t xfs_bmap_count_leaves(struct xfs_ifork *ifp, xfs_filblks_t *count);
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 9966938134c0..2b7aedc49923 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -297,6 +297,33 @@ xfs_xchg_range_rele_log_assist(
 	xlog_drop_incompat_feat(mp->m_log, XLOG_INCOMPAT_FEAT_SWAPEXT);
 }
 
+/* Decide if we can use the old data fork exchange code. */
+static inline bool
+xfs_xchg_use_forkswap(
+	const struct file_xchg_range	*fxr,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2)
+{
+	if (!(fxr->flags & FILE_XCHG_RANGE_NONATOMIC))
+		return false;
+	if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES))
+		return false;
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		return false;
+	if (fxr->file1_offset != 0 || fxr->file2_offset != 0)
+		return false;
+	if (fxr->length != ip1->i_disk_size)
+		return false;
+	if (fxr->length != ip2->i_disk_size)
+		return false;
+	return true;
+}
+
+enum xchg_strategy {
+	SWAPEXT		= 1,	/* xfs_swapext() */
+	FORKSWAP	= 2,	/* exchange forks */
+};
+
 /* Exchange the contents of two files. */
 int
 xfs_xchg_range(
@@ -316,19 +343,13 @@ xfs_xchg_range(
 	};
 	struct xfs_trans		*tp;
 	unsigned int			qretry;
+	unsigned int			flags = 0;
 	bool				retried = false;
+	enum xchg_strategy		strategy;
 	int				error;
 
 	trace_xfs_xchg_range(ip1, fxr, ip2, xchg_flags);
 
-	/*
-	 * This function only supports using log intent items (SXI items if
-	 * atomic exchange is required, or BUI items if not) to exchange file
-	 * data.  The legacy whole-fork swap will be ported in a later patch.
-	 */
-	if (!(xchg_flags & XFS_XCHG_RANGE_LOGGED) && !xfs_swapext_supported(mp))
-		return -EOPNOTSUPP;
-
 	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
 		req.req_flags |= XFS_SWAP_REQ_SET_SIZES;
 	if (fxr->flags & FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
@@ -340,10 +361,25 @@ xfs_xchg_range(
 	if (error)
 		return error;
 
+	/*
+	 * We haven't decided which exchange strategy we want to use yet, but
+	 * here we must choose if we want freed blocks during the swap to be
+	 * added to the transaction block reservation (RES_FDBLKS) or freed
+	 * into the global fdblocks.  The legacy fork swap mechanism doesn't
+	 * free any blocks, so it doesn't require it.  It is also the only
+	 * option that works for older filesystems.
+	 *
+	 * The bmap log intent items that were added with rmap and reflink can
+	 * change the bmbt shape, so the intent-based swap strategies require
+	 * us to set RES_FDBLKS.
+	 */
+	if (xfs_has_lazysbcount(mp))
+		flags |= XFS_TRANS_RES_FDBLKS;
+
 retry:
 	/* Allocate the transaction, lock the inodes, and join them. */
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
-			XFS_TRANS_RES_FDBLKS, &tp);
+			flags, &tp);
 	if (error)
 		return error;
 
@@ -386,6 +422,40 @@ xfs_xchg_range(
 	if (error)
 		goto out_trans_cancel;
 
+	if ((xchg_flags & XFS_XCHG_RANGE_LOGGED) || xfs_swapext_supported(mp)) {
+		/*
+		 * xfs_swapext() uses deferred bmap log intent items to swap
+		 * extents between file forks.  If the atomic log swap feature
+		 * is enabled, it will also use swapext log intent items to
+		 * restart the operation in case of failure.
+		 *
+		 * This means that we can use it if we previously obtained
+		 * permission from the log to use log-assisted atomic extent
+		 * swapping; or if the fs supports rmap or reflink and the
+		 * user said NONATOMIC.
+		 */
+		strategy = SWAPEXT;
+	} else if (xfs_xchg_use_forkswap(fxr, ip1, ip2)) {
+		/*
+		 * Exchange the file contents by using the old bmap fork
+		 * exchange code, if we're a defrag tool doing a full file
+		 * swap.
+		 */
+		strategy = FORKSWAP;
+
+		error = xfs_swap_extents_check_format(ip2, ip1);
+		if (error) {
+			xfs_notice(mp,
+		"%s: inode 0x%llx format is incompatible for exchanging.",
+					__func__, ip2->i_ino);
+			goto out_trans_cancel;
+		}
+	} else {
+		/* We cannot exchange the file contents. */
+		error = -EOPNOTSUPP;
+		goto out_trans_cancel;
+	}
+
 	/* If we got this far on a dry run, all parameters are ok. */
 	if (fxr->flags & FILE_XCHG_RANGE_DRY_RUN)
 		goto out_trans_cancel;
@@ -398,7 +468,13 @@ xfs_xchg_range(
 		xfs_trans_ichgtime(tp, ip2,
 				XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
 
-	xfs_swapext(tp, &req);
+	if (strategy == SWAPEXT) {
+		xfs_swapext(tp, &req);
+	} else {
+		error = xfs_swap_extent_forks(&tp, &req);
+		if (error)
+			goto out_trans_cancel;
+	}
 
 	/*
 	 * Force the log to persist metadata updates if the caller or the


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 15/21] xfs: remove old swap extents implementation
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (16 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
                     ` (2 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Migrate the old XFS_IOC_SWAPEXT implementation to use our shiny new one.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  491 ------------------------------------------------
 fs/xfs/xfs_bmap_util.h |    7 -
 fs/xfs/xfs_ioctl.c     |  102 +++-------
 fs/xfs/xfs_ioctl.h     |    4 
 fs/xfs/xfs_ioctl32.c   |   11 -
 fs/xfs/xfs_xchgrange.c |  299 +++++++++++++++++++++++++++++
 6 files changed, 334 insertions(+), 580 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e8562c4de7eb..47a583a94d58 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1240,494 +1240,3 @@ xfs_insert_file_space(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-
-/*
- * We need to check that the format of the data fork in the temporary inode is
- * valid for the target inode before doing the swap. This is not a problem with
- * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
- * data fork depending on the space the attribute fork is taking so we can get
- * invalid formats on the target inode.
- *
- * E.g. target has space for 7 extents in extent format, temp inode only has
- * space for 6.  If we defragment down to 7 extents, then the tmp format is a
- * btree, but when swapped it needs to be in extent format. Hence we can't just
- * blindly swap data forks on attr2 filesystems.
- *
- * Note that we check the swap in both directions so that we don't end up with
- * a corrupt temporary inode, either.
- *
- * Note that fixing the way xfs_fsr sets up the attribute fork in the source
- * inode will prevent this situation from occurring, so all we do here is
- * reject and log the attempt. basically we are putting the responsibility on
- * userspace to get this right.
- */
-int
-xfs_swap_extents_check_format(
-	struct xfs_inode	*ip,	/* target inode */
-	struct xfs_inode	*tip)	/* tmp inode */
-{
-	struct xfs_ifork	*ifp = &ip->i_df;
-	struct xfs_ifork	*tifp = &tip->i_df;
-
-	/* User/group/project quota ids must match if quotas are enforced. */
-	if (XFS_IS_QUOTA_ON(ip->i_mount) &&
-	    (!uid_eq(VFS_I(ip)->i_uid, VFS_I(tip)->i_uid) ||
-	     !gid_eq(VFS_I(ip)->i_gid, VFS_I(tip)->i_gid) ||
-	     ip->i_projid != tip->i_projid))
-		return -EINVAL;
-
-	/* Should never get a local format */
-	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
-	    tifp->if_format == XFS_DINODE_FMT_LOCAL)
-		return -EINVAL;
-
-	/*
-	 * if the target inode has less extents that then temporary inode then
-	 * why did userspace call us?
-	 */
-	if (ifp->if_nextents < tifp->if_nextents)
-		return -EINVAL;
-
-	/*
-	 * If we have to use the (expensive) rmap swap method, we can
-	 * handle any number of extents and any format.
-	 */
-	if (xfs_has_rmapbt(ip->i_mount))
-		return 0;
-
-	/*
-	 * if the target inode is in extent form and the temp inode is in btree
-	 * form then we will end up with the target inode in the wrong format
-	 * as we already know there are less extents in the temp inode.
-	 */
-	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
-	    tifp->if_format == XFS_DINODE_FMT_BTREE)
-		return -EINVAL;
-
-	/* Check temp in extent form to max in target */
-	if (tifp->if_format == XFS_DINODE_FMT_EXTENTS &&
-	    tifp->if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
-		return -EINVAL;
-
-	/* Check target in extent form to max in temp */
-	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
-	    ifp->if_nextents > XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
-		return -EINVAL;
-
-	/*
-	 * If we are in a btree format, check that the temp root block will fit
-	 * in the target and that it has enough extents to be in btree format
-	 * in the target.
-	 *
-	 * Note that we have to be careful to allow btree->extent conversions
-	 * (a common defrag case) which will occur when the temp inode is in
-	 * extent format...
-	 */
-	if (tifp->if_format == XFS_DINODE_FMT_BTREE) {
-		if (xfs_inode_has_attr_fork(ip) &&
-		    XFS_BMAP_BMDR_SPACE(tifp->if_broot) > xfs_inode_fork_boff(ip))
-			return -EINVAL;
-		if (tifp->if_nextents <= XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
-			return -EINVAL;
-	}
-
-	/* Reciprocal target->temp btree format checks */
-	if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
-		if (xfs_inode_has_attr_fork(tip) &&
-		    XFS_BMAP_BMDR_SPACE(ip->i_df.if_broot) > xfs_inode_fork_boff(tip))
-			return -EINVAL;
-		if (ifp->if_nextents <= XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
-			return -EINVAL;
-	}
-
-	return 0;
-}
-
-static int
-xfs_swap_extent_flush(
-	struct xfs_inode	*ip)
-{
-	int	error;
-
-	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
-	if (error)
-		return error;
-	truncate_pagecache_range(VFS_I(ip), 0, -1);
-
-	/* Verify O_DIRECT for ftmp */
-	if (VFS_I(ip)->i_mapping->nrpages)
-		return -EINVAL;
-	return 0;
-}
-
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
-	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tmpip)
-{
-	int			error;
-	struct xfs_trans	*tp = *tpp;
-
-	do {
-		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
-					      NULL);
-		/* success or fatal error */
-		if (error != -EAGAIN)
-			break;
-
-		error = xfs_trans_roll(tpp);
-		if (error)
-			break;
-		tp = *tpp;
-
-		/*
-		 * Redirty both inodes so they can relog and keep the log tail
-		 * moving forward.
-		 */
-		xfs_trans_ijoin(tp, ip, 0);
-		xfs_trans_ijoin(tp, tmpip, 0);
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
-	} while (true);
-
-	return error;
-}
-
-/* Swap the extents of two files by swapping data forks. */
-int
-xfs_swap_extent_forks(
-	struct xfs_trans	**tpp,
-	struct xfs_swapext_req	*req)
-{
-	struct xfs_inode	*ip = req->ip2;
-	struct xfs_inode	*tip = req->ip1;
-	xfs_filblks_t		aforkblks = 0;
-	xfs_filblks_t		taforkblks = 0;
-	xfs_extnum_t		junk;
-	uint64_t		tmp;
-	int			src_log_flags = XFS_ILOG_CORE;
-	int			target_log_flags = XFS_ILOG_CORE;
-	int			error;
-
-	/*
-	 * Count the number of extended attribute blocks
-	 */
-	if (xfs_inode_has_attr_fork(ip) && ip->i_af.if_nextents > 0 &&
-	    ip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
-				&aforkblks);
-		if (error)
-			return error;
-	}
-	if (xfs_inode_has_attr_fork(tip) && tip->i_af.if_nextents > 0 &&
-	    tip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
-				&taforkblks);
-		if (error)
-			return error;
-	}
-
-	/*
-	 * Btree format (v3) inodes have the inode number stamped in the bmbt
-	 * block headers. We can't start changing the bmbt blocks until the
-	 * inode owner change is logged so recovery does the right thing in the
-	 * event of a crash. Set the owner change log flags now and leave the
-	 * bmbt scan as the last step.
-	 */
-	if (xfs_has_v3inodes(ip->i_mount)) {
-		if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			target_log_flags |= XFS_ILOG_DOWNER;
-		if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			src_log_flags |= XFS_ILOG_DOWNER;
-	}
-
-	/*
-	 * Swap the data forks of the inodes
-	 */
-	swap(ip->i_df, tip->i_df);
-
-	/*
-	 * Fix the on-disk inode values
-	 */
-	tmp = (uint64_t)ip->i_nblocks;
-	ip->i_nblocks = tip->i_nblocks - taforkblks + aforkblks;
-	tip->i_nblocks = tmp + taforkblks - aforkblks;
-
-	/*
-	 * The extents in the source inode could still contain speculative
-	 * preallocation beyond EOF (e.g. the file is open but not modified
-	 * while defrag is in progress). In that case, we need to copy over the
-	 * number of delalloc blocks the data fork in the source inode is
-	 * tracking beyond EOF so that when the fork is truncated away when the
-	 * temporary inode is unlinked we don't underrun the i_delayed_blks
-	 * counter on that inode.
-	 */
-	ASSERT(tip->i_delayed_blks == 0);
-	tip->i_delayed_blks = ip->i_delayed_blks;
-	ip->i_delayed_blks = 0;
-
-	switch (ip->i_df.if_format) {
-	case XFS_DINODE_FMT_EXTENTS:
-		src_log_flags |= XFS_ILOG_DEXT;
-		break;
-	case XFS_DINODE_FMT_BTREE:
-		ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
-		       (src_log_flags & XFS_ILOG_DOWNER));
-		src_log_flags |= XFS_ILOG_DBROOT;
-		break;
-	}
-
-	switch (tip->i_df.if_format) {
-	case XFS_DINODE_FMT_EXTENTS:
-		target_log_flags |= XFS_ILOG_DEXT;
-		break;
-	case XFS_DINODE_FMT_BTREE:
-		target_log_flags |= XFS_ILOG_DBROOT;
-		ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
-		       (target_log_flags & XFS_ILOG_DOWNER));
-		break;
-	}
-
-	/* Do we have to swap reflink flags? */
-	if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
-	    (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
-		uint64_t	f;
-
-		f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
-		ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
-		ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
-		tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
-		tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
-	}
-
-	/* Swap the cow forks. */
-	if (xfs_has_reflink(ip->i_mount)) {
-		ASSERT(!ip->i_cowfp ||
-		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-		ASSERT(!tip->i_cowfp ||
-		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-
-		swap(ip->i_cowfp, tip->i_cowfp);
-
-		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(ip);
-		else
-			xfs_inode_clear_cowblocks_tag(ip);
-		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(tip);
-		else
-			xfs_inode_clear_cowblocks_tag(tip);
-	}
-
-	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
-	xfs_trans_log_inode(*tpp, tip, target_log_flags);
-
-	/*
-	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
-	 * have inode number owner values in the bmbt blocks that still refer to
-	 * the old inode. Scan each bmbt to fix up the owner values with the
-	 * inode number of the current inode.
-	 */
-	if (src_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(tpp, ip, tip);
-		if (error)
-			return error;
-	}
-	if (target_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(tpp, tip, ip);
-		if (error)
-			return error;
-	}
-
-	return 0;
-}
-
-int
-xfs_swap_extents(
-	struct xfs_inode	*ip,	/* target inode */
-	struct xfs_inode	*tip,	/* tmp inode */
-	struct xfs_swapext	*sxp)
-{
-	struct xfs_swapext_req	req = {
-		.ip1		= tip,
-		.ip2		= ip,
-		.whichfork	= XFS_DATA_FORK,
-	};
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
-	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	int			error = 0;
-	int			resblks = 0;
-	unsigned int		flags = 0;
-
-	/*
-	 * Lock the inodes against other IO, page faults and truncate to
-	 * begin with.  Then we can ensure the inodes are flushed and have no
-	 * page cache safely. Once we have done this we can take the ilocks and
-	 * do the rest of the checks.
-	 */
-	lock_two_nondirectories(VFS_I(ip), VFS_I(tip));
-	filemap_invalidate_lock_two(VFS_I(ip)->i_mapping,
-				    VFS_I(tip)->i_mapping);
-
-	/* Verify that both files have the same format */
-	if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	/* Verify both files are either real-time or non-realtime */
-	if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_qm_dqattach(tip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_swap_extent_flush(ip);
-	if (error)
-		goto out_unlock;
-	error = xfs_swap_extent_flush(tip);
-	if (error)
-		goto out_unlock;
-
-	if (xfs_inode_has_cow_data(tip)) {
-		error = xfs_reflink_cancel_cow_range(tip, 0, NULLFILEOFF, true);
-		if (error)
-			goto out_unlock;
-	}
-
-	/*
-	 * Extent "swapping" with rmap requires a permanent reservation and
-	 * a block reservation because it's really just a remap operation
-	 * performed with log redo items!
-	 */
-	if (xfs_has_rmapbt(mp)) {
-		int		w = XFS_DATA_FORK;
-		uint32_t	ipnext = ip->i_df.if_nextents;
-		uint32_t	tipnext	= tip->i_df.if_nextents;
-
-		/*
-		 * Conceptually this shouldn't affect the shape of either bmbt,
-		 * but since we atomically move extents one by one, we reserve
-		 * enough space to rebuild both trees.
-		 */
-		resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
-		resblks +=  XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
-
-		/*
-		 * If either inode straddles a bmapbt block allocation boundary,
-		 * the rmapbt algorithm triggers repeated allocs and frees as
-		 * extents are remapped. This can exhaust the block reservation
-		 * prematurely and cause shutdown. Return freed blocks to the
-		 * transaction reservation to counter this behavior.
-		 */
-		flags |= XFS_TRANS_RES_FDBLKS;
-	}
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, flags,
-				&tp);
-	if (error)
-		goto out_unlock;
-
-	/*
-	 * Lock and join the inodes to the tansaction so that transaction commit
-	 * or cancel will unlock the inodes from this point onwards.
-	 */
-	xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
-	xfs_trans_ijoin(tp, tip, 0);
-
-
-	/* Verify all data are being swapped */
-	if (sxp->sx_offset != 0 ||
-	    sxp->sx_length != ip->i_disk_size ||
-	    sxp->sx_length != tip->i_disk_size) {
-		error = -EFAULT;
-		goto out_trans_cancel;
-	}
-
-	trace_xfs_swap_extent_before(ip, 0);
-	trace_xfs_swap_extent_before(tip, 1);
-
-	/* check inode formats now that data is flushed */
-	error = xfs_swap_extents_check_format(ip, tip);
-	if (error) {
-		xfs_notice(mp,
-		    "%s: inode 0x%llx format is incompatible for exchanging.",
-				__func__, ip->i_ino);
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * Compare the current change & modify times with that
-	 * passed in.  If they differ, we abort this swap.
-	 * This is the mechanism used to ensure the calling
-	 * process that the file was not changed out from
-	 * under it.
-	 */
-	if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
-	    (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
-	    (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
-	    (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
-		error = -EBUSY;
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * Note the trickiness in setting the log flags - we set the owner log
-	 * flag on the opposite inode (i.e. the inode we are setting the new
-	 * owner to be) because once we swap the forks and log that, log
-	 * recovery is going to see the fork as owned by the swapped inode,
-	 * not the pre-swapped inodes.
-	 */
-	req.blockcount = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
-	if (xfs_has_rmapbt(mp)) {
-		xfs_swapext(tp, &req);
-		error = xfs_defer_finish(&tp);
-	} else
-		error = xfs_swap_extent_forks(&tp, &req);
-	if (error) {
-		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * If this is a synchronous mount, make sure that the
-	 * transaction goes to disk before returning to the user.
-	 */
-	if (xfs_has_wsync(mp))
-		xfs_trans_set_sync(tp);
-
-	error = xfs_trans_commit(tp);
-
-	trace_xfs_swap_extent_after(ip, 0);
-	trace_xfs_swap_extent_after(tip, 1);
-
-out_unlock_ilock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	xfs_iunlock(tip, XFS_ILOCK_EXCL);
-out_unlock:
-	filemap_invalidate_unlock_two(VFS_I(ip)->i_mapping,
-				      VFS_I(tip)->i_mapping);
-	unlock_two_nondirectories(VFS_I(ip), VFS_I(tip));
-	return error;
-
-out_trans_cancel:
-	xfs_trans_cancel(tp);
-	goto out_unlock_ilock;
-}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 39c71da08403..8eb7166aa9d4 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,13 +66,6 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
 int	xfs_free_eofblocks(struct xfs_inode *ip);
 
-int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
-			 struct xfs_swapext *sx);
-
-struct xfs_swapext_req;
-int xfs_swap_extent_forks(struct xfs_trans **tpp, struct xfs_swapext_req *req);
-int xfs_swap_extents_check_format(struct xfs_inode *ip, struct xfs_inode *tip);
-
 xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
 
 xfs_extnum_t xfs_bmap_count_leaves(struct xfs_ifork *ifp, xfs_filblks_t *count);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4b2a02a08dfa..85c33142c5ab 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1655,81 +1655,43 @@ xfs_ioc_scrub_metadata(
 
 int
 xfs_ioc_swapext(
-	xfs_swapext_t	*sxp)
+	struct xfs_swapext	*sxp)
 {
-	xfs_inode_t     *ip, *tip;
-	struct fd	f, tmp;
-	int		error = 0;
+	struct file_xchg_range	fxr = { 0 };
+	struct fd		fd2, fd1;
+	int			error = 0;
 
-	/* Pull information for the target fd */
-	f = fdget((int)sxp->sx_fdtarget);
-	if (!f.file) {
-		error = -EINVAL;
-		goto out;
-	}
-
-	if (!(f.file->f_mode & FMODE_WRITE) ||
-	    !(f.file->f_mode & FMODE_READ) ||
-	    (f.file->f_flags & O_APPEND)) {
-		error = -EBADF;
-		goto out_put_file;
-	}
+	fd2 = fdget((int)sxp->sx_fdtarget);
+	if (!fd2.file)
+		return -EINVAL;
 
-	tmp = fdget((int)sxp->sx_fdtmp);
-	if (!tmp.file) {
+	fd1 = fdget((int)sxp->sx_fdtmp);
+	if (!fd1.file) {
 		error = -EINVAL;
-		goto out_put_file;
+		goto dest_fdput;
 	}
 
-	if (!(tmp.file->f_mode & FMODE_WRITE) ||
-	    !(tmp.file->f_mode & FMODE_READ) ||
-	    (tmp.file->f_flags & O_APPEND)) {
-		error = -EBADF;
-		goto out_put_tmp_file;
-	}
+	fxr.file1_fd = sxp->sx_fdtmp;
+	fxr.length = sxp->sx_length;
+	fxr.flags = FILE_XCHG_RANGE_NONATOMIC | FILE_XCHG_RANGE_FILE2_FRESH |
+		    FILE_XCHG_RANGE_FULL_FILES;
+	fxr.file2_ino = sxp->sx_stat.bs_ino;
+	fxr.file2_mtime = sxp->sx_stat.bs_mtime.tv_sec;
+	fxr.file2_ctime = sxp->sx_stat.bs_ctime.tv_sec;
+	fxr.file2_mtime_nsec = sxp->sx_stat.bs_mtime.tv_nsec;
+	fxr.file2_ctime_nsec = sxp->sx_stat.bs_ctime.tv_nsec;
 
-	if (IS_SWAPFILE(file_inode(f.file)) ||
-	    IS_SWAPFILE(file_inode(tmp.file))) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
+	error = vfs_xchg_file_range(fd1.file, fd2.file, &fxr);
 
 	/*
-	 * We need to ensure that the fds passed in point to XFS inodes
-	 * before we cast and access them as XFS structures as we have no
-	 * control over what the user passes us here.
+	 * The old implementation returned EFAULT if the swap range was not
+	 * the entirety of both files.
 	 */
-	if (f.file->f_op != &xfs_file_operations ||
-	    tmp.file->f_op != &xfs_file_operations) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	ip = XFS_I(file_inode(f.file));
-	tip = XFS_I(file_inode(tmp.file));
-
-	if (ip->i_mount != tip->i_mount) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	if (ip->i_ino == tip->i_ino) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	if (xfs_is_shutdown(ip->i_mount)) {
-		error = -EIO;
-		goto out_put_tmp_file;
-	}
-
-	error = xfs_swap_extents(ip, tip, sxp);
-
- out_put_tmp_file:
-	fdput(tmp);
- out_put_file:
-	fdput(f);
- out:
+	if (error == -EDOM)
+		error = -EFAULT;
+	fdput(fd1);
+dest_fdput:
+	fdput(fd2);
 	return error;
 }
 
@@ -1988,14 +1950,10 @@ xfs_file_ioctl(
 	case XFS_IOC_SWAPEXT: {
 		struct xfs_swapext	sxp;
 
-		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
+		if (copy_from_user(&sxp, arg, sizeof(struct xfs_swapext)))
 			return -EFAULT;
-		error = mnt_want_write_file(filp);
-		if (error)
-			return error;
-		error = xfs_ioc_swapext(&sxp);
-		mnt_drop_write_file(filp);
-		return error;
+
+		return xfs_ioc_swapext(&sxp);
 	}
 
 	case XFS_IOC_FSCOUNTS: {
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index d4abba2c13c1..e3f72d816e0e 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -10,9 +10,7 @@ struct xfs_bstat;
 struct xfs_ibulk;
 struct xfs_inogrp;
 
-int
-xfs_ioc_swapext(
-	xfs_swapext_t	*sxp);
+int xfs_ioc_swapext(struct xfs_swapext *sxp);
 
 extern int
 xfs_find_handle(
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 2f54b701eead..885d6e58d7ec 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -425,7 +425,6 @@ xfs_file_compat_ioctl(
 	struct inode		*inode = file_inode(filp);
 	struct xfs_inode	*ip = XFS_I(inode);
 	void			__user *arg = compat_ptr(p);
-	int			error;
 
 	trace_xfs_file_compat_ioctl(ip);
 
@@ -435,6 +434,7 @@ xfs_file_compat_ioctl(
 		return xfs_compat_ioc_fsgeometry_v1(ip->i_mount, arg);
 	case XFS_IOC_FSGROWFSDATA_32: {
 		struct xfs_growfs_data	in;
+		int			error;
 
 		if (xfs_compat_growfs_data_copyin(&in, arg))
 			return -EFAULT;
@@ -447,6 +447,7 @@ xfs_file_compat_ioctl(
 	}
 	case XFS_IOC_FSGROWFSRT_32: {
 		struct xfs_growfs_rt	in;
+		int			error;
 
 		if (xfs_compat_growfs_rt_copyin(&in, arg))
 			return -EFAULT;
@@ -471,12 +472,8 @@ xfs_file_compat_ioctl(
 				   offsetof(struct xfs_swapext, sx_stat)) ||
 		    xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
 			return -EFAULT;
-		error = mnt_want_write_file(filp);
-		if (error)
-			return error;
-		error = xfs_ioc_swapext(&sxp);
-		mnt_drop_write_file(filp);
-		return error;
+
+		return xfs_ioc_swapext(&sxp);
 	}
 	case XFS_IOC_FSBULKSTAT_32:
 	case XFS_IOC_FSBULKSTAT_SINGLE_32:
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 2b7aedc49923..27bb88dcf228 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -2,6 +2,11 @@
 /*
  * Copyright (C) 2022 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
+ *
+ * The xfs_swap_extent_* functions are:
+ * Copyright (c) 2000-2006 Silicon Graphics, Inc.
+ * Copyright (c) 2012 Red Hat, Inc.
+ * All Rights Reserved.
  */
 #include "xfs.h"
 #include "xfs_fs.h"
@@ -15,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_quota.h"
 #include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_trace.h"
 #include "xfs_swapext.h"
@@ -71,6 +77,299 @@ xfs_xchg_range_estimate(
 	return error;
 }
 
+/*
+ * We need to check that the format of the data fork in the temporary inode is
+ * valid for the target inode before doing the swap. This is not a problem with
+ * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
+ * data fork depending on the space the attribute fork is taking so we can get
+ * invalid formats on the target inode.
+ *
+ * E.g. target has space for 7 extents in extent format, temp inode only has
+ * space for 6.  If we defragment down to 7 extents, then the tmp format is a
+ * btree, but when swapped it needs to be in extent format. Hence we can't just
+ * blindly swap data forks on attr2 filesystems.
+ *
+ * Note that we check the swap in both directions so that we don't end up with
+ * a corrupt temporary inode, either.
+ *
+ * Note that fixing the way xfs_fsr sets up the attribute fork in the source
+ * inode will prevent this situation from occurring, so all we do here is
+ * reject and log the attempt. basically we are putting the responsibility on
+ * userspace to get this right.
+ */
+STATIC int
+xfs_swap_extents_check_format(
+	struct xfs_inode	*ip,	/* target inode */
+	struct xfs_inode	*tip)	/* tmp inode */
+{
+	struct xfs_ifork	*ifp = &ip->i_df;
+	struct xfs_ifork	*tifp = &tip->i_df;
+
+	/* User/group/project quota ids must match if quotas are enforced. */
+	if (XFS_IS_QUOTA_ON(ip->i_mount) &&
+	    (!uid_eq(VFS_I(ip)->i_uid, VFS_I(tip)->i_uid) ||
+	     !gid_eq(VFS_I(ip)->i_gid, VFS_I(tip)->i_gid) ||
+	     ip->i_projid != tip->i_projid))
+		return -EINVAL;
+
+	/* Should never get a local format */
+	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+	    tifp->if_format == XFS_DINODE_FMT_LOCAL)
+		return -EINVAL;
+
+	/*
+	 * if the target inode has less extents that then temporary inode then
+	 * why did userspace call us?
+	 */
+	if (ifp->if_nextents < tifp->if_nextents)
+		return -EINVAL;
+
+	/*
+	 * If we have to use the (expensive) rmap swap method, we can
+	 * handle any number of extents and any format.
+	 */
+	if (xfs_has_rmapbt(ip->i_mount))
+		return 0;
+
+	/*
+	 * if the target inode is in extent form and the temp inode is in btree
+	 * form then we will end up with the target inode in the wrong format
+	 * as we already know there are less extents in the temp inode.
+	 */
+	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+	    tifp->if_format == XFS_DINODE_FMT_BTREE)
+		return -EINVAL;
+
+	/* Check temp in extent form to max in target */
+	if (tifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+	    tifp->if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
+		return -EINVAL;
+
+	/* Check target in extent form to max in temp */
+	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+	    ifp->if_nextents > XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
+		return -EINVAL;
+
+	/*
+	 * If we are in a btree format, check that the temp root block will fit
+	 * in the target and that it has enough extents to be in btree format
+	 * in the target.
+	 *
+	 * Note that we have to be careful to allow btree->extent conversions
+	 * (a common defrag case) which will occur when the temp inode is in
+	 * extent format...
+	 */
+	if (tifp->if_format == XFS_DINODE_FMT_BTREE) {
+		if (xfs_inode_has_attr_fork(ip) &&
+		    XFS_BMAP_BMDR_SPACE(tifp->if_broot) > xfs_inode_fork_boff(ip))
+			return -EINVAL;
+		if (tifp->if_nextents <= XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
+			return -EINVAL;
+	}
+
+	/* Reciprocal target->temp btree format checks */
+	if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
+		if (xfs_inode_has_attr_fork(tip) &&
+		    XFS_BMAP_BMDR_SPACE(ip->i_df.if_broot) > xfs_inode_fork_boff(tip))
+			return -EINVAL;
+		if (ifp->if_nextents <= XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+static int
+xfs_swap_change_owner(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_inode	*tmpip)
+{
+	int			error;
+	struct xfs_trans	*tp = *tpp;
+
+	do {
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+					      NULL);
+		/* success or fatal error */
+		if (error != -EAGAIN)
+			break;
+
+		error = xfs_trans_roll(tpp);
+		if (error)
+			break;
+		tp = *tpp;
+
+		/*
+		 * Redirty both inodes so they can relog and keep the log tail
+		 * moving forward.
+		 */
+		xfs_trans_ijoin(tp, ip, 0);
+		xfs_trans_ijoin(tp, tmpip, 0);
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+	} while (true);
+
+	return error;
+}
+
+/* Swap the extents of two files by swapping data forks. */
+STATIC int
+xfs_swap_extent_forks(
+	struct xfs_trans	**tpp,
+	struct xfs_swapext_req	*req)
+{
+	struct xfs_inode	*ip = req->ip2;
+	struct xfs_inode	*tip = req->ip1;
+	xfs_filblks_t		aforkblks = 0;
+	xfs_filblks_t		taforkblks = 0;
+	xfs_extnum_t		junk;
+	uint64_t		tmp;
+	int			src_log_flags = XFS_ILOG_CORE;
+	int			target_log_flags = XFS_ILOG_CORE;
+	int			error;
+
+	/*
+	 * Count the number of extended attribute blocks
+	 */
+	if (xfs_inode_has_attr_fork(ip) && ip->i_af.if_nextents > 0 &&
+	    ip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
+		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
+				&aforkblks);
+		if (error)
+			return error;
+	}
+	if (xfs_inode_has_attr_fork(tip) && tip->i_af.if_nextents > 0 &&
+	    tip->i_af.if_format != XFS_DINODE_FMT_LOCAL) {
+		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
+				&taforkblks);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Btree format (v3) inodes have the inode number stamped in the bmbt
+	 * block headers. We can't start changing the bmbt blocks until the
+	 * inode owner change is logged so recovery does the right thing in the
+	 * event of a crash. Set the owner change log flags now and leave the
+	 * bmbt scan as the last step.
+	 */
+	if (xfs_has_v3inodes(ip->i_mount)) {
+		if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
+			target_log_flags |= XFS_ILOG_DOWNER;
+		if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
+			src_log_flags |= XFS_ILOG_DOWNER;
+	}
+
+	/*
+	 * Swap the data forks of the inodes
+	 */
+	swap(ip->i_df, tip->i_df);
+
+	/*
+	 * Fix the on-disk inode values
+	 */
+	tmp = (uint64_t)ip->i_nblocks;
+	ip->i_nblocks = tip->i_nblocks - taforkblks + aforkblks;
+	tip->i_nblocks = tmp + taforkblks - aforkblks;
+
+	/*
+	 * The extents in the source inode could still contain speculative
+	 * preallocation beyond EOF (e.g. the file is open but not modified
+	 * while defrag is in progress). In that case, we need to copy over the
+	 * number of delalloc blocks the data fork in the source inode is
+	 * tracking beyond EOF so that when the fork is truncated away when the
+	 * temporary inode is unlinked we don't underrun the i_delayed_blks
+	 * counter on that inode.
+	 */
+	ASSERT(tip->i_delayed_blks == 0);
+	tip->i_delayed_blks = ip->i_delayed_blks;
+	ip->i_delayed_blks = 0;
+
+	switch (ip->i_df.if_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		src_log_flags |= XFS_ILOG_DEXT;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
+		       (src_log_flags & XFS_ILOG_DOWNER));
+		src_log_flags |= XFS_ILOG_DBROOT;
+		break;
+	}
+
+	switch (tip->i_df.if_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		target_log_flags |= XFS_ILOG_DEXT;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		target_log_flags |= XFS_ILOG_DBROOT;
+		ASSERT(!xfs_has_v3inodes(ip->i_mount) ||
+		       (target_log_flags & XFS_ILOG_DOWNER));
+		break;
+	}
+
+	/* Do we have to swap reflink flags? */
+	if ((ip->i_diflags2 & XFS_DIFLAG2_REFLINK) ^
+	    (tip->i_diflags2 & XFS_DIFLAG2_REFLINK)) {
+		uint64_t	f;
+
+		f = ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+		ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+		ip->i_diflags2 |= tip->i_diflags2 & XFS_DIFLAG2_REFLINK;
+		tip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+		tip->i_diflags2 |= f & XFS_DIFLAG2_REFLINK;
+	}
+
+	/* Swap the cow forks. */
+	if (xfs_has_reflink(ip->i_mount)) {
+		ASSERT(!ip->i_cowfp ||
+		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+		ASSERT(!tip->i_cowfp ||
+		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+
+		swap(ip->i_cowfp, tip->i_cowfp);
+
+		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(ip);
+		else
+			xfs_inode_clear_cowblocks_tag(ip);
+		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(tip);
+		else
+			xfs_inode_clear_cowblocks_tag(tip);
+	}
+
+	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
+	xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+	/*
+	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+	 * have inode number owner values in the bmbt blocks that still refer to
+	 * the old inode. Scan each bmbt to fix up the owner values with the
+	 * inode number of the current inode.
+	 */
+	if (src_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, ip, tip);
+		if (error)
+			return error;
+	}
+	if (target_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, tip, ip);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
 /* Prepare two files to have their data exchanged. */
 int
 xfs_xchg_range_prep(


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 16/21] xfs: condense extended attributes after an atomic swap
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (14 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 17/21] xfs: condense directories after an atomic swap Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
                     ` (4 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add a new swapext flag that enables us to perform post-swap processing
on file2 once we're done swapping the extent maps.  If we were swapping
the extended attributes, we want to be able to convert file2's attr fork
from block to inline format.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and swap the attr forks when ready.
If one file is in extents format and the other is inline, we will have to
promote both to extents format to perform the swap.  After the swap, we
can try to condense the fixed file's attr fork back down to inline
format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_log_format.h |    9 +++++--
 fs/xfs/libxfs/xfs_swapext.c    |   51 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_swapext.h    |    9 +++++--
 3 files changed, 64 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 65a84fdefe56..378201a70028 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -906,18 +906,23 @@ struct xfs_swap_extent {
 /* Clear the reflink flag from inode2 after the operation. */
 #define XFS_SWAP_EXT_CLEAR_INO2_REFLINK	(1ULL << 4)
 
+/* Try to convert inode2 from block to short format at the end, if possible. */
+#define XFS_SWAP_EXT_CVT_INO2_SF	(1ULL << 5)
+
 #define XFS_SWAP_EXT_FLAGS		(XFS_SWAP_EXT_ATTR_FORK | \
 					 XFS_SWAP_EXT_SET_SIZES | \
 					 XFS_SWAP_EXT_SKIP_INO1_HOLES | \
 					 XFS_SWAP_EXT_CLEAR_INO1_REFLINK | \
-					 XFS_SWAP_EXT_CLEAR_INO2_REFLINK)
+					 XFS_SWAP_EXT_CLEAR_INO2_REFLINK | \
+					 XFS_SWAP_EXT_CVT_INO2_SF)
 
 #define XFS_SWAP_EXT_STRINGS \
 	{ XFS_SWAP_EXT_ATTR_FORK,		"ATTRFORK" }, \
 	{ XFS_SWAP_EXT_SET_SIZES,		"SETSIZES" }, \
 	{ XFS_SWAP_EXT_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
 	{ XFS_SWAP_EXT_CLEAR_INO1_REFLINK,	"CLEAR_INO1_REFLINK" }, \
-	{ XFS_SWAP_EXT_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }
+	{ XFS_SWAP_EXT_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }, \
+	{ XFS_SWAP_EXT_CVT_INO2_SF,		"CVT_INO2_SF" }
 
 /* This is the structure used to lay out an sxi log item in the log. */
 struct xfs_sxi_log_format {
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 227a08ac5d4b..6b5223e73692 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -23,6 +23,10 @@
 #include "xfs_error.h"
 #include "xfs_errortag.h"
 #include "xfs_health.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -121,7 +125,8 @@ static inline bool
 sxi_has_postop_work(const struct xfs_swapext_intent *sxi)
 {
 	return sxi->sxi_flags & (XFS_SWAP_EXT_CLEAR_INO1_REFLINK |
-				 XFS_SWAP_EXT_CLEAR_INO2_REFLINK);
+				 XFS_SWAP_EXT_CLEAR_INO2_REFLINK |
+				 XFS_SWAP_EXT_CVT_INO2_SF);
 }
 
 static inline void
@@ -350,6 +355,36 @@ xfs_swapext_exchange_mappings(
 	sxi_advance(sxi, irec1);
 }
 
+/* Convert inode2's leaf attr fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_attr_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_da_args	args = {
+		.dp		= sxi->sxi_ip2,
+		.geo		= tp->t_mountp->m_attr_geo,
+		.whichfork	= XFS_ATTR_FORK,
+		.trans		= tp,
+	};
+	struct xfs_buf		*bp;
+	int			forkoff;
+	int			error;
+
+	if (!xfs_attr_is_leaf(sxi->sxi_ip2))
+		return 0;
+
+	error = xfs_attr3_leaf_read(tp, sxi->sxi_ip2, 0, &bp);
+	if (error)
+		return error;
+
+	forkoff = xfs_attr_shortform_allfit(bp, sxi->sxi_ip2);
+	if (forkoff == 0)
+		return 0;
+
+	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
+}
+
 static inline void
 xfs_swapext_clear_reflink(
 	struct xfs_trans	*tp,
@@ -367,6 +402,16 @@ xfs_swapext_do_postop_work(
 	struct xfs_trans		*tp,
 	struct xfs_swapext_intent	*sxi)
 {
+	if (sxi->sxi_flags & XFS_SWAP_EXT_CVT_INO2_SF) {
+		int			error = 0;
+
+		if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
+			error = xfs_swapext_attr_to_sf(tp, sxi);
+		sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
+		if (error)
+			return error;
+	}
+
 	if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO1_REFLINK) {
 		xfs_swapext_clear_reflink(tp, sxi->sxi_ip1);
 		sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
@@ -794,6 +839,8 @@ xfs_swapext_init_intent(
 
 	if (req->req_flags & XFS_SWAP_REQ_SKIP_INO1_HOLES)
 		sxi->sxi_flags |= XFS_SWAP_EXT_SKIP_INO1_HOLES;
+	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
+		sxi->sxi_flags |= XFS_SWAP_EXT_CVT_INO2_SF;
 
 	if (req->req_flags & XFS_SWAP_REQ_LOGGED)
 		sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_LOGGED;
@@ -1013,6 +1060,8 @@ xfs_swapext(
 	ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
 	if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
 		ASSERT(req->whichfork == XFS_DATA_FORK);
+	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
+		ASSERT(req->whichfork == XFS_ATTR_FORK);
 
 	if (req->blockcount == 0)
 		return;
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 1987897ddc25..6b610fea150a 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -126,16 +126,21 @@ struct xfs_swapext_req {
 /* Files need to be upgraded to have large extent counts. */
 #define XFS_SWAP_REQ_NREXT64		(1U << 3)
 
+/* Try to convert inode2's fork to local format, if possible. */
+#define XFS_SWAP_REQ_CVT_INO2_SF	(1U << 4)
+
 #define XFS_SWAP_REQ_FLAGS		(XFS_SWAP_REQ_LOGGED | \
 					 XFS_SWAP_REQ_SET_SIZES | \
 					 XFS_SWAP_REQ_SKIP_INO1_HOLES | \
-					 XFS_SWAP_REQ_NREXT64)
+					 XFS_SWAP_REQ_NREXT64 | \
+					 XFS_SWAP_REQ_CVT_INO2_SF)
 
 #define XFS_SWAP_REQ_STRINGS \
 	{ XFS_SWAP_REQ_LOGGED,			"LOGGED" }, \
 	{ XFS_SWAP_REQ_SET_SIZES,		"SETSIZES" }, \
 	{ XFS_SWAP_REQ_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
-	{ XFS_SWAP_REQ_NREXT64,			"NREXT64" }
+	{ XFS_SWAP_REQ_NREXT64,			"NREXT64" }, \
+	{ XFS_SWAP_REQ_CVT_INO2_SF,		"CVT_INO2_SF" }
 
 unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
 void xfs_swapext_reflink_finish(struct xfs_trans *tp,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 17/21] xfs: condense directories after an atomic swap
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (13 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 16/21] xfs: condense extended attributes " Darrick J. Wong
                     ` (5 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for directories.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and swap the data forks
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed directory down to
inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_swapext.c |   44 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 6b5223e73692..a52f72a499f4 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -27,6 +27,8 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_attr.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_dir2.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -385,6 +387,42 @@ xfs_swapext_attr_to_sf(
 	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
 }
 
+/* Convert inode2's block dir fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_dir_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_da_args	args = {
+		.dp		= sxi->sxi_ip2,
+		.geo		= tp->t_mountp->m_dir_geo,
+		.whichfork	= XFS_DATA_FORK,
+		.trans		= tp,
+	};
+	struct xfs_dir2_sf_hdr	sfh;
+	struct xfs_buf		*bp;
+	bool			isblock;
+	int			size;
+	int			error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (!isblock)
+		return 0;
+
+	error = xfs_dir3_block_read(tp, sxi->sxi_ip2, &bp);
+	if (error)
+		return error;
+
+	size = xfs_dir2_block_sfsize(sxi->sxi_ip2, bp->b_addr, &sfh);
+	if (size > xfs_inode_data_fork_size(sxi->sxi_ip2))
+		return 0;
+
+	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
+}
+
 static inline void
 xfs_swapext_clear_reflink(
 	struct xfs_trans	*tp,
@@ -407,6 +445,8 @@ xfs_swapext_do_postop_work(
 
 		if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
 			error = xfs_swapext_attr_to_sf(tp, sxi);
+		else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
+			error = xfs_swapext_dir_to_sf(tp, sxi);
 		sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
 		if (error)
 			return error;
@@ -1061,7 +1101,9 @@ xfs_swapext(
 	if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
 		ASSERT(req->whichfork == XFS_DATA_FORK);
 	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
-		ASSERT(req->whichfork == XFS_ATTR_FORK);
+		ASSERT(req->whichfork == XFS_ATTR_FORK ||
+		       (req->whichfork == XFS_DATA_FORK &&
+			S_ISDIR(VFS_I(req->ip2)->i_mode)));
 
 	if (req->blockcount == 0)
 		return;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 18/21] xfs: condense symbolic links after an atomic swap
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (11 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req Darrick J. Wong
                     ` (7 subsequent siblings)
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for symlinks.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and swap the data forks
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed symlink down to inline
format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_swapext.c        |   48 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.c |   47 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.h |    1 +
 fs/xfs/xfs_symlink.c               |   49 ++++--------------------------------
 4 files changed, 101 insertions(+), 44 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index a52f72a499f4..b27ceeb93a16 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -29,6 +29,7 @@
 #include "xfs_attr.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_dir2.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -423,6 +424,48 @@ xfs_swapext_dir_to_sf(
 	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
 }
 
+/* Convert inode2's remote symlink target back to shortform, if possible. */
+STATIC int
+xfs_swapext_link_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_inode		*ip = sxi->sxi_ip2;
+	struct xfs_ifork		*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+	char				*buf;
+	int				error;
+
+	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ip->i_disk_size > xfs_inode_data_fork_size(ip))
+		return 0;
+
+	/* Read the current symlink target into a buffer. */
+	buf = kmem_alloc(ip->i_disk_size + 1, KM_NOFS);
+	if (!buf) {
+		ASSERT(0);
+		return -ENOMEM;
+	}
+
+	error = xfs_symlink_remote_read(ip, buf);
+	if (error)
+		goto free;
+
+	/* Remove the blocks. */
+	error = xfs_symlink_remote_truncate(tp, ip);
+	if (error)
+		goto free;
+
+	/* Convert fork to local format and log our changes. */
+	xfs_idestroy_fork(ifp);
+	ifp->if_bytes = 0;
+	ifp->if_format = XFS_DINODE_FMT_LOCAL;
+	xfs_init_local_fork(ip, XFS_DATA_FORK, buf, ip->i_disk_size);
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+free:
+	kmem_free(buf);
+	return error;
+}
+
 static inline void
 xfs_swapext_clear_reflink(
 	struct xfs_trans	*tp,
@@ -447,6 +490,8 @@ xfs_swapext_do_postop_work(
 			error = xfs_swapext_attr_to_sf(tp, sxi);
 		else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
 			error = xfs_swapext_dir_to_sf(tp, sxi);
+		else if (S_ISLNK(VFS_I(sxi->sxi_ip2)->i_mode))
+			error = xfs_swapext_link_to_sf(tp, sxi);
 		sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
 		if (error)
 			return error;
@@ -1103,7 +1148,8 @@ xfs_swapext(
 	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
 		ASSERT(req->whichfork == XFS_ATTR_FORK ||
 		       (req->whichfork == XFS_DATA_FORK &&
-			S_ISDIR(VFS_I(req->ip2)->i_mode)));
+			(S_ISDIR(VFS_I(req->ip2)->i_mode) ||
+			 S_ISLNK(VFS_I(req->ip2)->i_mode))));
 
 	if (req->blockcount == 0)
 		return;
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 5261f15ea2ed..b48dcb893a2a 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -391,3 +391,50 @@ xfs_symlink_write_target(
 	ASSERT(pathlen == 0);
 	return 0;
 }
+
+/* Remove all the blocks from a symlink and invalidate buffers. */
+int
+xfs_symlink_remote_truncate(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*bp;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			done = 0;
+	int			i;
+	int			error;
+
+	/* Read mappings and invalidate buffers. */
+	error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	for (i = 0; i < nmaps; i++) {
+		if (!xfs_bmap_is_real_extent(&mval[i]))
+			break;
+
+		error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
+				XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
+				XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
+				&bp);
+		if (error)
+			return error;
+
+		xfs_trans_binval(tp, bp);
+	}
+
+	/* Unmap the remote blocks. */
+	error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
+	if (error)
+		return error;
+	if (!done) {
+		ASSERT(done);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+		return -EFSCORRUPTED;
+	}
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
index d81461c06b6b..05eb9c3937d9 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.h
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -23,5 +23,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
 int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
 		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
 		uint resblks);
+int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 548d9116e0c5..8cf69ca4bd7c 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -250,19 +250,12 @@ xfs_symlink(
  */
 STATIC int
 xfs_inactive_symlink_rmt(
-	struct xfs_inode *ip)
+	struct xfs_inode	*ip)
 {
-	struct xfs_buf	*bp;
-	int		done;
-	int		error;
-	int		i;
-	xfs_mount_t	*mp;
-	xfs_bmbt_irec_t	mval[XFS_SYMLINK_MAPS];
-	int		nmaps;
-	int		size;
-	xfs_trans_t	*tp;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	int			error;
 
-	mp = ip->i_mount;
 	ASSERT(!xfs_need_iread_extents(&ip->i_df));
 	/*
 	 * We're freeing a symlink that has some
@@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt(
 	 * locked for the second transaction.  In the error paths we need it
 	 * held so the cancel won't rele it, see below.
 	 */
-	size = (int)ip->i_disk_size;
 	ip->i_disk_size = 0;
 	VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-	/*
-	 * Find the block(s) so we can inval and unmap them.
-	 */
-	done = 0;
-	nmaps = ARRAY_SIZE(mval);
-	error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
-				mval, &nmaps, 0);
-	if (error)
-		goto error_trans_cancel;
-	/*
-	 * Invalidate the block(s). No validation is done.
-	 */
-	for (i = 0; i < nmaps; i++) {
-		error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
-				XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
-				XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
-				&bp);
-		if (error)
-			goto error_trans_cancel;
-		xfs_trans_binval(tp, bp);
-	}
-	/*
-	 * Unmap the dead block(s) to the dfops.
-	 */
-	error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done);
+
+	error = xfs_symlink_remote_truncate(tp, ip);
 	if (error)
 		goto error_trans_cancel;
-	ASSERT(done);
 
-	/*
-	 * Commit the transaction. This first logs the EFI and the inode, then
-	 * rolls and commits the transaction that frees the extents.
-	 */
-	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	error = xfs_trans_commit(tp);
 	if (error) {
 		ASSERT(xfs_is_shutdown(mp));


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 19/21] xfs: make atomic extent swapping support realtime files
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (18 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 21/21] xfs: enable atomic swapext feature Darrick J. Wong
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Now that bmap items support the realtime device, we can add the
necessary pieces to the atomic extent swapping code to support such
things.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_swapext.c |  109 +++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_swapext.h |    5 +-
 fs/xfs/xfs_bmap_util.c      |    2 -
 fs/xfs/xfs_file.c           |    2 -
 fs/xfs/xfs_inode.h          |    5 ++
 fs/xfs/xfs_rtalloc.c        |  136 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h        |    3 +
 fs/xfs/xfs_trace.h          |   11 ++-
 fs/xfs/xfs_xchgrange.c      |   71 ++++++++++++++++++++++
 fs/xfs/xfs_xchgrange.h      |    2 -
 10 files changed, 329 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index b27ceeb93a16..69812594fd71 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -142,6 +142,108 @@ sxi_advance(
 	sxi->sxi_blockcount -= irec->br_blockcount;
 }
 
+#ifdef DEBUG
+static inline bool
+xfs_swapext_need_rt_conversion(
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_inode		*ip = req->ip2;
+	struct xfs_mount		*mp = ip->i_mount;
+
+	/* xattrs don't live on the rt device */
+	if (req->whichfork == XFS_ATTR_FORK)
+		return false;
+
+	/*
+	 * Caller got permission to use logged swapext, so log recovery will
+	 * finish the swap and not leave us with partially swapped rt extents
+	 * exposed to userspace.
+	 */
+	if (req->req_flags & XFS_SWAP_REQ_LOGGED)
+		return false;
+
+	/*
+	 * If we can't use log intent items at all, the only supported
+	 * operation is full fork swaps.
+	 */
+	if (!xfs_swapext_supported(mp))
+		return false;
+
+	/* Conversion is only needed for realtime files with big rt extents */
+	return xfs_inode_has_bigrtextents(ip);
+}
+
+static inline int
+xfs_swapext_check_rt_extents(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	xfs_fileoff_t			startoff1 = req->startoff1;
+	xfs_fileoff_t			startoff2 = req->startoff2;
+	xfs_filblks_t			blockcount = req->blockcount;
+	uint32_t			mod;
+	int				nimaps;
+	int				error;
+
+	if (!xfs_swapext_need_rt_conversion(req))
+		return 0;
+
+	while (blockcount > 0) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip1, startoff1, blockcount,
+				&irec1, &nimaps, 0);
+		if (error)
+			return error;
+		ASSERT(nimaps == 1);
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip2, startoff2,
+				irec1.br_blockcount, &irec2, &nimaps,
+				0);
+		if (error)
+			return error;
+		ASSERT(nimaps == 1);
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1.br_blockcount = min(irec1.br_blockcount,
+					  irec2.br_blockcount);
+
+		/* Both mappings must be aligned to the realtime extent size. */
+		div_u64_rem(irec1.br_startoff, mp->m_sb.sb_rextsize, &mod);
+		if (mod) {
+			ASSERT(mod == 0);
+			return -EINVAL;
+		}
+
+		div_u64_rem(irec2.br_startoff, mp->m_sb.sb_rextsize, &mod);
+		if (mod) {
+			ASSERT(mod == 0);
+			return -EINVAL;
+		}
+
+		div_u64_rem(irec1.br_blockcount, mp->m_sb.sb_rextsize, &mod);
+		if (mod) {
+			ASSERT(mod == 0);
+			return -EINVAL;
+		}
+
+		startoff1 += irec1.br_blockcount;
+		startoff2 += irec1.br_blockcount;
+		blockcount -= irec1.br_blockcount;
+	}
+
+	return 0;
+}
+#else
+# define xfs_swapext_check_rt_extents(mp, req)		(0)
+#endif
+
 /* Check all extents to make sure we can actually swap them. */
 int
 xfs_swapext_check_extents(
@@ -161,12 +263,7 @@ xfs_swapext_check_extents(
 	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
 		return -EINVAL;
 
-	/* We don't support realtime data forks yet. */
-	if (!XFS_IS_REALTIME_INODE(req->ip1))
-		return 0;
-	if (req->whichfork == XFS_ATTR_FORK)
-		return 0;
-	return -EINVAL;
+	return xfs_swapext_check_rt_extents(mp, req);
 }
 
 #ifdef CONFIG_XFS_QUOTA
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 6b610fea150a..155add23d8e2 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -13,12 +13,11 @@
  * This can be done to individual file extents by using the block mapping log
  * intent items introduced with reflink and rmap; or to entire file ranges
  * using swapext log intent items to track the overall progress across multiple
- * extent mappings.  Realtime is not supported yet.
+ * extent mappings.
  */
 static inline bool xfs_swapext_supported(struct xfs_mount *mp)
 {
-	return (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) &&
-	       !xfs_has_realtime(mp);
+	return xfs_has_reflink(mp) || xfs_has_rmapbt(mp);
 }
 
 /*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 47a583a94d58..3593c0f0ce13 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -989,7 +989,7 @@ xfs_free_file_space(
 	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);
 
 	/* We can only free complete realtime extents. */
-	if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) {
+	if (xfs_inode_has_bigrtextents(ip)) {
 		startoffset_fsb = roundup_64(startoffset_fsb,
 					     mp->m_sb.sb_rextsize);
 		endoffset_fsb = rounddown_64(endoffset_fsb,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b4629c8aa6b7..87dfb05640a8 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1181,7 +1181,7 @@ xfs_file_xchg_range(
 		goto out_err;
 
 	/* Prepare and then exchange file contents. */
-	error = xfs_xchg_range_prep(file1, file2, fxr);
+	error = xfs_xchg_range_prep(file1, file2, fxr, priv_flags);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 4b01d078ace2..444c43571e31 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -287,6 +287,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
 }
 
+static inline bool xfs_inode_has_bigrtextents(struct xfs_inode *ip)
+{
+	return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 790191316a32..883333036519 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -21,6 +21,7 @@
 #include "xfs_sb.h"
 #include "xfs_log_priv.h"
 #include "xfs_health.h"
+#include "xfs_trace.h"
 
 /*
  * Read and return the summary information for a given extent size,
@@ -1461,3 +1462,138 @@ xfs_rtpick_extent(
 	*pick = b;
 	return 0;
 }
+
+/*
+ * Decide if this is an unwritten extent that isn't aligned to a rt extent
+ * boundary.  If it is, shorten the mapping so that we're ready to convert
+ * everything up to the next rt extent to a zeroed written extent.  If not,
+ * return false.
+ */
+static inline bool
+xfs_rtfile_want_conversion(
+	struct xfs_mount	*mp,
+	struct xfs_bmbt_irec	*irec)
+{
+	xfs_fileoff_t		rext_next;
+	uint32_t		modoff, modcnt;
+
+	if (irec->br_state != XFS_EXT_UNWRITTEN)
+		return false;
+
+	div_u64_rem(irec->br_startoff, mp->m_sb.sb_rextsize, &modoff);
+	if (modoff == 0) {
+		uint64_t	rexts = div_u64_rem(irec->br_blockcount,
+						mp->m_sb.sb_rextsize, &modcnt);
+
+		if (rexts > 0) {
+			/*
+			 * Unwritten mapping starts at an rt extent boundary
+			 * and is longer than one rt extent.  Round the length
+			 * down to the nearest extent but don't select it for
+			 * conversion.
+			 */
+			irec->br_blockcount -= modcnt;
+			modcnt = 0;
+		}
+
+		/* Unwritten mapping is perfectly aligned, do not convert. */
+		if (modcnt == 0)
+			return false;
+	}
+
+	/*
+	 * Unaligned and unwritten; trim to the current rt extent and select it
+	 * for conversion.
+	 */
+	rext_next = (irec->br_startoff - modoff) + mp->m_sb.sb_rextsize;
+	xfs_trim_extent(irec, irec->br_startoff, rext_next - irec->br_startoff);
+	return true;
+}
+
+/*
+ * For all realtime extents backing the given range of a file, search for
+ * unwritten mappings that do not cover a full rt extent and convert them
+ * to zeroed written mappings.  The goal is to end up with one mapping per rt
+ * extent so that we can perform a remapping operation.  Callers must ensure
+ * that there are no dirty pages in the given range.
+ */
+int
+xfs_rtfile_convert_unwritten(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	uint64_t		len)
+{
+	struct xfs_bmbt_irec	irec;
+	struct xfs_trans	*tp;
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		off;
+	xfs_fileoff_t		endoff;
+	unsigned int		resblks;
+	int			ret;
+
+	if (mp->m_sb.sb_rextsize == 1)
+		return 0;
+
+	off = rounddown_64(XFS_B_TO_FSBT(mp, pos), mp->m_sb.sb_rextsize);
+	endoff = roundup_64(XFS_B_TO_FSB(mp, pos + len), mp->m_sb.sb_rextsize);
+
+	trace_xfs_rtfile_convert_unwritten(ip, pos, len);
+
+	while (off < endoff) {
+		int		nmap = 1;
+
+		if (fatal_signal_pending(current))
+			return -EINTR;
+
+		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 1);
+		ret = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
+				&tp);
+		if (ret)
+			return ret;
+
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+		/*
+		 * Read the mapping.  If we find an unwritten extent that isn't
+		 * aligned to an rt extent boundary...
+		 */
+		ret = xfs_bmapi_read(ip, off, endoff - off, &irec, &nmap, 0);
+		if (ret)
+			goto err;
+		ASSERT(nmap == 1);
+		ASSERT(irec.br_startoff == off);
+		if (!xfs_rtfile_want_conversion(mp, &irec)) {
+			xfs_trans_cancel(tp);
+			off += irec.br_blockcount;
+			continue;
+		}
+
+		/*
+		 * ...make sure this partially unwritten rt extent gets
+		 * converted to a zeroed written extent that we can remap.
+		 */
+		nmap = 1;
+		ret = xfs_bmapi_write(tp, ip, off, irec.br_blockcount,
+				XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO, 0, &irec,
+				&nmap);
+		if (ret)
+			goto err;
+		ASSERT(nmap == 1);
+		if (irec.br_state != XFS_EXT_NORM) {
+			ASSERT(0);
+			ret = -EIO;
+			goto err;
+		}
+		ret = xfs_trans_commit(tp);
+		if (ret)
+			return ret;
+
+		off += irec.br_blockcount;
+	}
+
+	return 0;
+err:
+	xfs_trans_cancel(tp);
+	return ret;
+}
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index 3b2f1b499a11..e440f793dd98 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -140,6 +140,8 @@ int xfs_rtalloc_extent_is_free(struct xfs_mount *mp, struct xfs_trans *tp,
 			       xfs_rtblock_t start, xfs_extlen_t len,
 			       bool *is_free);
 int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp);
+int xfs_rtfile_convert_unwritten(struct xfs_inode *ip, loff_t pos,
+		uint64_t len);
 #else
 # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
 # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
@@ -164,6 +166,7 @@ xfs_rtmount_init(
 }
 # define xfs_rtmount_inodes(m)  (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
 # define xfs_rtunmount_inodes(m)
+# define xfs_rtfile_convert_unwritten(ip, pos, len)	(0)
 #endif	/* CONFIG_XFS_RT */
 
 #endif	/* __XFS_RTALLOC_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b0ced76af3b9..0802f078a945 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1519,7 +1519,7 @@ DEFINE_IMAP_EVENT(xfs_iomap_alloc);
 DEFINE_IMAP_EVENT(xfs_iomap_found);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
-	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
+	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, u64 count),
 	TP_ARGS(ip, offset, count),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
@@ -1527,7 +1527,7 @@ DECLARE_EVENT_CLASS(xfs_simple_io_class,
 		__field(loff_t, isize)
 		__field(loff_t, disize)
 		__field(loff_t, offset)
-		__field(size_t, count)
+		__field(u64, count)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
@@ -1538,7 +1538,7 @@ DECLARE_EVENT_CLASS(xfs_simple_io_class,
 		__entry->count = count;
 	),
 	TP_printk("dev %d:%d ino 0x%llx isize 0x%llx disize 0x%llx "
-		  "pos 0x%llx bytecount 0x%zx",
+		  "pos 0x%llx bytecount 0x%llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->isize,
@@ -1549,7 +1549,7 @@ DECLARE_EVENT_CLASS(xfs_simple_io_class,
 
 #define DEFINE_SIMPLE_IO_EVENT(name)	\
 DEFINE_EVENT(xfs_simple_io_class, name,	\
-	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),	\
+	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, u64 count),	\
 	TP_ARGS(ip, offset, count))
 DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enospc);
 DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert);
@@ -3741,6 +3741,9 @@ TRACE_EVENT(xfs_ioctl_clone,
 /* unshare tracepoints */
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_unshare);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_unshare_error);
+#ifdef CONFIG_XFS_RT
+DEFINE_SIMPLE_IO_EVENT(xfs_rtfile_convert_unwritten);
+#endif /* CONFIG_XFS_RT */
 
 /* copy on write */
 DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 27bb88dcf228..6a66d09099b0 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -28,6 +28,7 @@
 #include "xfs_sb.h"
 #include "xfs_icache.h"
 #include "xfs_log.h"
+#include "xfs_rtalloc.h"
 
 /* Lock (and optionally join) two inodes for a file range exchange. */
 void
@@ -370,12 +371,58 @@ xfs_swap_extent_forks(
 	return 0;
 }
 
+/*
+ * There may be partially written rt extents lurking in the ranges to be
+ * swapped.  According to the rules for realtime files with big rt extents, we
+ * must guarantee that an outside observer (an IO thread, realistically) never
+ * can see multiple physical rt extents mapped to the same logical file rt
+ * extent.  The deferred bmap log intent items that we use under the hood
+ * operate on single block mappings and not rt extents, which means we must
+ * have a strategy to ensure that log recovery after a failure won't stop in
+ * the middle of an rt extent.
+ *
+ * The preferred strategy is to use deferred extent swap log intent items to
+ * track the status of the overall swap operation so that we can complete the
+ * work during crash recovery.  If that isn't possible, we fall back to
+ * requiring the selected mappings in both forks to be aligned to rt extent
+ * boundaries.  As an aside, the old fork swap routine didn't have this
+ * requirement, but at an extreme cost in flexibilty (full files only, and no
+ * support if rmapbt is enabled).
+ */
+static bool
+xfs_xchg_range_need_rt_conversion(
+	struct xfs_inode		*ip,
+	unsigned int			xchg_flags)
+{
+	struct xfs_mount		*mp = ip->i_mount;
+
+	/*
+	 * Caller got permission to use logged swapext, so log recovery will
+	 * finish the swap and not leave us with partially swapped rt extents
+	 * exposed to userspace.
+	 */
+	if (xchg_flags & XFS_XCHG_RANGE_LOGGED)
+		return false;
+
+	/*
+	 * If we can't use log intent items at all, the only supported
+	 * operation is full fork swaps, so no conversions are needed.
+	 * The range requirements are enforced by the swapext code itself.
+	 */
+	if (!xfs_swapext_supported(mp))
+		return false;
+
+	/* Conversion is only needed for realtime files with big rt extents */
+	return xfs_inode_has_bigrtextents(ip);
+}
+
 /* Prepare two files to have their data exchanged. */
 int
 xfs_xchg_range_prep(
 	struct file		*file1,
 	struct file		*file2,
-	struct file_xchg_range	*fxr)
+	struct file_xchg_range	*fxr,
+	unsigned int		xchg_flags)
 {
 	struct xfs_inode	*ip1 = XFS_I(file_inode(file1));
 	struct xfs_inode	*ip2 = XFS_I(file_inode(file2));
@@ -439,6 +486,19 @@ xfs_xchg_range_prep(
 			return error;
 	}
 
+	/* Convert unwritten sub-extent mappings if required. */
+	if (xfs_xchg_range_need_rt_conversion(ip2, xchg_flags)) {
+		error = xfs_rtfile_convert_unwritten(ip2, fxr->file2_offset,
+				fxr->length);
+		if (error)
+			return error;
+
+		error = xfs_rtfile_convert_unwritten(ip1, fxr->file1_offset,
+				fxr->length);
+		if (error)
+			return error;
+	}
+
 	return 0;
 }
 
@@ -656,6 +716,15 @@ xfs_xchg_range(
 	if (xchg_flags & XFS_XCHG_RANGE_LOGGED)
 		req.req_flags |= XFS_SWAP_REQ_LOGGED;
 
+	/*
+	 * Round the request length up to the nearest fundamental unit of
+	 * allocation.  The prep function already checked that the request
+	 * offsets and length in @fxr are safe to round up.
+	 */
+	if (XFS_IS_REALTIME_INODE(ip2))
+		req.blockcount = roundup_64(req.blockcount,
+					    mp->m_sb.sb_rextsize);
+
 	error = xfs_xchg_range_estimate(&req);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
index a0e64408784a..e356fe09a40c 100644
--- a/fs/xfs/xfs_xchgrange.h
+++ b/fs/xfs/xfs_xchgrange.h
@@ -35,6 +35,6 @@ void xfs_xchg_range_rele_log_assist(struct xfs_mount *mp);
 int xfs_xchg_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
 		const struct file_xchg_range *fxr, unsigned int xchg_flags);
 int xfs_xchg_range_prep(struct file *file1, struct file *file2,
-		struct file_xchg_range *fxr);
+		struct file_xchg_range *fxr, unsigned int xchg_flags);
 
 #endif /* __XFS_XCHGRANGE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (17 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 15/21] xfs: remove old swap extents implementation Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 19/21] xfs: make atomic extent swapping support realtime files Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 21/21] xfs: enable atomic swapext feature Darrick J. Wong
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

The VFS exchange-range alignment checks use (fast) bitmasks to perform
block alignment checks on the exchange parameters.  Unfortunately,
bitmasks require that the alignment size be a power of two.  This isn't
true for realtime devices, so we have to copy-pasta the VFS checks using
long division for this to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_xchgrange.c |  102 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 91 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 6a66d09099b0..ae030a6f607e 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -416,6 +416,86 @@ xfs_xchg_range_need_rt_conversion(
 	return xfs_inode_has_bigrtextents(ip);
 }
 
+/*
+ * Check the alignment of an exchange request when the allocation unit size
+ * isn't a power of two.  The VFS helpers use (fast) bitmask-based alignment
+ * checks, but here we have to use slow long division.
+ */
+static int
+xfs_xchg_range_check_rtalign(
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	const struct file_xchg_range	*fxr)
+{
+	struct xfs_mount		*mp = ip1->i_mount;
+	uint32_t			rextbytes;
+	uint64_t			length = fxr->length;
+	uint64_t			blen;
+	loff_t				size1, size2;
+
+	rextbytes = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
+	size1 = i_size_read(VFS_I(ip1));
+	size2 = i_size_read(VFS_I(ip2));
+
+	/* The start of both ranges must be aligned to a rt extent. */
+	if (!isaligned_64(fxr->file1_offset, rextbytes) ||
+	    !isaligned_64(fxr->file2_offset, rextbytes))
+		return -EINVAL;
+
+	/*
+	 * If the caller asked for full files, check that the offset/length
+	 * values cover all of both files.
+	 */
+	if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    (fxr->file1_offset != 0 || fxr->file2_offset != 0 ||
+	     fxr->length != size1 || fxr->length != size2))
+		return -EDOM;
+
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		length = max_t(int64_t, size1 - fxr->file1_offset,
+					size2 - fxr->file2_offset);
+
+	/*
+	 * If the user wanted us to exchange up to the infile's EOF, round up
+	 * to the next rt extent boundary for this check.  Do the same for the
+	 * outfile.
+	 *
+	 * Otherwise, reject the range length if it's not rt extent aligned.
+	 * We already confirmed the starting offsets' rt extent block
+	 * alignment.
+	 */
+	if (fxr->file1_offset + length == size1)
+		blen = roundup_64(size1, rextbytes) - fxr->file1_offset;
+	else if (fxr->file2_offset + length == size2)
+		blen = roundup_64(size2, rextbytes) - fxr->file2_offset;
+	else if (!isaligned_64(length, rextbytes))
+		return -EINVAL;
+	else
+		blen = length;
+
+	/* Don't allow overlapped exchanges within the same file. */
+	if (ip1 == ip2 &&
+	    fxr->file2_offset + blen > fxr->file1_offset &&
+	    fxr->file1_offset + blen > fxr->file2_offset)
+		return -EINVAL;
+
+	/*
+	 * Ensure that we don't exchange a partial EOF rt extent into the
+	 * middle of another file.
+	 */
+	if (isaligned_64(length, rextbytes))
+		return 0;
+
+	blen = length;
+	if (fxr->file2_offset + length < size2)
+		blen = rounddown_64(blen, rextbytes);
+
+	if (fxr->file1_offset + blen < size1)
+		blen = rounddown_64(blen, rextbytes);
+
+	return blen == length ? 0 : -EINVAL;
+}
+
 /* Prepare two files to have their data exchanged. */
 int
 xfs_xchg_range_prep(
@@ -426,6 +506,7 @@ xfs_xchg_range_prep(
 {
 	struct xfs_inode	*ip1 = XFS_I(file_inode(file1));
 	struct xfs_inode	*ip2 = XFS_I(file_inode(file2));
+	unsigned int		alloc_unit = xfs_inode_alloc_unitsize(ip2);
 	int			error;
 
 	trace_xfs_xchg_range_prep(ip1, fxr, ip2, 0);
@@ -434,18 +515,17 @@ xfs_xchg_range_prep(
 	if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
 		return -EINVAL;
 
-	/*
-	 * The alignment checks in the VFS helpers cannot deal with allocation
-	 * units that are not powers of 2.  This can happen with the realtime
-	 * volume if the extent size is set.  Note that alignment checks are
-	 * skipped if FULL_FILES is set.
-	 */
-	if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
-	    !is_power_of_2(xfs_inode_alloc_unitsize(ip2)))
-		return -EOPNOTSUPP;
+	/* Check non-power of two alignment issues, if necessary. */
+	if (XFS_IS_REALTIME_INODE(ip2) && !is_power_of_2(alloc_unit)) {
+		error = xfs_xchg_range_check_rtalign(ip1, ip2, fxr);
+		if (error)
+			return error;
 
-	error = generic_xchg_file_range_prep(file1, file2, fxr,
-			xfs_inode_alloc_unitsize(ip2));
+		/* Do the VFS checks with the regular block alignment. */
+		alloc_unit = ip1->i_mount->m_sb.sb_blocksize;
+	}
+
+	error = generic_xchg_file_range_prep(file1, file2, fxr, alloc_unit);
 	if (error || fxr->length == 0)
 		return error;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 21/21] xfs: enable atomic swapext feature
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
                     ` (19 preceding siblings ...)
  2022-12-30 22:13   ` [PATCH 19/21] xfs: make atomic extent swapping support realtime files Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  20 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add the atomic swapext feature to the set of features that we will
permit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index bb8bff488017..0c457905cce5 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -393,7 +393,8 @@ xfs_sb_has_incompat_feature(
 #define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS   (1 << 0)	/* Delayed Attributes */
 #define XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT  (1U << 31)	/* file extent swap */
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
-	(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
+		(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS | \
+		 XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: create temporary files for online repair
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (24 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries Darrick J. Wong
                   ` (51 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

As mentioned earlier, the repair strategy for file-based metadata is to
build a new copy in a temporary file and swap the file fork mappings
with the metadata inode.  We've built the atomic extent swap facility,
so now we need to build a facility for handling private temporary files.

The first step is to teach the filesystem to ignore the temporary files.
We'll mark them as PRIVATE in the VFS so that the kernel security
modules will leave it alone.  The second step is to add the online
repair code the ability to create a temporary file and reap extents from
the temporary file after the extent swap.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles
---
 fs/xfs/Makefile         |    1 
 fs/xfs/scrub/common.c   |    6 +
 fs/xfs/scrub/parent.c   |    2 
 fs/xfs/scrub/reap.c     |  427 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/scrub/reap.h     |   21 ++
 fs/xfs/scrub/scrub.c    |    3 
 fs/xfs/scrub/scrub.h    |    4 
 fs/xfs/scrub/tempfile.c |  231 +++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.h |   27 +++
 fs/xfs/scrub/trace.h    |   96 +++++++++++
 fs/xfs/xfs_export.c     |    2 
 fs/xfs/xfs_inode.c      |    3 
 fs/xfs/xfs_inode.h      |    2 
 fs/xfs/xfs_itable.c     |    8 +
 14 files changed, 806 insertions(+), 27 deletions(-)
 create mode 100644 fs/xfs/scrub/tempfile.c
 create mode 100644 fs/xfs/scrub/tempfile.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 3/4] xfs: refactor stale buffer scanning for repairs Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We're about to start adding functionality that uses internal inodes that
are private to XFS.  What this means is that userspace should never be
able to access any information about these files, and should not be able
to open these files by handle.  Callers are not allowed to link these
files into the directory tree, which should suffice to make these
private inodes actually private.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_export.c |    2 +-
 fs/xfs/xfs_itable.c |    8 ++++++++
 2 files changed, 9 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
index 1064c2342876..b6ba96e0dd75 100644
--- a/fs/xfs/xfs_export.c
+++ b/fs/xfs/xfs_export.c
@@ -146,7 +146,7 @@ xfs_nfs_get_inode(
 		return ERR_PTR(error);
 	}
 
-	if (VFS_I(ip)->i_generation != generation) {
+	if (VFS_I(ip)->i_generation != generation || IS_PRIVATE(VFS_I(ip))) {
 		xfs_irele(ip);
 		return ERR_PTR(-ESTALE);
 	}
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index a1c2bcf65d37..7a967cc78010 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -86,6 +86,14 @@ xfs_bulkstat_one_int(
 	vfsuid = i_uid_into_vfsuid(mnt_userns, inode);
 	vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
 
+	/* If this is a private inode, don't leak its details to userspace. */
+	if (IS_PRIVATE(inode)) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		xfs_irele(ip);
+		error = -EINVAL;
+		goto out_advance;
+	}
+
 	/* xfs_iget returns the following without needing
 	 * further change.
 	 */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: create temporary files and directories for online repair
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/4] xfs: refactor stale buffer scanning for repairs Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach the online repair code how to create temporary files or
directories.  These temporary files can be used to stage reconstructed
information until we're ready to perform an atomic extent swap to commit
the new metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile         |    1 
 fs/xfs/scrub/common.c   |    6 +
 fs/xfs/scrub/parent.c   |    2 
 fs/xfs/scrub/scrub.c    |    3 +
 fs/xfs/scrub/scrub.h    |    4 +
 fs/xfs/scrub/tempfile.c |  231 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.h |   27 +++++
 fs/xfs/scrub/trace.h    |   33 +++++++
 fs/xfs/xfs_inode.c      |    3 -
 fs/xfs/xfs_inode.h      |    2 
 10 files changed, 308 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/tempfile.c
 create mode 100644 fs/xfs/scrub/tempfile.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 23b0c40620cf..6df1cd3b46ca 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -201,6 +201,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   refcount_repair.o \
 				   repair.o \
 				   rmap_repair.o \
+				   tempfile.o \
 				   xfbtree.o \
 				   )
 
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index bde9159dca4a..7eade2567af6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -37,6 +37,7 @@
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/health.h"
+#include "scrub/tempfile.h"
 
 /* Common code for the metadata scrubbers. */
 
@@ -1047,7 +1048,10 @@ xchk_setup_inode_contents(
 	if (error)
 		return error;
 
-	/* Got the inode, lock it and we're ready to go. */
+	/*
+	 * Prepare to scrub the file contents by locking out IO and page
+	 * faults.
+	 */
 	xchk_ilock(sc, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 	error = xchk_trans_alloc(sc, resblks);
 	if (error)
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 84740ffee0d2..58d012252015 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -181,7 +181,7 @@ xchk_parent_validate(
 	}
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		return error;
-	if (dp == sc->ip || !S_ISDIR(VFS_I(dp)->i_mode)) {
+	if (dp == sc->ip || dp == sc->tempip || !S_ISDIR(VFS_I(dp)->i_mode)) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out_rele;
 	}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index c6eb692a0822..7bab30c2766e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -18,12 +18,14 @@
 #include "xfs_btree.h"
 #include "xfs_btree_staging.h"
 #include "xfs_rmap.h"
+#include "xfs_xchgrange.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/health.h"
 #include "scrub/xfile.h"
+#include "scrub/tempfile.h"
 
 /*
  * Online Scrub and Repair
@@ -211,6 +213,7 @@ xchk_teardown(
 		sc->buf = NULL;
 	}
 
+	xrep_tempfile_rele(sc);
 	xchk_fshooks_disable(sc);
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index cf18bb4e8b35..7892901ad70b 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -105,6 +105,10 @@ struct xfs_scrub {
 	/* Lock flags for @ip. */
 	uint				ilock_flags;
 
+	/* A temporary file on this filesystem, for staging new metadata. */
+	struct xfs_inode		*tempip;
+	uint				temp_ilock_flags;
+
 	/* See the XCHK/XREP state flags below. */
 	unsigned int			flags;
 
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
new file mode 100644
index 000000000000..797088deb7cb
--- /dev/null
+++ b/fs/xfs/scrub/tempfile.c
@@ -0,0 +1,231 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_ialloc.h"
+#include "xfs_quota.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_dir2.h"
+#include "xfs_xchgrange.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/tempfile.h"
+
+/*
+ * Create a temporary file for reconstructing metadata, with the intention of
+ * atomically swapping the temporary file's contents with the file that's
+ * being repaired.
+ */
+int
+xrep_tempfile_create(
+	struct xfs_scrub	*sc,
+	uint16_t		mode)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = NULL;
+	struct xfs_dquot	*udqp = NULL;
+	struct xfs_dquot	*gdqp = NULL;
+	struct xfs_dquot	*pdqp = NULL;
+	struct xfs_trans_res	*tres;
+	struct xfs_inode	*dp = mp->m_rootip;
+	xfs_ino_t		ino;
+	unsigned int		resblks;
+	bool			is_dir = S_ISDIR(mode);
+	int			error;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+	if (xfs_is_readonly(mp))
+		return -EROFS;
+
+	ASSERT(sc->tp == NULL);
+	ASSERT(sc->tempip == NULL);
+
+	/*
+	 * Make sure that we have allocated dquot(s) on disk.  The temporary
+	 * inode should be completely root owned so that we don't fail due to
+	 * quota limits.
+	 */
+	error = xfs_qm_vop_dqalloc(dp, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
+			XFS_QMOPT_QUOTALL, &udqp, &gdqp, &pdqp);
+	if (error)
+		return error;
+
+	if (is_dir) {
+		resblks = XFS_MKDIR_SPACE_RES(mp, 0);
+		tres = &M_RES(mp)->tr_mkdir;
+	} else {
+		resblks = XFS_IALLOC_SPACE_RES(mp);
+		tres = &M_RES(mp)->tr_create_tmpfile;
+	}
+
+	error = xfs_trans_alloc_icreate(mp, tres, udqp, gdqp, pdqp, resblks,
+			&tp);
+	if (error)
+		goto out_release_dquots;
+
+	/* Allocate inode, set up directory. */
+	error = xfs_dialloc(&tp, dp->i_ino, mode, &ino);
+	if (error)
+		goto out_trans_cancel;
+	error = xfs_init_new_inode(&init_user_ns, tp, dp, ino, mode, 0, 0,
+			0, false, &sc->tempip);
+	if (error)
+		goto out_trans_cancel;
+
+	/* Change the ownership of the inode to root. */
+	VFS_I(sc->tempip)->i_uid = GLOBAL_ROOT_UID;
+	VFS_I(sc->tempip)->i_gid = GLOBAL_ROOT_GID;
+	sc->tempip->i_diflags &= ~(XFS_DIFLAG_REALTIME | XFS_DIFLAG_RTINHERIT);
+	xfs_trans_log_inode(tp, sc->tempip, XFS_ILOG_CORE);
+
+	/*
+	 * Mark our temporary file as private so that LSMs and the ACL code
+	 * don't try to add their own metadata or reason about these files.
+	 * The file should never be exposed to userspace.
+	 */
+	VFS_I(sc->tempip)->i_flags |= S_PRIVATE;
+	VFS_I(sc->tempip)->i_opflags &= ~IOP_XATTR;
+
+	if (is_dir) {
+		error = xfs_dir_init(tp, sc->tempip, dp);
+		if (error)
+			goto out_trans_cancel;
+	}
+
+	/*
+	 * Attach the dquot(s) to the inodes and modify them incore.
+	 * These ids of the inode couldn't have changed since the new
+	 * inode has been locked ever since it was created.
+	 */
+	xfs_qm_vop_create_dqattach(tp, sc->tempip, udqp, gdqp, pdqp);
+
+	/*
+	 * Put our temp file on the unlinked list so it's purged automatically.
+	 * Anything being reconstructed using this file must be atomically
+	 * swapped with the original file because the contents here will be
+	 * purged when the inode is dropped or log recovery cleans out the
+	 * unlinked list.
+	 */
+	error = xfs_iunlink(tp, sc->tempip);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_release_inode;
+
+	trace_xrep_tempfile_create(sc);
+
+	xfs_qm_dqrele(udqp);
+	xfs_qm_dqrele(gdqp);
+	xfs_qm_dqrele(pdqp);
+
+	/* Finish setting up the incore / vfs context. */
+	xfs_setup_iops(sc->tempip);
+	xfs_finish_inode_setup(sc->tempip);
+
+	sc->temp_ilock_flags = 0;
+	return error;
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+out_release_inode:
+	/*
+	 * Wait until after the current transaction is aborted to finish the
+	 * setup of the inode and release the inode.  This prevents recursive
+	 * transactions and deadlocks from xfs_inactive.
+	 */
+	if (sc->tempip) {
+		xfs_finish_inode_setup(sc->tempip);
+		xchk_irele(sc, sc->tempip);
+	}
+out_release_dquots:
+	xfs_qm_dqrele(udqp);
+	xfs_qm_dqrele(gdqp);
+	xfs_qm_dqrele(pdqp);
+
+	return error;
+}
+
+/* Take IOLOCK_EXCL on the temporary file, maybe. */
+bool
+xrep_tempfile_iolock_nowait(
+	struct xfs_scrub	*sc)
+{
+	if (xfs_ilock_nowait(sc->tempip, XFS_IOLOCK_EXCL)) {
+		sc->temp_ilock_flags |= XFS_IOLOCK_EXCL;
+		return true;
+	}
+
+	return false;
+}
+
+/* Release IOLOCK_EXCL on the temporary file. */
+void
+xrep_tempfile_iounlock(
+	struct xfs_scrub	*sc)
+{
+	xfs_iunlock(sc->tempip, XFS_IOLOCK_EXCL);
+	sc->temp_ilock_flags &= ~XFS_IOLOCK_EXCL;
+}
+
+/* Prepare the temporary file for metadata updates by grabbing ILOCK_EXCL. */
+void
+xrep_tempfile_ilock(
+	struct xfs_scrub	*sc)
+{
+	sc->temp_ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->tempip, XFS_ILOCK_EXCL);
+}
+
+/* Try to grab ILOCK_EXCL on the temporary file. */
+bool
+xrep_tempfile_ilock_nowait(
+	struct xfs_scrub	*sc)
+{
+	if (xfs_ilock_nowait(sc->tempip, XFS_ILOCK_EXCL)) {
+		sc->temp_ilock_flags |= XFS_ILOCK_EXCL;
+		return true;
+	}
+
+	return false;
+}
+
+/* Unlock ILOCK_EXCL on the temporary file after an update. */
+void
+xrep_tempfile_iunlock(
+	struct xfs_scrub	*sc)
+{
+	xfs_iunlock(sc->tempip, XFS_ILOCK_EXCL);
+	sc->temp_ilock_flags &= ~XFS_ILOCK_EXCL;
+}
+
+/* Release the temporary file. */
+void
+xrep_tempfile_rele(
+	struct xfs_scrub	*sc)
+{
+	if (!sc->tempip)
+		return;
+
+	if (sc->temp_ilock_flags) {
+		xfs_iunlock(sc->tempip, sc->temp_ilock_flags);
+		sc->temp_ilock_flags = 0;
+	}
+
+	xchk_irele(sc, sc->tempip);
+	sc->tempip = NULL;
+}
diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h
new file mode 100644
index 000000000000..f00a9ce43a32
--- /dev/null
+++ b/fs/xfs/scrub/tempfile.h
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_TEMPFILE_H__
+#define __XFS_SCRUB_TEMPFILE_H__
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+int xrep_tempfile_create(struct xfs_scrub *sc, uint16_t mode);
+void xrep_tempfile_rele(struct xfs_scrub *sc);
+
+bool xrep_tempfile_iolock_nowait(struct xfs_scrub *sc);
+void xrep_tempfile_iounlock(struct xfs_scrub *sc);
+
+void xrep_tempfile_ilock(struct xfs_scrub *sc);
+bool xrep_tempfile_ilock_nowait(struct xfs_scrub *sc);
+void xrep_tempfile_iunlock(struct xfs_scrub *sc);
+#else
+static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc)
+{
+	xchk_ilock(sc, XFS_IOLOCK_EXCL);
+}
+# define xrep_tempfile_rele(sc)
+#endif /* CONFIG_XFS_ONLINE_REPAIR */
+
+#endif /* __XFS_SCRUB_TEMPFILE_H__ */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 558bef72b569..b1a39a206730 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2157,6 +2157,39 @@ TRACE_EVENT(xrep_rmap_live_update,
 		  __entry->flags)
 );
 
+TRACE_EVENT(xrep_tempfile_create,
+	TP_PROTO(struct xfs_scrub *sc),
+	TP_ARGS(sc),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, type)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, inum)
+		__field(unsigned int, gen)
+		__field(unsigned int, flags)
+		__field(xfs_ino_t, temp_inum)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->file ? XFS_I(file_inode(sc->file))->i_ino : 0;
+		__entry->type = sc->sm->sm_type;
+		__entry->agno = sc->sm->sm_agno;
+		__entry->inum = sc->sm->sm_ino;
+		__entry->gen = sc->sm->sm_gen;
+		__entry->flags = sc->sm->sm_flags;
+		__entry->temp_inum = sc->tempip->i_ino;
+	),
+	TP_printk("dev %d:%d ino 0x%llx type %s inum 0x%llx gen 0x%x flags 0x%x temp_inum 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __print_symbolic(__entry->type, XFS_SCRUB_TYPE_STRINGS),
+		  __entry->inum,
+		  __entry->gen,
+		  __entry->flags,
+		  __entry->temp_inum)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 04ceafb936bc..d81c864207bb 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -41,7 +41,6 @@
 
 struct kmem_cache *xfs_inode_cache;
 
-STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
 STATIC int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag,
 	struct xfs_inode *);
 
@@ -2103,7 +2102,7 @@ xfs_iunlink_insert_inode(
  * We place the on-disk inode on a list in the AGI.  It will be pulled from this
  * list when the inode is freed.
  */
-STATIC int
+int
 xfs_iunlink(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 444c43571e31..3f6c63304ca3 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -575,6 +575,8 @@ extern struct kmem_cache	*xfs_inode_cache;
 
 bool xfs_inode_needs_inactive(struct xfs_inode *ip);
 
+int xfs_iunlink(struct xfs_trans *tp, struct xfs_inode *ip);
+
 void xfs_end_io(struct work_struct *work);
 
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: refactor stale buffer scanning for repairs
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/4] xfs: create temporary files and directories for online repair Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In an upcoming patch, we will need to be able to look for xfs_buf
objects caching file-based metadata blocks without needing to walk the
(possibly corrupt) structures to find all the buffers.  Repair already
has most of the code needed to scan the buffer cache, so hoist these
utility functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/reap.c |   73 ++++++++++++++++++++++++++++++++++++---------------
 fs/xfs/scrub/reap.h |   20 ++++++++++++++
 2 files changed, 71 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index ea7a274aa778..b6f89762b00c 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -209,6 +209,48 @@ static inline void xreap_defer_finish_reset(struct xreap_state *rs)
 	rs->force_roll = false;
 }
 
+/*
+ * Compute the maximum length of a buffer cache scan (in units of sectors),
+ * given a quantity of fs blocks.
+ */
+xfs_daddr_t
+xrep_bufscan_max_sectors(
+	struct xfs_mount	*mp,
+	xfs_extlen_t		fsblocks)
+{
+	int			max_fsbs;
+
+	/* Remote xattr values are the largest buffers that we support. */
+	max_fsbs = xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX);
+
+	return XFS_FSB_TO_BB(mp, min_t(xfs_extlen_t, fsblocks, max_fsbs));
+}
+
+/*
+ * Return an incore buffer from a sector scan, or NULL if there are no buffers
+ * left to return.
+ */
+struct xfs_buf *
+xrep_bufscan_advance(
+	struct xfs_mount	*mp,
+	struct xrep_bufscan	*scan)
+{
+	scan->__sector_count += scan->daddr_step;
+	while (scan->__sector_count <= scan->max_sectors) {
+		struct xfs_buf	*bp = NULL;
+		int		error;
+
+		error = xfs_buf_incore(mp->m_ddev_targp, scan->daddr,
+				scan->__sector_count, XBF_BCACHE_SCAN, &bp);
+		if (!error)
+			return bp;
+
+		scan->__sector_count += scan->daddr_step;
+	}
+
+	return NULL;
+}
+
 /* Try to invalidate the incore buffers for an extent that we're freeing. */
 STATIC void
 xreap_agextent_binval(
@@ -239,28 +281,15 @@ xreap_agextent_binval(
 	 * of any plausible size.
 	 */
 	while (bno < agbno_next) {
-		xfs_agblock_t	fsbcount;
-		xfs_agblock_t	max_fsbs;
-
-		/*
-		 * Max buffer size is the max remote xattr buffer size, which
-		 * is one fs block larger than 64k.
-		 */
-		max_fsbs = min_t(xfs_agblock_t, agbno_next - bno,
-				xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX));
-
-		for (fsbcount = 1; fsbcount < max_fsbs; fsbcount++) {
-			struct xfs_buf	*bp = NULL;
-			xfs_daddr_t	daddr;
-			int		error;
-
-			daddr = XFS_AGB_TO_DADDR(mp, agno, bno);
-			error = xfs_buf_incore(mp->m_ddev_targp, daddr,
-					XFS_FSB_TO_BB(mp, fsbcount),
-					XBF_BCACHE_SCAN, &bp);
-			if (error)
-				continue;
-
+		struct xrep_bufscan	scan = {
+			.daddr		= XFS_AGB_TO_DADDR(mp, agno, bno),
+			.max_sectors	= xrep_bufscan_max_sectors(mp,
+							agbno_next - bno),
+			.daddr_step	= XFS_FSB_TO_BB(mp, 1),
+		};
+		struct xfs_buf	*bp;
+
+		while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) {
 			xfs_trans_bjoin(sc->tp, bp);
 			xfs_trans_binval(sc->tp, bp);
 			rs->invalidated++;
diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h
index 29a8b52a1a54..3c31c795fd1a 100644
--- a/fs/xfs/scrub/reap.h
+++ b/fs/xfs/scrub/reap.h
@@ -11,4 +11,24 @@ int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap,
 int xrep_reap_fsblocks(struct xfs_scrub *sc, struct xfsb_bitmap *bitmap,
 		const struct xfs_owner_info *oinfo);
 
+/* Buffer cache scan context. */
+struct xrep_bufscan {
+	/* Disk address for the buffers we want to scan. */
+	xfs_daddr_t		daddr;
+
+	/* Maximum number of sectors to scan. */
+	xfs_daddr_t		max_sectors;
+
+	/* Each round, increment the search length by this number of sectors. */
+	xfs_daddr_t		daddr_step;
+
+	/* Internal scan state; initialize to zero. */
+	xfs_daddr_t		__sector_count;
+};
+
+xfs_daddr_t xrep_bufscan_max_sectors(struct xfs_mount *mp,
+		xfs_extlen_t fsblocks);
+struct xfs_buf *xrep_bufscan_advance(struct xfs_mount *mp,
+		struct xrep_bufscan *scan);
+
 #endif /* __XFS_SCRUB_REAP_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: add the ability to reap entire inode forks
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/4] xfs: create temporary files and directories for online repair Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In preparation for supporting repair of indexed file-based metadata
(such as realtime bitmaps, directories, and extended attribute data),
add a function to reap the old blocks after a metadata repair finishes.
IOWs, this is an elaborate bunmapi call that deals with crosslinked
blocks by unmapping them without freeing them, and also scans for incore
buffers to invalidate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/reap.c  |  354 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/reap.h  |    1 
 fs/xfs/scrub/trace.h |   63 +++++++++
 3 files changed, 418 insertions(+)


diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c
index b6f89762b00c..f43ad4dfc6f7 100644
--- a/fs/xfs/scrub/reap.c
+++ b/fs/xfs/scrub/reap.c
@@ -32,6 +32,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_attr_remote.h"
+#include "xfs_defer.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -670,3 +671,356 @@ xrep_reap_fsblocks(
 
 	return 0;
 }
+
+/*
+ * Metadata files are not supposed to share blocks with anything else.
+ * If blocks are shared, we remove the reverse mapping (thus reducing the
+ * crosslink factor); if blocks are not shared, we also need to free them.
+ *
+ * This first step determines the longest subset of the passed-in imap
+ * (starting at its beginning) that is either crosslinked or not crosslinked.
+ * The blockcount will be adjust down as needed.
+ */
+STATIC int
+xreap_bmapi_select(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*imap,
+	bool			*crosslinked)
+{
+	struct xfs_owner_info	oinfo;
+	struct xfs_btree_cur	*cur;
+	xfs_filblks_t		len = 1;
+	xfs_agblock_t		bno;
+	xfs_agblock_t		agbno;
+	xfs_agblock_t		agbno_next;
+	int			error;
+
+	agbno = XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock);
+	agbno_next = agbno + imap->br_blockcount;
+
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.pag);
+
+	xfs_rmap_ino_owner(&oinfo, ip->i_ino, whichfork, imap->br_startoff);
+	error = xfs_rmap_has_other_keys(cur, agbno, 1, &oinfo, crosslinked);
+	if (error)
+		goto out_cur;
+
+	bno = agbno + 1;
+	while (bno < agbno_next) {
+		bool		also_crosslinked;
+
+		oinfo.oi_offset++;
+		error = xfs_rmap_has_other_keys(cur, bno, 1, &oinfo,
+				&also_crosslinked);
+		if (error)
+			goto out_cur;
+
+		if (also_crosslinked != *crosslinked)
+			break;
+
+		len++;
+		bno++;
+	}
+
+	imap->br_blockcount = len;
+	trace_xreap_bmapi_select(sc->sa.pag, agbno, len, *crosslinked);
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/*
+ * Decide if this buffer can be joined to a transaction.  This is true for most
+ * buffers, but there are two cases that we want to catch: large remote xattr
+ * value buffers are not logged and can overflow the buffer log item dirty
+ * bitmap size; and oversized cached buffers if things have really gone
+ * haywire.
+ */
+static inline bool
+xreap_buf_loggable(
+	const struct xfs_buf	*bp)
+{
+	int			i;
+
+	for (i = 0; i < bp->b_map_count; i++) {
+		int		chunks;
+		int		map_size;
+
+		chunks = DIV_ROUND_UP(BBTOB(bp->b_maps[i].bm_len),
+				XFS_BLF_CHUNK);
+		map_size = DIV_ROUND_UP(chunks, NBWORD);
+		if (map_size > XFS_BLF_DATAMAP_SIZE)
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * Invalidate any buffers for this file mapping.  The @imap blockcount may be
+ * adjusted downward if we need to roll the transaction.
+ */
+STATIC int
+xreap_bmapi_binval(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*imap)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_perag	*pag = sc->sa.pag;
+	int			bmap_flags = xfs_bmapi_aflag(whichfork);
+	xfs_fileoff_t		off;
+	xfs_fileoff_t		max_off;
+	xfs_extlen_t		scan_blocks;
+	xfs_agnumber_t		agno = sc->sa.pag->pag_agno;
+	xfs_agblock_t		bno;
+	xfs_agblock_t		agbno;
+	xfs_agblock_t		agbno_next;
+	unsigned int		invalidated = 0;
+	int			error;
+
+	/*
+	 * Avoid invalidating AG headers and post-EOFS blocks because we never
+	 * own those.
+	 */
+	agbno = bno = XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock);
+	agbno_next = agbno + imap->br_blockcount;
+	if (!xfs_verify_agbno(pag, agbno) ||
+	    !xfs_verify_agbno(pag, agbno_next - 1))
+		return 0;
+
+	/*
+	 * Buffers for file blocks can span multiple contiguous mappings.  This
+	 * means that for each block in the mapping, there could exist an
+	 * xfs_buf indexed by that block with any length up to the maximum
+	 * buffer size (remote xattr values) or to the next hole in the fork.
+	 * To set up our binval scan, first we need to figure out the location
+	 * of the next hole.
+	 */
+	off = imap->br_startoff + imap->br_blockcount;
+	max_off = off + xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX);
+	while (off < max_off) {
+		struct xfs_bmbt_irec	hmap;
+		int			nhmaps = 1;
+
+		error = xfs_bmapi_read(ip, off, max_off - off, &hmap,
+				&nhmaps, bmap_flags);
+		if (error)
+			return error;
+		if (nhmaps != 1 || hmap.br_startblock == DELAYSTARTBLOCK) {
+			ASSERT(0);
+			return -EFSCORRUPTED;
+		}
+
+		if (!xfs_bmap_is_real_extent(&hmap))
+			break;
+
+		off = hmap.br_startoff + hmap.br_blockcount;
+	}
+	scan_blocks = off - imap->br_startoff;
+
+	trace_xreap_bmapi_binval_scan(sc, imap, scan_blocks);
+
+	/*
+	 * If there are incore buffers for these blocks, invalidate them.  If
+	 * we can't (try)lock the buffer we assume it's owned by someone else
+	 * and leave it alone.  The buffer cache cannot detect aliasing, so
+	 * employ nested loops to detect incore buffers of any plausible size.
+	 */
+	while (bno < agbno_next) {
+		struct xrep_bufscan	scan = {
+			.daddr		= XFS_AGB_TO_DADDR(mp, agno, bno),
+			.max_sectors	= xrep_bufscan_max_sectors(mp,
+								scan_blocks),
+			.daddr_step	= XFS_FSB_TO_BB(mp, 1),
+		};
+		struct xfs_buf		*bp;
+
+		while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) {
+			if (xreap_buf_loggable(bp)) {
+				xfs_trans_bjoin(sc->tp, bp);
+				xfs_trans_binval(sc->tp, bp);
+			} else {
+				xfs_buf_stale(bp);
+				xfs_buf_relse(bp);
+			}
+			invalidated++;
+
+			/*
+			 * Stop invalidating if we've hit the limit; we should
+			 * still have enough reservation left to free however
+			 * much of the mapping we've seen so far.
+			 */
+			if (invalidated > XREAP_MAX_BINVAL) {
+				imap->br_blockcount = agbno_next - bno;
+				goto out;
+			}
+		}
+
+		bno++;
+		scan_blocks--;
+	}
+
+out:
+	trace_xreap_bmapi_binval(sc->sa.pag, agbno, imap->br_blockcount);
+	return 0;
+}
+
+/*
+ * Dispose of as much of this file extent as we can.  Upon successful return,
+ * the imap will reflect the mapping that was removed from the fork.
+ */
+STATIC int
+xreap_ifork_extent(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*ip,
+	int				whichfork,
+	struct xfs_bmbt_irec		*imap)
+{
+	xfs_agnumber_t			agno;
+	bool				crosslinked;
+	int				error;
+
+	ASSERT(sc->sa.pag == NULL);
+
+	trace_xreap_ifork_extent(sc, ip, whichfork, imap);
+
+	agno = XFS_FSB_TO_AGNO(sc->mp, imap->br_startblock);
+	sc->sa.pag = xfs_perag_get(sc->mp, agno);
+	if (!sc->sa.pag)
+		return -EFSCORRUPTED;
+
+	error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &sc->sa.agf_bp);
+	if (error)
+		goto out_pag;
+
+	/*
+	 * Decide the fate of the blocks at the beginning of the mapping, then
+	 * update the mapping to use it with the unmap calls.
+	 */
+	error = xreap_bmapi_select(sc, ip, whichfork, imap, &crosslinked);
+	if (error)
+		goto out_agf;
+
+	if (crosslinked) {
+		/*
+		 * If there are other rmappings, this block is cross linked and
+		 * must not be freed.  Remove the reverse mapping, leave the
+		 * buffer cache in its possibly confused state, and move on.
+		 * We don't want to risk discarding valid data buffers from
+		 * anybody else who thinks they own the block, even though that
+		 * runs the risk of stale buffer warnings in the future.
+		 */
+		trace_xreap_dispose_unmap_extent(sc->sa.pag,
+				XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock),
+				imap->br_blockcount);
+
+		/*
+		 * Schedule removal of the mapping from the fork.  We use
+		 * deferred log intents in this function to control the exact
+		 * sequence of metadata updates.
+		 */
+		xfs_bmap_unmap_extent(sc->tp, ip, whichfork, imap);
+		xfs_trans_mod_dquot_byino(sc->tp, ip, XFS_TRANS_DQ_BCOUNT,
+				-(int64_t)imap->br_blockcount);
+		xfs_rmap_unmap_extent(sc->tp, ip, whichfork, imap);
+	} else {
+		/*
+		 * If the block is not crosslinked, we can invalidate all the
+		 * incore buffers for the extent, and then free the extent.
+		 * This is a bit of a mess since we don't detect discontiguous
+		 * buffers that are indexed by a block starting before the
+		 * first block of the extent but overlap anyway.
+		 */
+		trace_xreap_dispose_free_extent(sc->sa.pag,
+				XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock),
+				imap->br_blockcount);
+
+		/*
+		 * Invalidate as many buffers as we can, starting at the
+		 * beginning of this mapping.  If this function sets blockcount
+		 * to zero, the transaction is full of logged buffer
+		 * invalidations, so we need to return early so that we can
+		 * roll and retry.
+		 */
+		error = xreap_bmapi_binval(sc, ip, whichfork, imap);
+		if (error || imap->br_blockcount == 0)
+			goto out_agf;
+
+		/*
+		 * Schedule removal of the mapping from the fork.  We use
+		 * deferred log intents in this function to control the exact
+		 * sequence of metadata updates.
+		 */
+		xfs_bmap_unmap_extent(sc->tp, ip, whichfork, imap);
+		xfs_trans_mod_dquot_byino(sc->tp, ip, XFS_TRANS_DQ_BCOUNT,
+				-(int64_t)imap->br_blockcount);
+		__xfs_free_extent_later(sc->tp, imap->br_startblock,
+				imap->br_blockcount, NULL, true);
+	}
+
+out_agf:
+	xfs_trans_brelse(sc->tp, sc->sa.agf_bp);
+	sc->sa.agf_bp = NULL;
+out_pag:
+	xfs_perag_put(sc->sa.pag);
+	sc->sa.pag = NULL;
+	return error;
+}
+
+/*
+ * Dispose of each block mapped to the given fork of the given file.  Callers
+ * must hold ILOCK_EXCL, and ip can only be sc->ip or sc->tempip.  The fork
+ * must not have any delalloc reservations.
+ */
+int
+xrep_reap_ifork(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	xfs_fileoff_t		off = 0;
+	int			bmap_flags = xfs_bmapi_aflag(whichfork);
+	int			error;
+
+	ASSERT(xfs_has_rmapbt(sc->mp));
+	ASSERT(ip == sc->ip || ip == sc->tempip);
+	ASSERT(whichfork == XFS_ATTR_FORK || !XFS_IS_REALTIME_INODE(ip));
+
+	while (off < XFS_MAX_FILEOFF) {
+		struct xfs_bmbt_irec	imap;
+		int			nimaps = 1;
+
+		/* Read the next extent, skip past holes and delalloc. */
+		error = xfs_bmapi_read(ip, off, XFS_MAX_FILEOFF - off, &imap,
+				&nimaps, bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 || imap.br_startblock == DELAYSTARTBLOCK) {
+			ASSERT(0);
+			return -EFSCORRUPTED;
+		}
+
+		/*
+		 * If this is a real space mapping, reap as much of it as we
+		 * can in a single transaction.
+		 */
+		if (xfs_bmap_is_real_extent(&imap)) {
+			error = xreap_ifork_extent(sc, ip, whichfork, &imap);
+			if (error)
+				return error;
+
+			error = xfs_defer_finish(&sc->tp);
+			if (error)
+				return error;
+		}
+
+		off = imap.br_startoff + imap.br_blockcount;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h
index 3c31c795fd1a..6606b119b9ec 100644
--- a/fs/xfs/scrub/reap.h
+++ b/fs/xfs/scrub/reap.h
@@ -10,6 +10,7 @@ int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap,
 		const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type);
 int xrep_reap_fsblocks(struct xfs_scrub *sc, struct xfsb_bitmap *bitmap,
 		const struct xfs_owner_info *oinfo);
+int xrep_reap_ifork(struct xfs_scrub *sc, struct xfs_inode *ip, int whichfork);
 
 /* Buffer cache scan context. */
 struct xrep_bufscan {
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index b1a39a206730..b80de49ae831 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1356,6 +1356,7 @@ DEFINE_EVENT(xrep_extent_class, name, \
 DEFINE_REPAIR_EXTENT_EVENT(xreap_dispose_unmap_extent);
 DEFINE_REPAIR_EXTENT_EVENT(xreap_dispose_free_extent);
 DEFINE_REPAIR_EXTENT_EVENT(xreap_agextent_binval);
+DEFINE_REPAIR_EXTENT_EVENT(xreap_bmapi_binval);
 DEFINE_REPAIR_EXTENT_EVENT(xrep_agfl_insert);
 
 DECLARE_EVENT_CLASS(xrep_reap_find_class,
@@ -1389,6 +1390,7 @@ DEFINE_EVENT(xrep_reap_find_class, name, \
 		 bool crosslinked), \
 	TP_ARGS(pag, agbno, len, crosslinked))
 DEFINE_REPAIR_REAP_FIND_EVENT(xreap_agextent_select);
+DEFINE_REPAIR_REAP_FIND_EVENT(xreap_bmapi_select);
 
 DECLARE_EVENT_CLASS(xrep_rmap_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
@@ -2190,6 +2192,67 @@ TRACE_EVENT(xrep_tempfile_create,
 		  __entry->temp_inum)
 );
 
+TRACE_EVENT(xreap_ifork_extent,
+	TP_PROTO(struct xfs_scrub *sc, struct xfs_inode *ip, int whichfork,
+		 const struct xfs_bmbt_irec *irec),
+	TP_ARGS(sc, ip, whichfork, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(xfs_fileoff_t, fileoff)
+		__field(xfs_filblks_t, len)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(int, state)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->whichfork = whichfork;
+		__entry->fileoff = irec->br_startoff;
+		__entry->len = irec->br_blockcount;
+		__entry->agno = XFS_FSB_TO_AGNO(sc->mp, irec->br_startblock);
+		__entry->agbno = XFS_FSB_TO_AGBNO(sc->mp, irec->br_startblock);
+		__entry->state = irec->br_state;
+	),
+	TP_printk("dev %d:%d ip 0x%llx whichfork %s agno 0x%x agbno 0x%x fileoff 0x%llx fsbcount 0x%llx state 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->fileoff,
+		  __entry->len,
+		  __entry->state)
+);
+
+TRACE_EVENT(xreap_bmapi_binval_scan,
+	TP_PROTO(struct xfs_scrub *sc, const struct xfs_bmbt_irec *irec,
+		 xfs_extlen_t scan_blocks),
+	TP_ARGS(sc, irec, scan_blocks),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_filblks_t, len)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, scan_blocks)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->len = irec->br_blockcount;
+		__entry->agno = XFS_FSB_TO_AGNO(sc->mp, irec->br_startblock);
+		__entry->agbno = XFS_FSB_TO_AGBNO(sc->mp, irec->br_startblock);
+		__entry->scan_blocks = scan_blocks;
+	),
+	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%llx scan_blocks 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->scan_blocks)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (25 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                   ` (50 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

We now have all the infrastructure we need to repair file metadata.
We'll begin with the realtime summary file, because it is the least
complex data structure.  To support this we need to add three more
pieces to the temporary file code from the previous patchset --
preallocating space in the temp file, formatting metadata into that
space and writing the blocks to disk, and swapping the fork mappings
atomically.

After that, the actual reconstruction of the realtime summary
information is pretty simple, since we can simply write the incore
copy computed by the rtsummary scrubber to the temporary file, swap the
contents, and reap the old blocks.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rtsummary

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-rtsummary
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/scrub/common.c           |    1 
 fs/xfs/scrub/repair.h           |   14 +
 fs/xfs/scrub/rtsummary.c        |   18 +-
 fs/xfs/scrub/rtsummary.h        |   14 +
 fs/xfs/scrub/rtsummary_repair.c |  169 +++++++++++++++++
 fs/xfs/scrub/scrub.c            |    7 +
 fs/xfs/scrub/scrub.h            |    1 
 fs/xfs/scrub/tempfile.c         |  388 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.h         |   15 ++
 fs/xfs/scrub/tempswap.h         |   21 ++
 fs/xfs/scrub/trace.h            |   40 ++++
 12 files changed, 683 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/scrub/rtsummary.h
 create mode 100644 fs/xfs/scrub/rtsummary_repair.c
 create mode 100644 fs/xfs/scrub/tempswap.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: support preallocating and copying content into temporary files
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfs: online repair of realtime summaries Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: teach the tempfile to support atomic extent swapping Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the routines we need to preallocate space in a temporary ondisk
file and then copy the contents of an xfile into the tempfile.  The
upcoming rtsummary repair feature will construct the contents of a
realtime summary file in memory, after which it will want to copy all
that into the ondisk temporary file before atomically committing the new
rtsummary contents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/tempfile.c |  195 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.h |   15 ++++
 fs/xfs/scrub/trace.h    |   39 +++++++++
 3 files changed, 249 insertions(+)


diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index 797088deb7cb..d45ed229b8cb 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -14,14 +14,18 @@
 #include "xfs_inode.h"
 #include "xfs_ialloc.h"
 #include "xfs_quota.h"
+#include "xfs_bmap.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
 #include "xfs_dir2.h"
 #include "xfs_xchgrange.h"
+#include "xfs_defer.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/repair.h"
 #include "scrub/trace.h"
 #include "scrub/tempfile.h"
+#include "scrub/xfile.h"
 
 /*
  * Create a temporary file for reconstructing metadata, with the intention of
@@ -229,3 +233,194 @@ xrep_tempfile_rele(
 	xchk_irele(sc, sc->tempip);
 	sc->tempip = NULL;
 }
+
+/*
+ * Make sure that the given range of the data fork of the temporary file is
+ * mapped to written blocks.  The caller must ensure that both inodes are
+ * joined to the transaction.
+ */
+int
+xrep_tempfile_prealloc(
+	struct xfs_scrub	*sc,
+	xfs_fileoff_t		off,
+	xfs_filblks_t		len)
+{
+	struct xfs_bmbt_irec	map;
+	xfs_fileoff_t		end = off + len;
+	int			error;
+
+	ASSERT(sc->tempip != NULL);
+	ASSERT(!XFS_NOT_DQATTACHED(sc->mp, sc->tempip));
+
+	for (; off < end; off = map.br_startoff + map.br_blockcount) {
+		int		nmaps = 1;
+
+		/*
+		 * If we have a real extent mapping this block then we're
+		 * in ok shape.
+		 */
+		error = xfs_bmapi_read(sc->tempip, off, end - off, &map, &nmaps,
+				XFS_DATA_FORK);
+		if (error)
+			return error;
+		if (nmaps == 0) {
+			ASSERT(nmaps != 0);
+			return -EFSCORRUPTED;
+		}
+
+		if (xfs_bmap_is_written_extent(&map))
+			continue;
+
+		/*
+		 * If we find a delalloc reservation then something is very
+		 * very wrong.  Bail out.
+		 */
+		if (map.br_startblock == DELAYSTARTBLOCK)
+			return -EFSCORRUPTED;
+
+		/*
+		 * Make sure this block has a real zeroed extent allocated to
+		 * it.
+		 */
+		nmaps = 1;
+		error = xfs_bmapi_write(sc->tp, sc->tempip, off, end - off,
+				XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO, 0, &map,
+				&nmaps);
+		if (error)
+			return error;
+
+		trace_xrep_tempfile_prealloc(sc, XFS_DATA_FORK, &map);
+
+		/* Commit new extent and all deferred work. */
+		error = xfs_defer_finish(&sc->tp);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Write data to each block of a file.  The given range of the tempfile's data
+ * fork must already be populated with written extents.
+ */
+int
+xrep_tempfile_copyin(
+	struct xfs_scrub	*sc,
+	xfs_fileoff_t		off,
+	xfs_filblks_t		len,
+	xrep_tempfile_copyin_fn	prep_fn,
+	void			*data)
+{
+	LIST_HEAD(buffers_list);
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*bp;
+	xfs_fileoff_t		flush_mask;
+	xfs_fileoff_t		end = off + len;
+	loff_t			pos = XFS_FSB_TO_B(mp, off);
+	int			error = 0;
+
+	ASSERT(S_ISREG(VFS_I(sc->tempip)->i_mode));
+
+	/* Flush buffers to disk every 512K */
+	flush_mask = XFS_B_TO_FSBT(mp, (1U << 19)) - 1;
+
+	for (; off < end; off++, pos += mp->m_sb.sb_blocksize) {
+		struct xfs_bmbt_irec	map;
+		int			nmaps = 1;
+
+		/* Read block mapping for this file block. */
+		error = xfs_bmapi_read(sc->tempip, off, 1, &map, &nmaps, 0);
+		if (error)
+			goto out_err;
+		if (nmaps == 0 || !xfs_bmap_is_written_extent(&map)) {
+			error = -EFSCORRUPTED;
+			goto out_err;
+		}
+
+		/* Get the metadata buffer for this offset in the file. */
+		error = xfs_trans_get_buf(sc->tp, mp->m_ddev_targp,
+				XFS_FSB_TO_DADDR(mp, map.br_startblock),
+				mp->m_bsize, 0, &bp);
+		if (error)
+			goto out_err;
+
+		trace_xrep_tempfile_copyin(sc, XFS_DATA_FORK, &map);
+
+		/* Read in a block's worth of data from the xfile. */
+		error = prep_fn(sc, bp, data);
+		if (error) {
+			xfs_trans_brelse(sc->tp, bp);
+			goto out_err;
+		}
+
+		/* Queue buffer, and flush if we have too much dirty data. */
+		xfs_buf_delwri_queue_here(bp, &buffers_list);
+		xfs_trans_brelse(sc->tp, bp);
+
+		if (!(off & flush_mask)) {
+			error = xfs_buf_delwri_submit(&buffers_list);
+			if (error)
+				goto out_err;
+		}
+	}
+
+	/*
+	 * Write the new blocks to disk.  If the ordered list isn't empty after
+	 * that, then something went wrong and we have to fail.  This should
+	 * never happen, but we'll check anyway.
+	 */
+	error = xfs_buf_delwri_submit(&buffers_list);
+	if (error)
+		goto out_err;
+
+	if (!list_empty(&buffers_list)) {
+		ASSERT(list_empty(&buffers_list));
+		error = -EIO;
+		goto out_err;
+	}
+
+	return 0;
+
+out_err:
+	xfs_buf_delwri_cancel(&buffers_list);
+	return error;
+}
+
+/*
+ * Set the temporary file's size.  Caller must join the tempfile to the scrub
+ * transaction and is responsible for adjusting block mappings as needed.
+ */
+int
+xrep_tempfile_set_isize(
+	struct xfs_scrub	*sc,
+	unsigned long long	isize)
+{
+	if (sc->tempip->i_disk_size == isize)
+		return 0;
+
+	sc->tempip->i_disk_size = isize;
+	i_size_write(VFS_I(sc->tempip), isize);
+	return xrep_tempfile_roll_trans(sc);
+}
+
+/*
+ * Roll a repair transaction involving the temporary file.  Caller must join
+ * both the temporary file and the file being scrubbed to the transaction.
+ * This function return with both inodes joined to a new scrub transaction,
+ * or the usual negative errno.
+ */
+int
+xrep_tempfile_roll_trans(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	xfs_trans_log_inode(sc->tp, sc->tempip, XFS_ILOG_CORE);
+	error = xrep_roll_trans(sc);
+	if (error)
+		return error;
+
+	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
+	return 0;
+}
diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h
index f00a9ce43a32..282637f36f3d 100644
--- a/fs/xfs/scrub/tempfile.h
+++ b/fs/xfs/scrub/tempfile.h
@@ -16,6 +16,21 @@ void xrep_tempfile_iounlock(struct xfs_scrub *sc);
 void xrep_tempfile_ilock(struct xfs_scrub *sc);
 bool xrep_tempfile_ilock_nowait(struct xfs_scrub *sc);
 void xrep_tempfile_iunlock(struct xfs_scrub *sc);
+
+int xrep_tempfile_prealloc(struct xfs_scrub *sc, xfs_fileoff_t off,
+		xfs_filblks_t len);
+
+enum xfs_blft;
+
+typedef int (*xrep_tempfile_copyin_fn)(struct xfs_scrub *sc,
+		struct xfs_buf *bp, void *data);
+
+int xrep_tempfile_copyin(struct xfs_scrub *sc, xfs_fileoff_t off,
+		xfs_filblks_t len, xrep_tempfile_copyin_fn fn, void *data);
+
+int xrep_tempfile_set_isize(struct xfs_scrub *sc, unsigned long long isize);
+
+int xrep_tempfile_roll_trans(struct xfs_scrub *sc);
 #else
 static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc)
 {
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index b80de49ae831..292e2c3ece1d 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2192,6 +2192,45 @@ TRACE_EVENT(xrep_tempfile_create,
 		  __entry->temp_inum)
 );
 
+DECLARE_EVENT_CLASS(xrep_tempfile_class,
+	TP_PROTO(struct xfs_scrub *sc, int whichfork,
+		 struct xfs_bmbt_irec *irec),
+	TP_ARGS(sc, whichfork, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_filblks_t, len)
+		__field(xfs_fsblock_t, pblk)
+		__field(int, state)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->tempip->i_ino;
+		__entry->whichfork = whichfork;
+		__entry->lblk = irec->br_startoff;
+		__entry->len = irec->br_blockcount;
+		__entry->pblk = irec->br_startblock;
+		__entry->state = irec->br_state;
+	),
+	TP_printk("dev %d:%d ino 0x%llx whichfork %s fileoff 0x%llx fsbcount 0x%llx startblock 0x%llx state %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS),
+		  __entry->lblk,
+		  __entry->len,
+		  __entry->pblk,
+		  __entry->state)
+);
+#define DEFINE_XREP_TEMPFILE_EVENT(name) \
+DEFINE_EVENT(xrep_tempfile_class, name, \
+	TP_PROTO(struct xfs_scrub *sc, int whichfork, \
+		 struct xfs_bmbt_irec *irec), \
+	TP_ARGS(sc, whichfork, irec))
+DEFINE_XREP_TEMPFILE_EVENT(xrep_tempfile_prealloc);
+DEFINE_XREP_TEMPFILE_EVENT(xrep_tempfile_copyin);
+
 TRACE_EVENT(xreap_ifork_extent,
 	TP_PROTO(struct xfs_scrub *sc, struct xfs_inode *ip, int whichfork,
 		 const struct xfs_bmbt_irec *irec),


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: teach the tempfile to support atomic extent swapping
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfs: online repair of realtime summaries Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create some new routines to exchange the contents of a temporary file
created to stage a repair with another ondisk file.  This will be used
by the realtime summary repair function to commit atomically the new
rtsummary data, which will be staged in the tempfile.

The rest of XFS coordinates access to the realtime metadata inodes
solely through the ILOCK.  For repair to hold its exclusive access to
the realtime summary file, it has to allocate a single large transaction
and roll it repeatedly throughout the repair while holding the ILOCK.
In turn, this means that for now there's only a partial swapext
implementation for the temporary file, because we can only work within
an existing transaction.  Hence the only tempswap functions needed here
are to estimate the resource requirements of swapext between, reserve
more space/quota to an existing transaction, and kick off the actual
swap.  The rest will be added in a later patch in preparation for
repairing xattrs and directories.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/scrub.c    |    4 +
 fs/xfs/scrub/scrub.h    |    1 
 fs/xfs/scrub/tempfile.c |  193 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempswap.h |   21 +++++
 fs/xfs/scrub/trace.h    |    1 
 5 files changed, 220 insertions(+)
 create mode 100644 fs/xfs/scrub/tempswap.h


diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7bab30c2766e..a994710d99ae 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -193,6 +193,10 @@ xchk_teardown(
 		xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
+	if (sc->flags & XREP_ATOMIC_EXCHANGE) {
+		sc->flags &= ~XREP_ATOMIC_EXCHANGE;
+		xfs_xchg_range_rele_log_assist(sc->mp);
+	}
 	if (sc->flags & XCHK_HAVE_FREEZE_PROT) {
 		sc->flags &= ~XCHK_HAVE_FREEZE_PROT;
 		mnt_drop_write_file(sc->file);
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 7892901ad70b..9c26a6092c52 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -131,6 +131,7 @@ struct xfs_scrub {
 #define XCHK_FSHOOKS_QUOTA	(1 << 4)  /* quota live update enabled */
 #define XCHK_FSHOOKS_NLINKS	(1 << 5)  /* link count live update enabled */
 #define XCHK_FSHOOKS_RMAP	(1 << 6)  /* rmapbt live update enabled */
+#define XREP_ATOMIC_EXCHANGE	(1 << 29) /* uses atomic file content exchange */
 #define XREP_RESET_PERAG_RESV	(1 << 30) /* must reset AG space reservation */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index d45ed229b8cb..7214d2370bc9 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -19,12 +19,14 @@
 #include "xfs_trans_space.h"
 #include "xfs_dir2.h"
 #include "xfs_xchgrange.h"
+#include "xfs_swapext.h"
 #include "xfs_defer.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/repair.h"
 #include "scrub/trace.h"
 #include "scrub/tempfile.h"
+#include "scrub/tempswap.h"
 #include "scrub/xfile.h"
 
 /*
@@ -424,3 +426,194 @@ xrep_tempfile_roll_trans(
 	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
 	return 0;
 }
+
+/* Enable atomic extent swapping. */
+int
+xrep_tempswap_grab_log_assist(
+	struct xfs_scrub	*sc)
+{
+	bool			need_rele = false;
+	int			error;
+
+	ASSERT(!(sc->flags & XREP_ATOMIC_EXCHANGE));
+
+	error = xfs_xchg_range_grab_log_assist(sc->mp, true, &need_rele);
+	if (error)
+		return error;
+	if (!need_rele) {
+		ASSERT(need_rele);
+		return -EOPNOTSUPP;
+	}
+
+	sc->flags |= XREP_ATOMIC_EXCHANGE;
+	return 0;
+}
+
+/*
+ * Fill out the swapext request in preparation for swapping the contents of a
+ * metadata file that we've rebuilt in the temp file.
+ */
+STATIC int
+xrep_tempswap_prep_request(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct xrep_tempswap	*tx)
+{
+	struct xfs_swapext_req	*req = &tx->req;
+
+	memset(tx, 0, sizeof(struct xrep_tempswap));
+
+	/* COW forks don't exist on disk. */
+	if (whichfork == XFS_COW_FORK) {
+		ASSERT(0);
+		return -EINVAL;
+	}
+
+	/* Both files should have the relevant forks. */
+	if (!xfs_ifork_ptr(sc->ip, whichfork) ||
+	    !xfs_ifork_ptr(sc->tempip, whichfork)) {
+		ASSERT(0);
+		return -EINVAL;
+	}
+
+	/* Swap all mappings in both forks. */
+	req->ip1 = sc->tempip;
+	req->ip2 = sc->ip;
+	req->startoff1 = 0;
+	req->startoff2 = 0;
+	req->whichfork = whichfork;
+	req->blockcount = XFS_MAX_FILEOFF;
+	req->req_flags = XFS_SWAP_REQ_LOGGED;
+
+	/* Always swap sizes when we're swapping data fork mappings. */
+	if (whichfork == XFS_DATA_FORK)
+		req->req_flags |= XFS_SWAP_REQ_SET_SIZES;
+
+	/*
+	 * If we're repairing symlinks, xattrs, or directories, always try to
+	 * convert ip2 to short format after swapping.
+	 */
+	if (whichfork == XFS_ATTR_FORK || S_ISDIR(VFS_I(sc->ip)->i_mode) ||
+	    S_ISLNK(VFS_I(sc->ip)->i_mode))
+		req->req_flags |= XFS_SWAP_REQ_CVT_INO2_SF;
+
+	return 0;
+}
+
+/*
+ * Obtain a quota reservation to make sure we don't hit EDQUOT.  We can skip
+ * this if quota enforcement is disabled or if both inodes' dquots are the
+ * same.  The qretry structure must be initialized to zeroes before the first
+ * call to this function.
+ */
+STATIC int
+xrep_tempswap_reserve_quota(
+	struct xfs_scrub		*sc,
+	const struct xrep_tempswap	*tx)
+{
+	struct xfs_trans		*tp = sc->tp;
+	const struct xfs_swapext_req	*req = &tx->req;
+	int64_t				ddelta, rdelta;
+	int				error;
+
+	/*
+	 * Don't bother with a quota reservation if we're not enforcing them
+	 * or the two inodes have the same dquots.
+	 */
+	if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
+	    (req->ip1->i_udquot == req->ip2->i_udquot &&
+	     req->ip1->i_gdquot == req->ip2->i_gdquot &&
+	     req->ip1->i_pdquot == req->ip2->i_pdquot))
+		return 0;
+
+	/*
+	 * Quota reservation for each file comes from two sources.  First, we
+	 * need to account for any net gain in mapped blocks during the swap.
+	 * Second, we need reservation for the gross gain in mapped blocks so
+	 * that we don't trip over any quota block reservation assertions.  We
+	 * must reserve the gross gain because the quota code subtracts from
+	 * bcount the number of blocks that we unmap; it does not add that
+	 * quantity back to the quota block reservation.
+	 */
+	ddelta = max_t(int64_t, 0, req->ip2_bcount - req->ip1_bcount);
+	rdelta = max_t(int64_t, 0, req->ip2_rtbcount - req->ip1_rtbcount);
+	error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
+			ddelta + req->ip1_bcount, rdelta + req->ip1_rtbcount,
+			true);
+	if (error)
+		return error;
+
+	ddelta = max_t(int64_t, 0, req->ip1_bcount - req->ip2_bcount);
+	rdelta = max_t(int64_t, 0, req->ip1_rtbcount - req->ip2_rtbcount);
+	return xfs_trans_reserve_quota_nblks(tp, req->ip2,
+			ddelta + req->ip2_bcount, rdelta + req->ip2_rtbcount,
+			true);
+}
+
+/*
+ * Prepare an existing transaction for a swap.  The caller must hold
+ * the ILOCK of both the inode being repaired and the temporary file.
+ * Only use this when those ILOCKs cannot be dropped.
+ *
+ * Fill out the swapext request and resource estimation structures in
+ * preparation for swapping the contents of a metadata file that we've rebuilt
+ * in the temp file, then reserve space and quota to the transaction.
+ */
+int
+xrep_tempswap_trans_reserve(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct xrep_tempswap	*tx)
+{
+	int			error;
+
+	ASSERT(sc->tp != NULL);
+	ASSERT(xfs_isilocked(sc->ip, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(sc->tempip, XFS_ILOCK_EXCL));
+
+	error = xrep_tempswap_prep_request(sc, whichfork, tx);
+	if (error)
+		return error;
+
+	error = xfs_swapext_estimate(&tx->req);
+	if (error)
+		return error;
+
+	error = xfs_trans_reserve_more(sc->tp, tx->req.resblks, 0);
+	if (error)
+		return error;
+
+	return xrep_tempswap_reserve_quota(sc, tx);
+}
+
+/* Swap forks between the file being repaired and the temporary file. */
+int
+xrep_tempswap_contents(
+	struct xfs_scrub	*sc,
+	struct xrep_tempswap	*tx)
+{
+	int			error;
+
+	ASSERT(sc->flags & XREP_ATOMIC_EXCHANGE);
+
+	xfs_swapext(sc->tp, &tx->req);
+	error = xfs_defer_finish(&sc->tp);
+	if (error)
+		return error;
+
+	/*
+	 * If we swapped the ondisk sizes of two metadata files, we must swap
+	 * the incore sizes as well.  Since online fsck doesn't use swapext on
+	 * the data forks of user-accessible files, the two sizes are always
+	 * the same, so we don't need to log the inodes.
+	 */
+	if (tx->req.req_flags & XFS_SWAP_REQ_SET_SIZES) {
+		loff_t	temp;
+
+		temp = i_size_read(VFS_I(sc->ip));
+		i_size_write(VFS_I(sc->ip), i_size_read(VFS_I(sc->tempip)));
+		i_size_write(VFS_I(sc->tempip), temp);
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/tempswap.h b/fs/xfs/scrub/tempswap.h
new file mode 100644
index 000000000000..62e88cc6d91a
--- /dev/null
+++ b/fs/xfs/scrub/tempswap.h
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_TEMPSWAP_H__
+#define __XFS_SCRUB_TEMPSWAP_H__
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+struct xrep_tempswap {
+	struct xfs_swapext_req	req;
+};
+
+int xrep_tempswap_grab_log_assist(struct xfs_scrub *sc);
+int xrep_tempswap_trans_reserve(struct xfs_scrub *sc, int whichfork,
+		struct xrep_tempswap *ti);
+
+int xrep_tempswap_contents(struct xfs_scrub *sc, struct xrep_tempswap *ti);
+#endif /* CONFIG_XFS_ONLINE_REPAIR */
+
+#endif /* __XFS_SCRUB_TEMPFILE_H__ */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 292e2c3ece1d..aebfaef07e2d 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -124,6 +124,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 	{ XCHK_FSHOOKS_QUOTA,			"fshooks_quota" }, \
 	{ XCHK_FSHOOKS_NLINKS,			"fshooks_nlinks" }, \
 	{ XCHK_FSHOOKS_RMAP,			"fshooks_rmap" }, \
+	{ XREP_ATOMIC_EXCHANGE,			"atomic_swapext" }, \
 	{ XREP_RESET_PERAG_RESV,		"reset_perag_resv" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: online repair of realtime summaries
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: teach the tempfile to support atomic extent swapping Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Repair the realtime summary data by constructing a new rtsummary file in
the scrub temporary file, then atomically swapping the contents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/scrub/common.c           |    1 
 fs/xfs/scrub/repair.h           |   14 +++
 fs/xfs/scrub/rtsummary.c        |   18 +++-
 fs/xfs/scrub/rtsummary.h        |   14 +++
 fs/xfs/scrub/rtsummary_repair.c |  169 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c            |    3 -
 7 files changed, 214 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/scrub/rtsummary.h
 create mode 100644 fs/xfs/scrub/rtsummary_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6df1cd3b46ca..0abdcc69cd7f 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -207,6 +207,7 @@ xfs-y				+= $(addprefix scrub/, \
 
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
 				   rtbitmap_repair.o \
+				   rtsummary_repair.o \
 				   )
 
 xfs-$(CONFIG_XFS_QUOTA)		+= $(addprefix scrub/, \
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 7eade2567af6..2fbd8aa01ef7 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -32,6 +32,7 @@
 #include "xfs_ag.h"
 #include "xfs_error.h"
 #include "xfs_quota.h"
+#include "xfs_swapext.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 3b25f2fa629e..086e8e739264 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -80,6 +80,8 @@ int xrep_bmap(struct xfs_scrub *sc, int whichfork, bool allow_unwritten);
 int xrep_metadata_inode_forks(struct xfs_scrub *sc);
 int xrep_setup_ag_rmapbt(struct xfs_scrub *sc);
 int xrep_setup_ag_refcountbt(struct xfs_scrub *sc);
+int xrep_setup_rtsummary(struct xfs_scrub *sc, unsigned int *resblks,
+		size_t *bufsize);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
@@ -117,8 +119,10 @@ int xrep_fscounters(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
+int xrep_rtsummary(struct xfs_scrub *sc);
 #else
 # define xrep_rtbitmap			xrep_notsupported
+# define xrep_rtsummary			xrep_notsupported
 #endif /* CONFIG_XFS_RT */
 
 #ifdef CONFIG_XFS_QUOTA
@@ -186,6 +190,15 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 	return 0;
 }
 
+static inline int
+xrep_setup_rtsummary(
+	struct xfs_scrub	*sc,
+	unsigned int		*whatever,
+	size_t			*dontcare)
+{
+	return 0;
+}
+
 #define xrep_revalidate_allocbt		(NULL)
 #define xrep_revalidate_iallocbt	(NULL)
 
@@ -207,6 +220,7 @@ static inline int xrep_setup_rtbitmap(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_quotacheck			xrep_notsupported
 #define xrep_nlinks			xrep_notsupported
 #define xrep_fscounters			xrep_notsupported
+#define xrep_rtsummary			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 73c75d41ef3c..7d1bc49fb3dd 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -20,6 +20,8 @@
 #include "scrub/common.h"
 #include "scrub/trace.h"
 #include "scrub/xfile.h"
+#include "scrub/repair.h"
+#include "scrub/rtsummary.h"
 
 /*
  * Realtime Summary
@@ -37,8 +39,16 @@ xchk_setup_rtsummary(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_mount	*mp = sc->mp;
+	size_t			bufsize = mp->m_sb.sb_blocksize;
+	unsigned int		resblks = 0;
 	int			error;
 
+	if (xchk_could_repair(sc)) {
+		error = xrep_setup_rtsummary(sc, &resblks, &bufsize);
+		if (error)
+			return error;
+	}
+
 	/*
 	 * Create an xfile to construct a new rtsummary file.  The xfile allows
 	 * us to avoid pinning kernel memory for this purpose.
@@ -48,12 +58,12 @@ xchk_setup_rtsummary(
 	if (error)
 		return error;
 
-	error = xchk_trans_alloc(sc, 0);
+	error = xchk_trans_alloc(sc, resblks);
 	if (error)
 		return error;
 
 	/* Allocate a memory buffer for the summary comparison. */
-	sc->buf = kvmalloc(mp->m_sb.sb_blocksize, XCHK_GFP_FLAGS);
+	sc->buf = kvmalloc(bufsize, XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
 
@@ -78,8 +88,6 @@ xchk_setup_rtsummary(
 
 /* Helper functions to record suminfo words in an xfile. */
 
-typedef unsigned int xchk_rtsumoff_t;
-
 static inline int
 xfsum_load(
 	struct xfs_scrub	*sc,
@@ -100,7 +108,7 @@ xfsum_store(
 			sumoff << XFS_WORDLOG);
 }
 
-static inline int
+inline int
 xfsum_copyout(
 	struct xfs_scrub	*sc,
 	xchk_rtsumoff_t		sumoff,
diff --git a/fs/xfs/scrub/rtsummary.h b/fs/xfs/scrub/rtsummary.h
new file mode 100644
index 000000000000..e5f3c69c4cbf
--- /dev/null
+++ b/fs/xfs/scrub/rtsummary.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_RTSUMMARY_H__
+#define __XFS_SCRUB_RTSUMMARY_H__
+
+typedef unsigned int xchk_rtsumoff_t;
+
+int xfsum_copyout(struct xfs_scrub *sc, xchk_rtsumoff_t sumoff,
+		xfs_suminfo_t *info, unsigned int nr_words);
+
+#endif /* __XFS_SCRUB_RTSUMMARY_H__ */
diff --git a/fs/xfs/scrub/rtsummary_repair.c b/fs/xfs/scrub/rtsummary_repair.c
new file mode 100644
index 000000000000..f5c14c50ebf3
--- /dev/null
+++ b/fs/xfs/scrub/rtsummary_repair.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_rtalloc.h"
+#include "xfs_inode.h"
+#include "xfs_bit.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_swapext.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/tempfile.h"
+#include "scrub/tempswap.h"
+#include "scrub/reap.h"
+#include "scrub/xfile.h"
+#include "scrub/rtsummary.h"
+
+struct xrep_rtsummary {
+	/* suminfo position of xfile as we write buffers to disk. */
+	xchk_rtsumoff_t		prep_wordoff;
+};
+
+/* Set us up to repair the rtsummary file. */
+int
+xrep_setup_rtsummary(
+	struct xfs_scrub	*sc,
+	unsigned int		*resblks,
+	size_t			*bufsize)
+{
+	struct xfs_mount	*mp = sc->mp;
+	unsigned long long	blocks;
+	int			error;
+
+	*bufsize = max(*bufsize, sizeof(struct xrep_tempswap));
+
+	error = xrep_tempfile_create(sc, S_IFREG);
+	if (error)
+		return error;
+
+	/*
+	 * If we're doing a repair, we reserve enough blocks to write out a
+	 * completely new summary file, plus twice as many blocks as we would
+	 * need if we can only allocate one block per data fork mapping.  This
+	 * should cover the preallocation of the temporary file and swapping
+	 * the extent mappings.
+	 *
+	 * We cannot use xfs_swapext_estimate because we have not yet
+	 * constructed the replacement rtsummary and therefore do not know how
+	 * many extents it will use.  By the time we do, we will have a dirty
+	 * transaction (which we cannot drop because we cannot drop the
+	 * rtsummary ILOCK) and cannot ask for more reservation.
+	 */
+	blocks = XFS_B_TO_FSB(mp, mp->m_rsumsize);
+	blocks += xfs_bmbt_calc_size(mp, blocks) * 2;
+	if (blocks > UINT_MAX)
+		return -EOPNOTSUPP;
+
+	*resblks += blocks;
+
+	/*
+	 * Grab support for atomic extent swapping before we allocate any
+	 * transactions or grab ILOCKs.
+	 */
+	return xrep_tempswap_grab_log_assist(sc);
+}
+
+static int
+xrep_rtsummary_prep_buf(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp,
+	void			*data)
+{
+	struct xrep_rtsummary	*rs = data;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	bp->b_ops = &xfs_rtbuf_ops;
+
+	error = xfsum_copyout(sc, rs->prep_wordoff, bp->b_addr,
+			mp->m_blockwsize);
+	if (error)
+		return error;
+
+	rs->prep_wordoff += mp->m_blockwsize;
+	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_RTSUMMARY_BUF);
+	return 0;
+}
+
+/* Repair the realtime summary. */
+int
+xrep_rtsummary(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_rtsummary	rs = { .prep_wordoff = 0, };
+	struct xrep_tempswap	*ti = NULL;
+	xfs_filblks_t		rsumblocks;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(sc->mp))
+		return -EOPNOTSUPP;
+
+	/* Make sure any problems with the fork are fixed. */
+	error = xrep_metadata_inode_forks(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Try to take ILOCK_EXCL of the temporary file.  We had better be the
+	 * only ones holding onto this inode, but we can't block while holding
+	 * the rtsummary file's ILOCK_EXCL.
+	 */
+	while (!xrep_tempfile_ilock_nowait(sc)) {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+		delay(1);
+	}
+
+	/* Make sure we have space allocated for the entire summary file. */
+	rsumblocks = XFS_B_TO_FSB(sc->mp, sc->mp->m_rsumsize);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
+	error = xrep_tempfile_prealloc(sc, 0, rsumblocks);
+	if (error)
+		return error;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/* Copy the rtsummary file that we generated. */
+	error = xrep_tempfile_copyin(sc, 0, rsumblocks,
+			xrep_rtsummary_prep_buf, &rs);
+	if (error)
+		return error;
+	error = xrep_tempfile_set_isize(sc, sc->mp->m_rsumsize);
+	if (error)
+		return error;
+
+	/*
+	 * Now swap the extents.  Nothing in repair uses the temporary buffer,
+	 * so we can reuse it for the tempfile swapext information.
+	 */
+	ti = sc->buf;
+	error = xrep_tempswap_trans_reserve(sc, XFS_DATA_FORK, ti);
+	if (error)
+		return error;
+
+	error = xrep_tempswap_contents(sc, ti);
+	if (error)
+		return error;
+	ti = NULL;
+
+	/* Free the old rtsummary blocks if they're not in use. */
+	return xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK);
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index a994710d99ae..a9030603b424 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -19,6 +19,7 @@
 #include "xfs_btree_staging.h"
 #include "xfs_rmap.h"
 #include "xfs_xchgrange.h"
+#include "xfs_swapext.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -358,7 +359,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_rtsummary,
 		.scrub	= xchk_rtsummary,
 		.has	= xfs_has_realtime,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_rtsummary,
 	},
 	[XFS_SCRUB_TYPE_UQUOTA] = {	/* user quota */
 		.type	= ST_FS,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (26 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/9] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong
                     ` (8 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
                   ` (49 subsequent siblings)
  77 siblings, 9 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

There are a couple of significant changes that need to be made to the
directory and xattr code before we can support online repairs of those
data structures.

The first change is because online repair is designed to use libxfs to
create a replacement dir/xattr structure in a temporary file, and use
atomic extent swapping to commit the corrected structure.  To avoid the
performance hit of walking every block of the new structure to rewrite
the owner number before the swap, we instead change libxfs to allow
callers of the dir and xattr code the ability to set an explicit owner
number to be written into the header fields of any new blocks that are
created.  For regular operation this will be the directory inode number.

The second change is to update the dir/xattr code to actually *check*
the owner number in each block that is read off the disk, since we don't
currently do that.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=dirattr-validate-owners

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=dirattr-validate-owners
---
 fs/xfs/libxfs/xfs_attr.c        |   10 +-
 fs/xfs/libxfs/xfs_attr_leaf.c   |   59 +++++++++++---
 fs/xfs/libxfs/xfs_attr_leaf.h   |    4 +
 fs/xfs/libxfs/xfs_attr_remote.c |   13 +--
 fs/xfs/libxfs/xfs_bmap.c        |    1 
 fs/xfs/libxfs/xfs_da_btree.c    |  168 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_da_btree.h    |    3 +
 fs/xfs/libxfs/xfs_dir2.c        |    5 +
 fs/xfs/libxfs/xfs_dir2.h        |    4 +
 fs/xfs/libxfs/xfs_dir2_block.c  |   44 ++++++----
 fs/xfs/libxfs/xfs_dir2_data.c   |   17 ++--
 fs/xfs/libxfs/xfs_dir2_leaf.c   |   99 ++++++++++++++++++-----
 fs/xfs/libxfs/xfs_dir2_node.c   |   44 ++++++----
 fs/xfs/libxfs/xfs_dir2_priv.h   |   11 +--
 fs/xfs/libxfs/xfs_swapext.c     |    7 +-
 fs/xfs/scrub/attr.c             |    1 
 fs/xfs/scrub/dabtree.c          |    8 ++
 fs/xfs/scrub/dir.c              |   23 +++--
 fs/xfs/scrub/readdir.c          |    6 +
 fs/xfs/xfs_acl.c                |    2 
 fs/xfs/xfs_attr_item.c          |    1 
 fs/xfs/xfs_attr_list.c          |   35 +++++++-
 fs/xfs/xfs_dir2_readdir.c       |    6 +
 fs/xfs/xfs_ioctl.c              |    2 
 fs/xfs/xfs_iops.c               |    1 
 fs/xfs/xfs_trace.h              |    7 +-
 fs/xfs/xfs_xattr.c              |    2 
 27 files changed, 464 insertions(+), 119 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/9] xfs: add an explicit owner field to xfs_da_args
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 3/9] xfs: validate attr leaf " Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 5/9] xfs: validate dabtree node buffer owners Darrick J. Wong
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add an explicit owner field to xfs_da_args, which will make it easier
for online fsck to set the owner field of the temporary directory and
xattr structures that it builds to repair damaged metadata.

Note: I hopefully found all the xfs_da_args definitions by looking for
automatic stack variable declarations and xfs_da_args.dp assignments:

git grep -E '(args.*dp =|struct xfs_da_args[[:space:]]*[a-z0-9][a-z0-9]*)'

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_leaf.c |    2 ++
 fs/xfs/libxfs/xfs_bmap.c      |    1 +
 fs/xfs/libxfs/xfs_da_btree.h  |    1 +
 fs/xfs/libxfs/xfs_dir2.c      |    5 +++++
 fs/xfs/libxfs/xfs_swapext.c   |    2 ++
 fs/xfs/scrub/attr.c           |    1 +
 fs/xfs/scrub/dabtree.c        |    1 +
 fs/xfs/scrub/dir.c            |    3 ++-
 fs/xfs/scrub/readdir.c        |    2 ++
 fs/xfs/xfs_acl.c              |    2 ++
 fs/xfs/xfs_attr_item.c        |    1 +
 fs/xfs/xfs_dir2_readdir.c     |    1 +
 fs/xfs/xfs_ioctl.c            |    2 ++
 fs/xfs/xfs_iops.c             |    1 +
 fs/xfs/xfs_trace.h            |    7 +++++--
 fs/xfs/xfs_xattr.c            |    2 ++
 16 files changed, 31 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 1ba16c369f26..e771167c8766 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -975,6 +975,7 @@ xfs_attr_shortform_to_leaf(
 	nargs.whichfork = XFS_ATTR_FORK;
 	nargs.trans = args->trans;
 	nargs.op_flags = XFS_DA_OP_OKNOENT;
+	nargs.owner = args->owner;
 
 	sfe = &sf->list[0];
 	for (i = 0; i < sf->hdr.count; i++) {
@@ -1195,6 +1196,7 @@ xfs_attr3_leaf_to_shortform(
 	nargs.whichfork = XFS_ATTR_FORK;
 	nargs.trans = args->trans;
 	nargs.op_flags = XFS_DA_OP_OKNOENT;
+	nargs.owner = args->owner;
 
 	for (i = 0; i < ichdr.count; entry++, i++) {
 		if (entry->flags & XFS_ATTR_INCOMPLETE)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index cf85f3896afa..7cbb96a805a3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -968,6 +968,7 @@ xfs_bmap_add_attrfork_local(
 		dargs.total = dargs.geo->fsbcount;
 		dargs.whichfork = XFS_DATA_FORK;
 		dargs.trans = tp;
+		dargs.owner = ip->i_ino;
 		return xfs_dir2_sf_to_block(&dargs);
 	}
 
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index ffa3df5b2893..52694dc0cd3c 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -79,6 +79,7 @@ typedef struct xfs_da_args {
 	int		rmtvaluelen2;	/* remote attr value length in bytes */
 	uint32_t	op_flags;	/* operation flags */
 	enum xfs_dacmp	cmpresult;	/* name compare result for lookups */
+	xfs_ino_t	owner;		/* inode that owns the dir/attr data */
 } xfs_da_args_t;
 
 /*
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index f56710005f43..fb0697dc733f 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -249,6 +249,7 @@ xfs_dir_init(
 	args->geo = dp->i_mount->m_dir_geo;
 	args->dp = dp;
 	args->trans = tp;
+	args->owner = dp->i_ino;
 	error = xfs_dir2_sf_create(args, pdp->i_ino);
 	kmem_free(args);
 	return error;
@@ -294,6 +295,7 @@ xfs_dir_createname(
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+	args->owner = dp->i_ino;
 	if (!inum)
 		args->op_flags |= XFS_DA_OP_JUSTCHECK;
 
@@ -388,6 +390,7 @@ xfs_dir_lookup(
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
 	args->op_flags = XFS_DA_OP_OKNOENT;
+	args->owner = dp->i_ino;
 	if (ci_name)
 		args->op_flags |= XFS_DA_OP_CILOOKUP;
 
@@ -461,6 +464,7 @@ xfs_dir_removename(
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
+	args->owner = dp->i_ino;
 
 	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_removename(args);
@@ -522,6 +526,7 @@ xfs_dir_replace(
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
+	args->owner = dp->i_ino;
 
 	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
 		rval = xfs_dir2_sf_replace(args);
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 69812594fd71..65b1ccb162ad 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -466,6 +466,7 @@ xfs_swapext_attr_to_sf(
 		.geo		= tp->t_mountp->m_attr_geo,
 		.whichfork	= XFS_ATTR_FORK,
 		.trans		= tp,
+		.owner		= sxi->sxi_ip2->i_ino,
 	};
 	struct xfs_buf		*bp;
 	int			forkoff;
@@ -496,6 +497,7 @@ xfs_swapext_dir_to_sf(
 		.geo		= tp->t_mountp->m_dir_geo,
 		.whichfork	= XFS_DATA_FORK,
 		.trans		= tp,
+		.owner		= sxi->sxi_ip2->i_ino,
 	};
 	struct xfs_dir2_sf_hdr	sfh;
 	struct xfs_buf		*bp;
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index ea9d0f1a6fd0..090710acc4b6 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -169,6 +169,7 @@ xchk_xattr_listent(
 		.hashval		= xfs_da_hashname(name, namelen),
 		.trans			= context->tp,
 		.valuelen		= valuelen,
+		.owner			= context->dp->i_ino,
 	};
 	struct xchk_xattr_buf		*ab;
 	struct xchk_xattr		*sx;
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index 957a0b1a2f0b..c8274a7e0cfd 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -494,6 +494,7 @@ xchk_da_btree(
 	ds->dargs.whichfork = whichfork;
 	ds->dargs.trans = sc->tp;
 	ds->dargs.op_flags = XFS_DA_OP_OKNOENT;
+	ds->dargs.owner = sc->ip->i_ino;
 	ds->state = xfs_da_state_alloc(&ds->dargs);
 	ds->sc = sc;
 	ds->private = private;
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 5b47d3cc8f78..8b4e58416327 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -614,10 +614,11 @@ xchk_directory_blocks(
 {
 	struct xfs_bmbt_irec	got;
 	struct xfs_da_args	args = {
-		.dp		= sc ->ip,
+		.dp		= sc->ip,
 		.whichfork	= XFS_DATA_FORK,
 		.geo		= sc->mp->m_dir_geo,
 		.trans		= sc->tp,
+		.owner		= sc->ip->i_ino,
 	};
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK);
 	struct xfs_mount	*mp = sc->mp;
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
index fd888fe5151f..59a658362b2b 100644
--- a/fs/xfs/scrub/readdir.c
+++ b/fs/xfs/scrub/readdir.c
@@ -275,6 +275,7 @@ xchk_dir_walk(
 		.dp		= dp,
 		.geo		= dp->i_mount->m_dir_geo,
 		.trans		= sc->tp,
+		.owner		= dp->i_ino,
 	};
 	bool			isblock;
 	int			error;
@@ -326,6 +327,7 @@ xchk_dir_lookup(
 		.hashval	= xfs_dir2_hashname(dp->i_mount, name),
 		.whichfork	= XFS_DATA_FORK,
 		.op_flags	= XFS_DA_OP_OKNOENT,
+		.owner		= sc->ip->i_ino,
 	};
 	bool			isblock, isleaf;
 	int			error;
diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c
index a05f44eb8178..9108edbb9442 100644
--- a/fs/xfs/xfs_acl.c
+++ b/fs/xfs/xfs_acl.c
@@ -135,6 +135,7 @@ xfs_get_acl(struct inode *inode, int type, bool rcu)
 		.dp		= ip,
 		.attr_filter	= XFS_ATTR_ROOT,
 		.valuelen	= XFS_ACL_MAX_SIZE(mp),
+		.owner		= ip->i_ino,
 	};
 	int			error;
 
@@ -178,6 +179,7 @@ __xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	struct xfs_da_args	args = {
 		.dp		= ip,
 		.attr_filter	= XFS_ATTR_ROOT,
+		.owner		= ip->i_ino,
 	};
 	int			error;
 
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 2788a6f2edcd..0891d739666e 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -593,6 +593,7 @@ xfs_attri_item_recover(
 	args->attr_filter = attrp->alfi_attr_filter & XFS_ATTRI_FILTER_MASK;
 	args->op_flags = XFS_DA_OP_RECOVERY | XFS_DA_OP_OKNOENT |
 			 XFS_DA_OP_LOGGED;
+	args->owner = ip->i_ino;
 
 	ASSERT(xfs_sb_version_haslogxattrs(&mp->m_sb));
 
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 4c061b48da18..fc2524b8f1f1 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -532,6 +532,7 @@ xfs_readdir(
 	args.dp = dp;
 	args.geo = dp->i_mount->m_dir_geo;
 	args.trans = tp;
+	args.owner = dp->i_ino;
 
 	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
 		return xfs_dir2_sf_getdents(&args, ctx);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 85c33142c5ab..df6601eda7ec 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -477,6 +477,7 @@ xfs_attrmulti_attr_get(
 		.name		= name,
 		.namelen	= strlen(name),
 		.valuelen	= *len,
+		.owner		= XFS_I(inode)->i_ino,
 	};
 	int			error;
 
@@ -510,6 +511,7 @@ xfs_attrmulti_attr_set(
 		.attr_flags	= xfs_attr_flags(flags),
 		.name		= name,
 		.namelen	= strlen(name),
+		.owner		= XFS_I(inode)->i_ino,
 	};
 	int			error;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 515318dfbc38..a24bf6bb5094 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -61,6 +61,7 @@ xfs_initxattrs(
 			.namelen	= strlen(xattr->name),
 			.value		= xattr->value,
 			.valuelen	= xattr->value_len,
+			.owner		= ip->i_ino,
 		};
 		error = xfs_attr_change(&args);
 		if (error < 0)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0802f078a945..da6b7461f4d0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1949,6 +1949,7 @@ DECLARE_EVENT_CLASS(xfs_da_class,
 		__field(xfs_dahash_t, hashval)
 		__field(xfs_ino_t, inumber)
 		__field(uint32_t, op_flags)
+		__field(xfs_ino_t, owner)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(args->dp)->i_sb->s_dev;
@@ -1959,9 +1960,10 @@ DECLARE_EVENT_CLASS(xfs_da_class,
 		__entry->hashval = args->hashval;
 		__entry->inumber = args->inumber;
 		__entry->op_flags = args->op_flags;
+		__entry->owner = args->owner;
 	),
 	TP_printk("dev %d:%d ino 0x%llx name %.*s namelen %d hashval 0x%x "
-		  "inumber 0x%llx op_flags %s",
+		  "inumber 0x%llx op_flags %s owner 0x%llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->namelen,
@@ -1969,7 +1971,8 @@ DECLARE_EVENT_CLASS(xfs_da_class,
 		  __entry->namelen,
 		  __entry->hashval,
 		  __entry->inumber,
-		  __print_flags(__entry->op_flags, "|", XFS_DA_OP_FLAGS))
+		  __print_flags(__entry->op_flags, "|", XFS_DA_OP_FLAGS),
+		  __entry->owner)
 )
 
 #define DEFINE_DIR2_EVENT(name) \
diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
index e03f199f50c7..809882e3978b 100644
--- a/fs/xfs/xfs_xattr.c
+++ b/fs/xfs/xfs_xattr.c
@@ -122,6 +122,7 @@ xfs_xattr_get(const struct xattr_handler *handler, struct dentry *unused,
 		.namelen	= strlen(name),
 		.value		= value,
 		.valuelen	= size,
+		.owner		= XFS_I(inode)->i_ino,
 	};
 	int			error;
 
@@ -145,6 +146,7 @@ xfs_xattr_set(const struct xattr_handler *handler,
 		.namelen	= strlen(name),
 		.value		= (void *)value,
 		.valuelen	= size,
+		.owner		= XFS_I(inode)->i_ino,
 	};
 	int			error;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/9] xfs: use the xfs_da_args owner field to set new dir/attr block owner
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/9] xfs: validate attr remote value buffer owners Darrick J. Wong
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're creating leaf, data, freespace, or dabtree blocks for
directories and xattrs, use the explicit owner field (instead of the
xfs_inode) to set the owner field.  This will enable online repair to
construct replacement data structures in a temporary file without having
to change the owner fields prior to swapping the new and old structures.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_leaf.c   |    2 +-
 fs/xfs/libxfs/xfs_attr_remote.c |    4 ++--
 fs/xfs/libxfs/xfs_da_btree.c    |    2 +-
 fs/xfs/libxfs/xfs_dir2_block.c  |   19 ++++++++++---------
 fs/xfs/libxfs/xfs_dir2_data.c   |    2 +-
 fs/xfs/libxfs/xfs_dir2_leaf.c   |   11 +++++------
 fs/xfs/libxfs/xfs_dir2_node.c   |    2 +-
 7 files changed, 21 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index e771167c8766..9ce886d5e53d 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1332,7 +1332,7 @@ xfs_attr3_leaf_create(
 		ichdr.magic = XFS_ATTR3_LEAF_MAGIC;
 
 		hdr3->blkno = cpu_to_be64(xfs_buf_daddr(bp));
-		hdr3->owner = cpu_to_be64(dp->i_ino);
+		hdr3->owner = cpu_to_be64(args->owner);
 		uuid_copy(&hdr3->uuid, &mp->m_sb.sb_meta_uuid);
 
 		ichdr.freemap[0].base = sizeof(struct xfs_attr3_leaf_hdr);
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index bb4cf1fa0dc2..b8cdd15c4e1a 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -522,8 +522,8 @@ xfs_attr_rmtval_set_value(
 			return error;
 		bp->b_ops = &xfs_attr3_rmt_buf_ops;
 
-		xfs_attr_rmtval_copyin(mp, bp, args->dp->i_ino, &offset,
-				       &valuelen, &src);
+		xfs_attr_rmtval_copyin(mp, bp, args->owner, &offset, &valuelen,
+				&src);
 
 		error = xfs_bwrite(bp);	/* GROT: NOTE: synchronous write */
 		xfs_buf_relse(bp);
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 8d2992413047..b0d1aad1fbbb 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -466,7 +466,7 @@ xfs_da3_node_create(
 		memset(hdr3, 0, sizeof(struct xfs_da3_node_hdr));
 		ichdr.magic = XFS_DA3_NODE_MAGIC;
 		hdr3->info.blkno = cpu_to_be64(xfs_buf_daddr(bp));
-		hdr3->info.owner = cpu_to_be64(args->dp->i_ino);
+		hdr3->info.owner = cpu_to_be64(args->owner);
 		uuid_copy(&hdr3->info.uuid, &mp->m_sb.sb_meta_uuid);
 	} else {
 		ichdr.magic = XFS_DA_NODE_MAGIC;
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 6b3ca2b384cf..6bda6a490671 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -163,12 +163,13 @@ xfs_dir3_block_read(
 
 static void
 xfs_dir3_block_init(
-	struct xfs_mount	*mp,
-	struct xfs_trans	*tp,
-	struct xfs_buf		*bp,
-	struct xfs_inode	*dp)
+	struct xfs_da_args	*args,
+	struct xfs_buf		*bp)
 {
-	struct xfs_dir3_blk_hdr *hdr3 = bp->b_addr;
+	struct xfs_trans	*tp = args->trans;
+	struct xfs_inode	*dp = args->dp;
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_dir3_blk_hdr	*hdr3 = bp->b_addr;
 
 	bp->b_ops = &xfs_dir3_block_buf_ops;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DIR_BLOCK_BUF);
@@ -177,7 +178,7 @@ xfs_dir3_block_init(
 		memset(hdr3, 0, sizeof(*hdr3));
 		hdr3->magic = cpu_to_be32(XFS_DIR3_BLOCK_MAGIC);
 		hdr3->blkno = cpu_to_be64(xfs_buf_daddr(bp));
-		hdr3->owner = cpu_to_be64(dp->i_ino);
+		hdr3->owner = cpu_to_be64(args->owner);
 		uuid_copy(&hdr3->uuid, &mp->m_sb.sb_meta_uuid);
 		return;
 
@@ -1009,7 +1010,7 @@ xfs_dir2_leaf_to_block(
 	/*
 	 * Start converting it to block form.
 	 */
-	xfs_dir3_block_init(mp, tp, dbp, dp);
+	xfs_dir3_block_init(args, dbp);
 
 	needlog = 1;
 	needscan = 0;
@@ -1131,7 +1132,7 @@ xfs_dir2_sf_to_block(
 	error = xfs_dir3_data_init(args, blkno, &bp);
 	if (error)
 		goto out_free;
-	xfs_dir3_block_init(mp, tp, bp, dp);
+	xfs_dir3_block_init(args, bp);
 	hdr = bp->b_addr;
 
 	/*
@@ -1171,7 +1172,7 @@ xfs_dir2_sf_to_block(
 	 * Create entry for .
 	 */
 	dep = bp->b_addr + offset;
-	dep->inumber = cpu_to_be64(dp->i_ino);
+	dep->inumber = cpu_to_be64(args->owner);
 	dep->namelen = 1;
 	dep->name[0] = '.';
 	xfs_dir2_data_put_ftype(mp, dep, XFS_DIR3_FT_DIR);
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index 7a6d965bea71..c3ef720b5ff6 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -725,7 +725,7 @@ xfs_dir3_data_init(
 		memset(hdr3, 0, sizeof(*hdr3));
 		hdr3->magic = cpu_to_be32(XFS_DIR3_DATA_MAGIC);
 		hdr3->blkno = cpu_to_be64(xfs_buf_daddr(bp));
-		hdr3->owner = cpu_to_be64(dp->i_ino);
+		hdr3->owner = cpu_to_be64(args->owner);
 		uuid_copy(&hdr3->uuid, &mp->m_sb.sb_meta_uuid);
 
 	} else
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index 08dda5ce9d91..20ce057d12e8 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -304,12 +304,12 @@ xfs_dir3_leafn_read(
  */
 static void
 xfs_dir3_leaf_init(
-	struct xfs_mount	*mp,
-	struct xfs_trans	*tp,
+	struct xfs_da_args	*args,
 	struct xfs_buf		*bp,
-	xfs_ino_t		owner,
 	uint16_t		type)
 {
+	struct xfs_mount	*mp = args->dp->i_mount;
+	struct xfs_trans	*tp = args->trans;
 	struct xfs_dir2_leaf	*leaf = bp->b_addr;
 
 	ASSERT(type == XFS_DIR2_LEAF1_MAGIC || type == XFS_DIR2_LEAFN_MAGIC);
@@ -323,7 +323,7 @@ xfs_dir3_leaf_init(
 					 ? cpu_to_be16(XFS_DIR3_LEAF1_MAGIC)
 					 : cpu_to_be16(XFS_DIR3_LEAFN_MAGIC);
 		leaf3->info.blkno = cpu_to_be64(xfs_buf_daddr(bp));
-		leaf3->info.owner = cpu_to_be64(owner);
+		leaf3->info.owner = cpu_to_be64(args->owner);
 		uuid_copy(&leaf3->info.uuid, &mp->m_sb.sb_meta_uuid);
 	} else {
 		memset(leaf, 0, sizeof(*leaf));
@@ -356,7 +356,6 @@ xfs_dir3_leaf_get_buf(
 {
 	struct xfs_inode	*dp = args->dp;
 	struct xfs_trans	*tp = args->trans;
-	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_buf		*bp;
 	int			error;
 
@@ -369,7 +368,7 @@ xfs_dir3_leaf_get_buf(
 	if (error)
 		return error;
 
-	xfs_dir3_leaf_init(mp, tp, bp, dp->i_ino, magic);
+	xfs_dir3_leaf_init(args, bp, magic);
 	xfs_dir3_leaf_log_header(args, bp);
 	if (magic == XFS_DIR2_LEAF1_MAGIC)
 		xfs_dir3_leaf_log_tail(args, bp);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index be0b8834028c..1ad7405f9c38 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -349,7 +349,7 @@ xfs_dir3_free_get_buf(
 		hdr.magic = XFS_DIR3_FREE_MAGIC;
 
 		hdr3->hdr.blkno = cpu_to_be64(xfs_buf_daddr(bp));
-		hdr3->hdr.owner = cpu_to_be64(dp->i_ino);
+		hdr3->hdr.owner = cpu_to_be64(args->owner);
 		uuid_copy(&hdr3->hdr.uuid, &mp->m_sb.sb_meta_uuid);
 	} else
 		hdr.magic = XFS_DIR2_FREE_MAGIC;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/9] xfs: validate attr leaf buffer owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/9] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/9] xfs: validate attr remote value buffer owners Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/9] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a leaf block header checking function to validate the owner field
of xattr leaf blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c      |   10 ++++---
 fs/xfs/libxfs/xfs_attr_leaf.c |   55 ++++++++++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_attr_leaf.h |    4 ++-
 fs/xfs/libxfs/xfs_da_btree.c  |   42 +++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_da_btree.h  |    1 +
 fs/xfs/libxfs/xfs_swapext.c   |    3 +-
 fs/xfs/scrub/dabtree.c        |    7 +++++
 fs/xfs/xfs_attr_list.c        |   25 ++++++++++++++++---
 8 files changed, 128 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index e28d93d232de..564345a17119 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -647,8 +647,8 @@ xfs_attr_leaf_remove_attr(
 	int				forkoff;
 	int				error;
 
-	error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno,
-				   &bp);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+			args->blkno, &bp);
 	if (error)
 		return error;
 
@@ -679,7 +679,7 @@ xfs_attr_leaf_shrink(
 	if (!xfs_attr_is_leaf(dp))
 		return 0;
 
-	error = xfs_attr3_leaf_read(args->trans, args->dp, 0, &bp);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, 0, &bp);
 	if (error)
 		return error;
 
@@ -1208,7 +1208,7 @@ xfs_attr_leaf_try_add(
 	struct xfs_buf		*bp;
 	int			error;
 
-	error = xfs_attr3_leaf_read(args->trans, args->dp, 0, &bp);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, 0, &bp);
 	if (error)
 		return error;
 
@@ -1256,7 +1256,7 @@ xfs_attr_leaf_hasname(
 {
 	int                     error = 0;
 
-	error = xfs_attr3_leaf_read(args->trans, args->dp, 0, bp);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, 0, bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 9ce886d5e53d..1f3febeccbe0 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -388,6 +388,26 @@ xfs_attr3_leaf_verify(
 	return NULL;
 }
 
+xfs_failaddr_t
+xfs_attr3_leaf_header_check(
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+
+	if (xfs_has_crc(mp)) {
+		struct xfs_attr3_leafblock *hdr3 = bp->b_addr;
+
+		ASSERT(hdr3->hdr.info.hdr.magic ==
+				cpu_to_be16(XFS_ATTR3_LEAF_MAGIC));
+
+		if (be64_to_cpu(hdr3->hdr.info.owner) != owner)
+			return __this_address;
+	}
+
+	return NULL;
+}
+
 static void
 xfs_attr3_leaf_write_verify(
 	struct xfs_buf	*bp)
@@ -448,16 +468,30 @@ int
 xfs_attr3_leaf_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		bno,
 	struct xfs_buf		**bpp)
 {
+	xfs_failaddr_t		fa;
 	int			err;
 
 	err = xfs_da_read_buf(tp, dp, bno, 0, bpp, XFS_ATTR_FORK,
 			&xfs_attr3_leaf_buf_ops);
-	if (!err && tp && *bpp)
+	if (err || !(*bpp))
+		return err;
+
+	fa = xfs_attr3_leaf_header_check(*bpp, owner);
+	if (fa) {
+		__xfs_buf_mark_corrupt(*bpp, fa);
+		xfs_trans_brelse(tp, *bpp);
+		*bpp = NULL;
+		xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
+		return -EFSCORRUPTED;
+	}
+
+	if (tp)
 		xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_ATTR_LEAF_BUF);
-	return err;
+	return 0;
 }
 
 /*========================================================================
@@ -1249,7 +1283,7 @@ xfs_attr3_leaf_to_node(
 	error = xfs_da_grow_inode(args, &blkno);
 	if (error)
 		goto out;
-	error = xfs_attr3_leaf_read(args->trans, dp, 0, &bp1);
+	error = xfs_attr3_leaf_read(args->trans, dp, args->owner, 0, &bp1);
 	if (error)
 		goto out;
 
@@ -2088,7 +2122,7 @@ xfs_attr3_leaf_toosmall(
 		if (blkno == 0)
 			continue;
 		error = xfs_attr3_leaf_read(state->args->trans, state->args->dp,
-					blkno, &bp);
+					state->args->owner, blkno, &bp);
 		if (error)
 			return error;
 
@@ -2811,7 +2845,8 @@ xfs_attr3_leaf_clearflag(
 	/*
 	 * Set up the operation.
 	 */
-	error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, &bp);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+			args->blkno, &bp);
 	if (error)
 		return error;
 
@@ -2875,7 +2910,8 @@ xfs_attr3_leaf_setflag(
 	/*
 	 * Set up the operation.
 	 */
-	error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, &bp);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+			args->blkno, &bp);
 	if (error)
 		return error;
 
@@ -2934,7 +2970,8 @@ xfs_attr3_leaf_flipflags(
 	/*
 	 * Read the block containing the "old" attr
 	 */
-	error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, &bp1);
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+			args->blkno, &bp1);
 	if (error)
 		return error;
 
@@ -2942,8 +2979,8 @@ xfs_attr3_leaf_flipflags(
 	 * Read the block containing the "new" attr, if it is different
 	 */
 	if (args->blkno2 != args->blkno) {
-		error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno2,
-					   &bp2);
+		error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+				args->blkno2, &bp2);
 		if (error)
 			return error;
 	} else {
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 0711a448f64c..e585317dc680 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -102,12 +102,14 @@ int	xfs_attr_leaf_order(struct xfs_buf *leaf1_bp,
 				   struct xfs_buf *leaf2_bp);
 int	xfs_attr_leaf_newentsize(struct xfs_da_args *args, int *local);
 int	xfs_attr3_leaf_read(struct xfs_trans *tp, struct xfs_inode *dp,
-			xfs_dablk_t bno, struct xfs_buf **bpp);
+			xfs_ino_t owner, xfs_dablk_t bno, struct xfs_buf **bpp);
 void	xfs_attr3_leaf_hdr_from_disk(struct xfs_da_geometry *geo,
 				     struct xfs_attr3_icleaf_hdr *to,
 				     struct xfs_attr_leafblock *from);
 void	xfs_attr3_leaf_hdr_to_disk(struct xfs_da_geometry *geo,
 				   struct xfs_attr_leafblock *to,
 				   struct xfs_attr3_icleaf_hdr *from);
+xfs_failaddr_t xfs_attr3_leaf_header_check(struct xfs_buf *bp,
+		xfs_ino_t owner);
 
 #endif	/* __XFS_ATTR_LEAF_H__ */
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index b0d1aad1fbbb..0349e10552f6 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -251,6 +251,25 @@ xfs_da3_node_verify(
 	return NULL;
 }
 
+xfs_failaddr_t
+xfs_da3_header_check(
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_da_blkinfo	*hdr = bp->b_addr;
+
+	if (!xfs_has_crc(mp))
+		return NULL;
+
+	switch (hdr->magic) {
+	case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
+		return xfs_attr3_leaf_header_check(bp, owner);
+	}
+
+	return NULL;
+}
+
 static void
 xfs_da3_node_write_verify(
 	struct xfs_buf	*bp)
@@ -1598,6 +1617,7 @@ xfs_da3_node_lookup_int(
 	struct xfs_da_node_entry *btree;
 	struct xfs_da3_icnode_hdr nodehdr;
 	struct xfs_da_args	*args;
+	xfs_failaddr_t		fa;
 	xfs_dablk_t		blkno;
 	xfs_dahash_t		hashval;
 	xfs_dahash_t		btreehashval;
@@ -1636,6 +1656,12 @@ xfs_da3_node_lookup_int(
 
 		if (magic == XFS_ATTR_LEAF_MAGIC ||
 		    magic == XFS_ATTR3_LEAF_MAGIC) {
+			fa = xfs_attr3_leaf_header_check(blk->bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(blk->bp, fa);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			blk->magic = XFS_ATTR_LEAF_MAGIC;
 			blk->hashval = xfs_attr_leaf_lasthash(blk->bp, NULL);
 			break;
@@ -2003,6 +2029,7 @@ xfs_da3_path_shift(
 	struct xfs_da_node_entry *btree;
 	struct xfs_da3_icnode_hdr nodehdr;
 	struct xfs_buf		*bp;
+	xfs_failaddr_t		fa;
 	xfs_dablk_t		blkno = 0;
 	int			level;
 	int			error;
@@ -2094,6 +2121,12 @@ xfs_da3_path_shift(
 			break;
 		case XFS_ATTR_LEAF_MAGIC:
 		case XFS_ATTR3_LEAF_MAGIC:
+			fa = xfs_attr3_leaf_header_check(blk->bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(blk->bp, fa);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			blk->magic = XFS_ATTR_LEAF_MAGIC;
 			ASSERT(level == path->active-1);
 			blk->index = 0;
@@ -2296,6 +2329,7 @@ xfs_da3_swap_lastblock(
 	struct xfs_buf		*last_buf;
 	struct xfs_buf		*sib_buf;
 	struct xfs_buf		*par_buf;
+	xfs_failaddr_t		fa;
 	xfs_dahash_t		dead_hash;
 	xfs_fileoff_t		lastoff;
 	xfs_dablk_t		dead_blkno;
@@ -2332,6 +2366,14 @@ xfs_da3_swap_lastblock(
 	error = xfs_da3_node_read(tp, dp, last_blkno, &last_buf, w);
 	if (error)
 		return error;
+	fa = xfs_da3_header_check(last_buf, args->owner);
+	if (fa) {
+		__xfs_buf_mark_corrupt(last_buf, fa);
+		xfs_trans_brelse(tp, last_buf);
+		xfs_da_mark_sick(args);
+		return -EFSCORRUPTED;
+	}
+
 	/*
 	 * Copy the last block into the dead buffer and log it.
 	 */
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 52694dc0cd3c..0b9e467663b6 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -234,6 +234,7 @@ void	xfs_da3_node_hdr_from_disk(struct xfs_mount *mp,
 		struct xfs_da3_icnode_hdr *to, struct xfs_da_intnode *from);
 void	xfs_da3_node_hdr_to_disk(struct xfs_mount *mp,
 		struct xfs_da_intnode *to, struct xfs_da3_icnode_hdr *from);
+xfs_failaddr_t xfs_da3_header_check(struct xfs_buf *bp, xfs_ino_t owner);
 
 extern struct kmem_cache	*xfs_da_state_cache;
 
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 65b1ccb162ad..063922989d77 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -475,7 +475,8 @@ xfs_swapext_attr_to_sf(
 	if (!xfs_attr_is_leaf(sxi->sxi_ip2))
 		return 0;
 
-	error = xfs_attr3_leaf_read(tp, sxi->sxi_ip2, 0, &bp);
+	error = xfs_attr3_leaf_read(tp, sxi->sxi_ip2, sxi->sxi_ip2->i_ino, 0,
+			&bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index c8274a7e0cfd..e60b4cc96c54 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -320,6 +320,7 @@ xchk_da_btree_block(
 	struct xfs_da3_blkinfo		*hdr3;
 	struct xfs_da_args		*dargs = &ds->dargs;
 	struct xfs_inode		*ip = ds->dargs.dp;
+	xfs_failaddr_t			fa;
 	xfs_ino_t			owner;
 	int				*pmaxrecs;
 	struct xfs_da3_icnode_hdr	nodehdr;
@@ -442,6 +443,12 @@ xchk_da_btree_block(
 		goto out_freebp;
 	}
 
+	fa = xfs_da3_header_check(blk->bp, dargs->owner);
+	if (fa) {
+		xchk_da_set_corrupt(ds, level);
+		goto out_freebp;
+	}
+
 	/*
 	 * If we've been handed a block that is below the dabtree root, does
 	 * its hashval match what the parent block expected to see?
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index dcfa8e8e146a..2954ed7cfaf4 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -215,6 +215,7 @@ xfs_attr_node_list_lookup(
 	struct xfs_mount		*mp = dp->i_mount;
 	struct xfs_trans		*tp = context->tp;
 	struct xfs_buf			*bp;
+	xfs_failaddr_t			fa;
 	int				i;
 	int				error = 0;
 	unsigned int			expected_level = 0;
@@ -274,6 +275,12 @@ xfs_attr_node_list_lookup(
 		}
 	}
 
+	fa = xfs_attr3_leaf_header_check(bp, dp->i_ino);
+	if (fa) {
+		__xfs_buf_mark_corrupt(bp, fa);
+		goto out_releasebuf;
+	}
+
 	if (expected_level != 0)
 		goto out_corruptbuf;
 
@@ -282,6 +289,7 @@ xfs_attr_node_list_lookup(
 
 out_corruptbuf:
 	xfs_buf_mark_corrupt(bp);
+out_releasebuf:
 	xfs_trans_brelse(tp, bp);
 	xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 	return -EFSCORRUPTED;
@@ -298,6 +306,7 @@ xfs_attr_node_list(
 	struct xfs_buf			*bp;
 	struct xfs_inode		*dp = context->dp;
 	struct xfs_mount		*mp = dp->i_mount;
+	xfs_failaddr_t			fa;
 	int				error = 0;
 
 	trace_xfs_attr_node_list(context);
@@ -331,6 +340,15 @@ xfs_attr_node_list(
 			case XFS_ATTR_LEAF_MAGIC:
 			case XFS_ATTR3_LEAF_MAGIC:
 				leaf = bp->b_addr;
+				fa = xfs_attr3_leaf_header_check(bp,
+						dp->i_ino);
+				if (fa) {
+					__xfs_buf_mark_corrupt(bp, fa);
+					xfs_trans_brelse(context->tp, bp);
+					xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
+					bp = NULL;
+					break;
+				}
 				xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo,
 							     &leafhdr, leaf);
 				entries = xfs_attr3_leaf_entryp(leaf);
@@ -381,8 +399,8 @@ xfs_attr_node_list(
 			break;
 		cursor->blkno = leafhdr.forw;
 		xfs_trans_brelse(context->tp, bp);
-		error = xfs_attr3_leaf_read(context->tp, dp, cursor->blkno,
-					    &bp);
+		error = xfs_attr3_leaf_read(context->tp, dp, dp->i_ino,
+				cursor->blkno, &bp);
 		if (error)
 			return error;
 	}
@@ -502,7 +520,8 @@ xfs_attr_leaf_list(
 	trace_xfs_attr_leaf_list(context);
 
 	context->cursor.blkno = 0;
-	error = xfs_attr3_leaf_read(context->tp, context->dp, 0, &bp);
+	error = xfs_attr3_leaf_read(context->tp, context->dp,
+			context->dp->i_ino, 0, &bp);
 	if (error)
 		return error;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/9] xfs: validate attr remote value buffer owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/9] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/9] xfs: validate attr leaf " Darrick J. Wong
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check the owner field of xattr remote value blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_remote.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index b8cdd15c4e1a..3dd0b6b0956c 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -280,12 +280,12 @@ xfs_attr_rmtval_copyout(
 	struct xfs_mount	*mp,
 	struct xfs_buf		*bp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	int			*offset,
 	int			*valuelen,
 	uint8_t			**dst)
 {
 	char			*src = bp->b_addr;
-	xfs_ino_t		ino = dp->i_ino;
 	xfs_daddr_t		bno = xfs_buf_daddr(bp);
 	int			len = BBTOB(bp->b_length);
 	int			blksize = mp->m_attr_geo->blksize;
@@ -299,11 +299,11 @@ xfs_attr_rmtval_copyout(
 		byte_cnt = min(*valuelen, byte_cnt);
 
 		if (xfs_has_crc(mp)) {
-			if (xfs_attr3_rmt_hdr_ok(src, ino, *offset,
+			if (xfs_attr3_rmt_hdr_ok(src, owner, *offset,
 						  byte_cnt, bno)) {
 				xfs_alert(mp,
 "remote attribute header mismatch bno/off/len/owner (0x%llx/0x%x/Ox%x/0x%llx)",
-					bno, *offset, byte_cnt, ino);
+					bno, *offset, byte_cnt, owner);
 				xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
 				return -EFSCORRUPTED;
 			}
@@ -427,8 +427,7 @@ xfs_attr_rmtval_get(
 				return error;
 
 			error = xfs_attr_rmtval_copyout(mp, bp, args->dp,
-							&offset, &valuelen,
-							&dst);
+					args->owner, &offset, &valuelen, &dst);
 			xfs_buf_relse(bp);
 			if (error)
 				return error;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/9] xfs: validate dabtree node buffer owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 1/9] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 6/9] xfs: validate directory leaf " Darrick J. Wong
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check the owner field of dabtree node blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_da_btree.c |  108 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_da_btree.h |    1 
 fs/xfs/xfs_attr_list.c       |   10 ++++
 3 files changed, 119 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 0349e10552f6..1b4771b0ac50 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -251,6 +251,25 @@ xfs_da3_node_verify(
 	return NULL;
 }
 
+xfs_failaddr_t
+xfs_da3_node_header_check(
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+
+	if (xfs_has_crc(mp)) {
+		struct xfs_da3_blkinfo *hdr3 = bp->b_addr;
+
+		ASSERT(hdr3->hdr.magic == cpu_to_be16(XFS_DA3_NODE_MAGIC));
+
+		if (be64_to_cpu(hdr3->owner) != owner)
+			return __this_address;
+	}
+
+	return NULL;
+}
+
 xfs_failaddr_t
 xfs_da3_header_check(
 	struct xfs_buf		*bp,
@@ -265,6 +284,8 @@ xfs_da3_header_check(
 	switch (hdr->magic) {
 	case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
 		return xfs_attr3_leaf_header_check(bp, owner);
+	case cpu_to_be16(XFS_DA3_NODE_MAGIC):
+		return xfs_da3_node_header_check(bp, owner);
 	}
 
 	return NULL;
@@ -1218,6 +1239,7 @@ xfs_da3_root_join(
 	struct xfs_da3_icnode_hdr oldroothdr;
 	int			error;
 	struct xfs_inode	*dp = state->args->dp;
+	xfs_failaddr_t		fa;
 
 	trace_xfs_da_root_join(state->args);
 
@@ -1244,6 +1266,13 @@ xfs_da3_root_join(
 	error = xfs_da3_node_read(args->trans, dp, child, &bp, args->whichfork);
 	if (error)
 		return error;
+	fa = xfs_da3_header_check(bp, args->owner);
+	if (fa) {
+		__xfs_buf_mark_corrupt(bp, fa);
+		xfs_trans_brelse(args->trans, bp);
+		xfs_da_mark_sick(args);
+		return -EFSCORRUPTED;
+	}
 	xfs_da_blkinfo_onlychild_validate(bp->b_addr, oldroothdr.level);
 
 	/*
@@ -1285,6 +1314,7 @@ xfs_da3_node_toosmall(
 	struct xfs_da_blkinfo	*info;
 	xfs_dablk_t		blkno;
 	struct xfs_buf		*bp;
+	xfs_failaddr_t		fa;
 	struct xfs_da3_icnode_hdr nodehdr;
 	int			count;
 	int			forward;
@@ -1359,6 +1389,13 @@ xfs_da3_node_toosmall(
 				state->args->whichfork);
 		if (error)
 			return error;
+		fa = xfs_da3_node_header_check(bp, state->args->owner);
+		if (fa) {
+			__xfs_buf_mark_corrupt(bp, fa);
+			xfs_trans_brelse(state->args->trans, bp);
+			xfs_da_mark_sick(state->args);
+			return -EFSCORRUPTED;
+		}
 
 		node = bp->b_addr;
 		xfs_da3_node_hdr_from_disk(dp->i_mount, &thdr, node);
@@ -1681,6 +1718,13 @@ xfs_da3_node_lookup_int(
 			return -EFSCORRUPTED;
 		}
 
+		fa = xfs_da3_node_header_check(blk->bp, args->owner);
+		if (fa) {
+			__xfs_buf_mark_corrupt(blk->bp, fa);
+			xfs_da_mark_sick(args);
+			return -EFSCORRUPTED;
+		}
+
 		blk->magic = XFS_DA_NODE_MAGIC;
 
 		/*
@@ -1853,6 +1897,7 @@ xfs_da3_blk_link(
 	struct xfs_da_blkinfo	*tmp_info;
 	struct xfs_da_args	*args;
 	struct xfs_buf		*bp;
+	xfs_failaddr_t		fa;
 	int			before = 0;
 	int			error;
 	struct xfs_inode	*dp = state->args->dp;
@@ -1896,6 +1941,13 @@ xfs_da3_blk_link(
 						&bp, args->whichfork);
 			if (error)
 				return error;
+			fa = xfs_da3_header_check(bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(bp, fa);
+				xfs_trans_brelse(args->trans, bp);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			ASSERT(bp != NULL);
 			tmp_info = bp->b_addr;
 			ASSERT(tmp_info->magic == old_info->magic);
@@ -1917,6 +1969,13 @@ xfs_da3_blk_link(
 						&bp, args->whichfork);
 			if (error)
 				return error;
+			fa = xfs_da3_header_check(bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(bp, fa);
+				xfs_trans_brelse(args->trans, bp);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			ASSERT(bp != NULL);
 			tmp_info = bp->b_addr;
 			ASSERT(tmp_info->magic == old_info->magic);
@@ -1946,6 +2005,7 @@ xfs_da3_blk_unlink(
 	struct xfs_da_blkinfo	*tmp_info;
 	struct xfs_da_args	*args;
 	struct xfs_buf		*bp;
+	xfs_failaddr_t		fa;
 	int			error;
 
 	/*
@@ -1976,6 +2036,13 @@ xfs_da3_blk_unlink(
 						&bp, args->whichfork);
 			if (error)
 				return error;
+			fa = xfs_da3_header_check(bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(bp, fa);
+				xfs_trans_brelse(args->trans, bp);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			ASSERT(bp != NULL);
 			tmp_info = bp->b_addr;
 			ASSERT(tmp_info->magic == save_info->magic);
@@ -1993,6 +2060,13 @@ xfs_da3_blk_unlink(
 						&bp, args->whichfork);
 			if (error)
 				return error;
+			fa = xfs_da3_header_check(bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(bp, fa);
+				xfs_trans_brelse(args->trans, bp);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			ASSERT(bp != NULL);
 			tmp_info = bp->b_addr;
 			ASSERT(tmp_info->magic == save_info->magic);
@@ -2108,6 +2182,12 @@ xfs_da3_path_shift(
 		switch (be16_to_cpu(info->magic)) {
 		case XFS_DA_NODE_MAGIC:
 		case XFS_DA3_NODE_MAGIC:
+			fa = xfs_da3_node_header_check(blk->bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(blk->bp, fa);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			blk->magic = XFS_DA_NODE_MAGIC;
 			xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr,
 						   bp->b_addr);
@@ -2411,6 +2491,13 @@ xfs_da3_swap_lastblock(
 		error = xfs_da3_node_read(tp, dp, sib_blkno, &sib_buf, w);
 		if (error)
 			goto done;
+		fa = xfs_da3_header_check(sib_buf, args->owner);
+		if (fa) {
+			__xfs_buf_mark_corrupt(sib_buf, fa);
+			xfs_da_mark_sick(args);
+			error = -EFSCORRUPTED;
+			goto done;
+		}
 		sib_info = sib_buf->b_addr;
 		if (XFS_IS_CORRUPT(mp,
 				   be32_to_cpu(sib_info->forw) != last_blkno ||
@@ -2432,6 +2519,13 @@ xfs_da3_swap_lastblock(
 		error = xfs_da3_node_read(tp, dp, sib_blkno, &sib_buf, w);
 		if (error)
 			goto done;
+		fa = xfs_da3_header_check(sib_buf, args->owner);
+		if (fa) {
+			__xfs_buf_mark_corrupt(sib_buf, fa);
+			xfs_da_mark_sick(args);
+			error = -EFSCORRUPTED;
+			goto done;
+		}
 		sib_info = sib_buf->b_addr;
 		if (XFS_IS_CORRUPT(mp,
 				   be32_to_cpu(sib_info->back) != last_blkno ||
@@ -2455,6 +2549,13 @@ xfs_da3_swap_lastblock(
 		error = xfs_da3_node_read(tp, dp, par_blkno, &par_buf, w);
 		if (error)
 			goto done;
+		fa = xfs_da3_node_header_check(par_buf, args->owner);
+		if (fa) {
+			__xfs_buf_mark_corrupt(par_buf, fa);
+			xfs_da_mark_sick(args);
+			error = -EFSCORRUPTED;
+			goto done;
+		}
 		par_node = par_buf->b_addr;
 		xfs_da3_node_hdr_from_disk(dp->i_mount, &par_hdr, par_node);
 		if (XFS_IS_CORRUPT(mp,
@@ -2504,6 +2605,13 @@ xfs_da3_swap_lastblock(
 		error = xfs_da3_node_read(tp, dp, par_blkno, &par_buf, w);
 		if (error)
 			goto done;
+		fa = xfs_da3_node_header_check(par_buf, args->owner);
+		if (fa) {
+			__xfs_buf_mark_corrupt(par_buf, fa);
+			xfs_da_mark_sick(args);
+			error = -EFSCORRUPTED;
+			goto done;
+		}
 		par_node = par_buf->b_addr;
 		xfs_da3_node_hdr_from_disk(dp->i_mount, &par_hdr, par_node);
 		if (XFS_IS_CORRUPT(mp, par_hdr.level != level)) {
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 0b9e467663b6..1f5b3c3f0deb 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -235,6 +235,7 @@ void	xfs_da3_node_hdr_from_disk(struct xfs_mount *mp,
 void	xfs_da3_node_hdr_to_disk(struct xfs_mount *mp,
 		struct xfs_da_intnode *to, struct xfs_da3_icnode_hdr *from);
 xfs_failaddr_t xfs_da3_header_check(struct xfs_buf *bp, xfs_ino_t owner);
+xfs_failaddr_t xfs_da3_node_header_check(struct xfs_buf *bp, xfs_ino_t owner);
 
 extern struct kmem_cache	*xfs_da_state_cache;
 
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 2954ed7cfaf4..24516f3ff2df 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -240,6 +240,10 @@ xfs_attr_node_list_lookup(
 			goto out_corruptbuf;
 		}
 
+		fa = xfs_da3_node_header_check(bp, dp->i_ino);
+		if (fa)
+			goto out_corruptbuf;
+
 		xfs_da3_node_hdr_from_disk(mp, &nodehdr, node);
 
 		/* Tree taller than we can handle; bail out! */
@@ -334,6 +338,12 @@ xfs_attr_node_list(
 			case XFS_DA_NODE_MAGIC:
 			case XFS_DA3_NODE_MAGIC:
 				trace_xfs_attr_list_wrong_blk(context);
+				fa = xfs_da3_node_header_check(bp,
+						dp->i_ino);
+				if (fa) {
+					__xfs_buf_mark_corrupt(bp, fa);
+					xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK);
+				}
 				xfs_trans_brelse(context->tp, bp);
 				bp = NULL;
 				break;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/9] xfs: validate directory leaf buffer owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 5/9] xfs: validate dabtree node buffer owners Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 9/9] xfs: validate explicit directory free block owners Darrick J. Wong
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check the owner field of directory leaf blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_da_btree.c  |   16 ++++++++++
 fs/xfs/libxfs/xfs_dir2.h      |    2 +
 fs/xfs/libxfs/xfs_dir2_leaf.c |   64 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_dir2_node.c |    3 +-
 fs/xfs/libxfs/xfs_dir2_priv.h |    4 +--
 fs/xfs/scrub/dir.c            |    2 +
 6 files changed, 81 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 1b4771b0ac50..89bf350f3f1c 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -286,8 +286,12 @@ xfs_da3_header_check(
 		return xfs_attr3_leaf_header_check(bp, owner);
 	case cpu_to_be16(XFS_DA3_NODE_MAGIC):
 		return xfs_da3_node_header_check(bp, owner);
+	case cpu_to_be16(XFS_DIR3_LEAF1_MAGIC):
+	case cpu_to_be16(XFS_DIR3_LEAFN_MAGIC):
+		return xfs_dir3_leaf_header_check(bp, owner);
 	}
 
+	ASSERT(0);
 	return NULL;
 }
 
@@ -1706,6 +1710,12 @@ xfs_da3_node_lookup_int(
 
 		if (magic == XFS_DIR2_LEAFN_MAGIC ||
 		    magic == XFS_DIR3_LEAFN_MAGIC) {
+			fa = xfs_dir3_leaf_header_check(blk->bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(blk->bp, fa);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			blk->magic = XFS_DIR2_LEAFN_MAGIC;
 			blk->hashval = xfs_dir2_leaf_lasthash(args->dp,
 							      blk->bp, NULL);
@@ -2214,6 +2224,12 @@ xfs_da3_path_shift(
 			break;
 		case XFS_DIR2_LEAFN_MAGIC:
 		case XFS_DIR3_LEAFN_MAGIC:
+			fa = xfs_dir3_leaf_header_check(blk->bp, args->owner);
+			if (fa) {
+				__xfs_buf_mark_corrupt(blk->bp, fa);
+				xfs_da_mark_sick(args);
+				return -EFSCORRUPTED;
+			}
 			blk->magic = XFS_DIR2_LEAFN_MAGIC;
 			ASSERT(level == path->active-1);
 			blk->index = 0;
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index 15a36cf7ae87..b5ad503e47b3 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -89,6 +89,8 @@ extern struct xfs_dir2_data_free *xfs_dir2_data_freefind(
 
 extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino);
 
+xfs_failaddr_t xfs_dir3_leaf_header_check(struct xfs_buf *bp, xfs_ino_t owner);
+
 extern const struct xfs_buf_ops xfs_dir3_block_buf_ops;
 extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops;
 extern const struct xfs_buf_ops xfs_dir3_leaf1_buf_ops;
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index 20ce057d12e8..16a581e225a3 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -208,6 +208,28 @@ xfs_dir3_leaf_verify(
 	return xfs_dir3_leaf_check_int(mp, &leafhdr, bp->b_addr, true);
 }
 
+xfs_failaddr_t
+xfs_dir3_leaf_header_check(
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+
+	if (xfs_has_crc(mp)) {
+		struct xfs_dir3_leaf *hdr3 = bp->b_addr;
+
+		ASSERT(hdr3->hdr.info.hdr.magic ==
+					cpu_to_be16(XFS_DIR3_LEAF1_MAGIC) ||
+		       hdr3->hdr.info.hdr.magic ==
+					cpu_to_be16(XFS_DIR3_LEAFN_MAGIC));
+
+		if (be64_to_cpu(hdr3->hdr.info.owner) != owner)
+			return __this_address;
+	}
+
+	return NULL;
+}
+
 static void
 xfs_dir3_leaf_read_verify(
 	struct xfs_buf  *bp)
@@ -271,32 +293,60 @@ int
 xfs_dir3_leaf_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		fbno,
 	struct xfs_buf		**bpp)
 {
+	xfs_failaddr_t		fa;
 	int			err;
 
 	err = xfs_da_read_buf(tp, dp, fbno, 0, bpp, XFS_DATA_FORK,
 			&xfs_dir3_leaf1_buf_ops);
-	if (!err && tp && *bpp)
+	if (err || !(*bpp))
+		return err;
+
+	fa = xfs_dir3_leaf_header_check(*bpp, owner);
+	if (fa) {
+		__xfs_buf_mark_corrupt(*bpp, fa);
+		xfs_trans_brelse(tp, *bpp);
+		*bpp = NULL;
+		xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
+		return -EFSCORRUPTED;
+	}
+
+	if (tp)
 		xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_DIR_LEAF1_BUF);
-	return err;
+	return 0;
 }
 
 int
 xfs_dir3_leafn_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		fbno,
 	struct xfs_buf		**bpp)
 {
+	xfs_failaddr_t		fa;
 	int			err;
 
 	err = xfs_da_read_buf(tp, dp, fbno, 0, bpp, XFS_DATA_FORK,
 			&xfs_dir3_leafn_buf_ops);
-	if (!err && tp && *bpp)
+	if (err || !(*bpp))
+		return err;
+
+	fa = xfs_dir3_leaf_header_check(*bpp, owner);
+	if (fa) {
+		__xfs_buf_mark_corrupt(*bpp, fa);
+		xfs_trans_brelse(tp, *bpp);
+		*bpp = NULL;
+		xfs_dirattr_mark_sick(dp, XFS_DATA_FORK);
+		return -EFSCORRUPTED;
+	}
+
+	if (tp)
 		xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_DIR_LEAFN_BUF);
-	return err;
+	return 0;
 }
 
 /*
@@ -646,7 +696,8 @@ xfs_dir2_leaf_addname(
 
 	trace_xfs_dir2_leaf_addname(args);
 
-	error = xfs_dir3_leaf_read(tp, dp, args->geo->leafblk, &lbp);
+	error = xfs_dir3_leaf_read(tp, dp, args->owner, args->geo->leafblk,
+			&lbp);
 	if (error)
 		return error;
 
@@ -1237,7 +1288,8 @@ xfs_dir2_leaf_lookup_int(
 	tp = args->trans;
 	mp = dp->i_mount;
 
-	error = xfs_dir3_leaf_read(tp, dp, args->geo->leafblk, &lbp);
+	error = xfs_dir3_leaf_read(tp, dp, args->owner, args->geo->leafblk,
+			&lbp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 1ad7405f9c38..e21965788188 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -1562,7 +1562,8 @@ xfs_dir2_leafn_toosmall(
 		/*
 		 * Read the sibling leaf block.
 		 */
-		error = xfs_dir3_leafn_read(state->args->trans, dp, blkno, &bp);
+		error = xfs_dir3_leafn_read(state->args->trans, dp,
+				state->args->owner, blkno, &bp);
 		if (error)
 			return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index b10859a43776..8a7b5f030a6c 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -95,9 +95,9 @@ void xfs_dir2_leaf_hdr_from_disk(struct xfs_mount *mp,
 void xfs_dir2_leaf_hdr_to_disk(struct xfs_mount *mp, struct xfs_dir2_leaf *to,
 		struct xfs_dir3_icleaf_hdr *from);
 int xfs_dir3_leaf_read(struct xfs_trans *tp, struct xfs_inode *dp,
-		xfs_dablk_t fbno, struct xfs_buf **bpp);
+		xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp);
 int xfs_dir3_leafn_read(struct xfs_trans *tp, struct xfs_inode *dp,
-		xfs_dablk_t fbno, struct xfs_buf **bpp);
+		xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp);
 extern int xfs_dir2_block_to_leaf(struct xfs_da_args *args,
 		struct xfs_buf *dbp);
 extern int xfs_dir2_leaf_addname(struct xfs_da_args *args);
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 8b4e58416327..81348bea8add 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -463,7 +463,7 @@ xchk_directory_leaf1_bestfree(
 	int				error;
 
 	/* Read the free space block. */
-	error = xfs_dir3_leaf_read(sc->tp, sc->ip, lblk, &bp);
+	error = xfs_dir3_leaf_read(sc->tp, sc->ip, sc->ip->i_ino, lblk, &bp);
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error))
 		return error;
 	xchk_buffer_recheck(sc, bp);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/9] xfs: validate explicit directory data buffer owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 8/9] xfs: validate explicit directory block buffer owners Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port the existing directory data header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_dir2.h       |    1 +
 fs/xfs/libxfs/xfs_dir2_block.c |    3 ++-
 fs/xfs/libxfs/xfs_dir2_data.c  |   15 +++++++++------
 fs/xfs/libxfs/xfs_dir2_leaf.c  |   21 +++++++++++----------
 fs/xfs/libxfs/xfs_dir2_node.c  |    7 +++----
 fs/xfs/libxfs/xfs_dir2_priv.h  |    3 ++-
 fs/xfs/scrub/dir.c             |   14 +++++++-------
 fs/xfs/scrub/readdir.c         |    2 +-
 fs/xfs/xfs_dir2_readdir.c      |    3 ++-
 9 files changed, 38 insertions(+), 31 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index b5ad503e47b3..2c2564c2158d 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -90,6 +90,7 @@ extern struct xfs_dir2_data_free *xfs_dir2_data_freefind(
 extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino);
 
 xfs_failaddr_t xfs_dir3_leaf_header_check(struct xfs_buf *bp, xfs_ino_t owner);
+xfs_failaddr_t xfs_dir3_data_header_check(struct xfs_buf *bp, xfs_ino_t owner);
 
 extern const struct xfs_buf_ops xfs_dir3_block_buf_ops;
 extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops;
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 6bda6a490671..184341bb1f6a 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -982,7 +982,8 @@ xfs_dir2_leaf_to_block(
 	 * Read the data block if we don't already have it, give up if it fails.
 	 */
 	if (!dbp) {
-		error = xfs_dir3_data_read(tp, dp, args->geo->datablk, 0, &dbp);
+		error = xfs_dir3_data_read(tp, dp, args->owner,
+				args->geo->datablk, 0, &dbp);
 		if (error)
 			return error;
 	}
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index c3ef720b5ff6..00c2061aed34 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -395,17 +395,19 @@ static const struct xfs_buf_ops xfs_dir3_data_reada_buf_ops = {
 	.verify_write = xfs_dir3_data_write_verify,
 };
 
-static xfs_failaddr_t
+xfs_failaddr_t
 xfs_dir3_data_header_check(
-	struct xfs_inode	*dp,
-	struct xfs_buf		*bp)
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner)
 {
-	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_mount	*mp = bp->b_mount;
 
 	if (xfs_has_crc(mp)) {
 		struct xfs_dir3_data_hdr *hdr3 = bp->b_addr;
 
-		if (be64_to_cpu(hdr3->hdr.owner) != dp->i_ino)
+		ASSERT(hdr3->hdr.magic == cpu_to_be32(XFS_DIR3_DATA_MAGIC));
+
+		if (be64_to_cpu(hdr3->hdr.owner) != owner)
 			return __this_address;
 	}
 
@@ -416,6 +418,7 @@ int
 xfs_dir3_data_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		bno,
 	unsigned int		flags,
 	struct xfs_buf		**bpp)
@@ -429,7 +432,7 @@ xfs_dir3_data_read(
 		return err;
 
 	/* Check things that we can't do in the verifier. */
-	fa = xfs_dir3_data_header_check(dp, *bpp);
+	fa = xfs_dir3_data_header_check(*bpp, owner);
 	if (fa) {
 		__xfs_buf_mark_corrupt(*bpp, fa);
 		xfs_trans_brelse(tp, *bpp);
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index 16a581e225a3..a6eee2604487 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -884,9 +884,9 @@ xfs_dir2_leaf_addname(
 		 * Already had space in some data block.
 		 * Just read that one in.
 		 */
-		error = xfs_dir3_data_read(tp, dp,
-				   xfs_dir2_db_to_da(args->geo, use_block),
-				   0, &dbp);
+		error = xfs_dir3_data_read(tp, dp, args->owner,
+				xfs_dir2_db_to_da(args->geo, use_block), 0,
+				&dbp);
 		if (error) {
 			xfs_trans_brelse(tp, lbp);
 			return error;
@@ -1327,9 +1327,9 @@ xfs_dir2_leaf_lookup_int(
 		if (newdb != curdb) {
 			if (dbp)
 				xfs_trans_brelse(tp, dbp);
-			error = xfs_dir3_data_read(tp, dp,
-					   xfs_dir2_db_to_da(args->geo, newdb),
-					   0, &dbp);
+			error = xfs_dir3_data_read(tp, dp, args->owner,
+					xfs_dir2_db_to_da(args->geo, newdb), 0,
+					&dbp);
 			if (error) {
 				xfs_trans_brelse(tp, lbp);
 				return error;
@@ -1369,9 +1369,9 @@ xfs_dir2_leaf_lookup_int(
 		ASSERT(cidb != -1);
 		if (cidb != curdb) {
 			xfs_trans_brelse(tp, dbp);
-			error = xfs_dir3_data_read(tp, dp,
-					   xfs_dir2_db_to_da(args->geo, cidb),
-					   0, &dbp);
+			error = xfs_dir3_data_read(tp, dp, args->owner,
+					xfs_dir2_db_to_da(args->geo, cidb), 0,
+					&dbp);
 			if (error) {
 				xfs_trans_brelse(tp, lbp);
 				return error;
@@ -1665,7 +1665,8 @@ xfs_dir2_leaf_trim_data(
 	/*
 	 * Read the offending data block.  We need its buffer.
 	 */
-	error = xfs_dir3_data_read(tp, dp, xfs_dir2_db_to_da(geo, db), 0, &dbp);
+	error = xfs_dir3_data_read(tp, dp, args->owner,
+			xfs_dir2_db_to_da(geo, db), 0, &dbp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index e21965788188..dc85197b8448 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -863,7 +863,7 @@ xfs_dir2_leafn_lookup_for_entry(
 				ASSERT(state->extravalid);
 				curbp = state->extrablk.bp;
 			} else {
-				error = xfs_dir3_data_read(tp, dp,
+				error = xfs_dir3_data_read(tp, dp, args->owner,
 						xfs_dir2_db_to_da(args->geo,
 								  newdb),
 						0, &curbp);
@@ -1949,9 +1949,8 @@ xfs_dir2_node_addname_int(
 						  &freehdr, &findex);
 	} else {
 		/* Read the data block in. */
-		error = xfs_dir3_data_read(tp, dp,
-					   xfs_dir2_db_to_da(args->geo, dbno),
-					   0, &dbp);
+		error = xfs_dir3_data_read(tp, dp, args->owner,
+				xfs_dir2_db_to_da(args->geo, dbno), 0, &dbp);
 	}
 	if (error)
 		return error;
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 8a7b5f030a6c..22267a6e651a 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -78,7 +78,8 @@ extern void xfs_dir3_data_check(struct xfs_inode *dp, struct xfs_buf *bp);
 extern xfs_failaddr_t __xfs_dir3_data_check(struct xfs_inode *dp,
 		struct xfs_buf *bp);
 int xfs_dir3_data_read(struct xfs_trans *tp, struct xfs_inode *dp,
-		xfs_dablk_t bno, unsigned int flags, struct xfs_buf **bpp);
+		xfs_ino_t owner, xfs_dablk_t bno, unsigned int flags,
+		struct xfs_buf **bpp);
 int xfs_dir3_data_readahead(struct xfs_inode *dp, xfs_dablk_t bno,
 		unsigned int flags);
 
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 81348bea8add..01b00edfeb6d 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -193,8 +193,8 @@ xchk_dir_rec(
 		xchk_da_set_corrupt(ds, level);
 		goto out;
 	}
-	error = xfs_dir3_data_read(ds->dargs.trans, dp, rec_bno,
-			XFS_DABUF_MAP_HOLE_OK, &bp);
+	error = xfs_dir3_data_read(ds->dargs.trans, dp, ds->dargs.owner,
+			rec_bno, XFS_DABUF_MAP_HOLE_OK, &bp);
 	if (!xchk_fblock_process_error(ds->sc, XFS_DATA_FORK, rec_bno,
 			&error))
 		goto out;
@@ -311,7 +311,8 @@ xchk_directory_data_bestfree(
 		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
 	} else {
 		/* dir data format */
-		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, 0, &bp);
+		error = xfs_dir3_data_read(sc->tp, sc->ip, sc->ip->i_ino, lblk,
+				0, &bp);
 	}
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error))
 		goto out;
@@ -524,10 +525,9 @@ xchk_directory_leaf1_bestfree(
 	/* Check all the bestfree entries. */
 	for (i = 0; i < bestcount; i++, bestp++) {
 		best = be16_to_cpu(*bestp);
-		error = xfs_dir3_data_read(sc->tp, sc->ip,
+		error = xfs_dir3_data_read(sc->tp, sc->ip, args->owner,
 				xfs_dir2_db_to_da(args->geo, i),
-				XFS_DABUF_MAP_HOLE_OK,
-				&dbp);
+				XFS_DABUF_MAP_HOLE_OK, &dbp);
 		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk,
 				&error))
 			break;
@@ -590,7 +590,7 @@ xchk_directory_free_bestfree(
 			stale++;
 			continue;
 		}
-		error = xfs_dir3_data_read(sc->tp, sc->ip,
+		error = xfs_dir3_data_read(sc->tp, sc->ip, args->owner,
 				(freehdr.firstdb + i) * args->geo->fsbcount,
 				0, &dbp);
 		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk,
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
index 59a658362b2b..b385325707e0 100644
--- a/fs/xfs/scrub/readdir.c
+++ b/fs/xfs/scrub/readdir.c
@@ -177,7 +177,7 @@ xchk_read_leaf_dir_buf(
 	if (new_off > *curoff)
 		*curoff = new_off;
 
-	return xfs_dir3_data_read(tp, dp, map.br_startoff, 0, bpp);
+	return xfs_dir3_data_read(tp, dp, dp->i_ino, map.br_startoff, 0, bpp);
 }
 
 /* Call a function for every entry in a leaf directory. */
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index fc2524b8f1f1..37a521f47f20 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -284,7 +284,8 @@ xfs_dir2_leaf_readbuf(
 	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
 	if (new_off > *cur_off)
 		*cur_off = new_off;
-	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
+	error = xfs_dir3_data_read(args->trans, dp, args->owner,
+			map.br_startoff, 0, &bp);
 	if (error)
 		goto out;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 8/9] xfs: validate explicit directory block buffer owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 9/9] xfs: validate explicit directory free block owners Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 7/9] xfs: validate explicit directory data " Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port the existing directory block header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_dir2.h       |    1 +
 fs/xfs/libxfs/xfs_dir2_block.c |   22 ++++++++++++++--------
 fs/xfs/libxfs/xfs_dir2_priv.h  |    2 +-
 fs/xfs/libxfs/xfs_swapext.c    |    2 +-
 fs/xfs/scrub/dir.c             |    2 +-
 fs/xfs/scrub/readdir.c         |    2 +-
 fs/xfs/xfs_dir2_readdir.c      |    2 +-
 7 files changed, 20 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index 2c2564c2158d..7322284f61a0 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -91,6 +91,7 @@ extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino);
 
 xfs_failaddr_t xfs_dir3_leaf_header_check(struct xfs_buf *bp, xfs_ino_t owner);
 xfs_failaddr_t xfs_dir3_data_header_check(struct xfs_buf *bp, xfs_ino_t owner);
+xfs_failaddr_t xfs_dir3_block_header_check(struct xfs_buf *bp, xfs_ino_t owner);
 
 extern const struct xfs_buf_ops xfs_dir3_block_buf_ops;
 extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops;
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 184341bb1f6a..30eef4d9d866 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -115,18 +115,23 @@ const struct xfs_buf_ops xfs_dir3_block_buf_ops = {
 	.verify_struct = xfs_dir3_block_verify,
 };
 
-static xfs_failaddr_t
+xfs_failaddr_t
 xfs_dir3_block_header_check(
-	struct xfs_inode	*dp,
-	struct xfs_buf		*bp)
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner)
 {
-	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_mount	*mp = bp->b_mount;
 
 	if (xfs_has_crc(mp)) {
 		struct xfs_dir3_blk_hdr *hdr3 = bp->b_addr;
 
-		if (be64_to_cpu(hdr3->owner) != dp->i_ino)
+		ASSERT(hdr3->magic == cpu_to_be32(XFS_DIR3_BLOCK_MAGIC));
+
+		if (be64_to_cpu(hdr3->owner) != owner) {
+			xfs_err(NULL, "dir block owner 0x%llx doesnt match block 0x%llx", owner, be64_to_cpu(hdr3->owner));
+			dump_stack();
 			return __this_address;
+		}
 	}
 
 	return NULL;
@@ -136,6 +141,7 @@ int
 xfs_dir3_block_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	struct xfs_buf		**bpp)
 {
 	struct xfs_mount	*mp = dp->i_mount;
@@ -148,7 +154,7 @@ xfs_dir3_block_read(
 		return err;
 
 	/* Check things that we can't do in the verifier. */
-	fa = xfs_dir3_block_header_check(dp, *bpp);
+	fa = xfs_dir3_block_header_check(*bpp, owner);
 	if (fa) {
 		__xfs_buf_mark_corrupt(*bpp, fa);
 		xfs_trans_brelse(tp, *bpp);
@@ -383,7 +389,7 @@ xfs_dir2_block_addname(
 	tp = args->trans;
 
 	/* Read the (one and only) directory block into bp. */
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, args->owner, &bp);
 	if (error)
 		return error;
 
@@ -698,7 +704,7 @@ xfs_dir2_block_lookup_int(
 	dp = args->dp;
 	tp = args->trans;
 
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, args->owner, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 22267a6e651a..b1dfe3e27357 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -51,7 +51,7 @@ extern int xfs_dir_cilookup_result(struct xfs_da_args *args,
 
 /* xfs_dir2_block.c */
 extern int xfs_dir3_block_read(struct xfs_trans *tp, struct xfs_inode *dp,
-			       struct xfs_buf **bpp);
+			       xfs_ino_t owner, struct xfs_buf **bpp);
 extern int xfs_dir2_block_addname(struct xfs_da_args *args);
 extern int xfs_dir2_block_lookup(struct xfs_da_args *args);
 extern int xfs_dir2_block_removename(struct xfs_da_args *args);
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 063922989d77..12d548aa90cf 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -513,7 +513,7 @@ xfs_swapext_dir_to_sf(
 	if (!isblock)
 		return 0;
 
-	error = xfs_dir3_block_read(tp, sxi->sxi_ip2, &bp);
+	error = xfs_dir3_block_read(tp, sxi->sxi_ip2, sxi->sxi_ip2->i_ino, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 01b00edfeb6d..0db7dab744d0 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -308,7 +308,7 @@ xchk_directory_data_bestfree(
 		/* dir block format */
 		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
-		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
+		error = xfs_dir3_block_read(sc->tp, sc->ip, sc->ip->i_ino, &bp);
 	} else {
 		/* dir data format */
 		error = xfs_dir3_data_read(sc->tp, sc->ip, sc->ip->i_ino, lblk,
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
index b385325707e0..9dc828b736a2 100644
--- a/fs/xfs/scrub/readdir.c
+++ b/fs/xfs/scrub/readdir.c
@@ -101,7 +101,7 @@ xchk_dir_walk_block(
 	unsigned int		off, next_off, end;
 	int			error;
 
-	error = xfs_dir3_block_read(sc->tp, dp, &bp);
+	error = xfs_dir3_block_read(sc->tp, dp, dp->i_ino, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 37a521f47f20..771769f9e404 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -159,7 +159,7 @@ xfs_dir2_block_getdents(
 	if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
 		return 0;
 
-	error = xfs_dir3_block_read(args->trans, dp, &bp);
+	error = xfs_dir3_block_read(args->trans, dp, args->owner, &bp);
 	if (error)
 		return error;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 9/9] xfs: validate explicit directory free block owners
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 6/9] xfs: validate directory leaf " Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 8/9] xfs: validate explicit directory block buffer owners Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 7/9] xfs: validate explicit directory data " Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port the existing directory freespace block header checking function to
accept an owner number instead of an xfs_inode, then update the
callsites to use xfs_da_args.owner when possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_dir2_leaf.c |    3 ++-
 fs/xfs/libxfs/xfs_dir2_node.c |   32 ++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_dir2_priv.h |    2 +-
 fs/xfs/scrub/dir.c            |    2 +-
 4 files changed, 22 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index a6eee2604487..fb78ae79fdc6 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -1805,7 +1805,8 @@ xfs_dir2_node_to_leaf(
 	/*
 	 * Read the freespace block.
 	 */
-	error = xfs_dir2_free_read(tp, dp,  args->geo->freeblk, &fbp);
+	error = xfs_dir2_free_read(tp, dp, args->owner, args->geo->freeblk,
+			&fbp);
 	if (error)
 		return error;
 	xfs_dir2_free_hdr_from_disk(mp, &freehdr, fbp->b_addr);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index dc85197b8448..fe8d4fa13128 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -175,11 +175,11 @@ const struct xfs_buf_ops xfs_dir3_free_buf_ops = {
 /* Everything ok in the free block header? */
 static xfs_failaddr_t
 xfs_dir3_free_header_check(
-	struct xfs_inode	*dp,
-	xfs_dablk_t		fbno,
-	struct xfs_buf		*bp)
+	struct xfs_buf		*bp,
+	xfs_ino_t		owner,
+	xfs_dablk_t		fbno)
 {
-	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_mount	*mp = bp->b_mount;
 	int			maxbests = mp->m_dir_geo->free_max_bests;
 	unsigned int		firstdb;
 
@@ -195,7 +195,7 @@ xfs_dir3_free_header_check(
 			return __this_address;
 		if (be32_to_cpu(hdr3->nvalid) < be32_to_cpu(hdr3->nused))
 			return __this_address;
-		if (be64_to_cpu(hdr3->hdr.owner) != dp->i_ino)
+		if (be64_to_cpu(hdr3->hdr.owner) != owner)
 			return __this_address;
 	} else {
 		struct xfs_dir2_free_hdr *hdr = bp->b_addr;
@@ -214,6 +214,7 @@ static int
 __xfs_dir3_free_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		fbno,
 	unsigned int		flags,
 	struct xfs_buf		**bpp)
@@ -227,7 +228,7 @@ __xfs_dir3_free_read(
 		return err;
 
 	/* Check things that we can't do in the verifier. */
-	fa = xfs_dir3_free_header_check(dp, fbno, *bpp);
+	fa = xfs_dir3_free_header_check(*bpp, owner, fbno);
 	if (fa) {
 		__xfs_buf_mark_corrupt(*bpp, fa);
 		xfs_trans_brelse(tp, *bpp);
@@ -299,20 +300,23 @@ int
 xfs_dir2_free_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		fbno,
 	struct xfs_buf		**bpp)
 {
-	return __xfs_dir3_free_read(tp, dp, fbno, 0, bpp);
+	return __xfs_dir3_free_read(tp, dp, owner, fbno, 0, bpp);
 }
 
 static int
 xfs_dir2_free_try_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	xfs_ino_t		owner,
 	xfs_dablk_t		fbno,
 	struct xfs_buf		**bpp)
 {
-	return __xfs_dir3_free_read(tp, dp, fbno, XFS_DABUF_MAP_HOLE_OK, bpp);
+	return __xfs_dir3_free_read(tp, dp, owner, fbno, XFS_DABUF_MAP_HOLE_OK,
+			bpp);
 }
 
 static int
@@ -717,7 +721,7 @@ xfs_dir2_leafn_lookup_for_addname(
 				if (curbp)
 					xfs_trans_brelse(tp, curbp);
 
-				error = xfs_dir2_free_read(tp, dp,
+				error = xfs_dir2_free_read(tp, dp, args->owner,
 						xfs_dir2_db_to_da(args->geo,
 								  newfdb),
 						&curbp);
@@ -1356,8 +1360,8 @@ xfs_dir2_leafn_remove(
 		 * read in the free block.
 		 */
 		fdb = xfs_dir2_db_to_fdb(geo, db);
-		error = xfs_dir2_free_read(tp, dp, xfs_dir2_db_to_da(geo, fdb),
-					   &fbp);
+		error = xfs_dir2_free_read(tp, dp, args->owner,
+				xfs_dir2_db_to_da(geo, fdb), &fbp);
 		if (error)
 			return error;
 		free = fbp->b_addr;
@@ -1716,7 +1720,7 @@ xfs_dir2_node_add_datablk(
 	 * that was just allocated.
 	 */
 	fbno = xfs_dir2_db_to_fdb(args->geo, *dbno);
-	error = xfs_dir2_free_try_read(tp, dp,
+	error = xfs_dir2_free_try_read(tp, dp, args->owner,
 			       xfs_dir2_db_to_da(args->geo, fbno), &fbp);
 	if (error)
 		return error;
@@ -1863,7 +1867,7 @@ xfs_dir2_node_find_freeblk(
 		 * so this might not succeed.  This should be really rare, so
 		 * there's no reason to avoid it.
 		 */
-		error = xfs_dir2_free_try_read(tp, dp,
+		error = xfs_dir2_free_try_read(tp, dp, args->owner,
 				xfs_dir2_db_to_da(args->geo, fbno),
 				&fbp);
 		if (error)
@@ -2302,7 +2306,7 @@ xfs_dir2_node_trim_free(
 	/*
 	 * Read the freespace block.
 	 */
-	error = xfs_dir2_free_try_read(tp, dp, fo, &bp);
+	error = xfs_dir2_free_try_read(tp, dp, args->owner, fo, &bp);
 	if (error)
 		return error;
 	/*
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index b1dfe3e27357..d0178daf915f 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -156,7 +156,7 @@ extern int xfs_dir2_node_replace(struct xfs_da_args *args);
 extern int xfs_dir2_node_trim_free(struct xfs_da_args *args, xfs_fileoff_t fo,
 		int *rvalp);
 extern int xfs_dir2_free_read(struct xfs_trans *tp, struct xfs_inode *dp,
-		xfs_dablk_t fbno, struct xfs_buf **bpp);
+		xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp);
 
 /* xfs_dir2_sf.c */
 xfs_ino_t xfs_dir2_sf_get_ino(struct xfs_mount *mp, struct xfs_dir2_sf_hdr *hdr,
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 0db7dab744d0..ab6daf8549c1 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -570,7 +570,7 @@ xchk_directory_free_bestfree(
 	int				error;
 
 	/* Read the free space block */
-	error = xfs_dir2_free_read(sc->tp, sc->ip, lblk, &bp);
+	error = xfs_dir2_free_read(sc->tp, sc->ip, sc->ip->i_ino, lblk, &bp);
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error))
 		return error;
 	xchk_buffer_recheck(sc, bp);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs: online repair of extended attributes
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (27 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/5] xfs: create a blob array data structure Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of directories Darrick J. Wong
                   ` (48 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

Hi all,

This series employs atomic extent swapping to enable safe reconstruction
of extended attribute data attached to a file.  Because xattrs do not
have any redundant information to draw off of, we can at best salvage
as much data as we can and build a new structure.

Rebuilding an extended attribute structure consists of these three
steps:

First, we walk the existing attributes to salvage as many of them as we
can, by adding them as new attributes attached to the repair tempfile.
We need to add a new xfile-based data structure to hold blobs of
arbitrary length to stage the xattr names and values.

Second, we write the salvaged attributes to a temporary file, and use
atomic extent swaps to exchange the entire attribute fork between the
two files.

Finally, we reap the old xattr blocks (which are now in the temporary
file) as carefully as we can.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-xattrs

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-xattrs
---
 fs/xfs/Makefile               |    2 
 fs/xfs/libxfs/xfs_attr.c      |    2 
 fs/xfs/libxfs/xfs_attr.h      |    2 
 fs/xfs/libxfs/xfs_da_format.h |    5 
 fs/xfs/libxfs/xfs_swapext.c   |    2 
 fs/xfs/libxfs/xfs_swapext.h   |    1 
 fs/xfs/scrub/attr.c           |   33 +
 fs/xfs/scrub/attr.h           |    7 
 fs/xfs/scrub/attr_repair.c    | 1158 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/dabtree.c        |   16 +
 fs/xfs/scrub/dabtree.h        |    3 
 fs/xfs/scrub/repair.c         |   45 ++
 fs/xfs/scrub/repair.h         |    8 
 fs/xfs/scrub/scrub.c          |    2 
 fs/xfs/scrub/tempfile.c       |  176 ++++++
 fs/xfs/scrub/tempfile.h       |    2 
 fs/xfs/scrub/tempswap.h       |    2 
 fs/xfs/scrub/trace.h          |  106 ++++
 fs/xfs/scrub/xfarray.c        |   24 +
 fs/xfs/scrub/xfarray.h        |    2 
 fs/xfs/scrub/xfblob.c         |  176 ++++++
 fs/xfs/scrub/xfblob.h         |   27 +
 fs/xfs/xfs_buf.c              |    3 
 fs/xfs/xfs_trace.h            |    2 
 24 files changed, 1801 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/scrub/attr_repair.c
 create mode 100644 fs/xfs/scrub/xfblob.c
 create mode 100644 fs/xfs/scrub/xfblob.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: create a blob array data structure
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 5/5] xfs: flag empty xattr leaf blocks for optimization Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata.  For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
Use the xfile abstraction to store the attribute information in memory
that can be swapped out.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile       |    1 
 fs/xfs/scrub/xfblob.c |  152 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/xfblob.h |   25 ++++++++
 3 files changed, 178 insertions(+)
 create mode 100644 fs/xfs/scrub/xfblob.c
 create mode 100644 fs/xfs/scrub/xfblob.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0abdcc69cd7f..ac3bda492446 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -202,6 +202,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   repair.o \
 				   rmap_repair.o \
 				   tempfile.o \
+				   xfblob.o \
 				   xfbtree.o \
 				   )
 
diff --git a/fs/xfs/scrub/xfblob.c b/fs/xfs/scrub/xfblob.c
new file mode 100644
index 000000000000..c3a646cad5ed
--- /dev/null
+++ b/fs/xfs/scrub/xfblob.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "scrub/scrub.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/xfblob.h"
+
+/*
+ * XFS Blob Storage
+ * ================
+ * Stores and retrieves blobs using an xfile.  Objects are appended to the file
+ * and the offset is returned as a magic cookie for retrieval.
+ */
+
+#define XB_KEY_MAGIC	0xABAADDAD
+struct xb_key {
+	uint32_t		xb_magic;  /* XB_KEY_MAGIC */
+	uint32_t		xb_size;   /* size of the blob, in bytes */
+	loff_t			xb_offset; /* byte offset of this key */
+	/* blob comes after here */
+} __packed;
+
+/* Initialize a blob storage object. */
+int
+xfblob_create(
+	struct xfs_mount	*mp,
+	const char		*description,
+	struct xfblob		**blobp)
+{
+	struct xfblob		*blob;
+	struct xfile		*xfile;
+	int			error;
+
+	error = xfile_create(mp, description, 0, &xfile);
+	if (error)
+		return error;
+
+	blob = kmalloc(sizeof(struct xfblob), XCHK_GFP_FLAGS);
+	if (!blob) {
+		error = -ENOMEM;
+		goto out_xfile;
+	}
+
+	blob->xfile = xfile;
+	blob->last_offset = PAGE_SIZE;
+
+	*blobp = blob;
+	return 0;
+
+out_xfile:
+	xfile_destroy(xfile);
+	return error;
+}
+
+/* Destroy a blob storage object. */
+void
+xfblob_destroy(
+	struct xfblob	*blob)
+{
+	xfile_destroy(blob->xfile);
+	kfree(blob);
+}
+
+/* Retrieve a blob. */
+int
+xfblob_load(
+	struct xfblob	*blob,
+	xfblob_cookie	cookie,
+	void		*ptr,
+	uint32_t	size)
+{
+	struct xb_key	key;
+	int		error;
+
+	error = xfile_obj_load(blob->xfile, &key, sizeof(key), cookie);
+	if (error)
+		return error;
+
+	if (key.xb_magic != XB_KEY_MAGIC || key.xb_offset != cookie) {
+		ASSERT(0);
+		return -ENODATA;
+	}
+	if (size < key.xb_size) {
+		ASSERT(0);
+		return -EFBIG;
+	}
+
+	return xfile_obj_load(blob->xfile, ptr, key.xb_size,
+			cookie + sizeof(key));
+}
+
+/* Store a blob. */
+int
+xfblob_store(
+	struct xfblob	*blob,
+	xfblob_cookie	*cookie,
+	void		*ptr,
+	uint32_t	size)
+{
+	struct xb_key	key = {
+		.xb_offset = blob->last_offset,
+		.xb_magic = XB_KEY_MAGIC,
+		.xb_size = size,
+	};
+	loff_t		pos = blob->last_offset;
+	int		error;
+
+	error = xfile_obj_store(blob->xfile, &key, sizeof(key), pos);
+	if (error)
+		return error;
+
+	pos += sizeof(key);
+	error = xfile_obj_store(blob->xfile, ptr, size, pos);
+	if (error)
+		goto out_err;
+
+	*cookie = blob->last_offset;
+	blob->last_offset += sizeof(key) + size;
+	return 0;
+out_err:
+	xfile_discard(blob->xfile, blob->last_offset, sizeof(key));
+	return error;
+}
+
+/* Free a blob. */
+int
+xfblob_free(
+	struct xfblob	*blob,
+	xfblob_cookie	cookie)
+{
+	struct xb_key	key;
+	int		error;
+
+	error = xfile_obj_load(blob->xfile, &key, sizeof(key), cookie);
+	if (error)
+		return error;
+
+	if (key.xb_magic != XB_KEY_MAGIC || key.xb_offset != cookie) {
+		ASSERT(0);
+		return -ENODATA;
+	}
+
+	xfile_discard(blob->xfile, cookie, sizeof(key) + key.xb_size);
+	return 0;
+}
diff --git a/fs/xfs/scrub/xfblob.h b/fs/xfs/scrub/xfblob.h
new file mode 100644
index 000000000000..2c1810b4a4eb
--- /dev/null
+++ b/fs/xfs/scrub/xfblob.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_XFBLOB_H__
+#define __XFS_SCRUB_XFBLOB_H__
+
+struct xfblob {
+	struct xfile	*xfile;
+	loff_t		last_offset;
+};
+
+typedef loff_t		xfblob_cookie;
+
+int xfblob_create(struct xfs_mount *mp, const char *descr,
+		struct xfblob **blobp);
+void xfblob_destroy(struct xfblob *blob);
+int xfblob_load(struct xfblob *blob, xfblob_cookie cookie, void *ptr,
+		uint32_t size);
+int xfblob_store(struct xfblob *blob, xfblob_cookie *cookie, void *ptr,
+		uint32_t size);
+int xfblob_free(struct xfblob *blob, xfblob_cookie cookie);
+
+#endif /* __XFS_SCRUB_XFBLOB_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: use atomic extent swapping to fix user file fork data
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 3/5] xfs: repair extended attributes Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/5] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Build on the code that was recently added to the temporary repair file
code so that we can atomically switch the contents of any file fork,
even if the fork is in local format.  The upcoming functions to repair
xattrs, directories, and symlinks will need that capability.

Repair can lock out access to these user files by holding IOLOCK_EXCL on
these user files.  Therefore, it is safe to drop the ILOCK of both the
file being repaired and the tempfile being used for staging, and cancel
the scrub transaction.  We do this so that we can reuse the resource
estimation and transaction allocation functions used by a regular file
exchange operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_swapext.c |    2 
 fs/xfs/libxfs/xfs_swapext.h |    1 
 fs/xfs/scrub/tempfile.c     |  176 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.h     |    2 
 fs/xfs/scrub/tempswap.h     |    2 
 5 files changed, 182 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 12d548aa90cf..42df372d1a89 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -709,7 +709,7 @@ xfs_swapext_rmapbt_blocks(
 }
 
 /* Estimate the bmbt and rmapbt overhead required to exchange extents. */
-static int
+int
 xfs_swapext_estimate_overhead(
 	struct xfs_swapext_req	*req)
 {
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index 155add23d8e2..13824310f2a2 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -145,6 +145,7 @@ unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
 void xfs_swapext_reflink_finish(struct xfs_trans *tp,
 		const struct xfs_swapext_req *req, unsigned int reflink_state);
 
+int xfs_swapext_estimate_overhead(struct xfs_swapext_req *req);
 int xfs_swapext_estimate(struct xfs_swapext_req *req);
 
 extern struct kmem_cache	*xfs_swapext_intent_cache;
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index 7214d2370bc9..c9a089b169f2 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -219,6 +219,19 @@ xrep_tempfile_iunlock(
 	sc->temp_ilock_flags &= ~XFS_ILOCK_EXCL;
 }
 
+/*
+ * Begin the process of making changes to both the file being scrubbed and
+ * the temporary file by taking ILOCK_EXCL on both.
+ */
+void
+xrep_tempfile_ilock_both(
+	struct xfs_scrub	*sc)
+{
+	xfs_lock_two_inodes(sc->ip, XFS_ILOCK_EXCL, sc->tempip, XFS_ILOCK_EXCL);
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	sc->temp_ilock_flags |= XFS_ILOCK_EXCL;
+}
+
 /* Release the temporary file. */
 void
 xrep_tempfile_rele(
@@ -500,6 +513,78 @@ xrep_tempswap_prep_request(
 	return 0;
 }
 
+/*
+ * Fill out the swapext resource estimation structures in preparation for
+ * swapping the contents of a metadata file that we've rebuilt in the temp
+ * file.  Caller must hold IOLOCK_EXCL but not ILOCK_EXCL on both files.
+ */
+STATIC int
+xrep_tempswap_estimate(
+	struct xfs_scrub	*sc,
+	struct xrep_tempswap	*tx)
+{
+	struct xfs_swapext_req	*req = &tx->req;
+	struct xfs_ifork	*ifp;
+	struct xfs_ifork	*tifp;
+	int			state = 0;
+
+	/*
+	 * Deal with either fork being in local format.  The swapext code only
+	 * knows how to exchange block mappings for regular files, so we only
+	 * have to know about local format for xattrs and directories.
+	 */
+	ifp = xfs_ifork_ptr(sc->ip, req->whichfork);
+	if (ifp->if_format == XFS_DINODE_FMT_LOCAL)
+		state |= 1;
+
+	tifp = xfs_ifork_ptr(sc->tempip, req->whichfork);
+	if (tifp->if_format == XFS_DINODE_FMT_LOCAL)
+		state |= 2;
+
+	switch (state) {
+	case 0:
+		/* Both files have mapped extents; use the regular estimate. */
+		return xfs_xchg_range_estimate(req);
+	case 1:
+		/*
+		 * The file being repaired is in local format, but the temp
+		 * file has mapped extents.  To perform the swap, the file
+		 * being repaired will be reinitialized to have an empty extent
+		 * map, so the number of exchanges is the temporary file's
+		 * extent count.
+		 */
+		req->ip1_bcount = sc->tempip->i_nblocks;
+		req->nr_exchanges = tifp->if_nextents;
+		break;
+	case 2:
+		/*
+		 * The temporary file is in local format, but the file being
+		 * repaired has mapped extents.  To perform the swap, the temp
+		 * file will be converted to have a single block, so the number
+		 * of exchanges is (worst case) the extent count of the file
+		 * being repaired plus one more.
+		 */
+		req->ip1_bcount = 1;
+		req->ip2_bcount = sc->ip->i_nblocks;
+		req->nr_exchanges = ifp->if_nextents;
+		break;
+	case 3:
+		/*
+		 * Both forks are in local format.  To perform the swap, the
+		 * file being repaired will be reinitialized to have an empty
+		 * extent map and the temp file will be converted to have a
+		 * single block.  Only one exchange is required.  Presumably,
+		 * the caller could not exchange the two inode fork areas
+		 * directly.
+		 */
+		req->ip1_bcount = 1;
+		req->nr_exchanges = 1;
+		break;
+	}
+
+	return xfs_swapext_estimate_overhead(req);
+}
+
 /*
  * Obtain a quota reservation to make sure we don't hit EDQUOT.  We can skip
  * this if quota enforcement is disabled or if both inodes' dquots are the
@@ -586,6 +671,49 @@ xrep_tempswap_trans_reserve(
 	return xrep_tempswap_reserve_quota(sc, tx);
 }
 
+/*
+ * Allocate a transaction, ILOCK the temporary file and the file being
+ * repaired, and join them to the transaction in preparation to swap fork
+ * contents as part of a repair operation.
+ */
+int
+xrep_tempswap_trans_alloc(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct xrep_tempswap	*tx)
+{
+	unsigned int		flags = 0;
+	int			error;
+
+	ASSERT(sc->tp == NULL);
+
+	error = xrep_tempswap_prep_request(sc, whichfork, tx);
+	if (error)
+		return error;
+
+	error = xrep_tempswap_estimate(sc, tx);
+	if (error)
+		return error;
+
+	if (xfs_has_lazysbcount(sc->mp))
+		flags |= XFS_TRANS_RES_FDBLKS;
+
+	error = xrep_tempswap_grab_log_assist(sc);
+	if (error)
+		return error;
+
+	error = xfs_trans_alloc(sc->mp, &M_RES(sc->mp)->tr_itruncate,
+			tx->req.resblks, 0, flags, &sc->tp);
+	if (error)
+		return error;
+
+	sc->temp_ilock_flags |= XFS_ILOCK_EXCL;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_xchg_range_ilock(sc->tp, sc->ip, sc->tempip);
+
+	return xrep_tempswap_reserve_quota(sc, tx);
+}
+
 /* Swap forks between the file being repaired and the temporary file. */
 int
 xrep_tempswap_contents(
@@ -617,3 +745,51 @@ xrep_tempswap_contents(
 
 	return 0;
 }
+
+/*
+ * Write local format data from one of the temporary file's forks into the same
+ * fork of file being repaired, and swap the file sizes, if appropriate.
+ * Caller must ensure that the file being repaired has enough fork space to
+ * hold all the bytes.
+ */
+void
+xrep_tempfile_copyout_local(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	struct xfs_ifork	*temp_ifp;
+	struct xfs_ifork	*ifp;
+	unsigned int		ilog_flags = XFS_ILOG_CORE;
+
+	temp_ifp = xfs_ifork_ptr(sc->tempip, whichfork);
+	ifp = xfs_ifork_ptr(sc->ip, whichfork);
+
+	ASSERT(temp_ifp != NULL);
+	ASSERT(ifp != NULL);
+	ASSERT(temp_ifp->if_format == XFS_DINODE_FMT_LOCAL);
+	ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL);
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		ASSERT(sc->tempip->i_disk_size <= xfs_inode_data_fork_size(sc->ip));
+		break;
+	case XFS_ATTR_FORK:
+		ASSERT(sc->tempip->i_forkoff >= sc->ip->i_forkoff);
+		break;
+	default:
+		ASSERT(0);
+		return;
+	}
+
+	xfs_idestroy_fork(ifp);
+	xfs_init_local_fork(sc->ip, whichfork, temp_ifp->if_u1.if_data,
+			temp_ifp->if_bytes);
+
+	if (whichfork == XFS_DATA_FORK) {
+		i_size_write(VFS_I(sc->ip), i_size_read(VFS_I(sc->tempip)));
+		sc->ip->i_disk_size = sc->tempip->i_disk_size;
+	}
+
+	ilog_flags |= xfs_ilog_fdata(whichfork);
+	xfs_trans_log_inode(sc->tp, sc->ip, ilog_flags);
+}
diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h
index 282637f36f3d..402957f7f2b3 100644
--- a/fs/xfs/scrub/tempfile.h
+++ b/fs/xfs/scrub/tempfile.h
@@ -16,6 +16,7 @@ void xrep_tempfile_iounlock(struct xfs_scrub *sc);
 void xrep_tempfile_ilock(struct xfs_scrub *sc);
 bool xrep_tempfile_ilock_nowait(struct xfs_scrub *sc);
 void xrep_tempfile_iunlock(struct xfs_scrub *sc);
+void xrep_tempfile_ilock_both(struct xfs_scrub *sc);
 
 int xrep_tempfile_prealloc(struct xfs_scrub *sc, xfs_fileoff_t off,
 		xfs_filblks_t len);
@@ -31,6 +32,7 @@ int xrep_tempfile_copyin(struct xfs_scrub *sc, xfs_fileoff_t off,
 int xrep_tempfile_set_isize(struct xfs_scrub *sc, unsigned long long isize);
 
 int xrep_tempfile_roll_trans(struct xfs_scrub *sc);
+void xrep_tempfile_copyout_local(struct xfs_scrub *sc, int whichfork);
 #else
 static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc)
 {
diff --git a/fs/xfs/scrub/tempswap.h b/fs/xfs/scrub/tempswap.h
index 62e88cc6d91a..bef8d2d2134d 100644
--- a/fs/xfs/scrub/tempswap.h
+++ b/fs/xfs/scrub/tempswap.h
@@ -14,6 +14,8 @@ struct xrep_tempswap {
 int xrep_tempswap_grab_log_assist(struct xfs_scrub *sc);
 int xrep_tempswap_trans_reserve(struct xfs_scrub *sc, int whichfork,
 		struct xrep_tempswap *ti);
+int xrep_tempswap_trans_alloc(struct xfs_scrub *sc, int whichfork,
+		struct xrep_tempswap *ti);
 
 int xrep_tempswap_contents(struct xfs_scrub *sc, struct xrep_tempswap *ti);
 #endif /* CONFIG_XFS_ONLINE_REPAIR */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: repair extended attributes
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/5] xfs: create a blob array data structure Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 5/5] xfs: flag empty xattr leaf blocks for optimization Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/5] xfs: use atomic extent swapping to fix user file fork data Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/5] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, stage a new attribute structure in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile               |    1 
 fs/xfs/libxfs/xfs_attr.c      |    2 
 fs/xfs/libxfs/xfs_attr.h      |    2 
 fs/xfs/libxfs/xfs_da_format.h |    5 
 fs/xfs/scrub/attr.c           |   20 +
 fs/xfs/scrub/attr.h           |    7 
 fs/xfs/scrub/attr_repair.c    | 1158 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c         |   45 ++
 fs/xfs/scrub/repair.h         |    8 
 fs/xfs/scrub/scrub.c          |    2 
 fs/xfs/scrub/trace.h          |  105 ++++
 fs/xfs/scrub/xfarray.c        |   24 +
 fs/xfs/scrub/xfarray.h        |    2 
 fs/xfs/scrub/xfblob.c         |   24 +
 fs/xfs/scrub/xfblob.h         |    2 
 fs/xfs/xfs_buf.c              |    3 
 fs/xfs/xfs_trace.h            |    2 
 17 files changed, 1408 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/attr_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ac3bda492446..0ae616f25a98 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -188,6 +188,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
+				   attr_repair.o \
 				   bmap_repair.o \
 				   cow_repair.o \
 				   fscounters_repair.o \
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 564345a17119..d38a4c42a912 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -1095,7 +1095,7 @@ xfs_attr_set(
  * External routines when attribute list is inside the inode
  *========================================================================*/
 
-static inline int xfs_attr_sf_totsize(struct xfs_inode *dp)
+int xfs_attr_sf_totsize(struct xfs_inode *dp)
 {
 	struct xfs_attr_shortform *sf;
 
diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h
index 81be9b3e4004..e4f55008552b 100644
--- a/fs/xfs/libxfs/xfs_attr.h
+++ b/fs/xfs/libxfs/xfs_attr.h
@@ -618,4 +618,6 @@ extern struct kmem_cache *xfs_attr_intent_cache;
 int __init xfs_attr_intent_init_cache(void);
 void xfs_attr_intent_destroy_cache(void);
 
+int xfs_attr_sf_totsize(struct xfs_inode *dp);
+
 #endif	/* __XFS_ATTR_H__ */
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 9d332415e0b6..e37de511bc2f 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -706,6 +706,11 @@ struct xfs_attr3_leafblock {
 #define XFS_ATTR_INCOMPLETE	(1u << XFS_ATTR_INCOMPLETE_BIT)
 #define XFS_ATTR_NSP_ONDISK_MASK	(XFS_ATTR_ROOT | XFS_ATTR_SECURE)
 
+#define XFS_ATTR_NAMESPACE_STR \
+	{ XFS_ATTR_LOCAL,	"local" }, \
+	{ XFS_ATTR_ROOT,	"root" }, \
+	{ XFS_ATTR_SECURE,	"secure" }
+
 /*
  * Alignment for namelist and valuelist entries (since they are mixed
  * there can be only one alignment value)
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 090710acc4b6..1401525074a3 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -10,6 +10,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_log_format.h"
+#include "xfs_trans.h"
 #include "xfs_inode.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
@@ -20,6 +21,7 @@
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
 #include "scrub/attr.h"
+#include "scrub/repair.h"
 
 /* Free the buffers linked from the xattr buffer. */
 static void
@@ -35,6 +37,8 @@ xchk_xattr_buf_cleanup(
 	kvfree(ab->value);
 	ab->value = NULL;
 	ab->value_sz = 0;
+	kvfree(ab->name);
+	ab->name = NULL;
 }
 
 /*
@@ -65,7 +69,7 @@ xchk_xattr_want_freemap(
  * reallocating the buffer if necessary.  Buffer contents are not preserved
  * across a reallocation.
  */
-static int
+int
 xchk_setup_xattr_buf(
 	struct xfs_scrub	*sc,
 	size_t			value_size)
@@ -95,6 +99,12 @@ xchk_setup_xattr_buf(
 			return -ENOMEM;
 	}
 
+	if (xchk_could_repair(sc)) {
+		ab->name = kvmalloc(XATTR_NAME_MAX + 1, XCHK_GFP_FLAGS);
+		if (!ab->name)
+			return -ENOMEM;
+	}
+
 resize_value:
 	if (ab->value_sz >= value_size)
 		return 0;
@@ -121,6 +131,12 @@ xchk_setup_xattr(
 {
 	int			error;
 
+	if (xchk_could_repair(sc)) {
+		error = xrep_setup_xattr(sc);
+		if (error)
+			return error;
+	}
+
 	/*
 	 * We failed to get memory while checking attrs, so this time try to
 	 * get all the memory we're ever going to need.  Allocate the buffer
@@ -239,7 +255,7 @@ xchk_xattr_listent(
  * Within a char, the lowest bit of the char represents the byte with
  * the smallest address
  */
-STATIC bool
+bool
 xchk_xattr_set_map(
 	struct xfs_scrub	*sc,
 	unsigned long		*map,
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 5f6835752738..e90e9195c882 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -16,9 +16,16 @@ struct xchk_xattr_buf {
 	/* Bitmap of free space in xattr leaf blocks. */
 	unsigned long		*freemap;
 
+	/* Memory buffer used to hold salvaged xattr names. */
+	unsigned char		*name;
+
 	/* Memory buffer used to extract xattr values. */
 	void			*value;
 	size_t			value_sz;
 };
 
+bool xchk_xattr_set_map(struct xfs_scrub *sc, unsigned long *map,
+		unsigned int start, unsigned int len);
+int xchk_setup_xattr_buf(struct xfs_scrub *sc, size_t value_size);
+
 #endif	/* __XFS_SCRUB_ATTR_H__ */
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
new file mode 100644
index 000000000000..3362f784e4e5
--- /dev/null
+++ b/fs/xfs/scrub/attr_repair.c
@@ -0,0 +1,1158 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_attr.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
+#include "xfs_attr_remote.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
+#include "xfs_acl.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/tempfile.h"
+#include "scrub/tempswap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/xfblob.h"
+#include "scrub/attr.h"
+#include "scrub/reap.h"
+
+/*
+ * Extended Attribute Repair
+ * =========================
+ *
+ * We repair extended attributes by reading the xattr leaf blocks looking for
+ * attributes.  Salvaged attrs are added to a private hidden temporary file.
+ * When we're done salvaging, we rewrite the xattr block owners and use an
+ * atomic extent swap to commit the new xattr blocks to the file being
+ * repaired.
+ */
+
+struct xrep_xattr_key {
+	/* Cookie for retrieval of the xattr name. */
+	xfblob_cookie		name_cookie;
+
+	/* Cookie for retrieval of the xattr value. */
+	xfblob_cookie		value_cookie;
+
+	/* Hash of the dirent name. */
+	unsigned int		hash;
+
+	/* XFS_ATTR_* flags */
+	int			flags;
+
+	/* Length of the value and name. */
+	uint32_t		valuelen;
+	uint16_t		namelen;
+};
+
+struct xrep_xattr {
+	struct xfs_scrub	*sc;
+
+	struct xrep_tempswap	tx;
+
+	/* xattr keys */
+	struct xfarray		*xattr_records;
+
+	/* xattr values */
+	struct xfblob		*xattr_blobs;
+
+	/* Number of attributes that we are salvaging. */
+	unsigned long long	attrs_found;
+};
+
+/* Absorb up to 8 pages of attrs before we flush them to the temp file. */
+#define XREP_XATTR_SALVAGE_BYTES	(PAGE_SIZE * 8)
+
+/* Set up to recreate the extended attributes. */
+int
+xrep_setup_xattr(
+	struct xfs_scrub	*sc)
+{
+	return xrep_tempfile_create(sc, S_IFREG);
+}
+
+/*
+ * Decide if we want to salvage this attribute.  We don't bother with
+ * incomplete or oversized keys or values.
+ */
+STATIC int
+xrep_xattr_want_salvage(
+	int			flags,
+	const void		*name,
+	int			namelen,
+	int			valuelen)
+{
+	if (flags & XFS_ATTR_INCOMPLETE)
+		return false;
+	if (namelen > XATTR_NAME_MAX || namelen <= 0)
+		return false;
+	if (valuelen > XATTR_SIZE_MAX || valuelen < 0)
+		return false;
+	return true;
+}
+
+/* Allocate an in-core record to hold xattrs while we rebuild the xattr data. */
+STATIC int
+xrep_xattr_salvage_key(
+	struct xrep_xattr	*rx,
+	int			flags,
+	unsigned char		*name,
+	int			namelen,
+	unsigned char		*value,
+	int			valuelen)
+{
+	struct xrep_xattr_key	key = {
+		.valuelen	= valuelen,
+		.flags		= flags & (XFS_ATTR_ROOT | XFS_ATTR_SECURE),
+	};
+	unsigned int		i = 0;
+	int			error = 0;
+
+	if (xchk_should_terminate(rx->sc, &error))
+		return error;
+
+	/*
+	 * Truncate the name to the first character that would trip namecheck.
+	 * If we no longer have a name after that, ignore this attribute.
+	 */
+	while (i < namelen && name[i] != 0)
+		i++;
+	if (i == 0)
+		return 0;
+	key.namelen = i;
+	key.hash = xfs_da_hashname(name, key.namelen);
+
+	trace_xrep_xattr_salvage_key(rx->sc->ip, key.flags, name, key.namelen,
+			key.valuelen);
+
+	error = xfblob_store(rx->xattr_blobs, &key.name_cookie, name,
+			key.namelen);
+	if (error)
+		return error;
+
+	error = xfblob_store(rx->xattr_blobs, &key.value_cookie, value,
+			key.valuelen);
+	if (error)
+		return error;
+
+	error = xfarray_append(rx->xattr_records, &key);
+	if (error)
+		return error;
+
+	rx->attrs_found++;
+	return 0;
+}
+
+/*
+ * Record a shortform extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_sf_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_shortform	*sf,
+	struct xfs_attr_sf_entry	*sfe)
+{
+	struct xfs_scrub		*sc = rx->sc;
+	struct xchk_xattr_buf		*ab = sc->buf;
+	unsigned char			*name = sfe->nameval;
+	unsigned char			*value = &sfe->nameval[sfe->namelen];
+
+	if (!xchk_xattr_set_map(sc, ab->usedmap, (char *)name - (char *)sf,
+			sfe->namelen))
+		return 0;
+
+	if (!xchk_xattr_set_map(sc, ab->usedmap, (char *)value - (char *)sf,
+			sfe->valuelen))
+		return 0;
+
+	if (!xrep_xattr_want_salvage(sfe->flags, sfe->nameval, sfe->namelen,
+			sfe->valuelen))
+		return 0;
+
+	return xrep_xattr_salvage_key(rx, sfe->flags, sfe->nameval,
+			sfe->namelen, value, sfe->valuelen);
+}
+
+/*
+ * Record a local format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_local_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_leaf_entry	*ent,
+	unsigned int			nameidx,
+	const char			*buf_end,
+	struct xfs_attr_leaf_name_local	*lentry)
+{
+	struct xchk_xattr_buf		*ab = rx->sc->buf;
+	unsigned char			*value;
+	unsigned int			valuelen;
+	unsigned int			namesize;
+
+	/*
+	 * Decode the leaf local entry format.  If something seems wrong, we
+	 * junk the attribute.
+	 */
+	valuelen = be16_to_cpu(lentry->valuelen);
+	namesize = xfs_attr_leaf_entsize_local(lentry->namelen, valuelen);
+	if ((char *)lentry + namesize > buf_end)
+		return 0;
+	if (!xrep_xattr_want_salvage(ent->flags, lentry->nameval,
+			lentry->namelen, valuelen))
+		return 0;
+	if (!xchk_xattr_set_map(rx->sc, ab->usedmap, nameidx, namesize))
+		return 0;
+
+	/* Try to save this attribute. */
+	value = &lentry->nameval[lentry->namelen];
+	return xrep_xattr_salvage_key(rx, ent->flags, lentry->nameval,
+			lentry->namelen, value, valuelen);
+}
+
+/*
+ * Record a remote format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_remote_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_leaf_entry	*ent,
+	unsigned int			nameidx,
+	const char			*buf_end,
+	struct xfs_attr_leaf_name_remote *rentry,
+	unsigned int			ent_idx,
+	struct xfs_buf			*leaf_bp)
+{
+	struct xfs_da_args		args = {
+		.trans			= rx->sc->tp,
+		.dp			= rx->sc->ip,
+		.index			= ent_idx,
+		.geo			= rx->sc->mp->m_attr_geo,
+		.owner			= rx->sc->ip->i_ino,
+	};
+	struct xchk_xattr_buf		*ab = rx->sc->buf;
+	unsigned int			valuelen;
+	unsigned int			namesize;
+	int				error;
+
+	/*
+	 * Decode the leaf remote entry format.  If something seems wrong, we
+	 * junk the attribute.  Note that we should never find a zero-length
+	 * remote attribute value.
+	 */
+	valuelen = be32_to_cpu(rentry->valuelen);
+	namesize = xfs_attr_leaf_entsize_remote(rentry->namelen);
+	if ((char *)rentry + namesize > buf_end)
+		return 0;
+	if (valuelen == 0 ||
+	    !xrep_xattr_want_salvage(ent->flags, rentry->name, rentry->namelen,
+			valuelen))
+		return 0;
+	if (!xchk_xattr_set_map(rx->sc, ab->usedmap, nameidx, namesize))
+		return 0;
+
+	/*
+	 * Enlarge the buffer (if needed) to hold the value that we're trying
+	 * to salvage from the old extended attribute data.
+	 */
+	error = xchk_setup_xattr_buf(rx->sc, valuelen);
+	if (error == -ENOMEM)
+		error = -EDEADLOCK;
+	if (error)
+		return error;
+
+	/* Look up the remote value and stash it for reconstruction. */
+	args.valuelen = valuelen;
+	args.namelen = rentry->namelen;
+	args.name = rentry->name;
+	args.value = ab->value;
+	error = xfs_attr3_leaf_getvalue(leaf_bp, &args);
+	if (error || args.rmtblkno == 0)
+		goto err_free;
+
+	error = xfs_attr_rmtval_get(&args);
+	if (error)
+		goto err_free;
+
+	/* Try to save this attribute. */
+	error = xrep_xattr_salvage_key(rx, ent->flags, rentry->name,
+			rentry->namelen, ab->value, valuelen);
+err_free:
+	/* remote value was garbage, junk it */
+	if (error == -EFSBADCRC || error == -EFSCORRUPTED)
+		error = 0;
+	return error;
+}
+
+/* Extract every xattr key that we can from this attr fork block. */
+STATIC int
+xrep_xattr_recover_leaf(
+	struct xrep_xattr		*rx,
+	struct xfs_buf			*bp)
+{
+	struct xfs_attr3_icleaf_hdr	leafhdr;
+	struct xfs_scrub		*sc = rx->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_attr_leafblock	*leaf;
+	struct xfs_attr_leaf_name_local	*lentry;
+	struct xfs_attr_leaf_name_remote *rentry;
+	struct xfs_attr_leaf_entry	*ent;
+	struct xfs_attr_leaf_entry	*entries;
+	struct xchk_xattr_buf		*ab = rx->sc->buf;
+	char				*buf_end;
+	size_t				off;
+	unsigned int			nameidx;
+	unsigned int			hdrsize;
+	int				i;
+	int				error = 0;
+
+	bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize);
+
+	/* Check the leaf header */
+	leaf = bp->b_addr;
+	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
+	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
+	xchk_xattr_set_map(sc, ab->usedmap, 0, hdrsize);
+	entries = xfs_attr3_leaf_entryp(leaf);
+
+	buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize;
+	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		/* Skip key if it conflicts with something else? */
+		off = (char *)ent - (char *)leaf;
+		if (!xchk_xattr_set_map(sc, ab->usedmap, off,
+				sizeof(xfs_attr_leaf_entry_t)))
+			continue;
+
+		/* Check the name information. */
+		nameidx = be16_to_cpu(ent->nameidx);
+		if (nameidx < leafhdr.firstused ||
+		    nameidx >= mp->m_attr_geo->blksize)
+			continue;
+
+		if (ent->flags & XFS_ATTR_LOCAL) {
+			lentry = xfs_attr3_leaf_name_local(leaf, i);
+			error = xrep_xattr_salvage_local_attr(rx, ent, nameidx,
+					buf_end, lentry);
+		} else {
+			rentry = xfs_attr3_leaf_name_remote(leaf, i);
+			error = xrep_xattr_salvage_remote_attr(rx, ent, nameidx,
+					buf_end, rentry, i, bp);
+		}
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Try to recover shortform attrs. */
+STATIC int
+xrep_xattr_recover_sf(
+	struct xrep_xattr		*rx)
+{
+	struct xfs_scrub		*sc = rx->sc;
+	struct xchk_xattr_buf		*ab = sc->buf;
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_attr_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	unsigned char			*end;
+	int				i;
+	int				error = 0;
+
+	ifp = xfs_ifork_ptr(rx->sc->ip, XFS_ATTR_FORK);
+
+	bitmap_zero(ab->usedmap, ifp->if_bytes);
+	sf = (struct xfs_attr_shortform *)rx->sc->ip->i_af.if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+	xchk_xattr_set_map(sc, ab->usedmap, 0, sizeof(sf->hdr));
+
+	sfe = &sf->list[0];
+	if ((unsigned char *)sfe > end)
+		return 0;
+
+	for (i = 0; i < sf->hdr.count; i++) {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		next = xfs_attr_sf_nextentry(sfe);
+		if ((unsigned char *)next > end)
+			break;
+
+		if (xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)sfe - (char *)sf,
+				sizeof(struct xfs_attr_sf_entry))) {
+			/*
+			 * No conflicts with the sf entry; let's save this
+			 * attribute.
+			 */
+			error = xrep_xattr_salvage_sf_attr(rx, sf, sfe);
+			if (error)
+				return error;
+		}
+
+		sfe = next;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to return a buffer of xattr data for a given physical extent.
+ *
+ * Because the buffer cache get function complains if it finds a buffer
+ * matching the block number but not matching the length, we must be careful to
+ * look for incore buffers (up to the maximum length of a remote value) that
+ * could be hiding anywhere in the physical range.  If we find an incore
+ * buffer, we can pass that to the caller.  Optionally, read a single block and
+ * pass that back.
+ *
+ * Note the subtlety that remote attr value blocks for which there is no incore
+ * buffer will be passed to the callback one block at a time.  These buffers
+ * will not have any ops attached and must be staled to prevent aliasing with
+ * multiblock buffers once we drop the ILOCK.
+ */
+STATIC int
+xrep_xattr_find_buf(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		fsbno,
+	xfs_extlen_t		max_len,
+	bool			can_read,
+	struct xfs_buf		**bpp)
+{
+	struct xrep_bufscan	scan = {
+		.daddr		= XFS_FSB_TO_DADDR(mp, fsbno),
+		.max_sectors	= xrep_bufscan_max_sectors(mp, max_len),
+		.daddr_step	= XFS_FSB_TO_BB(mp, 1),
+	};
+	struct xfs_buf		*bp;
+
+	while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) {
+		*bpp = bp;
+		return 0;
+	}
+
+	if (!can_read) {
+		*bpp = NULL;
+		return 0;
+	}
+
+	return xfs_buf_read(mp->m_ddev_targp, scan.daddr, XFS_FSB_TO_BB(mp, 1),
+			XBF_TRYLOCK, bpp, NULL);
+}
+
+/*
+ * Deal with a buffer that we found during our walk of the attr fork.
+ *
+ * Attribute leaf and node blocks are simple -- they're a single block, so we
+ * can walk them one at a time and we never have to worry about discontiguous
+ * multiblock buffers like we do for directories.
+ *
+ * Unfortunately, remote attr blocks add a lot of complexity here.  Each disk
+ * block is totally self contained, in the sense that the v5 header provides no
+ * indication that there could be more data in the next block.  The incore
+ * buffers can span multiple blocks, though they never cross extent records.
+ * However, they don't necessarily start or end on an extent record boundary.
+ * Therefore, we need a special buffer find function to walk the buffer cache
+ * for us.
+ *
+ * The caller must hold the ILOCK on the file being repaired.  We use
+ * XBF_TRYLOCK here to skip any locked buffer on the assumption that we don't
+ * own the block and don't want to hang the system on a potentially garbage
+ * buffer.
+ */
+STATIC int
+xrep_xattr_recover_block(
+	struct xrep_xattr	*rx,
+	xfs_dablk_t		dabno,
+	xfs_fsblock_t		fsbno,
+	xfs_extlen_t		max_len,
+	xfs_extlen_t		*actual_len)
+{
+	struct xfs_da_blkinfo	*info;
+	struct xfs_buf		*bp;
+	int			error;
+
+	error = xrep_xattr_find_buf(rx->sc->mp, fsbno, max_len, true, &bp);
+	if (error)
+		return error;
+	info = bp->b_addr;
+	*actual_len = XFS_BB_TO_FSB(rx->sc->mp, bp->b_length);
+
+	trace_xrep_xattr_recover_leafblock(rx->sc->ip, dabno,
+			be16_to_cpu(info->magic));
+
+	/*
+	 * If the buffer has the right magic number for an attr leaf block and
+	 * passes a structure check (we don't care about checksums), salvage
+	 * as much as we can from the block. */
+	if (info->magic == cpu_to_be16(XFS_ATTR3_LEAF_MAGIC) &&
+	    xrep_buf_verify_struct(bp, &xfs_attr3_leaf_buf_ops) &&
+	    xfs_attr3_leaf_header_check(bp, rx->sc->ip->i_ino) == NULL)
+		error = xrep_xattr_recover_leaf(rx, bp);
+
+	/*
+	 * If the buffer didn't already have buffer ops set, it was read in by
+	 * the _find_buf function and could very well be /part/ of a multiblock
+	 * remote block.  Mark it stale so that it doesn't hang around in
+	 * memory to cause problems.
+	 */
+	if (bp->b_ops == NULL)
+		xfs_buf_stale(bp);
+
+	xfs_buf_relse(bp);
+	return error;
+}
+
+/* Insert one xattr key/value. */
+STATIC int
+xrep_xattr_insert_rec(
+	struct xrep_xattr		*rx,
+	const struct xrep_xattr_key	*key)
+{
+	struct xfs_da_args		args = {
+		.dp			= rx->sc->tempip,
+		.attr_filter		= key->flags,
+		.attr_flags		= XATTR_CREATE,
+		.namelen		= key->namelen,
+		.valuelen		= key->valuelen,
+		.op_flags		= XFS_DA_OP_NOTIME,
+		.owner			= rx->sc->ip->i_ino,
+	};
+	struct xchk_xattr_buf		*ab = rx->sc->buf;
+	int				error;
+
+	/*
+	 * Grab pointers to the scrub buffer so that we can use them to insert
+	 * attrs into the temp file.
+	 */
+	args.name = ab->name;
+	args.value = ab->value;
+
+	/*
+	 * The attribute name is stored near the end of the in-core buffer,
+	 * though we reserve one more byte to ensure null termination.
+	 */
+	ab->name[XATTR_NAME_MAX] = 0;
+
+	error = xfblob_load(rx->xattr_blobs, key->name_cookie, ab->name,
+			key->namelen);
+	if (error)
+		return error;
+
+	error = xfblob_free(rx->xattr_blobs, key->name_cookie);
+	if (error)
+		return error;
+
+	error = xfblob_load(rx->xattr_blobs, key->value_cookie, args.value,
+			key->valuelen);
+	if (error)
+		return error;
+
+	error = xfblob_free(rx->xattr_blobs, key->value_cookie);
+	if (error)
+		return error;
+
+	ab->name[key->namelen] = 0;
+
+	trace_xrep_xattr_insert_rec(rx->sc->tempip, key->flags, ab->name,
+			key->namelen, key->valuelen);
+
+	/*
+	 * xfs_attr_set creates and commits its own transaction.  If the attr
+	 * already exists, we'll just drop it during the rebuild.
+	 */
+	error = xfs_attr_set(&args);
+	if (error == -EEXIST)
+		error = 0;
+
+	return error;
+}
+
+/*
+ * Periodically flush salvaged attributes to the temporary file.  This is done
+ * to reduce the memory requirements of the xattr rebuild because files can
+ * contain millions of attributes.
+ */
+STATIC int
+xrep_xattr_flush_salvaged(
+	struct xrep_xattr	*rx)
+{
+	xfarray_idx_t		array_cur;
+	int			error;
+
+	/*
+	 * Entering this function, the scrub context has a reference to the
+	 * inode being repaired, the temporary file, and a scrub transaction
+	 * that we use during xattr salvaging to avoid livelocking if there
+	 * are cycles in the xattr structures.  We hold ILOCK_EXCL on both
+	 * the inode being repaired, though it is not ijoined to the scrub
+	 * transaction.
+	 *
+	 * To constrain kernel memory use, we occasionally flush salvaged
+	 * xattrs from the xfarray and xfblob structures into the temporary
+	 * file in preparation for swapping the xattr structures at the end.
+	 * Updating the temporary file requires a transaction, so we commit the
+	 * scrub transaction and drop the two ILOCKs so that xfs_attr_set can
+	 * allocate whatever transaction it wants.
+	 *
+	 * We still hold IOLOCK_EXCL on the inode being repaired, which
+	 * prevents anyone from accessing the damaged xattr data while we
+	 * repair it.
+	 */
+	error = xrep_trans_commit(rx->sc);
+	if (error)
+		return error;
+	xchk_iunlock(rx->sc, XFS_ILOCK_EXCL);
+
+	/*
+	 * Take the IOLOCK of the temporary file while we modify xattrs.  This
+	 * isn't strictly required because the temporary file is never revealed
+	 * to userspace, but we follow the same locking rules.
+	 */
+	while (!xrep_tempfile_iolock_nowait(rx->sc)) {
+		if (xchk_should_terminate(rx->sc, &error))
+			return error;
+		delay(1);
+	}
+
+	/* Add all the salvaged attrs to the temporary file. */
+	foreach_xfarray_idx(rx->xattr_records, array_cur) {
+		struct xrep_xattr_key	key;
+
+		error = xfarray_load(rx->xattr_records, array_cur, &key);
+		if (error)
+			return error;
+
+		error = xrep_xattr_insert_rec(rx, &key);
+		if (error)
+			return error;
+	}
+	xrep_tempfile_iounlock(rx->sc);
+
+	/* Empty out both arrays now that we've added the entries. */
+	xfarray_truncate(rx->xattr_records);
+	xfblob_truncate(rx->xattr_blobs);
+
+	/* Recreate the salvage transaction and relock the inode. */
+	error = xchk_trans_alloc(rx->sc, 0);
+	if (error)
+		return error;
+	xchk_ilock(rx->sc, XFS_ILOCK_EXCL);
+	return 0;
+}
+
+/*
+ * Decide if we need to flush the xattrs we've salvaged to disk to constrain
+ * memory usage.
+ */
+static int
+xrep_xattr_need_flush(
+	struct xrep_xattr	*rx,
+	bool			*need)
+{
+	long long		key_bytes, value_bytes;
+
+	key_bytes = xfarray_bytes(rx->xattr_records);
+	if (key_bytes < 0)
+		return key_bytes;
+
+	value_bytes = xfblob_bytes(rx->xattr_blobs);
+	if (value_bytes < 0)
+		return value_bytes;
+
+	*need = key_bytes + value_bytes >= XREP_XATTR_SALVAGE_BYTES;
+	return 0;
+}
+
+/* Extract as many attribute keys and values as we can. */
+STATIC int
+xrep_xattr_recover(
+	struct xrep_xattr	*rx)
+{
+	struct xfs_bmbt_irec	got;
+	struct xfs_scrub	*sc = rx->sc;
+	struct xfs_da_geometry	*geo = sc->mp->m_attr_geo;
+	xfs_fileoff_t		offset;
+	xfs_extlen_t		len;
+	xfs_dablk_t		dabno;
+	int			nmap;
+	int			error;
+
+	/*
+	 * Iterate each xattr leaf block in the attr fork to scan them for any
+	 * attributes that we might salvage.
+	 */
+	for (offset = 0;
+	     offset < XFS_MAX_FILEOFF;
+	     offset = got.br_startoff + got.br_blockcount) {
+		nmap = 1;
+		error = xfs_bmapi_read(sc->ip, offset, XFS_MAX_FILEOFF - offset,
+				&got, &nmap, XFS_BMAPI_ATTRFORK);
+		if (error)
+			return error;
+		if (nmap != 1)
+			return -EFSCORRUPTED;
+		if (!xfs_bmap_is_written_extent(&got))
+			continue;
+
+		for (dabno = round_up(got.br_startoff, geo->fsbcount);
+		     dabno < got.br_startoff + got.br_blockcount;
+		     dabno += len) {
+			xfs_fileoff_t	curr_offset = dabno - got.br_startoff;
+			xfs_extlen_t	maxlen;
+			bool		need_flush = false;
+
+			if (xchk_should_terminate(rx->sc, &error))
+				return error;
+
+			maxlen = min_t(xfs_filblks_t, INT_MAX,
+					got.br_blockcount - curr_offset);
+			error = xrep_xattr_recover_block(rx, dabno,
+					curr_offset + got.br_startblock,
+					maxlen, &len);
+			if (error)
+				return error;
+
+			error = xrep_xattr_need_flush(rx, &need_flush);
+			if (error)
+				return error;
+
+			if (need_flush) {
+				error = xrep_xattr_flush_salvaged(rx);
+				if (error)
+					return error;
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Reset the extended attribute fork to a state where we can start re-adding
+ * the salvaged attributes.
+ */
+STATIC int
+xrep_xattr_fork_remove(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	struct xfs_attr_sf_hdr	*hdr;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK);
+
+	/*
+	 * If the data fork is in btree format, we can't change di_forkoff
+	 * because we could run afoul of the rule that the data fork isn't
+	 * supposed to be in btree format if there's enough space in the fork
+	 * that it could have used extents format.  Instead, reinitialize the
+	 * attr fork to have a shortform structure with zero attributes.
+	 */
+	if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE) {
+		ifp->if_format = XFS_DINODE_FMT_LOCAL;
+		xfs_idata_realloc(ip, (int)sizeof(*hdr) - ifp->if_bytes,
+				XFS_ATTR_FORK);
+		hdr = (struct xfs_attr_sf_hdr *)ifp->if_u1.if_data;
+		hdr->count = 0;
+		hdr->totsize = cpu_to_be16(sizeof(*hdr));
+		xfs_trans_log_inode(sc->tp, ip,
+				XFS_ILOG_CORE | XFS_ILOG_ADATA);
+		return 0;
+	}
+
+	/* If we still have attr fork extents, something's wrong. */
+	if (ifp->if_nextents != 0) {
+		struct xfs_iext_cursor	icur;
+		struct xfs_bmbt_irec	irec;
+		unsigned int		i = 0;
+
+		xfs_emerg(sc->mp,
+	"inode 0x%llx attr fork still has %llu attr extents, format %d?!",
+				ip->i_ino, ifp->if_nextents, ifp->if_format);
+		for_each_xfs_iext(ifp, &icur, &irec) {
+			xfs_err(sc->mp,
+	"[%u]: startoff %llu startblock %llu blockcount %llu state %u",
+					i++, irec.br_startoff,
+					irec.br_startblock, irec.br_blockcount,
+					irec.br_state);
+		}
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	xfs_attr_fork_remove(ip, sc->tp);
+	return 0;
+}
+
+/*
+ * Free all the attribute fork blocks and delete the fork.  The caller must
+ * ILOCK the file being repaired and ijoin it to the transaction.  This
+ * function returns with the inode joined to a clean scrub transaction.
+ */
+int
+xrep_xattr_reset_fork(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	/* Unmap all the attr blocks. */
+	if (xfs_ifork_has_extents(&sc->ip->i_af)) {
+		error = xrep_reap_ifork(sc, sc->ip, XFS_ATTR_FORK);
+		if (error)
+			return error;
+	}
+
+	trace_xrep_xattr_reset_fork(sc->ip, sc->ip);
+
+	error = xrep_xattr_fork_remove(sc, sc->ip);
+	if (error)
+		return error;
+
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
+/*
+ * Find all the extended attributes for this inode by scraping them out of the
+ * attribute key blocks by hand, and flushing them into the temp file.
+ */
+STATIC int
+xrep_xattr_find_attributes(
+	struct xrep_xattr	*rx)
+{
+	struct xfs_inode	*ip = rx->sc->ip;
+	int			error;
+
+	/* Short format xattrs are easy! */
+	if (rx->sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL) {
+		error = xrep_xattr_recover_sf(rx);
+		if (error)
+			return error;
+
+		return xrep_xattr_flush_salvaged(rx);
+	}
+
+	/*
+	 * For non-inline xattr structures, the salvage function scans the
+	 * buffer cache looking for potential attr leaf blocks.  The scan
+	 * requires the ability to lock any buffer found and runs independently
+	 * of any transaction <-> buffer item <-> buffer linkage.  Therefore,
+	 * roll the transaction to ensure there are no buffers joined.  We hold
+	 * the ILOCK independently of the transaction.
+	 */
+	error = xfs_trans_roll(&rx->sc->tp);
+	if (error)
+		return error;
+
+	error = xfs_iread_extents(rx->sc->tp, ip, XFS_ATTR_FORK);
+	if (error)
+		return error;
+
+	error = xrep_xattr_recover(rx);
+	if (error)
+		return error;
+
+	return xrep_xattr_flush_salvaged(rx);
+}
+
+/*
+ * Prepare both inodes' attribute forks for extent swapping.  Promote the
+ * tempfile from short format to leaf format, and if the file being repaired
+ * has a short format attr fork, turn it into an empty extent list.
+ */
+STATIC int
+xrep_xattr_swap_prep(
+	struct xfs_scrub	*sc,
+	bool			temp_local,
+	bool			ip_local)
+{
+	int			error;
+
+	/*
+	 * If the tempfile's attributes are in shortform format, convert that
+	 * to a single leaf extent so that we can use the atomic extent swap.
+	 */
+	if (temp_local) {
+		struct xfs_da_args	args = {
+			.dp		= sc->tempip,
+			.geo		= sc->mp->m_attr_geo,
+			.whichfork	= XFS_ATTR_FORK,
+			.trans		= sc->tp,
+			.total		= 1,
+			.owner		= sc->ip->i_ino,
+		};
+
+		error = xfs_attr_shortform_to_leaf(&args);
+		if (error)
+			return error;
+
+		/*
+		 * Roll the deferred log items to get us back to a clean
+		 * transaction.
+		 */
+		error = xfs_defer_finish(&sc->tp);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If the file being repaired had a shortform attribute fork, convert
+	 * that to an empty extent list in preparation for the atomic extent
+	 * swap.
+	 */
+	if (ip_local) {
+		struct xfs_ifork	*ifp;
+
+		ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK);
+
+		xfs_idestroy_fork(ifp);
+		ifp->if_format = XFS_DINODE_FMT_EXTENTS;
+		ifp->if_nextents = 0;
+		ifp->if_bytes = 0;
+		ifp->if_u1.if_root = NULL;
+		ifp->if_height = 0;
+
+		xfs_trans_log_inode(sc->tp, sc->ip,
+				XFS_ILOG_CORE | XFS_ILOG_ADATA);
+	}
+
+	return 0;
+}
+
+/* Swap the temporary file's attribute fork with the one being repaired. */
+STATIC int
+xrep_xattr_swap(
+	struct xrep_xattr	*rx)
+{
+	struct xfs_scrub	*sc = rx->sc;
+	bool			ip_local, temp_local;
+	int			error = 0;
+
+	/*
+	 * Take the IOLOCK on the temporary file so that we can run xattr
+	 * operations with the same locks held as we would for a normal file.
+	 */
+	while (!xrep_tempfile_iolock_nowait(rx->sc)) {
+		if (xchk_should_terminate(rx->sc, &error))
+			return error;
+		delay(1);
+	}
+
+	error = xrep_tempswap_trans_alloc(rx->sc, XFS_ATTR_FORK, &rx->tx);
+	if (error)
+		return error;
+
+	ip_local = sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL;
+	temp_local = sc->tempip->i_af.if_format == XFS_DINODE_FMT_LOCAL;
+
+	/*
+	 * If the both files have a local format attr fork and the rebuilt
+	 * xattr data would fit in the repaired file's attr fork, just copy
+	 * the contents from the tempfile and declare ourselves done.
+	 */
+	if (ip_local && temp_local) {
+		int	forkoff;
+		int	newsize;
+
+		newsize = xfs_attr_sf_totsize(sc->tempip);
+		forkoff = xfs_attr_shortform_bytesfit(sc->ip, newsize);
+		if (forkoff > 0) {
+			sc->ip->i_forkoff = forkoff;
+			xrep_tempfile_copyout_local(sc, XFS_ATTR_FORK);
+			return 0;
+		}
+	}
+
+	/* Otherwise, make sure both attr forks are in block-mapping mode. */
+	error = xrep_xattr_swap_prep(sc, temp_local, ip_local);
+	if (error)
+		return error;
+
+	return xrep_tempswap_contents(sc, &rx->tx);
+}
+
+/*
+ * Swap the new extended attribute data (which we created in the tempfile) into
+ * the file being repaired.
+ */
+STATIC int
+xrep_xattr_rebuild_tree(
+	struct xrep_xattr	*rx)
+{
+	struct xfs_scrub	*sc = rx->sc;
+	int			error;
+
+	/*
+	 * If we didn't find any attributes to salvage, repair the file by
+	 * zapping the attr fork.
+	 */
+	if (rx->attrs_found == 0) {
+		xfs_trans_ijoin(sc->tp, sc->ip, 0);
+		return xrep_xattr_reset_fork(sc);
+	}
+
+	trace_xrep_xattr_rebuild_tree(sc->ip, sc->tempip);
+
+	/*
+	 * Commit the repair transaction and drop the ILOCKs so that we can use
+	 * the atomic extent swap helper functions to compute the correct
+	 * resource reservations.
+	 *
+	 * We still hold IOLOCK_EXCL (aka i_rwsem) which will prevent xattr
+	 * modifications, but there's nothing to prevent userspace from reading
+	 * the attributes until we're ready for the swap operation.  Reads will
+	 * return -EIO without shutting down the fs, so we're ok with that.
+	 */
+	error = xrep_trans_commit(sc);
+	if (error)
+		return error;
+
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
+
+	/*
+	 * Swap the tempfile's attr fork with the file being repaired.  This
+	 * recreates the transaction and takes the ILOCKs of both the file
+	 * being repaired and the temporary file.
+	 */
+	error = xrep_xattr_swap(rx);
+	if (error)
+		return error;
+
+	/*
+	 * Wipe out the attr fork of the temp file so that regular inode
+	 * inactivation won't trip over the corrupt attr fork.
+	 */
+	if (xfs_ifork_has_extents(&sc->tempip->i_af)) {
+		error = xrep_reap_ifork(sc, sc->tempip, XFS_ATTR_FORK);
+		if (error)
+			return error;
+	}
+
+	trace_xrep_xattr_reset_fork(sc->ip, sc->tempip);
+
+	error = xrep_xattr_fork_remove(sc, sc->tempip);
+	if (error)
+		return error;
+
+	return xrep_tempfile_roll_trans(sc);
+}
+
+/*
+ * Repair the extended attribute metadata.
+ *
+ * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer.
+ * The buffer cache in XFS can't handle aliased multiblock buffers, so this
+ * might misbehave if the attr fork is crosslinked with other filesystem
+ * metadata.
+ */
+int
+xrep_xattr(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_xattr	*rx;
+	int			max_len;
+	int			error;
+
+	if (!xfs_inode_hasattr(sc->ip))
+		return -ENOENT;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(sc->mp))
+		return -EOPNOTSUPP;
+
+	rx = kzalloc(sizeof(struct xrep_xattr), XCHK_GFP_FLAGS);
+	if (!rx)
+		return -ENOMEM;
+	rx->sc = sc;
+
+	/*
+	 * Make sure we have enough space to handle salvaging and spilling
+	 * every possible local attr value, since we only realloc the buffer
+	 * for remote values.
+	 */
+	max_len = xfs_attr_leaf_entsize_local_max(sc->mp->m_attr_geo->blksize);
+	error = xchk_setup_xattr_buf(rx->sc, max_len);
+	if (error == -ENOMEM)
+		error = -EDEADLOCK;
+	if (error)
+		goto out_rx;
+
+	/* Set up some storage */
+	error = xfarray_create(sc->mp, "xattr keys", 0,
+			sizeof(struct xrep_xattr_key), &rx->xattr_records);
+	if (error)
+		goto out_rx;
+
+	error = xfblob_create(sc->mp, "xattr values", &rx->xattr_blobs);
+	if (error)
+		goto out_keys;
+
+	ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL);
+
+	/*
+	 * Collect extended attributes by parsing raw blocks to salvage
+	 * whatever we can into the tempfile.  When we're done, free the
+	 * staging memory before swapping the xattr structures to reduce memory
+	 * usage.
+	 */
+	error = xrep_xattr_find_attributes(rx);
+	if (error)
+		goto out_values;
+
+	xfblob_destroy(rx->xattr_blobs);
+	xfarray_destroy(rx->xattr_records);
+	rx->xattr_blobs = NULL;
+	rx->xattr_records = NULL;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		goto out_rx;
+
+	/* Swap in the good contents. */
+	error = xrep_xattr_rebuild_tree(rx);
+	if (error)
+		goto out_values;
+
+	/* Invalidate ACLs now that we've reloaded all the xattrs. */
+	xfs_forget_acl(VFS_I(sc->ip), SGI_ACL_FILE);
+	xfs_forget_acl(VFS_I(sc->ip), SGI_ACL_DEFAULT);
+
+out_values:
+	if (rx->xattr_blobs)
+		xfblob_destroy(rx->xattr_blobs);
+out_keys:
+	if (rx->xattr_records)
+		xfarray_destroy(rx->xattr_records);
+out_rx:
+	kfree(rx);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index da6bff1fcd86..e5e5dbdce7c4 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -31,6 +31,9 @@
 #include "xfs_error.h"
 #include "xfs_reflink.h"
 #include "xfs_health.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -1118,6 +1121,17 @@ xrep_metadata_inode_forks(
 			return error;
 	}
 
+	/* Clear the attr forks since metadata shouldn't have that. */
+	if (xfs_inode_hasattr(sc->ip)) {
+		if (!dirty) {
+			dirty = true;
+			xfs_trans_ijoin(sc->tp, sc->ip, 0);
+		}
+		error = xrep_xattr_reset_fork(sc);
+		if (error)
+			return error;
+	}
+
 	/*
 	 * If we modified the inode, roll the transaction but don't rejoin the
 	 * inode to the new transaction because xrep_bmap_data can do that.
@@ -1189,3 +1203,34 @@ xrep_trans_cancel_hook_dummy(
 	current->journal_info = *cookiep;
 	*cookiep = NULL;
 }
+
+/*
+ * See if this buffer can pass the given ->verify_struct() function.
+ *
+ * If the buffer already has ops attached and they're not the ones that were
+ * passed in, we reject the buffer.  Otherwise, we perform the structure test
+ * (note that we do not check CRCs) and return the outcome of the test.  The
+ * buffer ops and error state are left unchanged.
+ */
+bool
+xrep_buf_verify_struct(
+	struct xfs_buf			*bp,
+	const struct xfs_buf_ops	*ops)
+{
+	const struct xfs_buf_ops	*old_ops = bp->b_ops;
+	xfs_failaddr_t			fa;
+	int				old_error;
+
+	if (old_ops) {
+		if (old_ops != ops)
+			return false;
+	}
+
+	old_error = bp->b_error;
+	bp->b_ops = ops;
+	fa = bp->b_ops->verify_struct(bp);
+	bp->b_ops = old_ops;
+	bp->b_error = old_error;
+
+	return fa == NULL;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 086e8e739264..2a79d7a5ba7e 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -82,6 +82,9 @@ int xrep_setup_ag_rmapbt(struct xfs_scrub *sc);
 int xrep_setup_ag_refcountbt(struct xfs_scrub *sc);
 int xrep_setup_rtsummary(struct xfs_scrub *sc, unsigned int *resblks,
 		size_t *bufsize);
+int xrep_setup_xattr(struct xfs_scrub *sc);
+
+int xrep_xattr_reset_fork(struct xfs_scrub *sc);
 
 /* Repair setup functions */
 int xrep_setup_ag_allocbt(struct xfs_scrub *sc);
@@ -116,6 +119,7 @@ int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_bmap_cow(struct xfs_scrub *sc);
 int xrep_nlinks(struct xfs_scrub *sc);
 int xrep_fscounters(struct xfs_scrub *sc);
+int xrep_xattr(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -140,6 +144,8 @@ int xrep_trans_alloc_hook_dummy(struct xfs_mount *mp, void **cookiep,
 		struct xfs_trans **tpp);
 void xrep_trans_cancel_hook_dummy(void **cookiep, struct xfs_trans *tp);
 
+bool xrep_buf_verify_struct(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
+
 #else
 
 #define xrep_ino_dqattach(sc)	(0)
@@ -182,6 +188,7 @@ xrep_setup_nothing(
 #define xrep_setup_ag_allocbt		xrep_setup_nothing
 #define xrep_setup_ag_rmapbt		xrep_setup_nothing
 #define xrep_setup_ag_refcountbt	xrep_setup_nothing
+#define xrep_setup_xattr		xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
@@ -221,6 +228,7 @@ xrep_setup_rtsummary(
 #define xrep_nlinks			xrep_notsupported
 #define xrep_fscounters			xrep_notsupported
 #define xrep_rtsummary			xrep_notsupported
+#define xrep_xattr			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index a9030603b424..0ec23fc650be 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -333,7 +333,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_xattr,
 		.scrub	= xchk_xattr,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_xattr,
 	},
 	[XFS_SCRUB_TYPE_SYMLINK] = {	/* symbolic link */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index aebfaef07e2d..8f925889d51a 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2293,6 +2293,111 @@ TRACE_EVENT(xreap_bmapi_binval_scan,
 		  __entry->scan_blocks)
 );
 
+TRACE_EVENT(xrep_xattr_recover_leafblock,
+	TP_PROTO(struct xfs_inode *ip, xfs_dablk_t dabno, uint16_t magic),
+	TP_ARGS(ip, dabno, magic),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_dablk_t, dabno)
+		__field(uint16_t, magic)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->dabno = dabno;
+		__entry->magic = magic;
+	),
+	TP_printk("dev %d:%d ino 0x%llx dablk 0x%x magic 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->dabno,
+		  __entry->magic)
+);
+
+TRACE_EVENT(xrep_xattr_salvage_key,
+	TP_PROTO(struct xfs_inode *ip, unsigned int flags, char *name,
+		 unsigned int namelen, unsigned int valuelen),
+	TP_ARGS(ip, flags, name, namelen, valuelen),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, flags)
+		__field(unsigned int, namelen)
+		__dynamic_array(char, name, namelen)
+		__field(unsigned int, valuelen)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->flags = flags;
+		__entry->namelen = namelen;
+		memcpy(__get_str(name), name, namelen);
+		__entry->valuelen = valuelen;
+	),
+	TP_printk("dev %d:%d ino 0x%llx flags %s name '%.*s' valuelen 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		   __print_flags(__entry->flags, "|", XFS_ATTR_NAMESPACE_STR),
+		  __entry->namelen,
+		  __get_str(name),
+		  __entry->valuelen)
+);
+
+TRACE_EVENT(xrep_xattr_insert_rec,
+	TP_PROTO(struct xfs_inode *ip, unsigned int flags, char *name,
+		 unsigned int namelen, unsigned int valuelen),
+	TP_ARGS(ip, flags, name, namelen, valuelen),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, flags)
+		__field(unsigned int, namelen)
+		__dynamic_array(char, name, namelen)
+		__field(unsigned int, valuelen)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->flags = flags;
+		__entry->namelen = namelen;
+		memcpy(__get_str(name), name, namelen);
+		__entry->valuelen = valuelen;
+	),
+	TP_printk("dev %d:%d ino 0x%llx flags %s name '%.*s' valuelen 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		   __print_flags(__entry->flags, "|", XFS_ATTR_NAMESPACE_STR),
+		  __entry->namelen,
+		  __get_str(name),
+		  __entry->valuelen)
+);
+
+TRACE_EVENT(xrep_xattr_class,
+	TP_PROTO(struct xfs_inode *ip, struct xfs_inode *arg_ip),
+	TP_ARGS(ip, arg_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_ino_t, src_ino)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->src_ino = arg_ip->i_ino;
+	),
+	TP_printk("dev %d:%d ino 0x%llx src 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->src_ino)
+)
+#define DEFINE_XREP_XATTR_CLASS(name) \
+DEFINE_EVENT(xrep_xattr_class, name, \
+	TP_PROTO(struct xfs_inode *ip, struct xfs_inode *arg_ip), \
+	TP_ARGS(ip, arg_ip))
+DEFINE_XREP_XATTR_CLASS(xrep_xattr_rebuild_tree);
+DEFINE_XREP_XATTR_CLASS(xrep_xattr_reset_fork);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 
diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c
index ce1365144209..f5af17fff40d 100644
--- a/fs/xfs/scrub/xfarray.c
+++ b/fs/xfs/scrub/xfarray.c
@@ -1082,3 +1082,27 @@ xfarray_sort(
 	kvfree(si);
 	return error;
 }
+
+/* How many bytes is this array consuming? */
+long long
+xfarray_bytes(
+	struct xfarray		*array)
+{
+	struct xfile_stat	statbuf;
+	int			error;
+
+	error = xfile_stat(array->xfile, &statbuf);
+	if (error)
+		return error;
+
+	return statbuf.bytes;
+}
+
+/* Empty the entire array. */
+void
+xfarray_truncate(
+	struct xfarray	*array)
+{
+	xfile_discard(array->xfile, 0, MAX_LFS_FILESIZE);
+	array->nr = 0;
+}
diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h
index 44c7e7083881..7f4bc4ad28ad 100644
--- a/fs/xfs/scrub/xfarray.h
+++ b/fs/xfs/scrub/xfarray.h
@@ -45,6 +45,8 @@ int xfarray_unset(struct xfarray *array, xfarray_idx_t idx);
 int xfarray_store(struct xfarray *array, xfarray_idx_t idx, const void *ptr);
 int xfarray_store_anywhere(struct xfarray *array, const void *ptr);
 bool xfarray_element_is_null(struct xfarray *array, const void *ptr);
+void xfarray_truncate(struct xfarray *array);
+long long xfarray_bytes(struct xfarray *array);
 
 /*
  * Load an array element, but zero the buffer if there's no data because we
diff --git a/fs/xfs/scrub/xfblob.c b/fs/xfs/scrub/xfblob.c
index c3a646cad5ed..5c1a4e0616c0 100644
--- a/fs/xfs/scrub/xfblob.c
+++ b/fs/xfs/scrub/xfblob.c
@@ -150,3 +150,27 @@ xfblob_free(
 	xfile_discard(blob->xfile, cookie, sizeof(key) + key.xb_size);
 	return 0;
 }
+
+/* How many bytes is this blob storage object consuming? */
+long long
+xfblob_bytes(
+	struct xfblob		*blob)
+{
+	struct xfile_stat	statbuf;
+	int			error;
+
+	error = xfile_stat(blob->xfile, &statbuf);
+	if (error)
+		return error;
+
+	return statbuf.bytes;
+}
+
+/* Drop all the blobs. */
+void
+xfblob_truncate(
+	struct xfblob	*blob)
+{
+	xfile_discard(blob->xfile, 0, MAX_LFS_FILESIZE);
+	blob->last_offset = 0;
+}
diff --git a/fs/xfs/scrub/xfblob.h b/fs/xfs/scrub/xfblob.h
index 2c1810b4a4eb..73051c8616c6 100644
--- a/fs/xfs/scrub/xfblob.h
+++ b/fs/xfs/scrub/xfblob.h
@@ -21,5 +21,7 @@ int xfblob_load(struct xfblob *blob, xfblob_cookie cookie, void *ptr,
 int xfblob_store(struct xfblob *blob, xfblob_cookie *cookie, void *ptr,
 		uint32_t size);
 int xfblob_free(struct xfblob *blob, xfblob_cookie cookie);
+long long xfblob_bytes(struct xfblob *blob);
+void xfblob_truncate(struct xfblob *blob);
 
 #endif /* __XFS_SCRUB_XFBLOB_H__ */
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 410db46e7935..b65dab243130 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -482,6 +482,9 @@ _xfs_buf_obj_cmp(
 		 * it stale has not yet committed. i.e. we are
 		 * reallocating a busy extent. Skip this buffer and
 		 * continue searching for an exact match.
+		 *
+		 * Note: If we're scanning for incore buffers to stale, don't
+		 * complain if we find non-stale buffers.
 		 */
 		if (!(map->bm_flags & XBM_IGNORE_LENGTH_MISMATCH))
 			ASSERT(bp->b_flags & XBF_STALE);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index da6b7461f4d0..147dbdf73d92 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -31,6 +31,8 @@
  * pos: file offset, in bytes
  * bytecount: number of bytes
  *
+ * dablk: directory or xattr block offset, in filesystem blocks
+ *
  * disize: ondisk file size, in bytes
  * isize: incore file size, in bytes
  *


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: scrub should set preen if attr leaf has holes
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 2/5] xfs: use atomic extent swapping to fix user file fork data Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c    |    2 ++
 fs/xfs/scrub/dabtree.c |   16 ++++++++++++++++
 fs/xfs/scrub/dabtree.h |    1 +
 fs/xfs/scrub/trace.h   |    1 +
 4 files changed, 20 insertions(+)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 1401525074a3..0fb9344c671b 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -420,6 +420,8 @@ xchk_xattr_block(
 		xchk_da_set_corrupt(ds, level);
 	if (!xchk_xattr_set_map(ds->sc, ab->usedmap, 0, hdrsize))
 		xchk_da_set_corrupt(ds, level);
+	if (leafhdr.holes)
+		xchk_da_set_preen(ds, level);
 
 	if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		goto out;
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index e60b4cc96c54..764f7dfd78b5 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -78,6 +78,22 @@ xchk_da_set_corrupt(
 			__return_address);
 }
 
+/* Flag a da btree node in need of optimization. */
+void
+xchk_da_set_preen(
+	struct xchk_da_btree	*ds,
+	int			level)
+{
+	struct xfs_scrub	*sc = ds->sc;
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	trace_xchk_fblock_preen(sc, ds->dargs.whichfork,
+			xfs_dir2_da_to_db(ds->dargs.geo,
+				ds->state->path.blk[level].blkno),
+			__return_address);
+}
+
+/* Find an entry at a certain level in a da btree. */
 static struct xfs_da_node_entry *
 xchk_da_btree_node_entry(
 	struct xchk_da_btree		*ds,
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index 1f3515c6d5a8..8066fa00dc1b 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -35,6 +35,7 @@ bool xchk_da_process_error(struct xchk_da_btree *ds, int level, int *error);
 
 /* Check for da btree corruption. */
 void xchk_da_set_corrupt(struct xchk_da_btree *ds, int level);
+void xchk_da_set_preen(struct xchk_da_btree *ds, int level);
 
 int xchk_da_btree_hash(struct xchk_da_btree *ds, int level, __be32 *hashp);
 int xchk_da_btree(struct xfs_scrub *sc, int whichfork,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 8f925889d51a..fa67a9451820 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -366,6 +366,7 @@ DEFINE_EVENT(xchk_fblock_error_class, name, \
 
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_error);
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_warning);
+DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_preen);
 
 #ifdef CONFIG_XFS_QUOTA
 TRACE_EVENT(xchk_qcheck_error,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs: flag empty xattr leaf blocks for optimization
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/5] xfs: create a blob array data structure Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/5] xfs: repair extended attributes Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Empty xattr leaf blocks at offset zero are a waste of space but
otherwise harmless.  If we encounter one, flag it as an opportunity for
optimization.

If we encounter empty attr leaf blocks anywhere else in the attr fork,
that's corruption.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c    |   11 +++++++++++
 fs/xfs/scrub/dabtree.h |    2 ++
 2 files changed, 13 insertions(+)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 0fb9344c671b..a1585862c625 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -412,6 +412,17 @@ xchk_xattr_block(
 	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
 	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
 
+	/*
+	 * Empty xattr leaf blocks mapped at block 0 are probably a byproduct
+	 * of a race between setxattr and a log shutdown.  Anywhere else in the
+	 * attr fork is a corruption.
+	 */
+	if (leafhdr.count == 0) {
+		if (blk->blkno == 0)
+			xchk_da_set_preen(ds, level);
+		else
+			xchk_da_set_corrupt(ds, level);
+	}
 	if (leafhdr.usedbytes > mp->m_attr_geo->blksize)
 		xchk_da_set_corrupt(ds, level);
 	if (leafhdr.firstused > mp->m_attr_geo->blksize)
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index 8066fa00dc1b..a24a4cbc4125 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -37,6 +37,8 @@ bool xchk_da_process_error(struct xchk_da_btree *ds, int level, int *error);
 void xchk_da_set_corrupt(struct xchk_da_btree *ds, int level);
 void xchk_da_set_preen(struct xchk_da_btree *ds, int level);
 
+void xchk_da_set_preen(struct xchk_da_btree *ds, int level);
+
 int xchk_da_btree_hash(struct xchk_da_btree *ds, int level, __be32 *hashp);
 int xchk_da_btree(struct xfs_scrub *sc, int whichfork,
 		xchk_da_btree_rec_fn scrub_fn, void *private);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: online repair of directories
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (28 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: online repair of parent pointers Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found Darrick J. Wong
                   ` (47 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series employs atomic extent swapping to enable safe reconstruction
of directory data.  For now, XFS does not support reverse directory
links (aka parent pointers), so we can only salvage the dirents of a
directory and construct a new structure.

Directory repair therefore consists of five main parts:

First, we walk the existing directory to salvage as many entries as we
can, by adding them as new directory entries to the repair temp dir.

Second, we validate the parent pointer found in the directory.  If one
was not found, we scan the entire filesystem looking for a potential
parent.

Third, we use atomic extent swaps to exchange the entire data fork
between the two directories.

Fourth, we reap the old directory blocks as carefully as we can.

To wrap up the directory repair code, we need to add to the regular
filesystem the ability to free all the data fork blocks in a directory.
This does not change anything with normal directories, since they must
still unlink and shrink one entry at a time.  However, this will
facilitate freeing of partially-inactivated temporary directories during
log recovery.

The second half of this patchset implements repairs for the dotdot
entries of directories.  For now there is only rudimentary support for
this, because there are no directory parent pointers, so the best we can
do is scanning the filesystem and the VFS dcache for answers.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-dirs

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs
---
 fs/xfs/Makefile              |    2 
 fs/xfs/scrub/dir.c           |    9 
 fs/xfs/scrub/dir_repair.c    | 1179 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/iscan.c         |    7 
 fs/xfs/scrub/iscan.h         |    3 
 fs/xfs/scrub/nlinks.c        |   23 +
 fs/xfs/scrub/nlinks_repair.c |    9 
 fs/xfs/scrub/parent.c        |    3 
 fs/xfs/scrub/parent.h        |   17 +
 fs/xfs/scrub/parent_repair.c |  461 ++++++++++++++++
 fs/xfs/scrub/repair.c        |   29 +
 fs/xfs/scrub/repair.h        |    7 
 fs/xfs/scrub/scrub.c         |    4 
 fs/xfs/scrub/tempfile.c      |   13 
 fs/xfs/scrub/tempfile.h      |    2 
 fs/xfs/scrub/trace.h         |  138 +++++
 fs/xfs/xfs_inode.c           |   51 ++
 17 files changed, 1953 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/dir_repair.c
 create mode 100644 fs/xfs/scrub/parent.h
 create mode 100644 fs/xfs/scrub/parent_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: online repair of directories
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of directories Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: online repair of parent pointers Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfs: ask the dentry cache if it knows the parent of a directory Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If a directory looks like it's in bad shape, try to sift through the
rubble to find whatever directory entries we can, scan the directory
tree for the parent (if needed), stage the new directory contents in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.  As a side effect of this patch, directory
inactivation will be able to purge any leftover dir blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/dir.c           |    9 
 fs/xfs/scrub/dir_repair.c    | 1152 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/nlinks.c        |   23 +
 fs/xfs/scrub/nlinks_repair.c |    9 
 fs/xfs/scrub/repair.c        |   29 +
 fs/xfs/scrub/repair.h        |    5 
 fs/xfs/scrub/scrub.c         |    2 
 fs/xfs/scrub/tempfile.c      |   13 
 fs/xfs/scrub/tempfile.h      |    2 
 fs/xfs/scrub/trace.h         |  135 +++++
 fs/xfs/xfs_inode.c           |   51 ++
 12 files changed, 1430 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/dir_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0ae616f25a98..43536f1b351e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -191,6 +191,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   attr_repair.o \
 				   bmap_repair.o \
 				   cow_repair.o \
+				   dir_repair.o \
 				   fscounters_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index ab6daf8549c1..218cf43cdf93 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -19,12 +19,21 @@
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
 #include "scrub/readdir.h"
+#include "scrub/repair.h"
 
 /* Set us up to scrub directories. */
 int
 xchk_setup_directory(
 	struct xfs_scrub	*sc)
 {
+	int			error;
+
+	if (xchk_could_repair(sc)) {
+		error = xrep_setup_directory(sc);
+		if (error)
+			return error;
+	}
+
 	return xchk_setup_inode_contents(sc, 0);
 }
 
diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
new file mode 100644
index 000000000000..99d8ce8528c5
--- /dev/null
+++ b/fs/xfs/scrub/dir_repair.c
@@ -0,0 +1,1152 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_bmap_util.h"
+#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/tempfile.h"
+#include "scrub/tempswap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
+#include "scrub/xfblob.h"
+#include "scrub/readdir.h"
+#include "scrub/reap.h"
+
+/*
+ * Directory Repair
+ * ================
+ *
+ * We repair directories by reading the directory data blocks looking for
+ * directory entries.  Salvaged entries are added to a private hidden temporary
+ * dir without touching the link counts of the inodes found.  When we're done
+ * salvaging, we rewrite the directory block owners and use an atomic extent
+ * swap to commit the new directory blocks to the directory being repaired.
+ * This will disrupt readdir cursors, but there's not much else we can do.
+ */
+
+/* Directory entry to be restored in the new directory. */
+struct xrep_dirent {
+	/* Cookie for retrieval of the dirent name. */
+	xfblob_cookie		name_cookie;
+
+	/* Target inode number. */
+	xfs_ino_t		ino;
+
+	/* Hash of the dirent name. */
+	unsigned int		hash;
+
+	/* Length of the dirent name. */
+	uint8_t			namelen;
+
+	/* File type of the dirent. */
+	uint8_t			ftype;
+};
+
+struct xrep_dir {
+	struct xfs_scrub	*sc;
+
+	struct xrep_tempswap	tx;
+
+	/* Fixed-size array of xrep_dirent structures. */
+	struct xfarray		*dir_entries;
+
+	/* Blobs containing directory entry names. */
+	struct xfblob		*dir_names;
+
+	/*
+	 * This is the parent that we're going to set on the reconstructed
+	 * directory.
+	 */
+	xfs_ino_t		parent_ino;
+
+	/* nlink value of the corrected directory. */
+	xfs_nlink_t		new_nlink;
+
+	/* Preallocated args struct for performing dir operations */
+	struct xfs_da_args	args;
+
+	/* Directory entry name, plus the trailing null. */
+	char			namebuf[MAXNAMELEN];
+};
+
+/* Absorb up to 8 pages of dirents before we flush them to the temp dir. */
+#define XREP_DIR_SALVAGE_BYTES	(PAGE_SIZE * 8)
+
+/* Set up for a directory repair. */
+int
+xrep_setup_directory(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xrep_tempfile_create(sc, S_IFDIR);
+	if (error)
+		return error;
+
+	sc->buf = kvzalloc(sizeof(struct xrep_dir), XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/*
+ * Decide if we want to salvage this entry.  We don't bother with oversized
+ * names or the dot entry.
+ */
+STATIC int
+xrep_dir_want_salvage(
+	struct xrep_dir		*rd,
+	const char		*name,
+	int			namelen,
+	xfs_ino_t		ino)
+{
+	struct xfs_mount	*mp = rd->sc->mp;
+
+	/* No pointers to ourselves or to garbage. */
+	if (ino == rd->sc->ip->i_ino)
+		return false;
+	if (!xfs_verify_dir_ino(mp, ino))
+		return false;
+
+	/* No weird looking names or dot entries. */
+	if (namelen >= MAXNAMELEN || namelen <= 0)
+		return false;
+	if (namelen == 1 && name[0] == '.')
+		return false;
+
+	return true;
+}
+
+/* Allocate an in-core record to hold entries while we rebuild the dir data. */
+STATIC int
+xrep_dir_salvage_entry(
+	struct xrep_dir		*rd,
+	unsigned char		*name,
+	unsigned int		namelen,
+	xfs_ino_t		ino)
+{
+	struct xrep_dirent	entry = {
+		.ino		= ino,
+	};
+	struct xfs_scrub	*sc = rd->sc;
+	struct xfs_inode	*ip;
+	unsigned int		i = 0;
+	int			error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/*
+	 * Truncate the name to the first character that would trip namecheck.
+	 * If we no longer have a name after that, ignore this entry.
+	 */
+	while (i < namelen && name[i] != 0 && name[i] != '/')
+		i++;
+	if (i == 0)
+		return 0;
+	entry.namelen = i;
+	entry.hash = xfs_da_hashname(name, entry.namelen);
+
+	/* Ignore '..' entries; we already picked the new parent. */
+	if (entry.namelen == 2 && name[0] == '.' && name[1] == '.') {
+		trace_xrep_dir_salvaged_parent(sc->ip, ino);
+		return 0;
+	}
+
+	trace_xrep_dir_salvage_entry(sc->ip, name, entry.namelen, ino);
+
+	/*
+	 * Compute the ftype or dump the entry if we can't.  We don't lock the
+	 * inode because inodes can't change type while we have a reference.
+	 */
+	error = xchk_iget(sc, ino, &ip);
+	if (error)
+		return 0;
+
+	entry.ftype = xfs_mode_to_ftype(VFS_I(ip)->i_mode);
+	xchk_irele(sc, ip);
+
+	/* Remember this for later. */
+	error = xfblob_store(rd->dir_names, &entry.name_cookie, name,
+			entry.namelen);
+	if (error)
+		return error;
+
+	return xfarray_append(rd->dir_entries, &entry);
+}
+
+/* Record a shortform directory entry for later reinsertion. */
+STATIC int
+xrep_dir_salvage_sf_entry(
+	struct xrep_dir			*rd,
+	struct xfs_dir2_sf_hdr		*sfp,
+	struct xfs_dir2_sf_entry	*sfep)
+{
+	xfs_ino_t			ino;
+
+	ino = xfs_dir2_sf_get_ino(rd->sc->mp, sfp, sfep);
+	if (!xrep_dir_want_salvage(rd, sfep->name, sfep->namelen, ino))
+		return 0;
+
+	return xrep_dir_salvage_entry(rd, sfep->name, sfep->namelen, ino);
+}
+
+/* Record a regular directory entry for later reinsertion. */
+STATIC int
+xrep_dir_salvage_data_entry(
+	struct xrep_dir			*rd,
+	struct xfs_dir2_data_entry	*dep)
+{
+	xfs_ino_t			ino;
+
+	ino = be64_to_cpu(dep->inumber);
+	if (!xrep_dir_want_salvage(rd, dep->name, dep->namelen, ino))
+		return 0;
+
+	return xrep_dir_salvage_entry(rd, dep->name, dep->namelen, ino);
+}
+
+/* Try to recover block/data format directory entries. */
+STATIC int
+xrep_dir_recover_data(
+	struct xrep_dir		*rd,
+	struct xfs_buf		*bp)
+{
+	struct xfs_da_geometry	*geo = rd->sc->mp->m_dir_geo;
+	unsigned int		offset;
+	unsigned int		end;
+	int			error = 0;
+
+	/*
+	 * Loop over the data portion of the block.
+	 * Each object is a real entry (dep) or an unused one (dup).
+	 */
+	offset = geo->data_entry_offset;
+	end = min_t(unsigned int, BBTOB(bp->b_length),
+			xfs_dir3_data_end_offset(geo, bp->b_addr));
+
+	while (offset < end) {
+		struct xfs_dir2_data_unused	*dup = bp->b_addr + offset;
+		struct xfs_dir2_data_entry	*dep = bp->b_addr + offset;
+
+		if (xchk_should_terminate(rd->sc, &error))
+			return error;
+
+		/* Skip unused entries. */
+		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
+			offset += be16_to_cpu(dup->length);
+			continue;
+		}
+
+		/* Don't walk off the end of the block. */
+		offset += xfs_dir2_data_entsize(rd->sc->mp, dep->namelen);
+		if (offset > end)
+			break;
+
+		/* Ok, let's save this entry. */
+		error = xrep_dir_salvage_data_entry(rd, dep);
+		if (error)
+			return error;
+
+	}
+
+	return 0;
+}
+
+/* Try to recover shortform directory entries. */
+STATIC int
+xrep_dir_recover_sf(
+	struct xrep_dir			*rd)
+{
+	struct xfs_dir2_sf_hdr		*sfp;
+	struct xfs_dir2_sf_entry	*sfep;
+	struct xfs_dir2_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	xfs_ino_t			ino;
+	unsigned char			*end;
+	int				error = 0;
+
+	ifp = xfs_ifork_ptr(rd->sc->ip, XFS_DATA_FORK);
+	sfp = (struct xfs_dir2_sf_hdr *)rd->sc->ip->i_df.if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+
+	ino = xfs_dir2_sf_get_parent_ino(sfp);
+	trace_xrep_dir_salvaged_parent(rd->sc->ip, ino);
+
+	sfep = xfs_dir2_sf_firstentry(sfp);
+	while ((unsigned char *)sfep < end) {
+		if (xchk_should_terminate(rd->sc, &error))
+			return error;
+
+		next = xfs_dir2_sf_nextentry(rd->sc->mp, sfp, sfep);
+		if ((unsigned char *)next > end)
+			break;
+
+		/* Ok, let's save this entry. */
+		error = xrep_dir_salvage_sf_entry(rd, sfp, sfep);
+		if (error)
+			return error;
+
+		sfep = next;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to figure out the format of this directory from the data fork mappings
+ * and the directory size.  If we can be reasonably sure of format, we can be
+ * more aggressive in salvaging directory entries.  On return, @magic_guess
+ * will be set to DIR3_BLOCK_MAGIC if we think this is a "block format"
+ * directory; DIR3_DATA_MAGIC if we think this is a "data format" directory,
+ * and 0 if we can't tell.
+ */
+STATIC void
+xrep_dir_guess_format(
+	struct xrep_dir		*rd,
+	__be32			*magic_guess)
+{
+	struct xfs_inode	*ip = rd->sc->ip;
+	struct xfs_da_geometry	*geo = rd->sc->mp->m_dir_geo;
+	xfs_fileoff_t		last;
+	int			error;
+
+	ASSERT(xfs_has_crc(ip->i_mount));
+
+	*magic_guess = 0;
+
+	/*
+	 * If there's a single directory block and the directory size is
+	 * exactly one block, this has to be a single block format directory.
+	 */
+	error = xfs_bmap_last_offset(ip, &last, XFS_DATA_FORK);
+	if (!error && XFS_FSB_TO_B(ip->i_mount, last) == geo->blksize &&
+	    ip->i_disk_size == geo->blksize) {
+		*magic_guess = cpu_to_be32(XFS_DIR3_BLOCK_MAGIC);
+		return;
+	}
+
+	/*
+	 * If the last extent before the leaf offset matches the directory
+	 * size and the directory size is larger than 1 block, this is a
+	 * data format directory.
+	 */
+	last = geo->leafblk;
+	error = xfs_bmap_last_before(rd->sc->tp, ip, &last, XFS_DATA_FORK);
+	if (!error &&
+	    XFS_FSB_TO_B(ip->i_mount, last) > geo->blksize &&
+	    XFS_FSB_TO_B(ip->i_mount, last) == ip->i_disk_size) {
+		*magic_guess = cpu_to_be32(XFS_DIR3_DATA_MAGIC);
+		return;
+	}
+}
+
+/* Recover directory entries from a specific directory block. */
+STATIC int
+xrep_dir_recover_dirblock(
+	struct xrep_dir		*rd,
+	__be32			magic_guess,
+	xfs_dablk_t		dabno)
+{
+	struct xfs_dir2_data_hdr *hdr;
+	struct xfs_buf		*bp;
+	__be32			oldmagic;
+	int			error;
+
+	/*
+	 * Try to read buffer.  We invalidate them in the next step so we don't
+	 * bother to set a buffer type or ops.
+	 */
+	error = xfs_da_read_buf(rd->sc->tp, rd->sc->ip, dabno,
+			XFS_DABUF_MAP_HOLE_OK, &bp, XFS_DATA_FORK, NULL);
+	if (error || !bp)
+		return error;
+
+	hdr = bp->b_addr;
+	oldmagic = hdr->magic;
+
+	trace_xrep_dir_recover_dirblock(rd->sc->ip, dabno,
+			be32_to_cpu(hdr->magic), be32_to_cpu(magic_guess));
+
+	/*
+	 * If we're sure of the block's format, proceed with the salvage
+	 * operation using the specified magic number.
+	 */
+	if (magic_guess) {
+		hdr->magic = magic_guess;
+		goto recover;
+	}
+
+	/*
+	 * If we couldn't guess what type of directory this is, then we will
+	 * only salvage entries from directory blocks that match the magic
+	 * number and pass verifiers.
+	 */
+	switch (hdr->magic) {
+	case cpu_to_be32(XFS_DIR2_BLOCK_MAGIC):
+	case cpu_to_be32(XFS_DIR3_BLOCK_MAGIC):
+		if (!xrep_buf_verify_struct(bp, &xfs_dir3_block_buf_ops))
+			goto out;
+		if (xfs_dir3_block_header_check(bp, rd->sc->ip->i_ino) != NULL)
+			goto out;
+		break;
+	case cpu_to_be32(XFS_DIR2_DATA_MAGIC):
+	case cpu_to_be32(XFS_DIR3_DATA_MAGIC):
+		if (!xrep_buf_verify_struct(bp, &xfs_dir3_data_buf_ops))
+			goto out;
+		if (xfs_dir3_data_header_check(bp, rd->sc->ip->i_ino) != NULL)
+			goto out;
+		break;
+	default:
+		goto out;
+	}
+
+recover:
+	error = xrep_dir_recover_data(rd, bp);
+
+out:
+	hdr->magic = oldmagic;
+	xfs_trans_brelse(rd->sc->tp, bp);
+	return error;
+}
+
+static inline void xrep_dir_init_args(struct xrep_dir *rd)
+{
+	memset(&rd->args, 0, sizeof(struct xfs_da_args));
+	rd->args.geo = rd->sc->mp->m_dir_geo;
+	rd->args.whichfork = XFS_DATA_FORK;
+	rd->args.owner = rd->sc->ip->i_ino;
+	rd->args.trans = rd->sc->tp;
+}
+
+/*
+ * Enter a name in a directory, or check for available space.
+ * If inum is 0, only the available space test is performed.
+ */
+STATIC int
+xrep_dir_createname(
+	struct xrep_dir		*rd,
+	struct xfs_inode	*dp,
+	const struct xfs_name	*name,
+	xfs_ino_t		inum,
+	xfs_extlen_t		total)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	bool			is_block, is_leaf;
+	int			error;
+
+	ASSERT(S_ISDIR(VFS_I(dp)->i_mode));
+
+	if (inum) {
+		error = xfs_dir_ino_validate(sc->mp, inum);
+		if (error)
+			return error;
+	}
+
+	xrep_dir_init_args(rd);
+	rd->args.name = name->name;
+	rd->args.namelen = name->len;
+	rd->args.filetype = name->type;
+	rd->args.hashval = xfs_dir2_hashname(sc->mp, name);
+	rd->args.inumber = inum;
+	rd->args.dp = dp;
+	rd->args.total = total;
+	rd->args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
+		return xfs_dir2_sf_addname(&rd->args);
+
+	error = xfs_dir2_isblock(&rd->args, &is_block);
+	if (error)
+		return error;
+	if (is_block)
+		return xfs_dir2_block_addname(&rd->args);
+
+	error = xfs_dir2_isleaf(&rd->args, &is_leaf);
+	if (error)
+		return error;
+	if (is_leaf)
+		return xfs_dir2_leaf_addname(&rd->args);
+
+	return xfs_dir2_node_addname(&rd->args);
+}
+
+/* Insert one dir entry without cycling locks or transactions. */
+STATIC int
+xrep_dir_insert_rec(
+	struct xrep_dir			*rd,
+	const struct xrep_dirent	*entry)
+{
+	struct xfs_name			name = {
+		.len			= entry->namelen,
+		.type			= entry->ftype,
+		.name			= rd->namebuf,
+	};
+	struct xfs_mount		*mp = rd->sc->mp;
+	char				*namebuf = rd->namebuf;
+	xfs_ino_t			ino;
+	uint				resblks;
+	int				error;
+
+	/* The entry name is stored in the in-core buffer. */
+	error = xfblob_load(rd->dir_names, entry->name_cookie, namebuf,
+			entry->namelen);
+	if (error)
+		return error;
+	namebuf[MAXNAMELEN - 1] = 0;
+
+	trace_xrep_dir_insert_rec(rd->sc->tempip, &name, entry->ino);
+
+	resblks = XFS_LINK_SPACE_RES(mp, entry->namelen);
+	error = xchk_trans_alloc(rd->sc, resblks);
+	if (error)
+		return error;
+
+	/*
+	 * Lock the temporary directory and join it to the transaction, and
+	 * make sure this filename isn't unique before we add it.
+	 */
+	xrep_tempfile_ilock(rd->sc);
+	xfs_trans_ijoin(rd->sc->tp, rd->sc->tempip, 0);
+
+	error = xchk_dir_lookup(rd->sc, rd->sc->tempip, &name, &ino);
+	if (error != -ENOENT)
+		goto out_cancel;
+
+	error = xrep_dir_createname(rd, rd->sc->tempip, &name, entry->ino,
+			resblks);
+	if (error)
+		goto out_cancel;
+
+	if (name.type == XFS_DIR3_FT_DIR)
+		rd->new_nlink++;
+
+	/* Commit and unlock. */
+	error = xrep_trans_commit(rd->sc);
+	if (error)
+		return error;
+
+	xrep_tempfile_iunlock(rd->sc);
+	return 0;
+out_cancel:
+	xchk_trans_cancel(rd->sc);
+	xrep_tempfile_iunlock(rd->sc);
+	return error;
+}
+
+/*
+ * Periodically flush salvaged directory entries to the temporary file.  This
+ * is done to reduce the memory requirements of the directory rebuild, since
+ * directories can contain up to 32GB of directory data.
+ */
+STATIC int
+xrep_dir_flush_salvaged(
+	struct xrep_dir		*rd)
+{
+	xfarray_idx_t		array_cur;
+	int			error;
+
+	/*
+	 * Entering this function, the scrub context has a reference to the
+	 * inode being repaired, the temporary file, and a scrub transaction
+	 * that we use during dirent salvaging to avoid livelocking if there
+	 * are cycles in the directory structures.  We hold ILOCK_EXCL on both
+	 * the inode being repaired and the temporary file, though they are
+	 * not ijoined to the scrub transaction.
+	 *
+	 * To constrain kernel memory use, we occasionally write salvaged
+	 * dirents from the xfarray and xfblob structures into the temporary
+	 * directory in preparation for swapping the directory structures at
+	 * the end.  Updating the temporary file requires a transaction, so we
+	 * commit the scrub transaction and drop the two ILOCKs so that
+	 * we can allocate whatever transaction we want.
+	 *
+	 * We still hold IOLOCK_EXCL on the inode being repaired, which
+	 * prevents anyone from accessing the damaged directory data while we
+	 * repair it.
+	 */
+	error = xrep_trans_commit(rd->sc);
+	if (error)
+		return error;
+	xchk_iunlock(rd->sc, XFS_ILOCK_EXCL);
+
+	/*
+	 * Take the IOLOCK of the temporary file while we modify dirents.  This
+	 * isn't strictly required because the temporary file is never revealed
+	 * to userspace, but we follow the same locking rules.
+	 */
+	while (!xrep_tempfile_iolock_nowait(rd->sc)) {
+		if (xchk_should_terminate(rd->sc, &error))
+			return error;
+		delay(1);
+	}
+
+	/* Add all the salvaged dirents to the temporary directory. */
+	foreach_xfarray_idx(rd->dir_entries, array_cur) {
+		struct xrep_dirent	entry;
+
+		error = xfarray_load(rd->dir_entries, array_cur, &entry);
+		if (error)
+			return error;
+
+		error = xrep_dir_insert_rec(rd, &entry);
+		if (error)
+			return error;
+	}
+	xrep_tempfile_iounlock(rd->sc);
+
+	/* Empty out both arrays now that we've added the entries. */
+	xfarray_truncate(rd->dir_entries);
+	xfblob_truncate(rd->dir_names);
+
+	/* Recreate the salvage transaction and relock both inodes. */
+	error = xchk_trans_alloc(rd->sc, 0);
+	if (error)
+		return error;
+	xchk_ilock(rd->sc, XFS_ILOCK_EXCL);
+	return 0;
+}
+
+/* Extract as many directory entries as we can. */
+STATIC int
+xrep_dir_recover(
+	struct xrep_dir		*rd)
+{
+	struct xfs_bmbt_irec	got;
+	struct xfs_scrub	*sc = rd->sc;
+	struct xfs_da_geometry	*geo = sc->mp->m_dir_geo;
+	xfs_fileoff_t		offset;
+	xfs_dablk_t		dabno;
+	__be32			magic_guess;
+	int			nmap;
+	int			error;
+
+	xrep_dir_guess_format(rd, &magic_guess);
+
+	/* Iterate each directory data block in the data fork. */
+	for (offset = 0;
+	     offset < geo->leafblk;
+	     offset = got.br_startoff + got.br_blockcount) {
+		nmap = 1;
+		error = xfs_bmapi_read(sc->ip, offset, geo->leafblk - offset,
+				&got, &nmap, 0);
+		if (error)
+			return error;
+		if (nmap != 1)
+			return -EFSCORRUPTED;
+		if (!xfs_bmap_is_written_extent(&got))
+			continue;
+
+		for (dabno = round_up(got.br_startoff, geo->fsbcount);
+		     dabno < got.br_startoff + got.br_blockcount;
+		     dabno += geo->fsbcount) {
+			if (xchk_should_terminate(rd->sc, &error))
+				return error;
+
+			error = xrep_dir_recover_dirblock(rd,
+					magic_guess, dabno);
+			if (error)
+				return error;
+
+			/* Flush dirents to constrain memory usage. */
+			if (xfarray_bytes(rd->dir_entries) +
+			    xfblob_bytes(rd->dir_names) <
+			    XREP_DIR_SALVAGE_BYTES)
+				continue;
+
+			error = xrep_dir_flush_salvaged(rd);
+			if (error)
+				return error;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Find all the directory entries for this inode by scraping them out of the
+ * directory leaf blocks by hand, and flushing them into the temp dir.
+ */
+STATIC int
+xrep_dir_find_entries(
+	struct xrep_dir		*rd)
+{
+	struct xfs_inode	*ip = rd->sc->ip;
+	int			error;
+
+	/*
+	 * Salvage directory entries from the old directory, and write them to
+	 * the temporary directory.
+	 */
+	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		error = xrep_dir_recover_sf(rd);
+	} else {
+		error = xfs_iread_extents(rd->sc->tp, ip, XFS_DATA_FORK);
+		if (error)
+			return error;
+
+		error = xrep_dir_recover(rd);
+	}
+	if (error)
+		return error;
+
+	return xrep_dir_flush_salvaged(rd);
+}
+
+/*
+ * Free all the directory blocks and reset the data fork.  The caller must
+ * join the inode to the transaction.  This function returns with the inode
+ * joined to a clean scrub transaction.
+ */
+STATIC int
+xrep_dir_reset_fork(
+	struct xrep_dir		*rd,
+	xfs_ino_t		parent_ino)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->tempip, XFS_DATA_FORK);
+	int			error;
+
+	/* Unmap all the directory buffers. */
+	if (xfs_ifork_has_extents(ifp)) {
+		error = xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK);
+		if (error)
+			return error;
+	}
+
+	trace_xrep_dir_reset_fork(sc->tempip, parent_ino);
+
+	/* Reset the data fork to an empty data fork. */
+	xfs_idestroy_fork(ifp);
+	ifp->if_bytes = 0;
+	sc->tempip->i_disk_size = 0;
+
+	/* Reinitialize the short form directory. */
+	xrep_dir_init_args(rd);
+	rd->args.dp = sc->tempip;
+	error = xfs_dir2_sf_create(&rd->args, parent_ino);
+	if (error)
+		return error;
+
+	return xrep_tempfile_roll_trans(sc);
+}
+
+/*
+ * Prepare both inodes' directory forks for extent swapping.  Promote the
+ * tempfile from short format to leaf format, and if the file being repaired
+ * has a short format data fork, turn it into an empty extent list.
+ */
+STATIC int
+xrep_dir_swap_prep(
+	struct xfs_scrub	*sc,
+	bool			temp_local,
+	bool			ip_local)
+{
+	int			error;
+
+	/*
+	 * If the tempfile's directory is in shortform format, convert that
+	 * to a single leaf extent so that we can use the atomic extent swap.
+	 */
+	if (temp_local) {
+		struct xfs_da_args	args = {
+			.dp		= sc->tempip,
+			.geo		= sc->mp->m_dir_geo,
+			.whichfork	= XFS_DATA_FORK,
+			.trans		= sc->tp,
+			.total		= 1,
+			.owner		= sc->ip->i_ino,
+		};
+
+		error = xfs_dir2_sf_to_block(&args);
+		if (error)
+			return error;
+
+		/*
+		 * Roll the deferred log items to get us back to a clean
+		 * transaction.
+		 */
+		error = xfs_defer_finish(&sc->tp);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If the file being repaired had a shortform data fork, convert that
+	 * to an empty extent list in preparation for the atomic extent swap.
+	 */
+	if (ip_local) {
+		struct xfs_ifork	*ifp;
+
+		ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK);
+		xfs_idestroy_fork(ifp);
+		ifp->if_format = XFS_DINODE_FMT_EXTENTS;
+		ifp->if_nextents = 0;
+		ifp->if_bytes = 0;
+		ifp->if_u1.if_root = NULL;
+		ifp->if_height = 0;
+
+		xfs_trans_log_inode(sc->tp, sc->ip,
+				XFS_ILOG_CORE | XFS_ILOG_DDATA);
+	}
+
+	return 0;
+}
+
+/*
+ * Replace the inode number of a directory entry.
+ */
+static int
+xrep_dir_replace(
+	struct xrep_dir		*rd,
+	struct xfs_inode	*dp,
+	const struct xfs_name	*name,
+	xfs_ino_t		inum,
+	xfs_extlen_t		total)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	bool			is_block, is_leaf;
+	int			error;
+
+	ASSERT(S_ISDIR(VFS_I(dp)->i_mode));
+
+	error = xfs_dir_ino_validate(sc->mp, inum);
+	if (error)
+		return error;
+
+	xrep_dir_init_args(rd);
+	rd->args.name = name->name;
+	rd->args.namelen = name->len;
+	rd->args.filetype = name->type;
+	rd->args.hashval = xfs_dir2_hashname(sc->mp, name);
+	rd->args.inumber = inum;
+	rd->args.dp = dp;
+	rd->args.total = total;
+
+	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
+		return xfs_dir2_sf_replace(&rd->args);
+
+	error = xfs_dir2_isblock(&rd->args, &is_block);
+	if (error)
+		return error;
+	if (is_block)
+		return xfs_dir2_block_replace(&rd->args);
+
+	error = xfs_dir2_isleaf(&rd->args, &is_leaf);
+	if (error)
+		return error;
+	if (is_leaf)
+		return xfs_dir2_leaf_replace(&rd->args);
+
+	return xfs_dir2_node_replace(&rd->args);
+}
+
+/* Swap the temporary directory's data fork with the one being repaired. */
+STATIC int
+xrep_dir_swap(
+	struct xrep_dir		*rd)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	bool			ip_local, temp_local;
+	int			error = 0;
+
+	/*
+	 * Take the IOLOCK on the temporary file so that we can run dir
+	 * operations with the same locks held as we would for a normal file.
+	 */
+	while (!xrep_tempfile_iolock_nowait(rd->sc)) {
+		if (xchk_should_terminate(rd->sc, &error))
+			return error;
+		delay(1);
+	}
+
+	error = xrep_tempswap_trans_alloc(sc, XFS_DATA_FORK, &rd->tx);
+	if (error)
+		return error;
+
+	/*
+	 * Reset the temporary directory's '.' entry to point to the directory
+	 * we're repairing.  Note: shortform directories lack the dot entry.
+	 *
+	 * It's possible that this replacement could also expand a sf tempdir
+	 * into block format.
+	 */
+	if (0) { // sc->tempip->i_df.if_format != XFS_DINODE_FMT_LOCAL) {
+		error = xrep_dir_replace(rd, sc->tempip, &xfs_name_dot,
+				sc->ip->i_ino, rd->tx.req.resblks);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Reset the temporary directory's '..' entry to point to the parent
+	 * that we found.  The temporary directory was created with the root
+	 * directory as the parent, so we can skip this if repairing a
+	 * subdirectory of the root.
+	 *
+	 * It's also possible that this replacement could also expand a sf
+	 * tempdir into block format.
+	 */
+	if (rd->parent_ino != sc->mp->m_rootip->i_ino) {
+		error = xrep_dir_replace(rd, rd->sc->tempip, &xfs_name_dotdot,
+				rd->parent_ino, rd->tx.req.resblks);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Changing the dot and dotdot entries could have changed the shape of
+	 * the directory, so we recompute these.
+	 */
+	ip_local = sc->ip->i_df.if_format == XFS_DINODE_FMT_LOCAL;
+	temp_local = sc->tempip->i_df.if_format == XFS_DINODE_FMT_LOCAL;
+
+	/*
+	 * If the both files have a local format data fork and the rebuilt
+	 * directory data would fit in the repaired file's data fork, copy
+	 * the contents from the tempfile and declare ourselves done.
+	 */
+	if (ip_local && temp_local &&
+	    sc->tempip->i_disk_size <= xfs_inode_data_fork_size(sc->ip)) {
+		set_nlink(VFS_I(sc->ip), rd->new_nlink);
+		xrep_tempfile_copyout_local(sc, XFS_DATA_FORK);
+		return 0;
+	}
+
+	/* Clean the transaction before we start working on the extent swap. */
+	error = xrep_tempfile_roll_trans(rd->sc);
+	if (error)
+		return error;
+
+	/* Otherwise, make sure both data forks are in block-mapping mode. */
+	error = xrep_dir_swap_prep(sc, temp_local, ip_local);
+	if (error)
+		return error;
+
+	/*
+	 * Set nlink of the directory under repair to the number of
+	 * subdirectories that will be in the new directory data.  Do this in
+	 * the same transaction sequence that (atomically) commits the new
+	 * data.
+	 */
+	set_nlink(VFS_I(sc->ip), rd->new_nlink);
+
+	return xrep_tempswap_contents(sc, &rd->tx);
+}
+
+/*
+ * Swap the new directory contents (which we created in the tempfile) into the
+ * directory being repaired.
+ */
+STATIC int
+xrep_dir_rebuild_tree(
+	struct xrep_dir		*rd)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	int			error;
+
+	trace_xrep_dir_rebuild_tree(sc->ip, rd->parent_ino);
+
+	/*
+	 * Commit the repair transaction so that we can use the atomic extent
+	 * swap helper functions to compute the correct block reservations and
+	 * re-lock the inodes.
+	 *
+	 * We still hold IOLOCK_EXCL (aka i_rwsem) which will prevent directory
+	 * modifications, but there's nothing to prevent userspace from reading
+	 * the directory until we're ready for the swap operation.  Reads will
+	 * return -EIO without shutting down the fs, so we're ok with that.
+	 */
+	error = xrep_trans_commit(sc);
+	if (error)
+		return error;
+
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
+
+	/*
+	 * Swap the tempdir's data fork with the file being repaired.  This
+	 * recreates the transaction and re-takes the ILOCK in the scrub
+	 * context.
+	 */
+	error = xrep_dir_swap(rd);
+	if (error)
+		return error;
+
+	/*
+	 * Release the old directory blocks and reset the data fork of the temp
+	 * directory to an empty shortform directory because inactivation does
+	 * nothing for directories.
+	 */
+	return xrep_dir_reset_fork(rd, sc->mp->m_rootip->i_ino);
+}
+
+/*
+ * If we're the root of a directory tree, we are our own parent.  If we're an
+ * unlinked directory, the parent /won't/ have a link to us.  Set the parent
+ * directory to the root for both cases.  Returns NULLFSINO if we don't know
+ * what to do.
+ */
+static inline xfs_ino_t
+xrep_dir_self_parent(
+	struct xrep_dir		*rd)
+{
+	struct xfs_scrub	*sc = rd->sc;
+
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_rootino)
+		return sc->mp->m_sb.sb_rootino;
+
+	if (VFS_I(sc->ip)->i_nlink == 0)
+		return sc->mp->m_sb.sb_rootino;
+
+	return NULLFSINO;
+}
+
+/*
+ * Look up the dotdot entry.  Returns NULLFSINO if we don't know what to do.
+ * The next patch will check this more carefully.
+ */
+static inline xfs_ino_t
+xrep_dir_lookup_parent(
+	struct xrep_dir		*rd)
+{
+	return xrep_dotdot_lookup(rd->sc);
+}
+
+/*
+ * Try to find the parent of the directory being repaired.
+ *
+ * NOTE: This function will someday be augmented by the directory parent repair
+ * code, which will know how to check the parent and scan the filesystem if
+ * we cannot find anything.  Inode scans will have to be done before we start
+ * salvaging directory entries, so we do this now.
+ */
+STATIC int
+xrep_dir_find_parent(
+	struct xrep_dir		*rd)
+{
+	rd->parent_ino = xrep_dir_self_parent(rd);
+	if (rd->parent_ino != NULLFSINO)
+		return 0;
+
+	rd->parent_ino = xrep_dir_lookup_parent(rd);
+	if (rd->parent_ino != NULLFSINO)
+		return 0;
+
+	/* NOTE: A future patch will deal with moving orphans. */
+	return -EFSCORRUPTED;
+}
+
+/*
+ * Repair the directory metadata.
+ *
+ * XXX: Directory entry buffers can be multiple fsblocks in size.  The buffer
+ * cache in XFS can't handle aliased multiblock buffers, so this might
+ * misbehave if the directory blocks are crosslinked with other filesystem
+ * metadata.
+ *
+ * XXX: Is it necessary to check the dcache for this directory to make sure
+ * that we always recreate every cached entry?
+ */
+int
+xrep_directory(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_dir		*rd = sc->buf;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(sc->mp))
+		return -EOPNOTSUPP;
+
+	rd->sc = sc;
+	rd->parent_ino = NULLFSINO;
+	rd->new_nlink = 2;
+
+	/* Set up some staging memory for salvaging dirents. */
+	error = xfarray_create(sc->mp, "directory entries", 0,
+			sizeof(struct xrep_dirent), &rd->dir_entries);
+	if (error)
+		goto out_rd;
+
+	error = xfblob_create(sc->mp, "dirent names", &rd->dir_names);
+	if (error)
+		goto out_arr;
+
+	/*
+	 * Drop the ILOCK and MMAPLOCK on this directory; we don't need to
+	 * hold these to maintain control over the directory we're fixing.
+	 * This should leave us holding only IOLOCK_EXCL.  If we have to scan
+	 * the entire filesystem to find or confirm the parent of this
+	 * directory, we may have to cycle IOLOCK_EXCL.
+	 */
+	if (sc->ilock_flags & XFS_ILOCK_EXCL)
+		xchk_iunlock(sc, XFS_ILOCK_EXCL);
+	xchk_iunlock(sc, XFS_MMAPLOCK_EXCL);
+
+	/* Figure out who is going to be the parent of this directory. */
+	error = xrep_dir_find_parent(rd);
+	if (error)
+		goto out_names;
+
+	/* Re-grab the ILOCK so that we can salvage directory entries. */
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
+
+	/*
+	 * Collect directory entries by parsing raw leaf blocks to salvage
+	 * whatever we can.  When we're done, free the staging memory before
+	 * swapping the directories to reduce memory usage.
+	 */
+	error = xrep_dir_find_entries(rd);
+	if (error)
+		goto out_names;
+
+	xfblob_destroy(rd->dir_names);
+	xfarray_destroy(rd->dir_entries);
+	rd->dir_names = NULL;
+	rd->dir_entries = NULL;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		goto out_rd;
+
+	/* Swap in the good contents. */
+	error = xrep_dir_rebuild_tree(rd);
+
+out_names:
+	if (rd->dir_names)
+		xfblob_destroy(rd->dir_names);
+out_arr:
+	if (rd->dir_entries)
+		xfarray_destroy(rd->dir_entries);
+out_rd:
+	return error;
+}
diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c
index e29d7da2eb32..54aa3dc4dc89 100644
--- a/fs/xfs/scrub/nlinks.c
+++ b/fs/xfs/scrub/nlinks.c
@@ -27,6 +27,7 @@
 #include "scrub/nlinks.h"
 #include "scrub/trace.h"
 #include "scrub/readdir.h"
+#include "scrub/tempfile.h"
 
 /*
  * Live Inode Link Count Checking
@@ -147,6 +148,13 @@ xchk_nlinks_live_update(
 	if (action == XFS_BACKREF_NLINK_DELTA)
 		scan_dir = p->ip;
 
+	/*
+	 * Ignore temporary directories being used to stage dir repairs, since
+	 * we don't bump the link counts of the children.
+	 */
+	if (xrep_is_tempfile(scan_dir))
+		return NOTIFY_DONE;
+
 	/* Ignore the live update if the directory hasn't been scanned yet. */
 	if (!xchk_iscan_want_live_update(&xnc->collect_iscan, scan_dir->i_ino))
 		return NOTIFY_DONE;
@@ -316,6 +324,13 @@ xchk_nlinks_collect_dir(
 	unsigned int		lock_mode;
 	int			error = 0;
 
+	/*
+	 * Ignore temporary directories being used to stage dir repairs, since
+	 * we don't bump the link counts of the children.
+	 */
+	if (xrep_is_tempfile(dp))
+		return 0;
+
 	/* Prevent anyone from changing this directory while we walk it. */
 	xfs_ilock(dp, XFS_IOLOCK_SHARED);
 	lock_mode = xfs_ilock_data_map_shared(dp);
@@ -547,6 +562,14 @@ xchk_nlinks_compare_inode(
 	unsigned int		actual_nlink;
 	int			error;
 
+	/*
+	 * Ignore temporary files being used to stage repairs, since we assume
+	 * they're correct for non-directories, and the directory repair code
+	 * doesn't bump the link counts for the children.
+	 */
+	if (xrep_is_tempfile(ip))
+		return 0;
+
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	mutex_lock(&xnc->lock);
 
diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c
index 2f83abd6eec7..4723b015a1c1 100644
--- a/fs/xfs/scrub/nlinks_repair.c
+++ b/fs/xfs/scrub/nlinks_repair.c
@@ -25,6 +25,7 @@
 #include "scrub/iscan.h"
 #include "scrub/nlinks.h"
 #include "scrub/trace.h"
+#include "scrub/tempfile.h"
 
 /*
  * Live Inode Link Count Repair
@@ -52,6 +53,14 @@ xrep_nlinks_repair_inode(
 	unsigned int		actual_nlink;
 	int			error;
 
+	/*
+	 * Ignore temporary files being used to stage repairs, since we assume
+	 * they're correct for non-directories, and the directory repair code
+	 * doesn't bump the link counts for the children.
+	 */
+	if (xrep_is_tempfile(ip))
+		return 0;
+
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, &sc->tp);
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index e5e5dbdce7c4..b5c5ee7f512b 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -34,6 +34,7 @@
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
+#include "xfs_dir2.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -1234,3 +1235,31 @@ xrep_buf_verify_struct(
 
 	return fa == NULL;
 }
+
+/*
+ * Look up the '..' entry for @sc->ip.  Returns NULLFSINO if sc->ip is not
+ * a directory, the directory is corrupt, or the inode number can't possibly
+ * be valid.
+ */
+xfs_ino_t
+xrep_dotdot_lookup(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_name		dotdot = xfs_name_dotdot;
+	xfs_ino_t		ino;
+	int			error;
+
+	/* sc->ip had better be a directory, so bail out if it isn't */
+	if (!S_ISDIR(VFS_I(sc->ip)->i_mode)) {
+		ASSERT(0);
+		return NULLFSINO;
+	}
+
+	error = xfs_dir_lookup(sc->tp, sc->ip, &dotdot, &ino, NULL);
+	if (error)
+		return NULLFSINO;
+	if (!xfs_verify_dir_ino(sc->mp, ino))
+		return NULLFSINO;
+
+	return ino;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 2a79d7a5ba7e..5fccc9c81d8f 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -83,6 +83,7 @@ int xrep_setup_ag_refcountbt(struct xfs_scrub *sc);
 int xrep_setup_rtsummary(struct xfs_scrub *sc, unsigned int *resblks,
 		size_t *bufsize);
 int xrep_setup_xattr(struct xfs_scrub *sc);
+int xrep_setup_directory(struct xfs_scrub *sc);
 
 int xrep_xattr_reset_fork(struct xfs_scrub *sc);
 
@@ -120,6 +121,7 @@ int xrep_bmap_cow(struct xfs_scrub *sc);
 int xrep_nlinks(struct xfs_scrub *sc);
 int xrep_fscounters(struct xfs_scrub *sc);
 int xrep_xattr(struct xfs_scrub *sc);
+int xrep_directory(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -145,6 +147,7 @@ int xrep_trans_alloc_hook_dummy(struct xfs_mount *mp, void **cookiep,
 void xrep_trans_cancel_hook_dummy(void **cookiep, struct xfs_trans *tp);
 
 bool xrep_buf_verify_struct(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
+xfs_ino_t xrep_dotdot_lookup(struct xfs_scrub *sc);
 
 #else
 
@@ -189,6 +192,7 @@ xrep_setup_nothing(
 #define xrep_setup_ag_rmapbt		xrep_setup_nothing
 #define xrep_setup_ag_refcountbt	xrep_setup_nothing
 #define xrep_setup_xattr		xrep_setup_nothing
+#define xrep_setup_directory		xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
@@ -229,6 +233,7 @@ xrep_setup_rtsummary(
 #define xrep_fscounters			xrep_notsupported
 #define xrep_rtsummary			xrep_notsupported
 #define xrep_xattr			xrep_notsupported
+#define xrep_directory			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 0ec23fc650be..1695e9d2f104 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -327,7 +327,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_directory,
 		.scrub	= xchk_directory,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_directory,
 	},
 	[XFS_SCRUB_TYPE_XATTR] = {	/* extended attributes */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index c9a089b169f2..b275c0b764f4 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -793,3 +793,16 @@ xrep_tempfile_copyout_local(
 	ilog_flags |= xfs_ilog_fdata(whichfork);
 	xfs_trans_log_inode(sc->tp, sc->ip, ilog_flags);
 }
+
+/* Decide if a given XFS inode is a temporary file for a repair. */
+bool
+xrep_is_tempfile(
+	const struct xfs_inode	*ip)
+{
+	const struct inode	*inode = &ip->i_vnode;
+
+	if (IS_PRIVATE(inode) && !(inode->i_opflags & IOP_XATTR))
+		return true;
+
+	return false;
+}
diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h
index 402957f7f2b3..4ca35f5d49a5 100644
--- a/fs/xfs/scrub/tempfile.h
+++ b/fs/xfs/scrub/tempfile.h
@@ -33,11 +33,13 @@ int xrep_tempfile_set_isize(struct xfs_scrub *sc, unsigned long long isize);
 
 int xrep_tempfile_roll_trans(struct xfs_scrub *sc);
 void xrep_tempfile_copyout_local(struct xfs_scrub *sc, int whichfork);
+bool xrep_is_tempfile(const struct xfs_inode *ip);
 #else
 static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc)
 {
 	xchk_ilock(sc, XFS_IOLOCK_EXCL);
 }
+# define xrep_is_tempfile(ip)		(false)
 # define xrep_tempfile_rele(sc)
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index fa67a9451820..b35b7d5a3767 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2399,6 +2399,141 @@ DEFINE_EVENT(xrep_xattr_class, name, \
 DEFINE_XREP_XATTR_CLASS(xrep_xattr_rebuild_tree);
 DEFINE_XREP_XATTR_CLASS(xrep_xattr_reset_fork);
 
+TRACE_EVENT(xrep_dir_recover_dirblock,
+	TP_PROTO(struct xfs_inode *dp, xfs_dablk_t dabno, uint32_t magic,
+		 uint32_t magic_guess),
+	TP_ARGS(dp, dabno, magic, magic_guess),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir_ino)
+		__field(xfs_dablk_t, dabno)
+		__field(uint32_t, magic)
+		__field(uint32_t, magic_guess)
+	),
+	TP_fast_assign(
+		__entry->dev = dp->i_mount->m_super->s_dev;
+		__entry->dir_ino = dp->i_ino;
+		__entry->dabno = dabno;
+		__entry->magic = magic;
+		__entry->magic_guess = magic_guess;
+	),
+	TP_printk("dev %d:%d dir 0x%llx dablk 0x%x magic 0x%x magic_guess 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir_ino,
+		  __entry->dabno,
+		  __entry->magic,
+		  __entry->magic_guess)
+);
+
+TRACE_EVENT(xrep_dir_salvage_entry,
+	TP_PROTO(struct xfs_inode *dp, char *name, unsigned int namelen,
+		 xfs_ino_t ino),
+	TP_ARGS(dp, name, namelen, ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir_ino)
+		__field(unsigned int, namelen)
+		__dynamic_array(char, name, namelen + 1)
+		__field(xfs_ino_t, ino)
+	),
+	TP_fast_assign(
+		__entry->dev = dp->i_mount->m_super->s_dev;
+		__entry->dir_ino = dp->i_ino;
+		__entry->namelen = namelen;
+		memcpy(__get_str(name), name, namelen);
+		__get_str(name)[namelen] = 0;
+		__entry->ino = ino;
+	),
+	TP_printk("dev %d:%d dir 0x%llx name '%.*s' ino 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir_ino,
+		  __entry->namelen,
+		  __get_str(name),
+		  __entry->ino)
+);
+
+DECLARE_EVENT_CLASS(xrep_dir_class,
+	TP_PROTO(struct xfs_inode *dp, xfs_ino_t parent_ino),
+	TP_ARGS(dp, parent_ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir_ino)
+		__field(xfs_ino_t, parent_ino)
+	),
+	TP_fast_assign(
+		__entry->dev = dp->i_mount->m_super->s_dev;
+		__entry->dir_ino = dp->i_ino;
+		__entry->parent_ino = parent_ino;
+	),
+	TP_printk("dev %d:%d dir 0x%llx parent 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir_ino,
+		  __entry->parent_ino)
+)
+#define DEFINE_XREP_DIR_CLASS(name) \
+DEFINE_EVENT(xrep_dir_class, name, \
+	TP_PROTO(struct xfs_inode *dp, xfs_ino_t parent_ino), \
+	TP_ARGS(dp, parent_ino))
+DEFINE_XREP_DIR_CLASS(xrep_dir_rebuild_tree);
+DEFINE_XREP_DIR_CLASS(xrep_dir_reset_fork);
+
+DECLARE_EVENT_CLASS(xrep_dirent_class,
+	TP_PROTO(struct xfs_inode *dp, struct xfs_name *name, xfs_ino_t ino),
+	TP_ARGS(dp, name, ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir_ino)
+		__field(unsigned int, namelen)
+		__dynamic_array(char, name, name->len)
+		__field(xfs_ino_t, ino)
+		__field(uint8_t, ftype)
+	),
+	TP_fast_assign(
+		__entry->dev = dp->i_mount->m_super->s_dev;
+		__entry->dir_ino = dp->i_ino;
+		__entry->namelen = name->len;
+		memcpy(__get_str(name), name->name, name->len);
+		__entry->ino = ino;
+		__entry->ftype = name->type;
+	),
+	TP_printk("dev %d:%d dir 0x%llx ftype %s name '%.*s' ino 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir_ino,
+		  __print_symbolic(__entry->ftype, XFS_DIR3_FTYPE_STR),
+		  __entry->namelen,
+		  __get_str(name),
+		  __entry->ino)
+)
+#define DEFINE_XREP_DIRENT_CLASS(name) \
+DEFINE_EVENT(xrep_dirent_class, name, \
+	TP_PROTO(struct xfs_inode *dp, struct xfs_name *name, xfs_ino_t ino), \
+	TP_ARGS(dp, name, ino))
+DEFINE_XREP_DIRENT_CLASS(xrep_dir_insert_rec);
+
+DECLARE_EVENT_CLASS(xrep_parent_salvage_class,
+	TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino),
+	TP_ARGS(dp, ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dir_ino)
+		__field(xfs_ino_t, ino)
+	),
+	TP_fast_assign(
+		__entry->dev = dp->i_mount->m_super->s_dev;
+		__entry->dir_ino = dp->i_ino;
+		__entry->ino = ino;
+	),
+	TP_printk("dev %d:%d dir 0x%llx parent 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dir_ino,
+		  __entry->ino)
+)
+#define DEFINE_XREP_PARENT_SALVAGE_CLASS(name) \
+DEFINE_EVENT(xrep_parent_salvage_class, name, \
+	TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino), \
+	TP_ARGS(dp, ino))
+DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_dir_salvaged_parent);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d81c864207bb..ce55dde40d9d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -16,6 +16,7 @@
 #include "xfs_inode.h"
 #include "xfs_dir2.h"
 #include "xfs_attr.h"
+#include "xfs_bit.h"
 #include "xfs_trans_space.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
@@ -1592,6 +1593,51 @@ xfs_release(
 	return error;
 }
 
+/*
+ * Mark all the buffers attached to this directory stale.  In theory we should
+ * never be freeing a directory with any blocks at all, but this covers the
+ * case where we've recovered a directory swap with a "temporary" directory
+ * created by online repair and now need to dump it.
+ */
+STATIC void
+xfs_inactive_dir(
+	struct xfs_inode	*dp)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_geometry	*geo = mp->m_dir_geo;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(dp, XFS_DATA_FORK);
+	xfs_fileoff_t		off;
+
+	/*
+	 * Invalidate each directory block.  All directory blocks are of
+	 * fsbcount length and alignment, so we only need to walk those same
+	 * offsets.  We hold the only reference to this inode, so we must wait
+	 * for the buffer locks.
+	 */
+	for_each_xfs_iext(ifp, &icur, &got) {
+		for (off = round_up(got.br_startoff, geo->fsbcount);
+		     off < got.br_startoff + got.br_blockcount;
+		     off += geo->fsbcount) {
+			struct xfs_buf	*bp = NULL;
+			xfs_fsblock_t	fsbno;
+			int		error;
+
+			fsbno = (off - got.br_startoff) + got.br_startblock;
+			error = xfs_buf_incore(mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(mp, fsbno),
+					XFS_FSB_TO_BB(mp, geo->fsbcount),
+					XBF_BCACHE_SCAN, &bp);
+			if (error)
+				continue;
+
+			xfs_buf_stale(bp);
+			xfs_buf_relse(bp);
+		}
+	}
+}
+
 /*
  * xfs_inactive_truncate
  *
@@ -1887,6 +1933,11 @@ xfs_inactive(
 	if (error)
 		goto out;
 
+	if (S_ISDIR(VFS_I(ip)->i_mode) && ip->i_df.if_nextents > 0) {
+		xfs_inactive_dir(ip);
+		truncate = 1;
+	}
+
 	if (S_ISLNK(VFS_I(ip)->i_mode))
 		error = xfs_inactive_symlink(ip);
 	else if (truncate)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: online repair of parent pointers
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of directories Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfs: ask the dentry cache if it knows the parent of a directory Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: online repair of directories Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach the online repair code to fix directory '..' entries (aka
directory parent pointers).  Since this requires us to know how to scan
every dirent in every directory on the filesystem, we can reuse the
parent scanner components to validate (or find!) the correct parent
entry when rebuilding directories too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/dir_repair.c    |   66 +++----
 fs/xfs/scrub/iscan.c         |    7 +
 fs/xfs/scrub/iscan.h         |    3 
 fs/xfs/scrub/parent.c        |    3 
 fs/xfs/scrub/parent.h        |   16 ++
 fs/xfs/scrub/parent_repair.c |  417 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h        |    2 
 fs/xfs/scrub/scrub.c         |    2 
 fs/xfs/scrub/trace.h         |    2 
 10 files changed, 483 insertions(+), 36 deletions(-)
 create mode 100644 fs/xfs/scrub/parent.h
 create mode 100644 fs/xfs/scrub/parent_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 43536f1b351e..5e9ffd9f1583 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -197,6 +197,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   inode_repair.o \
 				   newbt.o \
 				   nlinks_repair.o \
+				   parent_repair.o \
 				   rcbag_btree.o \
 				   rcbag.o \
 				   reap.o \
diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
index 99d8ce8528c5..e2de2fc24ba0 100644
--- a/fs/xfs/scrub/dir_repair.c
+++ b/fs/xfs/scrub/dir_repair.c
@@ -39,6 +39,7 @@
 #include "scrub/xfblob.h"
 #include "scrub/readdir.h"
 #include "scrub/reap.h"
+#include "scrub/parent.h"
 
 /*
  * Directory Repair
@@ -1010,50 +1011,36 @@ xrep_dir_rebuild_tree(
 }
 
 /*
- * If we're the root of a directory tree, we are our own parent.  If we're an
- * unlinked directory, the parent /won't/ have a link to us.  Set the parent
- * directory to the root for both cases.  Returns NULLFSINO if we don't know
- * what to do.
- */
-static inline xfs_ino_t
-xrep_dir_self_parent(
-	struct xrep_dir		*rd)
-{
-	struct xfs_scrub	*sc = rd->sc;
-
-	if (sc->ip->i_ino == sc->mp->m_sb.sb_rootino)
-		return sc->mp->m_sb.sb_rootino;
-
-	if (VFS_I(sc->ip)->i_nlink == 0)
-		return sc->mp->m_sb.sb_rootino;
-
-	return NULLFSINO;
-}
-
-/*
- * Look up the dotdot entry.  Returns NULLFSINO if we don't know what to do.
- * The next patch will check this more carefully.
+ * Look up the dotdot entry and confirm that it's really the parent.
+ * Returns NULLFSINO if we don't know what to do.
  */
 static inline xfs_ino_t
 xrep_dir_lookup_parent(
 	struct xrep_dir		*rd)
 {
-	return xrep_dotdot_lookup(rd->sc);
+	struct xfs_scrub	*sc = rd->sc;
+	xfs_ino_t		parent_ino;
+	int			error;
+
+	parent_ino = xrep_dotdot_lookup(sc);
+	if (parent_ino == NULLFSINO)
+		return parent_ino;
+
+	error = xrep_parent_confirm(sc, &parent_ino);
+	if (error)
+		return NULLFSINO;
+
+	return parent_ino;
 }
 
-/*
- * Try to find the parent of the directory being repaired.
- *
- * NOTE: This function will someday be augmented by the directory parent repair
- * code, which will know how to check the parent and scan the filesystem if
- * we cannot find anything.  Inode scans will have to be done before we start
- * salvaging directory entries, so we do this now.
- */
+/* Try to find the parent of the directory being repaired. */
 STATIC int
 xrep_dir_find_parent(
 	struct xrep_dir		*rd)
 {
-	rd->parent_ino = xrep_dir_self_parent(rd);
+	int			error;
+
+	rd->parent_ino = xrep_parent_self_reference(rd->sc);
 	if (rd->parent_ino != NULLFSINO)
 		return 0;
 
@@ -1061,6 +1048,19 @@ xrep_dir_find_parent(
 	if (rd->parent_ino != NULLFSINO)
 		return 0;
 
+	/*
+	 * A full filesystem scan is the last resort.  On a busy filesystem,
+	 * the scan can fail with -EBUSY if we cannot grab IOLOCKs.  That means
+	 * that we don't know what who the parent is, so we should return to
+	 * userspace.
+	 */
+	error = xrep_parent_scan(rd->sc, &rd->parent_ino);
+	if (error)
+		return error;
+
+	if (rd->parent_ino != NULLFSINO)
+		return 0;
+
 	/* NOTE: A future patch will deal with moving orphans. */
 	return -EFSCORRUPTED;
 }
diff --git a/fs/xfs/scrub/iscan.c b/fs/xfs/scrub/iscan.c
index e3db6a64338b..8cf486dfde19 100644
--- a/fs/xfs/scrub/iscan.c
+++ b/fs/xfs/scrub/iscan.c
@@ -342,8 +342,13 @@ xchk_iscan_iget(
 		 * It's possible that this inode has lost all of its links but
 		 * hasn't yet been inactivated.  If we don't have a transaction
 		 * or it's not writable, flush the inodegc workers and wait.
+		 * Otherwise, we have a dirty transaction in progress and the
+		 * best we can do is to queue the inodegc workers.
 		 */
-		xfs_inodegc_flush(mp);
+		if (!iscan->iget_nowait)
+			xfs_inodegc_flush(mp);
+		else
+			xfs_inodegc_push(mp);
 		return xchk_iscan_iget_retry(mp, iscan, true);
 	}
 
diff --git a/fs/xfs/scrub/iscan.h b/fs/xfs/scrub/iscan.h
index 947176620bc3..f10b71d9cec4 100644
--- a/fs/xfs/scrub/iscan.h
+++ b/fs/xfs/scrub/iscan.h
@@ -32,6 +32,9 @@ struct xchk_iscan {
 
 	/* Wait this many ms to retry an iget. */
 	unsigned int		iget_retry_delay;
+
+	/* True if we cannot allow iget to wait indefinitely. */
+	bool			iget_nowait:1;
 };
 
 /* Set if the scan has been aborted due to some event in the fs. */
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 58d012252015..dfea3102f52f 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -17,6 +17,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/readdir.h"
+#include "scrub/parent.h"
 
 /* Set us up to scrub parents. */
 int
@@ -95,7 +96,7 @@ xchk_parent_count_parent_dentries(
  * Try to iolock the parent dir @dp in shared mode and the child dir @sc->ip
  * exclusively.
  */
-STATIC int
+int
 xchk_parent_lock_two_dirs(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*dp)
diff --git a/fs/xfs/scrub/parent.h b/fs/xfs/scrub/parent.h
new file mode 100644
index 000000000000..e1979f5bb001
--- /dev/null
+++ b/fs/xfs/scrub/parent.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_PARENT_H__
+#define __XFS_SCRUB_PARENT_H__
+
+int xchk_parent_lock_two_dirs(struct xfs_scrub *sc, struct xfs_inode *dp);
+
+int xrep_parent_confirm(struct xfs_scrub *sc, xfs_ino_t *parent_ino);
+int xrep_parent_scan(struct xfs_scrub *sc, xfs_ino_t *parent_ino);
+
+xfs_ino_t xrep_parent_self_reference(struct xfs_scrub *sc);
+
+#endif /* __XFS_SCRUB_PARENT_H__ */
diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c
new file mode 100644
index 000000000000..d275c2129176
--- /dev/null
+++ b/fs/xfs/scrub/parent_repair.c
@@ -0,0 +1,417 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_trans_space.h"
+#include "xfs_health.h"
+#include "xfs_swapext.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/iscan.h"
+#include "scrub/parent.h"
+#include "scrub/readdir.h"
+#include "scrub/tempfile.h"
+
+struct xrep_findparent_info {
+	/* The directory currently being scanned. */
+	struct xfs_inode	*dp;
+
+	/*
+	 * Scrub context.  We're looking for a @dp containing a directory
+	 * entry pointing to sc->ip->i_ino.
+	 */
+	struct xfs_scrub	*sc;
+
+	/*
+	 * Parent that we've found for sc->ip.  If we're scanning the entire
+	 * directory tree, we need this to ensure that we only find /one/
+	 * parent directory.
+	 */
+	xfs_ino_t		found_parent;
+
+	/*
+	 * This is set to true if @found_parent was not observed directly from
+	 * the directory scan but by noticing a change in dotdot entries after
+	 * cycling the sc->ip IOLOCK.
+	 */
+	bool			parent_tentative;
+};
+
+/*
+ * If this directory entry points to the scrub target inode, then the directory
+ * we're scanning is the parent of the scrub target inode.
+ */
+STATIC int
+xrep_findparent_dirent(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*dp,
+	xfs_dir2_dataptr_t		dapos,
+	const struct xfs_name		*name,
+	xfs_ino_t			ino,
+	void				*priv)
+{
+	struct xrep_findparent_info	*fpi = priv;
+	int				error = 0;
+
+	if (xchk_should_terminate(fpi->sc, &error))
+		return error;
+
+	if (ino != fpi->sc->ip->i_ino)
+		return 0;
+
+	/* Ignore garbage directory entry names. */
+	if (name->len == 0 || !xfs_dir2_namecheck(name->name, name->len))
+		return -EFSCORRUPTED;
+
+	/*
+	 * Ignore dotdot and dot entries -- we're looking for parent -> child
+	 * links only.
+	 */
+	if (name->name[0] == '.' && (name->len == 1 ||
+				     (name->len == 2 && name->name[1] == '.')))
+		return 0;
+
+	/* Uhoh, more than one parent for a dir? */
+	if (fpi->found_parent != NULLFSINO &&
+	    !(fpi->parent_tentative && fpi->found_parent == fpi->dp->i_ino)) {
+		trace_xrep_findparent_dirent(fpi->sc->ip, 0);
+		return -EFSCORRUPTED;
+	}
+
+	/* We found a potential parent; remember this. */
+	trace_xrep_findparent_dirent(fpi->sc->ip, fpi->dp->i_ino);
+	fpi->found_parent = fpi->dp->i_ino;
+	fpi->parent_tentative = false;
+	return 0;
+}
+
+/*
+ * If this is a directory, walk the dirents looking for any that point to the
+ * scrub target inode.
+ */
+STATIC int
+xrep_findparent_walk_directory(
+	struct xrep_findparent_info	*fpi)
+{
+	struct xfs_scrub		*sc = fpi->sc;
+	struct xfs_inode		*dp = fpi->dp;
+	unsigned int			lock_mode;
+	int				error = 0;
+
+	/*
+	 * The inode being scanned cannot be its own parent, nor can any
+	 * temporary directory we created to stage this repair.
+	 */
+	if (dp == sc->ip || dp == sc->tempip)
+		return 0;
+
+	/*
+	 * Similarly, temporary files created to stage a repair cannot be the
+	 * parent of this inode.
+	 */
+	if (xrep_is_tempfile(dp))
+		return 0;
+
+	/* Try to lock dp; if we can, we're ready to scan! */
+	if (!xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
+		xfs_ino_t	orig_parent, new_parent;
+
+		/*
+		 * We may have to drop the lock on sc->ip to try to lock dp.
+		 * Therefore, look up the old dotdot entry for sc->ip so that
+		 * we can compare it after we re-lock sc->ip.
+		 */
+		orig_parent = xrep_dotdot_lookup(sc);
+
+		error = xchk_parent_lock_two_dirs(sc, dp);
+		if (error)
+			return error;
+
+		/*
+		 * It is possible that sc->ip got moved elsewhere in the
+		 * directory tree if we dropped sc->ip to grab dp.  Note that
+		 * rename operations replace the dotdot entry without checking
+		 * the old value.
+		 *
+		 * If the dotdot entry was wrong but there really was only one
+		 * parent of sc->ip, then the dotdot entry could now be
+		 * correct.  Record this new parent as a tentative parent and
+		 * keep scanning.  If there are more parents of this directory,
+		 * we must not touch anything.
+		 */
+		new_parent = xrep_dotdot_lookup(sc);
+
+		if (orig_parent != new_parent || VFS_I(sc->ip)->i_nlink == 0) {
+			fpi->found_parent = new_parent;
+			fpi->parent_tentative = true;
+		}
+	}
+
+	/*
+	 * If this directory is known to be sick, we cannot scan it reliably
+	 * and must abort.
+	 */
+	if (xfs_inode_has_sickness(dp, XFS_SICK_INO_CORE |
+				       XFS_SICK_INO_BMBTD |
+				       XFS_SICK_INO_DIR)) {
+		error = -EFSCORRUPTED;
+		goto out_unlock;
+	}
+
+	/*
+	 * Scan the directory to see if there it contains an entry pointing to
+	 * the directory that we are repairing.
+	 */
+	lock_mode = xfs_ilock_data_map_shared(dp);
+
+	/*
+	 * We cannot complete our parent pointer scan if a directory looks as
+	 * though it has been zapped by the inode record repair code.
+	 */
+	if (xchk_dir_looks_zapped(dp))
+		error = -EFSCORRUPTED;
+	if (!error)
+		error = xchk_dir_walk(sc, dp, xrep_findparent_dirent, fpi);
+	xfs_iunlock(dp, lock_mode);
+	if (error)
+		goto out_unlock;
+
+out_unlock:
+	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+	return error;
+}
+
+/*
+ * Confirm that the directory @parent_ino actually contains a directory entry
+ * pointing to the child @sc->ip->ino.  This function returns one of several
+ * ways:
+ *
+ * Returns 0 with @parent_ino unchanged if the parent was confirmed.
+ * Returns 0 with a different @parent_ino if we had to cycle inode locks to
+ * walk the alleged parent and the child's '..' entry was changed in the mean
+ * time.
+ * Returns 0 with @parent_ino set to NULLFSINO if the parent was not valid.
+ * Returns the usual negative errno if something else happened.
+ */
+int
+xrep_parent_confirm(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		*parent_ino)
+{
+	struct xrep_findparent_info fpi = {
+		.sc		= sc,
+		.found_parent	= NULLFSINO,
+	};
+	int			error;
+
+	/*
+	 * The root directory always points to itself.  Unlinked dirs can point
+	 * anywhere, so we point them at the root dir too.
+	 */
+	if (sc->ip == sc->mp->m_rootip || VFS_I(sc->ip)->i_nlink == 0) {
+		*parent_ino = sc->mp->m_sb.sb_rootino;
+		return 0;
+	}
+
+	/* Reject garbage parent inode numbers and self-referential parents. */
+	if (*parent_ino == NULLFSINO)
+	       return 0;
+	if (!xfs_verify_dir_ino(sc->mp, *parent_ino) ||
+	    *parent_ino == sc->ip->i_ino) {
+		*parent_ino = NULLFSINO;
+		return 0;
+	}
+
+	error = xchk_iget(sc, *parent_ino, &fpi.dp);
+	if (error)
+		return error;
+
+	if (!S_ISDIR(VFS_I(fpi.dp)->i_mode)) {
+		*parent_ino = NULLFSINO;
+		goto out_rele;
+	}
+
+	error = xrep_findparent_walk_directory(&fpi);
+	if (error)
+		goto out_rele;
+
+	*parent_ino = fpi.found_parent;
+out_rele:
+	xchk_irele(sc, fpi.dp);
+	return error;
+}
+
+/*
+ * Scan the entire filesystem looking for a parent inode for the inode being
+ * scrubbed.  @sc->ip must not be the root of a directory tree.
+ *
+ * Returns 0 with @parent_ino set to the parent that we found, or the current
+ * value of the child's '..' entry, if it changed when we had to drop the
+ * child's IOLOCK.
+ * Returns 0 with @parent_ino set to NULLFSINO if we didn't find anything.
+ * Returns the usual negative errno if something else happened.
+ */
+int
+xrep_parent_scan(
+	struct xfs_scrub		*sc,
+	xfs_ino_t			*parent_ino)
+{
+	struct xrep_findparent_info	fpi = {
+		.sc			= sc,
+		.found_parent		= NULLFSINO,
+	};
+	struct xchk_iscan		iscan = { };
+	int				ret;
+
+	/*
+	 * The caller holds a non-empty transaction and a directory ILOCK.
+	 * Hence we cannot block the system indefinitely in iget, so we will
+	 * retry rapidly for up to five seconds before aborting the operation.
+	 */
+	iscan.iget_nowait = true;
+	xchk_iscan_start(&iscan, 5000, 1);
+
+	while ((ret = xchk_iscan_iter(sc, &iscan, &fpi.dp)) == 1) {
+		if (S_ISDIR(VFS_I(fpi.dp)->i_mode))
+			ret = xrep_findparent_walk_directory(&fpi);
+		else
+			ret = 0;
+		xchk_iscan_mark_visited(&iscan, fpi.dp);
+		xchk_irele(sc, fpi.dp);
+		if (ret)
+			break;
+
+		if (xchk_should_terminate(sc, &ret))
+			break;
+	}
+	xchk_iscan_finish(&iscan);
+	if (ret)
+		return ret;
+
+	*parent_ino = fpi.found_parent;
+	return 0;
+}
+
+/*
+ * If we're the root of a directory tree, we are our own parent.  If we're an
+ * unlinked directory, the parent /won't/ have a link to us.  Set the parent
+ * directory to the root for both cases.  Returns NULLFSINO if we don't know
+ * what to do.
+ */
+xfs_ino_t
+xrep_parent_self_reference(
+	struct xfs_scrub	*sc)
+{
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_rootino)
+		return sc->mp->m_sb.sb_rootino;
+
+	if (VFS_I(sc->ip)->i_nlink == 0)
+		return sc->mp->m_sb.sb_rootino;
+
+	return NULLFSINO;
+}
+
+/*
+ * Repairing The Directory Parent Pointer
+ * ======================================
+ *
+ * Currently, only directories support parent pointers (in the form of '..'
+ * entries), so we simply scan the filesystem and update the '..' entry.
+ *
+ * Note that because the only parent pointer is the dotdot entry, we won't
+ * touch an unhealthy directory, since the directory repair code is perfectly
+ * capable of rebuilding a directory with the proper parent inode.
+ */
+
+/* Replace a directory's parent '..' pointer. */
+STATIC int
+xrep_parent_reset_dir(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		parent_ino)
+{
+	unsigned int		spaceres;
+	int			error;
+
+	trace_xrep_parent_reset_dir(sc->ip, parent_ino);
+
+	/*
+	 * Reserve more space just in case we have to expand the dir.  We're
+	 * allowed to exceed quota to repair inconsistent metadata.
+	 */
+	spaceres = XFS_RENAME_SPACE_RES(sc->mp, 2);
+	error = xfs_trans_reserve_more_inode(sc->tp, sc->ip, spaceres, 0,
+			true);
+	if (error)
+		return error;
+
+	/* Replace the dotdot entry. */
+	return xfs_dir_replace(sc->tp, sc->ip, &xfs_name_dotdot, parent_ino,
+			spaceres);
+}
+
+int
+xrep_parent(
+	struct xfs_scrub	*sc)
+{
+	xfs_ino_t		parent_ino, curr_parent;
+	unsigned int		sick, checked;
+	int			error;
+
+	/*
+	 * Avoid sick directories.  The parent pointer scrubber dropped the
+	 * ILOCK and MMAPLOCK, but we still hold IOLOCK_EXCL on the directory.
+	 * There shouldn't be anyone else clearing the directory's sick status.
+	 */
+	xfs_inode_measure_sickness(sc->ip, &sick, &checked);
+	if (sick & XFS_SICK_INO_DIR)
+		return -EFSCORRUPTED;
+
+	parent_ino = xrep_parent_self_reference(sc);
+	if (parent_ino != NULLFSINO)
+		goto reset_parent;
+
+	/* Scan the entire filesystem for a parent. */
+	error = xrep_parent_scan(sc, &parent_ino);
+	if (error)
+		return error;
+	if (parent_ino == NULLFSINO)
+		return -EFSCORRUPTED;
+
+reset_parent:
+	/* If the '..' entry is already set to the parent inode, we're done. */
+	curr_parent = xrep_dotdot_lookup(sc);
+	if (curr_parent != NULLFSINO && curr_parent == parent_ino)
+		return 0;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	/* Re-take the ILOCK, we're going to need it to modify the dir. */
+	xchk_ilock(sc, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	return xrep_parent_reset_dir(sc, parent_ino);
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 5fccc9c81d8f..acd7fccf8bee 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -122,6 +122,7 @@ int xrep_nlinks(struct xfs_scrub *sc);
 int xrep_fscounters(struct xfs_scrub *sc);
 int xrep_xattr(struct xfs_scrub *sc);
 int xrep_directory(struct xfs_scrub *sc);
+int xrep_parent(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -234,6 +235,7 @@ xrep_setup_rtsummary(
 #define xrep_rtsummary			xrep_notsupported
 #define xrep_xattr			xrep_notsupported
 #define xrep_directory			xrep_notsupported
+#define xrep_parent			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 1695e9d2f104..39ad06e6b2d0 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -345,7 +345,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_parent,
 		.scrub	= xchk_parent,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_parent,
 	},
 	[XFS_SCRUB_TYPE_RTBITMAP] = {	/* realtime bitmap */
 		.type	= ST_FS,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index b35b7d5a3767..b27abaa84d11 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2476,6 +2476,7 @@ DEFINE_EVENT(xrep_dir_class, name, \
 	TP_ARGS(dp, parent_ino))
 DEFINE_XREP_DIR_CLASS(xrep_dir_rebuild_tree);
 DEFINE_XREP_DIR_CLASS(xrep_dir_reset_fork);
+DEFINE_XREP_DIR_CLASS(xrep_parent_reset_dir);
 
 DECLARE_EVENT_CLASS(xrep_dirent_class,
 	TP_PROTO(struct xfs_inode *dp, struct xfs_name *name, xfs_ino_t ino),
@@ -2533,6 +2534,7 @@ DEFINE_EVENT(xrep_parent_salvage_class, name, \
 	TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino), \
 	TP_ARGS(dp, ino))
 DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_dir_salvaged_parent);
+DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_findparent_dirent);
 
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: ask the dentry cache if it knows the parent of a directory
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of directories Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: online repair of parent pointers Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: online repair of directories Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

It's possible that the dentry cache can tell us the parent of a
directory.  Therefore, when repairing directory dot dot entries, query
the dcache as a last resort before scanning the entire filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir_repair.c    |   27 ++++++++++++++++++++++++++
 fs/xfs/scrub/parent.h        |    1 +
 fs/xfs/scrub/parent_repair.c |   44 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/trace.h         |    1 +
 4 files changed, 73 insertions(+)


diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
index e2de2fc24ba0..871b14c09e86 100644
--- a/fs/xfs/scrub/dir_repair.c
+++ b/fs/xfs/scrub/dir_repair.c
@@ -1033,6 +1033,29 @@ xrep_dir_lookup_parent(
 	return parent_ino;
 }
 
+/*
+ * Look up '..' in the dentry cache and confirm that it's really the parent.
+ * Returns NULLFSINO if the dcache misses or if the hit is implausible.
+ */
+static inline xfs_ino_t
+xrep_dir_dcache_parent(
+	struct xrep_dir		*rd)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	xfs_ino_t		parent_ino;
+	int			error;
+
+	parent_ino = xrep_parent_from_dcache(sc);
+	if (parent_ino == NULLFSINO)
+		return parent_ino;
+
+	error = xrep_parent_confirm(sc, &parent_ino);
+	if (error)
+		return NULLFSINO;
+
+	return parent_ino;
+}
+
 /* Try to find the parent of the directory being repaired. */
 STATIC int
 xrep_dir_find_parent(
@@ -1044,6 +1067,10 @@ xrep_dir_find_parent(
 	if (rd->parent_ino != NULLFSINO)
 		return 0;
 
+	rd->parent_ino = xrep_dir_dcache_parent(rd);
+	if (rd->parent_ino != NULLFSINO)
+		return 0;
+
 	rd->parent_ino = xrep_dir_lookup_parent(rd);
 	if (rd->parent_ino != NULLFSINO)
 		return 0;
diff --git a/fs/xfs/scrub/parent.h b/fs/xfs/scrub/parent.h
index e1979f5bb001..c20673d8f093 100644
--- a/fs/xfs/scrub/parent.h
+++ b/fs/xfs/scrub/parent.h
@@ -10,6 +10,7 @@ int xchk_parent_lock_two_dirs(struct xfs_scrub *sc, struct xfs_inode *dp);
 
 int xrep_parent_confirm(struct xfs_scrub *sc, xfs_ino_t *parent_ino);
 int xrep_parent_scan(struct xfs_scrub *sc, xfs_ino_t *parent_ino);
+xfs_ino_t xrep_parent_from_dcache(struct xfs_scrub *sc);
 
 xfs_ino_t xrep_parent_self_reference(struct xfs_scrub *sc);
 
diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c
index d275c2129176..d83948d1fd05 100644
--- a/fs/xfs/scrub/parent_repair.c
+++ b/fs/xfs/scrub/parent_repair.c
@@ -263,6 +263,44 @@ xrep_parent_confirm(
 	return error;
 }
 
+/* Check the dentry cache to see if knows of a parent for the scrub target. */
+xfs_ino_t
+xrep_parent_from_dcache(
+	struct xfs_scrub	*sc)
+{
+	struct inode		*pip = NULL;
+	struct dentry		*dentry, *parent;
+	xfs_ino_t		ret = NULLFSINO;
+
+	dentry = d_find_alias(VFS_I(sc->ip));
+	if (!dentry)
+		goto out;
+
+	parent = dget_parent(dentry);
+	if (!parent)
+		goto out_dput;
+
+	if (parent->d_sb != sc->ip->i_mount->m_super) {
+		dput(parent);
+		goto out_dput;
+	}
+
+	pip = igrab(d_inode(parent));
+	dput(parent);
+
+	if (S_ISDIR(pip->i_mode)) {
+		trace_xrep_findparent_from_dcache(sc->ip, XFS_I(pip)->i_ino);
+		ret = XFS_I(pip)->i_ino;
+	}
+
+	xchk_irele(sc, XFS_I(pip));
+
+out_dput:
+	dput(dentry);
+out:
+	return ret;
+}
+
 /*
  * Scan the entire filesystem looking for a parent inode for the inode being
  * scrubbed.  @sc->ip must not be the root of a directory tree.
@@ -392,6 +430,12 @@ xrep_parent(
 	if (parent_ino != NULLFSINO)
 		goto reset_parent;
 
+	/* Does the VFS dcache have an answer for us? */
+	parent_ino = xrep_parent_from_dcache(sc);
+	error = xrep_parent_confirm(sc, &parent_ino);
+	if (!error && parent_ino != NULLFSINO)
+		goto reset_parent;
+
 	/* Scan the entire filesystem for a parent. */
 	error = xrep_parent_scan(sc, &parent_ino);
 	if (error)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index b27abaa84d11..d8223ec24369 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2535,6 +2535,7 @@ DEFINE_EVENT(xrep_parent_salvage_class, name, \
 	TP_ARGS(dp, ino))
 DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_dir_salvaged_parent);
 DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_findparent_dirent);
+DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_findparent_from_dcache);
 
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (29 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of directories Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of symbolic links Darrick J. Wong
                   ` (46 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Orphaned files are defined to be files with nonzero ondisk link count
but no observable parent directory.  This series enables online repair
to reparent orphaned files into the filesystem directory tree, and wires
up this reparenting ability into the directory, file link count, and
parent pointer repair functions.  This is how we fix files with positive
link count that are not reachable through the directory tree.

This patch will also create the orphanage directory (lost+found) if it
is not present.  In contrast to xfs_repair, we follow e2fsck in creating
the lost+found without group or other-owner access to avoid accidental
disclosure of files that were previously hidden by an 0700 directory.
That's silly security, but people have been known to do it.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/dir_repair.c    |  101 ++++++++
 fs/xfs/scrub/nlinks.c        |   11 +
 fs/xfs/scrub/nlinks.h        |    6 +
 fs/xfs/scrub/nlinks_repair.c |  250 ++++++++++++++++++++-
 fs/xfs/scrub/orphanage.c     |  504 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/orphanage.h     |   79 +++++++
 fs/xfs/scrub/parent.c        |   10 +
 fs/xfs/scrub/parent_repair.c |   92 ++++++++
 fs/xfs/scrub/repair.h        |    4 
 fs/xfs/scrub/scrub.c         |    2 
 fs/xfs/scrub/scrub.h         |    4 
 fs/xfs/scrub/trace.c         |    1 
 fs/xfs/scrub/trace.h         |   59 +++++
 fs/xfs/xfs_inode.c           |    6 -
 fs/xfs/xfs_inode.h           |    1 
 16 files changed, 1119 insertions(+), 12 deletions(-)
 create mode 100644 fs/xfs/scrub/orphanage.c
 create mode 100644 fs/xfs/scrub/orphanage.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: move orphan files to the orphanage
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we can't find a parent for a file, move it to the orphanage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/dir_repair.c    |  101 ++++++++++
 fs/xfs/scrub/orphanage.c     |  414 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/orphanage.h     |   79 ++++++++
 fs/xfs/scrub/parent.c        |   10 +
 fs/xfs/scrub/parent_repair.c |   92 +++++++++
 fs/xfs/scrub/repair.h        |    2 
 fs/xfs/scrub/scrub.c         |    2 
 fs/xfs/scrub/scrub.h         |    4 
 fs/xfs/scrub/trace.h         |    1 
 fs/xfs/xfs_inode.c           |    6 -
 fs/xfs/xfs_inode.h           |    1 
 12 files changed, 707 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/scrub/orphanage.c
 create mode 100644 fs/xfs/scrub/orphanage.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5e9ffd9f1583..6d6ca775553f 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -197,6 +197,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   inode_repair.o \
 				   newbt.o \
 				   nlinks_repair.o \
+				   orphanage.o \
 				   parent_repair.o \
 				   rcbag_btree.o \
 				   rcbag.o \
diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
index 871b14c09e86..7530819e1435 100644
--- a/fs/xfs/scrub/dir_repair.c
+++ b/fs/xfs/scrub/dir_repair.c
@@ -40,6 +40,7 @@
 #include "scrub/readdir.h"
 #include "scrub/reap.h"
 #include "scrub/parent.h"
+#include "scrub/orphanage.h"
 
 /*
  * Directory Repair
@@ -91,9 +92,15 @@ struct xrep_dir {
 	/* nlink value of the corrected directory. */
 	xfs_nlink_t		new_nlink;
 
+	/* Should we move this directory to the orphanage? */
+	bool			move_orphanage;
+
 	/* Preallocated args struct for performing dir operations */
 	struct xfs_da_args	args;
 
+	/* Orphanage reparinting request. */
+	struct xrep_orphanage_req adoption;
+
 	/* Directory entry name, plus the trailing null. */
 	char			namebuf[MAXNAMELEN];
 };
@@ -108,6 +115,10 @@ xrep_setup_directory(
 {
 	int			error;
 
+	error = xrep_orphanage_try_create(sc);
+	if (error)
+		return error;
+
 	error = xrep_tempfile_create(sc, S_IFDIR);
 	if (error)
 		return error;
@@ -1088,8 +1099,76 @@ xrep_dir_find_parent(
 	if (rd->parent_ino != NULLFSINO)
 		return 0;
 
-	/* NOTE: A future patch will deal with moving orphans. */
-	return -EFSCORRUPTED;
+	/*
+	 * Temporarily assign the root dir as the parent; we'll move this to
+	 * the orphanage after swapping the dir contents.
+	 */
+	rd->move_orphanage = true;
+	rd->parent_ino = rd->sc->mp->m_sb.sb_rootino;
+	return 0;
+}
+
+/*
+ * Move the current file to the orphanage.
+ *
+ * Caller must hold IOLOCK_EXCL on @sc->ip, and no other inode locks.  Upon
+ * successful return, the scrub transaction will have enough extra reservation
+ * to make the move; it will hold IOLOCK_EXCL and ILOCK_EXCL of @sc->ip and the
+ * orphanage; and both inodes will be ijoined.
+ */
+STATIC int
+xrep_dir_move_to_orphanage(
+	struct xrep_dir		*rd)
+{
+	struct xfs_scrub	*sc = rd->sc;
+	int			error;
+
+	/* No orphanage?  We can't fix this. */
+	if (!sc->orphanage)
+		return -EFSCORRUPTED;
+
+	/* If we can take the orphanage's iolock then we're ready to move. */
+	if (!xrep_orphanage_ilock_nowait(sc, XFS_IOLOCK_EXCL)) {
+		xfs_ino_t	orig_parent, new_parent;
+
+		/*
+		 * We may have to drop the lock on sc->ip to try to lock the
+		 * orphanage.  Therefore, look up the old dotdot entry for
+		 * sc->ip so that we can compare it after we re-lock sc->ip.
+		 */
+		orig_parent = xrep_dotdot_lookup(sc);
+
+		xchk_iunlock(sc, sc->ilock_flags);
+		error = xrep_orphanage_iolock_two(sc);
+		if (error)
+			return error;
+
+		/*
+		 * If the parent changed or the child was unlinked while the
+		 * child directory was unlocked, we don't need to move the
+		 * child to the orphanage after all.
+		 */
+		new_parent = xrep_dotdot_lookup(sc);
+
+		if (orig_parent != new_parent || VFS_I(sc->ip)->i_nlink == 0)
+			return 0;
+	}
+
+	/*
+	 * Move the directory to the orphanage, and let scrub teardown unlock
+	 * everything for us.
+	 */
+	xrep_orphanage_compute_blkres(sc, &rd->adoption);
+
+	error = xrep_orphanage_compute_name(&rd->adoption, rd->namebuf);
+	if (error)
+		return error;
+
+	error = xrep_orphanage_adoption_prep(&rd->adoption);
+	if (error)
+		return error;
+
+	return xrep_orphanage_adopt(&rd->adoption);
 }
 
 /*
@@ -1167,6 +1246,24 @@ xrep_directory(
 
 	/* Swap in the good contents. */
 	error = xrep_dir_rebuild_tree(rd);
+	if (error || !rd->move_orphanage)
+		goto out_rd;
+
+	/*
+	 * We hold ILOCK_EXCL on both the directory and the tempdir after a
+	 * successful rebuild.  Before we can move the directory to the
+	 * orphanage, we must roll to a clean unjoined transaction and drop the
+	 * ILOCKs on the dir and the temp dir.  We still hold IOLOCK_EXCL on
+	 * the dir, so nobody will be able to access it in the mean time.
+	 */
+	error = xfs_trans_roll(&sc->tp);
+	if (error)
+		goto out_rd;
+
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
+	xrep_tempfile_iunlock(sc);
+
+	error = xrep_dir_move_to_orphanage(rd);
 
 out_names:
 	if (rd->dir_names)
diff --git a/fs/xfs/scrub/orphanage.c b/fs/xfs/scrub/orphanage.c
new file mode 100644
index 000000000000..1fe7935433bf
--- /dev/null
+++ b/fs/xfs/scrub/orphanage.c
@@ -0,0 +1,414 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_ialloc.h"
+#include "xfs_quota.h"
+#include "xfs_trans_space.h"
+#include "xfs_dir2.h"
+#include "xfs_icache.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/repair.h"
+#include "scrub/trace.h"
+#include "scrub/orphanage.h"
+
+#include <linux/namei.h>
+
+/* Make the orphanage owned by root. */
+STATIC int
+xrep_chown_orphanage(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp)
+{
+	struct xfs_trans	*tp;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_dquot	*udqp = NULL, *gdqp = NULL, *pdqp = NULL;
+	struct xfs_dquot	*oldu = NULL, *oldg = NULL, *oldp = NULL;
+	struct inode		*inode = VFS_I(dp);
+	int			error;
+
+	error = xfs_qm_vop_dqalloc(dp, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
+			XFS_QMOPT_QUOTALL, &udqp, &gdqp, &pdqp);
+	if (error)
+		return error;
+
+	error = xfs_trans_alloc_ichange(dp, udqp, gdqp, pdqp, true, &tp);
+	if (error)
+		goto out_dqrele;
+
+	/*
+	 * Always clear setuid/setgid on the orphanage since we don't normally
+	 * want that functionality on this directory and xfs_repair doesn't
+	 * create it this way either.  Leave the other access bits unchanged.
+	 */
+	inode->i_mode &= ~(S_ISUID | S_ISGID);
+
+	/*
+	 * Change the ownerships and register quota modifications
+	 * in the transaction.
+	 */
+	if (!uid_eq(inode->i_uid, GLOBAL_ROOT_UID)) {
+		if (XFS_IS_UQUOTA_ON(mp))
+			oldu = xfs_qm_vop_chown(tp, dp, &dp->i_udquot, udqp);
+		inode->i_uid = GLOBAL_ROOT_UID;
+	}
+	if (!gid_eq(inode->i_gid, GLOBAL_ROOT_GID)) {
+		if (XFS_IS_GQUOTA_ON(mp))
+			oldg = xfs_qm_vop_chown(tp, dp, &dp->i_gdquot, gdqp);
+		inode->i_gid = GLOBAL_ROOT_GID;
+	}
+	if (dp->i_projid != 0) {
+		if (XFS_IS_PQUOTA_ON(mp))
+			oldp = xfs_qm_vop_chown(tp, dp, &dp->i_pdquot, pdqp);
+		dp->i_projid = 0;
+	}
+
+	dp->i_diflags &= ~(XFS_DIFLAG_REALTIME | XFS_DIFLAG_RTINHERIT);
+	xfs_trans_log_inode(tp, dp, XFS_ILOG_CORE);
+
+	XFS_STATS_INC(mp, xs_ig_attrchg);
+
+	if (xfs_has_wsync(mp))
+		xfs_trans_set_sync(tp);
+	error = xfs_trans_commit(tp);
+
+	xfs_qm_dqrele(oldu);
+	xfs_qm_dqrele(oldg);
+	xfs_qm_dqrele(oldp);
+
+out_dqrele:
+	xfs_qm_dqrele(udqp);
+	xfs_qm_dqrele(gdqp);
+	xfs_qm_dqrele(pdqp);
+	return error;
+}
+
+#define ORPHANAGE	"lost+found"
+
+/* Create the orphanage directory, and set sc->orphanage to it. */
+int
+xrep_orphanage_create(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct dentry		*root_dentry, *orphanage_dentry;
+	struct inode		*root_inode = VFS_I(sc->mp->m_rootip);
+	struct inode		*orphanage_inode;
+	int			error;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+	if (xfs_is_readonly(mp)) {
+		sc->orphanage = NULL;
+		return 0;
+	}
+
+	ASSERT(sc->tp == NULL);
+	ASSERT(sc->orphanage == NULL);
+
+	/* Find the dentry for the root directory... */
+	root_dentry = d_find_alias(root_inode);
+	if (!root_dentry) {
+		error = -EFSCORRUPTED;
+		goto out;
+	}
+
+	/* ...which is a directory, right? */
+	if (!d_is_dir(root_dentry)) {
+		error = -EFSCORRUPTED;
+		goto out_dput_root;
+	}
+
+	/* Try to find the orphanage directory. */
+	inode_lock_nested(root_inode, I_MUTEX_PARENT);
+	orphanage_dentry = lookup_one_len(ORPHANAGE, root_dentry,
+			strlen(ORPHANAGE));
+	if (IS_ERR(orphanage_dentry)) {
+		error = PTR_ERR(orphanage_dentry);
+		goto out_unlock_root;
+	}
+
+	/*
+	 * Nothing found?  Call mkdir to create the orphanage.  Create the
+	 * directory without group or other-user access because we're live and
+	 * someone could have been relying partly on minimal access to a parent
+	 * directory to control access to a file we put in here.
+	 */
+	if (d_really_is_negative(orphanage_dentry)) {
+		error = vfs_mkdir(&init_user_ns, root_inode, orphanage_dentry,
+				0700);
+		if (error)
+			goto out_dput_orphanage;
+	}
+
+	/* Not a directory? Bail out. */
+	if (!d_is_dir(orphanage_dentry)) {
+		error = -ENOTDIR;
+		goto out_dput_orphanage;
+	}
+
+	/*
+	 * Grab a reference to the orphanage.  This /should/ succeed since
+	 * we hold the root directory locked and therefore nobody can delete
+	 * the orphanage.
+	 */
+	orphanage_inode = igrab(d_inode(orphanage_dentry));
+	if (!orphanage_inode) {
+		error = -ENOENT;
+		goto out_dput_orphanage;
+	}
+
+	/* Make sure the orphanage is owned by root. */
+	error = xrep_chown_orphanage(sc, XFS_I(orphanage_inode));
+	if (error)
+		goto out_dput_orphanage;
+
+	/* Stash the reference for later and bail out. */
+	sc->orphanage = XFS_I(orphanage_inode);
+	sc->orphanage_ilock_flags = 0;
+
+out_dput_orphanage:
+	dput(orphanage_dentry);
+out_unlock_root:
+	inode_unlock(VFS_I(sc->mp->m_rootip));
+out_dput_root:
+	dput(root_dentry);
+out:
+	return error;
+}
+
+void
+xrep_orphanage_ilock(
+	struct xfs_scrub	*sc,
+	unsigned int		ilock_flags)
+{
+	sc->orphanage_ilock_flags |= ilock_flags;
+	xfs_ilock(sc->orphanage, ilock_flags);
+}
+
+bool
+xrep_orphanage_ilock_nowait(
+	struct xfs_scrub	*sc,
+	unsigned int		ilock_flags)
+{
+	if (xfs_ilock_nowait(sc->orphanage, ilock_flags)) {
+		sc->orphanage_ilock_flags |= ilock_flags;
+		return true;
+	}
+
+	return false;
+}
+
+void
+xrep_orphanage_iunlock(
+	struct xfs_scrub	*sc,
+	unsigned int		ilock_flags)
+{
+	xfs_iunlock(sc->orphanage, ilock_flags);
+	sc->orphanage_ilock_flags &= ~ilock_flags;
+}
+
+/* Grab the IOLOCK of the orphanage and sc->ip. */
+int
+xrep_orphanage_iolock_two(
+	struct xfs_scrub	*sc)
+{
+	int			error = 0;
+
+	while (true) {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		/*
+		 * Normal XFS takes the IOLOCK before grabbing a transaction.
+		 * Scrub holds a transaction, which means that we can't block
+		 * on either IOLOCK.
+		 */
+		if (xrep_orphanage_ilock_nowait(sc, XFS_IOLOCK_EXCL)) {
+			if (xchk_ilock_nowait(sc, XFS_IOLOCK_EXCL))
+				break;
+			xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL);
+		}
+		delay(1);
+	}
+
+	return 0;
+}
+
+/* Compute block reservation needed to add sc->ip to the orphanage. */
+void
+xrep_orphanage_compute_blkres(
+	struct xfs_scrub		*sc,
+	struct xrep_orphanage_req	*orph)
+{
+	struct xfs_mount		*mp = sc->mp;
+	bool				isdir = S_ISDIR(VFS_I(sc->ip)->i_mode);
+
+	orph->sc = sc;
+	orph->orphanage_blkres = XFS_LINK_SPACE_RES(mp, MAXNAMELEN);
+	orph->child_blkres = isdir ? XFS_RENAME_SPACE_RES(mp, 2) : 0;
+}
+
+/*
+ * Compute the xfs_name for the directory entry that we're adding to the
+ * orphanage.  Caller must have the IOLOCK of the orphanage and sc->ip.
+ */
+int
+xrep_orphanage_compute_name(
+	struct xrep_orphanage_req	*orph,
+	unsigned char			*namebuf)
+{
+	struct xfs_name			*xname = &orph->xname;
+	struct xfs_scrub		*sc = orph->sc;
+	xfs_ino_t			ino;
+	unsigned int			incr = 0;
+	int				error = 0;
+
+	xname->name = namebuf;
+	xname->len = snprintf(namebuf, MAXNAMELEN, "%llu", sc->ip->i_ino);
+	xname->type = xfs_mode_to_ftype(VFS_I(sc->ip)->i_mode);
+
+	/* Make sure the filename is unique in the lost+found. */
+	error = xfs_dir_lookup(sc->tp, sc->orphanage, xname, &ino, NULL);
+	while (error == 0 && incr < 10000) {
+		xname->len = snprintf(namebuf, MAXNAMELEN, "%llu.%u",
+				sc->ip->i_ino, ++incr);
+		error = xfs_dir_lookup(sc->tp, sc->orphanage, xname, &ino,
+				NULL);
+	}
+	if (error == 0) {
+		/* We already have 10,000 entries in the orphanage? */
+		return -EFSCORRUPTED;
+	}
+
+	if (error != -ENOENT)
+		return error;
+	return 0;
+}
+
+/*
+ * Prepare to send a child to the orphanage.
+ *
+ * Reserve more space in the transaction, take the ILOCKs of the orphanage and
+ * sc->ip, join them to the transaction, and reserve quota to reparent the
+ * latter.
+ */
+int
+xrep_orphanage_adoption_prep(
+	struct xrep_orphanage_req	*orph)
+{
+	struct xfs_scrub		*sc = orph->sc;
+	int				error;
+
+	/*
+	 * Reserve space to the transaction to handle expansion of both the
+	 * orphanage and the child directory.
+	 */
+	error = xfs_trans_reserve_more(sc->tp,
+			orph->orphanage_blkres + orph->child_blkres, 0);
+	if (error)
+		return error;
+
+	xfs_lock_two_inodes(sc->orphanage, XFS_ILOCK_EXCL,
+			    sc->ip, XFS_ILOCK_EXCL);
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	sc->orphanage_ilock_flags |= XFS_ILOCK_EXCL;
+
+	xfs_trans_ijoin(sc->tp, sc->orphanage, 0);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/*
+	 * Reserve enough quota in the orphan directory to add the new name.
+	 * Normally the orphanage should have user/group/project ids of zero
+	 * and hence is not subject to quota enforcement, but we're allowed to
+	 * exceed quota to reattach disconnected parts of the directory tree.
+	 */
+	error = xfs_trans_reserve_quota_nblks(sc->tp, sc->orphanage,
+			orph->orphanage_blkres, 0, true);
+	if (error)
+		return error;
+
+	/*
+	 * Reserve enough quota in the child directory to change dotdot.
+	 * Here we're also allowed to exceed file quota to repair inconsistent
+	 * metadata.
+	 */
+	if (orph->child_blkres) {
+		error = xfs_trans_reserve_quota_nblks(sc->tp, sc->ip,
+				orph->child_blkres, 0, true);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Move the current file to the orphanage.
+ *
+ * The caller must hold the IOLOCKs and the ILOCKs for both sc->ip and the
+ * orphanage.  The directory entry name must have been computed, and quota
+ * reserved.  The function returns with both inodes joined and ILOCKed to the
+ * transaction.
+ */
+int
+xrep_orphanage_adopt(
+	struct xrep_orphanage_req	*orph)
+{
+	struct xfs_scrub		*sc = orph->sc;
+	struct xfs_name			*xname = &orph->xname;
+	bool				isdir = S_ISDIR(VFS_I(sc->ip)->i_mode);
+	int				error;
+
+	trace_xrep_orphanage_adopt(sc->orphanage, &orph->xname, sc->ip->i_ino);
+
+	/*
+	 * Create the new name in the orphanage, and bump the link count of
+	 * the orphanage if we just added a directory.
+	 */
+	error = xfs_dir_createname(sc->tp, sc->orphanage, xname, sc->ip->i_ino,
+			orph->orphanage_blkres);
+	if (error)
+		return error;
+
+	xfs_trans_ichgtime(sc->tp, sc->orphanage,
+			XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+	if (isdir)
+		xfs_bumplink(sc->tp, sc->orphanage);
+	xfs_trans_log_inode(sc->tp, sc->orphanage, XFS_ILOG_CORE);
+
+	if (!isdir)
+		return 0;
+
+	/* Replace the dotdot entry in the child directory. */
+	return xfs_dir_replace(sc->tp, sc->ip, &xfs_name_dotdot,
+			sc->orphanage->i_ino, orph->child_blkres);
+}
+
+/* Release the orphanage. */
+void
+xrep_orphanage_rele(
+	struct xfs_scrub	*sc)
+{
+	if (!sc->orphanage)
+		return;
+
+	if (sc->orphanage_ilock_flags)
+		xfs_iunlock(sc->orphanage, sc->orphanage_ilock_flags);
+
+	xchk_irele(sc, sc->orphanage);
+	sc->orphanage = NULL;
+}
diff --git a/fs/xfs/scrub/orphanage.h b/fs/xfs/scrub/orphanage.h
new file mode 100644
index 000000000000..6087d68bc68f
--- /dev/null
+++ b/fs/xfs/scrub/orphanage.h
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_ORPHANAGE_H__
+#define __XFS_SCRUB_ORPHANAGE_H__
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+int xrep_orphanage_create(struct xfs_scrub *sc);
+
+/*
+ * If we're doing a repair, ensure that the orphanage exists and attach it to
+ * the scrub context.
+ */
+static inline int
+xrep_orphanage_try_create(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	ASSERT(sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR);
+
+	error = xrep_orphanage_create(sc);
+	switch (error) {
+	case 0:
+	case -ENOENT:
+	case -ENOTDIR:
+	case -ENOSPC:
+		/*
+		 * If the orphanage can't be found or isn't a directory, we'll
+		 * keep going, but we won't be able to attach the file to the
+		 * orphanage if we can't find the parent.
+		 */
+		return 0;
+	}
+
+	return error;
+}
+
+int xrep_orphanage_iolock_two(struct xfs_scrub *sc);
+
+/* Information about a request to add a file to the orphanage. */
+struct xrep_orphanage_req {
+	/* Name structure; caller must provide a buffer separately. */
+	struct xfs_name		xname;
+
+	struct xfs_scrub	*sc;
+
+	/* Block reservations for orphanage and child (if directory). */
+	unsigned int		orphanage_blkres;
+	unsigned int		child_blkres;
+};
+
+static inline size_t
+xrep_orphanage_req_sizeof(void)
+{
+	return sizeof(struct xrep_orphanage_req) + MAXNAMELEN + 1;
+}
+
+void xrep_orphanage_compute_blkres(struct xfs_scrub *sc,
+		struct xrep_orphanage_req *orph);
+int xrep_orphanage_compute_name(struct xrep_orphanage_req *orph,
+		unsigned char *namebuf);
+int xrep_orphanage_adoption_prep(struct xrep_orphanage_req *orph);
+int xrep_orphanage_adopt(struct xrep_orphanage_req *orph);
+
+void xrep_orphanage_ilock(struct xfs_scrub *sc, unsigned int ilock_flags);
+bool xrep_orphanage_ilock_nowait(struct xfs_scrub *sc,
+		unsigned int ilock_flags);
+void xrep_orphanage_iunlock(struct xfs_scrub *sc, unsigned int ilock_flags);
+
+void xrep_orphanage_rele(struct xfs_scrub *sc);
+#else
+struct xrep_orphanage_req { /* empty */ };
+# define xrep_orphanage_rele(sc)
+#endif /* CONFIG_XFS_ONLINE_REPAIR */
+
+#endif /* __XFS_SCRUB_ORPHANAGE_H__ */
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index dfea3102f52f..92866f1757be 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -10,6 +10,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_log_format.h"
+#include "xfs_trans.h"
 #include "xfs_inode.h"
 #include "xfs_icache.h"
 #include "xfs_dir2.h"
@@ -18,12 +19,21 @@
 #include "scrub/common.h"
 #include "scrub/readdir.h"
 #include "scrub/parent.h"
+#include "scrub/repair.h"
 
 /* Set us up to scrub parents. */
 int
 xchk_setup_parent(
 	struct xfs_scrub	*sc)
 {
+	int			error;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) {
+		error = xrep_setup_parent(sc);
+		if (error)
+			return error;
+	}
+
 	return xchk_setup_inode_contents(sc, 0);
 }
 
diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c
index d83948d1fd05..ffef5de0fbe2 100644
--- a/fs/xfs/scrub/parent_repair.c
+++ b/fs/xfs/scrub/parent_repair.c
@@ -33,6 +33,7 @@
 #include "scrub/parent.h"
 #include "scrub/readdir.h"
 #include "scrub/tempfile.h"
+#include "scrub/orphanage.h"
 
 struct xrep_findparent_info {
 	/* The directory currently being scanned. */
@@ -352,6 +353,29 @@ xrep_parent_scan(
 	return 0;
 }
 
+struct xrep_parent {
+	struct xfs_scrub	*sc;
+
+	/* Orphanage reparinting request. */
+	struct xrep_orphanage_req adoption;
+
+	/* Directory entry name, plus the trailing null. */
+	char			namebuf[MAXNAMELEN];
+};
+
+/* Set up for a parent repair. */
+int
+xrep_setup_parent(
+	struct xfs_scrub	*sc)
+{
+	/* We need a buffer for the orphanage request and a name buffer. */
+	sc->buf = kvzalloc(sizeof(struct xrep_parent), XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	return xrep_orphanage_try_create(sc);
+}
+
 /*
  * If we're the root of a directory tree, we are our own parent.  If we're an
  * unlinked directory, the parent /won't/ have a link to us.  Set the parent
@@ -409,14 +433,80 @@ xrep_parent_reset_dir(
 			spaceres);
 }
 
+/*
+ * Move the current file to the orphanage.
+ *
+ * Caller must hold IOLOCK_EXCL on @sc->ip, and no other inode locks.  Upon
+ * successful return, the scrub transaction will have enough extra reservation
+ * to make the move; it will hold IOLOCK_EXCL and ILOCK_EXCL of @sc->ip and the
+ * orphanage; and both inodes will be ijoined.
+ */
+STATIC int
+xrep_parent_move_to_orphanage(
+	struct xrep_parent	*rp)
+{
+	struct xfs_scrub	*sc = rp->sc;
+	int			error;
+
+	/* No orphanage?  We can't fix this. */
+	if (!sc->orphanage)
+		return -EFSCORRUPTED;
+
+	/* If we can take the orphanage's iolock then we're ready to move. */
+	if (!xrep_orphanage_ilock_nowait(sc, XFS_IOLOCK_EXCL)) {
+		xfs_ino_t	orig_parent, new_parent;
+
+		/*
+		 * We may have to drop the lock on sc->ip to try to lock the
+		 * orphanage.  Therefore, look up the old dotdot entry for
+		 * sc->ip so that we can compare it after we re-lock sc->ip.
+		 */
+		orig_parent = xrep_dotdot_lookup(sc);
+
+		xchk_iunlock(sc, sc->ilock_flags);
+		error = xrep_orphanage_iolock_two(sc);
+		if (error)
+			return error;
+
+		/*
+		 * If the parent changed or the child was unlinked while the
+		 * child directory was unlocked, we don't need to move the
+		 * child to the orphanage after all.
+		 */
+		new_parent = xrep_dotdot_lookup(sc);
+
+		if (orig_parent != new_parent || VFS_I(sc->ip)->i_nlink == 0)
+			return 0;
+	}
+
+	/*
+	 * Move the directory to the orphanage, and let scrub teardown unlock
+	 * everything for us.
+	 */
+	xrep_orphanage_compute_blkres(sc, &rp->adoption);
+
+	error = xrep_orphanage_compute_name(&rp->adoption, rp->namebuf);
+	if (error)
+		return error;
+
+	error = xrep_orphanage_adoption_prep(&rp->adoption);
+	if (error)
+		return error;
+
+	return xrep_orphanage_adopt(&rp->adoption);
+}
+
 int
 xrep_parent(
 	struct xfs_scrub	*sc)
 {
+	struct xrep_parent	*rp = sc->buf;
 	xfs_ino_t		parent_ino, curr_parent;
 	unsigned int		sick, checked;
 	int			error;
 
+	rp->sc = sc;
+
 	/*
 	 * Avoid sick directories.  The parent pointer scrubber dropped the
 	 * ILOCK and MMAPLOCK, but we still hold IOLOCK_EXCL on the directory.
@@ -441,7 +531,7 @@ xrep_parent(
 	if (error)
 		return error;
 	if (parent_ino == NULLFSINO)
-		return -EFSCORRUPTED;
+		return xrep_parent_move_to_orphanage(rp);
 
 reset_parent:
 	/* If the '..' entry is already set to the parent inode, we're done. */
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index acd7fccf8bee..596993b06256 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -84,6 +84,7 @@ int xrep_setup_rtsummary(struct xfs_scrub *sc, unsigned int *resblks,
 		size_t *bufsize);
 int xrep_setup_xattr(struct xfs_scrub *sc);
 int xrep_setup_directory(struct xfs_scrub *sc);
+int xrep_setup_parent(struct xfs_scrub *sc);
 
 int xrep_xattr_reset_fork(struct xfs_scrub *sc);
 
@@ -194,6 +195,7 @@ xrep_setup_nothing(
 #define xrep_setup_ag_refcountbt	xrep_setup_nothing
 #define xrep_setup_xattr		xrep_setup_nothing
 #define xrep_setup_directory		xrep_setup_nothing
+#define xrep_setup_parent		xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 39ad06e6b2d0..b334fd3d7706 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -27,6 +27,7 @@
 #include "scrub/health.h"
 #include "scrub/xfile.h"
 #include "scrub/tempfile.h"
+#include "scrub/orphanage.h"
 
 /*
  * Online Scrub and Repair
@@ -219,6 +220,7 @@ xchk_teardown(
 	}
 
 	xrep_tempfile_rele(sc);
+	xrep_orphanage_rele(sc);
 	xchk_fshooks_disable(sc);
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 9c26a6092c52..d606d4f370c7 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -105,6 +105,10 @@ struct xfs_scrub {
 	/* Lock flags for @ip. */
 	uint				ilock_flags;
 
+	/* The orphanage, for stashing files that have lost their parent. */
+	uint				orphanage_ilock_flags;
+	struct xfs_inode		*orphanage;
+
 	/* A temporary file on this filesystem, for staging new metadata. */
 	struct xfs_inode		*tempip;
 	uint				temp_ilock_flags;
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index d8223ec24369..ae8a5852e258 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2510,6 +2510,7 @@ DEFINE_EVENT(xrep_dirent_class, name, \
 	TP_PROTO(struct xfs_inode *dp, struct xfs_name *name, xfs_ino_t ino), \
 	TP_ARGS(dp, name, ino))
 DEFINE_XREP_DIRENT_CLASS(xrep_dir_insert_rec);
+DEFINE_XREP_DIRENT_CLASS(xrep_orphanage_adopt);
 
 DECLARE_EVENT_CLASS(xrep_parent_salvage_class,
 	TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino),
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ce55dde40d9d..85ce54a09d82 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -930,10 +930,10 @@ xfs_droplink(
 /*
  * Increment the link count on an inode & log the change.
  */
-static void
+void
 xfs_bumplink(
-	xfs_trans_t *tp,
-	xfs_inode_t *ip)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
 {
 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3f6c63304ca3..34f596deb92c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -581,6 +581,7 @@ void xfs_end_io(struct work_struct *work);
 
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+void xfs_bumplink(struct xfs_trans *tp, struct xfs_inode *ip);
 
 void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we encounter an inode with a nonzero link count but zero observed
links, move it to the orphanage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/nlinks.c        |   11 ++
 fs/xfs/scrub/nlinks.h        |    6 +
 fs/xfs/scrub/nlinks_repair.c |  250 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/repair.h        |    2 
 fs/xfs/scrub/trace.c         |    1 
 fs/xfs/scrub/trace.h         |   26 ++++
 6 files changed, 290 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c
index 54aa3dc4dc89..dca759d27ac4 100644
--- a/fs/xfs/scrub/nlinks.c
+++ b/fs/xfs/scrub/nlinks.c
@@ -24,6 +24,7 @@
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
+#include "scrub/orphanage.h"
 #include "scrub/nlinks.h"
 #include "scrub/trace.h"
 #include "scrub/readdir.h"
@@ -44,9 +45,17 @@ int
 xchk_setup_nlinks(
 	struct xfs_scrub	*sc)
 {
+	int			error;
+
 	xchk_fshooks_enable(sc, XCHK_FSHOOKS_NLINKS);
 
-	sc->buf = kzalloc(sizeof(struct xchk_nlink_ctrs), XCHK_GFP_FLAGS);
+	if (xchk_could_repair(sc)) {
+		error = xrep_setup_nlinks(sc);
+		if (error)
+			return error;
+	}
+
+	sc->buf = kvzalloc(sizeof(struct xchk_nlink_ctrs), XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
 
diff --git a/fs/xfs/scrub/nlinks.h b/fs/xfs/scrub/nlinks.h
index 46baef3c2237..f5108369cc2b 100644
--- a/fs/xfs/scrub/nlinks.h
+++ b/fs/xfs/scrub/nlinks.h
@@ -28,6 +28,12 @@ struct xchk_nlink_ctrs {
 	 * from other writer threads.
 	 */
 	struct xfs_nlink_hook	hooks;
+
+	/* Orphanage reparinting request. */
+	struct xrep_orphanage_req adoption;
+
+	/* Directory entry name, plus the trailing null. */
+	char			namebuf[MAXNAMELEN];
 };
 
 /*
diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c
index 4723b015a1c1..f881e5dbd432 100644
--- a/fs/xfs/scrub/nlinks_repair.c
+++ b/fs/xfs/scrub/nlinks_repair.c
@@ -23,6 +23,7 @@
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
+#include "scrub/orphanage.h"
 #include "scrub/nlinks.h"
 #include "scrub/trace.h"
 #include "scrub/tempfile.h"
@@ -37,6 +38,242 @@
  * inode is locked.
  */
 
+/* Set up to repair inode link counts. */
+int
+xrep_setup_nlinks(
+	struct xfs_scrub	*sc)
+{
+	return xrep_orphanage_try_create(sc);
+}
+
+/* Update incore link count information.  Caller must hold the xnc lock. */
+STATIC int
+xrep_nlinks_set_record(
+	struct xchk_nlink_ctrs	*xnc,
+	xfs_ino_t		ino,
+	const struct xchk_nlink	*nl)
+{
+	int			error;
+
+	trace_xrep_nlinks_set_record(xnc->sc->mp, ino, nl);
+
+	error = xfarray_store(xnc->nlinks, ino, nl);
+	if (error == -EFBIG) {
+		/*
+		 * EFBIG means we tried to store data at too high a byte offset
+		 * in the sparse array.  This should be impossible since we
+		 * presumably already stored an nlink count, but we still need
+		 * to fail gracefully.
+		 */
+		return -ECANCELED;
+	}
+
+	return error;
+}
+
+/*
+ * Inodes that aren't the root directory or the orphanage, have a nonzero link
+ * count, and no observed parents should be moved to the orphanage.
+ */
+static inline bool
+xrep_nlinks_is_orphaned(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip,
+	unsigned int		actual_nlink,
+	const struct xchk_nlink	*obs)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (obs->parents != 0)
+		return false;
+	if (ip == mp->m_rootip || ip == sc->orphanage)
+		return false;
+	return actual_nlink != 0;
+}
+
+/*
+ * Correct the link count of the given inode or move it to the orphanage.
+ * Because we have to grab locks and resources in a certain order, it's
+ * possible that this will be a no-op.
+ */
+STATIC int
+xrep_nlinks_repair_and_relink_inode(
+	struct xchk_nlink_ctrs		*xnc)
+{
+	struct xchk_nlink		obs;
+	struct xfs_scrub		*sc = xnc->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_inode		*ip = sc->ip;
+	uint64_t			total_links;
+	unsigned int			actual_nlink;
+	bool				orphan = false;
+	int				error;
+
+	/*
+	 * Ignore temporary files being used to stage repairs, since we assume
+	 * they're correct for non-directories, and the directory repair code
+	 * doesn't bump the link counts for the children.
+	 */
+	if (xrep_is_tempfile(ip))
+		return 0;
+
+	/* Grab the IOLOCK of the orphanage and the child directory. */
+	error = xrep_orphanage_iolock_two(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Allocate a transaction for the adoption.  We'll reserve space for
+	 * the transaction in the adoption preparation step.
+	 */
+	xrep_orphanage_compute_blkres(sc, &xnc->adoption);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, &sc->tp);
+	if (error)
+		goto out_iolock;
+
+	/*
+	 * Before we take the ILOCKs, compute the name of the potential
+	 * orphanage directory entry.
+	 */
+	error = xrep_orphanage_compute_name(&xnc->adoption, xnc->namebuf);
+	if (error)
+		goto out_trans;
+
+	error = xrep_orphanage_adoption_prep(&xnc->adoption);
+	if (error)
+		goto out_trans;
+
+	mutex_lock(&xnc->lock);
+
+	if (xchk_iscan_aborted(&xnc->collect_iscan)) {
+		error = -ECANCELED;
+		goto out_scanlock;
+	}
+
+	error = xfarray_load_sparse(xnc->nlinks, ip->i_ino, &obs);
+	if (error)
+		goto out_scanlock;
+
+	total_links = xchk_nlink_total(&obs);
+	actual_nlink = VFS_I(ip)->i_nlink;
+
+	/* Cannot set more than the maxiumum possible link count. */
+	if (total_links > U32_MAX) {
+		trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		error = 0;
+		goto out_scanlock;
+	}
+
+	/*
+	 * Linked directories should have at least one "child" (the dot entry)
+	 * pointing up to them.
+	 */
+	if (S_ISDIR(VFS_I(ip)->i_mode) && actual_nlink > 0 &&
+					  obs.children == 0) {
+		trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		error = 0;
+		goto out_scanlock;
+	}
+
+	/* Non-directories cannot have directories pointing up to them. */
+	if (!S_ISDIR(VFS_I(ip)->i_mode) && obs.children > 0) {
+		trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		error = 0;
+		goto out_scanlock;
+	}
+
+	/*
+	 * Decide if we're going to move this file to the orphanage, and fix
+	 * up the incore link counts if we are.
+	 */
+	if (xrep_nlinks_is_orphaned(sc, ip, actual_nlink, &obs)) {
+		obs.parents++;
+		total_links++;
+
+		error = xrep_nlinks_set_record(xnc, ip->i_ino, &obs);
+		if (error)
+			goto out_scanlock;
+
+		orphan = true;
+	}
+
+	/*
+	 * We did not find any links to this inode and we're not planning to
+	 * move it to the orphanage.  If the inode link count is also zero, we
+	 * have nothing further to do.  Otherwise, the situation is unfixable.
+	 */
+	if (total_links == 0) {
+		if (actual_nlink != 0)
+			trace_xrep_nlinks_unfixable_inode(mp, ip, &obs);
+		error = 0;
+		goto out_scanlock;
+	}
+
+	/* If the inode has the correct link count and isn't orphaned, exit. */
+	if (total_links == actual_nlink && !orphan) {
+		error = 0;
+		goto out_scanlock;
+	}
+
+	/* Commit the new link count. */
+	trace_xrep_nlinks_update_inode(mp, ip, &obs);
+
+	/*
+	 * If this is an orphan, create the new name in the orphanage, and bump
+	 * the link count of the orphanage if we just added a directory.  Then
+	 * we can set the correct nlink.
+	 */
+	if (orphan) {
+		error = xrep_orphanage_adopt(&xnc->adoption);
+		if (error)
+			goto out_scanlock;
+
+		/*
+		 * If the child is a directory, we need to bump the incore link
+		 * count of the orphanage to account for the new orphan's
+		 * child subdirectory entry.
+		 */
+		if (S_ISDIR(VFS_I(ip)->i_mode)) {
+			error = xfarray_load_sparse(xnc->nlinks,
+					sc->orphanage->i_ino, &obs);
+			if (error)
+				goto out_scanlock;
+
+			obs.flags |= XCHK_NLINK_WRITTEN;
+			obs.children++;
+
+			error = xrep_nlinks_set_record(xnc,
+					sc->orphanage->i_ino, &obs);
+			if (error)
+				goto out_scanlock;
+		}
+	}
+	set_nlink(VFS_I(ip), total_links);
+	xfs_trans_log_inode(sc->tp, ip, XFS_ILOG_CORE);
+	mutex_unlock(&xnc->lock);
+
+	error = xrep_trans_commit(sc);
+	if (error)
+		goto out_ilock;
+
+	xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+	xrep_orphanage_iunlock(sc, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+	return 0;
+
+out_scanlock:
+	mutex_unlock(&xnc->lock);
+out_trans:
+	xchk_trans_cancel(sc);
+out_ilock:
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
+	xrep_orphanage_iunlock(sc, XFS_ILOCK_EXCL);
+out_iolock:
+	xchk_iunlock(sc, XFS_IOLOCK_EXCL);
+	xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL);
+	return error;
+}
+
 /*
  * Correct the link count of the given inode.  Because we have to grab locks
  * and resources in a certain order, it's possible that this will be a no-op.
@@ -185,10 +422,10 @@ xrep_nlinks(
 	/*
 	 * We need ftype for an accurate count of the number of child
 	 * subdirectory links.  Child subdirectories with a back link (dotdot
-	 * entry) but no forward link are unfixable, so we cannot repair the
-	 * link count of the parent directory based on the back link count
-	 * alone.  Filesystems without ftype support are rare (old V4) so we
-	 * just skip out here.
+	 * entry) but no forward link are moved to the orphanage, so we cannot
+	 * repair the link count of the parent directory based on the back link
+	 * count alone.  Filesystems without ftype support are rare (old V4) so
+	 * we just skip out here.
 	 */
 	if (!xfs_has_ftype(sc->mp))
 		return -EOPNOTSUPP;
@@ -209,7 +446,10 @@ xrep_nlinks(
 		 */
 		xchk_trans_cancel(sc);
 
-		error = xrep_nlinks_repair_inode(xnc);
+		if (sc->orphanage && sc->ip != sc->orphanage)
+			error = xrep_nlinks_repair_and_relink_inode(xnc);
+		else
+			error = xrep_nlinks_repair_inode(xnc);
 		xchk_iscan_mark_visited(&xnc->compare_iscan, sc->ip);
 		xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 596993b06256..5a7787e3d3a1 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -85,6 +85,7 @@ int xrep_setup_rtsummary(struct xfs_scrub *sc, unsigned int *resblks,
 int xrep_setup_xattr(struct xfs_scrub *sc);
 int xrep_setup_directory(struct xfs_scrub *sc);
 int xrep_setup_parent(struct xfs_scrub *sc);
+int xrep_setup_nlinks(struct xfs_scrub *sc);
 
 int xrep_xattr_reset_fork(struct xfs_scrub *sc);
 
@@ -196,6 +197,7 @@ xrep_setup_nothing(
 #define xrep_setup_xattr		xrep_setup_nothing
 #define xrep_setup_directory		xrep_setup_nothing
 #define xrep_setup_parent		xrep_setup_nothing
+#define xrep_setup_nlinks		xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index f8f50c5a02c0..2e36fcc12e40 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -23,6 +23,7 @@
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
 #include "scrub/iscan.h"
+#include "scrub/orphanage.h"
 #include "scrub/nlinks.h"
 #include "scrub/fscounters.h"
 #include "scrub/xfbtree.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index ae8a5852e258..116f03c2fe48 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2538,6 +2538,32 @@ DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_dir_salvaged_parent);
 DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_findparent_dirent);
 DEFINE_XREP_PARENT_SALVAGE_CLASS(xrep_findparent_from_dcache);
 
+TRACE_EVENT(xrep_nlinks_set_record,
+	TP_PROTO(struct xfs_mount *mp, xfs_ino_t ino,
+		 const struct xchk_nlink *obs),
+	TP_ARGS(mp, ino, obs),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_nlink_t, parents)
+		__field(xfs_nlink_t, backrefs)
+		__field(xfs_nlink_t, children)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = ino;
+		__entry->parents = obs->parents;
+		__entry->backrefs = obs->backrefs;
+		__entry->children = obs->children;
+	),
+	TP_printk("dev %d:%d ino 0x%llx parents %u backrefs %u children %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->parents,
+		  __entry->backrefs,
+		  __entry->children)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When the orphanage adopts a file, that file becomes a child of the
orphanage.  The dentry cache may have entries for the orphanage
directory and the name we've chosen, so (1) make sure we abort if the
dcache has a positive entry because something's not right; and (2)
invalidate and purge negative dentries if the adoption goes through.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/orphanage.c |  100 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/scrub/trace.h     |   32 +++++++++++++++
 2 files changed, 127 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/scrub/orphanage.c b/fs/xfs/scrub/orphanage.c
index 1fe7935433bf..08ff273dbb39 100644
--- a/fs/xfs/scrub/orphanage.c
+++ b/fs/xfs/scrub/orphanage.c
@@ -356,6 +356,89 @@ xrep_orphanage_adoption_prep(
 	return 0;
 }
 
+/*
+ * Make sure the dcache does not have a positive dentry for the name we've
+ * chosen.  The caller should have checked with the ondisk directory, so any
+ * discrepancy is a sign that something is seriously wrong.
+ */
+static int
+xrep_orphanage_check_dcache(
+	struct xrep_orphanage_req	*orph)
+{
+	struct qstr			qname = QSTR_INIT(orph->xname.name,
+							  orph->xname.len);
+	struct dentry			*d_orphanage, *d_child, *dentry;
+	int				error = 0;
+
+	d_orphanage = d_find_alias(VFS_I(orph->sc->orphanage));
+	if (!d_orphanage)
+		return 0;
+
+	d_child = d_hash_and_lookup(d_orphanage, &qname);
+	if (d_child) {
+		trace_xrep_orphanage_check_child(orph->sc->mp, d_child);
+
+		if (d_is_positive(d_child)) {
+			ASSERT(d_is_negative(d_child));
+			error = -EFSCORRUPTED;
+		}
+
+		dput(d_child);
+	}
+
+	dput(d_orphanage);
+	if (error)
+		return error;
+
+	/*
+	 * Do we need to update d_parent of the dentry for the file being
+	 * repaired?  In theory there shouldn't be one since the file had
+	 * nonzero nlink but wasn't connected to any parent dir.
+	 */
+	dentry = d_find_alias(VFS_I(orph->sc->ip));
+	if (dentry) {
+		trace_xrep_orphanage_check_dentry(orph->sc->mp, d_child);
+		ASSERT(dentry->d_parent == NULL);
+
+		dput(dentry);
+		return -EFSCORRUPTED;
+	}
+
+	return 0;
+}
+
+/*
+ * Remove all negative dentries from the dcache.  There should not be any
+ * positive entries, since we've maintained our lock on the orphanage
+ * directory.
+ */
+static int
+xrep_orphanage_zap_dcache(
+	struct xrep_orphanage_req	*orph)
+{
+	struct qstr			qname = QSTR_INIT(orph->xname.name,
+							  orph->xname.len);
+	struct dentry			*d_orphanage, *d_child;
+	int				error = 0;
+
+	d_orphanage = d_find_alias(VFS_I(orph->sc->orphanage));
+	if (!d_orphanage)
+		return 0;
+
+	d_child = d_hash_and_lookup(d_orphanage, &qname);
+	while (d_child != NULL) {
+		trace_xrep_orphanage_zap_child(orph->sc->mp, d_child);
+
+		ASSERT(d_is_negative(d_child));
+		d_invalidate(d_child);
+		dput(d_child);
+		d_child = d_lookup(d_orphanage, &qname);
+	}
+
+	dput(d_orphanage);
+	return error;
+}
+
 /*
  * Move the current file to the orphanage.
  *
@@ -375,6 +458,10 @@ xrep_orphanage_adopt(
 
 	trace_xrep_orphanage_adopt(sc->orphanage, &orph->xname, sc->ip->i_ino);
 
+	error = xrep_orphanage_check_dcache(orph);
+	if (error)
+		return error;
+
 	/*
 	 * Create the new name in the orphanage, and bump the link count of
 	 * the orphanage if we just added a directory.
@@ -390,12 +477,15 @@ xrep_orphanage_adopt(
 		xfs_bumplink(sc->tp, sc->orphanage);
 	xfs_trans_log_inode(sc->tp, sc->orphanage, XFS_ILOG_CORE);
 
-	if (!isdir)
-		return 0;
+	if (isdir) {
+		/* Replace the dotdot entry in the child directory. */
+		error = xfs_dir_replace(sc->tp, sc->ip, &xfs_name_dotdot,
+				sc->orphanage->i_ino, orph->child_blkres);
+		if (error)
+			return error;
+	}
 
-	/* Replace the dotdot entry in the child directory. */
-	return xfs_dir_replace(sc->tp, sc->ip, &xfs_name_dotdot,
-			sc->orphanage->i_ino, orph->child_blkres);
+	return xrep_orphanage_zap_dcache(orph);
 }
 
 /* Release the orphanage. */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 116f03c2fe48..b29ed4dde427 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2564,6 +2564,38 @@ TRACE_EVENT(xrep_nlinks_set_record,
 		  __entry->children)
 );
 
+DECLARE_EVENT_CLASS(xrep_dentry_class,
+	TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry),
+	TP_ARGS(mp, dentry),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned long, ino)
+		__field(bool, positive)
+		__field(bool, has_parent)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->type = __d_entry_type(dentry);
+		__entry->positive = d_is_positive(dentry);
+		__entry->has_parent = dentry->d_parent != NULL;
+		__entry->ino = d_inode(dentry) ? d_inode(dentry)->i_ino : 0;
+	),
+	TP_printk("dev %d:%d type 0x%x positive? %d parent? %d ino 0x%lx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->positive,
+		  __entry->has_parent,
+		  __entry->ino)
+);
+#define DEFINE_REPAIR_DENTRY_EVENT(name) \
+DEFINE_EVENT(xrep_dentry_class, name, \
+	TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry), \
+	TP_ARGS(mp, dentry))
+DEFINE_REPAIR_DENTRY_EVENT(xrep_orphanage_check_child);
+DEFINE_REPAIR_DENTRY_EVENT(xrep_orphanage_check_dentry);
+DEFINE_REPAIR_DENTRY_EVENT(xrep_orphanage_zap_child);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] xfs: online repair of symbolic links
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (30 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/1] " Darrick J. Wong
  2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of parent pointers Darrick J. Wong
                   ` (45 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

The sole patch in this set adds the ability to repair the target buffer
of a symbolic link, using the same salvage, rebuild, and swap strategy
used everywhere else.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-symlink
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_bmap.c           |   11 -
 fs/xfs/libxfs/xfs_bmap.h           |    6 
 fs/xfs/libxfs/xfs_symlink_remote.c |    9 -
 fs/xfs/libxfs/xfs_symlink_remote.h |   22 +-
 fs/xfs/scrub/repair.h              |    8 +
 fs/xfs/scrub/scrub.c               |    2 
 fs/xfs/scrub/symlink.c             |   13 +
 fs/xfs/scrub/symlink_repair.c      |  452 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.c            |    5 
 fs/xfs/scrub/trace.h               |   46 ++++
 11 files changed, 560 insertions(+), 15 deletions(-)
 create mode 100644 fs/xfs/scrub/symlink_repair.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: online repair of symbolic links
  2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of symbolic links Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If a symbolic link target looks bad, try to sift through the rubble to
find as much of the target buffer that we can, and stage a new target
(short or remote format as needed) in a temporary file and use the
atomic extent swapping mechanism to commit the results.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_bmap.c           |   11 -
 fs/xfs/libxfs/xfs_bmap.h           |    6 
 fs/xfs/libxfs/xfs_symlink_remote.c |    9 -
 fs/xfs/libxfs/xfs_symlink_remote.h |   22 +-
 fs/xfs/scrub/repair.h              |    8 +
 fs/xfs/scrub/scrub.c               |    2 
 fs/xfs/scrub/symlink.c             |   13 +
 fs/xfs/scrub/symlink_repair.c      |  452 ++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.c            |    5 
 fs/xfs/scrub/trace.h               |   46 ++++
 11 files changed, 560 insertions(+), 15 deletions(-)
 create mode 100644 fs/xfs/scrub/symlink_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6d6ca775553f..e7a8a740318a 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -205,6 +205,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   refcount_repair.o \
 				   repair.o \
 				   rmap_repair.o \
+				   symlink_repair.o \
 				   tempfile.o \
 				   xfblob.o \
 				   xfbtree.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7cbb96a805a3..0dfa84993a9e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -766,7 +766,7 @@ xfs_bmap_local_to_extents_empty(
 }
 
 
-STATIC int				/* error */
+int					/* error */
 xfs_bmap_local_to_extents(
 	xfs_trans_t	*tp,		/* transaction pointer */
 	xfs_inode_t	*ip,		/* incore inode pointer */
@@ -776,7 +776,8 @@ xfs_bmap_local_to_extents(
 	void		(*init_fn)(struct xfs_trans *tp,
 				   struct xfs_buf *bp,
 				   struct xfs_inode *ip,
-				   struct xfs_ifork *ifp))
+				   struct xfs_ifork *ifp, void *priv),
+	void		*priv)
 {
 	int		error = 0;
 	int		flags;		/* logging flags returned */
@@ -841,7 +842,7 @@ xfs_bmap_local_to_extents(
 	 * log here. Note that init_fn must also set the buffer log item type
 	 * correctly.
 	 */
-	init_fn(tp, bp, ip, ifp);
+	init_fn(tp, bp, ip, ifp, priv);
 
 	/* account for the change in fork size */
 	xfs_idata_realloc(ip, -ifp->if_bytes, whichfork);
@@ -974,8 +975,8 @@ xfs_bmap_add_attrfork_local(
 
 	if (S_ISLNK(VFS_I(ip)->i_mode))
 		return xfs_bmap_local_to_extents(tp, ip, 1, flags,
-						 XFS_DATA_FORK,
-						 xfs_symlink_local_to_remote);
+				XFS_DATA_FORK, xfs_symlink_local_to_remote,
+				NULL);
 
 	/* should only be called for types that support local format data */
 	ASSERT(0);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 413ec27f2f24..9559f7174bba 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -174,6 +174,12 @@ unsigned int xfs_bmap_compute_attr_offset(struct xfs_mount *mp);
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_trans *tp,
 		struct xfs_inode *ip, int whichfork);
+int xfs_bmap_local_to_extents(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_extlen_t total, int *logflagsp, int whichfork,
+		void (*init_fn)(struct xfs_trans *tp, struct xfs_buf *bp,
+				struct xfs_inode *ip, struct xfs_ifork *ifp,
+				void *priv),
+		void *priv);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index b48dcb893a2a..d0c4bd7fb019 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -169,7 +169,8 @@ xfs_symlink_local_to_remote(
 	struct xfs_trans	*tp,
 	struct xfs_buf		*bp,
 	struct xfs_inode	*ip,
-	struct xfs_ifork	*ifp)
+	struct xfs_ifork	*ifp,
+	void			*priv)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	char			*buf;
@@ -318,9 +319,10 @@ xfs_symlink_remote_read(
 
 /* Write the symlink target into the inode. */
 int
-xfs_symlink_write_target(
+__xfs_symlink_write_target(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
+	xfs_ino_t		owner,
 	const char		*target_path,
 	int			pathlen,
 	xfs_fsblock_t		fs_blocks,
@@ -375,8 +377,7 @@ xfs_symlink_write_target(
 		byte_cnt = min(byte_cnt, pathlen);
 
 		buf = bp->b_addr;
-		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset, byte_cnt,
-				bp);
+		buf += xfs_symlink_hdr_set(mp, owner, offset, byte_cnt, bp);
 
 		memcpy(buf, cur_chunk, byte_cnt);
 
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
index 05eb9c3937d9..45855b78178f 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.h
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -16,13 +16,27 @@ int xfs_symlink_hdr_set(struct xfs_mount *mp, xfs_ino_t ino, uint32_t offset,
 bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset,
 			uint32_t size, struct xfs_buf *bp);
 void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
-				 struct xfs_inode *ip, struct xfs_ifork *ifp);
+				 struct xfs_inode *ip, struct xfs_ifork *ifp,
+				 void *priv);
 xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
 xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
 int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
-int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
-		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
-		uint resblks);
+int __xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_ino_t owner, const char *target_path, int pathlen,
+		xfs_fsblock_t fs_blocks, uint resblks);
+
+static inline int
+xfs_symlink_write_target(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	const char		*target_path,
+	int			pathlen,
+	xfs_fsblock_t		fs_blocks,
+	uint			resblks)
+{
+	return __xfs_symlink_write_target(tp, ip, ip->i_ino, target_path,
+			pathlen, fs_blocks, resblks);
+}
 int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 5a7787e3d3a1..c6461acd1112 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -86,6 +86,7 @@ int xrep_setup_xattr(struct xfs_scrub *sc);
 int xrep_setup_directory(struct xfs_scrub *sc);
 int xrep_setup_parent(struct xfs_scrub *sc);
 int xrep_setup_nlinks(struct xfs_scrub *sc);
+int xrep_setup_symlink(struct xfs_scrub *sc, unsigned int *resblks);
 
 int xrep_xattr_reset_fork(struct xfs_scrub *sc);
 
@@ -125,6 +126,7 @@ int xrep_fscounters(struct xfs_scrub *sc);
 int xrep_xattr(struct xfs_scrub *sc);
 int xrep_directory(struct xfs_scrub *sc);
 int xrep_parent(struct xfs_scrub *sc);
+int xrep_symlink(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -215,6 +217,11 @@ xrep_setup_rtsummary(
 	return 0;
 }
 
+static inline int xrep_setup_symlink(struct xfs_scrub *sc, unsigned int *x)
+{
+	return 0;
+}
+
 #define xrep_revalidate_allocbt		(NULL)
 #define xrep_revalidate_iallocbt	(NULL)
 
@@ -240,6 +247,7 @@ xrep_setup_rtsummary(
 #define xrep_xattr			xrep_notsupported
 #define xrep_directory			xrep_notsupported
 #define xrep_parent			xrep_notsupported
+#define xrep_symlink			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index b334fd3d7706..a596789e463d 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -341,7 +341,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_symlink,
 		.scrub	= xchk_symlink,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_symlink,
 	},
 	[XFS_SCRUB_TYPE_PARENT] = {	/* parent pointers */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index c134f738bc43..0938d73838f6 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -10,23 +10,34 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_log_format.h"
+#include "xfs_trans.h"
 #include "xfs_inode.h"
 #include "xfs_symlink.h"
 #include "xfs_symlink_remote.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/repair.h"
 
 /* Set us up to scrub a symbolic link. */
 int
 xchk_setup_symlink(
 	struct xfs_scrub	*sc)
 {
+	unsigned int		resblks = 0;
+	int			error;
+
 	/* Allocate the buffer without the inode lock held. */
 	sc->buf = kvzalloc(XFS_SYMLINK_MAXLEN + 1, XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
 
-	return xchk_setup_inode_contents(sc, 0);
+	if (xchk_could_repair(sc)) {
+		error = xrep_setup_symlink(sc, &resblks);
+		if (error)
+			return error;
+	}
+
+	return xchk_setup_inode_contents(sc, resblks);
 }
 
 /* Symbolic links. */
diff --git a/fs/xfs/scrub/symlink_repair.c b/fs/xfs/scrub/symlink_repair.c
new file mode 100644
index 000000000000..89d9187bc5fe
--- /dev/null
+++ b/fs/xfs/scrub/symlink_repair.c
@@ -0,0 +1,452 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_symlink.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_symlink_remote.h"
+#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/tempfile.h"
+#include "scrub/tempswap.h"
+#include "scrub/reap.h"
+
+/*
+ * Symbolic Link Repair
+ * ====================
+ *
+ * We repair symbolic links by reading whatever target data we can find, up to
+ * the first NULL byte.  Zero length symlinks are turned into links to the
+ * current directory.  The new target is written into a private hidden
+ * temporary file, and then an atomic extent swap commits the new symlink
+ * target to the file being repaired.
+ */
+
+/* Set us up to repair the rtsummary file. */
+int
+xrep_setup_symlink(
+	struct xfs_scrub	*sc,
+	unsigned int		*resblks)
+{
+	struct xfs_mount	*mp = sc->mp;
+	unsigned long long	blocks;
+	int			error;
+
+	error = xrep_tempfile_create(sc, S_IFLNK);
+	if (error)
+		return error;
+
+	/*
+	 * If we're doing a repair, we reserve enough blocks to write out a
+	 * completely new symlink file, plus twice as many blocks as we would
+	 * need if we can only allocate one block per data fork mapping.  This
+	 * should cover the preallocation of the temporary file and swapping
+	 * the extent mappings.
+	 *
+	 * We cannot use xfs_swapext_estimate because we have not yet
+	 * constructed the replacement rtsummary and therefore do not know how
+	 * many extents it will use.  By the time we do, we will have a dirty
+	 * transaction (which we cannot drop because we cannot drop the
+	 * rtsummary ILOCK) and cannot ask for more reservation.
+	 */
+	blocks = xfs_symlink_blocks(sc->mp, XFS_SYMLINK_MAXLEN);
+	blocks += xfs_bmbt_calc_size(mp, blocks) * 2;
+	if (blocks > UINT_MAX)
+		return -EOPNOTSUPP;
+
+	*resblks += blocks;
+	return 0;
+}
+
+/* Try to salvage the pathname from rmt blocks. */
+STATIC int
+xrep_symlink_salvage_remote(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_inode	*ip = sc->ip;
+	struct xfs_buf		*bp;
+	char			*target_buf = sc->buf;
+	xfs_failaddr_t		fa;
+	xfs_filblks_t		fsblocks;
+	xfs_daddr_t		d;
+	loff_t			len;
+	loff_t			offset;
+	unsigned int		byte_cnt;
+	bool			magic_ok;
+	bool			hdr_ok;
+	int			n;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			error;
+
+	/* We'll only read until the buffer is full. */
+	len = min_t(loff_t, ip->i_disk_size, XFS_SYMLINK_MAXLEN);
+	fsblocks = xfs_symlink_blocks(sc->mp, len);
+	error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		struct xfs_dsymlink_hdr	*dsl;
+
+		d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock);
+
+		/* Read the rmt block.  We'll run the verifiers manually. */
+		error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+				d, XFS_FSB_TO_BB(sc->mp, mval[n].br_blockcount),
+				0, &bp, NULL);
+		if (error)
+			return error;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		/* How many bytes do we expect to get out of this buffer? */
+		byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount);
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt);
+		byte_cnt = min_t(unsigned int, byte_cnt, len);
+
+		/*
+		 * See if the verifiers accept this block.  We're willing to
+		 * salvage if the if the offset/byte/ino are ok and either the
+		 * verifier passed or the magic is ok.  Anything else and we
+		 * stop dead in our tracks.
+		 */
+		fa = bp->b_ops->verify_struct(bp);
+		dsl = bp->b_addr;
+		magic_ok = dsl->sl_magic == cpu_to_be32(XFS_SYMLINK_MAGIC);
+		hdr_ok = xfs_symlink_hdr_ok(ip->i_ino, offset, byte_cnt, bp);
+		if (!hdr_ok || (fa != NULL && !magic_ok))
+			break;
+
+		memcpy(target_buf + offset, dsl + 1, byte_cnt);
+
+		len -= byte_cnt;
+		offset += byte_cnt;
+	}
+
+	/* Ensure we have a zero at the end, and /some/ contents. */
+	if (offset == 0 || target_buf[0] == 0)
+		sprintf(target_buf, ".");
+	else
+		target_buf[offset] = 0;
+	return 0;
+}
+
+/*
+ * Try to salvage an inline symlink's contents.  Empty symlinks become a link
+ * to the current directory.
+ */
+STATIC void
+xrep_symlink_salvage_inline(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_inode	*ip = sc->ip;
+	char			*target_buf = sc->buf;
+	struct xfs_ifork	*ifp;
+
+	ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+	if (ifp->if_u1.if_data)
+		strncpy(target_buf, ifp->if_u1.if_data, xfs_inode_data_fork_size(ip));
+	if (target_buf[0] == 0)
+		sprintf(target_buf, ".");
+}
+
+/* Salvage whatever we can of the target. */
+STATIC int
+xrep_symlink_salvage(
+	struct xfs_scrub	*sc)
+{
+	if (sc->ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		xrep_symlink_salvage_inline(sc);
+	} else {
+		int		error = xrep_symlink_salvage_remote(sc);
+
+		if (error)
+			return error;
+	}
+
+	trace_xrep_symlink_salvage_target(sc->ip, sc->buf, strlen(sc->buf));
+	return 0;
+}
+
+STATIC void
+xrep_symlink_local_to_remote(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp,
+	struct xfs_inode	*ip,
+	struct xfs_ifork	*ifp,
+	void			*priv)
+{
+	struct xfs_scrub	*sc = priv;
+	struct xfs_dsymlink_hdr	*dsl = bp->b_addr;
+
+	xfs_symlink_local_to_remote(tp, bp, ip, ifp, NULL);
+
+	if (!xfs_has_crc(sc->mp))
+		return;
+
+	dsl->sl_owner = cpu_to_be64(sc->ip->i_ino);
+	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsymlink_hdr) +
+					ifp->if_bytes - 1);
+}
+
+/*
+ * Prepare both links' data forks for extent swapping.  Promote the tempfile
+ * from local format to extents format, and if the file being repaired has a
+ * short format data fork, turn it into an empty extent list.
+ */
+STATIC int
+xrep_symlink_swap_prep(
+	struct xfs_scrub	*sc,
+	bool			temp_local,
+	bool			ip_local)
+{
+	int			error;
+
+	/*
+	 * If the temp link is in shortform format, convert that to a remote
+	 * target so that we can use the atomic extent swap.
+	 */
+	if (temp_local) {
+		int		logflags = XFS_ILOG_CORE;
+
+		error = xfs_bmap_local_to_extents(sc->tp, sc->tempip, 1,
+				&logflags, XFS_DATA_FORK,
+				xrep_symlink_local_to_remote,
+				sc);
+		if (error)
+			return error;
+
+		xfs_trans_log_inode(sc->tp, sc->ip, 0);
+
+		error = xfs_defer_finish(&sc->tp);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * If the file being repaired had a shortform data fork, convert that
+	 * to an empty extent list in preparation for the atomic extent swap.
+	 */
+	if (ip_local) {
+		struct xfs_ifork	*ifp;
+
+		ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK);
+		xfs_idestroy_fork(ifp);
+		ifp->if_format = XFS_DINODE_FMT_EXTENTS;
+		ifp->if_nextents = 0;
+		ifp->if_bytes = 0;
+		ifp->if_u1.if_root = NULL;
+		ifp->if_height = 0;
+
+		xfs_trans_log_inode(sc->tp, sc->ip,
+				XFS_ILOG_CORE | XFS_ILOG_DDATA);
+	}
+
+	return 0;
+}
+
+/* Swap the temporary link's data fork with the one being repaired. */
+STATIC int
+xrep_symlink_swap(
+	struct xfs_scrub	*sc)
+{
+	struct xrep_tempswap	*tx = sc->buf;
+	bool			ip_local, temp_local;
+	int			error;
+
+	/*
+	 * We're done with the temporary buffer, so we can reuse it for the
+	 * tempfile swap information.
+	 */
+	error = xrep_tempswap_trans_alloc(sc, XFS_DATA_FORK, tx);
+	if (error)
+		return error;
+
+	ip_local = sc->ip->i_df.if_format == XFS_DINODE_FMT_LOCAL;
+	temp_local = sc->tempip->i_df.if_format == XFS_DINODE_FMT_LOCAL;
+
+	/*
+	 * If the both links have a local format data fork and the rebuilt
+	 * remote data would fit in the repaired file's data fork, copy the
+	 * contents from the tempfile and declare ourselves done.
+	 */
+	if (ip_local && temp_local &&
+	    sc->tempip->i_disk_size <= xfs_inode_data_fork_size(sc->ip)) {
+		xrep_tempfile_copyout_local(sc, XFS_DATA_FORK);
+		return 0;
+	}
+
+	/* Otherwise, make sure both data forks are in block-mapping mode. */
+	error = xrep_symlink_swap_prep(sc, temp_local, ip_local);
+	if (error)
+		return error;
+
+	return xrep_tempswap_contents(sc, tx);
+}
+
+/*
+ * Free all the remote blocks and reset the data fork.  The caller must join
+ * the inode to the transaction.  This function returns with the inode joined
+ * to a clean scrub transaction.
+ */
+STATIC int
+xrep_symlink_reset_fork(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->tempip, XFS_DATA_FORK);
+	int			error;
+
+	/* Unmap all the remote target buffers. */
+	if (xfs_ifork_has_extents(ifp)) {
+		error = xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK);
+		if (error)
+			return error;
+	}
+
+	trace_xrep_symlink_reset_fork(sc->tempip);
+
+	/* Reset the temp link to have the same dummy content. */
+	xfs_idestroy_fork(ifp);
+	error = xfs_symlink_write_target(sc->tp, sc->tempip, ".", 1, 0, 0);
+	if (error)
+		return error;
+
+	return xrep_tempfile_roll_trans(sc);
+}
+
+/*
+ * Reinitialize a link target.  Caller must ensure the inode is joined to
+ * the transaction.
+ */
+STATIC int
+xrep_symlink_rebuild(
+	struct xfs_scrub	*sc)
+{
+	char			*target_buf = sc->buf;
+	xfs_fsblock_t		fs_blocks;
+	unsigned int		target_len;
+	unsigned int		resblks;
+	int			error;
+
+	/* How many blocks do we need? */
+	target_len = strlen(target_buf);
+	ASSERT(target_len != 0);
+	if (target_len == 0 || target_len > XFS_SYMLINK_MAXLEN)
+		return -EFSCORRUPTED;
+
+	trace_xrep_symlink_rebuild(sc->ip);
+
+	/*
+	 * In preparation to write the new symlink target to the temporary
+	 * file, drop the ILOCK of the file being repaired (it shouldn't be
+	 * joined) and take the ILOCK of the temporary file.
+	 *
+	 * The VFS does not take the IOLOCK while reading a symlink (and new
+	 * symlinks are hidden with INEW until they've been written) so it's
+	 * possible that a readlink() could see the old corrupted contents
+	 * while we're doing this.
+	 */
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
+	xrep_tempfile_ilock(sc);
+	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
+
+	/*
+	 * Reserve resources to reinitialize the target.  We're allowed to
+	 * exceed file quota to repair inconsistent metadata, though this is
+	 * unlikely.
+	 */
+	fs_blocks = xfs_symlink_blocks(sc->mp, target_len);
+	resblks = XFS_SYMLINK_SPACE_RES(sc->mp, target_len, fs_blocks);
+	error = xfs_trans_reserve_quota_nblks(sc->tp, sc->tempip, resblks, 0,
+			true);
+	if (error)
+		return error;
+
+	/* Erase the dummy target set up by the tempfile initialization. */
+	xfs_idestroy_fork(&sc->tempip->i_df);
+	sc->tempip->i_df.if_bytes = 0;
+	sc->tempip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
+
+	/* Write the salvaged target to the temporary link. */
+	error = __xfs_symlink_write_target(sc->tp, sc->tempip, sc->ip->i_ino,
+			target_buf, target_len, fs_blocks, resblks);
+	if (error)
+		return error;
+
+	/*
+	 * Commit the repair transaction so that we can use the atomic extent
+	 * swap helper functions to compute the correct block reservations and
+	 * re-lock the inodes.
+	 */
+	error = xrep_trans_commit(sc);
+	if (error)
+		return error;
+
+	/* Last chance to abort before we start committing fixes. */
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	xrep_tempfile_iunlock(sc);
+
+	/*
+	 * Swap the temp link's data fork with the file being repaired.  This
+	 * recreates the transaction and takes the ILOCKs of the file being
+	 * repaired and the temporary file.
+	 */
+	error = xrep_symlink_swap(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Release the old symlink blocks and reset the data fork of the temp
+	 * link to an empty shortform link.
+	 */
+	return xrep_symlink_reset_fork(sc);
+}
+
+/* Repair a symbolic link. */
+int
+xrep_symlink(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_has_rmapbt(sc->mp))
+		return -EOPNOTSUPP;
+
+	ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL);
+
+	error = xrep_symlink_salvage(sc);
+	if (error)
+		return error;
+
+	/* Now reset the target. */
+	return xrep_symlink_rebuild(sc);
+}
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index b275c0b764f4..e5087f14343b 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -21,6 +21,7 @@
 #include "xfs_xchgrange.h"
 #include "xfs_swapext.h"
 #include "xfs_defer.h"
+#include "xfs_symlink_remote.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/repair.h"
@@ -109,6 +110,10 @@ xrep_tempfile_create(
 		error = xfs_dir_init(tp, sc->tempip, dp);
 		if (error)
 			goto out_trans_cancel;
+	} else if (S_ISLNK(VFS_I(sc->tempip)->i_mode)) {
+		error = xfs_symlink_write_target(tp, sc->tempip, ".", 1, 0, 0);
+		if (error)
+			goto out_trans_cancel;
 	}
 
 	/*
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index b29ed4dde427..1cacea2ba195 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2596,6 +2596,52 @@ DEFINE_REPAIR_DENTRY_EVENT(xrep_orphanage_check_child);
 DEFINE_REPAIR_DENTRY_EVENT(xrep_orphanage_check_dentry);
 DEFINE_REPAIR_DENTRY_EVENT(xrep_orphanage_zap_child);
 
+TRACE_EVENT(xrep_symlink_salvage_target,
+	TP_PROTO(struct xfs_inode *ip, char *target, unsigned int targetlen),
+	TP_ARGS(ip, target, targetlen),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, targetlen)
+		__dynamic_array(char, target, targetlen + 1)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->targetlen = targetlen;
+		memcpy(__get_str(target), target, targetlen);
+		__get_str(target)[targetlen] = 0;
+	),
+	TP_printk("dev %d:%d ip 0x%llx target '%.*s'",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->targetlen,
+		  __get_str(target))
+);
+
+DECLARE_EVENT_CLASS(xrep_symlink_class,
+	TP_PROTO(struct xfs_inode *ip),
+	TP_ARGS(ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+	),
+	TP_printk("dev %d:%d ip 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino)
+);
+
+#define DEFINE_XREP_SYMLINK_EVENT(name) \
+DEFINE_EVENT(xrep_symlink_class, name, \
+	TP_PROTO(struct xfs_inode *ip), \
+	TP_ARGS(ip))
+DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_rebuild);
+DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_reset_fork);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] xfs: online repair of parent pointers
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (31 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of symbolic links Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/1] xfs: create an xattr iteration function for scrub Darrick J. Wong
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
                   ` (44 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This is the jumping-off point for rebuilding parent pointer information
and reconstructing directories with parent pointers.  The parent pointer
feature hasn't been merged yet, so this branch contains only a single
patch that refactors the xattr walking code in preparation for that.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-parent-pointers
---
 fs/xfs/Makefile          |    1 
 fs/xfs/scrub/attr.c      |  114 ++++++-----------
 fs/xfs/scrub/listxattr.c |  306 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/listxattr.h |   16 ++
 4 files changed, 364 insertions(+), 73 deletions(-)
 create mode 100644 fs/xfs/scrub/listxattr.c
 create mode 100644 fs/xfs/scrub/listxattr.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: create an xattr iteration function for scrub
  2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of parent pointers Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a streamlined function to walk a file's xattrs, without all the
cursor management stuff in the regular listxattr.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile          |    1 
 fs/xfs/scrub/attr.c      |  114 ++++++-----------
 fs/xfs/scrub/listxattr.c |  306 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/listxattr.h |   16 ++
 4 files changed, 364 insertions(+), 73 deletions(-)
 create mode 100644 fs/xfs/scrub/listxattr.c
 create mode 100644 fs/xfs/scrub/listxattr.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e7a8a740318a..33b1ea3e6e6b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -162,6 +162,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   ialloc.o \
 				   inode.o \
 				   iscan.o \
+				   listxattr.o \
 				   nlinks.o \
 				   parent.o \
 				   readdir.o \
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index a1585862c625..971cb6dd7dc8 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -21,6 +21,7 @@
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
 #include "scrub/attr.h"
+#include "scrub/listxattr.h"
 #include "scrub/repair.h"
 
 /* Free the buffers linked from the xattr buffer. */
@@ -153,82 +154,72 @@ xchk_setup_xattr(
 
 /* Extended Attributes */
 
-struct xchk_xattr {
-	struct xfs_attr_list_context	context;
-	struct xfs_scrub		*sc;
-};
-
 /*
  * Check that an extended attribute key can be looked up by hash.
  *
- * We use the XFS attribute list iterator (i.e. xfs_attr_list_ilocked)
- * to call this function for every attribute key in an inode.  Once
- * we're here, we load the attribute value to see if any errors happen,
- * or if we get more or less data than we expected.
+ * We use the extended attribute walk helper to call this function for every
+ * attribute key in an inode.  Once we're here, we load the attribute value to
+ * see if any errors happen, or if we get more or less data than we expected.
  */
-static void
-xchk_xattr_listent(
-	struct xfs_attr_list_context	*context,
-	int				flags,
-	unsigned char			*name,
-	int				namelen,
-	int				valuelen)
+static int
+xchk_xattr_actor(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip,
+	unsigned int		attr_flags,
+	void			*name,
+	unsigned int		namelen,
+	unsigned int		valuelen,
+	void			*priv)
 {
 	struct xfs_da_args		args = {
 		.op_flags		= XFS_DA_OP_NOTIME,
-		.attr_filter		= flags & XFS_ATTR_NSP_ONDISK_MASK,
-		.geo			= context->dp->i_mount->m_attr_geo,
+		.attr_filter		= attr_flags & XFS_ATTR_NSP_ONDISK_MASK,
+		.geo			= sc->mp->m_attr_geo,
 		.whichfork		= XFS_ATTR_FORK,
-		.dp			= context->dp,
+		.dp			= ip,
 		.name			= name,
 		.namelen		= namelen,
 		.hashval		= xfs_da_hashname(name, namelen),
-		.trans			= context->tp,
+		.trans			= sc->tp,
 		.valuelen		= valuelen,
-		.owner			= context->dp->i_ino,
+		.owner			= ip->i_ino,
 	};
 	struct xchk_xattr_buf		*ab;
-	struct xchk_xattr		*sx;
 	int				error = 0;
 
-	sx = container_of(context, struct xchk_xattr, context);
-	ab = sx->sc->buf;
+	ab = sc->buf;
 
-	if (xchk_should_terminate(sx->sc, &error)) {
-		context->seen_enough = error;
-		return;
-	}
+	if (xchk_should_terminate(sc, &error))
+		return error;
 
-	if (flags & XFS_ATTR_INCOMPLETE) {
+	if (attr_flags & XFS_ATTR_INCOMPLETE) {
 		/* Incomplete attr key, just mark the inode for preening. */
-		xchk_ino_set_preen(sx->sc, context->dp->i_ino);
-		return;
+		xchk_ino_set_preen(sc, ip->i_ino);
+		return 0;
 	}
 
 	/* Only one namespace bit allowed. */
-	if (hweight32(flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) {
-		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
-		goto fail_xref;
+	if (hweight32(attr_flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) {
+		xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, args.blkno);
+		return -ECANCELED;
 	}
 
 	/* Does this name make sense? */
 	if (!xfs_attr_namecheck(name, namelen)) {
-		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
-		goto fail_xref;
+		xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, args.blkno);
+		return -ECANCELED;
 	}
 
 	/*
-	 * Try to allocate enough memory to extrat the attr value.  If that
-	 * doesn't work, we overload the seen_enough variable to convey
-	 * the error message back to the main scrub function.
+	 * Try to allocate enough memory to extract the attr value.  If that
+	 * doesn't work, return -EDEADLOCK as a signal to try again with a
+	 * maximally sized buffer.
 	 */
-	error = xchk_setup_xattr_buf(sx->sc, valuelen);
+	error = xchk_setup_xattr_buf(sc, valuelen);
 	if (error == -ENOMEM)
 		error = -EDEADLOCK;
-	if (error) {
-		context->seen_enough = error;
-		return;
-	}
+	if (error)
+		return error;
 
 	args.value = ab->value;
 
@@ -236,16 +227,13 @@ xchk_xattr_listent(
 	/* ENODATA means the hash lookup failed and the attr is bad */
 	if (error == -ENODATA)
 		error = -EFSCORRUPTED;
-	if (!xchk_fblock_process_error(sx->sc, XFS_ATTR_FORK, args.blkno,
+	if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, args.blkno,
 			&error))
-		goto fail_xref;
+		return error;
 	if (args.valuelen != valuelen)
-		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK,
-					     args.blkno);
-fail_xref:
-	if (sx->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		context->seen_enough = 1;
-	return;
+		xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, args.blkno);
+
+	return 0;
 }
 
 /*
@@ -615,16 +603,6 @@ int
 xchk_xattr(
 	struct xfs_scrub		*sc)
 {
-	struct xchk_xattr		sx = {
-		.sc			= sc,
-		.context		= {
-			.dp		= sc->ip,
-			.tp		= sc->tp,
-			.resynch	= 1,
-			.put_listent	= xchk_xattr_listent,
-			.allow_incomplete = true,
-		},
-	};
 	xfs_dablk_t			last_checked = -1U;
 	int				error = 0;
 
@@ -653,12 +631,6 @@ xchk_xattr(
 	/*
 	 * Look up every xattr in this file by name and hash.
 	 *
-	 * Use the backend implementation of xfs_attr_list to call
-	 * xchk_xattr_listent on every attribute key in this inode.
-	 * In other words, we use the same iterator/callback mechanism
-	 * that listattr uses to scrub extended attributes, though in our
-	 * _listent function, we check the value of the attribute.
-	 *
 	 * The VFS only locks i_rwsem when modifying attrs, so keep all
 	 * three locks held because that's the only way to ensure we're
 	 * the only thread poking into the da btree.  We traverse the da
@@ -666,13 +638,9 @@ xchk_xattr(
 	 * iteration, which doesn't really follow the usual buffer
 	 * locking order.
 	 */
-	error = xfs_attr_list_ilocked(&sx.context);
+	error = xchk_xattr_walk(sc, sc->ip, xchk_xattr_actor, NULL);
 	if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, 0, &error))
 		return error;
 
-	/* Did our listent function try to return any errors? */
-	if (sx.context.seen_enough < 0)
-		return sx.context.seen_enough;
-
 	return 0;
 }
diff --git a/fs/xfs/scrub/listxattr.c b/fs/xfs/scrub/listxattr.c
new file mode 100644
index 000000000000..60f0ff9e2776
--- /dev/null
+++ b/fs/xfs/scrub/listxattr.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
+#include "xfs_trans.h"
+#include "scrub/scrub.h"
+#include "scrub/bitmap.h"
+#include "scrub/listxattr.h"
+
+/* Call a function for every entry in a shortform xattr structure. */
+STATIC int
+xchk_xattr_walk_sf(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*ip,
+	xchk_xattr_fn			attr_fn,
+	void				*priv)
+{
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	unsigned int			i;
+	int				error;
+
+	sf = (struct xfs_attr_shortform *)ip->i_af.if_u1.if_data;
+	for (i = 0, sfe = &sf->list[0]; i < sf->hdr.count; i++) {
+		error = attr_fn(sc, ip, sfe->flags, sfe->nameval, sfe->namelen,
+				sfe->valuelen, priv);
+		if (error)
+			return error;
+
+		sfe = xfs_attr_sf_nextentry(sfe);
+	}
+
+	return 0;
+}
+
+/* Call a function for every entry in this xattr leaf block. */
+STATIC int
+xchk_xattr_walk_leaf_entries(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*ip,
+	xchk_xattr_fn			attr_fn,
+	struct xfs_buf			*bp,
+	void				*priv)
+{
+	struct xfs_attr3_icleaf_hdr	ichdr;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_attr_leafblock	*leaf = bp->b_addr;
+	struct xfs_attr_leaf_entry	*entry;
+	unsigned int			i;
+	int				error;
+
+	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &ichdr, leaf);
+	entry = xfs_attr3_leaf_entryp(leaf);
+
+	for (i = 0; i < ichdr.count; entry++, i++) {
+		char			*name;
+		unsigned int		namelen, valuelen;
+
+		if (entry->flags & XFS_ATTR_LOCAL) {
+			struct xfs_attr_leaf_name_local		*name_loc;
+
+			name_loc = xfs_attr3_leaf_name_local(leaf, i);
+			name = name_loc->nameval;
+			namelen = name_loc->namelen;
+			valuelen = be16_to_cpu(name_loc->valuelen);
+		} else {
+			struct xfs_attr_leaf_name_remote	*name_rmt;
+
+			name_rmt = xfs_attr3_leaf_name_remote(leaf, i);
+			name = name_rmt->name;
+			namelen = name_rmt->namelen;
+			valuelen = be32_to_cpu(name_rmt->valuelen);
+		}
+
+		error = attr_fn(sc, ip, entry->flags, name, namelen, valuelen,
+				priv);
+		if (error)
+			return error;
+
+	}
+
+	return 0;
+}
+
+/*
+ * Call a function for every entry in a leaf-format xattr structure.  Avoid
+ * memory allocations for the loop detector since there's only one block.
+ */
+STATIC int
+xchk_xattr_walk_leaf(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*ip,
+	xchk_xattr_fn			attr_fn,
+	void				*priv)
+{
+	struct xfs_buf			*leaf_bp;
+	int				error;
+
+	error = xfs_attr3_leaf_read(sc->tp, ip, ip->i_ino, 0, &leaf_bp);
+	if (error)
+		return error;
+
+	error = xchk_xattr_walk_leaf_entries(sc, ip, attr_fn, leaf_bp, priv);
+	xfs_trans_brelse(sc->tp, leaf_bp);
+	return error;
+}
+
+/* Find the leftmost leaf in the xattr dabtree. */
+STATIC int
+xchk_xattr_find_leftmost_leaf(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*ip,
+	struct xbitmap			*seen_blocks,
+	struct xfs_buf			**leaf_bpp)
+{
+	struct xfs_da3_icnode_hdr	nodehdr;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_trans		*tp = sc->tp;
+	struct xfs_da_intnode		*node;
+	struct xfs_da_node_entry	*btree;
+	struct xfs_buf			*bp;
+	xfs_failaddr_t			fa;
+	xfs_dablk_t			blkno = 0;
+	unsigned int			expected_level = 0;
+	int				error;
+
+	for (;;) {
+		uint64_t		len;
+		uint16_t		magic;
+
+		error = xfs_da3_node_read(tp, ip, blkno, &bp, XFS_ATTR_FORK);
+		if (error)
+			return error;
+
+		node = bp->b_addr;
+		magic = be16_to_cpu(node->hdr.info.magic);
+		if (magic == XFS_ATTR_LEAF_MAGIC ||
+		    magic == XFS_ATTR3_LEAF_MAGIC)
+			break;
+
+		error = -EFSCORRUPTED;
+		if (magic != XFS_DA_NODE_MAGIC &&
+		    magic != XFS_DA3_NODE_MAGIC)
+			goto out_buf;
+
+		fa = xfs_da3_node_header_check(bp, ip->i_ino);
+		if (fa)
+			goto out_buf;
+
+		xfs_da3_node_hdr_from_disk(mp, &nodehdr, node);
+
+		if (nodehdr.count == 0 || nodehdr.level >= XFS_DA_NODE_MAXDEPTH)
+			goto out_buf;
+
+		/* Check the level from the root node. */
+		if (blkno == 0)
+			expected_level = nodehdr.level - 1;
+		else if (expected_level != nodehdr.level)
+			goto out_buf;
+		else
+			expected_level--;
+
+		/* Remember that we've seen this node. */
+		error = xbitmap_set(seen_blocks, blkno, 1);
+		if (error)
+			goto out_buf;
+
+		/* Find the next level towards the leaves of the dabtree. */
+		btree = nodehdr.btree;
+		blkno = be32_to_cpu(btree->before);
+		xfs_trans_brelse(tp, bp);
+
+		/* Make sure we haven't seen this new block already. */
+		len = 1;
+		if (xbitmap_test(seen_blocks, blkno, &len))
+			return -EFSCORRUPTED;
+	}
+
+	error = -EFSCORRUPTED;
+	fa = xfs_attr3_leaf_header_check(bp, ip->i_ino);
+	if (fa)
+		goto out_buf;
+
+	if (expected_level != 0)
+		goto out_buf;
+
+	/* Remember that we've seen this leaf. */
+	error = xbitmap_set(seen_blocks, blkno, 1);
+	if (error)
+		goto out_buf;
+
+	*leaf_bpp = bp;
+	return 0;
+
+out_buf:
+	xfs_trans_brelse(tp, bp);
+	return error;
+}
+
+/* Call a function for every entry in a node-format xattr structure. */
+STATIC int
+xchk_xattr_walk_node(
+	struct xfs_scrub		*sc,
+	struct xfs_inode		*ip,
+	xchk_xattr_fn			attr_fn,
+	void				*priv)
+{
+	struct xfs_attr3_icleaf_hdr	leafhdr;
+	struct xbitmap			seen_blocks;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_attr_leafblock	*leaf;
+	struct xfs_buf			*leaf_bp;
+	int				error;
+
+	xbitmap_init(&seen_blocks);
+
+	error = xchk_xattr_find_leftmost_leaf(sc, ip, &seen_blocks, &leaf_bp);
+	if (error)
+		goto out_bitmap;
+
+	for (;;) {
+		uint64_t	len;
+
+		error = xchk_xattr_walk_leaf_entries(sc, ip, attr_fn, leaf_bp,
+				priv);
+		if (error)
+			goto out_leaf;
+
+		/* Find the right sibling of this leaf block. */
+		leaf = leaf_bp->b_addr;
+		xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
+		if (leafhdr.forw == 0)
+			goto out_leaf;
+
+		xfs_trans_brelse(sc->tp, leaf_bp);
+
+		/* Make sure we haven't seen this new leaf already. */
+		len = 1;
+		if (xbitmap_test(&seen_blocks, leafhdr.forw, &len))
+			goto out_bitmap;
+
+		error = xfs_attr3_leaf_read(sc->tp, ip, ip->i_ino,
+				leafhdr.forw, &leaf_bp);
+		if (error)
+			goto out_bitmap;
+
+		/* Remember that we've seen this new leaf. */
+		error = xbitmap_set(&seen_blocks, leafhdr.forw, 1);
+		if (error)
+			goto out_leaf;
+	}
+
+out_leaf:
+	xfs_trans_brelse(sc->tp, leaf_bp);
+out_bitmap:
+	xbitmap_destroy(&seen_blocks);
+	return error;
+}
+
+/*
+ * Call a function for every extended attribute in a file.
+ *
+ * Callers must hold the ILOCK.  No validation or cursor restarts allowed.
+ * Returns -EFSCORRUPTED on any problem, including loops in the dabtree.
+ */
+int
+xchk_xattr_walk(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip,
+	xchk_xattr_fn		attr_fn,
+	void			*priv)
+{
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	if (!xfs_inode_hasattr(ip))
+		return 0;
+
+	if (ip->i_af.if_format == XFS_DINODE_FMT_LOCAL)
+		return xchk_xattr_walk_sf(sc, ip, attr_fn, priv);
+
+	/* attr functions require that the attr fork is loaded */
+	error = xfs_iread_extents(sc->tp, ip, XFS_ATTR_FORK);
+	if (error)
+		return error;
+
+	if (xfs_attr_is_leaf(ip))
+		return xchk_xattr_walk_leaf(sc, ip, attr_fn, priv);
+
+	return xchk_xattr_walk_node(sc, ip, attr_fn, priv);
+}
diff --git a/fs/xfs/scrub/listxattr.h b/fs/xfs/scrub/listxattr.h
new file mode 100644
index 000000000000..3a94573caf86
--- /dev/null
+++ b/fs/xfs/scrub/listxattr.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SCRUB_LISTXATTR_H__
+#define __XFS_SCRUB_LISTXATTR_H__
+
+typedef int (*xchk_xattr_fn)(struct xfs_scrub *sc, struct xfs_inode *ip,
+		unsigned int attr_flags, void *name, unsigned int namelen,
+		unsigned int valuelen, void *priv);
+
+int xchk_xattr_walk(struct xfs_scrub *sc, struct xfs_inode *ip,
+		xchk_xattr_fn attr_fn, void *priv);
+
+#endif /* __XFS_SCRUB_LISTXATTR_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (32 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of parent pointers Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/4] xfs: check AGI unlinked inode buckets Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance Darrick J. Wong
                   ` (43 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series enhances the AGI scrub code to check the unlinked inode
bucket lists for errors, and fixes them if necessary.  Now that iunlink
pointer updates are virtual log items, we can batch updates pretty
efficiently in the logging code.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-iunlink

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-iunlink
---
 fs/xfs/scrub/agheader.c        |   40 +++
 fs/xfs/scrub/agheader_repair.c |  576 +++++++++++++++++++++++++++++++++++++---
 fs/xfs/scrub/trace.h           |  128 +++++++++
 fs/xfs/xfs_icache.c            |    2 
 fs/xfs/xfs_inode.c             |    5 
 fs/xfs/xfs_inode.h             |   11 +
 6 files changed, 712 insertions(+), 50 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:14   ` [PATCH 3/4] xfs: hoist AGI repair context to a heap object Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  An upcoming enhancement to the online AGI fsck code
will need to make this distinction to check and rebuild the AGI unlinked
buckets.

This distinction is necessary because it is not sufficient to check
xfs_inode.i_nlink to decide if an inode is on the unlinked list.
Updates to i_nlink can happen while holding only ILOCK_EXCL, but updates
to an inode's position in the AGI unlinked list requires both ILOCK_EXCL
and the AGI buffer lock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    2 +-
 fs/xfs/xfs_inode.c  |    3 ++-
 fs/xfs/xfs_inode.h  |   10 +++++++++-
 3 files changed, 12 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e94c193cd417..06b3de67d791 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,7 +114,7 @@ xfs_inode_alloc(
 	INIT_LIST_HEAD(&ip->i_ioend_list);
 	spin_lock_init(&ip->i_ioend_lock);
 	ip->i_next_unlinked = NULLAGINO;
-	ip->i_prev_unlinked = NULLAGINO;
+	ip->i_prev_unlinked = 0;
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 85ce54a09d82..3788093fc81d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2143,6 +2143,7 @@ xfs_iunlink_insert_inode(
 	}
 
 	/* Point the head of the list to point to this inode. */
+	ip->i_prev_unlinked = NULLAGINO;
 	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index, agino);
 }
 
@@ -2245,7 +2246,7 @@ xfs_iunlink_remove_inode(
 	}
 
 	ip->i_next_unlinked = NULLAGINO;
-	ip->i_prev_unlinked = NULLAGINO;
+	ip->i_prev_unlinked = 0;
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 34f596deb92c..177b027b8803 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -68,7 +68,15 @@ typedef struct xfs_inode {
 	uint64_t		i_diflags2;	/* XFS_DIFLAG2_... */
 	struct timespec64	i_crtime;	/* time created */
 
-	/* unlinked list pointers */
+	/*
+	 * Unlinked list pointers.  These point to the next and previous inodes
+	 * in the AGI unlinked bucket list, respectively.  These fields can
+	 * only be updated with the AGI locked.
+	 *
+	 * i_next_unlinked reflects di_next_unlinked.  If i_prev_unlinked == 0,
+	 * the inode is not on the unlinked list.  If it is NULLAGINO, an AGI
+	 * bucket points at this inode.
+	 */
 	xfs_agino_t		i_next_unlinked;
 	xfs_agino_t		i_prev_unlinked;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: check AGI unlinked inode buckets
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/4] xfs: repair AGI unlinked inode bucket lists Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Look for corruptions in the AGI unlinked bucket chains.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader.c |   40 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.c      |    2 +-
 fs/xfs/xfs_inode.h      |    1 +
 3 files changed, 42 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 75de0ba4fcef..fb2f32a2af5d 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -15,6 +15,7 @@
 #include "xfs_ialloc.h"
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
+#include "xfs_inode.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 
@@ -865,6 +866,43 @@ xchk_agi_xref(
 	/* scrub teardown will take care of sc->sa for us */
 }
 
+/*
+ * Check the unlinked buckets for links to bad inodes.  We hold the AGI, so
+ * there cannot be any threads updating unlinked list pointers in this AG.
+ */
+STATIC void
+xchk_iunlink(
+	struct xfs_scrub	*sc,
+	struct xfs_agi		*agi)
+{
+	unsigned int		i;
+	struct xfs_inode	*ip;
+
+	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
+		xfs_agino_t	agino = be32_to_cpu(agi->agi_unlinked[i]);
+
+		while (agino != NULLAGINO) {
+			if (agino % XFS_AGI_UNLINKED_BUCKETS != i) {
+				xchk_block_set_corrupt(sc, sc->sa.agi_bp);
+				return;
+			}
+
+			ip = xfs_iunlink_lookup(sc->sa.pag, agino);
+			if (!ip) {
+				xchk_block_set_corrupt(sc, sc->sa.agi_bp);
+				return;
+			}
+
+			if (ip->i_prev_unlinked == 0) {
+				xchk_block_set_corrupt(sc, sc->sa.agi_bp);
+				return;
+			}
+
+			agino = ip->i_next_unlinked;
+		}
+	}
+}
+
 /* Scrub the AGI. */
 int
 xchk_agi(
@@ -949,6 +987,8 @@ xchk_agi(
 	if (pag->pagi_freecount != be32_to_cpu(agi->agi_freecount))
 		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 
+	xchk_iunlink(sc, agi);
+
 	xchk_agi_xref(sc);
 out:
 	return error;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 3788093fc81d..af4ac808a0e0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2010,7 +2010,7 @@ xfs_inactive(
  * only unlinked, referenced inodes can be on the unlinked inode list.  If we
  * don't find the inode in cache, then let the caller handle the situation.
  */
-static struct xfs_inode *
+struct xfs_inode *
 xfs_iunlink_lookup(
 	struct xfs_perag	*pag,
 	xfs_agino_t		agino)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 177b027b8803..be704174fa4f 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -584,6 +584,7 @@ extern struct kmem_cache	*xfs_inode_cache;
 bool xfs_inode_needs_inactive(struct xfs_inode *ip);
 
 int xfs_iunlink(struct xfs_trans *tp, struct xfs_inode *ip);
+struct xfs_inode *xfs_iunlink_lookup(struct xfs_perag *pag, xfs_agino_t agino);
 
 void xfs_end_io(struct work_struct *work);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: hoist AGI repair context to a heap object
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/4] xfs: check AGI unlinked inode buckets Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 4/4] xfs: repair AGI unlinked inode bucket lists Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/4] xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Save ~460 bytes of stack space by moving all the repair context to a
heap object.  We're going to add even more context data in the next
patch, which is why we really need to do this now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |  105 ++++++++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 42 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index daeb88cf5825..26ed60c24386 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -803,15 +803,29 @@ enum {
 	XREP_AGI_MAX
 };
 
+struct xrep_agi {
+	struct xfs_scrub		*sc;
+
+	/* AGI buffer, tracked separately */
+	struct xfs_buf			*agi_bp;
+
+	/* context for finding btree roots */
+	struct xrep_find_ag_btree	fab[XREP_AGI_MAX];
+
+	/* old AGI contents in case we have to revert */
+	struct xfs_agi			old_agi;
+};
+
 /*
  * Given the inode btree roots described by *fab, find the roots, check them
  * for sanity, and pass the root data back out via *fab.
  */
 STATIC int
 xrep_agi_find_btrees(
-	struct xfs_scrub		*sc,
-	struct xrep_find_ag_btree	*fab)
+	struct xrep_agi			*ragi)
 {
+	struct xfs_scrub		*sc = ragi->sc;
+	struct xrep_find_ag_btree	*fab = ragi->fab;
 	struct xfs_buf			*agf_bp;
 	struct xfs_mount		*mp = sc->mp;
 	int				error;
@@ -844,10 +858,11 @@ xrep_agi_find_btrees(
  */
 STATIC void
 xrep_agi_init_header(
-	struct xfs_scrub	*sc,
-	struct xfs_buf		*agi_bp,
-	struct xfs_agi		*old_agi)
+	struct xrep_agi		*ragi)
 {
+	struct xfs_scrub	*sc = ragi->sc;
+	struct xfs_buf		*agi_bp = ragi->agi_bp;
+	struct xfs_agi		*old_agi = &ragi->old_agi;
 	struct xfs_agi		*agi = agi_bp->b_addr;
 	struct xfs_mount	*mp = sc->mp;
 
@@ -874,10 +889,12 @@ xrep_agi_init_header(
 /* Set btree root information in an AGI. */
 STATIC void
 xrep_agi_set_roots(
-	struct xfs_scrub		*sc,
-	struct xfs_agi			*agi,
-	struct xrep_find_ag_btree	*fab)
+	struct xrep_agi			*ragi)
 {
+	struct xfs_scrub		*sc = ragi->sc;
+	struct xfs_agi			*agi = ragi->agi_bp->b_addr;
+	struct xrep_find_ag_btree	*fab = ragi->fab;
+
 	agi->agi_root = cpu_to_be32(fab[XREP_AGI_INOBT].root);
 	agi->agi_level = cpu_to_be32(fab[XREP_AGI_INOBT].height);
 
@@ -890,9 +907,10 @@ xrep_agi_set_roots(
 /* Update the AGI counters. */
 STATIC int
 xrep_agi_calc_from_btrees(
-	struct xfs_scrub	*sc,
-	struct xfs_buf		*agi_bp)
+	struct xrep_agi		*ragi)
 {
+	struct xfs_scrub	*sc = ragi->sc;
+	struct xfs_buf		*agi_bp = ragi->agi_bp;
 	struct xfs_btree_cur	*cur;
 	struct xfs_agi		*agi = agi_bp->b_addr;
 	struct xfs_mount	*mp = sc->mp;
@@ -939,9 +957,10 @@ xrep_agi_calc_from_btrees(
 /* Trigger reinitialization of the in-core data. */
 STATIC int
 xrep_agi_commit_new(
-	struct xfs_scrub	*sc,
-	struct xfs_buf		*agi_bp)
+	struct xrep_agi		*ragi)
 {
+	struct xfs_scrub	*sc = ragi->sc;
+	struct xfs_buf		*agi_bp = ragi->agi_bp;
 	struct xfs_perag	*pag;
 	struct xfs_agi		*agi = agi_bp->b_addr;
 
@@ -964,33 +983,36 @@ xrep_agi_commit_new(
 /* Repair the AGI. */
 int
 xrep_agi(
-	struct xfs_scrub		*sc)
+	struct xfs_scrub	*sc)
 {
-	struct xrep_find_ag_btree	fab[XREP_AGI_MAX] = {
-		[XREP_AGI_INOBT] = {
-			.rmap_owner = XFS_RMAP_OWN_INOBT,
-			.buf_ops = &xfs_inobt_buf_ops,
-			.maxlevels = M_IGEO(sc->mp)->inobt_maxlevels,
-		},
-		[XREP_AGI_FINOBT] = {
-			.rmap_owner = XFS_RMAP_OWN_INOBT,
-			.buf_ops = &xfs_finobt_buf_ops,
-			.maxlevels = M_IGEO(sc->mp)->inobt_maxlevels,
-		},
-		[XREP_AGI_END] = {
-			.buf_ops = NULL
-		},
-	};
-	struct xfs_agi			old_agi;
-	struct xfs_mount		*mp = sc->mp;
-	struct xfs_buf			*agi_bp;
-	struct xfs_agi			*agi;
-	int				error;
+	struct xrep_agi		*ragi;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
 
 	/* We require the rmapbt to rebuild anything. */
 	if (!xfs_has_rmapbt(mp))
 		return -EOPNOTSUPP;
 
+	sc->buf = kzalloc(sizeof(struct xrep_agi), XCHK_GFP_FLAGS);
+	if (!sc->buf)
+		return -ENOMEM;
+	ragi = sc->buf;
+	ragi->sc = sc;
+
+	ragi->fab[XREP_AGI_INOBT] = (struct xrep_find_ag_btree){
+		.rmap_owner	= XFS_RMAP_OWN_INOBT,
+		.buf_ops	= &xfs_inobt_buf_ops,
+		.maxlevels	= M_IGEO(sc->mp)->inobt_maxlevels,
+	};
+	ragi->fab[XREP_AGI_FINOBT] = (struct xrep_find_ag_btree){
+		.rmap_owner	= XFS_RMAP_OWN_INOBT,
+		.buf_ops	= &xfs_finobt_buf_ops,
+		.maxlevels	= M_IGEO(sc->mp)->inobt_maxlevels,
+	};
+	ragi->fab[XREP_AGI_END] = (struct xrep_find_ag_btree){
+		.buf_ops	= NULL,
+	};
+
 	/*
 	 * Make sure we have the AGI buffer, as scrub might have decided it
 	 * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED.
@@ -998,14 +1020,13 @@ xrep_agi(
 	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
 			XFS_AG_DADDR(mp, sc->sa.pag->pag_agno,
 						XFS_AGI_DADDR(mp)),
-			XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL);
+			XFS_FSS_TO_BB(mp, 1), 0, &ragi->agi_bp, NULL);
 	if (error)
 		return error;
-	agi_bp->b_ops = &xfs_agi_buf_ops;
-	agi = agi_bp->b_addr;
+	ragi->agi_bp->b_ops = &xfs_agi_buf_ops;
 
 	/* Find the AGI btree roots. */
-	error = xrep_agi_find_btrees(sc, fab);
+	error = xrep_agi_find_btrees(ragi);
 	if (error)
 		return error;
 
@@ -1014,18 +1035,18 @@ xrep_agi(
 		return error;
 
 	/* Start rewriting the header and implant the btrees we found. */
-	xrep_agi_init_header(sc, agi_bp, &old_agi);
-	xrep_agi_set_roots(sc, agi, fab);
-	error = xrep_agi_calc_from_btrees(sc, agi_bp);
+	xrep_agi_init_header(ragi);
+	xrep_agi_set_roots(ragi);
+	error = xrep_agi_calc_from_btrees(ragi);
 	if (error)
 		goto out_revert;
 
 	/* Reinitialize in-core state. */
-	return xrep_agi_commit_new(sc, agi_bp);
+	return xrep_agi_commit_new(ragi);
 
 out_revert:
 	/* Mark the incore AGI state stale and revert the AGI. */
 	sc->sa.pag->pagi_init = 0;
-	memcpy(agi, &old_agi, sizeof(old_agi));
+	memcpy(ragi->agi_bp->b_addr, &ragi->old_agi, sizeof(struct xfs_agi));
 	return error;
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: repair AGI unlinked inode bucket lists
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/4] xfs: check AGI unlinked inode buckets Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/4] xfs: hoist AGI repair context to a heap object Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/4] xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach the AGI repair code to rebuild the unlinked buckets and lists.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |  471 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/trace.h           |  128 +++++++++++
 2 files changed, 595 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 26ed60c24386..6170296b087f 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -22,12 +22,16 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_ag.h"
+#include "xfs_inode.h"
+#include "xfs_iunlink_item.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/bitmap.h"
 #include "scrub/reap.h"
+#include "scrub/xfile.h"
+#include "scrub/xfarray.h"
 
 /* Superblock */
 
@@ -803,6 +807,8 @@ enum {
 	XREP_AGI_MAX
 };
 
+#define XREP_AGI_LOOKUP_BATCH		32
+
 struct xrep_agi {
 	struct xfs_scrub		*sc;
 
@@ -814,8 +820,34 @@ struct xrep_agi {
 
 	/* old AGI contents in case we have to revert */
 	struct xfs_agi			old_agi;
+
+	/* bitmap of which inodes are unlinked */
+	struct xbitmap			iunlink_bmp;
+
+	/* heads of the unlinked inode bucket lists */
+	xfs_agino_t			iunlink_heads[XFS_AGI_UNLINKED_BUCKETS];
+
+	/* scratchpad for batched lookups of the radix tree */
+	struct xfs_inode		*lookup_batch[XREP_AGI_LOOKUP_BATCH];
+
+	/* Map of ino -> next_ino for unlinked inode processing. */
+	struct xfarray			*iunlink_next;
+
+	/* Map of ino -> prev_ino for unlinked inode processing. */
+	struct xfarray			*iunlink_prev;
 };
 
+static void
+xrep_agi_buf_cleanup(
+	void		*buf)
+{
+	struct xrep_agi	*ragi = buf;
+
+	xfarray_destroy(ragi->iunlink_prev);
+	xfarray_destroy(ragi->iunlink_next);
+	xbitmap_destroy(&ragi->iunlink_bmp);
+}
+
 /*
  * Given the inode btree roots described by *fab, find the roots, check them
  * for sanity, and pass the root data back out via *fab.
@@ -877,10 +909,6 @@ xrep_agi_init_header(
 	if (xfs_has_crc(mp))
 		uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
 
-	/* We don't know how to fix the unlinked list yet. */
-	memcpy(&agi->agi_unlinked, &old_agi->agi_unlinked,
-			sizeof(agi->agi_unlinked));
-
 	/* Mark the incore AGF data stale until we're done fixing things. */
 	ASSERT(sc->sa.pag->pagi_init);
 	sc->sa.pag->pagi_init = 0;
@@ -954,6 +982,417 @@ xrep_agi_calc_from_btrees(
 	return error;
 }
 
+/* Stage @ip->i_next_unlinked = next_agino */
+static inline int
+xrep_iunlink_store_next(
+	struct xrep_agi		*ragi,
+	struct xfs_inode	*ip,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+
+	return xfarray_store(ragi->iunlink_next, agino, &next_agino);
+}
+
+/* Stage @xfs_iget(agino)->i_prev_unlinked = prev_agino */
+static inline int
+xrep_iunlink_store_prev(
+	struct xrep_agi		*ragi,
+	xfs_agino_t		agino,
+	xfs_agino_t		prev_agino)
+{
+	ASSERT(prev_agino != 0);
+
+	return xfarray_store(ragi->iunlink_prev, agino, &prev_agino);
+}
+
+/* Decide if this is an unlinked inode in this AG. */
+STATIC bool
+xrep_iunlink_igrab(
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
+		return false;
+
+	/* Not on the unlinked list yet. */
+	if (ip->i_prev_unlinked == 0)
+		return false;
+
+	return true;
+}
+
+/*
+ * Mark the given inode in the lookup batch in our unlinked inode bitmap, and
+ * remember if this inode is the start of the unlinked chain.
+ */
+STATIC int
+xrep_iunlink_visit(
+	struct xrep_agi		*ragi,
+	unsigned int		batch_idx)
+{
+	struct xfs_mount	*mp = ragi->sc->mp;
+	struct xfs_inode	*ip = ragi->lookup_batch[batch_idx];
+	xfs_agino_t		agino;
+	unsigned int		bucket;
+	int			error;
+
+	ASSERT(XFS_INO_TO_AGNO(mp, ip->i_ino) == ragi->sc->sa.pag->pag_agno);
+	ASSERT(ip->i_prev_unlinked != 0);
+
+	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	bucket = agino % XFS_AGI_UNLINKED_BUCKETS;
+
+	trace_xrep_iunlink_visit(ragi->sc->sa.pag, ragi->iunlink_heads[bucket],
+			ip);
+
+	error = xbitmap_set(&ragi->iunlink_bmp, agino, 1);
+	if (error)
+		return error;
+
+	if (ip->i_prev_unlinked == NULLAGINO) {
+		if (ragi->iunlink_heads[bucket] == NULLAGINO)
+			ragi->iunlink_heads[bucket] = agino;
+	}
+
+	return 0;
+}
+
+/*
+ * Find all incore unlinked inodes so that we can rebuild the unlinked buckets.
+ * We hold the AGI so there should not be any modifications to the unlinked
+ * list.
+ */
+STATIC int
+xrep_iunlink_find(
+	struct xrep_agi		*ragi)
+{
+	struct xfs_perag	*pag = ragi->sc->sa.pag;
+	struct xfs_mount	*mp = pag->pag_mount;
+	uint32_t		first_index = 0;
+	bool			done = false;
+	unsigned int		nr_found = 0;
+
+	do {
+		unsigned int	i;
+		int		error = 0;
+
+		if (xchk_should_terminate(ragi->sc, &error))
+			return error;
+
+		rcu_read_lock();
+
+		nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
+				(void **)&ragi->lookup_batch, first_index,
+				XREP_AGI_LOOKUP_BATCH);
+		if (!nr_found) {
+			rcu_read_unlock();
+			return 0;
+		}
+
+		for (i = 0; i < nr_found; i++) {
+			struct xfs_inode *ip = ragi->lookup_batch[i];
+
+			if (done || !xrep_iunlink_igrab(pag, ip))
+				ragi->lookup_batch[i] = NULL;
+
+			/*
+			 * Update the index for the next lookup. Catch
+			 * overflows into the next AG range which can occur if
+			 * we have inodes in the last block of the AG and we
+			 * are currently pointing to the last inode.
+			 *
+			 * Because we may see inodes that are from the wrong AG
+			 * due to RCU freeing and reallocation, only update the
+			 * index if it lies in this AG. It was a race that lead
+			 * us to see this inode, so another lookup from the
+			 * same index will not find it again.
+			 */
+			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
+				continue;
+			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
+			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
+				done = true;
+		}
+
+		/* unlock now we've grabbed the inodes. */
+		rcu_read_unlock();
+
+		for (i = 0; i < nr_found; i++) {
+			if (!ragi->lookup_batch[i])
+				continue;
+			error = xrep_iunlink_visit(ragi, i);
+			if (error)
+				return error;
+		}
+	} while (!done);
+
+	return 0;
+}
+
+/* Walk an iunlink bucket's list, remembering all the good inodes. */
+STATIC int
+xrep_iunlink_walk_bucket(
+	struct xrep_agi		*ragi,
+	unsigned int		bucket)
+{
+	struct xfs_scrub	*sc = ragi->sc;
+	struct xfs_inode	*prev_ip = NULL;
+	struct xfs_inode	*ip;
+	xfs_agino_t		prev_agino = NULLAGINO;
+	xfs_agino_t		next_agino = ragi->iunlink_heads[bucket];
+	int			error = 0;
+
+	while (next_agino != NULLAGINO) {
+		if (xchk_should_terminate(ragi->sc, &error))
+			return error;
+
+		trace_xrep_iunlink_walk_bucket(sc->sa.pag, bucket, prev_agino,
+				next_agino);
+
+		/* Find the next inode in the chain. */
+		ip = xfs_iunlink_lookup(sc->sa.pag, next_agino);
+		if (!ip) {
+			/* Inode not incore?  Terminate the chain. */
+			next_agino = NULLAGINO;
+			break;
+		}
+
+		if (next_agino % XFS_AGI_UNLINKED_BUCKETS != bucket ||
+		    ip->i_prev_unlinked == 0) {
+			/*
+			 * Inode is in the wrong bucket or isn't unlinked.
+			 * Advance the list, but pretend we didn't see this
+			 * inode.
+			 */
+			next_agino = ip->i_next_unlinked;
+			continue;
+		}
+
+		/*
+		 * Otherwise, this inode's unlinked pointers are ok.  Clear it
+		 * from the unlinked bitmap since we're done with it, and make
+		 * sure the chain is still correct.
+		 */
+		error = xbitmap_clear(&ragi->iunlink_bmp, next_agino, 1);
+		if (error)
+			return error;
+
+		/* Remember the previous inode's next pointer. */
+		if (prev_ip) {
+			error = xrep_iunlink_store_next(ragi, prev_ip,
+					next_agino);
+			if (error)
+				return error;
+		}
+
+		/* Remember this inode's previous pointer. */
+		error = xrep_iunlink_store_prev(ragi, next_agino, prev_agino);
+		if (error)
+			return error;
+
+		/* Advance the list and remember this inode. */
+		prev_agino = next_agino;
+		next_agino = ip->i_next_unlinked;
+		prev_ip = ip;
+	}
+
+	/* Update the previous inode's next pointer. */
+	if (prev_ip) {
+		error = xrep_iunlink_store_next(ragi, prev_ip, next_agino);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Reinsert this unlinked inode into the head of the staged bucket list. */
+STATIC int
+xrep_iunlink_add_to_bucket(
+	struct xrep_agi		*ragi,
+	xfs_agino_t		agino)
+{
+	struct xfs_inode	*ip;
+	xfs_agino_t		current_head;
+	unsigned int		bucket;
+	int			error;
+
+	bucket = agino % XFS_AGI_UNLINKED_BUCKETS;
+
+	ip = xfs_iunlink_lookup(ragi->sc->sa.pag, agino);
+	if (!ip)
+		return -EFSCORRUPTED;
+
+	/* Point this inode at the current head of the bucket list. */
+	current_head = ragi->iunlink_heads[bucket];
+	error = xrep_iunlink_store_next(ragi, ip, current_head);
+	if (error)
+		return error;
+
+	/* Remember the head inode's previous pointer. */
+	if (current_head != NULLAGINO) {
+		error = xrep_iunlink_store_prev(ragi, current_head, agino);
+		if (error)
+			return error;
+	}
+
+	ragi->iunlink_heads[bucket] = agino;
+	return 0;
+}
+
+/* Reinsert unlinked inodes into the staged iunlink buckets. */
+STATIC int
+xrep_iunlink_add_lost_inodes(
+	uint64_t		start,
+	uint64_t		len,
+	void			*priv)
+{
+	struct xrep_agi		*ragi = priv;
+	int			error;
+
+	while (len > 0) {
+		error = xrep_iunlink_add_to_bucket(ragi, start);
+		if (error)
+			return error;
+
+		start++;
+		len--;
+	}
+
+	return 0;
+}
+
+/*
+ * Figure out the iunlink bucket values and find inodes that need to be
+ * reinserted into the list.
+ */
+STATIC int
+xrep_iunlink_find_all(
+	struct xrep_agi		*ragi)
+{
+	unsigned int		i;
+	int			error;
+
+	/* Record all the incore unlinked inodes. */
+	error = xrep_iunlink_find(ragi);
+	if (error)
+		return error;
+
+	/*
+	 * Clear all the unlinked inodes that are correctly linked into their
+	 * incore inode bucket lists.  After this call, iunlink_bmp will
+	 * contain unlinked inodes that are not in the correct list.
+	 */
+	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
+		error = xrep_iunlink_walk_bucket(ragi, i);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Any unlinked inodes that we didn't find through the bucket list
+	 * walk (or was ignored by the walk) must be inserted into the bucket
+	 * list.  Stage this in memory for now.
+	 */
+	return xbitmap_walk(&ragi->iunlink_bmp,
+			xrep_iunlink_add_lost_inodes, ragi);
+}
+
+/* Update i_next_iunlinked for the inode @agino. */
+STATIC int
+xrep_iunlink_relink_next(
+	struct xrep_agi		*ragi,
+	xfarray_idx_t		idx,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_inode	*ip;
+	xfarray_idx_t		agino = idx - 1;
+	int			error;
+
+	ip = xfs_iunlink_lookup(ragi->sc->sa.pag, agino);
+	if (!ip)
+		return -EFSCORRUPTED;
+
+	if (ip->i_next_unlinked == next_agino)
+		return 0;
+
+	trace_xrep_iunlink_relink_next(ip, next_agino);
+
+	error = xfs_iunlink_log_inode(ragi->sc->tp, ip, ragi->sc->sa.pag,
+			next_agino);
+	if (error)
+		return error;
+
+	ip->i_next_unlinked = next_agino;
+	return 0;
+}
+
+/* Update i_prev_iunlinked for the inode @agino. */
+STATIC int
+xrep_iunlink_relink_prev(
+	struct xrep_agi		*ragi,
+	xfarray_idx_t		idx,
+	xfs_agino_t		prev_agino)
+{
+	struct xfs_inode	*ip;
+	xfarray_idx_t		agino = idx - 1;
+
+	ASSERT(prev_agino != 0);
+
+	ip = xfs_iunlink_lookup(ragi->sc->sa.pag, agino);
+	if (!ip)
+		return -EFSCORRUPTED;
+
+	if (ip->i_prev_unlinked == prev_agino)
+		return 0;
+
+	trace_xrep_iunlink_relink_prev(ip, prev_agino);
+
+	ip->i_prev_unlinked = prev_agino;
+	return 0;
+}
+
+/* Log all the iunlink updates we need to finish regenerating the AGI. */
+STATIC int
+xrep_iunlink_commit(
+	struct xrep_agi		*ragi)
+{
+	struct xfs_agi		*agi = ragi->agi_bp->b_addr;
+	xfarray_idx_t		idx = XFARRAY_CURSOR_INIT;
+	xfs_agino_t		agino;
+	unsigned int		i;
+	int			error;
+
+	/* Fix all the forward links */
+	while ((error = xfarray_iter(ragi->iunlink_next, &idx, &agino)) == 1) {
+		error = xrep_iunlink_relink_next(ragi, idx, agino);
+		if (error)
+			break;
+	}
+
+	/* Fix all the back links */
+	idx = XFARRAY_CURSOR_INIT;
+	while ((error = xfarray_iter(ragi->iunlink_prev, &idx, &agino)) == 1) {
+		error = xrep_iunlink_relink_prev(ragi, idx, agino);
+		if (error)
+			break;
+	}
+
+	/* Copy the staged iunlink buckets to the new AGI. */
+	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
+		trace_xrep_iunlink_commit_bucket(ragi->sc->sa.pag, i,
+				ragi->iunlink_heads[i]);
+
+		agi->agi_unlinked[i] = cpu_to_be32(ragi->iunlink_heads[i]);
+	}
+
+	return 0;
+}
+
 /* Trigger reinitialization of the in-core data. */
 STATIC int
 xrep_agi_commit_new(
@@ -987,6 +1426,7 @@ xrep_agi(
 {
 	struct xrep_agi		*ragi;
 	struct xfs_mount	*mp = sc->mp;
+	unsigned int		i;
 	int			error;
 
 	/* We require the rmapbt to rebuild anything. */
@@ -1013,6 +1453,22 @@ xrep_agi(
 		.buf_ops	= NULL,
 	};
 
+	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++)
+		ragi->iunlink_heads[i] = NULLAGINO;
+
+	xbitmap_init(&ragi->iunlink_bmp);
+	sc->buf_cleanup = xrep_agi_buf_cleanup;
+
+	error = xfarray_create(sc->mp, "iunlinked next", 0,
+			sizeof(xfs_agino_t), &ragi->iunlink_next);
+	if (error)
+		return error;
+
+	error = xfarray_create(sc->mp, "iunlinked prev", 0,
+			sizeof(xfs_agino_t), &ragi->iunlink_prev);
+	if (error)
+		return error;
+
 	/*
 	 * Make sure we have the AGI buffer, as scrub might have decided it
 	 * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED.
@@ -1030,6 +1486,10 @@ xrep_agi(
 	if (error)
 		return error;
 
+	error = xrep_iunlink_find_all(ragi);
+	if (error)
+		return error;
+
 	/* Last chance to abort before we start committing fixes. */
 	if (xchk_should_terminate(sc, &error))
 		return error;
@@ -1038,6 +1498,9 @@ xrep_agi(
 	xrep_agi_init_header(ragi);
 	xrep_agi_set_roots(ragi);
 	error = xrep_agi_calc_from_btrees(ragi);
+	if (error)
+		goto out_revert;
+	error = xrep_iunlink_commit(ragi);
 	if (error)
 		goto out_revert;
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 1cacea2ba195..f7e30c6eb1d1 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -2642,6 +2642,134 @@ DEFINE_EVENT(xrep_symlink_class, name, \
 DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_rebuild);
 DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_reset_fork);
 
+TRACE_EVENT(xrep_iunlink_visit,
+	TP_PROTO(struct xfs_perag *pag, xfs_agino_t bucket_agino,
+		 struct xfs_inode *ip),
+	TP_ARGS(pag, bucket_agino, ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+		__field(xfs_agino_t, bucket_agino)
+		__field(xfs_agino_t, prev_agino)
+		__field(xfs_agino_t, next_agino)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->agino = XFS_INO_TO_AGINO(pag->pag_mount, ip->i_ino);
+		__entry->bucket_agino = bucket_agino;
+		__entry->prev_agino = ip->i_prev_unlinked;
+		__entry->next_agino = ip->i_next_unlinked;
+	),
+	TP_printk("dev %d:%d agno 0x%x agino 0x%x bucket_agino %u prev_agino 0x%x next_agino 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agino,
+		  __entry->bucket_agino,
+		  __entry->prev_agino,
+		  __entry->next_agino)
+);
+
+TRACE_EVENT(xrep_iunlink_walk_bucket,
+	TP_PROTO(struct xfs_perag *pag, unsigned int bucket,
+		 xfs_agino_t prev_agino, xfs_agino_t next_agino),
+	TP_ARGS(pag, bucket, prev_agino, next_agino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(unsigned int, bucket)
+		__field(xfs_agino_t, prev_agino)
+		__field(xfs_agino_t, next_agino)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->bucket = bucket;
+		__entry->prev_agino = prev_agino;
+		__entry->next_agino = next_agino;
+	),
+	TP_printk("dev %d:%d agno 0x%x bucket %u prev_agino 0x%x next_agino 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->bucket,
+		  __entry->prev_agino,
+		  __entry->next_agino)
+);
+
+TRACE_EVENT(xrep_iunlink_relink_next,
+	TP_PROTO(struct xfs_inode *ip, xfs_agino_t next_agino),
+	TP_ARGS(ip, next_agino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+		__field(xfs_agino_t, next_agino)
+		__field(xfs_agino_t, new_next_agino)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino);
+		__entry->agino = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino);
+		__entry->next_agino = ip->i_next_unlinked;
+		__entry->new_next_agino = next_agino;
+	),
+	TP_printk("dev %d:%d agno 0x%x agino 0x%x next_agino 0x%x -> 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agino,
+		  __entry->next_agino,
+		  __entry->new_next_agino)
+);
+
+TRACE_EVENT(xrep_iunlink_relink_prev,
+	TP_PROTO(struct xfs_inode *ip, xfs_agino_t prev_agino),
+	TP_ARGS(ip, prev_agino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+		__field(xfs_agino_t, prev_agino)
+		__field(xfs_agino_t, new_prev_agino)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino);
+		__entry->agino = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino);
+		__entry->prev_agino = ip->i_prev_unlinked;
+		__entry->new_prev_agino = prev_agino;
+	),
+	TP_printk("dev %d:%d agno 0x%x agino 0x%x prev_agino 0x%x -> 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agino,
+		  __entry->prev_agino,
+		  __entry->new_prev_agino)
+);
+
+TRACE_EVENT(xrep_iunlink_commit_bucket,
+	TP_PROTO(struct xfs_perag *pag, unsigned int bucket,
+		 xfs_agino_t agino),
+	TP_ARGS(pag, bucket, agino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(unsigned int, bucket)
+		__field(xfs_agino_t, agino)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->bucket = bucket;
+		__entry->agino = agino;
+	),
+	TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->bucket,
+		  __entry->agino)
+);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (33 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
@ 2022-12-30 22:14 ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: map xfile pages directly into xfs_buf Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
                   ` (42 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Congratulations!  You have made it to the final patchset of the main
online fsck feature!  This last series improves the performance of
xfile-backed btrees by teaching the buffer cache to directly map pages
from the xfile.  It also speeds up xfarray operations substantially by
implementing a small page cache to avoid repeated kmap/kunmap calls.
Collectively, these can reduce the runtime of online repair functions by
twenty percent or so.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xfile-page-caching

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=xfile-page-caching
---
 fs/xfs/libxfs/xfs_btree_mem.h  |    6 +
 fs/xfs/libxfs/xfs_rmap_btree.c |    1 
 fs/xfs/scrub/rcbag_btree.c     |    1 
 fs/xfs/scrub/trace.h           |   44 +++++++
 fs/xfs/scrub/xfbtree.c         |   23 +++-
 fs/xfs/scrub/xfile.c           |  254 +++++++++++++++++++++++-----------------
 fs/xfs/scrub/xfile.h           |   15 ++
 fs/xfs/xfs_buf.c               |  229 +++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_buf.h               |   10 ++
 9 files changed, 467 insertions(+), 116 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: map xfile pages directly into xfs_buf
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: use b_offset to support direct-mapping pages when blocksize < pagesize Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfile: implement write caching Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Map the xfile pages directly into xfs_buf to reduce memory overhead.
It's silly to use memory to stage changes to shmem pages for ephemeral
btrees that don't care about transactionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_mem.h  |    6 +
 fs/xfs/libxfs/xfs_rmap_btree.c |    1 
 fs/xfs/scrub/rcbag_btree.c     |    1 
 fs/xfs/scrub/xfbtree.c         |   23 ++++-
 fs/xfs/xfs_buf.c               |  202 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_buf.h               |   10 ++
 6 files changed, 236 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree_mem.h b/fs/xfs/libxfs/xfs_btree_mem.h
index ee142b972839..8feb104522b5 100644
--- a/fs/xfs/libxfs/xfs_btree_mem.h
+++ b/fs/xfs/libxfs/xfs_btree_mem.h
@@ -17,8 +17,14 @@ struct xfbtree_config {
 
 	/* Owner of this btree. */
 	unsigned long long		owner;
+
+	/* XFBTREE_* flags */
+	unsigned int			flags;
 };
 
+/* buffers should be directly mapped from memory */
+#define XFBTREE_DIRECT_MAP		(1U << 0)
+
 #ifdef CONFIG_XFS_IN_MEMORY_BTREE
 unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
 
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index ebd86c559837..1b88766ac497 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -670,6 +670,7 @@ xfs_rmapbt_mem_create(
 		.btree_ops	= &xfs_rmapbt_mem_ops,
 		.target		= target,
 		.owner		= agno,
+		.flags		= XFBTREE_DIRECT_MAP,
 	};
 
 	return xfbtree_create(mp, &cfg, xfbtreep);
diff --git a/fs/xfs/scrub/rcbag_btree.c b/fs/xfs/scrub/rcbag_btree.c
index 3aa40149e34d..26cc5a35c378 100644
--- a/fs/xfs/scrub/rcbag_btree.c
+++ b/fs/xfs/scrub/rcbag_btree.c
@@ -232,6 +232,7 @@ rcbagbt_mem_create(
 	struct xfbtree_config	cfg = {
 		.btree_ops	= &rcbagbt_mem_ops,
 		.target		= target,
+		.flags		= XFBTREE_DIRECT_MAP,
 	};
 
 	return xfbtree_create(mp, &cfg, xfbtreep);
diff --git a/fs/xfs/scrub/xfbtree.c b/fs/xfs/scrub/xfbtree.c
index 072e1d8b813e..55d530213d40 100644
--- a/fs/xfs/scrub/xfbtree.c
+++ b/fs/xfs/scrub/xfbtree.c
@@ -473,6 +473,9 @@ xfbtree_create(
 
 	/* Assign our memory file and the free space bitmap. */
 	xfbt->target = cfg->target;
+	if (cfg->flags & XFBTREE_DIRECT_MAP)
+		xfbt->target->bt_flags |= XFS_BUFTARG_DIRECT_MAP;
+
 	xfbt->freespace = kmalloc(sizeof(struct xbitmap), XCHK_GFP_FLAGS);
 	if (!xfbt->freespace) {
 		error = -ENOMEM;
@@ -732,7 +735,7 @@ xfbtree_trans_commit(
 
 		dirty = xfbtree_trans_bdetach(tp, bp);
 		if (dirty && !corrupt) {
-			xfs_failaddr_t	fa = bp->b_ops->verify_struct(bp);
+			xfs_failaddr_t	fa;
 
 			/*
 			 * Because this btree is ephemeral, validate the buffer
@@ -740,16 +743,30 @@ xfbtree_trans_commit(
 			 * corruption errors to the caller without shutting
 			 * down the filesystem.
 			 *
+			 * Buffers that are directly mapped to the xfile do not
+			 * need to be queued for IO at all.  Check if the DRAM
+			 * has been poisoned, however.
+			 *
 			 * If the buffer fails verification, log the failure
 			 * but continue walking the transaction items so that
 			 * we remove all ephemeral btree buffers.
 			 */
+			if (xfs_buf_check_poisoned(bp)) {
+				corrupt = true;
+				xfs_verifier_error(bp, -EFSCORRUPTED,
+						__this_address);
+				continue;
+			}
+
+			fa = bp->b_ops->verify_struct(bp);
 			if (fa) {
 				corrupt = true;
 				xfs_verifier_error(bp, -EFSCORRUPTED, fa);
-			} else {
+				continue;
+			}
+
+			if (!(bp->b_flags & _XBF_DIRECT_MAP))
 				xfs_buf_delwri_queue_here(bp, &buffer_list);
-			}
 		}
 
 		xfs_buf_relse(bp);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b65dab243130..e00682cd8901 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -272,6 +272,61 @@ _xfs_buf_alloc(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+/* Free an xfile page that was directly mapped into the buffer cache. */
+static int
+xfs_buf_free_xfpage(
+	struct xfile		*xfile,
+	loff_t			pos,
+	struct page		**pagep)
+{
+	struct xfile_page	xfpage = {
+		.page		= *pagep,
+		.pos		= round_down(pos, PAGE_SIZE),
+	};
+
+	*pagep = NULL;
+	lock_page(xfpage.page);
+
+	return xfile_put_page(xfile, &xfpage);
+}
+
+/* Unmap all the direct-mapped buffer pages. */
+static void
+xfs_buf_free_direct_pages(
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_map	*map;
+	unsigned int		m, p, n;
+	int			error = 0, err2;
+
+	ASSERT(bp->b_target->bt_flags & XFS_BUFTARG_DIRECT_MAP);
+
+	if (xfs_buf_is_vmapped(bp))
+		vm_unmap_ram(bp->b_addr, bp->b_page_count);
+
+	for (m = 0, p = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++) {
+		for (n = 0; n < map->bm_len; n += BTOBB(PAGE_SIZE)) {
+			err2 = xfs_buf_free_xfpage(bp->b_target->bt_xfile,
+					BBTOB(map->bm_bn + n),
+					&bp->b_pages[p++]);
+			if (!error && err2)
+				error = err2;
+		}
+	}
+
+	if (error)
+		xfs_err(bp->b_mount, "%s failed errno %d", __func__, error);
+
+	if (bp->b_pages != bp->b_page_array)
+		kmem_free(bp->b_pages);
+	bp->b_pages = NULL;
+	bp->b_flags &= ~_XBF_DIRECT_MAP;
+}
+#else
+# define xfs_buf_free_direct_pages(b)	((void)0)
+#endif /* CONFIG_XFS_IN_MEMORY_FILE */
+
 static void
 xfs_buf_free_pages(
 	struct xfs_buf	*bp)
@@ -314,7 +369,9 @@ xfs_buf_free(
 
 	ASSERT(list_empty(&bp->b_lru));
 
-	if (bp->b_flags & _XBF_PAGES)
+	if (bp->b_flags & _XBF_DIRECT_MAP)
+		xfs_buf_free_direct_pages(bp);
+	else if (bp->b_flags & _XBF_PAGES)
 		xfs_buf_free_pages(bp);
 	else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
@@ -411,6 +468,118 @@ xfs_buf_alloc_pages(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+/* Grab the xfile page for this part of the xfile. */
+static int
+xfs_buf_get_xfpage(
+	struct xfile		*xfile,
+	loff_t			pos,
+	unsigned int		len,
+	struct page		**pagep)
+{
+	struct xfile_page	xfpage = { NULL };
+	int			error;
+
+	error = xfile_get_page(xfile, pos, len, &xfpage);
+	if (error)
+		return error;
+
+	/*
+	 * Fall back to regular DRAM buffers if tmpfs gives us fsdata or the
+	 * page pos isn't what we were expecting.
+	 */
+	if (xfpage.fsdata || xfpage.pos != round_down(pos, PAGE_SIZE)) {
+		xfile_put_page(xfile, &xfpage);
+		return -ENOTBLK;
+	}
+
+	/* Unlock the page before we start using them for the buffer cache. */
+	ASSERT(PageUptodate(xfpage.page));
+	unlock_page(xfpage.page);
+
+	*pagep = xfpage.page;
+	return 0;
+}
+
+/*
+ * Try to map storage directly, if the target supports it.  Returns 0 for
+ * success, -ENOTBLK to mean "not supported", or the usual negative errno.
+ */
+static int
+xfs_buf_alloc_direct_pages(
+	struct xfs_buf		*bp,
+	xfs_buf_flags_t		flags)
+{
+	struct xfs_buf_map	*map;
+	gfp_t			gfp_mask = __GFP_NOWARN;
+	const unsigned int	page_align_mask = PAGE_SIZE - 1;
+	unsigned int		m, p, n;
+	int			error;
+
+	ASSERT(bp->b_target->bt_flags & XFS_BUFTARG_IN_MEMORY);
+
+	/* For direct-map buffers, each map has to be page aligned. */
+	for (m = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++)
+		if (BBTOB(map->bm_bn | map->bm_len) & page_align_mask)
+			return -ENOTBLK;
+
+	if (flags & XBF_READ_AHEAD)
+		gfp_mask |= __GFP_NORETRY;
+	else
+		gfp_mask |= GFP_NOFS;
+
+	/* Make sure that we have a page list */
+	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
+	if (bp->b_page_count <= XB_PAGES) {
+		bp->b_pages = bp->b_page_array;
+	} else {
+		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
+					gfp_mask);
+		if (!bp->b_pages)
+			return -ENOMEM;
+	}
+
+	/* Map in the xfile pages. */
+	for (m = 0, p = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++) {
+		for (n = 0; n < map->bm_len; n += BTOBB(PAGE_SIZE)) {
+			unsigned int	len;
+
+			len = min_t(unsigned int, BBTOB(map->bm_len - n),
+					PAGE_SIZE);
+
+			error = xfs_buf_get_xfpage(bp->b_target->bt_xfile,
+					BBTOB(map->bm_bn + n), len,
+					&bp->b_pages[p++]);
+			if (error)
+				goto fail;
+		}
+	}
+
+	bp->b_flags |= _XBF_DIRECT_MAP;
+	return 0;
+
+fail:
+	for (m = 0, p = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++) {
+		for (n = 0; n < map->bm_len; n += BTOBB(PAGE_SIZE)) {
+			if (bp->b_pages[p] == NULL)
+				continue;
+
+			xfs_buf_free_xfpage(bp->b_target->bt_xfile,
+					BBTOB(map->bm_bn + n),
+					&bp->b_pages[p++]);
+		}
+	}
+
+	if (bp->b_pages != bp->b_page_array)
+		kmem_free(bp->b_pages);
+	bp->b_pages = NULL;
+	bp->b_page_count = 0;
+	return error;
+}
+#else
+# define xfs_buf_alloc_direct_pages(b,f)	(-ENOTBLK)
+#endif /* CONFIG_XFS_IN_MEMORY_FILE */
+
 /*
  *	Map buffer into kernel address-space if necessary.
  */
@@ -419,7 +588,8 @@ _xfs_buf_map_pages(
 	struct xfs_buf		*bp,
 	xfs_buf_flags_t		flags)
 {
-	ASSERT(bp->b_flags & _XBF_PAGES);
+	ASSERT(bp->b_flags & (_XBF_PAGES | _XBF_DIRECT_MAP));
+
 	if (bp->b_page_count == 1) {
 		/* A single page buffer is always mappable */
 		bp->b_addr = page_address(bp->b_pages[0]);
@@ -566,7 +736,7 @@ xfs_buf_find_lock(
 	 */
 	if (bp->b_flags & XBF_STALE) {
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
+		bp->b_flags &= _XBF_KMEM | _XBF_PAGES | _XBF_DIRECT_MAP;
 		bp->b_ops = NULL;
 	}
 	return 0;
@@ -625,6 +795,13 @@ xfs_buf_find_insert(
 	if (error)
 		goto out_drop_pag;
 
+	/* Try to map pages directly, or fall back to memory. */
+	if (btp->bt_flags & XFS_BUFTARG_DIRECT_MAP) {
+		error = xfs_buf_alloc_direct_pages(new_bp, flags);
+		if (error && error != -ENOTBLK)
+			goto out_free_buf;
+	}
+
 	/*
 	 * For buffers that fit entirely within a single page, first attempt to
 	 * allocate the memory from the heap to minimise memory usage. If we
@@ -1569,7 +1746,10 @@ xfs_buf_ioapply_in_memory(
 
 	atomic_inc(&bp->b_io_remaining);
 
-	if (bp->b_map_count > 1) {
+	if (bp->b_target->bt_flags & XFS_BUFTARG_DIRECT_MAP) {
+		/* direct mapping means no io necessary */
+		error = 0;
+	} else if (bp->b_map_count > 1) {
 		/* We don't need or support multi-map buffers. */
 		ASSERT(0);
 		error = -EIO;
@@ -1588,6 +1768,20 @@ xfs_buf_ioapply_in_memory(
 		xfs_buf_ioend(bp);
 }
 
+bool
+xfs_buf_check_poisoned(
+	struct xfs_buf		*bp)
+{
+	unsigned int		i;
+
+	for (i = 0; i < bp->b_page_count; i++) {
+		if (PageHWPoison(bp->b_pages[i]))
+			return true;
+	}
+
+	return false;
+}
+
 STATIC void
 _xfs_buf_ioapply(
 	struct xfs_buf	*bp)
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index d74ce9080282..c3ab1c6652dc 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -43,6 +43,7 @@ struct xfile;
 #define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
 #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
+#define _XBF_DIRECT_MAP	 (1u << 23)/* pages directly mapped to storage */
 
 /* flags used only as arguments to access routines */
 /*
@@ -72,6 +73,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
+	{ _XBF_DIRECT_MAP,	"DIRECT_MAP" }, \
 	/* The following interface flags should never be set */ \
 	{ XBF_BCACHE_SCAN,	"BCACHE_SCAN" }, \
 	{ XBF_INCORE,		"INCORE" }, \
@@ -135,6 +137,13 @@ typedef struct xfs_buftarg {
 # define XFS_BUFTARG_IN_MEMORY	(0)
 #endif
 
+/* buffer pages are direct-mapped (implies IN_MEMORY) */
+#ifdef CONFIG_XFS_IN_MEMORY_FILE
+# define XFS_BUFTARG_DIRECT_MAP	(1U << 2)
+#else
+# define XFS_BUFTARG_DIRECT_MAP	(0)
+#endif
+
 static inline bool
 xfs_buftarg_in_memory(
 	struct xfs_buftarg	*btp)
@@ -458,5 +467,6 @@ xfs_buftarg_verify_daddr(
 int xfs_buf_reverify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
 bool xfs_verify_magic(struct xfs_buf *bp, __be32 dmagic);
 bool xfs_verify_magic16(struct xfs_buf *bp, __be16 dmagic);
+bool xfs_buf_check_poisoned(struct xfs_buf *bp);
 
 #endif	/* __XFS_BUF_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: use b_offset to support direct-mapping pages when blocksize < pagesize
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: map xfile pages directly into xfs_buf Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 3/3] xfile: implement write caching Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Support using directly-mapped pages in the buffer cache when the fs
blocksize is less than the page size.  This is not strictly necessary
since the only user of direct-map buffers always uses page-sized
buffers, but I included it here for completeness.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf.c |   34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e00682cd8901..db3344a36f14 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -303,7 +303,7 @@ xfs_buf_free_direct_pages(
 	ASSERT(bp->b_target->bt_flags & XFS_BUFTARG_DIRECT_MAP);
 
 	if (xfs_buf_is_vmapped(bp))
-		vm_unmap_ram(bp->b_addr, bp->b_page_count);
+		vm_unmap_ram(bp->b_addr - bp->b_offset, bp->b_page_count);
 
 	for (m = 0, p = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++) {
 		for (n = 0; n < map->bm_len; n += BTOBB(PAGE_SIZE)) {
@@ -336,7 +336,7 @@ xfs_buf_free_pages(
 	ASSERT(bp->b_flags & _XBF_PAGES);
 
 	if (xfs_buf_is_vmapped(bp))
-		vm_unmap_ram(bp->b_addr, bp->b_page_count);
+		vm_unmap_ram(bp->b_addr - bp->b_offset, bp->b_page_count);
 
 	for (i = 0; i < bp->b_page_count; i++) {
 		if (bp->b_pages[i])
@@ -465,6 +465,8 @@ xfs_buf_alloc_pages(
 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
 		memalloc_retry_wait(gfp_mask);
 	}
+
+	bp->b_offset = 0;
 	return 0;
 }
 
@@ -513,16 +515,32 @@ xfs_buf_alloc_direct_pages(
 	struct xfs_buf_map	*map;
 	gfp_t			gfp_mask = __GFP_NOWARN;
 	const unsigned int	page_align_mask = PAGE_SIZE - 1;
+	unsigned int		first_page_offset;
 	unsigned int		m, p, n;
 	int			error;
 
 	ASSERT(bp->b_target->bt_flags & XFS_BUFTARG_IN_MEMORY);
 
-	/* For direct-map buffers, each map has to be page aligned. */
-	for (m = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++)
-		if (BBTOB(map->bm_bn | map->bm_len) & page_align_mask)
+	/*
+	 * For direct-map buffer targets with multiple mappings, the first map
+	 * must end on a page boundary; the last map must start at a page
+	 * boundary; and the maps in between must start and end on a page
+	 * boundary.  For single-mapping buffers, we don't care.
+	 */
+	if (bp->b_map_count > 1) {
+		map = &bp->b_maps[bp->b_map_count - 1];
+		if (BBTOB(map->bm_bn) & page_align_mask)
 			return -ENOTBLK;
 
+		map = &bp->b_maps[0];
+		if (BBTOB(map->bm_bn + map->bm_len) & page_align_mask)
+			return -ENOTBLK;
+
+		for (m = 1, map++; m < bp->b_map_count - 1; m++, map++)
+			if (BBTOB(map->bm_bn | map->bm_len) & page_align_mask)
+				return -ENOTBLK;
+	}
+
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
 	else
@@ -540,6 +558,7 @@ xfs_buf_alloc_direct_pages(
 	}
 
 	/* Map in the xfile pages. */
+	first_page_offset = offset_in_page(BBTOB(xfs_buf_daddr(bp)));
 	for (m = 0, p = 0, map = bp->b_maps; m < bp->b_map_count; m++, map++) {
 		for (n = 0; n < map->bm_len; n += BTOBB(PAGE_SIZE)) {
 			unsigned int	len;
@@ -556,6 +575,7 @@ xfs_buf_alloc_direct_pages(
 	}
 
 	bp->b_flags |= _XBF_DIRECT_MAP;
+	bp->b_offset = first_page_offset;
 	return 0;
 
 fail:
@@ -592,7 +612,7 @@ _xfs_buf_map_pages(
 
 	if (bp->b_page_count == 1) {
 		/* A single page buffer is always mappable */
-		bp->b_addr = page_address(bp->b_pages[0]);
+		bp->b_addr = page_address(bp->b_pages[0]) + bp->b_offset;
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
 	} else {
@@ -619,6 +639,8 @@ _xfs_buf_map_pages(
 
 		if (!bp->b_addr)
 			return -ENOMEM;
+
+		bp->b_addr += bp->b_offset;
 	}
 
 	return 0;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfile: implement write caching
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 1/3] xfs: map xfile pages directly into xfs_buf Darrick J. Wong
  2022-12-30 22:14   ` [PATCH 2/3] xfs: use b_offset to support direct-mapping pages when blocksize < pagesize Darrick J. Wong
@ 2022-12-30 22:14   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/trace.h |   44 ++++++++-
 fs/xfs/scrub/xfile.c |  254 +++++++++++++++++++++++++++++---------------------
 fs/xfs/scrub/xfile.h |   15 +++
 fs/xfs/xfs_buf.c     |    1 
 4 files changed, 207 insertions(+), 107 deletions(-)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index f7e30c6eb1d1..3652ac4a3eff 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -872,10 +872,52 @@ DEFINE_XFILE_EVENT(xfile_pread);
 DEFINE_XFILE_EVENT(xfile_pwrite);
 DEFINE_XFILE_EVENT(xfile_seek_data);
 DEFINE_XFILE_EVENT(xfile_get_page);
-DEFINE_XFILE_EVENT(xfile_put_page);
 DEFINE_XFILE_EVENT(xfile_discard);
 DEFINE_XFILE_EVENT(xfile_prealloc);
 
+DECLARE_EVENT_CLASS(xfile_page_class,
+	TP_PROTO(struct xfile *xf, loff_t pos, struct page *page),
+	TP_ARGS(xf, pos, page),
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned long long, bytes_used)
+		__field(loff_t, pos)
+		__field(loff_t, size)
+		__field(unsigned long long, bytecount)
+		__field(pgoff_t, pgoff)
+	),
+	TP_fast_assign(
+		struct xfile_stat	statbuf;
+		int			ret;
+
+		ret = xfile_stat(xf, &statbuf);
+		if (!ret) {
+			__entry->bytes_used = statbuf.bytes;
+			__entry->size = statbuf.size;
+		} else {
+			__entry->bytes_used = -1;
+			__entry->size = -1;
+		}
+		__entry->ino = file_inode(xf->file)->i_ino;
+		__entry->pos = pos;
+		__entry->bytecount = page_size(page);
+		__entry->pgoff = page_offset(page);
+	),
+	TP_printk("xfino 0x%lx mem_bytes 0x%llx pos 0x%llx bytecount 0x%llx pgoff 0x%lx isize 0x%llx",
+		  __entry->ino,
+		  __entry->bytes_used,
+		  __entry->pos,
+		  __entry->bytecount,
+		  __entry->pgoff,
+		  __entry->size)
+);
+#define DEFINE_XFILE_PAGE_EVENT(name) \
+DEFINE_EVENT(xfile_page_class, name, \
+	TP_PROTO(struct xfile *xf, loff_t pos, struct page *page), \
+	TP_ARGS(xf, pos, page))
+DEFINE_XFILE_PAGE_EVENT(xfile_got_page);
+DEFINE_XFILE_PAGE_EVENT(xfile_put_page);
+
 TRACE_EVENT(xfarray_create,
 	TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity),
 	TP_ARGS(xfa, required_capacity),
diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index b1cbf80f55d7..529266e86e41 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -66,7 +66,7 @@ xfile_create(
 	struct xfile		*xf;
 	int			error = -ENOMEM;
 
-	xf = kmalloc(sizeof(struct xfile), XCHK_GFP_FLAGS);
+	xf = kzalloc(sizeof(struct xfile), XCHK_GFP_FLAGS);
 	if (!xf)
 		return -ENOMEM;
 
@@ -110,6 +110,117 @@ xfile_create(
 	return error;
 }
 
+/* Evict a cache entry and release the page. */
+static inline int
+xfile_cache_evict(
+	struct xfile		*xf,
+	struct xfile_cache	*entry)
+{
+	int			error;
+
+	if (!entry->xfpage.page)
+		return 0;
+
+	lock_page(entry->xfpage.page);
+	kunmap(entry->kaddr);
+
+	error = xfile_put_page(xf, &entry->xfpage);
+	memset(entry, 0, sizeof(struct xfile_cache));
+	return error;
+}
+
+/*
+ * Grab a page, map it into the kernel address space, and fill out the cache
+ * entry.
+ */
+static int
+xfile_cache_fill(
+	struct xfile		*xf,
+	loff_t			key,
+	struct xfile_cache	*entry)
+{
+	int			error;
+
+	error = xfile_get_page(xf, key, PAGE_SIZE, &entry->xfpage);
+	if (error)
+		return error;
+
+	entry->kaddr = kmap(entry->xfpage.page);
+	unlock_page(entry->xfpage.page);
+	return 0;
+}
+
+/* Return the kernel address of a cached position in the xfile. */
+static void *
+xfile_cache_lookup(
+	struct xfile		*xf,
+	loff_t			pos)
+{
+	loff_t			key = round_down(pos, PAGE_SIZE);
+	unsigned int		i;
+	int			ret;
+
+	/* Is it already in the cache? */
+	for (i = 0; i < XFILE_CACHE_ENTRIES; i++) {
+		if (!xf->cached[i].xfpage.page)
+			continue;
+		if (page_offset(xf->cached[i].xfpage.page) != key)
+			continue;
+
+		goto found;
+	}
+
+	/* Find the least-used slot here so we can evict it. */
+	for (i = 0; i < XFILE_CACHE_ENTRIES; i++) {
+		if (!xf->cached[i].xfpage.page)
+			goto insert;
+	}
+	i = min_t(unsigned int, i, XFILE_CACHE_ENTRIES - 1);
+
+	ret = xfile_cache_evict(xf, &xf->cached[i]);
+	if (ret)
+		return ERR_PTR(ret);
+
+insert:
+	ret = xfile_cache_fill(xf, key, &xf->cached[i]);
+	if (ret)
+		return ERR_PTR(ret);
+
+found:
+	/* Stupid MRU moves this cache entry to the front. */
+	if (i != 0)
+		swap(xf->cached[0], xf->cached[i]);
+
+	return xf->cached[0].kaddr;
+}
+
+/* Release the cached page corresponding to a given kernel address. */
+static int
+xfile_cache_rele(
+	struct xfile		*xf,
+	void			*kaddr)
+{
+	unsigned int		i;
+
+	for (i = 0; i < XFILE_CACHE_ENTRIES; i++) {
+		if (xf->cached[i].kaddr == kaddr)
+			return xfile_cache_evict(xf, &xf->cached[i]);
+	}
+
+	return 0;
+}
+
+/* Drop all cached xfile pages. */
+static void
+xfile_cache_drop(
+	struct xfile		*xf)
+{
+	unsigned int		i;
+
+	for (i = 0; i < XFILE_CACHE_ENTRIES; i++)
+		xfile_cache_evict(xf, &xf->cached[i]);
+}
+
 /* Close the file and release all resources. */
 void
 xfile_destroy(
@@ -119,6 +230,8 @@ xfile_destroy(
 
 	trace_xfile_destroy(xf);
 
+	xfile_cache_drop(xf);
+
 	lockdep_set_class(&inode->i_rwsem, &inode->i_sb->s_type->i_mutex_key);
 	fput(xf->file);
 	kfree(xf);
@@ -138,8 +251,6 @@ xfile_pread(
 	loff_t			pos)
 {
 	struct inode		*inode = file_inode(xf->file);
-	struct address_space	*mapping = inode->i_mapping;
-	struct page		*page = NULL;
 	ssize_t			read = 0;
 	unsigned int		pflags;
 	int			error = 0;
@@ -158,37 +269,21 @@ xfile_pread(
 
 		len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
 
-		/*
-		 * In-kernel reads of a shmem file cause it to allocate a page
-		 * if the mapping shows a hole.  Therefore, if we hit ENOMEM
-		 * we can continue by zeroing the caller's buffer.
-		 */
-		page = shmem_read_mapping_page_gfp(mapping, pos >> PAGE_SHIFT,
-				__GFP_NOWARN);
-		if (IS_ERR(page)) {
-			error = PTR_ERR(page);
-			if (error != -ENOMEM)
+		kaddr = xfile_cache_lookup(xf, pos);
+		if (IS_ERR(kaddr)) {
+			error = PTR_ERR(kaddr);
+			break;
+		}
+
+		p = kaddr + offset_in_page(pos);
+		memcpy(buf, p, len);
+
+		if (xf->flags & XFILE_UNCACHED) {
+			error = xfile_cache_rele(xf, kaddr);
+			if (error)
 				break;
-
-			memset(buf, 0, len);
-			goto advance;
-		}
-
-		if (PageUptodate(page)) {
-			/*
-			 * xfile pages must never be mapped into userspace, so
-			 * we skip the dcache flush.
-			 */
-			kaddr = kmap_local_page(page);
-			p = kaddr + offset_in_page(pos);
-			memcpy(buf, p, len);
-			kunmap_local(kaddr);
-		} else {
-			memset(buf, 0, len);
 		}
-		put_page(page);
 
-advance:
 		count -= len;
 		pos += len;
 		buf += len;
@@ -215,9 +310,6 @@ xfile_pwrite(
 	loff_t			pos)
 {
 	struct inode		*inode = file_inode(xf->file);
-	struct address_space	*mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
-	struct page		*page = NULL;
 	ssize_t			written = 0;
 	unsigned int		pflags;
 	int			error = 0;
@@ -231,52 +323,30 @@ xfile_pwrite(
 
 	pflags = memalloc_nofs_save();
 	while (count > 0) {
-		void		*fsdata = NULL;
 		void		*p, *kaddr;
 		unsigned int	len;
-		int		ret;
 
 		len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
 
-		/*
-		 * We call write_begin directly here to avoid all the freezer
-		 * protection lock-taking that happens in the normal path.
-		 * shmem doesn't support fs freeze, but lockdep doesn't know
-		 * that and will trip over that.
-		 */
-		error = aops->write_begin(NULL, mapping, pos, len, &page,
-				&fsdata);
-		if (error)
+		kaddr = xfile_cache_lookup(xf, pos);
+		if (IS_ERR(kaddr)) {
+			error = PTR_ERR(kaddr);
 			break;
-
-		/*
-		 * xfile pages must never be mapped into userspace, so we skip
-		 * the dcache flush.  If the page is not uptodate, zero it
-		 * before writing data.
-		 */
-		kaddr = kmap_local_page(page);
-		if (!PageUptodate(page)) {
-			memset(kaddr, 0, PAGE_SIZE);
-			SetPageUptodate(page);
 		}
+
 		p = kaddr + offset_in_page(pos);
 		memcpy(p, buf, len);
-		kunmap_local(kaddr);
 
-		ret = aops->write_end(NULL, mapping, pos, len, len, page,
-				fsdata);
-		if (ret < 0) {
-			error = ret;
-			break;
+		if (xf->flags & XFILE_UNCACHED) {
+			error = xfile_cache_rele(xf, kaddr);
+			if (error)
+				break;
 		}
 
-		written += ret;
-		if (ret != len)
-			break;
-
-		count -= ret;
-		pos += ret;
-		buf += ret;
+		written += len;
+		count -= len;
+		pos += len;
+		buf += len;
 	}
 	memalloc_nofs_restore(pflags);
 
@@ -293,6 +363,7 @@ xfile_discard(
 	u64			count)
 {
 	trace_xfile_discard(xf, pos, count);
+	xfile_cache_drop(xf);
 	shmem_truncate_range(file_inode(xf->file), pos, pos + count - 1);
 }
 
@@ -304,9 +375,6 @@ xfile_prealloc(
 	u64			count)
 {
 	struct inode		*inode = file_inode(xf->file);
-	struct address_space	*mapping = inode->i_mapping;
-	const struct address_space_operations *aops = mapping->a_ops;
-	struct page		*page = NULL;
 	unsigned int		pflags;
 	int			error = 0;
 
@@ -319,47 +387,21 @@ xfile_prealloc(
 
 	pflags = memalloc_nofs_save();
 	while (count > 0) {
-		void		*fsdata = NULL;
+		void		*kaddr;
 		unsigned int	len;
-		int		ret;
 
 		len = min_t(ssize_t, count, PAGE_SIZE - offset_in_page(pos));
 
-		/*
-		 * We call write_begin directly here to avoid all the freezer
-		 * protection lock-taking that happens in the normal path.
-		 * shmem doesn't support fs freeze, but lockdep doesn't know
-		 * that and will trip over that.
-		 */
-		error = aops->write_begin(NULL, mapping, pos, len, &page,
-				&fsdata);
+		kaddr = xfile_cache_lookup(xf, pos);
+		if (IS_ERR(kaddr)) {
+			error = PTR_ERR(kaddr);
+			break;
+		}
+
+		error = xfile_cache_rele(xf, kaddr);
 		if (error)
 			break;
 
-		/*
-		 * xfile pages must never be mapped into userspace, so we skip
-		 * the dcache flush.  If the page is not uptodate, zero it to
-		 * ensure we never go lacking for space here.
-		 */
-		if (!PageUptodate(page)) {
-			void	*kaddr = kmap_local_page(page);
-
-			memset(kaddr, 0, PAGE_SIZE);
-			SetPageUptodate(page);
-			kunmap_local(kaddr);
-		}
-
-		ret = aops->write_end(NULL, mapping, pos, len, len, page,
-				fsdata);
-		if (ret < 0) {
-			error = ret;
-			break;
-		}
-		if (ret != len) {
-			error = -EIO;
-			break;
-		}
-
 		count -= len;
 		pos += len;
 	}
@@ -490,7 +532,7 @@ xfile_put_page(
 	unsigned int		pflags;
 	int			ret;
 
-	trace_xfile_put_page(xf, xfpage->pos, PAGE_SIZE);
+	trace_xfile_put_page(xf, xfpage->pos, xfpage->page);
 
 	/* Give back the reference that we took in xfile_get_page. */
 	put_page(xfpage->page);
diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h
index bf80bb796e83..9a065bd9ffe2 100644
--- a/fs/xfs/scrub/xfile.h
+++ b/fs/xfs/scrub/xfile.h
@@ -24,10 +24,25 @@ static inline pgoff_t xfile_page_index(const struct xfile_page *xfpage)
 	return xfpage->page->index;
 }
 
+struct xfile_cache {
+	struct xfile_page	xfpage;
+	void			*kaddr;
+};
+
+#define XFILE_CACHE_ENTRIES	4
+
 struct xfile {
 	struct file		*file;
+
+	/* XFILE_* flags */
+	unsigned int		flags;
+
+	struct xfile_cache	cached[XFILE_CACHE_ENTRIES];
 };
 
+/* Do not cache pages for faster access. */
+#define XFILE_UNCACHED		(1U << 0)
+
 int xfile_create(struct xfs_mount *mp, const char *description, loff_t isize,
 		struct xfile **xfilep);
 void xfile_destroy(struct xfile *xf);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index db3344a36f14..c5fcbd54e840 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2387,6 +2387,7 @@ xfs_alloc_memory_buftarg(
 	if (!btp)
 		return -ENOMEM;
 
+	xfile->flags |= XFILE_UNCACHED;
 	btp->bt_xfile = xfile;
 	btp->bt_dev = (dev_t)-1U;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (34 preceding siblings ...)
  2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata Darrick J. Wong
                   ` (41 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Before we start merging the online repair functions, let's improve the
bulk loading code a bit.  First, we need to fix a misinteraction between
the AIL and the btree bulkloader wherein the delwri at the end of the
bulk load fails to queue a buffer for writeback if it happens to be on
the AIL list.

Second, we introduce EFIs in the btree bulkloader block allocator to to
guarantee that staging blocks are freed if the filesystem goes down
before committing the new btree.

Third, we change the bulkloader itself to copy multiple records into a
block if possible, and add some debugging knobs so that developers can
control the slack factors, just like they can do for xfs_repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-prep-for-bulk-loading
---
 libxfs/libxfs_api_defs.h   |    1 
 libxfs/libxfs_io.h         |   11 +++
 libxfs/xfs_btree.c         |    2 
 libxfs/xfs_btree.h         |    3 +
 libxfs/xfs_btree_staging.c |   67 +++++++++++-----
 libxfs/xfs_btree_staging.h |   32 +++++---
 repair/agbtree.c           |  182 ++++++++++++++++++++++++++++----------------
 7 files changed, 198 insertions(+), 100 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: force all buffers to be written during btree bulk load
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/5] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/5] xfs_repair: bulk load records into new btree blocks Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/5] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffer readers
encountering buffers with DELWRI_Q set, even though the btree bulk load
had already committed and the buffer itself wasn't on any delwri list.

I traced this to a misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list.  If a buffer is
logged and committed, the buffer can end up on that AIL buffer list.  If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up.  This clears DELWRI_Q from the buffer state.

If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block.  Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.

When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.

This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
committing the new btree root pointer.  This ordering requirement is
required for data consistency.

Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.

To fix this, create a new thread that waits for the buffer to be removed
from any other delwri lists before adding the buffer to the caller's
delwri list.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_io.h         |   11 +++++++++++
 libxfs/xfs_btree_staging.c |    4 +---
 2 files changed, 12 insertions(+), 3 deletions(-)


diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index fae86427201..4ffe788d446 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -243,6 +243,17 @@ xfs_buf_delwri_queue(struct xfs_buf *bp, struct list_head *buffer_list)
 	return true;
 }
 
+static inline void
+xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *buffer_list)
+{
+	ASSERT(list_empty(&bp->b_list));
+
+	/* This buffer is uptodate; don't let it get reread. */
+	libxfs_buf_mark_dirty(bp);
+
+	xfs_buf_delwri_queue(bp, buffer_list);
+}
+
 int xfs_buf_delwri_submit(struct list_head *buffer_list);
 void xfs_buf_delwri_cancel(struct list_head *list);
 
diff --git a/libxfs/xfs_btree_staging.c b/libxfs/xfs_btree_staging.c
index a6a90791668..baf7f422603 100644
--- a/libxfs/xfs_btree_staging.c
+++ b/libxfs/xfs_btree_staging.c
@@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf(
 	if (*bpp == NULL)
 		return;
 
-	if (!xfs_buf_delwri_queue(*bpp, buffers_list))
-		ASSERT(0);
-
+	xfs_buf_delwri_queue_here(*bpp, buffers_list);
 	xfs_buf_relse(*bpp);
 	*bpp = NULL;
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: implement block reservation accounting for btrees we're staging
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/5] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_btree_staging.h |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)


diff --git a/libxfs/xfs_btree_staging.h b/libxfs/xfs_btree_staging.h
index f0d2976050a..d6dea3f0088 100644
--- a/libxfs/xfs_btree_staging.h
+++ b/libxfs/xfs_btree_staging.h
@@ -38,11 +38,8 @@ struct xbtree_ifakeroot {
 	/* Number of bytes available for this fork in the inode. */
 	unsigned int		if_fork_size;
 
-	/* Fork format. */
-	unsigned int		if_format;
-
-	/* Number of records. */
-	unsigned int		if_extents;
+	/* Which fork is this btree being built for? */
+	int			if_whichfork;
 };
 
 /* Cursor interactions with fake roots for inode-rooted btrees. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: move btree bulkload record initialization to ->get_record implementations
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 5/5] xfs_repair: bulk load records into new btree blocks Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're performing a bulk load of a btree, move the code that
actually stores the btree record in the new btree block out of the
generic code and into the individual ->get_record implementations.
This is preparation for being able to store multiple records with a
single indirect call.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h   |    1 +
 libxfs/xfs_btree_staging.c |   17 ++++++-------
 libxfs/xfs_btree_staging.h |   15 ++++++++----
 repair/agbtree.c           |   56 +++++++++++++++++++++++++++++++++-----------
 4 files changed, 60 insertions(+), 29 deletions(-)


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index f8efcce777b..5aa9c019d40 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -50,6 +50,7 @@
 #define xfs_btree_bload_compute_geometry libxfs_btree_bload_compute_geometry
 #define xfs_btree_del_cursor		libxfs_btree_del_cursor
 #define xfs_btree_init_block		libxfs_btree_init_block
+#define xfs_btree_rec_addr		libxfs_btree_rec_addr
 #define xfs_buf_delwri_submit		libxfs_buf_delwri_submit
 #define xfs_buf_get			libxfs_buf_get
 #define xfs_buf_get_uncached		libxfs_buf_get_uncached
diff --git a/libxfs/xfs_btree_staging.c b/libxfs/xfs_btree_staging.c
index baf7f422603..97fade90622 100644
--- a/libxfs/xfs_btree_staging.c
+++ b/libxfs/xfs_btree_staging.c
@@ -434,22 +434,19 @@ STATIC int
 xfs_btree_bload_leaf(
 	struct xfs_btree_cur		*cur,
 	unsigned int			recs_this_block,
-	xfs_btree_bload_get_record_fn	get_record,
+	xfs_btree_bload_get_records_fn	get_records,
 	struct xfs_btree_block		*block,
 	void				*priv)
 {
-	unsigned int			j;
+	unsigned int			j = 1;
 	int				ret;
 
 	/* Fill the leaf block with records. */
-	for (j = 1; j <= recs_this_block; j++) {
-		union xfs_btree_rec	*block_rec;
-
-		ret = get_record(cur, priv);
-		if (ret)
+	while (j <= recs_this_block) {
+		ret = get_records(cur, j, block, recs_this_block - j + 1, priv);
+		if (ret < 0)
 			return ret;
-		block_rec = xfs_btree_rec_addr(cur, j, block);
-		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+		j += ret;
 	}
 
 	return 0;
@@ -787,7 +784,7 @@ xfs_btree_bload(
 		trace_xfs_btree_bload_block(cur, level, i, blocks, &ptr,
 				nr_this_block);
 
-		ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_record,
+		ret = xfs_btree_bload_leaf(cur, nr_this_block, bbl->get_records,
 				block, priv);
 		if (ret)
 			goto out;
diff --git a/libxfs/xfs_btree_staging.h b/libxfs/xfs_btree_staging.h
index d6dea3f0088..82a3a8ef0f1 100644
--- a/libxfs/xfs_btree_staging.h
+++ b/libxfs/xfs_btree_staging.h
@@ -50,7 +50,9 @@ void xfs_btree_commit_ifakeroot(struct xfs_btree_cur *cur, struct xfs_trans *tp,
 		int whichfork, const struct xfs_btree_ops *ops);
 
 /* Bulk loading of staged btrees. */
-typedef int (*xfs_btree_bload_get_record_fn)(struct xfs_btree_cur *cur, void *priv);
+typedef int (*xfs_btree_bload_get_records_fn)(struct xfs_btree_cur *cur,
+		unsigned int idx, struct xfs_btree_block *block,
+		unsigned int nr_wanted, void *priv);
 typedef int (*xfs_btree_bload_claim_block_fn)(struct xfs_btree_cur *cur,
 		union xfs_btree_ptr *ptr, void *priv);
 typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur,
@@ -58,11 +60,14 @@ typedef size_t (*xfs_btree_bload_iroot_size_fn)(struct xfs_btree_cur *cur,
 
 struct xfs_btree_bload {
 	/*
-	 * This function will be called nr_records times to load records into
-	 * the btree.  The function does this by setting the cursor's bc_rec
-	 * field in in-core format.  Records must be returned in sort order.
+	 * This function will be called to load @nr_wanted records into the
+	 * btree.  The implementation does this by setting the cursor's bc_rec
+	 * field in in-core format and using init_rec_from_cur to set the
+	 * records in the btree block.  Records must be returned in sort order.
+	 * The function must return the number of records loaded or the usual
+	 * negative errno.
 	 */
-	xfs_btree_bload_get_record_fn	get_record;
+	xfs_btree_bload_get_records_fn	get_records;
 
 	/*
 	 * This function will be called nr_blocks times to obtain a pointer
diff --git a/repair/agbtree.c b/repair/agbtree.c
index 0fd7ef5d351..d90cbcc2f28 100644
--- a/repair/agbtree.c
+++ b/repair/agbtree.c
@@ -209,18 +209,25 @@ get_bno_rec(
 
 /* Grab one bnobt record and put it in the btree cursor. */
 static int
-get_bnobt_record(
+get_bnobt_records(
 	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
 	void				*priv)
 {
 	struct bt_rebuild		*btr = priv;
 	struct xfs_alloc_rec_incore	*arec = &cur->bc_rec.a;
+	union xfs_btree_rec		*block_rec;
 
 	btr->bno_rec = get_bno_rec(cur, btr->bno_rec);
 	arec->ar_startblock = btr->bno_rec->ex_startblock;
 	arec->ar_blockcount = btr->bno_rec->ex_blockcount;
 	btr->freeblks += btr->bno_rec->ex_blockcount;
-	return 0;
+
+	block_rec = libxfs_btree_rec_addr(cur, idx, block);
+	cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	return 1;
 }
 
 void
@@ -247,10 +254,10 @@ init_freespace_cursors(
 	btr_cnt->cur = libxfs_allocbt_stage_cursor(sc->mp,
 			&btr_cnt->newbt.afake, pag, XFS_BTNUM_CNT);
 
-	btr_bno->bload.get_record = get_bnobt_record;
+	btr_bno->bload.get_records = get_bnobt_records;
 	btr_bno->bload.claim_block = rebuild_claim_block;
 
-	btr_cnt->bload.get_record = get_bnobt_record;
+	btr_cnt->bload.get_records = get_bnobt_records;
 	btr_cnt->bload.claim_block = rebuild_claim_block;
 
 	/*
@@ -371,13 +378,17 @@ get_ino_rec(
 
 /* Grab one inobt record. */
 static int
-get_inobt_record(
+get_inobt_records(
 	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
 	void				*priv)
 {
 	struct bt_rebuild		*btr = priv;
 	struct xfs_inobt_rec_incore	*irec = &cur->bc_rec.i;
 	struct ino_tree_node		*ino_rec;
+	union xfs_btree_rec		*block_rec;
 	int				inocnt = 0;
 	int				finocnt = 0;
 	int				k;
@@ -431,7 +442,10 @@ get_inobt_record(
 		btr->first_agino = ino_rec->ino_startnum;
 	btr->freecount += finocnt;
 	btr->count += inocnt;
-	return 0;
+
+	block_rec = libxfs_btree_rec_addr(cur, idx, block);
+	cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	return 1;
 }
 
 /* Initialize both inode btree cursors as needed. */
@@ -490,7 +504,7 @@ init_ino_cursors(
 	btr_ino->cur = libxfs_inobt_stage_cursor(sc->mp, &btr_ino->newbt.afake,
 			pag, XFS_BTNUM_INO);
 
-	btr_ino->bload.get_record = get_inobt_record;
+	btr_ino->bload.get_records = get_inobt_records;
 	btr_ino->bload.claim_block = rebuild_claim_block;
 	btr_ino->first_agino = NULLAGINO;
 
@@ -510,7 +524,7 @@ _("Unable to compute inode btree geometry, error %d.\n"), error);
 	btr_fino->cur = libxfs_inobt_stage_cursor(sc->mp,
 			&btr_fino->newbt.afake, pag, XFS_BTNUM_FINO);
 
-	btr_fino->bload.get_record = get_inobt_record;
+	btr_fino->bload.get_records = get_inobt_records;
 	btr_fino->bload.claim_block = rebuild_claim_block;
 	btr_fino->first_agino = NULLAGINO;
 
@@ -560,16 +574,23 @@ _("Error %d while creating finobt btree for AG %u.\n"), error, agno);
 
 /* Grab one rmap record. */
 static int
-get_rmapbt_record(
+get_rmapbt_records(
 	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
 	void				*priv)
 {
 	struct xfs_rmap_irec		*rec;
 	struct bt_rebuild		*btr = priv;
+	union xfs_btree_rec		*block_rec;
 
 	rec = pop_slab_cursor(btr->slab_cursor);
 	memcpy(&cur->bc_rec.r, rec, sizeof(struct xfs_rmap_irec));
-	return 0;
+
+	block_rec = libxfs_btree_rec_addr(cur, idx, block);
+	cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	return 1;
 }
 
 /* Set up the rmap rebuild parameters. */
@@ -589,7 +610,7 @@ init_rmapbt_cursor(
 	init_rebuild(sc, &XFS_RMAP_OINFO_AG, free_space, btr);
 	btr->cur = libxfs_rmapbt_stage_cursor(sc->mp, &btr->newbt.afake, pag);
 
-	btr->bload.get_record = get_rmapbt_record;
+	btr->bload.get_records = get_rmapbt_records;
 	btr->bload.claim_block = rebuild_claim_block;
 
 	/* Compute how many blocks we'll need. */
@@ -631,16 +652,23 @@ _("Error %d while creating rmap btree for AG %u.\n"), error, agno);
 
 /* Grab one refcount record. */
 static int
-get_refcountbt_record(
+get_refcountbt_records(
 	struct xfs_btree_cur		*cur,
+	unsigned int			idx,
+	struct xfs_btree_block		*block,
+	unsigned int			nr_wanted,
 	void				*priv)
 {
 	struct xfs_refcount_irec	*rec;
 	struct bt_rebuild		*btr = priv;
+	union xfs_btree_rec		*block_rec;
 
 	rec = pop_slab_cursor(btr->slab_cursor);
 	memcpy(&cur->bc_rec.rc, rec, sizeof(struct xfs_refcount_irec));
-	return 0;
+
+	block_rec = libxfs_btree_rec_addr(cur, idx, block);
+	cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	return 1;
 }
 
 /* Set up the refcount rebuild parameters. */
@@ -661,7 +689,7 @@ init_refc_cursor(
 	btr->cur = libxfs_refcountbt_stage_cursor(sc->mp, &btr->newbt.afake,
 			pag);
 
-	btr->bload.get_record = get_refcountbt_record;
+	btr->bload.get_records = get_refcountbt_records;
 	btr->bload.claim_block = rebuild_claim_block;
 
 	/* Compute how many blocks we'll need. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: constrain dirty buffers while formatting a staged btree
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/5] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Constrain the number of dirty buffers that are locked by the btree
staging code at any given time by establishing a threshold at which we
put them all on the delwri queue and push them to disk.  This limits
memory consumption while writing out new btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_btree.c         |    2 +-
 libxfs/xfs_btree.h         |    3 +++
 libxfs/xfs_btree_staging.c |   50 ++++++++++++++++++++++++++++++++++----------
 libxfs/xfs_btree_staging.h |   10 +++++++++
 repair/agbtree.c           |    1 +
 5 files changed, 54 insertions(+), 12 deletions(-)


diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index 3402c25c344..7b2df32960c 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -1327,7 +1327,7 @@ xfs_btree_get_buf_block(
  * Read in the buffer at the given ptr and return the buffer and
  * the block pointer within the buffer.
  */
-STATIC int
+int
 xfs_btree_read_buf_block(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_ptr	*ptr,
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index d8b390e895b..6a565ad5e83 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -701,6 +701,9 @@ void xfs_btree_set_ptr_null(struct xfs_btree_cur *cur,
 int xfs_btree_get_buf_block(struct xfs_btree_cur *cur,
 		const union xfs_btree_ptr *ptr, struct xfs_btree_block **block,
 		struct xfs_buf **bpp);
+int xfs_btree_read_buf_block(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr, int flags,
+		struct xfs_btree_block **block, struct xfs_buf **bpp);
 void xfs_btree_set_sibling(struct xfs_btree_cur *cur,
 		struct xfs_btree_block *block, const union xfs_btree_ptr *ptr,
 		int lr);
diff --git a/libxfs/xfs_btree_staging.c b/libxfs/xfs_btree_staging.c
index 97fade90622..5391d3fead2 100644
--- a/libxfs/xfs_btree_staging.c
+++ b/libxfs/xfs_btree_staging.c
@@ -333,18 +333,35 @@ xfs_btree_commit_ifakeroot(
 /*
  * Put a btree block that we're loading onto the ordered list and release it.
  * The btree blocks will be written to disk when bulk loading is finished.
+ * If we reach the dirty buffer threshold, flush them to disk before
+ * continuing.
  */
-static void
+static int
 xfs_btree_bload_drop_buf(
-	struct list_head	*buffers_list,
-	struct xfs_buf		**bpp)
+	struct xfs_btree_bload		*bbl,
+	struct list_head		*buffers_list,
+	struct xfs_buf			**bpp)
 {
-	if (*bpp == NULL)
-		return;
+	struct xfs_buf			*bp = *bpp;
+	int				error;
 
-	xfs_buf_delwri_queue_here(*bpp, buffers_list);
-	xfs_buf_relse(*bpp);
+	if (!bp)
+		return 0;
+
+	xfs_buf_delwri_queue_here(bp, buffers_list);
+	xfs_buf_relse(bp);
 	*bpp = NULL;
+	bbl->nr_dirty++;
+
+	if (!bbl->max_dirty || bbl->nr_dirty < bbl->max_dirty)
+		return 0;
+
+	error = xfs_buf_delwri_submit(buffers_list);
+	if (error)
+		return error;
+
+	bbl->nr_dirty = 0;
+	return 0;
 }
 
 /*
@@ -416,7 +433,10 @@ xfs_btree_bload_prep_block(
 	 */
 	if (*blockp)
 		xfs_btree_set_sibling(cur, *blockp, &new_ptr, XFS_BB_RIGHTSIB);
-	xfs_btree_bload_drop_buf(buffers_list, bpp);
+
+	ret = xfs_btree_bload_drop_buf(bbl, buffers_list, bpp);
+	if (ret)
+		return ret;
 
 	/* Initialize the new btree block. */
 	xfs_btree_init_block_cur(cur, new_bp, level, nr_this_block);
@@ -480,7 +500,7 @@ xfs_btree_bload_node(
 
 		ASSERT(!xfs_btree_ptr_is_null(cur, child_ptr));
 
-		ret = xfs_btree_get_buf_block(cur, child_ptr, &child_block,
+		ret = xfs_btree_read_buf_block(cur, child_ptr, 0, &child_block,
 				&child_bp);
 		if (ret)
 			return ret;
@@ -759,6 +779,7 @@ xfs_btree_bload(
 	cur->bc_nlevels = bbl->btree_height;
 	xfs_btree_set_ptr_null(cur, &child_ptr);
 	xfs_btree_set_ptr_null(cur, &ptr);
+	bbl->nr_dirty = 0;
 
 	xfs_btree_bload_level_geometry(cur, bbl, level, nr_this_level,
 			&avg_per_block, &blocks, &blocks_with_extra);
@@ -797,7 +818,10 @@ xfs_btree_bload(
 			xfs_btree_copy_ptrs(cur, &child_ptr, &ptr, 1);
 	}
 	total_blocks += blocks;
-	xfs_btree_bload_drop_buf(&buffers_list, &bp);
+
+	ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp);
+	if (ret)
+		goto out;
 
 	/* Populate the internal btree nodes. */
 	for (level = 1; level < cur->bc_nlevels; level++) {
@@ -839,7 +863,11 @@ xfs_btree_bload(
 				xfs_btree_copy_ptrs(cur, &first_ptr, &ptr, 1);
 		}
 		total_blocks += blocks;
-		xfs_btree_bload_drop_buf(&buffers_list, &bp);
+
+		ret = xfs_btree_bload_drop_buf(bbl, &buffers_list, &bp);
+		if (ret)
+			goto out;
+
 		xfs_btree_copy_ptrs(cur, &child_ptr, &first_ptr, 1);
 	}
 
diff --git a/libxfs/xfs_btree_staging.h b/libxfs/xfs_btree_staging.h
index 82a3a8ef0f1..d2eaf4fdc60 100644
--- a/libxfs/xfs_btree_staging.h
+++ b/libxfs/xfs_btree_staging.h
@@ -115,6 +115,16 @@ struct xfs_btree_bload {
 	 * height of the new btree.
 	 */
 	unsigned int			btree_height;
+
+	/*
+	 * Flush the new btree block buffer list to disk after this many blocks
+	 * have been formatted.  Zero prohibits writing any buffers until all
+	 * blocks have been formatted.
+	 */
+	uint16_t			max_dirty;
+
+	/* Number of dirty buffers. */
+	uint16_t			nr_dirty;
 };
 
 int xfs_btree_bload_compute_geometry(struct xfs_btree_cur *cur,
diff --git a/repair/agbtree.c b/repair/agbtree.c
index d90cbcc2f28..70ad042f832 100644
--- a/repair/agbtree.c
+++ b/repair/agbtree.c
@@ -23,6 +23,7 @@ init_rebuild(
 	memset(btr, 0, sizeof(struct bt_rebuild));
 
 	bulkload_init_ag(&btr->newbt, sc, oinfo);
+	btr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */
 	bulkload_estimate_ag_slack(sc, &btr->bload, free_space);
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs_repair: bulk load records into new btree blocks
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 1/5] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/5] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Amortize the cost of indirect calls further by loading a batch of
records into a new btree block instead of one record per ->get_record
call.  On a rmap btree with 3.9 million records, this reduces the
runtime of xfs_btree_bload by 3% for xfsprogs.  For the upcoming online
repair functionality, this will reduce runtime by 6% when spectre
mitigations are enabled in the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/agbtree.c |  161 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 90 insertions(+), 71 deletions(-)


diff --git a/repair/agbtree.c b/repair/agbtree.c
index 70ad042f832..cba67c5fbf4 100644
--- a/repair/agbtree.c
+++ b/repair/agbtree.c
@@ -220,15 +220,19 @@ get_bnobt_records(
 	struct bt_rebuild		*btr = priv;
 	struct xfs_alloc_rec_incore	*arec = &cur->bc_rec.a;
 	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
 
-	btr->bno_rec = get_bno_rec(cur, btr->bno_rec);
-	arec->ar_startblock = btr->bno_rec->ex_startblock;
-	arec->ar_blockcount = btr->bno_rec->ex_blockcount;
-	btr->freeblks += btr->bno_rec->ex_blockcount;
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		btr->bno_rec = get_bno_rec(cur, btr->bno_rec);
+		arec->ar_startblock = btr->bno_rec->ex_startblock;
+		arec->ar_blockcount = btr->bno_rec->ex_blockcount;
+		btr->freeblks += btr->bno_rec->ex_blockcount;
 
-	block_rec = libxfs_btree_rec_addr(cur, idx, block);
-	cur->bc_ops->init_rec_from_cur(cur, block_rec);
-	return 1;
+		block_rec = libxfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
 }
 
 void
@@ -388,65 +392,72 @@ get_inobt_records(
 {
 	struct bt_rebuild		*btr = priv;
 	struct xfs_inobt_rec_incore	*irec = &cur->bc_rec.i;
-	struct ino_tree_node		*ino_rec;
-	union xfs_btree_rec		*block_rec;
-	int				inocnt = 0;
-	int				finocnt = 0;
-	int				k;
-
-	btr->ino_rec = ino_rec = get_ino_rec(cur, btr->ino_rec);
-
-	/* Transform the incore record into an on-disk record. */
-	irec->ir_startino = ino_rec->ino_startnum;
-	irec->ir_free = ino_rec->ir_free;
-
-	for (k = 0; k < sizeof(xfs_inofree_t) * NBBY; k++)  {
-		ASSERT(is_inode_confirmed(ino_rec, k));
-
-		if (is_inode_sparse(ino_rec, k))
-			continue;
-		if (is_inode_free(ino_rec, k))
-			finocnt++;
-		inocnt++;
-	}
+	unsigned int			loaded = 0;
+
+	while (loaded < nr_wanted) {
+		struct ino_tree_node	*ino_rec;
+		union xfs_btree_rec	*block_rec;
+		int			inocnt = 0;
+		int			finocnt = 0;
+		int			k;
+
+		btr->ino_rec = ino_rec = get_ino_rec(cur, btr->ino_rec);
 
-	irec->ir_count = inocnt;
-	irec->ir_freecount = finocnt;
-
-	if (xfs_has_sparseinodes(cur->bc_mp)) {
-		uint64_t		sparse;
-		int			spmask;
-		uint16_t		holemask;
-
-		/*
-		 * Convert the 64-bit in-core sparse inode state to the
-		 * 16-bit on-disk holemask.
-		 */
-		holemask = 0;
-		spmask = (1 << XFS_INODES_PER_HOLEMASK_BIT) - 1;
-		sparse = ino_rec->ir_sparse;
-		for (k = 0; k < XFS_INOBT_HOLEMASK_BITS; k++) {
-			if (sparse & spmask) {
-				ASSERT((sparse & spmask) == spmask);
-				holemask |= (1 << k);
-			} else
-				ASSERT((sparse & spmask) == 0);
-			sparse >>= XFS_INODES_PER_HOLEMASK_BIT;
+		/* Transform the incore record into an on-disk record. */
+		irec->ir_startino = ino_rec->ino_startnum;
+		irec->ir_free = ino_rec->ir_free;
+
+		for (k = 0; k < sizeof(xfs_inofree_t) * NBBY; k++)  {
+			ASSERT(is_inode_confirmed(ino_rec, k));
+
+			if (is_inode_sparse(ino_rec, k))
+				continue;
+			if (is_inode_free(ino_rec, k))
+				finocnt++;
+			inocnt++;
 		}
 
-		irec->ir_holemask = holemask;
-	} else {
-		irec->ir_holemask = 0;
-	}
+		irec->ir_count = inocnt;
+		irec->ir_freecount = finocnt;
 
-	if (btr->first_agino == NULLAGINO)
-		btr->first_agino = ino_rec->ino_startnum;
-	btr->freecount += finocnt;
-	btr->count += inocnt;
+		if (xfs_has_sparseinodes(cur->bc_mp)) {
+			uint64_t		sparse;
+			int			spmask;
+			uint16_t		holemask;
+
+			/*
+			 * Convert the 64-bit in-core sparse inode state to the
+			 * 16-bit on-disk holemask.
+			 */
+			holemask = 0;
+			spmask = (1 << XFS_INODES_PER_HOLEMASK_BIT) - 1;
+			sparse = ino_rec->ir_sparse;
+			for (k = 0; k < XFS_INOBT_HOLEMASK_BITS; k++) {
+				if (sparse & spmask) {
+					ASSERT((sparse & spmask) == spmask);
+					holemask |= (1 << k);
+				} else
+					ASSERT((sparse & spmask) == 0);
+				sparse >>= XFS_INODES_PER_HOLEMASK_BIT;
+			}
+
+			irec->ir_holemask = holemask;
+		} else {
+			irec->ir_holemask = 0;
+		}
+
+		if (btr->first_agino == NULLAGINO)
+			btr->first_agino = ino_rec->ino_startnum;
+		btr->freecount += finocnt;
+		btr->count += inocnt;
+
+		block_rec = libxfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+		loaded++;
+		idx++;
+	}
 
-	block_rec = libxfs_btree_rec_addr(cur, idx, block);
-	cur->bc_ops->init_rec_from_cur(cur, block_rec);
-	return 1;
+	return loaded;
 }
 
 /* Initialize both inode btree cursors as needed. */
@@ -585,13 +596,17 @@ get_rmapbt_records(
 	struct xfs_rmap_irec		*rec;
 	struct bt_rebuild		*btr = priv;
 	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
 
-	rec = pop_slab_cursor(btr->slab_cursor);
-	memcpy(&cur->bc_rec.r, rec, sizeof(struct xfs_rmap_irec));
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		rec = pop_slab_cursor(btr->slab_cursor);
+		memcpy(&cur->bc_rec.r, rec, sizeof(struct xfs_rmap_irec));
 
-	block_rec = libxfs_btree_rec_addr(cur, idx, block);
-	cur->bc_ops->init_rec_from_cur(cur, block_rec);
-	return 1;
+		block_rec = libxfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
 }
 
 /* Set up the rmap rebuild parameters. */
@@ -663,13 +678,17 @@ get_refcountbt_records(
 	struct xfs_refcount_irec	*rec;
 	struct bt_rebuild		*btr = priv;
 	union xfs_btree_rec		*block_rec;
+	unsigned int			loaded;
 
-	rec = pop_slab_cursor(btr->slab_cursor);
-	memcpy(&cur->bc_rec.rc, rec, sizeof(struct xfs_refcount_irec));
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		rec = pop_slab_cursor(btr->slab_cursor);
+		memcpy(&cur->bc_rec.rc, rec, sizeof(struct xfs_refcount_irec));
 
-	block_rec = libxfs_btree_rec_addr(cur, idx, block);
-	cur->bc_ops->init_rec_from_cur(cur, block_rec);
-	return 1;
+		block_rec = libxfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
 }
 
 /* Set up the refcount rebuild parameters. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (35 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: try to use XFS_SCRUB_IFLAG_FORCE_REBUILD Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
                   ` (40 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This patchset adds a new IFLAG to the scrub ioctl so that userspace can
force a rebuild of an otherwise consistent piece of metadata.  This will
eventually enable the use of online repair to relocate metadata during a
filesystem reorganization (e.g. shrink).  For now, it facilitates stress
testing of online repair without needing the debugging knobs to be
enabled.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-force-rebuild

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-force-rebuild

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-force-rebuild
---
 io/scrub.c                          |   22 ++++++++++++-----
 libxfs/xfs_fs.h                     |    6 ++++-
 man/man2/ioctl_xfs_scrub_metadata.2 |    5 ++++
 man/man8/xfs_io.8                   |    3 ++
 scrub/phase1.c                      |   28 ++++++++++++++++++++++
 scrub/scrub.c                       |   45 ++++++++++++++++++-----------------
 scrub/scrub.h                       |    1 +
 scrub/xfs_scrub.c                   |    3 ++
 scrub/xfs_scrub.h                   |    1 +
 9 files changed, 84 insertions(+), 30 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: allow userspace to rebuild metadata structures
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: try to use XFS_SCRUB_IFLAG_FORCE_REBUILD Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/3] xfs_io: support passing the FORCE_REBUILD flag to online repair Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken.  We will use
this to move metadata structures out of the way during a free space
defragmentation operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs.h                     |    6 +++++-
 man/man2/ioctl_xfs_scrub_metadata.2 |    5 +++++
 2 files changed, 10 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 1cfd5bc6520..920fd4513fc 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -741,7 +741,11 @@ struct xfs_scrub_metadata {
  */
 #define XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED (1u << 7)
 
-#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR)
+/* i: Rebuild the data structure. */
+#define XFS_SCRUB_IFLAG_FORCE_REBUILD	(1 << 31)
+
+#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR | \
+				 XFS_SCRUB_IFLAG_FORCE_REBUILD)
 #define XFS_SCRUB_FLAGS_OUT	(XFS_SCRUB_OFLAG_CORRUPT | \
 				 XFS_SCRUB_OFLAG_PREEN | \
 				 XFS_SCRUB_OFLAG_XFAIL | \
diff --git a/man/man2/ioctl_xfs_scrub_metadata.2 b/man/man2/ioctl_xfs_scrub_metadata.2
index 046e3e3657b..42bf1e1cac5 100644
--- a/man/man2/ioctl_xfs_scrub_metadata.2
+++ b/man/man2/ioctl_xfs_scrub_metadata.2
@@ -216,6 +216,11 @@ The checker was unable to complete its check of all records.
 The checker encountered a metadata object with potentially problematic
 records.
 However, the records were not obviously corrupt.
+.TP
+.B XFS_SCRUB_IFLAG_FORCE_REBUILD
+Force the kernel to rebuild the specified piece of metadata, even if it's
+healthy.
+This can only be specified by the system administrator.
 .RE
 .PP
 For metadata checkers that operate on inodes or inode metadata, the fields


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs_io: support passing the FORCE_REBUILD flag to online repair
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: try to use XFS_SCRUB_IFLAG_FORCE_REBUILD Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: allow userspace to rebuild metadata structures Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add CLI options to the scrubv and repair commands so that the user can
pass FORCE_REBUILD to force the kernel to rebuild metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/scrub.c        |   22 +++++++++++++++-------
 man/man8/xfs_io.8 |    3 +++
 2 files changed, 18 insertions(+), 7 deletions(-)


diff --git a/io/scrub.c b/io/scrub.c
index fc22ba49f8b..a74e65fbe8d 100644
--- a/io/scrub.c
+++ b/io/scrub.c
@@ -46,7 +46,8 @@ scrub_ioctl(
 	int				fd,
 	int				type,
 	uint64_t			control,
-	uint32_t			control2)
+	uint32_t			control2,
+	uint32_t			flags)
 {
 	struct xfs_scrub_metadata	meta;
 	const struct xfrog_scrub_descr	*sc;
@@ -69,7 +70,7 @@ scrub_ioctl(
 		/* no control parameters */
 		break;
 	}
-	meta.sm_flags = 0;
+	meta.sm_flags = flags;
 
 	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
 	if (error)
@@ -91,17 +92,21 @@ parse_args(
 	int				argc,
 	char				**argv,
 	struct cmdinfo			*cmdinfo,
-	void				(*fn)(int, int, uint64_t, uint32_t))
+	void				(*fn)(int, int, uint64_t, uint32_t, uint32_t))
 {
 	char				*p;
 	int				type = -1;
 	int				i, c;
 	uint64_t			control = 0;
 	uint32_t			control2 = 0;
+	uint32_t			flags = 0;
 	const struct xfrog_scrub_descr	*d = NULL;
 
-	while ((c = getopt(argc, argv, "")) != EOF) {
+	while ((c = getopt(argc, argv, "R")) != EOF) {
 		switch (c) {
+		case 'R':
+			flags |= XFS_SCRUB_IFLAG_FORCE_REBUILD;
+			break;
 		default:
 			return command_usage(cmdinfo);
 		}
@@ -173,7 +178,7 @@ parse_args(
 		ASSERT(0);
 		break;
 	}
-	fn(file->fd, type, control, control2);
+	fn(file->fd, type, control, control2, flags);
 
 	return 0;
 }
@@ -216,6 +221,8 @@ repair_help(void)
 " or (optionally) take an inode number and generation number to act upon as\n"
 " the second and third parameters.\n"
 "\n"
+" Flags are -R to rebuild metadata.\n"
+"\n"
 " Example:\n"
 " 'repair inobt 3' - repairs the inode btree in AG 3.\n"
 " 'repair bmapbtd 128 13525' - repairs the extent map of inode 128 gen 13525.\n"
@@ -231,7 +238,8 @@ repair_ioctl(
 	int				fd,
 	int				type,
 	uint64_t			control,
-	uint32_t			control2)
+	uint32_t			control2,
+	uint32_t			flags)
 {
 	struct xfs_scrub_metadata	meta;
 	const struct xfrog_scrub_descr	*sc;
@@ -254,7 +262,7 @@ repair_ioctl(
 		/* no control parameters */
 		break;
 	}
-	meta.sm_flags = XFS_SCRUB_IFLAG_REPAIR;
+	meta.sm_flags = flags | XFS_SCRUB_IFLAG_REPAIR;
 
 	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
 	if (error)
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 223b5152314..ae8d0245d87 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1340,6 +1340,9 @@ parameter specifies which type of metadata to repair.
 For AG metadata, one AG number must be specified.
 For file metadata, the repair is applied to the open file unless the
 inode number and generation number are specified.
+The
+.B -R
+option can be specified to force rebuilding of a metadata structure.
 .TP
 .BI "label" " " "[ -c | -s " label " ] "
 On filesystems that support online label manipulation, get, set, or clear the


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs_scrub: try to use XFS_SCRUB_IFLAG_FORCE_REBUILD
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/3] xfs_io: support passing the FORCE_REBUILD flag to online repair Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: allow userspace to rebuild metadata structures Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have a FORCE_REBUILD flag to the scrub ioctl, try to use
that over the (much noisier) error injection knob, which may or may not
even be enabled in the kernel config.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c    |   28 ++++++++++++++++++++++++++++
 scrub/scrub.c     |   45 +++++++++++++++++++++++----------------------
 scrub/scrub.h     |    1 +
 scrub/xfs_scrub.c |    3 +++
 scrub/xfs_scrub.h |    1 +
 5 files changed, 56 insertions(+), 22 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index fd1050c9202..2daf5c7bb38 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -27,6 +27,7 @@
 #include "scrub.h"
 #include "repair.h"
 #include "libfrog/fsgeom.h"
+#include "xfs_errortag.h"
 
 /* Phase 1: Find filesystem geometry (and clean up after) */
 
@@ -68,6 +69,27 @@ scrub_cleanup(
 	return error;
 }
 
+/* Decide if we're using FORCE_REBUILD or injecting FORCE_REPAIR. */
+static int
+enable_force_repair(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_error_injection	inject = {
+		.fd			= ctx->mnt.fd,
+		.errtag			= XFS_ERRTAG_FORCE_SCRUB_REPAIR,
+	};
+	int				error;
+
+	use_force_rebuild = can_force_rebuild(ctx);
+	if (use_force_rebuild)
+		return 0;
+
+	error = ioctl(ctx->mnt.fd, XFS_IOC_ERROR_INJECTION, &inject);
+	if (error)
+		str_errno(ctx, _("force_repair"));
+	return error;
+}
+
 /*
  * Bind to the mountpoint, read the XFS geometry, bind to the block devices.
  * Anything we've already built will be cleaned up by scrub_cleanup.
@@ -156,6 +178,12 @@ _("Kernel metadata repair facility is not available.  Use -n to scrub."));
 		return ECANCELED;
 	}
 
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+		error = enable_force_repair(ctx);
+		if (error)
+			return error;
+	}
+
 	/* Did we find the log and rt devices, if they're present? */
 	if (ctx->mnt.fsgeom.logstart == 0 && ctx->fsinfo.fs_log == NULL) {
 		str_error(ctx, ctx->mntpoint,
diff --git a/scrub/scrub.c b/scrub/scrub.c
index e83d0d9ce99..d899f75a92f 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -18,7 +18,6 @@
 #include "common.h"
 #include "progress.h"
 #include "scrub.h"
-#include "xfs_errortag.h"
 #include "repair.h"
 #include "descr.h"
 
@@ -500,26 +499,16 @@ static bool
 __scrub_test(
 	struct scrub_ctx		*ctx,
 	unsigned int			type,
-	bool				repair)
+	unsigned int			flags)
 {
 	struct xfs_scrub_metadata	meta = {0};
-	struct xfs_error_injection	inject;
-	static bool			injected;
 	int				error;
 
 	if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
 		return false;
-	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
-		inject.fd = ctx->mnt.fd;
-		inject.errtag = XFS_ERRTAG_FORCE_SCRUB_REPAIR;
-		error = ioctl(ctx->mnt.fd, XFS_IOC_ERROR_INJECTION, &inject);
-		if (error == 0)
-			injected = true;
-	}
 
 	meta.sm_type = type;
-	if (repair)
-		meta.sm_flags |= XFS_SCRUB_IFLAG_REPAIR;
+	meta.sm_flags = flags;
 	error = -xfrog_scrub_metadata(&ctx->mnt, &meta);
 	switch (error) {
 	case 0:
@@ -532,13 +521,15 @@ _("Filesystem is mounted read-only; cannot proceed."));
 		str_info(ctx, ctx->mntpoint,
 _("Filesystem is mounted norecovery; cannot proceed."));
 		return false;
+	case EINVAL:
 	case EOPNOTSUPP:
 	case ENOTTY:
 		if (debug || verbose)
 			str_info(ctx, ctx->mntpoint,
 _("Kernel %s %s facility not detected."),
 					_(xfrog_scrubbers[type].descr),
-					repair ? _("repair") : _("scrub"));
+					(flags & XFS_SCRUB_IFLAG_REPAIR) ?
+						_("repair") : _("scrub"));
 		return false;
 	case ENOENT:
 		/* Scrubber says not present on this fs; that's fine. */
@@ -553,56 +544,64 @@ bool
 can_scrub_fs_metadata(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE, 0);
 }
 
 bool
 can_scrub_inode(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_INODE, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_INODE, 0);
 }
 
 bool
 can_scrub_bmap(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_BMBTD, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_BMBTD, 0);
 }
 
 bool
 can_scrub_dir(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_DIR, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_DIR, 0);
 }
 
 bool
 can_scrub_attr(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_XATTR, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_XATTR, 0);
 }
 
 bool
 can_scrub_symlink(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_SYMLINK, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_SYMLINK, 0);
 }
 
 bool
 can_scrub_parent(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_PARENT, false);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_PARENT, 0);
 }
 
 bool
 xfs_can_repair(
 	struct scrub_ctx	*ctx)
 {
-	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE, true);
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE, XFS_SCRUB_IFLAG_REPAIR);
+}
+
+bool
+can_force_rebuild(
+	struct scrub_ctx	*ctx)
+{
+	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE,
+			XFS_SCRUB_IFLAG_REPAIR | XFS_SCRUB_IFLAG_FORCE_REBUILD);
 }
 
 /* General repair routines. */
@@ -624,6 +623,8 @@ xfs_repair_metadata(
 	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
 	meta.sm_type = aitem->type;
 	meta.sm_flags = aitem->flags | XFS_SCRUB_IFLAG_REPAIR;
+	if (use_force_rebuild)
+		meta.sm_flags |= XFS_SCRUB_IFLAG_FORCE_REBUILD;
 	switch (xfrog_scrubbers[aitem->type].type) {
 	case XFROG_SCRUB_TYPE_AGHEADER:
 	case XFROG_SCRUB_TYPE_PERAG:
diff --git a/scrub/scrub.h b/scrub/scrub.h
index fccd82f2155..023069ee066 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -33,6 +33,7 @@ bool can_scrub_attr(struct scrub_ctx *ctx);
 bool can_scrub_symlink(struct scrub_ctx *ctx);
 bool can_scrub_parent(struct scrub_ctx *ctx);
 bool xfs_can_repair(struct scrub_ctx *ctx);
+bool can_force_rebuild(struct scrub_ctx *ctx);
 
 int scrub_file(struct scrub_ctx *ctx, int fd, const struct xfs_bulkstat *bstat,
 		unsigned int type, struct action_list *alist);
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 7a0411b0cc8..597be59f9f9 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -157,6 +157,9 @@ bool				stdout_isatty;
  */
 bool				is_service;
 
+/* Set to true if the kernel supports XFS_SCRUB_IFLAG_FORCE_REBUILD */
+bool				use_force_rebuild;
+
 #define SCRUB_RET_SUCCESS	(0)	/* no problems left behind */
 #define SCRUB_RET_CORRUPT	(1)	/* corruption remains on fs */
 #define SCRUB_RET_UNOPTIMIZED	(2)	/* fs could be optimized */
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index f6712d368c6..0d6b9dad2c9 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -21,6 +21,7 @@ extern bool			want_fstrim;
 extern bool			stderr_isatty;
 extern bool			stdout_isatty;
 extern bool			is_service;
+extern bool			use_force_rebuild;
 
 enum scrub_mode {
 	SCRUB_MODE_DRY_RUN,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (36 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] libfrog: promote XFROG_SCRUB_DESCR_SUMMARY to a scrub type Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of quota counters Darrick J. Wong
                   ` (39 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

At this point, we need to clean up the libfrog and xfs_scrub code a
little bit.  First, correct some of the weird naming and organizing
choices I made in libfrog for scrub types and fs summary counter scans.
Second, break out metadata file scans as a separate group, and teach
xfs_scrub that it can ask the kernel to scan them in parallel.  On
filesystems with quota or realtime volumes, this can speed up that part
significantly.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-metafile-parallel
---
 io/scrub.c      |   39 ++++++++-------
 libfrog/scrub.c |   51 +++++++++----------
 libfrog/scrub.h |   24 +++------
 scrub/phase2.c  |  146 +++++++++++++++++++++++++++++++++++++++++++------------
 scrub/phase4.c  |    2 -
 scrub/phase7.c  |    4 +-
 scrub/scrub.c   |   77 +++++++++++++++++------------
 scrub/scrub.h   |    6 ++
 8 files changed, 219 insertions(+), 130 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] libfrog: rename XFROG_SCRUB_TYPE_* to XFROG_SCRUB_GROUP_*
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] libfrog: promote XFROG_SCRUB_DESCR_SUMMARY to a scrub type Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/4] libfrog: rename the scrub "fs" group Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

I didn't do a good job of naming XFROG_SCRUB_TYPE when I created that
enumeration.  The goal of the enum is to group the scrub ioctl's
XFS_SCRUB_TYPE_* codes by principal filesystem object (AG, inode, etc.)
but for some dumb reason I chose to reuse "type".  This is confusing,
so fix this sin.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/scrub.c      |   36 ++++++++++++++++++------------------
 libfrog/scrub.c |   50 +++++++++++++++++++++++++-------------------------
 libfrog/scrub.h |   16 ++++++++--------
 scrub/scrub.c   |   54 +++++++++++++++++++++++++++---------------------------
 4 files changed, 78 insertions(+), 78 deletions(-)


diff --git a/io/scrub.c b/io/scrub.c
index a74e65fbe8d..0cad69253dc 100644
--- a/io/scrub.c
+++ b/io/scrub.c
@@ -56,17 +56,17 @@ scrub_ioctl(
 	sc = &xfrog_scrubbers[type];
 	memset(&meta, 0, sizeof(meta));
 	meta.sm_type = type;
-	switch (sc->type) {
-	case XFROG_SCRUB_TYPE_AGHEADER:
-	case XFROG_SCRUB_TYPE_PERAG:
+	switch (sc->group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
 		meta.sm_agno = control;
 		break;
-	case XFROG_SCRUB_TYPE_INODE:
+	case XFROG_SCRUB_GROUP_INODE:
 		meta.sm_ino = control;
 		meta.sm_gen = control2;
 		break;
-	case XFROG_SCRUB_TYPE_NONE:
-	case XFROG_SCRUB_TYPE_FS:
+	case XFROG_SCRUB_GROUP_NONE:
+	case XFROG_SCRUB_GROUP_FS:
 		/* no control parameters */
 		break;
 	}
@@ -126,8 +126,8 @@ parse_args(
 	}
 	optind++;
 
-	switch (d->type) {
-	case XFROG_SCRUB_TYPE_INODE:
+	switch (d->group) {
+	case XFROG_SCRUB_GROUP_INODE:
 		if (optind == argc) {
 			control = 0;
 			control2 = 0;
@@ -152,8 +152,8 @@ parse_args(
 			return 0;
 		}
 		break;
-	case XFROG_SCRUB_TYPE_AGHEADER:
-	case XFROG_SCRUB_TYPE_PERAG:
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
 		if (optind != argc - 1) {
 			fprintf(stderr,
 				_("Must specify one AG number.\n"));
@@ -166,8 +166,8 @@ parse_args(
 			return 0;
 		}
 		break;
-	case XFROG_SCRUB_TYPE_FS:
-	case XFROG_SCRUB_TYPE_NONE:
+	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_NONE:
 		if (optind != argc) {
 			fprintf(stderr,
 				_("No parameters allowed.\n"));
@@ -248,17 +248,17 @@ repair_ioctl(
 	sc = &xfrog_scrubbers[type];
 	memset(&meta, 0, sizeof(meta));
 	meta.sm_type = type;
-	switch (sc->type) {
-	case XFROG_SCRUB_TYPE_AGHEADER:
-	case XFROG_SCRUB_TYPE_PERAG:
+	switch (sc->group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
 		meta.sm_agno = control;
 		break;
-	case XFROG_SCRUB_TYPE_INODE:
+	case XFROG_SCRUB_GROUP_INODE:
 		meta.sm_ino = control;
 		meta.sm_gen = control2;
 		break;
-	case XFROG_SCRUB_TYPE_NONE:
-	case XFROG_SCRUB_TYPE_FS:
+	case XFROG_SCRUB_GROUP_NONE:
+	case XFROG_SCRUB_GROUP_FS:
 		/* no control parameters */
 		break;
 	}
diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index d900bf2af63..90fc2b1a40c 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -12,127 +12,127 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_PROBE] = {
 		.name	= "probe",
 		.descr	= "metadata",
-		.type	= XFROG_SCRUB_TYPE_NONE,
+		.group	= XFROG_SCRUB_GROUP_NONE,
 	},
 	[XFS_SCRUB_TYPE_SB] = {
 		.name	= "sb",
 		.descr	= "superblock",
-		.type	= XFROG_SCRUB_TYPE_AGHEADER,
+		.group	= XFROG_SCRUB_GROUP_AGHEADER,
 	},
 	[XFS_SCRUB_TYPE_AGF] = {
 		.name	= "agf",
 		.descr	= "free space header",
-		.type	= XFROG_SCRUB_TYPE_AGHEADER,
+		.group	= XFROG_SCRUB_GROUP_AGHEADER,
 	},
 	[XFS_SCRUB_TYPE_AGFL] = {
 		.name	= "agfl",
 		.descr	= "free list",
-		.type	= XFROG_SCRUB_TYPE_AGHEADER,
+		.group	= XFROG_SCRUB_GROUP_AGHEADER,
 	},
 	[XFS_SCRUB_TYPE_AGI] = {
 		.name	= "agi",
 		.descr	= "inode header",
-		.type	= XFROG_SCRUB_TYPE_AGHEADER,
+		.group	= XFROG_SCRUB_GROUP_AGHEADER,
 	},
 	[XFS_SCRUB_TYPE_BNOBT] = {
 		.name	= "bnobt",
 		.descr	= "freesp by block btree",
-		.type	= XFROG_SCRUB_TYPE_PERAG,
+		.group	= XFROG_SCRUB_GROUP_PERAG,
 	},
 	[XFS_SCRUB_TYPE_CNTBT] = {
 		.name	= "cntbt",
 		.descr	= "freesp by length btree",
-		.type	= XFROG_SCRUB_TYPE_PERAG,
+		.group	= XFROG_SCRUB_GROUP_PERAG,
 	},
 	[XFS_SCRUB_TYPE_INOBT] = {
 		.name	= "inobt",
 		.descr	= "inode btree",
-		.type	= XFROG_SCRUB_TYPE_PERAG,
+		.group	= XFROG_SCRUB_GROUP_PERAG,
 	},
 	[XFS_SCRUB_TYPE_FINOBT] = {
 		.name	= "finobt",
 		.descr	= "free inode btree",
-		.type	= XFROG_SCRUB_TYPE_PERAG,
+		.group	= XFROG_SCRUB_GROUP_PERAG,
 	},
 	[XFS_SCRUB_TYPE_RMAPBT] = {
 		.name	= "rmapbt",
 		.descr	= "reverse mapping btree",
-		.type	= XFROG_SCRUB_TYPE_PERAG,
+		.group	= XFROG_SCRUB_GROUP_PERAG,
 	},
 	[XFS_SCRUB_TYPE_REFCNTBT] = {
 		.name	= "refcountbt",
 		.descr	= "reference count btree",
-		.type	= XFROG_SCRUB_TYPE_PERAG,
+		.group	= XFROG_SCRUB_GROUP_PERAG,
 	},
 	[XFS_SCRUB_TYPE_INODE] = {
 		.name	= "inode",
 		.descr	= "inode record",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_BMBTD] = {
 		.name	= "bmapbtd",
 		.descr	= "data block map",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_BMBTA] = {
 		.name	= "bmapbta",
 		.descr	= "attr block map",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_BMBTC] = {
 		.name	= "bmapbtc",
 		.descr	= "CoW block map",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_DIR] = {
 		.name	= "directory",
 		.descr	= "directory entries",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_XATTR] = {
 		.name	= "xattr",
 		.descr	= "extended attributes",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_SYMLINK] = {
 		.name	= "symlink",
 		.descr	= "symbolic link",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_PARENT] = {
 		.name	= "parent",
 		.descr	= "parent pointer",
-		.type	= XFROG_SCRUB_TYPE_INODE,
+		.group	= XFROG_SCRUB_GROUP_INODE,
 	},
 	[XFS_SCRUB_TYPE_RTBITMAP] = {
 		.name	= "rtbitmap",
 		.descr	= "realtime bitmap",
-		.type	= XFROG_SCRUB_TYPE_FS,
+		.group	= XFROG_SCRUB_GROUP_FS,
 	},
 	[XFS_SCRUB_TYPE_RTSUM] = {
 		.name	= "rtsummary",
 		.descr	= "realtime summary",
-		.type	= XFROG_SCRUB_TYPE_FS,
+		.group	= XFROG_SCRUB_GROUP_FS,
 	},
 	[XFS_SCRUB_TYPE_UQUOTA] = {
 		.name	= "usrquota",
 		.descr	= "user quotas",
-		.type	= XFROG_SCRUB_TYPE_FS,
+		.group	= XFROG_SCRUB_GROUP_FS,
 	},
 	[XFS_SCRUB_TYPE_GQUOTA] = {
 		.name	= "grpquota",
 		.descr	= "group quotas",
-		.type	= XFROG_SCRUB_TYPE_FS,
+		.group	= XFROG_SCRUB_GROUP_FS,
 	},
 	[XFS_SCRUB_TYPE_PQUOTA] = {
 		.name	= "prjquota",
 		.descr	= "project quotas",
-		.type	= XFROG_SCRUB_TYPE_FS,
+		.group	= XFROG_SCRUB_GROUP_FS,
 	},
 	[XFS_SCRUB_TYPE_FSCOUNTERS] = {
 		.name	= "fscounters",
 		.descr	= "filesystem summary counters",
-		.type	= XFROG_SCRUB_TYPE_FS,
+		.group	= XFROG_SCRUB_GROUP_FS,
 		.flags	= XFROG_SCRUB_DESCR_SUMMARY,
 	},
 };
diff --git a/libfrog/scrub.h b/libfrog/scrub.h
index e43d8c244e4..43a882321f9 100644
--- a/libfrog/scrub.h
+++ b/libfrog/scrub.h
@@ -6,20 +6,20 @@
 #ifndef __LIBFROG_SCRUB_H__
 #define __LIBFROG_SCRUB_H__
 
-/* Type info and names for the scrub types. */
-enum xfrog_scrub_type {
-	XFROG_SCRUB_TYPE_NONE,		/* not metadata */
-	XFROG_SCRUB_TYPE_AGHEADER,	/* per-AG header */
-	XFROG_SCRUB_TYPE_PERAG,		/* per-AG metadata */
-	XFROG_SCRUB_TYPE_FS,		/* per-FS metadata */
-	XFROG_SCRUB_TYPE_INODE,		/* per-inode metadata */
+/* Group the scrub types by principal filesystem object. */
+enum xfrog_scrub_group {
+	XFROG_SCRUB_GROUP_NONE,		/* not metadata */
+	XFROG_SCRUB_GROUP_AGHEADER,	/* per-AG header */
+	XFROG_SCRUB_GROUP_PERAG,	/* per-AG metadata */
+	XFROG_SCRUB_GROUP_FS,		/* per-FS metadata */
+	XFROG_SCRUB_GROUP_INODE,	/* per-inode metadata */
 };
 
 /* Catalog of scrub types and names, indexed by XFS_SCRUB_TYPE_* */
 struct xfrog_scrub_descr {
 	const char		*name;
 	const char		*descr;
-	enum xfrog_scrub_type	type;
+	enum xfrog_scrub_group	group;
 	unsigned int		flags;
 };
 
diff --git a/scrub/scrub.c b/scrub/scrub.c
index d899f75a92f..61a111db080 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -34,21 +34,21 @@ format_scrub_descr(
 	struct xfs_scrub_metadata	*meta = where;
 	const struct xfrog_scrub_descr	*sc = &xfrog_scrubbers[meta->sm_type];
 
-	switch (sc->type) {
-	case XFROG_SCRUB_TYPE_AGHEADER:
-	case XFROG_SCRUB_TYPE_PERAG:
+	switch (sc->group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
 		return snprintf(buf, buflen, _("AG %u %s"), meta->sm_agno,
 				_(sc->descr));
 		break;
-	case XFROG_SCRUB_TYPE_INODE:
+	case XFROG_SCRUB_GROUP_INODE:
 		return scrub_render_ino_descr(ctx, buf, buflen,
 				meta->sm_ino, meta->sm_gen, "%s",
 				_(sc->descr));
 		break;
-	case XFROG_SCRUB_TYPE_FS:
+	case XFROG_SCRUB_GROUP_FS:
 		return snprintf(buf, buflen, _("%s"), _(sc->descr));
 		break;
-	case XFROG_SCRUB_TYPE_NONE:
+	case XFROG_SCRUB_GROUP_NONE:
 		assert(0);
 		break;
 	}
@@ -276,12 +276,12 @@ scrub_save_repair(
 	memset(aitem, 0, sizeof(*aitem));
 	aitem->type = meta->sm_type;
 	aitem->flags = meta->sm_flags;
-	switch (xfrog_scrubbers[meta->sm_type].type) {
-	case XFROG_SCRUB_TYPE_AGHEADER:
-	case XFROG_SCRUB_TYPE_PERAG:
+	switch (xfrog_scrubbers[meta->sm_type].group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
 		aitem->agno = meta->sm_agno;
 		break;
-	case XFROG_SCRUB_TYPE_INODE:
+	case XFROG_SCRUB_GROUP_INODE:
 		aitem->ino = meta->sm_ino;
 		aitem->gen = meta->sm_gen;
 		break;
@@ -336,14 +336,14 @@ scrub_meta_type(
 }
 
 /*
- * Scrub all metadata types that are assigned to the given XFROG_SCRUB_TYPE_*,
+ * Scrub all metadata types that are assigned to the given XFROG_SCRUB_GROUP_*,
  * saving corruption reports for later.  This should not be used for
- * XFROG_SCRUB_TYPE_INODE or for checking summary metadata.
+ * XFROG_SCRUB_GROUP_INODE or for checking summary metadata.
  */
 static bool
-scrub_all_types(
+scrub_group(
 	struct scrub_ctx		*ctx,
-	enum xfrog_scrub_type		scrub_type,
+	enum xfrog_scrub_group		group,
 	xfs_agnumber_t			agno,
 	struct action_list		*alist)
 {
@@ -354,7 +354,7 @@ scrub_all_types(
 	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
 		int			ret;
 
-		if (sc->type != scrub_type)
+		if (sc->group != group)
 			continue;
 		if (sc->flags & XFROG_SCRUB_DESCR_SUMMARY)
 			continue;
@@ -388,7 +388,7 @@ scrub_ag_headers(
 	xfs_agnumber_t			agno,
 	struct action_list		*alist)
 {
-	return scrub_all_types(ctx, XFROG_SCRUB_TYPE_AGHEADER, agno, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_AGHEADER, agno, alist);
 }
 
 /* Scrub each AG's metadata btrees. */
@@ -398,7 +398,7 @@ scrub_ag_metadata(
 	xfs_agnumber_t			agno,
 	struct action_list		*alist)
 {
-	return scrub_all_types(ctx, XFROG_SCRUB_TYPE_PERAG, agno, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, agno, alist);
 }
 
 /* Scrub whole-FS metadata btrees. */
@@ -407,7 +407,7 @@ scrub_fs_metadata(
 	struct scrub_ctx		*ctx,
 	struct action_list		*alist)
 {
-	return scrub_all_types(ctx, XFROG_SCRUB_TYPE_FS, 0, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_FS, 0, alist);
 }
 
 /* Scrub FS summary metadata. */
@@ -430,12 +430,12 @@ scrub_estimate_ag_work(
 
 	sc = xfrog_scrubbers;
 	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
-		switch (sc->type) {
-		case XFROG_SCRUB_TYPE_AGHEADER:
-		case XFROG_SCRUB_TYPE_PERAG:
+		switch (sc->group) {
+		case XFROG_SCRUB_GROUP_AGHEADER:
+		case XFROG_SCRUB_GROUP_PERAG:
 			estimate += ctx->mnt.fsgeom.agcount;
 			break;
-		case XFROG_SCRUB_TYPE_FS:
+		case XFROG_SCRUB_GROUP_FS:
 			estimate++;
 			break;
 		default:
@@ -463,7 +463,7 @@ scrub_file(
 	enum check_outcome		fix;
 
 	assert(type < XFS_SCRUB_TYPE_NR);
-	assert(xfrog_scrubbers[type].type == XFROG_SCRUB_TYPE_INODE);
+	assert(xfrog_scrubbers[type].group == XFROG_SCRUB_GROUP_INODE);
 
 	meta.sm_type = type;
 	meta.sm_ino = bstat->bs_ino;
@@ -625,12 +625,12 @@ xfs_repair_metadata(
 	meta.sm_flags = aitem->flags | XFS_SCRUB_IFLAG_REPAIR;
 	if (use_force_rebuild)
 		meta.sm_flags |= XFS_SCRUB_IFLAG_FORCE_REBUILD;
-	switch (xfrog_scrubbers[aitem->type].type) {
-	case XFROG_SCRUB_TYPE_AGHEADER:
-	case XFROG_SCRUB_TYPE_PERAG:
+	switch (xfrog_scrubbers[aitem->type].group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
 		meta.sm_agno = aitem->agno;
 		break;
-	case XFROG_SCRUB_TYPE_INODE:
+	case XFROG_SCRUB_GROUP_INODE:
 		meta.sm_ino = aitem->ino;
 		meta.sm_gen = aitem->gen;
 		break;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] libfrog: promote XFROG_SCRUB_DESCR_SUMMARY to a scrub type
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] libfrog: rename XFROG_SCRUB_TYPE_* to XFROG_SCRUB_GROUP_* Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

"Summary" metadata, at least in the scrub context, are metadata whose
values depend on some kind of computation and therefore can only be
checked after we've looked at all the other metadata.  Currently, the
superblock summary counters are the only thing that are like this, but
since they run in a totally separate xfs_scrub phase (7 vs. 2), make
them their own group and remove the group+flag mix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/scrub.c      |    3 +++
 libfrog/scrub.c |    3 +--
 libfrog/scrub.h |    8 +-------
 scrub/phase4.c  |    2 +-
 scrub/phase7.c  |    4 ++--
 scrub/scrub.c   |   16 ++++++++++++----
 scrub/scrub.h   |    3 ++-
 7 files changed, 22 insertions(+), 17 deletions(-)


diff --git a/io/scrub.c b/io/scrub.c
index 0cad69253dc..4cfa4bc3c4c 100644
--- a/io/scrub.c
+++ b/io/scrub.c
@@ -67,6 +67,7 @@ scrub_ioctl(
 		break;
 	case XFROG_SCRUB_GROUP_NONE:
 	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_SUMMARY:
 		/* no control parameters */
 		break;
 	}
@@ -168,6 +169,7 @@ parse_args(
 		break;
 	case XFROG_SCRUB_GROUP_FS:
 	case XFROG_SCRUB_GROUP_NONE:
+	case XFROG_SCRUB_GROUP_SUMMARY:
 		if (optind != argc) {
 			fprintf(stderr,
 				_("No parameters allowed.\n"));
@@ -259,6 +261,7 @@ repair_ioctl(
 		break;
 	case XFROG_SCRUB_GROUP_NONE:
 	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_SUMMARY:
 		/* no control parameters */
 		break;
 	}
diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index 90fc2b1a40c..5a5f522a425 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -132,8 +132,7 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_FSCOUNTERS] = {
 		.name	= "fscounters",
 		.descr	= "filesystem summary counters",
-		.group	= XFROG_SCRUB_GROUP_FS,
-		.flags	= XFROG_SCRUB_DESCR_SUMMARY,
+		.group	= XFROG_SCRUB_GROUP_SUMMARY,
 	},
 };
 
diff --git a/libfrog/scrub.h b/libfrog/scrub.h
index 43a882321f9..68f1a968103 100644
--- a/libfrog/scrub.h
+++ b/libfrog/scrub.h
@@ -13,6 +13,7 @@ enum xfrog_scrub_group {
 	XFROG_SCRUB_GROUP_PERAG,	/* per-AG metadata */
 	XFROG_SCRUB_GROUP_FS,		/* per-FS metadata */
 	XFROG_SCRUB_GROUP_INODE,	/* per-inode metadata */
+	XFROG_SCRUB_GROUP_SUMMARY,	/* summary metadata */
 };
 
 /* Catalog of scrub types and names, indexed by XFS_SCRUB_TYPE_* */
@@ -20,15 +21,8 @@ struct xfrog_scrub_descr {
 	const char		*name;
 	const char		*descr;
 	enum xfrog_scrub_group	group;
-	unsigned int		flags;
 };
 
-/*
- * The type of metadata checked by this scrubber is a summary of other types
- * of metadata.  This scrubber should be run after all the others.
- */
-#define XFROG_SCRUB_DESCR_SUMMARY	(1 << 0)
-
 extern const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR];
 
 int xfrog_scrub_metadata(struct xfs_fd *xfd, struct xfs_scrub_metadata *meta);
diff --git a/scrub/phase4.c b/scrub/phase4.c
index ecd56056ca2..3c985cdc13a 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -139,7 +139,7 @@ phase4_func(
 	 * counters, so counter repairs have to be put on the list now so that
 	 * they get fixed before we stop retrying unfixed metadata repairs.
 	 */
-	ret = scrub_fs_summary(ctx, &ctx->action_lists[0]);
+	ret = scrub_fs_counters(ctx, &ctx->action_lists[0]);
 	if (ret)
 		return ret;
 
diff --git a/scrub/phase7.c b/scrub/phase7.c
index 8d8034c36af..2b83e0a471d 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -116,9 +116,9 @@ phase7_func(
 	int			ip;
 	int			error;
 
-	/* Check and fix the fs summary counters. */
+	/* Check and fix the summary metadata. */
 	action_list_init(&alist);
-	error = scrub_fs_summary(ctx, &alist);
+	error = scrub_summary_metadata(ctx, &alist);
 	if (error)
 		return error;
 	error = action_list_process(ctx, -1, &alist,
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 61a111db080..b7e120e91d6 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -46,6 +46,7 @@ format_scrub_descr(
 				_(sc->descr));
 		break;
 	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_SUMMARY:
 		return snprintf(buf, buflen, _("%s"), _(sc->descr));
 		break;
 	case XFROG_SCRUB_GROUP_NONE:
@@ -356,8 +357,6 @@ scrub_group(
 
 		if (sc->group != group)
 			continue;
-		if (sc->flags & XFROG_SCRUB_DESCR_SUMMARY)
-			continue;
 
 		ret = scrub_meta_type(ctx, type, agno, alist);
 		if (ret)
@@ -410,9 +409,18 @@ scrub_fs_metadata(
 	return scrub_group(ctx, XFROG_SCRUB_GROUP_FS, 0, alist);
 }
 
-/* Scrub FS summary metadata. */
+/* Scrub all FS summary metadata. */
 int
-scrub_fs_summary(
+scrub_summary_metadata(
+	struct scrub_ctx		*ctx,
+	struct action_list		*alist)
+{
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist);
+}
+
+/* Scrub /only/ the superblock summary counters. */
+int
+scrub_fs_counters(
 	struct scrub_ctx		*ctx,
 	struct action_list		*alist)
 {
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 023069ee066..56836cf2ba3 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -23,7 +23,8 @@ int scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
 int scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
 		struct action_list *alist);
 int scrub_fs_metadata(struct scrub_ctx *ctx, struct action_list *alist);
-int scrub_fs_summary(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] libfrog: rename the scrub "fs" group
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] libfrog: promote XFROG_SCRUB_DESCR_SUMMARY to a scrub type Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] libfrog: rename XFROG_SCRUB_TYPE_* to XFROG_SCRUB_GROUP_* Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The only members of XFROG_SCRUB_GROUP_FS are metadata files.  Although
each of these files have full-filesystem scope, let's rename the group
so that it's more obvious that the group scans metadata files, not the
entire fs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/scrub.c      |    6 +++---
 libfrog/scrub.c |   10 +++++-----
 libfrog/scrub.h |    2 +-
 scrub/scrub.c   |    6 +++---
 4 files changed, 12 insertions(+), 12 deletions(-)


diff --git a/io/scrub.c b/io/scrub.c
index 4cfa4bc3c4c..3971a9fedb5 100644
--- a/io/scrub.c
+++ b/io/scrub.c
@@ -66,7 +66,7 @@ scrub_ioctl(
 		meta.sm_gen = control2;
 		break;
 	case XFROG_SCRUB_GROUP_NONE:
-	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
 		/* no control parameters */
 		break;
@@ -167,7 +167,7 @@ parse_args(
 			return 0;
 		}
 		break;
-	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_NONE:
 	case XFROG_SCRUB_GROUP_SUMMARY:
 		if (optind != argc) {
@@ -260,7 +260,7 @@ repair_ioctl(
 		meta.sm_gen = control2;
 		break;
 	case XFROG_SCRUB_GROUP_NONE:
-	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
 		/* no control parameters */
 		break;
diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index 5a5f522a425..2e4d96caaa1 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -107,27 +107,27 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_RTBITMAP] = {
 		.name	= "rtbitmap",
 		.descr	= "realtime bitmap",
-		.group	= XFROG_SCRUB_GROUP_FS,
+		.group	= XFROG_SCRUB_GROUP_METAFILES,
 	},
 	[XFS_SCRUB_TYPE_RTSUM] = {
 		.name	= "rtsummary",
 		.descr	= "realtime summary",
-		.group	= XFROG_SCRUB_GROUP_FS,
+		.group	= XFROG_SCRUB_GROUP_METAFILES,
 	},
 	[XFS_SCRUB_TYPE_UQUOTA] = {
 		.name	= "usrquota",
 		.descr	= "user quotas",
-		.group	= XFROG_SCRUB_GROUP_FS,
+		.group	= XFROG_SCRUB_GROUP_METAFILES,
 	},
 	[XFS_SCRUB_TYPE_GQUOTA] = {
 		.name	= "grpquota",
 		.descr	= "group quotas",
-		.group	= XFROG_SCRUB_GROUP_FS,
+		.group	= XFROG_SCRUB_GROUP_METAFILES,
 	},
 	[XFS_SCRUB_TYPE_PQUOTA] = {
 		.name	= "prjquota",
 		.descr	= "project quotas",
-		.group	= XFROG_SCRUB_GROUP_FS,
+		.group	= XFROG_SCRUB_GROUP_METAFILES,
 	},
 	[XFS_SCRUB_TYPE_FSCOUNTERS] = {
 		.name	= "fscounters",
diff --git a/libfrog/scrub.h b/libfrog/scrub.h
index 68f1a968103..14c4857bede 100644
--- a/libfrog/scrub.h
+++ b/libfrog/scrub.h
@@ -11,7 +11,7 @@ enum xfrog_scrub_group {
 	XFROG_SCRUB_GROUP_NONE,		/* not metadata */
 	XFROG_SCRUB_GROUP_AGHEADER,	/* per-AG header */
 	XFROG_SCRUB_GROUP_PERAG,	/* per-AG metadata */
-	XFROG_SCRUB_GROUP_FS,		/* per-FS metadata */
+	XFROG_SCRUB_GROUP_METAFILES,	/* whole-fs metadata files */
 	XFROG_SCRUB_GROUP_INODE,	/* per-inode metadata */
 	XFROG_SCRUB_GROUP_SUMMARY,	/* summary metadata */
 };
diff --git a/scrub/scrub.c b/scrub/scrub.c
index b7e120e91d6..1fcd5b8e85d 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -45,7 +45,7 @@ format_scrub_descr(
 				meta->sm_ino, meta->sm_gen, "%s",
 				_(sc->descr));
 		break;
-	case XFROG_SCRUB_GROUP_FS:
+	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
 		return snprintf(buf, buflen, _("%s"), _(sc->descr));
 		break;
@@ -406,7 +406,7 @@ scrub_fs_metadata(
 	struct scrub_ctx		*ctx,
 	struct action_list		*alist)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_FS, 0, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_METAFILES, 0, alist);
 }
 
 /* Scrub all FS summary metadata. */
@@ -443,7 +443,7 @@ scrub_estimate_ag_work(
 		case XFROG_SCRUB_GROUP_PERAG:
 			estimate += ctx->mnt.fsgeom.agcount;
 			break;
-		case XFROG_SCRUB_GROUP_FS:
+		case XFROG_SCRUB_GROUP_METAFILES:
 			estimate++;
 			break;
 		default:


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs_scrub: scan metadata files in parallel
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 3/4] libfrog: rename the scrub "fs" group Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The realtime bitmap and the three quota files are completely independent
of each other, which means that we ought to be able to scan them in
parallel.  Rework the phase2 code so that we can do this.  Note,
however, that the realtime summary file summarizes the contents of the
realtime bitmap, so we must coordinate the workqueue threads.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase2.c |  146 +++++++++++++++++++++++++++++++++++++++++++-------------
 scrub/scrub.c  |    9 ++-
 scrub/scrub.h  |    3 +
 3 files changed, 121 insertions(+), 37 deletions(-)


diff --git a/scrub/phase2.c b/scrub/phase2.c
index 8f82e2a6c04..75c302af075 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -10,6 +10,8 @@
 #include "list.h"
 #include "libfrog/paths.h"
 #include "libfrog/workqueue.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/scrub.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "scrub.h"
@@ -17,6 +19,18 @@
 
 /* Phase 2: Check internal metadata. */
 
+struct scan_ctl {
+	/*
+	 * Control mechanism to signal that the rt bitmap file scan is done and
+	 * wake up any waiters.
+	 */
+	pthread_cond_t		rbm_wait;
+	pthread_mutex_t		rbm_waitlock;
+	bool			rbm_done;
+
+	bool			aborted;
+};
+
 /* Scrub each AG's metadata btrees. */
 static void
 scan_ag_metadata(
@@ -25,7 +39,7 @@ scan_ag_metadata(
 	void				*arg)
 {
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
-	bool				*aborted = arg;
+	struct scan_ctl			*sctl = arg;
 	struct action_list		alist;
 	struct action_list		immediate_alist;
 	unsigned long long		broken_primaries;
@@ -33,7 +47,7 @@ scan_ag_metadata(
 	char				descr[DESCR_BUFSZ];
 	int				ret;
 
-	if (*aborted)
+	if (sctl->aborted)
 		return;
 
 	action_list_init(&alist);
@@ -89,32 +103,40 @@ _("Filesystem might not be repairable."));
 	action_list_defer(ctx, agno, &alist);
 	return;
 err:
-	*aborted = true;
+	sctl->aborted = true;
 }
 
-/* Scrub whole-FS metadata btrees. */
+/* Scan one metadata file. */
 static void
-scan_fs_metadata(
-	struct workqueue		*wq,
-	xfs_agnumber_t			agno,
-	void				*arg)
+scan_metafile(
+	struct workqueue	*wq,
+	xfs_agnumber_t		type,
+	void			*arg)
 {
-	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
-	bool				*aborted = arg;
-	struct action_list		alist;
-	int				ret;
+	struct action_list	alist;
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
+	struct scan_ctl		*sctl = arg;
+	int			ret;
 
-	if (*aborted)
-		return;
+	if (sctl->aborted)
+		goto out;
 
 	action_list_init(&alist);
-	ret = scrub_fs_metadata(ctx, &alist);
+	ret = scrub_metadata_file(ctx, type, &alist);
 	if (ret) {
-		*aborted = true;
-		return;
+		sctl->aborted = true;
+		goto out;
 	}
 
-	action_list_defer(ctx, agno, &alist);
+	action_list_defer(ctx, 0, &alist);
+
+out:
+	if (type == XFS_SCRUB_TYPE_RTBITMAP) {
+		pthread_mutex_lock(&sctl->rbm_waitlock);
+		sctl->rbm_done = true;
+		pthread_cond_broadcast(&sctl->rbm_wait);
+		pthread_mutex_unlock(&sctl->rbm_waitlock);
+	}
 }
 
 /* Scan all filesystem metadata. */
@@ -122,17 +144,25 @@ int
 phase2_func(
 	struct scrub_ctx	*ctx)
 {
-	struct action_list	alist;
 	struct workqueue	wq;
+	struct scan_ctl		sctl = {
+		.aborted	= false,
+		.rbm_done	= false,
+	};
+	struct action_list	alist;
+	const struct xfrog_scrub_descr *sc = xfrog_scrubbers;
 	xfs_agnumber_t		agno;
-	bool			aborted = false;
+	unsigned int		type;
 	int			ret, ret2;
 
+	pthread_mutex_init(&sctl.rbm_waitlock, NULL);
+	pthread_cond_init(&sctl.rbm_wait, NULL);
+
 	ret = -workqueue_create(&wq, (struct xfs_mount *)ctx,
 			scrub_nproc_workqueue(ctx));
 	if (ret) {
 		str_liberror(ctx, ret, _("creating scrub workqueue"));
-		return ret;
+		goto out_wait;
 	}
 
 	/*
@@ -143,29 +173,76 @@ phase2_func(
 	action_list_init(&alist);
 	ret = scrub_primary_super(ctx, &alist);
 	if (ret)
-		goto out;
+		goto out_wq;
 	ret = action_list_process_or_defer(ctx, 0, &alist);
 	if (ret)
-		goto out;
+		goto out_wq;
 
-	for (agno = 0; !aborted && agno < ctx->mnt.fsgeom.agcount; agno++) {
-		ret = -workqueue_add(&wq, scan_ag_metadata, agno, &aborted);
+	/* Scan each AG in parallel. */
+	for (agno = 0;
+	     agno < ctx->mnt.fsgeom.agcount && !sctl.aborted;
+	     agno++) {
+		ret = -workqueue_add(&wq, scan_ag_metadata, agno, &sctl);
 		if (ret) {
 			str_liberror(ctx, ret, _("queueing per-AG scrub work"));
-			goto out;
+			goto out_wq;
 		}
 	}
 
-	if (aborted)
-		goto out;
+	if (sctl.aborted)
+		goto out_wq;
 
-	ret = -workqueue_add(&wq, scan_fs_metadata, 0, &aborted);
+	/*
+	 * Scan all the metadata files in parallel except for the realtime
+	 * summary file, which must run after the realtime bitmap has been
+	 * scanned.
+	 */
+	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
+		if (sc->group != XFROG_SCRUB_GROUP_METAFILES)
+			continue;
+		if (type == XFS_SCRUB_TYPE_RTSUM)
+			continue;
+
+		ret = -workqueue_add(&wq, scan_metafile, type, &sctl);
+		if (ret) {
+			str_liberror(ctx, ret,
+	_("queueing metadata file scrub work"));
+			goto out_wq;
+		}
+	}
+
+	if (sctl.aborted)
+		goto out_wq;
+
+	/*
+	 * Wait for the rt bitmap to finish scanning, then scan the rt summary
+	 * since the summary can be regenerated completely from the bitmap.
+	 */
+	ret = pthread_mutex_lock(&sctl.rbm_waitlock);
+	if (ret) {
+		str_liberror(ctx, ret, _("waiting for rtbitmap scrubber"));
+		goto out_wq;
+	}
+	if (!sctl.rbm_done) {
+		ret = pthread_cond_wait(&sctl.rbm_wait, &sctl.rbm_waitlock);
+		if (ret) {
+			str_liberror(ctx, ret,
+	_("waiting for rtbitmap scrubber"));
+			goto out_wq;
+		}
+	}
+	pthread_mutex_unlock(&sctl.rbm_waitlock);
+
+	if (sctl.aborted)
+		goto out_wq;
+
+	ret = -workqueue_add(&wq, scan_metafile, XFS_SCRUB_TYPE_RTSUM, &sctl);
 	if (ret) {
-		str_liberror(ctx, ret, _("queueing per-FS scrub work"));
-		goto out;
+		str_liberror(ctx, ret, _("queueing rtsummary scrub work"));
+		goto out_wq;
 	}
 
-out:
+out_wq:
 	ret2 = -workqueue_terminate(&wq);
 	if (ret2) {
 		str_liberror(ctx, ret2, _("finishing scrub work"));
@@ -173,8 +250,11 @@ phase2_func(
 			ret = ret2;
 	}
 	workqueue_destroy(&wq);
+out_wait:
+	pthread_cond_destroy(&sctl.rbm_wait);
+	pthread_mutex_destroy(&sctl.rbm_waitlock);
 
-	if (!ret && aborted)
+	if (!ret && sctl.aborted)
 		ret = ECANCELED;
 	return ret;
 }
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 1fcd5b8e85d..20067df523f 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -400,13 +400,16 @@ scrub_ag_metadata(
 	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, agno, alist);
 }
 
-/* Scrub whole-FS metadata btrees. */
+/* Scrub one metadata file */
 int
-scrub_fs_metadata(
+scrub_metadata_file(
 	struct scrub_ctx		*ctx,
+	unsigned int			type,
 	struct action_list		*alist)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_METAFILES, 0, alist);
+	ASSERT(xfrog_scrubbers[type].group == XFROG_SCRUB_GROUP_METAFILES);
+
+	return scrub_meta_type(ctx, type, 0, alist);
 }
 
 /* Scrub all FS summary metadata. */
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 56836cf2ba3..a4e36808f34 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -22,7 +22,8 @@ int scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
 		struct action_list *alist);
 int scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
 		struct action_list *alist);
-int scrub_fs_metadata(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
+		struct action_list *alist);
 int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] libxfs: online repair of quota counters
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (37 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs: implement live quotacheck inode scan Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/2] xfs_repair: rebuild inode fork mappings Darrick J. Wong
                   ` (38 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This series uses the inode scanner and live update hook functionality
introduced in the last patchset to implement quotacheck on a live
filesystem.  The quotacheck scrubber builds an incore copy of the
dquot resource usage counters and compares it to the live dquots to
report discrepancies.

If the user chooses to repair the quota counters, the repair function
visits each incore dquot to update the counts from the live information.
The live update hooks are key to keeping the incore copy up to date.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-quotacheck

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-quotacheck
---
 io/scrub.c                      |    3 +++
 libfrog/scrub.c                 |    5 +++++
 libfrog/scrub.h                 |    1 +
 libxfs/xfs_fs.h                 |    4 +++-
 libxfs/xfs_health.h             |    4 +++-
 man/man2/ioctl_xfs_fsgeometry.2 |    3 +++
 scrub/phase4.c                  |   17 ++++++++++++++++
 scrub/phase5.c                  |   22 +++++++++++++++++++-
 scrub/repair.c                  |    3 +++
 scrub/scrub.c                   |   42 +++++++++++++++++++++++++++++++++++++++
 scrub/scrub.h                   |    2 ++
 scrub/xfs_scrub.h               |    1 +
 spaceman/health.c               |    4 ++++
 13 files changed, 107 insertions(+), 4 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: report the health of quota counts
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of quota counters Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs: implement live quotacheck inode scan Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/3] libfrog: create a new scrub group for things requiring full inode scans Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Report the health of quota counts.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs.h                 |    1 +
 libxfs/xfs_health.h             |    4 +++-
 man/man2/ioctl_xfs_fsgeometry.2 |    3 +++
 spaceman/health.c               |    4 ++++
 4 files changed, 11 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 920fd4513fc..7e86e1db66d 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -195,6 +195,7 @@ struct xfs_fsop_geom {
 #define XFS_FSOP_GEOM_SICK_PQUOTA	(1 << 3)  /* project quota */
 #define XFS_FSOP_GEOM_SICK_RT_BITMAP	(1 << 4)  /* realtime bitmap */
 #define XFS_FSOP_GEOM_SICK_RT_SUMMARY	(1 << 5)  /* realtime summary */
+#define XFS_FSOP_GEOM_SICK_QUOTACHECK	(1 << 6)  /* quota counts */
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h
index 99e796256c5..1dea286bb15 100644
--- a/libxfs/xfs_health.h
+++ b/libxfs/xfs_health.h
@@ -41,6 +41,7 @@ struct xfs_fsop_geom;
 #define XFS_SICK_FS_UQUOTA	(1 << 1)  /* user quota */
 #define XFS_SICK_FS_GQUOTA	(1 << 2)  /* group quota */
 #define XFS_SICK_FS_PQUOTA	(1 << 3)  /* project quota */
+#define XFS_SICK_FS_QUOTACHECK	(1 << 4)  /* quota counts */
 
 /* Observable health issues for realtime volume metadata. */
 #define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
@@ -72,7 +73,8 @@ struct xfs_fsop_geom;
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
 				 XFS_SICK_FS_UQUOTA | \
 				 XFS_SICK_FS_GQUOTA | \
-				 XFS_SICK_FS_PQUOTA)
+				 XFS_SICK_FS_PQUOTA | \
+				 XFS_SICK_FS_QUOTACHECK)
 
 #define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
 				 XFS_SICK_RT_SUMMARY)
diff --git a/man/man2/ioctl_xfs_fsgeometry.2 b/man/man2/ioctl_xfs_fsgeometry.2
index 6b7c83da758..f59a6e8a6a2 100644
--- a/man/man2/ioctl_xfs_fsgeometry.2
+++ b/man/man2/ioctl_xfs_fsgeometry.2
@@ -256,6 +256,9 @@ Free space bitmap for the realtime device.
 .TP
 .B XFS_FSOP_GEOM_SICK_RT_SUMMARY
 Free space summary for the realtime device.
+.TP
+.B XFS_FSOP_GEOM_SICK_QUOTACHECK
+Quota resource usage counters.
 .RE
 
 .SH RETURN VALUE
diff --git a/spaceman/health.c b/spaceman/health.c
index d83c5ccd90d..3318f9d1a7f 100644
--- a/spaceman/health.c
+++ b/spaceman/health.c
@@ -72,6 +72,10 @@ static const struct flag_map fs_flags[] = {
 		.descr = "realtime summary",
 		.has_fn = has_realtime,
 	},
+	{
+		.mask = XFS_FSOP_GEOM_SICK_QUOTACHECK,
+		.descr = "quota counts",
+	},
 	{0},
 };
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] libfrog: create a new scrub group for things requiring full inode scans
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of quota counters Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs: implement live quotacheck inode scan Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: report the health of quota counts Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Subsequent patches will add online fsck types (quotacheck, link counts)
that require us to walk every inode in the entire filesystem.  This
requires the AG metadata and the inodes to be in good enough shape to
complete the scan without hitting corruption errors.  As such, they
ought to run after phases 2-4 and before phase 7, which summarizes what
we've found.

Phase 5 seems like a reasonable place to do this, since it already walks
every xattr and directory entry in the filesystem to look for suspicious
looking names.  Add a new XFROG_SCRUB_GROUP, and add it to phase 5.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/scrub.c        |    3 +++
 libfrog/scrub.h   |    1 +
 scrub/phase5.c    |   22 ++++++++++++++++++++--
 scrub/scrub.c     |   33 +++++++++++++++++++++++++++++++++
 scrub/scrub.h     |    1 +
 scrub/xfs_scrub.h |    1 +
 6 files changed, 59 insertions(+), 2 deletions(-)


diff --git a/io/scrub.c b/io/scrub.c
index 3971a9fedb5..0ad1b0229cc 100644
--- a/io/scrub.c
+++ b/io/scrub.c
@@ -68,6 +68,7 @@ scrub_ioctl(
 	case XFROG_SCRUB_GROUP_NONE:
 	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_ISCAN:
 		/* no control parameters */
 		break;
 	}
@@ -170,6 +171,7 @@ parse_args(
 	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_NONE:
 	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_ISCAN:
 		if (optind != argc) {
 			fprintf(stderr,
 				_("No parameters allowed.\n"));
@@ -262,6 +264,7 @@ repair_ioctl(
 	case XFROG_SCRUB_GROUP_NONE:
 	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_ISCAN:
 		/* no control parameters */
 		break;
 	}
diff --git a/libfrog/scrub.h b/libfrog/scrub.h
index 14c4857bede..a59371fe141 100644
--- a/libfrog/scrub.h
+++ b/libfrog/scrub.h
@@ -13,6 +13,7 @@ enum xfrog_scrub_group {
 	XFROG_SCRUB_GROUP_PERAG,	/* per-AG metadata */
 	XFROG_SCRUB_GROUP_METAFILES,	/* whole-fs metadata files */
 	XFROG_SCRUB_GROUP_INODE,	/* per-inode metadata */
+	XFROG_SCRUB_GROUP_ISCAN,	/* metadata requiring full inode scan */
 	XFROG_SCRUB_GROUP_SUMMARY,	/* summary metadata */
 };
 
diff --git a/scrub/phase5.c b/scrub/phase5.c
index 1ef234bff68..123e3751ca1 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -16,6 +16,8 @@
 #include "list.h"
 #include "libfrog/paths.h"
 #include "libfrog/workqueue.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/scrub.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "inodes.h"
@@ -23,8 +25,9 @@
 #include "scrub.h"
 #include "descr.h"
 #include "unicrash.h"
+#include "repair.h"
 
-/* Phase 5: Check directory connectivity. */
+/* Phase 5: Full inode scans and check directory connectivity. */
 
 /*
  * Warn about problematic bytes in a directory/attribute name.  That means
@@ -385,9 +388,24 @@ int
 phase5_func(
 	struct scrub_ctx	*ctx)
 {
+	struct action_list	alist;
 	bool			aborted = false;
 	int			ret;
 
+	/*
+	 * Check and fix anything that requires a full inode scan.  We do this
+	 * after we've checked all inodes and repaired anything that could get
+	 * in the way of a scan.
+	 */
+	action_list_init(&alist);
+	ret = scrub_iscan_metadata(ctx, &alist);
+	if (ret)
+		return ret;
+	ret = action_list_process(ctx, ctx->mnt.fd, &alist,
+			ALP_COMPLAIN_IF_UNFIXED | ALP_NOPROGRESS);
+	if (ret)
+		return ret;
+
 	if (ctx->corruptions_found || ctx->unfixable_errors) {
 		str_info(ctx, ctx->mntpoint,
 _("Filesystem has errors, skipping connectivity checks."));
@@ -416,7 +434,7 @@ phase5_estimate(
 	unsigned int		*nr_threads,
 	int			*rshift)
 {
-	*items = ctx->mnt_sv.f_files - ctx->mnt_sv.f_ffree;
+	*items = scrub_estimate_iscan_work(ctx);
 	*nr_threads = scrub_nproc(ctx);
 	*rshift = 0;
 	return 0;
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 20067df523f..5c7fc4c2a3a 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -47,6 +47,7 @@ format_scrub_descr(
 		break;
 	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_ISCAN:
 		return snprintf(buf, buflen, _("%s"), _(sc->descr));
 		break;
 	case XFROG_SCRUB_GROUP_NONE:
@@ -421,6 +422,15 @@ scrub_summary_metadata(
 	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist);
 }
 
+/* Scrub all metadata requiring a full inode scan. */
+int
+scrub_iscan_metadata(
+	struct scrub_ctx		*ctx,
+	struct action_list		*alist)
+{
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_ISCAN, 0, alist);
+}
+
 /* Scrub /only/ the superblock summary counters. */
 int
 scrub_fs_counters(
@@ -456,6 +466,29 @@ scrub_estimate_ag_work(
 	return estimate;
 }
 
+/*
+ * How many kernel calls will we make to scrub everything requiring a full
+ * inode scan?
+ */
+unsigned int
+scrub_estimate_iscan_work(
+	struct scrub_ctx		*ctx)
+{
+	const struct xfrog_scrub_descr	*sc;
+	int				type;
+	unsigned int			estimate;
+
+	estimate = ctx->mnt_sv.f_files - ctx->mnt_sv.f_ffree;
+
+	sc = xfrog_scrubbers;
+	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
+		if (sc->group == XFROG_SCRUB_GROUP_ISCAN)
+			estimate++;
+	}
+
+	return estimate;
+}
+
 /*
  * Scrub file metadata of some sort.  If errors occur, this function will log
  * them and return nonzero.
diff --git a/scrub/scrub.h b/scrub/scrub.h
index a4e36808f34..5d365f99148 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -24,6 +24,7 @@ int scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
 		struct action_list *alist);
 int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
 		struct action_list *alist);
+int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
 
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 0d6b9dad2c9..768f935084f 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -99,6 +99,7 @@ int phase7_func(struct scrub_ctx *ctx);
 
 /* Progress estimator functions */
 unsigned int scrub_estimate_ag_work(struct scrub_ctx *ctx);
+unsigned int scrub_estimate_iscan_work(struct scrub_ctx *ctx);
 int phase2_estimate(struct scrub_ctx *ctx, uint64_t *items,
 		    unsigned int *nr_threads, int *rshift);
 int phase3_estimate(struct scrub_ctx *ctx, uint64_t *items,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs: implement live quotacheck inode scan
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of quota counters Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/3] libfrog: create a new scrub group for things requiring full inode scans Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: report the health of quota counts Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new trio of scrub functions to check quota counters.  While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters.  We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libfrog/scrub.c |    5 +++++
 libxfs/xfs_fs.h |    3 ++-
 scrub/phase4.c  |   17 +++++++++++++++++
 scrub/repair.c  |    3 +++
 scrub/scrub.c   |    9 +++++++++
 scrub/scrub.h   |    1 +
 6 files changed, 37 insertions(+), 1 deletion(-)


diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index 2e4d96caaa1..3718d56eae3 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -134,6 +134,11 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 		.descr	= "filesystem summary counters",
 		.group	= XFROG_SCRUB_GROUP_SUMMARY,
 	},
+	[XFS_SCRUB_TYPE_QUOTACHECK] = {
+		.name	= "quotacheck",
+		.descr	= "quota counters",
+		.group	= XFROG_SCRUB_GROUP_ISCAN,
+	},
 };
 
 /* Invoke the scrub ioctl.  Returns zero or negative error code. */
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 7e86e1db66d..6612c89944d 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -708,9 +708,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_GQUOTA	22	/* group quotas */
 #define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
 #define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
+#define XFS_SCRUB_TYPE_QUOTACHECK 25	/* quota counters */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	25
+#define XFS_SCRUB_TYPE_NR	26
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1u << 0)
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 3c985cdc13a..da9e8759134 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -128,6 +128,7 @@ int
 phase4_func(
 	struct scrub_ctx	*ctx)
 {
+	struct xfs_fsop_geom	fsgeom;
 	int			ret;
 
 	if (!have_action_items(ctx))
@@ -143,6 +144,22 @@ phase4_func(
 	if (ret)
 		return ret;
 
+	/*
+	 * Repair possibly bad quota counts before starting other repairs,
+	 * because wildly incorrect quota counts can cause shutdowns.
+	 * Quotacheck scans all inodes, so we only want to do it if we know
+	 * it's sick.
+	 */
+	ret = xfrog_geometry(ctx->mnt.fd, &fsgeom);
+	if (ret)
+		return ret;
+
+	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK) {
+		ret = scrub_quotacheck(ctx, &ctx->action_lists[0]);
+		if (ret)
+			return ret;
+	}
+
 	ret = repair_everything(ctx);
 	if (ret)
 		return ret;
diff --git a/scrub/repair.c b/scrub/repair.c
index 67900ea4208..8a1ae0226a0 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -84,6 +84,9 @@ xfs_action_item_priority(
 	case XFS_SCRUB_TYPE_GQUOTA:
 	case XFS_SCRUB_TYPE_PQUOTA:
 		return PRIO(aitem, XFS_SCRUB_TYPE_UQUOTA);
+	case XFS_SCRUB_TYPE_QUOTACHECK:
+		/* This should always go after [UGP]QUOTA no matter what. */
+		return PRIO(aitem, aitem->type);
 	case XFS_SCRUB_TYPE_FSCOUNTERS:
 		/* This should always go after AG headers no matter what. */
 		return PRIO(aitem, INT_MAX);
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 5c7fc4c2a3a..f2dd9bb9d0b 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -440,6 +440,15 @@ scrub_fs_counters(
 	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, 0, alist);
 }
 
+/* Scrub /only/ the quota counters. */
+int
+scrub_quotacheck(
+	struct scrub_ctx		*ctx,
+	struct action_list		*alist)
+{
+	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, 0, alist);
+}
+
 /* How many items do we have to check? */
 unsigned int
 scrub_estimate_ag_work(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 5d365f99148..42b91fbc3ed 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -27,6 +27,7 @@ int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
 int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_quotacheck(struct scrub_ctx *ctx, struct action_list *alist);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] xfs_repair: rebuild inode fork mappings
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (38 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of quota counters Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/2] xfs_repair: push inode buf and dinode pointers all the way to inode fork processing Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/2] xfs_repair: rebuild block mappings from rmapbt data Darrick J. Wong
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of file link counts Darrick J. Wong
                   ` (37 subsequent siblings)
  77 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Add the ability to regenerate inode fork mappings if the rmapbt
otherwise looks ok.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rebuild-forks
---
 include/xfs_trans.h      |    2 
 libxfs/libxfs_api_defs.h |   13 +
 libxfs/trans.c           |   48 +++
 repair/Makefile          |    2 
 repair/agbtree.c         |    2 
 repair/bmap_repair.c     |  741 ++++++++++++++++++++++++++++++++++++++++++++++
 repair/bmap_repair.h     |   13 +
 repair/bulkload.c        |  175 +++++++++++
 repair/bulkload.h        |   19 +
 repair/dino_chunks.c     |    5 
 repair/dinode.c          |  142 ++++++---
 repair/dinode.h          |    7 
 repair/rmap.c            |    2 
 repair/rmap.h            |    1 
 14 files changed, 1123 insertions(+), 49 deletions(-)
 create mode 100644 repair/bmap_repair.c
 create mode 100644 repair/bmap_repair.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs_repair: push inode buf and dinode pointers all the way to inode fork processing
  2022-12-30 22:17 ` [PATCHSET v24.0 0/2] xfs_repair: rebuild inode fork mappings Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/2] xfs_repair: rebuild block mappings from rmapbt data Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, the process_dinode* family of functions assume that they have
the buffer backing the inodes locked, and therefore the dinode pointer
won't ever change.  However, the bmbt rebuilding code in the next patch
will violate that assumption, so we must pass pointers to the inobp and
the dinode pointer (that is to say, double pointers) all the way through
to process_inode_{data,attr}_fork so that we can regrab the buffer after
the rebuilding step finishes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/dino_chunks.c |    5 ++-
 repair/dinode.c      |   88 ++++++++++++++++++++++++++++----------------------
 repair/dinode.h      |    7 ++--
 3 files changed, 57 insertions(+), 43 deletions(-)


diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index 0e09132b0b1..5c7799b1888 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -851,10 +851,11 @@ process_inode_chunk(
 		ino_dirty = 0;
 		parent = 0;
 
-		status = process_dinode(mp, dino, agno, agino,
+		status = process_dinode(mp, &dino, agno, agino,
 				is_inode_free(ino_rec, irec_offset),
 				&ino_dirty, &is_used,ino_discovery, check_dups,
-				extra_attr_check, &isa_dir, &parent);
+				extra_attr_check, &isa_dir, &parent,
+				&bplist[bp_index]);
 
 		ASSERT(is_used != 3);
 		if (ino_dirty) {
diff --git a/repair/dinode.c b/repair/dinode.c
index e534a01b500..cea3c1ee5fe 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -1889,17 +1889,19 @@ _("nblocks (%" PRIu64 ") smaller than nextents for inode %" PRIu64 "\n"), nblock
  */
 static int
 process_inode_data_fork(
-	xfs_mount_t		*mp,
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	xfs_agino_t		ino,
-	struct xfs_dinode	*dino,
+	struct xfs_dinode	**dinop,
 	int			type,
 	int			*dirty,
 	xfs_rfsblock_t		*totblocks,
 	xfs_extnum_t		*nextents,
 	blkmap_t		**dblkmap,
-	int			check_dups)
+	int			check_dups,
+	struct xfs_buf		**ino_bpp)
 {
+	struct xfs_dinode	*dino = *dinop;
 	xfs_ino_t		lino = XFS_AGINO_TO_INO(mp, agno, ino);
 	int			err = 0;
 	xfs_extnum_t		nex, max_nex;
@@ -2001,20 +2003,22 @@ process_inode_data_fork(
  */
 static int
 process_inode_attr_fork(
-	xfs_mount_t		*mp,
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	xfs_agino_t		ino,
-	struct xfs_dinode	*dino,
+	struct xfs_dinode	**dinop,
 	int			type,
 	int			*dirty,
 	xfs_rfsblock_t		*atotblocks,
 	xfs_extnum_t		*anextents,
 	int			check_dups,
 	int			extra_attr_check,
-	int			*retval)
+	int			*retval,
+	struct xfs_buf		**ino_bpp)
 {
 	xfs_ino_t		lino = XFS_AGINO_TO_INO(mp, agno, ino);
-	blkmap_t		*ablkmap = NULL;
+	struct xfs_dinode	*dino = *dinop;
+	struct blkmap		*ablkmap = NULL;
 	int			repair = 0;
 	int			err;
 
@@ -2073,7 +2077,7 @@ process_inode_attr_fork(
 		 * XXX - put the inode onto the "move it" list and
 		 *	log the the attribute scrubbing
 		 */
-		do_warn(_("bad attribute fork in inode %" PRIu64), lino);
+		do_warn(_("bad attribute fork in inode %" PRIu64 "\n"), lino);
 
 		if (!no_modify)  {
 			do_warn(_(", clearing attr fork\n"));
@@ -2272,21 +2276,22 @@ _("Bad extent size hint %u on inode %" PRIu64 ", "),
  * for detailed, info, look at process_dinode() comments.
  */
 static int
-process_dinode_int(xfs_mount_t *mp,
-		struct xfs_dinode *dino,
-		xfs_agnumber_t agno,
-		xfs_agino_t ino,
-		int was_free,		/* 1 if inode is currently free */
-		int *dirty,		/* out == > 0 if inode is now dirty */
-		int *used,		/* out == 1 if inode is in use */
-		int verify_mode,	/* 1 == verify but don't modify inode */
-		int uncertain,		/* 1 == inode is uncertain */
-		int ino_discovery,	/* 1 == check dirs for unknown inodes */
-		int check_dups,		/* 1 == check if inode claims
-					 * duplicate blocks		*/
-		int extra_attr_check, /* 1 == do attribute format and value checks */
-		int *isa_dir,		/* out == 1 if inode is a directory */
-		xfs_ino_t *parent)	/* out -- parent if ino is a dir */
+process_dinode_int(
+	struct xfs_mount	*mp,
+	struct xfs_dinode	**dinop,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		ino,
+	int			was_free,	/* 1 if inode is currently free */
+	int			*dirty,		/* out == > 0 if inode is now dirty */
+	int			*used,		/* out == 1 if inode is in use */
+	int			verify_mode,	/* 1 == verify but don't modify inode */
+	int			uncertain,	/* 1 == inode is uncertain */
+	int			ino_discovery,	/* 1 == check dirs for unknown inodes */
+	int			check_dups,	/* 1 == check if inode claims duplicate blocks */
+	int			extra_attr_check, /* 1 == do attribute format and value checks */
+	int			*isa_dir,	/* out == 1 if inode is a directory */
+	xfs_ino_t		*parent,	/* out -- parent if ino is a dir */
+	struct xfs_buf		**ino_bpp)
 {
 	xfs_rfsblock_t		totblocks = 0;
 	xfs_rfsblock_t		atotblocks = 0;
@@ -2299,6 +2304,7 @@ process_dinode_int(xfs_mount_t *mp,
 	const int		is_free = 0;
 	const int		is_used = 1;
 	blkmap_t		*dblkmap = NULL;
+	struct xfs_dinode	*dino = *dinop;
 	xfs_agino_t		unlinked_ino;
 	struct xfs_perag	*pag;
 
@@ -2322,6 +2328,7 @@ process_dinode_int(xfs_mount_t *mp,
 	 * If uncertain is set, verify_mode MUST be set.
 	 */
 	ASSERT(uncertain == 0 || verify_mode != 0);
+	ASSERT(ino_bpp != NULL || verify_mode != 0);
 
 	/*
 	 * This is the only valid point to check the CRC; after this we may have
@@ -2861,18 +2868,21 @@ _("Bad CoW extent size %u on inode %" PRIu64 ", "),
 	/*
 	 * check data fork -- if it's bad, clear the inode
 	 */
-	if (process_inode_data_fork(mp, agno, ino, dino, type, dirty,
-			&totblocks, &nextents, &dblkmap, check_dups) != 0)
+	if (process_inode_data_fork(mp, agno, ino, dinop, type, dirty,
+			&totblocks, &nextents, &dblkmap, check_dups,
+			ino_bpp) != 0)
 		goto bad_out;
+	dino = *dinop;
 
 	/*
 	 * check attribute fork if necessary.  attributes are
 	 * always stored in the regular filesystem.
 	 */
-	if (process_inode_attr_fork(mp, agno, ino, dino, type, dirty,
+	if (process_inode_attr_fork(mp, agno, ino, dinop, type, dirty,
 			&atotblocks, &anextents, check_dups, extra_attr_check,
-			&retval))
+			&retval, ino_bpp))
 		goto bad_out;
+	dino = *dinop;
 
 	/*
 	 * enforce totblocks is 0 for misc types
@@ -2990,8 +3000,8 @@ _("Bad CoW extent size %u on inode %" PRIu64 ", "),
 
 int
 process_dinode(
-	xfs_mount_t		*mp,
-	struct xfs_dinode	*dino,
+	struct xfs_mount	*mp,
+	struct xfs_dinode	**dinop,
 	xfs_agnumber_t		agno,
 	xfs_agino_t		ino,
 	int			was_free,
@@ -3001,7 +3011,8 @@ process_dinode(
 	int			check_dups,
 	int			extra_attr_check,
 	int			*isa_dir,
-	xfs_ino_t		*parent)
+	xfs_ino_t		*parent,
+	struct xfs_buf		**ino_bpp)
 {
 	const int		verify_mode = 0;
 	const int		uncertain = 0;
@@ -3009,9 +3020,10 @@ process_dinode(
 #ifdef XR_INODE_TRACE
 	fprintf(stderr, _("processing inode %d/%d\n"), agno, ino);
 #endif
-	return process_dinode_int(mp, dino, agno, ino, was_free, dirty, used,
-				verify_mode, uncertain, ino_discovery,
-				check_dups, extra_attr_check, isa_dir, parent);
+	return process_dinode_int(mp, dinop, agno, ino, was_free, dirty, used,
+			verify_mode, uncertain, ino_discovery,
+			check_dups, extra_attr_check, isa_dir, parent,
+			ino_bpp);
 }
 
 /*
@@ -3036,9 +3048,9 @@ verify_dinode(
 	const int		ino_discovery = 0;
 	const int		uncertain = 0;
 
-	return process_dinode_int(mp, dino, agno, ino, 0, &dirty, &used,
-				verify_mode, uncertain, ino_discovery,
-				check_dups, 0, &isa_dir, &parent);
+	return process_dinode_int(mp, &dino, agno, ino, 0, &dirty, &used,
+			verify_mode, uncertain, ino_discovery,
+			check_dups, 0, &isa_dir, &parent, NULL);
 }
 
 /*
@@ -3062,7 +3074,7 @@ verify_uncertain_dinode(
 	const int		ino_discovery = 0;
 	const int		uncertain = 1;
 
-	return process_dinode_int(mp, dino, agno, ino, 0, &dirty, &used,
+	return process_dinode_int(mp, &dino, agno, ino, 0, &dirty, &used,
 				verify_mode, uncertain, ino_discovery,
-				check_dups, 0, &isa_dir, &parent);
+				check_dups, 0, &isa_dir, &parent, NULL);
 }
diff --git a/repair/dinode.h b/repair/dinode.h
index 333d96d26a2..92df83da621 100644
--- a/repair/dinode.h
+++ b/repair/dinode.h
@@ -43,8 +43,8 @@ void
 update_rootino(xfs_mount_t *mp);
 
 int
-process_dinode(xfs_mount_t *mp,
-		struct xfs_dinode *dino,
+process_dinode(struct xfs_mount *mp,
+		struct xfs_dinode **dinop,
 		xfs_agnumber_t agno,
 		xfs_agino_t ino,
 		int was_free,
@@ -54,7 +54,8 @@ process_dinode(xfs_mount_t *mp,
 		int check_dups,
 		int extra_attr_check,
 		int *isa_dir,
-		xfs_ino_t *parent);
+		xfs_ino_t *parent,
+		struct xfs_buf **ino_bpp);
 
 int
 verify_dinode(xfs_mount_t *mp,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs_repair: rebuild block mappings from rmapbt data
  2022-12-30 22:17 ` [PATCHSET v24.0 0/2] xfs_repair: rebuild inode fork mappings Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/2] xfs_repair: push inode buf and dinode pointers all the way to inode fork processing Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use rmap records to rebuild corrupt inode forks instead of zapping
the whole inode if we think the rmap data is reasonably sane.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_trans.h      |    2 
 libxfs/libxfs_api_defs.h |   13 +
 libxfs/trans.c           |   48 +++
 repair/Makefile          |    2 
 repair/agbtree.c         |    2 
 repair/bmap_repair.c     |  741 ++++++++++++++++++++++++++++++++++++++++++++++
 repair/bmap_repair.h     |   13 +
 repair/bulkload.c        |  175 +++++++++++
 repair/bulkload.h        |   19 +
 repair/dinode.c          |   54 +++
 repair/rmap.c            |    2 
 repair/rmap.h            |    1 
 12 files changed, 1066 insertions(+), 6 deletions(-)
 create mode 100644 repair/bmap_repair.c
 create mode 100644 repair/bmap_repair.h


diff --git a/include/xfs_trans.h b/include/xfs_trans.h
index 690759ece3a..ae339df1195 100644
--- a/include/xfs_trans.h
+++ b/include/xfs_trans.h
@@ -91,6 +91,8 @@ int	libxfs_trans_alloc_rollable(struct xfs_mount *mp, uint blocks,
 int	libxfs_trans_alloc_empty(struct xfs_mount *mp, struct xfs_trans **tpp);
 int	libxfs_trans_commit(struct xfs_trans *);
 void	libxfs_trans_cancel(struct xfs_trans *);
+int	libxfs_trans_reserve_more(struct xfs_trans *tp, uint blocks,
+			uint rtextents);
 
 /* cancel dfops associated with a transaction */
 void xfs_defer_cancel(struct xfs_trans *);
diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index 5aa9c019d40..5d73111b508 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -39,11 +39,18 @@
 #define xfs_attr_set			libxfs_attr_set
 
 #define __xfs_bmap_add_free		__libxfs_bmap_add_free
+#define xfs_bmap_validate_extent	libxfs_bmap_validate_extent
 #define xfs_bmapi_read			libxfs_bmapi_read
+#define xfs_bmapi_remap			libxfs_bmapi_remap
 #define xfs_bmapi_write			libxfs_bmapi_write
 #define xfs_bmap_last_offset		libxfs_bmap_last_offset
+#define xfs_bmbt_calc_size		libxfs_bmbt_calc_size
+#define xfs_bmbt_commit_staged_btree	libxfs_bmbt_commit_staged_btree
+#define xfs_bmbt_disk_get_startoff	libxfs_bmbt_disk_get_startoff
+#define xfs_bmbt_disk_set_all		libxfs_bmbt_disk_set_all
 #define xfs_bmbt_maxlevels_ondisk	libxfs_bmbt_maxlevels_ondisk
 #define xfs_bmbt_maxrecs		libxfs_bmbt_maxrecs
+#define xfs_bmbt_stage_cursor		libxfs_bmbt_stage_cursor
 #define xfs_bmdr_maxrecs		libxfs_bmdr_maxrecs
 
 #define xfs_btree_bload			libxfs_btree_bload
@@ -120,8 +127,12 @@
 #define xfs_ialloc_read_agi		libxfs_ialloc_read_agi
 #define xfs_idata_realloc		libxfs_idata_realloc
 #define xfs_idestroy_fork		libxfs_idestroy_fork
+#define xfs_iext_first			libxfs_iext_first
+#define xfs_iext_insert_raw		libxfs_iext_insert_raw
 #define xfs_iext_lookup_extent		libxfs_iext_lookup_extent
+#define xfs_iext_next			libxfs_iext_next
 #define xfs_ifork_zap_attr		libxfs_ifork_zap_attr
+#define xfs_imap_to_bp			libxfs_imap_to_bp
 #define xfs_initialize_perag		libxfs_initialize_perag
 #define xfs_initialize_perag_data	libxfs_initialize_perag_data
 #define xfs_init_local_fork		libxfs_init_local_fork
@@ -162,10 +173,12 @@
 #define xfs_rmapbt_stage_cursor		libxfs_rmapbt_stage_cursor
 #define xfs_rmap_compare		libxfs_rmap_compare
 #define xfs_rmap_get_rec		libxfs_rmap_get_rec
+#define xfs_rmap_ino_bmbt_owner		libxfs_rmap_ino_bmbt_owner
 #define xfs_rmap_irec_offset_pack	libxfs_rmap_irec_offset_pack
 #define xfs_rmap_irec_offset_unpack	libxfs_rmap_irec_offset_unpack
 #define xfs_rmap_lookup_le		libxfs_rmap_lookup_le
 #define xfs_rmap_lookup_le_range	libxfs_rmap_lookup_le_range
+#define xfs_rmap_query_all		libxfs_rmap_query_all
 #define xfs_rmap_query_range		libxfs_rmap_query_range
 
 #define xfs_rtfree_extent		libxfs_rtfree_extent
diff --git a/libxfs/trans.c b/libxfs/trans.c
index 50d9c23de3e..e9430c61562 100644
--- a/libxfs/trans.c
+++ b/libxfs/trans.c
@@ -1046,3 +1046,51 @@ libxfs_trans_alloc_inode(
 	*tpp = tp;
 	return 0;
 }
+
+/*
+ * Try to reserve more blocks for a transaction.  The single use case we
+ * support is for offline repair -- use a transaction to gather data without
+ * fear of btree cycle deadlocks; calculate how many blocks we really need
+ * from that data; and only then start modifying data.  This can fail due to
+ * ENOSPC, so we have to be able to cancel the transaction.
+ */
+int
+libxfs_trans_reserve_more(
+	struct xfs_trans	*tp,
+	uint			blocks,
+	uint			rtextents)
+{
+	int			error = 0;
+
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/*
+	 * Attempt to reserve the needed disk blocks by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (blocks > 0) {
+		if (tp->t_mountp->m_sb.sb_fdblocks < blocks)
+			return -ENOSPC;
+		tp->t_blk_res += blocks;
+	}
+
+	/*
+	 * Attempt to reserve the needed realtime extents by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (rtextents > 0) {
+		if (tp->t_mountp->m_sb.sb_rextents < rtextents) {
+			error = -ENOSPC;
+			goto out_blocks;
+		}
+	}
+
+	return 0;
+out_blocks:
+	if (blocks > 0)
+		tp->t_blk_res -= blocks;
+
+	return error;
+}
diff --git a/repair/Makefile b/repair/Makefile
index 2c40e59a30f..e5014deb0ce 100644
--- a/repair/Makefile
+++ b/repair/Makefile
@@ -16,6 +16,7 @@ HFILES = \
 	avl.h \
 	bulkload.h \
 	bmap.h \
+	bmap_repair.h \
 	btree.h \
 	da_util.h \
 	dinode.h \
@@ -41,6 +42,7 @@ CFILES = \
 	avl.c \
 	bulkload.c \
 	bmap.c \
+	bmap_repair.c \
 	btree.c \
 	da_util.c \
 	dino_chunks.c \
diff --git a/repair/agbtree.c b/repair/agbtree.c
index cba67c5fbf4..23851f17b61 100644
--- a/repair/agbtree.c
+++ b/repair/agbtree.c
@@ -22,7 +22,7 @@ init_rebuild(
 {
 	memset(btr, 0, sizeof(struct bt_rebuild));
 
-	bulkload_init_ag(&btr->newbt, sc, oinfo);
+	bulkload_init_ag(&btr->newbt, sc, oinfo, NULLFSBLOCK);
 	btr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */
 	bulkload_estimate_ag_slack(sc, &btr->bload, free_space);
 }
diff --git a/repair/bmap_repair.c b/repair/bmap_repair.c
new file mode 100644
index 00000000000..25a9daa7449
--- /dev/null
+++ b/repair/bmap_repair.c
@@ -0,0 +1,741 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include <libxfs.h>
+#include "btree.h"
+#include "err_protos.h"
+#include "libxlog.h"
+#include "incore.h"
+#include "globals.h"
+#include "dinode.h"
+#include "slab.h"
+#include "rmap.h"
+#include "bulkload.h"
+#include "bmap_repair.h"
+
+#define min_t(type, x, y) ( ((type)(x)) > ((type)(y)) ? ((type)(y)) : ((type)(x)) )
+
+/*
+ * Inode Fork Block Mapping (BMBT) Repair
+ * ======================================
+ *
+ * Gather all the rmap records for the inode and fork we're fixing, reset the
+ * incore fork, then recreate the btree.
+ */
+struct xrep_bmap {
+	/* List of new bmap records. */
+	struct xfs_slab		*bmap_records;
+	struct xfs_slab_cursor	*bmap_cursor;
+
+	/* New fork. */
+	struct bulkload		new_fork_info;
+	struct xfs_btree_bload	bmap_bload;
+
+	struct repair_ctx	*sc;
+
+	/* How many blocks did we find allocated to this file? */
+	xfs_rfsblock_t		nblocks;
+
+	/* How many bmbt blocks did we find for this fork? */
+	xfs_rfsblock_t		old_bmbt_block_count;
+
+	/* Which fork are we fixing? */
+	int			whichfork;
+};
+
+/* Remember this reverse-mapping as a series of bmap records. */
+STATIC int
+xrep_bmap_from_rmap(
+	struct xrep_bmap	*rb,
+	xfs_fileoff_t		startoff,
+	xfs_fsblock_t		startblock,
+	xfs_filblks_t		blockcount,
+	bool			unwritten)
+{
+	struct xfs_bmbt_rec	rbe;
+	struct xfs_bmbt_irec	irec;
+	int			error = 0;
+
+	irec.br_startoff = startoff;
+	irec.br_startblock = startblock;
+	irec.br_state = unwritten ? XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+
+	do {
+		xfs_failaddr_t	fa;
+
+		irec.br_blockcount = min_t(xfs_filblks_t, blockcount,
+				XFS_MAX_BMBT_EXTLEN);
+
+		fa = libxfs_bmap_validate_extent(rb->sc->ip, rb->whichfork,
+				&irec);
+		if (fa)
+			return -EFSCORRUPTED;
+
+		libxfs_bmbt_disk_set_all(&rbe, &irec);
+
+		error = slab_add(rb->bmap_records, &rbe);
+		if (error)
+			return error;
+
+		irec.br_startblock += irec.br_blockcount;
+		irec.br_startoff += irec.br_blockcount;
+		blockcount -= irec.br_blockcount;
+	} while (blockcount > 0);
+
+	return 0;
+}
+
+/* Check for any obvious errors or conflicts in the file mapping. */
+STATIC int
+xrep_bmap_check_fork_rmap(
+	struct xrep_bmap		*rb,
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec)
+{
+	struct repair_ctx		*sc = rb->sc;
+
+	/*
+	 * Data extents for rt files are never stored on the data device, but
+	 * everything else (xattrs, bmbt blocks) can be.
+	 */
+	if (XFS_IS_REALTIME_INODE(sc->ip) &&
+	    !(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)))
+		return EFSCORRUPTED;
+
+	/* Check that this is within the AG. */
+	if (!xfs_verify_agbext(cur->bc_ag.pag, rec->rm_startblock,
+				rec->rm_blockcount))
+		return EFSCORRUPTED;
+
+	/* No contradictory flags. */
+	if ((rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)) &&
+	    (rec->rm_flags & XFS_RMAP_UNWRITTEN))
+		return EFSCORRUPTED;
+
+	/* Check the file offset range. */
+	if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
+	    !xfs_verify_fileext(sc->mp, rec->rm_offset, rec->rm_blockcount))
+		return EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xrep_bmap_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	struct xrep_bmap		*rb = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_fsblock_t			fsbno;
+	int				error;
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != rb->sc->ip->i_ino)
+		return 0;
+
+	error = xrep_bmap_check_fork_rmap(rb, cur, rec);
+	if (error)
+		return error;
+
+	/*
+	 * Record all blocks allocated to this file even if the extent isn't
+	 * for the fork we're rebuilding so that we can reset di_nblocks later.
+	 */
+	rb->nblocks += rec->rm_blockcount;
+
+	/* If this rmap isn't for the fork we want, we're done. */
+	if (rb->whichfork == XFS_DATA_FORK &&
+	    (rec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return 0;
+	if (rb->whichfork == XFS_ATTR_FORK &&
+	    !(rec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return 0;
+
+	fsbno = XFS_AGB_TO_FSB(mp, cur->bc_ag.pag->pag_agno,
+			rec->rm_startblock);
+
+	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+		rb->old_bmbt_block_count += rec->rm_blockcount;
+		return 0;
+	}
+
+	return xrep_bmap_from_rmap(rb, rec->rm_offset, fsbno,
+			rec->rm_blockcount,
+			rec->rm_flags & XFS_RMAP_UNWRITTEN);
+}
+
+/* Compare two bmap extents. */
+static int
+xrep_bmap_extent_cmp(
+	const void			*a,
+	const void			*b)
+{
+	xfs_fileoff_t			ao;
+	xfs_fileoff_t			bo;
+
+	ao = libxfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)a);
+	bo = libxfs_bmbt_disk_get_startoff((struct xfs_bmbt_rec *)b);
+
+	if (ao > bo)
+		return 1;
+	else if (ao < bo)
+		return -1;
+	return 0;
+}
+
+/* Scan one AG for reverse mappings that we can turn into extent maps. */
+STATIC int
+xrep_bmap_scan_ag(
+	struct xrep_bmap	*rb,
+	struct xfs_perag	*pag)
+{
+	struct repair_ctx	*sc = rb->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agf_bp = NULL;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = -libxfs_alloc_read_agf(pag, sc->tp, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return ENOMEM;
+	cur = libxfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, pag);
+	error = -libxfs_rmap_query_all(cur, xrep_bmap_walk_rmap, rb);
+	libxfs_btree_del_cursor(cur, error);
+	libxfs_trans_brelse(sc->tp, agf_bp);
+	return error;
+}
+
+/*
+ * Collect block mappings for this fork of this inode and decide if we have
+ * enough space to rebuild.  Caller is responsible for cleaning up the list if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_bmap_find_mappings(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			error;
+
+	/* Iterate the rmaps for extents. */
+	for_each_perag(rb->sc->mp, agno, pag) {
+		error = xrep_bmap_scan_ag(rb, pag);
+		if (error) {
+			libxfs_perag_put(pag);
+			return error;
+		}
+	}
+
+	return 0;
+}
+
+/* Retrieve bmap data for bulk load. */
+STATIC int
+xrep_bmap_get_records(
+	struct xfs_btree_cur	*cur,
+	unsigned int		idx,
+	struct xfs_btree_block	*block,
+	unsigned int		nr_wanted,
+	void			*priv)
+{
+	struct xfs_bmbt_rec	*rec;
+	struct xfs_bmbt_irec	*irec = &cur->bc_rec.b;
+	struct xrep_bmap	*rb = priv;
+	union xfs_btree_rec	*block_rec;
+	unsigned int		loaded;
+
+	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
+		rec = pop_slab_cursor(rb->bmap_cursor);
+		libxfs_bmbt_disk_get_all(rec, irec);
+
+		block_rec = libxfs_btree_rec_addr(cur, idx, block);
+		cur->bc_ops->init_rec_from_cur(cur, block_rec);
+	}
+
+	return loaded;
+}
+
+/* Feed one of the new btree blocks to the bulk loader. */
+STATIC int
+xrep_bmap_claim_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	void			*priv)
+{
+	struct xrep_bmap        *rb = priv;
+
+	return bulkload_claim_block(cur, &rb->new_fork_info, ptr);
+}
+
+/* Figure out how much space we need to create the incore btree root block. */
+STATIC size_t
+xrep_bmap_iroot_size(
+	struct xfs_btree_cur	*cur,
+	unsigned int		level,
+	unsigned int		nr_this_level,
+	void			*priv)
+{
+	ASSERT(level > 0);
+
+	return XFS_BMAP_BROOT_SPACE_CALC(cur->bc_mp, nr_this_level);
+}
+
+/* Update the inode counters. */
+STATIC int
+xrep_bmap_reset_counters(
+	struct xrep_bmap	*rb)
+{
+	struct repair_ctx	*sc = rb->sc;
+	struct xbtree_ifakeroot	*ifake = &rb->new_fork_info.ifake;
+	int64_t			delta;
+
+	/*
+	 * Update the inode block counts to reflect the extents we found in the
+	 * rmapbt.
+	 */
+	delta = ifake->if_blocks - rb->old_bmbt_block_count;
+	sc->ip->i_nblocks = rb->nblocks + delta;
+	libxfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+
+	/* Quotas don't exist so we're done. */
+	return 0;
+}
+
+/*
+ * Ensure that the inode being repaired is ready to handle a certain number of
+ * extents, or return EFSCORRUPTED.  Caller must hold the ILOCK of the inode
+ * being repaired and have joined it to the scrub transaction.
+ */
+static int
+xrep_ino_ensure_extent_count(
+	struct repair_ctx	*sc,
+	int			whichfork,
+	xfs_extnum_t		nextents)
+{
+	xfs_extnum_t		max_extents;
+	bool			large_extcount;
+
+	large_extcount = xfs_inode_has_large_extent_counts(sc->ip);
+	max_extents = xfs_iext_max_nextents(large_extcount, whichfork);
+	if (nextents <= max_extents)
+		return 0;
+	if (large_extcount)
+		return EFSCORRUPTED;
+	if (!xfs_has_large_extent_counts(sc->mp))
+		return EFSCORRUPTED;
+
+	max_extents = xfs_iext_max_nextents(true, whichfork);
+	if (nextents > max_extents)
+		return EFSCORRUPTED;
+
+	sc->ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	libxfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	return 0;
+}
+
+/*
+ * Create a new iext tree and load it with block mappings.  If the inode is
+ * in extents format, that's all we need to do to commit the new mappings.
+ * If it is in btree format, this takes care of preloading the incore tree.
+ */
+STATIC int
+xrep_bmap_extents_load(
+	struct xrep_bmap	*rb,
+	struct xfs_btree_cur	*bmap_cur,
+	uint64_t		nextents)
+{
+	struct xfs_iext_cursor	icur;
+	struct xbtree_ifakeroot	*ifake = &rb->new_fork_info.ifake;
+	struct xfs_ifork	*ifp = ifake->if_fork;
+	unsigned int		i;
+	int			error;
+
+	ASSERT(ifp->if_bytes == 0);
+
+	error = init_slab_cursor(rb->bmap_records, xrep_bmap_extent_cmp,
+			&rb->bmap_cursor);
+	if (error)
+		return error;
+
+	/* Add all the mappings to the incore extent tree. */
+	libxfs_iext_first(ifp, &icur);
+	for (i = 0; i < nextents; i++) {
+		struct xfs_bmbt_rec	*rec;
+
+		rec = pop_slab_cursor(rb->bmap_cursor);
+		libxfs_bmbt_disk_get_all(rec, &bmap_cur->bc_rec.b);
+		libxfs_iext_insert_raw(ifp, &icur, &bmap_cur->bc_rec.b);
+		ifp->if_nextents++;
+		libxfs_iext_next(ifp, &icur);
+	}
+	free_slab_cursor(&rb->bmap_cursor);
+
+	return xrep_ino_ensure_extent_count(rb->sc, rb->whichfork,
+			ifp->if_nextents);
+}
+
+/*
+ * Reserve new btree blocks, bulk load the bmap records into the ondisk btree,
+ * and load the incore extent tree.
+ */
+STATIC int
+xrep_bmap_btree_load(
+	struct xrep_bmap	*rb,
+	struct xfs_btree_cur	*bmap_cur,
+	uint64_t		nextents)
+{
+	struct repair_ctx	*sc = rb->sc;
+	int			error;
+
+	rb->bmap_bload.get_records = xrep_bmap_get_records;
+	rb->bmap_bload.claim_block = xrep_bmap_claim_block;
+	rb->bmap_bload.iroot_size = xrep_bmap_iroot_size;
+	rb->bmap_bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */
+	bulkload_estimate_inode_slack(sc->mp, &rb->bmap_bload);
+
+	/* Compute how many blocks we'll need. */
+	error = -libxfs_btree_bload_compute_geometry(bmap_cur, &rb->bmap_bload,
+			nextents);
+	if (error)
+		return error;
+
+	/*
+	 * Guess how many blocks we're going to need to rebuild an entire bmap
+	 * from the number of extents we found, and pump up our transaction to
+	 * have sufficient block reservation.
+	 */
+	error = -libxfs_trans_reserve_more(sc->tp, rb->bmap_bload.nr_blocks, 0);
+	if (error)
+		return error;
+
+	/* Reserve the space we'll need for the new btree. */
+	error = bulkload_alloc_blocks(&rb->new_fork_info,
+			rb->bmap_bload.nr_blocks);
+	if (error)
+		return error;
+
+	/* Add all observed bmap records. */
+	error = init_slab_cursor(rb->bmap_records, xrep_bmap_extent_cmp,
+			&rb->bmap_cursor);
+	if (error)
+		return error;
+	error = -libxfs_btree_bload(bmap_cur, &rb->bmap_bload, rb);
+	free_slab_cursor(&rb->bmap_cursor);
+	if (error)
+	       return error;
+
+	/*
+	 * Load the new bmap records into the new incore extent tree to
+	 * preserve delalloc reservations for regular files.  The directory
+	 * code loads the extent tree during xfs_dir_open and assumes
+	 * thereafter that it remains loaded, so we must not violate that
+	 * assumption.
+	 */
+	return xrep_bmap_extents_load(rb, bmap_cur, nextents);
+}
+
+/*
+ * Use the collected bmap information to stage a new bmap fork.  If this is
+ * successful we'll return with the new fork information logged to the repair
+ * transaction but not yet committed.
+ */
+STATIC int
+xrep_bmap_build_new_fork(
+	struct xrep_bmap	*rb)
+{
+	struct xfs_owner_info	oinfo;
+	struct repair_ctx	*sc = rb->sc;
+	struct xfs_btree_cur	*bmap_cur;
+	struct xbtree_ifakeroot	*ifake = &rb->new_fork_info.ifake;
+	uint64_t		nextents;
+	int			error;
+
+	/*
+	 * Sort the bmap extents by startblock to avoid btree splits when we
+	 * rebuild the bmbt btree.
+	 */
+	qsort_slab(rb->bmap_records, xrep_bmap_extent_cmp);
+
+	/*
+	 * Prepare to construct the new fork by initializing the new btree
+	 * structure and creating a fake ifork in the ifakeroot structure.
+	 */
+	libxfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, rb->whichfork);
+	bulkload_init_inode(&rb->new_fork_info, sc, rb->whichfork, &oinfo);
+	bmap_cur = libxfs_bmbt_stage_cursor(sc->mp, sc->ip, ifake);
+
+	/*
+	 * Figure out the size and format of the new fork, then fill it with
+	 * all the bmap records we've found.  Join the inode to the transaction
+	 * so that we can roll the transaction while holding the inode locked.
+	 */
+	libxfs_trans_ijoin(sc->tp, sc->ip, 0);
+	nextents = slab_count(rb->bmap_records);
+	if (nextents <= XFS_IFORK_MAXEXT(sc->ip, rb->whichfork)) {
+		ifake->if_fork->if_format = XFS_DINODE_FMT_EXTENTS;
+		error = xrep_bmap_extents_load(rb, bmap_cur, nextents);
+	} else {
+		ifake->if_fork->if_format = XFS_DINODE_FMT_BTREE;
+		error = xrep_bmap_btree_load(rb, bmap_cur, nextents);
+	}
+	if (error)
+		goto err_cur;
+
+	/*
+	 * Install the new fork in the inode.  After this point the old mapping
+	 * data are no longer accessible and the new tree is live.  We delete
+	 * the cursor immediately after committing the staged root because the
+	 * staged fork might be in extents format.
+	 */
+	libxfs_bmbt_commit_staged_btree(bmap_cur, sc->tp, rb->whichfork);
+	libxfs_btree_del_cursor(bmap_cur, 0);
+
+	/* Reset the inode counters now that we've changed the fork. */
+	error = xrep_bmap_reset_counters(rb);
+	if (error)
+		goto err_newbt;
+
+	/* Dispose of any unused blocks and the accounting infomation. */
+	bulkload_destroy(&rb->new_fork_info, error);
+
+	return -libxfs_trans_roll_inode(&sc->tp, sc->ip);
+err_cur:
+	if (bmap_cur)
+		libxfs_btree_del_cursor(bmap_cur, error);
+err_newbt:
+	bulkload_destroy(&rb->new_fork_info, error);
+	return error;
+}
+
+/* Check for garbage inputs.  Returns ECANCELED if there's nothing to do. */
+STATIC int
+xrep_bmap_check_inputs(
+	struct repair_ctx	*sc,
+	int			whichfork)
+{
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(sc->ip, whichfork);
+
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+	if (!xfs_has_rmapbt(sc->mp))
+		return EOPNOTSUPP;
+
+	/* No fork means nothing to rebuild. */
+	if (!ifp)
+		return ECANCELED;
+
+	/*
+	 * We only know how to repair extent mappings, which is to say that we
+	 * only support extents and btree fork format.  Repairs to a local
+	 * format fork require a higher level repair function, so we do not
+	 * have any work to do here.
+	 */
+	switch (ifp->if_format) {
+	case XFS_DINODE_FMT_DEV:
+	case XFS_DINODE_FMT_LOCAL:
+	case XFS_DINODE_FMT_UUID:
+		return ECANCELED;
+	case XFS_DINODE_FMT_EXTENTS:
+	case XFS_DINODE_FMT_BTREE:
+		break;
+	default:
+		return EFSCORRUPTED;
+	}
+
+	if (whichfork == XFS_ATTR_FORK)
+		return 0;
+
+	/* Only files, symlinks, and directories get to have data forks. */
+	switch (VFS_I(sc->ip)->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+		/* ok */
+		break;
+	default:
+		return EINVAL;
+	}
+
+	/* Don't know how to rebuild realtime data forks. */
+	if (XFS_IS_REALTIME_INODE(sc->ip))
+		return EOPNOTSUPP;
+
+	return 0;
+}
+
+/* Repair an inode fork. */
+STATIC int
+xrep_bmap(
+	struct repair_ctx	*sc,
+	int			whichfork)
+{
+	struct xrep_bmap	*rb;
+	int			error = 0;
+
+	error = xrep_bmap_check_inputs(sc, whichfork);
+	if (error == ECANCELED)
+		return 0;
+	if (error)
+		return error;
+
+	rb = kmem_zalloc(sizeof(struct xrep_bmap), KM_NOFS | KM_MAYFAIL);
+	if (!rb)
+		return ENOMEM;
+	rb->sc = sc;
+	rb->whichfork = whichfork;
+
+	/* Set up some storage */
+	error = init_slab(&rb->bmap_records, sizeof(struct xfs_bmbt_rec));
+	if (error)
+		goto out_rb;
+
+	/* Collect all reverse mappings for this fork's extents. */
+	error = xrep_bmap_find_mappings(rb);
+	if (error)
+		goto out_bitmap;
+
+	/* Rebuild the bmap information. */
+	error = xrep_bmap_build_new_fork(rb);
+
+	/*
+	 * We don't need to free the old bmbt blocks because we're rebuilding
+	 * all the space metadata later.
+	 */
+
+out_bitmap:
+	free_slab(&rb->bmap_records);
+out_rb:
+	kmem_free(rb);
+	return error;
+}
+
+/* Rebuild some inode's bmap. */
+int
+rebuild_bmap(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino,
+	int			whichfork,
+	unsigned long		nr_extents,
+	struct xfs_buf		**ino_bpp,
+	struct xfs_dinode	**dinop,
+	int			*dirty)
+{
+	struct repair_ctx	sc = {
+		.mp		= mp,
+	};
+	const struct xfs_buf_ops *bp_ops;
+	unsigned long		boffset;
+	unsigned long long	resblks;
+	xfs_daddr_t		bp_bn;
+	int			bp_length;
+	int			error, err2;
+
+	bp_bn = xfs_buf_daddr(*ino_bpp);
+	bp_length = (*ino_bpp)->b_length;
+	bp_ops = (*ino_bpp)->b_ops;
+	boffset = (char *)(*dinop) - (char *)(*ino_bpp)->b_addr;
+
+	/*
+	 * Bail out if the inode didn't think it had extents.  Otherwise, zap
+	 * it back to a zero-extents fork so that we can rebuild it.
+	 */
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		if ((*dinop)->di_nextents == 0)
+			return 0;
+		(*dinop)->di_format = XFS_DINODE_FMT_EXTENTS;
+		(*dinop)->di_nextents = 0;
+		libxfs_dinode_calc_crc(mp, *dinop);
+		*dirty = 1;
+		break;
+	case XFS_ATTR_FORK:
+		if ((*dinop)->di_anextents == 0)
+			return 0;
+		(*dinop)->di_aformat = XFS_DINODE_FMT_EXTENTS;
+		(*dinop)->di_anextents = 0;
+		libxfs_dinode_calc_crc(mp, *dinop);
+		*dirty = 1;
+		break;
+	default:
+		return EINVAL;
+	}
+
+	resblks = libxfs_bmbt_calc_size(mp, nr_extents);
+	error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks, 0,
+			0, &sc.tp);
+	if (error)
+		return error;
+
+	/*
+	 * Repair magic: the caller passed us the inode cluster buffer for the
+	 * inode.  The _iget call grabs the buffer to load the incore inode, so
+	 * the buffer must be attached to the transaction to avoid recursing
+	 * the buffer lock.
+	 *
+	 * Unfortunately, the _iget call drops the buffer once the inode is
+	 * loaded, so if we've made any changes we have to log the buffer, hold
+	 * it, and roll the transaction.  This persists the caller's changes
+	 * and maintains our ownership of the cluster buffer.
+	 */
+	libxfs_trans_bjoin(sc.tp, *ino_bpp);
+	if (*dirty) {
+		unsigned int	end = BBTOB((*ino_bpp)->b_length) - 1;
+
+		libxfs_trans_log_buf(sc.tp, *ino_bpp, 0, end);
+		*dirty = 0;
+
+		libxfs_trans_bhold(sc.tp, *ino_bpp);
+		error = -libxfs_trans_roll(&sc.tp);
+		libxfs_trans_bjoin(sc.tp, *ino_bpp);
+		if (error)
+			goto out_cancel;
+	}
+
+	/* Grab the inode and fix the bmbt. */
+	error = -libxfs_iget(mp, sc.tp, ino, 0, &sc.ip);
+	if (error)
+		goto out_cancel;
+	error = xrep_bmap(&sc, whichfork);
+	if (error)
+		libxfs_trans_cancel(sc.tp);
+	else
+		error = -libxfs_trans_commit(sc.tp);
+
+	/*
+	 * Rebuilding the inode fork rolled the transaction, so we need to
+	 * re-grab the inode cluster buffer and dinode pointer for the caller.
+	 */
+	err2 = -libxfs_imap_to_bp(mp, NULL, &sc.ip->i_imap, ino_bpp);
+	if (err2)
+		do_error(
+ _("Unable to re-grab inode cluster buffer after failed repair of inode %llu, error %d.\n"),
+				(unsigned long long)ino, err2);
+	*dinop = xfs_buf_offset(*ino_bpp, sc.ip->i_imap.im_boffset);
+	libxfs_irele(sc.ip);
+
+	return error;
+
+out_cancel:
+	libxfs_trans_cancel(sc.tp);
+
+	/*
+	 * Try to regrab the old buffer so we have something to return to the
+	 * caller.
+	 */
+	err2 = -libxfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, bp_bn,
+			bp_length, 0, ino_bpp, bp_ops);
+	if (err2)
+		do_error(
+ _("Unable to re-grab inode cluster buffer after failed repair of inode %llu, error %d.\n"),
+				(unsigned long long)ino, err2);
+	*dinop = xfs_buf_offset(*ino_bpp, boffset);
+	return error;
+}
diff --git a/repair/bmap_repair.h b/repair/bmap_repair.h
new file mode 100644
index 00000000000..0f3e016c948
--- /dev/null
+++ b/repair/bmap_repair.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef REBUILD_H_
+#define REBUILD_H_
+
+int rebuild_bmap(struct xfs_mount *mp, xfs_ino_t ino, int whichfork,
+		 unsigned long nr_extents, struct xfs_buf **ino_bpp,
+		 struct xfs_dinode **dinop, int *dirty);
+
+#endif /* REBUILD_H_ */
diff --git a/repair/bulkload.c b/repair/bulkload.c
index 8dd0a0c3908..0a0a60fde44 100644
--- a/repair/bulkload.c
+++ b/repair/bulkload.c
@@ -14,14 +14,30 @@ void
 bulkload_init_ag(
 	struct bulkload			*bkl,
 	struct repair_ctx		*sc,
-	const struct xfs_owner_info	*oinfo)
+	const struct xfs_owner_info	*oinfo,
+	xfs_fsblock_t			alloc_hint)
 {
 	memset(bkl, 0, sizeof(struct bulkload));
 	bkl->sc = sc;
 	bkl->oinfo = *oinfo; /* structure copy */
+	bkl->alloc_hint = alloc_hint;
 	INIT_LIST_HEAD(&bkl->resv_list);
 }
 
+/* Initialize accounting resources for staging a new inode fork btree. */
+void
+bulkload_init_inode(
+	struct bulkload			*bkl,
+	struct repair_ctx		*sc,
+	int				whichfork,
+	const struct xfs_owner_info	*oinfo)
+{
+	bulkload_init_ag(bkl, sc, oinfo, XFS_INO_TO_FSB(sc->mp, sc->ip->i_ino));
+	bkl->ifake.if_fork = kmem_cache_zalloc(xfs_ifork_cache, 0);
+	bkl->ifake.if_fork_size = xfs_inode_fork_size(sc->ip, whichfork);
+	bkl->ifake.if_whichfork = whichfork;
+}
+
 /* Designate specific blocks to be used to build our new btree. */
 int
 bulkload_add_blocks(
@@ -30,6 +46,7 @@ bulkload_add_blocks(
 	xfs_extlen_t		len)
 {
 	struct bulkload_resv	*resv;
+	struct xfs_mount	*mp = bkl->sc->mp;
 
 	resv = kmem_alloc(sizeof(struct bulkload_resv), KM_MAYFAIL);
 	if (!resv)
@@ -39,24 +56,140 @@ bulkload_add_blocks(
 	resv->fsbno = fsbno;
 	resv->len = len;
 	resv->used = 0;
+	resv->pag = libxfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, fsbno));
+
 	list_add_tail(&resv->list, &bkl->resv_list);
 	bkl->nr_reserved += len;
 
 	return 0;
 }
 
+/* Reserve disk space for our new btree. */
+int
+bulkload_alloc_blocks(
+	struct bulkload		*bkl,
+	uint64_t		nr_blocks)
+{
+	struct repair_ctx	*sc = bkl->sc;
+	xfs_alloctype_t		type;
+	int			error = 0;
+
+	type = sc->ip ? XFS_ALLOCTYPE_START_BNO : XFS_ALLOCTYPE_NEAR_BNO;
+
+	while (nr_blocks > 0) {
+		struct xfs_alloc_arg	args = {
+			.tp		= sc->tp,
+			.mp		= sc->mp,
+			.type		= type,
+			.fsbno		= bkl->alloc_hint,
+			.oinfo		= bkl->oinfo,
+			.minlen		= 1,
+			.maxlen		= nr_blocks,
+			.prod		= 1,
+			.resv		= XFS_AG_RESV_NONE,
+		};
+
+		error = -libxfs_alloc_vextent(&args);
+		if (error)
+			return error;
+		if (args.fsbno == NULLFSBLOCK)
+			return ENOSPC;
+
+		error = bulkload_add_blocks(bkl, args.fsbno, args.len);
+		if (error)
+			return error;
+
+		nr_blocks -= args.len;
+
+		error = -libxfs_trans_roll_inode(&sc->tp, sc->ip);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Release blocks that were reserved for a btree repair.  If the repair
+ * succeeded then we log deferred frees for unused blocks.  Otherwise, we try
+ * to free the extents immediately to roll the filesystem back to where it was
+ * before we started.
+ */
+static inline int
+bulkload_destroy_reservation(
+	struct bulkload		*bkl,
+	struct bulkload_resv	*resv,
+	bool			cancel_repair)
+{
+	struct repair_ctx	*sc = bkl->sc;
+
+	if (cancel_repair) {
+		int		error;
+
+		/* Free the extent then roll the transaction. */
+		error = -libxfs_free_extent(sc->tp, resv->pag,
+				XFS_FSB_TO_AGBNO(sc->mp, resv->fsbno),
+				resv->len, &bkl->oinfo, XFS_AG_RESV_NONE);
+		if (error)
+			return error;
+
+		return -libxfs_trans_roll_inode(&sc->tp, sc->ip);
+	}
+
+	/*
+	 * Use the deferred freeing mechanism to schedule for deletion any
+	 * blocks we didn't use to rebuild the tree.  This enables us to log
+	 * them all in the same transaction as the root change.
+	 */
+	resv->fsbno += resv->used;
+	resv->len -= resv->used;
+	resv->used = 0;
+
+	if (resv->len == 0)
+		return 0;
+
+	__xfs_free_extent_later(sc->tp, resv->fsbno, resv->len, &bkl->oinfo,
+			true);
+
+	return 0;
+}
+
 /* Free all the accounting info and disk space we reserved for a new btree. */
 void
 bulkload_destroy(
 	struct bulkload		*bkl,
 	int			error)
 {
+	struct repair_ctx	*sc = bkl->sc;
 	struct bulkload_resv	*resv, *n;
+	int			err2;
 
+	list_for_each_entry_safe(resv, n, &bkl->resv_list, list) {
+		err2 = bulkload_destroy_reservation(bkl, resv, error != 0);
+		if (err2)
+			goto junkit;
+
+		list_del(&resv->list);
+		libxfs_perag_put(resv->pag);
+		kmem_free(resv);
+	}
+
+junkit:
+	/*
+	 * If we still have reservations attached to @newbt, cleanup must have
+	 * failed and the filesystem is about to go down.  Clean up the incore
+	 * reservations.
+	 */
 	list_for_each_entry_safe(resv, n, &bkl->resv_list, list) {
 		list_del(&resv->list);
+		libxfs_perag_put(resv->pag);
 		kmem_free(resv);
 	}
+
+	if (sc->ip) {
+		kmem_cache_free(xfs_ifork_cache, bkl->ifake.if_fork);
+		bkl->ifake.if_fork = NULL;
+	}
 }
 
 /* Feed one of the reserved btree blocks to the bulk loader. */
@@ -138,3 +271,43 @@ bulkload_estimate_ag_slack(
 	if (bload->node_slack < 0)
 		bload->node_slack = 2;
 }
+
+/*
+ * Estimate proper slack values for a btree that's being reloaded.
+ *
+ * Under most circumstances, we'll take whatever default loading value the
+ * btree bulk loading code calculates for us.  However, there are some
+ * exceptions to this rule:
+ *
+ * (1) If someone turned one of the debug knobs.
+ * (2) The FS has less than ~9% space free.
+ *
+ * Note that we actually use 3/32 for the comparison to avoid division.
+ */
+void
+bulkload_estimate_inode_slack(
+	struct xfs_mount	*mp,
+	struct xfs_btree_bload	*bload)
+{
+	/*
+	 * The global values are set to -1 (i.e. take the bload defaults)
+	 * unless someone has set them otherwise, so we just pull the values
+	 * here.
+	 */
+	bload->leaf_slack = bload_leaf_slack;
+	bload->node_slack = bload_node_slack;
+
+	/* No further changes if there's more than 3/32ths space left. */
+	if (mp->m_sb.sb_fdblocks >= ((mp->m_sb.sb_dblocks * 3) >> 5))
+		return;
+
+	/*
+	 * We're low on space; load the btrees as tightly as possible.  Leave
+	 * a couple of open slots in each btree block so that we don't end up
+	 * splitting the btrees like crazy right after mount.
+	 */
+	if (bload->leaf_slack < 0)
+		bload->leaf_slack = 2;
+	if (bload->node_slack < 0)
+		bload->node_slack = 2;
+}
diff --git a/repair/bulkload.h b/repair/bulkload.h
index a84e99b8c89..b1c0925890b 100644
--- a/repair/bulkload.h
+++ b/repair/bulkload.h
@@ -11,12 +11,16 @@ extern int bload_node_slack;
 
 struct repair_ctx {
 	struct xfs_mount	*mp;
+	struct xfs_inode	*ip;
+	struct xfs_trans	*tp;
 };
 
 struct bulkload_resv {
 	/* Link to list of extents that we've reserved. */
 	struct list_head	list;
 
+	struct xfs_perag	*pag;
+
 	/* FSB of the block we reserved. */
 	xfs_fsblock_t		fsbno;
 
@@ -34,7 +38,10 @@ struct bulkload {
 	struct list_head	resv_list;
 
 	/* Fake root for new btree. */
-	struct xbtree_afakeroot	afake;
+	union {
+		struct xbtree_afakeroot	afake;
+		struct xbtree_ifakeroot	ifake;
+	};
 
 	/* rmap owner of these blocks */
 	struct xfs_owner_info	oinfo;
@@ -42,6 +49,9 @@ struct bulkload {
 	/* The last reservation we allocated from. */
 	struct bulkload_resv	*last_resv;
 
+	/* Hint as to where we should allocate blocks. */
+	xfs_fsblock_t		alloc_hint;
+
 	/* Number of blocks reserved via resv_list. */
 	unsigned int		nr_reserved;
 };
@@ -50,13 +60,18 @@ struct bulkload {
 	list_for_each_entry_safe((resv), (n), &(bkl)->resv_list, list)
 
 void bulkload_init_ag(struct bulkload *bkl, struct repair_ctx *sc,
-		const struct xfs_owner_info *oinfo);
+		const struct xfs_owner_info *oinfo, xfs_fsblock_t alloc_hint);
+void bulkload_init_inode(struct bulkload *bkl, struct repair_ctx *sc,
+		int whichfork, const struct xfs_owner_info *oinfo);
 int bulkload_add_blocks(struct bulkload *bkl, xfs_fsblock_t fsbno,
 		xfs_extlen_t len);
+int bulkload_alloc_blocks(struct bulkload *bkl, uint64_t nr_blocks);
 void bulkload_destroy(struct bulkload *bkl, int error);
 int bulkload_claim_block(struct xfs_btree_cur *cur, struct bulkload *bkl,
 		union xfs_btree_ptr *ptr);
 void bulkload_estimate_ag_slack(struct repair_ctx *sc,
 		struct xfs_btree_bload *bload, unsigned int free);
+void bulkload_estimate_inode_slack(struct xfs_mount *mp,
+		struct xfs_btree_bload *bload);
 
 #endif /* __XFS_REPAIR_BULKLOAD_H__ */
diff --git a/repair/dinode.c b/repair/dinode.c
index cea3c1ee5fe..5e664eab7ea 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -20,6 +20,7 @@
 #include "threads.h"
 #include "slab.h"
 #include "rmap.h"
+#include "bmap_repair.h"
 
 /*
  * gettext lookups for translations of strings use mutexes internally to
@@ -1905,7 +1906,9 @@ process_inode_data_fork(
 	xfs_ino_t		lino = XFS_AGINO_TO_INO(mp, agno, ino);
 	int			err = 0;
 	xfs_extnum_t		nex, max_nex;
+	int			try_rebuild = -1; /* don't know yet */
 
+retry:
 	/*
 	 * extent count on disk is only valid for positive values. The kernel
 	 * uses negative values in memory. hence if we see negative numbers
@@ -1934,11 +1937,15 @@ process_inode_data_fork(
 		*totblocks = 0;
 		break;
 	case XFS_DINODE_FMT_EXTENTS:
+		if (!rmapbt_suspect && try_rebuild == -1)
+			try_rebuild = 1;
 		err = process_exinode(mp, agno, ino, dino, type, dirty,
 			totblocks, nextents, dblkmap, XFS_DATA_FORK,
 			check_dups);
 		break;
 	case XFS_DINODE_FMT_BTREE:
+		if (!rmapbt_suspect && try_rebuild == -1)
+			try_rebuild = 1;
 		err = process_btinode(mp, agno, ino, dino, type, dirty,
 			totblocks, nextents, dblkmap, XFS_DATA_FORK,
 			check_dups);
@@ -1954,8 +1961,28 @@ process_inode_data_fork(
 	if (err)  {
 		do_warn(_("bad data fork in inode %" PRIu64 "\n"), lino);
 		if (!no_modify)  {
+			if (try_rebuild == 1) {
+				do_warn(
+_("rebuilding inode %"PRIu64" data fork\n"),
+					lino);
+				try_rebuild = 0;
+				err = rebuild_bmap(mp, lino, XFS_DATA_FORK,
+						be32_to_cpu(dino->di_nextents),
+						ino_bpp, dinop, dirty);
+				dino = *dinop;
+				if (!err)
+					goto retry;
+				do_warn(
+_("inode %"PRIu64" data fork rebuild failed, error %d, clearing\n"),
+					lino, err);
+			}
 			clear_dinode(mp, dino, lino);
 			*dirty += 1;
+			ASSERT(*dirty > 0);
+		} else if (try_rebuild == 1) {
+			do_warn(
+_("would have tried to rebuild inode %"PRIu64" data fork\n"),
+					lino);
 		}
 		return 1;
 	}
@@ -2021,7 +2048,9 @@ process_inode_attr_fork(
 	struct blkmap		*ablkmap = NULL;
 	int			repair = 0;
 	int			err;
+	int			try_rebuild = -1; /* don't know yet */
 
+retry:
 	if (!dino->di_forkoff) {
 		*anextents = 0;
 		if (dino->di_aformat != XFS_DINODE_FMT_EXTENTS) {
@@ -2048,6 +2077,8 @@ process_inode_attr_fork(
 		err = process_lclinode(mp, agno, ino, dino, XFS_ATTR_FORK);
 		break;
 	case XFS_DINODE_FMT_EXTENTS:
+		if (!rmapbt_suspect && try_rebuild == -1)
+			try_rebuild = 1;
 		ablkmap = blkmap_alloc(*anextents, XFS_ATTR_FORK);
 		*anextents = 0;
 		err = process_exinode(mp, agno, ino, dino, type, dirty,
@@ -2055,6 +2086,8 @@ process_inode_attr_fork(
 				XFS_ATTR_FORK, check_dups);
 		break;
 	case XFS_DINODE_FMT_BTREE:
+		if (!rmapbt_suspect && try_rebuild == -1)
+			try_rebuild = 1;
 		ablkmap = blkmap_alloc(*anextents, XFS_ATTR_FORK);
 		*anextents = 0;
 		err = process_btinode(mp, agno, ino, dino, type, dirty,
@@ -2080,11 +2113,30 @@ process_inode_attr_fork(
 		do_warn(_("bad attribute fork in inode %" PRIu64 "\n"), lino);
 
 		if (!no_modify)  {
+			if (try_rebuild == 1) {
+				do_warn(
+_("rebuilding inode %"PRIu64" attr fork\n"),
+					lino);
+				try_rebuild = 0;
+				err = rebuild_bmap(mp, lino, XFS_ATTR_FORK,
+						be16_to_cpu(dino->di_anextents),
+						ino_bpp, dinop, dirty);
+				dino = *dinop;
+				if (!err)
+					goto retry;
+				do_warn(
+_("inode %"PRIu64" attr fork rebuild failed, error %d"),
+					lino, err);
+			}
 			do_warn(_(", clearing attr fork\n"));
 			*dirty += clear_dinode_attr(mp, dino, lino);
 			dino->di_aformat = XFS_DINODE_FMT_LOCAL;
 			ASSERT(*dirty > 0);
-		} else  {
+		} else if (try_rebuild) {
+			do_warn(
+_("would have tried to rebuild inode %"PRIu64" attr fork or cleared it\n"),
+					lino);
+		} else {
 			do_warn(_(", would clear attr fork\n"));
 		}
 
diff --git a/repair/rmap.c b/repair/rmap.c
index 6dba330fd7b..5fbae50d5b7 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -33,7 +33,7 @@ struct xfs_ag_rmap {
 };
 
 static struct xfs_ag_rmap *ag_rmaps;
-static bool rmapbt_suspect;
+bool rmapbt_suspect;
 static bool refcbt_suspect;
 
 static inline int rmap_compare(const void *a, const void *b)
diff --git a/repair/rmap.h b/repair/rmap.h
index b9177f765e3..782256f8b7e 100644
--- a/repair/rmap.h
+++ b/repair/rmap.h
@@ -7,6 +7,7 @@
 #define RMAP_H_
 
 extern bool collect_rmaps;
+extern bool rmapbt_suspect;
 
 extern bool rmap_needs_work(struct xfs_mount *);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/3] libxfs: online repair of file link counts
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (39 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/2] xfs_repair: rebuild inode fork mappings Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: report inode link count health Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
                   ` (36 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Now that we've created the infrastructure to perform live scans of every
file in the filesystem and the necessary hook infrastructure to observe
live updates, use it to scan directories to compute the correct link
counts for files in the filesystem, and reset those link counts.

This patchset creates a tailored readdir implementation for scrub
because the regular version has to cycle ILOCKs to copy information to
userspace.  We can't cycle the ILOCK during the nlink scan and we don't
need all the other VFS support code (maintaining a readdir cursor and
translating XFS structures to VFS structures and back) so it was easier
to duplicate the code.


If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-nlinks

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-nlinks
---
 libfrog/scrub.c                     |    5 +
 libxfs/xfs_da_format.h              |   11 +++
 libxfs/xfs_dir2.c                   |    6 ++
 libxfs/xfs_dir2.h                   |    1 
 libxfs/xfs_fs.h                     |    4 +
 libxfs/xfs_health.h                 |    4 +
 man/man2/ioctl_xfs_scrub_metadata.2 |    4 +
 repair/phase6.c                     |    4 -
 scrub/phase5.c                      |  136 +++++++++++++++++++++++++++++++++--
 scrub/scrub.c                       |   18 ++---
 scrub/scrub.h                       |    1 
 spaceman/health.c                   |    4 +
 12 files changed, 175 insertions(+), 23 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/3] xfs: report inode link count health
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of file link counts Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/3] xfs: teach scrub to check file nlinks Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: use multiple threads to run in-kernel metadata scrubs that scan inodes Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Report the health of inode link counts.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs.h     |    1 +
 libxfs/xfs_health.h |    4 +++-
 spaceman/health.c   |    4 ++++
 3 files changed, 8 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 6612c89944d..2f9f13ba75b 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -196,6 +196,7 @@ struct xfs_fsop_geom {
 #define XFS_FSOP_GEOM_SICK_RT_BITMAP	(1 << 4)  /* realtime bitmap */
 #define XFS_FSOP_GEOM_SICK_RT_SUMMARY	(1 << 5)  /* realtime summary */
 #define XFS_FSOP_GEOM_SICK_QUOTACHECK	(1 << 6)  /* quota counts */
+#define XFS_FSOP_GEOM_SICK_NLINKS	(1 << 7)  /* inode link counts */
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h
index 1dea286bb15..5571f6cb253 100644
--- a/libxfs/xfs_health.h
+++ b/libxfs/xfs_health.h
@@ -42,6 +42,7 @@ struct xfs_fsop_geom;
 #define XFS_SICK_FS_GQUOTA	(1 << 2)  /* group quota */
 #define XFS_SICK_FS_PQUOTA	(1 << 3)  /* project quota */
 #define XFS_SICK_FS_QUOTACHECK	(1 << 4)  /* quota counts */
+#define XFS_SICK_FS_NLINKS	(1 << 5)  /* inode link counts */
 
 /* Observable health issues for realtime volume metadata. */
 #define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
@@ -74,7 +75,8 @@ struct xfs_fsop_geom;
 				 XFS_SICK_FS_UQUOTA | \
 				 XFS_SICK_FS_GQUOTA | \
 				 XFS_SICK_FS_PQUOTA | \
-				 XFS_SICK_FS_QUOTACHECK)
+				 XFS_SICK_FS_QUOTACHECK | \
+				 XFS_SICK_FS_NLINKS)
 
 #define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
 				 XFS_SICK_RT_SUMMARY)
diff --git a/spaceman/health.c b/spaceman/health.c
index 3318f9d1a7f..88b12c0b0ea 100644
--- a/spaceman/health.c
+++ b/spaceman/health.c
@@ -76,6 +76,10 @@ static const struct flag_map fs_flags[] = {
 		.mask = XFS_FSOP_GEOM_SICK_QUOTACHECK,
 		.descr = "quota counts",
 	},
+	{
+		.mask = XFS_FSOP_GEOM_SICK_NLINKS,
+		.descr = "inode link counts",
+	},
 	{0},
 };
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/3] xfs: teach scrub to check file nlinks
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of file link counts Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: report inode link count health Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: use multiple threads to run in-kernel metadata scrubs that scan inodes Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Copy-pasta the online quotacheck code to check inode link counts too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libfrog/scrub.c                     |    5 +++++
 libxfs/xfs_da_format.h              |   11 +++++++++++
 libxfs/xfs_dir2.c                   |    6 ++++++
 libxfs/xfs_dir2.h                   |    1 +
 libxfs/xfs_fs.h                     |    3 ++-
 man/man2/ioctl_xfs_scrub_metadata.2 |    4 ++++
 repair/phase6.c                     |    4 ----
 7 files changed, 29 insertions(+), 5 deletions(-)


diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index 3718d56eae3..95daa78ba65 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -139,6 +139,11 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 		.descr	= "quota counters",
 		.group	= XFROG_SCRUB_GROUP_ISCAN,
 	},
+	[XFS_SCRUB_TYPE_NLINKS] = {
+		.name	= "nlinks",
+		.descr	= "inode link counts",
+		.group	= XFROG_SCRUB_GROUP_ISCAN,
+	},
 };
 
 /* Invoke the scrub ioctl.  Returns zero or negative error code. */
diff --git a/libxfs/xfs_da_format.h b/libxfs/xfs_da_format.h
index 25e2841084e..9d332415e0b 100644
--- a/libxfs/xfs_da_format.h
+++ b/libxfs/xfs_da_format.h
@@ -159,6 +159,17 @@ struct xfs_da3_intnode {
 
 #define XFS_DIR3_FT_MAX			9
 
+#define XFS_DIR3_FTYPE_STR \
+	{ XFS_DIR3_FT_UNKNOWN,	"unknown" }, \
+	{ XFS_DIR3_FT_REG_FILE,	"file" }, \
+	{ XFS_DIR3_FT_DIR,	"directory" }, \
+	{ XFS_DIR3_FT_CHRDEV,	"char" }, \
+	{ XFS_DIR3_FT_BLKDEV,	"block" }, \
+	{ XFS_DIR3_FT_FIFO,	"fifo" }, \
+	{ XFS_DIR3_FT_SOCK,	"sock" }, \
+	{ XFS_DIR3_FT_SYMLINK,	"symlink" }, \
+	{ XFS_DIR3_FT_WHT,	"whiteout" }
+
 /*
  * Byte offset in data block and shortform entry.
  */
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index d6a192963f5..033b6e4c475 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -24,6 +24,12 @@ const struct xfs_name xfs_name_dotdot = {
 	.type	= XFS_DIR3_FT_DIR,
 };
 
+const struct xfs_name xfs_name_dot = {
+	.name	= (unsigned char *)".",
+	.len	= 1,
+	.type	= XFS_DIR3_FT_DIR,
+};
+
 /*
  * Convert inode mode to directory entry filetype
  */
diff --git a/libxfs/xfs_dir2.h b/libxfs/xfs_dir2.h
index dd39f17dd9a..15a36cf7ae8 100644
--- a/libxfs/xfs_dir2.h
+++ b/libxfs/xfs_dir2.h
@@ -22,6 +22,7 @@ struct xfs_dir3_icfree_hdr;
 struct xfs_dir3_icleaf_hdr;
 
 extern const struct xfs_name	xfs_name_dotdot;
+extern const struct xfs_name	xfs_name_dot;
 
 /*
  * Convert inode mode to directory entry filetype
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 2f9f13ba75b..3885c56078f 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -710,9 +710,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
 #define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
 #define XFS_SCRUB_TYPE_QUOTACHECK 25	/* quota counters */
+#define XFS_SCRUB_TYPE_NLINKS	26	/* inode link counts */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	26
+#define XFS_SCRUB_TYPE_NR	27
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1u << 0)
diff --git a/man/man2/ioctl_xfs_scrub_metadata.2 b/man/man2/ioctl_xfs_scrub_metadata.2
index 42bf1e1cac5..db238de1bb5 100644
--- a/man/man2/ioctl_xfs_scrub_metadata.2
+++ b/man/man2/ioctl_xfs_scrub_metadata.2
@@ -164,6 +164,10 @@ Examine all user, group, or project quota records for corruption.
 .B XFS_SCRUB_TYPE_FSCOUNTERS
 Examine all filesystem summary counters (free blocks, inode count, free inode
 count) for errors.
+
+.TP
+.B XFS_SCRUB_TYPE_NLINKS
+Scan all inodes in the filesystem to verify each file's link count.
 .RE
 
 .PD 1
diff --git a/repair/phase6.c b/repair/phase6.c
index 0be2c9c9705..a7f658d4267 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -23,10 +23,6 @@ static struct cred		zerocr;
 static struct fsxattr 		zerofsx;
 static xfs_ino_t		orphanage_ino;
 
-static struct xfs_name		xfs_name_dot = {(unsigned char *)".",
-						1,
-						XFS_DIR3_FT_DIR};
-
 /*
  * Data structures used to keep track of directories where the ".."
  * entries are updated. These must be rebuilt after the initial pass


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/3] xfs_scrub: use multiple threads to run in-kernel metadata scrubs that scan inodes
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of file link counts Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/3] xfs: report inode link count health Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/3] xfs: teach scrub to check file nlinks Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Instead of running the inode link count and quotacheck scanners in
serial, run them in parallel, with a slight delay to stagger the work to
reduce inode resource contention.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase5.c |  136 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 scrub/scrub.c  |   18 ++++---
 scrub/scrub.h  |    1 
 3 files changed, 138 insertions(+), 17 deletions(-)


diff --git a/scrub/phase5.c b/scrub/phase5.c
index 123e3751ca1..622e58138db 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -383,12 +383,137 @@ check_fs_label(
 	return error;
 }
 
+typedef int (*iscan_item_fn)(struct scrub_ctx *, struct action_list *);
+
+struct iscan_item {
+	struct action_list	alist;
+	bool			*abortedp;
+	iscan_item_fn		scrub_fn;
+};
+
+/* Run one inode-scan scrubber in this thread. */
+static void
+iscan_worker(
+	struct workqueue	*wq,
+	xfs_agnumber_t		nr,
+	void			*arg)
+{
+	struct timespec		tv;
+	struct iscan_item	*item = arg;
+	struct scrub_ctx	*ctx = wq->wq_ctx;
+	int			ret;
+
+	/*
+	 * Delay each successive iscan by a second so that the threads are less
+	 * likely to contend on the inode buffers.
+	 */
+	if (nr) {
+		tv.tv_sec = nr;
+		tv.tv_nsec = 0;
+		nanosleep(&tv, NULL);
+	}
+
+	ret = item->scrub_fn(ctx, &item->alist);
+	if (ret) {
+		str_liberror(ctx, ret, _("checking iscan metadata"));
+		*item->abortedp = true;
+		goto out;
+	}
+
+	ret = action_list_process(ctx, ctx->mnt.fd, &item->alist,
+			ALP_COMPLAIN_IF_UNFIXED | ALP_NOPROGRESS);
+	if (ret) {
+		str_liberror(ctx, ret, _("repairing iscan metadata"));
+		*item->abortedp = true;
+		goto out;
+	}
+
+out:
+	free(item);
+	return;
+}
+
+/* Queue one inode-scan scrubber. */
+static int
+queue_iscan(
+	struct workqueue	*wq,
+	bool			*abortedp,
+	xfs_agnumber_t		nr,
+	iscan_item_fn		scrub_fn)
+{
+	struct iscan_item	*item;
+	struct scrub_ctx	*ctx = wq->wq_ctx;
+	int			ret;
+
+	item = malloc(sizeof(struct iscan_item));
+	if (!item) {
+		ret = ENOMEM;
+		str_liberror(ctx, ret, _("setting up iscan"));
+		return ret;
+	}
+	action_list_init(&item->alist);
+	item->scrub_fn = scrub_fn;
+	item->abortedp = abortedp;
+
+	ret = -workqueue_add(wq, iscan_worker, nr, item);
+	if (ret)
+		str_liberror(ctx, ret, _("queuing iscan work"));
+
+	return ret;
+}
+
+/* Run multiple inode-scan scrubbers at the same time. */
+static int
+run_kernel_iscan_scrubbers(
+	struct scrub_ctx	*ctx)
+{
+	struct workqueue	wq_iscan;
+	unsigned int		nr_threads = scrub_nproc_workqueue(ctx);
+	xfs_agnumber_t		nr = 0;
+	bool			aborted = false;
+	int			ret, ret2;
+
+	ret = -workqueue_create(&wq_iscan, (struct xfs_mount *)ctx,
+			nr_threads);
+	if (ret) {
+		str_liberror(ctx, ret, _("setting up iscan workqueue"));
+		return ret;
+	}
+
+	/*
+	 * The nlinks scanner is much faster than quotacheck because it only
+	 * walks directories, so we start it first.
+	 */
+	ret = queue_iscan(&wq_iscan, &aborted, nr, scrub_nlinks);
+	if (ret)
+		goto wait;
+
+	if (nr_threads > 1)
+		nr++;
+
+	ret = queue_iscan(&wq_iscan, &aborted, nr, scrub_quotacheck);
+	if (ret)
+		goto wait;
+
+wait:
+	ret2 = -workqueue_terminate(&wq_iscan);
+	if (ret2) {
+		str_liberror(ctx, ret2, _("joining iscan workqueue"));
+		if (!ret)
+			ret = ret2;
+	}
+	if (aborted && !ret)
+		ret = ECANCELED;
+
+	workqueue_destroy(&wq_iscan);
+	return ret;
+}
+
 /* Check directory connectivity. */
 int
 phase5_func(
 	struct scrub_ctx	*ctx)
 {
-	struct action_list	alist;
 	bool			aborted = false;
 	int			ret;
 
@@ -397,12 +522,7 @@ phase5_func(
 	 * after we've checked all inodes and repaired anything that could get
 	 * in the way of a scan.
 	 */
-	action_list_init(&alist);
-	ret = scrub_iscan_metadata(ctx, &alist);
-	if (ret)
-		return ret;
-	ret = action_list_process(ctx, ctx->mnt.fd, &alist,
-			ALP_COMPLAIN_IF_UNFIXED | ALP_NOPROGRESS);
+	ret = run_kernel_iscan_scrubbers(ctx);
 	if (ret)
 		return ret;
 
@@ -435,7 +555,7 @@ phase5_estimate(
 	int			*rshift)
 {
 	*items = scrub_estimate_iscan_work(ctx);
-	*nr_threads = scrub_nproc(ctx);
+	*nr_threads = scrub_nproc(ctx) * 2;
 	*rshift = 0;
 	return 0;
 }
diff --git a/scrub/scrub.c b/scrub/scrub.c
index f2dd9bb9d0b..fe5c8ade5d8 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -422,15 +422,6 @@ scrub_summary_metadata(
 	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist);
 }
 
-/* Scrub all metadata requiring a full inode scan. */
-int
-scrub_iscan_metadata(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
-{
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_ISCAN, 0, alist);
-}
-
 /* Scrub /only/ the superblock summary counters. */
 int
 scrub_fs_counters(
@@ -449,6 +440,15 @@ scrub_quotacheck(
 	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, 0, alist);
 }
 
+/* Scrub /only/ the file link counters. */
+int
+scrub_nlinks(
+	struct scrub_ctx		*ctx,
+	struct action_list		*alist)
+{
+	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_NLINKS, 0, alist);
+}
+
 /* How many items do we have to check? */
 unsigned int
 scrub_estimate_ag_work(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 42b91fbc3ed..430ad0fbd83 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -28,6 +28,7 @@ int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_quotacheck(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_nlinks(struct scrub_ctx *ctx, struct action_list *alist);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] libxfs: indirect health reporting
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (40 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of file link counts Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: upload clean bills of health Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                   ` (35 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This series enables the XFS health reporting infrastructure to remember
indirect health concerns when resources are scarce.  For example, if a
scrub notices that there's something wrong with an inode's metadata but
memory reclaim needs to free the incore inode, we want to record in the
perag data the fact that there was some inode somewhere with an error.
The perag structures never go away.

The first two patches in this series set that up, and the third one
provides a means for xfs_scrub to tell the kernel that it can forget the
indirect problem report.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=indirect-health-reporting
---
 libfrog/scrub.c                     |    5 ++++
 libxfs/xfs_fs.h                     |    4 ++-
 libxfs/xfs_health.h                 |   45 +++++++++++++++++++++++++++++++++++
 libxfs/xfs_inode_buf.c              |    2 +-
 man/man2/ioctl_xfs_scrub_metadata.2 |    6 +++++
 scrub/phase1.c                      |   38 ++++++++++++++++++++++++++++++
 scrub/repair.c                      |   15 ++++++++++++
 scrub/repair.h                      |    1 +
 scrub/scrub.c                       |   16 ++++++++----
 scrub/scrub.h                       |    1 +
 spaceman/health.c                   |    4 +++
 11 files changed, 129 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: add secondary and indirect classes to the health tracking system
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: upload clean bills of health Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/4] xfs: update health status if we get a clean bill of health Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] xfs: remember sick inodes that get inactivated Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Establish two more classes of health tracking bits:

 * Indirect problems, which suggest problems in other health domains
   that we weren't able to preserve.

 * Secondary problems, which track state that's related to primary
   evidence of health problems; and

The first class we'll use in an upcoming patch to record in the AG
health status the fact that we ran out of memory and had to inactivate
an inode with defective metadata.  The second class we use to indicate
that repair knows that an inode is bad and we need to fix it later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_health.h |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)


diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h
index e2e1b95ddfb..b3733f756bb 100644
--- a/libxfs/xfs_health.h
+++ b/libxfs/xfs_health.h
@@ -31,6 +31,19 @@
  *  - !checked && sick  => errors have been observed during normal operation,
  *                         but the metadata has not been checked thoroughly
  *  - !checked && !sick => has not been examined since mount
+ *
+ * Evidence of health problems can be sorted into three basic categories:
+ *
+ * a) Primary evidence, which signals that something is defective within the
+ *    general grouping of metadata.
+ *
+ * b) Secondary evidence, which are side effects of primary problem but are
+ *    not themselves problems.  These can be forgotten when the primary
+ *    health problems are addressed.
+ *
+ * c) Indirect evidence, which points to something being wrong in another
+ *    group, but we had to release resources and this is all that's left of
+ *    that state.
  */
 
 struct xfs_mount;
@@ -105,6 +118,35 @@ struct xfs_da_args;
 				 XFS_SICK_INO_SYMLINK | \
 				 XFS_SICK_INO_PARENT)
 
+/* Secondary state related to (but not primary evidence of) health problems. */
+#define XFS_SICK_FS_SECONDARY	(0)
+#define XFS_SICK_RT_SECONDARY	(0)
+#define XFS_SICK_AG_SECONDARY	(0)
+#define XFS_SICK_INO_SECONDARY	(0)
+
+/* Evidence of health problems elsewhere. */
+#define XFS_SICK_FS_INDIRECT	(0)
+#define XFS_SICK_RT_INDIRECT	(0)
+#define XFS_SICK_AG_INDIRECT	(0)
+#define XFS_SICK_INO_INDIRECT	(0)
+
+/* All health masks. */
+#define XFS_SICK_FS_ALL	(XFS_SICK_FS_PRIMARY | \
+				 XFS_SICK_FS_SECONDARY | \
+				 XFS_SICK_FS_INDIRECT)
+
+#define XFS_SICK_RT_ALL	(XFS_SICK_RT_PRIMARY | \
+				 XFS_SICK_RT_SECONDARY | \
+				 XFS_SICK_RT_INDIRECT)
+
+#define XFS_SICK_AG_ALL	(XFS_SICK_AG_PRIMARY | \
+				 XFS_SICK_AG_SECONDARY | \
+				 XFS_SICK_AG_INDIRECT)
+
+#define XFS_SICK_INO_ALL	(XFS_SICK_INO_PRIMARY | \
+				 XFS_SICK_INO_SECONDARY | \
+				 XFS_SICK_INO_INDIRECT)
+
 /*
  * These functions must be provided by the xfs implementation.  Function
  * behavior with respect to the first argument should be as follows:


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: remember sick inodes that get inactivated
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 3/4] xfs: update health status if we get a clean bill of health Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If an unhealthy inode gets inactivated, remember this fact in the
per-fs health summary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs.h        |    1 +
 libxfs/xfs_health.h    |    7 +++++--
 libxfs/xfs_inode_buf.c |    2 +-
 spaceman/health.c      |    4 ++++
 4 files changed, 11 insertions(+), 3 deletions(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 3885c56078f..417cf85c0f7 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -292,6 +292,7 @@ struct xfs_ag_geometry {
 #define XFS_AG_GEOM_SICK_FINOBT	(1 << 7)  /* free inode index */
 #define XFS_AG_GEOM_SICK_RMAPBT	(1 << 8)  /* reverse mappings */
 #define XFS_AG_GEOM_SICK_REFCNTBT (1 << 9)  /* reference counts */
+#define XFS_AG_GEOM_SICK_INODES	(1 << 10) /* bad inodes were seen */
 
 /*
  * Structures for XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG & XFS_IOC_FSGROWFSRT
diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h
index b3733f756bb..252334bc048 100644
--- a/libxfs/xfs_health.h
+++ b/libxfs/xfs_health.h
@@ -76,6 +76,7 @@ struct xfs_da_args;
 #define XFS_SICK_AG_FINOBT	(1 << 7)  /* free inode index */
 #define XFS_SICK_AG_RMAPBT	(1 << 8)  /* reverse mappings */
 #define XFS_SICK_AG_REFCNTBT	(1 << 9)  /* reference counts */
+#define XFS_SICK_AG_INODES	(1 << 10) /* inactivated bad inodes */
 
 /* Observable health issues for inode metadata. */
 #define XFS_SICK_INO_CORE	(1 << 0)  /* inode core */
@@ -86,6 +87,8 @@ struct xfs_da_args;
 #define XFS_SICK_INO_XATTR	(1 << 5)  /* extended attributes */
 #define XFS_SICK_INO_SYMLINK	(1 << 6)  /* symbolic link remote target */
 #define XFS_SICK_INO_PARENT	(1 << 7)  /* parent pointers */
+/* Don't propagate sick status to ag health summary during inactivation */
+#define XFS_SICK_INO_FORGET	(1 << 8)
 
 /* Primary evidence of health problems in a given group. */
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
@@ -122,12 +125,12 @@ struct xfs_da_args;
 #define XFS_SICK_FS_SECONDARY	(0)
 #define XFS_SICK_RT_SECONDARY	(0)
 #define XFS_SICK_AG_SECONDARY	(0)
-#define XFS_SICK_INO_SECONDARY	(0)
+#define XFS_SICK_INO_SECONDARY	(XFS_SICK_INO_FORGET)
 
 /* Evidence of health problems elsewhere. */
 #define XFS_SICK_FS_INDIRECT	(0)
 #define XFS_SICK_RT_INDIRECT	(0)
-#define XFS_SICK_AG_INDIRECT	(0)
+#define XFS_SICK_AG_INDIRECT	(XFS_SICK_AG_INODES)
 #define XFS_SICK_INO_INDIRECT	(0)
 
 /* All health masks. */
diff --git a/libxfs/xfs_inode_buf.c b/libxfs/xfs_inode_buf.c
index aad64c0a2e6..82eb3f91b9d 100644
--- a/libxfs/xfs_inode_buf.c
+++ b/libxfs/xfs_inode_buf.c
@@ -136,7 +136,7 @@ xfs_imap_to_bp(
 			imap->im_len, XBF_UNMAPPED, bpp, &xfs_inode_buf_ops);
 	if (xfs_metadata_is_sick(error))
 		xfs_agno_mark_sick(mp, xfs_daddr_to_agno(mp, imap->im_blkno),
-				XFS_SICK_AG_INOBT);
+				XFS_SICK_AG_INODES);
 	return error;
 }
 
diff --git a/spaceman/health.c b/spaceman/health.c
index 88b12c0b0ea..12fb67bab28 100644
--- a/spaceman/health.c
+++ b/spaceman/health.c
@@ -127,6 +127,10 @@ static const struct flag_map ag_flags[] = {
 		.descr = "reference count btree",
 		.has_fn = has_reflink,
 	},
+	{
+		.mask = XFS_AG_GEOM_SICK_INODES,
+		.descr = "overall inode state",
+	},
 	{0},
 };
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: update health status if we get a clean bill of health
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: upload clean bills of health Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] xfs: add secondary and indirect classes to the health tracking system Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] xfs: remember sick inodes that get inactivated Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If scrub finds that everything is ok with the filesystem, we need a way
to tell the health tracking that it can let go of indirect health flags,
since indirect flags only mean that at some point in the past we lost
some context.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libfrog/scrub.c                     |    5 +++++
 libxfs/xfs_fs.h                     |    3 ++-
 man/man2/ioctl_xfs_scrub_metadata.2 |    6 ++++++
 scrub/scrub.c                       |    7 +------
 4 files changed, 14 insertions(+), 7 deletions(-)


diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index 95daa78ba65..7cd241d9bce 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -144,6 +144,11 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 		.descr	= "inode link counts",
 		.group	= XFROG_SCRUB_GROUP_ISCAN,
 	},
+	[XFS_SCRUB_TYPE_HEALTHY] = {
+		.name	= "healthy",
+		.descr	= "retained health records",
+		.group	= XFROG_SCRUB_GROUP_NONE,
+	},
 };
 
 /* Invoke the scrub ioctl.  Returns zero or negative error code. */
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 417cf85c0f7..400cf68e551 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -712,9 +712,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
 #define XFS_SCRUB_TYPE_QUOTACHECK 25	/* quota counters */
 #define XFS_SCRUB_TYPE_NLINKS	26	/* inode link counts */
+#define XFS_SCRUB_TYPE_HEALTHY	27	/* everything checked out ok */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	27
+#define XFS_SCRUB_TYPE_NR	28
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1u << 0)
diff --git a/man/man2/ioctl_xfs_scrub_metadata.2 b/man/man2/ioctl_xfs_scrub_metadata.2
index db238de1bb5..01f11058839 100644
--- a/man/man2/ioctl_xfs_scrub_metadata.2
+++ b/man/man2/ioctl_xfs_scrub_metadata.2
@@ -168,6 +168,12 @@ count) for errors.
 .TP
 .B XFS_SCRUB_TYPE_NLINKS
 Scan all inodes in the filesystem to verify each file's link count.
+
+.TP
+.B XFS_SCRUB_TYPE_HEALTHY
+Mark everything healthy after a clean scrub run.
+This clears out all the indirect health problem markers that might remain
+in the system.
 .RE
 
 .PD 1
diff --git a/scrub/scrub.c b/scrub/scrub.c
index fe5c8ade5d8..7f80b2de211 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -39,20 +39,15 @@ format_scrub_descr(
 	case XFROG_SCRUB_GROUP_PERAG:
 		return snprintf(buf, buflen, _("AG %u %s"), meta->sm_agno,
 				_(sc->descr));
-		break;
 	case XFROG_SCRUB_GROUP_INODE:
 		return scrub_render_ino_descr(ctx, buf, buflen,
 				meta->sm_ino, meta->sm_gen, "%s",
 				_(sc->descr));
-		break;
 	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
 	case XFROG_SCRUB_GROUP_ISCAN:
-		return snprintf(buf, buflen, _("%s"), _(sc->descr));
-		break;
 	case XFROG_SCRUB_GROUP_NONE:
-		assert(0);
-		break;
+		return snprintf(buf, buflen, _("%s"), _(sc->descr));
 	}
 	return -1;
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs_scrub: upload clean bills of health
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] xfs: add secondary and indirect classes to the health tracking system Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If scrub terminates with a clean bill of health, tell the kernel that
the result of the scan is that everything's healthy.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |   38 ++++++++++++++++++++++++++++++++++++++
 scrub/repair.c |   15 +++++++++++++++
 scrub/repair.h |    1 +
 scrub/scrub.c  |    9 +++++++++
 scrub/scrub.h  |    1 +
 5 files changed, 64 insertions(+)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 2daf5c7bb38..cecb5e861f4 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -44,6 +44,40 @@ xfs_shutdown_fs(
 		str_errno(ctx, ctx->mntpoint);
 }
 
+/*
+ * If we haven't found /any/ problems at all, tell the kernel that we're giving
+ * the filesystem a clean bill of health.
+ */
+static int
+report_to_kernel(
+	struct scrub_ctx	*ctx)
+{
+	struct action_list	alist;
+	int			ret;
+
+	if (!ctx->scrub_setup_succeeded || ctx->corruptions_found ||
+	    ctx->runtime_errors || ctx->unfixable_errors ||
+	    ctx->warnings_found)
+		return 0;
+
+	action_list_init(&alist);
+	ret = scrub_clean_health(ctx, &alist);
+	if (ret)
+		return ret;
+
+	/*
+	 * Complain if we cannot fail the clean bill of health, unless we're
+	 * just testing repairs.
+	 */
+	if (action_list_length(&alist) > 0 &&
+	    !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+		str_info(ctx, _("Couldn't upload clean bill of health."), NULL);
+		action_list_discard(&alist);
+	}
+
+	return 0;
+}
+
 /* Clean up the XFS-specific state data. */
 int
 scrub_cleanup(
@@ -51,6 +85,10 @@ scrub_cleanup(
 {
 	int			error;
 
+	error = report_to_kernel(ctx);
+	if (error)
+		return error;
+
 	action_lists_free(&ctx->action_lists);
 	if (ctx->fshandle)
 		free_handle(ctx->fshandle, ctx->fshandle_len);
diff --git a/scrub/repair.c b/scrub/repair.c
index 8a1ae0226a0..bb0faceca69 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -172,6 +172,21 @@ action_lists_alloc(
 	return 0;
 }
 
+/* Discard repair list contents. */
+void
+action_list_discard(
+	struct action_list		*alist)
+{
+	struct action_item		*aitem;
+	struct action_item		*n;
+
+	list_for_each_entry_safe(aitem, n, &alist->list, list) {
+		alist->nr--;
+		list_del(&aitem->list);
+		free(aitem);
+	}
+}
+
 /* Free the repair lists. */
 void
 action_lists_free(
diff --git a/scrub/repair.h b/scrub/repair.h
index 102e5779c70..d7a3a9e7cff 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -24,6 +24,7 @@ static inline bool action_list_empty(const struct action_list *alist)
 
 unsigned long long action_list_length(struct action_list *alist);
 void action_list_add(struct action_list *dest, struct action_item *item);
+void action_list_discard(struct action_list *alist);
 void action_list_splice(struct action_list *dest, struct action_list *src);
 
 void action_list_find_mustfix(struct action_list *actions,
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 7f80b2de211..975dd8efbbc 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -444,6 +444,15 @@ scrub_nlinks(
 	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_NLINKS, 0, alist);
 }
 
+/* Update incore health records if we were clean. */
+int
+scrub_clean_health(
+	struct scrub_ctx		*ctx,
+	struct action_list		*alist)
+{
+	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, 0, alist);
+}
+
 /* How many items do we have to check? */
 unsigned int
 scrub_estimate_ag_work(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 430ad0fbd83..0622677376c 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -29,6 +29,7 @@ int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_quotacheck(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_nlinks(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_clean_health(struct scrub_ctx *ctx, struct action_list *alist);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/9] libxfs: support in-memory btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (41 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/9] libxfs: add xfile support Darrick J. Wong
                     ` (8 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
                   ` (34 subsequent siblings)
  77 siblings, 9 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Online repair of the reverse-mapping btrees presens some unique
challenges.  To construct a new reverse mapping btree, we must scan the
entire filesystem, but we cannot afford to quiesce the entire filesystem
for the potentially lengthy scan.

For rmap btrees, therefore, we relax our requirements of totally atomic
repairs.  Instead, repairs will scan all inodes, construct a new reverse
mapping dataset, format a new btree, and commit it before anyone trips
over the corruption.  This is exactly the same strategy as was used in
the quotacheck and nlink scanners.

Unfortunately, the xfarray cannot perform key-based lookups and is
therefore unsuitable for supporting live updates.  Luckily, we already a
data structure that maintains an indexed rmap recordset -- the existing
rmap btree code!  Hence we port the existing btree and buffer target
code to be able to create a btree using the xfile we developed earlier.
Live hooks keep the in-memory btree up to date for any resources that
have already been scanned.

This approach is not maximally memory efficient, but we can use the same
rmap code that we do everywhere else, which provides improved stability
without growing the code base even more.  Note that in-memory btree
blocks are always page sized.

This patchset modifies the kernel xfs buffer cache to be capable of
using a xfile (aka a shmem file) as a backing device.  It then augments
the btree code to support creating btree cursors with buffers that come
from a buftarg other than the data device (namely an xfile-backed
buftarg).  For the userspace xfs buffer cache, we instead use a memfd or
an O_TMPFILE file as a backing device.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=in-memory-btrees
---
 configure.ac                |    3 
 copy/xfs_copy.c             |    2 
 db/init.c                   |    7 
 db/sb.c                     |    3 
 include/builddefs.in        |    3 
 include/libxfs.h            |    5 
 include/xfs_mount.h         |   10 +
 include/xfs_trace.h         |   15 +
 include/xfs_trans.h         |    1 
 libfrog/bitmap.c            |   64 +++
 libfrog/bitmap.h            |    3 
 libxfs/Makefile             |   15 +
 libxfs/init.c               |  131 +++++--
 libxfs/libxfs_io.h          |   38 ++
 libxfs/libxfs_priv.h        |    2 
 libxfs/rdwr.c               |  117 +++++-
 libxfs/trans.c              |   40 ++
 libxfs/xfbtree.c            |  797 +++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfbtree.h            |   57 +++
 libxfs/xfile.c              |  258 ++++++++++++++
 libxfs/xfile.h              |  108 ++++++
 libxfs/xfs_btree.c          |  173 ++++++++-
 libxfs/xfs_btree.h          |   17 +
 libxfs/xfs_btree_mem.h      |  128 +++++++
 libxfs/xfs_refcount_btree.c |    4 
 libxfs/xfs_rmap_btree.c     |    4 
 logprint/logprint.c         |    2 
 m4/package_libcdev.m4       |   50 +++
 mkfs/xfs_mkfs.c             |    4 
 repair/prefetch.c           |   12 -
 repair/prefetch.h           |    1 
 repair/progress.c           |   14 -
 repair/progress.h           |    2 
 repair/scan.c               |    2 
 repair/xfs_repair.c         |   32 +-
 35 files changed, 1970 insertions(+), 154 deletions(-)
 create mode 100644 libxfs/xfbtree.c
 create mode 100644 libxfs/xfbtree.h
 create mode 100644 libxfs/xfile.c
 create mode 100644 libxfs/xfile.h
 create mode 100644 libxfs/xfs_btree_mem.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/9] libxfs: clean up xfs_da_unmount usage
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/9] libxfs: add xfile support Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/9] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Replace the open-coded xfs_da_unmount usage in libxfs_umount and teach
libxfs_mount not to leak the dir/attr geometry structures when the mount
attempt fails.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/init.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)


diff --git a/libxfs/init.c b/libxfs/init.c
index 93dc1f1c599..f21dbc6732b 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -842,7 +842,7 @@ libxfs_mount(
 	if (error) {
 		fprintf(stderr, _("%s: data size check failed\n"), progname);
 		if (!xfs_is_debugger(mp))
-			return NULL;
+			goto out_da;
 	} else
 		libxfs_buf_relse(bp);
 
@@ -856,7 +856,7 @@ libxfs_mount(
 			fprintf(stderr, _("%s: log size checks failed\n"),
 					progname);
 			if (!xfs_is_debugger(mp))
-				return NULL;
+				goto out_da;
 		}
 		if (bp)
 			libxfs_buf_relse(bp);
@@ -865,8 +865,8 @@ libxfs_mount(
 	/* Initialize realtime fields in the mount structure */
 	if (rtmount_init(mp)) {
 		fprintf(stderr, _("%s: realtime device init failed\n"),
-			progname);
-			return NULL;
+				progname);
+			goto out_da;
 	}
 
 	/*
@@ -884,7 +884,7 @@ libxfs_mount(
 			fprintf(stderr, _("%s: read of AG %u failed\n"),
 						progname, sbp->sb_agcount);
 			if (!xfs_is_debugger(mp))
-				return NULL;
+				goto out_da;
 			fprintf(stderr, _("%s: limiting reads to AG 0\n"),
 								progname);
 			sbp->sb_agcount = 1;
@@ -902,6 +902,9 @@ libxfs_mount(
 	xfs_set_perag_data_loaded(mp);
 
 	return mp;
+out_da:
+	xfs_da_unmount(mp);
+	return NULL;
 }
 
 void
@@ -1024,9 +1027,7 @@ libxfs_umount(
 	if (xfs_is_perag_data_loaded(mp))
 		libxfs_free_perag(mp);
 
-	kmem_free(mp->m_attr_geo);
-	kmem_free(mp->m_dir_geo);
-
+	xfs_da_unmount(mp);
 	kmem_free(mp->m_rtdev_targp);
 	if (mp->m_logdev_targp != mp->m_ddev_targp)
 		kmem_free(mp->m_logdev_targp);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/9] libxfs: teach buftargs to maintain their own buffer hashtable
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/9] libxfs: add xfile support Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/9] libxfs: clean up xfs_da_unmount usage Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 9/9] xfbtree: let the buffer cache flush dirty buffers to the xfile Darrick J. Wong
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, cached buffers are indexed with a single global bcache
structure.  This works ok for the limited use case where we only support
reading from the data device, but will fail badly when we want to
support buffers from in-memory btrees.  Move the bcache structure into
the buftarg.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 copy/xfs_copy.c     |    2 +
 db/init.c           |    7 +++-
 db/sb.c             |    3 +-
 include/libxfs.h    |    3 --
 include/xfs_mount.h |    2 +
 libxfs/init.c       |   81 +++++++++++++++++++++++++++++----------------------
 libxfs/libxfs_io.h  |   14 +++++----
 libxfs/rdwr.c       |   40 +++++++++++++++++--------
 logprint/logprint.c |    2 +
 mkfs/xfs_mkfs.c     |    4 +--
 repair/prefetch.c   |   12 +++++---
 repair/prefetch.h   |    1 +
 repair/progress.c   |   14 ++++++---
 repair/progress.h   |    2 +
 repair/scan.c       |    2 +
 repair/xfs_repair.c |   32 +++++++++++---------
 16 files changed, 130 insertions(+), 91 deletions(-)


diff --git a/copy/xfs_copy.c b/copy/xfs_copy.c
index 79f65946709..45f8485799e 100644
--- a/copy/xfs_copy.c
+++ b/copy/xfs_copy.c
@@ -733,7 +733,7 @@ main(int argc, char **argv)
 	memset(&mbuf, 0, sizeof(xfs_mount_t));
 
 	/* We don't yet know the sector size, so read maximal size */
-	libxfs_buftarg_init(&mbuf, xargs.ddev, xargs.logdev, xargs.rtdev);
+	libxfs_buftarg_init(&mbuf, xargs.ddev, xargs.logdev, xargs.rtdev, 0);
 	error = -libxfs_buf_read_uncached(mbuf.m_ddev_targp, XFS_SB_DADDR,
 			1 << (XFS_MAX_SECTORSIZE_LOG - BBSHIFT), 0, &sbp, NULL);
 	if (error) {
diff --git a/db/init.c b/db/init.c
index eec65d0884d..9f045d27076 100644
--- a/db/init.c
+++ b/db/init.c
@@ -97,7 +97,6 @@ init(
 	else
 		x.dname = fsdevice;
 
-	x.bcache_flags = CACHE_MISCOMPARE_PURGE;
 	if (!libxfs_init(&x)) {
 		fputs(_("\nfatal error -- couldn't initialize XFS library\n"),
 			stderr);
@@ -109,7 +108,8 @@ init(
 	 * tool and so need to be able to mount busted filesystems.
 	 */
 	memset(&xmount, 0, sizeof(struct xfs_mount));
-	libxfs_buftarg_init(&xmount, x.ddev, x.logdev, x.rtdev);
+	libxfs_buftarg_init(&xmount, x.ddev, x.logdev, x.rtdev,
+			XFS_BUFTARG_MISCOMPARE_PURGE);
 	error = -libxfs_buf_read_uncached(xmount.m_ddev_targp, XFS_SB_DADDR,
 			1 << (XFS_MAX_SECTORSIZE_LOG - BBSHIFT), 0, &bp, NULL);
 	if (error) {
@@ -134,7 +134,8 @@ init(
 
 	agcount = sbp->sb_agcount;
 	mp = libxfs_mount(&xmount, sbp, x.ddev, x.logdev, x.rtdev,
-			  LIBXFS_MOUNT_DEBUGGER);
+			  LIBXFS_MOUNT_DEBUGGER |
+			  LIBXFS_MOUNT_CACHE_MISCOMPARE_PURGE);
 	if (!mp) {
 		fprintf(stderr,
 			_("%s: device %s unusable (not an XFS filesystem?)\n"),
diff --git a/db/sb.c b/db/sb.c
index 2d508c26a3b..fd81286cd60 100644
--- a/db/sb.c
+++ b/db/sb.c
@@ -233,7 +233,8 @@ sb_logcheck(void)
 		}
 	}
 
-	libxfs_buftarg_init(mp, x.ddev, x.logdev, x.rtdev);
+	libxfs_buftarg_init(mp, x.ddev, x.logdev, x.rtdev,
+			XFS_BUFTARG_MISCOMPARE_PURGE);
 
 	dirty = xlog_is_dirty(mp, mp->m_log, &x, 0);
 	if (dirty == -1) {
diff --git a/include/libxfs.h b/include/libxfs.h
index 915bf511313..b07da6c03ee 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -123,8 +123,6 @@ typedef struct libxfs_xinit {
 	int             dfd;            /* data subvolume file descriptor */
 	int             logfd;          /* log subvolume file descriptor */
 	int             rtfd;           /* realtime subvolume file descriptor */
-	int		icache_flags;	/* cache init flags */
-	int		bcache_flags;	/* cache init flags */
 } libxfs_init_t;
 
 #define LIBXFS_ISREADONLY	0x0002	/* disallow all mounted filesystems */
@@ -141,7 +139,6 @@ extern int	libxfs_device_to_fd (dev_t);
 extern dev_t	libxfs_device_open (char *, int, int, int);
 extern void	libxfs_device_close (dev_t);
 extern int	libxfs_device_alignment (void);
-extern void	libxfs_report(FILE *);
 
 /* check or write log footer: specify device, log size in blocks & uuid */
 typedef char	*(libxfs_get_block_t)(char *, int, void *);
diff --git a/include/xfs_mount.h b/include/xfs_mount.h
index acd9214da3a..6be85bf21d2 100644
--- a/include/xfs_mount.h
+++ b/include/xfs_mount.h
@@ -256,6 +256,8 @@ __XFS_UNSUPP_OPSTATE(shutdown)
 #define LIBXFS_MOUNT_DEBUGGER		(1U << 0)
 /* report metadata corruption to stdout */
 #define LIBXFS_MOUNT_REPORT_CORRUPTION	(1U << 1)
+/* purge buffer cache on miscompares */
+#define LIBXFS_MOUNT_CACHE_MISCOMPARE_PURGE	(1U << 2)
 
 #define LIBXFS_BHASHSIZE(sbp) 		(1<<10)
 
diff --git a/libxfs/init.c b/libxfs/init.c
index f21dbc6732b..5e90bf733b7 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -31,7 +31,6 @@ pthread_mutex_t	atomic64_lock = PTHREAD_MUTEX_INITIALIZER;
 
 char *progname = "libxfs";	/* default, changed by each tool */
 
-struct cache *libxfs_bcache;	/* global buffer cache */
 int libxfs_bhash_size;		/* #buckets in bcache */
 
 int	use_xfs_buf_lock;	/* global flag: use xfs_buf locks for MT */
@@ -407,8 +406,6 @@ libxfs_init(libxfs_init_t *a)
 	}
 	if (!libxfs_bhash_size)
 		libxfs_bhash_size = LIBXFS_BHASHSIZE(sbp);
-	libxfs_bcache = cache_init(a->bcache_flags, libxfs_bhash_size,
-				   &libxfs_bcache_operations);
 	use_xfs_buf_lock = a->usebuflock;
 	xfs_dir_startup();
 	init_caches();
@@ -592,9 +589,14 @@ static struct xfs_buftarg *
 libxfs_buftarg_alloc(
 	struct xfs_mount	*mp,
 	dev_t			dev,
-	unsigned long		write_fails)
+	unsigned long		write_fails,
+	unsigned int		buftarg_flags)
 {
 	struct xfs_buftarg	*btp;
+	unsigned int		bcache_flags = 0;
+
+	if (!write_fails)
+		buftarg_flags &= ~XFS_BUFTARG_INJECT_WRITE_FAIL;
 
 	btp = malloc(sizeof(*btp));
 	if (!btp) {
@@ -604,13 +606,15 @@ libxfs_buftarg_alloc(
 	}
 	btp->bt_mount = mp;
 	btp->bt_bdev = dev;
-	btp->flags = 0;
-	if (write_fails) {
-		btp->writes_left = write_fails;
-		btp->flags |= XFS_BUFTARG_INJECT_WRITE_FAIL;
-	}
+	btp->flags = buftarg_flags;
+	btp->writes_left = write_fails;
+	if (btp->flags & XFS_BUFTARG_MISCOMPARE_PURGE)
+		bcache_flags |= CACHE_MISCOMPARE_PURGE;
 	pthread_mutex_init(&btp->lock, NULL);
 
+	btp->bcache = cache_init(bcache_flags, libxfs_bhash_size,
+			&libxfs_bcache_operations);
+
 	return btp;
 }
 
@@ -633,10 +637,12 @@ libxfs_buftarg_init(
 	struct xfs_mount	*mp,
 	dev_t			dev,
 	dev_t			logdev,
-	dev_t			rtdev)
+	dev_t			rtdev,
+	unsigned int		btflags)
 {
 	char			*p = getenv("LIBXFS_DEBUG_WRITE_CRASH");
 	unsigned long		dfail = 0, lfail = 0, rfail = 0;
+	unsigned int		dflags = 0, lflags = 0, rflags = 0;
 
 	/* Simulate utility crash after a certain number of writes. */
 	while (p && *p) {
@@ -650,6 +656,8 @@ libxfs_buftarg_init(
 				exit(1);
 			}
 			dfail = strtoul(val, NULL, 0);
+			if (dfail)
+				dflags |= XFS_BUFTARG_INJECT_WRITE_FAIL;
 			break;
 		case WF_LOG:
 			if (!val) {
@@ -658,6 +666,8 @@ libxfs_buftarg_init(
 				exit(1);
 			}
 			lfail = strtoul(val, NULL, 0);
+			if (lfail)
+				lflags |= XFS_BUFTARG_INJECT_WRITE_FAIL;
 			break;
 		case WF_RT:
 			if (!val) {
@@ -666,6 +676,8 @@ libxfs_buftarg_init(
 				exit(1);
 			}
 			rfail = strtoul(val, NULL, 0);
+			if (rfail)
+				rflags |= XFS_BUFTARG_INJECT_WRITE_FAIL;
 			break;
 		default:
 			fprintf(stderr, _("unknown write fail type %s\n"),
@@ -708,12 +720,15 @@ libxfs_buftarg_init(
 		return;
 	}
 
-	mp->m_ddev_targp = libxfs_buftarg_alloc(mp, dev, dfail);
+	mp->m_ddev_targp = libxfs_buftarg_alloc(mp, dev, dfail,
+			dflags | btflags);
 	if (!logdev || logdev == dev)
 		mp->m_logdev_targp = mp->m_ddev_targp;
 	else
-		mp->m_logdev_targp = libxfs_buftarg_alloc(mp, logdev, lfail);
-	mp->m_rtdev_targp = libxfs_buftarg_alloc(mp, rtdev, rfail);
+		mp->m_logdev_targp = libxfs_buftarg_alloc(mp, logdev, lfail,
+				lflags | btflags);
+	mp->m_rtdev_targp = libxfs_buftarg_alloc(mp, rtdev, rfail,
+			rflags | btflags);
 }
 
 /* Compute maximum possible height for per-AG btree types for this fs. */
@@ -760,14 +775,18 @@ libxfs_mount(
 	struct xfs_buf		*bp;
 	struct xfs_sb		*sbp;
 	xfs_daddr_t		d;
+	unsigned int		btflags = 0;
 	int			error;
 
+
 	mp->m_features = xfs_sb_version_to_features(sb);
 	if (flags & LIBXFS_MOUNT_DEBUGGER)
 		xfs_set_debugger(mp);
 	if (flags & LIBXFS_MOUNT_REPORT_CORRUPTION)
 		xfs_set_reporting_corruption(mp);
-	libxfs_buftarg_init(mp, dev, logdev, rtdev);
+	if (flags & LIBXFS_MOUNT_CACHE_MISCOMPARE_PURGE)
+		btflags |= XFS_BUFTARG_MISCOMPARE_PURGE;
+	libxfs_buftarg_init(mp, dev, logdev, rtdev, btflags);
 
 	mp->m_finobt_nores = true;
 	xfs_set_inode32(mp);
@@ -975,7 +994,7 @@ libxfs_flush_mount(
 	 * LOST_WRITE flag to be set in the buftarg.  Once that's done,
 	 * instruct the disks to persist their write caches.
 	 */
-	libxfs_bcache_flush();
+	libxfs_bcache_flush(mp);
 
 	/* Flush all kernel and disk write caches, and report failures. */
 	if (mp->m_ddev_targp) {
@@ -1001,6 +1020,14 @@ libxfs_flush_mount(
 	return error;
 }
 
+static void
+libxfs_buftarg_free(
+	struct xfs_buftarg	*btp)
+{
+	cache_destroy(btp->bcache);
+	kmem_free(btp);
+}
+
 /*
  * Release any resource obtained during a mount.
  */
@@ -1017,7 +1044,7 @@ libxfs_umount(
 	 * all incore buffers, then pick up the outcome when we tell the disks
 	 * to persist their write caches.
 	 */
-	libxfs_bcache_purge();
+	libxfs_bcache_purge(mp);
 	error = libxfs_flush_mount(mp);
 
 	/*
@@ -1028,10 +1055,10 @@ libxfs_umount(
 		libxfs_free_perag(mp);
 
 	xfs_da_unmount(mp);
-	kmem_free(mp->m_rtdev_targp);
+	libxfs_buftarg_free(mp->m_rtdev_targp);
 	if (mp->m_logdev_targp != mp->m_ddev_targp)
-		kmem_free(mp->m_logdev_targp);
-	kmem_free(mp->m_ddev_targp);
+		libxfs_buftarg_free(mp->m_logdev_targp);
+	libxfs_buftarg_free(mp->m_ddev_targp);
 
 	return error;
 }
@@ -1047,10 +1074,7 @@ libxfs_destroy(
 
 	libxfs_close_devices(li);
 
-	/* Free everything from the buffer cache before freeing buffer cache */
-	libxfs_bcache_purge();
 	libxfs_bcache_free();
-	cache_destroy(libxfs_bcache);
 	leaked = destroy_caches();
 	rcu_unregister_thread();
 	if (getenv("LIBXFS_LEAK_CHECK") && leaked)
@@ -1062,16 +1086,3 @@ libxfs_device_alignment(void)
 {
 	return platform_align_blockdev();
 }
-
-void
-libxfs_report(FILE *fp)
-{
-	time_t t;
-	char *c;
-
-	cache_report(fp, "libxfs_bcache", libxfs_bcache);
-
-	t = time(NULL);
-	c = asctime(localtime(&t));
-	fprintf(fp, "%s", c);
-}
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 4ffe788d446..3fa9e75dcaa 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -26,6 +26,7 @@ struct xfs_buftarg {
 	unsigned long		writes_left;
 	dev_t			bt_bdev;
 	unsigned int		flags;
+	struct cache		*bcache;	/* global buffer cache */
 };
 
 /* We purged a dirty buffer and lost a write. */
@@ -34,6 +35,8 @@ struct xfs_buftarg {
 #define XFS_BUFTARG_CORRUPT_WRITE	(1 << 1)
 /* Simulate failure after a certain number of writes. */
 #define XFS_BUFTARG_INJECT_WRITE_FAIL	(1 << 2)
+/* purge buffers when lookups find a size mismatch */
+#define XFS_BUFTARG_MISCOMPARE_PURGE	(1 << 3)
 
 /* Simulate the system crashing after a certain number of writes. */
 static inline void
@@ -50,8 +53,8 @@ xfs_buftarg_trip_write(
 	pthread_mutex_unlock(&btp->lock);
 }
 
-extern void	libxfs_buftarg_init(struct xfs_mount *mp, dev_t ddev,
-				    dev_t logdev, dev_t rtdev);
+void libxfs_buftarg_init(struct xfs_mount *mp, dev_t ddev, dev_t logdev,
+		dev_t rtdev, unsigned int btflags);
 int libxfs_blkdev_issue_flush(struct xfs_buftarg *btp);
 
 #define LIBXFS_BBTOOFF64(bbs)	(((xfs_off_t)(bbs)) << BBSHIFT)
@@ -139,7 +142,6 @@ int libxfs_buf_priority(struct xfs_buf *bp);
 
 /* Buffer Cache Interfaces */
 
-extern struct cache	*libxfs_bcache;
 extern struct cache_operations	libxfs_bcache_operations;
 
 #define LIBXFS_GETBUF_TRYLOCK	(1 << 0)
@@ -183,10 +185,10 @@ libxfs_buf_read(
 
 int libxfs_readbuf_verify(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
 struct xfs_buf *libxfs_getsb(struct xfs_mount *mp);
-extern void	libxfs_bcache_purge(void);
+extern void	libxfs_bcache_purge(struct xfs_mount *mp);
 extern void	libxfs_bcache_free(void);
-extern void	libxfs_bcache_flush(void);
-extern int	libxfs_bcache_overflowed(void);
+extern void	libxfs_bcache_flush(struct xfs_mount *mp);
+extern int	libxfs_bcache_overflowed(struct xfs_mount *mp);
 
 /* Buffer (Raw) Interfaces */
 int		libxfs_bwrite(struct xfs_buf *bp);
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index d5aad3ea210..5d63ec4f6de 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -198,18 +198,21 @@ libxfs_bhash(cache_key_t key, unsigned int hashsize, unsigned int hashshift)
 }
 
 static int
-libxfs_bcompare(struct cache_node *node, cache_key_t key)
+libxfs_bcompare(
+	struct cache_node	*node,
+	cache_key_t		key)
 {
 	struct xfs_buf		*bp = container_of(node, struct xfs_buf,
 						   b_node);
 	struct xfs_bufkey	*bkey = (struct xfs_bufkey *)key;
+	struct cache		*bcache = bkey->buftarg->bcache;
 
 	if (bp->b_target->bt_bdev == bkey->buftarg->bt_bdev &&
 	    bp->b_cache_key == bkey->blkno) {
 		if (bp->b_length == bkey->bblen)
 			return CACHE_HIT;
 #ifdef IO_BCOMPARE_CHECK
-		if (!(libxfs_bcache->c_flags & CACHE_MISCOMPARE_PURGE)) {
+		if (!(bcache->c_flags & CACHE_MISCOMPARE_PURGE)) {
 			fprintf(stderr,
 	"%lx: Badness in key lookup (length)\n"
 	"bp=(bno 0x%llx, len %u bytes) key=(bno 0x%llx, len %u bytes)\n",
@@ -399,11 +402,12 @@ __cache_lookup(
 	struct xfs_buf		**bpp)
 {
 	struct cache_node	*cn = NULL;
+	struct cache		*bcache = key->buftarg->bcache;
 	struct xfs_buf		*bp;
 
 	*bpp = NULL;
 
-	cache_node_get(libxfs_bcache, key, &cn);
+	cache_node_get(bcache, key, &cn);
 	if (!cn)
 		return -ENOMEM;
 	bp = container_of(cn, struct xfs_buf, b_node);
@@ -415,7 +419,7 @@ __cache_lookup(
 		if (ret) {
 			ASSERT(ret == EAGAIN);
 			if (flags & LIBXFS_GETBUF_TRYLOCK) {
-				cache_node_put(libxfs_bcache, cn);
+				cache_node_put(bcache, cn);
 				return -EAGAIN;
 			}
 
@@ -434,7 +438,7 @@ __cache_lookup(
 		bp->b_holder = pthread_self();
 	}
 
-	cache_node_set_priority(libxfs_bcache, cn,
+	cache_node_set_priority(bcache, cn,
 			cache_node_get_priority(cn) - CACHE_PREFETCH_PRIORITY);
 	*bpp = bp;
 	return 0;
@@ -550,7 +554,7 @@ libxfs_buf_relse(
 	}
 
 	if (!list_empty(&bp->b_node.cn_hash))
-		cache_node_put(libxfs_bcache, &bp->b_node);
+		cache_node_put(bp->b_target->bcache, &bp->b_node);
 	else if (--bp->b_node.cn_count == 0) {
 		if (bp->b_flags & LIBXFS_B_DIRTY)
 			libxfs_bwrite(bp);
@@ -1004,21 +1008,31 @@ libxfs_bflush(
 }
 
 void
-libxfs_bcache_purge(void)
+libxfs_bcache_purge(struct xfs_mount *mp)
 {
-	cache_purge(libxfs_bcache);
+	if (!mp)
+		return;
+	cache_purge(mp->m_ddev_targp->bcache);
+	cache_purge(mp->m_logdev_targp->bcache);
+	cache_purge(mp->m_rtdev_targp->bcache);
 }
 
 void
-libxfs_bcache_flush(void)
+libxfs_bcache_flush(struct xfs_mount *mp)
 {
-	cache_flush(libxfs_bcache);
+	if (!mp)
+		return;
+	cache_flush(mp->m_ddev_targp->bcache);
+	cache_flush(mp->m_logdev_targp->bcache);
+	cache_flush(mp->m_rtdev_targp->bcache);
 }
 
 int
-libxfs_bcache_overflowed(void)
+libxfs_bcache_overflowed(struct xfs_mount *mp)
 {
-	return cache_overflowed(libxfs_bcache);
+	return cache_overflowed(mp->m_ddev_targp->bcache) ||
+		cache_overflowed(mp->m_logdev_targp->bcache) ||
+		cache_overflowed(mp->m_rtdev_targp->bcache);
 }
 
 struct cache_operations libxfs_bcache_operations = {
@@ -1460,7 +1474,7 @@ libxfs_buf_set_priority(
 	struct xfs_buf	*bp,
 	int		priority)
 {
-	cache_node_set_priority(libxfs_bcache, &bp->b_node, priority);
+	cache_node_set_priority(bp->b_target->bcache, &bp->b_node, priority);
 }
 
 int
diff --git a/logprint/logprint.c b/logprint/logprint.c
index 9a8811f467c..df70553543b 100644
--- a/logprint/logprint.c
+++ b/logprint/logprint.c
@@ -213,7 +213,7 @@ main(int argc, char **argv)
 		exit(1);
 
 	logstat(&mount);
-	libxfs_buftarg_init(&mount, x.ddev, x.logdev, x.rtdev);
+	libxfs_buftarg_init(&mount, x.ddev, x.logdev, x.rtdev, 0);
 
 	logfd = (x.logfd < 0) ? x.dfd : x.logfd;
 
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 31861a2eb3c..638e7ce6ea4 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -4518,7 +4518,7 @@ main(
 	/*
 	 * we need the libxfs buffer cache from here on in.
 	 */
-	libxfs_buftarg_init(mp, xi.ddev, xi.logdev, xi.rtdev);
+	libxfs_buftarg_init(mp, xi.ddev, xi.logdev, xi.rtdev, 0);
 
 	/*
 	 * Before we mount the filesystem we need to make sure the devices have
@@ -4587,7 +4587,7 @@ main(
 	 * Need to drop references to inodes we still hold, first.
 	 */
 	libxfs_rtmount_destroy(mp);
-	libxfs_bcache_purge();
+	libxfs_bcache_purge(mp);
 
 	/*
 	 * Mark the filesystem ok.
diff --git a/repair/prefetch.c b/repair/prefetch.c
index 017750e9a92..5665e0a224c 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -886,10 +886,12 @@ init_prefetch(
 
 prefetch_args_t *
 start_inode_prefetch(
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	int			dirs_only,
 	prefetch_args_t		*prev_args)
 {
+	struct cache		*bcache = mp->m_ddev_targp->bcache;
 	prefetch_args_t		*args;
 	long			max_queue;
 	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
@@ -914,7 +916,7 @@ start_inode_prefetch(
 	 * and not any other associated metadata like directories
 	 */
 
-	max_queue = libxfs_bcache->c_maxcount / thread_count / 8;
+	max_queue = bcache->c_maxcount / thread_count / 8;
 	if (igeo->inode_cluster_size > mp->m_sb.sb_blocksize)
 		max_queue = max_queue * igeo->blocks_per_cluster /
 				igeo->ialloc_blks;
@@ -970,14 +972,16 @@ prefetch_ag_range(
 	void			(*func)(struct workqueue *,
 					xfs_agnumber_t, void *))
 {
+	struct xfs_mount	*mp = work->wq_ctx;
 	int			i;
 	struct prefetch_args	*pf_args[2];
 
-	pf_args[start_ag & 1] = start_inode_prefetch(start_ag, dirs_only, NULL);
+	pf_args[start_ag & 1] = start_inode_prefetch(mp, start_ag, dirs_only,
+			NULL);
 	for (i = start_ag; i < end_ag; i++) {
 		/* Don't prefetch end_ag */
 		if (i + 1 < end_ag)
-			pf_args[(~i) & 1] = start_inode_prefetch(i + 1,
+			pf_args[(~i) & 1] = start_inode_prefetch(mp, i + 1,
 						dirs_only, pf_args[i & 1]);
 		func(work, i, pf_args[i & 1]);
 	}
@@ -1027,7 +1031,7 @@ do_inode_prefetch(
 	 * filesystem - it's all in the cache. In that case, run a thread per
 	 * CPU to maximise parallelism of the queue to be processed.
 	 */
-	if (check_cache && !libxfs_bcache_overflowed()) {
+	if (check_cache && !libxfs_bcache_overflowed(mp)) {
 		queue.wq_ctx = mp;
 		create_work_queue(&queue, mp, platform_nproc());
 		for (i = 0; i < mp->m_sb.sb_agcount; i++)
diff --git a/repair/prefetch.h b/repair/prefetch.h
index 54ece48ad22..a8c52a1195b 100644
--- a/repair/prefetch.h
+++ b/repair/prefetch.h
@@ -39,6 +39,7 @@ init_prefetch(
 
 prefetch_args_t *
 start_inode_prefetch(
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	int			dirs_only,
 	prefetch_args_t		*prev_args);
diff --git a/repair/progress.c b/repair/progress.c
index f6c4d988444..625dc41c289 100644
--- a/repair/progress.c
+++ b/repair/progress.c
@@ -383,14 +383,18 @@ timediff(int phase)
 **  array.
 */
 char *
-timestamp(int end, int phase, char *buf)
+timestamp(
+	struct xfs_mount	*mp,
+	int			end,
+	int			phase,
+	char			*buf)
 {
 
-	time_t    now;
-	struct tm *tmp;
+	time_t			now;
+	struct tm		*tmp;
 
-	if (verbose > 1)
-		cache_report(stderr, "libxfs_bcache", libxfs_bcache);
+	if (verbose > 1 && mp && mp->m_ddev_targp)
+		cache_report(stderr, "libxfs_bcache", mp->m_ddev_targp->bcache);
 
 	now = time(NULL);
 
diff --git a/repair/progress.h b/repair/progress.h
index 2c1690db1b1..75b751b783b 100644
--- a/repair/progress.h
+++ b/repair/progress.h
@@ -37,7 +37,7 @@ extern void stop_progress_rpt(void);
 extern void summary_report(void);
 extern int  set_progress_msg(int report, uint64_t total);
 extern uint64_t print_final_rpt(void);
-extern char *timestamp(int end, int phase, char *buf);
+extern char *timestamp(struct xfs_mount *mp, int end, int phase, char *buf);
 extern char *duration(int val, char *buf);
 extern int do_parallel;
 
diff --git a/repair/scan.c b/repair/scan.c
index 008ef65ac75..ac2233b93b7 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -42,7 +42,7 @@ struct aghdr_cnts {
 void
 set_mp(xfs_mount_t *mpp)
 {
-	libxfs_bcache_purge();
+	libxfs_bcache_purge(mp);
 	mp = mpp;
 }
 
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index ff29bea9743..e49d4292ad4 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -944,9 +944,11 @@ repair_capture_writeback(
 }
 
 static inline void
-phase_end(int phase)
+phase_end(
+	struct xfs_mount	*mp,
+	int			phase)
 {
-	timestamp(PHASE_END, phase, NULL);
+	timestamp(mp, PHASE_END, phase, NULL);
 
 	/* Fail if someone injected an post-phase error. */
 	if (fail_after_phase && phase == fail_after_phase)
@@ -981,8 +983,8 @@ main(int argc, char **argv)
 
 	msgbuf = malloc(DURATION_BUF_SIZE);
 
-	timestamp(PHASE_START, 0, NULL);
-	phase_end(0);
+	timestamp(temp_mp, PHASE_START, 0, NULL);
+	phase_end(temp_mp, 0);
 
 	/* -f forces this, but let's be nice and autodetect it, as well. */
 	if (!isa_file) {
@@ -1005,7 +1007,7 @@ main(int argc, char **argv)
 
 	/* do phase1 to make sure we have a superblock */
 	phase1(temp_mp);
-	phase_end(1);
+	phase_end(temp_mp, 1);
 
 	if (no_modify && primary_sb_modified)  {
 		do_warn(_("Primary superblock would have been modified.\n"
@@ -1142,8 +1144,8 @@ main(int argc, char **argv)
 		unsigned long	max_mem;
 		struct rlimit	rlim;
 
-		libxfs_bcache_purge();
-		cache_destroy(libxfs_bcache);
+		libxfs_bcache_purge(mp);
+		cache_destroy(mp->m_ddev_targp->bcache);
 
 		mem_used = (mp->m_sb.sb_icount >> (10 - 2)) +
 					(mp->m_sb.sb_dblocks >> (10 + 1)) +
@@ -1203,7 +1205,7 @@ main(int argc, char **argv)
 			do_log(_("        - block cache size set to %d entries\n"),
 				libxfs_bhash_size * HASH_CACHE_RATIO);
 
-		libxfs_bcache = cache_init(0, libxfs_bhash_size,
+		mp->m_ddev_targp->bcache = cache_init(0, libxfs_bhash_size,
 						&libxfs_bcache_operations);
 	}
 
@@ -1231,16 +1233,16 @@ main(int argc, char **argv)
 
 	/* make sure the per-ag freespace maps are ok so we can mount the fs */
 	phase2(mp, phase2_threads);
-	phase_end(2);
+	phase_end(mp, 2);
 
 	if (do_prefetch)
 		init_prefetch(mp);
 
 	phase3(mp, phase2_threads);
-	phase_end(3);
+	phase_end(mp, 3);
 
 	phase4(mp);
-	phase_end(4);
+	phase_end(mp, 4);
 
 	if (no_modify) {
 		printf(_("No modify flag set, skipping phase 5\n"));
@@ -1250,7 +1252,7 @@ main(int argc, char **argv)
 	} else {
 		phase5(mp);
 	}
-	phase_end(5);
+	phase_end(mp, 5);
 
 	/*
 	 * Done with the block usage maps, toss them...
@@ -1260,10 +1262,10 @@ main(int argc, char **argv)
 
 	if (!bad_ino_btree)  {
 		phase6(mp);
-		phase_end(6);
+		phase_end(mp, 6);
 
 		phase7(mp, phase2_threads);
-		phase_end(7);
+		phase_end(mp, 7);
 	} else  {
 		do_warn(
 _("Inode allocation btrees are too corrupted, skipping phases 6 and 7\n"));
@@ -1388,7 +1390,7 @@ _("Note - stripe unit (%d) and width (%d) were copied from a backup superblock.\
 	 * verifiers are run (where we discover the max metadata LSN), reformat
 	 * the log if necessary and unmount.
 	 */
-	libxfs_bcache_flush();
+	libxfs_bcache_flush(mp);
 	format_log_max_lsn(mp);
 
 	if (xfs_sb_version_needsrepair(&mp->m_sb))


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/9] libxfs: add xfile support
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/9] libxfs: clean up xfs_da_unmount usage Darrick J. Wong
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port the xfile functionality (anonymous pageable file-index memory) from
the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 configure.ac          |    3 +
 include/builddefs.in  |    3 +
 libxfs/Makefile       |   12 +++
 libxfs/xfile.c        |  224 +++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfile.h        |   56 ++++++++++++
 m4/package_libcdev.m4 |   50 +++++++++++
 6 files changed, 348 insertions(+)
 create mode 100644 libxfs/xfile.c
 create mode 100644 libxfs/xfile.h


diff --git a/configure.ac b/configure.ac
index 20445a98d84..6c704464061 100644
--- a/configure.ac
+++ b/configure.ac
@@ -251,6 +251,9 @@ AC_CHECK_SIZEOF([char *])
 AC_TYPE_UMODE_T
 AC_MANUAL_FORMAT
 AC_HAVE_LIBURCU_ATOMIC64
+AC_HAVE_MEMFD_CLOEXEC
+AC_HAVE_O_TMPFILE
+AC_HAVE_MKOSTEMP_CLOEXEC
 
 AC_CONFIG_FILES([include/builddefs])
 AC_OUTPUT
diff --git a/include/builddefs.in b/include/builddefs.in
index e0a2f3cbc95..60c1320af37 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -127,6 +127,9 @@ SYSTEMD_SYSTEM_UNIT_DIR = @systemd_system_unit_dir@
 HAVE_CROND = @have_crond@
 CROND_DIR = @crond_dir@
 HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@
+HAVE_MEMFD_CLOEXEC = @have_memfd_cloexec@
+HAVE_O_TMPFILE = @have_o_tmpfile@
+HAVE_MKOSTEMP_CLOEXEC = @have_mkostemp_cloexec@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/libxfs/Makefile b/libxfs/Makefile
index 010ee68e229..2007be570ed 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -26,6 +26,7 @@ HFILES = \
 	libxfs_priv.h \
 	linux-err.h \
 	topology.h \
+	xfile.h \
 	xfs_ag_resv.h \
 	xfs_alloc.h \
 	xfs_alloc_btree.h \
@@ -65,6 +66,7 @@ CFILES = cache.c \
 	topology.c \
 	trans.c \
 	util.c \
+	xfile.c \
 	xfs_ag.c \
 	xfs_ag_resv.c \
 	xfs_alloc.c \
@@ -111,6 +113,16 @@ CFILES = cache.c \
 #
 #LCFLAGS +=
 
+ifeq ($(HAVE_MEMFD_CLOEXEC),yes)
+	LCFLAGS += -DHAVE_MEMFD_CLOEXEC
+endif
+ifeq ($(HAVE_O_TMPFILE),yes)
+	LCFLAGS += -DHAVE_O_TMPFILE
+endif
+ifeq ($(HAVE_MKOSTEMP_CLOEXEC),yes)
+	LCFLAGS += -DHAVE_MKOSTEMP_CLOEXEC
+endif
+
 FCFLAGS = -I.
 
 LTLIBS = $(LIBPTHREAD) $(LIBRT)
diff --git a/libxfs/xfile.c b/libxfs/xfile.c
new file mode 100644
index 00000000000..357ffb0077d
--- /dev/null
+++ b/libxfs/xfile.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs_priv.h"
+#include "libxfs.h"
+#include "libxfs/xfile.h"
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+/*
+ * Swappable Temporary Memory
+ * ==========================
+ *
+ * Offline checking sometimes needs to be able to stage a large amount of data
+ * in memory.  This information might not fit in the available memory and it
+ * doesn't all need to be accessible at all times.  In other words, we want an
+ * indexed data buffer to store data that can be paged out.
+ *
+ * memfd files meet those requirements.  Therefore, the xfile mechanism uses
+ * one to store our staging data.  The xfile must be freed with xfile_destroy.
+ *
+ * xfiles assume that the caller will handle all required concurrency
+ * management; file locks are not taken.
+ */
+
+/*
+ * Open a memory-backed fd to back an xfile.  We require close-on-exec here,
+ * because these memfd files function as windowed RAM and hence should never
+ * be shared with other processes.
+ */
+static int
+xfile_create_fd(
+	const char		*description)
+{
+	int			fd = -1;
+
+#ifdef HAVE_MEMFD_CLOEXEC
+	/* memfd_create exists in kernel 3.17 (2014) and glibc 2.27 (2018). */
+	fd = memfd_create(description, MFD_CLOEXEC);
+	if (fd >= 0)
+		return fd;
+#endif
+
+#ifdef HAVE_O_TMPFILE
+	/*
+	 * O_TMPFILE exists as of kernel 3.11 (2013), which means that if we
+	 * find it, we're pretty safe in assuming O_CLOEXEC exists too.
+	 */
+	fd = open("/dev/shm", O_TMPFILE | O_CLOEXEC | O_RDWR, 0600);
+	if (fd >= 0)
+		return fd;
+
+	fd = open("/tmp", O_TMPFILE | O_CLOEXEC | O_RDWR, 0600);
+	if (fd >= 0)
+		return fd;
+#endif
+
+#ifdef HAVE_MKOSTEMP_CLOEXEC
+	/*
+	 * mkostemp exists as of glibc 2.7 (2007) and O_CLOEXEC exists as of
+	 * kernel 2.6.23 (2007).
+	 */
+	fd = mkostemp("libxfsXXXXXX", O_CLOEXEC);
+	if (fd >= 0)
+		return fd;
+#endif
+
+#if !defined(HAVE_MEMFD_CLOEXEC) && \
+    !defined(HAVE_O_TMPFILE) && \
+    !defined(HAVE_MKOSTEMP_CLOEXEC)
+# error System needs memfd_create, O_TMPFILE, or O_CLOEXEC to build!
+#endif
+
+	return fd;
+}
+
+/*
+ * Create an xfile of the given size.  The description will be used in the
+ * trace output.
+ */
+int
+xfile_create(
+	struct xfs_mount	*mp,
+	const char		*description,
+	struct xfile		**xfilep)
+{
+	struct xfile		*xf;
+	char			fname[MAXNAMELEN];
+	int			error;
+
+	snprintf(fname, MAXNAMELEN - 1, "XFS (%s): %s", mp->m_fsname,
+			description);
+	fname[MAXNAMELEN - 1] = 0;
+
+	xf = kmem_alloc(sizeof(struct xfile), KM_MAYFAIL);
+	if (!xf)
+		return -ENOMEM;
+
+	xf->fd = xfile_create_fd(fname);
+	if (xf->fd < 0) {
+		error = -errno;
+		kmem_free(xf);
+		return error;
+	}
+
+	*xfilep = xf;
+	return 0;
+}
+
+/* Close the file and release all resources. */
+void
+xfile_destroy(
+	struct xfile		*xf)
+{
+	close(xf->fd);
+	kmem_free(xf);
+}
+
+static inline loff_t
+xfile_maxbytes(
+	struct xfile		*xf)
+{
+	if (sizeof(loff_t) == 8)
+		return LLONG_MAX;
+	return LONG_MAX;
+}
+
+/*
+ * Read a memory object directly from the xfile's page cache.  Unlike regular
+ * pread, we return -E2BIG and -EFBIG for reads that are too large or at too
+ * high an offset, instead of truncating the read.  Otherwise, we return
+ * bytes read or an error code, like regular pread.
+ */
+ssize_t
+xfile_pread(
+	struct xfile		*xf,
+	void			*buf,
+	size_t			count,
+	loff_t			pos)
+{
+	ssize_t			ret;
+
+	if (count > INT_MAX)
+		return -E2BIG;
+	if (xfile_maxbytes(xf) - pos < count)
+		return -EFBIG;
+
+	ret = pread(xf->fd, buf, count, pos);
+	if (ret >= 0)
+		return ret;
+	return -errno;
+}
+
+/*
+ * Write a memory object directly to the xfile's page cache.  Unlike regular
+ * pwrite, we return -E2BIG and -EFBIG for writes that are too large or at too
+ * high an offset, instead of truncating the write.  Otherwise, we return
+ * bytes written or an error code, like regular pwrite.
+ */
+ssize_t
+xfile_pwrite(
+	struct xfile		*xf,
+	void			*buf,
+	size_t			count,
+	loff_t			pos)
+{
+	ssize_t			ret;
+
+	if (count > INT_MAX)
+		return -E2BIG;
+	if (xfile_maxbytes(xf) - pos < count)
+		return -EFBIG;
+
+	ret = pwrite(xf->fd, buf, count, pos);
+	if (ret >= 0)
+		return ret;
+	return -errno;
+}
+
+/* Query stat information for an xfile. */
+int
+xfile_stat(
+	struct xfile		*xf,
+	struct xfile_stat	*statbuf)
+{
+	struct stat		ks;
+	int			error;
+
+	error = fstat(xf->fd, &ks);
+	if (error)
+		return -errno;
+
+	statbuf->size = ks.st_size;
+	statbuf->bytes = (unsigned long long)ks.st_blocks << 9;
+	return 0;
+}
+
+/* Dump an xfile to stdout. */
+int
+xfile_dump(
+	struct xfile		*xf)
+{
+	char			*argv[] = {"od", "-tx1", "-Ad", "-c", NULL};
+	pid_t			child;
+	int			i;
+
+	child = fork();
+	if (child != 0) {
+		int		wstatus;
+
+		wait(&wstatus);
+		return wstatus == 0 ? 0 : -EIO;
+	}
+
+	/* reroute our xfile to stdin and shut everything else */
+	dup2(xf->fd, 0);
+	for (i = 3; i < 1024; i++)
+		close(i);
+
+	return execvp("od", argv);
+}
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
new file mode 100644
index 00000000000..ad13f62ee0f
--- /dev/null
+++ b/libxfs/xfile.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __LIBXFS_XFILE_H__
+#define __LIBXFS_XFILE_H__
+
+struct xfile {
+	int		fd;
+};
+
+int xfile_create(struct xfs_mount *mp, const char *description,
+		struct xfile **xfilep);
+void xfile_destroy(struct xfile *xf);
+
+ssize_t xfile_pread(struct xfile *xf, void *buf, size_t count, loff_t pos);
+ssize_t xfile_pwrite(struct xfile *xf, void *buf, size_t count, loff_t pos);
+
+/*
+ * Load an object.  Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+static inline int
+xfile_obj_load(struct xfile *xf, void *buf, size_t count, loff_t pos)
+{
+	ssize_t	ret = xfile_pread(xf, buf, count, pos);
+
+	if (ret < 0 || ret != count)
+		return -ENOMEM;
+	return 0;
+}
+
+/*
+ * Store an object.  Since we're treating this file as "memory", any error or
+ * short IO is treated as a failure to allocate memory.
+ */
+static inline int
+xfile_obj_store(struct xfile *xf, void *buf, size_t count, loff_t pos)
+{
+	ssize_t	ret = xfile_pwrite(xf, buf, count, pos);
+
+	if (ret < 0 || ret != count)
+		return -ENOMEM;
+	return 0;
+}
+
+struct xfile_stat {
+	loff_t			size;
+	unsigned long long	bytes;
+};
+
+int xfile_stat(struct xfile *xf, struct xfile_stat *statbuf);
+int xfile_dump(struct xfile *xf);
+
+#endif /* __LIBXFS_XFILE_H__ */
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index bb1ab49c11e..119d1bda74d 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -507,3 +507,53 @@ AC_DEFUN([AC_PACKAGE_CHECK_LTO],
     AC_SUBST(lto_cflags)
     AC_SUBST(lto_ldflags)
   ])
+
+#
+# Check if we have a memfd_create syscall with a MFD_CLOEXEC flag
+#
+AC_DEFUN([AC_HAVE_MEMFD_CLOEXEC],
+  [ AC_MSG_CHECKING([for memfd_fd and MFD_CLOEXEC])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <sys/mman.h>
+    ]], [[
+         return memfd_create("xfs", MFD_CLOEXEC);
+    ]])],[have_memfd_cloexec=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_memfd_cloexec)
+  ])
+
+#
+# Check if we have the O_TMPFILE flag
+#
+AC_DEFUN([AC_HAVE_O_TMPFILE],
+  [ AC_MSG_CHECKING([for O_TMPFILE])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+    ]], [[
+         return open("nowhere", O_TMPFILE, 0600);
+    ]])],[have_o_tmpfile=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_o_tmpfile)
+  ])
+
+#
+# Check if we have mkostemp with the O_CLOEXEC flag
+#
+AC_DEFUN([AC_HAVE_MKOSTEMP_CLOEXEC],
+  [ AC_MSG_CHECKING([for mkostemp and O_CLOEXEC])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdlib.h>
+    ]], [[
+         return mkostemp("nowhere", O_TMPFILE);
+    ]])],[have_mkostemp_cloexec=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_mkostemp_cloexec)
+  ])


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/9] libxfs: support in-memory buffer cache targets
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 9/9] xfbtree: let the buffer cache flush dirty buffers to the xfile Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 8/9] xfs: connect in-memory btrees to xfiles Darrick J. Wong
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the buffer cache to target in-memory files by connecting it to
xfiles.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_io.h |   14 +++++++++++++-
 libxfs/rdwr.c      |   47 +++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 56 insertions(+), 5 deletions(-)


diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 3fa9e75dcaa..c002ef058ec 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -24,7 +24,10 @@ struct xfs_buftarg {
 	struct xfs_mount	*bt_mount;
 	pthread_mutex_t		lock;
 	unsigned long		writes_left;
-	dev_t			bt_bdev;
+	union {
+		struct xfile	*bt_xfile;
+		dev_t		bt_bdev;
+	};
 	unsigned int		flags;
 	struct cache		*bcache;	/* global buffer cache */
 };
@@ -37,6 +40,15 @@ struct xfs_buftarg {
 #define XFS_BUFTARG_INJECT_WRITE_FAIL	(1 << 2)
 /* purge buffers when lookups find a size mismatch */
 #define XFS_BUFTARG_MISCOMPARE_PURGE	(1 << 3)
+/* use xfile for */
+#define XFS_BUFTARG_IN_MEMORY		(1 << 4)
+
+static inline bool
+xfs_buftarg_in_memory(
+	struct xfs_buftarg	*btp)
+{
+	return btp->flags & XFS_BUFTARG_IN_MEMORY;
+}
 
 /* Simulate the system crashing after a certain number of writes. */
 static inline void
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 5d63ec4f6de..9d36698bb5c 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -18,7 +18,7 @@
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "libfrog/platform.h"
-
+#include "libxfs/xfile.h"
 #include "libxfs.h"
 
 static void libxfs_brelse(struct cache_node *node);
@@ -68,6 +68,9 @@ libxfs_device_zero(struct xfs_buftarg *btp, xfs_daddr_t start, uint len)
 	char		*z;
 	int		error, fd;
 
+	if (btp->flags & XFS_BUFTARG_IN_MEMORY)
+		return -EOPNOTSUPP;
+
 	fd = libxfs_device_to_fd(btp->bt_bdev);
 	start_offset = LIBXFS_BBTOOFF64(start);
 
@@ -578,6 +581,31 @@ libxfs_balloc(
 	return &bp->b_node;
 }
 
+static inline int
+libxfs_buf_ioapply_in_memory(
+	struct xfs_buf		*bp,
+	bool			is_write)
+{
+	struct xfile		*xfile = bp->b_target->bt_xfile;
+	loff_t			pos = BBTOB(xfs_buf_daddr(bp));
+	size_t			size = BBTOB(bp->b_length);
+	int			error;
+
+	if (bp->b_nmaps > 1) {
+		/* We don't need or support multi-map buffers. */
+		ASSERT(0);
+		error = -EIO;
+	} else if (is_write) {
+		error = xfile_obj_store(xfile, bp->b_addr, size, pos);
+	} else {
+		error = xfile_obj_load(xfile, bp->b_addr, size, pos);
+	}
+	if (error)
+		bp->b_error = error;
+	else if (!is_write)
+		bp->b_flags |= LIBXFS_B_UPTODATE;
+	return error;
+}
 
 static int
 __read_buf(int fd, void *buf, int len, off64_t offset, int flags)
@@ -602,12 +630,16 @@ int
 libxfs_readbufr(struct xfs_buftarg *btp, xfs_daddr_t blkno, struct xfs_buf *bp,
 		int len, int flags)
 {
-	int	fd = libxfs_device_to_fd(btp->bt_bdev);
+	int	fd;
 	int	bytes = BBTOB(len);
 	int	error;
 
 	ASSERT(len <= bp->b_length);
 
+	if (bp->b_target->flags & XFS_BUFTARG_IN_MEMORY)
+		return libxfs_buf_ioapply_in_memory(bp, false);
+
+	fd = libxfs_device_to_fd(btp->bt_bdev);
 	error = __read_buf(fd, bp->b_addr, bytes, LIBXFS_BBTOOFF64(blkno), flags);
 	if (!error &&
 	    bp->b_target->bt_bdev == btp->bt_bdev &&
@@ -640,6 +672,9 @@ libxfs_readbufr_map(struct xfs_buftarg *btp, struct xfs_buf *bp, int flags)
 	void	*buf;
 	int	i;
 
+	if (bp->b_target->flags & XFS_BUFTARG_IN_MEMORY)
+		return libxfs_buf_ioapply_in_memory(bp, false);
+
 	fd = libxfs_device_to_fd(btp->bt_bdev);
 	buf = bp->b_addr;
 	for (i = 0; i < bp->b_nmaps; i++) {
@@ -824,7 +859,7 @@ int
 libxfs_bwrite(
 	struct xfs_buf	*bp)
 {
-	int		fd = libxfs_device_to_fd(bp->b_target->bt_bdev);
+	int		fd;
 
 	/*
 	 * we never write buffers that are marked stale. This indicates they
@@ -859,7 +894,10 @@ libxfs_bwrite(
 		}
 	}
 
-	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
+	if (bp->b_target->flags & XFS_BUFTARG_IN_MEMORY) {
+		libxfs_buf_ioapply_in_memory(bp, true);
+	} else if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
+		fd = libxfs_device_to_fd(bp->b_target->bt_bdev);
 		bp->b_error = __write_buf(fd, bp->b_addr, BBTOB(bp->b_length),
 				    LIBXFS_BBTOOFF64(xfs_buf_daddr(bp)),
 				    bp->b_flags);
@@ -867,6 +905,7 @@ libxfs_bwrite(
 		int	i;
 		void	*buf = bp->b_addr;
 
+		fd = libxfs_device_to_fd(bp->b_target->bt_bdev);
 		for (i = 0; i < bp->b_nmaps; i++) {
 			off64_t	offset = LIBXFS_BBTOOFF64(bp->b_maps[i].bm_bn);
 			int len = BBTOB(bp->b_maps[i].bm_len);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/9] xfs: consolidate btree block freeing tracepoints
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 8/9] xfs: connect in-memory btrees to xfiles Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 7/9] xfs: support in-memory btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 6/9] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't waste tracepoint segment memory on per-btree block freeing
tracepoints when we can do it from the generic btree code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_trace.h         |    3 +--
 libxfs/xfs_btree.c          |    2 ++
 libxfs/xfs_refcount_btree.c |    2 --
 libxfs/xfs_rmap_btree.c     |    2 --
 4 files changed, 3 insertions(+), 6 deletions(-)


diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 19b05f6e25e..0a7581b5794 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -61,6 +61,7 @@
 #define trace_xfs_btree_commit_ifakeroot(a)	((void) 0)
 #define trace_xfs_btree_bload_level_geometry(a,b,c,d,e,f,g) ((void) 0)
 #define trace_xfs_btree_bload_block(a,b,c,d,e,f) ((void) 0)
+#define trace_xfs_btree_free_block(...)		((void) 0)
 
 #define trace_xfs_free_extent(a,b,c,d,e,f,g)	((void) 0)
 #define trace_xfs_agf(a,b,c,d)			((void) 0)
@@ -243,7 +244,6 @@
 #define trace_xfs_rmap_find_left_neighbor_result(...)	((void) 0)
 #define trace_xfs_rmap_lookup_le_range_result(...)	((void) 0)
 
-#define trace_xfs_rmapbt_free_block(...)	((void) 0)
 #define trace_xfs_rmapbt_alloc_block(...)	((void) 0)
 
 #define trace_xfs_ag_resv_critical(...)		((void) 0)
@@ -263,7 +263,6 @@
 #define trace_xfs_refcount_insert_error(...)	((void) 0)
 #define trace_xfs_refcount_delete(...)		((void) 0)
 #define trace_xfs_refcount_delete_error(...)	((void) 0)
-#define trace_xfs_refcountbt_free_block(...)	((void) 0)
 #define trace_xfs_refcountbt_alloc_block(...)	((void) 0)
 #define trace_xfs_refcount_rec_order_error(...)	((void) 0)
 
diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index e0b9f075015..d7501da87ce 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -411,6 +411,8 @@ xfs_btree_free_block(
 {
 	int			error;
 
+	trace_xfs_btree_free_block(cur, bp);
+
 	error = cur->bc_ops->free_block(cur, bp);
 	if (!error) {
 		xfs_trans_binval(cur->bc_tp, bp);
diff --git a/libxfs/xfs_refcount_btree.c b/libxfs/xfs_refcount_btree.c
index 0a8e80e705f..c1dd2fe8d37 100644
--- a/libxfs/xfs_refcount_btree.c
+++ b/libxfs/xfs_refcount_btree.c
@@ -107,8 +107,6 @@ xfs_refcountbt_free_block(
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
 	int			error;
 
-	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
 	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index e8ffb23be42..36f6714ed3f 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -123,8 +123,6 @@ xfs_rmapbt_free_block(
 	int			error;
 
 	bno = xfs_daddr_to_agbno(cur->bc_mp, xfs_buf_daddr(bp));
-	trace_xfs_rmapbt_free_block(cur->bc_mp, pag->pag_agno,
-			bno, 1);
 	be32_add_cpu(&agf->agf_rmap_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
 	error = xfs_alloc_put_freelist(pag, cur->bc_tp, agbp, NULL, bno, 1);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/9] xfs: consolidate btree block allocation tracepoints
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 7/9] xfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't waste tracepoint segment memory on per-btree block allocation
tracepoints when we can do it from the generic btree code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_trace.h         |    4 +---
 libxfs/xfs_btree.c          |   20 +++++++++++++++++---
 libxfs/xfs_refcount_btree.c |    2 --
 libxfs/xfs_rmap_btree.c     |    2 --
 4 files changed, 18 insertions(+), 10 deletions(-)


diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 0a7581b5794..3ca6cda253c 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -62,6 +62,7 @@
 #define trace_xfs_btree_bload_level_geometry(a,b,c,d,e,f,g) ((void) 0)
 #define trace_xfs_btree_bload_block(a,b,c,d,e,f) ((void) 0)
 #define trace_xfs_btree_free_block(...)		((void) 0)
+#define trace_xfs_btree_alloc_block(...)	((void) 0)
 
 #define trace_xfs_free_extent(a,b,c,d,e,f,g)	((void) 0)
 #define trace_xfs_agf(a,b,c,d)			((void) 0)
@@ -244,8 +245,6 @@
 #define trace_xfs_rmap_find_left_neighbor_result(...)	((void) 0)
 #define trace_xfs_rmap_lookup_le_range_result(...)	((void) 0)
 
-#define trace_xfs_rmapbt_alloc_block(...)	((void) 0)
-
 #define trace_xfs_ag_resv_critical(...)		((void) 0)
 #define trace_xfs_ag_resv_needed(...)		((void) 0)
 #define trace_xfs_ag_resv_free(...)		((void) 0)
@@ -263,7 +262,6 @@
 #define trace_xfs_refcount_insert_error(...)	((void) 0)
 #define trace_xfs_refcount_delete(...)		((void) 0)
 #define trace_xfs_refcount_delete_error(...)	((void) 0)
-#define trace_xfs_refcountbt_alloc_block(...)	((void) 0)
 #define trace_xfs_refcount_rec_order_error(...)	((void) 0)
 
 #define trace_xfs_refcount_lookup(...)		((void) 0)
diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index d7501da87ce..cd722d1c830 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -2690,6 +2690,20 @@ xfs_btree_rshift(
 	return error;
 }
 
+static inline int
+xfs_btree_alloc_block(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*hint_block,
+	union xfs_btree_ptr		*new_block,
+	int				*stat)
+{
+	int				error;
+
+	error = cur->bc_ops->alloc_block(cur, hint_block, new_block, stat);
+	trace_xfs_btree_alloc_block(cur, new_block, *stat, error);
+	return error;
+}
+
 /*
  * Split cur/level block in half.
  * Return new block number and the key to its first
@@ -2733,7 +2747,7 @@ __xfs_btree_split(
 	xfs_btree_buf_to_ptr(cur, lbp, &lptr);
 
 	/* Allocate the new block. If we can't do it, we're toast. Give up. */
-	error = cur->bc_ops->alloc_block(cur, &lptr, &rptr, stat);
+	error = xfs_btree_alloc_block(cur, &lptr, &rptr, stat);
 	if (error)
 		goto error0;
 	if (*stat == 0)
@@ -2999,7 +3013,7 @@ xfs_btree_new_iroot(
 	pp = xfs_btree_ptr_addr(cur, 1, block);
 
 	/* Allocate the new block. If we can't do it, we're toast. Give up. */
-	error = cur->bc_ops->alloc_block(cur, pp, &nptr, stat);
+	error = xfs_btree_alloc_block(cur, pp, &nptr, stat);
 	if (error)
 		goto error0;
 	if (*stat == 0)
@@ -3099,7 +3113,7 @@ xfs_btree_new_root(
 	cur->bc_ops->init_ptr_from_cur(cur, &rptr);
 
 	/* Allocate the new block. If we can't do it, we're toast. Give up. */
-	error = cur->bc_ops->alloc_block(cur, &rptr, &lptr, stat);
+	error = xfs_btree_alloc_block(cur, &rptr, &lptr, stat);
 	if (error)
 		goto error0;
 	if (*stat == 0)
diff --git a/libxfs/xfs_refcount_btree.c b/libxfs/xfs_refcount_btree.c
index c1dd2fe8d37..ec30077bd49 100644
--- a/libxfs/xfs_refcount_btree.c
+++ b/libxfs/xfs_refcount_btree.c
@@ -76,8 +76,6 @@ xfs_refcountbt_alloc_block(
 	error = xfs_alloc_vextent(&args);
 	if (error)
 		goto out_error;
-	trace_xfs_refcountbt_alloc_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-			args.agbno, 1);
 	if (args.fsbno == NULLFSBLOCK) {
 		*stat = 0;
 		return 0;
diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index 36f6714ed3f..928f61053b0 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -92,8 +92,6 @@ xfs_rmapbt_alloc_block(
 				       &bno, 1);
 	if (error)
 		return error;
-
-	trace_xfs_rmapbt_alloc_block(cur->bc_mp, pag->pag_agno, bno, 1);
 	if (bno == NULLAGBLOCK) {
 		*stat = 0;
 		return 0;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/9] xfs: support in-memory btrees
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 5/9] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 6/9] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Adapt the generic btree cursor code to be able to create a btree whose
buffers come from a (presumably in-memory) buftarg with a header block
that's specific to in-memory btrees.  We'll connect this to other parts
of online scrub in the next patches.

Note that in-memory btrees always have a block size matching the system
memory page size for efficiency reasons.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/libxfs.h       |    2 
 libxfs/Makefile        |    3 
 libxfs/init.c          |    3 
 libxfs/libxfs_io.h     |   10 +
 libxfs/libxfs_priv.h   |    2 
 libxfs/rdwr.c          |   30 ++++
 libxfs/xfbtree.c       |  343 ++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfbtree.h       |   36 +++++
 libxfs/xfile.c         |   18 +++
 libxfs/xfile.h         |   50 +++++++
 libxfs/xfs_btree.c     |  151 +++++++++++++++++----
 libxfs/xfs_btree.h     |   17 ++
 libxfs/xfs_btree_mem.h |   87 ++++++++++++
 13 files changed, 724 insertions(+), 28 deletions(-)
 create mode 100644 libxfs/xfbtree.c
 create mode 100644 libxfs/xfbtree.h
 create mode 100644 libxfs/xfs_btree_mem.h


diff --git a/include/libxfs.h b/include/libxfs.h
index b07da6c03ee..887f57b6171 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -7,6 +7,8 @@
 #ifndef __LIBXFS_H__
 #define __LIBXFS_H__
 
+#define CONFIG_XFS_IN_MEMORY_BTREE
+
 #include "libxfs_api_defs.h"
 #include "platform_defs.h"
 #include "xfs.h"
diff --git a/libxfs/Makefile b/libxfs/Makefile
index 2007be570ed..b4aa9706aaa 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -26,6 +26,7 @@ HFILES = \
 	libxfs_priv.h \
 	linux-err.h \
 	topology.h \
+	xfbtree.h \
 	xfile.h \
 	xfs_ag_resv.h \
 	xfs_alloc.h \
@@ -36,6 +37,7 @@ HFILES = \
 	xfs_bmap.h \
 	xfs_bmap_btree.h \
 	xfs_btree.h \
+	xfs_btree_mem.h \
 	xfs_btree_staging.h \
 	xfs_attr_remote.h \
 	xfs_cksum.h \
@@ -66,6 +68,7 @@ CFILES = cache.c \
 	topology.c \
 	trans.c \
 	util.c \
+	xfbtree.c \
 	xfile.c \
 	xfs_ag.c \
 	xfs_ag_resv.c \
diff --git a/libxfs/init.c b/libxfs/init.c
index 5e90bf733b7..676c6fbd6d2 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -22,6 +22,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
 #include "libfrog/platform.h"
+#include "xfile.h"
 
 #include "libxfs.h"		/* for now */
 
@@ -321,6 +322,8 @@ libxfs_init(libxfs_init_t *a)
 	a->dsize = a->lbsize = a->rtbsize = 0;
 	a->dbsize = a->logBBsize = a->logBBstart = a->rtsize = 0;
 
+	xfile_libinit();
+
 	fd = -1;
 	flags = (a->isreadonly | a->isdirect);
 
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index c002ef058ec..fb536c1c3c9 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -271,4 +271,14 @@ xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *buffer_list)
 int xfs_buf_delwri_submit(struct list_head *buffer_list);
 void xfs_buf_delwri_cancel(struct list_head *list);
 
+xfs_daddr_t xfs_buftarg_nr_sectors(struct xfs_buftarg *btp);
+
+static inline bool
+xfs_buftarg_verify_daddr(
+	struct xfs_buftarg	*btp,
+	xfs_daddr_t		daddr)
+{
+	return daddr < xfs_buftarg_nr_sectors(btp);
+}
+
 #endif	/* __LIBXFS_IO_H__ */
diff --git a/libxfs/libxfs_priv.h b/libxfs/libxfs_priv.h
index 8cfdc3e295a..f205d31a305 100644
--- a/libxfs/libxfs_priv.h
+++ b/libxfs/libxfs_priv.h
@@ -37,6 +37,8 @@
 #ifndef __LIBXFS_INTERNAL_XFS_H__
 #define __LIBXFS_INTERNAL_XFS_H__
 
+#define CONFIG_XFS_IN_MEMORY_BTREE
+
 #include "libxfs_api_defs.h"
 #include "platform_defs.h"
 #include "xfs.h"
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 9d36698bb5c..c2dbc51f3f2 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -1544,3 +1544,33 @@ __xfs_buf_mark_corrupt(
 	xfs_buf_corruption_error(bp, fa);
 	xfs_buf_stale(bp);
 }
+
+/* Return the number of sectors for a buffer target. */
+xfs_daddr_t
+xfs_buftarg_nr_sectors(
+	struct xfs_buftarg	*btp)
+{
+	struct stat		sb;
+	int			fd;
+	int			ret;
+
+	if (btp->flags & XFS_BUFTARG_IN_MEMORY)
+		return xfile_size(btp->bt_xfile) >> BBSHIFT;
+
+	fd = libxfs_device_to_fd(btp->bt_bdev);
+	ret = fstat(fd, &sb);
+	if (ret)
+		return 0;
+
+	if (S_ISBLK(sb.st_mode)) {
+		uint64_t	sz;
+
+		ret = ioctl(fd, BLKGETSIZE64, &sz);
+		if (ret)
+			return 0;
+
+		return sz >> BBSHIFT;
+	}
+
+	return sb.st_size >> BBSHIFT;
+}
diff --git a/libxfs/xfbtree.c b/libxfs/xfbtree.c
new file mode 100644
index 00000000000..0481e9ed9f4
--- /dev/null
+++ b/libxfs/xfbtree.c
@@ -0,0 +1,343 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs_priv.h"
+#include "libxfs.h"
+#include "xfile.h"
+#include "xfbtree.h"
+#include "xfs_btree_mem.h"
+
+/* btree ops functions for in-memory btrees. */
+
+static xfs_failaddr_t
+xfs_btree_mem_head_verify(
+	struct xfs_buf			*bp)
+{
+	struct xfs_btree_mem_head	*mhead = bp->b_addr;
+	struct xfs_mount		*mp = bp->b_mount;
+
+	if (!xfs_verify_magic(bp, mhead->mh_magic))
+		return __this_address;
+	if (be32_to_cpu(mhead->mh_nlevels) == 0)
+		return __this_address;
+	if (!uuid_equal(&mhead->mh_uuid, &mp->m_sb.sb_meta_uuid))
+		return __this_address;
+
+	return NULL;
+}
+
+static void
+xfs_btree_mem_head_read_verify(
+	struct xfs_buf		*bp)
+{
+	xfs_failaddr_t		fa = xfs_btree_mem_head_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static void
+xfs_btree_mem_head_write_verify(
+	struct xfs_buf		*bp)
+{
+	xfs_failaddr_t		fa = xfs_btree_mem_head_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static const struct xfs_buf_ops xfs_btree_mem_head_buf_ops = {
+	.name			= "xfs_btree_mem_head",
+	.magic			= { cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC),
+				    cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC) },
+	.verify_read		= xfs_btree_mem_head_read_verify,
+	.verify_write		= xfs_btree_mem_head_write_verify,
+	.verify_struct		= xfs_btree_mem_head_verify,
+};
+
+/* Initialize the header block for an in-memory btree. */
+static inline void
+xfs_btree_mem_head_init(
+	struct xfs_buf			*head_bp,
+	unsigned long long		owner,
+	xfileoff_t			leaf_xfoff)
+{
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+	struct xfs_mount		*mp = head_bp->b_mount;
+
+	mhead->mh_magic = cpu_to_be32(XFS_BTREE_MEM_HEAD_MAGIC);
+	mhead->mh_nlevels = cpu_to_be32(1);
+	mhead->mh_owner = cpu_to_be64(owner);
+	mhead->mh_root = cpu_to_be64(leaf_xfoff);
+	uuid_copy(&mhead->mh_uuid, &mp->m_sb.sb_meta_uuid);
+
+	head_bp->b_ops = &xfs_btree_mem_head_buf_ops;
+}
+
+/* Return tree height from the in-memory btree head. */
+unsigned int
+xfs_btree_mem_head_nlevels(
+	struct xfs_buf			*head_bp)
+{
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+
+	return be32_to_cpu(mhead->mh_nlevels);
+}
+
+/* Extract the buftarg target for this xfile btree. */
+struct xfs_buftarg *
+xfbtree_target(struct xfbtree *xfbtree)
+{
+	return xfbtree->target;
+}
+
+/* Is this daddr (sector offset) contained within the buffer target? */
+static inline bool
+xfbtree_verify_buftarg_xfileoff(
+	struct xfs_buftarg	*btp,
+	xfileoff_t		xfoff)
+{
+	xfs_daddr_t		xfoff_daddr = xfo_to_daddr(xfoff);
+
+	return xfs_buftarg_verify_daddr(btp, xfoff_daddr);
+}
+
+/* Is this btree xfile offset contained within the xfile? */
+bool
+xfbtree_verify_xfileoff(
+	struct xfs_btree_cur	*cur,
+	unsigned long long	xfoff)
+{
+	struct xfs_buftarg	*btp = xfbtree_target(cur->bc_mem.xfbtree);
+
+	return xfbtree_verify_buftarg_xfileoff(btp, xfoff);
+}
+
+/* Check if a btree pointer is reasonable. */
+int
+xfbtree_check_ptr(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*ptr,
+	int				index,
+	int				level)
+{
+	xfileoff_t			bt_xfoff;
+	xfs_failaddr_t			fa = NULL;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		bt_xfoff = be64_to_cpu(ptr->l);
+	else
+		bt_xfoff = be32_to_cpu(ptr->s);
+
+	if (!xfbtree_verify_xfileoff(cur, bt_xfoff))
+		fa = __this_address;
+
+	if (fa) {
+		xfs_err(cur->bc_mp,
+"In-memory: Corrupt btree %d flags 0x%x pointer at level %d index %d fa %pS.",
+				cur->bc_btnum, cur->bc_flags, level, index,
+				fa);
+		return -EFSCORRUPTED;
+	}
+	return 0;
+}
+
+/* Convert a btree pointer to a daddr */
+xfs_daddr_t
+xfbtree_ptr_to_daddr(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*ptr)
+{
+	xfileoff_t			bt_xfoff;
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		bt_xfoff = be64_to_cpu(ptr->l);
+	else
+		bt_xfoff = be32_to_cpu(ptr->s);
+	return xfo_to_daddr(bt_xfoff);
+}
+
+/* Set the pointer to point to this buffer. */
+void
+xfbtree_buf_to_ptr(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp,
+	union xfs_btree_ptr	*ptr)
+{
+	xfileoff_t		xfoff = xfs_daddr_to_xfo(xfs_buf_daddr(bp));
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		ptr->l = cpu_to_be64(xfoff);
+	else
+		ptr->s = cpu_to_be32(xfoff);
+}
+
+/* Return the in-memory btree block size, in units of 512 bytes. */
+unsigned int xfbtree_bbsize(void)
+{
+	return xfo_to_daddr(1);
+}
+
+/* Set the root of an in-memory btree. */
+void
+xfbtree_set_root(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*ptr,
+	int				inc)
+{
+	struct xfs_buf			*head_bp = cur->bc_mem.head_bp;
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		mhead->mh_root = ptr->l;
+	} else {
+		uint32_t		root = be32_to_cpu(ptr->s);
+
+		mhead->mh_root = cpu_to_be64(root);
+	}
+	be32_add_cpu(&mhead->mh_nlevels, inc);
+	xfs_trans_log_buf(cur->bc_tp, head_bp, 0, sizeof(*mhead) - 1);
+}
+
+/* Initialize a pointer from the in-memory btree header. */
+void
+xfbtree_init_ptr_from_cur(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_ptr		*ptr)
+{
+	struct xfs_buf			*head_bp = cur->bc_mem.head_bp;
+	struct xfs_btree_mem_head	*mhead = head_bp->b_addr;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		ptr->l = mhead->mh_root;
+	} else {
+		uint64_t		root = be64_to_cpu(mhead->mh_root);
+
+		ptr->s = cpu_to_be32(root);
+	}
+}
+
+/* Duplicate an in-memory btree cursor. */
+struct xfs_btree_cur *
+xfbtree_dup_cursor(
+	struct xfs_btree_cur		*cur)
+{
+	struct xfs_btree_cur		*ncur;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	ncur = xfs_btree_alloc_cursor(cur->bc_mp, cur->bc_tp, cur->bc_btnum,
+			cur->bc_maxlevels, cur->bc_cache);
+	ncur->bc_flags = cur->bc_flags;
+	ncur->bc_nlevels = cur->bc_nlevels;
+	ncur->bc_statoff = cur->bc_statoff;
+	ncur->bc_ops = cur->bc_ops;
+	memcpy(&ncur->bc_mem, &cur->bc_mem, sizeof(cur->bc_mem));
+
+	if (cur->bc_mem.pag)
+		ncur->bc_mem.pag = xfs_perag_bump(cur->bc_mem.pag);
+
+	return ncur;
+}
+
+/* Check the owner of an in-memory btree block. */
+xfs_failaddr_t
+xfbtree_check_block_owner(
+	struct xfs_btree_cur	*cur,
+	struct xfs_btree_block	*block)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		if (be64_to_cpu(block->bb_u.l.bb_owner) != xfbt->owner)
+			return __this_address;
+
+		return NULL;
+	}
+
+	if (be32_to_cpu(block->bb_u.s.bb_owner) != xfbt->owner)
+		return __this_address;
+
+	return NULL;
+}
+
+/* Return the owner of this in-memory btree. */
+unsigned long long
+xfbtree_owner(
+	struct xfs_btree_cur	*cur)
+{
+	return cur->bc_mem.xfbtree->owner;
+}
+
+/* Return the xfile offset (in blocks) of a btree buffer. */
+unsigned long long
+xfbtree_buf_to_xfoff(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	return xfs_daddr_to_xfo(xfs_buf_daddr(bp));
+}
+
+/* Verify a long-format btree block. */
+xfs_failaddr_t
+xfbtree_lblock_verify(
+	struct xfs_buf		*bp,
+	unsigned int		max_recs)
+{
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_buftarg	*btp = bp->b_target;
+
+	/* numrecs verification */
+	if (be16_to_cpu(block->bb_numrecs) > max_recs)
+		return __this_address;
+
+	/* sibling pointer verification */
+	if (block->bb_u.l.bb_leftsib != cpu_to_be64(NULLFSBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be64_to_cpu(block->bb_u.l.bb_leftsib)))
+		return __this_address;
+
+	if (block->bb_u.l.bb_rightsib != cpu_to_be64(NULLFSBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be64_to_cpu(block->bb_u.l.bb_rightsib)))
+		return __this_address;
+
+	return NULL;
+}
+
+/* Verify a short-format btree block. */
+xfs_failaddr_t
+xfbtree_sblock_verify(
+	struct xfs_buf		*bp,
+	unsigned int		max_recs)
+{
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_buftarg	*btp = bp->b_target;
+
+	/* numrecs verification */
+	if (be16_to_cpu(block->bb_numrecs) > max_recs)
+		return __this_address;
+
+	/* sibling pointer verification */
+	if (block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be32_to_cpu(block->bb_u.s.bb_leftsib)))
+		return __this_address;
+
+	if (block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK) &&
+	    !xfbtree_verify_buftarg_xfileoff(btp,
+				be32_to_cpu(block->bb_u.s.bb_rightsib)))
+		return __this_address;
+
+	return NULL;
+}
diff --git a/libxfs/xfbtree.h b/libxfs/xfbtree.h
new file mode 100644
index 00000000000..e378b771637
--- /dev/null
+++ b/libxfs/xfbtree.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __LIBXFS_XFBTREE_H__
+#define __LIBXFS_XFBTREE_H__
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+
+/* Root block for an in-memory btree. */
+struct xfs_btree_mem_head {
+	__be32				mh_magic;
+	__be32				mh_nlevels;
+	__be64				mh_owner;
+	__be64				mh_root;
+	uuid_t				mh_uuid;
+};
+
+#define XFS_BTREE_MEM_HEAD_MAGIC	0x4341544D	/* "CATM" */
+
+/* in-memory btree header is always block 0 in the backing store */
+#define XFS_BTREE_MEM_HEAD_DADDR	0
+
+/* xfile-backed in-memory btrees */
+
+struct xfbtree {
+	struct xfs_buftarg		*target;
+
+	/* Owner of this btree. */
+	unsigned long long		owner;
+};
+
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* __LIBXFS_XFBTREE_H__ */
diff --git a/libxfs/xfile.c b/libxfs/xfile.c
index 357ffb0077d..5985433749d 100644
--- a/libxfs/xfile.c
+++ b/libxfs/xfile.c
@@ -6,6 +6,7 @@
 #include "libxfs_priv.h"
 #include "libxfs.h"
 #include "libxfs/xfile.h"
+#include "libfrog/util.h"
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/wait.h>
@@ -26,6 +27,23 @@
  * management; file locks are not taken.
  */
 
+/* Figure out the xfile block size here */
+unsigned int		XFB_BLOCKSIZE;
+unsigned int		XFB_BSHIFT;
+
+void
+xfile_libinit(void)
+{
+	long		ret = sysconf(_SC_PAGESIZE);
+
+	/* If we don't find a power-of-two page size, go with 4k. */
+	if (ret < 0 || !is_power_of_2(ret))
+		ret = 4096;
+
+	XFB_BLOCKSIZE = ret;
+	XFB_BSHIFT = libxfs_highbit32(XFB_BLOCKSIZE);
+}
+
 /*
  * Open a memory-backed fd to back an xfile.  We require close-on-exec here,
  * because these memfd files function as windowed RAM and hence should never
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
index ad13f62ee0f..5a1d0104808 100644
--- a/libxfs/xfile.h
+++ b/libxfs/xfile.h
@@ -10,6 +10,8 @@ struct xfile {
 	int		fd;
 };
 
+void xfile_libinit(void);
+
 int xfile_create(struct xfs_mount *mp, const char *description,
 		struct xfile **xfilep);
 void xfile_destroy(struct xfile *xf);
@@ -53,4 +55,52 @@ struct xfile_stat {
 int xfile_stat(struct xfile *xf, struct xfile_stat *statbuf);
 int xfile_dump(struct xfile *xf);
 
+static inline loff_t xfile_size(struct xfile *xf)
+{
+	struct xfile_stat	xs;
+	int			ret;
+
+	ret = xfile_stat(xf, &xs);
+	if (ret)
+		return 0;
+
+	return xs.size;
+}
+
+/* file block (aka system page size) to basic block conversions. */
+typedef unsigned long long	xfileoff_t;
+extern unsigned int		XFB_BLOCKSIZE;
+extern unsigned int		XFB_BSHIFT;
+#define XFB_SHIFT		(XFB_BSHIFT - BBSHIFT)
+
+static inline loff_t xfo_to_b(xfileoff_t xfoff)
+{
+	return xfoff << XFB_BSHIFT;
+}
+
+static inline xfileoff_t b_to_xfo(loff_t pos)
+{
+	return (pos + (XFB_BLOCKSIZE - 1)) >> XFB_BSHIFT;
+}
+
+static inline xfileoff_t b_to_xfot(loff_t pos)
+{
+	return pos >> XFB_BSHIFT;
+}
+
+static inline xfs_daddr_t xfo_to_daddr(xfileoff_t xfoff)
+{
+	return xfoff << XFB_SHIFT;
+}
+
+static inline xfileoff_t xfs_daddr_to_xfo(xfs_daddr_t bb)
+{
+	return (bb + (xfo_to_daddr(1) - 1)) >> XFB_SHIFT;
+}
+
+static inline xfileoff_t xfs_daddr_to_xfot(xfs_daddr_t bb)
+{
+	return bb >> XFB_SHIFT;
+}
+
 #endif /* __LIBXFS_XFILE_H__ */
diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index cd722d1c830..dd189a8baf2 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -25,6 +25,9 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_health.h"
+#include "xfile.h"
+#include "xfbtree.h"
+#include "xfs_btree_mem.h"
 
 /*
  * Btree magic numbers.
@@ -79,6 +82,9 @@ xfs_btree_check_lblock_siblings(
 	if (level >= 0) {
 		if (!xfs_btree_check_lptr(cur, sibling, level + 1))
 			return __this_address;
+	} else if (cur && (cur->bc_flags & XFS_BTREE_IN_MEMORY)) {
+		if (!xfbtree_verify_xfileoff(cur, sibling))
+			return __this_address;
 	} else {
 		if (!xfs_verify_fsbno(mp, sibling))
 			return __this_address;
@@ -106,6 +112,9 @@ xfs_btree_check_sblock_siblings(
 	if (level >= 0) {
 		if (!xfs_btree_check_sptr(cur, sibling, level + 1))
 			return __this_address;
+	} else if (cur && (cur->bc_flags & XFS_BTREE_IN_MEMORY)) {
+		if (!xfbtree_verify_xfileoff(cur, sibling))
+			return __this_address;
 	} else {
 		if (!xfs_verify_agbno(pag, sibling))
 			return __this_address;
@@ -148,7 +157,9 @@ __xfs_btree_check_lblock(
 	    cur->bc_ops->get_maxrecs(cur, level))
 		return __this_address;
 
-	if (bp)
+	if ((cur->bc_flags & XFS_BTREE_IN_MEMORY) && bp)
+		fsb = xfbtree_buf_to_xfoff(cur, bp);
+	else if (bp)
 		fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
 
 	fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb,
@@ -215,8 +226,12 @@ __xfs_btree_check_sblock(
 	    cur->bc_ops->get_maxrecs(cur, level))
 		return __this_address;
 
-	if (bp)
+	if ((cur->bc_flags & XFS_BTREE_IN_MEMORY) && bp) {
+		pag = NULL;
+		agbno = xfbtree_buf_to_xfoff(cur, bp);
+	} else if (bp) {
 		agbno = xfs_daddr_to_agbno(mp, xfs_buf_daddr(bp));
+	}
 
 	fa = xfs_btree_check_sblock_siblings(pag, cur, level, agbno,
 			block->bb_u.s.bb_leftsib);
@@ -273,6 +288,8 @@ xfs_btree_check_lptr(
 {
 	if (level <= 0)
 		return false;
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_verify_xfileoff(cur, fsbno);
 	return xfs_verify_fsbno(cur->bc_mp, fsbno);
 }
 
@@ -285,6 +302,8 @@ xfs_btree_check_sptr(
 {
 	if (level <= 0)
 		return false;
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_verify_xfileoff(cur, agbno);
 	return xfs_verify_agbno(cur->bc_ag.pag, agbno);
 }
 
@@ -299,6 +318,9 @@ xfs_btree_check_ptr(
 	int				index,
 	int				level)
 {
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_check_ptr(cur, ptr, index, level);
+
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		if (xfs_btree_check_lptr(cur, be64_to_cpu((&ptr->l)[index]),
 				level))
@@ -455,11 +477,36 @@ xfs_btree_del_cursor(
 	       xfs_is_shutdown(cur->bc_mp) || error != 0);
 	if (unlikely(cur->bc_flags & XFS_BTREE_STAGING))
 		kmem_free(cur->bc_ops);
-	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
+	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) &&
+	    !(cur->bc_flags & XFS_BTREE_IN_MEMORY) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+		if (cur->bc_mem.pag)
+			xfs_perag_put(cur->bc_mem.pag);
+	}
 	kmem_cache_free(cur->bc_cache, cur);
 }
 
+/* Return the buffer target for this btree's buffer. */
+static inline struct xfs_buftarg *
+xfs_btree_buftarg(
+	struct xfs_btree_cur	*cur)
+{
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_target(cur->bc_mem.xfbtree);
+	return cur->bc_mp->m_ddev_targp;
+}
+
+/* Return the block size (in units of 512b sectors) for this btree. */
+static inline unsigned int
+xfs_btree_bbsize(
+	struct xfs_btree_cur	*cur)
+{
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_bbsize();
+	return cur->bc_mp->m_bsize;
+}
+
 /*
  * Duplicate the btree cursor.
  * Allocate a new one, copy the record, re-get the buffers.
@@ -497,10 +544,11 @@ xfs_btree_dup_cursor(
 		new->bc_levels[i].ra = cur->bc_levels[i].ra;
 		bp = cur->bc_levels[i].bp;
 		if (bp) {
-			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
-						   xfs_buf_daddr(bp), mp->m_bsize,
-						   0, &bp,
-						   cur->bc_ops->buf_ops);
+			error = xfs_trans_read_buf(mp, tp,
+					xfs_btree_buftarg(cur),
+					xfs_buf_daddr(bp),
+					xfs_btree_bbsize(cur), 0, &bp,
+					cur->bc_ops->buf_ops);
 			if (xfs_metadata_is_sick(error))
 				xfs_btree_mark_sick(new);
 			if (error) {
@@ -941,6 +989,9 @@ xfs_btree_readahead_lblock(
 	xfs_fsblock_t		left = be64_to_cpu(block->bb_u.l.bb_leftsib);
 	xfs_fsblock_t		right = be64_to_cpu(block->bb_u.l.bb_rightsib);
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return 0;
+
 	if ((lr & XFS_BTCUR_LEFTRA) && left != NULLFSBLOCK) {
 		xfs_btree_reada_bufl(cur->bc_mp, left, 1,
 				     cur->bc_ops->buf_ops);
@@ -966,6 +1017,8 @@ xfs_btree_readahead_sblock(
 	xfs_agblock_t		left = be32_to_cpu(block->bb_u.s.bb_leftsib);
 	xfs_agblock_t		right = be32_to_cpu(block->bb_u.s.bb_rightsib);
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return 0;
 
 	if ((lr & XFS_BTCUR_LEFTRA) && left != NULLAGBLOCK) {
 		xfs_btree_reada_bufs(cur->bc_mp, cur->bc_ag.pag->pag_agno,
@@ -1027,6 +1080,11 @@ xfs_btree_ptr_to_daddr(
 	if (error)
 		return error;
 
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+		*daddr = xfbtree_ptr_to_daddr(cur, ptr);
+		return 0;
+	}
+
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		fsbno = be64_to_cpu(ptr->l);
 		*daddr = XFS_FSB_TO_DADDR(cur->bc_mp, fsbno);
@@ -1055,8 +1113,9 @@ xfs_btree_readahead_ptr(
 
 	if (xfs_btree_ptr_to_daddr(cur, ptr, &daddr))
 		return;
-	xfs_buf_readahead(cur->bc_mp->m_ddev_targp, daddr,
-			  cur->bc_mp->m_bsize * count, cur->bc_ops->buf_ops);
+	xfs_buf_readahead(xfs_btree_buftarg(cur), daddr,
+			xfs_btree_bbsize(cur) * count,
+			cur->bc_ops->buf_ops);
 }
 
 /*
@@ -1230,7 +1289,9 @@ xfs_btree_init_block_cur(
 	 * change in future, but is safe for current users of the generic btree
 	 * code.
 	 */
-	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		owner = xfbtree_owner(cur);
+	else if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		owner = cur->bc_ino.ip->i_ino;
 	else
 		owner = cur->bc_ag.pag->pag_agno;
@@ -1270,6 +1331,11 @@ xfs_btree_buf_to_ptr(
 	struct xfs_buf		*bp,
 	union xfs_btree_ptr	*ptr)
 {
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY) {
+		xfbtree_buf_to_ptr(cur, bp, ptr);
+		return;
+	}
+
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		ptr->l = cpu_to_be64(XFS_DADDR_TO_FSB(cur->bc_mp,
 					xfs_buf_daddr(bp)));
@@ -1314,15 +1380,14 @@ xfs_btree_get_buf_block(
 	struct xfs_btree_block		**block,
 	struct xfs_buf			**bpp)
 {
-	struct xfs_mount	*mp = cur->bc_mp;
-	xfs_daddr_t		d;
-	int			error;
+	xfs_daddr_t			d;
+	int				error;
 
 	error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
 	if (error)
 		return error;
-	error = xfs_trans_get_buf(cur->bc_tp, mp->m_ddev_targp, d, mp->m_bsize,
-			0, bpp);
+	error = xfs_trans_get_buf(cur->bc_tp, xfs_btree_buftarg(cur), d,
+			xfs_btree_bbsize(cur), 0, bpp);
 	if (error)
 		return error;
 
@@ -1353,9 +1418,9 @@ xfs_btree_read_buf_block(
 	error = xfs_btree_ptr_to_daddr(cur, ptr, &d);
 	if (error)
 		return error;
-	error = xfs_trans_read_buf(mp, cur->bc_tp, mp->m_ddev_targp, d,
-				   mp->m_bsize, flags, bpp,
-				   cur->bc_ops->buf_ops);
+	error = xfs_trans_read_buf(mp, cur->bc_tp, xfs_btree_buftarg(cur), d,
+			xfs_btree_bbsize(cur), flags, bpp,
+			cur->bc_ops->buf_ops);
 	if (xfs_metadata_is_sick(error))
 		xfs_btree_mark_sick(cur);
 	if (error)
@@ -1795,6 +1860,37 @@ xfs_btree_decrement(
 	return error;
 }
 
+/*
+ * Check the btree block owner now that we have the context to know who the
+ * real owner is.
+ */
+static inline xfs_failaddr_t
+xfs_btree_check_block_owner(
+	struct xfs_btree_cur	*cur,
+	struct xfs_btree_block	*block)
+{
+	if (!xfs_has_crc(cur->bc_mp))
+		return NULL;
+
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfbtree_check_block_owner(cur, block);
+
+	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS)) {
+		if (be32_to_cpu(block->bb_u.s.bb_owner) !=
+						cur->bc_ag.pag->pag_agno)
+			return __this_address;
+		return NULL;
+	}
+
+	if (cur->bc_ino.flags & XFS_BTCUR_BMBT_INVALID_OWNER)
+		return NULL;
+
+	if (be64_to_cpu(block->bb_u.l.bb_owner) != cur->bc_ino.ip->i_ino)
+		return __this_address;
+
+	return NULL;
+}
+
 int
 xfs_btree_lookup_get_block(
 	struct xfs_btree_cur		*cur,	/* btree cursor */
@@ -1833,11 +1929,7 @@ xfs_btree_lookup_get_block(
 		return error;
 
 	/* Check the inode owner since the verifiers don't. */
-	if (xfs_has_crc(cur->bc_mp) &&
-	    !(cur->bc_ino.flags & XFS_BTCUR_BMBT_INVALID_OWNER) &&
-	    (cur->bc_flags & XFS_BTREE_LONG_PTRS) &&
-	    be64_to_cpu((*blkp)->bb_u.l.bb_owner) !=
-			cur->bc_ino.ip->i_ino)
+	if (xfs_btree_check_block_owner(cur, *blkp) != NULL)
 		goto out_bad;
 
 	/* Did we get the level we were looking for? */
@@ -4369,7 +4461,7 @@ xfs_btree_visit_block(
 {
 	struct xfs_btree_block		*block;
 	struct xfs_buf			*bp;
-	union xfs_btree_ptr		rptr;
+	union xfs_btree_ptr		rptr, bufptr;
 	int				error;
 
 	/* do right sibling readahead */
@@ -4392,15 +4484,14 @@ xfs_btree_visit_block(
 	 * return the same block without checking if the right sibling points
 	 * back to us and creates a cyclic reference in the btree.
 	 */
+	xfs_btree_buf_to_ptr(cur, bp, &bufptr);
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
-		if (be64_to_cpu(rptr.l) == XFS_DADDR_TO_FSB(cur->bc_mp,
-							xfs_buf_daddr(bp))) {
+		if (rptr.l == bufptr.l) {
 			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
 		}
 	} else {
-		if (be32_to_cpu(rptr.s) == xfs_daddr_to_agbno(cur->bc_mp,
-							xfs_buf_daddr(bp))) {
+		if (rptr.s == bufptr.s) {
 			xfs_btree_mark_sick(cur);
 			return -EFSCORRUPTED;
 		}
@@ -4582,6 +4673,8 @@ xfs_btree_lblock_verify(
 	xfs_fsblock_t		fsb;
 	xfs_failaddr_t		fa;
 
+	ASSERT(!xfs_buftarg_in_memory(bp->b_target));
+
 	/* numrecs verification */
 	if (be16_to_cpu(block->bb_numrecs) > max_recs)
 		return __this_address;
@@ -4637,6 +4730,8 @@ xfs_btree_sblock_verify(
 	xfs_agblock_t		agbno;
 	xfs_failaddr_t		fa;
 
+	ASSERT(!xfs_buftarg_in_memory(bp->b_target));
+
 	/* numrecs verification */
 	if (be16_to_cpu(block->bb_numrecs) > max_recs)
 		return __this_address;
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index 2fa7a09cab3..6c81fceab74 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -248,6 +248,15 @@ struct xfs_btree_cur_ino {
 #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
 };
 
+/* In-memory btree information */
+struct xfbtree;
+
+struct xfs_btree_cur_mem {
+	struct xfbtree			*xfbtree;
+	struct xfs_buf			*head_bp;
+	struct xfs_perag		*pag;
+};
+
 struct xfs_btree_level {
 	/* buffer pointer */
 	struct xfs_buf		*bp;
@@ -287,6 +296,7 @@ struct xfs_btree_cur
 	union {
 		struct xfs_btree_cur_ag	bc_ag;
 		struct xfs_btree_cur_ino bc_ino;
+		struct xfs_btree_cur_mem bc_mem;
 	};
 
 	/* Must be at the end of the struct! */
@@ -317,6 +327,13 @@ xfs_btree_cur_sizeof(unsigned int nlevels)
  */
 #define XFS_BTREE_STAGING		(1<<5)
 
+/* btree stored in memory; not compatible with ROOT_IN_INODE */
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+# define XFS_BTREE_IN_MEMORY		(1<<7)
+#else
+# define XFS_BTREE_IN_MEMORY		(0)
+#endif
+
 #define	XFS_BTREE_NOERROR	0
 #define	XFS_BTREE_ERROR		1
 
diff --git a/libxfs/xfs_btree_mem.h b/libxfs/xfs_btree_mem.h
new file mode 100644
index 00000000000..6ca9ea64a9a
--- /dev/null
+++ b/libxfs/xfs_btree_mem.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_BTREE_MEM_H__
+#define __XFS_BTREE_MEM_H__
+
+struct xfbtree;
+
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
+
+struct xfs_buftarg *xfbtree_target(struct xfbtree *xfbtree);
+int xfbtree_check_ptr(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr, int index, int level);
+xfs_daddr_t xfbtree_ptr_to_daddr(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr);
+void xfbtree_buf_to_ptr(struct xfs_btree_cur *cur, struct xfs_buf *bp,
+		union xfs_btree_ptr *ptr);
+
+unsigned int xfbtree_bbsize(void);
+
+void xfbtree_set_root(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *ptr, int inc);
+void xfbtree_init_ptr_from_cur(struct xfs_btree_cur *cur,
+		union xfs_btree_ptr *ptr);
+struct xfs_btree_cur *xfbtree_dup_cursor(struct xfs_btree_cur *cur);
+bool xfbtree_verify_xfileoff(struct xfs_btree_cur *cur,
+		unsigned long long xfoff);
+xfs_failaddr_t xfbtree_check_block_owner(struct xfs_btree_cur *cur,
+		struct xfs_btree_block *block);
+unsigned long long xfbtree_owner(struct xfs_btree_cur *cur);
+xfs_failaddr_t xfbtree_lblock_verify(struct xfs_buf *bp, unsigned int max_recs);
+xfs_failaddr_t xfbtree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
+unsigned long long xfbtree_buf_to_xfoff(struct xfs_btree_cur *cur,
+		struct xfs_buf *bp);
+#else
+static inline unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp)
+{
+	return 0;
+}
+
+static inline struct xfs_buftarg *
+xfbtree_target(struct xfbtree *xfbtree)
+{
+	return NULL;
+}
+
+static inline int
+xfbtree_check_ptr(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr,
+		  int index, int level)
+{
+	return 0;
+}
+
+static inline xfs_daddr_t
+xfbtree_ptr_to_daddr(struct xfs_btree_cur *cur, const union xfs_btree_ptr *ptr)
+{
+	return 0;
+}
+
+static inline void
+xfbtree_buf_to_ptr(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp,
+	union xfs_btree_ptr	*ptr)
+{
+	memset(ptr, 0xFF, sizeof(*ptr));
+}
+
+static inline unsigned int xfbtree_bbsize(void)
+{
+	return 0;
+}
+
+#define xfbtree_set_root			NULL
+#define xfbtree_init_ptr_from_cur		NULL
+#define xfbtree_dup_cursor			NULL
+#define xfbtree_verify_xfileoff(cur, xfoff)	(false)
+#define xfbtree_check_block_owner(cur, block)	NULL
+#define xfbtree_owner(cur)			(0ULL)
+#define xfbtree_buf_to_xfoff(cur, bp)		(-1)
+
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
+#endif /* __XFS_BTREE_MEM_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 8/9] xfs: connect in-memory btrees to xfiles
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 4/9] libxfs: support in-memory buffer cache targets Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/9] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add to our stubbed-out in-memory btrees the ability to connect them with
an actual in-memory backing file (aka xfiles) and the necessary pieces
to track free space in the xfile and flush dirty xfbtree buffers on
demand, which we'll need for online repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_mount.h    |    8 +
 include/xfs_trace.h    |    8 +
 include/xfs_trans.h    |    1 
 libfrog/bitmap.c       |   64 ++++++-
 libfrog/bitmap.h       |    3 
 libxfs/init.c          |   32 +++
 libxfs/trans.c         |   40 ++++
 libxfs/xfbtree.c       |  459 ++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfbtree.h       |   27 +++
 libxfs/xfile.c         |   16 ++
 libxfs/xfile.h         |    2 
 libxfs/xfs_btree_mem.h |   41 ++++
 12 files changed, 690 insertions(+), 11 deletions(-)


diff --git a/include/xfs_mount.h b/include/xfs_mount.h
index 6be85bf21d2..906b0573c0b 100644
--- a/include/xfs_mount.h
+++ b/include/xfs_mount.h
@@ -282,4 +282,12 @@ static inline void xfs_perag_drop_intents(struct xfs_perag *pag) { }
 #define xfs_drain_free(dr)		((void)0)
 #define xfs_drain_init(dr)		((void)0)
 
+static inline void libxfs_buftarg_drain(struct xfs_buftarg *btp)
+{
+	cache_purge(btp->bcache);
+}
+void libxfs_buftarg_free(struct xfs_buftarg *btp);
+int libxfs_alloc_memory_buftarg(struct xfs_mount *mp, struct xfile *xfile,
+		struct xfs_buftarg **btpp);
+
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 3ca6cda253c..3c6bd32d4ca 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -6,6 +6,13 @@
 #ifndef __TRACE_H__
 #define __TRACE_H__
 
+#define trace_xfbtree_create(...)		((void) 0)
+#define trace_xfbtree_create_root_buf(...)	((void) 0)
+#define trace_xfbtree_alloc_block(...)		((void) 0)
+#define trace_xfbtree_free_block(...)		((void) 0)
+#define trace_xfbtree_trans_cancel_buf(...)	((void) 0)
+#define trace_xfbtree_trans_commit_buf(...)	((void) 0)
+
 #define trace_xfs_agfl_reset(a,b,c,d)		((void) 0)
 #define trace_xfs_agfl_free_defer(a,b,c,d,e)	((void) 0)
 #define trace_xfs_alloc_cur_check(a,b,c,d,e,f)	((void) 0)
@@ -194,6 +201,7 @@
 #define trace_xfs_trans_cancel(a,b)		((void) 0)
 #define trace_xfs_trans_brelse(a)		((void) 0)
 #define trace_xfs_trans_binval(a)		((void) 0)
+#define trace_xfs_trans_bdetach(a)		((void) 0)
 #define trace_xfs_trans_bjoin(a)		((void) 0)
 #define trace_xfs_trans_bhold(a)		((void) 0)
 #define trace_xfs_trans_bhold_release(a)	((void) 0)
diff --git a/include/xfs_trans.h b/include/xfs_trans.h
index ae339df1195..bfaee7e8fed 100644
--- a/include/xfs_trans.h
+++ b/include/xfs_trans.h
@@ -107,6 +107,7 @@ int	libxfs_trans_roll_inode (struct xfs_trans **, struct xfs_inode *);
 void	libxfs_trans_brelse(struct xfs_trans *, struct xfs_buf *);
 void	libxfs_trans_binval(struct xfs_trans *, struct xfs_buf *);
 void	libxfs_trans_bjoin(struct xfs_trans *, struct xfs_buf *);
+void	libxfs_trans_bdetach(struct xfs_trans *tp, struct xfs_buf *bp);
 void	libxfs_trans_bhold(struct xfs_trans *, struct xfs_buf *);
 void	libxfs_trans_bhold_release(struct xfs_trans *, struct xfs_buf *);
 void	libxfs_trans_dirty_buf(struct xfs_trans *, struct xfs_buf *);
diff --git a/libfrog/bitmap.c b/libfrog/bitmap.c
index 5af5ab8dd6b..e1f3a5e1c84 100644
--- a/libfrog/bitmap.c
+++ b/libfrog/bitmap.c
@@ -233,10 +233,9 @@ bitmap_set(
 	return res;
 }
 
-#if 0	/* Unused, provided for completeness. */
 /* Clear a region of bits. */
-int
-bitmap_clear(
+static int
+__bitmap_clear(
 	struct bitmap		*bmap,
 	uint64_t		start,
 	uint64_t		len)
@@ -251,8 +250,8 @@ bitmap_clear(
 	uint64_t		new_length;
 	struct avl64node	*node;
 	int			stat;
+	int			ret = 0;
 
-	pthread_mutex_lock(&bmap->bt_lock);
 	/* Find any existing nodes over that range. */
 	avl64_findranges(bmap->bt_tree, start, start + len, &firstn, &lastn);
 
@@ -312,10 +311,24 @@ bitmap_clear(
 	}
 
 out:
-	pthread_mutex_unlock(&bmap->bt_lock);
 	return ret;
 }
-#endif
+
+/* Clear a region of bits. */
+int
+bitmap_clear(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		length)
+{
+	int			res;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+	res = __bitmap_clear(bmap, start, length);
+	pthread_mutex_unlock(&bmap->bt_lock);
+
+	return res;
+}
 
 /* Iterate the set regions of this bitmap. */
 int
@@ -438,3 +451,42 @@ bitmap_dump(
 	printf("BITMAP DUMP DONE\n");
 }
 #endif
+
+/*
+ * Find the first set bit in this bitmap, clear it, and return the index of
+ * that bit in @valp.  Returns -ENODATA if no bits were set, or the usual
+ * negative errno.
+ */
+int
+bitmap_take_first_set(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		last,
+	uint64_t		*valp)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct bitmap_node	*ext;
+	uint64_t		val;
+	int			error;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+
+	avl64_findranges(bmap->bt_tree, start, last + 1, &firstn, &lastn);
+
+	if (firstn == NULL && lastn == NULL) {
+		error = -ENODATA;
+		goto out;
+	}
+
+	ext = container_of(firstn, struct bitmap_node, btn_node);
+	val = ext->btn_start;
+	error = __bitmap_clear(bmap, val, 1);
+	if (error)
+		goto out;
+
+	*valp = val;
+out:
+	pthread_mutex_unlock(&bmap->bt_lock);
+	return error;
+}
diff --git a/libfrog/bitmap.h b/libfrog/bitmap.h
index 043b77eece6..896ae01f8f4 100644
--- a/libfrog/bitmap.h
+++ b/libfrog/bitmap.h
@@ -14,6 +14,7 @@ struct bitmap {
 int bitmap_alloc(struct bitmap **bmap);
 void bitmap_free(struct bitmap **bmap);
 int bitmap_set(struct bitmap *bmap, uint64_t start, uint64_t length);
+int bitmap_clear(struct bitmap *bmap, uint64_t start, uint64_t length);
 int bitmap_iterate(struct bitmap *bmap, int (*fn)(uint64_t, uint64_t, void *),
 		void *arg);
 int bitmap_iterate_range(struct bitmap *bmap, uint64_t start, uint64_t length,
@@ -22,5 +23,7 @@ bool bitmap_test(struct bitmap *bmap, uint64_t start,
 		uint64_t len);
 bool bitmap_empty(struct bitmap *bmap);
 void bitmap_dump(struct bitmap *bmap);
+int bitmap_take_first_set(struct bitmap *bmap, uint64_t start, uint64_t last,
+		uint64_t *valp);
 
 #endif /* __LIBFROG_BITMAP_H__ */
diff --git a/libxfs/init.c b/libxfs/init.c
index 676c6fbd6d2..b80f6bfd8fc 100644
--- a/libxfs/init.c
+++ b/libxfs/init.c
@@ -621,6 +621,36 @@ libxfs_buftarg_alloc(
 	return btp;
 }
 
+int
+libxfs_alloc_memory_buftarg(
+	struct xfs_mount	*mp,
+	struct xfile		*xfile,
+	struct xfs_buftarg	**btpp)
+{
+	struct xfs_buftarg	*btp;
+	unsigned int		bcache_flags = 0;
+
+	btp = malloc(sizeof(*btp));
+	if (!btp)
+		return -ENOMEM;
+
+	btp->bt_mount = mp;
+	btp->bt_xfile = xfile;
+	btp->flags = XFS_BUFTARG_IN_MEMORY;
+	btp->writes_left = 0;
+	pthread_mutex_init(&btp->lock, NULL);
+
+	/*
+	 * Keep the bucket count small because the only anticipated caller is
+	 * per-AG in-memory btrees, for which we don't need to scale to handle
+	 * an entire filesystem.
+	 */
+	btp->bcache = cache_init(bcache_flags, 63, &libxfs_bcache_operations);
+
+	*btpp = btp;
+	return 0;
+}
+
 enum libxfs_write_failure_nums {
 	WF_DATA = 0,
 	WF_LOG,
@@ -1023,7 +1053,7 @@ libxfs_flush_mount(
 	return error;
 }
 
-static void
+void
 libxfs_buftarg_free(
 	struct xfs_buftarg	*btp)
 {
diff --git a/libxfs/trans.c b/libxfs/trans.c
index e9430c61562..3120d8b1dea 100644
--- a/libxfs/trans.c
+++ b/libxfs/trans.c
@@ -613,6 +613,46 @@ libxfs_trans_brelse(
 	libxfs_buf_relse(bp);
 }
 
+/*
+ * Forcibly detach a buffer previously joined to the transaction.  The caller
+ * will retain its locked reference to the buffer after this function returns.
+ * The buffer must be completely clean and must not be held to the transaction.
+ */
+void
+libxfs_trans_bdetach(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
+
+	ASSERT(tp != NULL);
+	ASSERT(bp->b_transp == tp);
+	ASSERT(bip->bli_item.li_type == XFS_LI_BUF);
+
+	trace_xfs_trans_bdetach(bip);
+
+	/*
+	 * Erase all recursion count, since we're removing this buffer from the
+	 * transaction.
+	 */
+	bip->bli_recur = 0;
+
+	/*
+	 * The buffer must be completely clean.  Specifically, it had better
+	 * not be dirty, stale, logged, ordered, or held to the transaction.
+	 */
+	ASSERT(!test_bit(XFS_LI_DIRTY, &bip->bli_item.li_flags));
+	ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY));
+	ASSERT(!(bip->bli_flags & XFS_BLI_HOLD));
+	ASSERT(!(bip->bli_flags & XFS_BLI_ORDERED));
+	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+
+	/* Unlink the log item from the transaction and drop the log item. */
+	xfs_trans_del_item(&bip->bli_item);
+	xfs_buf_item_put(bip);
+	bp->b_transp = NULL;
+}
+
 /*
  * Mark the buffer as not needing to be unlocked when the buf item's
  * iop_unlock() routine is called.  The buffer must already be locked
diff --git a/libxfs/xfbtree.c b/libxfs/xfbtree.c
index 0481e9ed9f4..41851c3b9ae 100644
--- a/libxfs/xfbtree.c
+++ b/libxfs/xfbtree.c
@@ -8,6 +8,7 @@
 #include "xfile.h"
 #include "xfbtree.h"
 #include "xfs_btree_mem.h"
+#include "libfrog/bitmap.h"
 
 /* btree ops functions for in-memory btrees. */
 
@@ -133,9 +134,18 @@ xfbtree_check_ptr(
 	else
 		bt_xfoff = be32_to_cpu(ptr->s);
 
-	if (!xfbtree_verify_xfileoff(cur, bt_xfoff))
+	if (!xfbtree_verify_xfileoff(cur, bt_xfoff)) {
 		fa = __this_address;
+		goto done;
+	}
 
+	/* Can't point to the head or anything before it */
+	if (bt_xfoff < XFBTREE_INIT_LEAF_BLOCK) {
+		fa = __this_address;
+		goto done;
+	}
+
+done:
 	if (fa) {
 		xfs_err(cur->bc_mp,
 "In-memory: Corrupt btree %d flags 0x%x pointer at level %d index %d fa %pS.",
@@ -341,3 +351,450 @@ xfbtree_sblock_verify(
 
 	return NULL;
 }
+
+/* Close the btree xfile and release all resources. */
+void
+xfbtree_destroy(
+	struct xfbtree		*xfbt)
+{
+	bitmap_free(&xfbt->freespace);
+	kmem_free(xfbt->freespace);
+	libxfs_buftarg_drain(xfbt->target);
+	kmem_free(xfbt);
+}
+
+/* Compute the number of bytes available for records. */
+static inline unsigned int
+xfbtree_rec_bytes(
+	struct xfs_mount		*mp,
+	const struct xfbtree_config	*cfg)
+{
+	unsigned int			blocklen = xfo_to_b(1);
+
+	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS) {
+		if (xfs_has_crc(mp))
+			return blocklen - XFS_BTREE_LBLOCK_CRC_LEN;
+
+		return blocklen - XFS_BTREE_LBLOCK_LEN;
+	}
+
+	if (xfs_has_crc(mp))
+		return blocklen - XFS_BTREE_SBLOCK_CRC_LEN;
+
+	return blocklen - XFS_BTREE_SBLOCK_LEN;
+}
+
+/* Initialize an empty leaf block as the btree root. */
+STATIC int
+xfbtree_init_leaf_block(
+	struct xfs_mount		*mp,
+	struct xfbtree			*xfbt,
+	const struct xfbtree_config	*cfg)
+{
+	struct xfs_buf			*bp;
+	xfs_daddr_t			daddr;
+	int				error;
+	unsigned int			bc_flags = 0;
+
+	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+		bc_flags |= XFS_BTREE_LONG_PTRS;
+
+	daddr = xfo_to_daddr(XFBTREE_INIT_LEAF_BLOCK);
+	error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
+	if (error)
+		return error;
+
+	trace_xfbtree_create_root_buf(xfbt, bp);
+
+	bp->b_ops = cfg->btree_ops->buf_ops;
+	xfs_btree_init_block_int(mp, bp->b_addr, daddr, cfg->btnum, 0, 0,
+			cfg->owner, bc_flags);
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		return error;
+
+	xfbt->xf_used++;
+	return 0;
+}
+
+/* Initialize the in-memory btree header block. */
+STATIC int
+xfbtree_init_head(
+	struct xfbtree		*xfbt)
+{
+	struct xfs_buf		*bp;
+	xfs_daddr_t		daddr;
+	int			error;
+
+	daddr = xfo_to_daddr(XFBTREE_HEAD_BLOCK);
+	error = xfs_buf_get(xfbt->target, daddr, xfbtree_bbsize(), &bp);
+	if (error)
+		return error;
+
+	xfs_btree_mem_head_init(bp, xfbt->owner, XFBTREE_INIT_LEAF_BLOCK);
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		return error;
+
+	xfbt->xf_used++;
+	return 0;
+}
+
+/* Create an xfile btree backing thing that can be used for in-memory btrees. */
+int
+xfbtree_create(
+	struct xfs_mount		*mp,
+	const struct xfbtree_config	*cfg,
+	struct xfbtree			**xfbtreep)
+{
+	struct xfbtree			*xfbt;
+	unsigned int			blocklen = xfbtree_rec_bytes(mp, cfg);
+	unsigned int			keyptr_len = cfg->btree_ops->key_len;
+	int				error;
+
+	/* Requires an xfile-backed buftarg. */
+	if (!(cfg->target->flags & XFS_BUFTARG_IN_MEMORY)) {
+		ASSERT(cfg->target->flags & XFS_BUFTARG_IN_MEMORY);
+		return -EINVAL;
+	}
+
+	xfbt = kmem_zalloc(sizeof(struct xfbtree), KM_NOFS | KM_MAYFAIL);
+	if (!xfbt)
+		return -ENOMEM;
+
+	/* Assign our memory file and the free space bitmap. */
+	xfbt->target = cfg->target;
+	error = bitmap_alloc(&xfbt->freespace);
+	if (error)
+		goto err_buftarg;
+
+	/* Set up min/maxrecs for this btree. */
+	if (cfg->flags & XFBTREE_CREATE_LONG_PTRS)
+		keyptr_len += sizeof(__be64);
+	else
+		keyptr_len += sizeof(__be32);
+	xfbt->maxrecs[0] = blocklen / cfg->btree_ops->rec_len;
+	xfbt->maxrecs[1] = blocklen / keyptr_len;
+	xfbt->minrecs[0] = xfbt->maxrecs[0] / 2;
+	xfbt->minrecs[1] = xfbt->maxrecs[1] / 2;
+	xfbt->owner = cfg->owner;
+
+	/* Initialize the empty btree. */
+	error = xfbtree_init_leaf_block(mp, xfbt, cfg);
+	if (error)
+		goto err_freesp;
+
+	error = xfbtree_init_head(xfbt);
+	if (error)
+		goto err_freesp;
+
+	trace_xfbtree_create(mp, cfg, xfbt);
+
+	*xfbtreep = xfbt;
+	return 0;
+
+err_freesp:
+	bitmap_free(&xfbt->freespace);
+	kmem_free(xfbt->freespace);
+err_buftarg:
+	libxfs_buftarg_drain(xfbt->target);
+	kmem_free(xfbt);
+	return error;
+}
+
+/* Read the in-memory btree head. */
+int
+xfbtree_head_read_buf(
+	struct xfbtree		*xfbt,
+	struct xfs_trans	*tp,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_buftarg	*btp = xfbt->target;
+	struct xfs_mount	*mp = btp->bt_mount;
+	struct xfs_btree_mem_head *mhead;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		daddr;
+	int			error;
+
+	daddr = xfo_to_daddr(XFBTREE_HEAD_BLOCK);
+	error = xfs_trans_read_buf(mp, tp, btp, daddr, xfbtree_bbsize(), 0,
+			&bp, &xfs_btree_mem_head_buf_ops);
+	if (error)
+		return error;
+
+	mhead = bp->b_addr;
+	if (be64_to_cpu(mhead->mh_owner) != xfbt->owner) {
+		xfs_verifier_error(bp, -EFSCORRUPTED, __this_address);
+		xfs_trans_brelse(tp, bp);
+		return -EFSCORRUPTED;
+	}
+
+	*bpp = bp;
+	return 0;
+}
+
+static inline struct xfile *xfbtree_xfile(struct xfbtree *xfbt)
+{
+	return xfbt->target->bt_xfile;
+}
+
+/* Allocate a block to our in-memory btree. */
+int
+xfbtree_alloc_block(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*start,
+	union xfs_btree_ptr		*new,
+	int				*stat)
+{
+	struct xfbtree			*xfbt = cur->bc_mem.xfbtree;
+	uint64_t			bt_xfoff;
+	loff_t				pos;
+	int				error;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	/*
+	 * Find the first free block in the free space bitmap and take it.  If
+	 * none are found, seek to end of the file.
+	 */
+	error = bitmap_take_first_set(xfbt->freespace, 0, -1ULL, &bt_xfoff);
+	if (error == -ENODATA) {
+		bt_xfoff = xfbt->xf_used;
+		xfbt->xf_used++;
+	} else if (error) {
+		return error;
+	}
+
+	trace_xfbtree_alloc_block(xfbt, cur, bt_xfoff);
+
+	/* Fail if the block address exceeds the maximum for short pointers. */
+	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && bt_xfoff >= INT_MAX) {
+		*stat = 0;
+		return 0;
+	}
+
+	/* Make sure we actually can write to the block before we return it. */
+	pos = xfo_to_b(bt_xfoff);
+	error = xfile_prealloc(xfbtree_xfile(xfbt), pos, xfo_to_b(1));
+	if (error)
+		return error;
+
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		new->l = cpu_to_be64(bt_xfoff);
+	else
+		new->s = cpu_to_be32(bt_xfoff);
+
+	*stat = 1;
+	return 0;
+}
+
+/* Free a block from our in-memory btree. */
+int
+xfbtree_free_block(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+	xfileoff_t		bt_xfoff, bt_xflen;
+
+	ASSERT(cur->bc_flags & XFS_BTREE_IN_MEMORY);
+
+	bt_xfoff = xfs_daddr_to_xfot(xfs_buf_daddr(bp));
+	bt_xflen = xfs_daddr_to_xfot(bp->b_length);
+
+	trace_xfbtree_free_block(xfbt, cur, bt_xfoff);
+
+	return bitmap_set(xfbt->freespace, bt_xfoff, bt_xflen);
+}
+
+/* Return the minimum number of records for a btree block. */
+int
+xfbtree_get_minrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+
+	return xfbt->minrecs[level != 0];
+}
+
+/* Return the maximum number of records for a btree block. */
+int
+xfbtree_get_maxrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	struct xfbtree		*xfbt = cur->bc_mem.xfbtree;
+
+	return xfbt->maxrecs[level != 0];
+}
+
+/* If this log item is a buffer item that came from the xfbtree, return it. */
+static inline struct xfs_buf *
+xfbtree_buf_match(
+	struct xfbtree			*xfbt,
+	const struct xfs_log_item	*lip)
+{
+	const struct xfs_buf_log_item	*bli;
+	struct xfs_buf			*bp;
+
+	if (lip->li_type != XFS_LI_BUF)
+		return NULL;
+
+	bli = container_of(lip, struct xfs_buf_log_item, bli_item);
+	bp = bli->bli_buf;
+	if (bp->b_target != xfbt->target)
+		return NULL;
+
+	return bp;
+}
+
+/*
+ * Detach this (probably dirty) xfbtree buffer from the transaction by any
+ * means necessary.  Returns true if the buffer needs to be written.
+ */
+STATIC bool
+xfbtree_trans_bdetach(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf_log_item	*bli = bp->b_log_item;
+	bool			dirty;
+
+	ASSERT(bli != NULL);
+
+	dirty = bli->bli_flags & (XFS_BLI_DIRTY | XFS_BLI_ORDERED);
+
+	bli->bli_flags &= ~(XFS_BLI_DIRTY | XFS_BLI_ORDERED |
+			    XFS_BLI_STALE);
+	clear_bit(XFS_LI_DIRTY, &bli->bli_item.li_flags);
+
+	while (bp->b_log_item != NULL)
+		libxfs_trans_bdetach(tp, bp);
+
+	return dirty;
+}
+
+/*
+ * Commit changes to the incore btree immediately by writing all dirty xfbtree
+ * buffers to the backing xfile.  This detaches all xfbtree buffers from the
+ * transaction, even on failure.  The buffer locks are dropped between the
+ * delwri queue and submit, so the caller must synchronize btree access.
+ *
+ * Normally we'd let the buffers commit with the transaction and get written to
+ * the xfile via the log, but online repair stages ephemeral btrees in memory
+ * and uses the btree_staging functions to write new btrees to disk atomically.
+ * The in-memory btree (and its backing store) are discarded at the end of the
+ * repair phase, which means that xfbtree buffers cannot commit with the rest
+ * of a transaction.
+ *
+ * In other words, online repair only needs the transaction to collect buffer
+ * pointers and to avoid buffer deadlocks, not to guarantee consistency of
+ * updates.
+ */
+int
+xfbtree_trans_commit(
+	struct xfbtree		*xfbt,
+	struct xfs_trans	*tp)
+{
+	LIST_HEAD(buffer_list);
+	struct xfs_log_item	*lip, *n;
+	bool			corrupt = false;
+	bool			tp_dirty = false;
+
+	/*
+	 * For each xfbtree buffer attached to the transaction, write the dirty
+	 * buffers to the xfile and release them.
+	 */
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		struct xfs_buf	*bp = xfbtree_buf_match(xfbt, lip);
+		bool		dirty;
+
+		if (!bp) {
+			if (test_bit(XFS_LI_DIRTY, &lip->li_flags))
+				tp_dirty |= true;
+			continue;
+		}
+
+		trace_xfbtree_trans_commit_buf(xfbt, bp);
+
+		dirty = xfbtree_trans_bdetach(tp, bp);
+		if (dirty && !corrupt) {
+			xfs_failaddr_t	fa = bp->b_ops->verify_struct(bp);
+
+			/*
+			 * Because this btree is ephemeral, validate the buffer
+			 * structure before delwri_submit so that we can return
+			 * corruption errors to the caller without shutting
+			 * down the filesystem.
+			 *
+			 * If the buffer fails verification, log the failure
+			 * but continue walking the transaction items so that
+			 * we remove all ephemeral btree buffers.
+			 */
+			if (fa) {
+				corrupt = true;
+				xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+			} else {
+				xfs_buf_delwri_queue_here(bp, &buffer_list);
+			}
+		}
+
+		xfs_buf_relse(bp);
+	}
+
+	/*
+	 * Reset the transaction's dirty flag to reflect the dirty state of the
+	 * log items that are still attached.
+	 */
+	tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
+			(tp_dirty ? XFS_TRANS_DIRTY : 0);
+
+	if (corrupt) {
+		xfs_buf_delwri_cancel(&buffer_list);
+		return -EFSCORRUPTED;
+	}
+
+	if (list_empty(&buffer_list))
+		return 0;
+
+	return xfs_buf_delwri_submit(&buffer_list);
+}
+
+/*
+ * Cancel changes to the incore btree by detaching all the xfbtree buffers.
+ * Changes are not written to the backing store.  This is needed for online
+ * repair btrees, which are by nature ephemeral.
+ */
+void
+xfbtree_trans_cancel(
+	struct xfbtree		*xfbt,
+	struct xfs_trans	*tp)
+{
+	struct xfs_log_item	*lip, *n;
+	bool			tp_dirty = false;
+
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		struct xfs_buf	*bp = xfbtree_buf_match(xfbt, lip);
+
+		if (!bp) {
+			if (test_bit(XFS_LI_DIRTY, &lip->li_flags))
+				tp_dirty |= true;
+			continue;
+		}
+
+		trace_xfbtree_trans_cancel_buf(xfbt, bp);
+
+		xfbtree_trans_bdetach(tp, bp);
+		xfs_buf_relse(bp);
+	}
+
+	/*
+	 * Reset the transaction's dirty flag to reflect the dirty state of the
+	 * log items that are still attached.
+	 */
+	tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
+			(tp_dirty ? XFS_TRANS_DIRTY : 0);
+}
diff --git a/libxfs/xfbtree.h b/libxfs/xfbtree.h
index e378b771637..72f56c69157 100644
--- a/libxfs/xfbtree.h
+++ b/libxfs/xfbtree.h
@@ -19,18 +19,39 @@ struct xfs_btree_mem_head {
 
 #define XFS_BTREE_MEM_HEAD_MAGIC	0x4341544D	/* "CATM" */
 
-/* in-memory btree header is always block 0 in the backing store */
-#define XFS_BTREE_MEM_HEAD_DADDR	0
-
 /* xfile-backed in-memory btrees */
 
 struct xfbtree {
+	/* buffer cache target for the xfile backing this in-memory btree */
 	struct xfs_buftarg		*target;
 
+	/* Bitmap of free space from pos to used */
+	struct bitmap			*freespace;
+
+	/* Number of xfile blocks actually used by this xfbtree. */
+	xfileoff_t			xf_used;
+
 	/* Owner of this btree. */
 	unsigned long long		owner;
+
+	/* Minimum and maximum records per block. */
+	unsigned int			maxrecs[2];
+	unsigned int			minrecs[2];
 };
 
+/* The head of the in-memory btree is always at block 0 */
+#define XFBTREE_HEAD_BLOCK		0
+
+/* in-memory btrees are always created with an empty leaf block at block 1 */
+#define XFBTREE_INIT_LEAF_BLOCK		1
+
+int xfbtree_head_read_buf(struct xfbtree *xfbt, struct xfs_trans *tp,
+		struct xfs_buf **bpp);
+
+void xfbtree_destroy(struct xfbtree *xfbt);
+int xfbtree_trans_commit(struct xfbtree *xfbt, struct xfs_trans *tp);
+void xfbtree_trans_cancel(struct xfbtree *xfbt, struct xfs_trans *tp);
+
 #endif /* CONFIG_XFS_IN_MEMORY_BTREE */
 
 #endif /* __LIBXFS_XFBTREE_H__ */
diff --git a/libxfs/xfile.c b/libxfs/xfile.c
index 5985433749d..c1b8b1c5928 100644
--- a/libxfs/xfile.c
+++ b/libxfs/xfile.c
@@ -240,3 +240,19 @@ xfile_dump(
 
 	return execvp("od", argv);
 }
+
+/* Ensure that there is storage backing the given range. */
+int
+xfile_prealloc(
+	struct xfile	*xf,
+	loff_t		pos,
+	uint64_t	count)
+{
+	int		error;
+
+	count = min(count, xfile_maxbytes(xf) - pos);
+	error = fallocate(xf->fd, 0, pos, count);
+	if (error)
+		return -errno;
+	return 0;
+}
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
index 5a1d0104808..9580de32864 100644
--- a/libxfs/xfile.h
+++ b/libxfs/xfile.h
@@ -47,6 +47,8 @@ xfile_obj_store(struct xfile *xf, void *buf, size_t count, loff_t pos)
 	return 0;
 }
 
+int xfile_prealloc(struct xfile *xf, loff_t pos, uint64_t count);
+
 struct xfile_stat {
 	loff_t			size;
 	unsigned long long	bytes;
diff --git a/libxfs/xfs_btree_mem.h b/libxfs/xfs_btree_mem.h
index 6ca9ea64a9a..5e7b1f20fb5 100644
--- a/libxfs/xfs_btree_mem.h
+++ b/libxfs/xfs_btree_mem.h
@@ -8,6 +8,26 @@
 
 struct xfbtree;
 
+struct xfbtree_config {
+	/* Buffer ops for the btree root block */
+	const struct xfs_btree_ops	*btree_ops;
+
+	/* Buffer target for the xfile backing this btree. */
+	struct xfs_buftarg		*target;
+
+	/* Owner of this btree. */
+	unsigned long long		owner;
+
+	/* Btree type number */
+	xfs_btnum_t			btnum;
+
+	/* XFBTREE_CREATE_* flags */
+	unsigned int			flags;
+};
+
+/* btree has long pointers */
+#define XFBTREE_CREATE_LONG_PTRS	(1U << 0)
+
 #ifdef CONFIG_XFS_IN_MEMORY_BTREE
 unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
 
@@ -35,6 +55,16 @@ xfs_failaddr_t xfbtree_lblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 xfs_failaddr_t xfbtree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 unsigned long long xfbtree_buf_to_xfoff(struct xfs_btree_cur *cur,
 		struct xfs_buf *bp);
+
+int xfbtree_get_minrecs(struct xfs_btree_cur *cur, int level);
+int xfbtree_get_maxrecs(struct xfs_btree_cur *cur, int level);
+
+int xfbtree_create(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+		struct xfbtree **xfbtreep);
+int xfbtree_alloc_block(struct xfs_btree_cur *cur,
+		const union xfs_btree_ptr *start, union xfs_btree_ptr *ptr,
+		int *stat);
+int xfbtree_free_block(struct xfs_btree_cur *cur, struct xfs_buf *bp);
 #else
 static inline unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp)
 {
@@ -77,11 +107,22 @@ static inline unsigned int xfbtree_bbsize(void)
 #define xfbtree_set_root			NULL
 #define xfbtree_init_ptr_from_cur		NULL
 #define xfbtree_dup_cursor			NULL
+#define xfbtree_get_minrecs			NULL
+#define xfbtree_get_maxrecs			NULL
+#define xfbtree_alloc_block			NULL
+#define xfbtree_free_block			NULL
 #define xfbtree_verify_xfileoff(cur, xfoff)	(false)
 #define xfbtree_check_block_owner(cur, block)	NULL
 #define xfbtree_owner(cur)			(0ULL)
 #define xfbtree_buf_to_xfoff(cur, bp)		(-1)
 
+static inline int
+xfbtree_create(struct xfs_mount *mp, const struct xfbtree_config *cfg,
+		struct xfbtree **xfbtreep)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_XFS_IN_MEMORY_BTREE */
 
 #endif /* __XFS_BTREE_MEM_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 9/9] xfbtree: let the buffer cache flush dirty buffers to the xfile
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 2/9] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/9] libxfs: support in-memory buffer cache targets Darrick J. Wong
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

As a performance optimization, when we're committing xfbtree updates,
let the buffer cache flush the dirty buffers to disk when it's ready
instead of writing everything at every transaction commit.  This is a
bit sketchy but it's an ephemeral tree so we can play fast and loose.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfbtree.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)


diff --git a/libxfs/xfbtree.c b/libxfs/xfbtree.c
index 41851c3b9ae..65d6baea856 100644
--- a/libxfs/xfbtree.c
+++ b/libxfs/xfbtree.c
@@ -699,7 +699,6 @@ xfbtree_trans_commit(
 	struct xfbtree		*xfbt,
 	struct xfs_trans	*tp)
 {
-	LIST_HEAD(buffer_list);
 	struct xfs_log_item	*lip, *n;
 	bool			corrupt = false;
 	bool			tp_dirty = false;
@@ -733,12 +732,16 @@ xfbtree_trans_commit(
 			 * If the buffer fails verification, log the failure
 			 * but continue walking the transaction items so that
 			 * we remove all ephemeral btree buffers.
+			 *
+			 * Since the userspace buffer cache supports marking
+			 * buffers dirty and flushing them later, use this to
+			 * reduce the number of writes to the xfile.
 			 */
 			if (fa) {
 				corrupt = true;
 				xfs_verifier_error(bp, -EFSCORRUPTED, fa);
 			} else {
-				xfs_buf_delwri_queue_here(bp, &buffer_list);
+				libxfs_buf_mark_dirty(bp);
 			}
 		}
 
@@ -752,15 +755,9 @@ xfbtree_trans_commit(
 	tp->t_flags = (tp->t_flags & ~XFS_TRANS_DIRTY) |
 			(tp_dirty ? XFS_TRANS_DIRTY : 0);
 
-	if (corrupt) {
-		xfs_buf_delwri_cancel(&buffer_list);
+	if (corrupt)
 		return -EFSCORRUPTED;
-	}
-
-	if (list_empty(&buffer_list))
-		return 0;
-
-	return xfs_buf_delwri_submit(&buffer_list);
+	return 0;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (42 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/5] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
                   ` (33 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

We have now constructed the four tools that we need to scan the
filesystem looking for reverse mappings: an inode scanner, hooks to
receive live updates from other writer threads, the ability to construct
btrees in memory, and a btree bulk loader.

This series glues those three together, enabling us to scan the
filesystem for mappings and keep it up to date while other writers run,
and then commit the new btree to disk atomically.

To reduce the size of each patch, the functionality is left disabled
until the end of the series and broken up into three patches: one to
create the mechanics of scanning the filesystem, a second to transition
to in-memory btrees, and a third to set up the live hooks.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rmap-btree

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-rmap-btree
---
 include/xfs_mount.h     |    6 +
 libxfs/xfs_ag.c         |    1 
 libxfs/xfs_ag.h         |    3 +
 libxfs/xfs_bmap.c       |   49 +++++++++++-
 libxfs/xfs_bmap.h       |    8 ++
 libxfs/xfs_inode_fork.c |    9 ++
 libxfs/xfs_inode_fork.h |    1 
 libxfs/xfs_rmap.c       |  192 +++++++++++++++++++++++++++++++++++------------
 libxfs/xfs_rmap.h       |   30 +++++++
 libxfs/xfs_rmap_btree.c |  136 +++++++++++++++++++++++++++++++++
 libxfs/xfs_rmap_btree.h |    9 ++
 man/man8/mkfs.xfs.8.in  |    4 -
 mkfs/xfs_mkfs.c         |    2 
 13 files changed, 392 insertions(+), 58 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: create a helper to decide if a file mapping targets the rt volume
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs: repair the rmapbt Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a helper so that we can stop open-coding this decision
everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_bmap.c       |    6 +++---
 libxfs/xfs_inode_fork.c |    9 +++++++++
 libxfs/xfs_inode_fork.h |    1 +
 3 files changed, 13 insertions(+), 3 deletions(-)


diff --git a/libxfs/xfs_bmap.c b/libxfs/xfs_bmap.c
index f88e95056e2..4533db87668 100644
--- a/libxfs/xfs_bmap.c
+++ b/libxfs/xfs_bmap.c
@@ -4905,7 +4905,7 @@ xfs_bmap_del_extent_delay(
 
 	XFS_STATS_INC(mp, xs_del_exlist);
 
-	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
+	isrt = xfs_ifork_is_realtime(ip, whichfork);
 	del_endoff = del->br_startoff + del->br_blockcount;
 	got_endoff = got->br_startoff + got->br_blockcount;
 	da_old = startblockval(got->br_startblock);
@@ -5145,7 +5145,7 @@ xfs_bmap_del_extent_real(
 		return -ENOSPC;
 
 	flags = XFS_ILOG_CORE;
-	if (whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip)) {
+	if (xfs_ifork_is_realtime(ip, whichfork)) {
 		xfs_filblks_t	len;
 		xfs_extlen_t	mod;
 
@@ -5410,7 +5410,7 @@ __xfs_bunmapi(
 		return 0;
 	}
 	XFS_STATS_INC(mp, xs_blk_unmap);
-	isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip);
+	isrt = xfs_ifork_is_realtime(ip, whichfork);
 	end = start + len;
 
 	if (!xfs_iext_lookup_extent_before(ip, ifp, &end, &icur, &got)) {
diff --git a/libxfs/xfs_inode_fork.c b/libxfs/xfs_inode_fork.c
index 0dd59f4eb43..9b0c786fab6 100644
--- a/libxfs/xfs_inode_fork.c
+++ b/libxfs/xfs_inode_fork.c
@@ -785,3 +785,12 @@ xfs_iext_count_upgrade(
 
 	return 0;
 }
+
+/* Decide if a file mapping is on the realtime device or not. */
+bool
+xfs_ifork_is_realtime(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	return XFS_IS_REALTIME_INODE(ip) && whichfork != XFS_ATTR_FORK;
+}
diff --git a/libxfs/xfs_inode_fork.h b/libxfs/xfs_inode_fork.h
index 36a9fe3420c..c201d8ad595 100644
--- a/libxfs/xfs_inode_fork.h
+++ b/libxfs/xfs_inode_fork.h
@@ -261,6 +261,7 @@ int xfs_iext_count_may_overflow(struct xfs_inode *ip, int whichfork,
 		int nr_to_add);
 int xfs_iext_count_upgrade(struct xfs_trans *tp, struct xfs_inode *ip,
 		uint nr_to_add);
+bool xfs_ifork_is_realtime(struct xfs_inode *ip, int whichfork);
 
 /* returns true if the fork has extents but they are not read in yet. */
 static inline bool xfs_need_iread_extents(struct xfs_ifork *ifp)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: repair the rmapbt
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/5] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/5] xfs: create a shadow rmap btree during rmap repair Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rebuild the reverse mapping btree from all primary metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_bmap.c       |   43 +++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_bmap.h       |    8 ++++++++
 libxfs/xfs_rmap.c       |   22 +++++++++++++++-------
 libxfs/xfs_rmap.h       |    2 ++
 libxfs/xfs_rmap_btree.c |   13 ++++++++++++-
 5 files changed, 80 insertions(+), 8 deletions(-)


diff --git a/libxfs/xfs_bmap.c b/libxfs/xfs_bmap.c
index 4533db87668..59c9d53bf31 100644
--- a/libxfs/xfs_bmap.c
+++ b/libxfs/xfs_bmap.c
@@ -6399,3 +6399,46 @@ xfs_bunmapi_range(
 out:
 	return error;
 }
+
+struct xfs_bmap_query_range {
+	xfs_bmap_query_range_fn	fn;
+	void			*priv;
+};
+
+/* Format btree record and pass to our callback. */
+STATIC int
+xfs_bmap_query_range_helper(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_rec	*rec,
+	void				*priv)
+{
+	struct xfs_bmap_query_range	*query = priv;
+	struct xfs_bmbt_irec		irec;
+	xfs_failaddr_t			fa;
+
+	xfs_bmbt_disk_get_all(&rec->bmbt, &irec);
+	fa = xfs_bmap_validate_extent(cur->bc_ino.ip, cur->bc_ino.whichfork,
+			&irec);
+	if (fa) {
+		xfs_btree_mark_sick(cur);
+		return xfs_bmap_complain_bad_rec(cur->bc_ino.ip,
+				cur->bc_ino.whichfork, fa, &irec);
+	}
+
+	return query->fn(cur, &irec, query->priv);
+}
+
+/* Find all bmaps. */
+int
+xfs_bmap_query_all(
+	struct xfs_btree_cur		*cur,
+	xfs_bmap_query_range_fn		fn,
+	void				*priv)
+{
+	struct xfs_bmap_query_range	query = {
+		.priv			= priv,
+		.fn			= fn,
+	};
+
+	return xfs_btree_query_all(cur, xfs_bmap_query_range_helper, &query);
+}
diff --git a/libxfs/xfs_bmap.h b/libxfs/xfs_bmap.h
index 1201ee024c1..bbda4a77cb6 100644
--- a/libxfs/xfs_bmap.h
+++ b/libxfs/xfs_bmap.h
@@ -274,4 +274,12 @@ extern struct kmem_cache	*xfs_bmap_intent_cache;
 int __init xfs_bmap_intent_init_cache(void);
 void xfs_bmap_intent_destroy_cache(void);
 
+typedef int (*xfs_bmap_query_range_fn)(
+	struct xfs_btree_cur	*cur,
+	struct xfs_bmbt_irec	*rec,
+	void			*priv);
+
+int xfs_bmap_query_all(struct xfs_btree_cur *cur, xfs_bmap_query_range_fn fn,
+		void *priv);
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index 0816f103f1c..18d9dd480b0 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -211,13 +211,12 @@ xfs_rmap_btrec_to_irec(
 			irec);
 }
 
-/* Simple checks for rmap records. */
-xfs_failaddr_t
-xfs_rmap_check_irec(
-	struct xfs_btree_cur		*cur,
+inline xfs_failaddr_t
+xfs_rmap_check_perag_irec(
+	struct xfs_perag		*pag,
 	const struct xfs_rmap_irec	*irec)
 {
-	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_mount		*mp = pag->pag_mount;
 	bool				is_inode;
 	bool				is_unwritten;
 	bool				is_bmbt;
@@ -232,8 +231,8 @@ xfs_rmap_check_irec(
 			return __this_address;
 	} else {
 		/* check for valid extent range, including overflow */
-		if (!xfs_verify_agbext(cur->bc_ag.pag, irec->rm_startblock,
-						       irec->rm_blockcount))
+		if (!xfs_verify_agbext(pag, irec->rm_startblock,
+					    irec->rm_blockcount))
 			return __this_address;
 	}
 
@@ -268,6 +267,15 @@ xfs_rmap_check_irec(
 	return NULL;
 }
 
+/* Simple checks for rmap records. */
+xfs_failaddr_t
+xfs_rmap_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*irec)
+{
+	return xfs_rmap_check_perag_irec(cur->bc_ag.pag, irec);
+}
+
 static inline int
 xfs_rmap_complain_bad_rec(
 	struct xfs_btree_cur		*cur,
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index ced605d6932..b7ad51055e1 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -195,6 +195,8 @@ int xfs_rmap_compare(const struct xfs_rmap_irec *a,
 union xfs_btree_rec;
 xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
+xfs_failaddr_t xfs_rmap_check_perag_irec(struct xfs_perag *pag,
+		const struct xfs_rmap_irec *irec);
 xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_rmap_irec *irec);
 
diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index 928f61053b0..574f2dda5c0 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -340,7 +340,18 @@ xfs_rmapbt_verify(
 
 	level = be16_to_cpu(block->bb_level);
 	if (pag && pag->pagf_init) {
-		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
+		unsigned int	maxlevel = pag->pagf_levels[XFS_BTNUM_RMAPi];
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+		/*
+		 * Online repair could be rewriting the free space btrees, so
+		 * we'll validate against the larger of either tree while this
+		 * is going on.
+		 */
+		maxlevel = max_t(unsigned int, maxlevel,
+				pag->pagf_alt_levels[XFS_BTNUM_RMAPi]);
+#endif
+		if (level >= maxlevel)
 			return __this_address;
 	} else if (level >= mp->m_rmap_maxlevels)
 		return __this_address;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs: create a shadow rmap btree during rmap repair
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/5] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs: repair the rmapbt Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/5] xfs: hook live rmap operations during a repair operation Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/5] mkfs: enable reverse mapping by default Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an in-memory btree of rmap records instead of an array.  This
enables us to do live record collection instead of freezing the fs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_rmap.c       |   25 +++++-----
 libxfs/xfs_rmap_btree.c |  123 +++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_rmap_btree.h |    9 +++
 3 files changed, 146 insertions(+), 11 deletions(-)


diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index 18d9dd480b0..bbc3546eead 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -273,6 +273,8 @@ xfs_rmap_check_irec(
 	struct xfs_btree_cur		*cur,
 	const struct xfs_rmap_irec	*irec)
 {
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		return xfs_rmap_check_perag_irec(cur->bc_mem.pag, irec);
 	return xfs_rmap_check_perag_irec(cur->bc_ag.pag, irec);
 }
 
@@ -284,9 +286,13 @@ xfs_rmap_complain_bad_rec(
 {
 	struct xfs_mount		*mp = cur->bc_mp;
 
-	xfs_warn(mp,
-		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
-		cur->bc_ag.pag->pag_agno, fa);
+	if (cur->bc_flags & XFS_BTREE_IN_MEMORY)
+		xfs_warn(mp,
+ "In-Memory Reverse Mapping BTree record corruption detected at %pS!", fa);
+	else
+		xfs_warn(mp,
+ "Reverse Mapping BTree record corruption in AG %d detected at %pS!",
+			cur->bc_ag.pag->pag_agno, fa);
 	xfs_warn(mp,
 		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
 		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
@@ -2411,15 +2417,12 @@ xfs_rmap_map_raw(
 {
 	struct xfs_owner_info	oinfo;
 
-	oinfo.oi_owner = rmap->rm_owner;
-	oinfo.oi_offset = rmap->rm_offset;
-	oinfo.oi_flags = 0;
-	if (rmap->rm_flags & XFS_RMAP_ATTR_FORK)
-		oinfo.oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
-	if (rmap->rm_flags & XFS_RMAP_BMBT_BLOCK)
-		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
+	xfs_owner_info_pack(&oinfo, rmap->rm_owner, rmap->rm_offset,
+			rmap->rm_flags);
 
-	if (rmap->rm_flags || XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+	if ((rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+			       XFS_RMAP_UNWRITTEN)) ||
+	    XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
 		return xfs_rmap_map(cur, rmap->rm_startblock,
 				rmap->rm_blockcount,
 				rmap->rm_flags & XFS_RMAP_UNWRITTEN,
diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index 574f2dda5c0..0b10f81d30e 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -19,6 +19,9 @@
 #include "xfs_trace.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "xfile.h"
+#include "xfbtree.h"
+#include "xfs_btree_mem.h"
 
 static struct kmem_cache	*xfs_rmapbt_cur_cache;
 
@@ -553,6 +556,126 @@ xfs_rmapbt_stage_cursor(
 	return cur;
 }
 
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+/*
+ * Validate an in-memory rmap btree block.  Callers are allowed to generate an
+ * in-memory btree even if the ondisk feature is not enabled.
+ */
+static xfs_failaddr_t
+xfs_rmapbt_mem_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	xfs_failaddr_t		fa;
+	unsigned int		level;
+
+	if (!xfs_verify_magic(bp, block->bb_magic))
+		return __this_address;
+
+	fa = xfs_btree_sblock_v5hdr_verify(bp);
+	if (fa)
+		return fa;
+
+	level = be16_to_cpu(block->bb_level);
+	if (xfs_has_rmapbt(mp)) {
+		if (level >= mp->m_rmap_maxlevels)
+			return __this_address;
+	} else {
+		if (level >= xfs_rmapbt_maxlevels_ondisk())
+			return __this_address;
+	}
+
+	return xfbtree_sblock_verify(bp,
+			xfs_rmapbt_maxrecs(xfo_to_b(1), level == 0));
+}
+
+static void
+xfs_rmapbt_mem_rw_verify(
+	struct xfs_buf	*bp)
+{
+	xfs_failaddr_t	fa = xfs_rmapbt_mem_verify(bp);
+
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+/* skip crc checks on in-memory btrees to save time */
+static const struct xfs_buf_ops xfs_rmapbt_mem_buf_ops = {
+	.name			= "xfs_rmapbt_mem",
+	.magic			= { 0, cpu_to_be32(XFS_RMAP_CRC_MAGIC) },
+	.verify_read		= xfs_rmapbt_mem_rw_verify,
+	.verify_write		= xfs_rmapbt_mem_rw_verify,
+	.verify_struct		= xfs_rmapbt_mem_verify,
+};
+
+static const struct xfs_btree_ops xfs_rmapbt_mem_ops = {
+	.rec_len		= sizeof(struct xfs_rmap_rec),
+	.key_len		= 2 * sizeof(struct xfs_rmap_key),
+
+	.dup_cursor		= xfbtree_dup_cursor,
+	.set_root		= xfbtree_set_root,
+	.alloc_block		= xfbtree_alloc_block,
+	.free_block		= xfbtree_free_block,
+	.get_minrecs		= xfbtree_get_minrecs,
+	.get_maxrecs		= xfbtree_get_maxrecs,
+	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_rmapbt_init_high_key_from_rec,
+	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfbtree_init_ptr_from_cur,
+	.key_diff		= xfs_rmapbt_key_diff,
+	.buf_ops		= &xfs_rmapbt_mem_buf_ops,
+	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
+	.keys_inorder		= xfs_rmapbt_keys_inorder,
+	.recs_inorder		= xfs_rmapbt_recs_inorder,
+	.keys_contiguous	= xfs_rmapbt_keys_contiguous,
+};
+
+/* Create a cursor for an in-memory btree. */
+struct xfs_btree_cur *
+xfs_rmapbt_mem_cursor(
+	struct xfs_perag	*pag,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*head_bp,
+	struct xfbtree		*xfbtree)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_mount	*mp = pag->pag_mount;
+
+	/* Overlapping btree; 2 keys per pointer. */
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
+			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
+	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING |
+			XFS_BTREE_IN_MEMORY;
+	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
+	cur->bc_ops = &xfs_rmapbt_mem_ops;
+	cur->bc_mem.xfbtree = xfbtree;
+	cur->bc_mem.head_bp = head_bp;
+	cur->bc_nlevels = xfs_btree_mem_head_nlevels(head_bp);
+
+	cur->bc_mem.pag = xfs_perag_bump(pag);
+	return cur;
+}
+
+/* Create an in-memory rmap btree. */
+int
+xfs_rmapbt_mem_create(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	struct xfs_buftarg	*target,
+	struct xfbtree		**xfbtreep)
+{
+	struct xfbtree_config	cfg = {
+		.btree_ops	= &xfs_rmapbt_mem_ops,
+		.target		= target,
+		.btnum		= XFS_BTNUM_RMAP,
+		.owner		= agno,
+	};
+
+	return xfbtree_create(mp, &cfg, xfbtreep);
+}
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
 /*
  * Install a new reverse mapping btree root.  Caller is responsible for
  * invalidating and freeing the old btree blocks.
diff --git a/libxfs/xfs_rmap_btree.h b/libxfs/xfs_rmap_btree.h
index 3244715dd11..a27a236111d 100644
--- a/libxfs/xfs_rmap_btree.h
+++ b/libxfs/xfs_rmap_btree.h
@@ -64,4 +64,13 @@ unsigned int xfs_rmapbt_maxlevels_ondisk(void);
 int __init xfs_rmapbt_init_cur_cache(void);
 void xfs_rmapbt_destroy_cur_cache(void);
 
+#ifdef CONFIG_XFS_IN_MEMORY_BTREE
+struct xfbtree;
+struct xfs_btree_cur *xfs_rmapbt_mem_cursor(struct xfs_perag *pag,
+		struct xfs_trans *tp, struct xfs_buf *head_bp,
+		struct xfbtree *xfbtree);
+int xfs_rmapbt_mem_create(struct xfs_mount *mp, xfs_agnumber_t agno,
+		struct xfs_buftarg *target, struct xfbtree **xfbtreep);
+#endif /* CONFIG_XFS_IN_MEMORY_BTREE */
+
 #endif /* __XFS_RMAP_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: hook live rmap operations during a repair operation
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 3/5] xfs: create a shadow rmap btree during rmap repair Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/5] mkfs: enable reverse mapping by default Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Hook the regular rmap code when an rmapbt repair operation is running so
that we can unlock the AGF buffer to scan the filesystem and keep the
in-memory btree up to date during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_mount.h |    6 ++
 libxfs/xfs_ag.c     |    1 
 libxfs/xfs_ag.h     |    3 +
 libxfs/xfs_rmap.c   |  145 +++++++++++++++++++++++++++++++++++++++------------
 libxfs/xfs_rmap.h   |   28 ++++++++++
 5 files changed, 150 insertions(+), 33 deletions(-)


diff --git a/include/xfs_mount.h b/include/xfs_mount.h
index 906b0573c0b..c67d0237686 100644
--- a/include/xfs_mount.h
+++ b/include/xfs_mount.h
@@ -13,6 +13,12 @@ struct xfs_da_geometry;
 
 typedef void (*buf_writeback_fn)(struct xfs_buf *bp);
 
+/* Stubbed-out functionality from the kernel. */
+struct xfs_hook_chain {
+};
+#define xfs_hook_init(chain)		((void)0)
+#define xfs_hook_call(chain, val, priv)	(0)
+
 /*
  * Define a user-level mount structure with all we need
  * in order to make use of the numerous XFS_* macros.
diff --git a/libxfs/xfs_ag.c b/libxfs/xfs_ag.c
index d01b1c6a7a4..e3585813288 100644
--- a/libxfs/xfs_ag.c
+++ b/libxfs/xfs_ag.c
@@ -332,6 +332,7 @@ xfs_initialize_perag(
 		init_waitqueue_head(&pag->pagb_wait);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
+		xfs_hooks_init(&pag->pag_rmap_update_hooks);
 #endif /* __KERNEL__ */
 
 		error = xfs_buf_hash_init(pag);
diff --git a/libxfs/xfs_ag.h b/libxfs/xfs_ag.h
index fd663d04bdf..6d5ddefa321 100644
--- a/libxfs/xfs_ag.h
+++ b/libxfs/xfs_ag.h
@@ -121,6 +121,9 @@ struct xfs_perag {
 	 * inconsistencies.
 	 */
 	struct xfs_drain	pag_intents;
+
+	/* Hook to feed rmapbt updates to an active online repair. */
+	struct xfs_hooks	pag_rmap_update_hooks;
 #endif /* __KERNEL__ */
 };
 
diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index bbc3546eead..172dac6a2bd 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -820,6 +820,77 @@ xfs_rmap_unmap(
 	return error;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of rmapbt live updates.  If
+ * the compiler supports jump labels, the static branch will be replaced by a
+ * nop sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_rmap_hooks_switch);
+
+void
+xfs_rmap_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_rmap_hooks_switch);
+}
+
+void
+xfs_rmap_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_rmap_hooks_switch);
+}
+
+/* Call downstream hooks for a reverse mapping update. */
+static inline void
+xfs_rmap_update_hook(
+	struct xfs_trans		*tp,
+	struct xfs_perag		*pag,
+	enum xfs_rmap_intent_type	op,
+	xfs_agblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	bool				unwritten,
+	const struct xfs_owner_info	*oinfo)
+{
+	if (xfs_hooks_switched_on(&xfs_rmap_hooks_switch)) {
+		struct xfs_rmap_update_params	p = {
+			.startblock	= startblock,
+			.blockcount	= blockcount,
+			.unwritten	= unwritten,
+			.oinfo		= *oinfo, /* struct copy */
+		};
+
+		if (pag)
+			xfs_hooks_call(&pag->pag_rmap_update_hooks, op, &p);
+	}
+}
+
+/* Call the specified function during a reverse mapping update. */
+int
+xfs_rmap_hook_add(
+	struct xfs_perag	*pag,
+	struct xfs_rmap_hook	*hook)
+{
+	return xfs_hooks_add(&pag->pag_rmap_update_hooks, &hook->update_hook);
+}
+
+/* Stop calling the specified function during a reverse mapping update. */
+void
+xfs_rmap_hook_del(
+	struct xfs_perag	*pag,
+	struct xfs_rmap_hook	*hook)
+{
+	xfs_hooks_del(&pag->pag_rmap_update_hooks, &hook->update_hook);
+}
+#else
+# define xfs_rmap_update_hook(t, p, o, s, b, u, oi)	do { } while(0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Remove a reference to an extent in the rmap btree.
  */
@@ -840,7 +911,7 @@ xfs_rmap_free(
 		return 0;
 
 	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag);
-
+	xfs_rmap_update_hook(tp, pag, XFS_RMAP_UNMAP, bno, len, false, oinfo);
 	error = xfs_rmap_unmap(cur, bno, len, false, oinfo);
 
 	xfs_btree_del_cursor(cur, error);
@@ -1092,6 +1163,7 @@ xfs_rmap_alloc(
 		return 0;
 
 	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag);
+	xfs_rmap_update_hook(tp, pag, XFS_RMAP_MAP, bno, len, false, oinfo);
 	error = xfs_rmap_map(cur, bno, len, false, oinfo);
 
 	xfs_btree_del_cursor(cur, error);
@@ -2511,6 +2583,38 @@ xfs_rmap_finish_one_cleanup(
 		xfs_trans_brelse(tp, agbp);
 }
 
+/* Commit an rmap operation into the ondisk tree. */
+int
+__xfs_rmap_finish_intent(
+	struct xfs_btree_cur		*rcur,
+	enum xfs_rmap_intent_type	op,
+	xfs_agblock_t			bno,
+	xfs_extlen_t			len,
+	const struct xfs_owner_info	*oinfo,
+	bool				unwritten)
+{
+	switch (op) {
+	case XFS_RMAP_ALLOC:
+	case XFS_RMAP_MAP:
+		return xfs_rmap_map(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_MAP_SHARED:
+		return xfs_rmap_map_shared(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_FREE:
+	case XFS_RMAP_UNMAP:
+		return xfs_rmap_unmap(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_UNMAP_SHARED:
+		return xfs_rmap_unmap_shared(rcur, bno, len, unwritten, oinfo);
+	case XFS_RMAP_CONVERT:
+		return xfs_rmap_convert(rcur, bno, len, !unwritten, oinfo);
+	case XFS_RMAP_CONVERT_SHARED:
+		return xfs_rmap_convert_shared(rcur, bno, len, !unwritten,
+				oinfo);
+	default:
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+}
+
 /*
  * Process one of the deferred rmap operations.  We pass back the
  * btree cursor to maintain our lock on the rmapbt between calls.
@@ -2577,39 +2681,14 @@ xfs_rmap_finish_one(
 	unwritten = ri->ri_bmap.br_state == XFS_EXT_UNWRITTEN;
 	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, ri->ri_bmap.br_startblock);
 
-	switch (ri->ri_type) {
-	case XFS_RMAP_ALLOC:
-	case XFS_RMAP_MAP:
-		error = xfs_rmap_map(rcur, bno, ri->ri_bmap.br_blockcount,
-				unwritten, &oinfo);
-		break;
-	case XFS_RMAP_MAP_SHARED:
-		error = xfs_rmap_map_shared(rcur, bno,
-				ri->ri_bmap.br_blockcount, unwritten, &oinfo);
-		break;
-	case XFS_RMAP_FREE:
-	case XFS_RMAP_UNMAP:
-		error = xfs_rmap_unmap(rcur, bno, ri->ri_bmap.br_blockcount,
-				unwritten, &oinfo);
-		break;
-	case XFS_RMAP_UNMAP_SHARED:
-		error = xfs_rmap_unmap_shared(rcur, bno,
-				ri->ri_bmap.br_blockcount, unwritten, &oinfo);
-		break;
-	case XFS_RMAP_CONVERT:
-		error = xfs_rmap_convert(rcur, bno, ri->ri_bmap.br_blockcount,
-				!unwritten, &oinfo);
-		break;
-	case XFS_RMAP_CONVERT_SHARED:
-		error = xfs_rmap_convert_shared(rcur, bno,
-				ri->ri_bmap.br_blockcount, !unwritten, &oinfo);
-		break;
-	default:
-		ASSERT(0);
-		error = -EFSCORRUPTED;
-	}
+	error = __xfs_rmap_finish_intent(rcur, ri->ri_type, bno,
+			ri->ri_bmap.br_blockcount, &oinfo, unwritten);
+	if (error)
+		return error;
 
-	return error;
+	xfs_rmap_update_hook(tp, ri->ri_pag, ri->ri_type, bno,
+			ri->ri_bmap.br_blockcount, unwritten, &oinfo);
+	return 0;
 }
 
 /*
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index b7ad51055e1..2a9265218f1 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -186,6 +186,10 @@ void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
 		struct xfs_btree_cur *rcur, int error);
 int xfs_rmap_finish_one(struct xfs_trans *tp, struct xfs_rmap_intent *ri,
 		struct xfs_btree_cur **pcur);
+int __xfs_rmap_finish_intent(struct xfs_btree_cur *rcur,
+		enum xfs_rmap_intent_type op, xfs_agblock_t bno,
+		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
+		bool unwritten);
 
 int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		uint64_t owner, uint64_t offset, unsigned int flags,
@@ -237,4 +241,28 @@ extern struct kmem_cache	*xfs_rmap_intent_cache;
 int __init xfs_rmap_intent_init_cache(void);
 void xfs_rmap_intent_destroy_cache(void);
 
+/*
+ * Parameters for tracking reverse mapping changes.  The hook function arg
+ * parameter is enum xfs_rmap_intent_type, and the rest is below.
+ */
+struct xfs_rmap_update_params {
+	xfs_agblock_t			startblock;
+	xfs_extlen_t			blockcount;
+	struct xfs_owner_info		oinfo;
+	bool				unwritten;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+
+struct xfs_rmap_hook {
+	struct xfs_hook			update_hook;
+};
+
+void xfs_rmap_hook_disable(void);
+void xfs_rmap_hook_enable(void);
+
+int xfs_rmap_hook_add(struct xfs_perag *pag, struct xfs_rmap_hook *hook);
+void xfs_rmap_hook_del(struct xfs_perag *pag, struct xfs_rmap_hook *hook);
+#endif
+
 #endif	/* __XFS_RMAP_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] mkfs: enable reverse mapping by default
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 4/5] xfs: hook live rmap operations during a repair operation Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that the scrub part of online fsck is feature complete (scrub and
health reporting are done) there's actually a compelling story for
having the reverse mappings enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 man/man8/mkfs.xfs.8.in |    4 ++--
 mkfs/xfs_mkfs.c        |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/man/man8/mkfs.xfs.8.in b/man/man8/mkfs.xfs.8.in
index 1c90b5b5595..94f117b6917 100644
--- a/man/man8/mkfs.xfs.8.in
+++ b/man/man8/mkfs.xfs.8.in
@@ -289,8 +289,8 @@ pinpoint exactly which data has been lost when a disk error occurs.
 .IP
 By default,
 .B mkfs.xfs
-will not create reverse mapping btrees.  This feature is only available
-for filesystems created with the (default)
+will create reverse mapping btrees when possible.
+This feature is only available for filesystems created with the (default)
 .B \-m crc=1
 option set. When the option
 .B \-m crc=0
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 638e7ce6ea4..cca3497ab64 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -4306,7 +4306,7 @@ main(
 			.dirftype = true,
 			.finobt = true,
 			.spinodes = true,
-			.rmapbt = false,
+			.rmapbt = true,
 			.reflink = true,
 			.inobtcnt = true,
 			.parent_pointers = false,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (43 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 6/6] xfs_repair: remove the old rmap collection slabs Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
                   ` (32 subsequent siblings)
  77 siblings, 6 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Now that we've ported support for in-memory btrees to userspace, port
xfs_repair to use them instead of the clunky slab interface that we
currently use.  This has the effect of moving memory consumption for
tracking reverse mappings into a memfd file, which means that we could
(theoretically) reduce the memory requirements by pointing it at an
on-disk file or something.  It also enables us to remove the sorting
step and to avoid having to coalesce adjacent contiguous bmap records
into a single rmap record.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-use-in-memory-btrees
---
 libxfs/libxfs_api_defs.h |   13 +
 libxfs/xfile.c           |  169 ++++++++++
 libxfs/xfile.h           |   10 -
 repair/agbtree.c         |   18 +
 repair/agbtree.h         |    1 
 repair/dinode.c          |    9 -
 repair/phase4.c          |   25 --
 repair/phase5.c          |    2 
 repair/rmap.c            |  754 ++++++++++++++++++++++++++++++----------------
 repair/rmap.h            |   32 +-
 repair/scan.c            |    7 
 repair/slab.c            |   49 ++-
 repair/slab.h            |    2 
 repair/xfs_repair.c      |    6 
 14 files changed, 754 insertions(+), 343 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/6] libxfs: partition memfd files to avoid using too many fds
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 4/6] xfs_repair: compute refcount data from in-memory rmap btrees Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/6] xfs_repair: reduce rmap bag memory usage when creating refcounts Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/6] xfs_repair: convert regular rmap repair to use in-memory btrees Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can partition a memfd file to avoid running out of
file descriptors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfile.c |  169 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 libxfs/xfile.h |   10 ++-
 2 files changed, 167 insertions(+), 12 deletions(-)


diff --git a/libxfs/xfile.c b/libxfs/xfile.c
index c1b8b1c5928..dfe3c60a9b3 100644
--- a/libxfs/xfile.c
+++ b/libxfs/xfile.c
@@ -95,6 +95,146 @@ xfile_create_fd(
 	return fd;
 }
 
+struct xfile_fcb {
+	struct list_head	fcb_list;
+	int			fd;
+	unsigned int		refcount;
+};
+
+static LIST_HEAD(fcb_list);
+static pthread_mutex_t fcb_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+/* Create a new memfd. */
+static inline int
+xfile_fcb_create(
+	const char		*description,
+	struct xfile_fcb	**fcbp)
+{
+	struct xfile_fcb	*fcb;
+	int			fd;
+
+	fd = xfile_create_fd(description);
+	if (fd < 0)
+		return -errno;
+
+	fcb = malloc(sizeof(struct xfile_fcb));
+	if (!fcb) {
+		close(fd);
+		return -ENOMEM;
+	}
+
+	list_head_init(&fcb->fcb_list);
+	fcb->fd = fd;
+	fcb->refcount = 1;
+
+	*fcbp = fcb;
+	return 0;
+}
+
+/* Release an xfile control block */
+static void
+xfile_fcb_irele(
+	struct xfile_fcb	*fcb,
+	loff_t			pos,
+	uint64_t		len)
+{
+	/*
+	 * If this memfd is linked only to itself, it's private, so we can
+	 * close it without taking any locks.
+	 */
+	if (list_empty(&fcb->fcb_list)) {
+		close(fcb->fd);
+		free(fcb);
+		return;
+	}
+
+	pthread_mutex_lock(&fcb_mutex);
+	if (--fcb->refcount == 0) {
+		/* If we're the last user of this memfd file, kill it fast. */
+		list_del(&fcb->fcb_list);
+		close(fcb->fd);
+		free(fcb);
+	} else if (len > 0) {
+		struct stat	statbuf;
+		int		ret;
+
+		/*
+		 * If we were using the end of a partitioned file, free the
+		 * address space.  IOWs, bonus points if you delete these in
+		 * reverse-order of creation.
+		 */
+		ret = fstat(fcb->fd, &statbuf);
+		if (!ret && statbuf.st_size == pos + len) {
+			ret = ftruncate(fcb->fd, pos);
+		}
+	}
+	pthread_mutex_unlock(&fcb_mutex);
+}
+
+/*
+ * Find an memfd that can accomodate the given amount of address space.
+ */
+static int
+xfile_fcb_find(
+	const char		*description,
+	uint64_t		maxrange,
+	loff_t			*pos,
+	struct xfile_fcb	**fcbp)
+{
+	struct xfile_fcb	*fcb;
+	int			ret;
+	int			error;
+
+	/* No maximum range means that the caller gets a private memfd. */
+	if (maxrange == 0) {
+		*pos = 0;
+		return xfile_fcb_create(description, fcbp);
+	}
+
+	pthread_mutex_lock(&fcb_mutex);
+
+	/*
+	 * If we only need a certain number of byte range, look for one with
+	 * available file range.
+	 */
+	list_for_each_entry(fcb, &fcb_list, fcb_list) {
+		struct stat	statbuf;
+
+		ret = fstat(fcb->fd, &statbuf);
+		if (ret)
+			continue;
+
+		ret = ftruncate(fcb->fd, statbuf.st_size + maxrange);
+		if (ret)
+			continue;
+
+		fcb->refcount++;
+		*pos = statbuf.st_size;
+		*fcbp = fcb;
+		goto out_unlock;
+	}
+
+	/* Otherwise, open a new memfd and add it to our list. */
+	error = xfile_fcb_create(description, &fcb);
+	if (error)
+		return error;
+
+	ret = ftruncate(fcb->fd, maxrange);
+	if (ret) {
+		error = -errno;
+		xfile_fcb_irele(fcb, 0, maxrange);
+		return error;
+	}
+
+	list_add_tail(&fcb->fcb_list, &fcb_list);
+	*pos = 0;
+	*fcbp = fcb;
+
+out_unlock:
+	pthread_mutex_unlock(&fcb_mutex);
+	return error;
+}
+
 /*
  * Create an xfile of the given size.  The description will be used in the
  * trace output.
@@ -102,6 +242,7 @@ xfile_create_fd(
 int
 xfile_create(
 	struct xfs_mount	*mp,
+	unsigned long long	maxrange,
 	const char		*description,
 	struct xfile		**xfilep)
 {
@@ -117,13 +258,14 @@ xfile_create(
 	if (!xf)
 		return -ENOMEM;
 
-	xf->fd = xfile_create_fd(fname);
-	if (xf->fd < 0) {
-		error = -errno;
+	error = xfile_fcb_find(description, maxrange, &xf->partition_pos,
+			&xf->fcb);
+	if (error) {
 		kmem_free(xf);
 		return error;
 	}
 
+	xf->partition_bytes = maxrange;
 	*xfilep = xf;
 	return 0;
 }
@@ -133,7 +275,7 @@ void
 xfile_destroy(
 	struct xfile		*xf)
 {
-	close(xf->fd);
+	xfile_fcb_irele(xf->fcb, xf->partition_pos, xf->partition_bytes);
 	kmem_free(xf);
 }
 
@@ -141,6 +283,9 @@ static inline loff_t
 xfile_maxbytes(
 	struct xfile		*xf)
 {
+	if (xf->partition_bytes > 0)
+		return xf->partition_bytes;
+
 	if (sizeof(loff_t) == 8)
 		return LLONG_MAX;
 	return LONG_MAX;
@@ -166,7 +311,7 @@ xfile_pread(
 	if (xfile_maxbytes(xf) - pos < count)
 		return -EFBIG;
 
-	ret = pread(xf->fd, buf, count, pos);
+	ret = pread(xf->fcb->fd, buf, count, pos + xf->partition_pos);
 	if (ret >= 0)
 		return ret;
 	return -errno;
@@ -192,7 +337,7 @@ xfile_pwrite(
 	if (xfile_maxbytes(xf) - pos < count)
 		return -EFBIG;
 
-	ret = pwrite(xf->fd, buf, count, pos);
+	ret = pwrite(xf->fcb->fd, buf, count, pos + xf->partition_pos);
 	if (ret >= 0)
 		return ret;
 	return -errno;
@@ -207,7 +352,13 @@ xfile_stat(
 	struct stat		ks;
 	int			error;
 
-	error = fstat(xf->fd, &ks);
+	if (xf->partition_bytes > 0) {
+		statbuf->size = xf->partition_bytes;
+		statbuf->bytes = xf->partition_bytes;
+		return 0;
+	}
+
+	error = fstat(xf->fcb->fd, &ks);
 	if (error)
 		return -errno;
 
@@ -234,7 +385,7 @@ xfile_dump(
 	}
 
 	/* reroute our xfile to stdin and shut everything else */
-	dup2(xf->fd, 0);
+	dup2(xf->fcb->fd, 0);
 	for (i = 3; i < 1024; i++)
 		close(i);
 
@@ -251,7 +402,7 @@ xfile_prealloc(
 	int		error;
 
 	count = min(count, xfile_maxbytes(xf) - pos);
-	error = fallocate(xf->fd, 0, pos, count);
+	error = fallocate(xf->fcb->fd, 0, pos + xf->partition_pos, count);
 	if (error)
 		return -errno;
 	return 0;
diff --git a/libxfs/xfile.h b/libxfs/xfile.h
index 9580de32864..9923fe8e5ec 100644
--- a/libxfs/xfile.h
+++ b/libxfs/xfile.h
@@ -6,14 +6,18 @@
 #ifndef __LIBXFS_XFILE_H__
 #define __LIBXFS_XFILE_H__
 
+struct xfile_fcb;
+
 struct xfile {
-	int		fd;
+	struct xfile_fcb	*fcb;
+	loff_t			partition_pos;
+	uint64_t		partition_bytes;
 };
 
 void xfile_libinit(void);
 
-int xfile_create(struct xfs_mount *mp, const char *description,
-		struct xfile **xfilep);
+int xfile_create(struct xfs_mount *mp, unsigned long long maxrange,
+		const char *description, struct xfile **xfilep);
 void xfile_destroy(struct xfile *xf);
 
 ssize_t xfile_pread(struct xfile *xf, void *buf, size_t count, loff_t pos);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/6] xfs_repair: convert regular rmap repair to use in-memory btrees
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 5/6] xfs_repair: reduce rmap bag memory usage when creating refcounts Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert the rmap btree repair code to use in-memory rmap btrees to store
the observed reverse mapping records.  This will eliminate the need for
a separate record sorting step, as well as eliminate the need for all
the code that turns multiple consecutive bmap records into a single rmap
record.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    9 ++
 repair/agbtree.c         |   18 ++-
 repair/agbtree.h         |    1 
 repair/phase5.c          |    2 
 repair/rmap.c            |  243 ++++++++++++++++++++++++++++++++++++++++++++--
 repair/rmap.h            |   16 +++
 repair/xfs_repair.c      |    6 +
 7 files changed, 278 insertions(+), 17 deletions(-)


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index 5d73111b508..ce5eb27c1fd 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -56,8 +56,13 @@
 #define xfs_btree_bload			libxfs_btree_bload
 #define xfs_btree_bload_compute_geometry libxfs_btree_bload_compute_geometry
 #define xfs_btree_del_cursor		libxfs_btree_del_cursor
+#define xfs_btree_get_block		libxfs_btree_get_block
+#define xfs_btree_goto_left_edge	libxfs_btree_goto_left_edge
+#define xfs_btree_increment		libxfs_btree_increment
 #define xfs_btree_init_block		libxfs_btree_init_block
+#define xfs_btree_mem_head_read_buf	libxfs_btree_mem_head_read_buf
 #define xfs_btree_rec_addr		libxfs_btree_rec_addr
+#define xfs_btree_visit_blocks		libxfs_btree_visit_blocks
 #define xfs_buf_delwri_submit		libxfs_buf_delwri_submit
 #define xfs_buf_get			libxfs_buf_get
 #define xfs_buf_get_uncached		libxfs_buf_get_uncached
@@ -170,6 +175,8 @@
 #define xfs_rmapbt_init_cursor		libxfs_rmapbt_init_cursor
 #define xfs_rmapbt_maxlevels_ondisk	libxfs_rmapbt_maxlevels_ondisk
 #define xfs_rmapbt_maxrecs		libxfs_rmapbt_maxrecs
+#define xfs_rmapbt_mem_create		libxfs_rmapbt_mem_create
+#define xfs_rmapbt_mem_cursor		libxfs_rmapbt_mem_cursor
 #define xfs_rmapbt_stage_cursor		libxfs_rmapbt_stage_cursor
 #define xfs_rmap_compare		libxfs_rmap_compare
 #define xfs_rmap_get_rec		libxfs_rmap_get_rec
@@ -178,6 +185,7 @@
 #define xfs_rmap_irec_offset_unpack	libxfs_rmap_irec_offset_unpack
 #define xfs_rmap_lookup_le		libxfs_rmap_lookup_le
 #define xfs_rmap_lookup_le_range	libxfs_rmap_lookup_le_range
+#define xfs_rmap_map_raw		libxfs_rmap_map_raw
 #define xfs_rmap_query_all		libxfs_rmap_query_all
 #define xfs_rmap_query_range		libxfs_rmap_query_range
 
@@ -225,6 +233,7 @@
 
 #define xfs_validate_stripe_geometry	libxfs_validate_stripe_geometry
 #define xfs_verify_agbno		libxfs_verify_agbno
+#define xfs_verify_agbext		libxfs_verify_agbext
 #define xfs_verify_agino		libxfs_verify_agino
 #define xfs_verify_cksum		libxfs_verify_cksum
 #define xfs_verify_dir_ino		libxfs_verify_dir_ino
diff --git a/repair/agbtree.c b/repair/agbtree.c
index 23851f17b61..26e282d57c8 100644
--- a/repair/agbtree.c
+++ b/repair/agbtree.c
@@ -100,7 +100,8 @@ reserve_agblocks(
 			do_error(_("could not set up btree reservation: %s\n"),
 				strerror(-error));
 
-		error = rmap_add_ag_rec(mp, agno, ext_ptr->ex_startblock, len,
+		error = rmap_add_agbtree_mapping(mp, agno,
+				ext_ptr->ex_startblock, len,
 				btr->newbt.oinfo.oi_owner);
 		if (error)
 			do_error(_("could not set up btree rmaps: %s\n"),
@@ -593,14 +594,19 @@ get_rmapbt_records(
 	unsigned int			nr_wanted,
 	void				*priv)
 {
-	struct xfs_rmap_irec		*rec;
 	struct bt_rebuild		*btr = priv;
 	union xfs_btree_rec		*block_rec;
 	unsigned int			loaded;
+	int				ret;
 
 	for (loaded = 0; loaded < nr_wanted; loaded++, idx++) {
-		rec = pop_slab_cursor(btr->slab_cursor);
-		memcpy(&cur->bc_rec.r, rec, sizeof(struct xfs_rmap_irec));
+		ret = rmap_get_mem_rec(&btr->rmapbt_cursor, &cur->bc_rec.r);
+		if (ret < 0)
+			return ret;
+		if (ret == 0)
+			do_error(
+ _("ran out of records while rebuilding AG %u rmap btree\n"),
+					cur->bc_ag.pag->pag_agno);
 
 		block_rec = libxfs_btree_rec_addr(cur, idx, block);
 		cur->bc_ops->init_rec_from_cur(cur, block_rec);
@@ -648,7 +654,7 @@ build_rmap_tree(
 {
 	int			error;
 
-	error = rmap_init_cursor(agno, &btr->slab_cursor);
+	error = rmap_init_mem_cursor(sc->mp, NULL, agno, &btr->rmapbt_cursor);
 	if (error)
 		do_error(
 _("Insufficient memory to construct rmap cursor.\n"));
@@ -661,7 +667,7 @@ _("Error %d while creating rmap btree for AG %u.\n"), error, agno);
 
 	/* Since we're not writing the AGF yet, no need to commit the cursor */
 	libxfs_btree_del_cursor(btr->cur, 0);
-	free_slab_cursor(&btr->slab_cursor);
+	rmap_free_mem_cursor(NULL, &btr->rmapbt_cursor, 0);
 }
 
 /* rebuild the refcount tree */
diff --git a/repair/agbtree.h b/repair/agbtree.h
index 84f7083de20..4fb8e82e03e 100644
--- a/repair/agbtree.h
+++ b/repair/agbtree.h
@@ -20,6 +20,7 @@ struct bt_rebuild {
 	/* Tree-specific data. */
 	union {
 		struct xfs_slab_cursor	*slab_cursor;
+		struct rmap_mem_cur	rmapbt_cursor;
 		struct {
 			struct extent_tree_node	*bno_rec;
 			unsigned int		freeblks;
diff --git a/repair/phase5.c b/repair/phase5.c
index 40f991b6626..361e5649b29 100644
--- a/repair/phase5.c
+++ b/repair/phase5.c
@@ -703,7 +703,7 @@ phase5(xfs_mount_t *mp)
 	 * the superblock counters.
 	 */
 	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
-		error = rmap_store_ag_btree_rec(mp, agno);
+		error = rmap_commit_agbtree_mappings(mp, agno);
 		if (error)
 			do_error(
 _("unable to add AG %u reverse-mapping data to btree.\n"), agno);
diff --git a/repair/rmap.c b/repair/rmap.c
index 5fbae50d5b7..64a5786faca 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -13,6 +13,8 @@
 #include "slab.h"
 #include "rmap.h"
 #include "libfrog/bitmap.h"
+#include "libxfs/xfile.h"
+#include "libxfs/xfbtree.h"
 
 #undef RMAP_DEBUG
 
@@ -24,6 +26,7 @@
 
 /* per-AG rmap object anchor */
 struct xfs_ag_rmap {
+	struct xfbtree	*ar_xfbtree;		/* rmap observations */
 	struct xfs_slab	*ar_rmaps;		/* rmap observations, p4 */
 	struct xfs_slab	*ar_raw_rmaps;		/* unmerged rmaps */
 	int		ar_flcount;		/* agfl entries from leftover */
@@ -53,6 +56,65 @@ rmap_needs_work(
 	       xfs_has_rmapbt(mp);
 }
 
+/* Destroy an in-memory rmap btree. */
+STATIC void
+rmaps_destroy(
+	struct xfs_mount	*mp,
+	struct xfs_ag_rmap	*ag_rmap)
+{
+	struct xfile		*xfile;
+	struct xfs_buftarg	*target;
+
+	free_slab(&ag_rmap->ar_refcount_items);
+
+	if (!ag_rmap->ar_xfbtree)
+		return;
+
+	target = ag_rmap->ar_xfbtree->target;
+	xfile = target->bt_xfile;
+
+	xfbtree_destroy(ag_rmap->ar_xfbtree);
+	libxfs_buftarg_free(target);
+	xfile_destroy(xfile);
+}
+
+/* Initialize the in-memory rmap btree for collecting per-AG rmap records. */
+STATIC void
+rmaps_init_ag(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	struct xfs_ag_rmap	*ag_rmap)
+{
+	struct xfile		*xfile;
+	struct xfs_buftarg	*target;
+	unsigned long long	maxbytes;
+	int			error;
+
+	maxbytes = XFS_FSB_TO_B(mp, mp->m_sb.sb_agblocks);
+	error = xfile_create(mp, maxbytes, "rmapbt repair", &xfile);
+	if (error)
+		goto nomem;
+
+	error = -libxfs_alloc_memory_buftarg(mp, xfile, &target);
+	if (error)
+		goto nomem;
+
+	error = -libxfs_rmapbt_mem_create(mp, agno, target,
+			&ag_rmap->ar_xfbtree);
+	if (error)
+		goto nomem;
+
+	error = init_slab(&ag_rmap->ar_refcount_items,
+			  sizeof(struct xfs_refcount_irec));
+	if (error)
+		goto nomem;
+
+	return;
+nomem:
+	do_error(
+_("Insufficient memory while allocating realtime reverse mapping btree."));
+}
+
 /*
  * Initialize per-AG reverse map data.
  */
@@ -71,6 +133,8 @@ rmaps_init(
 		do_error(_("couldn't allocate per-AG reverse map roots\n"));
 
 	for (i = 0; i < mp->m_sb.sb_agcount; i++) {
+		rmaps_init_ag(mp, i, &ag_rmaps[i]);
+
 		error = init_slab(&ag_rmaps[i].ar_rmaps,
 				sizeof(struct xfs_rmap_irec));
 		if (error)
@@ -82,11 +146,6 @@ _("Insufficient memory while allocating reverse mapping slabs."));
 			do_error(
 _("Insufficient memory while allocating raw metadata reverse mapping slabs."));
 		ag_rmaps[i].ar_last_rmap.rm_owner = XFS_RMAP_OWN_UNKNOWN;
-		error = init_slab(&ag_rmaps[i].ar_refcount_items,
-				  sizeof(struct xfs_refcount_irec));
-		if (error)
-			do_error(
-_("Insufficient memory while allocating refcount item slabs."));
 	}
 }
 
@@ -105,7 +164,7 @@ rmaps_free(
 	for (i = 0; i < mp->m_sb.sb_agcount; i++) {
 		free_slab(&ag_rmaps[i].ar_rmaps);
 		free_slab(&ag_rmaps[i].ar_raw_rmaps);
-		free_slab(&ag_rmaps[i].ar_refcount_items);
+		rmaps_destroy(mp, &ag_rmaps[i]);
 	}
 	free(ag_rmaps);
 	ag_rmaps = NULL;
@@ -136,6 +195,103 @@ rmaps_are_mergeable(
 	return r1->rm_offset + r1->rm_blockcount == r2->rm_offset;
 }
 
+int
+rmap_init_mem_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	struct rmap_mem_cur	*rmcur)
+{
+	struct xfbtree		*xfbt;
+	struct xfs_perag	*pag;
+	int			error;
+
+	xfbt = ag_rmaps[agno].ar_xfbtree;
+	error = -xfbtree_head_read_buf(xfbt, tp, &rmcur->mhead_bp);
+	if (error)
+		return error;
+
+	pag = libxfs_perag_get(mp, agno);
+	rmcur->mcur = libxfs_rmapbt_mem_cursor(pag, tp, rmcur->mhead_bp, xfbt);
+
+	error = -libxfs_btree_goto_left_edge(rmcur->mcur);
+	if (error)
+		rmap_free_mem_cursor(tp, rmcur, error);
+
+	libxfs_perag_put(pag);
+	return error;
+}
+
+void
+rmap_free_mem_cursor(
+	struct xfs_trans	*tp,
+	struct rmap_mem_cur	*rmcur,
+	int			error)
+{
+	libxfs_btree_del_cursor(rmcur->mcur, error);
+	libxfs_trans_brelse(tp, rmcur->mhead_bp);
+	rmcur->mcur = NULL;
+	rmcur->mhead_bp = NULL;
+}
+
+/*
+ * Retrieve the next record from the in-memory rmap btree.  Returns 1 if irec
+ * has been filled out, 0 if there aren't any more records, or a negative errno
+ * value if an error happened.
+ */
+int
+rmap_get_mem_rec(
+	struct rmap_mem_cur	*rmcur,
+	struct xfs_rmap_irec	*irec)
+{
+	int			stat = 0;
+	int			error;
+
+	error = -libxfs_btree_increment(rmcur->mcur, 0, &stat);
+	if (error)
+		return -error;
+	if (!stat)
+		return 0;
+
+	error = -libxfs_rmap_get_rec(rmcur->mcur, irec, &stat);
+	if (error)
+		return -error;
+
+	return stat;
+}
+
+static void
+rmap_add_mem_rec(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	struct xfs_rmap_irec	*rmap)
+{
+	struct rmap_mem_cur	rmcur;
+	struct xfbtree		*xfbt;
+	struct xfs_trans	*tp;
+	int			error;
+
+	xfbt = ag_rmaps[agno].ar_xfbtree;
+	error = -libxfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		do_error(_("allocating tx for in-memory rmap update\n"));
+
+	error = rmap_init_mem_cursor(mp, tp, agno, &rmcur);
+	if (error)
+		do_error(_("reading in-memory rmap btree head\n"));
+
+	error = -libxfs_rmap_map_raw(rmcur.mcur, rmap);
+	if (error)
+		do_error(_("adding rmap to in-memory btree, err %d\n"), error);
+	rmap_free_mem_cursor(tp, &rmcur, 0);
+
+	error = xfbtree_trans_commit(xfbt, tp);
+	if (error)
+		do_error(_("committing in-memory rmap record\n"));
+
+	libxfs_trans_cancel(tp);
+}
+
 /*
  * Add an observation about a block mapping in an inode's data or attribute
  * fork for later btree reconstruction.
@@ -173,6 +329,9 @@ rmap_add_rec(
 	rmap.rm_blockcount = irec->br_blockcount;
 	if (irec->br_state == XFS_EXT_UNWRITTEN)
 		rmap.rm_flags |= XFS_RMAP_UNWRITTEN;
+
+	rmap_add_mem_rec(mp, agno, &rmap);
+
 	last_rmap = &ag_rmaps[agno].ar_last_rmap;
 	if (last_rmap->rm_owner == XFS_RMAP_OWN_UNKNOWN)
 		*last_rmap = rmap;
@@ -223,6 +382,8 @@ __rmap_add_raw_rec(
 		rmap.rm_flags |= XFS_RMAP_BMBT_BLOCK;
 	rmap.rm_startblock = agbno;
 	rmap.rm_blockcount = len;
+
+	rmap_add_mem_rec(mp, agno, &rmap);
 	return slab_add(ag_rmaps[agno].ar_raw_rmaps, &rmap);
 }
 
@@ -273,6 +434,36 @@ rmap_add_ag_rec(
 	return __rmap_add_raw_rec(mp, agno, agbno, len, owner, false, false);
 }
 
+/*
+ * Add a reverse mapping for a per-AG btree extent.  These are /not/ tracked
+ * in the in-memory rmap btree because they can only be added to the rmap
+ * data after the in-memory btrees have been written to disk.
+ */
+int
+rmap_add_agbtree_mapping(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len,
+	uint64_t		owner)
+{
+	struct xfs_rmap_irec	rmap = {
+		.rm_owner	= owner,
+		.rm_startblock	= agbno,
+		.rm_blockcount	= len,
+	};
+	struct xfs_perag	*pag;
+
+	if (!rmap_needs_work(mp))
+		return 0;
+
+	pag = libxfs_perag_get(mp, agno);
+	assert(libxfs_verify_agbext(pag, agbno, len));
+	libxfs_perag_put(pag);
+
+	return slab_add(ag_rmaps[agno].ar_raw_rmaps, &rmap);
+}
+
 /*
  * Merge adjacent raw rmaps and add them to the main rmap list.
  */
@@ -441,7 +632,7 @@ rmap_add_fixed_ag_rec(
  * the rmapbt, after which it is fully regenerated.
  */
 int
-rmap_store_ag_btree_rec(
+rmap_commit_agbtree_mappings(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno)
 {
@@ -536,7 +727,7 @@ rmap_store_ag_btree_rec(
 	if (error)
 		goto err;
 
-	/* Create cursors to refcount structures */
+	/* Create cursors to rmap structures */
 	error = init_slab_cursor(ag_rmap->ar_rmaps, rmap_compare, &rm_cur);
 	if (error)
 		goto err;
@@ -870,6 +1061,21 @@ compute_refcounts(
 }
 #undef RMAP_END
 
+static int
+count_btree_records(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	void			*data)
+{
+	uint64_t		*nr = data;
+	struct xfs_btree_block	*block;
+	struct xfs_buf		*bp;
+
+	block = libxfs_btree_get_block(cur, level, &bp);
+	*nr += be16_to_cpu(block->bb_numrecs);
+	return 0;
+}
+
 /*
  * Return the number of rmap objects for an AG.
  */
@@ -878,7 +1084,26 @@ rmap_record_count(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno)
 {
-	return slab_count(ag_rmaps[agno].ar_rmaps);
+	struct rmap_mem_cur	rmcur;
+	uint64_t		nr = 0;
+	int			error;
+
+	if (ag_rmaps[agno].ar_xfbtree == NULL)
+		return 0;
+
+	error = rmap_init_mem_cursor(mp, NULL, agno, &rmcur);
+	if (error)
+		do_error(_("%s while reading in-memory rmap btree\n"),
+				strerror(error));
+
+	error = -libxfs_btree_visit_blocks(rmcur.mcur, count_btree_records,
+			XFS_BTREE_VISIT_RECORDS, &nr);
+	if (error)
+		do_error(_("%s while counting in-memory rmap records\n"),
+				strerror(error));
+
+	rmap_free_mem_cursor(NULL, &rmcur, 0);
+	return nr;
 }
 
 /*
diff --git a/repair/rmap.h b/repair/rmap.h
index 782256f8b7e..d8eec58ab8d 100644
--- a/repair/rmap.h
+++ b/repair/rmap.h
@@ -24,7 +24,10 @@ extern int rmap_fold_raw_recs(struct xfs_mount *mp, xfs_agnumber_t agno);
 extern bool rmaps_are_mergeable(struct xfs_rmap_irec *r1, struct xfs_rmap_irec *r2);
 
 extern int rmap_add_fixed_ag_rec(struct xfs_mount *, xfs_agnumber_t);
-extern int rmap_store_ag_btree_rec(struct xfs_mount *, xfs_agnumber_t);
+
+int rmap_add_agbtree_mapping(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agblock_t agbno, xfs_extlen_t len, uint64_t owner);
+int rmap_commit_agbtree_mappings(struct xfs_mount *mp, xfs_agnumber_t agno);
 
 uint64_t rmap_record_count(struct xfs_mount *mp, xfs_agnumber_t agno);
 extern int rmap_init_cursor(xfs_agnumber_t, struct xfs_slab_cursor **);
@@ -49,4 +52,15 @@ extern int fix_inode_reflink_flags(struct xfs_mount *, xfs_agnumber_t);
 extern void fix_freelist(struct xfs_mount *, xfs_agnumber_t, bool);
 extern void rmap_store_agflcount(struct xfs_mount *, xfs_agnumber_t, int);
 
+struct rmap_mem_cur {
+	struct xfs_btree_cur	*mcur;
+	struct xfs_buf		*mhead_bp;
+};
+
+int rmap_init_mem_cursor(struct xfs_mount *mp, struct xfs_trans *tp,
+		xfs_agnumber_t agno, struct rmap_mem_cur *rmcur);
+void rmap_free_mem_cursor(struct xfs_trans *tp, struct rmap_mem_cur *rmcur,
+		int error);
+int rmap_get_mem_rec(struct rmap_mem_cur *rmcur, struct xfs_rmap_irec *irec);
+
 #endif /* RMAP_H_ */
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index e49d4292ad4..251a46d11fe 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -913,6 +913,12 @@ repair_capture_writeback(
 	struct xfs_mount	*mp = bp->b_mount;
 	static pthread_mutex_t	wb_mutex = PTHREAD_MUTEX_INITIALIZER;
 
+	/* We only care about ondisk metadata. */
+	if (bp->b_target != mp->m_ddev_targp &&
+	    bp->b_target != mp->m_logdev_targp &&
+	    bp->b_target != mp->m_rtdev_targp)
+		return;
+
 	/*
 	 * This write hook ignores any buffer that looks like a superblock to
 	 * avoid hook recursion when setting NEEDSREPAIR.  Higher level code


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/6] xfs_repair: verify on-disk rmap btrees with in-memory btree data
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 6/6] xfs_repair: remove the old rmap collection slabs Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/6] xfs_repair: compute refcount data from in-memory rmap btrees Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check the on-disk reverse mappings with the observations we've recorded
in the in-memory btree during the filesystem walk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/rmap.c |   58 +++++++++++++++++++++++++++------------------------------
 1 file changed, 27 insertions(+), 31 deletions(-)


diff --git a/repair/rmap.c b/repair/rmap.c
index 64a5786faca..a8d973ecd2a 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -1200,11 +1200,11 @@ rmaps_verify_btree(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno)
 {
+	struct rmap_mem_cur	rm_cur;
+	struct xfs_rmap_irec	rm_rec;
 	struct xfs_rmap_irec	tmp;
-	struct xfs_slab_cursor	*rm_cur;
 	struct xfs_btree_cur	*bt_cur = NULL;
 	struct xfs_buf		*agbp = NULL;
-	struct xfs_rmap_irec	*rm_rec;
 	struct xfs_perag	*pag = NULL;
 	int			have;
 	int			error;
@@ -1217,8 +1217,8 @@ rmaps_verify_btree(
 		return;
 	}
 
-	/* Create cursors to refcount structures */
-	error = rmap_init_cursor(agno, &rm_cur);
+	/* Create cursors to rmap structures */
+	error = rmap_init_mem_cursor(mp, NULL, agno, &rm_cur);
 	if (error) {
 		do_warn(_("Not enough memory to check reverse mappings.\n"));
 		return;
@@ -1241,13 +1241,12 @@ rmaps_verify_btree(
 		goto err_agf;
 	}
 
-	rm_rec = pop_slab_cursor(rm_cur);
-	while (rm_rec) {
-		error = rmap_lookup(bt_cur, rm_rec, &tmp, &have);
+	while ((error = rmap_get_mem_rec(&rm_cur, &rm_rec)) == 1) {
+		error = rmap_lookup(bt_cur, &rm_rec, &tmp, &have);
 		if (error) {
 			do_warn(
 _("Could not read reverse-mapping record for (%u/%u).\n"),
-					agno, rm_rec->rm_startblock);
+					agno, rm_rec.rm_startblock);
 			goto err_cur;
 		}
 
@@ -1257,13 +1256,13 @@ _("Could not read reverse-mapping record for (%u/%u).\n"),
 		 * match the observed rmap.
 		 */
 		if (xfs_has_reflink(bt_cur->bc_mp) &&
-				(!have || !rmap_is_good(rm_rec, &tmp))) {
-			error = rmap_lookup_overlapped(bt_cur, rm_rec,
+				(!have || !rmap_is_good(&rm_rec, &tmp))) {
+			error = rmap_lookup_overlapped(bt_cur, &rm_rec,
 					&tmp, &have);
 			if (error) {
 				do_warn(
 _("Could not read reverse-mapping record for (%u/%u).\n"),
-						agno, rm_rec->rm_startblock);
+						agno, rm_rec.rm_startblock);
 				goto err_cur;
 			}
 		}
@@ -1271,21 +1270,21 @@ _("Could not read reverse-mapping record for (%u/%u).\n"),
 			do_warn(
 _("Missing reverse-mapping record for (%u/%u) %slen %u owner %"PRId64" \
 %s%soff %"PRIu64"\n"),
-				agno, rm_rec->rm_startblock,
-				(rm_rec->rm_flags & XFS_RMAP_UNWRITTEN) ?
+				agno, rm_rec.rm_startblock,
+				(rm_rec.rm_flags & XFS_RMAP_UNWRITTEN) ?
 					_("unwritten ") : "",
-				rm_rec->rm_blockcount,
-				rm_rec->rm_owner,
-				(rm_rec->rm_flags & XFS_RMAP_ATTR_FORK) ?
+				rm_rec.rm_blockcount,
+				rm_rec.rm_owner,
+				(rm_rec.rm_flags & XFS_RMAP_ATTR_FORK) ?
 					_("attr ") : "",
-				(rm_rec->rm_flags & XFS_RMAP_BMBT_BLOCK) ?
+				(rm_rec.rm_flags & XFS_RMAP_BMBT_BLOCK) ?
 					_("bmbt ") : "",
-				rm_rec->rm_offset);
-			goto next_loop;
+				rm_rec.rm_offset);
+			continue;
 		}
 
 		/* Compare each refcount observation against the btree's */
-		if (!rmap_is_good(rm_rec, &tmp)) {
+		if (!rmap_is_good(&rm_rec, &tmp)) {
 			do_warn(
 _("Incorrect reverse-mapping: saw (%u/%u) %slen %u owner %"PRId64" %s%soff \
 %"PRIu64"; should be (%u/%u) %slen %u owner %"PRId64" %s%soff %"PRIu64"\n"),
@@ -1299,20 +1298,17 @@ _("Incorrect reverse-mapping: saw (%u/%u) %slen %u owner %"PRId64" %s%soff \
 				(tmp.rm_flags & XFS_RMAP_BMBT_BLOCK) ?
 					_("bmbt ") : "",
 				tmp.rm_offset,
-				agno, rm_rec->rm_startblock,
-				(rm_rec->rm_flags & XFS_RMAP_UNWRITTEN) ?
+				agno, rm_rec.rm_startblock,
+				(rm_rec.rm_flags & XFS_RMAP_UNWRITTEN) ?
 					_("unwritten ") : "",
-				rm_rec->rm_blockcount,
-				rm_rec->rm_owner,
-				(rm_rec->rm_flags & XFS_RMAP_ATTR_FORK) ?
+				rm_rec.rm_blockcount,
+				rm_rec.rm_owner,
+				(rm_rec.rm_flags & XFS_RMAP_ATTR_FORK) ?
 					_("attr ") : "",
-				(rm_rec->rm_flags & XFS_RMAP_BMBT_BLOCK) ?
+				(rm_rec.rm_flags & XFS_RMAP_BMBT_BLOCK) ?
 					_("bmbt ") : "",
-				rm_rec->rm_offset);
-			goto next_loop;
+				rm_rec.rm_offset);
 		}
-next_loop:
-		rm_rec = pop_slab_cursor(rm_cur);
 	}
 
 err_cur:
@@ -1321,7 +1317,7 @@ _("Incorrect reverse-mapping: saw (%u/%u) %slen %u owner %"PRId64" %s%soff \
 	libxfs_buf_relse(agbp);
 err_pag:
 	libxfs_perag_put(pag);
-	free_slab_cursor(&rm_cur);
+	rmap_free_mem_cursor(NULL, &rm_cur, error);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/6] xfs_repair: compute refcount data from in-memory rmap btrees
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 6/6] xfs_repair: remove the old rmap collection slabs Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/6] xfs_repair: verify on-disk rmap btrees with in-memory btree data Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/6] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use the in-memory rmap btrees to compute the reference count
information.  Convert the bag implementation to hold actual records
instead of pointers to slab objects.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    4 +
 repair/phase4.c          |    2 
 repair/rmap.c            |  230 ++++++++++++++++++++++++++++++++++++----------
 repair/slab.c            |   49 ++++++----
 repair/slab.h            |    2 
 5 files changed, 216 insertions(+), 71 deletions(-)


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index ce5eb27c1fd..a03cef515a2 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -55,9 +55,11 @@
 
 #define xfs_btree_bload			libxfs_btree_bload
 #define xfs_btree_bload_compute_geometry libxfs_btree_bload_compute_geometry
+#define xfs_btree_decrement		libxfs_btree_decrement
 #define xfs_btree_del_cursor		libxfs_btree_del_cursor
 #define xfs_btree_get_block		libxfs_btree_get_block
 #define xfs_btree_goto_left_edge	libxfs_btree_goto_left_edge
+#define xfs_btree_has_more_records	libxfs_btree_has_more_records
 #define xfs_btree_increment		libxfs_btree_increment
 #define xfs_btree_init_block		libxfs_btree_init_block
 #define xfs_btree_mem_head_read_buf	libxfs_btree_mem_head_read_buf
@@ -150,6 +152,8 @@
 #define xfs_inode_validate_cowextsize	libxfs_inode_validate_cowextsize
 #define xfs_inode_validate_extsize	libxfs_inode_validate_extsize
 
+#define xfs_internal_inum		libxfs_internal_inum
+
 #define xfs_iread_extents		libxfs_iread_extents
 #define xfs_irele			libxfs_irele
 #define xfs_log_calc_minimum_size	libxfs_log_calc_minimum_size
diff --git a/repair/phase4.c b/repair/phase4.c
index 61e5500631a..b8a6992706a 100644
--- a/repair/phase4.c
+++ b/repair/phase4.c
@@ -188,7 +188,7 @@ compute_ag_refcounts(
 	if (error)
 		do_error(
 _("%s while computing reference count records.\n"),
-			 strerror(-error));
+			 strerror(error));
 }
 
 static void
diff --git a/repair/rmap.c b/repair/rmap.c
index a8d973ecd2a..0a29a615afa 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -934,66 +934,196 @@ refcount_emit(
 _("Insufficient memory while recreating refcount tree."));
 }
 
+#define RMAP_NEXT(r)	((r)->rm_startblock + (r)->rm_blockcount)
+
+/* Decide if an rmap could describe a shared extent. */
+static inline bool
+rmap_shareable(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rmap)
+{
+	/* AG metadata are never sharable */
+	if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+		return false;
+
+	/* Metadata in files are never shareable */
+	if (libxfs_internal_inum(mp, rmap->rm_owner))
+		return false;
+
+	/* Metadata and unwritten file blocks are not shareable. */
+	if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+			      XFS_RMAP_UNWRITTEN))
+		return false;
+
+	return true;
+}
+
+/* Grab the rmap for the next possible shared extent. */
+STATIC int
+refcount_walk_rmaps(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rmap,
+	bool			*have_rec)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	int			have_gt;
+	int			error = 0;
+
+	*have_rec = false;
+
+	/*
+	 * Loop through the remaining rmaps.  Remember CoW staging
+	 * extents and the refcountbt blocks from the old tree for later
+	 * disposal.  We can only share written data fork extents, so
+	 * keep looping until we find an rmap for one.
+	 */
+	do {
+		error = -libxfs_btree_increment(cur, 0, &have_gt);
+		if (error)
+			return error;
+		if (!have_gt)
+			return 0;
+
+		error = -libxfs_rmap_get_rec(cur, rmap, &have_gt);
+		if (error)
+			return error;
+		if (!have_gt)
+			return EFSCORRUPTED;
+	} while (!rmap_shareable(mp, rmap));
+
+	*have_rec = true;
+	return 0;
+}
+
+/*
+ * Find the next block where the refcount changes, given the next rmap we
+ * looked at and the ones we're already tracking.
+ */
+static inline int
+next_refcount_edge(
+	struct xfs_bag		*stack_top,
+	struct xfs_rmap_irec	*next_rmap,
+	bool			next_valid,
+	xfs_agblock_t		*nbnop)
+{
+	struct xfs_rmap_irec	*rmap;
+	uint64_t		idx;
+	xfs_agblock_t		nbno = NULLAGBLOCK;
+
+	if (next_valid)
+		nbno = next_rmap->rm_startblock;
+
+	foreach_bag_ptr(stack_top, idx, rmap)
+		nbno = min(nbno, RMAP_NEXT(rmap));
+
+	/*
+	 * We should have found /something/ because either next_rrm is the next
+	 * interesting rmap to look at after emitting this refcount extent, or
+	 * there are other rmaps in rmap_bag contributing to the current
+	 * sharing count.  But if something is seriously wrong, bail out.
+	 */
+	if (nbno == NULLAGBLOCK)
+		return EFSCORRUPTED;
+
+	*nbnop = nbno;
+	return 0;
+}
+
+/*
+ * Walk forward through the rmap btree to collect all rmaps starting at
+ * @bno in @rmap_bag.  These represent the file(s) that share ownership of
+ * the current block.  Upon return, the rmap cursor points to the last record
+ * satisfying the startblock constraint.
+ */
+static int
+refcount_push_rmaps_at(
+	struct rmap_mem_cur	*rmcur,
+	xfs_agnumber_t		agno,
+	struct xfs_bag		*stack_top,
+	xfs_agblock_t		bno,
+	struct xfs_rmap_irec	*irec,
+	bool			*have,
+	const char		*tag)
+{
+	int			have_gt;
+	int			error;
+
+	while (*have && irec->rm_startblock == bno) {
+		rmap_dump(tag, agno, irec);
+		error = bag_add(stack_top, irec);
+		if (error)
+			return error;
+		error = refcount_walk_rmaps(rmcur->mcur, irec, have);
+		if (error)
+			return error;
+	}
+
+	error = -libxfs_btree_decrement(rmcur->mcur, 0, &have_gt);
+	if (error)
+		return error;
+	if (!have_gt)
+		return EFSCORRUPTED;
+
+	return 0;
+}
+
 /*
  * Transform a pile of physical block mapping observations into refcount data
  * for eventual rebuilding of the btrees.
  */
-#define RMAP_END(r)	((r)->rm_startblock + (r)->rm_blockcount)
 int
 compute_refcounts(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t		agno)
 {
+	struct rmap_mem_cur	rmcur;
+	struct xfs_rmap_irec	irec;
 	struct xfs_bag		*stack_top = NULL;
-	struct xfs_slab		*rmaps;
-	struct xfs_slab_cursor	*rmaps_cur;
-	struct xfs_rmap_irec	*array_cur;
 	struct xfs_rmap_irec	*rmap;
-	uint64_t		n, idx;
+	uint64_t		idx;
 	uint64_t		old_stack_nr;
 	xfs_agblock_t		sbno;	/* first bno of this rmap set */
 	xfs_agblock_t		cbno;	/* first bno of this refcount set */
 	xfs_agblock_t		nbno;	/* next bno where rmap set changes */
+	bool			have;
 	int			error;
 
 	if (!xfs_has_reflink(mp))
 		return 0;
 
-	rmaps = ag_rmaps[agno].ar_rmaps;
-
-	error = init_slab_cursor(rmaps, rmap_compare, &rmaps_cur);
+	error = rmap_init_mem_cursor(mp, NULL, agno, &rmcur);
 	if (error)
 		return error;
 
-	error = init_bag(&stack_top);
+	error = init_bag(&stack_top, sizeof(struct xfs_rmap_irec));
 	if (error)
-		goto err;
+		goto out_cur;
 
-	/* While there are rmaps to be processed... */
-	n = 0;
-	while (n < slab_count(rmaps)) {
-		array_cur = peek_slab_cursor(rmaps_cur);
-		sbno = cbno = array_cur->rm_startblock;
+	/* Start the rmapbt cursor to the left of all records. */
+	error = -libxfs_btree_goto_left_edge(rmcur.mcur);
+	if (error)
+		goto out_bag;
+
+
+	/* Process reverse mappings into refcount data. */
+	while (libxfs_btree_has_more_records(rmcur.mcur)) {
 		/* Push all rmaps with pblk == sbno onto the stack */
-		for (;
-		     array_cur && array_cur->rm_startblock == sbno;
-		     array_cur = peek_slab_cursor(rmaps_cur)) {
-			advance_slab_cursor(rmaps_cur); n++;
-			rmap_dump("push0", agno, array_cur);
-			error = bag_add(stack_top, array_cur);
-			if (error)
-				goto err;
-		}
+		error = refcount_walk_rmaps(rmcur.mcur, &irec, &have);
+		if (error)
+			goto out_bag;
+		if (!have)
+			break;
+		sbno = cbno = irec.rm_startblock;
+		error = refcount_push_rmaps_at(&rmcur, agno, stack_top, sbno,
+				&irec, &have, "push0");
+		if (error)
+			goto out_bag;
 		mark_inode_rl(mp, stack_top);
 
 		/* Set nbno to the bno of the next refcount change */
-		if (n < slab_count(rmaps) && array_cur)
-			nbno = array_cur->rm_startblock;
-		else
-			nbno = NULLAGBLOCK;
-		foreach_bag_ptr(stack_top, idx, rmap) {
-			nbno = min(nbno, RMAP_END(rmap));
-		}
+		error = next_refcount_edge(stack_top, &irec, have, &nbno);
+		if (error)
+			goto out_bag;
 
 		/* Emit reverse mappings, if needed */
 		ASSERT(nbno > sbno);
@@ -1003,23 +1133,24 @@ compute_refcounts(
 		while (bag_count(stack_top)) {
 			/* Pop all rmaps that end at nbno */
 			foreach_bag_ptr_reverse(stack_top, idx, rmap) {
-				if (RMAP_END(rmap) != nbno)
+				if (RMAP_NEXT(rmap) != nbno)
 					continue;
 				rmap_dump("pop", agno, rmap);
 				error = bag_remove(stack_top, idx);
 				if (error)
-					goto err;
+					goto out_bag;
 			}
 
 			/* Push array items that start at nbno */
-			for (;
-			     array_cur && array_cur->rm_startblock == nbno;
-			     array_cur = peek_slab_cursor(rmaps_cur)) {
-				advance_slab_cursor(rmaps_cur); n++;
-				rmap_dump("push1", agno, array_cur);
-				error = bag_add(stack_top, array_cur);
+			error = refcount_walk_rmaps(rmcur.mcur, &irec, &have);
+			if (error)
+				goto out_bag;
+			if (have) {
+				error = refcount_push_rmaps_at(&rmcur, agno,
+						stack_top, nbno, &irec, &have,
+						"push1");
 				if (error)
-					goto err;
+					goto out_bag;
 			}
 			mark_inode_rl(mp, stack_top);
 
@@ -1041,25 +1172,22 @@ compute_refcounts(
 			sbno = nbno;
 
 			/* Set nbno to the bno of the next refcount change */
-			if (n < slab_count(rmaps))
-				nbno = array_cur->rm_startblock;
-			else
-				nbno = NULLAGBLOCK;
-			foreach_bag_ptr(stack_top, idx, rmap) {
-				nbno = min(nbno, RMAP_END(rmap));
-			}
+			error = next_refcount_edge(stack_top, &irec, have,
+					&nbno);
+			if (error)
+				goto out_bag;
 
 			/* Emit reverse mappings, if needed */
 			ASSERT(nbno > sbno);
 		}
 	}
-err:
+out_bag:
 	free_bag(&stack_top);
-	free_slab_cursor(&rmaps_cur);
-
+out_cur:
+	rmap_free_mem_cursor(NULL, &rmcur, error);
 	return error;
 }
-#undef RMAP_END
+#undef RMAP_NEXT
 
 static int
 count_btree_records(
diff --git a/repair/slab.c b/repair/slab.c
index 01bc4d426fe..44ca0468eda 100644
--- a/repair/slab.c
+++ b/repair/slab.c
@@ -78,16 +78,26 @@ struct xfs_slab_cursor {
 };
 
 /*
- * Bags -- each bag is an array of pointers items; when a bag fills up, we
- * resize it.
+ * Bags -- each bag is an array of record items; when a bag fills up, we resize
+ * it and hope we don't run out of memory.
  */
 #define MIN_BAG_SIZE	4096
 struct xfs_bag {
 	uint64_t		bg_nr;		/* number of pointers */
 	uint64_t		bg_inuse;	/* number of slots in use */
-	void			**bg_ptrs;	/* pointers */
+	char			*bg_items;	/* pointer to block of items */
+	size_t			bg_item_sz;	/* size of each item */
 };
-#define BAG_END(bag)	(&(bag)->bg_ptrs[(bag)->bg_nr])
+
+static inline void *bag_ptr(struct xfs_bag *bag, uint64_t idx)
+{
+	return &bag->bg_items[bag->bg_item_sz * idx];
+}
+
+static inline void *bag_end(struct xfs_bag *bag)
+{
+	return bag_ptr(bag, bag->bg_nr);
+}
 
 /*
  * Create a slab to hold some objects of a particular size.
@@ -382,15 +392,17 @@ slab_count(
  */
 int
 init_bag(
-	struct xfs_bag	**bag)
+	struct xfs_bag	**bag,
+	size_t		item_sz)
 {
 	struct xfs_bag	*ptr;
 
 	ptr = calloc(1, sizeof(struct xfs_bag));
 	if (!ptr)
 		return -ENOMEM;
-	ptr->bg_ptrs = calloc(MIN_BAG_SIZE, sizeof(void *));
-	if (!ptr->bg_ptrs) {
+	ptr->bg_item_sz = item_sz;
+	ptr->bg_items = calloc(MIN_BAG_SIZE, item_sz);
+	if (!ptr->bg_items) {
 		free(ptr);
 		return -ENOMEM;
 	}
@@ -411,7 +423,7 @@ free_bag(
 	ptr = *bag;
 	if (!ptr)
 		return;
-	free(ptr->bg_ptrs);
+	free(ptr->bg_items);
 	free(ptr);
 	*bag = NULL;
 }
@@ -424,22 +436,23 @@ bag_add(
 	struct xfs_bag	*bag,
 	void		*ptr)
 {
-	void		**p, **x;
+	void		*p, *x;
 
-	p = &bag->bg_ptrs[bag->bg_inuse];
-	if (p == BAG_END(bag)) {
+	p = bag_ptr(bag, bag->bg_inuse);
+	if (p == bag_end(bag)) {
 		/* No free space, alloc more pointers */
 		uint64_t	nr;
 
 		nr = bag->bg_nr * 2;
-		x = realloc(bag->bg_ptrs, nr * sizeof(void *));
+		x = realloc(bag->bg_items, nr * bag->bg_item_sz);
 		if (!x)
 			return -ENOMEM;
-		bag->bg_ptrs = x;
-		memset(BAG_END(bag), 0, bag->bg_nr * sizeof(void *));
+		bag->bg_items = x;
+		memset(bag_end(bag), 0, bag->bg_nr * bag->bg_item_sz);
 		bag->bg_nr = nr;
+		p = bag_ptr(bag, bag->bg_inuse);
 	}
-	bag->bg_ptrs[bag->bg_inuse] = ptr;
+	memcpy(p, ptr, bag->bg_item_sz);
 	bag->bg_inuse++;
 	return 0;
 }
@@ -453,8 +466,8 @@ bag_remove(
 	uint64_t	nr)
 {
 	ASSERT(nr < bag->bg_inuse);
-	memmove(&bag->bg_ptrs[nr], &bag->bg_ptrs[nr + 1],
-		(bag->bg_inuse - nr - 1) * sizeof(void *));
+	memmove(bag_ptr(bag, nr), bag_ptr(bag, nr + 1),
+		(bag->bg_inuse - nr - 1) * bag->bg_item_sz);
 	bag->bg_inuse--;
 	return 0;
 }
@@ -479,5 +492,5 @@ bag_item(
 {
 	if (nr >= bag->bg_inuse)
 		return NULL;
-	return bag->bg_ptrs[nr];
+	return bag_ptr(bag, nr);
 }
diff --git a/repair/slab.h b/repair/slab.h
index 077b4582214..019b169024d 100644
--- a/repair/slab.h
+++ b/repair/slab.h
@@ -28,7 +28,7 @@ void *pop_slab_cursor(struct xfs_slab_cursor *cur);
 
 struct xfs_bag;
 
-int init_bag(struct xfs_bag **bagp);
+int init_bag(struct xfs_bag **bagp, size_t itemsz);
 void free_bag(struct xfs_bag **bagp);
 int bag_add(struct xfs_bag *bag, void *item);
 int bag_remove(struct xfs_bag *bag, uint64_t idx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/6] xfs_repair: reduce rmap bag memory usage when creating refcounts
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 1/6] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/6] xfs_repair: convert regular rmap repair to use in-memory btrees Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The algorithm that computes reference count records uses a "bag"
structure to remember the rmap records corresponding to the current
block.  In the previous patch we converted the bag structure to store
actual rmap records instead of pointers to rmap records owned by another
structure as part of preparing for converting this algorithm to use
in-memory rmap btrees.

However, the memory usage of the bag structure is now excessive -- we
only need the physical extent and inode owner information to generate
refcount records and mark inodes that require the reflink flag.  IOWs,
the flags and offset fields are unnecessary.  Create a custom structure
for the bag, which halves its memory usage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/rmap.c |   74 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 44 insertions(+), 30 deletions(-)


diff --git a/repair/rmap.c b/repair/rmap.c
index 0a29a615afa..87123f5331d 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -35,6 +35,13 @@ struct xfs_ag_rmap {
 	struct xfs_slab	*ar_refcount_items;	/* refcount items, p4-5 */
 };
 
+/* Only the parts of struct xfs_rmap_irec that we need to compute refcounts. */
+struct rmap_for_refcount {
+	xfs_agblock_t	rm_startblock;
+	xfs_extlen_t	rm_blockcount;
+	uint64_t	rm_owner;
+};
+
 static struct xfs_ag_rmap *ag_rmaps;
 bool rmapbt_suspect;
 static bool refcbt_suspect;
@@ -786,16 +793,14 @@ static void
 rmap_dump(
 	const char		*msg,
 	xfs_agnumber_t		agno,
-	struct xfs_rmap_irec	*rmap)
+	const struct rmap_for_refcount *rfr)
 {
-	printf("%s: %p agno=%u pblk=%llu own=%lld lblk=%llu len=%u flags=0x%x\n",
-		msg, rmap,
+	printf("%s: %p agno=%u agbno=%llu owner=%lld fsbcount=%u\n",
+		msg, rfr,
 		(unsigned int)agno,
-		(unsigned long long)rmap->rm_startblock,
-		(unsigned long long)rmap->rm_owner,
-		(unsigned long long)rmap->rm_offset,
-		(unsigned int)rmap->rm_blockcount,
-		(unsigned int)rmap->rm_flags);
+		(unsigned long long)rfr->rm_startblock,
+		(unsigned long long)rfr->rm_owner,
+		(unsigned int)rfr->rm_blockcount);
 }
 #else
 # define rmap_dump(m, a, r)
@@ -874,30 +879,33 @@ rmap_dump(
  */
 static void
 mark_inode_rl(
-	struct xfs_mount		*mp,
+	struct xfs_mount	*mp,
 	struct xfs_bag		*rmaps)
 {
-	xfs_agnumber_t		iagno;
-	struct xfs_rmap_irec	*rmap;
+	struct rmap_for_refcount *rfr;
 	struct ino_tree_node	*irec;
 	int			off;
 	uint64_t		idx;
-	xfs_agino_t		ino;
 
 	if (bag_count(rmaps) < 2)
 		return;
 
 	/* Reflink flag accounting */
-	foreach_bag_ptr(rmaps, idx, rmap) {
-		ASSERT(!XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner));
-		iagno = XFS_INO_TO_AGNO(mp, rmap->rm_owner);
-		ino = XFS_INO_TO_AGINO(mp, rmap->rm_owner);
-		pthread_mutex_lock(&ag_locks[iagno].lock);
-		irec = find_inode_rec(mp, iagno, ino);
-		off = get_inode_offset(mp, rmap->rm_owner, irec);
+	foreach_bag_ptr(rmaps, idx, rfr) {
+		xfs_agnumber_t	agno;
+		xfs_agino_t	agino;
+
+		ASSERT(!XFS_RMAP_NON_INODE_OWNER(rfr->rm_owner));
+
+		agno = XFS_INO_TO_AGNO(mp, rfr->rm_owner);
+		agino = XFS_INO_TO_AGINO(mp, rfr->rm_owner);
+
+		pthread_mutex_lock(&ag_locks[agno].lock);
+		irec = find_inode_rec(mp, agno, agino);
+		off = get_inode_offset(mp, rfr->rm_owner, irec);
 		/* lock here because we might go outside this ag */
 		set_inode_is_rl(irec, off);
-		pthread_mutex_unlock(&ag_locks[iagno].lock);
+		pthread_mutex_unlock(&ag_locks[agno].lock);
 	}
 }
 
@@ -1006,15 +1014,15 @@ next_refcount_edge(
 	bool			next_valid,
 	xfs_agblock_t		*nbnop)
 {
-	struct xfs_rmap_irec	*rmap;
+	struct rmap_for_refcount *rfr;
 	uint64_t		idx;
 	xfs_agblock_t		nbno = NULLAGBLOCK;
 
 	if (next_valid)
 		nbno = next_rmap->rm_startblock;
 
-	foreach_bag_ptr(stack_top, idx, rmap)
-		nbno = min(nbno, RMAP_NEXT(rmap));
+	foreach_bag_ptr(stack_top, idx, rfr)
+		nbno = min(nbno, RMAP_NEXT(rfr));
 
 	/*
 	 * We should have found /something/ because either next_rrm is the next
@@ -1049,8 +1057,14 @@ refcount_push_rmaps_at(
 	int			error;
 
 	while (*have && irec->rm_startblock == bno) {
-		rmap_dump(tag, agno, irec);
-		error = bag_add(stack_top, irec);
+		struct rmap_for_refcount	rfr = {
+			.rm_startblock		= irec->rm_startblock,
+			.rm_blockcount		= irec->rm_blockcount,
+			.rm_owner		= irec->rm_owner,
+		};
+
+		rmap_dump(tag, agno, &rfr);
+		error = bag_add(stack_top, &rfr);
 		if (error)
 			return error;
 		error = refcount_walk_rmaps(rmcur->mcur, irec, have);
@@ -1079,7 +1093,7 @@ compute_refcounts(
 	struct rmap_mem_cur	rmcur;
 	struct xfs_rmap_irec	irec;
 	struct xfs_bag		*stack_top = NULL;
-	struct xfs_rmap_irec	*rmap;
+	struct rmap_for_refcount *rfr;
 	uint64_t		idx;
 	uint64_t		old_stack_nr;
 	xfs_agblock_t		sbno;	/* first bno of this rmap set */
@@ -1095,7 +1109,7 @@ compute_refcounts(
 	if (error)
 		return error;
 
-	error = init_bag(&stack_top, sizeof(struct xfs_rmap_irec));
+	error = init_bag(&stack_top, sizeof(struct rmap_for_refcount));
 	if (error)
 		goto out_cur;
 
@@ -1132,10 +1146,10 @@ compute_refcounts(
 		/* While stack isn't empty... */
 		while (bag_count(stack_top)) {
 			/* Pop all rmaps that end at nbno */
-			foreach_bag_ptr_reverse(stack_top, idx, rmap) {
-				if (RMAP_NEXT(rmap) != nbno)
+			foreach_bag_ptr_reverse(stack_top, idx, rfr) {
+				if (RMAP_NEXT(rfr) != nbno)
 					continue;
-				rmap_dump("pop", agno, rmap);
+				rmap_dump("pop", agno, rfr);
 				error = bag_remove(stack_top, idx);
 				if (error)
 					goto out_bag;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/6] xfs_repair: remove the old rmap collection slabs
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/6] xfs_repair: verify on-disk rmap btrees with in-memory btree data Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've switched the offline repair code to use an in-memory
rmap btree for everything except recording the rmaps for the newly
generated per-AG btrees, get rid of all the old code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/dinode.c |    9 +--
 repair/phase4.c |   23 -------
 repair/rmap.c   |  189 +++++++++----------------------------------------------
 repair/rmap.h   |   16 ++---
 repair/scan.c   |    7 --
 5 files changed, 42 insertions(+), 202 deletions(-)


diff --git a/repair/dinode.c b/repair/dinode.c
index 5e664eab7ea..ee34a62ae8b 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -625,13 +625,8 @@ _("illegal state %d in block map %" PRIu64 "\n"),
 				break;
 			}
 		}
-		if (collect_rmaps) { /* && !check_dups */
-			error = rmap_add_rec(mp, ino, whichfork, &irec);
-			if (error)
-				do_error(
-_("couldn't add reverse mapping\n")
-					);
-		}
+		if (collect_rmaps) /* && !check_dups */
+			rmap_add_rec(mp, ino, whichfork, &irec);
 		*tot += irec.br_blockcount;
 	}
 	error = 0;
diff --git a/repair/phase4.c b/repair/phase4.c
index b8a6992706a..b5e713aaa82 100644
--- a/repair/phase4.c
+++ b/repair/phase4.c
@@ -142,17 +142,7 @@ static void
 process_ags(
 	xfs_mount_t		*mp)
 {
-	xfs_agnumber_t		i;
-	int			error;
-
 	do_inode_prefetch(mp, ag_stride, process_ag_func, true, false);
-	for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-		error = rmap_finish_collecting_fork_recs(mp, i);
-		if (error)
-			do_error(
-_("unable to finish adding attr/data fork reverse-mapping data for AG %u.\n"),
-				i);
-	}
 }
 
 static void
@@ -161,18 +151,7 @@ check_rmap_btrees(
 	xfs_agnumber_t	agno,
 	void		*arg)
 {
-	int		error;
-
-	error = rmap_add_fixed_ag_rec(wq->wq_ctx, agno);
-	if (error)
-		do_error(
-_("unable to add AG %u metadata reverse-mapping data.\n"), agno);
-
-	error = rmap_fold_raw_recs(wq->wq_ctx, agno);
-	if (error)
-		do_error(
-_("unable to merge AG %u metadata reverse-mapping data.\n"), agno);
-
+	rmap_add_fixed_ag_rec(wq->wq_ctx, agno);
 	rmaps_verify_btree(wq->wq_ctx, agno);
 }
 
diff --git a/repair/rmap.c b/repair/rmap.c
index 87123f5331d..e598dd9c9b8 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -27,11 +27,9 @@
 /* per-AG rmap object anchor */
 struct xfs_ag_rmap {
 	struct xfbtree	*ar_xfbtree;		/* rmap observations */
-	struct xfs_slab	*ar_rmaps;		/* rmap observations, p4 */
-	struct xfs_slab	*ar_raw_rmaps;		/* unmerged rmaps */
+	struct xfs_slab	*ar_agbtree_rmaps;	/* rmaps for rebuilt ag btrees */
 	int		ar_flcount;		/* agfl entries from leftover */
 						/* agbt allocations */
-	struct xfs_rmap_irec	ar_last_rmap;	/* last rmap seen */
 	struct xfs_slab	*ar_refcount_items;	/* refcount items, p4-5 */
 };
 
@@ -72,6 +70,7 @@ rmaps_destroy(
 	struct xfile		*xfile;
 	struct xfs_buftarg	*target;
 
+	free_slab(&ag_rmap->ar_agbtree_rmaps);
 	free_slab(&ag_rmap->ar_refcount_items);
 
 	if (!ag_rmap->ar_xfbtree)
@@ -116,6 +115,11 @@ rmaps_init_ag(
 	if (error)
 		goto nomem;
 
+	error = init_slab(&ag_rmap->ar_agbtree_rmaps,
+			sizeof(struct xfs_rmap_irec));
+	if (error)
+		goto nomem;
+
 	return;
 nomem:
 	do_error(
@@ -130,7 +134,6 @@ rmaps_init(
 	struct xfs_mount	*mp)
 {
 	xfs_agnumber_t		i;
-	int			error;
 
 	if (!rmap_needs_work(mp))
 		return;
@@ -139,21 +142,8 @@ rmaps_init(
 	if (!ag_rmaps)
 		do_error(_("couldn't allocate per-AG reverse map roots\n"));
 
-	for (i = 0; i < mp->m_sb.sb_agcount; i++) {
+	for (i = 0; i < mp->m_sb.sb_agcount; i++)
 		rmaps_init_ag(mp, i, &ag_rmaps[i]);
-
-		error = init_slab(&ag_rmaps[i].ar_rmaps,
-				sizeof(struct xfs_rmap_irec));
-		if (error)
-			do_error(
-_("Insufficient memory while allocating reverse mapping slabs."));
-		error = init_slab(&ag_rmaps[i].ar_raw_rmaps,
-				  sizeof(struct xfs_rmap_irec));
-		if (error)
-			do_error(
-_("Insufficient memory while allocating raw metadata reverse mapping slabs."));
-		ag_rmaps[i].ar_last_rmap.rm_owner = XFS_RMAP_OWN_UNKNOWN;
-	}
 }
 
 /*
@@ -168,11 +158,8 @@ rmaps_free(
 	if (!rmap_needs_work(mp))
 		return;
 
-	for (i = 0; i < mp->m_sb.sb_agcount; i++) {
-		free_slab(&ag_rmaps[i].ar_rmaps);
-		free_slab(&ag_rmaps[i].ar_raw_rmaps);
+	for (i = 0; i < mp->m_sb.sb_agcount; i++)
 		rmaps_destroy(mp, &ag_rmaps[i]);
-	}
 	free(ag_rmaps);
 	ag_rmaps = NULL;
 }
@@ -303,7 +290,7 @@ rmap_add_mem_rec(
  * Add an observation about a block mapping in an inode's data or attribute
  * fork for later btree reconstruction.
  */
-int
+void
 rmap_add_rec(
 	struct xfs_mount	*mp,
 	xfs_ino_t		ino,
@@ -313,11 +300,9 @@ rmap_add_rec(
 	struct xfs_rmap_irec	rmap;
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
-	struct xfs_rmap_irec	*last_rmap;
-	int			error = 0;
 
 	if (!rmap_needs_work(mp))
-		return 0;
+		return;
 
 	agno = XFS_FSB_TO_AGNO(mp, irec->br_startblock);
 	agbno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
@@ -338,36 +323,10 @@ rmap_add_rec(
 		rmap.rm_flags |= XFS_RMAP_UNWRITTEN;
 
 	rmap_add_mem_rec(mp, agno, &rmap);
-
-	last_rmap = &ag_rmaps[agno].ar_last_rmap;
-	if (last_rmap->rm_owner == XFS_RMAP_OWN_UNKNOWN)
-		*last_rmap = rmap;
-	else if (rmaps_are_mergeable(last_rmap, &rmap))
-		last_rmap->rm_blockcount += rmap.rm_blockcount;
-	else {
-		error = slab_add(ag_rmaps[agno].ar_rmaps, last_rmap);
-		if (error)
-			return error;
-		*last_rmap = rmap;
-	}
-
-	return error;
-}
-
-/* Finish collecting inode data/attr fork rmaps. */
-int
-rmap_finish_collecting_fork_recs(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno)
-{
-	if (!rmap_needs_work(mp) ||
-	    ag_rmaps[agno].ar_last_rmap.rm_owner == XFS_RMAP_OWN_UNKNOWN)
-		return 0;
-	return slab_add(ag_rmaps[agno].ar_rmaps, &ag_rmaps[agno].ar_last_rmap);
 }
 
 /* add a raw rmap; these will be merged later */
-static int
+static void
 __rmap_add_raw_rec(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
@@ -391,13 +350,12 @@ __rmap_add_raw_rec(
 	rmap.rm_blockcount = len;
 
 	rmap_add_mem_rec(mp, agno, &rmap);
-	return slab_add(ag_rmaps[agno].ar_raw_rmaps, &rmap);
 }
 
 /*
  * Add a reverse mapping for an inode fork's block mapping btree block.
  */
-int
+void
 rmap_add_bmbt_rec(
 	struct xfs_mount	*mp,
 	xfs_ino_t		ino,
@@ -408,7 +366,7 @@ rmap_add_bmbt_rec(
 	xfs_agblock_t		agbno;
 
 	if (!rmap_needs_work(mp))
-		return 0;
+		return;
 
 	agno = XFS_FSB_TO_AGNO(mp, fsbno);
 	agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
@@ -416,14 +374,14 @@ rmap_add_bmbt_rec(
 	ASSERT(agno < mp->m_sb.sb_agcount);
 	ASSERT(agbno + 1 <= mp->m_sb.sb_agblocks);
 
-	return __rmap_add_raw_rec(mp, agno, agbno, 1, ino,
-			whichfork == XFS_ATTR_FORK, true);
+	__rmap_add_raw_rec(mp, agno, agbno, 1, ino, whichfork == XFS_ATTR_FORK,
+			true);
 }
 
 /*
  * Add a reverse mapping for a per-AG fixed metadata extent.
  */
-int
+STATIC void
 rmap_add_ag_rec(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
@@ -432,13 +390,13 @@ rmap_add_ag_rec(
 	uint64_t		owner)
 {
 	if (!rmap_needs_work(mp))
-		return 0;
+		return;
 
 	ASSERT(agno != NULLAGNUMBER);
 	ASSERT(agno < mp->m_sb.sb_agcount);
 	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
 
-	return __rmap_add_raw_rec(mp, agno, agbno, len, owner, false, false);
+	__rmap_add_raw_rec(mp, agno, agbno, len, owner, false, false);
 }
 
 /*
@@ -468,62 +426,7 @@ rmap_add_agbtree_mapping(
 	assert(libxfs_verify_agbext(pag, agbno, len));
 	libxfs_perag_put(pag);
 
-	return slab_add(ag_rmaps[agno].ar_raw_rmaps, &rmap);
-}
-
-/*
- * Merge adjacent raw rmaps and add them to the main rmap list.
- */
-int
-rmap_fold_raw_recs(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno)
-{
-	struct xfs_slab_cursor	*cur = NULL;
-	struct xfs_rmap_irec	*prev, *rec;
-	uint64_t		old_sz;
-	int			error = 0;
-
-	old_sz = slab_count(ag_rmaps[agno].ar_rmaps);
-	if (slab_count(ag_rmaps[agno].ar_raw_rmaps) == 0)
-		goto no_raw;
-	qsort_slab(ag_rmaps[agno].ar_raw_rmaps, rmap_compare);
-	error = init_slab_cursor(ag_rmaps[agno].ar_raw_rmaps, rmap_compare,
-			&cur);
-	if (error)
-		goto err;
-
-	prev = pop_slab_cursor(cur);
-	rec = pop_slab_cursor(cur);
-	while (prev && rec) {
-		if (rmaps_are_mergeable(prev, rec)) {
-			prev->rm_blockcount += rec->rm_blockcount;
-			rec = pop_slab_cursor(cur);
-			continue;
-		}
-		error = slab_add(ag_rmaps[agno].ar_rmaps, prev);
-		if (error)
-			goto err;
-		prev = rec;
-		rec = pop_slab_cursor(cur);
-	}
-	if (prev) {
-		error = slab_add(ag_rmaps[agno].ar_rmaps, prev);
-		if (error)
-			goto err;
-	}
-	free_slab(&ag_rmaps[agno].ar_raw_rmaps);
-	error = init_slab(&ag_rmaps[agno].ar_raw_rmaps,
-			sizeof(struct xfs_rmap_irec));
-	if (error)
-		do_error(
-_("Insufficient memory while allocating raw metadata reverse mapping slabs."));
-no_raw:
-	if (old_sz)
-		qsort_slab(ag_rmaps[agno].ar_rmaps, rmap_compare);
-err:
-	free_slab_cursor(&cur);
-	return error;
+	return slab_add(ag_rmaps[agno].ar_agbtree_rmaps, &rmap);
 }
 
 static int
@@ -560,7 +463,7 @@ popcnt(
  * Add an allocation group's fixed metadata to the rmap list.  This includes
  * sb/agi/agf/agfl headers, inode chunks, and the log.
  */
-int
+void
 rmap_add_fixed_ag_rec(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno)
@@ -569,18 +472,14 @@ rmap_add_fixed_ag_rec(
 	xfs_agblock_t		agbno;
 	ino_tree_node_t		*ino_rec;
 	xfs_agino_t		agino;
-	int			error;
 	int			startidx;
 	int			nr;
 
 	if (!rmap_needs_work(mp))
-		return 0;
+		return;
 
 	/* sb/agi/agf/agfl headers */
-	error = rmap_add_ag_rec(mp, agno, 0, XFS_BNO_BLOCK(mp),
-			XFS_RMAP_OWN_FS);
-	if (error)
-		goto out;
+	rmap_add_ag_rec(mp, agno, 0, XFS_BNO_BLOCK(mp), XFS_RMAP_OWN_FS);
 
 	/* inodes */
 	ino_rec = findfirst_inode_rec(agno);
@@ -598,10 +497,8 @@ rmap_add_fixed_ag_rec(
 		agino = ino_rec->ino_startnum + startidx;
 		agbno = XFS_AGINO_TO_AGBNO(mp, agino);
 		if (XFS_AGINO_TO_OFFSET(mp, agino) == 0) {
-			error = rmap_add_ag_rec(mp, agno, agbno, nr,
+			rmap_add_ag_rec(mp, agno, agbno, nr,
 					XFS_RMAP_OWN_INODES);
-			if (error)
-				goto out;
 		}
 	}
 
@@ -609,13 +506,9 @@ rmap_add_fixed_ag_rec(
 	fsbno = mp->m_sb.sb_logstart;
 	if (fsbno && XFS_FSB_TO_AGNO(mp, fsbno) == agno) {
 		agbno = XFS_FSB_TO_AGBNO(mp, mp->m_sb.sb_logstart);
-		error = rmap_add_ag_rec(mp, agno, agbno, mp->m_sb.sb_logblocks,
+		rmap_add_ag_rec(mp, agno, agbno, mp->m_sb.sb_logblocks,
 				XFS_RMAP_OWN_LOG);
-		if (error)
-			goto out;
 	}
-out:
-	return error;
 }
 
 /*
@@ -656,12 +549,6 @@ rmap_commit_agbtree_mappings(
 	if (!xfs_has_rmapbt(mp))
 		return 0;
 
-	/* Release the ar_rmaps; they were put into the rmapbt during p5. */
-	free_slab(&ag_rmap->ar_rmaps);
-	error = init_slab(&ag_rmap->ar_rmaps, sizeof(struct xfs_rmap_irec));
-	if (error)
-		goto err;
-
 	/* Add the AGFL blocks to the rmap list */
 	error = -libxfs_trans_read_buf(
 			mp, NULL, mp->m_ddev_targp,
@@ -685,7 +572,8 @@ rmap_commit_agbtree_mappings(
 	 * space btree blocks, so we must be careful not to create those
 	 * records again.  Create a bitmap of already-recorded OWN_AG rmaps.
 	 */
-	error = init_slab_cursor(ag_rmap->ar_raw_rmaps, rmap_compare, &rm_cur);
+	error = init_slab_cursor(ag_rmap->ar_agbtree_rmaps, rmap_compare,
+			&rm_cur);
 	if (error)
 		goto err;
 	error = -bitmap_alloc(&own_ag_bitmap);
@@ -718,7 +606,7 @@ rmap_commit_agbtree_mappings(
 
 		agbno = be32_to_cpu(*b);
 		if (!bitmap_test(own_ag_bitmap, agbno, 1)) {
-			error = rmap_add_ag_rec(mp, agno, agbno, 1,
+			error = rmap_add_agbtree_mapping(mp, agno, agbno, 1,
 					XFS_RMAP_OWN_AG);
 			if (error)
 				goto err;
@@ -729,13 +617,9 @@ rmap_commit_agbtree_mappings(
 	agflbp = NULL;
 	bitmap_free(&own_ag_bitmap);
 
-	/* Merge all the raw rmaps into the main list */
-	error = rmap_fold_raw_recs(mp, agno);
-	if (error)
-		goto err;
-
 	/* Create cursors to rmap structures */
-	error = init_slab_cursor(ag_rmap->ar_rmaps, rmap_compare, &rm_cur);
+	error = init_slab_cursor(ag_rmap->ar_agbtree_rmaps, rmap_compare,
+			&rm_cur);
 	if (error)
 		goto err;
 
@@ -1104,6 +988,8 @@ compute_refcounts(
 
 	if (!xfs_has_reflink(mp))
 		return 0;
+	if (ag_rmaps[agno].ar_xfbtree == NULL)
+		return 0;
 
 	error = rmap_init_mem_cursor(mp, NULL, agno, &rmcur);
 	if (error)
@@ -1248,17 +1134,6 @@ rmap_record_count(
 	return nr;
 }
 
-/*
- * Return a slab cursor that will return rmap objects in order.
- */
-int
-rmap_init_cursor(
-	xfs_agnumber_t		agno,
-	struct xfs_slab_cursor	**cur)
-{
-	return init_slab_cursor(ag_rmaps[agno].ar_rmaps, rmap_compare, cur);
-}
-
 /*
  * Disable the refcount btree check.
  */
diff --git a/repair/rmap.h b/repair/rmap.h
index d8eec58ab8d..cb6c32af62c 100644
--- a/repair/rmap.h
+++ b/repair/rmap.h
@@ -14,23 +14,19 @@ extern bool rmap_needs_work(struct xfs_mount *);
 extern void rmaps_init(struct xfs_mount *);
 extern void rmaps_free(struct xfs_mount *);
 
-extern int rmap_add_rec(struct xfs_mount *, xfs_ino_t, int, struct xfs_bmbt_irec *);
-extern int rmap_finish_collecting_fork_recs(struct xfs_mount *mp,
-		xfs_agnumber_t agno);
-extern int rmap_add_ag_rec(struct xfs_mount *, xfs_agnumber_t agno,
-		xfs_agblock_t agbno, xfs_extlen_t len, uint64_t owner);
-extern int rmap_add_bmbt_rec(struct xfs_mount *, xfs_ino_t, int, xfs_fsblock_t);
-extern int rmap_fold_raw_recs(struct xfs_mount *mp, xfs_agnumber_t agno);
-extern bool rmaps_are_mergeable(struct xfs_rmap_irec *r1, struct xfs_rmap_irec *r2);
+void rmap_add_rec(struct xfs_mount *mp, xfs_ino_t ino, int whichfork,
+		struct xfs_bmbt_irec *irec);
+void rmap_add_bmbt_rec(struct xfs_mount *mp, xfs_ino_t ino, int whichfork,
+		xfs_fsblock_t fsbno);
+bool rmaps_are_mergeable(struct xfs_rmap_irec *r1, struct xfs_rmap_irec *r2);
 
-extern int rmap_add_fixed_ag_rec(struct xfs_mount *, xfs_agnumber_t);
+void rmap_add_fixed_ag_rec(struct xfs_mount *mp, xfs_agnumber_t agno);
 
 int rmap_add_agbtree_mapping(struct xfs_mount *mp, xfs_agnumber_t agno,
 		xfs_agblock_t agbno, xfs_extlen_t len, uint64_t owner);
 int rmap_commit_agbtree_mappings(struct xfs_mount *mp, xfs_agnumber_t agno);
 
 uint64_t rmap_record_count(struct xfs_mount *mp, xfs_agnumber_t agno);
-extern int rmap_init_cursor(xfs_agnumber_t, struct xfs_slab_cursor **);
 extern void rmap_avoid_check(void);
 void rmaps_verify_btree(struct xfs_mount *mp, xfs_agnumber_t agno);
 
diff --git a/repair/scan.c b/repair/scan.c
index ac2233b93b7..ff51eb0a602 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -224,7 +224,6 @@ scan_bmapbt(
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
 	int			state;
-	int			error;
 
 	/*
 	 * unlike the ag freeblock btrees, if anything looks wrong
@@ -413,12 +412,8 @@ _("bad state %d, inode %" PRIu64 " bmap block 0x%" PRIx64 "\n"),
 	if (check_dups && collect_rmaps) {
 		agno = XFS_FSB_TO_AGNO(mp, bno);
 		pthread_mutex_lock(&ag_locks[agno].lock);
-		error = rmap_add_bmbt_rec(mp, ino, whichfork, bno);
+		rmap_add_bmbt_rec(mp, ino, whichfork, bno);
 		pthread_mutex_unlock(&ag_locks[agno].lock);
-		if (error)
-			do_error(
-_("couldn't add inode %"PRIu64" bmbt block %"PRIu64" reverse-mapping data."),
-				ino, bno);
 	}
 
 	if (level == 0) {


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (44 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/5] xfs_repair: port to the new refcount bag structure Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
                   ` (31 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

The refcountbt repair code has serious memory usage problems when the
block sharing factor of the filesystem is very high.  This can happen if
a deduplication tool has been run against the filesystem, or if the fs
stores reflinked VM images that have been aging for a long time.

Recall that the original reference counting algorithm walks the reverse
mapping records of the filesystem to generate reference counts.  For any
given block in the AG, the rmap bag structure contains the all rmap
records that cover that block; the refcount is the size of that bag.

For online repair, the bag doesn't need the owner, offset, or state flag
information, so it discards those.  This halves the record size, but the
bag structure still stores one excerpted record for each reverse
mapping.  If the sharing count is high, this will use a LOT of memory
storing redundant records.  In the extreme case, 100k mappings to the
same piece of space will consume 100k*16 bytes = 1.6M of memory.

For offline repair, the bag stores the owner values so that we know
which inodes need to be marked as being reflink inodes.  If a
deduplication tool has been run and there are many blocks within a file
pointing to the same physical space, this will stll use a lot of memory
to store redundant records.

The solution to this problem is to deduplicate the bag records when
possible by adding a reference count to the bag record, and changing the
bag add function to detect an existing record to bump the refcount.  In
the above example, the 100k mappings will now use 24 bytes of memory.
These lookups can be done efficiently with a btree, so we create a new
refcount bag btree type (inside of online repair).  This is why we
refactored the btree code in the previous patchset.

The btree conversion also dramatically reduces the runtime of the
refcount generation algorithm, because the code to delete all bag
records that end at a given agblock now only has to delete one record
instead of (using the example above) 100k records.  As an added benefit,
record deletion now gives back the unused xfile space, which it did not
do previously.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-refcount-scalability

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-refcount-scalability
---
 libxfs/libxfs_api_defs.h |    8 +
 libxfs/xfs_btree.c       |    3 
 libxfs/xfs_btree.h       |    1 
 libxfs/xfs_shared.h      |    1 
 libxfs/xfs_types.h       |    6 -
 repair/Makefile          |    4 
 repair/rcbag.c           |  404 ++++++++++++++++++++++++++++++++++++++++++++++
 repair/rcbag.h           |   33 ++++
 repair/rcbag_btree.c     |  394 +++++++++++++++++++++++++++++++++++++++++++++
 repair/rcbag_btree.h     |   78 +++++++++
 repair/rmap.c            |  159 +++++-------------
 repair/slab.c            |  130 ---------------
 repair/slab.h            |   19 --
 repair/xfs_repair.c      |    6 +
 14 files changed, 982 insertions(+), 264 deletions(-)
 create mode 100644 repair/rcbag.c
 create mode 100644 repair/rcbag.h
 create mode 100644 repair/rcbag_btree.c
 create mode 100644 repair/rcbag_btree.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs: define an in-memory btree for storing refcount bag info
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/5] xfs_repair: port to the new refcount bag structure Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs_repair: define an in-memory btree for storing refcount bag info Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/5] xfs_repair: create refcount bag Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/5] xfs_repair: remove the old bag implementation Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_btree.c  |    3 +++
 libxfs/xfs_btree.h  |    1 +
 libxfs/xfs_shared.h |    1 +
 libxfs/xfs_types.h  |    6 ++++--
 4 files changed, 9 insertions(+), 2 deletions(-)


diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index 8c7f51f4195..b89e8ce7797 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -1370,6 +1370,9 @@ xfs_btree_set_refs(
 	case XFS_BTNUM_REFC:
 		xfs_buf_set_ref(bp, XFS_REFC_BTREE_REF);
 		break;
+	case XFS_BTNUM_RCBAG:
+		xfs_buf_set_ref(bp, XFS_RCBAG_BTREE_REF);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index 7c2ff1a02dd..c795dd8ee84 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -62,6 +62,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 #define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
 #define	XFS_BTNUM_REFC	((xfs_btnum_t)XFS_BTNUM_REFCi)
+#define	XFS_BTNUM_RCBAG	((xfs_btnum_t)XFS_BTNUM_RCBAGi)
 
 struct xfs_btree_ops;
 uint32_t xfs_btree_magic(struct xfs_mount *mp, const struct xfs_btree_ops *ops);
diff --git a/libxfs/xfs_shared.h b/libxfs/xfs_shared.h
index d1b3f210326..eaabfa52eda 100644
--- a/libxfs/xfs_shared.h
+++ b/libxfs/xfs_shared.h
@@ -128,6 +128,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_ATTR_BTREE_REF	1
 #define	XFS_DQUOT_REF		1
 #define	XFS_REFC_BTREE_REF	1
+#define	XFS_RCBAG_BTREE_REF	1
 #define	XFS_SSB_REF		0
 
 /*
diff --git a/libxfs/xfs_types.h b/libxfs/xfs_types.h
index c2868e8b6a1..9a4019f23dd 100644
--- a/libxfs/xfs_types.h
+++ b/libxfs/xfs_types.h
@@ -116,7 +116,8 @@ typedef enum {
  */
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
-	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_MAX
+	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_RCBAGi,
+	XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 #define XFS_BTNUM_STRINGS \
@@ -126,7 +127,8 @@ typedef enum {
 	{ XFS_BTNUM_BMAPi,	"bmbt" }, \
 	{ XFS_BTNUM_INOi,	"inobt" }, \
 	{ XFS_BTNUM_FINOi,	"finobt" }, \
-	{ XFS_BTNUM_REFCi,	"refcbt" }
+	{ XFS_BTNUM_REFCi,	"refcbt" }, \
+	{ XFS_BTNUM_RCBAGi,	"rcbagbt" }
 
 struct xfs_name {
 	const unsigned char	*name;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs_repair: define an in-memory btree for storing refcount bag info
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/5] xfs_repair: port to the new refcount bag structure Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/5] xfs: " Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new in-memory btree type so that we can store refcount bag info
in a much more memory-efficient format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    3 
 repair/Makefile          |    2 
 repair/rcbag_btree.c     |  335 ++++++++++++++++++++++++++++++++++++++++++++++
 repair/rcbag_btree.h     |   71 ++++++++++
 4 files changed, 411 insertions(+)
 create mode 100644 repair/rcbag_btree.c
 create mode 100644 repair/rcbag_btree.h


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index a03cef515a2..df614182b7b 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -55,6 +55,7 @@
 
 #define xfs_btree_bload			libxfs_btree_bload
 #define xfs_btree_bload_compute_geometry libxfs_btree_bload_compute_geometry
+#define xfs_btree_calc_size		libxfs_btree_calc_size
 #define xfs_btree_decrement		libxfs_btree_decrement
 #define xfs_btree_del_cursor		libxfs_btree_del_cursor
 #define xfs_btree_get_block		libxfs_btree_get_block
@@ -62,8 +63,10 @@
 #define xfs_btree_has_more_records	libxfs_btree_has_more_records
 #define xfs_btree_increment		libxfs_btree_increment
 #define xfs_btree_init_block		libxfs_btree_init_block
+#define xfs_btree_mem_head_nlevels	libxfs_btree_mem_head_nlevels
 #define xfs_btree_mem_head_read_buf	libxfs_btree_mem_head_read_buf
 #define xfs_btree_rec_addr		libxfs_btree_rec_addr
+#define xfs_btree_space_to_height	libxfs_btree_space_to_height
 #define xfs_btree_visit_blocks		libxfs_btree_visit_blocks
 #define xfs_buf_delwri_submit		libxfs_buf_delwri_submit
 #define xfs_buf_get			libxfs_buf_get
diff --git a/repair/Makefile b/repair/Makefile
index e5014deb0ce..5ea8d9618e7 100644
--- a/repair/Makefile
+++ b/repair/Makefile
@@ -28,6 +28,7 @@ HFILES = \
 	progress.h \
 	protos.h \
 	quotacheck.h \
+	rcbag_btree.h \
 	rmap.h \
 	rt.h \
 	scan.h \
@@ -64,6 +65,7 @@ CFILES = \
 	prefetch.c \
 	progress.c \
 	quotacheck.c \
+	rcbag_btree.c \
 	rmap.c \
 	rt.c \
 	sb.c \
diff --git a/repair/rcbag_btree.c b/repair/rcbag_btree.c
new file mode 100644
index 00000000000..c86189806c8
--- /dev/null
+++ b/repair/rcbag_btree.c
@@ -0,0 +1,335 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs.h"
+#include "btree.h"
+#include "err_protos.h"
+#include "libxlog.h"
+#include "incore.h"
+#include "globals.h"
+#include "dinode.h"
+#include "slab.h"
+#include "libfrog/bitmap.h"
+#include "libxfs/xfile.h"
+#include "libxfs/xfbtree.h"
+#include "libxfs/xfs_btree_mem.h"
+#include "rcbag_btree.h"
+
+static struct kmem_cache	*rcbagbt_cur_cache;
+
+STATIC void
+rcbagbt_init_key_from_rec(
+	union xfs_btree_key		*key,
+	const union xfs_btree_rec	*rec)
+{
+	struct rcbag_key	*bag_key = (struct rcbag_key *)key;
+	const struct rcbag_rec	*bag_rec = (const struct rcbag_rec *)rec;
+
+	BUILD_BUG_ON(sizeof(struct rcbag_key) > sizeof(union xfs_btree_key));
+	BUILD_BUG_ON(sizeof(struct rcbag_rec) > sizeof(union xfs_btree_rec));
+
+	bag_key->rbg_startblock = bag_rec->rbg_startblock;
+	bag_key->rbg_blockcount = bag_rec->rbg_blockcount;
+	bag_key->rbg_ino = bag_rec->rbg_ino;
+}
+
+STATIC void
+rcbagbt_init_rec_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec)
+{
+	struct rcbag_rec	*bag_rec = (struct rcbag_rec *)rec;
+	struct rcbag_rec	*bag_irec = (struct rcbag_rec *)&cur->bc_rec;
+
+	bag_rec->rbg_startblock = bag_irec->rbg_startblock;
+	bag_rec->rbg_blockcount = bag_irec->rbg_blockcount;
+	bag_rec->rbg_ino = bag_irec->rbg_ino;
+	bag_rec->rbg_refcount = bag_irec->rbg_refcount;
+}
+
+STATIC int64_t
+rcbagbt_key_diff(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key)
+{
+	struct rcbag_rec		*rec = (struct rcbag_rec *)&cur->bc_rec;
+	const struct rcbag_key		*kp = (const struct rcbag_key *)key;
+
+	if (kp->rbg_startblock > rec->rbg_startblock)
+		return 1;
+	if (kp->rbg_startblock < rec->rbg_startblock)
+		return -1;
+
+	if (kp->rbg_blockcount > rec->rbg_blockcount)
+		return 1;
+	if (kp->rbg_blockcount < rec->rbg_blockcount)
+		return -1;
+
+	if (kp->rbg_ino > rec->rbg_ino)
+		return 1;
+	if (kp->rbg_ino < rec->rbg_ino)
+		return -1;
+
+	return 0;
+}
+
+STATIC int64_t
+rcbagbt_diff_two_keys(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*k1,
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
+{
+	const struct rcbag_key		*kp1 = (const struct rcbag_key *)k1;
+	const struct rcbag_key		*kp2 = (const struct rcbag_key *)k2;
+
+	ASSERT(mask == NULL);
+
+	if (kp1->rbg_startblock > kp2->rbg_startblock)
+		return 1;
+	if (kp1->rbg_startblock < kp2->rbg_startblock)
+		return -1;
+
+	if (kp1->rbg_blockcount > kp2->rbg_blockcount)
+		return 1;
+	if (kp1->rbg_blockcount < kp2->rbg_blockcount)
+		return -1;
+
+	if (kp1->rbg_ino > kp2->rbg_ino)
+		return 1;
+	if (kp1->rbg_ino < kp2->rbg_ino)
+		return -1;
+
+	return 0;
+}
+
+STATIC int
+rcbagbt_keys_inorder(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*k1,
+	const union xfs_btree_key	*k2)
+{
+	const struct rcbag_key		*kp1 = (const struct rcbag_key *)k1;
+	const struct rcbag_key		*kp2 = (const struct rcbag_key *)k2;
+
+	if (kp1->rbg_startblock > kp2->rbg_startblock)
+		return 0;
+	if (kp1->rbg_startblock < kp2->rbg_startblock)
+		return 1;
+
+	if (kp1->rbg_blockcount > kp2->rbg_blockcount)
+		return 0;
+	if (kp1->rbg_blockcount < kp2->rbg_blockcount)
+		return 1;
+
+	if (kp1->rbg_ino > kp2->rbg_ino)
+		return 0;
+	if (kp1->rbg_ino < kp2->rbg_ino)
+		return 1;
+
+	return 0;
+}
+
+STATIC int
+rcbagbt_recs_inorder(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_rec	*r1,
+	const union xfs_btree_rec	*r2)
+{
+	const struct rcbag_rec		*rp1 = (const struct rcbag_rec *)r1;
+	const struct rcbag_rec		*rp2 = (const struct rcbag_rec *)r2;
+
+	if (rp1->rbg_startblock > rp2->rbg_startblock)
+		return 0;
+	if (rp1->rbg_startblock < rp2->rbg_startblock)
+		return 1;
+
+	if (rp1->rbg_blockcount > rp2->rbg_blockcount)
+		return 0;
+	if (rp1->rbg_blockcount < rp2->rbg_blockcount)
+		return 1;
+
+	if (rp1->rbg_ino > rp2->rbg_ino)
+		return 0;
+	if (rp1->rbg_ino < rp2->rbg_ino)
+		return 1;
+
+	return 0;
+}
+
+static xfs_failaddr_t
+rcbagbt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	xfs_failaddr_t		fa;
+	unsigned int		level;
+
+	if (!xfs_verify_magic(bp, block->bb_magic))
+		return __this_address;
+
+	fa = xfs_btree_lblock_v5hdr_verify(bp, XFS_RMAP_OWN_UNKNOWN);
+	if (fa)
+		return fa;
+
+	level = be16_to_cpu(block->bb_level);
+	if (level >= rcbagbt_maxlevels_possible())
+		return __this_address;
+
+	return xfbtree_lblock_verify(bp,
+			rcbagbt_maxrecs(mp, xfo_to_b(1), level == 0));
+}
+
+static void
+rcbagbt_rw_verify(
+	struct xfs_buf	*bp)
+{
+	xfs_failaddr_t	fa = rcbagbt_verify(bp);
+
+	if (fa)
+		do_error(_("refcount bag btree block 0x%llx corrupted at %p\n"),
+				(unsigned long long)xfs_buf_daddr(bp), fa);
+}
+
+/* skip crc checks on in-memory btrees to save time */
+static const struct xfs_buf_ops rcbagbt_mem_buf_ops = {
+	.name			= "rcbagbt_mem",
+	.magic			= { 0, cpu_to_be32(RCBAG_MAGIC) },
+	.verify_read		= rcbagbt_rw_verify,
+	.verify_write		= rcbagbt_rw_verify,
+	.verify_struct		= rcbagbt_verify,
+};
+
+static const struct xfs_btree_ops rcbagbt_mem_ops = {
+	.rec_len		= sizeof(struct rcbag_rec),
+	.key_len		= sizeof(struct rcbag_key),
+	.geom_flags		= XFS_BTREE_CRC_BLOCKS | XFS_BTREE_LONG_PTRS |
+				  XFS_BTREE_IN_MEMORY,
+
+	.dup_cursor		= xfbtree_dup_cursor,
+	.set_root		= xfbtree_set_root,
+	.alloc_block		= xfbtree_alloc_block,
+	.free_block		= xfbtree_free_block,
+	.get_minrecs		= xfbtree_get_minrecs,
+	.get_maxrecs		= xfbtree_get_maxrecs,
+	.init_key_from_rec	= rcbagbt_init_key_from_rec,
+	.init_rec_from_cur	= rcbagbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfbtree_init_ptr_from_cur,
+	.key_diff		= rcbagbt_key_diff,
+	.buf_ops		= &rcbagbt_mem_buf_ops,
+	.diff_two_keys		= rcbagbt_diff_two_keys,
+	.keys_inorder		= rcbagbt_keys_inorder,
+	.recs_inorder		= rcbagbt_recs_inorder,
+};
+
+/* Create a cursor for an in-memory btree. */
+struct xfs_btree_cur *
+rcbagbt_mem_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*head_bp,
+	struct xfbtree		*xfbtree)
+{
+	struct xfs_btree_cur	*cur;
+
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RCBAG, &rcbagbt_mem_ops,
+			rcbagbt_maxlevels_possible(), rcbagbt_cur_cache);
+
+	cur->bc_mem.xfbtree = xfbtree;
+	cur->bc_mem.head_bp = head_bp;
+	cur->bc_nlevels = libxfs_btree_mem_head_nlevels(head_bp);
+	return cur;
+}
+
+/* Create an in-memory refcount bag btree. */
+int
+rcbagbt_mem_create(
+	struct xfs_mount	*mp,
+	struct xfs_buftarg	*target,
+	struct xfbtree		**xfbtreep)
+{
+	struct xfbtree_config	cfg = {
+		.btree_ops	= &rcbagbt_mem_ops,
+		.target		= target,
+	};
+
+	return -xfbtree_create(mp, &cfg, xfbtreep);
+}
+
+/* Calculate number of records in a refcount bag btree block. */
+static inline unsigned int
+rcbagbt_block_maxrecs(
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	if (leaf)
+		return blocklen / sizeof(struct rcbag_rec);
+	return blocklen /
+		(sizeof(struct rcbag_key) + sizeof(rcbag_ptr_t));
+}
+
+/*
+ * Calculate number of records in an refcount bag btree block.
+ */
+unsigned int
+rcbagbt_maxrecs(
+	struct xfs_mount	*mp,
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	blocklen -= RCBAG_BLOCK_LEN;
+	return rcbagbt_block_maxrecs(blocklen, leaf);
+}
+
+#define RCBAGBT_INIT_MINRECS(minrecs) \
+	do { \
+		unsigned int		blocklen; \
+ \
+		blocklen = getpagesize() - XFS_BTREE_LBLOCK_CRC_LEN; \
+ \
+		minrecs[0] = rcbagbt_block_maxrecs(blocklen, true) / 2; \
+		minrecs[1] = rcbagbt_block_maxrecs(blocklen, false) / 2; \
+	} while (0)
+
+/* Compute the max possible height for refcount bag btrees. */
+unsigned int
+rcbagbt_maxlevels_possible(void)
+{
+	unsigned int		minrecs[2];
+
+	RCBAGBT_INIT_MINRECS(minrecs);
+	return libxfs_btree_space_to_height(minrecs, ULLONG_MAX);
+}
+
+/* Calculate the refcount bag btree size for some records. */
+unsigned long long
+rcbagbt_calc_size(
+	unsigned long long	nr_records)
+{
+	unsigned int		minrecs[2];
+
+	RCBAGBT_INIT_MINRECS(minrecs);
+	return libxfs_btree_calc_size(minrecs, nr_records);
+}
+
+int __init
+rcbagbt_init_cur_cache(void)
+{
+	rcbagbt_cur_cache = kmem_cache_create("rcbagbt_cur",
+			xfs_btree_cur_sizeof(rcbagbt_maxlevels_possible()),
+			0, 0, NULL);
+
+	if (!rcbagbt_cur_cache)
+		return ENOMEM;
+	return 0;
+}
+
+void
+rcbagbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(rcbagbt_cur_cache);
+	rcbagbt_cur_cache = NULL;
+}
diff --git a/repair/rcbag_btree.h b/repair/rcbag_btree.h
new file mode 100644
index 00000000000..21329153baf
--- /dev/null
+++ b/repair/rcbag_btree.h
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __RCBAG_BTREE_H__
+#define __RCBAG_BTREE_H__
+
+struct xfs_buf;
+struct xfs_btree_cur;
+struct xfs_mount;
+
+#define RCBAG_MAGIC	0x74826671	/* 'JRBG' */
+
+struct rcbag_key {
+	uint32_t	rbg_startblock;
+	uint32_t	rbg_blockcount;
+	uint64_t	rbg_ino;
+};
+
+struct rcbag_rec {
+	uint32_t	rbg_startblock;
+	uint32_t	rbg_blockcount;
+	uint64_t	rbg_ino;
+	uint64_t	rbg_refcount;
+};
+
+typedef __be64 rcbag_ptr_t;
+
+/* reflinks only exist on crc enabled filesystems */
+#define RCBAG_BLOCK_LEN	XFS_BTREE_LBLOCK_CRC_LEN
+
+/*
+ * Record, key, and pointer address macros for btree blocks.
+ *
+ * (note that some of these may appear unused, but they are used in userspace)
+ */
+#define RCBAG_REC_ADDR(block, index) \
+	((struct rcbag_rec *) \
+		((char *)(block) + RCBAG_BLOCK_LEN + \
+		 (((index) - 1) * sizeof(struct rcbag_rec))))
+
+#define RCBAG_KEY_ADDR(block, index) \
+	((struct rcbag_key *) \
+		((char *)(block) + RCBAG_BLOCK_LEN + \
+		 ((index) - 1) * sizeof(struct rcbag_key)))
+
+#define RCBAG_PTR_ADDR(block, index, maxrecs) \
+	((rcbag_ptr_t *) \
+		((char *)(block) + RCBAG_BLOCK_LEN + \
+		 (maxrecs) * sizeof(struct rcbag_key) + \
+		 ((index) - 1) * sizeof(rcbag_ptr_t)))
+
+unsigned int rcbagbt_maxrecs(struct xfs_mount *mp, unsigned int blocklen,
+		bool leaf);
+
+unsigned long long rcbagbt_calc_size(unsigned long long nr_records);
+
+unsigned int rcbagbt_maxlevels_possible(void);
+
+int __init rcbagbt_init_cur_cache(void);
+void rcbagbt_destroy_cur_cache(void);
+
+struct xfbtree;
+struct xfs_btree_cur *rcbagbt_mem_cursor(struct xfs_mount *mp,
+		struct xfs_trans *tp, struct xfs_buf *head_bp,
+		struct xfbtree *xfbtree);
+int rcbagbt_mem_create(struct xfs_mount *mp, struct xfs_buftarg *target,
+		struct xfbtree **xfbtreep);
+
+#endif /* __RCBAG_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs_repair: create refcount bag
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 1/5] xfs: " Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 5/5] xfs_repair: remove the old bag implementation Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a bag structure for refcount information that uses the refcount
bag btree defined in the previous patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    5 +
 repair/Makefile          |    2 
 repair/rcbag.c           |  404 ++++++++++++++++++++++++++++++++++++++++++++++
 repair/rcbag.h           |   33 ++++
 repair/rcbag_btree.c     |   59 +++++++
 repair/rcbag_btree.h     |    7 +
 6 files changed, 510 insertions(+)
 create mode 100644 repair/rcbag.c
 create mode 100644 repair/rcbag.h


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index df614182b7b..aa71914af97 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -58,14 +58,19 @@
 #define xfs_btree_calc_size		libxfs_btree_calc_size
 #define xfs_btree_decrement		libxfs_btree_decrement
 #define xfs_btree_del_cursor		libxfs_btree_del_cursor
+#define xfs_btree_delete		libxfs_btree_delete
 #define xfs_btree_get_block		libxfs_btree_get_block
+#define xfs_btree_get_rec		libxfs_btree_get_rec
 #define xfs_btree_goto_left_edge	libxfs_btree_goto_left_edge
 #define xfs_btree_has_more_records	libxfs_btree_has_more_records
 #define xfs_btree_increment		libxfs_btree_increment
 #define xfs_btree_init_block		libxfs_btree_init_block
+#define xfs_btree_insert		libxfs_btree_insert
+#define xfs_btree_lookup		libxfs_btree_lookup
 #define xfs_btree_mem_head_nlevels	libxfs_btree_mem_head_nlevels
 #define xfs_btree_mem_head_read_buf	libxfs_btree_mem_head_read_buf
 #define xfs_btree_rec_addr		libxfs_btree_rec_addr
+#define xfs_btree_update		libxfs_btree_update
 #define xfs_btree_space_to_height	libxfs_btree_space_to_height
 #define xfs_btree_visit_blocks		libxfs_btree_visit_blocks
 #define xfs_buf_delwri_submit		libxfs_buf_delwri_submit
diff --git a/repair/Makefile b/repair/Makefile
index 5ea8d9618e7..250c86cca2d 100644
--- a/repair/Makefile
+++ b/repair/Makefile
@@ -29,6 +29,7 @@ HFILES = \
 	protos.h \
 	quotacheck.h \
 	rcbag_btree.h \
+	rcbag.h \
 	rmap.h \
 	rt.h \
 	scan.h \
@@ -66,6 +67,7 @@ CFILES = \
 	progress.c \
 	quotacheck.c \
 	rcbag_btree.c \
+	rcbag.c \
 	rmap.c \
 	rt.c \
 	sb.c \
diff --git a/repair/rcbag.c b/repair/rcbag.c
new file mode 100644
index 00000000000..04958cba460
--- /dev/null
+++ b/repair/rcbag.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs.h"
+#include "btree.h"
+#include "err_protos.h"
+#include "libxlog.h"
+#include "incore.h"
+#include "globals.h"
+#include "dinode.h"
+#include "slab.h"
+#include "libfrog/bitmap.h"
+#include "libxfs/xfile.h"
+#include "libxfs/xfbtree.h"
+#include "libxfs/xfs_btree_mem.h"
+#include "rcbag_btree.h"
+#include "rcbag.h"
+
+struct rcbag {
+	struct xfs_mount	*mp;
+	struct xfbtree		*xfbtree;
+	uint64_t		nr_items;
+};
+
+int
+rcbag_init(
+	struct xfs_mount	*mp,
+	uint64_t		max_rmaps,
+	struct rcbag		**bagp)
+{
+	struct xfile		*xfile;
+	struct xfs_buftarg	*target;
+	struct rcbag		*bag;
+	unsigned long long	maxbytes;
+	int			error;
+
+	bag = malloc(sizeof(struct rcbag));
+	if (!bag)
+		return ENOMEM;
+
+	bag->nr_items = 0;
+	bag->mp = mp;
+
+	/* Need to save space for the head block */
+	maxbytes = (1 + rcbagbt_calc_size(max_rmaps)) * getpagesize();
+	error = -xfile_create(mp, maxbytes, "refcount bag", &xfile);
+	if (error)
+		goto out_bag;
+
+	error = -libxfs_alloc_memory_buftarg(mp, xfile, &target);
+	if (error)
+		goto out_xfile;
+
+	error = rcbagbt_mem_create(mp, target, &bag->xfbtree);
+	if (error)
+		goto out_buftarg;
+
+	*bagp = bag;
+	return 0;
+
+out_buftarg:
+	libxfs_buftarg_free(target);
+out_xfile:
+	xfile_destroy(xfile);
+out_bag:
+	free(bag);
+	return error;
+}
+
+void
+rcbag_free(
+	struct rcbag		**bagp)
+{
+	struct rcbag		*bag = *bagp;
+	struct xfile		*xfile;
+	struct xfs_buftarg	*target;
+
+	target = bag->xfbtree->target;
+	xfile = target->bt_xfile;
+
+	xfbtree_destroy(bag->xfbtree);
+	libxfs_buftarg_free(target);
+	xfile_destroy(xfile);
+
+	free(bag);
+	*bagp = NULL;
+}
+
+/* Track an rmap in the refcount bag. */
+void
+rcbag_add(
+	struct rcbag			*bag,
+	const struct xfs_rmap_irec	*rmap)
+{
+	struct rcbag_rec		bagrec;
+	struct xfs_mount		*mp = bag->mp;
+	struct xfs_trans		*tp;
+	struct xfs_buf			*head_bp;
+	struct xfs_btree_cur		*cur;
+	int				has;
+	int				error;
+
+	error = -libxfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		do_error(_("allocating tx for refcount bag update\n"));
+
+	error = -xfbtree_head_read_buf(bag->xfbtree, tp, &head_bp);
+	if (error)
+		do_error(_("reading refcount bag header\n"));
+
+	cur = rcbagbt_mem_cursor(mp, tp, head_bp, bag->xfbtree);
+	error = rcbagbt_lookup_eq(cur, rmap, &has);
+	if (error)
+		do_error(_("looking up refcount bag records\n"));
+
+	if (has) {
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error || !has)
+			do_error(_("reading refcount bag records\n"));
+
+		bagrec.rbg_refcount++;
+		error = rcbagbt_update(cur, &bagrec);
+		if (error)
+			do_error(_("updating refcount bag record\n"));
+	} else {
+		bagrec.rbg_startblock = rmap->rm_startblock;
+		bagrec.rbg_blockcount = rmap->rm_blockcount;
+		bagrec.rbg_ino = rmap->rm_owner;
+		bagrec.rbg_refcount = 1;
+
+		error = rcbagbt_insert(cur, &bagrec, &has);
+		if (error || !has)
+			do_error(_("adding refcount bag record, err %d\n"),
+					error);
+	}
+
+	libxfs_btree_del_cursor(cur, error);
+	libxfs_trans_brelse(tp, head_bp);
+
+	error = -xfbtree_trans_commit(bag->xfbtree, tp);
+	if (error)
+		do_error(_("committing refcount bag record\n"));
+
+	libxfs_trans_cancel(tp);
+	bag->nr_items++;
+}
+
+uint64_t
+rcbag_count(
+	const struct rcbag	*rcbag)
+{
+	return rcbag->nr_items;
+}
+
+#define BAGREC_NEXT(r)	((r)->rbg_startblock + (r)->rbg_blockcount)
+
+/*
+ * Find the next block where the refcount changes, given the next rmap we
+ * looked at and the ones we're already tracking.
+ */
+void
+rcbag_next_edge(
+	struct rcbag			*bag,
+	const struct xfs_rmap_irec	*next_rmap,
+	bool				next_valid,
+	uint32_t			*next_bnop)
+{
+	struct rcbag_rec		bagrec;
+	struct xfs_mount		*mp = bag->mp;
+	struct xfs_buf			*head_bp;
+	struct xfs_btree_cur		*cur;
+	uint32_t			next_bno = NULLAGBLOCK;
+	int				has;
+	int				error;
+
+	if (next_valid)
+		next_bno = next_rmap->rm_startblock;
+
+	error = -xfbtree_head_read_buf(bag->xfbtree, NULL, &head_bp);
+	if (error)
+		do_error(_("reading refcount bag header\n"));
+
+	cur = rcbagbt_mem_cursor(mp, NULL, head_bp, bag->xfbtree);
+	error = -libxfs_btree_goto_left_edge(cur);
+	if (error)
+		do_error(_("seeking refcount bag btree cursor\n"));
+
+	while (true) {
+		error = -libxfs_btree_increment(cur, 0, &has);
+		if (error)
+			do_error(_("incrementing refcount bag btree cursor\n"));
+		if (!has)
+			break;
+
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			do_error(_("reading refcount bag btree record\n"));
+		if (!has)
+			do_error(_("refcount bag btree record disappeared?\n"));
+
+		next_bno = min(next_bno, BAGREC_NEXT(&bagrec));
+	}
+
+	/*
+	 * We should have found /something/ because either next_rrm is the next
+	 * interesting rmap to look at after emitting this refcount extent, or
+	 * there are other rmaps in rmap_bag contributing to the current
+	 * sharing count.  But if something is seriously wrong, bail out.
+	 */
+	if (next_bno == NULLAGBLOCK)
+		do_error(_("next refcount bag edge not found?\n"));
+
+	*next_bnop = next_bno;
+
+	libxfs_btree_del_cursor(cur, error);
+	libxfs_trans_brelse(NULL, head_bp);
+}
+
+/* Pop all refcount bag records that end at next_bno */
+void
+rcbag_remove_ending_at(
+	struct rcbag		*bag,
+	uint32_t		next_bno)
+{
+	struct rcbag_rec	bagrec;
+	struct xfs_mount	*mp = bag->mp;
+	struct xfs_trans	*tp;
+	struct xfs_buf		*head_bp;
+	struct xfs_btree_cur	*cur;
+	int			has;
+	int			error;
+
+	error = -libxfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		do_error(_("allocating tx for refcount bag update\n"));
+
+	error = -xfbtree_head_read_buf(bag->xfbtree, tp, &head_bp);
+	if (error)
+		do_error(_("reading refcount bag header\n"));
+
+	/* go to the right edge of the tree */
+	cur = rcbagbt_mem_cursor(mp, tp, head_bp, bag->xfbtree);
+	memset(&cur->bc_rec, 0xFF, sizeof(cur->bc_rec));
+	error = -libxfs_btree_lookup(cur, XFS_LOOKUP_GE, &has);
+	if (error)
+		do_error(_("seeking refcount bag btree cursor\n"));
+
+	while (true) {
+		error = -libxfs_btree_decrement(cur, 0, &has);
+		if (error)
+			do_error(_("decrementing refcount bag btree cursor\n"));
+		if (!has)
+			break;
+
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			do_error(_("reading refcount bag btree record\n"));
+		if (!has)
+			do_error(_("refcount bag btree record disappeared?\n"));
+
+		if (BAGREC_NEXT(&bagrec) != next_bno)
+			continue;
+
+		error = -libxfs_btree_delete(cur, &has);
+		if (error)
+			do_error(_("deleting refcount bag btree record, err %d\n"),
+					error);
+		if (!has)
+			do_error(_("couldn't delete refcount bag record?\n"));
+
+		bag->nr_items -= bagrec.rbg_refcount;
+	}
+
+	libxfs_btree_del_cursor(cur, error);
+	libxfs_trans_brelse(tp, head_bp);
+
+	error = -xfbtree_trans_commit(bag->xfbtree, tp);
+	if (error)
+		do_error(_("committing refcount bag deletions\n"));
+
+	libxfs_trans_cancel(tp);
+}
+
+/* Prepare to iterate the shared inodes tracked by the refcount bag. */
+void
+rcbag_ino_iter_start(
+	struct rcbag		*bag,
+	struct rcbag_iter	*iter)
+{
+	struct xfs_mount	*mp = bag->mp;
+	int			error;
+
+	memset(iter, 0, sizeof(struct rcbag_iter));
+
+	if (bag->nr_items < 2)
+		return;
+
+	error = -xfbtree_head_read_buf(bag->xfbtree, NULL, &iter->head_bp);
+	if (error)
+		do_error(_("reading refcount bag header\n"));
+
+	iter->cur = rcbagbt_mem_cursor(mp, NULL, iter->head_bp, bag->xfbtree);
+	error = -libxfs_btree_goto_left_edge(iter->cur);
+	if (error)
+		do_error(_("seeking refcount bag btree cursor\n"));
+}
+
+/* Tear down an iteration. */
+void
+rcbag_ino_iter_stop(
+	struct rcbag		*bag,
+	struct rcbag_iter	*iter)
+{
+	if (iter->cur)
+		libxfs_btree_del_cursor(iter->cur, XFS_BTREE_NOERROR);
+	if (iter->head_bp)
+		libxfs_trans_brelse(NULL, iter->head_bp);
+	iter->cur = NULL;
+	iter->head_bp = NULL;
+}
+
+/*
+ * Walk all the shared inodes tracked by the refcount bag.  Returns 1 when
+ * returning a valid iter.ino, and 0 if iteration has completed.  The iter
+ * should be initialized to zeroes before the first call.
+ */
+int
+rcbag_ino_iter(
+	struct rcbag		*bag,
+	struct rcbag_iter	*iter)
+{
+	struct rcbag_rec	bagrec;
+	int			has;
+	int			error;
+
+	if (bag->nr_items < 2)
+		return 0;
+
+	do {
+		error = -libxfs_btree_increment(iter->cur, 0, &has);
+		if (error)
+			do_error(_("incrementing refcount bag btree cursor\n"));
+		if (!has)
+			return 0;
+
+		error = rcbagbt_get_rec(iter->cur, &bagrec, &has);
+		if (error)
+			do_error(_("reading refcount bag btree record\n"));
+		if (!has)
+			do_error(_("refcount bag btree record disappeared?\n"));
+	} while (iter->ino == bagrec.rbg_ino);
+
+	iter->ino = bagrec.rbg_ino;
+	return 1;
+}
+
+/* Dump the rcbag. */
+void
+rcbag_dump(
+	struct rcbag			*bag)
+{
+	struct rcbag_rec		bagrec;
+	struct xfs_mount		*mp = bag->mp;
+	struct xfs_buf			*head_bp;
+	struct xfs_btree_cur		*cur;
+	unsigned long long		nr = 0;
+	int				has;
+	int				error;
+
+	error = -xfbtree_head_read_buf(bag->xfbtree, NULL, &head_bp);
+	if (error)
+		do_error(_("reading refcount bag header\n"));
+
+	cur = rcbagbt_mem_cursor(mp, NULL, head_bp, bag->xfbtree);
+	error = -libxfs_btree_goto_left_edge(cur);
+	if (error)
+		do_error(_("seeking refcount bag btree cursor\n"));
+
+	while (true) {
+		error = -libxfs_btree_increment(cur, 0, &has);
+		if (error)
+			do_error(_("incrementing refcount bag btree cursor\n"));
+		if (!has)
+			break;
+
+		error = rcbagbt_get_rec(cur, &bagrec, &has);
+		if (error)
+			do_error(_("reading refcount bag btree record\n"));
+		if (!has)
+			do_error(_("refcount bag btree record disappeared?\n"));
+
+		printf("[%llu]: bno 0x%x fsbcount 0x%x ino 0x%llx refcount 0x%llx\n",
+				nr++,
+				(unsigned int)bagrec.rbg_startblock,
+				(unsigned int)bagrec.rbg_blockcount,
+				(unsigned long long)bagrec.rbg_ino,
+				(unsigned long long)bagrec.rbg_refcount);
+	}
+
+	libxfs_btree_del_cursor(cur, error);
+	libxfs_trans_brelse(NULL, head_bp);
+}
diff --git a/repair/rcbag.h b/repair/rcbag.h
new file mode 100644
index 00000000000..04f9b5b403e
--- /dev/null
+++ b/repair/rcbag.h
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __RCBAG_H__
+#define __RCBAG_H__
+
+struct xfs_mount;
+struct rcbag;
+
+int rcbag_init(struct xfs_mount *mp, uint64_t max_rmaps, struct rcbag **bagp);
+void rcbag_free(struct rcbag **bagp);
+void rcbag_add(struct rcbag *bag, const struct xfs_rmap_irec *rmap);
+uint64_t rcbag_count(const struct rcbag *bag);
+
+void rcbag_next_edge(struct rcbag *bag, const struct xfs_rmap_irec *next_rmap,
+		bool next_valid, uint32_t *next_bnop);
+void rcbag_remove_ending_at(struct rcbag *bag, uint32_t next_bno);
+
+struct rcbag_iter {
+	struct xfs_buf		*head_bp;
+	struct xfs_btree_cur	*cur;
+	uint64_t		ino;
+};
+
+void rcbag_ino_iter_start(struct rcbag *bag, struct rcbag_iter *iter);
+void rcbag_ino_iter_stop(struct rcbag *bag, struct rcbag_iter *iter);
+int rcbag_ino_iter(struct rcbag *bag, struct rcbag_iter *iter);
+
+void rcbag_dump(struct rcbag *bag);
+
+#endif /* __RCBAG_H__ */
diff --git a/repair/rcbag_btree.c b/repair/rcbag_btree.c
index c86189806c8..13676463f7c 100644
--- a/repair/rcbag_btree.c
+++ b/repair/rcbag_btree.c
@@ -333,3 +333,62 @@ rcbagbt_destroy_cur_cache(void)
 	kmem_cache_destroy(rcbagbt_cur_cache);
 	rcbagbt_cur_cache = NULL;
 }
+
+/* Look up the refcount bag record corresponding to this reverse mapping. */
+int
+rcbagbt_lookup_eq(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rmap,
+	int				*success)
+{
+	struct rcbag_rec		*rec = (struct rcbag_rec *)&cur->bc_rec;
+
+	rec->rbg_startblock = rmap->rm_startblock;
+	rec->rbg_blockcount = rmap->rm_blockcount;
+	rec->rbg_ino = rmap->rm_owner;
+
+	return -libxfs_btree_lookup(cur, XFS_LOOKUP_EQ, success);
+}
+
+/* Get the data from the pointed-to record. */
+int
+rcbagbt_get_rec(
+	struct xfs_btree_cur	*cur,
+	struct rcbag_rec	*rec,
+	int			*has)
+{
+	union xfs_btree_rec	*btrec;
+	int			error;
+
+	error = -libxfs_btree_get_rec(cur, &btrec, has);
+	if (error || !(*has))
+		return error;
+
+	memcpy(rec, btrec, sizeof(struct rcbag_rec));
+	return 0;
+}
+
+/* Update the record referred to by cur to the value given. */
+int
+rcbagbt_update(
+	struct xfs_btree_cur	*cur,
+	const struct rcbag_rec	*rec)
+{
+	union xfs_btree_rec	btrec;
+
+	memcpy(&btrec, rec, sizeof(struct rcbag_rec));
+	return -libxfs_btree_update(cur, &btrec);
+}
+
+/* Update the record referred to by cur to the value given. */
+int
+rcbagbt_insert(
+	struct xfs_btree_cur	*cur,
+	const struct rcbag_rec	*rec,
+	int			*success)
+{
+	struct rcbag_rec	*btrec = (struct rcbag_rec *)&cur->bc_rec;
+
+	memcpy(btrec, rec, sizeof(struct rcbag_rec));
+	return -libxfs_btree_insert(cur, success);
+}
diff --git a/repair/rcbag_btree.h b/repair/rcbag_btree.h
index 21329153baf..fcbc4171369 100644
--- a/repair/rcbag_btree.h
+++ b/repair/rcbag_btree.h
@@ -68,4 +68,11 @@ struct xfs_btree_cur *rcbagbt_mem_cursor(struct xfs_mount *mp,
 int rcbagbt_mem_create(struct xfs_mount *mp, struct xfs_buftarg *target,
 		struct xfbtree **xfbtreep);
 
+int rcbagbt_lookup_eq(struct xfs_btree_cur *cur,
+		const struct xfs_rmap_irec *rmap, int *success);
+int rcbagbt_get_rec(struct xfs_btree_cur *cur, struct rcbag_rec *rec, int *has);
+int rcbagbt_update(struct xfs_btree_cur *cur, const struct rcbag_rec *rec);
+int rcbagbt_insert(struct xfs_btree_cur *cur, const struct rcbag_rec *rec,
+		int *success);
+
 #endif /* __RCBAG_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs_repair: port to the new refcount bag structure
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/5] xfs_repair: define an in-memory btree for storing refcount bag info Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port the refcount record generating code to use the new refcount bag
data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/rmap.c       |  152 +++++++++++++++------------------------------------
 repair/xfs_repair.c |    6 ++
 2 files changed, 52 insertions(+), 106 deletions(-)


diff --git a/repair/rmap.c b/repair/rmap.c
index e598dd9c9b8..ef1a599162b 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -15,6 +15,7 @@
 #include "libfrog/bitmap.h"
 #include "libxfs/xfile.h"
 #include "libxfs/xfbtree.h"
+#include "rcbag.h"
 
 #undef RMAP_DEBUG
 
@@ -762,35 +763,32 @@ rmap_dump(
  * reflink inode flag, if the stack depth is greater than 1.
  */
 static void
-mark_inode_rl(
+mark_reflink_inodes(
 	struct xfs_mount	*mp,
-	struct xfs_bag		*rmaps)
+	struct rcbag		*rcstack)
 {
-	struct rmap_for_refcount *rfr;
+	struct rcbag_iter	rciter;
 	struct ino_tree_node	*irec;
-	int			off;
-	uint64_t		idx;
 
-	if (bag_count(rmaps) < 2)
-		return;
-
-	/* Reflink flag accounting */
-	foreach_bag_ptr(rmaps, idx, rfr) {
+	rcbag_ino_iter_start(rcstack, &rciter);
+	while (rcbag_ino_iter(rcstack, &rciter) == 1) {
 		xfs_agnumber_t	agno;
 		xfs_agino_t	agino;
+		int		off;
 
-		ASSERT(!XFS_RMAP_NON_INODE_OWNER(rfr->rm_owner));
+		ASSERT(!XFS_RMAP_NON_INODE_OWNER(rciter.ino));
 
-		agno = XFS_INO_TO_AGNO(mp, rfr->rm_owner);
-		agino = XFS_INO_TO_AGINO(mp, rfr->rm_owner);
+		agno = XFS_INO_TO_AGNO(mp, rciter.ino);
+		agino = XFS_INO_TO_AGINO(mp, rciter.ino);
 
 		pthread_mutex_lock(&ag_locks[agno].lock);
 		irec = find_inode_rec(mp, agno, agino);
-		off = get_inode_offset(mp, rfr->rm_owner, irec);
+		off = get_inode_offset(mp, rciter.ino, irec);
 		/* lock here because we might go outside this ag */
 		set_inode_is_rl(irec, off);
 		pthread_mutex_unlock(&ag_locks[agno].lock);
 	}
+	rcbag_ino_iter_stop(rcstack, &rciter);
 }
 
 /*
@@ -826,8 +824,6 @@ refcount_emit(
 _("Insufficient memory while recreating refcount tree."));
 }
 
-#define RMAP_NEXT(r)	((r)->rm_startblock + (r)->rm_blockcount)
-
 /* Decide if an rmap could describe a shared extent. */
 static inline bool
 rmap_shareable(
@@ -887,40 +883,6 @@ refcount_walk_rmaps(
 	return 0;
 }
 
-/*
- * Find the next block where the refcount changes, given the next rmap we
- * looked at and the ones we're already tracking.
- */
-static inline int
-next_refcount_edge(
-	struct xfs_bag		*stack_top,
-	struct xfs_rmap_irec	*next_rmap,
-	bool			next_valid,
-	xfs_agblock_t		*nbnop)
-{
-	struct rmap_for_refcount *rfr;
-	uint64_t		idx;
-	xfs_agblock_t		nbno = NULLAGBLOCK;
-
-	if (next_valid)
-		nbno = next_rmap->rm_startblock;
-
-	foreach_bag_ptr(stack_top, idx, rfr)
-		nbno = min(nbno, RMAP_NEXT(rfr));
-
-	/*
-	 * We should have found /something/ because either next_rrm is the next
-	 * interesting rmap to look at after emitting this refcount extent, or
-	 * there are other rmaps in rmap_bag contributing to the current
-	 * sharing count.  But if something is seriously wrong, bail out.
-	 */
-	if (nbno == NULLAGBLOCK)
-		return EFSCORRUPTED;
-
-	*nbnop = nbno;
-	return 0;
-}
-
 /*
  * Walk forward through the rmap btree to collect all rmaps starting at
  * @bno in @rmap_bag.  These represent the file(s) that share ownership of
@@ -930,28 +892,19 @@ next_refcount_edge(
 static int
 refcount_push_rmaps_at(
 	struct rmap_mem_cur	*rmcur,
-	xfs_agnumber_t		agno,
-	struct xfs_bag		*stack_top,
+	struct rcbag		*stack,
 	xfs_agblock_t		bno,
-	struct xfs_rmap_irec	*irec,
+	struct xfs_rmap_irec	*rmap,
 	bool			*have,
 	const char		*tag)
 {
 	int			have_gt;
 	int			error;
 
-	while (*have && irec->rm_startblock == bno) {
-		struct rmap_for_refcount	rfr = {
-			.rm_startblock		= irec->rm_startblock,
-			.rm_blockcount		= irec->rm_blockcount,
-			.rm_owner		= irec->rm_owner,
-		};
+	while (*have && rmap->rm_startblock == bno) {
+		rcbag_add(stack, rmap);
 
-		rmap_dump(tag, agno, &rfr);
-		error = bag_add(stack_top, &rfr);
-		if (error)
-			return error;
-		error = refcount_walk_rmaps(rmcur->mcur, irec, have);
+		error = refcount_walk_rmaps(rmcur->mcur, rmap, have);
 		if (error)
 			return error;
 	}
@@ -971,15 +924,14 @@ refcount_push_rmaps_at(
  */
 int
 compute_refcounts(
-	struct xfs_mount		*mp,
+	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno)
 {
+	struct rcbag		*rcstack;
 	struct rmap_mem_cur	rmcur;
-	struct xfs_rmap_irec	irec;
-	struct xfs_bag		*stack_top = NULL;
-	struct rmap_for_refcount *rfr;
-	uint64_t		idx;
-	uint64_t		old_stack_nr;
+	struct xfs_rmap_irec	rmap;
+	uint64_t		nr_rmaps;
+	uint64_t		old_stack_height;
 	xfs_agblock_t		sbno;	/* first bno of this rmap set */
 	xfs_agblock_t		cbno;	/* first bno of this refcount set */
 	xfs_agblock_t		nbno;	/* next bno where rmap set changes */
@@ -991,11 +943,13 @@ compute_refcounts(
 	if (ag_rmaps[agno].ar_xfbtree == NULL)
 		return 0;
 
+	nr_rmaps = rmap_record_count(mp, agno);
+
 	error = rmap_init_mem_cursor(mp, NULL, agno, &rmcur);
 	if (error)
 		return error;
 
-	error = init_bag(&stack_top, sizeof(struct rmap_for_refcount));
+	error = rcbag_init(mp, nr_rmaps, &rcstack);
 	if (error)
 		goto out_cur;
 
@@ -1008,86 +962,72 @@ compute_refcounts(
 	/* Process reverse mappings into refcount data. */
 	while (libxfs_btree_has_more_records(rmcur.mcur)) {
 		/* Push all rmaps with pblk == sbno onto the stack */
-		error = refcount_walk_rmaps(rmcur.mcur, &irec, &have);
+		error = refcount_walk_rmaps(rmcur.mcur, &rmap, &have);
 		if (error)
 			goto out_bag;
 		if (!have)
 			break;
-		sbno = cbno = irec.rm_startblock;
-		error = refcount_push_rmaps_at(&rmcur, agno, stack_top, sbno,
-				&irec, &have, "push0");
+		sbno = cbno = rmap.rm_startblock;
+		error = refcount_push_rmaps_at(&rmcur, rcstack, sbno, &rmap,
+				&have, "push0");
 		if (error)
 			goto out_bag;
-		mark_inode_rl(mp, stack_top);
+		mark_reflink_inodes(mp, rcstack);
 
 		/* Set nbno to the bno of the next refcount change */
-		error = next_refcount_edge(stack_top, &irec, have, &nbno);
-		if (error)
-			goto out_bag;
+		rcbag_next_edge(rcstack, &rmap, have, &nbno);
 
 		/* Emit reverse mappings, if needed */
 		ASSERT(nbno > sbno);
-		old_stack_nr = bag_count(stack_top);
+		old_stack_height = rcbag_count(rcstack);
 
 		/* While stack isn't empty... */
-		while (bag_count(stack_top)) {
+		while (rcbag_count(rcstack) > 0) {
 			/* Pop all rmaps that end at nbno */
-			foreach_bag_ptr_reverse(stack_top, idx, rfr) {
-				if (RMAP_NEXT(rfr) != nbno)
-					continue;
-				rmap_dump("pop", agno, rfr);
-				error = bag_remove(stack_top, idx);
-				if (error)
-					goto out_bag;
-			}
+			rcbag_remove_ending_at(rcstack, nbno);
 
 			/* Push array items that start at nbno */
-			error = refcount_walk_rmaps(rmcur.mcur, &irec, &have);
+			error = refcount_walk_rmaps(rmcur.mcur, &rmap, &have);
 			if (error)
 				goto out_bag;
 			if (have) {
-				error = refcount_push_rmaps_at(&rmcur, agno,
-						stack_top, nbno, &irec, &have,
-						"push1");
+				error = refcount_push_rmaps_at(&rmcur, rcstack,
+						nbno, &rmap, &have, "push1");
 				if (error)
 					goto out_bag;
 			}
-			mark_inode_rl(mp, stack_top);
+			mark_reflink_inodes(mp, rcstack);
 
 			/* Emit refcount if necessary */
 			ASSERT(nbno > cbno);
-			if (bag_count(stack_top) != old_stack_nr) {
-				if (old_stack_nr > 1) {
+			if (rcbag_count(rcstack) != old_stack_height) {
+				if (old_stack_height > 1) {
 					refcount_emit(mp, agno, cbno,
-						      nbno - cbno,
-						      old_stack_nr);
+							nbno - cbno,
+							old_stack_height);
 				}
 				cbno = nbno;
 			}
 
 			/* Stack empty, go find the next rmap */
-			if (bag_count(stack_top) == 0)
+			if (rcbag_count(rcstack) == 0)
 				break;
-			old_stack_nr = bag_count(stack_top);
+			old_stack_height = rcbag_count(rcstack);
 			sbno = nbno;
 
 			/* Set nbno to the bno of the next refcount change */
-			error = next_refcount_edge(stack_top, &irec, have,
-					&nbno);
-			if (error)
-				goto out_bag;
+			rcbag_next_edge(rcstack, &rmap, have, &nbno);
 
 			/* Emit reverse mappings, if needed */
 			ASSERT(nbno > sbno);
 		}
 	}
 out_bag:
-	free_bag(&stack_top);
+	rcbag_free(&rcstack);
 out_cur:
 	rmap_free_mem_cursor(NULL, &rmcur, error);
 	return error;
 }
-#undef RMAP_NEXT
 
 static int
 count_btree_records(
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 251a46d11fe..8e62533ac53 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -26,6 +26,7 @@
 #include "libfrog/platform.h"
 #include "bulkload.h"
 #include "quotacheck.h"
+#include "rcbag_btree.h"
 
 /*
  * option tables for getsubopt calls
@@ -1247,6 +1248,10 @@ main(int argc, char **argv)
 	phase3(mp, phase2_threads);
 	phase_end(mp, 3);
 
+	error = rcbagbt_init_cur_cache();
+	if (error)
+		do_error(_("could not allocate btree cursor memory\n"));
+
 	phase4(mp);
 	phase_end(mp, 4);
 
@@ -1259,6 +1264,7 @@ main(int argc, char **argv)
 		phase5(mp);
 	}
 	phase_end(mp, 5);
+	rcbagbt_destroy_cur_cache();
 
 	/*
 	 * Done with the block usage maps, toss them...


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs_repair: remove the old bag implementation
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 3/5] xfs_repair: create refcount bag Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove the old bag implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/rmap.c |    7 ---
 repair/slab.c |  130 ---------------------------------------------------------
 repair/slab.h |   19 --------
 3 files changed, 156 deletions(-)


diff --git a/repair/rmap.c b/repair/rmap.c
index ef1a599162b..f8294cc3e13 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -34,13 +34,6 @@ struct xfs_ag_rmap {
 	struct xfs_slab	*ar_refcount_items;	/* refcount items, p4-5 */
 };
 
-/* Only the parts of struct xfs_rmap_irec that we need to compute refcounts. */
-struct rmap_for_refcount {
-	xfs_agblock_t	rm_startblock;
-	xfs_extlen_t	rm_blockcount;
-	uint64_t	rm_owner;
-};
-
 static struct xfs_ag_rmap *ag_rmaps;
 bool rmapbt_suspect;
 static bool refcbt_suspect;
diff --git a/repair/slab.c b/repair/slab.c
index 44ca0468eda..a0114ac2373 100644
--- a/repair/slab.c
+++ b/repair/slab.c
@@ -77,28 +77,6 @@ struct xfs_slab_cursor {
 	struct xfs_slab_hdr_cursor	hcur[0];	/* per-slab cursors */
 };
 
-/*
- * Bags -- each bag is an array of record items; when a bag fills up, we resize
- * it and hope we don't run out of memory.
- */
-#define MIN_BAG_SIZE	4096
-struct xfs_bag {
-	uint64_t		bg_nr;		/* number of pointers */
-	uint64_t		bg_inuse;	/* number of slots in use */
-	char			*bg_items;	/* pointer to block of items */
-	size_t			bg_item_sz;	/* size of each item */
-};
-
-static inline void *bag_ptr(struct xfs_bag *bag, uint64_t idx)
-{
-	return &bag->bg_items[bag->bg_item_sz * idx];
-}
-
-static inline void *bag_end(struct xfs_bag *bag)
-{
-	return bag_ptr(bag, bag->bg_nr);
-}
-
 /*
  * Create a slab to hold some objects of a particular size.
  */
@@ -386,111 +364,3 @@ slab_count(
 {
 	return slab->s_nr_items;
 }
-
-/*
- * Create a bag to point to some objects.
- */
-int
-init_bag(
-	struct xfs_bag	**bag,
-	size_t		item_sz)
-{
-	struct xfs_bag	*ptr;
-
-	ptr = calloc(1, sizeof(struct xfs_bag));
-	if (!ptr)
-		return -ENOMEM;
-	ptr->bg_item_sz = item_sz;
-	ptr->bg_items = calloc(MIN_BAG_SIZE, item_sz);
-	if (!ptr->bg_items) {
-		free(ptr);
-		return -ENOMEM;
-	}
-	ptr->bg_nr = MIN_BAG_SIZE;
-	*bag = ptr;
-	return 0;
-}
-
-/*
- * Free a bag of pointers.
- */
-void
-free_bag(
-	struct xfs_bag	**bag)
-{
-	struct xfs_bag	*ptr;
-
-	ptr = *bag;
-	if (!ptr)
-		return;
-	free(ptr->bg_items);
-	free(ptr);
-	*bag = NULL;
-}
-
-/*
- * Add an object to the pointer bag.
- */
-int
-bag_add(
-	struct xfs_bag	*bag,
-	void		*ptr)
-{
-	void		*p, *x;
-
-	p = bag_ptr(bag, bag->bg_inuse);
-	if (p == bag_end(bag)) {
-		/* No free space, alloc more pointers */
-		uint64_t	nr;
-
-		nr = bag->bg_nr * 2;
-		x = realloc(bag->bg_items, nr * bag->bg_item_sz);
-		if (!x)
-			return -ENOMEM;
-		bag->bg_items = x;
-		memset(bag_end(bag), 0, bag->bg_nr * bag->bg_item_sz);
-		bag->bg_nr = nr;
-		p = bag_ptr(bag, bag->bg_inuse);
-	}
-	memcpy(p, ptr, bag->bg_item_sz);
-	bag->bg_inuse++;
-	return 0;
-}
-
-/*
- * Remove a pointer from a bag.
- */
-int
-bag_remove(
-	struct xfs_bag	*bag,
-	uint64_t	nr)
-{
-	ASSERT(nr < bag->bg_inuse);
-	memmove(bag_ptr(bag, nr), bag_ptr(bag, nr + 1),
-		(bag->bg_inuse - nr - 1) * bag->bg_item_sz);
-	bag->bg_inuse--;
-	return 0;
-}
-
-/*
- * Return the number of items in a bag.
- */
-uint64_t
-bag_count(
-	struct xfs_bag	*bag)
-{
-	return bag->bg_inuse;
-}
-
-/*
- * Return the nth item in a bag.
- */
-void *
-bag_item(
-	struct xfs_bag	*bag,
-	uint64_t	nr)
-{
-	if (nr >= bag->bg_inuse)
-		return NULL;
-	return bag_ptr(bag, nr);
-}
diff --git a/repair/slab.h b/repair/slab.h
index 019b169024d..77fb32163d5 100644
--- a/repair/slab.h
+++ b/repair/slab.h
@@ -26,23 +26,4 @@ void *peek_slab_cursor(struct xfs_slab_cursor *cur);
 void advance_slab_cursor(struct xfs_slab_cursor *cur);
 void *pop_slab_cursor(struct xfs_slab_cursor *cur);
 
-struct xfs_bag;
-
-int init_bag(struct xfs_bag **bagp, size_t itemsz);
-void free_bag(struct xfs_bag **bagp);
-int bag_add(struct xfs_bag *bag, void *item);
-int bag_remove(struct xfs_bag *bag, uint64_t idx);
-uint64_t bag_count(struct xfs_bag *bag);
-void *bag_item(struct xfs_bag *bag, uint64_t idx);
-
-#define foreach_bag_ptr(bag, idx, ptr) \
-	for ((idx) = 0, (ptr) = bag_item((bag), (idx)); \
-	     (idx) < bag_count(bag); \
-	     (idx)++, (ptr) = bag_item((bag), (idx)))
-
-#define foreach_bag_ptr_reverse(bag, idx, ptr) \
-	for ((idx) = bag_count(bag) - 1, (ptr) = bag_item((bag), (idx)); \
-	     (ptr) != NULL; \
-	     (idx)--, (ptr) = bag_item((bag), (idx)))
-
 #endif /* SLAB_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (45 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                   ` (30 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This series cleans up a few bits of the symbolic link code as needed for
future projects.  Online repair requires the ability to commit fixed
fork-based filesystem metadata such as directories, xattrs, and symbolic
links atomically, so we need to rearrange the symlink code before we
land the atomic extent swapping.

Accomplish this by moving the remote symlink target block code and
declarations to xfs_symlink_remote.[ch].

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=symlink-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=symlink-cleanups
---
 include/libxfs.h            |    1 
 libxfs/libxfs_api_defs.h    |    1 
 libxfs/xfs_bmap.c           |    1 
 libxfs/xfs_inode_fork.c     |    1 
 libxfs/xfs_shared.h         |   14 ----
 libxfs/xfs_symlink_remote.c |  155 +++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_symlink_remote.h |   27 +++++++
 mkfs/proto.c                |   72 +++++++++++---------
 8 files changed, 223 insertions(+), 49 deletions(-)
 create mode 100644 libxfs/xfs_symlink_remote.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] xfs: move remote symlink target read function to libxfs Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move declarations for libxfs symlink functions into a separate header
file like we do for most everything else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/libxfs.h            |    1 +
 libxfs/xfs_bmap.c           |    1 +
 libxfs/xfs_inode_fork.c     |    1 +
 libxfs/xfs_shared.h         |   14 --------------
 libxfs/xfs_symlink_remote.c |    2 +-
 libxfs/xfs_symlink_remote.h |   23 +++++++++++++++++++++++
 6 files changed, 27 insertions(+), 15 deletions(-)
 create mode 100644 libxfs/xfs_symlink_remote.h


diff --git a/include/libxfs.h b/include/libxfs.h
index 887f57b6171..d4b5d8e564d 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -79,6 +79,7 @@ struct iomap;
 #include "xfs_refcount_btree.h"
 #include "xfs_refcount.h"
 #include "xfs_btree_staging.h"
+#include "xfs_symlink_remote.h"
 
 #ifndef ARRAY_SIZE
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/libxfs/xfs_bmap.c b/libxfs/xfs_bmap.c
index edd3e81fd04..983eb7641fc 100644
--- a/libxfs/xfs_bmap.c
+++ b/libxfs/xfs_bmap.c
@@ -30,6 +30,7 @@
 #include "xfs_ag_resv.h"
 #include "xfs_refcount.h"
 #include "xfs_health.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
diff --git a/libxfs/xfs_inode_fork.c b/libxfs/xfs_inode_fork.c
index 9b0c786fab6..9d76a6b5c65 100644
--- a/libxfs/xfs_inode_fork.c
+++ b/libxfs/xfs_inode_fork.c
@@ -24,6 +24,7 @@
 #include "xfs_types.h"
 #include "xfs_errortag.h"
 #include "xfs_health.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache *xfs_ifork_cache;
 
diff --git a/libxfs/xfs_shared.h b/libxfs/xfs_shared.h
index eaabfa52eda..5127fa88531 100644
--- a/libxfs/xfs_shared.h
+++ b/libxfs/xfs_shared.h
@@ -138,20 +138,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_ICHGTIME_CHG	0x2	/* inode field change timestamp */
 #define	XFS_ICHGTIME_CREATE	0x4	/* inode create timestamp */
 
-
-/*
- * Symlink decoding/encoding functions
- */
-int xfs_symlink_blocks(struct xfs_mount *mp, int pathlen);
-int xfs_symlink_hdr_set(struct xfs_mount *mp, xfs_ino_t ino, uint32_t offset,
-			uint32_t size, struct xfs_buf *bp);
-bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset,
-			uint32_t size, struct xfs_buf *bp);
-void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
-				 struct xfs_inode *ip, struct xfs_ifork *ifp);
-xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
-xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
-
 /* Computed inode geometry for the filesystem. */
 struct xfs_ino_geometry {
 	/* Maximum inode count in this filesystem. */
diff --git a/libxfs/xfs_symlink_remote.c b/libxfs/xfs_symlink_remote.c
index b9d446fba9a..e036a8f46fe 100644
--- a/libxfs/xfs_symlink_remote.c
+++ b/libxfs/xfs_symlink_remote.c
@@ -13,7 +13,7 @@
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
-
+#include "xfs_symlink_remote.h"
 
 /*
  * Each contiguous block has a header, so it is not just a simple pathlen
diff --git a/libxfs/xfs_symlink_remote.h b/libxfs/xfs_symlink_remote.h
new file mode 100644
index 00000000000..a58d536c8b8
--- /dev/null
+++ b/libxfs/xfs_symlink_remote.h
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc.
+ * Copyright (c) 2013 Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#ifndef __XFS_SYMLINK_REMOTE_H
+#define __XFS_SYMLINK_REMOTE_H
+
+/*
+ * Symlink decoding/encoding functions
+ */
+int xfs_symlink_blocks(struct xfs_mount *mp, int pathlen);
+int xfs_symlink_hdr_set(struct xfs_mount *mp, xfs_ino_t ino, uint32_t offset,
+			uint32_t size, struct xfs_buf *bp);
+bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset,
+			uint32_t size, struct xfs_buf *bp);
+void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
+				 struct xfs_inode *ip, struct xfs_ifork *ifp);
+xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
+xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
+
+#endif /* __XFS_SYMLINK_REMOTE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: move remote symlink target read function to libxfs
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 3/4] xfs: move symlink target write " Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] mkfs: use libxfs to create symlinks Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move xfs_readlink_bmap_ilocked to xfs_symlink_remote.c so that the
swapext code can use it to convert a remote format symlink back to
shortform format after a metadata repair.  While we're at it, fix a
broken printf modifier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_symlink_remote.c |   77 +++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_symlink_remote.h |    1 +
 2 files changed, 78 insertions(+)


diff --git a/libxfs/xfs_symlink_remote.c b/libxfs/xfs_symlink_remote.c
index e036a8f46fe..db4955adc08 100644
--- a/libxfs/xfs_symlink_remote.c
+++ b/libxfs/xfs_symlink_remote.c
@@ -14,6 +14,9 @@
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_symlink_remote.h"
+#include "xfs_bit.h"
+#include "xfs_bmap.h"
+#include "xfs_health.h"
 
 /*
  * Each contiguous block has a header, so it is not just a simple pathlen
@@ -235,3 +238,77 @@ xfs_symlink_shortform_verify(
 
 	return xfs_symlink_sf_verify_struct(ifp->if_u1.if_data, ifp->if_bytes);
 }
+
+/* Read a remote symlink target into the buffer. */
+int
+xfs_symlink_remote_read(
+	struct xfs_inode	*ip,
+	char			*link)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	char			*cur_chunk;
+	int			pathlen = ip->i_disk_size;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			byte_cnt;
+	int			n;
+	int			error = 0;
+	int			fsblocks = 0;
+	int			offset;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
+
+	fsblocks = xfs_symlink_blocks(mp, pathlen);
+	error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0);
+	if (error)
+		goto out;
+
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+
+		error = xfs_buf_read(mp->m_ddev_targp, d, BTOBB(byte_cnt), 0,
+				&bp, &xfs_symlink_buf_ops);
+		if (xfs_metadata_is_sick(error))
+			xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+		if (error)
+			return error;
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		if (pathlen < byte_cnt)
+			byte_cnt = pathlen;
+
+		cur_chunk = bp->b_addr;
+		if (xfs_has_crc(mp)) {
+			if (!xfs_symlink_hdr_ok(ip->i_ino, offset,
+							byte_cnt, bp)) {
+				xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+				error = -EFSCORRUPTED;
+				xfs_alert(mp,
+"symlink header does not match required off/len/owner (0x%x/0x%x,0x%llx)",
+					offset, byte_cnt, ip->i_ino);
+				xfs_buf_relse(bp);
+				goto out;
+
+			}
+
+			cur_chunk += sizeof(struct xfs_dsymlink_hdr);
+		}
+
+		memcpy(link + offset, cur_chunk, byte_cnt);
+
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_buf_relse(bp);
+	}
+	ASSERT(pathlen == 0);
+
+	link[ip->i_disk_size] = '\0';
+	error = 0;
+
+ out:
+	return error;
+}
diff --git a/libxfs/xfs_symlink_remote.h b/libxfs/xfs_symlink_remote.h
index a58d536c8b8..7d3acaee0af 100644
--- a/libxfs/xfs_symlink_remote.h
+++ b/libxfs/xfs_symlink_remote.h
@@ -19,5 +19,6 @@ void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
 				 struct xfs_inode *ip, struct xfs_ifork *ifp);
 xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
 xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
+int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: move symlink target write function to libxfs
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 1/4] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 2/4] xfs: move remote symlink target read function to libxfs Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 4/4] mkfs: use libxfs to create symlinks Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move xfs_symlink_write_target to xfs_symlink_remote.c so that kernel and
mkfs can share the same function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_symlink_remote.c |   76 +++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_symlink_remote.h |    3 ++
 2 files changed, 79 insertions(+)


diff --git a/libxfs/xfs_symlink_remote.c b/libxfs/xfs_symlink_remote.c
index db4955adc08..740db58d99c 100644
--- a/libxfs/xfs_symlink_remote.c
+++ b/libxfs/xfs_symlink_remote.c
@@ -312,3 +312,79 @@ xfs_symlink_remote_read(
  out:
 	return error;
 }
+
+/* Write the symlink target into the inode. */
+int
+xfs_symlink_write_target(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	const char		*target_path,
+	int			pathlen,
+	xfs_fsblock_t		fs_blocks,
+	uint			resblks)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	const char		*cur_chunk;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	int			byte_cnt;
+	int			nmaps;
+	int			offset = 0;
+	int			n;
+	int			error;
+
+	/*
+	 * If the symlink will fit into the inode, write it inline.
+	 */
+	if (pathlen <= xfs_inode_data_fork_size(ip)) {
+		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
+
+		ip->i_disk_size = pathlen;
+		ip->i_df.if_format = XFS_DINODE_FMT_LOCAL;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+		return 0;
+	}
+
+	nmaps = XFS_SYMLINK_MAPS;
+	error = xfs_bmapi_write(tp, ip, 0, fs_blocks, XFS_BMAPI_METADATA,
+			resblks, mval, &nmaps);
+	if (error)
+		return error;
+
+	ip->i_disk_size = pathlen;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	cur_chunk = target_path;
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		char	*buf;
+
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
+				BTOBB(byte_cnt), 0, &bp);
+		if (error)
+			return error;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		byte_cnt = min(byte_cnt, pathlen);
+
+		buf = bp->b_addr;
+		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset, byte_cnt,
+				bp);
+
+		memcpy(buf, cur_chunk, byte_cnt);
+
+		cur_chunk += byte_cnt;
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
+		xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
+						(char *)bp->b_addr);
+	}
+	ASSERT(pathlen == 0);
+	return 0;
+}
diff --git a/libxfs/xfs_symlink_remote.h b/libxfs/xfs_symlink_remote.h
index 7d3acaee0af..d81461c06b6 100644
--- a/libxfs/xfs_symlink_remote.h
+++ b/libxfs/xfs_symlink_remote.h
@@ -20,5 +20,8 @@ void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp,
 xfs_failaddr_t xfs_symlink_sf_verify_struct(void *sfp, int64_t size);
 xfs_failaddr_t xfs_symlink_shortform_verify(struct xfs_inode *ip);
 int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
+int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
+		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
+		uint resblks);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] mkfs: use libxfs to create symlinks
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 3/4] xfs: move symlink target write " Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've grabbed the kernel-side symlink writing function, use it
to create symbolic links from protofiles.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    1 +
 mkfs/proto.c             |   72 ++++++++++++++++++++++++----------------------
 2 files changed, 39 insertions(+), 34 deletions(-)


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index aa71914af97..81dfb10575d 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -209,6 +209,7 @@
 #define xfs_sb_version_to_features	libxfs_sb_version_to_features
 #define xfs_symlink_blocks		libxfs_symlink_blocks
 #define xfs_symlink_hdr_ok		libxfs_symlink_hdr_ok
+#define xfs_symlink_write_target	libxfs_symlink_write_target
 
 #define xfs_trans_add_item		libxfs_trans_add_item
 #define xfs_trans_alloc_empty		libxfs_trans_alloc_empty
diff --git a/mkfs/proto.c b/mkfs/proto.c
index 68ecdbf3632..bd306f95568 100644
--- a/mkfs/proto.c
+++ b/mkfs/proto.c
@@ -16,8 +16,6 @@ static char *getstr(char **pp);
 static void fail(char *msg, int i);
 static struct xfs_trans * getres(struct xfs_mount *mp, uint blocks);
 static void rsvfile(xfs_mount_t *mp, xfs_inode_t *ip, long long len);
-static int newfile(xfs_trans_t *tp, xfs_inode_t *ip, int symlink, int logit,
-			char *buf, int len);
 static char *newregfile(char **pp, int *len);
 static void rtinit(xfs_mount_t *mp);
 static long filesize(int fd);
@@ -217,31 +215,42 @@ rsvfile(
 		fail(_("committing space for a file failed"), error);
 }
 
-static int
-newfile(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip,
-	int		symlink,
-	int		logit,
-	char		*buf,
-	int		len)
+static void
+writesymlink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	char			*buf,
+	int			len)
 {
-	struct xfs_buf	*bp;
-	xfs_daddr_t	d;
-	int		error;
-	int		flags;
-	xfs_bmbt_irec_t	map;
-	xfs_mount_t	*mp;
-	xfs_extlen_t	nb;
-	int		nmap;
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_extlen_t		nb = XFS_B_TO_FSB(mp, len);
+	int			error;
+
+	error = -libxfs_symlink_write_target(tp, ip, buf, len, nb, nb);
+	if (error) {
+		fprintf(stderr,
+	_("%s: error %d creating symlink to '%s'.\n"), progname, error, buf);
+		exit(1);
+	}
+}
+
+static void
+writefile(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	char			*buf,
+	int			len)
+{
+	struct xfs_bmbt_irec	map;
+	struct xfs_mount	*mp;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	xfs_extlen_t		nb;
+	int			nmap;
+	int			error;
 
-	flags = 0;
 	mp = ip->i_mount;
-	if (symlink && len <= xfs_inode_data_fork_size(ip)) {
-		libxfs_init_local_fork(ip, XFS_DATA_FORK, buf, len);
-		ip->i_df.if_format = XFS_DINODE_FMT_LOCAL;
-		flags = XFS_ILOG_DDATA;
-	} else if (len > 0) {
+	if (len > 0) {
 		int	bcount;
 
 		nb = XFS_B_TO_FSB(mp, len);
@@ -263,7 +272,7 @@ newfile(
 			exit(1);
 		}
 		d = XFS_FSB_TO_DADDR(mp, map.br_startblock);
-		error = -libxfs_trans_get_buf(logit ? tp : NULL, mp->m_dev, d,
+		error = -libxfs_trans_get_buf(NULL, mp->m_dev, d,
 				nb << mp->m_blkbb_log, 0, &bp);
 		if (error) {
 			fprintf(stderr,
@@ -275,15 +284,10 @@ newfile(
 		bcount = BBTOB(bp->b_length);
 		if (len < bcount)
 			memset((char *)bp->b_addr + len, 0, bcount - len);
-		if (logit)
-			libxfs_trans_log_buf(tp, bp, 0, bcount - 1);
-		else {
-			libxfs_buf_mark_dirty(bp);
-			libxfs_buf_relse(bp);
-		}
+		libxfs_buf_mark_dirty(bp);
+		libxfs_buf_relse(bp);
 	}
 	ip->i_disk_size = len;
-	return flags;
 }
 
 static char *
@@ -459,7 +463,7 @@ parseproto(
 					   &creds, fsxp, &ip);
 		if (error)
 			fail(_("Inode allocation failed"), error);
-		flags |= newfile(tp, ip, 0, 0, buf, len);
+		writefile(tp, ip, buf, len);
 		if (buf)
 			free(buf);
 		libxfs_trans_ijoin(tp, pip, 0);
@@ -543,7 +547,7 @@ parseproto(
 				&creds, fsxp, &ip);
 		if (error)
 			fail(_("Inode allocation failed"), error);
-		flags |= newfile(tp, ip, 1, 1, buf, len);
+		writesymlink(tp, ip, buf, len);
 		libxfs_trans_ijoin(tp, pip, 0);
 		xname.type = XFS_DIR3_FT_SYMLINK;
 		newdirent(mp, tp, pip, &xname, ip->i_ino);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 00/19] libxfs: atomic file updates
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (46 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
@ 2022-12-30 22:17 ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 03/19] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
                     ` (18 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: online fsck of iunlink buckets Darrick J. Wong
                   ` (29 subsequent siblings)
  77 siblings, 19 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically.  This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes.  A successful call
completion guarantees that the new contents will be seen even if the
system fails.

The ability to swap extent mappings between files in this manner is
critical to supporting online filesystem repair, which is built upon the
strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Callers can arrange for the update to be rejected if the
original file has been changed.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic file writes concept
that has also been floating around for years.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file swap is
implemented as an atomic inode fork swap, which means that we can
implement online reconstruction of extended attributes and directories
by building a new one in another inode and atomically swap the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic extent swapping.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be swapped atomically into
the inode being repaired.  This is essential to avoid making corruption
problems worse if the system goes down in the middle of running repair.

This patchset also ports the old XFS extent swap ioctl interface to use
the new extent swap code.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

Question: Should we really bother with fsdevel bikeshedding?  Most
filesystems cannot support this functionality, so we could keep it
private to XFS for now.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
 configure.ac                    |    1 
 fsr/xfs_fsr.c                   |  214 ++++---
 include/builddefs.in            |    1 
 include/jdm.h                   |   24 +
 include/xfs_inode.h             |    5 
 include/xfs_trace.h             |   13 
 io/Makefile                     |    6 
 io/atomicupdate.c               |  387 ++++++++++++
 io/init.c                       |    1 
 io/inject.c                     |    1 
 io/io.h                         |    5 
 io/open.c                       |   27 +
 io/swapext.c                    |  195 +++++-
 libfrog/Makefile                |    6 
 libfrog/fiexchange.h            |  105 +++
 libfrog/file_exchange.c         |  184 ++++++
 libfrog/file_exchange.h         |   16 
 libfrog/fsgeom.c                |   45 +
 libfrog/fsgeom.h                |    7 
 libhandle/jdm.c                 |  117 ++++
 libxfs/Makefile                 |    2 
 libxfs/defer_item.c             |   79 ++
 libxfs/libxfs_priv.h            |   30 +
 libxfs/xfs_bmap.h               |    4 
 libxfs/xfs_defer.c              |    7 
 libxfs/xfs_defer.h              |    3 
 libxfs/xfs_errortag.h           |    4 
 libxfs/xfs_format.h             |   15 
 libxfs/xfs_fs.h                 |    2 
 libxfs/xfs_log_format.h         |   80 ++
 libxfs/xfs_sb.c                 |    3 
 libxfs/xfs_swapext.c            | 1256 +++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_swapext.h            |  170 +++++
 libxfs/xfs_symlink_remote.c     |   47 +
 libxfs/xfs_symlink_remote.h     |    1 
 libxfs/xfs_trans_space.h        |    4 
 logprint/log_misc.c             |   11 
 logprint/log_print_all.c        |   12 
 logprint/log_redo.c             |  128 ++++
 logprint/logprint.h             |    6 
 m4/package_libcdev.m4           |   20 +
 man/man2/ioctl_xfs_fsgeometry.2 |    3 
 man/man8/xfs_io.8               |   87 +++
 43 files changed, 3183 insertions(+), 151 deletions(-)
 create mode 100644 io/atomicupdate.c
 create mode 100644 libfrog/fiexchange.h
 create mode 100644 libfrog/file_exchange.c
 create mode 100644 libfrog/file_exchange.h
 create mode 100644 libxfs/xfs_swapext.c
 create mode 100644 libxfs/xfs_swapext.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 01/19] vfs: introduce new file range exchange ioctl
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 03/19] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 04/19] xfs: introduce a swap-extent log intent item Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 02/19] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
                     ` (15 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new ioctl to handle swapping contents between two files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs.h |    1 +
 1 file changed, 1 insertion(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 400cf68e551..210c17f5a16 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -841,6 +841,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
+/*	FIEXCHANGE_RANGE ----------- hoisted 129	 */
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 02/19] xfs: parameterize all the incompat log feature helpers
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 01/19] vfs: introduce new file range exchange ioctl Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 06/19] xfs: add error injection to test swapext recovery Darrick J. Wong
                     ` (14 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We're about to define a new XFS_SB_FEAT_INCOMPAT_LOG_ bit, which means
that callers will soon require the ability to toggle on and off
different log incompat feature bits.  Parameterize the
xlog_{use,drop}_incompat_feat and xfs_sb_remove_incompat_log_features
functions so that callers can specify which feature they're trying to
use and so that we can clear individual log incompat bits as needed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_format.h |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index 5ba2dae7aa2..817adb36cb1 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -404,9 +404,10 @@ xfs_sb_has_incompat_log_feature(
 
 static inline void
 xfs_sb_remove_incompat_log_features(
-	struct xfs_sb	*sbp)
+	struct xfs_sb	*sbp,
+	uint32_t	feature)
 {
-	sbp->sb_features_log_incompat &= ~XFS_SB_FEAT_INCOMPAT_LOG_ALL;
+	sbp->sb_features_log_incompat &= ~feature;
 }
 
 static inline void


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 03/19] xfs: create a log incompat flag for atomic extent swapping
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 04/19] xfs: introduce a swap-extent log intent item Darrick J. Wong
                     ` (17 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a log incompat flag so that we only attempt to process swap
extent log items if the filesystem supports it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_format.h             |    1 +
 libxfs/xfs_fs.h                 |    1 +
 libxfs/xfs_sb.c                 |    3 +++
 libxfs/xfs_swapext.h            |   24 ++++++++++++++++++++++++
 man/man2/ioctl_xfs_fsgeometry.2 |    3 +++
 5 files changed, 32 insertions(+)
 create mode 100644 libxfs/xfs_swapext.h


diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index 817adb36cb1..1424976ec95 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -391,6 +391,7 @@ xfs_sb_has_incompat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS   (1 << 0)	/* Delayed Attributes */
+#define XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT  (1U << 31)	/* file extent swap */
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
 	(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 210c17f5a16..a39fd65e6ee 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_BIGTIME	(1 << 21) /* 64-bit nsec timestamps */
 #define XFS_FSOP_GEOM_FLAGS_INOBTCNT	(1 << 22) /* inobt btree counter */
 #define XFS_FSOP_GEOM_FLAGS_NREXT64	(1 << 23) /* large extent counters */
+#define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP	(1U << 31) /* atomic file extent swap */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/libxfs/xfs_sb.c b/libxfs/xfs_sb.c
index 11b922789fe..10f699b4e99 100644
--- a/libxfs/xfs_sb.c
+++ b/libxfs/xfs_sb.c
@@ -23,6 +23,7 @@
 #include "xfs_da_format.h"
 #include "xfs_health.h"
 #include "xfs_ag.h"
+#include "xfs_swapext.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -1195,6 +1196,8 @@ xfs_fs_geometry(
 	}
 	if (xfs_has_large_extent_counts(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
+	if (xfs_swapext_supported(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
diff --git a/libxfs/xfs_swapext.h b/libxfs/xfs_swapext.h
new file mode 100644
index 00000000000..316323339d7
--- /dev/null
+++ b/libxfs/xfs_swapext.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SWAPEXT_H_
+#define __XFS_SWAPEXT_H_ 1
+
+/*
+ * Decide if this filesystem supports using log items to swap file extents and
+ * restart the operation if the system fails before the operation completes.
+ *
+ * This can be done to individual file extents by using the block mapping log
+ * intent items introduced with reflink and rmap; or to entire file ranges
+ * using swapext log intent items to track the overall progress across multiple
+ * extent mappings.  Realtime is not supported yet.
+ */
+static inline bool xfs_swapext_supported(struct xfs_mount *mp)
+{
+	return (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) &&
+	       !xfs_has_realtime(mp);
+}
+
+#endif /* __XFS_SWAPEXT_H_ */
diff --git a/man/man2/ioctl_xfs_fsgeometry.2 b/man/man2/ioctl_xfs_fsgeometry.2
index f59a6e8a6a2..7c563ca0454 100644
--- a/man/man2/ioctl_xfs_fsgeometry.2
+++ b/man/man2/ioctl_xfs_fsgeometry.2
@@ -211,6 +211,9 @@ Filesystem stores reverse mappings of blocks to owners.
 .TP
 .B XFS_FSOP_GEOM_FLAGS_REFLINK
 Filesystem supports sharing blocks between files.
+.TP
+.B XFS_FSOP_GEOM_FLAGS_ATOMICSWAP
+Filesystem can exchange file contents atomically via FIEXCHANGE_RANGE.
 .RE
 .SH XFS METADATA HEALTH REPORTING
 .PP


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 04/19] xfs: introduce a swap-extent log intent item
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 03/19] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
@ 2022-12-30 22:17   ` Darrick J. Wong
  2022-12-30 22:17   ` [PATCH 01/19] vfs: introduce new file range exchange ioctl Darrick J. Wong
                     ` (16 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:17 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new intent log item to handle swapping extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_log_format.h |   51 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)


diff --git a/libxfs/xfs_log_format.h b/libxfs/xfs_log_format.h
index 367f536d988..b105a5ef664 100644
--- a/libxfs/xfs_log_format.h
+++ b/libxfs/xfs_log_format.h
@@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
 #define XLOG_REG_TYPE_ATTRD_FORMAT	28
 #define XLOG_REG_TYPE_ATTR_NAME	29
 #define XLOG_REG_TYPE_ATTR_VALUE	30
-#define XLOG_REG_TYPE_MAX		30
-
+#define XLOG_REG_TYPE_SXI_FORMAT	31
+#define XLOG_REG_TYPE_SXD_FORMAT	32
+#define XLOG_REG_TYPE_MAX		32
 
 /*
  * Flags to log operation header
@@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_BUD		0x1245
 #define	XFS_LI_ATTRI		0x1246  /* attr set/remove intent*/
 #define	XFS_LI_ATTRD		0x1247  /* attr set/remove done */
+#define	XFS_LI_SXI		0x1248  /* extent swap intent */
+#define	XFS_LI_SXD		0x1249  /* extent swap done */
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
 	{ XFS_LI_BUD,		"XFS_LI_BUD" }, \
 	{ XFS_LI_ATTRI,		"XFS_LI_ATTRI" }, \
-	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }
+	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }, \
+	{ XFS_LI_SXI,		"XFS_LI_SXI" }, \
+	{ XFS_LI_SXD,		"XFS_LI_SXD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -871,6 +876,46 @@ struct xfs_bud_log_format {
 	uint64_t		bud_bui_id;	/* id of corresponding bui */
 };
 
+/*
+ * SXI/SXD (extent swapping) log format definitions
+ */
+
+struct xfs_swap_extent {
+	uint64_t		sx_inode1;
+	uint64_t		sx_inode2;
+	uint64_t		sx_startoff1;
+	uint64_t		sx_startoff2;
+	uint64_t		sx_blockcount;
+	uint64_t		sx_flags;
+	int64_t			sx_isize1;
+	int64_t			sx_isize2;
+};
+
+#define XFS_SWAP_EXT_FLAGS		(0)
+
+#define XFS_SWAP_EXT_STRINGS
+
+/* This is the structure used to lay out an sxi log item in the log. */
+struct xfs_sxi_log_format {
+	uint16_t		sxi_type;	/* sxi log item type */
+	uint16_t		sxi_size;	/* size of this item */
+	uint32_t		__pad;		/* must be zero */
+	uint64_t		sxi_id;		/* sxi identifier */
+	struct xfs_swap_extent	sxi_extent;	/* extent to swap */
+};
+
+/*
+ * This is the structure used to lay out an sxd log item in the
+ * log.  The sxd_extents array is a variable size array whose
+ * size is given by sxd_nextents;
+ */
+struct xfs_sxd_log_format {
+	uint16_t		sxd_type;	/* sxd log item type */
+	uint16_t		sxd_size;	/* size of this item */
+	uint32_t		__pad;
+	uint64_t		sxd_sxi_id;	/* id of corresponding bui */
+};
+
 /*
  * Dquot Log format definitions.
  *


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 05/19] xfs: create deferred log items for extent swapping
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (10 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 10/19] xfs: make atomic extent swapping support realtime files Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 18/19] xfs_io: enhance swapext to take advantage of new api Darrick J. Wong
                     ` (6 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've created the skeleton of a log intent item to track and
restart extent swap operations, add the upper level logic to commit
intent items and turn them into concrete work recorded in the log.  We
use the deferred item "multihop" feature that was introduced a few
patches ago to constrain the number of active swap operations to one per
thread.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_trace.h      |   13 +
 libxfs/Makefile          |    2 
 libxfs/defer_item.c      |   79 ++++
 libxfs/libxfs_priv.h     |   30 +
 libxfs/xfs_bmap.h        |    4 
 libxfs/xfs_defer.c       |    7 
 libxfs/xfs_defer.h       |    3 
 libxfs/xfs_format.h      |    6 
 libxfs/xfs_log_format.h  |   28 +
 libxfs/xfs_swapext.c     | 1018 ++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_swapext.h     |  142 ++++++
 libxfs/xfs_trans_space.h |    4 
 12 files changed, 1331 insertions(+), 5 deletions(-)
 create mode 100644 libxfs/xfs_swapext.c


diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 3c6bd32d4ca..a6ba6fc93bf 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -316,6 +316,9 @@
 #define trace_xfs_refcount_cow_decrease(...)	((void) 0)
 #define trace_xfs_refcount_recover_extent(...)	((void) 0)
 
+#define trace_xfs_reflink_set_inode_flag(...)	((void) 0)
+#define trace_xfs_reflink_unset_inode_flag(...)	((void) 0)
+
 #define trace_xfs_rmap_find_left_neighbor_candidate(...)	((void) 0)
 #define trace_xfs_rmap_find_left_neighbor_query(...)	((void) 0)
 #define trace_xfs_rmap_find_left_neighbor_result(...)	((void) 0)
@@ -329,6 +332,16 @@
 #define trace_xfs_rmap_map_error(...)		((void) 0)
 #define trace_xfs_rmap_delete_error(...)	((void) 0)
 
+#define trace_xfs_swapext_defer(...)		((void) 0)
+#define trace_xfs_swapext_delta_nextents(...)	((void) 0)
+#define trace_xfs_swapext_delta_nextents_step(...)	((void) 0)
+#define trace_xfs_swapext_extent1(...)		((void) 0)
+#define trace_xfs_swapext_extent2(...)		((void) 0)
+#define trace_xfs_swapext_final_estimate(...)	((void) 0)
+#define trace_xfs_swapext_initial_estimate(...)	((void) 0)
+#define trace_xfs_swapext_overhead(...)		((void) 0)
+#define trace_xfs_swapext_update_inode_size(...) ((void) 0)
+
 #define trace_xfs_fs_mark_healthy(a,b)		((void) 0)
 
 /* set c = c to avoid unused var warnings */
diff --git a/libxfs/Makefile b/libxfs/Makefile
index b4aa9706aaa..0e43941948d 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -55,6 +55,7 @@ HFILES = \
 	xfs_rmap_btree.h \
 	xfs_sb.h \
 	xfs_shared.h \
+	xfs_swapext.h \
 	xfs_trans_resv.h \
 	xfs_trans_space.h \
 	xfs_dir2_priv.h
@@ -103,6 +104,7 @@ CFILES = cache.c \
 	xfs_rmap_btree.c \
 	xfs_rtbitmap.c \
 	xfs_sb.c \
+	xfs_swapext.c \
 	xfs_symlink_remote.c \
 	xfs_trans_inode.c \
 	xfs_trans_resv.c \
diff --git a/libxfs/defer_item.c b/libxfs/defer_item.c
index ab61e0fe572..316cc87a802 100644
--- a/libxfs/defer_item.c
+++ b/libxfs/defer_item.c
@@ -24,6 +24,8 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "libxfs.h"
+#include "xfs_ag.h"
+#include "xfs_swapext.h"
 
 /* Dummy defer item ops, since we don't do logging. */
 
@@ -651,3 +653,80 @@ const struct xfs_defer_op_type xfs_attr_defer_type = {
 	.finish_item	= xfs_attr_finish_item,
 	.cancel_item	= xfs_attr_cancel_item,
 };
+
+/* Atomic Swapping of File Ranges */
+
+STATIC struct xfs_log_item *
+xfs_swapext_create_intent(
+	struct xfs_trans		*tp,
+	struct list_head		*items,
+	unsigned int			count,
+	bool				sort)
+{
+	return NULL;
+}
+STATIC struct xfs_log_item *
+xfs_swapext_create_done(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*intent,
+	unsigned int			count)
+{
+	return NULL;
+}
+
+/* Process a deferred swapext update. */
+STATIC int
+xfs_swapext_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct list_head		*item,
+	struct xfs_btree_cur		**state)
+{
+	struct xfs_swapext_intent	*sxi;
+	int				error;
+
+	sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+
+	/*
+	 * Swap one more extent between the two files.  If there's still more
+	 * work to do, we want to requeue ourselves after all other pending
+	 * deferred operations have finished.  This includes all of the dfops
+	 * that we queued directly as well as any new ones created in the
+	 * process of finishing the others.  Doing so prevents us from queuing
+	 * a large number of SXI log items in kernel memory, which in turn
+	 * prevents us from pinning the tail of the log (while logging those
+	 * new SXI items) until the first SXI items can be processed.
+	 */
+	error = xfs_swapext_finish_one(tp, sxi);
+	if (error == -EAGAIN)
+		return error;
+
+	kmem_cache_free(xfs_swapext_intent_cache, sxi);
+	return error;
+}
+
+/* Abort all pending SXIs. */
+STATIC void
+xfs_swapext_abort_intent(
+	struct xfs_log_item		*intent)
+{
+}
+
+/* Cancel a deferred swapext update. */
+STATIC void
+xfs_swapext_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_swapext_intent	*sxi;
+
+	sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+	kmem_cache_free(xfs_swapext_intent_cache, sxi);
+}
+
+const struct xfs_defer_op_type xfs_swapext_defer_type = {
+	.create_intent	= xfs_swapext_create_intent,
+	.abort_intent	= xfs_swapext_abort_intent,
+	.create_done	= xfs_swapext_create_done,
+	.finish_item	= xfs_swapext_finish_item,
+	.cancel_item	= xfs_swapext_cancel_item,
+};
diff --git a/libxfs/libxfs_priv.h b/libxfs/libxfs_priv.h
index 2c5208f907d..63bc6ea7c2b 100644
--- a/libxfs/libxfs_priv.h
+++ b/libxfs/libxfs_priv.h
@@ -219,6 +219,35 @@ static inline bool WARN_ON(bool expr) {
 	(inode)->i_version = (version);	\
 } while (0)
 
+#define __must_check                    __attribute__((__warn_unused_result__))
+
+/*
+ * Allows for effectively applying __must_check to a macro so we can have
+ * both the type-agnostic benefits of the macros while also being able to
+ * enforce that the return value is, in fact, checked.
+ */
+static inline bool __must_check __must_check_overflow(bool overflow)
+{
+	return unlikely(overflow);
+}
+
+/*
+ * For simplicity and code hygiene, the fallback code below insists on
+ * a, b and *d having the same type (similar to the min() and max()
+ * macros), whereas gcc's type-generic overflow checkers accept
+ * different types. Hence we don't just make check_add_overflow an
+ * alias for __builtin_add_overflow, but add type checks similar to
+ * below.
+ */
+#define check_add_overflow(a, b, d) __must_check_overflow(({	\
+	typeof(a) __a = (a);			\
+	typeof(b) __b = (b);			\
+	typeof(d) __d = (d);			\
+	(void) (&__a == &__b);			\
+	(void) (&__a == __d);			\
+	__builtin_add_overflow(__a, __b, __d);	\
+}))
+
 static inline int __do_div(unsigned long long *n, unsigned base)
 {
 	int __res;
@@ -601,6 +630,7 @@ void xfs_log_item_init(struct xfs_mount *, struct xfs_log_item *, int);
 #define xfs_log_in_recovery(mp)		(false)
 
 /* xfs_icache.c */
+#define xfs_inode_clear_cowblocks_tag(ip)	do { } while (0)
 #define xfs_inode_set_cowblocks_tag(ip)	do { } while (0)
 #define xfs_inode_set_eofblocks_tag(ip)	do { } while (0)
 
diff --git a/libxfs/xfs_bmap.h b/libxfs/xfs_bmap.h
index cb09a43a287..413ec27f2f2 100644
--- a/libxfs/xfs_bmap.h
+++ b/libxfs/xfs_bmap.h
@@ -144,7 +144,7 @@ static inline int xfs_bmapi_whichfork(uint32_t bmapi_flags)
 	{ BMAP_COWFORK,		"COW" }
 
 /* Return true if the extent is an allocated extent, written or not. */
-static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec)
 {
 	return irec->br_startblock != HOLESTARTBLOCK &&
 		irec->br_startblock != DELAYSTARTBLOCK &&
@@ -155,7 +155,7 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
  * Return true if the extent is a real, allocated extent, or false if it is  a
  * delayed allocation, and unwritten extent or a hole.
  */
-static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)
 {
 	return xfs_bmap_is_real_extent(irec) &&
 	       irec->br_state != XFS_EXT_UNWRITTEN;
diff --git a/libxfs/xfs_defer.c b/libxfs/xfs_defer.c
index 77a94f58f41..47a7c5ed1f5 100644
--- a/libxfs/xfs_defer.c
+++ b/libxfs/xfs_defer.c
@@ -21,6 +21,7 @@
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
+#include "xfs_swapext.h"
 
 static struct kmem_cache	*xfs_defer_pending_cache;
 
@@ -184,6 +185,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = {
 	[XFS_DEFER_OPS_TYPE_FREE]	= &xfs_extent_free_defer_type,
 	[XFS_DEFER_OPS_TYPE_AGFL_FREE]	= &xfs_agfl_free_defer_type,
 	[XFS_DEFER_OPS_TYPE_ATTR]	= &xfs_attr_defer_type,
+	[XFS_DEFER_OPS_TYPE_SWAPEXT]	= &xfs_swapext_defer_type,
 };
 
 /*
@@ -908,6 +910,10 @@ xfs_defer_init_item_caches(void)
 	error = xfs_attr_intent_init_cache();
 	if (error)
 		goto err;
+	error = xfs_swapext_intent_init_cache();
+	if (error)
+		goto err;
+
 	return 0;
 err:
 	xfs_defer_destroy_item_caches();
@@ -918,6 +924,7 @@ xfs_defer_init_item_caches(void)
 void
 xfs_defer_destroy_item_caches(void)
 {
+	xfs_swapext_intent_destroy_cache();
 	xfs_attr_intent_destroy_cache();
 	xfs_extfree_intent_destroy_cache();
 	xfs_bmap_intent_destroy_cache();
diff --git a/libxfs/xfs_defer.h b/libxfs/xfs_defer.h
index 114a3a4930a..bcc48b0c75c 100644
--- a/libxfs/xfs_defer.h
+++ b/libxfs/xfs_defer.h
@@ -20,6 +20,7 @@ enum xfs_defer_ops_type {
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_AGFL_FREE,
 	XFS_DEFER_OPS_TYPE_ATTR,
+	XFS_DEFER_OPS_TYPE_SWAPEXT,
 	XFS_DEFER_OPS_TYPE_MAX,
 };
 
@@ -65,7 +66,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 extern const struct xfs_defer_op_type xfs_attr_defer_type;
-
+extern const struct xfs_defer_op_type xfs_swapext_defer_type;
 
 /*
  * Deferred operation item relogging limits.
diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index 1424976ec95..bb8bff48801 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -425,6 +425,12 @@ static inline bool xfs_sb_version_haslogxattrs(struct xfs_sb *sbp)
 		 XFS_SB_FEAT_INCOMPAT_LOG_XATTRS);
 }
 
+static inline bool xfs_sb_version_haslogswapext(struct xfs_sb *sbp)
+{
+	return xfs_sb_is_v5(sbp) && (sbp->sb_features_log_incompat &
+		 XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT);
+}
+
 static inline bool
 xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
 {
diff --git a/libxfs/xfs_log_format.h b/libxfs/xfs_log_format.h
index b105a5ef664..65a84fdefe5 100644
--- a/libxfs/xfs_log_format.h
+++ b/libxfs/xfs_log_format.h
@@ -891,9 +891,33 @@ struct xfs_swap_extent {
 	int64_t			sx_isize2;
 };
 
-#define XFS_SWAP_EXT_FLAGS		(0)
+/* Swap extents between extended attribute forks. */
+#define XFS_SWAP_EXT_ATTR_FORK		(1ULL << 0)
 
-#define XFS_SWAP_EXT_STRINGS
+/* Set the file sizes when finished. */
+#define XFS_SWAP_EXT_SET_SIZES		(1ULL << 1)
+
+/* Do not swap any part of the range where inode1's mapping is a hole. */
+#define XFS_SWAP_EXT_SKIP_INO1_HOLES	(1ULL << 2)
+
+/* Clear the reflink flag from inode1 after the operation. */
+#define XFS_SWAP_EXT_CLEAR_INO1_REFLINK	(1ULL << 3)
+
+/* Clear the reflink flag from inode2 after the operation. */
+#define XFS_SWAP_EXT_CLEAR_INO2_REFLINK	(1ULL << 4)
+
+#define XFS_SWAP_EXT_FLAGS		(XFS_SWAP_EXT_ATTR_FORK | \
+					 XFS_SWAP_EXT_SET_SIZES | \
+					 XFS_SWAP_EXT_SKIP_INO1_HOLES | \
+					 XFS_SWAP_EXT_CLEAR_INO1_REFLINK | \
+					 XFS_SWAP_EXT_CLEAR_INO2_REFLINK)
+
+#define XFS_SWAP_EXT_STRINGS \
+	{ XFS_SWAP_EXT_ATTR_FORK,		"ATTRFORK" }, \
+	{ XFS_SWAP_EXT_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_SWAP_EXT_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
+	{ XFS_SWAP_EXT_CLEAR_INO1_REFLINK,	"CLEAR_INO1_REFLINK" }, \
+	{ XFS_SWAP_EXT_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }
 
 /* This is the structure used to lay out an sxi log item in the log. */
 struct xfs_sxi_log_format {
diff --git a/libxfs/xfs_swapext.c b/libxfs/xfs_swapext.c
new file mode 100644
index 00000000000..4c3dd4f7c7f
--- /dev/null
+++ b/libxfs/xfs_swapext.c
@@ -0,0 +1,1018 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs_priv.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_bmap.h"
+#include "xfs_swapext.h"
+#include "xfs_trace.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_quota_defs.h"
+#include "xfs_health.h"
+
+struct kmem_cache	*xfs_swapext_intent_cache;
+
+/* bmbt mappings adjacent to a pair of records. */
+struct xfs_swapext_adjacent {
+	struct xfs_bmbt_irec		left1;
+	struct xfs_bmbt_irec		right1;
+	struct xfs_bmbt_irec		left2;
+	struct xfs_bmbt_irec		right2;
+};
+
+#define ADJACENT_INIT { \
+	.left1  = { .br_startblock = HOLESTARTBLOCK }, \
+	.right1 = { .br_startblock = HOLESTARTBLOCK }, \
+	.left2  = { .br_startblock = HOLESTARTBLOCK }, \
+	.right2 = { .br_startblock = HOLESTARTBLOCK }, \
+}
+
+/* Information to help us reset reflink flag / CoW fork state after a swap. */
+
+/* Previous state of the two inodes' reflink flags. */
+#define XFS_REFLINK_STATE_IP1		(1U << 0)
+#define XFS_REFLINK_STATE_IP2		(1U << 1)
+
+/*
+ * If the reflink flag is set on either inode, make sure it has an incore CoW
+ * fork, since all reflink inodes must have them.  If there's a CoW fork and it
+ * has extents in it, make sure the inodes are tagged appropriately so that
+ * speculative preallocations can be GC'd if we run low of space.
+ */
+static inline void
+xfs_swapext_ensure_cowfork(
+	struct xfs_inode	*ip)
+{
+	struct xfs_ifork	*cfork;
+
+	if (xfs_is_reflink_inode(ip))
+		xfs_ifork_init_cow(ip);
+
+	cfork = xfs_ifork_ptr(ip, XFS_COW_FORK);
+	if (!cfork)
+		return;
+	if (cfork->if_bytes > 0)
+		xfs_inode_set_cowblocks_tag(ip);
+	else
+		xfs_inode_clear_cowblocks_tag(ip);
+}
+
+/* Schedule an atomic extent swap. */
+void
+xfs_swapext_schedule(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	trace_xfs_swapext_defer(tp->t_mountp, sxi);
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_SWAPEXT, &sxi->sxi_list);
+}
+
+/*
+ * Adjust the on-disk inode size upwards if needed so that we never map extents
+ * into the file past EOF.  This is crucial so that log recovery won't get
+ * confused by the sudden appearance of post-eof extents.
+ */
+STATIC void
+xfs_swapext_update_size(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fsize_t		new_isize)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_fsize_t		len;
+
+	if (new_isize < 0)
+		return;
+
+	len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
+		  new_isize);
+
+	if (len <= ip->i_disk_size)
+		return;
+
+	trace_xfs_swapext_update_inode_size(ip, len);
+
+	ip->i_disk_size = len;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+static inline bool
+sxi_has_more_swap_work(const struct xfs_swapext_intent *sxi)
+{
+	return sxi->sxi_blockcount > 0;
+}
+
+static inline bool
+sxi_has_postop_work(const struct xfs_swapext_intent *sxi)
+{
+	return sxi->sxi_flags & (XFS_SWAP_EXT_CLEAR_INO1_REFLINK |
+				 XFS_SWAP_EXT_CLEAR_INO2_REFLINK);
+}
+
+static inline void
+sxi_advance(
+	struct xfs_swapext_intent	*sxi,
+	const struct xfs_bmbt_irec	*irec)
+{
+	sxi->sxi_startoff1 += irec->br_blockcount;
+	sxi->sxi_startoff2 += irec->br_blockcount;
+	sxi->sxi_blockcount -= irec->br_blockcount;
+}
+
+/* Check all extents to make sure we can actually swap them. */
+int
+xfs_swapext_check_extents(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_ifork		*ifp1, *ifp2;
+
+	/* No fork? */
+	ifp1 = xfs_ifork_ptr(req->ip1, req->whichfork);
+	ifp2 = xfs_ifork_ptr(req->ip2, req->whichfork);
+	if (!ifp1 || !ifp2)
+		return -EINVAL;
+
+	/* We don't know how to swap local format forks. */
+	if (ifp1->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
+		return -EINVAL;
+
+	/* We don't support realtime data forks yet. */
+	if (!XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+	if (req->whichfork == XFS_ATTR_FORK)
+		return 0;
+	return -EINVAL;
+}
+
+#ifdef CONFIG_XFS_QUOTA
+/* Log the actual updates to the quota accounting. */
+static inline void
+xfs_swapext_update_quota(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int64_t				ip1_delta = 0, ip2_delta = 0;
+	unsigned int			qflag;
+
+	qflag = XFS_IS_REALTIME_INODE(sxi->sxi_ip1) ? XFS_TRANS_DQ_RTBCOUNT :
+						      XFS_TRANS_DQ_BCOUNT;
+
+	if (xfs_bmap_is_real_extent(irec1)) {
+		ip1_delta -= irec1->br_blockcount;
+		ip2_delta += irec1->br_blockcount;
+	}
+
+	if (xfs_bmap_is_real_extent(irec2)) {
+		ip1_delta += irec2->br_blockcount;
+		ip2_delta -= irec2->br_blockcount;
+	}
+
+	xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip1, qflag, ip1_delta);
+	xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip2, qflag, ip2_delta);
+}
+#else
+# define xfs_swapext_update_quota(tp, sxi, irec1, irec2)	((void)0)
+#endif
+
+/*
+ * Walk forward through the file ranges in @sxi until we find two different
+ * mappings to exchange.  If there is work to do, return the mappings;
+ * otherwise we've reached the end of the range and sxi_blockcount will be
+ * zero.
+ *
+ * If the walk skips over a pair of mappings to the same storage, save them as
+ * the left records in @adj (if provided) so that the simulation phase can
+ * avoid an extra lookup.
+  */
+static int
+xfs_swapext_find_mappings(
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2,
+	struct xfs_swapext_adjacent	*adj)
+{
+	int				nimaps;
+	int				bmap_flags;
+	int				error;
+
+	bmap_flags = xfs_bmapi_aflag(xfs_swapext_whichfork(sxi));
+
+	for (; sxi_has_more_swap_work(sxi); sxi_advance(sxi, irec1)) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->sxi_ip1, sxi->sxi_startoff1,
+				sxi->sxi_blockcount, irec1, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec1->br_startblock == DELAYSTARTBLOCK ||
+		    irec1->br_startoff != sxi->sxi_startoff1) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * If the caller told us to ignore sparse areas of file1, jump
+		 * ahead to the next region.
+		 */
+		if ((sxi->sxi_flags & XFS_SWAP_EXT_SKIP_INO1_HOLES) &&
+		    irec1->br_startblock == HOLESTARTBLOCK) {
+			trace_xfs_swapext_extent1(sxi->sxi_ip1, irec1);
+			continue;
+		}
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->sxi_ip2, sxi->sxi_startoff2,
+				irec1->br_blockcount, irec2, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec2->br_startblock == DELAYSTARTBLOCK ||
+		    irec2->br_startoff != sxi->sxi_startoff2) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1->br_blockcount = min(irec1->br_blockcount,
+					   irec2->br_blockcount);
+
+		trace_xfs_swapext_extent1(sxi->sxi_ip1, irec1);
+		trace_xfs_swapext_extent2(sxi->sxi_ip2, irec2);
+
+		/* We found something to swap, so return it. */
+		if (irec1->br_startblock != irec2->br_startblock)
+			return 0;
+
+		/*
+		 * Two extents mapped to the same physical block must not have
+		 * different states; that's filesystem corruption.  Move on to
+		 * the next extent if they're both holes or both the same
+		 * physical extent.
+		 */
+		if (irec1->br_state != irec2->br_state) {
+			xfs_bmap_mark_sick(sxi->sxi_ip1,
+					xfs_swapext_whichfork(sxi));
+			xfs_bmap_mark_sick(sxi->sxi_ip2,
+					xfs_swapext_whichfork(sxi));
+			return -EFSCORRUPTED;
+		}
+
+		/*
+		 * Save the mappings if we're estimating work and skipping
+		 * these identical mappings.
+		 */
+		if (adj) {
+			memcpy(&adj->left1, irec1, sizeof(*irec1));
+			memcpy(&adj->left2, irec2, sizeof(*irec2));
+		}
+	}
+
+	return 0;
+}
+
+/* Exchange these two mappings. */
+static void
+xfs_swapext_exchange_mappings(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int				whichfork = xfs_swapext_whichfork(sxi);
+
+	xfs_swapext_update_quota(tp, sxi, irec1, irec2);
+
+	/* Remove both mappings. */
+	xfs_bmap_unmap_extent(tp, sxi->sxi_ip1, whichfork, irec1);
+	xfs_bmap_unmap_extent(tp, sxi->sxi_ip2, whichfork, irec2);
+
+	/*
+	 * Re-add both mappings.  We swap the file offsets between the two maps
+	 * and add the opposite map, which has the effect of filling the
+	 * logical offsets we just unmapped, but with with the physical mapping
+	 * information swapped.
+	 */
+	swap(irec1->br_startoff, irec2->br_startoff);
+	xfs_bmap_map_extent(tp, sxi->sxi_ip1, whichfork, irec2);
+	xfs_bmap_map_extent(tp, sxi->sxi_ip2, whichfork, irec1);
+
+	/* Make sure we're not mapping extents past EOF. */
+	if (whichfork == XFS_DATA_FORK) {
+		xfs_swapext_update_size(tp, sxi->sxi_ip1, irec2,
+				sxi->sxi_isize1);
+		xfs_swapext_update_size(tp, sxi->sxi_ip2, irec1,
+				sxi->sxi_isize2);
+	}
+
+	/*
+	 * Advance our cursor and exit.   The caller (either defer ops or log
+	 * recovery) will log the SXD item, and if *blockcount is nonzero, it
+	 * will log a new SXI item for the remainder and call us back.
+	 */
+	sxi_advance(sxi, irec1);
+}
+
+static inline void
+xfs_swapext_clear_reflink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	trace_xfs_reflink_unset_inode_flag(ip);
+
+	ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Finish whatever work might come after a swap operation. */
+static int
+xfs_swapext_do_postop_work(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO1_REFLINK) {
+		xfs_swapext_clear_reflink(tp, sxi->sxi_ip1);
+		sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
+	}
+
+	if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO2_REFLINK) {
+		xfs_swapext_clear_reflink(tp, sxi->sxi_ip2);
+		sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO2_REFLINK;
+	}
+
+	return 0;
+}
+
+/* Finish one extent swap, possibly log more. */
+int
+xfs_swapext_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	int				error;
+
+	if (sxi_has_more_swap_work(sxi)) {
+		/*
+		 * If the operation state says that some range of the files
+		 * have not yet been swapped, look for extents in that range to
+		 * swap.  If we find some extents, swap them.
+		 */
+		error = xfs_swapext_find_mappings(sxi, &irec1, &irec2, NULL);
+		if (error)
+			return error;
+
+		if (sxi_has_more_swap_work(sxi))
+			xfs_swapext_exchange_mappings(tp, sxi, &irec1, &irec2);
+
+		/*
+		 * If the caller asked us to exchange the file sizes after the
+		 * swap and either we just swapped the last extents in the
+		 * range or we didn't find anything to swap, update the ondisk
+		 * file sizes.
+		 */
+		if ((sxi->sxi_flags & XFS_SWAP_EXT_SET_SIZES) &&
+		    !sxi_has_more_swap_work(sxi)) {
+			sxi->sxi_ip1->i_disk_size = sxi->sxi_isize1;
+			sxi->sxi_ip2->i_disk_size = sxi->sxi_isize2;
+
+			xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+			xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+		}
+	} else if (sxi_has_postop_work(sxi)) {
+		/*
+		 * Now that we're finished with the swap operation, complete
+		 * the post-op cleanup work.
+		 */
+		error = xfs_swapext_do_postop_work(tp, sxi);
+		if (error)
+			return error;
+	}
+
+	/* If we still have work to do, ask for a new transaction. */
+	if (sxi_has_more_swap_work(sxi) || sxi_has_postop_work(sxi)) {
+		trace_xfs_swapext_defer(tp->t_mountp, sxi);
+		return -EAGAIN;
+	}
+
+	/*
+	 * If we reach here, we've finished all the swapping work and the post
+	 * operation work.  The last thing we need to do before returning to
+	 * the caller is to make sure that COW forks are set up correctly.
+	 */
+	if (!(sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)) {
+		xfs_swapext_ensure_cowfork(sxi->sxi_ip1);
+		xfs_swapext_ensure_cowfork(sxi->sxi_ip2);
+	}
+
+	return 0;
+}
+
+/*
+ * Compute the amount of bmbt blocks we should reserve for each file.  In the
+ * worst case, each exchange will fill a hole with a new mapping, which could
+ * result in a btree split every time we add a new leaf block.
+ */
+static inline uint64_t
+xfs_swapext_bmbt_blocks(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	return howmany_64(req->nr_exchanges,
+					XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)) *
+			XFS_EXTENTADD_SPACE_RES(mp, req->whichfork);
+}
+
+static inline uint64_t
+xfs_swapext_rmapbt_blocks(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	if (!xfs_has_rmapbt(mp))
+		return 0;
+	if (XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+
+	return howmany_64(req->nr_exchanges,
+					XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) *
+			XFS_RMAPADD_SPACE_RES(mp);
+}
+
+/* Estimate the bmbt and rmapbt overhead required to exchange extents. */
+static int
+xfs_swapext_estimate_overhead(
+	struct xfs_swapext_req	*req)
+{
+	struct xfs_mount	*mp = req->ip1->i_mount;
+	xfs_filblks_t		bmbt_blocks;
+	xfs_filblks_t		rmapbt_blocks;
+	xfs_filblks_t		resblks = req->resblks;
+
+	/*
+	 * Compute the number of bmbt and rmapbt blocks we might need to handle
+	 * the estimated number of exchanges.
+	 */
+	bmbt_blocks = xfs_swapext_bmbt_blocks(mp, req);
+	rmapbt_blocks = xfs_swapext_rmapbt_blocks(mp, req);
+
+	trace_xfs_swapext_overhead(mp, bmbt_blocks, rmapbt_blocks);
+
+	/* Make sure the change in file block count doesn't overflow. */
+	if (check_add_overflow(req->ip1_bcount, bmbt_blocks, &req->ip1_bcount))
+		return -EFBIG;
+	if (check_add_overflow(req->ip2_bcount, bmbt_blocks, &req->ip2_bcount))
+		return -EFBIG;
+
+	/*
+	 * Add together the number of blocks we need to handle btree growth,
+	 * then add it to the number of blocks we need to reserve to this
+	 * transaction.
+	 */
+	if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+		return -ENOSPC;
+
+	/* Can't actually reserve more than UINT_MAX blocks. */
+	if (req->resblks > UINT_MAX)
+		return -ENOSPC;
+
+	req->resblks = resblks;
+	trace_xfs_swapext_final_estimate(req);
+	return 0;
+}
+
+/* Decide if we can merge two real extents. */
+static inline bool
+can_merge(
+	const struct xfs_bmbt_irec	*b1,
+	const struct xfs_bmbt_irec	*b2)
+{
+	/* Don't merge holes. */
+	if (b1->br_startblock == HOLESTARTBLOCK ||
+	    b2->br_startblock == HOLESTARTBLOCK)
+		return false;
+
+	/* We don't merge holes. */
+	if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2))
+		return false;
+
+	if (b1->br_startoff   + b1->br_blockcount == b2->br_startoff &&
+	    b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
+	    b1->br_state			  == b2->br_state &&
+	    b1->br_blockcount + b2->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+		return true;
+
+	return false;
+}
+
+#define CLEFT_CONTIG	0x01
+#define CRIGHT_CONTIG	0x02
+#define CHOLE		0x04
+#define CBOTH_CONTIG	(CLEFT_CONTIG | CRIGHT_CONTIG)
+
+#define NLEFT_CONTIG	0x10
+#define NRIGHT_CONTIG	0x20
+#define NHOLE		0x40
+#define NBOTH_CONTIG	(NLEFT_CONTIG | NRIGHT_CONTIG)
+
+/* Estimate the effect of a single swap on extent count. */
+static inline int
+delta_nextents_step(
+	struct xfs_mount		*mp,
+	const struct xfs_bmbt_irec	*left,
+	const struct xfs_bmbt_irec	*curr,
+	const struct xfs_bmbt_irec	*new,
+	const struct xfs_bmbt_irec	*right)
+{
+	bool				lhole, rhole, chole, nhole;
+	unsigned int			state = 0;
+	int				ret = 0;
+
+	lhole = left->br_startblock == HOLESTARTBLOCK;
+	rhole = right->br_startblock == HOLESTARTBLOCK;
+	chole = curr->br_startblock == HOLESTARTBLOCK;
+	nhole = new->br_startblock == HOLESTARTBLOCK;
+
+	if (chole)
+		state |= CHOLE;
+	if (!lhole && !chole && can_merge(left, curr))
+		state |= CLEFT_CONTIG;
+	if (!rhole && !chole && can_merge(curr, right))
+		state |= CRIGHT_CONTIG;
+	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
+	    left->br_startblock + curr->br_startblock +
+					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+		state &= ~CRIGHT_CONTIG;
+
+	if (nhole)
+		state |= NHOLE;
+	if (!lhole && !nhole && can_merge(left, new))
+		state |= NLEFT_CONTIG;
+	if (!rhole && !nhole && can_merge(new, right))
+		state |= NRIGHT_CONTIG;
+	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
+	    left->br_startblock + new->br_startblock +
+					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+		state &= ~NRIGHT_CONTIG;
+
+	switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
+	case CLEFT_CONTIG | CRIGHT_CONTIG:
+		/*
+		 * left/curr/right are the same extent, so deleting curr causes
+		 * 2 new extents to be created.
+		 */
+		ret += 2;
+		break;
+	case 0:
+		/*
+		 * curr is not contiguous with any extent, so we remove curr
+		 * completely
+		 */
+		ret--;
+		break;
+	case CHOLE:
+		/* hole, do nothing */
+		break;
+	case CLEFT_CONTIG:
+	case CRIGHT_CONTIG:
+		/* trim either left or right, no change */
+		break;
+	}
+
+	switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) {
+	case NLEFT_CONTIG | NRIGHT_CONTIG:
+		/*
+		 * left/curr/right will become the same extent, so adding
+		 * curr causes the deletion of right.
+		 */
+		ret--;
+		break;
+	case 0:
+		/* new is not contiguous with any extent */
+		ret++;
+		break;
+	case NHOLE:
+		/* hole, do nothing. */
+		break;
+	case NLEFT_CONTIG:
+	case NRIGHT_CONTIG:
+		/* new is absorbed into left or right, no change */
+		break;
+	}
+
+	trace_xfs_swapext_delta_nextents_step(mp, left, curr, new, right, ret,
+			state);
+	return ret;
+}
+
+/* Make sure we don't overflow the extent counters. */
+static inline int
+ensure_delta_nextents(
+	struct xfs_swapext_req	*req,
+	struct xfs_inode	*ip,
+	int64_t			delta)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, req->whichfork);
+	xfs_extnum_t		max_extents;
+	bool			large_extcount;
+
+	if (delta < 0)
+		return 0;
+
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS)) {
+		if (ifp->if_nextents + delta > 10)
+			return -EFBIG;
+	}
+
+	if (req->req_flags & XFS_SWAP_REQ_NREXT64)
+		large_extcount = true;
+	else
+		large_extcount = xfs_inode_has_large_extent_counts(ip);
+
+	max_extents = xfs_iext_max_nextents(large_extcount, req->whichfork);
+	if (ifp->if_nextents + delta <= max_extents)
+		return 0;
+	if (large_extcount)
+		return -EFBIG;
+	if (!xfs_has_large_extent_counts(mp))
+		return -EFBIG;
+
+	max_extents = xfs_iext_max_nextents(true, req->whichfork);
+	if (ifp->if_nextents + delta > max_extents)
+		return -EFBIG;
+
+	req->req_flags |= XFS_SWAP_REQ_NREXT64;
+	return 0;
+}
+
+/* Find the next extent after irec. */
+static inline int
+get_next_ext(
+	struct xfs_inode		*ip,
+	int				bmap_flags,
+	const struct xfs_bmbt_irec	*irec,
+	struct xfs_bmbt_irec		*nrec)
+{
+	xfs_fileoff_t			off;
+	xfs_filblks_t			blockcount;
+	int				nimaps = 1;
+	int				error;
+
+	off = irec->br_startoff + irec->br_blockcount;
+	blockcount = XFS_MAX_FILEOFF - off;
+	error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags);
+	if (error)
+		return error;
+	if (nrec->br_startblock == DELAYSTARTBLOCK ||
+	    nrec->br_startoff != off) {
+		/*
+		 * If we don't get the extent we want, return a zero-length
+		 * mapping, which our estimator function will pretend is a hole.
+		 * We shouldn't get delalloc reservations.
+		 */
+		nrec->br_startblock = HOLESTARTBLOCK;
+	}
+
+	return 0;
+}
+
+int __init
+xfs_swapext_intent_init_cache(void)
+{
+	xfs_swapext_intent_cache = kmem_cache_create("xfs_swapext_intent",
+			sizeof(struct xfs_swapext_intent),
+			0, 0, NULL);
+
+	return xfs_swapext_intent_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_swapext_intent_destroy_cache(void)
+{
+	kmem_cache_destroy(xfs_swapext_intent_cache);
+	xfs_swapext_intent_cache = NULL;
+}
+
+/*
+ * Decide if we will swap the reflink flags between the two files after the
+ * swap.  The only time we want to do this is if we're exchanging all extents
+ * under EOF and the inode reflink flags have different states.
+ */
+static inline bool
+sxi_can_exchange_reflink_flags(
+	const struct xfs_swapext_req	*req,
+	unsigned int			reflink_state)
+{
+	struct xfs_mount		*mp = req->ip1->i_mount;
+
+	if (hweight32(reflink_state) != 1)
+		return false;
+	if (req->startoff1 != 0 || req->startoff2 != 0)
+		return false;
+	if (req->blockcount != XFS_B_TO_FSB(mp, req->ip1->i_disk_size))
+		return false;
+	if (req->blockcount != XFS_B_TO_FSB(mp, req->ip2->i_disk_size))
+		return false;
+	return true;
+}
+
+
+/* Allocate and initialize a new incore intent item from a request. */
+struct xfs_swapext_intent *
+xfs_swapext_init_intent(
+	const struct xfs_swapext_req	*req,
+	unsigned int			*reflink_state)
+{
+	struct xfs_swapext_intent	*sxi;
+	unsigned int			rs = 0;
+
+	sxi = kmem_cache_zalloc(xfs_swapext_intent_cache,
+			GFP_NOFS | __GFP_NOFAIL);
+	INIT_LIST_HEAD(&sxi->sxi_list);
+	sxi->sxi_ip1 = req->ip1;
+	sxi->sxi_ip2 = req->ip2;
+	sxi->sxi_startoff1 = req->startoff1;
+	sxi->sxi_startoff2 = req->startoff2;
+	sxi->sxi_blockcount = req->blockcount;
+	sxi->sxi_isize1 = sxi->sxi_isize2 = -1;
+
+	if (req->whichfork == XFS_ATTR_FORK)
+		sxi->sxi_flags |= XFS_SWAP_EXT_ATTR_FORK;
+
+	if (req->whichfork == XFS_DATA_FORK &&
+	    (req->req_flags & XFS_SWAP_REQ_SET_SIZES)) {
+		sxi->sxi_flags |= XFS_SWAP_EXT_SET_SIZES;
+		sxi->sxi_isize1 = req->ip2->i_disk_size;
+		sxi->sxi_isize2 = req->ip1->i_disk_size;
+	}
+
+	if (req->req_flags & XFS_SWAP_REQ_SKIP_INO1_HOLES)
+		sxi->sxi_flags |= XFS_SWAP_EXT_SKIP_INO1_HOLES;
+
+	if (req->req_flags & XFS_SWAP_REQ_LOGGED)
+		sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_LOGGED;
+	if (req->req_flags & XFS_SWAP_REQ_NREXT64)
+		sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_NREXT64;
+
+	if (req->whichfork == XFS_DATA_FORK) {
+		/*
+		 * Record the state of each inode's reflink flag before the
+		 * operation.
+		 */
+		if (xfs_is_reflink_inode(req->ip1))
+			rs |= XFS_REFLINK_STATE_IP1;
+		if (xfs_is_reflink_inode(req->ip2))
+			rs |= XFS_REFLINK_STATE_IP2;
+
+		/*
+		 * Figure out if we're clearing the reflink flags (which
+		 * effectively swaps them) after the operation.
+		 */
+		if (sxi_can_exchange_reflink_flags(req, rs)) {
+			if (rs & XFS_REFLINK_STATE_IP1)
+				sxi->sxi_flags |=
+						XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
+			if (rs & XFS_REFLINK_STATE_IP2)
+				sxi->sxi_flags |=
+						XFS_SWAP_EXT_CLEAR_INO2_REFLINK;
+		}
+	}
+
+	if (reflink_state)
+		*reflink_state = rs;
+	return sxi;
+}
+
+/*
+ * Estimate the number of exchange operations and the number of file blocks
+ * in each file that will be affected by the exchange operation.
+ */
+int
+xfs_swapext_estimate(
+	struct xfs_swapext_req		*req)
+{
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_bmbt_irec		irec1, irec2;
+	struct xfs_swapext_adjacent	adj = ADJACENT_INIT;
+	xfs_filblks_t			ip1_blocks = 0, ip2_blocks = 0;
+	int64_t				d_nexts1, d_nexts2;
+	int				bmap_flags;
+	int				error;
+
+	ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
+
+	bmap_flags = xfs_bmapi_aflag(req->whichfork);
+	sxi = xfs_swapext_init_intent(req, NULL);
+
+	/*
+	 * To guard against the possibility of overflowing the extent counters,
+	 * we have to estimate an upper bound on the potential increase in that
+	 * counter.  We can split the extent at each end of the range, and for
+	 * each step of the swap we can split the extent that we're working on
+	 * if the extents do not align.
+	 */
+	d_nexts1 = d_nexts2 = 3;
+
+	while (sxi_has_more_swap_work(sxi)) {
+		/*
+		 * Walk through the file ranges until we find something to
+		 * swap.  Because we're simulating the swap, pass in adj to
+		 * capture skipped mappings for correct estimation of bmbt
+		 * record merges.
+		 */
+		error = xfs_swapext_find_mappings(sxi, &irec1, &irec2, &adj);
+		if (error)
+			goto out_free;
+		if (!sxi_has_more_swap_work(sxi))
+			break;
+
+		/* Update accounting. */
+		if (xfs_bmap_is_real_extent(&irec1))
+			ip1_blocks += irec1.br_blockcount;
+		if (xfs_bmap_is_real_extent(&irec2))
+			ip2_blocks += irec2.br_blockcount;
+		req->nr_exchanges++;
+
+		/* Read the next extents from both files. */
+		error = get_next_ext(req->ip1, bmap_flags, &irec1, &adj.right1);
+		if (error)
+			goto out_free;
+
+		error = get_next_ext(req->ip2, bmap_flags, &irec2, &adj.right2);
+		if (error)
+			goto out_free;
+
+		/* Update extent count deltas. */
+		d_nexts1 += delta_nextents_step(req->ip1->i_mount,
+				&adj.left1, &irec1, &irec2, &adj.right1);
+
+		d_nexts2 += delta_nextents_step(req->ip1->i_mount,
+				&adj.left2, &irec2, &irec1, &adj.right2);
+
+		/* Now pretend we swapped the extents. */
+		if (can_merge(&adj.left2, &irec1))
+			adj.left2.br_blockcount += irec1.br_blockcount;
+		else
+			memcpy(&adj.left2, &irec1, sizeof(irec1));
+
+		if (can_merge(&adj.left1, &irec2))
+			adj.left1.br_blockcount += irec2.br_blockcount;
+		else
+			memcpy(&adj.left1, &irec2, sizeof(irec2));
+
+		sxi_advance(sxi, &irec1);
+	}
+
+	/* Account for the blocks that are being exchanged. */
+	if (XFS_IS_REALTIME_INODE(req->ip1) &&
+	    req->whichfork == XFS_DATA_FORK) {
+		req->ip1_rtbcount = ip1_blocks;
+		req->ip2_rtbcount = ip2_blocks;
+	} else {
+		req->ip1_bcount = ip1_blocks;
+		req->ip2_bcount = ip2_blocks;
+	}
+
+	/*
+	 * Make sure that both forks have enough slack left in their extent
+	 * counters that the swap operation will not overflow.
+	 */
+	trace_xfs_swapext_delta_nextents(req, d_nexts1, d_nexts2);
+	if (req->ip1 == req->ip2) {
+		error = ensure_delta_nextents(req, req->ip1,
+				d_nexts1 + d_nexts2);
+	} else {
+		error = ensure_delta_nextents(req, req->ip1, d_nexts1);
+		if (error)
+			goto out_free;
+		error = ensure_delta_nextents(req, req->ip2, d_nexts2);
+	}
+	if (error)
+		goto out_free;
+
+	trace_xfs_swapext_initial_estimate(req);
+	error = xfs_swapext_estimate_overhead(req);
+out_free:
+	kmem_cache_free(xfs_swapext_intent_cache, sxi);
+	return error;
+}
+
+static inline void
+xfs_swapext_set_reflink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	trace_xfs_reflink_set_inode_flag(ip);
+
+	ip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/*
+ * If either file has shared blocks and we're swapping data forks, we must flag
+ * the other file as having shared blocks so that we get the shared-block rmap
+ * functions if we need to fix up the rmaps.
+ */
+void
+xfs_swapext_ensure_reflink(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_intent	*sxi,
+	unsigned int			reflink_state)
+{
+	if ((reflink_state & XFS_REFLINK_STATE_IP1) &&
+	    !xfs_is_reflink_inode(sxi->sxi_ip2))
+		xfs_swapext_set_reflink(tp, sxi->sxi_ip2);
+
+	if ((reflink_state & XFS_REFLINK_STATE_IP2) &&
+	    !xfs_is_reflink_inode(sxi->sxi_ip1))
+		xfs_swapext_set_reflink(tp, sxi->sxi_ip1);
+}
+
+/* Widen the extent counts of both inodes if necessary. */
+static inline void
+xfs_swapext_upgrade_extent_counts(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_intent	*sxi)
+{
+	if (!(sxi->sxi_op_flags & XFS_SWAP_EXT_OP_NREXT64))
+		return;
+
+	sxi->sxi_ip1->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+
+	sxi->sxi_ip2->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+}
+
+/*
+ * Schedule a swap a range of extents from one inode to another.  If the atomic
+ * swap feature is enabled, then the operation progress can be resumed even if
+ * the system goes down.  The caller must commit the transaction to start the
+ * work.
+ *
+ * The caller must ensure the inodes must be joined to the transaction and
+ * ILOCKd; they will still be joined to the transaction at exit.
+ */
+void
+xfs_swapext(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_swapext_intent	*sxi;
+	unsigned int			reflink_state;
+
+	ASSERT(xfs_isilocked(req->ip1, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(req->ip2, XFS_ILOCK_EXCL));
+	ASSERT(req->whichfork != XFS_COW_FORK);
+	ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
+	if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
+		ASSERT(req->whichfork == XFS_DATA_FORK);
+
+	if (req->blockcount == 0)
+		return;
+
+	sxi = xfs_swapext_init_intent(req, &reflink_state);
+	xfs_swapext_schedule(tp, sxi);
+	xfs_swapext_ensure_reflink(tp, sxi, reflink_state);
+	xfs_swapext_upgrade_extent_counts(tp, sxi);
+}
diff --git a/libxfs/xfs_swapext.h b/libxfs/xfs_swapext.h
index 316323339d7..1987897ddc2 100644
--- a/libxfs/xfs_swapext.h
+++ b/libxfs/xfs_swapext.h
@@ -21,4 +21,146 @@ static inline bool xfs_swapext_supported(struct xfs_mount *mp)
 	       !xfs_has_realtime(mp);
 }
 
+/*
+ * In-core information about an extent swap request between ranges of two
+ * inodes.
+ */
+struct xfs_swapext_intent {
+	/* List of other incore deferred work. */
+	struct list_head	sxi_list;
+
+	/* Inodes participating in the operation. */
+	struct xfs_inode	*sxi_ip1;
+	struct xfs_inode	*sxi_ip2;
+
+	/* File offset range information. */
+	xfs_fileoff_t		sxi_startoff1;
+	xfs_fileoff_t		sxi_startoff2;
+	xfs_filblks_t		sxi_blockcount;
+
+	/* Set these file sizes after the operation, unless negative. */
+	xfs_fsize_t		sxi_isize1;
+	xfs_fsize_t		sxi_isize2;
+
+	/* XFS_SWAP_EXT_* log operation flags */
+	unsigned int		sxi_flags;
+
+	/* XFS_SWAP_EXT_OP_* flags */
+	unsigned int		sxi_op_flags;
+};
+
+/* Use log intent items to track and restart the entire operation. */
+#define XFS_SWAP_EXT_OP_LOGGED	(1U << 0)
+
+/* Upgrade files to have large extent counts before proceeding. */
+#define XFS_SWAP_EXT_OP_NREXT64	(1U << 1)
+
+#define XFS_SWAP_EXT_OP_STRINGS \
+	{ XFS_SWAP_EXT_OP_LOGGED,		"LOGGED" }, \
+	{ XFS_SWAP_EXT_OP_NREXT64,		"NREXT64" }
+
+static inline int
+xfs_swapext_whichfork(const struct xfs_swapext_intent *sxi)
+{
+	if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
+		return XFS_ATTR_FORK;
+	return XFS_DATA_FORK;
+}
+
+/* Parameters for a swapext request. */
+struct xfs_swapext_req {
+	/* Inodes participating in the operation. */
+	struct xfs_inode	*ip1;
+	struct xfs_inode	*ip2;
+
+	/* File offset range information. */
+	xfs_fileoff_t		startoff1;
+	xfs_fileoff_t		startoff2;
+	xfs_filblks_t		blockcount;
+
+	/* Data or attr fork? */
+	int			whichfork;
+
+	/* XFS_SWAP_REQ_* operation flags */
+	unsigned int		req_flags;
+
+	/*
+	 * Fields below this line are filled out by xfs_swapext_estimate;
+	 * callers should initialize this part of the struct to zero.
+	 */
+
+	/*
+	 * Data device blocks to be moved out of ip1, and free space needed to
+	 * handle the bmbt changes.
+	 */
+	xfs_filblks_t		ip1_bcount;
+
+	/*
+	 * Data device blocks to be moved out of ip2, and free space needed to
+	 * handle the bmbt changes.
+	 */
+	xfs_filblks_t		ip2_bcount;
+
+	/* rt blocks to be moved out of ip1. */
+	xfs_filblks_t		ip1_rtbcount;
+
+	/* rt blocks to be moved out of ip2. */
+	xfs_filblks_t		ip2_rtbcount;
+
+	/* Free space needed to handle the bmbt changes */
+	unsigned long long	resblks;
+
+	/* Number of extent swaps needed to complete the operation */
+	unsigned long long	nr_exchanges;
+};
+
+/* Caller has permission to use log intent items for the swapext operation. */
+#define XFS_SWAP_REQ_LOGGED		(1U << 0)
+
+/* Set the file sizes when finished. */
+#define XFS_SWAP_REQ_SET_SIZES		(1U << 1)
+
+/* Do not swap any part of the range where ip1's mapping is a hole. */
+#define XFS_SWAP_REQ_SKIP_INO1_HOLES	(1U << 2)
+
+/* Files need to be upgraded to have large extent counts. */
+#define XFS_SWAP_REQ_NREXT64		(1U << 3)
+
+#define XFS_SWAP_REQ_FLAGS		(XFS_SWAP_REQ_LOGGED | \
+					 XFS_SWAP_REQ_SET_SIZES | \
+					 XFS_SWAP_REQ_SKIP_INO1_HOLES | \
+					 XFS_SWAP_REQ_NREXT64)
+
+#define XFS_SWAP_REQ_STRINGS \
+	{ XFS_SWAP_REQ_LOGGED,			"LOGGED" }, \
+	{ XFS_SWAP_REQ_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_SWAP_REQ_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
+	{ XFS_SWAP_REQ_NREXT64,			"NREXT64" }
+
+unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
+void xfs_swapext_reflink_finish(struct xfs_trans *tp,
+		const struct xfs_swapext_req *req, unsigned int reflink_state);
+
+int xfs_swapext_estimate(struct xfs_swapext_req *req);
+
+extern struct kmem_cache	*xfs_swapext_intent_cache;
+
+int __init xfs_swapext_intent_init_cache(void);
+void xfs_swapext_intent_destroy_cache(void);
+
+struct xfs_swapext_intent *xfs_swapext_init_intent(
+		const struct xfs_swapext_req *req, unsigned int *reflink_state);
+void xfs_swapext_ensure_reflink(struct xfs_trans *tp,
+		const struct xfs_swapext_intent *sxi, unsigned int reflink_state);
+
+void xfs_swapext_schedule(struct xfs_trans *tp,
+		struct xfs_swapext_intent *sxi);
+int xfs_swapext_finish_one(struct xfs_trans *tp,
+		struct xfs_swapext_intent *sxi);
+
+int xfs_swapext_check_extents(struct xfs_mount *mp,
+		const struct xfs_swapext_req *req);
+
+void xfs_swapext(struct xfs_trans *tp, const struct xfs_swapext_req *req);
+
 #endif /* __XFS_SWAPEXT_H_ */
diff --git a/libxfs/xfs_trans_space.h b/libxfs/xfs_trans_space.h
index 87b31c69a77..9640fc232c1 100644
--- a/libxfs/xfs_trans_space.h
+++ b/libxfs/xfs_trans_space.h
@@ -10,6 +10,10 @@
  * Components of space reservations.
  */
 
+/* Worst case number of bmaps that can be held in a block. */
+#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)    \
+		(((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
+
 /* Worst case number of rmaps that can be held in a block. */
 #define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)    \
 		(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 06/19] xfs: add error injection to test swapext recovery
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:17   ` [PATCH 02/19] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 08/19] xfs: condense directories after an atomic swap Darrick J. Wong
                     ` (13 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add an errortag so that we can test recovery of swapext log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/inject.c           |    1 +
 libxfs/xfs_errortag.h |    4 +++-
 libxfs/xfs_swapext.c  |    4 ++++
 3 files changed, 8 insertions(+), 1 deletion(-)


diff --git a/io/inject.c b/io/inject.c
index 6ef1fc8d2f4..4b0cd76005c 100644
--- a/io/inject.c
+++ b/io/inject.c
@@ -63,6 +63,7 @@ error_tag(char *name)
 		{ XFS_ERRTAG_ATTR_LEAF_TO_NODE,		"attr_leaf_to_node" },
 		{ XFS_ERRTAG_WB_DELAY_MS,		"wb_delay_ms" },
 		{ XFS_ERRTAG_WRITE_DELAY_MS,		"write_delay_ms" },
+		{ XFS_ERRTAG_SWAPEXT_FINISH_ONE,	"swapext_finish_one" },
 		{ XFS_ERRTAG_MAX,			NULL }
 	};
 	int	count;
diff --git a/libxfs/xfs_errortag.h b/libxfs/xfs_errortag.h
index 01a9e86b303..263d62a8d70 100644
--- a/libxfs/xfs_errortag.h
+++ b/libxfs/xfs_errortag.h
@@ -63,7 +63,8 @@
 #define XFS_ERRTAG_ATTR_LEAF_TO_NODE			41
 #define XFS_ERRTAG_WB_DELAY_MS				42
 #define XFS_ERRTAG_WRITE_DELAY_MS			43
-#define XFS_ERRTAG_MAX					44
+#define XFS_ERRTAG_SWAPEXT_FINISH_ONE			44
+#define XFS_ERRTAG_MAX					45
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -111,5 +112,6 @@
 #define XFS_RANDOM_ATTR_LEAF_TO_NODE			1
 #define XFS_RANDOM_WB_DELAY_MS				3000
 #define XFS_RANDOM_WRITE_DELAY_MS			3000
+#define XFS_RANDOM_SWAPEXT_FINISH_ONE			1
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/libxfs/xfs_swapext.c b/libxfs/xfs_swapext.c
index 4c3dd4f7c7f..3557efedeb6 100644
--- a/libxfs/xfs_swapext.c
+++ b/libxfs/xfs_swapext.c
@@ -20,6 +20,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_quota_defs.h"
 #include "xfs_health.h"
+#include "xfs_errortag.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -423,6 +424,9 @@ xfs_swapext_finish_one(
 			return error;
 	}
 
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_SWAPEXT_FINISH_ONE))
+		return -EIO;
+
 	/* If we still have work to do, ask for a new transaction. */
 	if (sxi_has_more_swap_work(sxi) || sxi_has_postop_work(sxi)) {
 		trace_xfs_swapext_defer(tp->t_mountp, sxi);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 07/19] xfs: condense extended attributes after an atomic swap
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 12/19] libhandle: add support for bulkstat v5 Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 09/19] xfs: condense symbolic links " Darrick J. Wong
                     ` (9 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new swapext flag that enables us to perform post-swap processing
on file2 once we're done swapping the extent maps.  If we were swapping
the extended attributes, we want to be able to convert file2's attr fork
from block to inline format.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and swap the attr forks when ready.
If one file is in extents format and the other is inline, we will have to
promote both to extents format to perform the swap.  After the swap, we
can try to condense the fixed file's attr fork back down to inline
format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_log_format.h |    9 ++++++--
 libxfs/xfs_swapext.c    |   51 ++++++++++++++++++++++++++++++++++++++++++++++-
 libxfs/xfs_swapext.h    |    9 ++++++--
 3 files changed, 64 insertions(+), 5 deletions(-)


diff --git a/libxfs/xfs_log_format.h b/libxfs/xfs_log_format.h
index 65a84fdefe5..378201a7002 100644
--- a/libxfs/xfs_log_format.h
+++ b/libxfs/xfs_log_format.h
@@ -906,18 +906,23 @@ struct xfs_swap_extent {
 /* Clear the reflink flag from inode2 after the operation. */
 #define XFS_SWAP_EXT_CLEAR_INO2_REFLINK	(1ULL << 4)
 
+/* Try to convert inode2 from block to short format at the end, if possible. */
+#define XFS_SWAP_EXT_CVT_INO2_SF	(1ULL << 5)
+
 #define XFS_SWAP_EXT_FLAGS		(XFS_SWAP_EXT_ATTR_FORK | \
 					 XFS_SWAP_EXT_SET_SIZES | \
 					 XFS_SWAP_EXT_SKIP_INO1_HOLES | \
 					 XFS_SWAP_EXT_CLEAR_INO1_REFLINK | \
-					 XFS_SWAP_EXT_CLEAR_INO2_REFLINK)
+					 XFS_SWAP_EXT_CLEAR_INO2_REFLINK | \
+					 XFS_SWAP_EXT_CVT_INO2_SF)
 
 #define XFS_SWAP_EXT_STRINGS \
 	{ XFS_SWAP_EXT_ATTR_FORK,		"ATTRFORK" }, \
 	{ XFS_SWAP_EXT_SET_SIZES,		"SETSIZES" }, \
 	{ XFS_SWAP_EXT_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
 	{ XFS_SWAP_EXT_CLEAR_INO1_REFLINK,	"CLEAR_INO1_REFLINK" }, \
-	{ XFS_SWAP_EXT_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }
+	{ XFS_SWAP_EXT_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }, \
+	{ XFS_SWAP_EXT_CVT_INO2_SF,		"CVT_INO2_SF" }
 
 /* This is the structure used to lay out an sxi log item in the log. */
 struct xfs_sxi_log_format {
diff --git a/libxfs/xfs_swapext.c b/libxfs/xfs_swapext.c
index 3557efedeb6..68aa34de0ed 100644
--- a/libxfs/xfs_swapext.c
+++ b/libxfs/xfs_swapext.c
@@ -21,6 +21,10 @@
 #include "xfs_quota_defs.h"
 #include "xfs_health.h"
 #include "xfs_errortag.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -119,7 +123,8 @@ static inline bool
 sxi_has_postop_work(const struct xfs_swapext_intent *sxi)
 {
 	return sxi->sxi_flags & (XFS_SWAP_EXT_CLEAR_INO1_REFLINK |
-				 XFS_SWAP_EXT_CLEAR_INO2_REFLINK);
+				 XFS_SWAP_EXT_CLEAR_INO2_REFLINK |
+				 XFS_SWAP_EXT_CVT_INO2_SF);
 }
 
 static inline void
@@ -348,6 +353,36 @@ xfs_swapext_exchange_mappings(
 	sxi_advance(sxi, irec1);
 }
 
+/* Convert inode2's leaf attr fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_attr_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_da_args	args = {
+		.dp		= sxi->sxi_ip2,
+		.geo		= tp->t_mountp->m_attr_geo,
+		.whichfork	= XFS_ATTR_FORK,
+		.trans		= tp,
+	};
+	struct xfs_buf		*bp;
+	int			forkoff;
+	int			error;
+
+	if (!xfs_attr_is_leaf(sxi->sxi_ip2))
+		return 0;
+
+	error = xfs_attr3_leaf_read(tp, sxi->sxi_ip2, 0, &bp);
+	if (error)
+		return error;
+
+	forkoff = xfs_attr_shortform_allfit(bp, sxi->sxi_ip2);
+	if (forkoff == 0)
+		return 0;
+
+	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
+}
+
 static inline void
 xfs_swapext_clear_reflink(
 	struct xfs_trans	*tp,
@@ -365,6 +400,16 @@ xfs_swapext_do_postop_work(
 	struct xfs_trans		*tp,
 	struct xfs_swapext_intent	*sxi)
 {
+	if (sxi->sxi_flags & XFS_SWAP_EXT_CVT_INO2_SF) {
+		int			error = 0;
+
+		if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
+			error = xfs_swapext_attr_to_sf(tp, sxi);
+		sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
+		if (error)
+			return error;
+	}
+
 	if (sxi->sxi_flags & XFS_SWAP_EXT_CLEAR_INO1_REFLINK) {
 		xfs_swapext_clear_reflink(tp, sxi->sxi_ip1);
 		sxi->sxi_flags &= ~XFS_SWAP_EXT_CLEAR_INO1_REFLINK;
@@ -792,6 +837,8 @@ xfs_swapext_init_intent(
 
 	if (req->req_flags & XFS_SWAP_REQ_SKIP_INO1_HOLES)
 		sxi->sxi_flags |= XFS_SWAP_EXT_SKIP_INO1_HOLES;
+	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
+		sxi->sxi_flags |= XFS_SWAP_EXT_CVT_INO2_SF;
 
 	if (req->req_flags & XFS_SWAP_REQ_LOGGED)
 		sxi->sxi_op_flags |= XFS_SWAP_EXT_OP_LOGGED;
@@ -1011,6 +1058,8 @@ xfs_swapext(
 	ASSERT(!(req->req_flags & ~XFS_SWAP_REQ_FLAGS));
 	if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
 		ASSERT(req->whichfork == XFS_DATA_FORK);
+	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
+		ASSERT(req->whichfork == XFS_ATTR_FORK);
 
 	if (req->blockcount == 0)
 		return;
diff --git a/libxfs/xfs_swapext.h b/libxfs/xfs_swapext.h
index 1987897ddc2..6b610fea150 100644
--- a/libxfs/xfs_swapext.h
+++ b/libxfs/xfs_swapext.h
@@ -126,16 +126,21 @@ struct xfs_swapext_req {
 /* Files need to be upgraded to have large extent counts. */
 #define XFS_SWAP_REQ_NREXT64		(1U << 3)
 
+/* Try to convert inode2's fork to local format, if possible. */
+#define XFS_SWAP_REQ_CVT_INO2_SF	(1U << 4)
+
 #define XFS_SWAP_REQ_FLAGS		(XFS_SWAP_REQ_LOGGED | \
 					 XFS_SWAP_REQ_SET_SIZES | \
 					 XFS_SWAP_REQ_SKIP_INO1_HOLES | \
-					 XFS_SWAP_REQ_NREXT64)
+					 XFS_SWAP_REQ_NREXT64 | \
+					 XFS_SWAP_REQ_CVT_INO2_SF)
 
 #define XFS_SWAP_REQ_STRINGS \
 	{ XFS_SWAP_REQ_LOGGED,			"LOGGED" }, \
 	{ XFS_SWAP_REQ_SET_SIZES,		"SETSIZES" }, \
 	{ XFS_SWAP_REQ_SKIP_INO1_HOLES,		"SKIP_INO1_HOLES" }, \
-	{ XFS_SWAP_REQ_NREXT64,			"NREXT64" }
+	{ XFS_SWAP_REQ_NREXT64,			"NREXT64" }, \
+	{ XFS_SWAP_REQ_CVT_INO2_SF,		"CVT_INO2_SF" }
 
 unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
 void xfs_swapext_reflink_finish(struct xfs_trans *tp,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 08/19] xfs: condense directories after an atomic swap
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 06/19] xfs: add error injection to test swapext recovery Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 11/19] xfs: enable atomic swapext feature Darrick J. Wong
                     ` (12 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for directories.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and swap the data forks
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed directory down to
inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_swapext.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_swapext.c b/libxfs/xfs_swapext.c
index 68aa34de0ed..cb5bafe3112 100644
--- a/libxfs/xfs_swapext.c
+++ b/libxfs/xfs_swapext.c
@@ -25,6 +25,8 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_attr.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_dir2.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -383,6 +385,42 @@ xfs_swapext_attr_to_sf(
 	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
 }
 
+/* Convert inode2's block dir fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_dir_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_da_args	args = {
+		.dp		= sxi->sxi_ip2,
+		.geo		= tp->t_mountp->m_dir_geo,
+		.whichfork	= XFS_DATA_FORK,
+		.trans		= tp,
+	};
+	struct xfs_dir2_sf_hdr	sfh;
+	struct xfs_buf		*bp;
+	bool			isblock;
+	int			size;
+	int			error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (!isblock)
+		return 0;
+
+	error = xfs_dir3_block_read(tp, sxi->sxi_ip2, &bp);
+	if (error)
+		return error;
+
+	size = xfs_dir2_block_sfsize(sxi->sxi_ip2, bp->b_addr, &sfh);
+	if (size > xfs_inode_data_fork_size(sxi->sxi_ip2))
+		return 0;
+
+	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
+}
+
 static inline void
 xfs_swapext_clear_reflink(
 	struct xfs_trans	*tp,
@@ -405,6 +443,8 @@ xfs_swapext_do_postop_work(
 
 		if (sxi->sxi_flags & XFS_SWAP_EXT_ATTR_FORK)
 			error = xfs_swapext_attr_to_sf(tp, sxi);
+		else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
+			error = xfs_swapext_dir_to_sf(tp, sxi);
 		sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
 		if (error)
 			return error;
@@ -1059,7 +1099,9 @@ xfs_swapext(
 	if (req->req_flags & XFS_SWAP_REQ_SET_SIZES)
 		ASSERT(req->whichfork == XFS_DATA_FORK);
 	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
-		ASSERT(req->whichfork == XFS_ATTR_FORK);
+		ASSERT(req->whichfork == XFS_ATTR_FORK ||
+		       (req->whichfork == XFS_DATA_FORK &&
+			S_ISDIR(VFS_I(req->ip2)->i_mode)));
 
 	if (req->blockcount == 0)
 		return;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 09/19] xfs: condense symbolic links after an atomic swap
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 07/19] xfs: condense extended attributes after an atomic swap Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 10/19] xfs: make atomic extent swapping support realtime files Darrick J. Wong
                     ` (8 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for symlinks.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and swap the data forks
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed symlink down to inline
format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_swapext.c        |   48 ++++++++++++++++++++++++++++++++++++++++++-
 libxfs/xfs_symlink_remote.c |   47 ++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_symlink_remote.h |    1 +
 3 files changed, 95 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_swapext.c b/libxfs/xfs_swapext.c
index cb5bafe3112..e439b938f10 100644
--- a/libxfs/xfs_swapext.c
+++ b/libxfs/xfs_swapext.c
@@ -27,6 +27,7 @@
 #include "xfs_attr.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_dir2.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache	*xfs_swapext_intent_cache;
 
@@ -421,6 +422,48 @@ xfs_swapext_dir_to_sf(
 	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
 }
 
+/* Convert inode2's remote symlink target back to shortform, if possible. */
+STATIC int
+xfs_swapext_link_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_inode		*ip = sxi->sxi_ip2;
+	struct xfs_ifork		*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+	char				*buf;
+	int				error;
+
+	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ip->i_disk_size > xfs_inode_data_fork_size(ip))
+		return 0;
+
+	/* Read the current symlink target into a buffer. */
+	buf = kmem_alloc(ip->i_disk_size + 1, KM_NOFS);
+	if (!buf) {
+		ASSERT(0);
+		return -ENOMEM;
+	}
+
+	error = xfs_symlink_remote_read(ip, buf);
+	if (error)
+		goto free;
+
+	/* Remove the blocks. */
+	error = xfs_symlink_remote_truncate(tp, ip);
+	if (error)
+		goto free;
+
+	/* Convert fork to local format and log our changes. */
+	xfs_idestroy_fork(ifp);
+	ifp->if_bytes = 0;
+	ifp->if_format = XFS_DINODE_FMT_LOCAL;
+	xfs_init_local_fork(ip, XFS_DATA_FORK, buf, ip->i_disk_size);
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+free:
+	kmem_free(buf);
+	return error;
+}
+
 static inline void
 xfs_swapext_clear_reflink(
 	struct xfs_trans	*tp,
@@ -445,6 +488,8 @@ xfs_swapext_do_postop_work(
 			error = xfs_swapext_attr_to_sf(tp, sxi);
 		else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
 			error = xfs_swapext_dir_to_sf(tp, sxi);
+		else if (S_ISLNK(VFS_I(sxi->sxi_ip2)->i_mode))
+			error = xfs_swapext_link_to_sf(tp, sxi);
 		sxi->sxi_flags &= ~XFS_SWAP_EXT_CVT_INO2_SF;
 		if (error)
 			return error;
@@ -1101,7 +1146,8 @@ xfs_swapext(
 	if (req->req_flags & XFS_SWAP_REQ_CVT_INO2_SF)
 		ASSERT(req->whichfork == XFS_ATTR_FORK ||
 		       (req->whichfork == XFS_DATA_FORK &&
-			S_ISDIR(VFS_I(req->ip2)->i_mode)));
+			(S_ISDIR(VFS_I(req->ip2)->i_mode) ||
+			 S_ISLNK(VFS_I(req->ip2)->i_mode))));
 
 	if (req->blockcount == 0)
 		return;
diff --git a/libxfs/xfs_symlink_remote.c b/libxfs/xfs_symlink_remote.c
index 740db58d99c..3565135bead 100644
--- a/libxfs/xfs_symlink_remote.c
+++ b/libxfs/xfs_symlink_remote.c
@@ -388,3 +388,50 @@ xfs_symlink_write_target(
 	ASSERT(pathlen == 0);
 	return 0;
 }
+
+/* Remove all the blocks from a symlink and invalidate buffers. */
+int
+xfs_symlink_remote_truncate(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*bp;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			done = 0;
+	int			i;
+	int			error;
+
+	/* Read mappings and invalidate buffers. */
+	error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	for (i = 0; i < nmaps; i++) {
+		if (!xfs_bmap_is_real_extent(&mval[i]))
+			break;
+
+		error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
+				XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
+				XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
+				&bp);
+		if (error)
+			return error;
+
+		xfs_trans_binval(tp, bp);
+	}
+
+	/* Unmap the remote blocks. */
+	error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
+	if (error)
+		return error;
+	if (!done) {
+		ASSERT(done);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+		return -EFSCORRUPTED;
+	}
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	return 0;
+}
diff --git a/libxfs/xfs_symlink_remote.h b/libxfs/xfs_symlink_remote.h
index d81461c06b6..05eb9c3937d 100644
--- a/libxfs/xfs_symlink_remote.h
+++ b/libxfs/xfs_symlink_remote.h
@@ -23,5 +23,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
 int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
 		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
 		uint resblks);
+int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 10/19] xfs: make atomic extent swapping support realtime files
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 09/19] xfs: condense symbolic links " Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 05/19] xfs: create deferred log items for extent swapping Darrick J. Wong
                     ` (7 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that bmap items support the realtime device, we can add the
necessary pieces to the atomic extent swapping code to support such
things.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/xfs_inode.h  |    5 ++
 libxfs/xfs_swapext.c |  109 +++++++++++++++++++++++++++++++++++++++++++++++---
 libxfs/xfs_swapext.h |    5 +-
 3 files changed, 110 insertions(+), 9 deletions(-)


diff --git a/include/xfs_inode.h b/include/xfs_inode.h
index b0bba1094e7..489fd7d107d 100644
--- a/include/xfs_inode.h
+++ b/include/xfs_inode.h
@@ -232,6 +232,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
 }
 
+static inline bool xfs_inode_has_bigrtextents(struct xfs_inode *ip)
+{
+	return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1;
+}
+
 /* Always set the child's GID to this value, even if the parent is setgid. */
 #define CRED_FORCE_GID	(1U << 0)
 struct cred {
diff --git a/libxfs/xfs_swapext.c b/libxfs/xfs_swapext.c
index e439b938f10..9be29f610f2 100644
--- a/libxfs/xfs_swapext.c
+++ b/libxfs/xfs_swapext.c
@@ -140,6 +140,108 @@ sxi_advance(
 	sxi->sxi_blockcount -= irec->br_blockcount;
 }
 
+#ifdef DEBUG
+static inline bool
+xfs_swapext_need_rt_conversion(
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_inode		*ip = req->ip2;
+	struct xfs_mount		*mp = ip->i_mount;
+
+	/* xattrs don't live on the rt device */
+	if (req->whichfork == XFS_ATTR_FORK)
+		return false;
+
+	/*
+	 * Caller got permission to use logged swapext, so log recovery will
+	 * finish the swap and not leave us with partially swapped rt extents
+	 * exposed to userspace.
+	 */
+	if (req->req_flags & XFS_SWAP_REQ_LOGGED)
+		return false;
+
+	/*
+	 * If we can't use log intent items at all, the only supported
+	 * operation is full fork swaps.
+	 */
+	if (!xfs_swapext_supported(mp))
+		return false;
+
+	/* Conversion is only needed for realtime files with big rt extents */
+	return xfs_inode_has_bigrtextents(ip);
+}
+
+static inline int
+xfs_swapext_check_rt_extents(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	xfs_fileoff_t			startoff1 = req->startoff1;
+	xfs_fileoff_t			startoff2 = req->startoff2;
+	xfs_filblks_t			blockcount = req->blockcount;
+	uint32_t			mod;
+	int				nimaps;
+	int				error;
+
+	if (!xfs_swapext_need_rt_conversion(req))
+		return 0;
+
+	while (blockcount > 0) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip1, startoff1, blockcount,
+				&irec1, &nimaps, 0);
+		if (error)
+			return error;
+		ASSERT(nimaps == 1);
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip2, startoff2,
+				irec1.br_blockcount, &irec2, &nimaps,
+				0);
+		if (error)
+			return error;
+		ASSERT(nimaps == 1);
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1.br_blockcount = min(irec1.br_blockcount,
+					  irec2.br_blockcount);
+
+		/* Both mappings must be aligned to the realtime extent size. */
+		div_u64_rem(irec1.br_startoff, mp->m_sb.sb_rextsize, &mod);
+		if (mod) {
+			ASSERT(mod == 0);
+			return -EINVAL;
+		}
+
+		div_u64_rem(irec2.br_startoff, mp->m_sb.sb_rextsize, &mod);
+		if (mod) {
+			ASSERT(mod == 0);
+			return -EINVAL;
+		}
+
+		div_u64_rem(irec1.br_blockcount, mp->m_sb.sb_rextsize, &mod);
+		if (mod) {
+			ASSERT(mod == 0);
+			return -EINVAL;
+		}
+
+		startoff1 += irec1.br_blockcount;
+		startoff2 += irec1.br_blockcount;
+		blockcount -= irec1.br_blockcount;
+	}
+
+	return 0;
+}
+#else
+# define xfs_swapext_check_rt_extents(mp, req)		(0)
+#endif
+
 /* Check all extents to make sure we can actually swap them. */
 int
 xfs_swapext_check_extents(
@@ -159,12 +261,7 @@ xfs_swapext_check_extents(
 	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
 		return -EINVAL;
 
-	/* We don't support realtime data forks yet. */
-	if (!XFS_IS_REALTIME_INODE(req->ip1))
-		return 0;
-	if (req->whichfork == XFS_ATTR_FORK)
-		return 0;
-	return -EINVAL;
+	return xfs_swapext_check_rt_extents(mp, req);
 }
 
 #ifdef CONFIG_XFS_QUOTA
diff --git a/libxfs/xfs_swapext.h b/libxfs/xfs_swapext.h
index 6b610fea150..155add23d8e 100644
--- a/libxfs/xfs_swapext.h
+++ b/libxfs/xfs_swapext.h
@@ -13,12 +13,11 @@
  * This can be done to individual file extents by using the block mapping log
  * intent items introduced with reflink and rmap; or to entire file ranges
  * using swapext log intent items to track the overall progress across multiple
- * extent mappings.  Realtime is not supported yet.
+ * extent mappings.
  */
 static inline bool xfs_swapext_supported(struct xfs_mount *mp)
 {
-	return (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) &&
-	       !xfs_has_realtime(mp);
+	return xfs_has_reflink(mp) || xfs_has_rmapbt(mp);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 11/19] xfs: enable atomic swapext feature
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 08/19] xfs: condense directories after an atomic swap Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 12/19] libhandle: add support for bulkstat v5 Darrick J. Wong
                     ` (11 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add the atomic swapext feature to the set of features that we will
permit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_format.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index bb8bff48801..0c457905cce 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -393,7 +393,8 @@ xfs_sb_has_incompat_feature(
 #define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS   (1 << 0)	/* Delayed Attributes */
 #define XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT  (1U << 31)	/* file extent swap */
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
-	(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
+		(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS | \
+		 XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 12/19] libhandle: add support for bulkstat v5
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 11/19] xfs: enable atomic swapext feature Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 07/19] xfs: condense extended attributes after an atomic swap Darrick J. Wong
                     ` (10 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add support to libhandle for generating file handles with bulkstat v5
structures.  xfs_fsr will need this to be able to interface with the new
vfs range swap ioctl, and other client programs will probably want this
over time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/jdm.h   |   24 +++++++++++
 libhandle/jdm.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 141 insertions(+)


diff --git a/include/jdm.h b/include/jdm.h
index c57fcae7fca..445737a6b5f 100644
--- a/include/jdm.h
+++ b/include/jdm.h
@@ -11,6 +11,7 @@ typedef void	jdm_fshandle_t;		/* filesystem handle */
 typedef void	jdm_filehandle_t;	/* filehandle */
 
 struct xfs_bstat;
+struct xfs_bulkstat;
 struct attrlist_cursor;
 struct parent;
 
@@ -23,6 +24,9 @@ jdm_new_filehandle( jdm_filehandle_t **handlep,	/* new filehandle */
 		    jdm_fshandle_t *fshandlep,	/* filesystem filehandle */
 		    struct xfs_bstat *sp);	/* bulkstat info */
 
+extern void jdm_new_filehandle_v5(jdm_filehandle_t **handlep, size_t *hlen,
+		jdm_fshandle_t *fshandlep, struct xfs_bulkstat *sp);
+
 extern void
 jdm_delete_filehandle( jdm_filehandle_t *handlep,/* filehandle to delete */
 		       size_t hlen);		/* filehandle size */
@@ -32,35 +36,55 @@ jdm_open( jdm_fshandle_t *fshandlep,
 	  struct xfs_bstat *sp,
 	  intgen_t oflags);
 
+extern intgen_t jdm_open_v5(jdm_fshandle_t *fshandlep, struct xfs_bulkstat *sp,
+		intgen_t oflags);
+
 extern intgen_t
 jdm_readlink( jdm_fshandle_t *fshandlep,
 	      struct xfs_bstat *sp,
 	      char *bufp,
 	      size_t bufsz);
 
+extern intgen_t jdm_readlink_v5(jdm_fshandle_t *fshandlep,
+		struct xfs_bulkstat *sp, char *bufp, size_t bufsz);
+
 extern intgen_t
 jdm_attr_multi(	jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
 		char *bufp, int rtrvcnt, int flags);
 
+extern intgen_t jdm_attr_multi_v5(jdm_fshandle_t *fshp,
+		struct xfs_bulkstat *statp, char *bufp, int rtrvcnt,
+		int flags);
+
 extern intgen_t
 jdm_attr_list(	jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
 		char *bufp, size_t bufsz, int flags,
 		struct attrlist_cursor *cursor);
 
+extern intgen_t jdm_attr_list_v5(jdm_fshandle_t *fshp,
+		struct xfs_bulkstat *statp, char *bufp, size_t bufsz, int
+		flags, struct attrlist_cursor *cursor);
+
 extern int
 jdm_parents( jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
 		struct parent *bufp, size_t bufsz,
 		unsigned int *count);
 
+extern int jdm_parents_v5(jdm_fshandle_t *fshp, struct xfs_bulkstat *statp,
+		struct parent *bufp, size_t bufsz, unsigned int *count);
+
 extern int
 jdm_parentpaths( jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
 		struct parent *bufp, size_t bufsz,
 		unsigned int *count);
 
+extern int jdm_parentpaths_v5(jdm_fshandle_t *fshp, struct xfs_bulkstat *statp,
+		struct parent *bufp, size_t bufsz, unsigned int *count);
+
 /* macro for determining the size of a structure member */
 #define sizeofmember( t, m )	sizeof( ( ( t * )0 )->m )
 
diff --git a/libhandle/jdm.c b/libhandle/jdm.c
index 07b0c60985e..e21aff2b2c1 100644
--- a/libhandle/jdm.c
+++ b/libhandle/jdm.c
@@ -41,6 +41,19 @@ jdm_fill_filehandle( filehandle_t *handlep,
 	handlep->fh_ino = statp->bs_ino;
 }
 
+static void
+jdm_fill_filehandle_v5(
+	struct filehandle	*handlep,
+	struct fshandle		*fshandlep,
+	struct xfs_bulkstat	*statp)
+{
+	handlep->fh_fshandle = *fshandlep;
+	handlep->fh_sz_following = FILEHANDLE_SZ_FOLLOWING;
+	memset(handlep->fh_pad, 0, FILEHANDLE_SZ_PAD);
+	handlep->fh_gen = statp->bs_gen;
+	handlep->fh_ino = statp->bs_ino;
+}
+
 jdm_fshandle_t *
 jdm_getfshandle( char *mntpnt )
 {
@@ -90,6 +103,22 @@ jdm_new_filehandle( jdm_filehandle_t **handlep,
 		jdm_fill_filehandle(*handlep, (fshandle_t *) fshandlep, statp);
 }
 
+void
+jdm_new_filehandle_v5(
+	jdm_filehandle_t	**handlep,
+	size_t			*hlen,
+	jdm_fshandle_t		*fshandlep,
+	struct xfs_bulkstat	*statp)
+{
+	/* allocate and fill filehandle */
+	*hlen = sizeof(filehandle_t);
+	*handlep = (filehandle_t *) malloc(*hlen);
+	if (!*handlep)
+		return;
+
+	jdm_fill_filehandle_v5(*handlep, (struct fshandle *)fshandlep, statp);
+}
+
 /* ARGSUSED */
 void
 jdm_delete_filehandle( jdm_filehandle_t *handlep, size_t hlen )
@@ -111,6 +140,19 @@ jdm_open( jdm_fshandle_t *fshp, struct xfs_bstat *statp, intgen_t oflags )
 	return fd;
 }
 
+intgen_t
+jdm_open_v5(
+	jdm_fshandle_t		*fshp,
+	struct xfs_bulkstat	*statp,
+	intgen_t		oflags)
+{
+	struct fshandle		*fshandlep = (struct fshandle *)fshp;
+	struct filehandle	filehandle;
+
+	jdm_fill_filehandle_v5(&filehandle, fshandlep, statp);
+	return open_by_fshandle(&filehandle, sizeof(filehandle), oflags);
+}
+
 intgen_t
 jdm_readlink( jdm_fshandle_t *fshp,
 	      struct xfs_bstat *statp,
@@ -128,6 +170,20 @@ jdm_readlink( jdm_fshandle_t *fshp,
 	return rval;
 }
 
+intgen_t
+jdm_readlink_v5(
+	jdm_fshandle_t		*fshp,
+	struct xfs_bulkstat	*statp,
+	char			*bufp,
+	size_t			bufsz)
+{
+	struct fshandle		*fshandlep = (struct fshandle *)fshp;
+	struct filehandle	filehandle;
+
+	jdm_fill_filehandle_v5(&filehandle, fshandlep, statp);
+	return readlink_by_handle(&filehandle, sizeof(filehandle), bufp, bufsz);
+}
+
 int
 jdm_attr_multi(	jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
@@ -145,6 +201,22 @@ jdm_attr_multi(	jdm_fshandle_t *fshp,
 	return rval;
 }
 
+int
+jdm_attr_multi_v5(
+	jdm_fshandle_t		*fshp,
+	struct xfs_bulkstat	*statp,
+	char			*bufp,
+	int			rtrvcnt,
+	int			flags)
+{
+	struct fshandle		*fshandlep = (struct fshandle *)fshp;
+	struct filehandle	filehandle;
+
+	jdm_fill_filehandle_v5(&filehandle, fshandlep, statp);
+	return attr_multi_by_handle(&filehandle, sizeof(filehandle), bufp,
+			rtrvcnt, flags);
+}
+
 int
 jdm_attr_list(	jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
@@ -166,6 +238,27 @@ jdm_attr_list(	jdm_fshandle_t *fshp,
 	return rval;
 }
 
+int
+jdm_attr_list_v5(
+	jdm_fshandle_t		*fshp,
+	struct xfs_bulkstat	*statp,
+	char			*bufp,
+	size_t			bufsz,
+	int			flags,
+	struct attrlist_cursor	*cursor)
+{
+	struct fshandle		*fshandlep = (struct fshandle *)fshp;
+	struct filehandle	filehandle;
+
+	/* prevent needless EINVAL from the kernel */
+	if (bufsz > XFS_XATTR_LIST_MAX)
+		bufsz = XFS_XATTR_LIST_MAX;
+
+	jdm_fill_filehandle_v5(&filehandle, fshandlep, statp);
+	return attr_list_by_handle(&filehandle, sizeof(filehandle), bufp,
+			bufsz, flags, cursor);
+}
+
 int
 jdm_parents( jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
@@ -176,6 +269,18 @@ jdm_parents( jdm_fshandle_t *fshp,
 	return -1;
 }
 
+int
+jdm_parents_v5(
+	jdm_fshandle_t		*fshp,
+	struct xfs_bulkstat	*statp,
+	struct parent		*bufp,
+	size_t			bufsz,
+	unsigned int		*count)
+{
+	errno = EOPNOTSUPP;
+	return -1;
+}
+
 int
 jdm_parentpaths( jdm_fshandle_t *fshp,
 		struct xfs_bstat *statp,
@@ -185,3 +290,15 @@ jdm_parentpaths( jdm_fshandle_t *fshp,
 	errno = EOPNOTSUPP;
 	return -1;
 }
+
+int
+jdm_parentpaths_v5(
+	jdm_fshandle_t		*fshp,
+	struct xfs_bulkstat	*statp,
+	struct parent		*bufp,
+	size_t			bufsz,
+	unsigned int		*count)
+{
+	errno = EOPNOTSUPP;
+	return -1;
+}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 13/19] libfrog: convert xfs_io swapext command to use new libfrog wrapper
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (17 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 17/19] xfs_fsr: skip the xattr/forkoff levering with the newer swapext implementations Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an abstraction layer for the two swapext ioctls and port xfs_io
to use it.  Now we're insulated from the differences between the XFS v0
ioctl and the new vfs ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 configure.ac            |    1 
 include/builddefs.in    |    1 
 io/Makefile             |    4 +
 io/swapext.c            |   55 ++++++++-------
 libfrog/Makefile        |    6 ++
 libfrog/fiexchange.h    |  105 ++++++++++++++++++++++++++++++
 libfrog/file_exchange.c |  167 +++++++++++++++++++++++++++++++++++++++++++++++
 libfrog/file_exchange.h |   14 ++++
 libfrog/fsgeom.h        |    6 ++
 m4/package_libcdev.m4   |   20 ++++++
 10 files changed, 352 insertions(+), 27 deletions(-)
 create mode 100644 libfrog/fiexchange.h
 create mode 100644 libfrog/file_exchange.c
 create mode 100644 libfrog/file_exchange.h


diff --git a/configure.ac b/configure.ac
index 6c704464061..f4f1563da8b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -254,6 +254,7 @@ AC_HAVE_LIBURCU_ATOMIC64
 AC_HAVE_MEMFD_CLOEXEC
 AC_HAVE_O_TMPFILE
 AC_HAVE_MKOSTEMP_CLOEXEC
+AC_HAVE_FIEXCHANGE
 
 AC_CONFIG_FILES([include/builddefs])
 AC_OUTPUT
diff --git a/include/builddefs.in b/include/builddefs.in
index 60c1320af37..c0de6000c2a 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -130,6 +130,7 @@ HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@
 HAVE_MEMFD_CLOEXEC = @have_memfd_cloexec@
 HAVE_O_TMPFILE = @have_o_tmpfile@
 HAVE_MKOSTEMP_CLOEXEC = @have_mkostemp_cloexec@
+HAVE_FIEXCHANGE = @have_fiexchange@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/io/Makefile b/io/Makefile
index 498174cfc43..229f8f377b3 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -112,6 +112,10 @@ ifeq ($(HAVE_STATFS_FLAGS),yes)
 LCFLAGS += -DHAVE_STATFS_FLAGS
 endif
 
+ifeq ($(HAVE_FIEXCHANGE),yes)
+LCFLAGS += -DHAVE_FIEXCHANGE
+endif
+
 default: depend $(LTCOMMAND)
 
 include $(BUILDRULES)
diff --git a/io/swapext.c b/io/swapext.c
index a4153bb7d42..3f8a5c7b4d4 100644
--- a/io/swapext.c
+++ b/io/swapext.c
@@ -10,7 +10,8 @@
 #include "io.h"
 #include "libfrog/logging.h"
 #include "libfrog/fsgeom.h"
-#include "libfrog/bulkstat.h"
+#include "libfrog/fiexchange.h"
+#include "libfrog/file_exchange.h"
 
 static cmdinfo_t swapext_cmd;
 
@@ -28,47 +29,47 @@ swapext_f(
 	int			argc,
 	char			**argv)
 {
-	struct xfs_fd		fxfd = XFS_FD_INIT(file->fd);
-	struct xfs_bulkstat	bulkstat;
-	int			fd;
-	int			error;
-	struct xfs_swapext	sx;
+	struct xfs_fd		xfd = XFS_FD_INIT(file->fd);
+	struct file_xchg_range	fxr;
 	struct stat		stat;
+	uint64_t		flags = FILE_XCHG_RANGE_FILE2_FRESH |
+					FILE_XCHG_RANGE_FULL_FILES;
+	int			fd;
+	int			ret;
 
 	/* open the donor file */
 	fd = openfile(argv[1], NULL, 0, 0, NULL);
 	if (fd < 0)
 		return 0;
 
-	/*
-	 * stat the target file to get the inode number and use the latter to
-	 * get the bulkstat info for the swapext cmd.
-	 */
-	error = fstat(file->fd, &stat);
-	if (error) {
+	ret = -xfd_prepare_geometry(&xfd);
+	if (ret) {
+		xfrog_perror(ret, "xfd_prepare_geometry");
+		exitcode = 1;
+		goto out;
+	}
+
+	ret = fstat(file->fd, &stat);
+	if (ret) {
 		perror("fstat");
+		exitcode = 1;
 		goto out;
 	}
 
-	error = -xfrog_bulkstat_single(&fxfd, stat.st_ino, 0, &bulkstat);
-	if (error) {
-		xfrog_perror(error, "bulkstat");
+	ret = xfrog_file_exchange_prep(&xfd, flags, 0, fd, 0, stat.st_size,
+			&fxr);
+	if (ret) {
+		xfrog_perror(ret, "xfrog_file_exchange_prep");
+		exitcode = 1;
 		goto out;
 	}
-	error = -xfrog_bulkstat_v5_to_v1(&fxfd, &sx.sx_stat, &bulkstat);
-	if (error) {
-		xfrog_perror(error, "bulkstat conversion");
+
+	ret = xfrog_file_exchange(&xfd, &fxr);
+	if (ret) {
+		xfrog_perror(ret, "swapext");
+		exitcode = 1;
 		goto out;
 	}
-	sx.sx_version = XFS_SX_VERSION;
-	sx.sx_fdtarget = file->fd;
-	sx.sx_fdtmp = fd;
-	sx.sx_offset = 0;
-	sx.sx_length = stat.st_size;
-	error = ioctl(file->fd, XFS_IOC_SWAPEXT, &sx);
-	if (error)
-		perror("swapext");
-
 out:
 	close(fd);
 	return 0;
diff --git a/libfrog/Makefile b/libfrog/Makefile
index 0110708239a..66d2afe56fe 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -18,6 +18,7 @@ bitmap.c \
 bulkstat.c \
 convert.c \
 crc32.c \
+file_exchange.c \
 fsgeom.c \
 list_sort.c \
 linux.c \
@@ -39,6 +40,7 @@ crc32c.h \
 crc32cselftest.h \
 crc32defs.h \
 crc32table.h \
+file_exchange.h \
 fsgeom.h \
 logging.h \
 paths.h \
@@ -54,6 +56,10 @@ ifeq ($(HAVE_GETMNTENT),yes)
 LCFLAGS += -DHAVE_GETMNTENT
 endif
 
+ifeq ($(HAVE_FIEXCHANGE),yes)
+LCFLAGS += -DHAVE_FIEXCHANGE
+endif
+
 LDIRT = gen_crc32table crc32table.h crc32selftest
 
 default: crc32selftest ltdepend $(LTLIBRARY)
diff --git a/libfrog/fiexchange.h b/libfrog/fiexchange.h
new file mode 100644
index 00000000000..04ec42777d7
--- /dev/null
+++ b/libfrog/fiexchange.h
@@ -0,0 +1,105 @@
+#ifdef HAVE_FIEXCHANGE
+# include <linux/fiexchange.h>
+#endif
+
+/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
+/*
+ * FIEXCHANGE ioctl definitions, to facilitate exchanging parts of files.
+ *
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _LINUX_FIEXCHANGE_H
+#define _LINUX_FIEXCHANGE_H
+
+#include <linux/types.h>
+
+/*
+ * Exchange part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2).  Filesystems must be able to
+ * restart and complete the operation even after the system goes down.
+ */
+struct file_xchg_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__s64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see FILE_XCHG_RANGE_* below */
+
+	/* file2 metadata for optional freshness checks */
+	__s64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+
+	__u64		pad[6];		/* must be zeroes */
+};
+
+/*
+ * Atomic exchange operations are not required.  This relaxes the requirement
+ * that the filesystem must be able to complete the operation after a crash.
+ */
+#define FILE_XCHG_RANGE_NONATOMIC	(1 << 0)
+
+/*
+ * Check that file2's inode number, mtime, and ctime against the values
+ * provided, and return -EBUSY if there isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FILE2_FRESH	(1 << 1)
+
+/*
+ * Check that the file1's length is equal to file1_offset + length, and that
+ * file2's length is equal to file2_offset + length.  Returns -EDOM if there
+ * isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FULL_FILES	(1 << 2)
+
+/*
+ * Exchange file data all the way to the ends of both files, and then exchange
+ * the file sizes.  This flag can be used to replace a file's contents with a
+ * different amount of data.  length will be ignored.
+ */
+#define FILE_XCHG_RANGE_TO_EOF		(1 << 3)
+
+/* Flush all changes in file data and file metadata to disk before returning. */
+#define FILE_XCHG_RANGE_FSYNC		(1 << 4)
+
+/* Dry run; do all the parameter verification but do not change anything. */
+#define FILE_XCHG_RANGE_DRY_RUN		(1 << 5)
+
+/*
+ * Do not exchange any part of the range where file1's mapping is a hole.  This
+ * can be used to emulate scatter-gather atomic writes with a temp file.
+ */
+#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
+
+/*
+ * Commit the contents of file1 into file2 if file2 has the same inode number,
+ * mtime, and ctime as the arguments provided to the call.  The old contents of
+ * file2 will be moved to file1.
+ *
+ * With this flag, all committed information can be retrieved even if the
+ * system crashes or is rebooted.  This includes writing through or flushing a
+ * disk cache if present.  The call blocks until the device reports that the
+ * commit is complete.
+ *
+ * This flag should not be combined with NONATOMIC.  It can be combined with
+ * SKIP_FILE1_HOLES.
+ */
+#define FILE_XCHG_RANGE_COMMIT		(FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FSYNC)
+
+#define FILE_XCHG_RANGE_ALL_FLAGS	(FILE_XCHG_RANGE_NONATOMIC | \
+					 FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FULL_FILES | \
+					 FILE_XCHG_RANGE_TO_EOF | \
+					 FILE_XCHG_RANGE_FSYNC | \
+					 FILE_XCHG_RANGE_DRY_RUN | \
+					 FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+
+#define FIEXCHANGE_RANGE	_IOWR('X', 129, struct file_xchg_range)
+
+#endif /* _LINUX_FIEXCHANGE_H */
diff --git a/libfrog/file_exchange.c b/libfrog/file_exchange.c
new file mode 100644
index 00000000000..00277f8f0fc
--- /dev/null
+++ b/libfrog/file_exchange.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <string.h>
+#include "xfs.h"
+#include "fsgeom.h"
+#include "bulkstat.h"
+#include "fiexchange.h"
+#include "file_exchange.h"
+
+/* Prepare the freshness component of a swapext request. */
+static int
+xfrog_file_exchange_prep_freshness(
+	struct xfs_fd		*dest,
+	struct file_xchg_range	*req)
+{
+	struct stat		stat;
+	struct xfs_bulkstat	bulkstat;
+	int			error;
+
+	error = fstat(dest->fd, &stat);
+	if (error)
+		return -errno;
+	req->file2_ino = stat.st_ino;
+
+	/*
+	 * Try to fill out the [cm]time data from bulkstat.  We prefer this
+	 * approach because bulkstat v5 gives us 64-bit time even on 32-bit.
+	 *
+	 * However, we'll take our chances on the C library if the filesystem
+	 * supports 64-bit time but we ended up with bulkstat v5 emulation.
+	 */
+	error = xfrog_bulkstat_single(dest, stat.st_ino, 0, &bulkstat);
+	if (!error &&
+	    !((dest->fsgeom.flags & XFS_FSOP_GEOM_FLAGS_BIGTIME) &&
+	      bulkstat.bs_version < XFS_BULKSTAT_VERSION_V5)) {
+		req->file2_mtime = bulkstat.bs_mtime;
+		req->file2_ctime = bulkstat.bs_ctime;
+		req->file2_mtime_nsec = bulkstat.bs_mtime_nsec;
+		req->file2_ctime_nsec = bulkstat.bs_ctime_nsec;
+		return 0;
+	}
+
+	/* Otherwise, use the stat information and hope for the best. */
+	req->file2_mtime = stat.st_mtime;
+	req->file2_ctime = stat.st_ctime;
+	req->file2_mtime_nsec = stat.st_mtim.tv_nsec;
+	req->file2_ctime_nsec = stat.st_ctim.tv_nsec;
+	return 0;
+}
+
+/* Prepare an extent swap request. */
+int
+xfrog_file_exchange_prep(
+	struct xfs_fd		*dest,
+	uint64_t		flags,
+	int64_t			file2_offset,
+	int			file1_fd,
+	int64_t			file1_offset,
+	int64_t			length,
+	struct file_xchg_range	*req)
+{
+	memset(req, 0, sizeof(*req));
+	req->file1_fd = file1_fd;
+	req->file1_offset = file1_offset;
+	req->length = length;
+	req->file2_offset = file2_offset;
+	req->flags = flags;
+
+	if (flags & FILE_XCHG_RANGE_FILE2_FRESH)
+		return xfrog_file_exchange_prep_freshness(dest, req);
+
+	return 0;
+}
+
+/* Swap two files' extents with the vfs swaprange ioctl. */
+static int
+xfrog_file_exchange_vfs(
+	struct xfs_fd		*xfd,
+	struct file_xchg_range	*req)
+{
+	int			ret;
+
+	ret = ioctl(xfd->fd, FIEXCHANGE_RANGE, req);
+	if (ret) {
+		/* the old swapext ioctl returned EFAULT for bad length */
+		if (errno == EDOM)
+			return -EFAULT;
+		return -errno;
+	}
+	return 0;
+}
+
+/*
+ * The old swapext ioctl did not provide atomic swap; it required that the
+ * supplied offset and length matched both files' lengths; and it also required
+ * that the sx_stat information match the dest file.  It doesn't support any
+ * other flags.
+ */
+#define FILE_XCHG_RANGE_SWAPEXT0	(FILE_XCHG_RANGE_NONATOMIC | \
+					 FILE_XCHG_RANGE_FULL_FILES | \
+					 FILE_XCHG_RANGE_FILE2_FRESH)
+
+/* Swap two files' extents with the old xfs swaprange ioctl. */
+static int
+xfrog_file_exchange0(
+	struct xfs_fd		*xfd,
+	struct file_xchg_range	*req)
+{
+	struct xfs_swapext	sx = {
+		.sx_version	= XFS_SX_VERSION,
+		.sx_fdtarget	= xfd->fd,
+		.sx_fdtmp	= req->file1_fd,
+		.sx_length	= req->length,
+	};
+	int			ret;
+
+	if (req->file1_offset != req->file2_offset)
+		return -EINVAL;
+	if (req->flags != FILE_XCHG_RANGE_SWAPEXT0)
+		return -EOPNOTSUPP;
+
+	sx.sx_stat.bs_ino = req->file2_ino;
+	sx.sx_stat.bs_ctime.tv_sec = req->file2_ctime;
+	sx.sx_stat.bs_ctime.tv_nsec = req->file2_ctime_nsec;
+	sx.sx_stat.bs_mtime.tv_sec = req->file2_mtime;
+	sx.sx_stat.bs_mtime.tv_nsec = req->file2_mtime_nsec;
+
+	ret = ioctl(xfd->fd, XFS_IOC_SWAPEXT, &sx);
+	if (ret)
+		return -errno;
+	return 0;
+}
+
+/* Swap extents between an XFS file and a donor fd. */
+int
+xfrog_file_exchange(
+	struct xfs_fd		*xfd,
+	struct file_xchg_range	*req)
+{
+	int			error;
+
+	if (xfd->flags & XFROG_FLAG_FORCE_SWAPEXT)
+		goto try_v1;
+
+	error = xfrog_file_exchange_vfs(xfd, req);
+	if ((error != -ENOTTY && error != -EOPNOTSUPP) ||
+	    (xfd->flags & XFROG_FLAG_FORCE_FIEXCHANGE))
+		return error;
+
+	/* If the vfs ioctl wasn't found, we punt to v0. */
+	switch (error) {
+	case -EOPNOTSUPP:
+	case -ENOTTY:
+		xfd->flags |= XFROG_FLAG_FORCE_SWAPEXT;
+		break;
+	}
+
+try_v1:
+	return xfrog_file_exchange0(xfd, req);
+}
diff --git a/libfrog/file_exchange.h b/libfrog/file_exchange.h
new file mode 100644
index 00000000000..a77d67514e8
--- /dev/null
+++ b/libfrog/file_exchange.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2022 Oracle. Inc.
+ * All Rights Reserved.
+ */
+#ifndef __LIBFROG_FILE_EXCHANGE_H__
+#define __LIBFROG_FILE_EXCHANGE_H__
+
+int xfrog_file_exchange_prep(struct xfs_fd *file2, uint64_t flags,
+		int64_t file2_offset, int file1_fd, int64_t file1_offset,
+		int64_t length, struct file_xchg_range *req);
+int xfrog_file_exchange(struct xfs_fd *xfd, struct file_xchg_range *req);
+
+#endif	/* __LIBFROG_FILE_EXCHANGE_H__ */
diff --git a/libfrog/fsgeom.h b/libfrog/fsgeom.h
index ca38324e853..9dfa986ff08 100644
--- a/libfrog/fsgeom.h
+++ b/libfrog/fsgeom.h
@@ -50,6 +50,12 @@ struct xfs_fd {
 /* Only use v5 bulkstat/inumbers ioctls. */
 #define XFROG_FLAG_BULKSTAT_FORCE_V5	(1 << 1)
 
+/* Only use the old XFS swapext ioctl for file data exchanges. */
+#define XFROG_FLAG_FORCE_SWAPEXT	(1 << 2)
+
+/* Only use FIEXCHANGE_RANGE for file data exchanges. */
+#define XFROG_FLAG_FORCE_FIEXCHANGE	(1 << 3)
+
 /* Static initializers */
 #define XFS_FD_INIT(_fd)	{ .fd = (_fd), }
 #define XFS_FD_INIT_EMPTY	XFS_FD_INIT(-1)
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 119d1bda74d..062730a1f06 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -557,3 +557,23 @@ AC_DEFUN([AC_HAVE_MKOSTEMP_CLOEXEC],
        AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
     AC_SUBST(have_mkostemp_cloexec)
   ])
+
+#
+# Check if we have a FIEXCHANGE_RANGE ioctl (Linux)
+#
+AC_DEFUN([AC_HAVE_FIEXCHANGE],
+  [ AC_MSG_CHECKING([for FIEXCHANGE_RANGE])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <linux/fs.h>
+#include <linux/fiexchange.h>
+    ]], [[
+         struct file_xchg_range fxr;
+         ioctl(-1, FIEXCHANGE_RANGE, &fxr);
+    ]])],[have_fiexchange=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_fiexchange)
+  ])


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 14/19] xfs_logprint: support dumping swapext log items
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (13 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 19/19] xfs_io: add atomic update commands to exercise extent swapping Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 16/19] xfs_fsr: port to new swapext library function Darrick J. Wong
                     ` (3 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Support dumping swapext log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 logprint/log_misc.c      |   11 ++++
 logprint/log_print_all.c |   12 ++++
 logprint/log_redo.c      |  128 ++++++++++++++++++++++++++++++++++++++++++++++
 logprint/logprint.h      |    6 ++
 4 files changed, 157 insertions(+)


diff --git a/logprint/log_misc.c b/logprint/log_misc.c
index 836156e0d58..565e7b76284 100644
--- a/logprint/log_misc.c
+++ b/logprint/log_misc.c
@@ -1052,6 +1052,17 @@ xlog_print_record(
 					be32_to_cpu(op_head->oh_len));
 			break;
 		    }
+		    case XFS_LI_SXI: {
+			skip = xlog_print_trans_sxi(&ptr,
+					be32_to_cpu(op_head->oh_len),
+					continued);
+			break;
+		    }
+		    case XFS_LI_SXD: {
+			skip = xlog_print_trans_sxd(&ptr,
+					be32_to_cpu(op_head->oh_len));
+			break;
+		    }
 		    case XFS_LI_QUOTAOFF: {
 			skip = xlog_print_trans_qoff(&ptr,
 					be32_to_cpu(op_head->oh_len));
diff --git a/logprint/log_print_all.c b/logprint/log_print_all.c
index 8d3ede190e5..6e528fcd097 100644
--- a/logprint/log_print_all.c
+++ b/logprint/log_print_all.c
@@ -440,6 +440,12 @@ xlog_recover_print_logitem(
 	case XFS_LI_BUI:
 		xlog_recover_print_bui(item);
 		break;
+	case XFS_LI_SXD:
+		xlog_recover_print_sxd(item);
+		break;
+	case XFS_LI_SXI:
+		xlog_recover_print_sxi(item);
+		break;
 	case XFS_LI_DQUOT:
 		xlog_recover_print_dquot(item);
 		break;
@@ -498,6 +504,12 @@ xlog_recover_print_item(
 	case XFS_LI_BUI:
 		printf("BUI");
 		break;
+	case XFS_LI_SXD:
+		printf("SXD");
+		break;
+	case XFS_LI_SXI:
+		printf("SXI");
+		break;
 	case XFS_LI_DQUOT:
 		printf("DQ ");
 		break;
diff --git a/logprint/log_redo.c b/logprint/log_redo.c
index edf7e0fbfa9..770485df75d 100644
--- a/logprint/log_redo.c
+++ b/logprint/log_redo.c
@@ -847,3 +847,131 @@ xlog_recover_print_attrd(
 		f->alfd_size,
 		(unsigned long long)f->alfd_alf_id);
 }
+
+/* Atomic Extent Swapping Items */
+
+static int
+xfs_sxi_copy_format(
+	struct xfs_sxi_log_format *sxi,
+	uint			  len,
+	struct xfs_sxi_log_format *dst_fmt,
+	int			  continued)
+{
+	if (len == sizeof(struct xfs_sxi_log_format) || continued) {
+		memcpy(dst_fmt, sxi, len);
+		return 0;
+	}
+	fprintf(stderr, _("%s: bad size of SXI format: %u; expected %zu\n"),
+		progname, len, sizeof(struct xfs_sxi_log_format));
+	return 1;
+}
+
+int
+xlog_print_trans_sxi(
+	char			**ptr,
+	uint			src_len,
+	int			continued)
+{
+	struct xfs_sxi_log_format *src_f, *f = NULL;
+	struct xfs_swap_extent	*ex;
+	int			error = 0;
+
+	src_f = malloc(src_len);
+	if (src_f == NULL) {
+		fprintf(stderr, _("%s: %s: malloc failed\n"),
+			progname, __func__);
+		exit(1);
+	}
+	memcpy(src_f, *ptr, src_len);
+	*ptr += src_len;
+
+	/* convert to native format */
+	if (continued && src_len < sizeof(struct xfs_sxi_log_format)) {
+		printf(_("SXI: Not enough data to decode further\n"));
+		error = 1;
+		goto error;
+	}
+
+	f = malloc(sizeof(struct xfs_sxi_log_format));
+	if (f == NULL) {
+		fprintf(stderr, _("%s: %s: malloc failed\n"),
+			progname, __func__);
+		exit(1);
+	}
+	if (xfs_sxi_copy_format(src_f, src_len, f, continued)) {
+		error = 1;
+		goto error;
+	}
+
+	printf(_("SXI:  #regs: %d	num_extents: 1  id: 0x%llx\n"),
+		f->sxi_size, (unsigned long long)f->sxi_id);
+
+	if (continued) {
+		printf(_("SXI extent data skipped (CONTINUE set, no space)\n"));
+		goto error;
+	}
+
+	ex = &f->sxi_extent;
+	printf("(ino1: 0x%llx, ino2: 0x%llx, off1: %lld, off2: %lld, len: %lld, flags: 0x%llx)\n",
+		(unsigned long long)ex->sx_inode1,
+		(unsigned long long)ex->sx_inode2,
+		(unsigned long long)ex->sx_startoff1,
+		(unsigned long long)ex->sx_startoff2,
+		(unsigned long long)ex->sx_blockcount,
+		(unsigned long long)ex->sx_flags);
+error:
+	free(src_f);
+	free(f);
+	return error;
+}
+
+void
+xlog_recover_print_sxi(
+	struct xlog_recover_item	*item)
+{
+	char				*src_f;
+	uint				src_len;
+
+	src_f = item->ri_buf[0].i_addr;
+	src_len = item->ri_buf[0].i_len;
+
+	xlog_print_trans_sxi(&src_f, src_len, 0);
+}
+
+int
+xlog_print_trans_sxd(
+	char				**ptr,
+	uint				len)
+{
+	struct xfs_sxd_log_format	*f;
+	struct xfs_sxd_log_format	lbuf;
+
+	/* size without extents at end */
+	uint core_size = sizeof(struct xfs_sxd_log_format);
+
+	memcpy(&lbuf, *ptr, min(core_size, len));
+	f = &lbuf;
+	*ptr += len;
+	if (len >= core_size) {
+		printf(_("SXD:  #regs: %d	                 id: 0x%llx\n"),
+			f->sxd_size,
+			(unsigned long long)f->sxd_sxi_id);
+
+		/* don't print extents as they are not used */
+
+		return 0;
+	} else {
+		printf(_("SXD: Not enough data to decode further\n"));
+		return 1;
+	}
+}
+
+void
+xlog_recover_print_sxd(
+	struct xlog_recover_item	*item)
+{
+	char				*f;
+
+	f = item->ri_buf[0].i_addr;
+	xlog_print_trans_sxd(&f, sizeof(struct xfs_sxd_log_format));
+}
diff --git a/logprint/logprint.h b/logprint/logprint.h
index b4479c240d9..892b280b548 100644
--- a/logprint/logprint.h
+++ b/logprint/logprint.h
@@ -65,4 +65,10 @@ extern void xlog_recover_print_attri(struct xlog_recover_item *item);
 extern int xlog_print_trans_attrd(char **ptr, uint len);
 extern void xlog_recover_print_attrd(struct xlog_recover_item *item);
 extern void xlog_print_op_header(xlog_op_header_t *op_head, int i, char **ptr);
+
+extern int xlog_print_trans_sxi(char **ptr, uint src_len, int continued);
+extern void xlog_recover_print_sxi(struct xlog_recover_item *item);
+extern int xlog_print_trans_sxd(char **ptr, uint len);
+extern void xlog_recover_print_sxd(struct xlog_recover_item *item);
+
 #endif	/* LOGPRINT_H */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 15/19] xfs_fsr: convert to bulkstat v5 ioctls
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (15 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 16/19] xfs_fsr: port to new swapext library function Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 17/19] xfs_fsr: skip the xattr/forkoff levering with the newer swapext implementations Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 13/19] libfrog: convert xfs_io swapext command to use new libfrog wrapper Darrick J. Wong
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that libhandle can, er, handle bulkstat information coming from the
v5 bulkstat ioctl, port xfs_fsr to use the new interfaces instead of
repeatedly converting things back and forth.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fsr/xfs_fsr.c    |  148 ++++++++++++++++++++++++++++++------------------------
 libfrog/fsgeom.c |   45 ++++++++++++----
 libfrog/fsgeom.h |    1 
 3 files changed, 117 insertions(+), 77 deletions(-)


diff --git a/fsr/xfs_fsr.c b/fsr/xfs_fsr.c
index ba02506d8e4..8e916faee94 100644
--- a/fsr/xfs_fsr.c
+++ b/fsr/xfs_fsr.c
@@ -65,10 +65,10 @@ static int	pagesize;
 
 void usage(int ret);
 static int  fsrfile(char *fname, xfs_ino_t ino);
-static int  fsrfile_common( char *fname, char *tname, char *mnt,
-                            int fd, struct xfs_bstat *statp);
-static int  packfile(char *fname, char *tname, int fd,
-                     struct xfs_bstat *statp, struct fsxattr *fsxp);
+static int  fsrfile_common(char *fname, char *tname, char *mnt,
+			   struct xfs_fd *file_fd, struct xfs_bulkstat *statp);
+static int  packfile(char *fname, char *tname, struct xfs_fd *file_fd,
+                     struct xfs_bulkstat *statp, struct fsxattr *fsxp);
 static void fsrdir(char *dirname);
 static int  fsrfs(char *mntdir, xfs_ino_t ino, int targetrange);
 static void initallfs(char *mtab);
@@ -80,7 +80,7 @@ int xfs_getrt(int fd, struct statvfs *sfbp);
 char * gettmpname(char *fname);
 char * getparent(char *fname);
 int fsrprintf(const char *fmt, ...);
-int read_fd_bmap(int, struct xfs_bstat *, int *);
+int read_fd_bmap(int, struct xfs_bulkstat *, int *);
 static void tmp_init(char *mnt);
 static char * tmp_next(char *mnt);
 static void tmp_close(char *mnt);
@@ -102,6 +102,26 @@ static int	nfrags = 0;	/* Debug option: Coerse into specific number
 				 * of extents */
 static int	openopts = O_CREAT|O_EXCL|O_RDWR|O_DIRECT;
 
+/*
+ * Open a file on an XFS filesystem from file handle components and fs geometry
+ * data.  Returns zero or a negative error code.
+ */
+static int
+open_handle(
+	struct xfs_fd		*xfd,
+	jdm_fshandle_t		*fshandle,
+	struct xfs_bulkstat	*bulkstat,
+	struct xfs_fsop_geom	*fsgeom,
+	int			flags)
+{
+	xfd->fd = jdm_open_v5(fshandle, bulkstat, flags);
+	if (xfd->fd < 0)
+		return errno;
+
+	xfd_install_geometry(xfd, fsgeom);
+	return 0;
+}
+
 static int
 xfs_swapext(int fd, xfs_swapext_t *sx)
 {
@@ -600,7 +620,6 @@ static int
 fsrfs(char *mntdir, xfs_ino_t startino, int targetrange)
 {
 	struct xfs_fd	fsxfd = XFS_FD_INIT_EMPTY;
-	int	fd;
 	int	count = 0;
 	int	ret;
 	char	fname[64];
@@ -638,10 +657,10 @@ fsrfs(char *mntdir, xfs_ino_t startino, int targetrange)
 	}
 
 	while ((ret = -xfrog_bulkstat(&fsxfd, breq) == 0)) {
-		struct xfs_bstat	bs1;
 		struct xfs_bulkstat	*buf = breq->bulkstat;
 		struct xfs_bulkstat	*p;
 		struct xfs_bulkstat	*endp;
+		struct xfs_fd		file_fd = XFS_FD_INIT_EMPTY;
 		uint32_t		buflenout = breq->hdr.ocount;
 
 		if (buflenout == 0)
@@ -658,15 +677,9 @@ fsrfs(char *mntdir, xfs_ino_t startino, int targetrange)
 			     (p->bs_extents64 < 2))
 				continue;
 
-			ret = -xfrog_bulkstat_v5_to_v1(&fsxfd, &bs1, p);
+			ret = open_handle(&file_fd, fshandlep, p,
+					&fsxfd.fsgeom, O_RDWR | O_DIRECT);
 			if (ret) {
-				fsrprintf(_("bstat conversion error: %s\n"),
-						strerror(ret));
-				continue;
-			}
-
-			fd = jdm_open(fshandlep, &bs1, O_RDWR | O_DIRECT);
-			if (fd < 0) {
 				/* This probably means the file was
 				 * removed while in progress of handling
 				 * it.  Just quietly ignore this file.
@@ -683,11 +696,12 @@ fsrfs(char *mntdir, xfs_ino_t startino, int targetrange)
 			/* Get a tmp file name */
 			tname = tmp_next(mntdir);
 
-			ret = fsrfile_common(fname, tname, mntdir, fd, &bs1);
+			ret = fsrfile_common(fname, tname, mntdir, &file_fd,
+					p);
 
 			leftoffino = p->bs_ino;
 
-			close(fd);
+			xfd_close(&file_fd);
 
 			if (ret == 0) {
 				if (--count <= 0)
@@ -735,9 +749,8 @@ fsrfile(
 {
 	struct xfs_fd		fsxfd = XFS_FD_INIT_EMPTY;
 	struct xfs_bulkstat	bulkstat;
-	struct xfs_bstat	statbuf;
+	struct xfs_fd		file_fd = XFS_FD_INIT_EMPTY;
 	jdm_fshandle_t		*fshandlep;
-	int			fd = -1;
 	int			error = -1;
 	char			*tname;
 
@@ -765,17 +778,12 @@ fsrfile(
 			fname, strerror(error));
 		goto out;
 	}
-	error = -xfrog_bulkstat_v5_to_v1(&fsxfd, &statbuf, &bulkstat);
-	if (error) {
-		fsrprintf(_("bstat conversion error on %s: %s\n"),
-			fname, strerror(error));
-		goto out;
-	}
 
-	fd = jdm_open(fshandlep, &statbuf, O_RDWR|O_DIRECT);
-	if (fd < 0) {
+	error = open_handle(&file_fd, fshandlep, &bulkstat, &fsxfd.fsgeom,
+			O_RDWR | O_DIRECT);
+	if (error) {
 		fsrprintf(_("unable to open handle %s: %s\n"),
-			fname, strerror(errno));
+			fname, strerror(error));
 		goto out;
 	}
 
@@ -783,14 +791,13 @@ fsrfile(
 	memcpy(&fsgeom, &fsxfd.fsgeom, sizeof(fsgeom));
 
 	tname = gettmpname(fname);
-
 	if (tname)
-		error = fsrfile_common(fname, tname, NULL, fd, &statbuf);
+		error = fsrfile_common(fname, tname, NULL, &file_fd,
+				&bulkstat);
 
 out:
 	xfd_close(&fsxfd);
-	if (fd >= 0)
-		close(fd);
+	xfd_close(&file_fd);
 	free(fshandlep);
 
 	return error;
@@ -816,8 +823,8 @@ fsrfile_common(
 	char		*fname,
 	char		*tname,
 	char		*fsname,
-	int		fd,
-	struct xfs_bstat *statp)
+	struct xfs_fd	*file_fd,
+	struct xfs_bulkstat *statp)
 {
 	int		error;
 	struct statvfs  vfss;
@@ -827,7 +834,7 @@ fsrfile_common(
 	if (vflag)
 		fsrprintf("%s\n", fname);
 
-	if (fsync(fd) < 0) {
+	if (fsync(file_fd->fd) < 0) {
 		fsrprintf(_("sync failed: %s: %s\n"), fname, strerror(errno));
 		return -1;
 	}
@@ -851,7 +858,7 @@ fsrfile_common(
 		fl.l_whence = SEEK_SET;
 		fl.l_start = (off_t)0;
 		fl.l_len = 0;
-		if ((fcntl(fd, F_GETLK, &fl)) < 0 ) {
+		if ((fcntl(file_fd->fd, F_GETLK, &fl)) < 0 ) {
 			if (vflag)
 				fsrprintf(_("locking check failed: %s\n"),
 					fname);
@@ -869,7 +876,7 @@ fsrfile_common(
 	/*
 	 * Check if there is room to copy the file.
 	 *
-	 * Note that xfs_bstat.bs_blksize returns the filesystem blocksize,
+	 * Note that xfs_bulkstat.bs_blksize returns the filesystem blocksize,
 	 * not the optimal I/O size as struct stat.
 	 */
 	if (statvfs(fsname ? fsname : fname, &vfss) < 0) {
@@ -886,7 +893,7 @@ fsrfile_common(
 		return 1;
 	}
 
-	if ((ioctl(fd, FS_IOC_FSGETXATTR, &fsx)) < 0) {
+	if ((ioctl(file_fd->fd, FS_IOC_FSGETXATTR, &fsx)) < 0) {
 		fsrprintf(_("failed to get inode attrs: %s\n"), fname);
 		return(-1);
 	}
@@ -902,7 +909,7 @@ fsrfile_common(
 		return(0);
 	}
 	if (fsx.fsx_xflags & FS_XFLAG_REALTIME) {
-		if (xfs_getrt(fd, &vfss) < 0) {
+		if (xfs_getrt(file_fd->fd, &vfss) < 0) {
 			fsrprintf(_("cannot get realtime geometry for: %s\n"),
 				fname);
 			return(-1);
@@ -928,7 +935,7 @@ fsrfile_common(
 	 * file we're defragging, in packfile().
 	 */
 
-	if ((error = packfile(fname, tname, fd, statp, &fsx)))
+	if ((error = packfile(fname, tname, file_fd, statp, &fsx)))
 		return error;
 	return -1; /* no error */
 }
@@ -952,7 +959,7 @@ static int
 fsr_setup_attr_fork(
 	int		fd,
 	int		tfd,
-	struct xfs_bstat *bstatp)
+	struct xfs_bulkstat *bstatp)
 {
 #ifdef HAVE_FSETXATTR
 	struct xfs_fd	txfd = XFS_FD_INIT(tfd);
@@ -1136,23 +1143,28 @@ fsr_setup_attr_fork(
  *  1: No change / No Error
  */
 static int
-packfile(char *fname, char *tname, int fd,
-	 struct xfs_bstat *statp, struct fsxattr *fsxp)
+packfile(
+	char			*fname,
+	char			*tname,
+	struct xfs_fd		*file_fd,
+	struct xfs_bulkstat	*statp,
+	struct fsxattr		*fsxp)
 {
-	int 		tfd = -1;
-	int		srval;
-	int		retval = -1;	/* Failure is the default */
-	int		nextents, extent, cur_nextents, new_nextents;
-	unsigned	blksz_dio;
-	unsigned	dio_min;
-	struct dioattr	dio;
-	static xfs_swapext_t   sx;
-	struct xfs_flock64  space;
-	off64_t 	cnt, pos;
-	void 		*fbuf = NULL;
-	int 		ct, wc, wc_b4;
-	char		ffname[SMBUFSZ];
-	int		ffd = -1;
+	int			tfd = -1;
+	int			srval;
+	int			retval = -1;	/* Failure is the default */
+	int			nextents, extent, cur_nextents, new_nextents;
+	unsigned		blksz_dio;
+	unsigned		dio_min;
+	struct dioattr		dio;
+	static xfs_swapext_t	sx;
+	struct xfs_flock64	space;
+	off64_t			cnt, pos;
+	void			*fbuf = NULL;
+	int			ct, wc, wc_b4;
+	char			ffname[SMBUFSZ];
+	int			ffd = -1;
+	int			error;
 
 	/*
 	 * Work out the extent map - nextents will be set to the
@@ -1160,7 +1172,7 @@ packfile(char *fname, char *tname, int fd,
 	 * into account holes), cur_nextents is the current number
 	 * of extents.
 	 */
-	nextents = read_fd_bmap(fd, statp, &cur_nextents);
+	nextents = read_fd_bmap(file_fd->fd, statp, &cur_nextents);
 
 	if (cur_nextents == 1 || cur_nextents <= nextents) {
 		if (vflag)
@@ -1183,7 +1195,7 @@ packfile(char *fname, char *tname, int fd,
 	unlink(tname);
 
 	/* Setup extended attributes */
-	if (fsr_setup_attr_fork(fd, tfd, statp) != 0) {
+	if (fsr_setup_attr_fork(file_fd->fd, tfd, statp) != 0) {
 		fsrprintf(_("failed to set ATTR fork on tmp: %s:\n"), tname);
 		goto out;
 	}
@@ -1301,7 +1313,7 @@ packfile(char *fname, char *tname, int fd,
 				   tname, strerror(errno));
 				goto out;
 			}
-			if (lseek(fd, outmap[extent].bmv_length, SEEK_CUR) < 0) {
+			if (lseek(file_fd->fd, outmap[extent].bmv_length, SEEK_CUR) < 0) {
 				fsrprintf(_("could not lseek in file: %s : %s\n"),
 				   fname, strerror(errno));
 				goto out;
@@ -1321,7 +1333,7 @@ packfile(char *fname, char *tname, int fd,
 				ct = min(cnt + dio_min - (cnt % dio_min),
 					blksz_dio);
 			}
-			ct = read(fd, fbuf, ct);
+			ct = read(file_fd->fd, fbuf, ct);
 			if (ct == 0) {
 				/* EOF, stop trying to read */
 				extent = nextents;
@@ -1392,9 +1404,15 @@ packfile(char *fname, char *tname, int fd,
 		goto out;
 	}
 
-	sx.sx_stat     = *statp; /* struct copy */
+	error = -xfrog_bulkstat_v5_to_v1(file_fd, &sx.sx_stat, statp);
+	if (error) {
+		fsrprintf(_("bstat conversion error on %s: %s\n"),
+				fname, strerror(error));
+		goto out;
+	}
+
 	sx.sx_version  = XFS_SX_VERSION;
-	sx.sx_fdtarget = fd;
+	sx.sx_fdtarget = file_fd->fd;
 	sx.sx_fdtmp    = tfd;
 	sx.sx_offset   = 0;
 	sx.sx_length   = statp->bs_size;
@@ -1408,7 +1426,7 @@ packfile(char *fname, char *tname, int fd,
         }
 
 	/* Swap the extents */
-	srval = xfs_swapext(fd, &sx);
+	srval = xfs_swapext(file_fd->fd, &sx);
 	if (srval < 0) {
 		if (errno == ENOTSUP) {
 			if (vflag || dflag)
@@ -1504,7 +1522,7 @@ getparent(char *fname)
 #define MAPSIZE	128
 #define	OUTMAP_SIZE_INCREMENT	MAPSIZE
 
-int	read_fd_bmap(int fd, struct xfs_bstat *sin, int *cur_nextents)
+int	read_fd_bmap(int fd, struct xfs_bulkstat *sin, int *cur_nextents)
 {
 	int		i, cnt;
 	struct getbmap	map[MAPSIZE];
diff --git a/libfrog/fsgeom.c b/libfrog/fsgeom.c
index 3e7f0797d8b..6980d3ffab6 100644
--- a/libfrog/fsgeom.c
+++ b/libfrog/fsgeom.c
@@ -102,29 +102,50 @@ xfrog_geometry(
 	return -errno;
 }
 
-/*
- * Prepare xfs_fd structure for future ioctl operations by computing the xfs
- * geometry for @xfd->fd.  Returns zero or a negative error code.
- */
-int
-xfd_prepare_geometry(
+/* Compute conversion factors of an xfs_fd structure. */
+static void
+xfd_compute_conversion_factors(
 	struct xfs_fd		*xfd)
 {
-	int			ret;
-
-	ret = xfrog_geometry(xfd->fd, &xfd->fsgeom);
-	if (ret)
-		return ret;
-
 	xfd->agblklog = log2_roundup(xfd->fsgeom.agblocks);
 	xfd->blocklog = highbit32(xfd->fsgeom.blocksize);
 	xfd->inodelog = highbit32(xfd->fsgeom.inodesize);
 	xfd->inopblog = xfd->blocklog - xfd->inodelog;
 	xfd->aginolog = xfd->agblklog + xfd->inopblog;
 	xfd->blkbb_log = xfd->blocklog - BBSHIFT;
+}
+
+/*
+ * Prepare xfs_fd structure for future ioctl operations by computing the xfs
+ * geometry for @xfd->fd.  Returns zero or a negative error code.
+ */
+int
+xfd_prepare_geometry(
+	struct xfs_fd		*xfd)
+{
+	int			ret;
+
+	ret = xfrog_geometry(xfd->fd, &xfd->fsgeom);
+	if (ret)
+		return ret;
+
+	xfd_compute_conversion_factors(xfd);
 	return 0;
 }
 
+/*
+ * Prepare xfs_fd structure for future ioctl operations by computing the xfs
+ * geometry for @xfd->fd.  Returns zero or a negative error code.
+ */
+void
+xfd_install_geometry(
+	struct xfs_fd		*xfd,
+	struct xfs_fsop_geom	*fsgeom)
+{
+	memcpy(&xfd->fsgeom, fsgeom, sizeof(*fsgeom));
+	xfd_compute_conversion_factors(xfd);
+}
+
 /* Open a file on an XFS filesystem.  Returns zero or a negative error code. */
 int
 xfd_open(
diff --git a/libfrog/fsgeom.h b/libfrog/fsgeom.h
index 9dfa986ff08..bef864fce63 100644
--- a/libfrog/fsgeom.h
+++ b/libfrog/fsgeom.h
@@ -61,6 +61,7 @@ struct xfs_fd {
 #define XFS_FD_INIT_EMPTY	XFS_FD_INIT(-1)
 
 int xfd_prepare_geometry(struct xfs_fd *xfd);
+void xfd_install_geometry(struct xfs_fd *xfd, struct xfs_fsop_geom *fsgeom);
 int xfd_open(struct xfs_fd *xfd, const char *pathname, int flags);
 int xfd_close(struct xfs_fd *xfd);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 16/19] xfs_fsr: port to new swapext library function
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (14 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 14/19] xfs_logprint: support dumping swapext log items Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 15/19] xfs_fsr: convert to bulkstat v5 ioctls Darrick J. Wong
                     ` (2 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Port fsr to use the new libfrog library functions to handle swapext.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fsr/xfs_fsr.c           |   80 ++++++++++++++++++++++++-----------------------
 libfrog/file_exchange.c |   17 ++++++++++
 libfrog/file_exchange.h |    2 +
 3 files changed, 59 insertions(+), 40 deletions(-)


diff --git a/fsr/xfs_fsr.c b/fsr/xfs_fsr.c
index 8e916faee94..bbc7d5fcabb 100644
--- a/fsr/xfs_fsr.c
+++ b/fsr/xfs_fsr.c
@@ -13,6 +13,8 @@
 #include "libfrog/paths.h"
 #include "libfrog/fsgeom.h"
 #include "libfrog/bulkstat.h"
+#include "libfrog/fiexchange.h"
+#include "libfrog/file_exchange.h"
 
 #include <fcntl.h>
 #include <errno.h>
@@ -122,12 +124,6 @@ open_handle(
 	return 0;
 }
 
-static int
-xfs_swapext(int fd, xfs_swapext_t *sx)
-{
-    return ioctl(fd, XFS_IOC_SWAPEXT, sx);
-}
-
 static int
 xfs_fscounts(int fd, xfs_fsop_counts_t *counts)
 {
@@ -1150,14 +1146,13 @@ packfile(
 	struct xfs_bulkstat	*statp,
 	struct fsxattr		*fsxp)
 {
+	struct file_xchg_range	fxr;
 	int			tfd = -1;
-	int			srval;
 	int			retval = -1;	/* Failure is the default */
 	int			nextents, extent, cur_nextents, new_nextents;
 	unsigned		blksz_dio;
 	unsigned		dio_min;
 	struct dioattr		dio;
-	static xfs_swapext_t	sx;
 	struct xfs_flock64	space;
 	off64_t			cnt, pos;
 	void			*fbuf = NULL;
@@ -1194,6 +1189,20 @@ packfile(
 	}
 	unlink(tname);
 
+	/*
+	 * Set up everything in the swap request except for the destination
+	 * freshness check, which we'll do separately since we already have
+	 * a bulkstat.
+	 */
+	error = xfrog_file_exchange_prep(file_fd,
+			FILE_XCHG_RANGE_NONATOMIC | FILE_XCHG_RANGE_FULL_FILES,
+			0, tfd, 0, statp->bs_size, &fxr);
+	if (error) {
+		fsrprintf(_("error %d setting up swapext request\n"), error);
+		goto out;
+	}
+	xfrog_file_exchange_require_file2_fresh(&fxr, statp);
+
 	/* Setup extended attributes */
 	if (fsr_setup_attr_fork(file_fd->fd, tfd, statp) != 0) {
 		fsrprintf(_("failed to set ATTR fork on tmp: %s:\n"), tname);
@@ -1404,19 +1413,6 @@ packfile(
 		goto out;
 	}
 
-	error = -xfrog_bulkstat_v5_to_v1(file_fd, &sx.sx_stat, statp);
-	if (error) {
-		fsrprintf(_("bstat conversion error on %s: %s\n"),
-				fname, strerror(error));
-		goto out;
-	}
-
-	sx.sx_version  = XFS_SX_VERSION;
-	sx.sx_fdtarget = file_fd->fd;
-	sx.sx_fdtmp    = tfd;
-	sx.sx_offset   = 0;
-	sx.sx_length   = statp->bs_size;
-
 	/* switch to the owner's id, to keep quota in line */
         if (fchown(tfd, statp->bs_uid, statp->bs_gid) < 0) {
                 if (vflag)
@@ -1426,25 +1422,29 @@ packfile(
         }
 
 	/* Swap the extents */
-	srval = xfs_swapext(file_fd->fd, &sx);
-	if (srval < 0) {
-		if (errno == ENOTSUP) {
-			if (vflag || dflag)
-			   fsrprintf(_("%s: file type not supported\n"), fname);
-		} else if (errno == EFAULT) {
-			/* The file has changed since we started the copy */
-			if (vflag || dflag)
-			   fsrprintf(_("%s: file modified defrag aborted\n"),
-				     fname);
-		} else if (errno == EBUSY) {
-			/* Timestamp has changed or mmap'ed file */
-			if (vflag || dflag)
-			   fsrprintf(_("%s: file busy\n"), fname);
-		} else {
-			fsrprintf(_("XFS_IOC_SWAPEXT failed: %s: %s\n"),
-				  fname, strerror(errno));
-		}
-		goto out;
+	error = xfrog_file_exchange(file_fd, &fxr);
+	switch (error) {
+		case 0:
+			break;
+	case ENOTSUP:
+		if (vflag || dflag)
+			fsrprintf(_("%s: file type not supported\n"), fname);
+		break;
+	case EFAULT:
+	case EDOM:
+		/* The file has changed since we started the copy */
+		if (vflag || dflag)
+			fsrprintf(_("%s: file modified defrag aborted\n"),
+					fname);
+		break;
+	case EBUSY:
+		/* Timestamp has changed or mmap'ed file */
+		if (vflag || dflag)
+			fsrprintf(_("%s: file busy\n"), fname);
+		break;
+	default:
+		fsrprintf(_("XFS_IOC_SWAPEXT failed: %s: %s\n"),
+			  fname, strerror(error));
 	}
 
 	/* Report progress */
diff --git a/libfrog/file_exchange.c b/libfrog/file_exchange.c
index 00277f8f0fc..84795d71db2 100644
--- a/libfrog/file_exchange.c
+++ b/libfrog/file_exchange.c
@@ -55,6 +55,23 @@ xfrog_file_exchange_prep_freshness(
 	return 0;
 }
 
+/*
+ * Enable checking that the target (or destination) file has not been modified
+ * since a particular point in time.
+ */
+void
+xfrog_file_exchange_require_file2_fresh(
+	struct file_xchg_range	*req,
+	struct xfs_bulkstat	*bulkstat)
+{
+	req->flags |= FILE_XCHG_RANGE_FILE2_FRESH;
+	req->file2_ino = bulkstat->bs_ino;
+	req->file2_mtime = bulkstat->bs_mtime;
+	req->file2_ctime = bulkstat->bs_ctime;
+	req->file2_mtime_nsec = bulkstat->bs_mtime_nsec;
+	req->file2_ctime_nsec = bulkstat->bs_ctime_nsec;
+}
+
 /* Prepare an extent swap request. */
 int
 xfrog_file_exchange_prep(
diff --git a/libfrog/file_exchange.h b/libfrog/file_exchange.h
index a77d67514e8..618c2e5aa99 100644
--- a/libfrog/file_exchange.h
+++ b/libfrog/file_exchange.h
@@ -6,6 +6,8 @@
 #ifndef __LIBFROG_FILE_EXCHANGE_H__
 #define __LIBFROG_FILE_EXCHANGE_H__
 
+void xfrog_file_exchange_require_file2_fresh(struct file_xchg_range *req,
+		struct xfs_bulkstat *bulkstat);
 int xfrog_file_exchange_prep(struct xfs_fd *file2, uint64_t flags,
 		int64_t file2_offset, int file1_fd, int64_t file1_offset,
 		int64_t length, struct file_xchg_range *req);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 17/19] xfs_fsr: skip the xattr/forkoff levering with the newer swapext implementations
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (16 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 15/19] xfs_fsr: convert to bulkstat v5 ioctls Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 13/19] libfrog: convert xfs_io swapext command to use new libfrog wrapper Darrick J. Wong
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The newer swapext implementations in the kernel run at a high enough
level (above the bmap layer) that it's no longer required to manipulate
bs_forkoff by creating garbage xattrs to get the extent tree that we
want.  If we detect the newer algorithms, skip this error prone step.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fsr/xfs_fsr.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)


diff --git a/fsr/xfs_fsr.c b/fsr/xfs_fsr.c
index bbc7d5fcabb..1fb2e5e7529 100644
--- a/fsr/xfs_fsr.c
+++ b/fsr/xfs_fsr.c
@@ -969,6 +969,22 @@ fsr_setup_attr_fork(
 	if (!(bstatp->bs_xflags & FS_XFLAG_HASATTR))
 		return 0;
 
+	/*
+	 * If the filesystem has the ability to perform atomic extent swaps or
+	 * has the reverse mapping btree enabled, the file extent swap
+	 * implementation uses a higher level algorithm that calls into the
+	 * bmap code instead of playing games with swapping the extent forks.
+	 *
+	 * The newer bmap implementation does not require specific values of
+	 * bs_forkoff, unlike the old fork swap code.  Therefore, leave the
+	 * extended attributes alone if we know we're not using the old fork
+	 * swap strategy.  This eliminates a major source of runtime errors
+	 * in fsr.
+	 */
+	if (fsgeom.flags & (XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP |
+			    XFS_FSOP_GEOM_FLAGS_RMAPBT))
+		return 0;
+
 	/*
 	 * use the old method if we have attr1 or the kernel does not yet
 	 * support passing the fork offset in the bulkstat data.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 18/19] xfs_io: enhance swapext to take advantage of new api
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (11 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 05/19] xfs: create deferred log items for extent swapping Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 19/19] xfs_io: add atomic update commands to exercise extent swapping Darrick J. Wong
                     ` (5 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enhance the swapext command so that we can take advantage of the new
API's features and print some timing information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/swapext.c      |  156 +++++++++++++++++++++++++++++++++++++++++++++++++----
 man/man8/xfs_io.8 |   54 ++++++++++++++++++
 2 files changed, 197 insertions(+), 13 deletions(-)


diff --git a/io/swapext.c b/io/swapext.c
index 3f8a5c7b4d4..620ce4770a4 100644
--- a/io/swapext.c
+++ b/io/swapext.c
@@ -21,7 +21,36 @@ swapext_help(void)
 	printf(_(
 "\n"
 " Swaps extents between the open file descriptor and the supplied filename.\n"
-"\n"));
+"\n"
+" -a   -- Use atomic extent swapping\n"
+" -C   -- Print timing information in a condensed format\n"
+" -d N -- Start swapping extents at this offset in the open file\n"
+" -e   -- Swap extents to the ends of both files, including the file sizes\n"
+" -f   -- Flush changed file data and metadata to disk\n"
+" -h   -- Do not swap ranges that correspond to holes in the supplied file\n"
+" -l N -- Swap this many bytes between the two files\n"
+" -n   -- Dry run; do all the parameter validation but do not change anything.\n"
+" -s N -- Start swapping extents at this offset in the supplied file\n"
+" -t   -- Print timing information\n"
+" -u   -- Do not compare the open file's timestamps\n"
+" -v   -- 'xfs' for XFS_IOC_SWAPEXT, or 'vfs' for FIEXCHANGE_RANGE\n"));
+}
+
+static void
+set_xfd_flags(
+	struct xfs_fd	*xfd,
+	int		api_ver)
+{
+	switch (api_ver) {
+	case 0:
+		xfd->flags |= XFROG_FLAG_FORCE_SWAPEXT;
+		break;
+	case 1:
+		xfd->flags |= XFROG_FLAG_FORCE_FIEXCHANGE;
+		break;
+	default:
+		break;
+	}
 }
 
 static int
@@ -32,13 +61,101 @@ swapext_f(
 	struct xfs_fd		xfd = XFS_FD_INIT(file->fd);
 	struct file_xchg_range	fxr;
 	struct stat		stat;
-	uint64_t		flags = FILE_XCHG_RANGE_FILE2_FRESH |
+	struct timeval		t1, t2;
+	uint64_t		flags = FILE_XCHG_RANGE_NONATOMIC |
+					FILE_XCHG_RANGE_FILE2_FRESH |
 					FILE_XCHG_RANGE_FULL_FILES;
+	int64_t			src_offset = 0;
+	int64_t			dest_offset = 0;
+	int64_t			length = -1;
+	size_t			fsblocksize, fssectsize;
+	int			condensed = 0, quiet_flag = 1;
+	int			api_ver = -1;
+	int			c;
 	int			fd;
 	int			ret;
 
+	init_cvtnum(&fsblocksize, &fssectsize);
+	while ((c = getopt(argc, argv, "Cad:efhl:ns:tuv:")) != -1) {
+		switch (c) {
+		case 'C':
+			condensed = 1;
+			break;
+		case 'a':
+			flags &= ~FILE_XCHG_RANGE_NONATOMIC;
+			break;
+		case 'd':
+			dest_offset = cvtnum(fsblocksize, fssectsize, optarg);
+			if (dest_offset < 0) {
+				printf(
+			_("non-numeric open file offset argument -- %s\n"),
+						optarg);
+				return 0;
+			}
+			flags &= ~FILE_XCHG_RANGE_FULL_FILES;
+			break;
+		case 'e':
+			flags |= FILE_XCHG_RANGE_TO_EOF;
+			flags &= ~FILE_XCHG_RANGE_FULL_FILES;
+			break;
+		case 'f':
+			flags |= FILE_XCHG_RANGE_FSYNC;
+			break;
+		case 'h':
+			flags |= FILE_XCHG_RANGE_SKIP_FILE1_HOLES;
+			break;
+		case 'l':
+			length = cvtnum(fsblocksize, fssectsize, optarg);
+			if (length < 0) {
+				printf(
+			_("non-numeric length argument -- %s\n"),
+						optarg);
+				return 0;
+			}
+			flags &= ~FILE_XCHG_RANGE_FULL_FILES;
+			break;
+		case 'n':
+			flags |= FILE_XCHG_RANGE_DRY_RUN;
+			break;
+		case 's':
+			src_offset = cvtnum(fsblocksize, fssectsize, optarg);
+			if (src_offset < 0) {
+				printf(
+			_("non-numeric supplied file offset argument -- %s\n"),
+						optarg);
+				return 0;
+			}
+			flags &= ~FILE_XCHG_RANGE_FULL_FILES;
+			break;
+		case 't':
+			quiet_flag = 0;
+			break;
+		case 'u':
+			flags &= ~FILE_XCHG_RANGE_FILE2_FRESH;
+			break;
+		case 'v':
+			if (!strcmp(optarg, "xfs"))
+				api_ver = 0;
+			else if (!strcmp(optarg, "vfs"))
+				api_ver = 1;
+			else {
+				fprintf(stderr,
+			_("version must be 'xfs' or 'vfs'.\n"));
+				return 1;
+			}
+			break;
+		default:
+			swapext_help();
+			return 0;
+		}
+	}
+	if (optind != argc - 1) {
+		swapext_help();
+		return 0;
+	}
+
 	/* open the donor file */
-	fd = openfile(argv[1], NULL, 0, 0, NULL);
+	fd = openfile(argv[optind], NULL, 0, 0, NULL);
 	if (fd < 0)
 		return 0;
 
@@ -49,27 +166,42 @@ swapext_f(
 		goto out;
 	}
 
-	ret = fstat(file->fd, &stat);
-	if (ret) {
-		perror("fstat");
-		exitcode = 1;
-		goto out;
+	if (length < 0) {
+		ret = fstat(file->fd, &stat);
+		if (ret) {
+			perror("fstat");
+			exitcode = 1;
+			goto out;
+		}
+
+		length = stat.st_size;
 	}
 
-	ret = xfrog_file_exchange_prep(&xfd, flags, 0, fd, 0, stat.st_size,
-			&fxr);
+	ret = xfrog_file_exchange_prep(&xfd, flags, dest_offset, fd, src_offset,
+			length, &fxr);
 	if (ret) {
 		xfrog_perror(ret, "xfrog_file_exchange_prep");
 		exitcode = 1;
 		goto out;
 	}
 
+	set_xfd_flags(&xfd, api_ver);
+
+	gettimeofday(&t1, NULL);
 	ret = xfrog_file_exchange(&xfd, &fxr);
 	if (ret) {
 		xfrog_perror(ret, "swapext");
 		exitcode = 1;
 		goto out;
 	}
+	if (quiet_flag)
+		goto out;
+
+	gettimeofday(&t2, NULL);
+	t2 = tsub(t2, t1);
+
+	report_io_times("swapext", &t2, dest_offset, length, length, 1,
+			condensed);
 out:
 	close(fd);
 	return 0;
@@ -81,9 +213,9 @@ swapext_init(void)
 	swapext_cmd.name = "swapext";
 	swapext_cmd.cfunc = swapext_f;
 	swapext_cmd.argmin = 1;
-	swapext_cmd.argmax = 1;
+	swapext_cmd.argmax = -1;
 	swapext_cmd.flags = CMD_NOMAP_OK;
-	swapext_cmd.args = _("<donorfile>");
+	swapext_cmd.args = _("[-a] [-e] [-f] [-u] [-d dest_offset] [-s src_offset] [-l length] [-v xfs|vfs] <donorfile>");
 	swapext_cmd.oneline = _("Swap extents between files.");
 	swapext_cmd.help = swapext_help;
 
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index ae8d0245d87..9a155aaa02c 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -708,10 +708,62 @@ bytes of data.
 .RE
 .PD
 .TP
-.BI swapext " donor_file "
+.BI "swapext [OPTIONS]" " donor_file "
 Swaps extent forks between files. The current open file is the target. The donor
 file is specified by path. Note that file data is not copied (file content moves
 with the fork(s)).
+Options include:
+.RS 1.0i
+.PD 0
+.TP 0.4i
+.B \-a
+Swap extent forks atomically.
+The filesystem must be able to complete the operation even if the system goes
+down.
+.TP
+.B \-C
+Print timing information in a condensed format.
+.TP
+.BI \-d " dest_offset"
+Swap extents with open file beginning at
+.IR dest_offset .
+.TP
+.B \-e
+Swap extents to the ends of both files, including the file sizes.
+.TP
+.B \-f
+Flush changed file data and file metadata to disk.
+.TP
+.B \-h
+Do not swap ranges that correspond to holes in the donor file.
+.TP
+.BI \-l " length"
+Swap up to
+.I length
+bytes of data.
+.TP
+.B \-n
+Perform all the parameter validation checks but don't change anything.
+.TP
+.BI \-s " src_offset"
+Swap extents with donor file beginning at
+.IR src_offset .
+.TP
+.B \-t
+Print timing information.
+.TP
+.B \-u
+Do not snapshot and compare the open file's timestamps.
+.TP
+.B \-v
+Use a particular version of the kernel interface.
+Currently supported values are
+.I xfs
+for the old XFS_IOC_SWAPEXT ioctl, and
+.I vfs
+for the new FIEXCHANGE_RANGE VFS interface.
+.RE
+.PD
 .TP
 .BI "set_encpolicy [ \-c " mode " ] [ \-n " mode " ] [ \-f " flags " ] [ \-v " version " ] [ " keyspec " ]"
 On filesystems that support encryption, assign an encryption policy to the


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 19/19] xfs_io: add atomic update commands to exercise extent swapping
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
                     ` (12 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 18/19] xfs_io: enhance swapext to take advantage of new api Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 14/19] xfs_logprint: support dumping swapext log items Darrick J. Wong
                     ` (4 subsequent siblings)
  18 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add three commands to xfs_io so that we can exercise atomic file updates
as provided by reflink and atomic swapext.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/Makefile       |    2 
 io/atomicupdate.c |  387 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 io/init.c         |    1 
 io/io.h           |    5 +
 io/open.c         |   27 +++-
 man/man8/xfs_io.8 |   33 ++++-
 6 files changed, 448 insertions(+), 7 deletions(-)
 create mode 100644 io/atomicupdate.c


diff --git a/io/Makefile b/io/Makefile
index 229f8f377b3..aa0d216b25f 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -13,7 +13,7 @@ CFILES = init.c \
 	file.c freeze.c fsync.c getrusage.c imap.c inject.c label.c link.c \
 	mmap.c open.c parent.c pread.c prealloc.c pwrite.c reflink.c \
 	resblks.c scrub.c seek.c shutdown.c stat.c swapext.c sync.c \
-	truncate.c utimes.c
+	truncate.c utimes.c atomicupdate.c
 
 LLDLIBS = $(LIBXCMD) $(LIBHANDLE) $(LIBFROG) $(LIBPTHREAD)
 LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE) $(LIBFROG)
diff --git a/io/atomicupdate.c b/io/atomicupdate.c
new file mode 100644
index 00000000000..cb9b664cfed
--- /dev/null
+++ b/io/atomicupdate.c
@@ -0,0 +1,387 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "platform_defs.h"
+#include "command.h"
+#include "init.h"
+#include "io.h"
+#include "input.h"
+#include "libfrog/logging.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/fiexchange.h"
+#include "libfrog/file_exchange.h"
+
+struct update_info {
+	/* File object for the file that we're updating. */
+	struct xfs_fd		file_fd;
+
+	/* FIEXCHANGE_RANGE request to commit the changes. */
+	struct file_xchg_range	xchg_req;
+
+	/* Name of the file we're updating. */
+	char			*old_fname;
+
+	/* fd we're using to stage the updates. */
+	int			temp_fd;
+};
+
+enum finish_how	{
+	FINISH_ABORT,
+	FINISH_COMMIT,
+	FINISH_CHECK
+};
+
+static struct update_info *updates;
+static unsigned int nr_updates;
+
+static void
+startupdate_help(void)
+{
+	printf(_(
+"\n"
+" Prepare for an atomic file update, if supported by the filesystem.\n"
+" A temporary file will be opened for writing and inserted into the file\n"
+" table.  The current file will be changed to this temporary file.  Neither\n"
+" file can be closed for the duration of the update.\n"
+"\n"
+" -e   -- Start with an empty file\n"
+"\n"));
+}
+
+static int
+startupdate_f(
+	int			argc,
+	char			*argv[])
+{
+	struct fsxattr		attr;
+	struct xfs_fsop_geom	fsgeom;
+	struct fs_path		fspath;
+	struct stat		stat;
+	struct update_info	*p;
+	char			*fname;
+	char			*path = NULL, *d;
+	size_t			fname_len;
+	int			flags = IO_TMPFILE | IO_ATOMICUPDATE;
+	int			temp_fd = -1;
+	bool			clone_file = true;
+	int			c;
+	int			ret;
+
+	while ((c = getopt(argc, argv, "e")) != -1) {
+		switch (c) {
+		case 'e':
+			clone_file = false;
+			break;
+		default:
+			startupdate_help();
+			return 0;
+		}
+	}
+	if (optind != argc) {
+		startupdate_help();
+		return 0;
+	}
+
+	/* Allocate a new slot. */
+	p = realloc(updates, (++nr_updates) * sizeof(*p));
+	if (!p) {
+		perror("startupdate realloc");
+		goto fail;
+	}
+	updates = p;
+
+	/* Fill out the update information so that we can commit later. */
+	p = &updates[nr_updates - 1];
+	memset(p, 0, sizeof(*p));
+	p->file_fd.fd = file->fd;
+	ret = xfd_prepare_geometry(&p->file_fd);
+	if (ret) {
+		xfrog_perror(ret, file->name);
+		goto fail;
+	}
+
+	ret = fstat(file->fd, &stat);
+	if (ret) {
+		perror(file->name);
+		goto fail;
+	}
+
+	/* Is the current file realtime?  If so, the temp file must match. */
+	ret = ioctl(file->fd, FS_IOC_FSGETXATTR, &attr);
+	if (ret == 0 && attr.fsx_xflags & FS_XFLAG_REALTIME)
+		flags |= IO_REALTIME;
+
+	/* Compute path to the directory that the current file is in. */
+	path = strdup(file->name);
+	d = strrchr(path, '/');
+	if (!d) {
+		fprintf(stderr, _("%s: cannot compute dirname?"), path);
+		goto fail;
+	}
+	*d = 0;
+
+	/* Open a temporary file to stage the extents. */
+	temp_fd = openfile(path, &fsgeom, flags, 0600, &fspath);
+	if (temp_fd < 0) {
+		perror(path);
+		goto fail;
+	}
+
+	/*
+	 * Snapshot the original file metadata in anticipation of the later
+	 * extent swap request.
+	 */
+	ret = xfrog_file_exchange_prep(&p->file_fd, FILE_XCHG_RANGE_COMMIT, 0,
+			temp_fd, 0, stat.st_size, &p->xchg_req);
+	if (ret) {
+		perror("update prep");
+		goto fail;
+	}
+
+	/* Clone all the data from the original file into the temporary file. */
+	if (clone_file) {
+		ret = ioctl(temp_fd, XFS_IOC_CLONE, p->file_fd.fd);
+		if (ret) {
+			perror(path);
+			goto fail;
+		}
+	}
+
+	/* Prepare a new path string for the duration of the update. */
+#define FILEUPDATE_STR	" (fileupdate)"
+	fname_len = strlen(file->name) + strlen(FILEUPDATE_STR);
+	fname = malloc(fname_len + 1);
+	if (!fname) {
+		perror("new path");
+		goto fail;
+	}
+	snprintf(fname, fname_len + 1, "%s%s", file->name, FILEUPDATE_STR);
+
+	/*
+	 * Install the temporary file into the same slot of the file table as
+	 * the original file.  Ensure that the original file cannot be closed.
+	 */
+	file->flags |= IO_ATOMICUPDATE;
+	p->old_fname = file->name;
+	file->name = fname;
+	p->temp_fd = file->fd = temp_fd;
+
+	free(path);
+	return 0;
+fail:
+	if (temp_fd >= 0)
+		close(temp_fd);
+	free(path);
+	nr_updates--;
+	exitcode = 1;
+	return 1;
+}
+
+static long long
+finish_update(
+	enum finish_how		how,
+	uint64_t		flags,
+	long long		*offset)
+{
+	struct update_info	*p;
+	long long		committed_bytes = 0;
+	size_t			length;
+	unsigned int		i;
+	unsigned int		upd_offset;
+	int			temp_fd;
+	int			ret;
+
+	/* Find our update descriptor. */
+	for (i = 0, p = updates; i < nr_updates; i++, p++) {
+		if (p->temp_fd == file->fd)
+			break;
+	}
+
+	if (i == nr_updates) {
+		fprintf(stderr,
+	_("Current file is not the staging file for an atomic update.\n"));
+		exitcode = 1;
+		return -1;
+	}
+
+	p->xchg_req.flags |= flags;
+
+	/*
+	 * Commit our changes, if desired.  If the extent swap fails, we stop
+	 * processing immediately so that we can run more xfs_io commands.
+	 */
+	switch (how) {
+	case FINISH_CHECK:
+		p->xchg_req.flags |= FILE_XCHG_RANGE_DRY_RUN;
+		fallthrough;
+	case FINISH_COMMIT:
+		ret = xfrog_file_exchange(&p->file_fd, &p->xchg_req);
+		if (ret) {
+			xfrog_perror(ret, _("committing update"));
+			exitcode = 1;
+			return -1;
+		}
+		printf(_("Committed updates to '%s'.\n"), p->old_fname);
+		*offset = p->xchg_req.file2_offset;
+		committed_bytes = p->xchg_req.length;
+		break;
+	case FINISH_ABORT:
+		printf(_("Cancelled updates to '%s'.\n"), p->old_fname);
+		break;
+	}
+
+	/*
+	 * Reset the filetable to point to the original file, and close the
+	 * temporary file.
+	 */
+	free(file->name);
+	file->name = p->old_fname;
+	file->flags &= ~IO_ATOMICUPDATE;
+	temp_fd = file->fd;
+	file->fd = p->file_fd.fd;
+	ret = close(temp_fd);
+	if (ret)
+		perror(_("closing temporary file"));
+
+	/* Remove the atomic update context, shifting things down. */
+	upd_offset = p - updates;
+	length = nr_updates * sizeof(struct update_info);
+	length -= (upd_offset + 1) * sizeof(struct update_info);
+	if (length)
+		memmove(p, p + 1, length);
+
+	nr_updates--;
+	return committed_bytes;
+}
+
+static void
+cancelupdate_help(void)
+{
+	printf(_(
+"\n"
+" Cancels an atomic file update.  The temporary file will be closed, and the\n"
+" current file set back to the original file.\n"
+"\n"));
+}
+
+static int
+cancelupdate_f(
+	int		argc,
+	char		*argv[])
+{
+	return finish_update(FINISH_ABORT, 0, NULL);
+}
+
+static void
+commitupdate_help(void)
+{
+	printf(_(
+"\n"
+" Commits an atomic file update.  File contents written to the temporary file\n"
+" will be swapped atomically with the corresponding range in the original\n"
+" file.  The temporary file will be closed, and the current file set back to\n"
+" the original file.\n"
+"\n"
+" -C   -- Print timing information in a condensed format.\n"
+" -h   -- Do not swap sparse areas of the temporary file.\n"
+" -k   -- Do not change file size.\n"
+" -n   -- Check parameters but do not change anything.\n"
+" -q   -- Do not print timing information at all.\n"));
+}
+
+static int
+commitupdate_f(
+	int		argc,
+	char		*argv[])
+{
+	struct timeval	t1, t2;
+	enum finish_how	how = FINISH_COMMIT;
+	uint64_t	flags = FILE_XCHG_RANGE_TO_EOF;
+	long long	offset, len;
+	int		condensed = 0, quiet_flag = 0;
+	int		c;
+
+	while ((c = getopt(argc, argv, "Chknq")) != -1) {
+		switch (c) {
+		case 'C':
+			condensed = 1;
+			break;
+		case 'h':
+			flags |= FILE_XCHG_RANGE_SKIP_FILE1_HOLES;
+			break;
+		case 'k':
+			flags &= ~FILE_XCHG_RANGE_TO_EOF;
+			break;
+		case 'n':
+			how = FINISH_CHECK;
+			break;
+		case 'q':
+			quiet_flag = 1;
+			break;
+		default:
+			commitupdate_help();
+			return 0;
+		}
+	}
+	if (optind != argc) {
+		commitupdate_help();
+		return 0;
+	}
+
+	gettimeofday(&t1, NULL);
+	len = finish_update(how, flags, &offset);
+	if (len < 0)
+		return 1;
+	if (quiet_flag)
+		return 0;
+
+	gettimeofday(&t2, NULL);
+	t2 = tsub(t2, t1);
+	report_io_times("commitupdate", &t2, offset, len, len, 1, condensed);
+	return 0;
+}
+
+static struct cmdinfo startupdate_cmd = {
+	.name		= "startupdate",
+	.cfunc		= startupdate_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.help		= startupdate_help,
+};
+
+static struct cmdinfo cancelupdate_cmd = {
+	.name		= "cancelupdate",
+	.cfunc		= cancelupdate_f,
+	.argmin		= 0,
+	.argmax		= 0,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.help		= cancelupdate_help,
+};
+
+static struct cmdinfo commitupdate_cmd = {
+	.name		= "commitupdate",
+	.cfunc		= commitupdate_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.help		= commitupdate_help,
+};
+
+void
+atomicupdate_init(void)
+{
+	startupdate_cmd.oneline = _("start an atomic update of a file");
+	startupdate_cmd.args = _("[-e]");
+
+	cancelupdate_cmd.oneline = _("cancel an atomic update");
+
+	commitupdate_cmd.oneline = _("commit a file update atomically");
+	commitupdate_cmd.args = _("[-C] [-h] [-n] [-q]");
+
+	add_command(&startupdate_cmd);
+	add_command(&cancelupdate_cmd);
+	add_command(&commitupdate_cmd);
+}
diff --git a/io/init.c b/io/init.c
index 033ed67d455..96536a25a1f 100644
--- a/io/init.c
+++ b/io/init.c
@@ -44,6 +44,7 @@ init_cvtnum(
 static void
 init_commands(void)
 {
+	atomicupdate_init();
 	attr_init();
 	bmap_init();
 	bulkstat_init();
diff --git a/io/io.h b/io/io.h
index 64b7a663a8c..1cfe8edc2db 100644
--- a/io/io.h
+++ b/io/io.h
@@ -31,6 +31,9 @@
 #define IO_PATH		(1<<10)
 #define IO_NOFOLLOW	(1<<11)
 
+/* undergoing atomic update, do not close */
+#define IO_ATOMICUPDATE	(1<<12)
+
 /*
  * Regular file I/O control
  */
@@ -74,6 +77,7 @@ extern int		openfile(char *, struct xfs_fsop_geom *, int, mode_t,
 				 struct fs_path *);
 extern int		addfile(char *, int , struct xfs_fsop_geom *, int,
 				struct fs_path *);
+extern int		closefile(void);
 extern void		printxattr(uint, int, int, const char *, int, int);
 
 extern unsigned int	recurse_all;
@@ -184,3 +188,4 @@ extern void		scrub_init(void);
 extern void		repair_init(void);
 extern void		crc32cselftest_init(void);
 extern void		bulkstat_init(void);
+extern void		atomicupdate_init(void);
diff --git a/io/open.c b/io/open.c
index d8072664c16..12c486395ce 100644
--- a/io/open.c
+++ b/io/open.c
@@ -337,14 +337,19 @@ open_f(
 	return 0;
 }
 
-static int
-close_f(
-	int		argc,
-	char		**argv)
+int
+closefile(void)
 {
 	size_t		length;
 	unsigned int	offset;
 
+	if (file->flags & IO_ATOMICUPDATE) {
+		fprintf(stderr,
+	_("%s: atomic update in progress, cannot close.\n"),
+			file->name);
+		exitcode = 1;
+		return 0;
+	}
 	if (close(file->fd) < 0) {
 		perror("close");
 		exitcode = 1;
@@ -370,7 +375,19 @@ close_f(
 		free(filetable);
 		file = filetable = NULL;
 	}
-	filelist_f();
+	return 0;
+}
+
+static int
+close_f(
+	int		argc,
+	char		**argv)
+{
+	int		ret;
+
+	ret = closefile();
+	if (!ret)
+		filelist_f();
 	return 0;
 }
 
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 9a155aaa02c..d531cabc3ef 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1042,7 +1042,38 @@ sec uses UNIX timestamp notation and is the seconds elapsed since
 nsec is the nanoseconds since the sec. This value needs to be in
 the range 0-999999999 with UTIME_NOW and UTIME_OMIT being exceptions.
 Each (sec, nsec) pair constitutes a single timestamp value.
-
+.TP
+.BI "startupdate [ " -e ]
+Create a temporary clone of a file in which to stage file updates.
+The
+.B \-e
+option creates an empty staging file.
+.TP
+.B cancelupdate
+Abandon changes from a update staging file.
+.TP
+.BI "commitupdate [" OPTIONS ]
+Commit changes from a update staging file to the real file.
+.RS 1.0i
+.PD 0
+.TP 0.4i
+.B \-C
+Print timing information in a condensed format.
+.TP 0.4i
+.B \-h
+Skip the range swap for all ranges of the update staging file that
+are sparse holes.
+.TP 0.4i
+.B \-k
+Do not change file size.
+.TP 0.4i
+.B \-n
+Check parameters without changing anything.
+.TP 0.4i
+.B \-q
+Do not print timing information at all.
+.PD
+.RE
 
 .SH MEMORY MAPPED I/O COMMANDS
 .TP


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] xfs: online fsck of iunlink buckets
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (47 preceding siblings ...)
  2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/1] xfs_db: dump unlinked buckets Darrick J. Wong
  2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: cache xfile pages for better performance Darrick J. Wong
                   ` (28 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This series enhances the AGI scrub code to check the unlinked inode
bucket lists for errors, and fixes them if necessary.  Now that iunlink
pointer updates are virtual log items, we can batch updates pretty
efficiently in the logging code.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-iunlink

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-iunlink
---
 db/Makefile       |    2 -
 db/command.c      |    1 
 db/command.h      |    1 
 db/unlinked.c     |  204 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_db.8 |   19 +++++
 5 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 db/unlinked.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs_db: dump unlinked buckets
  2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: online fsck of iunlink buckets Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new command to dump the resource usage of files in the unlinked
buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 db/Makefile       |    2 -
 db/command.c      |    1 
 db/command.h      |    1 
 db/unlinked.c     |  204 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_db.8 |   19 +++++
 5 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 db/unlinked.c


diff --git a/db/Makefile b/db/Makefile
index de4ab1d4bf5..dbe79a9a1b1 100644
--- a/db/Makefile
+++ b/db/Makefile
@@ -15,7 +15,7 @@ HFILES = addr.h agf.h agfl.h agi.h attr.h attrshort.h bit.h block.h bmap.h \
 	sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h \
 	fuzz.h
 CFILES = $(HFILES:.h=.c) btdump.c btheight.c convert.c info.c namei.c \
-	timelimit.c bmap_inflate.c
+	timelimit.c bmap_inflate.c unlinked.c
 LSRCFILES = xfs_admin.sh xfs_ncheck.sh xfs_metadump.sh
 
 LLDLIBS	= $(LIBXFS) $(LIBXLOG) $(LIBFROG) $(LIBUUID) $(LIBRT) $(LIBURCU) \
diff --git a/db/command.c b/db/command.c
index 88401ef5b44..be6d045a23a 100644
--- a/db/command.c
+++ b/db/command.c
@@ -142,4 +142,5 @@ init_commands(void)
 	fuzz_init();
 	timelimit_init();
 	bmapinflate_init();
+	unlinked_init();
 }
diff --git a/db/command.h b/db/command.h
index c35258a72a9..85be8b622f0 100644
--- a/db/command.h
+++ b/db/command.h
@@ -35,3 +35,4 @@ extern void		btheight_init(void);
 extern void		timelimit_init(void);
 extern void		namei_init(void);
 extern void		bmapinflate_init(void);
+extern void		unlinked_init(void);
diff --git a/db/unlinked.c b/db/unlinked.c
new file mode 100644
index 00000000000..5b7df811601
--- /dev/null
+++ b/db/unlinked.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs.h"
+#include "command.h"
+#include "output.h"
+#include "init.h"
+
+static xfs_filblks_t
+count_rtblocks(
+	struct xfs_inode	*ip)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	xfs_filblks_t		count = 0;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+	int			error;
+
+	error = -libxfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+	if (error) {
+		dbprintf(
+_("could not read AG %u agino %u extents, err=%d\n"),
+				XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
+				XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
+				error);
+		return 0;
+	}
+
+	for_each_xfs_iext(ifp, &icur, &got)
+		if (!isnullstartblock(got.br_startblock))
+			count += got.br_blockcount;
+	return count;
+}
+
+static xfs_agino_t
+get_next_unlinked(
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino,
+	bool			verbose)
+{
+	struct xfs_buf		*ino_bp;
+	struct xfs_dinode	*dip;
+	struct xfs_inode	*ip;
+	xfs_ino_t		ino;
+	xfs_agino_t		ret;
+	int			error;
+
+	ino = XFS_AGINO_TO_INO(mp, agno, agino);
+	error = -libxfs_iget(mp, NULL, ino, 0, &ip);
+	if (error)
+		goto bad;
+
+	if (verbose) {
+		xfs_filblks_t	blocks, rtblks = 0;
+
+		if (XFS_IS_REALTIME_INODE(ip))
+			rtblks = count_rtblocks(ip);
+		blocks = ip->i_nblocks - rtblks;
+
+		dbprintf(_(" blocks %llu rtblocks %llu\n"),
+				blocks, rtblks);
+	} else {
+		dbprintf("\n");
+	}
+
+	error = -libxfs_imap_to_bp(mp, NULL, &ip->i_imap, &ino_bp);
+	if (error)
+		goto bad;
+
+	dip = xfs_buf_offset(ino_bp, ip->i_imap.im_boffset);
+	ret = be32_to_cpu(dip->di_next_unlinked);
+	libxfs_buf_relse(ino_bp);
+
+	return ret;
+bad:
+	dbprintf(_("AG %u agino %u: %s\n"), agno, agino, strerror(error));
+	return NULLAGINO;
+}
+
+static void
+dump_unlinked_bucket(
+	xfs_agnumber_t	agno,
+	struct xfs_buf	*agi_bp,
+	unsigned int	bucket,
+	bool		quiet,
+	bool		verbose)
+{
+	struct xfs_agi	*agi = agi_bp->b_addr;
+	xfs_agino_t	agino;
+	unsigned int	i = 0;
+
+	agino = be32_to_cpu(agi->agi_unlinked[bucket]);
+	if (agino != NULLAGINO)
+		dbprintf(_("AG %u bucket %u agino %u"), agno, bucket, agino);
+	else if (!quiet && agino == NULLAGINO)
+		dbprintf(_("AG %u bucket %u agino NULL\n"), agno, bucket);
+
+	while (agino != NULLAGINO) {
+		agino = get_next_unlinked(agno, agino, verbose);
+		if (agino != NULLAGINO)
+			dbprintf(_("    [%u] agino %u"), i++, agino);
+		else if (!quiet && agino == NULLAGINO)
+			dbprintf(_("    [%u] agino NULL\n"), i++);
+	}
+}
+
+static void
+dump_unlinked(
+	struct xfs_perag	*pag,
+	unsigned int		bucket,
+	bool			quiet,
+	bool			verbose)
+{
+	struct xfs_buf		*agi_bp;
+	xfs_agnumber_t		agno = pag->pag_agno;
+	int			error;
+
+	error = -libxfs_ialloc_read_agi(pag, NULL, &agi_bp);
+	if (error) {
+		dbprintf(_("AGI %u: %s\n"), agno, strerror(errno));
+		return;
+	}
+
+	if (bucket != -1U) {
+		dump_unlinked_bucket(agno, agi_bp, bucket, quiet, verbose);
+		goto relse;
+	}
+
+	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
+		dump_unlinked_bucket(agno, agi_bp, bucket, quiet, verbose);
+	}
+
+relse:
+	libxfs_buf_relse(agi_bp);
+}
+
+static int
+unlinked_f(
+	int			argc,
+	char			**argv)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno = NULLAGNUMBER;
+	unsigned int		bucket = -1U;
+	bool			quiet = false;
+	bool			verbose = false;
+	int			c;
+
+	while ((c = getopt(argc, argv, "a:b:qv")) != EOF) {
+		switch (c) {
+		case 'a':
+			agno = atoi(optarg);
+			if (agno >= mp->m_sb.sb_agcount) {
+				dbprintf(_("Unknown AG %u, agcount is %u.\n"),
+						agno, mp->m_sb.sb_agcount);
+				return 0;
+			}
+			break;
+		case 'b':
+			bucket = atoi(optarg);
+			if (bucket >= XFS_AGI_UNLINKED_BUCKETS) {
+				dbprintf(_("Unknown bucket %u, max is 63.\n"),
+						bucket);
+				return 0;
+			}
+			break;
+		case 'q':
+			quiet = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			dbprintf(_("Bad option for unlinked command.\n"));
+			return 0;
+		}
+	}
+
+	if (agno != NULLAGNUMBER) {
+		struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
+
+		dump_unlinked(pag, bucket, quiet, verbose);
+		libxfs_perag_put(pag);
+		return 0;
+	}
+
+	for_each_perag(mp, agno, pag)
+		dump_unlinked(pag, bucket, quiet, verbose);
+
+	return 0;
+}
+
+static const cmdinfo_t	unlinked_cmd =
+	{ "unlinked", NULL, unlinked_f, 0, -1, 0,
+	  N_("[-a agno] [-b bucket] [-q] [-v]"),
+	  N_("dump chain of unlinked inode buckets"), NULL };
+
+void
+unlinked_init(void)
+{
+	add_command(&unlinked_cmd);
+}
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index d67bf1e79da..43c7db5e225 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -959,6 +959,25 @@ Print the timestamps in the current locale's date and time format instead of
 raw seconds since the Unix epoch.
 .RE
 .TP
+.BI "unlinked [-a " agno " ] [-b " bucket " ] [-q] [-v]"
+Dump the contents of unlinked buckets.
+
+Options include:
+.RS 1.0i
+.TP 0.4i
+.B \-a
+Print only this AG's unlinked buckets.
+.TP 0.4i
+.B \-b
+Print only this bucket within each AGI.
+.TP 0.4i
+.B \-q
+Only print the essentials.
+.TP 0.4i
+.B \-v
+Print resource usage of each file on the unlinked lists.
+.RE
+.TP
 .BI "uuid [" uuid " | " generate " | " rewrite " | " restore ]
 Set the filesystem universally unique identifier (UUID).
 The filesystem UUID can be used by


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] xfs: cache xfile pages for better performance
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (48 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: online fsck of iunlink buckets Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/1] xfs: map xfile pages directly into xfs_buf Darrick J. Wong
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
                   ` (27 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Congratulations!  You have made it to the final patchset of the main
online fsck feature!  This last series improves the performance of
xfile-backed btrees by teaching the buffer cache to directly map pages
from the xfile.  It also speeds up xfarray operations substantially by
implementing a small page cache to avoid repeated kmap/kunmap calls.
Collectively, these can reduce the runtime of online repair functions by
twenty percent or so.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xfile-page-caching

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=xfile-page-caching
---
 libxfs/xfs_btree_mem.h  |    6 ++++++
 libxfs/xfs_rmap_btree.c |    1 +
 2 files changed, 7 insertions(+)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: map xfile pages directly into xfs_buf
  2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: cache xfile pages for better performance Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Map the xfile pages directly into xfs_buf to reduce memory overhead.
It's silly to use memory to stage changes to shmem pages for ephemeral
btrees that don't care about transactionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_btree_mem.h  |    6 ++++++
 libxfs/xfs_rmap_btree.c |    1 +
 2 files changed, 7 insertions(+)


diff --git a/libxfs/xfs_btree_mem.h b/libxfs/xfs_btree_mem.h
index ee142b97283..8feb104522b 100644
--- a/libxfs/xfs_btree_mem.h
+++ b/libxfs/xfs_btree_mem.h
@@ -17,8 +17,14 @@ struct xfbtree_config {
 
 	/* Owner of this btree. */
 	unsigned long long		owner;
+
+	/* XFBTREE_* flags */
+	unsigned int			flags;
 };
 
+/* buffers should be directly mapped from memory */
+#define XFBTREE_DIRECT_MAP		(1U << 0)
+
 #ifdef CONFIG_XFS_IN_MEMORY_BTREE
 unsigned int xfs_btree_mem_head_nlevels(struct xfs_buf *head_bp);
 
diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index d6db7dfe22f..bc953516a98 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -668,6 +668,7 @@ xfs_rmapbt_mem_create(
 		.btree_ops	= &xfs_rmapbt_mem_ops,
 		.target		= target,
 		.owner		= agno,
+		.flags		= XFBTREE_DIRECT_MAP,
 	};
 
 	return xfbtree_create(mp, &cfg, xfbtreep);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (49 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: cache xfile pages for better performance Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: log when a repair was unnecessary Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
                   ` (26 subsequent siblings)
  77 siblings, 6 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Now that we've landed the new kernel code, it's time to reorganize the
xfs_scrub code that handles repairs.  Clean up various naming warts and
misleading error messages.  Move the repair code to scrub/repair.c as
the first step.  Then, fix various issues in the repair code before we
start reorganizing things.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-fixes
---
 scrub/phase1.c        |    2 
 scrub/phase2.c        |    3 -
 scrub/phase3.c        |    2 
 scrub/phase4.c        |   22 ++++-
 scrub/phase5.c        |    2 
 scrub/phase6.c        |   13 +++
 scrub/phase7.c        |    2 
 scrub/repair.c        |  169 +++++++++++++++++++++++++++++++++++++++++
 scrub/repair.h        |   16 +++-
 scrub/scrub.c         |  202 +------------------------------------------------
 scrub/scrub.h         |   16 ----
 scrub/scrub_private.h |   55 +++++++++++++
 12 files changed, 273 insertions(+), 231 deletions(-)
 create mode 100644 scrub/scrub_private.h


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/6] xfs_scrub: don't report media errors for space with unknowable owner
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: remove ALP_* flags namespace Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/6] xfs_scrub: move repair functions to repair.c Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 6/6] xfs_scrub: actually try to fix summary counters ahead of repairs Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

On filesystems that don't have the reverse mapping feature enabled, the
GETFSMAP call cannot tell us much about the owner of a space extent --
we're limited to static fs metadata, free space, or "unknown".  In this
case, nothing is corrupt, so str_corrupt is not an appropriate logging
function.  Relax this to str_info so that the user sees a notice that
media errors have been found so that the user knows something bad
happened even if the directory tree walker cannot find the file owning
the space where the media error was found.

Filesystems with rmap enabled are never supposed to return OWN_UNKNOWN
from a GETFSMAP report, so continue to report that as a corruption.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase6.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index afdb16b689c..1a2643bdaf0 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -397,7 +397,18 @@ report_ioerr_fsmap(
 		snprintf(buf, DESCR_BUFSZ, _("disk offset %"PRIu64),
 				(uint64_t)map->fmr_physical + err_off);
 		type = decode_special_owner(map->fmr_owner);
-		str_corrupt(ctx, buf, _("media error in %s."), type);
+		/*
+		 * On filesystems that don't store reverse mappings, the
+		 * GETFSMAP call returns OWNER_UNKNOWN for allocated space.
+		 * We'll have to let the directory tree walker find the file
+		 * that lost data.
+		 */
+		if (!(ctx->mnt.fsgeom.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT) &&
+		    map->fmr_owner == XFS_FMR_OWN_UNKNOWN) {
+			str_info(ctx, buf, _("media error detected."));
+		} else {
+			str_corrupt(ctx, buf, _("media error in %s."), type);
+		}
 	}
 
 	/* Report extent maps */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/6] xfs_scrub: remove ALP_* flags namespace
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: log when a repair was unnecessary Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/6] xfs_scrub: require primary superblock repairs to complete before proceeding Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/6] xfs_scrub: don't report media errors for space with unknowable owner Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In preparation to move all the repair code to repair.[ch], remove the
ALP_* flags namespace since it mostly overlaps with XRM_*.  Rename the
clunky "COMPLAIN_IF_UNFIXED" flag to "FINAL_WARNING", because that's
what it really means.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase3.c |    2 +-
 scrub/phase4.c |    2 +-
 scrub/phase5.c |    2 +-
 scrub/phase7.c |    2 +-
 scrub/repair.c |    4 ++--
 scrub/repair.h |   16 ++++++++++++----
 scrub/scrub.c  |   10 +++++-----
 scrub/scrub.h  |   10 ----------
 8 files changed, 23 insertions(+), 25 deletions(-)


diff --git a/scrub/phase3.c b/scrub/phase3.c
index 65e903f23d2..ef41ee8049d 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -88,7 +88,7 @@ try_inode_repair(
 		return 0;
 
 	ret = action_list_process(ictx->ctx, fd, alist,
-			ALP_REPAIR_ONLY | ALP_NOPROGRESS);
+			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
 	if (ret)
 		return ret;
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index da9e8759134..c71b56b7bda 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -54,7 +54,7 @@ repair_ag(
 	} while (unfixed > 0);
 
 	/* Try once more, but this time complain if we can't fix things. */
-	flags |= ALP_COMPLAIN_IF_UNFIXED;
+	flags |= XRM_FINAL_WARNING;
 	ret = action_list_process(ctx, -1, alist, flags);
 	if (ret)
 		*aborted = true;
diff --git a/scrub/phase5.c b/scrub/phase5.c
index 622e58138db..93dd14d50ba 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -421,7 +421,7 @@ iscan_worker(
 	}
 
 	ret = action_list_process(ctx, ctx->mnt.fd, &item->alist,
-			ALP_COMPLAIN_IF_UNFIXED | ALP_NOPROGRESS);
+			XRM_FINAL_WARNING | XRM_NOPROGRESS);
 	if (ret) {
 		str_liberror(ctx, ret, _("repairing iscan metadata"));
 		*item->abortedp = true;
diff --git a/scrub/phase7.c b/scrub/phase7.c
index 2b83e0a471d..e9cb40f48d8 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -122,7 +122,7 @@ phase7_func(
 	if (error)
 		return error;
 	error = action_list_process(ctx, -1, &alist,
-			ALP_COMPLAIN_IF_UNFIXED | ALP_NOPROGRESS);
+			XRM_FINAL_WARNING | XRM_NOPROGRESS);
 	if (error)
 		return error;
 
diff --git a/scrub/repair.c b/scrub/repair.c
index bb0faceca69..54016337896 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -274,7 +274,7 @@ action_list_process(
 		fix = xfs_repair_metadata(ctx, xfdp, aitem, repair_flags);
 		switch (fix) {
 		case CHECK_DONE:
-			if (!(repair_flags & ALP_NOPROGRESS))
+			if (!(repair_flags & XRM_NOPROGRESS))
 				progress_add(1);
 			alist->nr--;
 			list_del(&aitem->list);
@@ -316,7 +316,7 @@ action_list_process_or_defer(
 	int				ret;
 
 	ret = action_list_process(ctx, -1, alist,
-			ALP_REPAIR_ONLY | ALP_NOPROGRESS);
+			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
 	if (ret)
 		return ret;
 
diff --git a/scrub/repair.h b/scrub/repair.h
index d7a3a9e7cff..969871bd8bf 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -32,10 +32,18 @@ void action_list_find_mustfix(struct action_list *actions,
 		unsigned long long *broken_primaries,
 		unsigned long long *broken_secondaries);
 
-/* Passed through to xfs_repair_metadata() */
-#define ALP_REPAIR_ONLY		(XRM_REPAIR_ONLY)
-#define ALP_COMPLAIN_IF_UNFIXED	(XRM_COMPLAIN_IF_UNFIXED)
-#define ALP_NOPROGRESS		(1U << 31)
+/*
+ * Only ask the kernel to repair this object if the kernel directly told us it
+ * was corrupt.  Objects that are only flagged as having cross-referencing
+ * errors or flagged as eligible for optimization are left for later.
+ */
+#define XRM_REPAIR_ONLY		(1U << 0)
+
+/* This is the last repair attempt; complain if still broken even after fix. */
+#define XRM_FINAL_WARNING	(1U << 1)
+
+/* Don't call progress_add after repairing an item. */
+#define XRM_NOPROGRESS		(1U << 2)
 
 int action_list_process(struct scrub_ctx *ctx, int fd,
 		struct action_list *alist, unsigned int repair_flags);
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 975dd8efbbc..7b9c360f796 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -734,7 +734,7 @@ _("Filesystem is shut down, aborting."));
 		 * could fix this, it's at least worth trying the scan
 		 * again to see if another repair fixed it.
 		 */
-		if (!(repair_flags & XRM_COMPLAIN_IF_UNFIXED))
+		if (!(repair_flags & XRM_FINAL_WARNING))
 			return CHECK_RETRY;
 		/*
 		 * If we forced repairs or this is a preen, don't
@@ -771,13 +771,13 @@ _("Read-only filesystem; cannot make changes."));
 		 * to requeue the repair for later and don't say a
 		 * thing.  Otherwise, print error and bail out.
 		 */
-		if (!(repair_flags & XRM_COMPLAIN_IF_UNFIXED))
+		if (!(repair_flags & XRM_FINAL_WARNING))
 			return CHECK_RETRY;
 		str_liberror(ctx, error, descr_render(&dsc));
 		return CHECK_DONE;
 	}
 
-	if (repair_flags & XRM_COMPLAIN_IF_UNFIXED)
+	if (repair_flags & XRM_FINAL_WARNING)
 		scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
 	if (needs_repair(&meta)) {
 		/*
@@ -785,7 +785,7 @@ _("Read-only filesystem; cannot make changes."));
 		 * just requeue this and try again later.  Otherwise we
 		 * log the error loudly and don't try again.
 		 */
-		if (!(repair_flags & XRM_COMPLAIN_IF_UNFIXED))
+		if (!(repair_flags & XRM_FINAL_WARNING))
 			return CHECK_RETRY;
 		str_corrupt(ctx, descr_render(&dsc),
 _("Repair unsuccessful; offline repair required."));
@@ -797,7 +797,7 @@ _("Repair unsuccessful; offline repair required."));
 		 * caller to run xfs_repair; otherwise, we'll keep trying to
 		 * reverify the cross-referencing as repairs progress.
 		 */
-		if (repair_flags & XRM_COMPLAIN_IF_UNFIXED) {
+		if (repair_flags & XRM_FINAL_WARNING) {
 			str_info(ctx, descr_render(&dsc),
  _("Seems correct but cross-referencing failed; offline repair recommended."));
 		} else {
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 0622677376c..751f2c8e9c1 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -54,16 +54,6 @@ struct action_item {
 	__u32			agno;
 };
 
-/*
- * Only ask the kernel to repair this object if the kernel directly told us it
- * was corrupt.  Objects that are only flagged as having cross-referencing
- * errors or flagged as eligible for optimization are left for later.
- */
-#define XRM_REPAIR_ONLY		(1U << 0)
-
-/* Complain if still broken even after fix. */
-#define XRM_COMPLAIN_IF_UNFIXED	(1U << 1)
-
 enum check_outcome xfs_repair_metadata(struct scrub_ctx *ctx,
 		struct xfs_fd *xfdp, struct action_item *aitem,
 		unsigned int repair_flags);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/6] xfs_scrub: move repair functions to repair.c
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 1/6] xfs_scrub: don't report media errors for space with unknowable owner Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 6/6] xfs_scrub: actually try to fix summary counters ahead of repairs Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move all the repair functions to repair.c.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c        |    2 
 scrub/repair.c        |  161 +++++++++++++++++++++++++++++++++++++++
 scrub/scrub.c         |  202 +------------------------------------------------
 scrub/scrub.h         |    6 -
 scrub/scrub_private.h |   55 +++++++++++++
 5 files changed, 222 insertions(+), 204 deletions(-)
 create mode 100644 scrub/scrub_private.h


diff --git a/scrub/phase1.c b/scrub/phase1.c
index cecb5e861f4..2d258a1a182 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -210,7 +210,7 @@ _("Kernel metadata scrubbing facility is not available."));
 	}
 
 	/* Do we need kernel-assisted metadata repair? */
-	if (ctx->mode != SCRUB_MODE_DRY_RUN && !xfs_can_repair(ctx)) {
+	if (ctx->mode != SCRUB_MODE_DRY_RUN && !can_repair(ctx)) {
 		str_error(ctx, ctx->mntpoint,
 _("Kernel metadata repair facility is not available.  Use -n to scrub."));
 		return ECANCELED;
diff --git a/scrub/repair.c b/scrub/repair.c
index 54016337896..a6fbdcb638d 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -10,11 +10,172 @@
 #include <sys/statvfs.h>
 #include "list.h"
 #include "libfrog/paths.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/scrub.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "scrub.h"
 #include "progress.h"
 #include "repair.h"
+#include "descr.h"
+#include "scrub_private.h"
+
+/* General repair routines. */
+
+/* Repair some metadata. */
+static enum check_outcome
+xfs_repair_metadata(
+	struct scrub_ctx		*ctx,
+	struct xfs_fd			*xfdp,
+	struct action_item		*aitem,
+	unsigned int			repair_flags)
+{
+	struct xfs_scrub_metadata	meta = { 0 };
+	struct xfs_scrub_metadata	oldm;
+	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
+	int				error;
+
+	assert(aitem->type < XFS_SCRUB_TYPE_NR);
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	meta.sm_type = aitem->type;
+	meta.sm_flags = aitem->flags | XFS_SCRUB_IFLAG_REPAIR;
+	if (use_force_rebuild)
+		meta.sm_flags |= XFS_SCRUB_IFLAG_FORCE_REBUILD;
+	switch (xfrog_scrubbers[aitem->type].group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
+		meta.sm_agno = aitem->agno;
+		break;
+	case XFROG_SCRUB_GROUP_INODE:
+		meta.sm_ino = aitem->ino;
+		meta.sm_gen = aitem->gen;
+		break;
+	default:
+		break;
+	}
+
+	if (!is_corrupt(&meta) && (repair_flags & XRM_REPAIR_ONLY))
+		return CHECK_RETRY;
+
+	memcpy(&oldm, &meta, sizeof(oldm));
+	descr_set(&dsc, &oldm);
+
+	if (needs_repair(&meta))
+		str_info(ctx, descr_render(&dsc), _("Attempting repair."));
+	else if (debug || verbose)
+		str_info(ctx, descr_render(&dsc),
+				_("Attempting optimization."));
+
+	error = -xfrog_scrub_metadata(xfdp, &meta);
+	switch (error) {
+	case 0:
+		/* No operational errors encountered. */
+		break;
+	case EDEADLOCK:
+	case EBUSY:
+		/* Filesystem is busy, try again later. */
+		if (debug || verbose)
+			str_info(ctx, descr_render(&dsc),
+_("Filesystem is busy, deferring repair."));
+		return CHECK_RETRY;
+	case ESHUTDOWN:
+		/* Filesystem is already shut down, abort. */
+		str_error(ctx, descr_render(&dsc),
+_("Filesystem is shut down, aborting."));
+		return CHECK_ABORT;
+	case ENOTTY:
+	case EOPNOTSUPP:
+		/*
+		 * If we're in no-complain mode, requeue the check for
+		 * later.  It's possible that an error in another
+		 * component caused us to flag an error in this
+		 * component.  Even if the kernel didn't think it
+		 * could fix this, it's at least worth trying the scan
+		 * again to see if another repair fixed it.
+		 */
+		if (!(repair_flags & XRM_FINAL_WARNING))
+			return CHECK_RETRY;
+		/*
+		 * If we forced repairs or this is a preen, don't
+		 * error out if the kernel doesn't know how to fix.
+		 */
+		if (is_unoptimized(&oldm) ||
+		    debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+			return CHECK_DONE;
+		fallthrough;
+	case EINVAL:
+		/* Kernel doesn't know how to repair this? */
+		str_corrupt(ctx, descr_render(&dsc),
+_("Don't know how to fix; offline repair required."));
+		return CHECK_DONE;
+	case EROFS:
+		/* Read-only filesystem, can't fix. */
+		if (verbose || debug || needs_repair(&oldm))
+			str_error(ctx, descr_render(&dsc),
+_("Read-only filesystem; cannot make changes."));
+		return CHECK_ABORT;
+	case ENOENT:
+		/* Metadata not present, just skip it. */
+		return CHECK_DONE;
+	case ENOMEM:
+	case ENOSPC:
+		/* Don't care if preen fails due to low resources. */
+		if (is_unoptimized(&oldm) && !needs_repair(&oldm))
+			return CHECK_DONE;
+		fallthrough;
+	default:
+		/*
+		 * Operational error.  If the caller doesn't want us
+		 * to complain about repair failures, tell the caller
+		 * to requeue the repair for later and don't say a
+		 * thing.  Otherwise, print error and bail out.
+		 */
+		if (!(repair_flags & XRM_FINAL_WARNING))
+			return CHECK_RETRY;
+		str_liberror(ctx, error, descr_render(&dsc));
+		return CHECK_DONE;
+	}
+
+	if (repair_flags & XRM_FINAL_WARNING)
+		scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
+	if (needs_repair(&meta)) {
+		/*
+		 * Still broken; if we've been told not to complain then we
+		 * just requeue this and try again later.  Otherwise we
+		 * log the error loudly and don't try again.
+		 */
+		if (!(repair_flags & XRM_FINAL_WARNING))
+			return CHECK_RETRY;
+		str_corrupt(ctx, descr_render(&dsc),
+ _("Repair unsuccessful; offline repair required."));
+	} else if (xref_failed(&meta)) {
+		/*
+		 * This metadata object itself looks ok, but we still noticed
+		 * inconsistencies when comparing it with the other filesystem
+		 * metadata.  If we're in "final warning" mode, advise the
+		 * caller to run xfs_repair; otherwise, we'll keep trying to
+		 * reverify the cross-referencing as repairs progress.
+		 */
+		if (repair_flags & XRM_FINAL_WARNING) {
+			str_info(ctx, descr_render(&dsc),
+ _("Seems correct but cross-referencing failed; offline repair recommended."));
+		} else {
+			if (verbose)
+				str_info(ctx, descr_render(&dsc),
+ _("Seems correct but cross-referencing failed; will keep checking."));
+			return CHECK_RETRY;
+		}
+	} else {
+		/* Clean operation, no corruption detected. */
+		if (needs_repair(&oldm))
+			record_repair(ctx, descr_render(&dsc),
+					_("Repairs successful."));
+		else
+			record_preen(ctx, descr_render(&dsc),
+					_("Optimization successful."));
+	}
+	return CHECK_DONE;
+}
 
 /*
  * Prioritize action items in order of how long we can wait.
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 7b9c360f796..0f080028879 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -20,11 +20,12 @@
 #include "scrub.h"
 #include "repair.h"
 #include "descr.h"
+#include "scrub_private.h"
 
 /* Online scrub and repair wrappers. */
 
 /* Format a scrub description. */
-static int
+int
 format_scrub_descr(
 	struct scrub_ctx		*ctx,
 	char				*buf,
@@ -52,46 +53,8 @@ format_scrub_descr(
 	return -1;
 }
 
-/* Predicates for scrub flag state. */
-
-static inline bool is_corrupt(struct xfs_scrub_metadata *sm)
-{
-	return sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT;
-}
-
-static inline bool is_unoptimized(struct xfs_scrub_metadata *sm)
-{
-	return sm->sm_flags & XFS_SCRUB_OFLAG_PREEN;
-}
-
-static inline bool xref_failed(struct xfs_scrub_metadata *sm)
-{
-	return sm->sm_flags & XFS_SCRUB_OFLAG_XFAIL;
-}
-
-static inline bool xref_disagrees(struct xfs_scrub_metadata *sm)
-{
-	return sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT;
-}
-
-static inline bool is_incomplete(struct xfs_scrub_metadata *sm)
-{
-	return sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE;
-}
-
-static inline bool is_suspicious(struct xfs_scrub_metadata *sm)
-{
-	return sm->sm_flags & XFS_SCRUB_OFLAG_WARNING;
-}
-
-/* Should we fix it? */
-static inline bool needs_repair(struct xfs_scrub_metadata *sm)
-{
-	return is_corrupt(sm) || xref_disagrees(sm);
-}
-
 /* Warn about strange circumstances after scrub. */
-static inline void
+void
 scrub_warn_incomplete_scrub(
 	struct scrub_ctx		*ctx,
 	struct descr			*dsc,
@@ -647,7 +610,7 @@ can_scrub_parent(
 }
 
 bool
-xfs_can_repair(
+can_repair(
 	struct scrub_ctx	*ctx)
 {
 	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE, XFS_SCRUB_IFLAG_REPAIR);
@@ -660,160 +623,3 @@ can_force_rebuild(
 	return __scrub_test(ctx, XFS_SCRUB_TYPE_PROBE,
 			XFS_SCRUB_IFLAG_REPAIR | XFS_SCRUB_IFLAG_FORCE_REBUILD);
 }
-
-/* General repair routines. */
-
-/* Repair some metadata. */
-enum check_outcome
-xfs_repair_metadata(
-	struct scrub_ctx		*ctx,
-	struct xfs_fd			*xfdp,
-	struct action_item		*aitem,
-	unsigned int			repair_flags)
-{
-	struct xfs_scrub_metadata	meta = { 0 };
-	struct xfs_scrub_metadata	oldm;
-	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
-	int				error;
-
-	assert(aitem->type < XFS_SCRUB_TYPE_NR);
-	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
-	meta.sm_type = aitem->type;
-	meta.sm_flags = aitem->flags | XFS_SCRUB_IFLAG_REPAIR;
-	if (use_force_rebuild)
-		meta.sm_flags |= XFS_SCRUB_IFLAG_FORCE_REBUILD;
-	switch (xfrog_scrubbers[aitem->type].group) {
-	case XFROG_SCRUB_GROUP_AGHEADER:
-	case XFROG_SCRUB_GROUP_PERAG:
-		meta.sm_agno = aitem->agno;
-		break;
-	case XFROG_SCRUB_GROUP_INODE:
-		meta.sm_ino = aitem->ino;
-		meta.sm_gen = aitem->gen;
-		break;
-	default:
-		break;
-	}
-
-	if (!is_corrupt(&meta) && (repair_flags & XRM_REPAIR_ONLY))
-		return CHECK_RETRY;
-
-	memcpy(&oldm, &meta, sizeof(oldm));
-	descr_set(&dsc, &oldm);
-
-	if (needs_repair(&meta))
-		str_info(ctx, descr_render(&dsc), _("Attempting repair."));
-	else if (debug || verbose)
-		str_info(ctx, descr_render(&dsc),
-				_("Attempting optimization."));
-
-	error = -xfrog_scrub_metadata(xfdp, &meta);
-	switch (error) {
-	case 0:
-		/* No operational errors encountered. */
-		break;
-	case EDEADLOCK:
-	case EBUSY:
-		/* Filesystem is busy, try again later. */
-		if (debug || verbose)
-			str_info(ctx, descr_render(&dsc),
-_("Filesystem is busy, deferring repair."));
-		return CHECK_RETRY;
-	case ESHUTDOWN:
-		/* Filesystem is already shut down, abort. */
-		str_error(ctx, descr_render(&dsc),
-_("Filesystem is shut down, aborting."));
-		return CHECK_ABORT;
-	case ENOTTY:
-	case EOPNOTSUPP:
-		/*
-		 * If we're in no-complain mode, requeue the check for
-		 * later.  It's possible that an error in another
-		 * component caused us to flag an error in this
-		 * component.  Even if the kernel didn't think it
-		 * could fix this, it's at least worth trying the scan
-		 * again to see if another repair fixed it.
-		 */
-		if (!(repair_flags & XRM_FINAL_WARNING))
-			return CHECK_RETRY;
-		/*
-		 * If we forced repairs or this is a preen, don't
-		 * error out if the kernel doesn't know how to fix.
-		 */
-		if (is_unoptimized(&oldm) ||
-		    debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
-			return CHECK_DONE;
-		fallthrough;
-	case EINVAL:
-		/* Kernel doesn't know how to repair this? */
-		str_corrupt(ctx, descr_render(&dsc),
-_("Don't know how to fix; offline repair required."));
-		return CHECK_DONE;
-	case EROFS:
-		/* Read-only filesystem, can't fix. */
-		if (verbose || debug || needs_repair(&oldm))
-			str_error(ctx, descr_render(&dsc),
-_("Read-only filesystem; cannot make changes."));
-		return CHECK_ABORT;
-	case ENOENT:
-		/* Metadata not present, just skip it. */
-		return CHECK_DONE;
-	case ENOMEM:
-	case ENOSPC:
-		/* Don't care if preen fails due to low resources. */
-		if (is_unoptimized(&oldm) && !needs_repair(&oldm))
-			return CHECK_DONE;
-		fallthrough;
-	default:
-		/*
-		 * Operational error.  If the caller doesn't want us
-		 * to complain about repair failures, tell the caller
-		 * to requeue the repair for later and don't say a
-		 * thing.  Otherwise, print error and bail out.
-		 */
-		if (!(repair_flags & XRM_FINAL_WARNING))
-			return CHECK_RETRY;
-		str_liberror(ctx, error, descr_render(&dsc));
-		return CHECK_DONE;
-	}
-
-	if (repair_flags & XRM_FINAL_WARNING)
-		scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
-	if (needs_repair(&meta)) {
-		/*
-		 * Still broken; if we've been told not to complain then we
-		 * just requeue this and try again later.  Otherwise we
-		 * log the error loudly and don't try again.
-		 */
-		if (!(repair_flags & XRM_FINAL_WARNING))
-			return CHECK_RETRY;
-		str_corrupt(ctx, descr_render(&dsc),
-_("Repair unsuccessful; offline repair required."));
-	} else if (xref_failed(&meta)) {
-		/*
-		 * This metadata object itself looks ok, but we still noticed
-		 * inconsistencies when comparing it with the other filesystem
-		 * metadata.  If we're in "final warning" mode, advise the
-		 * caller to run xfs_repair; otherwise, we'll keep trying to
-		 * reverify the cross-referencing as repairs progress.
-		 */
-		if (repair_flags & XRM_FINAL_WARNING) {
-			str_info(ctx, descr_render(&dsc),
- _("Seems correct but cross-referencing failed; offline repair recommended."));
-		} else {
-			if (verbose)
-				str_info(ctx, descr_render(&dsc),
- _("Seems correct but cross-referencing failed; will keep checking."));
-			return CHECK_RETRY;
-		}
-	} else {
-		/* Clean operation, no corruption detected. */
-		if (needs_repair(&oldm))
-			record_repair(ctx, descr_render(&dsc),
-					_("Repairs successful."));
-		else
-			record_preen(ctx, descr_render(&dsc),
-					_("Optimization successful."));
-	}
-	return CHECK_DONE;
-}
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 751f2c8e9c1..9558c29f32d 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -38,7 +38,7 @@ bool can_scrub_dir(struct scrub_ctx *ctx);
 bool can_scrub_attr(struct scrub_ctx *ctx);
 bool can_scrub_symlink(struct scrub_ctx *ctx);
 bool can_scrub_parent(struct scrub_ctx *ctx);
-bool xfs_can_repair(struct scrub_ctx *ctx);
+bool can_repair(struct scrub_ctx *ctx);
 bool can_force_rebuild(struct scrub_ctx *ctx);
 
 int scrub_file(struct scrub_ctx *ctx, int fd, const struct xfs_bulkstat *bstat,
@@ -54,8 +54,4 @@ struct action_item {
 	__u32			agno;
 };
 
-enum check_outcome xfs_repair_metadata(struct scrub_ctx *ctx,
-		struct xfs_fd *xfdp, struct action_item *aitem,
-		unsigned int repair_flags);
-
 #endif /* XFS_SCRUB_SCRUB_H_ */
diff --git a/scrub/scrub_private.h b/scrub/scrub_private.h
new file mode 100644
index 00000000000..8bc0c521463
--- /dev/null
+++ b/scrub/scrub_private.h
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef XFS_SCRUB_SCRUB_PRIVATE_H_
+#define XFS_SCRUB_SCRUB_PRIVATE_H_
+
+/* Shared code between scrub.c and repair.c. */
+
+int format_scrub_descr(struct scrub_ctx *ctx, char *buf, size_t buflen,
+		void *where);
+
+/* Predicates for scrub flag state. */
+
+static inline bool is_corrupt(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT;
+}
+
+static inline bool is_unoptimized(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_PREEN;
+}
+
+static inline bool xref_failed(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_XFAIL;
+}
+
+static inline bool xref_disagrees(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT;
+}
+
+static inline bool is_incomplete(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE;
+}
+
+static inline bool is_suspicious(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_WARNING;
+}
+
+/* Should we fix it? */
+static inline bool needs_repair(struct xfs_scrub_metadata *sm)
+{
+	return is_corrupt(sm) || xref_disagrees(sm);
+}
+
+void scrub_warn_incomplete_scrub(struct scrub_ctx *ctx, struct descr *dsc,
+		struct xfs_scrub_metadata *meta);
+
+#endif /* XFS_SCRUB_SCRUB_PRIVATE_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/6] xfs_scrub: log when a repair was unnecessary
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/6] xfs_scrub: require primary superblock repairs to complete before proceeding Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the kernel tells us that a filesystem object didn't need repairs, we
should log that with a message specific to that outcome.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/scrub/repair.c b/scrub/repair.c
index a6fbdcb638d..1ca8331bb04 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -165,6 +165,10 @@ _("Read-only filesystem; cannot make changes."));
  _("Seems correct but cross-referencing failed; will keep checking."));
 			return CHECK_RETRY;
 		}
+	} else if (meta.sm_flags & XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED) {
+		if (verbose)
+			str_info(ctx, descr_render(&dsc),
+					_("No modification needed."));
 	} else {
 		/* Clean operation, no corruption detected. */
 		if (needs_repair(&oldm))


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/6] xfs_scrub: require primary superblock repairs to complete before proceeding
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: log when a repair was unnecessary Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: remove ALP_* flags namespace Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Phase 2 of the xfs_scrub program calls the kernel to check the primary
superblock before scanning the rest of the filesystem.  Though doing so
is a no-op now (since the primary super must pass all checks as a
prerequisite for mounting), the goal of this code is to enable future
kernel code to intercept an xfs_scrub run before it actually does
anything.  If this some day involves fixing the primary superblock, it
seems reasonable to require that /all/ repairs complete successfully
before moving on to the rest of the filesystem.

Unfortunately, that's not what xfs_scrub does now -- primary super
repairs that fail are theoretically deferred to phase 4!  So make this
mandatory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase2.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/scrub/phase2.c b/scrub/phase2.c
index 75c302af075..774f3a17e9e 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -174,7 +174,8 @@ phase2_func(
 	ret = scrub_primary_super(ctx, &alist);
 	if (ret)
 		goto out_wq;
-	ret = action_list_process_or_defer(ctx, 0, &alist);
+	ret = action_list_process(ctx, -1, &alist,
+			XRM_FINAL_WARNING | XRM_NOPROGRESS);
 	if (ret)
 		goto out_wq;
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/6] xfs_scrub: actually try to fix summary counters ahead of repairs
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 3/6] xfs_scrub: move repair functions to repair.c Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

A while ago, I decided to make phase 4 check the summary counters before
it starts any other repairs, having observed that repairs of primary
metadata can fail because the summary counters (incorrectly) claim that
there aren't enough free resources in the filesystem.  However, if
problems are found in the summary counters, the repair work will be run
as part of the AG 0 repairs, which means that it runs concurrently with
other scrubbers.  This doesn't quite get us to the intended goal, so try
to fix the scrubbers ahead of time.  If that fails, tough, we'll get
back to it in phase 7 if scrub gets that far.

Fixes: cbaf1c9d91a0 ("xfs_scrub: check summary counters")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase4.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)


diff --git a/scrub/phase4.c b/scrub/phase4.c
index c71b56b7bda..5929df38084 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -129,6 +129,7 @@ phase4_func(
 	struct scrub_ctx	*ctx)
 {
 	struct xfs_fsop_geom	fsgeom;
+	struct action_list	alist;
 	int			ret;
 
 	if (!have_action_items(ctx))
@@ -136,11 +137,13 @@ phase4_func(
 
 	/*
 	 * Check the summary counters early.  Normally we do this during phase
-	 * seven, but some of the cross-referencing requires fairly-accurate
-	 * counters, so counter repairs have to be put on the list now so that
-	 * they get fixed before we stop retrying unfixed metadata repairs.
+	 * seven, but some of the cross-referencing requires fairly accurate
+	 * summary counters.  Check and try to repair them now to minimize the
+	 * chance that repairs of primary metadata fail due to secondary
+	 * metadata.  If repairs fails, we'll come back during phase 7.
 	 */
-	ret = scrub_fs_counters(ctx, &ctx->action_lists[0]);
+	action_list_init(&alist);
+	ret = scrub_fs_counters(ctx, &alist);
 	if (ret)
 		return ret;
 
@@ -155,11 +158,18 @@ phase4_func(
 		return ret;
 
 	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK) {
-		ret = scrub_quotacheck(ctx, &ctx->action_lists[0]);
+		ret = scrub_quotacheck(ctx, &alist);
 		if (ret)
 			return ret;
 	}
 
+	/* Repair counters before starting on the rest. */
+	ret = action_list_process(ctx, -1, &alist,
+			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
+	if (ret)
+		return ret;
+	action_list_discard(&alist);
+
 	ret = repair_everything(ctx);
 	if (ret)
 		return ret;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (50 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: get rid of trivial fs metadata scanner helpers Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                   ` (25 subsequent siblings)
  77 siblings, 6 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

While I was poking through the QA results for xfs_scrub, I noticed that
it doesn't warn the user when the primary and secondary realtime
metadata are so out of whack that the chances of a successful repair are
not so high.  I decided that it was worth refactoring the scrub code a
bit so that we could warn the user about these types of things, and
ended up refactoring unnecessary helpers out of existence and fixing
other reporting gaps.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings
---
 scrub/phase1.c |    2 +-
 scrub/phase2.c |   53 +++++++++++++++++++++++++++++++++--------------------
 scrub/phase4.c |    9 +++++----
 scrub/phase5.c |   14 ++++++--------
 scrub/repair.c |   47 ++++++++++++++++++++++++++++++++++++-----------
 scrub/repair.h |   10 +++++++---
 scrub/scrub.c  |   52 +---------------------------------------------------
 scrub/scrub.h  |    7 ++-----
 8 files changed, 91 insertions(+), 103 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/6] xfs_scrub: collapse trivial superblock scrub helpers
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: get rid of trivial fs metadata scanner helpers Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 6/6] xfs_scrub: warn about difficult repairs to rt and quota metadata Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove the trivial primary super scrub helper function since it makes
tracing code paths difficult and will become annoying in the patches
that follow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase2.c |    9 +++++----
 scrub/scrub.c  |   16 +---------------
 scrub/scrub.h  |    3 ++-
 3 files changed, 8 insertions(+), 20 deletions(-)


diff --git a/scrub/phase2.c b/scrub/phase2.c
index 774f3a17e9e..7b6933a7475 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -166,12 +166,13 @@ phase2_func(
 	}
 
 	/*
-	 * In case we ever use the primary super scrubber to perform fs
-	 * upgrades (followed by a full scrub), do that before we launch
-	 * anything else.
+	 * Scrub primary superblock.  This will be useful if we ever need to
+	 * hook a filesystem-wide pre-scrub activity (e.g. enable filesystem
+	 * upgrades) off of the sb 0 scrubber (which currently does nothing).
+	 * If errors occur, this function will log them and return nonzero.
 	 */
 	action_list_init(&alist);
-	ret = scrub_primary_super(ctx, &alist);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, 0, &alist);
 	if (ret)
 		goto out_wq;
 	ret = action_list_process(ctx, -1, &alist,
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 0f080028879..bd33fcb770c 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -259,7 +259,7 @@ scrub_save_repair(
  * Returns 0 for success.  If errors occur, this function will log them and
  * return a positive error code.
  */
-static int
+int
 scrub_meta_type(
 	struct scrub_ctx		*ctx,
 	unsigned int			type,
@@ -325,20 +325,6 @@ scrub_group(
 	return 0;
 }
 
-/*
- * Scrub primary superblock.  This will be useful if we ever need to hook
- * a filesystem-wide pre-scrub activity off of the sb 0 scrubber (which
- * currently does nothing).  If errors occur, this function will log them and
- * return nonzero.
- */
-int
-scrub_primary_super(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
-{
-	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, 0, alist);
-}
-
 /* Scrub each AG's header blocks. */
 int
 scrub_ag_headers(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 9558c29f32d..f228ffb89fc 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -17,7 +17,6 @@ enum check_outcome {
 struct action_item;
 
 void scrub_report_preen_triggers(struct scrub_ctx *ctx);
-int scrub_primary_super(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
 		struct action_list *alist);
 int scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
@@ -30,6 +29,8 @@ int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_quotacheck(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_nlinks(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_clean_health(struct scrub_ctx *ctx, struct action_list *alist);
+int scrub_meta_type(struct scrub_ctx *ctx, unsigned int type,
+		xfs_agnumber_t agno, struct action_list *alist);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/6] xfs_scrub: get rid of trivial fs metadata scanner helpers
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/6] xfs_scrub: collapse trivial superblock scrub helpers Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Get rid of these pointless wrappers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |    2 +-
 scrub/phase4.c |    9 +++++----
 scrub/phase5.c |   14 ++++++--------
 scrub/scrub.c  |   36 ------------------------------------
 scrub/scrub.h  |    4 ----
 5 files changed, 12 insertions(+), 53 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 2d258a1a182..047631802e4 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -61,7 +61,7 @@ report_to_kernel(
 		return 0;
 
 	action_list_init(&alist);
-	ret = scrub_clean_health(ctx, &alist);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, 0, &alist);
 	if (ret)
 		return ret;
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 5929df38084..df9b066cfd2 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -136,14 +136,14 @@ phase4_func(
 		goto maybe_trim;
 
 	/*
-	 * Check the summary counters early.  Normally we do this during phase
-	 * seven, but some of the cross-referencing requires fairly accurate
+	 * Check the resource usage counters early.  Normally we do this during
+	 * phase 7, but some of the cross-referencing requires fairly accurate
 	 * summary counters.  Check and try to repair them now to minimize the
 	 * chance that repairs of primary metadata fail due to secondary
 	 * metadata.  If repairs fails, we'll come back during phase 7.
 	 */
 	action_list_init(&alist);
-	ret = scrub_fs_counters(ctx, &alist);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, 0, &alist);
 	if (ret)
 		return ret;
 
@@ -158,7 +158,8 @@ phase4_func(
 		return ret;
 
 	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK) {
-		ret = scrub_quotacheck(ctx, &alist);
+		ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, 0,
+				&alist);
 		if (ret)
 			return ret;
 	}
diff --git a/scrub/phase5.c b/scrub/phase5.c
index 93dd14d50ba..e598ffd3985 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -383,12 +383,10 @@ check_fs_label(
 	return error;
 }
 
-typedef int (*iscan_item_fn)(struct scrub_ctx *, struct action_list *);
-
 struct iscan_item {
 	struct action_list	alist;
 	bool			*abortedp;
-	iscan_item_fn		scrub_fn;
+	unsigned int		scrub_type;
 };
 
 /* Run one inode-scan scrubber in this thread. */
@@ -413,7 +411,7 @@ iscan_worker(
 		nanosleep(&tv, NULL);
 	}
 
-	ret = item->scrub_fn(ctx, &item->alist);
+	ret = scrub_meta_type(ctx, item->scrub_type, 0, &item->alist);
 	if (ret) {
 		str_liberror(ctx, ret, _("checking iscan metadata"));
 		*item->abortedp = true;
@@ -439,7 +437,7 @@ queue_iscan(
 	struct workqueue	*wq,
 	bool			*abortedp,
 	xfs_agnumber_t		nr,
-	iscan_item_fn		scrub_fn)
+	unsigned int		scrub_type)
 {
 	struct iscan_item	*item;
 	struct scrub_ctx	*ctx = wq->wq_ctx;
@@ -452,7 +450,7 @@ queue_iscan(
 		return ret;
 	}
 	action_list_init(&item->alist);
-	item->scrub_fn = scrub_fn;
+	item->scrub_type = scrub_type;
 	item->abortedp = abortedp;
 
 	ret = -workqueue_add(wq, iscan_worker, nr, item);
@@ -484,14 +482,14 @@ run_kernel_iscan_scrubbers(
 	 * The nlinks scanner is much faster than quotacheck because it only
 	 * walks directories, so we start it first.
 	 */
-	ret = queue_iscan(&wq_iscan, &aborted, nr, scrub_nlinks);
+	ret = queue_iscan(&wq_iscan, &aborted, nr, XFS_SCRUB_TYPE_NLINKS);
 	if (ret)
 		goto wait;
 
 	if (nr_threads > 1)
 		nr++;
 
-	ret = queue_iscan(&wq_iscan, &aborted, nr, scrub_quotacheck);
+	ret = queue_iscan(&wq_iscan, &aborted, nr, XFS_SCRUB_TYPE_QUOTACHECK);
 	if (ret)
 		goto wait;
 
diff --git a/scrub/scrub.c b/scrub/scrub.c
index bd33fcb770c..fe4603f863b 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -366,42 +366,6 @@ scrub_summary_metadata(
 	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist);
 }
 
-/* Scrub /only/ the superblock summary counters. */
-int
-scrub_fs_counters(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
-{
-	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, 0, alist);
-}
-
-/* Scrub /only/ the quota counters. */
-int
-scrub_quotacheck(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
-{
-	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, 0, alist);
-}
-
-/* Scrub /only/ the file link counters. */
-int
-scrub_nlinks(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
-{
-	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_NLINKS, 0, alist);
-}
-
-/* Update incore health records if we were clean. */
-int
-scrub_clean_health(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
-{
-	return scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, 0, alist);
-}
-
 /* How many items do we have to check? */
 unsigned int
 scrub_estimate_ag_work(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index f228ffb89fc..b02e8f16815 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -25,10 +25,6 @@ int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
 		struct action_list *alist);
 int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
-int scrub_fs_counters(struct scrub_ctx *ctx, struct action_list *alist);
-int scrub_quotacheck(struct scrub_ctx *ctx, struct action_list *alist);
-int scrub_nlinks(struct scrub_ctx *ctx, struct action_list *alist);
-int scrub_clean_health(struct scrub_ctx *ctx, struct action_list *alist);
 int scrub_meta_type(struct scrub_ctx *ctx, unsigned int type,
 		xfs_agnumber_t agno, struct action_list *alist);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/6] xfs_scrub: split up the mustfix repairs and difficulty assessment functions
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: add missing repair types to the mustfix and difficulty assessment Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, action_list_find_mustfix does two things -- it figures out
which repairs must be tried during phase 2 to enable the inode scan in
phase 3; and it figures out if xfs_scrub should warn about secondary and
primary metadata corruption that might make repair difficult.

Split these into separate functions to make each more coherent.  A long
time from now we'll need this to enable warnings about difficult rt
repairs, but for now this is merely a code cleanup.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase2.c |   15 +++++++--------
 scrub/repair.c |   38 +++++++++++++++++++++++++++-----------
 scrub/repair.h |   10 +++++++---
 3 files changed, 41 insertions(+), 22 deletions(-)


diff --git a/scrub/phase2.c b/scrub/phase2.c
index 7b6933a7475..360426c5fb0 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -42,9 +42,8 @@ scan_ag_metadata(
 	struct scan_ctl			*sctl = arg;
 	struct action_list		alist;
 	struct action_list		immediate_alist;
-	unsigned long long		broken_primaries;
-	unsigned long long		broken_secondaries;
 	char				descr[DESCR_BUFSZ];
+	unsigned int			difficulty;
 	int				ret;
 
 	if (sctl->aborted)
@@ -79,12 +78,12 @@ scan_ag_metadata(
 	 * the inobt from rmapbt data, but if the rmapbt is broken even
 	 * at this early phase then we are sunk.
 	 */
-	broken_secondaries = 0;
-	broken_primaries = 0;
-	action_list_find_mustfix(&alist, &immediate_alist,
-			&broken_primaries, &broken_secondaries);
-	if (broken_secondaries && !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
-		if (broken_primaries)
+	difficulty = action_list_difficulty(&alist);
+	action_list_find_mustfix(&alist, &immediate_alist);
+
+	if ((difficulty & REPAIR_DIFFICULTY_SECONDARY) &&
+	    !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+		if (difficulty & REPAIR_DIFFICULTY_PRIMARY)
 			str_info(ctx, descr,
 _("Corrupt primary and secondary block mapping metadata."));
 		else
diff --git a/scrub/repair.c b/scrub/repair.c
index 1ca8331bb04..814a385ce29 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -282,9 +282,7 @@ xfs_action_item_compare(
 void
 action_list_find_mustfix(
 	struct action_list		*alist,
-	struct action_list		*immediate_alist,
-	unsigned long long		*broken_primaries,
-	unsigned long long		*broken_secondaries)
+	struct action_list		*immediate_alist)
 {
 	struct action_item		*n;
 	struct action_item		*aitem;
@@ -293,25 +291,43 @@ action_list_find_mustfix(
 		if (!(aitem->flags & XFS_SCRUB_OFLAG_CORRUPT))
 			continue;
 		switch (aitem->type) {
-		case XFS_SCRUB_TYPE_RMAPBT:
-			(*broken_secondaries)++;
-			break;
 		case XFS_SCRUB_TYPE_FINOBT:
 		case XFS_SCRUB_TYPE_INOBT:
 			alist->nr--;
 			list_move_tail(&aitem->list, &immediate_alist->list);
 			immediate_alist->nr++;
-			fallthrough;
+			break;
+		}
+	}
+}
+
+/* Determine if primary or secondary metadata are inconsistent. */
+unsigned int
+action_list_difficulty(
+	const struct action_list	*alist)
+{
+	struct action_item		*aitem, *n;
+	unsigned int			ret = 0;
+
+	list_for_each_entry_safe(aitem, n, &alist->list, list) {
+		if (!(aitem->flags & XFS_SCRUB_OFLAG_CORRUPT))
+			continue;
+
+		switch (aitem->type) {
+		case XFS_SCRUB_TYPE_RMAPBT:
+			ret |= REPAIR_DIFFICULTY_SECONDARY;
+			break;
+		case XFS_SCRUB_TYPE_FINOBT:
+		case XFS_SCRUB_TYPE_INOBT:
 		case XFS_SCRUB_TYPE_BNOBT:
 		case XFS_SCRUB_TYPE_CNTBT:
 		case XFS_SCRUB_TYPE_REFCNTBT:
-			(*broken_primaries)++;
-			break;
-		default:
-			abort();
+			ret |= REPAIR_DIFFICULTY_PRIMARY;
 			break;
 		}
 	}
+
+	return ret;
 }
 
 /*
diff --git a/scrub/repair.h b/scrub/repair.h
index 969871bd8bf..4c3fd718575 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -28,9 +28,13 @@ void action_list_discard(struct action_list *alist);
 void action_list_splice(struct action_list *dest, struct action_list *src);
 
 void action_list_find_mustfix(struct action_list *actions,
-		struct action_list *immediate_alist,
-		unsigned long long *broken_primaries,
-		unsigned long long *broken_secondaries);
+		struct action_list *immediate_alist);
+
+/* Primary metadata is corrupt */
+#define REPAIR_DIFFICULTY_PRIMARY	(1U << 0)
+/* Secondary metadata is corrupt */
+#define REPAIR_DIFFICULTY_SECONDARY	(1U << 1)
+unsigned int action_list_difficulty(const struct action_list *actions);
 
 /*
  * Only ask the kernel to repair this object if the kernel directly told us it


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/6] xfs_scrub: add missing repair types to the mustfix and difficulty assessment
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 5/6] xfs_scrub: any inconsistency in metadata should trigger difficulty warnings Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/6] xfs_scrub: split up the mustfix repairs and difficulty assessment functions Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a few scrub types that ought to trigger a mustfix (such as AGI
corruption) and all the AG space metadata to the repair difficulty
assessment.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |    7 +++++++
 1 file changed, 7 insertions(+)


diff --git a/scrub/repair.c b/scrub/repair.c
index 814a385ce29..cb0ca50a18e 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -291,6 +291,7 @@ action_list_find_mustfix(
 		if (!(aitem->flags & XFS_SCRUB_OFLAG_CORRUPT))
 			continue;
 		switch (aitem->type) {
+		case XFS_SCRUB_TYPE_AGI:
 		case XFS_SCRUB_TYPE_FINOBT:
 		case XFS_SCRUB_TYPE_INOBT:
 			alist->nr--;
@@ -317,11 +318,17 @@ action_list_difficulty(
 		case XFS_SCRUB_TYPE_RMAPBT:
 			ret |= REPAIR_DIFFICULTY_SECONDARY;
 			break;
+		case XFS_SCRUB_TYPE_SB:
+		case XFS_SCRUB_TYPE_AGF:
+		case XFS_SCRUB_TYPE_AGFL:
+		case XFS_SCRUB_TYPE_AGI:
 		case XFS_SCRUB_TYPE_FINOBT:
 		case XFS_SCRUB_TYPE_INOBT:
 		case XFS_SCRUB_TYPE_BNOBT:
 		case XFS_SCRUB_TYPE_CNTBT:
 		case XFS_SCRUB_TYPE_REFCNTBT:
+		case XFS_SCRUB_TYPE_RTBITMAP:
+		case XFS_SCRUB_TYPE_RTSUM:
 			ret |= REPAIR_DIFFICULTY_PRIMARY;
 			break;
 		}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/6] xfs_scrub: any inconsistency in metadata should trigger difficulty warnings
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 6/6] xfs_scrub: warn about difficult repairs to rt and quota metadata Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: add missing repair types to the mustfix and difficulty assessment Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/6] xfs_scrub: split up the mustfix repairs and difficulty assessment functions Darrick J. Wong
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Any inconsistency in the space metadata can be a sign that repairs will
be difficult, so set off the warning if there were cross referencing
problems too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index cb0ca50a18e..6be5d7684b3 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -311,7 +311,9 @@ action_list_difficulty(
 	unsigned int			ret = 0;
 
 	list_for_each_entry_safe(aitem, n, &alist->list, list) {
-		if (!(aitem->flags & XFS_SCRUB_OFLAG_CORRUPT))
+		if (!(aitem->flags & (XFS_SCRUB_OFLAG_CORRUPT |
+				      XFS_SCRUB_OFLAG_XCORRUPT |
+				      XFS_SCRUB_OFLAG_XFAIL)))
 			continue;
 
 		switch (aitem->type) {


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/6] xfs_scrub: warn about difficult repairs to rt and quota metadata
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: get rid of trivial fs metadata scanner helpers Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/6] xfs_scrub: collapse trivial superblock scrub helpers Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/6] xfs_scrub: any inconsistency in metadata should trigger difficulty warnings Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Warn the user if there are problems with the rt or quota metadata that
might make repairs difficult.  For now there aren't any corruption
conditions that would trigger this, but we don't want to leave a gap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase2.c |   37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)


diff --git a/scrub/phase2.c b/scrub/phase2.c
index 360426c5fb0..a78d15aac1f 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -31,6 +31,25 @@ struct scan_ctl {
 	bool			aborted;
 };
 
+/* Warn about the types of mutual inconsistencies that may make repairs hard. */
+static inline void
+warn_repair_difficulties(
+	struct scrub_ctx	*ctx,
+	unsigned int		difficulty,
+	const char		*descr)
+{
+	if (!(difficulty & REPAIR_DIFFICULTY_SECONDARY))
+		return;
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+		return;
+
+	if (difficulty & REPAIR_DIFFICULTY_PRIMARY)
+		str_info(ctx, descr, _("Corrupt primary and secondary metadata."));
+	else
+		str_info(ctx, descr, _("Corrupt secondary metadata."));
+	str_info(ctx, descr, _("Filesystem might not be repairable."));
+}
+
 /* Scrub each AG's metadata btrees. */
 static void
 scan_ag_metadata(
@@ -80,18 +99,7 @@ scan_ag_metadata(
 	 */
 	difficulty = action_list_difficulty(&alist);
 	action_list_find_mustfix(&alist, &immediate_alist);
-
-	if ((difficulty & REPAIR_DIFFICULTY_SECONDARY) &&
-	    !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
-		if (difficulty & REPAIR_DIFFICULTY_PRIMARY)
-			str_info(ctx, descr,
-_("Corrupt primary and secondary block mapping metadata."));
-		else
-			str_info(ctx, descr,
-_("Corrupt secondary block mapping metadata."));
-		str_info(ctx, descr,
-_("Filesystem might not be repairable."));
-	}
+	warn_repair_difficulties(ctx, difficulty, descr);
 
 	/* Repair (inode) btree damage. */
 	ret = action_list_process_or_defer(ctx, agno, &immediate_alist);
@@ -115,6 +123,7 @@ scan_metafile(
 	struct action_list	alist;
 	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	struct scan_ctl		*sctl = arg;
+	unsigned int		difficulty;
 	int			ret;
 
 	if (sctl->aborted)
@@ -127,6 +136,10 @@ scan_metafile(
 		goto out;
 	}
 
+	/* Complain about metadata corruptions that might not be fixable. */
+	difficulty = action_list_difficulty(&alist);
+	warn_repair_difficulties(ctx, difficulty, xfrog_scrubbers[type].descr);
+
 	action_list_defer(ctx, 0, &alist);
 
 out:


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (51 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/9] xfs_scrub: remove scrub_metadata_file Darrick J. Wong
                     ` (8 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
                   ` (24 subsequent siblings)
  77 siblings, 9 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Certain kinds of XFS metadata depend on the correctness of lower level
metadata.  For example, directory indexes depends on the directory data
fork, which in turn depend on the directory inode to be correct.  The
current scrub code does not strictly preserve these dependencies if it
has to defer a repair until phase 4, because phase 4 prioritizes repairs
by type (corruption, then cross referencing, and then preening) and
loses the ordering of in the previous phases.  This leads to absurd
things like trying to repair a directory before repairing its corrupted
fork, which is absurd.

To solve this problem, introduce a repair ticket structure to track all
the repairs pending for a principal object (inode, AG, etc).  This
reduces memory requirements if an object requires more than one type of
repair and makes it very easy to track the data dependencies between
sub-objects of a principal object.  Repair dependencies between object
types (e.g.  bnobt before inodes) must still be encoded statically into
phase 4.

A secondary benefit of this new ticket structure is that we can decide
to attempt a repair of an object A that was flagged for a cross
referencing error during the scan if a different object B depends on A
but only B showed definitive signs of corruption.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps
---
 libfrog/scrub.c       |    1 
 scrub/phase1.c        |    9 -
 scrub/phase2.c        |   46 ++--
 scrub/phase3.c        |   77 ++++---
 scrub/phase4.c        |   17 +-
 scrub/phase5.c        |    9 -
 scrub/phase7.c        |    9 -
 scrub/repair.c        |  529 +++++++++++++++++++++++++++++++++----------------
 scrub/repair.h        |   47 +++-
 scrub/scrub.c         |  136 ++++++-------
 scrub/scrub.h         |  108 ++++++++--
 scrub/scrub_private.h |   37 +++
 12 files changed, 663 insertions(+), 362 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/9] xfs_scrub: track repair items by principal, not by individual repairs
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 2/9] xfs_scrub: use repair_item to direct repair activities Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 6/9] xfs_scrub: clean up repair_item_difficulty a little Darrick J. Wong
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new structure to track scrub and repair state by principal
filesystem object (e.g. ag number or inode number/generation) so that we
can more easily examine and ensure that we satisfy repair order
dependencies.  This transposition will eventually enable bulk scrub
operations and will also save a lot of memory if a given object needs a
lot of work.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c        |    4 ++
 scrub/phase2.c        |   14 ++++++--
 scrub/phase3.c        |   19 ++++++-----
 scrub/phase4.c        |    6 ++--
 scrub/phase5.c        |    5 ++-
 scrub/phase7.c        |    4 ++
 scrub/scrub.c         |   68 ++++++++++++++++++++++++++++++++--------
 scrub/scrub.h         |   83 +++++++++++++++++++++++++++++++++++++++++++++----
 scrub/scrub_private.h |   19 +++++++++++
 9 files changed, 185 insertions(+), 37 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 047631802e4..3113fc5ccf6 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -52,6 +52,7 @@ static int
 report_to_kernel(
 	struct scrub_ctx	*ctx)
 {
+	struct scrub_item	sri;
 	struct action_list	alist;
 	int			ret;
 
@@ -60,8 +61,9 @@ report_to_kernel(
 	    ctx->warnings_found)
 		return 0;
 
+	scrub_item_init_fs(&sri);
 	action_list_init(&alist);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, 0, &alist);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, 0, &alist, &sri);
 	if (ret)
 		return ret;
 
diff --git a/scrub/phase2.c b/scrub/phase2.c
index a78d15aac1f..50c2c88276f 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -57,6 +57,7 @@ scan_ag_metadata(
 	xfs_agnumber_t			agno,
 	void				*arg)
 {
+	struct scrub_item		sri;
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	struct scan_ctl			*sctl = arg;
 	struct action_list		alist;
@@ -68,6 +69,7 @@ scan_ag_metadata(
 	if (sctl->aborted)
 		return;
 
+	scrub_item_init_ag(&sri, agno);
 	action_list_init(&alist);
 	action_list_init(&immediate_alist);
 	snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
@@ -76,7 +78,7 @@ scan_ag_metadata(
 	 * First we scrub and fix the AG headers, because we need
 	 * them to work well enough to check the AG btrees.
 	 */
-	ret = scrub_ag_headers(ctx, agno, &alist);
+	ret = scrub_ag_headers(ctx, agno, &alist, &sri);
 	if (ret)
 		goto err;
 
@@ -86,7 +88,7 @@ scan_ag_metadata(
 		goto err;
 
 	/* Now scrub the AG btrees. */
-	ret = scrub_ag_metadata(ctx, agno, &alist);
+	ret = scrub_ag_metadata(ctx, agno, &alist, &sri);
 	if (ret)
 		goto err;
 
@@ -120,6 +122,7 @@ scan_metafile(
 	xfs_agnumber_t		type,
 	void			*arg)
 {
+	struct scrub_item	sri;
 	struct action_list	alist;
 	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	struct scan_ctl		*sctl = arg;
@@ -129,8 +132,9 @@ scan_metafile(
 	if (sctl->aborted)
 		goto out;
 
+	scrub_item_init_fs(&sri);
 	action_list_init(&alist);
-	ret = scrub_metadata_file(ctx, type, &alist);
+	ret = scrub_metadata_file(ctx, type, &alist, &sri);
 	if (ret) {
 		sctl->aborted = true;
 		goto out;
@@ -162,6 +166,7 @@ phase2_func(
 		.rbm_done	= false,
 	};
 	struct action_list	alist;
+	struct scrub_item	sri;
 	const struct xfrog_scrub_descr *sc = xfrog_scrubbers;
 	xfs_agnumber_t		agno;
 	unsigned int		type;
@@ -183,8 +188,9 @@ phase2_func(
 	 * upgrades) off of the sb 0 scrubber (which currently does nothing).
 	 * If errors occur, this function will log them and return nonzero.
 	 */
+	scrub_item_init_ag(&sri, 0);
 	action_list_init(&alist);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, 0, &alist);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, 0, &alist, &sri);
 	if (ret)
 		goto out_wq;
 	ret = action_list_process(ctx, -1, &alist,
diff --git a/scrub/phase3.c b/scrub/phase3.c
index ef41ee8049d..ef22a1d11c1 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -105,12 +105,14 @@ scrub_inode(
 	void			*arg)
 {
 	struct action_list	alist;
+	struct scrub_item	sri;
 	struct scrub_inode_ctx	*ictx = arg;
 	struct ptcounter	*icount = ictx->icount;
 	xfs_agnumber_t		agno;
 	int			fd = -1;
 	int			error;
 
+	scrub_item_init_file(&sri, bstat);
 	action_list_init(&alist);
 	agno = cvt_ino_to_agno(&ctx->mnt, bstat->bs_ino);
 	background_sleep();
@@ -143,7 +145,7 @@ scrub_inode(
 		fd = scrub_open_handle(handle);
 
 	/* Scrub the inode. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_INODE, &alist);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_INODE, &alist, &sri);
 	if (error)
 		goto out;
 
@@ -152,13 +154,13 @@ scrub_inode(
 		goto out;
 
 	/* Scrub all block mappings. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTD, &alist);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTD, &alist, &sri);
 	if (error)
 		goto out;
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTA, &alist);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTA, &alist, &sri);
 	if (error)
 		goto out;
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTC, &alist);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTC, &alist, &sri);
 	if (error)
 		goto out;
 
@@ -169,21 +171,22 @@ scrub_inode(
 	if (S_ISLNK(bstat->bs_mode)) {
 		/* Check symlink contents. */
 		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_SYMLINK,
-				&alist);
+				&alist, &sri);
 	} else if (S_ISDIR(bstat->bs_mode)) {
 		/* Check the directory entries. */
-		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_DIR, &alist);
+		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_DIR, &alist,
+				&sri);
 	}
 	if (error)
 		goto out;
 
 	/* Check all the extended attributes. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_XATTR, &alist);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_XATTR, &alist, &sri);
 	if (error)
 		goto out;
 
 	/* Check parent pointers. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_PARENT, &alist);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_PARENT, &alist, &sri);
 	if (error)
 		goto out;
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index df9b066cfd2..31939653bda 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -130,6 +130,7 @@ phase4_func(
 {
 	struct xfs_fsop_geom	fsgeom;
 	struct action_list	alist;
+	struct scrub_item	sri;
 	int			ret;
 
 	if (!have_action_items(ctx))
@@ -142,8 +143,9 @@ phase4_func(
 	 * chance that repairs of primary metadata fail due to secondary
 	 * metadata.  If repairs fails, we'll come back during phase 7.
 	 */
+	scrub_item_init_fs(&sri);
 	action_list_init(&alist);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, 0, &alist);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, 0, &alist, &sri);
 	if (ret)
 		return ret;
 
@@ -159,7 +161,7 @@ phase4_func(
 
 	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK) {
 		ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, 0,
-				&alist);
+				&alist, &sri);
 		if (ret)
 			return ret;
 	}
diff --git a/scrub/phase5.c b/scrub/phase5.c
index e598ffd3985..ea77c2a5298 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -384,6 +384,7 @@ check_fs_label(
 }
 
 struct iscan_item {
+	struct scrub_item	sri;
 	struct action_list	alist;
 	bool			*abortedp;
 	unsigned int		scrub_type;
@@ -411,7 +412,8 @@ iscan_worker(
 		nanosleep(&tv, NULL);
 	}
 
-	ret = scrub_meta_type(ctx, item->scrub_type, 0, &item->alist);
+	ret = scrub_meta_type(ctx, item->scrub_type, 0, &item->alist,
+			&item->sri);
 	if (ret) {
 		str_liberror(ctx, ret, _("checking iscan metadata"));
 		*item->abortedp = true;
@@ -449,6 +451,7 @@ queue_iscan(
 		str_liberror(ctx, ret, _("setting up iscan"));
 		return ret;
 	}
+	scrub_item_init_fs(&item->sri);
 	action_list_init(&item->alist);
 	item->scrub_type = scrub_type;
 	item->abortedp = abortedp;
diff --git a/scrub/phase7.c b/scrub/phase7.c
index e9cb40f48d8..ddc1e3b24e3 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -99,6 +99,7 @@ phase7_func(
 	struct scrub_ctx	*ctx)
 {
 	struct summary_counts	totalcount = {0};
+	struct scrub_item	sri;
 	struct action_list	alist;
 	struct ptvar		*ptvar;
 	unsigned long long	used_data;
@@ -117,8 +118,9 @@ phase7_func(
 	int			error;
 
 	/* Check and fix the summary metadata. */
+	scrub_item_init_fs(&sri);
 	action_list_init(&alist);
-	error = scrub_summary_metadata(ctx, &alist);
+	error = scrub_summary_metadata(ctx, &alist, &sri);
 	if (error)
 		return error;
 	error = action_list_process(ctx, -1, &alist,
diff --git a/scrub/scrub.c b/scrub/scrub.c
index fe4603f863b..55653b31c4c 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -264,7 +264,8 @@ scrub_meta_type(
 	struct scrub_ctx		*ctx,
 	unsigned int			type,
 	xfs_agnumber_t			agno,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
 	struct xfs_scrub_metadata	meta = {
 		.sm_type		= type,
@@ -283,11 +284,13 @@ scrub_meta_type(
 	case CHECK_ABORT:
 		return ECANCELED;
 	case CHECK_REPAIR:
+		scrub_item_save_state(sri, type, meta.sm_flags);
 		ret = scrub_save_repair(ctx, alist, &meta);
 		if (ret)
 			return ret;
 		fallthrough;
 	case CHECK_DONE:
+		scrub_item_clean_state(sri, type);
 		return 0;
 	default:
 		/* CHECK_RETRY should never happen. */
@@ -305,7 +308,8 @@ scrub_group(
 	struct scrub_ctx		*ctx,
 	enum xfrog_scrub_group		group,
 	xfs_agnumber_t			agno,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
 	const struct xfrog_scrub_descr	*sc;
 	unsigned int			type;
@@ -317,7 +321,7 @@ scrub_group(
 		if (sc->group != group)
 			continue;
 
-		ret = scrub_meta_type(ctx, type, agno, alist);
+		ret = scrub_meta_type(ctx, type, agno, alist, sri);
 		if (ret)
 			return ret;
 	}
@@ -330,9 +334,10 @@ int
 scrub_ag_headers(
 	struct scrub_ctx		*ctx,
 	xfs_agnumber_t			agno,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_AGHEADER, agno, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_AGHEADER, agno, alist, sri);
 }
 
 /* Scrub each AG's metadata btrees. */
@@ -340,9 +345,10 @@ int
 scrub_ag_metadata(
 	struct scrub_ctx		*ctx,
 	xfs_agnumber_t			agno,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, agno, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, agno, alist, sri);
 }
 
 /* Scrub one metadata file */
@@ -350,20 +356,22 @@ int
 scrub_metadata_file(
 	struct scrub_ctx		*ctx,
 	unsigned int			type,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
 	ASSERT(xfrog_scrubbers[type].group == XFROG_SCRUB_GROUP_METAFILES);
 
-	return scrub_meta_type(ctx, type, 0, alist);
+	return scrub_meta_type(ctx, type, 0, alist, sri);
 }
 
 /* Scrub all FS summary metadata. */
 int
 scrub_summary_metadata(
 	struct scrub_ctx		*ctx,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist, sri);
 }
 
 /* How many items do we have to check? */
@@ -425,7 +433,8 @@ scrub_file(
 	int				fd,
 	const struct xfs_bulkstat	*bstat,
 	unsigned int			type,
-	struct action_list		*alist)
+	struct action_list		*alist,
+	struct scrub_item		*sri)
 {
 	struct xfs_scrub_metadata	meta = {0};
 	struct xfs_fd			xfd;
@@ -454,12 +463,45 @@ scrub_file(
 	fix = xfs_check_metadata(ctx, xfdp, &meta, true);
 	if (fix == CHECK_ABORT)
 		return ECANCELED;
-	if (fix == CHECK_DONE)
+	if (fix == CHECK_DONE) {
+		scrub_item_clean_state(sri, type);
 		return 0;
+	}
 
+	scrub_item_save_state(sri, type, meta.sm_flags);
 	return scrub_save_repair(ctx, alist, &meta);
 }
 
+/* Dump a scrub item for debugging purposes. */
+void
+scrub_item_dump(
+	struct scrub_item	*sri,
+	unsigned int		group_mask,
+	const char		*tag)
+{
+	unsigned int		i;
+
+	if (group_mask == 0)
+		group_mask = -1U;
+
+	printf("DUMP SCRUB ITEM FOR %s\n", tag);
+	if (sri->sri_ino != -1ULL)
+		printf("ino 0x%llx gen %u\n", (unsigned long long)sri->sri_ino,
+				sri->sri_gen);
+	if (sri->sri_agno != -1U)
+		printf("agno %u\n", sri->sri_agno);
+
+	foreach_scrub_type(i) {
+		unsigned int	g = 1U << xfrog_scrubbers[i].group;
+
+		if (g & group_mask)
+			printf("[%u]: type '%s' state 0x%x\n", i,
+					xfrog_scrubbers[i].name,
+					sri->sri_state[i]);
+	}
+	fflush(stdout);
+}
+
 /*
  * Test the availability of a kernel scrub command.  If errors occur (or the
  * scrub ioctl is rejected) the errors will be logged and this function will
diff --git a/scrub/scrub.h b/scrub/scrub.h
index b02e8f16815..546651b2818 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -16,17 +16,85 @@ enum check_outcome {
 
 struct action_item;
 
+/*
+ * These flags record the metadata object state that the kernel returned.
+ * We want to remember if the object was corrupt, if the cross-referencing
+ * revealed inconsistencies (xcorrupt), if the cross referencing itself failed
+ * (xfail) or if the object is correct but could be optimised (preen).
+ */
+#define SCRUB_ITEM_CORRUPT	(XFS_SCRUB_OFLAG_CORRUPT)	/* (1 << 1) */
+#define SCRUB_ITEM_PREEN	(XFS_SCRUB_OFLAG_PREEN)		/* (1 << 2) */
+#define SCRUB_ITEM_XFAIL	(XFS_SCRUB_OFLAG_XFAIL)		/* (1 << 3) */
+#define SCRUB_ITEM_XCORRUPT	(XFS_SCRUB_OFLAG_XCORRUPT)	/* (1 << 4) */
+
+/* All of the state flags that we need to prioritize repair work. */
+#define SCRUB_ITEM_REPAIR_ANY	(SCRUB_ITEM_CORRUPT | \
+				 SCRUB_ITEM_PREEN | \
+				 SCRUB_ITEM_XFAIL | \
+				 SCRUB_ITEM_XCORRUPT)
+
+struct scrub_item {
+	/*
+	 * Information we need to call the scrub and repair ioctls.  Per-AG
+	 * items should set the ino/gen fields to -1; per-inode items should
+	 * set sri_agno to -1; and per-fs items should set all three fields to
+	 * -1.  Or use the macros below.
+	 */
+	__u64			sri_ino;
+	__u32			sri_gen;
+	__u32			sri_agno;
+
+	/* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
+	__u8			sri_state[XFS_SCRUB_TYPE_NR];
+};
+
+#define foreach_scrub_type(loopvar) \
+	for ((loopvar) = 0; (loopvar) < XFS_SCRUB_TYPE_NR; (loopvar)++)
+
+static inline void
+scrub_item_init_ag(struct scrub_item *sri, xfs_agnumber_t agno)
+{
+	memset(sri, 0, sizeof(*sri));
+	sri->sri_agno = agno;
+	sri->sri_ino = -1ULL;
+	sri->sri_gen = -1U;
+}
+
+static inline void
+scrub_item_init_fs(struct scrub_item *sri)
+{
+	memset(sri, 0, sizeof(*sri));
+	sri->sri_agno = -1U;
+	sri->sri_ino = -1ULL;
+	sri->sri_gen = -1U;
+}
+
+static inline void
+scrub_item_init_file(struct scrub_item *sri, struct xfs_bulkstat *bstat)
+{
+	memset(sri, 0, sizeof(*sri));
+	sri->sri_agno = -1U;
+	sri->sri_ino = bstat->bs_ino;
+	sri->sri_gen = bstat->bs_gen;
+}
+
+void scrub_item_dump(struct scrub_item *sri, unsigned int group_mask,
+		const char *tag);
+
 void scrub_report_preen_triggers(struct scrub_ctx *ctx);
 int scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
-		struct action_list *alist);
+		struct action_list *alist, struct scrub_item *sri);
 int scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
-		struct action_list *alist);
+		struct action_list *alist, struct scrub_item *sri);
 int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
-		struct action_list *alist);
-int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist);
-int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist);
+		struct action_list *alist, struct scrub_item *sri);
+int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist,
+		struct scrub_item *sri);
+int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist,
+		struct scrub_item *sri);
 int scrub_meta_type(struct scrub_ctx *ctx, unsigned int type,
-		xfs_agnumber_t agno, struct action_list *alist);
+		xfs_agnumber_t agno, struct action_list *alist,
+		struct scrub_item *sri);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);
@@ -39,7 +107,8 @@ bool can_repair(struct scrub_ctx *ctx);
 bool can_force_rebuild(struct scrub_ctx *ctx);
 
 int scrub_file(struct scrub_ctx *ctx, int fd, const struct xfs_bulkstat *bstat,
-		unsigned int type, struct action_list *alist);
+		unsigned int type, struct action_list *alist,
+		struct scrub_item *sri);
 
 /* Repair parameters are the scrub inputs and retry count. */
 struct action_item {
diff --git a/scrub/scrub_private.h b/scrub/scrub_private.h
index 8bc0c521463..f91c65383d1 100644
--- a/scrub/scrub_private.h
+++ b/scrub/scrub_private.h
@@ -52,4 +52,23 @@ static inline bool needs_repair(struct xfs_scrub_metadata *sm)
 void scrub_warn_incomplete_scrub(struct scrub_ctx *ctx, struct descr *dsc,
 		struct xfs_scrub_metadata *meta);
 
+/* Scrub item functions */
+
+static inline void
+scrub_item_save_state(
+	struct scrub_item		*sri,
+	unsigned  int			scrub_type,
+	unsigned  int			scrub_flags)
+{
+	sri->sri_state[scrub_type] = scrub_flags & SCRUB_ITEM_REPAIR_ANY;
+}
+
+static inline void
+scrub_item_clean_state(
+	struct scrub_item		*sri,
+	unsigned  int			scrub_type)
+{
+	sri->sri_state[scrub_type] = 0;
+}
+
 #endif /* XFS_SCRUB_SCRUB_PRIVATE_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/9] xfs_scrub: use repair_item to direct repair activities
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 3/9] xfs_scrub: remove action lists from phaseX code Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/9] xfs_scrub: track repair items by principal, not by individual repairs Darrick J. Wong
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that the new scrub_item tracks the state of any filesystem object
needing any kind of repair, use it to drive filesystem repairs and
updates to the in-kernel health status when repair finishes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |    2 
 scrub/phase2.c |   24 ++--
 scrub/phase3.c |   57 ++++----
 scrub/phase4.c |    7 -
 scrub/phase5.c |    2 
 scrub/phase7.c |    3 
 scrub/repair.c |  381 +++++++++++++++++++++++++++++++-------------------------
 scrub/repair.h |   45 +++++--
 scrub/scrub.c  |   44 ------
 scrub/scrub.h  |   12 --
 10 files changed, 298 insertions(+), 279 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 3113fc5ccf6..2c0ff7c8327 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -71,7 +71,7 @@ report_to_kernel(
 	 * Complain if we cannot fail the clean bill of health, unless we're
 	 * just testing repairs.
 	 */
-	if (action_list_length(&alist) > 0 &&
+	if (repair_item_count_needsrepair(&sri) != 0 &&
 	    !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
 		str_info(ctx, _("Couldn't upload clean bill of health."), NULL);
 		action_list_discard(&alist);
diff --git a/scrub/phase2.c b/scrub/phase2.c
index 50c2c88276f..83c467347fe 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -58,6 +58,7 @@ scan_ag_metadata(
 	void				*arg)
 {
 	struct scrub_item		sri;
+	struct scrub_item		fix_now;
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	struct scan_ctl			*sctl = arg;
 	struct action_list		alist;
@@ -83,7 +84,7 @@ scan_ag_metadata(
 		goto err;
 
 	/* Repair header damage. */
-	ret = action_list_process_or_defer(ctx, agno, &alist);
+	ret = repair_item_corruption(ctx, &sri);
 	if (ret)
 		goto err;
 
@@ -99,17 +100,19 @@ scan_ag_metadata(
 	 * the inobt from rmapbt data, but if the rmapbt is broken even
 	 * at this early phase then we are sunk.
 	 */
-	difficulty = action_list_difficulty(&alist);
-	action_list_find_mustfix(&alist, &immediate_alist);
+	difficulty = repair_item_difficulty(&sri);
+	repair_item_mustfix(&sri, &fix_now);
 	warn_repair_difficulties(ctx, difficulty, descr);
 
 	/* Repair (inode) btree damage. */
-	ret = action_list_process_or_defer(ctx, agno, &immediate_alist);
+	ret = repair_item_corruption(ctx, &fix_now);
 	if (ret)
 		goto err;
 
 	/* Everything else gets fixed during phase 4. */
-	action_list_defer(ctx, agno, &alist);
+	ret = repair_item_defer(ctx, &sri);
+	if (ret)
+		goto err;
 	return;
 err:
 	sctl->aborted = true;
@@ -141,10 +144,14 @@ scan_metafile(
 	}
 
 	/* Complain about metadata corruptions that might not be fixable. */
-	difficulty = action_list_difficulty(&alist);
+	difficulty = repair_item_difficulty(&sri);
 	warn_repair_difficulties(ctx, difficulty, xfrog_scrubbers[type].descr);
 
-	action_list_defer(ctx, 0, &alist);
+	ret = repair_item_defer(ctx, &sri);
+	if (ret) {
+		sctl->aborted = true;
+		goto out;
+	}
 
 out:
 	if (type == XFS_SCRUB_TYPE_RTBITMAP) {
@@ -193,8 +200,7 @@ phase2_func(
 	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, 0, &alist, &sri);
 	if (ret)
 		goto out_wq;
-	ret = action_list_process(ctx, -1, &alist,
-			XRM_FINAL_WARNING | XRM_NOPROGRESS);
+	ret = repair_item_completely(ctx, &sri);
 	if (ret)
 		goto out_wq;
 
diff --git a/scrub/phase3.c b/scrub/phase3.c
index ef22a1d11c1..7e09c48ce18 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -55,45 +55,48 @@ report_close_error(
  * Defer all the repairs until phase 4, being careful about locking since the
  * inode scrub threads are not per-AG.
  */
-static void
+static int
 defer_inode_repair(
-	struct scrub_inode_ctx	*ictx,
-	xfs_agnumber_t		agno,
-	struct action_list	*alist)
+	struct scrub_inode_ctx		*ictx,
+	const struct xfs_bulkstat	*bstat,
+	struct scrub_item		*sri)
 {
-	if (alist->nr == 0)
-		return;
+	struct action_item		*aitem = NULL;
+	xfs_agnumber_t			agno;
+	int				ret;
 
+	ret = repair_item_to_action_item(ictx->ctx, sri, &aitem);
+	if (ret || !aitem)
+		return ret;
+
+	agno = cvt_ino_to_agno(&ictx->ctx->mnt, bstat->bs_ino);
 	pthread_mutex_lock(&ictx->locks[agno]);
-	action_list_defer(ictx->ctx, agno, alist);
+	action_list_add(&ictx->ctx->action_lists[agno], aitem);
 	pthread_mutex_unlock(&ictx->locks[agno]);
+	return 0;
 }
 
-/* Run repair actions now and defer unfinished items for later. */
+/* Run repair actions now and leave unfinished items for later. */
 static int
 try_inode_repair(
-	struct scrub_inode_ctx	*ictx,
-	int			fd,
-	xfs_agnumber_t		agno,
-	struct action_list	*alist)
+	struct scrub_inode_ctx		*ictx,
+	struct scrub_item		*sri,
+	int				fd,
+	const struct xfs_bulkstat	*bstat)
 {
-	int			ret;
-
 	/*
 	 * If at the start of phase 3 we already had ag/rt metadata repairs
 	 * queued up for phase 4, leave the action list untouched so that file
-	 * metadata repairs will be deferred in scan order until phase 4.
+	 * metadata repairs will be deferred until phase 4.
 	 */
 	if (ictx->always_defer_repairs)
 		return 0;
 
-	ret = action_list_process(ictx->ctx, fd, alist,
-			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
-	if (ret)
-		return ret;
-
-	defer_inode_repair(ictx, agno, alist);
-	return 0;
+	/*
+	 * Try to repair the file metadata.  Unfixed metadata will remain in
+	 * the scrub item state to be queued as a single action item.
+	 */
+	return repair_file_corruption(ictx->ctx, sri, fd);
 }
 
 /* Verify the contents, xattrs, and extent maps of an inode. */
@@ -108,13 +111,11 @@ scrub_inode(
 	struct scrub_item	sri;
 	struct scrub_inode_ctx	*ictx = arg;
 	struct ptcounter	*icount = ictx->icount;
-	xfs_agnumber_t		agno;
 	int			fd = -1;
 	int			error;
 
 	scrub_item_init_file(&sri, bstat);
 	action_list_init(&alist);
-	agno = cvt_ino_to_agno(&ctx->mnt, bstat->bs_ino);
 	background_sleep();
 
 	/*
@@ -149,7 +150,7 @@ scrub_inode(
 	if (error)
 		goto out;
 
-	error = try_inode_repair(ictx, fd, agno, &alist);
+	error = try_inode_repair(ictx, &sri, fd, bstat);
 	if (error)
 		goto out;
 
@@ -164,7 +165,7 @@ scrub_inode(
 	if (error)
 		goto out;
 
-	error = try_inode_repair(ictx, fd, agno, &alist);
+	error = try_inode_repair(ictx, &sri, fd, bstat);
 	if (error)
 		goto out;
 
@@ -191,7 +192,7 @@ scrub_inode(
 		goto out;
 
 	/* Try to repair the file while it's open. */
-	error = try_inode_repair(ictx, fd, agno, &alist);
+	error = try_inode_repair(ictx, &sri, fd, bstat);
 	if (error)
 		goto out;
 
@@ -208,7 +209,7 @@ scrub_inode(
 	progress_add(1);
 
 	if (!error && !ictx->aborted)
-		defer_inode_repair(ictx, agno, &alist);
+		error = defer_inode_repair(ictx, bstat, &sri);
 
 	if (fd >= 0) {
 		int	err2;
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 31939653bda..3afd04af47e 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -40,7 +40,7 @@ repair_ag(
 
 	/* Repair anything broken until we fail to make progress. */
 	do {
-		ret = action_list_process(ctx, -1, alist, flags);
+		ret = action_list_process(ctx, alist, flags);
 		if (ret) {
 			*aborted = true;
 			return;
@@ -55,7 +55,7 @@ repair_ag(
 
 	/* Try once more, but this time complain if we can't fix things. */
 	flags |= XRM_FINAL_WARNING;
-	ret = action_list_process(ctx, -1, alist, flags);
+	ret = action_list_process(ctx, alist, flags);
 	if (ret)
 		*aborted = true;
 }
@@ -167,8 +167,7 @@ phase4_func(
 	}
 
 	/* Repair counters before starting on the rest. */
-	ret = action_list_process(ctx, -1, &alist,
-			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
+	ret = repair_item_corruption(ctx, &sri);
 	if (ret)
 		return ret;
 	action_list_discard(&alist);
diff --git a/scrub/phase5.c b/scrub/phase5.c
index ea77c2a5298..b7801b46760 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -420,7 +420,7 @@ iscan_worker(
 		goto out;
 	}
 
-	ret = action_list_process(ctx, ctx->mnt.fd, &item->alist,
+	ret = action_list_process(ctx, &item->alist,
 			XRM_FINAL_WARNING | XRM_NOPROGRESS);
 	if (ret) {
 		str_liberror(ctx, ret, _("repairing iscan metadata"));
diff --git a/scrub/phase7.c b/scrub/phase7.c
index ddc1e3b24e3..15540778ffa 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -123,8 +123,7 @@ phase7_func(
 	error = scrub_summary_metadata(ctx, &alist, &sri);
 	if (error)
 		return error;
-	error = action_list_process(ctx, -1, &alist,
-			XRM_FINAL_WARNING | XRM_NOPROGRESS);
+	error = repair_item_completely(ctx, &sri);
 	if (error)
 		return error;
 
diff --git a/scrub/repair.c b/scrub/repair.c
index 6be5d7684b3..cadd2c20627 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -27,7 +27,8 @@ static enum check_outcome
 xfs_repair_metadata(
 	struct scrub_ctx		*ctx,
 	struct xfs_fd			*xfdp,
-	struct action_item		*aitem,
+	unsigned int			scrub_type,
+	struct scrub_item		*sri,
 	unsigned int			repair_flags)
 {
 	struct xfs_scrub_metadata	meta = { 0 };
@@ -35,20 +36,20 @@ xfs_repair_metadata(
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
 	int				error;
 
-	assert(aitem->type < XFS_SCRUB_TYPE_NR);
+	assert(scrub_type < XFS_SCRUB_TYPE_NR);
 	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
-	meta.sm_type = aitem->type;
-	meta.sm_flags = aitem->flags | XFS_SCRUB_IFLAG_REPAIR;
+	meta.sm_type = scrub_type;
+	meta.sm_flags = XFS_SCRUB_IFLAG_REPAIR;
 	if (use_force_rebuild)
 		meta.sm_flags |= XFS_SCRUB_IFLAG_FORCE_REBUILD;
-	switch (xfrog_scrubbers[aitem->type].group) {
+	switch (xfrog_scrubbers[scrub_type].group) {
 	case XFROG_SCRUB_GROUP_AGHEADER:
 	case XFROG_SCRUB_GROUP_PERAG:
-		meta.sm_agno = aitem->agno;
+		meta.sm_agno = sri->sri_agno;
 		break;
 	case XFROG_SCRUB_GROUP_INODE:
-		meta.sm_ino = aitem->ino;
-		meta.sm_gen = aitem->gen;
+		meta.sm_ino = sri->sri_ino;
+		meta.sm_gen = sri->sri_gen;
 		break;
 	default:
 		break;
@@ -58,9 +59,10 @@ xfs_repair_metadata(
 		return CHECK_RETRY;
 
 	memcpy(&oldm, &meta, sizeof(oldm));
+	oldm.sm_flags = sri->sri_state[scrub_type] & SCRUB_ITEM_REPAIR_ANY;
 	descr_set(&dsc, &oldm);
 
-	if (needs_repair(&meta))
+	if (needs_repair(&oldm))
 		str_info(ctx, descr_render(&dsc), _("Attempting repair."));
 	else if (debug || verbose)
 		str_info(ctx, descr_render(&dsc),
@@ -100,13 +102,16 @@ _("Filesystem is shut down, aborting."));
 		 * error out if the kernel doesn't know how to fix.
 		 */
 		if (is_unoptimized(&oldm) ||
-		    debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+		    debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+			scrub_item_clean_state(sri, scrub_type);
 			return CHECK_DONE;
+		}
 		fallthrough;
 	case EINVAL:
 		/* Kernel doesn't know how to repair this? */
 		str_corrupt(ctx, descr_render(&dsc),
 _("Don't know how to fix; offline repair required."));
+		scrub_item_clean_state(sri, scrub_type);
 		return CHECK_DONE;
 	case EROFS:
 		/* Read-only filesystem, can't fix. */
@@ -116,23 +121,28 @@ _("Read-only filesystem; cannot make changes."));
 		return CHECK_ABORT;
 	case ENOENT:
 		/* Metadata not present, just skip it. */
+		scrub_item_clean_state(sri, scrub_type);
 		return CHECK_DONE;
 	case ENOMEM:
 	case ENOSPC:
 		/* Don't care if preen fails due to low resources. */
-		if (is_unoptimized(&oldm) && !needs_repair(&oldm))
+		if (is_unoptimized(&oldm) && !needs_repair(&oldm)) {
+			scrub_item_clean_state(sri, scrub_type);
 			return CHECK_DONE;
+		}
 		fallthrough;
 	default:
 		/*
-		 * Operational error.  If the caller doesn't want us
-		 * to complain about repair failures, tell the caller
-		 * to requeue the repair for later and don't say a
-		 * thing.  Otherwise, print error and bail out.
+		 * Operational error.  If the caller doesn't want us to
+		 * complain about repair failures, tell the caller to requeue
+		 * the repair for later and don't say a thing.  Otherwise,
+		 * print an error, mark the item clean because we're done with
+		 * trying to repair it, and bail out.
 		 */
 		if (!(repair_flags & XRM_FINAL_WARNING))
 			return CHECK_RETRY;
 		str_liberror(ctx, error, descr_render(&dsc));
+		scrub_item_clean_state(sri, scrub_type);
 		return CHECK_DONE;
 	}
 
@@ -178,12 +188,13 @@ _("Read-only filesystem; cannot make changes."));
 			record_preen(ctx, descr_render(&dsc),
 					_("Optimization successful."));
 	}
+
+	scrub_item_clean_state(sri, scrub_type);
 	return CHECK_DONE;
 }
 
 /*
  * Prioritize action items in order of how long we can wait.
- * 0 = do it now, 10000 = do it later.
  *
  * To minimize the amount of repair work, we want to prioritize metadata
  * objects by perceived corruptness.  If CORRUPT is set, the fields are
@@ -199,104 +210,34 @@ _("Read-only filesystem; cannot make changes."));
  * in order.
  */
 
-/* Sort action items in severity order. */
-static int
-PRIO(
-	struct action_item	*aitem,
-	int			order)
-{
-	if (aitem->flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return order;
-	else if (aitem->flags & XFS_SCRUB_OFLAG_XCORRUPT)
-		return 100 + order;
-	else if (aitem->flags & XFS_SCRUB_OFLAG_XFAIL)
-		return 200 + order;
-	else if (aitem->flags & XFS_SCRUB_OFLAG_PREEN)
-		return 300 + order;
-	abort();
-}
-
-/* Sort the repair items in dependency order. */
-static int
-xfs_action_item_priority(
-	struct action_item	*aitem)
-{
-	switch (aitem->type) {
-	case XFS_SCRUB_TYPE_SB:
-	case XFS_SCRUB_TYPE_AGF:
-	case XFS_SCRUB_TYPE_AGFL:
-	case XFS_SCRUB_TYPE_AGI:
-	case XFS_SCRUB_TYPE_BNOBT:
-	case XFS_SCRUB_TYPE_CNTBT:
-	case XFS_SCRUB_TYPE_INOBT:
-	case XFS_SCRUB_TYPE_FINOBT:
-	case XFS_SCRUB_TYPE_REFCNTBT:
-	case XFS_SCRUB_TYPE_RMAPBT:
-	case XFS_SCRUB_TYPE_INODE:
-	case XFS_SCRUB_TYPE_BMBTD:
-	case XFS_SCRUB_TYPE_BMBTA:
-	case XFS_SCRUB_TYPE_BMBTC:
-		return PRIO(aitem, aitem->type - 1);
-	case XFS_SCRUB_TYPE_DIR:
-	case XFS_SCRUB_TYPE_XATTR:
-	case XFS_SCRUB_TYPE_SYMLINK:
-	case XFS_SCRUB_TYPE_PARENT:
-		return PRIO(aitem, XFS_SCRUB_TYPE_DIR);
-	case XFS_SCRUB_TYPE_RTBITMAP:
-	case XFS_SCRUB_TYPE_RTSUM:
-		return PRIO(aitem, XFS_SCRUB_TYPE_RTBITMAP);
-	case XFS_SCRUB_TYPE_UQUOTA:
-	case XFS_SCRUB_TYPE_GQUOTA:
-	case XFS_SCRUB_TYPE_PQUOTA:
-		return PRIO(aitem, XFS_SCRUB_TYPE_UQUOTA);
-	case XFS_SCRUB_TYPE_QUOTACHECK:
-		/* This should always go after [UGP]QUOTA no matter what. */
-		return PRIO(aitem, aitem->type);
-	case XFS_SCRUB_TYPE_FSCOUNTERS:
-		/* This should always go after AG headers no matter what. */
-		return PRIO(aitem, INT_MAX);
-	}
-	abort();
-}
-
-/* Make sure that btrees get repaired before headers. */
-static int
-xfs_action_item_compare(
-	void				*priv,
-	struct list_head		*a,
-	struct list_head		*b)
-{
-	struct action_item		*ra;
-	struct action_item		*rb;
-
-	ra = container_of(a, struct action_item, list);
-	rb = container_of(b, struct action_item, list);
-
-	return xfs_action_item_priority(ra) - xfs_action_item_priority(rb);
-}
+struct action_item {
+	struct list_head	list;
+	struct scrub_item	sri;
+};
 
 /*
  * Figure out which AG metadata must be fixed before we can move on
  * to the inode scan.
  */
 void
-action_list_find_mustfix(
-	struct action_list		*alist,
-	struct action_list		*immediate_alist)
+repair_item_mustfix(
+	struct scrub_item	*sri,
+	struct scrub_item	*fix_now)
 {
-	struct action_item		*n;
-	struct action_item		*aitem;
+	unsigned int		scrub_type;
 
-	list_for_each_entry_safe(aitem, n, &alist->list, list) {
-		if (!(aitem->flags & XFS_SCRUB_OFLAG_CORRUPT))
+	assert(sri->sri_agno != -1U);
+	scrub_item_init_ag(fix_now, sri->sri_agno);
+
+	foreach_scrub_type(scrub_type) {
+		if (!(sri->sri_state[scrub_type] & SCRUB_ITEM_CORRUPT))
 			continue;
-		switch (aitem->type) {
+
+		switch (scrub_type) {
 		case XFS_SCRUB_TYPE_AGI:
 		case XFS_SCRUB_TYPE_FINOBT:
 		case XFS_SCRUB_TYPE_INOBT:
-			alist->nr--;
-			list_move_tail(&aitem->list, &immediate_alist->list);
-			immediate_alist->nr++;
+			fix_now->sri_state[scrub_type] |= SCRUB_ITEM_CORRUPT;
 			break;
 		}
 	}
@@ -304,19 +245,19 @@ action_list_find_mustfix(
 
 /* Determine if primary or secondary metadata are inconsistent. */
 unsigned int
-action_list_difficulty(
-	const struct action_list	*alist)
+repair_item_difficulty(
+	const struct scrub_item	*sri)
 {
-	struct action_item		*aitem, *n;
-	unsigned int			ret = 0;
+	unsigned int		scrub_type;
+	unsigned int		ret = 0;
 
-	list_for_each_entry_safe(aitem, n, &alist->list, list) {
-		if (!(aitem->flags & (XFS_SCRUB_OFLAG_CORRUPT |
-				      XFS_SCRUB_OFLAG_XCORRUPT |
-				      XFS_SCRUB_OFLAG_XFAIL)))
+	foreach_scrub_type(scrub_type) {
+		if (!(sri->sri_state[scrub_type] & (XFS_SCRUB_OFLAG_CORRUPT |
+						    XFS_SCRUB_OFLAG_XCORRUPT |
+						    XFS_SCRUB_OFLAG_XFAIL)))
 			continue;
 
-		switch (aitem->type) {
+		switch (scrub_type) {
 		case XFS_SCRUB_TYPE_RMAPBT:
 			ret |= REPAIR_DIFFICULTY_SECONDARY;
 			break;
@@ -396,13 +337,19 @@ action_list_init(
 	alist->sorted = false;
 }
 
-/* Number of repairs in this list. */
+/* Number of pending repairs in this list. */
 unsigned long long
 action_list_length(
 	struct action_list		*alist)
 {
-	return alist->nr;
-};
+	struct action_item		*aitem;
+	unsigned long long		ret = 0;
+
+	list_for_each_entry(aitem, &alist->list, list)
+		ret += repair_item_count_needsrepair(&aitem->sri);
+
+	return ret;
+}
 
 /* Add to the list of repairs. */
 void
@@ -415,60 +362,78 @@ action_list_add(
 	alist->sorted = false;
 }
 
-/* Splice two repair lists. */
-void
-action_list_splice(
-	struct action_list		*dest,
-	struct action_list		*src)
-{
-	if (src->nr == 0)
-		return;
-
-	list_splice_tail_init(&src->list, &dest->list);
-	dest->nr += src->nr;
-	src->nr = 0;
-	dest->sorted = false;
-}
-
 /* Repair everything on this list. */
 int
 action_list_process(
 	struct scrub_ctx		*ctx,
-	int				fd,
 	struct action_list		*alist,
 	unsigned int			repair_flags)
+{
+	struct action_item		*aitem;
+	struct action_item		*n;
+	int				ret;
+
+	list_for_each_entry_safe(aitem, n, &alist->list, list) {
+		if (scrub_excessive_errors(ctx))
+			return ECANCELED;
+
+		ret = repair_item(ctx, &aitem->sri, repair_flags);
+		if (ret)
+			break;
+
+		if (repair_item_count_needsrepair(&aitem->sri) == 0) {
+			list_del(&aitem->list);
+			free(aitem);
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * For a given filesystem object, perform all repairs of a given class
+ * (corrupt, xcorrupt, xfail, preen) if the repair item says it's needed.
+ */
+static int
+repair_item_class(
+	struct scrub_ctx		*ctx,
+	struct scrub_item		*sri,
+	int				override_fd,
+	uint8_t				repair_mask,
+	unsigned int			flags)
 {
 	struct xfs_fd			xfd;
 	struct xfs_fd			*xfdp = &ctx->mnt;
-	struct action_item		*aitem;
-	struct action_item		*n;
-	enum check_outcome		fix;
+	unsigned int			scrub_type;
+
+	if (ctx->mode < SCRUB_MODE_REPAIR)
+		return 0;
 
 	/*
 	 * If the caller passed us a file descriptor for a scrub, use it
 	 * instead of scrub-by-handle because this enables the kernel to skip
 	 * costly inode btree lookups.
 	 */
-	if (fd >= 0) {
+	if (override_fd >= 0) {
 		memcpy(&xfd, xfdp, sizeof(xfd));
-		xfd.fd = fd;
+		xfd.fd = override_fd;
 		xfdp = &xfd;
 	}
 
-	if (!alist->sorted) {
-		list_sort(NULL, &alist->list, xfs_action_item_compare);
-		alist->sorted = true;
-	}
+	foreach_scrub_type(scrub_type) {
+		enum check_outcome	fix;
 
-	list_for_each_entry_safe(aitem, n, &alist->list, list) {
-		fix = xfs_repair_metadata(ctx, xfdp, aitem, repair_flags);
+		if (scrub_excessive_errors(ctx))
+			return ECANCELED;
+
+		if (!(sri->sri_state[scrub_type] & repair_mask))
+			continue;
+
+		fix = xfs_repair_metadata(ctx, xfdp, scrub_type, sri, flags);
 		switch (fix) {
 		case CHECK_DONE:
-			if (!(repair_flags & XRM_NOPROGRESS))
+			if (!(flags & XRM_NOPROGRESS))
 				progress_add(1);
-			alist->nr--;
-			list_del(&aitem->list);
-			free(aitem);
 			continue;
 		case CHECK_ABORT:
 			return ECANCELED;
@@ -479,37 +444,113 @@ action_list_process(
 		}
 	}
 
-	if (scrub_excessive_errors(ctx))
-		return ECANCELED;
+	return 0;
+}
+
+/*
+ * Repair all parts (i.e. scrub types) of this filesystem object for which
+ * corruption has been observed directly.  Other types of repair work (fixing
+ * cross referencing problems and preening) are deferred.
+ *
+ * This function should only be called to perform spot repairs of fs objects
+ * during phase 2 and 3 while we still have open handles to those objects.
+ */
+int
+repair_item_corruption(
+	struct scrub_ctx	*ctx,
+	struct scrub_item	*sri)
+{
+	return repair_item_class(ctx, sri, -1, SCRUB_ITEM_CORRUPT,
+			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
+}
+
+/* Repair all parts of this file, similar to repair_item_corruption. */
+int
+repair_file_corruption(
+	struct scrub_ctx	*ctx,
+	struct scrub_item	*sri,
+	int			override_fd)
+{
+	return repair_item_class(ctx, sri, override_fd, SCRUB_ITEM_CORRUPT,
+			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
+}
+
+/*
+ * Repair everything in this filesystem object that needs it.  This includes
+ * cross-referencing and preening.
+ */
+int
+repair_item(
+	struct scrub_ctx	*ctx,
+	struct scrub_item	*sri,
+	unsigned int		flags)
+{
+	int			ret;
+
+	ret = repair_item_class(ctx, sri, -1, SCRUB_ITEM_CORRUPT, flags);
+	if (ret)
+		return ret;
+
+	ret = repair_item_class(ctx, sri, -1, SCRUB_ITEM_XCORRUPT, flags);
+	if (ret)
+		return ret;
+
+	ret = repair_item_class(ctx, sri, -1, SCRUB_ITEM_XFAIL, flags);
+	if (ret)
+		return ret;
+
+	return repair_item_class(ctx, sri, -1, SCRUB_ITEM_PREEN, flags);
+}
+
+/* Create an action item around a scrub item that needs repairs. */
+int
+repair_item_to_action_item(
+	struct scrub_ctx	*ctx,
+	const struct scrub_item	*sri,
+	struct action_item	**aitemp)
+{
+	struct action_item	*aitem;
+
+	if (repair_item_count_needsrepair(sri) == 0)
+		return 0;
+
+	aitem = malloc(sizeof(struct action_item));
+	if (!aitem) {
+		int		error = errno;
+
+		str_liberror(ctx, error, _("creating repair action item"));
+		return error;
+	}
+
+	INIT_LIST_HEAD(&aitem->list);
+	memcpy(&aitem->sri, sri, sizeof(struct scrub_item));
+
+	*aitemp = aitem;
 	return 0;
 }
 
 /* Defer all the repairs until phase 4. */
-void
-action_list_defer(
-	struct scrub_ctx		*ctx,
-	xfs_agnumber_t			agno,
-	struct action_list		*alist)
+int
+repair_item_defer(
+	struct scrub_ctx	*ctx,
+	const struct scrub_item	*sri)
 {
+	struct action_item	*aitem = NULL;
+	unsigned int		agno;
+	int			error;
+
+	error = repair_item_to_action_item(ctx, sri, &aitem);
+	if (error || !aitem)
+		return error;
+
+	if (sri->sri_agno != -1U)
+		agno = sri->sri_agno;
+	else if (sri->sri_ino != -1ULL && sri->sri_gen != -1U)
+		agno = cvt_ino_to_agno(&ctx->mnt, sri->sri_ino);
+	else
+		agno = 0;
 	ASSERT(agno < ctx->mnt.fsgeom.agcount);
 
-	action_list_splice(&ctx->action_lists[agno], alist);
-}
-
-/* Run actions now and defer unfinished items for later. */
-int
-action_list_process_or_defer(
-	struct scrub_ctx		*ctx,
-	xfs_agnumber_t			agno,
-	struct action_list		*alist)
-{
-	int				ret;
-
-	ret = action_list_process(ctx, -1, alist,
-			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
-	if (ret)
-		return ret;
-
-	action_list_defer(ctx, agno, alist);
+	action_list_add(&ctx->action_lists[agno], aitem);
 	return 0;
 }
diff --git a/scrub/repair.h b/scrub/repair.h
index 4c3fd718575..b0b448cef7a 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -12,6 +12,8 @@ struct action_list {
 	bool			sorted;
 };
 
+struct action_item;
+
 int action_lists_alloc(size_t nr, struct action_list **listsp);
 void action_lists_free(struct action_list **listsp);
 
@@ -25,16 +27,14 @@ static inline bool action_list_empty(const struct action_list *alist)
 unsigned long long action_list_length(struct action_list *alist);
 void action_list_add(struct action_list *dest, struct action_item *item);
 void action_list_discard(struct action_list *alist);
-void action_list_splice(struct action_list *dest, struct action_list *src);
 
-void action_list_find_mustfix(struct action_list *actions,
-		struct action_list *immediate_alist);
+void repair_item_mustfix(struct scrub_item *sri, struct scrub_item *fix_now);
 
 /* Primary metadata is corrupt */
 #define REPAIR_DIFFICULTY_PRIMARY	(1U << 0)
 /* Secondary metadata is corrupt */
 #define REPAIR_DIFFICULTY_SECONDARY	(1U << 1)
-unsigned int action_list_difficulty(const struct action_list *actions);
+unsigned int repair_item_difficulty(const struct scrub_item *sri);
 
 /*
  * Only ask the kernel to repair this object if the kernel directly told us it
@@ -49,11 +49,36 @@ unsigned int action_list_difficulty(const struct action_list *actions);
 /* Don't call progress_add after repairing an item. */
 #define XRM_NOPROGRESS		(1U << 2)
 
-int action_list_process(struct scrub_ctx *ctx, int fd,
-		struct action_list *alist, unsigned int repair_flags);
-void action_list_defer(struct scrub_ctx *ctx, xfs_agnumber_t agno,
-		struct action_list *alist);
-int action_list_process_or_defer(struct scrub_ctx *ctx, xfs_agnumber_t agno,
-		struct action_list *alist);
+int action_list_process(struct scrub_ctx *ctx, struct action_list *alist,
+		unsigned int repair_flags);
+int repair_item_corruption(struct scrub_ctx *ctx, struct scrub_item *sri);
+int repair_file_corruption(struct scrub_ctx *ctx, struct scrub_item *sri,
+		int override_fd);
+int repair_item(struct scrub_ctx *ctx, struct scrub_item *sri,
+		unsigned int repair_flags);
+int repair_item_to_action_item(struct scrub_ctx *ctx,
+		const struct scrub_item *sri, struct action_item **aitemp);
+int repair_item_defer(struct scrub_ctx *ctx, const struct scrub_item *sri);
+
+static inline unsigned int
+repair_item_count_needsrepair(
+	const struct scrub_item	*sri)
+{
+	unsigned int		scrub_type;
+	unsigned int		nr = 0;
+
+	foreach_scrub_type(scrub_type)
+		if (sri->sri_state[scrub_type] & SCRUB_ITEM_REPAIR_ANY)
+			nr++;
+	return nr;
+}
+
+static inline int
+repair_item_completely(
+	struct scrub_ctx	*ctx,
+	struct scrub_item	*sri)
+{
+	return repair_item(ctx, sri, XRM_FINAL_WARNING | XRM_NOPROGRESS);
+}
 
 #endif /* XFS_SCRUB_REPAIR_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 55653b31c4c..e3bfee40489 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -217,42 +217,6 @@ _("Optimizations of %s are possible."), _(xfrog_scrubbers[i].descr));
 	}
 }
 
-/* Save a scrub context for later repairs. */
-static int
-scrub_save_repair(
-	struct scrub_ctx		*ctx,
-	struct action_list		*alist,
-	struct xfs_scrub_metadata	*meta)
-{
-	struct action_item		*aitem;
-
-	/* Schedule this item for later repairs. */
-	aitem = malloc(sizeof(struct action_item));
-	if (!aitem) {
-		str_errno(ctx, _("adding item to repair list"));
-		return errno;
-	}
-
-	memset(aitem, 0, sizeof(*aitem));
-	aitem->type = meta->sm_type;
-	aitem->flags = meta->sm_flags;
-	switch (xfrog_scrubbers[meta->sm_type].group) {
-	case XFROG_SCRUB_GROUP_AGHEADER:
-	case XFROG_SCRUB_GROUP_PERAG:
-		aitem->agno = meta->sm_agno;
-		break;
-	case XFROG_SCRUB_GROUP_INODE:
-		aitem->ino = meta->sm_ino;
-		aitem->gen = meta->sm_gen;
-		break;
-	default:
-		break;
-	}
-
-	action_list_add(alist, aitem);
-	return 0;
-}
-
 /*
  * Scrub a single XFS_SCRUB_TYPE_*, saving corruption reports for later.
  *
@@ -272,7 +236,6 @@ scrub_meta_type(
 		.sm_agno		= agno,
 	};
 	enum check_outcome		fix;
-	int				ret;
 
 	background_sleep();
 
@@ -285,10 +248,7 @@ scrub_meta_type(
 		return ECANCELED;
 	case CHECK_REPAIR:
 		scrub_item_save_state(sri, type, meta.sm_flags);
-		ret = scrub_save_repair(ctx, alist, &meta);
-		if (ret)
-			return ret;
-		fallthrough;
+		return 0;
 	case CHECK_DONE:
 		scrub_item_clean_state(sri, type);
 		return 0;
@@ -469,7 +429,7 @@ scrub_file(
 	}
 
 	scrub_item_save_state(sri, type, meta.sm_flags);
-	return scrub_save_repair(ctx, alist, &meta);
+	return 0;
 }
 
 /* Dump a scrub item for debugging purposes. */
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 546651b2818..95882eabedb 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -14,8 +14,6 @@ enum check_outcome {
 	CHECK_RETRY,	/* repair failed, try again later */
 };
 
-struct action_item;
-
 /*
  * These flags record the metadata object state that the kernel returned.
  * We want to remember if the object was corrupt, if the cross-referencing
@@ -110,14 +108,4 @@ int scrub_file(struct scrub_ctx *ctx, int fd, const struct xfs_bulkstat *bstat,
 		unsigned int type, struct action_list *alist,
 		struct scrub_item *sri);
 
-/* Repair parameters are the scrub inputs and retry count. */
-struct action_item {
-	struct list_head	list;
-	__u64			ino;
-	__u32			type;
-	__u32			flags;
-	__u32			gen;
-	__u32			agno;
-};
-
 #endif /* XFS_SCRUB_SCRUB_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/9] xfs_scrub: remove action lists from phaseX code
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/9] xfs_scrub: remove scrub_metadata_file Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/9] xfs_scrub: boost the repair priority of dependencies of damaged items Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/9] xfs_scrub: use repair_item to direct repair activities Darrick J. Wong
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we track repair schedules by filesystem object (and not
individual repairs) we can get rid of all the onstack list heads and
whatnot in the phaseX code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |    5 +----
 scrub/phase2.c |   16 ++++------------
 scrub/phase3.c |   19 ++++++++-----------
 scrub/phase4.c |    8 ++------
 scrub/phase5.c |    8 ++------
 scrub/phase7.c |    4 +---
 scrub/scrub.c  |   37 ++++++++++++++++++++-----------------
 scrub/scrub.h  |   16 +++++-----------
 8 files changed, 43 insertions(+), 70 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 2c0ff7c8327..6b2f6cdd5fa 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -53,7 +53,6 @@ report_to_kernel(
 	struct scrub_ctx	*ctx)
 {
 	struct scrub_item	sri;
-	struct action_list	alist;
 	int			ret;
 
 	if (!ctx->scrub_setup_succeeded || ctx->corruptions_found ||
@@ -62,8 +61,7 @@ report_to_kernel(
 		return 0;
 
 	scrub_item_init_fs(&sri);
-	action_list_init(&alist);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, 0, &alist, &sri);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, &sri);
 	if (ret)
 		return ret;
 
@@ -74,7 +72,6 @@ report_to_kernel(
 	if (repair_item_count_needsrepair(&sri) != 0 &&
 	    !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
 		str_info(ctx, _("Couldn't upload clean bill of health."), NULL);
-		action_list_discard(&alist);
 	}
 
 	return 0;
diff --git a/scrub/phase2.c b/scrub/phase2.c
index 83c467347fe..656eccce449 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -61,8 +61,6 @@ scan_ag_metadata(
 	struct scrub_item		fix_now;
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	struct scan_ctl			*sctl = arg;
-	struct action_list		alist;
-	struct action_list		immediate_alist;
 	char				descr[DESCR_BUFSZ];
 	unsigned int			difficulty;
 	int				ret;
@@ -71,15 +69,13 @@ scan_ag_metadata(
 		return;
 
 	scrub_item_init_ag(&sri, agno);
-	action_list_init(&alist);
-	action_list_init(&immediate_alist);
 	snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
 
 	/*
 	 * First we scrub and fix the AG headers, because we need
 	 * them to work well enough to check the AG btrees.
 	 */
-	ret = scrub_ag_headers(ctx, agno, &alist, &sri);
+	ret = scrub_ag_headers(ctx, &sri);
 	if (ret)
 		goto err;
 
@@ -89,7 +85,7 @@ scan_ag_metadata(
 		goto err;
 
 	/* Now scrub the AG btrees. */
-	ret = scrub_ag_metadata(ctx, agno, &alist, &sri);
+	ret = scrub_ag_metadata(ctx, &sri);
 	if (ret)
 		goto err;
 
@@ -126,7 +122,6 @@ scan_metafile(
 	void			*arg)
 {
 	struct scrub_item	sri;
-	struct action_list	alist;
 	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	struct scan_ctl		*sctl = arg;
 	unsigned int		difficulty;
@@ -136,8 +131,7 @@ scan_metafile(
 		goto out;
 
 	scrub_item_init_fs(&sri);
-	action_list_init(&alist);
-	ret = scrub_metadata_file(ctx, type, &alist, &sri);
+	ret = scrub_metadata_file(ctx, type, &sri);
 	if (ret) {
 		sctl->aborted = true;
 		goto out;
@@ -172,7 +166,6 @@ phase2_func(
 		.aborted	= false,
 		.rbm_done	= false,
 	};
-	struct action_list	alist;
 	struct scrub_item	sri;
 	const struct xfrog_scrub_descr *sc = xfrog_scrubbers;
 	xfs_agnumber_t		agno;
@@ -196,8 +189,7 @@ phase2_func(
 	 * If errors occur, this function will log them and return nonzero.
 	 */
 	scrub_item_init_ag(&sri, 0);
-	action_list_init(&alist);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, 0, &alist, &sri);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, &sri);
 	if (ret)
 		goto out_wq;
 	ret = repair_item_completely(ctx, &sri);
diff --git a/scrub/phase3.c b/scrub/phase3.c
index 7e09c48ce18..01171de64d1 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -107,7 +107,6 @@ scrub_inode(
 	struct xfs_bulkstat	*bstat,
 	void			*arg)
 {
-	struct action_list	alist;
 	struct scrub_item	sri;
 	struct scrub_inode_ctx	*ictx = arg;
 	struct ptcounter	*icount = ictx->icount;
@@ -115,7 +114,6 @@ scrub_inode(
 	int			error;
 
 	scrub_item_init_file(&sri, bstat);
-	action_list_init(&alist);
 	background_sleep();
 
 	/*
@@ -146,7 +144,7 @@ scrub_inode(
 		fd = scrub_open_handle(handle);
 
 	/* Scrub the inode. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_INODE, &alist, &sri);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_INODE, &sri);
 	if (error)
 		goto out;
 
@@ -155,13 +153,13 @@ scrub_inode(
 		goto out;
 
 	/* Scrub all block mappings. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTD, &alist, &sri);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTD, &sri);
 	if (error)
 		goto out;
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTA, &alist, &sri);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTA, &sri);
 	if (error)
 		goto out;
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTC, &alist, &sri);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTC, &sri);
 	if (error)
 		goto out;
 
@@ -172,22 +170,21 @@ scrub_inode(
 	if (S_ISLNK(bstat->bs_mode)) {
 		/* Check symlink contents. */
 		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_SYMLINK,
-				&alist, &sri);
+				&sri);
 	} else if (S_ISDIR(bstat->bs_mode)) {
 		/* Check the directory entries. */
-		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_DIR, &alist,
-				&sri);
+		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_DIR, &sri);
 	}
 	if (error)
 		goto out;
 
 	/* Check all the extended attributes. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_XATTR, &alist, &sri);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_XATTR, &sri);
 	if (error)
 		goto out;
 
 	/* Check parent pointers. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_PARENT, &alist, &sri);
+	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_PARENT, &sri);
 	if (error)
 		goto out;
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 3afd04af47e..ee6aa90f326 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -129,7 +129,6 @@ phase4_func(
 	struct scrub_ctx	*ctx)
 {
 	struct xfs_fsop_geom	fsgeom;
-	struct action_list	alist;
 	struct scrub_item	sri;
 	int			ret;
 
@@ -144,8 +143,7 @@ phase4_func(
 	 * metadata.  If repairs fails, we'll come back during phase 7.
 	 */
 	scrub_item_init_fs(&sri);
-	action_list_init(&alist);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, 0, &alist, &sri);
+	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, &sri);
 	if (ret)
 		return ret;
 
@@ -160,8 +158,7 @@ phase4_func(
 		return ret;
 
 	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK) {
-		ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, 0,
-				&alist, &sri);
+		ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, &sri);
 		if (ret)
 			return ret;
 	}
@@ -170,7 +167,6 @@ phase4_func(
 	ret = repair_item_corruption(ctx, &sri);
 	if (ret)
 		return ret;
-	action_list_discard(&alist);
 
 	ret = repair_everything(ctx);
 	if (ret)
diff --git a/scrub/phase5.c b/scrub/phase5.c
index b7801b46760..ea32d185751 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -385,7 +385,6 @@ check_fs_label(
 
 struct iscan_item {
 	struct scrub_item	sri;
-	struct action_list	alist;
 	bool			*abortedp;
 	unsigned int		scrub_type;
 };
@@ -412,16 +411,14 @@ iscan_worker(
 		nanosleep(&tv, NULL);
 	}
 
-	ret = scrub_meta_type(ctx, item->scrub_type, 0, &item->alist,
-			&item->sri);
+	ret = scrub_meta_type(ctx, item->scrub_type, &item->sri);
 	if (ret) {
 		str_liberror(ctx, ret, _("checking iscan metadata"));
 		*item->abortedp = true;
 		goto out;
 	}
 
-	ret = action_list_process(ctx, &item->alist,
-			XRM_FINAL_WARNING | XRM_NOPROGRESS);
+	ret = repair_item_completely(ctx, &item->sri);
 	if (ret) {
 		str_liberror(ctx, ret, _("repairing iscan metadata"));
 		*item->abortedp = true;
@@ -452,7 +449,6 @@ queue_iscan(
 		return ret;
 	}
 	scrub_item_init_fs(&item->sri);
-	action_list_init(&item->alist);
 	item->scrub_type = scrub_type;
 	item->abortedp = abortedp;
 
diff --git a/scrub/phase7.c b/scrub/phase7.c
index 15540778ffa..98846a1566b 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -100,7 +100,6 @@ phase7_func(
 {
 	struct summary_counts	totalcount = {0};
 	struct scrub_item	sri;
-	struct action_list	alist;
 	struct ptvar		*ptvar;
 	unsigned long long	used_data;
 	unsigned long long	used_rt;
@@ -119,8 +118,7 @@ phase7_func(
 
 	/* Check and fix the summary metadata. */
 	scrub_item_init_fs(&sri);
-	action_list_init(&alist);
-	error = scrub_summary_metadata(ctx, &alist, &sri);
+	error = scrub_summary_metadata(ctx, &sri);
 	if (error)
 		return error;
 	error = repair_item_completely(ctx, &sri);
diff --git a/scrub/scrub.c b/scrub/scrub.c
index e3bfee40489..5dd5cf67a8e 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -219,6 +219,7 @@ _("Optimizations of %s are possible."), _(xfrog_scrubbers[i].descr));
 
 /*
  * Scrub a single XFS_SCRUB_TYPE_*, saving corruption reports for later.
+ * Do not call this function to repair file metadata.
  *
  * Returns 0 for success.  If errors occur, this function will log them and
  * return a positive error code.
@@ -227,18 +228,29 @@ int
 scrub_meta_type(
 	struct scrub_ctx		*ctx,
 	unsigned int			type,
-	xfs_agnumber_t			agno,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
 	struct xfs_scrub_metadata	meta = {
 		.sm_type		= type,
-		.sm_agno		= agno,
 	};
 	enum check_outcome		fix;
 
 	background_sleep();
 
+	switch (xfrog_scrubbers[type].group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
+		meta.sm_agno = sri->sri_agno;
+		break;
+	case XFROG_SCRUB_GROUP_METAFILES:
+	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_NONE:
+		break;
+	default:
+		assert(0);
+		break;
+	}
+
 	/* Check the item. */
 	fix = xfs_check_metadata(ctx, &ctx->mnt, &meta, false);
 	progress_add(1);
@@ -267,8 +279,6 @@ static bool
 scrub_group(
 	struct scrub_ctx		*ctx,
 	enum xfrog_scrub_group		group,
-	xfs_agnumber_t			agno,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
 	const struct xfrog_scrub_descr	*sc;
@@ -281,7 +291,7 @@ scrub_group(
 		if (sc->group != group)
 			continue;
 
-		ret = scrub_meta_type(ctx, type, agno, alist, sri);
+		ret = scrub_meta_type(ctx, type, sri);
 		if (ret)
 			return ret;
 	}
@@ -293,22 +303,18 @@ scrub_group(
 int
 scrub_ag_headers(
 	struct scrub_ctx		*ctx,
-	xfs_agnumber_t			agno,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_AGHEADER, agno, alist, sri);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_AGHEADER, sri);
 }
 
 /* Scrub each AG's metadata btrees. */
 int
 scrub_ag_metadata(
 	struct scrub_ctx		*ctx,
-	xfs_agnumber_t			agno,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, agno, alist, sri);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, sri);
 }
 
 /* Scrub one metadata file */
@@ -316,22 +322,20 @@ int
 scrub_metadata_file(
 	struct scrub_ctx		*ctx,
 	unsigned int			type,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
 	ASSERT(xfrog_scrubbers[type].group == XFROG_SCRUB_GROUP_METAFILES);
 
-	return scrub_meta_type(ctx, type, 0, alist, sri);
+	return scrub_meta_type(ctx, type, sri);
 }
 
 /* Scrub all FS summary metadata. */
 int
 scrub_summary_metadata(
 	struct scrub_ctx		*ctx,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, 0, alist, sri);
+	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, sri);
 }
 
 /* How many items do we have to check? */
@@ -393,7 +397,6 @@ scrub_file(
 	int				fd,
 	const struct xfs_bulkstat	*bstat,
 	unsigned int			type,
-	struct action_list		*alist,
 	struct scrub_item		*sri)
 {
 	struct xfs_scrub_metadata	meta = {0};
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 95882eabedb..e1e70b38b8e 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -80,18 +80,13 @@ void scrub_item_dump(struct scrub_item *sri, unsigned int group_mask,
 		const char *tag);
 
 void scrub_report_preen_triggers(struct scrub_ctx *ctx);
-int scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
-		struct action_list *alist, struct scrub_item *sri);
-int scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
-		struct action_list *alist, struct scrub_item *sri);
+int scrub_ag_headers(struct scrub_ctx *ctx, struct scrub_item *sri);
+int scrub_ag_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
 int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
-		struct action_list *alist, struct scrub_item *sri);
-int scrub_iscan_metadata(struct scrub_ctx *ctx, struct action_list *alist,
-		struct scrub_item *sri);
-int scrub_summary_metadata(struct scrub_ctx *ctx, struct action_list *alist,
 		struct scrub_item *sri);
+int scrub_iscan_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
+int scrub_summary_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
 int scrub_meta_type(struct scrub_ctx *ctx, unsigned int type,
-		xfs_agnumber_t agno, struct action_list *alist,
 		struct scrub_item *sri);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
@@ -105,7 +100,6 @@ bool can_repair(struct scrub_ctx *ctx);
 bool can_force_rebuild(struct scrub_ctx *ctx);
 
 int scrub_file(struct scrub_ctx *ctx, int fd, const struct xfs_bulkstat *bstat,
-		unsigned int type, struct action_list *alist,
-		struct scrub_item *sri);
+		unsigned int type, struct scrub_item *sri);
 
 #endif /* XFS_SCRUB_SCRUB_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/9] xfs_scrub: remove scrub_metadata_file
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/9] xfs_scrub: boost the repair priority of dependencies of damaged items Darrick J. Wong
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Collapse this function with scrub_meta_type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase2.c |    2 +-
 scrub/scrub.c  |   12 ------------
 scrub/scrub.h  |    2 --
 3 files changed, 1 insertion(+), 15 deletions(-)


diff --git a/scrub/phase2.c b/scrub/phase2.c
index 656eccce449..138f0f8a8f3 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -131,7 +131,7 @@ scan_metafile(
 		goto out;
 
 	scrub_item_init_fs(&sri);
-	ret = scrub_metadata_file(ctx, type, &sri);
+	ret = scrub_meta_type(ctx, type, &sri);
 	if (ret) {
 		sctl->aborted = true;
 		goto out;
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 5dd5cf67a8e..b970d1cfe90 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -317,18 +317,6 @@ scrub_ag_metadata(
 	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, sri);
 }
 
-/* Scrub one metadata file */
-int
-scrub_metadata_file(
-	struct scrub_ctx		*ctx,
-	unsigned int			type,
-	struct scrub_item		*sri)
-{
-	ASSERT(xfrog_scrubbers[type].group == XFROG_SCRUB_GROUP_METAFILES);
-
-	return scrub_meta_type(ctx, type, sri);
-}
-
 /* Scrub all FS summary metadata. */
 int
 scrub_summary_metadata(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index e1e70b38b8e..6e34ca2d7b3 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -82,8 +82,6 @@ void scrub_item_dump(struct scrub_item *sri, unsigned int group_mask,
 void scrub_report_preen_triggers(struct scrub_ctx *ctx);
 int scrub_ag_headers(struct scrub_ctx *ctx, struct scrub_item *sri);
 int scrub_ag_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
-int scrub_metadata_file(struct scrub_ctx *ctx, unsigned int scrub_type,
-		struct scrub_item *sri);
 int scrub_iscan_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
 int scrub_summary_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
 int scrub_meta_type(struct scrub_ctx *ctx, unsigned int type,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/9] xfs_scrub: boost the repair priority of dependencies of damaged items
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/9] xfs_scrub: remove scrub_metadata_file Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/9] xfs_scrub: remove action lists from phaseX code Darrick J. Wong
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In XFS, certain types of metadata objects depend on the correctness of
lower level metadata objects.  For example, directory blocks are stored
in the data fork of directory files, which means that any issues with
the inode core and the data fork should be dealt with before we try to
repair a directory.

xfs_scrub prioritises repairs by the severity of what the kernel scrub
function reports -- anything directly observed to be corrupt get
repaired first, then anything that had trouble with cross referencing,
and finally anything that was correct but could be further optimised.
Returning to the above example, if a directory data fork mapping offset
is off by a bit flip, scrub will mark that as failing cross referencing,
but it'll mark the directory as corrupt.  Repair should check out the
mapping problem before it tackles the directory.

Do this by embedding a dependency table and using it to boost the
priority of the repair_item fields as needed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libfrog/scrub.c       |    1 +
 scrub/repair.c        |   98 ++++++++++++++++++++++++++++++++++++++++++++++++-
 scrub/scrub.h         |   12 ++++++
 scrub/scrub_private.h |    8 ++++
 4 files changed, 116 insertions(+), 3 deletions(-)


diff --git a/libfrog/scrub.c b/libfrog/scrub.c
index 7cd241d9bce..3e322b4717d 100644
--- a/libfrog/scrub.c
+++ b/libfrog/scrub.c
@@ -150,6 +150,7 @@ const struct xfrog_scrub_descr xfrog_scrubbers[XFS_SCRUB_TYPE_NR] = {
 		.group	= XFROG_SCRUB_GROUP_NONE,
 	},
 };
+#undef DEP
 
 /* Invoke the scrub ioctl.  Returns zero or negative error code. */
 int
diff --git a/scrub/repair.c b/scrub/repair.c
index cadd2c20627..16acb0a0f10 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -22,6 +22,28 @@
 
 /* General repair routines. */
 
+/*
+ * Bitmap showing the correctness dependencies between scrub types for repairs.
+ * There are no edges between AG btrees and AG headers because we can't mount
+ * the filesystem if the btree root pointers in the AG headers are wrong.
+ * Dependencies cannot cross scrub groups.
+ */
+#define DEP(x) (1U << (x))
+static const unsigned int repair_deps[XFS_SCRUB_TYPE_NR] = {
+	[XFS_SCRUB_TYPE_BMBTD]		= DEP(XFS_SCRUB_TYPE_INODE),
+	[XFS_SCRUB_TYPE_BMBTA]		= DEP(XFS_SCRUB_TYPE_INODE),
+	[XFS_SCRUB_TYPE_BMBTC]		= DEP(XFS_SCRUB_TYPE_INODE),
+	[XFS_SCRUB_TYPE_DIR]		= DEP(XFS_SCRUB_TYPE_BMBTD),
+	[XFS_SCRUB_TYPE_XATTR]		= DEP(XFS_SCRUB_TYPE_BMBTA),
+	[XFS_SCRUB_TYPE_SYMLINK]	= DEP(XFS_SCRUB_TYPE_BMBTD),
+	[XFS_SCRUB_TYPE_PARENT]		= DEP(XFS_SCRUB_TYPE_BMBTD),
+	[XFS_SCRUB_TYPE_QUOTACHECK]	= DEP(XFS_SCRUB_TYPE_UQUOTA) |
+					  DEP(XFS_SCRUB_TYPE_GQUOTA) |
+					  DEP(XFS_SCRUB_TYPE_PQUOTA),
+	[XFS_SCRUB_TYPE_RTSUM]		= DEP(XFS_SCRUB_TYPE_RTBITMAP),
+};
+#undef DEP
+
 /* Repair some metadata. */
 static enum check_outcome
 xfs_repair_metadata(
@@ -34,8 +56,16 @@ xfs_repair_metadata(
 	struct xfs_scrub_metadata	meta = { 0 };
 	struct xfs_scrub_metadata	oldm;
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
+	bool				repair_only;
 	int				error;
 
+	/*
+	 * If the caller boosted the priority of this scrub type on behalf of a
+	 * higher level repair by setting IFLAG_REPAIR, turn off REPAIR_ONLY.
+	 */
+	repair_only = (repair_flags & XRM_REPAIR_ONLY) &&
+			scrub_item_type_boosted(sri, scrub_type);
+
 	assert(scrub_type < XFS_SCRUB_TYPE_NR);
 	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
 	meta.sm_type = scrub_type;
@@ -55,7 +85,7 @@ xfs_repair_metadata(
 		break;
 	}
 
-	if (!is_corrupt(&meta) && (repair_flags & XRM_REPAIR_ONLY))
+	if (!is_corrupt(&meta) && repair_only)
 		return CHECK_RETRY;
 
 	memcpy(&oldm, &meta, sizeof(oldm));
@@ -215,6 +245,60 @@ struct action_item {
 	struct scrub_item	sri;
 };
 
+/*
+ * The operation of higher level metadata objects depends on the correctness of
+ * lower level metadata objects.  This means that if X depends on Y, we must
+ * investigate and correct all the observed issues with Y before we try to make
+ * a correction to X.  For all scheduled repair activity on X, boost the
+ * priority of repairs on all the Ys to ensure this correctness.
+ */
+static void
+repair_item_boost_priorities(
+	struct scrub_item		*sri)
+{
+	unsigned int			scrub_type;
+
+	foreach_scrub_type(scrub_type) {
+		unsigned int		dep_mask = repair_deps[scrub_type];
+		unsigned int		b;
+
+		if (repair_item_count_needsrepair(sri) == 0 || !dep_mask)
+			continue;
+
+		/*
+		 * Check if the repairs for this scrub type depend on any other
+		 * scrub types that have been flagged with cross-referencing
+		 * errors and are not already tagged for the highest priority
+		 * repair (SCRUB_ITEM_CORRUPT).  If so, boost the priority of
+		 * that scrub type (via SCRUB_ITEM_BOOST_REPAIR) so that any
+		 * problems with the dependencies will (hopefully) be fixed
+		 * before we start repairs on this scrub type.
+		 *
+		 * So far in the history of xfs_scrub we have maintained that
+		 * lower numbered scrub types do not depend on higher numbered
+		 * scrub types, so we need only process the bit mask once.
+		 */
+		for (b = 0; b < XFS_SCRUB_TYPE_NR; b++, dep_mask >>= 1) {
+			if (!dep_mask)
+				break;
+			if (!(dep_mask & 1))
+				continue;
+			if (!(sri->sri_state[b] & SCRUB_ITEM_REPAIR_XREF))
+				continue;
+			if (sri->sri_state[b] & SCRUB_ITEM_CORRUPT)
+				continue;
+			sri->sri_state[b] |= SCRUB_ITEM_BOOST_REPAIR;
+		}
+	}
+}
+
+/*
+ * These are the scrub item state bits that must be copied when scheduling
+ * a (per-AG) scrub type for immediate repairs.  The original state tracking
+ * bits are left untouched to force a rescan in phase 4.
+ */
+#define MUSTFIX_STATES	(SCRUB_ITEM_CORRUPT | \
+			 SCRUB_ITEM_BOOST_REPAIR)
 /*
  * Figure out which AG metadata must be fixed before we can move on
  * to the inode scan.
@@ -227,17 +311,21 @@ repair_item_mustfix(
 	unsigned int		scrub_type;
 
 	assert(sri->sri_agno != -1U);
+	repair_item_boost_priorities(sri);
 	scrub_item_init_ag(fix_now, sri->sri_agno);
 
 	foreach_scrub_type(scrub_type) {
-		if (!(sri->sri_state[scrub_type] & SCRUB_ITEM_CORRUPT))
+		unsigned int	state;
+
+		state = sri->sri_state[scrub_type] & MUSTFIX_STATES;
+		if (!state)
 			continue;
 
 		switch (scrub_type) {
 		case XFS_SCRUB_TYPE_AGI:
 		case XFS_SCRUB_TYPE_FINOBT:
 		case XFS_SCRUB_TYPE_INOBT:
-			fix_now->sri_state[scrub_type] |= SCRUB_ITEM_CORRUPT;
+			fix_now->sri_state[scrub_type] = state;
 			break;
 		}
 	}
@@ -471,6 +559,8 @@ repair_file_corruption(
 	struct scrub_item	*sri,
 	int			override_fd)
 {
+	repair_item_boost_priorities(sri);
+
 	return repair_item_class(ctx, sri, override_fd, SCRUB_ITEM_CORRUPT,
 			XRM_REPAIR_ONLY | XRM_NOPROGRESS);
 }
@@ -487,6 +577,8 @@ repair_item(
 {
 	int			ret;
 
+	repair_item_boost_priorities(sri);
+
 	ret = repair_item_class(ctx, sri, -1, SCRUB_ITEM_CORRUPT, flags);
 	if (ret)
 		return ret;
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 6e34ca2d7b3..0d5738dc692 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -14,6 +14,14 @@ enum check_outcome {
 	CHECK_RETRY,	/* repair failed, try again later */
 };
 
+/*
+ * This flag boosts the repair priority of a scrub item when a dependent scrub
+ * item is scheduled for repair.  Use a separate flag to preserve the
+ * corruption state that we got from the kernel.  Priority boost is cleared the
+ * next time xfs_repair_metadata is called.
+ */
+#define SCRUB_ITEM_BOOST_REPAIR	(1 << 0)
+
 /*
  * These flags record the metadata object state that the kernel returned.
  * We want to remember if the object was corrupt, if the cross-referencing
@@ -31,6 +39,10 @@ enum check_outcome {
 				 SCRUB_ITEM_XFAIL | \
 				 SCRUB_ITEM_XCORRUPT)
 
+/* Cross-referencing failures only. */
+#define SCRUB_ITEM_REPAIR_XREF	(SCRUB_ITEM_XFAIL | \
+				 SCRUB_ITEM_XCORRUPT)
+
 struct scrub_item {
 	/*
 	 * Information we need to call the scrub and repair ioctls.  Per-AG
diff --git a/scrub/scrub_private.h b/scrub/scrub_private.h
index f91c65383d1..eafb750b0d1 100644
--- a/scrub/scrub_private.h
+++ b/scrub/scrub_private.h
@@ -71,4 +71,12 @@ scrub_item_clean_state(
 	sri->sri_state[scrub_type] = 0;
 }
 
+static inline bool
+scrub_item_type_boosted(
+	struct scrub_item		*sri,
+	unsigned  int			scrub_type)
+{
+	return sri->sri_state[scrub_type] & SCRUB_ITEM_BOOST_REPAIR;
+}
+
 #endif /* XFS_SCRUB_SCRUB_PRIVATE_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/9] xfs_scrub: clean up repair_item_difficulty a little
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 1/9] xfs_scrub: track repair items by principal, not by individual repairs Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 8/9] xfs_scrub: retry incomplete repairs Darrick J. Wong
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Document the flags handling in repair_item_difficulty.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index 16acb0a0f10..7ad4f6cfe8a 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -331,6 +331,15 @@ repair_item_mustfix(
 	}
 }
 
+/*
+ * These scrub item states correspond to metadata that is inconsistent in some
+ * way and must be repaired.  If too many metadata objects share these states,
+ * this can make repairs difficult.
+ */
+#define HARDREPAIR_STATES	(SCRUB_ITEM_CORRUPT | \
+				 SCRUB_ITEM_XCORRUPT | \
+				 SCRUB_ITEM_XFAIL)
+
 /* Determine if primary or secondary metadata are inconsistent. */
 unsigned int
 repair_item_difficulty(
@@ -340,9 +349,10 @@ repair_item_difficulty(
 	unsigned int		ret = 0;
 
 	foreach_scrub_type(scrub_type) {
-		if (!(sri->sri_state[scrub_type] & (XFS_SCRUB_OFLAG_CORRUPT |
-						    XFS_SCRUB_OFLAG_XCORRUPT |
-						    XFS_SCRUB_OFLAG_XFAIL)))
+		unsigned int	state;
+
+		state = sri->sri_state[scrub_type] & HARDREPAIR_STATES;
+		if (!state)
 			continue;
 
 		switch (scrub_type) {


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/9] xfs_scrub: check dependencies of a scrub type before repairing
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 9/9] xfs_scrub: remove unused action_list fields Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have a map of a scrub type to its dependent scrub types, use
this information to avoid trying to fix higher level metadata before the
lower levels have passed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |   32 ++++++++++++++++++++++++++++++++
 scrub/scrub.h  |    5 +++++
 2 files changed, 37 insertions(+)


diff --git a/scrub/repair.c b/scrub/repair.c
index 7ad4f6cfe8a..8624167246a 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -488,6 +488,29 @@ action_list_process(
 	return ret;
 }
 
+/* Decide if the dependent scrub types of the given scrub type are ok. */
+static bool
+repair_item_dependencies_ok(
+	const struct scrub_item	*sri,
+	unsigned int		scrub_type)
+{
+	unsigned int		dep_mask = repair_deps[scrub_type];
+	unsigned int		b;
+
+	for (b = 0; dep_mask && b < XFS_SCRUB_TYPE_NR; b++, dep_mask >>= 1) {
+		if (!(dep_mask & 1))
+			continue;
+		/*
+		 * If this lower level object also needs repair, we can't fix
+		 * the higher level item.
+		 */
+		if (sri->sri_state[b] & SCRUB_ITEM_NEEDSREPAIR)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * For a given filesystem object, perform all repairs of a given class
  * (corrupt, xcorrupt, xfail, preen) if the repair item says it's needed.
@@ -527,6 +550,15 @@ repair_item_class(
 		if (!(sri->sri_state[scrub_type] & repair_mask))
 			continue;
 
+		/*
+		 * Don't try to repair higher level items if their lower-level
+		 * dependencies haven't been verified, unless this is our last
+		 * chance to fix things without complaint.
+		 */
+		if (!(flags & XRM_FINAL_WARNING) &&
+		    !repair_item_dependencies_ok(sri, scrub_type))
+			continue;
+
 		fix = xfs_repair_metadata(ctx, xfdp, scrub_type, sri, flags);
 		switch (fix) {
 		case CHECK_DONE:
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 0d5738dc692..75595f43ee9 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -43,6 +43,11 @@ enum check_outcome {
 #define SCRUB_ITEM_REPAIR_XREF	(SCRUB_ITEM_XFAIL | \
 				 SCRUB_ITEM_XCORRUPT)
 
+/* Mask of bits signalling that a piece of metadata requires attention. */
+#define SCRUB_ITEM_NEEDSREPAIR	(SCRUB_ITEM_CORRUPT | \
+				 SCRUB_ITEM_XFAIL | \
+				 SCRUB_ITEM_XCORRUPT)
+
 struct scrub_item {
 	/*
 	 * Information we need to call the scrub and repair ioctls.  Per-AG


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 8/9] xfs_scrub: retry incomplete repairs
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 6/9] xfs_scrub: clean up repair_item_difficulty a little Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 9/9] xfs_scrub: remove unused action_list fields Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 7/9] xfs_scrub: check dependencies of a scrub type before repairing Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If a repair says it didn't do anything on account of not being able to
complete a scan of the metadata, retry the repair a few times; if even
that doesn't work, we can delay it to phase 4.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c        |   15 ++++++++++++++-
 scrub/scrub.c         |    3 +--
 scrub/scrub_private.h |   10 ++++++++++
 3 files changed, 25 insertions(+), 3 deletions(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index 8624167246a..c1ab03d6f02 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -57,6 +57,7 @@ xfs_repair_metadata(
 	struct xfs_scrub_metadata	oldm;
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
 	bool				repair_only;
+	unsigned int			tries = 0;
 	int				error;
 
 	/*
@@ -98,6 +99,7 @@ xfs_repair_metadata(
 		str_info(ctx, descr_render(&dsc),
 				_("Attempting optimization."));
 
+retry:
 	error = -xfrog_scrub_metadata(xfdp, &meta);
 	switch (error) {
 	case 0:
@@ -176,9 +178,20 @@ _("Read-only filesystem; cannot make changes."));
 		return CHECK_DONE;
 	}
 
+	/*
+	 * If the kernel says the repair was incomplete or that there was a
+	 * cross-referencing discrepancy but no obvious corruption, we'll try
+	 * the repair again, just in case the fs was busy.  Only retry so many
+	 * times.
+	 */
+	if (want_retry(&meta) && tries < 10) {
+		tries++;
+		goto retry;
+	}
+
 	if (repair_flags & XRM_FINAL_WARNING)
 		scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
-	if (needs_repair(&meta)) {
+	if (needs_repair(&meta) || is_incomplete(&meta)) {
 		/*
 		 * Still broken; if we've been told not to complain then we
 		 * just requeue this and try again later.  Otherwise we
diff --git a/scrub/scrub.c b/scrub/scrub.c
index b970d1cfe90..699e9aa3940 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -137,8 +137,7 @@ _("Filesystem is shut down, aborting."));
 	 * we'll try the scan again, just in case the fs was busy.
 	 * Only retry so many times.
 	 */
-	if (tries < 10 && (is_incomplete(meta) ||
-			   (xref_disagrees(meta) && !is_corrupt(meta)))) {
+	if (want_retry(meta) && tries < 10) {
 		tries++;
 		goto retry;
 	}
diff --git a/scrub/scrub_private.h b/scrub/scrub_private.h
index eafb750b0d1..b54384c2091 100644
--- a/scrub/scrub_private.h
+++ b/scrub/scrub_private.h
@@ -49,6 +49,16 @@ static inline bool needs_repair(struct xfs_scrub_metadata *sm)
 	return is_corrupt(sm) || xref_disagrees(sm);
 }
 
+/*
+ * We want to retry an operation if the kernel says it couldn't complete the
+ * scan/repair; or if there were cross-referencing problems but the object was
+ * not obviously corrupt.
+ */
+static inline bool want_retry(struct xfs_scrub_metadata *sm)
+{
+	return is_incomplete(sm) || (xref_disagrees(sm) && !is_corrupt(sm));
+}
+
 void scrub_warn_incomplete_scrub(struct scrub_ctx *ctx, struct descr *dsc,
 		struct xfs_scrub_metadata *meta);
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 9/9] xfs_scrub: remove unused action_list fields
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 8/9] xfs_scrub: retry incomplete repairs Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 7/9] xfs_scrub: check dependencies of a scrub type before repairing Darrick J. Wong
  8 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove some fields since we don't need them anymore.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |    5 -----
 scrub/repair.h |    2 --
 2 files changed, 7 deletions(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index c1ab03d6f02..a552b445e90 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -423,7 +423,6 @@ action_list_discard(
 	struct action_item		*n;
 
 	list_for_each_entry_safe(aitem, n, &alist->list, list) {
-		alist->nr--;
 		list_del(&aitem->list);
 		free(aitem);
 	}
@@ -444,8 +443,6 @@ action_list_init(
 	struct action_list		*alist)
 {
 	INIT_LIST_HEAD(&alist->list);
-	alist->nr = 0;
-	alist->sorted = false;
 }
 
 /* Number of pending repairs in this list. */
@@ -469,8 +466,6 @@ action_list_add(
 	struct action_item		*aitem)
 {
 	list_add_tail(&aitem->list, &alist->list);
-	alist->nr++;
-	alist->sorted = false;
 }
 
 /* Repair everything on this list. */
diff --git a/scrub/repair.h b/scrub/repair.h
index b0b448cef7a..d76bb963cdd 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -8,8 +8,6 @@
 
 struct action_list {
 	struct list_head	list;
-	unsigned long long	nr;
-	bool			sorted;
 };
 
 struct action_item;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (52 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: refactor scrub_meta_type out of existence Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
                   ` (23 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Now that we've introduced tickets to track the status of repairs to a
specific principal XFS object (fs, ag, file), use them to track the
scrub state of those same objects.  Ultimately, we want to make it easy
to introduce vectorized repair, where we send a batch of repair requests
to the kernel instead of making millions of ioctl calls.  For now,
however, we'll settle for easier bookkeepping.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking
---
 scrub/phase1.c        |    3 
 scrub/phase2.c        |   12 +-
 scrub/phase3.c        |   39 ++----
 scrub/phase4.c        |   16 +-
 scrub/phase5.c        |    5 -
 scrub/phase7.c        |    5 +
 scrub/repair.c        |   71 +++++------
 scrub/scrub.c         |  321 ++++++++++++++++++++++---------------------------
 scrub/scrub.h         |   40 ++++--
 scrub/scrub_private.h |   14 ++
 10 files changed, 258 insertions(+), 268 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs_scrub: start tracking scrub state in scrub_item
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub: remove enum check_outcome Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Start using the scrub_item to track which metadata objects need
checking by adding a new flag to the scrub_item state set.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |    3 +
 scrub/phase2.c |   12 +++--
 scrub/phase3.c |   39 +++++----------
 scrub/phase4.c |   16 +++---
 scrub/phase5.c |    5 +-
 scrub/phase7.c |    5 ++
 scrub/scrub.c  |  147 +++++++++++++++++++-------------------------------------
 scrub/scrub.h  |   28 ++++++++---
 8 files changed, 109 insertions(+), 146 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 6b2f6cdd5fa..18b1d5e948e 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -61,7 +61,8 @@ report_to_kernel(
 		return 0;
 
 	scrub_item_init_fs(&sri);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_HEALTHY, &sri);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_HEALTHY);
+	ret = scrub_item_check(ctx, &sri);
 	if (ret)
 		return ret;
 
diff --git a/scrub/phase2.c b/scrub/phase2.c
index 138f0f8a8f3..7b580f4e4dd 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -75,7 +75,8 @@ scan_ag_metadata(
 	 * First we scrub and fix the AG headers, because we need
 	 * them to work well enough to check the AG btrees.
 	 */
-	ret = scrub_ag_headers(ctx, &sri);
+	scrub_item_schedule_group(&sri, XFROG_SCRUB_GROUP_AGHEADER);
+	ret = scrub_item_check(ctx, &sri);
 	if (ret)
 		goto err;
 
@@ -85,7 +86,8 @@ scan_ag_metadata(
 		goto err;
 
 	/* Now scrub the AG btrees. */
-	ret = scrub_ag_metadata(ctx, &sri);
+	scrub_item_schedule_group(&sri, XFROG_SCRUB_GROUP_PERAG);
+	ret = scrub_item_check(ctx, &sri);
 	if (ret)
 		goto err;
 
@@ -131,7 +133,8 @@ scan_metafile(
 		goto out;
 
 	scrub_item_init_fs(&sri);
-	ret = scrub_meta_type(ctx, type, &sri);
+	scrub_item_schedule(&sri, type);
+	ret = scrub_item_check(ctx, &sri);
 	if (ret) {
 		sctl->aborted = true;
 		goto out;
@@ -189,7 +192,8 @@ phase2_func(
 	 * If errors occur, this function will log them and return nonzero.
 	 */
 	scrub_item_init_ag(&sri, 0);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_SB, &sri);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_SB);
+	ret = scrub_item_check(ctx, &sri);
 	if (ret)
 		goto out_wq;
 	ret = repair_item_completely(ctx, &sri);
diff --git a/scrub/phase3.c b/scrub/phase3.c
index 01171de64d1..06a338480e7 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -144,7 +144,8 @@ scrub_inode(
 		fd = scrub_open_handle(handle);
 
 	/* Scrub the inode. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_INODE, &sri);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_INODE);
+	error = scrub_item_check_file(ctx, &sri, fd);
 	if (error)
 		goto out;
 
@@ -153,13 +154,10 @@ scrub_inode(
 		goto out;
 
 	/* Scrub all block mappings. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTD, &sri);
-	if (error)
-		goto out;
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTA, &sri);
-	if (error)
-		goto out;
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_BMBTC, &sri);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_BMBTD);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_BMBTA);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_BMBTC);
+	error = scrub_item_check_file(ctx, &sri, fd);
 	if (error)
 		goto out;
 
@@ -167,24 +165,15 @@ scrub_inode(
 	if (error)
 		goto out;
 
-	if (S_ISLNK(bstat->bs_mode)) {
-		/* Check symlink contents. */
-		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_SYMLINK,
-				&sri);
-	} else if (S_ISDIR(bstat->bs_mode)) {
-		/* Check the directory entries. */
-		error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_DIR, &sri);
-	}
-	if (error)
-		goto out;
+	/* Check everything accessible via file mapping. */
+	if (S_ISLNK(bstat->bs_mode))
+		scrub_item_schedule(&sri, XFS_SCRUB_TYPE_SYMLINK);
+	else if (S_ISDIR(bstat->bs_mode))
+		scrub_item_schedule(&sri, XFS_SCRUB_TYPE_DIR);
 
-	/* Check all the extended attributes. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_XATTR, &sri);
-	if (error)
-		goto out;
-
-	/* Check parent pointers. */
-	error = scrub_file(ctx, fd, bstat, XFS_SCRUB_TYPE_PARENT, &sri);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_XATTR);
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_PARENT);
+	error = scrub_item_check_file(ctx, &sri, fd);
 	if (error)
 		goto out;
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index ee6aa90f326..aa0b060fcc9 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -143,9 +143,7 @@ phase4_func(
 	 * metadata.  If repairs fails, we'll come back during phase 7.
 	 */
 	scrub_item_init_fs(&sri);
-	ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_FSCOUNTERS, &sri);
-	if (ret)
-		return ret;
+	scrub_item_schedule(&sri, XFS_SCRUB_TYPE_FSCOUNTERS);
 
 	/*
 	 * Repair possibly bad quota counts before starting other repairs,
@@ -157,13 +155,13 @@ phase4_func(
 	if (ret)
 		return ret;
 
-	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK) {
-		ret = scrub_meta_type(ctx, XFS_SCRUB_TYPE_QUOTACHECK, &sri);
-		if (ret)
-			return ret;
-	}
+	if (fsgeom.sick & XFS_FSOP_GEOM_SICK_QUOTACHECK)
+		scrub_item_schedule(&sri, XFS_SCRUB_TYPE_QUOTACHECK);
 
-	/* Repair counters before starting on the rest. */
+	/* Check and repair counters before starting on the rest. */
+	ret = scrub_item_check(ctx, &sri);
+	if (ret)
+		return ret;
 	ret = repair_item_corruption(ctx, &sri);
 	if (ret)
 		return ret;
diff --git a/scrub/phase5.c b/scrub/phase5.c
index ea32d185751..96e13ac423f 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -386,7 +386,6 @@ check_fs_label(
 struct iscan_item {
 	struct scrub_item	sri;
 	bool			*abortedp;
-	unsigned int		scrub_type;
 };
 
 /* Run one inode-scan scrubber in this thread. */
@@ -411,7 +410,7 @@ iscan_worker(
 		nanosleep(&tv, NULL);
 	}
 
-	ret = scrub_meta_type(ctx, item->scrub_type, &item->sri);
+	ret = scrub_item_check(ctx, &item->sri);
 	if (ret) {
 		str_liberror(ctx, ret, _("checking iscan metadata"));
 		*item->abortedp = true;
@@ -449,7 +448,7 @@ queue_iscan(
 		return ret;
 	}
 	scrub_item_init_fs(&item->sri);
-	item->scrub_type = scrub_type;
+	scrub_item_schedule(&item->sri, scrub_type);
 	item->abortedp = abortedp;
 
 	ret = -workqueue_add(wq, iscan_worker, nr, item);
diff --git a/scrub/phase7.c b/scrub/phase7.c
index 98846a1566b..75d0ee0fb02 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -10,6 +10,8 @@
 #include <linux/fsmap.h>
 #include "libfrog/paths.h"
 #include "libfrog/ptvar.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/scrub.h"
 #include "list.h"
 #include "xfs_scrub.h"
 #include "common.h"
@@ -118,7 +120,8 @@ phase7_func(
 
 	/* Check and fix the summary metadata. */
 	scrub_item_init_fs(&sri);
-	error = scrub_summary_metadata(ctx, &sri);
+	scrub_item_schedule_group(&sri, XFROG_SCRUB_GROUP_SUMMARY);
+	error = scrub_item_check(ctx, &sri);
 	if (error)
 		return error;
 	error = repair_item_completely(ctx, &sri);
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 699e9aa3940..1a60631eddc 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -86,9 +86,11 @@ xfs_check_metadata(
 	bool				is_inode)
 {
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
+	enum xfrog_scrub_group		group;
 	unsigned int			tries = 0;
 	int				error;
 
+	group = xfrog_scrubbers[meta->sm_type].group;
 	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
 	assert(meta->sm_type < XFS_SCRUB_TYPE_NR);
 	descr_set(&dsc, meta);
@@ -165,7 +167,7 @@ _("Repairs are required."));
 	 */
 	if (is_unoptimized(meta)) {
 		if (ctx->mode != SCRUB_MODE_REPAIR) {
-			if (!is_inode) {
+			if (group != XFROG_SCRUB_GROUP_INODE) {
 				/* AG or FS metadata, always warn. */
 				str_info(ctx, descr_render(&dsc),
 _("Optimization is possible."));
@@ -223,9 +225,10 @@ _("Optimizations of %s are possible."), _(xfrog_scrubbers[i].descr));
  * Returns 0 for success.  If errors occur, this function will log them and
  * return a positive error code.
  */
-int
+static int
 scrub_meta_type(
 	struct scrub_ctx		*ctx,
+	struct xfs_fd			*xfdp,
 	unsigned int			type,
 	struct scrub_item		*sri)
 {
@@ -243,16 +246,20 @@ scrub_meta_type(
 		break;
 	case XFROG_SCRUB_GROUP_METAFILES:
 	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_ISCAN:
 	case XFROG_SCRUB_GROUP_NONE:
 		break;
-	default:
-		assert(0);
+	case XFROG_SCRUB_GROUP_INODE:
+		meta.sm_ino = sri->sri_ino;
+		meta.sm_gen = sri->sri_gen;
 		break;
 	}
 
 	/* Check the item. */
-	fix = xfs_check_metadata(ctx, &ctx->mnt, &meta, false);
-	progress_add(1);
+	fix = xfs_check_metadata(ctx, xfdp, &meta, false);
+
+	if (xfrog_scrubbers[type].group != XFROG_SCRUB_GROUP_INODE)
+		progress_add(1);
 
 	switch (fix) {
 	case CHECK_ABORT:
@@ -269,60 +276,54 @@ scrub_meta_type(
 	}
 }
 
-/*
- * Scrub all metadata types that are assigned to the given XFROG_SCRUB_GROUP_*,
- * saving corruption reports for later.  This should not be used for
- * XFROG_SCRUB_GROUP_INODE or for checking summary metadata.
- */
-static bool
-scrub_group(
-	struct scrub_ctx		*ctx,
-	enum xfrog_scrub_group		group,
-	struct scrub_item		*sri)
+/* Schedule scrub for all metadata of a given group. */
+void
+scrub_item_schedule_group(
+	struct scrub_item		*sri,
+	enum xfrog_scrub_group		group)
 {
-	const struct xfrog_scrub_descr	*sc;
-	unsigned int			type;
-
-	sc = xfrog_scrubbers;
-	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
-		int			ret;
+	unsigned int			scrub_type;
 
-		if (sc->group != group)
+	foreach_scrub_type(scrub_type) {
+		if (xfrog_scrubbers[scrub_type].group != group)
 			continue;
-
-		ret = scrub_meta_type(ctx, type, sri);
-		if (ret)
-			return ret;
+		scrub_item_schedule(sri, scrub_type);
 	}
-
-	return 0;
 }
 
-/* Scrub each AG's header blocks. */
+/* Run all the incomplete scans on this scrub principal. */
 int
-scrub_ag_headers(
+scrub_item_check_file(
 	struct scrub_ctx		*ctx,
-	struct scrub_item		*sri)
+	struct scrub_item		*sri,
+	int				override_fd)
 {
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_AGHEADER, sri);
-}
+	struct xfs_fd			xfd;
+	struct xfs_fd			*xfdp = &ctx->mnt;
+	unsigned int			scrub_type;
+	int				error;
 
-/* Scrub each AG's metadata btrees. */
-int
-scrub_ag_metadata(
-	struct scrub_ctx		*ctx,
-	struct scrub_item		*sri)
-{
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_PERAG, sri);
-}
+	/*
+	 * If the caller passed us a file descriptor for a scrub, use it
+	 * instead of scrub-by-handle because this enables the kernel to skip
+	 * costly inode btree lookups.
+	 */
+	if (override_fd >= 0) {
+		memcpy(&xfd, xfdp, sizeof(xfd));
+		xfd.fd = override_fd;
+		xfdp = &xfd;
+	}
 
-/* Scrub all FS summary metadata. */
-int
-scrub_summary_metadata(
-	struct scrub_ctx		*ctx,
-	struct scrub_item		*sri)
-{
-	return scrub_group(ctx, XFROG_SCRUB_GROUP_SUMMARY, sri);
+	foreach_scrub_type(scrub_type) {
+		if (!(sri->sri_state[scrub_type] & SCRUB_ITEM_NEEDSCHECK))
+			continue;
+
+		error = scrub_meta_type(ctx, xfdp, scrub_type, sri);
+		if (error)
+			break;
+	}
+
+	return error;
 }
 
 /* How many items do we have to check? */
@@ -374,54 +375,6 @@ scrub_estimate_iscan_work(
 	return estimate;
 }
 
-/*
- * Scrub file metadata of some sort.  If errors occur, this function will log
- * them and return nonzero.
- */
-int
-scrub_file(
-	struct scrub_ctx		*ctx,
-	int				fd,
-	const struct xfs_bulkstat	*bstat,
-	unsigned int			type,
-	struct scrub_item		*sri)
-{
-	struct xfs_scrub_metadata	meta = {0};
-	struct xfs_fd			xfd;
-	struct xfs_fd			*xfdp = &ctx->mnt;
-	enum check_outcome		fix;
-
-	assert(type < XFS_SCRUB_TYPE_NR);
-	assert(xfrog_scrubbers[type].group == XFROG_SCRUB_GROUP_INODE);
-
-	meta.sm_type = type;
-	meta.sm_ino = bstat->bs_ino;
-	meta.sm_gen = bstat->bs_gen;
-
-	/*
-	 * If the caller passed us a file descriptor for a scrub, use it
-	 * instead of scrub-by-handle because this enables the kernel to skip
-	 * costly inode btree lookups.
-	 */
-	if (fd >= 0) {
-		memcpy(&xfd, xfdp, sizeof(xfd));
-		xfd.fd = fd;
-		xfdp = &xfd;
-	}
-
-	/* Scrub the piece of metadata. */
-	fix = xfs_check_metadata(ctx, xfdp, &meta, true);
-	if (fix == CHECK_ABORT)
-		return ECANCELED;
-	if (fix == CHECK_DONE) {
-		scrub_item_clean_state(sri, type);
-		return 0;
-	}
-
-	scrub_item_save_state(sri, type, meta.sm_flags);
-	return 0;
-}
-
 /* Dump a scrub item for debugging purposes. */
 void
 scrub_item_dump(
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 75595f43ee9..1c24b054fd9 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -6,6 +6,8 @@
 #ifndef XFS_SCRUB_SCRUB_H_
 #define XFS_SCRUB_SCRUB_H_
 
+enum xfrog_scrub_group;
+
 /* Online scrub and repair. */
 enum check_outcome {
 	CHECK_DONE,	/* no further processing needed */
@@ -33,6 +35,9 @@ enum check_outcome {
 #define SCRUB_ITEM_XFAIL	(XFS_SCRUB_OFLAG_XFAIL)		/* (1 << 3) */
 #define SCRUB_ITEM_XCORRUPT	(XFS_SCRUB_OFLAG_XCORRUPT)	/* (1 << 4) */
 
+/* This scrub type needs to be checked. */
+#define SCRUB_ITEM_NEEDSCHECK	(1 << 5)
+
 /* All of the state flags that we need to prioritize repair work. */
 #define SCRUB_ITEM_REPAIR_ANY	(SCRUB_ITEM_CORRUPT | \
 				 SCRUB_ITEM_PREEN | \
@@ -96,13 +101,24 @@ scrub_item_init_file(struct scrub_item *sri, struct xfs_bulkstat *bstat)
 void scrub_item_dump(struct scrub_item *sri, unsigned int group_mask,
 		const char *tag);
 
+static inline void
+scrub_item_schedule(struct scrub_item *sri, unsigned int scrub_type)
+{
+	sri->sri_state[scrub_type] = SCRUB_ITEM_NEEDSCHECK;
+}
+
+void scrub_item_schedule_group(struct scrub_item *sri,
+		enum xfrog_scrub_group group);
+int scrub_item_check_file(struct scrub_ctx *ctx, struct scrub_item *sri,
+		int override_fd);
+
+static inline int
+scrub_item_check(struct scrub_ctx *ctx, struct scrub_item *sri)
+{
+	return scrub_item_check_file(ctx, sri, -1);
+}
+
 void scrub_report_preen_triggers(struct scrub_ctx *ctx);
-int scrub_ag_headers(struct scrub_ctx *ctx, struct scrub_item *sri);
-int scrub_ag_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
-int scrub_iscan_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
-int scrub_summary_metadata(struct scrub_ctx *ctx, struct scrub_item *sri);
-int scrub_meta_type(struct scrub_ctx *ctx, unsigned int type,
-		struct scrub_item *sri);
 
 bool can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool can_scrub_inode(struct scrub_ctx *ctx);


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs_scrub: refactor scrub_meta_type out of existence
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/5] xfs_scrub: hoist scrub retry loop to scrub_item_check_file Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove this helper function since it's trivial now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/scrub.c |  124 ++++++++++++++++++++++++---------------------------------
 1 file changed, 53 insertions(+), 71 deletions(-)


diff --git a/scrub/scrub.c b/scrub/scrub.c
index da49285be78..ccc6a7a4047 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -82,31 +82,51 @@ static int
 xfs_check_metadata(
 	struct scrub_ctx		*ctx,
 	struct xfs_fd			*xfdp,
-	struct xfs_scrub_metadata	*meta,
+	unsigned int			scrub_type,
 	struct scrub_item		*sri)
 {
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
+	struct xfs_scrub_metadata	meta = { };
 	enum xfrog_scrub_group		group;
 	unsigned int			tries = 0;
 	int				error;
 
-	group = xfrog_scrubbers[meta->sm_type].group;
+	background_sleep();
+
+	group = xfrog_scrubbers[scrub_type].group;
+	meta.sm_type = scrub_type;
+	switch (group) {
+	case XFROG_SCRUB_GROUP_AGHEADER:
+	case XFROG_SCRUB_GROUP_PERAG:
+		meta.sm_agno = sri->sri_agno;
+		break;
+	case XFROG_SCRUB_GROUP_METAFILES:
+	case XFROG_SCRUB_GROUP_SUMMARY:
+	case XFROG_SCRUB_GROUP_ISCAN:
+	case XFROG_SCRUB_GROUP_NONE:
+		break;
+	case XFROG_SCRUB_GROUP_INODE:
+		meta.sm_ino = sri->sri_ino;
+		meta.sm_gen = sri->sri_gen;
+		break;
+	}
+
 	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
-	assert(meta->sm_type < XFS_SCRUB_TYPE_NR);
-	descr_set(&dsc, meta);
+	assert(scrub_type < XFS_SCRUB_TYPE_NR);
+	descr_set(&dsc, &meta);
 
-	dbg_printf("check %s flags %xh\n", descr_render(&dsc), meta->sm_flags);
+	dbg_printf("check %s flags %xh\n", descr_render(&dsc), meta.sm_flags);
 retry:
-	error = -xfrog_scrub_metadata(xfdp, meta);
+	error = -xfrog_scrub_metadata(xfdp, &meta);
 	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
-		meta->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+		meta.sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 	switch (error) {
 	case 0:
 		/* No operational errors encountered. */
 		break;
 	case ENOENT:
 		/* Metadata not present, just skip it. */
-		scrub_item_clean_state(sri, meta->sm_type);
+		scrub_item_clean_state(sri, scrub_type);
 		return 0;
 	case ESHUTDOWN:
 		/* FS already crashed, give up. */
@@ -128,12 +148,12 @@ _("Filesystem is shut down, aborting."));
 		 * Log it and move on.
 		 */
 		str_liberror(ctx, error, _("Kernel bug"));
-		scrub_item_clean_state(sri, meta->sm_type);
+		scrub_item_clean_state(sri, scrub_type);
 		return 0;
 	default:
 		/* Operational error.  Log it and move on. */
 		str_liberror(ctx, error, descr_render(&dsc));
-		scrub_item_clean_state(sri, meta->sm_type);
+		scrub_item_clean_state(sri, scrub_type);
 		return 0;
 	}
 
@@ -143,29 +163,29 @@ _("Filesystem is shut down, aborting."));
 	 * we'll try the scan again, just in case the fs was busy.
 	 * Only retry so many times.
 	 */
-	if (want_retry(meta) && tries < 10) {
+	if (want_retry(&meta) && tries < 10) {
 		tries++;
 		goto retry;
 	}
 
 	/* Complain about incomplete or suspicious metadata. */
-	scrub_warn_incomplete_scrub(ctx, &dsc, meta);
+	scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
 
 	/*
 	 * If we need repairs or there were discrepancies, schedule a
 	 * repair if desired, otherwise complain.
 	 */
-	if (is_corrupt(meta) || xref_disagrees(meta)) {
+	if (is_corrupt(&meta) || xref_disagrees(&meta)) {
 		if (ctx->mode < SCRUB_MODE_REPAIR) {
 			/* Dry-run mode, so log an error and forget it. */
 			str_corrupt(ctx, descr_render(&dsc),
 _("Repairs are required."));
-			scrub_item_clean_state(sri, meta->sm_type);
+			scrub_item_clean_state(sri, scrub_type);
 			return 0;
 		}
 
 		/* Schedule repairs. */
-		scrub_item_save_state(sri, meta->sm_type, meta->sm_flags);
+		scrub_item_save_state(sri, scrub_type, meta.sm_flags);
 		return 0;
 	}
 
@@ -173,26 +193,26 @@ _("Repairs are required."));
 	 * If we could optimize, schedule a repair if desired,
 	 * otherwise complain.
 	 */
-	if (is_unoptimized(meta)) {
+	if (is_unoptimized(&meta)) {
 		if (ctx->mode != SCRUB_MODE_REPAIR) {
 			/* Dry-run mode, so log an error and forget it. */
 			if (group != XFROG_SCRUB_GROUP_INODE) {
 				/* AG or FS metadata, always warn. */
 				str_info(ctx, descr_render(&dsc),
 _("Optimization is possible."));
-			} else if (!ctx->preen_triggers[meta->sm_type]) {
+			} else if (!ctx->preen_triggers[scrub_type]) {
 				/* File metadata, only warn once per type. */
 				pthread_mutex_lock(&ctx->lock);
-				if (!ctx->preen_triggers[meta->sm_type])
-					ctx->preen_triggers[meta->sm_type] = true;
+				if (!ctx->preen_triggers[scrub_type])
+					ctx->preen_triggers[scrub_type] = true;
 				pthread_mutex_unlock(&ctx->lock);
 			}
-			scrub_item_clean_state(sri, meta->sm_type);
+			scrub_item_clean_state(sri, scrub_type);
 			return 0;
 		}
 
 		/* Schedule optimizations. */
-		scrub_item_save_state(sri, meta->sm_type, meta->sm_flags);
+		scrub_item_save_state(sri, scrub_type, meta.sm_flags);
 		return 0;
 	}
 
@@ -203,13 +223,13 @@ _("Optimization is possible."));
 	 * re-examine the object as repairs progress to see if the kernel will
 	 * deem it completely consistent at some point.
 	 */
-	if (xref_failed(meta) && ctx->mode == SCRUB_MODE_REPAIR) {
-		scrub_item_save_state(sri, meta->sm_type, meta->sm_flags);
+	if (xref_failed(&meta) && ctx->mode == SCRUB_MODE_REPAIR) {
+		scrub_item_save_state(sri, scrub_type, meta.sm_flags);
 		return 0;
 	}
 
 	/* Everything is ok. */
-	scrub_item_clean_state(sri, meta->sm_type);
+	scrub_item_clean_state(sri, scrub_type);
 	return 0;
 }
 
@@ -233,52 +253,6 @@ _("Optimizations of %s are possible."), _(xfrog_scrubbers[i].descr));
 	}
 }
 
-/*
- * Scrub a single XFS_SCRUB_TYPE_*, saving corruption reports for later.
- * Do not call this function to repair file metadata.
- *
- * Returns 0 for success.  If errors occur, this function will log them and
- * return a positive error code.
- */
-static int
-scrub_meta_type(
-	struct scrub_ctx		*ctx,
-	struct xfs_fd			*xfdp,
-	unsigned int			type,
-	struct scrub_item		*sri)
-{
-	struct xfs_scrub_metadata	meta = {
-		.sm_type		= type,
-	};
-	int				error;
-
-	background_sleep();
-
-	switch (xfrog_scrubbers[type].group) {
-	case XFROG_SCRUB_GROUP_AGHEADER:
-	case XFROG_SCRUB_GROUP_PERAG:
-		meta.sm_agno = sri->sri_agno;
-		break;
-	case XFROG_SCRUB_GROUP_METAFILES:
-	case XFROG_SCRUB_GROUP_SUMMARY:
-	case XFROG_SCRUB_GROUP_ISCAN:
-	case XFROG_SCRUB_GROUP_NONE:
-		break;
-	case XFROG_SCRUB_GROUP_INODE:
-		meta.sm_ino = sri->sri_ino;
-		meta.sm_gen = sri->sri_gen;
-		break;
-	}
-
-	/* Check the item. */
-	error = xfs_check_metadata(ctx, xfdp, &meta, sri);
-
-	if (xfrog_scrubbers[type].group != XFROG_SCRUB_GROUP_INODE)
-		progress_add(1);
-
-	return error;
-}
-
 /* Schedule scrub for all metadata of a given group. */
 void
 scrub_item_schedule_group(
@@ -321,7 +295,15 @@ scrub_item_check_file(
 		if (!(sri->sri_state[scrub_type] & SCRUB_ITEM_NEEDSCHECK))
 			continue;
 
-		error = scrub_meta_type(ctx, xfdp, scrub_type, sri);
+		error = xfs_check_metadata(ctx, xfdp, scrub_type, sri);
+
+		/*
+		 * Progress is counted by the inode for inode metadata; for
+		 * everything else, it's counted for each scrub call.
+		 */
+		if (sri->sri_ino == -1ULL)
+			progress_add(1);
+
 		if (error)
 			break;
 	}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs_scrub: remove enum check_outcome
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 4/5] xfs_scrub: hoist repair retry loop to repair_item_class Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/5] xfs_scrub: start tracking scrub state in scrub_item Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Get rid of this enumeration, and just do what we will directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c |   56 ++++++++++++++++++++++----------------------------
 scrub/scrub.c  |   63 +++++++++++++++++++++++++++++---------------------------
 scrub/scrub.h  |    8 -------
 3 files changed, 58 insertions(+), 69 deletions(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index a552b445e90..9cce421223b 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -45,7 +45,7 @@ static const unsigned int repair_deps[XFS_SCRUB_TYPE_NR] = {
 #undef DEP
 
 /* Repair some metadata. */
-static enum check_outcome
+static int
 xfs_repair_metadata(
 	struct scrub_ctx		*ctx,
 	struct xfs_fd			*xfdp,
@@ -87,7 +87,7 @@ xfs_repair_metadata(
 	}
 
 	if (!is_corrupt(&meta) && repair_only)
-		return CHECK_RETRY;
+		return 0;
 
 	memcpy(&oldm, &meta, sizeof(oldm));
 	oldm.sm_flags = sri->sri_state[scrub_type] & SCRUB_ITEM_REPAIR_ANY;
@@ -111,12 +111,12 @@ xfs_repair_metadata(
 		if (debug || verbose)
 			str_info(ctx, descr_render(&dsc),
 _("Filesystem is busy, deferring repair."));
-		return CHECK_RETRY;
+		return 0;
 	case ESHUTDOWN:
 		/* Filesystem is already shut down, abort. */
 		str_error(ctx, descr_render(&dsc),
 _("Filesystem is shut down, aborting."));
-		return CHECK_ABORT;
+		return ECANCELED;
 	case ENOTTY:
 	case EOPNOTSUPP:
 		/*
@@ -128,7 +128,7 @@ _("Filesystem is shut down, aborting."));
 		 * again to see if another repair fixed it.
 		 */
 		if (!(repair_flags & XRM_FINAL_WARNING))
-			return CHECK_RETRY;
+			return 0;
 		/*
 		 * If we forced repairs or this is a preen, don't
 		 * error out if the kernel doesn't know how to fix.
@@ -136,7 +136,7 @@ _("Filesystem is shut down, aborting."));
 		if (is_unoptimized(&oldm) ||
 		    debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
 			scrub_item_clean_state(sri, scrub_type);
-			return CHECK_DONE;
+			return 0;
 		}
 		fallthrough;
 	case EINVAL:
@@ -144,23 +144,23 @@ _("Filesystem is shut down, aborting."));
 		str_corrupt(ctx, descr_render(&dsc),
 _("Don't know how to fix; offline repair required."));
 		scrub_item_clean_state(sri, scrub_type);
-		return CHECK_DONE;
+		return 0;
 	case EROFS:
 		/* Read-only filesystem, can't fix. */
 		if (verbose || debug || needs_repair(&oldm))
 			str_error(ctx, descr_render(&dsc),
 _("Read-only filesystem; cannot make changes."));
-		return CHECK_ABORT;
+		return ECANCELED;
 	case ENOENT:
 		/* Metadata not present, just skip it. */
 		scrub_item_clean_state(sri, scrub_type);
-		return CHECK_DONE;
+		return 0;
 	case ENOMEM:
 	case ENOSPC:
 		/* Don't care if preen fails due to low resources. */
 		if (is_unoptimized(&oldm) && !needs_repair(&oldm)) {
 			scrub_item_clean_state(sri, scrub_type);
-			return CHECK_DONE;
+			return 0;
 		}
 		fallthrough;
 	default:
@@ -172,10 +172,10 @@ _("Read-only filesystem; cannot make changes."));
 		 * trying to repair it, and bail out.
 		 */
 		if (!(repair_flags & XRM_FINAL_WARNING))
-			return CHECK_RETRY;
+			return 0;
 		str_liberror(ctx, error, descr_render(&dsc));
 		scrub_item_clean_state(sri, scrub_type);
-		return CHECK_DONE;
+		return 0;
 	}
 
 	/*
@@ -198,7 +198,7 @@ _("Read-only filesystem; cannot make changes."));
 		 * log the error loudly and don't try again.
 		 */
 		if (!(repair_flags & XRM_FINAL_WARNING))
-			return CHECK_RETRY;
+			return 0;
 		str_corrupt(ctx, descr_render(&dsc),
  _("Repair unsuccessful; offline repair required."));
 	} else if (xref_failed(&meta)) {
@@ -216,7 +216,7 @@ _("Read-only filesystem; cannot make changes."));
 			if (verbose)
 				str_info(ctx, descr_render(&dsc),
  _("Seems correct but cross-referencing failed; will keep checking."));
-			return CHECK_RETRY;
+			return 0;
 		}
 	} else if (meta.sm_flags & XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED) {
 		if (verbose)
@@ -233,7 +233,7 @@ _("Read-only filesystem; cannot make changes."));
 	}
 
 	scrub_item_clean_state(sri, scrub_type);
-	return CHECK_DONE;
+	return 0;
 }
 
 /*
@@ -534,6 +534,7 @@ repair_item_class(
 	struct xfs_fd			xfd;
 	struct xfs_fd			*xfdp = &ctx->mnt;
 	unsigned int			scrub_type;
+	int				error = 0;
 
 	if (ctx->mode < SCRUB_MODE_REPAIR)
 		return 0;
@@ -550,8 +551,6 @@ repair_item_class(
 	}
 
 	foreach_scrub_type(scrub_type) {
-		enum check_outcome	fix;
-
 		if (scrub_excessive_errors(ctx))
 			return ECANCELED;
 
@@ -567,22 +566,17 @@ repair_item_class(
 		    !repair_item_dependencies_ok(sri, scrub_type))
 			continue;
 
-		fix = xfs_repair_metadata(ctx, xfdp, scrub_type, sri, flags);
-		switch (fix) {
-		case CHECK_DONE:
-			if (!(flags & XRM_NOPROGRESS))
-				progress_add(1);
-			continue;
-		case CHECK_ABORT:
-			return ECANCELED;
-		case CHECK_RETRY:
-			continue;
-		case CHECK_REPAIR:
-			abort();
-		}
+		error = xfs_repair_metadata(ctx, xfdp, scrub_type, sri, flags);
+		if (error)
+			break;
+
+		/* Maybe update progress if we fixed the problem. */
+		if (!(flags & XRM_NOPROGRESS) &&
+		    !(sri->sri_state[scrub_type] & SCRUB_ITEM_REPAIR_ANY))
+			progress_add(1);
 	}
 
-	return 0;
+	return error;
 }
 
 /*
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 1a60631eddc..da49285be78 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -78,12 +78,12 @@ scrub_warn_incomplete_scrub(
 }
 
 /* Do a read-only check of some metadata. */
-static enum check_outcome
+static int
 xfs_check_metadata(
 	struct scrub_ctx		*ctx,
 	struct xfs_fd			*xfdp,
 	struct xfs_scrub_metadata	*meta,
-	bool				is_inode)
+	struct scrub_item		*sri)
 {
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
 	enum xfrog_scrub_group		group;
@@ -106,17 +106,18 @@ xfs_check_metadata(
 		break;
 	case ENOENT:
 		/* Metadata not present, just skip it. */
-		return CHECK_DONE;
+		scrub_item_clean_state(sri, meta->sm_type);
+		return 0;
 	case ESHUTDOWN:
 		/* FS already crashed, give up. */
 		str_error(ctx, descr_render(&dsc),
 _("Filesystem is shut down, aborting."));
-		return CHECK_ABORT;
+		return ECANCELED;
 	case EIO:
 	case ENOMEM:
 		/* Abort on I/O errors or insufficient memory. */
 		str_liberror(ctx, error, descr_render(&dsc));
-		return CHECK_ABORT;
+		return ECANCELED;
 	case EDEADLOCK:
 	case EBUSY:
 	case EFSBADCRC:
@@ -124,13 +125,16 @@ _("Filesystem is shut down, aborting."));
 		/*
 		 * The first two should never escape the kernel,
 		 * and the other two should be reported via sm_flags.
+		 * Log it and move on.
 		 */
 		str_liberror(ctx, error, _("Kernel bug"));
-		return CHECK_DONE;
+		scrub_item_clean_state(sri, meta->sm_type);
+		return 0;
 	default:
-		/* Operational error. */
+		/* Operational error.  Log it and move on. */
 		str_liberror(ctx, error, descr_render(&dsc));
-		return CHECK_DONE;
+		scrub_item_clean_state(sri, meta->sm_type);
+		return 0;
 	}
 
 	/*
@@ -153,12 +157,16 @@ _("Filesystem is shut down, aborting."));
 	 */
 	if (is_corrupt(meta) || xref_disagrees(meta)) {
 		if (ctx->mode < SCRUB_MODE_REPAIR) {
+			/* Dry-run mode, so log an error and forget it. */
 			str_corrupt(ctx, descr_render(&dsc),
 _("Repairs are required."));
-			return CHECK_DONE;
+			scrub_item_clean_state(sri, meta->sm_type);
+			return 0;
 		}
 
-		return CHECK_REPAIR;
+		/* Schedule repairs. */
+		scrub_item_save_state(sri, meta->sm_type, meta->sm_flags);
+		return 0;
 	}
 
 	/*
@@ -167,6 +175,7 @@ _("Repairs are required."));
 	 */
 	if (is_unoptimized(meta)) {
 		if (ctx->mode != SCRUB_MODE_REPAIR) {
+			/* Dry-run mode, so log an error and forget it. */
 			if (group != XFROG_SCRUB_GROUP_INODE) {
 				/* AG or FS metadata, always warn. */
 				str_info(ctx, descr_render(&dsc),
@@ -178,10 +187,13 @@ _("Optimization is possible."));
 					ctx->preen_triggers[meta->sm_type] = true;
 				pthread_mutex_unlock(&ctx->lock);
 			}
-			return CHECK_DONE;
+			scrub_item_clean_state(sri, meta->sm_type);
+			return 0;
 		}
 
-		return CHECK_REPAIR;
+		/* Schedule optimizations. */
+		scrub_item_save_state(sri, meta->sm_type, meta->sm_flags);
+		return 0;
 	}
 
 	/*
@@ -191,11 +203,14 @@ _("Optimization is possible."));
 	 * re-examine the object as repairs progress to see if the kernel will
 	 * deem it completely consistent at some point.
 	 */
-	if (xref_failed(meta) && ctx->mode == SCRUB_MODE_REPAIR)
-		return CHECK_REPAIR;
+	if (xref_failed(meta) && ctx->mode == SCRUB_MODE_REPAIR) {
+		scrub_item_save_state(sri, meta->sm_type, meta->sm_flags);
+		return 0;
+	}
 
 	/* Everything is ok. */
-	return CHECK_DONE;
+	scrub_item_clean_state(sri, meta->sm_type);
+	return 0;
 }
 
 /* Bulk-notify user about things that could be optimized. */
@@ -235,7 +250,7 @@ scrub_meta_type(
 	struct xfs_scrub_metadata	meta = {
 		.sm_type		= type,
 	};
-	enum check_outcome		fix;
+	int				error;
 
 	background_sleep();
 
@@ -256,24 +271,12 @@ scrub_meta_type(
 	}
 
 	/* Check the item. */
-	fix = xfs_check_metadata(ctx, xfdp, &meta, false);
+	error = xfs_check_metadata(ctx, xfdp, &meta, sri);
 
 	if (xfrog_scrubbers[type].group != XFROG_SCRUB_GROUP_INODE)
 		progress_add(1);
 
-	switch (fix) {
-	case CHECK_ABORT:
-		return ECANCELED;
-	case CHECK_REPAIR:
-		scrub_item_save_state(sri, type, meta.sm_flags);
-		return 0;
-	case CHECK_DONE:
-		scrub_item_clean_state(sri, type);
-		return 0;
-	default:
-		/* CHECK_RETRY should never happen. */
-		abort();
-	}
+	return error;
 }
 
 /* Schedule scrub for all metadata of a given group. */
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 1c24b054fd9..797b872246d 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -8,14 +8,6 @@
 
 enum xfrog_scrub_group;
 
-/* Online scrub and repair. */
-enum check_outcome {
-	CHECK_DONE,	/* no further processing needed */
-	CHECK_REPAIR,	/* schedule this for repairs */
-	CHECK_ABORT,	/* end program */
-	CHECK_RETRY,	/* repair failed, try again later */
-};
-
 /*
  * This flag boosts the repair priority of a scrub item when a dependent scrub
  * item is scheduled for repair.  Use a separate flag to preserve the


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs_scrub: hoist repair retry loop to repair_item_class
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: refactor scrub_meta_type out of existence Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/5] xfs_scrub: hoist scrub retry loop to scrub_item_check_file Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub: remove enum check_outcome Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/5] xfs_scrub: start tracking scrub state in scrub_item Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For metadata repair calls, move the ioctl retry and freeze permission
tracking into scrub_item.  This enables us to move the repair retry loop
out of xfs_repair_metadata and into its caller to remove a long
backwards jump, and gets us closer to vectorizing scrub calls.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c        |   21 ++++++++++++---------
 scrub/scrub.c         |   32 ++++++++++++++++++++++++++++++--
 scrub/scrub.h         |    6 ++++++
 scrub/scrub_private.h |   14 ++++++++++++++
 4 files changed, 62 insertions(+), 11 deletions(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index 9cce421223b..d0434e3d6d1 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -57,7 +57,6 @@ xfs_repair_metadata(
 	struct xfs_scrub_metadata	oldm;
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
 	bool				repair_only;
-	unsigned int			tries = 0;
 	int				error;
 
 	/*
@@ -99,7 +98,6 @@ xfs_repair_metadata(
 		str_info(ctx, descr_render(&dsc),
 				_("Attempting optimization."));
 
-retry:
 	error = -xfrog_scrub_metadata(xfdp, &meta);
 	switch (error) {
 	case 0:
@@ -184,10 +182,8 @@ _("Read-only filesystem; cannot make changes."));
 	 * the repair again, just in case the fs was busy.  Only retry so many
 	 * times.
 	 */
-	if (want_retry(&meta) && tries < 10) {
-		tries++;
-		goto retry;
-	}
+	if (want_retry(&meta) && scrub_item_schedule_retry(sri, scrub_type))
+		return 0;
 
 	if (repair_flags & XRM_FINAL_WARNING)
 		scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
@@ -532,6 +528,7 @@ repair_item_class(
 	unsigned int			flags)
 {
 	struct xfs_fd			xfd;
+	struct scrub_item		old_sri;
 	struct xfs_fd			*xfdp = &ctx->mnt;
 	unsigned int			scrub_type;
 	int				error = 0;
@@ -566,9 +563,15 @@ repair_item_class(
 		    !repair_item_dependencies_ok(sri, scrub_type))
 			continue;
 
-		error = xfs_repair_metadata(ctx, xfdp, scrub_type, sri, flags);
-		if (error)
-			break;
+		sri->sri_tries[scrub_type] = SCRUB_ITEM_MAX_RETRIES;
+		do {
+			memcpy(&old_sri, sri, sizeof(old_sri));
+			error = xfs_repair_metadata(ctx, xfdp, scrub_type, sri,
+					flags);
+			if (error)
+				return error;
+		} while (scrub_item_call_kernel_again(sri, scrub_type,
+					repair_mask, &old_sri));
 
 		/* Maybe update progress if we fixed the problem. */
 		if (!(flags & XRM_NOPROGRESS) &&
diff --git a/scrub/scrub.c b/scrub/scrub.c
index ccc6a7a4047..d3d96a85903 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -268,6 +268,34 @@ scrub_item_schedule_group(
 	}
 }
 
+/* Decide if we call the kernel again to finish scrub/repair activity. */
+bool
+scrub_item_call_kernel_again(
+	struct scrub_item	*sri,
+	unsigned int		scrub_type,
+	uint8_t			work_mask,
+	const struct scrub_item	*old)
+{
+	uint8_t			statex;
+
+	/* If there's nothing to do, we're done. */
+	if (!(sri->sri_state[scrub_type] & work_mask))
+		return false;
+
+	/*
+	 * We are willing to go again if the last call had any effect on the
+	 * state of the scrub item that the caller cares about, if the freeze
+	 * flag got set, or if the kernel asked us to try again...
+	 */
+	statex = sri->sri_state[scrub_type] ^ old->sri_state[scrub_type];
+	if (statex & work_mask)
+		return true;
+	if (sri->sri_tries[scrub_type] != old->sri_tries[scrub_type])
+		return true;
+
+	return false;
+}
+
 /* Run all the incomplete scans on this scrub principal. */
 int
 scrub_item_check_file(
@@ -383,9 +411,9 @@ scrub_item_dump(
 		unsigned int	g = 1U << xfrog_scrubbers[i].group;
 
 		if (g & group_mask)
-			printf("[%u]: type '%s' state 0x%x\n", i,
+			printf("[%u]: type '%s' state 0x%x tries %u\n", i,
 					xfrog_scrubbers[i].name,
-					sri->sri_state[i]);
+					sri->sri_state[i], sri->sri_tries[i]);
 	}
 	fflush(stdout);
 }
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 797b872246d..e1ce6ed2f59 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -45,6 +45,9 @@ enum xfrog_scrub_group;
 				 SCRUB_ITEM_XFAIL | \
 				 SCRUB_ITEM_XCORRUPT)
 
+/* Maximum number of times we'll retry a scrub ioctl call. */
+#define SCRUB_ITEM_MAX_RETRIES	10
+
 struct scrub_item {
 	/*
 	 * Information we need to call the scrub and repair ioctls.  Per-AG
@@ -58,6 +61,9 @@ struct scrub_item {
 
 	/* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
 	__u8			sri_state[XFS_SCRUB_TYPE_NR];
+
+	/* Track scrub and repair call retries for each scrub type. */
+	__u8			sri_tries[XFS_SCRUB_TYPE_NR];
 };
 
 #define foreach_scrub_type(loopvar) \
diff --git a/scrub/scrub_private.h b/scrub/scrub_private.h
index b54384c2091..6380779c90b 100644
--- a/scrub/scrub_private.h
+++ b/scrub/scrub_private.h
@@ -89,4 +89,18 @@ scrub_item_type_boosted(
 	return sri->sri_state[scrub_type] & SCRUB_ITEM_BOOST_REPAIR;
 }
 
+/* Decide if we want to retry this operation and update bookkeeping if yes. */
+static inline bool
+scrub_item_schedule_retry(struct scrub_item *sri, unsigned int scrub_type)
+{
+	if (sri->sri_tries[scrub_type] == 0)
+		return false;
+	sri->sri_tries[scrub_type]--;
+	return true;
+}
+
+bool scrub_item_call_kernel_again(struct scrub_item *sri,
+		unsigned int scrub_type, uint8_t work_mask,
+		const struct scrub_item *old);
+
 #endif /* XFS_SCRUB_SCRUB_PRIVATE_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs_scrub: hoist scrub retry loop to scrub_item_check_file
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: refactor scrub_meta_type out of existence Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/5] xfs_scrub: hoist repair retry loop to repair_item_class Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For metadata check calls, use the ioctl retry and freeze permission
tracking in scrub_item that we created in the last patch.  This enables
us to move the check retry loop out of xfs_scrub_metadata and into its
caller to remove a long backwards jump, and gets us closer to
vectorizing scrub calls.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/scrub.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)


diff --git a/scrub/scrub.c b/scrub/scrub.c
index d3d96a85903..6f9f182c629 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -88,7 +88,6 @@ xfs_check_metadata(
 	DEFINE_DESCR(dsc, ctx, format_scrub_descr);
 	struct xfs_scrub_metadata	meta = { };
 	enum xfrog_scrub_group		group;
-	unsigned int			tries = 0;
 	int				error;
 
 	background_sleep();
@@ -116,7 +115,7 @@ xfs_check_metadata(
 	descr_set(&dsc, &meta);
 
 	dbg_printf("check %s flags %xh\n", descr_render(&dsc), meta.sm_flags);
-retry:
+
 	error = -xfrog_scrub_metadata(xfdp, &meta);
 	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
 		meta.sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
@@ -163,10 +162,8 @@ _("Filesystem is shut down, aborting."));
 	 * we'll try the scan again, just in case the fs was busy.
 	 * Only retry so many times.
 	 */
-	if (want_retry(&meta) && tries < 10) {
-		tries++;
-		goto retry;
-	}
+	if (want_retry(&meta) && scrub_item_schedule_retry(sri, scrub_type))
+		return 0;
 
 	/* Complain about incomplete or suspicious metadata. */
 	scrub_warn_incomplete_scrub(ctx, &dsc, &meta);
@@ -304,6 +301,7 @@ scrub_item_check_file(
 	int				override_fd)
 {
 	struct xfs_fd			xfd;
+	struct scrub_item		old_sri;
 	struct xfs_fd			*xfdp = &ctx->mnt;
 	unsigned int			scrub_type;
 	int				error;
@@ -323,7 +321,14 @@ scrub_item_check_file(
 		if (!(sri->sri_state[scrub_type] & SCRUB_ITEM_NEEDSCHECK))
 			continue;
 
-		error = xfs_check_metadata(ctx, xfdp, scrub_type, sri);
+		sri->sri_tries[scrub_type] = SCRUB_ITEM_MAX_RETRIES;
+		do {
+			memcpy(&old_sri, sri, sizeof(old_sri));
+			error = xfs_check_metadata(ctx, xfdp, scrub_type, sri);
+			if (error)
+				return error;
+		} while (scrub_item_call_kernel_again(sri, scrub_type,
+					SCRUB_ITEM_NEEDSCHECK, &old_sri));
 
 		/*
 		 * Progress is counted by the inode for inode metadata; for


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (53 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/4] libfrog: enhance ptvar to support initializer functions Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
                   ` (22 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Currently, phase 4 of xfs_scrub uses per-AG repair item lists to
schedule repair work across a thread pool.  This scheme is suboptimal
when most of the repairs involve a single AG because all the work gets
dumped on a single pool thread.

Instead, we should create a thread pool with the same number of workers
as CPUs, and dispatch individual repair tickets as separate work items
to maximize parallelization.

However, we also need to ensure that repairs to space metadata and file
metadata are kept in separate queues because file repairs generally
depend on correctness of space metadata.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling
---
 include/list.h        |   14 +++
 libfrog/ptvar.c       |    9 ++
 libfrog/ptvar.h       |    4 +
 scrub/counter.c       |    2 
 scrub/descr.c         |    2 
 scrub/phase1.c        |   15 ++-
 scrub/phase2.c        |   23 ++++-
 scrub/phase3.c        |  106 ++++++++++++++--------
 scrub/phase4.c        |  240 ++++++++++++++++++++++++++++++++++++-------------
 scrub/phase7.c        |    2 
 scrub/read_verify.c   |    2 
 scrub/repair.c        |  172 +++++++++++++++++++++++------------
 scrub/repair.h        |   37 ++++++--
 scrub/scrub.c         |    5 +
 scrub/scrub.h         |   10 ++
 scrub/scrub_private.h |    2 
 scrub/xfs_scrub.h     |    3 -
 17 files changed, 465 insertions(+), 183 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] libfrog: enhance ptvar to support initializer functions
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/4] xfs_scrub: recheck entire metadata objects after corruption repairs Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Modify the per-thread variable code to support passing in an initializer
function that will set up each thread's variable space when it is
claimed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libfrog/ptvar.c     |    9 ++++++++-
 libfrog/ptvar.h     |    4 +++-
 scrub/counter.c     |    2 +-
 scrub/descr.c       |    2 +-
 scrub/phase7.c      |    2 +-
 scrub/read_verify.c |    2 +-
 6 files changed, 15 insertions(+), 6 deletions(-)


diff --git a/libfrog/ptvar.c b/libfrog/ptvar.c
index 7ac8c541862..9d5ae6bc8e3 100644
--- a/libfrog/ptvar.c
+++ b/libfrog/ptvar.c
@@ -26,6 +26,7 @@
 struct ptvar {
 	pthread_key_t	key;
 	pthread_mutex_t	lock;
+	ptvar_init_fn	init_fn;
 	size_t		nr_used;
 	size_t		nr_counters;
 	size_t		data_size;
@@ -38,6 +39,7 @@ int
 ptvar_alloc(
 	size_t		nr,
 	size_t		size,
+	ptvar_init_fn	init_fn,
 	struct ptvar	**pptv)
 {
 	struct ptvar	*ptv;
@@ -58,6 +60,7 @@ ptvar_alloc(
 	ptv->data_size = size;
 	ptv->nr_counters = nr;
 	ptv->nr_used = 0;
+	ptv->init_fn = init_fn;
 	memset(ptv->data, 0, nr * size);
 	ret = -pthread_mutex_init(&ptv->lock, NULL);
 	if (ret)
@@ -98,11 +101,15 @@ ptvar_get(
 	if (!p) {
 		pthread_mutex_lock(&ptv->lock);
 		assert(ptv->nr_used < ptv->nr_counters);
-		p = &ptv->data[(ptv->nr_used++) * ptv->data_size];
+		p = &ptv->data[ptv->nr_used * ptv->data_size];
 		ret = -pthread_setspecific(ptv->key, p);
 		if (ret)
 			goto out_unlock;
+		ptv->nr_used++;
 		pthread_mutex_unlock(&ptv->lock);
+
+		if (ptv->init_fn)
+			ptv->init_fn(p);
 	}
 	*retp = 0;
 	return p;
diff --git a/libfrog/ptvar.h b/libfrog/ptvar.h
index b7d02d6269e..e4a181ffe76 100644
--- a/libfrog/ptvar.h
+++ b/libfrog/ptvar.h
@@ -8,7 +8,9 @@
 
 struct ptvar;
 
-int ptvar_alloc(size_t nr, size_t size, struct ptvar **pptv);
+typedef void (*ptvar_init_fn)(void *data);
+int ptvar_alloc(size_t nr, size_t size, ptvar_init_fn init_fn,
+		struct ptvar **pptv);
 void ptvar_free(struct ptvar *ptv);
 void *ptvar_get(struct ptvar *ptv, int *ret);
 
diff --git a/scrub/counter.c b/scrub/counter.c
index 6d91eb6e015..174f480f882 100644
--- a/scrub/counter.c
+++ b/scrub/counter.c
@@ -38,7 +38,7 @@ ptcounter_alloc(
 	p = malloc(sizeof(struct ptcounter));
 	if (!p)
 		return errno;
-	ret = -ptvar_alloc(nr, sizeof(uint64_t), &p->var);
+	ret = -ptvar_alloc(nr, sizeof(uint64_t), NULL, &p->var);
 	if (ret) {
 		free(p);
 		return ret;
diff --git a/scrub/descr.c b/scrub/descr.c
index e694d01d7b7..152d9150621 100644
--- a/scrub/descr.c
+++ b/scrub/descr.c
@@ -89,7 +89,7 @@ descr_init_phase(
 	int			ret;
 
 	assert(descr_ptvar == NULL);
-	ret = -ptvar_alloc(nr_threads, DESCR_BUFSZ, &descr_ptvar);
+	ret = -ptvar_alloc(nr_threads, DESCR_BUFSZ, NULL, &descr_ptvar);
 	if (ret)
 		str_liberror(ctx, ret, _("creating description buffer"));
 
diff --git a/scrub/phase7.c b/scrub/phase7.c
index 75d0ee0fb02..f3ab1c9d8f6 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -136,7 +136,7 @@ phase7_func(
 	}
 
 	error = -ptvar_alloc(scrub_nproc(ctx), sizeof(struct summary_counts),
-			&ptvar);
+			NULL, &ptvar);
 	if (error) {
 		str_liberror(ctx, error, _("setting up block counter"));
 		return error;
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index be30f2688f9..19f74fae722 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -120,7 +120,7 @@ read_verify_pool_alloc(
 	rvp->disk = disk;
 	rvp->ioerr_fn = ioerr_fn;
 	ret = -ptvar_alloc(submitter_threads, sizeof(struct read_verify),
-			&rvp->rvstate);
+			NULL, &rvp->rvstate);
 	if (ret)
 		goto out_counter;
 	ret = -workqueue_create(&rvp->wq, (struct xfs_mount *)rvp,


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs_scrub: improve thread scheduling repair items during phase 4
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/4] libfrog: enhance ptvar to support initializer functions Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/4] xfs_scrub: recheck entire metadata objects after corruption repairs Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub: try to repair space metadata before file metadata Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

As it stands, xfs_scrub doesn't do a good job of scheduling repair items
during phase 4.  The repair lists are sharded by AG, and one repair
worker is started for each per-AG repair list.  Consequently, if one AG
requires considerably more work than the others (e.g. inodes are not
spread evenly among the AGs) then phase 4 can stall waiting for that one
worker thread when there's still plenty of CPU power available.

While our initial assumptions were that repairs would be vanishingly
scarce, the reality is that "repairs" can be triggered for optimizations
like gaps in the xattr structures, or clearing the inode reflink flag on
inodes that no longer share data.  In real world testing scenarios, the
lack of balance leads to complaints about excessive runtime of
xfs_scrub.

To fix these balance problems, we replace the per-AG repair item lists
in the scrub context with a single repair item list.  Phase 4 will be
redesigned as follows:

The repair worker will grab a repair item from the main list, try to
repair it, record whether the repair attempt made any progress, and
requeue the item if it was not fully fixed.  A separate repair scheduler
function starts the repair workers, and waits for them all to complete.
Requeued repairs are merged back into the main repair list.  If we made
any forward progress, we'll start another round of repairs with the
repair workers.  Phase 4 retains the behavior that if the pool stops
making forward progress, it will try all the repairs one last time,
serially.

To facilitate this new design, phase 2 will queue repairs of space
metadata items directly to the main list.  Phase 3's worker threads will
queue repair items to per-thread lists and splice those lists into the
main list at the end.

On a filesystem crafted to put all the inodes in a single AG, this
restores xfs_scrub's ability to parallelize repairs.  There seems to be
a slight performance hit for the evenly-spread case, but avoiding a
performance cliff due to an unbalanced fs is more important here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/list.h    |   14 +++
 scrub/phase1.c    |    8 +-
 scrub/phase2.c    |   23 +++++
 scrub/phase3.c    |  106 ++++++++++++++++---------
 scrub/phase4.c    |  226 ++++++++++++++++++++++++++++++++++++++---------------
 scrub/repair.c    |  135 +++++++++++++++++---------------
 scrub/repair.h    |   37 +++++++--
 scrub/xfs_scrub.h |    2 
 8 files changed, 377 insertions(+), 174 deletions(-)


diff --git a/include/list.h b/include/list.h
index dab4e23bd10..b1e2d49c8f3 100644
--- a/include/list.h
+++ b/include/list.h
@@ -152,6 +152,20 @@ static inline void list_splice_init(struct list_head *list,
 #define list_first_entry(ptr, type, member) \
 	list_entry((ptr)->next, type, member)
 
+/**
+ * list_first_entry_or_null - get the first element from a list
+ * @ptr:	the list head to take the element from.
+ * @type:	the type of the struct this is embedded in.
+ * @member:	the name of the list_head within the struct.
+ *
+ * Note that if the list is empty, it returns NULL.
+ */
+#define list_first_entry_or_null(ptr, type, member) ({ \
+	struct list_head *head__ = (ptr); \
+	struct list_head *pos__ = head__->next; \
+	pos__ != head__ ? list_entry(pos__, type, member) : NULL; \
+})
+
 #define container_of(ptr, type, member) ({			\
 	const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
 	(type *)( (char *)__mptr - offsetof(type,member) );})
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 18b1d5e948e..6ad012042da 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -89,7 +89,8 @@ scrub_cleanup(
 	if (error)
 		return error;
 
-	action_lists_free(&ctx->action_lists);
+	action_list_free(&ctx->action_list);
+
 	if (ctx->fshandle)
 		free_handle(ctx->fshandle, ctx->fshandle_len);
 	if (ctx->rtdev)
@@ -185,10 +186,9 @@ _("Not an XFS filesystem."));
 		return error;
 	}
 
-	error = action_lists_alloc(ctx->mnt.fsgeom.agcount,
-			&ctx->action_lists);
+	error = action_list_alloc(&ctx->action_list);
 	if (error) {
-		str_liberror(ctx, error, _("allocating action lists"));
+		str_liberror(ctx, error, _("allocating repair list"));
 		return error;
 	}
 
diff --git a/scrub/phase2.c b/scrub/phase2.c
index 7b580f4e4dd..e07f7a11be5 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -50,6 +50,25 @@ warn_repair_difficulties(
 	str_info(ctx, descr, _("Filesystem might not be repairable."));
 }
 
+/* Add a scrub item that needs more work to fs metadata repair list. */
+static int
+defer_fs_repair(
+	struct scrub_ctx	*ctx,
+	const struct scrub_item	*sri)
+{
+	struct action_item	*aitem = NULL;
+	int			error;
+
+	error = repair_item_to_action_item(ctx, sri, &aitem);
+	if (error || !aitem)
+		return error;
+
+	pthread_mutex_lock(&ctx->lock);
+	action_list_add(ctx->action_list, aitem);
+	pthread_mutex_unlock(&ctx->lock);
+	return 0;
+}
+
 /* Scrub each AG's metadata btrees. */
 static void
 scan_ag_metadata(
@@ -108,7 +127,7 @@ scan_ag_metadata(
 		goto err;
 
 	/* Everything else gets fixed during phase 4. */
-	ret = repair_item_defer(ctx, &sri);
+	ret = defer_fs_repair(ctx, &sri);
 	if (ret)
 		goto err;
 	return;
@@ -144,7 +163,7 @@ scan_metafile(
 	difficulty = repair_item_difficulty(&sri);
 	warn_repair_difficulties(ctx, difficulty, xfrog_scrubbers[type].descr);
 
-	ret = repair_item_defer(ctx, &sri);
+	ret = defer_fs_repair(ctx, &sri);
 	if (ret) {
 		sctl->aborted = true;
 		goto out;
diff --git a/scrub/phase3.c b/scrub/phase3.c
index 06a338480e7..e000d8fe4c5 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -10,6 +10,7 @@
 #include "list.h"
 #include "libfrog/paths.h"
 #include "libfrog/workqueue.h"
+#include "libfrog/ptvar.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "counter.h"
@@ -26,8 +27,8 @@ struct scrub_inode_ctx {
 	/* Number of inodes scanned. */
 	struct ptcounter	*icount;
 
-	/* per-AG locks to protect the repair lists */
-	pthread_mutex_t		*locks;
+	/* Per-thread lists of file repair items. */
+	struct ptvar		*repair_ptlists;
 
 	/* Set to true to abort all threads. */
 	bool			aborted;
@@ -51,28 +52,28 @@ report_close_error(
 	str_errno(ctx, descr);
 }
 
-/*
- * Defer all the repairs until phase 4, being careful about locking since the
- * inode scrub threads are not per-AG.
- */
+/* Defer all the repairs until phase 4. */
 static int
 defer_inode_repair(
 	struct scrub_inode_ctx		*ictx,
-	const struct xfs_bulkstat	*bstat,
-	struct scrub_item		*sri)
+	const struct scrub_item		*sri)
 {
+	struct action_list		*alist;
 	struct action_item		*aitem = NULL;
-	xfs_agnumber_t			agno;
 	int				ret;
 
 	ret = repair_item_to_action_item(ictx->ctx, sri, &aitem);
 	if (ret || !aitem)
 		return ret;
 
-	agno = cvt_ino_to_agno(&ictx->ctx->mnt, bstat->bs_ino);
-	pthread_mutex_lock(&ictx->locks[agno]);
-	action_list_add(&ictx->ctx->action_lists[agno], aitem);
-	pthread_mutex_unlock(&ictx->locks[agno]);
+	alist = ptvar_get(ictx->repair_ptlists, &ret);
+	if (ret) {
+		str_liberror(ictx->ctx, ret,
+ _("getting per-thread inode repair list"));
+		return ret;
+	}
+
+	action_list_add(alist, aitem);
 	return 0;
 }
 
@@ -81,8 +82,7 @@ static int
 try_inode_repair(
 	struct scrub_inode_ctx		*ictx,
 	struct scrub_item		*sri,
-	int				fd,
-	const struct xfs_bulkstat	*bstat)
+	int				fd)
 {
 	/*
 	 * If at the start of phase 3 we already had ag/rt metadata repairs
@@ -149,7 +149,7 @@ scrub_inode(
 	if (error)
 		goto out;
 
-	error = try_inode_repair(ictx, &sri, fd, bstat);
+	error = try_inode_repair(ictx, &sri, fd);
 	if (error)
 		goto out;
 
@@ -161,7 +161,7 @@ scrub_inode(
 	if (error)
 		goto out;
 
-	error = try_inode_repair(ictx, &sri, fd, bstat);
+	error = try_inode_repair(ictx, &sri, fd);
 	if (error)
 		goto out;
 
@@ -178,7 +178,7 @@ scrub_inode(
 		goto out;
 
 	/* Try to repair the file while it's open. */
-	error = try_inode_repair(ictx, &sri, fd, bstat);
+	error = try_inode_repair(ictx, &sri, fd);
 	if (error)
 		goto out;
 
@@ -195,7 +195,7 @@ scrub_inode(
 	progress_add(1);
 
 	if (!error && !ictx->aborted)
-		error = defer_inode_repair(ictx, bstat, &sri);
+		error = defer_inode_repair(ictx, &sri);
 
 	if (fd >= 0) {
 		int	err2;
@@ -212,6 +212,33 @@ scrub_inode(
 	return error;
 }
 
+/*
+ * Collect all the inode repairs in the file repair list.  No need for locks
+ * here, since we're single-threaded.
+ */
+static int
+collect_repairs(
+	struct ptvar		*ptv,
+	void			*data,
+	void			*foreach_arg)
+{
+	struct scrub_ctx	*ctx = foreach_arg;
+	struct action_list	*alist = data;
+
+	action_list_merge(ctx->action_list, alist);
+	return 0;
+}
+
+/* Initialize this per-thread file repair item list. */
+static void
+action_ptlist_init(
+	void			*priv)
+{
+	struct action_list	*alist = priv;
+
+	action_list_init(alist);
+}
+
 /* Verify all the inodes in a filesystem. */
 int
 phase3_func(
@@ -222,17 +249,18 @@ phase3_func(
 	xfs_agnumber_t		agno;
 	int			err;
 
+	err = -ptvar_alloc(scrub_nproc(ctx), sizeof(struct action_list),
+			action_ptlist_init, &ictx.repair_ptlists);
+	if (err) {
+		str_liberror(ctx, err,
+	_("creating per-thread file repair item lists"));
+		return err;
+	}
+
 	err = ptcounter_alloc(scrub_nproc(ctx), &ictx.icount);
 	if (err) {
 		str_liberror(ctx, err, _("creating scanned inode counter"));
-		return err;
-	}
-
-	ictx.locks = calloc(ctx->mnt.fsgeom.agcount, sizeof(pthread_mutex_t));
-	if (!ictx.locks) {
-		str_errno(ctx, _("creating per-AG repair list locks"));
-		err = ENOMEM;
-		goto out_ptcounter;
+		goto out_ptvar;
 	}
 
 	/*
@@ -241,9 +269,7 @@ phase3_func(
 	 * to repair the space metadata.
 	 */
 	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
-		pthread_mutex_init(&ictx.locks[agno], NULL);
-
-		if (!action_list_empty(&ctx->action_lists[agno]))
+		if (!action_list_empty(ctx->action_list))
 			ictx.always_defer_repairs = true;
 	}
 
@@ -251,22 +277,30 @@ phase3_func(
 	if (!err && ictx.aborted)
 		err = ECANCELED;
 	if (err)
-		goto out_locks;
+		goto out_ptcounter;
+
+	/*
+	 * Combine all of the file repair items into the main repair list.
+	 * We don't need locks here since we're the only thread running now.
+	 */
+	err = -ptvar_foreach(ictx.repair_ptlists, collect_repairs, ctx);
+	if (err) {
+		str_liberror(ctx, err, _("collecting inode repair lists"));
+		goto out_ptcounter;
+	}
 
 	scrub_report_preen_triggers(ctx);
 	err = ptcounter_value(ictx.icount, &val);
 	if (err) {
 		str_liberror(ctx, err, _("summing scanned inode counter"));
-		goto out_locks;
+		goto out_ptcounter;
 	}
 
 	ctx->inodes_checked = val;
-out_locks:
-	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++)
-		pthread_mutex_destroy(&ictx.locks[agno]);
-	free(ictx.locks);
 out_ptcounter:
 	ptcounter_free(ictx.icount);
+out_ptvar:
+	ptvar_free(ictx.repair_ptlists);
 	return err;
 }
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index aa0b060fcc9..5563c4c5b0b 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -17,57 +17,166 @@
 #include "scrub.h"
 #include "repair.h"
 #include "vfs.h"
+#include "atomic.h"
 
 /* Phase 4: Repair filesystem. */
 
-/* Fix all the problems in our per-AG list. */
+struct repair_list_schedule {
+	struct action_list		*repair_list;
+
+	/* Action items that we could not resolve and want to try again. */
+	struct action_list		requeue_list;
+
+	pthread_mutex_t			lock;
+
+	/* Workers use this to signal the scheduler when all work is done. */
+	pthread_cond_t			done;
+
+	/* Number of workers that are still running. */
+	unsigned int			workers;
+
+	/* Or should we all abort? */
+	bool				aborted;
+
+	/* Did we make any progress this round? */
+	bool				made_progress;
+};
+
+/* Try to repair as many things on our list as we can. */
 static void
-repair_ag(
+repair_list_worker(
 	struct workqueue		*wq,
 	xfs_agnumber_t			agno,
 	void				*priv)
 {
+	struct repair_list_schedule	*rls = priv;
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
-	bool				*aborted = priv;
-	struct action_list		*alist;
-	unsigned long long		unfixed;
-	unsigned long long		new_unfixed;
-	unsigned int			flags = 0;
-	int				ret;
-
-	alist = &ctx->action_lists[agno];
-	unfixed = action_list_length(alist);
-
-	/* Repair anything broken until we fail to make progress. */
-	do {
-		ret = action_list_process(ctx, alist, flags);
+
+	pthread_mutex_lock(&rls->lock);
+	while (!rls->aborted) {
+		struct action_item	*aitem;
+		enum tryrepair_outcome	outcome;
+		int			ret;
+
+		aitem = action_list_pop(rls->repair_list);
+		if (!aitem)
+			break;
+
+		pthread_mutex_unlock(&rls->lock);
+		ret = action_item_try_repair(ctx, aitem, &outcome);
+		pthread_mutex_lock(&rls->lock);
+
 		if (ret) {
-			*aborted = true;
-			return;
+			rls->aborted = true;
+			free(aitem);
+			break;
 		}
-		new_unfixed = action_list_length(alist);
-		if (new_unfixed == unfixed)
+
+		switch (outcome) {
+		case TR_REQUEUE:
+			/*
+			 * Partial progress.  Make a note of that and requeue
+			 * this item for the next round.
+			 */
+			rls->made_progress = true;
+			action_list_add(&rls->requeue_list, aitem);
+			break;
+		case TR_NOPROGRESS:
+			/*
+			 * No progress.  Requeue this item for a later round,
+			 * which could happen if something else makes progress.
+			 */
+			action_list_add(&rls->requeue_list, aitem);
 			break;
-		unfixed = new_unfixed;
-		if (*aborted)
-			return;
-	} while (unfixed > 0);
-
-	/* Try once more, but this time complain if we can't fix things. */
-	flags |= XRM_FINAL_WARNING;
-	ret = action_list_process(ctx, alist, flags);
-	if (ret)
-		*aborted = true;
+		case TR_REPAIRED:
+			/* Item is clean.  Free it. */
+			free(aitem);
+			break;
+		}
+	}
+
+	rls->workers--;
+	if (rls->workers == 0)
+		pthread_cond_broadcast(&rls->done);
+	pthread_mutex_unlock(&rls->lock);
+}
+
+/*
+ * Schedule repair list workers.  Returns 1 if we made progress, 0 if we
+ * did not, or -1 if we need to abort everything.
+ */
+static int
+repair_list_schedule(
+	struct scrub_ctx		*ctx,
+	struct workqueue		*wq,
+	struct action_list		*repair_list)
+{
+	struct repair_list_schedule	rls = {
+		.lock			= PTHREAD_MUTEX_INITIALIZER,
+		.done			= PTHREAD_COND_INITIALIZER,
+		.repair_list		= repair_list,
+	};
+	unsigned int			i;
+	unsigned int			nr_workers = scrub_nproc(ctx);
+	bool				made_any_progress = false;
+	int				ret = 0;
+
+	if (action_list_empty(repair_list))
+		return 0;
+
+	action_list_init(&rls.requeue_list);
+
+	/*
+	 * Use the workers to run through the entire repair list once.  Requeue
+	 * anything that did not make progress, and keep trying as long as the
+	 * workers made any kind of progress.
+	 */
+	do {
+		rls.made_progress = false;
+
+		/* Start all the worker threads. */
+		for (i = 0; i < nr_workers; i++) {
+			pthread_mutex_lock(&rls.lock);
+			rls.workers++;
+			pthread_mutex_unlock(&rls.lock);
+
+			ret = -workqueue_add(wq, repair_list_worker, 0, &rls);
+			if (ret) {
+				str_liberror(ctx, ret,
+ _("queueing repair list worker"));
+				pthread_mutex_lock(&rls.lock);
+				rls.workers--;
+				pthread_mutex_unlock(&rls.lock);
+				break;
+			}
+		}
+
+		/* Wait for all worker functions to return. */
+		pthread_mutex_lock(&rls.lock);
+		if (rls.workers > 0)
+			pthread_cond_wait(&rls.done, &rls.lock);
+		pthread_mutex_unlock(&rls.lock);
+
+		action_list_merge(repair_list, &rls.requeue_list);
+
+		if (ret || rls.aborted)
+			return -1;
+		if (rls.made_progress)
+			made_any_progress = true;
+	} while (rls.made_progress && !action_list_empty(repair_list));
+
+	if (made_any_progress)
+	       return 1;
+	return 0;
 }
 
-/* Process all the action items. */
+/* Process both repair lists. */
 static int
 repair_everything(
 	struct scrub_ctx		*ctx)
 {
 	struct workqueue		wq;
-	xfs_agnumber_t			agno;
-	bool				aborted = false;
+	int				fixed_anything;
 	int				ret;
 
 	ret = -workqueue_create(&wq, (struct xfs_mount *)ctx,
@@ -76,41 +185,32 @@ repair_everything(
 		str_liberror(ctx, ret, _("creating repair workqueue"));
 		return ret;
 	}
-	for (agno = 0; !aborted && agno < ctx->mnt.fsgeom.agcount; agno++) {
-		if (action_list_length(&ctx->action_lists[agno]) == 0)
-			continue;
 
-		ret = -workqueue_add(&wq, repair_ag, agno, &aborted);
-		if (ret) {
-			str_liberror(ctx, ret, _("queueing repair work"));
+	/*
+	 * Try to fix everything on the space metadata repair list and then the
+	 * file repair list until we stop making progress.  These repairs can
+	 * be threaded, if the user desires.
+	 */
+	do {
+		fixed_anything = 0;
+
+		ret = repair_list_schedule(ctx, &wq, ctx->action_list);
+		if (ret < 0)
 			break;
-		}
-	}
+		if (ret == 1)
+			fixed_anything++;
+	} while (fixed_anything > 0);
 
 	ret = -workqueue_terminate(&wq);
 	if (ret)
 		str_liberror(ctx, ret, _("finishing repair work"));
 	workqueue_destroy(&wq);
 
-	if (aborted)
-		return ECANCELED;
+	if (ret < 0)
+		return ret;
 
-	return 0;
-}
-
-/* Decide if we have any repair work to do. */
-static inline bool
-have_action_items(
-	struct scrub_ctx	*ctx)
-{
-	xfs_agnumber_t		agno;
-
-	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
-		if (action_list_length(&ctx->action_lists[agno]) > 0)
-			return true;
-	}
-
-	return false;
+	/* Repair everything serially.  Last chance to fix things. */
+	return action_list_process(ctx, ctx->action_list, XRM_FINAL_WARNING);
 }
 
 /* Trim the unused areas of the filesystem if the caller asked us to. */
@@ -132,7 +232,7 @@ phase4_func(
 	struct scrub_item	sri;
 	int			ret;
 
-	if (!have_action_items(ctx))
+	if (action_list_empty(ctx->action_list))
 		goto maybe_trim;
 
 	/*
@@ -190,12 +290,12 @@ phase4_estimate(
 	unsigned int		*nr_threads,
 	int			*rshift)
 {
-	xfs_agnumber_t		agno;
-	unsigned long long	need_fixing = 0;
+	unsigned long long	need_fixing;
 
-	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++)
-		need_fixing += action_list_length(&ctx->action_lists[agno]);
+	/* Everything on the repair list plus FSTRIM. */
+	need_fixing = action_list_length(ctx->action_list);
 	need_fixing++;
+
 	*items = need_fixing;
 	*nr_threads = scrub_nproc(ctx) + 1;
 	*rshift = 0;
diff --git a/scrub/repair.c b/scrub/repair.c
index d0434e3d6d1..dd608629f2a 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -387,58 +387,41 @@ repair_item_difficulty(
 	return ret;
 }
 
-/*
- * Allocate a certain number of repair lists for the scrub context.  Returns
- * zero or a positive error number.
- */
+/* Create a new repair action list. */
 int
-action_lists_alloc(
-	size_t				nr,
-	struct action_list		**listsp)
+action_list_alloc(
+	struct action_list		**listp)
 {
-	struct action_list		*lists;
-	xfs_agnumber_t			agno;
+	struct action_list		*alist;
 
-	lists = calloc(nr, sizeof(struct action_list));
-	if (!lists)
+	alist = malloc(sizeof(struct action_list));
+	if (!alist)
 		return errno;
 
-	for (agno = 0; agno < nr; agno++)
-		action_list_init(&lists[agno]);
-	*listsp = lists;
-
+	action_list_init(alist);
+	*listp = alist;
 	return 0;
 }
 
-/* Discard repair list contents. */
+/* Free the repair lists. */
 void
-action_list_discard(
-	struct action_list		*alist)
+action_list_free(
+	struct action_list		**listp)
 {
+	struct action_list		*alist = *listp;
 	struct action_item		*aitem;
 	struct action_item		*n;
 
+	if (!(*listp))
+		return;
+
 	list_for_each_entry_safe(aitem, n, &alist->list, list) {
 		list_del(&aitem->list);
 		free(aitem);
 	}
-}
 
-/* Free the repair lists. */
-void
-action_lists_free(
-	struct action_list		**listsp)
-{
-	free(*listsp);
-	*listsp = NULL;
-}
-
-/* Initialize repair list */
-void
-action_list_init(
-	struct action_list		*alist)
-{
-	INIT_LIST_HEAD(&alist->list);
+	free(alist);
+	*listp = NULL;
 }
 
 /* Number of pending repairs in this list. */
@@ -455,7 +438,23 @@ action_list_length(
 	return ret;
 }
 
-/* Add to the list of repairs. */
+/* Remove the first action item from the action list. */
+struct action_item *
+action_list_pop(
+	struct action_list		*alist)
+{
+	struct action_item		*aitem;
+
+	aitem = list_first_entry_or_null(&alist->list, struct action_item,
+			list);
+	if (!aitem)
+		return NULL;
+
+	list_del_init(&aitem->list);
+	return aitem;
+}
+
+/* Add an action item to the end of a list. */
 void
 action_list_add(
 	struct action_list		*alist,
@@ -464,6 +463,46 @@ action_list_add(
 	list_add_tail(&aitem->list, &alist->list);
 }
 
+/*
+ * Try to repair a filesystem object and let the caller know what it should do
+ * with the action item.  The caller must be able to requeue action items, so
+ * we don't complain if repairs are not totally successful.
+ */
+int
+action_item_try_repair(
+	struct scrub_ctx	*ctx,
+	struct action_item	*aitem,
+	enum tryrepair_outcome	*outcome)
+{
+	struct scrub_item	*sri = &aitem->sri;
+	unsigned int		before, after;
+	int			ret;
+
+	before = repair_item_count_needsrepair(sri);
+
+	ret = repair_item(ctx, sri, 0);
+	if (ret)
+		return ret;
+
+	after = repair_item_count_needsrepair(sri);
+	if (after > 0) {
+		/*
+		 * The kernel did not complete all of the repairs requested.
+		 * If it made some progress we'll requeue; otherwise, let the
+		 * caller know that nothing got fixed.
+		 */
+		if (before != after)
+			*outcome = TR_REQUEUE;
+		else
+			*outcome = TR_NOPROGRESS;
+		return 0;
+	}
+
+	/* Repairs complete. */
+	*outcome = TR_REPAIRED;
+	return 0;
+}
+
 /* Repair everything on this list. */
 int
 action_list_process(
@@ -667,29 +706,3 @@ repair_item_to_action_item(
 	*aitemp = aitem;
 	return 0;
 }
-
-/* Defer all the repairs until phase 4. */
-int
-repair_item_defer(
-	struct scrub_ctx	*ctx,
-	const struct scrub_item	*sri)
-{
-	struct action_item	*aitem = NULL;
-	unsigned int		agno;
-	int			error;
-
-	error = repair_item_to_action_item(ctx, sri, &aitem);
-	if (error || !aitem)
-		return error;
-
-	if (sri->sri_agno != -1U)
-		agno = sri->sri_agno;
-	else if (sri->sri_ino != -1ULL && sri->sri_gen != -1U)
-		agno = cvt_ino_to_agno(&ctx->mnt, sri->sri_ino);
-	else
-		agno = 0;
-	ASSERT(agno < ctx->mnt.fsgeom.agcount);
-
-	action_list_add(&ctx->action_lists[agno], aitem);
-	return 0;
-}
diff --git a/scrub/repair.h b/scrub/repair.h
index d76bb963cdd..639ab2370e0 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -12,19 +12,43 @@ struct action_list {
 
 struct action_item;
 
-int action_lists_alloc(size_t nr, struct action_list **listsp);
-void action_lists_free(struct action_list **listsp);
+int action_list_alloc(struct action_list **listp);
+void action_list_free(struct action_list **listp);
+static inline void action_list_init(struct action_list *alist)
+{
+	INIT_LIST_HEAD(&alist->list);
+}
 
-void action_list_init(struct action_list *alist);
+unsigned long long action_list_length(struct action_list *alist);
+
+/* Move all the items of @src to the tail of @dst, and reinitialize @src. */
+static inline void
+action_list_merge(
+	struct action_list	*dst,
+	struct action_list	*src)
+{
+	list_splice_tail_init(&src->list, &dst->list);
+}
+
+struct action_item *action_list_pop(struct action_list *alist);
+void action_list_add(struct action_list *alist, struct action_item *aitem);
 
 static inline bool action_list_empty(const struct action_list *alist)
 {
 	return list_empty(&alist->list);
 }
 
-unsigned long long action_list_length(struct action_list *alist);
-void action_list_add(struct action_list *dest, struct action_item *item);
-void action_list_discard(struct action_list *alist);
+enum tryrepair_outcome {
+	/* No progress was made on repairs at all. */
+	TR_NOPROGRESS = 0,
+	/* Some progress was made on repairs; try again soon. */
+	TR_REQUEUE,
+	/* Repairs completely successful. */
+	TR_REPAIRED,
+};
+
+int action_item_try_repair(struct scrub_ctx *ctx, struct action_item *aitem,
+		enum tryrepair_outcome *outcome);
 
 void repair_item_mustfix(struct scrub_item *sri, struct scrub_item *fix_now);
 
@@ -56,7 +80,6 @@ int repair_item(struct scrub_ctx *ctx, struct scrub_item *sri,
 		unsigned int repair_flags);
 int repair_item_to_action_item(struct scrub_ctx *ctx,
 		const struct scrub_item *sri, struct action_item **aitemp);
-int repair_item_defer(struct scrub_ctx *ctx, const struct scrub_item *sri);
 
 static inline unsigned int
 repair_item_count_needsrepair(
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 768f935084f..d5094147327 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -71,7 +71,7 @@ struct scrub_ctx {
 
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
-	struct action_list	*action_lists;
+	struct action_list	*action_list;
 	unsigned long long	max_errors;
 	unsigned long long	runtime_errors;
 	unsigned long long	corruptions_found;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs_scrub: recheck entire metadata objects after corruption repairs
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/4] libfrog: enhance ptvar to support initializer functions Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/4] xfs_scrub: improve thread scheduling repair items during phase 4 Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub: try to repair space metadata before file metadata Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we've finished making repairs to some domain of filesystem metadata
(file, AG, etc.) to correct an inconsistency, we should recheck all the
other metadata types within that domain to make sure that we neither
made things worse nor introduced more cross-referencing problems.  If we
did, requeue the item to make the repairs.  If the only changes we made
were optimizations, don't bother.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/repair.c        |   37 +++++++++++++++++++++++++++++++++++++
 scrub/scrub.c         |    5 +++--
 scrub/scrub.h         |   10 ++++++++++
 scrub/scrub_private.h |    2 ++
 4 files changed, 52 insertions(+), 2 deletions(-)


diff --git a/scrub/repair.c b/scrub/repair.c
index dd608629f2a..cf17bbd8d0e 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -476,8 +476,10 @@ action_item_try_repair(
 {
 	struct scrub_item	*sri = &aitem->sri;
 	unsigned int		before, after;
+	unsigned int		scrub_type;
 	int			ret;
 
+	BUILD_BUG_ON(sizeof(sri->sri_selected) * NBBY < XFS_SCRUB_TYPE_NR);
 	before = repair_item_count_needsrepair(sri);
 
 	ret = repair_item(ctx, sri, 0);
@@ -498,6 +500,41 @@ action_item_try_repair(
 		return 0;
 	}
 
+	/*
+	 * Nothing in this fs object was marked inconsistent.  This means we
+	 * were merely optimizing metadata and there is no revalidation work to
+	 * be done.
+	 */
+	if (!sri->sri_inconsistent) {
+		*outcome = TR_REPAIRED;
+		return 0;
+	}
+
+	/*
+	 * We fixed inconsistent metadata, so reschedule the entire object for
+	 * immediate revalidation to see if anything else went wrong.
+	 */
+	foreach_scrub_type(scrub_type)
+		if (sri->sri_selected & (1U << scrub_type))
+			sri->sri_state[scrub_type] = SCRUB_ITEM_NEEDSCHECK;
+	sri->sri_inconsistent = false;
+	sri->sri_revalidate = true;
+
+	ret = scrub_item_check(ctx, sri);
+	if (ret)
+		return ret;
+
+	after = repair_item_count_needsrepair(sri);
+	if (after > 0) {
+		/*
+		 * Uhoh, we found something else broken.  Tell the caller that
+		 * this item needs to be queued for more repairs.
+		 */
+		sri->sri_revalidate = false;
+		*outcome = TR_REQUEUE;
+		return 0;
+	}
+
 	/* Repairs complete. */
 	*outcome = TR_REPAIRED;
 	return 0;
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 6f9f182c629..c245e46afa7 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -117,11 +117,12 @@ xfs_check_metadata(
 	dbg_printf("check %s flags %xh\n", descr_render(&dsc), meta.sm_flags);
 
 	error = -xfrog_scrub_metadata(xfdp, &meta);
-	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
-		meta.sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 	switch (error) {
 	case 0:
 		/* No operational errors encountered. */
+		if (!sri->sri_revalidate &&
+		    debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+			meta.sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 		break;
 	case ENOENT:
 		/* Metadata not present, just skip it. */
diff --git a/scrub/scrub.h b/scrub/scrub.h
index e1ce6ed2f59..53354099c81 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -59,11 +59,20 @@ struct scrub_item {
 	__u32			sri_gen;
 	__u32			sri_agno;
 
+	/* Bitmask of scrub types that were scheduled here. */
+	__u32			sri_selected;
+
 	/* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
 	__u8			sri_state[XFS_SCRUB_TYPE_NR];
 
 	/* Track scrub and repair call retries for each scrub type. */
 	__u8			sri_tries[XFS_SCRUB_TYPE_NR];
+
+	/* Were there any corruption repairs needed? */
+	bool			sri_inconsistent:1;
+
+	/* Are we revalidating after repairs? */
+	bool			sri_revalidate:1;
 };
 
 #define foreach_scrub_type(loopvar) \
@@ -103,6 +112,7 @@ static inline void
 scrub_item_schedule(struct scrub_item *sri, unsigned int scrub_type)
 {
 	sri->sri_state[scrub_type] = SCRUB_ITEM_NEEDSCHECK;
+	sri->sri_selected |= (1U << scrub_type);
 }
 
 void scrub_item_schedule_group(struct scrub_item *sri,
diff --git a/scrub/scrub_private.h b/scrub/scrub_private.h
index 6380779c90b..c60ea555885 100644
--- a/scrub/scrub_private.h
+++ b/scrub/scrub_private.h
@@ -71,6 +71,8 @@ scrub_item_save_state(
 	unsigned  int			scrub_flags)
 {
 	sri->sri_state[scrub_type] = scrub_flags & SCRUB_ITEM_REPAIR_ANY;
+	if (scrub_flags & SCRUB_ITEM_NEEDSREPAIR)
+		sri->sri_inconsistent = true;
 }
 
 static inline void


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs_scrub: try to repair space metadata before file metadata
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 2/4] xfs_scrub: improve thread scheduling repair items during phase 4 Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Phase 4 (metadata repairs) of xfs_scrub has suffered a mild race
condition since the beginning of its existence.  Repair functions for
higher level metadata such as directories build the new directory blocks
in an unlinked temporary file and use atomic extent swapping to commit
the corrected directory contents into the existing directory.  Atomic
extent swapping requires consistent filesystem space metadata, but phase
4 has never enforced correctness dependencies between space and file
metadata repairs.

Before the previous patch eliminated the per-AG repair lists, this error
was not often hit in testing scenarios because the allocator generally
succeeds in placing file data blocks in the same AG as the inode.  With
pool threads now able to pop file repairs from the repair list before
space repairs complete, this error became much more obvious.

Fortunately, the new phase 4 design makes it easy to try to enforce the
consistency requirements of higher level file metadata repairs.  Split
the repair list into one for space metadata and another for file
metadata.  Phase 4 will now try to fix the space metadata until it stops
making progress on that, and only then will it try to fix file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c    |   13 ++++++++++---
 scrub/phase2.c    |    2 +-
 scrub/phase3.c    |    4 ++--
 scrub/phase4.c    |   22 +++++++++++++++++-----
 scrub/xfs_scrub.h |    3 ++-
 5 files changed, 32 insertions(+), 12 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 6ad012042da..faa554f1e1e 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -89,7 +89,8 @@ scrub_cleanup(
 	if (error)
 		return error;
 
-	action_list_free(&ctx->action_list);
+	action_list_free(&ctx->file_repair_list);
+	action_list_free(&ctx->fs_repair_list);
 
 	if (ctx->fshandle)
 		free_handle(ctx->fshandle, ctx->fshandle_len);
@@ -186,9 +187,15 @@ _("Not an XFS filesystem."));
 		return error;
 	}
 
-	error = action_list_alloc(&ctx->action_list);
+	error = action_list_alloc(&ctx->fs_repair_list);
 	if (error) {
-		str_liberror(ctx, error, _("allocating repair list"));
+		str_liberror(ctx, error, _("allocating fs repair list"));
+		return error;
+	}
+
+	error = action_list_alloc(&ctx->file_repair_list);
+	if (error) {
+		str_liberror(ctx, error, _("allocating file repair list"));
 		return error;
 	}
 
diff --git a/scrub/phase2.c b/scrub/phase2.c
index e07f7a11be5..ebe3ad3ad5c 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -64,7 +64,7 @@ defer_fs_repair(
 		return error;
 
 	pthread_mutex_lock(&ctx->lock);
-	action_list_add(ctx->action_list, aitem);
+	action_list_add(ctx->fs_repair_list, aitem);
 	pthread_mutex_unlock(&ctx->lock);
 	return 0;
 }
diff --git a/scrub/phase3.c b/scrub/phase3.c
index e000d8fe4c5..c5950b1b9e3 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -225,7 +225,7 @@ collect_repairs(
 	struct scrub_ctx	*ctx = foreach_arg;
 	struct action_list	*alist = data;
 
-	action_list_merge(ctx->action_list, alist);
+	action_list_merge(ctx->file_repair_list, alist);
 	return 0;
 }
 
@@ -269,7 +269,7 @@ phase3_func(
 	 * to repair the space metadata.
 	 */
 	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
-		if (!action_list_empty(ctx->action_list))
+		if (!action_list_empty(ctx->fs_repair_list))
 			ictx.always_defer_repairs = true;
 	}
 
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 5563c4c5b0b..aa9aaea1826 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -194,7 +194,13 @@ repair_everything(
 	do {
 		fixed_anything = 0;
 
-		ret = repair_list_schedule(ctx, &wq, ctx->action_list);
+		ret = repair_list_schedule(ctx, &wq, ctx->fs_repair_list);
+		if (ret < 0)
+			break;
+		if (ret == 1)
+			fixed_anything++;
+
+		ret = repair_list_schedule(ctx, &wq, ctx->file_repair_list);
 		if (ret < 0)
 			break;
 		if (ret == 1)
@@ -209,8 +215,12 @@ repair_everything(
 	if (ret < 0)
 		return ret;
 
-	/* Repair everything serially.  Last chance to fix things. */
-	return action_list_process(ctx, ctx->action_list, XRM_FINAL_WARNING);
+	/*
+	 * Combine both repair lists and repair everything serially.  This is
+	 * the last chance to fix things.
+	 */
+	action_list_merge(ctx->fs_repair_list, ctx->file_repair_list);
+	return action_list_process(ctx, ctx->fs_repair_list, XRM_FINAL_WARNING);
 }
 
 /* Trim the unused areas of the filesystem if the caller asked us to. */
@@ -232,7 +242,8 @@ phase4_func(
 	struct scrub_item	sri;
 	int			ret;
 
-	if (action_list_empty(ctx->action_list))
+	if (action_list_empty(ctx->fs_repair_list) &&
+	    action_list_empty(ctx->file_repair_list))
 		goto maybe_trim;
 
 	/*
@@ -293,7 +304,8 @@ phase4_estimate(
 	unsigned long long	need_fixing;
 
 	/* Everything on the repair list plus FSTRIM. */
-	need_fixing = action_list_length(ctx->action_list);
+	need_fixing = action_list_length(ctx->fs_repair_list) +
+		      action_list_length(ctx->file_repair_list);
 	need_fixing++;
 
 	*items = need_fixing;
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index d5094147327..8d6ee9826cb 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -71,7 +71,8 @@ struct scrub_ctx {
 
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
-	struct action_list	*action_list;
+	struct action_list	*fs_repair_list;
+	struct action_list	*file_repair_list;
 	unsigned long long	max_errors;
 	unsigned long long	runtime_errors;
 	unsigned long long	corruptions_found;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (54 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/7] xfs_scrub: ignore phase 8 if the user disabled fstrim Darrick J. Wong
                     ` (6 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
                   ` (21 subsequent siblings)
  77 siblings, 7 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Back when I originally designed xfs_scrub, all filesystem metadata
checks were complete by the end of phase 3, and phase 4 was where all
the metadata repairs occurred.  On the grounds that the filesystem
should be fully consistent by then, I made a call to FITRIM at the end
of phase 4 to discard empty space in the filesystem.

Unfortunately, that's no longer the case -- summary counters, link
counts, and quota counters are not checked until phase 7.  It's not safe
to instruct the storage to unmap "empty" areas if we don't know where
those empty areas are, so we need to create a phase 8 to trim the fs.
While we're at it, make it more obvious that fstrim only gets to run if
there are no unfixed corruptiosn and no other runtime errors have
occurred.

Finally, parallelize the fstrim calls by making them per-AG instead of
per-filesystem.  This cuts down the trimming time considerably on more
modern hardware that can handle (or at least combine) concurrent discard
requests.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-fstrim-phase
---
 scrub/Makefile    |    1 
 scrub/phase4.c    |   30 +----------
 scrub/phase8.c    |  150 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/vfs.c       |   22 +++++---
 scrub/vfs.h       |    2 -
 scrub/xfs_scrub.c |   11 ++++
 scrub/xfs_scrub.h |    3 +
 7 files changed, 182 insertions(+), 37 deletions(-)
 create mode 100644 scrub/phase8.c


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/7] xfs_scrub: move FITRIM to phase 8
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/7] xfs_scrub: ignore phase 8 if the user disabled fstrim Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/7] xfs_scrub: collapse trim_filesystem Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 7/7] xfs_scrub: fstrim each AG in parallel Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Issuing discards against the filesystem should be the *last* thing that
xfs_scrub does, after everything else has been checked, repaired, and
found to be clean.  If we can't satisfy all those conditions, we have no
business telling the storage to discard itself.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/Makefile    |    1 +
 scrub/phase4.c    |   30 ++----------------------
 scrub/phase8.c    |   66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrub.h |    3 ++
 4 files changed, 73 insertions(+), 27 deletions(-)
 create mode 100644 scrub/phase8.c


diff --git a/scrub/Makefile b/scrub/Makefile
index aba14ed2310..2dc0fe1935c 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -60,6 +60,7 @@ phase4.c \
 phase5.c \
 phase6.c \
 phase7.c \
+phase8.c \
 progress.c \
 read_verify.c \
 repair.c \
diff --git a/scrub/phase4.c b/scrub/phase4.c
index aa9aaea1826..260f7bb7ac1 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -223,16 +223,6 @@ repair_everything(
 	return action_list_process(ctx, ctx->fs_repair_list, XRM_FINAL_WARNING);
 }
 
-/* Trim the unused areas of the filesystem if the caller asked us to. */
-static void
-trim_filesystem(
-	struct scrub_ctx	*ctx)
-{
-	if (want_fstrim)
-		fstrim(ctx);
-	progress_add(1);
-}
-
 /* Fix everything that needs fixing. */
 int
 phase4_func(
@@ -244,7 +234,7 @@ phase4_func(
 
 	if (action_list_empty(ctx->fs_repair_list) &&
 	    action_list_empty(ctx->file_repair_list))
-		goto maybe_trim;
+		return 0;
 
 	/*
 	 * Check the resource usage counters early.  Normally we do this during
@@ -277,20 +267,7 @@ phase4_func(
 	if (ret)
 		return ret;
 
-	ret = repair_everything(ctx);
-	if (ret)
-		return ret;
-
-	/*
-	 * If errors remain on the filesystem, do not trim anything.  We don't
-	 * have any threads running, so it's ok to skip the ctx lock here.
-	 */
-	if (ctx->corruptions_found || ctx->unfixable_errors != 0)
-		return 0;
-
-maybe_trim:
-	trim_filesystem(ctx);
-	return 0;
+	return repair_everything(ctx);
 }
 
 /* Estimate how much work we're going to do. */
@@ -303,10 +280,9 @@ phase4_estimate(
 {
 	unsigned long long	need_fixing;
 
-	/* Everything on the repair list plus FSTRIM. */
+	/* Everything on the repair lis. */
 	need_fixing = action_list_length(ctx->fs_repair_list) +
 		      action_list_length(ctx->file_repair_list);
-	need_fixing++;
 
 	*items = need_fixing;
 	*nr_threads = scrub_nproc(ctx) + 1;
diff --git a/scrub/phase8.c b/scrub/phase8.c
new file mode 100644
index 00000000000..e2d712990ab
--- /dev/null
+++ b/scrub/phase8.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include <stdint.h>
+#include <dirent.h>
+#include <sys/types.h>
+#include <sys/statvfs.h>
+#include "list.h"
+#include "libfrog/paths.h"
+#include "libfrog/workqueue.h"
+#include "xfs_scrub.h"
+#include "common.h"
+#include "progress.h"
+#include "scrub.h"
+#include "repair.h"
+#include "vfs.h"
+#include "atomic.h"
+
+/* Phase 8: Trim filesystem. */
+
+/* Trim the unused areas of the filesystem if the caller asked us to. */
+static void
+trim_filesystem(
+	struct scrub_ctx	*ctx)
+{
+	fstrim(ctx);
+	progress_add(1);
+}
+
+/* Trim the filesystem, if desired. */
+int
+phase8_func(
+	struct scrub_ctx	*ctx)
+{
+	if (action_list_empty(ctx->fs_repair_list) &&
+	    action_list_empty(ctx->file_repair_list))
+		goto maybe_trim;
+
+	/*
+	 * If errors remain on the filesystem, do not trim anything.  We don't
+	 * have any threads running, so it's ok to skip the ctx lock here.
+	 */
+	if (ctx->corruptions_found || ctx->unfixable_errors != 0)
+		return 0;
+
+maybe_trim:
+	trim_filesystem(ctx);
+	return 0;
+}
+
+/* Estimate how much work we're going to do. */
+int
+phase8_estimate(
+	struct scrub_ctx	*ctx,
+	uint64_t		*items,
+	unsigned int		*nr_threads,
+	int			*rshift)
+{
+	*items = 1;
+	*nr_threads = 1;
+	*rshift = 0;
+	return 0;
+}
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 8d6ee9826cb..004d2d02587 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -97,6 +97,7 @@ int phase4_func(struct scrub_ctx *ctx);
 int phase5_func(struct scrub_ctx *ctx);
 int phase6_func(struct scrub_ctx *ctx);
 int phase7_func(struct scrub_ctx *ctx);
+int phase8_func(struct scrub_ctx *ctx);
 
 /* Progress estimator functions */
 unsigned int scrub_estimate_ag_work(struct scrub_ctx *ctx);
@@ -111,5 +112,7 @@ int phase5_estimate(struct scrub_ctx *ctx, uint64_t *items,
 		    unsigned int *nr_threads, int *rshift);
 int phase6_estimate(struct scrub_ctx *ctx, uint64_t *items,
 		    unsigned int *nr_threads, int *rshift);
+int phase8_estimate(struct scrub_ctx *ctx, uint64_t *items,
+		    unsigned int *nr_threads, int *rshift);
 
 #endif /* XFS_SCRUB_XFS_SCRUB_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/7] xfs_scrub: ignore phase 8 if the user disabled fstrim
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/7] xfs_scrub: collapse trim_filesystem Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the user told us to skip trimming the filesystem, don't run the phase
at all.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)


diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 597be59f9f9..bdee8e4fdae 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -246,6 +246,7 @@ struct phase_rusage {
 /* Operations for each phase. */
 #define DATASCAN_DUMMY_FN	((void *)1)
 #define REPAIR_DUMMY_FN		((void *)2)
+#define FSTRIM_DUMMY_FN		((void *)3)
 struct phase_ops {
 	char		*descr;
 	int		(*fn)(struct scrub_ctx *ctx);
@@ -426,6 +427,11 @@ run_scrub_phases(
 			.fn = phase7_func,
 			.must_run = true,
 		},
+		{
+			.descr = _("Trim filesystem storage."),
+			.fn = FSTRIM_DUMMY_FN,
+			.estimate_work = phase8_estimate,
+		},
 		{
 			NULL
 		},
@@ -446,6 +452,8 @@ run_scrub_phases(
 		/* Turn on certain phases if user said to. */
 		if (sp->fn == DATASCAN_DUMMY_FN && scrub_data) {
 			sp->fn = phase6_func;
+		} else if (sp->fn == FSTRIM_DUMMY_FN && want_fstrim) {
+			sp->fn = phase8_func;
 		} else if (sp->fn == REPAIR_DUMMY_FN &&
 			   ctx->mode == SCRUB_MODE_REPAIR) {
 			sp->descr = _("Repair filesystem.");
@@ -455,7 +463,8 @@ run_scrub_phases(
 
 		/* Skip certain phases unless they're turned on. */
 		if (sp->fn == REPAIR_DUMMY_FN ||
-		    sp->fn == DATASCAN_DUMMY_FN)
+		    sp->fn == DATASCAN_DUMMY_FN ||
+		    sp->fn == FSTRIM_DUMMY_FN)
 			continue;
 
 		/* Allow debug users to force a particular phase. */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/7] xfs_scrub: collapse trim_filesystem
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/7] xfs_scrub: ignore phase 8 if the user disabled fstrim Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/7] xfs_scrub: move FITRIM to phase 8 Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Collapse this two-line helper into the main function since it's trivial.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase8.c |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)


diff --git a/scrub/phase8.c b/scrub/phase8.c
index e2d712990ab..80f86f9ecd9 100644
--- a/scrub/phase8.c
+++ b/scrub/phase8.c
@@ -21,15 +21,6 @@
 
 /* Phase 8: Trim filesystem. */
 
-/* Trim the unused areas of the filesystem if the caller asked us to. */
-static void
-trim_filesystem(
-	struct scrub_ctx	*ctx)
-{
-	fstrim(ctx);
-	progress_add(1);
-}
-
 /* Trim the filesystem, if desired. */
 int
 phase8_func(
@@ -47,7 +38,8 @@ phase8_func(
 		return 0;
 
 maybe_trim:
-	trim_filesystem(ctx);
+	fstrim(ctx);
+	progress_add(1);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/7] xfs_scrub: fix the work estimation for phase 8
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 6/7] xfs_scrub: don't call FITRIM after runtime errors Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/7] xfs_scrub: report FITRIM errors properly Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If there are latent errors on the filesystem, we aren't going to do any
work during phase 8 and it makes no sense to add that into the work
estimate for the progress bar.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase8.c |   36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)


diff --git a/scrub/phase8.c b/scrub/phase8.c
index 80f86f9ecd9..789ef2b2b4e 100644
--- a/scrub/phase8.c
+++ b/scrub/phase8.c
@@ -21,23 +21,35 @@
 
 /* Phase 8: Trim filesystem. */
 
-/* Trim the filesystem, if desired. */
-int
-phase8_func(
+static inline bool
+fstrim_ok(
 	struct scrub_ctx	*ctx)
 {
-	if (action_list_empty(ctx->fs_repair_list) &&
-	    action_list_empty(ctx->file_repair_list))
-		goto maybe_trim;
-
 	/*
 	 * If errors remain on the filesystem, do not trim anything.  We don't
 	 * have any threads running, so it's ok to skip the ctx lock here.
 	 */
-	if (ctx->corruptions_found || ctx->unfixable_errors != 0)
+	if (!action_list_empty(ctx->fs_repair_list))
+		return false;
+	if (!action_list_empty(ctx->file_repair_list))
+		return false;
+
+	if (ctx->corruptions_found != 0)
+		return false;
+	if (ctx->unfixable_errors != 0)
+		return false;
+
+	return true;
+}
+
+/* Trim the filesystem, if desired. */
+int
+phase8_func(
+	struct scrub_ctx	*ctx)
+{
+	if (!fstrim_ok(ctx))
 		return 0;
 
-maybe_trim:
 	fstrim(ctx);
 	progress_add(1);
 	return 0;
@@ -51,7 +63,11 @@ phase8_estimate(
 	unsigned int		*nr_threads,
 	int			*rshift)
 {
-	*items = 1;
+	*items = 0;
+
+	if (fstrim_ok(ctx))
+		*items = 1;
+
 	*nr_threads = 1;
 	*rshift = 0;
 	return 0;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/7] xfs_scrub: report FITRIM errors properly
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 4/7] xfs_scrub: fix the work estimation for phase 8 Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the error reporting for the FITRIM ioctl out of vfs.c and into
phase8.c.  This makes it so that IO errors encountered during trim are
counted as runtime errors instead of being dropped silently.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase8.c |    9 ++++++++-
 scrub/vfs.c    |   12 +++++++-----
 scrub/vfs.h    |    2 +-
 3 files changed, 16 insertions(+), 7 deletions(-)


diff --git a/scrub/phase8.c b/scrub/phase8.c
index 789ef2b2b4e..c54f2cc738b 100644
--- a/scrub/phase8.c
+++ b/scrub/phase8.c
@@ -47,10 +47,17 @@ int
 phase8_func(
 	struct scrub_ctx	*ctx)
 {
+	int			error;
+
 	if (!fstrim_ok(ctx))
 		return 0;
 
-	fstrim(ctx);
+	error = fstrim(ctx);
+	if (error) {
+		str_liberror(ctx, error, _("fstrim"));
+		return error;
+	}
+
 	progress_add(1);
 	return 0;
 }
diff --git a/scrub/vfs.c b/scrub/vfs.c
index 577eb6dc3e8..ca34972d401 100644
--- a/scrub/vfs.c
+++ b/scrub/vfs.c
@@ -296,15 +296,17 @@ struct fstrim_range {
 #endif
 
 /* Call FITRIM to trim all the unused space in a filesystem. */
-void
+int
 fstrim(
 	struct scrub_ctx	*ctx)
 {
 	struct fstrim_range	range = {0};
-	int			error;
 
 	range.len = ULLONG_MAX;
-	error = ioctl(ctx->mnt.fd, FITRIM, &range);
-	if (error && errno != EOPNOTSUPP && errno != ENOTTY)
-		perror(_("fstrim"));
+	if (ioctl(ctx->mnt.fd, FITRIM, &range) == 0)
+		return 0;
+	if (errno == EOPNOTSUPP || errno == ENOTTY)
+		return 0;
+
+	return errno;
 }
diff --git a/scrub/vfs.h b/scrub/vfs.h
index dc1099cf18d..14f2a583eb1 100644
--- a/scrub/vfs.h
+++ b/scrub/vfs.h
@@ -24,6 +24,6 @@ typedef int (*scan_fs_tree_dirent_fn)(struct scrub_ctx *, const char *,
 int scan_fs_tree(struct scrub_ctx *ctx, scan_fs_tree_dir_fn dir_fn,
 		scan_fs_tree_dirent_fn dirent_fn, void *arg);
 
-void fstrim(struct scrub_ctx *ctx);
+int fstrim(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_VFS_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/7] xfs_scrub: don't call FITRIM after runtime errors
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 7/7] xfs_scrub: fstrim each AG in parallel Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/7] xfs_scrub: fix the work estimation for phase 8 Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/7] xfs_scrub: report FITRIM errors properly Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't call FITRIM if there have been runtime errors -- we don't want to
touch anything after any kind of unfixable problem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase8.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/scrub/phase8.c b/scrub/phase8.c
index c54f2cc738b..ac667fc91fb 100644
--- a/scrub/phase8.c
+++ b/scrub/phase8.c
@@ -39,6 +39,9 @@ fstrim_ok(
 	if (ctx->unfixable_errors != 0)
 		return false;
 
+	if (ctx->runtime_errors != 0)
+		return false;
+
 	return true;
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/7] xfs_scrub: fstrim each AG in parallel
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 1/7] xfs_scrub: move FITRIM to phase 8 Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 6/7] xfs_scrub: don't call FITRIM after runtime errors Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Newer flash storage devices aren't as bad as the old ones when it comes
to trimming unused storage.  We know that the first block of each AG is
always used, and therefore each AG can be trimmed independently.
Therefore, do them all in parallel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase8.c |   80 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 scrub/vfs.c    |   10 +++++--
 scrub/vfs.h    |    2 +
 3 files changed, 81 insertions(+), 11 deletions(-)


diff --git a/scrub/phase8.c b/scrub/phase8.c
index ac667fc91fb..a8ea8db706b 100644
--- a/scrub/phase8.c
+++ b/scrub/phase8.c
@@ -18,6 +18,7 @@
 #include "repair.h"
 #include "vfs.h"
 #include "atomic.h"
+#include "disk.h"
 
 /* Phase 8: Trim filesystem. */
 
@@ -45,24 +46,89 @@ fstrim_ok(
 	return true;
 }
 
+struct trim_ctl {
+	uint64_t	datadev_end_pos;
+	bool		aborted;
+};
+
+/* Trim each AG. */
+static void
+trim_ag(
+	struct workqueue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
+	struct trim_ctl		*tctl = arg;
+	uint64_t		pos, len, eoag_pos;
+	int			error;
+
+	pos = cvt_agbno_to_b(&ctx->mnt, agno, 0);
+	eoag_pos = cvt_agbno_to_b(&ctx->mnt, agno, ctx->mnt.fsgeom.agblocks);
+	len = min(tctl->datadev_end_pos, eoag_pos) - pos;
+
+	error = fstrim(ctx, pos, len);
+	if (error) {
+		char		descr[DESCR_BUFSZ];
+
+		snprintf(descr, sizeof(descr) - 1, _("fstrim agno %u"), agno);
+		str_liberror(ctx, error, descr);
+		tctl->aborted = true;
+		return;
+	}
+
+	progress_add(1);
+}
+
 /* Trim the filesystem, if desired. */
 int
 phase8_func(
 	struct scrub_ctx	*ctx)
 {
-	int			error;
+	struct workqueue	wq;
+	struct trim_ctl		tctl = {
+		.aborted	= false,
+	};
+	xfs_agnumber_t		agno;
+	int			error, err2;
 
 	if (!fstrim_ok(ctx))
 		return 0;
 
-	error = fstrim(ctx);
+	tctl.datadev_end_pos = cvt_off_fsb_to_b(&ctx->mnt,
+			ctx->mnt.fsgeom.datablocks);
+
+	error = -workqueue_create(&wq, (struct xfs_mount *)ctx,
+			disk_heads(ctx->datadev));
 	if (error) {
-		str_liberror(ctx, error, _("fstrim"));
+		str_liberror(ctx, error, _("creating fstrim workqueue"));
 		return error;
 	}
 
-	progress_add(1);
-	return 0;
+	/* Trim each AG in parallel. */
+	for (agno = 0;
+	     agno < ctx->mnt.fsgeom.agcount && !tctl.aborted;
+	     agno++) {
+		error = -workqueue_add(&wq, trim_ag, agno, &tctl);
+		if (error) {
+			str_liberror(ctx, error,
+					_("queueing per-AG fstrim work"));
+			goto out_wq;
+		}
+	}
+
+out_wq:
+	err2 = -workqueue_terminate(&wq);
+	if (err2) {
+		str_liberror(ctx, err2, _("finishing fstrim work"));
+		if (!error && err2)
+			error = err2;
+	}
+	workqueue_destroy(&wq);
+
+	if (!error && tctl.aborted)
+		return ECANCELED;
+	return error;
 }
 
 /* Estimate how much work we're going to do. */
@@ -76,9 +142,9 @@ phase8_estimate(
 	*items = 0;
 
 	if (fstrim_ok(ctx))
-		*items = 1;
+		*items = ctx->mnt.fsgeom.agcount;
 
-	*nr_threads = 1;
+	*nr_threads = disk_heads(ctx->datadev);
 	*rshift = 0;
 	return 0;
 }
diff --git a/scrub/vfs.c b/scrub/vfs.c
index ca34972d401..85ee2694b00 100644
--- a/scrub/vfs.c
+++ b/scrub/vfs.c
@@ -298,11 +298,15 @@ struct fstrim_range {
 /* Call FITRIM to trim all the unused space in a filesystem. */
 int
 fstrim(
-	struct scrub_ctx	*ctx)
+	struct scrub_ctx	*ctx,
+	uint64_t		offset,
+	uint64_t		len)
 {
-	struct fstrim_range	range = {0};
+	struct fstrim_range	range = {
+		.start		= offset,
+		.len		= len,
+	};
 
-	range.len = ULLONG_MAX;
 	if (ioctl(ctx->mnt.fd, FITRIM, &range) == 0)
 		return 0;
 	if (errno == EOPNOTSUPP || errno == ENOTTY)
diff --git a/scrub/vfs.h b/scrub/vfs.h
index 14f2a583eb1..311c403fa4e 100644
--- a/scrub/vfs.h
+++ b/scrub/vfs.h
@@ -24,6 +24,6 @@ typedef int (*scan_fs_tree_dirent_fn)(struct scrub_ctx *, const char *,
 int scan_fs_tree(struct scrub_ctx *ctx, scan_fs_tree_dir_fn dir_fn,
 		scan_fs_tree_dirent_fn dirent_fn, void *arg);
 
-int fstrim(struct scrub_ctx *ctx);
+int fstrim(struct scrub_ctx *ctx, uint64_t offset, uint64_t len);
 
 #endif /* XFS_SCRUB_VFS_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (55 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/8] xfs_scrub_fail: escape paths correctly Darrick J. Wong
                     ` (7 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
                   ` (20 subsequent siblings)
  77 siblings, 8 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Fix some glaring deficiencies in the systemd services that were created
to manage background scans.  First, we need to fix various errors in
pathname escaping, because systemd does /not/ like slashes in service
names.  We also need to fix small bugs in the mailer scripts so that
scrub failures are reported effectively.  Finally, fix xfs_scrub_all
to deal with systemd restarts causing it to think that a scrub has
finished before the service actually finishes.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-service-fixes
---
 debian/rules                     |    1 
 scrub/xfs_scrub@.service.in      |    5 ++
 scrub/xfs_scrub_all.cron.in      |    5 ++
 scrub/xfs_scrub_all.in           |   83 ++++++++++++++++++++++++++++++++------
 scrub/xfs_scrub_all.service.in   |    5 ++
 scrub/xfs_scrub_all.timer        |    5 ++
 scrub/xfs_scrub_fail             |   37 ++++++++++++++++-
 scrub/xfs_scrub_fail@.service.in |    5 ++
 8 files changed, 130 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/8] debian: install scrub services with dh_installsystemd
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/8] xfs_scrub_fail: escape paths correctly Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/8] xfs_scrub_fail: fix sendmail detection Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/8] xfs_scrub_fail: return the failure status of the mailer program Darrick J. Wong
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use dh_installsystemd to handle the installation and activation of the
scrub systemd services.  This requires bumping the compat version to 11.
Note that the services are /not/ activated on installation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 debian/rules |    1 +
 1 file changed, 1 insertion(+)


diff --git a/debian/rules b/debian/rules
index 95df4835b25..57baad625c5 100755
--- a/debian/rules
+++ b/debian/rules
@@ -108,6 +108,7 @@ binary-arch: checkroot built
 	dh_compress
 	dh_fixperms
 	dh_makeshlibs
+	dh_installsystemd -p xfsprogs --no-enable --no-start --no-restart-after-upgrade --no-stop-on-upgrade
 	dh_installdeb
 	dh_shlibdeps
 	dh_gencontrol


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/8] xfs_scrub: add missing copyrights and spdx headers
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 6/8] xfs_scrub_all: fix argument passing when invoking xfs_scrub manually Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 8/8] xfs_scrub_all: survive systemd restarts when waiting for services Darrick J. Wong
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add missing license and copyright information to the systemd control
files in this directory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub@.service.in      |    5 +++++
 scrub/xfs_scrub_all.cron.in      |    5 +++++
 scrub/xfs_scrub_all.service.in   |    5 +++++
 scrub/xfs_scrub_all.timer        |    5 +++++
 scrub/xfs_scrub_fail             |    5 +++++
 scrub/xfs_scrub_fail@.service.in |    5 +++++
 6 files changed, 30 insertions(+)


diff --git a/scrub/xfs_scrub@.service.in b/scrub/xfs_scrub@.service.in
index 6fb3f6ea2e9..c8662fc85a6 100644
--- a/scrub/xfs_scrub@.service.in
+++ b/scrub/xfs_scrub@.service.in
@@ -1,3 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
 [Unit]
 Description=Online XFS Metadata Check for %I
 OnFailure=xfs_scrub_fail@%i.service
diff --git a/scrub/xfs_scrub_all.cron.in b/scrub/xfs_scrub_all.cron.in
index 3dea9296077..0ef97cc0ca8 100644
--- a/scrub/xfs_scrub_all.cron.in
+++ b/scrub/xfs_scrub_all.cron.in
@@ -1 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+#
 10 3 * * 0 root test -e /run/systemd/system || @sbindir@/xfs_scrub_all
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
index b1b80da40a3..b874eb6f757 100644
--- a/scrub/xfs_scrub_all.service.in
+++ b/scrub/xfs_scrub_all.service.in
@@ -1,3 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
 [Unit]
 Description=Online XFS Metadata Check for All Filesystems
 ConditionACPower=true
diff --git a/scrub/xfs_scrub_all.timer b/scrub/xfs_scrub_all.timer
index 2e4a33b1666..1aef11e18f9 100644
--- a/scrub/xfs_scrub_all.timer
+++ b/scrub/xfs_scrub_all.timer
@@ -1,3 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
 [Unit]
 Description=Periodic XFS Online Metadata Check for All Filesystems
 
diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
index 36dd50e9653..8ada5dbbe06 100755
--- a/scrub/xfs_scrub_fail
+++ b/scrub/xfs_scrub_fail
@@ -1,5 +1,10 @@
 #!/bin/bash
 
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
 # Email logs of failed xfs_scrub unit runs
 
 mailer=/usr/sbin/sendmail
diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
index 8d106e9ba4b..ac0cb2e283b 100644
--- a/scrub/xfs_scrub_fail@.service.in
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -1,3 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
 [Unit]
 Description=Online XFS Metadata Check Failure Reporting for %I
 Documentation=man:xfs_scrub(8)


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/8] xfs_scrub_fail: fix sendmail detection
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/8] xfs_scrub_fail: escape paths correctly Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/8] debian: install scrub services with dh_installsystemd Darrick J. Wong
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This script emails the results of failed scrub runs to root.  We
shouldn't be hardcoding the path to the mailer program because distros
can change the path according to their whim.  Modify this script to use
command -v to find the program.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_fail |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
index 8ada5dbbe06..a46eb34ee29 100755
--- a/scrub/xfs_scrub_fail
+++ b/scrub/xfs_scrub_fail
@@ -7,13 +7,14 @@
 
 # Email logs of failed xfs_scrub unit runs
 
-mailer=/usr/sbin/sendmail
 recipient="$1"
 test -z "${recipient}" && exit 0
 mntpoint="$2"
 test -z "${mntpoint}" && exit 0
 hostname="$(hostname -f 2>/dev/null)"
 test -z "${hostname}" && hostname="${HOSTNAME}"
+
+mailer="$(command -v sendmail)"
 if [ ! -x "${mailer}" ]; then
 	echo "${mailer}: Mailer program not found."
 	exit 1


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/8] xfs_scrub_fail: escape paths correctly
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/8] xfs_scrub_fail: fix sendmail detection Darrick J. Wong
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Always escape pathnames correctly so that systemd doesn't complain.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_fail |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)


diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
index a46eb34ee29..4ec7e48836a 100755
--- a/scrub/xfs_scrub_fail
+++ b/scrub/xfs_scrub_fail
@@ -20,6 +20,32 @@ if [ ! -x "${mailer}" ]; then
 	exit 1
 fi
 
+# systemd doesn't like unit instance names with slashes in them, so it
+# replaces them with dashes when it invokes the service.  However, it's not
+# smart enough to convert the dashes to something else, so when it unescapes
+# the instance name to feed to xfs_scrub, it turns all dashes into slashes.
+# "/moo-cow" becomes "-moo-cow" becomes "/moo/cow", which is wrong.  systemd
+# actually /can/ escape the dashes correctly if it is told that this is a path
+# (and not a unit name), but it didn't do this prior to January 2017, so fix
+# this for them.
+#
+# systemd path escaping also drops the initial slash so we add that back in so
+# that log messages from the service units preserve the full path and users can
+# look up log messages using full paths.  However, for "/" the escaping rules
+# do /not/ drop the initial slash, so we have to special-case that here.
+escape_path() {
+	local arg="$1"
+
+	if [ "${arg}" = "/" ]; then
+		echo "-"
+		exit 0
+	fi
+
+	echo "-$(systemd-escape --path "${mntpoint}")"
+}
+
+mntpoint_esc="$(escape_path "${mntpoint}")"
+
 (cat << ENDL
 To: $1
 From: <xfs_scrub@${hostname}>
@@ -29,4 +55,4 @@ So sorry, the automatic xfs_scrub of ${mntpoint} on ${hostname} failed.
 
 A log of what happened follows:
 ENDL
-systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint}") | "${mailer}" -t -i
+systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint_esc}") | "${mailer}" -t -i


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/8] xfs_scrub_fail: return the failure status of the mailer program
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 1/8] debian: install scrub services with dh_installsystemd Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 7/8] xfs_scrub_all: escape service names consistently Darrick J. Wong
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We should return the exit code of the mailer program sending the scrub
failure reports, since that's much more important to anyone watching the
system.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_fail |    1 +
 1 file changed, 1 insertion(+)


diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
index 4ec7e48836a..fbe30cbc4c6 100755
--- a/scrub/xfs_scrub_fail
+++ b/scrub/xfs_scrub_fail
@@ -56,3 +56,4 @@ So sorry, the automatic xfs_scrub of ${mntpoint} on ${hostname} failed.
 A log of what happened follows:
 ENDL
 systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint_esc}") | "${mailer}" -t -i
+exit "${PIPESTATUS[1]}"


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/8] xfs_scrub_all: fix argument passing when invoking xfs_scrub manually
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 7/8] xfs_scrub_all: escape service names consistently Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/8] xfs_scrub: add missing copyrights and spdx headers Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 8/8] xfs_scrub_all: survive systemd restarts when waiting for services Darrick J. Wong
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, xfs_scrub_all will try to invoke xfs_scrub with argv[1] being
"-n -x".  This of course is recognized by C getopt as a weird looking
string, not two individual arguments, and causes the child process to
exit with complaints about CLI usage.

What we really want is to split the string into a proper array and then
add them to the xfs_scrub command line.  The code here isn't strictly
correct, but as @scrub_args@ is controlled by us in the Makefile, it'll
do for now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_all.in |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 5b76b49adab..32bcfb15f5a 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -132,7 +132,9 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 			return
 
 		# Invoke xfs_scrub manually
-		cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
+		cmd = ['@sbindir@/xfs_scrub']
+		cmd += '@scrub_args@'.split()
+		cmd += [mnt]
 		ret = run_killable(cmd, None, killfuncs, \
 				lambda proc: proc.terminate())
 		if ret >= 0:


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/8] xfs_scrub_all: escape service names consistently
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 5/8] xfs_scrub_fail: return the failure status of the mailer program Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 6/8] xfs_scrub_all: fix argument passing when invoking xfs_scrub manually Darrick J. Wong
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In commit 07c6fd59, I fixed this script so that it would escape
pathnames correctly when passing them as unit instance variables to
systemctl start.  Unfortunately, I neglected to do this for systemctl
stop, which leads to warnings if someone hit ^C while the program is
running from a CLI.  Fix that, and name the unit name variable
consistently.

Fixes: 07c6fd59 ("xfs_scrub_all: escape paths being passed to systemd service instances")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_all.in |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)


diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 32bcfb15f5a..2bdbccffd9c 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -58,10 +58,10 @@ def find_mounts():
 
 	return fs
 
-def kill_systemd(unit, proc):
+def kill_systemd(unitname, proc):
 	'''Kill systemd unit.'''
 	proc.terminate()
-	cmd=['systemctl', 'stop', unit]
+	cmd = ['systemctl', 'stop', unitname]
 	x = subprocess.Popen(cmd)
 	x.wait()
 
@@ -119,9 +119,10 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 			return
 
 		# Try it the systemd way
-		cmd=['systemctl', 'start', 'xfs_scrub@%s' % systemd_escape(mnt)]
+		unitname = 'xfs_scrub@%s' % systemd_escape(mnt)
+		cmd = ['systemctl', 'start', unitname]
 		ret = run_killable(cmd, DEVNULL(), killfuncs, \
-				lambda proc: kill_systemd('xfs_scrub@%s' % mnt, proc))
+				lambda proc: kill_systemd(unitname, proc))
 		if ret == 0 or ret == 1:
 			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
 			sys.stdout.flush()


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 8/8] xfs_scrub_all: survive systemd restarts when waiting for services
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 2/8] xfs_scrub: add missing copyrights and spdx headers Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  7 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If xfs_scrub_all detects a running systemd, it will use it to invoke
xfs_scrub subprocesses in a sandboxed and resource-controlled
environment.  Unfortunately, if you happen to restart dbus or systemd
while it's running, you get this:

systemd[1]: Reexecuting.
xfs_scrub_all[9958]: Warning! D-Bus connection terminated.
xfs_scrub_all[9956]: Warning! D-Bus connection terminated.
xfs_scrub_all[9956]: Failed to wait for response: Connection reset by peer
xfs_scrub_all[9958]: Failed to wait for response: Connection reset by peer
xfs_scrub_all[9930]: Scrubbing / done, (err=1)
xfs_scrub_all[9930]: Scrubbing /storage done, (err=1)

The xfs_scrub units themselves are still running, it's just that the
`systemctl start' command that xfs_scrub_all uses to start and wait for
the unit lost its connection to dbus and hence is no longer monitoring
sub-services.

When this happens, we don't have great options -- systemctl doesn't have
a command to wait on an activating (aka running) unit.  Emulate the
functionality we normally get by polling the failed/active statuses.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_all.in |   78 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 65 insertions(+), 13 deletions(-)


diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 2bdbccffd9c..3e0c48acb39 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -14,6 +14,7 @@ import time
 import sys
 import os
 import argparse
+from io import TextIOWrapper
 
 retcode = 0
 terminate = False
@@ -58,12 +59,18 @@ def find_mounts():
 
 	return fs
 
-def kill_systemd(unitname, proc):
-	'''Kill systemd unit.'''
-	proc.terminate()
-	cmd = ['systemctl', 'stop', unitname]
-	x = subprocess.Popen(cmd)
-	x.wait()
+def backtick(cmd):
+	'''Generator function that yields lines of a program's stdout.'''
+	p = subprocess.Popen(cmd, stdout = subprocess.PIPE)
+	for line in TextIOWrapper(p.stdout, encoding="utf-8"):
+		yield line.strip()
+
+def remove_killfunc(killfuncs, fn):
+	'''Ensure fn is not in killfuncs.'''
+	try:
+		killfuncs.remove(fn)
+	except:
+		pass
 
 def run_killable(cmd, stdout, killfuncs, kill_fn):
 	'''Run a killable program.  Returns program retcode or -1 if we can't start it.'''
@@ -72,10 +79,7 @@ def run_killable(cmd, stdout, killfuncs, kill_fn):
 		real_kill_fn = lambda: kill_fn(proc)
 		killfuncs.add(real_kill_fn)
 		proc.wait()
-		try:
-			killfuncs.remove(real_kill_fn)
-		except:
-			pass
+		remove_killfunc(killfuncs, real_kill_fn)
 		return proc.returncode
 	except:
 		return -1
@@ -107,6 +111,56 @@ def systemd_escape(path):
 	except:
 		return path
 
+def systemctl_stop(unitname):
+	'''Stop a systemd unit.'''
+	cmd = ['systemctl', 'stop', unitname]
+	x = subprocess.Popen(cmd)
+	x.wait()
+
+def systemctl_start(unitname, killfuncs):
+	'''Start a systemd unit and wait for it to complete.'''
+	stop_fn = None
+	cmd = ['systemctl', 'start', unitname]
+	try:
+		proc = subprocess.Popen(cmd, stdout = DEVNULL())
+		stop_fn = lambda: systemctl_stop(unitname)
+		killfuncs.add(stop_fn)
+		proc.wait()
+		ret = proc.returncode
+	except:
+		if stop_fn is not None:
+			remove_killfunc(killfuncs, stop_fn)
+		return -1
+
+	if ret != 1:
+		remove_killfunc(killfuncs, stop_fn)
+		return ret
+
+	# If systemctl-start returns 1, it's possible that the service failed
+	# or that dbus/systemd restarted and the client program lost its
+	# connection -- according to the systemctl man page, 1 means "unit not
+	# failed".
+	#
+	# Either way, we switch to polling the service status to try to wait
+	# for the service to end.  As of systemd 249, the is-active command
+	# returns any of the following states: active, reloading, inactive,
+	# failed, activating, deactivating, or maintenance.  Apparently these
+	# strings are not localized.
+	while True:
+		try:
+			for l in backtick(['systemctl', 'is-active', unitname]):
+				if l == 'failed':
+					remove_killfunc(killfuncs, stop_fn)
+					return 1
+				if l == 'inactive':
+					remove_killfunc(killfuncs, stop_fn)
+					return 0
+		except:
+			remove_killfunc(killfuncs, stop_fn)
+			return -1
+
+		time.sleep(1)
+
 def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 	'''Run a scrub process.'''
 	global retcode, terminate
@@ -120,9 +174,7 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 
 		# Try it the systemd way
 		unitname = 'xfs_scrub@%s' % systemd_escape(mnt)
-		cmd = ['systemctl', 'start', unitname]
-		ret = run_killable(cmd, DEVNULL(), killfuncs, \
-				lambda proc: kill_systemd(unitname, proc))
+		ret = systemctl_start(unitname, killfuncs)
 		if ret == 0 or ret == 1:
 			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
 			sys.stdout.flush()


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (56 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub.service: reduce CPU usage to 60% when possible Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
                   ` (19 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

To reduce the risk of the online fsck service suffering some sort of
catastrophic breach that results in attackers reconfiguring the running
system, I embarked on a security audit of the systemd service files.
The result should be that all elements of the background service
(individual scrub jobs, the scrub_all initiator, and the failure
reporting) run with as few privileges and within as strong of a sandbox
as possible.

Granted, this does nothing about the potential for the /kernel/ screwing
up, but at least we could prevent obvious container escapes.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-service-security
---
 doc/README-env-vars.txt          |    2 +
 scrub/Makefile                   |    7 +++
 scrub/phase1.c                   |    4 +-
 scrub/system-xfs_scrub.slice     |   30 +++++++++++++
 scrub/vfs.c                      |    2 -
 scrub/xfs_scrub.c                |    9 +++-
 scrub/xfs_scrub.h                |    5 ++
 scrub/xfs_scrub@.service.in      |   85 ++++++++++++++++++++++++++++++++++----
 scrub/xfs_scrub_all.service.in   |   66 ++++++++++++++++++++++++++++++
 scrub/xfs_scrub_fail@.service.in |   60 +++++++++++++++++++++++++++
 10 files changed, 253 insertions(+), 17 deletions(-)
 create mode 100644 scrub/system-xfs_scrub.slice


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs_scrub: allow auxiliary pathnames for sandboxing
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub.service: reduce CPU usage to 60% when possible Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/5] xfs_scrub_fail: tighten up the security on the background systemd service Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In the next patch, we'll tighten up the security on the xfs_scrub
service so that it can't escape.  However, sanboxing the service
involves making the host filesystem as inaccessible as possible, with
the filesystem to scrub bind mounted onto a known location within the
sandbox.  Hence we need one path for reporting and a new -A argument to
tell scrub what it should actually be trying to open.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 doc/README-env-vars.txt |    2 ++
 scrub/phase1.c          |    4 ++--
 scrub/vfs.c             |    2 +-
 scrub/xfs_scrub.c       |    9 +++++++--
 scrub/xfs_scrub.h       |    5 ++++-
 5 files changed, 16 insertions(+), 6 deletions(-)


diff --git a/doc/README-env-vars.txt b/doc/README-env-vars.txt
index eec59a82513..d7984df8202 100644
--- a/doc/README-env-vars.txt
+++ b/doc/README-env-vars.txt
@@ -24,3 +24,5 @@ XFS_SCRUB_THREADS            -- start exactly this number of threads
 Available even in non-debug mode:
 SERVICE_MODE                 -- compress all error codes to 1 for LSB
                                 service action compliance
+SERVICE_MOUNTPOINT           -- actual path to open for issuing kernel
+                                scrub calls
diff --git a/scrub/phase1.c b/scrub/phase1.c
index faa554f1e1e..80fd0c6e27c 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -146,7 +146,7 @@ phase1_func(
 	 * CAP_SYS_ADMIN, which we probably need to do anything fancy
 	 * with the (XFS driver) kernel.
 	 */
-	error = -xfd_open(&ctx->mnt, ctx->mntpoint,
+	error = -xfd_open(&ctx->mnt, ctx->actual_mntpoint,
 			O_RDONLY | O_NOATIME | O_DIRECTORY);
 	if (error) {
 		if (error == EPERM)
@@ -199,7 +199,7 @@ _("Not an XFS filesystem."));
 		return error;
 	}
 
-	error = path_to_fshandle(ctx->mntpoint, &ctx->fshandle,
+	error = path_to_fshandle(ctx->actual_mntpoint, &ctx->fshandle,
 			&ctx->fshandle_len);
 	if (error) {
 		str_errno(ctx, _("getting fshandle"));
diff --git a/scrub/vfs.c b/scrub/vfs.c
index 85ee2694b00..c64c6c41105 100644
--- a/scrub/vfs.c
+++ b/scrub/vfs.c
@@ -249,7 +249,7 @@ scan_fs_tree(
 		goto out_cond;
 	}
 
-	ret = queue_subdir(ctx, &sft, &wq, ctx->mntpoint, true);
+	ret = queue_subdir(ctx, &sft, &wq, ctx->actual_mntpoint, true);
 	if (ret) {
 		str_liberror(ctx, ret, _("queueing directory scan"));
 		goto out_wq;
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index bdee8e4fdae..23d8fec5d9b 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -118,6 +118,8 @@
  * Available even in non-debug mode:
  * SERVICE_MODE			-- compress all error codes to 1 for LSB
  *				   service action compliance
+ * SERVICE_MOUNTPOINT		-- actual path to open for issuing kernel
+ *				   scrub calls
  */
 
 /* Program name; needed for libfrog error reports. */
@@ -739,6 +741,9 @@ main(
 		usage();
 
 	ctx.mntpoint = argv[optind];
+	ctx.actual_mntpoint = getenv("SERVICE_MOUNTPOINT");
+	if (!ctx.actual_mntpoint)
+		ctx.actual_mntpoint = ctx.mntpoint;
 
 	stdout_isatty = isatty(STDOUT_FILENO);
 	stderr_isatty = isatty(STDERR_FILENO);
@@ -756,7 +761,7 @@ main(
 		return SCRUB_RET_OPERROR;
 
 	/* Find the mount record for the passed-in argument. */
-	if (stat(argv[optind], &ctx.mnt_sb) < 0) {
+	if (stat(ctx.actual_mntpoint, &ctx.mnt_sb) < 0) {
 		fprintf(stderr,
 			_("%s: could not stat: %s: %s\n"),
 			progname, argv[optind], strerror(errno));
@@ -779,7 +784,7 @@ main(
 	}
 
 	fs_table_initialise(0, NULL, 0, NULL);
-	fsp = fs_table_lookup_mount(ctx.mntpoint);
+	fsp = fs_table_lookup_mount(ctx.actual_mntpoint);
 	if (!fsp) {
 		fprintf(stderr, _("%s: Not a XFS mount point.\n"),
 				ctx.mntpoint);
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 004d2d02587..2ef8b2e5066 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -36,9 +36,12 @@ enum error_action {
 struct scrub_ctx {
 	/* Immutable scrub state. */
 
-	/* Strings we need for presentation */
+	/* Mountpoint we use for presentation */
 	char			*mntpoint;
 
+	/* Actual VFS path to the filesystem */
+	char			*actual_mntpoint;
+
 	/* Mountpoint info */
 	struct stat		mnt_sb;
 	struct statvfs		mnt_sv;


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs_scrub.service: reduce CPU usage to 60% when possible
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/5] xfs_scrub: allow auxiliary pathnames for sandboxing Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, the xfs_scrub background service is configured to use -b,
which means that the program runs completely serially.  However, even
using 100% of one CPU with idle priority may be enough to cause thermal
throttling and unwanted fan noise on smaller systems (e.g. laptops) with
fast IO systems.

Let's try to avoid this (at least on systemd) by using cgroups to limit
the program's usage to 60% of one CPU and lowering the nice priority in
the scheduler.  What we /really/ want is to run steadily on an
efficiency core, but there doesn't seem to be a means to ask the
scheduler not to ramp up the CPU frequency for a particular task.

While we're at it, group the resource limit directives together.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/Makefile                   |    7 ++++++-
 scrub/system-xfs_scrub.slice     |   30 ++++++++++++++++++++++++++++++
 scrub/xfs_scrub@.service.in      |   12 ++++++++++--
 scrub/xfs_scrub_all.service.in   |    4 ++++
 scrub/xfs_scrub_fail@.service.in |    4 ++++
 5 files changed, 54 insertions(+), 3 deletions(-)
 create mode 100644 scrub/system-xfs_scrub.slice


diff --git a/scrub/Makefile b/scrub/Makefile
index 2dc0fe1935c..1c36621b400 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -15,7 +15,12 @@ XFS_SCRUB_ALL_PROG = xfs_scrub_all
 XFS_SCRUB_ARGS = -b -n
 ifeq ($(HAVE_SYSTEMD),yes)
 INSTALL_SCRUB += install-systemd
-SYSTEMD_SERVICES = xfs_scrub@.service xfs_scrub_all.service xfs_scrub_all.timer xfs_scrub_fail@.service
+SYSTEMD_SERVICES=\
+	xfs_scrub@.service \
+	xfs_scrub_fail@.service \
+	xfs_scrub_all.service \
+	xfs_scrub_all.timer \
+	system-xfs_scrub.slice
 OPTIONAL_TARGETS += $(SYSTEMD_SERVICES)
 endif
 ifeq ($(HAVE_CROND),yes)
diff --git a/scrub/system-xfs_scrub.slice b/scrub/system-xfs_scrub.slice
new file mode 100644
index 00000000000..051cbb14108
--- /dev/null
+++ b/scrub/system-xfs_scrub.slice
@@ -0,0 +1,30 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2022 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=xfs_scrub background service slice
+Before=slices.target
+
+[Slice]
+
+# If the CPU usage cgroup controller is available, don't use more than 60% of a
+# single core for all background processes.
+CPUQuota=60%
+CPUAccounting=true
+
+[Install]
+# As of systemd 249, the systemd cgroupv2 configuration code will drop resource
+# controllers from the root and system.slice cgroups at startup if it doesn't
+# find any direct dependencies that require a given controller.  Newly
+# activated units with resource control directives are created under the system
+# slice but do not cause a reconfiguration of the slice's resource controllers.
+# Hence we cannot put CPUQuota= into the xfs_scrub service units directly.
+#
+# For the CPUQuota directive to have any effect, we must therefore create an
+# explicit definition file for the slice that systemd creates to contain the
+# xfs_scrub instance units (e.g. xfs_scrub@.service) and we must configure this
+# slice as a dependency of the system slice to establish the direct dependency
+# relation.
+WantedBy=system.slice
diff --git a/scrub/xfs_scrub@.service.in b/scrub/xfs_scrub@.service.in
index c8662fc85a6..3c64252de49 100644
--- a/scrub/xfs_scrub@.service.in
+++ b/scrub/xfs_scrub@.service.in
@@ -18,8 +18,16 @@ PrivateTmp=no
 AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
 NoNewPrivileges=yes
 User=nobody
-IOSchedulingClass=idle
-CPUSchedulingPolicy=idle
 Environment=SERVICE_MODE=1
 ExecStart=@sbindir@/xfs_scrub @scrub_args@ %I
 SyslogIdentifier=%N
+
+# Run scrub with minimal CPU and IO priority so that nothing else will starve.
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+CPUAccounting=true
+Nice=19
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
index b874eb6f757..ae4135033dd 100644
--- a/scrub/xfs_scrub_all.service.in
+++ b/scrub/xfs_scrub_all.service.in
@@ -14,3 +14,7 @@ Type=oneshot
 Environment=SERVICE_MODE=1
 ExecStart=@sbindir@/xfs_scrub_all
 SyslogIdentifier=xfs_scrub_all
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice
diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
index ac0cb2e283b..591486599ce 100644
--- a/scrub/xfs_scrub_fail@.service.in
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -14,3 +14,7 @@ ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" %I
 User=mail
 Group=mail
 SupplementaryGroups=systemd-journal
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] xfs_scrub: tighten up the security on the background systemd service
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 4/5] xfs_scrub_fail: tighten up the security on the background systemd service Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/5] xfs_scrub_all: " Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, xfs_scrub has to run with some elevated privileges.  Minimize
the risk of xfs_scrub escaping its service container or contaminating
the rest of the system by using systemd's sandboxing controls to
prohibit as much access as possible.

The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub@.service' in systemd 249.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub@.service.in |   73 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 65 insertions(+), 8 deletions(-)


diff --git a/scrub/xfs_scrub@.service.in b/scrub/xfs_scrub@.service.in
index 3c64252de49..39af00d4b73 100644
--- a/scrub/xfs_scrub@.service.in
+++ b/scrub/xfs_scrub@.service.in
@@ -10,15 +10,8 @@ Documentation=man:xfs_scrub(8)
 
 [Service]
 Type=oneshot
-PrivateNetwork=true
-ProtectSystem=full
-ProtectHome=read-only
-# Disable private /tmp just in case %i is a path under /tmp.
-PrivateTmp=no
-AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
-NoNewPrivileges=yes
-User=nobody
 Environment=SERVICE_MODE=1
+Environment=SERVICE_MOUNTPOINT=/tmp/scrub
 ExecStart=@sbindir@/xfs_scrub @scrub_args@ %I
 SyslogIdentifier=%N
 
@@ -31,3 +24,67 @@ Nice=19
 # Create the service underneath the scrub background service slice so that we
 # can control resource usage.
 Slice=system-xfs_scrub.slice
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# Dynamically create a user that isn't root
+DynamicUser=true
+
+# Make the entire filesystem readonly and /home inaccessible, then bind mount
+# the filesystem we're supposed to be checking into our private /tmp dir.
+# 'norbind' means that we don't bind anything under that original mount.
+ProtectSystem=strict
+ProtectHome=yes
+PrivateTmp=true
+BindPaths=/%I:/tmp/scrub:norbind
+
+# Don't let scrub complain about paths in /etc/projects that have been hidden
+# by our sandboxing.  scrub doesn't care about project ids anyway.
+InaccessiblePaths=-/etc/projects
+
+# No network access
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=none
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrub needs these privileges to run, and no others
+CapabilityBoundingSet=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+NoNewPrivileges=true
+
+# xfs_scrub doesn't create files
+UMask=7777
+
+# No access to hardware /dev files except for block devices
+ProtectClock=true
+DevicePolicy=closed
+DeviceAllow=block-*


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs_scrub_fail: tighten up the security on the background systemd service
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub.service: reduce CPU usage to 60% when possible Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/5] xfs_scrub: allow auxiliary pathnames for sandboxing Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: " Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 5/5] xfs_scrub_all: " Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, xfs_scrub_fail has to run with enough privileges to access
the journal contents for a given scrub run and to send a report via
email.  Minimize the risk of xfs_scrub_fail escaping its service
container or contaminating the rest of the system by using systemd's
sandboxing controls to prohibit as much access as possible.

The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_fail@.service' in systemd 249.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_fail@.service.in |   56 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)


diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
index 591486599ce..2c36c47ab02 100644
--- a/scrub/xfs_scrub_fail@.service.in
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -18,3 +18,59 @@ SupplementaryGroups=systemd-journal
 # Create the service underneath the scrub background service slice so that we
 # can control resource usage.
 Slice=system-xfs_scrub.slice
+
+# No realtime scheduling
+RestrictRealtime=true
+
+# Make the entire filesystem readonly and /home inaccessible, then bind mount
+# the filesystem we're supposed to be checking into our private /tmp dir.
+ProtectSystem=full
+ProtectHome=yes
+PrivateTmp=true
+RestrictSUIDSGID=true
+
+# Emailing reports requires network access, but not the ability to change the
+# hostname.
+ProtectHostname=true
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Can't hide /proc because journalctl needs it to find various pieces of log
+# information
+#ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrub needs these privileges to run, and no others
+CapabilityBoundingSet=
+NoNewPrivileges=true
+
+# Failure reporting shouldn't create world-readable files
+UMask=0077
+
+# Clean up any IPC objects when this unit stops
+RemoveIPC=true
+
+# No access to hardware device files
+PrivateDevices=true
+ProtectClock=true


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs_scrub_all: tighten up the security on the background systemd service
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: " Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, xfs_scrub_all has to run with enough privileges to find
mounted XFS filesystems and the device associated with that mount and to
start xfs_scrub@<mountpoint> sub-services.  Minimize the risk of
xfs_scrub_all escaping its service container or contaminating the rest
of the system by using systemd's sandboxing controls to prohibit as much
access as possible.

The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_all.service' in systemd 249.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub_all.service.in |   62 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)


diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
index ae4135033dd..c1c6012b47d 100644
--- a/scrub/xfs_scrub_all.service.in
+++ b/scrub/xfs_scrub_all.service.in
@@ -18,3 +18,65 @@ SyslogIdentifier=xfs_scrub_all
 # Create the service underneath the scrub background service slice so that we
 # can control resource usage.
 Slice=system-xfs_scrub.slice
+
+# Run scrub_all with minimal CPU and IO priority so that nothing will starve.
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+CPUAccounting=true
+Nice=19
+
+# No realtime scheduling
+RestrictRealtime=true
+
+# No special privileges, but we still have to run as root so that we can
+# contact the service manager to start the sub-units.
+CapabilityBoundingSet=
+NoNewPrivileges=true
+RestrictSUIDSGID=true
+
+# Make the entire filesystem readonly.  We don't want to hide anything because
+# we need to find all mounted XFS filesystems in the host.
+ProtectSystem=strict
+ProtectHome=read-only
+PrivateTmp=false
+
+# No network access except to the systemd control socket
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=AF_UNIX
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# Media scan stamp file shouldn't be readable by regular users
+UMask=0077
+
+# lsblk ignores mountpoints if it can't find the device files, so we cannot
+# hide them
+#ProtectClock=true
+#PrivateDevices=true


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (57 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub_all: failure reporting for the xfs_scrub_all job Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/2] xfs_scrub: automatic optimization by default Darrick J. Wong
                   ` (18 subsequent siblings)
  77 siblings, 4 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

Now that we've completed the online fsck functionality, there are a few
things that could be improved in the automatic service.  Specifically,
we would like to perform a more intensive metadata + media scan once per
month, to give the user confidence that the filesystem isn't losing data
silently.  To accomplish this, enhance xfs_scrub_all to be able to
trigger media scans.  Next, add a duplicate set of system services that
start the media scans automatically.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service
---
 debian/rules                           |    3 +
 include/builddefs.in                   |    3 +
 man/man8/Makefile                      |    7 ++
 man/man8/xfs_scrub_all.8.in            |   20 ++++++
 scrub/Makefile                         |   14 ++++-
 scrub/xfs_scrub_all.cron.in            |    2 -
 scrub/xfs_scrub_all.in                 |   99 +++++++++++++++++++++++++++++++-
 scrub/xfs_scrub_all.service.in         |    9 ++-
 scrub/xfs_scrub_all_fail.service.in    |   67 ++++++++++++++++++++++
 scrub/xfs_scrub_fail                   |   44 +++++++++++---
 scrub/xfs_scrub_fail@.service.in       |    2 -
 scrub/xfs_scrub_media@.service.in      |   90 +++++++++++++++++++++++++++++
 scrub/xfs_scrub_media_fail@.service.in |   76 +++++++++++++++++++++++++
 13 files changed, 413 insertions(+), 23 deletions(-)
 rename man/man8/{xfs_scrub_all.8 => xfs_scrub_all.8.in} (59%)
 create mode 100644 scrub/xfs_scrub_all_fail.service.in
 create mode 100644 scrub/xfs_scrub_media@.service.in
 create mode 100644 scrub/xfs_scrub_media_fail@.service.in


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs_scrub_all: support metadata+media scans of all filesystems
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub_all: failure reporting for the xfs_scrub_all job Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 3/4] xfs_scrub_all: trigger automatic media scans once per month Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/4] xfs_scrub_all: enable periodic file data scrubs automatically Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add the necessary systemd services and control bits so that
xfs_scrub_all can kick off a metadata+media scan of a filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 man/man8/xfs_scrub_all.8               |    5 +-
 scrub/Makefile                         |    2 +
 scrub/xfs_scrub_all.in                 |   27 ++++++++--
 scrub/xfs_scrub_fail                   |   13 +++--
 scrub/xfs_scrub_fail@.service.in       |    2 -
 scrub/xfs_scrub_media@.service.in      |   90 ++++++++++++++++++++++++++++++++
 scrub/xfs_scrub_media_fail@.service.in |   76 +++++++++++++++++++++++++++
 7 files changed, 204 insertions(+), 11 deletions(-)
 create mode 100644 scrub/xfs_scrub_media@.service.in
 create mode 100644 scrub/xfs_scrub_media_fail@.service.in


diff --git a/man/man8/xfs_scrub_all.8 b/man/man8/xfs_scrub_all.8
index 74548802eda..86a9b3eced2 100644
--- a/man/man8/xfs_scrub_all.8
+++ b/man/man8/xfs_scrub_all.8
@@ -4,7 +4,7 @@ xfs_scrub_all \- scrub all mounted XFS filesystems
 .SH SYNOPSIS
 .B xfs_scrub_all
 [
-.B \-hV
+.B \-hxV
 ]
 .SH DESCRIPTION
 .B xfs_scrub_all
@@ -21,6 +21,9 @@ the same device simultaneously.
 .B \-h
 Display help.
 .TP
+.B \-x
+Read all file data extents to look for disk errors.
+.TP
 .B \-V
 Prints the version number and exits.
 .SH EXIT CODE
diff --git a/scrub/Makefile b/scrub/Makefile
index 1c36621b400..f65148e5469 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -18,6 +18,8 @@ INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
 	xfs_scrub@.service \
 	xfs_scrub_fail@.service \
+	xfs_scrub_media@.service \
+	xfs_scrub_media_fail@.service \
 	xfs_scrub_all.service \
 	xfs_scrub_all.timer \
 	system-xfs_scrub.slice
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 3e0c48acb39..eeb52b651b5 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -18,6 +18,7 @@ from io import TextIOWrapper
 
 retcode = 0
 terminate = False
+scrub_media = False
 
 def DEVNULL():
 	'''Return /dev/null in subprocess writable format.'''
@@ -111,6 +112,17 @@ def systemd_escape(path):
 	except:
 		return path
 
+def scrub_unitname(mnt):
+	'''Return the systemd service name.'''
+	global scrub_media
+
+	if mnt != '*':
+		mnt = systemd_escape(mnt)
+
+	if scrub_media:
+		return 'xfs_scrub_media@%s' % mnt
+	return 'xfs_scrub@%s' % mnt
+
 def systemctl_stop(unitname):
 	'''Stop a systemd unit.'''
 	cmd = ['systemctl', 'stop', unitname]
@@ -163,7 +175,7 @@ def systemctl_start(unitname, killfuncs):
 
 def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 	'''Run a scrub process.'''
-	global retcode, terminate
+	global retcode, terminate, scrub_media
 
 	print("Scrubbing %s..." % mnt)
 	sys.stdout.flush()
@@ -173,7 +185,7 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 			return
 
 		# Try it the systemd way
-		unitname = 'xfs_scrub@%s' % systemd_escape(mnt)
+		unitname = scrub_unitname(mnt)
 		ret = systemctl_start(unitname, killfuncs)
 		if ret == 0 or ret == 1:
 			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
@@ -187,6 +199,8 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 		# Invoke xfs_scrub manually
 		cmd = ['@sbindir@/xfs_scrub']
 		cmd += '@scrub_args@'.split()
+		if scrub_media:
+			cmd += '-x'
 		cmd += [mnt]
 		ret = run_killable(cmd, None, killfuncs, \
 				lambda proc: proc.terminate())
@@ -213,26 +227,31 @@ def main():
 		a = (mnt, cond, running_devs, devs, killfuncs)
 		thr = threading.Thread(target = run_scrub, args = a)
 		thr.start()
-	global retcode, terminate
+	global retcode, terminate, scrub_media
 
 	parser = argparse.ArgumentParser( \
 			description = "Scrub all mounted XFS filesystems.")
 	parser.add_argument("-V", help = "Report version and exit.", \
 			action = "store_true")
+	parser.add_argument("-x", help = "Scrub file data after filesystem metadata.", \
+			action = "store_true")
 	args = parser.parse_args()
 
 	if args.V:
 		print("xfs_scrub_all version @pkg_version@")
 		sys.exit(0)
 
+	scrub_media = args.x
+
 	fs = find_mounts()
 
 	# Tail the journal if we ourselves aren't a service...
 	journalthread = None
 	if 'SERVICE_MODE' not in os.environ:
 		try:
+			unitname = scrub_unitname('*')
 			cmd=['journalctl', '--no-pager', '-q', '-S', 'now', \
-					'-f', '-u', 'xfs_scrub@*', '-o', \
+					'-f', '-u', unitname, '-o', \
 					'cat']
 			journalthread = subprocess.Popen(cmd)
 		except:
diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
index fbe30cbc4c6..58c50abe963 100755
--- a/scrub/xfs_scrub_fail
+++ b/scrub/xfs_scrub_fail
@@ -9,8 +9,11 @@
 
 recipient="$1"
 test -z "${recipient}" && exit 0
-mntpoint="$2"
+service="$2"
+test -z "${service}" && exit 0
+mntpoint="$3"
 test -z "${mntpoint}" && exit 0
+
 hostname="$(hostname -f 2>/dev/null)"
 test -z "${hostname}" && hostname="${HOSTNAME}"
 
@@ -48,12 +51,12 @@ mntpoint_esc="$(escape_path "${mntpoint}")"
 
 (cat << ENDL
 To: $1
-From: <xfs_scrub@${hostname}>
-Subject: xfs_scrub failure on ${mntpoint}
+From: <${service}@${hostname}>
+Subject: ${service} failure on ${mntpoint}
 
-So sorry, the automatic xfs_scrub of ${mntpoint} on ${hostname} failed.
+So sorry, the automatic ${service} of ${mntpoint} on ${hostname} failed.
 
 A log of what happened follows:
 ENDL
-systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint_esc}") | "${mailer}" -t -i
+systemctl status --full --lines 4294967295 "${service}@${mntpoint_esc}") | "${mailer}" -t -i
 exit "${PIPESTATUS[1]}"
diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
index 2c36c47ab02..cba194bad2d 100644
--- a/scrub/xfs_scrub_fail@.service.in
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -10,7 +10,7 @@ Documentation=man:xfs_scrub(8)
 [Service]
 Type=oneshot
 Environment=EMAIL_ADDR=root
-ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" %I
+ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" xfs_scrub %I
 User=mail
 Group=mail
 SupplementaryGroups=systemd-journal
diff --git a/scrub/xfs_scrub_media@.service.in b/scrub/xfs_scrub_media@.service.in
new file mode 100644
index 00000000000..d2b991856df
--- /dev/null
+++ b/scrub/xfs_scrub_media@.service.in
@@ -0,0 +1,90 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2022 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=Online XFS Metadata and Media Check for %I
+OnFailure=xfs_scrub_media_fail@%i.service
+Documentation=man:xfs_scrub(8)
+
+[Service]
+Type=oneshot
+Environment=SERVICE_MODE=1
+Environment=SERVICE_MOUNTPOINT=/tmp/scrub
+ExecStart=@sbindir@/xfs_scrub @scrub_args@ -x %I
+SyslogIdentifier=%N
+
+# Run scrub with minimal CPU and IO priority so that nothing else will starve.
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+CPUAccounting=true
+Nice=19
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# Dynamically create a user that isn't root
+DynamicUser=true
+
+# Make the entire filesystem readonly and /home inaccessible, then bind mount
+# the filesystem we're supposed to be checking into our private /tmp dir.
+# 'norbind' means that we don't bind anything under that original mount.
+ProtectSystem=strict
+ProtectHome=yes
+PrivateTmp=true
+BindPaths=/%I:/tmp/scrub:norbind
+
+# Don't let scrub complain about paths in /etc/projects that have been hidden
+# by our sandboxing.  scrub doesn't care about project ids anyway.
+InaccessiblePaths=-/etc/projects
+
+# No network access
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=none
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrub needs these privileges to run, and no others
+CapabilityBoundingSet=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+NoNewPrivileges=true
+
+# xfs_scrub doesn't create files
+UMask=7777
+
+# No access to hardware /dev files except for block devices
+ProtectClock=true
+DevicePolicy=closed
+DeviceAllow=block-*
diff --git a/scrub/xfs_scrub_media_fail@.service.in b/scrub/xfs_scrub_media_fail@.service.in
new file mode 100644
index 00000000000..e6c45e72f20
--- /dev/null
+++ b/scrub/xfs_scrub_media_fail@.service.in
@@ -0,0 +1,76 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2022 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=Online XFS Metadata and Media Check Failure Reporting for %I
+Documentation=man:xfs_scrub(8)
+
+[Service]
+Type=oneshot
+Environment=EMAIL_ADDR=root
+ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" xfs_scrub_media %I
+User=mail
+Group=mail
+SupplementaryGroups=systemd-journal
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice
+
+# No realtime scheduling
+RestrictRealtime=true
+
+# Make the entire filesystem readonly and /home inaccessible, then bind mount
+# the filesystem we're supposed to be checking into our private /tmp dir.
+ProtectSystem=full
+ProtectHome=yes
+PrivateTmp=true
+RestrictSUIDSGID=true
+
+# Emailing reports requires network access, but not the ability to change the
+# hostname.
+ProtectHostname=true
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Can't hide /proc because journalctl needs it to find various pieces of log
+# information
+#ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrub needs these privileges to run, and no others
+CapabilityBoundingSet=
+NoNewPrivileges=true
+
+# Failure reporting shouldn't create world-readable files
+UMask=0077
+
+# Clean up any IPC objects when this unit stops
+RemoveIPC=true
+
+# No access to hardware device files
+PrivateDevices=true
+ProtectClock=true


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs_scrub_all: enable periodic file data scrubs automatically
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:18   ` [PATCH 3/4] xfs_scrub_all: trigger automatic media scans once per month Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enhance xfs_scrub_all with the ability to initiate a file data scrub
periodically.  The user must specify the period, and they may optionally
specify the path to a file that will record the last time the file data
was scrubbed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 debian/rules                   |    3 +-
 include/builddefs.in           |    3 ++
 man/man8/Makefile              |    7 +++-
 man/man8/xfs_scrub_all.8.in    |   15 ++++++++
 scrub/Makefile                 |    3 ++
 scrub/xfs_scrub_all.in         |   74 +++++++++++++++++++++++++++++++++++++++-
 scrub/xfs_scrub_all.service.in |    6 ++-
 7 files changed, 106 insertions(+), 5 deletions(-)
 rename man/man8/{xfs_scrub_all.8 => xfs_scrub_all.8.in} (63%)


diff --git a/debian/rules b/debian/rules
index 57baad625c5..97fbbbfa1ab 100755
--- a/debian/rules
+++ b/debian/rules
@@ -34,7 +34,8 @@ configure_options = \
 	--disable-ubsan \
 	--disable-addrsan \
 	--disable-threadsan \
-	--enable-lto
+	--enable-lto \
+	--localstatedir=/var
 
 options = export DEBUG=-DNDEBUG DISTRIBUTION=debian \
 	  INSTALL_USER=root INSTALL_GROUP=root \
diff --git a/include/builddefs.in b/include/builddefs.in
index c0de6000c2a..50ebb9f75d8 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -59,6 +59,9 @@ PKG_DOC_DIR	= @datadir@/doc/@pkg_name@
 PKG_LOCALE_DIR	= @datadir@/locale
 PKG_DATA_DIR	= @datadir@/@pkg_name@
 MKFS_CFG_DIR	= @datadir@/@pkg_name@/mkfs
+PKG_STATE_DIR	= @localstatedir@/lib/@pkg_name@
+
+XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_STAMP=$(PKG_STATE_DIR)/xfs_scrub_all_media.stamp
 
 CC		= @cc@
 BUILD_CC	= @BUILD_CC@
diff --git a/man/man8/Makefile b/man/man8/Makefile
index 272e45aebc2..5be76ab727a 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -11,11 +11,12 @@ ifneq ("$(ENABLE_SCRUB)","yes")
   MAN_PAGES = $(filter-out xfs_scrub%,$(shell echo *.$(MAN_SECTION)))
 else
   MAN_PAGES = $(shell echo *.$(MAN_SECTION))
+  MAN_PAGES += xfs_scrub_all.8
 endif
 MAN_PAGES	+= mkfs.xfs.8
 MAN_DEST	= $(PKG_MAN_DIR)/man$(MAN_SECTION)
 LSRCFILES	= $(MAN_PAGES)
-DIRT		= mkfs.xfs.8
+DIRT		= mkfs.xfs.8 xfs_scrub_all.8
 
 default : $(MAN_PAGES)
 
@@ -29,4 +30,8 @@ mkfs.xfs.8: mkfs.xfs.8.in
 	@echo "    [SED]    $@"
 	$(Q)$(SED) -e 's|@mkfs_cfg_dir@|$(MKFS_CFG_DIR)|g' < $^ > $@
 
+xfs_scrub_all.8: xfs_scrub_all.8.in
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e 's|@stampfile@|$(XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_STAMP)|g' < $^ > $@
+
 install-dev :
diff --git a/man/man8/xfs_scrub_all.8 b/man/man8/xfs_scrub_all.8.in
similarity index 63%
rename from man/man8/xfs_scrub_all.8
rename to man/man8/xfs_scrub_all.8.in
index 86a9b3eced2..0aa87e23716 100644
--- a/man/man8/xfs_scrub_all.8
+++ b/man/man8/xfs_scrub_all.8.in
@@ -18,6 +18,21 @@ operations can be run in parallel so long as no two scrubbers access
 the same device simultaneously.
 .SH OPTIONS
 .TP
+.B \--auto-media-scan-interval
+Automatically enable the file data scan (i.e. the
+.B -x
+flag) if it has not been run in the specified interval.
+The interval must be a floating point number with an optional unit suffix.
+Supported unit suffixes are
+.IR y ", " q ", " mo ", " w ", " d ", " h ", " m ", and " s
+for years, 90-day quarters, 30-day months, weeks, days, hours, minutes, and
+seconds, respectively.
+If no units are specified, the default is seconds.
+.TP
+.B \--auto-media-scan-stamp
+Path to a file that will record the last time the media scan was run.
+Defaults to @stampfile@.
+.TP
 .B \-h
 Display help.
 .TP
diff --git a/scrub/Makefile b/scrub/Makefile
index f65148e5469..f773995dcd7 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -118,6 +118,7 @@ xfs_scrub_all: xfs_scrub_all.in $(builddefs)
 	@echo "    [SED]    $@"
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
 		   -e "s|@pkg_version@|$(PKG_VERSION)|g" \
+		   -e "s|@stampfile@|$(XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_STAMP)|g" \
 		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@
 	$(Q)chmod a+x $@
 
@@ -132,6 +133,7 @@ install: $(INSTALL_SCRUB)
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
 		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" \
 		   -e "s|@pkg_lib_dir@|$(PKG_LIB_SCRIPT_DIR)|g" \
+		   -e "s|@pkg_state_dir@|$(PKG_STATE_DIR)|g" \
 		   -e "s|@pkg_name@|$(PKG_NAME)|g" \
 		   < $< > $@
 
@@ -153,6 +155,7 @@ install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_SBIN_DIR)
 	$(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_SBIN_DIR)
+	$(INSTALL) -m 755 -d $(PKG_STATE_DIR)
 
 install-dev:
 
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index eeb52b651b5..db307f78ebb 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -15,6 +15,10 @@ import sys
 import os
 import argparse
 from io import TextIOWrapper
+from pathlib import Path
+from datetime import timedelta
+from datetime import datetime
+from datetime import timezone
 
 retcode = 0
 terminate = False
@@ -221,6 +225,63 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 		cond.notify()
 		cond.release()
 
+def scan_interval(string):
+	'''Convert a textual scan interval argument into a time delta.'''
+
+	if string.endswith('y'):
+		year = timedelta(seconds = 31556952)
+		return year * float(string[:-1])
+	if string.endswith('q'):
+		return timedelta(days = 90 * float(string[:-1]))
+	if string.endswith('mo'):
+		return timedelta(days = 30 * float(string[:-2]))
+	if string.endswith('w'):
+		return timedelta(weeks = float(string[:-1]))
+	if string.endswith('d'):
+		return timedelta(days = float(string[:-1]))
+	if string.endswith('h'):
+		return timedelta(hours = float(string[:-1]))
+	if string.endswith('m'):
+		return timedelta(minutes = float(string[:-1]))
+	if string.endswith('s'):
+		return timedelta(seconds = float(string[:-1]))
+	return timedelta(seconds = int(string))
+
+def utcnow():
+	'''Create a representation of the time right now, in UTC.'''
+
+	dt = datetime.utcnow()
+	return dt.replace(tzinfo = timezone.utc)
+
+def enable_automatic_media_scan(args):
+	'''Decide if we enable media scanning automatically.'''
+	already_enabled = args.x
+
+	try:
+		interval = scan_interval(args.auto_media_scan_interval)
+	except Exception as e:
+		raise Exception(f"{args.auto_media_scan_interval}: Invalid media scan interval.")
+
+	p = Path(args.auto_media_scan_stamp)
+	if already_enabled:
+		res = True
+	else:
+		try:
+			last_run = p.stat().st_mtime
+			now = utcnow().timestamp()
+			res = last_run + interval.total_seconds() < now
+		except FileNotFoundError:
+			res = True
+
+	if res:
+		# Truncate the stamp file to update its mtime
+		with p.open('w') as f:
+			pass
+		if not already_enabled:
+			print('Automatically enabling file data scrub.')
+
+	return res
+
 def main():
 	'''Find mounts, schedule scrub runs.'''
 	def thr(mnt, devs):
@@ -235,13 +296,24 @@ def main():
 			action = "store_true")
 	parser.add_argument("-x", help = "Scrub file data after filesystem metadata.", \
 			action = "store_true")
+	parser.add_argument("--auto-media-scan-interval", help = "Automatically scrub file data at this interval.", \
+			default = None)
+	parser.add_argument("--auto-media-scan-stamp", help = "Stamp file for automatic file data scrub.", \
+			default = '@stampfile@')
 	args = parser.parse_args()
 
 	if args.V:
 		print("xfs_scrub_all version @pkg_version@")
 		sys.exit(0)
 
-	scrub_media = args.x
+	if args.auto_media_scan_interval is not None:
+		try:
+			scrub_media = enable_automatic_media_scan(args)
+		except Exception as e:
+			print(e)
+			sys.exit(16)
+	else:
+		scrub_media = args.x
 
 	fs = find_mounts()
 
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
index c1c6012b47d..4938404ee95 100644
--- a/scrub/xfs_scrub_all.service.in
+++ b/scrub/xfs_scrub_all.service.in
@@ -34,11 +34,13 @@ CapabilityBoundingSet=
 NoNewPrivileges=true
 RestrictSUIDSGID=true
 
-# Make the entire filesystem readonly.  We don't want to hide anything because
-# we need to find all mounted XFS filesystems in the host.
+# Make the entire filesystem readonly except for the media scan stamp file
+# directory.  We don't want to hide anything because we need to find all
+# mounted XFS filesystems in the host.
 ProtectSystem=strict
 ProtectHome=read-only
 PrivateTmp=false
+BindPaths=@pkg_state_dir@
 
 # No network access except to the systemd control socket
 PrivateNetwork=true


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs_scrub_all: trigger automatic media scans once per month
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub_all: failure reporting for the xfs_scrub_all job Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/4] xfs_scrub_all: support metadata+media scans of all filesystems Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/4] xfs_scrub_all: enable periodic file data scrubs automatically Darrick J. Wong
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach the xfs_scrub_all background service to trigger an automatic scan
of all file data once per month.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/Makefile                 |    8 +++++++-
 scrub/xfs_scrub_all.cron.in    |    2 +-
 scrub/xfs_scrub_all.service.in |    2 +-
 3 files changed, 9 insertions(+), 3 deletions(-)


diff --git a/scrub/Makefile b/scrub/Makefile
index f773995dcd7..7434ac0ce4e 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -110,6 +110,9 @@ ifeq ($(HAVE_HDIO_GETGEO),yes)
 LCFLAGS += -DHAVE_HDIO_GETGEO
 endif
 
+# Automatically trigger a media scan once per month
+XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_INTERVAL=1mo
+
 LDIRT = $(XFS_SCRUB_ALL_PROG) *.service *.cron
 
 default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(OPTIONAL_TARGETS)
@@ -135,11 +138,14 @@ install: $(INSTALL_SCRUB)
 		   -e "s|@pkg_lib_dir@|$(PKG_LIB_SCRIPT_DIR)|g" \
 		   -e "s|@pkg_state_dir@|$(PKG_STATE_DIR)|g" \
 		   -e "s|@pkg_name@|$(PKG_NAME)|g" \
+		   -e "s|@media_scan_interval@|$(XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_INTERVAL)|g" \
 		   < $< > $@
 
 %.cron: %.cron.in $(builddefs)
 	@echo "    [SED]    $@"
-	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" < $< > $@
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
+		   -e "s|@media_scan_interval@|$(XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_INTERVAL)|g" \
+		   < $< > $@
 
 install-systemd: default $(SYSTEMD_SERVICES)
 	$(INSTALL) -m 755 -d $(SYSTEMD_SYSTEM_UNIT_DIR)
diff --git a/scrub/xfs_scrub_all.cron.in b/scrub/xfs_scrub_all.cron.in
index 0ef97cc0ca8..d55d2ad2d89 100644
--- a/scrub/xfs_scrub_all.cron.in
+++ b/scrub/xfs_scrub_all.cron.in
@@ -3,4 +3,4 @@
 # Copyright (C) 2018 Oracle.  All Rights Reserved.
 # Author: Darrick J. Wong <djwong@kernel.org>
 #
-10 3 * * 0 root test -e /run/systemd/system || @sbindir@/xfs_scrub_all
+10 3 * * 0 root test -e /run/systemd/system || @sbindir@/xfs_scrub_all --auto-media-scan-interval @media_scan_interval@
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
index 4938404ee95..e831ad58eb8 100644
--- a/scrub/xfs_scrub_all.service.in
+++ b/scrub/xfs_scrub_all.service.in
@@ -12,7 +12,7 @@ After=paths.target multi-user.target network.target network-online.target system
 [Service]
 Type=oneshot
 Environment=SERVICE_MODE=1
-ExecStart=@sbindir@/xfs_scrub_all
+ExecStart=@sbindir@/xfs_scrub_all --auto-media-scan-interval @media_scan_interval@
 SyslogIdentifier=xfs_scrub_all
 
 # Create the service underneath the scrub background service slice so that we


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs_scrub_all: failure reporting for the xfs_scrub_all job
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/4] xfs_scrub_all: support metadata+media scans of all filesystems Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a failure reporting service for when xfs_scrub_all fails.  This
shouldn't happen often, but let's report anyways.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/Makefile                      |    1 +
 scrub/xfs_scrub_all.service.in      |    1 +
 scrub/xfs_scrub_all_fail.service.in |   67 +++++++++++++++++++++++++++++++++++
 scrub/xfs_scrub_fail                |   33 ++++++++++++++---
 4 files changed, 96 insertions(+), 6 deletions(-)
 create mode 100644 scrub/xfs_scrub_all_fail.service.in


diff --git a/scrub/Makefile b/scrub/Makefile
index 7434ac0ce4e..f2d0c1aa0bf 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -21,6 +21,7 @@ SYSTEMD_SERVICES=\
 	xfs_scrub_media@.service \
 	xfs_scrub_media_fail@.service \
 	xfs_scrub_all.service \
+	xfs_scrub_all_fail.service \
 	xfs_scrub_all.timer \
 	system-xfs_scrub.slice
 OPTIONAL_TARGETS += $(SYSTEMD_SERVICES)
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
index e831ad58eb8..c11d6a96037 100644
--- a/scrub/xfs_scrub_all.service.in
+++ b/scrub/xfs_scrub_all.service.in
@@ -5,6 +5,7 @@
 
 [Unit]
 Description=Online XFS Metadata Check for All Filesystems
+OnFailure=xfs_scrub_all_fail.service
 ConditionACPower=true
 Documentation=man:xfs_scrub_all(8)
 After=paths.target multi-user.target network.target network-online.target systemd-networkd.service NetworkManager.service connman.service
diff --git a/scrub/xfs_scrub_all_fail.service.in b/scrub/xfs_scrub_all_fail.service.in
new file mode 100644
index 00000000000..002d05b67de
--- /dev/null
+++ b/scrub/xfs_scrub_all_fail.service.in
@@ -0,0 +1,67 @@
+[Unit]
+Description=Online XFS Metadata Check for All Filesystems Failure Reporting
+Documentation=man:xfs_scrub_all(8)
+
+[Service]
+Type=oneshot
+Environment=EMAIL_ADDR=root
+ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" xfs_scrub_all
+User=mail
+Group=mail
+SupplementaryGroups=systemd-journal
+
+# No realtime scheduling
+RestrictRealtime=true
+
+# Make the entire filesystem readonly and /home inaccessible, then bind mount
+# the filesystem we're supposed to be checking into our private /tmp dir.
+ProtectSystem=full
+ProtectHome=yes
+PrivateTmp=true
+RestrictSUIDSGID=true
+
+# Emailing reports requires network access, but not the ability to change the
+# hostname.
+ProtectHostname=true
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Can't hide /proc because journalctl needs it to find various pieces of log
+# information
+#ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrub needs these privileges to run, and no others
+CapabilityBoundingSet=
+NoNewPrivileges=true
+
+# Failure reporting shouldn't create world-readable files
+UMask=0077
+
+# Clean up any IPC objects when this unit stops
+RemoveIPC=true
+
+# No access to hardware device files
+PrivateDevices=true
+ProtectClock=true
diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
index 58c50abe963..0739a00dec9 100755
--- a/scrub/xfs_scrub_fail
+++ b/scrub/xfs_scrub_fail
@@ -5,14 +5,13 @@
 # Copyright (C) 2018 Oracle.  All Rights Reserved.
 # Author: Darrick J. Wong <djwong@kernel.org>
 
-# Email logs of failed xfs_scrub unit runs
+# Email logs of failed xfs_scrub and xfs_scrub_all unit runs
 
 recipient="$1"
 test -z "${recipient}" && exit 0
 service="$2"
 test -z "${service}" && exit 0
 mntpoint="$3"
-test -z "${mntpoint}" && exit 0
 
 hostname="$(hostname -f 2>/dev/null)"
 test -z "${hostname}" && hostname="${HOSTNAME}"
@@ -47,10 +46,12 @@ escape_path() {
 	echo "-$(systemd-escape --path "${mntpoint}")"
 }
 
-mntpoint_esc="$(escape_path "${mntpoint}")"
+fail_mail_mntpoint() {
+	local mntpoint_esc
 
-(cat << ENDL
-To: $1
+	mntpoint_esc="$(escape_path "${mntpoint}")"
+	cat << ENDL
+To: ${recipient}
 From: <${service}@${hostname}>
 Subject: ${service} failure on ${mntpoint}
 
@@ -58,5 +59,25 @@ So sorry, the automatic ${service} of ${mntpoint} on ${hostname} failed.
 
 A log of what happened follows:
 ENDL
-systemctl status --full --lines 4294967295 "${service}@${mntpoint_esc}") | "${mailer}" -t -i
+	systemctl status --full --lines 4294967295 "${service}@${mntpoint_esc}"
+}
+
+fail_mail() {
+	cat << ENDL
+To: ${recipient}
+From: <${service}@${hostname}>
+Subject: ${service} failure
+
+So sorry, the automatic ${service} on ${hostname} failed.
+
+A log of what happened follows:
+ENDL
+	systemctl status --full --lines 4294967295 "${service}"
+}
+
+if [ -n "${mntpoint}" ]; then
+	fail_mail_mntpoint | "${mailer}" -t -i
+else
+	fail_mail | "${mailer}" -t -i
+fi
 exit "${PIPESTATUS[1]}"


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] xfs_scrub: automatic optimization by default
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (58 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
@ 2022-12-30 22:18 ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/2] xfs_scrub: automatic downgrades to dry-run mode in service mode Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/2] xfs_scrub: add an optimization-only mode Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
                   ` (17 subsequent siblings)
  77 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This final patchset in the online fsck series enables the background
service to optimize filesystems by default.  This is the first step
towards enabling repairs by default.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-optimize-by-default

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-optimize-by-default
---
 man/man8/xfs_scrub.8 |    6 +++++-
 scrub/Makefile       |    2 +-
 scrub/phase1.c       |   13 +++++++++++++
 scrub/phase4.c       |    6 ++++++
 scrub/repair.c       |   37 ++++++++++++++++++++++++++++++++++++-
 scrub/repair.h       |    2 ++
 scrub/scrub.c        |    4 ++--
 scrub/xfs_scrub.c    |   21 +++++++++++++++++++--
 scrub/xfs_scrub.h    |    1 +
 9 files changed, 85 insertions(+), 7 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs_scrub: automatic downgrades to dry-run mode in service mode
  2022-12-30 22:18 ` [PATCHSET v24.0 0/2] xfs_scrub: automatic optimization by default Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 2/2] xfs_scrub: add an optimization-only mode Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When service mode is enabled, xfs_scrub is being run within the context
of a systemd service.  The service description language doesn't have any
particularly good constructs for adding in a '-n' argument if the
filesystem is readonly, which means that xfs_scrub is passed a path, and
needs to switch to dry-run mode on its own if the fs is mounted
readonly or the kernel doesn't support repairs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |   13 +++++++++++++
 scrub/repair.c |   33 +++++++++++++++++++++++++++++++++
 scrub/repair.h |    2 ++
 3 files changed, 48 insertions(+)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index 80fd0c6e27c..7b9caa4258c 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -216,6 +216,19 @@ _("Kernel metadata scrubbing facility is not available."));
 		return ECANCELED;
 	}
 
+	/*
+	 * Normally, callers are required to pass -n if the provided path is a
+	 * readonly filesystem or the kernel wasn't built with online repair
+	 * enabled.  However, systemd services are not scripts and cannot
+	 * determine either of these conditions programmatically.  Change the
+	 * behavior to dry-run mode if either condition is detected.
+	 */
+	if (repair_want_service_downgrade(ctx)) {
+		str_info(ctx, ctx->mntpoint,
+_("Filesystem cannot be repaired in service mode, downgrading to dry-run mode."));
+		ctx->mode = SCRUB_MODE_DRY_RUN;
+	}
+
 	/* Do we need kernel-assisted metadata repair? */
 	if (ctx->mode != SCRUB_MODE_DRY_RUN && !can_repair(ctx)) {
 		str_error(ctx, ctx->mntpoint,
diff --git a/scrub/repair.c b/scrub/repair.c
index cf17bbd8d0e..127055f2f61 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -44,6 +44,39 @@ static const unsigned int repair_deps[XFS_SCRUB_TYPE_NR] = {
 };
 #undef DEP
 
+/*
+ * Decide if we want an automatic downgrade to dry-run mode.  This is only
+ * for service mode, where we are fed a path and have to figure out if the fs
+ * is repairable or not.
+ */
+bool
+repair_want_service_downgrade(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_metadata	meta = {
+		.sm_type		= XFS_SCRUB_TYPE_PROBE,
+		.sm_flags		= XFS_SCRUB_IFLAG_REPAIR,
+	};
+	int				error;
+
+	if (ctx->mode == SCRUB_MODE_DRY_RUN)
+		return false;
+	if (!is_service)
+		return false;
+	if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
+		return false;
+
+	error = -xfrog_scrub_metadata(&ctx->mnt, &meta);
+	switch (error) {
+	case EROFS:
+	case ENOTRECOVERABLE:
+	case EOPNOTSUPP:
+		return true;
+	}
+
+	return false;
+}
+
 /* Repair some metadata. */
 static int
 xfs_repair_metadata(
diff --git a/scrub/repair.h b/scrub/repair.h
index 639ab2370e0..c4b9b5799e2 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -102,4 +102,6 @@ repair_item_completely(
 	return repair_item(ctx, sri, XRM_FINAL_WARNING | XRM_NOPROGRESS);
 }
 
+bool repair_want_service_downgrade(struct scrub_ctx *ctx);
+
 #endif /* XFS_SCRUB_REPAIR_H_ */


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs_scrub: add an optimization-only mode
  2022-12-30 22:18 ` [PATCHSET v24.0 0/2] xfs_scrub: automatic optimization by default Darrick J. Wong
  2022-12-30 22:18   ` [PATCH 1/2] xfs_scrub: automatic downgrades to dry-run mode in service mode Darrick J. Wong
@ 2022-12-30 22:18   ` Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:18 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a "preen" mode in which we only optimize filesystem metadata.
Repairs will result in early exits.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 man/man8/xfs_scrub.8 |    6 +++++-
 scrub/Makefile       |    2 +-
 scrub/phase4.c       |    6 ++++++
 scrub/repair.c       |    4 +++-
 scrub/scrub.c        |    4 ++--
 scrub/xfs_scrub.c    |   21 +++++++++++++++++++--
 scrub/xfs_scrub.h    |    1 +
 7 files changed, 37 insertions(+), 7 deletions(-)


diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
index e881ae76acb..cbf12f53b91 100644
--- a/man/man8/xfs_scrub.8
+++ b/man/man8/xfs_scrub.8
@@ -4,7 +4,7 @@ xfs_scrub \- check and repair the contents of a mounted XFS filesystem
 .SH SYNOPSIS
 .B xfs_scrub
 [
-.B \-abCemnTvx
+.B \-abCemnpTvx
 ]
 .I mount-point
 .br
@@ -86,6 +86,10 @@ Search this file for mounted filesystems instead of /etc/mtab.
 Only check filesystem metadata.
 Do not repair or optimize anything.
 .TP
+.B \-p
+Only optimize filesystem metadata.
+If repairs are required, report them and exit.
+.TP
 .BI \-T
 Print timing and memory usage information for each phase.
 .TP
diff --git a/scrub/Makefile b/scrub/Makefile
index f2d0c1aa0bf..04f2494233d 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -12,7 +12,7 @@ ifeq ($(SCRUB_PREREQS),yesyesyes)
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 XFS_SCRUB_ALL_PROG = xfs_scrub_all
-XFS_SCRUB_ARGS = -b -n
+XFS_SCRUB_ARGS = -b -p
 ifeq ($(HAVE_SYSTEMD),yes)
 INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 260f7bb7ac1..74fcc55b379 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -236,6 +236,12 @@ phase4_func(
 	    action_list_empty(ctx->file_repair_list))
 		return 0;
 
+	if (ctx->mode == SCRUB_MODE_PREEN && ctx->corruptions_found) {
+		str_info(ctx, ctx->mntpoint,
+ _("Corruptions found; will not optimize.  Re-run without -p.\n"));
+		return 0;
+	}
+
 	/*
 	 * Check the resource usage counters early.  Normally we do this during
 	 * phase 7, but some of the cross-referencing requires fairly accurate
diff --git a/scrub/repair.c b/scrub/repair.c
index 127055f2f61..6629125578c 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -642,7 +642,9 @@ repair_item_class(
 	unsigned int			scrub_type;
 	int				error = 0;
 
-	if (ctx->mode < SCRUB_MODE_REPAIR)
+	if (ctx->mode == SCRUB_MODE_DRY_RUN)
+		return 0;
+	if (ctx->mode == SCRUB_MODE_PREEN && !(repair_mask & SCRUB_ITEM_PREEN))
 		return 0;
 
 	/*
diff --git a/scrub/scrub.c b/scrub/scrub.c
index c245e46afa7..19c35bfd907 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -174,7 +174,7 @@ _("Filesystem is shut down, aborting."));
 	 * repair if desired, otherwise complain.
 	 */
 	if (is_corrupt(&meta) || xref_disagrees(&meta)) {
-		if (ctx->mode < SCRUB_MODE_REPAIR) {
+		if (ctx->mode != SCRUB_MODE_REPAIR) {
 			/* Dry-run mode, so log an error and forget it. */
 			str_corrupt(ctx, descr_render(&dsc),
 _("Repairs are required."));
@@ -192,7 +192,7 @@ _("Repairs are required."));
 	 * otherwise complain.
 	 */
 	if (is_unoptimized(&meta)) {
-		if (ctx->mode != SCRUB_MODE_REPAIR) {
+		if (ctx->mode == SCRUB_MODE_DRY_RUN) {
 			/* Dry-run mode, so log an error and forget it. */
 			if (group != XFROG_SCRUB_GROUP_INODE) {
 				/* AG or FS metadata, always warn. */
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 23d8fec5d9b..8104059ebb4 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -181,6 +181,7 @@ usage(void)
 	fprintf(stderr, _("  -k           Do not FITRIM the free space.\n"));
 	fprintf(stderr, _("  -m path      Path to /etc/mtab.\n"));
 	fprintf(stderr, _("  -n           Dry run.  Do not modify anything.\n"));
+	fprintf(stderr, _("  -p           Only optimize, do not fix corruptions.\n"));
 	fprintf(stderr, _("  -T           Display timing/usage information.\n"));
 	fprintf(stderr, _("  -v           Verbose output.\n"));
 	fprintf(stderr, _("  -V           Print version.\n"));
@@ -461,6 +462,11 @@ run_scrub_phases(
 			sp->descr = _("Repair filesystem.");
 			sp->fn = phase4_func;
 			sp->must_run = true;
+		} else if (sp->fn == REPAIR_DUMMY_FN &&
+			   ctx->mode == SCRUB_MODE_PREEN) {
+			sp->descr = _("Optimize filesystem.");
+			sp->fn = phase4_func;
+			sp->must_run = true;
 		}
 
 		/* Skip certain phases unless they're turned on. */
@@ -598,7 +604,7 @@ report_outcome(
 	if (ctx->scrub_setup_succeeded && actionable_errors > 0) {
 		char		*msg;
 
-		if (ctx->mode == SCRUB_MODE_DRY_RUN)
+		if (ctx->mode != SCRUB_MODE_REPAIR)
 			msg = _("%s: Re-run xfs_scrub without -n.\n");
 		else
 			msg = _("%s: Unmount and run xfs_repair.\n");
@@ -646,7 +652,7 @@ main(
 	pthread_mutex_init(&ctx.lock, NULL);
 	ctx.mode = SCRUB_MODE_REPAIR;
 	ctx.error_action = ERRORS_CONTINUE;
-	while ((c = getopt(argc, argv, "a:bC:de:km:nTvxV")) != EOF) {
+	while ((c = getopt(argc, argv, "a:bC:de:km:npTvxV")) != EOF) {
 		switch (c) {
 		case 'a':
 			ctx.max_errors = cvt_u64(optarg, 10);
@@ -694,8 +700,19 @@ main(
 			mtab = optarg;
 			break;
 		case 'n':
+			if (ctx.mode != SCRUB_MODE_REPAIR) {
+				fprintf(stderr, _("Cannot use -n with -p.\n"));
+				usage();
+			}
 			ctx.mode = SCRUB_MODE_DRY_RUN;
 			break;
+		case 'p':
+			if (ctx.mode != SCRUB_MODE_REPAIR) {
+				fprintf(stderr, _("Cannot use -p with -n.\n"));
+				usage();
+			}
+			ctx.mode = SCRUB_MODE_PREEN;
+			break;
 		case 'T':
 			display_rusage = true;
 			break;
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 2ef8b2e5066..7269b231015 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -25,6 +25,7 @@ extern bool			use_force_rebuild;
 
 enum scrub_mode {
 	SCRUB_MODE_DRY_RUN,
+	SCRUB_MODE_PREEN,
 	SCRUB_MODE_REPAIR,
 };
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (59 preceding siblings ...)
  2022-12-30 22:18 ` [PATCHSET v24.0 0/2] xfs_scrub: automatic optimization by default Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] xfs: force rebuilding of metadata Darrick J. Wong
                   ` (16 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

Now that we've created a whole loop control infrastructure to ensure
that we can race fsstress and xfs_scrub's rmapbt scanning code, expand
the testing to scrubs of every other type of space and file metadata.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-rtsummary

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-rtsummary
---
 common/fuzzy          |  166 +++++++++++++++++++++++++++-
 common/quota          |   64 +++++++++++
 configure.ac          |    5 +
 doc/group-names.txt   |    1 
 include/builddefs.in  |    4 +
 m4/package_libcdev.m4 |   47 ++++++++
 m4/package_xfslibs.m4 |   16 +++
 src/Makefile          |   10 ++
 src/xfsfind.c         |  290 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/357         |    2 
 tests/xfs/782         |   37 ++++++
 tests/xfs/782.out     |    2 
 tests/xfs/783         |   37 ++++++
 tests/xfs/783.out     |    2 
 tests/xfs/784         |   37 ++++++
 tests/xfs/784.out     |    2 
 tests/xfs/785         |   37 ++++++
 tests/xfs/785.out     |    2 
 tests/xfs/786         |   38 ++++++
 tests/xfs/786.out     |    2 
 tests/xfs/787         |   38 ++++++
 tests/xfs/787.out     |    2 
 tests/xfs/788         |   38 ++++++
 tests/xfs/788.out     |    2 
 tests/xfs/789         |   39 +++++++
 tests/xfs/789.out     |    2 
 tests/xfs/790         |   39 +++++++
 tests/xfs/790.out     |    2 
 tests/xfs/791         |   40 +++++++
 tests/xfs/791.out     |    2 
 tests/xfs/792         |   38 ++++++
 tests/xfs/792.out     |    2 
 tests/xfs/793         |   37 ++++++
 tests/xfs/793.out     |    2 
 tests/xfs/794         |   39 +++++++
 tests/xfs/794.out     |    2 
 tests/xfs/795         |   39 +++++++
 tests/xfs/795.out     |    2 
 tests/xfs/796         |   37 ++++++
 tests/xfs/796.out     |    2 
 tests/xfs/797         |   40 +++++++
 tests/xfs/797.out     |    2 
 tests/xfs/798         |   44 +++++++
 tests/xfs/798.out     |    2 
 tests/xfs/799         |   38 ++++++
 tests/xfs/799.out     |    2 
 tests/xfs/800         |   40 +++++++
 tests/xfs/800.out     |    2 
 tests/xfs/801         |   47 ++++++++
 tests/xfs/801.out     |    2 
 tests/xfs/802         |   40 +++++++
 tests/xfs/802.out     |    2 
 tests/xfs/803         |   40 +++++++
 tests/xfs/803.out     |    2 
 tests/xfs/804         |   40 +++++++
 tests/xfs/804.out     |    2 
 tests/xfs/805         |   38 ++++++
 tests/xfs/805.out     |    2 
 tests/xfs/826         |   38 ++++++
 tests/xfs/826.out     |    2 
 tests/xfs/827         |   39 +++++++
 tests/xfs/827.out     |    2 
 62 files changed, 1662 insertions(+), 9 deletions(-)
 create mode 100644 src/xfsfind.c
 create mode 100755 tests/xfs/782
 create mode 100644 tests/xfs/782.out
 create mode 100755 tests/xfs/783
 create mode 100644 tests/xfs/783.out
 create mode 100755 tests/xfs/784
 create mode 100644 tests/xfs/784.out
 create mode 100755 tests/xfs/785
 create mode 100644 tests/xfs/785.out
 create mode 100755 tests/xfs/786
 create mode 100644 tests/xfs/786.out
 create mode 100755 tests/xfs/787
 create mode 100644 tests/xfs/787.out
 create mode 100755 tests/xfs/788
 create mode 100644 tests/xfs/788.out
 create mode 100755 tests/xfs/789
 create mode 100644 tests/xfs/789.out
 create mode 100755 tests/xfs/790
 create mode 100644 tests/xfs/790.out
 create mode 100755 tests/xfs/791
 create mode 100644 tests/xfs/791.out
 create mode 100755 tests/xfs/792
 create mode 100644 tests/xfs/792.out
 create mode 100755 tests/xfs/793
 create mode 100644 tests/xfs/793.out
 create mode 100755 tests/xfs/794
 create mode 100644 tests/xfs/794.out
 create mode 100755 tests/xfs/795
 create mode 100644 tests/xfs/795.out
 create mode 100755 tests/xfs/796
 create mode 100644 tests/xfs/796.out
 create mode 100755 tests/xfs/797
 create mode 100644 tests/xfs/797.out
 create mode 100755 tests/xfs/798
 create mode 100644 tests/xfs/798.out
 create mode 100755 tests/xfs/799
 create mode 100644 tests/xfs/799.out
 create mode 100755 tests/xfs/800
 create mode 100644 tests/xfs/800.out
 create mode 100755 tests/xfs/801
 create mode 100644 tests/xfs/801.out
 create mode 100755 tests/xfs/802
 create mode 100644 tests/xfs/802.out
 create mode 100755 tests/xfs/803
 create mode 100644 tests/xfs/803.out
 create mode 100755 tests/xfs/804
 create mode 100644 tests/xfs/804.out
 create mode 100755 tests/xfs/805
 create mode 100644 tests/xfs/805.out
 create mode 100755 tests/xfs/826
 create mode 100644 tests/xfs/826.out
 create mode 100755 tests/xfs/827
 create mode 100644 tests/xfs/827.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] xfs/357: switch fuzzing to agi 1
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-07 18:46     ` Zorro Lang
  2022-12-30 22:19   ` [PATCH 4/5] fuzzy: allow xfs scrub stress tests to pick preconfigured fsstress configs Darrick J. Wong
  4 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Since we now require a working AGI 0 to mount the system, fuzz AGI 1
instead.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/357 |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/tests/xfs/357 b/tests/xfs/357
index 8a2c920ef4..25af8624db 100755
--- a/tests/xfs/357
+++ b/tests/xfs/357
@@ -25,7 +25,7 @@ echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
 echo "Fuzz AGI"
-_scratch_xfs_fuzz_metadata '' 'online' 'agi 0' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online' 'agi 1' >> $seqres.full
 echo "Done fuzzing AGI"
 
 # success, all done


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-05 13:04     ` Zorro Lang
  2023-02-07 17:02     ` [PATCH v24.1 " Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 5/5] xfs: race fsstress with online scrubbers for file metadata Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For each XFS_SCRUB_TYPE_* that looks at AG or filesystem metadata,
create a test that runs that scrubber in the foreground and fsstress in
the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/quota        |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++
 doc/group-names.txt |    1 +
 tests/xfs/782       |   37 +++++++++++++++++++++++++++++
 tests/xfs/782.out   |    2 ++
 tests/xfs/783       |   37 +++++++++++++++++++++++++++++
 tests/xfs/783.out   |    2 ++
 tests/xfs/784       |   37 +++++++++++++++++++++++++++++
 tests/xfs/784.out   |    2 ++
 tests/xfs/785       |   37 +++++++++++++++++++++++++++++
 tests/xfs/785.out   |    2 ++
 tests/xfs/786       |   38 ++++++++++++++++++++++++++++++
 tests/xfs/786.out   |    2 ++
 tests/xfs/787       |   38 ++++++++++++++++++++++++++++++
 tests/xfs/787.out   |    2 ++
 tests/xfs/788       |   38 ++++++++++++++++++++++++++++++
 tests/xfs/788.out   |    2 ++
 tests/xfs/789       |   39 +++++++++++++++++++++++++++++++
 tests/xfs/789.out   |    2 ++
 tests/xfs/790       |   39 +++++++++++++++++++++++++++++++
 tests/xfs/790.out   |    2 ++
 tests/xfs/791       |   40 ++++++++++++++++++++++++++++++++
 tests/xfs/791.out   |    2 ++
 tests/xfs/798       |   44 +++++++++++++++++++++++++++++++++++
 tests/xfs/798.out   |    2 ++
 tests/xfs/800       |   40 ++++++++++++++++++++++++++++++++
 tests/xfs/800.out   |    2 ++
 tests/xfs/801       |   47 +++++++++++++++++++++++++++++++++++++
 tests/xfs/801.out   |    2 ++
 tests/xfs/802       |   40 ++++++++++++++++++++++++++++++++
 tests/xfs/802.out   |    2 ++
 tests/xfs/803       |   40 ++++++++++++++++++++++++++++++++
 tests/xfs/803.out   |    2 ++
 tests/xfs/804       |   40 ++++++++++++++++++++++++++++++++
 tests/xfs/804.out   |    2 ++
 tests/xfs/805       |   38 ++++++++++++++++++++++++++++++
 tests/xfs/805.out   |    2 ++
 36 files changed, 768 insertions(+)
 create mode 100755 tests/xfs/782
 create mode 100644 tests/xfs/782.out
 create mode 100755 tests/xfs/783
 create mode 100644 tests/xfs/783.out
 create mode 100755 tests/xfs/784
 create mode 100644 tests/xfs/784.out
 create mode 100755 tests/xfs/785
 create mode 100644 tests/xfs/785.out
 create mode 100755 tests/xfs/786
 create mode 100644 tests/xfs/786.out
 create mode 100755 tests/xfs/787
 create mode 100644 tests/xfs/787.out
 create mode 100755 tests/xfs/788
 create mode 100644 tests/xfs/788.out
 create mode 100755 tests/xfs/789
 create mode 100644 tests/xfs/789.out
 create mode 100755 tests/xfs/790
 create mode 100644 tests/xfs/790.out
 create mode 100755 tests/xfs/791
 create mode 100644 tests/xfs/791.out
 create mode 100755 tests/xfs/798
 create mode 100644 tests/xfs/798.out
 create mode 100755 tests/xfs/800
 create mode 100644 tests/xfs/800.out
 create mode 100755 tests/xfs/801
 create mode 100644 tests/xfs/801.out
 create mode 100755 tests/xfs/802
 create mode 100644 tests/xfs/802.out
 create mode 100755 tests/xfs/803
 create mode 100644 tests/xfs/803.out
 create mode 100755 tests/xfs/804
 create mode 100644 tests/xfs/804.out
 create mode 100755 tests/xfs/805
 create mode 100644 tests/xfs/805.out


diff --git a/common/quota b/common/quota
index 24251d092a..96b8d04424 100644
--- a/common/quota
+++ b/common/quota
@@ -53,6 +53,70 @@ _require_xfs_quota()
     [ -n "$XFS_QUOTA_PROG" ] || _notrun "XFS quota user tools not installed"
 }
 
+# Check that a mounted fs has a particular type of quota accounting turned on.
+#
+# The first argument must be the data device of a mounted fs.  It must not be
+# the actual mountpath.
+#
+# The second argument is the quota type ('usrquota', 'grpquota', 'prjquota',
+# 'any', or 'all').
+_xfs_quota_acct_enabled()
+{
+	local dev="$1"
+	local qtype="$2"
+	local f_args=()
+	local any=
+
+	case "$qtype" in
+	"usrquota"|"uquota")	f_args=("-U");;
+	"grpquota"|"gquota")	f_args=("-G");;
+	"prjquota"|"pquota")	f_args=("-P");;
+	"all")			f_args=("-U" "-G" "-P");;
+	"any")			f_args=("-U" "-G" "-P"); any=1;;
+	*)			echo "$qtype: Unknown quota type."; return 1;;
+	esac
+
+	if [ "$any" = "1" ]; then
+		for arg in "$f_args"; do
+			$here/src/feature "$arg" "$dev" && return 0
+		done
+		return 1
+	fi
+
+	$here/src/feature "${f_args[@]}" "$dev"
+}
+
+# Require that a mounted fs has a particular type of quota turned on.  This
+# takes the same arguments as _xfs_quota_acct_enabled.  If the third argument is
+# '-u' (or is empty and dev is $SCRATCH_DEV) the fs will be unmounted on
+# failure.
+_require_xfs_quota_acct_enabled()
+{
+	local dev="$1"
+	local qtype="$2"
+	local umount="$3"
+	local fsname="$dev"
+
+	_xfs_quota_acct_enabled "$dev" "$qtype" "$qmode" && return 0
+
+	if [ -z "$umount" ] && [ "$dev" = "$SCRATCH_DEV" ]; then
+		umount="-u"
+	fi
+	test "$umount" = "-u" && umount "$dev" &>/dev/null
+
+	case "$dev" in
+	"$TEST_DEV")	fsname="test";;
+	"$SCRATCH_DEV")	fsname="scratch";;
+	esac
+
+	case "$qtype" in
+	"any")		qtype="any quotas";;
+	"all")		qtype="all quotas";;
+	esac
+
+	_notrun "$qtype: accounting not enabled on $fsname filesystem."
+}
+
 #
 # checks that xfs_quota can operate on foreign (non-xfs) filesystems
 # Skips check on xfs filesystems, old xfs_quota is fine there.
diff --git a/doc/group-names.txt b/doc/group-names.txt
index ac219e05b3..771ce937ae 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -35,6 +35,7 @@ dangerous_fuzzers	fuzzers that can crash your computer
 dangerous_norepair	fuzzers to evaluate kernel metadata verifiers
 dangerous_online_repair	fuzzers to evaluate xfs_scrub online repair
 dangerous_fsstress_repair	race fsstress and xfs_scrub online repair
+dangerous_fsstress_scrub	race fsstress and xfs_scrub checking
 dangerous_repair	fuzzers to evaluate xfs_repair offline repair
 dangerous_scrub		fuzzers to evaluate xfs_scrub checking
 data			data loss checkers
diff --git a/tests/xfs/782 b/tests/xfs/782
new file mode 100755
index 0000000000..4801eda4bd
--- /dev/null
+++ b/tests/xfs/782
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 782
+#
+# Race fsstress and superblock scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub sb %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/782.out b/tests/xfs/782.out
new file mode 100644
index 0000000000..6e378f0e53
--- /dev/null
+++ b/tests/xfs/782.out
@@ -0,0 +1,2 @@
+QA output created by 782
+Silence is golden
diff --git a/tests/xfs/783 b/tests/xfs/783
new file mode 100755
index 0000000000..379a9369e5
--- /dev/null
+++ b/tests/xfs/783
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 783
+#
+# Race fsstress and AGF scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub agf %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/783.out b/tests/xfs/783.out
new file mode 100644
index 0000000000..2522395956
--- /dev/null
+++ b/tests/xfs/783.out
@@ -0,0 +1,2 @@
+QA output created by 783
+Silence is golden
diff --git a/tests/xfs/784 b/tests/xfs/784
new file mode 100755
index 0000000000..2b89361c36
--- /dev/null
+++ b/tests/xfs/784
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 784
+#
+# Race fsstress and AGFL scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub agfl %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/784.out b/tests/xfs/784.out
new file mode 100644
index 0000000000..48d9b24dd0
--- /dev/null
+++ b/tests/xfs/784.out
@@ -0,0 +1,2 @@
+QA output created by 784
+Silence is golden
diff --git a/tests/xfs/785 b/tests/xfs/785
new file mode 100755
index 0000000000..34a13b058d
--- /dev/null
+++ b/tests/xfs/785
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 785
+#
+# Race fsstress and AGI scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub agi %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/785.out b/tests/xfs/785.out
new file mode 100644
index 0000000000..6ecb0c61b3
--- /dev/null
+++ b/tests/xfs/785.out
@@ -0,0 +1,2 @@
+QA output created by 785
+Silence is golden
diff --git a/tests/xfs/786 b/tests/xfs/786
new file mode 100755
index 0000000000..157200ea8c
--- /dev/null
+++ b/tests/xfs/786
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 786
+#
+# Race fsstress and freespace by block btree scrub for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub bnobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/786.out b/tests/xfs/786.out
new file mode 100644
index 0000000000..ccb9167df9
--- /dev/null
+++ b/tests/xfs/786.out
@@ -0,0 +1,2 @@
+QA output created by 786
+Silence is golden
diff --git a/tests/xfs/787 b/tests/xfs/787
new file mode 100755
index 0000000000..91eaf5a7af
--- /dev/null
+++ b/tests/xfs/787
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 787
+#
+# Race fsstress and free space by length btree scrub for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub cntbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/787.out b/tests/xfs/787.out
new file mode 100644
index 0000000000..fa7f038120
--- /dev/null
+++ b/tests/xfs/787.out
@@ -0,0 +1,2 @@
+QA output created by 787
+Silence is golden
diff --git a/tests/xfs/788 b/tests/xfs/788
new file mode 100755
index 0000000000..f1369e5309
--- /dev/null
+++ b/tests/xfs/788
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 788
+#
+# Race fsstress and inode btree scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -x 'dir' -s "scrub inobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/788.out b/tests/xfs/788.out
new file mode 100644
index 0000000000..5ddd661113
--- /dev/null
+++ b/tests/xfs/788.out
@@ -0,0 +1,2 @@
+QA output created by 788
+Silence is golden
diff --git a/tests/xfs/789 b/tests/xfs/789
new file mode 100755
index 0000000000..550ff2c690
--- /dev/null
+++ b/tests/xfs/789
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 789
+#
+# Race fsstress and free inode btree scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" finobt
+_scratch_xfs_stress_scrub -x 'dir' -s "scrub finobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/789.out b/tests/xfs/789.out
new file mode 100644
index 0000000000..da88fc99cb
--- /dev/null
+++ b/tests/xfs/789.out
@@ -0,0 +1,2 @@
+QA output created by 789
+Silence is golden
diff --git a/tests/xfs/790 b/tests/xfs/790
new file mode 100755
index 0000000000..c4e5779ef7
--- /dev/null
+++ b/tests/xfs/790
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 790
+#
+# Race fsstress and reverse mapping btree scrub for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" rmapbt
+_scratch_xfs_stress_scrub -s "scrub rmapbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/790.out b/tests/xfs/790.out
new file mode 100644
index 0000000000..7102c590f0
--- /dev/null
+++ b/tests/xfs/790.out
@@ -0,0 +1,2 @@
+QA output created by 790
+Silence is golden
diff --git a/tests/xfs/791 b/tests/xfs/791
new file mode 100755
index 0000000000..6939d910c9
--- /dev/null
+++ b/tests/xfs/791
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 791
+#
+# Race fsstress and reference count btree scrub for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" reflink
+_scratch_xfs_stress_scrub -s "scrub refcountbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/791.out b/tests/xfs/791.out
new file mode 100644
index 0000000000..758905371d
--- /dev/null
+++ b/tests/xfs/791.out
@@ -0,0 +1,2 @@
+QA output created by 791
+Silence is golden
diff --git a/tests/xfs/798 b/tests/xfs/798
new file mode 100755
index 0000000000..c5bdfad50a
--- /dev/null
+++ b/tests/xfs/798
@@ -0,0 +1,44 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 798
+#
+# Race fsstress and fscounter scrub on the realtime device for a while to see
+# if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+
+# Force all files to be allocated on the realtime device
+_xfs_force_bdev realtime $SCRATCH_MNT
+
+_scratch_xfs_stress_scrub -s 'scrub fscounters'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/798.out b/tests/xfs/798.out
new file mode 100644
index 0000000000..216d6e93f4
--- /dev/null
+++ b/tests/xfs/798.out
@@ -0,0 +1,2 @@
+QA output created by 798
+Silence is golden
diff --git a/tests/xfs/800 b/tests/xfs/800
new file mode 100755
index 0000000000..cbcfb5f5a6
--- /dev/null
+++ b/tests/xfs/800
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 800
+#
+# Race fsstress and realtime bitmap scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+_scratch_xfs_stress_scrub -s "scrub rtbitmap"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/800.out b/tests/xfs/800.out
new file mode 100644
index 0000000000..bdfaa2cecd
--- /dev/null
+++ b/tests/xfs/800.out
@@ -0,0 +1,2 @@
+QA output created by 800
+Silence is golden
diff --git a/tests/xfs/801 b/tests/xfs/801
new file mode 100755
index 0000000000..a51fab523b
--- /dev/null
+++ b/tests/xfs/801
@@ -0,0 +1,47 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 801
+#
+# Race fsstress and realtime summary scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+
+# XXX the realtime summary scrubber isn't currently implemented upstream.
+# Don't bother trying to test it on those kernels
+$XFS_IO_PROG -c 'scrub rtsummary' -c 'scrub rtsummary' "$SCRATCH_MNT" 2>&1 | \
+	grep -q 'Scan was not complete' && \
+	_notrun "rtsummary scrub is incomplete"
+
+_scratch_xfs_stress_scrub -s "scrub rtsummary"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/801.out b/tests/xfs/801.out
new file mode 100644
index 0000000000..39481b38e2
--- /dev/null
+++ b/tests/xfs/801.out
@@ -0,0 +1,2 @@
+QA output created by 801
+Silence is golden
diff --git a/tests/xfs/802 b/tests/xfs/802
new file mode 100755
index 0000000000..1f3b83882e
--- /dev/null
+++ b/tests/xfs/802
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 802
+#
+# Race fsstress and user quota scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
+_scratch_xfs_stress_scrub -s "scrub usrquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/802.out b/tests/xfs/802.out
new file mode 100644
index 0000000000..a69c05391f
--- /dev/null
+++ b/tests/xfs/802.out
@@ -0,0 +1,2 @@
+QA output created by 802
+Silence is golden
diff --git a/tests/xfs/803 b/tests/xfs/803
new file mode 100755
index 0000000000..b2bb85672d
--- /dev/null
+++ b/tests/xfs/803
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 803
+#
+# Race fsstress and group quota scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
+_scratch_xfs_stress_scrub -s "scrub grpquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/803.out b/tests/xfs/803.out
new file mode 100644
index 0000000000..38ba741d0f
--- /dev/null
+++ b/tests/xfs/803.out
@@ -0,0 +1,2 @@
+QA output created by 803
+Silence is golden
diff --git a/tests/xfs/804 b/tests/xfs/804
new file mode 100755
index 0000000000..129724eb11
--- /dev/null
+++ b/tests/xfs/804
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 804
+#
+# Race fsstress and project quota scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
+_scratch_xfs_stress_scrub -s "scrub prjquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/804.out b/tests/xfs/804.out
new file mode 100644
index 0000000000..5e0cb437e7
--- /dev/null
+++ b/tests/xfs/804.out
@@ -0,0 +1,2 @@
+QA output created by 804
+Silence is golden
diff --git a/tests/xfs/805 b/tests/xfs/805
new file mode 100755
index 0000000000..aca9b9cdf4
--- /dev/null
+++ b/tests/xfs/805
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 805
+#
+# Race fsstress and summary counters scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub fscounters"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/805.out b/tests/xfs/805.out
new file mode 100644
index 0000000000..ac324c5874
--- /dev/null
+++ b/tests/xfs/805.out
@@ -0,0 +1,2 @@
+QA output created by 805
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 5/5] xfs: race fsstress with online scrubbers for file metadata Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-05 12:57     ` Zorro Lang
  2023-02-07 17:01     ` [PATCH v24.1 " Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/5] xfs/357: switch fuzzing to agi 1 Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 4/5] fuzzy: allow xfs scrub stress tests to pick preconfigured fsstress configs Darrick J. Wong
  4 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create a new find(1) like utility that doesn't crash on directory tree
changes (like find does due to bugs in its loop detector) and actually
implements the custom xfs attribute predicates that we need for scrub
stress tests.  This program will be needed for a future patch where we
add stress tests for scrub and repair of file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 configure.ac          |    5 +
 include/builddefs.in  |    4 +
 m4/package_libcdev.m4 |   47 ++++++++
 m4/package_xfslibs.m4 |   16 +++
 src/Makefile          |   10 ++
 src/xfsfind.c         |  290 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 372 insertions(+)
 create mode 100644 src/xfsfind.c


diff --git a/configure.ac b/configure.ac
index cbf8377988..e92bd6b26d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -66,6 +66,11 @@ AC_PACKAGE_WANT_LINUX_FS_H
 AC_PACKAGE_WANT_LIBBTRFSUTIL
 
 AC_HAVE_COPY_FILE_RANGE
+AC_HAVE_SEEK_DATA
+AC_HAVE_BMV_OF_SHARED
+AC_HAVE_NFTW
+AC_HAVE_RLIMIT_NOFILE
+
 AC_CHECK_FUNCS([renameat2])
 AC_CHECK_FUNCS([reallocarray])
 AC_CHECK_TYPES([struct mount_attr], [], [], [[#include <linux/mount.h>]])
diff --git a/include/builddefs.in b/include/builddefs.in
index 6641209f81..dab10c968f 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -68,6 +68,10 @@ HAVE_FIEMAP = @have_fiemap@
 HAVE_FALLOCATE = @have_fallocate@
 HAVE_COPY_FILE_RANGE = @have_copy_file_range@
 HAVE_LIBBTRFSUTIL = @have_libbtrfsutil@
+HAVE_SEEK_DATA = @have_seek_data@
+HAVE_NFTW = @have_nftw@
+HAVE_BMV_OF_SHARED = @have_bmv_of_shared@
+HAVE_RLIMIT_NOFILE = @have_rlimit_nofile@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 5c76c0f73e..e1b381c16f 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -110,3 +110,50 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
     AC_SUBST(have_copy_file_range)
   ])
 
+# Check if we have SEEK_DATA
+AC_DEFUN([AC_HAVE_SEEK_DATA],
+  [ AC_MSG_CHECKING([for SEEK_DATA])
+    AC_TRY_LINK([
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <unistd.h>
+    ], [
+         lseek(-1, 0, SEEK_DATA);
+    ], have_seek_data=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_seek_data)
+  ])
+
+# Check if we have nftw
+AC_DEFUN([AC_HAVE_NFTW],
+  [ AC_MSG_CHECKING([for nftw])
+    AC_TRY_LINK([
+#define _GNU_SOURCE
+#include <stddef.h>
+#include <ftw.h>
+    ], [
+         nftw("/", (int (*)(const char *, const struct stat *, int, struct FTW *))1, 0, 0);
+    ], have_nftw=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_nftw)
+  ])
+
+# Check if we have RLIMIT_NOFILE
+AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
+  [ AC_MSG_CHECKING([for RLIMIT_NOFILE])
+    AC_TRY_LINK([
+#define _GNU_SOURCE
+#include <sys/time.h>
+#include <sys/resource.h>
+    ], [
+         struct rlimit rlimit;
+
+         rlimit.rlim_cur = 0;
+         getrlimit(RLIMIT_NOFILE, &rlimit);
+    ], have_rlimit_nofile=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_rlimit_nofile)
+  ])
diff --git a/m4/package_xfslibs.m4 b/m4/package_xfslibs.m4
index 0746cd1dc5..479f30a29b 100644
--- a/m4/package_xfslibs.m4
+++ b/m4/package_xfslibs.m4
@@ -104,3 +104,19 @@ AC_DEFUN([AC_PACKAGE_NEED_XFSCTL_MACRO],
         exit 1
       ])
   ])
+
+# Check if we have BMV_OF_SHARED from the GETBMAPX ioctl
+AC_DEFUN([AC_HAVE_BMV_OF_SHARED],
+  [ AC_MSG_CHECKING([for BMV_OF_SHARED])
+    AC_TRY_LINK([
+#define _GNU_SOURCE
+#include <xfs/xfs.h>
+    ], [
+         struct getbmapx obj;
+         ioctl(-1, XFS_IOC_GETBMAPX, &obj);
+         obj.bmv_oflags |= BMV_OF_SHARED;
+    ], have_bmv_of_shared=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_bmv_of_shared)
+  ])
diff --git a/src/Makefile b/src/Makefile
index afdf6b30c5..7807ca89a5 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -83,6 +83,16 @@ ifeq ($(HAVE_LIBCAP), true)
 LLDLIBS += -lcap
 endif
 
+ifeq ($(HAVE_SEEK_DATA), yes)
+ ifeq ($(HAVE_NFTW), yes)
+  ifeq ($(HAVE_BMV_OF_SHARED), yes)
+   ifeq ($(HAVE_RLIMIT_NOFILE), yes)
+     TARGETS += xfsfind
+   endif
+  endif
+ endif
+endif
+
 CFILES = $(TARGETS:=.c)
 LDIRT = $(TARGETS) fssum
 
diff --git a/src/xfsfind.c b/src/xfsfind.c
new file mode 100644
index 0000000000..6b0a93e793
--- /dev/null
+++ b/src/xfsfind.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * find(1) but with special predicates for finding XFS attributes.
+ * Copyright (C) 2022 Oracle.
+ */
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/types.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <ftw.h>
+#include <linux/fs.h>
+#include <xfs/xfs.h>
+
+#include "global.h"
+
+static int want_anyfile;
+static int want_datafile;
+static int want_attrfile;
+static int want_dir;
+static int want_regfile;
+static int want_sharedfile;
+static int report_errors = 1;
+
+static int
+check_datafile(
+	const char		*path,
+	int			fd)
+{
+	off_t			off;
+
+	off = lseek(fd, 0, SEEK_DATA);
+	if (off >= 0)
+		return 1;
+
+	if (errno == ENXIO)
+		return 0;
+
+	if (report_errors)
+		perror(path);
+
+	return -1;
+}
+
+static int
+check_attrfile(
+	const char		*path,
+	int			fd)
+{
+	struct fsxattr		fsx;
+	int			ret;
+
+	ret = ioctl(fd, XFS_IOC_FSGETXATTR, &fsx);
+	if (ret) {
+		if (report_errors)
+			perror(path);
+		return -1;
+	}
+
+	if (want_attrfile && (fsx.fsx_xflags & XFS_XFLAG_HASATTR))
+		return 1;
+	return 0;
+}
+
+#define BMAP_NR			33
+static struct getbmapx		bmaps[BMAP_NR];
+
+static int
+check_sharedfile(
+	const char		*path,
+	int			fd)
+{
+	struct getbmapx		*key = &bmaps[0];
+	unsigned int		i;
+	int			ret;
+
+	memset(key, 0, sizeof(struct getbmapx));
+	key->bmv_length = ULLONG_MAX;
+	/* no holes and don't flush dirty pages */
+	key->bmv_iflags = BMV_IF_DELALLOC | BMV_IF_NO_HOLES;
+	key->bmv_count = BMAP_NR;
+
+	while ((ret = ioctl(fd, XFS_IOC_GETBMAPX, bmaps)) == 0) {
+		struct getbmapx	*p = &bmaps[1];
+		xfs_off_t	new_off;
+
+		for (i = 0; i < key->bmv_entries; i++, p++) {
+			if (p->bmv_oflags & BMV_OF_SHARED)
+				return 1;
+		}
+
+		if (key->bmv_entries == 0)
+			break;
+		p = key + key->bmv_entries;
+		if (p->bmv_oflags & BMV_OF_LAST)
+			return 0;
+
+		new_off = p->bmv_offset + p->bmv_length;
+		key->bmv_length -= new_off - key->bmv_offset;
+		key->bmv_offset = new_off;
+	}
+	if (ret < 0) {
+		if (report_errors)
+			perror(path);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+print_help(
+	const char		*name)
+{
+	printf("Usage: %s [OPTIONS] path\n", name);
+	printf("\n");
+	printf("Print all file paths matching any of the given predicates.\n");
+	printf("\n");
+	printf("-a	Match files with xattrs.\n");
+	printf("-b	Match files with data blocks.\n");
+	printf("-d	Match directories.\n");
+	printf("-q	Ignore errors while walking directory tree.\n");
+	printf("-r	Match regular files.\n");
+	printf("-s	Match files with shared blocks.\n");
+	printf("\n");
+	printf("If no matching options are given, match all files found.\n");
+}
+
+static int
+visit(
+	const char		*path,
+	const struct stat	*sb,
+	int			typeflag,
+	struct FTW		*ftwbuf)
+{
+	int			printme = 1;
+	int			fd = -1;
+	int			retval = FTW_CONTINUE;
+
+	if (want_anyfile)
+		goto out;
+	if (want_regfile && typeflag == FTW_F)
+		goto out;
+	if (want_dir && typeflag == FTW_D)
+		goto out;
+
+	/*
+	 * We can only open directories and files; screen out everything else.
+	 * Note that nftw lies and reports FTW_F for device files, so check the
+	 * statbuf mode too.
+	 */
+	if (typeflag != FTW_F && typeflag != FTW_D) {
+		printme = 0;
+		goto out;
+	}
+
+	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode)) {
+		printme = 0;
+		goto out;
+	}
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		if (report_errors) {
+			perror(path);
+			return FTW_STOP;
+		}
+
+		return FTW_CONTINUE;
+	}
+
+	if (want_datafile && typeflag == FTW_F) {
+		int ret = check_datafile(path, fd);
+		if (ret < 0 && report_errors) {
+			printme = 0;
+			retval = FTW_STOP;
+			goto out_fd;
+		}
+
+		if (ret == 1)
+			goto out_fd;
+	}
+
+	if (want_attrfile) {
+		int ret = check_attrfile(path, fd);
+		if (ret < 0 && report_errors) {
+			printme = 0;
+			retval = FTW_STOP;
+			goto out_fd;
+		}
+
+		if (ret == 1)
+			goto out_fd;
+	}
+
+	if (want_sharedfile) {
+		int ret = check_sharedfile(path, fd);
+		if (ret < 0 && report_errors) {
+			printme = 0;
+			retval = FTW_STOP;
+			goto out_fd;
+		}
+
+		if (ret == 1)
+			goto out_fd;
+	}
+
+	printme = 0;
+out_fd:
+	close(fd);
+out:
+	if (printme)
+		printf("%s\n", path);
+	return retval;
+}
+
+static void
+handle_sigabrt(
+	int		signal,
+	siginfo_t	*info,
+	void		*ucontext)
+{
+	fprintf(stderr, "Signal %u, exiting.\n", signal);
+	exit(2);
+}
+
+int
+main(
+	int			argc,
+	char			*argv[])
+{
+	struct rlimit		rlimit;
+	struct sigaction	abrt = {
+		.sa_sigaction	= handle_sigabrt,
+		.sa_flags	= SA_SIGINFO,
+	};
+	int			c;
+	int			ret;
+
+	while ((c = getopt(argc, argv, "abdqrs")) >= 0) {
+		switch (c) {
+		case 'a':	want_attrfile = 1;   break;
+		case 'b':	want_datafile = 1;   break;
+		case 'd':	want_dir = 1;        break;
+		case 'q':	report_errors = 0;   break;
+		case 'r':	want_regfile = 1;    break;
+		case 's':	want_sharedfile = 1; break;
+		default:
+			print_help(argv[0]);
+			return 1;
+		}
+	}
+
+	ret = getrlimit(RLIMIT_NOFILE, &rlimit);
+	if (ret) {
+		perror("RLIMIT_NOFILE");
+		return 1;
+	}
+
+	if (!want_attrfile && !want_datafile && !want_dir && !want_regfile &&
+	    !want_sharedfile)
+		want_anyfile = 1;
+
+	/*
+	 * nftw is known to abort() if a directory it is walking disappears out
+	 * from under it.  Handle this with grace if the caller wants us to run
+	 * quietly.
+	 */
+	if (!report_errors) {
+		ret = sigaction(SIGABRT, &abrt, NULL);
+		if (ret) {
+			perror("SIGABRT handler");
+			return 1;
+		}
+	}
+
+	for (c = optind; c < argc; c++) {
+		ret = nftw(argv[c], visit, rlimit.rlim_cur - 5,
+				FTW_ACTIONRETVAL | FTW_CHDIR | FTW_MOUNT |
+				FTW_PHYS);
+		if (ret && report_errors) {
+			perror(argv[c]);
+			break;
+		}
+	}
+
+	if (ret)
+		return 1;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] fuzzy: allow xfs scrub stress tests to pick preconfigured fsstress configs
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 1/5] xfs/357: switch fuzzing to agi 1 Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-07 18:48     ` Zorro Lang
  4 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Make it so that xfs_scrub stress tests can select what kind of fsstress
operations they want to run.  This will make it easier for, say,
directory scrubbers to configure fsstress to exercise directory tree
changes while skipping file data updates, because those are irrelevant.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 74 insertions(+), 3 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index e39f787e78..c4a5bc9261 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -466,6 +466,7 @@ __stress_scrub_fsx_loop() {
 	local end="$1"
 	local runningfile="$2"
 	local remount_period="$3"
+	local stress_tgt="$4"	# ignored
 	local focus=(-q -X)	# quiet, validate file contents
 
 	# As of November 2022, 2 million fsx ops should be enough to keep
@@ -528,10 +529,70 @@ __stress_scrub_fsstress_loop() {
 	local end="$1"
 	local runningfile="$2"
 	local remount_period="$3"
+	local stress_tgt="$4"
+	local focus=()
+
+	case "$stress_tgt" in
+	"dir")
+		focus+=('-z')
+
+		# Create a directory tree rapidly
+		for op in creat link mkdir mknod symlink; do
+			focus+=('-f' "${op}=8")
+		done
+		focus+=('-f' 'rmdir=2' '-f' 'unlink=8')
+
+		# Rename half as often
+		for op in rename rnoreplace rexchange; do
+			focus+=('-f' "${op}=4")
+		done
+
+		# Read and sync occasionally
+		for op in getdents stat fsync; do
+			focus+=('-f' "${op}=1")
+		done
+		;;
+	"xattr")
+		focus+=('-z')
+
+		# Create a directory tree slowly
+		for op in creat ; do
+			focus+=('-f' "${op}=2")
+		done
+		for op in unlink rmdir; do
+			focus+=('-f' "${op}=1")
+		done
+
+		# Create xattrs rapidly
+		for op in attr_set setfattr; do
+			focus+=('-f' "${op}=80")
+		done
+
+		# Remove xattrs 1/4 as quickly
+		for op in attr_remove removefattr; do
+			focus+=('-f' "${op}=20")
+		done
+
+		# Read and sync occasionally
+		for op in listfattr getfattr fsync; do
+			focus+=('-f' "${op}=10")
+		done
+		;;
+	"writeonly")
+		# Only do things that cause filesystem writes
+		focus+=('-w')
+		;;
+	"default")
+		# No new arguments
+		;;
+	*)
+		echo "$stress_tgt: Unrecognized stress target, using defaults."
+		;;
+	esac
 
 	# As of March 2022, 2 million fsstress ops should be enough to keep
 	# any filesystem busy for a couple of hours.
-	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 $FSSTRESS_AVOID)
+	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 "${focus[@]}" $FSSTRESS_AVOID)
 	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
 
 	if [ -n "$remount_period" ]; then
@@ -691,6 +752,14 @@ __stress_scrub_check_commands() {
 # -w	Delay the start of the scrub/repair loop by this number of seconds.
 #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
 #	will be clamped to ten seconds before the end time.
+# -x	Focus on this type of fsstress operation.  Possible values:
+#
+#       'dir': Grow the directory trees as much as possible.
+#       'xattr': Grow extended attributes in a small tree.
+#       'default': Run fsstress with default arguments.
+#       'writeonly': Only perform fs updates, no reads.
+#
+#       The default is 'default' unless XFS_SCRUB_STRESS_TARGET is set.
 # -X	Run this program to exercise the filesystem.  Currently supported
 #       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
 _scratch_xfs_stress_scrub() {
@@ -703,6 +772,7 @@ _scratch_xfs_stress_scrub() {
 	local exerciser="fsstress"
 	local io_args=()
 	local remount_period="${XFS_SCRUB_STRESS_REMOUNT_PERIOD}"
+	local stress_tgt="${XFS_SCRUB_STRESS_TARGET:-default}"
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	__SCRUB_STRESS_REMOUNT_LOOP=""
@@ -710,7 +780,7 @@ _scratch_xfs_stress_scrub() {
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fi:r:s:S:t:w:X:" c; do
+	while getopts "fi:r:s:S:t:w:x:X:" c; do
 		case "$c" in
 			f) freeze=yes;;
 			i) io_args+=("$OPTARG");;
@@ -719,6 +789,7 @@ _scratch_xfs_stress_scrub() {
 			S) xfs_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			w) scrub_delay="$OPTARG";;
+			x) stress_tgt="$OPTARG";;
 			X) exerciser="$OPTARG";;
 			*) return 1; ;;
 		esac
@@ -757,7 +828,7 @@ _scratch_xfs_stress_scrub() {
 	fi
 
 	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" \
-			"$remount_period" &
+			"$remount_period" "$stress_tgt" &
 
 	if [ -n "$freeze" ]; then
 		__stress_scrub_freeze_loop "$end" "$runningfile" &


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] xfs: race fsstress with online scrubbers for file metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For each XFS_SCRUB_TYPE_* that looks at file metadata, create a test
that runs that scrubber in the foreground and fsstress in the
background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 tests/xfs/792     |   38 +++++++++++++++++++++++
 tests/xfs/792.out |    2 +
 tests/xfs/793     |   37 ++++++++++++++++++++++
 tests/xfs/793.out |    2 +
 tests/xfs/794     |   39 +++++++++++++++++++++++
 tests/xfs/794.out |    2 +
 tests/xfs/795     |   39 +++++++++++++++++++++++
 tests/xfs/795.out |    2 +
 tests/xfs/796     |   37 ++++++++++++++++++++++
 tests/xfs/796.out |    2 +
 tests/xfs/797     |   40 ++++++++++++++++++++++++
 tests/xfs/797.out |    2 +
 tests/xfs/799     |   38 +++++++++++++++++++++++
 tests/xfs/799.out |    2 +
 tests/xfs/826     |   38 +++++++++++++++++++++++
 tests/xfs/826.out |    2 +
 tests/xfs/827     |   39 +++++++++++++++++++++++
 tests/xfs/827.out |    2 +
 19 files changed, 447 insertions(+), 5 deletions(-)
 create mode 100755 tests/xfs/792
 create mode 100644 tests/xfs/792.out
 create mode 100755 tests/xfs/793
 create mode 100644 tests/xfs/793.out
 create mode 100755 tests/xfs/794
 create mode 100644 tests/xfs/794.out
 create mode 100755 tests/xfs/795
 create mode 100644 tests/xfs/795.out
 create mode 100755 tests/xfs/796
 create mode 100644 tests/xfs/796.out
 create mode 100755 tests/xfs/797
 create mode 100644 tests/xfs/797.out
 create mode 100755 tests/xfs/799
 create mode 100644 tests/xfs/799.out
 create mode 100755 tests/xfs/826
 create mode 100644 tests/xfs/826.out
 create mode 100755 tests/xfs/827
 create mode 100644 tests/xfs/827.out


diff --git a/common/fuzzy b/common/fuzzy
index c4a5bc9261..f7f660bc31 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -330,12 +330,20 @@ __stress_freeze_filter_output() {
 
 # Filter scrub output so that we don't tarnish the golden output if the fs is
 # too busy to scrub.  Note: Tests should _notrun if the scrub type is not
-# supported.
+# supported.  Callers can provide extra strings to filter out as function
+# arguments.
 __stress_scrub_filter_output() {
+	local extra_args=()
+
+	for arg in "$@"; do
+		extra_args+=(-e "/${arg}/d")
+	done
+
 	_filter_scratch | \
 		sed -e '/Device or resource busy/d' \
 		    -e '/Optimization possible/d' \
-		    -e '/No space left on device/d'
+		    -e '/No space left on device/d' \
+		    "${extra_args[@]}"
 }
 
 # Decide if the scratch filesystem is still alive.
@@ -401,13 +409,34 @@ __stress_one_scrub_loop() {
 		fi
 	done
 
+	local extra_filters=()
+	local target_cmd=(echo "$scrub_tgt")
+	case "$scrub_tgt" in
+	"%file%"|"%datafile%"|"%attrfile%")
+		extra_filters+=('No such file or directory' 'No such device or address')
+		target_cmd=(find "$SCRATCH_MNT" -print)
+		;;
+	"%dir%")
+		extra_filters+=('No such file or directory' 'Not a directory')
+		target_cmd=(find "$SCRATCH_MNT" -type d -print)
+		;;
+	"%regfile%"|"%cowfile%")
+		extra_filters+=('No such file or directory')
+		target_cmd=(find "$SCRATCH_MNT" -type f -print)
+		;;
+	esac
+
 	while __stress_scrub_running "$scrub_startat" "$runningfile"; do
 		sleep 1
 	done
 
 	while __stress_scrub_running "$end" "$runningfile"; do
-		$XFS_IO_PROG -x "${xfs_io_args[@]}" "$scrub_tgt" 2>&1 | \
-			__stress_scrub_filter_output
+		readarray -t fnames < <("${target_cmd[@]}" 2>/dev/null)
+		for fname in "${fnames[@]}"; do
+			$XFS_IO_PROG -x "${xfs_io_args[@]}" "$fname" 2>&1 | \
+				__stress_scrub_filter_output "${extra_filters[@]}"
+			__stress_scrub_running "$end" "$runningfile" || break
+		done
 	done
 }
 
@@ -585,6 +614,22 @@ __stress_scrub_fsstress_loop() {
 	"default")
 		# No new arguments
 		;;
+	"symlink")
+		focus+=('-z')
+
+		# Only create, read, and delete symbolic links
+		focus+=('-f' 'symlink=4')
+		focus+=('-f' 'readlink=10')
+		focus+=('-f' 'unlink=1')
+		;;
+	"mknod")
+		focus+=('-z')
+
+		# Only create and delete special files
+		focus+=('-f' 'mknod=4')
+		focus+=('-f' 'getdents=100')
+		focus+=('-f' 'unlink=1')
+		;;
 	*)
 		echo "$stress_tgt: Unrecognized stress target, using defaults."
 		;;
@@ -715,9 +760,31 @@ __stress_scrub_check_commands() {
 	local scrub_tgt="$1"
 	shift
 
+	local cooked_tgt="$scrub_tgt"
+	case "$scrub_tgt" in
+	"%file%"|"%dir%")
+		cooked_tgt="$SCRATCH_MNT"
+		;;
+	"%regfile%"|"%datafile%")
+		cooked_tgt="$SCRATCH_MNT/testfile"
+		echo test > "$cooked_tgt"
+		;;
+	"%attrfile%")
+		cooked_tgt="$SCRATCH_MNT/testfile"
+		$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 64k' "$cooked_tgt" &>/dev/null
+		attr -s attrname "$cooked_tgt" < "$cooked_tgt" &>/dev/null
+		;;
+	"%cowfile%")
+		cooked_tgt="$SCRATCH_MNT/testfile"
+		$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 128k' "$cooked_tgt" &>/dev/null
+		_cp_reflink "$cooked_tgt" "$cooked_tgt.1"
+		$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 1' "$cooked_tgt.1" &>/dev/null
+		;;
+	esac
+
 	for arg in "$@"; do
 		local cooked_arg="$(echo "$arg" | sed -e "s/%agno%/0/g")"
-		testio=`$XFS_IO_PROG -x -c "$cooked_arg" $scrub_tgt 2>&1`
+		testio=`$XFS_IO_PROG -x -c "$cooked_arg" "$cooked_tgt" 2>&1`
 		echo $testio | grep -q "Unknown type" && \
 			_notrun "xfs_io scrub subcommand support is missing"
 		echo $testio | grep -q "Inappropriate ioctl" && \
@@ -749,6 +816,16 @@ __stress_scrub_check_commands() {
 # -S	Pass this option to xfs_scrub.  If zero -S options are specified,
 #	xfs_scrub will not be run.  To select repair mode, pass '-k' or '-v'.
 # -t	Run online scrub against this file; $SCRATCH_MNT is the default.
+#	Special values are as follows:
+#
+#	%file%		all files
+#	%regfile%	regular files
+#	%dir%		direct
+#	%datafile%	regular files with data blocks
+#	%attrfile%	regular files with xattr blocks
+#	%cowfile%	regular files with shared blocks
+#
+#	File selection races with fsstress, so the selection is best-effort.
 # -w	Delay the start of the scrub/repair loop by this number of seconds.
 #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
 #	will be clamped to ten seconds before the end time.
@@ -758,6 +835,8 @@ __stress_scrub_check_commands() {
 #       'xattr': Grow extended attributes in a small tree.
 #       'default': Run fsstress with default arguments.
 #       'writeonly': Only perform fs updates, no reads.
+#       'symlink': Only create symbolic links.
+#       'mknod': Only create special files.
 #
 #       The default is 'default' unless XFS_SCRUB_STRESS_TARGET is set.
 # -X	Run this program to exercise the filesystem.  Currently supported
diff --git a/tests/xfs/792 b/tests/xfs/792
new file mode 100755
index 0000000000..0806e87909
--- /dev/null
+++ b/tests/xfs/792
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 792
+#
+# Race fsstress and inode record scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub inode" -t "%file%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/792.out b/tests/xfs/792.out
new file mode 100644
index 0000000000..c9b5ef3a7c
--- /dev/null
+++ b/tests/xfs/792.out
@@ -0,0 +1,2 @@
+QA output created by 792
+Silence is golden
diff --git a/tests/xfs/793 b/tests/xfs/793
new file mode 100755
index 0000000000..41be82d621
--- /dev/null
+++ b/tests/xfs/793
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 793
+#
+# Race fsstress and data fork scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub bmapbtd" -t "%datafile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/793.out b/tests/xfs/793.out
new file mode 100644
index 0000000000..e8a17d4ecb
--- /dev/null
+++ b/tests/xfs/793.out
@@ -0,0 +1,2 @@
+QA output created by 793
+Silence is golden
diff --git a/tests/xfs/794 b/tests/xfs/794
new file mode 100755
index 0000000000..8f4835dbc9
--- /dev/null
+++ b/tests/xfs/794
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 794
+#
+# Race fsstress and attr fork scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/attr
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_attrs
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -x 'xattr' -s "scrub bmapbta" -t "%attrfile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/794.out b/tests/xfs/794.out
new file mode 100644
index 0000000000..bc999c055c
--- /dev/null
+++ b/tests/xfs/794.out
@@ -0,0 +1,2 @@
+QA output created by 794
+Silence is golden
diff --git a/tests/xfs/795 b/tests/xfs/795
new file mode 100755
index 0000000000..ec065bafdd
--- /dev/null
+++ b/tests/xfs/795
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 795
+#
+# Race fsstress and cow fork scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" reflink
+_scratch_xfs_stress_scrub -s "scrub bmapbtc" -t "%cowfile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/795.out b/tests/xfs/795.out
new file mode 100644
index 0000000000..cb357003dd
--- /dev/null
+++ b/tests/xfs/795.out
@@ -0,0 +1,2 @@
+QA output created by 795
+Silence is golden
diff --git a/tests/xfs/796 b/tests/xfs/796
new file mode 100755
index 0000000000..d337701264
--- /dev/null
+++ b/tests/xfs/796
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 796
+#
+# Race fsstress and directory scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -x 'dir' -s "scrub directory" -t "%dir%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/796.out b/tests/xfs/796.out
new file mode 100644
index 0000000000..374e3774a2
--- /dev/null
+++ b/tests/xfs/796.out
@@ -0,0 +1,2 @@
+QA output created by 796
+Silence is golden
diff --git a/tests/xfs/797 b/tests/xfs/797
new file mode 100755
index 0000000000..c68b43be7a
--- /dev/null
+++ b/tests/xfs/797
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 797
+#
+# Race fsstress and extended attributes scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/attr
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_attrs
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -x 'xattr' -s "scrub xattr" -t "%attrfile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/797.out b/tests/xfs/797.out
new file mode 100644
index 0000000000..6b64f4bf21
--- /dev/null
+++ b/tests/xfs/797.out
@@ -0,0 +1,2 @@
+QA output created by 797
+Silence is golden
diff --git a/tests/xfs/799 b/tests/xfs/799
new file mode 100755
index 0000000000..84007ea9c0
--- /dev/null
+++ b/tests/xfs/799
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 799
+#
+# Race fsstress and parent pointers scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub parent" -t "%dir%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/799.out b/tests/xfs/799.out
new file mode 100644
index 0000000000..f3fd9fa2a0
--- /dev/null
+++ b/tests/xfs/799.out
@@ -0,0 +1,2 @@
+QA output created by 799
+Silence is golden
diff --git a/tests/xfs/826 b/tests/xfs/826
new file mode 100755
index 0000000000..7660270571
--- /dev/null
+++ b/tests/xfs/826
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 826
+#
+# Race fsstress and symlink scrub for a while to see if we crash or livelock.
+# We can't open symlink files directly for scrubbing, so we use xfs_scrub(8).
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+XFS_SCRUB_PHASE=3 _scratch_xfs_stress_scrub -x 'symlink' -S '-n'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/826.out b/tests/xfs/826.out
new file mode 100644
index 0000000000..93fae86b82
--- /dev/null
+++ b/tests/xfs/826.out
@@ -0,0 +1,2 @@
+QA output created by 826
+Silence is golden
diff --git a/tests/xfs/827 b/tests/xfs/827
new file mode 100755
index 0000000000..55ec01d1e6
--- /dev/null
+++ b/tests/xfs/827
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 827
+#
+# Race fsstress and special file scrub for a while to see if we crash or
+# livelock.  We can't open special files directly for scrubbing, so we use
+# xfs_scrub(8).
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+XFS_SCRUB_PHASE=3 _scratch_xfs_stress_scrub -x 'mknod' -S '-n'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/827.out b/tests/xfs/827.out
new file mode 100644
index 0000000000..65f29d949d
--- /dev/null
+++ b/tests/xfs/827.out
@@ -0,0 +1,2 @@
+QA output created by 827
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] xfs: force rebuilding of metadata
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (60 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Darrick J. Wong
                   ` (15 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This patchset adds a new IFLAG to the scrub ioctl so that userspace can
force a rebuild of an otherwise consistent piece of metadata.  This will
eventually enable the use of online repair to relocate metadata during a
filesystem reorganization (e.g. shrink).  For now, it facilitates stress
testing of online repair without needing the debugging knobs to be
enabled.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-force-rebuild

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-force-rebuild

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-force-rebuild
---
 common/fuzzy |   34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] xfs: force rebuilding of metadata Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-14  8:00     ` Zorro Lang
  0 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For stress testing online repair, try to use the FORCE_REBUILD ioctl
flag over the error injection knobs whenever possible because the knobs
are very noisy and are not always available.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index f7f660bc31..14f7fdf03c 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -398,6 +398,9 @@ __stress_one_scrub_loop() {
 
 	local xfs_io_args=()
 	for arg in "$@"; do
+		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
+			arg="$(echo "$arg" | sed -e 's/^repair/repair -R/g')"
+		fi
 		if echo "$arg" | grep -q -w '%agno%'; then
 			# Substitute the AG number
 			for ((agno = 0; agno < agcount; agno++)); do
@@ -695,13 +698,21 @@ _require_xfs_stress_scrub() {
 		_notrun 'xfs scrub stress test requires common/filter'
 }
 
+# Make sure that we can force repairs either by error injection or passing
+# FORCE_REBUILD via ioctl.
+__require_xfs_stress_force_rebuild() {
+	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
+	test -z "$output" && return
+	_require_xfs_io_error_injection "force_repair"
+}
+
 # Make sure we have everything we need to run stress and online repair
 _require_xfs_stress_online_repair() {
 	_require_xfs_stress_scrub
 	_require_xfs_io_command "repair"
 	command -v _require_xfs_io_error_injection &>/dev/null || \
 		_notrun 'xfs repair stress test requires common/inject'
-	_require_xfs_io_error_injection "force_repair"
+	__require_xfs_stress_force_rebuild
 	_require_freeze
 }
 
@@ -783,7 +794,11 @@ __stress_scrub_check_commands() {
 	esac
 
 	for arg in "$@"; do
-		local cooked_arg="$(echo "$arg" | sed -e "s/%agno%/0/g")"
+		local cooked_arg="$arg"
+		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
+			cooked_arg="$(echo "$cooked_arg" | sed -e 's/^repair/repair -R/g')"
+		fi
+		cooked_arg="$(echo "$cooked_arg" | sed -e "s/%agno%/0/g")"
 		testio=`$XFS_IO_PROG -x -c "$cooked_arg" "$cooked_tgt" 2>&1`
 		echo $testio | grep -q "Unknown type" && \
 			_notrun "xfs_io scrub subcommand support is missing"
@@ -943,10 +958,23 @@ _scratch_xfs_stress_scrub() {
 	echo "Loop finished at $(date)" >> $seqres.full
 }
 
+# Decide if we're going to force repairs either by error injection or passing
+# FORCE_REBUILD via ioctl.
+__scratch_xfs_stress_setup_force_rebuild() {
+	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
+
+	if [ -z "$output" ]; then
+		export SCRUBSTRESS_USE_FORCE_REBUILD=1
+		return
+	fi
+
+	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
+}
+
 # Start online repair, freeze, and fsstress in background looping processes,
 # and wait for 30*TIME_FACTOR seconds to see if the filesystem goes down.
 # Same requirements and arguments as _scratch_xfs_stress_scrub.
 _scratch_xfs_stress_online_repair() {
-	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
+	__scratch_xfs_stress_setup_force_rebuild
 	XFS_SCRUB_FORCE_REPAIR=1 _scratch_xfs_stress_scrub "$@"
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] fstests: online repair of AG btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (61 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] xfs: force rebuilding of metadata Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: stress test ag repair functions Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of inodes Darrick J. Wong
                   ` (14 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

Now that we've spent a lot of time reworking common code in online fsck,
we're ready to start rebuilding the AG space btrees.  This series
implements repair functions for the free space, inode, and refcount
btrees.  Rebuilding the reverse mapping btree is much more intense and
is left for a subsequent patchset.  The fstests counterpart of this
patchset implements stress testing of repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees
---
 README            |    3 ++
 common/fuzzy      |   39 +++++++++++++++++++--------
 common/rc         |    2 +
 common/xfs        |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/725     |   37 +++++++++++++++++++++++++
 tests/xfs/725.out |    2 +
 tests/xfs/726     |   37 +++++++++++++++++++++++++
 tests/xfs/726.out |    2 +
 tests/xfs/727     |   38 ++++++++++++++++++++++++++
 tests/xfs/727.out |    2 +
 tests/xfs/728     |   37 +++++++++++++++++++++++++
 tests/xfs/728.out |    2 +
 tests/xfs/729     |   37 +++++++++++++++++++++++++
 tests/xfs/729.out |    2 +
 tests/xfs/730     |   37 +++++++++++++++++++++++++
 tests/xfs/730.out |    2 +
 tests/xfs/731     |   37 +++++++++++++++++++++++++
 tests/xfs/731.out |    2 +
 18 files changed, 382 insertions(+), 13 deletions(-)
 create mode 100755 tests/xfs/725
 create mode 100644 tests/xfs/725.out
 create mode 100755 tests/xfs/726
 create mode 100644 tests/xfs/726.out
 create mode 100755 tests/xfs/727
 create mode 100644 tests/xfs/727.out
 create mode 100755 tests/xfs/728
 create mode 100644 tests/xfs/728.out
 create mode 100755 tests/xfs/729
 create mode 100644 tests/xfs/729.out
 create mode 100755 tests/xfs/730
 create mode 100644 tests/xfs/730.out
 create mode 100755 tests/xfs/731
 create mode 100644 tests/xfs/731.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs: test rebuilding the entire filesystem with online fsck
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: stress test ag repair functions Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:06   ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add a new knob, TEST_XFS_SCRUB_REBUILD, that makes it so that we use
xfs_scrub to rebuild the ondisk metadata after every test.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 README       |    3 ++
 common/fuzzy |    1 +
 common/rc    |    2 +-
 common/xfs   |   77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 82 insertions(+), 1 deletion(-)


diff --git a/README b/README
index 4c4f22f853..744317625f 100644
--- a/README
+++ b/README
@@ -191,6 +191,9 @@ Extra XFS specification:
    to check the filesystem. As of August 2021, xfs_repair finds all
    filesystem corruptions found by xfs_check, and more, which means that
    xfs_check is no longer run by default.
+ - Set TEST_XFS_SCRUB_REBUILD=1 to have _check_xfs_filesystem run xfs_scrub in
+   "force_repair" mode to rebuild the filesystem; and xfs_repair -n to check
+   the results of the rebuilding.
  - xfs_scrub, if present, will always check the test and scratch
    filesystems if they are still online at the end of the test. It is no
    longer necessary to set TEST_XFS_SCRUB.
diff --git a/common/fuzzy b/common/fuzzy
index 14f7fdf03c..d8de55250d 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -975,6 +975,7 @@ __scratch_xfs_stress_setup_force_rebuild() {
 # and wait for 30*TIME_FACTOR seconds to see if the filesystem goes down.
 # Same requirements and arguments as _scratch_xfs_stress_scrub.
 _scratch_xfs_stress_online_repair() {
+	touch "$RESULT_DIR/.skip_orebuild"	# no need to test online rebuild
 	__scratch_xfs_stress_setup_force_rebuild
 	XFS_SCRUB_FORCE_REPAIR=1 _scratch_xfs_stress_scrub "$@"
 }
diff --git a/common/rc b/common/rc
index 23530413ec..a1b65f0a7f 100644
--- a/common/rc
+++ b/common/rc
@@ -1685,7 +1685,7 @@ _require_scratch_nocheck()
             exit 1
         fi
     fi
-    rm -f ${RESULT_DIR}/require_scratch
+    rm -f ${RESULT_DIR}/require_scratch "$RESULT_DIR/.skip_orebuild"
 }
 
 # we need the scratch device and it needs to not be an lvm device
diff --git a/common/xfs b/common/xfs
index 436569ba28..804047557b 100644
--- a/common/xfs
+++ b/common/xfs
@@ -692,6 +692,8 @@ _scratch_xfs_mdrestore()
 # run xfs_check and friends on a FS.
 _check_xfs_filesystem()
 {
+	local can_scrub=
+
 	if [ $# -ne 3 ]; then
 		echo "Usage: _check_xfs_filesystem device <logdev>|none <rtdev>|none" 1>&2
 		exit 1
@@ -726,6 +728,8 @@ _check_xfs_filesystem()
 	# Run online scrub if we can.
 	mntpt="$(_is_dev_mounted $device)"
 	if [ -n "$mntpt" ] && _supports_xfs_scrub "$mntpt" "$device"; then
+		can_scrub=1
+
 		# Tests can create a scenario in which a call to syncfs() issued
 		# at the end of the execution of the test script would return an
 		# error code. xfs_scrub internally calls syncfs() before
@@ -842,6 +846,79 @@ _check_xfs_filesystem()
 		_mount_or_remount_rw "$extra_mount_options" $device $mountpoint
 	fi
 
+	# If desired, test the online metadata rebuilding behavior if the
+	# filesystem was mounted when this function was called.
+	if [ -n "$TEST_XFS_SCRUB_REBUILD" ] && [ -n "$can_scrub" ] && [ ! -e "$RESULT_DIR/.skip_orebuild" ]; then
+		orebuild_ok=1
+
+		# Walk the entire directory tree to load directory blocks into
+		# memory and populate the dentry cache, which can speed up the
+		# repairs considerably when the directory tree is very large.
+		find $mntpt &>/dev/null &
+
+		XFS_SCRUB_FORCE_REPAIR=1 "$XFS_SCRUB_PROG" -v -d $mntpt > $tmp.scrub 2>&1
+		if [ $? -ne 0 ]; then
+			if grep -q 'No space left on device' $tmp.scrub; then
+				# It's not an error if the fs does not have
+				# enough space to complete a repair.  We will
+				# check everything, though.
+				echo "*** XFS_SCRUB_FORCE_REPAIR=1 xfs_scrub -v -d ran out of space ***" >> $seqres.full
+				cat $tmp.scrub >> $seqres.full
+				echo "*** end xfs_scrub output" >> $seqres.full
+			else
+				_log_err "_check_xfs_filesystem: filesystem on $device failed scrub orebuild"
+				echo "*** XFS_SCRUB_FORCE_REPAIR=1 xfs_scrub -v -d output ***" >> $seqres.full
+				cat $tmp.scrub >> $seqres.full
+				echo "*** end xfs_scrub output" >> $seqres.full
+				ok=0
+				orebuild_ok=0
+			fi
+		fi
+		rm -f $tmp.scrub
+
+		# Clear force_repair because xfs_scrub could have set it
+		$XFS_IO_PROG -x -c 'inject noerror' "$mntpt" >> $seqres.full
+
+		"$XFS_SCRUB_PROG" -v -d -n $mntpt > $tmp.scrub 2>&1
+		if [ $? -ne 0 ]; then
+			_log_err "_check_xfs_filesystem: filesystem on $device failed scrub orebuild recheck"
+			echo "*** xfs_scrub -v -d -n output ***" >> $seqres.full
+			cat $tmp.scrub >> $seqres.full
+			echo "*** end xfs_scrub output" >> $seqres.full
+			ok=0
+			orebuild_ok=0
+		fi
+		rm -f $tmp.scrub
+
+		mountpoint=`_umount_or_remount_ro $device`
+
+		$XFS_REPAIR_PROG -n $extra_options $extra_log_options $extra_rt_options $device >$tmp.repair 2>&1
+		if [ $? -ne 0 ]; then
+			_log_err "_check_xfs_filesystem: filesystem on $device is inconsistent (orebuild-reverify)"
+			echo "*** xfs_repair -n output ***"	>>$seqres.full
+			cat $tmp.repair				>>$seqres.full
+			echo "*** end xfs_repair output"	>>$seqres.full
+
+			ok=0
+			orebuild_ok=0
+		fi
+		rm -f $tmp.repair
+
+		if [ $ok -eq 0 ]; then
+			echo "*** mount output ***"		>>$seqres.full
+			_mount					>>$seqres.full
+			echo "*** end mount output"		>>$seqres.full
+		elif [ "$type" = "xfs" ]; then
+			_mount_or_remount_rw "$extra_mount_options" $device $mountpoint
+		fi
+
+		if [ "$orebuild_ok" -ne 1 ] && [ "$DUMP_CORRUPT_FS" = "1" ]; then
+			local flatdev="$(basename "$device")"
+			_xfs_metadump "$seqres.$flatdev.orebuild.md" "$device" \
+				"$logdev" compress >> $seqres.full
+		fi
+	fi
+
 	if [ $ok -eq 0 ]; then
 		status=1
 		if [ "$iam" != "check" ]; then


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs: stress test ag repair functions
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: test rebuilding the entire filesystem with online fsck Darrick J. Wong
  2023-02-18  6:06   ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Race fsstress and various AG repair functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |   38 ++++++++++++++++++++++++++------------
 tests/xfs/725     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/725.out |    2 ++
 tests/xfs/726     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/726.out |    2 ++
 tests/xfs/727     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/727.out |    2 ++
 tests/xfs/728     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/728.out |    2 ++
 tests/xfs/729     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/729.out |    2 ++
 tests/xfs/730     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/730.out |    2 ++
 tests/xfs/731     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/731.out |    2 ++
 15 files changed, 300 insertions(+), 12 deletions(-)
 create mode 100755 tests/xfs/725
 create mode 100644 tests/xfs/725.out
 create mode 100755 tests/xfs/726
 create mode 100644 tests/xfs/726.out
 create mode 100755 tests/xfs/727
 create mode 100644 tests/xfs/727.out
 create mode 100755 tests/xfs/728
 create mode 100644 tests/xfs/728.out
 create mode 100755 tests/xfs/729
 create mode 100644 tests/xfs/729.out
 create mode 100755 tests/xfs/730
 create mode 100644 tests/xfs/730.out
 create mode 100755 tests/xfs/731
 create mode 100644 tests/xfs/731.out


diff --git a/common/fuzzy b/common/fuzzy
index d8de55250d..d4177c3136 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -393,7 +393,8 @@ __stress_one_scrub_loop() {
 	local runningfile="$2"
 	local scrub_tgt="$3"
 	local scrub_startat="$4"
-	shift; shift; shift; shift
+	local start_agno="$5"
+	shift; shift; shift; shift; shift
 	local agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
 
 	local xfs_io_args=()
@@ -403,7 +404,7 @@ __stress_one_scrub_loop() {
 		fi
 		if echo "$arg" | grep -q -w '%agno%'; then
 			# Substitute the AG number
-			for ((agno = 0; agno < agcount; agno++)); do
+			for ((agno = start_agno; agno < agcount; agno++)); do
 				local ag_arg="$(echo "$arg" | sed -e "s|%agno%|$agno|g")"
 				xfs_io_args+=('-c' "$ag_arg")
 			done
@@ -413,28 +414,34 @@ __stress_one_scrub_loop() {
 	done
 
 	local extra_filters=()
-	local target_cmd=(echo "$scrub_tgt")
 	case "$scrub_tgt" in
 	"%file%"|"%datafile%"|"%attrfile%")
 		extra_filters+=('No such file or directory' 'No such device or address')
-		target_cmd=(find "$SCRATCH_MNT" -print)
 		;;
 	"%dir%")
 		extra_filters+=('No such file or directory' 'Not a directory')
-		target_cmd=(find "$SCRATCH_MNT" -type d -print)
 		;;
 	"%regfile%"|"%cowfile%")
 		extra_filters+=('No such file or directory')
-		target_cmd=(find "$SCRATCH_MNT" -type f -print)
 		;;
 	esac
 
+	local target_cmd=(echo "$scrub_tgt")
+	case "$scrub_tgt" in
+	"%file%")	target_cmd=($here/src/xfsfind -q  "$SCRATCH_MNT");;
+	"%attrfile%")	target_cmd=($here/src/xfsfind -qa "$SCRATCH_MNT");;
+	"%datafile%")	target_cmd=($here/src/xfsfind -qb "$SCRATCH_MNT");;
+	"%dir%")	target_cmd=($here/src/xfsfind -qd "$SCRATCH_MNT");;
+	"%regfile%")	target_cmd=($here/src/xfsfind -qr "$SCRATCH_MNT");;
+	"%cowfile%")	target_cmd=($here/src/xfsfind -qs "$SCRATCH_MNT");;
+	esac
+
 	while __stress_scrub_running "$scrub_startat" "$runningfile"; do
 		sleep 1
 	done
 
 	while __stress_scrub_running "$end" "$runningfile"; do
-		readarray -t fnames < <("${target_cmd[@]}" 2>/dev/null)
+		readarray -t fnames < <("${target_cmd[@]}" 2>> $seqres.full)
 		for fname in "${fnames[@]}"; do
 			$XFS_IO_PROG -x "${xfs_io_args[@]}" "$fname" 2>&1 | \
 				__stress_scrub_filter_output "${extra_filters[@]}"
@@ -692,6 +699,7 @@ __stress_scrub_fsstress_loop() {
 # Make sure we have everything we need to run stress and scrub
 _require_xfs_stress_scrub() {
 	_require_xfs_io_command "scrub"
+	_require_test_program "xfsfind"
 	_require_command "$KILLALL_PROG" killall
 	_require_freeze
 	command -v _filter_scratch &>/dev/null || \
@@ -769,7 +777,8 @@ _scratch_xfs_stress_scrub_cleanup() {
 # filesystem before we start running them in a loop.
 __stress_scrub_check_commands() {
 	local scrub_tgt="$1"
-	shift
+	local start_agno="$2"
+	shift; shift
 
 	local cooked_tgt="$scrub_tgt"
 	case "$scrub_tgt" in
@@ -798,7 +807,7 @@ __stress_scrub_check_commands() {
 		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
 			cooked_arg="$(echo "$cooked_arg" | sed -e 's/^repair/repair -R/g')"
 		fi
-		cooked_arg="$(echo "$cooked_arg" | sed -e "s/%agno%/0/g")"
+		cooked_arg="$(echo "$cooked_arg" | sed -e "s/%agno%/$start_agno/g")"
 		testio=`$XFS_IO_PROG -x -c "$cooked_arg" "$cooked_tgt" 2>&1`
 		echo $testio | grep -q "Unknown type" && \
 			_notrun "xfs_io scrub subcommand support is missing"
@@ -817,6 +826,7 @@ __stress_scrub_check_commands() {
 #
 # Various options include:
 #
+# -a	For %agno% substitution, start with this AG instead of AG 0.
 # -f	Run a freeze/thaw loop while we're doing other things.  Defaults to
 #	disabled, unless XFS_SCRUB_STRESS_FREEZE is set.
 # -i	Pass this command to xfs_io to exercise something that is not scrub
@@ -867,6 +877,7 @@ _scratch_xfs_stress_scrub() {
 	local io_args=()
 	local remount_period="${XFS_SCRUB_STRESS_REMOUNT_PERIOD}"
 	local stress_tgt="${XFS_SCRUB_STRESS_TARGET:-default}"
+	local start_agno=0
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	__SCRUB_STRESS_REMOUNT_LOOP=""
@@ -874,8 +885,9 @@ _scratch_xfs_stress_scrub() {
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fi:r:s:S:t:w:x:X:" c; do
+	while getopts "a:fi:r:s:S:t:w:x:X:" c; do
 		case "$c" in
+			a) start_agno="$OPTARG";;
 			f) freeze=yes;;
 			i) io_args+=("$OPTARG");;
 			r) remount_period="$OPTARG";;
@@ -889,7 +901,8 @@ _scratch_xfs_stress_scrub() {
 		esac
 	done
 
-	__stress_scrub_check_commands "$scrub_tgt" "${one_scrub_args[@]}"
+	__stress_scrub_check_commands "$scrub_tgt" "$start_agno" \
+			"${one_scrub_args[@]}"
 
 	if ! command -v "__stress_scrub_${exerciser}_loop" &>/dev/null; then
 		echo "${exerciser}: Unknown fs exercise program."
@@ -936,7 +949,8 @@ _scratch_xfs_stress_scrub() {
 
 	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
 		__stress_one_scrub_loop "$end" "$runningfile" "$scrub_tgt" \
-				"$scrub_startat" "${one_scrub_args[@]}" &
+				"$scrub_startat" "$start_agno" \
+				"${one_scrub_args[@]}" &
 	fi
 
 	if [ "${#xfs_scrub_args[@]}" -gt 0 ]; then
diff --git a/tests/xfs/725 b/tests/xfs/725
new file mode 100755
index 0000000000..8466b4a77f
--- /dev/null
+++ b/tests/xfs/725
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 725
+#
+# Race fsstress and bnobt repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair bnobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/725.out b/tests/xfs/725.out
new file mode 100644
index 0000000000..128709eb38
--- /dev/null
+++ b/tests/xfs/725.out
@@ -0,0 +1,2 @@
+QA output created by 725
+Silence is golden
diff --git a/tests/xfs/726 b/tests/xfs/726
new file mode 100755
index 0000000000..4f34c69ba4
--- /dev/null
+++ b/tests/xfs/726
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 725
+#
+# Race fsstress and inobt repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair inobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/726.out b/tests/xfs/726.out
new file mode 100644
index 0000000000..40767062d2
--- /dev/null
+++ b/tests/xfs/726.out
@@ -0,0 +1,2 @@
+QA output created by 726
+Silence is golden
diff --git a/tests/xfs/727 b/tests/xfs/727
new file mode 100755
index 0000000000..d16bb3ece2
--- /dev/null
+++ b/tests/xfs/727
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 725
+#
+# Race fsstress and refcountbt repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" reflink
+_scratch_xfs_stress_online_repair -s "repair refcountbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/727.out b/tests/xfs/727.out
new file mode 100644
index 0000000000..2de2b4b2ce
--- /dev/null
+++ b/tests/xfs/727.out
@@ -0,0 +1,2 @@
+QA output created by 727
+Silence is golden
diff --git a/tests/xfs/728 b/tests/xfs/728
new file mode 100755
index 0000000000..f0dd536d49
--- /dev/null
+++ b/tests/xfs/728
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 728
+#
+# Race fsstress and superblock repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -a 1 -s "repair sb %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/728.out b/tests/xfs/728.out
new file mode 100644
index 0000000000..ab39f45fe5
--- /dev/null
+++ b/tests/xfs/728.out
@@ -0,0 +1,2 @@
+QA output created by 728
+Silence is golden
diff --git a/tests/xfs/729 b/tests/xfs/729
new file mode 100755
index 0000000000..85d53b5f0b
--- /dev/null
+++ b/tests/xfs/729
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 729
+#
+# Race fsstress and agf repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair agf %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/729.out b/tests/xfs/729.out
new file mode 100644
index 0000000000..0f175ae2f9
--- /dev/null
+++ b/tests/xfs/729.out
@@ -0,0 +1,2 @@
+QA output created by 729
+Silence is golden
diff --git a/tests/xfs/730 b/tests/xfs/730
new file mode 100755
index 0000000000..a452016bb1
--- /dev/null
+++ b/tests/xfs/730
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 730
+#
+# Race fsstress and agfl repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair agfl %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/730.out b/tests/xfs/730.out
new file mode 100644
index 0000000000..50c3c832f0
--- /dev/null
+++ b/tests/xfs/730.out
@@ -0,0 +1,2 @@
+QA output created by 730
+Silence is golden
diff --git a/tests/xfs/731 b/tests/xfs/731
new file mode 100755
index 0000000000..7d0492a10d
--- /dev/null
+++ b/tests/xfs/731
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 731
+#
+# Race fsstress and agi repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair agi %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/731.out b/tests/xfs/731.out
new file mode 100644
index 0000000000..93b1b2692d
--- /dev/null
+++ b/tests/xfs/731.out
@@ -0,0 +1,2 @@
+QA output created by 731
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of inodes
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (62 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair for inode record metadata Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
                   ` (13 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

In this series, online repair gains the ability to repair inode records.
To do this, we must repair the ondisk inode and fork information enough
to pass the iget verifiers and hence make the inode igettable again.
Once that's done, we can perform higher level repairs on the incore
inode.  The fstests counterpart of this patchset implements stress
testing of repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-inodes

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-inodes
---
 tests/xfs/806     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/806.out |    2 ++
 2 files changed, 40 insertions(+)
 create mode 100755 tests/xfs/806
 create mode 100644 tests/xfs/806.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: race fsstress with online repair for inode record metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of inodes Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:07     ` Zorro Lang
  0 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create a test that runs the inode record repairer in the foreground and
fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/806     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/806.out |    2 ++
 2 files changed, 40 insertions(+)
 create mode 100755 tests/xfs/806
 create mode 100644 tests/xfs/806.out


diff --git a/tests/xfs/806 b/tests/xfs/806
new file mode 100755
index 0000000000..e07f9f9141
--- /dev/null
+++ b/tests/xfs/806
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 806
+#
+# Race fsstress and inode record repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair inode" -t "%file%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/806.out b/tests/xfs/806.out
new file mode 100644
index 0000000000..463bd7f008
--- /dev/null
+++ b/tests/xfs/806.out
@@ -0,0 +1,2 @@
+QA output created by 806
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (63 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of inodes Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 4/4] xfs: race fsstress with online repair for special file metadata Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota and counters Darrick J. Wong
                   ` (12 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

In this series, online repair gains the ability to rebuild data and attr
fork mappings from the reverse mapping information.  It is at this point
where we reintroduce the ability to reap file extents.

Repair of CoW forks is a little different -- on disk, CoW staging
extents are owned by the refcount btree and cannot be mapped back to
individual files.  Hence we can only detect staging extents that don't
quite look right (missing reverse mappings, shared staging extents) and
replace them with fresh allocations.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-file-mappings

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-file-mappings
---
 tests/xfs/746     |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/746.out |    2 +
 tests/xfs/807     |   37 +++++++++++++++++++++++
 tests/xfs/807.out |    2 +
 tests/xfs/808     |   39 ++++++++++++++++++++++++
 tests/xfs/808.out |    2 +
 tests/xfs/828     |   38 ++++++++++++++++++++++++
 tests/xfs/828.out |    2 +
 tests/xfs/829     |   39 ++++++++++++++++++++++++
 tests/xfs/829.out |    2 +
 tests/xfs/840     |   72 +++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/840.out |    3 ++
 tests/xfs/846     |   39 ++++++++++++++++++++++++
 tests/xfs/846.out |    2 +
 14 files changed, 364 insertions(+)
 create mode 100755 tests/xfs/746
 create mode 100644 tests/xfs/746.out
 create mode 100755 tests/xfs/807
 create mode 100644 tests/xfs/807.out
 create mode 100755 tests/xfs/808
 create mode 100644 tests/xfs/808.out
 create mode 100755 tests/xfs/828
 create mode 100644 tests/xfs/828.out
 create mode 100755 tests/xfs/829
 create mode 100644 tests/xfs/829.out
 create mode 100755 tests/xfs/840
 create mode 100644 tests/xfs/840.out
 create mode 100755 tests/xfs/846
 create mode 100644 tests/xfs/846.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/4] xfs: test rebuilding xattrs when the data fork is btree format
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 4/4] xfs: race fsstress with online repair for special file metadata Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 3/4] xfs: ensure that online file data fork repairs don't hit EDQUOT Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Make sure we handle the case of rebuilding extended attributes properly
when the data fork is in btree format and we therefore cannot zap the
attr fork.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/746     |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/746.out |    2 +
 2 files changed, 87 insertions(+)
 create mode 100755 tests/xfs/746
 create mode 100644 tests/xfs/746.out


diff --git a/tests/xfs/746 b/tests/xfs/746
new file mode 100755
index 0000000000..5853259e84
--- /dev/null
+++ b/tests/xfs/746
@@ -0,0 +1,85 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 746
+#
+# Make sure online repair can handle rebuilding xattrs when the data fork is
+# in btree format and we cannot just zap the attr fork.
+
+. ./common/preamble
+_begin_fstest auto quick online_repair
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+}
+
+# Import common functions.
+. ./common/inject
+. ./common/filter
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_io_error_injection "force_repair"
+_require_xfs_io_command "falloc"
+_require_xfs_io_command "repair"
+_require_test_program "punch-alternating"
+
+_scratch_mkfs > $tmp.mkfs
+_scratch_mount
+
+_supports_xfs_scrub $SCRATCH_MNT $SCRATCH_DEV || _notrun "Scrub not supported"
+
+# Force data device extents so that we can create a file with the exact bmbt
+# that we need regardless of rt configuration.
+_xfs_force_bdev data $SCRATCH_MNT
+
+file=$SCRATCH_MNT/moofile
+touch $file
+
+# Create some xattrs so that we have to rebuild them.
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 76' $file.txt >> $seqres.full
+$SETFATTR_PROG -n user.SGI_BCL_FILE -v "$(cat $file.txt)" $file
+
+$SETFATTR_PROG -n user.crtime_usec -v 12345678 $file
+
+blksz=$(_get_file_block_size $SCRATCH_MNT)
+ino=$(stat -c '%i' $file)
+
+# Figure out how many extents we need to have to create a data fork that's in
+# btree format.
+umount $SCRATCH_MNT
+di_forkoff=$(_scratch_xfs_db -c "inode $ino" -c "p core.forkoff" | \
+	awk '{print $3}')
+_scratch_xfs_db -c "inode $ino" -c "p" >> $seqres.full
+_scratch_mount
+
+# Create a data fork in btree format
+min_ext_for_btree=$((di_forkoff * 8 / 16))
+$XFS_IO_PROG -c "falloc 0 $(( (min_ext_for_btree + 1) * 2 * blksz))" $file
+$here/src/punch-alternating $file
+
+# Make sure the data fork is in btree format.
+umount $SCRATCH_MNT
+_scratch_xfs_db -c "inode $ino" -c "p core.format" | grep -q "btree" || \
+	echo "data fork not in btree format?"
+echo "about to start test" >> $seqres.full
+_scratch_xfs_db -c "inode $ino" -c "p" >> $seqres.full
+_scratch_mount
+
+# Force repair the xattr fork
+_scratch_inject_error force_repair
+$XFS_IO_PROG -x -c 'repair xattr' $file 2>&1 | tee $tmp.repair.log
+grep -q 'Operation not supported' $tmp.repair.log && \
+	_notrun "online xattr repair not supported"
+
+# If online repair did it correctly, the filesystem won't be corrupt.  Let the
+# post-test check do its thing.
+
+# success, all done
+echo "Silence is golden."
+status=0
+exit
diff --git a/tests/xfs/746.out b/tests/xfs/746.out
new file mode 100644
index 0000000000..365485b0b3
--- /dev/null
+++ b/tests/xfs/746.out
@@ -0,0 +1,2 @@
+QA output created by 746
+Silence is golden.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/4] xfs: race fsstress with online repair for inode and fork metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 3/4] xfs: ensure that online file data fork repairs don't hit EDQUOT Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:07   ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Zorro Lang
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For each XFS_SCRUB_TYPE_* that looks at inode and data/attr/cow fork
metadata, create a test that runs that repairer in the foreground and
fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/807     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/807.out |    2 ++
 tests/xfs/808     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/808.out |    2 ++
 tests/xfs/846     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/846.out |    2 ++
 6 files changed, 121 insertions(+)
 create mode 100755 tests/xfs/807
 create mode 100644 tests/xfs/807.out
 create mode 100755 tests/xfs/808
 create mode 100644 tests/xfs/808.out
 create mode 100755 tests/xfs/846
 create mode 100644 tests/xfs/846.out


diff --git a/tests/xfs/807 b/tests/xfs/807
new file mode 100755
index 0000000000..e32a37057d
--- /dev/null
+++ b/tests/xfs/807
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 807
+#
+# Race fsstress and data fork repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair bmapbtd" -t "%datafile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/807.out b/tests/xfs/807.out
new file mode 100644
index 0000000000..3752a5f715
--- /dev/null
+++ b/tests/xfs/807.out
@@ -0,0 +1,2 @@
+QA output created by 807
+Silence is golden
diff --git a/tests/xfs/808 b/tests/xfs/808
new file mode 100755
index 0000000000..378b606427
--- /dev/null
+++ b/tests/xfs/808
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 808
+#
+# Race fsstress and attr fork repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/attr
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_attrs
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -x 'xattr' -s "repair bmapbta" -t "%attrfile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/808.out b/tests/xfs/808.out
new file mode 100644
index 0000000000..8825342849
--- /dev/null
+++ b/tests/xfs/808.out
@@ -0,0 +1,2 @@
+QA output created by 808
+Silence is golden
diff --git a/tests/xfs/846 b/tests/xfs/846
new file mode 100755
index 0000000000..8388a22730
--- /dev/null
+++ b/tests/xfs/846
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 846
+#
+# Race fsstress and CoW fork repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" reflink
+_scratch_xfs_stress_online_repair -s "repair bmapbtc" -t "%cowfile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/846.out b/tests/xfs/846.out
new file mode 100644
index 0000000000..c88a3035c3
--- /dev/null
+++ b/tests/xfs/846.out
@@ -0,0 +1,2 @@
+QA output created by 846
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/4] xfs: ensure that online file data fork repairs don't hit EDQUOT
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 4/4] xfs: race fsstress with online repair for special file metadata Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/4] xfs: test rebuilding xattrs when the data fork is btree format Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/4] xfs: race fsstress with online repair for inode and fork metadata Darrick J. Wong
  2023-02-18  6:07   ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Zorro Lang
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add a test to ensure that the sysadmin doesn't get EDQUOT if they try to
repair file data fork metadata when we've already exceeded a quota limit
somewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/840     |   72 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/840.out |    3 ++
 2 files changed, 75 insertions(+)
 create mode 100755 tests/xfs/840
 create mode 100644 tests/xfs/840.out


diff --git a/tests/xfs/840 b/tests/xfs/840
new file mode 100755
index 0000000000..fff41c5b8a
--- /dev/null
+++ b/tests/xfs/840
@@ -0,0 +1,72 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 840
+#
+# Ensure that the sysadmin won't hit EDQUOT while repairing file data forks
+# even if the file's quota limits have been exceeded.  This tests the quota
+# reservation handling inside the bmap btree rebuilding code.
+#
+. ./common/preamble
+_begin_fstest online_repair
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/quota
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+_require_xfs_io_command "falloc"
+_require_quota
+_require_user
+_require_test_program "punch-alternating"
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_qmount_option usrquota
+_qmount
+
+blocksize=$(_get_block_size $SCRATCH_MNT)
+alloc_unit=$(_get_file_block_size $SCRATCH_MNT)
+
+# Make sure we can actually repair a data fork
+scratchfile=$SCRATCH_MNT/file
+touch $scratchfile
+$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
+__stress_scrub_check_commands "$scratchfile" "" 'repair bmapbtd'
+
+# Compute the number of extent records needed to guarantee btree format,
+# assuming 16 bytes for each ondisk extent record
+bmbt_records=$(( (blocksize / 16) * 5 / 4 ))
+total_size=$(( bmbt_records * 2 * alloc_unit ))
+
+# Create a file with a data fork in bmap btree format
+$XFS_IO_PROG -c "falloc 0 $total_size" $scratchfile >> $seqres.full
+$here/src/punch-alternating $scratchfile
+
+# Set a low quota hardlimit for an unprivileged uid and chown the file to it
+echo "set up quota" >> $seqres.full
+$XFS_QUOTA_PROG -x -c "limit -u bhard=$((alloc_unit * 2)) $qa_user" $SCRATCH_MNT
+chown $qa_user $scratchfile
+$XFS_QUOTA_PROG -x -c 'report -u' $SCRATCH_MNT >> $seqres.full
+
+# Rebuild the data fork
+echo "repairs" >> $seqres.full
+$XFS_IO_PROG -x -c 'inject force_repair' -c 'repair bmapbtd' $scratchfile
+$XFS_QUOTA_PROG -x -c 'report -u' $SCRATCH_MNT >> $seqres.full
+
+# Fail at appending the file as qa_user to ensure quota enforcement works
+echo "fail quota" >> $seqres.full
+su - "$qa_user" -c "$XFS_IO_PROG -c 'pwrite 10g 1' $scratchfile" >> $seqres.full
+$XFS_QUOTA_PROG -x -c 'report -u' $SCRATCH_MNT >> $seqres.full
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/840.out b/tests/xfs/840.out
new file mode 100644
index 0000000000..8c32ec12bb
--- /dev/null
+++ b/tests/xfs/840.out
@@ -0,0 +1,3 @@
+QA output created by 840
+pwrite: Disk quota exceeded
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/4] xfs: race fsstress with online repair for special file metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/4] xfs: test rebuilding xattrs when the data fork is btree format Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For each XFS_SCRUB_TYPE_* that looks at symbolic link and special file
metadata, create a test that runs that repairer in the foreground and
fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/828     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/828.out |    2 ++
 tests/xfs/829     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/829.out |    2 ++
 4 files changed, 81 insertions(+)
 create mode 100755 tests/xfs/828
 create mode 100644 tests/xfs/828.out
 create mode 100755 tests/xfs/829
 create mode 100644 tests/xfs/829.out


diff --git a/tests/xfs/828 b/tests/xfs/828
new file mode 100755
index 0000000000..99020e9b3c
--- /dev/null
+++ b/tests/xfs/828
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 828
+#
+# Race fsstress and symlink repair for a while to see if we crash or livelock.
+# We can't open special files directly for scrubbing, so we use xfs_scrub(8).
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+XFS_SCRUB_PHASE=3 _scratch_xfs_stress_online_repair -x 'symlink' -S '-k'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/828.out b/tests/xfs/828.out
new file mode 100644
index 0000000000..d0291290a9
--- /dev/null
+++ b/tests/xfs/828.out
@@ -0,0 +1,2 @@
+QA output created by 828
+Silence is golden
diff --git a/tests/xfs/829 b/tests/xfs/829
new file mode 100755
index 0000000000..7451f66069
--- /dev/null
+++ b/tests/xfs/829
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 827
+#
+# Race fsstress and special file repair for a while to see if we crash or
+# livelock.  We can't open special files directly for scrubbing, so we use
+# xfs_scrub(8).
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+XFS_SCRUB_PHASE=3 _scratch_xfs_stress_online_repair -x 'mknod' -S '-k'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/829.out b/tests/xfs/829.out
new file mode 100644
index 0000000000..8a43a15204
--- /dev/null
+++ b/tests/xfs/829.out
@@ -0,0 +1,2 @@
+QA output created by 829
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of quota and counters
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (64 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota counters Darrick J. Wong
                   ` (11 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

XFS stores quota records and free space bitmap information in files.
Add the necessary infrastructure to enable repairing metadata inodes and
their forks, and then make it so that we can repair the file metadata
for the rtbitmap.  Repairing the bitmap contents (and the summary file)
is left for subsequent patchsets.

We also add the ability to repair file metadata the quota files.  As
part of these repairs, we also reinitialize the ondisk dquot records as
necessary to get the incore dquots working.  We can also correct
obviously bad dquot record attributes, but we leave checking the
resource usage counts for the next patchsets.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-quota

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-quota
---
 tests/xfs/809     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/809.out |    2 ++
 tests/xfs/810     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/810.out |    2 ++
 tests/xfs/811     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/811.out |    2 ++
 6 files changed, 126 insertions(+)
 create mode 100755 tests/xfs/809
 create mode 100644 tests/xfs/809.out
 create mode 100755 tests/xfs/810
 create mode 100644 tests/xfs/810.out
 create mode 100755 tests/xfs/811
 create mode 100644 tests/xfs/811.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota and counters Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:10     ` Zorro Lang
  2023-02-18  6:12     ` Zorro Lang
  0 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create tests to race fsstress with dquot repair while running fsstress
in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/809     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/809.out |    2 ++
 tests/xfs/810     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/810.out |    2 ++
 tests/xfs/811     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/811.out |    2 ++
 6 files changed, 126 insertions(+)
 create mode 100755 tests/xfs/809
 create mode 100644 tests/xfs/809.out
 create mode 100755 tests/xfs/810
 create mode 100644 tests/xfs/810.out
 create mode 100755 tests/xfs/811
 create mode 100644 tests/xfs/811.out


diff --git a/tests/xfs/809 b/tests/xfs/809
new file mode 100755
index 0000000000..35ac02ff85
--- /dev/null
+++ b/tests/xfs/809
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 809
+#
+# Race fsstress and user quota repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
+_scratch_xfs_stress_online_repair -s "repair usrquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/809.out b/tests/xfs/809.out
new file mode 100644
index 0000000000..e90865ca8f
--- /dev/null
+++ b/tests/xfs/809.out
@@ -0,0 +1,2 @@
+QA output created by 809
+Silence is golden
diff --git a/tests/xfs/810 b/tests/xfs/810
new file mode 100755
index 0000000000..7387910504
--- /dev/null
+++ b/tests/xfs/810
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 810
+#
+# Race fsstress and group quota repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
+_scratch_xfs_stress_online_repair -s "repair grpquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/810.out b/tests/xfs/810.out
new file mode 100644
index 0000000000..90f12fdd21
--- /dev/null
+++ b/tests/xfs/810.out
@@ -0,0 +1,2 @@
+QA output created by 810
+Silence is golden
diff --git a/tests/xfs/811 b/tests/xfs/811
new file mode 100755
index 0000000000..1e13940b46
--- /dev/null
+++ b/tests/xfs/811
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 811
+#
+# Race fsstress and project quota repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
+_scratch_xfs_stress_online_repair -s "repair prjquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/811.out b/tests/xfs/811.out
new file mode 100644
index 0000000000..cf30f69bdc
--- /dev/null
+++ b/tests/xfs/811.out
@@ -0,0 +1,2 @@
+QA output created by 811
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of quota counters
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (65 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota and counters Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quotacheck Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of file link counts Darrick J. Wong
                   ` (10 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This series uses the inode scanner and live update hook functionality
introduced in the last patchset to implement quotacheck on a live
filesystem.  The quotacheck scrubber builds an incore copy of the
dquot resource usage counters and compares it to the live dquots to
report discrepancies.

If the user chooses to repair the quota counters, the repair function
visits each incore dquot to update the counts from the live information.
The live update hooks are key to keeping the incore copy up to date.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-quotacheck

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-quotacheck
---
 tests/xfs/715     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/715.out |    2 ++
 tests/xfs/812     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/812.out |    2 ++
 4 files changed, 84 insertions(+)
 create mode 100755 tests/xfs/715
 create mode 100644 tests/xfs/715.out
 create mode 100755 tests/xfs/812
 create mode 100644 tests/xfs/812.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: race fsstress with online scrub and repair for quotacheck
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota counters Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:12     ` Zorro Lang
  0 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create tests to race fsstress with quota count check and repair while
running fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/715     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/715.out |    2 ++
 tests/xfs/812     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/812.out |    2 ++
 4 files changed, 84 insertions(+)
 create mode 100755 tests/xfs/715
 create mode 100644 tests/xfs/715.out
 create mode 100755 tests/xfs/812
 create mode 100644 tests/xfs/812.out


diff --git a/tests/xfs/715 b/tests/xfs/715
new file mode 100755
index 0000000000..eca979b297
--- /dev/null
+++ b/tests/xfs/715
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 715
+#
+# Race fsstress and quotacheck repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" any
+_scratch_xfs_stress_online_repair -s "repair quotacheck"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/715.out b/tests/xfs/715.out
new file mode 100644
index 0000000000..b5947d898b
--- /dev/null
+++ b/tests/xfs/715.out
@@ -0,0 +1,2 @@
+QA output created by 715
+Silence is golden
diff --git a/tests/xfs/812 b/tests/xfs/812
new file mode 100755
index 0000000000..f84494e392
--- /dev/null
+++ b/tests/xfs/812
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 812
+#
+# Race fsstress and quotacheck scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" any
+_scratch_xfs_stress_scrub -s "scrub quotacheck"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/812.out b/tests/xfs/812.out
new file mode 100644
index 0000000000..d8dbb15dc7
--- /dev/null
+++ b/tests/xfs/812.out
@@ -0,0 +1,2 @@
+QA output created by 812
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of file link counts
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (66 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota counters Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with inode link count check and repair Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters Darrick J. Wong
                   ` (9 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

Now that we've created the infrastructure to perform live scans of every
file in the filesystem and the necessary hook infrastructure to observe
live updates, use it to scan directories to compute the correct link
counts for files in the filesystem, and reset those link counts.

This patchset creates a tailored readdir implementation for scrub
because the regular version has to cycle ILOCKs to copy information to
userspace.  We can't cycle the ILOCK during the nlink scan and we don't
need all the other VFS support code (maintaining a readdir cursor and
translating XFS structures to VFS structures and back) so it was easier
to duplicate the code.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-nlinks

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-nlinks
---
 tests/xfs/772     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/772.out |    2 ++
 tests/xfs/820     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/820.out |    2 ++
 4 files changed, 79 insertions(+)
 create mode 100755 tests/xfs/772
 create mode 100644 tests/xfs/772.out
 create mode 100755 tests/xfs/820
 create mode 100644 tests/xfs/820.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: race fsstress with inode link count check and repair
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of file link counts Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:13     ` Zorro Lang
  0 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Race fsstress with inode link count checking and repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/772     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/772.out |    2 ++
 tests/xfs/820     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/820.out |    2 ++
 4 files changed, 79 insertions(+)
 create mode 100755 tests/xfs/772
 create mode 100644 tests/xfs/772.out
 create mode 100755 tests/xfs/820
 create mode 100644 tests/xfs/820.out


diff --git a/tests/xfs/772 b/tests/xfs/772
new file mode 100755
index 0000000000..a00c2796c5
--- /dev/null
+++ b/tests/xfs/772
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 772
+#
+# Race fsstress and inode link count repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -x "dir" -s "repair nlinks"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/772.out b/tests/xfs/772.out
new file mode 100644
index 0000000000..98c1396896
--- /dev/null
+++ b/tests/xfs/772.out
@@ -0,0 +1,2 @@
+QA output created by 772
+Silence is golden
diff --git a/tests/xfs/820 b/tests/xfs/820
new file mode 100755
index 0000000000..58a5d4cc91
--- /dev/null
+++ b/tests/xfs/820
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 820
+#
+# Race fsstress and nlinks scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -x "dir" -s "scrub nlinks"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/820.out b/tests/xfs/820.out
new file mode 100644
index 0000000000..29ab2e2d8c
--- /dev/null
+++ b/tests/xfs/820.out
@@ -0,0 +1,2 @@
+QA output created by 820
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (67 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of file link counts Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: test fs summary counter online repair Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of rmap btrees Darrick J. Wong
                   ` (8 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

A longstanding deficiency in the online fs summary counter scrubbing
code is that it hasn't any means to quiesce the incore percpu counters
while it's running.  There is no way to coordinate with other threads
are reserving or freeing free space simultaneously, which leads to false
error reports.  Right now, if the discrepancy is large, we just sort of
shrug and bail out with an incomplete flag, but this is lame.

For repair activity, we actually /do/ need to stabilize the counters to
get an accurate reading and install it in the percpu counter.  To
improve the former and enable the latter, allow the fscounters online
fsck code to perform an exclusive mini-freeze on the filesystem.  The
exclusivity prevents userspace from thawing while we're running, and the
mini-freeze means that we don't wait for the log to quiesce, which will
make both speedier.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-fscounters
---
 tests/xfs/713     |   36 ++++++++++++++++++++++++++++++++++++
 tests/xfs/713.out |    4 ++++
 tests/xfs/714     |   41 +++++++++++++++++++++++++++++++++++++++++
 tests/xfs/714.out |    2 ++
 tests/xfs/762     |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/762.out |    2 ++
 6 files changed, 131 insertions(+)
 create mode 100755 tests/xfs/713
 create mode 100644 tests/xfs/713.out
 create mode 100755 tests/xfs/714
 create mode 100644 tests/xfs/714.out
 create mode 100755 tests/xfs/762
 create mode 100644 tests/xfs/762.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs: test fs summary counter online repair
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: race fsstress with online repair for summary counters Darrick J. Wong
  2023-02-18  6:14   ` [PATCHSET v24.0 0/2] fstests: online repair for fs " Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Fuzz the fs summary counters in the primary super and see if online
repair can fix them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/713     |   36 ++++++++++++++++++++++++++++++++++++
 tests/xfs/713.out |    4 ++++
 2 files changed, 40 insertions(+)
 create mode 100755 tests/xfs/713
 create mode 100644 tests/xfs/713.out


diff --git a/tests/xfs/713 b/tests/xfs/713
new file mode 100755
index 0000000000..7ac6d1458f
--- /dev/null
+++ b/tests/xfs/713
@@ -0,0 +1,36 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 713
+#
+# Populate a XFS filesystem and fuzz every fscounter field.
+# Use xfs_scrub to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz fscounters"
+test -z "$SCRATCH_XFS_LIST_METADATA_FIELDS" &&
+	SCRATCH_XFS_LIST_METADATA_FIELDS='icount,ifree,fdblocks'
+export SCRATCH_XFS_LIST_METADATA_FIELDS
+_scratch_xfs_fuzz_metadata '' 'online' 'sb 0' >> $seqres.full
+echo "Done fuzzing fscounters"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/713.out b/tests/xfs/713.out
new file mode 100644
index 0000000000..6dd322f4ca
--- /dev/null
+++ b/tests/xfs/713.out
@@ -0,0 +1,4 @@
+QA output created by 713
+Format and populate
+Fuzz fscounters
+Done fuzzing fscounters


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs: race fsstress with online repair for summary counters
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: test fs summary counter online repair Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:14   ` [PATCHSET v24.0 0/2] fstests: online repair for fs " Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create tests to race fsstress with fs summary counter repair while
running fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/714     |   41 +++++++++++++++++++++++++++++++++++++++++
 tests/xfs/714.out |    2 ++
 tests/xfs/762     |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/762.out |    2 ++
 4 files changed, 91 insertions(+)
 create mode 100755 tests/xfs/714
 create mode 100644 tests/xfs/714.out
 create mode 100755 tests/xfs/762
 create mode 100644 tests/xfs/762.out


diff --git a/tests/xfs/714 b/tests/xfs/714
new file mode 100755
index 0000000000..c1b6cd919a
--- /dev/null
+++ b/tests/xfs/714
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 714
+#
+# Race fsstress and fscounter repair for a while to see if we crash or livelock.
+# Summary counter repair requires us to freeze the filesystem to stop all
+# filesystem activity, so we can't have userspace wandering in and thawing it.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+# Override the default cleanup function.
+_cleanup()
+{
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -rf $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair fscounters"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/714.out b/tests/xfs/714.out
new file mode 100644
index 0000000000..f5ce748b71
--- /dev/null
+++ b/tests/xfs/714.out
@@ -0,0 +1,2 @@
+QA output created by 714
+Silence is golden
diff --git a/tests/xfs/762 b/tests/xfs/762
new file mode 100755
index 0000000000..0f70b632a3
--- /dev/null
+++ b/tests/xfs/762
@@ -0,0 +1,46 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 762
+#
+# Race fsstress and fscounter repair on the realtime device for a while to see
+# if we crash or livelock.  Summary counter repair requires us to freeze the
+# filesystem to stop all filesystem activity, so we can't have userspace
+# wandering in and thawing it.
+#
+. ./common/preamble
+_begin_fstest auto quick rw scrub realtime
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_realtime
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+
+# Force all files to be allocated on the realtime device
+_xfs_force_bdev realtime $SCRATCH_MNT
+_scratch_xfs_stress_online_repair -s 'scrub fscounters' -s "repair fscounters"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/762.out b/tests/xfs/762.out
new file mode 100644
index 0000000000..fbaeb29706
--- /dev/null
+++ b/tests/xfs/762.out
@@ -0,0 +1,2 @@
+QA output created by 762
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of rmap btrees
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (68 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs/422: don't freeze while racing rmap repair and fsstress Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Darrick J. Wong
                   ` (7 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

We have now constructed the four tools that we need to scan the
filesystem looking for reverse mappings: an inode scanner, hooks to
receive live updates from other writer threads, the ability to construct
btrees in memory, and a btree bulk loader.

This series glues those three together, enabling us to scan the
filesystem for mappings and keep it up to date while other writers run,
and then commit the new btree to disk atomically.

To reduce the size of each patch, the functionality is left disabled
until the end of the series and broken up into three patches: one to
create the mechanics of scanning the filesystem, a second to transition
to in-memory btrees, and a third to set up the live hooks.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rmap-btree

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-rmap-btree
---
 tests/xfs/422 |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs/422: don't freeze while racing rmap repair and fsstress
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of rmap btrees Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:15     ` Zorro Lang
  0 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Since we're moving away from freezing the filesystem for rmap repair,
remove the freeze/thaw race from this test to make it more interesting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/422 |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)


diff --git a/tests/xfs/422 b/tests/xfs/422
index 995f612166..339f12976a 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -5,8 +5,6 @@
 # FS QA Test No. 422
 #
 # Race fsstress and rmapbt repair for a while to see if we crash or livelock.
-# rmapbt repair requires us to freeze the filesystem to stop all filesystem
-# activity, so we can't have userspace wandering in and thawing it.
 #
 . ./common/preamble
 _begin_fstest online_repair dangerous_fsstress_repair freeze
@@ -31,7 +29,7 @@ _require_xfs_stress_online_repair
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
 _require_xfs_has_feature "$SCRATCH_MNT" rmapbt
-_scratch_xfs_stress_online_repair -f -s "repair rmapbt %agno%"
+_scratch_xfs_stress_online_repair -s "repair rmapbt %agno%"
 
 # success, all done
 echo Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (69 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of rmap btrees Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] populate: take a snapshot of the filesystem if creation fails Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                   ` (6 subsequent siblings)
  77 siblings, 3 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

Before we start on an intense patchset of improving the XFS fuzz testing
framework, let's fix a couple of bugs in the code that creates sample
filesystems with all types of metadata.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fix-populate-problems
---
 common/populate |   70 ++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 24 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] populate: take a snapshot of the filesystem if creation fails
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] populate: fix some weirdness in __populate_check_xfs_agbtree_height Darrick J. Wong
  2023-02-18  6:16   ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

There have been a few bug reports filed about people not being able to
use the filesystem metadata population code to create filesystems with
all types of metadata on them.  Right now this is super-annoying to
debug because we don't capture a metadump for easy debugging.  Fix that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/populate |   59 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 19 deletions(-)


diff --git a/common/populate b/common/populate
index 44b4af1667..e4090a29d3 100644
--- a/common/populate
+++ b/common/populate
@@ -40,6 +40,27 @@ __populate_create_file() {
 	$XFS_IO_PROG -f -c "pwrite -S 0x62 -W -b 1m 0 $sz" "${fname}"
 }
 
+# Fail the test if we failed to create some kind of filesystem metadata.
+# Create a metadata dump of the failed filesystem so that we can analyze
+# how things went rong.
+__populate_fail() {
+	local flatdev="$(basename "$SCRATCH_DEV")"
+	local metadump="$seqres.$flatdev.populate.md"
+
+	case "$FSTYP" in
+	xfs)
+		_scratch_unmount
+		_scratch_xfs_metadump "$metadump"
+		;;
+	ext4)
+		_scratch_unmount
+		_ext4_metadump "${SCRATCH_DEV}" "$metadump"
+		;;
+	esac
+
+	_fail "$@"
+}
+
 # Punch out every other hole in this file, if it exists.
 #
 # The goal here is to force the creation of a large number of metadata records
@@ -501,7 +522,7 @@ __populate_check_xfs_dformat() {
 	format="$2"
 
 	fmt="$(_scratch_xfs_db -c "inode ${inode}" -c 'p core.format' | sed -e 's/^.*(\([a-z]*\)).*$/\1/g')"
-	test "${format}" = "${fmt}" || _fail "failed to create ino ${inode} dformat expected ${format} saw ${fmt}"
+	test "${format}" = "${fmt}" || __populate_fail "failed to create ino ${inode} dformat expected ${format} saw ${fmt}"
 }
 
 # Check attr fork format of XFS file
@@ -510,7 +531,7 @@ __populate_check_xfs_aformat() {
 	format="$2"
 
 	fmt="$(_scratch_xfs_db -c "inode ${inode}" -c 'p core.aformat' | sed -e 's/^.*(\([a-z]*\)).*$/\1/g')"
-	test "${format}" = "${fmt}" || _fail "failed to create ino ${inode} aformat expected ${format} saw ${fmt}"
+	test "${format}" = "${fmt}" || __populate_fail "failed to create ino ${inode} aformat expected ${format} saw ${fmt}"
 }
 
 # Check structure of XFS directory
@@ -529,21 +550,21 @@ __populate_check_xfs_dir() {
 
 	case "${dtype}" in
 	"shortform"|"inline"|"local")
-		(test "${datab}" -eq 0 && test "${leafb}" -eq 0 && test "${freeb}" -eq 0) || _fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
+		(test "${datab}" -eq 0 && test "${leafb}" -eq 0 && test "${freeb}" -eq 0) || __populate_fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
 		;;
 	"block")
-		(test "${datab}" -eq 1 && test "${leafb}" -eq 0 && test "${freeb}" -eq 0) || _fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
+		(test "${datab}" -eq 1 && test "${leafb}" -eq 0 && test "${freeb}" -eq 0) || __populate_fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
 		;;
 	"leaf")
-		(test "${datab}" -eq 1 && test "${leafb}" -eq 1 && test "${freeb}" -eq 0) || _fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
+		(test "${datab}" -eq 1 && test "${leafb}" -eq 1 && test "${freeb}" -eq 0) || __populate_fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
 		;;
 	"leafn")
 		_scratch_xfs_db -x -c "inode ${inode}" -c "dblock ${leaf_lblk}" -c "p lhdr.info.hdr.magic" | grep -q '0x3dff' && return
 		_scratch_xfs_db -x -c "inode ${inode}" -c "dblock ${leaf_lblk}" -c "p lhdr.info.magic" | grep -q '0xd2ff' && return
-		_fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
+		__populate_fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
 		;;
 	"node"|"btree")
-		(test "${datab}" -eq 1 && test "${leafb}" -eq 1 && test "${freeb}" -eq 1) || _fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
+		(test "${datab}" -eq 1 && test "${leafb}" -eq 1 && test "${freeb}" -eq 1) || __populate_fail "failed to create ${dtype} dir ino ${inode} datab ${datab} leafb ${leafb} freeb ${freeb}"
 		;;
 	*)
 		_fail "Unknown directory type ${dtype}"
@@ -563,13 +584,13 @@ __populate_check_xfs_attr() {
 
 	case "${atype}" in
 	"shortform"|"inline"|"local")
-		(test "${datab}" -eq 0 && test "${leafb}" -eq 0) || _fail "failed to create ${atype} attr ino ${inode} datab ${datab} leafb ${leafb}"
+		(test "${datab}" -eq 0 && test "${leafb}" -eq 0) || __populate_fail "failed to create ${atype} attr ino ${inode} datab ${datab} leafb ${leafb}"
 		;;
 	"leaf")
-		(test "${datab}" -eq 1 && test "${leafb}" -eq 0) || _fail "failed to create ${atype} attr ino ${inode} datab ${datab} leafb ${leafb}"
+		(test "${datab}" -eq 1 && test "${leafb}" -eq 0) || __populate_fail "failed to create ${atype} attr ino ${inode} datab ${datab} leafb ${leafb}"
 		;;
 	"node"|"btree")
-		(test "${datab}" -eq 1 && test "${leafb}" -eq 1) || _fail "failed to create ${atype} attr ino ${inode} datab ${datab} leafb ${leafb}"
+		(test "${datab}" -eq 1 && test "${leafb}" -eq 1) || __populate_fail "failed to create ${atype} attr ino ${inode} datab ${datab} leafb ${leafb}"
 		;;
 	*)
 		_fail "Unknown attribute type ${atype}"
@@ -605,7 +626,7 @@ __populate_check_xfs_agbtree_height() {
 			return 100
 		fi
 	done
-	test $? -eq 100 || _fail "Failed to create ${bt_type} of sufficient height!"
+	test $? -eq 100 || __populate_fail "Failed to create ${bt_type} of sufficient height!"
 	return 1
 }
 
@@ -678,13 +699,13 @@ __populate_check_ext4_dformat() {
 
 	case "${format}" in
 	"blockmap")
-		test "${extents}" -eq 0 || _fail "failed to create ino ${inode} with blockmap"
+		test "${extents}" -eq 0 || __populate_fail "failed to create ino ${inode} with blockmap"
 		;;
 	"extent"|"extents")
-		test "${extents}" -eq 1 || _fail "failed to create ino ${inode} with extents"
+		test "${extents}" -eq 1 || __populate_fail "failed to create ino ${inode} with extents"
 		;;
 	"etree")
-		(test "${extents}" -eq 1 && test "${etree}" -eq 1) || _fail "failed to create ino ${inode} with extent tree"
+		(test "${extents}" -eq 1 && test "${etree}" -eq 1) || __populate_fail "failed to create ino ${inode} with extent tree"
 		;;
 	*)
 		_fail "Unknown dformat ${format}"
@@ -702,10 +723,10 @@ __populate_check_ext4_aformat() {
 
 	case "${format}" in
 	"local"|"inline")
-		test "${ablock}" -eq 0 || _fail "failed to create inode ${inode} with ${format} xattr"
+		test "${ablock}" -eq 0 || __populate_fail "failed to create inode ${inode} with ${format} xattr"
 		;;
 	"block")
-		test "${extents}" -eq 1 || _fail "failed to create inode ${inode} with ${format} xattr"
+		test "${extents}" -eq 1 || __populate_fail "failed to create inode ${inode} with ${format} xattr"
 		;;
 	*)
 		_fail "Unknown aformat ${format}"
@@ -726,13 +747,13 @@ __populate_check_ext4_dir() {
 
 	case "${dtype}" in
 	"inline")
-		(test "${inline}" -eq 1 && test "${htree}" -eq 0) || _fail "failed to create ${dtype} dir ino ${inode} htree ${htree} inline ${inline}"
+		(test "${inline}" -eq 1 && test "${htree}" -eq 0) || __populate_fail "failed to create ${dtype} dir ino ${inode} htree ${htree} inline ${inline}"
 		;;
 	"block")
-		(test "${inline}" -eq 0 && test "${htree}" -eq 0) || _fail "failed to create ${dtype} dir ino ${inode} htree ${htree} inline ${inline}"
+		(test "${inline}" -eq 0 && test "${htree}" -eq 0) || __populate_fail "failed to create ${dtype} dir ino ${inode} htree ${htree} inline ${inline}"
 		;;
 	"htree")
-		(test "${inline}" -eq 0 && test "${htree}" -eq 1) || _fail "failed to create ${dtype} dir ino ${inode} htree ${htree} inline ${inline}"
+		(test "${inline}" -eq 0 && test "${htree}" -eq 1) || __populate_fail "failed to create ${dtype} dir ino ${inode} htree ${htree} inline ${inline}"
 		;;
 	*)
 		_fail "Unknown directory type ${dtype}"


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] populate: fix some weirdness in __populate_check_xfs_agbtree_height
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] populate: take a snapshot of the filesystem if creation fails Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-18  6:16   ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Use a for loop to scan the AGs, and make all the variables local like
you'd expect them to be.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/populate |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)


diff --git a/common/populate b/common/populate
index e4090a29d3..29ea637ecb 100644
--- a/common/populate
+++ b/common/populate
@@ -599,8 +599,8 @@ __populate_check_xfs_attr() {
 
 # Check that there's at least one per-AG btree with multiple levels
 __populate_check_xfs_agbtree_height() {
-	bt_type="$1"
-	nr_ags=$(_scratch_xfs_db -c 'sb 0' -c 'p agcount' | awk '{print $3}')
+	local bt_type="$1"
+	local agcount=$(_scratch_xfs_db -c 'sb 0' -c 'p agcount' | awk '{print $3}')
 
 	case "${bt_type}" in
 	"bno"|"cnt"|"rmap"|"refcnt")
@@ -620,13 +620,14 @@ __populate_check_xfs_agbtree_height() {
 		;;
 	esac
 
-	seq 0 $((nr_ags - 1)) | while read ag; do
-		bt_level=$(_scratch_xfs_db -c "${hdr} ${ag}" -c "p ${bt_prefix}level" | awk '{print $3}')
+	for ((agno = 0; agno < agcount; agno++)); do
+		bt_level=$(_scratch_xfs_db -c "${hdr} ${agno}" -c "p ${bt_prefix}level" | awk '{print $3}')
+		# "level" is really the btree height
 		if [ "${bt_level}" -gt 1 ]; then
-			return 100
+			return 0
 		fi
 	done
-	test $? -eq 100 || __populate_fail "Failed to create ${bt_type} of sufficient height!"
+	__populate_fail "Failed to create ${bt_type} of sufficient height!"
 	return 1
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (70 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 01/24] fuzzy: disable per-field random fuzzing by default Darrick J. Wong
                     ` (23 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
                   ` (5 subsequent siblings)
  77 siblings, 24 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

There are a ton of improvements to the XFS fuzzing code in this update.
We start by disabling by default two parts of the fuzz testing that
don't lead to predictable golden output: fuzzing with the 'random' verb,
and fuzzing the 'LSN' field.

Next, refactor the inner fuzzing loop so that each of the four repair
strategies are broken out into separate functions, as well as the
post-repair attempts to modify the filesystem.  This splitting makes it
much easier to fix some longstanding problems with the fuzzer logic.  We
also revise the strategies a bit so that they more accurately reflect
the intended usage patterns of the repair programs.

Then, strengthen other parts of the fuzzing -- we make the post-repair
modification exercises a bit more strenuous by running fsstress on the
repaired fs; adding an evaluation of xfs_check vs.  xfs_repair; and
making it possible to check the xfs health reporting system.  The
xfs_check changes were key to disabling it in fstests in 2021.

Finally, improve the array handling of the xfs fuzz tests so that we
actually know about array indices as a first class concept, instead of
the current mucking around we do with regular expressions.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements
---
 common/fuzzy  |  513 ++++++++++++++++++++++++++++++++++++++++++++-------------
 common/xfs    |   80 +++++++++
 tests/xfs/354 |    7 +
 tests/xfs/355 |    7 +
 tests/xfs/358 |    5 -
 tests/xfs/359 |    5 -
 tests/xfs/360 |    5 -
 tests/xfs/361 |    5 -
 tests/xfs/362 |    5 -
 tests/xfs/363 |    5 -
 tests/xfs/364 |    5 -
 tests/xfs/365 |    5 -
 tests/xfs/366 |    5 -
 tests/xfs/367 |    5 -
 tests/xfs/368 |    5 -
 tests/xfs/369 |    5 -
 tests/xfs/370 |    5 -
 tests/xfs/371 |    5 -
 tests/xfs/372 |    5 -
 tests/xfs/373 |    7 +
 tests/xfs/410 |    5 -
 tests/xfs/411 |    5 -
 tests/xfs/455 |    7 +
 tests/xfs/457 |    5 -
 tests/xfs/458 |    5 -
 tests/xfs/459 |    5 -
 tests/xfs/460 |    5 -
 tests/xfs/461 |    5 -
 tests/xfs/462 |    5 -
 tests/xfs/463 |    5 -
 tests/xfs/464 |    7 +
 tests/xfs/483 |    5 -
 32 files changed, 606 insertions(+), 147 deletions(-)


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 01/24] fuzzy: disable per-field random fuzzing by default
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 02/24] fuzzy: disable timstamp " Darrick J. Wong
                     ` (22 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Don't run the random fuzzer by default so that we can try to stabilize
the output somewhat.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index d4177c3136..cd6e2a0e08 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -284,7 +284,8 @@ _scratch_xfs_list_fuzz_verbs() {
 		return;
 	fi
 	_scratch_xfs_db -x -c 'sb 0' -c 'fuzz' | grep '^Fuzz commands:' | \
-		sed -e 's/[,.]//g' -e 's/Fuzz commands: //g' -e 's/ /\n/g'
+		sed -e 's/[,.]//g' -e 's/Fuzz commands: //g' -e 's/ /\n/g' | \
+		grep -v '^random$'
 }
 
 # Fuzz some of the fields of some piece of metadata


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 02/24] fuzzy: disable timstamp fuzzing by default
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 01/24] fuzzy: disable per-field random fuzzing by default Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 07/24] fuzzy: don't fuzz xattr namespace flags and values Darrick J. Wong
                     ` (21 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Don't fuzz timestamps since all bit patterns are valid and XFS itself
does not perform any validation on them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index cd6e2a0e08..2798c257a0 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -73,6 +73,14 @@ __filter_xfs_db_keys() {
 	    -e '/pad/d'
 }
 
+# Filter out metadata fields that are completely controlled by userspace
+# or are arbitrary bit sequences.  In other words, fields where the filesystem
+# does no validation.
+__filter_unvalidated_xfs_db_fields() {
+	sed -e '/\.sec/d' \
+	    -e '/\.nsec/d'
+}
+
 # Filter the xfs_db print command's field debug information
 # into field name and type.
 __filter_xfs_db_print_fields() {
@@ -91,7 +99,7 @@ __filter_xfs_db_print_fields() {
 		else
 			echo "${fuzzkey}"
 		fi
-	done | grep -E "${filter}"
+	done | grep -E "${filter}" | __filter_unvalidated_xfs_db_fields
 }
 
 # Navigate to some part of the filesystem and print the field info.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 03/24] fuzzy: don't fuzz the log sequence number
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 05/24] fuzzy: don't fuzz inode generation numbers Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 06/24] fuzzy: don't fuzz user-controllable inode flags Darrick J. Wong
                     ` (18 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Don't bother filtering log sequence numbers since xfs_db doesn't have
the ability to tell us the range of LSNs that would actually cause
validation failures.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 2798c257a0..677e655d68 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -78,7 +78,9 @@ __filter_xfs_db_keys() {
 # does no validation.
 __filter_unvalidated_xfs_db_fields() {
 	sed -e '/\.sec/d' \
-	    -e '/\.nsec/d'
+	    -e '/\.nsec/d' \
+	    -e '/^lsn$/d' \
+	    -e '/\.lsn/d'
 }
 
 # Filter the xfs_db print command's field debug information


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 04/24] fuzzy: don't fuzz obsolete inode fields
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 06/24] fuzzy: don't fuzz user-controllable inode flags Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 12/24] common/fuzzy: fix some problems with the offline repair strategy Darrick J. Wong
                     ` (16 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

We don't really care about inode fields were used in V4 (deprecated) or
DMAPI (unsupported) so don't bother fuzzing them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 677e655d68..d183052a14 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -80,7 +80,10 @@ __filter_unvalidated_xfs_db_fields() {
 	sed -e '/\.sec/d' \
 	    -e '/\.nsec/d' \
 	    -e '/^lsn$/d' \
-	    -e '/\.lsn/d'
+	    -e '/\.lsn/d' \
+	    -e '/^core.flushiter/d' \
+	    -e '/^core.dmevmask/d' \
+	    -e '/^core.dmstate/d'
 }
 
 # Filter the xfs_db print command's field debug information


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 05/24] fuzzy: don't fuzz inode generation numbers
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 07/24] fuzzy: don't fuzz xattr namespace flags and values Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 03/24] fuzzy: don't fuzz the log sequence number Darrick J. Wong
                     ` (19 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

The inode generation number is a randomly selected 32-bit integer that
isn't itself validated anywhere.  No need to fuzz that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index d183052a14..0d7e60a011 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -83,7 +83,8 @@ __filter_unvalidated_xfs_db_fields() {
 	    -e '/\.lsn/d' \
 	    -e '/^core.flushiter/d' \
 	    -e '/^core.dmevmask/d' \
-	    -e '/^core.dmstate/d'
+	    -e '/^core.dmstate/d' \
+	    -e '/^core.gen/d'
 }
 
 # Filter the xfs_db print command's field debug information


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 06/24] fuzzy: don't fuzz user-controllable inode flags
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 03/24] fuzzy: don't fuzz the log sequence number Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 04/24] fuzzy: don't fuzz obsolete inode fields Darrick J. Wong
                     ` (17 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Don't fuzz the inode flags that are controlled by userspace and don't
actually have any other effects on the ondisk metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 0d7e60a011..6f5083041a 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -84,7 +84,15 @@ __filter_unvalidated_xfs_db_fields() {
 	    -e '/^core.flushiter/d' \
 	    -e '/^core.dmevmask/d' \
 	    -e '/^core.dmstate/d' \
-	    -e '/^core.gen/d'
+	    -e '/^core.gen/d' \
+	    -e '/^core.prealloc/d' \
+	    -e '/^core.immutable/d' \
+	    -e '/^core.append/d' \
+	    -e '/^core.sync/d' \
+	    -e '/^core.noatime/d' \
+	    -e '/^core.nodump/d' \
+	    -e '/^core.nodefrag/d' \
+	    -e '/^v3.dax/d'
 }
 
 # Filter the xfs_db print command's field debug information


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 07/24] fuzzy: don't fuzz xattr namespace flags and values
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 01/24] fuzzy: disable per-field random fuzzing by default Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 02/24] fuzzy: disable timstamp " Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 05/24] fuzzy: don't fuzz inode generation numbers Darrick J. Wong
                     ` (20 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Extended attribute namespace flags are controlled by userspace, and
there is no validation imposed on the values.  Don't bother fuzzing
either of these things.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 6f5083041a..866901931e 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -92,7 +92,13 @@ __filter_unvalidated_xfs_db_fields() {
 	    -e '/^core.noatime/d' \
 	    -e '/^core.nodump/d' \
 	    -e '/^core.nodefrag/d' \
-	    -e '/^v3.dax/d'
+	    -e '/^v3.dax/d' \
+	    -e '/^nvlist.*value/d' \
+	    -e '/^entries.*root/d' \
+	    -e '/^entries.*secure/d' \
+	    -e '/^a.sfattr.list.*value/d' \
+	    -e '/^a.sfattr.list.*root/d' \
+	    -e '/^a.sfattr.list.*secure/d'
 }
 
 # Filter the xfs_db print command's field debug information


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 08/24] common/fuzzy: split out each repair strategy into a separate helper
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 11/24] common/fuzzy: fix some problems with the online " Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 10/24] common/fuzzy: hoist the post-repair fs modification step Darrick J. Wong
                     ` (13 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Refactor __scratch_xfs_fuzz_field_test to split out each repair strategy
into a separate helper function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |  212 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 166 insertions(+), 46 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 866901931e..ef42336fa6 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -194,6 +194,148 @@ __fuzz_notify() {
 	test -w /dev/ttyprintk && echo "$@" >> /dev/ttyprintk
 }
 
+# Perform the online repair part of a fuzz test.
+__scratch_xfs_fuzz_field_online() {
+	local fuzz_action="$1"
+
+	# Mount or else we can't do anything online
+	echo "+ Mount filesystem to try online repair"
+	_try_scratch_mount 2>&1
+	res=$?
+	if [ $res -ne 0 ]; then
+		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
+		return 0
+	fi
+
+	# Make sure online scrub will catch whatever we fuzzed
+	echo "++ Online scrub"
+	_scratch_scrub -n -a 1 -e continue 2>&1
+	res=$?
+	test $res -eq 0 && \
+		(>&2 echo "scrub didn't fail with ${fuzz_action}.")
+
+	# Try fixing the filesystem online
+	__fuzz_notify "++ Try to repair filesystem online"
+	_scratch_scrub 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "online repair failed ($res) with ${fuzz_action}.")
+
+	__scratch_xfs_fuzz_unmount
+
+	# Offline scrub should pass now
+	echo "+ Make sure error is gone (offline)"
+	_scratch_xfs_repair -n 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "offline re-scrub failed ($res) with ${fuzz_action}.")
+
+	return 0
+}
+
+# Perform the offline repair part of a fuzz test.
+__scratch_xfs_fuzz_field_offline() {
+	local fuzz_action="$1"
+
+	# Mount or else we can't do anything offline
+	echo "+ Mount filesystem to try offline repair"
+	_try_scratch_mount 2>&1
+	res=$?
+	if [ $res -ne 0 ]; then
+		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
+		return 0
+	fi
+
+	# Make sure online scrub will catch whatever we fuzzed
+	echo "++ Online scrub"
+	_scratch_scrub -n -a 1 -e continue 2>&1
+	res=$?
+	test $res -eq 0 && \
+		(>&2 echo "scrub didn't fail with ${fuzz_action}.")
+
+	__scratch_xfs_fuzz_unmount
+
+	# Repair the filesystem offline
+	echo "+ Try to repair the filesystem offline"
+	_repair_scratch_fs 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "offline repair failed ($res) with ${fuzz_action}.")
+
+	# See if repair finds a clean fs
+	echo "+ Make sure error is gone (offline)"
+	_scratch_xfs_repair -n 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "offline re-scrub failed ($res) with ${fuzz_action}.")
+
+	return 0
+}
+
+# Perform the no-repair part of a fuzz test.
+__scratch_xfs_fuzz_field_norepair() {
+	local fuzz_action="$1"
+
+	# Mount or else we can't do anything in norepair mode
+	echo "+ Mount filesystem to try no repair"
+	_try_scratch_mount 2>&1
+	res=$?
+	if [ $res -ne 0 ]; then
+		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
+		return 0
+	fi
+
+	__scratch_xfs_fuzz_unmount
+
+	return 0
+}
+
+# Perform the online-then-offline repair part of a fuzz test.
+__scratch_xfs_fuzz_field_both() {
+	local fuzz_action="$1"
+
+	# Mount or else we can't do anything in both repair mode
+	echo "+ Mount filesystem to try both repairs"
+	_try_scratch_mount 2>&1
+	res=$?
+	if [ $res -ne 0 ]; then
+		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
+		return 0
+	fi
+
+	# Make sure online scrub will catch whatever we fuzzed
+	echo "++ Online scrub"
+	_scratch_scrub -n -a 1 -e continue 2>&1
+	res=$?
+	test $res -eq 0 && \
+		(>&2 echo "online scrub didn't fail with ${fuzz_action}.")
+
+	# Try fixing the filesystem online
+	__fuzz_notify "++ Try to repair filesystem online"
+	_scratch_scrub 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "online repair failed ($res) with ${fuzz_action}.")
+
+	__scratch_xfs_fuzz_unmount
+
+	# Repair the filesystem offline?
+	echo "+ Try to repair the filesystem offline"
+	_repair_scratch_fs 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "offline repair failed ($res) with ${fuzz_action}.")
+
+	# See if repair finds a clean fs
+	echo "+ Make sure error is gone (offline)"
+	_scratch_xfs_repair -n 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "offline re-scrub ($res) with ${fuzz_action}.")
+
+	return 0
+}
+
 # Fuzz one field of some piece of metadata.
 # First arg is the field name
 # Second arg is the fuzz verb (ones, zeroes, random, add, sub...)
@@ -211,54 +353,32 @@ __scratch_xfs_fuzz_field_test() {
 	res=$?
 	test $res -ne 0 && return
 
-	# Try to catch the error with scrub
-	echo "+ Try to catch the error"
-	_try_scratch_mount 2>&1
-	res=$?
-	if [ $res -eq 0 ]; then
-		# Try an online scrub unless we're fuzzing ag 0's sb,
-		# which scrub doesn't know how to fix.
-		if [ "${repair}" != "none" ]; then
-			echo "++ Online scrub"
-			if [ "$1" != "sb 0" ]; then
-				_scratch_scrub -n -a 1 -e continue 2>&1
-				res=$?
-				test $res -eq 0 && \
-					(>&2 echo "scrub didn't fail with ${field} = ${fuzzverb}.")
-			fi
-		fi
-
-		# Try fixing the filesystem online?!
-		if [ "${repair}" = "online" ] || [ "${repair}" = "both" ]; then
-			__fuzz_notify "++ Try to repair filesystem online"
-			_scratch_scrub 2>&1
-			res=$?
-			test $res -ne 0 && \
-				(>&2 echo "online repair failed ($res) with ${field} = ${fuzzverb}.")
-		fi
-
-		__scratch_xfs_fuzz_unmount
-	elif [ "${repair}" = "online" ] || [ "${repair}" = "both" ]; then
-		(>&2 echo "mount failed ($res) with ${field} = ${fuzzverb}.")
-	fi
-
-	# Repair the filesystem offline?
-	if [ "${repair}" = "offline" ] || [ "${repair}" = "both" ]; then
-		echo "+ Try to repair the filesystem offline"
-		_repair_scratch_fs 2>&1
+	# Try to catch the error with whatever repair strategy we picked.
+	# The fs should not be mounted before or after the strategy call.
+	local fuzz_action="${field} = ${fuzzverb}"
+	case "${repair}" in
+	"online")
+		__scratch_xfs_fuzz_field_online "${fuzz_action}"
 		res=$?
-		test $res -ne 0 && \
-			(>&2 echo "offline repair failed ($res) with ${field} = ${fuzzverb}.")
-	fi
-
-	# See if repair finds a clean fs
-	if [ "${repair}" != "none" ]; then
-		echo "+ Make sure error is gone (offline)"
-		_scratch_xfs_repair -n 2>&1
+		;;
+	"offline")
+		__scratch_xfs_fuzz_field_offline "${fuzz_action}"
 		res=$?
-		test $res -ne 0 && \
-			(>&2 echo "offline re-scrub ($res) with ${field} = ${fuzzverb}.")
-	fi
+		;;
+	"none")
+		__scratch_xfs_fuzz_field_norepair "${fuzz_action}"
+		res=$?
+		;;
+	"both")
+		__scratch_xfs_fuzz_field_both "${fuzz_action}"
+		res=$?
+		;;
+	*)
+		(>&2 echo "unknown repair strategy ${repair}.")
+		res=2
+		;;
+	esac
+	test $res -eq 0 || return $res
 
 	# See if scrub finds a clean fs
 	echo "+ Make sure error is gone (online)"


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 09/24] common/fuzzy: add an underline to the full log between sections
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (10 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 10/24] common/fuzzy: hoist the post-repair fs modification step Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 13/24] common/fuzzy: fix some problems with the no-repair strategy Darrick J. Wong
                     ` (11 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

The fuzz scripts use __fuzz_notify in effect to log each step in the
fuzz process.  Enhance it to print an "underline" to ease readability a
bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index ef42336fa6..7efa5eeaf7 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -190,7 +190,9 @@ __scratch_xfs_fuzz_mdrestore()
 }
 
 __fuzz_notify() {
+	echo '========================================'
 	echo "$@"
+	echo '========================================'
 	test -w /dev/ttyprintk && echo "$@" >> /dev/ttyprintk
 }
 
@@ -348,7 +350,6 @@ __scratch_xfs_fuzz_field_test() {
 
 	# Set the new field value
 	__fuzz_notify "+ Fuzz ${field} = ${fuzzverb}"
-	echo "========================"
 	_scratch_xfs_fuzz_metadata_field "${field}" ${fuzzverb} "$@"
 	res=$?
 	test $res -ne 0 && return


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 10/24] common/fuzzy: hoist the post-repair fs modification step
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 08/24] common/fuzzy: split out each repair strategy into a separate helper Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 09/24] common/fuzzy: add an underline to the full log between sections Darrick J. Wong
                     ` (12 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Hoist the code that tries to modify an fs after repairing our fuzz
damage into a separate function, so that we can further simplify the
caller.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   77 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 44 insertions(+), 33 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 7efa5eeaf7..e90f414d34 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -338,6 +338,47 @@ __scratch_xfs_fuzz_field_both() {
 	return 0
 }
 
+# Assess the state of the filesystem after a repair strategy has been run by
+# trying to make changes to it.
+_scratch_xfs_fuzz_field_modifyfs() {
+	local fuzz_action="$1"
+	local repair="$2"
+
+	# Try to mount the filesystem
+	echo "+ Make sure error is gone (online)"
+	_try_scratch_mount 2>&1
+	res=$?
+	if [ $res -eq 0 ]; then
+		# Make sure online scrub says the filesystem is clean now
+		if [ "${repair}" != "none" ]; then
+			echo "++ Online scrub"
+			_scratch_scrub -n -e continue 2>&1
+				res=$?
+				test $res -ne 0 && \
+					(>&2 echo "online re-scrub ($res) with ${field} = ${fuzzverb}.")
+			fi
+		fi
+
+		# Try modifying the filesystem again
+		__fuzz_notify "++ Try to write filesystem again"
+		_scratch_fuzz_modify 100 2>&1
+		__scratch_xfs_fuzz_unmount
+	else
+		(>&2 echo "re-mount failed ($res) with ${fuzz_action}.")
+	fi
+
+	# See if repair finds a clean fs
+	if [ "${repair}" != "none" ]; then
+		echo "+ Re-check the filesystem (offline)"
+		_scratch_xfs_repair -n 2>&1
+		res=$?
+		test $res -ne 0 && \
+			(>&2 echo "re-repair failed ($res) with ${field} = ${fuzzverb}.")
+	fi
+
+	return 0
+}
+
 # Fuzz one field of some piece of metadata.
 # First arg is the field name
 # Second arg is the fuzz verb (ones, zeroes, random, add, sub...)
@@ -381,39 +422,9 @@ __scratch_xfs_fuzz_field_test() {
 	esac
 	test $res -eq 0 || return $res
 
-	# See if scrub finds a clean fs
-	echo "+ Make sure error is gone (online)"
-	_try_scratch_mount 2>&1
-	res=$?
-	if [ $res -eq 0 ]; then
-		# Try an online scrub unless we're fuzzing ag 0's sb,
-		# which scrub doesn't know how to fix.
-		if [ "${repair}" != "none" ]; then
-			echo "++ Online scrub"
-			if [ "$1" != "sb 0" ]; then
-				_scratch_scrub -n -e continue 2>&1
-				res=$?
-				test $res -ne 0 && \
-					(>&2 echo "online re-scrub ($res) with ${field} = ${fuzzverb}.")
-			fi
-		fi
-
-		# Try modifying the filesystem again!
-		__fuzz_notify "++ Try to write filesystem again"
-		_scratch_fuzz_modify 100 2>&1
-		__scratch_xfs_fuzz_unmount
-	else
-		(>&2 echo "re-mount failed ($res) with ${field} = ${fuzzverb}.")
-	fi
-
-	# See if repair finds a clean fs
-	if [ "${repair}" != "none" ]; then
-		echo "+ Re-check the filesystem (offline)"
-		_scratch_xfs_repair -n 2>&1
-		res=$?
-		test $res -ne 0 && \
-			(>&2 echo "re-repair failed ($res) with ${field} = ${fuzzverb}.")
-	fi
+	# See what happens when we modify the fs
+	_scratch_xfs_fuzz_field_modifyfs "${fuzz_action}" "${repair}"
+	return $?
 }
 
 # Make sure we have all the pieces we need for field fuzzing


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 11/24] common/fuzzy: fix some problems with the online repair strategy
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 12/24] common/fuzzy: fix some problems with the offline repair strategy Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 08/24] common/fuzzy: split out each repair strategy into a separate helper Darrick J. Wong
                     ` (14 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

While auditing the fuzz tester code, I noticed there were numerous
problems with the online repair strategy -- the stages of the strategy
are not consistently logged to the kernel log, some of the error
messages don't identify /which/ scrubber we're calling, and we don't
actually re-run online scrub after a repair to make sure that it's
verification is ok.  Disable xfs_repair prefetch to reduce the chances
of an OOM kill, and abort the fuzz test if we can't mount.  We also
reorganize the error messages to make reading the golden output easier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index e90f414d34..8b52d289d1 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -201,36 +201,43 @@ __scratch_xfs_fuzz_field_online() {
 	local fuzz_action="$1"
 
 	# Mount or else we can't do anything online
-	echo "+ Mount filesystem to try online repair"
+	__fuzz_notify "+ Mount filesystem to try online repair"
 	_try_scratch_mount 2>&1
 	res=$?
 	if [ $res -ne 0 ]; then
-		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
-		return 0
+		(>&2 echo "${fuzz_action}: mount failed ($res).")
+		return 1
 	fi
 
 	# Make sure online scrub will catch whatever we fuzzed
-	echo "++ Online scrub"
+	__fuzz_notify "++ Detect fuzzed field (online)"
 	_scratch_scrub -n -a 1 -e continue 2>&1
 	res=$?
 	test $res -eq 0 && \
-		(>&2 echo "scrub didn't fail with ${fuzz_action}.")
+		(>&2 echo "${fuzz_action}: online scrub didn't fail.")
 
 	# Try fixing the filesystem online
-	__fuzz_notify "++ Try to repair filesystem online"
+	__fuzz_notify "++ Try to repair filesystem (online)"
 	_scratch_scrub 2>&1
 	res=$?
 	test $res -ne 0 && \
-		(>&2 echo "online repair failed ($res) with ${fuzz_action}.")
+		(>&2 echo "${fuzz_action}: online repair failed ($res).")
+
+	# Online scrub should pass now
+	__fuzz_notify "++ Make sure error is gone (online)"
+	_scratch_scrub -n -a 1 -e continue 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "${fuzz_action}: online re-scrub failed ($res).")
 
 	__scratch_xfs_fuzz_unmount
 
 	# Offline scrub should pass now
-	echo "+ Make sure error is gone (offline)"
-	_scratch_xfs_repair -n 2>&1
+	__fuzz_notify "+ Make sure error is gone (offline)"
+	_scratch_xfs_repair -P -n 2>&1
 	res=$?
 	test $res -ne 0 && \
-		(>&2 echo "offline re-scrub failed ($res) with ${fuzz_action}.")
+		(>&2 echo "${fuzz_action}: offline re-scrub failed ($res).")
 
 	return 0
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 12/24] common/fuzzy: fix some problems with the offline repair strategy
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 04/24] fuzzy: don't fuzz obsolete inode fields Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 11/24] common/fuzzy: fix some problems with the online " Darrick J. Wong
                     ` (15 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

While auditing the fuzz tester code, I noticed there were numerous
problems with the offline repair strategy -- the stages of the strategy
are not consistently logged to the kernel log, we don't actually check
that repair -n finds the fuzzed field, and since this is an offline
test, we don't need or want to mount or try to run the online scrubber.
Also, disable prefetch to reduce the chance of an OOM kill.  Rework the
error messages to make reading the golden output easier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   31 ++++++++++---------------------
 1 file changed, 10 insertions(+), 21 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 8b52d289d1..07f597627c 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -246,37 +246,26 @@ __scratch_xfs_fuzz_field_online() {
 __scratch_xfs_fuzz_field_offline() {
 	local fuzz_action="$1"
 
-	# Mount or else we can't do anything offline
-	echo "+ Mount filesystem to try offline repair"
-	_try_scratch_mount 2>&1
-	res=$?
-	if [ $res -ne 0 ]; then
-		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
-		return 0
-	fi
-
-	# Make sure online scrub will catch whatever we fuzzed
-	echo "++ Online scrub"
-	_scratch_scrub -n -a 1 -e continue 2>&1
+	# Make sure offline scrub will catch whatever we fuzzed
+	__fuzz_notify "+ Detect fuzzed field (offline)"
+	_scratch_xfs_repair -P -n 2>&1
 	res=$?
 	test $res -eq 0 && \
-		(>&2 echo "scrub didn't fail with ${fuzz_action}.")
-
-	__scratch_xfs_fuzz_unmount
+		(>&2 echo "${fuzz_action}: offline scrub didn't fail.")
 
 	# Repair the filesystem offline
-	echo "+ Try to repair the filesystem offline"
-	_repair_scratch_fs 2>&1
+	__fuzz_notify "+ Try to repair the filesystem (offline)"
+	_repair_scratch_fs -P 2>&1
 	res=$?
 	test $res -ne 0 && \
-		(>&2 echo "offline repair failed ($res) with ${fuzz_action}.")
+		(>&2 echo "${fuzz_action}: offline repair failed ($res).")
 
 	# See if repair finds a clean fs
-	echo "+ Make sure error is gone (offline)"
-	_scratch_xfs_repair -n 2>&1
+	__fuzz_notify "+ Make sure error is gone (offline)"
+	_scratch_xfs_repair -P -n 2>&1
 	res=$?
 	test $res -ne 0 && \
-		(>&2 echo "offline re-scrub failed ($res) with ${fuzz_action}.")
+		(>&2 echo "${fuzz_action}: offline re-scrub failed ($res).")
 
 	return 0
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 13/24] common/fuzzy: fix some problems with the no-repair strategy
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (11 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 09/24] common/fuzzy: add an underline to the full log between sections Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 14/24] common/fuzzy: fix some problems with the online-then-offline repair strategy Darrick J. Wong
                     ` (10 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

While auditing the fuzz tester code, I noticed there were numerous
problems with the no repair strategy -- the stages of the strategy
are not consistently logged to the kernel log, and we don't actually
verify that either online or offline scrubs notice the fuzz.  Rework the
error messages to make reading the golden output easier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 07f597627c..16fca67534 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -274,15 +274,29 @@ __scratch_xfs_fuzz_field_offline() {
 __scratch_xfs_fuzz_field_norepair() {
 	local fuzz_action="$1"
 
+	# Make sure offline scrub will catch whatever we fuzzed
+	__fuzz_notify "+ Detect fuzzed field (offline)"
+	_scratch_xfs_repair -P -n 2>&1
+	res=$?
+	test $res -eq 0 && \
+		(>&2 echo "${fuzz_action}: offline scrub didn't fail.")
+
 	# Mount or else we can't do anything in norepair mode
-	echo "+ Mount filesystem to try no repair"
+	__fuzz_notify "+ Mount filesystem to try online scan"
 	_try_scratch_mount 2>&1
 	res=$?
 	if [ $res -ne 0 ]; then
-		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
-		return 0
+		(>&2 echo "${fuzz_action}: mount failed ($res).")
+		return 1
 	fi
 
+	# Make sure online scrub will catch whatever we fuzzed
+	__fuzz_notify "++ Detect fuzzed field (online)"
+	_scratch_scrub -n -a 1 -e continue 2>&1
+	res=$?
+	test $res -eq 0 && \
+		(>&2 echo "${fuzz_action}: online scrub didn't fail.")
+
 	__scratch_xfs_fuzz_unmount
 
 	return 0


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 14/24] common/fuzzy: fix some problems with the online-then-offline repair strategy
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (12 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 13/24] common/fuzzy: fix some problems with the no-repair strategy Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 16/24] xfs/{35[45],455}: fix bogus corruption errors Darrick J. Wong
                     ` (9 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

While auditing the fuzz tester code, I noticed there were numerous
problems with the online-then-offline repair strategy -- the stages of
the strategy are not consistently logged to the kernel log, some of the
error messages don't identify /which/ scrubber we're calling, we don't
do a pre-repair check to make sure we detect the fuzzed fields, and we
don't actually re-run online scrub after a repair to make sure that it's
ok.  Disable xfs_repair prefetch to reduce the possibility of OOM kills.
Rework the error messages to make reading the golden output easier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   80 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 53 insertions(+), 27 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 16fca67534..a33c230b40 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -306,45 +306,71 @@ __scratch_xfs_fuzz_field_norepair() {
 __scratch_xfs_fuzz_field_both() {
 	local fuzz_action="$1"
 
+	# Make sure offline scrub will catch whatever we fuzzed
+	__fuzz_notify "+ Detect fuzzed field (offline)"
+	_scratch_xfs_repair -P -n 2>&1
+	res=$?
+	test $res -eq 0 && \
+		(>&2 echo "${fuzz_action}: offline scrub didn't fail.")
+
 	# Mount or else we can't do anything in both repair mode
-	echo "+ Mount filesystem to try both repairs"
+	__fuzz_notify "+ Mount filesystem to try both repairs"
 	_try_scratch_mount 2>&1
 	res=$?
 	if [ $res -ne 0 ]; then
-		(>&2 echo "mount failed ($res) with ${fuzz_action}.")
-		return 0
+		(>&2 echo "${fuzz_action}: mount failed ($res).")
+	else
+		# Make sure online scrub will catch whatever we fuzzed
+		__fuzz_notify "++ Detect fuzzed field (online)"
+		_scratch_scrub -n -a 1 -e continue 2>&1
+		res=$?
+		test $res -eq 0 && \
+			(>&2 echo "${fuzz_action}: online scrub didn't fail.")
+
+		# Try fixing the filesystem online
+		__fuzz_notify "++ Try to repair filesystem (online)"
+		_scratch_scrub 2>&1
+		res=$?
+		test $res -ne 0 && \
+			(>&2 echo "${fuzz_action}: online repair failed ($res).")
+
+		__scratch_xfs_fuzz_unmount
+	fi
+
+	# Repair the filesystem offline if online repair failed?
+	if [ $res -ne 0 ]; then
+		__fuzz_notify "+ Try to repair the filesystem (offline)"
+		_repair_scratch_fs -P 2>&1
+		res=$?
+		test $res -ne 0 && \
+			(>&2 echo "${fuzz_action}: offline repair failed ($res).")
+	fi
+
+	# See if repair finds a clean fs
+	__fuzz_notify "+ Make sure error is gone (offline)"
+	_scratch_xfs_repair -P -n 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "${fuzz_action}: offline re-scrub failed ($res).")
+
+	# Mount so that we can see what scrub says after we've fixed the fs
+	__fuzz_notify "+ Re-mount filesystem to re-try online scan"
+	_try_scratch_mount 2>&1
+	res=$?
+	if [ $res -ne 0 ]; then
+		(>&2 echo "${fuzz_action}: mount failed ($res).")
+		return 1
 	fi
 
-	# Make sure online scrub will catch whatever we fuzzed
-	echo "++ Online scrub"
+	# Online scrub should pass now
+	__fuzz_notify "++ Make sure error is gone (online)"
 	_scratch_scrub -n -a 1 -e continue 2>&1
 	res=$?
-	test $res -eq 0 && \
-		(>&2 echo "online scrub didn't fail with ${fuzz_action}.")
-
-	# Try fixing the filesystem online
-	__fuzz_notify "++ Try to repair filesystem online"
-	_scratch_scrub 2>&1
-	res=$?
 	test $res -ne 0 && \
-		(>&2 echo "online repair failed ($res) with ${fuzz_action}.")
+		(>&2 echo "${fuzz_action}: online re-scrub failed ($res).")
 
 	__scratch_xfs_fuzz_unmount
 
-	# Repair the filesystem offline?
-	echo "+ Try to repair the filesystem offline"
-	_repair_scratch_fs 2>&1
-	res=$?
-	test $res -ne 0 && \
-		(>&2 echo "offline repair failed ($res) with ${fuzz_action}.")
-
-	# See if repair finds a clean fs
-	echo "+ Make sure error is gone (offline)"
-	_scratch_xfs_repair -n 2>&1
-	res=$?
-	test $res -ne 0 && \
-		(>&2 echo "offline re-scrub ($res) with ${fuzz_action}.")
-
 	return 0
 }
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 15/24] common/fuzzy: fix some problems with the post-repair fs modification code
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (15 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 21/24] fuzzy: compress coredumps created while fuzzing Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 19/24] common/fuzzy: exercise the filesystem a little harder after repairing Darrick J. Wong
                     ` (6 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

While auditing the fuzz tester code, I noticed there were numerous
problems with the code that test-drives the filesystem after we've run
the repair strategy.  Now that we've made sure that the repair strategy
checks its own efficacy, we can rearrange this function to try making
mods and then re-check the filesystem afterwards.  Also, disable
xfs_repair prefetch to reduce the likelihood of OOM kills.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   56 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 31 insertions(+), 25 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index a33c230b40..e9a5d67592 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -380,37 +380,43 @@ _scratch_xfs_fuzz_field_modifyfs() {
 	local fuzz_action="$1"
 	local repair="$2"
 
-	# Try to mount the filesystem
-	echo "+ Make sure error is gone (online)"
+	# Try to mount the filesystem so that we can make changes
+	__fuzz_notify "+ Mount filesystem to make changes"
 	_try_scratch_mount 2>&1
 	res=$?
-	if [ $res -eq 0 ]; then
-		# Make sure online scrub says the filesystem is clean now
-		if [ "${repair}" != "none" ]; then
-			echo "++ Online scrub"
-			_scratch_scrub -n -e continue 2>&1
-				res=$?
-				test $res -ne 0 && \
-					(>&2 echo "online re-scrub ($res) with ${field} = ${fuzzverb}.")
-			fi
-		fi
+	if [ $res -ne 0 ]; then
+		(>&2 echo "${fuzz_action}: pre-mod mount failed ($res).")
+		return $res
+	fi
 
-		# Try modifying the filesystem again
-		__fuzz_notify "++ Try to write filesystem again"
-		_scratch_fuzz_modify 100 2>&1
+	# Try modifying the filesystem again
+	__fuzz_notify "++ Try to write filesystem again"
+	_scratch_fuzz_modify 100 2>&1
+
+	# If we didn't repair anything, there's no point in checking further,
+	# the fs is still corrupt.
+	if [ "${repair}" = "none" ]; then
 		__scratch_xfs_fuzz_unmount
-	else
-		(>&2 echo "re-mount failed ($res) with ${fuzz_action}.")
+		return 0
 	fi
 
-	# See if repair finds a clean fs
-	if [ "${repair}" != "none" ]; then
-		echo "+ Re-check the filesystem (offline)"
-		_scratch_xfs_repair -n 2>&1
-		res=$?
-		test $res -ne 0 && \
-			(>&2 echo "re-repair failed ($res) with ${field} = ${fuzzverb}.")
-	fi
+	# Run an online check to make sure the fs is still ok, unless we
+	# are running the norepair strategy.
+	__fuzz_notify "+ Re-check the filesystem (online)"
+	_scratch_scrub -n -e continue 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "${fuzz_action}: online post-mod scrub failed ($res).")
+
+	__scratch_xfs_fuzz_unmount
+
+	# Run an offline check to make sure the fs is still ok, unless we
+	# are running the norepair strategy.
+	__fuzz_notify "+ Re-check the filesystem (offline)"
+	_scratch_xfs_repair -P -n 2>&1
+	res=$?
+	test $res -ne 0 && \
+		(>&2 echo "${fuzz_action}: offline post-mod scrub failed ($res).")
 
 	return 0
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 16/24] xfs/{35[45],455}: fix bogus corruption errors
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (13 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 14/24] common/fuzzy: fix some problems with the online-then-offline repair strategy Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 21/24] fuzzy: compress coredumps created while fuzzing Darrick J. Wong
                     ` (8 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

The AGFL fuzz tests first fuzz the entire block header, and second
extract flfirst from the AGF header to start a second round of targeted
fuzzing of live bno pointers in the AGFL.  However, flfirst (and the
AGFL field detection at the start of the second round of fuzzing) are
detected after we've already been fuzz testing, which means that the
AGFL might be garbage because repair failed or was not called.  If this
is the case, test will fail because the _scratch_xfs_db -c 'agf 0' -c
'p flfirst' call emits things like this:

Fuzz AGFL flfirst
Metadata corruption detected at 0x55f4f789fbc0, xfs_agfl block 0x3/0x200
Metadata corruption detected at 0x55b7356e0bc0, xfs_agfl block 0x3/0x200
Done fuzzing AGFL flfirst

Fix this by restoring the scratch fs before probing flfirst and starting
the second round of fuzzing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/354 |    7 ++++++-
 tests/xfs/355 |    7 ++++++-
 tests/xfs/455 |    7 ++++++-
 3 files changed, 18 insertions(+), 3 deletions(-)


diff --git a/tests/xfs/354 b/tests/xfs/354
index b10ce1d68f..8abf527ea6 100755
--- a/tests/xfs/354
+++ b/tests/xfs/354
@@ -28,8 +28,13 @@ echo "Fuzz AGFL"
 _scratch_xfs_fuzz_metadata '' 'offline' 'agfl 0' >> $seqres.full
 echo "Done fuzzing AGFL"
 
-echo "Fuzz AGFL flfirst"
+# Restore a correct copy of the filesystem before we start the second round of
+# fuzzing.  This avoids corruption errors from xfs_db when we probe for flfirst
+# in the AGF and later when _scratch_xfs_fuzz_metadata probes the AGFL fields.
+__scratch_xfs_fuzz_mdrestore
 flfirst=$(_scratch_xfs_db -c 'agf 0' -c 'p flfirst' | sed -e 's/flfirst = //g')
+
+echo "Fuzz AGFL flfirst"
 SCRATCH_XFS_LIST_METADATA_FIELDS="bno[${flfirst}]" _scratch_xfs_fuzz_metadata '' 'offline' 'agfl 0' >> $seqres.full
 echo "Done fuzzing AGFL flfirst"
 
diff --git a/tests/xfs/355 b/tests/xfs/355
index 530c9a970a..2d552a591c 100755
--- a/tests/xfs/355
+++ b/tests/xfs/355
@@ -28,8 +28,13 @@ echo "Fuzz AGFL"
 _scratch_xfs_fuzz_metadata '' 'online' 'agfl 0' >> $seqres.full
 echo "Done fuzzing AGFL"
 
-echo "Fuzz AGFL flfirst"
+# Restore a correct copy of the filesystem before we start the second round of
+# fuzzing.  This avoids corruption errors from xfs_db when we probe for flfirst
+# in the AGF and later when _scratch_xfs_fuzz_metadata probes the AGFL fields.
+__scratch_xfs_fuzz_mdrestore
 flfirst=$(_scratch_xfs_db -c 'agf 0' -c 'p flfirst' | sed -e 's/flfirst = //g')
+
+echo "Fuzz AGFL flfirst"
 SCRATCH_XFS_LIST_METADATA_FIELDS="bno[${flfirst}]" _scratch_xfs_fuzz_metadata '' 'online' 'agfl 0' >> $seqres.full
 echo "Done fuzzing AGFL flfirst"
 
diff --git a/tests/xfs/455 b/tests/xfs/455
index 96820bc3b8..9f06c71fa2 100755
--- a/tests/xfs/455
+++ b/tests/xfs/455
@@ -29,8 +29,13 @@ echo "Fuzz AGFL"
 _scratch_xfs_fuzz_metadata '' 'none' 'agfl 0' >> $seqres.full
 echo "Done fuzzing AGFL"
 
-echo "Fuzz AGFL flfirst"
+# Restore a correct copy of the filesystem before we start the second round of
+# fuzzing.  This avoids corruption errors from xfs_db when we probe for flfirst
+# in the AGF and later when _scratch_xfs_fuzz_metadata probes the AGFL fields.
+__scratch_xfs_fuzz_mdrestore
 flfirst=$(_scratch_xfs_db -c 'agf 0' -c 'p flfirst' | sed -e 's/flfirst = //g')
+
+echo "Fuzz AGFL flfirst"
 SCRATCH_XFS_LIST_METADATA_FIELDS="bno[${flfirst}]" _scratch_xfs_fuzz_metadata '' 'none' 'agfl 0' >> $seqres.full
 echo "Done fuzzing AGFL flfirst"
 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 17/24] common/fuzzy: evaluate xfs_check vs xfs_repair
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (18 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 22/24] fuzzy: report the fuzzing repair strategy in seqres.full Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 20/24] fuzzy: dump metadata state before fuzzing Darrick J. Wong
                     ` (3 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

When fuzzing a filesystem and using the offline repair strategy, compare
the outputs of xfs_check against xfs_repair to ensure that the newer
xfs_repair catches at least as many things as xfs_check does.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   11 +++++++++++
 1 file changed, 11 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index e9a5d67592..cf085f8b28 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -253,6 +253,17 @@ __scratch_xfs_fuzz_field_offline() {
 	test $res -eq 0 && \
 		(>&2 echo "${fuzz_action}: offline scrub didn't fail.")
 
+	# Make sure xfs_repair catches at least as many things as the old
+	# xfs_check did.
+	if [ -n "${SCRATCH_XFS_FUZZ_CHECK}" ]; then
+		__fuzz_notify "+ Detect fuzzed field (xfs_check)"
+		_scratch_xfs_check 2>&1
+		res1=$?
+		if [ $res1 -ne 0 ] && [ $res -eq 0 ]; then
+			(>&2 echo "${fuzz_action}: xfs_repair passed but xfs_check failed ($res1).")
+		fi
+	fi
+
 	# Repair the filesystem offline
 	__fuzz_notify "+ Try to repair the filesystem (offline)"
 	_repair_scratch_fs -P 2>&1


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 18/24] common: check xfs health after doing an online scrub
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (20 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 20/24] fuzzy: dump metadata state before fuzzing Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 24/24] fuzzy: for fuzzing ag btrees, find the path to the AG header Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 23/24] xfs: improve metadata array field handling when fuzzing Darrick J. Wong
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

After we've run xfs_scrub -n to perform a check of a mounted
filesystem's metadata, we should check the health reporting system to
make sure that the results got recorded.  Also wire this up to the xfs
fuzz testing helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   27 +++++++++++++++++++++++++++
 common/xfs   |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index cf085f8b28..d841d435eb 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -216,6 +216,15 @@ __scratch_xfs_fuzz_field_online() {
 	test $res -eq 0 && \
 		(>&2 echo "${fuzz_action}: online scrub didn't fail.")
 
+	# Does the health status report reflect the corruption?
+	if [ $res -ne 0 ]; then
+		__fuzz_notify "++ Detect fuzzed field ill-health report"
+		_check_xfs_health $SCRATCH_MNT 2>&1
+		res=$?
+		test $res -ne 1 && \
+			(>&2 echo "${fuzz_action}: online health check failed ($res).")
+	fi
+
 	# Try fixing the filesystem online
 	__fuzz_notify "++ Try to repair filesystem (online)"
 	_scratch_scrub 2>&1
@@ -308,6 +317,15 @@ __scratch_xfs_fuzz_field_norepair() {
 	test $res -eq 0 && \
 		(>&2 echo "${fuzz_action}: online scrub didn't fail.")
 
+	# Does the health status report reflect the corruption?
+	if [ $res -ne 0 ]; then
+		__fuzz_notify "++ Detect fuzzed field ill-health report"
+		_check_xfs_health $SCRATCH_MNT 2>&1
+		res=$?
+		test $res -ne 1 && \
+			(>&2 echo "${fuzz_action}: online health check failed ($res).")
+	fi
+
 	__scratch_xfs_fuzz_unmount
 
 	return 0
@@ -338,6 +356,15 @@ __scratch_xfs_fuzz_field_both() {
 		test $res -eq 0 && \
 			(>&2 echo "${fuzz_action}: online scrub didn't fail.")
 
+		# Does the health status report reflect the corruption?
+		if [ $res -ne 0 ]; then
+			__fuzz_notify "++ Detect fuzzed field ill-health report"
+			_check_xfs_health $SCRATCH_MNT 2>&1
+			res=$?
+			test $res -ne 1 && \
+				(>&2 echo "${fuzz_action}: online health check failed ($res).")
+		fi
+
 		# Try fixing the filesystem online
 		__fuzz_notify "++ Try to repair filesystem (online)"
 		_scratch_scrub 2>&1
diff --git a/common/xfs b/common/xfs
index 804047557b..371618dc7b 100644
--- a/common/xfs
+++ b/common/xfs
@@ -599,6 +599,37 @@ _require_xfs_db_command()
 		_notrun "xfs_db $command support is missing"
 }
 
+# Check the health of a mounted XFS filesystem.  Callers probably want to
+# ensure that xfs_scrub has been run first.  Returns 1 if unhealthy metadata
+# are found or 0 otherwise.
+_check_xfs_health() {
+	local mntpt="$1"
+	local ret=0
+	local t="$tmp.health_helper"
+
+	test -x "$XFS_SPACEMAN_PROG" || return 0
+
+	$XFS_SPACEMAN_PROG -c 'health -c -q' $mntpt > $t.out 2> $t.err
+	test $? -ne 0 && ret=1
+
+	# Don't return error if userspace or kernel don't support health
+	# reporting.
+	grep -q 'command.*health.*not found' $t.err && return 0
+	grep -q 'Inappropriate ioctl for device' $t.err && return 0
+
+	# Filter out the "please run scrub" message if nothing's been checked.
+	sed -e '/Health status has not been/d' -e '/Please run xfs_scrub/d' -i \
+			$t.err
+
+	grep -q unhealthy $t.out && ret=1
+	test $(wc -l < $t.err) -gt 0 && ret=1
+	cat $t.out
+	cat $t.err 1>&2
+	rm -f $t.out $t.err
+
+	return $ret
+}
+
 # Does the filesystem mounted from a particular device support scrub?
 _supports_xfs_scrub()
 {
@@ -750,6 +781,18 @@ _check_xfs_filesystem()
 			ok=0
 		fi
 		rm -f $tmp.scrub
+
+		# Does the health reporting notice anything?
+		_check_xfs_health $mntpt > $tmp.health 2>&1
+		res=$?
+		if [ $((res ^ ok)) -eq 0 ]; then
+			_log_err "_check_xfs_filesystem: filesystem on $device failed health check"
+			echo "*** xfs_spaceman -c 'health -c -q' output ***" >> $seqres.full
+			cat $tmp.health >> $seqres.full
+			echo "*** end xfs_spaceman output" >> $seqres.full
+			ok=0
+		fi
+		rm -f $tmp.health
 	fi
 
 	if [ "$type" = "xfs" ]; then


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 19/24] common/fuzzy: exercise the filesystem a little harder after repairing
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (16 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 15/24] common/fuzzy: fix some problems with the post-repair fs modification code Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 22/24] fuzzy: report the fuzzing repair strategy in seqres.full Darrick J. Wong
                     ` (5 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Use fsstress to exercise the filesystem a little more strenuously after
we've run the fuzzing repair strategy, so that we have a better chance
of tripping over corruption problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   36 ++++++++++++------------------------
 1 file changed, 12 insertions(+), 24 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index d841d435eb..3de6f43dc6 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -6,30 +6,18 @@
 
 # Modify various files after a fuzzing operation
 _scratch_fuzz_modify() {
-	nr="$1"
+	echo "+++ stressing filesystem"
+	mkdir -p $SCRATCH_MNT/data
+	_xfs_force_bdev data $SCRATCH_MNT/data
+	$FSSTRESS_PROG -n $((TIME_FACTOR * 10000)) -p $((LOAD_FACTOR * 4)) -d $SCRATCH_MNT/data
 
-	test -z "${nr}" && nr=50000
-	echo "+++ touch ${nr} files"
-	blk_sz=$(stat -f -c '%s' ${SCRATCH_MNT})
-	$XFS_IO_PROG -f -c "pwrite -S 0x63 0 ${blk_sz}" "/tmp/afile" > /dev/null
-	date="$(date)"
-	find "${SCRATCH_MNT}/" -type f 2> /dev/null | head -n "${nr}" | while read f; do
-		# try to remove append, immutable (and even dax) flag if exists
-		$XFS_IO_PROG -rc 'chattr -x -i -a' "$f" > /dev/null 2>&1
-		setfattr -n "user.date" -v "${date}" "$f"
-		cat "/tmp/afile" >> "$f"
-		mv "$f" "$f.longer"
-	done
-	sync
-	rm -rf "/tmp/afile"
-
-	echo "+++ create files"
-	mkdir -p "${SCRATCH_MNT}/test.moo"
-	$XFS_IO_PROG -f -c 'pwrite -S 0x80 0 65536' "${SCRATCH_MNT}/test.moo/urk" > /dev/null
-	sync
-
-	echo "+++ remove files"
-	rm -rf "${SCRATCH_MNT}/test.moo"
+	if _xfs_has_feature "$SCRATCH_MNT" realtime; then
+		mkdir -p $SCRATCH_MNT/rt
+		_xfs_force_bdev realtime $SCRATCH_MNT/rt
+		$FSSTRESS_PROG -n $((TIME_FACTOR * 10000)) -p $((LOAD_FACTOR * 4)) -d $SCRATCH_MNT/rt
+	else
+		echo "+++ xfs realtime not configured"
+	fi
 }
 
 # Try to access files after fuzzing
@@ -429,7 +417,7 @@ _scratch_xfs_fuzz_field_modifyfs() {
 
 	# Try modifying the filesystem again
 	__fuzz_notify "++ Try to write filesystem again"
-	_scratch_fuzz_modify 100 2>&1
+	_scratch_fuzz_modify 2>&1
 
 	# If we didn't repair anything, there's no point in checking further,
 	# the fs is still corrupt.


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 20/24] fuzzy: dump metadata state before fuzzing
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (19 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 17/24] common/fuzzy: evaluate xfs_check vs xfs_repair Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 18/24] common: check xfs health after doing an online scrub Darrick J. Wong
                     ` (2 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

When we start a fuzz test, dump the metadata to stdout so that anyone
analyzing a failure can see what was in the (supposedly) good image, and
what it turns into after fuzzing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   12 ++++++++++++
 1 file changed, 12 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index 3de6f43dc6..939f5e5ef2 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -110,6 +110,16 @@ __filter_xfs_db_print_fields() {
 	done | grep -E "${filter}" | __filter_unvalidated_xfs_db_fields
 }
 
+# Dump the current contents of a metadata object.
+# All arguments are xfs_db commands to locate the metadata.
+_scratch_xfs_dump_metadata() {
+	local cmds=()
+	for arg in "$@"; do
+		cmds+=("-c" "${arg}")
+	done
+	_scratch_xfs_db "${cmds[@]}" -c print
+}
+
 # Navigate to some part of the filesystem and print the field info.
 # The first argument is an grep filter for the fields
 # The rest of the arguments are xfs_db commands to locate the metadata.
@@ -534,6 +544,8 @@ _scratch_xfs_fuzz_metadata() {
 	echo $(echo "${fields}")
 	echo "Verbs we propose to fuzz with:"
 	echo $(echo "${verbs}")
+	echo "Current metadata object state:"
+	_scratch_xfs_dump_metadata "$@"
 
 	# Always capture full core dumps from crashing tools
 	ulimit -c unlimited


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 21/24] fuzzy: compress coredumps created while fuzzing
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (14 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 16/24] xfs/{35[45],455}: fix bogus corruption errors Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 15/24] common/fuzzy: fix some problems with the post-repair fs modification code Darrick J. Wong
                     ` (7 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Compress the coredumps and put them in the results directory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    8 ++++++++
 1 file changed, 8 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index 939f5e5ef2..7eaf883c0f 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -554,6 +554,14 @@ _scratch_xfs_fuzz_metadata() {
 		echo "${verbs}" | while read fuzzverb; do
 			__scratch_xfs_fuzz_mdrestore
 			__scratch_xfs_fuzz_field_test "${field}" "${fuzzverb}" "${repair}" "$@"
+
+			# Collect compresssed coredumps in the test results
+			# directory if the sysadmin didn't override the default
+			# coredump strategy.
+			for i in core core.*; do
+				test -f "$i" || continue
+				_save_coredump "$i"
+			done
 		done
 	done
 }


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 22/24] fuzzy: report the fuzzing repair strategy in seqres.full
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (17 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 19/24] common/fuzzy: exercise the filesystem a little harder after repairing Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 17/24] common/fuzzy: evaluate xfs_check vs xfs_repair Darrick J. Wong
                     ` (4 subsequent siblings)
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Record in the seqres.full file the filesystem repair strategy that we're
going to use to detect the fuzzed metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 7eaf883c0f..f34fcadefe 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -540,7 +540,7 @@ _scratch_xfs_fuzz_metadata() {
 
 	fields="$(_scratch_xfs_list_metadata_fields "${filter}" "$@")"
 	verbs="$(_scratch_xfs_list_fuzz_verbs)"
-	echo "Fields we propose to fuzz under: $@"
+	echo "Fields we propose to fuzz with the \"${repair}\" repair strategy: $@"
 	echo $(echo "${fields}")
 	echo "Verbs we propose to fuzz with:"
 	echo $(echo "${verbs}")


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 23/24] xfs: improve metadata array field handling when fuzzing
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (22 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 24/24] fuzzy: for fuzzing ag btrees, find the path to the AG header Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Currently, we use some gnarly regular expressions to try to constrain
the amount of time we spend fuzzing each element of a metadata array.
This is pretty inflexible (and buggy) since we limit some arrays
(e.g. dir hashes) to the first ten elements and other arrays (e.g.
extent mappings) that use compact index ranges to the first one.

Replace this whole weird mess with logic that can tease out the array
indices, unroll the compact indices if needed, and give the user more
flexible control over which array elements get used.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   52 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 9 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index f34fcadefe..53fe22db69 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -53,12 +53,46 @@ _scratch_scrub() {
 	esac
 }
 
-# Filter out any keys with an array index >= 10, collapse any array range
-# ("[1-195]") to the first item, and ignore padding fields.
-__filter_xfs_db_keys() {
-	sed -e '/\([a-z]*\)\[\([0-9][0-9]\+\)\].*/d' \
-	    -e 's/\([a-zA-Z0-9_]*\)\[\([0-9]*\)-[0-9]*\]/\1[\2]/g' \
-	    -e '/pad/d'
+# Expand indexed keys (i.e. arrays) into a long format so that we can filter
+# the array indices individually, and pass regular keys right through.
+#
+# For example, "u3.bmx[0-1] = [foo,bar]" is exploded into:
+# u3.bmx[0] = [foo,bar]
+# u3.bmx[1] = [foo,bar]
+#
+# Note that we restrict array indices to [0-9] to reduce fuzz runtime.  The
+# minimum and maximum array indices can be changed by setting the variables
+# SCRATCH_XFS_{MIN,MAX}_ARRAY_IDX.
+#
+# Also filter padding fields.
+__explode_xfs_db_fields() {
+	local min_idx="${SCRATCH_XFS_MIN_ARRAY_IDX}"
+	local max_idx="${SCRATCH_XFS_MAX_ARRAY_IDX}"
+
+	test -z "${min_idx}" && min_idx=0
+	test -z "${max_idx}" && max_idx=9
+	test "${max_idx}" = "none" && max_idx=99999
+
+	grep ' = ' | \
+	sed -e 's/^\([.a-zA-Z0-9_]*\)\[\([0-9]*\)-\([0-9]*\)\]\(.*\) = \(.*\)$/\1[%d]\4 \2 \3 = \5/g' \
+	    -e 's/^\([.a-zA-Z0-9_]*\)\[\([0-9]*\)\]\(.*\) = \(.*\)$/\1[%d]\3 \2 \2 = \4/g' | \
+	while read name col1 col2 rest; do
+		if [[ "${name}" == *pad* ]]; then
+			continue
+		fi
+
+		if [ "${col1}" = "=" ]; then
+			echo "${name} ${col1} ${col2} ${rest}"
+			continue
+		fi
+
+		test "${min_idx}" -gt "${col1}" && col1="${min_idx}"
+		test "${max_idx}" -lt "${col2}" && col2="${max_idx}"
+
+		seq "${col1}" "${col2}" | while read idx; do
+			printf "${name} %s\n" "${idx}" "${rest}"
+		done
+	done
 }
 
 # Filter out metadata fields that are completely controlled by userspace
@@ -96,14 +130,14 @@ __filter_xfs_db_print_fields() {
 	if [ -z "${filter}" ] || [ "${filter}" = "nofilter" ]; then
 		filter='^'
 	fi
-	grep ' = ' | while read key equals value; do
-		fuzzkey="$(echo "${key}" | __filter_xfs_db_keys)"
+	__explode_xfs_db_fields | while read key equals value; do
+		fuzzkey="$(echo "${key}")"
 		if [ -z "${fuzzkey}" ]; then
 			continue
 		elif [[ "${value}" == "["* ]]; then
 			echo "${value}" | sed -e 's/^.//g' -e 's/.$//g' -e 's/,/\n/g' | while read subfield; do
 				echo "${fuzzkey}.${subfield}"
-			done | __filter_xfs_db_keys
+			done
 		else
 			echo "${fuzzkey}"
 		fi


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 24/24] fuzzy: for fuzzing ag btrees, find the path to the AG header
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
                     ` (21 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 18/24] common: check xfs health after doing an online scrub Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 23/24] xfs: improve metadata array field handling when fuzzing Darrick J. Wong
  23 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

The fs population code creates various btrees in /some/ allocation group
with at least two levels.  These btrees aren't necessarily created in
agno 0, so we need to find it programmatically.  While we're at it, fix
a few of the comments that failed to mention when we're fuzzing interior
nodes and not leaves.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/xfs    |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/358 |    5 ++++-
 tests/xfs/359 |    5 ++++-
 tests/xfs/360 |    5 ++++-
 tests/xfs/361 |    5 ++++-
 tests/xfs/362 |    5 ++++-
 tests/xfs/363 |    5 ++++-
 tests/xfs/364 |    5 ++++-
 tests/xfs/365 |    5 ++++-
 tests/xfs/366 |    5 ++++-
 tests/xfs/367 |    5 ++++-
 tests/xfs/368 |    5 ++++-
 tests/xfs/369 |    5 ++++-
 tests/xfs/370 |    5 ++++-
 tests/xfs/371 |    5 ++++-
 tests/xfs/372 |    5 ++++-
 tests/xfs/373 |    7 +++++--
 tests/xfs/410 |    5 ++++-
 tests/xfs/411 |    5 ++++-
 tests/xfs/457 |    5 ++++-
 tests/xfs/458 |    5 ++++-
 tests/xfs/459 |    5 ++++-
 tests/xfs/460 |    5 ++++-
 tests/xfs/461 |    5 ++++-
 tests/xfs/462 |    5 ++++-
 tests/xfs/463 |    5 ++++-
 tests/xfs/464 |    7 +++++--
 tests/xfs/483 |    5 ++++-
 28 files changed, 147 insertions(+), 29 deletions(-)


diff --git a/common/xfs b/common/xfs
index 371618dc7b..610730e5ef 100644
--- a/common/xfs
+++ b/common/xfs
@@ -1686,3 +1686,40 @@ _xfs_get_inode_core_bytes()
 		echo 96
 	fi
 }
+
+# Find us the path to the AG header containing a per-AG btree with a specific
+# height.
+_scratch_xfs_find_agbtree_height() {
+	local bt_type="$1"
+	local bt_height="$2"
+	local agcount=$(_xfs_mount_agcount $SCRATCH_DEV)
+
+	case "${bt_type}" in
+	"bno"|"cnt"|"rmap"|"refcnt")
+		hdr="agf"
+		bt_prefix="${bt_type}"
+		;;
+	"ino")
+		hdr="agi"
+		bt_prefix=""
+		;;
+	"fino")
+		hdr="agi"
+		bt_prefix="free_"
+		;;
+	*)
+		_fail "Don't know about AG btree ${bt_type}"
+		;;
+	esac
+
+	for ((agno = 0; agno < agcount; agno++)); do
+		bt_level=$(_scratch_xfs_db -c "${hdr} ${agno}" -c "p ${bt_prefix}level" | awk '{print $3}')
+		# "level" is really the btree height
+		if [ "${bt_level}" -eq "${bt_height}" ]; then
+			echo "${hdr} ${agno}"
+			return 0
+		fi
+	done
+
+	return 1
+}
diff --git a/tests/xfs/358 b/tests/xfs/358
index a00eb6f9cb..92180e5196 100755
--- a/tests/xfs/358
+++ b/tests/xfs/358
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
 echo "Fuzz bnobt recs"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agf 0' 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing bnobt recs"
 
 # success, all done
diff --git a/tests/xfs/359 b/tests/xfs/359
index f0a82db4b8..0498aaccf5 100755
--- a/tests/xfs/359
+++ b/tests/xfs/359
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
 echo "Fuzz bnobt recs"
-_scratch_xfs_fuzz_metadata '' 'online'  'agf 0' 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing bnobt recs"
 
 # success, all done
diff --git a/tests/xfs/360 b/tests/xfs/360
index 3942ffd1b2..c34f455403 100755
--- a/tests/xfs/360
+++ b/tests/xfs/360
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
 echo "Fuzz bnobt keyptr"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agf 0' 'addr bnoroot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr bnoroot' >> $seqres.full
 echo "Done fuzzing bnobt keyptr"
 
 # success, all done
diff --git a/tests/xfs/361 b/tests/xfs/361
index b7ee0f6f94..22b1af4ea3 100755
--- a/tests/xfs/361
+++ b/tests/xfs/361
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
 echo "Fuzz bnobt keyptr"
-_scratch_xfs_fuzz_metadata '' 'online'  'agf 0' 'addr bnoroot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr bnoroot' >> $seqres.full
 echo "Done fuzzing bnobt keyptr"
 
 # success, all done
diff --git a/tests/xfs/362 b/tests/xfs/362
index f711661b02..51727edc06 100755
--- a/tests/xfs/362
+++ b/tests/xfs/362
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'cnt' 2)" || \
+	_fail "could not find two-level cntbt"
+
 echo "Fuzz cntbt"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agf 0' 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing cntbt"
 
 # success, all done
diff --git a/tests/xfs/363 b/tests/xfs/363
index 6be9109eca..8a62c1c821 100755
--- a/tests/xfs/363
+++ b/tests/xfs/363
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'cnt' 2)" || \
+	_fail "could not find two-level cntbt"
+
 echo "Fuzz cntbt"
-_scratch_xfs_fuzz_metadata '' 'online'  'agf 0' 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing cntbt"
 
 # success, all done
diff --git a/tests/xfs/364 b/tests/xfs/364
index fcd18fe686..984ecdafed 100755
--- a/tests/xfs/364
+++ b/tests/xfs/364
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
 echo "Fuzz inobt"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agi 0' 'addr root' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr root' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing inobt"
 
 # success, all done
diff --git a/tests/xfs/365 b/tests/xfs/365
index 6f116f9b9c..e4325c35d1 100755
--- a/tests/xfs/365
+++ b/tests/xfs/365
@@ -24,8 +24,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
 echo "Fuzz inobt"
-_scratch_xfs_fuzz_metadata '' 'online'  'agi 1' 'addr root' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr root' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing inobt"
 
 # success, all done
diff --git a/tests/xfs/366 b/tests/xfs/366
index 4c651288c0..8a52d21a0f 100755
--- a/tests/xfs/366
+++ b/tests/xfs/366
@@ -25,8 +25,11 @@ _require_xfs_finobt
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'fino' 2)" || \
+	_fail "could not find two-level finobt"
+
 echo "Fuzz finobt"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agi 0' 'addr free_root' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr free_root' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing finobt"
 
 # success, all done
diff --git a/tests/xfs/367 b/tests/xfs/367
index c474a9e7d8..d9d07faab2 100755
--- a/tests/xfs/367
+++ b/tests/xfs/367
@@ -25,8 +25,11 @@ _require_xfs_finobt
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'fino' 2)" || \
+	_fail "could not find two-level finobt"
+
 echo "Fuzz finobt"
-_scratch_xfs_fuzz_metadata '' 'online'  'agi 0' 'addr free_root' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr free_root' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing finobt"
 
 # success, all done
diff --git a/tests/xfs/368 b/tests/xfs/368
index b1c1f97664..83499827c9 100755
--- a/tests/xfs/368
+++ b/tests/xfs/368
@@ -25,8 +25,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
 echo "Fuzz rmapbt recs"
-_scratch_xfs_fuzz_metadata '' 'offline' 'agf 0' 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline' "$path" 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing rmapbt recs"
 
 # success, all done
diff --git a/tests/xfs/369 b/tests/xfs/369
index 5e6d8d9be0..3236b50e00 100755
--- a/tests/xfs/369
+++ b/tests/xfs/369
@@ -25,8 +25,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
 echo "Fuzz rmapbt recs"
-_scratch_xfs_fuzz_metadata '' 'online' 'agf 0' 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online' "$path" 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing rmapbt recs"
 
 # success, all done
diff --git a/tests/xfs/370 b/tests/xfs/370
index 0a916242e2..891d5e2572 100755
--- a/tests/xfs/370
+++ b/tests/xfs/370
@@ -26,8 +26,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
 echo "Fuzz rmapbt keyptr"
-_scratch_xfs_fuzz_metadata '' 'offline' 'agf 0' 'addr rmaproot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline' "$path" 'addr rmaproot' >> $seqres.full
 echo "Done fuzzing rmapbt keyptr"
 
 # success, all done
diff --git a/tests/xfs/371 b/tests/xfs/371
index a9b914d9f5..f7a336b170 100755
--- a/tests/xfs/371
+++ b/tests/xfs/371
@@ -26,8 +26,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
 echo "Fuzz rmapbt keyptr"
-_scratch_xfs_fuzz_metadata '' 'online' 'agf 0' 'addr rmaproot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online' "$path" 'addr rmaproot' >> $seqres.full
 echo "Done fuzzing rmapbt keyptr"
 
 # success, all done
diff --git a/tests/xfs/372 b/tests/xfs/372
index c39a917500..2250322527 100755
--- a/tests/xfs/372
+++ b/tests/xfs/372
@@ -26,8 +26,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
 echo "Fuzz refcountbt"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agf 0' 'addr refcntroot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr refcntroot' >> $seqres.full
 echo "Done fuzzing refcountbt"
 
 # success, all done
diff --git a/tests/xfs/373 b/tests/xfs/373
index 324aa9fe7d..e0c20044ec 100755
--- a/tests/xfs/373
+++ b/tests/xfs/373
@@ -4,7 +4,7 @@
 #
 # FS QA Test No. 373
 #
-# Populate a XFS filesystem and fuzz every refcountbt field.
+# Populate a XFS filesystem and fuzz every refcountbt key/pointer field.
 # Use xfs_scrub to fix the corruption.
 #
 . ./common/preamble
@@ -26,8 +26,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
 echo "Fuzz refcountbt"
-_scratch_xfs_fuzz_metadata '' 'online'  'agf 0' 'addr refcntroot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr refcntroot' >> $seqres.full
 echo "Done fuzzing refcountbt"
 
 # success, all done
diff --git a/tests/xfs/410 b/tests/xfs/410
index e98a63ebf5..388ed7d190 100755
--- a/tests/xfs/410
+++ b/tests/xfs/410
@@ -26,8 +26,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
 echo "Fuzz refcountbt"
-_scratch_xfs_fuzz_metadata '' 'offline'  'agf 0' 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing refcountbt"
 
 # success, all done
diff --git a/tests/xfs/411 b/tests/xfs/411
index cfe7796102..a9fc25ce7d 100755
--- a/tests/xfs/411
+++ b/tests/xfs/411
@@ -26,8 +26,11 @@ _require_scratch_xfs_fuzz_fields
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
 echo "Fuzz refcountbt"
-_scratch_xfs_fuzz_metadata '' 'online'  'agf 0' 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing refcountbt"
 
 # success, all done
diff --git a/tests/xfs/457 b/tests/xfs/457
index 332eeb9837..64cd6b4b82 100755
--- a/tests/xfs/457
+++ b/tests/xfs/457
@@ -25,8 +25,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
 echo "Fuzz bnobt recs"
-_scratch_xfs_fuzz_metadata '' 'none'  'agf 0' 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing bnobt recs"
 
 # success, all done
diff --git a/tests/xfs/458 b/tests/xfs/458
index ce03d687ab..8d87ec569f 100755
--- a/tests/xfs/458
+++ b/tests/xfs/458
@@ -25,8 +25,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
 echo "Fuzz bnobt keyptr"
-_scratch_xfs_fuzz_metadata '' 'none'  'agf 0' 'addr bnoroot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr bnoroot' >> $seqres.full
 echo "Done fuzzing bnobt keyptr"
 
 # success, all done
diff --git a/tests/xfs/459 b/tests/xfs/459
index d166160f87..5989bc1e6e 100755
--- a/tests/xfs/459
+++ b/tests/xfs/459
@@ -25,8 +25,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'cnt' 2)" || \
+	_fail "could not find two-level cntbt"
+
 echo "Fuzz cntbt"
-_scratch_xfs_fuzz_metadata '' 'none'  'agf 0' 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing cntbt"
 
 # success, all done
diff --git a/tests/xfs/460 b/tests/xfs/460
index 0daafa3066..7117477011 100755
--- a/tests/xfs/460
+++ b/tests/xfs/460
@@ -25,8 +25,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
 echo "Fuzz inobt"
-_scratch_xfs_fuzz_metadata '' 'none'  'agi 1' 'addr root' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr root' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing inobt"
 
 # success, all done
diff --git a/tests/xfs/461 b/tests/xfs/461
index 2d20c69d87..7c1327b052 100755
--- a/tests/xfs/461
+++ b/tests/xfs/461
@@ -26,8 +26,11 @@ _require_xfs_finobt
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'fino' 2)" || \
+	_fail "could not find two-level finobt"
+
 echo "Fuzz finobt"
-_scratch_xfs_fuzz_metadata '' 'none'  'agi 0' 'addr free_root' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr free_root' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing finobt"
 
 # success, all done
diff --git a/tests/xfs/462 b/tests/xfs/462
index 587facc03c..1ee4d27e92 100755
--- a/tests/xfs/462
+++ b/tests/xfs/462
@@ -26,8 +26,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
 echo "Fuzz rmapbt recs"
-_scratch_xfs_fuzz_metadata '' 'none' 'agf 0' 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none' "$path" 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing rmapbt recs"
 
 # success, all done
diff --git a/tests/xfs/463 b/tests/xfs/463
index 7270f7017a..7dd2d37dea 100755
--- a/tests/xfs/463
+++ b/tests/xfs/463
@@ -26,8 +26,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
 echo "Fuzz rmapbt keyptr"
-_scratch_xfs_fuzz_metadata '' 'none' 'agf 0' 'addr rmaproot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none' "$path" 'addr rmaproot' >> $seqres.full
 echo "Done fuzzing rmapbt keyptr"
 
 # success, all done
diff --git a/tests/xfs/464 b/tests/xfs/464
index 59d25ae1c0..719901e66d 100755
--- a/tests/xfs/464
+++ b/tests/xfs/464
@@ -4,7 +4,7 @@
 #
 # FS QA Test No. 464
 #
-# Populate a XFS filesystem and fuzz every refcountbt field.
+# Populate a XFS filesystem and fuzz every refcountbt key/pointer field.
 # Do not fix the filesystem, to test metadata verifiers.
 
 . ./common/preamble
@@ -27,8 +27,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
 echo "Fuzz refcountbt"
-_scratch_xfs_fuzz_metadata '' 'none'  'agf 0' 'addr refcntroot' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr refcntroot' >> $seqres.full
 echo "Done fuzzing refcountbt"
 
 # success, all done
diff --git a/tests/xfs/483 b/tests/xfs/483
index d7b0101a82..56670ba178 100755
--- a/tests/xfs/483
+++ b/tests/xfs/483
@@ -27,8 +27,11 @@ _disable_dmesg_check
 echo "Format and populate"
 _scratch_populate_cached nofill > $seqres.full 2>&1
 
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
 echo "Fuzz refcountbt"
-_scratch_xfs_fuzz_metadata '' 'none'  'agf 0' 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
 echo "Done fuzzing refcountbt"
 
 # success, all done


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (71 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/5] fuzzy: test fuzzing directory block mappings Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
                   ` (4 subsequent siblings)
  77 siblings, 5 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This patchset fills some gaps in our fuzz testing for XFS.  I forgot to
include fuzz testing the data fork mappings of directories and xattrs to
see how scrub responds to nonsensical file maps, and I omitted tests to
fuzz the realtime metadata.  Add those.

Finally, add a new fuzz testing strategy known as bothrepair.  This
simulates what system administrators are most likely to do upon
receiving a health alert about a pet filesystem -- try to repair it with
xfs_scrub, and if that doesn't work, unmount the filesystem and run
xfs_repair.  Between the two, we ought to be able to fix every possible
problem.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing
---
 common/fuzzy       |  139 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 tests/xfs/1500     |   35 +++++++++++++
 tests/xfs/1500.out |    4 +
 tests/xfs/1501     |   35 +++++++++++++
 tests/xfs/1501.out |    4 +
 tests/xfs/1502     |   45 +++++++++++++++++
 tests/xfs/1502.out |    6 ++
 tests/xfs/1503     |   35 +++++++++++++
 tests/xfs/1503.out |    4 +
 tests/xfs/1504     |   38 ++++++++++++++
 tests/xfs/1504.out |    4 +
 tests/xfs/1505     |   38 ++++++++++++++
 tests/xfs/1505.out |    4 +
 tests/xfs/1506     |   38 ++++++++++++++
 tests/xfs/1506.out |    4 +
 tests/xfs/1507     |   38 ++++++++++++++
 tests/xfs/1507.out |    4 +
 tests/xfs/1508     |   39 +++++++++++++++
 tests/xfs/1508.out |    4 +
 tests/xfs/1509     |   39 +++++++++++++++
 tests/xfs/1509.out |    4 +
 tests/xfs/1510     |   39 +++++++++++++++
 tests/xfs/1510.out |    4 +
 tests/xfs/1511     |   40 +++++++++++++++
 tests/xfs/1511.out |    4 +
 tests/xfs/1512     |   40 +++++++++++++++
 tests/xfs/1512.out |    5 ++
 tests/xfs/1513     |   40 +++++++++++++++
 tests/xfs/1513.out |    5 ++
 tests/xfs/1514     |   40 +++++++++++++++
 tests/xfs/1514.out |    5 ++
 tests/xfs/1515     |   42 ++++++++++++++++
 tests/xfs/1515.out |    5 ++
 tests/xfs/1516     |   40 +++++++++++++++
 tests/xfs/1516.out |    5 ++
 tests/xfs/1517     |   40 +++++++++++++++
 tests/xfs/1517.out |    5 ++
 tests/xfs/1518     |   40 +++++++++++++++
 tests/xfs/1518.out |    5 ++
 tests/xfs/1519     |   41 +++++++++++++++
 tests/xfs/1519.out |    5 ++
 tests/xfs/1520     |   42 ++++++++++++++++
 tests/xfs/1520.out |    5 ++
 tests/xfs/1521     |   42 ++++++++++++++++
 tests/xfs/1521.out |    5 ++
 tests/xfs/1522     |   42 ++++++++++++++++
 tests/xfs/1522.out |    5 ++
 tests/xfs/1523     |   42 ++++++++++++++++
 tests/xfs/1523.out |    5 ++
 tests/xfs/1524     |   40 +++++++++++++++
 tests/xfs/1524.out |    5 ++
 tests/xfs/1525     |   40 +++++++++++++++
 tests/xfs/1525.out |    5 ++
 tests/xfs/1526     |   40 +++++++++++++++
 tests/xfs/1526.out |    5 ++
 tests/xfs/1527     |   40 +++++++++++++++
 tests/xfs/1527.out |    5 ++
 tests/xfs/1530     |   40 +++++++++++++++
 tests/xfs/1530.out |    4 +
 tests/xfs/1531     |   40 +++++++++++++++
 tests/xfs/1531.out |    5 ++
 tests/xfs/1532     |   40 +++++++++++++++
 tests/xfs/1532.out |    5 ++
 tests/xfs/1533     |   40 +++++++++++++++
 tests/xfs/1533.out |    5 ++
 tests/xfs/1534     |   38 ++++++++++++++
 tests/xfs/1534.out |    4 +
 tests/xfs/1535     |   38 ++++++++++++++
 tests/xfs/1535.out |    4 +
 tests/xfs/1536     |   38 ++++++++++++++
 tests/xfs/1536.out |    4 +
 tests/xfs/1537     |   41 +++++++++++++++
 tests/xfs/1537.out |    5 ++
 tests/xfs/1554     |   48 ++++++++++++++++++
 tests/xfs/1554.out |   10 ++++
 tests/xfs/1555     |   48 ++++++++++++++++++
 tests/xfs/1555.out |   10 ++++
 tests/xfs/1556     |   48 ++++++++++++++++++
 tests/xfs/1556.out |   10 ++++
 tests/xfs/1557     |   48 ++++++++++++++++++
 tests/xfs/1557.out |   10 ++++
 tests/xfs/1558     |   48 ++++++++++++++++++
 tests/xfs/1558.out |   10 ++++
 tests/xfs/1559     |   48 ++++++++++++++++++
 tests/xfs/1559.out |   10 ++++
 tests/xfs/1560     |   49 ++++++++++++++++++
 tests/xfs/1560.out |   10 ++++
 tests/xfs/1561     |   49 ++++++++++++++++++
 tests/xfs/1561.out |   10 ++++
 tests/xfs/1562     |   41 +++++++++++++++
 tests/xfs/1562.out |    4 +
 tests/xfs/1563     |   41 +++++++++++++++
 tests/xfs/1563.out |    4 +
 tests/xfs/1564     |   41 +++++++++++++++
 tests/xfs/1564.out |    4 +
 tests/xfs/1565     |   41 +++++++++++++++
 tests/xfs/1565.out |    4 +
 tests/xfs/1566     |   42 ++++++++++++++++
 tests/xfs/1566.out |    4 +
 tests/xfs/1567     |   42 ++++++++++++++++
 tests/xfs/1567.out |    4 +
 tests/xfs/1568     |   41 +++++++++++++++
 tests/xfs/1568.out |    4 +
 tests/xfs/1569     |   41 +++++++++++++++
 tests/xfs/1569.out |    4 +
 tests/xfs/1570     |   36 +++++++++++++
 tests/xfs/1570.out |    4 +
 tests/xfs/1571     |   36 +++++++++++++
 tests/xfs/1571.out |    4 +
 tests/xfs/1572     |   38 ++++++++++++++
 tests/xfs/1572.out |    4 +
 tests/xfs/1573     |   37 ++++++++++++++
 tests/xfs/1573.out |    4 +
 113 files changed, 2717 insertions(+), 4 deletions(-)
 create mode 100755 tests/xfs/1500
 create mode 100644 tests/xfs/1500.out
 create mode 100755 tests/xfs/1501
 create mode 100644 tests/xfs/1501.out
 create mode 100755 tests/xfs/1502
 create mode 100644 tests/xfs/1502.out
 create mode 100755 tests/xfs/1503
 create mode 100644 tests/xfs/1503.out
 create mode 100755 tests/xfs/1504
 create mode 100644 tests/xfs/1504.out
 create mode 100755 tests/xfs/1505
 create mode 100644 tests/xfs/1505.out
 create mode 100755 tests/xfs/1506
 create mode 100644 tests/xfs/1506.out
 create mode 100755 tests/xfs/1507
 create mode 100644 tests/xfs/1507.out
 create mode 100755 tests/xfs/1508
 create mode 100644 tests/xfs/1508.out
 create mode 100755 tests/xfs/1509
 create mode 100644 tests/xfs/1509.out
 create mode 100755 tests/xfs/1510
 create mode 100644 tests/xfs/1510.out
 create mode 100755 tests/xfs/1511
 create mode 100644 tests/xfs/1511.out
 create mode 100755 tests/xfs/1512
 create mode 100644 tests/xfs/1512.out
 create mode 100755 tests/xfs/1513
 create mode 100644 tests/xfs/1513.out
 create mode 100755 tests/xfs/1514
 create mode 100644 tests/xfs/1514.out
 create mode 100755 tests/xfs/1515
 create mode 100644 tests/xfs/1515.out
 create mode 100755 tests/xfs/1516
 create mode 100644 tests/xfs/1516.out
 create mode 100755 tests/xfs/1517
 create mode 100644 tests/xfs/1517.out
 create mode 100755 tests/xfs/1518
 create mode 100644 tests/xfs/1518.out
 create mode 100755 tests/xfs/1519
 create mode 100644 tests/xfs/1519.out
 create mode 100755 tests/xfs/1520
 create mode 100644 tests/xfs/1520.out
 create mode 100755 tests/xfs/1521
 create mode 100644 tests/xfs/1521.out
 create mode 100755 tests/xfs/1522
 create mode 100644 tests/xfs/1522.out
 create mode 100755 tests/xfs/1523
 create mode 100644 tests/xfs/1523.out
 create mode 100755 tests/xfs/1524
 create mode 100644 tests/xfs/1524.out
 create mode 100755 tests/xfs/1525
 create mode 100644 tests/xfs/1525.out
 create mode 100755 tests/xfs/1526
 create mode 100644 tests/xfs/1526.out
 create mode 100755 tests/xfs/1527
 create mode 100644 tests/xfs/1527.out
 create mode 100755 tests/xfs/1530
 create mode 100644 tests/xfs/1530.out
 create mode 100755 tests/xfs/1531
 create mode 100644 tests/xfs/1531.out
 create mode 100755 tests/xfs/1532
 create mode 100644 tests/xfs/1532.out
 create mode 100755 tests/xfs/1533
 create mode 100644 tests/xfs/1533.out
 create mode 100755 tests/xfs/1534
 create mode 100644 tests/xfs/1534.out
 create mode 100755 tests/xfs/1535
 create mode 100644 tests/xfs/1535.out
 create mode 100755 tests/xfs/1536
 create mode 100644 tests/xfs/1536.out
 create mode 100755 tests/xfs/1537
 create mode 100644 tests/xfs/1537.out
 create mode 100755 tests/xfs/1554
 create mode 100644 tests/xfs/1554.out
 create mode 100755 tests/xfs/1555
 create mode 100644 tests/xfs/1555.out
 create mode 100755 tests/xfs/1556
 create mode 100644 tests/xfs/1556.out
 create mode 100755 tests/xfs/1557
 create mode 100644 tests/xfs/1557.out
 create mode 100755 tests/xfs/1558
 create mode 100644 tests/xfs/1558.out
 create mode 100755 tests/xfs/1559
 create mode 100644 tests/xfs/1559.out
 create mode 100755 tests/xfs/1560
 create mode 100644 tests/xfs/1560.out
 create mode 100755 tests/xfs/1561
 create mode 100644 tests/xfs/1561.out
 create mode 100755 tests/xfs/1562
 create mode 100644 tests/xfs/1562.out
 create mode 100755 tests/xfs/1563
 create mode 100644 tests/xfs/1563.out
 create mode 100755 tests/xfs/1564
 create mode 100644 tests/xfs/1564.out
 create mode 100755 tests/xfs/1565
 create mode 100644 tests/xfs/1565.out
 create mode 100755 tests/xfs/1566
 create mode 100644 tests/xfs/1566.out
 create mode 100755 tests/xfs/1567
 create mode 100644 tests/xfs/1567.out
 create mode 100755 tests/xfs/1568
 create mode 100644 tests/xfs/1568.out
 create mode 100755 tests/xfs/1569
 create mode 100644 tests/xfs/1569.out
 create mode 100755 tests/xfs/1570
 create mode 100644 tests/xfs/1570.out
 create mode 100755 tests/xfs/1571
 create mode 100644 tests/xfs/1571.out
 create mode 100755 tests/xfs/1572
 create mode 100755 tests/xfs/1572.out
 create mode 100755 tests/xfs/1573
 create mode 100644 tests/xfs/1573.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/5] fuzzy: test fuzzing directory block mappings
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/5] fuzzy: test fuzzing xattr " Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Fuzz the block mappings of directories to see what happens.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy       |   16 ++++++++++++++++
 tests/xfs/1554     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1554.out |   10 ++++++++++
 tests/xfs/1555     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1555.out |   10 ++++++++++
 tests/xfs/1556     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1556.out |   10 ++++++++++
 7 files changed, 190 insertions(+)
 create mode 100755 tests/xfs/1554
 create mode 100644 tests/xfs/1554.out
 create mode 100755 tests/xfs/1555
 create mode 100644 tests/xfs/1555.out
 create mode 100755 tests/xfs/1556
 create mode 100644 tests/xfs/1556.out


diff --git a/common/fuzzy b/common/fuzzy
index 53fe22db69..09f42d9225 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -549,6 +549,22 @@ _require_scratch_xfs_fuzz_fields()
 	_require_xfs_db_command "fuzz"
 }
 
+# Sets the array SCRATCH_XFS_DIR_FUZZ_TYPES to the list of directory formats
+# available for fuzzing.  Each list item must match one of the /S_IFDIR.FMT_*
+# files created by the fs population code.  Users can override this by setting
+# SCRATCH_XFS_LIST_FUZZ_DIRTYPE in the environment.  BTREE is omitted here
+# because that refers to the fork format and does not affect the directory
+# structure at all.
+_scratch_xfs_set_dir_fuzz_types() {
+	if [ -n "${SCRATCH_XFS_LIST_FUZZ_DIRTYPE}" ]; then
+		mapfile -t SCRATCH_XFS_DIR_FUZZ_TYPES < \
+				<(echo "${SCRATCH_XFS_LIST_FUZZ_DIRTYPE}" | tr '[ ,]' '[\n\n]')
+		return
+	fi
+
+	SCRATCH_XFS_DIR_FUZZ_TYPES=(BLOCK LEAF LEAFN NODE)
+}
+
 # Grab the list of available fuzzing verbs
 _scratch_xfs_list_fuzz_verbs() {
 	if [ -n "${SCRATCH_XFS_LIST_FUZZ_VERBS}" ]; then
diff --git a/tests/xfs/1554 b/tests/xfs/1554
new file mode 100755
index 0000000000..b43c705cb9
--- /dev/null
+++ b/tests/xfs/1554
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1554
+#
+# Populate a XFS filesystem and fuzz the data mappings of every directory type.
+# Use xfs_scrub to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_dir_fuzz_types
+
+# Now fuzz the block maps of each directory type.
+for dirtype in "${SCRATCH_XFS_DIR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${dirtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the directory block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_${dirtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'u*.bmx' 'online'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing dir map ${dirtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1554.out b/tests/xfs/1554.out
new file mode 100644
index 0000000000..9985ed1990
--- /dev/null
+++ b/tests/xfs/1554.out
@@ -0,0 +1,10 @@
+QA output created by 1554
+Format and populate
+Fuzz block map for BLOCK
+Done fuzzing dir map BLOCK
+Fuzz block map for LEAF
+Done fuzzing dir map LEAF
+Fuzz block map for LEAFN
+Done fuzzing dir map LEAFN
+Fuzz block map for NODE
+Done fuzzing dir map NODE
diff --git a/tests/xfs/1555 b/tests/xfs/1555
new file mode 100755
index 0000000000..8dee177a3d
--- /dev/null
+++ b/tests/xfs/1555
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1555
+#
+# Populate a XFS filesystem and fuzz the data mappings of every directory type.
+# Use xfs_repair to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_dir_fuzz_types
+
+# Now fuzz the block maps of each directory type.
+for dirtype in "${SCRATCH_XFS_DIR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${dirtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the directory block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_${dirtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'u*.bmx' 'offline'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing dir map ${dirtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1555.out b/tests/xfs/1555.out
new file mode 100644
index 0000000000..38c5714d13
--- /dev/null
+++ b/tests/xfs/1555.out
@@ -0,0 +1,10 @@
+QA output created by 1555
+Format and populate
+Fuzz block map for BLOCK
+Done fuzzing dir map BLOCK
+Fuzz block map for LEAF
+Done fuzzing dir map LEAF
+Fuzz block map for LEAFN
+Done fuzzing dir map LEAFN
+Fuzz block map for NODE
+Done fuzzing dir map NODE
diff --git a/tests/xfs/1556 b/tests/xfs/1556
new file mode 100755
index 0000000000..54df601c11
--- /dev/null
+++ b/tests/xfs/1556
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1556
+#
+# Populate a XFS filesystem and fuzz the data mappings of every directory type.
+# Do not fix the filesystem, to test metadata verifiers.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_norepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_dir_fuzz_types
+
+# Now fuzz the block maps of each directory type.
+for dirtype in "${SCRATCH_XFS_DIR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${dirtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the directory block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_${dirtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'u*.bmx' 'none'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing dir map ${dirtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1556.out b/tests/xfs/1556.out
new file mode 100644
index 0000000000..b947285caa
--- /dev/null
+++ b/tests/xfs/1556.out
@@ -0,0 +1,10 @@
+QA output created by 1556
+Format and populate
+Fuzz block map for BLOCK
+Done fuzzing dir map BLOCK
+Fuzz block map for LEAF
+Done fuzzing dir map LEAF
+Fuzz block map for LEAFN
+Done fuzzing dir map LEAFN
+Fuzz block map for NODE
+Done fuzzing dir map NODE


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/5] fuzzy: test fuzzing xattr block mappings
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/5] fuzzy: test fuzzing directory block mappings Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 5/5] fuzzy: fuzz test key/pointers of inode btrees Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Fuzz the block mappings of extended attributes to see what happens.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy       |   16 ++++++++++++++++
 tests/xfs/1557     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1557.out |   10 ++++++++++
 tests/xfs/1558     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1558.out |   10 ++++++++++
 tests/xfs/1559     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1559.out |   10 ++++++++++
 7 files changed, 190 insertions(+)
 create mode 100755 tests/xfs/1557
 create mode 100644 tests/xfs/1557.out
 create mode 100755 tests/xfs/1558
 create mode 100644 tests/xfs/1558.out
 create mode 100755 tests/xfs/1559
 create mode 100644 tests/xfs/1559.out


diff --git a/common/fuzzy b/common/fuzzy
index 09f42d9225..d8eb7d8b72 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -565,6 +565,22 @@ _scratch_xfs_set_dir_fuzz_types() {
 	SCRATCH_XFS_DIR_FUZZ_TYPES=(BLOCK LEAF LEAFN NODE)
 }
 
+# Sets the array SCRATCH_XFS_XATTR_FUZZ_TYPES to the list of xattr formats
+# available for fuzzing.  Each list item must match one of the /ATTR.FMT_*
+# files created by the fs population code.  Users can override this by setting
+# SCRATCH_XFS_LIST_FUZZ_XATTRTYPE in the environment.  BTREE is omitted here
+# because that refers to the fork format and does not affect the extended
+# attribute structure at all.
+_scratch_xfs_set_xattr_fuzz_types() {
+	if [ -n "${SCRATCH_XFS_LIST_FUZZ_XATTRTYPE}" ]; then
+		mapfile -t SCRATCH_XFS_XATTR_FUZZ_TYPES < \
+				<(echo "${SCRATCH_XFS_LIST_FUZZ_XATTRTYPE}" | tr '[ ,]' '[\n\n]')
+		return
+	fi
+
+	SCRATCH_XFS_XATTR_FUZZ_TYPES=(EXTENTS_REMOTE3K EXTENTS_REMOTE4K LEAF NODE)
+}
+
 # Grab the list of available fuzzing verbs
 _scratch_xfs_list_fuzz_verbs() {
 	if [ -n "${SCRATCH_XFS_LIST_FUZZ_VERBS}" ]; then
diff --git a/tests/xfs/1557 b/tests/xfs/1557
new file mode 100755
index 0000000000..afd5d31f62
--- /dev/null
+++ b/tests/xfs/1557
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1555
+#
+# Populate a XFS filesystem and fuzz the attr mappings of every xattr type.
+# Use xfs_scrub to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_xattr_fuzz_types
+
+# Now fuzz the block maps of each xattr type.
+for attrtype in "${SCRATCH_XFS_XATTR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${attrtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the xattr block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_${attrtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'a*.bmx' 'online'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing attr map ${attrtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1557.out b/tests/xfs/1557.out
new file mode 100644
index 0000000000..e4d92dd6b8
--- /dev/null
+++ b/tests/xfs/1557.out
@@ -0,0 +1,10 @@
+QA output created by 1557
+Format and populate
+Fuzz block map for EXTENTS_REMOTE3K
+Done fuzzing attr map EXTENTS_REMOTE3K
+Fuzz block map for EXTENTS_REMOTE4K
+Done fuzzing attr map EXTENTS_REMOTE4K
+Fuzz block map for LEAF
+Done fuzzing attr map LEAF
+Fuzz block map for NODE
+Done fuzzing attr map NODE
diff --git a/tests/xfs/1558 b/tests/xfs/1558
new file mode 100755
index 0000000000..0683b06010
--- /dev/null
+++ b/tests/xfs/1558
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1557
+#
+# Populate a XFS filesystem and fuzz the attr mappings of every xattr type.
+# Use xfs_repair to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_xattr_fuzz_types
+
+# Now fuzz the block maps of each xattr type.
+for attrtype in "${SCRATCH_XFS_XATTR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${attrtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the xattr block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_${attrtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'a*.bmx' 'offline'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing attr map ${attrtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1558.out b/tests/xfs/1558.out
new file mode 100644
index 0000000000..6fed892cb4
--- /dev/null
+++ b/tests/xfs/1558.out
@@ -0,0 +1,10 @@
+QA output created by 1558
+Format and populate
+Fuzz block map for EXTENTS_REMOTE3K
+Done fuzzing attr map EXTENTS_REMOTE3K
+Fuzz block map for EXTENTS_REMOTE4K
+Done fuzzing attr map EXTENTS_REMOTE4K
+Fuzz block map for LEAF
+Done fuzzing attr map LEAF
+Fuzz block map for NODE
+Done fuzzing attr map NODE
diff --git a/tests/xfs/1559 b/tests/xfs/1559
new file mode 100755
index 0000000000..886c1f2641
--- /dev/null
+++ b/tests/xfs/1559
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1559
+#
+# Populate a XFS filesystem and fuzz the attr mappings of every xattr type.
+# Do not fix the filesystem, to test metadata verifiers.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_norepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_xattr_fuzz_types
+
+# Now fuzz the block maps of each xattr type.
+for attrtype in "${SCRATCH_XFS_XATTR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${attrtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the xattr block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_${attrtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'a*.bmx' 'none'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing attr map ${attrtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1559.out b/tests/xfs/1559.out
new file mode 100644
index 0000000000..19fe4c91df
--- /dev/null
+++ b/tests/xfs/1559.out
@@ -0,0 +1,10 @@
+QA output created by 1559
+Format and populate
+Fuzz block map for EXTENTS_REMOTE3K
+Done fuzzing attr map EXTENTS_REMOTE3K
+Fuzz block map for EXTENTS_REMOTE4K
+Done fuzzing attr map EXTENTS_REMOTE4K
+Fuzz block map for LEAF
+Done fuzzing attr map LEAF
+Fuzz block map for NODE
+Done fuzzing attr map NODE


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/5] fuzzy: test fuzzing realtime free space metadata
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 4/5] xfs: fuzz test both repair strategies Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Fuzz the contents of the realtime bitmap and summary files to see what
happens.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy       |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 tests/xfs/1562     |   41 ++++++++++++++++++++
 tests/xfs/1562.out |    4 ++
 tests/xfs/1563     |   41 ++++++++++++++++++++
 tests/xfs/1563.out |    4 ++
 tests/xfs/1564     |   41 ++++++++++++++++++++
 tests/xfs/1564.out |    4 ++
 tests/xfs/1565     |   41 ++++++++++++++++++++
 tests/xfs/1565.out |    4 ++
 tests/xfs/1566     |   42 ++++++++++++++++++++
 tests/xfs/1566.out |    4 ++
 tests/xfs/1567     |   42 ++++++++++++++++++++
 tests/xfs/1567.out |    4 ++
 tests/xfs/1568     |   41 ++++++++++++++++++++
 tests/xfs/1568.out |    4 ++
 tests/xfs/1569     |   41 ++++++++++++++++++++
 tests/xfs/1569.out |    4 ++
 17 files changed, 465 insertions(+), 4 deletions(-)
 create mode 100755 tests/xfs/1562
 create mode 100644 tests/xfs/1562.out
 create mode 100755 tests/xfs/1563
 create mode 100644 tests/xfs/1563.out
 create mode 100755 tests/xfs/1564
 create mode 100644 tests/xfs/1564.out
 create mode 100755 tests/xfs/1565
 create mode 100644 tests/xfs/1565.out
 create mode 100755 tests/xfs/1566
 create mode 100644 tests/xfs/1566.out
 create mode 100755 tests/xfs/1567
 create mode 100644 tests/xfs/1567.out
 create mode 100755 tests/xfs/1568
 create mode 100644 tests/xfs/1568.out
 create mode 100755 tests/xfs/1569
 create mode 100644 tests/xfs/1569.out


diff --git a/common/fuzzy b/common/fuzzy
index d8eb7d8b72..ef54f2fe2c 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -154,6 +154,12 @@ _scratch_xfs_dump_metadata() {
 	_scratch_xfs_db "${cmds[@]}" -c print
 }
 
+# Decide from the output of the xfs_db "stack" command if the debugger's io
+# cursor is pointed at a block that is an unstructured data format (blob).
+__scratch_xfs_detect_blob_from_stack() {
+	grep -q -E 'inode.*, type (data|rtsummary|rtbitmap)'
+}
+
 # Navigate to some part of the filesystem and print the field info.
 # The first argument is an grep filter for the fields
 # The rest of the arguments are xfs_db commands to locate the metadata.
@@ -169,7 +175,17 @@ _scratch_xfs_list_metadata_fields() {
 	for arg in "$@"; do
 		cmds+=("-c" "${arg}")
 	done
-	_scratch_xfs_db "${cmds[@]}" -c print | __filter_xfs_db_print_fields "${filter}"
+
+	# Does the path argument point towards something that is an
+	# unstructured blob?
+	if _scratch_xfs_db "${cmds[@]}" -c stack 2>/dev/null | \
+			__scratch_xfs_detect_blob_from_stack; then
+		echo "<blob>"
+		return
+	fi
+
+	_scratch_xfs_db "${cmds[@]}" -c print | \
+		__filter_xfs_db_print_fields "${filter}"
 }
 
 # Fuzz a metadata field
@@ -207,6 +223,70 @@ _scratch_xfs_fuzz_metadata_field() {
 	return 0
 }
 
+# List the fuzzing verbs available for unstructured blobs
+__scratch_xfs_list_blob_fuzz_verbs() {
+		cat << ENDL
+zeroes
+ones
+firstbit
+middlebit
+lastbit
+random
+ENDL
+}
+
+# Fuzz a metadata blob
+# The first arg is a blob fuzzing verb
+# The rest of the arguments are xfs_db commands to find the metadata.
+_scratch_xfs_fuzz_metadata_blob() {
+	local fuzzverb="$1"
+	shift
+	local trashcmd=(blocktrash -z)
+
+	local cmds=()
+	for arg in "$@"; do
+		cmds+=("-c" "${arg}")
+	done
+
+	local bytecount=$(_scratch_xfs_db "${cmds[@]}" -c "stack" | grep 'byte.*length' | awk '{print $5}')
+	local bitmax=$((bytecount * 8))
+
+	case "${fuzzverb}" in
+	"zeroes")
+		trashcmd+=(-0 -o 0 -x "${bitmax}" -y "${bitmax}");;
+	"ones")
+		trashcmd+=(-1 -o 0 -x "${bitmax}" -y "${bitmax}");;
+	"firstbit")
+		trashcmd+=(-2 -o 0 -x 1 -y 1);;
+	"middlebit")
+		trashcmd+=(-2 -o $((bitmax / 2)) -x 1 -y 1);;
+	"lastbit")
+		trashcmd+=(-2 -o "${bitmax}" -x 1 -y 1);;
+	"random")
+		trashcmd+=(-3 -o 0 -x "${bitmax}" -y "${bitmax}");;
+	*)
+		echo "Unknown blob fuzz verb \"${fuzzverb}\"."
+		return 1
+		;;
+	esac
+
+	trashcmd="${trashcmd[@]}"
+	oldval="$(_scratch_xfs_get_metadata_field "" "$@")"
+	while true; do
+		_scratch_xfs_db -x "${cmds[@]}" -c "${trashcmd}"
+		echo
+		newval="$(_scratch_xfs_get_metadata_field "" "$@" 2> /dev/null)"
+		if [ "${fuzzverb}" != "random" ] || [ "${oldval}" != "${newval}" ]; then
+			break;
+		fi
+	done
+	if [ "${oldval}" = "${newval}" ]; then
+		echo "Blob already set to new value, skipping test."
+		return 1
+	fi
+	return 0
+}
+
 # Try to forcibly unmount the scratch fs
 __scratch_xfs_fuzz_unmount()
 {
@@ -503,7 +583,11 @@ __scratch_xfs_fuzz_field_test() {
 
 	# Set the new field value
 	__fuzz_notify "+ Fuzz ${field} = ${fuzzverb}"
-	_scratch_xfs_fuzz_metadata_field "${field}" ${fuzzverb} "$@"
+	if [ "$field" = "<blob>" ]; then
+		_scratch_xfs_fuzz_metadata_blob ${fuzzverb} "$@"
+	else
+		_scratch_xfs_fuzz_metadata_field "${field}" ${fuzzverb} "$@"
+	fi
 	res=$?
 	test $res -ne 0 && return
 
@@ -587,7 +671,22 @@ _scratch_xfs_list_fuzz_verbs() {
 		echo "${SCRATCH_XFS_LIST_FUZZ_VERBS}" | tr '[ ,]' '[\n\n]'
 		return;
 	fi
-	_scratch_xfs_db -x -c 'sb 0' -c 'fuzz' | grep '^Fuzz commands:' | \
+
+	local cmds=()
+	for arg in "$@"; do
+		cmds+=("-c" "${arg}")
+	done
+	test "${#cmds[@]}" -eq 0 && cmds=('-c' 'sb 0')
+
+	# Does the path argument point towards something that is an
+	# unstructured blob?
+	if _scratch_xfs_db "${cmds[@]}" -c stack 2>/dev/null | \
+			__scratch_xfs_detect_blob_from_stack; then
+		__scratch_xfs_list_blob_fuzz_verbs
+		return
+	fi
+
+	_scratch_xfs_db -x "${cmds[@]}" -c 'fuzz' | grep '^Fuzz commands:' | \
 		sed -e 's/[,.]//g' -e 's/Fuzz commands: //g' -e 's/ /\n/g' | \
 		grep -v '^random$'
 }
@@ -605,7 +704,7 @@ _scratch_xfs_fuzz_metadata() {
 	shift; shift
 
 	fields="$(_scratch_xfs_list_metadata_fields "${filter}" "$@")"
-	verbs="$(_scratch_xfs_list_fuzz_verbs)"
+	verbs="$(_scratch_xfs_list_fuzz_verbs "$@")"
 	echo "Fields we propose to fuzz with the \"${repair}\" repair strategy: $@"
 	echo $(echo "${fields}")
 	echo "Verbs we propose to fuzz with:"
diff --git a/tests/xfs/1562 b/tests/xfs/1562
new file mode 100755
index 0000000000..015209eeb2
--- /dev/null
+++ b/tests/xfs/1562
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1562
+#
+# Populate a XFS filesystem and fuzz every realtime bitmap field.
+# Use xfs_scrub to fix the corruption.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtbitmap"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.bitmap')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.bitmap')
+else
+	path=('sb' 'addr rbmino')
+fi
+_scratch_xfs_fuzz_metadata '' 'online' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtbitmap"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1562.out b/tests/xfs/1562.out
new file mode 100644
index 0000000000..63a3bc7600
--- /dev/null
+++ b/tests/xfs/1562.out
@@ -0,0 +1,4 @@
+QA output created by 1562
+Format and populate
+Fuzz rtbitmap
+Done fuzzing rtbitmap
diff --git a/tests/xfs/1563 b/tests/xfs/1563
new file mode 100755
index 0000000000..2be0870a3d
--- /dev/null
+++ b/tests/xfs/1563
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1563
+#
+# Populate a XFS filesystem and fuzz every realtime summary field.
+# Use xfs_scrub to fix the corruption.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtsummary"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.summary')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.summary')
+else
+	path=('sb' 'addr rsumino')
+fi
+_scratch_xfs_fuzz_metadata '' 'online' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtsummary"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1563.out b/tests/xfs/1563.out
new file mode 100644
index 0000000000..e4ca8c3733
--- /dev/null
+++ b/tests/xfs/1563.out
@@ -0,0 +1,4 @@
+QA output created by 1563
+Format and populate
+Fuzz rtsummary
+Done fuzzing rtsummary
diff --git a/tests/xfs/1564 b/tests/xfs/1564
new file mode 100755
index 0000000000..c0d10ff0e9
--- /dev/null
+++ b/tests/xfs/1564
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1564
+#
+# Populate a XFS filesystem and fuzz every realtime bitmap field.
+# Use xfs_repair to fix the corruption.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_repair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtbitmap"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.bitmap')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.bitmap')
+else
+	path=('sb' 'addr rbmino')
+fi
+_scratch_xfs_fuzz_metadata '' 'offline' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtbitmap"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1564.out b/tests/xfs/1564.out
new file mode 100644
index 0000000000..afede817b8
--- /dev/null
+++ b/tests/xfs/1564.out
@@ -0,0 +1,4 @@
+QA output created by 1564
+Format and populate
+Fuzz rtbitmap
+Done fuzzing rtbitmap
diff --git a/tests/xfs/1565 b/tests/xfs/1565
new file mode 100755
index 0000000000..6b4186fb3c
--- /dev/null
+++ b/tests/xfs/1565
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1565
+#
+# Populate a XFS filesystem and fuzz every realtime summary field.
+# Use xfs_repair to fix the corruption.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_repair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtsummary"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.summary')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.summary')
+else
+	path=('sb' 'addr rsumino')
+fi
+_scratch_xfs_fuzz_metadata '' 'offline' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtsummary"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1565.out b/tests/xfs/1565.out
new file mode 100644
index 0000000000..7a8d9d04d6
--- /dev/null
+++ b/tests/xfs/1565.out
@@ -0,0 +1,4 @@
+QA output created by 1565
+Format and populate
+Fuzz rtsummary
+Done fuzzing rtsummary
diff --git a/tests/xfs/1566 b/tests/xfs/1566
new file mode 100755
index 0000000000..8d0f61ae10
--- /dev/null
+++ b/tests/xfs/1566
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1566
+#
+# Populate a XFS filesystem and fuzz every realtime bitmap field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_bothrepair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtbitmap"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.bitmap')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.bitmap')
+else
+	path=('sb' 'addr rbmino')
+fi
+_scratch_xfs_fuzz_metadata '' 'both' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtbitmap"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1566.out b/tests/xfs/1566.out
new file mode 100644
index 0000000000..d50e1d8539
--- /dev/null
+++ b/tests/xfs/1566.out
@@ -0,0 +1,4 @@
+QA output created by 1566
+Format and populate
+Fuzz rtbitmap
+Done fuzzing rtbitmap
diff --git a/tests/xfs/1567 b/tests/xfs/1567
new file mode 100755
index 0000000000..7dc2012b67
--- /dev/null
+++ b/tests/xfs/1567
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1567
+#
+# Populate a XFS filesystem and fuzz every realtime summary field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_bothrepair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtsummary"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.summary')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.summary')
+else
+	path=('sb' 'addr rsumino')
+fi
+_scratch_xfs_fuzz_metadata '' 'both' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtsummary"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1567.out b/tests/xfs/1567.out
new file mode 100644
index 0000000000..b88fa0c1b3
--- /dev/null
+++ b/tests/xfs/1567.out
@@ -0,0 +1,4 @@
+QA output created by 1567
+Format and populate
+Fuzz rtsummary
+Done fuzzing rtsummary
diff --git a/tests/xfs/1568 b/tests/xfs/1568
new file mode 100755
index 0000000000..c80640ef97
--- /dev/null
+++ b/tests/xfs/1568
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1568
+#
+# Populate a XFS filesystem and fuzz every realtime bitmap field.
+# Do not fix the filesystem, to test metadata verifiers.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_norepair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtbitmap"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.bitmap')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.bitmap')
+else
+	path=('sb' 'addr rbmino')
+fi
+_scratch_xfs_fuzz_metadata '' 'none' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtbitmap"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1568.out b/tests/xfs/1568.out
new file mode 100644
index 0000000000..a80f579662
--- /dev/null
+++ b/tests/xfs/1568.out
@@ -0,0 +1,4 @@
+QA output created by 1568
+Format and populate
+Fuzz rtbitmap
+Done fuzzing rtbitmap
diff --git a/tests/xfs/1569 b/tests/xfs/1569
new file mode 100755
index 0000000000..e303f08ff5
--- /dev/null
+++ b/tests/xfs/1569
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1569
+#
+# Populate a XFS filesystem and fuzz every realtime summary field.
+# Do not fix the filesystem, to test metadata verifiers.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_norepair realtime
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz rtsummary"
+is_metadir=$(_scratch_xfs_get_metadata_field "core.version" 'path -m /realtime/0.summary')
+if [ -n "$is_metadir" ]; then
+	path=('path -m /realtime/0.summary')
+else
+	path=('sb' 'addr rsumino')
+fi
+_scratch_xfs_fuzz_metadata '' 'none' "${path[@]}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing rtsummary"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1569.out b/tests/xfs/1569.out
new file mode 100644
index 0000000000..d6202cc3af
--- /dev/null
+++ b/tests/xfs/1569.out
@@ -0,0 +1,4 @@
+QA output created by 1569
+Format and populate
+Fuzz rtsummary
+Done fuzzing rtsummary


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/5] xfs: fuzz test both repair strategies
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 5/5] fuzzy: fuzz test key/pointers of inode btrees Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 3/5] fuzzy: test fuzzing realtime free space metadata Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add more fuzz tests to examine the effectiveness of online and then
offline repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/1500     |   35 +++++++++++++++++++++++++++++++++++
 tests/xfs/1500.out |    4 ++++
 tests/xfs/1501     |   35 +++++++++++++++++++++++++++++++++++
 tests/xfs/1501.out |    4 ++++
 tests/xfs/1502     |   45 +++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1502.out |    6 ++++++
 tests/xfs/1503     |   35 +++++++++++++++++++++++++++++++++++
 tests/xfs/1503.out |    4 ++++
 tests/xfs/1504     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1504.out |    4 ++++
 tests/xfs/1505     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1505.out |    4 ++++
 tests/xfs/1506     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1506.out |    4 ++++
 tests/xfs/1507     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1507.out |    4 ++++
 tests/xfs/1508     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/1508.out |    4 ++++
 tests/xfs/1509     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/1509.out |    4 ++++
 tests/xfs/1510     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/1510.out |    4 ++++
 tests/xfs/1511     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1511.out |    4 ++++
 tests/xfs/1512     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1512.out |    5 +++++
 tests/xfs/1513     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1513.out |    5 +++++
 tests/xfs/1514     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1514.out |    5 +++++
 tests/xfs/1515     |   42 ++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1515.out |    5 +++++
 tests/xfs/1516     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1516.out |    5 +++++
 tests/xfs/1517     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1517.out |    5 +++++
 tests/xfs/1518     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1518.out |    5 +++++
 tests/xfs/1519     |   41 +++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1519.out |    5 +++++
 tests/xfs/1520     |   42 ++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1520.out |    5 +++++
 tests/xfs/1521     |   42 ++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1521.out |    5 +++++
 tests/xfs/1522     |   42 ++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1522.out |    5 +++++
 tests/xfs/1523     |   42 ++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1523.out |    5 +++++
 tests/xfs/1524     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1524.out |    5 +++++
 tests/xfs/1525     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1525.out |    5 +++++
 tests/xfs/1526     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1526.out |    5 +++++
 tests/xfs/1527     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1527.out |    5 +++++
 tests/xfs/1530     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1530.out |    4 ++++
 tests/xfs/1531     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1531.out |    5 +++++
 tests/xfs/1532     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1532.out |    5 +++++
 tests/xfs/1533     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1533.out |    5 +++++
 tests/xfs/1534     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1534.out |    4 ++++
 tests/xfs/1535     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1535.out |    4 ++++
 tests/xfs/1536     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1536.out |    4 ++++
 tests/xfs/1537     |   41 +++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1537.out |    5 +++++
 tests/xfs/1560     |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1560.out |   10 ++++++++++
 tests/xfs/1561     |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1561.out |   10 ++++++++++
 76 files changed, 1709 insertions(+)
 create mode 100755 tests/xfs/1500
 create mode 100644 tests/xfs/1500.out
 create mode 100755 tests/xfs/1501
 create mode 100644 tests/xfs/1501.out
 create mode 100755 tests/xfs/1502
 create mode 100644 tests/xfs/1502.out
 create mode 100755 tests/xfs/1503
 create mode 100644 tests/xfs/1503.out
 create mode 100755 tests/xfs/1504
 create mode 100644 tests/xfs/1504.out
 create mode 100755 tests/xfs/1505
 create mode 100644 tests/xfs/1505.out
 create mode 100755 tests/xfs/1506
 create mode 100644 tests/xfs/1506.out
 create mode 100755 tests/xfs/1507
 create mode 100644 tests/xfs/1507.out
 create mode 100755 tests/xfs/1508
 create mode 100644 tests/xfs/1508.out
 create mode 100755 tests/xfs/1509
 create mode 100644 tests/xfs/1509.out
 create mode 100755 tests/xfs/1510
 create mode 100644 tests/xfs/1510.out
 create mode 100755 tests/xfs/1511
 create mode 100644 tests/xfs/1511.out
 create mode 100755 tests/xfs/1512
 create mode 100644 tests/xfs/1512.out
 create mode 100755 tests/xfs/1513
 create mode 100644 tests/xfs/1513.out
 create mode 100755 tests/xfs/1514
 create mode 100644 tests/xfs/1514.out
 create mode 100755 tests/xfs/1515
 create mode 100644 tests/xfs/1515.out
 create mode 100755 tests/xfs/1516
 create mode 100644 tests/xfs/1516.out
 create mode 100755 tests/xfs/1517
 create mode 100644 tests/xfs/1517.out
 create mode 100755 tests/xfs/1518
 create mode 100644 tests/xfs/1518.out
 create mode 100755 tests/xfs/1519
 create mode 100644 tests/xfs/1519.out
 create mode 100755 tests/xfs/1520
 create mode 100644 tests/xfs/1520.out
 create mode 100755 tests/xfs/1521
 create mode 100644 tests/xfs/1521.out
 create mode 100755 tests/xfs/1522
 create mode 100644 tests/xfs/1522.out
 create mode 100755 tests/xfs/1523
 create mode 100644 tests/xfs/1523.out
 create mode 100755 tests/xfs/1524
 create mode 100644 tests/xfs/1524.out
 create mode 100755 tests/xfs/1525
 create mode 100644 tests/xfs/1525.out
 create mode 100755 tests/xfs/1526
 create mode 100644 tests/xfs/1526.out
 create mode 100755 tests/xfs/1527
 create mode 100644 tests/xfs/1527.out
 create mode 100755 tests/xfs/1530
 create mode 100644 tests/xfs/1530.out
 create mode 100755 tests/xfs/1531
 create mode 100644 tests/xfs/1531.out
 create mode 100755 tests/xfs/1532
 create mode 100644 tests/xfs/1532.out
 create mode 100755 tests/xfs/1533
 create mode 100644 tests/xfs/1533.out
 create mode 100755 tests/xfs/1534
 create mode 100644 tests/xfs/1534.out
 create mode 100755 tests/xfs/1535
 create mode 100644 tests/xfs/1535.out
 create mode 100755 tests/xfs/1536
 create mode 100644 tests/xfs/1536.out
 create mode 100755 tests/xfs/1537
 create mode 100644 tests/xfs/1537.out
 create mode 100755 tests/xfs/1560
 create mode 100644 tests/xfs/1560.out
 create mode 100755 tests/xfs/1561
 create mode 100644 tests/xfs/1561.out


diff --git a/tests/xfs/1500 b/tests/xfs/1500
new file mode 100755
index 0000000000..cae5e05caf
--- /dev/null
+++ b/tests/xfs/1500
@@ -0,0 +1,35 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1500
+#
+# Populate a XFS filesystem and fuzz every superblock field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz superblock"
+_scratch_xfs_fuzz_metadata '' 'both' 'sb 1' >> $seqres.full
+echo "Done fuzzing superblock"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1500.out b/tests/xfs/1500.out
new file mode 100644
index 0000000000..bb485204c7
--- /dev/null
+++ b/tests/xfs/1500.out
@@ -0,0 +1,4 @@
+QA output created by 1500
+Format and populate
+Fuzz superblock
+Done fuzzing superblock
diff --git a/tests/xfs/1501 b/tests/xfs/1501
new file mode 100755
index 0000000000..867584ebbe
--- /dev/null
+++ b/tests/xfs/1501
@@ -0,0 +1,35 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1501
+#
+# Populate a XFS filesystem and fuzz every AGF field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz AGF"
+_scratch_xfs_fuzz_metadata '' 'both' 'agf 0' >> $seqres.full
+echo "Done fuzzing AGF"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1501.out b/tests/xfs/1501.out
new file mode 100644
index 0000000000..0c8b1be2a5
--- /dev/null
+++ b/tests/xfs/1501.out
@@ -0,0 +1,4 @@
+QA output created by 1501
+Format and populate
+Fuzz AGF
+Done fuzzing AGF
diff --git a/tests/xfs/1502 b/tests/xfs/1502
new file mode 100755
index 0000000000..04116d6d0e
--- /dev/null
+++ b/tests/xfs/1502
@@ -0,0 +1,45 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1502
+#
+# Populate a XFS filesystem and fuzz every AGFL field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz AGFL"
+_scratch_xfs_fuzz_metadata '' 'both' 'agfl 0' >> $seqres.full
+echo "Done fuzzing AGFL"
+
+# Restore a correct copy of the filesystem before we start the second round of
+# fuzzing.  This avoids corruption errors from xfs_db when we probe for flfirst
+# in the AGF and later when _scratch_xfs_fuzz_metadata probes the AGFL fields.
+__scratch_xfs_fuzz_mdrestore
+flfirst=$(_scratch_xfs_db -c 'agf 0' -c 'p flfirst' | sed -e 's/flfirst = //g')
+
+echo "Fuzz AGFL flfirst"
+SCRATCH_XFS_LIST_METADATA_FIELDS="bno[${flfirst}]" _scratch_xfs_fuzz_metadata '' 'both' 'agfl 0' >> $seqres.full
+echo "Done fuzzing AGFL flfirst"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1502.out b/tests/xfs/1502.out
new file mode 100644
index 0000000000..99c0facdac
--- /dev/null
+++ b/tests/xfs/1502.out
@@ -0,0 +1,6 @@
+QA output created by 1502
+Format and populate
+Fuzz AGFL
+Done fuzzing AGFL
+Fuzz AGFL flfirst
+Done fuzzing AGFL flfirst
diff --git a/tests/xfs/1503 b/tests/xfs/1503
new file mode 100755
index 0000000000..8c4b125dfa
--- /dev/null
+++ b/tests/xfs/1503
@@ -0,0 +1,35 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1503
+#
+# Populate a XFS filesystem and fuzz every AGI field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Fuzz AGI"
+_scratch_xfs_fuzz_metadata '' 'both' 'agi 0' >> $seqres.full
+echo "Done fuzzing AGI"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1503.out b/tests/xfs/1503.out
new file mode 100644
index 0000000000..0029001317
--- /dev/null
+++ b/tests/xfs/1503.out
@@ -0,0 +1,4 @@
+QA output created by 1503
+Format and populate
+Fuzz AGI
+Done fuzzing AGI
diff --git a/tests/xfs/1504 b/tests/xfs/1504
new file mode 100755
index 0000000000..e2712e646d
--- /dev/null
+++ b/tests/xfs/1504
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1504
+#
+# Populate a XFS filesystem and fuzz every bnobt field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
+echo "Fuzz bnobt recs"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr bnoroot' 'addr ptrs[1]' >> $seqres.full
+echo "Done fuzzing bnobt recs"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1504.out b/tests/xfs/1504.out
new file mode 100644
index 0000000000..2c8162dd31
--- /dev/null
+++ b/tests/xfs/1504.out
@@ -0,0 +1,4 @@
+QA output created by 1504
+Format and populate
+Fuzz bnobt recs
+Done fuzzing bnobt recs
diff --git a/tests/xfs/1505 b/tests/xfs/1505
new file mode 100755
index 0000000000..dbf850f8b4
--- /dev/null
+++ b/tests/xfs/1505
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1505
+#
+# Populate a XFS filesystem and fuzz every bnobt key/pointer.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'bno' 2)" || \
+	_fail "could not find two-level bnobt"
+
+echo "Fuzz bnobt keyptr"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr bnoroot' >> $seqres.full
+echo "Done fuzzing bnobt keyptr"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1505.out b/tests/xfs/1505.out
new file mode 100644
index 0000000000..7c326a36dc
--- /dev/null
+++ b/tests/xfs/1505.out
@@ -0,0 +1,4 @@
+QA output created by 1505
+Format and populate
+Fuzz bnobt keyptr
+Done fuzzing bnobt keyptr
diff --git a/tests/xfs/1506 b/tests/xfs/1506
new file mode 100755
index 0000000000..efce4928fa
--- /dev/null
+++ b/tests/xfs/1506
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1506
+#
+# Populate a XFS filesystem and fuzz every cntbt field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'cnt' 2)" || \
+	_fail "could not find two-level cntbt"
+
+echo "Fuzz cntbt"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr cntroot' 'addr ptrs[1]' >> $seqres.full
+echo "Done fuzzing cntbt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1506.out b/tests/xfs/1506.out
new file mode 100644
index 0000000000..bed44c625b
--- /dev/null
+++ b/tests/xfs/1506.out
@@ -0,0 +1,4 @@
+QA output created by 1506
+Format and populate
+Fuzz cntbt
+Done fuzzing cntbt
diff --git a/tests/xfs/1507 b/tests/xfs/1507
new file mode 100755
index 0000000000..7cece6854d
--- /dev/null
+++ b/tests/xfs/1507
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1507
+#
+# Populate a XFS filesystem and fuzz every inobt field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
+echo "Fuzz inobt"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr root' 'addr ptrs[1]' >> $seqres.full
+echo "Done fuzzing inobt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1507.out b/tests/xfs/1507.out
new file mode 100644
index 0000000000..3607a3a554
--- /dev/null
+++ b/tests/xfs/1507.out
@@ -0,0 +1,4 @@
+QA output created by 1507
+Format and populate
+Fuzz inobt
+Done fuzzing inobt
diff --git a/tests/xfs/1508 b/tests/xfs/1508
new file mode 100755
index 0000000000..e2b38a9242
--- /dev/null
+++ b/tests/xfs/1508
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1508
+#
+# Populate a XFS filesystem and fuzz every finobt field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+_require_xfs_finobt
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'fino' 2)" || \
+	_fail "could not find two-level finobt"
+
+echo "Fuzz finobt"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr free_root' 'addr ptrs[1]' >> $seqres.full
+echo "Done fuzzing finobt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1508.out b/tests/xfs/1508.out
new file mode 100644
index 0000000000..08b8e704b4
--- /dev/null
+++ b/tests/xfs/1508.out
@@ -0,0 +1,4 @@
+QA output created by 1508
+Format and populate
+Fuzz finobt
+Done fuzzing finobt
diff --git a/tests/xfs/1509 b/tests/xfs/1509
new file mode 100755
index 0000000000..c90c4e6784
--- /dev/null
+++ b/tests/xfs/1509
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1509
+#
+# Populate a XFS filesystem and fuzz every rmapbt field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_scratch_rmapbt
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
+echo "Fuzz rmapbt recs"
+_scratch_xfs_fuzz_metadata '' 'both' "$path" 'addr rmaproot' 'addr ptrs[1]' >> $seqres.full
+echo "Done fuzzing rmapbt recs"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1509.out b/tests/xfs/1509.out
new file mode 100644
index 0000000000..95a25b005f
--- /dev/null
+++ b/tests/xfs/1509.out
@@ -0,0 +1,4 @@
+QA output created by 1509
+Format and populate
+Fuzz rmapbt recs
+Done fuzzing rmapbt recs
diff --git a/tests/xfs/1510 b/tests/xfs/1510
new file mode 100755
index 0000000000..fd8b994a98
--- /dev/null
+++ b/tests/xfs/1510
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1510
+#
+# Populate a XFS filesystem and fuzz every rmapbt key/pointer field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_scratch_rmapbt
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'rmap' 2)" || \
+	_fail "could not find two-level rmapbt"
+
+echo "Fuzz rmapbt keyptr"
+_scratch_xfs_fuzz_metadata '' 'both' "$path" 'addr rmaproot' >> $seqres.full
+echo "Done fuzzing rmapbt keyptr"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1510.out b/tests/xfs/1510.out
new file mode 100644
index 0000000000..0d5d467e2e
--- /dev/null
+++ b/tests/xfs/1510.out
@@ -0,0 +1,4 @@
+QA output created by 1510
+Format and populate
+Fuzz rmapbt keyptr
+Done fuzzing rmapbt keyptr
diff --git a/tests/xfs/1511 b/tests/xfs/1511
new file mode 100755
index 0000000000..ac945c17ed
--- /dev/null
+++ b/tests/xfs/1511
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1511
+#
+# Populate a XFS filesystem and fuzz every refcountbt key/pointer field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_reflink
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
+echo "Fuzz refcountbt"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr refcntroot' >> $seqres.full
+echo "Done fuzzing refcountbt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1511.out b/tests/xfs/1511.out
new file mode 100644
index 0000000000..0628ebe519
--- /dev/null
+++ b/tests/xfs/1511.out
@@ -0,0 +1,4 @@
+QA output created by 1511
+Format and populate
+Fuzz refcountbt
+Done fuzzing refcountbt
diff --git a/tests/xfs/1512 b/tests/xfs/1512
new file mode 100755
index 0000000000..9e3b859e1b
--- /dev/null
+++ b/tests/xfs/1512
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1512
+#
+# Populate a XFS filesystem and fuzz every btree-format directory inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find btree-format dir inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_BTREE)
+_scratch_unmount
+
+echo "Fuzz inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1512.out b/tests/xfs/1512.out
new file mode 100644
index 0000000000..c28f0fc936
--- /dev/null
+++ b/tests/xfs/1512.out
@@ -0,0 +1,5 @@
+QA output created by 1512
+Format and populate
+Find btree-format dir inode
+Fuzz inode
+Done fuzzing inode
diff --git a/tests/xfs/1513 b/tests/xfs/1513
new file mode 100755
index 0000000000..5b1ed15290
--- /dev/null
+++ b/tests/xfs/1513
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1513
+#
+# Populate a XFS filesystem and fuzz every extents-format file inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find extents-format file inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFREG.FMT_EXTENTS)
+_scratch_unmount
+
+echo "Fuzz inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1513.out b/tests/xfs/1513.out
new file mode 100644
index 0000000000..6ee6f85ac3
--- /dev/null
+++ b/tests/xfs/1513.out
@@ -0,0 +1,5 @@
+QA output created by 1513
+Format and populate
+Find extents-format file inode
+Fuzz inode
+Done fuzzing inode
diff --git a/tests/xfs/1514 b/tests/xfs/1514
new file mode 100755
index 0000000000..8f530466d8
--- /dev/null
+++ b/tests/xfs/1514
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1514
+#
+# Populate a XFS filesystem and fuzz every btree-format file inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find btree-format file inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFREG.FMT_BTREE)
+_scratch_unmount
+
+echo "Fuzz inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1514.out b/tests/xfs/1514.out
new file mode 100644
index 0000000000..9bebc6d1a0
--- /dev/null
+++ b/tests/xfs/1514.out
@@ -0,0 +1,5 @@
+QA output created by 1514
+Format and populate
+Find btree-format file inode
+Fuzz inode
+Done fuzzing inode
diff --git a/tests/xfs/1515 b/tests/xfs/1515
new file mode 100755
index 0000000000..7a67448a40
--- /dev/null
+++ b/tests/xfs/1515
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1515
+#
+# Populate a XFS filesystem and fuzz every bmbt block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find bmbt block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFREG.FMT_BTREE)
+_scratch_unmount
+
+inode_ver=$(_scratch_xfs_get_metadata_field "core.version" "inode ${inum}")
+
+echo "Fuzz bmbt"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "addr u${inode_ver}.bmbt.ptrs[1]" >> $seqres.full
+echo "Done fuzzing bmbt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1515.out b/tests/xfs/1515.out
new file mode 100644
index 0000000000..239ad87fa9
--- /dev/null
+++ b/tests/xfs/1515.out
@@ -0,0 +1,5 @@
+QA output created by 1515
+Format and populate
+Find bmbt block
+Fuzz bmbt
+Done fuzzing bmbt
diff --git a/tests/xfs/1516 b/tests/xfs/1516
new file mode 100755
index 0000000000..9c89964f85
--- /dev/null
+++ b/tests/xfs/1516
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1516
+#
+# Populate a XFS filesystem and fuzz every symlink remote block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find symlink remote block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFLNK.FMT_EXTENTS)
+_scratch_unmount
+
+echo "Fuzz symlink remote block"
+_scratch_xfs_fuzz_metadata '' 'both' "inode ${inum}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing symlink remote block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1516.out b/tests/xfs/1516.out
new file mode 100644
index 0000000000..043db933d2
--- /dev/null
+++ b/tests/xfs/1516.out
@@ -0,0 +1,5 @@
+QA output created by 1516
+Format and populate
+Find symlink remote block
+Fuzz symlink remote block
+Done fuzzing symlink remote block
diff --git a/tests/xfs/1517 b/tests/xfs/1517
new file mode 100755
index 0000000000..e54f081281
--- /dev/null
+++ b/tests/xfs/1517
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1517
+#
+# Populate a XFS filesystem and fuzz every inline directory inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find inline-format dir inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_INLINE)
+_scratch_unmount
+
+echo "Fuzz inline-format dir inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inline-format dir inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1517.out b/tests/xfs/1517.out
new file mode 100644
index 0000000000..94063af2ab
--- /dev/null
+++ b/tests/xfs/1517.out
@@ -0,0 +1,5 @@
+QA output created by 1517
+Format and populate
+Find inline-format dir inode
+Fuzz inline-format dir inode
+Done fuzzing inline-format dir inode
diff --git a/tests/xfs/1518 b/tests/xfs/1518
new file mode 100755
index 0000000000..14b1d74837
--- /dev/null
+++ b/tests/xfs/1518
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1518
+#
+# Populate a XFS filesystem and fuzz every block-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find data-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_BLOCK)
+_scratch_unmount
+
+echo "Fuzz data-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" 'dblock 0' >> $seqres.full
+echo "Done fuzzing data-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1518.out b/tests/xfs/1518.out
new file mode 100644
index 0000000000..a3831196fd
--- /dev/null
+++ b/tests/xfs/1518.out
@@ -0,0 +1,5 @@
+QA output created by 1518
+Format and populate
+Find data-format dir block
+Fuzz data-format dir block
+Done fuzzing data-format dir block
diff --git a/tests/xfs/1519 b/tests/xfs/1519
new file mode 100755
index 0000000000..98d719d33f
--- /dev/null
+++ b/tests/xfs/1519
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1519
+#
+# Populate a XFS filesystem and fuzz every data-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find data-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_LEAF)
+blk_sz=$(_get_block_size $SCRATCH_MNT)
+_scratch_unmount
+
+echo "Fuzz data-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "dblock 0" >> $seqres.full
+echo "Done fuzzing data-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1519.out b/tests/xfs/1519.out
new file mode 100644
index 0000000000..d1607473d5
--- /dev/null
+++ b/tests/xfs/1519.out
@@ -0,0 +1,5 @@
+QA output created by 1519
+Format and populate
+Find data-format dir block
+Fuzz data-format dir block
+Done fuzzing data-format dir block
diff --git a/tests/xfs/1520 b/tests/xfs/1520
new file mode 100755
index 0000000000..fedee5a52f
--- /dev/null
+++ b/tests/xfs/1520
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1520
+#
+# Populate a XFS filesystem and fuzz every leaf1-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find leaf1-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_LEAF)
+blk_sz=$(_get_block_size $SCRATCH_MNT)
+_scratch_unmount
+
+leaf_offset=$(( (2 ** 35) / blk_sz))
+echo "Fuzz leaf1-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "dblock ${leaf_offset}" >> $seqres.full
+echo "Done fuzzing leaf1-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1520.out b/tests/xfs/1520.out
new file mode 100644
index 0000000000..a97bf281f6
--- /dev/null
+++ b/tests/xfs/1520.out
@@ -0,0 +1,5 @@
+QA output created by 1520
+Format and populate
+Find leaf1-format dir block
+Fuzz leaf1-format dir block
+Done fuzzing leaf1-format dir block
diff --git a/tests/xfs/1521 b/tests/xfs/1521
new file mode 100755
index 0000000000..9d9bfcf407
--- /dev/null
+++ b/tests/xfs/1521
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1521
+#
+# Populate a XFS filesystem and fuzz every leafn-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find leafn-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_NODE)
+blk_sz=$(_get_block_size $SCRATCH_MNT)
+_scratch_unmount
+
+leaf_offset=$(( ( (2 ** 35) / blk_sz) + 1))
+echo "Fuzz leafn-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "dblock ${leaf_offset}" >> $seqres.full
+echo "Done fuzzing leafn-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1521.out b/tests/xfs/1521.out
new file mode 100644
index 0000000000..c47f0f08c5
--- /dev/null
+++ b/tests/xfs/1521.out
@@ -0,0 +1,5 @@
+QA output created by 1521
+Format and populate
+Find leafn-format dir block
+Fuzz leafn-format dir block
+Done fuzzing leafn-format dir block
diff --git a/tests/xfs/1522 b/tests/xfs/1522
new file mode 100755
index 0000000000..1ae8eac378
--- /dev/null
+++ b/tests/xfs/1522
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1522
+#
+# Populate a XFS filesystem and fuzz every node-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find node-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_NODE)
+blk_sz=$(_get_block_size $SCRATCH_MNT)
+_scratch_unmount
+
+leaf_offset=$(( (2 ** 35) / blk_sz ))
+echo "Fuzz node-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "dblock ${leaf_offset}" >> $seqres.full
+echo "Done fuzzing node-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1522.out b/tests/xfs/1522.out
new file mode 100644
index 0000000000..ae26b05740
--- /dev/null
+++ b/tests/xfs/1522.out
@@ -0,0 +1,5 @@
+QA output created by 1522
+Format and populate
+Find node-format dir block
+Fuzz node-format dir block
+Done fuzzing node-format dir block
diff --git a/tests/xfs/1523 b/tests/xfs/1523
new file mode 100755
index 0000000000..345bc6e57a
--- /dev/null
+++ b/tests/xfs/1523
@@ -0,0 +1,42 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1523
+#
+# Populate a XFS filesystem and fuzz every freeindex-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find freeindex-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_NODE)
+blk_sz=$(_get_block_size $SCRATCH_MNT)
+_scratch_unmount
+
+leaf_offset=$(( (2 ** 36) / blk_sz ))
+echo "Fuzz freeindex-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "dblock ${leaf_offset}" >> $seqres.full
+echo "Done fuzzing freeindex-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1523.out b/tests/xfs/1523.out
new file mode 100644
index 0000000000..b4dd1270ae
--- /dev/null
+++ b/tests/xfs/1523.out
@@ -0,0 +1,5 @@
+QA output created by 1523
+Format and populate
+Find freeindex-format dir block
+Fuzz freeindex-format dir block
+Done fuzzing freeindex-format dir block
diff --git a/tests/xfs/1524 b/tests/xfs/1524
new file mode 100755
index 0000000000..6602fac980
--- /dev/null
+++ b/tests/xfs/1524
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1524
+#
+# Populate a XFS filesystem and fuzz every inline attr inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find inline-format attr inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_LOCAL)
+_scratch_unmount
+
+echo "Fuzz inline-format attr inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inline-format attr inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1524.out b/tests/xfs/1524.out
new file mode 100644
index 0000000000..2afb919ce8
--- /dev/null
+++ b/tests/xfs/1524.out
@@ -0,0 +1,5 @@
+QA output created by 1524
+Format and populate
+Find inline-format attr inode
+Fuzz inline-format attr inode
+Done fuzzing inline-format attr inode
diff --git a/tests/xfs/1525 b/tests/xfs/1525
new file mode 100755
index 0000000000..78ae622739
--- /dev/null
+++ b/tests/xfs/1525
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1525
+#
+# Populate a XFS filesystem and fuzz every leaf-format attr block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find leaf-format attr block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_LEAF)
+_scratch_unmount
+
+echo "Fuzz leaf-format attr block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" 'ablock 0' >> $seqres.full
+echo "Done fuzzing leaf-format attr block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1525.out b/tests/xfs/1525.out
new file mode 100644
index 0000000000..dda2601e13
--- /dev/null
+++ b/tests/xfs/1525.out
@@ -0,0 +1,5 @@
+QA output created by 1525
+Format and populate
+Find leaf-format attr block
+Fuzz leaf-format attr block
+Done fuzzing leaf-format attr block
diff --git a/tests/xfs/1526 b/tests/xfs/1526
new file mode 100755
index 0000000000..7efe2fdbfc
--- /dev/null
+++ b/tests/xfs/1526
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1526
+#
+# Populate a XFS filesystem and fuzz every node-format attr block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find node-format attr block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_NODE)
+_scratch_unmount
+
+echo "Fuzz node-format attr block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "ablock 0" >> $seqres.full
+echo "Done fuzzing node-format attr block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1526.out b/tests/xfs/1526.out
new file mode 100644
index 0000000000..ef7b4c7a9e
--- /dev/null
+++ b/tests/xfs/1526.out
@@ -0,0 +1,5 @@
+QA output created by 1526
+Format and populate
+Find node-format attr block
+Fuzz node-format attr block
+Done fuzzing node-format attr block
diff --git a/tests/xfs/1527 b/tests/xfs/1527
new file mode 100755
index 0000000000..9ebcbe7117
--- /dev/null
+++ b/tests/xfs/1527
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1527
+#
+# Populate a XFS filesystem and fuzz every external attr block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find external attr block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_EXTENTS_REMOTE3K)
+_scratch_unmount
+
+echo "Fuzz external attr block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "ablock 1" >> $seqres.full
+echo "Done fuzzing external attr block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1527.out b/tests/xfs/1527.out
new file mode 100644
index 0000000000..a79038cccf
--- /dev/null
+++ b/tests/xfs/1527.out
@@ -0,0 +1,5 @@
+QA output created by 1527
+Format and populate
+Find external attr block
+Fuzz external attr block
+Done fuzzing external attr block
diff --git a/tests/xfs/1530 b/tests/xfs/1530
new file mode 100755
index 0000000000..6225391078
--- /dev/null
+++ b/tests/xfs/1530
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1530
+#
+# Populate a XFS filesystem and fuzz every refcountbt field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_reflink
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'refcnt' 2)" || \
+	_fail "could not find two-level refcountbt"
+
+echo "Fuzz refcountbt"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr refcntroot' 'addr ptrs[1]' >> $seqres.full
+echo "Done fuzzing refcountbt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1530.out b/tests/xfs/1530.out
new file mode 100644
index 0000000000..4c2f39053e
--- /dev/null
+++ b/tests/xfs/1530.out
@@ -0,0 +1,4 @@
+QA output created by 1530
+Format and populate
+Fuzz refcountbt
+Done fuzzing refcountbt
diff --git a/tests/xfs/1531 b/tests/xfs/1531
new file mode 100755
index 0000000000..43a446e538
--- /dev/null
+++ b/tests/xfs/1531
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1531
+#
+# Populate a XFS filesystem and fuzz every btree-format attr inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find btree-format attr inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_BTREE)
+_scratch_unmount
+
+echo "Fuzz inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1531.out b/tests/xfs/1531.out
new file mode 100644
index 0000000000..6c4deceaf0
--- /dev/null
+++ b/tests/xfs/1531.out
@@ -0,0 +1,5 @@
+QA output created by 1531
+Format and populate
+Find btree-format attr inode
+Fuzz inode
+Done fuzzing inode
diff --git a/tests/xfs/1532 b/tests/xfs/1532
new file mode 100755
index 0000000000..1aa65139a6
--- /dev/null
+++ b/tests/xfs/1532
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1532
+#
+# Populate a XFS filesystem and fuzz every blockdev inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find blockdev inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFBLK)
+_scratch_unmount
+
+echo "Fuzz inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1532.out b/tests/xfs/1532.out
new file mode 100644
index 0000000000..4bb5defc3f
--- /dev/null
+++ b/tests/xfs/1532.out
@@ -0,0 +1,5 @@
+QA output created by 1532
+Format and populate
+Find blockdev inode
+Fuzz inode
+Done fuzzing inode
diff --git a/tests/xfs/1533 b/tests/xfs/1533
new file mode 100755
index 0000000000..e873432a0e
--- /dev/null
+++ b/tests/xfs/1533
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1533
+#
+# Populate a XFS filesystem and fuzz every local-format symlink inode field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find local-format symlink inode"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFLNK.FMT_LOCAL)
+_scratch_unmount
+
+echo "Fuzz inode"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" >> $seqres.full
+echo "Done fuzzing inode"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1533.out b/tests/xfs/1533.out
new file mode 100644
index 0000000000..198e5266fc
--- /dev/null
+++ b/tests/xfs/1533.out
@@ -0,0 +1,5 @@
+QA output created by 1533
+Format and populate
+Find local-format symlink inode
+Fuzz inode
+Done fuzzing inode
diff --git a/tests/xfs/1534 b/tests/xfs/1534
new file mode 100755
index 0000000000..20a0faa56f
--- /dev/null
+++ b/tests/xfs/1534
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1534
+#
+# Populate a XFS filesystem and fuzz every user dquot field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+_require_quota
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+echo "${MOUNT_OPTIONS}" | grep -q 'usrquota' || _notrun "user quota disabled"
+
+echo "Fuzz user 0 dquot"
+_scratch_xfs_fuzz_metadata '' 'both'  "dquot -u 0" >> $seqres.full
+echo "Done fuzzing dquot"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1534.out b/tests/xfs/1534.out
new file mode 100644
index 0000000000..0e5484adfc
--- /dev/null
+++ b/tests/xfs/1534.out
@@ -0,0 +1,4 @@
+QA output created by 1534
+Format and populate
+Fuzz user 0 dquot
+Done fuzzing dquot
diff --git a/tests/xfs/1535 b/tests/xfs/1535
new file mode 100755
index 0000000000..c6b268621c
--- /dev/null
+++ b/tests/xfs/1535
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1535
+#
+# Populate a XFS filesystem and fuzz every group dquot field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+_require_quota
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+echo "${MOUNT_OPTIONS}" | grep -q 'grpquota' || _notrun "group quota disabled"
+
+echo "Fuzz group 0 dquot"
+_scratch_xfs_fuzz_metadata '' 'both'  "dquot -g 0" >> $seqres.full
+echo "Done fuzzing dquot"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1535.out b/tests/xfs/1535.out
new file mode 100644
index 0000000000..ad317e3673
--- /dev/null
+++ b/tests/xfs/1535.out
@@ -0,0 +1,4 @@
+QA output created by 1535
+Format and populate
+Fuzz group 0 dquot
+Done fuzzing dquot
diff --git a/tests/xfs/1536 b/tests/xfs/1536
new file mode 100755
index 0000000000..20d054df6f
--- /dev/null
+++ b/tests/xfs/1536
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1536
+#
+# Populate a XFS filesystem and fuzz every project dquot field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+_require_quota
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+echo "${MOUNT_OPTIONS}" | grep -q 'prjquota' || _notrun "project quota disabled"
+
+echo "Fuzz project 0 dquot"
+_scratch_xfs_fuzz_metadata '' 'both'  "dquot -p 0" >> $seqres.full
+echo "Done fuzzing dquot"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1536.out b/tests/xfs/1536.out
new file mode 100644
index 0000000000..8738d707dd
--- /dev/null
+++ b/tests/xfs/1536.out
@@ -0,0 +1,4 @@
+QA output created by 1536
+Format and populate
+Fuzz project 0 dquot
+Done fuzzing dquot
diff --git a/tests/xfs/1537 b/tests/xfs/1537
new file mode 100755
index 0000000000..21f962dfaf
--- /dev/null
+++ b/tests/xfs/1537
@@ -0,0 +1,41 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1537
+#
+# Populate a XFS filesystem and fuzz every single-leafn-format dir block field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+echo "Find single-leafn-format dir block"
+_scratch_mount
+inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_LEAFN)
+blk_sz=$(_get_block_size $SCRATCH_MNT)
+_scratch_unmount
+
+leaf_offset=$(( (2 ** 35) / blk_sz ))
+echo "Fuzz single-leafn-format dir block"
+_scratch_xfs_fuzz_metadata '' 'both'  "inode ${inum}" "dblock ${leaf_offset}" >> $seqres.full
+echo "Done fuzzing single-leafn-format dir block"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1537.out b/tests/xfs/1537.out
new file mode 100644
index 0000000000..f298ecce6b
--- /dev/null
+++ b/tests/xfs/1537.out
@@ -0,0 +1,5 @@
+QA output created by 1537
+Format and populate
+Find single-leafn-format dir block
+Fuzz single-leafn-format dir block
+Done fuzzing single-leafn-format dir block
diff --git a/tests/xfs/1560 b/tests/xfs/1560
new file mode 100755
index 0000000000..456f079919
--- /dev/null
+++ b/tests/xfs/1560
@@ -0,0 +1,49 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1560
+#
+# Populate a XFS filesystem and fuzz the data mappings of every directory type.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_dir_fuzz_types
+
+# Now fuzz the block maps of each directory type.
+for dirtype in "${SCRATCH_XFS_DIR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${dirtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the directory block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/S_IFDIR.FMT_${dirtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'u*.bmx' 'both'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing dir map ${dirtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1560.out b/tests/xfs/1560.out
new file mode 100644
index 0000000000..429cd3e0ce
--- /dev/null
+++ b/tests/xfs/1560.out
@@ -0,0 +1,10 @@
+QA output created by 1560
+Format and populate
+Fuzz block map for BLOCK
+Done fuzzing dir map BLOCK
+Fuzz block map for LEAF
+Done fuzzing dir map LEAF
+Fuzz block map for LEAFN
+Done fuzzing dir map LEAFN
+Fuzz block map for NODE
+Done fuzzing dir map NODE
diff --git a/tests/xfs/1561 b/tests/xfs/1561
new file mode 100755
index 0000000000..936e4c264f
--- /dev/null
+++ b/tests/xfs/1561
@@ -0,0 +1,49 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1561
+#
+# Populate a XFS filesystem and fuzz the attr mappings of every xattr type.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+_scratch_xfs_set_xattr_fuzz_types
+
+# Now fuzz the block maps of each xattr type.
+for attrtype in "${SCRATCH_XFS_XATTR_FUZZ_TYPES[@]}"; do
+	echo "Fuzz block map for ${attrtype}" | tee -a $seqres.full
+
+	# Restore a correct copy of the filesystem before we start a round of
+	# fuzzing.  This avoids corruption errors from xfs_db when
+	# _scratch_xfs_fuzz_metadata probes the xattr block fields.
+	__scratch_xfs_fuzz_mdrestore
+
+	_scratch_mount
+	inum=$(stat -c '%i' $SCRATCH_MNT/ATTR.FMT_${attrtype})
+	_scratch_unmount
+
+	_scratch_xfs_fuzz_metadata 'a*.bmx' 'both'  "inode ${inum}" >> $seqres.full
+	echo "Done fuzzing attr map ${attrtype}"
+done
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1561.out b/tests/xfs/1561.out
new file mode 100644
index 0000000000..39e5dcba71
--- /dev/null
+++ b/tests/xfs/1561.out
@@ -0,0 +1,10 @@
+QA output created by 1561
+Format and populate
+Fuzz block map for EXTENTS_REMOTE3K
+Done fuzzing attr map EXTENTS_REMOTE3K
+Fuzz block map for EXTENTS_REMOTE4K
+Done fuzzing attr map EXTENTS_REMOTE4K
+Fuzz block map for LEAF
+Done fuzzing attr map LEAF
+Fuzz block map for NODE
+Done fuzzing attr map NODE


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/5] fuzzy: fuzz test key/pointers of inode btrees
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/5] fuzzy: test fuzzing directory block mappings Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/5] fuzzy: test fuzzing xattr " Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 4/5] xfs: fuzz test both repair strategies Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 3/5] fuzzy: test fuzzing realtime free space metadata Darrick J. Wong
  4 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Test what happens when we fuzz the key/pointer blocks (aka the interior
nodes) of the inode btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/1570     |   36 ++++++++++++++++++++++++++++++++++++
 tests/xfs/1570.out |    4 ++++
 tests/xfs/1571     |   36 ++++++++++++++++++++++++++++++++++++
 tests/xfs/1571.out |    4 ++++
 tests/xfs/1572     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1572.out |    4 ++++
 tests/xfs/1573     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/1573.out |    4 ++++
 8 files changed, 163 insertions(+)
 create mode 100755 tests/xfs/1570
 create mode 100644 tests/xfs/1570.out
 create mode 100755 tests/xfs/1571
 create mode 100644 tests/xfs/1571.out
 create mode 100755 tests/xfs/1572
 create mode 100755 tests/xfs/1572.out
 create mode 100755 tests/xfs/1573
 create mode 100644 tests/xfs/1573.out


diff --git a/tests/xfs/1570 b/tests/xfs/1570
new file mode 100755
index 0000000000..c2d144e298
--- /dev/null
+++ b/tests/xfs/1570
@@ -0,0 +1,36 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 1570
+#
+# Populate a XFS filesystem and fuzz every inobt key/pointer field.
+# Use xfs_repair to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
+echo "Fuzz inobt"
+_scratch_xfs_fuzz_metadata '' 'offline'  "$path" 'addr root' >> $seqres.full
+echo "Done fuzzing inobt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1570.out b/tests/xfs/1570.out
new file mode 100644
index 0000000000..b3977dca57
--- /dev/null
+++ b/tests/xfs/1570.out
@@ -0,0 +1,4 @@
+QA output created by 1570
+Format and populate
+Fuzz inobt
+Done fuzzing inobt
diff --git a/tests/xfs/1571 b/tests/xfs/1571
new file mode 100755
index 0000000000..c64b321ff6
--- /dev/null
+++ b/tests/xfs/1571
@@ -0,0 +1,36 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 1571
+#
+# Populate a XFS filesystem and fuzz every inobt key/pointer field.
+# Use xfs_scrub to fix the corruption.
+#
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
+echo "Fuzz inobt"
+_scratch_xfs_fuzz_metadata '' 'online'  "$path" 'addr root' >> $seqres.full
+echo "Done fuzzing inobt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1571.out b/tests/xfs/1571.out
new file mode 100644
index 0000000000..292e8bdec4
--- /dev/null
+++ b/tests/xfs/1571.out
@@ -0,0 +1,4 @@
+QA output created by 1571
+Format and populate
+Fuzz inobt
+Done fuzzing inobt
diff --git a/tests/xfs/1572 b/tests/xfs/1572
new file mode 100755
index 0000000000..abcdc2397f
--- /dev/null
+++ b/tests/xfs/1572
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1572
+#
+# Populate a XFS filesystem and fuzz every inobt key/pointer field.
+# Try online repair and, if necessary, offline repair,
+# to test the most likely usage pattern.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_bothrepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
+echo "Fuzz inobt"
+_scratch_xfs_fuzz_metadata '' 'both'  "$path" 'addr root' >> $seqres.full
+echo "Done fuzzing inobt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1572.out b/tests/xfs/1572.out
new file mode 100755
index 0000000000..8afa3ea075
--- /dev/null
+++ b/tests/xfs/1572.out
@@ -0,0 +1,4 @@
+QA output created by 1572
+Format and populate
+Fuzz inobt
+Done fuzzing inobt
diff --git a/tests/xfs/1573 b/tests/xfs/1573
new file mode 100755
index 0000000000..7a816e59b8
--- /dev/null
+++ b/tests/xfs/1573
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All rights reserved.
+#
+# FS QA Test No. 1573
+#
+# Populate a XFS filesystem and fuzz every inobt key/pointer field.
+# Do not fix the filesystem, to test metadata verifiers.
+
+. ./common/preamble
+_begin_fstest dangerous_fuzzers dangerous_norepair
+
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch_xfs_fuzz_fields
+_disable_dmesg_check
+
+echo "Format and populate"
+_scratch_populate_cached nofill > $seqres.full 2>&1
+
+path="$(_scratch_xfs_find_agbtree_height 'ino' 2)" || \
+	_fail "could not find two-level inobt"
+
+echo "Fuzz inobt"
+_scratch_xfs_fuzz_metadata '' 'none'  "$path" 'addr root' >> $seqres.full
+echo "Done fuzzing inobt"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1573.out b/tests/xfs/1573.out
new file mode 100644
index 0000000000..cef5aef758
--- /dev/null
+++ b/tests/xfs/1573.out
@@ -0,0 +1,4 @@
+QA output created by 1573
+Format and populate
+Fuzz inobt
+Done fuzzing inobt


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/7] fstests: atomic file updates
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (72 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/7] xfs/122: fix for swapext log items Darrick J. Wong
                     ` (6 more replies)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of realtime summaries Darrick J. Wong
                   ` (3 subsequent siblings)
  77 siblings, 7 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically.  This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes.  A successful call
completion guarantees that the new contents will be seen even if the
system fails.

The ability to swap extent mappings between files in this manner is
critical to supporting online filesystem repair, which is built upon the
strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Callers can arrange for the update to be rejected if the
original file has been changed.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic file writes concept
that has also been floating around for years.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file swap is
implemented as an atomic inode fork swap, which means that we can
implement online reconstruction of extended attributes and directories
by building a new one in another inode and atomically swap the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic extent swapping.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be swapped atomically into
the inode being repaired.  This is essential to avoid making corruption
problems worse if the system goes down in the middle of running repair.

This patchset also ports the old XFS extent swap ioctl interface to use
the new extent swap code.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

Question: Should we really bother with fsdevel bikeshedding?  Most
filesystems cannot support this functionality, so we could keep it
private to XFS for now.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
 common/rc              |   24 +++++++
 common/reflink         |   33 +++++++++
 common/xfs             |   19 +++++
 configure.ac           |    1 
 doc/group-names.txt    |    2 +
 include/builddefs.in   |    1 
 ltp/Makefile           |    4 +
 ltp/fsstress.c         |  168 ++++++++++++++++++++++++++++++++++++++++++++++++
 ltp/fsx.c              |  160 +++++++++++++++++++++++++++++++++++++++++++++-
 m4/package_libcdev.m4  |   21 ++++++
 src/fiexchange.h       |  101 +++++++++++++++++++++++++++++
 src/global.h           |    6 ++
 tests/generic/1200     |   55 ++++++++++++++++
 tests/generic/1200.out |    3 +
 tests/generic/1201     |   53 +++++++++++++++
 tests/generic/1201.out |    4 +
 tests/generic/1202     |   47 +++++++++++++
 tests/generic/1202.out |    2 +
 tests/generic/1203     |   58 +++++++++++++++++
 tests/generic/1203.out |    2 +
 tests/generic/1204     |  100 +++++++++++++++++++++++++++++
 tests/generic/1204.out |   86 +++++++++++++++++++++++++
 tests/generic/1205     |  116 +++++++++++++++++++++++++++++++++
 tests/generic/1205.out |   90 ++++++++++++++++++++++++++
 tests/generic/1206     |   76 ++++++++++++++++++++++
 tests/generic/1206.out |   32 +++++++++
 tests/generic/1207     |  122 +++++++++++++++++++++++++++++++++++
 tests/generic/1207.out |   48 ++++++++++++++
 tests/generic/1209     |  101 +++++++++++++++++++++++++++++
 tests/generic/1209.out |   33 +++++++++
 tests/generic/1210     |   48 ++++++++++++++
 tests/generic/1210.out |    6 ++
 tests/generic/1211     |  105 ++++++++++++++++++++++++++++++
 tests/generic/1211.out |   40 +++++++++++
 tests/generic/1212     |   58 +++++++++++++++++
 tests/generic/1212.out |    2 +
 tests/generic/1213     |  126 ++++++++++++++++++++++++++++++++++++
 tests/generic/1213.out |   48 ++++++++++++++
 tests/generic/1214     |   63 ++++++++++++++++++
 tests/generic/1214.out |    2 +
 tests/generic/1215     |   70 ++++++++++++++++++++
 tests/generic/1215.out |    2 +
 tests/generic/1216     |   53 +++++++++++++++
 tests/generic/1216.out |    6 ++
 tests/generic/1217     |   56 ++++++++++++++++
 tests/generic/1217.out |    4 +
 tests/generic/1218     |  115 +++++++++++++++++++++++++++++++++
 tests/generic/1218.out |   49 ++++++++++++++
 tests/generic/1219     |   83 ++++++++++++++++++++++++
 tests/generic/1219.out |   17 +++++
 tests/xfs/1202         |   59 +++++++++++++++++
 tests/xfs/1202.out     |   12 +++
 tests/xfs/1208         |   62 ++++++++++++++++++
 tests/xfs/1208.out     |   10 +++
 tests/xfs/1211         |   59 +++++++++++++++++
 tests/xfs/1211.out     |    7 ++
 tests/xfs/1212         |   61 +++++++++++++++++
 tests/xfs/1212.out     |    5 +
 tests/xfs/122.out      |    3 +
 tests/xfs/537          |    2 -
 60 files changed, 2798 insertions(+), 3 deletions(-)
 create mode 100644 src/fiexchange.h
 create mode 100755 tests/generic/1200
 create mode 100644 tests/generic/1200.out
 create mode 100755 tests/generic/1201
 create mode 100644 tests/generic/1201.out
 create mode 100755 tests/generic/1202
 create mode 100644 tests/generic/1202.out
 create mode 100755 tests/generic/1203
 create mode 100644 tests/generic/1203.out
 create mode 100755 tests/generic/1204
 create mode 100644 tests/generic/1204.out
 create mode 100755 tests/generic/1205
 create mode 100644 tests/generic/1205.out
 create mode 100755 tests/generic/1206
 create mode 100644 tests/generic/1206.out
 create mode 100755 tests/generic/1207
 create mode 100644 tests/generic/1207.out
 create mode 100755 tests/generic/1209
 create mode 100644 tests/generic/1209.out
 create mode 100755 tests/generic/1210
 create mode 100644 tests/generic/1210.out
 create mode 100755 tests/generic/1211
 create mode 100644 tests/generic/1211.out
 create mode 100755 tests/generic/1212
 create mode 100644 tests/generic/1212.out
 create mode 100755 tests/generic/1213
 create mode 100644 tests/generic/1213.out
 create mode 100755 tests/generic/1214
 create mode 100644 tests/generic/1214.out
 create mode 100755 tests/generic/1215
 create mode 100644 tests/generic/1215.out
 create mode 100755 tests/generic/1216
 create mode 100644 tests/generic/1216.out
 create mode 100755 tests/generic/1217
 create mode 100644 tests/generic/1217.out
 create mode 100755 tests/generic/1218
 create mode 100644 tests/generic/1218.out
 create mode 100755 tests/generic/1219
 create mode 100755 tests/generic/1219.out
 create mode 100755 tests/xfs/1202
 create mode 100644 tests/xfs/1202.out
 create mode 100755 tests/xfs/1208
 create mode 100644 tests/xfs/1208.out
 create mode 100755 tests/xfs/1211
 create mode 100644 tests/xfs/1211.out
 create mode 100755 tests/xfs/1212
 create mode 100644 tests/xfs/1212.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/7] xfs/122: fix for swapext log items
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 5/7] generic: test that file privilege gets dropped with FIEXCHANGE_RANGE Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add entries for the extent swapping log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/122.out |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/tests/xfs/122.out b/tests/xfs/122.out
index 95e53c5081..21549db7fd 100644
--- a/tests/xfs/122.out
+++ b/tests/xfs/122.out
@@ -117,6 +117,9 @@ sizeof(struct xfs_rtrmap_root) = 4
 sizeof(struct xfs_rud_log_format) = 16
 sizeof(struct xfs_rui_log_format) = 16
 sizeof(struct xfs_scrub_metadata) = 64
+sizeof(struct xfs_swap_extent) = 64
+sizeof(struct xfs_sxd_log_format) = 16
+sizeof(struct xfs_sxi_log_format) = 80
 sizeof(struct xfs_unmount_log_format) = 8
 sizeof(xfs_agf_t) = 224
 sizeof(xfs_agfl_t) = 36


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/7] generic: test old xfs extent swapping ioctl
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 3/7] generic: test new vfs swapext ioctl Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 7/7] fsstress: update for FIEXCHANGE_RANGE Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 6/7] fsx: support FIEXCHANGE_RANGE Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add some tests to check the operation of the old xfs swapext ioctl.
There aren't any xfs-specific pieces in here, so they're in generic/

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/rc              |   11 +++++++++
 doc/group-names.txt    |    2 ++
 tests/generic/1200     |   55 +++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1200.out |    3 ++
 tests/generic/1201     |   53 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1201.out |    4 +++
 tests/generic/1202     |   47 ++++++++++++++++++++++++++++++++++++++
 tests/generic/1202.out |    2 ++
 tests/xfs/1202         |   59 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1202.out     |   12 ++++++++++
 tests/xfs/537          |    2 +-
 11 files changed, 249 insertions(+), 1 deletion(-)
 create mode 100755 tests/generic/1200
 create mode 100644 tests/generic/1200.out
 create mode 100755 tests/generic/1201
 create mode 100644 tests/generic/1201.out
 create mode 100755 tests/generic/1202
 create mode 100644 tests/generic/1202.out
 create mode 100755 tests/xfs/1202
 create mode 100644 tests/xfs/1202.out


diff --git a/common/rc b/common/rc
index a1b65f0a7f..88ecc1837d 100644
--- a/common/rc
+++ b/common/rc
@@ -2575,6 +2575,17 @@ _require_xfs_io_command()
 		echo $testio | grep -q "Inappropriate ioctl" && \
 			_notrun "xfs_io $command support is missing"
 		;;
+	"swapext")
+		$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 128k -b 128k' $testfile > /dev/null
+		$XFS_IO_PROG -f -c 'truncate 128k' $testfile.1 > /dev/null
+		testio=`$XFS_IO_PROG -c "$command $param $testfile.1" $testfile 2>&1`
+		echo $testio | grep -q "bad argument count" && \
+			_notrun "xfs_io $command $param support is missing"
+		echo $testio | grep -q "Inappropriate ioctl" && \
+			_notrun "xfs_io $command $param ioctl support is missing"
+		rm -f $testfile.1
+		param_checked="$param"
+		;;
 	"utimes" )
 		testio=`$XFS_IO_PROG -f -c "utimes 0 0 0 0" $testfile 2>&1`
 		;;
diff --git a/doc/group-names.txt b/doc/group-names.txt
index 771ce937ae..e88dcc0fdd 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -51,6 +51,7 @@ enospc			ENOSPC error reporting
 exportfs		file handles
 fiemap			fiemap ioctl
 filestreams		XFS filestreams allocator
+fiexchange		FIEXCHANGE_RANGE ioctl
 freeze			filesystem freeze tests
 fsck			general fsck tests
 fsmap			FS_IOC_GETFSMAP ioctl
@@ -121,6 +122,7 @@ splice			splice system call
 stress			fsstress filesystem exerciser
 subvol			btrfs subvolumes
 swap			swap files
+swapext			XFS_IOC_SWAPEXT ioctl
 symlink			symbolic links
 tape			dump and restore with a tape
 thin			thin provisioning
diff --git a/tests/generic/1200 b/tests/generic/1200
new file mode 100755
index 0000000000..58b93e2b6e
--- /dev/null
+++ b/tests/generic/1200
@@ -0,0 +1,55 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1200
+#
+# Can we use swapext to make the quota accounting incorrect?
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext quota
+
+# Import common functions.
+. ./common/filter
+. ./common/quota
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_user
+_require_nobody
+_require_quota
+_require_xfs_quota
+_require_scratch
+
+# Format filesystem and set up quota limits
+_scratch_mkfs > $seqres.full
+_qmount_option "usrquota,grpquota"
+_qmount >> $seqres.full
+
+# Set up initial files
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 256k -b 1m' $SCRATCH_MNT/a >> $seqres.full
+chown $qa_user $SCRATCH_MNT/a
+$XFS_IO_PROG -f -c 'pwrite -S 0x59 0 64k -b 64k' -c 'truncate 256k' $SCRATCH_MNT/b >> $seqres.full
+chown nobody $SCRATCH_MNT/b
+
+echo before swapext >> $seqres.full
+$XFS_QUOTA_PROG -x -c 'report -a' $SCRATCH_MNT >> $seqres.full
+stat $SCRATCH_MNT/* >> $seqres.full
+
+# Now try to swap the extents of the two files.  The command is allowed to
+# fail with -EINVAL (since that's what the first kernel fix does) or succeed
+# (because subsequent rewrites can handle quota).  Whatever the outcome, the
+# quota usage check at the end should never show a discrepancy.
+$XFS_IO_PROG -c "swapext $SCRATCH_MNT/b" $SCRATCH_MNT/a &> $tmp.swap
+cat $tmp.swap >> $seqres.full
+grep -v 'Invalid argument' $tmp.swap
+
+echo after swapext >> $seqres.full
+$XFS_QUOTA_PROG -x -c 'report -a' $SCRATCH_MNT >> $seqres.full
+stat $SCRATCH_MNT/* >> $seqres.full
+
+_check_quota_usage
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1200.out b/tests/generic/1200.out
new file mode 100644
index 0000000000..b1a6357719
--- /dev/null
+++ b/tests/generic/1200.out
@@ -0,0 +1,3 @@
+QA output created by 1200
+Comparing user usage
+Comparing group usage
diff --git a/tests/generic/1201 b/tests/generic/1201
new file mode 100755
index 0000000000..91e3773eaa
--- /dev/null
+++ b/tests/generic/1201
@@ -0,0 +1,53 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1201
+#
+# Can we use swapext to exceed the quota enforcement?
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext quota
+
+# Import common functions.
+. ./common/filter
+. ./common/quota
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_user
+_require_nobody
+_require_quota
+_require_xfs_quota
+_require_scratch
+
+# Format filesystem and set up quota limits
+_scratch_mkfs > $seqres.full
+_qmount_option "usrquota,grpquota"
+_qmount >> $seqres.full
+
+# Set up initial files
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 256k -b 1m' $SCRATCH_MNT/a >> $seqres.full
+chown $qa_user $SCRATCH_MNT/a
+$XFS_IO_PROG -f -c 'pwrite -S 0x59 0 64k -b 64k' -c 'truncate 256k' $SCRATCH_MNT/b >> $seqres.full
+chown nobody $SCRATCH_MNT/b
+
+# Set up a quota limit
+$XFS_QUOTA_PROG -x -c "limit -u bhard=70k nobody" $SCRATCH_MNT
+
+echo before swapext >> $seqres.full
+$XFS_QUOTA_PROG -x -c 'report -a' $SCRATCH_MNT >> $seqres.full
+stat $SCRATCH_MNT/* >> $seqres.full
+
+# Now try to swapext
+$XFS_IO_PROG -c "swapext $SCRATCH_MNT/b" $SCRATCH_MNT/a
+
+echo after swapext >> $seqres.full
+$XFS_QUOTA_PROG -x -c 'report -a' $SCRATCH_MNT >> $seqres.full
+stat $SCRATCH_MNT/* >> $seqres.full
+
+_check_quota_usage
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1201.out b/tests/generic/1201.out
new file mode 100644
index 0000000000..9001a59958
--- /dev/null
+++ b/tests/generic/1201.out
@@ -0,0 +1,4 @@
+QA output created by 1201
+swapext: Disk quota exceeded
+Comparing user usage
+Comparing group usage
diff --git a/tests/generic/1202 b/tests/generic/1202
new file mode 100755
index 0000000000..2afa546f1f
--- /dev/null
+++ b/tests/generic/1202
@@ -0,0 +1,47 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1202
+#
+# Make sure that swapext won't touch a swap file.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	test -e "$dir/a" && swapoff $dir/a
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_test
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+
+# Set up a fragmented swapfile and a dummy donor file.
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 32m -b 1m' -c fsync $dir/a >> $seqres.full
+$here/src/punch-alternating $dir/a
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 32m -b 1m' -c fsync $dir/a >> $seqres.full
+$MKSWAP_PROG $dir/a >> $seqres.full
+
+$XFS_IO_PROG -f -c 'pwrite -S 0x59 0 32m -b 1m' $dir/b >> $seqres.full
+
+swapon $dir/a || _notrun 'failed to swapon'
+
+# Now try to swapext.  The old code would return EINVAL for swapfiles
+# even though everything else in the VFS returns ETXTBSY.
+$XFS_IO_PROG -c "swapext $dir/b" $dir/a 2>&1 | \
+	sed -e 's/swapext: Invalid argument/swapext: Text file busy/g'
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1202.out b/tests/generic/1202.out
new file mode 100644
index 0000000000..029d65e4d0
--- /dev/null
+++ b/tests/generic/1202.out
@@ -0,0 +1,2 @@
+QA output created by 1202
+swapext: Text file busy
diff --git a/tests/xfs/1202 b/tests/xfs/1202
new file mode 100755
index 0000000000..73e56641af
--- /dev/null
+++ b/tests/xfs/1202
@@ -0,0 +1,59 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1202
+#
+# Simple tests of the old xfs swapext ioctl
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_io_command swapext '-v vfs'
+_require_test
+
+# We can't do any reasonable swapping if the files we're going to create are
+# realtime files, the rt extent size is greater than 1 block, and we can't use
+# atomic extent swapping to make sure that partially written extents are fully
+# swapped.
+file_blksz=$(_get_file_block_size $TEST_DIR)
+fs_blksz=$(_get_block_size $TEST_DIR)
+if (( $file_blksz != $fs_blksz )); then
+	_xfs_has_feature $TEST_DIR reflink || \
+		_notrun "test requires atomic extent swapping for rextsize=$((file_blksz / fs_blksz))"
+fi
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 256k -b 1m' $dir/a >> $seqres.full
+$XFS_IO_PROG -f -c 'pwrite -S 0x59 0 256k -b 1m' $dir/b >> $seqres.full
+$XFS_IO_PROG -f -c 'pwrite -S 0x60 0 256k -b 1m' $dir/c >> $seqres.full
+$XFS_IO_PROG -f -c 'pwrite -S 0x61 0 128k -b 1m' $dir/d >> $seqres.full
+md5sum $dir/a $dir/b $dir/c $dir/d | _filter_test_dir
+
+# Swap two files that are the same length
+echo swap
+$XFS_IO_PROG -c "swapext $dir/b" $dir/a
+md5sum $dir/a $dir/b | _filter_test_dir
+
+# Try to swap two files that are not the same length
+echo fail swap
+$XFS_IO_PROG -c "swapext $dir/c" $dir/d
+md5sum $dir/c $dir/d | _filter_test_dir
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1202.out b/tests/xfs/1202.out
new file mode 100644
index 0000000000..23127eb7b7
--- /dev/null
+++ b/tests/xfs/1202.out
@@ -0,0 +1,12 @@
+QA output created by 1202
+af5c6e2d6c297f3139a4e99df396c072  TEST_DIR/test-1202/a
+fba5c83875b2ab054e06a0284346ebdf  TEST_DIR/test-1202/b
+40c08c6f2aca19bb0d2cf1dbd8bc1b1c  TEST_DIR/test-1202/c
+81615449a98aaaad8dc179b3bec87f38  TEST_DIR/test-1202/d
+swap
+fba5c83875b2ab054e06a0284346ebdf  TEST_DIR/test-1202/a
+af5c6e2d6c297f3139a4e99df396c072  TEST_DIR/test-1202/b
+fail swap
+swapext: Bad address
+40c08c6f2aca19bb0d2cf1dbd8bc1b1c  TEST_DIR/test-1202/c
+81615449a98aaaad8dc179b3bec87f38  TEST_DIR/test-1202/d
diff --git a/tests/xfs/537 b/tests/xfs/537
index 7e11488799..6364db9b5d 100755
--- a/tests/xfs/537
+++ b/tests/xfs/537
@@ -7,7 +7,7 @@
 # Verify that XFS does not cause inode fork's extent count to overflow when
 # swapping forks between files
 . ./common/preamble
-_begin_fstest auto quick collapse
+_begin_fstest auto quick collapse swapext
 
 # Import common functions.
 . ./common/filter


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 3/7] generic: test new vfs swapext ioctl
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 4/7] generic, xfs: test scatter-gather atomic file updates Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/7] generic: test old xfs extent swapping ioctl Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Test the new vfs swapext ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/rc              |   13 +++++
 common/reflink         |   33 +++++++++++++
 tests/generic/1203     |   58 ++++++++++++++++++++++
 tests/generic/1203.out |    2 +
 tests/generic/1204     |  100 ++++++++++++++++++++++++++++++++++++++
 tests/generic/1204.out |   86 +++++++++++++++++++++++++++++++++
 tests/generic/1205     |  116 ++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1205.out |   90 ++++++++++++++++++++++++++++++++++
 tests/generic/1206     |   76 +++++++++++++++++++++++++++++
 tests/generic/1206.out |   32 ++++++++++++
 tests/generic/1207     |  122 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1207.out |   48 ++++++++++++++++++
 tests/generic/1209     |  101 ++++++++++++++++++++++++++++++++++++++
 tests/generic/1209.out |   33 +++++++++++++
 tests/generic/1210     |   48 ++++++++++++++++++
 tests/generic/1210.out |    6 ++
 tests/generic/1211     |  105 ++++++++++++++++++++++++++++++++++++++++
 tests/generic/1211.out |   40 +++++++++++++++
 tests/generic/1212     |   58 ++++++++++++++++++++++
 tests/generic/1212.out |    2 +
 tests/generic/1213     |  126 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1213.out |   48 ++++++++++++++++++
 tests/generic/1214     |   63 ++++++++++++++++++++++++
 tests/generic/1214.out |    2 +
 tests/generic/1215     |   70 +++++++++++++++++++++++++++
 tests/generic/1215.out |    2 +
 tests/xfs/1208         |   62 ++++++++++++++++++++++++
 tests/xfs/1208.out     |   10 ++++
 28 files changed, 1552 insertions(+)
 create mode 100755 tests/generic/1203
 create mode 100644 tests/generic/1203.out
 create mode 100755 tests/generic/1204
 create mode 100644 tests/generic/1204.out
 create mode 100755 tests/generic/1205
 create mode 100644 tests/generic/1205.out
 create mode 100755 tests/generic/1206
 create mode 100644 tests/generic/1206.out
 create mode 100755 tests/generic/1207
 create mode 100644 tests/generic/1207.out
 create mode 100755 tests/generic/1209
 create mode 100644 tests/generic/1209.out
 create mode 100755 tests/generic/1210
 create mode 100644 tests/generic/1210.out
 create mode 100755 tests/generic/1211
 create mode 100644 tests/generic/1211.out
 create mode 100755 tests/generic/1212
 create mode 100644 tests/generic/1212.out
 create mode 100755 tests/generic/1213
 create mode 100644 tests/generic/1213.out
 create mode 100755 tests/generic/1214
 create mode 100644 tests/generic/1214.out
 create mode 100755 tests/generic/1215
 create mode 100644 tests/generic/1215.out
 create mode 100755 tests/xfs/1208
 create mode 100644 tests/xfs/1208.out


diff --git a/common/rc b/common/rc
index 88ecc1837d..cfe765de2e 100644
--- a/common/rc
+++ b/common/rc
@@ -2575,6 +2575,17 @@ _require_xfs_io_command()
 		echo $testio | grep -q "Inappropriate ioctl" && \
 			_notrun "xfs_io $command support is missing"
 		;;
+	"startupdate"|"commitupdate"|"cancelupdate")
+		$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 128k -b 128k' $testfile > /dev/null
+		testio=$($XFS_IO_PROG -c "startupdate $param" \
+				-c 'pwrite -S 0x59 0 192k' \
+				-c 'commitupdate' $testfile 2>&1)
+		echo $testio | grep -q "Inappropriate ioctl" && \
+			_notrun "xfs_io $command $param support is missing"
+		echo $testio | grep -q "Operation not supported" && \
+			_notrun "xfs_io $command $param kernel support is missing"
+		param_checked="$param"
+		;;
 	"swapext")
 		$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 128k -b 128k' $testfile > /dev/null
 		$XFS_IO_PROG -f -c 'truncate 128k' $testfile.1 > /dev/null
@@ -2583,6 +2594,8 @@ _require_xfs_io_command()
 			_notrun "xfs_io $command $param support is missing"
 		echo $testio | grep -q "Inappropriate ioctl" && \
 			_notrun "xfs_io $command $param ioctl support is missing"
+		echo $testio | grep -q "Operation not supported" && \
+			_notrun "xfs_io $command $param kernel support is missing"
 		rm -f $testfile.1
 		param_checked="$param"
 		;;
diff --git a/common/reflink b/common/reflink
index 76e9cb7d32..22adc4449b 100644
--- a/common/reflink
+++ b/common/reflink
@@ -325,6 +325,39 @@ _weave_reflink_regular() {
 	done
 }
 
+# Create a file of interleaved holes, unwritten blocks, and regular blocks.
+_weave_file_rainbow() {
+	blksz=$1
+	nr=$2
+	dfile=$3
+
+	$XFS_IO_PROG -f -c "truncate $((blksz * nr))" $dfile
+	_pwrite_byte 0x00 0 $((blksz * nr)) $dfile.chk
+	# 0 blocks are unwritten
+	seq 1 5 $((nr - 1)) | while read i; do
+		$XFS_IO_PROG -f -c "falloc $((blksz * i)) $blksz" $dfile
+		_pwrite_byte 0x00 $((blksz * i)) $blksz $dfile.chk
+	done
+	# 1 blocks are holes
+	seq 2 5 $((nr - 1)) | while read i; do
+		_pwrite_byte 0x00 $((blksz * i)) $blksz $dfile.chk
+	done
+	# 2 blocks are regular
+	seq 3 5 $((nr - 1)) | while read i; do
+		_pwrite_byte 0x71 $((blksz * i)) $blksz $dfile
+		_pwrite_byte 0x71 $((blksz * i)) $blksz $dfile.chk
+	done
+	# 3 blocks are holes
+	seq 2 5 $((nr - 1)) | while read i; do
+		_pwrite_byte 0x00 $((blksz * i)) $blksz $dfile.chk
+	done
+	# 4 blocks are delalloc
+	seq 4 5 $((nr - 1)) | while read i; do
+		_pwrite_byte 0x62 $((blksz * i)) $blksz $dfile
+		_pwrite_byte 0x62 $((blksz * i)) $blksz $dfile.chk
+	done
+}
+
 # Create a file of interleaved holes, unwritten blocks, regular blocks, and
 # reflinked blocks
 _weave_reflink_rainbow() {
diff --git a/tests/generic/1203 b/tests/generic/1203
new file mode 100755
index 0000000000..890b0b4c86
--- /dev/null
+++ b/tests/generic/1203
@@ -0,0 +1,58 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1202
+#
+# Make sure that swapext modifies ctime and not mtime of the file.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_require_test_program punch-alternating
+_require_xfs_io_command swapext '-v vfs'
+_require_test
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+
+# Set up initial files
+$XFS_IO_PROG -f -c 'pwrite -S 0x58 0 256k -b 1m' $dir/a >> $seqres.full
+$here/src/punch-alternating $dir/a
+$XFS_IO_PROG -f -c 'pwrite -S 0x59 0 256k -b 1m' $dir/b >> $seqres.full
+
+# Snapshot the 'a' file before we swap
+echo before >> $seqres.full
+md5sum $dir/a $dir/b >> $seqres.full
+old_mtime="$(echo $(stat -c '%y' $dir/a $dir/b))"
+old_ctime="$(echo $(stat -c '%z' $dir/a $dir/b))"
+stat -c '%y %Y %z %Z' $dir/a $dir/b >> $seqres.full
+
+# Now try to swapext
+$XFS_IO_PROG -c "swapext $dir/b" $dir/a
+
+# Snapshot the 'a' file after we swap
+echo after >> $seqres.full
+md5sum $dir/a $dir/b >> $seqres.full
+new_mtime="$(echo $(stat -c '%y' $dir/a $dir/b))"
+new_ctime="$(echo $(stat -c '%z' $dir/a $dir/b))"
+stat -c '%y %Y %z %Z' $dir/a $dir/b >> $seqres.full
+
+test "$new_mtime" = "$old_mtime" && echo "mtime: $new_mtime == $old_mtime"
+test "$new_ctime" = "$old_ctime" && echo "ctime: $new_ctime == $old_ctime"
+
+# success, all done
+echo Silence is golden.
+status=0
+exit
diff --git a/tests/generic/1203.out b/tests/generic/1203.out
new file mode 100644
index 0000000000..904b25beb4
--- /dev/null
+++ b/tests/generic/1203.out
@@ -0,0 +1,2 @@
+QA output created by 1203
+Silence is golden.
diff --git a/tests/generic/1204 b/tests/generic/1204
new file mode 100755
index 0000000000..5db3878520
--- /dev/null
+++ b/tests/generic/1204
@@ -0,0 +1,100 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1204
+#
+# Test swapext between ranges of two different files.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs -s 64k -l 64k'
+_require_xfs_io_command "falloc"
+_require_test
+
+filesnap() {
+	echo "$1"
+	if [ "$2" != "$3" ]; then
+		md5sum $2 $3 | _filter_test_dir
+	else
+		md5sum $2 | _filter_test_dir
+	fi
+}
+
+test_swapext_once() {
+	filesnap "$1: before swapext" $dir/$3 $dir/$4
+	$XFS_IO_PROG -c "swapext -v vfs $2 $dir/$3" $dir/$4
+	filesnap "$1: after swapext" $dir/$3 $dir/$4
+	_test_cycle_mount
+	filesnap "$1: after cycling mount" $dir/$3 $dir/$4
+	echo
+}
+
+test_swapext_two() {
+	# swapext the same range of two files
+	test_swapext_once "$@: samerange" \
+		"-s $((blksz * 3)) -d $((blksz * 3)) -l $((blksz * 5))" b a
+
+	# swapext different ranges of two files
+	test_swapext_once "$@: diffrange" \
+		"-s $((blksz * 37)) -d $((blksz * 47)) -l $((blksz * 7))" b a
+
+	# swapext overlapping ranges of two files
+	test_swapext_once "$@: overlap" \
+		"-s $((blksz * 17)) -d $((blksz * 23)) -l $((blksz * 7))" b a
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=57
+_require_congruent_file_oplen $TEST_DIR $blksz
+
+# Set up some simple files for a first test.
+rm -rf $dir/a $dir/b
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+_pwrite_byte 0x59 0 $((blksz * nrblks)) $dir/b >> $seqres.full
+test_swapext_two "simple"
+
+# Make some files that don't end an aligned offset.
+rm -rf $dir/a $dir/b
+_pwrite_byte 0x58 0 $(( (blksz * nrblks) + 37)) $dir/a >> $seqres.full
+_pwrite_byte 0x59 0 $(( (blksz * nrblks) + 37)) $dir/b >> $seqres.full
+test_swapext_once "unalignedeof" "" a b
+
+# Set up some crazy rainbow files
+rm -rf $dir/a $dir/b
+_weave_file_rainbow $blksz $nrblks $dir/a >> $seqres.full
+_weave_file_rainbow $blksz $nrblks $dir/b >> $seqres.full
+test_swapext_two "rainbow"
+
+# Now set up a simple file for testing within the same file
+rm -rf $dir/c
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((blksz * nrblks))" \
+	-c "pwrite -S 0x59 $((blksz * nrblks)) $((blksz * nrblks))" \
+	$dir/c >> $seqres.full
+
+# swapext the same offset into the 'X' and 'Y' regions of the file
+test_swapext_once "single: sameXandY" \
+	"-s $((blksz * 3)) -d $((blksz * (nrblks + 3))) -l $((blksz * 5))" c c
+
+# swapext the same offset into the 'X' and 'Y' regions of the file
+test_swapext_once "single: overlap" \
+	"-s $((blksz * 13)) -d $((blksz * 17)) -l $((blksz * 5))" c c
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1204.out b/tests/generic/1204.out
new file mode 100644
index 0000000000..cdf9f7315d
--- /dev/null
+++ b/tests/generic/1204.out
@@ -0,0 +1,86 @@
+QA output created by 1204
+simple: samerange: before swapext
+db85d578204631f2b4eb1e73974253c2  TEST_DIR/test-1204/b
+d0425612f15c6071022cf7127620f63d  TEST_DIR/test-1204/a
+simple: samerange: after swapext
+20beef1c9ed7de02e4229c69bd43bd8f  TEST_DIR/test-1204/b
+e7697fa99d08f7eb76fa3fb963fe916a  TEST_DIR/test-1204/a
+simple: samerange: after cycling mount
+20beef1c9ed7de02e4229c69bd43bd8f  TEST_DIR/test-1204/b
+e7697fa99d08f7eb76fa3fb963fe916a  TEST_DIR/test-1204/a
+
+simple: diffrange: before swapext
+20beef1c9ed7de02e4229c69bd43bd8f  TEST_DIR/test-1204/b
+e7697fa99d08f7eb76fa3fb963fe916a  TEST_DIR/test-1204/a
+simple: diffrange: after swapext
+cd32ce54c295fcdf571ce7f8220fac56  TEST_DIR/test-1204/b
+d9771c5bb6d9db00b9abe65a4410e1a6  TEST_DIR/test-1204/a
+simple: diffrange: after cycling mount
+cd32ce54c295fcdf571ce7f8220fac56  TEST_DIR/test-1204/b
+d9771c5bb6d9db00b9abe65a4410e1a6  TEST_DIR/test-1204/a
+
+simple: overlap: before swapext
+cd32ce54c295fcdf571ce7f8220fac56  TEST_DIR/test-1204/b
+d9771c5bb6d9db00b9abe65a4410e1a6  TEST_DIR/test-1204/a
+simple: overlap: after swapext
+e0fff655f6a08fc2f03ee01e4767060c  TEST_DIR/test-1204/b
+ec7d764c85e583e305028c9cba5b25b6  TEST_DIR/test-1204/a
+simple: overlap: after cycling mount
+e0fff655f6a08fc2f03ee01e4767060c  TEST_DIR/test-1204/b
+ec7d764c85e583e305028c9cba5b25b6  TEST_DIR/test-1204/a
+
+unalignedeof: before swapext
+9f8c731a4f1946ffdda8c33e82417f2d  TEST_DIR/test-1204/a
+7a5d2ba7508226751c835292e28cd227  TEST_DIR/test-1204/b
+unalignedeof: after swapext
+7a5d2ba7508226751c835292e28cd227  TEST_DIR/test-1204/a
+9f8c731a4f1946ffdda8c33e82417f2d  TEST_DIR/test-1204/b
+unalignedeof: after cycling mount
+7a5d2ba7508226751c835292e28cd227  TEST_DIR/test-1204/a
+9f8c731a4f1946ffdda8c33e82417f2d  TEST_DIR/test-1204/b
+
+rainbow: samerange: before swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+rainbow: samerange: after swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+rainbow: samerange: after cycling mount
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+
+rainbow: diffrange: before swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+rainbow: diffrange: after swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+rainbow: diffrange: after cycling mount
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+
+rainbow: overlap: before swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1204/a
+rainbow: overlap: after swapext
+6753bc585e3c71d53bfaae11d2ffee99  TEST_DIR/test-1204/b
+39597abd4d9d0c9ceac22b77eb00c373  TEST_DIR/test-1204/a
+rainbow: overlap: after cycling mount
+6753bc585e3c71d53bfaae11d2ffee99  TEST_DIR/test-1204/b
+39597abd4d9d0c9ceac22b77eb00c373  TEST_DIR/test-1204/a
+
+single: sameXandY: before swapext
+39e17753fa9e923a3b5928e13775e358  TEST_DIR/test-1204/c
+single: sameXandY: after swapext
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1204/c
+single: sameXandY: after cycling mount
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1204/c
+
+single: overlap: before swapext
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1204/c
+swapext: Invalid argument
+single: overlap: after swapext
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1204/c
+single: overlap: after cycling mount
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1204/c
+
diff --git a/tests/generic/1205 b/tests/generic/1205
new file mode 100755
index 0000000000..dfa5adf0e1
--- /dev/null
+++ b/tests/generic/1205
@@ -0,0 +1,116 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1205
+#
+# Test swapext between ranges of two different files, when one of the files
+# is shared.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_xfs_io_command "falloc"
+_require_test_reflink
+
+filesnap() {
+	echo "$1"
+	if [ "$2" != "$3" ]; then
+		md5sum $2 $3 | _filter_test_dir
+	else
+		md5sum $2 | _filter_test_dir
+	fi
+}
+
+test_swapext_once() {
+	filesnap "$1: before swapext" $dir/$3 $dir/$4
+	$XFS_IO_PROG -c "swapext -v vfs $2 $dir/$3" $dir/$4
+	filesnap "$1: after swapext" $dir/$3 $dir/$4
+	_test_cycle_mount
+	filesnap "$1: after cycling mount" $dir/$3 $dir/$4
+	echo
+}
+
+test_swapext_two() {
+	# swapext the same range of two files
+	test_swapext_once "$@: samerange" \
+		"-s $((blksz * 3)) -d $((blksz * 3)) -l $((blksz * 5))" b a
+
+	# swapext different ranges of two files
+	test_swapext_once "$@: diffrange" \
+		"-s $((blksz * 37)) -d $((blksz * 47)) -l $((blksz * 7))" b a
+
+	# swapext overlapping ranges of two files
+	test_swapext_once "$@: overlap" \
+		"-s $((blksz * 17)) -d $((blksz * 23)) -l $((blksz * 7))" b a
+
+	# Now let's overwrite a entirely to make sure COW works
+	echo "overwrite A and B entirely"
+	md5sum $dir/sharea | _filter_test_dir
+	$XFS_IO_PROG -c "pwrite -S 0x60 0 $((blksz * nrblks))" $dir/a >> $seqres.full
+	$XFS_IO_PROG -c "pwrite -S 0x60 0 $((blksz * nrblks))" $dir/b >> $seqres.full
+	md5sum $dir/sharea | _filter_test_dir
+	_test_cycle_mount
+	md5sum $dir/sharea | _filter_test_dir
+	echo
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+_require_congruent_file_oplen $TEST_DIR $blksz
+nrblks=57
+
+# Set up some simple files for a first test.
+rm -f $dir/a $dir/b $dir/sharea
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+_pwrite_byte 0x59 0 $((blksz * nrblks)) $dir/b >> $seqres.full
+_cp_reflink $dir/a $dir/sharea
+test_swapext_two "simple"
+
+# Set up some crazy rainbow files
+rm -f $dir/a $dir/b $dir/sharea
+_weave_file_rainbow $blksz $nrblks $dir/a >> $seqres.full
+_weave_file_rainbow $blksz $nrblks $dir/b >> $seqres.full
+_cp_reflink $dir/a $dir/sharea
+test_swapext_two "rainbow"
+
+# Now set up a simple file for testing within the same file
+rm -f $dir/c $dir/sharec
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((blksz * nrblks))" \
+	-c "pwrite -S 0x59 $((blksz * nrblks)) $((blksz * nrblks))" \
+	$dir/c >> $seqres.full
+_cp_reflink $dir/c $dir/sharec
+
+# swapext the same offset into the 'X' and 'Y' regions of the file
+test_swapext_once "single: sameXandY" \
+	"-s $((blksz * 3)) -d $((blksz * (nrblks + 3))) -l $((blksz * 5))" c c
+
+# swapext the same offset into the 'X' and 'Y' regions of the file
+test_swapext_once "single: overlap" \
+	"-s $((blksz * 13)) -d $((blksz * 17)) -l $((blksz * 5))" c c
+
+# Now let's overwrite a entirely to make sure COW works
+echo "overwrite C entirely"
+md5sum $dir/sharec | _filter_test_dir
+$XFS_IO_PROG -c "pwrite -S 0x60 0 $((blksz * nrblks))" $dir/c >> $seqres.full
+md5sum $dir/sharec | _filter_test_dir
+_test_cycle_mount
+md5sum $dir/sharec | _filter_test_dir
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1205.out b/tests/generic/1205.out
new file mode 100644
index 0000000000..9a95742e78
--- /dev/null
+++ b/tests/generic/1205.out
@@ -0,0 +1,90 @@
+QA output created by 1205
+simple: samerange: before swapext
+db85d578204631f2b4eb1e73974253c2  TEST_DIR/test-1205/b
+d0425612f15c6071022cf7127620f63d  TEST_DIR/test-1205/a
+simple: samerange: after swapext
+20beef1c9ed7de02e4229c69bd43bd8f  TEST_DIR/test-1205/b
+e7697fa99d08f7eb76fa3fb963fe916a  TEST_DIR/test-1205/a
+simple: samerange: after cycling mount
+20beef1c9ed7de02e4229c69bd43bd8f  TEST_DIR/test-1205/b
+e7697fa99d08f7eb76fa3fb963fe916a  TEST_DIR/test-1205/a
+
+simple: diffrange: before swapext
+20beef1c9ed7de02e4229c69bd43bd8f  TEST_DIR/test-1205/b
+e7697fa99d08f7eb76fa3fb963fe916a  TEST_DIR/test-1205/a
+simple: diffrange: after swapext
+cd32ce54c295fcdf571ce7f8220fac56  TEST_DIR/test-1205/b
+d9771c5bb6d9db00b9abe65a4410e1a6  TEST_DIR/test-1205/a
+simple: diffrange: after cycling mount
+cd32ce54c295fcdf571ce7f8220fac56  TEST_DIR/test-1205/b
+d9771c5bb6d9db00b9abe65a4410e1a6  TEST_DIR/test-1205/a
+
+simple: overlap: before swapext
+cd32ce54c295fcdf571ce7f8220fac56  TEST_DIR/test-1205/b
+d9771c5bb6d9db00b9abe65a4410e1a6  TEST_DIR/test-1205/a
+simple: overlap: after swapext
+e0fff655f6a08fc2f03ee01e4767060c  TEST_DIR/test-1205/b
+ec7d764c85e583e305028c9cba5b25b6  TEST_DIR/test-1205/a
+simple: overlap: after cycling mount
+e0fff655f6a08fc2f03ee01e4767060c  TEST_DIR/test-1205/b
+ec7d764c85e583e305028c9cba5b25b6  TEST_DIR/test-1205/a
+
+overwrite A and B entirely
+d0425612f15c6071022cf7127620f63d  TEST_DIR/test-1205/sharea
+d0425612f15c6071022cf7127620f63d  TEST_DIR/test-1205/sharea
+d0425612f15c6071022cf7127620f63d  TEST_DIR/test-1205/sharea
+
+rainbow: samerange: before swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+rainbow: samerange: after swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+rainbow: samerange: after cycling mount
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+
+rainbow: diffrange: before swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+rainbow: diffrange: after swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+rainbow: diffrange: after cycling mount
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+
+rainbow: overlap: before swapext
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/b
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/a
+rainbow: overlap: after swapext
+6753bc585e3c71d53bfaae11d2ffee99  TEST_DIR/test-1205/b
+39597abd4d9d0c9ceac22b77eb00c373  TEST_DIR/test-1205/a
+rainbow: overlap: after cycling mount
+6753bc585e3c71d53bfaae11d2ffee99  TEST_DIR/test-1205/b
+39597abd4d9d0c9ceac22b77eb00c373  TEST_DIR/test-1205/a
+
+overwrite A and B entirely
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/sharea
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/sharea
+48b41ee1970eb71064b77181f42634cf  TEST_DIR/test-1205/sharea
+
+single: sameXandY: before swapext
+39e17753fa9e923a3b5928e13775e358  TEST_DIR/test-1205/c
+single: sameXandY: after swapext
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1205/c
+single: sameXandY: after cycling mount
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1205/c
+
+single: overlap: before swapext
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1205/c
+swapext: Invalid argument
+single: overlap: after swapext
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1205/c
+single: overlap: after cycling mount
+8262c617070703fb0e2a28d8f05e3112  TEST_DIR/test-1205/c
+
+overwrite C entirely
+39e17753fa9e923a3b5928e13775e358  TEST_DIR/test-1205/sharec
+39e17753fa9e923a3b5928e13775e358  TEST_DIR/test-1205/sharec
+39e17753fa9e923a3b5928e13775e358  TEST_DIR/test-1205/sharec
diff --git a/tests/generic/1206 b/tests/generic/1206
new file mode 100755
index 0000000000..a4d9dd083e
--- /dev/null
+++ b/tests/generic/1206
@@ -0,0 +1,76 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1206
+#
+# Test swapext between two files of unlike size.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs -s 64k -l 64k'
+_require_test
+
+filesnap() {
+	echo "$1"
+	if [ "$2" != "$3" ]; then
+		md5sum $2 $3 | _filter_test_dir
+	else
+		md5sum $2 | _filter_test_dir
+	fi
+}
+
+test_swapext_once() {
+	local tag=$1
+	local a_len=$2
+	local b_len=$3
+	local a_off=$4
+	local b_off=$5
+	local len=$6
+
+	# len is either a block count or -e to swap to EOF
+	if [ "$len" != "-e" ]; then
+		len="-l $((blksz * len))"
+	fi
+
+	rm -f $dir/a $dir/b
+	_pwrite_byte 0x58 0 $((blksz * a_len)) $dir/a >> $seqres.full
+	_pwrite_byte 0x59 0 $((blksz * b_len)) $dir/b >> $seqres.full
+	filesnap "$tag: before swapext" $dir/a $dir/b
+
+	cmd="swapext -v vfs -s $((blksz * a_off)) -d $((blksz * b_off)) $len $dir/a"
+	echo "$cmd" >> $seqres.full
+	$XFS_IO_PROG -c "$cmd" $dir/b
+	filesnap "$tag: after swapext" $dir/a $dir/b
+
+	_test_cycle_mount
+	filesnap "$tag: after cycling mount" $dir/a $dir/b
+	echo
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+
+test_swapext_once "last 5 blocks" 27 37 22 32 5
+
+test_swapext_once "whole file to eof" 27 37 0 0 -e
+
+test_swapext_once "blocks 30-40" 27 37 30 30 10
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1206.out b/tests/generic/1206.out
new file mode 100644
index 0000000000..82f4185684
--- /dev/null
+++ b/tests/generic/1206.out
@@ -0,0 +1,32 @@
+QA output created by 1206
+last 5 blocks: before swapext
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/a
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/b
+last 5 blocks: after swapext
+3f34470fe9feb8513d5f3a8538f2c5f3  TEST_DIR/test-1206/a
+c3daca7dd9218371cd0dc64f11e4b0bf  TEST_DIR/test-1206/b
+last 5 blocks: after cycling mount
+3f34470fe9feb8513d5f3a8538f2c5f3  TEST_DIR/test-1206/a
+c3daca7dd9218371cd0dc64f11e4b0bf  TEST_DIR/test-1206/b
+
+whole file to eof: before swapext
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/a
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/b
+whole file to eof: after swapext
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/a
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/b
+whole file to eof: after cycling mount
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/a
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/b
+
+blocks 30-40: before swapext
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/a
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/b
+swapext: Invalid argument
+blocks 30-40: after swapext
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/a
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/b
+blocks 30-40: after cycling mount
+207ea56e0ccbf50d38fd3a2d842aa170  TEST_DIR/test-1206/a
+eb58941d31f5be1e4e22df8c536dd490  TEST_DIR/test-1206/b
+
diff --git a/tests/generic/1207 b/tests/generic/1207
new file mode 100755
index 0000000000..3a69a1ec57
--- /dev/null
+++ b/tests/generic/1207
@@ -0,0 +1,122 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1207
+#
+# Test atomic file updates when (a) the length is the same; (b) the length
+# is different; and (c) someone modifies the original file and we need to
+# cancel the update.  The file contents are cloned into the staging file,
+# and some of the contents are updated.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_xfs_io_command startupdate
+_require_test_reflink
+_require_test
+
+filesnap() {
+	echo "$1"
+	md5sum $2 | _filter_test_dir
+}
+
+mkfile() {
+	rm -f $dir/a
+	_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+	sync
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=64
+
+# Use the atomic file update staging prototype in xfs_io to update a file.
+mkfile
+filesnap "before commit" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c 'pwrite -S 0x60 44k 55k -b 1m' \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after commit" $dir/a
+echo
+
+# Use the atomic file updates to replace a file with a shorter file.
+mkfile
+filesnap "before shorten commit" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c 'truncate 55k' \
+	-c 'pwrite -S 0x60 0 55k' \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after shorten commit" $dir/a
+echo
+
+# Use the atomic file updates to replace a file with a longer file.
+mkfile
+filesnap "before lengthen commit" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c "pwrite -S 0x60 0 $(( (blksz * nrblks) + 37373 ))" \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after lengthen commit" $dir/a
+echo
+
+# Use the atomic file update staging prototype in xfs_io to cancel updating a
+# file.
+mkfile
+filesnap "before cancel" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c 'pwrite -S 0x60 44k 55k -b 1m' \
+	-c 'cancelupdate' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after cancel" $dir/a
+echo
+
+# Now try the update but with the A file open separately so that we clobber
+# mtime and fail the update.
+mkfile
+filesnap "before fail commit" $dir/a
+
+$XFS_IO_PROG \
+	-c "open $dir/a" \
+	-c 'startupdate' \
+	-c 'pwrite -S 0x58 44k 55k -b 1m' \
+	-c 'file 0' \
+	-c 'close' \
+	-c 'pwrite -S 0x61 22k 11k -b 1m' \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after fail commit" $dir/a
+echo
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1207.out b/tests/generic/1207.out
new file mode 100644
index 0000000000..ee33564067
--- /dev/null
+++ b/tests/generic/1207.out
@@ -0,0 +1,48 @@
+QA output created by 1207
+before commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1207/a'.
+after commit
+bedbd22b58a680219a1225353f6195fa  TEST_DIR/test-1207/a
+
+before shorten commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+wrote 56320/56320 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1207/a'.
+after shorten commit
+52353039d89c5f2b76b9003464e5276a  TEST_DIR/test-1207/a
+
+before lengthen commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+wrote 4231677/4231677 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1207/a'.
+after lengthen commit
+1839e7c6bf616160dc51b12179db2642  TEST_DIR/test-1207/a
+
+before cancel
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Cancelled updates to 'TEST_DIR/test-1207/a'.
+after cancel
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+
+before fail commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+committing update: Device or resource busy
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[000] TEST_DIR/test-1207/a (xfs,non-sync,non-direct,read-write)
+ 001  TEST_DIR/test-1207/a (fileupdate) (xfs,non-sync,non-direct,read-write)
+[000] TEST_DIR/test-1207/a (fileupdate) (xfs,non-sync,non-direct,read-write)
+wrote 11264/11264 bytes at offset 22528
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+after fail commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1207/a
+
diff --git a/tests/generic/1209 b/tests/generic/1209
new file mode 100755
index 0000000000..6b287654ca
--- /dev/null
+++ b/tests/generic/1209
@@ -0,0 +1,101 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1209
+#
+# Try invalid parameters to see if they fail.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_xfs_io_command startupdate
+_require_test
+_require_scratch
+_require_chattr i
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=64
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/b >> $seqres.full
+
+echo Immutable files
+$XFS_IO_PROG -c 'chattr +i' -c "swapext $dir/b" $dir/a
+$CHATTR_PROG -i $dir/a
+
+echo Readonly files
+$XFS_IO_PROG -r -c "swapext $dir/b" $dir/a
+
+echo Directories
+$XFS_IO_PROG -c "swapext $dir/b" $dir
+
+echo Unaligned ranges
+$XFS_IO_PROG -c "swapext -s 37 -d 61 -l 17 $dir/b" $dir/a
+
+echo file1 range entirely beyond EOF
+$XFS_IO_PROG -c "swapext -s $(( blksz * (nrblks + 500) )) -d 0 -l $blksz $dir/b" $dir/a
+
+echo file2 range entirely beyond EOF
+$XFS_IO_PROG -c "swapext -d $(( blksz * (nrblks + 500) )) -s 0 -l $blksz $dir/b" $dir/a
+
+echo Both ranges entirely beyond EOF
+$XFS_IO_PROG -c "swapext -d $(( blksz * (nrblks + 500) )) -s $(( blksz * (nrblks + 500) )) -l $blksz $dir/b" $dir/a
+
+echo file1 range crossing EOF
+$XFS_IO_PROG -c "swapext -s $(( blksz * (nrblks - 1) )) -d 0 -l $((2 * blksz)) $dir/b" $dir/a
+
+echo file2 range crossing EOF
+$XFS_IO_PROG -c "swapext -d $(( blksz * (nrblks  - 1) )) -s 0 -l $((2 * blksz)) $dir/b" $dir/a
+
+echo Both ranges crossing EOF
+$XFS_IO_PROG -c "swapext -d $(( blksz * (nrblks - 1) )) -s $(( blksz * (nrblks - 1) )) -l $((blksz * 2)) $dir/b" $dir/a
+
+echo file1 unaligned EOF to file2 nowhere near EOF
+_pwrite_byte 0x58 $((blksz * nrblks)) 37 $dir/a >> $seqres.full
+_pwrite_byte 0x59 $((blksz * nrblks)) 37 $dir/b >> $seqres.full
+$XFS_IO_PROG -c "swapext -d 0 -s $(( blksz * nrblks )) -l 37 $dir/b" $dir/a
+
+echo file2 unaligned EOF to file1 nowhere near EOF
+$XFS_IO_PROG -c "swapext -s 0 -d $(( blksz * nrblks )) -l 37 $dir/b" $dir/a
+
+echo Files of unequal length
+_pwrite_byte 0x58 $((blksz * nrblks)) $((blksz * 2)) $dir/a >> $seqres.full
+_pwrite_byte 0x59 $((blksz * nrblks)) $blksz $dir/b >> $seqres.full
+$XFS_IO_PROG -c "swapext $dir/b" $dir/a
+
+echo Files on different filesystems
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $SCRATCH_MNT/c >> $seqres.full
+$XFS_IO_PROG -c "swapext $SCRATCH_MNT/c" $dir/a
+
+echo Files on different mounts
+mkdir -p $SCRATCH_MNT/xyz
+mount --bind $dir $SCRATCH_MNT/xyz --bind
+_pwrite_byte 0x60 0 $((blksz * (nrblks + 2))) $dir/c >> $seqres.full
+$XFS_IO_PROG -c "swapext $SCRATCH_MNT/xyz/c" $dir/a
+umount $SCRATCH_MNT/xyz
+
+echo Swapping a file with itself
+$XFS_IO_PROG -c "swapext $dir/a" $dir/a
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1209.out b/tests/generic/1209.out
new file mode 100644
index 0000000000..a051ed97d9
--- /dev/null
+++ b/tests/generic/1209.out
@@ -0,0 +1,33 @@
+QA output created by 1209
+Immutable files
+swapext: Operation not permitted
+Readonly files
+swapext: Bad file descriptor
+Directories
+swapext: Is a directory
+Unaligned ranges
+swapext: Invalid argument
+file1 range entirely beyond EOF
+swapext: Invalid argument
+file2 range entirely beyond EOF
+swapext: Invalid argument
+Both ranges entirely beyond EOF
+swapext: Invalid argument
+file1 range crossing EOF
+swapext: Invalid argument
+file2 range crossing EOF
+swapext: Invalid argument
+Both ranges crossing EOF
+swapext: Invalid argument
+file1 unaligned EOF to file2 nowhere near EOF
+swapext: Invalid argument
+file2 unaligned EOF to file1 nowhere near EOF
+swapext: Invalid argument
+Files of unequal length
+swapext: Bad address
+Files on different filesystems
+swapext: Invalid cross-device link
+Files on different mounts
+swapext: Invalid cross-device link
+Swapping a file with itself
+swapext: Invalid argument
diff --git a/tests/generic/1210 b/tests/generic/1210
new file mode 100755
index 0000000000..93a325ad7a
--- /dev/null
+++ b/tests/generic/1210
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1210
+#
+# Make sure swapext honors RLIMIT_FSIZE.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_test
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=64
+
+# Create some 4M files to test swapext
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+_pwrite_byte 0x59 0 $((blksz * nrblks)) $dir/b >> $seqres.full
+sync
+md5sum $dir/a $dir/b | _filter_test_dir
+
+# Set FSIZE to twice the blocksize (IOWs, 128k)
+ulimit -f $(( (blksz * 2) / 512))
+ulimit -a >> $seqres.full
+
+# Now try to swapext
+$XFS_IO_PROG -c "swapext $dir/b" $dir/a
+md5sum $dir/a $dir/b | _filter_test_dir
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1210.out b/tests/generic/1210.out
new file mode 100644
index 0000000000..02d41b8838
--- /dev/null
+++ b/tests/generic/1210.out
@@ -0,0 +1,6 @@
+QA output created by 1210
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1210/a
+901e136269b8d283d311697b7c6dc1f2  TEST_DIR/test-1210/b
+swapext: Invalid argument
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1210/a
+901e136269b8d283d311697b7c6dc1f2  TEST_DIR/test-1210/b
diff --git a/tests/generic/1211 b/tests/generic/1211
new file mode 100755
index 0000000000..f7b8a8d280
--- /dev/null
+++ b/tests/generic/1211
@@ -0,0 +1,105 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1211
+#
+# Test atomic file replacement when (a) the length is the same; (b) the length
+# is different; and (c) someone modifies the original file and we need to
+# cancel the update.  The staging file is created empty, which implies that the
+# caller wants a full file replacement.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_xfs_io_command startupdate '-e'
+_require_test
+
+filesnap() {
+	echo "$1"
+	md5sum $2 | _filter_test_dir
+}
+
+mkfile() {
+	rm -f $dir/a
+	_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+	sync
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=64
+
+# Use the atomic file update staging prototype in xfs_io to update a file.
+mkfile
+filesnap "before commit" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate -e' \
+	-c "pwrite -S 0x60 0 $((blksz * nrblks))" \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after commit" $dir/a
+echo
+
+# Use the atomic file updates to replace a file with a shorter file.
+mkfile
+filesnap "before shorten commit" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate -e' \
+	-c 'pwrite -S 0x60 0 55k' \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after shorten commit" $dir/a
+echo
+
+# Use the atomic file updates to replace a file with a longer file.
+mkfile
+filesnap "before lengthen commit" $dir/a
+
+$XFS_IO_PROG \
+	-c 'startupdate -e' \
+	-c "pwrite -S 0x60 0 $(( (blksz * nrblks) + 37373 ))" \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after lengthen commit" $dir/a
+echo
+
+# Now try the update but with the A file open separately so that we clobber
+# mtime and fail the update.
+mkfile
+filesnap "before fail commit" $dir/a
+
+$XFS_IO_PROG \
+	-c "open $dir/a" \
+	-c 'startupdate -e ' \
+	-c 'pwrite -S 0x58 44k 55k -b 1m' \
+	-c 'file 0' \
+	-c 'close' \
+	-c 'pwrite -S 0x61 22k 11k -b 1m' \
+	-c 'commitupdate -q' \
+	"$dir/a" 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after fail commit" $dir/a
+echo
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1211.out b/tests/generic/1211.out
new file mode 100644
index 0000000000..a149de198f
--- /dev/null
+++ b/tests/generic/1211.out
@@ -0,0 +1,40 @@
+QA output created by 1211
+before commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1211/a
+wrote 4194304/4194304 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1211/a'.
+after commit
+0558063c531ca7c7864fc5a4923f7144  TEST_DIR/test-1211/a
+
+before shorten commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1211/a
+wrote 56320/56320 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1211/a'.
+after shorten commit
+52353039d89c5f2b76b9003464e5276a  TEST_DIR/test-1211/a
+
+before lengthen commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1211/a
+wrote 4231677/4231677 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1211/a'.
+after lengthen commit
+1839e7c6bf616160dc51b12179db2642  TEST_DIR/test-1211/a
+
+before fail commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1211/a
+committing update: Device or resource busy
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[000] TEST_DIR/test-1211/a (xfs,non-sync,non-direct,read-write)
+ 001  TEST_DIR/test-1211/a (fileupdate) (xfs,non-sync,non-direct,read-write)
+[000] TEST_DIR/test-1211/a (fileupdate) (xfs,non-sync,non-direct,read-write)
+wrote 11264/11264 bytes at offset 22528
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+after fail commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1211/a
+
diff --git a/tests/generic/1212 b/tests/generic/1212
new file mode 100755
index 0000000000..c6599e3a35
--- /dev/null
+++ b/tests/generic/1212
@@ -0,0 +1,58 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1212
+#
+# Stress testing with a lot of extents.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs'
+_require_test_program punch-alternating
+_require_test
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=$(_get_file_block_size $TEST_DIR)
+nrblks=$((LOAD_FACTOR * 100000))
+
+_require_fs_space $TEST_DIR $(( (2 * blksz * nrblks) / 1024 ))
+
+# Create some big swiss cheese files to test swapext with a lot of extents
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+$here/src/punch-alternating $dir/a
+_pwrite_byte 0x59 0 $((blksz * nrblks)) $dir/b >> $seqres.full
+$here/src/punch-alternating -o 1 $dir/b
+filefrag -v $dir/a $dir/b >> $seqres.full
+
+# Now try to swapext
+md5_a="$(md5sum < $dir/a)"
+md5_b="$(md5sum < $dir/b)"
+date >> $seqres.full
+$XFS_IO_PROG -c "swapext $dir/b" $dir/a
+date >> $seqres.full
+
+echo "md5_a=$md5_a" >> $seqres.full
+echo "md5_b=$md5_b" >> $seqres.full
+md5sum $dir/a $dir/b >> $seqres.full
+
+test "$(md5sum < $dir/b)" = "$md5_a" || echo "file b does not match former a"
+test "$(md5sum < $dir/a)" = "$md5_b" || echo "file a does not match former b"
+
+echo "Silence is golden!"
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1212.out b/tests/generic/1212.out
new file mode 100644
index 0000000000..fb775977ca
--- /dev/null
+++ b/tests/generic/1212.out
@@ -0,0 +1,2 @@
+QA output created by 1212
+Silence is golden!
diff --git a/tests/generic/1213 b/tests/generic/1213
new file mode 100755
index 0000000000..d466263a9d
--- /dev/null
+++ b/tests/generic/1213
@@ -0,0 +1,126 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1213
+#
+# Test non-root atomic file updates when (a) the file contents are cloned into
+# the staging file; and (b) when the staging file is created empty.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_xfs_io_command startupdate
+_require_test_reflink
+_require_test
+_require_user
+
+filesnap() {
+	echo "$1"
+	md5sum $2 | _filter_test_dir
+}
+
+mkfile() {
+	rm -f $dir/a
+	_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+	chown $qa_user $dir/a $dir/
+	sync
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=64
+
+# Use the atomic file update staging prototype in xfs_io to update a file.
+mkfile
+filesnap "before commit" $dir/a
+
+cmd="$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c 'pwrite -S 0x60 44k 55k -b 1m' \
+	-c 'commitupdate -q' \
+	\"$dir/a\""
+su -s /bin/bash -c "$cmd" $qa_user 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after commit" $dir/a
+echo
+
+# Use the atomic file updates to replace a file with a shorter file.
+mkfile
+filesnap "before shorten commit" $dir/a
+
+cmd="$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c 'truncate 55k' \
+	-c 'pwrite -S 0x60 0 55k' \
+	-c 'commitupdate -q' \
+	\"$dir/a\""
+su -s /bin/bash -c "$cmd" $qa_user 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after shorten commit" $dir/a
+echo
+
+# Use the atomic file updates to replace a file with a longer file.
+mkfile
+filesnap "before lengthen commit" $dir/a
+
+cmd="$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c \"pwrite -S 0x60 0 $(( (blksz * nrblks) + 37373 ))\" \
+	-c 'commitupdate -q' \
+	\"$dir/a\""
+su -s /bin/bash -c "$cmd" $qa_user 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after lengthen commit" $dir/a
+echo
+
+# Use the atomic file update staging prototype in xfs_io to cancel updating a
+# file.
+mkfile
+filesnap "before cancel" $dir/a
+
+cmd="$XFS_IO_PROG \
+	-c 'startupdate' \
+	-c 'pwrite -S 0x60 44k 55k -b 1m' \
+	-c 'cancelupdate' \
+	\"$dir/a\""
+su -s /bin/bash -c "$cmd" $qa_user 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after cancel" $dir/a
+echo
+
+# Now try the update but with the A file open separately so that we clobber
+# mtime and fail the update.
+mkfile
+filesnap "before fail commit" $dir/a
+
+cmd="$XFS_IO_PROG \
+	-c \"open $dir/a\" \
+	-c 'startupdate' \
+	-c 'pwrite -S 0x58 44k 55k -b 1m' \
+	-c 'file 0' \
+	-c 'close' \
+	-c 'pwrite -S 0x61 22k 11k -b 1m' \
+	-c 'commitupdate -q' \
+	\"$dir/a\""
+su -s /bin/bash -c "$cmd" $qa_user 2>&1 | _filter_xfs_io | _filter_test_dir
+
+filesnap "after fail commit" $dir/a
+echo
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1213.out b/tests/generic/1213.out
new file mode 100644
index 0000000000..b5a140560d
--- /dev/null
+++ b/tests/generic/1213.out
@@ -0,0 +1,48 @@
+QA output created by 1213
+before commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1213/a'.
+after commit
+bedbd22b58a680219a1225353f6195fa  TEST_DIR/test-1213/a
+
+before shorten commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+wrote 56320/56320 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1213/a'.
+after shorten commit
+52353039d89c5f2b76b9003464e5276a  TEST_DIR/test-1213/a
+
+before lengthen commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+wrote 4231677/4231677 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Committed updates to 'TEST_DIR/test-1213/a'.
+after lengthen commit
+1839e7c6bf616160dc51b12179db2642  TEST_DIR/test-1213/a
+
+before cancel
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Cancelled updates to 'TEST_DIR/test-1213/a'.
+after cancel
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+
+before fail commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+committing update: Device or resource busy
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 56320/56320 bytes at offset 45056
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+[000] TEST_DIR/test-1213/a (xfs,non-sync,non-direct,read-write)
+ 001  TEST_DIR/test-1213/a (fileupdate) (xfs,non-sync,non-direct,read-write)
+[000] TEST_DIR/test-1213/a (fileupdate) (xfs,non-sync,non-direct,read-write)
+wrote 11264/11264 bytes at offset 22528
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+after fail commit
+d712f003e9d467e063cda1baf319b928  TEST_DIR/test-1213/a
+
diff --git a/tests/generic/1214 b/tests/generic/1214
new file mode 100755
index 0000000000..0cacc57e9c
--- /dev/null
+++ b/tests/generic/1214
@@ -0,0 +1,63 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1214
+#
+# Test swapext with the fsync flag flushes everything to disk before the call
+# returns.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_test_program "punch-alternating"
+_require_xfs_io_command swapext '-v vfs -a'
+_require_scratch
+_require_scratch_shutdown
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_pwrite_byte 0x58 0 2m $SCRATCH_MNT/a >> $seqres.full
+_pwrite_byte 0x59 0 2m $SCRATCH_MNT/b >> $seqres.full
+$here/src/punch-alternating $SCRATCH_MNT/a
+$here/src/punch-alternating $SCRATCH_MNT/b
+
+old_a=$(md5sum $SCRATCH_MNT/a | awk '{print $1}')
+old_b=$(md5sum $SCRATCH_MNT/b | awk '{print $1}')
+echo "md5 a: $old_a md5 b: $old_b" >> $seqres.full
+
+od -tx1 -Ad -c $SCRATCH_MNT/a > /tmp/a0
+od -tx1 -Ad -c $SCRATCH_MNT/b > /tmp/b0
+
+echo swap >> $seqres.full
+$XFS_IO_PROG -c "swapext -v vfs -a -e -f -u $SCRATCH_MNT/a" $SCRATCH_MNT/b
+_scratch_shutdown
+_scratch_cycle_mount
+
+new_a=$(md5sum $SCRATCH_MNT/a | awk '{print $1}')
+new_b=$(md5sum $SCRATCH_MNT/b | awk '{print $1}')
+echo "md5 a: $new_a md5 b: $new_b" >> $seqres.full
+
+test $old_a = $new_b || echo "scratch file B doesn't match old file A"
+test $old_b = $new_a || echo "scratch file A doesn't match old file B"
+
+od -tx1 -Ad -c $SCRATCH_MNT/a > /tmp/a1
+od -tx1 -Ad -c $SCRATCH_MNT/b > /tmp/b1
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/generic/1214.out b/tests/generic/1214.out
new file mode 100644
index 0000000000..a529e42333
--- /dev/null
+++ b/tests/generic/1214.out
@@ -0,0 +1,2 @@
+QA output created by 1214
+Silence is golden
diff --git a/tests/generic/1215 b/tests/generic/1215
new file mode 100755
index 0000000000..28eb8356d8
--- /dev/null
+++ b/tests/generic/1215
@@ -0,0 +1,70 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1215
+#
+# Test swapext with the dry run flag doesn't change anything.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_require_test_program "punch-alternating"
+_require_xfs_io_command swapext '-v vfs'
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+_pwrite_byte 0x59 0 2m $SCRATCH_MNT/b >> $seqres.full
+$XFS_IO_PROG -c 'truncate 2m' $SCRATCH_MNT/a
+$here/src/punch-alternating $SCRATCH_MNT/a
+$here/src/punch-alternating $SCRATCH_MNT/b
+
+old_a=$(md5sum $SCRATCH_MNT/a | awk '{print $1}')
+old_b=$(md5sum $SCRATCH_MNT/b | awk '{print $1}')
+echo "md5 a: $old_a md5 b: $old_b" >> $seqres.full
+
+# Test swapext with the -n option, which will do all the input parameter
+# checking and return 0 without changing anything.
+echo dry run swap >> $seqres.full
+$XFS_IO_PROG -c "swapext -v vfs -n -f -u $SCRATCH_MNT/a" $SCRATCH_MNT/b
+_scratch_cycle_mount
+
+new_a=$(md5sum $SCRATCH_MNT/a | awk '{print $1}')
+new_b=$(md5sum $SCRATCH_MNT/b | awk '{print $1}')
+echo "md5 a: $new_a md5 b: $new_b" >> $seqres.full
+
+test $old_a = $new_a || echo "scratch file A should not have swapped"
+test $old_b = $new_b || echo "scratch file B should not have swapped"
+
+# Do it again, but without the -n option, to prove that we can actually
+# swap the file contents.
+echo actual swap >> $seqres.full
+$XFS_IO_PROG -c "swapext -v vfs -f -u $SCRATCH_MNT/a" $SCRATCH_MNT/b
+_scratch_cycle_mount
+
+new_a=$(md5sum $SCRATCH_MNT/a | awk '{print $1}')
+new_b=$(md5sum $SCRATCH_MNT/b | awk '{print $1}')
+echo "md5 a: $new_a md5 b: $new_b" >> $seqres.full
+
+test $old_a = $new_b || echo "scratch file A should have swapped"
+test $old_b = $new_a || echo "scratch file B should have swapped"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/generic/1215.out b/tests/generic/1215.out
new file mode 100644
index 0000000000..f68a8e6998
--- /dev/null
+++ b/tests/generic/1215.out
@@ -0,0 +1,2 @@
+QA output created by 1215
+Silence is golden
diff --git a/tests/xfs/1208 b/tests/xfs/1208
new file mode 100755
index 0000000000..3106a4036c
--- /dev/null
+++ b/tests/xfs/1208
@@ -0,0 +1,62 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1208
+#
+# Make sure an atomic swapext actually runs to completion even if we shut
+# down the filesystem midway through.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/reflink
+. ./common/inject
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_io_command swapext '-v vfs -a'
+_require_test_program "punch-alternating"
+_require_xfs_io_command startupdate
+_require_xfs_io_error_injection "bmap_finish_one"
+_require_test
+
+filesnap() {
+	echo "$1"
+	md5sum $dir/a $dir/b $dir/c | _filter_test_dir
+}
+
+dir=$TEST_DIR/test-$seq
+mkdir -p $dir
+blksz=65536
+nrblks=137
+
+# Create two files to swap, and try to fragment the first file.
+rm -f $dir/a
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+$here/src/punch-alternating $dir/a
+_pwrite_byte 0x59 0 $((blksz * nrblks)) $dir/b >> $seqres.full
+_pwrite_byte 0x59 0 $((blksz * nrblks)) $dir/c >> $seqres.full
+_pwrite_byte 0x58 0 $((blksz * nrblks)) $dir/a >> $seqres.full
+sync
+
+# Inject a bmap error and trigger it via swapext.
+filesnap "before commit"
+$XFS_IO_PROG -x -c 'inject bmap_finish_one' -c "swapext $dir/b" $dir/a
+
+# Check the file afterwards.
+_test_cycle_mount
+filesnap "after commit"
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1208.out b/tests/xfs/1208.out
new file mode 100644
index 0000000000..b6aa1b6231
--- /dev/null
+++ b/tests/xfs/1208.out
@@ -0,0 +1,10 @@
+QA output created by 1208
+before commit
+c7221b1494117327570a0958b0abca51  TEST_DIR/test-1208/a
+30cc2b6b307081e10972317013efb0f3  TEST_DIR/test-1208/b
+30cc2b6b307081e10972317013efb0f3  TEST_DIR/test-1208/c
+swapext: Input/output error
+after commit
+30cc2b6b307081e10972317013efb0f3  TEST_DIR/test-1208/a
+c7221b1494117327570a0958b0abca51  TEST_DIR/test-1208/b
+30cc2b6b307081e10972317013efb0f3  TEST_DIR/test-1208/c


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 4/7] generic, xfs: test scatter-gather atomic file updates
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/7] xfs/122: fix for swapext log items Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 5/7] generic: test that file privilege gets dropped with FIEXCHANGE_RANGE Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 3/7] generic: test new vfs swapext ioctl Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Make sure that FILE_SWAP_RANGE_SKIP_FILE1_HOLES does what we want it to
do -- provide a means to implement scatter-gather atomic file writes.
That means we create a temporary file, write whatever sparse bits to it
that we want, and swap the non-hole parts of the temp file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/xfs             |   19 +++++++++++++++
 tests/generic/1216     |   53 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1216.out |    6 +++++
 tests/generic/1217     |   56 ++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1217.out |    4 +++
 tests/xfs/1211         |   59 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1211.out     |    7 ++++++
 tests/xfs/1212         |   61 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1212.out     |    5 ++++
 9 files changed, 270 insertions(+)
 create mode 100755 tests/generic/1216
 create mode 100644 tests/generic/1216.out
 create mode 100755 tests/generic/1217
 create mode 100644 tests/generic/1217.out
 create mode 100755 tests/xfs/1211
 create mode 100644 tests/xfs/1211.out
 create mode 100755 tests/xfs/1212
 create mode 100644 tests/xfs/1212.out


diff --git a/common/xfs b/common/xfs
index 610730e5ef..8b365ad18b 100644
--- a/common/xfs
+++ b/common/xfs
@@ -1723,3 +1723,22 @@ _scratch_xfs_find_agbtree_height() {
 
 	return 1
 }
+
+_require_xfs_mkfs_atomicswap()
+{
+	# atomicswap can be activated on rmap or reflink filesystems.
+	# reflink is newer (4.9 for reflink vs. 4.8 for rmap) so test that.
+	_scratch_mkfs_xfs_supported -m reflink=1 >/dev/null 2>&1 || \
+		_notrun "mkfs.xfs doesn't have atomicswap dependent features"
+}
+
+_require_xfs_scratch_atomicswap()
+{
+	_require_xfs_mkfs_atomicswap
+	_require_scratch
+	_require_xfs_io_command swapext '-v vfs -a'
+	_scratch_mkfs -m reflink=1 > /dev/null
+	_try_scratch_mount || \
+		_notrun "atomicswap dependencies not supported by scratch filesystem type: $FSTYP"
+	_scratch_unmount
+}
diff --git a/tests/generic/1216 b/tests/generic/1216
new file mode 100755
index 0000000000..802cfe14d4
--- /dev/null
+++ b/tests/generic/1216
@@ -0,0 +1,53 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1216
+#
+# Test scatter-gather atomic file writes.  We create a temporary file, write
+# sparsely to it, then use FILE_SWAP_RANGE_SKIP_FILE1_HOLES flag to swap
+# atomicallly only the ranges that we wrote.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs -a'
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_congruent_file_oplen $SCRATCH_MNT 65536
+
+# Create original file
+_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+
+# Create the donor file
+_pwrite_byte 0x59 64k 64k $SCRATCH_MNT/b >> $seqres.full
+_pwrite_byte 0x57 768k 64k $SCRATCH_MNT/b >> $seqres.full
+$XFS_IO_PROG -c 'truncate 1m' $SCRATCH_MNT/b
+
+md5sum $SCRATCH_MNT/a | _filter_scratch
+md5sum $SCRATCH_MNT/b | _filter_scratch
+
+# Test swapext.  -h means skip holes in /b, and -e means operate to EOF
+echo swap | tee -a $seqres.full
+$XFS_IO_PROG -c "swapext -v vfs -f -u -h -e -a $SCRATCH_MNT/b" $SCRATCH_MNT/a
+_scratch_cycle_mount
+
+md5sum $SCRATCH_MNT/a | _filter_scratch
+md5sum $SCRATCH_MNT/b | _filter_scratch
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1216.out b/tests/generic/1216.out
new file mode 100644
index 0000000000..2b49d5cfa8
--- /dev/null
+++ b/tests/generic/1216.out
@@ -0,0 +1,6 @@
+QA output created by 1216
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+c9fb827e2e3e579dc2a733ddad486d1d  SCRATCH_MNT/b
+swap
+e9cbfe8489a68efaa5fcf40cf3106118  SCRATCH_MNT/a
+faf8ed02a5b0638096a817abcc6c2127  SCRATCH_MNT/b
diff --git a/tests/generic/1217 b/tests/generic/1217
new file mode 100755
index 0000000000..78c49751f7
--- /dev/null
+++ b/tests/generic/1217
@@ -0,0 +1,56 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1217
+#
+# Test scatter-gather atomic file commits.  Use the startupdate command to
+# create a temporary file, write sparsely to it, then commitupdate -h to
+# perform the scattered update.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+_require_xfs_io_command swapext '-v vfs -a'
+_require_xfs_io_command startupdate '-e'
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_congruent_file_oplen $SCRATCH_MNT 65536
+
+# Create original file
+_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+sync
+md5sum $SCRATCH_MNT/a | _filter_scratch
+
+# Test atomic scatter-gather file commits.
+echo commit | tee -a $seqres.full
+$XFS_IO_PROG \
+	-c 'bmap -elpv' \
+	-c 'startupdate -e' \
+	-c 'truncate 1m' \
+	-c 'pwrite -S 0x59 64k 64k' \
+	-c 'pwrite -S 0x57 768k 64k' \
+	-c 'bmap -elpv' \
+	-c 'commitupdate -h -k' \
+	-c 'bmap -elpv' \
+	$SCRATCH_MNT/a >> $seqres.full
+_scratch_cycle_mount
+
+md5sum $SCRATCH_MNT/a | _filter_scratch
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1217.out b/tests/generic/1217.out
new file mode 100644
index 0000000000..73bf89895d
--- /dev/null
+++ b/tests/generic/1217.out
@@ -0,0 +1,4 @@
+QA output created by 1217
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+commit
+e9cbfe8489a68efaa5fcf40cf3106118  SCRATCH_MNT/a
diff --git a/tests/xfs/1211 b/tests/xfs/1211
new file mode 100755
index 0000000000..6297b8ad67
--- /dev/null
+++ b/tests/xfs/1211
@@ -0,0 +1,59 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1211
+#
+# Test scatter-gather atomic file writes.  We create a temporary file, write
+# sparsely to it, then use FILE_SWAP_RANGE_SKIP_FILE1_HOLES flag to swap
+# atomicallly only the ranges that we wrote.  Inject an error so that we can
+# test that log recovery finishes the swap.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/inject
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_io_command swapext '-v vfs -a'
+_require_xfs_scratch_atomicswap
+_require_xfs_io_error_injection "bmap_finish_one"
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_congruent_file_oplen $SCRATCH_MNT 65536
+
+# Create original file
+_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+
+# Create the donor file
+_pwrite_byte 0x59 64k 64k $SCRATCH_MNT/b >> $seqres.full
+_pwrite_byte 0x57 768k 64k $SCRATCH_MNT/b >> $seqres.full
+$XFS_IO_PROG -c 'truncate 1m' $SCRATCH_MNT/b
+sync
+
+md5sum $SCRATCH_MNT/a | _filter_scratch
+md5sum $SCRATCH_MNT/b | _filter_scratch
+
+# Test swapext.  -h means skip holes in /b, and -e means operate to EOF
+echo swap | tee -a $seqres.full
+$XFS_IO_PROG -x -c 'inject bmap_finish_one' \
+	-c "swapext -v vfs -f -u -h -e -a $SCRATCH_MNT/b" $SCRATCH_MNT/a
+_scratch_cycle_mount
+
+md5sum $SCRATCH_MNT/a | _filter_scratch
+md5sum $SCRATCH_MNT/b | _filter_scratch
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1211.out b/tests/xfs/1211.out
new file mode 100644
index 0000000000..19412c21a1
--- /dev/null
+++ b/tests/xfs/1211.out
@@ -0,0 +1,7 @@
+QA output created by 1211
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+c9fb827e2e3e579dc2a733ddad486d1d  SCRATCH_MNT/b
+swap
+swapext: Input/output error
+e9cbfe8489a68efaa5fcf40cf3106118  SCRATCH_MNT/a
+faf8ed02a5b0638096a817abcc6c2127  SCRATCH_MNT/b
diff --git a/tests/xfs/1212 b/tests/xfs/1212
new file mode 100755
index 0000000000..d2292d65a2
--- /dev/null
+++ b/tests/xfs/1212
@@ -0,0 +1,61 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1212
+#
+# Test scatter-gather atomic file commits.  Use the startupdate command to
+# create a temporary file, write sparsely to it, then commitupdate -h to
+# perform the scattered update.  Inject an error so that we can test that log
+# recovery finishes the swap.
+
+. ./common/preamble
+_begin_fstest auto quick fiexchange swapext
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.* $dir
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/inject
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_io_command swapext '-v vfs -a'
+_require_xfs_io_command startupdate '-e'
+_require_xfs_scratch_atomicswap
+_require_xfs_io_error_injection "bmap_finish_one"
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+# Create original file
+_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+sync
+md5sum $SCRATCH_MNT/a | _filter_scratch
+
+# Test atomic scatter-gather file commits.
+echo commit | tee -a $seqres.full
+$XFS_IO_PROG -x \
+	-c 'bmap -elpv' \
+	-c 'startupdate -e' \
+	-c 'truncate 1m' \
+	-c 'pwrite -S 0x59 64k 64k' \
+	-c 'pwrite -S 0x57 768k 64k' \
+	-c 'sync' \
+	-c 'bmap -elpv' \
+	-c 'inject bmap_finish_one' \
+	-c 'commitupdate -h -k' \
+	$SCRATCH_MNT/a >> $seqres.full
+_scratch_cycle_mount
+
+$XFS_IO_PROG -c 'bmap -elpv' $SCRATCH_MNT/a >> $seqres.full
+md5sum $SCRATCH_MNT/a | _filter_scratch
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1212.out b/tests/xfs/1212.out
new file mode 100644
index 0000000000..29718e2024
--- /dev/null
+++ b/tests/xfs/1212.out
@@ -0,0 +1,5 @@
+QA output created by 1212
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+commit
+committing update: Input/output error
+e9cbfe8489a68efaa5fcf40cf3106118  SCRATCH_MNT/a


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 5/7] generic: test that file privilege gets dropped with FIEXCHANGE_RANGE
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/7] xfs/122: fix for swapext log items Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 4/7] generic, xfs: test scatter-gather atomic file updates Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Make sure that we clear the suid and sgid bits and capabilities during a
FIEXCHANGE_RANGE call just like we would for a regular file write.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/generic/1218     |  115 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1218.out |   49 ++++++++++++++++++++
 tests/generic/1219     |   83 +++++++++++++++++++++++++++++++++++
 tests/generic/1219.out |   17 +++++++
 4 files changed, 264 insertions(+)
 create mode 100755 tests/generic/1218
 create mode 100644 tests/generic/1218.out
 create mode 100755 tests/generic/1219
 create mode 100755 tests/generic/1219.out


diff --git a/tests/generic/1218 b/tests/generic/1218
new file mode 100755
index 0000000000..e6c170351e
--- /dev/null
+++ b/tests/generic/1218
@@ -0,0 +1,115 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1218
+#
+# Functional test for dropping suid and sgid bits as part of an atomic file
+# commit.
+#
+. ./common/preamble
+_begin_fstest auto fiexchange swapext quick
+
+# Override the default cleanup function.
+# _cleanup()
+# {
+# 	cd /
+# 	rm -r -f $tmp.*
+# }
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_require_user
+_require_xfs_io_command swapext '-v vfs -a'
+_require_xfs_io_command startupdate
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_congruent_file_oplen $SCRATCH_MNT 1048576
+chmod a+rw $SCRATCH_MNT/
+
+setup_testfile() {
+	rm -f $SCRATCH_MNT/a
+	_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+	sync
+}
+
+commit_and_check() {
+	local user="$1"
+
+	md5sum $SCRATCH_MNT/a | _filter_scratch
+	stat -c '%a %A %n' $SCRATCH_MNT/a | _filter_scratch
+
+	local cmd="$XFS_IO_PROG -c 'startupdate' -c 'pwrite -S 0x57 0 1m' -c 'commitupdate' $SCRATCH_MNT/a"
+	if [ -n "$user" ]; then
+		su - "$user" -c "$cmd" >> $seqres.full
+	else
+		$SHELL -c "$cmd" >> $seqres.full
+	fi
+
+	_scratch_cycle_mount
+	md5sum $SCRATCH_MNT/a | _filter_scratch
+	stat -c '%a %A %n' $SCRATCH_MNT/a | _filter_scratch
+
+	# Blank line in output
+	echo
+}
+
+# Commit to a non-exec file by an unprivileged user clears suid but leaves
+# sgid.
+echo "Test 1 - qa_user, non-exec file"
+setup_testfile
+chmod a+rws $SCRATCH_MNT/a
+commit_and_check "$qa_user"
+
+# Commit to a group-exec file by an unprivileged user clears suid and sgid.
+echo "Test 2 - qa_user, group-exec file"
+setup_testfile
+chmod g+x,a+rws $SCRATCH_MNT/a
+commit_and_check "$qa_user"
+
+# Commit to a user-exec file by an unprivileged user clears suid but not sgid.
+echo "Test 3 - qa_user, user-exec file"
+setup_testfile
+chmod u+x,a+rws,g-x $SCRATCH_MNT/a
+commit_and_check "$qa_user"
+
+# Commit to a all-exec file by an unprivileged user clears suid and sgid.
+echo "Test 4 - qa_user, all-exec file"
+setup_testfile
+chmod a+rwxs $SCRATCH_MNT/a
+commit_and_check "$qa_user"
+
+# Commit to a non-exec file by root leaves suid and sgid.
+echo "Test 5 - root, non-exec file"
+setup_testfile
+chmod a+rws $SCRATCH_MNT/a
+commit_and_check
+
+# Commit to a group-exec file by root leaves suid and sgid.
+echo "Test 6 - root, group-exec file"
+setup_testfile
+chmod g+x,a+rws $SCRATCH_MNT/a
+commit_and_check
+
+# Commit to a user-exec file by root leaves suid and sgid.
+echo "Test 7 - root, user-exec file"
+setup_testfile
+chmod u+x,a+rws,g-x $SCRATCH_MNT/a
+commit_and_check
+
+# Commit to a all-exec file by root leaves suid and sgid.
+echo "Test 8 - root, all-exec file"
+setup_testfile
+chmod a+rwxs $SCRATCH_MNT/a
+commit_and_check
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1218.out b/tests/generic/1218.out
new file mode 100644
index 0000000000..8f4469aad4
--- /dev/null
+++ b/tests/generic/1218.out
@@ -0,0 +1,49 @@
+QA output created by 1218
+Test 1 - qa_user, non-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6666 -rwSrwSrw- SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+666 -rw-rw-rw- SCRATCH_MNT/a
+
+Test 2 - qa_user, group-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6676 -rwSrwsrw- SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+676 -rw-rwxrw- SCRATCH_MNT/a
+
+Test 3 - qa_user, user-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6766 -rwsrwSrw- SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+766 -rwxrw-rw- SCRATCH_MNT/a
+
+Test 4 - qa_user, all-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6777 -rwsrwsrwx SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+777 -rwxrwxrwx SCRATCH_MNT/a
+
+Test 5 - root, non-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6666 -rwSrwSrw- SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+6666 -rwSrwSrw- SCRATCH_MNT/a
+
+Test 6 - root, group-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6676 -rwSrwsrw- SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+6676 -rwSrwsrw- SCRATCH_MNT/a
+
+Test 7 - root, user-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6766 -rwsrwSrw- SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+6766 -rwsrwSrw- SCRATCH_MNT/a
+
+Test 8 - root, all-exec file
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+6777 -rwsrwsrwx SCRATCH_MNT/a
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+6777 -rwsrwsrwx SCRATCH_MNT/a
+
diff --git a/tests/generic/1219 b/tests/generic/1219
new file mode 100755
index 0000000000..fe20475058
--- /dev/null
+++ b/tests/generic/1219
@@ -0,0 +1,83 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1219
+#
+# Functional test for dropping capability bits as part of an atomic file
+# commit.
+#
+. ./common/preamble
+_begin_fstest auto fiexchange swapext quick
+
+# Override the default cleanup function.
+# _cleanup()
+# {
+# 	cd /
+# 	rm -r -f $tmp.*
+# }
+
+# Import common functions.
+. ./common/filter
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_require_user
+_require_command "$GETCAP_PROG" getcap
+_require_command "$SETCAP_PROG" setcap
+_require_xfs_io_command swapext '-v vfs -a'
+_require_xfs_io_command startupdate
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_congruent_file_oplen $SCRATCH_MNT 1048576
+chmod a+rw $SCRATCH_MNT/
+
+setup_testfile() {
+	rm -f $SCRATCH_MNT/a $SCRATCH_MNT/b
+	_pwrite_byte 0x58 0 1m $SCRATCH_MNT/a >> $seqres.full
+	_pwrite_byte 0x57 0 1m $SCRATCH_MNT/b >> $seqres.full
+	chmod a+rw $SCRATCH_MNT/a $SCRATCH_MNT/b
+	$SETCAP_PROG cap_setgid,cap_setuid+ep $SCRATCH_MNT/a
+	sync
+}
+
+commit_and_check() {
+	local user="$1"
+
+	md5sum $SCRATCH_MNT/a | _filter_scratch
+	stat -c '%a %A %n' $SCRATCH_MNT/a | _filter_scratch
+	_getcap -v $SCRATCH_MNT/a | _filter_scratch
+
+	local cmd="$XFS_IO_PROG -c 'startupdate' -c 'pwrite -S 0x57 0 1m' -c 'commitupdate' $SCRATCH_MNT/a"
+	if [ -n "$user" ]; then
+		su - "$user" -c "$cmd" >> $seqres.full
+	else
+		$SHELL -c "$cmd" >> $seqres.full
+	fi
+
+	_scratch_cycle_mount
+	md5sum $SCRATCH_MNT/a | _filter_scratch
+	stat -c '%a %A %n' $SCRATCH_MNT/a | _filter_scratch
+	_getcap -v $SCRATCH_MNT/a | _filter_scratch
+
+	# Blank line in output
+	echo
+}
+
+# Commit by an unprivileged user clears capability bits.
+echo "Test 1 - qa_user"
+setup_testfile
+commit_and_check "$qa_user"
+
+# Commit by root leaves capability bits.
+echo "Test 2 - root"
+setup_testfile
+commit_and_check
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1219.out b/tests/generic/1219.out
new file mode 100755
index 0000000000..a925b4ec4f
--- /dev/null
+++ b/tests/generic/1219.out
@@ -0,0 +1,17 @@
+QA output created by 1219
+Test 1 - qa_user
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+666 -rw-rw-rw- SCRATCH_MNT/a
+SCRATCH_MNT/a cap_setgid,cap_setuid=ep
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+666 -rw-rw-rw- SCRATCH_MNT/a
+SCRATCH_MNT/a
+
+Test 2 - root
+310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
+666 -rw-rw-rw- SCRATCH_MNT/a
+SCRATCH_MNT/a cap_setgid,cap_setuid=ep
+3784de23efab7a2074c9ec66901e39e5  SCRATCH_MNT/a
+666 -rw-rw-rw- SCRATCH_MNT/a
+SCRATCH_MNT/a
+


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 6/7] fsx: support FIEXCHANGE_RANGE
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 7/7] fsstress: update for FIEXCHANGE_RANGE Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2023-02-28  1:55     ` Zorro Lang
  6 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Upgrade fsx to support exchanging file contents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 configure.ac          |    1 
 include/builddefs.in  |    1 
 ltp/Makefile          |    4 +
 ltp/fsx.c             |  160 ++++++++++++++++++++++++++++++++++++++++++++++++-
 m4/package_libcdev.m4 |   21 ++++++
 src/fiexchange.h      |  101 +++++++++++++++++++++++++++++++
 src/global.h          |    6 ++
 7 files changed, 292 insertions(+), 2 deletions(-)
 create mode 100644 src/fiexchange.h


diff --git a/configure.ac b/configure.ac
index e92bd6b26d..4687d8a3c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -70,6 +70,7 @@ AC_HAVE_SEEK_DATA
 AC_HAVE_BMV_OF_SHARED
 AC_HAVE_NFTW
 AC_HAVE_RLIMIT_NOFILE
+AC_HAVE_FIEXCHANGE
 
 AC_CHECK_FUNCS([renameat2])
 AC_CHECK_FUNCS([reallocarray])
diff --git a/include/builddefs.in b/include/builddefs.in
index dab10c968f..969acf0da2 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -72,6 +72,7 @@ HAVE_SEEK_DATA = @have_seek_data@
 HAVE_NFTW = @have_nftw@
 HAVE_BMV_OF_SHARED = @have_bmv_of_shared@
 HAVE_RLIMIT_NOFILE = @have_rlimit_nofile@
+HAVE_FIEXCHANGE = @have_fiexchange@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 
diff --git a/ltp/Makefile b/ltp/Makefile
index 85f634145c..c2b70d896e 100644
--- a/ltp/Makefile
+++ b/ltp/Makefile
@@ -36,6 +36,10 @@ ifeq ($(HAVE_COPY_FILE_RANGE),yes)
 LCFLAGS += -DHAVE_COPY_FILE_RANGE
 endif
 
+ifeq ($(HAVE_FIEXCHANGE),yes)
+LCFLAGS += -DHAVE_FIEXCHANGE
+endif
+
 default: depend $(TARGETS)
 
 depend: .dep
diff --git a/ltp/fsx.c b/ltp/fsx.c
index 12c2cc33bf..ee4b8fe45d 100644
--- a/ltp/fsx.c
+++ b/ltp/fsx.c
@@ -111,6 +111,7 @@ enum {
 	OP_CLONE_RANGE,
 	OP_DEDUPE_RANGE,
 	OP_COPY_RANGE,
+	OP_EXCHANGE_RANGE,
 	OP_MAX_FULL,
 
 	/* integrity operations */
@@ -175,6 +176,7 @@ int	check_file = 0;			/* -X flag enables */
 int	clone_range_calls = 1;		/* -J flag disables */
 int	dedupe_range_calls = 1;		/* -B flag disables */
 int	copy_range_calls = 1;		/* -E flag disables */
+int	xchg_range_calls = 1;		/* -0 flag disables */
 int	integrity = 0;			/* -i flag */
 int	fsxgoodfd = 0;
 int	o_direct;			/* -Z */
@@ -268,6 +270,7 @@ static const char *op_names[] = {
 	[OP_DEDUPE_RANGE] = "dedupe_range",
 	[OP_COPY_RANGE] = "copy_range",
 	[OP_FSYNC] = "fsync",
+	[OP_EXCHANGE_RANGE] = "xchg_range",
 };
 
 static const char *op_name(int operation)
@@ -452,6 +455,20 @@ logdump(void)
 			if (overlap)
 				prt("\t******IIII");
 			break;
+		case OP_EXCHANGE_RANGE:
+			prt("XCHG 0x%x thru 0x%x\t(0x%x bytes) to 0x%x thru 0x%x",
+			    lp->args[0], lp->args[0] + lp->args[1] - 1,
+			    lp->args[1],
+			    lp->args[2], lp->args[2] + lp->args[1] - 1);
+			overlap2 = badoff >= lp->args[2] &&
+				  badoff < lp->args[2] + lp->args[1];
+			if (overlap && overlap2)
+				prt("\tXXXX**XXXX");
+			else if (overlap)
+				prt("\tXXXX******");
+			else if (overlap2)
+				prt("\t******XXXX");
+			break;
 		case OP_CLONE_RANGE:
 			prt("CLONE 0x%x thru 0x%x\t(0x%x bytes) to 0x%x thru 0x%x",
 			    lp->args[0], lp->args[0] + lp->args[1] - 1,
@@ -1369,6 +1386,116 @@ do_insert_range(unsigned offset, unsigned length)
 }
 #endif
 
+#ifdef FIEXCHANGE_RANGE
+static __u64 swap_flags = 0;
+
+int
+test_xchg_range(void)
+{
+	struct file_xchg_range	fsr = {
+		.file1_fd = fd,
+		.flags = FILE_XCHG_RANGE_DRY_RUN | swap_flags,
+	};
+	int ret, e;
+
+retry:
+	ret = ioctl(fd, FIEXCHANGE_RANGE, &fsr);
+	e = ret < 0 ? errno : 0;
+	if (e == EOPNOTSUPP && !(swap_flags & FILE_XCHG_RANGE_NONATOMIC)) {
+		/*
+		 * If the call fails with atomic mode, try again with non
+		 * atomic mode.
+		 */
+		swap_flags = FILE_XCHG_RANGE_NONATOMIC;
+		fsr.flags |= swap_flags;
+		goto retry;
+	}
+	if (e == EOPNOTSUPP || errno == ENOTTY) {
+		if (!quiet)
+			fprintf(stderr,
+				"main: filesystem does not support "
+				"exchange range, disabling!\n");
+		return 0;
+	}
+
+	return 1;
+}
+
+void
+do_xchg_range(unsigned offset, unsigned length, unsigned dest)
+{
+	struct file_xchg_range	fsr = {
+		.file1_fd = fd,
+		.file1_offset = offset,
+		.file2_offset = dest,
+		.length = length,
+		.flags = swap_flags,
+	};
+	void *p;
+
+	if (length == 0) {
+		if (!quiet && testcalls > simulatedopcount)
+			prt("skipping zero length exchange range\n");
+		log5(OP_EXCHANGE_RANGE, offset, length, dest, FL_SKIPPED);
+		return;
+	}
+
+	if ((loff_t)offset >= file_size || (loff_t)dest >= file_size) {
+		if (!quiet && testcalls > simulatedopcount)
+			prt("skipping exchange range behind EOF\n");
+		log5(OP_EXCHANGE_RANGE, offset, length, dest, FL_SKIPPED);
+		return;
+	}
+
+	p = malloc(length);
+	if (!p) {
+		if (!quiet && testcalls > simulatedopcount)
+			prt("skipping exchange range due to ENOMEM\n");
+		log5(OP_EXCHANGE_RANGE, offset, length, dest, FL_SKIPPED);
+		return;
+	}
+
+	log5(OP_EXCHANGE_RANGE, offset, length, dest, FL_NONE);
+
+	if (testcalls <= simulatedopcount)
+		goto out_free;
+
+	if ((progressinterval && testcalls % progressinterval == 0) ||
+	    (debug && (monitorstart == -1 || monitorend == -1 ||
+		       dest <= monitorstart || dest + length <= monitorend))) {
+		prt("%lu swap\tfrom 0x%x to 0x%x, (0x%x bytes) at 0x%x\n",
+			testcalls, offset, offset+length, length, dest);
+	}
+
+	if (ioctl(fd, FIEXCHANGE_RANGE, &fsr) == -1) {
+		prt("exchange range: 0x%x to 0x%x at 0x%x\n", offset,
+				offset + length, dest);
+		prterr("do_xchg_range: FIEXCHANGE_RANGE");
+		report_failure(161);
+		goto out_free;
+	}
+
+	memcpy(p, good_buf + offset, length);
+	memcpy(good_buf + offset, good_buf + dest, length);
+	memcpy(good_buf + dest, p, length);
+out_free:
+	free(p);
+}
+
+#else
+int
+test_xchg_range(void)
+{
+	return 0;
+}
+
+void
+do_xchg_range(unsigned offset, unsigned length, unsigned dest)
+{
+	return;
+}
+#endif
+
 #ifdef FICLONERANGE
 int
 test_clone_range(void)
@@ -1856,6 +1983,7 @@ static int
 op_args_count(int operation)
 {
 	switch (operation) {
+	case OP_EXCHANGE_RANGE:
 	case OP_CLONE_RANGE:
 	case OP_DEDUPE_RANGE:
 	case OP_COPY_RANGE:
@@ -2053,6 +2181,9 @@ test(void)
 	case OP_COPY_RANGE:
 		generate_dest_range(true, maxfilelen, &offset, &size, &offset2);
 		break;
+	case OP_EXCHANGE_RANGE:
+		generate_dest_range(false, file_size, &offset, &size, &offset2);
+		break;
 	}
 
 have_op:
@@ -2096,6 +2227,12 @@ test(void)
 			goto out;
 		}
 		break;
+	case OP_EXCHANGE_RANGE:
+		if (!xchg_range_calls) {
+			log5(op, offset, size, offset2, FL_SKIPPED);
+			goto out;
+		}
+		break;
 	case OP_CLONE_RANGE:
 		if (!clone_range_calls) {
 			log5(op, offset, size, offset2, FL_SKIPPED);
@@ -2180,6 +2317,18 @@ test(void)
 
 		do_insert_range(offset, size);
 		break;
+	case OP_EXCHANGE_RANGE:
+		if (size == 0) {
+			log5(OP_EXCHANGE_RANGE, offset, size, offset2, FL_SKIPPED);
+			goto out;
+		}
+		if (offset2 + size > maxfilelen) {
+			log5(OP_EXCHANGE_RANGE, offset, size, offset2, FL_SKIPPED);
+			goto out;
+		}
+
+		do_xchg_range(offset, size, offset2);
+		break;
 	case OP_CLONE_RANGE:
 		if (size == 0) {
 			log5(OP_CLONE_RANGE, offset, size, offset2, FL_SKIPPED);
@@ -2294,6 +2443,9 @@ usage(void)
 #ifdef HAVE_COPY_FILE_RANGE
 "	-E: Do not use copy range calls\n"
 #endif
+#ifdef FIEXCHANGE_RANGE
+"	-0: Do not use exchange range calls\n"
+#endif
 "	-L: fsxLite - no file creations & no file size changes\n\
 	-N numops: total # operations to do (default infinity)\n\
 	-O: use oplen (see -o flag) for every op (default random)\n\
@@ -2608,12 +2760,11 @@ main(int argc, char **argv)
 	page_size = getpagesize();
 	page_mask = page_size - 1;
 	mmap_mask = page_mask;
-	
 
 	setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
 
 	while ((ch = getopt_long(argc, argv,
-				 "b:c:dfg:i:j:kl:m:no:p:qr:s:t:w:xyABD:EFJKHzCILN:OP:RS:UWXZ",
+				 "0b:c:dfg:i:j:kl:m:no:p:qr:s:t:w:xyABD:EFJKHzCILN:OP:RS:UWXZ",
 				 longopts, NULL)) != EOF)
 		switch (ch) {
 		case 'b':
@@ -2747,6 +2898,9 @@ main(int argc, char **argv)
 		case 'I':
 			insert_range_calls = 0;
 			break;
+		case '0':
+			xchg_range_calls = 0;
+			break;
 		case 'J':
 			clone_range_calls = 0;
 			break;
@@ -2988,6 +3142,8 @@ main(int argc, char **argv)
 		dedupe_range_calls = test_dedupe_range();
 	if (copy_range_calls)
 		copy_range_calls = test_copy_range();
+	if (xchg_range_calls)
+		xchg_range_calls = test_xchg_range();
 
 	while (numops == -1 || numops--)
 		if (!test())
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index e1b381c16f..db663970c2 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -157,3 +157,24 @@ AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
        AC_MSG_RESULT(no))
     AC_SUBST(have_rlimit_nofile)
   ])
+
+#
+# Check if we have a FIEXCHANGE_RANGE ioctl (Linux)
+#
+AC_DEFUN([AC_HAVE_FIEXCHANGE],
+  [ AC_MSG_CHECKING([for FIEXCHANGE_RANGE])
+    AC_TRY_LINK([
+#define _GNU_SOURCE
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <linux/fs.h>
+#include <linux/fiexchange.h>
+    ], [
+         struct file_xchg_range fxr;
+         ioctl(-1, FIEXCHANGE_RANGE, &fxr);
+    ], have_fiexchange=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_fiexchange)
+  ])
diff --git a/src/fiexchange.h b/src/fiexchange.h
new file mode 100644
index 0000000000..29b3ac0ff5
--- /dev/null
+++ b/src/fiexchange.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
+/*
+ * FIEXCHANGE ioctl definitions, to facilitate exchanging parts of files.
+ *
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _LINUX_FIEXCHANGE_H
+#define _LINUX_FIEXCHANGE_H
+
+#include <linux/types.h>
+
+/*
+ * Exchange part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2).  Filesystems must be able to
+ * restart and complete the operation even after the system goes down.
+ */
+struct file_xchg_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__s64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see FILE_XCHG_RANGE_* below */
+
+	/* file2 metadata for optional freshness checks */
+	__s64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+
+	__u64		pad[6];		/* must be zeroes */
+};
+
+/*
+ * Atomic exchange operations are not required.  This relaxes the requirement
+ * that the filesystem must be able to complete the operation after a crash.
+ */
+#define FILE_XCHG_RANGE_NONATOMIC	(1 << 0)
+
+/*
+ * Check that file2's inode number, mtime, and ctime against the values
+ * provided, and return -EBUSY if there isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FILE2_FRESH	(1 << 1)
+
+/*
+ * Check that the file1's length is equal to file1_offset + length, and that
+ * file2's length is equal to file2_offset + length.  Returns -EDOM if there
+ * isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FULL_FILES	(1 << 2)
+
+/*
+ * Exchange file data all the way to the ends of both files, and then exchange
+ * the file sizes.  This flag can be used to replace a file's contents with a
+ * different amount of data.  length will be ignored.
+ */
+#define FILE_XCHG_RANGE_TO_EOF		(1 << 3)
+
+/* Flush all changes in file data and file metadata to disk before returning. */
+#define FILE_XCHG_RANGE_FSYNC		(1 << 4)
+
+/* Dry run; do all the parameter verification but do not change anything. */
+#define FILE_XCHG_RANGE_DRY_RUN		(1 << 5)
+
+/*
+ * Do not exchange any part of the range where file1's mapping is a hole.  This
+ * can be used to emulate scatter-gather atomic writes with a temp file.
+ */
+#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
+
+/*
+ * Commit the contents of file1 into file2 if file2 has the same inode number,
+ * mtime, and ctime as the arguments provided to the call.  The old contents of
+ * file2 will be moved to file1.
+ *
+ * With this flag, all committed information can be retrieved even if the
+ * system crashes or is rebooted.  This includes writing through or flushing a
+ * disk cache if present.  The call blocks until the device reports that the
+ * commit is complete.
+ *
+ * This flag should not be combined with NONATOMIC.  It can be combined with
+ * SKIP_FILE1_HOLES.
+ */
+#define FILE_XCHG_RANGE_COMMIT		(FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FSYNC)
+
+#define FILE_XCHG_RANGE_ALL_FLAGS	(FILE_XCHG_RANGE_NONATOMIC | \
+					 FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FULL_FILES | \
+					 FILE_XCHG_RANGE_TO_EOF | \
+					 FILE_XCHG_RANGE_FSYNC | \
+					 FILE_XCHG_RANGE_DRY_RUN | \
+					 FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+
+#define FIEXCHANGE_RANGE	_IOWR('X', 129, struct file_xchg_range)
+
+#endif /* _LINUX_FIEXCHANGE_H */
diff --git a/src/global.h b/src/global.h
index b44070993c..49570ef117 100644
--- a/src/global.h
+++ b/src/global.h
@@ -171,6 +171,12 @@
 #include <sys/mman.h>
 #endif
 
+#ifdef HAVE_FIEXCHANGE
+# include <linux/fiexchange.h>
+#else
+# include "fiexchange.h"
+#endif
+
 static inline unsigned long long
 rounddown_64(unsigned long long x, unsigned int y)
 {


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 7/7] fsstress: update for FIEXCHANGE_RANGE
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 2/7] generic: test old xfs extent swapping ioctl Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 6/7] fsx: support FIEXCHANGE_RANGE Darrick J. Wong
  6 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Teach this stress tool to be able to use the file content exchange
ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 ltp/fsstress.c |  168 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 168 insertions(+)


diff --git a/ltp/fsstress.c b/ltp/fsstress.c
index 10608fb554..0fba3d92a0 100644
--- a/ltp/fsstress.c
+++ b/ltp/fsstress.c
@@ -143,6 +143,7 @@ typedef enum {
 	OP_URING_WRITE,
 	OP_WRITE,
 	OP_WRITEV,
+	OP_XCHGRANGE,
 	OP_LAST
 } opty_t;
 
@@ -272,6 +273,8 @@ void	uring_read_f(opnum_t, long);
 void	uring_write_f(opnum_t, long);
 void	write_f(opnum_t, long);
 void	writev_f(opnum_t, long);
+void	xchgrange_f(opnum_t, long);
+
 char	*xattr_flag_to_string(int);
 
 struct opdesc	ops[OP_LAST]	= {
@@ -340,6 +343,7 @@ struct opdesc	ops[OP_LAST]	= {
 	[OP_URING_WRITE]   = {"uring_write",   uring_write_f,	1, 1 },
 	[OP_WRITE]	   = {"write",	       write_f,		4, 1 },
 	[OP_WRITEV]	   = {"writev",	       writev_f,	4, 1 },
+	[OP_XCHGRANGE]	   = {"xchgrange",     xchgrange_f,	4, 1 },
 }, *ops_end;
 
 flist_t	flist[FT_nft] = {
@@ -2494,6 +2498,170 @@ chown_f(opnum_t opno, long r)
 	free_pathname(&f);
 }
 
+/* exchange some arbitrary range of f1 to f2...fn. */
+void
+xchgrange_f(
+	opnum_t			opno,
+	long			r)
+{
+#ifdef FIEXCHANGE_RANGE
+	struct file_xchg_range	fxr = { 0 };
+	static __u64		swap_flags = 0;
+	struct pathname		fpath1;
+	struct pathname		fpath2;
+	struct stat64		stat1;
+	struct stat64		stat2;
+	char			inoinfo1[1024];
+	char			inoinfo2[1024];
+	off64_t			lr;
+	off64_t			off1;
+	off64_t			off2;
+	off64_t			max_off2;
+	size_t			len;
+	int			v1;
+	int			v2;
+	int			fd1;
+	int			fd2;
+	int			ret;
+	int			tries = 0;
+	int			e;
+
+	/* Load paths */
+	init_pathname(&fpath1);
+	if (!get_fname(FT_REGm, r, &fpath1, NULL, NULL, &v1)) {
+		if (v1)
+			printf("%d/%lld: xchgrange read - no filename\n",
+				procid, opno);
+		goto out_fpath1;
+	}
+
+	init_pathname(&fpath2);
+	if (!get_fname(FT_REGm, random(), &fpath2, NULL, NULL, &v2)) {
+		if (v2)
+			printf("%d/%lld: xchgrange write - no filename\n",
+				procid, opno);
+		goto out_fpath2;
+	}
+
+	/* Open files */
+	fd1 = open_path(&fpath1, O_RDONLY);
+	e = fd1 < 0 ? errno : 0;
+	check_cwd();
+	if (fd1 < 0) {
+		if (v1)
+			printf("%d/%lld: xchgrange read - open %s failed %d\n",
+				procid, opno, fpath1.path, e);
+		goto out_fpath2;
+	}
+
+	fd2 = open_path(&fpath2, O_WRONLY);
+	e = fd2 < 0 ? errno : 0;
+	check_cwd();
+	if (fd2 < 0) {
+		if (v2)
+			printf("%d/%lld: xchgrange write - open %s failed %d\n",
+				procid, opno, fpath2.path, e);
+		goto out_fd1;
+	}
+
+	/* Get file stats */
+	if (fstat64(fd1, &stat1) < 0) {
+		if (v1)
+			printf("%d/%lld: xchgrange read - fstat64 %s failed %d\n",
+				procid, opno, fpath1.path, errno);
+		goto out_fd2;
+	}
+	inode_info(inoinfo1, sizeof(inoinfo1), &stat1, v1);
+
+	if (fstat64(fd2, &stat2) < 0) {
+		if (v2)
+			printf("%d/%lld: xchgrange write - fstat64 %s failed %d\n",
+				procid, opno, fpath2.path, errno);
+		goto out_fd2;
+	}
+	inode_info(inoinfo2, sizeof(inoinfo2), &stat2, v2);
+
+	if (stat1.st_size < (stat1.st_blksize * 2) ||
+	    stat2.st_size < (stat2.st_blksize * 2)) {
+		if (v2)
+			printf("%d/%lld: xchgrange - files are too small\n",
+				procid, opno);
+		goto out_fd2;
+	}
+
+	/* Never let us swap more than 1/4 of the files. */
+	len = (random() % FILELEN_MAX) + 1;
+	if (len > stat1.st_size / 4)
+		len = stat1.st_size / 4;
+	if (len > stat2.st_size / 4)
+		len = stat2.st_size / 4;
+	len = rounddown_64(len, stat1.st_blksize);
+	if (len == 0)
+		len = stat1.st_blksize;
+
+	/* Calculate offsets */
+	lr = ((int64_t)random() << 32) + random();
+	if (stat1.st_size == len)
+		off1 = 0;
+	else
+		off1 = (off64_t)(lr % MIN(stat1.st_size - len, MAXFSIZE));
+	off1 %= maxfsize;
+	off1 = rounddown_64(off1, stat1.st_blksize);
+
+	/*
+	 * If srcfile == destfile, randomly generate destination ranges
+	 * until we find one that doesn't overlap the source range.
+	 */
+	max_off2 = MIN(stat2.st_size  - len, MAXFSIZE);
+	do {
+		lr = ((int64_t)random() << 32) + random();
+		if (stat2.st_size == len)
+			off2 = 0;
+		else
+			off2 = (off64_t)(lr % max_off2);
+		off2 %= maxfsize;
+		off2 = rounddown_64(off2, stat2.st_blksize);
+	} while (stat1.st_ino == stat2.st_ino &&
+		 llabs(off2 - off1) < len &&
+		 tries++ < 10);
+
+	/* Swap data blocks */
+	fxr.file1_fd = fd1;
+	fxr.file1_offset = off1;
+	fxr.length = len;
+	fxr.file2_offset = off2;
+	fxr.flags = swap_flags;
+
+retry:
+	ret = ioctl(fd2, FIEXCHANGE_RANGE, &fxr);
+	e = ret < 0 ? errno : 0;
+	if (e == EOPNOTSUPP && !(swap_flags & FILE_XCHG_RANGE_NONATOMIC)) {
+		swap_flags = FILE_XCHG_RANGE_NONATOMIC;
+		fxr.flags |= swap_flags;
+		goto retry;
+	}
+	if (v1 || v2) {
+		printf("%d/%lld: xchgrange %s%s [%lld,%lld] -> %s%s [%lld,%lld]",
+			procid, opno,
+			fpath1.path, inoinfo1, (long long)off1, (long long)len,
+			fpath2.path, inoinfo2, (long long)off2, (long long)len);
+
+		if (ret < 0)
+			printf(" error %d", e);
+		printf("\n");
+	}
+
+out_fd2:
+	close(fd2);
+out_fd1:
+	close(fd1);
+out_fpath2:
+	free_pathname(&fpath2);
+out_fpath1:
+	free_pathname(&fpath1);
+#endif
+}
+
 /* reflink some arbitrary range of f1 to f2. */
 void
 clonerange_f(


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of realtime summaries
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (73 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair of realtime summary files Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of extended attributes Darrick J. Wong
                   ` (2 subsequent siblings)
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

We now have all the infrastructure we need to repair file metadata.
We'll begin with the realtime summary file, because it is the least
complex data structure.  To support this we need to add three more
pieces to the temporary file code from the previous patchset --
preallocating space in the temp file, formatting metadata into that
space and writing the blocks to disk, and swapping the fork mappings
atomically.

After that, the actual reconstruction of the realtime summary
information is pretty simple, since we can simply write the incore
copy computed by the rtsummary scrubber to the temporary file, swap the
contents, and reap the old blocks.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-rtsummary

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-rtsummary
---
 tests/xfs/813     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/813.out |    2 ++
 2 files changed, 50 insertions(+)
 create mode 100755 tests/xfs/813
 create mode 100644 tests/xfs/813.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: race fsstress with online repair of realtime summary files
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of realtime summaries Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create tests to race fsstress with rt summary file repair while running
fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/813     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/813.out |    2 ++
 2 files changed, 50 insertions(+)
 create mode 100755 tests/xfs/813
 create mode 100644 tests/xfs/813.out


diff --git a/tests/xfs/813 b/tests/xfs/813
new file mode 100755
index 0000000000..5efe923c75
--- /dev/null
+++ b/tests/xfs/813
@@ -0,0 +1,48 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 813
+#
+# Race fsstress and realtime summary repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+_xfs_force_bdev realtime $SCRATCH_MNT
+
+# XXX the realtime summary scrubber isn't currently implemented upstream.
+# Don't bother trying to fix it on those kernels
+$XFS_IO_PROG -c 'scrub rtsummary' -c 'scrub rtsummary' "$SCRATCH_MNT" 2>&1 | \
+	grep -q 'Scan was not complete' && \
+	_notrun "rtsummary scrub is incomplete"
+
+_scratch_xfs_stress_online_repair -s "repair rtsummary"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/813.out b/tests/xfs/813.out
new file mode 100644
index 0000000000..f0c2a12bea
--- /dev/null
+++ b/tests/xfs/813.out
@@ -0,0 +1,2 @@
+QA output created by 813
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: online repair of extended attributes
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (74 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of realtime summaries Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair of extended attribute data Darrick J. Wong
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of directories Darrick J. Wong
  2022-12-30 22:20 ` [PATCHSET v24.0 0/1] fstests: test automatic scrub optimization by default Darrick J. Wong
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This series employs atomic extent swapping to enable safe reconstruction
of extended attribute data attached to a file.  Because xattrs do not
have any redundant information to draw off of, we can at best salvage
as much data as we can and build a new structure.

Rebuilding an extended attribute structure consists of these three
steps:

First, we walk the existing attributes to salvage as many of them as we
can, by adding them as new attributes attached to the repair tempfile.
We need to add a new xfile-based data structure to hold blobs of
arbitrary length to stage the xattr names and values.

Second, we write the salvaged attributes to a temporary file, and use
atomic extent swaps to exchange the entire attribute fork between the
two files.

Finally, we reap the old xattr blocks (which are now in the temporary
file) as carefully as we can.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-xattrs

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-xattrs
---
 tests/xfs/814     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/814.out |    2 ++
 2 files changed, 42 insertions(+)
 create mode 100755 tests/xfs/814
 create mode 100644 tests/xfs/814.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: race fsstress with online repair of extended attribute data
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of extended attributes Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create tests to race fsstress with extended attribute repair while
running fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/814     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/814.out |    2 ++
 2 files changed, 42 insertions(+)
 create mode 100755 tests/xfs/814
 create mode 100644 tests/xfs/814.out


diff --git a/tests/xfs/814 b/tests/xfs/814
new file mode 100755
index 0000000000..96abb13691
--- /dev/null
+++ b/tests/xfs/814
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 814
+#
+# Race fsstress and extended attributes repair for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/attr
+
+# real QA test starts here
+_supported_fs xfs
+_require_attrs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -x 'xattr' -s "repair xattr" -t "%attrfile%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/814.out b/tests/xfs/814.out
new file mode 100644
index 0000000000..95532d3da7
--- /dev/null
+++ b/tests/xfs/814.out
@@ -0,0 +1,2 @@
+QA output created by 814
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/2] fstests: online repair of directories
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (75 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of extended attributes Darrick J. Wong
@ 2022-12-30 22:19 ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: ensure that online directory repairs don't hit EDQUOT Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: race fsstress with online repair of dirs and parent pointers Darrick J. Wong
  2022-12-30 22:20 ` [PATCHSET v24.0 0/1] fstests: test automatic scrub optimization by default Darrick J. Wong
  77 siblings, 2 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This series employs atomic extent swapping to enable safe reconstruction
of directory data.  For now, XFS does not support reverse directory
links (aka parent pointers), so we can only salvage the dirents of a
directory and construct a new structure.

Directory repair therefore consists of five main parts:

First, we walk the existing directory to salvage as many entries as we
can, by adding them as new directory entries to the repair temp dir.

Second, we validate the parent pointer found in the directory.  If one
was not found, we scan the entire filesystem looking for a potential
parent.

Third, we use atomic extent swaps to exchange the entire data fork
between the two directories.

Fourth, we reap the old directory blocks as carefully as we can.

To wrap up the directory repair code, we need to add to the regular
filesystem the ability to free all the data fork blocks in a directory.
This does not change anything with normal directories, since they must
still unlink and shrink one entry at a time.  However, this will
facilitate freeing of partially-inactivated temporary directories during
log recovery.

The second half of this patchset implements repairs for the dotdot
entries of directories.  For now there is only rudimentary support for
this, because there are no directory parent pointers, so the best we can
do is scanning the filesystem and the VFS dcache for answers.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-dirs

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs
---
 tests/xfs/815     |   37 +++++++++++++++++++++++++
 tests/xfs/815.out |    2 +
 tests/xfs/816     |   38 ++++++++++++++++++++++++++
 tests/xfs/816.out |    2 +
 tests/xfs/841     |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/841.out |    3 ++
 6 files changed, 159 insertions(+)
 create mode 100755 tests/xfs/815
 create mode 100644 tests/xfs/815.out
 create mode 100755 tests/xfs/816
 create mode 100644 tests/xfs/816.out
 create mode 100755 tests/xfs/841
 create mode 100644 tests/xfs/841.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/2] xfs: ensure that online directory repairs don't hit EDQUOT
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of directories Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: race fsstress with online repair of dirs and parent pointers Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add a test to ensure that the sysadmin doesn't get EDQUOT if they try to
repair directory metadata when we've already exceeded a quota limit
somewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/841     |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/841.out |    3 ++
 2 files changed, 80 insertions(+)
 create mode 100755 tests/xfs/841
 create mode 100644 tests/xfs/841.out


diff --git a/tests/xfs/841 b/tests/xfs/841
new file mode 100755
index 0000000000..f743454971
--- /dev/null
+++ b/tests/xfs/841
@@ -0,0 +1,77 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 841
+#
+# Ensure that the sysadmin won't hit EDQUOT while repairing file data contents
+# even if the file's quota limits have been exceeded.  This tests the quota
+# reservation handling inside the swapext code used by repair.
+#
+. ./common/preamble
+_begin_fstest online_repair
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/quota
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+_require_quota
+_require_user
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_qmount_option usrquota
+_qmount
+
+blocksize=$(_get_block_size $SCRATCH_MNT)
+alloc_unit=$(_xfs_get_dir_blocksize $SCRATCH_MNT)
+
+# Make sure we can actually repair a directory
+scratchdir=$SCRATCH_MNT/dir
+scratchfile=$SCRATCH_MNT/file
+mkdir $scratchdir
+touch $scratchfile
+$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
+__stress_scrub_check_commands "$scratchdir" "" 'repair directory'
+
+# Create a 2-dirblock directory
+total_size=$((alloc_unit * 2))
+dirents=$((total_size / 255))
+
+for ((i = 0; i < dirents; i++)); do
+	name=$(printf "x%0254d" $i)
+	ln $scratchfile $scratchdir/$name
+done
+
+# Set a low quota hardlimit for an unprivileged uid and chown the files to it
+echo "set up quota" >> $seqres.full
+$XFS_QUOTA_PROG -x -c "limit -u bhard=$alloc_unit $qa_user" $SCRATCH_MNT
+chown $qa_user $scratchdir $scratchfile
+$XFS_QUOTA_PROG -x -c 'report -u' $SCRATCH_MNT >> $seqres.full
+
+# Rebuild the directory
+echo "repairs" >> $seqres.full
+$XFS_IO_PROG -x -c 'inject force_repair' -c 'repair directory' $scratchdir
+$XFS_QUOTA_PROG -x -c 'report -u' $SCRATCH_MNT >> $seqres.full
+
+# Fail at appending the directory as qa_user to ensure quota enforcement works
+echo "fail quota" >> $seqres.full
+for ((i = 0; i < dirents; i++)); do
+	name=$(printf "y%0254d" $i)
+	su - "$qa_user" -c "ln $scratchfile $scratchdir/$name" 2>&1 | \
+		_filter_scratch | sed -e 's/y[0-9]*/yXXX/g'
+	test "${PIPESTATUS[0]}" -ne 0 && break
+done
+$XFS_QUOTA_PROG -x -c 'report -u' $SCRATCH_MNT >> $seqres.full
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/841.out b/tests/xfs/841.out
new file mode 100644
index 0000000000..e8e169111a
--- /dev/null
+++ b/tests/xfs/841.out
@@ -0,0 +1,3 @@
+QA output created by 841
+ln: failed to create hard link 'SCRATCH_MNT/dir/yXXX': Disk quota exceeded
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH 2/2] xfs: race fsstress with online repair of dirs and parent pointers
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of directories Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: ensure that online directory repairs don't hit EDQUOT Darrick J. Wong
@ 2022-12-30 22:19   ` Darrick J. Wong
  1 sibling, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:19 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create tests to race fsstress with directories and directory parent
pointer repair while running fsstress in the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/815     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/815.out |    2 ++
 tests/xfs/816     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/816.out |    2 ++
 4 files changed, 79 insertions(+)
 create mode 100755 tests/xfs/815
 create mode 100644 tests/xfs/815.out
 create mode 100755 tests/xfs/816
 create mode 100644 tests/xfs/816.out


diff --git a/tests/xfs/815 b/tests/xfs/815
new file mode 100755
index 0000000000..745afec792
--- /dev/null
+++ b/tests/xfs/815
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 815
+#
+# Race fsstress and directory repair for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -x 'dir' -s "repair directory" -t "%dir%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/815.out b/tests/xfs/815.out
new file mode 100644
index 0000000000..6ea462f3f7
--- /dev/null
+++ b/tests/xfs/815.out
@@ -0,0 +1,2 @@
+QA output created by 815
+Silence is golden
diff --git a/tests/xfs/816 b/tests/xfs/816
new file mode 100755
index 0000000000..25a79005f8
--- /dev/null
+++ b/tests/xfs/816
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 816
+#
+# Race fsstress and parent pointers repair for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -s "repair parent" -t "%dir%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/816.out b/tests/xfs/816.out
new file mode 100644
index 0000000000..a9d8f943c8
--- /dev/null
+++ b/tests/xfs/816.out
@@ -0,0 +1,2 @@
+QA output created by 816
+Silence is golden


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCHSET v24.0 0/1] fstests: test automatic scrub optimization by default
  2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
                   ` (76 preceding siblings ...)
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of directories Darrick J. Wong
@ 2022-12-30 22:20 ` Darrick J. Wong
  2022-12-30 22:20   ` [PATCH 1/1] xfs: test xfs_scrub dry run, preen, and repair mode Darrick J. Wong
  77 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:20 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This final patchset in the online fsck series enables the background
service to optimize filesystems by default.  This is the first step
towards enabling repairs by default.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-optimize-by-default

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=scrub-optimize-by-default
---
 tests/xfs/850     |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/850.out |   32 ++++++++++++++++
 2 files changed, 137 insertions(+)
 create mode 100755 tests/xfs/850
 create mode 100644 tests/xfs/850.out


^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH 1/1] xfs: test xfs_scrub dry run, preen, and repair mode
  2022-12-30 22:20 ` [PATCHSET v24.0 0/1] fstests: test automatic scrub optimization by default Darrick J. Wong
@ 2022-12-30 22:20   ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:20 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For each of the three operational modes of xfs_scrub, make sure that we
/only/ repair that which we're supposed to.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/850     |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/850.out |   32 ++++++++++++++++
 2 files changed, 137 insertions(+)
 create mode 100755 tests/xfs/850
 create mode 100644 tests/xfs/850.out


diff --git a/tests/xfs/850 b/tests/xfs/850
new file mode 100755
index 0000000000..bb46915c89
--- /dev/null
+++ b/tests/xfs/850
@@ -0,0 +1,105 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test 850
+#
+# Make sure that xfs_scrub dry run, preen, and repair modes only modify the
+# things that they're allowed to touch.
+#
+. ./common/preamble
+_begin_fstest auto quick online_repair
+
+# Import common functions.
+. ./common/fuzzy
+. ./common/filter
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs xfs
+_require_scratch_nocheck
+_require_scrub
+_require_xfs_db_command "fuzz"
+_require_xfs_io_command "repair"
+
+# Make sure we support repair?
+output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
+test -z "$output" && _notrun 'kernel does not support repair'
+
+# Make sure xfs_scrub is new enough to support -p(reen)
+$XFS_SCRUB_PROG -p 2>&1 | grep -q 'invalid option' && \
+	_notrun 'scrub does not support -p'
+
+_scratch_mkfs | _filter_mkfs 2>$tmp.mkfs >/dev/null
+. $tmp.mkfs
+
+test $agcount -ge 3 || _notrun 'filesystem must have at least 3 AGs'
+
+AWK_PROG='
+{
+	if ($1 ~ /Optimized:/)
+		optimized++;
+	else if ($1 ~ /Repaired:/)
+		repaired++;
+	else if ($1 ~ /Corruption:/)
+		corruption++;
+}
+END {
+	printf("corruption: %u optimized: %u repaired: %u\n",
+			corruption, optimized, repaired);
+}
+'
+
+test_scrub() {
+	local mode="$1"
+	local scrub_arg="$2"
+	local db_args=(-x)
+
+	# Fuzz secondary superblocks because this won't cause mount failures
+	if [[ $mode =~ c ]]; then
+		db_args+=(-c 'sb 1' -c 'fuzz -d dblocks add')
+	fi
+	if [[ $mode =~ o ]]; then
+		db_args+=(-c 'sb 2' -c 'fuzz -d fname random')
+	fi
+
+	echo "testing mode? $mode scrub_arg $scrub_arg"
+	echo "db_args:${db_args[@]}:scrub_arg:$scrub_arg:$mode:" >> $seqres.full
+	echo "----------------" >> $seqres.full
+
+	_scratch_mkfs >> $seqres.full
+
+	# Make sure there's nothing left to optimize, at least according to
+	# xfs_scrub.  This clears the way for us to make targeted changes to
+	# the filesystem.
+	_scratch_mount
+	_scratch_scrub $scrub_arg >> /dev/null
+	_scratch_unmount
+
+	# Modify the filesystem as needed to trip up xfs_scrub
+	_scratch_xfs_db "${db_args[@]}" >> $seqres.full
+
+	# See how many optimizations, repairs, and corruptions it'll report
+	_scratch_mount
+	_scratch_scrub $scrub_arg 2>&1 | awk "$AWK_PROG"
+	test "${PIPESTATUS[0]}" -eq 0 || echo "xfs_scrub returned ${PIPESTATUS[0]}?"
+	echo
+	_scratch_unmount
+}
+
+test_scrub 'o' -n	# dry run with possible optimizations
+test_scrub 'o' -p	# preen
+test_scrub 'o' 		# repair
+
+test_scrub 'co' -n	# dry run with corruptions and optimizations
+test_scrub 'co' -p	# preen
+test_scrub 'co' 	# repair
+
+test_scrub 'c' -n	# dry run with corruptions
+test_scrub 'c' -p	# preen
+test_scrub 'c' 		# repair
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/850.out b/tests/xfs/850.out
new file mode 100644
index 0000000000..6e43fd919d
--- /dev/null
+++ b/tests/xfs/850.out
@@ -0,0 +1,32 @@
+QA output created by 850
+testing mode? o scrub_arg -n
+corruption: 0 optimized: 0 repaired: 0
+
+testing mode? o scrub_arg -p
+corruption: 0 optimized: 1 repaired: 0
+
+testing mode? o scrub_arg 
+corruption: 0 optimized: 1 repaired: 0
+
+testing mode? co scrub_arg -n
+corruption: 1 optimized: 0 repaired: 0
+xfs_scrub returned 1?
+
+testing mode? co scrub_arg -p
+corruption: 1 optimized: 0 repaired: 0
+xfs_scrub returned 1?
+
+testing mode? co scrub_arg 
+corruption: 0 optimized: 1 repaired: 1
+
+testing mode? c scrub_arg -n
+corruption: 1 optimized: 0 repaired: 0
+xfs_scrub returned 1?
+
+testing mode? c scrub_arg -p
+corruption: 1 optimized: 0 repaired: 0
+xfs_scrub returned 1?
+
+testing mode? c scrub_arg 
+corruption: 0 optimized: 0 repaired: 1
+


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* Re: [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests
  2022-12-30 22:19   ` [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests Darrick J. Wong
@ 2023-02-05 12:57     ` Zorro Lang
  2023-02-07 16:57       ` Darrick J. Wong
  2023-02-07 17:01     ` [PATCH v24.1 " Darrick J. Wong
  1 sibling, 1 reply; 469+ messages in thread
From: Zorro Lang @ 2023-02-05 12:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:06PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a new find(1) like utility that doesn't crash on directory tree
> changes (like find does due to bugs in its loop detector) and actually
> implements the custom xfs attribute predicates that we need for scrub
> stress tests.  This program will be needed for a future patch where we
> add stress tests for scrub and repair of file metadata.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  configure.ac          |    5 +
>  include/builddefs.in  |    4 +
>  m4/package_libcdev.m4 |   47 ++++++++
>  m4/package_xfslibs.m4 |   16 +++
>  src/Makefile          |   10 ++
>  src/xfsfind.c         |  290 +++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 372 insertions(+)
>  create mode 100644 src/xfsfind.c
> 
> 
> diff --git a/configure.ac b/configure.ac
> index cbf8377988..e92bd6b26d 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -66,6 +66,11 @@ AC_PACKAGE_WANT_LINUX_FS_H
>  AC_PACKAGE_WANT_LIBBTRFSUTIL
>  
>  AC_HAVE_COPY_FILE_RANGE
> +AC_HAVE_SEEK_DATA
> +AC_HAVE_BMV_OF_SHARED
> +AC_HAVE_NFTW
> +AC_HAVE_RLIMIT_NOFILE
> +
>  AC_CHECK_FUNCS([renameat2])
>  AC_CHECK_FUNCS([reallocarray])
>  AC_CHECK_TYPES([struct mount_attr], [], [], [[#include <linux/mount.h>]])
> diff --git a/include/builddefs.in b/include/builddefs.in
> index 6641209f81..dab10c968f 100644
> --- a/include/builddefs.in
> +++ b/include/builddefs.in
> @@ -68,6 +68,10 @@ HAVE_FIEMAP = @have_fiemap@
>  HAVE_FALLOCATE = @have_fallocate@
>  HAVE_COPY_FILE_RANGE = @have_copy_file_range@
>  HAVE_LIBBTRFSUTIL = @have_libbtrfsutil@
> +HAVE_SEEK_DATA = @have_seek_data@
> +HAVE_NFTW = @have_nftw@
> +HAVE_BMV_OF_SHARED = @have_bmv_of_shared@
> +HAVE_RLIMIT_NOFILE = @have_rlimit_nofile@
>  
>  GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
>  
> diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
> index 5c76c0f73e..e1b381c16f 100644
> --- a/m4/package_libcdev.m4
> +++ b/m4/package_libcdev.m4
> @@ -110,3 +110,50 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
>      AC_SUBST(have_copy_file_range)
>    ])
>  
> +# Check if we have SEEK_DATA
> +AC_DEFUN([AC_HAVE_SEEK_DATA],
> +  [ AC_MSG_CHECKING([for SEEK_DATA])
> +    AC_TRY_LINK([

The AC_TRY_LINK is obsolete by autoconf, refer to:
https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Obsolete-Macros.html

So as the suggestion of above link, we'd better to replace:
          Macro: AC_TRY_LINK (includes, function-body, [action-if-true], [action-if-false])
with:
          AC_LINK_IFELSE(
            [AC_LANG_PROGRAM([[includes]],
               [[function-body]])],
            [action-if-true],
            [action-if-false])

For example (hope it's right:)

diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index e1b381c1..7f1767a4 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -113,13 +113,13 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
 # Check if we have SEEK_DATA
 AC_DEFUN([AC_HAVE_SEEK_DATA],
   [ AC_MSG_CHECKING([for SEEK_DATA])
-    AC_TRY_LINK([
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
 #define _GNU_SOURCE
 #include <sys/types.h>
 #include <unistd.h>
-    ], [
+    ]], [[
          lseek(-1, 0, SEEK_DATA);
-    ], have_seek_data=yes
+    ]])], have_seek_data=yes
        AC_MSG_RESULT(yes),
        AC_MSG_RESULT(no))
     AC_SUBST(have_seek_data)

> +#define _GNU_SOURCE
> +#include <sys/types.h>
> +#include <unistd.h>
> +    ], [
> +         lseek(-1, 0, SEEK_DATA);
> +    ], have_seek_data=yes
> +       AC_MSG_RESULT(yes),
> +       AC_MSG_RESULT(no))
> +    AC_SUBST(have_seek_data)
> +  ])
> +
> +# Check if we have nftw
> +AC_DEFUN([AC_HAVE_NFTW],
> +  [ AC_MSG_CHECKING([for nftw])
> +    AC_TRY_LINK([

Same as above

> +#define _GNU_SOURCE
> +#include <stddef.h>
> +#include <ftw.h>
> +    ], [
> +         nftw("/", (int (*)(const char *, const struct stat *, int, struct FTW *))1, 0, 0);
> +    ], have_nftw=yes
> +       AC_MSG_RESULT(yes),
> +       AC_MSG_RESULT(no))
> +    AC_SUBST(have_nftw)
> +  ])
> +
> +# Check if we have RLIMIT_NOFILE
> +AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
> +  [ AC_MSG_CHECKING([for RLIMIT_NOFILE])
> +    AC_TRY_LINK([

Same as above

> +#define _GNU_SOURCE
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +    ], [
> +         struct rlimit rlimit;
> +
> +         rlimit.rlim_cur = 0;
> +         getrlimit(RLIMIT_NOFILE, &rlimit);
> +    ], have_rlimit_nofile=yes
> +       AC_MSG_RESULT(yes),
> +       AC_MSG_RESULT(no))
> +    AC_SUBST(have_rlimit_nofile)
> +  ])
> diff --git a/m4/package_xfslibs.m4 b/m4/package_xfslibs.m4
> index 0746cd1dc5..479f30a29b 100644
> --- a/m4/package_xfslibs.m4
> +++ b/m4/package_xfslibs.m4
> @@ -104,3 +104,19 @@ AC_DEFUN([AC_PACKAGE_NEED_XFSCTL_MACRO],
>          exit 1
>        ])
>    ])
> +
> +# Check if we have BMV_OF_SHARED from the GETBMAPX ioctl
> +AC_DEFUN([AC_HAVE_BMV_OF_SHARED],
> +  [ AC_MSG_CHECKING([for BMV_OF_SHARED])
> +    AC_TRY_LINK([

Same as above

Thanks,
Zorro

> +#define _GNU_SOURCE
> +#include <xfs/xfs.h>
> +    ], [
> +         struct getbmapx obj;
> +         ioctl(-1, XFS_IOC_GETBMAPX, &obj);
> +         obj.bmv_oflags |= BMV_OF_SHARED;
> +    ], have_bmv_of_shared=yes
> +       AC_MSG_RESULT(yes),
> +       AC_MSG_RESULT(no))
> +    AC_SUBST(have_bmv_of_shared)
> +  ])
> diff --git a/src/Makefile b/src/Makefile
> index afdf6b30c5..7807ca89a5 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -83,6 +83,16 @@ ifeq ($(HAVE_LIBCAP), true)
>  LLDLIBS += -lcap
>  endif
>  
> +ifeq ($(HAVE_SEEK_DATA), yes)
> + ifeq ($(HAVE_NFTW), yes)
> +  ifeq ($(HAVE_BMV_OF_SHARED), yes)
> +   ifeq ($(HAVE_RLIMIT_NOFILE), yes)
> +     TARGETS += xfsfind
> +   endif
> +  endif
> + endif
> +endif
> +
>  CFILES = $(TARGETS:=.c)
>  LDIRT = $(TARGETS) fssum
>  
> diff --git a/src/xfsfind.c b/src/xfsfind.c
> new file mode 100644
> index 0000000000..6b0a93e793
> --- /dev/null
> +++ b/src/xfsfind.c
> @@ -0,0 +1,290 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * find(1) but with special predicates for finding XFS attributes.
> + * Copyright (C) 2022 Oracle.
> + */
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +#include <sys/types.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <ftw.h>
> +#include <linux/fs.h>
> +#include <xfs/xfs.h>
> +
> +#include "global.h"
> +
> +static int want_anyfile;
> +static int want_datafile;
> +static int want_attrfile;
> +static int want_dir;
> +static int want_regfile;
> +static int want_sharedfile;
> +static int report_errors = 1;
> +
> +static int
> +check_datafile(
> +	const char		*path,
> +	int			fd)
> +{
> +	off_t			off;
> +
> +	off = lseek(fd, 0, SEEK_DATA);
> +	if (off >= 0)
> +		return 1;
> +
> +	if (errno == ENXIO)
> +		return 0;
> +
> +	if (report_errors)
> +		perror(path);
> +
> +	return -1;
> +}
> +
> +static int
> +check_attrfile(
> +	const char		*path,
> +	int			fd)
> +{
> +	struct fsxattr		fsx;
> +	int			ret;
> +
> +	ret = ioctl(fd, XFS_IOC_FSGETXATTR, &fsx);
> +	if (ret) {
> +		if (report_errors)
> +			perror(path);
> +		return -1;
> +	}
> +
> +	if (want_attrfile && (fsx.fsx_xflags & XFS_XFLAG_HASATTR))
> +		return 1;
> +	return 0;
> +}
> +
> +#define BMAP_NR			33
> +static struct getbmapx		bmaps[BMAP_NR];
> +
> +static int
> +check_sharedfile(
> +	const char		*path,
> +	int			fd)
> +{
> +	struct getbmapx		*key = &bmaps[0];
> +	unsigned int		i;
> +	int			ret;
> +
> +	memset(key, 0, sizeof(struct getbmapx));
> +	key->bmv_length = ULLONG_MAX;
> +	/* no holes and don't flush dirty pages */
> +	key->bmv_iflags = BMV_IF_DELALLOC | BMV_IF_NO_HOLES;
> +	key->bmv_count = BMAP_NR;
> +
> +	while ((ret = ioctl(fd, XFS_IOC_GETBMAPX, bmaps)) == 0) {
> +		struct getbmapx	*p = &bmaps[1];
> +		xfs_off_t	new_off;
> +
> +		for (i = 0; i < key->bmv_entries; i++, p++) {
> +			if (p->bmv_oflags & BMV_OF_SHARED)
> +				return 1;
> +		}
> +
> +		if (key->bmv_entries == 0)
> +			break;
> +		p = key + key->bmv_entries;
> +		if (p->bmv_oflags & BMV_OF_LAST)
> +			return 0;
> +
> +		new_off = p->bmv_offset + p->bmv_length;
> +		key->bmv_length -= new_off - key->bmv_offset;
> +		key->bmv_offset = new_off;
> +	}
> +	if (ret < 0) {
> +		if (report_errors)
> +			perror(path);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +print_help(
> +	const char		*name)
> +{
> +	printf("Usage: %s [OPTIONS] path\n", name);
> +	printf("\n");
> +	printf("Print all file paths matching any of the given predicates.\n");
> +	printf("\n");
> +	printf("-a	Match files with xattrs.\n");
> +	printf("-b	Match files with data blocks.\n");
> +	printf("-d	Match directories.\n");
> +	printf("-q	Ignore errors while walking directory tree.\n");
> +	printf("-r	Match regular files.\n");
> +	printf("-s	Match files with shared blocks.\n");
> +	printf("\n");
> +	printf("If no matching options are given, match all files found.\n");
> +}
> +
> +static int
> +visit(
> +	const char		*path,
> +	const struct stat	*sb,
> +	int			typeflag,
> +	struct FTW		*ftwbuf)
> +{
> +	int			printme = 1;
> +	int			fd = -1;
> +	int			retval = FTW_CONTINUE;
> +
> +	if (want_anyfile)
> +		goto out;
> +	if (want_regfile && typeflag == FTW_F)
> +		goto out;
> +	if (want_dir && typeflag == FTW_D)
> +		goto out;
> +
> +	/*
> +	 * We can only open directories and files; screen out everything else.
> +	 * Note that nftw lies and reports FTW_F for device files, so check the
> +	 * statbuf mode too.
> +	 */
> +	if (typeflag != FTW_F && typeflag != FTW_D) {
> +		printme = 0;
> +		goto out;
> +	}
> +
> +	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode)) {
> +		printme = 0;
> +		goto out;
> +	}
> +
> +	fd = open(path, O_RDONLY);
> +	if (fd < 0) {
> +		if (report_errors) {
> +			perror(path);
> +			return FTW_STOP;
> +		}
> +
> +		return FTW_CONTINUE;
> +	}
> +
> +	if (want_datafile && typeflag == FTW_F) {
> +		int ret = check_datafile(path, fd);
> +		if (ret < 0 && report_errors) {
> +			printme = 0;
> +			retval = FTW_STOP;
> +			goto out_fd;
> +		}
> +
> +		if (ret == 1)
> +			goto out_fd;
> +	}
> +
> +	if (want_attrfile) {
> +		int ret = check_attrfile(path, fd);
> +		if (ret < 0 && report_errors) {
> +			printme = 0;
> +			retval = FTW_STOP;
> +			goto out_fd;
> +		}
> +
> +		if (ret == 1)
> +			goto out_fd;
> +	}
> +
> +	if (want_sharedfile) {
> +		int ret = check_sharedfile(path, fd);
> +		if (ret < 0 && report_errors) {
> +			printme = 0;
> +			retval = FTW_STOP;
> +			goto out_fd;
> +		}
> +
> +		if (ret == 1)
> +			goto out_fd;
> +	}
> +
> +	printme = 0;
> +out_fd:
> +	close(fd);
> +out:
> +	if (printme)
> +		printf("%s\n", path);
> +	return retval;
> +}
> +
> +static void
> +handle_sigabrt(
> +	int		signal,
> +	siginfo_t	*info,
> +	void		*ucontext)
> +{
> +	fprintf(stderr, "Signal %u, exiting.\n", signal);
> +	exit(2);
> +}
> +
> +int
> +main(
> +	int			argc,
> +	char			*argv[])
> +{
> +	struct rlimit		rlimit;
> +	struct sigaction	abrt = {
> +		.sa_sigaction	= handle_sigabrt,
> +		.sa_flags	= SA_SIGINFO,
> +	};
> +	int			c;
> +	int			ret;
> +
> +	while ((c = getopt(argc, argv, "abdqrs")) >= 0) {
> +		switch (c) {
> +		case 'a':	want_attrfile = 1;   break;
> +		case 'b':	want_datafile = 1;   break;
> +		case 'd':	want_dir = 1;        break;
> +		case 'q':	report_errors = 0;   break;
> +		case 'r':	want_regfile = 1;    break;
> +		case 's':	want_sharedfile = 1; break;
> +		default:
> +			print_help(argv[0]);
> +			return 1;
> +		}
> +	}
> +
> +	ret = getrlimit(RLIMIT_NOFILE, &rlimit);
> +	if (ret) {
> +		perror("RLIMIT_NOFILE");
> +		return 1;
> +	}
> +
> +	if (!want_attrfile && !want_datafile && !want_dir && !want_regfile &&
> +	    !want_sharedfile)
> +		want_anyfile = 1;
> +
> +	/*
> +	 * nftw is known to abort() if a directory it is walking disappears out
> +	 * from under it.  Handle this with grace if the caller wants us to run
> +	 * quietly.
> +	 */
> +	if (!report_errors) {
> +		ret = sigaction(SIGABRT, &abrt, NULL);
> +		if (ret) {
> +			perror("SIGABRT handler");
> +			return 1;
> +		}
> +	}
> +
> +	for (c = optind; c < argc; c++) {
> +		ret = nftw(argv[c], visit, rlimit.rlim_cur - 5,
> +				FTW_ACTIONRETVAL | FTW_CHDIR | FTW_MOUNT |
> +				FTW_PHYS);
> +		if (ret && report_errors) {
> +			perror(argv[c]);
> +			break;
> +		}
> +	}
> +
> +	if (ret)
> +		return 1;
> +	return 0;
> +}
> 


^ permalink raw reply related	[flat|nested] 469+ messages in thread

* Re: [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata
  2022-12-30 22:19   ` [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata Darrick J. Wong
@ 2023-02-05 13:04     ` Zorro Lang
  2023-02-07 16:58       ` Darrick J. Wong
  2023-02-07 17:02     ` [PATCH v24.1 " Darrick J. Wong
  1 sibling, 1 reply; 469+ messages in thread
From: Zorro Lang @ 2023-02-05 13:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:06PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> For each XFS_SCRUB_TYPE_* that looks at AG or filesystem metadata,
> create a test that runs that scrubber in the foreground and fsstress in
> the background.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  common/quota        |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  doc/group-names.txt |    1 +

[snip]

> diff --git a/doc/group-names.txt b/doc/group-names.txt
> index ac219e05b3..771ce937ae 100644
> --- a/doc/group-names.txt
> +++ b/doc/group-names.txt
> @@ -35,6 +35,7 @@ dangerous_fuzzers	fuzzers that can crash your computer
>  dangerous_norepair	fuzzers to evaluate kernel metadata verifiers
>  dangerous_online_repair	fuzzers to evaluate xfs_scrub online repair
>  dangerous_fsstress_repair	race fsstress and xfs_scrub online repair
> +dangerous_fsstress_scrub	race fsstress and xfs_scrub checking

We've added this group name, so this patch will hit conflict. But I think I
can use `git am --3way ...` to apply this patch forcibly :)

Thanks,
Zorro

>  dangerous_repair	fuzzers to evaluate xfs_repair offline repair
>  dangerous_scrub		fuzzers to evaluate xfs_scrub checking
>  data			data loss checkers
> diff --git a/tests/xfs/782 b/tests/xfs/782
> new file mode 100755
> index 0000000000..4801eda4bd
> --- /dev/null
> +++ b/tests/xfs/782
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 782
> +#
> +# Race fsstress and superblock scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub sb %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/782.out b/tests/xfs/782.out
> new file mode 100644
> index 0000000000..6e378f0e53
> --- /dev/null
> +++ b/tests/xfs/782.out
> @@ -0,0 +1,2 @@
> +QA output created by 782
> +Silence is golden
> diff --git a/tests/xfs/783 b/tests/xfs/783
> new file mode 100755
> index 0000000000..379a9369e5
> --- /dev/null
> +++ b/tests/xfs/783
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 783
> +#
> +# Race fsstress and AGF scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub agf %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/783.out b/tests/xfs/783.out
> new file mode 100644
> index 0000000000..2522395956
> --- /dev/null
> +++ b/tests/xfs/783.out
> @@ -0,0 +1,2 @@
> +QA output created by 783
> +Silence is golden
> diff --git a/tests/xfs/784 b/tests/xfs/784
> new file mode 100755
> index 0000000000..2b89361c36
> --- /dev/null
> +++ b/tests/xfs/784
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 784
> +#
> +# Race fsstress and AGFL scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub agfl %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/784.out b/tests/xfs/784.out
> new file mode 100644
> index 0000000000..48d9b24dd0
> --- /dev/null
> +++ b/tests/xfs/784.out
> @@ -0,0 +1,2 @@
> +QA output created by 784
> +Silence is golden
> diff --git a/tests/xfs/785 b/tests/xfs/785
> new file mode 100755
> index 0000000000..34a13b058d
> --- /dev/null
> +++ b/tests/xfs/785
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 785
> +#
> +# Race fsstress and AGI scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub agi %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/785.out b/tests/xfs/785.out
> new file mode 100644
> index 0000000000..6ecb0c61b3
> --- /dev/null
> +++ b/tests/xfs/785.out
> @@ -0,0 +1,2 @@
> +QA output created by 785
> +Silence is golden
> diff --git a/tests/xfs/786 b/tests/xfs/786
> new file mode 100755
> index 0000000000..157200ea8c
> --- /dev/null
> +++ b/tests/xfs/786
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 786
> +#
> +# Race fsstress and freespace by block btree scrub for a while to see if we
> +# crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub bnobt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/786.out b/tests/xfs/786.out
> new file mode 100644
> index 0000000000..ccb9167df9
> --- /dev/null
> +++ b/tests/xfs/786.out
> @@ -0,0 +1,2 @@
> +QA output created by 786
> +Silence is golden
> diff --git a/tests/xfs/787 b/tests/xfs/787
> new file mode 100755
> index 0000000000..91eaf5a7af
> --- /dev/null
> +++ b/tests/xfs/787
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 787
> +#
> +# Race fsstress and free space by length btree scrub for a while to see if we
> +# crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub cntbt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/787.out b/tests/xfs/787.out
> new file mode 100644
> index 0000000000..fa7f038120
> --- /dev/null
> +++ b/tests/xfs/787.out
> @@ -0,0 +1,2 @@
> +QA output created by 787
> +Silence is golden
> diff --git a/tests/xfs/788 b/tests/xfs/788
> new file mode 100755
> index 0000000000..f1369e5309
> --- /dev/null
> +++ b/tests/xfs/788
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 788
> +#
> +# Race fsstress and inode btree scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -x 'dir' -s "scrub inobt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/788.out b/tests/xfs/788.out
> new file mode 100644
> index 0000000000..5ddd661113
> --- /dev/null
> +++ b/tests/xfs/788.out
> @@ -0,0 +1,2 @@
> +QA output created by 788
> +Silence is golden
> diff --git a/tests/xfs/789 b/tests/xfs/789
> new file mode 100755
> index 0000000000..550ff2c690
> --- /dev/null
> +++ b/tests/xfs/789
> @@ -0,0 +1,39 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 789
> +#
> +# Race fsstress and free inode btree scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" finobt
> +_scratch_xfs_stress_scrub -x 'dir' -s "scrub finobt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/789.out b/tests/xfs/789.out
> new file mode 100644
> index 0000000000..da88fc99cb
> --- /dev/null
> +++ b/tests/xfs/789.out
> @@ -0,0 +1,2 @@
> +QA output created by 789
> +Silence is golden
> diff --git a/tests/xfs/790 b/tests/xfs/790
> new file mode 100755
> index 0000000000..c4e5779ef7
> --- /dev/null
> +++ b/tests/xfs/790
> @@ -0,0 +1,39 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 790
> +#
> +# Race fsstress and reverse mapping btree scrub for a while to see if we crash
> +# or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" rmapbt
> +_scratch_xfs_stress_scrub -s "scrub rmapbt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/790.out b/tests/xfs/790.out
> new file mode 100644
> index 0000000000..7102c590f0
> --- /dev/null
> +++ b/tests/xfs/790.out
> @@ -0,0 +1,2 @@
> +QA output created by 790
> +Silence is golden
> diff --git a/tests/xfs/791 b/tests/xfs/791
> new file mode 100755
> index 0000000000..6939d910c9
> --- /dev/null
> +++ b/tests/xfs/791
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 791
> +#
> +# Race fsstress and reference count btree scrub for a while to see if we crash
> +# or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/reflink
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" reflink
> +_scratch_xfs_stress_scrub -s "scrub refcountbt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/791.out b/tests/xfs/791.out
> new file mode 100644
> index 0000000000..758905371d
> --- /dev/null
> +++ b/tests/xfs/791.out
> @@ -0,0 +1,2 @@
> +QA output created by 791
> +Silence is golden
> diff --git a/tests/xfs/798 b/tests/xfs/798
> new file mode 100755
> index 0000000000..c5bdfad50a
> --- /dev/null
> +++ b/tests/xfs/798
> @@ -0,0 +1,44 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2022 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 798
> +#
> +# Race fsstress and fscounter scrub on the realtime device for a while to see
> +# if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_realtime
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> +
> +# Force all files to be allocated on the realtime device
> +_xfs_force_bdev realtime $SCRATCH_MNT
> +
> +_scratch_xfs_stress_scrub -s 'scrub fscounters'
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/798.out b/tests/xfs/798.out
> new file mode 100644
> index 0000000000..216d6e93f4
> --- /dev/null
> +++ b/tests/xfs/798.out
> @@ -0,0 +1,2 @@
> +QA output created by 798
> +Silence is golden
> diff --git a/tests/xfs/800 b/tests/xfs/800
> new file mode 100755
> index 0000000000..cbcfb5f5a6
> --- /dev/null
> +++ b/tests/xfs/800
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 800
> +#
> +# Race fsstress and realtime bitmap scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_realtime
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> +_scratch_xfs_stress_scrub -s "scrub rtbitmap"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/800.out b/tests/xfs/800.out
> new file mode 100644
> index 0000000000..bdfaa2cecd
> --- /dev/null
> +++ b/tests/xfs/800.out
> @@ -0,0 +1,2 @@
> +QA output created by 800
> +Silence is golden
> diff --git a/tests/xfs/801 b/tests/xfs/801
> new file mode 100755
> index 0000000000..a51fab523b
> --- /dev/null
> +++ b/tests/xfs/801
> @@ -0,0 +1,47 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 801
> +#
> +# Race fsstress and realtime summary scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_realtime
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> +
> +# XXX the realtime summary scrubber isn't currently implemented upstream.
> +# Don't bother trying to test it on those kernels
> +$XFS_IO_PROG -c 'scrub rtsummary' -c 'scrub rtsummary' "$SCRATCH_MNT" 2>&1 | \
> +	grep -q 'Scan was not complete' && \
> +	_notrun "rtsummary scrub is incomplete"
> +
> +_scratch_xfs_stress_scrub -s "scrub rtsummary"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/801.out b/tests/xfs/801.out
> new file mode 100644
> index 0000000000..39481b38e2
> --- /dev/null
> +++ b/tests/xfs/801.out
> @@ -0,0 +1,2 @@
> +QA output created by 801
> +Silence is golden
> diff --git a/tests/xfs/802 b/tests/xfs/802
> new file mode 100755
> index 0000000000..1f3b83882e
> --- /dev/null
> +++ b/tests/xfs/802
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 802
> +#
> +# Race fsstress and user quota scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
> +_scratch_xfs_stress_scrub -s "scrub usrquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/802.out b/tests/xfs/802.out
> new file mode 100644
> index 0000000000..a69c05391f
> --- /dev/null
> +++ b/tests/xfs/802.out
> @@ -0,0 +1,2 @@
> +QA output created by 802
> +Silence is golden
> diff --git a/tests/xfs/803 b/tests/xfs/803
> new file mode 100755
> index 0000000000..b2bb85672d
> --- /dev/null
> +++ b/tests/xfs/803
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 803
> +#
> +# Race fsstress and group quota scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
> +_scratch_xfs_stress_scrub -s "scrub grpquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/803.out b/tests/xfs/803.out
> new file mode 100644
> index 0000000000..38ba741d0f
> --- /dev/null
> +++ b/tests/xfs/803.out
> @@ -0,0 +1,2 @@
> +QA output created by 803
> +Silence is golden
> diff --git a/tests/xfs/804 b/tests/xfs/804
> new file mode 100755
> index 0000000000..129724eb11
> --- /dev/null
> +++ b/tests/xfs/804
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 804
> +#
> +# Race fsstress and project quota scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
> +_scratch_xfs_stress_scrub -s "scrub prjquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/804.out b/tests/xfs/804.out
> new file mode 100644
> index 0000000000..5e0cb437e7
> --- /dev/null
> +++ b/tests/xfs/804.out
> @@ -0,0 +1,2 @@
> +QA output created by 804
> +Silence is golden
> diff --git a/tests/xfs/805 b/tests/xfs/805
> new file mode 100755
> index 0000000000..aca9b9cdf4
> --- /dev/null
> +++ b/tests/xfs/805
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 805
> +#
> +# Race fsstress and summary counters scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub fscounters"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/805.out b/tests/xfs/805.out
> new file mode 100644
> index 0000000000..ac324c5874
> --- /dev/null
> +++ b/tests/xfs/805.out
> @@ -0,0 +1,2 @@
> +QA output created by 805
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests
  2023-02-05 12:57     ` Zorro Lang
@ 2023-02-07 16:57       ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2023-02-07 16:57 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Sun, Feb 05, 2023 at 08:57:47PM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 02:19:06PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Create a new find(1) like utility that doesn't crash on directory tree
> > changes (like find does due to bugs in its loop detector) and actually
> > implements the custom xfs attribute predicates that we need for scrub
> > stress tests.  This program will be needed for a future patch where we
> > add stress tests for scrub and repair of file metadata.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  configure.ac          |    5 +
> >  include/builddefs.in  |    4 +
> >  m4/package_libcdev.m4 |   47 ++++++++
> >  m4/package_xfslibs.m4 |   16 +++
> >  src/Makefile          |   10 ++
> >  src/xfsfind.c         |  290 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 372 insertions(+)
> >  create mode 100644 src/xfsfind.c
> > 
> > 
> > diff --git a/configure.ac b/configure.ac
> > index cbf8377988..e92bd6b26d 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -66,6 +66,11 @@ AC_PACKAGE_WANT_LINUX_FS_H
> >  AC_PACKAGE_WANT_LIBBTRFSUTIL
> >  
> >  AC_HAVE_COPY_FILE_RANGE
> > +AC_HAVE_SEEK_DATA
> > +AC_HAVE_BMV_OF_SHARED
> > +AC_HAVE_NFTW
> > +AC_HAVE_RLIMIT_NOFILE
> > +
> >  AC_CHECK_FUNCS([renameat2])
> >  AC_CHECK_FUNCS([reallocarray])
> >  AC_CHECK_TYPES([struct mount_attr], [], [], [[#include <linux/mount.h>]])
> > diff --git a/include/builddefs.in b/include/builddefs.in
> > index 6641209f81..dab10c968f 100644
> > --- a/include/builddefs.in
> > +++ b/include/builddefs.in
> > @@ -68,6 +68,10 @@ HAVE_FIEMAP = @have_fiemap@
> >  HAVE_FALLOCATE = @have_fallocate@
> >  HAVE_COPY_FILE_RANGE = @have_copy_file_range@
> >  HAVE_LIBBTRFSUTIL = @have_libbtrfsutil@
> > +HAVE_SEEK_DATA = @have_seek_data@
> > +HAVE_NFTW = @have_nftw@
> > +HAVE_BMV_OF_SHARED = @have_bmv_of_shared@
> > +HAVE_RLIMIT_NOFILE = @have_rlimit_nofile@
> >  
> >  GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
> >  
> > diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
> > index 5c76c0f73e..e1b381c16f 100644
> > --- a/m4/package_libcdev.m4
> > +++ b/m4/package_libcdev.m4
> > @@ -110,3 +110,50 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
> >      AC_SUBST(have_copy_file_range)
> >    ])
> >  
> > +# Check if we have SEEK_DATA
> > +AC_DEFUN([AC_HAVE_SEEK_DATA],
> > +  [ AC_MSG_CHECKING([for SEEK_DATA])
> > +    AC_TRY_LINK([
> 
> The AC_TRY_LINK is obsolete by autoconf, refer to:
> https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Obsolete-Macros.html
> 
> So as the suggestion of above link, we'd better to replace:
>           Macro: AC_TRY_LINK (includes, function-body, [action-if-true], [action-if-false])
> with:
>           AC_LINK_IFELSE(
>             [AC_LANG_PROGRAM([[includes]],
>                [[function-body]])],
>             [action-if-true],
>             [action-if-false])
> 
> For example (hope it's right:)

Yeah... this patch was written so long ago I wasn't even aware of the
deprecations.  I've run autoupdate to fix the problems and will repost.

--D

> diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
> index e1b381c1..7f1767a4 100644
> --- a/m4/package_libcdev.m4
> +++ b/m4/package_libcdev.m4
> @@ -113,13 +113,13 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
>  # Check if we have SEEK_DATA
>  AC_DEFUN([AC_HAVE_SEEK_DATA],
>    [ AC_MSG_CHECKING([for SEEK_DATA])
> -    AC_TRY_LINK([
> +    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
>  #define _GNU_SOURCE
>  #include <sys/types.h>
>  #include <unistd.h>
> -    ], [
> +    ]], [[
>           lseek(-1, 0, SEEK_DATA);
> -    ], have_seek_data=yes
> +    ]])], have_seek_data=yes
>         AC_MSG_RESULT(yes),
>         AC_MSG_RESULT(no))
>      AC_SUBST(have_seek_data)
> 
> > +#define _GNU_SOURCE
> > +#include <sys/types.h>
> > +#include <unistd.h>
> > +    ], [
> > +         lseek(-1, 0, SEEK_DATA);
> > +    ], have_seek_data=yes
> > +       AC_MSG_RESULT(yes),
> > +       AC_MSG_RESULT(no))
> > +    AC_SUBST(have_seek_data)
> > +  ])
> > +
> > +# Check if we have nftw
> > +AC_DEFUN([AC_HAVE_NFTW],
> > +  [ AC_MSG_CHECKING([for nftw])
> > +    AC_TRY_LINK([
> 
> Same as above
> 
> > +#define _GNU_SOURCE
> > +#include <stddef.h>
> > +#include <ftw.h>
> > +    ], [
> > +         nftw("/", (int (*)(const char *, const struct stat *, int, struct FTW *))1, 0, 0);
> > +    ], have_nftw=yes
> > +       AC_MSG_RESULT(yes),
> > +       AC_MSG_RESULT(no))
> > +    AC_SUBST(have_nftw)
> > +  ])
> > +
> > +# Check if we have RLIMIT_NOFILE
> > +AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
> > +  [ AC_MSG_CHECKING([for RLIMIT_NOFILE])
> > +    AC_TRY_LINK([
> 
> Same as above
> 
> > +#define _GNU_SOURCE
> > +#include <sys/time.h>
> > +#include <sys/resource.h>
> > +    ], [
> > +         struct rlimit rlimit;
> > +
> > +         rlimit.rlim_cur = 0;
> > +         getrlimit(RLIMIT_NOFILE, &rlimit);
> > +    ], have_rlimit_nofile=yes
> > +       AC_MSG_RESULT(yes),
> > +       AC_MSG_RESULT(no))
> > +    AC_SUBST(have_rlimit_nofile)
> > +  ])
> > diff --git a/m4/package_xfslibs.m4 b/m4/package_xfslibs.m4
> > index 0746cd1dc5..479f30a29b 100644
> > --- a/m4/package_xfslibs.m4
> > +++ b/m4/package_xfslibs.m4
> > @@ -104,3 +104,19 @@ AC_DEFUN([AC_PACKAGE_NEED_XFSCTL_MACRO],
> >          exit 1
> >        ])
> >    ])
> > +
> > +# Check if we have BMV_OF_SHARED from the GETBMAPX ioctl
> > +AC_DEFUN([AC_HAVE_BMV_OF_SHARED],
> > +  [ AC_MSG_CHECKING([for BMV_OF_SHARED])
> > +    AC_TRY_LINK([
> 
> Same as above
> 
> Thanks,
> Zorro
> 
> > +#define _GNU_SOURCE
> > +#include <xfs/xfs.h>
> > +    ], [
> > +         struct getbmapx obj;
> > +         ioctl(-1, XFS_IOC_GETBMAPX, &obj);
> > +         obj.bmv_oflags |= BMV_OF_SHARED;
> > +    ], have_bmv_of_shared=yes
> > +       AC_MSG_RESULT(yes),
> > +       AC_MSG_RESULT(no))
> > +    AC_SUBST(have_bmv_of_shared)
> > +  ])
> > diff --git a/src/Makefile b/src/Makefile
> > index afdf6b30c5..7807ca89a5 100644
> > --- a/src/Makefile
> > +++ b/src/Makefile
> > @@ -83,6 +83,16 @@ ifeq ($(HAVE_LIBCAP), true)
> >  LLDLIBS += -lcap
> >  endif
> >  
> > +ifeq ($(HAVE_SEEK_DATA), yes)
> > + ifeq ($(HAVE_NFTW), yes)
> > +  ifeq ($(HAVE_BMV_OF_SHARED), yes)
> > +   ifeq ($(HAVE_RLIMIT_NOFILE), yes)
> > +     TARGETS += xfsfind
> > +   endif
> > +  endif
> > + endif
> > +endif
> > +
> >  CFILES = $(TARGETS:=.c)
> >  LDIRT = $(TARGETS) fssum
> >  
> > diff --git a/src/xfsfind.c b/src/xfsfind.c
> > new file mode 100644
> > index 0000000000..6b0a93e793
> > --- /dev/null
> > +++ b/src/xfsfind.c
> > @@ -0,0 +1,290 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * find(1) but with special predicates for finding XFS attributes.
> > + * Copyright (C) 2022 Oracle.
> > + */
> > +#include <sys/time.h>
> > +#include <sys/resource.h>
> > +#include <sys/types.h>
> > +#include <stdio.h>
> > +#include <unistd.h>
> > +#include <ftw.h>
> > +#include <linux/fs.h>
> > +#include <xfs/xfs.h>
> > +
> > +#include "global.h"
> > +
> > +static int want_anyfile;
> > +static int want_datafile;
> > +static int want_attrfile;
> > +static int want_dir;
> > +static int want_regfile;
> > +static int want_sharedfile;
> > +static int report_errors = 1;
> > +
> > +static int
> > +check_datafile(
> > +	const char		*path,
> > +	int			fd)
> > +{
> > +	off_t			off;
> > +
> > +	off = lseek(fd, 0, SEEK_DATA);
> > +	if (off >= 0)
> > +		return 1;
> > +
> > +	if (errno == ENXIO)
> > +		return 0;
> > +
> > +	if (report_errors)
> > +		perror(path);
> > +
> > +	return -1;
> > +}
> > +
> > +static int
> > +check_attrfile(
> > +	const char		*path,
> > +	int			fd)
> > +{
> > +	struct fsxattr		fsx;
> > +	int			ret;
> > +
> > +	ret = ioctl(fd, XFS_IOC_FSGETXATTR, &fsx);
> > +	if (ret) {
> > +		if (report_errors)
> > +			perror(path);
> > +		return -1;
> > +	}
> > +
> > +	if (want_attrfile && (fsx.fsx_xflags & XFS_XFLAG_HASATTR))
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +#define BMAP_NR			33
> > +static struct getbmapx		bmaps[BMAP_NR];
> > +
> > +static int
> > +check_sharedfile(
> > +	const char		*path,
> > +	int			fd)
> > +{
> > +	struct getbmapx		*key = &bmaps[0];
> > +	unsigned int		i;
> > +	int			ret;
> > +
> > +	memset(key, 0, sizeof(struct getbmapx));
> > +	key->bmv_length = ULLONG_MAX;
> > +	/* no holes and don't flush dirty pages */
> > +	key->bmv_iflags = BMV_IF_DELALLOC | BMV_IF_NO_HOLES;
> > +	key->bmv_count = BMAP_NR;
> > +
> > +	while ((ret = ioctl(fd, XFS_IOC_GETBMAPX, bmaps)) == 0) {
> > +		struct getbmapx	*p = &bmaps[1];
> > +		xfs_off_t	new_off;
> > +
> > +		for (i = 0; i < key->bmv_entries; i++, p++) {
> > +			if (p->bmv_oflags & BMV_OF_SHARED)
> > +				return 1;
> > +		}
> > +
> > +		if (key->bmv_entries == 0)
> > +			break;
> > +		p = key + key->bmv_entries;
> > +		if (p->bmv_oflags & BMV_OF_LAST)
> > +			return 0;
> > +
> > +		new_off = p->bmv_offset + p->bmv_length;
> > +		key->bmv_length -= new_off - key->bmv_offset;
> > +		key->bmv_offset = new_off;
> > +	}
> > +	if (ret < 0) {
> > +		if (report_errors)
> > +			perror(path);
> > +		return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void
> > +print_help(
> > +	const char		*name)
> > +{
> > +	printf("Usage: %s [OPTIONS] path\n", name);
> > +	printf("\n");
> > +	printf("Print all file paths matching any of the given predicates.\n");
> > +	printf("\n");
> > +	printf("-a	Match files with xattrs.\n");
> > +	printf("-b	Match files with data blocks.\n");
> > +	printf("-d	Match directories.\n");
> > +	printf("-q	Ignore errors while walking directory tree.\n");
> > +	printf("-r	Match regular files.\n");
> > +	printf("-s	Match files with shared blocks.\n");
> > +	printf("\n");
> > +	printf("If no matching options are given, match all files found.\n");
> > +}
> > +
> > +static int
> > +visit(
> > +	const char		*path,
> > +	const struct stat	*sb,
> > +	int			typeflag,
> > +	struct FTW		*ftwbuf)
> > +{
> > +	int			printme = 1;
> > +	int			fd = -1;
> > +	int			retval = FTW_CONTINUE;
> > +
> > +	if (want_anyfile)
> > +		goto out;
> > +	if (want_regfile && typeflag == FTW_F)
> > +		goto out;
> > +	if (want_dir && typeflag == FTW_D)
> > +		goto out;
> > +
> > +	/*
> > +	 * We can only open directories and files; screen out everything else.
> > +	 * Note that nftw lies and reports FTW_F for device files, so check the
> > +	 * statbuf mode too.
> > +	 */
> > +	if (typeflag != FTW_F && typeflag != FTW_D) {
> > +		printme = 0;
> > +		goto out;
> > +	}
> > +
> > +	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode)) {
> > +		printme = 0;
> > +		goto out;
> > +	}
> > +
> > +	fd = open(path, O_RDONLY);
> > +	if (fd < 0) {
> > +		if (report_errors) {
> > +			perror(path);
> > +			return FTW_STOP;
> > +		}
> > +
> > +		return FTW_CONTINUE;
> > +	}
> > +
> > +	if (want_datafile && typeflag == FTW_F) {
> > +		int ret = check_datafile(path, fd);
> > +		if (ret < 0 && report_errors) {
> > +			printme = 0;
> > +			retval = FTW_STOP;
> > +			goto out_fd;
> > +		}
> > +
> > +		if (ret == 1)
> > +			goto out_fd;
> > +	}
> > +
> > +	if (want_attrfile) {
> > +		int ret = check_attrfile(path, fd);
> > +		if (ret < 0 && report_errors) {
> > +			printme = 0;
> > +			retval = FTW_STOP;
> > +			goto out_fd;
> > +		}
> > +
> > +		if (ret == 1)
> > +			goto out_fd;
> > +	}
> > +
> > +	if (want_sharedfile) {
> > +		int ret = check_sharedfile(path, fd);
> > +		if (ret < 0 && report_errors) {
> > +			printme = 0;
> > +			retval = FTW_STOP;
> > +			goto out_fd;
> > +		}
> > +
> > +		if (ret == 1)
> > +			goto out_fd;
> > +	}
> > +
> > +	printme = 0;
> > +out_fd:
> > +	close(fd);
> > +out:
> > +	if (printme)
> > +		printf("%s\n", path);
> > +	return retval;
> > +}
> > +
> > +static void
> > +handle_sigabrt(
> > +	int		signal,
> > +	siginfo_t	*info,
> > +	void		*ucontext)
> > +{
> > +	fprintf(stderr, "Signal %u, exiting.\n", signal);
> > +	exit(2);
> > +}
> > +
> > +int
> > +main(
> > +	int			argc,
> > +	char			*argv[])
> > +{
> > +	struct rlimit		rlimit;
> > +	struct sigaction	abrt = {
> > +		.sa_sigaction	= handle_sigabrt,
> > +		.sa_flags	= SA_SIGINFO,
> > +	};
> > +	int			c;
> > +	int			ret;
> > +
> > +	while ((c = getopt(argc, argv, "abdqrs")) >= 0) {
> > +		switch (c) {
> > +		case 'a':	want_attrfile = 1;   break;
> > +		case 'b':	want_datafile = 1;   break;
> > +		case 'd':	want_dir = 1;        break;
> > +		case 'q':	report_errors = 0;   break;
> > +		case 'r':	want_regfile = 1;    break;
> > +		case 's':	want_sharedfile = 1; break;
> > +		default:
> > +			print_help(argv[0]);
> > +			return 1;
> > +		}
> > +	}
> > +
> > +	ret = getrlimit(RLIMIT_NOFILE, &rlimit);
> > +	if (ret) {
> > +		perror("RLIMIT_NOFILE");
> > +		return 1;
> > +	}
> > +
> > +	if (!want_attrfile && !want_datafile && !want_dir && !want_regfile &&
> > +	    !want_sharedfile)
> > +		want_anyfile = 1;
> > +
> > +	/*
> > +	 * nftw is known to abort() if a directory it is walking disappears out
> > +	 * from under it.  Handle this with grace if the caller wants us to run
> > +	 * quietly.
> > +	 */
> > +	if (!report_errors) {
> > +		ret = sigaction(SIGABRT, &abrt, NULL);
> > +		if (ret) {
> > +			perror("SIGABRT handler");
> > +			return 1;
> > +		}
> > +	}
> > +
> > +	for (c = optind; c < argc; c++) {
> > +		ret = nftw(argv[c], visit, rlimit.rlim_cur - 5,
> > +				FTW_ACTIONRETVAL | FTW_CHDIR | FTW_MOUNT |
> > +				FTW_PHYS);
> > +		if (ret && report_errors) {
> > +			perror(argv[c]);
> > +			break;
> > +		}
> > +	}
> > +
> > +	if (ret)
> > +		return 1;
> > +	return 0;
> > +}
> > 
> 

^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata
  2023-02-05 13:04     ` Zorro Lang
@ 2023-02-07 16:58       ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2023-02-07 16:58 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Sun, Feb 05, 2023 at 09:04:12PM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 02:19:06PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > For each XFS_SCRUB_TYPE_* that looks at AG or filesystem metadata,
> > create a test that runs that scrubber in the foreground and fsstress in
> > the background.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  common/quota        |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  doc/group-names.txt |    1 +
> 
> [snip]
> 
> > diff --git a/doc/group-names.txt b/doc/group-names.txt
> > index ac219e05b3..771ce937ae 100644
> > --- a/doc/group-names.txt
> > +++ b/doc/group-names.txt
> > @@ -35,6 +35,7 @@ dangerous_fuzzers	fuzzers that can crash your computer
> >  dangerous_norepair	fuzzers to evaluate kernel metadata verifiers
> >  dangerous_online_repair	fuzzers to evaluate xfs_scrub online repair
> >  dangerous_fsstress_repair	race fsstress and xfs_scrub online repair
> > +dangerous_fsstress_scrub	race fsstress and xfs_scrub checking
> 
> We've added this group name, so this patch will hit conflict. But I think I
> can use `git am --3way ...` to apply this patch forcibly :)

Or I'll just repost this patch, since I already/always rebase everything.

--D

> Thanks,
> Zorro
> 
> >  dangerous_repair	fuzzers to evaluate xfs_repair offline repair
> >  dangerous_scrub		fuzzers to evaluate xfs_scrub checking
> >  data			data loss checkers
> > diff --git a/tests/xfs/782 b/tests/xfs/782
> > new file mode 100755
> > index 0000000000..4801eda4bd
> > --- /dev/null
> > +++ b/tests/xfs/782
> > @@ -0,0 +1,37 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 782
> > +#
> > +# Race fsstress and superblock scrub for a while to see if we crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub sb %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/782.out b/tests/xfs/782.out
> > new file mode 100644
> > index 0000000000..6e378f0e53
> > --- /dev/null
> > +++ b/tests/xfs/782.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 782
> > +Silence is golden
> > diff --git a/tests/xfs/783 b/tests/xfs/783
> > new file mode 100755
> > index 0000000000..379a9369e5
> > --- /dev/null
> > +++ b/tests/xfs/783
> > @@ -0,0 +1,37 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 783
> > +#
> > +# Race fsstress and AGF scrub for a while to see if we crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub agf %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/783.out b/tests/xfs/783.out
> > new file mode 100644
> > index 0000000000..2522395956
> > --- /dev/null
> > +++ b/tests/xfs/783.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 783
> > +Silence is golden
> > diff --git a/tests/xfs/784 b/tests/xfs/784
> > new file mode 100755
> > index 0000000000..2b89361c36
> > --- /dev/null
> > +++ b/tests/xfs/784
> > @@ -0,0 +1,37 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 784
> > +#
> > +# Race fsstress and AGFL scrub for a while to see if we crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub agfl %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/784.out b/tests/xfs/784.out
> > new file mode 100644
> > index 0000000000..48d9b24dd0
> > --- /dev/null
> > +++ b/tests/xfs/784.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 784
> > +Silence is golden
> > diff --git a/tests/xfs/785 b/tests/xfs/785
> > new file mode 100755
> > index 0000000000..34a13b058d
> > --- /dev/null
> > +++ b/tests/xfs/785
> > @@ -0,0 +1,37 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 785
> > +#
> > +# Race fsstress and AGI scrub for a while to see if we crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub agi %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/785.out b/tests/xfs/785.out
> > new file mode 100644
> > index 0000000000..6ecb0c61b3
> > --- /dev/null
> > +++ b/tests/xfs/785.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 785
> > +Silence is golden
> > diff --git a/tests/xfs/786 b/tests/xfs/786
> > new file mode 100755
> > index 0000000000..157200ea8c
> > --- /dev/null
> > +++ b/tests/xfs/786
> > @@ -0,0 +1,38 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 786
> > +#
> > +# Race fsstress and freespace by block btree scrub for a while to see if we
> > +# crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub bnobt %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/786.out b/tests/xfs/786.out
> > new file mode 100644
> > index 0000000000..ccb9167df9
> > --- /dev/null
> > +++ b/tests/xfs/786.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 786
> > +Silence is golden
> > diff --git a/tests/xfs/787 b/tests/xfs/787
> > new file mode 100755
> > index 0000000000..91eaf5a7af
> > --- /dev/null
> > +++ b/tests/xfs/787
> > @@ -0,0 +1,38 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 787
> > +#
> > +# Race fsstress and free space by length btree scrub for a while to see if we
> > +# crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub cntbt %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/787.out b/tests/xfs/787.out
> > new file mode 100644
> > index 0000000000..fa7f038120
> > --- /dev/null
> > +++ b/tests/xfs/787.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 787
> > +Silence is golden
> > diff --git a/tests/xfs/788 b/tests/xfs/788
> > new file mode 100755
> > index 0000000000..f1369e5309
> > --- /dev/null
> > +++ b/tests/xfs/788
> > @@ -0,0 +1,38 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 788
> > +#
> > +# Race fsstress and inode btree scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -x 'dir' -s "scrub inobt %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/788.out b/tests/xfs/788.out
> > new file mode 100644
> > index 0000000000..5ddd661113
> > --- /dev/null
> > +++ b/tests/xfs/788.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 788
> > +Silence is golden
> > diff --git a/tests/xfs/789 b/tests/xfs/789
> > new file mode 100755
> > index 0000000000..550ff2c690
> > --- /dev/null
> > +++ b/tests/xfs/789
> > @@ -0,0 +1,39 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 789
> > +#
> > +# Race fsstress and free inode btree scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_has_feature "$SCRATCH_MNT" finobt
> > +_scratch_xfs_stress_scrub -x 'dir' -s "scrub finobt %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/789.out b/tests/xfs/789.out
> > new file mode 100644
> > index 0000000000..da88fc99cb
> > --- /dev/null
> > +++ b/tests/xfs/789.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 789
> > +Silence is golden
> > diff --git a/tests/xfs/790 b/tests/xfs/790
> > new file mode 100755
> > index 0000000000..c4e5779ef7
> > --- /dev/null
> > +++ b/tests/xfs/790
> > @@ -0,0 +1,39 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 790
> > +#
> > +# Race fsstress and reverse mapping btree scrub for a while to see if we crash
> > +# or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_has_feature "$SCRATCH_MNT" rmapbt
> > +_scratch_xfs_stress_scrub -s "scrub rmapbt %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/790.out b/tests/xfs/790.out
> > new file mode 100644
> > index 0000000000..7102c590f0
> > --- /dev/null
> > +++ b/tests/xfs/790.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 790
> > +Silence is golden
> > diff --git a/tests/xfs/791 b/tests/xfs/791
> > new file mode 100755
> > index 0000000000..6939d910c9
> > --- /dev/null
> > +++ b/tests/xfs/791
> > @@ -0,0 +1,40 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 791
> > +#
> > +# Race fsstress and reference count btree scrub for a while to see if we crash
> > +# or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +. ./common/reflink
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_has_feature "$SCRATCH_MNT" reflink
> > +_scratch_xfs_stress_scrub -s "scrub refcountbt %agno%"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/791.out b/tests/xfs/791.out
> > new file mode 100644
> > index 0000000000..758905371d
> > --- /dev/null
> > +++ b/tests/xfs/791.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 791
> > +Silence is golden
> > diff --git a/tests/xfs/798 b/tests/xfs/798
> > new file mode 100755
> > index 0000000000..c5bdfad50a
> > --- /dev/null
> > +++ b/tests/xfs/798
> > @@ -0,0 +1,44 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0-or-later
> > +# Copyright (c) 2022 Oracle.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 798
> > +#
> > +# Race fsstress and fscounter scrub on the realtime device for a while to see
> > +# if we crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_realtime
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> > +
> > +# Force all files to be allocated on the realtime device
> > +_xfs_force_bdev realtime $SCRATCH_MNT
> > +
> > +_scratch_xfs_stress_scrub -s 'scrub fscounters'
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/798.out b/tests/xfs/798.out
> > new file mode 100644
> > index 0000000000..216d6e93f4
> > --- /dev/null
> > +++ b/tests/xfs/798.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 798
> > +Silence is golden
> > diff --git a/tests/xfs/800 b/tests/xfs/800
> > new file mode 100755
> > index 0000000000..cbcfb5f5a6
> > --- /dev/null
> > +++ b/tests/xfs/800
> > @@ -0,0 +1,40 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 800
> > +#
> > +# Race fsstress and realtime bitmap scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_realtime
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> > +_scratch_xfs_stress_scrub -s "scrub rtbitmap"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/800.out b/tests/xfs/800.out
> > new file mode 100644
> > index 0000000000..bdfaa2cecd
> > --- /dev/null
> > +++ b/tests/xfs/800.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 800
> > +Silence is golden
> > diff --git a/tests/xfs/801 b/tests/xfs/801
> > new file mode 100755
> > index 0000000000..a51fab523b
> > --- /dev/null
> > +++ b/tests/xfs/801
> > @@ -0,0 +1,47 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 801
> > +#
> > +# Race fsstress and realtime summary scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_realtime
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> > +
> > +# XXX the realtime summary scrubber isn't currently implemented upstream.
> > +# Don't bother trying to test it on those kernels
> > +$XFS_IO_PROG -c 'scrub rtsummary' -c 'scrub rtsummary' "$SCRATCH_MNT" 2>&1 | \
> > +	grep -q 'Scan was not complete' && \
> > +	_notrun "rtsummary scrub is incomplete"
> > +
> > +_scratch_xfs_stress_scrub -s "scrub rtsummary"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/801.out b/tests/xfs/801.out
> > new file mode 100644
> > index 0000000000..39481b38e2
> > --- /dev/null
> > +++ b/tests/xfs/801.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 801
> > +Silence is golden
> > diff --git a/tests/xfs/802 b/tests/xfs/802
> > new file mode 100755
> > index 0000000000..1f3b83882e
> > --- /dev/null
> > +++ b/tests/xfs/802
> > @@ -0,0 +1,40 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 802
> > +#
> > +# Race fsstress and user quota scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +. ./common/quota
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
> > +_scratch_xfs_stress_scrub -s "scrub usrquota"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/802.out b/tests/xfs/802.out
> > new file mode 100644
> > index 0000000000..a69c05391f
> > --- /dev/null
> > +++ b/tests/xfs/802.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 802
> > +Silence is golden
> > diff --git a/tests/xfs/803 b/tests/xfs/803
> > new file mode 100755
> > index 0000000000..b2bb85672d
> > --- /dev/null
> > +++ b/tests/xfs/803
> > @@ -0,0 +1,40 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 803
> > +#
> > +# Race fsstress and group quota scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +. ./common/quota
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
> > +_scratch_xfs_stress_scrub -s "scrub grpquota"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/803.out b/tests/xfs/803.out
> > new file mode 100644
> > index 0000000000..38ba741d0f
> > --- /dev/null
> > +++ b/tests/xfs/803.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 803
> > +Silence is golden
> > diff --git a/tests/xfs/804 b/tests/xfs/804
> > new file mode 100755
> > index 0000000000..129724eb11
> > --- /dev/null
> > +++ b/tests/xfs/804
> > @@ -0,0 +1,40 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 804
> > +#
> > +# Race fsstress and project quota scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +. ./common/quota
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
> > +_scratch_xfs_stress_scrub -s "scrub prjquota"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/804.out b/tests/xfs/804.out
> > new file mode 100644
> > index 0000000000..5e0cb437e7
> > --- /dev/null
> > +++ b/tests/xfs/804.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 804
> > +Silence is golden
> > diff --git a/tests/xfs/805 b/tests/xfs/805
> > new file mode 100755
> > index 0000000000..aca9b9cdf4
> > --- /dev/null
> > +++ b/tests/xfs/805
> > @@ -0,0 +1,38 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 805
> > +#
> > +# Race fsstress and summary counters scrub for a while to see if we crash or
> > +# livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> > +
> > +_cleanup() {
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	cd /
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -s "scrub fscounters"
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/805.out b/tests/xfs/805.out
> > new file mode 100644
> > index 0000000000..ac324c5874
> > --- /dev/null
> > +++ b/tests/xfs/805.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 805
> > +Silence is golden
> > 
> 

^ permalink raw reply	[flat|nested] 469+ messages in thread

* [PATCH v24.1 3/5] fuzzy: add a custom xfs find utility for scrub stress tests
  2022-12-30 22:19   ` [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests Darrick J. Wong
  2023-02-05 12:57     ` Zorro Lang
@ 2023-02-07 17:01     ` Darrick J. Wong
  2023-02-07 18:42       ` Zorro Lang
  1 sibling, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2023-02-07 17:01 UTC (permalink / raw)
  To: zlang; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create a new find(1) like utility that doesn't crash on directory tree
changes (like find does due to bugs in its loop detector) and actually
implements the custom xfs attribute predicates that we need for scrub
stress tests.  This program will be needed for a future patch where we
add stress tests for scrub and repair of file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v24.1: apply some autoupdate love to the m4 macros
---
 configure.ac          |    5 +
 include/builddefs.in  |    4 +
 m4/package_libcdev.m4 |   44 +++++++
 m4/package_xfslibs.m4 |   15 +++
 src/Makefile          |   10 ++
 src/xfsfind.c         |  290 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 368 insertions(+)
 create mode 100644 src/xfsfind.c

diff --git a/configure.ac b/configure.ac
index cbf8377988..e92bd6b26d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -66,6 +66,11 @@ AC_PACKAGE_WANT_LINUX_FS_H
 AC_PACKAGE_WANT_LIBBTRFSUTIL
 
 AC_HAVE_COPY_FILE_RANGE
+AC_HAVE_SEEK_DATA
+AC_HAVE_BMV_OF_SHARED
+AC_HAVE_NFTW
+AC_HAVE_RLIMIT_NOFILE
+
 AC_CHECK_FUNCS([renameat2])
 AC_CHECK_FUNCS([reallocarray])
 AC_CHECK_TYPES([struct mount_attr], [], [], [[#include <linux/mount.h>]])
diff --git a/include/builddefs.in b/include/builddefs.in
index 6641209f81..dab10c968f 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -68,6 +68,10 @@ HAVE_FIEMAP = @have_fiemap@
 HAVE_FALLOCATE = @have_fallocate@
 HAVE_COPY_FILE_RANGE = @have_copy_file_range@
 HAVE_LIBBTRFSUTIL = @have_libbtrfsutil@
+HAVE_SEEK_DATA = @have_seek_data@
+HAVE_NFTW = @have_nftw@
+HAVE_BMV_OF_SHARED = @have_bmv_of_shared@
+HAVE_RLIMIT_NOFILE = @have_rlimit_nofile@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 5c76c0f73e..98572aecd9 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -110,3 +110,47 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
     AC_SUBST(have_copy_file_range)
   ])
 
+# Check if we have SEEK_DATA
+AC_DEFUN([AC_HAVE_SEEK_DATA],
+  [ AC_MSG_CHECKING([for SEEK_DATA])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <unistd.h>
+    ]], [[
+         lseek(-1, 0, SEEK_DATA);
+    ]])],[have_seek_data=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_seek_data)
+  ])
+
+# Check if we have nftw
+AC_DEFUN([AC_HAVE_NFTW],
+  [ AC_MSG_CHECKING([for nftw])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <stddef.h>
+#include <ftw.h>
+    ]], [[
+         nftw("/", (int (*)(const char *, const struct stat *, int, struct FTW *))1, 0, 0);
+    ]])],[have_nftw=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_nftw)
+  ])
+
+# Check if we have RLIMIT_NOFILE
+AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
+  [ AC_MSG_CHECKING([for RLIMIT_NOFILE])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <sys/time.h>
+#include <sys/resource.h>
+    ]], [[
+         struct rlimit rlimit;
+
+         rlimit.rlim_cur = 0;
+         getrlimit(RLIMIT_NOFILE, &rlimit);
+    ]])],[have_rlimit_nofile=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_rlimit_nofile)
+  ])
diff --git a/m4/package_xfslibs.m4 b/m4/package_xfslibs.m4
index 0746cd1dc5..8ef58cc064 100644
--- a/m4/package_xfslibs.m4
+++ b/m4/package_xfslibs.m4
@@ -104,3 +104,18 @@ AC_DEFUN([AC_PACKAGE_NEED_XFSCTL_MACRO],
         exit 1
       ])
   ])
+
+# Check if we have BMV_OF_SHARED from the GETBMAPX ioctl
+AC_DEFUN([AC_HAVE_BMV_OF_SHARED],
+  [ AC_MSG_CHECKING([for BMV_OF_SHARED])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <xfs/xfs.h>
+    ]], [[
+         struct getbmapx obj;
+         ioctl(-1, XFS_IOC_GETBMAPX, &obj);
+         obj.bmv_oflags |= BMV_OF_SHARED;
+    ]])],[have_bmv_of_shared=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(have_bmv_of_shared)
+  ])
diff --git a/src/Makefile b/src/Makefile
index f270015ce8..a574f7bd03 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -83,6 +83,16 @@ ifeq ($(HAVE_LIBCAP), true)
 LLDLIBS += -lcap
 endif
 
+ifeq ($(HAVE_SEEK_DATA), yes)
+ ifeq ($(HAVE_NFTW), yes)
+  ifeq ($(HAVE_BMV_OF_SHARED), yes)
+   ifeq ($(HAVE_RLIMIT_NOFILE), yes)
+     TARGETS += xfsfind
+   endif
+  endif
+ endif
+endif
+
 CFILES = $(TARGETS:=.c)
 LDIRT = $(TARGETS) fssum
 
diff --git a/src/xfsfind.c b/src/xfsfind.c
new file mode 100644
index 0000000000..6b0a93e793
--- /dev/null
+++ b/src/xfsfind.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * find(1) but with special predicates for finding XFS attributes.
+ * Copyright (C) 2022 Oracle.
+ */
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/types.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <ftw.h>
+#include <linux/fs.h>
+#include <xfs/xfs.h>
+
+#include "global.h"
+
+static int want_anyfile;
+static int want_datafile;
+static int want_attrfile;
+static int want_dir;
+static int want_regfile;
+static int want_sharedfile;
+static int report_errors = 1;
+
+static int
+check_datafile(
+	const char		*path,
+	int			fd)
+{
+	off_t			off;
+
+	off = lseek(fd, 0, SEEK_DATA);
+	if (off >= 0)
+		return 1;
+
+	if (errno == ENXIO)
+		return 0;
+
+	if (report_errors)
+		perror(path);
+
+	return -1;
+}
+
+static int
+check_attrfile(
+	const char		*path,
+	int			fd)
+{
+	struct fsxattr		fsx;
+	int			ret;
+
+	ret = ioctl(fd, XFS_IOC_FSGETXATTR, &fsx);
+	if (ret) {
+		if (report_errors)
+			perror(path);
+		return -1;
+	}
+
+	if (want_attrfile && (fsx.fsx_xflags & XFS_XFLAG_HASATTR))
+		return 1;
+	return 0;
+}
+
+#define BMAP_NR			33
+static struct getbmapx		bmaps[BMAP_NR];
+
+static int
+check_sharedfile(
+	const char		*path,
+	int			fd)
+{
+	struct getbmapx		*key = &bmaps[0];
+	unsigned int		i;
+	int			ret;
+
+	memset(key, 0, sizeof(struct getbmapx));
+	key->bmv_length = ULLONG_MAX;
+	/* no holes and don't flush dirty pages */
+	key->bmv_iflags = BMV_IF_DELALLOC | BMV_IF_NO_HOLES;
+	key->bmv_count = BMAP_NR;
+
+	while ((ret = ioctl(fd, XFS_IOC_GETBMAPX, bmaps)) == 0) {
+		struct getbmapx	*p = &bmaps[1];
+		xfs_off_t	new_off;
+
+		for (i = 0; i < key->bmv_entries; i++, p++) {
+			if (p->bmv_oflags & BMV_OF_SHARED)
+				return 1;
+		}
+
+		if (key->bmv_entries == 0)
+			break;
+		p = key + key->bmv_entries;
+		if (p->bmv_oflags & BMV_OF_LAST)
+			return 0;
+
+		new_off = p->bmv_offset + p->bmv_length;
+		key->bmv_length -= new_off - key->bmv_offset;
+		key->bmv_offset = new_off;
+	}
+	if (ret < 0) {
+		if (report_errors)
+			perror(path);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+print_help(
+	const char		*name)
+{
+	printf("Usage: %s [OPTIONS] path\n", name);
+	printf("\n");
+	printf("Print all file paths matching any of the given predicates.\n");
+	printf("\n");
+	printf("-a	Match files with xattrs.\n");
+	printf("-b	Match files with data blocks.\n");
+	printf("-d	Match directories.\n");
+	printf("-q	Ignore errors while walking directory tree.\n");
+	printf("-r	Match regular files.\n");
+	printf("-s	Match files with shared blocks.\n");
+	printf("\n");
+	printf("If no matching options are given, match all files found.\n");
+}
+
+static int
+visit(
+	const char		*path,
+	const struct stat	*sb,
+	int			typeflag,
+	struct FTW		*ftwbuf)
+{
+	int			printme = 1;
+	int			fd = -1;
+	int			retval = FTW_CONTINUE;
+
+	if (want_anyfile)
+		goto out;
+	if (want_regfile && typeflag == FTW_F)
+		goto out;
+	if (want_dir && typeflag == FTW_D)
+		goto out;
+
+	/*
+	 * We can only open directories and files; screen out everything else.
+	 * Note that nftw lies and reports FTW_F for device files, so check the
+	 * statbuf mode too.
+	 */
+	if (typeflag != FTW_F && typeflag != FTW_D) {
+		printme = 0;
+		goto out;
+	}
+
+	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode)) {
+		printme = 0;
+		goto out;
+	}
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		if (report_errors) {
+			perror(path);
+			return FTW_STOP;
+		}
+
+		return FTW_CONTINUE;
+	}
+
+	if (want_datafile && typeflag == FTW_F) {
+		int ret = check_datafile(path, fd);
+		if (ret < 0 && report_errors) {
+			printme = 0;
+			retval = FTW_STOP;
+			goto out_fd;
+		}
+
+		if (ret == 1)
+			goto out_fd;
+	}
+
+	if (want_attrfile) {
+		int ret = check_attrfile(path, fd);
+		if (ret < 0 && report_errors) {
+			printme = 0;
+			retval = FTW_STOP;
+			goto out_fd;
+		}
+
+		if (ret == 1)
+			goto out_fd;
+	}
+
+	if (want_sharedfile) {
+		int ret = check_sharedfile(path, fd);
+		if (ret < 0 && report_errors) {
+			printme = 0;
+			retval = FTW_STOP;
+			goto out_fd;
+		}
+
+		if (ret == 1)
+			goto out_fd;
+	}
+
+	printme = 0;
+out_fd:
+	close(fd);
+out:
+	if (printme)
+		printf("%s\n", path);
+	return retval;
+}
+
+static void
+handle_sigabrt(
+	int		signal,
+	siginfo_t	*info,
+	void		*ucontext)
+{
+	fprintf(stderr, "Signal %u, exiting.\n", signal);
+	exit(2);
+}
+
+int
+main(
+	int			argc,
+	char			*argv[])
+{
+	struct rlimit		rlimit;
+	struct sigaction	abrt = {
+		.sa_sigaction	= handle_sigabrt,
+		.sa_flags	= SA_SIGINFO,
+	};
+	int			c;
+	int			ret;
+
+	while ((c = getopt(argc, argv, "abdqrs")) >= 0) {
+		switch (c) {
+		case 'a':	want_attrfile = 1;   break;
+		case 'b':	want_datafile = 1;   break;
+		case 'd':	want_dir = 1;        break;
+		case 'q':	report_errors = 0;   break;
+		case 'r':	want_regfile = 1;    break;
+		case 's':	want_sharedfile = 1; break;
+		default:
+			print_help(argv[0]);
+			return 1;
+		}
+	}
+
+	ret = getrlimit(RLIMIT_NOFILE, &rlimit);
+	if (ret) {
+		perror("RLIMIT_NOFILE");
+		return 1;
+	}
+
+	if (!want_attrfile && !want_datafile && !want_dir && !want_regfile &&
+	    !want_sharedfile)
+		want_anyfile = 1;
+
+	/*
+	 * nftw is known to abort() if a directory it is walking disappears out
+	 * from under it.  Handle this with grace if the caller wants us to run
+	 * quietly.
+	 */
+	if (!report_errors) {
+		ret = sigaction(SIGABRT, &abrt, NULL);
+		if (ret) {
+			perror("SIGABRT handler");
+			return 1;
+		}
+	}
+
+	for (c = optind; c < argc; c++) {
+		ret = nftw(argv[c], visit, rlimit.rlim_cur - 5,
+				FTW_ACTIONRETVAL | FTW_CHDIR | FTW_MOUNT |
+				FTW_PHYS);
+		if (ret && report_errors) {
+			perror(argv[c]);
+			break;
+		}
+	}
+
+	if (ret)
+		return 1;
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 469+ messages in thread

* [PATCH v24.1 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata
  2022-12-30 22:19   ` [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata Darrick J. Wong
  2023-02-05 13:04     ` Zorro Lang
@ 2023-02-07 17:02     ` Darrick J. Wong
  2023-02-07 18:45       ` Zorro Lang
  1 sibling, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2023-02-07 17:02 UTC (permalink / raw)
  To: zlang; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For each XFS_SCRUB_TYPE_* that looks at AG or filesystem metadata,
create a test that runs that scrubber in the foreground and fsstress in
the background.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v24.1: rebase against newer upstream to get rid of merge conflicts
---
 common/quota      |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/782     |   37 +++++++++++++++++++++++++++++++
 tests/xfs/782.out |    2 ++
 tests/xfs/783     |   37 +++++++++++++++++++++++++++++++
 tests/xfs/783.out |    2 ++
 tests/xfs/784     |   37 +++++++++++++++++++++++++++++++
 tests/xfs/784.out |    2 ++
 tests/xfs/785     |   37 +++++++++++++++++++++++++++++++
 tests/xfs/785.out |    2 ++
 tests/xfs/786     |   38 +++++++++++++++++++++++++++++++
 tests/xfs/786.out |    2 ++
 tests/xfs/787     |   38 +++++++++++++++++++++++++++++++
 tests/xfs/787.out |    2 ++
 tests/xfs/788     |   38 +++++++++++++++++++++++++++++++
 tests/xfs/788.out |    2 ++
 tests/xfs/789     |   39 ++++++++++++++++++++++++++++++++
 tests/xfs/789.out |    2 ++
 tests/xfs/790     |   39 ++++++++++++++++++++++++++++++++
 tests/xfs/790.out |    2 ++
 tests/xfs/791     |   40 +++++++++++++++++++++++++++++++++
 tests/xfs/791.out |    2 ++
 tests/xfs/798     |   44 ++++++++++++++++++++++++++++++++++++
 tests/xfs/798.out |    2 ++
 tests/xfs/800     |   40 +++++++++++++++++++++++++++++++++
 tests/xfs/800.out |    2 ++
 tests/xfs/801     |   47 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/801.out |    2 ++
 tests/xfs/802     |   40 +++++++++++++++++++++++++++++++++
 tests/xfs/802.out |    2 ++
 tests/xfs/803     |   40 +++++++++++++++++++++++++++++++++
 tests/xfs/803.out |    2 ++
 tests/xfs/804     |   40 +++++++++++++++++++++++++++++++++
 tests/xfs/804.out |    2 ++
 tests/xfs/805     |   38 +++++++++++++++++++++++++++++++
 tests/xfs/805.out |    2 ++
 35 files changed, 767 insertions(+)
 create mode 100755 tests/xfs/782
 create mode 100644 tests/xfs/782.out
 create mode 100755 tests/xfs/783
 create mode 100644 tests/xfs/783.out
 create mode 100755 tests/xfs/784
 create mode 100644 tests/xfs/784.out
 create mode 100755 tests/xfs/785
 create mode 100644 tests/xfs/785.out
 create mode 100755 tests/xfs/786
 create mode 100644 tests/xfs/786.out
 create mode 100755 tests/xfs/787
 create mode 100644 tests/xfs/787.out
 create mode 100755 tests/xfs/788
 create mode 100644 tests/xfs/788.out
 create mode 100755 tests/xfs/789
 create mode 100644 tests/xfs/789.out
 create mode 100755 tests/xfs/790
 create mode 100644 tests/xfs/790.out
 create mode 100755 tests/xfs/791
 create mode 100644 tests/xfs/791.out
 create mode 100755 tests/xfs/798
 create mode 100644 tests/xfs/798.out
 create mode 100755 tests/xfs/800
 create mode 100644 tests/xfs/800.out
 create mode 100755 tests/xfs/801
 create mode 100644 tests/xfs/801.out
 create mode 100755 tests/xfs/802
 create mode 100644 tests/xfs/802.out
 create mode 100755 tests/xfs/803
 create mode 100644 tests/xfs/803.out
 create mode 100755 tests/xfs/804
 create mode 100644 tests/xfs/804.out
 create mode 100755 tests/xfs/805
 create mode 100644 tests/xfs/805.out

diff --git a/common/quota b/common/quota
index 24251d092a..96b8d04424 100644
--- a/common/quota
+++ b/common/quota
@@ -53,6 +53,70 @@ _require_xfs_quota()
     [ -n "$XFS_QUOTA_PROG" ] || _notrun "XFS quota user tools not installed"
 }
 
+# Check that a mounted fs has a particular type of quota accounting turned on.
+#
+# The first argument must be the data device of a mounted fs.  It must not be
+# the actual mountpath.
+#
+# The second argument is the quota type ('usrquota', 'grpquota', 'prjquota',
+# 'any', or 'all').
+_xfs_quota_acct_enabled()
+{
+	local dev="$1"
+	local qtype="$2"
+	local f_args=()
+	local any=
+
+	case "$qtype" in
+	"usrquota"|"uquota")	f_args=("-U");;
+	"grpquota"|"gquota")	f_args=("-G");;
+	"prjquota"|"pquota")	f_args=("-P");;
+	"all")			f_args=("-U" "-G" "-P");;
+	"any")			f_args=("-U" "-G" "-P"); any=1;;
+	*)			echo "$qtype: Unknown quota type."; return 1;;
+	esac
+
+	if [ "$any" = "1" ]; then
+		for arg in "$f_args"; do
+			$here/src/feature "$arg" "$dev" && return 0
+		done
+		return 1
+	fi
+
+	$here/src/feature "${f_args[@]}" "$dev"
+}
+
+# Require that a mounted fs has a particular type of quota turned on.  This
+# takes the same arguments as _xfs_quota_acct_enabled.  If the third argument is
+# '-u' (or is empty and dev is $SCRATCH_DEV) the fs will be unmounted on
+# failure.
+_require_xfs_quota_acct_enabled()
+{
+	local dev="$1"
+	local qtype="$2"
+	local umount="$3"
+	local fsname="$dev"
+
+	_xfs_quota_acct_enabled "$dev" "$qtype" "$qmode" && return 0
+
+	if [ -z "$umount" ] && [ "$dev" = "$SCRATCH_DEV" ]; then
+		umount="-u"
+	fi
+	test "$umount" = "-u" && umount "$dev" &>/dev/null
+
+	case "$dev" in
+	"$TEST_DEV")	fsname="test";;
+	"$SCRATCH_DEV")	fsname="scratch";;
+	esac
+
+	case "$qtype" in
+	"any")		qtype="any quotas";;
+	"all")		qtype="all quotas";;
+	esac
+
+	_notrun "$qtype: accounting not enabled on $fsname filesystem."
+}
+
 #
 # checks that xfs_quota can operate on foreign (non-xfs) filesystems
 # Skips check on xfs filesystems, old xfs_quota is fine there.
diff --git a/tests/xfs/782 b/tests/xfs/782
new file mode 100755
index 0000000000..4801eda4bd
--- /dev/null
+++ b/tests/xfs/782
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 782
+#
+# Race fsstress and superblock scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub sb %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/782.out b/tests/xfs/782.out
new file mode 100644
index 0000000000..6e378f0e53
--- /dev/null
+++ b/tests/xfs/782.out
@@ -0,0 +1,2 @@
+QA output created by 782
+Silence is golden
diff --git a/tests/xfs/783 b/tests/xfs/783
new file mode 100755
index 0000000000..379a9369e5
--- /dev/null
+++ b/tests/xfs/783
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 783
+#
+# Race fsstress and AGF scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub agf %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/783.out b/tests/xfs/783.out
new file mode 100644
index 0000000000..2522395956
--- /dev/null
+++ b/tests/xfs/783.out
@@ -0,0 +1,2 @@
+QA output created by 783
+Silence is golden
diff --git a/tests/xfs/784 b/tests/xfs/784
new file mode 100755
index 0000000000..2b89361c36
--- /dev/null
+++ b/tests/xfs/784
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 784
+#
+# Race fsstress and AGFL scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub agfl %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/784.out b/tests/xfs/784.out
new file mode 100644
index 0000000000..48d9b24dd0
--- /dev/null
+++ b/tests/xfs/784.out
@@ -0,0 +1,2 @@
+QA output created by 784
+Silence is golden
diff --git a/tests/xfs/785 b/tests/xfs/785
new file mode 100755
index 0000000000..34a13b058d
--- /dev/null
+++ b/tests/xfs/785
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 785
+#
+# Race fsstress and AGI scrub for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub agi %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/785.out b/tests/xfs/785.out
new file mode 100644
index 0000000000..6ecb0c61b3
--- /dev/null
+++ b/tests/xfs/785.out
@@ -0,0 +1,2 @@
+QA output created by 785
+Silence is golden
diff --git a/tests/xfs/786 b/tests/xfs/786
new file mode 100755
index 0000000000..157200ea8c
--- /dev/null
+++ b/tests/xfs/786
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 786
+#
+# Race fsstress and freespace by block btree scrub for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub bnobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/786.out b/tests/xfs/786.out
new file mode 100644
index 0000000000..ccb9167df9
--- /dev/null
+++ b/tests/xfs/786.out
@@ -0,0 +1,2 @@
+QA output created by 786
+Silence is golden
diff --git a/tests/xfs/787 b/tests/xfs/787
new file mode 100755
index 0000000000..91eaf5a7af
--- /dev/null
+++ b/tests/xfs/787
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 787
+#
+# Race fsstress and free space by length btree scrub for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub cntbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/787.out b/tests/xfs/787.out
new file mode 100644
index 0000000000..fa7f038120
--- /dev/null
+++ b/tests/xfs/787.out
@@ -0,0 +1,2 @@
+QA output created by 787
+Silence is golden
diff --git a/tests/xfs/788 b/tests/xfs/788
new file mode 100755
index 0000000000..f1369e5309
--- /dev/null
+++ b/tests/xfs/788
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 788
+#
+# Race fsstress and inode btree scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -x 'dir' -s "scrub inobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/788.out b/tests/xfs/788.out
new file mode 100644
index 0000000000..5ddd661113
--- /dev/null
+++ b/tests/xfs/788.out
@@ -0,0 +1,2 @@
+QA output created by 788
+Silence is golden
diff --git a/tests/xfs/789 b/tests/xfs/789
new file mode 100755
index 0000000000..550ff2c690
--- /dev/null
+++ b/tests/xfs/789
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 789
+#
+# Race fsstress and free inode btree scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" finobt
+_scratch_xfs_stress_scrub -x 'dir' -s "scrub finobt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/789.out b/tests/xfs/789.out
new file mode 100644
index 0000000000..da88fc99cb
--- /dev/null
+++ b/tests/xfs/789.out
@@ -0,0 +1,2 @@
+QA output created by 789
+Silence is golden
diff --git a/tests/xfs/790 b/tests/xfs/790
new file mode 100755
index 0000000000..c4e5779ef7
--- /dev/null
+++ b/tests/xfs/790
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 790
+#
+# Race fsstress and reverse mapping btree scrub for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" rmapbt
+_scratch_xfs_stress_scrub -s "scrub rmapbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/790.out b/tests/xfs/790.out
new file mode 100644
index 0000000000..7102c590f0
--- /dev/null
+++ b/tests/xfs/790.out
@@ -0,0 +1,2 @@
+QA output created by 790
+Silence is golden
diff --git a/tests/xfs/791 b/tests/xfs/791
new file mode 100755
index 0000000000..6939d910c9
--- /dev/null
+++ b/tests/xfs/791
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 791
+#
+# Race fsstress and reference count btree scrub for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" reflink
+_scratch_xfs_stress_scrub -s "scrub refcountbt %agno%"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/791.out b/tests/xfs/791.out
new file mode 100644
index 0000000000..758905371d
--- /dev/null
+++ b/tests/xfs/791.out
@@ -0,0 +1,2 @@
+QA output created by 791
+Silence is golden
diff --git a/tests/xfs/798 b/tests/xfs/798
new file mode 100755
index 0000000000..c5bdfad50a
--- /dev/null
+++ b/tests/xfs/798
@@ -0,0 +1,44 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 798
+#
+# Race fsstress and fscounter scrub on the realtime device for a while to see
+# if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+
+# Force all files to be allocated on the realtime device
+_xfs_force_bdev realtime $SCRATCH_MNT
+
+_scratch_xfs_stress_scrub -s 'scrub fscounters'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/798.out b/tests/xfs/798.out
new file mode 100644
index 0000000000..216d6e93f4
--- /dev/null
+++ b/tests/xfs/798.out
@@ -0,0 +1,2 @@
+QA output created by 798
+Silence is golden
diff --git a/tests/xfs/800 b/tests/xfs/800
new file mode 100755
index 0000000000..cbcfb5f5a6
--- /dev/null
+++ b/tests/xfs/800
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 800
+#
+# Race fsstress and realtime bitmap scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+_scratch_xfs_stress_scrub -s "scrub rtbitmap"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/800.out b/tests/xfs/800.out
new file mode 100644
index 0000000000..bdfaa2cecd
--- /dev/null
+++ b/tests/xfs/800.out
@@ -0,0 +1,2 @@
+QA output created by 800
+Silence is golden
diff --git a/tests/xfs/801 b/tests/xfs/801
new file mode 100755
index 0000000000..a51fab523b
--- /dev/null
+++ b/tests/xfs/801
@@ -0,0 +1,47 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 801
+#
+# Race fsstress and realtime summary scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_realtime
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" realtime
+
+# XXX the realtime summary scrubber isn't currently implemented upstream.
+# Don't bother trying to test it on those kernels
+$XFS_IO_PROG -c 'scrub rtsummary' -c 'scrub rtsummary' "$SCRATCH_MNT" 2>&1 | \
+	grep -q 'Scan was not complete' && \
+	_notrun "rtsummary scrub is incomplete"
+
+_scratch_xfs_stress_scrub -s "scrub rtsummary"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/801.out b/tests/xfs/801.out
new file mode 100644
index 0000000000..39481b38e2
--- /dev/null
+++ b/tests/xfs/801.out
@@ -0,0 +1,2 @@
+QA output created by 801
+Silence is golden
diff --git a/tests/xfs/802 b/tests/xfs/802
new file mode 100755
index 0000000000..1f3b83882e
--- /dev/null
+++ b/tests/xfs/802
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 802
+#
+# Race fsstress and user quota scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
+_scratch_xfs_stress_scrub -s "scrub usrquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/802.out b/tests/xfs/802.out
new file mode 100644
index 0000000000..a69c05391f
--- /dev/null
+++ b/tests/xfs/802.out
@@ -0,0 +1,2 @@
+QA output created by 802
+Silence is golden
diff --git a/tests/xfs/803 b/tests/xfs/803
new file mode 100755
index 0000000000..b2bb85672d
--- /dev/null
+++ b/tests/xfs/803
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 803
+#
+# Race fsstress and group quota scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
+_scratch_xfs_stress_scrub -s "scrub grpquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/803.out b/tests/xfs/803.out
new file mode 100644
index 0000000000..38ba741d0f
--- /dev/null
+++ b/tests/xfs/803.out
@@ -0,0 +1,2 @@
+QA output created by 803
+Silence is golden
diff --git a/tests/xfs/804 b/tests/xfs/804
new file mode 100755
index 0000000000..129724eb11
--- /dev/null
+++ b/tests/xfs/804
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 804
+#
+# Race fsstress and project quota scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+. ./common/quota
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
+_scratch_xfs_stress_scrub -s "scrub prjquota"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/804.out b/tests/xfs/804.out
new file mode 100644
index 0000000000..5e0cb437e7
--- /dev/null
+++ b/tests/xfs/804.out
@@ -0,0 +1,2 @@
+QA output created by 804
+Silence is golden
diff --git a/tests/xfs/805 b/tests/xfs/805
new file mode 100755
index 0000000000..aca9b9cdf4
--- /dev/null
+++ b/tests/xfs/805
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
+#
+# FS QA Test No. 805
+#
+# Race fsstress and summary counters scrub for a while to see if we crash or
+# livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -s "scrub fscounters"
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/805.out b/tests/xfs/805.out
new file mode 100644
index 0000000000..ac324c5874
--- /dev/null
+++ b/tests/xfs/805.out
@@ -0,0 +1,2 @@
+QA output created by 805
+Silence is golden

^ permalink raw reply related	[flat|nested] 469+ messages in thread

* Re: [PATCH v24.1 3/5] fuzzy: add a custom xfs find utility for scrub stress tests
  2023-02-07 17:01     ` [PATCH v24.1 " Darrick J. Wong
@ 2023-02-07 18:42       ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-07 18:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Tue, Feb 07, 2023 at 09:01:41AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a new find(1) like utility that doesn't crash on directory tree
> changes (like find does due to bugs in its loop detector) and actually
> implements the custom xfs attribute predicates that we need for scrub
> stress tests.  This program will be needed for a future patch where we
> add stress tests for scrub and repair of file metadata.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> v24.1: apply some autoupdate love to the m4 macros
> ---

This version looks better to me, I'll merge it with a little change on
.gitignore file to filter src/xfsfind.

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  configure.ac          |    5 +
>  include/builddefs.in  |    4 +
>  m4/package_libcdev.m4 |   44 +++++++
>  m4/package_xfslibs.m4 |   15 +++
>  src/Makefile          |   10 ++
>  src/xfsfind.c         |  290 +++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 368 insertions(+)
>  create mode 100644 src/xfsfind.c
> 
> diff --git a/configure.ac b/configure.ac
> index cbf8377988..e92bd6b26d 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -66,6 +66,11 @@ AC_PACKAGE_WANT_LINUX_FS_H
>  AC_PACKAGE_WANT_LIBBTRFSUTIL
>  
>  AC_HAVE_COPY_FILE_RANGE
> +AC_HAVE_SEEK_DATA
> +AC_HAVE_BMV_OF_SHARED
> +AC_HAVE_NFTW
> +AC_HAVE_RLIMIT_NOFILE
> +
>  AC_CHECK_FUNCS([renameat2])
>  AC_CHECK_FUNCS([reallocarray])
>  AC_CHECK_TYPES([struct mount_attr], [], [], [[#include <linux/mount.h>]])
> diff --git a/include/builddefs.in b/include/builddefs.in
> index 6641209f81..dab10c968f 100644
> --- a/include/builddefs.in
> +++ b/include/builddefs.in
> @@ -68,6 +68,10 @@ HAVE_FIEMAP = @have_fiemap@
>  HAVE_FALLOCATE = @have_fallocate@
>  HAVE_COPY_FILE_RANGE = @have_copy_file_range@
>  HAVE_LIBBTRFSUTIL = @have_libbtrfsutil@
> +HAVE_SEEK_DATA = @have_seek_data@
> +HAVE_NFTW = @have_nftw@
> +HAVE_BMV_OF_SHARED = @have_bmv_of_shared@
> +HAVE_RLIMIT_NOFILE = @have_rlimit_nofile@
>  
>  GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
>  
> diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
> index 5c76c0f73e..98572aecd9 100644
> --- a/m4/package_libcdev.m4
> +++ b/m4/package_libcdev.m4
> @@ -110,3 +110,47 @@ AC_DEFUN([AC_HAVE_COPY_FILE_RANGE],
>      AC_SUBST(have_copy_file_range)
>    ])
>  
> +# Check if we have SEEK_DATA
> +AC_DEFUN([AC_HAVE_SEEK_DATA],
> +  [ AC_MSG_CHECKING([for SEEK_DATA])
> +    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
> +#define _GNU_SOURCE
> +#include <sys/types.h>
> +#include <unistd.h>
> +    ]], [[
> +         lseek(-1, 0, SEEK_DATA);
> +    ]])],[have_seek_data=yes
> +       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
> +    AC_SUBST(have_seek_data)
> +  ])
> +
> +# Check if we have nftw
> +AC_DEFUN([AC_HAVE_NFTW],
> +  [ AC_MSG_CHECKING([for nftw])
> +    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
> +#define _GNU_SOURCE
> +#include <stddef.h>
> +#include <ftw.h>
> +    ]], [[
> +         nftw("/", (int (*)(const char *, const struct stat *, int, struct FTW *))1, 0, 0);
> +    ]])],[have_nftw=yes
> +       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
> +    AC_SUBST(have_nftw)
> +  ])
> +
> +# Check if we have RLIMIT_NOFILE
> +AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
> +  [ AC_MSG_CHECKING([for RLIMIT_NOFILE])
> +    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
> +#define _GNU_SOURCE
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +    ]], [[
> +         struct rlimit rlimit;
> +
> +         rlimit.rlim_cur = 0;
> +         getrlimit(RLIMIT_NOFILE, &rlimit);
> +    ]])],[have_rlimit_nofile=yes
> +       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
> +    AC_SUBST(have_rlimit_nofile)
> +  ])
> diff --git a/m4/package_xfslibs.m4 b/m4/package_xfslibs.m4
> index 0746cd1dc5..8ef58cc064 100644
> --- a/m4/package_xfslibs.m4
> +++ b/m4/package_xfslibs.m4
> @@ -104,3 +104,18 @@ AC_DEFUN([AC_PACKAGE_NEED_XFSCTL_MACRO],
>          exit 1
>        ])
>    ])
> +
> +# Check if we have BMV_OF_SHARED from the GETBMAPX ioctl
> +AC_DEFUN([AC_HAVE_BMV_OF_SHARED],
> +  [ AC_MSG_CHECKING([for BMV_OF_SHARED])
> +    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
> +#define _GNU_SOURCE
> +#include <xfs/xfs.h>
> +    ]], [[
> +         struct getbmapx obj;
> +         ioctl(-1, XFS_IOC_GETBMAPX, &obj);
> +         obj.bmv_oflags |= BMV_OF_SHARED;
> +    ]])],[have_bmv_of_shared=yes
> +       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
> +    AC_SUBST(have_bmv_of_shared)
> +  ])
> diff --git a/src/Makefile b/src/Makefile
> index f270015ce8..a574f7bd03 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -83,6 +83,16 @@ ifeq ($(HAVE_LIBCAP), true)
>  LLDLIBS += -lcap
>  endif
>  
> +ifeq ($(HAVE_SEEK_DATA), yes)
> + ifeq ($(HAVE_NFTW), yes)
> +  ifeq ($(HAVE_BMV_OF_SHARED), yes)
> +   ifeq ($(HAVE_RLIMIT_NOFILE), yes)
> +     TARGETS += xfsfind
> +   endif
> +  endif
> + endif
> +endif
> +
>  CFILES = $(TARGETS:=.c)
>  LDIRT = $(TARGETS) fssum
>  
> diff --git a/src/xfsfind.c b/src/xfsfind.c
> new file mode 100644
> index 0000000000..6b0a93e793
> --- /dev/null
> +++ b/src/xfsfind.c
> @@ -0,0 +1,290 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * find(1) but with special predicates for finding XFS attributes.
> + * Copyright (C) 2022 Oracle.
> + */
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +#include <sys/types.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <ftw.h>
> +#include <linux/fs.h>
> +#include <xfs/xfs.h>
> +
> +#include "global.h"
> +
> +static int want_anyfile;
> +static int want_datafile;
> +static int want_attrfile;
> +static int want_dir;
> +static int want_regfile;
> +static int want_sharedfile;
> +static int report_errors = 1;
> +
> +static int
> +check_datafile(
> +	const char		*path,
> +	int			fd)
> +{
> +	off_t			off;
> +
> +	off = lseek(fd, 0, SEEK_DATA);
> +	if (off >= 0)
> +		return 1;
> +
> +	if (errno == ENXIO)
> +		return 0;
> +
> +	if (report_errors)
> +		perror(path);
> +
> +	return -1;
> +}
> +
> +static int
> +check_attrfile(
> +	const char		*path,
> +	int			fd)
> +{
> +	struct fsxattr		fsx;
> +	int			ret;
> +
> +	ret = ioctl(fd, XFS_IOC_FSGETXATTR, &fsx);
> +	if (ret) {
> +		if (report_errors)
> +			perror(path);
> +		return -1;
> +	}
> +
> +	if (want_attrfile && (fsx.fsx_xflags & XFS_XFLAG_HASATTR))
> +		return 1;
> +	return 0;
> +}
> +
> +#define BMAP_NR			33
> +static struct getbmapx		bmaps[BMAP_NR];
> +
> +static int
> +check_sharedfile(
> +	const char		*path,
> +	int			fd)
> +{
> +	struct getbmapx		*key = &bmaps[0];
> +	unsigned int		i;
> +	int			ret;
> +
> +	memset(key, 0, sizeof(struct getbmapx));
> +	key->bmv_length = ULLONG_MAX;
> +	/* no holes and don't flush dirty pages */
> +	key->bmv_iflags = BMV_IF_DELALLOC | BMV_IF_NO_HOLES;
> +	key->bmv_count = BMAP_NR;
> +
> +	while ((ret = ioctl(fd, XFS_IOC_GETBMAPX, bmaps)) == 0) {
> +		struct getbmapx	*p = &bmaps[1];
> +		xfs_off_t	new_off;
> +
> +		for (i = 0; i < key->bmv_entries; i++, p++) {
> +			if (p->bmv_oflags & BMV_OF_SHARED)
> +				return 1;
> +		}
> +
> +		if (key->bmv_entries == 0)
> +			break;
> +		p = key + key->bmv_entries;
> +		if (p->bmv_oflags & BMV_OF_LAST)
> +			return 0;
> +
> +		new_off = p->bmv_offset + p->bmv_length;
> +		key->bmv_length -= new_off - key->bmv_offset;
> +		key->bmv_offset = new_off;
> +	}
> +	if (ret < 0) {
> +		if (report_errors)
> +			perror(path);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +print_help(
> +	const char		*name)
> +{
> +	printf("Usage: %s [OPTIONS] path\n", name);
> +	printf("\n");
> +	printf("Print all file paths matching any of the given predicates.\n");
> +	printf("\n");
> +	printf("-a	Match files with xattrs.\n");
> +	printf("-b	Match files with data blocks.\n");
> +	printf("-d	Match directories.\n");
> +	printf("-q	Ignore errors while walking directory tree.\n");
> +	printf("-r	Match regular files.\n");
> +	printf("-s	Match files with shared blocks.\n");
> +	printf("\n");
> +	printf("If no matching options are given, match all files found.\n");
> +}
> +
> +static int
> +visit(
> +	const char		*path,
> +	const struct stat	*sb,
> +	int			typeflag,
> +	struct FTW		*ftwbuf)
> +{
> +	int			printme = 1;
> +	int			fd = -1;
> +	int			retval = FTW_CONTINUE;
> +
> +	if (want_anyfile)
> +		goto out;
> +	if (want_regfile && typeflag == FTW_F)
> +		goto out;
> +	if (want_dir && typeflag == FTW_D)
> +		goto out;
> +
> +	/*
> +	 * We can only open directories and files; screen out everything else.
> +	 * Note that nftw lies and reports FTW_F for device files, so check the
> +	 * statbuf mode too.
> +	 */
> +	if (typeflag != FTW_F && typeflag != FTW_D) {
> +		printme = 0;
> +		goto out;
> +	}
> +
> +	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode)) {
> +		printme = 0;
> +		goto out;
> +	}
> +
> +	fd = open(path, O_RDONLY);
> +	if (fd < 0) {
> +		if (report_errors) {
> +			perror(path);
> +			return FTW_STOP;
> +		}
> +
> +		return FTW_CONTINUE;
> +	}
> +
> +	if (want_datafile && typeflag == FTW_F) {
> +		int ret = check_datafile(path, fd);
> +		if (ret < 0 && report_errors) {
> +			printme = 0;
> +			retval = FTW_STOP;
> +			goto out_fd;
> +		}
> +
> +		if (ret == 1)
> +			goto out_fd;
> +	}
> +
> +	if (want_attrfile) {
> +		int ret = check_attrfile(path, fd);
> +		if (ret < 0 && report_errors) {
> +			printme = 0;
> +			retval = FTW_STOP;
> +			goto out_fd;
> +		}
> +
> +		if (ret == 1)
> +			goto out_fd;
> +	}
> +
> +	if (want_sharedfile) {
> +		int ret = check_sharedfile(path, fd);
> +		if (ret < 0 && report_errors) {
> +			printme = 0;
> +			retval = FTW_STOP;
> +			goto out_fd;
> +		}
> +
> +		if (ret == 1)
> +			goto out_fd;
> +	}
> +
> +	printme = 0;
> +out_fd:
> +	close(fd);
> +out:
> +	if (printme)
> +		printf("%s\n", path);
> +	return retval;
> +}
> +
> +static void
> +handle_sigabrt(
> +	int		signal,
> +	siginfo_t	*info,
> +	void		*ucontext)
> +{
> +	fprintf(stderr, "Signal %u, exiting.\n", signal);
> +	exit(2);
> +}
> +
> +int
> +main(
> +	int			argc,
> +	char			*argv[])
> +{
> +	struct rlimit		rlimit;
> +	struct sigaction	abrt = {
> +		.sa_sigaction	= handle_sigabrt,
> +		.sa_flags	= SA_SIGINFO,
> +	};
> +	int			c;
> +	int			ret;
> +
> +	while ((c = getopt(argc, argv, "abdqrs")) >= 0) {
> +		switch (c) {
> +		case 'a':	want_attrfile = 1;   break;
> +		case 'b':	want_datafile = 1;   break;
> +		case 'd':	want_dir = 1;        break;
> +		case 'q':	report_errors = 0;   break;
> +		case 'r':	want_regfile = 1;    break;
> +		case 's':	want_sharedfile = 1; break;
> +		default:
> +			print_help(argv[0]);
> +			return 1;
> +		}
> +	}
> +
> +	ret = getrlimit(RLIMIT_NOFILE, &rlimit);
> +	if (ret) {
> +		perror("RLIMIT_NOFILE");
> +		return 1;
> +	}
> +
> +	if (!want_attrfile && !want_datafile && !want_dir && !want_regfile &&
> +	    !want_sharedfile)
> +		want_anyfile = 1;
> +
> +	/*
> +	 * nftw is known to abort() if a directory it is walking disappears out
> +	 * from under it.  Handle this with grace if the caller wants us to run
> +	 * quietly.
> +	 */
> +	if (!report_errors) {
> +		ret = sigaction(SIGABRT, &abrt, NULL);
> +		if (ret) {
> +			perror("SIGABRT handler");
> +			return 1;
> +		}
> +	}
> +
> +	for (c = optind; c < argc; c++) {
> +		ret = nftw(argv[c], visit, rlimit.rlim_cur - 5,
> +				FTW_ACTIONRETVAL | FTW_CHDIR | FTW_MOUNT |
> +				FTW_PHYS);
> +		if (ret && report_errors) {
> +			perror(argv[c]);
> +			break;
> +		}
> +	}
> +
> +	if (ret)
> +		return 1;
> +	return 0;
> +}
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH v24.1 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata
  2023-02-07 17:02     ` [PATCH v24.1 " Darrick J. Wong
@ 2023-02-07 18:45       ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-07 18:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Tue, Feb 07, 2023 at 09:02:12AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> For each XFS_SCRUB_TYPE_* that looks at AG or filesystem metadata,
> create a test that runs that scrubber in the foreground and fsstress in
> the background.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> v24.1: rebase against newer upstream to get rid of merge conflicts
> ---

This version is good to me.

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  common/quota      |   64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/782     |   37 +++++++++++++++++++++++++++++++
>  tests/xfs/782.out |    2 ++
>  tests/xfs/783     |   37 +++++++++++++++++++++++++++++++
>  tests/xfs/783.out |    2 ++
>  tests/xfs/784     |   37 +++++++++++++++++++++++++++++++
>  tests/xfs/784.out |    2 ++
>  tests/xfs/785     |   37 +++++++++++++++++++++++++++++++
>  tests/xfs/785.out |    2 ++
>  tests/xfs/786     |   38 +++++++++++++++++++++++++++++++
>  tests/xfs/786.out |    2 ++
>  tests/xfs/787     |   38 +++++++++++++++++++++++++++++++
>  tests/xfs/787.out |    2 ++
>  tests/xfs/788     |   38 +++++++++++++++++++++++++++++++
>  tests/xfs/788.out |    2 ++
>  tests/xfs/789     |   39 ++++++++++++++++++++++++++++++++
>  tests/xfs/789.out |    2 ++
>  tests/xfs/790     |   39 ++++++++++++++++++++++++++++++++
>  tests/xfs/790.out |    2 ++
>  tests/xfs/791     |   40 +++++++++++++++++++++++++++++++++
>  tests/xfs/791.out |    2 ++
>  tests/xfs/798     |   44 ++++++++++++++++++++++++++++++++++++
>  tests/xfs/798.out |    2 ++
>  tests/xfs/800     |   40 +++++++++++++++++++++++++++++++++
>  tests/xfs/800.out |    2 ++
>  tests/xfs/801     |   47 +++++++++++++++++++++++++++++++++++++++
>  tests/xfs/801.out |    2 ++
>  tests/xfs/802     |   40 +++++++++++++++++++++++++++++++++
>  tests/xfs/802.out |    2 ++
>  tests/xfs/803     |   40 +++++++++++++++++++++++++++++++++
>  tests/xfs/803.out |    2 ++
>  tests/xfs/804     |   40 +++++++++++++++++++++++++++++++++
>  tests/xfs/804.out |    2 ++
>  tests/xfs/805     |   38 +++++++++++++++++++++++++++++++
>  tests/xfs/805.out |    2 ++
>  35 files changed, 767 insertions(+)
>  create mode 100755 tests/xfs/782
>  create mode 100644 tests/xfs/782.out
>  create mode 100755 tests/xfs/783
>  create mode 100644 tests/xfs/783.out
>  create mode 100755 tests/xfs/784
>  create mode 100644 tests/xfs/784.out
>  create mode 100755 tests/xfs/785
>  create mode 100644 tests/xfs/785.out
>  create mode 100755 tests/xfs/786
>  create mode 100644 tests/xfs/786.out
>  create mode 100755 tests/xfs/787
>  create mode 100644 tests/xfs/787.out
>  create mode 100755 tests/xfs/788
>  create mode 100644 tests/xfs/788.out
>  create mode 100755 tests/xfs/789
>  create mode 100644 tests/xfs/789.out
>  create mode 100755 tests/xfs/790
>  create mode 100644 tests/xfs/790.out
>  create mode 100755 tests/xfs/791
>  create mode 100644 tests/xfs/791.out
>  create mode 100755 tests/xfs/798
>  create mode 100644 tests/xfs/798.out
>  create mode 100755 tests/xfs/800
>  create mode 100644 tests/xfs/800.out
>  create mode 100755 tests/xfs/801
>  create mode 100644 tests/xfs/801.out
>  create mode 100755 tests/xfs/802
>  create mode 100644 tests/xfs/802.out
>  create mode 100755 tests/xfs/803
>  create mode 100644 tests/xfs/803.out
>  create mode 100755 tests/xfs/804
>  create mode 100644 tests/xfs/804.out
>  create mode 100755 tests/xfs/805
>  create mode 100644 tests/xfs/805.out
> 
> diff --git a/common/quota b/common/quota
> index 24251d092a..96b8d04424 100644
> --- a/common/quota
> +++ b/common/quota
> @@ -53,6 +53,70 @@ _require_xfs_quota()
>      [ -n "$XFS_QUOTA_PROG" ] || _notrun "XFS quota user tools not installed"
>  }
>  
> +# Check that a mounted fs has a particular type of quota accounting turned on.
> +#
> +# The first argument must be the data device of a mounted fs.  It must not be
> +# the actual mountpath.
> +#
> +# The second argument is the quota type ('usrquota', 'grpquota', 'prjquota',
> +# 'any', or 'all').
> +_xfs_quota_acct_enabled()
> +{
> +	local dev="$1"
> +	local qtype="$2"
> +	local f_args=()
> +	local any=
> +
> +	case "$qtype" in
> +	"usrquota"|"uquota")	f_args=("-U");;
> +	"grpquota"|"gquota")	f_args=("-G");;
> +	"prjquota"|"pquota")	f_args=("-P");;
> +	"all")			f_args=("-U" "-G" "-P");;
> +	"any")			f_args=("-U" "-G" "-P"); any=1;;
> +	*)			echo "$qtype: Unknown quota type."; return 1;;
> +	esac
> +
> +	if [ "$any" = "1" ]; then
> +		for arg in "$f_args"; do
> +			$here/src/feature "$arg" "$dev" && return 0
> +		done
> +		return 1
> +	fi
> +
> +	$here/src/feature "${f_args[@]}" "$dev"
> +}
> +
> +# Require that a mounted fs has a particular type of quota turned on.  This
> +# takes the same arguments as _xfs_quota_acct_enabled.  If the third argument is
> +# '-u' (or is empty and dev is $SCRATCH_DEV) the fs will be unmounted on
> +# failure.
> +_require_xfs_quota_acct_enabled()
> +{
> +	local dev="$1"
> +	local qtype="$2"
> +	local umount="$3"
> +	local fsname="$dev"
> +
> +	_xfs_quota_acct_enabled "$dev" "$qtype" "$qmode" && return 0
> +
> +	if [ -z "$umount" ] && [ "$dev" = "$SCRATCH_DEV" ]; then
> +		umount="-u"
> +	fi
> +	test "$umount" = "-u" && umount "$dev" &>/dev/null
> +
> +	case "$dev" in
> +	"$TEST_DEV")	fsname="test";;
> +	"$SCRATCH_DEV")	fsname="scratch";;
> +	esac
> +
> +	case "$qtype" in
> +	"any")		qtype="any quotas";;
> +	"all")		qtype="all quotas";;
> +	esac
> +
> +	_notrun "$qtype: accounting not enabled on $fsname filesystem."
> +}
> +
>  #
>  # checks that xfs_quota can operate on foreign (non-xfs) filesystems
>  # Skips check on xfs filesystems, old xfs_quota is fine there.
> diff --git a/tests/xfs/782 b/tests/xfs/782
> new file mode 100755
> index 0000000000..4801eda4bd
> --- /dev/null
> +++ b/tests/xfs/782
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 782
> +#
> +# Race fsstress and superblock scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub sb %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/782.out b/tests/xfs/782.out
> new file mode 100644
> index 0000000000..6e378f0e53
> --- /dev/null
> +++ b/tests/xfs/782.out
> @@ -0,0 +1,2 @@
> +QA output created by 782
> +Silence is golden
> diff --git a/tests/xfs/783 b/tests/xfs/783
> new file mode 100755
> index 0000000000..379a9369e5
> --- /dev/null
> +++ b/tests/xfs/783
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 783
> +#
> +# Race fsstress and AGF scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub agf %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/783.out b/tests/xfs/783.out
> new file mode 100644
> index 0000000000..2522395956
> --- /dev/null
> +++ b/tests/xfs/783.out
> @@ -0,0 +1,2 @@
> +QA output created by 783
> +Silence is golden
> diff --git a/tests/xfs/784 b/tests/xfs/784
> new file mode 100755
> index 0000000000..2b89361c36
> --- /dev/null
> +++ b/tests/xfs/784
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 784
> +#
> +# Race fsstress and AGFL scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub agfl %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/784.out b/tests/xfs/784.out
> new file mode 100644
> index 0000000000..48d9b24dd0
> --- /dev/null
> +++ b/tests/xfs/784.out
> @@ -0,0 +1,2 @@
> +QA output created by 784
> +Silence is golden
> diff --git a/tests/xfs/785 b/tests/xfs/785
> new file mode 100755
> index 0000000000..34a13b058d
> --- /dev/null
> +++ b/tests/xfs/785
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 785
> +#
> +# Race fsstress and AGI scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub agi %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/785.out b/tests/xfs/785.out
> new file mode 100644
> index 0000000000..6ecb0c61b3
> --- /dev/null
> +++ b/tests/xfs/785.out
> @@ -0,0 +1,2 @@
> +QA output created by 785
> +Silence is golden
> diff --git a/tests/xfs/786 b/tests/xfs/786
> new file mode 100755
> index 0000000000..157200ea8c
> --- /dev/null
> +++ b/tests/xfs/786
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 786
> +#
> +# Race fsstress and freespace by block btree scrub for a while to see if we
> +# crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub bnobt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/786.out b/tests/xfs/786.out
> new file mode 100644
> index 0000000000..ccb9167df9
> --- /dev/null
> +++ b/tests/xfs/786.out
> @@ -0,0 +1,2 @@
> +QA output created by 786
> +Silence is golden
> diff --git a/tests/xfs/787 b/tests/xfs/787
> new file mode 100755
> index 0000000000..91eaf5a7af
> --- /dev/null
> +++ b/tests/xfs/787
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 787
> +#
> +# Race fsstress and free space by length btree scrub for a while to see if we
> +# crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub cntbt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/787.out b/tests/xfs/787.out
> new file mode 100644
> index 0000000000..fa7f038120
> --- /dev/null
> +++ b/tests/xfs/787.out
> @@ -0,0 +1,2 @@
> +QA output created by 787
> +Silence is golden
> diff --git a/tests/xfs/788 b/tests/xfs/788
> new file mode 100755
> index 0000000000..f1369e5309
> --- /dev/null
> +++ b/tests/xfs/788
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 788
> +#
> +# Race fsstress and inode btree scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -x 'dir' -s "scrub inobt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/788.out b/tests/xfs/788.out
> new file mode 100644
> index 0000000000..5ddd661113
> --- /dev/null
> +++ b/tests/xfs/788.out
> @@ -0,0 +1,2 @@
> +QA output created by 788
> +Silence is golden
> diff --git a/tests/xfs/789 b/tests/xfs/789
> new file mode 100755
> index 0000000000..550ff2c690
> --- /dev/null
> +++ b/tests/xfs/789
> @@ -0,0 +1,39 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 789
> +#
> +# Race fsstress and free inode btree scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" finobt
> +_scratch_xfs_stress_scrub -x 'dir' -s "scrub finobt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/789.out b/tests/xfs/789.out
> new file mode 100644
> index 0000000000..da88fc99cb
> --- /dev/null
> +++ b/tests/xfs/789.out
> @@ -0,0 +1,2 @@
> +QA output created by 789
> +Silence is golden
> diff --git a/tests/xfs/790 b/tests/xfs/790
> new file mode 100755
> index 0000000000..c4e5779ef7
> --- /dev/null
> +++ b/tests/xfs/790
> @@ -0,0 +1,39 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 790
> +#
> +# Race fsstress and reverse mapping btree scrub for a while to see if we crash
> +# or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" rmapbt
> +_scratch_xfs_stress_scrub -s "scrub rmapbt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/790.out b/tests/xfs/790.out
> new file mode 100644
> index 0000000000..7102c590f0
> --- /dev/null
> +++ b/tests/xfs/790.out
> @@ -0,0 +1,2 @@
> +QA output created by 790
> +Silence is golden
> diff --git a/tests/xfs/791 b/tests/xfs/791
> new file mode 100755
> index 0000000000..6939d910c9
> --- /dev/null
> +++ b/tests/xfs/791
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 791
> +#
> +# Race fsstress and reference count btree scrub for a while to see if we crash
> +# or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/reflink
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" reflink
> +_scratch_xfs_stress_scrub -s "scrub refcountbt %agno%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/791.out b/tests/xfs/791.out
> new file mode 100644
> index 0000000000..758905371d
> --- /dev/null
> +++ b/tests/xfs/791.out
> @@ -0,0 +1,2 @@
> +QA output created by 791
> +Silence is golden
> diff --git a/tests/xfs/798 b/tests/xfs/798
> new file mode 100755
> index 0000000000..c5bdfad50a
> --- /dev/null
> +++ b/tests/xfs/798
> @@ -0,0 +1,44 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2022 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 798
> +#
> +# Race fsstress and fscounter scrub on the realtime device for a while to see
> +# if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_realtime
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> +
> +# Force all files to be allocated on the realtime device
> +_xfs_force_bdev realtime $SCRATCH_MNT
> +
> +_scratch_xfs_stress_scrub -s 'scrub fscounters'
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/798.out b/tests/xfs/798.out
> new file mode 100644
> index 0000000000..216d6e93f4
> --- /dev/null
> +++ b/tests/xfs/798.out
> @@ -0,0 +1,2 @@
> +QA output created by 798
> +Silence is golden
> diff --git a/tests/xfs/800 b/tests/xfs/800
> new file mode 100755
> index 0000000000..cbcfb5f5a6
> --- /dev/null
> +++ b/tests/xfs/800
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 800
> +#
> +# Race fsstress and realtime bitmap scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_realtime
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> +_scratch_xfs_stress_scrub -s "scrub rtbitmap"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/800.out b/tests/xfs/800.out
> new file mode 100644
> index 0000000000..bdfaa2cecd
> --- /dev/null
> +++ b/tests/xfs/800.out
> @@ -0,0 +1,2 @@
> +QA output created by 800
> +Silence is golden
> diff --git a/tests/xfs/801 b/tests/xfs/801
> new file mode 100755
> index 0000000000..a51fab523b
> --- /dev/null
> +++ b/tests/xfs/801
> @@ -0,0 +1,47 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 801
> +#
> +# Race fsstress and realtime summary scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_realtime
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_has_feature "$SCRATCH_MNT" realtime
> +
> +# XXX the realtime summary scrubber isn't currently implemented upstream.
> +# Don't bother trying to test it on those kernels
> +$XFS_IO_PROG -c 'scrub rtsummary' -c 'scrub rtsummary' "$SCRATCH_MNT" 2>&1 | \
> +	grep -q 'Scan was not complete' && \
> +	_notrun "rtsummary scrub is incomplete"
> +
> +_scratch_xfs_stress_scrub -s "scrub rtsummary"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/801.out b/tests/xfs/801.out
> new file mode 100644
> index 0000000000..39481b38e2
> --- /dev/null
> +++ b/tests/xfs/801.out
> @@ -0,0 +1,2 @@
> +QA output created by 801
> +Silence is golden
> diff --git a/tests/xfs/802 b/tests/xfs/802
> new file mode 100755
> index 0000000000..1f3b83882e
> --- /dev/null
> +++ b/tests/xfs/802
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 802
> +#
> +# Race fsstress and user quota scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
> +_scratch_xfs_stress_scrub -s "scrub usrquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/802.out b/tests/xfs/802.out
> new file mode 100644
> index 0000000000..a69c05391f
> --- /dev/null
> +++ b/tests/xfs/802.out
> @@ -0,0 +1,2 @@
> +QA output created by 802
> +Silence is golden
> diff --git a/tests/xfs/803 b/tests/xfs/803
> new file mode 100755
> index 0000000000..b2bb85672d
> --- /dev/null
> +++ b/tests/xfs/803
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 803
> +#
> +# Race fsstress and group quota scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
> +_scratch_xfs_stress_scrub -s "scrub grpquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/803.out b/tests/xfs/803.out
> new file mode 100644
> index 0000000000..38ba741d0f
> --- /dev/null
> +++ b/tests/xfs/803.out
> @@ -0,0 +1,2 @@
> +QA output created by 803
> +Silence is golden
> diff --git a/tests/xfs/804 b/tests/xfs/804
> new file mode 100755
> index 0000000000..129724eb11
> --- /dev/null
> +++ b/tests/xfs/804
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 804
> +#
> +# Race fsstress and project quota scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
> +_scratch_xfs_stress_scrub -s "scrub prjquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/804.out b/tests/xfs/804.out
> new file mode 100644
> index 0000000000..5e0cb437e7
> --- /dev/null
> +++ b/tests/xfs/804.out
> @@ -0,0 +1,2 @@
> +QA output created by 804
> +Silence is golden
> diff --git a/tests/xfs/805 b/tests/xfs/805
> new file mode 100755
> index 0000000000..aca9b9cdf4
> --- /dev/null
> +++ b/tests/xfs/805
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 805
> +#
> +# Race fsstress and summary counters scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -s "scrub fscounters"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/805.out b/tests/xfs/805.out
> new file mode 100644
> index 0000000000..ac324c5874
> --- /dev/null
> +++ b/tests/xfs/805.out
> @@ -0,0 +1,2 @@
> +QA output created by 805
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/5] xfs/357: switch fuzzing to agi 1
  2022-12-30 22:19   ` [PATCH 1/5] xfs/357: switch fuzzing to agi 1 Darrick J. Wong
@ 2023-02-07 18:46     ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-07 18:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:06PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Since we now require a working AGI 0 to mount the system, fuzz AGI 1
> instead.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Looks good to me,
Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/357 |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> 
> diff --git a/tests/xfs/357 b/tests/xfs/357
> index 8a2c920ef4..25af8624db 100755
> --- a/tests/xfs/357
> +++ b/tests/xfs/357
> @@ -25,7 +25,7 @@ echo "Format and populate"
>  _scratch_populate_cached nofill > $seqres.full 2>&1
>  
>  echo "Fuzz AGI"
> -_scratch_xfs_fuzz_metadata '' 'online' 'agi 0' >> $seqres.full
> +_scratch_xfs_fuzz_metadata '' 'online' 'agi 1' >> $seqres.full
>  echo "Done fuzzing AGI"
>  
>  # success, all done
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 4/5] fuzzy: allow xfs scrub stress tests to pick preconfigured fsstress configs
  2022-12-30 22:19   ` [PATCH 4/5] fuzzy: allow xfs scrub stress tests to pick preconfigured fsstress configs Darrick J. Wong
@ 2023-02-07 18:48     ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-07 18:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:06PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make it so that xfs_scrub stress tests can select what kind of fsstress
> operations they want to run.  This will make it easier for, say,
> directory scrubbers to configure fsstress to exercise directory tree
> changes while skipping file data updates, because those are irrelevant.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  common/fuzzy |   77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 74 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/common/fuzzy b/common/fuzzy
> index e39f787e78..c4a5bc9261 100644
> --- a/common/fuzzy
> +++ b/common/fuzzy
> @@ -466,6 +466,7 @@ __stress_scrub_fsx_loop() {
>  	local end="$1"
>  	local runningfile="$2"
>  	local remount_period="$3"
> +	local stress_tgt="$4"	# ignored
>  	local focus=(-q -X)	# quiet, validate file contents
>  
>  	# As of November 2022, 2 million fsx ops should be enough to keep
> @@ -528,10 +529,70 @@ __stress_scrub_fsstress_loop() {
>  	local end="$1"
>  	local runningfile="$2"
>  	local remount_period="$3"
> +	local stress_tgt="$4"
> +	local focus=()
> +
> +	case "$stress_tgt" in
> +	"dir")
> +		focus+=('-z')
> +
> +		# Create a directory tree rapidly
> +		for op in creat link mkdir mknod symlink; do
> +			focus+=('-f' "${op}=8")
> +		done
> +		focus+=('-f' 'rmdir=2' '-f' 'unlink=8')
> +
> +		# Rename half as often
> +		for op in rename rnoreplace rexchange; do
> +			focus+=('-f' "${op}=4")
> +		done
> +
> +		# Read and sync occasionally
> +		for op in getdents stat fsync; do
> +			focus+=('-f' "${op}=1")
> +		done
> +		;;
> +	"xattr")
> +		focus+=('-z')
> +
> +		# Create a directory tree slowly
> +		for op in creat ; do
> +			focus+=('-f' "${op}=2")
> +		done
> +		for op in unlink rmdir; do
> +			focus+=('-f' "${op}=1")
> +		done
> +
> +		# Create xattrs rapidly
> +		for op in attr_set setfattr; do
> +			focus+=('-f' "${op}=80")
> +		done
> +
> +		# Remove xattrs 1/4 as quickly
> +		for op in attr_remove removefattr; do
> +			focus+=('-f' "${op}=20")
> +		done
> +
> +		# Read and sync occasionally
> +		for op in listfattr getfattr fsync; do
> +			focus+=('-f' "${op}=10")
> +		done
> +		;;
> +	"writeonly")
> +		# Only do things that cause filesystem writes
> +		focus+=('-w')
> +		;;
> +	"default")
> +		# No new arguments
> +		;;
> +	*)
> +		echo "$stress_tgt: Unrecognized stress target, using defaults."
> +		;;
> +	esac
>  
>  	# As of March 2022, 2 million fsstress ops should be enough to keep
>  	# any filesystem busy for a couple of hours.
> -	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 $FSSTRESS_AVOID)
> +	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 "${focus[@]}" $FSSTRESS_AVOID)
>  	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
>  
>  	if [ -n "$remount_period" ]; then
> @@ -691,6 +752,14 @@ __stress_scrub_check_commands() {
>  # -w	Delay the start of the scrub/repair loop by this number of seconds.
>  #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
>  #	will be clamped to ten seconds before the end time.
> +# -x	Focus on this type of fsstress operation.  Possible values:
> +#
> +#       'dir': Grow the directory trees as much as possible.
> +#       'xattr': Grow extended attributes in a small tree.
> +#       'default': Run fsstress with default arguments.
> +#       'writeonly': Only perform fs updates, no reads.
> +#
> +#       The default is 'default' unless XFS_SCRUB_STRESS_TARGET is set.
>  # -X	Run this program to exercise the filesystem.  Currently supported
>  #       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
>  _scratch_xfs_stress_scrub() {
> @@ -703,6 +772,7 @@ _scratch_xfs_stress_scrub() {
>  	local exerciser="fsstress"
>  	local io_args=()
>  	local remount_period="${XFS_SCRUB_STRESS_REMOUNT_PERIOD}"
> +	local stress_tgt="${XFS_SCRUB_STRESS_TARGET:-default}"
>  
>  	__SCRUB_STRESS_FREEZE_PID=""
>  	__SCRUB_STRESS_REMOUNT_LOOP=""
> @@ -710,7 +780,7 @@ _scratch_xfs_stress_scrub() {
>  	touch "$runningfile"
>  
>  	OPTIND=1
> -	while getopts "fi:r:s:S:t:w:X:" c; do
> +	while getopts "fi:r:s:S:t:w:x:X:" c; do
>  		case "$c" in
>  			f) freeze=yes;;
>  			i) io_args+=("$OPTARG");;
> @@ -719,6 +789,7 @@ _scratch_xfs_stress_scrub() {
>  			S) xfs_scrub_args+=("$OPTARG");;
>  			t) scrub_tgt="$OPTARG";;
>  			w) scrub_delay="$OPTARG";;
> +			x) stress_tgt="$OPTARG";;
>  			X) exerciser="$OPTARG";;
>  			*) return 1; ;;
>  		esac
> @@ -757,7 +828,7 @@ _scratch_xfs_stress_scrub() {
>  	fi
>  
>  	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" \
> -			"$remount_period" &
> +			"$remount_period" "$stress_tgt" &
>  
>  	if [ -n "$freeze" ]; then
>  		__stress_scrub_freeze_loop "$end" "$runningfile" &
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair
  2022-12-30 22:19   ` [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair Darrick J. Wong
@ 2023-02-14  8:00     ` Zorro Lang
  2023-02-14 18:18       ` Darrick J. Wong
  0 siblings, 1 reply; 469+ messages in thread
From: Zorro Lang @ 2023-02-14  8:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:09PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> For stress testing online repair, try to use the FORCE_REBUILD ioctl
> flag over the error injection knobs whenever possible because the knobs
> are very noisy and are not always available.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  common/fuzzy |   34 +++++++++++++++++++++++++++++++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/common/fuzzy b/common/fuzzy
> index f7f660bc31..14f7fdf03c 100644
> --- a/common/fuzzy
> +++ b/common/fuzzy
> @@ -398,6 +398,9 @@ __stress_one_scrub_loop() {
>  
>  	local xfs_io_args=()
>  	for arg in "$@"; do
> +		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
> +			arg="$(echo "$arg" | sed -e 's/^repair/repair -R/g')"
> +		fi
>  		if echo "$arg" | grep -q -w '%agno%'; then
>  			# Substitute the AG number
>  			for ((agno = 0; agno < agcount; agno++)); do
> @@ -695,13 +698,21 @@ _require_xfs_stress_scrub() {
>  		_notrun 'xfs scrub stress test requires common/filter'
>  }
>  
> +# Make sure that we can force repairs either by error injection or passing
> +# FORCE_REBUILD via ioctl.
> +__require_xfs_stress_force_rebuild() {
> +	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
> +	test -z "$output" && return
> +	_require_xfs_io_error_injection "force_repair"
> +}
> +
>  # Make sure we have everything we need to run stress and online repair
>  _require_xfs_stress_online_repair() {
>  	_require_xfs_stress_scrub
>  	_require_xfs_io_command "repair"
>  	command -v _require_xfs_io_error_injection &>/dev/null || \
>  		_notrun 'xfs repair stress test requires common/inject'
> -	_require_xfs_io_error_injection "force_repair"
> +	__require_xfs_stress_force_rebuild
>  	_require_freeze
>  }
>  
> @@ -783,7 +794,11 @@ __stress_scrub_check_commands() {
>  	esac
>  
>  	for arg in "$@"; do
> -		local cooked_arg="$(echo "$arg" | sed -e "s/%agno%/0/g")"
> +		local cooked_arg="$arg"
> +		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
> +			cooked_arg="$(echo "$cooked_arg" | sed -e 's/^repair/repair -R/g')"
> +		fi
> +		cooked_arg="$(echo "$cooked_arg" | sed -e "s/%agno%/0/g")"
>  		testio=`$XFS_IO_PROG -x -c "$cooked_arg" "$cooked_tgt" 2>&1`
>  		echo $testio | grep -q "Unknown type" && \
>  			_notrun "xfs_io scrub subcommand support is missing"
> @@ -943,10 +958,23 @@ _scratch_xfs_stress_scrub() {
>  	echo "Loop finished at $(date)" >> $seqres.full
>  }
>  
> +# Decide if we're going to force repairs either by error injection or passing
> +# FORCE_REBUILD via ioctl.
> +__scratch_xfs_stress_setup_force_rebuild() {
> +	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
> +
> +	if [ -z "$output" ]; then
> +		export SCRUBSTRESS_USE_FORCE_REBUILD=1

Do you need to use this parameter ^^ in another child process? Is the "export"
necessary?

Thanks,
Zorro

> +		return
> +	fi
> +
> +	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
> +}
> +
>  # Start online repair, freeze, and fsstress in background looping processes,
>  # and wait for 30*TIME_FACTOR seconds to see if the filesystem goes down.
>  # Same requirements and arguments as _scratch_xfs_stress_scrub.
>  _scratch_xfs_stress_online_repair() {
> -	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
> +	__scratch_xfs_stress_setup_force_rebuild
>  	XFS_SCRUB_FORCE_REPAIR=1 _scratch_xfs_stress_scrub "$@"
>  }
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair
  2023-02-14  8:00     ` Zorro Lang
@ 2023-02-14 18:18       ` Darrick J. Wong
  2023-02-16 14:57         ` Zorro Lang
  0 siblings, 1 reply; 469+ messages in thread
From: Darrick J. Wong @ 2023-02-14 18:18 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Tue, Feb 14, 2023 at 04:00:07PM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 02:19:09PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > For stress testing online repair, try to use the FORCE_REBUILD ioctl
> > flag over the error injection knobs whenever possible because the knobs
> > are very noisy and are not always available.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  common/fuzzy |   34 +++++++++++++++++++++++++++++++---
> >  1 file changed, 31 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/common/fuzzy b/common/fuzzy
> > index f7f660bc31..14f7fdf03c 100644
> > --- a/common/fuzzy
> > +++ b/common/fuzzy
> > @@ -398,6 +398,9 @@ __stress_one_scrub_loop() {
> >  
> >  	local xfs_io_args=()
> >  	for arg in "$@"; do
> > +		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
> > +			arg="$(echo "$arg" | sed -e 's/^repair/repair -R/g')"
> > +		fi
> >  		if echo "$arg" | grep -q -w '%agno%'; then
> >  			# Substitute the AG number
> >  			for ((agno = 0; agno < agcount; agno++)); do
> > @@ -695,13 +698,21 @@ _require_xfs_stress_scrub() {
> >  		_notrun 'xfs scrub stress test requires common/filter'
> >  }
> >  
> > +# Make sure that we can force repairs either by error injection or passing
> > +# FORCE_REBUILD via ioctl.
> > +__require_xfs_stress_force_rebuild() {
> > +	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
> > +	test -z "$output" && return
> > +	_require_xfs_io_error_injection "force_repair"
> > +}
> > +
> >  # Make sure we have everything we need to run stress and online repair
> >  _require_xfs_stress_online_repair() {
> >  	_require_xfs_stress_scrub
> >  	_require_xfs_io_command "repair"
> >  	command -v _require_xfs_io_error_injection &>/dev/null || \
> >  		_notrun 'xfs repair stress test requires common/inject'
> > -	_require_xfs_io_error_injection "force_repair"
> > +	__require_xfs_stress_force_rebuild
> >  	_require_freeze
> >  }
> >  
> > @@ -783,7 +794,11 @@ __stress_scrub_check_commands() {
> >  	esac
> >  
> >  	for arg in "$@"; do
> > -		local cooked_arg="$(echo "$arg" | sed -e "s/%agno%/0/g")"
> > +		local cooked_arg="$arg"
> > +		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
> > +			cooked_arg="$(echo "$cooked_arg" | sed -e 's/^repair/repair -R/g')"
> > +		fi
> > +		cooked_arg="$(echo "$cooked_arg" | sed -e "s/%agno%/0/g")"
> >  		testio=`$XFS_IO_PROG -x -c "$cooked_arg" "$cooked_tgt" 2>&1`
> >  		echo $testio | grep -q "Unknown type" && \
> >  			_notrun "xfs_io scrub subcommand support is missing"
> > @@ -943,10 +958,23 @@ _scratch_xfs_stress_scrub() {
> >  	echo "Loop finished at $(date)" >> $seqres.full
> >  }
> >  
> > +# Decide if we're going to force repairs either by error injection or passing
> > +# FORCE_REBUILD via ioctl.
> > +__scratch_xfs_stress_setup_force_rebuild() {
> > +	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
> > +
> > +	if [ -z "$output" ]; then
> > +		export SCRUBSTRESS_USE_FORCE_REBUILD=1
> 
> Do you need to use this parameter ^^ in another child process? Is the "export"
> necessary?

Nope, it's not required.

--D

> Thanks,
> Zorro
> 
> > +		return
> > +	fi
> > +
> > +	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
> > +}
> > +
> >  # Start online repair, freeze, and fsstress in background looping processes,
> >  # and wait for 30*TIME_FACTOR seconds to see if the filesystem goes down.
> >  # Same requirements and arguments as _scratch_xfs_stress_scrub.
> >  _scratch_xfs_stress_online_repair() {
> > -	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
> > +	__scratch_xfs_stress_setup_force_rebuild
> >  	XFS_SCRUB_FORCE_REPAIR=1 _scratch_xfs_stress_scrub "$@"
> >  }
> > 
> 

^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair
  2023-02-14 18:18       ` Darrick J. Wong
@ 2023-02-16 14:57         ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-16 14:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Tue, Feb 14, 2023 at 10:18:20AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 14, 2023 at 04:00:07PM +0800, Zorro Lang wrote:
> > On Fri, Dec 30, 2022 at 02:19:09PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > For stress testing online repair, try to use the FORCE_REBUILD ioctl
> > > flag over the error injection knobs whenever possible because the knobs
> > > are very noisy and are not always available.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  common/fuzzy |   34 +++++++++++++++++++++++++++++++---
> > >  1 file changed, 31 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/common/fuzzy b/common/fuzzy
> > > index f7f660bc31..14f7fdf03c 100644
> > > --- a/common/fuzzy
> > > +++ b/common/fuzzy
> > > @@ -398,6 +398,9 @@ __stress_one_scrub_loop() {
> > >  
> > >  	local xfs_io_args=()
> > >  	for arg in "$@"; do
> > > +		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
> > > +			arg="$(echo "$arg" | sed -e 's/^repair/repair -R/g')"
> > > +		fi
> > >  		if echo "$arg" | grep -q -w '%agno%'; then
> > >  			# Substitute the AG number
> > >  			for ((agno = 0; agno < agcount; agno++)); do
> > > @@ -695,13 +698,21 @@ _require_xfs_stress_scrub() {
> > >  		_notrun 'xfs scrub stress test requires common/filter'
> > >  }
> > >  
> > > +# Make sure that we can force repairs either by error injection or passing
> > > +# FORCE_REBUILD via ioctl.
> > > +__require_xfs_stress_force_rebuild() {
> > > +	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
> > > +	test -z "$output" && return
> > > +	_require_xfs_io_error_injection "force_repair"
> > > +}
> > > +
> > >  # Make sure we have everything we need to run stress and online repair
> > >  _require_xfs_stress_online_repair() {
> > >  	_require_xfs_stress_scrub
> > >  	_require_xfs_io_command "repair"
> > >  	command -v _require_xfs_io_error_injection &>/dev/null || \
> > >  		_notrun 'xfs repair stress test requires common/inject'
> > > -	_require_xfs_io_error_injection "force_repair"
> > > +	__require_xfs_stress_force_rebuild
> > >  	_require_freeze
> > >  }
> > >  
> > > @@ -783,7 +794,11 @@ __stress_scrub_check_commands() {
> > >  	esac
> > >  
> > >  	for arg in "$@"; do
> > > -		local cooked_arg="$(echo "$arg" | sed -e "s/%agno%/0/g")"
> > > +		local cooked_arg="$arg"
> > > +		if [ -n "$SCRUBSTRESS_USE_FORCE_REBUILD" ]; then
> > > +			cooked_arg="$(echo "$cooked_arg" | sed -e 's/^repair/repair -R/g')"
> > > +		fi
> > > +		cooked_arg="$(echo "$cooked_arg" | sed -e "s/%agno%/0/g")"
> > >  		testio=`$XFS_IO_PROG -x -c "$cooked_arg" "$cooked_tgt" 2>&1`
> > >  		echo $testio | grep -q "Unknown type" && \
> > >  			_notrun "xfs_io scrub subcommand support is missing"
> > > @@ -943,10 +958,23 @@ _scratch_xfs_stress_scrub() {
> > >  	echo "Loop finished at $(date)" >> $seqres.full
> > >  }
> > >  
> > > +# Decide if we're going to force repairs either by error injection or passing
> > > +# FORCE_REBUILD via ioctl.
> > > +__scratch_xfs_stress_setup_force_rebuild() {
> > > +	local output="$($XFS_IO_PROG -x -c 'repair -R probe' $SCRATCH_MNT 2>&1)"
> > > +
> > > +	if [ -z "$output" ]; then
> > > +		export SCRUBSTRESS_USE_FORCE_REBUILD=1
> > 
> > Do you need to use this parameter ^^ in another child process? Is the "export"
> > necessary?
> 
> Nope, it's not required.

OK, others looks good to me, will merge this patch without that "export".

Reviewed-by: Zorro Lang <zlang@redhat.com>

> 
> --D
> 
> > Thanks,
> > Zorro
> > 
> > > +		return
> > > +	fi
> > > +
> > > +	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
> > > +}
> > > +
> > >  # Start online repair, freeze, and fsstress in background looping processes,
> > >  # and wait for 30*TIME_FACTOR seconds to see if the filesystem goes down.
> > >  # Same requirements and arguments as _scratch_xfs_stress_scrub.
> > >  _scratch_xfs_stress_online_repair() {
> > > -	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
> > > +	__scratch_xfs_stress_setup_force_rebuild
> > >  	XFS_SCRUB_FORCE_REPAIR=1 _scratch_xfs_stress_scrub "$@"
> > >  }
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCHSET v24.0 0/2] fstests: online repair of AG btrees
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: stress test ag repair functions Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: test rebuilding the entire filesystem with online fsck Darrick J. Wong
@ 2023-02-18  6:06   ` Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:12PM -0800, Darrick J. Wong wrote:
> Hi all,
> 
> Now that we've spent a lot of time reworking common code in online fsck,
> we're ready to start rebuilding the AG space btrees.  This series
> implements repair functions for the free space, inode, and refcount
> btrees.  Rebuilding the reverse mapping btree is much more intense and
> is left for a subsequent patchset.  The fstests counterpart of this
> patchset implements stress testing of repair.
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D

LGTM,

Reviewed-by: Zorro Lang <zlang@redhat.com>

> 
> kernel git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees
> 
> xfsprogs git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-ag-btrees
> 
> fstests git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-ag-btrees
> ---
>  README            |    3 ++
>  common/fuzzy      |   39 +++++++++++++++++++--------
>  common/rc         |    2 +
>  common/xfs        |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/725     |   37 +++++++++++++++++++++++++
>  tests/xfs/725.out |    2 +
>  tests/xfs/726     |   37 +++++++++++++++++++++++++
>  tests/xfs/726.out |    2 +
>  tests/xfs/727     |   38 ++++++++++++++++++++++++++
>  tests/xfs/727.out |    2 +
>  tests/xfs/728     |   37 +++++++++++++++++++++++++
>  tests/xfs/728.out |    2 +
>  tests/xfs/729     |   37 +++++++++++++++++++++++++
>  tests/xfs/729.out |    2 +
>  tests/xfs/730     |   37 +++++++++++++++++++++++++
>  tests/xfs/730.out |    2 +
>  tests/xfs/731     |   37 +++++++++++++++++++++++++
>  tests/xfs/731.out |    2 +
>  18 files changed, 382 insertions(+), 13 deletions(-)
>  create mode 100755 tests/xfs/725
>  create mode 100644 tests/xfs/725.out
>  create mode 100755 tests/xfs/726
>  create mode 100644 tests/xfs/726.out
>  create mode 100755 tests/xfs/727
>  create mode 100644 tests/xfs/727.out
>  create mode 100755 tests/xfs/728
>  create mode 100644 tests/xfs/728.out
>  create mode 100755 tests/xfs/729
>  create mode 100644 tests/xfs/729.out
>  create mode 100755 tests/xfs/730
>  create mode 100644 tests/xfs/730.out
>  create mode 100755 tests/xfs/731
>  create mode 100644 tests/xfs/731.out
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] xfs: race fsstress with online repair for inode record metadata
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair for inode record metadata Darrick J. Wong
@ 2023-02-18  6:07     ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:15PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a test that runs the inode record repairer in the foreground and
> fsstress in the background.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

LGTM,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/806     |   38 ++++++++++++++++++++++++++++++++++++++
>  tests/xfs/806.out |    2 ++
>  2 files changed, 40 insertions(+)
>  create mode 100755 tests/xfs/806
>  create mode 100644 tests/xfs/806.out
> 
> 
> diff --git a/tests/xfs/806 b/tests/xfs/806
> new file mode 100755
> index 0000000000..e07f9f9141
> --- /dev/null
> +++ b/tests/xfs/806
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 806
> +#
> +# Race fsstress and inode record repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_online_repair -s "repair inode" -t "%file%"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/806.out b/tests/xfs/806.out
> new file mode 100644
> index 0000000000..463bd7f008
> --- /dev/null
> +++ b/tests/xfs/806.out
> @@ -0,0 +1,2 @@
> +QA output created by 806
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings
  2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:19   ` [PATCH 2/4] xfs: race fsstress with online repair for inode and fork metadata Darrick J. Wong
@ 2023-02-18  6:07   ` Zorro Lang
  4 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:18PM -0800, Darrick J. Wong wrote:
> Hi all,
> 
> In this series, online repair gains the ability to rebuild data and attr
> fork mappings from the reverse mapping information.  It is at this point
> where we reintroduce the ability to reap file extents.
> 
> Repair of CoW forks is a little different -- on disk, CoW staging
> extents are owned by the refcount btree and cannot be mapped back to
> individual files.  Hence we can only detect staging extents that don't
> quite look right (missing reverse mappings, shared staging extents) and
> replace them with fresh allocations.
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D

Ack, will merge this patchset

Reviewed-by: Zorro Lang <zlang@redhat.com>

> 
> kernel git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings
> 
> xfsprogs git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=repair-file-mappings
> 
> fstests git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-file-mappings
> ---
>  tests/xfs/746     |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/746.out |    2 +
>  tests/xfs/807     |   37 +++++++++++++++++++++++
>  tests/xfs/807.out |    2 +
>  tests/xfs/808     |   39 ++++++++++++++++++++++++
>  tests/xfs/808.out |    2 +
>  tests/xfs/828     |   38 ++++++++++++++++++++++++
>  tests/xfs/828.out |    2 +
>  tests/xfs/829     |   39 ++++++++++++++++++++++++
>  tests/xfs/829.out |    2 +
>  tests/xfs/840     |   72 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/840.out |    3 ++
>  tests/xfs/846     |   39 ++++++++++++++++++++++++
>  tests/xfs/846.out |    2 +
>  14 files changed, 364 insertions(+)
>  create mode 100755 tests/xfs/746
>  create mode 100644 tests/xfs/746.out
>  create mode 100755 tests/xfs/807
>  create mode 100644 tests/xfs/807.out
>  create mode 100755 tests/xfs/808
>  create mode 100644 tests/xfs/808.out
>  create mode 100755 tests/xfs/828
>  create mode 100644 tests/xfs/828.out
>  create mode 100755 tests/xfs/829
>  create mode 100644 tests/xfs/829.out
>  create mode 100755 tests/xfs/840
>  create mode 100644 tests/xfs/840.out
>  create mode 100755 tests/xfs/846
>  create mode 100644 tests/xfs/846.out
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata Darrick J. Wong
@ 2023-02-18  6:10     ` Zorro Lang
  2023-02-18  6:12     ` Zorro Lang
  1 sibling, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:21PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create tests to race fsstress with dquot repair while running fsstress
> in the background.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/809     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/809.out |    2 ++
>  tests/xfs/810     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/810.out |    2 ++
>  tests/xfs/811     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/811.out |    2 ++
>  6 files changed, 126 insertions(+)
>  create mode 100755 tests/xfs/809
>  create mode 100644 tests/xfs/809.out
>  create mode 100755 tests/xfs/810
>  create mode 100644 tests/xfs/810.out
>  create mode 100755 tests/xfs/811
>  create mode 100644 tests/xfs/811.out
> 
> 
> diff --git a/tests/xfs/809 b/tests/xfs/809
> new file mode 100755
> index 0000000000..35ac02ff85
> --- /dev/null
> +++ b/tests/xfs/809
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 809
> +#
> +# Race fsstress and user quota repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
> +_scratch_xfs_stress_online_repair -s "repair usrquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/809.out b/tests/xfs/809.out
> new file mode 100644
> index 0000000000..e90865ca8f
> --- /dev/null
> +++ b/tests/xfs/809.out
> @@ -0,0 +1,2 @@
> +QA output created by 809
> +Silence is golden
> diff --git a/tests/xfs/810 b/tests/xfs/810
> new file mode 100755
> index 0000000000..7387910504
> --- /dev/null
> +++ b/tests/xfs/810
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 810
> +#
> +# Race fsstress and group quota repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
> +_scratch_xfs_stress_online_repair -s "repair grpquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/810.out b/tests/xfs/810.out
> new file mode 100644
> index 0000000000..90f12fdd21
> --- /dev/null
> +++ b/tests/xfs/810.out
> @@ -0,0 +1,2 @@
> +QA output created by 810
> +Silence is golden
> diff --git a/tests/xfs/811 b/tests/xfs/811
> new file mode 100755
> index 0000000000..1e13940b46
> --- /dev/null
> +++ b/tests/xfs/811
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 811
> +#
> +# Race fsstress and project quota repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
> +_scratch_xfs_stress_online_repair -s "repair prjquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/811.out b/tests/xfs/811.out
> new file mode 100644
> index 0000000000..cf30f69bdc
> --- /dev/null
> +++ b/tests/xfs/811.out
> @@ -0,0 +1,2 @@
> +QA output created by 811
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata Darrick J. Wong
  2023-02-18  6:10     ` Zorro Lang
@ 2023-02-18  6:12     ` Zorro Lang
  1 sibling, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:21PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create tests to race fsstress with dquot repair while running fsstress
> in the background.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/809     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/809.out |    2 ++
>  tests/xfs/810     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/810.out |    2 ++
>  tests/xfs/811     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/811.out |    2 ++
>  6 files changed, 126 insertions(+)
>  create mode 100755 tests/xfs/809
>  create mode 100644 tests/xfs/809.out
>  create mode 100755 tests/xfs/810
>  create mode 100644 tests/xfs/810.out
>  create mode 100755 tests/xfs/811
>  create mode 100644 tests/xfs/811.out
> 
> 
> diff --git a/tests/xfs/809 b/tests/xfs/809
> new file mode 100755
> index 0000000000..35ac02ff85
> --- /dev/null
> +++ b/tests/xfs/809
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 809
> +#
> +# Race fsstress and user quota repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" usrquota
> +_scratch_xfs_stress_online_repair -s "repair usrquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/809.out b/tests/xfs/809.out
> new file mode 100644
> index 0000000000..e90865ca8f
> --- /dev/null
> +++ b/tests/xfs/809.out
> @@ -0,0 +1,2 @@
> +QA output created by 809
> +Silence is golden
> diff --git a/tests/xfs/810 b/tests/xfs/810
> new file mode 100755
> index 0000000000..7387910504
> --- /dev/null
> +++ b/tests/xfs/810
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 810
> +#
> +# Race fsstress and group quota repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" grpquota
> +_scratch_xfs_stress_online_repair -s "repair grpquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/810.out b/tests/xfs/810.out
> new file mode 100644
> index 0000000000..90f12fdd21
> --- /dev/null
> +++ b/tests/xfs/810.out
> @@ -0,0 +1,2 @@
> +QA output created by 810
> +Silence is golden
> diff --git a/tests/xfs/811 b/tests/xfs/811
> new file mode 100755
> index 0000000000..1e13940b46
> --- /dev/null
> +++ b/tests/xfs/811
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 811
> +#
> +# Race fsstress and project quota repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" prjquota
> +_scratch_xfs_stress_online_repair -s "repair prjquota"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/811.out b/tests/xfs/811.out
> new file mode 100644
> index 0000000000..cf30f69bdc
> --- /dev/null
> +++ b/tests/xfs/811.out
> @@ -0,0 +1,2 @@
> +QA output created by 811
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] xfs: race fsstress with online scrub and repair for quotacheck
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quotacheck Darrick J. Wong
@ 2023-02-18  6:12     ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:24PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create tests to race fsstress with quota count check and repair while
> running fsstress in the background.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/715     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/715.out |    2 ++
>  tests/xfs/812     |   40 ++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/812.out |    2 ++
>  4 files changed, 84 insertions(+)
>  create mode 100755 tests/xfs/715
>  create mode 100644 tests/xfs/715.out
>  create mode 100755 tests/xfs/812
>  create mode 100644 tests/xfs/812.out
> 
> 
> diff --git a/tests/xfs/715 b/tests/xfs/715
> new file mode 100755
> index 0000000000..eca979b297
> --- /dev/null
> +++ b/tests/xfs/715
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 715
> +#
> +# Race fsstress and quotacheck repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" any
> +_scratch_xfs_stress_online_repair -s "repair quotacheck"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/715.out b/tests/xfs/715.out
> new file mode 100644
> index 0000000000..b5947d898b
> --- /dev/null
> +++ b/tests/xfs/715.out
> @@ -0,0 +1,2 @@
> +QA output created by 715
> +Silence is golden
> diff --git a/tests/xfs/812 b/tests/xfs/812
> new file mode 100755
> index 0000000000..f84494e392
> --- /dev/null
> +++ b/tests/xfs/812
> @@ -0,0 +1,40 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle. Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 812
> +#
> +# Race fsstress and quotacheck scrub for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +. ./common/quota
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_require_xfs_quota_acct_enabled "$SCRATCH_DEV" any
> +_scratch_xfs_stress_scrub -s "scrub quotacheck"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/812.out b/tests/xfs/812.out
> new file mode 100644
> index 0000000000..d8dbb15dc7
> --- /dev/null
> +++ b/tests/xfs/812.out
> @@ -0,0 +1,2 @@
> +QA output created by 812
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] xfs: race fsstress with inode link count check and repair
  2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with inode link count check and repair Darrick J. Wong
@ 2023-02-18  6:13     ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:27PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Race fsstress with inode link count checking and repair.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/772     |   38 ++++++++++++++++++++++++++++++++++++++
>  tests/xfs/772.out |    2 ++
>  tests/xfs/820     |   37 +++++++++++++++++++++++++++++++++++++
>  tests/xfs/820.out |    2 ++
>  4 files changed, 79 insertions(+)
>  create mode 100755 tests/xfs/772
>  create mode 100644 tests/xfs/772.out
>  create mode 100755 tests/xfs/820
>  create mode 100644 tests/xfs/820.out
> 
> 
> diff --git a/tests/xfs/772 b/tests/xfs/772
> new file mode 100755
> index 0000000000..a00c2796c5
> --- /dev/null
> +++ b/tests/xfs/772
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2022 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 772
> +#
> +# Race fsstress and inode link count repair for a while to see if we crash or
> +# livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_online_repair -x "dir" -s "repair nlinks"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/772.out b/tests/xfs/772.out
> new file mode 100644
> index 0000000000..98c1396896
> --- /dev/null
> +++ b/tests/xfs/772.out
> @@ -0,0 +1,2 @@
> +QA output created by 772
> +Silence is golden
> diff --git a/tests/xfs/820 b/tests/xfs/820
> new file mode 100755
> index 0000000000..58a5d4cc91
> --- /dev/null
> +++ b/tests/xfs/820
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2022 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 820
> +#
> +# Race fsstress and nlinks scrub for a while to see if we crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub
> +
> +_cleanup() {
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	cd /
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -x "dir" -s "scrub nlinks"
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/820.out b/tests/xfs/820.out
> new file mode 100644
> index 0000000000..29ab2e2d8c
> --- /dev/null
> +++ b/tests/xfs/820.out
> @@ -0,0 +1,2 @@
> +QA output created by 820
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] xfs: test fs summary counter online repair Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] xfs: race fsstress with online repair for summary counters Darrick J. Wong
@ 2023-02-18  6:14   ` Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:30PM -0800, Darrick J. Wong wrote:
> Hi all,
> 
> A longstanding deficiency in the online fs summary counter scrubbing
> code is that it hasn't any means to quiesce the incore percpu counters
> while it's running.  There is no way to coordinate with other threads
> are reserving or freeing free space simultaneously, which leads to false
> error reports.  Right now, if the discrepancy is large, we just sort of
> shrug and bail out with an incomplete flag, but this is lame.
> 
> For repair activity, we actually /do/ need to stabilize the counters to
> get an accurate reading and install it in the percpu counter.  To
> improve the former and enable the latter, allow the fscounters online
> fsck code to perform an exclusive mini-freeze on the filesystem.  The
> exclusivity prevents userspace from thawing while we're running, and the
> mini-freeze means that we don't wait for the log to quiesce, which will
> make both speedier.
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D

This patchset looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

> 
> kernel git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters
> 
> fstests git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-fscounters
> ---
>  tests/xfs/713     |   36 ++++++++++++++++++++++++++++++++++++
>  tests/xfs/713.out |    4 ++++
>  tests/xfs/714     |   41 +++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/714.out |    2 ++
>  tests/xfs/762     |   46 ++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/762.out |    2 ++
>  6 files changed, 131 insertions(+)
>  create mode 100755 tests/xfs/713
>  create mode 100644 tests/xfs/713.out
>  create mode 100755 tests/xfs/714
>  create mode 100644 tests/xfs/714.out
>  create mode 100755 tests/xfs/762
>  create mode 100644 tests/xfs/762.out
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 1/1] xfs/422: don't freeze while racing rmap repair and fsstress
  2022-12-30 22:19   ` [PATCH 1/1] xfs/422: don't freeze while racing rmap repair and fsstress Darrick J. Wong
@ 2023-02-18  6:15     ` Zorro Lang
  0 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests, guan

On Fri, Dec 30, 2022 at 02:19:33PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Since we're moving away from freezing the filesystem for rmap repair,
> remove the freeze/thaw race from this test to make it more interesting.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Looks good to me,
Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/422 |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> 
> diff --git a/tests/xfs/422 b/tests/xfs/422
> index 995f612166..339f12976a 100755
> --- a/tests/xfs/422
> +++ b/tests/xfs/422
> @@ -5,8 +5,6 @@
>  # FS QA Test No. 422
>  #
>  # Race fsstress and rmapbt repair for a while to see if we crash or livelock.
> -# rmapbt repair requires us to freeze the filesystem to stop all filesystem
> -# activity, so we can't have userspace wandering in and thawing it.
>  #
>  . ./common/preamble
>  _begin_fstest online_repair dangerous_fsstress_repair freeze
> @@ -31,7 +29,7 @@ _require_xfs_stress_online_repair
>  _scratch_mkfs > "$seqres.full" 2>&1
>  _scratch_mount
>  _require_xfs_has_feature "$SCRATCH_MNT" rmapbt
> -_scratch_xfs_stress_online_repair -f -s "repair rmapbt %agno%"
> +_scratch_xfs_stress_online_repair -s "repair rmapbt %agno%"
>  
>  # success, all done
>  echo Silence is golden
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population
  2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 1/2] populate: take a snapshot of the filesystem if creation fails Darrick J. Wong
  2022-12-30 22:19   ` [PATCH 2/2] populate: fix some weirdness in __populate_check_xfs_agbtree_height Darrick J. Wong
@ 2023-02-18  6:16   ` Zorro Lang
  2 siblings, 0 replies; 469+ messages in thread
From: Zorro Lang @ 2023-02-18  6:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:36PM -0800, Darrick J. Wong wrote:
> Hi all,
> 
> Before we start on an intense patchset of improving the XFS fuzz testing
> framework, let's fix a couple of bugs in the code that creates sample
> filesystems with all types of metadata.
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D

Ack,

Reviewed-by: Zorro Lang <zlang@redhat.com>

> 
> fstests git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fix-populate-problems
> ---
>  common/populate |   70 ++++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 46 insertions(+), 24 deletions(-)
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 6/7] fsx: support FIEXCHANGE_RANGE
  2022-12-30 22:19   ` [PATCH 6/7] fsx: support FIEXCHANGE_RANGE Darrick J. Wong
@ 2023-02-28  1:55     ` Zorro Lang
  2023-03-01  2:56       ` Darrick J. Wong
  0 siblings, 1 reply; 469+ messages in thread
From: Zorro Lang @ 2023-02-28  1:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:19:49PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Upgrade fsx to support exchanging file contents.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Hi Darrick,

I've merged most of patches of [NYE DELUGE 2/4], now I'm trying to merge
the rest of it this time.

This patch will get build warning [1] from autoconf, can you rebase this patch
to current for-next branch, and use autoupdate to update the configure.ac
and lib/autoconf/general.m4 ?

Thanks,
Zorro

[1]
autoconf
configure.ac:73: warning: The macro `AC_TRY_LINK' is obsolete.
configure.ac:73: You should run autoupdate.
./lib/autoconf/general.m4:2920: AC_TRY_LINK is expanded from...
m4/package_libcdev.m4:161: AC_HAVE_FIEXCHANGE is expanded from...
configure.ac:73: the top level
./configure \   
                --libexecdir=/usr/lib \
                --exec_prefix=/var/lib

>  configure.ac          |    1 
>  include/builddefs.in  |    1 
>  ltp/Makefile          |    4 +
>  ltp/fsx.c             |  160 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  m4/package_libcdev.m4 |   21 ++++++
>  src/fiexchange.h      |  101 +++++++++++++++++++++++++++++++
>  src/global.h          |    6 ++
>  7 files changed, 292 insertions(+), 2 deletions(-)
>  create mode 100644 src/fiexchange.h
> 
> 
> diff --git a/configure.ac b/configure.ac
> index e92bd6b26d..4687d8a3c0 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -70,6 +70,7 @@ AC_HAVE_SEEK_DATA
>  AC_HAVE_BMV_OF_SHARED
>  AC_HAVE_NFTW
>  AC_HAVE_RLIMIT_NOFILE
> +AC_HAVE_FIEXCHANGE

[snip]

>  	setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
>  
>  	while ((ch = getopt_long(argc, argv,
> -				 "b:c:dfg:i:j:kl:m:no:p:qr:s:t:w:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> +				 "0b:c:dfg:i:j:kl:m:no:p:qr:s:t:w:xyABD:EFJKHzCILN:OP:RS:UWXZ",

Looks like we nearly used up most of letters for fsx, to avoid some operations.

Maybe we can use a single option (e.g. -a means avoid) and suboptions to
help that. For example "-a xchg_range,clone_range,dedupe_range" to avoid
these 3 operations. Or use long option, e.g. --no-xchg-range, --no-clone-range
to replace short ones.

What do you think? (Anyway, that's not the problem of this patch)

Thanks,
Zorro

>  				 longopts, NULL)) != EOF)
>  		switch (ch) {
>  		case 'b':
> @@ -2747,6 +2898,9 @@ main(int argc, char **argv)
>  		case 'I':
>  			insert_range_calls = 0;
>  			break;
> +		case '0':
> +			xchg_range_calls = 0;
> +			break;
>  		case 'J':
>  			clone_range_calls = 0;
>  			break;
> @@ -2988,6 +3142,8 @@ main(int argc, char **argv)
>  		dedupe_range_calls = test_dedupe_range();
>  	if (copy_range_calls)
>  		copy_range_calls = test_copy_range();
> +	if (xchg_range_calls)
> +		xchg_range_calls = test_xchg_range();
>  
>  	while (numops == -1 || numops--)
>  		if (!test())
> diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
> index e1b381c16f..db663970c2 100644
> --- a/m4/package_libcdev.m4
> +++ b/m4/package_libcdev.m4
> @@ -157,3 +157,24 @@ AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
>         AC_MSG_RESULT(no))
>      AC_SUBST(have_rlimit_nofile)
>    ])
> +
> +#
> +# Check if we have a FIEXCHANGE_RANGE ioctl (Linux)
> +#
> +AC_DEFUN([AC_HAVE_FIEXCHANGE],
> +  [ AC_MSG_CHECKING([for FIEXCHANGE_RANGE])
> +    AC_TRY_LINK([
> +#define _GNU_SOURCE
> +#include <sys/syscall.h>
> +#include <sys/ioctl.h>
> +#include <unistd.h>
> +#include <linux/fs.h>
> +#include <linux/fiexchange.h>
> +    ], [
> +         struct file_xchg_range fxr;
> +         ioctl(-1, FIEXCHANGE_RANGE, &fxr);
> +    ], have_fiexchange=yes
> +       AC_MSG_RESULT(yes),
> +       AC_MSG_RESULT(no))
> +    AC_SUBST(have_fiexchange)
> +  ])
> diff --git a/src/fiexchange.h b/src/fiexchange.h
> new file mode 100644
> index 0000000000..29b3ac0ff5
> --- /dev/null
> +++ b/src/fiexchange.h
> @@ -0,0 +1,101 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
> +/*
> + * FIEXCHANGE ioctl definitions, to facilitate exchanging parts of files.
> + *
> + * Copyright (C) 2022 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#ifndef _LINUX_FIEXCHANGE_H
> +#define _LINUX_FIEXCHANGE_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * Exchange part of file1 with part of the file that this ioctl that is being
> + * called against (which we'll call file2).  Filesystems must be able to
> + * restart and complete the operation even after the system goes down.
> + */
> +struct file_xchg_range {
> +	__s64		file1_fd;
> +	__s64		file1_offset;	/* file1 offset, bytes */
> +	__s64		file2_offset;	/* file2 offset, bytes */
> +	__s64		length;		/* bytes to exchange */
> +
> +	__u64		flags;		/* see FILE_XCHG_RANGE_* below */
> +
> +	/* file2 metadata for optional freshness checks */
> +	__s64		file2_ino;	/* inode number */
> +	__s64		file2_mtime;	/* modification time */
> +	__s64		file2_ctime;	/* change time */
> +	__s32		file2_mtime_nsec; /* mod time, nsec */
> +	__s32		file2_ctime_nsec; /* change time, nsec */
> +
> +	__u64		pad[6];		/* must be zeroes */
> +};
> +
> +/*
> + * Atomic exchange operations are not required.  This relaxes the requirement
> + * that the filesystem must be able to complete the operation after a crash.
> + */
> +#define FILE_XCHG_RANGE_NONATOMIC	(1 << 0)
> +
> +/*
> + * Check that file2's inode number, mtime, and ctime against the values
> + * provided, and return -EBUSY if there isn't an exact match.
> + */
> +#define FILE_XCHG_RANGE_FILE2_FRESH	(1 << 1)
> +
> +/*
> + * Check that the file1's length is equal to file1_offset + length, and that
> + * file2's length is equal to file2_offset + length.  Returns -EDOM if there
> + * isn't an exact match.
> + */
> +#define FILE_XCHG_RANGE_FULL_FILES	(1 << 2)
> +
> +/*
> + * Exchange file data all the way to the ends of both files, and then exchange
> + * the file sizes.  This flag can be used to replace a file's contents with a
> + * different amount of data.  length will be ignored.
> + */
> +#define FILE_XCHG_RANGE_TO_EOF		(1 << 3)
> +
> +/* Flush all changes in file data and file metadata to disk before returning. */
> +#define FILE_XCHG_RANGE_FSYNC		(1 << 4)
> +
> +/* Dry run; do all the parameter verification but do not change anything. */
> +#define FILE_XCHG_RANGE_DRY_RUN		(1 << 5)
> +
> +/*
> + * Do not exchange any part of the range where file1's mapping is a hole.  This
> + * can be used to emulate scatter-gather atomic writes with a temp file.
> + */
> +#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
> +
> +/*
> + * Commit the contents of file1 into file2 if file2 has the same inode number,
> + * mtime, and ctime as the arguments provided to the call.  The old contents of
> + * file2 will be moved to file1.
> + *
> + * With this flag, all committed information can be retrieved even if the
> + * system crashes or is rebooted.  This includes writing through or flushing a
> + * disk cache if present.  The call blocks until the device reports that the
> + * commit is complete.
> + *
> + * This flag should not be combined with NONATOMIC.  It can be combined with
> + * SKIP_FILE1_HOLES.
> + */
> +#define FILE_XCHG_RANGE_COMMIT		(FILE_XCHG_RANGE_FILE2_FRESH | \
> +					 FILE_XCHG_RANGE_FSYNC)
> +
> +#define FILE_XCHG_RANGE_ALL_FLAGS	(FILE_XCHG_RANGE_NONATOMIC | \
> +					 FILE_XCHG_RANGE_FILE2_FRESH | \
> +					 FILE_XCHG_RANGE_FULL_FILES | \
> +					 FILE_XCHG_RANGE_TO_EOF | \
> +					 FILE_XCHG_RANGE_FSYNC | \
> +					 FILE_XCHG_RANGE_DRY_RUN | \
> +					 FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
> +
> +#define FIEXCHANGE_RANGE	_IOWR('X', 129, struct file_xchg_range)
> +
> +#endif /* _LINUX_FIEXCHANGE_H */
> diff --git a/src/global.h b/src/global.h
> index b44070993c..49570ef117 100644
> --- a/src/global.h
> +++ b/src/global.h
> @@ -171,6 +171,12 @@
>  #include <sys/mman.h>
>  #endif
>  
> +#ifdef HAVE_FIEXCHANGE
> +# include <linux/fiexchange.h>
> +#else
> +# include "fiexchange.h"
> +#endif
> +
>  static inline unsigned long long
>  rounddown_64(unsigned long long x, unsigned int y)
>  {
> 


^ permalink raw reply	[flat|nested] 469+ messages in thread

* Re: [PATCH 6/7] fsx: support FIEXCHANGE_RANGE
  2023-02-28  1:55     ` Zorro Lang
@ 2023-03-01  2:56       ` Darrick J. Wong
  0 siblings, 0 replies; 469+ messages in thread
From: Darrick J. Wong @ 2023-03-01  2:56 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Tue, Feb 28, 2023 at 09:55:28AM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 02:19:49PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Upgrade fsx to support exchanging file contents.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> 
> Hi Darrick,
> 
> I've merged most of patches of [NYE DELUGE 2/4], now I'm trying to merge
> the rest of it this time.
> 
> This patch will get build warning [1] from autoconf, can you rebase this patch
> to current for-next branch, and use autoupdate to update the configure.ac
> and lib/autoconf/general.m4 ?

Will do.  Thanks for merging this all, I really appreciate it!

--D

> Thanks,
> Zorro
> 
> [1]
> autoconf
> configure.ac:73: warning: The macro `AC_TRY_LINK' is obsolete.
> configure.ac:73: You should run autoupdate.
> ./lib/autoconf/general.m4:2920: AC_TRY_LINK is expanded from...
> m4/package_libcdev.m4:161: AC_HAVE_FIEXCHANGE is expanded from...
> configure.ac:73: the top level
> ./configure \   
>                 --libexecdir=/usr/lib \
>                 --exec_prefix=/var/lib
> 
> >  configure.ac          |    1 
> >  include/builddefs.in  |    1 
> >  ltp/Makefile          |    4 +
> >  ltp/fsx.c             |  160 ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  m4/package_libcdev.m4 |   21 ++++++
> >  src/fiexchange.h      |  101 +++++++++++++++++++++++++++++++
> >  src/global.h          |    6 ++
> >  7 files changed, 292 insertions(+), 2 deletions(-)
> >  create mode 100644 src/fiexchange.h
> > 
> > 
> > diff --git a/configure.ac b/configure.ac
> > index e92bd6b26d..4687d8a3c0 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -70,6 +70,7 @@ AC_HAVE_SEEK_DATA
> >  AC_HAVE_BMV_OF_SHARED
> >  AC_HAVE_NFTW
> >  AC_HAVE_RLIMIT_NOFILE
> > +AC_HAVE_FIEXCHANGE
> 
> [snip]
> 
> >  	setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
> >  
> >  	while ((ch = getopt_long(argc, argv,
> > -				 "b:c:dfg:i:j:kl:m:no:p:qr:s:t:w:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> > +				 "0b:c:dfg:i:j:kl:m:no:p:qr:s:t:w:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> 
> Looks like we nearly used up most of letters for fsx, to avoid some operations.
> 
> Maybe we can use a single option (e.g. -a means avoid) and suboptions to
> help that. For example "-a xchg_range,clone_range,dedupe_range" to avoid
> these 3 operations. Or use long option, e.g. --no-xchg-range, --no-clone-range
> to replace short ones.
> 
> What do you think? (Anyway, that's not the problem of this patch)
> 
> Thanks,
> Zorro
> 
> >  				 longopts, NULL)) != EOF)
> >  		switch (ch) {
> >  		case 'b':
> > @@ -2747,6 +2898,9 @@ main(int argc, char **argv)
> >  		case 'I':
> >  			insert_range_calls = 0;
> >  			break;
> > +		case '0':
> > +			xchg_range_calls = 0;
> > +			break;
> >  		case 'J':
> >  			clone_range_calls = 0;
> >  			break;
> > @@ -2988,6 +3142,8 @@ main(int argc, char **argv)
> >  		dedupe_range_calls = test_dedupe_range();
> >  	if (copy_range_calls)
> >  		copy_range_calls = test_copy_range();
> > +	if (xchg_range_calls)
> > +		xchg_range_calls = test_xchg_range();
> >  
> >  	while (numops == -1 || numops--)
> >  		if (!test())
> > diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
> > index e1b381c16f..db663970c2 100644
> > --- a/m4/package_libcdev.m4
> > +++ b/m4/package_libcdev.m4
> > @@ -157,3 +157,24 @@ AC_DEFUN([AC_HAVE_RLIMIT_NOFILE],
> >         AC_MSG_RESULT(no))
> >      AC_SUBST(have_rlimit_nofile)
> >    ])
> > +
> > +#
> > +# Check if we have a FIEXCHANGE_RANGE ioctl (Linux)
> > +#
> > +AC_DEFUN([AC_HAVE_FIEXCHANGE],
> > +  [ AC_MSG_CHECKING([for FIEXCHANGE_RANGE])
> > +    AC_TRY_LINK([
> > +#define _GNU_SOURCE
> > +#include <sys/syscall.h>
> > +#include <sys/ioctl.h>
> > +#include <unistd.h>
> > +#include <linux/fs.h>
> > +#include <linux/fiexchange.h>
> > +    ], [
> > +         struct file_xchg_range fxr;
> > +         ioctl(-1, FIEXCHANGE_RANGE, &fxr);
> > +    ], have_fiexchange=yes
> > +       AC_MSG_RESULT(yes),
> > +       AC_MSG_RESULT(no))
> > +    AC_SUBST(have_fiexchange)
> > +  ])
> > diff --git a/src/fiexchange.h b/src/fiexchange.h
> > new file mode 100644
> > index 0000000000..29b3ac0ff5
> > --- /dev/null
> > +++ b/src/fiexchange.h
> > @@ -0,0 +1,101 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
> > +/*
> > + * FIEXCHANGE ioctl definitions, to facilitate exchanging parts of files.
> > + *
> > + * Copyright (C) 2022 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <djwong@kernel.org>
> > + */
> > +#ifndef _LINUX_FIEXCHANGE_H
> > +#define _LINUX_FIEXCHANGE_H
> > +
> > +#include <linux/types.h>
> > +
> > +/*
> > + * Exchange part of file1 with part of the file that this ioctl that is being
> > + * called against (which we'll call file2).  Filesystems must be able to
> > + * restart and complete the operation even after the system goes down.
> > + */
> > +struct file_xchg_range {
> > +	__s64		file1_fd;
> > +	__s64		file1_offset;	/* file1 offset, bytes */
> > +	__s64		file2_offset;	/* file2 offset, bytes */
> > +	__s64		length;		/* bytes to exchange */
> > +
> > +	__u64		flags;		/* see FILE_XCHG_RANGE_* below */
> > +
> > +	/* file2 metadata for optional freshness checks */
> > +	__s64		file2_ino;	/* inode number */
> > +	__s64		file2_mtime;	/* modification time */
> > +	__s64		file2_ctime;	/* change time */
> > +	__s32		file2_mtime_nsec; /* mod time, nsec */
> > +	__s32		file2_ctime_nsec; /* change time, nsec */
> > +
> > +	__u64		pad[6];		/* must be zeroes */
> > +};
> > +
> > +/*
> > + * Atomic exchange operations are not required.  This relaxes the requirement
> > + * that the filesystem must be able to complete the operation after a crash.
> > + */
> > +#define FILE_XCHG_RANGE_NONATOMIC	(1 << 0)
> > +
> > +/*
> > + * Check that file2's inode number, mtime, and ctime against the values
> > + * provided, and return -EBUSY if there isn't an exact match.
> > + */
> > +#define FILE_XCHG_RANGE_FILE2_FRESH	(1 << 1)
> > +
> > +/*
> > + * Check that the file1's length is equal to file1_offset + length, and that
> > + * file2's length is equal to file2_offset + length.  Returns -EDOM if there
> > + * isn't an exact match.
> > + */
> > +#define FILE_XCHG_RANGE_FULL_FILES	(1 << 2)
> > +
> > +/*
> > + * Exchange file data all the way to the ends of both files, and then exchange
> > + * the file sizes.  This flag can be used to replace a file's contents with a
> > + * different amount of data.  length will be ignored.
> > + */
> > +#define FILE_XCHG_RANGE_TO_EOF		(1 << 3)
> > +
> > +/* Flush all changes in file data and file metadata to disk before returning. */
> > +#define FILE_XCHG_RANGE_FSYNC		(1 << 4)
> > +
> > +/* Dry run; do all the parameter verification but do not change anything. */
> > +#define FILE_XCHG_RANGE_DRY_RUN		(1 << 5)
> > +
> > +/*
> > + * Do not exchange any part of the range where file1's mapping is a hole.  This
> > + * can be used to emulate scatter-gather atomic writes with a temp file.
> > + */
> > +#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
> > +
> > +/*
> > + * Commit the contents of file1 into file2 if file2 has the same inode number,
> > + * mtime, and ctime as the arguments provided to the call.  The old contents of
> > + * file2 will be moved to file1.
> > + *
> > + * With this flag, all committed information can be retrieved even if the
> > + * system crashes or is rebooted.  This includes writing through or flushing a
> > + * disk cache if present.  The call blocks until the device reports that the
> > + * commit is complete.
> > + *
> > + * This flag should not be combined with NONATOMIC.  It can be combined with
> > + * SKIP_FILE1_HOLES.
> > + */
> > +#define FILE_XCHG_RANGE_COMMIT		(FILE_XCHG_RANGE_FILE2_FRESH | \
> > +					 FILE_XCHG_RANGE_FSYNC)
> > +
> > +#define FILE_XCHG_RANGE_ALL_FLAGS	(FILE_XCHG_RANGE_NONATOMIC | \
> > +					 FILE_XCHG_RANGE_FILE2_FRESH | \
> > +					 FILE_XCHG_RANGE_FULL_FILES | \
> > +					 FILE_XCHG_RANGE_TO_EOF | \
> > +					 FILE_XCHG_RANGE_FSYNC | \
> > +					 FILE_XCHG_RANGE_DRY_RUN | \
> > +					 FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
> > +
> > +#define FIEXCHANGE_RANGE	_IOWR('X', 129, struct file_xchg_range)
> > +
> > +#endif /* _LINUX_FIEXCHANGE_H */
> > diff --git a/src/global.h b/src/global.h
> > index b44070993c..49570ef117 100644
> > --- a/src/global.h
> > +++ b/src/global.h
> > @@ -171,6 +171,12 @@
> >  #include <sys/mman.h>
> >  #endif
> >  
> > +#ifdef HAVE_FIEXCHANGE
> > +# include <linux/fiexchange.h>
> > +#else
> > +# include "fiexchange.h"
> > +#endif
> > +
> >  static inline unsigned long long
> >  rounddown_64(unsigned long long x, unsigned int y)
> >  {
> > 
> 

^ permalink raw reply	[flat|nested] 469+ messages in thread

end of thread, other threads:[~2023-03-01  2:57 UTC | newest]

Thread overview: 469+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-30 21:14 [NYE DELUGE 2/4] xfs: online repair in its entirety Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/9] xfs: fix online repair block reaping Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/9] xfs: move the post-repair block reaping code to a separate file Darrick J. Wong
2022-12-30 22:12   ` [PATCH 8/9] xfs: reap large AG metadata extents when possible Darrick J. Wong
2022-12-30 22:12   ` [PATCH 5/9] xfs: use deferred frees to reap old btree blocks Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/9] xfs: only invalidate blocks if we're going to free them Darrick J. Wong
2022-12-30 22:12   ` [PATCH 6/9] xfs: rearrange xrep_reap_block to make future code flow easier Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/9] xfs: cull repair code that will never get used Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/9] xfs: only allow reaping of per-AG blocks in xrep_reap_extents Darrick J. Wong
2022-12-30 22:12   ` [PATCH 7/9] xfs: ignore stale buffers when scanning the buffer cache Darrick J. Wong
2022-12-30 22:12   ` [PATCH 9/9] xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: prepare repair for bulk loading Darrick J. Wong
2022-12-30 22:12   ` [PATCH 5/6] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/6] xfs: add debug knobs to control btree bulk load slack factors Darrick J. Wong
2022-12-30 22:12   ` [PATCH 6/6] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/6] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/6] xfs: log EFIs for all btree blocks being used to stage a btree Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/6] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/7] xfs: stage repair information in pageable memory Darrick J. Wong
2022-12-30 22:12   ` [PATCH 5/7] xfs: speed up xfarray sort by sorting xfile page contents directly Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/7] xfs: convert xfarray insertion sort to heapsort using scratchpad memory Darrick J. Wong
2022-12-30 22:12   ` [PATCH 7/7] xfs: improve xfarray quicksort pivot Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/7] xfs: enable sorting of xfile-backed arrays Darrick J. Wong
2022-12-30 22:12   ` [PATCH 6/7] xfs: cache pages used for xfarray quicksort convergence Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/7] xfs: create a big array data structure Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/7] xfs: teach xfile to pass back direct-map pages to caller Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online scrubbing of realtime summary files Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/4] xfs: get our own reference to inodes that we want to scrub Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/4] xfs: move the realtime summary file scrubber to a separate source file Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/4] xfs: implement online scrubbing of rtsummary info Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/4] xfs: wrap ilock/iunlock operations on sc->ip Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: miscellaneous repair tweaks Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/2] xfs: allow the user to cancel repairs before we start writing Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/2] xfs: always rescan allegedly healthy per-ag metadata after repair Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/2] xfs: force rebuilding of metadata Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/2] xfs: allow userspace to rebuild metadata structures Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/2] xfs: don't complain about unfixed metadata when repairs were injected Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of AG btrees Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/5] xfs: clear pagf_agflreset when repairing the AGFL Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/5] xfs: repair free space btrees Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/5] xfs: repair inode btrees Darrick J. Wong
2022-12-30 22:12   ` [PATCH 5/5] xfs: repair refcount btrees Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/5] xfs: rewrite xfs_icache_inode_is_allocated Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/6] xfs: online repair of inodes and forks Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/6] xfs: repair inode records Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/6] xfs: try to attach dquots to files before repairing them Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/6] xfs: zap broken inode forks Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/6] xfs: disable online repair quota helpers when quota not enabled Darrick J. Wong
2022-12-30 22:12   ` [PATCH 5/6] xfs: abort directory parent scrub scans if we encounter a zapped directory Darrick J. Wong
2022-12-30 22:12   ` [PATCH 6/6] xfs: repair obviously broken inode modes Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/5] xfs: online repair of file fork mappings Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/5] xfs: reintroduce reaping of file metadata blocks to xrep_reap_extents Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/5] xfs: refactor repair forcing tests into a repair.c helper Darrick J. Wong
2022-12-30 22:12   ` [PATCH 5/5] xfs: repair problems in CoW forks Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/5] xfs: create a ranged query function for refcount btrees Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/5] xfs: repair inode fork block mapping data structures Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: online repair of quota and rt metadata files Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/4] xfs: online repair of realtime bitmaps Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/4] xfs: create a new inode fork block unmap helper Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/4] xfs: repair the inode core and forks of a metadata inode Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/4] xfs: repair quotas Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: live inode scans for online fsck Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/4] xfs: allow scrub to hook metadata updates in other writers Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/4] xfs: implement live inode scan for scrub Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/4] xfs: speed up xfs_iwalk_adjust_start a little bit Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/4] xfs: allow blocking notifier chains with filesystem hooks Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of quota counters Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/5] xfs: implement live quotacheck inode scan Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/5] xfs: repair cannot update the summary counters when logging quota flags Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/5] xfs: report the health of quota counts Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/5] xfs: track quota updates during live quotacheck Darrick J. Wong
2022-12-30 22:13   ` [PATCH 5/5] xfs: repair dquots based on live quotacheck results Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/5] xfs: online repair of file link counts Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/5] xfs: report health of inode " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/5] xfs: teach scrub to check file nlinks Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/5] xfs: streamline the directory iteration code for scrub Darrick J. Wong
2022-12-30 22:13   ` [PATCH 5/5] xfs: teach repair to fix file nlinks Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/5] xfs: track file link count updates during live nlinks fsck Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 00/11] xfs: report corruption to the health trackers Darrick J. Wong
2022-12-30 22:13   ` [PATCH 01/11] xfs: separate the marking of sick and checked metadata Darrick J. Wong
2022-12-30 22:13   ` [PATCH 02/11] xfs: report fs corruption errors to the health tracking system Darrick J. Wong
2022-12-30 22:13   ` [PATCH 08/11] xfs: report inode corruption errors to the health system Darrick J. Wong
2022-12-30 22:13   ` [PATCH 06/11] xfs: report dir/attr block " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 09/11] xfs: report quota " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 03/11] xfs: report ag header corruption errors to the health tracking system Darrick J. Wong
2022-12-30 22:13   ` [PATCH 07/11] xfs: report symlink block corruption errors to the health system Darrick J. Wong
2022-12-30 22:13   ` [PATCH 05/11] xfs: report btree " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 04/11] xfs: report block map corruption errors to the health tracking system Darrick J. Wong
2022-12-30 22:13   ` [PATCH 10/11] xfs: report realtime metadata corruption errors to the health system Darrick J. Wong
2022-12-30 22:13   ` [PATCH 11/11] xfs: report XFS_IS_CORRUPT " Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: indirect health reporting Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/3] xfs: update health status if we get a clean bill of health Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/3] xfs: remember sick inodes that get inactivated Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/3] xfs: add secondary and indirect classes to the health tracking system Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: online repair for fs summary counters Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/3] xfs: repair " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/3] xfs: stabilize fs summary counters for online fsck Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/3] xfs: remove XCHK_REAPING_DISABLED from scrub Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/7] xfs: dump xfiles for debugging purposes Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/7] xfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
2022-12-30 22:13   ` [PATCH 5/7] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
2022-12-30 22:13   ` [PATCH 6/7] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/7] xfs: support in-memory buffer cache targets Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/7] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
2022-12-30 22:13   ` [PATCH 7/7] xfs: connect in-memory btrees to xfiles Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: online repair of rmap btrees Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/4] xfs: repair the rmapbt Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/4] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/4] xfs: hook live rmap operations during a repair operation Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/4] xfs: create a shadow rmap btree during rmap repair Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/9] xfs: move btree geometry to ops struct Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/9] xfs: set the btree cursor bc_ops in xfs_btree_alloc_cursor Darrick J. Wong
2022-12-30 22:13   ` [PATCH 7/9] xfs: remove the unnecessary daddr paramter to _init_block Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/9] xfs: encode the default bc_flags in the btree ops structure Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/9] xfs: initialize btree blocks using btree_ops structure Darrick J. Wong
2022-12-30 22:13   ` [PATCH 5/9] xfs: rename btree block/buffer init functions Darrick J. Wong
2022-12-30 22:13   ` [PATCH 6/9] xfs: btree convert xfs_btree_init_block to xfs_btree_init_buf calls Darrick J. Wong
2022-12-30 22:13   ` [PATCH 8/9] xfs: set btree block buffer ops in _init_buf Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/9] xfs: export some of the btree ops structures Darrick J. Wong
2022-12-30 22:13   ` [PATCH 9/9] xfs: remove unnecessary fields in xfbtree_config Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: reduce refcount repair memory usage Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/3] xfs: define an in-memory btree for storing refcount bag info during repairs Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/3] xfs: port refcount repair to the new refcount bag structure Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/3] xfs: create refcount bag structure for btree repairs Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: bmap log intent cleanups Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/3] xfs: remove xfs_trans_set_bmap_flags Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/3] xfs: clean up bmap log intent item tracepoint callsites Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/3] xfs: split tracepoint classes for deferred items Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/4] xfs: widen BUI formats to support realtime Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/4] xfs: hoist freeing of rt data fork extent mappings Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/4] xfs: fix xfs_bunmapi to allow unmapping of partial rt extents Darrick J. Wong
2022-12-30 22:13   ` [PATCH 4/4] xfs: support recovering bmap intent items targetting realtime extents Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/4] xfs: add a realtime flag to the bmap update log redo items Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/2] xfs: support attrfork and unwritten BUIs Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/2] xfs: support deferred bmap updates on the attr fork Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/2] xfs: xfs_bmap_finish_one should map unwritten extents properly Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/3] xfs: clean up symbolic link code Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/3] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
2022-12-30 22:13   ` [PATCH 3/3] xfs: move symlink target write function to libxfs Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/3] xfs: move remote symlink target read " Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 00/21] xfs: atomic file updates Darrick J. Wong
2022-12-30 22:13   ` [PATCH 05/21] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
2022-12-30 22:13   ` [PATCH 01/21] vfs: introduce new file range exchange ioctl Darrick J. Wong
2022-12-30 22:13   ` [PATCH 03/21] xfs: refactor non-power-of-two alignment checks Darrick J. Wong
2022-12-30 22:13   ` [PATCH 02/21] xfs: create a new helper to return a file's allocation unit Darrick J. Wong
2022-12-30 22:13   ` [PATCH 04/21] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
2022-12-30 22:13   ` [PATCH 06/21] xfs: introduce a swap-extent log intent item Darrick J. Wong
2022-12-30 22:13   ` [PATCH 07/21] xfs: create deferred log items for extent swapping Darrick J. Wong
2022-12-30 22:13   ` [PATCH 08/21] xfs: enable xlog users to toggle atomic " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 11/21] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
2022-12-30 22:13   ` [PATCH 10/21] xfs: add error injection to test swapext recovery Darrick J. Wong
2022-12-30 22:13   ` [PATCH 09/21] xfs: add a ->xchg_file_range handler Darrick J. Wong
2022-12-30 22:13   ` [PATCH 14/21] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
2022-12-30 22:13   ` [PATCH 18/21] xfs: condense symbolic links after an atomic swap Darrick J. Wong
2022-12-30 22:13   ` [PATCH 13/21] xfs: port xfs_swap_extent_forks to use xfs_swapext_req Darrick J. Wong
2022-12-30 22:13   ` [PATCH 17/21] xfs: condense directories after an atomic swap Darrick J. Wong
2022-12-30 22:13   ` [PATCH 16/21] xfs: condense extended attributes " Darrick J. Wong
2022-12-30 22:13   ` [PATCH 12/21] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
2022-12-30 22:13   ` [PATCH 15/21] xfs: remove old swap extents implementation Darrick J. Wong
2022-12-30 22:13   ` [PATCH 20/21] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
2022-12-30 22:13   ` [PATCH 19/21] xfs: make atomic extent swapping support realtime files Darrick J. Wong
2022-12-30 22:13   ` [PATCH 21/21] xfs: enable atomic swapext feature Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: create temporary files for online repair Darrick J. Wong
2022-12-30 22:14   ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/4] xfs: create temporary files and directories for online repair Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/4] xfs: refactor stale buffer scanning for repairs Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of realtime summaries Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/3] xfs: online repair of realtime summaries Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/3] xfs: teach the tempfile to support atomic extent swapping Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/9] xfs: set and validate dir/attr block owners Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/9] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong
2022-12-30 22:14   ` [PATCH 4/9] xfs: validate attr remote value buffer owners Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/9] xfs: validate attr leaf " Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/9] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong
2022-12-30 22:14   ` [PATCH 5/9] xfs: validate dabtree node buffer owners Darrick J. Wong
2022-12-30 22:14   ` [PATCH 6/9] xfs: validate directory leaf " Darrick J. Wong
2022-12-30 22:14   ` [PATCH 9/9] xfs: validate explicit directory free block owners Darrick J. Wong
2022-12-30 22:14   ` [PATCH 8/9] xfs: validate explicit directory block buffer owners Darrick J. Wong
2022-12-30 22:14   ` [PATCH 7/9] xfs: validate explicit directory data " Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/5] xfs: online repair of extended attributes Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/5] xfs: create a blob array data structure Darrick J. Wong
2022-12-30 22:14   ` [PATCH 5/5] xfs: flag empty xattr leaf blocks for optimization Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/5] xfs: repair extended attributes Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/5] xfs: use atomic extent swapping to fix user file fork data Darrick J. Wong
2022-12-30 22:14   ` [PATCH 4/5] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: online repair of directories Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/3] xfs: online repair of parent pointers Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/3] xfs: ask the dentry cache if it knows the parent of a directory Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/3] xfs: online repair of directories Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: move orphan files to lost and found Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of symbolic links Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/1] " Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/1] xfs: online repair of parent pointers Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/1] xfs: create an xattr iteration function for scrub Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/4] xfs: online fsck of iunlink buckets Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/4] xfs: check AGI unlinked inode buckets Darrick J. Wong
2022-12-30 22:14   ` [PATCH 4/4] xfs: repair AGI unlinked inode bucket lists Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/4] xfs: hoist AGI repair context to a heap object Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/4] xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list Darrick J. Wong
2022-12-30 22:14 ` [PATCHSET v24.0 0/3] xfs: cache xfile pages for better performance Darrick J. Wong
2022-12-30 22:14   ` [PATCH 1/3] xfs: map xfile pages directly into xfs_buf Darrick J. Wong
2022-12-30 22:14   ` [PATCH 2/3] xfs: use b_offset to support direct-mapping pages when blocksize < pagesize Darrick J. Wong
2022-12-30 22:14   ` [PATCH 3/3] xfile: implement write caching Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: prepare repair for bulk loading Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/5] xfs: implement block reservation accounting for btrees we're staging Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/5] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/5] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
2022-12-30 22:17   ` [PATCH 5/5] xfs_repair: bulk load records into new btree blocks Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/5] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: force rebuilding of metadata Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: try to use XFS_SCRUB_IFLAG_FORCE_REBUILD Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/3] xfs_io: support passing the FORCE_REBUILD flag to online repair Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/3] xfs: allow userspace to rebuild metadata structures Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/4] libfrog: promote XFROG_SCRUB_DESCR_SUMMARY to a scrub type Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/4] libfrog: rename XFROG_SCRUB_TYPE_* to XFROG_SCRUB_GROUP_* Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/4] libfrog: rename the scrub "fs" group Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: scan metadata files in parallel Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of quota counters Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/3] xfs: implement live quotacheck inode scan Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/3] libfrog: create a new scrub group for things requiring full inode scans Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/3] xfs: report the health of quota counts Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/2] xfs_repair: rebuild inode fork mappings Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/2] xfs_repair: push inode buf and dinode pointers all the way to inode fork processing Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/2] xfs_repair: rebuild block mappings from rmapbt data Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/3] libxfs: online repair of file link counts Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/3] xfs: report inode link count health Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/3] xfs: teach scrub to check file nlinks Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/3] xfs_scrub: use multiple threads to run in-kernel metadata scrubs that scan inodes Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: indirect health reporting Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/4] xfs_scrub: upload clean bills of health Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/4] xfs: add secondary and indirect classes to the health tracking system Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/4] xfs: update health status if we get a clean bill of health Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/4] xfs: remember sick inodes that get inactivated Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/9] libxfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/9] libxfs: add xfile support Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/9] libxfs: clean up xfs_da_unmount usage Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/9] libxfs: teach buftargs to maintain their own buffer hashtable Darrick J. Wong
2022-12-30 22:17   ` [PATCH 9/9] xfbtree: let the buffer cache flush dirty buffers to the xfile Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/9] libxfs: support in-memory buffer cache targets Darrick J. Wong
2022-12-30 22:17   ` [PATCH 8/9] xfs: connect in-memory btrees to xfiles Darrick J. Wong
2022-12-30 22:17   ` [PATCH 5/9] xfs: consolidate btree block freeing tracepoints Darrick J. Wong
2022-12-30 22:17   ` [PATCH 7/9] xfs: support in-memory btrees Darrick J. Wong
2022-12-30 22:17   ` [PATCH 6/9] xfs: consolidate btree block allocation tracepoints Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/5] libxfs: online repair of rmap btrees Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/5] xfs: create a helper to decide if a file mapping targets the rt volume Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/5] xfs: repair the rmapbt Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/5] xfs: create a shadow rmap btree during rmap repair Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/5] xfs: hook live rmap operations during a repair operation Darrick J. Wong
2022-12-30 22:17   ` [PATCH 5/5] mkfs: enable reverse mapping by default Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/6] xfs_repair: use in-memory rmap btrees Darrick J. Wong
2022-12-30 22:17   ` [PATCH 6/6] xfs_repair: remove the old rmap collection slabs Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/6] xfs_repair: verify on-disk rmap btrees with in-memory btree data Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/6] xfs_repair: compute refcount data from in-memory rmap btrees Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/6] libxfs: partition memfd files to avoid using too many fds Darrick J. Wong
2022-12-30 22:17   ` [PATCH 5/6] xfs_repair: reduce rmap bag memory usage when creating refcounts Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/6] xfs_repair: convert regular rmap repair to use in-memory btrees Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/5] xfs_repair: reduce refcount repair memory usage Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/5] xfs_repair: port to the new refcount bag structure Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/5] xfs_repair: define an in-memory btree for storing refcount bag info Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/5] xfs: " Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/5] xfs_repair: create refcount bag Darrick J. Wong
2022-12-30 22:17   ` [PATCH 5/5] xfs_repair: remove the old bag implementation Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 0/4] libxfs: clean up symbolic link code Darrick J. Wong
2022-12-30 22:17   ` [PATCH 1/4] xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h Darrick J. Wong
2022-12-30 22:17   ` [PATCH 2/4] xfs: move remote symlink target read function to libxfs Darrick J. Wong
2022-12-30 22:17   ` [PATCH 3/4] xfs: move symlink target write " Darrick J. Wong
2022-12-30 22:17   ` [PATCH 4/4] mkfs: use libxfs to create symlinks Darrick J. Wong
2022-12-30 22:17 ` [PATCHSET v24.0 00/19] libxfs: atomic file updates Darrick J. Wong
2022-12-30 22:17   ` [PATCH 03/19] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
2022-12-30 22:17   ` [PATCH 04/19] xfs: introduce a swap-extent log intent item Darrick J. Wong
2022-12-30 22:17   ` [PATCH 01/19] vfs: introduce new file range exchange ioctl Darrick J. Wong
2022-12-30 22:17   ` [PATCH 02/19] xfs: parameterize all the incompat log feature helpers Darrick J. Wong
2022-12-30 22:18   ` [PATCH 06/19] xfs: add error injection to test swapext recovery Darrick J. Wong
2022-12-30 22:18   ` [PATCH 08/19] xfs: condense directories after an atomic swap Darrick J. Wong
2022-12-30 22:18   ` [PATCH 11/19] xfs: enable atomic swapext feature Darrick J. Wong
2022-12-30 22:18   ` [PATCH 12/19] libhandle: add support for bulkstat v5 Darrick J. Wong
2022-12-30 22:18   ` [PATCH 07/19] xfs: condense extended attributes after an atomic swap Darrick J. Wong
2022-12-30 22:18   ` [PATCH 09/19] xfs: condense symbolic links " Darrick J. Wong
2022-12-30 22:18   ` [PATCH 10/19] xfs: make atomic extent swapping support realtime files Darrick J. Wong
2022-12-30 22:18   ` [PATCH 05/19] xfs: create deferred log items for extent swapping Darrick J. Wong
2022-12-30 22:18   ` [PATCH 18/19] xfs_io: enhance swapext to take advantage of new api Darrick J. Wong
2022-12-30 22:18   ` [PATCH 19/19] xfs_io: add atomic update commands to exercise extent swapping Darrick J. Wong
2022-12-30 22:18   ` [PATCH 14/19] xfs_logprint: support dumping swapext log items Darrick J. Wong
2022-12-30 22:18   ` [PATCH 16/19] xfs_fsr: port to new swapext library function Darrick J. Wong
2022-12-30 22:18   ` [PATCH 15/19] xfs_fsr: convert to bulkstat v5 ioctls Darrick J. Wong
2022-12-30 22:18   ` [PATCH 17/19] xfs_fsr: skip the xattr/forkoff levering with the newer swapext implementations Darrick J. Wong
2022-12-30 22:18   ` [PATCH 13/19] libfrog: convert xfs_io swapext command to use new libfrog wrapper Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: online fsck of iunlink buckets Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/1] xfs_db: dump unlinked buckets Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/1] xfs: cache xfile pages for better performance Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/1] xfs: map xfile pages directly into xfs_buf Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: fixes to the repair code Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: log when a repair was unnecessary Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/6] xfs_scrub: require primary superblock repairs to complete before proceeding Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: remove ALP_* flags namespace Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/6] xfs_scrub: don't report media errors for space with unknowable owner Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/6] xfs_scrub: move repair functions to repair.c Darrick J. Wong
2022-12-30 22:18   ` [PATCH 6/6] xfs_scrub: actually try to fix summary counters ahead of repairs Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/6] xfs_scrub: improve warnings about difficult repairs Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/6] xfs_scrub: get rid of trivial fs metadata scanner helpers Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/6] xfs_scrub: collapse trivial superblock scrub helpers Darrick J. Wong
2022-12-30 22:18   ` [PATCH 6/6] xfs_scrub: warn about difficult repairs to rt and quota metadata Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/6] xfs_scrub: any inconsistency in metadata should trigger difficulty warnings Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/6] xfs_scrub: add missing repair types to the mustfix and difficulty assessment Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/6] xfs_scrub: split up the mustfix repairs and difficulty assessment functions Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/9] xfs_scrub: track data dependencies for repairs Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/9] xfs_scrub: remove scrub_metadata_file Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/9] xfs_scrub: boost the repair priority of dependencies of damaged items Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/9] xfs_scrub: remove action lists from phaseX code Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/9] xfs_scrub: use repair_item to direct repair activities Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/9] xfs_scrub: track repair items by principal, not by individual repairs Darrick J. Wong
2022-12-30 22:18   ` [PATCH 6/9] xfs_scrub: clean up repair_item_difficulty a little Darrick J. Wong
2022-12-30 22:18   ` [PATCH 8/9] xfs_scrub: retry incomplete repairs Darrick J. Wong
2022-12-30 22:18   ` [PATCH 9/9] xfs_scrub: remove unused action_list fields Darrick J. Wong
2022-12-30 22:18   ` [PATCH 7/9] xfs_scrub: check dependencies of a scrub type before repairing Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: use scrub_item to track check progress Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: refactor scrub_meta_type out of existence Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/5] xfs_scrub: hoist scrub retry loop to scrub_item_check_file Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/5] xfs_scrub: hoist repair retry loop to repair_item_class Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub: remove enum check_outcome Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/5] xfs_scrub: start tracking scrub state in scrub_item Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub: improve scheduling of repair items Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/4] libfrog: enhance ptvar to support initializer functions Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/4] xfs_scrub: recheck entire metadata objects after corruption repairs Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/4] xfs_scrub: improve thread scheduling repair items during phase 4 Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub: try to repair space metadata before file metadata Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/7] xfs_scrub: move fstrim to a separate phase Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/7] xfs_scrub: ignore phase 8 if the user disabled fstrim Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/7] xfs_scrub: collapse trim_filesystem Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/7] xfs_scrub: move FITRIM to phase 8 Darrick J. Wong
2022-12-30 22:18   ` [PATCH 7/7] xfs_scrub: fstrim each AG in parallel Darrick J. Wong
2022-12-30 22:18   ` [PATCH 6/7] xfs_scrub: don't call FITRIM after runtime errors Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/7] xfs_scrub: fix the work estimation for phase 8 Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/7] xfs_scrub: report FITRIM errors properly Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/8] xfs_scrub: fixes for systemd services Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/8] xfs_scrub_fail: escape paths correctly Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/8] xfs_scrub_fail: fix sendmail detection Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/8] debian: install scrub services with dh_installsystemd Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/8] xfs_scrub_fail: return the failure status of the mailer program Darrick J. Wong
2022-12-30 22:18   ` [PATCH 7/8] xfs_scrub_all: escape service names consistently Darrick J. Wong
2022-12-30 22:18   ` [PATCH 6/8] xfs_scrub_all: fix argument passing when invoking xfs_scrub manually Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/8] xfs_scrub: add missing copyrights and spdx headers Darrick J. Wong
2022-12-30 22:18   ` [PATCH 8/8] xfs_scrub_all: survive systemd restarts when waiting for services Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/5] xfs_scrub: tighten security of systemd services Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/5] xfs_scrub.service: reduce CPU usage to 60% when possible Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/5] xfs_scrub: allow auxiliary pathnames for sandboxing Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/5] xfs_scrub_fail: tighten up the security on the background systemd service Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/5] xfs_scrub: " Darrick J. Wong
2022-12-30 22:18   ` [PATCH 5/5] xfs_scrub_all: " Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/4] xfs_scrub_all: automatic media scan service Darrick J. Wong
2022-12-30 22:18   ` [PATCH 4/4] xfs_scrub_all: failure reporting for the xfs_scrub_all job Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/4] xfs_scrub_all: support metadata+media scans of all filesystems Darrick J. Wong
2022-12-30 22:18   ` [PATCH 3/4] xfs_scrub_all: trigger automatic media scans once per month Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/4] xfs_scrub_all: enable periodic file data scrubs automatically Darrick J. Wong
2022-12-30 22:18 ` [PATCHSET v24.0 0/2] xfs_scrub: automatic optimization by default Darrick J. Wong
2022-12-30 22:18   ` [PATCH 1/2] xfs_scrub: automatic downgrades to dry-run mode in service mode Darrick J. Wong
2022-12-30 22:18   ` [PATCH 2/2] xfs_scrub: add an optimization-only mode Darrick J. Wong
2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: race online scrub with fsstress Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/5] xfs: race fsstress with online scrubbers for AG and fs metadata Darrick J. Wong
2023-02-05 13:04     ` Zorro Lang
2023-02-07 16:58       ` Darrick J. Wong
2023-02-07 17:02     ` [PATCH v24.1 " Darrick J. Wong
2023-02-07 18:45       ` Zorro Lang
2022-12-30 22:19   ` [PATCH 5/5] xfs: race fsstress with online scrubbers for file metadata Darrick J. Wong
2022-12-30 22:19   ` [PATCH 3/5] fuzzy: add a custom xfs find utility for scrub stress tests Darrick J. Wong
2023-02-05 12:57     ` Zorro Lang
2023-02-07 16:57       ` Darrick J. Wong
2023-02-07 17:01     ` [PATCH v24.1 " Darrick J. Wong
2023-02-07 18:42       ` Zorro Lang
2022-12-30 22:19   ` [PATCH 1/5] xfs/357: switch fuzzing to agi 1 Darrick J. Wong
2023-02-07 18:46     ` Zorro Lang
2022-12-30 22:19   ` [PATCH 4/5] fuzzy: allow xfs scrub stress tests to pick preconfigured fsstress configs Darrick J. Wong
2023-02-07 18:48     ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] xfs: force rebuilding of metadata Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] fuzzy: use FORCE_REBUILD over injecting force_repair Darrick J. Wong
2023-02-14  8:00     ` Zorro Lang
2023-02-14 18:18       ` Darrick J. Wong
2023-02-16 14:57         ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/2] xfs: stress test ag repair functions Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/2] xfs: test rebuilding the entire filesystem with online fsck Darrick J. Wong
2023-02-18  6:06   ` [PATCHSET v24.0 0/2] fstests: online repair of AG btrees Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of inodes Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair for inode record metadata Darrick J. Wong
2023-02-18  6:07     ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Darrick J. Wong
2022-12-30 22:19   ` [PATCH 4/4] xfs: race fsstress with online repair for special file metadata Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/4] xfs: test rebuilding xattrs when the data fork is btree format Darrick J. Wong
2022-12-30 22:19   ` [PATCH 3/4] xfs: ensure that online file data fork repairs don't hit EDQUOT Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/4] xfs: race fsstress with online repair for inode and fork metadata Darrick J. Wong
2023-02-18  6:07   ` [PATCHSET v24.0 0/4] fstests: online repair of file fork mappings Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota and counters Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quota metadata Darrick J. Wong
2023-02-18  6:10     ` Zorro Lang
2023-02-18  6:12     ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of quota counters Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online scrub and repair for quotacheck Darrick J. Wong
2023-02-18  6:12     ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of file link counts Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with inode link count check and repair Darrick J. Wong
2023-02-18  6:13     ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair for fs summary counters Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/2] xfs: test fs summary counter online repair Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/2] xfs: race fsstress with online repair for summary counters Darrick J. Wong
2023-02-18  6:14   ` [PATCHSET v24.0 0/2] fstests: online repair for fs " Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of rmap btrees Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs/422: don't freeze while racing rmap repair and fsstress Darrick J. Wong
2023-02-18  6:15     ` Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/2] populate: take a snapshot of the filesystem if creation fails Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/2] populate: fix some weirdness in __populate_check_xfs_agbtree_height Darrick J. Wong
2023-02-18  6:16   ` [PATCHSET v24.0 0/2] fstests: fix a few bugs in fs population Zorro Lang
2022-12-30 22:19 ` [PATCHSET v24.0 00/24] fstests: improve xfs fuzzing Darrick J. Wong
2022-12-30 22:19   ` [PATCH 01/24] fuzzy: disable per-field random fuzzing by default Darrick J. Wong
2022-12-30 22:19   ` [PATCH 02/24] fuzzy: disable timstamp " Darrick J. Wong
2022-12-30 22:19   ` [PATCH 07/24] fuzzy: don't fuzz xattr namespace flags and values Darrick J. Wong
2022-12-30 22:19   ` [PATCH 05/24] fuzzy: don't fuzz inode generation numbers Darrick J. Wong
2022-12-30 22:19   ` [PATCH 03/24] fuzzy: don't fuzz the log sequence number Darrick J. Wong
2022-12-30 22:19   ` [PATCH 06/24] fuzzy: don't fuzz user-controllable inode flags Darrick J. Wong
2022-12-30 22:19   ` [PATCH 04/24] fuzzy: don't fuzz obsolete inode fields Darrick J. Wong
2022-12-30 22:19   ` [PATCH 12/24] common/fuzzy: fix some problems with the offline repair strategy Darrick J. Wong
2022-12-30 22:19   ` [PATCH 11/24] common/fuzzy: fix some problems with the online " Darrick J. Wong
2022-12-30 22:19   ` [PATCH 08/24] common/fuzzy: split out each repair strategy into a separate helper Darrick J. Wong
2022-12-30 22:19   ` [PATCH 10/24] common/fuzzy: hoist the post-repair fs modification step Darrick J. Wong
2022-12-30 22:19   ` [PATCH 09/24] common/fuzzy: add an underline to the full log between sections Darrick J. Wong
2022-12-30 22:19   ` [PATCH 13/24] common/fuzzy: fix some problems with the no-repair strategy Darrick J. Wong
2022-12-30 22:19   ` [PATCH 14/24] common/fuzzy: fix some problems with the online-then-offline repair strategy Darrick J. Wong
2022-12-30 22:19   ` [PATCH 16/24] xfs/{35[45],455}: fix bogus corruption errors Darrick J. Wong
2022-12-30 22:19   ` [PATCH 21/24] fuzzy: compress coredumps created while fuzzing Darrick J. Wong
2022-12-30 22:19   ` [PATCH 15/24] common/fuzzy: fix some problems with the post-repair fs modification code Darrick J. Wong
2022-12-30 22:19   ` [PATCH 19/24] common/fuzzy: exercise the filesystem a little harder after repairing Darrick J. Wong
2022-12-30 22:19   ` [PATCH 22/24] fuzzy: report the fuzzing repair strategy in seqres.full Darrick J. Wong
2022-12-30 22:19   ` [PATCH 17/24] common/fuzzy: evaluate xfs_check vs xfs_repair Darrick J. Wong
2022-12-30 22:19   ` [PATCH 20/24] fuzzy: dump metadata state before fuzzing Darrick J. Wong
2022-12-30 22:19   ` [PATCH 18/24] common: check xfs health after doing an online scrub Darrick J. Wong
2022-12-30 22:19   ` [PATCH 24/24] fuzzy: for fuzzing ag btrees, find the path to the AG header Darrick J. Wong
2022-12-30 22:19   ` [PATCH 23/24] xfs: improve metadata array field handling when fuzzing Darrick J. Wong
2022-12-30 22:19 ` [PATCHSET v24.0 0/5] fstests: strengthen fuzz testing Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/5] fuzzy: test fuzzing directory block mappings Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/5] fuzzy: test fuzzing xattr " Darrick J. Wong
2022-12-30 22:19   ` [PATCH 5/5] fuzzy: fuzz test key/pointers of inode btrees Darrick J. Wong
2022-12-30 22:19   ` [PATCH 4/5] xfs: fuzz test both repair strategies Darrick J. Wong
2022-12-30 22:19   ` [PATCH 3/5] fuzzy: test fuzzing realtime free space metadata Darrick J. Wong
2022-12-30 22:19 ` [PATCHSET v24.0 0/7] fstests: atomic file updates Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/7] xfs/122: fix for swapext log items Darrick J. Wong
2022-12-30 22:19   ` [PATCH 5/7] generic: test that file privilege gets dropped with FIEXCHANGE_RANGE Darrick J. Wong
2022-12-30 22:19   ` [PATCH 4/7] generic, xfs: test scatter-gather atomic file updates Darrick J. Wong
2022-12-30 22:19   ` [PATCH 3/7] generic: test new vfs swapext ioctl Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/7] generic: test old xfs extent swapping ioctl Darrick J. Wong
2022-12-30 22:19   ` [PATCH 7/7] fsstress: update for FIEXCHANGE_RANGE Darrick J. Wong
2022-12-30 22:19   ` [PATCH 6/7] fsx: support FIEXCHANGE_RANGE Darrick J. Wong
2023-02-28  1:55     ` Zorro Lang
2023-03-01  2:56       ` Darrick J. Wong
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of realtime summaries Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair of realtime summary files Darrick J. Wong
2022-12-30 22:19 ` [PATCHSET v24.0 0/1] fstests: online repair of extended attributes Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/1] xfs: race fsstress with online repair of extended attribute data Darrick J. Wong
2022-12-30 22:19 ` [PATCHSET v24.0 0/2] fstests: online repair of directories Darrick J. Wong
2022-12-30 22:19   ` [PATCH 1/2] xfs: ensure that online directory repairs don't hit EDQUOT Darrick J. Wong
2022-12-30 22:19   ` [PATCH 2/2] xfs: race fsstress with online repair of dirs and parent pointers Darrick J. Wong
2022-12-30 22:20 ` [PATCHSET v24.0 0/1] fstests: test automatic scrub optimization by default Darrick J. Wong
2022-12-30 22:20   ` [PATCH 1/1] xfs: test xfs_scrub dry run, preen, and repair mode Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.