linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection
@ 2022-10-27 17:14 Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents Darrick J. Wong
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

Hi all,

Fuzz testing of the refcount btree demonstrated a weakness in validation
of refcount btree records during normal runtime.  The idea of using the
upper bit of the rc_startblock field to separate the refcount records
into one group for shared space and another for CoW staging extents was
added at the last minute.  The incore struct left this bit encoded in
the upper bit of the startblock field, which makes it all too easy for
arithmetic operations to overflow if we don't detect the cowflag
properly.

When I ran a norepair fuzz tester, I was able to crash the kernel on one
of these accidental overflows by fuzzing a key record in a node block,
which broke lookups.  To fix the problem, make the domain (shared/cow) a
separate field in the incore record.

Unfortunately, a customer also hit this once in production.  Due to bugs
in the kernel running on the VM host, writes to the disk image would
occasionally be lost.  Given sufficient memory pressure on the VM guest,
a refcountbt xfs_buf could be reclaimed and later reloaded from the
stale copy on the virtual disk.  The stale disk contents were a refcount
btree leaf block full of records for the wrong domain, and this caused
an infinite loop in the guest VM.

v2: actually include the refcount adjust loop invariant checking patch;
    move the deferred refcount continuation checks earlier in the series;
    break up the megapatch into smaller pieces; fix an uninitialized list
    error.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=refcount-cow-domain-6.1
---
 fs/xfs/libxfs/xfs_format.h         |   22 ---
 fs/xfs/libxfs/xfs_refcount.c       |  297 ++++++++++++++++++++++++++----------
 fs/xfs/libxfs/xfs_refcount.h       |   40 ++++-
 fs/xfs/libxfs/xfs_refcount_btree.c |   15 +-
 fs/xfs/libxfs/xfs_types.h          |   30 ++++
 fs/xfs/scrub/refcount.c            |   74 ++++-----
 fs/xfs/xfs_trace.h                 |   48 +++++-
 7 files changed, 362 insertions(+), 164 deletions(-)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 20:41   ` Dave Chinner
  2022-10-27 17:14 ` [PATCH 02/12] xfs: check deferred refcount op continuation parameters Darrick J. Wong
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen
such that the end of the range would not be in the middle of a refcount
record.  If this is no longer the case, something is seriously wrong
with the btree.  Bail out with a corruption error.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 64b910caafaa..831353ba96dc 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -986,15 +986,29 @@ xfs_refcount_adjust_extents(
 			(*agbno) += tmp.rc_blockcount;
 			(*aglen) -= tmp.rc_blockcount;
 
+			/* Stop if there's nothing left to modify */
+			if (*aglen == 0 || !xfs_refcount_still_have_space(cur))
+				break;
+
+			/* Move the cursor to the start of ext. */
 			error = xfs_refcount_lookup_ge(cur, *agbno,
 					&found_rec);
 			if (error)
 				goto out_error;
 		}
 
-		/* Stop if there's nothing left to modify */
-		if (*aglen == 0 || !xfs_refcount_still_have_space(cur))
-			break;
+		/*
+		 * A previous step trimmed agbno/aglen such that the end of the
+		 * range would not be in the middle of the record.  If this is
+		 * no longer the case, something is seriously wrong with the
+		 * btree.  Make sure we never feed the synthesized record into
+		 * the processing loop below.
+		 */
+		if (XFS_IS_CORRUPT(cur->bc_mp, ext.rc_blockcount == 0) ||
+		    XFS_IS_CORRUPT(cur->bc_mp, ext.rc_blockcount > *aglen)) {
+			error = -EFSCORRUPTED;
+			goto out_error;
+		}
 
 		/*
 		 * Adjust the reference count and either update the tree


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 20:49   ` Dave Chinner
  2022-10-27 21:54   ` [PATCH v2.1 " Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 03/12] xfs: move _irec structs to xfs_types.h Darrick J. Wong
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we're in the middle of a deferred refcount operation and decide to
roll the transaction to avoid overflowing the transaction space, we need
to check the new agbno/aglen parameters that we're about to record in
the new intent.  Specifically, we need to check that the new extent is
completely within the filesystem, and that continuation does not put us
into a different AG.

If the keys of a node block are wrong, the lookup to resume an
xfs_refcount_adjust_extents operation can put us into the wrong record
block.  If this happens, we might not find that we run out of aglen at
an exact record boundary, which will cause the loop control to do the
wrong thing.

The previous patch should take care of that problem, but let's add this
extra sanity check to stop corruption problems sooner than later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   48 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 831353ba96dc..c6aa832a8713 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1138,6 +1138,44 @@ xfs_refcount_finish_one_cleanup(
 		xfs_trans_brelse(tp, agbp);
 }
 
+/*
+ * Set up a continuation a deferred refcount operation by updating the intent.
+ * Checks to make sure we're not going to run off the end of the AG.
+ */
+static inline int
+xfs_refcount_continue_op(
+	struct xfs_btree_cur		*cur,
+	xfs_fsblock_t			startblock,
+	xfs_agblock_t			new_agbno,
+	xfs_extlen_t			new_len,
+	xfs_fsblock_t			*fsbp)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_perag		*pag = cur->bc_ag.pag;
+	xfs_fsblock_t			new_fsbno;
+	xfs_agnumber_t			old_agno;
+
+	old_agno = XFS_FSB_TO_AGNO(mp, startblock);
+	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+
+	/*
+	 * If we don't have any work left to do, then there's no need
+	 * to perform the validation of the new parameters.
+	 */
+	if (!new_len)
+		goto done;
+
+	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
+		return -EFSCORRUPTED;
+
+	if (XFS_IS_CORRUPT(mp, old_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
+		return -EFSCORRUPTED;
+
+done:
+	*fsbp = new_fsbno;
+	return 0;
+}
+
 /*
  * Process one of the deferred refcount operations.  We pass back the
  * btree cursor to maintain our lock on the btree between calls.
@@ -1205,12 +1243,18 @@ xfs_refcount_finish_one(
 	case XFS_REFCOUNT_INCREASE:
 		error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
 				new_len, XFS_REFCOUNT_ADJUST_INCREASE);
-		*new_fsb = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+		if (error)
+			goto out_drop;
+		error = xfs_refcount_continue_op(rcur, startblock, new_agbno,
+				*new_len, new_fsb);
 		break;
 	case XFS_REFCOUNT_DECREASE:
 		error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
 				new_len, XFS_REFCOUNT_ADJUST_DECREASE);
-		*new_fsb = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+		if (error)
+			goto out_drop;
+		error = xfs_refcount_continue_op(rcur, startblock, new_agbno,
+				*new_len, new_fsb);
 		break;
 	case XFS_REFCOUNT_ALLOC_COW:
 		*new_fsb = startblock + blockcount;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/12] xfs: move _irec structs to xfs_types.h
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 02/12] xfs: check deferred refcount op continuation parameters Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 04/12] xfs: refactor refcount record usage in xchk_refcountbt_rec Darrick J. Wong
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Structure definitions for incore objects do not belong in the ondisk
format header.  Move them to the incore types header where they belong.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h |   20 --------------------
 fs/xfs/libxfs/xfs_types.h  |   20 ++++++++++++++++++++
 2 files changed, 20 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index b55bdfa9c8a8..005dd65b71cd 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1564,20 +1564,6 @@ struct xfs_rmap_rec {
 #define RMAPBT_UNUSED_OFFSET_BITLEN	7
 #define RMAPBT_OFFSET_BITLEN		54
 
-#define XFS_RMAP_ATTR_FORK		(1 << 0)
-#define XFS_RMAP_BMBT_BLOCK		(1 << 1)
-#define XFS_RMAP_UNWRITTEN		(1 << 2)
-#define XFS_RMAP_KEY_FLAGS		(XFS_RMAP_ATTR_FORK | \
-					 XFS_RMAP_BMBT_BLOCK)
-#define XFS_RMAP_REC_FLAGS		(XFS_RMAP_UNWRITTEN)
-struct xfs_rmap_irec {
-	xfs_agblock_t	rm_startblock;	/* extent start block */
-	xfs_extlen_t	rm_blockcount;	/* extent length */
-	uint64_t	rm_owner;	/* extent owner */
-	uint64_t	rm_offset;	/* offset within the owner */
-	unsigned int	rm_flags;	/* state flags */
-};
-
 /*
  * Key structure
  *
@@ -1640,12 +1626,6 @@ struct xfs_refcount_key {
 	__be32		rc_startblock;	/* starting block number */
 };
 
-struct xfs_refcount_irec {
-	xfs_agblock_t	rc_startblock;	/* starting block number */
-	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
-	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
-};
-
 #define MAXREFCOUNT	((xfs_nlink_t)~0U)
 #define MAXREFCEXTLEN	((xfs_extlen_t)~0U)
 
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index a6b7d98cf68f..2d9ebc7338b1 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -166,6 +166,26 @@ typedef struct xfs_bmbt_irec
 	xfs_exntst_t	br_state;	/* extent state */
 } xfs_bmbt_irec_t;
 
+struct xfs_refcount_irec {
+	xfs_agblock_t	rc_startblock;	/* starting block number */
+	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
+	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
+};
+
+#define XFS_RMAP_ATTR_FORK		(1 << 0)
+#define XFS_RMAP_BMBT_BLOCK		(1 << 1)
+#define XFS_RMAP_UNWRITTEN		(1 << 2)
+#define XFS_RMAP_KEY_FLAGS		(XFS_RMAP_ATTR_FORK | \
+					 XFS_RMAP_BMBT_BLOCK)
+#define XFS_RMAP_REC_FLAGS		(XFS_RMAP_UNWRITTEN)
+struct xfs_rmap_irec {
+	xfs_agblock_t	rm_startblock;	/* extent start block */
+	xfs_extlen_t	rm_blockcount;	/* extent length */
+	uint64_t	rm_owner;	/* extent owner */
+	uint64_t	rm_offset;	/* offset within the owner */
+	unsigned int	rm_flags;	/* state flags */
+};
+
 /* per-AG block reservation types */
 enum xfs_ag_resv_type {
 	XFS_AG_RESV_NONE = 0,


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/12] xfs: refactor refcount record usage in xchk_refcountbt_rec
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 03/12] xfs: move _irec structs to xfs_types.h Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 17:14 ` [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec Darrick J. Wong
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Consolidate the open-coded xfs_refcount_irec fields into an actual
struct and use the existing _btrec_to_irec to decode the ondisk record.
This will reduce code churn in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/refcount.c |   58 +++++++++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 32 deletions(-)


diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index c68b767dc08f..8ab55e791ea8 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -269,15 +269,13 @@ xchk_refcountbt_process_rmap_fragments(
 STATIC void
 xchk_refcountbt_xref_rmap(
 	struct xfs_scrub		*sc,
-	xfs_agblock_t			bno,
-	xfs_extlen_t			len,
-	xfs_nlink_t			refcount)
+	const struct xfs_refcount_irec	*irec)
 {
 	struct xchk_refcnt_check	refchk = {
-		.sc = sc,
-		.bno = bno,
-		.len = len,
-		.refcount = refcount,
+		.sc			= sc,
+		.bno			= irec->rc_startblock,
+		.len			= irec->rc_blockcount,
+		.refcount		= irec->rc_refcount,
 		.seen = 0,
 	};
 	struct xfs_rmap_irec		low;
@@ -291,9 +289,9 @@ xchk_refcountbt_xref_rmap(
 
 	/* Cross-reference with the rmapbt to confirm the refcount. */
 	memset(&low, 0, sizeof(low));
-	low.rm_startblock = bno;
+	low.rm_startblock = irec->rc_startblock;
 	memset(&high, 0xFF, sizeof(high));
-	high.rm_startblock = bno + len - 1;
+	high.rm_startblock = irec->rc_startblock + irec->rc_blockcount - 1;
 
 	INIT_LIST_HEAD(&refchk.fragments);
 	error = xfs_rmap_query_range(sc->sa.rmap_cur, &low, &high,
@@ -302,7 +300,7 @@ xchk_refcountbt_xref_rmap(
 		goto out_free;
 
 	xchk_refcountbt_process_rmap_fragments(&refchk);
-	if (refcount != refchk.seen)
+	if (irec->rc_refcount != refchk.seen)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
 
 out_free:
@@ -315,17 +313,16 @@ xchk_refcountbt_xref_rmap(
 /* Cross-reference with the other btrees. */
 STATIC void
 xchk_refcountbt_xref(
-	struct xfs_scrub	*sc,
-	xfs_agblock_t		agbno,
-	xfs_extlen_t		len,
-	xfs_nlink_t		refcount)
+	struct xfs_scrub		*sc,
+	const struct xfs_refcount_irec	*irec)
 {
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		return;
 
-	xchk_xref_is_used_space(sc, agbno, len);
-	xchk_xref_is_not_inode_chunk(sc, agbno, len);
-	xchk_refcountbt_xref_rmap(sc, agbno, len, refcount);
+	xchk_xref_is_used_space(sc, irec->rc_startblock, irec->rc_blockcount);
+	xchk_xref_is_not_inode_chunk(sc, irec->rc_startblock,
+			irec->rc_blockcount);
+	xchk_refcountbt_xref_rmap(sc, irec);
 }
 
 /* Scrub a refcountbt record. */
@@ -334,35 +331,32 @@ xchk_refcountbt_rec(
 	struct xchk_btree	*bs,
 	const union xfs_btree_rec *rec)
 {
+	struct xfs_refcount_irec irec;
 	xfs_agblock_t		*cow_blocks = bs->private;
 	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
-	xfs_agblock_t		bno;
-	xfs_extlen_t		len;
-	xfs_nlink_t		refcount;
 	bool			has_cowflag;
 
-	bno = be32_to_cpu(rec->refc.rc_startblock);
-	len = be32_to_cpu(rec->refc.rc_blockcount);
-	refcount = be32_to_cpu(rec->refc.rc_refcount);
+	xfs_refcount_btrec_to_irec(rec, &irec);
 
 	/* Only CoW records can have refcount == 1. */
-	has_cowflag = (bno & XFS_REFC_COW_START);
-	if ((refcount == 1 && !has_cowflag) || (refcount != 1 && has_cowflag))
+	has_cowflag = (irec.rc_startblock & XFS_REFC_COW_START);
+	if ((irec.rc_refcount == 1 && !has_cowflag) ||
+	    (irec.rc_refcount != 1 && has_cowflag))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 	if (has_cowflag)
-		(*cow_blocks) += len;
+		(*cow_blocks) += irec.rc_blockcount;
 
 	/* Check the extent. */
-	bno &= ~XFS_REFC_COW_START;
-	if (bno + len <= bno ||
-	    !xfs_verify_agbno(pag, bno) ||
-	    !xfs_verify_agbno(pag, bno + len - 1))
+	irec.rc_startblock &= ~XFS_REFC_COW_START;
+	if (irec.rc_startblock + irec.rc_blockcount <= irec.rc_startblock ||
+	    !xfs_verify_agbno(pag, irec.rc_startblock) ||
+	    !xfs_verify_agbno(pag, irec.rc_startblock + irec.rc_blockcount - 1))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-	if (refcount == 0)
+	if (irec.rc_refcount == 0)
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-	xchk_refcountbt_xref(bs->sc, bno, len, refcount);
+	xchk_refcountbt_xref(bs->sc, &irec);
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 04/12] xfs: refactor refcount record usage in xchk_refcountbt_rec Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 21:03   ` Dave Chinner
  2022-10-27 17:14 ` [PATCH 06/12] xfs: report refcount domain in tracepoints Darrick J. Wong
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Just prior to committing the reflink code into upstream, the xfs
maintainer at the time requested that I find a way to shard the refcount
records into two domains -- one for records tracking shared extents, and
a second for tracking CoW staging extents.  The idea here was to
minimize mount time CoW reclamation by pushing all the CoW records to
the right edge of the keyspace, and it was accomplished by setting the
upper bit in rc_startblock.  We don't allow AGs to have more than 2^31
blocks, so the bit was free.

Unfortunately, this was a very late addition to the codebase, so most of
the refcount record processing code still treats rc_startblock as a u32
and pays no attention to whether or not the upper bit (the cow flag) is
set.  This is a weakness is theoretically exploitable, since we're not
fully validating the incoming metadata records.

Fuzzing demonstrates practical exploits of this weakness.  If the cow
flag of a node block key record is corrupted, a lookup operation can go
to the wrong record block and start returning records from the wrong
cow/shared domain.  This causes the math to go all wrong (since cow
domain is still implicit in the upper bit of rc_startblock) and we can
crash the kernel by tricking xfs into jumping into a nonexistent AG and
tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.

To fix this, start tracking the domain as an explicit part of struct
xfs_refcount_irec, adjust all refcount functions to check the domain
of a returned record, and alter the function definitions to accept them
where necessary.

Found by fuzzing keys[2].cowflag = add in xfs/464.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c       |  151 ++++++++++++++++++++++++------------
 fs/xfs/libxfs/xfs_refcount.h       |   28 ++++++-
 fs/xfs/libxfs/xfs_refcount_btree.c |   15 +++-
 fs/xfs/libxfs/xfs_types.h          |    6 +
 fs/xfs/scrub/refcount.c            |   22 ++---
 5 files changed, 154 insertions(+), 68 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index c6aa832a8713..8d0417565b9d 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -46,13 +46,16 @@ STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur,
 int
 xfs_refcount_lookup_le(
 	struct xfs_btree_cur	*cur,
+	enum xfs_refc_domain	domain,
 	xfs_agblock_t		bno,
 	int			*stat)
 {
-	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
+	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
+			xfs_refcount_encode_startblock(bno, domain),
 			XFS_LOOKUP_LE);
 	cur->bc_rec.rc.rc_startblock = bno;
 	cur->bc_rec.rc.rc_blockcount = 0;
+	cur->bc_rec.rc.rc_domain = domain;
 	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
 }
 
@@ -63,13 +66,16 @@ xfs_refcount_lookup_le(
 int
 xfs_refcount_lookup_ge(
 	struct xfs_btree_cur	*cur,
+	enum xfs_refc_domain	domain,
 	xfs_agblock_t		bno,
 	int			*stat)
 {
-	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
+	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
+			xfs_refcount_encode_startblock(bno, domain),
 			XFS_LOOKUP_GE);
 	cur->bc_rec.rc.rc_startblock = bno;
 	cur->bc_rec.rc.rc_blockcount = 0;
+	cur->bc_rec.rc.rc_domain = domain;
 	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
 }
 
@@ -80,13 +86,16 @@ xfs_refcount_lookup_ge(
 int
 xfs_refcount_lookup_eq(
 	struct xfs_btree_cur	*cur,
+	enum xfs_refc_domain	domain,
 	xfs_agblock_t		bno,
 	int			*stat)
 {
-	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
+	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
+			xfs_refcount_encode_startblock(bno, domain),
 			XFS_LOOKUP_LE);
 	cur->bc_rec.rc.rc_startblock = bno;
 	cur->bc_rec.rc.rc_blockcount = 0;
+	cur->bc_rec.rc.rc_domain = domain;
 	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
 }
 
@@ -96,7 +105,17 @@ xfs_refcount_btrec_to_irec(
 	const union xfs_btree_rec	*rec,
 	struct xfs_refcount_irec	*irec)
 {
-	irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
+	__u32				start;
+
+	start = be32_to_cpu(rec->refc.rc_startblock);
+	if (start & XFS_REFC_COW_START) {
+		start &= ~XFS_REFC_COW_START;
+		irec->rc_domain = XFS_REFC_DOMAIN_COW;
+	} else {
+		irec->rc_domain = XFS_REFC_DOMAIN_SHARED;
+	}
+
+	irec->rc_startblock = start;
 	irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
 	irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
 }
@@ -114,7 +133,6 @@ xfs_refcount_get_rec(
 	struct xfs_perag		*pag = cur->bc_ag.pag;
 	union xfs_btree_rec		*rec;
 	int				error;
-	xfs_agblock_t			realstart;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (error || !*stat)
@@ -124,22 +142,18 @@ xfs_refcount_get_rec(
 	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
 		goto out_bad_rec;
 
-	/* handle special COW-staging state */
-	realstart = irec->rc_startblock;
-	if (realstart & XFS_REFC_COW_START) {
-		if (irec->rc_refcount != 1)
-			goto out_bad_rec;
-		realstart &= ~XFS_REFC_COW_START;
-	} else if (irec->rc_refcount < 2) {
+	/* handle special COW-staging domain */
+	if (irec->rc_domain == XFS_REFC_DOMAIN_COW && irec->rc_refcount != 1)
+		goto out_bad_rec;
+	if (irec->rc_domain == XFS_REFC_DOMAIN_SHARED && irec->rc_refcount < 2)
 		goto out_bad_rec;
-	}
 
 	/* check for valid extent range, including overflow */
-	if (!xfs_verify_agbno(pag, realstart))
+	if (!xfs_verify_agbno(pag, irec->rc_startblock))
 		goto out_bad_rec;
-	if (realstart > realstart + irec->rc_blockcount)
+	if (irec->rc_startblock > irec->rc_startblock + irec->rc_blockcount)
 		goto out_bad_rec;
-	if (!xfs_verify_agbno(pag, realstart + irec->rc_blockcount - 1))
+	if (!xfs_verify_agbno(pag, irec->rc_startblock + irec->rc_blockcount - 1))
 		goto out_bad_rec;
 
 	if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
@@ -169,12 +183,17 @@ xfs_refcount_update(
 	struct xfs_refcount_irec	*irec)
 {
 	union xfs_btree_rec	rec;
+	__u32			start;
 	int			error;
 
 	trace_xfs_refcount_update(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
-	rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock);
+
+	start = xfs_refcount_encode_startblock(irec->rc_startblock,
+			irec->rc_domain);
+	rec.refc.rc_startblock = cpu_to_be32(start);
 	rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount);
 	rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount);
+
 	error = xfs_btree_update(cur, &rec);
 	if (error)
 		trace_xfs_refcount_update_error(cur->bc_mp,
@@ -196,9 +215,12 @@ xfs_refcount_insert(
 	int				error;
 
 	trace_xfs_refcount_insert(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
+
 	cur->bc_rec.rc.rc_startblock = irec->rc_startblock;
 	cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount;
 	cur->bc_rec.rc.rc_refcount = irec->rc_refcount;
+	cur->bc_rec.rc.rc_domain = irec->rc_domain;
+
 	error = xfs_btree_insert(cur, i);
 	if (error)
 		goto out_error;
@@ -244,7 +266,8 @@ xfs_refcount_delete(
 	}
 	if (error)
 		goto out_error;
-	error = xfs_refcount_lookup_ge(cur, irec.rc_startblock, &found_rec);
+	error = xfs_refcount_lookup_ge(cur, irec.rc_domain, irec.rc_startblock,
+			&found_rec);
 out_error:
 	if (error)
 		trace_xfs_refcount_delete_error(cur->bc_mp,
@@ -343,6 +366,7 @@ xfs_refc_next(
 STATIC int
 xfs_refcount_split_extent(
 	struct xfs_btree_cur		*cur,
+	enum xfs_refc_domain		domain,
 	xfs_agblock_t			agbno,
 	bool				*shape_changed)
 {
@@ -351,7 +375,7 @@ xfs_refcount_split_extent(
 	int				error;
 
 	*shape_changed = false;
-	error = xfs_refcount_lookup_le(cur, agbno, &found_rec);
+	error = xfs_refcount_lookup_le(cur, domain, agbno, &found_rec);
 	if (error)
 		goto out_error;
 	if (!found_rec)
@@ -364,6 +388,7 @@ xfs_refcount_split_extent(
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
+
 	if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
 		return 0;
 
@@ -423,8 +448,8 @@ xfs_refcount_merge_center_extents(
 	 * call removes the center and the second one removes the right
 	 * extent.
 	 */
-	error = xfs_refcount_lookup_ge(cur, center->rc_startblock,
-			&found_rec);
+	error = xfs_refcount_lookup_ge(cur, center->rc_domain,
+			center->rc_startblock, &found_rec);
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
@@ -451,8 +476,8 @@ xfs_refcount_merge_center_extents(
 	}
 
 	/* Enlarge the left extent. */
-	error = xfs_refcount_lookup_le(cur, left->rc_startblock,
-			&found_rec);
+	error = xfs_refcount_lookup_le(cur, left->rc_domain,
+			left->rc_startblock, &found_rec);
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
@@ -493,8 +518,8 @@ xfs_refcount_merge_left_extent(
 
 	/* If the extent at agbno (cleft) wasn't synthesized, remove it. */
 	if (cleft->rc_refcount > 1) {
-		error = xfs_refcount_lookup_le(cur, cleft->rc_startblock,
-				&found_rec);
+		error = xfs_refcount_lookup_le(cur, cleft->rc_domain,
+				cleft->rc_startblock, &found_rec);
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
@@ -512,8 +537,8 @@ xfs_refcount_merge_left_extent(
 	}
 
 	/* Enlarge the left extent. */
-	error = xfs_refcount_lookup_le(cur, left->rc_startblock,
-			&found_rec);
+	error = xfs_refcount_lookup_le(cur, left->rc_domain,
+			left->rc_startblock, &found_rec);
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
@@ -557,8 +582,8 @@ xfs_refcount_merge_right_extent(
 	 * remove it.
 	 */
 	if (cright->rc_refcount > 1) {
-		error = xfs_refcount_lookup_le(cur, cright->rc_startblock,
-			&found_rec);
+		error = xfs_refcount_lookup_le(cur, cright->rc_domain,
+				cright->rc_startblock, &found_rec);
 		if (error)
 			goto out_error;
 		if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
@@ -576,8 +601,8 @@ xfs_refcount_merge_right_extent(
 	}
 
 	/* Enlarge the right extent. */
-	error = xfs_refcount_lookup_le(cur, right->rc_startblock,
-			&found_rec);
+	error = xfs_refcount_lookup_le(cur, right->rc_domain,
+			right->rc_startblock, &found_rec);
 	if (error)
 		goto out_error;
 	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec != 1)) {
@@ -616,11 +641,17 @@ xfs_refcount_find_left_extents(
 	int				flags)
 {
 	struct xfs_refcount_irec	tmp;
+	enum xfs_refc_domain		domain;
 	int				error;
 	int				found_rec;
 
+	if (flags & XFS_FIND_RCEXT_SHARED)
+		domain = XFS_REFC_DOMAIN_SHARED;
+	else
+		domain = XFS_REFC_DOMAIN_COW;
+
 	left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK;
-	error = xfs_refcount_lookup_le(cur, agbno - 1, &found_rec);
+	error = xfs_refcount_lookup_le(cur, domain, agbno - 1, &found_rec);
 	if (error)
 		goto out_error;
 	if (!found_rec)
@@ -671,6 +702,7 @@ xfs_refcount_find_left_extents(
 			cleft->rc_blockcount = min(aglen,
 					tmp.rc_startblock - agbno);
 			cleft->rc_refcount = 1;
+			cleft->rc_domain = domain;
 		}
 	} else {
 		/*
@@ -680,6 +712,7 @@ xfs_refcount_find_left_extents(
 		cleft->rc_startblock = agbno;
 		cleft->rc_blockcount = aglen;
 		cleft->rc_refcount = 1;
+		cleft->rc_domain = domain;
 	}
 	trace_xfs_refcount_find_left_extent(cur->bc_mp, cur->bc_ag.pag->pag_agno,
 			left, cleft, agbno);
@@ -705,11 +738,17 @@ xfs_refcount_find_right_extents(
 	int				flags)
 {
 	struct xfs_refcount_irec	tmp;
+	enum xfs_refc_domain		domain;
 	int				error;
 	int				found_rec;
 
+	if (flags & XFS_FIND_RCEXT_SHARED)
+		domain = XFS_REFC_DOMAIN_SHARED;
+	else
+		domain = XFS_REFC_DOMAIN_COW;
+
 	right->rc_startblock = cright->rc_startblock = NULLAGBLOCK;
-	error = xfs_refcount_lookup_ge(cur, agbno + aglen, &found_rec);
+	error = xfs_refcount_lookup_ge(cur, domain, agbno + aglen, &found_rec);
 	if (error)
 		goto out_error;
 	if (!found_rec)
@@ -760,6 +799,7 @@ xfs_refcount_find_right_extents(
 			cright->rc_blockcount = right->rc_startblock -
 					cright->rc_startblock;
 			cright->rc_refcount = 1;
+			cright->rc_domain = domain;
 		}
 	} else {
 		/*
@@ -769,6 +809,7 @@ xfs_refcount_find_right_extents(
 		cright->rc_startblock = agbno;
 		cright->rc_blockcount = aglen;
 		cright->rc_refcount = 1;
+		cright->rc_domain = domain;
 	}
 	trace_xfs_refcount_find_right_extent(cur->bc_mp, cur->bc_ag.pag->pag_agno,
 			cright, right, agbno + aglen);
@@ -933,7 +974,8 @@ xfs_refcount_adjust_extents(
 	if (*aglen == 0)
 		return 0;
 
-	error = xfs_refcount_lookup_ge(cur, *agbno, &found_rec);
+	error = xfs_refcount_lookup_ge(cur, XFS_REFC_DOMAIN_SHARED, *agbno,
+			&found_rec);
 	if (error)
 		goto out_error;
 
@@ -945,6 +987,7 @@ xfs_refcount_adjust_extents(
 			ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
 			ext.rc_blockcount = 0;
 			ext.rc_refcount = 0;
+			ext.rc_domain = XFS_REFC_DOMAIN_SHARED;
 		}
 
 		/*
@@ -957,6 +1000,8 @@ xfs_refcount_adjust_extents(
 			tmp.rc_blockcount = min(*aglen,
 					ext.rc_startblock - *agbno);
 			tmp.rc_refcount = 1 + adj;
+			tmp.rc_domain = XFS_REFC_DOMAIN_SHARED;
+
 			trace_xfs_refcount_modify_extent(cur->bc_mp,
 					cur->bc_ag.pag->pag_agno, &tmp);
 
@@ -991,7 +1036,8 @@ xfs_refcount_adjust_extents(
 				break;
 
 			/* Move the cursor to the start of ext. */
-			error = xfs_refcount_lookup_ge(cur, *agbno,
+			error = xfs_refcount_lookup_ge(cur,
+					XFS_REFC_DOMAIN_SHARED, *agbno,
 					&found_rec);
 			if (error)
 				goto out_error;
@@ -1084,13 +1130,15 @@ xfs_refcount_adjust(
 	/*
 	 * Ensure that no rcextents cross the boundary of the adjustment range.
 	 */
-	error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
+	error = xfs_refcount_split_extent(cur, XFS_REFC_DOMAIN_SHARED,
+			agbno, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
 		shape_changes++;
 
-	error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
+	error = xfs_refcount_split_extent(cur, XFS_REFC_DOMAIN_SHARED,
+			agbno + aglen, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
@@ -1365,7 +1413,8 @@ xfs_refcount_find_shared(
 	*flen = 0;
 
 	/* Try to find a refcount extent that crosses the start */
-	error = xfs_refcount_lookup_le(cur, agbno, &have);
+	error = xfs_refcount_lookup_le(cur, XFS_REFC_DOMAIN_SHARED, agbno,
+			&have);
 	if (error)
 		goto out_error;
 	if (!have) {
@@ -1513,17 +1562,18 @@ xfs_refcount_adjust_cow_extents(
 		return 0;
 
 	/* Find any overlapping refcount records */
-	error = xfs_refcount_lookup_ge(cur, agbno, &found_rec);
+	error = xfs_refcount_lookup_ge(cur, XFS_REFC_DOMAIN_COW, agbno,
+			&found_rec);
 	if (error)
 		goto out_error;
 	error = xfs_refcount_get_rec(cur, &ext, &found_rec);
 	if (error)
 		goto out_error;
 	if (!found_rec) {
-		ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks +
-				XFS_REFC_COW_START;
+		ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
 		ext.rc_blockcount = 0;
 		ext.rc_refcount = 0;
+		ext.rc_domain = XFS_REFC_DOMAIN_COW;
 	}
 
 	switch (adj) {
@@ -1538,6 +1588,8 @@ xfs_refcount_adjust_cow_extents(
 		tmp.rc_startblock = agbno;
 		tmp.rc_blockcount = aglen;
 		tmp.rc_refcount = 1;
+		tmp.rc_domain = XFS_REFC_DOMAIN_COW;
+
 		trace_xfs_refcount_modify_extent(cur->bc_mp,
 				cur->bc_ag.pag->pag_agno, &tmp);
 
@@ -1600,16 +1652,16 @@ xfs_refcount_adjust_cow(
 	bool			shape_changed;
 	int			error;
 
-	agbno += XFS_REFC_COW_START;
-
 	/*
 	 * Ensure that no rcextents cross the boundary of the adjustment range.
 	 */
-	error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
+	error = xfs_refcount_split_extent(cur, XFS_REFC_DOMAIN_COW,
+			agbno, &shape_changed);
 	if (error)
 		goto out_error;
 
-	error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
+	error = xfs_refcount_split_extent(cur, XFS_REFC_DOMAIN_COW,
+			agbno + aglen, &shape_changed);
 	if (error)
 		goto out_error;
 
@@ -1745,7 +1797,6 @@ xfs_refcount_recover_cow_leftovers(
 	union xfs_btree_irec		low;
 	union xfs_btree_irec		high;
 	xfs_fsblock_t			fsb;
-	xfs_agblock_t			agbno;
 	int				error;
 
 	if (mp->m_sb.sb_agblocks >= XFS_REFC_COW_START)
@@ -1775,7 +1826,7 @@ xfs_refcount_recover_cow_leftovers(
 	/* Find all the leftover CoW staging extents. */
 	memset(&low, 0, sizeof(low));
 	memset(&high, 0, sizeof(high));
-	low.rc.rc_startblock = XFS_REFC_COW_START;
+	low.rc.rc_domain = high.rc.rc_domain = XFS_REFC_DOMAIN_COW;
 	high.rc.rc_startblock = -1U;
 	error = xfs_btree_query_range(cur, &low, &high,
 			xfs_refcount_recover_extent, &debris);
@@ -1796,8 +1847,8 @@ xfs_refcount_recover_cow_leftovers(
 				&rr->rr_rrec);
 
 		/* Free the orphan record */
-		agbno = rr->rr_rrec.rc_startblock - XFS_REFC_COW_START;
-		fsb = XFS_AGB_TO_FSB(mp, pag->pag_agno, agbno);
+		fsb = XFS_AGB_TO_FSB(mp, pag->pag_agno,
+				rr->rr_rrec.rc_startblock);
 		xfs_refcount_free_cow_extent(tp, fsb,
 				rr->rr_rrec.rc_blockcount);
 
@@ -1828,6 +1879,7 @@ xfs_refcount_recover_cow_leftovers(
 int
 xfs_refcount_has_record(
 	struct xfs_btree_cur	*cur,
+	enum xfs_refc_domain	domain,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
 	bool			*exists)
@@ -1839,6 +1891,7 @@ xfs_refcount_has_record(
 	low.rc.rc_startblock = bno;
 	memset(&high, 0xFF, sizeof(high));
 	high.rc.rc_startblock = bno + len - 1;
+	low.rc.rc_domain = high.rc.rc_domain = domain;
 
 	return xfs_btree_has_record(cur, &low, &high, exists);
 }
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index e8b322de7f3d..ca3ffe11a452 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -14,14 +14,33 @@ struct xfs_bmbt_irec;
 struct xfs_refcount_irec;
 
 extern int xfs_refcount_lookup_le(struct xfs_btree_cur *cur,
-		xfs_agblock_t bno, int *stat);
+		enum xfs_refc_domain domain, xfs_agblock_t bno, int *stat);
 extern int xfs_refcount_lookup_ge(struct xfs_btree_cur *cur,
-		xfs_agblock_t bno, int *stat);
+		enum xfs_refc_domain domain, xfs_agblock_t bno, int *stat);
 extern int xfs_refcount_lookup_eq(struct xfs_btree_cur *cur,
-		xfs_agblock_t bno, int *stat);
+		enum xfs_refc_domain domain, xfs_agblock_t bno, int *stat);
 extern int xfs_refcount_get_rec(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
+static inline __u32
+xfs_refcount_encode_startblock(
+	xfs_agblock_t		startblock,
+	enum xfs_refc_domain	domain)
+{
+	__u32			start;
+
+	/*
+	 * low level btree operations need to handle the generic btree range
+	 * query functions (which set rc_domain == -1U), so we check that the
+	 * domain is /not/ shared.
+	 */
+	start = startblock & ~XFS_REFC_COW_START;
+	if (domain != XFS_REFC_DOMAIN_SHARED)
+		start |= XFS_REFC_COW_START;
+
+	return start;
+}
+
 enum xfs_refcount_intent_type {
 	XFS_REFCOUNT_INCREASE = 1,
 	XFS_REFCOUNT_DECREASE,
@@ -79,7 +98,8 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
 #define XFS_REFCOUNT_ITEM_OVERHEAD	32
 
 extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
-		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
+		enum xfs_refc_domain domain, xfs_agblock_t bno,
+		xfs_extlen_t len, bool *exists);
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 316c1ec0c3c2..c57cddfcf434 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -13,6 +13,7 @@
 #include "xfs_btree.h"
 #include "xfs_btree_staging.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_refcount.h"
 #include "xfs_alloc.h"
 #include "xfs_error.h"
 #include "xfs_trace.h"
@@ -160,7 +161,12 @@ xfs_refcountbt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
 {
-	rec->refc.rc_startblock = cpu_to_be32(cur->bc_rec.rc.rc_startblock);
+	const struct xfs_refcount_irec *irec = &cur->bc_rec.rc;
+	__u32			start;
+
+	start = xfs_refcount_encode_startblock(irec->rc_startblock,
+			irec->rc_domain);
+	rec->refc.rc_startblock = cpu_to_be32(start);
 	rec->refc.rc_blockcount = cpu_to_be32(cur->bc_rec.rc.rc_blockcount);
 	rec->refc.rc_refcount = cpu_to_be32(cur->bc_rec.rc.rc_refcount);
 }
@@ -182,10 +188,13 @@ xfs_refcountbt_key_diff(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key)
 {
-	struct xfs_refcount_irec	*rec = &cur->bc_rec.rc;
 	const struct xfs_refcount_key	*kp = &key->refc;
+	const struct xfs_refcount_irec	*irec = &cur->bc_rec.rc;
+	__u32				start;
 
-	return (int64_t)be32_to_cpu(kp->rc_startblock) - rec->rc_startblock;
+	start = xfs_refcount_encode_startblock(irec->rc_startblock,
+			irec->rc_domain);
+	return (int64_t)be32_to_cpu(kp->rc_startblock) - start;
 }
 
 STATIC int64_t
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 2d9ebc7338b1..eb9a98338bb9 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -166,10 +166,16 @@ typedef struct xfs_bmbt_irec
 	xfs_exntst_t	br_state;	/* extent state */
 } xfs_bmbt_irec_t;
 
+enum xfs_refc_domain {
+	XFS_REFC_DOMAIN_SHARED = 0,
+	XFS_REFC_DOMAIN_COW,
+};
+
 struct xfs_refcount_irec {
 	xfs_agblock_t	rc_startblock;	/* starting block number */
 	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
 	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
+	enum xfs_refc_domain	rc_domain; /* shared or cow staging extent? */
 };
 
 #define XFS_RMAP_ATTR_FORK		(1 << 0)
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 8ab55e791ea8..6adc645060e7 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -334,20 +334,19 @@ xchk_refcountbt_rec(
 	struct xfs_refcount_irec irec;
 	xfs_agblock_t		*cow_blocks = bs->private;
 	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
-	bool			has_cowflag;
 
 	xfs_refcount_btrec_to_irec(rec, &irec);
 
 	/* Only CoW records can have refcount == 1. */
-	has_cowflag = (irec.rc_startblock & XFS_REFC_COW_START);
-	if ((irec.rc_refcount == 1 && !has_cowflag) ||
-	    (irec.rc_refcount != 1 && has_cowflag))
+	if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED && irec.rc_refcount == 1)
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	if (has_cowflag)
+	if (irec.rc_domain == XFS_REFC_DOMAIN_COW) {
+		if (irec.rc_refcount != 1)
+			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		(*cow_blocks) += irec.rc_blockcount;
+	}
 
 	/* Check the extent. */
-	irec.rc_startblock &= ~XFS_REFC_COW_START;
 	if (irec.rc_startblock + irec.rc_blockcount <= irec.rc_startblock ||
 	    !xfs_verify_agbno(pag, irec.rc_startblock) ||
 	    !xfs_verify_agbno(pag, irec.rc_startblock + irec.rc_blockcount - 1))
@@ -420,7 +419,6 @@ xchk_xref_is_cow_staging(
 	xfs_extlen_t			len)
 {
 	struct xfs_refcount_irec	rc;
-	bool				has_cowflag;
 	int				has_refcount;
 	int				error;
 
@@ -428,8 +426,8 @@ xchk_xref_is_cow_staging(
 		return;
 
 	/* Find the CoW staging extent. */
-	error = xfs_refcount_lookup_le(sc->sa.refc_cur,
-			agbno + XFS_REFC_COW_START, &has_refcount);
+	error = xfs_refcount_lookup_le(sc->sa.refc_cur, XFS_REFC_DOMAIN_COW,
+			agbno, &has_refcount);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
 		return;
 	if (!has_refcount) {
@@ -446,8 +444,7 @@ xchk_xref_is_cow_staging(
 	}
 
 	/* CoW flag must be set, refcount must be 1. */
-	has_cowflag = (rc.rc_startblock & XFS_REFC_COW_START);
-	if (!has_cowflag || rc.rc_refcount != 1)
+	if (rc.rc_domain != XFS_REFC_DOMAIN_COW || rc.rc_refcount != 1)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
 
 	/* Must be at least as long as what was passed in */
@@ -471,7 +468,8 @@ xchk_xref_is_not_shared(
 	if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_refcount_has_record(sc->sa.refc_cur, agbno, len, &shared);
+	error = xfs_refcount_has_record(sc->sa.refc_cur, XFS_REFC_DOMAIN_SHARED,
+			agbno, len, &shared);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
 		return;
 	if (shared)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/12] xfs: report refcount domain in tracepoints
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 21:05   ` Dave Chinner
  2022-10-27 17:14 ` [PATCH 07/12] xfs: refactor domain and refcount checking Darrick J. Wong
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've broken out the startblock and shared/cow domain in the
incore refcount extent record structure, update the tracepoints to
report the domain.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_types.h |    4 ++++
 fs/xfs/xfs_trace.h        |   48 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 43 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index eb9a98338bb9..5ebdda7e1078 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -171,6 +171,10 @@ enum xfs_refc_domain {
 	XFS_REFC_DOMAIN_COW,
 };
 
+#define XFS_REFC_DOMAIN_STRINGS \
+	{ XFS_REFC_DOMAIN_SHARED,	"shared" }, \
+	{ XFS_REFC_DOMAIN_COW,		"cow" }
+
 struct xfs_refcount_irec {
 	xfs_agblock_t	rc_startblock;	/* starting block number */
 	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index cb7c81ba7fa3..372d871bccc5 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -799,6 +799,9 @@ TRACE_DEFINE_ENUM(PE_SIZE_PTE);
 TRACE_DEFINE_ENUM(PE_SIZE_PMD);
 TRACE_DEFINE_ENUM(PE_SIZE_PUD);
 
+TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_SHARED);
+TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_COW);
+
 TRACE_EVENT(xfs_filemap_fault,
 	TP_PROTO(struct xfs_inode *ip, enum page_entry_size pe_size,
 		 bool write_fault),
@@ -2925,6 +2928,7 @@ DECLARE_EVENT_CLASS(xfs_refcount_extent_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, domain)
 		__field(xfs_agblock_t, startblock)
 		__field(xfs_extlen_t, blockcount)
 		__field(xfs_nlink_t, refcount)
@@ -2932,13 +2936,15 @@ DECLARE_EVENT_CLASS(xfs_refcount_extent_class,
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
+		__entry->domain = irec->rc_domain;
 		__entry->startblock = irec->rc_startblock;
 		__entry->blockcount = irec->rc_blockcount;
 		__entry->refcount = irec->rc_refcount;
 	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u",
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
+		  __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->startblock,
 		  __entry->blockcount,
 		  __entry->refcount)
@@ -2958,6 +2964,7 @@ DECLARE_EVENT_CLASS(xfs_refcount_extent_at_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, domain)
 		__field(xfs_agblock_t, startblock)
 		__field(xfs_extlen_t, blockcount)
 		__field(xfs_nlink_t, refcount)
@@ -2966,14 +2973,16 @@ DECLARE_EVENT_CLASS(xfs_refcount_extent_at_class,
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
+		__entry->domain = irec->rc_domain;
 		__entry->startblock = irec->rc_startblock;
 		__entry->blockcount = irec->rc_blockcount;
 		__entry->refcount = irec->rc_refcount;
 		__entry->agbno = agbno;
 	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u @ agbno 0x%x",
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u @ agbno 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
+		  __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->startblock,
 		  __entry->blockcount,
 		  __entry->refcount,
@@ -2994,9 +3003,11 @@ DECLARE_EVENT_CLASS(xfs_refcount_double_extent_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, i1_domain)
 		__field(xfs_agblock_t, i1_startblock)
 		__field(xfs_extlen_t, i1_blockcount)
 		__field(xfs_nlink_t, i1_refcount)
+		__field(enum xfs_refc_domain, i2_domain)
 		__field(xfs_agblock_t, i2_startblock)
 		__field(xfs_extlen_t, i2_blockcount)
 		__field(xfs_nlink_t, i2_refcount)
@@ -3004,20 +3015,24 @@ DECLARE_EVENT_CLASS(xfs_refcount_double_extent_class,
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
+		__entry->i1_domain = i1->rc_domain;
 		__entry->i1_startblock = i1->rc_startblock;
 		__entry->i1_blockcount = i1->rc_blockcount;
 		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_domain = i2->rc_domain;
 		__entry->i2_startblock = i2->rc_startblock;
 		__entry->i2_blockcount = i2->rc_blockcount;
 		__entry->i2_refcount = i2->rc_refcount;
 	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u -- "
-		  "agbno 0x%x fsbcount 0x%x refcount %u",
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u -- "
+		  "dom %s agbno 0x%x fsbcount 0x%x refcount %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
+		  __print_symbolic(__entry->i1_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i1_startblock,
 		  __entry->i1_blockcount,
 		  __entry->i1_refcount,
+		  __print_symbolic(__entry->i2_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i2_startblock,
 		  __entry->i2_blockcount,
 		  __entry->i2_refcount)
@@ -3038,9 +3053,11 @@ DECLARE_EVENT_CLASS(xfs_refcount_double_extent_at_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, i1_domain)
 		__field(xfs_agblock_t, i1_startblock)
 		__field(xfs_extlen_t, i1_blockcount)
 		__field(xfs_nlink_t, i1_refcount)
+		__field(enum xfs_refc_domain, i2_domain)
 		__field(xfs_agblock_t, i2_startblock)
 		__field(xfs_extlen_t, i2_blockcount)
 		__field(xfs_nlink_t, i2_refcount)
@@ -3049,21 +3066,25 @@ DECLARE_EVENT_CLASS(xfs_refcount_double_extent_at_class,
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
+		__entry->i1_domain = i1->rc_domain;
 		__entry->i1_startblock = i1->rc_startblock;
 		__entry->i1_blockcount = i1->rc_blockcount;
 		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_domain = i2->rc_domain;
 		__entry->i2_startblock = i2->rc_startblock;
 		__entry->i2_blockcount = i2->rc_blockcount;
 		__entry->i2_refcount = i2->rc_refcount;
 		__entry->agbno = agbno;
 	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u -- "
-		  "agbno 0x%x fsbcount 0x%x refcount %u @ agbno 0x%x",
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u -- "
+		  "dom %s agbno 0x%x fsbcount 0x%x refcount %u @ agbno 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
+		  __print_symbolic(__entry->i1_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i1_startblock,
 		  __entry->i1_blockcount,
 		  __entry->i1_refcount,
+		  __print_symbolic(__entry->i2_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i2_startblock,
 		  __entry->i2_blockcount,
 		  __entry->i2_refcount,
@@ -3086,12 +3107,15 @@ DECLARE_EVENT_CLASS(xfs_refcount_triple_extent_class,
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, i1_domain)
 		__field(xfs_agblock_t, i1_startblock)
 		__field(xfs_extlen_t, i1_blockcount)
 		__field(xfs_nlink_t, i1_refcount)
+		__field(enum xfs_refc_domain, i2_domain)
 		__field(xfs_agblock_t, i2_startblock)
 		__field(xfs_extlen_t, i2_blockcount)
 		__field(xfs_nlink_t, i2_refcount)
+		__field(enum xfs_refc_domain, i3_domain)
 		__field(xfs_agblock_t, i3_startblock)
 		__field(xfs_extlen_t, i3_blockcount)
 		__field(xfs_nlink_t, i3_refcount)
@@ -3099,27 +3123,33 @@ DECLARE_EVENT_CLASS(xfs_refcount_triple_extent_class,
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
 		__entry->agno = agno;
+		__entry->i1_domain = i1->rc_domain;
 		__entry->i1_startblock = i1->rc_startblock;
 		__entry->i1_blockcount = i1->rc_blockcount;
 		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_domain = i2->rc_domain;
 		__entry->i2_startblock = i2->rc_startblock;
 		__entry->i2_blockcount = i2->rc_blockcount;
 		__entry->i2_refcount = i2->rc_refcount;
+		__entry->i3_domain = i3->rc_domain;
 		__entry->i3_startblock = i3->rc_startblock;
 		__entry->i3_blockcount = i3->rc_blockcount;
 		__entry->i3_refcount = i3->rc_refcount;
 	),
-	TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%x refcount %u -- "
-		  "agbno 0x%x fsbcount 0x%x refcount %u -- "
-		  "agbno 0x%x fsbcount 0x%x refcount %u",
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u -- "
+		  "dom %s agbno 0x%x fsbcount 0x%x refcount %u -- "
+		  "dom %s agbno 0x%x fsbcount 0x%x refcount %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
+		  __print_symbolic(__entry->i1_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i1_startblock,
 		  __entry->i1_blockcount,
 		  __entry->i1_refcount,
+		  __print_symbolic(__entry->i2_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i2_startblock,
 		  __entry->i2_blockcount,
 		  __entry->i2_refcount,
+		  __print_symbolic(__entry->i3_domain, XFS_REFC_DOMAIN_STRINGS),
 		  __entry->i3_startblock,
 		  __entry->i3_blockcount,
 		  __entry->i3_refcount)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/12] xfs: refactor domain and refcount checking
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (5 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 06/12] xfs: report refcount domain in tracepoints Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 21:07   ` Dave Chinner
  2022-10-27 17:14 ` [PATCH 08/12] xfs: remove XFS_FIND_RCEXT_SHARED and _COW Darrick J. Wong
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a helper function to ensure that CoW staging extent records have
a single refcount and that shared extent records have more than 1
refcount.  We'll put this to more use in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |    5 +----
 fs/xfs/libxfs/xfs_refcount.h |   12 ++++++++++++
 fs/xfs/scrub/refcount.c      |   10 ++++------
 3 files changed, 17 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 8d0417565b9d..586c58b5a557 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -142,10 +142,7 @@ xfs_refcount_get_rec(
 	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
 		goto out_bad_rec;
 
-	/* handle special COW-staging domain */
-	if (irec->rc_domain == XFS_REFC_DOMAIN_COW && irec->rc_refcount != 1)
-		goto out_bad_rec;
-	if (irec->rc_domain == XFS_REFC_DOMAIN_SHARED && irec->rc_refcount < 2)
+	if (!xfs_refcount_check_domain(irec))
 		goto out_bad_rec;
 
 	/* check for valid extent range, including overflow */
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index ca3ffe11a452..d666a0e32659 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -55,6 +55,18 @@ struct xfs_refcount_intent {
 	xfs_fsblock_t				ri_startblock;
 };
 
+/* Check that the refcount is appropriate for the record domain. */
+static inline bool
+xfs_refcount_check_domain(
+	const struct xfs_refcount_irec	*irec)
+{
+	if (irec->rc_domain == XFS_REFC_DOMAIN_COW && irec->rc_refcount != 1)
+		return false;
+	if (irec->rc_domain == XFS_REFC_DOMAIN_SHARED && irec->rc_refcount < 2)
+		return false;
+	return true;
+}
+
 void xfs_refcount_increase_extent(struct xfs_trans *tp,
 		struct xfs_bmbt_irec *irec);
 void xfs_refcount_decrease_extent(struct xfs_trans *tp,
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 6adc645060e7..98c033072120 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -337,14 +337,12 @@ xchk_refcountbt_rec(
 
 	xfs_refcount_btrec_to_irec(rec, &irec);
 
-	/* Only CoW records can have refcount == 1. */
-	if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED && irec.rc_refcount == 1)
+	/* Check the domain and refcount are not incompatible. */
+	if (!xfs_refcount_check_domain(&irec))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	if (irec.rc_domain == XFS_REFC_DOMAIN_COW) {
-		if (irec.rc_refcount != 1)
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	if (irec.rc_domain == XFS_REFC_DOMAIN_COW)
 		(*cow_blocks) += irec.rc_blockcount;
-	}
 
 	/* Check the extent. */
 	if (irec.rc_startblock + irec.rc_blockcount <= irec.rc_startblock ||


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/12] xfs: remove XFS_FIND_RCEXT_SHARED and _COW
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (6 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 07/12] xfs: refactor domain and refcount checking Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 21:11   ` Dave Chinner
  2022-10-27 17:14 ` [PATCH 09/12] xfs: check record domain when accessing refcount records Darrick J. Wong
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have an explicit enum for shared and CoW staging extents, we
can get rid of the old FIND_RCEXT flags.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   48 +++++++++++++++---------------------------
 1 file changed, 17 insertions(+), 31 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 586c58b5a557..3b1cb0578770 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -622,8 +622,6 @@ xfs_refcount_merge_right_extent(
 	return error;
 }
 
-#define XFS_FIND_RCEXT_SHARED	1
-#define XFS_FIND_RCEXT_COW	2
 /*
  * Find the left extent and the one after it (cleft).  This function assumes
  * that we've already split any extent crossing agbno.
@@ -633,20 +631,14 @@ xfs_refcount_find_left_extents(
 	struct xfs_btree_cur		*cur,
 	struct xfs_refcount_irec	*left,
 	struct xfs_refcount_irec	*cleft,
+	enum xfs_refc_domain		domain,
 	xfs_agblock_t			agbno,
-	xfs_extlen_t			aglen,
-	int				flags)
+	xfs_extlen_t			aglen)
 {
 	struct xfs_refcount_irec	tmp;
-	enum xfs_refc_domain		domain;
 	int				error;
 	int				found_rec;
 
-	if (flags & XFS_FIND_RCEXT_SHARED)
-		domain = XFS_REFC_DOMAIN_SHARED;
-	else
-		domain = XFS_REFC_DOMAIN_COW;
-
 	left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK;
 	error = xfs_refcount_lookup_le(cur, domain, agbno - 1, &found_rec);
 	if (error)
@@ -664,9 +656,9 @@ xfs_refcount_find_left_extents(
 
 	if (xfs_refc_next(&tmp) != agbno)
 		return 0;
-	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
+	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
 		return 0;
-	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
+	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
 		return 0;
 	/* We have a left extent; retrieve (or invent) the next right one */
 	*left = tmp;
@@ -730,20 +722,14 @@ xfs_refcount_find_right_extents(
 	struct xfs_btree_cur		*cur,
 	struct xfs_refcount_irec	*right,
 	struct xfs_refcount_irec	*cright,
+	enum xfs_refc_domain		domain,
 	xfs_agblock_t			agbno,
-	xfs_extlen_t			aglen,
-	int				flags)
+	xfs_extlen_t			aglen)
 {
 	struct xfs_refcount_irec	tmp;
-	enum xfs_refc_domain		domain;
 	int				error;
 	int				found_rec;
 
-	if (flags & XFS_FIND_RCEXT_SHARED)
-		domain = XFS_REFC_DOMAIN_SHARED;
-	else
-		domain = XFS_REFC_DOMAIN_COW;
-
 	right->rc_startblock = cright->rc_startblock = NULLAGBLOCK;
 	error = xfs_refcount_lookup_ge(cur, domain, agbno + aglen, &found_rec);
 	if (error)
@@ -761,9 +747,9 @@ xfs_refcount_find_right_extents(
 
 	if (tmp.rc_startblock != agbno + aglen)
 		return 0;
-	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
+	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
 		return 0;
-	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
+	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
 		return 0;
 	/* We have a right extent; retrieve (or invent) the next left one */
 	*right = tmp;
@@ -832,10 +818,10 @@ xfs_refc_valid(
 STATIC int
 xfs_refcount_merge_extents(
 	struct xfs_btree_cur	*cur,
+	enum xfs_refc_domain	domain,
 	xfs_agblock_t		*agbno,
 	xfs_extlen_t		*aglen,
 	enum xfs_refc_adjust_op adjust,
-	int			flags,
 	bool			*shape_changed)
 {
 	struct xfs_refcount_irec	left = {0}, cleft = {0};
@@ -850,12 +836,12 @@ xfs_refcount_merge_extents(
 	 * just below (agbno + aglen) [cright], and just above (agbno + aglen)
 	 * [right].
 	 */
-	error = xfs_refcount_find_left_extents(cur, &left, &cleft, *agbno,
-			*aglen, flags);
+	error = xfs_refcount_find_left_extents(cur, &left, &cleft, domain,
+			*agbno, *aglen);
 	if (error)
 		return error;
-	error = xfs_refcount_find_right_extents(cur, &right, &cright, *agbno,
-			*aglen, flags);
+	error = xfs_refcount_find_right_extents(cur, &right, &cright, domain,
+			*agbno, *aglen);
 	if (error)
 		return error;
 
@@ -1144,8 +1130,8 @@ xfs_refcount_adjust(
 	/*
 	 * Try to merge with the left or right extents of the range.
 	 */
-	error = xfs_refcount_merge_extents(cur, new_agbno, new_aglen, adj,
-			XFS_FIND_RCEXT_SHARED, &shape_changed);
+	error = xfs_refcount_merge_extents(cur, XFS_REFC_DOMAIN_SHARED,
+			new_agbno, new_aglen, adj, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
@@ -1665,8 +1651,8 @@ xfs_refcount_adjust_cow(
 	/*
 	 * Try to merge with the left or right extents of the range.
 	 */
-	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
-			XFS_FIND_RCEXT_COW, &shape_changed);
+	error = xfs_refcount_merge_extents(cur, XFS_REFC_DOMAIN_COW, &agbno,
+			&aglen, adj, &shape_changed);
 	if (error)
 		goto out_error;
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/12] xfs: check record domain when accessing refcount records
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (7 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 08/12] xfs: remove XFS_FIND_RCEXT_SHARED and _COW Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 21:15   ` Dave Chinner
  2022-10-27 17:14 ` [PATCH 10/12] xfs: fix agblocks check in the cow leftover recovery function Darrick J. Wong
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.

The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   53 ++++++++++++++++++++++++++++++++----------
 fs/xfs/scrub/refcount.c      |    4 ++-
 2 files changed, 43 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 3b1cb0578770..608a122eef16 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -386,6 +386,8 @@ xfs_refcount_split_extent(
 		goto out_error;
 	}
 
+	if (rcext.rc_domain != domain)
+		return 0;
 	if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
 		return 0;
 
@@ -434,6 +436,9 @@ xfs_refcount_merge_center_extents(
 	int				error;
 	int				found_rec;
 
+	ASSERT(left->rc_domain == center->rc_domain);
+	ASSERT(right->rc_domain == center->rc_domain);
+
 	trace_xfs_refcount_merge_center_extents(cur->bc_mp,
 			cur->bc_ag.pag->pag_agno, left, center, right);
 
@@ -510,6 +515,8 @@ xfs_refcount_merge_left_extent(
 	int				error;
 	int				found_rec;
 
+	ASSERT(left->rc_domain == cleft->rc_domain);
+
 	trace_xfs_refcount_merge_left_extent(cur->bc_mp,
 			cur->bc_ag.pag->pag_agno, left, cleft);
 
@@ -571,6 +578,8 @@ xfs_refcount_merge_right_extent(
 	int				error;
 	int				found_rec;
 
+	ASSERT(right->rc_domain == cright->rc_domain);
+
 	trace_xfs_refcount_merge_right_extent(cur->bc_mp,
 			cur->bc_ag.pag->pag_agno, cright, right);
 
@@ -654,12 +663,10 @@ xfs_refcount_find_left_extents(
 		goto out_error;
 	}
 
+	if (tmp.rc_domain != domain)
+		return 0;
 	if (xfs_refc_next(&tmp) != agbno)
 		return 0;
-	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
-		return 0;
-	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
-		return 0;
 	/* We have a left extent; retrieve (or invent) the next right one */
 	*left = tmp;
 
@@ -675,6 +682,9 @@ xfs_refcount_find_left_extents(
 			goto out_error;
 		}
 
+		if (tmp.rc_domain != domain)
+			goto not_found;
+
 		/* if tmp starts at the end of our range, just use that */
 		if (tmp.rc_startblock == agbno)
 			*cleft = tmp;
@@ -694,6 +704,7 @@ xfs_refcount_find_left_extents(
 			cleft->rc_domain = domain;
 		}
 	} else {
+not_found:
 		/*
 		 * No extents, so pretend that there's one covering the whole
 		 * range.
@@ -745,12 +756,10 @@ xfs_refcount_find_right_extents(
 		goto out_error;
 	}
 
+	if (tmp.rc_domain != domain)
+		return 0;
 	if (tmp.rc_startblock != agbno + aglen)
 		return 0;
-	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
-		return 0;
-	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
-		return 0;
 	/* We have a right extent; retrieve (or invent) the next left one */
 	*right = tmp;
 
@@ -766,6 +775,9 @@ xfs_refcount_find_right_extents(
 			goto out_error;
 		}
 
+		if (tmp.rc_domain != domain)
+			goto not_found;
+
 		/* if tmp ends at the end of our range, just use that */
 		if (xfs_refc_next(&tmp) == agbno + aglen)
 			*cright = tmp;
@@ -785,6 +797,7 @@ xfs_refcount_find_right_extents(
 			cright->rc_domain = domain;
 		}
 	} else {
+not_found:
 		/*
 		 * No extents, so pretend that there's one covering the whole
 		 * range.
@@ -894,7 +907,7 @@ xfs_refcount_merge_extents(
 				aglen);
 	}
 
-	return error;
+	return 0;
 }
 
 /*
@@ -966,7 +979,7 @@ xfs_refcount_adjust_extents(
 		error = xfs_refcount_get_rec(cur, &ext, &found_rec);
 		if (error)
 			goto out_error;
-		if (!found_rec) {
+		if (!found_rec || ext.rc_domain != XFS_REFC_DOMAIN_SHARED) {
 			ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
 			ext.rc_blockcount = 0;
 			ext.rc_refcount = 0;
@@ -1415,6 +1428,8 @@ xfs_refcount_find_shared(
 		error = -EFSCORRUPTED;
 		goto out_error;
 	}
+	if (tmp.rc_domain != XFS_REFC_DOMAIN_SHARED)
+		goto done;
 
 	/* If the extent ends before the start, look at the next one */
 	if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
@@ -1430,6 +1445,8 @@ xfs_refcount_find_shared(
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
+		if (tmp.rc_domain != XFS_REFC_DOMAIN_SHARED)
+			goto done;
 	}
 
 	/* If the extent starts after the range we want, bail out */
@@ -1461,7 +1478,8 @@ xfs_refcount_find_shared(
 			error = -EFSCORRUPTED;
 			goto out_error;
 		}
-		if (tmp.rc_startblock >= agbno + aglen ||
+		if (tmp.rc_domain != XFS_REFC_DOMAIN_SHARED ||
+		    tmp.rc_startblock >= agbno + aglen ||
 		    tmp.rc_startblock != *fbno + *flen)
 			break;
 		*flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
@@ -1552,6 +1570,11 @@ xfs_refcount_adjust_cow_extents(
 	error = xfs_refcount_get_rec(cur, &ext, &found_rec);
 	if (error)
 		goto out_error;
+	if (XFS_IS_CORRUPT(cur->bc_mp, found_rec &&
+				ext.rc_domain != XFS_REFC_DOMAIN_COW)) {
+		error = -EFSCORRUPTED;
+		goto out_error;
+	}
 	if (!found_rec) {
 		ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
 		ext.rc_blockcount = 0;
@@ -1761,8 +1784,14 @@ xfs_refcount_recover_extent(
 
 	rr = kmem_alloc(sizeof(struct xfs_refcount_recovery), 0);
 	xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec);
+
+	if (XFS_IS_CORRUPT(cur->bc_mp,
+			   rr->rr_rrec.rc_domain != XFS_REFC_DOMAIN_COW)) {
+		kmem_free(rr);
+		return -EFSCORRUPTED;
+	}
+
 	list_add_tail(&rr->rr_list, debris);
-
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 98c033072120..8b06dd0bc955 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -441,8 +441,8 @@ xchk_xref_is_cow_staging(
 		return;
 	}
 
-	/* CoW flag must be set, refcount must be 1. */
-	if (rc.rc_domain != XFS_REFC_DOMAIN_COW || rc.rc_refcount != 1)
+	/* CoW lookup returned a shared extent record? */
+	if (rc.rc_domain != XFS_REFC_DOMAIN_COW)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
 
 	/* Must be at least as long as what was passed in */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/12] xfs: fix agblocks check in the cow leftover recovery function
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (8 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 09/12] xfs: check record domain when accessing refcount records Darrick J. Wong
@ 2022-10-27 17:14 ` Darrick J. Wong
  2022-10-27 21:22   ` Dave Chinner
  2022-10-27 17:15 ` [PATCH 11/12] xfs: fix uninitialized list head in struct xfs_refcount_recovery Darrick J. Wong
  2022-10-27 17:15 ` [PATCH 12/12] xfs: rename XFS_REFC_COW_START to _COWFLAG Darrick J. Wong
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:14 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

As we've seen, refcount records use the upper bit of the rc_startblock
field to ensure that all the refcount records are at the right side of
the refcount btree.  This works because an AG is never allowed to have
more than (1U << 31) blocks in it.  If we ever encounter a filesystem
claiming to have that many blocks, we absolutely do not want reflink
touching it at all.

However, this test at the start of xfs_refcount_recover_cow_leftovers is
slightly incorrect -- it /should/ be checking that agblocks isn't larger
than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
constant is never large enough to conflict with that CoW flag.

Note that the V5 superblock verifier has not historically rejected
filesystems where agblocks <= XFS_MAX_CRC_AG_BLOCKS, which is why this
ended up in the COW recovery routine.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 608a122eef16..529968b3b9c3 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1811,7 +1811,9 @@ xfs_refcount_recover_cow_leftovers(
 	xfs_fsblock_t			fsb;
 	int				error;
 
-	if (mp->m_sb.sb_agblocks >= XFS_REFC_COW_START)
+	/* reflink filesystems mustn't have AGs larger than 2^31-1 blocks */
+	BUILD_BUG_ON(XFS_MAX_CRC_AG_BLOCKS >= XFS_REFC_COW_START);
+	if (mp->m_sb.sb_agblocks > XFS_MAX_CRC_AG_BLOCKS)
 		return -EOPNOTSUPP;
 
 	INIT_LIST_HEAD(&debris);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/12] xfs: fix uninitialized list head in struct xfs_refcount_recovery
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (9 preceding siblings ...)
  2022-10-27 17:14 ` [PATCH 10/12] xfs: fix agblocks check in the cow leftover recovery function Darrick J. Wong
@ 2022-10-27 17:15 ` Darrick J. Wong
  2022-10-27 21:24   ` Dave Chinner
  2022-10-27 17:15 ` [PATCH 12/12] xfs: rename XFS_REFC_COW_START to _COWFLAG Darrick J. Wong
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:15 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We're supposed to initialize the list head of an object before adding it
to another list.  Fix that, and stop using the kmem_{alloc,free} calls
from the Irix days.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 529968b3b9c3..05fa18b9e0ed 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1782,12 +1782,14 @@ xfs_refcount_recover_extent(
 			   be32_to_cpu(rec->refc.rc_refcount) != 1))
 		return -EFSCORRUPTED;
 
-	rr = kmem_alloc(sizeof(struct xfs_refcount_recovery), 0);
+	rr = kmalloc(sizeof(struct xfs_refcount_recovery),
+			GFP_KERNEL | __GFP_NOFAIL);
+	INIT_LIST_HEAD(&rr->rr_list);
 	xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec);
 
 	if (XFS_IS_CORRUPT(cur->bc_mp,
 			   rr->rr_rrec.rc_domain != XFS_REFC_DOMAIN_COW)) {
-		kmem_free(rr);
+		kfree(rr);
 		return -EFSCORRUPTED;
 	}
 
@@ -1874,7 +1876,7 @@ xfs_refcount_recover_cow_leftovers(
 			goto out_free;
 
 		list_del(&rr->rr_list);
-		kmem_free(rr);
+		kfree(rr);
 	}
 
 	return error;
@@ -1884,7 +1886,7 @@ xfs_refcount_recover_cow_leftovers(
 	/* Free the leftover list */
 	list_for_each_entry_safe(rr, n, &debris, rr_list) {
 		list_del(&rr->rr_list);
-		kmem_free(rr);
+		kfree(rr);
 	}
 	return error;
 }


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/12] xfs: rename XFS_REFC_COW_START to _COWFLAG
  2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
                   ` (10 preceding siblings ...)
  2022-10-27 17:15 ` [PATCH 11/12] xfs: fix uninitialized list head in struct xfs_refcount_recovery Darrick J. Wong
@ 2022-10-27 17:15 ` Darrick J. Wong
  2022-10-27 21:25   ` Dave Chinner
  11 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 17:15 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
a bit flag, even though it's *only* a bit flag.  Rename the variable to
reflect its nature and update the cast target since we're not supposed
to be comparing it to xfs_agblock_t now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h   |    2 +-
 fs/xfs/libxfs/xfs_refcount.c |    6 +++---
 fs/xfs/libxfs/xfs_refcount.h |    4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 005dd65b71cd..371dc07233e0 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1612,7 +1612,7 @@ unsigned int xfs_refc_block(struct xfs_mount *mp);
  * on the startblock.  This speeds up mount time deletion of stale
  * staging extents because they're all at the right side of the tree.
  */
-#define XFS_REFC_COW_START		((xfs_agblock_t)(1U << 31))
+#define XFS_REFC_COWFLAG		(1U << 31)
 #define REFCNTBT_COWFLAG_BITLEN		1
 #define REFCNTBT_AGBLOCK_BITLEN		31
 
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 05fa18b9e0ed..740959445170 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -108,8 +108,8 @@ xfs_refcount_btrec_to_irec(
 	__u32				start;
 
 	start = be32_to_cpu(rec->refc.rc_startblock);
-	if (start & XFS_REFC_COW_START) {
-		start &= ~XFS_REFC_COW_START;
+	if (start & XFS_REFC_COWFLAG) {
+		start &= ~XFS_REFC_COWFLAG;
 		irec->rc_domain = XFS_REFC_DOMAIN_COW;
 	} else {
 		irec->rc_domain = XFS_REFC_DOMAIN_SHARED;
@@ -1814,7 +1814,7 @@ xfs_refcount_recover_cow_leftovers(
 	int				error;
 
 	/* reflink filesystems mustn't have AGs larger than 2^31-1 blocks */
-	BUILD_BUG_ON(XFS_MAX_CRC_AG_BLOCKS >= XFS_REFC_COW_START);
+	BUILD_BUG_ON(XFS_MAX_CRC_AG_BLOCKS >= XFS_REFC_COWFLAG);
 	if (mp->m_sb.sb_agblocks > XFS_MAX_CRC_AG_BLOCKS)
 		return -EOPNOTSUPP;
 
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index d666a0e32659..cb42365596b2 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -34,9 +34,9 @@ xfs_refcount_encode_startblock(
 	 * query functions (which set rc_domain == -1U), so we check that the
 	 * domain is /not/ shared.
 	 */
-	start = startblock & ~XFS_REFC_COW_START;
+	start = startblock & ~XFS_REFC_COWFLAG;
 	if (domain != XFS_REFC_DOMAIN_SHARED)
-		start |= XFS_REFC_COW_START;
+		start |= XFS_REFC_COWFLAG;
 
 	return start;
 }


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents
  2022-10-27 17:14 ` [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents Darrick J. Wong
@ 2022-10-27 20:41   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 20:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:09AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen
> such that the end of the range would not be in the middle of a refcount
> record.  If this is no longer the case, something is seriously wrong
> with the btree.  Bail out with a corruption error.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |   20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)

Looks reasonable - I can't see any reason why we'd want to look up
the next extent if we've already finished processing the current
range.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 17:14 ` [PATCH 02/12] xfs: check deferred refcount op continuation parameters Darrick J. Wong
@ 2022-10-27 20:49   ` Dave Chinner
  2022-10-27 21:32     ` Darrick J. Wong
  2022-10-27 21:54   ` [PATCH v2.1 " Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 20:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:14AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> If we're in the middle of a deferred refcount operation and decide to
> roll the transaction to avoid overflowing the transaction space, we need
> to check the new agbno/aglen parameters that we're about to record in
> the new intent.  Specifically, we need to check that the new extent is
> completely within the filesystem, and that continuation does not put us
> into a different AG.
> 
> If the keys of a node block are wrong, the lookup to resume an
> xfs_refcount_adjust_extents operation can put us into the wrong record
> block.  If this happens, we might not find that we run out of aglen at
> an exact record boundary, which will cause the loop control to do the
> wrong thing.
> 
> The previous patch should take care of that problem, but let's add this
> extra sanity check to stop corruption problems sooner than later.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |   48 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 831353ba96dc..c6aa832a8713 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -1138,6 +1138,44 @@ xfs_refcount_finish_one_cleanup(
>  		xfs_trans_brelse(tp, agbp);
>  }
>  
> +/*
> + * Set up a continuation a deferred refcount operation by updating the intent.
> + * Checks to make sure we're not going to run off the end of the AG.
> + */
> +static inline int
> +xfs_refcount_continue_op(
> +	struct xfs_btree_cur		*cur,
> +	xfs_fsblock_t			startblock,
> +	xfs_agblock_t			new_agbno,
> +	xfs_extlen_t			new_len,
> +	xfs_fsblock_t			*fsbp)
> +{
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	struct xfs_perag		*pag = cur->bc_ag.pag;
> +	xfs_fsblock_t			new_fsbno;
> +	xfs_agnumber_t			old_agno;
> +
> +	old_agno = XFS_FSB_TO_AGNO(mp, startblock);
> +	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
> +
> +	/*
> +	 * If we don't have any work left to do, then there's no need
> +	 * to perform the validation of the new parameters.
> +	 */
> +	if (!new_len)
> +		goto done;

Shouldn't we be validating new_fsbno rather than just returning
whatever we calculated here?

> +	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
> +		return -EFSCORRUPTED;
> +
> +	if (XFS_IS_CORRUPT(mp, old_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
> +		return -EFSCORRUPTED;

We already know what agno new_fsbno sits in - we calculated it
directly from pag->pag_agno above, so this can jsut check against
pag->pag_agno directly, right?

i.e.

	if (XFS_IS_CORRUPT(mp,
			XFS_FSB_TO_AGNO(mp, startblock) != pag->pag_agno))
		return -EFSCORRUPTED;

and we don't need the local variable for it....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec
  2022-10-27 17:14 ` [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec Darrick J. Wong
@ 2022-10-27 21:03   ` Dave Chinner
  2022-10-27 21:10     ` Darrick J. Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:31AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Just prior to committing the reflink code into upstream, the xfs
> maintainer at the time requested that I find a way to shard the refcount
> records into two domains -- one for records tracking shared extents, and
> a second for tracking CoW staging extents.  The idea here was to
> minimize mount time CoW reclamation by pushing all the CoW records to
> the right edge of the keyspace, and it was accomplished by setting the
> upper bit in rc_startblock.  We don't allow AGs to have more than 2^31
> blocks, so the bit was free.
> 
> Unfortunately, this was a very late addition to the codebase, so most of
> the refcount record processing code still treats rc_startblock as a u32
> and pays no attention to whether or not the upper bit (the cow flag) is
> set.  This is a weakness is theoretically exploitable, since we're not
> fully validating the incoming metadata records.
> 
> Fuzzing demonstrates practical exploits of this weakness.  If the cow
> flag of a node block key record is corrupted, a lookup operation can go
> to the wrong record block and start returning records from the wrong
> cow/shared domain.  This causes the math to go all wrong (since cow
> domain is still implicit in the upper bit of rc_startblock) and we can
> crash the kernel by tricking xfs into jumping into a nonexistent AG and
> tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
> 
> To fix this, start tracking the domain as an explicit part of struct
> xfs_refcount_irec, adjust all refcount functions to check the domain
> of a returned record, and alter the function definitions to accept them
> where necessary.
> 
> Found by fuzzing keys[2].cowflag = add in xfs/464.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Couple of minor things.

> @@ -169,12 +183,17 @@ xfs_refcount_update(
>  	struct xfs_refcount_irec	*irec)
>  {
>  	union xfs_btree_rec	rec;
> +	__u32			start;
>  	int			error;

Why __u32 and not, say, u32 or uint32_t? u32 is used 10x more often
than __u32 in the kernel code, and in XFS only seem to use the __
variants in UAPI structures.

> @@ -364,6 +388,7 @@ xfs_refcount_split_extent(
>  		error = -EFSCORRUPTED;
>  		goto out_error;
>  	}
> +
>  	if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
>  		return 0;
>  

Random new whitespace?

Other than that it looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/12] xfs: report refcount domain in tracepoints
  2022-10-27 17:14 ` [PATCH 06/12] xfs: report refcount domain in tracepoints Darrick J. Wong
@ 2022-10-27 21:05   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:37AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we've broken out the startblock and shared/cow domain in the
> incore refcount extent record structure, update the tracepoints to
> report the domain.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_types.h |    4 ++++
>  fs/xfs/xfs_trace.h        |   48 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 43 insertions(+), 9 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/12] xfs: refactor domain and refcount checking
  2022-10-27 17:14 ` [PATCH 07/12] xfs: refactor domain and refcount checking Darrick J. Wong
@ 2022-10-27 21:07   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:42AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a helper function to ensure that CoW staging extent records have
> a single refcount and that shared extent records have more than 1
> refcount.  We'll put this to more use in the next patch.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |    5 +----
>  fs/xfs/libxfs/xfs_refcount.h |   12 ++++++++++++
>  fs/xfs/scrub/refcount.c      |   10 ++++------
>  3 files changed, 17 insertions(+), 10 deletions(-)

Nice cleanup - just a little change but it really improves the
readability of this code.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec
  2022-10-27 21:03   ` Dave Chinner
@ 2022-10-27 21:10     ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 21:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 28, 2022 at 08:03:07AM +1100, Dave Chinner wrote:
> On Thu, Oct 27, 2022 at 10:14:31AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Just prior to committing the reflink code into upstream, the xfs
> > maintainer at the time requested that I find a way to shard the refcount
> > records into two domains -- one for records tracking shared extents, and
> > a second for tracking CoW staging extents.  The idea here was to
> > minimize mount time CoW reclamation by pushing all the CoW records to
> > the right edge of the keyspace, and it was accomplished by setting the
> > upper bit in rc_startblock.  We don't allow AGs to have more than 2^31
> > blocks, so the bit was free.
> > 
> > Unfortunately, this was a very late addition to the codebase, so most of
> > the refcount record processing code still treats rc_startblock as a u32
> > and pays no attention to whether or not the upper bit (the cow flag) is
> > set.  This is a weakness is theoretically exploitable, since we're not
> > fully validating the incoming metadata records.
> > 
> > Fuzzing demonstrates practical exploits of this weakness.  If the cow
> > flag of a node block key record is corrupted, a lookup operation can go
> > to the wrong record block and start returning records from the wrong
> > cow/shared domain.  This causes the math to go all wrong (since cow
> > domain is still implicit in the upper bit of rc_startblock) and we can
> > crash the kernel by tricking xfs into jumping into a nonexistent AG and
> > tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
> > 
> > To fix this, start tracking the domain as an explicit part of struct
> > xfs_refcount_irec, adjust all refcount functions to check the domain
> > of a returned record, and alter the function definitions to accept them
> > where necessary.
> > 
> > Found by fuzzing keys[2].cowflag = add in xfs/464.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> Couple of minor things.
> 
> > @@ -169,12 +183,17 @@ xfs_refcount_update(
> >  	struct xfs_refcount_irec	*irec)
> >  {
> >  	union xfs_btree_rec	rec;
> > +	__u32			start;
> >  	int			error;
> 
> Why __u32 and not, say, u32 or uint32_t? u32 is used 10x more often
> than __u32 in the kernel code, and in XFS only seem to use the __
> variants in UAPI structures.

Force of habit, I guess.  I'll rework it as uint32_t.

> > @@ -364,6 +388,7 @@ xfs_refcount_split_extent(
> >  		error = -EFSCORRUPTED;
> >  		goto out_error;
> >  	}
> > +
> >  	if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
> >  		return 0;
> >  
> 
> Random new whitespace?

Oops.  I overlooked that when I was slicing and dicing yesterday.

> Other than that it looks good.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Thanks!

--D

> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/12] xfs: remove XFS_FIND_RCEXT_SHARED and _COW
  2022-10-27 17:14 ` [PATCH 08/12] xfs: remove XFS_FIND_RCEXT_SHARED and _COW Darrick J. Wong
@ 2022-10-27 21:11   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:48AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have an explicit enum for shared and CoW staging extents, we
> can get rid of the old FIND_RCEXT flags.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |   48 +++++++++++++++---------------------------
>  1 file changed, 17 insertions(+), 31 deletions(-)

Looks ok.

Minor thing below, otherwise

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> @@ -761,9 +747,9 @@ xfs_refcount_find_right_extents(
>  
>  	if (tmp.rc_startblock != agbno + aglen)
>  		return 0;
> -	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
> +	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
>  		return 0;
> -	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
> +	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
>  		return 0;

Wouldn't this:

	if (!xfs_refcount_check_domain(domain, &tmp))
		return 0;

Do the same thing?

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/12] xfs: check record domain when accessing refcount records
  2022-10-27 17:14 ` [PATCH 09/12] xfs: check record domain when accessing refcount records Darrick J. Wong
@ 2022-10-27 21:15   ` Dave Chinner
  2022-10-27 21:33     ` Darrick J. Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:53AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we've separated the startblock and CoW/shared extent domain in
> the incore refcount record structure, check the domain whenever we
> retrieve a record to ensure that it's still in the domain that we want.
> Depending on the circumstances, a change in domain either means we're
> done processing or that we've found a corruption and need to fail out.
> 
> The refcount check in xchk_xref_is_cow_staging is redundant since
> _get_rec has done that for a long time now, so we can get rid of it.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |   53 ++++++++++++++++++++++++++++++++----------
>  fs/xfs/scrub/refcount.c      |    4 ++-
>  2 files changed, 43 insertions(+), 14 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 3b1cb0578770..608a122eef16 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -386,6 +386,8 @@ xfs_refcount_split_extent(
>  		goto out_error;
>  	}
>  
> +	if (rcext.rc_domain != domain)
> +		return 0;
>  	if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
>  		return 0;
>  
> @@ -434,6 +436,9 @@ xfs_refcount_merge_center_extents(
>  	int				error;
>  	int				found_rec;
>  
> +	ASSERT(left->rc_domain == center->rc_domain);
> +	ASSERT(right->rc_domain == center->rc_domain);
> +
>  	trace_xfs_refcount_merge_center_extents(cur->bc_mp,
>  			cur->bc_ag.pag->pag_agno, left, center, right);

Can you move the asserts to after the trace points? That way we if
need to debug the assert, we'll have a tracepoint with the record
information that triggered the assert emitted immediately before it
fires...

>  
> @@ -510,6 +515,8 @@ xfs_refcount_merge_left_extent(
>  	int				error;
>  	int				found_rec;
>  
> +	ASSERT(left->rc_domain == cleft->rc_domain);
> +
>  	trace_xfs_refcount_merge_left_extent(cur->bc_mp,
>  			cur->bc_ag.pag->pag_agno, left, cleft);
>  
> @@ -571,6 +578,8 @@ xfs_refcount_merge_right_extent(
>  	int				error;
>  	int				found_rec;
>  
> +	ASSERT(right->rc_domain == cright->rc_domain);
> +
>  	trace_xfs_refcount_merge_right_extent(cur->bc_mp,
>  			cur->bc_ag.pag->pag_agno, cright, right);
>  
> @@ -654,12 +663,10 @@ xfs_refcount_find_left_extents(
>  		goto out_error;
>  	}
>  
> +	if (tmp.rc_domain != domain)
> +		return 0;
>  	if (xfs_refc_next(&tmp) != agbno)
>  		return 0;
> -	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
> -		return 0;
> -	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
> -		return 0;

Ah, as these go away, you can ignore my comment about them in the
previous patches... :)

Otherwise, looks ok.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/12] xfs: fix agblocks check in the cow leftover recovery function
  2022-10-27 17:14 ` [PATCH 10/12] xfs: fix agblocks check in the cow leftover recovery function Darrick J. Wong
@ 2022-10-27 21:22   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:14:59AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As we've seen, refcount records use the upper bit of the rc_startblock
> field to ensure that all the refcount records are at the right side of
> the refcount btree.  This works because an AG is never allowed to have
> more than (1U << 31) blocks in it.  If we ever encounter a filesystem
> claiming to have that many blocks, we absolutely do not want reflink
> touching it at all.
> 
> However, this test at the start of xfs_refcount_recover_cow_leftovers is
> slightly incorrect -- it /should/ be checking that agblocks isn't larger
> than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
> constant is never large enough to conflict with that CoW flag.
> 
> Note that the V5 superblock verifier has not historically rejected
> filesystems where agblocks <= XFS_MAX_CRC_AG_BLOCKS, which is why this
ITYM                         >=

> ended up in the COW recovery routine.

I think we should probably fix that - I didn't realise we had this
superblock geometry check buried deep in the reflink recovery code.

That said, for the moment adding an extra check to the reflink
recovery code is fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/12] xfs: fix uninitialized list head in struct xfs_refcount_recovery
  2022-10-27 17:15 ` [PATCH 11/12] xfs: fix uninitialized list head in struct xfs_refcount_recovery Darrick J. Wong
@ 2022-10-27 21:24   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:15:05AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We're supposed to initialize the list head of an object before adding it
> to another list.  Fix that, and stop using the kmem_{alloc,free} calls
> from the Irix days.
> 
> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 12/12] xfs: rename XFS_REFC_COW_START to _COWFLAG
  2022-10-27 17:15 ` [PATCH 12/12] xfs: rename XFS_REFC_COW_START to _COWFLAG Darrick J. Wong
@ 2022-10-27 21:25   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 21:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 10:15:10AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
> a bit flag, even though it's *only* a bit flag.  Rename the variable to
> reflect its nature and update the cast target since we're not supposed
> to be comparing it to xfs_agblock_t now.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Yay!

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 20:49   ` Dave Chinner
@ 2022-10-27 21:32     ` Darrick J. Wong
  2022-10-27 21:42       ` Darrick J. Wong
  2022-10-27 22:24       ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 21:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 28, 2022 at 07:49:57AM +1100, Dave Chinner wrote:
> On Thu, Oct 27, 2022 at 10:14:14AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > If we're in the middle of a deferred refcount operation and decide to
> > roll the transaction to avoid overflowing the transaction space, we need
> > to check the new agbno/aglen parameters that we're about to record in
> > the new intent.  Specifically, we need to check that the new extent is
> > completely within the filesystem, and that continuation does not put us
> > into a different AG.
> > 
> > If the keys of a node block are wrong, the lookup to resume an
> > xfs_refcount_adjust_extents operation can put us into the wrong record
> > block.  If this happens, we might not find that we run out of aglen at
> > an exact record boundary, which will cause the loop control to do the
> > wrong thing.
> > 
> > The previous patch should take care of that problem, but let's add this
> > extra sanity check to stop corruption problems sooner than later.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_refcount.c |   48 ++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 46 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index 831353ba96dc..c6aa832a8713 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -1138,6 +1138,44 @@ xfs_refcount_finish_one_cleanup(
> >  		xfs_trans_brelse(tp, agbp);
> >  }
> >  
> > +/*
> > + * Set up a continuation a deferred refcount operation by updating the intent.
> > + * Checks to make sure we're not going to run off the end of the AG.
> > + */
> > +static inline int
> > +xfs_refcount_continue_op(
> > +	struct xfs_btree_cur		*cur,
> > +	xfs_fsblock_t			startblock,
> > +	xfs_agblock_t			new_agbno,
> > +	xfs_extlen_t			new_len,
> > +	xfs_fsblock_t			*fsbp)
> > +{
> > +	struct xfs_mount		*mp = cur->bc_mp;
> > +	struct xfs_perag		*pag = cur->bc_ag.pag;
> > +	xfs_fsblock_t			new_fsbno;
> > +	xfs_agnumber_t			old_agno;
> > +
> > +	old_agno = XFS_FSB_TO_AGNO(mp, startblock);
> > +	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
> > +
> > +	/*
> > +	 * If we don't have any work left to do, then there's no need
> > +	 * to perform the validation of the new parameters.
> > +	 */
> > +	if (!new_len)
> > +		goto done;
> 
> Shouldn't we be validating new_fsbno rather than just returning
> whatever we calculated here?

No.  Imagine that the deferred work is performed against the last 30
blocks of the last AG in the filesystem.  Let's say that the last AG is
AG 3 and the AG has 100 blocks.  fsblock 3:99 is the last fsblock in the
filesystem.

Before we start the deferred work, startblock == 3:70 and
blockcount == 30.  We adjust the refcount of those 30 blocks, so we're
done now.  The adjust function passes out new_agbno == 70 + 30 and
new_len == 30 - 30.

The agbno to fsbno conversion sets new_fsbno to 3:100 and new_len is 0.
However, fsblock 3/100 is one block past the end of both AG 3 and the
filesystem, so the check below will fail:

> > +	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
> > +		return -EFSCORRUPTED;
> > +
> > +	if (XFS_IS_CORRUPT(mp, old_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
> > +		return -EFSCORRUPTED;
> 
> We already know what agno new_fsbno sits in - we calculated it
> directly from pag->pag_agno above, so this can jsut check against
> pag->pag_agno directly, right?

We don't actually know what agno new_fsbno sits in because of the way
that the agblock -> fsblock conversion works:

#define XFS_AGB_TO_FSB(mp,agno,agbno)	\
	(((xfs_fsblock_t)(agno) << (mp)->m_sb.sb_agblklog) | (agbno))

Notice how we don't mask off the bits of agbno above sb_agblklog?  If
sb_agblklog is (say) 20 but agbno has bit 31 set, that bit 31 will bump
the AG number by 2^11 AGs.

The genesis of this patch was from the old days when rc_startblock had
bit 31 set on COW staging extents.  Splitting out the cow/shared domain
later in the series makes this patch sort of redundant, but I still want
to catch the cases where agbno has some bit set above agblklog due to
corruption/bitflips/shenanigans.

> i.e.
> 
> 	if (XFS_IS_CORRUPT(mp,
> 			XFS_FSB_TO_AGNO(mp, startblock) != pag->pag_agno))
> 		return -EFSCORRUPTED;
> 
> and we don't need the local variable for it....

Not quite -- we need to compare new_fsbno's agnumber against the perag
structure.  But you're right that @old_agno isn't necessary.

	if (XFS_IS_CORRUPT(mp,
			XFS_FSB_TO_AGNO(mp, new_fsbno) != pag->pag_agno))
		/* fail */

I'll change it to the above.

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/12] xfs: check record domain when accessing refcount records
  2022-10-27 21:15   ` Dave Chinner
@ 2022-10-27 21:33     ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 21:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 28, 2022 at 08:15:31AM +1100, Dave Chinner wrote:
> On Thu, Oct 27, 2022 at 10:14:53AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Now that we've separated the startblock and CoW/shared extent domain in
> > the incore refcount record structure, check the domain whenever we
> > retrieve a record to ensure that it's still in the domain that we want.
> > Depending on the circumstances, a change in domain either means we're
> > done processing or that we've found a corruption and need to fail out.
> > 
> > The refcount check in xchk_xref_is_cow_staging is redundant since
> > _get_rec has done that for a long time now, so we can get rid of it.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_refcount.c |   53 ++++++++++++++++++++++++++++++++----------
> >  fs/xfs/scrub/refcount.c      |    4 ++-
> >  2 files changed, 43 insertions(+), 14 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index 3b1cb0578770..608a122eef16 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -386,6 +386,8 @@ xfs_refcount_split_extent(
> >  		goto out_error;
> >  	}
> >  
> > +	if (rcext.rc_domain != domain)
> > +		return 0;
> >  	if (rcext.rc_startblock == agbno || xfs_refc_next(&rcext) <= agbno)
> >  		return 0;
> >  
> > @@ -434,6 +436,9 @@ xfs_refcount_merge_center_extents(
> >  	int				error;
> >  	int				found_rec;
> >  
> > +	ASSERT(left->rc_domain == center->rc_domain);
> > +	ASSERT(right->rc_domain == center->rc_domain);
> > +
> >  	trace_xfs_refcount_merge_center_extents(cur->bc_mp,
> >  			cur->bc_ag.pag->pag_agno, left, center, right);
> 
> Can you move the asserts to after the trace points? That way we if
> need to debug the assert, we'll have a tracepoint with the record
> information that triggered the assert emitted immediately before it
> fires...

Done.

> >  
> > @@ -510,6 +515,8 @@ xfs_refcount_merge_left_extent(
> >  	int				error;
> >  	int				found_rec;
> >  
> > +	ASSERT(left->rc_domain == cleft->rc_domain);
> > +
> >  	trace_xfs_refcount_merge_left_extent(cur->bc_mp,
> >  			cur->bc_ag.pag->pag_agno, left, cleft);
> >  
> > @@ -571,6 +578,8 @@ xfs_refcount_merge_right_extent(
> >  	int				error;
> >  	int				found_rec;
> >  
> > +	ASSERT(right->rc_domain == cright->rc_domain);
> > +
> >  	trace_xfs_refcount_merge_right_extent(cur->bc_mp,
> >  			cur->bc_ag.pag->pag_agno, cright, right);
> >  
> > @@ -654,12 +663,10 @@ xfs_refcount_find_left_extents(
> >  		goto out_error;
> >  	}
> >  
> > +	if (tmp.rc_domain != domain)
> > +		return 0;
> >  	if (xfs_refc_next(&tmp) != agbno)
> >  		return 0;
> > -	if (domain == XFS_REFC_DOMAIN_SHARED && tmp.rc_refcount < 2)
> > -		return 0;
> > -	if (domain == XFS_REFC_DOMAIN_COW && tmp.rc_refcount > 1)
> > -		return 0;
> 
> Ah, as these go away, you can ignore my comment about them in the
> previous patches... :)
> 
> Otherwise, looks ok.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Cool, thanks!

--D

> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 21:32     ` Darrick J. Wong
@ 2022-10-27 21:42       ` Darrick J. Wong
  2022-10-27 22:24       ` Dave Chinner
  1 sibling, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 21:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 02:32:42PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 28, 2022 at 07:49:57AM +1100, Dave Chinner wrote:
> > On Thu, Oct 27, 2022 at 10:14:14AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > If we're in the middle of a deferred refcount operation and decide to
> > > roll the transaction to avoid overflowing the transaction space, we need
> > > to check the new agbno/aglen parameters that we're about to record in
> > > the new intent.  Specifically, we need to check that the new extent is
> > > completely within the filesystem, and that continuation does not put us
> > > into a different AG.
> > > 
> > > If the keys of a node block are wrong, the lookup to resume an
> > > xfs_refcount_adjust_extents operation can put us into the wrong record
> > > block.  If this happens, we might not find that we run out of aglen at
> > > an exact record boundary, which will cause the loop control to do the
> > > wrong thing.
> > > 
> > > The previous patch should take care of that problem, but let's add this
> > > extra sanity check to stop corruption problems sooner than later.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_refcount.c |   48 ++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 46 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > > index 831353ba96dc..c6aa832a8713 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount.c
> > > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > > @@ -1138,6 +1138,44 @@ xfs_refcount_finish_one_cleanup(
> > >  		xfs_trans_brelse(tp, agbp);
> > >  }
> > >  
> > > +/*
> > > + * Set up a continuation a deferred refcount operation by updating the intent.
> > > + * Checks to make sure we're not going to run off the end of the AG.
> > > + */
> > > +static inline int
> > > +xfs_refcount_continue_op(
> > > +	struct xfs_btree_cur		*cur,
> > > +	xfs_fsblock_t			startblock,
> > > +	xfs_agblock_t			new_agbno,
> > > +	xfs_extlen_t			new_len,
> > > +	xfs_fsblock_t			*fsbp)
> > > +{
> > > +	struct xfs_mount		*mp = cur->bc_mp;
> > > +	struct xfs_perag		*pag = cur->bc_ag.pag;
> > > +	xfs_fsblock_t			new_fsbno;
> > > +	xfs_agnumber_t			old_agno;
> > > +
> > > +	old_agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > +	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
> > > +
> > > +	/*
> > > +	 * If we don't have any work left to do, then there's no need
> > > +	 * to perform the validation of the new parameters.
> > > +	 */
> > > +	if (!new_len)
> > > +		goto done;
> > 
> > Shouldn't we be validating new_fsbno rather than just returning
> > whatever we calculated here?

Thinking further about this, let's elide the call if there isn't any
work to continue.

IOWs, the function shortens to:

static inline int
xfs_refcount_continue_op(
	struct xfs_btree_cur		*cur,
	xfs_fsblock_t			startblock,
	xfs_agblock_t			new_agbno,
	xfs_extlen_t			new_len,
	xfs_fsblock_t			*fsbp)
{
	struct xfs_mount		*mp = cur->bc_mp;
	struct xfs_perag		*pag = cur->bc_ag.pag;
	xfs_fsblock_t			new_fsbno;

	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);

	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
		return -EFSCORRUPTED;

	if (XFS_IS_CORRUPT(mp, pag->pag_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
		return -EFSCORRUPTED;

	*fsbp = new_fsbno;
	return 0;
}

and the call sites now look like:

	error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
			new_len, XFS_REFCOUNT_ADJUST_INCREASE);
	if (error)
		goto out_drop;
	if (*new_len > 0)
		error = xfs_refcount_continue_op(rcur, startblock,
				new_agbno, *new_len, new_fsb);

--D

> No.  Imagine that the deferred work is performed against the last 30
> blocks of the last AG in the filesystem.  Let's say that the last AG is
> AG 3 and the AG has 100 blocks.  fsblock 3:99 is the last fsblock in the
> filesystem.
> 
> Before we start the deferred work, startblock == 3:70 and
> blockcount == 30.  We adjust the refcount of those 30 blocks, so we're
> done now.  The adjust function passes out new_agbno == 70 + 30 and
> new_len == 30 - 30.
> 
> The agbno to fsbno conversion sets new_fsbno to 3:100 and new_len is 0.
> However, fsblock 3/100 is one block past the end of both AG 3 and the
> filesystem, so the check below will fail:
> 
> > > +	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
> > > +		return -EFSCORRUPTED;
> > > +
> > > +	if (XFS_IS_CORRUPT(mp, old_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
> > > +		return -EFSCORRUPTED;
> > 
> > We already know what agno new_fsbno sits in - we calculated it
> > directly from pag->pag_agno above, so this can jsut check against
> > pag->pag_agno directly, right?
> 
> We don't actually know what agno new_fsbno sits in because of the way
> that the agblock -> fsblock conversion works:
> 
> #define XFS_AGB_TO_FSB(mp,agno,agbno)	\
> 	(((xfs_fsblock_t)(agno) << (mp)->m_sb.sb_agblklog) | (agbno))
> 
> Notice how we don't mask off the bits of agbno above sb_agblklog?  If
> sb_agblklog is (say) 20 but agbno has bit 31 set, that bit 31 will bump
> the AG number by 2^11 AGs.
> 
> The genesis of this patch was from the old days when rc_startblock had
> bit 31 set on COW staging extents.  Splitting out the cow/shared domain
> later in the series makes this patch sort of redundant, but I still want
> to catch the cases where agbno has some bit set above agblklog due to
> corruption/bitflips/shenanigans.
> 
> > i.e.
> > 
> > 	if (XFS_IS_CORRUPT(mp,
> > 			XFS_FSB_TO_AGNO(mp, startblock) != pag->pag_agno))
> > 		return -EFSCORRUPTED;
> > 
> > and we don't need the local variable for it....
> 
> Not quite -- we need to compare new_fsbno's agnumber against the perag
> structure.  But you're right that @old_agno isn't necessary.
> 
> 	if (XFS_IS_CORRUPT(mp,
> 			XFS_FSB_TO_AGNO(mp, new_fsbno) != pag->pag_agno))
> 		/* fail */
> 
> I'll change it to the above.
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2.1 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 17:14 ` [PATCH 02/12] xfs: check deferred refcount op continuation parameters Darrick J. Wong
  2022-10-27 20:49   ` Dave Chinner
@ 2022-10-27 21:54   ` Darrick J. Wong
  1 sibling, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 21:54 UTC (permalink / raw)
  To: linux-xfs, Dave Chinner

From: Darrick J. Wong <djwong@kernel.org>

If we're in the middle of a deferred refcount operation and decide to
roll the transaction to avoid overflowing the transaction space, we need
to check the new agbno/aglen parameters that we're about to record in
the new intent.  Specifically, we need to check that the new extent is
completely within the filesystem, and that continuation does not put us
into a different AG.

If the keys of a node block are wrong, the lookup to resume an
xfs_refcount_adjust_extents operation can put us into the wrong record
block.  If this happens, we might not find that we run out of aglen at
an exact record boundary, which will cause the loop control to do the
wrong thing.

The previous patch should take care of that problem, but let's add this
extra sanity check to stop corruption problems sooner than later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v2.1: reduce local variables, elide calls when there's no continuation
---
 fs/xfs/libxfs/xfs_refcount.c |   40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 831353ba96dc..06de33a0a684 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1138,6 +1138,34 @@ xfs_refcount_finish_one_cleanup(
 		xfs_trans_brelse(tp, agbp);
 }
 
+/*
+ * Set up a continuation a deferred refcount operation by updating the intent.
+ * Checks to make sure we're not going to run off the end of the AG.
+ */
+static inline int
+xfs_refcount_continue_op(
+	struct xfs_btree_cur		*cur,
+	xfs_fsblock_t			startblock,
+	xfs_agblock_t			new_agbno,
+	xfs_extlen_t			new_len,
+	xfs_fsblock_t			*fsbp)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_perag		*pag = cur->bc_ag.pag;
+	xfs_fsblock_t			new_fsbno;
+
+	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+
+	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
+		return -EFSCORRUPTED;
+
+	if (XFS_IS_CORRUPT(mp, pag->pag_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
+		return -EFSCORRUPTED;
+
+	*fsbp = new_fsbno;
+	return 0;
+}
+
 /*
  * Process one of the deferred refcount operations.  We pass back the
  * btree cursor to maintain our lock on the btree between calls.
@@ -1205,12 +1233,20 @@ xfs_refcount_finish_one(
 	case XFS_REFCOUNT_INCREASE:
 		error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
 				new_len, XFS_REFCOUNT_ADJUST_INCREASE);
-		*new_fsb = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+		if (error)
+			goto out_drop;
+		if (*new_len > 0)
+			error = xfs_refcount_continue_op(rcur, startblock,
+					new_agbno, *new_len, new_fsb);
 		break;
 	case XFS_REFCOUNT_DECREASE:
 		error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
 				new_len, XFS_REFCOUNT_ADJUST_DECREASE);
-		*new_fsb = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+		if (error)
+			goto out_drop;
+		if (*new_len > 0)
+			error = xfs_refcount_continue_op(rcur, startblock,
+					new_agbno, *new_len, new_fsb);
 		break;
 	case XFS_REFCOUNT_ALLOC_COW:
 		*new_fsb = startblock + blockcount;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 21:32     ` Darrick J. Wong
  2022-10-27 21:42       ` Darrick J. Wong
@ 2022-10-27 22:24       ` Dave Chinner
  2022-10-27 23:25         ` Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2022-10-27 22:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Oct 27, 2022 at 02:32:42PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 28, 2022 at 07:49:57AM +1100, Dave Chinner wrote:
> > On Thu, Oct 27, 2022 at 10:14:14AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > If we're in the middle of a deferred refcount operation and decide to
> > > roll the transaction to avoid overflowing the transaction space, we need
> > > to check the new agbno/aglen parameters that we're about to record in
> > > the new intent.  Specifically, we need to check that the new extent is
> > > completely within the filesystem, and that continuation does not put us
> > > into a different AG.
> > > 
> > > If the keys of a node block are wrong, the lookup to resume an
> > > xfs_refcount_adjust_extents operation can put us into the wrong record
> > > block.  If this happens, we might not find that we run out of aglen at
> > > an exact record boundary, which will cause the loop control to do the
> > > wrong thing.
> > > 
> > > The previous patch should take care of that problem, but let's add this
> > > extra sanity check to stop corruption problems sooner than later.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_refcount.c |   48 ++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 46 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > > index 831353ba96dc..c6aa832a8713 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount.c
> > > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > > @@ -1138,6 +1138,44 @@ xfs_refcount_finish_one_cleanup(
> > >  		xfs_trans_brelse(tp, agbp);
> > >  }
> > >  
> > > +/*
> > > + * Set up a continuation a deferred refcount operation by updating the intent.
> > > + * Checks to make sure we're not going to run off the end of the AG.
> > > + */
> > > +static inline int
> > > +xfs_refcount_continue_op(
> > > +	struct xfs_btree_cur		*cur,
> > > +	xfs_fsblock_t			startblock,
> > > +	xfs_agblock_t			new_agbno,
> > > +	xfs_extlen_t			new_len,
> > > +	xfs_fsblock_t			*fsbp)
> > > +{
> > > +	struct xfs_mount		*mp = cur->bc_mp;
> > > +	struct xfs_perag		*pag = cur->bc_ag.pag;
> > > +	xfs_fsblock_t			new_fsbno;
> > > +	xfs_agnumber_t			old_agno;
> > > +
> > > +	old_agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > +	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
> > > +
> > > +	/*
> > > +	 * If we don't have any work left to do, then there's no need
> > > +	 * to perform the validation of the new parameters.
> > > +	 */
> > > +	if (!new_len)
> > > +		goto done;
> > 
> > Shouldn't we be validating new_fsbno rather than just returning
> > whatever we calculated here?
> 
> No.  Imagine that the deferred work is performed against the last 30
> blocks of the last AG in the filesystem.  Let's say that the last AG is
> AG 3 and the AG has 100 blocks.  fsblock 3:99 is the last fsblock in the
> filesystem.
> 
> Before we start the deferred work, startblock == 3:70 and
> blockcount == 30.  We adjust the refcount of those 30 blocks, so we're
> done now.  The adjust function passes out new_agbno == 70 + 30 and
> new_len == 30 - 30.
> 
> The agbno to fsbno conversion sets new_fsbno to 3:100 and new_len is 0.
> However, fsblock 3/100 is one block past the end of both AG 3 and the
> filesystem, so the check below will fail:

Sure, but my point here is that the function returns this invalid
fsbno in *fsbp and assumes that the caller will handle it correctly.

If the caller knows that we aren't going to continue past the
"new_len == 0" condition, then why is it even calling this function?
i.e. this isn't a "decide if we are going to continue" function,
it's a "calculate and validate next fsbno" function...

i.e. the intent doesn't match the name of the function.

> > > +	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
> > > +		return -EFSCORRUPTED;
> > > +
> > > +	if (XFS_IS_CORRUPT(mp, old_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
> > > +		return -EFSCORRUPTED;
> > 
> > We already know what agno new_fsbno sits in - we calculated it
> > directly from pag->pag_agno above, so this can jsut check against
> > pag->pag_agno directly, right?
> 
> We don't actually know what agno new_fsbno sits in because of the way
> that the agblock -> fsblock conversion works:
> 
> #define XFS_AGB_TO_FSB(mp,agno,agbno)	\
> 	(((xfs_fsblock_t)(agno) << (mp)->m_sb.sb_agblklog) | (agbno))

Sure, but FSBs are *sparse* and there is unused, unchecked address
space between the AGs that agbno overruns can fall into. And when we
look at XFS_FSB_TO_AGNO():

#define XFS_FSB_TO_AGNO(mp,fsbno)       \
        ((xfs_agnumber_t)((fsbno) >> (mp)->m_sb.sb_agblklog))

we can see that it simply truncates away the agbno portion to get
back to the agno.

IOWs:

	0			sb_agblocks
	+--------------------------+------------+
					(1 << sb_agblklog)
				   +------------+
				   invalid agbnos!

Hence the agbno needs to be checked agains sb_agblocks to capture AG
overruns, not converted to a FSB and back to an AGNO as this will
claim agbnos in the inaccessible address space region between AGs
are valid....

> Notice how we don't mask off the bits of agbno above sb_agblklog?  If
> sb_agblklog is (say) 20 but agbno has bit 31 set, that bit 31 will bump
> the AG number by 2^11 AGs.

Yes, but that's only a side effect of the agbno having the high bit
set - it could have many other bits set and still be out of range.
i.e. coverting to fsb and back to agno doesn't actually capture all
cases of the next calculated agbno/fsbno could be invalid.

xfs_verify_fsbext() may capture this by chance because it checks
the entire agbno portion of the fsb (via XFS_FSB_TO_AGBNO) against
xfs_ag_block_count(agno), but it won't capture the overruns that
only bump the AGNO portion of the FSB.

Hence I really think we should be checking new_agbno for validity
here, not relying on side effects of coverting to/from FSBs and
verifying fsb extents to capture ag block count overruns in the
supplied agbno....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/12] xfs: check deferred refcount op continuation parameters
  2022-10-27 22:24       ` Dave Chinner
@ 2022-10-27 23:25         ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2022-10-27 23:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 28, 2022 at 09:24:03AM +1100, Dave Chinner wrote:
> On Thu, Oct 27, 2022 at 02:32:42PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 28, 2022 at 07:49:57AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 27, 2022 at 10:14:14AM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > If we're in the middle of a deferred refcount operation and decide to
> > > > roll the transaction to avoid overflowing the transaction space, we need
> > > > to check the new agbno/aglen parameters that we're about to record in
> > > > the new intent.  Specifically, we need to check that the new extent is
> > > > completely within the filesystem, and that continuation does not put us
> > > > into a different AG.
> > > > 
> > > > If the keys of a node block are wrong, the lookup to resume an
> > > > xfs_refcount_adjust_extents operation can put us into the wrong record
> > > > block.  If this happens, we might not find that we run out of aglen at
> > > > an exact record boundary, which will cause the loop control to do the
> > > > wrong thing.
> > > > 
> > > > The previous patch should take care of that problem, but let's add this
> > > > extra sanity check to stop corruption problems sooner than later.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_refcount.c |   48 ++++++++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 46 insertions(+), 2 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > > > index 831353ba96dc..c6aa832a8713 100644
> > > > --- a/fs/xfs/libxfs/xfs_refcount.c
> > > > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > > > @@ -1138,6 +1138,44 @@ xfs_refcount_finish_one_cleanup(
> > > >  		xfs_trans_brelse(tp, agbp);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Set up a continuation a deferred refcount operation by updating the intent.
> > > > + * Checks to make sure we're not going to run off the end of the AG.
> > > > + */
> > > > +static inline int
> > > > +xfs_refcount_continue_op(
> > > > +	struct xfs_btree_cur		*cur,
> > > > +	xfs_fsblock_t			startblock,
> > > > +	xfs_agblock_t			new_agbno,
> > > > +	xfs_extlen_t			new_len,
> > > > +	xfs_fsblock_t			*fsbp)
> > > > +{
> > > > +	struct xfs_mount		*mp = cur->bc_mp;
> > > > +	struct xfs_perag		*pag = cur->bc_ag.pag;
> > > > +	xfs_fsblock_t			new_fsbno;
> > > > +	xfs_agnumber_t			old_agno;
> > > > +
> > > > +	old_agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > > +	new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
> > > > +
> > > > +	/*
> > > > +	 * If we don't have any work left to do, then there's no need
> > > > +	 * to perform the validation of the new parameters.
> > > > +	 */
> > > > +	if (!new_len)
> > > > +		goto done;
> > > 
> > > Shouldn't we be validating new_fsbno rather than just returning
> > > whatever we calculated here?
> > 
> > No.  Imagine that the deferred work is performed against the last 30
> > blocks of the last AG in the filesystem.  Let's say that the last AG is
> > AG 3 and the AG has 100 blocks.  fsblock 3:99 is the last fsblock in the
> > filesystem.
> > 
> > Before we start the deferred work, startblock == 3:70 and
> > blockcount == 30.  We adjust the refcount of those 30 blocks, so we're
> > done now.  The adjust function passes out new_agbno == 70 + 30 and
> > new_len == 30 - 30.
> > 
> > The agbno to fsbno conversion sets new_fsbno to 3:100 and new_len is 0.
> > However, fsblock 3/100 is one block past the end of both AG 3 and the
> > filesystem, so the check below will fail:
> 
> Sure, but my point here is that the function returns this invalid
> fsbno in *fsbp and assumes that the caller will handle it correctly.
> 
> If the caller knows that we aren't going to continue past the
> "new_len == 0" condition, then why is it even calling this function?
> i.e. this isn't a "decide if we are going to continue" function,
> it's a "calculate and validate next fsbno" function...
> 
> i.e. the intent doesn't match the name of the function.

<nod> Well I've already moved the if test to the callsite, so I hope
that'll be less confusing.

> 
> > > > +	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, new_fsbno, new_len)))
> > > > +		return -EFSCORRUPTED;
> > > > +
> > > > +	if (XFS_IS_CORRUPT(mp, old_agno != XFS_FSB_TO_AGNO(mp, new_fsbno)))
> > > > +		return -EFSCORRUPTED;
> > > 
> > > We already know what agno new_fsbno sits in - we calculated it
> > > directly from pag->pag_agno above, so this can jsut check against
> > > pag->pag_agno directly, right?
> > 
> > We don't actually know what agno new_fsbno sits in because of the way
> > that the agblock -> fsblock conversion works:
> > 
> > #define XFS_AGB_TO_FSB(mp,agno,agbno)	\
> > 	(((xfs_fsblock_t)(agno) << (mp)->m_sb.sb_agblklog) | (agbno))
> 
> Sure, but FSBs are *sparse* and there is unused, unchecked address
> space between the AGs that agbno overruns can fall into. And when we
> look at XFS_FSB_TO_AGNO():
> 
> #define XFS_FSB_TO_AGNO(mp,fsbno)       \
>         ((xfs_agnumber_t)((fsbno) >> (mp)->m_sb.sb_agblklog))
> 
> we can see that it simply truncates away the agbno portion to get
> back to the agno.
> 
> IOWs:
> 
> 	0			sb_agblocks
> 	+--------------------------+------------+
> 					(1 << sb_agblklog)
> 				   +------------+
> 				   invalid agbnos!
> 
> Hence the agbno needs to be checked agains sb_agblocks to capture AG
> overruns, not converted to a FSB and back to an AGNO as this will
> claim agbnos in the inaccessible address space region between AGs
> are valid....
> 
> > Notice how we don't mask off the bits of agbno above sb_agblklog?  If
> > sb_agblklog is (say) 20 but agbno has bit 31 set, that bit 31 will bump
> > the AG number by 2^11 AGs.
> 
> Yes, but that's only a side effect of the agbno having the high bit
> set - it could have many other bits set and still be out of range.
> i.e. coverting to fsb and back to agno doesn't actually capture all
> cases of the next calculated agbno/fsbno could be invalid.
> 
> xfs_verify_fsbext() may capture this by chance because it checks
> the entire agbno portion of the fsb (via XFS_FSB_TO_AGBNO) against
> xfs_ag_block_count(agno), but it won't capture the overruns that
> only bump the AGNO portion of the FSB.
> 
> Hence I really think we should be checking new_agbno for validity
> here, not relying on side effects of coverting to/from FSBs and
> verifying fsb extents to capture ag block count overruns in the
> supplied agbno....

Oh, ok.  So you want to check the new agbno and new aglen to make sure
that both are within the filesystem and whatnot *before* we call
XFS_AGB_TO_FSBNO, rather than checking the fsblock after the conversion?

I can do that.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-10-27 23:25 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-27 17:14 [PATCHSET v2 00/12] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
2022-10-27 17:14 ` [PATCH 01/12] xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents Darrick J. Wong
2022-10-27 20:41   ` Dave Chinner
2022-10-27 17:14 ` [PATCH 02/12] xfs: check deferred refcount op continuation parameters Darrick J. Wong
2022-10-27 20:49   ` Dave Chinner
2022-10-27 21:32     ` Darrick J. Wong
2022-10-27 21:42       ` Darrick J. Wong
2022-10-27 22:24       ` Dave Chinner
2022-10-27 23:25         ` Darrick J. Wong
2022-10-27 21:54   ` [PATCH v2.1 " Darrick J. Wong
2022-10-27 17:14 ` [PATCH 03/12] xfs: move _irec structs to xfs_types.h Darrick J. Wong
2022-10-27 17:14 ` [PATCH 04/12] xfs: refactor refcount record usage in xchk_refcountbt_rec Darrick J. Wong
2022-10-27 17:14 ` [PATCH 05/12] xfs: track cow/shared record domains explicitly in xfs_refcount_irec Darrick J. Wong
2022-10-27 21:03   ` Dave Chinner
2022-10-27 21:10     ` Darrick J. Wong
2022-10-27 17:14 ` [PATCH 06/12] xfs: report refcount domain in tracepoints Darrick J. Wong
2022-10-27 21:05   ` Dave Chinner
2022-10-27 17:14 ` [PATCH 07/12] xfs: refactor domain and refcount checking Darrick J. Wong
2022-10-27 21:07   ` Dave Chinner
2022-10-27 17:14 ` [PATCH 08/12] xfs: remove XFS_FIND_RCEXT_SHARED and _COW Darrick J. Wong
2022-10-27 21:11   ` Dave Chinner
2022-10-27 17:14 ` [PATCH 09/12] xfs: check record domain when accessing refcount records Darrick J. Wong
2022-10-27 21:15   ` Dave Chinner
2022-10-27 21:33     ` Darrick J. Wong
2022-10-27 17:14 ` [PATCH 10/12] xfs: fix agblocks check in the cow leftover recovery function Darrick J. Wong
2022-10-27 21:22   ` Dave Chinner
2022-10-27 17:15 ` [PATCH 11/12] xfs: fix uninitialized list head in struct xfs_refcount_recovery Darrick J. Wong
2022-10-27 21:24   ` Dave Chinner
2022-10-27 17:15 ` [PATCH 12/12] xfs: rename XFS_REFC_COW_START to _COWFLAG Darrick J. Wong
2022-10-27 21:25   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).