All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v16 00/21] xfs-4.19: online repair support
@ 2018-06-24 19:23 Darrick J. Wong
  2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
                   ` (20 more replies)
  0 siblings, 21 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This is the sixteenth revision of a patchset that adds to XFS kernel
support for online metadata scrubbing and repair.  There aren't any
on-disk format changes.

New for this version of the patch series are: a renaming of the 'repair
freeze' code into the 'scrub freeze' code so that scrubbers can freeze
the filesystem to check or repair metadata; a new ioctl flag for
userspace to grant the kernel permission to freeze the filesystem to do
work; a centralized iput helper that delays iput processing until after
the scrub freeze is lifted; and a new fs summary counter scrub/repair
function that can check the global icount/ifree/fdblock counters.

The first patch fixes a soon-to-be-invalid assumption in the rmap code
that the rmapbt can never be empty.  This won't be true at all for
realtime rmap and is briefly untrue for rmap repairs, so we might as
well restructure that assumption out of the code.

The next two patches create two predicates that decide if an inode has
CoW staging blocks that need to be freed or post-eof blocks that need to
be freed.

Patches 4-7 implement reconstruction of the AGF/AGI/AGFL headers, the
free space btrees, and the inode btrees.  These are the same patches as
v15.

Patches 8-9 implement the deferred iput code -- first the scrub-specific
iput helper that enables us delay iputting of inodes that have post-eof
or CoW blocks that need to be freed until we no longer have a
transaction and can let the iput work proceed.  The second patch fleshes
out the scrub iget/iput tracepoints for easier debugging.

Next comes the scrub freeze code which enables us to pause all other
work in the filesystem and introduces the FREEZE_OK ioctl flag for
userspace to control freeze policy.

Patch 11-19 implement online rmap, refcount, inode, ifork, bmap,
symlink, extended attribute, and quota repairs.  Patch 20 implements
online quotacheck via the scrub freezer.

Patch 21 implements the filesystem summary counter check and repair
code.  This turned out to be easier to implement than I had thought --
we gather the icount, ifree, and fdblocks counts from each of the AGs.
Next we adjust fdblocks by all the in-core reservations: resblks, per-AG
reservations, and all delayed allocation blocks of all in-core inodes.
Then we can compare our counts against the superblock's counts and
adjust accordingly.

If you're going to start using this mess, you probably ought to just
pull from my git trees.  The kernel patches[1] should apply against
4.18-rc2.  xfsprogs[2] and xfstests[3] can be found in their usual
places.  The git trees contain all four series' worth of changes.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
@ 2018-06-24 19:23 ` Darrick J. Wong
  2018-06-27  0:54   ` Dave Chinner
  2018-06-28 21:11   ` Allison Henderson
  2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
                   ` (19 subsequent siblings)
  20 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

The original rmap code assumed that there would always be at least one
rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't
find one.  This assumption isn't true for the rmapbt repair function
(and it won't be true for realtime rmap either), so remove the check and
just deal with the situation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index d4460b0d2d81..8b2a2f81d110 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -753,19 +753,19 @@ xfs_rmap_map(
 			&have_lt);
 	if (error)
 		goto out_error;
-	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
-
-	error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
-	if (error)
-		goto out_error;
-	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
-	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
-			cur->bc_private.a.agno, ltrec.rm_startblock,
-			ltrec.rm_blockcount, ltrec.rm_owner,
-			ltrec.rm_offset, ltrec.rm_flags);
+	if (have_lt) {
+		error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
+		trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
+				cur->bc_private.a.agno, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
 
-	if (!xfs_rmap_is_mergeable(&ltrec, owner, flags))
-		have_lt = 0;
+		if (!xfs_rmap_is_mergeable(&ltrec, owner, flags))
+			have_lt = 0;
+	}
 
 	XFS_WANT_CORRUPTED_GOTO(mp,
 		have_lt == 0 ||


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
  2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
@ 2018-06-24 19:23 ` Darrick J. Wong
  2018-06-27  1:02   ` Dave Chinner
  2018-06-28 21:12   ` Allison Henderson
  2018-06-24 19:23 ` [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks Darrick J. Wong
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a helper to decide if an inode has real or unwritten extents in the
CoW fork.  The upcoming repair freeze functionality will have to know if
it's safe to iput an inode -- if the inode has incore any state that
would require a transaction to unwind during iput, we'll have to defer
the iput.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   19 +++++++++++++++++++
 fs/xfs/xfs_inode.h |    1 +
 2 files changed, 20 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7a96c4e0ab5c..e6859dfc29af 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3689,3 +3689,22 @@ xfs_iflush_int(
 corrupt_out:
 	return -EFSCORRUPTED;
 }
+
+/* Decide if there are real or unwritten extents in the CoW fork. */
+bool
+xfs_inode_has_cow_blocks(
+	struct xfs_inode		*ip)
+{
+	struct xfs_iext_cursor		icur;
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+
+	if (!ifp)
+		return false;
+
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		if (!isnullstartblock(irec.br_startblock))
+			return true;
+	}
+	return false;
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 2ed63a49e890..735d0788bfdb 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -503,5 +503,6 @@ extern struct kmem_zone	*xfs_inode_zone;
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
+bool xfs_inode_has_cow_blocks(struct xfs_inode *ip);
 
 #endif	/* __XFS_INODE_H__ */


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
  2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
  2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
@ 2018-06-24 19:23 ` Darrick J. Wong
  2018-06-28 21:13   ` Allison Henderson
  2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the part of _free_eofblocks that decides if it's really going
to truncate post-EOF blocks into a separate helper function.  The
upcoming repair freeze patch requires us to defer iput of an inode if
disposing of that inode would have to start another transaction to
unwind incore state.  No functionality changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  101 ++++++++++++++++++++----------------------------
 fs/xfs/xfs_inode.c     |   32 +++++++++++++++
 fs/xfs/xfs_inode.h     |    1 
 3 files changed, 75 insertions(+), 59 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c94d376e4152..0f38acbb200f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -805,78 +805,61 @@ xfs_free_eofblocks(
 	struct xfs_inode	*ip)
 {
 	struct xfs_trans	*tp;
-	int			error;
-	xfs_fileoff_t		end_fsb;
-	xfs_fileoff_t		last_fsb;
-	xfs_filblks_t		map_len;
-	int			nimaps;
-	struct xfs_bmbt_irec	imap;
 	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
 	/*
-	 * Figure out if there are any blocks beyond the end
-	 * of the file.  If not, then there is nothing to do.
+	 * If there are blocks after the end of file, truncate the file to its
+	 * current size to free them up.
 	 */
-	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
-	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
-	if (last_fsb <= end_fsb)
+	if (!xfs_inode_has_posteof_blocks(ip))
 		return 0;
-	map_len = last_fsb - end_fsb;
-
-	nimaps = 1;
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	/*
-	 * If there are blocks after the end of file, truncate the file to its
-	 * current size to free them up.
+	 * Attach the dquots to the inode up front.
 	 */
-	if (!error && (nimaps != 0) &&
-	    (imap.br_startblock != HOLESTARTBLOCK ||
-	     ip->i_delayed_blks)) {
-		/*
-		 * Attach the dquots to the inode up front.
-		 */
-		error = xfs_qm_dqattach(ip);
-		if (error)
-			return error;
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
 
-		/* wait on dio to ensure i_size has settled */
-		inode_dio_wait(VFS_I(ip));
+	/* wait on dio to ensure i_size has settled */
+	inode_dio_wait(VFS_I(ip));
 
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
-				&tp);
-		if (error) {
-			ASSERT(XFS_FORCED_SHUTDOWN(mp));
-			return error;
-		}
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error) {
+		ASSERT(XFS_FORCED_SHUTDOWN(mp));
+		return error;
+	}
 
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, ip, 0);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
 
-		/*
-		 * Do not update the on-disk file size.  If we update the
-		 * on-disk file size and then the system crashes before the
-		 * contents of the file are flushed to disk then the files
-		 * may be full of holes (ie NULL files bug).
-		 */
-		error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
-					XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
-		if (error) {
-			/*
-			 * If we get an error at this point we simply don't
-			 * bother truncating the file.
-			 */
-			xfs_trans_cancel(tp);
-		} else {
-			error = xfs_trans_commit(tp);
-			if (!error)
-				xfs_inode_clear_eofblocks_tag(ip);
-		}
+	/*
+	 * Do not update the on-disk file size.  If we update the
+	 * on-disk file size and then the system crashes before the
+	 * contents of the file are flushed to disk then the files
+	 * may be full of holes (ie NULL files bug).
+	 */
+	error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
+				XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
+	if (error)
+		goto err_cancel;
 
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	}
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_unlock;
+
+	xfs_inode_clear_eofblocks_tag(ip);
+	goto out_unlock;
+
+err_cancel:
+	/*
+	 * If we get an error at this point we simply don't
+	 * bother truncating the file.
+	 */
+	xfs_trans_cancel(tp);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e6859dfc29af..368ac0528727 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3708,3 +3708,35 @@ xfs_inode_has_cow_blocks(
 	}
 	return false;
 }
+
+/*
+ * Decide if this inode have post-EOF blocks.  The caller is responsible
+ * for knowing / caring about the PREALLOC/APPEND flags.
+ */
+bool
+xfs_inode_has_posteof_blocks(
+	struct xfs_inode	*ip)
+{
+	struct xfs_bmbt_irec	imap;
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		end_fsb;
+	xfs_fileoff_t		last_fsb;
+	xfs_filblks_t		map_len;
+	int			nimaps;
+	int			error;
+
+	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
+	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
+	if (last_fsb <= end_fsb)
+		return false;
+	map_len = last_fsb - end_fsb;
+
+	nimaps = 1;
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
+	return !error && (nimaps != 0) &&
+	       (imap.br_startblock != HOLESTARTBLOCK ||
+	        ip->i_delayed_blks);
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 735d0788bfdb..a041fffa1b33 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -504,5 +504,6 @@ extern struct kmem_zone	*xfs_inode_zone;
 
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
 bool xfs_inode_has_cow_blocks(struct xfs_inode *ip);
+bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
 
 #endif	/* __XFS_INODE_H__ */


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2018-06-24 19:23 ` [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks Darrick J. Wong
@ 2018-06-24 19:23 ` Darrick J. Wong
  2018-06-27  2:19   ` Dave Chinner
  2018-06-28 21:14   ` Allison Henderson
  2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:23 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Regenerate the AGF and AGFL from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader_repair.c |  644 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c          |  106 ++++++-
 fs/xfs/scrub/repair.h          |   15 +
 fs/xfs/scrub/scrub.c           |    4 
 fs/xfs/xfs_trans.c             |   54 +++
 fs/xfs/xfs_trans.h             |    2 
 6 files changed, 814 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 117eedac53df..90e5e6cbc911 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -17,12 +17,18 @@
 #include "xfs_sb.h"
 #include "xfs_inode.h"
 #include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
 #include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
 #include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/repair.h"
 
 /* Superblock */
 
@@ -54,3 +60,641 @@ xfs_repair_superblock(
 	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
 	return error;
 }
+
+/* AGF */
+
+struct xfs_repair_agf_allocbt {
+	struct xfs_scrub_context	*sc;
+	xfs_agblock_t			freeblks;
+	xfs_agblock_t			longest;
+};
+
+/* Record free space shape information. */
+STATIC int
+xfs_repair_agf_walk_allocbt(
+	struct xfs_btree_cur		*cur,
+	struct xfs_alloc_rec_incore	*rec,
+	void				*priv)
+{
+	struct xfs_repair_agf_allocbt	*raa = priv;
+	int				error = 0;
+
+	if (xfs_scrub_should_terminate(raa->sc, &error))
+		return error;
+
+	raa->freeblks += rec->ar_blockcount;
+	if (rec->ar_blockcount > raa->longest)
+		raa->longest = rec->ar_blockcount;
+	return error;
+}
+
+/* Does this AGFL block look sane? */
+STATIC int
+xfs_repair_agf_check_agfl_block(
+	struct xfs_mount		*mp,
+	xfs_agblock_t			agbno,
+	void				*priv)
+{
+	struct xfs_scrub_context	*sc = priv;
+
+	if (!xfs_verify_agbno(mp, sc->sa.agno, agbno))
+		return -EFSCORRUPTED;
+	return 0;
+}
+
+/* Information for finding AGF-rooted btrees */
+enum {
+	REPAIR_AGF_BNOBT = 0,
+	REPAIR_AGF_CNTBT,
+	REPAIR_AGF_RMAPBT,
+	REPAIR_AGF_REFCOUNTBT,
+	REPAIR_AGF_END,
+	REPAIR_AGF_MAX
+};
+
+static const struct xfs_repair_find_ag_btree repair_agf[] = {
+	[REPAIR_AGF_BNOBT] = {
+		.rmap_owner = XFS_RMAP_OWN_AG,
+		.buf_ops = &xfs_allocbt_buf_ops,
+		.magic = XFS_ABTB_CRC_MAGIC,
+	},
+	[REPAIR_AGF_CNTBT] = {
+		.rmap_owner = XFS_RMAP_OWN_AG,
+		.buf_ops = &xfs_allocbt_buf_ops,
+		.magic = XFS_ABTC_CRC_MAGIC,
+	},
+	[REPAIR_AGF_RMAPBT] = {
+		.rmap_owner = XFS_RMAP_OWN_AG,
+		.buf_ops = &xfs_rmapbt_buf_ops,
+		.magic = XFS_RMAP_CRC_MAGIC,
+	},
+	[REPAIR_AGF_REFCOUNTBT] = {
+		.rmap_owner = XFS_RMAP_OWN_REFC,
+		.buf_ops = &xfs_refcountbt_buf_ops,
+		.magic = XFS_REFC_CRC_MAGIC,
+	},
+	[REPAIR_AGF_END] = {
+		.buf_ops = NULL,
+	},
+};
+
+/*
+ * Find the btree roots.  This is /also/ a chicken and egg problem because we
+ * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
+ * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
+ * corruptions in those btrees we'll bail out.
+ */
+STATIC int
+xfs_repair_agf_find_btrees(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agf_bp,
+	struct xfs_repair_find_ag_btree	*fab,
+	struct xfs_buf			*agfl_bp)
+{
+	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
+	int				error;
+
+	/* Go find the root data. */
+	memcpy(fab, repair_agf, sizeof(repair_agf));
+	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
+	if (error)
+		return error;
+
+	/* We must find the bnobt, cntbt, and rmapbt roots. */
+	if (fab[REPAIR_AGF_BNOBT].root == NULLAGBLOCK ||
+	    fab[REPAIR_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
+	    fab[REPAIR_AGF_CNTBT].root == NULLAGBLOCK ||
+	    fab[REPAIR_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
+	    fab[REPAIR_AGF_RMAPBT].root == NULLAGBLOCK ||
+	    fab[REPAIR_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
+		return -EFSCORRUPTED;
+
+	/*
+	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
+	 * different root then something's seriously wrong.
+	 */
+	if (fab[REPAIR_AGF_RMAPBT].root !=
+	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
+		return -EFSCORRUPTED;
+
+	/* We must find the refcountbt root if that feature is enabled. */
+	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
+	    (fab[REPAIR_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
+	     fab[REPAIR_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Set btree root information in an AGF. */
+STATIC void
+xfs_repair_agf_set_roots(
+	struct xfs_scrub_context	*sc,
+	struct xfs_agf			*agf,
+	struct xfs_repair_find_ag_btree	*fab)
+{
+	agf->agf_roots[XFS_BTNUM_BNOi] =
+			cpu_to_be32(fab[REPAIR_AGF_BNOBT].root);
+	agf->agf_levels[XFS_BTNUM_BNOi] =
+			cpu_to_be32(fab[REPAIR_AGF_BNOBT].height);
+
+	agf->agf_roots[XFS_BTNUM_CNTi] =
+			cpu_to_be32(fab[REPAIR_AGF_CNTBT].root);
+	agf->agf_levels[XFS_BTNUM_CNTi] =
+			cpu_to_be32(fab[REPAIR_AGF_CNTBT].height);
+
+	agf->agf_roots[XFS_BTNUM_RMAPi] =
+			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].root);
+	agf->agf_levels[XFS_BTNUM_RMAPi] =
+			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].height);
+
+	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
+		agf->agf_refcount_root =
+				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].root);
+		agf->agf_refcount_level =
+				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].height);
+	}
+}
+
+/*
+ * Reinitialize the AGF header, making an in-core copy of the old contents so
+ * that we know which in-core state needs to be reinitialized.
+ */
+STATIC void
+xfs_repair_agf_init_header(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agf_bp,
+	struct xfs_agf			*old_agf)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
+
+	memcpy(old_agf, agf, sizeof(*old_agf));
+	memset(agf, 0, BBTOB(agf_bp->b_length));
+	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
+	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
+	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
+	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
+	agf->agf_flfirst = old_agf->agf_flfirst;
+	agf->agf_fllast = old_agf->agf_fllast;
+	agf->agf_flcount = old_agf->agf_flcount;
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
+}
+
+/* Update the AGF btree counters by walking the btrees. */
+STATIC int
+xfs_repair_agf_update_btree_counters(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agf_bp)
+{
+	struct xfs_repair_agf_allocbt	raa = { .sc = sc };
+	struct xfs_btree_cur		*cur = NULL;
+	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
+	struct xfs_mount		*mp = sc->mp;
+	xfs_agblock_t			btreeblks;
+	xfs_agblock_t			blocks;
+	int				error;
+
+	/* Update the AGF counters from the bnobt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_BNO);
+	error = xfs_alloc_query_all(cur, xfs_repair_agf_walk_allocbt, &raa);
+	if (error)
+		goto err;
+	error = xfs_btree_count_blocks(cur, &blocks);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	btreeblks = blocks - 1;
+	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
+	agf->agf_longest = cpu_to_be32(raa.longest);
+
+	/* Update the AGF counters from the cntbt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_CNT);
+	error = xfs_btree_count_blocks(cur, &blocks);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	btreeblks += blocks - 1;
+
+	/* Update the AGF counters from the rmapbt. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
+	error = xfs_btree_count_blocks(cur, &blocks);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	agf->agf_rmap_blocks = cpu_to_be32(blocks);
+	btreeblks += blocks - 1;
+
+	agf->agf_btreeblks = cpu_to_be32(btreeblks);
+
+	/* Update the AGF counters from the refcountbt. */
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
+				sc->sa.agno, NULL);
+		error = xfs_btree_count_blocks(cur, &blocks);
+		if (error)
+			goto err;
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		agf->agf_refcount_blocks = cpu_to_be32(blocks);
+	}
+
+	return 0;
+err:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Trigger reinitialization of the in-core data. */
+STATIC int
+xfs_repair_agf_reinit_incore(
+	struct xfs_scrub_context	*sc,
+	struct xfs_agf			*agf,
+	const struct xfs_agf		*old_agf)
+{
+	struct xfs_perag		*pag;
+
+	/* XXX: trigger fdblocks recalculation */
+
+	/* Now reinitialize the in-core counters if necessary. */
+	pag = sc->sa.pag;
+	if (!pag->pagf_init)
+		return 0;
+
+	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
+	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
+	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
+	pag->pagf_levels[XFS_BTNUM_BNOi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
+	pag->pagf_levels[XFS_BTNUM_CNTi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
+	pag->pagf_levels[XFS_BTNUM_RMAPi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
+	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
+
+	return 0;
+}
+
+/* Repair the AGF. */
+int
+xfs_repair_agf(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_repair_find_ag_btree	fab[REPAIR_AGF_MAX];
+	struct xfs_agf			old_agf;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agf_bp;
+	struct xfs_buf			*agfl_bp;
+	struct xfs_agf			*agf;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
+	if (error)
+		return error;
+	agf_bp->b_ops = &xfs_agf_buf_ops;
+	agf = XFS_BUF_TO_AGF(agf_bp);
+
+	/*
+	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
+	 * the AGFL now; these blocks might have once been part of the
+	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
+	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
+	 * because we can't do any serious cross-referencing with any of the
+	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
+	 * then we'll bail out.
+	 */
+	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
+	if (error)
+		return error;
+
+	/*
+	 * Spot-check the AGFL blocks; if they're obviously corrupt then
+	 * there's nothing we can do but bail out.
+	 */
+	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
+			xfs_repair_agf_check_agfl_block, sc);
+	if (error)
+		return error;
+
+	/*
+	 * Find the AGF btree roots.  See the comment for this function for
+	 * more information about the limitations of this repairer; this is
+	 * also a chicken-and-egg situation.
+	 */
+	error = xfs_repair_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
+	if (error)
+		return error;
+
+	/* Start rewriting the header and implant the btrees we found. */
+	xfs_repair_agf_init_header(sc, agf_bp, &old_agf);
+	xfs_repair_agf_set_roots(sc, agf, fab);
+	error = xfs_repair_agf_update_btree_counters(sc, agf_bp);
+	if (error)
+		goto out_revert;
+
+	/* Reinitialize in-core state. */
+	error = xfs_repair_agf_reinit_incore(sc, agf, &old_agf);
+	if (error)
+		goto out_revert;
+
+	/* Write this to disk. */
+	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
+	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
+	return 0;
+
+out_revert:
+	memcpy(agf, &old_agf, sizeof(old_agf));
+	return error;
+}
+
+/* AGFL */
+
+struct xfs_repair_agfl {
+	struct xfs_repair_extent_list	agmeta_list;
+	struct xfs_repair_extent_list	*freesp_list;
+	struct xfs_scrub_context	*sc;
+};
+
+/* Record all freespace information. */
+STATIC int
+xfs_repair_agfl_rmap_fn(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_agfl		*ra = priv;
+	xfs_fsblock_t			fsb;
+	int				error = 0;
+
+	if (xfs_scrub_should_terminate(ra->sc, &error))
+		return error;
+
+	/* Record all the OWN_AG blocks. */
+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		error = xfs_repair_collect_btree_extent(ra->sc,
+				ra->freesp_list, fsb, rec->rm_blockcount);
+		if (error)
+			return error;
+	}
+
+	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
+			xfs_repair_collect_btree_cur_blocks_in_extent_list,
+			&ra->agmeta_list);
+}
+
+/* Add a btree block to the agmeta list. */
+STATIC int
+xfs_repair_agfl_visit_btblock(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	void				*priv)
+{
+	struct xfs_repair_agfl		*ra = priv;
+	struct xfs_buf			*bp;
+	xfs_fsblock_t			fsb;
+	int				error = 0;
+
+	if (xfs_scrub_should_terminate(ra->sc, &error))
+		return error;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	return xfs_repair_collect_btree_extent(ra->sc, &ra->agmeta_list,
+			fsb, 1);
+}
+
+/*
+ * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
+ * which blocks belong to the AGFL.
+ */
+STATIC int
+xfs_repair_agfl_find_extents(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agf_bp,
+	struct xfs_repair_extent_list	*agfl_extents,
+	xfs_agblock_t			*flcount)
+{
+	struct xfs_repair_agfl		ra;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_btree_cur		*cur;
+	struct xfs_repair_extent	*rae;
+	int				error;
+
+	ra.sc = sc;
+	ra.freesp_list = agfl_extents;
+	xfs_repair_init_extent_list(&ra.agmeta_list);
+
+	/* Find all space used by the free space btrees & rmapbt. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xfs_repair_agfl_rmap_fn, &ra);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Find all space used by bnobt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_BNO);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Find all space used by cntbt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_CNT);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
+	if (error)
+		goto err;
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/*
+	 * Drop the freesp meta blocks that are in use by btrees.
+	 * The remaining blocks /should/ be AGFL blocks.
+	 */
+	error = xfs_repair_subtract_extents(sc, agfl_extents, &ra.agmeta_list);
+	xfs_repair_cancel_btree_extents(sc, &ra.agmeta_list);
+	if (error)
+		return error;
+
+	/* Calculate the new AGFL size. */
+	*flcount = 0;
+	for_each_xfs_repair_extent(rae, agfl_extents) {
+		*flcount += rae->len;
+		if (*flcount > xfs_agfl_size(mp))
+			break;
+	}
+	if (*flcount > xfs_agfl_size(mp))
+		*flcount = xfs_agfl_size(mp);
+	return 0;
+
+err:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Update the AGF and reset the in-core state. */
+STATIC int
+xfs_repair_agfl_update_agf(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agf_bp,
+	xfs_agblock_t			flcount)
+{
+	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
+
+	/* XXX: trigger fdblocks recalculation */
+
+	/* Update the AGF counters. */
+	if (sc->sa.pag->pagf_init)
+		sc->sa.pag->pagf_flcount = flcount;
+	agf->agf_flfirst = cpu_to_be32(0);
+	agf->agf_flcount = cpu_to_be32(flcount);
+	agf->agf_fllast = cpu_to_be32(flcount - 1);
+
+	xfs_alloc_log_agf(sc->tp, agf_bp,
+			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
+	return 0;
+}
+
+/* Write out a totally new AGFL. */
+STATIC void
+xfs_repair_agfl_init_header(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agfl_bp,
+	struct xfs_repair_extent_list	*agfl_extents,
+	xfs_agblock_t			flcount)
+{
+	struct xfs_mount		*mp = sc->mp;
+	__be32				*agfl_bno;
+	struct xfs_repair_extent	*rae;
+	struct xfs_repair_extent	*n;
+	struct xfs_agfl			*agfl;
+	xfs_agblock_t			agbno;
+	unsigned int			fl_off;
+
+	/* Start rewriting the header. */
+	agfl = XFS_BUF_TO_AGFL(agfl_bp);
+	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
+	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
+	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
+	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
+
+	/* Fill the AGFL with the remaining blocks. */
+	fl_off = 0;
+	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
+	for_each_xfs_repair_extent_safe(rae, n, agfl_extents) {
+		agbno = XFS_FSB_TO_AGBNO(mp, rae->fsbno);
+
+		trace_xfs_repair_agfl_insert(mp, sc->sa.agno, agbno, rae->len);
+
+		while (rae->len > 0 && fl_off < flcount) {
+			agfl_bno[fl_off] = cpu_to_be32(agbno);
+			fl_off++;
+			agbno++;
+			rae->fsbno++;
+			rae->len--;
+		}
+
+		if (rae->len)
+			break;
+		list_del(&rae->list);
+		kmem_free(rae);
+	}
+
+	/* Write AGF and AGFL to disk. */
+	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
+	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
+}
+
+/* Repair the AGFL. */
+int
+xfs_repair_agfl(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+	struct xfs_repair_extent_list	agfl_extents;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agf_bp;
+	struct xfs_buf			*agfl_bp;
+	xfs_agblock_t			flcount;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+	xfs_repair_init_extent_list(&agfl_extents);
+
+	/*
+	 * Read the AGF so that we can query the rmapbt.  We hope that there's
+	 * nothing wrong with the AGF, but all the AG header repair functions
+	 * have this chicken-and-egg problem.
+	 */
+	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
+	if (error)
+		return error;
+	agfl_bp->b_ops = &xfs_agfl_buf_ops;
+
+	/*
+	 * Compute the set of old AGFL blocks by subtracting from the list of
+	 * OWN_AG blocks the list of blocks owned by all other OWN_AG metadata
+	 * (bnobt, cntbt, rmapbt).  These are the old AGFL blocks, so return
+	 * that list and the number of blocks we're actually going to put back
+	 * on the AGFL.
+	 */
+	error = xfs_repair_agfl_find_extents(sc, agf_bp, &agfl_extents,
+			&flcount);
+	if (error)
+		goto err;
+
+	/*
+	 * Update AGF and AGFL.  We reset the global free block counter when
+	 * we adjust the AGF flcount (which can fail) so avoid updating any
+	 * bufers until we know that part works.
+	 */
+	error = xfs_repair_agfl_update_agf(sc, agf_bp, flcount);
+	if (error)
+		goto err;
+	xfs_repair_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
+
+	/*
+	 * Ok, the AGFL should be ready to go now.  Roll the transaction so
+	 * that we can free any AGFL overflow.
+	 */
+	sc->sa.agf_bp = agf_bp;
+	sc->sa.agfl_bp = agfl_bp;
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		goto err;
+
+	/* Dump any AGFL overflow. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	return xfs_repair_reap_btree_extents(sc, &agfl_extents, &oinfo,
+			XFS_AG_RESV_AGFL);
+err:
+	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 326be4e8b71e..bcdaa8df18f6 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -127,9 +127,12 @@ xfs_repair_roll_ag_trans(
 	int				error;
 
 	/* Keep the AG header buffers locked so we can keep going. */
-	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
-	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
-	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
+	if (sc->sa.agi_bp)
+		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
+	if (sc->sa.agfl_bp)
+		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
 
 	/* Roll the transaction. */
 	error = xfs_trans_roll(&sc->tp);
@@ -137,9 +140,12 @@ xfs_repair_roll_ag_trans(
 		goto out_release;
 
 	/* Join AG headers to the new transaction. */
-	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
-	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
-	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
+	if (sc->sa.agi_bp)
+		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
+	if (sc->sa.agfl_bp)
+		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
 
 	return 0;
 
@@ -149,9 +155,12 @@ xfs_repair_roll_ag_trans(
 	 * buffers will be released during teardown on our way out
 	 * of the kernel.
 	 */
-	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
-	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
-	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
+	if (sc->sa.agi_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
+	if (sc->sa.agfl_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
 
 	return error;
 }
@@ -408,6 +417,85 @@ xfs_repair_collect_btree_extent(
 	return 0;
 }
 
+/*
+ * Help record all btree blocks seen while iterating all records of a btree.
+ *
+ * We know that the btree query_all function starts at the left edge and walks
+ * towards the right edge of the tree.  Therefore, we know that we can walk up
+ * the btree cursor towards the root; if the pointer for a given level points
+ * to the first record/key in that block, we haven't seen this block before;
+ * and therefore we need to remember that we saw this block in the btree.
+ *
+ * So if our btree is:
+ *
+ *    4
+ *  / | \
+ * 1  2  3
+ *
+ * Pretend for this example that each leaf block has 100 btree records.  For
+ * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
+ * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
+ * block 4.  The list is [1, 4].
+ *
+ * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
+ * loop.  The list remains [1, 4].
+ *
+ * For the 101st btree record, we've moved onto leaf block 2.  Now
+ * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
+ * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
+ *
+ * For the 102nd record, bc_ptrs[0] == 2, so we continue.
+ *
+ * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
+ * we add 3 to the list.  Now it is [1, 4, 2, 3].
+ *
+ * For the 300th record we just exit, with the list being [1, 4, 2, 3].
+ *
+ * The *iter_fn can return XFS_BTREE_QUERY_RANGE_ABORT to stop, 0 to keep
+ * iterating, or the usual negative error code.
+ */
+int
+xfs_repair_collect_btree_cur_blocks(
+	struct xfs_scrub_context	*sc,
+	struct xfs_btree_cur		*cur,
+	int				(*iter_fn)(struct xfs_scrub_context *sc,
+						   xfs_fsblock_t fsbno,
+						   xfs_fsblock_t len,
+						   void *priv),
+	void				*priv)
+{
+	struct xfs_buf			*bp;
+	xfs_fsblock_t			fsb;
+	int				i;
+	int				error;
+
+	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
+		xfs_btree_get_block(cur, i, &bp);
+		if (!bp)
+			continue;
+		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+		error = iter_fn(sc, fsb, 1, priv);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Simple adapter to connect xfs_repair_collect_btree_extent to
+ * xfs_repair_collect_btree_cur_blocks.
+ */
+int
+xfs_repair_collect_btree_cur_blocks_in_extent_list(
+	struct xfs_scrub_context	*sc,
+	xfs_fsblock_t			fsbno,
+	xfs_fsblock_t			len,
+	void				*priv)
+{
+	return xfs_repair_collect_btree_extent(sc, priv, fsbno, len);
+}
+
 /*
  * An error happened during the rebuild so the transaction will be cancelled.
  * The fs will shut down, and the administrator has to unmount and run repair.
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index ef47826b6725..f2af5923aa75 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -48,9 +48,20 @@ xfs_repair_init_extent_list(
 
 #define for_each_xfs_repair_extent_safe(rbe, n, exlist) \
 	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
+#define for_each_xfs_repair_extent(rbe, exlist) \
+	list_for_each_entry((rbe), &(exlist)->list, list)
 int xfs_repair_collect_btree_extent(struct xfs_scrub_context *sc,
 		struct xfs_repair_extent_list *btlist, xfs_fsblock_t fsbno,
 		xfs_extlen_t len);
+int xfs_repair_collect_btree_cur_blocks(struct xfs_scrub_context *sc,
+		struct xfs_btree_cur *cur,
+		int (*iter_fn)(struct xfs_scrub_context *sc,
+			       xfs_fsblock_t fsbno, xfs_fsblock_t len,
+			       void *priv),
+		void *priv);
+int xfs_repair_collect_btree_cur_blocks_in_extent_list(
+		struct xfs_scrub_context *sc, xfs_fsblock_t fsbno,
+		xfs_fsblock_t len, void *priv);
 void xfs_repair_cancel_btree_extents(struct xfs_scrub_context *sc,
 		struct xfs_repair_extent_list *btlist);
 int xfs_repair_subtract_extents(struct xfs_scrub_context *sc,
@@ -89,6 +100,8 @@ int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
 
 int xfs_repair_probe(struct xfs_scrub_context *sc);
 int xfs_repair_superblock(struct xfs_scrub_context *sc);
+int xfs_repair_agf(struct xfs_scrub_context *sc);
+int xfs_repair_agfl(struct xfs_scrub_context *sc);
 
 #else
 
@@ -112,6 +125,8 @@ xfs_repair_calc_ag_resblks(
 
 #define xfs_repair_probe		xfs_repair_notsupported
 #define xfs_repair_superblock		xfs_repair_notsupported
+#define xfs_repair_agf			xfs_repair_notsupported
+#define xfs_repair_agfl			xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 58ae76b3a421..8e11c3c699fb 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -208,13 +208,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_fs,
 		.scrub	= xfs_scrub_agf,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_agf,
 	},
 	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_fs,
 		.scrub	= xfs_scrub_agfl,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_agfl,
 	},
 	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 524f543c5b82..c08785cf83a9 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -126,6 +126,60 @@ xfs_trans_dup(
 	return ntp;
 }
 
+/*
+ * Try to reserve more blocks for a transaction.  The single use case we
+ * support is for online repair -- use a transaction to gather data without
+ * fear of btree cycle deadlocks; calculate how many blocks we really need
+ * from that data; and only then start modifying data.  This can fail due to
+ * ENOSPC, so we have to be able to cancel the transaction.
+ */
+int
+xfs_trans_reserve_more(
+	struct xfs_trans	*tp,
+	uint			blocks,
+	uint			rtextents)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	int			error = 0;
+
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/*
+	 * Attempt to reserve the needed disk blocks by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (blocks > 0) {
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
+		if (error)
+			return -ENOSPC;
+		tp->t_blk_res += blocks;
+	}
+
+	/*
+	 * Attempt to reserve the needed realtime extents by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (rtextents > 0) {
+		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
+		if (error) {
+			error = -ENOSPC;
+			goto out_blocks;
+		}
+		tp->t_rtx_res += rtextents;
+	}
+
+	return 0;
+out_blocks:
+	if (blocks > 0) {
+		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
+		tp->t_blk_res -= blocks;
+	}
+	return error;
+}
+
 /*
  * This is called to reserve free disk blocks and log space for the
  * given transaction.  This must be done before allocating any resources
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 6526314f0b8f..bdbd3d5fd7b0 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -153,6 +153,8 @@ typedef struct xfs_trans {
 int		xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
 			uint blocks, uint rtextents, uint flags,
 			struct xfs_trans **tpp);
+int		xfs_trans_reserve_more(struct xfs_trans *tp, uint blocks,
+			uint rtextents);
 int		xfs_trans_alloc_empty(struct xfs_mount *mp,
 			struct xfs_trans **tpp);
 void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 05/21] xfs: repair the AGI
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-06-27  2:22   ` Dave Chinner
  2018-06-28 21:15   ` Allison Henderson
  2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the AGI header items with some help from the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader_repair.c |  211 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/scrub.c           |    2 
 3 files changed, 214 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 90e5e6cbc911..61e0134f6f9f 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -698,3 +698,214 @@ xfs_repair_agfl(
 	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
 	return error;
 }
+
+/* AGI */
+
+enum {
+	REPAIR_AGI_INOBT = 0,
+	REPAIR_AGI_FINOBT,
+	REPAIR_AGI_END,
+	REPAIR_AGI_MAX
+};
+
+static const struct xfs_repair_find_ag_btree repair_agi[] = {
+	[REPAIR_AGI_INOBT] = {
+		.rmap_owner = XFS_RMAP_OWN_INOBT,
+		.buf_ops = &xfs_inobt_buf_ops,
+		.magic = XFS_IBT_CRC_MAGIC,
+	},
+	[REPAIR_AGI_FINOBT] = {
+		.rmap_owner = XFS_RMAP_OWN_INOBT,
+		.buf_ops = &xfs_inobt_buf_ops,
+		.magic = XFS_FIBT_CRC_MAGIC,
+	},
+	[REPAIR_AGI_END] = {
+		.buf_ops = NULL
+	},
+};
+
+/* Find the inode btree roots from the rmap data. */
+STATIC int
+xfs_repair_agi_find_btrees(
+	struct xfs_scrub_context	*sc,
+	struct xfs_repair_find_ag_btree	*fab)
+{
+	struct xfs_buf			*agf_bp;
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	memcpy(fab, repair_agi, sizeof(repair_agi));
+
+	/* Read the AGF. */
+	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+
+	/* Find the btree roots. */
+	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, NULL);
+	if (error)
+		return error;
+
+	/* We must find the inobt root. */
+	if (fab[REPAIR_AGI_INOBT].root == NULLAGBLOCK ||
+	    fab[REPAIR_AGI_INOBT].height > XFS_BTREE_MAXLEVELS)
+		return -EFSCORRUPTED;
+
+	/* We must find the finobt root if that feature is enabled. */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb) &&
+	    (fab[REPAIR_AGI_FINOBT].root == NULLAGBLOCK ||
+	     fab[REPAIR_AGI_FINOBT].height > XFS_BTREE_MAXLEVELS))
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/*
+ * Reinitialize the AGI header, making an in-core copy of the old contents so
+ * that we know which in-core state needs to be reinitialized.
+ */
+STATIC void
+xfs_repair_agi_init_header(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agi_bp,
+	struct xfs_agi			*old_agi)
+{
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agi_bp);
+	struct xfs_mount		*mp = sc->mp;
+
+	memcpy(old_agi, agi, sizeof(*old_agi));
+	memset(agi, 0, BBTOB(agi_bp->b_length));
+	agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
+	agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
+	agi->agi_seqno = cpu_to_be32(sc->sa.agno);
+	agi->agi_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
+	agi->agi_newino = cpu_to_be32(NULLAGINO);
+	agi->agi_dirino = cpu_to_be32(NULLAGINO);
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
+
+	/* We don't know how to fix the unlinked list yet. */
+	memcpy(&agi->agi_unlinked, &old_agi->agi_unlinked,
+			sizeof(agi->agi_unlinked));
+}
+
+/* Set btree root information in an AGI. */
+STATIC void
+xfs_repair_agi_set_roots(
+	struct xfs_scrub_context	*sc,
+	struct xfs_agi			*agi,
+	struct xfs_repair_find_ag_btree	*fab)
+{
+	agi->agi_root = cpu_to_be32(fab[REPAIR_AGI_INOBT].root);
+	agi->agi_level = cpu_to_be32(fab[REPAIR_AGI_INOBT].height);
+
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb)) {
+		agi->agi_free_root = cpu_to_be32(fab[REPAIR_AGI_FINOBT].root);
+		agi->agi_free_level =
+				cpu_to_be32(fab[REPAIR_AGI_FINOBT].height);
+	}
+}
+
+/* Update the AGI counters. */
+STATIC int
+xfs_repair_agi_update_btree_counters(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*agi_bp)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agi_bp);
+	struct xfs_mount		*mp = sc->mp;
+	xfs_agino_t			count;
+	xfs_agino_t			freecount;
+	int				error;
+
+	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, sc->sa.agno,
+			XFS_BTNUM_INO);
+	error = xfs_ialloc_count_inodes(cur, &count, &freecount);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	agi->agi_count = cpu_to_be32(count);
+	agi->agi_freecount = cpu_to_be32(freecount);
+	return 0;
+err:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Trigger reinitialization of the in-core data. */
+STATIC int
+xfs_repair_agi_reinit_incore(
+	struct xfs_scrub_context	*sc,
+	struct xfs_agi			*agi,
+	const struct xfs_agi		*old_agi)
+{
+	struct xfs_perag		*pag;
+
+	/* XXX: trigger inode count recalculation */
+
+	/* Now reinitialize the in-core counters if necessary. */
+	pag = sc->sa.pag;
+	if (!pag->pagf_init)
+		return 0;
+
+	sc->sa.pag->pagi_count = be32_to_cpu(agi->agi_count);
+	sc->sa.pag->pagi_freecount = be32_to_cpu(agi->agi_freecount);
+
+	return 0;
+}
+
+/* Repair the AGI. */
+int
+xfs_repair_agi(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_repair_find_ag_btree	fab[REPAIR_AGI_MAX];
+	struct xfs_agi			old_agi;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agi_bp;
+	struct xfs_agi			*agi;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGI_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL);
+	if (error)
+		return error;
+	agi_bp->b_ops = &xfs_agi_buf_ops;
+	agi = XFS_BUF_TO_AGI(agi_bp);
+
+	/* Find the AGI btree roots. */
+	error = xfs_repair_agi_find_btrees(sc, fab);
+	if (error)
+		return error;
+
+	/* Start rewriting the header and implant the btrees we found. */
+	xfs_repair_agi_init_header(sc, agi_bp, &old_agi);
+	xfs_repair_agi_set_roots(sc, agi, fab);
+	error = xfs_repair_agi_update_btree_counters(sc, agi_bp);
+	if (error)
+		goto out_revert;
+
+	/* Reinitialize in-core state. */
+	error = xfs_repair_agi_reinit_incore(sc, agi, &old_agi);
+	if (error)
+		goto out_revert;
+
+	/* Write this to disk. */
+	xfs_trans_buf_set_type(sc->tp, agi_bp, XFS_BLFT_AGI_BUF);
+	xfs_trans_log_buf(sc->tp, agi_bp, 0, BBTOB(agi_bp->b_length) - 1);
+	return error;
+
+out_revert:
+	memcpy(agi, &old_agi, sizeof(old_agi));
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index f2af5923aa75..d541c1586d0a 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -102,6 +102,7 @@ int xfs_repair_probe(struct xfs_scrub_context *sc);
 int xfs_repair_superblock(struct xfs_scrub_context *sc);
 int xfs_repair_agf(struct xfs_scrub_context *sc);
 int xfs_repair_agfl(struct xfs_scrub_context *sc);
+int xfs_repair_agi(struct xfs_scrub_context *sc);
 
 #else
 
@@ -127,6 +128,7 @@ xfs_repair_calc_ag_resblks(
 #define xfs_repair_superblock		xfs_repair_notsupported
 #define xfs_repair_agf			xfs_repair_notsupported
 #define xfs_repair_agfl			xfs_repair_notsupported
+#define xfs_repair_agi			xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 8e11c3c699fb..0f036aab2551 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -220,7 +220,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_fs,
 		.scrub	= xfs_scrub_agi,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_agi,
 	},
 	[XFS_SCRUB_TYPE_BNOBT] = {	/* bnobt */
 		.type	= ST_PERAG,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 06/21] xfs: repair free space btrees
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-06-27  3:21   ` Dave Chinner
  2018-06-30 17:36   ` Allison Henderson
  2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
                   ` (14 subsequent siblings)
  20 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the free space btrees from the gaps in the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/scrub/alloc.c        |    1 
 fs/xfs/scrub/alloc_repair.c |  561 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c       |    8 +
 fs/xfs/scrub/repair.h       |    2 
 fs/xfs/scrub/scrub.c        |    4 
 fs/xfs/xfs_extent_busy.c    |   14 +
 fs/xfs/xfs_extent_busy.h    |    4 
 8 files changed, 591 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/alloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a36cccbec169..841e0824eeb6 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,6 +164,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   alloc_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 50e4f7fa06f0..e2514c84cb7a 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -15,7 +15,6 @@
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
 #include "xfs_sb.h"
-#include "xfs_alloc.h"
 #include "xfs_rmap.h"
 #include "xfs_alloc.h"
 #include "scrub/xfs_scrub.h"
diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
new file mode 100644
index 000000000000..c25a2b0d71f1
--- /dev/null
+++ b/fs/xfs/scrub/alloc_repair.c
@@ -0,0 +1,561 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_refcount.h"
+#include "xfs_extent_busy.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Free Space Btree Repair
+ * =======================
+ *
+ * The reverse mappings are supposed to record all space usage for the entire
+ * AG.  Therefore, we can recalculate the free extents in an AG by looking for
+ * gaps in the physical extents recorded in the rmapbt.  On a reflink
+ * filesystem this is a little more tricky in that we have to be aware that
+ * the rmap records are allowed to overlap.
+ *
+ * We derive which blocks belonged to the old bnobt/cntbt by recording all the
+ * OWN_AG extents and subtracting out the blocks owned by all other OWN_AG
+ * metadata: the rmapbt blocks visited while iterating the reverse mappings
+ * and the AGFL blocks.
+ *
+ * Once we have both of those pieces, we can reconstruct the bnobt and cntbt
+ * by blowing out the free block state and freeing all the extents that we
+ * found.  This adds the requirement that we can't have any busy extents in
+ * the AG because the busy code cannot handle duplicate records.
+ *
+ * Note that we can only rebuild both free space btrees at the same time
+ * because the regular extent freeing infrastructure loads both btrees at the
+ * same time.
+ */
+
+struct xfs_repair_alloc_extent {
+	struct list_head		list;
+	xfs_agblock_t			bno;
+	xfs_extlen_t			len;
+};
+
+struct xfs_repair_alloc {
+	struct xfs_repair_extent_list	nobtlist; /* rmapbt/agfl blocks */
+	struct xfs_repair_extent_list	*btlist;  /* OWN_AG blocks */
+	struct list_head		*extlist; /* free extents */
+	struct xfs_scrub_context	*sc;
+	uint64_t			nr_records; /* length of extlist */
+	xfs_agblock_t			next_bno; /* next bno we want to see */
+	xfs_agblock_t			nr_blocks; /* free blocks in extlist */
+};
+
+/* Record extents that aren't in use from gaps in the rmap records. */
+STATIC int
+xfs_repair_alloc_extent_fn(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_alloc		*ra = priv;
+	struct xfs_repair_alloc_extent	*rae;
+	xfs_fsblock_t			fsb;
+	int				error;
+
+	/* Record all the OWN_AG blocks... */
+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		error = xfs_repair_collect_btree_extent(ra->sc,
+				ra->btlist, fsb, rec->rm_blockcount);
+		if (error)
+			return error;
+	}
+
+	/* ...and all the rmapbt blocks... */
+	error = xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
+			xfs_repair_collect_btree_cur_blocks_in_extent_list,
+			&ra->nobtlist);
+	if (error)
+		return error;
+
+	/* ...and all the free space. */
+	if (rec->rm_startblock > ra->next_bno) {
+		trace_xfs_repair_alloc_extent_fn(cur->bc_mp,
+				cur->bc_private.a.agno,
+				ra->next_bno, rec->rm_startblock - ra->next_bno,
+				XFS_RMAP_OWN_NULL, 0, 0);
+
+		rae = kmem_alloc(sizeof(struct xfs_repair_alloc_extent),
+				KM_MAYFAIL);
+		if (!rae)
+			return -ENOMEM;
+		INIT_LIST_HEAD(&rae->list);
+		rae->bno = ra->next_bno;
+		rae->len = rec->rm_startblock - ra->next_bno;
+		list_add_tail(&rae->list, ra->extlist);
+		ra->nr_records++;
+		ra->nr_blocks += rae->len;
+	}
+	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
+			rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
+/* Collect an AGFL block for the not-to-release list. */
+static int
+xfs_repair_collect_agfl_block(
+	struct xfs_mount		*mp,
+	xfs_agblock_t			bno,
+	void				*priv)
+{
+	struct xfs_repair_alloc		*ra = priv;
+	xfs_fsblock_t			fsb;
+
+	fsb = XFS_AGB_TO_FSB(mp, ra->sc->sa.agno, bno);
+	return xfs_repair_collect_btree_extent(ra->sc, &ra->nobtlist, fsb, 1);
+}
+
+/* Compare two btree extents. */
+static int
+xfs_repair_allocbt_extent_cmp(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_repair_alloc_extent	*ap;
+	struct xfs_repair_alloc_extent	*bp;
+
+	ap = container_of(a, struct xfs_repair_alloc_extent, list);
+	bp = container_of(b, struct xfs_repair_alloc_extent, list);
+
+	if (ap->bno > bp->bno)
+		return 1;
+	else if (ap->bno < bp->bno)
+		return -1;
+	return 0;
+}
+
+/* Put an extent onto the free list. */
+STATIC int
+xfs_repair_allocbt_free_extent(
+	struct xfs_scrub_context	*sc,
+	xfs_fsblock_t			fsbno,
+	xfs_extlen_t			len,
+	struct xfs_owner_info		*oinfo)
+{
+	int				error;
+
+	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
+	if (error)
+		return error;
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		return error;
+	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);
+}
+
+/* Find the longest free extent in the list. */
+static struct xfs_repair_alloc_extent *
+xfs_repair_allocbt_get_longest(
+	struct list_head		*free_extents)
+{
+	struct xfs_repair_alloc_extent	*rae;
+	struct xfs_repair_alloc_extent	*res = NULL;
+
+	list_for_each_entry(rae, free_extents, list) {
+		if (!res || rae->len > res->len)
+			res = rae;
+	}
+	return res;
+}
+
+/* Find the shortest free extent in the list. */
+static struct xfs_repair_alloc_extent *
+xfs_repair_allocbt_get_shortest(
+	struct list_head		*free_extents)
+{
+	struct xfs_repair_alloc_extent	*rae;
+	struct xfs_repair_alloc_extent	*res = NULL;
+
+	list_for_each_entry(rae, free_extents, list) {
+		if (!res || rae->len < res->len)
+			res = rae;
+		if (res->len == 1)
+			break;
+	}
+	return res;
+}
+
+/*
+ * Allocate a block from the (cached) shortest extent in the AG.  In theory
+ * this should never fail, since we already checked that there was enough
+ * space to handle the new btrees.
+ */
+STATIC xfs_fsblock_t
+xfs_repair_allocbt_alloc_block(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*free_extents,
+	struct xfs_repair_alloc_extent	**cached_result)
+{
+	struct xfs_repair_alloc_extent	*ext = *cached_result;
+	xfs_fsblock_t			fsb;
+
+	/* No cached result, see if we can find another. */
+	if (!ext) {
+		ext = xfs_repair_allocbt_get_shortest(free_extents);
+		ASSERT(ext);
+		if (!ext)
+			return NULLFSBLOCK;
+	}
+
+	/* Subtract one block. */
+	fsb = XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, ext->bno);
+	ext->bno++;
+	ext->len--;
+	if (ext->len == 0) {
+		list_del(&ext->list);
+		kmem_free(ext);
+		ext = NULL;
+	}
+
+	*cached_result = ext;
+	return fsb;
+}
+
+/* Free every record in the extent list. */
+STATIC void
+xfs_repair_allocbt_cancel_freelist(
+	struct list_head		*extlist)
+{
+	struct xfs_repair_alloc_extent	*rae;
+	struct xfs_repair_alloc_extent	*n;
+
+	list_for_each_entry_safe(rae, n, extlist, list) {
+		list_del(&rae->list);
+		kmem_free(rae);
+	}
+}
+
+/*
+ * Iterate all reverse mappings to find (1) the free extents, (2) the OWN_AG
+ * extents, (3) the rmapbt blocks, and (4) the AGFL blocks.  The free space is
+ * (1) + (2) - (3) - (4).  Figure out if we have enough free space to
+ * reconstruct the free space btrees.  Caller must clean up the input lists
+ * if something goes wrong.
+ */
+STATIC int
+xfs_repair_allocbt_find_freespace(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*free_extents,
+	struct xfs_repair_extent_list	*old_allocbt_blocks)
+{
+	struct xfs_repair_alloc		ra;
+	struct xfs_repair_alloc_extent	*rae;
+	struct xfs_btree_cur		*cur;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_agblock_t			agend;
+	xfs_agblock_t			nr_blocks;
+	int				error;
+
+	ra.extlist = free_extents;
+	ra.btlist = old_allocbt_blocks;
+	xfs_repair_init_extent_list(&ra.nobtlist);
+	ra.next_bno = 0;
+	ra.nr_records = 0;
+	ra.nr_blocks = 0;
+	ra.sc = sc;
+
+	/*
+	 * Iterate all the reverse mappings to find gaps in the physical
+	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
+	 */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xfs_repair_alloc_extent_fn, &ra);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	cur = NULL;
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agend = be32_to_cpu(XFS_BUF_TO_AGF(sc->sa.agf_bp)->agf_length);
+	if (ra.next_bno < agend) {
+		rae = kmem_alloc(sizeof(struct xfs_repair_alloc_extent),
+				KM_MAYFAIL);
+		if (!rae) {
+			error = -ENOMEM;
+			goto err;
+		}
+		INIT_LIST_HEAD(&rae->list);
+		rae->bno = ra.next_bno;
+		rae->len = agend - ra.next_bno;
+		list_add_tail(&rae->list, free_extents);
+		ra.nr_records++;
+	}
+
+	/* Collect all the AGFL blocks. */
+	error = xfs_agfl_walk(mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
+			sc->sa.agfl_bp, xfs_repair_collect_agfl_block, &ra);
+	if (error)
+		goto err;
+
+	/* Do we actually have enough space to do this? */
+	nr_blocks = 2 * xfs_allocbt_calc_size(mp, ra.nr_records);
+	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE) ||
+	    ra.nr_blocks < nr_blocks) {
+		error = -ENOSPC;
+		goto err;
+	}
+
+	/* Compute the old bnobt/cntbt blocks. */
+	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
+			&ra.nobtlist);
+	if (error)
+		goto err;
+	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
+	return 0;
+
+err:
+	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
+	if (cur)
+		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/*
+ * Reset the global free block counter and the per-AG counters to make it look
+ * like this AG has no free space.
+ */
+STATIC int
+xfs_repair_allocbt_reset_counters(
+	struct xfs_scrub_context	*sc,
+	int				*log_flags)
+{
+	struct xfs_perag		*pag = sc->sa.pag;
+	struct xfs_agf			*agf;
+	xfs_extlen_t			oldf;
+	xfs_agblock_t			rmap_blocks;
+	int				error;
+
+	/*
+	 * Since we're abandoning the old bnobt/cntbt, we have to
+	 * decrease fdblocks by the # of blocks in those trees.
+	 * btreeblks counts the non-root blocks of the free space
+	 * and rmap btrees.  Do this before resetting the AGF counters.
+	 */
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	rmap_blocks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
+	oldf = pag->pagf_btreeblks + 2;
+	oldf -= rmap_blocks;
+	error = xfs_mod_fdblocks(sc->mp, -(int64_t)oldf, false);
+	if (error)
+		return error;
+
+	/* Reset the per-AG info, both incore and ondisk. */
+	pag->pagf_btreeblks = rmap_blocks;
+	pag->pagf_freeblks = 0;
+	pag->pagf_longest = 0;
+
+	agf->agf_btreeblks = cpu_to_be32(pag->pagf_btreeblks);
+	agf->agf_freeblks = 0;
+	agf->agf_longest = 0;
+	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
+
+	return 0;
+}
+
+/* Initialize new bnobt/cntbt roots and implant them into the AGF. */
+STATIC int
+xfs_repair_allocbt_reset_btrees(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*free_extents,
+	int				*log_flags)
+{
+	struct xfs_owner_info		oinfo;
+	struct xfs_repair_alloc_extent	*cached = NULL;
+	struct xfs_buf			*bp;
+	struct xfs_perag		*pag = sc->sa.pag;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agf			*agf;
+	xfs_fsblock_t			bnofsb;
+	xfs_fsblock_t			cntfsb;
+	int				error;
+
+	/* Allocate new bnobt root. */
+	bnofsb = xfs_repair_allocbt_alloc_block(sc, free_extents, &cached);
+	if (bnofsb == NULLFSBLOCK)
+		return -ENOSPC;
+
+	/* Allocate new cntbt root. */
+	cntfsb = xfs_repair_allocbt_alloc_block(sc, free_extents, &cached);
+	if (cntfsb == NULLFSBLOCK)
+		return -ENOSPC;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	/* Initialize new bnobt root. */
+	error = xfs_repair_init_btblock(sc, bnofsb, &bp, XFS_BTNUM_BNO,
+			&xfs_allocbt_buf_ops);
+	if (error)
+		return error;
+	agf->agf_roots[XFS_BTNUM_BNOi] =
+			cpu_to_be32(XFS_FSB_TO_AGBNO(mp, bnofsb));
+	agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
+
+	/* Initialize new cntbt root. */
+	error = xfs_repair_init_btblock(sc, cntfsb, &bp, XFS_BTNUM_CNT,
+			&xfs_allocbt_buf_ops);
+	if (error)
+		return error;
+	agf->agf_roots[XFS_BTNUM_CNTi] =
+			cpu_to_be32(XFS_FSB_TO_AGBNO(mp, cntfsb));
+	agf->agf_levels[XFS_BTNUM_CNTi] = cpu_to_be32(1);
+
+	/* Add rmap records for the btree roots */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
+			XFS_FSB_TO_AGBNO(mp, bnofsb), 1, &oinfo);
+	if (error)
+		return error;
+	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
+			XFS_FSB_TO_AGBNO(mp, cntfsb), 1, &oinfo);
+	if (error)
+		return error;
+
+	/* Reset the incore state. */
+	pag->pagf_levels[XFS_BTNUM_BNOi] = 1;
+	pag->pagf_levels[XFS_BTNUM_CNTi] = 1;
+
+	*log_flags |=  XFS_AGF_ROOTS | XFS_AGF_LEVELS;
+	return 0;
+}
+
+/* Build new free space btrees and dispose of the old one. */
+STATIC int
+xfs_repair_allocbt_rebuild_trees(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*free_extents,
+	struct xfs_repair_extent_list	*old_allocbt_blocks)
+{
+	struct xfs_owner_info		oinfo;
+	struct xfs_repair_alloc_extent	*rae;
+	struct xfs_repair_alloc_extent	*n;
+	struct xfs_repair_alloc_extent	*longest;
+	int				error;
+
+	xfs_rmap_skip_owner_update(&oinfo);
+
+	/*
+	 * Insert the longest free extent in case it's necessary to
+	 * refresh the AGFL with multiple blocks.  If there is no longest
+	 * extent, we had exactly the free space we needed; we're done.
+	 */
+	longest = xfs_repair_allocbt_get_longest(free_extents);
+	if (!longest)
+		goto done;
+	error = xfs_repair_allocbt_free_extent(sc,
+			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
+			longest->len, &oinfo);
+	list_del(&longest->list);
+	kmem_free(longest);
+	if (error)
+		return error;
+
+	/* Insert records into the new btrees. */
+	list_sort(NULL, free_extents, xfs_repair_allocbt_extent_cmp);
+	list_for_each_entry_safe(rae, n, free_extents, list) {
+		error = xfs_repair_allocbt_free_extent(sc,
+				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
+				rae->len, &oinfo);
+		if (error)
+			return error;
+		list_del(&rae->list);
+		kmem_free(rae);
+	}
+
+done:
+	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	return xfs_repair_reap_btree_extents(sc, old_allocbt_blocks, &oinfo,
+			XFS_AG_RESV_NONE);
+}
+
+/* Repair the freespace btrees for some AG. */
+int
+xfs_repair_allocbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct list_head		free_extents;
+	struct xfs_repair_extent_list	old_allocbt_blocks;
+	struct xfs_mount		*mp = sc->mp;
+	int				log_flags = 0;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+
+	/*
+	 * Make sure the busy extent list is clear because we can't put
+	 * extents on there twice.
+	 */
+	if (!xfs_extent_busy_list_empty(sc->sa.pag))
+		return -EDEADLOCK;
+
+	/* Collect the free space data and find the old btree blocks. */
+	INIT_LIST_HEAD(&free_extents);
+	xfs_repair_init_extent_list(&old_allocbt_blocks);
+	error = xfs_repair_allocbt_find_freespace(sc, &free_extents,
+			&old_allocbt_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old free space btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xfs_repair_allocbt_reset_counters(sc, &log_flags);
+	if (error)
+		goto out;
+	error = xfs_repair_allocbt_reset_btrees(sc, &free_extents, &log_flags);
+	if (error)
+		goto out;
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+
+	/* Invalidate the old freespace btree blocks and commit. */
+	error = xfs_repair_invalidate_blocks(sc, &old_allocbt_blocks);
+	if (error)
+		goto out;
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the freespace information. */
+	error = xfs_repair_allocbt_rebuild_trees(sc, &free_extents,
+			&old_allocbt_blocks);
+out:
+	xfs_repair_allocbt_cancel_freelist(&free_extents);
+	xfs_repair_cancel_btree_extents(sc, &old_allocbt_blocks);
+	return error;
+}
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 70e70c69f83f..c1132a40a366 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -623,8 +623,14 @@ xfs_scrub_setup_ag_btree(
 	 * expensive operation should be performed infrequently and only
 	 * as a last resort.  Any caller that sets force_log should
 	 * document why they need to do so.
+	 *
+	 * Force everything in memory out to disk if we're repairing.
+	 * This ensures we won't get tripped up by btree blocks sitting
+	 * in memory waiting to have LSNs stamped in.  The AGF/AGI repair
+	 * routines use any available rmap data to try to find a btree
+	 * root that also passes the read verifiers.
 	 */
-	if (force_log) {
+	if (force_log || (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)) {
 		error = xfs_scrub_checkpoint_log(mp);
 		if (error)
 			return error;
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index d541c1586d0a..e5f67fc68e9a 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -103,6 +103,7 @@ int xfs_repair_superblock(struct xfs_scrub_context *sc);
 int xfs_repair_agf(struct xfs_scrub_context *sc);
 int xfs_repair_agfl(struct xfs_scrub_context *sc);
 int xfs_repair_agi(struct xfs_scrub_context *sc);
+int xfs_repair_allocbt(struct xfs_scrub_context *sc);
 
 #else
 
@@ -129,6 +130,7 @@ xfs_repair_calc_ag_resblks(
 #define xfs_repair_agf			xfs_repair_notsupported
 #define xfs_repair_agfl			xfs_repair_notsupported
 #define xfs_repair_agi			xfs_repair_notsupported
+#define xfs_repair_allocbt		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 0f036aab2551..7a55b20b7e4e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -226,13 +226,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_ag_allocbt,
 		.scrub	= xfs_scrub_bnobt,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_allocbt,
 	},
 	[XFS_SCRUB_TYPE_CNTBT] = {	/* cntbt */
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_ag_allocbt,
 		.scrub	= xfs_scrub_cntbt,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_allocbt,
 	},
 	[XFS_SCRUB_TYPE_INOBT] = {	/* inobt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 0ed68379e551..82f99633a597 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
 		diff = b1->bno - b2->bno;
 	return diff;
 }
+
+/* Are there any busy extents in this AG? */
+bool
+xfs_extent_busy_list_empty(
+	struct xfs_perag	*pag)
+{
+	spin_lock(&pag->pagb_lock);
+	if (pag->pagb_tree.rb_node) {
+		spin_unlock(&pag->pagb_lock);
+		return false;
+	}
+	spin_unlock(&pag->pagb_lock);
+	return true;
+}
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 990ab3891971..df1ea61df16e 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -65,4 +65,8 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
 	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
 }
 
+bool
+xfs_extent_busy_list_empty(
+	struct xfs_perag	*pag);
+
 #endif /* __XFS_EXTENT_BUSY_H__ */


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 07/21] xfs: repair inode btrees
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-06-28  0:55   ` Dave Chinner
  2018-06-30 17:36   ` Allison Henderson
  2018-06-24 19:24 ` [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running Darrick J. Wong
                   ` (13 subsequent siblings)
  20 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/ialloc_repair.c |  585 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h        |    2 
 fs/xfs/scrub/scrub.c         |    4 
 4 files changed, 590 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/ialloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 841e0824eeb6..837fd4a95f6f 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -165,6 +165,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
+				   ialloc_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
new file mode 100644
index 000000000000..29c736466bba
--- /dev/null
+++ b/fs/xfs/scrub/ialloc_repair.c
@@ -0,0 +1,585 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_log.h"
+#include "xfs_trans_priv.h"
+#include "xfs_error.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Inode Btree Repair
+ * ==================
+ *
+ * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
+ * records.  The OWN_INOBT records are the old inode btree blocks and will be
+ * cleared out after we've rebuilt the tree.  Each possible inode chunk within
+ * an OWN_INODES record will be read in and the freemask calculated from the
+ * i_mode data in the inode chunk.  For sparse inodes the holemask will be
+ * calculated by creating the properly aligned inobt record and punching out
+ * any chunk that's missing.  Inode allocations and frees grab the AGI first,
+ * so repair protects itself from concurrent access by locking the AGI.
+ *
+ * Once we've reconstructed all the inode records, we can create new inode
+ * btree roots and reload the btrees.  We rebuild both inode trees at the same
+ * time because they have the same rmap owner and it would be more complex to
+ * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
+ * blocks it owns.  We have all the data we need to build both, so dump
+ * everything and start over.
+ */
+
+struct xfs_repair_ialloc_extent {
+	struct list_head		list;
+	xfs_inofree_t			freemask;
+	xfs_agino_t			startino;
+	unsigned int			count;
+	unsigned int			usedcount;
+	uint16_t			holemask;
+};
+
+struct xfs_repair_ialloc {
+	struct list_head		*extlist;
+	struct xfs_repair_extent_list	*btlist;
+	struct xfs_scrub_context	*sc;
+	uint64_t			nr_records;
+};
+
+/*
+ * Is this inode in use?  If the inode is in memory we can tell from i_mode,
+ * otherwise we have to check di_mode in the on-disk buffer.  We only care
+ * that the high (i.e. non-permission) bits of _mode are zero.  This should be
+ * safe because repair keeps all AG headers locked until the end, and process
+ * trying to perform an inode allocation/free must lock the AGI.
+ */
+STATIC int
+xfs_repair_ialloc_check_free(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp,
+	xfs_ino_t			fsino,
+	xfs_agino_t			bpino,
+	bool				*inuse)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_dinode		*dip;
+	int				error;
+
+	/* Will the in-core inode tell us if it's in use? */
+	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
+	if (!error)
+		return 0;
+
+	/* Inode uncached or half assembled, read disk buffer */
+	dip = xfs_buf_offset(bp, bpino * mp->m_sb.sb_inodesize);
+	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
+		return -EFSCORRUPTED;
+
+	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
+		return -EFSCORRUPTED;
+
+	*inuse = dip->di_mode != 0;
+	return 0;
+}
+
+/*
+ * For each cluster in this blob of inode, we must calculate the
+ * properly aligned startino of that cluster, then iterate each
+ * cluster to fill in used and filled masks appropriately.  We
+ * then use the (startino, used, filled) information to construct
+ * the appropriate inode records.
+ */
+STATIC int
+xfs_repair_ialloc_process_cluster(
+	struct xfs_repair_ialloc	*ri,
+	xfs_agblock_t			agbno,
+	int				blks_per_cluster,
+	xfs_agino_t			rec_agino)
+{
+	struct xfs_imap			imap;
+	struct xfs_repair_ialloc_extent	*rie;
+	struct xfs_dinode		*dip;
+	struct xfs_buf			*bp;
+	struct xfs_scrub_context	*sc = ri->sc;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_ino_t			fsino;
+	xfs_inofree_t			usedmask;
+	xfs_agino_t			nr_inodes;
+	xfs_agino_t			startino;
+	xfs_agino_t			clusterino;
+	xfs_agino_t			clusteroff;
+	xfs_agino_t			agino;
+	uint16_t			fillmask;
+	bool				inuse;
+	int				usedcount;
+	int				error;
+
+	/* The per-AG inum of this inode cluster. */
+	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
+
+	/* The per-AG inum of the inobt record. */
+	startino = rec_agino + rounddown(agino - rec_agino,
+			XFS_INODES_PER_CHUNK);
+
+	/* The per-AG inum of the cluster within the inobt record. */
+	clusteroff = agino - startino;
+
+	/* Every inode in this holemask slot is filled. */
+	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
+	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
+			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
+
+	/* Grab the inode cluster buffer. */
+	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
+	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
+	imap.im_boffset = 0;
+
+	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
+			XFS_IGET_UNTRUSTED);
+	if (error)
+		return error;
+
+	usedmask = 0;
+	usedcount = 0;
+	/* Which inodes within this cluster are free? */
+	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
+		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
+		error = xfs_repair_ialloc_check_free(sc, bp, fsino,
+				clusterino, &inuse);
+		if (error) {
+			xfs_trans_brelse(sc->tp, bp);
+			return error;
+		}
+		if (inuse) {
+			usedcount++;
+			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
+		}
+	}
+	xfs_trans_brelse(sc->tp, bp);
+
+	/*
+	 * If the last item in the list is our chunk record,
+	 * update that.
+	 */
+	if (!list_empty(ri->extlist)) {
+		rie = list_last_entry(ri->extlist,
+				struct xfs_repair_ialloc_extent, list);
+		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
+			rie->freemask &= ~usedmask;
+			rie->holemask &= ~fillmask;
+			rie->count += nr_inodes;
+			rie->usedcount += usedcount;
+			return 0;
+		}
+	}
+
+	/* New inode chunk; add to the list. */
+	rie = kmem_alloc(sizeof(struct xfs_repair_ialloc_extent), KM_MAYFAIL);
+	if (!rie)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&rie->list);
+	rie->startino = startino;
+	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
+	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;
+	rie->count = nr_inodes;
+	rie->usedcount = usedcount;
+	list_add_tail(&rie->list, ri->extlist);
+	ri->nr_records++;
+
+	return 0;
+}
+
+/* Record extents that belong to inode btrees. */
+STATIC int
+xfs_repair_ialloc_extent_fn(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_ialloc	*ri = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_fsblock_t			fsbno;
+	xfs_agblock_t			agbno = rec->rm_startblock;
+	xfs_agino_t			inoalign;
+	xfs_agino_t			agino;
+	xfs_agino_t			rec_agino;
+	int				blks_per_cluster;
+	int				error = 0;
+
+	if (xfs_scrub_should_terminate(ri->sc, &error))
+		return error;
+
+	/* Fragment of the old btrees; dispose of them later. */
+	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
+		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
+		return xfs_repair_collect_btree_extent(ri->sc, ri->btlist,
+				fsbno, rec->rm_blockcount);
+	}
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
+		return 0;
+
+	blks_per_cluster = xfs_icluster_size_fsb(mp);
+
+	if (agbno % blks_per_cluster != 0)
+		return -EFSCORRUPTED;
+
+	trace_xfs_repair_ialloc_extent_fn(mp, ri->sc->sa.agno,
+			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
+			rec->rm_offset, rec->rm_flags);
+
+	/*
+	 * Determine the inode block alignment, and where the block
+	 * ought to start if it's aligned properly.  On a sparse inode
+	 * system the rmap doesn't have to start on an alignment boundary,
+	 * but the record does.  On pre-sparse filesystems, we /must/
+	 * start both rmap and inobt on an alignment boundary.
+	 */
+	inoalign = xfs_ialloc_cluster_alignment(mp);
+	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
+	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
+	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
+		return -EFSCORRUPTED;
+
+	/* Set up the free/hole masks for each cluster in this inode chunk. */
+	for (;
+	     agbno < rec->rm_startblock + rec->rm_blockcount;
+	     agbno += blks_per_cluster) {
+		error = xfs_repair_ialloc_process_cluster(ri, agbno,
+				blks_per_cluster, rec_agino);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Compare two ialloc extents. */
+static int
+xfs_repair_ialloc_extent_cmp(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_repair_ialloc_extent	*ap;
+	struct xfs_repair_ialloc_extent	*bp;
+
+	ap = container_of(a, struct xfs_repair_ialloc_extent, list);
+	bp = container_of(b, struct xfs_repair_ialloc_extent, list);
+
+	if (ap->startino > bp->startino)
+		return 1;
+	else if (ap->startino < bp->startino)
+		return -1;
+	return 0;
+}
+
+/* Insert an inode chunk record into a given btree. */
+static int
+xfs_repair_iallocbt_insert_btrec(
+	struct xfs_btree_cur		*cur,
+	struct xfs_repair_ialloc_extent	*rie)
+{
+	int				stat;
+	int				error;
+
+	error = xfs_inobt_lookup(cur, rie->startino, XFS_LOOKUP_EQ, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 0);
+	error = xfs_inobt_insert_rec(cur, rie->holemask, rie->count,
+			rie->count - rie->usedcount, rie->freemask, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
+	return error;
+}
+
+/* Insert an inode chunk record into both inode btrees. */
+static int
+xfs_repair_iallocbt_insert_rec(
+	struct xfs_scrub_context	*sc,
+	struct xfs_repair_ialloc_extent	*rie)
+{
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	trace_xfs_repair_ialloc_insert(sc->mp, sc->sa.agno, rie->startino,
+			rie->holemask, rie->count, rie->count - rie->usedcount,
+			rie->freemask);
+
+	/* Insert into the inobt. */
+	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
+			XFS_BTNUM_INO);
+	error = xfs_repair_iallocbt_insert_btrec(cur, rie);
+	if (error)
+		goto out_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Insert into the finobt if chunk has free inodes. */
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb) &&
+	    rie->count != rie->usedcount) {
+		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
+				sc->sa.agno, XFS_BTNUM_FINO);
+		error = xfs_repair_iallocbt_insert_btrec(cur, rie);
+		if (error)
+			goto out_cur;
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	}
+
+	return xfs_repair_roll_ag_trans(sc);
+out_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Free every record in the inode list. */
+STATIC void
+xfs_repair_iallocbt_cancel_inorecs(
+	struct list_head		*reclist)
+{
+	struct xfs_repair_ialloc_extent	*rie;
+	struct xfs_repair_ialloc_extent	*n;
+
+	list_for_each_entry_safe(rie, n, reclist, list) {
+		list_del(&rie->list);
+		kmem_free(rie);
+	}
+}
+
+/*
+ * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
+ * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
+ * the inode btrees.  The caller must clean up the lists if anything goes
+ * wrong.
+ */
+STATIC int
+xfs_repair_iallocbt_find_inodes(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*inode_records,
+	struct xfs_repair_extent_list	*old_iallocbt_blocks)
+{
+	struct xfs_repair_ialloc	ri;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_btree_cur		*cur;
+	xfs_agblock_t			nr_blocks;
+	int				error;
+
+	/* Collect all reverse mappings for inode blocks. */
+	ri.extlist = inode_records;
+	ri.btlist = old_iallocbt_blocks;
+	ri.nr_records = 0;
+	ri.sc = sc;
+
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xfs_repair_ialloc_extent_fn, &ri);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Do we actually have enough space to do this? */
+	nr_blocks = xfs_iallocbt_calc_size(mp, ri.nr_records);
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		nr_blocks *= 2;
+	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE))
+		return -ENOSPC;
+
+	return 0;
+
+err:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Update the AGI counters. */
+STATIC int
+xfs_repair_iallocbt_reset_counters(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*inode_records,
+	int				*log_flags)
+{
+	struct xfs_agi			*agi;
+	struct xfs_repair_ialloc_extent	*rie;
+	unsigned int			count = 0;
+	unsigned int			usedcount = 0;
+	unsigned int			freecount;
+
+	/* Figure out the new counters. */
+	list_for_each_entry(rie, inode_records, list) {
+		count += rie->count;
+		usedcount += rie->usedcount;
+	}
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+	freecount = count - usedcount;
+
+	/* XXX: trigger inode count recalculation */
+
+	/* Reset the per-AG info, both incore and ondisk. */
+	sc->sa.pag->pagi_count = count;
+	sc->sa.pag->pagi_freecount = freecount;
+	agi->agi_count = cpu_to_be32(count);
+	agi->agi_freecount = cpu_to_be32(freecount);
+	*log_flags |= XFS_AGI_COUNT | XFS_AGI_FREECOUNT;
+
+	return 0;
+}
+
+/* Initialize new inobt/finobt roots and implant them into the AGI. */
+STATIC int
+xfs_repair_iallocbt_reset_btrees(
+	struct xfs_scrub_context	*sc,
+	struct xfs_owner_info		*oinfo,
+	int				*log_flags)
+{
+	struct xfs_agi			*agi;
+	struct xfs_buf			*bp;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fsblock_t			inofsb;
+	xfs_fsblock_t			finofsb;
+	enum xfs_ag_resv_type		resv;
+	int				error;
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+
+	/* Initialize new inobt root. */
+	resv = XFS_AG_RESV_NONE;
+	error = xfs_repair_alloc_ag_block(sc, oinfo, &inofsb, resv);
+	if (error)
+		return error;
+	error = xfs_repair_init_btblock(sc, inofsb, &bp, XFS_BTNUM_INO,
+			&xfs_inobt_buf_ops);
+	if (error)
+		return error;
+	agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, inofsb));
+	agi->agi_level = cpu_to_be32(1);
+	*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
+
+	/* Initialize new finobt root. */
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return 0;
+
+	resv = mp->m_inotbt_nores ? XFS_AG_RESV_NONE : XFS_AG_RESV_METADATA;
+	error = xfs_repair_alloc_ag_block(sc, oinfo, &finofsb, resv);
+	if (error)
+		return error;
+	error = xfs_repair_init_btblock(sc, finofsb, &bp, XFS_BTNUM_FINO,
+			&xfs_inobt_buf_ops);
+	if (error)
+		return error;
+	agi->agi_free_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, finofsb));
+	agi->agi_free_level = cpu_to_be32(1);
+	*log_flags |= XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
+
+	return 0;
+}
+
+/* Build new inode btrees and dispose of the old one. */
+STATIC int
+xfs_repair_iallocbt_rebuild_trees(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*inode_records,
+	struct xfs_owner_info		*oinfo,
+	struct xfs_repair_extent_list	*old_iallocbt_blocks)
+{
+	struct xfs_repair_ialloc_extent	*rie;
+	struct xfs_repair_ialloc_extent	*n;
+	int				error;
+
+	/* Add all records. */
+	list_sort(NULL, inode_records, xfs_repair_ialloc_extent_cmp);
+	list_for_each_entry_safe(rie, n, inode_records, list) {
+		error = xfs_repair_iallocbt_insert_rec(sc, rie);
+		if (error)
+			return error;
+
+		list_del(&rie->list);
+		kmem_free(rie);
+	}
+
+	/* Free the old inode btree blocks if they're not in use. */
+	return xfs_repair_reap_btree_extents(sc, old_iallocbt_blocks, oinfo,
+			XFS_AG_RESV_NONE);
+}
+
+/* Repair both inode btrees. */
+int
+xfs_repair_iallocbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+	struct list_head		inode_records;
+	struct xfs_repair_extent_list	old_iallocbt_blocks;
+	struct xfs_mount		*mp = sc->mp;
+	int				log_flags = 0;
+	int				error = 0;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+
+	/* Collect the free space data and find the old btree blocks. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
+	INIT_LIST_HEAD(&inode_records);
+	xfs_repair_init_extent_list(&old_iallocbt_blocks);
+	error = xfs_repair_iallocbt_find_inodes(sc, &inode_records,
+			&old_iallocbt_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old inode btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xfs_repair_iallocbt_reset_counters(sc, &inode_records,
+			&log_flags);
+	if (error)
+		goto out;
+	error = xfs_repair_iallocbt_reset_btrees(sc, &oinfo, &log_flags);
+	if (error)
+		goto out;
+	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
+
+	/* Invalidate all the inobt/finobt blocks in btlist. */
+	error = xfs_repair_invalidate_blocks(sc, &old_iallocbt_blocks);
+	if (error)
+		goto out;
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the inode information. */
+	error = xfs_repair_iallocbt_rebuild_trees(sc, &inode_records, &oinfo,
+			&old_iallocbt_blocks);
+out:
+	xfs_repair_cancel_btree_extents(sc, &old_iallocbt_blocks);
+	xfs_repair_iallocbt_cancel_inorecs(&inode_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index e5f67fc68e9a..dcfa5eb18940 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -104,6 +104,7 @@ int xfs_repair_agf(struct xfs_scrub_context *sc);
 int xfs_repair_agfl(struct xfs_scrub_context *sc);
 int xfs_repair_agi(struct xfs_scrub_context *sc);
 int xfs_repair_allocbt(struct xfs_scrub_context *sc);
+int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
 
 #else
 
@@ -131,6 +132,7 @@ xfs_repair_calc_ag_resblks(
 #define xfs_repair_agfl			xfs_repair_notsupported
 #define xfs_repair_agi			xfs_repair_notsupported
 #define xfs_repair_allocbt		xfs_repair_notsupported
+#define xfs_repair_iallocbt		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7a55b20b7e4e..fec0e130f19e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -238,14 +238,14 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_ag_iallocbt,
 		.scrub	= xfs_scrub_inobt,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
 		.type	= ST_PERAG,
 		.setup	= xfs_scrub_setup_ag_iallocbt,
 		.scrub	= xfs_scrub_finobt,
 		.has	= xfs_sb_version_hasfinobt,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
 		.type	= ST_PERAG,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-06-28 23:37   ` Dave Chinner
  2018-06-24 19:24 ` [PATCH 09/21] xfs: finish our set of inode get/put tracepoints for scrub Darrick J. Wong
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Destroying an incore inode sometimes requires some work to be done on
the inode.  For example, post-EOF blocks on a non-PREALLOC inode are
trimmed, and copy-on-write staging extents are freed.  This work is done
in separate transactions, which is bad for scrub and repair because (a)
we already have a transaction and can't nest them, and (b) if we've
frozen the filesystem for scrub/repair work, that (regular) transaction
allocation will block on the freeze.

Therefore, if we detect that work has to be done to destroy the incore
inode, we'll just hang on to the reference until after the scrub is
finished.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h |    1 +
 fs/xfs/scrub/dir.c    |    2 +-
 fs/xfs/scrub/parent.c |    6 +++---
 fs/xfs/scrub/scrub.c  |   20 +++++++++++++++++++
 fs/xfs/scrub/scrub.h  |    9 ++++++++
 fs/xfs/scrub/trace.h  |   30 ++++++++++++++++++++++++++++
 7 files changed, 116 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index c1132a40a366..9740c28384b6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -22,6 +22,7 @@
 #include "xfs_alloc_btree.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_bmap_util.h"
 #include "xfs_ialloc.h"
 #include "xfs_ialloc_btree.h"
 #include "xfs_refcount.h"
@@ -890,3 +891,54 @@ xfs_scrub_ilock_inverted(
 	}
 	return -EDEADLOCK;
 }
+
+/*
+ * Release a reference to an inode while the fs is running a scrub or repair.
+ * If we anticipate that destroying the incore inode will require work to be
+ * done, we'll defer the iput until after the scrub/repair releases the
+ * transaction.
+ */
+void
+xfs_scrub_iput(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	/*
+	 * If this file doesn't have any blocks to be freed at release time,
+	 * go straight to iput.
+	 */
+	if (!xfs_can_free_eofblocks(ip, true))
+		goto iput;
+
+	/*
+	 * Any real/unwritten extents in the CoW fork will have to be freed
+	 * so iput if there aren't any.
+	 */
+	if (!xfs_inode_has_cow_blocks(ip))
+		goto iput;
+
+	/*
+	 * Any blocks after the end of the file will have to be freed so iput
+	 * if there aren't any.
+	 */
+	if (!xfs_inode_has_posteof_blocks(ip))
+		goto iput;
+
+	/*
+	 * There are no other users of i_private in XFS so if it's non-NULL
+	 * this inode is already on the deferred iput list and we can release
+	 * this reference.
+	 */
+	if (VFS_I(ip)->i_private)
+		goto iput;
+
+	/* Otherwise, add it to the deferred iput list. */
+	trace_xfs_scrub_iput_defer(ip, __return_address);
+	VFS_I(ip)->i_private = sc->deferred_iput_list;
+	sc->deferred_iput_list = VFS_I(ip);
+	return;
+
+iput:
+	trace_xfs_scrub_iput_now(ip, __return_address);
+	iput(VFS_I(ip));
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 2172bd5361e2..ca9e15af2a4f 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -140,5 +140,6 @@ static inline bool xfs_scrub_skip_xref(struct xfs_scrub_metadata *sm)
 
 int xfs_scrub_metadata_inode_forks(struct xfs_scrub_context *sc);
 int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
+void xfs_scrub_iput(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 86324775fc9b..5cb371576732 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -87,7 +87,7 @@ xfs_scrub_dir_check_ftype(
 			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
 	if (ino_dtype != dtype)
 		xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-	iput(VFS_I(ip));
+	xfs_scrub_iput(sdc->sc, ip);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index e2bda58c32f0..fd0b2bfb8f18 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -230,11 +230,11 @@ xfs_scrub_parent_validate(
 
 	/* Drat, parent changed.  Try again! */
 	if (dnum != dp->i_ino) {
-		iput(VFS_I(dp));
+		xfs_scrub_iput(sc, dp);
 		*try_again = true;
 		return 0;
 	}
-	iput(VFS_I(dp));
+	xfs_scrub_iput(sc, dp);
 
 	/*
 	 * '..' didn't change, so check that there was only one entry
@@ -247,7 +247,7 @@ xfs_scrub_parent_validate(
 out_unlock:
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
 out_rele:
-	iput(VFS_I(dp));
+	xfs_scrub_iput(sc, dp);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index fec0e130f19e..b66cfbc56a34 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -157,6 +157,24 @@ xfs_scrub_probe(
 
 /* Scrub setup and teardown */
 
+/* Release all references to inodes we encountered needing deferred iput. */
+STATIC void
+xfs_scrub_iput_deferred(
+	struct xfs_scrub_context	*sc)
+{
+	struct inode			*inode, *next;
+
+	inode = sc->deferred_iput_list;
+	while (inode != (struct inode *)sc) {
+		next = inode->i_private;
+		inode->i_private = NULL;
+		trace_xfs_scrub_iput_deferred(XFS_I(inode), __return_address);
+		iput(inode);
+		inode = next;
+	}
+	sc->deferred_iput_list = sc;
+}
+
 /* Free all the resources and finish the transactions. */
 STATIC int
 xfs_scrub_teardown(
@@ -180,6 +198,7 @@ xfs_scrub_teardown(
 			iput(VFS_I(sc->ip));
 		sc->ip = NULL;
 	}
+	xfs_scrub_iput_deferred(sc);
 	if (sc->has_quotaofflock)
 		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
 	if (sc->buf) {
@@ -506,6 +525,7 @@ xfs_scrub_metadata(
 	sc.ops = &meta_scrub_ops[sm->sm_type];
 	sc.try_harder = try_harder;
 	sc.sa.agno = NULLAGNUMBER;
+	sc.deferred_iput_list = &sc;
 	error = sc.ops->setup(&sc, ip);
 	if (error)
 		goto out_teardown;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index b295edd5fc0e..69eee2ffed29 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -65,6 +65,15 @@ struct xfs_scrub_context {
 	bool				try_harder;
 	bool				has_quotaofflock;
 
+	/*
+	 * List of inodes which cannot be released (by scrub) until after the
+	 * scrub operation concludes because we'd have to do some work to the
+	 * inode to destroy its incore representation (cow blocks, posteof
+	 * blocks, etc.).  Each inode's i_private points to the next inode, or
+	 * to the scrub context as a sentinel for the end of the list.
+	 */
+	void				*deferred_iput_list;
+
 	/* State tracking for single-AG operations. */
 	struct xfs_scrub_ag		sa;
 };
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index cec3e5ece5a1..a050a00fc258 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -480,6 +480,36 @@ TRACE_EVENT(xfs_scrub_xref_error,
 		  __entry->ret_ip)
 );
 
+DECLARE_EVENT_CLASS(xfs_scrub_iref_class,
+	TP_PROTO(struct xfs_inode *ip, xfs_failaddr_t caller_ip),
+	TP_ARGS(ip, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, count)
+		__field(xfs_failaddr_t, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d ino 0x%llx count %d caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->count,
+		  __entry->caller_ip)
+)
+
+#define DEFINE_SCRUB_IREF_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_iref_class, name, \
+	TP_PROTO(struct xfs_inode *ip, xfs_failaddr_t caller_ip), \
+	TP_ARGS(ip, caller_ip))
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_deferred);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_defer);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_now);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 09/21] xfs: finish our set of inode get/put tracepoints for scrub
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-06-24 19:24 ` [PATCH 10/21] xfs: introduce online scrub freeze Darrick J. Wong
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Finish setting up tracepoints to track inode get/put operations when
running them through the scrub code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |    2 ++
 fs/xfs/scrub/dir.c    |    1 +
 fs/xfs/scrub/parent.c |    1 +
 fs/xfs/scrub/scrub.c  |    1 +
 fs/xfs/scrub/trace.h  |    3 +++
 5 files changed, 8 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 9740c28384b6..6dcd83944ab6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -717,7 +717,9 @@ xfs_scrub_get_inode(
 				error, __return_address);
 		return error;
 	}
+	trace_xfs_scrub_iget_target(ip, __this_address);
 	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
+		trace_xfs_scrub_iput_target(ip, __this_address);
 		iput(VFS_I(ip));
 		return -ENOENT;
 	}
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 5cb371576732..a11dde63535c 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -81,6 +81,7 @@ xfs_scrub_dir_check_ftype(
 	if (!xfs_scrub_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset,
 			&error))
 		goto out;
+	trace_xfs_scrub_iget(ip, __this_address);
 
 	/* Convert mode to the DT_* values that dir_emit uses. */
 	ino_dtype = xfs_dir3_get_dtype(mp,
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index fd0b2bfb8f18..c5ce6622d4f9 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -170,6 +170,7 @@ xfs_scrub_parent_validate(
 	}
 	if (!xfs_scrub_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out;
+	trace_xfs_scrub_iget(dp, __this_address);
 	if (dp == sc->ip || !S_ISDIR(VFS_I(dp)->i_mode)) {
 		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out_rele;
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index b66cfbc56a34..b24b37b34d85 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -191,6 +191,7 @@ xfs_scrub_teardown(
 		sc->tp = NULL;
 	}
 	if (sc->ip) {
+		trace_xfs_scrub_iput_target(sc->ip, __this_address);
 		if (sc->ilock_flags)
 			xfs_iunlock(sc->ip, sc->ilock_flags);
 		if (sc->ip != ip_in &&
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a050a00fc258..2a561689cecb 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -509,6 +509,9 @@ DEFINE_EVENT(xfs_scrub_iref_class, name, \
 DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_deferred);
 DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_defer);
 DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_now);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iget);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iget_target);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_target);
 
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 10/21] xfs: introduce online scrub freeze
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 09/21] xfs: finish our set of inode get/put tracepoints for scrub Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-06-24 19:24 ` [PATCH 11/21] xfs: repair the rmapbt Darrick J. Wong
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Introduce a new 'online scrub freeze' that we can use to lock out all
filesystem modifications and background activity so that we can perform
global scans in order to rebuild metadata.  This introduces a new IFLAG
to the scrub ioctl to indicate that userspace is willing to allow a
freeze.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |    6 +++
 fs/xfs/scrub/common.c  |   87 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/common.h  |    2 +
 fs/xfs/scrub/repair.c  |   21 ++++++++++++
 fs/xfs/scrub/repair.h  |    1 +
 fs/xfs/scrub/scrub.c   |    8 ++++
 fs/xfs/scrub/scrub.h   |    6 +++
 fs/xfs/xfs_mount.h     |    6 +++
 fs/xfs/xfs_super.c     |   53 +++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.c     |    5 ++-
 10 files changed, 192 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index f3aa59302fef..e93f9432d2a6 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -536,7 +536,11 @@ struct xfs_scrub_metadata {
  */
 #define XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED (1 << 7)
 
-#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR)
+/* i: Allow scrub to freeze the filesystem to perform global scans. */
+#define XFS_SCRUB_IFLAG_FREEZE_OK	(1 << 8)
+
+#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR | \
+				 XFS_SCRUB_IFLAG_FREEZE_OK)
 #define XFS_SCRUB_FLAGS_OUT	(XFS_SCRUB_OFLAG_CORRUPT | \
 				 XFS_SCRUB_OFLAG_PREEN | \
 				 XFS_SCRUB_OFLAG_XFAIL | \
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 6dcd83944ab6..257cb13d36e3 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -590,9 +590,13 @@ xfs_scrub_trans_alloc(
 	struct xfs_scrub_context	*sc,
 	uint				resblks)
 {
+	uint				flags = 0;
+
+	if (sc->fs_frozen)
+		flags |= XFS_TRANS_NO_WRITECOUNT;
 	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
 		return xfs_trans_alloc(sc->mp, &M_RES(sc->mp)->tr_itruncate,
-				resblks, 0, 0, &sc->tp);
+				resblks, 0, flags, &sc->tp);
 
 	return xfs_trans_alloc_empty(sc->mp, &sc->tp);
 }
@@ -944,3 +948,84 @@ xfs_scrub_iput(
 	trace_xfs_scrub_iput_now(ip, __return_address);
 	iput(VFS_I(ip));
 }
+
+/*
+ * Exclusive Filesystem Access During Scrub and Repair
+ * ===================================================
+ *
+ * While most scrub activity can occur while the filesystem is live, there
+ * are certain scenarios where we cannot tolerate concurrent metadata updates.
+ * We therefore must freeze the filesystem against all other changes.
+ *
+ * The typical scenarios envisioned for scrub freezes are (a) to lock out all
+ * other filesystem changes in order to check the global summary counters,
+ * and anything else that requires unusual behavioral semantics.
+ *
+ * The typical scenarios envisioned for repair freezes are (a) to avoid ABBA
+ * deadlocks when need to take locks in an unusual order; or (b) to update
+ * global filesystem state.  For example, reconstruction of a damaged reverse
+ * mapping btree requires us to hold the AG header locks while scanning
+ * inodes, which goes against the usual inode -> AG header locking order.
+ *
+ * A note about inode reclaim: when we freeze the filesystem, users can't
+ * modify things and periodic background reclaim of speculative preallocations
+ * and copy-on-write staging extents is stopped.  However, the scrub/repair
+ * thread must be careful about evicting an inode from memory -- if the
+ * eviction would require a transaction, we must defer the iput until after
+ * the scrub freeze.  The reasons for this are twofold: first, scrub/repair
+ * already have a transaction and xfs can't nest transactions; and second, we
+ * froze the fs to prevent modifications that we can't control directly.
+ *
+ * Userspace is prevented from freezing or thawing the filesystem during a
+ * repair freeze by the ->freeze_super and ->thaw_super superblock operations,
+ * which block any changes to the freeze state while a repair freeze is
+ * running through the use of the m_scrub_freeze mutex.  It only makes sense
+ * to run one scrub/repair freeze at a time, so the mutex is fine.
+ *
+ * Scrub/repair freezes cannot be initiated during a regular freeze because
+ * freeze_super does not allow nested freeze.  Repair activity that does not
+ * require a repair freeze is also prevented from running during a regular
+ * freeze because transaction allocation blocks on the regular freeze.  We
+ * assume that the only other users of XFS_TRANS_NO_WRITECOUNT transactions
+ * either aren't modifying space metadata in a way that would affect repair,
+ * or that we can inhibit any of the ones that do.
+ *
+ * Note that thaw_super and freeze_super can call deactivate_locked_super
+ * which can free the xfs_mount.  This can happen if someone freezes the block
+ * device, unmounts the filesystem, and thaws the block device.  Therefore, we
+ * must be careful about who gets to unlock the repair freeze mutex.  See the
+ * comments in xfs_fs_put_super.
+ */
+
+/* Start a scrub/repair freeze. */
+int
+xfs_scrub_fs_freeze(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	if (!(sc->sm->sm_flags & XFS_SCRUB_IFLAG_FREEZE_OK))
+		return -EUSERS;
+
+	mutex_lock(&sc->mp->m_scrub_freeze);
+	error = freeze_super(sc->mp->m_super);
+	if (error) {
+		mutex_unlock(&sc->mp->m_scrub_freeze);
+		return error;
+	}
+	sc->fs_frozen = true;
+	return 0;
+}
+
+/* Release a scrub/repair freeze and iput all the deferred inodes. */
+int
+xfs_scrub_fs_thaw(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	sc->fs_frozen = false;
+	error = thaw_super(sc->mp->m_super);
+	mutex_unlock(&sc->mp->m_scrub_freeze);
+	return error;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index ca9e15af2a4f..e8c4e41139ca 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -141,5 +141,7 @@ static inline bool xfs_scrub_skip_xref(struct xfs_scrub_metadata *sm)
 int xfs_scrub_metadata_inode_forks(struct xfs_scrub_context *sc);
 int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xfs_scrub_iput(struct xfs_scrub_context *sc, struct xfs_inode *ip);
+int xfs_scrub_fs_freeze(struct xfs_scrub_context *sc);
+int xfs_scrub_fs_thaw(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index bcdaa8df18f6..85ec872093e6 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1161,3 +1161,24 @@ xfs_repair_ino_dqattach(
 
 	return error;
 }
+
+/* Read all AG headers and attach to this transaction. */
+int
+xfs_repair_grab_all_ag_headers(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agi;
+	struct xfs_buf			*agf;
+	struct xfs_buf			*agfl;
+	xfs_agnumber_t			agno;
+	int				error = 0;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_scrub_ag_read_headers(sc, agno, &agi, &agf, &agfl);
+		if (error)
+			break;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index dcfa5eb18940..1cdf457e41da 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -95,6 +95,7 @@ int xfs_repair_find_ag_btree_roots(struct xfs_scrub_context *sc,
 		struct xfs_buf *agfl_bp);
 void xfs_repair_force_quotacheck(struct xfs_scrub_context *sc, uint dqtype);
 int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
+int xfs_repair_grab_all_ag_headers(struct xfs_scrub_context *sc);
 
 /* Metadata repairers */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index b24b37b34d85..424f01130f14 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -182,6 +182,8 @@ xfs_scrub_teardown(
 	struct xfs_inode		*ip_in,
 	int				error)
 {
+	int				err2;
+
 	xfs_scrub_ag_free(sc, &sc->sa);
 	if (sc->tp) {
 		if (error == 0 && (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR))
@@ -199,6 +201,12 @@ xfs_scrub_teardown(
 			iput(VFS_I(sc->ip));
 		sc->ip = NULL;
 	}
+	if (sc->fs_frozen) {
+		err2 = xfs_scrub_fs_thaw(sc);
+		if (!error && err2)
+			error = err2;
+		sc->fs_frozen = false;
+	}
 	xfs_scrub_iput_deferred(sc);
 	if (sc->has_quotaofflock)
 		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 69eee2ffed29..93a4a0b22273 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -65,6 +65,12 @@ struct xfs_scrub_context {
 	bool				try_harder;
 	bool				has_quotaofflock;
 
+	/*
+	 * Do we own the current scrub freeze?  It is critical that we
+	 * release it before exiting to userspace!
+	 */
+	bool				fs_frozen;
+
 	/*
 	 * List of inodes which cannot be released (by scrub) until after the
 	 * scrub operation concludes because we'd have to do some work to the
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 245349d1e23f..b2b947a0e44a 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -193,6 +193,12 @@ typedef struct xfs_mount {
 	unsigned int		*m_errortag;
 	struct xfs_kobj		m_errortag_kobj;
 #endif
+	/*
+	 * Only allow one thread to initiate a repair freeze at a time.  We
+	 * also use this to block userspace from changing the freeze state
+	 * while a repair freeze is in progress.
+	 */
+	struct mutex		m_scrub_freeze;
 } xfs_mount_t;
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9d791f158dfe..c446d800bb79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1445,6 +1445,42 @@ xfs_fs_unfreeze(
 	return 0;
 }
 
+/*
+ * Don't let userspace freeze while scrub has the filesystem frozen.  Note
+ * that freeze_super can free the xfs_mount, so we must be careful to recheck
+ * XFS_M before trying to access anything in the xfs_mount afterwards.
+ */
+STATIC int
+xfs_fs_freeze_super(
+	struct super_block	*sb)
+{
+	int			error;
+
+	mutex_lock(&XFS_M(sb)->m_scrub_freeze);
+	error = freeze_super(sb);
+	if (XFS_M(sb))
+		mutex_unlock(&XFS_M(sb)->m_scrub_freeze);
+	return error;
+}
+
+/*
+ * Don't let userspace thaw while scrub has the filesystem frozen.  Note that
+ * thaw_super can free the xfs_mount, so we must be careful to recheck XFS_M
+ * before trying to access anything in the xfs_mount afterwards.
+ */
+STATIC int
+xfs_fs_thaw_super(
+	struct super_block	*sb)
+{
+	int			error;
+
+	mutex_lock(&XFS_M(sb)->m_scrub_freeze);
+	error = thaw_super(sb);
+	if (XFS_M(sb))
+		mutex_unlock(&XFS_M(sb)->m_scrub_freeze);
+	return error;
+}
+
 STATIC int
 xfs_fs_show_options(
 	struct seq_file		*m,
@@ -1582,6 +1618,7 @@ xfs_mount_alloc(
 	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
+	mutex_init(&mp->m_scrub_freeze);
 	atomic_set(&mp->m_active_trans, 0);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
@@ -1768,6 +1805,7 @@ xfs_fs_fill_super(
  out_free_fsname:
 	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
+	mutex_destroy(&mp->m_scrub_freeze);
 	kfree(mp);
  out:
 	return error;
@@ -1800,6 +1838,19 @@ xfs_fs_put_super(
 
 	sb->s_fs_info = NULL;
 	xfs_free_fsname(mp);
+	/*
+	 * fs freeze takes an active reference to the filesystem and fs thaw
+	 * drops it.  If a filesystem on a frozen (dm) block device is
+	 * unmounted before the block device is thawed, we can end up tearing
+	 * down the super from within thaw_super when the device is thawed.
+	 * xfs_fs_thaw_super grabbed the scrub repair mutex before calling
+	 * thaw_super, so we must avoid freeing a locked mutex.  At this point
+	 * we know we're the only user of the filesystem, so we can safely
+	 * unlock the scrub/repair mutex if it's locked.
+	 */
+	if (mutex_is_locked(&mp->m_scrub_freeze))
+		mutex_unlock(&mp->m_scrub_freeze);
+	mutex_destroy(&mp->m_scrub_freeze);
 	kfree(mp);
 }
 
@@ -1846,6 +1897,8 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.freeze_super		= xfs_fs_freeze_super,
+	.thaw_super		= xfs_fs_thaw_super,
 };
 
 static struct file_system_type xfs_fs_type = {
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index c08785cf83a9..1f0ae57d1e8c 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -314,9 +314,12 @@ xfs_trans_alloc(
 
 	/*
 	 * Zero-reservation ("empty") transactions can't modify anything, so
-	 * they're allowed to run while we're frozen.
+	 * they're allowed to run while we're frozen.  Scrub is allowed to
+	 * freeze the filesystem in order to obtain exclusive access to the
+	 * filesystem.
 	 */
 	WARN_ON(resp->tr_logres > 0 &&
+	        !mutex_is_locked(&mp->m_scrub_freeze) &&
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	atomic_inc(&mp->m_active_trans);
 


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 11/21] xfs: repair the rmapbt
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 10/21] xfs: introduce online scrub freeze Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-07-03  5:32   ` Dave Chinner
  2018-06-24 19:24 ` [PATCH 12/21] xfs: repair refcount btrees Darrick J. Wong
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the reverse mapping btree from all primary metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/repair.h      |   11 
 fs/xfs/scrub/rmap.c        |    6 
 fs/xfs/scrub/rmap_repair.c | 1036 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c       |    2 
 5 files changed, 1054 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/rmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 837fd4a95f6f..c71c5deef4c9 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -167,6 +167,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   ialloc_repair.o \
 				   repair.o \
+				   rmap_repair.o \
 				   )
 endif
 endif
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 1cdf457e41da..3d9e064147ec 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -96,6 +96,7 @@ int xfs_repair_find_ag_btree_roots(struct xfs_scrub_context *sc,
 void xfs_repair_force_quotacheck(struct xfs_scrub_context *sc, uint dqtype);
 int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
 int xfs_repair_grab_all_ag_headers(struct xfs_scrub_context *sc);
+int xfs_repair_rmapbt_setup(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
 /* Metadata repairers */
 
@@ -106,6 +107,7 @@ int xfs_repair_agfl(struct xfs_scrub_context *sc);
 int xfs_repair_agi(struct xfs_scrub_context *sc);
 int xfs_repair_allocbt(struct xfs_scrub_context *sc);
 int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
+int xfs_repair_rmapbt(struct xfs_scrub_context *sc);
 
 #else
 
@@ -127,6 +129,14 @@ xfs_repair_calc_ag_resblks(
 	return 0;
 }
 
+static inline int xfs_repair_rmapbt_setup(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	/* We don't support rmap repair, but we can still do a scan. */
+	return xfs_scrub_setup_ag_btree(sc, ip, false);
+}
+
 #define xfs_repair_probe		xfs_repair_notsupported
 #define xfs_repair_superblock		xfs_repair_notsupported
 #define xfs_repair_agf			xfs_repair_notsupported
@@ -134,6 +144,7 @@ xfs_repair_calc_ag_resblks(
 #define xfs_repair_agi			xfs_repair_notsupported
 #define xfs_repair_allocbt		xfs_repair_notsupported
 #define xfs_repair_iallocbt		xfs_repair_notsupported
+#define xfs_repair_rmapbt		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index c6d763236ba7..dd1cccfbb31a 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -24,6 +24,7 @@
 #include "scrub/common.h"
 #include "scrub/btree.h"
 #include "scrub/trace.h"
+#include "scrub/repair.h"
 
 /*
  * Set us up to scrub reverse mapping btrees.
@@ -33,7 +34,10 @@ xfs_scrub_setup_ag_rmapbt(
 	struct xfs_scrub_context	*sc,
 	struct xfs_inode		*ip)
 {
-	return xfs_scrub_setup_ag_btree(sc, ip, false);
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+		return xfs_repair_rmapbt_setup(sc, ip);
+	else
+		return xfs_scrub_setup_ag_btree(sc, ip, false);
 }
 
 /* Reverse-mapping scrubber. */
diff --git a/fs/xfs/scrub/rmap_repair.c b/fs/xfs/scrub/rmap_repair.c
new file mode 100644
index 000000000000..2ade606060c8
--- /dev/null
+++ b/fs/xfs/scrub/rmap_repair.c
@@ -0,0 +1,1036 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Reverse Mapping Btree Repair
+ * ============================
+ *
+ * This is the most involved of all the AG space btree rebuilds.  Everywhere
+ * else in XFS we lock inodes and then AG data structures, but generating the
+ * list of rmap records requires that we be able to scan both block mapping
+ * btrees of every inode in the filesystem to see if it owns any extents in
+ * this AG.  We can't tolerate any inode updates while we do this, so we
+ * freeze the filesystem to lock everyone else out, and grant ourselves
+ * special privileges to run transactions with regular background reclamation
+ * turned off.
+ *
+ * We also have to be very careful not to allow inode reclaim to start a
+ * transaction because all transactions (other than our own) will block.
+ *
+ * So basically we scan all primary per-AG metadata and all block maps of all
+ * inodes to generate a huge list of reverse map records.  Next we look for
+ * gaps in the rmap records to calculate all the unclaimed free space (1).
+ * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
+ * the space used by those btrees from (1), and also subtract the free space
+ * listed in the bnobt from (1).  What's left are the gaps in assigned space
+ * that the new rmapbt knows about but the existing bnobt doesn't; these are
+ * the blocks from the old rmapbt and they can be freed.
+ */
+
+/* Set us up to repair reverse mapping btrees. */
+int
+xfs_repair_rmapbt_setup(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	int				error;
+
+	/*
+	 * Freeze out anything that can lock an inode.  We reconstruct
+	 * the rmapbt by reading inode bmaps with the AGF held, which is
+	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
+	 * inodes.
+	 */
+	error = xfs_scrub_fs_freeze(sc);
+	if (error)
+		return error;
+
+	/* Check the AG number and set up the scrub context. */
+	error = xfs_scrub_setup_fs(sc, ip);
+	if (error)
+		return error;
+
+	/*
+	 * Lock all the AG header buffers so that we can read all the
+	 * per-AG metadata too.
+	 */
+	error = xfs_repair_grab_all_ag_headers(sc);
+	if (error)
+		return error;
+
+	return xfs_scrub_ag_init(sc, sc->sm->sm_agno, &sc->sa);
+}
+
+struct xfs_repair_rmapbt_extent {
+	struct list_head		list;
+	struct xfs_rmap_irec		rmap;
+};
+
+/* Context for collecting rmaps */
+struct xfs_repair_rmapbt {
+	struct list_head		*rmaplist;
+	struct xfs_scrub_context	*sc;
+	uint64_t			owner;
+	xfs_agblock_t			btblocks;
+	uint64_t			nr_records;
+};
+
+/* Context for calculating old rmapbt blocks */
+struct xfs_repair_rmapbt_freesp {
+	struct xfs_repair_extent_list	rmap_freelist;
+	struct xfs_repair_extent_list	bno_freelist;
+	struct xfs_scrub_context	*sc;
+	xfs_agblock_t			next_bno;
+};
+
+/* Initialize an rmap. */
+static inline int
+xfs_repair_rmapbt_new_rmap(
+	struct xfs_repair_rmapbt	*rr,
+	xfs_agblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	uint64_t			owner,
+	uint64_t			offset,
+	unsigned int			flags)
+{
+	struct xfs_repair_rmapbt_extent	*rre;
+	int				error = 0;
+
+	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
+			startblock, blockcount, owner, offset, flags);
+
+	if (xfs_scrub_should_terminate(rr->sc, &error))
+		return error;
+
+	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
+	if (!rre)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&rre->list);
+	rre->rmap.rm_startblock = startblock;
+	rre->rmap.rm_blockcount = blockcount;
+	rre->rmap.rm_owner = owner;
+	rre->rmap.rm_offset = offset;
+	rre->rmap.rm_flags = flags;
+	list_add_tail(&rre->list, rr->rmaplist);
+	rr->nr_records++;
+
+	return 0;
+}
+
+/* Add an AGFL block to the rmap list. */
+STATIC int
+xfs_repair_rmapbt_walk_agfl(
+	struct xfs_mount		*mp,
+	xfs_agblock_t			bno,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+
+	return xfs_repair_rmapbt_new_rmap(rr, bno, 1, XFS_RMAP_OWN_AG, 0, 0);
+}
+
+/* Add a btree block to the rmap list. */
+STATIC int
+xfs_repair_rmapbt_visit_btblock(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_buf			*bp;
+	xfs_fsblock_t			fsb;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	rr->btblocks++;
+	fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	return xfs_repair_rmapbt_new_rmap(rr, XFS_FSB_TO_AGBNO(cur->bc_mp, fsb),
+			1, rr->owner, 0, 0);
+}
+
+STATIC int
+xfs_repair_rmapbt_stash_btree_rmap(
+	struct xfs_scrub_context	*sc,
+	xfs_fsblock_t			fsbno,
+	xfs_fsblock_t			len,
+	void				*priv)
+{
+	return xfs_repair_rmapbt_new_rmap(priv, XFS_FSB_TO_AGBNO(sc->mp, fsbno),
+			len, XFS_RMAP_OWN_INOBT, 0, 0);
+}
+
+/* Record inode btree rmaps. */
+STATIC int
+xfs_repair_rmapbt_inodes(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	irec;
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_agino_t			agino;
+	xfs_agino_t			iperhole;
+	unsigned int			i;
+	int				error;
+
+	/* Record the inobt blocks. */
+	error = xfs_repair_collect_btree_cur_blocks(rr->sc, cur,
+			xfs_repair_rmapbt_stash_btree_rmap, rr);
+	if (error)
+		return error;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	/* Record a non-sparse inode chunk. */
+	if (irec.ir_holemask == XFS_INOBT_HOLEMASK_FULL)
+		return xfs_repair_rmapbt_new_rmap(rr,
+				XFS_AGINO_TO_AGBNO(mp, irec.ir_startino),
+				XFS_INODES_PER_CHUNK / mp->m_sb.sb_inopblock,
+				XFS_RMAP_OWN_INODES, 0, 0);
+
+	/* Iterate each chunk. */
+	iperhole = max_t(xfs_agino_t, mp->m_sb.sb_inopblock,
+			XFS_INODES_PER_HOLEMASK_BIT);
+	for (i = 0, agino = irec.ir_startino;
+	     i < XFS_INOBT_HOLEMASK_BITS;
+	     i += iperhole / XFS_INODES_PER_HOLEMASK_BIT, agino += iperhole) {
+		/* Skip holes. */
+		if (irec.ir_holemask & (1 << i))
+			continue;
+
+		/* Record the inode chunk otherwise. */
+		error = xfs_repair_rmapbt_new_rmap(rr,
+				XFS_AGINO_TO_AGBNO(mp, agino),
+				iperhole / mp->m_sb.sb_inopblock,
+				XFS_RMAP_OWN_INODES, 0, 0);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Record a CoW staging extent. */
+STATIC int
+xfs_repair_rmapbt_refcount(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_refcount_irec	refc;
+
+	xfs_refcount_btrec_to_irec(rec, &refc);
+	if (refc.rc_refcount != 1)
+		return -EFSCORRUPTED;
+
+	return xfs_repair_rmapbt_new_rmap(rr,
+			refc.rc_startblock - XFS_REFC_COW_START,
+			refc.rc_blockcount, XFS_RMAP_OWN_COW, 0, 0);
+}
+
+/* Add a bmbt block to the rmap list. */
+STATIC int
+xfs_repair_rmapbt_visit_bmbt(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_buf			*bp;
+	xfs_fsblock_t			fsb;
+	unsigned int			flags = XFS_RMAP_BMBT_BLOCK;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	if (XFS_FSB_TO_AGNO(cur->bc_mp, fsb) != rr->sc->sa.agno)
+		return 0;
+
+	if (cur->bc_private.b.whichfork == XFS_ATTR_FORK)
+		flags |= XFS_RMAP_ATTR_FORK;
+	return xfs_repair_rmapbt_new_rmap(rr,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsb), 1,
+			cur->bc_private.b.ip->i_ino, 0, flags);
+}
+
+/* Determine rmap flags from fork and bmbt state. */
+static inline unsigned int
+xfs_repair_rmapbt_bmap_flags(
+	int			whichfork,
+	xfs_exntst_t		state)
+{
+	return  (whichfork == XFS_ATTR_FORK ? XFS_RMAP_ATTR_FORK : 0) |
+		(state == XFS_EXT_UNWRITTEN ? XFS_RMAP_UNWRITTEN : 0);
+}
+
+/* Find all the extents from a given AG in an inode fork. */
+STATIC int
+xfs_repair_rmapbt_scan_ifork(
+	struct xfs_repair_rmapbt	*rr,
+	struct xfs_inode		*ip,
+	int				whichfork)
+{
+	struct xfs_bmbt_irec		rec;
+	struct xfs_iext_cursor		icur;
+	struct xfs_mount		*mp = rr->sc->mp;
+	struct xfs_btree_cur		*cur = NULL;
+	struct xfs_ifork		*ifp;
+	unsigned int			rflags;
+	int				fmt;
+	int				error = 0;
+
+	/* Do we even have data mapping extents? */
+	fmt = XFS_IFORK_FORMAT(ip, whichfork);
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+	switch (fmt) {
+	case XFS_DINODE_FMT_BTREE:
+		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
+			if (error)
+				return error;
+		}
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		break;
+	default:
+		return 0;
+	}
+	if (!ifp)
+		return 0;
+
+	/* Find all the BMBT blocks in the AG. */
+	if (fmt == XFS_DINODE_FMT_BTREE) {
+		cur = xfs_bmbt_init_cursor(mp, rr->sc->tp, ip, whichfork);
+		error = xfs_btree_visit_blocks(cur,
+				xfs_repair_rmapbt_visit_bmbt, rr);
+		if (error)
+			goto out;
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		cur = NULL;
+	}
+
+	/* We're done if this is an rt inode's data fork. */
+	if (whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip))
+		return 0;
+
+	/* Find all the extents in the AG. */
+	for_each_xfs_iext(ifp, &icur, &rec) {
+		if (isnullstartblock(rec.br_startblock))
+			continue;
+		/* Stash non-hole extent. */
+		if (XFS_FSB_TO_AGNO(mp, rec.br_startblock) == rr->sc->sa.agno) {
+			rflags = xfs_repair_rmapbt_bmap_flags(whichfork,
+					rec.br_state);
+			error = xfs_repair_rmapbt_new_rmap(rr,
+					XFS_FSB_TO_AGBNO(mp, rec.br_startblock),
+					rec.br_blockcount, ip->i_ino,
+					rec.br_startoff, rflags);
+			if (error)
+				goto out;
+		}
+	}
+out:
+	if (cur)
+		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Iterate all the inodes in an AG group. */
+STATIC int
+xfs_repair_rmapbt_scan_inobt(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	irec;
+	struct xfs_repair_rmapbt	*rr = priv;
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip = NULL;
+	xfs_ino_t			ino;
+	xfs_agino_t			agino;
+	int				chunkidx;
+	int				lock_mode = 0;
+	int				error = 0;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	for (chunkidx = 0, agino = irec.ir_startino;
+	     chunkidx < XFS_INODES_PER_CHUNK;
+	     chunkidx++, agino++) {
+		bool	inuse;
+
+		/* Skip if this inode is free */
+		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
+			continue;
+		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
+
+		/* Back off and try again if an inode is being reclaimed */
+		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
+				&inuse);
+		if (error == -EAGAIN)
+			return -EDEADLOCK;
+
+		/*
+		 * Grab inode for scanning.  We cannot use DONTCACHE here
+		 * because we already have a transaction so the iput must not
+		 * trigger inode reclaim (which might allocate a transaction
+		 * to clean up posteof blocks).
+		 */
+		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);
+		if (error)
+			return error;
+		trace_xfs_scrub_iget(ip, __this_address);
+
+		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
+		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
+		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
+		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
+			lock_mode = XFS_ILOCK_EXCL;
+		else
+			lock_mode = XFS_ILOCK_SHARED;
+		if (!xfs_ilock_nowait(ip, lock_mode)) {
+			error = -EBUSY;
+			goto out_rele;
+		}
+
+		/* Check the data fork. */
+		error = xfs_repair_rmapbt_scan_ifork(rr, ip, XFS_DATA_FORK);
+		if (error)
+			goto out_unlock;
+
+		/* Check the attr fork. */
+		error = xfs_repair_rmapbt_scan_ifork(rr, ip, XFS_ATTR_FORK);
+		if (error)
+			goto out_unlock;
+
+		xfs_iunlock(ip, lock_mode);
+		xfs_scrub_iput(rr->sc, ip);
+		ip = NULL;
+	}
+
+	return error;
+out_unlock:
+	xfs_iunlock(ip, lock_mode);
+out_rele:
+	iput(VFS_I(ip));
+	return error;
+}
+
+/* Record extents that aren't in use from gaps in the rmap records. */
+STATIC int
+xfs_repair_rmapbt_record_rmap_freesp(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt_freesp	*rrf = priv;
+	xfs_fsblock_t			fsb;
+	int				error;
+
+	/* Record the free space we find. */
+	if (rec->rm_startblock > rrf->next_bno) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rrf->next_bno);
+		error = xfs_repair_collect_btree_extent(rrf->sc,
+				&rrf->rmap_freelist, fsb,
+				rec->rm_startblock - rrf->next_bno);
+		if (error)
+			return error;
+	}
+	rrf->next_bno = max_t(xfs_agblock_t, rrf->next_bno,
+			rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
+/* Record extents that aren't in use from the bnobt records. */
+STATIC int
+xfs_repair_rmapbt_record_bno_freesp(
+	struct xfs_btree_cur		*cur,
+	struct xfs_alloc_rec_incore	*rec,
+	void				*priv)
+{
+	struct xfs_repair_rmapbt_freesp	*rrf = priv;
+	xfs_fsblock_t			fsb;
+
+	/* Record the free space we find. */
+	fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+			rec->ar_startblock);
+	return xfs_repair_collect_btree_extent(rrf->sc, &rrf->bno_freelist,
+			fsb, rec->ar_blockcount);
+}
+
+/* Compare two rmapbt extents. */
+static int
+xfs_repair_rmapbt_extent_cmp(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_repair_rmapbt_extent	*ap;
+	struct xfs_repair_rmapbt_extent	*bp;
+
+	ap = container_of(a, struct xfs_repair_rmapbt_extent, list);
+	bp = container_of(b, struct xfs_repair_rmapbt_extent, list);
+	return xfs_rmap_compare(&ap->rmap, &bp->rmap);
+}
+
+/* Generate rmaps for the AG headers (AGI/AGF/AGFL) */
+STATIC int
+xfs_repair_rmapbt_generate_agheader_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	int				error;
+
+	/* Create a record for the AG sb->agfl. */
+	error = xfs_repair_rmapbt_new_rmap(rr, XFS_SB_BLOCK(sc->mp),
+			XFS_AGFL_BLOCK(sc->mp) - XFS_SB_BLOCK(sc->mp) + 1,
+			XFS_RMAP_OWN_FS, 0, 0);
+	if (error)
+		return error;
+
+	/* Generate rmaps for the blocks in the AGFL. */
+	return xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
+			sc->sa.agfl_bp, xfs_repair_rmapbt_walk_agfl, rr);
+}
+
+/* Generate rmaps for the log, if it's in this AG. */
+STATIC int
+xfs_repair_rmapbt_generate_log_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+
+	if (sc->mp->m_sb.sb_logstart == 0 ||
+	    XFS_FSB_TO_AGNO(sc->mp, sc->mp->m_sb.sb_logstart) != sc->sa.agno)
+		return 0;
+
+	return xfs_repair_rmapbt_new_rmap(rr,
+			XFS_FSB_TO_AGBNO(sc->mp, sc->mp->m_sb.sb_logstart),
+			sc->mp->m_sb.sb_logblocks, XFS_RMAP_OWN_LOG, 0, 0);
+}
+
+/* Collect rmaps for the blocks containing the free space btrees. */
+STATIC int
+xfs_repair_rmapbt_generate_freesp_rmaps(
+	struct xfs_repair_rmapbt	*rr,
+	xfs_agblock_t			*new_btreeblks)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	rr->owner = XFS_RMAP_OWN_AG;
+	rr->btblocks = 0;
+
+	/* bnobt */
+	cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, XFS_BTNUM_BNO);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* cntbt */
+	cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, XFS_BTNUM_CNT);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* btreeblks doesn't include the bnobt/cntbt btree roots */
+	*new_btreeblks = rr->btblocks - 2;
+	return 0;
+err:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Collect rmaps for the blocks containing inode btrees and the inode chunks. */
+STATIC int
+xfs_repair_rmapbt_generate_inobt_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	rr->owner = XFS_RMAP_OWN_INOBT;
+
+	/*
+	 * Iterate every record in the inobt so we can capture all the inode
+	 * chunks and the blocks in the inobt itself.  Note that if there are
+	 * zero records in the inobt then query_all does nothing and we have
+	 * to account the empty inobt root manually.
+	 */
+	if (sc->sa.pag->pagi_count > 0) {
+		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
+				sc->sa.agno, XFS_BTNUM_INO);
+		error = xfs_btree_query_all(cur, xfs_repair_rmapbt_inodes, rr);
+		if (error)
+			goto err_cur;
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	} else {
+		struct xfs_agi		*agi;
+
+		agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+		error = xfs_repair_rmapbt_new_rmap(rr,
+				be32_to_cpu(agi->agi_root), 1,
+				XFS_RMAP_OWN_INOBT, 0, 0);
+		if (error)
+			goto err;
+	}
+
+	/* finobt */
+	if (!xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		return 0;
+
+	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
+			XFS_BTNUM_FINO);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+err:
+	return error;
+}
+
+/*
+ * Collect rmaps for the blocks containing the refcount btree, and all CoW
+ * staging extents.
+ */
+STATIC int
+xfs_repair_rmapbt_generate_refcountbt_rmaps(
+	struct xfs_repair_rmapbt	*rr)
+{
+	union xfs_btree_irec		low;
+	union xfs_btree_irec		high;
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	if (!xfs_sb_version_hasreflink(&sc->mp->m_sb))
+		return 0;
+
+	rr->owner = XFS_RMAP_OWN_REFC;
+
+	/* refcountbt */
+	cur = xfs_refcountbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, NULL);
+	error = xfs_btree_visit_blocks(cur, xfs_repair_rmapbt_visit_btblock,
+			rr);
+	if (error)
+		goto err_cur;
+
+	/* Collect rmaps for CoW staging extents. */
+	memset(&low, 0, sizeof(low));
+	low.rc.rc_startblock = XFS_REFC_COW_START;
+	memset(&high, 0xFF, sizeof(high));
+	error = xfs_btree_query_range(cur, &low, &high,
+			xfs_repair_rmapbt_refcount, rr);
+	if (error)
+		goto err_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Collect rmaps for all block mappings for every inode in this AG. */
+STATIC int
+xfs_repair_rmapbt_generate_aginode_rmaps(
+	struct xfs_repair_rmapbt	*rr,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_scrub_context	*sc = rr->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agi_bp;
+	int				error;
+
+	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
+	if (error)
+		return error;
+	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
+	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(sc->tp, agi_bp);
+	return error;
+}
+
+/*
+ * Generate all the reverse-mappings for this AG, a list of the old rmapbt
+ * blocks, and the new btreeblks count.  Figure out if we have enough free
+ * space to reconstruct the inode btrees.  The caller must clean up the lists
+ * if anything goes wrong.
+ */
+STATIC int
+xfs_repair_rmapbt_find_rmaps(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*rmap_records,
+	xfs_agblock_t			*new_btreeblks)
+{
+	struct xfs_repair_rmapbt	rr;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	rr.rmaplist = rmap_records;
+	rr.sc = sc;
+	rr.nr_records = 0;
+
+	/* Generate rmaps for AG space metadata */
+	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
+	if (error)
+		return error;
+	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
+	if (error)
+		return error;
+
+	/* Iterate all AGs for inodes rmaps. */
+	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
+		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
+		if (error)
+			return error;
+	}
+
+	/* Do we actually have enough space to do this? */
+	if (!xfs_repair_ag_has_space(sc->sa.pag,
+			xfs_rmapbt_calc_size(sc->mp, rr.nr_records),
+			XFS_AG_RESV_RMAPBT))
+		return -ENOSPC;
+
+	return 0;
+}
+
+/* Update the AGF counters. */
+STATIC int
+xfs_repair_rmapbt_reset_counters(
+	struct xfs_scrub_context	*sc,
+	xfs_agblock_t			new_btreeblks,
+	int				*log_flags)
+{
+	struct xfs_agf			*agf;
+	struct xfs_perag		*pag = sc->sa.pag;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	pag->pagf_btreeblks = new_btreeblks;
+	agf->agf_btreeblks = cpu_to_be32(new_btreeblks);
+	*log_flags |= XFS_AGF_BTREEBLKS;
+
+	return 0;
+}
+
+/* Initialize a new rmapbt root and implant it into the AGF. */
+STATIC int
+xfs_repair_rmapbt_reset_btree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_owner_info		*oinfo,
+	int				*log_flags)
+{
+	struct xfs_buf			*bp;
+	struct xfs_agf			*agf;
+	struct xfs_perag		*pag = sc->sa.pag;
+	xfs_fsblock_t			btfsb;
+	int				error;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* Initialize a new rmapbt root. */
+	error = xfs_repair_alloc_ag_block(sc, oinfo, &btfsb,
+			XFS_AG_RESV_RMAPBT);
+	if (error)
+		return error;
+
+	/* The root block is not a btreeblks block. */
+	be32_add_cpu(&agf->agf_btreeblks, -1);
+	pag->pagf_btreeblks--;
+	*log_flags |= XFS_AGF_BTREEBLKS;
+
+	error = xfs_repair_init_btblock(sc, btfsb, &bp, XFS_BTNUM_RMAP,
+			&xfs_rmapbt_buf_ops);
+	if (error)
+		return error;
+
+	agf->agf_roots[XFS_BTNUM_RMAPi] =
+			cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, btfsb));
+	agf->agf_levels[XFS_BTNUM_RMAPi] = cpu_to_be32(1);
+	agf->agf_rmap_blocks = cpu_to_be32(1);
+	pag->pagf_levels[XFS_BTNUM_RMAPi] = 1;
+	*log_flags |= XFS_AGF_ROOTS | XFS_AGF_LEVELS | XFS_AGF_RMAP_BLOCKS;
+
+	return 0;
+}
+
+/*
+ * Roll and fix the free list while reloading the rmapbt.  Do not shrink the
+ * freelist because the rmapbt is not fully set up yet.
+ */
+STATIC int
+xfs_repair_rmapbt_fix_freelist(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		return error;
+	return xfs_repair_fix_freelist(sc, false);
+}
+
+/* Insert all the rmaps we collected. */
+STATIC int
+xfs_repair_rmapbt_rebuild_tree(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*rmap_records)
+{
+	struct xfs_repair_rmapbt_extent	*rre;
+	struct xfs_repair_rmapbt_extent	*n;
+	struct xfs_btree_cur		*cur;
+	struct xfs_mount		*mp = sc->mp;
+	uint32_t			old_flcount;
+	int				error;
+
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	old_flcount = sc->sa.pag->pagf_flcount;
+
+	list_sort(NULL, rmap_records, xfs_repair_rmapbt_extent_cmp);
+	list_for_each_entry_safe(rre, n, rmap_records, list) {
+		/* Add the rmap. */
+		error = xfs_rmap_map_raw(cur, &rre->rmap);
+		if (error)
+			goto err_cur;
+		list_del(&rre->list);
+		kmem_free(rre);
+
+		/*
+		 * If the flcount changed because the rmap btree changed shape
+		 * then we need to fix the freelist to keep it full enough to
+		 * handle a total btree split.  We'll roll this transaction to
+		 * get it out of the way and then fix the freelist in a fresh
+		 * transaction.
+		 *
+		 * However, two things we must be careful about: (1) fixing
+		 * the freelist changes the rmapbt so drop the rmapbt cursor
+		 * and (2) we can't let the freelist shrink.  The rmapbt isn't
+		 * fully set up yet, which means that the current AGFL blocks
+		 * might not be reflected in the rmapbt, which is a problem if
+		 * we want to unmap blocks from the AGFL.
+		 */
+		if (sc->sa.pag->pagf_flcount == old_flcount)
+			continue;
+		if (list_empty(rmap_records))
+			break;
+
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		error = xfs_repair_rmapbt_fix_freelist(sc);
+		if (error)
+			goto err;
+		old_flcount = sc->sa.pag->pagf_flcount;
+		cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.agno);
+	}
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Fix the freelist once more, if necessary. */
+	if (sc->sa.pag->pagf_flcount != old_flcount) {
+		error = xfs_repair_rmapbt_fix_freelist(sc);
+		if (error)
+			goto err;
+	}
+	return 0;
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+err:
+	return error;
+}
+
+/* Cancel every rmapbt record. */
+STATIC void
+xfs_repair_rmapbt_cancel_rmaps(
+	struct list_head	*reclist)
+{
+	struct xfs_repair_rmapbt_extent	*rre;
+	struct xfs_repair_rmapbt_extent	*n;
+
+	list_for_each_entry_safe(rre, n, reclist, list) {
+		list_del(&rre->list);
+		kmem_free(rre);
+	}
+}
+
+/*
+ * Reap the old rmapbt blocks.  Now that the rmapbt is fully rebuilt, we make
+ * a list of gaps in the rmap records and a list of the extents mentioned in
+ * the bnobt.  Any block that's in the new rmapbt gap list but not mentioned
+ * in the bnobt is a block from the old rmapbt and can be removed.
+ */
+STATIC int
+xfs_repair_rmapbt_reap_old_blocks(
+	struct xfs_scrub_context	*sc,
+	struct xfs_owner_info		*oinfo)
+{
+	struct xfs_repair_rmapbt_freesp	rrf;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agf			*agf;
+	struct xfs_btree_cur		*cur;
+	xfs_fsblock_t			btfsb;
+	xfs_agblock_t			agend;
+	int				error;
+
+	xfs_repair_init_extent_list(&rrf.rmap_freelist);
+	xfs_repair_init_extent_list(&rrf.bno_freelist);
+	rrf.next_bno = 0;
+	rrf.sc = sc;
+
+	/* Compute free space from the new rmapbt. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xfs_repair_rmapbt_record_rmap_freesp,
+			&rrf);
+	if (error)
+		goto err_cur;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	agend = be32_to_cpu(agf->agf_length);
+	if (rrf.next_bno < agend) {
+		btfsb = XFS_AGB_TO_FSB(mp, sc->sa.agno, rrf.next_bno);
+		error = xfs_repair_collect_btree_extent(sc, &rrf.rmap_freelist,
+				btfsb, agend - rrf.next_bno);
+		if (error)
+			goto err;
+	}
+
+	/* Compute free space from the existing bnobt. */
+	cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno, XFS_BTNUM_BNO);
+	error = xfs_alloc_query_all(cur, xfs_repair_rmapbt_record_bno_freesp,
+			&rrf);
+	if (error)
+		goto err_lists;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+
+	/*
+	 * Free the "free" blocks that the new rmapbt knows about but
+	 * the old bnobt doesn't.  These are the old rmapbt blocks.
+	 */
+	error = xfs_repair_subtract_extents(sc, &rrf.rmap_freelist,
+			&rrf.bno_freelist);
+	xfs_repair_cancel_btree_extents(sc, &rrf.bno_freelist);
+	if (error)
+		goto err;
+	error = xfs_repair_invalidate_blocks(sc, &rrf.rmap_freelist);
+	if (error)
+		goto err;
+	return xfs_repair_reap_btree_extents(sc, &rrf.rmap_freelist, oinfo,
+			XFS_AG_RESV_RMAPBT);
+err_lists:
+	xfs_repair_cancel_btree_extents(sc, &rrf.bno_freelist);
+err_cur:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+err:
+	return error;
+}
+
+/* Repair the rmap btree for some AG. */
+int
+xfs_repair_rmapbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+	struct list_head		rmap_records;
+	xfs_extlen_t			new_btreeblks;
+	int				log_flags = 0;
+	int				error;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+
+	/* Collect rmaps for all AG headers. */
+	INIT_LIST_HEAD(&rmap_records);
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
+	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old rmap btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xfs_repair_rmapbt_reset_counters(sc, new_btreeblks, &log_flags);
+	if (error)
+		goto out;
+	error = xfs_repair_rmapbt_reset_btree(sc, &oinfo, &log_flags);
+	if (error)
+		goto out;
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the rmap information. */
+	error = xfs_repair_rmapbt_rebuild_tree(sc, &rmap_records);
+	if (error)
+		goto out;
+
+	/* Find and destroy the blocks from the old rmapbt. */
+	error = xfs_repair_rmapbt_reap_old_blocks(sc, &oinfo);
+out:
+	xfs_repair_rmapbt_cancel_rmaps(&rmap_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 424f01130f14..3f8036ee3971 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -280,7 +280,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_rmapbt,
 		.scrub	= xfs_scrub_rmapbt,
 		.has	= xfs_sb_version_hasrmapbt,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_rmapbt,
 	},
 	[XFS_SCRUB_TYPE_REFCNTBT] = {	/* refcountbt */
 		.type	= ST_PERAG,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 12/21] xfs: repair refcount btrees
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 11/21] xfs: repair the rmapbt Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-07-03  5:50   ` Dave Chinner
  2018-06-24 19:24 ` [PATCH 13/21] xfs: repair inode records Darrick J. Wong
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Reconstruct the refcount data from the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/scrub/refcount_repair.c |  592 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/scrub.c           |    2 
 4 files changed, 596 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/refcount_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index c71c5deef4c9..eeb03b9f30f6 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -166,6 +166,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   ialloc_repair.o \
+				   refcount_repair.o \
 				   repair.o \
 				   rmap_repair.o \
 				   )
diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c
new file mode 100644
index 000000000000..63fcd1d76789
--- /dev/null
+++ b/fs/xfs/scrub/refcount_repair.c
@@ -0,0 +1,592 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_itable.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_error.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Rebuilding the Reference Count Btree
+ * ====================================
+ *
+ * This algorithm is "borrowed" from xfs_repair.  Imagine the rmap
+ * entries as rectangles representing extents of physical blocks, and
+ * that the rectangles can be laid down to allow them to overlap each
+ * other; then we know that we must emit a refcnt btree entry wherever
+ * the amount of overlap changes, i.e. the emission stimulus is
+ * level-triggered:
+ *
+ *                 -    ---
+ *       --      ----- ----   ---        ------
+ * --   ----     ----------- ----     ---------
+ * -------------------------------- -----------
+ * ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+ * 2 1  23 21    3 43 234  2123  1 01 2  3     0
+ *
+ * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner).
+ *
+ * Note that in the actual refcnt btree we don't store the refcount < 2
+ * cases because the bnobt tells us which blocks are free; single-use
+ * blocks aren't recorded in the bnobt or the refcntbt.  If the rmapbt
+ * supports storing multiple entries covering a given block we could
+ * theoretically dispense with the refcntbt and simply count rmaps, but
+ * that's inefficient in the (hot) write path, so we'll take the cost of
+ * the extra tree to save time.  Also there's no guarantee that rmap
+ * will be enabled.
+ *
+ * Given an array of rmaps sorted by physical block number, a starting
+ * physical block (sp), a bag to hold rmaps that cover sp, and the next
+ * physical block where the level changes (np), we can reconstruct the
+ * refcount btree as follows:
+ *
+ * While there are still unprocessed rmaps in the array,
+ *  - Set sp to the physical block (pblk) of the next unprocessed rmap.
+ *  - Add to the bag all rmaps in the array where startblock == sp.
+ *  - Set np to the physical block where the bag size will change.  This
+ *    is the minimum of (the pblk of the next unprocessed rmap) and
+ *    (startblock + len of each rmap in the bag).
+ *  - Record the bag size as old_bag_size.
+ *
+ *  - While the bag isn't empty,
+ *     - Remove from the bag all rmaps where startblock + len == np.
+ *     - Add to the bag all rmaps in the array where startblock == np.
+ *     - If the bag size isn't old_bag_size, store the refcount entry
+ *       (sp, np - sp, bag_size) in the refcnt btree.
+ *     - If the bag is empty, break out of the inner loop.
+ *     - Set old_bag_size to the bag size
+ *     - Set sp = np.
+ *     - Set np to the physical block where the bag size will change.
+ *       This is the minimum of (the pblk of the next unprocessed rmap)
+ *       and (startblock + len of each rmap in the bag).
+ *
+ * Like all the other repairers, we make a list of all the refcount
+ * records we need, then reinitialize the refcount btree root and
+ * insert all the records.
+ */
+
+struct xfs_repair_refc_rmap {
+	struct list_head		list;
+	struct xfs_rmap_irec		rmap;
+};
+
+struct xfs_repair_refc_extent {
+	struct list_head		list;
+	struct xfs_refcount_irec	refc;
+};
+
+struct xfs_repair_refc {
+	struct list_head		rmap_bag;  /* rmaps we're tracking */
+	struct list_head		rmap_idle; /* idle rmaps */
+	struct list_head		*extlist;  /* refcount extents */
+	struct xfs_repair_extent_list	*btlist;   /* old refcountbt blocks */
+	struct xfs_scrub_context	*sc;
+	unsigned long			nr_records;/* nr refcount extents */
+	xfs_extlen_t			btblocks;  /* # of refcountbt blocks */
+};
+
+/* Grab the next record from the rmapbt. */
+STATIC int
+xfs_repair_refcountbt_next_rmap(
+	struct xfs_btree_cur		*cur,
+	struct xfs_repair_refc		*rr,
+	struct xfs_rmap_irec		*rec,
+	bool				*have_rec)
+{
+	struct xfs_rmap_irec		rmap;
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_repair_refc_extent	*rre;
+	xfs_fsblock_t			fsbno;
+	int				have_gt;
+	int				error = 0;
+
+	*have_rec = false;
+	/*
+	 * Loop through the remaining rmaps.  Remember CoW staging
+	 * extents and the refcountbt blocks from the old tree for later
+	 * disposal.  We can only share written data fork extents, so
+	 * keep looping until we find an rmap for one.
+	 */
+	do {
+		if (xfs_scrub_should_terminate(rr->sc, &error))
+			goto out_error;
+
+		error = xfs_btree_increment(cur, 0, &have_gt);
+		if (error)
+			goto out_error;
+		if (!have_gt)
+			return 0;
+
+		error = xfs_rmap_get_rec(cur, &rmap, &have_gt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
+
+		if (rmap.rm_owner == XFS_RMAP_OWN_COW) {
+			/* Pass CoW staging extents right through. */
+			rre = kmem_alloc(sizeof(struct xfs_repair_refc_extent),
+					KM_MAYFAIL);
+			if (!rre)
+				goto out_error;
+
+			INIT_LIST_HEAD(&rre->list);
+			rre->refc.rc_startblock = rmap.rm_startblock +
+					XFS_REFC_COW_START;
+			rre->refc.rc_blockcount = rmap.rm_blockcount;
+			rre->refc.rc_refcount = 1;
+			list_add_tail(&rre->list, rr->extlist);
+		} else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) {
+			/* refcountbt block, dump it when we're done. */
+			rr->btblocks += rmap.rm_blockcount;
+			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+					cur->bc_private.a.agno,
+					rmap.rm_startblock);
+			error = xfs_repair_collect_btree_extent(rr->sc,
+					rr->btlist, fsbno, rmap.rm_blockcount);
+			if (error)
+				goto out_error;
+		}
+	} while (XFS_RMAP_NON_INODE_OWNER(rmap.rm_owner) ||
+		 xfs_internal_inum(mp, rmap.rm_owner) ||
+		 (rmap.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+				   XFS_RMAP_UNWRITTEN)));
+
+	*rec = rmap;
+	*have_rec = true;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Recycle an idle rmap or allocate a new one. */
+static struct xfs_repair_refc_rmap *
+xfs_repair_refcountbt_get_rmap(
+	struct xfs_repair_refc		*rr)
+{
+	struct xfs_repair_refc_rmap	*rrm;
+
+	if (list_empty(&rr->rmap_idle)) {
+		rrm = kmem_alloc(sizeof(struct xfs_repair_refc_rmap),
+				KM_MAYFAIL);
+		if (!rrm)
+			return NULL;
+		INIT_LIST_HEAD(&rrm->list);
+		return rrm;
+	}
+
+	rrm = list_first_entry(&rr->rmap_idle, struct xfs_repair_refc_rmap,
+			list);
+	list_del_init(&rrm->list);
+	return rrm;
+}
+
+/* Compare two btree extents. */
+static int
+xfs_repair_refcount_extent_cmp(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_repair_refc_extent	*ap;
+	struct xfs_repair_refc_extent	*bp;
+
+	ap = container_of(a, struct xfs_repair_refc_extent, list);
+	bp = container_of(b, struct xfs_repair_refc_extent, list);
+
+	if (ap->refc.rc_startblock > bp->refc.rc_startblock)
+		return 1;
+	else if (ap->refc.rc_startblock < bp->refc.rc_startblock)
+		return -1;
+	return 0;
+}
+
+/* Record a reference count extent. */
+STATIC int
+xfs_repair_refcountbt_new_refc(
+	struct xfs_scrub_context	*sc,
+	struct xfs_repair_refc		*rr,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			len,
+	xfs_nlink_t			refcount)
+{
+	struct xfs_repair_refc_extent	*rre;
+	struct xfs_refcount_irec	irec;
+
+	irec.rc_startblock = agbno;
+	irec.rc_blockcount = len;
+	irec.rc_refcount = refcount;
+
+	trace_xfs_repair_refcount_extent_fn(sc->mp, sc->sa.agno,
+			&irec);
+
+	rre = kmem_alloc(sizeof(struct xfs_repair_refc_extent),
+			KM_MAYFAIL);
+	if (!rre)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&rre->list);
+	rre->refc = irec;
+	list_add_tail(&rre->list, rr->extlist);
+
+	return 0;
+}
+
+/* Iterate all the rmap records to generate reference count data. */
+#define RMAP_NEXT(r)	((r).rm_startblock + (r).rm_blockcount)
+STATIC int
+xfs_repair_refcountbt_generate_refcounts(
+	struct xfs_scrub_context	*sc,
+	struct xfs_repair_refc		*rr)
+{
+	struct xfs_rmap_irec		rmap;
+	struct xfs_btree_cur		*cur;
+	struct xfs_repair_refc_rmap	*rrm;
+	struct xfs_repair_refc_rmap	*n;
+	xfs_agblock_t			sbno;
+	xfs_agblock_t			cbno;
+	xfs_agblock_t			nbno;
+	size_t				old_stack_sz;
+	size_t				stack_sz = 0;
+	bool				have;
+	int				have_gt;
+	int				error;
+
+	/* Start the rmapbt cursor to the left of all records. */
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno);
+	error = xfs_rmap_lookup_le(cur, 0, 0, 0, 0, 0, &have_gt);
+	if (error)
+		goto out;
+	ASSERT(have_gt == 0);
+
+	/* Process reverse mappings into refcount data. */
+	while (xfs_btree_has_more_records(cur)) {
+		/* Push all rmaps with pblk == sbno onto the stack */
+		error = xfs_repair_refcountbt_next_rmap(cur, rr, &rmap, &have);
+		if (error)
+			goto out;
+		if (!have)
+			break;
+		sbno = cbno = rmap.rm_startblock;
+		while (have && rmap.rm_startblock == sbno) {
+			rrm = xfs_repair_refcountbt_get_rmap(rr);
+			if (!rrm)
+				goto out;
+			rrm->rmap = rmap;
+			list_add_tail(&rrm->list, &rr->rmap_bag);
+			stack_sz++;
+			error = xfs_repair_refcountbt_next_rmap(cur, rr, &rmap,
+					&have);
+			if (error)
+				goto out;
+		}
+		error = xfs_btree_decrement(cur, 0, &have_gt);
+		if (error)
+			goto out;
+		XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out);
+
+		/* Set nbno to the bno of the next refcount change */
+		nbno = have ? rmap.rm_startblock : NULLAGBLOCK;
+		list_for_each_entry(rrm, &rr->rmap_bag, list)
+			nbno = min_t(xfs_agblock_t, nbno, RMAP_NEXT(rrm->rmap));
+
+		ASSERT(nbno > sbno);
+		old_stack_sz = stack_sz;
+
+		/* While stack isn't empty... */
+		while (stack_sz) {
+			/* Pop all rmaps that end at nbno */
+			list_for_each_entry_safe(rrm, n, &rr->rmap_bag, list) {
+				if (RMAP_NEXT(rrm->rmap) != nbno)
+					continue;
+				stack_sz--;
+				list_move(&rrm->list, &rr->rmap_idle);
+			}
+
+			/* Push array items that start at nbno */
+			error = xfs_repair_refcountbt_next_rmap(cur, rr, &rmap,
+					&have);
+			if (error)
+				goto out;
+			while (have && rmap.rm_startblock == nbno) {
+				rrm = xfs_repair_refcountbt_get_rmap(rr);
+				if (!rrm)
+					goto out;
+				rrm->rmap = rmap;
+				list_add_tail(&rrm->list, &rr->rmap_bag);
+				stack_sz++;
+				error = xfs_repair_refcountbt_next_rmap(cur,
+						rr, &rmap, &have);
+				if (error)
+					goto out;
+			}
+			error = xfs_btree_decrement(cur, 0, &have_gt);
+			if (error)
+				goto out;
+			XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out);
+
+			/* Emit refcount if necessary */
+			ASSERT(nbno > cbno);
+			if (stack_sz != old_stack_sz) {
+				if (old_stack_sz > 1) {
+					error = xfs_repair_refcountbt_new_refc(
+							sc, rr, cbno,
+							nbno - cbno,
+							old_stack_sz);
+					if (error)
+						goto out;
+					rr->nr_records++;
+				}
+				cbno = nbno;
+			}
+
+			/* Stack empty, go find the next rmap */
+			if (stack_sz == 0)
+				break;
+			old_stack_sz = stack_sz;
+			sbno = nbno;
+
+			/* Set nbno to the bno of the next refcount change */
+			nbno = have ? rmap.rm_startblock : NULLAGBLOCK;
+			list_for_each_entry(rrm, &rr->rmap_bag, list)
+				nbno = min_t(xfs_agblock_t, nbno,
+						RMAP_NEXT(rrm->rmap));
+
+			ASSERT(nbno > sbno);
+		}
+	}
+
+	/* Free all the leftover rmap records. */
+	list_for_each_entry_safe(rrm, n, &rr->rmap_idle, list) {
+		list_del(&rrm->list);
+		kmem_free(rrm);
+	}
+
+	ASSERT(list_empty(&rr->rmap_bag));
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	return 0;
+out:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	return error;
+}
+#undef RMAP_NEXT
+
+/*
+ * Generate all the reference counts for this AG and a list of the old
+ * refcount btree blocks.  Figure out if we have enough free space to
+ * reconstruct the inode btrees.  The caller must clean up the lists if
+ * anything goes wrong.
+ */
+STATIC int
+xfs_repair_refcountbt_find_refcounts(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*refcount_records,
+	struct xfs_repair_extent_list	*old_refcountbt_blocks)
+{
+	struct xfs_repair_refc		rr;
+	struct xfs_repair_refc_rmap	*rrm;
+	struct xfs_repair_refc_rmap	*n;
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	INIT_LIST_HEAD(&rr.rmap_bag);
+	INIT_LIST_HEAD(&rr.rmap_idle);
+	rr.extlist = refcount_records;
+	rr.btlist = old_refcountbt_blocks;
+	rr.btblocks = 0;
+	rr.sc = sc;
+	rr.nr_records = 0;
+
+	/* Generate all the refcount records. */
+	error = xfs_repair_refcountbt_generate_refcounts(sc, &rr);
+	if (error)
+		goto out;
+
+	/* Do we actually have enough space to do this? */
+	if (!xfs_repair_ag_has_space(sc->sa.pag,
+			xfs_refcountbt_calc_size(mp, rr.nr_records),
+			XFS_AG_RESV_METADATA)) {
+		error = -ENOSPC;
+		goto out;
+	}
+
+out:
+	list_for_each_entry_safe(rrm, n, &rr.rmap_idle, list) {
+		list_del(&rrm->list);
+		kmem_free(rrm);
+	}
+	list_for_each_entry_safe(rrm, n, &rr.rmap_bag, list) {
+		list_del(&rrm->list);
+		kmem_free(rrm);
+	}
+	return error;
+}
+
+/* Initialize new refcountbt root and implant it into the AGF. */
+STATIC int
+xfs_repair_refcountbt_reset_btree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_owner_info		*oinfo,
+	int				*log_flags)
+{
+	struct xfs_buf			*bp;
+	struct xfs_agf			*agf;
+	xfs_fsblock_t			btfsb;
+	int				error;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* Initialize a new refcountbt root. */
+	error = xfs_repair_alloc_ag_block(sc, oinfo, &btfsb,
+			XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
+	error = xfs_repair_init_btblock(sc, btfsb, &bp, XFS_BTNUM_REFC,
+			&xfs_refcountbt_buf_ops);
+	if (error)
+		return error;
+	agf->agf_refcount_root = cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, btfsb));
+	agf->agf_refcount_level = cpu_to_be32(1);
+	agf->agf_refcount_blocks = cpu_to_be32(1);
+	*log_flags |= XFS_AGF_REFCOUNT_BLOCKS | XFS_AGF_REFCOUNT_ROOT |
+		      XFS_AGF_REFCOUNT_LEVEL;
+
+	return 0;
+}
+
+/* Build new refcount btree and dispose of the old one. */
+STATIC int
+xfs_repair_refcountbt_rebuild_tree(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*refcount_records,
+	struct xfs_owner_info		*oinfo,
+	struct xfs_repair_extent_list	*old_refcountbt_blocks)
+{
+	struct xfs_repair_refc_extent	*rre;
+	struct xfs_repair_refc_extent	*n;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_btree_cur		*cur;
+	int				have_gt;
+	int				error;
+
+	/* Add all records. */
+	list_sort(NULL, refcount_records, xfs_repair_refcount_extent_cmp);
+	list_for_each_entry_safe(rre, n, refcount_records, list) {
+		/* Insert into the refcountbt. */
+		cur = xfs_refcountbt_init_cursor(mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.agno, NULL);
+		error = xfs_refcount_lookup_eq(cur, rre->refc.rc_startblock,
+				&have_gt);
+		if (error)
+			return error;
+		XFS_WANT_CORRUPTED_RETURN(mp, have_gt == 0);
+		error = xfs_refcount_insert(cur, &rre->refc, &have_gt);
+		if (error)
+			return error;
+		XFS_WANT_CORRUPTED_RETURN(mp, have_gt == 1);
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		cur = NULL;
+
+		error = xfs_repair_roll_ag_trans(sc);
+		if (error)
+			return error;
+
+		list_del(&rre->list);
+		kmem_free(rre);
+	}
+
+	/* Free the old refcountbt blocks if they're not in use. */
+	return xfs_repair_reap_btree_extents(sc, old_refcountbt_blocks, oinfo,
+			XFS_AG_RESV_METADATA);
+}
+
+/* Free every record in the refcount list. */
+STATIC void
+xfs_repair_refcountbt_cancel_refcrecs(
+	struct list_head	*recs)
+{
+	struct xfs_repair_refc_extent	*rre;
+	struct xfs_repair_refc_extent	*n;
+
+	list_for_each_entry_safe(rre, n, recs, list) {
+		list_del(&rre->list);
+		kmem_free(rre);
+	}
+}
+
+/* Rebuild the refcount btree. */
+int
+xfs_repair_refcountbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+	struct list_head		refcount_records;
+	struct xfs_repair_extent_list	old_refcountbt_blocks;
+	struct xfs_mount		*mp = sc->mp;
+	int				log_flags = 0;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xfs_scrub_perag_get(sc->mp, &sc->sa);
+
+	/* Collect all reference counts. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
+	INIT_LIST_HEAD(&refcount_records);
+	xfs_repair_init_extent_list(&old_refcountbt_blocks);
+	error = xfs_repair_refcountbt_find_refcounts(sc, &refcount_records,
+			&old_refcountbt_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old refcount btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xfs_repair_refcountbt_reset_btree(sc, &oinfo, &log_flags);
+	if (error)
+		goto out;
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+
+	/* Invalidate all the inobt/finobt blocks in btlist. */
+	error = xfs_repair_invalidate_blocks(sc, &old_refcountbt_blocks);
+	if (error)
+		goto out;
+	error = xfs_repair_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the refcount information. */
+	return xfs_repair_refcountbt_rebuild_tree(sc, &refcount_records,
+			&oinfo, &old_refcountbt_blocks);
+out:
+	xfs_repair_cancel_btree_extents(sc, &old_refcountbt_blocks);
+	xfs_repair_refcountbt_cancel_refcrecs(&refcount_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 3d9e064147ec..23c160558892 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -108,6 +108,7 @@ int xfs_repair_agi(struct xfs_scrub_context *sc);
 int xfs_repair_allocbt(struct xfs_scrub_context *sc);
 int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
 int xfs_repair_rmapbt(struct xfs_scrub_context *sc);
+int xfs_repair_refcountbt(struct xfs_scrub_context *sc);
 
 #else
 
@@ -145,6 +146,7 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_allocbt		xfs_repair_notsupported
 #define xfs_repair_iallocbt		xfs_repair_notsupported
 #define xfs_repair_rmapbt		xfs_repair_notsupported
+#define xfs_repair_refcountbt		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 3f8036ee3971..e4801fb6e632 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -287,7 +287,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_refcountbt,
 		.scrub	= xfs_scrub_refcountbt,
 		.has	= xfs_sb_version_hasreflink,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_refcountbt,
 	},
 	[XFS_SCRUB_TYPE_INODE] = {	/* inode record */
 		.type	= ST_INODE,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 13/21] xfs: repair inode records
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 12/21] xfs: repair refcount btrees Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-07-03  6:17   ` Dave Chinner
  2018-06-24 19:24 ` [PATCH 14/21] xfs: zap broken inode forks Darrick J. Wong
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Try to reinitialize corrupt inodes, or clear the reflink flag
if it's not needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/scrub/inode_repair.c |  483 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h       |    2 
 fs/xfs/scrub/scrub.c        |    2 
 4 files changed, 487 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/inode_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index eeb03b9f30f6..f47f0fe0e70a 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -166,6 +166,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   ialloc_repair.o \
+				   inode_repair.o \
 				   refcount_repair.o \
 				   repair.o \
 				   rmap_repair.o \
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
new file mode 100644
index 000000000000..4ac43c1b1eb0
--- /dev/null
+++ b/fs/xfs/scrub/inode_repair.c
@@ -0,0 +1,483 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_inode_buf.h"
+#include "xfs_inode_fork.h"
+#include "xfs_ialloc.h"
+#include "xfs_da_format.h"
+#include "xfs_reflink.h"
+#include "xfs_rmap.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_dir2.h"
+#include "xfs_quota_defs.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/* Make sure this buffer can pass the inode buffer verifier. */
+STATIC void
+xfs_repair_inode_buf(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_trans		*tp = sc->tp;
+	struct xfs_dinode		*dip;
+	xfs_agnumber_t			agno;
+	xfs_agino_t			agino;
+	int				ioff;
+	int				i;
+	int				ni;
+	int				di_ok;
+	bool				unlinked_ok;
+
+	ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock;
+	agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp));
+	for (i = 0; i < ni; i++) {
+		ioff = i << mp->m_sb.sb_inodelog;
+		dip = xfs_buf_offset(bp, ioff);
+		agino = be32_to_cpu(dip->di_next_unlinked);
+		unlinked_ok = (agino == NULLAGINO ||
+			       xfs_verify_agino(sc->mp, agno, agino));
+		di_ok = dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
+			xfs_dinode_good_version(mp, dip->di_version);
+		if (di_ok && unlinked_ok)
+			continue;
+		dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+		dip->di_version = 3;
+		if (!unlinked_ok)
+			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
+		xfs_dinode_calc_crc(mp, dip);
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
+		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
+	}
+}
+
+/* Reinitialize things that never change in an inode. */
+STATIC void
+xfs_repair_inode_header(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip)
+{
+	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
+		dip->di_version = 3;
+	dip->di_ino = cpu_to_be64(sc->sm->sm_ino);
+	uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid);
+	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
+}
+
+/*
+ * Turn di_mode into /something/ recognizable.
+ *
+ * XXX: Ideally we'd try to read data block 0 to see if it's a directory.
+ */
+STATIC void
+xfs_repair_inode_mode(
+	struct xfs_dinode	*dip)
+{
+	uint16_t		mode;
+
+	mode = be16_to_cpu(dip->di_mode);
+	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
+		return;
+
+	/* bad mode, so we set it to a file that only root can read */
+	mode = S_IFREG;
+	dip->di_mode = cpu_to_be16(mode);
+	dip->di_uid = 0;
+	dip->di_gid = 0;
+}
+
+/* Fix any conflicting flags that the verifiers complain about. */
+STATIC void
+xfs_repair_inode_flags(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip)
+{
+	struct xfs_mount		*mp = sc->mp;
+	uint64_t			flags2;
+	uint16_t			mode;
+	uint16_t			flags;
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
+		flags2 |= XFS_DIFLAG2_REFLINK;
+	else
+		flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE);
+	if (flags & XFS_DIFLAG_REALTIME)
+		flags2 &= ~XFS_DIFLAG2_REFLINK;
+	if (flags2 & XFS_DIFLAG2_REFLINK)
+		flags2 &= ~XFS_DIFLAG2_DAX;
+	dip->di_flags = cpu_to_be16(flags);
+	dip->di_flags2 = cpu_to_be64(flags2);
+}
+
+/* Make sure we don't have a garbage file size. */
+STATIC void
+xfs_repair_inode_size(
+	struct xfs_dinode	*dip)
+{
+	uint64_t		size;
+	uint16_t		mode;
+
+	mode = be16_to_cpu(dip->di_mode);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		/* di_size can't be nonzero for special files */
+		dip->di_size = 0;
+		break;
+	case S_IFREG:
+		/* Regular files can't be larger than 2^63-1 bytes. */
+		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
+		break;
+	case S_IFLNK:
+		/* Catch over- or under-sized symlinks. */
+		if (size > XFS_SYMLINK_MAXLEN)
+			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
+		else if (size == 0)
+			dip->di_size = cpu_to_be64(1);
+		break;
+	case S_IFDIR:
+		/* Directories can't have a size larger than 32G. */
+		if (size > XFS_DIR2_SPACE_SIZE)
+			dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE);
+		else if (size == 0)
+			dip->di_size = cpu_to_be64(1);
+		break;
+	}
+}
+
+/* Fix extent size hints. */
+STATIC void
+xfs_repair_inode_extsize_hints(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip)
+{
+	struct xfs_mount		*mp = sc->mp;
+	uint64_t			flags2;
+	uint16_t			flags;
+	uint16_t			mode;
+	xfs_failaddr_t			fa;
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	fa = xfs_inode_validate_extsize(mp, be32_to_cpu(dip->di_extsize),
+			mode, flags);
+	if (fa) {
+		dip->di_extsize = 0;
+		dip->di_flags &= ~cpu_to_be16(XFS_DIFLAG_EXTSIZE |
+					      XFS_DIFLAG_EXTSZINHERIT);
+	}
+
+	if (dip->di_version < 3)
+		return;
+
+	fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize),
+			mode, flags, flags2);
+	if (fa) {
+		dip->di_cowextsize = 0;
+		dip->di_flags2 &= ~cpu_to_be64(XFS_DIFLAG2_COWEXTSIZE);
+	}
+}
+
+/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
+STATIC int
+xfs_repair_inode_core(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_imap			imap;
+	struct xfs_buf			*bp;
+	struct xfs_dinode		*dip;
+	xfs_ino_t			ino;
+	int				error;
+
+	/* Map & read inode. */
+	ino = sc->sm->sm_ino;
+	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
+	if (error)
+		return error;
+
+	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+			imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp, NULL);
+	if (error)
+		return error;
+
+	/* Make sure we can pass the inode buffer verifier. */
+	xfs_repair_inode_buf(sc, bp);
+	bp->b_ops = &xfs_inode_buf_ops;
+
+	/* Fix everything the verifier will complain about. */
+	dip = xfs_buf_offset(bp, imap.im_boffset);
+	xfs_repair_inode_header(sc, dip);
+	xfs_repair_inode_mode(dip);
+	xfs_repair_inode_flags(sc, dip);
+	xfs_repair_inode_size(dip);
+	xfs_repair_inode_extsize_hints(sc, dip);
+
+	/* Write out the inode... */
+	xfs_dinode_calc_crc(sc->mp, dip);
+	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_DINO_BUF);
+	xfs_trans_log_buf(sc->tp, bp, imap.im_boffset,
+			imap.im_boffset + sc->mp->m_sb.sb_inodesize - 1);
+	error = xfs_trans_commit(sc->tp);
+	if (error)
+		return error;
+	sc->tp = NULL;
+
+	/* ...and reload it? */
+	error = xfs_iget(sc->mp, sc->tp, ino,
+			XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &sc->ip);
+	if (error)
+		return error;
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+	error = xfs_scrub_trans_alloc(sc, 0);
+	if (error)
+		return error;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+
+	return 0;
+}
+
+/* Fix everything xfs_dinode_verify cares about. */
+STATIC int
+xfs_repair_inode_problems(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	error = xfs_repair_inode_core(sc);
+	if (error)
+		return error;
+
+	/* We had to fix a totally busted inode, schedule quotacheck. */
+	if (XFS_IS_UQUOTA_ON(sc->mp))
+		xfs_repair_force_quotacheck(sc, XFS_DQ_USER);
+	if (XFS_IS_GQUOTA_ON(sc->mp))
+		xfs_repair_force_quotacheck(sc, XFS_DQ_GROUP);
+	if (XFS_IS_PQUOTA_ON(sc->mp))
+		xfs_repair_force_quotacheck(sc, XFS_DQ_PROJ);
+
+	return 0;
+}
+
+/*
+ * Fix problems that the verifiers don't care about.  In general these are
+ * errors that don't cause problems elsewhere in the kernel that we can easily
+ * detect, so we don't check them all that rigorously.
+ */
+
+/* Make sure block and extent counts are ok. */
+STATIC int
+xfs_repair_inode_unchecked_blockcounts(
+	struct xfs_scrub_context	*sc)
+{
+	xfs_filblks_t			count;
+	xfs_filblks_t			acount;
+	xfs_extnum_t			nextents;
+	int				error;
+
+	/* di_nblocks/di_nextents/di_anextents don't match up? */
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK,
+			&nextents, &count);
+	if (error)
+		return error;
+	sc->ip->i_d.di_nextents = nextents;
+
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
+			&nextents, &acount);
+	if (error)
+		return error;
+	sc->ip->i_d.di_anextents = nextents;
+
+	sc->ip->i_d.di_nblocks = count + acount;
+	if (sc->ip->i_d.di_anextents != 0 && sc->ip->i_d.di_forkoff == 0)
+		sc->ip->i_d.di_anextents = 0;
+	return 0;
+}
+
+/* Check for invalid uid/gid.  Note that a -1U projid is allowed. */
+STATIC void
+xfs_repair_inode_unchecked_ids(
+	struct xfs_scrub_context	*sc)
+{
+	if (sc->ip->i_d.di_uid == -1U) {
+		sc->ip->i_d.di_uid = 0;
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_UQUOTA_ON(sc->mp))
+			xfs_repair_force_quotacheck(sc, XFS_DQ_USER);
+	}
+
+	if (sc->ip->i_d.di_gid == -1U) {
+		sc->ip->i_d.di_gid = 0;
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_GQUOTA_ON(sc->mp))
+			xfs_repair_force_quotacheck(sc, XFS_DQ_GROUP);
+	}
+}
+
+/* Nanosecond counters can't have more than 1 billion. */
+STATIC void
+xfs_repair_inode_unchecked_timestamps(
+	struct xfs_inode		*ip)
+{
+	if ((unsigned long)VFS_I(ip)->i_atime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_atime.tv_nsec = 0;
+	if ((unsigned long)VFS_I(ip)->i_mtime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_mtime.tv_nsec = 0;
+	if ((unsigned long)VFS_I(ip)->i_ctime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_ctime.tv_nsec = 0;
+	if (ip->i_d.di_version > 2 &&
+	    (unsigned long)ip->i_d.di_crtime.t_nsec >= NSEC_PER_SEC)
+		ip->i_d.di_crtime.t_nsec = 0;
+}
+
+/* Fix inode flags that don't make sense together. */
+STATIC void
+xfs_repair_inode_unchecked_flags(
+	struct xfs_scrub_context	*sc)
+{
+	uint16_t			mode;
+
+	mode = VFS_I(sc->ip)->i_mode;
+
+	/* Clear junk flags */
+	if (sc->ip->i_d.di_flags & ~XFS_DIFLAG_ANY)
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_ANY;
+
+	/* NEWRTBM only applies to realtime bitmaps */
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_rbmino)
+		sc->ip->i_d.di_flags |= XFS_DIFLAG_NEWRTBM;
+	else
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_NEWRTBM;
+
+	/* These only make sense for directories. */
+	if (!S_ISDIR(mode))
+		sc->ip->i_d.di_flags &= ~(XFS_DIFLAG_RTINHERIT |
+					  XFS_DIFLAG_EXTSZINHERIT |
+					  XFS_DIFLAG_PROJINHERIT |
+					  XFS_DIFLAG_NOSYMLINKS);
+
+	/* These only make sense for files. */
+	if (!S_ISREG(mode))
+		sc->ip->i_d.di_flags &= ~(XFS_DIFLAG_REALTIME |
+					  XFS_DIFLAG_EXTSIZE);
+
+	/* These only make sense for non-rt files. */
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_FILESTREAM;
+
+	/* Immutable and append only?  Drop the append. */
+	if ((sc->ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE) &&
+	    (sc->ip->i_d.di_flags & XFS_DIFLAG_APPEND))
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_APPEND;
+
+	if (sc->ip->i_d.di_version < 3)
+		return;
+
+	/* Clear junk flags. */
+	if (sc->ip->i_d.di_flags2 & ~XFS_DIFLAG2_ANY)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_ANY;
+
+	/* No reflink flag unless we support it and it's a file. */
+	if (!xfs_sb_version_hasreflink(&sc->mp->m_sb) ||
+	    !S_ISREG(mode))
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* DAX only applies to files and dirs. */
+	if (!(S_ISREG(mode) || S_ISDIR(mode)))
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_DAX;
+
+	/* No reflink files on the realtime device. */
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* No mixing reflink and DAX yet. */
+	if (sc->ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_DAX;
+}
+
+/* Fix any irregularities in an inode that the verifiers don't catch. */
+STATIC int
+xfs_repair_inode_unchecked(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	error = xfs_repair_inode_unchecked_blockcounts(sc);
+	if (error)
+		return error;
+	xfs_repair_inode_unchecked_timestamps(sc->ip);
+	xfs_repair_inode_unchecked_flags(sc);
+	xfs_repair_inode_unchecked_ids(sc);
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
+/* Repair an inode's fields. */
+int
+xfs_repair_inode(
+	struct xfs_scrub_context	*sc)
+{
+	int				error = 0;
+
+	/*
+	 * No inode?  That means we failed the _iget verifiers.  Repair all
+	 * the things that the inode verifiers care about, then retry _iget.
+	 */
+	if (!sc->ip) {
+		error = xfs_repair_inode_problems(sc);
+		if (error)
+			goto out;
+	}
+
+	/* By this point we had better have a working incore inode. */
+	ASSERT(sc->ip);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* If we found corruption of any kind, try to fix it. */
+	if ((sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) ||
+	    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT)) {
+		error = xfs_repair_inode_unchecked(sc);
+		if (error)
+			goto out;
+	}
+
+	/* See if we can clear the reflink flag. */
+	if (xfs_is_reflink_inode(sc->ip))
+		return xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 23c160558892..e3a763540780 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -109,6 +109,7 @@ int xfs_repair_allocbt(struct xfs_scrub_context *sc);
 int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
 int xfs_repair_rmapbt(struct xfs_scrub_context *sc);
 int xfs_repair_refcountbt(struct xfs_scrub_context *sc);
+int xfs_repair_inode(struct xfs_scrub_context *sc);
 
 #else
 
@@ -147,6 +148,7 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_iallocbt		xfs_repair_notsupported
 #define xfs_repair_rmapbt		xfs_repair_notsupported
 #define xfs_repair_refcountbt		xfs_repair_notsupported
+#define xfs_repair_inode		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index e4801fb6e632..77cbb955d8a8 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -293,7 +293,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xfs_scrub_setup_inode,
 		.scrub	= xfs_scrub_inode,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_inode,
 	},
 	[XFS_SCRUB_TYPE_BMBTD] = {	/* inode data fork */
 		.type	= ST_INODE,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 14/21] xfs: zap broken inode forks
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 13/21] xfs: repair inode records Darrick J. Wong
@ 2018-06-24 19:24 ` Darrick J. Wong
  2018-07-04  2:07   ` Dave Chinner
  2018-06-24 19:25 ` [PATCH 15/21] xfs: repair inode block maps Darrick J. Wong
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:24 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true.  Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c |   32 ++-
 fs/xfs/libxfs/xfs_attr_leaf.h |    2 
 fs/xfs/libxfs/xfs_bmap.c      |   21 ++
 fs/xfs/libxfs/xfs_bmap.h      |    2 
 fs/xfs/scrub/inode_repair.c   |  399 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 437 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b3c19339e1b5..f6c458104934 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -894,23 +894,16 @@ xfs_attr_shortform_allfit(
 	return xfs_attr_shortform_bytesfit(dp, bytes);
 }
 
-/* Verify the consistency of an inline attribute fork. */
+/* Verify the consistency of a raw inline attribute fork. */
 xfs_failaddr_t
-xfs_attr_shortform_verify(
-	struct xfs_inode		*ip)
+xfs_attr_shortform_verify_struct(
+	struct xfs_attr_shortform	*sfp,
+	size_t				size)
 {
-	struct xfs_attr_shortform	*sfp;
 	struct xfs_attr_sf_entry	*sfep;
 	struct xfs_attr_sf_entry	*next_sfep;
 	char				*endp;
-	struct xfs_ifork		*ifp;
 	int				i;
-	int				size;
-
-	ASSERT(ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL);
-	ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
-	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
-	size = ifp->if_bytes;
 
 	/*
 	 * Give up if the attribute is way too short.
@@ -968,6 +961,23 @@ xfs_attr_shortform_verify(
 	return NULL;
 }
 
+/* Verify the consistency of an inline attribute fork. */
+xfs_failaddr_t
+xfs_attr_shortform_verify(
+	struct xfs_inode		*ip)
+{
+	struct xfs_attr_shortform	*sfp;
+	struct xfs_ifork		*ifp;
+	int				size;
+
+	ASSERT(ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL);
+	ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
+	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
+	size = ifp->if_bytes;
+
+	return xfs_attr_shortform_verify_struct(sfp, size);
+}
+
 /*
  * Convert a leaf attribute list to shortform attribute list
  */
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 7b74e18becff..728af25a1738 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -41,6 +41,8 @@ int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args,
 int	xfs_attr_shortform_remove(struct xfs_da_args *args);
 int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
 int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
+xfs_failaddr_t xfs_attr_shortform_verify_struct(struct xfs_attr_shortform *sfp,
+		size_t size);
 xfs_failaddr_t xfs_attr_shortform_verify(struct xfs_inode *ip);
 void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b7f094e19bab..b1254e6c17b5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6223,18 +6223,16 @@ xfs_bmap_finish_one(
 	return error;
 }
 
-/* Check that an inode's extent does not have invalid flags or bad ranges. */
+/* Check that an extent does not have invalid flags or bad ranges. */
 xfs_failaddr_t
-xfs_bmap_validate_extent(
-	struct xfs_inode	*ip,
+xfs_bmbt_validate_extent(
+	struct xfs_mount	*mp,
+	bool			isrt,
 	int			whichfork,
 	struct xfs_bmbt_irec	*irec)
 {
-	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fsblock_t		endfsb;
-	bool			isrt;
 
-	isrt = XFS_IS_REALTIME_INODE(ip);
 	endfsb = irec->br_startblock + irec->br_blockcount - 1;
 	if (isrt) {
 		if (!xfs_verify_rtbno(mp, irec->br_startblock))
@@ -6258,3 +6256,14 @@ xfs_bmap_validate_extent(
 	}
 	return NULL;
 }
+
+/* Check that an inode's extent does not have invalid flags or bad ranges. */
+xfs_failaddr_t
+xfs_bmap_validate_extent(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*irec)
+{
+	return xfs_bmbt_validate_extent(ip->i_mount, XFS_IS_REALTIME_INODE(ip),
+			whichfork, irec);
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 9b49ddf99c41..7e3659604fa6 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -284,6 +284,8 @@ static inline int xfs_bmap_fork_to_state(int whichfork)
 	}
 }
 
+xfs_failaddr_t xfs_bmbt_validate_extent(struct xfs_mount *mp, bool isrt,
+		int whichfork, struct xfs_bmbt_irec *irec);
 xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *irec);
 
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 4ac43c1b1eb0..b941f21d7667 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -22,11 +22,15 @@
 #include "xfs_ialloc.h"
 #include "xfs_da_format.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
 #include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
 #include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_bmap_util.h"
 #include "xfs_dir2.h"
 #include "xfs_quota_defs.h"
+#include "xfs_attr_leaf.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -113,7 +117,8 @@ xfs_repair_inode_mode(
 STATIC void
 xfs_repair_inode_flags(
 	struct xfs_scrub_context	*sc,
-	struct xfs_dinode		*dip)
+	struct xfs_dinode		*dip,
+	bool				is_rt_file)
 {
 	struct xfs_mount		*mp = sc->mp;
 	uint64_t			flags2;
@@ -132,6 +137,10 @@ xfs_repair_inode_flags(
 		flags2 &= ~XFS_DIFLAG2_REFLINK;
 	if (flags2 & XFS_DIFLAG2_REFLINK)
 		flags2 &= ~XFS_DIFLAG2_DAX;
+	if (is_rt_file)
+		flags |= XFS_DIFLAG_REALTIME;
+	else
+		flags &= ~XFS_DIFLAG_REALTIME;
 	dip->di_flags = cpu_to_be16(flags);
 	dip->di_flags2 = cpu_to_be64(flags2);
 }
@@ -210,17 +219,402 @@ xfs_repair_inode_extsize_hints(
 	}
 }
 
+struct xfs_repair_inode_fork_counters {
+	struct xfs_scrub_context	*sc;
+	xfs_rfsblock_t			data_blocks;
+	xfs_rfsblock_t			rt_blocks;
+	xfs_rfsblock_t			attr_blocks;
+	xfs_extnum_t			data_extents;
+	xfs_extnum_t			rt_extents;
+	xfs_aextnum_t			attr_extents;
+};
+
+/* Count extents and blocks for an inode given an rmap. */
+STATIC int
+xfs_repair_inode_count_rmap(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_inode_fork_counters	*rifc = priv;
+
+	/* Is this even the right fork? */
+	if (rec->rm_owner != rifc->sc->sm->sm_ino)
+		return 0;
+	if (rec->rm_flags & XFS_RMAP_ATTR_FORK) {
+		rifc->attr_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			rifc->attr_extents++;
+	} else {
+		rifc->data_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			rifc->data_extents++;
+	}
+	return 0;
+}
+
+/* Count extents and blocks for an inode from all AG rmap data. */
+STATIC int
+xfs_repair_inode_count_ag_rmaps(
+	struct xfs_repair_inode_fork_counters	*rifc,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agf;
+	int				error;
+
+	error = xfs_alloc_read_agf(rifc->sc->mp, rifc->sc->tp, agno, 0, &agf);
+	if (error)
+		return error;
+
+	cur = xfs_rmapbt_init_cursor(rifc->sc->mp, rifc->sc->tp, agf, agno);
+	if (!cur) {
+		error = -ENOMEM;
+		goto out_agf;
+	}
+
+	error = xfs_rmap_query_all(cur, xfs_repair_inode_count_rmap, rifc);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+out_agf:
+	xfs_trans_brelse(rifc->sc->tp, agf);
+	return error;
+}
+
+/* Count extents and blocks for a given inode from all rmap data. */
+STATIC int
+xfs_repair_inode_count_rmaps(
+	struct xfs_repair_inode_fork_counters	*rifc)
+{
+	xfs_agnumber_t			agno;
+	int				error;
+
+	if (!xfs_sb_version_hasrmapbt(&rifc->sc->mp->m_sb) ||
+	    xfs_sb_version_hasrealtime(&rifc->sc->mp->m_sb))
+		return -EOPNOTSUPP;
+
+	/* XXX: find rt blocks too */
+
+	for (agno = 0; agno < rifc->sc->mp->m_sb.sb_agcount; agno++) {
+		error = xfs_repair_inode_count_ag_rmaps(rifc, agno);
+		if (error)
+			return error;
+	}
+
+	/* Can't have extents on both the rt and the data device. */
+	if (rifc->data_extents && rifc->rt_extents)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Figure out if we need to zap this extents format fork. */
+STATIC bool
+xfs_repair_inode_core_check_extents_fork(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip,
+	int				dfork_size,
+	int				whichfork)
+{
+	struct xfs_bmbt_irec		new;
+	struct xfs_bmbt_rec		*dp;
+	bool				isrt;
+	int				i;
+	int				nex;
+	int				fork_size;
+
+	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	fork_size = nex * sizeof(struct xfs_bmbt_rec);
+	if (fork_size < 0 || fork_size > dfork_size)
+		return true;
+	dp = (struct xfs_bmbt_rec *)XFS_DFORK_PTR(dip, whichfork);
+
+	isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME);
+	for (i = 0; i < nex; i++, dp++) {
+		xfs_failaddr_t	fa;
+
+		xfs_bmbt_disk_get_all(dp, &new);
+		fa = xfs_bmbt_validate_extent(sc->mp, isrt, whichfork, &new);
+		if (fa)
+			return true;
+	}
+
+	return false;
+}
+
+/* Figure out if we need to zap this btree format fork. */
+STATIC bool
+xfs_repair_inode_core_check_btree_fork(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip,
+	int				dfork_size,
+	int				whichfork)
+{
+	struct xfs_bmdr_block		*dfp;
+	int				nrecs;
+	int				level;
+
+	if (XFS_DFORK_NEXTENTS(dip, whichfork) <=
+			dfork_size / sizeof(struct xfs_bmbt_irec))
+		return true;
+
+	dfp = (struct xfs_bmdr_block *)XFS_DFORK_PTR(dip, whichfork);
+	nrecs = be16_to_cpu(dfp->bb_numrecs);
+	level = be16_to_cpu(dfp->bb_level);
+
+	if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size)
+		return true;
+	if (level == 0 || level > XFS_BTREE_MAXLEVELS)
+		return true;
+	return false;
+}
+
+/*
+ * Check the data fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xfs_repair_inode_core_check_data_fork(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip,
+	uint16_t			mode)
+{
+	uint64_t			size;
+	int				dfork_size;
+
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		if (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK) != XFS_DINODE_FMT_DEV)
+			return true;
+		break;
+	case S_IFREG:
+	case S_IFLNK:
+	case S_IFDIR:
+		switch (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK)) {
+		case XFS_DINODE_FMT_LOCAL:
+		case XFS_DINODE_FMT_EXTENTS:
+		case XFS_DINODE_FMT_BTREE:
+			break;
+		default:
+			return true;
+		}
+		break;
+	default:
+		return true;
+	}
+	dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK);
+	switch (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK)) {
+	case XFS_DINODE_FMT_DEV:
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		if (size > dfork_size)
+			return true;
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xfs_repair_inode_core_check_extents_fork(sc, dip,
+				dfork_size, XFS_DATA_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xfs_repair_inode_core_check_btree_fork(sc, dip,
+				dfork_size, XFS_DATA_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/* Reset the data fork to something sane. */
+STATIC void
+xfs_repair_inode_core_zap_data_fork(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip,
+	uint16_t			mode,
+	struct xfs_repair_inode_fork_counters	*rifc)
+{
+	char				*p;
+	const struct xfs_dir_ops	*ops;
+	struct xfs_dir2_sf_hdr		*sfp;
+	int				i8count;
+
+	/* Special files always get reset to DEV */
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		dip->di_format = XFS_DINODE_FMT_DEV;
+		dip->di_size = 0;
+		return;
+	}
+
+	/*
+	 * If we have data extents, reset to an empty map and hope the user
+	 * will run the bmapbtd checker next.
+	 */
+	if (rifc->data_extents || rifc->rt_extents || S_ISREG(mode)) {
+		dip->di_format = XFS_DINODE_FMT_EXTENTS;
+		dip->di_nextents = 0;
+		return;
+	}
+
+	/* Otherwise, reset the local format to the minimum. */
+	switch (mode & S_IFMT) {
+	case S_IFLNK:
+		/* Blow out symlink; now it points to root dir */
+		dip->di_format = XFS_DINODE_FMT_LOCAL;
+		dip->di_size = cpu_to_be64(1);
+		p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+		*p = '/';
+		break;
+	case S_IFDIR:
+		/*
+		 * Blow out dir, make it point to the root.  In the
+		 * future the direction repair will reconstruct this
+		 * dir for us.
+		 */
+		dip->di_format = XFS_DINODE_FMT_LOCAL;
+		i8count = sc->mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM;
+		ops = xfs_dir_get_ops(sc->mp, NULL);
+		sfp = (struct xfs_dir2_sf_hdr *)XFS_DFORK_PTR(dip,
+				XFS_DATA_FORK);
+		sfp->count = 0;
+		sfp->i8count = i8count;
+		ops->sf_put_parent_ino(sfp, sc->mp->m_sb.sb_rootino);
+		dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count));
+		break;
+	}
+}
+
+/*
+ * Check the attr fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xfs_repair_inode_core_check_attr_fork(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip)
+{
+	struct xfs_attr_shortform	*sfp;
+	int				size;
+
+	if (XFS_DFORK_BOFF(dip) == 0)
+		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
+		       dip->di_anextents != 0;
+
+	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
+	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
+	case XFS_DINODE_FMT_LOCAL:
+		sfp = (struct xfs_attr_shortform *)XFS_DFORK_PTR(dip,
+				XFS_ATTR_FORK);
+		return xfs_attr_shortform_verify_struct(sfp, size) != NULL;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xfs_repair_inode_core_check_extents_fork(sc, dip, size,
+				XFS_ATTR_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xfs_repair_inode_core_check_btree_fork(sc, dip, size,
+				XFS_ATTR_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/* Reset the attr fork to something sane. */
+STATIC void
+xfs_repair_inode_core_zap_attr_fork(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip,
+	struct xfs_repair_inode_fork_counters	*rifc)
+{
+	dip->di_aformat = XFS_DINODE_FMT_EXTENTS;
+	dip->di_anextents = 0;
+	/*
+	 * We leave a nonzero forkoff so that the bmap scrub will look for
+	 * attr rmaps.
+	 */
+	dip->di_forkoff = rifc->attr_extents ? 1 : 0;
+}
+
+/*
+ * Zap the data/attr forks if we spot anything that isn't going to pass the
+ * ifork verifiers or the ifork formatters, because we need to get the inode
+ * into good enough shape that the higher level repair functions can run.
+ */
+STATIC void
+xfs_repair_inode_core_zap_forks(
+	struct xfs_scrub_context	*sc,
+	struct xfs_dinode		*dip,
+	struct xfs_repair_inode_fork_counters	*rifc)
+{
+	uint16_t			mode;
+	bool				zap_datafork = false;
+	bool				zap_attrfork = false;
+
+	mode = be16_to_cpu(dip->di_mode);
+
+	/* Inode counters don't make sense? */
+	if (be32_to_cpu(dip->di_nextents) > be64_to_cpu(dip->di_nblocks))
+		zap_datafork = true;
+	if (be16_to_cpu(dip->di_anextents) > be64_to_cpu(dip->di_nblocks))
+		zap_attrfork = true;
+	if (be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
+			be64_to_cpu(dip->di_nblocks))
+		zap_datafork = zap_attrfork = true;
+
+	if (!zap_datafork)
+		zap_datafork = xfs_repair_inode_core_check_data_fork(sc, dip,
+				mode);
+	if (!zap_attrfork)
+		zap_attrfork = xfs_repair_inode_core_check_attr_fork(sc, dip);
+
+	/* Zap whatever's bad. */
+	if (zap_attrfork)
+		xfs_repair_inode_core_zap_attr_fork(sc, dip, rifc);
+	if (zap_datafork)
+		xfs_repair_inode_core_zap_data_fork(sc, dip, mode, rifc);
+	dip->di_nblocks = 0;
+	if (!zap_attrfork)
+		be64_add_cpu(&dip->di_nblocks, rifc->attr_blocks);
+	if (!zap_datafork) {
+		be64_add_cpu(&dip->di_nblocks, rifc->data_blocks);
+		be64_add_cpu(&dip->di_nblocks, rifc->rt_blocks);
+	}
+}
+
 /* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
 STATIC int
 xfs_repair_inode_core(
 	struct xfs_scrub_context	*sc)
 {
+	struct xfs_repair_inode_fork_counters	rifc;
 	struct xfs_imap			imap;
 	struct xfs_buf			*bp;
 	struct xfs_dinode		*dip;
 	xfs_ino_t			ino;
 	int				error;
 
+	/* Figure out what this inode had mapped in both forks. */
+	memset(&rifc, 0, sizeof(rifc));
+	rifc.sc = sc;
+	error = xfs_repair_inode_count_rmaps(&rifc);
+	if (error)
+		return error;
+
 	/* Map & read inode. */
 	ino = sc->sm->sm_ino;
 	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
@@ -240,9 +634,10 @@ xfs_repair_inode_core(
 	dip = xfs_buf_offset(bp, imap.im_boffset);
 	xfs_repair_inode_header(sc, dip);
 	xfs_repair_inode_mode(dip);
-	xfs_repair_inode_flags(sc, dip);
+	xfs_repair_inode_flags(sc, dip, rifc.rt_extents > 0);
 	xfs_repair_inode_size(dip);
 	xfs_repair_inode_extsize_hints(sc, dip);
+	xfs_repair_inode_core_zap_forks(sc, dip, &rifc);
 
 	/* Write out the inode... */
 	xfs_dinode_calc_crc(sc->mp, dip);


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 15/21] xfs: repair inode block maps
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2018-06-24 19:24 ` [PATCH 14/21] xfs: zap broken inode forks Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  2018-07-04  3:00   ` Dave Chinner
  2018-06-24 19:25 ` [PATCH 16/21] xfs: repair damaged symlinks Darrick J. Wong
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the reverse-mapping btree information to rebuild an inode fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/bmap.c        |    8 +
 fs/xfs/scrub/bmap_repair.c |  488 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    4 
 fs/xfs/scrub/scrub.c       |    4 
 5 files changed, 503 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/bmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index f47f0fe0e70a..928c7dd0a28d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -165,6 +165,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
+				   bmap_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   refcount_repair.o \
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 3d08589f5c60..cf40d65398e6 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -57,6 +57,14 @@ xfs_scrub_setup_inode_bmap(
 		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
 		if (error)
 			goto out;
+
+		/* Drop the page cache if we're repairing block mappings. */
+		if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) {
+			error = invalidate_inode_pages2(
+					VFS_I(sc->ip)->i_mapping);
+			if (error)
+				goto out;
+		}
 	}
 
 	/* Got the inode, lock it and we're ready to go. */
diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c
new file mode 100644
index 000000000000..def391a897b6
--- /dev/null
+++ b/fs/xfs/scrub/bmap_repair.c
@@ -0,0 +1,488 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_quota.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/* Inode fork block mapping (BMBT) repair. */
+
+struct xfs_repair_bmap_extent {
+	struct list_head		list;
+	struct xfs_rmap_irec		rmap;
+	xfs_agnumber_t			agno;
+};
+
+struct xfs_repair_bmap {
+	struct list_head		*extlist;
+	struct xfs_repair_extent_list	*btlist;
+	struct xfs_scrub_context	*sc;
+	xfs_ino_t			ino;
+	xfs_rfsblock_t			otherfork_blocks;
+	xfs_rfsblock_t			bmbt_blocks;
+	xfs_extnum_t			extents;
+	int				whichfork;
+};
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xfs_repair_bmap_extent_fn(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_bmap		*rb = priv;
+	struct xfs_repair_bmap_extent	*rbe;
+	struct xfs_mount		*mp = cur->bc_mp;
+	xfs_fsblock_t			fsbno;
+	int				error = 0;
+
+	if (xfs_scrub_should_terminate(rb->sc, &error))
+		return error;
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != rb->ino) {
+		return 0;
+	} else if (rb->whichfork == XFS_DATA_FORK &&
+		 (rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
+		rb->otherfork_blocks += rec->rm_blockcount;
+		return 0;
+	} else if (rb->whichfork == XFS_ATTR_FORK &&
+		 !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
+		rb->otherfork_blocks += rec->rm_blockcount;
+		return 0;
+	}
+
+	rb->extents++;
+
+	/* Delete the old bmbt blocks later. */
+	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+		fsbno = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		rb->bmbt_blocks += rec->rm_blockcount;
+		return xfs_repair_collect_btree_extent(rb->sc, rb->btlist,
+				fsbno, rec->rm_blockcount);
+	}
+
+	/* Remember this rmap. */
+	trace_xfs_repair_bmap_extent_fn(mp, cur->bc_private.a.agno,
+			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
+			rec->rm_offset, rec->rm_flags);
+
+	rbe = kmem_alloc(sizeof(struct xfs_repair_bmap_extent), KM_MAYFAIL);
+	if (!rbe)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&rbe->list);
+	rbe->rmap = *rec;
+	rbe->agno = cur->bc_private.a.agno;
+	list_add_tail(&rbe->list, rb->extlist);
+
+	return 0;
+}
+
+/* Compare two bmap extents. */
+static int
+xfs_repair_bmap_extent_cmp(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_repair_bmap_extent	*ap;
+	struct xfs_repair_bmap_extent	*bp;
+
+	ap = container_of(a, struct xfs_repair_bmap_extent, list);
+	bp = container_of(b, struct xfs_repair_bmap_extent, list);
+
+	if (ap->rmap.rm_offset > bp->rmap.rm_offset)
+		return 1;
+	else if (ap->rmap.rm_offset < bp->rmap.rm_offset)
+		return -1;
+	return 0;
+}
+
+/* Scan one AG for reverse mappings that we can turn into extent maps. */
+STATIC int
+xfs_repair_bmap_scan_ag(
+	struct xfs_repair_bmap		*rb,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_scrub_context	*sc = rb->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agf_bp = NULL;
+	struct xfs_btree_cur		*cur;
+	int				error;
+
+	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, agno);
+	error = xfs_rmap_query_all(cur, xfs_repair_bmap_extent_fn, rb);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
+			XFS_BTREE_NOERROR);
+	xfs_trans_brelse(sc->tp, agf_bp);
+	return error;
+}
+
+/* Insert bmap records into an inode fork, given an rmap. */
+STATIC int
+xfs_repair_bmap_insert_rec(
+	struct xfs_scrub_context	*sc,
+	struct xfs_repair_bmap_extent	*rbe,
+	int				baseflags)
+{
+	struct xfs_bmbt_irec		bmap;
+	struct xfs_defer_ops		dfops;
+	xfs_fsblock_t			firstfsb;
+	xfs_extlen_t			extlen;
+	int				flags;
+	int				error = 0;
+
+	/* Form the "new" mapping... */
+	bmap.br_startblock = XFS_AGB_TO_FSB(sc->mp, rbe->agno,
+			rbe->rmap.rm_startblock);
+	bmap.br_startoff = rbe->rmap.rm_offset;
+
+	flags = 0;
+	if (rbe->rmap.rm_flags & XFS_RMAP_UNWRITTEN)
+		flags = XFS_BMAPI_PREALLOC;
+	while (rbe->rmap.rm_blockcount > 0) {
+		xfs_defer_init(&dfops, &firstfsb);
+		extlen = min_t(xfs_extlen_t, rbe->rmap.rm_blockcount,
+				MAXEXTLEN);
+		bmap.br_blockcount = extlen;
+
+		/* Re-add the extent to the fork. */
+		error = xfs_bmapi_remap(sc->tp, sc->ip,
+				bmap.br_startoff, extlen,
+				bmap.br_startblock, &dfops,
+				baseflags | flags);
+		if (error)
+			goto out_cancel;
+
+		bmap.br_startblock += extlen;
+		bmap.br_startoff += extlen;
+		rbe->rmap.rm_blockcount -= extlen;
+		error = xfs_defer_ijoin(&dfops, sc->ip);
+		if (error)
+			goto out_cancel;
+		error = xfs_defer_finish(&sc->tp, &dfops);
+		if (error)
+			goto out;
+		/* Make sure we roll the transaction. */
+		error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+		if (error)
+			goto out;
+	}
+
+	return 0;
+out_cancel:
+	xfs_defer_cancel(&dfops);
+out:
+	return error;
+}
+
+/* Check for garbage inputs. */
+STATIC int
+xfs_repair_bmap_check_inputs(
+	struct xfs_scrub_context	*sc,
+	int				whichfork)
+{
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+	/* Don't know how to repair the other fork formats. */
+	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		return -EOPNOTSUPP;
+
+	/* Only files, symlinks, and directories get to have data forks. */
+	if (whichfork == XFS_DATA_FORK && !S_ISREG(VFS_I(sc->ip)->i_mode) &&
+	    !S_ISDIR(VFS_I(sc->ip)->i_mode) && !S_ISLNK(VFS_I(sc->ip)->i_mode))
+		return -EINVAL;
+
+	/* If we somehow have delalloc extents, forget it. */
+	if (whichfork == XFS_DATA_FORK && sc->ip->i_delayed_blks)
+		return -EBUSY;
+
+	/*
+	 * If there's no attr fork area in the inode, there's
+	 * no attr fork to rebuild.
+	 */
+	if (whichfork == XFS_ATTR_FORK && !XFS_IFORK_Q(sc->ip))
+		return -ENOENT;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&sc->mp->m_sb))
+		return -EOPNOTSUPP;
+
+	/* Don't know how to rebuild realtime data forks. */
+	if (XFS_IS_REALTIME_INODE(sc->ip) && whichfork == XFS_DATA_FORK)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+/*
+ * Collect block mappings for this fork of this inode and decide if we have
+ * enough space to rebuild.  Caller is responsible for cleaning up the list if
+ * anything goes wrong.
+ */
+STATIC int
+xfs_repair_bmap_find_mappings(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	struct list_head		*mapping_records,
+	struct xfs_repair_extent_list	*old_bmbt_blocks,
+	xfs_rfsblock_t			*old_bmbt_block_count,
+	xfs_rfsblock_t			*otherfork_blocks)
+{
+	struct xfs_repair_bmap		rb;
+	xfs_agnumber_t			agno;
+	unsigned int			resblks;
+	int				error;
+
+	memset(&rb, 0, sizeof(rb));
+	rb.extlist = mapping_records;
+	rb.btlist = old_bmbt_blocks;
+	rb.ino = sc->ip->i_ino;
+	rb.whichfork = whichfork;
+	rb.sc = sc;
+
+	/* Iterate the rmaps for extents. */
+	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
+		error = xfs_repair_bmap_scan_ag(&rb, agno);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Guess how many blocks we're going to need to rebuild an entire bmap
+	 * from the number of extents we found, and pump up our transaction to
+	 * have sufficient block reservation.
+	 */
+	resblks = xfs_bmbt_calc_size(sc->mp, rb.extents);
+	error = xfs_trans_reserve_more(sc->tp, resblks, 0);
+	if (error)
+		return error;
+
+	*otherfork_blocks = rb.otherfork_blocks;
+	*old_bmbt_block_count = rb.bmbt_blocks;
+	return 0;
+}
+
+/* Update the inode counters. */
+STATIC int
+xfs_repair_bmap_reset_counters(
+	struct xfs_scrub_context	*sc,
+	xfs_rfsblock_t			old_bmbt_block_count,
+	xfs_rfsblock_t			otherfork_blocks,
+	int				*log_flags)
+{
+	int				error;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/*
+	 * Drop the block counts associated with this fork since we'll re-add
+	 * them with the bmap routines later.
+	 */
+	sc->ip->i_d.di_nblocks = otherfork_blocks;
+	*log_flags |= XFS_ILOG_CORE;
+
+	if (!old_bmbt_block_count)
+		return 0;
+
+	/* Release quota counts for the old bmbt blocks. */
+	error = xfs_repair_ino_dqattach(sc);
+	if (error)
+		return error;
+	xfs_trans_mod_dquot_byino(sc->tp, sc->ip, XFS_TRANS_DQ_BCOUNT,
+			-(int64_t)old_bmbt_block_count);
+	return 0;
+}
+
+/* Initialize a new fork and implant it in the inode. */
+STATIC void
+xfs_repair_bmap_reset_fork(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	bool				has_mappings,
+	int				*log_flags)
+{
+	/* Set us back to extents format with zero records. */
+	XFS_IFORK_FMT_SET(sc->ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+	XFS_IFORK_NEXT_SET(sc->ip, whichfork, 0);
+
+	/* Reinitialize the on-disk fork. */
+	if (XFS_IFORK_PTR(sc->ip, whichfork) != NULL)
+		xfs_idestroy_fork(sc->ip, whichfork);
+	if (whichfork == XFS_DATA_FORK) {
+		memset(&sc->ip->i_df, 0, sizeof(struct xfs_ifork));
+		sc->ip->i_df.if_flags |= XFS_IFEXTENTS;
+	} else if (whichfork == XFS_ATTR_FORK) {
+		if (has_mappings) {
+			sc->ip->i_afp = NULL;
+		} else {
+			sc->ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone,
+					KM_SLEEP);
+			sc->ip->i_afp->if_flags |= XFS_IFEXTENTS;
+		}
+	}
+	*log_flags |= XFS_ILOG_CORE;
+}
+
+/* Build new fork mappings and dispose of the old bmbt blocks. */
+STATIC int
+xfs_repair_bmap_rebuild_tree(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	struct list_head		*mapping_records,
+	struct xfs_repair_extent_list	*old_bmbt_blocks)
+{
+	struct xfs_owner_info		oinfo;
+	struct xfs_repair_bmap_extent	*rbe;
+	struct xfs_repair_bmap_extent	*n;
+	int				baseflags;
+	int				error;
+
+	baseflags = XFS_BMAPI_NORMAP;
+	if (whichfork == XFS_ATTR_FORK)
+		baseflags |= XFS_BMAPI_ATTRFORK;
+
+	/* "Remap" the extents into the fork. */
+	list_sort(NULL, mapping_records, xfs_repair_bmap_extent_cmp);
+	list_for_each_entry_safe(rbe, n, mapping_records, list) {
+		error = xfs_repair_bmap_insert_rec(sc, rbe, baseflags);
+		if (error)
+			return error;
+		list_del(&rbe->list);
+		kmem_free(rbe);
+	}
+
+	/* Dispose of all the old bmbt blocks. */
+	xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, whichfork);
+	return xfs_repair_reap_btree_extents(sc, old_bmbt_blocks, &oinfo,
+			XFS_AG_RESV_NONE);
+}
+
+/* Free every record in the mapping list. */
+STATIC void
+xfs_repair_bmap_cancel_bmbtrecs(
+	struct list_head		*recs)
+{
+	struct xfs_repair_bmap_extent	*rbe;
+	struct xfs_repair_bmap_extent	*n;
+
+	list_for_each_entry_safe(rbe, n, recs, list) {
+		list_del(&rbe->list);
+		kmem_free(rbe);
+	}
+}
+
+/* Repair an inode fork. */
+STATIC int
+xfs_repair_bmap(
+	struct xfs_scrub_context	*sc,
+	int				whichfork)
+{
+	struct list_head		mapping_records;
+	struct xfs_repair_extent_list	old_bmbt_blocks;
+	struct xfs_inode		*ip = sc->ip;
+	xfs_rfsblock_t			old_bmbt_block_count;
+	xfs_rfsblock_t			otherfork_blocks;
+	int				log_flags = 0;
+	int				error = 0;
+
+	error = xfs_repair_bmap_check_inputs(sc, whichfork);
+	if (error)
+		return error;
+
+	/*
+	 * If this is a file data fork, wait for all pending directio to
+	 * complete, then tear everything out of the page cache.
+	 */
+	if (S_ISREG(VFS_I(ip)->i_mode) && whichfork == XFS_DATA_FORK) {
+		inode_dio_wait(VFS_I(ip));
+		truncate_inode_pages(VFS_I(ip)->i_mapping, 0);
+	}
+
+	/* Collect all reverse mappings for this fork's extents. */
+	INIT_LIST_HEAD(&mapping_records);
+	xfs_repair_init_extent_list(&old_bmbt_blocks);
+	error = xfs_repair_bmap_find_mappings(sc, whichfork, &mapping_records,
+			&old_bmbt_blocks, &old_bmbt_block_count,
+			&otherfork_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the in-core fork and zero the on-disk fork.  This is the
+	 * point at which we are no longer able to bail out gracefully.
+	 */
+	error = xfs_repair_bmap_reset_counters(sc, old_bmbt_block_count,
+			otherfork_blocks, &log_flags);
+	if (error)
+		goto out;
+	xfs_repair_bmap_reset_fork(sc, whichfork, list_empty(&mapping_records),
+			&log_flags);
+	xfs_trans_log_inode(sc->tp, sc->ip, log_flags);
+	error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+	if (error)
+		goto out;
+
+	/* Now rebuild the fork extent map information. */
+	error = xfs_repair_bmap_rebuild_tree(sc, whichfork, &mapping_records,
+			&old_bmbt_blocks);
+out:
+	xfs_repair_cancel_btree_extents(sc, &old_bmbt_blocks);
+	xfs_repair_bmap_cancel_bmbtrecs(&mapping_records);
+	return error;
+}
+
+/* Repair an inode's data fork. */
+int
+xfs_repair_bmap_data(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_repair_bmap(sc, XFS_DATA_FORK);
+}
+
+/* Repair an inode's attr fork. */
+int
+xfs_repair_bmap_attr(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_repair_bmap(sc, XFS_ATTR_FORK);
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index e3a763540780..a832ed485e4e 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -110,6 +110,8 @@ int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
 int xfs_repair_rmapbt(struct xfs_scrub_context *sc);
 int xfs_repair_refcountbt(struct xfs_scrub_context *sc);
 int xfs_repair_inode(struct xfs_scrub_context *sc);
+int xfs_repair_bmap_data(struct xfs_scrub_context *sc);
+int xfs_repair_bmap_attr(struct xfs_scrub_context *sc);
 
 #else
 
@@ -149,6 +151,8 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_rmapbt		xfs_repair_notsupported
 #define xfs_repair_refcountbt		xfs_repair_notsupported
 #define xfs_repair_inode		xfs_repair_notsupported
+#define xfs_repair_bmap_data		xfs_repair_notsupported
+#define xfs_repair_bmap_attr		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 77cbb955d8a8..eecb96fe2feb 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -299,13 +299,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xfs_scrub_setup_inode_bmap,
 		.scrub	= xfs_scrub_bmap_data,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_bmap_data,
 	},
 	[XFS_SCRUB_TYPE_BMBTA] = {	/* inode attr fork */
 		.type	= ST_INODE,
 		.setup	= xfs_scrub_setup_inode_bmap,
 		.scrub	= xfs_scrub_bmap_attr,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_bmap_attr,
 	},
 	[XFS_SCRUB_TYPE_BMBTC] = {	/* inode CoW fork */
 		.type	= ST_INODE,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 16/21] xfs: repair damaged symlinks
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2018-06-24 19:25 ` [PATCH 15/21] xfs: repair inode block maps Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  2018-07-04  5:45   ` Dave Chinner
  2018-06-24 19:25 ` [PATCH 17/21] xfs: repair extended attributes Darrick J. Wong
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Repair inconsistent symbolic link data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile               |    1 
 fs/xfs/scrub/repair.h         |    2 
 fs/xfs/scrub/scrub.c          |    2 
 fs/xfs/scrub/symlink.c        |    2 
 fs/xfs/scrub/symlink_repair.c |  301 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 306 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/symlink_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 928c7dd0a28d..36156166fef0 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -171,6 +171,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   refcount_repair.o \
 				   repair.o \
 				   rmap_repair.o \
+				   symlink_repair.o \
 				   )
 endif
 endif
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index a832ed485e4e..14fa8cf89799 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -112,6 +112,7 @@ int xfs_repair_refcountbt(struct xfs_scrub_context *sc);
 int xfs_repair_inode(struct xfs_scrub_context *sc);
 int xfs_repair_bmap_data(struct xfs_scrub_context *sc);
 int xfs_repair_bmap_attr(struct xfs_scrub_context *sc);
+int xfs_repair_symlink(struct xfs_scrub_context *sc);
 
 #else
 
@@ -153,6 +154,7 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_inode		xfs_repair_notsupported
 #define xfs_repair_bmap_data		xfs_repair_notsupported
 #define xfs_repair_bmap_attr		xfs_repair_notsupported
+#define xfs_repair_symlink		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index eecb96fe2feb..6d7ae6e0e165 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -329,7 +329,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xfs_scrub_setup_symlink,
 		.scrub	= xfs_scrub_symlink,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_symlink,
 	},
 	[XFS_SCRUB_TYPE_PARENT] = {	/* parent pointers */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index 570a89812116..7f2ba4082705 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -34,7 +34,7 @@ xfs_scrub_setup_symlink(
 	if (!sc->buf)
 		return -ENOMEM;
 
-	return xfs_scrub_setup_inode_contents(sc, ip, 0);
+	return xfs_scrub_setup_inode_contents(sc, ip, XFS_SYMLINK_MAPS);
 }
 
 /* Symbolic links. */
diff --git a/fs/xfs/scrub/symlink_repair.c b/fs/xfs/scrub/symlink_repair.c
new file mode 100644
index 000000000000..acc51eb3a879
--- /dev/null
+++ b/fs/xfs/scrub/symlink_repair.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_symlink.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Symbolic Link Repair
+ * ====================
+ *
+ * There's not much we can do to repair symbolic links -- we truncate them to
+ * the first NULL byte and fix up the remote target block headers if they're
+ * incorrect.  Zero-length symlinks are turned into links to /.
+ */
+
+/* Blow out the whole symlink; replace contents. */
+STATIC int
+xfs_repair_symlink_rewrite(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	const char		*target_path,
+	int			pathlen)
+{
+	struct xfs_defer_ops	dfops;
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_ifork	*ifp;
+	const char		*cur_chunk;
+	struct xfs_mount	*mp = (*tpp)->t_mountp;
+	struct xfs_buf		*bp;
+	xfs_fsblock_t		first_block;
+	xfs_fileoff_t		first_fsb;
+	xfs_filblks_t		fs_blocks;
+	xfs_daddr_t		d;
+	int			byte_cnt;
+	int			n;
+	int			nmaps;
+	int			offset;
+	int			error = 0;
+
+	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+
+	/* Truncate the whole data fork if it wasn't inline. */
+	if (!(ifp->if_flags & XFS_IFINLINE)) {
+		error = xfs_itruncate_extents(tpp, ip, XFS_DATA_FORK, 0);
+		if (error)
+			goto out;
+	}
+
+	/* Blow out the in-core fork and zero the on-disk fork. */
+	xfs_idestroy_fork(ip, XFS_DATA_FORK);
+	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
+	ip->i_d.di_nextents = 0;
+	memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
+	ip->i_df.if_flags |= XFS_IFEXTENTS;
+
+	/* Rewrite an inline symlink. */
+	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
+		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
+
+		i_size_write(VFS_I(ip), pathlen);
+		ip->i_d.di_size = pathlen;
+		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
+		xfs_trans_log_inode(*tpp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+		goto out;
+
+	}
+
+	/* Rewrite a remote symlink. */
+	fs_blocks = xfs_symlink_blocks(mp, pathlen);
+	first_fsb = 0;
+	nmaps = XFS_SYMLINK_MAPS;
+
+	/* Reserve quota for new blocks. */
+	error = xfs_trans_reserve_quota_nblks(*tpp, ip, fs_blocks, 0,
+			XFS_QMOPT_RES_REGBLKS);
+	if (error)
+		goto out;
+
+	/* Map blocks, write symlink target. */
+	xfs_defer_init(&dfops, &first_block);
+
+	error = xfs_bmapi_write(*tpp, ip, first_fsb, fs_blocks,
+			  XFS_BMAPI_METADATA, &first_block, fs_blocks,
+			  mval, &nmaps, &dfops);
+	if (error)
+		goto out_bmap_cancel;
+
+	ip->i_d.di_size = pathlen;
+	i_size_write(VFS_I(ip), pathlen);
+	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+
+	cur_chunk = target_path;
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		char	*buf;
+
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+		bp = xfs_trans_get_buf(*tpp, mp->m_ddev_targp, d,
+				       BTOBB(byte_cnt), 0);
+		if (!bp) {
+			error = -ENOMEM;
+			goto out_bmap_cancel;
+		}
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		byte_cnt = min(byte_cnt, pathlen);
+
+		buf = bp->b_addr;
+		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
+					   byte_cnt, bp);
+
+		memcpy(buf, cur_chunk, byte_cnt);
+
+		cur_chunk += byte_cnt;
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_trans_buf_set_type(*tpp, bp, XFS_BLFT_SYMLINK_BUF);
+		xfs_trans_log_buf(*tpp, bp, 0, (buf + byte_cnt - 1) -
+						(char *)bp->b_addr);
+	}
+	ASSERT(pathlen == 0);
+
+	error = xfs_defer_finish(tpp, &dfops);
+	if (error)
+		goto out_bmap_cancel;
+
+	return 0;
+
+out_bmap_cancel:
+	xfs_defer_cancel(&dfops);
+out:
+	return error;
+}
+
+/* Fix everything that fails the verifiers in the remote blocks. */
+STATIC int
+xfs_repair_symlink_fix_remotes(
+	struct xfs_scrub_context	*sc,
+	loff_t				len)
+{
+	struct xfs_bmbt_irec		mval[XFS_SYMLINK_MAPS];
+	struct xfs_buf			*bp;
+	xfs_filblks_t			fsblocks;
+	xfs_daddr_t			d;
+	loff_t				offset;
+	unsigned int			byte_cnt;
+	int				n;
+	int				nmaps = XFS_SYMLINK_MAPS;
+	int				nr;
+	int				error;
+
+	fsblocks = xfs_symlink_blocks(sc->mp, len);
+	error = xfs_bmapi_read(sc->ip, 0, fsblocks, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount);
+
+		error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+				d, BTOBB(byte_cnt), 0, &bp, NULL);
+		if (error)
+			return error;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt);
+		if (len < byte_cnt)
+			byte_cnt = len;
+
+		nr = xfs_symlink_hdr_set(sc->mp, sc->ip->i_ino, offset,
+				byte_cnt, bp);
+
+		len -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SYMLINK_BUF);
+		xfs_trans_log_buf(sc->tp, bp, 0, nr - 1);
+		xfs_trans_brelse(sc->tp, bp);
+	}
+	if (len != 0)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Fix this inline symlink. */
+STATIC int
+xfs_repair_symlink_inline(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_inode		*ip = sc->ip;
+	struct xfs_ifork		*ifp;
+	loff_t				len;
+	size_t				newlen;
+
+	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	len = i_size_read(VFS_I(ip));
+	xfs_trans_ijoin(sc->tp, ip, 0);
+
+	if (ifp->if_u1.if_data) {
+		newlen = strnlen(ifp->if_u1.if_data, XFS_IFORK_DSIZE(ip));
+	} else {
+		/* Zero length symlink becomes a root symlink. */
+		ifp->if_u1.if_data = kmem_alloc(4, KM_SLEEP);
+		snprintf(ifp->if_u1.if_data, 4, "/");
+		newlen = 1;
+	}
+
+	if (len > newlen) {
+		i_size_write(VFS_I(ip), newlen);
+		ip->i_d.di_size = newlen;
+		xfs_trans_log_inode(sc->tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+	}
+
+	return 0;
+}
+
+/* Repair a remote symlink. */
+STATIC int
+xfs_repair_symlink_remote(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_inode		*ip = sc->ip;
+	loff_t				len;
+	size_t				newlen;
+	int				error = 0;
+
+	len = i_size_read(VFS_I(ip));
+	xfs_trans_ijoin(sc->tp, ip, 0);
+
+	error = xfs_repair_symlink_fix_remotes(sc, len);
+	if (error)
+		return error;
+
+	/* Roll transaction, release buffers. */
+	error = xfs_trans_roll_inode(&sc->tp, ip);
+	if (error)
+		return error;
+
+	/* Size set correctly? */
+	len = i_size_read(VFS_I(ip));
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	error = xfs_readlink(ip, sc->buf);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		return error;
+
+	/*
+	 * Figure out the new target length.  We can't handle zero-length
+	 * symlinks, so make sure that we don't write that out.
+	 */
+	newlen = strnlen(sc->buf, XFS_SYMLINK_MAXLEN);
+	if (newlen == 0) {
+		*((char *)sc->buf) = '/';
+		newlen = 1;
+	}
+
+	if (len > newlen)
+		return xfs_repair_symlink_rewrite(&sc->tp, ip, sc->buf, newlen);
+	return 0;
+}
+
+/* Repair a symbolic link. */
+int
+xfs_repair_symlink(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_ifork		*ifp;
+
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	if (ifp->if_flags & XFS_IFINLINE)
+		return xfs_repair_symlink_inline(sc);
+	return xfs_repair_symlink_remote(sc);
+}


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 17/21] xfs: repair extended attributes
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2018-06-24 19:25 ` [PATCH 16/21] xfs: repair damaged symlinks Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  2018-07-06  1:03   ` Dave Chinner
  2018-06-24 19:25 ` [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, zap the attr tree, and re-add the
values.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/attr.c        |    2 
 fs/xfs/scrub/attr_repair.c |  568 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    2 
 fs/xfs/scrub/scrub.c       |    2 
 fs/xfs/scrub/scrub.h       |    3 
 6 files changed, 576 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/attr_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 36156166fef0..24d8b19a837b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,6 +164,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   attr_repair.o \
 				   alloc_repair.o \
 				   bmap_repair.o \
 				   ialloc_repair.o \
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index de51cf8a8516..50715617eb1e 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -125,7 +125,7 @@ xfs_scrub_xattr_listent(
  * Within a char, the lowest bit of the char represents the byte with
  * the smallest address
  */
-STATIC bool
+bool
 xfs_scrub_xattr_set_map(
 	struct xfs_scrub_context	*sc,
 	unsigned long			*map,
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
new file mode 100644
index 000000000000..5f2e4dad92b7
--- /dev/null
+++ b/fs/xfs/scrub/attr_repair.c
@@ -0,0 +1,568 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_attr.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
+#include "xfs_attr_remote.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Extended Attribute Repair
+ * =========================
+ *
+ * We repair extended attributes by reading the attribute fork blocks looking
+ * for keys and values, then truncate the entire attr fork and reinsert all
+ * the attributes.  Unfortunately, there's no secondary copy of most extended
+ * attribute data, which means that if we blow up midway through there's
+ * little we can do.
+ */
+
+struct xfs_attr_key {
+	struct list_head		list;
+	unsigned char			*value;
+	int				valuelen;
+	int				flags;
+	int				namelen;
+	unsigned char			name[0];
+};
+
+#define XFS_ATTR_KEY_LEN(namelen) (sizeof(struct xfs_attr_key) + (namelen) + 1)
+
+struct xfs_repair_xattr {
+	struct list_head		*attrlist;
+	struct xfs_scrub_context	*sc;
+};
+
+/* Iterate each block in an attr fork extent */
+#define for_each_xfs_attr_block(mp, irec, dabno) \
+	for ((dabno) = roundup((xfs_dablk_t)(irec)->br_startoff, \
+			(mp)->m_attr_geo->fsbcount); \
+	     (dabno) < (irec)->br_startoff + (irec)->br_blockcount; \
+	     (dabno) += (mp)->m_attr_geo->fsbcount)
+
+/*
+ * Record an extended attribute key & value for later reinsertion into the
+ * inode.  Use the helpers below, don't call this directly.
+ */
+STATIC int
+__xfs_repair_xattr_salvage_attr(
+	struct xfs_repair_xattr		*rx,
+	struct xfs_buf			*bp,
+	int				flags,
+	int				idx,
+	unsigned char			*name,
+	int				namelen,
+	unsigned char			*value,
+	int				valuelen)
+{
+	struct xfs_attr_key		*key;
+	struct xfs_da_args		args;
+	int				error = -ENOMEM;
+
+	/* Ignore incomplete or oversized attributes. */
+	if ((flags & XFS_ATTR_INCOMPLETE) ||
+	    namelen > XATTR_NAME_MAX || namelen < 0 ||
+	    valuelen > XATTR_SIZE_MAX || valuelen < 0)
+		return 0;
+
+	/* Store attr key. */
+	key = kmem_alloc(XFS_ATTR_KEY_LEN(namelen), KM_MAYFAIL);
+	if (!key)
+		goto err;
+	INIT_LIST_HEAD(&key->list);
+	key->value = kmem_zalloc_large(valuelen, KM_MAYFAIL);
+	if (!key->value)
+		goto err_key;
+	key->valuelen = valuelen;
+	key->flags = flags & (ATTR_ROOT | ATTR_SECURE);
+	key->namelen = namelen;
+	key->name[namelen] = 0;
+	memcpy(key->name, name, namelen);
+
+	/* Caller already had the value, so copy it and exit. */
+	if (value) {
+		memcpy(key->value, value, valuelen);
+		goto out_ok;
+	}
+
+	/* Otherwise look up the remote value directly. */
+	memset(&args, 0, sizeof(args));
+	args.geo = rx->sc->mp->m_attr_geo;
+	args.index = idx;
+	args.namelen = namelen;
+	args.name = key->name;
+	args.valuelen = valuelen;
+	args.value = key->value;
+	args.dp = rx->sc->ip;
+	args.trans = rx->sc->tp;
+	error = xfs_attr3_leaf_getvalue(bp, &args);
+	if (error || args.rmtblkno == 0)
+		goto err_value;
+
+	error = xfs_attr_rmtval_get(&args);
+	switch (error) {
+	case 0:
+		break;
+	case -EFSBADCRC:
+	case -EFSCORRUPTED:
+		error = 0;
+		/* fall through */
+	default:
+		goto err_value;
+	}
+
+out_ok:
+	list_add_tail(&key->list, rx->attrlist);
+	return 0;
+
+err_value:
+	kmem_free(key->value);
+err_key:
+	kmem_free(key);
+err:
+	return error;
+}
+
+/*
+ * Record a local format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+static inline int
+xfs_repair_xattr_salvage_local_attr(
+	struct xfs_repair_xattr		*rx,
+	int				flags,
+	unsigned char			*name,
+	int				namelen,
+	unsigned char			*value,
+	int				valuelen)
+{
+	return __xfs_repair_xattr_salvage_attr(rx, NULL, flags, 0, name,
+			namelen, value, valuelen);
+}
+
+/*
+ * Record a remote format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+static inline int
+xfs_repair_xattr_salvage_remote_attr(
+	struct xfs_repair_xattr		*rx,
+	int				flags,
+	unsigned char			*name,
+	int				namelen,
+	struct xfs_buf			*leaf_bp,
+	int				idx,
+	int				valuelen)
+{
+	return __xfs_repair_xattr_salvage_attr(rx, leaf_bp, flags, idx,
+			name, namelen, NULL, valuelen);
+}
+
+/* Extract every xattr key that we can from this attr fork block. */
+STATIC int
+xfs_repair_xattr_recover_leaf(
+	struct xfs_repair_xattr		*rx,
+	struct xfs_buf			*bp)
+{
+	struct xfs_attr3_icleaf_hdr	leafhdr;
+	struct xfs_scrub_context	*sc = rx->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_attr_leafblock	*leaf;
+	unsigned long			*usedmap = sc->buf;
+	struct xfs_attr_leaf_name_local	*lentry;
+	struct xfs_attr_leaf_name_remote *rentry;
+	struct xfs_attr_leaf_entry	*ent;
+	struct xfs_attr_leaf_entry	*entries;
+	char				*buf_end;
+	char				*name;
+	char				*name_end;
+	char				*value;
+	size_t				off;
+	unsigned int			nameidx;
+	unsigned int			namesize;
+	unsigned int			hdrsize;
+	unsigned int			namelen;
+	unsigned int			valuelen;
+	int				i;
+	int				error;
+
+	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
+
+	/* Check the leaf header */
+	leaf = bp->b_addr;
+	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
+	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
+	xfs_scrub_xattr_set_map(sc, usedmap, 0, hdrsize);
+	entries = xfs_attr3_leaf_entryp(leaf);
+
+	buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize;
+	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
+		/* Skip key if it conflicts with something else? */
+		off = (char *)ent - (char *)leaf;
+		if (!xfs_scrub_xattr_set_map(sc, usedmap, off,
+				sizeof(xfs_attr_leaf_entry_t)))
+			continue;
+
+		/* Check the name information. */
+		nameidx = be16_to_cpu(ent->nameidx);
+		if (nameidx < leafhdr.firstused ||
+		    nameidx >= mp->m_attr_geo->blksize)
+			continue;
+
+		if (ent->flags & XFS_ATTR_LOCAL) {
+			lentry = xfs_attr3_leaf_name_local(leaf, i);
+			namesize = xfs_attr_leaf_entsize_local(lentry->namelen,
+					be16_to_cpu(lentry->valuelen));
+			name_end = (char *)lentry + namesize;
+			if (lentry->namelen == 0)
+				continue;
+			name = lentry->nameval;
+			namelen = lentry->namelen;
+			valuelen = be16_to_cpu(lentry->valuelen);
+			value = &name[namelen];
+		} else {
+			rentry = xfs_attr3_leaf_name_remote(leaf, i);
+			namesize = xfs_attr_leaf_entsize_remote(rentry->namelen);
+			name_end = (char *)rentry + namesize;
+			if (rentry->namelen == 0 || rentry->valueblk == 0)
+				continue;
+			name = rentry->name;
+			namelen = rentry->namelen;
+			valuelen = be32_to_cpu(rentry->valuelen);
+			value = NULL;
+		}
+		if (name_end > buf_end)
+			continue;
+		if (!xfs_scrub_xattr_set_map(sc, usedmap, nameidx, namesize))
+			continue;
+
+		/* Ok, let's save this key/value. */
+		if (ent->flags & XFS_ATTR_LOCAL)
+			error = xfs_repair_xattr_salvage_local_attr(rx,
+				ent->flags, name, namelen, value, valuelen);
+		else
+			error = xfs_repair_xattr_salvage_remote_attr(rx,
+				ent->flags, name, namelen, bp, i, valuelen);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Try to recover shortform attrs. */
+STATIC int
+xfs_repair_xattr_recover_sf(
+	struct xfs_repair_xattr		*rx)
+{
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_attr_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	unsigned char			*end;
+	int				i;
+	int				error;
+
+	ifp = XFS_IFORK_PTR(rx->sc->ip, XFS_ATTR_FORK);
+	sf = (struct xfs_attr_shortform *)rx->sc->ip->i_afp->if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+
+	for (i = 0, sfe = &sf->list[0]; i < sf->hdr.count; i++) {
+		next = XFS_ATTR_SF_NEXTENTRY(sfe);
+		if ((unsigned char *)next > end)
+			break;
+
+		/* Ok, let's save this key/value. */
+		error = xfs_repair_xattr_salvage_local_attr(rx, sfe->flags,
+				sfe->nameval, sfe->namelen,
+				&sfe->nameval[sfe->namelen], sfe->valuelen);
+		if (error)
+			return error;
+
+		sfe = next;
+	}
+
+	return 0;
+}
+
+/* Extract as many attribute keys and values as we can. */
+STATIC int
+xfs_repair_xattr_recover(
+	struct xfs_repair_xattr		*rx)
+{
+	struct xfs_iext_cursor		icur;
+	struct xfs_bmbt_irec		got;
+	struct xfs_scrub_context	*sc = rx->sc;
+	struct xfs_ifork		*ifp;
+	struct xfs_da_blkinfo		*info;
+	struct xfs_buf			*bp;
+	xfs_dablk_t			dabno;
+	int				error = 0;
+
+	if (sc->ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
+		return xfs_repair_xattr_recover_sf(rx);
+
+	/* Iterate each attr block in the attr fork. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	for_each_xfs_iext(ifp, &icur, &got) {
+		for_each_xfs_attr_block(sc->mp, &got, dabno) {
+			/*
+			 * Try to read buffer.  We invalidate them in the next
+			 * step so we don't bother to set a buffer type or
+			 * ops.
+			 */
+			error = xfs_da_read_buf(sc->tp, sc->ip, dabno, -1, &bp,
+					XFS_ATTR_FORK, NULL);
+			if (error || !bp)
+				continue;
+
+			/* Screen out non-leaves & other garbage. */
+			info = bp->b_addr;
+			if (info->magic != cpu_to_be16(XFS_ATTR3_LEAF_MAGIC) ||
+			    xfs_attr3_leaf_buf_ops.verify_struct(bp) != NULL)
+				continue;
+
+			error = xfs_repair_xattr_recover_leaf(rx, bp);
+			if (error)
+				return error;
+		}
+	}
+
+	return 0;
+}
+
+/* Free all the attribute fork blocks and delete the fork. */
+STATIC int
+xfs_repair_xattr_reset_btree(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_iext_cursor		icur;
+	struct xfs_bmbt_irec		got;
+	struct xfs_ifork		*ifp;
+	struct xfs_buf			*bp;
+	xfs_fileoff_t			lblk;
+	int				error;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	if (sc->ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
+		goto out_fork_remove;
+
+	/* Invalidate each attr block in the attr fork. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	for_each_xfs_iext(ifp, &icur, &got) {
+		for_each_xfs_attr_block(sc->mp, &got, lblk) {
+			error = xfs_da_get_buf(sc->tp, sc->ip, lblk, -1, &bp,
+					XFS_ATTR_FORK);
+			if (error || !bp)
+				continue;
+			xfs_trans_binval(sc->tp, bp);
+			error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+			if (error)
+				return error;
+		}
+	}
+
+	error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_ATTR_FORK, 0);
+	if (error)
+		return error;
+
+out_fork_remove:
+	/* Reset the attribute fork - this also destroys the in-core fork */
+	xfs_attr_fork_remove(sc->ip, sc->tp);
+	return 0;
+}
+
+/*
+ * Compare two xattr keys.  ATTR_SECURE keys come before ATTR_ROOT and
+ * ATTR_ROOT keys come before user attrs.  Otherwise sort in hash order.
+ */
+static int
+xfs_repair_xattr_key_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xfs_attr_key	*ap;
+	struct xfs_attr_key	*bp;
+	uint			ahash, bhash;
+
+	ap = container_of(a, struct xfs_attr_key, list);
+	bp = container_of(b, struct xfs_attr_key, list);
+
+	if (ap->flags > bp->flags)
+		return 1;
+	else if (ap->flags < bp->flags)
+		return -1;
+
+	ahash = xfs_da_hashname(ap->name, ap->namelen);
+	bhash = xfs_da_hashname(bp->name, bp->namelen);
+	if (ahash > bhash)
+		return 1;
+	else if (ahash < bhash)
+		return -1;
+	return 0;
+}
+
+/*
+ * Find all the extended attributes for this inode by scraping them out of the
+ * attribute key blocks by hand.  The caller must clean up the lists if
+ * anything goes wrong.
+ */
+STATIC int
+xfs_repair_xattr_find_attributes(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*attributes)
+{
+	struct xfs_repair_xattr		rx;
+	struct xfs_ifork		*ifp;
+	int				error;
+
+	error = xfs_repair_ino_dqattach(sc);
+	if (error)
+		return error;
+
+	/* Extent map should be loaded. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	if (XFS_IFORK_FORMAT(sc->ip, XFS_ATTR_FORK) != XFS_DINODE_FMT_LOCAL &&
+	    !(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(sc->tp, sc->ip, XFS_ATTR_FORK);
+		if (error)
+			return error;
+	}
+
+	rx.attrlist = attributes;
+	rx.sc = sc;
+
+	/* Read every attr key and value and record them in memory. */
+	return xfs_repair_xattr_recover(&rx);
+}
+
+/* Free all the attributes. */
+STATIC void
+xfs_repair_xattr_cancel_attrs(
+	struct list_head	*attributes)
+{
+	struct xfs_attr_key	*key;
+	struct xfs_attr_key	*n;
+
+	list_for_each_entry_safe(key, n, attributes, list) {
+		list_del(&key->list);
+		kmem_free(key->value);
+		kmem_free(key);
+	}
+}
+
+/*
+ * Insert all the attributes that we collected.
+ *
+ * Commit the repair transaction and drop the ilock because the attribute
+ * setting code needs to be able to allocate special transactions and take the
+ * ilock on its own.  Some day we'll have deferred attribute setting, at which
+ * point we'll be able to use that to replace the attributes atomically and
+ * safely.
+ */
+STATIC int
+xfs_repair_xattr_rebuild_tree(
+	struct xfs_scrub_context	*sc,
+	struct list_head		*attributes)
+{
+	struct xfs_attr_key		*key;
+	struct xfs_attr_key		*n;
+	int				error;
+
+	error = xfs_trans_commit(sc->tp);
+	sc->tp = NULL;
+	if (error)
+		return error;
+
+	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
+	sc->ilock_flags &= ~XFS_ILOCK_EXCL;
+
+	/* Re-add every attr to the file. */
+	list_sort(NULL, attributes, xfs_repair_xattr_key_cmp);
+	list_for_each_entry_safe(key, n, attributes, list) {
+		error = xfs_attr_set(sc->ip, key->name, key->value,
+				key->valuelen, key->flags);
+		if (error)
+			return error;
+
+		/*
+		 * If the attr value is larger than a single page, free the
+		 * key now so that we aren't hogging memory while doing a lot
+		 * of metadata updates.  Otherwise, we want to spend as little
+		 * time reconstructing the attrs as we possibly can.
+		 */
+		if (key->valuelen <= PAGE_SIZE)
+			continue;
+		list_del(&key->list);
+		kmem_free(key->value);
+		kmem_free(key);
+	}
+
+	xfs_repair_xattr_cancel_attrs(attributes);
+	return 0;
+}
+
+/*
+ * Repair the extended attribute metadata.
+ *
+ * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer
+ * and we can't handle those 100% until the buffer cache learns how to deal
+ * with that.
+ */
+int
+xfs_repair_xattr(
+	struct xfs_scrub_context	*sc)
+{
+	struct list_head		attributes;
+	int				error;
+
+	if (!xfs_inode_hasattr(sc->ip))
+		return -ENOENT;
+
+	/* Collect extended attributes by parsing raw blocks. */
+	INIT_LIST_HEAD(&attributes);
+	error = xfs_repair_xattr_find_attributes(sc, &attributes);
+	if (error)
+		goto out;
+
+	/*
+	 * Invalidate and truncate all attribute fork extents.  This is the
+	 * point at which we are no longer able to bail out gracefully.
+	 * We commit the transaction here because xfs_attr_set allocates its
+	 * own transactions.
+	 */
+	error = xfs_repair_xattr_reset_btree(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the attribute information. */
+	error = xfs_repair_xattr_rebuild_tree(sc, &attributes);
+out:
+	xfs_repair_xattr_cancel_attrs(&attributes);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 14fa8cf89799..05a63fcc2364 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -113,6 +113,7 @@ int xfs_repair_inode(struct xfs_scrub_context *sc);
 int xfs_repair_bmap_data(struct xfs_scrub_context *sc);
 int xfs_repair_bmap_attr(struct xfs_scrub_context *sc);
 int xfs_repair_symlink(struct xfs_scrub_context *sc);
+int xfs_repair_xattr(struct xfs_scrub_context *sc);
 
 #else
 
@@ -155,6 +156,7 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_bmap_data		xfs_repair_notsupported
 #define xfs_repair_bmap_attr		xfs_repair_notsupported
 #define xfs_repair_symlink		xfs_repair_notsupported
+#define xfs_repair_xattr		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6d7ae6e0e165..857197c89729 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -323,7 +323,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xfs_scrub_setup_xattr,
 		.scrub	= xfs_scrub_xattr,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_xattr,
 	},
 	[XFS_SCRUB_TYPE_SYMLINK] = {	/* symbolic link */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 93a4a0b22273..270098bb3225 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -155,4 +155,7 @@ void xfs_scrub_xref_is_used_rt_space(struct xfs_scrub_context *sc,
 # define xfs_scrub_xref_is_used_rt_space(sc, rtbno, len) do { } while (0)
 #endif
 
+bool xfs_scrub_xattr_set_map(struct xfs_scrub_context *sc, unsigned long *map,
+		unsigned int start, unsigned int len);
+
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (16 preceding siblings ...)
  2018-06-24 19:25 ` [PATCH 17/21] xfs: repair extended attributes Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  2018-06-29  2:52   ` Dave Chinner
  2018-06-24 19:25 ` [PATCH 19/21] xfs: repair quotas Darrick J. Wong
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/attr.c    |    2 ++
 fs/xfs/scrub/dabtree.c |   15 +++++++++++++++
 fs/xfs/scrub/dabtree.h |    1 +
 fs/xfs/scrub/trace.h   |    1 +
 4 files changed, 19 insertions(+)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 50715617eb1e..56894045f147 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -293,6 +293,8 @@ xfs_scrub_xattr_block(
 		xfs_scrub_da_set_corrupt(ds, level);
 	if (!xfs_scrub_xattr_set_map(ds->sc, usedmap, 0, hdrsize))
 		xfs_scrub_da_set_corrupt(ds, level);
+	if (leafhdr.holes)
+		xfs_scrub_da_set_preen(ds, level);
 
 	if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		goto out;
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index d700c4d4d4ef..ccf2d92b2756 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -85,6 +85,21 @@ xfs_scrub_da_set_corrupt(
 			__return_address);
 }
 
+/* Flag a da btree node in need of optimization. */
+void
+xfs_scrub_da_set_preen(
+	struct xfs_scrub_da_btree	*ds,
+	int				level)
+{
+	struct xfs_scrub_context	*sc = ds->sc;
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	trace_xfs_scrub_fblock_preen(sc, ds->dargs.whichfork,
+			xfs_dir2_da_to_db(ds->dargs.geo,
+				ds->state->path.blk[level].blkno),
+			__return_address);
+}
+
 /* Find an entry at a certain level in a da btree. */
 STATIC void *
 xfs_scrub_da_btree_entry(
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index 365f9f0019e6..92916e543549 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -36,6 +36,7 @@ bool xfs_scrub_da_process_error(struct xfs_scrub_da_btree *ds, int level, int *e
 
 /* Check for da btree corruption. */
 void xfs_scrub_da_set_corrupt(struct xfs_scrub_da_btree *ds, int level);
+void xfs_scrub_da_set_preen(struct xfs_scrub_da_btree *ds, int level);
 
 int xfs_scrub_da_btree_hash(struct xfs_scrub_da_btree *ds, int level,
 			    __be32 *hashp);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 2a561689cecb..0212d273ca8b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -230,6 +230,7 @@ DEFINE_EVENT(xfs_scrub_fblock_error_class, name, \
 
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xfs_scrub_fblock_error);
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xfs_scrub_fblock_warning);
+DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xfs_scrub_fblock_preen);
 
 TRACE_EVENT(xfs_scrub_incomplete,
 	TP_PROTO(struct xfs_scrub_context *sc, void *ret_ip),


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 19/21] xfs: repair quotas
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2018-06-24 19:25 ` [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  2018-07-06  1:50   ` Dave Chinner
  2018-06-24 19:25 ` [PATCH 20/21] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
  2018-06-24 19:25 ` [PATCH 21/21] xfs: add online scrub/repair for superblock counters Darrick J. Wong
  20 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix anything that causes the quota verifiers to fail.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/scrub/attr_repair.c  |    2 
 fs/xfs/scrub/common.h       |    8 +
 fs/xfs/scrub/quota.c        |    2 
 fs/xfs/scrub/quota_repair.c |  365 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c       |   58 +++++++
 fs/xfs/scrub/repair.h       |    8 +
 fs/xfs/scrub/scrub.c        |   11 +
 fs/xfs/scrub/scrub.h        |    1 
 9 files changed, 448 insertions(+), 8 deletions(-)
 create mode 100644 fs/xfs/scrub/quota_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 24d8b19a837b..0392bca6f5fe 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -174,5 +174,6 @@ xfs-y				+= $(addprefix scrub/, \
 				   rmap_repair.o \
 				   symlink_repair.o \
 				   )
+xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota_repair.o
 endif
 endif
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
index 5f2e4dad92b7..aba5ac671a28 100644
--- a/fs/xfs/scrub/attr_repair.c
+++ b/fs/xfs/scrub/attr_repair.c
@@ -355,7 +355,7 @@ xfs_repair_xattr_recover(
 }
 
 /* Free all the attribute fork blocks and delete the fork. */
-STATIC int
+int
 xfs_repair_xattr_reset_btree(
 	struct xfs_scrub_context	*sc)
 {
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index e8c4e41139ca..b0cca36de2de 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -138,6 +138,14 @@ static inline bool xfs_scrub_skip_xref(struct xfs_scrub_metadata *sm)
 			       XFS_SCRUB_OFLAG_XCORRUPT);
 }
 
+/* Do we need to invoke the repair tool? */
+static inline bool xfs_scrub_needs_repair(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+			       XFS_SCRUB_OFLAG_XCORRUPT |
+			       XFS_SCRUB_OFLAG_PREEN);
+}
+
 int xfs_scrub_metadata_inode_forks(struct xfs_scrub_context *sc);
 int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xfs_scrub_iput(struct xfs_scrub_context *sc, struct xfs_inode *ip);
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 6ff906aa0a3b..ab0f0f7fde2d 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -29,7 +29,7 @@
 #include "scrub/trace.h"
 
 /* Convert a scrub type code to a DQ flag, or return 0 if error. */
-static inline uint
+uint
 xfs_scrub_quota_to_dqtype(
 	struct xfs_scrub_context	*sc)
 {
diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c
new file mode 100644
index 000000000000..4e7af44a2ba1
--- /dev/null
+++ b/fs/xfs/scrub/quota_repair.c
@@ -0,0 +1,365 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_dquot.h"
+#include "xfs_dquot_item.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Quota Repair
+ * ============
+ *
+ * Quota repairs are fairly simplistic; we fix everything that the dquot
+ * verifiers complain about, cap any counters or limits that make no sense,
+ * and schedule a quotacheck if we had to fix anything.  We also repair any
+ * data fork extent records that don't apply to metadata files.
+ */
+
+struct xfs_repair_quota_info {
+	struct xfs_scrub_context	*sc;
+	bool				need_quotacheck;
+};
+
+/* Scrub the fields in an individual quota item. */
+STATIC int
+xfs_repair_quota_item(
+	struct xfs_dquot		*dq,
+	uint				dqtype,
+	void				*priv)
+{
+	struct xfs_repair_quota_info	*rqi = priv;
+	struct xfs_scrub_context	*sc = rqi->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_disk_dquot		*d = &dq->q_core;
+	unsigned long long		bsoft;
+	unsigned long long		isoft;
+	unsigned long long		rsoft;
+	unsigned long long		bhard;
+	unsigned long long		ihard;
+	unsigned long long		rhard;
+	unsigned long long		bcount;
+	unsigned long long		icount;
+	unsigned long long		rcount;
+	xfs_ino_t			fs_icount;
+	bool				dirty = false;
+	int				error;
+
+	/* Did we get the dquot type we wanted? */
+	if (dqtype != (d->d_flags & XFS_DQ_ALLTYPES)) {
+		d->d_flags = dqtype;
+		dirty = true;
+	}
+
+	if (d->d_pad0 || d->d_pad) {
+		d->d_pad0 = 0;
+		d->d_pad = 0;
+		dirty = true;
+	}
+
+	/* Check the limits. */
+	bhard = be64_to_cpu(d->d_blk_hardlimit);
+	ihard = be64_to_cpu(d->d_ino_hardlimit);
+	rhard = be64_to_cpu(d->d_rtb_hardlimit);
+
+	bsoft = be64_to_cpu(d->d_blk_softlimit);
+	isoft = be64_to_cpu(d->d_ino_softlimit);
+	rsoft = be64_to_cpu(d->d_rtb_softlimit);
+
+	if (bsoft > bhard) {
+		d->d_blk_softlimit = d->d_blk_hardlimit;
+		dirty = true;
+	}
+
+	if (isoft > ihard) {
+		d->d_ino_softlimit = d->d_ino_hardlimit;
+		dirty = true;
+	}
+
+	if (rsoft > rhard) {
+		d->d_rtb_softlimit = d->d_rtb_hardlimit;
+		dirty = true;
+	}
+
+	/* Check the resource counts. */
+	bcount = be64_to_cpu(d->d_bcount);
+	icount = be64_to_cpu(d->d_icount);
+	rcount = be64_to_cpu(d->d_rtbcount);
+	fs_icount = percpu_counter_sum(&mp->m_icount);
+
+	/*
+	 * Check that usage doesn't exceed physical limits.  However, on
+	 * a reflink filesystem we're allowed to exceed physical space
+	 * if there are no quota limits.  We don't know what the real number
+	 * is, but we can make quotacheck find out for us.
+	 */
+	if (!xfs_sb_version_hasreflink(&mp->m_sb) &&
+	    mp->m_sb.sb_dblocks < bcount) {
+		dq->q_res_bcount -= be64_to_cpu(dq->q_core.d_bcount);
+		dq->q_res_bcount += mp->m_sb.sb_dblocks;
+		d->d_bcount = cpu_to_be64(mp->m_sb.sb_dblocks);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (icount > fs_icount) {
+		dq->q_res_icount -= be64_to_cpu(dq->q_core.d_icount);
+		dq->q_res_icount += fs_icount;
+		d->d_icount = cpu_to_be64(fs_icount);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (rcount > mp->m_sb.sb_rblocks) {
+		dq->q_res_rtbcount -= be64_to_cpu(dq->q_core.d_rtbcount);
+		dq->q_res_rtbcount += mp->m_sb.sb_rblocks;
+		d->d_rtbcount = cpu_to_be64(mp->m_sb.sb_rblocks);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+
+	if (!dirty)
+		return 0;
+
+	dq->dq_flags |= XFS_DQ_DIRTY;
+	xfs_trans_dqjoin(sc->tp, dq);
+	xfs_trans_log_dquot(sc->tp, dq);
+	error = xfs_trans_roll(&sc->tp);
+	xfs_dqlock(dq);
+	return error;
+}
+
+/* Fix a quota timer so that we can pass the verifier. */
+STATIC void
+xfs_repair_quota_fix_timer(
+	__be64			softlimit,
+	__be64			countnow,
+	__be32			*timer,
+	time_t			timelimit)
+{
+	uint64_t		soft = be64_to_cpu(softlimit);
+	uint64_t		count = be64_to_cpu(countnow);
+
+	if (soft && count > soft && *timer == 0)
+		*timer = cpu_to_be32(get_seconds() + timelimit);
+}
+
+/* Fix anything the verifiers complain about. */
+STATIC int
+xfs_repair_quota_block(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp,
+	uint				dqtype,
+	xfs_dqid_t			id)
+{
+	struct xfs_dqblk		*d = (struct xfs_dqblk *)bp->b_addr;
+	struct xfs_disk_dquot		*ddq;
+	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
+	enum xfs_blft			buftype = 0;
+	int				i;
+
+	bp->b_ops = &xfs_dquot_buf_ops;
+	for (i = 0; i < qi->qi_dqperchunk; i++) {
+		ddq = &d[i].dd_diskdq;
+
+		ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC);
+		ddq->d_version = XFS_DQUOT_VERSION;
+		ddq->d_flags = dqtype;
+		ddq->d_id = cpu_to_be32(id + i);
+
+		xfs_repair_quota_fix_timer(ddq->d_blk_softlimit,
+				ddq->d_bcount, &ddq->d_btimer,
+				qi->qi_btimelimit);
+		xfs_repair_quota_fix_timer(ddq->d_ino_softlimit,
+				ddq->d_icount, &ddq->d_itimer,
+				qi->qi_itimelimit);
+		xfs_repair_quota_fix_timer(ddq->d_rtb_softlimit,
+				ddq->d_rtbcount, &ddq->d_rtbtimer,
+				qi->qi_rtbtimelimit);
+
+		if (xfs_sb_version_hascrc(&sc->mp->m_sb)) {
+			uuid_copy(&d->dd_uuid, &sc->mp->m_sb.sb_meta_uuid);
+			xfs_update_cksum((char *)d, sizeof(struct xfs_dqblk),
+					 XFS_DQUOT_CRC_OFF);
+		} else {
+			memset(&d->dd_uuid, 0, sizeof(d->dd_uuid));
+			d->dd_lsn = 0;
+			d->dd_crc = 0;
+		}
+	}
+	switch (dqtype) {
+	case XFS_DQ_USER:
+		buftype = XFS_BLFT_UDQUOT_BUF;
+		break;
+	case XFS_DQ_GROUP:
+		buftype = XFS_BLFT_GDQUOT_BUF;
+		break;
+	case XFS_DQ_PROJ:
+		buftype = XFS_BLFT_PDQUOT_BUF;
+		break;
+	}
+	xfs_trans_buf_set_type(sc->tp, bp, buftype);
+	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
+	return xfs_trans_roll(&sc->tp);
+}
+
+/* Repair quota's data fork. */
+STATIC int
+xfs_repair_quota_data_fork(
+	struct xfs_scrub_context	*sc,
+	uint				dqtype)
+{
+	struct xfs_bmbt_irec		irec = { 0 };
+	struct xfs_iext_cursor		icur;
+	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
+	struct xfs_ifork		*ifp;
+	struct xfs_buf			*bp;
+	struct xfs_dqblk		*d;
+	xfs_dqid_t			id;
+	xfs_fileoff_t			max_dqid_off;
+	xfs_fileoff_t			off;
+	xfs_fsblock_t			fsbno;
+	bool				truncate = false;
+	int				error = 0;
+
+	error = xfs_repair_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+
+	/* Check for data fork problems that apply only to quota files. */
+	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		if (isnullstartblock(irec.br_startblock)) {
+			error = -EFSCORRUPTED;
+			goto out;
+		}
+
+		if (irec.br_startoff > max_dqid_off ||
+		    irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) {
+			truncate = true;
+			break;
+		}
+	}
+	if (truncate) {
+		error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK,
+				max_dqid_off * sc->mp->m_sb.sb_blocksize);
+		if (error)
+			goto out;
+	}
+
+	/* Now go fix anything that fails the verifiers. */
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		for (fsbno = irec.br_startblock, off = irec.br_startoff;
+		     fsbno < irec.br_startblock + irec.br_blockcount;
+		     fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB,
+				off += XFS_DQUOT_CLUSTER_SIZE_FSB) {
+			id = off * qi->qi_dqperchunk;
+			error = xfs_trans_read_buf(sc->mp, sc->tp,
+					sc->mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					qi->qi_dqchunklen,
+					0, &bp, &xfs_dquot_buf_ops);
+			if (error == 0) {
+				d = (struct xfs_dqblk *)bp->b_addr;
+				if (id == be32_to_cpu(d->dd_diskdq.d_id))
+					continue;
+				error = -EFSCORRUPTED;
+			}
+			if (error != -EFSBADCRC && error != -EFSCORRUPTED)
+				goto out;
+
+			/* Failed verifier, try again. */
+			error = xfs_trans_read_buf(sc->mp, sc->tp,
+					sc->mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					qi->qi_dqchunklen,
+					0, &bp, NULL);
+			if (error)
+				goto out;
+			error = xfs_repair_quota_block(sc, bp, dqtype, id);
+		}
+	}
+
+	/*
+	 * Roll the transaction so that we unlock all of the buffers we
+	 * touched while doing raw dquot buffer checks.  Subsequent parts of
+	 * quota repair will use the regular dquot APIs, which re-read the
+	 * dquot buffers to construct in-core dquots without a transaction.
+	 * This causes deadlocks if we haven't released the dquot buffers.
+	 */
+	error = xfs_trans_roll(&sc->tp);
+out:
+	return error;
+}
+
+/*
+ * Go fix anything in the quota items that we could have been mad about.  Now
+ * that we've checked the quota inode data fork we have to drop ILOCK_EXCL to
+ * use the regular dquot functions.
+ */
+STATIC int
+xfs_repair_quota_problems(
+	struct xfs_scrub_context	*sc,
+	uint				dqtype)
+{
+	struct xfs_repair_quota_info	rqi;
+	int				error;
+
+	rqi.sc = sc;
+	rqi.need_quotacheck = false;
+	error = xfs_qm_dqiterate(sc->mp, dqtype, xfs_repair_quota_item, &rqi);
+	if (error)
+		return error;
+
+	/* Make a quotacheck happen. */
+	if (rqi.need_quotacheck)
+		xfs_repair_force_quotacheck(sc, dqtype);
+	return 0;
+}
+
+/* Repair all of a quota type's items. */
+int
+xfs_repair_quota(
+	struct xfs_scrub_context	*sc)
+{
+	uint				dqtype;
+	int				error;
+
+	dqtype = xfs_scrub_quota_to_dqtype(sc);
+
+	/* Fix problematic data fork mappings. */
+	error = xfs_repair_quota_data_fork(sc, dqtype);
+	if (error)
+		goto out;
+
+	/* Unlock quota inode; we play only with dquots from now on. */
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = 0;
+
+	/* Fix anything the dquot verifiers complain about. */
+	error = xfs_repair_quota_problems(sc, dqtype);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 85ec872093e6..399a98e1d8f7 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -29,6 +29,8 @@
 #include "xfs_ag_resv.h"
 #include "xfs_trans_space.h"
 #include "xfs_quota.h"
+#include "xfs_attr.h"
+#include "xfs_reflink.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -1182,3 +1184,59 @@ xfs_repair_grab_all_ag_headers(
 
 	return error;
 }
+
+/*
+ * Repair the attr/data forks of a metadata inode.  The metadata inode must be
+ * pointed to by sc->ip and the ILOCK must be held.
+ */
+int
+xfs_repair_metadata_inode_forks(
+	struct xfs_scrub_context	*sc)
+{
+	__u32				smtype;
+	__u32				smflags;
+	int				error;
+
+	smtype = sc->sm->sm_type;
+	smflags = sc->sm->sm_flags;
+
+	/* Let's see if the forks need repair. */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	error = xfs_scrub_metadata_inode_forks(sc);
+	if (error || !xfs_scrub_needs_repair(sc->sm))
+		goto out;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* Clear the reflink flag & attr forks that we shouldn't have. */
+	if (xfs_is_reflink_inode(sc->ip)) {
+		error = xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+		if (error)
+			goto out;
+	}
+
+	if (xfs_inode_hasattr(sc->ip)) {
+		error = xfs_repair_xattr_reset_btree(sc);
+		if (error)
+			goto out;
+	}
+
+	/* Repair the data fork. */
+	sc->sm->sm_type = XFS_SCRUB_TYPE_BMBTD;
+	error = xfs_repair_bmap_data(sc);
+	sc->sm->sm_type = smtype;
+	if (error)
+		goto out;
+
+	/* Bail out if we still need repairs. */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	error = xfs_scrub_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+	if (xfs_scrub_needs_repair(sc->sm))
+		error = -EFSCORRUPTED;
+out:
+	sc->sm->sm_type = smtype;
+	sc->sm->sm_flags = smflags;
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 05a63fcc2364..083ab63624eb 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -97,6 +97,8 @@ void xfs_repair_force_quotacheck(struct xfs_scrub_context *sc, uint dqtype);
 int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
 int xfs_repair_grab_all_ag_headers(struct xfs_scrub_context *sc);
 int xfs_repair_rmapbt_setup(struct xfs_scrub_context *sc, struct xfs_inode *ip);
+int xfs_repair_xattr_reset_btree(struct xfs_scrub_context *sc);
+int xfs_repair_metadata_inode_forks(struct xfs_scrub_context *sc);
 
 /* Metadata repairers */
 
@@ -114,6 +116,11 @@ int xfs_repair_bmap_data(struct xfs_scrub_context *sc);
 int xfs_repair_bmap_attr(struct xfs_scrub_context *sc);
 int xfs_repair_symlink(struct xfs_scrub_context *sc);
 int xfs_repair_xattr(struct xfs_scrub_context *sc);
+#ifdef CONFIG_XFS_QUOTA
+int xfs_repair_quota(struct xfs_scrub_context *sc);
+#else
+# define xfs_repair_quota		xfs_repair_notsupported
+#endif /* CONFIG_XFS_QUOTA */
 
 #else
 
@@ -157,6 +164,7 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_bmap_attr		xfs_repair_notsupported
 #define xfs_repair_symlink		xfs_repair_notsupported
 #define xfs_repair_xattr		xfs_repair_notsupported
+#define xfs_repair_quota		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 857197c89729..f57ec412a617 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -355,19 +355,19 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xfs_scrub_setup_quota,
 		.scrub	= xfs_scrub_quota,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_quota,
 	},
 	[XFS_SCRUB_TYPE_GQUOTA] = {	/* group quota */
 		.type	= ST_FS,
 		.setup	= xfs_scrub_setup_quota,
 		.scrub	= xfs_scrub_quota,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_quota,
 	},
 	[XFS_SCRUB_TYPE_PQUOTA] = {	/* project quota */
 		.type	= ST_FS,
 		.setup	= xfs_scrub_setup_quota,
 		.scrub	= xfs_scrub_quota,
-		.repair	= xfs_repair_notsupported,
+		.repair	= xfs_repair_quota,
 	},
 };
 
@@ -562,9 +562,8 @@ xfs_scrub_metadata(
 		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
 			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 
-		needs_fix = (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
-						XFS_SCRUB_OFLAG_XCORRUPT |
-						XFS_SCRUB_OFLAG_PREEN));
+		needs_fix = xfs_scrub_needs_repair(sc.sm);
+
 		/*
 		 * If userspace asked for a repair but it wasn't necessary,
 		 * report that back to userspace.
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 270098bb3225..43c4189ea549 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -157,5 +157,6 @@ void xfs_scrub_xref_is_used_rt_space(struct xfs_scrub_context *sc,
 
 bool xfs_scrub_xattr_set_map(struct xfs_scrub_context *sc, unsigned long *map,
 		unsigned int start, unsigned int len);
+uint xfs_scrub_quota_to_dqtype(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 20/21] xfs: implement live quotacheck as part of quota repair
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (18 preceding siblings ...)
  2018-06-24 19:25 ` [PATCH 19/21] xfs: repair quotas Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  2018-06-24 19:25 ` [PATCH 21/21] xfs: add online scrub/repair for superblock counters Darrick J. Wong
  20 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the fs freezing mechanism we developed for the rmapbt repair to
freeze the fs, this time to scan the fs for a live quotacheck.  We add a
new dqget variant to use the existing scrub transaction to allocate an
on-disk dquot block if it is missing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/quota.c        |   20 +++
 fs/xfs/scrub/quota_repair.c |  296 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dquot.c          |   59 ++++++++-
 fs/xfs/xfs_dquot.h          |    3 
 4 files changed, 370 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index ab0f0f7fde2d..34a6dfffbbc2 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -27,6 +27,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/repair.h"
 
 /* Convert a scrub type code to a DQ flag, or return 0 if error. */
 uint
@@ -64,12 +65,29 @@ xfs_scrub_setup_quota(
 	mutex_lock(&sc->mp->m_quotainfo->qi_quotaofflock);
 	if (!xfs_this_quota_on(sc->mp, dqtype))
 		return -ENOENT;
+	/*
+	 * Freeze out anything that can alter an inode because we reconstruct
+	 * the quota counts by iterating all the inodes in the system.
+	 */
+	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
+	    (sc->try_harder || XFS_QM_NEED_QUOTACHECK(sc->mp))) {
+		error = xfs_scrub_fs_freeze(sc);
+		if (error)
+			return error;
+	}
 	error = xfs_scrub_setup_fs(sc, ip);
 	if (error)
 		return error;
 	sc->ip = xfs_quota_inode(sc->mp, dqtype);
-	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
 	sc->ilock_flags = XFS_ILOCK_EXCL;
+	/*
+	 * Pretend to be an ILOCK parent to shut up lockdep if we're going to
+	 * do a full inode scan of the fs.  Quota inodes do not count towards
+	 * quota accounting, so we shouldn't deadlock on ourselves.
+	 */
+	if (sc->fs_frozen)
+		sc->ilock_flags |= XFS_ILOCK_PARENT;
+	xfs_ilock(sc->ip, sc->ilock_flags);
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c
index 4e7af44a2ba1..efdbe14a7ecb 100644
--- a/fs/xfs/scrub/quota_repair.c
+++ b/fs/xfs/scrub/quota_repair.c
@@ -16,13 +16,20 @@
 #include "xfs_trans.h"
 #include "xfs_sb.h"
 #include "xfs_inode.h"
+#include "xfs_icache.h"
 #include "xfs_inode_fork.h"
 #include "xfs_alloc.h"
 #include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_dquot.h"
 #include "xfs_dquot_item.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -37,6 +44,11 @@
  * verifiers complain about, cap any counters or limits that make no sense,
  * and schedule a quotacheck if we had to fix anything.  We also repair any
  * data fork extent records that don't apply to metadata files.
+ *
+ * Online quotacheck is fairly straightforward.  We engage a repair freeze,
+ * zero all the dquots, and scan every inode in the system to recalculate the
+ * appropriate quota charges.  Finally, we log all the dquots to disk and
+ * set the _CHKD flags.
  */
 
 struct xfs_repair_quota_info {
@@ -314,6 +326,272 @@ xfs_repair_quota_data_fork(
 	return error;
 }
 
+/* Online Quotacheck */
+
+/*
+ * Add this inode's resource usage to the dquot.  We adjust the in-core and
+ * the (cached) on-disk copies of the counters and leave the dquot dirty.  A
+ * subsequent pass through the dquots logs them all to disk.  Fortunately we
+ * froze the filesystem before starting so at least we don't have to deal
+ * with chown/chproj races.
+ */
+STATIC int
+xfs_repair_quotacheck_dqadjust(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip,
+	uint				type,
+	xfs_qcnt_t			nblks,
+	xfs_qcnt_t			rtblks)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_dquot		*dqp;
+	xfs_dqid_t			id;
+	int				error;
+
+	/* Try to read in the dquot. */
+	id = xfs_qm_id_for_quotatype(ip, type);
+	error = xfs_qm_dqget(mp, id, type, false, &dqp);
+	if (error == -ENOENT) {
+		/* Allocate a dquot using our special transaction. */
+		error = xfs_qm_dqget_alloc(&sc->tp, id, type, &dqp);
+		if (error)
+			return error;
+		error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+	}
+	if (error) {
+		/*
+		 * Shouldn't be able to turn off quotas here.
+		 */
+		ASSERT(error != -ESRCH);
+		ASSERT(error != -ENOENT);
+		return error;
+	}
+
+	/*
+	 * Adjust the inode count and the block count to reflect this inode's
+	 * resource usage.
+	 */
+	be64_add_cpu(&dqp->q_core.d_icount, 1);
+	dqp->q_res_icount++;
+	if (nblks) {
+		be64_add_cpu(&dqp->q_core.d_bcount, nblks);
+		dqp->q_res_bcount += nblks;
+	}
+	if (rtblks) {
+		be64_add_cpu(&dqp->q_core.d_rtbcount, rtblks);
+		dqp->q_res_rtbcount += rtblks;
+	}
+
+	/*
+	 * Set default limits, adjust timers (since we changed usages)
+	 *
+	 * There are no timers for the default values set in the root dquot.
+	 */
+	if (dqp->q_core.d_id) {
+		xfs_qm_adjust_dqlimits(mp, dqp);
+		xfs_qm_adjust_dqtimers(mp, &dqp->q_core);
+	}
+
+	dqp->dq_flags |= XFS_DQ_DIRTY;
+	xfs_qm_dqput(dqp);
+	return 0;
+}
+
+/* Record this inode's quota use. */
+STATIC int
+xfs_repair_quotacheck_inode(
+	struct xfs_scrub_context	*sc,
+	uint				dqtype,
+	struct xfs_inode		*ip)
+{
+	struct xfs_ifork		*ifp;
+	xfs_filblks_t			rtblks = 0;	/* total rt blks */
+	xfs_qcnt_t			nblks;
+	int				error;
+
+	/* Count the realtime blocks. */
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+
+		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+			error = xfs_iread_extents(sc->tp, ip, XFS_DATA_FORK);
+			if (error)
+				return error;
+		}
+
+		xfs_bmap_count_leaves(ifp, &rtblks);
+	}
+
+	nblks = (xfs_qcnt_t)ip->i_d.di_nblocks - rtblks;
+
+	/* Adjust the dquot. */
+	return xfs_repair_quotacheck_dqadjust(sc, ip, dqtype, nblks, rtblks);
+}
+
+struct xfs_repair_quotacheck {
+	struct xfs_scrub_context	*sc;
+	uint				dqtype;
+};
+
+/* Iterate all the inodes in an AG group. */
+STATIC int
+xfs_repair_quotacheck_inobt(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_inobt_rec_incore	irec;
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip = NULL;
+	struct xfs_repair_quotacheck	*rq = priv;
+	xfs_ino_t			ino;
+	xfs_agino_t			agino;
+	int				chunkidx;
+	int				error = 0;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	for (chunkidx = 0, agino = irec.ir_startino;
+	     chunkidx < XFS_INODES_PER_CHUNK;
+	     chunkidx++, agino++) {
+		bool	inuse;
+
+		/* Skip if this inode is free */
+		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
+			continue;
+		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
+		if (xfs_is_quota_inode(&mp->m_sb, ino))
+			continue;
+
+		/* Back off and try again if an inode is being reclaimed */
+		error = xfs_icache_inode_is_allocated(mp, NULL, ino, &inuse);
+		if (error == -EAGAIN)
+			return -EDEADLOCK;
+
+		/*
+		 * Grab inode for scanning.  We cannot use DONTCACHE here
+		 * because we already have a transaction so the iput must not
+		 * trigger inode reclaim (which might allocate a transaction
+		 * to clean up posteof blocks).
+		 */
+		error = xfs_iget(mp, NULL, ino, 0, XFS_ILOCK_EXCL, &ip);
+		if (error)
+			return error;
+		trace_xfs_scrub_iget(ip, __this_address);
+
+		error = xfs_repair_quotacheck_inode(rq->sc, rq->dqtype, ip);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		xfs_scrub_iput(rq->sc, ip);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Zero a dquot prior to regenerating the counts. */
+static int
+xfs_repair_quotacheck_zero_dquot(
+	struct xfs_dquot		*dq,
+	uint				dqtype,
+	void				*priv)
+{
+	dq->q_res_bcount -= be64_to_cpu(dq->q_core.d_bcount);
+	dq->q_core.d_bcount = 0;
+	dq->q_res_icount -= be64_to_cpu(dq->q_core.d_icount);
+	dq->q_core.d_icount = 0;
+	dq->q_res_rtbcount -= be64_to_cpu(dq->q_core.d_rtbcount);
+	dq->q_core.d_rtbcount = 0;
+	dq->dq_flags |= XFS_DQ_DIRTY;
+	return 0;
+}
+
+/* Log a dirty dquot after we regenerated the counters. */
+static int
+xfs_repair_quotacheck_log_dquot(
+	struct xfs_dquot		*dq,
+	uint				dqtype,
+	void				*priv)
+{
+	struct xfs_scrub_context	*sc = priv;
+	int				error;
+
+	xfs_trans_dqjoin(sc->tp, dq);
+	xfs_trans_log_dquot(sc->tp, dq);
+	error = xfs_trans_roll(&sc->tp);
+	xfs_dqlock(dq);
+	return error;
+}
+
+/* Execute an online quotacheck. */
+STATIC int
+xfs_repair_quotacheck(
+	struct xfs_scrub_context	*sc,
+	uint				dqtype)
+{
+	struct xfs_repair_quotacheck	rq;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*bp;
+	struct xfs_btree_cur		*cur;
+	xfs_agnumber_t			ag;
+	uint				flag;
+	int				error;
+
+	/*
+	 * Commit the transaction so that we can allocate new quota ip
+	 * mappings if we have to.  If we crash after this point, the sb
+	 * still has the CHKD flags cleared, so mount quotacheck will fix
+	 * all of this up.
+	 */
+	error = xfs_trans_commit(sc->tp);
+	sc->tp = NULL;
+	if (error)
+		return error;
+
+	/* Zero all the quota items. */
+	error = xfs_qm_dqiterate(mp, dqtype, xfs_repair_quotacheck_zero_dquot,
+			sc);
+	if (error)
+		goto out;
+
+	rq.sc = sc;
+	rq.dqtype = dqtype;
+
+	/* Iterate all AGs for inodes. */
+	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+		error = xfs_ialloc_read_agi(mp, NULL, ag, &bp);
+		if (error)
+			goto out;
+		cur = xfs_inobt_init_cursor(mp, NULL, bp, ag, XFS_BTNUM_INO);
+		error = xfs_btree_query_all(cur, xfs_repair_quotacheck_inobt,
+				&rq);
+		xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
+						  XFS_BTREE_NOERROR);
+		xfs_buf_relse(bp);
+		if (error)
+			goto out;
+	}
+
+	/* Log dquots. */
+	error = xfs_scrub_trans_alloc(sc, 0);
+	if (error)
+		goto out;
+	error = xfs_qm_dqiterate(mp, dqtype, xfs_repair_quotacheck_log_dquot,
+			sc);
+	if (error)
+		goto out;
+
+	/* Set quotachecked flag. */
+	flag = xfs_quota_chkd_flag(dqtype);
+	sc->mp->m_qflags |= flag;
+	spin_lock(&sc->mp->m_sb_lock);
+	sc->mp->m_sb.sb_qflags |= flag;
+	spin_unlock(&sc->mp->m_sb_lock);
+	xfs_log_sb(sc->tp);
+out:
+	return error;
+}
+
 /*
  * Go fix anything in the quota items that we could have been mad about.  Now
  * that we've checked the quota inode data fork we have to drop ILOCK_EXCL to
@@ -334,7 +612,8 @@ xfs_repair_quota_problems(
 		return error;
 
 	/* Make a quotacheck happen. */
-	if (rqi.need_quotacheck)
+	if (rqi.need_quotacheck ||
+	    XFS_TEST_ERROR(false, sc->mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
 		xfs_repair_force_quotacheck(sc, dqtype);
 	return 0;
 }
@@ -345,6 +624,7 @@ xfs_repair_quota(
 	struct xfs_scrub_context	*sc)
 {
 	uint				dqtype;
+	uint				flag;
 	int				error;
 
 	dqtype = xfs_scrub_quota_to_dqtype(sc);
@@ -360,6 +640,20 @@ xfs_repair_quota(
 
 	/* Fix anything the dquot verifiers complain about. */
 	error = xfs_repair_quota_problems(sc, dqtype);
+	if (error)
+		goto out;
+
+	/* Do we need a quotacheck?  Did we need one? */
+	flag = xfs_quota_chkd_flag(dqtype);
+	if (!(flag & sc->mp->m_qflags)) {
+		/* We need to freeze the fs before we can scan inodes. */
+		if (!sc->fs_frozen) {
+			error = -EDEADLOCK;
+			goto out;
+		}
+
+		error = xfs_repair_quotacheck(sc, dqtype);
+	}
 out:
 	return error;
 }
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 0973a0423bed..709f4c70916b 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -534,6 +534,7 @@ xfs_dquot_from_disk(
 static int
 xfs_qm_dqread_alloc(
 	struct xfs_mount	*mp,
+	struct xfs_trans	**tpp,
 	struct xfs_dquot	*dqp,
 	struct xfs_buf		**bpp)
 {
@@ -541,6 +542,18 @@ xfs_qm_dqread_alloc(
 	struct xfs_buf		*bp;
 	int			error;
 
+	/*
+	 * The caller passed in a transaction which we don't control, so
+	 * release the hold before passing back the buffer.
+	 */
+	if (tpp) {
+		error = xfs_dquot_disk_alloc(tpp, dqp, &bp);
+		if (error)
+			return error;
+		xfs_trans_bhold_release(*tpp, bp);
+		return 0;
+	}
+
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_dqalloc,
 			XFS_QM_DQALLOC_SPACE_RES(mp), 0, 0, &tp);
 	if (error)
@@ -576,6 +589,7 @@ xfs_qm_dqread_alloc(
 static int
 xfs_qm_dqread(
 	struct xfs_mount	*mp,
+	struct xfs_trans	**tpp,
 	xfs_dqid_t		id,
 	uint			type,
 	bool			can_alloc,
@@ -591,7 +605,7 @@ xfs_qm_dqread(
 	/* Try to read the buffer, allocating if necessary. */
 	error = xfs_dquot_disk_read(mp, dqp, &bp);
 	if (error == -ENOENT && can_alloc)
-		error = xfs_qm_dqread_alloc(mp, dqp, &bp);
+		error = xfs_qm_dqread_alloc(mp, tpp, dqp, &bp);
 	if (error)
 		goto err;
 
@@ -775,9 +789,10 @@ xfs_qm_dqget_checks(
  * Given the file system, id, and type (UDQUOT/GDQUOT), return a a locked
  * dquot, doing an allocation (if requested) as needed.
  */
-int
-xfs_qm_dqget(
+static int
+__xfs_qm_dqget(
 	struct xfs_mount	*mp,
+	struct xfs_trans	**tpp,
 	xfs_dqid_t		id,
 	uint			type,
 	bool			can_alloc,
@@ -799,7 +814,7 @@ xfs_qm_dqget(
 		return 0;
 	}
 
-	error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
+	error = xfs_qm_dqread(mp, NULL, id, type, can_alloc, &dqp);
 	if (error)
 		return error;
 
@@ -838,7 +853,39 @@ xfs_qm_dqget_uncached(
 	if (error)
 		return error;
 
-	return xfs_qm_dqread(mp, id, type, 0, dqpp);
+	return xfs_qm_dqread(mp, NULL, id, type, 0, dqpp);
+}
+
+/*
+ * Given the file system, id, and type (UDQUOT/GDQUOT), return a a locked
+ * dquot, doing an allocation (if requested) as needed.
+ */
+int
+xfs_qm_dqget(
+	struct xfs_mount	*mp,
+	xfs_dqid_t		id,
+	uint			type,
+	bool			can_alloc,
+	struct xfs_dquot	**O_dqpp)
+{
+	return __xfs_qm_dqget(mp, NULL, id, type, can_alloc, O_dqpp);
+}
+
+/*
+ * Given the file system, id, and type (UDQUOT/GDQUOT) and a hole in the quota
+ * data where the on-disk dquot is supposed to live, return a locked dquot
+ * having allocated blocks with the transaction.  This is a corner case
+ * required by online repair, which already has a transaction and has to pass
+ * that into dquot_setup.
+ */
+int
+xfs_qm_dqget_alloc(
+	struct xfs_trans	**tpp,
+	xfs_dqid_t		id,
+	uint			type,
+	struct xfs_dquot	**dqpp)
+{
+	return __xfs_qm_dqget((*tpp)->t_mountp, tpp, id, type, true, dqpp);
 }
 
 /* Return the quota id for a given inode and type. */
@@ -902,7 +949,7 @@ xfs_qm_dqget_inode(
 	 * we re-acquire the lock.
 	 */
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
+	error = xfs_qm_dqread(mp, NULL, id, type, can_alloc, &dqp);
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 64bd8640f6e8..a25d98d2d1c8 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -168,6 +168,9 @@ extern int		xfs_qm_dqget_next(struct xfs_mount *mp, xfs_dqid_t id,
 extern int		xfs_qm_dqget_uncached(struct xfs_mount *mp,
 					xfs_dqid_t id, uint type,
 					struct xfs_dquot **dqpp);
+extern int		xfs_qm_dqget_alloc(struct xfs_trans **tpp,
+					xfs_dqid_t id, uint type,
+					struct xfs_dquot **dqpp);
 extern void		xfs_qm_dqput(xfs_dquot_t *);
 
 extern void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 21/21] xfs: add online scrub/repair for superblock counters
  2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
                   ` (19 preceding siblings ...)
  2018-06-24 19:25 ` [PATCH 20/21] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
@ 2018-06-24 19:25 ` Darrick J. Wong
  20 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-24 19:25 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Teach online scrub and repair how to check and reset the superblock
inode and block counters.  The AG rebuilding functions will need these
to adjust the counts if they need to change as a part of recovering from
corruption.  We must use the repair freeze mechanism to prevent any
other changes while we do this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                  |    2 
 fs/xfs/libxfs/xfs_fs.h           |    3 
 fs/xfs/libxfs/xfs_types.c        |   34 +++++
 fs/xfs/libxfs/xfs_types.h        |    1 
 fs/xfs/scrub/common.c            |  146 ++++++++++++++++++++
 fs/xfs/scrub/common.h            |    4 +
 fs/xfs/scrub/fscounters.c        |  276 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/fscounters_repair.c |  100 ++++++++++++++
 fs/xfs/scrub/repair.h            |    2 
 fs/xfs/scrub/scrub.c             |    6 +
 fs/xfs/scrub/scrub.h             |    7 +
 fs/xfs/scrub/trace.h             |   63 ++++++++-
 12 files changed, 639 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters.c
 create mode 100644 fs/xfs/scrub/fscounters_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0392bca6f5fe..50876f164b73 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -148,6 +148,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   common.o \
 				   dabtree.o \
 				   dir.o \
+				   fscounters.o \
 				   ialloc.o \
 				   inode.o \
 				   parent.o \
@@ -167,6 +168,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   attr_repair.o \
 				   alloc_repair.o \
 				   bmap_repair.o \
+				   fscounters_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   refcount_repair.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index e93f9432d2a6..0f0e2948866c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -502,9 +502,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_UQUOTA	21	/* user quotas */
 #define XFS_SCRUB_TYPE_GQUOTA	22	/* group quotas */
 #define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
+#define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	24
+#define XFS_SCRUB_TYPE_NR	25
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
index 2e2a243cef2e..2e9c0c25ccb6 100644
--- a/fs/xfs/libxfs/xfs_types.c
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -171,3 +171,37 @@ xfs_verify_rtbno(
 {
 	return rtbno < mp->m_sb.sb_rblocks;
 }
+
+/* Calculate the range of valid icount values. */
+static void
+xfs_icount_range(
+	struct xfs_mount	*mp,
+	unsigned long long	*min,
+	unsigned long long	*max)
+{
+	unsigned long long	nr_inos = 0;
+	xfs_agnumber_t		agno;
+
+	/* root, rtbitmap, rtsum all live in the first chunk */
+	*min = XFS_INODES_PER_CHUNK;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		xfs_agino_t	first, last;
+
+		xfs_agino_range(mp, agno, &first, &last);
+		nr_inos += first - last + 1;
+	}
+	*max = nr_inos;
+}
+
+/* Sanity-checking of inode counts. */
+bool
+xfs_verify_icount(
+	struct xfs_mount	*mp,
+	unsigned long long	icount)
+{
+	unsigned long long	min, max;
+
+	xfs_icount_range(mp, &min, &max);
+	return icount >= min && icount < max;
+}
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 4055d62f690c..b9e6c89284c3 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -165,5 +165,6 @@ bool xfs_verify_ino(struct xfs_mount *mp, xfs_ino_t ino);
 bool xfs_internal_inum(struct xfs_mount *mp, xfs_ino_t ino);
 bool xfs_verify_dir_ino(struct xfs_mount *mp, xfs_ino_t ino);
 bool xfs_verify_rtbno(struct xfs_mount *mp, xfs_rtblock_t rtbno);
+bool xfs_verify_icount(struct xfs_mount *mp, unsigned long long icount);
 
 #endif	/* __XFS_TYPES_H__ */
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 257cb13d36e3..4c4a6a2d5480 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1029,3 +1029,149 @@ xfs_scrub_fs_thaw(
 	mutex_unlock(&sc->mp->m_scrub_freeze);
 	return error;
 }
+
+/* Decide if we're going to grab this inode for iteration. */
+STATIC int
+xfs_scrub_foreach_live_inode_ag_grab(
+	struct xfs_inode	*ip)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	ASSERT(rcu_read_lock_held());
+
+	/*
+	 * check for stale RCU freed inode
+	 *
+	 * If the inode has been reallocated, it doesn't matter if it's not in
+	 * the AG we are walking - we are walking for writeback, so if it
+	 * passes all the "valid inode" checks and is dirty, then we'll write
+	 * it back anyway.  If it has been reallocated and still being
+	 * initialised, the XFS_INEW check below will catch it.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	if (!ip->i_ino)
+		goto out_unlock_noent;
+
+	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
+	if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
+		goto out_unlock_noent;
+	spin_unlock(&ip->i_flags_lock);
+
+	/* nothing to sync during shutdown */
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
+		return -EFSCORRUPTED;
+
+	/* If we can't grab the inode, it must on it's way to reclaim. */
+	if (!igrab(inode))
+		return -ENOENT;
+	trace_xfs_scrub_iget(ip, __this_address);
+
+	/* inode is valid */
+	return 0;
+
+out_unlock_noent:
+	spin_unlock(&ip->i_flags_lock);
+	return -ENOENT;
+}
+
+#define XFS_LOOKUP_BATCH 32
+/*
+ * Iterate all in-core inodes of an AG.  We will not wait for inodes that are
+ * new or reclaimable, and the filesystem should be frozen by the caller.
+ */
+STATIC int
+xfs_scrub_foreach_live_inode_ag(
+	struct xfs_scrub_context *sc,
+	struct xfs_perag	*pag,
+	int			(*execute)(struct xfs_inode *ip, void *priv),
+	void			*priv)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint32_t		first_index = 0;
+	int			done = 0;
+	int			nr_found = 0;
+	int			error = 0;
+
+	do {
+		struct xfs_inode *batch[XFS_LOOKUP_BATCH];
+		int		i;
+
+		rcu_read_lock();
+
+		nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
+				(void **)batch, first_index, XFS_LOOKUP_BATCH);
+		if (!nr_found) {
+			rcu_read_unlock();
+			break;
+		}
+
+		/*
+		 * Grab the inodes before we drop the lock. if we found
+		 * nothing, nr == 0 and the loop will be skipped.
+		 */
+		for (i = 0; i < nr_found; i++) {
+			struct xfs_inode *ip = batch[i];
+
+			if (done || xfs_scrub_foreach_live_inode_ag_grab(ip))
+				batch[i] = NULL;
+
+			/*
+			 * Update the index for the next lookup. Catch
+			 * overflows into the next AG range which can occur if
+			 * we have inodes in the last block of the AG and we
+			 * are currently pointing to the last inode.
+			 *
+			 * Because we may see inodes that are from the wrong AG
+			 * due to RCU freeing and reallocation, only update the
+			 * index if it lies in this AG. It was a race that lead
+			 * us to see this inode, so another lookup from the
+			 * same index will not find it again.
+			 */
+			if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno)
+				continue;
+			first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
+			if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
+				done = 1;
+		}
+
+		/* unlock now we've grabbed the inodes. */
+		rcu_read_unlock();
+
+		for (i = 0; i < nr_found; i++) {
+			if (!batch[i])
+				continue;
+			if (!error)
+				error = execute(batch[i], priv);
+			xfs_scrub_iput(sc, batch[i]);
+		}
+
+		if (error)
+			break;
+	} while (nr_found && !done);
+
+	return error;
+}
+
+/*
+ * Iterate all in-core inodes.  We will not wait for inodes that are
+ * new or reclaimable, and the filesystem should be frozen by the caller.
+ */
+int
+xfs_scrub_foreach_live_inode(
+	struct xfs_scrub_context *sc,
+	int			(*execute)(struct xfs_inode *ip, void *priv),
+	void			*priv)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+	int			error = 0;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount && !error; agno++) {
+		pag = xfs_perag_get(mp, agno);
+		error = xfs_scrub_foreach_live_inode_ag(sc, pag, execute, priv);
+		xfs_perag_put(pag);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index b0cca36de2de..aed12c4fb2f5 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -105,6 +105,8 @@ xfs_scrub_setup_quota(struct xfs_scrub_context *sc, struct xfs_inode *ip)
 	return -ENOENT;
 }
 #endif
+int xfs_scrub_setup_fscounters(struct xfs_scrub_context *sc,
+			       struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
@@ -151,5 +153,7 @@ int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xfs_scrub_iput(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 int xfs_scrub_fs_freeze(struct xfs_scrub_context *sc);
 int xfs_scrub_fs_thaw(struct xfs_scrub_context *sc);
+int xfs_scrub_foreach_live_inode(struct xfs_scrub_context *sc,
+		int (*execute)(struct xfs_inode *ip, void *priv), void *priv);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
new file mode 100644
index 000000000000..32661e1951ba
--- /dev/null
+++ b/fs/xfs/scrub/fscounters.c
@@ -0,0 +1,276 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * FS Summary Counters
+ * ===================
+ *
+ * Filesystem summary counters are a tricky beast to check.  We cannot have
+ * anyone changing the superblock fields, the percpu counters, or the AG
+ * headers while we do the global check.  This means that we must freeze the
+ * filesystem for the entire duration.   Once that's done, we compute what the
+ * incore counters /should/ be based on the counters in the AG headers
+ * (presumably we checked those in an earlier part of scrub) and the in-core
+ * free space reservations (both the user-changeable one and the per-AG ones).
+ *
+ * From there we compare the computed incore counts to the actual ones and
+ * complain if they're off.  For repair we compute the deltas needed to
+ * correct the counters and then update the incore and ondisk counters
+ * accordingly.
+ */
+
+/* Summary counter checks require a frozen fs. */
+int
+xfs_scrub_setup_fscounters(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	int				error;
+
+	/* Save counters across runs. */
+	sc->buf = kmem_zalloc(sizeof(struct xfs_scrub_fscounters), KM_SLEEP);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	/*
+	 * We need to prevent any other thread from changing the global fs
+	 * summary counters while we're scrubbing or repairing them.  This
+	 * requires the fs to be frozen.
+	 *
+	 * Scrub can do some basic sanity checks if userspace does not permit
+	 * us to freeze the filesystem.
+	 */
+	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
+	    !(sc->sm->sm_flags & XFS_SCRUB_IFLAG_FREEZE_OK))
+		return -EUSERS;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FREEZE_OK) {
+		error = xfs_scrub_fs_freeze(sc);
+		if (error)
+			return error;
+	}
+
+	/* Set up the scrub context. */
+	return xfs_scrub_trans_alloc(sc, 0);
+}
+
+/*
+ * Record the number of blocks reserved for this inode for future writes but
+ * not yet allocated to real space.  In other words, we're looking for all
+ * subtractions from fdblocks that aren't backed by actual space allocations
+ * while we recalculate fdlbocks.
+ */
+STATIC int
+xfs_scrub_fscounters_count_del(
+	struct xfs_inode	*ip,
+	void			*priv)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	rec;
+	struct xfs_ifork	*ifp;
+	uint64_t		*d = priv;
+	int64_t			delblks = ip->i_delayed_blks;
+
+	if (delblks == 0)
+		return 0;
+
+	/* Add the indlen blocks for each data fork reservation. */
+	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	for_each_xfs_iext(ifp, &icur, &rec) {
+		if (!isnullstartblock(rec.br_startblock))
+			continue;
+		delblks += startblockval(rec.br_startblock);
+	}
+
+	/*
+	 * Add the indlen blocks for each CoW fork reservation.  Remember
+	 * that we count real/unwritten extents in the CoW fork towards
+	 * i_delayed_blks, so we have to subtract those.
+	 */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	if (ifp) {
+		for_each_xfs_iext(ifp, &icur, &rec) {
+			if (!isnullstartblock(rec.br_startblock)) {
+				/* real/unwritten extent */
+				delblks -= rec.br_blockcount;
+				continue;
+			}
+			delblks += startblockval(rec.br_startblock);
+		}
+	}
+
+	/* No, we can't have negative reservations. */
+	if (delblks < 0)
+		return -EFSCORRUPTED;
+
+	*d += delblks;
+	return 0;
+}
+
+/*
+ * Calculate what the global in-core counters ought to be from the AG header
+ * contents.  Callers can compare this to the actual in-core counters to
+ * calculate by how much both in-core and on-disk counters need to be
+ * adjusted.
+ */
+STATIC int
+xfs_scrub_fscounters_calc(
+	struct xfs_scrub_context	*sc,
+	struct xfs_scrub_fscounters	*fsc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agi_bp;
+	struct xfs_buf			*agf_bp;
+	struct xfs_agi			*agi;
+	struct xfs_agf			*agf;
+	struct xfs_perag		*pag;
+	uint64_t			delayed = 0;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	ASSERT(sc->fs_frozen);
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		/* Count all the inodes */
+		error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
+		if (error)
+			return error;
+		agi = XFS_BUF_TO_AGI(agi_bp);
+		fsc->icount += be32_to_cpu(agi->agi_count);
+		fsc->ifree += be32_to_cpu(agi->agi_freecount);
+
+		/* Add up the free/freelist/bnobt/cntbt blocks */
+		error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
+		if (error)
+			return error;
+		if (!agf_bp)
+			return -ENOMEM;
+		agf = XFS_BUF_TO_AGF(agf_bp);
+		fsc->fdblocks += be32_to_cpu(agf->agf_freeblks);
+		fsc->fdblocks += be32_to_cpu(agf->agf_flcount);
+		fsc->fdblocks += be32_to_cpu(agf->agf_btreeblks);
+
+		/*
+		 * Per-AG reservations are taken out of the incore counters,
+		 * so count them out.
+		 */
+		pag = xfs_perag_get(mp, agno);
+		fsc->fdblocks -= pag->pag_meta_resv.ar_reserved;
+		fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved;
+		xfs_perag_put(pag);
+	}
+
+	/*
+	 * The global space reservation is taken out of the incore counters,
+	 * so count that out too.
+	 */
+	fsc->fdblocks -= mp->m_resblks_avail;
+
+	/*
+	 * Delayed allocation reservations are taken out of the incore counters
+	 * but not recorded on disk, so count them out too.
+	 */
+	error = xfs_scrub_foreach_live_inode(sc, xfs_scrub_fscounters_count_del,
+			&delayed);
+	if (error)
+		return error;
+	fsc->fdblocks -= delayed;
+
+	trace_xfs_scrub_fscounters_calc(mp, fsc->icount, fsc->ifree,
+			fsc->fdblocks, delayed);
+
+	/* Bail out if the values we compute are totally nonsense. */
+	if (!xfs_verify_icount(mp, fsc->icount) ||
+	    fsc->fdblocks > mp->m_sb.sb_dblocks ||
+	    fsc->ifree > fsc->icount)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/*
+ * Check the superblock counters.
+ *
+ * The filesystem must be frozen so that the counters do not change while
+ * we're computing the summary counters.
+ */
+int
+xfs_scrub_fscounters(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_scrub_fscounters	*fsc = sc->buf;
+	int				error;
+
+	/* See if icount is obviously wrong. */
+	if (!xfs_verify_icount(mp, mp->m_sb.sb_icount))
+		xfs_scrub_block_set_corrupt(sc, mp->m_sb_bp);
+
+	/* See if fdblocks / ifree are obviously wrong. */
+	if (mp->m_sb.sb_fdblocks > mp->m_sb.sb_dblocks)
+		xfs_scrub_block_set_corrupt(sc, mp->m_sb_bp);
+	if (mp->m_sb.sb_ifree > mp->m_sb.sb_icount)
+		xfs_scrub_block_set_corrupt(sc, mp->m_sb_bp);
+
+	/*
+	 * If we're only checking for corruption and we found it, exit now.
+	 *
+	 * Repair depends on the counter values we collect here, so if the
+	 * IFLAG_REPAIR flag is set we must continue to calculate the correct
+	 * counter values.
+	 */
+	if (!(sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
+	    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
+		return 0;
+
+	/* Bail out if we need to be frozen to do the hard checks. */
+	if (!sc->fs_frozen) {
+		xfs_scrub_set_incomplete(sc);
+		return -EUSERS;
+	}
+
+	/* Counters seem ok, but let's count them. */
+	error = xfs_scrub_fscounters_calc(sc, fsc);
+	if (!xfs_scrub_process_error(sc, 0, XFS_SB_BLOCK(sc->mp), &error))
+		return error;
+
+	/*
+	 * Compare the in-core counters.  In theory we sync'd the superblock
+	 * when we did the repair freeze, so they should be the same as the
+	 * percpu counters.
+	 */
+	spin_lock(&mp->m_sb_lock);
+	if (mp->m_sb.sb_icount != fsc->icount)
+		xfs_scrub_block_set_corrupt(sc, mp->m_sb_bp);
+	if (mp->m_sb.sb_ifree != fsc->ifree)
+		xfs_scrub_block_set_corrupt(sc, mp->m_sb_bp);
+	if (mp->m_sb.sb_fdblocks != fsc->fdblocks)
+		xfs_scrub_block_set_corrupt(sc, mp->m_sb_bp);
+	spin_unlock(&mp->m_sb_lock);
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/fscounters_repair.c b/fs/xfs/scrub/fscounters_repair.c
new file mode 100644
index 000000000000..8893fb9d4813
--- /dev/null
+++ b/fs/xfs/scrub/fscounters_repair.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * FS Summary Counters
+ * ===================
+ *
+ * To repair the filesystem summary counters we compute the correct values,
+ * take the difference between those values and the ones in m_sb, and modify
+ * both the percpu and the m_sb counters by the corresponding amounts.  The
+ * filesystem must be frozen to do anything.
+ */
+
+/*
+ * Reset the superblock counters.
+ *
+ * The filesystem must be frozen so that the counters do not change while
+ * we're computing the summary counters.
+ */
+int
+xfs_repair_fscounters(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_scrub_fscounters	*fsc = sc->buf;
+	int64_t				delta_icount;
+	int64_t				delta_ifree;
+	int64_t				delta_fdblocks;
+	int				error;
+
+	/*
+	 * Reinitialize the counters.  We know that the counters in mp->m_sb
+	 * are supposed to match the counters we calculated, so we therefore
+	 * need to calculate the deltas...
+	 */
+	spin_lock(&mp->m_sb_lock);
+	delta_icount = (int64_t)fsc->icount - mp->m_sb.sb_icount;
+	delta_ifree = (int64_t)fsc->ifree - mp->m_sb.sb_ifree;
+	delta_fdblocks = (int64_t)fsc->fdblocks - mp->m_sb.sb_fdblocks;
+	spin_unlock(&mp->m_sb_lock);
+
+	trace_xfs_repair_reset_counters(mp, delta_icount, delta_ifree,
+			delta_fdblocks);
+
+	/* ...and then update the per-cpu counters... */
+	if (delta_icount) {
+		error = xfs_mod_icount(mp, delta_icount);
+		if (error)
+			return error;
+	}
+	if (delta_ifree) {
+		error = xfs_mod_ifree(mp, delta_ifree);
+		if (error)
+			goto err_icount;
+	}
+	if (delta_fdblocks) {
+		error = xfs_mod_fdblocks(mp, delta_fdblocks, false);
+		if (error)
+			goto err_ifree;
+	}
+
+	/* ...and finally log the superblock changes. */
+	spin_lock(&mp->m_sb_lock);
+	mp->m_sb.sb_icount = fsc->icount;
+	mp->m_sb.sb_ifree = fsc->ifree;
+	mp->m_sb.sb_fdblocks = fsc->fdblocks;
+	spin_unlock(&mp->m_sb_lock);
+	xfs_log_sb(sc->tp);
+
+	return 0;
+err_icount:
+	xfs_mod_icount(mp, -delta_icount);
+err_ifree:
+	xfs_mod_ifree(mp, -delta_ifree);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 083ab63624eb..7e3fee59b517 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -121,6 +121,7 @@ int xfs_repair_quota(struct xfs_scrub_context *sc);
 #else
 # define xfs_repair_quota		xfs_repair_notsupported
 #endif /* CONFIG_XFS_QUOTA */
+int xfs_repair_fscounters(struct xfs_scrub_context *sc);
 
 #else
 
@@ -165,6 +166,7 @@ static inline int xfs_repair_rmapbt_setup(
 #define xfs_repair_symlink		xfs_repair_notsupported
 #define xfs_repair_xattr		xfs_repair_notsupported
 #define xfs_repair_quota		xfs_repair_notsupported
+#define xfs_repair_fscounters		xfs_repair_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f57ec412a617..42b23d831c9e 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -369,6 +369,12 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.scrub	= xfs_scrub_quota,
 		.repair	= xfs_repair_quota,
 	},
+	[XFS_SCRUB_TYPE_FSCOUNTERS] = {	/* fs summary counters */
+		.type	= ST_FS,
+		.setup	= xfs_scrub_setup_fscounters,
+		.scrub	= xfs_scrub_fscounters,
+		.repair	= xfs_repair_fscounters,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 43c4189ea549..1a1b0cad64a8 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -128,6 +128,7 @@ xfs_scrub_quota(struct xfs_scrub_context *sc)
 	return -ENOENT;
 }
 #endif
+int xfs_scrub_fscounters(struct xfs_scrub_context *sc);
 
 /* cross-referencing helpers */
 void xfs_scrub_xref_is_used_space(struct xfs_scrub_context *sc,
@@ -159,4 +160,10 @@ bool xfs_scrub_xattr_set_map(struct xfs_scrub_context *sc, unsigned long *map,
 		unsigned int start, unsigned int len);
 uint xfs_scrub_quota_to_dqtype(struct xfs_scrub_context *sc);
 
+struct xfs_scrub_fscounters {
+	uint64_t		icount;
+	uint64_t		ifree;
+	uint64_t		fdblocks;
+};
+
 #endif	/* __XFS_SCRUB_SCRUB_H__ */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 0212d273ca8b..a1608a27cb29 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -514,6 +514,50 @@ DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iget);
 DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iget_target);
 DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_target);
 
+TRACE_EVENT(xfs_scrub_fscounters_calc,
+	TP_PROTO(struct xfs_mount *mp, uint64_t icount, uint64_t ifree,
+		 uint64_t fdblocks, uint64_t delalloc),
+	TP_ARGS(mp, icount, ifree, fdblocks, delalloc),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int64_t, icount_sb)
+		__field(int64_t, icount_percpu)
+		__field(uint64_t, icount_calculated)
+		__field(int64_t, ifree_sb)
+		__field(int64_t, ifree_percpu)
+		__field(uint64_t, ifree_calculated)
+		__field(int64_t, fdblocks_sb)
+		__field(int64_t, fdblocks_percpu)
+		__field(uint64_t, fdblocks_calculated)
+		__field(uint64_t, delalloc)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->icount_sb = mp->m_sb.sb_icount;
+		__entry->icount_percpu = percpu_counter_sum(&mp->m_icount);
+		__entry->icount_calculated = icount;
+		__entry->ifree_sb = mp->m_sb.sb_ifree;
+		__entry->ifree_percpu = percpu_counter_sum(&mp->m_ifree);
+		__entry->ifree_calculated = ifree;
+		__entry->fdblocks_sb = mp->m_sb.sb_fdblocks;
+		__entry->fdblocks_percpu = percpu_counter_sum(&mp->m_fdblocks);
+		__entry->fdblocks_calculated = fdblocks;
+		__entry->delalloc = delalloc;
+	),
+	TP_printk("dev %d:%d icount %lld:%lld:%llu ifree %lld:%lld:%llu fdblocks %lld:%lld:%llu delalloc %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->icount_sb,
+		  __entry->icount_percpu,
+		  __entry->icount_calculated,
+		  __entry->ifree_sb,
+		  __entry->ifree_percpu,
+		  __entry->ifree_calculated,
+		  __entry->fdblocks_sb,
+		  __entry->fdblocks_percpu,
+		  __entry->fdblocks_calculated,
+		  __entry->delalloc)
+)
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
@@ -722,17 +766,28 @@ TRACE_EVENT(xfs_repair_calc_ag_resblks_btsize,
 		  __entry->rmapbt_sz,
 		  __entry->refcbt_sz)
 )
+
 TRACE_EVENT(xfs_repair_reset_counters,
-	TP_PROTO(struct xfs_mount *mp),
-	TP_ARGS(mp),
+	TP_PROTO(struct xfs_mount *mp, int64_t icount_adj, int64_t ifree_adj,
+		 int64_t fdblocks_adj),
+	TP_ARGS(mp, icount_adj, ifree_adj, fdblocks_adj),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
+		__field(int64_t, icount_adj)
+		__field(int64_t, ifree_adj)
+		__field(int64_t, fdblocks_adj)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
+		__entry->icount_adj = icount_adj;
+		__entry->ifree_adj = ifree_adj;
+		__entry->fdblocks_adj = fdblocks_adj;
 	),
-	TP_printk("dev %d:%d",
-		  MAJOR(__entry->dev), MINOR(__entry->dev))
+	TP_printk("dev %d:%d icount %lld ifree %lld fdblocks %lld",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->icount_adj,
+		  __entry->ifree_adj,
+		  __entry->fdblocks_adj)
 )
 
 TRACE_EVENT(xfs_repair_ialloc_insert,


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap
  2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
@ 2018-06-27  0:54   ` Dave Chinner
  2018-06-28 21:11   ` Allison Henderson
  1 sibling, 0 replies; 77+ messages in thread
From: Dave Chinner @ 2018-06-27  0:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:23:36PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The original rmap code assumed that there would always be at least one
> rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't
> find one.  This assumption isn't true for the rmapbt repair function
> (and it won't be true for realtime rmap either), so remove the check and
> just deal with the situation.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks reasonable.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks
  2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
@ 2018-06-27  1:02   ` Dave Chinner
  2018-06-28 21:12   ` Allison Henderson
  1 sibling, 0 replies; 77+ messages in thread
From: Dave Chinner @ 2018-06-27  1:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:23:42PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Add a helper to decide if an inode has real or unwritten extents in the
> CoW fork.  The upcoming repair freeze functionality will have to know if
> it's safe to iput an inode -- if the inode has incore any state that
> would require a transaction to unwind during iput, we'll have to defer
> the iput.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Seems ok, but I really can't review this in isolation because at
this point I've got no idea what it's calling context is.

i.e. I'm not going to say this is OK until I see how/why/where
delayed iput()s are used.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
@ 2018-06-27  2:19   ` Dave Chinner
  2018-06-27 16:44     ` Allison Henderson
  2018-06-28 17:25     ` Allison Henderson
  2018-06-28 21:14   ` Allison Henderson
  1 sibling, 2 replies; 77+ messages in thread
From: Dave Chinner @ 2018-06-27  2:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:23:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Regenerate the AGF and AGFL from the rmap data.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[...]

> +/* Information for finding AGF-rooted btrees */
> +enum {
> +	REPAIR_AGF_BNOBT = 0,
> +	REPAIR_AGF_CNTBT,
> +	REPAIR_AGF_RMAPBT,
> +	REPAIR_AGF_REFCOUNTBT,
> +	REPAIR_AGF_END,
> +	REPAIR_AGF_MAX
> +};

Why can't you just use XFS_BTNUM_* for these btree type descriptors?

> +
> +static const struct xfs_repair_find_ag_btree repair_agf[] = {
> +	[REPAIR_AGF_BNOBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_AG,
> +		.buf_ops = &xfs_allocbt_buf_ops,
> +		.magic = XFS_ABTB_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_CNTBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_AG,
> +		.buf_ops = &xfs_allocbt_buf_ops,
> +		.magic = XFS_ABTC_CRC_MAGIC,
> +	},

I had to stop and think about why this only supports the v5 types.
i.e. we're rebuilding from rmap info, so this will never run on v4
filesystems, hence we only care about v5 types (i.e. *CRC_MAGIC).
Perhaps a one-line comment to remind readers of this?

> +	[REPAIR_AGF_RMAPBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_AG,
> +		.buf_ops = &xfs_rmapbt_buf_ops,
> +		.magic = XFS_RMAP_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_REFCOUNTBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_REFC,
> +		.buf_ops = &xfs_refcountbt_buf_ops,
> +		.magic = XFS_REFC_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_END] = {
> +		.buf_ops = NULL,
> +	},
> +};
> +
> +/*
> + * Find the btree roots.  This is /also/ a chicken and egg problem because we
> + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
> + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
> + * corruptions in those btrees we'll bail out.
> + */
> +STATIC int
> +xfs_repair_agf_find_btrees(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	struct xfs_repair_find_ag_btree	*fab,
> +	struct xfs_buf			*agfl_bp)
> +{
> +	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
> +	int				error;
> +
> +	/* Go find the root data. */
> +	memcpy(fab, repair_agf, sizeof(repair_agf));

Why are we initialising fab here, instead of in the caller where it
is declared and passed to various functions? Given there is only a
single declaration of this structure, why do we need a global static
const table initialiser just to copy it here - why isn't it
initialised at the declaration point?

> +	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
> +	if (error)
> +		return error;
> +
> +	/* We must find the bnobt, cntbt, and rmapbt roots. */
> +	if (fab[REPAIR_AGF_BNOBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
> +	    fab[REPAIR_AGF_CNTBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
> +	    fab[REPAIR_AGF_RMAPBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
> +		return -EFSCORRUPTED;
> +
> +	/*
> +	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
> +	 * different root then something's seriously wrong.
> +	 */
> +	if (fab[REPAIR_AGF_RMAPBT].root !=
> +	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
> +		return -EFSCORRUPTED;
> +
> +	/* We must find the refcountbt root if that feature is enabled. */
> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
> +	    (fab[REPAIR_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
> +	     fab[REPAIR_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
> +		return -EFSCORRUPTED;
> +
> +	return 0;
> +}
> +
> +/* Set btree root information in an AGF. */
> +STATIC void
> +xfs_repair_agf_set_roots(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_agf			*agf,
> +	struct xfs_repair_find_ag_btree	*fab)
> +{
> +	agf->agf_roots[XFS_BTNUM_BNOi] =
> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].root);
> +	agf->agf_levels[XFS_BTNUM_BNOi] =
> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].height);
> +
> +	agf->agf_roots[XFS_BTNUM_CNTi] =
> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].root);
> +	agf->agf_levels[XFS_BTNUM_CNTi] =
> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].height);
> +
> +	agf->agf_roots[XFS_BTNUM_RMAPi] =
> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].root);
> +	agf->agf_levels[XFS_BTNUM_RMAPi] =
> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].height);
> +
> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
> +		agf->agf_refcount_root =
> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].root);
> +		agf->agf_refcount_level =
> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].height);
> +	}
> +}
> +
> +/*
> + * Reinitialize the AGF header, making an in-core copy of the old contents so
> + * that we know which in-core state needs to be reinitialized.
> + */
> +STATIC void
> +xfs_repair_agf_init_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	struct xfs_agf			*old_agf)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> +
> +	memcpy(old_agf, agf, sizeof(*old_agf));
> +	memset(agf, 0, BBTOB(agf_bp->b_length));
> +	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
> +	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
> +	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
> +	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
> +	agf->agf_flfirst = old_agf->agf_flfirst;
> +	agf->agf_fllast = old_agf->agf_fllast;
> +	agf->agf_flcount = old_agf->agf_flcount;
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
> +}

Do we need to clear pag->pagf_init here so that it gets
re-initialised next time someone reads the AGF?

> +
> +/* Update the AGF btree counters by walking the btrees. */
> +STATIC int
> +xfs_repair_agf_update_btree_counters(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp)
> +{
> +	struct xfs_repair_agf_allocbt	raa = { .sc = sc };
> +	struct xfs_btree_cur		*cur = NULL;
> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_agblock_t			btreeblks;
> +	xfs_agblock_t			blocks;
> +	int				error;
> +
> +	/* Update the AGF counters from the bnobt. */
> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_BNO);
> +	error = xfs_alloc_query_all(cur, xfs_repair_agf_walk_allocbt, &raa);
> +	if (error)
> +		goto err;
> +	error = xfs_btree_count_blocks(cur, &blocks);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	btreeblks = blocks - 1;
> +	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
> +	agf->agf_longest = cpu_to_be32(raa.longest);

This function updates more than the AGF btree counters. :P

> +
> +	/* Update the AGF counters from the cntbt. */
> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_CNT);
> +	error = xfs_btree_count_blocks(cur, &blocks);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	btreeblks += blocks - 1;
> +
> +	/* Update the AGF counters from the rmapbt. */
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
> +	error = xfs_btree_count_blocks(cur, &blocks);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	agf->agf_rmap_blocks = cpu_to_be32(blocks);
> +	btreeblks += blocks - 1;
> +
> +	agf->agf_btreeblks = cpu_to_be32(btreeblks);
> +
> +	/* Update the AGF counters from the refcountbt. */
> +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> +		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
> +				sc->sa.agno, NULL);
> +		error = xfs_btree_count_blocks(cur, &blocks);
> +		if (error)
> +			goto err;
> +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +		agf->agf_refcount_blocks = cpu_to_be32(blocks);
> +	}
> +
> +	return 0;
> +err:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
> +
> +/* Trigger reinitialization of the in-core data. */
> +STATIC int
> +xfs_repair_agf_reinit_incore(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_agf			*agf,
> +	const struct xfs_agf		*old_agf)
> +{
> +	struct xfs_perag		*pag;
> +
> +	/* XXX: trigger fdblocks recalculation */
> +
> +	/* Now reinitialize the in-core counters if necessary. */
> +	pag = sc->sa.pag;
> +	if (!pag->pagf_init)
> +		return 0;
> +
> +	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
> +	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
> +	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
> +	pag->pagf_levels[XFS_BTNUM_BNOi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
> +	pag->pagf_levels[XFS_BTNUM_CNTi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
> +	pag->pagf_levels[XFS_BTNUM_RMAPi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
> +	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);

Ok, so we reinit the pagf bits here, but....

> +
> +	return 0;
> +}
> +
> +/* Repair the AGF. */
> +int
> +xfs_repair_agf(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_repair_find_ag_btree	fab[REPAIR_AGF_MAX];
> +	struct xfs_agf			old_agf;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*agf_bp;
> +	struct xfs_buf			*agfl_bp;
> +	struct xfs_agf			*agf;
> +	int				error;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
> +			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
> +	if (error)
> +		return error;
> +	agf_bp->b_ops = &xfs_agf_buf_ops;
> +	agf = XFS_BUF_TO_AGF(agf_bp);
> +
> +	/*
> +	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
> +	 * the AGFL now; these blocks might have once been part of the
> +	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
> +	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
> +	 * because we can't do any serious cross-referencing with any of the
> +	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
> +	 * then we'll bail out.
> +	 */
> +	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Spot-check the AGFL blocks; if they're obviously corrupt then
> +	 * there's nothing we can do but bail out.
> +	 */
> +	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
> +			xfs_repair_agf_check_agfl_block, sc);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Find the AGF btree roots.  See the comment for this function for
> +	 * more information about the limitations of this repairer; this is
> +	 * also a chicken-and-egg situation.
> +	 */
> +	error = xfs_repair_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
> +	if (error)
> +		return error;

Comment could be better written.

	/*
	 * Find the AGF btree roots. This is also a chicken-and-egg
	 * situation - see xfs_repair_agf_find_btrees() for details.
	 */

> +
> +	/* Start rewriting the header and implant the btrees we found. */
> +	xfs_repair_agf_init_header(sc, agf_bp, &old_agf);
> +	xfs_repair_agf_set_roots(sc, agf, fab);
> +	error = xfs_repair_agf_update_btree_counters(sc, agf_bp);
> +	if (error)
> +		goto out_revert;

If we fail here, the pagf information is invalid, hence I think we
really do need to clear pagf_init before we start rebuilding the new
AGF. Yes, I can see we revert the AGF info, but this seems like a
landmine waiting to be tripped over.

> +	/* Reinitialize in-core state. */
> +	error = xfs_repair_agf_reinit_incore(sc, agf, &old_agf);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Write this to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
> +	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
> +	return 0;
> +
> +out_revert:
> +	memcpy(agf, &old_agf, sizeof(old_agf));
> +	return error;
> +}
> +
> +/* AGFL */
> +
> +struct xfs_repair_agfl {
> +	struct xfs_repair_extent_list	agmeta_list;
> +	struct xfs_repair_extent_list	*freesp_list;
> +	struct xfs_scrub_context	*sc;
> +};
> +
> +/* Record all freespace information. */
> +STATIC int
> +xfs_repair_agfl_rmap_fn(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_rmap_irec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_repair_agfl		*ra = priv;
> +	xfs_fsblock_t			fsb;
> +	int				error = 0;
> +
> +	if (xfs_scrub_should_terminate(ra->sc, &error))
> +		return error;
> +
> +	/* Record all the OWN_AG blocks. */
> +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> +				rec->rm_startblock);
> +		error = xfs_repair_collect_btree_extent(ra->sc,
> +				ra->freesp_list, fsb, rec->rm_blockcount);
> +		if (error)
> +			return error;
> +	}
> +
> +	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
> +			xfs_repair_collect_btree_cur_blocks_in_extent_list,

Urk. The function name lengths is getting out of hand. I'm very
tempted to suggest we should shorten the namespace of all this
like s/xfs_repair_/xr_/ and s/xfs_scrub_/xs_/, etc just to make them
shorter and easier to read.

Oh, wait, did I say that out loud? :P

Something to think about, anyway.

> +			&ra->agmeta_list);
> +}
> +
> +/* Add a btree block to the agmeta list. */
> +STATIC int
> +xfs_repair_agfl_visit_btblock(

I find the name a bit confusing - AGFLs don't have btree blocks.
Yes, I know that it's a xfs_btree_visit_blocks() callback but I
think s/visit/collect/ makes more sense. i.e. it tells us what we
are doing with the btree block, rather than making it sound like we
are walking AGFL btree blocks...

> +/*
> + * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
> + * which blocks belong to the AGFL.
> + */
> +STATIC int
> +xfs_repair_agfl_find_extents(

Same here - xr_agfl_collect_free_extents()?

> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	struct xfs_repair_extent_list	*agfl_extents,
> +	xfs_agblock_t			*flcount)
> +{
> +	struct xfs_repair_agfl		ra;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_repair_extent	*rae;
> +	int				error;
> +
> +	ra.sc = sc;
> +	ra.freesp_list = agfl_extents;
> +	xfs_repair_init_extent_list(&ra.agmeta_list);
> +
> +	/* Find all space used by the free space btrees & rmapbt. */
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
> +	error = xfs_rmap_query_all(cur, xfs_repair_agfl_rmap_fn, &ra);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/* Find all space used by bnobt. */

Needs clarification.

	/* Find all the in use bnobt blocks */

> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_BNO);
> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/* Find all space used by cntbt. */

	/* Find all the in use cntbt blocks */

> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_CNT);
> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
> +	if (error)
> +		goto err;
> +
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/*
> +	 * Drop the freesp meta blocks that are in use by btrees.
> +	 * The remaining blocks /should/ be AGFL blocks.
> +	 */
> +	error = xfs_repair_subtract_extents(sc, agfl_extents, &ra.agmeta_list);
> +	xfs_repair_cancel_btree_extents(sc, &ra.agmeta_list);
> +	if (error)
> +		return error;
> +
> +	/* Calculate the new AGFL size. */
> +	*flcount = 0;
> +	for_each_xfs_repair_extent(rae, agfl_extents) {
> +		*flcount += rae->len;
> +		if (*flcount > xfs_agfl_size(mp))
> +			break;
> +	}
> +	if (*flcount > xfs_agfl_size(mp))
> +		*flcount = xfs_agfl_size(mp);

Ok, so flcount is clamped here. What happens to all the remaining
agfl_extents beyond flcount?

> +	return 0;
> +
> +err:

Ok, what cleans up all the extents we've recorded in ra on error?

> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
> +
> +/* Update the AGF and reset the in-core state. */
> +STATIC int
> +xfs_repair_agfl_update_agf(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	xfs_agblock_t			flcount)
> +{
> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> +
	ASSERT(*flcount <= xfs_agfl_size(mp));

> +	/* XXX: trigger fdblocks recalculation */
> +
> +	/* Update the AGF counters. */
> +	if (sc->sa.pag->pagf_init)
> +		sc->sa.pag->pagf_flcount = flcount;
> +	agf->agf_flfirst = cpu_to_be32(0);
> +	agf->agf_flcount = cpu_to_be32(flcount);
> +	agf->agf_fllast = cpu_to_be32(flcount - 1);
> +
> +	xfs_alloc_log_agf(sc->tp, agf_bp,
> +			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
> +	return 0;
> +}
> +
> +/* Write out a totally new AGFL. */
> +STATIC void
> +xfs_repair_agfl_init_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agfl_bp,
> +	struct xfs_repair_extent_list	*agfl_extents,
> +	xfs_agblock_t			flcount)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	__be32				*agfl_bno;
> +	struct xfs_repair_extent	*rae;
> +	struct xfs_repair_extent	*n;
> +	struct xfs_agfl			*agfl;
> +	xfs_agblock_t			agbno;
> +	unsigned int			fl_off;
> +
	ASSERT(*flcount <= xfs_agfl_size(mp));

> +	/* Start rewriting the header. */
> +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> +
> +	/* Fill the AGFL with the remaining blocks. */
> +	fl_off = 0;
> +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> +	for_each_xfs_repair_extent_safe(rae, n, agfl_extents) {
> +		agbno = XFS_FSB_TO_AGBNO(mp, rae->fsbno);
> +
> +		trace_xfs_repair_agfl_insert(mp, sc->sa.agno, agbno, rae->len);
> +
> +		while (rae->len > 0 && fl_off < flcount) {
> +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> +			fl_off++;
> +			agbno++;
> +			rae->fsbno++;
> +			rae->len--;
> +		}

This only works correctly if flcount <= xfs_agfl_size, which is why
I'm suggesting some asserts.

> +
> +		if (rae->len)
> +			break;
> +		list_del(&rae->list);
> +		kmem_free(rae);
> +	}
> +
> +	/* Write AGF and AGFL to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> +}
> +
> +/* Repair the AGFL. */
> +int
> +xfs_repair_agfl(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_repair_extent_list	agfl_extents;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*agf_bp;
> +	struct xfs_buf			*agfl_bp;
> +	xfs_agblock_t			flcount;
> +	int				error;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +	xfs_repair_init_extent_list(&agfl_extents);
> +
> +	/*
> +	 * Read the AGF so that we can query the rmapbt.  We hope that there's
> +	 * nothing wrong with the AGF, but all the AG header repair functions
> +	 * have this chicken-and-egg problem.
> +	 */
> +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
> +	if (error)
> +		return error;
> +	if (!agf_bp)
> +		return -ENOMEM;
> +
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
> +			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
> +	if (error)
> +		return error;
> +	agfl_bp->b_ops = &xfs_agfl_buf_ops;
> +
> +	/*
> +	 * Compute the set of old AGFL blocks by subtracting from the list of
> +	 * OWN_AG blocks the list of blocks owned by all other OWN_AG metadata
> +	 * (bnobt, cntbt, rmapbt).  These are the old AGFL blocks, so return
> +	 * that list and the number of blocks we're actually going to put back
> +	 * on the AGFL.
> +	 */

That comment belongs on the function, not here. All we need here is
something like:

	/* Gather all the extents we're going to put on the new AGFL. */

> +	error = xfs_repair_agfl_find_extents(sc, agf_bp, &agfl_extents,
> +			&flcount);
> +	if (error)
> +		goto err;
> +
> +	/*
> +	 * Update AGF and AGFL.  We reset the global free block counter when
> +	 * we adjust the AGF flcount (which can fail) so avoid updating any
> +	 * bufers until we know that part works.

buffers

> +	 */
> +	error = xfs_repair_agfl_update_agf(sc, agf_bp, flcount);
> +	if (error)
> +		goto err;
> +	xfs_repair_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
> +
> +	/*
> +	 * Ok, the AGFL should be ready to go now.  Roll the transaction so
> +	 * that we can free any AGFL overflow.
> +	 */

Why does rolling the transaction allow us to free the overflow?
Shouldn't the comment say something like "Roll to the transaction to
make the new AGFL permanent before we start using it when returning
the residual AGFL freespace overflow back to the AGF freespace
btrees."

> +	sc->sa.agf_bp = agf_bp;
> +	sc->sa.agfl_bp = agfl_bp;
> +	error = xfs_repair_roll_ag_trans(sc);
> +	if (error)
> +		goto err;
> +
> +	/* Dump any AGFL overflow. */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> +	return xfs_repair_reap_btree_extents(sc, &agfl_extents, &oinfo,
> +			XFS_AG_RESV_AGFL);
> +err:
> +	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 326be4e8b71e..bcdaa8df18f6 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -127,9 +127,12 @@ xfs_repair_roll_ag_trans(
>  	int				error;
>  
>  	/* Keep the AG header buffers locked so we can keep going. */
> -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>  
>  	/* Roll the transaction. */
>  	error = xfs_trans_roll(&sc->tp);
> @@ -137,9 +140,12 @@ xfs_repair_roll_ag_trans(
>  		goto out_release;
>  
>  	/* Join AG headers to the new transaction. */
> -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>  
>  	return 0;
>  
> @@ -149,9 +155,12 @@ xfs_repair_roll_ag_trans(
>  	 * buffers will be released during teardown on our way out
>  	 * of the kernel.
>  	 */
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>  
>  	return error;
>  }
> @@ -408,6 +417,85 @@ xfs_repair_collect_btree_extent(
>  	return 0;
>  }
>  
> +/*
> + * Help record all btree blocks seen while iterating all records of a btree.
> + *
> + * We know that the btree query_all function starts at the left edge and walks
> + * towards the right edge of the tree.  Therefore, we know that we can walk up
> + * the btree cursor towards the root; if the pointer for a given level points
> + * to the first record/key in that block, we haven't seen this block before;
> + * and therefore we need to remember that we saw this block in the btree.
> + *
> + * So if our btree is:
> + *
> + *    4
> + *  / | \
> + * 1  2  3
> + *
> + * Pretend for this example that each leaf block has 100 btree records.  For
> + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> + * block 4.  The list is [1, 4].
> + *
> + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> + * loop.  The list remains [1, 4].
> + *
> + * For the 101st btree record, we've moved onto leaf block 2.  Now
> + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> + *
> + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> + *
> + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> + *
> + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> + *
> + * The *iter_fn can return XFS_BTREE_QUERY_RANGE_ABORT to stop, 0 to keep
> + * iterating, or the usual negative error code.
> + */
> +int
> +xfs_repair_collect_btree_cur_blocks(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_btree_cur		*cur,
> +	int				(*iter_fn)(struct xfs_scrub_context *sc,
> +						   xfs_fsblock_t fsbno,
> +						   xfs_fsblock_t len,
> +						   void *priv),
> +	void				*priv)
> +{
> +	struct xfs_buf			*bp;
> +	xfs_fsblock_t			fsb;
> +	int				i;
> +	int				error;
> +
> +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> +		xfs_btree_get_block(cur, i, &bp);
> +		if (!bp)
> +			continue;
> +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> +		error = iter_fn(sc, fsb, 1, priv);
> +		if (error)
> +			return error;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Simple adapter to connect xfs_repair_collect_btree_extent to
> + * xfs_repair_collect_btree_cur_blocks.
> + */
> +int
> +xfs_repair_collect_btree_cur_blocks_in_extent_list(
> +	struct xfs_scrub_context	*sc,
> +	xfs_fsblock_t			fsbno,
> +	xfs_fsblock_t			len,
> +	void				*priv)
> +{
> +	return xfs_repair_collect_btree_extent(sc, priv, fsbno, len);
> +}
> +
>  /*
>   * An error happened during the rebuild so the transaction will be cancelled.
>   * The fs will shut down, and the administrator has to unmount and run repair.
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index ef47826b6725..f2af5923aa75 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -48,9 +48,20 @@ xfs_repair_init_extent_list(
>  
>  #define for_each_xfs_repair_extent_safe(rbe, n, exlist) \
>  	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
> +#define for_each_xfs_repair_extent(rbe, exlist) \
> +	list_for_each_entry((rbe), &(exlist)->list, list)
>  int xfs_repair_collect_btree_extent(struct xfs_scrub_context *sc,
>  		struct xfs_repair_extent_list *btlist, xfs_fsblock_t fsbno,
>  		xfs_extlen_t len);
> +int xfs_repair_collect_btree_cur_blocks(struct xfs_scrub_context *sc,
> +		struct xfs_btree_cur *cur,
> +		int (*iter_fn)(struct xfs_scrub_context *sc,
> +			       xfs_fsblock_t fsbno, xfs_fsblock_t len,
> +			       void *priv),
> +		void *priv);
> +int xfs_repair_collect_btree_cur_blocks_in_extent_list(
> +		struct xfs_scrub_context *sc, xfs_fsblock_t fsbno,
> +		xfs_fsblock_t len, void *priv);
>  void xfs_repair_cancel_btree_extents(struct xfs_scrub_context *sc,
>  		struct xfs_repair_extent_list *btlist);
>  int xfs_repair_subtract_extents(struct xfs_scrub_context *sc,
> @@ -89,6 +100,8 @@ int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
>  
>  int xfs_repair_probe(struct xfs_scrub_context *sc);
>  int xfs_repair_superblock(struct xfs_scrub_context *sc);
> +int xfs_repair_agf(struct xfs_scrub_context *sc);
> +int xfs_repair_agfl(struct xfs_scrub_context *sc);
>  
>  #else
>  
> @@ -112,6 +125,8 @@ xfs_repair_calc_ag_resblks(
>  
>  #define xfs_repair_probe		xfs_repair_notsupported
>  #define xfs_repair_superblock		xfs_repair_notsupported
> +#define xfs_repair_agf			xfs_repair_notsupported
> +#define xfs_repair_agfl			xfs_repair_notsupported
>  
>  #endif /* CONFIG_XFS_ONLINE_REPAIR */
>  
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 58ae76b3a421..8e11c3c699fb 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -208,13 +208,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>  		.type	= ST_PERAG,
>  		.setup	= xfs_scrub_setup_fs,
>  		.scrub	= xfs_scrub_agf,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_agf,
>  	},
>  	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
>  		.type	= ST_PERAG,
>  		.setup	= xfs_scrub_setup_fs,
>  		.scrub	= xfs_scrub_agfl,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_agfl,
>  	},
>  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
>  		.type	= ST_PERAG,
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 524f543c5b82..c08785cf83a9 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -126,6 +126,60 @@ xfs_trans_dup(
>  	return ntp;
>  }
>  
> +/*
> + * Try to reserve more blocks for a transaction.  The single use case we
> + * support is for online repair -- use a transaction to gather data without
> + * fear of btree cycle deadlocks; calculate how many blocks we really need
> + * from that data; and only then start modifying data.  This can fail due to
> + * ENOSPC, so we have to be able to cancel the transaction.
> + */
> +int
> +xfs_trans_reserve_more(
> +	struct xfs_trans	*tp,
> +	uint			blocks,
> +	uint			rtextents)

This isn't used in this patch - seems out of place here. Committed
to the wrong patch?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/21] xfs: repair the AGI
  2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
@ 2018-06-27  2:22   ` Dave Chinner
  2018-06-28 21:15   ` Allison Henderson
  1 sibling, 0 replies; 77+ messages in thread
From: Dave Chinner @ 2018-06-27  2:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the AGI header items with some help from the rmapbt.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/agheader_repair.c |  211 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/repair.h          |    2 
>  fs/xfs/scrub/scrub.c           |    2 
>  3 files changed, 214 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index 90e5e6cbc911..61e0134f6f9f 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
> @@ -698,3 +698,214 @@ xfs_repair_agfl(
>  	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
>  	return error;
>  }
> +
> +/* AGI */
> +
> +enum {
> +	REPAIR_AGI_INOBT = 0,
> +	REPAIR_AGI_FINOBT,
> +	REPAIR_AGI_END,
> +	REPAIR_AGI_MAX
> +};

XFS_BTNUM_INOBT?

Basically same infrastructure questions as last patch.

.....
> +/*
> + * Reinitialize the AGI header, making an in-core copy of the old contents so
> + * that we know which in-core state needs to be reinitialized.
> + */
> +STATIC void
> +xfs_repair_agi_init_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agi_bp,
> +	struct xfs_agi			*old_agi)
> +{
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agi_bp);
> +	struct xfs_mount		*mp = sc->mp;
> +
> +	memcpy(old_agi, agi, sizeof(*old_agi));
> +	memset(agi, 0, BBTOB(agi_bp->b_length));
> +	agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
> +	agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
> +	agi->agi_seqno = cpu_to_be32(sc->sa.agno);
> +	agi->agi_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
> +	agi->agi_newino = cpu_to_be32(NULLAGINO);
> +	agi->agi_dirino = cpu_to_be32(NULLAGINO);
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
> +
> +	/* We don't know how to fix the unlinked list yet. */
> +	memcpy(&agi->agi_unlinked, &old_agi->agi_unlinked,
> +			sizeof(agi->agi_unlinked));
> +}

and clear pagi_init?

> +/* Update the AGI counters. */
> +STATIC int
> +xfs_repair_agi_update_btree_counters(

_update_counts

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/21] xfs: repair free space btrees
  2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
@ 2018-06-27  3:21   ` Dave Chinner
  2018-07-04  2:15     ` Darrick J. Wong
  2018-06-30 17:36   ` Allison Henderson
  1 sibling, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-06-27  3:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:07PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the free space btrees from the gaps in the rmap btree.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
......
> +
> +/* Collect an AGFL block for the not-to-release list. */
> +static int
> +xfs_repair_collect_agfl_block(
> +	struct xfs_mount		*mp,
> +	xfs_agblock_t			bno,
> +	void				*priv)

/me now gets confused by agfl code (xfs_repair_agfl_...) collecting btree
blocks, and now the btree code (xfs_repair_collect_agfl... )
collecting agfl blocks.

The naming/namespace collisions is not that nice. I think this needs
to be xr_allocbt_collect_agfl_blocks().

/me idly wonders about consistently renaming everything abt, bnbt, cnbt,
fibt, ibt, rmbt and rcbt...

> +/*
> + * Iterate all reverse mappings to find (1) the free extents, (2) the OWN_AG
> + * extents, (3) the rmapbt blocks, and (4) the AGFL blocks.  The free space is
> + * (1) + (2) - (3) - (4).  Figure out if we have enough free space to
> + * reconstruct the free space btrees.  Caller must clean up the input lists
> + * if something goes wrong.
> + */
> +STATIC int
> +xfs_repair_allocbt_find_freespace(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*free_extents,
> +	struct xfs_repair_extent_list	*old_allocbt_blocks)
> +{
> +	struct xfs_repair_alloc		ra;
> +	struct xfs_repair_alloc_extent	*rae;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_agblock_t			agend;
> +	xfs_agblock_t			nr_blocks;
> +	int				error;
> +
> +	ra.extlist = free_extents;
> +	ra.btlist = old_allocbt_blocks;
> +	xfs_repair_init_extent_list(&ra.nobtlist);
> +	ra.next_bno = 0;
> +	ra.nr_records = 0;
> +	ra.nr_blocks = 0;
> +	ra.sc = sc;
> +
> +	/*
> +	 * Iterate all the reverse mappings to find gaps in the physical
> +	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
> +	 */
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
> +	error = xfs_rmap_query_all(cur, xfs_repair_alloc_extent_fn, &ra);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	cur = NULL;
> +
> +	/* Insert a record for space between the last rmap and EOAG. */
> +	agend = be32_to_cpu(XFS_BUF_TO_AGF(sc->sa.agf_bp)->agf_length);
> +	if (ra.next_bno < agend) {
> +		rae = kmem_alloc(sizeof(struct xfs_repair_alloc_extent),
> +				KM_MAYFAIL);
> +		if (!rae) {
> +			error = -ENOMEM;
> +			goto err;
> +		}
> +		INIT_LIST_HEAD(&rae->list);
> +		rae->bno = ra.next_bno;
> +		rae->len = agend - ra.next_bno;
> +		list_add_tail(&rae->list, free_extents);
> +		ra.nr_records++;
> +	}
> +
> +	/* Collect all the AGFL blocks. */
> +	error = xfs_agfl_walk(mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
> +			sc->sa.agfl_bp, xfs_repair_collect_agfl_block, &ra);
> +	if (error)
> +		goto err;
> +
> +	/* Do we actually have enough space to do this? */
> +	nr_blocks = 2 * xfs_allocbt_calc_size(mp, ra.nr_records);

	/* Do we have enough space to rebuild both freespace trees? */

(explains the multiplication by 2)

> +	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE) ||
> +	    ra.nr_blocks < nr_blocks) {
> +		error = -ENOSPC;
> +		goto err;
> +	}
> +
> +	/* Compute the old bnobt/cntbt blocks. */
> +	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
> +			&ra.nobtlist);
> +	if (error)
> +		goto err;
> +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> +	return 0;
> +
> +err:
> +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> +	if (cur)
> +		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;

Error stacking here can be cleaned up - we don't need an extra stack
as the cursor is NULL when finished with. Hence it could just be:

	/* Compute the old bnobt/cntbt blocks. */
	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
			&ra.nobtlist);
err:
	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
	if (cur)
		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
	return error;
}


> +}
> +
> +/*
> + * Reset the global free block counter and the per-AG counters to make it look
> + * like this AG has no free space.
> + */
> +STATIC int
> +xfs_repair_allocbt_reset_counters(
> +	struct xfs_scrub_context	*sc,
> +	int				*log_flags)
> +{
> +	struct xfs_perag		*pag = sc->sa.pag;
> +	struct xfs_agf			*agf;
> +	xfs_extlen_t			oldf;
> +	xfs_agblock_t			rmap_blocks;
> +	int				error;
> +
> +	/*
> +	 * Since we're abandoning the old bnobt/cntbt, we have to
> +	 * decrease fdblocks by the # of blocks in those trees.
> +	 * btreeblks counts the non-root blocks of the free space
> +	 * and rmap btrees.  Do this before resetting the AGF counters.

Comment can use 80 columns.

> +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> +	rmap_blocks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> +	oldf = pag->pagf_btreeblks + 2;
> +	oldf -= rmap_blocks;

Convoluted. The comment really didn't help me understand what oldf
is accounting.

Ah, rmap_blocks is actually the new btreeblks count. OK.

	/*
	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
	 * non-root blocks of the free space and rmap btrees.  Do this before
	 * resetting the AGF counters.
	 */

	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);

	/* rmap_blocks accounts root block, btreeblks doesn't */
	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;

	/* btreeblks doesn't account bno/cnt root blocks */
	to_free = pag->pagf_btreeblks + 2;

	/* and don't account for the blocks we aren't freeing */
	to_free -= new_btblks;


> +	error = xfs_mod_fdblocks(sc->mp, -(int64_t)oldf, false);
> +	if (error)
> +		return error;
> +
> +	/* Reset the per-AG info, both incore and ondisk. */
> +	pag->pagf_btreeblks = rmap_blocks;
> +	pag->pagf_freeblks = 0;
> +	pag->pagf_longest = 0;
> +
> +	agf->agf_btreeblks = cpu_to_be32(pag->pagf_btreeblks);

I'd prefer that you use new_btblks here, too. Easier to see at a
glance that the on-disk agf is being set to the new value....


> +	agf->agf_freeblks = 0;
> +	agf->agf_longest = 0;
> +	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
> +
> +	return 0;
> +}
> +
> +/* Initialize new bnobt/cntbt roots and implant them into the AGF. */
> +STATIC int
> +xfs_repair_allocbt_reset_btrees(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*free_extents,
> +	int				*log_flags)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_repair_alloc_extent	*cached = NULL;
> +	struct xfs_buf			*bp;
> +	struct xfs_perag		*pag = sc->sa.pag;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_agf			*agf;
> +	xfs_fsblock_t			bnofsb;
> +	xfs_fsblock_t			cntfsb;
> +	int				error;
> +
> +	/* Allocate new bnobt root. */
> +	bnofsb = xfs_repair_allocbt_alloc_block(sc, free_extents, &cached);
> +	if (bnofsb == NULLFSBLOCK)
> +		return -ENOSPC;

Does this happen after the free extent list has been sorted by bno
order? It really should, that way the new root is as close to the
the AGF as possible, and the new btree blocks will also tend to
cluster towards the lower AG offsets.

> +	/* Allocate new cntbt root. */
> +	cntfsb = xfs_repair_allocbt_alloc_block(sc, free_extents, &cached);
> +	if (cntfsb == NULLFSBLOCK)
> +		return -ENOSPC;
> +
> +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> +	/* Initialize new bnobt root. */
> +	error = xfs_repair_init_btblock(sc, bnofsb, &bp, XFS_BTNUM_BNO,
> +			&xfs_allocbt_buf_ops);
> +	if (error)
> +		return error;
> +	agf->agf_roots[XFS_BTNUM_BNOi] =
> +			cpu_to_be32(XFS_FSB_TO_AGBNO(mp, bnofsb));
> +	agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
> +
> +	/* Initialize new cntbt root. */
> +	error = xfs_repair_init_btblock(sc, cntfsb, &bp, XFS_BTNUM_CNT,
> +			&xfs_allocbt_buf_ops);
> +	if (error)
> +		return error;
> +	agf->agf_roots[XFS_BTNUM_CNTi] =
> +			cpu_to_be32(XFS_FSB_TO_AGBNO(mp, cntfsb));
> +	agf->agf_levels[XFS_BTNUM_CNTi] = cpu_to_be32(1);
> +
> +	/* Add rmap records for the btree roots */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
> +			XFS_FSB_TO_AGBNO(mp, bnofsb), 1, &oinfo);
> +	if (error)
> +		return error;
> +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
> +			XFS_FSB_TO_AGBNO(mp, cntfsb), 1, &oinfo);
> +	if (error)
> +		return error;
> +
> +	/* Reset the incore state. */
> +	pag->pagf_levels[XFS_BTNUM_BNOi] = 1;
> +	pag->pagf_levels[XFS_BTNUM_CNTi] = 1;
> +
> +	*log_flags |=  XFS_AGF_ROOTS | XFS_AGF_LEVELS;
> +	return 0;

Rather than duplicating all this init code twice, would factoring it
make sense? The only difference between the alloc/init of the two
btrees is the array index that info is stored in....

> +}
> +
> +/* Build new free space btrees and dispose of the old one. */
> +STATIC int
> +xfs_repair_allocbt_rebuild_trees(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*free_extents,
> +	struct xfs_repair_extent_list	*old_allocbt_blocks)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_repair_alloc_extent	*rae;
> +	struct xfs_repair_alloc_extent	*n;
> +	struct xfs_repair_alloc_extent	*longest;
> +	int				error;
> +
> +	xfs_rmap_skip_owner_update(&oinfo);
> +
> +	/*
> +	 * Insert the longest free extent in case it's necessary to
> +	 * refresh the AGFL with multiple blocks.  If there is no longest
> +	 * extent, we had exactly the free space we needed; we're done.
> +	 */
> +	longest = xfs_repair_allocbt_get_longest(free_extents);
> +	if (!longest)
> +		goto done;
> +	error = xfs_repair_allocbt_free_extent(sc,
> +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> +			longest->len, &oinfo);
> +	list_del(&longest->list);
> +	kmem_free(longest);
> +	if (error)
> +		return error;
> +
> +	/* Insert records into the new btrees. */
> +	list_sort(NULL, free_extents, xfs_repair_allocbt_extent_cmp);

Hmmm. I guess list sorting doesn't occur before allocating new root
blocks. Can this get moved?

....

> +bool
> +xfs_extent_busy_list_empty(
> +	struct xfs_perag	*pag);

One line form for header prototypes, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-27  2:19   ` Dave Chinner
@ 2018-06-27 16:44     ` Allison Henderson
  2018-06-27 23:37       ` Dave Chinner
  2018-06-28 17:25     ` Allison Henderson
  1 sibling, 1 reply; 77+ messages in thread
From: Allison Henderson @ 2018-06-27 16:44 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong; +Cc: linux-xfs



On 06/26/2018 07:19 PM, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:23:54PM -0700, Darrick J. Wong wrote:
>> From: Darrick J. Wong <darrick.wong@oracle.com>
>>
>> Regenerate the AGF and AGFL from the rmap data.
>>
>> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [...]
> 
>> +/* Information for finding AGF-rooted btrees */
>> +enum {
>> +	REPAIR_AGF_BNOBT = 0,
>> +	REPAIR_AGF_CNTBT,
>> +	REPAIR_AGF_RMAPBT,
>> +	REPAIR_AGF_REFCOUNTBT,
>> +	REPAIR_AGF_END,
>> +	REPAIR_AGF_MAX
>> +};
> 
> Why can't you just use XFS_BTNUM_* for these btree type descriptors?
> 
>> +
>> +static const struct xfs_repair_find_ag_btree repair_agf[] = {
>> +	[REPAIR_AGF_BNOBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>> +		.buf_ops = &xfs_allocbt_buf_ops,
>> +		.magic = XFS_ABTB_CRC_MAGIC,
>> +	},
>> +	[REPAIR_AGF_CNTBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>> +		.buf_ops = &xfs_allocbt_buf_ops,
>> +		.magic = XFS_ABTC_CRC_MAGIC,
>> +	},
> 
> I had to stop and think about why this only supports the v5 types.
> i.e. we're rebuilding from rmap info, so this will never run on v4
> filesystems, hence we only care about v5 types (i.e. *CRC_MAGIC).
> Perhaps a one-line comment to remind readers of this?
> 
>> +	[REPAIR_AGF_RMAPBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>> +		.buf_ops = &xfs_rmapbt_buf_ops,
>> +		.magic = XFS_RMAP_CRC_MAGIC,
>> +	},
>> +	[REPAIR_AGF_REFCOUNTBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_REFC,
>> +		.buf_ops = &xfs_refcountbt_buf_ops,
>> +		.magic = XFS_REFC_CRC_MAGIC,
>> +	},
>> +	[REPAIR_AGF_END] = {
>> +		.buf_ops = NULL,
>> +	},
>> +};
>> +
>> +/*
>> + * Find the btree roots.  This is /also/ a chicken and egg problem because we
>> + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
>> + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
>> + * corruptions in those btrees we'll bail out.
>> + */
>> +STATIC int
>> +xfs_repair_agf_find_btrees(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	struct xfs_repair_find_ag_btree	*fab,
>> +	struct xfs_buf			*agfl_bp)
>> +{
>> +	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
>> +	int				error;
>> +
>> +	/* Go find the root data. */
>> +	memcpy(fab, repair_agf, sizeof(repair_agf));
> 
> Why are we initialising fab here, instead of in the caller where it
> is declared and passed to various functions? Given there is only a
> single declaration of this structure, why do we need a global static
> const table initialiser just to copy it here - why isn't it
> initialised at the declaration point?
> 
>> +	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
>> +	if (error)
>> +		return error;
>> +
>> +	/* We must find the bnobt, cntbt, and rmapbt roots. */
>> +	if (fab[REPAIR_AGF_BNOBT].root == NULLAGBLOCK ||
>> +	    fab[REPAIR_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
>> +	    fab[REPAIR_AGF_CNTBT].root == NULLAGBLOCK ||
>> +	    fab[REPAIR_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
>> +	    fab[REPAIR_AGF_RMAPBT].root == NULLAGBLOCK ||
>> +	    fab[REPAIR_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
>> +		return -EFSCORRUPTED;
>> +
>> +	/*
>> +	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
>> +	 * different root then something's seriously wrong.
>> +	 */
>> +	if (fab[REPAIR_AGF_RMAPBT].root !=
>> +	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
>> +		return -EFSCORRUPTED;
>> +
>> +	/* We must find the refcountbt root if that feature is enabled. */
>> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
>> +	    (fab[REPAIR_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
>> +	     fab[REPAIR_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
>> +		return -EFSCORRUPTED;
>> +
>> +	return 0;
>> +}
>> +
>> +/* Set btree root information in an AGF. */
>> +STATIC void
>> +xfs_repair_agf_set_roots(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_agf			*agf,
>> +	struct xfs_repair_find_ag_btree	*fab)
>> +{
>> +	agf->agf_roots[XFS_BTNUM_BNOi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].root);
>> +	agf->agf_levels[XFS_BTNUM_BNOi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].height);
>> +
>> +	agf->agf_roots[XFS_BTNUM_CNTi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].root);
>> +	agf->agf_levels[XFS_BTNUM_CNTi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].height);
>> +
>> +	agf->agf_roots[XFS_BTNUM_RMAPi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].root);
>> +	agf->agf_levels[XFS_BTNUM_RMAPi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].height);
>> +
>> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
>> +		agf->agf_refcount_root =
>> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].root);
>> +		agf->agf_refcount_level =
>> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].height);
>> +	}
>> +}
>> +
>> +/*
>> + * Reinitialize the AGF header, making an in-core copy of the old contents so
>> + * that we know which in-core state needs to be reinitialized.
>> + */
>> +STATIC void
>> +xfs_repair_agf_init_header(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	struct xfs_agf			*old_agf)
>> +{
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
>> +
>> +	memcpy(old_agf, agf, sizeof(*old_agf));
>> +	memset(agf, 0, BBTOB(agf_bp->b_length));
>> +	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
>> +	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
>> +	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
>> +	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
>> +	agf->agf_flfirst = old_agf->agf_flfirst;
>> +	agf->agf_fllast = old_agf->agf_fllast;
>> +	agf->agf_flcount = old_agf->agf_flcount;
>> +	if (xfs_sb_version_hascrc(&mp->m_sb))
>> +		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
>> +}
> 
> Do we need to clear pag->pagf_init here so that it gets
> re-initialised next time someone reads the AGF?
> 
>> +
>> +/* Update the AGF btree counters by walking the btrees. */
>> +STATIC int
>> +xfs_repair_agf_update_btree_counters(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp)
>> +{
>> +	struct xfs_repair_agf_allocbt	raa = { .sc = sc };
>> +	struct xfs_btree_cur		*cur = NULL;
>> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
>> +	struct xfs_mount		*mp = sc->mp;
>> +	xfs_agblock_t			btreeblks;
>> +	xfs_agblock_t			blocks;
>> +	int				error;
>> +
>> +	/* Update the AGF counters from the bnobt. */
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_BNO);
>> +	error = xfs_alloc_query_all(cur, xfs_repair_agf_walk_allocbt, &raa);
>> +	if (error)
>> +		goto err;
>> +	error = xfs_btree_count_blocks(cur, &blocks);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +	btreeblks = blocks - 1;
>> +	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
>> +	agf->agf_longest = cpu_to_be32(raa.longest);
> 
> This function updates more than the AGF btree counters. :P
> 
>> +
>> +	/* Update the AGF counters from the cntbt. */
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_CNT);
>> +	error = xfs_btree_count_blocks(cur, &blocks);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +	btreeblks += blocks - 1;
>> +
>> +	/* Update the AGF counters from the rmapbt. */
>> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
>> +	error = xfs_btree_count_blocks(cur, &blocks);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +	agf->agf_rmap_blocks = cpu_to_be32(blocks);
>> +	btreeblks += blocks - 1;
>> +
>> +	agf->agf_btreeblks = cpu_to_be32(btreeblks);
>> +
>> +	/* Update the AGF counters from the refcountbt. */
>> +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
>> +		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
>> +				sc->sa.agno, NULL);
>> +		error = xfs_btree_count_blocks(cur, &blocks);
>> +		if (error)
>> +			goto err;
>> +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +		agf->agf_refcount_blocks = cpu_to_be32(blocks);
>> +	}
>> +
>> +	return 0;
>> +err:
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>> +	return error;
>> +}
>> +
>> +/* Trigger reinitialization of the in-core data. */
>> +STATIC int
>> +xfs_repair_agf_reinit_incore(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_agf			*agf,
>> +	const struct xfs_agf		*old_agf)
>> +{
>> +	struct xfs_perag		*pag;
>> +
>> +	/* XXX: trigger fdblocks recalculation */
>> +
>> +	/* Now reinitialize the in-core counters if necessary. */
>> +	pag = sc->sa.pag;
>> +	if (!pag->pagf_init)
>> +		return 0;
>> +
>> +	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
>> +	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
>> +	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
>> +	pag->pagf_levels[XFS_BTNUM_BNOi] =
>> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
>> +	pag->pagf_levels[XFS_BTNUM_CNTi] =
>> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
>> +	pag->pagf_levels[XFS_BTNUM_RMAPi] =
>> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
>> +	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
> 
> Ok, so we reinit the pagf bits here, but....
> 
>> +
>> +	return 0;
>> +}
>> +
>> +/* Repair the AGF. */
>> +int
>> +xfs_repair_agf(
>> +	struct xfs_scrub_context	*sc)
>> +{
>> +	struct xfs_repair_find_ag_btree	fab[REPAIR_AGF_MAX];
>> +	struct xfs_agf			old_agf;
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_buf			*agf_bp;
>> +	struct xfs_buf			*agfl_bp;
>> +	struct xfs_agf			*agf;
>> +	int				error;
>> +
>> +	/* We require the rmapbt to rebuild anything. */
>> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>> +		return -EOPNOTSUPP;
>> +
>> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
>> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
>> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
>> +			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
>> +	if (error)
>> +		return error;
>> +	agf_bp->b_ops = &xfs_agf_buf_ops;
>> +	agf = XFS_BUF_TO_AGF(agf_bp);
>> +
>> +	/*
>> +	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
>> +	 * the AGFL now; these blocks might have once been part of the
>> +	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
>> +	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
>> +	 * because we can't do any serious cross-referencing with any of the
>> +	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
>> +	 * then we'll bail out.
>> +	 */
>> +	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Spot-check the AGFL blocks; if they're obviously corrupt then
>> +	 * there's nothing we can do but bail out.
>> +	 */
>> +	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
>> +			xfs_repair_agf_check_agfl_block, sc);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Find the AGF btree roots.  See the comment for this function for
>> +	 * more information about the limitations of this repairer; this is
>> +	 * also a chicken-and-egg situation.
>> +	 */
>> +	error = xfs_repair_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
>> +	if (error)
>> +		return error;
> 
> Comment could be better written.
> 
> 	/*
> 	 * Find the AGF btree roots. This is also a chicken-and-egg
> 	 * situation - see xfs_repair_agf_find_btrees() for details.
> 	 */
> 
>> +
>> +	/* Start rewriting the header and implant the btrees we found. */
>> +	xfs_repair_agf_init_header(sc, agf_bp, &old_agf);
>> +	xfs_repair_agf_set_roots(sc, agf, fab);
>> +	error = xfs_repair_agf_update_btree_counters(sc, agf_bp);
>> +	if (error)
>> +		goto out_revert;
> 
> If we fail here, the pagf information is invalid, hence I think we
> really do need to clear pagf_init before we start rebuilding the new
> AGF. Yes, I can see we revert the AGF info, but this seems like a
> landmine waiting to be tripped over.
> 
>> +	/* Reinitialize in-core state. */
>> +	error = xfs_repair_agf_reinit_incore(sc, agf, &old_agf);
>> +	if (error)
>> +		goto out_revert;
>> +
>> +	/* Write this to disk. */
>> +	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
>> +	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
>> +	return 0;
>> +
>> +out_revert:
>> +	memcpy(agf, &old_agf, sizeof(old_agf));
>> +	return error;
>> +}
>> +
>> +/* AGFL */
>> +
>> +struct xfs_repair_agfl {
>> +	struct xfs_repair_extent_list	agmeta_list;
>> +	struct xfs_repair_extent_list	*freesp_list;
>> +	struct xfs_scrub_context	*sc;
>> +};
>> +
>> +/* Record all freespace information. */
>> +STATIC int
>> +xfs_repair_agfl_rmap_fn(
>> +	struct xfs_btree_cur		*cur,
>> +	struct xfs_rmap_irec		*rec,
>> +	void				*priv)
>> +{
>> +	struct xfs_repair_agfl		*ra = priv;
>> +	xfs_fsblock_t			fsb;
>> +	int				error = 0;
>> +
>> +	if (xfs_scrub_should_terminate(ra->sc, &error))
>> +		return error;
>> +
>> +	/* Record all the OWN_AG blocks. */
>> +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
>> +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
>> +				rec->rm_startblock);
>> +		error = xfs_repair_collect_btree_extent(ra->sc,
>> +				ra->freesp_list, fsb, rec->rm_blockcount);
>> +		if (error)
>> +			return error;
>> +	}
>> +
>> +	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
>> +			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> 
> Urk. The function name lengths is getting out of hand. I'm very
> tempted to suggest we should shorten the namespace of all this
> like s/xfs_repair_/xr_/ and s/xfs_scrub_/xs_/, etc just to make them
> shorter and easier to read.
> 
> Oh, wait, did I say that out loud? :P
> 
> Something to think about, anyway.
> 
Well they are sort of long, but TBH I think i still kind of appreciate 
the extra verbiage.  I have seen other projects do things like adopt a 
sort of 3 or 4 letter abbreviation (like maybe xfs_scrb or xfs_repr). 
Helps to cut down on the verbosity while still not loosing too much of 
what it is supposed to mean.  Just another idea to consider. :-)

>> +			&ra->agmeta_list);
>> +}
>> +
>> +/* Add a btree block to the agmeta list. */
>> +STATIC int
>> +xfs_repair_agfl_visit_btblock(
> 
> I find the name a bit confusing - AGFLs don't have btree blocks.
> Yes, I know that it's a xfs_btree_visit_blocks() callback but I
> think s/visit/collect/ makes more sense. i.e. it tells us what we
> are doing with the btree block, rather than making it sound like we
> are walking AGFL btree blocks...
> 
>> +/*
>> + * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
>> + * which blocks belong to the AGFL.
>> + */
>> +STATIC int
>> +xfs_repair_agfl_find_extents(
> 
> Same here - xr_agfl_collect_free_extents()?
> 
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	struct xfs_repair_extent_list	*agfl_extents,
>> +	xfs_agblock_t			*flcount)
>> +{
>> +	struct xfs_repair_agfl		ra;
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_btree_cur		*cur;
>> +	struct xfs_repair_extent	*rae;
>> +	int				error;
>> +
>> +	ra.sc = sc;
>> +	ra.freesp_list = agfl_extents;
>> +	xfs_repair_init_extent_list(&ra.agmeta_list);
>> +
>> +	/* Find all space used by the free space btrees & rmapbt. */
>> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
>> +	error = xfs_rmap_query_all(cur, xfs_repair_agfl_rmap_fn, &ra);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +
>> +	/* Find all space used by bnobt. */
> 
> Needs clarification.
> 
> 	/* Find all the in use bnobt blocks */
> 
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_BNO);
>> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +
>> +	/* Find all space used by cntbt. */
> 
> 	/* Find all the in use cntbt blocks */
> 
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_CNT);
>> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
>> +	if (error)
>> +		goto err;
>> +
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +
>> +	/*
>> +	 * Drop the freesp meta blocks that are in use by btrees.
>> +	 * The remaining blocks /should/ be AGFL blocks.
>> +	 */
>> +	error = xfs_repair_subtract_extents(sc, agfl_extents, &ra.agmeta_list);
>> +	xfs_repair_cancel_btree_extents(sc, &ra.agmeta_list);
>> +	if (error)
>> +		return error;
>> +
>> +	/* Calculate the new AGFL size. */
>> +	*flcount = 0;
>> +	for_each_xfs_repair_extent(rae, agfl_extents) {
>> +		*flcount += rae->len;
>> +		if (*flcount > xfs_agfl_size(mp))
>> +			break;
>> +	}
>> +	if (*flcount > xfs_agfl_size(mp))
>> +		*flcount = xfs_agfl_size(mp);
> 
> Ok, so flcount is clamped here. What happens to all the remaining
> agfl_extents beyond flcount?
> 
>> +	return 0;
>> +
>> +err:
> 
> Ok, what cleans up all the extents we've recorded in ra on error?
> 
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>> +	return error;
>> +}
>> +
>> +/* Update the AGF and reset the in-core state. */
>> +STATIC int
>> +xfs_repair_agfl_update_agf(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	xfs_agblock_t			flcount)
>> +{
>> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
>> +
> 	ASSERT(*flcount <= xfs_agfl_size(mp));
> 
>> +	/* XXX: trigger fdblocks recalculation */
>> +
>> +	/* Update the AGF counters. */
>> +	if (sc->sa.pag->pagf_init)
>> +		sc->sa.pag->pagf_flcount = flcount;
>> +	agf->agf_flfirst = cpu_to_be32(0);
>> +	agf->agf_flcount = cpu_to_be32(flcount);
>> +	agf->agf_fllast = cpu_to_be32(flcount - 1);
>> +
>> +	xfs_alloc_log_agf(sc->tp, agf_bp,
>> +			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
>> +	return 0;
>> +}
>> +
>> +/* Write out a totally new AGFL. */
>> +STATIC void
>> +xfs_repair_agfl_init_header(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agfl_bp,
>> +	struct xfs_repair_extent_list	*agfl_extents,
>> +	xfs_agblock_t			flcount)
>> +{
>> +	struct xfs_mount		*mp = sc->mp;
>> +	__be32				*agfl_bno;
>> +	struct xfs_repair_extent	*rae;
>> +	struct xfs_repair_extent	*n;
>> +	struct xfs_agfl			*agfl;
>> +	xfs_agblock_t			agbno;
>> +	unsigned int			fl_off;
>> +
> 	ASSERT(*flcount <= xfs_agfl_size(mp));
> 
>> +	/* Start rewriting the header. */
>> +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
>> +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
>> +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
>> +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
>> +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
>> +
>> +	/* Fill the AGFL with the remaining blocks. */
>> +	fl_off = 0;
>> +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
>> +	for_each_xfs_repair_extent_safe(rae, n, agfl_extents) {
>> +		agbno = XFS_FSB_TO_AGBNO(mp, rae->fsbno);
>> +
>> +		trace_xfs_repair_agfl_insert(mp, sc->sa.agno, agbno, rae->len);
>> +
>> +		while (rae->len > 0 && fl_off < flcount) {
>> +			agfl_bno[fl_off] = cpu_to_be32(agbno);
>> +			fl_off++;
>> +			agbno++;
>> +			rae->fsbno++;
>> +			rae->len--;
>> +		}
> 
> This only works correctly if flcount <= xfs_agfl_size, which is why
> I'm suggesting some asserts.
> 
>> +
>> +		if (rae->len)
>> +			break;
>> +		list_del(&rae->list);
>> +		kmem_free(rae);
>> +	}
>> +
>> +	/* Write AGF and AGFL to disk. */
>> +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
>> +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
>> +}
>> +
>> +/* Repair the AGFL. */
>> +int
>> +xfs_repair_agfl(
>> +	struct xfs_scrub_context	*sc)
>> +{
>> +	struct xfs_owner_info		oinfo;
>> +	struct xfs_repair_extent_list	agfl_extents;
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_buf			*agf_bp;
>> +	struct xfs_buf			*agfl_bp;
>> +	xfs_agblock_t			flcount;
>> +	int				error;
>> +
>> +	/* We require the rmapbt to rebuild anything. */
>> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>> +		return -EOPNOTSUPP;
>> +
>> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
>> +	xfs_repair_init_extent_list(&agfl_extents);
>> +
>> +	/*
>> +	 * Read the AGF so that we can query the rmapbt.  We hope that there's
>> +	 * nothing wrong with the AGF, but all the AG header repair functions
>> +	 * have this chicken-and-egg problem.
>> +	 */
>> +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
>> +	if (error)
>> +		return error;
>> +	if (!agf_bp)
>> +		return -ENOMEM;
>> +
>> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
>> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
>> +			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
>> +	if (error)
>> +		return error;
>> +	agfl_bp->b_ops = &xfs_agfl_buf_ops;
>> +
>> +	/*
>> +	 * Compute the set of old AGFL blocks by subtracting from the list of
>> +	 * OWN_AG blocks the list of blocks owned by all other OWN_AG metadata
>> +	 * (bnobt, cntbt, rmapbt).  These are the old AGFL blocks, so return
>> +	 * that list and the number of blocks we're actually going to put back
>> +	 * on the AGFL.
>> +	 */
> 
> That comment belongs on the function, not here. All we need here is
> something like:
> 
> 	/* Gather all the extents we're going to put on the new AGFL. */
> 
>> +	error = xfs_repair_agfl_find_extents(sc, agf_bp, &agfl_extents,
>> +			&flcount);
>> +	if (error)
>> +		goto err;
>> +
>> +	/*
>> +	 * Update AGF and AGFL.  We reset the global free block counter when
>> +	 * we adjust the AGF flcount (which can fail) so avoid updating any
>> +	 * bufers until we know that part works.
> 
> buffers
> 
>> +	 */
>> +	error = xfs_repair_agfl_update_agf(sc, agf_bp, flcount);
>> +	if (error)
>> +		goto err;
>> +	xfs_repair_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
>> +
>> +	/*
>> +	 * Ok, the AGFL should be ready to go now.  Roll the transaction so
>> +	 * that we can free any AGFL overflow.
>> +	 */
> 
> Why does rolling the transaction allow us to free the overflow?
> Shouldn't the comment say something like "Roll to the transaction to
> make the new AGFL permanent before we start using it when returning
> the residual AGFL freespace overflow back to the AGF freespace
> btrees."
> 
>> +	sc->sa.agf_bp = agf_bp;
>> +	sc->sa.agfl_bp = agfl_bp;
>> +	error = xfs_repair_roll_ag_trans(sc);
>> +	if (error)
>> +		goto err;
>> +
>> +	/* Dump any AGFL overflow. */
>> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
>> +	return xfs_repair_reap_btree_extents(sc, &agfl_extents, &oinfo,
>> +			XFS_AG_RESV_AGFL);
>> +err:
>> +	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
>> +	return error;
>> +}
>> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
>> index 326be4e8b71e..bcdaa8df18f6 100644
>> --- a/fs/xfs/scrub/repair.c
>> +++ b/fs/xfs/scrub/repair.c
>> @@ -127,9 +127,12 @@ xfs_repair_roll_ag_trans(
>>   	int				error;
>>   
>>   	/* Keep the AG header buffers locked so we can keep going. */
>> -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
>> -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
>> -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>> +	if (sc->sa.agi_bp)
>> +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
>> +	if (sc->sa.agf_bp)
>> +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
>> +	if (sc->sa.agfl_bp)
>> +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>>   
>>   	/* Roll the transaction. */
>>   	error = xfs_trans_roll(&sc->tp);
>> @@ -137,9 +140,12 @@ xfs_repair_roll_ag_trans(
>>   		goto out_release;
>>   
>>   	/* Join AG headers to the new transaction. */
>> -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
>> -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
>> -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>> +	if (sc->sa.agi_bp)
>> +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
>> +	if (sc->sa.agf_bp)
>> +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
>> +	if (sc->sa.agfl_bp)
>> +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>>   
>>   	return 0;
>>   
>> @@ -149,9 +155,12 @@ xfs_repair_roll_ag_trans(
>>   	 * buffers will be released during teardown on our way out
>>   	 * of the kernel.
>>   	 */
>> -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
>> -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
>> -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>> +	if (sc->sa.agi_bp)
>> +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
>> +	if (sc->sa.agf_bp)
>> +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
>> +	if (sc->sa.agfl_bp)
>> +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>>   
>>   	return error;
>>   }
>> @@ -408,6 +417,85 @@ xfs_repair_collect_btree_extent(
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Help record all btree blocks seen while iterating all records of a btree.
>> + *
>> + * We know that the btree query_all function starts at the left edge and walks
>> + * towards the right edge of the tree.  Therefore, we know that we can walk up
>> + * the btree cursor towards the root; if the pointer for a given level points
>> + * to the first record/key in that block, we haven't seen this block before;
>> + * and therefore we need to remember that we saw this block in the btree.
>> + *
>> + * So if our btree is:
>> + *
>> + *    4
>> + *  / | \
>> + * 1  2  3
>> + *
>> + * Pretend for this example that each leaf block has 100 btree records.  For
>> + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
>> + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
>> + * block 4.  The list is [1, 4].
>> + *
>> + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
>> + * loop.  The list remains [1, 4].
>> + *
>> + * For the 101st btree record, we've moved onto leaf block 2.  Now
>> + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
>> + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
>> + *
>> + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
>> + *
>> + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
>> + * we add 3 to the list.  Now it is [1, 4, 2, 3].
>> + *
>> + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
>> + *
>> + * The *iter_fn can return XFS_BTREE_QUERY_RANGE_ABORT to stop, 0 to keep
>> + * iterating, or the usual negative error code.
>> + */
>> +int
>> +xfs_repair_collect_btree_cur_blocks(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_btree_cur		*cur,
>> +	int				(*iter_fn)(struct xfs_scrub_context *sc,
>> +						   xfs_fsblock_t fsbno,
>> +						   xfs_fsblock_t len,
>> +						   void *priv),
>> +	void				*priv)
>> +{
>> +	struct xfs_buf			*bp;
>> +	xfs_fsblock_t			fsb;
>> +	int				i;
>> +	int				error;
>> +
>> +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
>> +		xfs_btree_get_block(cur, i, &bp);
>> +		if (!bp)
>> +			continue;
>> +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
>> +		error = iter_fn(sc, fsb, 1, priv);
>> +		if (error)
>> +			return error;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Simple adapter to connect xfs_repair_collect_btree_extent to
>> + * xfs_repair_collect_btree_cur_blocks.
>> + */
>> +int
>> +xfs_repair_collect_btree_cur_blocks_in_extent_list(
>> +	struct xfs_scrub_context	*sc,
>> +	xfs_fsblock_t			fsbno,
>> +	xfs_fsblock_t			len,
>> +	void				*priv)
>> +{
>> +	return xfs_repair_collect_btree_extent(sc, priv, fsbno, len);
>> +}
>> +
>>   /*
>>    * An error happened during the rebuild so the transaction will be cancelled.
>>    * The fs will shut down, and the administrator has to unmount and run repair.
>> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
>> index ef47826b6725..f2af5923aa75 100644
>> --- a/fs/xfs/scrub/repair.h
>> +++ b/fs/xfs/scrub/repair.h
>> @@ -48,9 +48,20 @@ xfs_repair_init_extent_list(
>>   
>>   #define for_each_xfs_repair_extent_safe(rbe, n, exlist) \
>>   	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
>> +#define for_each_xfs_repair_extent(rbe, exlist) \
>> +	list_for_each_entry((rbe), &(exlist)->list, list)
>>   int xfs_repair_collect_btree_extent(struct xfs_scrub_context *sc,
>>   		struct xfs_repair_extent_list *btlist, xfs_fsblock_t fsbno,
>>   		xfs_extlen_t len);
>> +int xfs_repair_collect_btree_cur_blocks(struct xfs_scrub_context *sc,
>> +		struct xfs_btree_cur *cur,
>> +		int (*iter_fn)(struct xfs_scrub_context *sc,
>> +			       xfs_fsblock_t fsbno, xfs_fsblock_t len,
>> +			       void *priv),
>> +		void *priv);
>> +int xfs_repair_collect_btree_cur_blocks_in_extent_list(
>> +		struct xfs_scrub_context *sc, xfs_fsblock_t fsbno,
>> +		xfs_fsblock_t len, void *priv);
>>   void xfs_repair_cancel_btree_extents(struct xfs_scrub_context *sc,
>>   		struct xfs_repair_extent_list *btlist);
>>   int xfs_repair_subtract_extents(struct xfs_scrub_context *sc,
>> @@ -89,6 +100,8 @@ int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
>>   
>>   int xfs_repair_probe(struct xfs_scrub_context *sc);
>>   int xfs_repair_superblock(struct xfs_scrub_context *sc);
>> +int xfs_repair_agf(struct xfs_scrub_context *sc);
>> +int xfs_repair_agfl(struct xfs_scrub_context *sc);
>>   
>>   #else
>>   
>> @@ -112,6 +125,8 @@ xfs_repair_calc_ag_resblks(
>>   
>>   #define xfs_repair_probe		xfs_repair_notsupported
>>   #define xfs_repair_superblock		xfs_repair_notsupported
>> +#define xfs_repair_agf			xfs_repair_notsupported
>> +#define xfs_repair_agfl			xfs_repair_notsupported
>>   
>>   #endif /* CONFIG_XFS_ONLINE_REPAIR */
>>   
>> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
>> index 58ae76b3a421..8e11c3c699fb 100644
>> --- a/fs/xfs/scrub/scrub.c
>> +++ b/fs/xfs/scrub/scrub.c
>> @@ -208,13 +208,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>>   		.type	= ST_PERAG,
>>   		.setup	= xfs_scrub_setup_fs,
>>   		.scrub	= xfs_scrub_agf,
>> -		.repair	= xfs_repair_notsupported,
>> +		.repair	= xfs_repair_agf,
>>   	},
>>   	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
>>   		.type	= ST_PERAG,
>>   		.setup	= xfs_scrub_setup_fs,
>>   		.scrub	= xfs_scrub_agfl,
>> -		.repair	= xfs_repair_notsupported,
>> +		.repair	= xfs_repair_agfl,
>>   	},
>>   	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
>>   		.type	= ST_PERAG,
>> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
>> index 524f543c5b82..c08785cf83a9 100644
>> --- a/fs/xfs/xfs_trans.c
>> +++ b/fs/xfs/xfs_trans.c
>> @@ -126,6 +126,60 @@ xfs_trans_dup(
>>   	return ntp;
>>   }
>>   
>> +/*
>> + * Try to reserve more blocks for a transaction.  The single use case we
>> + * support is for online repair -- use a transaction to gather data without
>> + * fear of btree cycle deadlocks; calculate how many blocks we really need
>> + * from that data; and only then start modifying data.  This can fail due to
>> + * ENOSPC, so we have to be able to cancel the transaction.
>> + */
>> +int
>> +xfs_trans_reserve_more(
>> +	struct xfs_trans	*tp,
>> +	uint			blocks,
>> +	uint			rtextents)
> 
> This isn't used in this patch - seems out of place here. Committed
> to the wrong patch?
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-27 16:44     ` Allison Henderson
@ 2018-06-27 23:37       ` Dave Chinner
  2018-06-29 15:14         ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-06-27 23:37 UTC (permalink / raw)
  To: Allison Henderson; +Cc: Darrick J. Wong, linux-xfs

On Wed, Jun 27, 2018 at 09:44:53AM -0700, Allison Henderson wrote:
> On 06/26/2018 07:19 PM, Dave Chinner wrote:
> >On Sun, Jun 24, 2018 at 12:23:54PM -0700, Darrick J. Wong wrote:
> >>From: Darrick J. Wong <darrick.wong@oracle.com>
> >>
> >>Regenerate the AGF and AGFL from the rmap data.
> >>
> >>Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >
> >[...]

> >>+	/* Record all the OWN_AG blocks. */
> >>+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> >>+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> >>+				rec->rm_startblock);
> >>+		error = xfs_repair_collect_btree_extent(ra->sc,
> >>+				ra->freesp_list, fsb, rec->rm_blockcount);
> >>+		if (error)
> >>+			return error;
> >>+	}
> >>+
> >>+	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
> >>+			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> >
> >Urk. The function name lengths is getting out of hand. I'm very
> >tempted to suggest we should shorten the namespace of all this
> >like s/xfs_repair_/xr_/ and s/xfs_scrub_/xs_/, etc just to make them
> >shorter and easier to read.
> >
> >Oh, wait, did I say that out loud? :P
> >
> >Something to think about, anyway.
> >
> Well they are sort of long, but TBH I think i still kind of
> appreciate the extra verbiage.  I have seen other projects do things
> like adopt a sort of 3 or 4 letter abbreviation (like maybe xfs_scrb
> or xfs_repr). Helps to cut down on the verbosity while still not
> loosing too much of what it is supposed to mean.  Just another idea
> to consider. :-)

We've got that in places, too, like "xlog_" prefixes for all the log
code, so that's not an unreasonable thing to suggest. After all, in
many cases we're talking about a tradeoff between readabilty and the
amount of typing necessary.

However, IMO, function names so long they need a line of their own
indicates we have a structural problem in our code, not a
readability problem. We should not need names that long to document
what the function does - it should be obvious from the context, the
abstraction that is being used and a short name....

e.g. how many of these different "collect extent" operations could
be abstracted into a common extent list structure and generic
callbacks? It seems there's a lot of similarity in them, and we're
really only differentiating them by adding more namespace and
context specific information into the structure and function names.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 07/21] xfs: repair inode btrees
  2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
@ 2018-06-28  0:55   ` Dave Chinner
  2018-07-04  2:22     ` Darrick J. Wong
  2018-06-30 17:36   ` Allison Henderson
  1 sibling, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-06-28  0:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:13PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Use the rmapbt to find inode chunks, query the chunks to compute
> hole and free masks, and with that information rebuild the inobt
> and finobt.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

[....]

> +/*
> + * For each cluster in this blob of inode, we must calculate the
> + * properly aligned startino of that cluster, then iterate each
> + * cluster to fill in used and filled masks appropriately.  We
> + * then use the (startino, used, filled) information to construct
> + * the appropriate inode records.
> + */
> +STATIC int
> +xfs_repair_ialloc_process_cluster(
> +	struct xfs_repair_ialloc	*ri,
> +	xfs_agblock_t			agbno,
> +	int				blks_per_cluster,
> +	xfs_agino_t			rec_agino)
> +{
> +	struct xfs_imap			imap;
> +	struct xfs_repair_ialloc_extent	*rie;
> +	struct xfs_dinode		*dip;
> +	struct xfs_buf			*bp;
> +	struct xfs_scrub_context	*sc = ri->sc;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_ino_t			fsino;
> +	xfs_inofree_t			usedmask;
> +	xfs_agino_t			nr_inodes;
> +	xfs_agino_t			startino;
> +	xfs_agino_t			clusterino;
> +	xfs_agino_t			clusteroff;
> +	xfs_agino_t			agino;
> +	uint16_t			fillmask;
> +	bool				inuse;
> +	int				usedcount;
> +	int				error;
> +
> +	/* The per-AG inum of this inode cluster. */
> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> +
> +	/* The per-AG inum of the inobt record. */
> +	startino = rec_agino + rounddown(agino - rec_agino,
> +			XFS_INODES_PER_CHUNK);
> +
> +	/* The per-AG inum of the cluster within the inobt record. */
> +	clusteroff = agino - startino;
> +
> +	/* Every inode in this holemask slot is filled. */
> +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
> +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
> +
> +	/* Grab the inode cluster buffer. */
> +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
> +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
> +	imap.im_boffset = 0;
> +
> +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
> +			XFS_IGET_UNTRUSTED);

This is going to error out if the cluster we are asking to be mapped
has no record in the inobt. Aren't we trying to rebuild the inobt
here from the rmap's idea of on-disk clusters? So how do we rebuild
the inobt record if we can't already find the chunk record in the
inobt?

At minimum, this needs a comment explaining why it works.

> +/* Initialize new inobt/finobt roots and implant them into the AGI. */
> +STATIC int
> +xfs_repair_iallocbt_reset_btrees(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_owner_info		*oinfo,
> +	int				*log_flags)
> +{
> +	struct xfs_agi			*agi;
> +	struct xfs_buf			*bp;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fsblock_t			inofsb;
> +	xfs_fsblock_t			finofsb;
> +	enum xfs_ag_resv_type		resv;
> +	int				error;
> +
> +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> +
> +	/* Initialize new inobt root. */
> +	resv = XFS_AG_RESV_NONE;
> +	error = xfs_repair_alloc_ag_block(sc, oinfo, &inofsb, resv);
> +	if (error)
> +		return error;
> +	error = xfs_repair_init_btblock(sc, inofsb, &bp, XFS_BTNUM_INO,
> +			&xfs_inobt_buf_ops);
> +	if (error)
> +		return error;
> +	agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, inofsb));
> +	agi->agi_level = cpu_to_be32(1);
> +	*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
> +
> +	/* Initialize new finobt root. */
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return 0;
> +
> +	resv = mp->m_inotbt_nores ? XFS_AG_RESV_NONE : XFS_AG_RESV_METADATA;

Comment explaining this?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-27  2:19   ` Dave Chinner
  2018-06-27 16:44     ` Allison Henderson
@ 2018-06-28 17:25     ` Allison Henderson
  2018-06-29 15:08       ` Darrick J. Wong
  1 sibling, 1 reply; 77+ messages in thread
From: Allison Henderson @ 2018-06-28 17:25 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong; +Cc: linux-xfs


On 06/26/2018 07:19 PM, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:23:54PM -0700, Darrick J. Wong wrote:
>> From: Darrick J. Wong <darrick.wong@oracle.com>
>>
>> Regenerate the AGF and AGFL from the rmap data.
>>
>> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [...]
> 
>> +/* Information for finding AGF-rooted btrees */
>> +enum {
>> +	REPAIR_AGF_BNOBT = 0,
>> +	REPAIR_AGF_CNTBT,
>> +	REPAIR_AGF_RMAPBT,
>> +	REPAIR_AGF_REFCOUNTBT,
>> +	REPAIR_AGF_END,
>> +	REPAIR_AGF_MAX
>> +};
> 
> Why can't you just use XFS_BTNUM_* for these btree type descriptors?

Well, I know Darrick hasn't responded yet, but I actually have seen 
other projects intentionally redefine scopes like this (even if its 
repetitive).  The reason being for example, to help prevent people from 
mistakenly indexing an element of the below array that may not be 
defined since XFS_BTNUM_* defines more types than are being used here. 
(and its easy to overlook because by belonging to the same name space, 
it doest look out of place)  So basically in redefining only the types 
meant to be used, we may help people to avoid mistakenly mishandling it.

I've also seen such practices generate a lot of extra code too.  Both 
solutions will work.  But in response to your comment: it looks to me 
like a question of cutting down code vs using more a defensive coding style.


> 
>> +
>> +static const struct xfs_repair_find_ag_btree repair_agf[] = {
>> +	[REPAIR_AGF_BNOBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>> +		.buf_ops = &xfs_allocbt_buf_ops,
>> +		.magic = XFS_ABTB_CRC_MAGIC,
>> +	},
>> +	[REPAIR_AGF_CNTBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>> +		.buf_ops = &xfs_allocbt_buf_ops,
>> +		.magic = XFS_ABTC_CRC_MAGIC,
>> +	},
> 
> I had to stop and think about why this only supports the v5 types.
> i.e. we're rebuilding from rmap info, so this will never run on v4
> filesystems, hence we only care about v5 types (i.e. *CRC_MAGIC).
> Perhaps a one-line comment to remind readers of this?
> 
>> +	[REPAIR_AGF_RMAPBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>> +		.buf_ops = &xfs_rmapbt_buf_ops,
>> +		.magic = XFS_RMAP_CRC_MAGIC,
>> +	},
>> +	[REPAIR_AGF_REFCOUNTBT] = {
>> +		.rmap_owner = XFS_RMAP_OWN_REFC,
>> +		.buf_ops = &xfs_refcountbt_buf_ops,
>> +		.magic = XFS_REFC_CRC_MAGIC,
>> +	},
>> +	[REPAIR_AGF_END] = {
>> +		.buf_ops = NULL,
>> +	},
>> +};
>> +
>> +/*
>> + * Find the btree roots.  This is /also/ a chicken and egg problem because we
>> + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
>> + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
>> + * corruptions in those btrees we'll bail out.
>> + */
>> +STATIC int
>> +xfs_repair_agf_find_btrees(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	struct xfs_repair_find_ag_btree	*fab,
>> +	struct xfs_buf			*agfl_bp)
>> +{
>> +	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
>> +	int				error;
>> +
>> +	/* Go find the root data. */
>> +	memcpy(fab, repair_agf, sizeof(repair_agf));
> 
> Why are we initialising fab here, instead of in the caller where it
> is declared and passed to various functions? Given there is only a
> single declaration of this structure, why do we need a global static
> const table initialiser just to copy it here - why isn't it
> initialised at the declaration point?
> 
>> +	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
>> +	if (error)
>> +		return error;
>> +
>> +	/* We must find the bnobt, cntbt, and rmapbt roots. */
>> +	if (fab[REPAIR_AGF_BNOBT].root == NULLAGBLOCK ||
>> +	    fab[REPAIR_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
>> +	    fab[REPAIR_AGF_CNTBT].root == NULLAGBLOCK ||
>> +	    fab[REPAIR_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
>> +	    fab[REPAIR_AGF_RMAPBT].root == NULLAGBLOCK ||
>> +	    fab[REPAIR_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
>> +		return -EFSCORRUPTED;
>> +
>> +	/*
>> +	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
>> +	 * different root then something's seriously wrong.
>> +	 */
>> +	if (fab[REPAIR_AGF_RMAPBT].root !=
>> +	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
>> +		return -EFSCORRUPTED;
>> +
>> +	/* We must find the refcountbt root if that feature is enabled. */
>> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
>> +	    (fab[REPAIR_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
>> +	     fab[REPAIR_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
>> +		return -EFSCORRUPTED;
>> +
>> +	return 0;
>> +}
>> +
>> +/* Set btree root information in an AGF. */
>> +STATIC void
>> +xfs_repair_agf_set_roots(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_agf			*agf,
>> +	struct xfs_repair_find_ag_btree	*fab)
>> +{
>> +	agf->agf_roots[XFS_BTNUM_BNOi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].root);
>> +	agf->agf_levels[XFS_BTNUM_BNOi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].height);
>> +
>> +	agf->agf_roots[XFS_BTNUM_CNTi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].root);
>> +	agf->agf_levels[XFS_BTNUM_CNTi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].height);
>> +
>> +	agf->agf_roots[XFS_BTNUM_RMAPi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].root);
>> +	agf->agf_levels[XFS_BTNUM_RMAPi] =
>> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].height);
>> +
>> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
>> +		agf->agf_refcount_root =
>> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].root);
>> +		agf->agf_refcount_level =
>> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].height);
>> +	}
>> +}
>> +
>> +/*
>> + * Reinitialize the AGF header, making an in-core copy of the old contents so
>> + * that we know which in-core state needs to be reinitialized.
>> + */
>> +STATIC void
>> +xfs_repair_agf_init_header(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	struct xfs_agf			*old_agf)
>> +{
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
>> +
>> +	memcpy(old_agf, agf, sizeof(*old_agf));
>> +	memset(agf, 0, BBTOB(agf_bp->b_length));
>> +	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
>> +	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
>> +	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
>> +	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
>> +	agf->agf_flfirst = old_agf->agf_flfirst;
>> +	agf->agf_fllast = old_agf->agf_fllast;
>> +	agf->agf_flcount = old_agf->agf_flcount;
>> +	if (xfs_sb_version_hascrc(&mp->m_sb))
>> +		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
>> +}
> 
> Do we need to clear pag->pagf_init here so that it gets
> re-initialised next time someone reads the AGF?
> 
>> +
>> +/* Update the AGF btree counters by walking the btrees. */
>> +STATIC int
>> +xfs_repair_agf_update_btree_counters(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp)
>> +{
>> +	struct xfs_repair_agf_allocbt	raa = { .sc = sc };
>> +	struct xfs_btree_cur		*cur = NULL;
>> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
>> +	struct xfs_mount		*mp = sc->mp;
>> +	xfs_agblock_t			btreeblks;
>> +	xfs_agblock_t			blocks;
>> +	int				error;
>> +
>> +	/* Update the AGF counters from the bnobt. */
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_BNO);
>> +	error = xfs_alloc_query_all(cur, xfs_repair_agf_walk_allocbt, &raa);
>> +	if (error)
>> +		goto err;
>> +	error = xfs_btree_count_blocks(cur, &blocks);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +	btreeblks = blocks - 1;
>> +	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
>> +	agf->agf_longest = cpu_to_be32(raa.longest);
> 
> This function updates more than the AGF btree counters. :P
> 
>> +
>> +	/* Update the AGF counters from the cntbt. */
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_CNT);
>> +	error = xfs_btree_count_blocks(cur, &blocks);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +	btreeblks += blocks - 1;
>> +
>> +	/* Update the AGF counters from the rmapbt. */
>> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
>> +	error = xfs_btree_count_blocks(cur, &blocks);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +	agf->agf_rmap_blocks = cpu_to_be32(blocks);
>> +	btreeblks += blocks - 1;
>> +
>> +	agf->agf_btreeblks = cpu_to_be32(btreeblks);
>> +
>> +	/* Update the AGF counters from the refcountbt. */
>> +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
>> +		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
>> +				sc->sa.agno, NULL);
>> +		error = xfs_btree_count_blocks(cur, &blocks);
>> +		if (error)
>> +			goto err;
>> +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +		agf->agf_refcount_blocks = cpu_to_be32(blocks);
>> +	}
>> +
>> +	return 0;
>> +err:
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>> +	return error;
>> +}
>> +
>> +/* Trigger reinitialization of the in-core data. */
>> +STATIC int
>> +xfs_repair_agf_reinit_incore(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_agf			*agf,
>> +	const struct xfs_agf		*old_agf)
>> +{
>> +	struct xfs_perag		*pag;
>> +
>> +	/* XXX: trigger fdblocks recalculation */
>> +
>> +	/* Now reinitialize the in-core counters if necessary. */
>> +	pag = sc->sa.pag;
>> +	if (!pag->pagf_init)
>> +		return 0;
>> +
>> +	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
>> +	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
>> +	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
>> +	pag->pagf_levels[XFS_BTNUM_BNOi] =
>> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
>> +	pag->pagf_levels[XFS_BTNUM_CNTi] =
>> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
>> +	pag->pagf_levels[XFS_BTNUM_RMAPi] =
>> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
>> +	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
> 
> Ok, so we reinit the pagf bits here, but....
> 
>> +
>> +	return 0;
>> +}
>> +
>> +/* Repair the AGF. */
>> +int
>> +xfs_repair_agf(
>> +	struct xfs_scrub_context	*sc)
>> +{
>> +	struct xfs_repair_find_ag_btree	fab[REPAIR_AGF_MAX];
>> +	struct xfs_agf			old_agf;
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_buf			*agf_bp;
>> +	struct xfs_buf			*agfl_bp;
>> +	struct xfs_agf			*agf;
>> +	int				error;
>> +
>> +	/* We require the rmapbt to rebuild anything. */
>> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>> +		return -EOPNOTSUPP;
>> +
>> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
>> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
>> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
>> +			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
>> +	if (error)
>> +		return error;
>> +	agf_bp->b_ops = &xfs_agf_buf_ops;
>> +	agf = XFS_BUF_TO_AGF(agf_bp);
>> +
>> +	/*
>> +	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
>> +	 * the AGFL now; these blocks might have once been part of the
>> +	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
>> +	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
>> +	 * because we can't do any serious cross-referencing with any of the
>> +	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
>> +	 * then we'll bail out.
>> +	 */
>> +	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Spot-check the AGFL blocks; if they're obviously corrupt then
>> +	 * there's nothing we can do but bail out.
>> +	 */
>> +	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
>> +			xfs_repair_agf_check_agfl_block, sc);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Find the AGF btree roots.  See the comment for this function for
>> +	 * more information about the limitations of this repairer; this is
>> +	 * also a chicken-and-egg situation.
>> +	 */
>> +	error = xfs_repair_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
>> +	if (error)
>> +		return error;
> 
> Comment could be better written.
> 
> 	/*
> 	 * Find the AGF btree roots. This is also a chicken-and-egg
> 	 * situation - see xfs_repair_agf_find_btrees() for details.
> 	 */
> 
>> +
>> +	/* Start rewriting the header and implant the btrees we found. */
>> +	xfs_repair_agf_init_header(sc, agf_bp, &old_agf);
>> +	xfs_repair_agf_set_roots(sc, agf, fab);
>> +	error = xfs_repair_agf_update_btree_counters(sc, agf_bp);
>> +	if (error)
>> +		goto out_revert;
> 
> If we fail here, the pagf information is invalid, hence I think we
> really do need to clear pagf_init before we start rebuilding the new
> AGF. Yes, I can see we revert the AGF info, but this seems like a
> landmine waiting to be tripped over.
> 
>> +	/* Reinitialize in-core state. */
>> +	error = xfs_repair_agf_reinit_incore(sc, agf, &old_agf);
>> +	if (error)
>> +		goto out_revert;
>> +
>> +	/* Write this to disk. */
>> +	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
>> +	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
>> +	return 0;
>> +
>> +out_revert:
>> +	memcpy(agf, &old_agf, sizeof(old_agf));
>> +	return error;
>> +}
>> +
>> +/* AGFL */
>> +
>> +struct xfs_repair_agfl {
>> +	struct xfs_repair_extent_list	agmeta_list;
>> +	struct xfs_repair_extent_list	*freesp_list;
>> +	struct xfs_scrub_context	*sc;
>> +};
>> +
>> +/* Record all freespace information. */
>> +STATIC int
>> +xfs_repair_agfl_rmap_fn(
>> +	struct xfs_btree_cur		*cur,
>> +	struct xfs_rmap_irec		*rec,
>> +	void				*priv)
>> +{
>> +	struct xfs_repair_agfl		*ra = priv;
>> +	xfs_fsblock_t			fsb;
>> +	int				error = 0;
>> +
>> +	if (xfs_scrub_should_terminate(ra->sc, &error))
>> +		return error;
>> +
>> +	/* Record all the OWN_AG blocks. */
>> +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
>> +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
>> +				rec->rm_startblock);
>> +		error = xfs_repair_collect_btree_extent(ra->sc,
>> +				ra->freesp_list, fsb, rec->rm_blockcount);
>> +		if (error)
>> +			return error;
>> +	}
>> +
>> +	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
>> +			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> 
> Urk. The function name lengths is getting out of hand. I'm very
> tempted to suggest we should shorten the namespace of all this
> like s/xfs_repair_/xr_/ and s/xfs_scrub_/xs_/, etc just to make them
> shorter and easier to read.
> 
> Oh, wait, did I say that out loud? :P
> 
> Something to think about, anyway.
> 
>> +			&ra->agmeta_list);
>> +}
>> +
>> +/* Add a btree block to the agmeta list. */
>> +STATIC int
>> +xfs_repair_agfl_visit_btblock(
> 
> I find the name a bit confusing - AGFLs don't have btree blocks.
> Yes, I know that it's a xfs_btree_visit_blocks() callback but I
> think s/visit/collect/ makes more sense. i.e. it tells us what we
> are doing with the btree block, rather than making it sound like we
> are walking AGFL btree blocks...
> 
>> +/*
>> + * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
>> + * which blocks belong to the AGFL.
>> + */
>> +STATIC int
>> +xfs_repair_agfl_find_extents(
> 
> Same here - xr_agfl_collect_free_extents()?
> 
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	struct xfs_repair_extent_list	*agfl_extents,
>> +	xfs_agblock_t			*flcount)
>> +{
>> +	struct xfs_repair_agfl		ra;
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_btree_cur		*cur;
>> +	struct xfs_repair_extent	*rae;
>> +	int				error;
>> +
>> +	ra.sc = sc;
>> +	ra.freesp_list = agfl_extents;
>> +	xfs_repair_init_extent_list(&ra.agmeta_list);
>> +
>> +	/* Find all space used by the free space btrees & rmapbt. */
>> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
>> +	error = xfs_rmap_query_all(cur, xfs_repair_agfl_rmap_fn, &ra);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +
>> +	/* Find all space used by bnobt. */
> 
> Needs clarification.
> 
> 	/* Find all the in use bnobt blocks */
> 
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_BNO);
>> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
>> +	if (error)
>> +		goto err;
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +
>> +	/* Find all space used by cntbt. */
> 
> 	/* Find all the in use cntbt blocks */
> 
>> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
>> +			XFS_BTNUM_CNT);
>> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
>> +	if (error)
>> +		goto err;
>> +
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>> +
>> +	/*
>> +	 * Drop the freesp meta blocks that are in use by btrees.
>> +	 * The remaining blocks /should/ be AGFL blocks.
>> +	 */
>> +	error = xfs_repair_subtract_extents(sc, agfl_extents, &ra.agmeta_list);
>> +	xfs_repair_cancel_btree_extents(sc, &ra.agmeta_list);
>> +	if (error)
>> +		return error;
>> +
>> +	/* Calculate the new AGFL size. */
>> +	*flcount = 0;
>> +	for_each_xfs_repair_extent(rae, agfl_extents) {
>> +		*flcount += rae->len;
>> +		if (*flcount > xfs_agfl_size(mp))
>> +			break;
>> +	}
>> +	if (*flcount > xfs_agfl_size(mp))
>> +		*flcount = xfs_agfl_size(mp);
> 
> Ok, so flcount is clamped here. What happens to all the remaining
> agfl_extents beyond flcount?
> 
>> +	return 0;
>> +
>> +err:
> 
> Ok, what cleans up all the extents we've recorded in ra on error?
> 
>> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>> +	return error;
>> +}
>> +
>> +/* Update the AGF and reset the in-core state. */
>> +STATIC int
>> +xfs_repair_agfl_update_agf(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agf_bp,
>> +	xfs_agblock_t			flcount)
>> +{
>> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
>> +
> 	ASSERT(*flcount <= xfs_agfl_size(mp));
> 
>> +	/* XXX: trigger fdblocks recalculation */
>> +
>> +	/* Update the AGF counters. */
>> +	if (sc->sa.pag->pagf_init)
>> +		sc->sa.pag->pagf_flcount = flcount;
>> +	agf->agf_flfirst = cpu_to_be32(0);
>> +	agf->agf_flcount = cpu_to_be32(flcount);
>> +	agf->agf_fllast = cpu_to_be32(flcount - 1);
>> +
>> +	xfs_alloc_log_agf(sc->tp, agf_bp,
>> +			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
>> +	return 0;
>> +}
>> +
>> +/* Write out a totally new AGFL. */
>> +STATIC void
>> +xfs_repair_agfl_init_header(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_buf			*agfl_bp,
>> +	struct xfs_repair_extent_list	*agfl_extents,
>> +	xfs_agblock_t			flcount)
>> +{
>> +	struct xfs_mount		*mp = sc->mp;
>> +	__be32				*agfl_bno;
>> +	struct xfs_repair_extent	*rae;
>> +	struct xfs_repair_extent	*n;
>> +	struct xfs_agfl			*agfl;
>> +	xfs_agblock_t			agbno;
>> +	unsigned int			fl_off;
>> +
> 	ASSERT(*flcount <= xfs_agfl_size(mp));
> 
>> +	/* Start rewriting the header. */
>> +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
>> +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
>> +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
>> +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
>> +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
>> +
>> +	/* Fill the AGFL with the remaining blocks. */
>> +	fl_off = 0;
>> +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
>> +	for_each_xfs_repair_extent_safe(rae, n, agfl_extents) {
>> +		agbno = XFS_FSB_TO_AGBNO(mp, rae->fsbno);
>> +
>> +		trace_xfs_repair_agfl_insert(mp, sc->sa.agno, agbno, rae->len);
>> +
>> +		while (rae->len > 0 && fl_off < flcount) {
>> +			agfl_bno[fl_off] = cpu_to_be32(agbno);
>> +			fl_off++;
>> +			agbno++;
>> +			rae->fsbno++;
>> +			rae->len--;
>> +		}
> 
> This only works correctly if flcount <= xfs_agfl_size, which is why
> I'm suggesting some asserts.
> 
>> +
>> +		if (rae->len)
>> +			break;
>> +		list_del(&rae->list);
>> +		kmem_free(rae);
>> +	}
>> +
>> +	/* Write AGF and AGFL to disk. */
>> +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
>> +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
>> +}
>> +
>> +/* Repair the AGFL. */
>> +int
>> +xfs_repair_agfl(
>> +	struct xfs_scrub_context	*sc)
>> +{
>> +	struct xfs_owner_info		oinfo;
>> +	struct xfs_repair_extent_list	agfl_extents;
>> +	struct xfs_mount		*mp = sc->mp;
>> +	struct xfs_buf			*agf_bp;
>> +	struct xfs_buf			*agfl_bp;
>> +	xfs_agblock_t			flcount;
>> +	int				error;
>> +
>> +	/* We require the rmapbt to rebuild anything. */
>> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>> +		return -EOPNOTSUPP;
>> +
>> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
>> +	xfs_repair_init_extent_list(&agfl_extents);
>> +
>> +	/*
>> +	 * Read the AGF so that we can query the rmapbt.  We hope that there's
>> +	 * nothing wrong with the AGF, but all the AG header repair functions
>> +	 * have this chicken-and-egg problem.
>> +	 */
>> +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
>> +	if (error)
>> +		return error;
>> +	if (!agf_bp)
>> +		return -ENOMEM;
>> +
>> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
>> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
>> +			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
>> +	if (error)
>> +		return error;
>> +	agfl_bp->b_ops = &xfs_agfl_buf_ops;
>> +
>> +	/*
>> +	 * Compute the set of old AGFL blocks by subtracting from the list of
>> +	 * OWN_AG blocks the list of blocks owned by all other OWN_AG metadata
>> +	 * (bnobt, cntbt, rmapbt).  These are the old AGFL blocks, so return
>> +	 * that list and the number of blocks we're actually going to put back
>> +	 * on the AGFL.
>> +	 */
> 
> That comment belongs on the function, not here. All we need here is
> something like:
> 
> 	/* Gather all the extents we're going to put on the new AGFL. */
> 
>> +	error = xfs_repair_agfl_find_extents(sc, agf_bp, &agfl_extents,
>> +			&flcount);
>> +	if (error)
>> +		goto err;
>> +
>> +	/*
>> +	 * Update AGF and AGFL.  We reset the global free block counter when
>> +	 * we adjust the AGF flcount (which can fail) so avoid updating any
>> +	 * bufers until we know that part works.
> 
> buffers
> 
>> +	 */
>> +	error = xfs_repair_agfl_update_agf(sc, agf_bp, flcount);
>> +	if (error)
>> +		goto err;
>> +	xfs_repair_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
>> +
>> +	/*
>> +	 * Ok, the AGFL should be ready to go now.  Roll the transaction so
>> +	 * that we can free any AGFL overflow.
>> +	 */
> 
> Why does rolling the transaction allow us to free the overflow?
> Shouldn't the comment say something like "Roll to the transaction to
> make the new AGFL permanent before we start using it when returning
> the residual AGFL freespace overflow back to the AGF freespace
> btrees."
> 
>> +	sc->sa.agf_bp = agf_bp;
>> +	sc->sa.agfl_bp = agfl_bp;
>> +	error = xfs_repair_roll_ag_trans(sc);
>> +	if (error)
>> +		goto err;
>> +
>> +	/* Dump any AGFL overflow. */
>> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
>> +	return xfs_repair_reap_btree_extents(sc, &agfl_extents, &oinfo,
>> +			XFS_AG_RESV_AGFL);
>> +err:
>> +	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
>> +	return error;
>> +}
>> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
>> index 326be4e8b71e..bcdaa8df18f6 100644
>> --- a/fs/xfs/scrub/repair.c
>> +++ b/fs/xfs/scrub/repair.c
>> @@ -127,9 +127,12 @@ xfs_repair_roll_ag_trans(
>>   	int				error;
>>   
>>   	/* Keep the AG header buffers locked so we can keep going. */
>> -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
>> -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
>> -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>> +	if (sc->sa.agi_bp)
>> +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
>> +	if (sc->sa.agf_bp)
>> +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
>> +	if (sc->sa.agfl_bp)
>> +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>>   
>>   	/* Roll the transaction. */
>>   	error = xfs_trans_roll(&sc->tp);
>> @@ -137,9 +140,12 @@ xfs_repair_roll_ag_trans(
>>   		goto out_release;
>>   
>>   	/* Join AG headers to the new transaction. */
>> -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
>> -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
>> -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>> +	if (sc->sa.agi_bp)
>> +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
>> +	if (sc->sa.agf_bp)
>> +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
>> +	if (sc->sa.agfl_bp)
>> +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>>   
>>   	return 0;
>>   
>> @@ -149,9 +155,12 @@ xfs_repair_roll_ag_trans(
>>   	 * buffers will be released during teardown on our way out
>>   	 * of the kernel.
>>   	 */
>> -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
>> -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
>> -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>> +	if (sc->sa.agi_bp)
>> +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
>> +	if (sc->sa.agf_bp)
>> +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
>> +	if (sc->sa.agfl_bp)
>> +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>>   
>>   	return error;
>>   }
>> @@ -408,6 +417,85 @@ xfs_repair_collect_btree_extent(
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Help record all btree blocks seen while iterating all records of a btree.
>> + *
>> + * We know that the btree query_all function starts at the left edge and walks
>> + * towards the right edge of the tree.  Therefore, we know that we can walk up
>> + * the btree cursor towards the root; if the pointer for a given level points
>> + * to the first record/key in that block, we haven't seen this block before;
>> + * and therefore we need to remember that we saw this block in the btree.
>> + *
>> + * So if our btree is:
>> + *
>> + *    4
>> + *  / | \
>> + * 1  2  3
>> + *
>> + * Pretend for this example that each leaf block has 100 btree records.  For
>> + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
>> + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
>> + * block 4.  The list is [1, 4].
>> + *
>> + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
>> + * loop.  The list remains [1, 4].
>> + *
>> + * For the 101st btree record, we've moved onto leaf block 2.  Now
>> + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
>> + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
>> + *
>> + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
>> + *
>> + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
>> + * we add 3 to the list.  Now it is [1, 4, 2, 3].
>> + *
>> + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
>> + *
>> + * The *iter_fn can return XFS_BTREE_QUERY_RANGE_ABORT to stop, 0 to keep
>> + * iterating, or the usual negative error code.
>> + */
>> +int
>> +xfs_repair_collect_btree_cur_blocks(
>> +	struct xfs_scrub_context	*sc,
>> +	struct xfs_btree_cur		*cur,
>> +	int				(*iter_fn)(struct xfs_scrub_context *sc,
>> +						   xfs_fsblock_t fsbno,
>> +						   xfs_fsblock_t len,
>> +						   void *priv),
>> +	void				*priv)
>> +{
>> +	struct xfs_buf			*bp;
>> +	xfs_fsblock_t			fsb;
>> +	int				i;
>> +	int				error;
>> +
>> +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
>> +		xfs_btree_get_block(cur, i, &bp);
>> +		if (!bp)
>> +			continue;
>> +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
>> +		error = iter_fn(sc, fsb, 1, priv);
>> +		if (error)
>> +			return error;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Simple adapter to connect xfs_repair_collect_btree_extent to
>> + * xfs_repair_collect_btree_cur_blocks.
>> + */
>> +int
>> +xfs_repair_collect_btree_cur_blocks_in_extent_list(
>> +	struct xfs_scrub_context	*sc,
>> +	xfs_fsblock_t			fsbno,
>> +	xfs_fsblock_t			len,
>> +	void				*priv)
>> +{
>> +	return xfs_repair_collect_btree_extent(sc, priv, fsbno, len);
>> +}
>> +
>>   /*
>>    * An error happened during the rebuild so the transaction will be cancelled.
>>    * The fs will shut down, and the administrator has to unmount and run repair.
>> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
>> index ef47826b6725..f2af5923aa75 100644
>> --- a/fs/xfs/scrub/repair.h
>> +++ b/fs/xfs/scrub/repair.h
>> @@ -48,9 +48,20 @@ xfs_repair_init_extent_list(
>>   
>>   #define for_each_xfs_repair_extent_safe(rbe, n, exlist) \
>>   	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
>> +#define for_each_xfs_repair_extent(rbe, exlist) \
>> +	list_for_each_entry((rbe), &(exlist)->list, list)
>>   int xfs_repair_collect_btree_extent(struct xfs_scrub_context *sc,
>>   		struct xfs_repair_extent_list *btlist, xfs_fsblock_t fsbno,
>>   		xfs_extlen_t len);
>> +int xfs_repair_collect_btree_cur_blocks(struct xfs_scrub_context *sc,
>> +		struct xfs_btree_cur *cur,
>> +		int (*iter_fn)(struct xfs_scrub_context *sc,
>> +			       xfs_fsblock_t fsbno, xfs_fsblock_t len,
>> +			       void *priv),
>> +		void *priv);
>> +int xfs_repair_collect_btree_cur_blocks_in_extent_list(
>> +		struct xfs_scrub_context *sc, xfs_fsblock_t fsbno,
>> +		xfs_fsblock_t len, void *priv);
>>   void xfs_repair_cancel_btree_extents(struct xfs_scrub_context *sc,
>>   		struct xfs_repair_extent_list *btlist);
>>   int xfs_repair_subtract_extents(struct xfs_scrub_context *sc,
>> @@ -89,6 +100,8 @@ int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
>>   
>>   int xfs_repair_probe(struct xfs_scrub_context *sc);
>>   int xfs_repair_superblock(struct xfs_scrub_context *sc);
>> +int xfs_repair_agf(struct xfs_scrub_context *sc);
>> +int xfs_repair_agfl(struct xfs_scrub_context *sc);
>>   
>>   #else
>>   
>> @@ -112,6 +125,8 @@ xfs_repair_calc_ag_resblks(
>>   
>>   #define xfs_repair_probe		xfs_repair_notsupported
>>   #define xfs_repair_superblock		xfs_repair_notsupported
>> +#define xfs_repair_agf			xfs_repair_notsupported
>> +#define xfs_repair_agfl			xfs_repair_notsupported
>>   
>>   #endif /* CONFIG_XFS_ONLINE_REPAIR */
>>   
>> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
>> index 58ae76b3a421..8e11c3c699fb 100644
>> --- a/fs/xfs/scrub/scrub.c
>> +++ b/fs/xfs/scrub/scrub.c
>> @@ -208,13 +208,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>>   		.type	= ST_PERAG,
>>   		.setup	= xfs_scrub_setup_fs,
>>   		.scrub	= xfs_scrub_agf,
>> -		.repair	= xfs_repair_notsupported,
>> +		.repair	= xfs_repair_agf,
>>   	},
>>   	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
>>   		.type	= ST_PERAG,
>>   		.setup	= xfs_scrub_setup_fs,
>>   		.scrub	= xfs_scrub_agfl,
>> -		.repair	= xfs_repair_notsupported,
>> +		.repair	= xfs_repair_agfl,
>>   	},
>>   	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
>>   		.type	= ST_PERAG,
>> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
>> index 524f543c5b82..c08785cf83a9 100644
>> --- a/fs/xfs/xfs_trans.c
>> +++ b/fs/xfs/xfs_trans.c
>> @@ -126,6 +126,60 @@ xfs_trans_dup(
>>   	return ntp;
>>   }
>>   
>> +/*
>> + * Try to reserve more blocks for a transaction.  The single use case we
>> + * support is for online repair -- use a transaction to gather data without
>> + * fear of btree cycle deadlocks; calculate how many blocks we really need
>> + * from that data; and only then start modifying data.  This can fail due to
>> + * ENOSPC, so we have to be able to cancel the transaction.
>> + */
>> +int
>> +xfs_trans_reserve_more(
>> +	struct xfs_trans	*tp,
>> +	uint			blocks,
>> +	uint			rtextents)
> 
> This isn't used in this patch - seems out of place here. Committed
> to the wrong patch?
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap
  2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
  2018-06-27  0:54   ` Dave Chinner
@ 2018-06-28 21:11   ` Allison Henderson
  2018-06-29 14:39     ` Darrick J. Wong
  1 sibling, 1 reply; 77+ messages in thread
From: Allison Henderson @ 2018-06-28 21:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The original rmap code assumed that there would always be at least one
> rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't
> find one.  This assumption isn't true for the rmapbt repair function
> (and it won't be true for realtime rmap either), so remove the check and
> just deal with the situation.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/libxfs/xfs_rmap.c |   24 ++++++++++++------------
>   1 file changed, 12 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index d4460b0d2d81..8b2a2f81d110 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -753,19 +753,19 @@ xfs_rmap_map(
>   			&have_lt);
>   	if (error)
>   		goto out_error;
> -	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> -
> -	error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
> -	if (error)
> -		goto out_error;
> -	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> -	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> -			cur->bc_private.a.agno, ltrec.rm_startblock,
> -			ltrec.rm_blockcount, ltrec.rm_owner,
> -			ltrec.rm_offset, ltrec.rm_flags);
> +	if (have_lt) {
> +		error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
> +		if (error)
> +			goto out_error;
> +		XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> +		trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> +				cur->bc_private.a.agno, ltrec.rm_startblock,
> +				ltrec.rm_blockcount, ltrec.rm_owner,
> +				ltrec.rm_offset, ltrec.rm_flags);
>   
> -	if (!xfs_rmap_is_mergeable(&ltrec, owner, flags))
> -		have_lt = 0;
> +		if (!xfs_rmap_is_mergeable(&ltrec, owner, flags))
> +			have_lt = 0;
> +	}
>   
>   	XFS_WANT_CORRUPTED_GOTO(mp,
>   		have_lt == 0 ||
> 

Alrighty, looks ok after some digging around. I'm still a little puzzled 
as to why the original code raised the assert without checking to see 
whats on the other side of the cursor?  Assuming the error condition
was supposed to be the case when the tree was empty.  In any case, it 
looks correct now.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=Q2PdVMOGp7_huNLFbP6xty0mgocZk65leUyLVRvSsSY&s=6-FSoklyIhTEtg811gAG43N9-7Z-sYFsm7zv33EadgQ&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks
  2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
  2018-06-27  1:02   ` Dave Chinner
@ 2018-06-28 21:12   ` Allison Henderson
  1 sibling, 0 replies; 77+ messages in thread
From: Allison Henderson @ 2018-06-28 21:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Add a helper to decide if an inode has real or unwritten extents in the
> CoW fork.  The upcoming repair freeze functionality will have to know if
> it's safe to iput an inode -- if the inode has incore any state that
> would require a transaction to unwind during iput, we'll have to defer
> the iput.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/xfs_inode.c |   19 +++++++++++++++++++
>   fs/xfs/xfs_inode.h |    1 +
>   2 files changed, 20 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 7a96c4e0ab5c..e6859dfc29af 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3689,3 +3689,22 @@ xfs_iflush_int(
>   corrupt_out:
>   	return -EFSCORRUPTED;
>   }
> +
> +/* Decide if there are real or unwritten extents in the CoW fork. */
> +bool
> +xfs_inode_has_cow_blocks(
> +	struct xfs_inode		*ip)
> +{
> +	struct xfs_iext_cursor		icur;
> +	struct xfs_bmbt_irec		irec;
> +	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> +
> +	if (!ifp)
> +		return false;
> +
> +	for_each_xfs_iext(ifp, &icur, &irec) {
> +		if (!isnullstartblock(irec.br_startblock))
> +			return true;
> +	}
> +	return false;
> +}
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 2ed63a49e890..735d0788bfdb 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -503,5 +503,6 @@ extern struct kmem_zone	*xfs_inode_zone;
>   #define XFS_DEFAULT_COWEXTSZ_HINT 32
>   
>   bool xfs_inode_verify_forks(struct xfs_inode *ip);
> +bool xfs_inode_has_cow_blocks(struct xfs_inode *ip);
>   
>   #endif	/* __XFS_INODE_H__ */
> 

Ok, this one looks pretty straight forward.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=gITPMUadlG9a62qcCKtadzb41y_zIzqRNXMCANKITeM&s=hmiK2_87KL5oO6gl-LQ00avYqc5HcijzCgyErYAM1Cc&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks
  2018-06-24 19:23 ` [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks Darrick J. Wong
@ 2018-06-28 21:13   ` Allison Henderson
  0 siblings, 0 replies; 77+ messages in thread
From: Allison Henderson @ 2018-06-28 21:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor the part of _free_eofblocks that decides if it's really going
> to truncate post-EOF blocks into a separate helper function.  The
> upcoming repair freeze patch requires us to defer iput of an inode if
> disposing of that inode would have to start another transaction to
> unwind incore state.  No functionality changes.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/xfs_bmap_util.c |  101 ++++++++++++++++++++----------------------------
>   fs/xfs/xfs_inode.c     |   32 +++++++++++++++
>   fs/xfs/xfs_inode.h     |    1
>   3 files changed, 75 insertions(+), 59 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c94d376e4152..0f38acbb200f 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -805,78 +805,61 @@ xfs_free_eofblocks(
>   	struct xfs_inode	*ip)
>   {
>   	struct xfs_trans	*tp;
> -	int			error;
> -	xfs_fileoff_t		end_fsb;
> -	xfs_fileoff_t		last_fsb;
> -	xfs_filblks_t		map_len;
> -	int			nimaps;
> -	struct xfs_bmbt_irec	imap;
>   	struct xfs_mount	*mp = ip->i_mount;
> +	int			error;
>   
>   	/*
> -	 * Figure out if there are any blocks beyond the end
> -	 * of the file.  If not, then there is nothing to do.
> +	 * If there are blocks after the end of file, truncate the file to its
> +	 * current size to free them up.
>   	 */
> -	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> -	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
> -	if (last_fsb <= end_fsb)
> +	if (!xfs_inode_has_posteof_blocks(ip))
>   		return 0;
> -	map_len = last_fsb - end_fsb;
> -
> -	nimaps = 1;
> -	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>   
>   	/*
> -	 * If there are blocks after the end of file, truncate the file to its
> -	 * current size to free them up.
> +	 * Attach the dquots to the inode up front.
>   	 */
> -	if (!error && (nimaps != 0) &&
> -	    (imap.br_startblock != HOLESTARTBLOCK ||
> -	     ip->i_delayed_blks)) {
> -		/*
> -		 * Attach the dquots to the inode up front.
> -		 */
> -		error = xfs_qm_dqattach(ip);
> -		if (error)
> -			return error;
> +	error = xfs_qm_dqattach(ip);
> +	if (error)
> +		return error;
>   
> -		/* wait on dio to ensure i_size has settled */
> -		inode_dio_wait(VFS_I(ip));
> +	/* wait on dio to ensure i_size has settled */
> +	inode_dio_wait(VFS_I(ip));
>   
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
> -				&tp);
> -		if (error) {
> -			ASSERT(XFS_FORCED_SHUTDOWN(mp));
> -			return error;
> -		}
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	if (error) {
> +		ASSERT(XFS_FORCED_SHUTDOWN(mp));
> +		return error;
> +	}
>   
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		xfs_trans_ijoin(tp, ip, 0);
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
>   
> -		/*
> -		 * Do not update the on-disk file size.  If we update the
> -		 * on-disk file size and then the system crashes before the
> -		 * contents of the file are flushed to disk then the files
> -		 * may be full of holes (ie NULL files bug).
> -		 */
> -		error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
> -					XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
> -		if (error) {
> -			/*
> -			 * If we get an error at this point we simply don't
> -			 * bother truncating the file.
> -			 */
> -			xfs_trans_cancel(tp);
> -		} else {
> -			error = xfs_trans_commit(tp);
> -			if (!error)
> -				xfs_inode_clear_eofblocks_tag(ip);
> -		}
> +	/*
> +	 * Do not update the on-disk file size.  If we update the
> +	 * on-disk file size and then the system crashes before the
> +	 * contents of the file are flushed to disk then the files
> +	 * may be full of holes (ie NULL files bug).
> +	 */
> +	error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK,
> +				XFS_ISIZE(ip), XFS_BMAPI_NODISCARD);
> +	if (error)
> +		goto err_cancel;
>   
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	}
> +	error = xfs_trans_commit(tp);
> +	if (error)
> +		goto out_unlock;
> +
> +	xfs_inode_clear_eofblocks_tag(ip);
> +	goto out_unlock;
> +
> +err_cancel:
> +	/*
> +	 * If we get an error at this point we simply don't
> +	 * bother truncating the file.
> +	 */
> +	xfs_trans_cancel(tp);
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   	return error;
>   }
>   
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e6859dfc29af..368ac0528727 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3708,3 +3708,35 @@ xfs_inode_has_cow_blocks(
>   	}
>   	return false;
>   }
> +
> +/*
> + * Decide if this inode have post-EOF blocks.  The caller is responsible
typo: have->has

> + * for knowing / caring about the PREALLOC/APPEND flags.
Why do the flags effect weather or not we have post-EOF blocks?

Other than minor nit picks, it looks like most of the same functionality.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> + */
> +bool
> +xfs_inode_has_posteof_blocks(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_bmbt_irec	imap;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fileoff_t		end_fsb;
> +	xfs_fileoff_t		last_fsb;
> +	xfs_filblks_t		map_len;
> +	int			nimaps;
> +	int			error;
> +
> +	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip));
> +	last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes);
> +	if (last_fsb <= end_fsb)
> +		return false;
> +	map_len = last_fsb - end_fsb;
> +
> +	nimaps = 1;
> +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> +	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
> +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +
> +	return !error && (nimaps != 0) &&
> +	       (imap.br_startblock != HOLESTARTBLOCK ||
> +	        ip->i_delayed_blks);
> +}
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 735d0788bfdb..a041fffa1b33 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -504,5 +504,6 @@ extern struct kmem_zone	*xfs_inode_zone;
>   
>   bool xfs_inode_verify_forks(struct xfs_inode *ip);
>   bool xfs_inode_has_cow_blocks(struct xfs_inode *ip);
> +bool xfs_inode_has_posteof_blocks(struct xfs_inode *ip);
>   
>   #endif	/* __XFS_INODE_H__ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=R-xXcRCE5SuhINVgkBSyy6kqhqF-khnVaJl7k1QF97c&s=OiUh10Yxjc51nxdlFMf79yknDmf7_kmhH2kyJkpa4AM&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
  2018-06-27  2:19   ` Dave Chinner
@ 2018-06-28 21:14   ` Allison Henderson
  2018-06-28 23:21     ` Dave Chinner
  1 sibling, 1 reply; 77+ messages in thread
From: Allison Henderson @ 2018-06-28 21:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Regenerate the AGF and AGFL from the rmap data.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/scrub/agheader_repair.c |  644 ++++++++++++++++++++++++++++++++++++++++
>   fs/xfs/scrub/repair.c          |  106 ++++++-
>   fs/xfs/scrub/repair.h          |   15 +
>   fs/xfs/scrub/scrub.c           |    4
>   fs/xfs/xfs_trans.c             |   54 +++
>   fs/xfs/xfs_trans.h             |    2
>   6 files changed, 814 insertions(+), 11 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index 117eedac53df..90e5e6cbc911 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
> @@ -17,12 +17,18 @@
>   #include "xfs_sb.h"
>   #include "xfs_inode.h"
>   #include "xfs_alloc.h"
> +#include "xfs_alloc_btree.h"
>   #include "xfs_ialloc.h"
> +#include "xfs_ialloc_btree.h"
>   #include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_refcount.h"
> +#include "xfs_refcount_btree.h"
>   #include "scrub/xfs_scrub.h"
>   #include "scrub/scrub.h"
>   #include "scrub/common.h"
>   #include "scrub/trace.h"
> +#include "scrub/repair.h"
>   
>   /* Superblock */
>   
> @@ -54,3 +60,641 @@ xfs_repair_superblock(
>   	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
>   	return error;
>   }
> +
> +/* AGF */
> +
> +struct xfs_repair_agf_allocbt {
> +	struct xfs_scrub_context	*sc;
> +	xfs_agblock_t			freeblks;
> +	xfs_agblock_t			longest;
> +};
> +
> +/* Record free space shape information. */
> +STATIC int
> +xfs_repair_agf_walk_allocbt(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_alloc_rec_incore	*rec,
> +	void				*priv)
> +{
> +	struct xfs_repair_agf_allocbt	*raa = priv;
> +	int				error = 0;
> +
> +	if (xfs_scrub_should_terminate(raa->sc, &error))
> +		return error;
> +
> +	raa->freeblks += rec->ar_blockcount;
> +	if (rec->ar_blockcount > raa->longest)
> +		raa->longest = rec->ar_blockcount;
> +	return error;
> +}
> +
> +/* Does this AGFL block look sane? */
> +STATIC int
> +xfs_repair_agf_check_agfl_block(
> +	struct xfs_mount		*mp,
> +	xfs_agblock_t			agbno,
> +	void				*priv)
> +{
> +	struct xfs_scrub_context	*sc = priv;
> +
> +	if (!xfs_verify_agbno(mp, sc->sa.agno, agbno))
> +		return -EFSCORRUPTED;
> +	return 0;
> +}
> +
> +/* Information for finding AGF-rooted btrees */
> +enum {
> +	REPAIR_AGF_BNOBT = 0,
> +	REPAIR_AGF_CNTBT,
> +	REPAIR_AGF_RMAPBT,
> +	REPAIR_AGF_REFCOUNTBT,
> +	REPAIR_AGF_END,
> +	REPAIR_AGF_MAX
> +};
> +
> +static const struct xfs_repair_find_ag_btree repair_agf[] = {
> +	[REPAIR_AGF_BNOBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_AG,
> +		.buf_ops = &xfs_allocbt_buf_ops,
> +		.magic = XFS_ABTB_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_CNTBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_AG,
> +		.buf_ops = &xfs_allocbt_buf_ops,
> +		.magic = XFS_ABTC_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_RMAPBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_AG,
> +		.buf_ops = &xfs_rmapbt_buf_ops,
> +		.magic = XFS_RMAP_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_REFCOUNTBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_REFC,
> +		.buf_ops = &xfs_refcountbt_buf_ops,
> +		.magic = XFS_REFC_CRC_MAGIC,
> +	},
> +	[REPAIR_AGF_END] = {
> +		.buf_ops = NULL,
> +	},
> +};
> +
> +/*
> + * Find the btree roots.  This is /also/ a chicken and egg problem because we
> + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
> + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
> + * corruptions in those btrees we'll bail out.
> + */
It would help if maybe we could put the /*IN*/ or /*OUT*/ on the 
parameters here?  And maybe a blurb about their usage.  From looking at 
how their used in the memcpy, I'm guessing that agf_bp is IN and fab is 
OUT.  But it's otherwise its not really clear on how they're meant to be 
used with out going the function to see how it handles them.

> +STATIC int
> +xfs_repair_agf_find_btrees(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	struct xfs_repair_find_ag_btree	*fab,
> +	struct xfs_buf			*agfl_bp)
> +{
> +	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
> +	int				error;
> +
> +	/* Go find the root data. */
> +	memcpy(fab, repair_agf, sizeof(repair_agf));
> +	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
> +	if (error)
> +		return error;
> +
> +	/* We must find the bnobt, cntbt, and rmapbt roots. */
> +	if (fab[REPAIR_AGF_BNOBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
> +	    fab[REPAIR_AGF_CNTBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
> +	    fab[REPAIR_AGF_RMAPBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
> +		return -EFSCORRUPTED;
> +
> +	/*
> +	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
> +	 * different root then something's seriously wrong.
> +	 */
> +	if (fab[REPAIR_AGF_RMAPBT].root !=
> +	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
> +		return -EFSCORRUPTED;
> +
> +	/* We must find the refcountbt root if that feature is enabled. */
> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
> +	    (fab[REPAIR_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
> +	     fab[REPAIR_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
> +		return -EFSCORRUPTED;
> +
> +	return 0;
> +}
> +
> +/* Set btree root information in an AGF. */
> +STATIC void
> +xfs_repair_agf_set_roots(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_agf			*agf,
> +	struct xfs_repair_find_ag_btree	*fab)
> +{
> +	agf->agf_roots[XFS_BTNUM_BNOi] =
> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].root);
> +	agf->agf_levels[XFS_BTNUM_BNOi] =
> +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].height);
> +
> +	agf->agf_roots[XFS_BTNUM_CNTi] =
> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].root);
> +	agf->agf_levels[XFS_BTNUM_CNTi] =
> +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].height);
> +
> +	agf->agf_roots[XFS_BTNUM_RMAPi] =
> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].root);
> +	agf->agf_levels[XFS_BTNUM_RMAPi] =
> +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].height);
> +
> +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
> +		agf->agf_refcount_root =
> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].root);
> +		agf->agf_refcount_level =
> +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].height);
> +	}
> +}
> +
> +/*
> + * Reinitialize the AGF header, making an in-core copy of the old contents so
> + * that we know which in-core state needs to be reinitialized.
> + */
> +STATIC void
> +xfs_repair_agf_init_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	struct xfs_agf			*old_agf)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> +
> +	memcpy(old_agf, agf, sizeof(*old_agf));
> +	memset(agf, 0, BBTOB(agf_bp->b_length));
> +	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
> +	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
> +	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
> +	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
> +	agf->agf_flfirst = old_agf->agf_flfirst;
> +	agf->agf_fllast = old_agf->agf_fllast;
> +	agf->agf_flcount = old_agf->agf_flcount;
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
> +}
> +
> +/* Update the AGF btree counters by walking the btrees. */
> +STATIC int
> +xfs_repair_agf_update_btree_counters(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp)
> +{
> +	struct xfs_repair_agf_allocbt	raa = { .sc = sc };
> +	struct xfs_btree_cur		*cur = NULL;
> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_agblock_t			btreeblks;
> +	xfs_agblock_t			blocks;
> +	int				error;
> +
> +	/* Update the AGF counters from the bnobt. */
> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_BNO);
> +	error = xfs_alloc_query_all(cur, xfs_repair_agf_walk_allocbt, &raa);
> +	if (error)
> +		goto err;
> +	error = xfs_btree_count_blocks(cur, &blocks);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	btreeblks = blocks - 1;
> +	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
> +	agf->agf_longest = cpu_to_be32(raa.longest);
> +
> +	/* Update the AGF counters from the cntbt. */
> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_CNT);
> +	error = xfs_btree_count_blocks(cur, &blocks);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	btreeblks += blocks - 1;
> +
> +	/* Update the AGF counters from the rmapbt. */
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
> +	error = xfs_btree_count_blocks(cur, &blocks);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	agf->agf_rmap_blocks = cpu_to_be32(blocks);
> +	btreeblks += blocks - 1;
> +
> +	agf->agf_btreeblks = cpu_to_be32(btreeblks);
> +
> +	/* Update the AGF counters from the refcountbt. */
> +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> +		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
> +				sc->sa.agno, NULL);
> +		error = xfs_btree_count_blocks(cur, &blocks);
> +		if (error)
> +			goto err;
> +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +		agf->agf_refcount_blocks = cpu_to_be32(blocks);
> +	}
> +
> +	return 0;
> +err:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
> +
> +/* Trigger reinitialization of the in-core data. */
> +STATIC int
> +xfs_repair_agf_reinit_incore(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_agf			*agf,
> +	const struct xfs_agf		*old_agf)
> +{
> +	struct xfs_perag		*pag;
> +
> +	/* XXX: trigger fdblocks recalculation */
> +
> +	/* Now reinitialize the in-core counters if necessary. */
> +	pag = sc->sa.pag;
> +	if (!pag->pagf_init)
> +		return 0;
> +
> +	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
> +	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
> +	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
> +	pag->pagf_levels[XFS_BTNUM_BNOi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
> +	pag->pagf_levels[XFS_BTNUM_CNTi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
> +	pag->pagf_levels[XFS_BTNUM_RMAPi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
> +	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
> +
> +	return 0;
> +}
> +
> +/* Repair the AGF. */
> +int
> +xfs_repair_agf(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_repair_find_ag_btree	fab[REPAIR_AGF_MAX];
> +	struct xfs_agf			old_agf;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*agf_bp;
> +	struct xfs_buf			*agfl_bp;
> +	struct xfs_agf			*agf;
> +	int				error;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
> +			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
> +	if (error)
> +		return error;
> +	agf_bp->b_ops = &xfs_agf_buf_ops;
> +	agf = XFS_BUF_TO_AGF(agf_bp);
> +
> +	/*
> +	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
> +	 * the AGFL now; these blocks might have once been part of the
> +	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
> +	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
> +	 * because we can't do any serious cross-referencing with any of the
> +	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
> +	 * then we'll bail out.
> +	 */
> +	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Spot-check the AGFL blocks; if they're obviously corrupt then
> +	 * there's nothing we can do but bail out.
> +	 */
> +	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
> +			xfs_repair_agf_check_agfl_block, sc);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Find the AGF btree roots.  See the comment for this function for
> +	 * more information about the limitations of this repairer; this is
> +	 * also a chicken-and-egg situation.
> +	 */
> +	error = xfs_repair_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
> +	if (error)
> +		return error;
> +
> +	/* Start rewriting the header and implant the btrees we found. */
> +	xfs_repair_agf_init_header(sc, agf_bp, &old_agf);
> +	xfs_repair_agf_set_roots(sc, agf, fab);
> +	error = xfs_repair_agf_update_btree_counters(sc, agf_bp);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Reinitialize in-core state. */
> +	error = xfs_repair_agf_reinit_incore(sc, agf, &old_agf);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Write this to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
> +	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
> +	return 0;
> +
> +out_revert:
> +	memcpy(agf, &old_agf, sizeof(old_agf));
> +	return error;
> +}
> +
> +/* AGFL */
> +
> +struct xfs_repair_agfl {
> +	struct xfs_repair_extent_list	agmeta_list;
> +	struct xfs_repair_extent_list	*freesp_list;
> +	struct xfs_scrub_context	*sc;
> +};
> +
> +/* Record all freespace information. */
So I've gathered that *_fn routines are call back functions, but it
would be nice to have a blurb or something here that describes what it's
a call back for.  Just to help make clear what it's doing in the greater
scheme of things.  Some of these will make more sense when we see where 
and how they're used.

> +STATIC int
> +xfs_repair_agfl_rmap_fn(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_rmap_irec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_repair_agfl		*ra = priv;
> +	xfs_fsblock_t			fsb;
> +	int				error = 0;
> +
> +	if (xfs_scrub_should_terminate(ra->sc, &error))
> +		return error;
> +
> +	/* Record all the OWN_AG blocks. */
> +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> +				rec->rm_startblock);
> +		error = xfs_repair_collect_btree_extent(ra->sc,
> +				ra->freesp_list, fsb, rec->rm_blockcount);
> +		if (error)
> +			return error;
> +	}
> +
> +	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
> +			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> +			&ra->agmeta_list);
> +}
> +
> +/* Add a btree block to the agmeta list. */
> +STATIC int
> +xfs_repair_agfl_visit_btblock(
> +	struct xfs_btree_cur		*cur,
> +	int				level,
> +	void				*priv)
> +{
> +	struct xfs_repair_agfl		*ra = priv;
> +	struct xfs_buf			*bp;
> +	xfs_fsblock_t			fsb;
> +	int				error = 0;
> +
> +	if (xfs_scrub_should_terminate(ra->sc, &error))
> +		return error;
> +
> +	xfs_btree_get_block(cur, level, &bp);
> +	if (!bp)
> +		return 0;
> +
> +	fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> +	return xfs_repair_collect_btree_extent(ra->sc, &ra->agmeta_list,
> +			fsb, 1);
> +}
> +
> +/*
> + * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
> + * which blocks belong to the AGFL.
> + */
> +STATIC int
> +xfs_repair_agfl_find_extents(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	struct xfs_repair_extent_list	*agfl_extents,
> +	xfs_agblock_t			*flcount)
> +{
> +	struct xfs_repair_agfl		ra;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_repair_extent	*rae;
> +	int				error;
> +
> +	ra.sc = sc;
> +	ra.freesp_list = agfl_extents;
> +	xfs_repair_init_extent_list(&ra.agmeta_list);
> +
> +	/* Find all space used by the free space btrees & rmapbt. */
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
> +	error = xfs_rmap_query_all(cur, xfs_repair_agfl_rmap_fn, &ra);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/* Find all space used by bnobt. */
> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_BNO);
> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/* Find all space used by cntbt. */
> +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> +			XFS_BTNUM_CNT);
> +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
> +	if (error)
> +		goto err;
> +
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/*
> +	 * Drop the freesp meta blocks that are in use by btrees.
> +	 * The remaining blocks /should/ be AGFL blocks.
> +	 */
> +	error = xfs_repair_subtract_extents(sc, agfl_extents, &ra.agmeta_list);
> +	xfs_repair_cancel_btree_extents(sc, &ra.agmeta_list);
> +	if (error)
> +		return error;
> +
> +	/* Calculate the new AGFL size. */
> +	*flcount = 0;
> +	for_each_xfs_repair_extent(rae, agfl_extents) {
> +		*flcount += rae->len;
> +		if (*flcount > xfs_agfl_size(mp))
> +			break;
> +	}
> +	if (*flcount > xfs_agfl_size(mp))
> +		*flcount = xfs_agfl_size(mp);
> +	return 0;
> +
> +err:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
Thx for the comments, they help!

> +
> +/* Update the AGF and reset the in-core state. */
> +STATIC int
> +xfs_repair_agfl_update_agf(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agf_bp,
> +	xfs_agblock_t			flcount)
> +{
> +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> +
> +	/* XXX: trigger fdblocks recalculation */
> +
> +	/* Update the AGF counters. */
> +	if (sc->sa.pag->pagf_init)
> +		sc->sa.pag->pagf_flcount = flcount;
> +	agf->agf_flfirst = cpu_to_be32(0);
> +	agf->agf_flcount = cpu_to_be32(flcount);
> +	agf->agf_fllast = cpu_to_be32(flcount - 1);
> +
> +	xfs_alloc_log_agf(sc->tp, agf_bp,
> +			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
> +	return 0;
> +}
> +
> +/* Write out a totally new AGFL. */
> +STATIC void
> +xfs_repair_agfl_init_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agfl_bp,
> +	struct xfs_repair_extent_list	*agfl_extents,
> +	xfs_agblock_t			flcount)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	__be32				*agfl_bno;
> +	struct xfs_repair_extent	*rae;
> +	struct xfs_repair_extent	*n;
> +	struct xfs_agfl			*agfl;
> +	xfs_agblock_t			agbno;
> +	unsigned int			fl_off;
> +
> +	/* Start rewriting the header. */
> +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> +
> +	/* Fill the AGFL with the remaining blocks. */
> +	fl_off = 0;
> +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> +	for_each_xfs_repair_extent_safe(rae, n, agfl_extents) {
> +		agbno = XFS_FSB_TO_AGBNO(mp, rae->fsbno);
> +
> +		trace_xfs_repair_agfl_insert(mp, sc->sa.agno, agbno, rae->len);
> +
> +		while (rae->len > 0 && fl_off < flcount) {
> +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> +			fl_off++;
> +			agbno++;
> +			rae->fsbno++;
> +			rae->len--;
> +		}
> +
> +		if (rae->len)
> +			break;
> +		list_del(&rae->list);
> +		kmem_free(rae);
> +	}
> +
> +	/* Write AGF and AGFL to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> +}
> +
> +/* Repair the AGFL. */
> +int
> +xfs_repair_agfl(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_repair_extent_list	agfl_extents;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*agf_bp;
> +	struct xfs_buf			*agfl_bp;
> +	xfs_agblock_t			flcount;
> +	int				error;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +	xfs_repair_init_extent_list(&agfl_extents);
> +
> +	/*
> +	 * Read the AGF so that we can query the rmapbt.  We hope that there's
> +	 * nothing wrong with the AGF, but all the AG header repair functions
> +	 * have this chicken-and-egg problem.
> +	 */
> +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
> +	if (error)
> +		return error;
> +	if (!agf_bp)
> +		return -ENOMEM;
> +
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
> +			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
> +	if (error)
> +		return error;
> +	agfl_bp->b_ops = &xfs_agfl_buf_ops;
> +
> +	/*
> +	 * Compute the set of old AGFL blocks by subtracting from the list of
> +	 * OWN_AG blocks the list of blocks owned by all other OWN_AG metadata
> +	 * (bnobt, cntbt, rmapbt).  These are the old AGFL blocks, so return
> +	 * that list and the number of blocks we're actually going to put back
> +	 * on the AGFL.
> +	 */
> +	error = xfs_repair_agfl_find_extents(sc, agf_bp, &agfl_extents,
> +			&flcount);
> +	if (error)
> +		goto err;
> +
> +	/*
> +	 * Update AGF and AGFL.  We reset the global free block counter when
> +	 * we adjust the AGF flcount (which can fail) so avoid updating any
> +	 * bufers until we know that part works.
> +	 */
> +	error = xfs_repair_agfl_update_agf(sc, agf_bp, flcount);
> +	if (error)
> +		goto err;
> +	xfs_repair_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
> +
> +	/*
> +	 * Ok, the AGFL should be ready to go now.  Roll the transaction so
> +	 * that we can free any AGFL overflow.
> +	 */
> +	sc->sa.agf_bp = agf_bp;
> +	sc->sa.agfl_bp = agfl_bp;
> +	error = xfs_repair_roll_ag_trans(sc);
> +	if (error)
> +		goto err;
> +
> +	/* Dump any AGFL overflow. */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> +	return xfs_repair_reap_btree_extents(sc, &agfl_extents, &oinfo,
> +			XFS_AG_RESV_AGFL);
> +err:
> +	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 326be4e8b71e..bcdaa8df18f6 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -127,9 +127,12 @@ xfs_repair_roll_ag_trans(
>   	int				error;
>   
>   	/* Keep the AG header buffers locked so we can keep going. */
> -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>   
>   	/* Roll the transaction. */
>   	error = xfs_trans_roll(&sc->tp);
> @@ -137,9 +140,12 @@ xfs_repair_roll_ag_trans(
>   		goto out_release;
>   
>   	/* Join AG headers to the new transaction. */
> -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>   
>   	return 0;
>   
> @@ -149,9 +155,12 @@ xfs_repair_roll_ag_trans(
>   	 * buffers will be released during teardown on our way out
>   	 * of the kernel.
>   	 */
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>   
>   	return error;
>   }
> @@ -408,6 +417,85 @@ xfs_repair_collect_btree_extent(
>   	return 0;
>   }
>   
> +/*
> + * Help record all btree blocks seen while iterating all records of a btree.
> + *
> + * We know that the btree query_all function starts at the left edge and walks
> + * towards the right edge of the tree.  Therefore, we know that we can walk up
> + * the btree cursor towards the root; if the pointer for a given level points
> + * to the first record/key in that block, we haven't seen this block before;
> + * and therefore we need to remember that we saw this block in the btree.
> + *
> + * So if our btree is:
> + *
> + *    4
> + *  / | \
> + * 1  2  3
> + *
> + * Pretend for this example that each leaf block has 100 btree records.  For
> + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> + * block 4.  The list is [1, 4].
> + *
> + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> + * loop.  The list remains [1, 4].
> + *
> + * For the 101st btree record, we've moved onto leaf block 2.  Now
> + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> + *
> + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> + *
> + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> + *
> + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> + *
> + * The *iter_fn can return XFS_BTREE_QUERY_RANGE_ABORT to stop, 0 to keep
> + * iterating, or the usual negative error code.
> + */
Thank you, the comment block helps a lot!

> +int
> +xfs_repair_collect_btree_cur_blocks(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_btree_cur		*cur,
> +	int				(*iter_fn)(struct xfs_scrub_context *sc,
> +						   xfs_fsblock_t fsbno,
> +						   xfs_fsblock_t len,
> +						   void *priv),
> +	void				*priv)
> +{
> +	struct xfs_buf			*bp;
> +	xfs_fsblock_t			fsb;
> +	int				i;
> +	int				error;
> +
> +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> +		xfs_btree_get_block(cur, i, &bp);
> +		if (!bp)
> +			continue;
> +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> +		error = iter_fn(sc, fsb, 1, priv);
> +		if (error)
> +			return error;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Simple adapter to connect xfs_repair_collect_btree_extent to
> + * xfs_repair_collect_btree_cur_blocks.
> + */
> +int
> +xfs_repair_collect_btree_cur_blocks_in_extent_list(
> +	struct xfs_scrub_context	*sc,
> +	xfs_fsblock_t			fsbno,
> +	xfs_fsblock_t			len,
> +	void				*priv)
> +{
> +	return xfs_repair_collect_btree_extent(sc, priv, fsbno, len);
> +}
> +
>   /*
>    * An error happened during the rebuild so the transaction will be cancelled.
>    * The fs will shut down, and the administrator has to unmount and run repair.
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index ef47826b6725..f2af5923aa75 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -48,9 +48,20 @@ xfs_repair_init_extent_list(
>   
>   #define for_each_xfs_repair_extent_safe(rbe, n, exlist) \
>   	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
> +#define for_each_xfs_repair_extent(rbe, exlist) \
> +	list_for_each_entry((rbe), &(exlist)->list, list)
>   int xfs_repair_collect_btree_extent(struct xfs_scrub_context *sc,
>   		struct xfs_repair_extent_list *btlist, xfs_fsblock_t fsbno,
>   		xfs_extlen_t len);
> +int xfs_repair_collect_btree_cur_blocks(struct xfs_scrub_context *sc,
> +		struct xfs_btree_cur *cur,
> +		int (*iter_fn)(struct xfs_scrub_context *sc,
> +			       xfs_fsblock_t fsbno, xfs_fsblock_t len,
> +			       void *priv),
> +		void *priv);
> +int xfs_repair_collect_btree_cur_blocks_in_extent_list(
> +		struct xfs_scrub_context *sc, xfs_fsblock_t fsbno,
> +		xfs_fsblock_t len, void *priv);
>   void xfs_repair_cancel_btree_extents(struct xfs_scrub_context *sc,
>   		struct xfs_repair_extent_list *btlist);
>   int xfs_repair_subtract_extents(struct xfs_scrub_context *sc,
> @@ -89,6 +100,8 @@ int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
>   
>   int xfs_repair_probe(struct xfs_scrub_context *sc);
>   int xfs_repair_superblock(struct xfs_scrub_context *sc);
> +int xfs_repair_agf(struct xfs_scrub_context *sc);
> +int xfs_repair_agfl(struct xfs_scrub_context *sc);
>   
>   #else
>   
> @@ -112,6 +125,8 @@ xfs_repair_calc_ag_resblks(
>   
>   #define xfs_repair_probe		xfs_repair_notsupported
>   #define xfs_repair_superblock		xfs_repair_notsupported
> +#define xfs_repair_agf			xfs_repair_notsupported
> +#define xfs_repair_agfl			xfs_repair_notsupported
>   
>   #endif /* CONFIG_XFS_ONLINE_REPAIR */
>   
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 58ae76b3a421..8e11c3c699fb 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -208,13 +208,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>   		.type	= ST_PERAG,
>   		.setup	= xfs_scrub_setup_fs,
>   		.scrub	= xfs_scrub_agf,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_agf,
>   	},
>   	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
>   		.type	= ST_PERAG,
>   		.setup	= xfs_scrub_setup_fs,
>   		.scrub	= xfs_scrub_agfl,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_agfl,
>   	},
>   	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
>   		.type	= ST_PERAG,
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 524f543c5b82..c08785cf83a9 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -126,6 +126,60 @@ xfs_trans_dup(
>   	return ntp;
>   }
>   
> +/*
> + * Try to reserve more blocks for a transaction.  The single use case we
> + * support is for online repair -- use a transaction to gather data without
> + * fear of btree cycle deadlocks; calculate how many blocks we really need
> + * from that data; and only then start modifying data.  This can fail due to
> + * ENOSPC, so we have to be able to cancel the transaction.
> + */
> +int
> +xfs_trans_reserve_more(
> +	struct xfs_trans	*tp,
> +	uint			blocks,
> +	uint			rtextents)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
> +	int			error = 0;
> +
> +	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> +
> +	/*
> +	 * Attempt to reserve the needed disk blocks by decrementing
> +	 * the number needed from the number available.  This will
> +	 * fail if the count would go below zero.
> +	 */
> +	if (blocks > 0) {
> +		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
> +		if (error)
> +			return -ENOSPC;
> +		tp->t_blk_res += blocks;
> +	}
> +
> +	/*
> +	 * Attempt to reserve the needed realtime extents by decrementing
> +	 * the number needed from the number available.  This will
> +	 * fail if the count would go below zero.
> +	 */
> +	if (rtextents > 0) {
> +		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
> +		if (error) {
> +			error = -ENOSPC;
> +			goto out_blocks;
> +		}
> +		tp->t_rtx_res += rtextents;
> +	}
> +
> +	return 0;
> +out_blocks:
> +	if (blocks > 0) {
> +		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
> +		tp->t_blk_res -= blocks;
> +	}
> +	return error;
> +}
> +
>   /*
>    * This is called to reserve free disk blocks and log space for the
>    * given transaction.  This must be done before allocating any resources
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 6526314f0b8f..bdbd3d5fd7b0 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -153,6 +153,8 @@ typedef struct xfs_trans {
>   int		xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
>   			uint blocks, uint rtextents, uint flags,
>   			struct xfs_trans **tpp);
> +int		xfs_trans_reserve_more(struct xfs_trans *tp, uint blocks,
> +			uint rtextents);
>   int		xfs_trans_alloc_empty(struct xfs_mount *mp,
>   			struct xfs_trans **tpp);
>   void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);
> 

Ok, so it definetly took some digging to understand how it all is meant 
to come together, but I think I understand the overall idea.  It is a 
pretty big patch, if it's possible to divide the agf and  agfl routines 
into separate patches, that may be helpful too.  It sounds like from the 
other commenatry there may still be yet another revision coming, so I'll 
look for it.  Thanks!

Allison

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=xFHkNjPgIq7q-Z-WsIORKsznsj4hd7ccICbZU9JUcnY&s=b1AmGsYafE0NhNDJSrBlzO8O4iogYj6CxVDp_P5q5ts&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 05/21] xfs: repair the AGI
  2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
  2018-06-27  2:22   ` Dave Chinner
@ 2018-06-28 21:15   ` Allison Henderson
  1 sibling, 0 replies; 77+ messages in thread
From: Allison Henderson @ 2018-06-28 21:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:24 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the AGI header items with some help from the rmapbt.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/scrub/agheader_repair.c |  211 ++++++++++++++++++++++++++++++++++++++++
>   fs/xfs/scrub/repair.h          |    2
>   fs/xfs/scrub/scrub.c           |    2
>   3 files changed, 214 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index 90e5e6cbc911..61e0134f6f9f 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
> @@ -698,3 +698,214 @@ xfs_repair_agfl(
>   	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
>   	return error;
>   }
> +
> +/* AGI */
> +
> +enum {
> +	REPAIR_AGI_INOBT = 0,
> +	REPAIR_AGI_FINOBT,
> +	REPAIR_AGI_END,
> +	REPAIR_AGI_MAX
> +};
> +
> +static const struct xfs_repair_find_ag_btree repair_agi[] = {
> +	[REPAIR_AGI_INOBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_INOBT,
> +		.buf_ops = &xfs_inobt_buf_ops,
> +		.magic = XFS_IBT_CRC_MAGIC,
> +	},
> +	[REPAIR_AGI_FINOBT] = {
> +		.rmap_owner = XFS_RMAP_OWN_INOBT,
> +		.buf_ops = &xfs_inobt_buf_ops,
> +		.magic = XFS_FIBT_CRC_MAGIC,
> +	},
> +	[REPAIR_AGI_END] = {
> +		.buf_ops = NULL
> +	},
> +};
> +
> +/* Find the inode btree roots from the rmap data. */
> +STATIC int
> +xfs_repair_agi_find_btrees(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_repair_find_ag_btree	*fab)
> +{
> +	struct xfs_buf			*agf_bp;
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	memcpy(fab, repair_agi, sizeof(repair_agi));
> +
> +	/* Read the AGF. */
> +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
> +	if (error)
> +		return error;
> +	if (!agf_bp)
> +		return -ENOMEM;
> +
> +	/* Find the btree roots. */
> +	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, NULL);
> +	if (error)
> +		return error;
> +
> +	/* We must find the inobt root. */
> +	if (fab[REPAIR_AGI_INOBT].root == NULLAGBLOCK ||
> +	    fab[REPAIR_AGI_INOBT].height > XFS_BTREE_MAXLEVELS)
> +		return -EFSCORRUPTED;
> +
> +	/* We must find the finobt root if that feature is enabled. */
> +	if (xfs_sb_version_hasfinobt(&mp->m_sb) &&
> +	    (fab[REPAIR_AGI_FINOBT].root == NULLAGBLOCK ||
> +	     fab[REPAIR_AGI_FINOBT].height > XFS_BTREE_MAXLEVELS))
> +		return -EFSCORRUPTED;
> +
> +	return 0;
> +}
> +
> +/*
> + * Reinitialize the AGI header, making an in-core copy of the old contents so
> + * that we know which in-core state needs to be reinitialized.
> + */
> +STATIC void
> +xfs_repair_agi_init_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agi_bp,
> +	struct xfs_agi			*old_agi)
> +{
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agi_bp);
> +	struct xfs_mount		*mp = sc->mp;
> +
> +	memcpy(old_agi, agi, sizeof(*old_agi));
> +	memset(agi, 0, BBTOB(agi_bp->b_length));
> +	agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
> +	agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
> +	agi->agi_seqno = cpu_to_be32(sc->sa.agno);
> +	agi->agi_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
> +	agi->agi_newino = cpu_to_be32(NULLAGINO);
> +	agi->agi_dirino = cpu_to_be32(NULLAGINO);
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
> +
> +	/* We don't know how to fix the unlinked list yet. */
> +	memcpy(&agi->agi_unlinked, &old_agi->agi_unlinked,
> +			sizeof(agi->agi_unlinked));
> +}
> +
> +/* Set btree root information in an AGI. */
> +STATIC void
> +xfs_repair_agi_set_roots(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_agi			*agi,
> +	struct xfs_repair_find_ag_btree	*fab)
> +{
> +	agi->agi_root = cpu_to_be32(fab[REPAIR_AGI_INOBT].root);
> +	agi->agi_level = cpu_to_be32(fab[REPAIR_AGI_INOBT].height);
> +
> +	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb)) {
> +		agi->agi_free_root = cpu_to_be32(fab[REPAIR_AGI_FINOBT].root);
> +		agi->agi_free_level =
> +				cpu_to_be32(fab[REPAIR_AGI_FINOBT].height);
> +	}
> +}
> +
> +/* Update the AGI counters. */
> +STATIC int
> +xfs_repair_agi_update_btree_counters(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*agi_bp)
> +{
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_agi			*agi = XFS_BUF_TO_AGI(agi_bp);
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_agino_t			count;
> +	xfs_agino_t			freecount;
> +	int				error;
> +
> +	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, sc->sa.agno,
> +			XFS_BTNUM_INO);
> +	error = xfs_ialloc_count_inodes(cur, &count, &freecount);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	agi->agi_count = cpu_to_be32(count);
> +	agi->agi_freecount = cpu_to_be32(freecount);
> +	return 0;
> +err:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
> +
> +/* Trigger reinitialization of the in-core data. */
> +STATIC int
> +xfs_repair_agi_reinit_incore(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_agi			*agi,
> +	const struct xfs_agi		*old_agi)
> +{
> +	struct xfs_perag		*pag;
> +
> +	/* XXX: trigger inode count recalculation */
> +
> +	/* Now reinitialize the in-core counters if necessary. */
> +	pag = sc->sa.pag;
> +	if (!pag->pagf_init)
> +		return 0;
> +
> +	sc->sa.pag->pagi_count = be32_to_cpu(agi->agi_count);
> +	sc->sa.pag->pagi_freecount = be32_to_cpu(agi->agi_freecount);
> +
> +	return 0;
> +}
> +
> +/* Repair the AGI. */
> +int
> +xfs_repair_agi(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_repair_find_ag_btree	fab[REPAIR_AGI_MAX];
> +	struct xfs_agi			old_agi;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*agi_bp;
> +	struct xfs_agi			*agi;
> +	int				error;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGI_DADDR(mp)),
> +			XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL);
> +	if (error)
> +		return error;
> +	agi_bp->b_ops = &xfs_agi_buf_ops;
> +	agi = XFS_BUF_TO_AGI(agi_bp);
> +
> +	/* Find the AGI btree roots. */
> +	error = xfs_repair_agi_find_btrees(sc, fab);
> +	if (error)
> +		return error;
> +
> +	/* Start rewriting the header and implant the btrees we found. */
> +	xfs_repair_agi_init_header(sc, agi_bp, &old_agi);
> +	xfs_repair_agi_set_roots(sc, agi, fab);
> +	error = xfs_repair_agi_update_btree_counters(sc, agi_bp);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Reinitialize in-core state. */
> +	error = xfs_repair_agi_reinit_incore(sc, agi, &old_agi);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Write this to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agi_bp, XFS_BLFT_AGI_BUF);
> +	xfs_trans_log_buf(sc->tp, agi_bp, 0, BBTOB(agi_bp->b_length) - 1);
> +	return error;
> +
> +out_revert:
> +	memcpy(agi, &old_agi, sizeof(old_agi));
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index f2af5923aa75..d541c1586d0a 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -102,6 +102,7 @@ int xfs_repair_probe(struct xfs_scrub_context *sc);
>   int xfs_repair_superblock(struct xfs_scrub_context *sc);
>   int xfs_repair_agf(struct xfs_scrub_context *sc);
>   int xfs_repair_agfl(struct xfs_scrub_context *sc);
> +int xfs_repair_agi(struct xfs_scrub_context *sc);
>   
>   #else
>   
> @@ -127,6 +128,7 @@ xfs_repair_calc_ag_resblks(
>   #define xfs_repair_superblock		xfs_repair_notsupported
>   #define xfs_repair_agf			xfs_repair_notsupported
>   #define xfs_repair_agfl			xfs_repair_notsupported
> +#define xfs_repair_agi			xfs_repair_notsupported
>   
>   #endif /* CONFIG_XFS_ONLINE_REPAIR */
>   
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 8e11c3c699fb..0f036aab2551 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -220,7 +220,7 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>   		.type	= ST_PERAG,
>   		.setup	= xfs_scrub_setup_fs,
>   		.scrub	= xfs_scrub_agi,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_agi,
>   	},
>   	[XFS_SCRUB_TYPE_BNOBT] = {	/* bnobt */
>   		.type	= ST_PERAG,
> 

It looks ok in terms of aligning with the infrastructure in the last 
patch. It does look pretty similar to the xfs_repair_ag* routines, so if 
there's common code that can be factored out, that would be great. 
Realistically though in studying what that solution might look like, I 
think the AGF/AGFL/AGI code paths have just enough differences such that 
trying to push it all through a common code path may end up generating a 
lot of switch like statements, so it may not be worth it.  Just a 
suggestion to consider.

Allison

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=vvDQam1-gFYigwUy0mxQkiHsCjC2HRjT0eB_fDn9esQ&s=H326jzffjTika3ys5iwsQGiICINSSbtbj3LaL2VbjJY&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-28 21:14   ` Allison Henderson
@ 2018-06-28 23:21     ` Dave Chinner
  2018-06-29  1:35       ` Allison Henderson
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-06-28 23:21 UTC (permalink / raw)
  To: Allison Henderson; +Cc: Darrick J. Wong, linux-xfs

On Thu, Jun 28, 2018 at 02:14:51PM -0700, Allison Henderson wrote:
> On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> >+static const struct xfs_repair_find_ag_btree repair_agf[] = {
> >+	[REPAIR_AGF_BNOBT] = {
> >+		.rmap_owner = XFS_RMAP_OWN_AG,
> >+		.buf_ops = &xfs_allocbt_buf_ops,
> >+		.magic = XFS_ABTB_CRC_MAGIC,
> >+	},
> >+	[REPAIR_AGF_CNTBT] = {
> >+		.rmap_owner = XFS_RMAP_OWN_AG,
> >+		.buf_ops = &xfs_allocbt_buf_ops,
> >+		.magic = XFS_ABTC_CRC_MAGIC,
> >+	},
> >+	[REPAIR_AGF_RMAPBT] = {
> >+		.rmap_owner = XFS_RMAP_OWN_AG,
> >+		.buf_ops = &xfs_rmapbt_buf_ops,
> >+		.magic = XFS_RMAP_CRC_MAGIC,
> >+	},
> >+	[REPAIR_AGF_REFCOUNTBT] = {
> >+		.rmap_owner = XFS_RMAP_OWN_REFC,
> >+		.buf_ops = &xfs_refcountbt_buf_ops,
> >+		.magic = XFS_REFC_CRC_MAGIC,
> >+	},
> >+	[REPAIR_AGF_END] = {
> >+		.buf_ops = NULL,
> >+	},
> >+};
> >+
> >+/*
> >+ * Find the btree roots.  This is /also/ a chicken and egg problem because we
> >+ * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
> >+ * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
> >+ * corruptions in those btrees we'll bail out.
> >+ */
> It would help if maybe we could put the /*IN*/ or /*OUT*/ on the
> parameters here?  And maybe a blurb about their usage.  From looking
> at how their used in the memcpy, I'm guessing that agf_bp is IN and
> fab is OUT.  But it's otherwise its not really clear on how they're
> meant to be used with out going the function to see how it handles
> them.

IMO, that's what kerneldoc format comments are for. I'd much prefer
we use kerneldoc format than go back to the bad old terse 3-4 word
post-variable declaration comments that we used to have.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running
  2018-06-24 19:24 ` [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running Darrick J. Wong
@ 2018-06-28 23:37   ` Dave Chinner
  2018-06-29 14:49     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-06-28 23:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Destroying an incore inode sometimes requires some work to be done on
> the inode.  For example, post-EOF blocks on a non-PREALLOC inode are
> trimmed, and copy-on-write staging extents are freed.  This work is done
> in separate transactions, which is bad for scrub and repair because (a)
> we already have a transaction and can't nest them, and (b) if we've
> frozen the filesystem for scrub/repair work, that (regular) transaction
> allocation will block on the freeze.
> 
> Therefore, if we detect that work has to be done to destroy the incore
> inode, we'll just hang on to the reference until after the scrub is
> finished.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Darrick, I'll just repeat what we discussed on #xfs here so we have
in it the archive and everyone else knows why this is probably going
to be done differently.

I think we should move deferred inode inactivation processing into
the background reclaim radix tree walker rather than introduce a
special new "don't iput this inode yet" state. We're really only
trying to prevent the transactions that xfs_inactive() may run
throught iput() when the filesystem is frozen, and we already stop
background reclaim processing when the fs is frozen.

I've always intended that xfs_fs_destroy_inode() basically becomes a
no-op that just queues the inode for final inactivation, freeing and
reclaim - right now it ony does the reclaim work in the background.
I first proposed this back in ~2008 here:

http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink

At this point, it really only requires a new inode flag to indicate
that it has an inactivation pending - we set that if xfs_inactive
needs to do work before the inode can be reclaimed, and have a
separate per-ag work queue that walks the inode radix tree finding
reclaimable inodes that have the NEED_INACTIVATION inode flag set.
This way background reclaim doesn't get stuck on them.

This has benefits for many operations e.g. bulk processing of
inode inactivation and freeing either concurrently or after rm -rf
rather than at unlink syscall exit, VFS inode cache shrinker never
blocks on inactivation needing to run transactions, etc.

It also allows us to turn off inactivation on a per-AG basis,
meaning that when we are rebuilding an AG structure in repair (e.g.
the rmap btree) we can turn off inode inactivation and reclaim for
that AG rather than needing to freeze the entire filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-28 23:21     ` Dave Chinner
@ 2018-06-29  1:35       ` Allison Henderson
  2018-06-29 14:55         ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Allison Henderson @ 2018-06-29  1:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On 06/28/2018 04:21 PM, Dave Chinner wrote:
> On Thu, Jun 28, 2018 at 02:14:51PM -0700, Allison Henderson wrote:
>> On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
>>> +static const struct xfs_repair_find_ag_btree repair_agf[] = {
>>> +	[REPAIR_AGF_BNOBT] = {
>>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>>> +		.buf_ops = &xfs_allocbt_buf_ops,
>>> +		.magic = XFS_ABTB_CRC_MAGIC,
>>> +	},
>>> +	[REPAIR_AGF_CNTBT] = {
>>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>>> +		.buf_ops = &xfs_allocbt_buf_ops,
>>> +		.magic = XFS_ABTC_CRC_MAGIC,
>>> +	},
>>> +	[REPAIR_AGF_RMAPBT] = {
>>> +		.rmap_owner = XFS_RMAP_OWN_AG,
>>> +		.buf_ops = &xfs_rmapbt_buf_ops,
>>> +		.magic = XFS_RMAP_CRC_MAGIC,
>>> +	},
>>> +	[REPAIR_AGF_REFCOUNTBT] = {
>>> +		.rmap_owner = XFS_RMAP_OWN_REFC,
>>> +		.buf_ops = &xfs_refcountbt_buf_ops,
>>> +		.magic = XFS_REFC_CRC_MAGIC,
>>> +	},
>>> +	[REPAIR_AGF_END] = {
>>> +		.buf_ops = NULL,
>>> +	},
>>> +};
>>> +
>>> +/*
>>> + * Find the btree roots.  This is /also/ a chicken and egg problem because we
>>> + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
>>> + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
>>> + * corruptions in those btrees we'll bail out.
>>> + */
>> It would help if maybe we could put the /*IN*/ or /*OUT*/ on the
>> parameters here?  And maybe a blurb about their usage.  From looking
>> at how their used in the memcpy, I'm guessing that agf_bp is IN and
>> fab is OUT.  But it's otherwise its not really clear on how they're
>> meant to be used with out going the function to see how it handles
>> them.
> 
> IMO, that's what kerneldoc format comments are for. I'd much prefer
> we use kerneldoc format than go back to the bad old terse 3-4 word
> post-variable declaration comments that we used to have.
> 
> Cheers,
> 
> Dave.
> 

Sure, that sounds like it would be fine too :-)

Allison

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes
  2018-06-24 19:25 ` [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
@ 2018-06-29  2:52   ` Dave Chinner
  0 siblings, 0 replies; 77+ messages in thread
From: Dave Chinner @ 2018-06-29  2:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:25:29PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> If an attr block indicates that it could use compaction, set the preen
> flag to have the attr fork rebuilt, since the attr fork rebuilder can
> take care of that for us.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Oh, that's an easy one :P

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap
  2018-06-28 21:11   ` Allison Henderson
@ 2018-06-29 14:39     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-29 14:39 UTC (permalink / raw)
  To: Allison Henderson; +Cc: linux-xfs

On Thu, Jun 28, 2018 at 02:11:38PM -0700, Allison Henderson wrote:
> On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > The original rmap code assumed that there would always be at least one
> > rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't
> > find one.  This assumption isn't true for the rmapbt repair function
> > (and it won't be true for realtime rmap either), so remove the check and
> > just deal with the situation.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >   fs/xfs/libxfs/xfs_rmap.c |   24 ++++++++++++------------
> >   1 file changed, 12 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index d4460b0d2d81..8b2a2f81d110 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -753,19 +753,19 @@ xfs_rmap_map(
> >   			&have_lt);
> >   	if (error)
> >   		goto out_error;
> > -	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> > -
> > -	error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
> > -	if (error)
> > -		goto out_error;
> > -	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> > -	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> > -			cur->bc_private.a.agno, ltrec.rm_startblock,
> > -			ltrec.rm_blockcount, ltrec.rm_owner,
> > -			ltrec.rm_offset, ltrec.rm_flags);
> > +	if (have_lt) {
> > +		error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
> > +		if (error)
> > +			goto out_error;
> > +		XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> > +		trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> > +				cur->bc_private.a.agno, ltrec.rm_startblock,
> > +				ltrec.rm_blockcount, ltrec.rm_owner,
> > +				ltrec.rm_offset, ltrec.rm_flags);
> > -	if (!xfs_rmap_is_mergeable(&ltrec, owner, flags))
> > -		have_lt = 0;
> > +		if (!xfs_rmap_is_mergeable(&ltrec, owner, flags))
> > +			have_lt = 0;
> > +	}
> >   	XFS_WANT_CORRUPTED_GOTO(mp,
> >   		have_lt == 0 ||
> > 
> 
> Alrighty, looks ok after some digging around. I'm still a little puzzled as
> to why the original code raised the assert without checking to see whats on
> the other side of the cursor?  Assuming the error condition
> was supposed to be the case when the tree was empty.  In any case, it looks
> correct now.

At the time (~2014?) I don't think either Dave or I were thinking about
rmapbt being extended into the realtime device, so we thought that
assumption was a reasonable one to make.  That was, of course, long
before I got far enough along in designing online check to realize that
"hey, maybe we should be able to rebuild things from scratch too"... :)

Anyway, thank you both for the review.

--D

> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=Q2PdVMOGp7_huNLFbP6xty0mgocZk65leUyLVRvSsSY&s=6-FSoklyIhTEtg811gAG43N9-7Z-sYFsm7zv33EadgQ&e=
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running
  2018-06-28 23:37   ` Dave Chinner
@ 2018-06-29 14:49     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-29 14:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jun 29, 2018 at 09:37:21AM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Destroying an incore inode sometimes requires some work to be done on
> > the inode.  For example, post-EOF blocks on a non-PREALLOC inode are
> > trimmed, and copy-on-write staging extents are freed.  This work is done
> > in separate transactions, which is bad for scrub and repair because (a)
> > we already have a transaction and can't nest them, and (b) if we've
> > frozen the filesystem for scrub/repair work, that (regular) transaction
> > allocation will block on the freeze.
> > 
> > Therefore, if we detect that work has to be done to destroy the incore
> > inode, we'll just hang on to the reference until after the scrub is
> > finished.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Darrick, I'll just repeat what we discussed on #xfs here so we have
> in it the archive and everyone else knows why this is probably going
> to be done differently.
> 
> I think we should move deferred inode inactivation processing into
> the background reclaim radix tree walker rather than introduce a
> special new "don't iput this inode yet" state. We're really only
> trying to prevent the transactions that xfs_inactive() may run
> throught iput() when the filesystem is frozen, and we already stop
> background reclaim processing when the fs is frozen.
> 
> I've always intended that xfs_fs_destroy_inode() basically becomes a
> no-op that just queues the inode for final inactivation, freeing and
> reclaim - right now it ony does the reclaim work in the background.
> I first proposed this back in ~2008 here:
> 
> http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink
> 
> At this point, it really only requires a new inode flag to indicate
> that it has an inactivation pending - we set that if xfs_inactive
> needs to do work before the inode can be reclaimed, and have a
> separate per-ag work queue that walks the inode radix tree finding
> reclaimable inodes that have the NEED_INACTIVATION inode flag set.
> This way background reclaim doesn't get stuck on them.
> 
> This has benefits for many operations e.g. bulk processing of
> inode inactivation and freeing either concurrently or after rm -rf
> rather than at unlink syscall exit, VFS inode cache shrinker never
> blocks on inactivation needing to run transactions, etc.
> 
> It also allows us to turn off inactivation on a per-AG basis,
> meaning that when we are rebuilding an AG structure in repair (e.g.
> the rmap btree) we can turn off inode inactivation and reclaim for
> that AG rather than needing to freeze the entire filesystem....

So although I've been off playing a JavaScript monkey this week, I should
note that the past few months I've also been slowly combing through all
the past online repair fuzz test output to see what's still majorly
broken.  I've noticed that the bmbt fuzzers have a particular failure
pattern that leads to shutdown, which is:

1) Fuzz a bmbt.br_blockcount value to a large enough value that we now
have a giant post-eof extent.

2) Mount filesystem.

3) Run xfs_scrub, which loads said inode, checks the bad bmbt, and tells
userspace it's broken...

4) ...and releases the inode.

5) Memory reclaim or someone comes along and calls xfs_inactive, which
says "Hey, nice post-EOF extent, let's trim that off!"  The extent free
code then freaks out "ZOMG, that extent is already free!"

6) Bam, filesystem shuts down.

7) xfs_scrub retries the bmbt scrub, but this time with IFLAG_REPAIR
set, but by now the fs has already gone down, and sadness.

I've had a thought lurking around in my head for a while that perhaps we
should have a second SKIP_INACTIVATION iflag that indicates that the
inode is corrupt and we should skip post-eof inactivation to avoid fs
shutdowns.  We'd still have to take the risk of cleaning out the cow
fork (because that metadata are never persisted) but we could at least
avoid a shutdown.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-29  1:35       ` Allison Henderson
@ 2018-06-29 14:55         ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-29 14:55 UTC (permalink / raw)
  To: Allison Henderson; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 28, 2018 at 06:35:06PM -0700, Allison Henderson wrote:
> On 06/28/2018 04:21 PM, Dave Chinner wrote:
> > On Thu, Jun 28, 2018 at 02:14:51PM -0700, Allison Henderson wrote:
> > > On 06/24/2018 12:23 PM, Darrick J. Wong wrote:
> > > > +static const struct xfs_repair_find_ag_btree repair_agf[] = {
> > > > +	[REPAIR_AGF_BNOBT] = {
> > > > +		.rmap_owner = XFS_RMAP_OWN_AG,
> > > > +		.buf_ops = &xfs_allocbt_buf_ops,
> > > > +		.magic = XFS_ABTB_CRC_MAGIC,
> > > > +	},
> > > > +	[REPAIR_AGF_CNTBT] = {
> > > > +		.rmap_owner = XFS_RMAP_OWN_AG,
> > > > +		.buf_ops = &xfs_allocbt_buf_ops,
> > > > +		.magic = XFS_ABTC_CRC_MAGIC,
> > > > +	},
> > > > +	[REPAIR_AGF_RMAPBT] = {
> > > > +		.rmap_owner = XFS_RMAP_OWN_AG,
> > > > +		.buf_ops = &xfs_rmapbt_buf_ops,
> > > > +		.magic = XFS_RMAP_CRC_MAGIC,
> > > > +	},
> > > > +	[REPAIR_AGF_REFCOUNTBT] = {
> > > > +		.rmap_owner = XFS_RMAP_OWN_REFC,
> > > > +		.buf_ops = &xfs_refcountbt_buf_ops,
> > > > +		.magic = XFS_REFC_CRC_MAGIC,
> > > > +	},
> > > > +	[REPAIR_AGF_END] = {
> > > > +		.buf_ops = NULL,
> > > > +	},
> > > > +};
> > > > +
> > > > +/*
> > > > + * Find the btree roots.  This is /also/ a chicken and egg problem because we
> > > > + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
> > > > + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
> > > > + * corruptions in those btrees we'll bail out.
> > > > + */
> > > It would help if maybe we could put the /*IN*/ or /*OUT*/ on the
> > > parameters here?  And maybe a blurb about their usage.  From looking
> > > at how their used in the memcpy, I'm guessing that agf_bp is IN and
> > > fab is OUT.  But it's otherwise its not really clear on how they're
> > > meant to be used with out going the function to see how it handles
> > > them.
> > 
> > IMO, that's what kerneldoc format comments are for. I'd much prefer
> > we use kerneldoc format than go back to the bad old terse 3-4 word
> > post-variable declaration comments that we used to have.
> > 

<nod> I'll straighten this mess out.  The pattern underneath this is
that we partially fill out the structure with the characteristics of the
block we want the code to help us look for, and then the code fills in
the rest (the block number and btree height) of the structure as it
rummages around the AG.

I thought that having a static template that could be memcpy'd into the
stack variable was a useful thing, but seeing as it just complicates
matters I'll just move this template to the stack variable's definition.
At the time I was planning for more than one user of each template, but
that never came to fruition so it's better to simplify the data
structure lifetime and initialization semantics.

--D

> > Cheers,
> > 
> > Dave.
> > 
> 
> Sure, that sounds like it would be fine too :-)
> 
> Allison
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-28 17:25     ` Allison Henderson
@ 2018-06-29 15:08       ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-29 15:08 UTC (permalink / raw)
  To: Allison Henderson; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 28, 2018 at 10:25:22AM -0700, Allison Henderson wrote:
> 
> On 06/26/2018 07:19 PM, Dave Chinner wrote:
> > On Sun, Jun 24, 2018 at 12:23:54PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Regenerate the AGF and AGFL from the rmap data.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > [...]
> > 
> > > +/* Information for finding AGF-rooted btrees */
> > > +enum {
> > > +	REPAIR_AGF_BNOBT = 0,
> > > +	REPAIR_AGF_CNTBT,
> > > +	REPAIR_AGF_RMAPBT,
> > > +	REPAIR_AGF_REFCOUNTBT,
> > > +	REPAIR_AGF_END,
> > > +	REPAIR_AGF_MAX
> > > +};
> > 
> > Why can't you just use XFS_BTNUM_* for these btree type descriptors?
> Well, I know Darrick hasn't responded yet, but I actually have seen other
> projects intentionally redefine scopes like this (even if its repetitive).
> The reason being for example, to help prevent people from mistakenly
> indexing an element of the below array that may not be defined since
> XFS_BTNUM_* defines more types than are being used here. (and its easy to
> overlook because by belonging to the same name space, it doest look out of
> place)  So basically in redefining only the types meant to be used, we may
> help people to avoid mistakenly mishandling it.
> 
> I've also seen such practices generate a lot of extra code too.  Both
> solutions will work.  But in response to your comment: it looks to me like a
> question of cutting down code vs using more a defensive coding style.

I was trying to avoid having repair_agf[] be a sparse array (which it
would be if I used BTNUM) and avoid adding a btnum number to struct
xfs_r_f_a_b which would require me to write a lookup function....
so, yes. :)

--D

> 
> 
> > 
> > > +
> > > +static const struct xfs_repair_find_ag_btree repair_agf[] = {
> > > +	[REPAIR_AGF_BNOBT] = {
> > > +		.rmap_owner = XFS_RMAP_OWN_AG,
> > > +		.buf_ops = &xfs_allocbt_buf_ops,
> > > +		.magic = XFS_ABTB_CRC_MAGIC,
> > > +	},
> > > +	[REPAIR_AGF_CNTBT] = {
> > > +		.rmap_owner = XFS_RMAP_OWN_AG,
> > > +		.buf_ops = &xfs_allocbt_buf_ops,
> > > +		.magic = XFS_ABTC_CRC_MAGIC,
> > > +	},
> > 
> > I had to stop and think about why this only supports the v5 types.
> > i.e. we're rebuilding from rmap info, so this will never run on v4
> > filesystems, hence we only care about v5 types (i.e. *CRC_MAGIC).
> > Perhaps a one-line comment to remind readers of this?
> > 
> > > +	[REPAIR_AGF_RMAPBT] = {
> > > +		.rmap_owner = XFS_RMAP_OWN_AG,
> > > +		.buf_ops = &xfs_rmapbt_buf_ops,
> > > +		.magic = XFS_RMAP_CRC_MAGIC,
> > > +	},
> > > +	[REPAIR_AGF_REFCOUNTBT] = {
> > > +		.rmap_owner = XFS_RMAP_OWN_REFC,
> > > +		.buf_ops = &xfs_refcountbt_buf_ops,
> > > +		.magic = XFS_REFC_CRC_MAGIC,
> > > +	},
> > > +	[REPAIR_AGF_END] = {
> > > +		.buf_ops = NULL,
> > > +	},
> > > +};
> > > +
> > > +/*
> > > + * Find the btree roots.  This is /also/ a chicken and egg problem because we
> > > + * have to use the rmapbt (rooted in the AGF) to find the btrees rooted in the
> > > + * AGF.  We also have no idea if the btrees make any sense.  If we hit obvious
> > > + * corruptions in those btrees we'll bail out.
> > > + */
> > > +STATIC int
> > > +xfs_repair_agf_find_btrees(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_buf			*agf_bp,
> > > +	struct xfs_repair_find_ag_btree	*fab,
> > > +	struct xfs_buf			*agfl_bp)
> > > +{
> > > +	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
> > > +	int				error;
> > > +
> > > +	/* Go find the root data. */
> > > +	memcpy(fab, repair_agf, sizeof(repair_agf));
> > 
> > Why are we initialising fab here, instead of in the caller where it
> > is declared and passed to various functions? Given there is only a
> > single declaration of this structure, why do we need a global static
> > const table initialiser just to copy it here - why isn't it
> > initialised at the declaration point?
> > 
> > > +	error = xfs_repair_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/* We must find the bnobt, cntbt, and rmapbt roots. */
> > > +	if (fab[REPAIR_AGF_BNOBT].root == NULLAGBLOCK ||
> > > +	    fab[REPAIR_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
> > > +	    fab[REPAIR_AGF_CNTBT].root == NULLAGBLOCK ||
> > > +	    fab[REPAIR_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
> > > +	    fab[REPAIR_AGF_RMAPBT].root == NULLAGBLOCK ||
> > > +	    fab[REPAIR_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
> > > +		return -EFSCORRUPTED;
> > > +
> > > +	/*
> > > +	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
> > > +	 * different root then something's seriously wrong.
> > > +	 */
> > > +	if (fab[REPAIR_AGF_RMAPBT].root !=
> > > +	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
> > > +		return -EFSCORRUPTED;
> > > +
> > > +	/* We must find the refcountbt root if that feature is enabled. */
> > > +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
> > > +	    (fab[REPAIR_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
> > > +	     fab[REPAIR_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
> > > +		return -EFSCORRUPTED;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Set btree root information in an AGF. */
> > > +STATIC void
> > > +xfs_repair_agf_set_roots(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_agf			*agf,
> > > +	struct xfs_repair_find_ag_btree	*fab)
> > > +{
> > > +	agf->agf_roots[XFS_BTNUM_BNOi] =
> > > +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].root);
> > > +	agf->agf_levels[XFS_BTNUM_BNOi] =
> > > +			cpu_to_be32(fab[REPAIR_AGF_BNOBT].height);
> > > +
> > > +	agf->agf_roots[XFS_BTNUM_CNTi] =
> > > +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].root);
> > > +	agf->agf_levels[XFS_BTNUM_CNTi] =
> > > +			cpu_to_be32(fab[REPAIR_AGF_CNTBT].height);
> > > +
> > > +	agf->agf_roots[XFS_BTNUM_RMAPi] =
> > > +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].root);
> > > +	agf->agf_levels[XFS_BTNUM_RMAPi] =
> > > +			cpu_to_be32(fab[REPAIR_AGF_RMAPBT].height);
> > > +
> > > +	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
> > > +		agf->agf_refcount_root =
> > > +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].root);
> > > +		agf->agf_refcount_level =
> > > +				cpu_to_be32(fab[REPAIR_AGF_REFCOUNTBT].height);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Reinitialize the AGF header, making an in-core copy of the old contents so
> > > + * that we know which in-core state needs to be reinitialized.
> > > + */
> > > +STATIC void
> > > +xfs_repair_agf_init_header(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_buf			*agf_bp,
> > > +	struct xfs_agf			*old_agf)
> > > +{
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> > > +
> > > +	memcpy(old_agf, agf, sizeof(*old_agf));
> > > +	memset(agf, 0, BBTOB(agf_bp->b_length));
> > > +	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
> > > +	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
> > > +	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
> > > +	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
> > > +	agf->agf_flfirst = old_agf->agf_flfirst;
> > > +	agf->agf_fllast = old_agf->agf_fllast;
> > > +	agf->agf_flcount = old_agf->agf_flcount;
> > > +	if (xfs_sb_version_hascrc(&mp->m_sb))
> > > +		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
> > > +}
> > 
> > Do we need to clear pag->pagf_init here so that it gets
> > re-initialised next time someone reads the AGF?
> > 
> > > +
> > > +/* Update the AGF btree counters by walking the btrees. */
> > > +STATIC int
> > > +xfs_repair_agf_update_btree_counters(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_buf			*agf_bp)
> > > +{
> > > +	struct xfs_repair_agf_allocbt	raa = { .sc = sc };
> > > +	struct xfs_btree_cur		*cur = NULL;
> > > +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	xfs_agblock_t			btreeblks;
> > > +	xfs_agblock_t			blocks;
> > > +	int				error;
> > > +
> > > +	/* Update the AGF counters from the bnobt. */
> > > +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> > > +			XFS_BTNUM_BNO);
> > > +	error = xfs_alloc_query_all(cur, xfs_repair_agf_walk_allocbt, &raa);
> > > +	if (error)
> > > +		goto err;
> > > +	error = xfs_btree_count_blocks(cur, &blocks);
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +	btreeblks = blocks - 1;
> > > +	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
> > > +	agf->agf_longest = cpu_to_be32(raa.longest);
> > 
> > This function updates more than the AGF btree counters. :P
> > 
> > > +
> > > +	/* Update the AGF counters from the cntbt. */
> > > +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> > > +			XFS_BTNUM_CNT);
> > > +	error = xfs_btree_count_blocks(cur, &blocks);
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +	btreeblks += blocks - 1;
> > > +
> > > +	/* Update the AGF counters from the rmapbt. */
> > > +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
> > > +	error = xfs_btree_count_blocks(cur, &blocks);
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +	agf->agf_rmap_blocks = cpu_to_be32(blocks);
> > > +	btreeblks += blocks - 1;
> > > +
> > > +	agf->agf_btreeblks = cpu_to_be32(btreeblks);
> > > +
> > > +	/* Update the AGF counters from the refcountbt. */
> > > +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> > > +		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
> > > +				sc->sa.agno, NULL);
> > > +		error = xfs_btree_count_blocks(cur, &blocks);
> > > +		if (error)
> > > +			goto err;
> > > +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +		agf->agf_refcount_blocks = cpu_to_be32(blocks);
> > > +	}
> > > +
> > > +	return 0;
> > > +err:
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > > +	return error;
> > > +}
> > > +
> > > +/* Trigger reinitialization of the in-core data. */
> > > +STATIC int
> > > +xfs_repair_agf_reinit_incore(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_agf			*agf,
> > > +	const struct xfs_agf		*old_agf)
> > > +{
> > > +	struct xfs_perag		*pag;
> > > +
> > > +	/* XXX: trigger fdblocks recalculation */
> > > +
> > > +	/* Now reinitialize the in-core counters if necessary. */
> > > +	pag = sc->sa.pag;
> > > +	if (!pag->pagf_init)
> > > +		return 0;
> > > +
> > > +	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
> > > +	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
> > > +	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
> > > +	pag->pagf_levels[XFS_BTNUM_BNOi] =
> > > +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
> > > +	pag->pagf_levels[XFS_BTNUM_CNTi] =
> > > +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
> > > +	pag->pagf_levels[XFS_BTNUM_RMAPi] =
> > > +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
> > > +	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
> > 
> > Ok, so we reinit the pagf bits here, but....
> > 
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Repair the AGF. */
> > > +int
> > > +xfs_repair_agf(
> > > +	struct xfs_scrub_context	*sc)
> > > +{
> > > +	struct xfs_repair_find_ag_btree	fab[REPAIR_AGF_MAX];
> > > +	struct xfs_agf			old_agf;
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	struct xfs_buf			*agf_bp;
> > > +	struct xfs_buf			*agfl_bp;
> > > +	struct xfs_agf			*agf;
> > > +	int				error;
> > > +
> > > +	/* We require the rmapbt to rebuild anything. */
> > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> > > +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > > +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
> > > +			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
> > > +	if (error)
> > > +		return error;
> > > +	agf_bp->b_ops = &xfs_agf_buf_ops;
> > > +	agf = XFS_BUF_TO_AGF(agf_bp);
> > > +
> > > +	/*
> > > +	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
> > > +	 * the AGFL now; these blocks might have once been part of the
> > > +	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
> > > +	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
> > > +	 * because we can't do any serious cross-referencing with any of the
> > > +	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
> > > +	 * then we'll bail out.
> > > +	 */
> > > +	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/*
> > > +	 * Spot-check the AGFL blocks; if they're obviously corrupt then
> > > +	 * there's nothing we can do but bail out.
> > > +	 */
> > > +	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
> > > +			xfs_repair_agf_check_agfl_block, sc);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/*
> > > +	 * Find the AGF btree roots.  See the comment for this function for
> > > +	 * more information about the limitations of this repairer; this is
> > > +	 * also a chicken-and-egg situation.
> > > +	 */
> > > +	error = xfs_repair_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
> > > +	if (error)
> > > +		return error;
> > 
> > Comment could be better written.
> > 
> > 	/*
> > 	 * Find the AGF btree roots. This is also a chicken-and-egg
> > 	 * situation - see xfs_repair_agf_find_btrees() for details.
> > 	 */
> > 
> > > +
> > > +	/* Start rewriting the header and implant the btrees we found. */
> > > +	xfs_repair_agf_init_header(sc, agf_bp, &old_agf);
> > > +	xfs_repair_agf_set_roots(sc, agf, fab);
> > > +	error = xfs_repair_agf_update_btree_counters(sc, agf_bp);
> > > +	if (error)
> > > +		goto out_revert;
> > 
> > If we fail here, the pagf information is invalid, hence I think we
> > really do need to clear pagf_init before we start rebuilding the new
> > AGF. Yes, I can see we revert the AGF info, but this seems like a
> > landmine waiting to be tripped over.
> > 
> > > +	/* Reinitialize in-core state. */
> > > +	error = xfs_repair_agf_reinit_incore(sc, agf, &old_agf);
> > > +	if (error)
> > > +		goto out_revert;
> > > +
> > > +	/* Write this to disk. */
> > > +	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
> > > +	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
> > > +	return 0;
> > > +
> > > +out_revert:
> > > +	memcpy(agf, &old_agf, sizeof(old_agf));
> > > +	return error;
> > > +}
> > > +
> > > +/* AGFL */
> > > +
> > > +struct xfs_repair_agfl {
> > > +	struct xfs_repair_extent_list	agmeta_list;
> > > +	struct xfs_repair_extent_list	*freesp_list;
> > > +	struct xfs_scrub_context	*sc;
> > > +};
> > > +
> > > +/* Record all freespace information. */
> > > +STATIC int
> > > +xfs_repair_agfl_rmap_fn(
> > > +	struct xfs_btree_cur		*cur,
> > > +	struct xfs_rmap_irec		*rec,
> > > +	void				*priv)
> > > +{
> > > +	struct xfs_repair_agfl		*ra = priv;
> > > +	xfs_fsblock_t			fsb;
> > > +	int				error = 0;
> > > +
> > > +	if (xfs_scrub_should_terminate(ra->sc, &error))
> > > +		return error;
> > > +
> > > +	/* Record all the OWN_AG blocks. */
> > > +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> > > +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> > > +				rec->rm_startblock);
> > > +		error = xfs_repair_collect_btree_extent(ra->sc,
> > > +				ra->freesp_list, fsb, rec->rm_blockcount);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +
> > > +	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
> > > +			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> > 
> > Urk. The function name lengths is getting out of hand. I'm very
> > tempted to suggest we should shorten the namespace of all this
> > like s/xfs_repair_/xr_/ and s/xfs_scrub_/xs_/, etc just to make them
> > shorter and easier to read.
> > 
> > Oh, wait, did I say that out loud? :P
> > 
> > Something to think about, anyway.
> > 
> > > +			&ra->agmeta_list);
> > > +}
> > > +
> > > +/* Add a btree block to the agmeta list. */
> > > +STATIC int
> > > +xfs_repair_agfl_visit_btblock(
> > 
> > I find the name a bit confusing - AGFLs don't have btree blocks.
> > Yes, I know that it's a xfs_btree_visit_blocks() callback but I
> > think s/visit/collect/ makes more sense. i.e. it tells us what we
> > are doing with the btree block, rather than making it sound like we
> > are walking AGFL btree blocks...
> > 
> > > +/*
> > > + * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
> > > + * which blocks belong to the AGFL.
> > > + */
> > > +STATIC int
> > > +xfs_repair_agfl_find_extents(
> > 
> > Same here - xr_agfl_collect_free_extents()?
> > 
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_buf			*agf_bp,
> > > +	struct xfs_repair_extent_list	*agfl_extents,
> > > +	xfs_agblock_t			*flcount)
> > > +{
> > > +	struct xfs_repair_agfl		ra;
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	struct xfs_btree_cur		*cur;
> > > +	struct xfs_repair_extent	*rae;
> > > +	int				error;
> > > +
> > > +	ra.sc = sc;
> > > +	ra.freesp_list = agfl_extents;
> > > +	xfs_repair_init_extent_list(&ra.agmeta_list);
> > > +
> > > +	/* Find all space used by the free space btrees & rmapbt. */
> > > +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
> > > +	error = xfs_rmap_query_all(cur, xfs_repair_agfl_rmap_fn, &ra);
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +
> > > +	/* Find all space used by bnobt. */
> > 
> > Needs clarification.
> > 
> > 	/* Find all the in use bnobt blocks */
> > 
> > > +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> > > +			XFS_BTNUM_BNO);
> > > +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +
> > > +	/* Find all space used by cntbt. */
> > 
> > 	/* Find all the in use cntbt blocks */
> > 
> > > +	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
> > > +			XFS_BTNUM_CNT);
> > > +	error = xfs_btree_visit_blocks(cur, xfs_repair_agfl_visit_btblock, &ra);
> > > +	if (error)
> > > +		goto err;
> > > +
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > +
> > > +	/*
> > > +	 * Drop the freesp meta blocks that are in use by btrees.
> > > +	 * The remaining blocks /should/ be AGFL blocks.
> > > +	 */
> > > +	error = xfs_repair_subtract_extents(sc, agfl_extents, &ra.agmeta_list);
> > > +	xfs_repair_cancel_btree_extents(sc, &ra.agmeta_list);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/* Calculate the new AGFL size. */
> > > +	*flcount = 0;
> > > +	for_each_xfs_repair_extent(rae, agfl_extents) {
> > > +		*flcount += rae->len;
> > > +		if (*flcount > xfs_agfl_size(mp))
> > > +			break;
> > > +	}
> > > +	if (*flcount > xfs_agfl_size(mp))
> > > +		*flcount = xfs_agfl_size(mp);
> > 
> > Ok, so flcount is clamped here. What happens to all the remaining
> > agfl_extents beyond flcount?
> > 
> > > +	return 0;
> > > +
> > > +err:
> > 
> > Ok, what cleans up all the extents we've recorded in ra on error?
> > 
> > > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > > +	return error;
> > > +}
> > > +
> > > +/* Update the AGF and reset the in-core state. */
> > > +STATIC int
> > > +xfs_repair_agfl_update_agf(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_buf			*agf_bp,
> > > +	xfs_agblock_t			flcount)
> > > +{
> > > +	struct xfs_agf			*agf = XFS_BUF_TO_AGF(agf_bp);
> > > +
> > 	ASSERT(*flcount <= xfs_agfl_size(mp));
> > 
> > > +	/* XXX: trigger fdblocks recalculation */
> > > +
> > > +	/* Update the AGF counters. */
> > > +	if (sc->sa.pag->pagf_init)
> > > +		sc->sa.pag->pagf_flcount = flcount;
> > > +	agf->agf_flfirst = cpu_to_be32(0);
> > > +	agf->agf_flcount = cpu_to_be32(flcount);
> > > +	agf->agf_fllast = cpu_to_be32(flcount - 1);
> > > +
> > > +	xfs_alloc_log_agf(sc->tp, agf_bp,
> > > +			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
> > > +	return 0;
> > > +}
> > > +
> > > +/* Write out a totally new AGFL. */
> > > +STATIC void
> > > +xfs_repair_agfl_init_header(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_buf			*agfl_bp,
> > > +	struct xfs_repair_extent_list	*agfl_extents,
> > > +	xfs_agblock_t			flcount)
> > > +{
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	__be32				*agfl_bno;
> > > +	struct xfs_repair_extent	*rae;
> > > +	struct xfs_repair_extent	*n;
> > > +	struct xfs_agfl			*agfl;
> > > +	xfs_agblock_t			agbno;
> > > +	unsigned int			fl_off;
> > > +
> > 	ASSERT(*flcount <= xfs_agfl_size(mp));
> > 
> > > +	/* Start rewriting the header. */
> > > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> > > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > > +
> > > +	/* Fill the AGFL with the remaining blocks. */
> > > +	fl_off = 0;
> > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > > +	for_each_xfs_repair_extent_safe(rae, n, agfl_extents) {
> > > +		agbno = XFS_FSB_TO_AGBNO(mp, rae->fsbno);
> > > +
> > > +		trace_xfs_repair_agfl_insert(mp, sc->sa.agno, agbno, rae->len);
> > > +
> > > +		while (rae->len > 0 && fl_off < flcount) {
> > > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > > +			fl_off++;
> > > +			agbno++;
> > > +			rae->fsbno++;
> > > +			rae->len--;
> > > +		}
> > 
> > This only works correctly if flcount <= xfs_agfl_size, which is why
> > I'm suggesting some asserts.
> > 
> > > +
> > > +		if (rae->len)
> > > +			break;
> > > +		list_del(&rae->list);
> > > +		kmem_free(rae);
> > > +	}
> > > +
> > > +	/* Write AGF and AGFL to disk. */
> > > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > > +}
> > > +
> > > +/* Repair the AGFL. */
> > > +int
> > > +xfs_repair_agfl(
> > > +	struct xfs_scrub_context	*sc)
> > > +{
> > > +	struct xfs_owner_info		oinfo;
> > > +	struct xfs_repair_extent_list	agfl_extents;
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	struct xfs_buf			*agf_bp;
> > > +	struct xfs_buf			*agfl_bp;
> > > +	xfs_agblock_t			flcount;
> > > +	int				error;
> > > +
> > > +	/* We require the rmapbt to rebuild anything. */
> > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> > > +	xfs_repair_init_extent_list(&agfl_extents);
> > > +
> > > +	/*
> > > +	 * Read the AGF so that we can query the rmapbt.  We hope that there's
> > > +	 * nothing wrong with the AGF, but all the AG header repair functions
> > > +	 * have this chicken-and-egg problem.
> > > +	 */
> > > +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
> > > +	if (error)
> > > +		return error;
> > > +	if (!agf_bp)
> > > +		return -ENOMEM;
> > > +
> > > +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > > +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
> > > +			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
> > > +	if (error)
> > > +		return error;
> > > +	agfl_bp->b_ops = &xfs_agfl_buf_ops;
> > > +
> > > +	/*
> > > +	 * Compute the set of old AGFL blocks by subtracting from the list of
> > > +	 * OWN_AG blocks the list of blocks owned by all other OWN_AG metadata
> > > +	 * (bnobt, cntbt, rmapbt).  These are the old AGFL blocks, so return
> > > +	 * that list and the number of blocks we're actually going to put back
> > > +	 * on the AGFL.
> > > +	 */
> > 
> > That comment belongs on the function, not here. All we need here is
> > something like:
> > 
> > 	/* Gather all the extents we're going to put on the new AGFL. */
> > 
> > > +	error = xfs_repair_agfl_find_extents(sc, agf_bp, &agfl_extents,
> > > +			&flcount);
> > > +	if (error)
> > > +		goto err;
> > > +
> > > +	/*
> > > +	 * Update AGF and AGFL.  We reset the global free block counter when
> > > +	 * we adjust the AGF flcount (which can fail) so avoid updating any
> > > +	 * bufers until we know that part works.
> > 
> > buffers
> > 
> > > +	 */
> > > +	error = xfs_repair_agfl_update_agf(sc, agf_bp, flcount);
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_repair_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
> > > +
> > > +	/*
> > > +	 * Ok, the AGFL should be ready to go now.  Roll the transaction so
> > > +	 * that we can free any AGFL overflow.
> > > +	 */
> > 
> > Why does rolling the transaction allow us to free the overflow?
> > Shouldn't the comment say something like "Roll to the transaction to
> > make the new AGFL permanent before we start using it when returning
> > the residual AGFL freespace overflow back to the AGF freespace
> > btrees."
> > 
> > > +	sc->sa.agf_bp = agf_bp;
> > > +	sc->sa.agfl_bp = agfl_bp;
> > > +	error = xfs_repair_roll_ag_trans(sc);
> > > +	if (error)
> > > +		goto err;
> > > +
> > > +	/* Dump any AGFL overflow. */
> > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > +	return xfs_repair_reap_btree_extents(sc, &agfl_extents, &oinfo,
> > > +			XFS_AG_RESV_AGFL);
> > > +err:
> > > +	xfs_repair_cancel_btree_extents(sc, &agfl_extents);
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > > index 326be4e8b71e..bcdaa8df18f6 100644
> > > --- a/fs/xfs/scrub/repair.c
> > > +++ b/fs/xfs/scrub/repair.c
> > > @@ -127,9 +127,12 @@ xfs_repair_roll_ag_trans(
> > >   	int				error;
> > >   	/* Keep the AG header buffers locked so we can keep going. */
> > > -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > > +	if (sc->sa.agi_bp)
> > > +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > +	if (sc->sa.agf_bp)
> > > +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > +	if (sc->sa.agfl_bp)
> > > +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > >   	/* Roll the transaction. */
> > >   	error = xfs_trans_roll(&sc->tp);
> > > @@ -137,9 +140,12 @@ xfs_repair_roll_ag_trans(
> > >   		goto out_release;
> > >   	/* Join AG headers to the new transaction. */
> > > -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > > +	if (sc->sa.agi_bp)
> > > +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > +	if (sc->sa.agf_bp)
> > > +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > +	if (sc->sa.agfl_bp)
> > > +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > >   	return 0;
> > > @@ -149,9 +155,12 @@ xfs_repair_roll_ag_trans(
> > >   	 * buffers will be released during teardown on our way out
> > >   	 * of the kernel.
> > >   	 */
> > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > > +	if (sc->sa.agi_bp)
> > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > +	if (sc->sa.agf_bp)
> > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > +	if (sc->sa.agfl_bp)
> > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > >   	return error;
> > >   }
> > > @@ -408,6 +417,85 @@ xfs_repair_collect_btree_extent(
> > >   	return 0;
> > >   }
> > > +/*
> > > + * Help record all btree blocks seen while iterating all records of a btree.
> > > + *
> > > + * We know that the btree query_all function starts at the left edge and walks
> > > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > > + * the btree cursor towards the root; if the pointer for a given level points
> > > + * to the first record/key in that block, we haven't seen this block before;
> > > + * and therefore we need to remember that we saw this block in the btree.
> > > + *
> > > + * So if our btree is:
> > > + *
> > > + *    4
> > > + *  / | \
> > > + * 1  2  3
> > > + *
> > > + * Pretend for this example that each leaf block has 100 btree records.  For
> > > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > > + * block 4.  The list is [1, 4].
> > > + *
> > > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > > + * loop.  The list remains [1, 4].
> > > + *
> > > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > > + *
> > > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > > + *
> > > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > > + *
> > > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > > + *
> > > + * The *iter_fn can return XFS_BTREE_QUERY_RANGE_ABORT to stop, 0 to keep
> > > + * iterating, or the usual negative error code.
> > > + */
> > > +int
> > > +xfs_repair_collect_btree_cur_blocks(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_btree_cur		*cur,
> > > +	int				(*iter_fn)(struct xfs_scrub_context *sc,
> > > +						   xfs_fsblock_t fsbno,
> > > +						   xfs_fsblock_t len,
> > > +						   void *priv),
> > > +	void				*priv)
> > > +{
> > > +	struct xfs_buf			*bp;
> > > +	xfs_fsblock_t			fsb;
> > > +	int				i;
> > > +	int				error;
> > > +
> > > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > > +		xfs_btree_get_block(cur, i, &bp);
> > > +		if (!bp)
> > > +			continue;
> > > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > +		error = iter_fn(sc, fsb, 1, priv);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Simple adapter to connect xfs_repair_collect_btree_extent to
> > > + * xfs_repair_collect_btree_cur_blocks.
> > > + */
> > > +int
> > > +xfs_repair_collect_btree_cur_blocks_in_extent_list(
> > > +	struct xfs_scrub_context	*sc,
> > > +	xfs_fsblock_t			fsbno,
> > > +	xfs_fsblock_t			len,
> > > +	void				*priv)
> > > +{
> > > +	return xfs_repair_collect_btree_extent(sc, priv, fsbno, len);
> > > +}
> > > +
> > >   /*
> > >    * An error happened during the rebuild so the transaction will be cancelled.
> > >    * The fs will shut down, and the administrator has to unmount and run repair.
> > > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > > index ef47826b6725..f2af5923aa75 100644
> > > --- a/fs/xfs/scrub/repair.h
> > > +++ b/fs/xfs/scrub/repair.h
> > > @@ -48,9 +48,20 @@ xfs_repair_init_extent_list(
> > >   #define for_each_xfs_repair_extent_safe(rbe, n, exlist) \
> > >   	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
> > > +#define for_each_xfs_repair_extent(rbe, exlist) \
> > > +	list_for_each_entry((rbe), &(exlist)->list, list)
> > >   int xfs_repair_collect_btree_extent(struct xfs_scrub_context *sc,
> > >   		struct xfs_repair_extent_list *btlist, xfs_fsblock_t fsbno,
> > >   		xfs_extlen_t len);
> > > +int xfs_repair_collect_btree_cur_blocks(struct xfs_scrub_context *sc,
> > > +		struct xfs_btree_cur *cur,
> > > +		int (*iter_fn)(struct xfs_scrub_context *sc,
> > > +			       xfs_fsblock_t fsbno, xfs_fsblock_t len,
> > > +			       void *priv),
> > > +		void *priv);
> > > +int xfs_repair_collect_btree_cur_blocks_in_extent_list(
> > > +		struct xfs_scrub_context *sc, xfs_fsblock_t fsbno,
> > > +		xfs_fsblock_t len, void *priv);
> > >   void xfs_repair_cancel_btree_extents(struct xfs_scrub_context *sc,
> > >   		struct xfs_repair_extent_list *btlist);
> > >   int xfs_repair_subtract_extents(struct xfs_scrub_context *sc,
> > > @@ -89,6 +100,8 @@ int xfs_repair_ino_dqattach(struct xfs_scrub_context *sc);
> > >   int xfs_repair_probe(struct xfs_scrub_context *sc);
> > >   int xfs_repair_superblock(struct xfs_scrub_context *sc);
> > > +int xfs_repair_agf(struct xfs_scrub_context *sc);
> > > +int xfs_repair_agfl(struct xfs_scrub_context *sc);
> > >   #else
> > > @@ -112,6 +125,8 @@ xfs_repair_calc_ag_resblks(
> > >   #define xfs_repair_probe		xfs_repair_notsupported
> > >   #define xfs_repair_superblock		xfs_repair_notsupported
> > > +#define xfs_repair_agf			xfs_repair_notsupported
> > > +#define xfs_repair_agfl			xfs_repair_notsupported
> > >   #endif /* CONFIG_XFS_ONLINE_REPAIR */
> > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > index 58ae76b3a421..8e11c3c699fb 100644
> > > --- a/fs/xfs/scrub/scrub.c
> > > +++ b/fs/xfs/scrub/scrub.c
> > > @@ -208,13 +208,13 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
> > >   		.type	= ST_PERAG,
> > >   		.setup	= xfs_scrub_setup_fs,
> > >   		.scrub	= xfs_scrub_agf,
> > > -		.repair	= xfs_repair_notsupported,
> > > +		.repair	= xfs_repair_agf,
> > >   	},
> > >   	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
> > >   		.type	= ST_PERAG,
> > >   		.setup	= xfs_scrub_setup_fs,
> > >   		.scrub	= xfs_scrub_agfl,
> > > -		.repair	= xfs_repair_notsupported,
> > > +		.repair	= xfs_repair_agfl,
> > >   	},
> > >   	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> > >   		.type	= ST_PERAG,
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 524f543c5b82..c08785cf83a9 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -126,6 +126,60 @@ xfs_trans_dup(
> > >   	return ntp;
> > >   }
> > > +/*
> > > + * Try to reserve more blocks for a transaction.  The single use case we
> > > + * support is for online repair -- use a transaction to gather data without
> > > + * fear of btree cycle deadlocks; calculate how many blocks we really need
> > > + * from that data; and only then start modifying data.  This can fail due to
> > > + * ENOSPC, so we have to be able to cancel the transaction.
> > > + */
> > > +int
> > > +xfs_trans_reserve_more(
> > > +	struct xfs_trans	*tp,
> > > +	uint			blocks,
> > > +	uint			rtextents)
> > 
> > This isn't used in this patch - seems out of place here. Committed
> > to the wrong patch?
> > 
> > Cheers,
> > 
> > Dave.
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 04/21] xfs: repair the AGF and AGFL
  2018-06-27 23:37       ` Dave Chinner
@ 2018-06-29 15:14         ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-29 15:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Allison Henderson, linux-xfs

On Thu, Jun 28, 2018 at 09:37:20AM +1000, Dave Chinner wrote:
> On Wed, Jun 27, 2018 at 09:44:53AM -0700, Allison Henderson wrote:
> > On 06/26/2018 07:19 PM, Dave Chinner wrote:
> > >On Sun, Jun 24, 2018 at 12:23:54PM -0700, Darrick J. Wong wrote:
> > >>From: Darrick J. Wong <darrick.wong@oracle.com>
> > >>
> > >>Regenerate the AGF and AGFL from the rmap data.
> > >>
> > >>Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > >
> > >[...]
> 
> > >>+	/* Record all the OWN_AG blocks. */
> > >>+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> > >>+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> > >>+				rec->rm_startblock);
> > >>+		error = xfs_repair_collect_btree_extent(ra->sc,
> > >>+				ra->freesp_list, fsb, rec->rm_blockcount);
> > >>+		if (error)
> > >>+			return error;
> > >>+	}
> > >>+
> > >>+	return xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
> > >>+			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> > >
> > >Urk. The function name lengths is getting out of hand. I'm very
> > >tempted to suggest we should shorten the namespace of all this
> > >like s/xfs_repair_/xr_/ and s/xfs_scrub_/xs_/, etc just to make them
> > >shorter and easier to read.
> > >
> > >Oh, wait, did I say that out loud? :P
> > >
> > >Something to think about, anyway.
> > >
> > Well they are sort of long, but TBH I think i still kind of
> > appreciate the extra verbiage.  I have seen other projects do things
> > like adopt a sort of 3 or 4 letter abbreviation (like maybe xfs_scrb
> > or xfs_repr). Helps to cut down on the verbosity while still not
> > loosing too much of what it is supposed to mean.  Just another idea
> > to consider. :-)
> 
> We've got that in places, too, like "xlog_" prefixes for all the log
> code, so that's not an unreasonable thing to suggest. After all, in
> many cases we're talking about a tradeoff between readabilty and the
> amount of typing necessary.

I propose(d on IRC) to shorten the prefixes to xrep_ and xchk_.

I'll also take a look at condensing the non-prefix parts of the names.
Agreed that they're too long now.

> However, IMO, function names so long they need a line of their own
> indicates we have a structural problem in our code, not a
> readability problem. We should not need names that long to document
> what the function does - it should be obvious from the context, the
> abstraction that is being used and a short name....
> 
> e.g. how many of these different "collect extent" operations could
> be abstracted into a common extent list structure and generic
> callbacks? It seems there's a lot of similarity in them, and we're
> really only differentiating them by adding more namespace and
> context specific information into the structure and function names.

I'd /really/ like to convert this into a proper incore bitmap and then
turn the operations into proper bitmasking operations, since
collect_btree_extents is merely setting ranges of bits in two bitmaps
and subtract_extents takes the union of them both.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/21] xfs: repair free space btrees
  2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
  2018-06-27  3:21   ` Dave Chinner
@ 2018-06-30 17:36   ` Allison Henderson
  1 sibling, 0 replies; 77+ messages in thread
From: Allison Henderson @ 2018-06-30 17:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:24 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the free space btrees from the gaps in the rmap btree.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/Makefile             |    1
>   fs/xfs/scrub/alloc.c        |    1
>   fs/xfs/scrub/alloc_repair.c |  561 +++++++++++++++++++++++++++++++++++++++++++
>   fs/xfs/scrub/common.c       |    8 +
>   fs/xfs/scrub/repair.h       |    2
>   fs/xfs/scrub/scrub.c        |    4
>   fs/xfs/xfs_extent_busy.c    |   14 +
>   fs/xfs/xfs_extent_busy.h    |    4
>   8 files changed, 591 insertions(+), 4 deletions(-)
>   create mode 100644 fs/xfs/scrub/alloc_repair.c
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index a36cccbec169..841e0824eeb6 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -164,6 +164,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
>   ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
>   xfs-y				+= $(addprefix scrub/, \
>   				   agheader_repair.o \
> +				   alloc_repair.o \
>   				   repair.o \
>   				   )
>   endif
> diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
> index 50e4f7fa06f0..e2514c84cb7a 100644
> --- a/fs/xfs/scrub/alloc.c
> +++ b/fs/xfs/scrub/alloc.c
> @@ -15,7 +15,6 @@
>   #include "xfs_log_format.h"
>   #include "xfs_trans.h"
>   #include "xfs_sb.h"
> -#include "xfs_alloc.h"
>   #include "xfs_rmap.h"
>   #include "xfs_alloc.h"
>   #include "scrub/xfs_scrub.h"
> diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> new file mode 100644
> index 000000000000..c25a2b0d71f1
> --- /dev/null
> +++ b/fs/xfs/scrub/alloc_repair.c
> @@ -0,0 +1,561 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2018 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_btree.h"
> +#include "xfs_bit.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans.h"
> +#include "xfs_sb.h"
> +#include "xfs_alloc.h"
> +#include "xfs_alloc_btree.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_inode.h"
> +#include "xfs_refcount.h"
> +#include "xfs_extent_busy.h"
> +#include "scrub/xfs_scrub.h"
> +#include "scrub/scrub.h"
> +#include "scrub/common.h"
> +#include "scrub/btree.h"
> +#include "scrub/trace.h"
> +#include "scrub/repair.h"
> +
> +/*
> + * Free Space Btree Repair
> + * =======================
> + *
> + * The reverse mappings are supposed to record all space usage for the entire
> + * AG.  Therefore, we can recalculate the free extents in an AG by looking for
> + * gaps in the physical extents recorded in the rmapbt.  On a reflink
> + * filesystem this is a little more tricky in that we have to be aware that
> + * the rmap records are allowed to overlap.
> + *
> + * We derive which blocks belonged to the old bnobt/cntbt by recording all the
> + * OWN_AG extents and subtracting out the blocks owned by all other OWN_AG
> + * metadata: the rmapbt blocks visited while iterating the reverse mappings
> + * and the AGFL blocks.
> + *
> + * Once we have both of those pieces, we can reconstruct the bnobt and cntbt
> + * by blowing out the free block state and freeing all the extents that we
> + * found.  This adds the requirement that we can't have any busy extents in
> + * the AG because the busy code cannot handle duplicate records.
> + *
> + * Note that we can only rebuild both free space btrees at the same time
> + * because the regular extent freeing infrastructure loads both btrees at the
> + * same time.
> + */
> +
> +struct xfs_repair_alloc_extent {
> +	struct list_head		list;
> +	xfs_agblock_t			bno;
> +	xfs_extlen_t			len;
> +};
> +
> +struct xfs_repair_alloc {
> +	struct xfs_repair_extent_list	nobtlist; /* rmapbt/agfl blocks */
> +	struct xfs_repair_extent_list	*btlist;  /* OWN_AG blocks */
> +	struct list_head		*extlist; /* free extents */
> +	struct xfs_scrub_context	*sc;
> +	uint64_t			nr_records; /* length of extlist */
> +	xfs_agblock_t			next_bno; /* next bno we want to see */
> +	xfs_agblock_t			nr_blocks; /* free blocks in extlist */
Align the comments on the right to a common column?

> +};
> +
> +/* Record extents that aren't in use from gaps in the rmap records. */
> +STATIC int
> +xfs_repair_alloc_extent_fn(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_rmap_irec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_repair_alloc		*ra = priv;
> +	struct xfs_repair_alloc_extent	*rae;
> +	xfs_fsblock_t			fsb;
> +	int				error;
> +
> +	/* Record all the OWN_AG blocks... */
> +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> +				rec->rm_startblock);
> +		error = xfs_repair_collect_btree_extent(ra->sc,
> +				ra->btlist, fsb, rec->rm_blockcount);
> +		if (error)
> +			return error;
> +	}
> +
> +	/* ...and all the rmapbt blocks... */
> +	error = xfs_repair_collect_btree_cur_blocks(ra->sc, cur,
> +			xfs_repair_collect_btree_cur_blocks_in_extent_list,
> +			&ra->nobtlist);
> +	if (error)
> +		return error;
> +
> +	/* ...and all the free space. */
> +	if (rec->rm_startblock > ra->next_bno) {
> +		trace_xfs_repair_alloc_extent_fn(cur->bc_mp,
> +				cur->bc_private.a.agno,
> +				ra->next_bno, rec->rm_startblock - ra->next_bno,
> +				XFS_RMAP_OWN_NULL, 0, 0);
> +
> +		rae = kmem_alloc(sizeof(struct xfs_repair_alloc_extent),
> +				KM_MAYFAIL);
> +		if (!rae)
> +			return -ENOMEM;
> +		INIT_LIST_HEAD(&rae->list);
> +		rae->bno = ra->next_bno;
> +		rae->len = rec->rm_startblock - ra->next_bno;
> +		list_add_tail(&rae->list, ra->extlist);
> +		ra->nr_records++;
> +		ra->nr_blocks += rae->len;
> +	}
> +	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
> +			rec->rm_startblock + rec->rm_blockcount);
> +	return 0;
> +}
Alrighty, seems to follow the commentary.  Thx!

> +
> +/* Collect an AGFL block for the not-to-release list. */
> +static int
> +xfs_repair_collect_agfl_block(
> +	struct xfs_mount		*mp,
> +	xfs_agblock_t			bno,
> +	void				*priv)
> +{
> +	struct xfs_repair_alloc		*ra = priv;
> +	xfs_fsblock_t			fsb;
> +
> +	fsb = XFS_AGB_TO_FSB(mp, ra->sc->sa.agno, bno);
> +	return xfs_repair_collect_btree_extent(ra->sc, &ra->nobtlist, fsb, 1);
> +}
> +
> +/* Compare two btree extents. */
> +static int
> +xfs_repair_allocbt_extent_cmp(
> +	void				*priv,
> +	struct list_head		*a,
> +	struct list_head		*b)
> +{
> +	struct xfs_repair_alloc_extent	*ap;
> +	struct xfs_repair_alloc_extent	*bp;
> +
> +	ap = container_of(a, struct xfs_repair_alloc_extent, list);
> +	bp = container_of(b, struct xfs_repair_alloc_extent, list);
> +
> +	if (ap->bno > bp->bno)
> +		return 1;
> +	else if (ap->bno < bp->bno)
> +		return -1;
> +	return 0;
> +}
> +
> +/* Put an extent onto the free list. */
> +STATIC int
> +xfs_repair_allocbt_free_extent(
While on the topic of name shortening, I've noticed other places
in the code shorten "extent" to "ext", and it seems pretty readable.
Just a suggestion if it helps :-)


> +	struct xfs_scrub_context	*sc,
> +	xfs_fsblock_t			fsbno,
> +	xfs_extlen_t			len,
> +	struct xfs_owner_info		*oinfo)
> +{
> +	int				error;
> +
> +	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
> +	if (error)
> +		return error;
> +	error = xfs_repair_roll_ag_trans(sc);
> +	if (error)
> +		return error;
> +	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);
> +}
> +
> +/* Find the longest free extent in the list. */
> +static struct xfs_repair_alloc_extent *
> +xfs_repair_allocbt_get_longest(
> +	struct list_head		*free_extents)
> +{
> +	struct xfs_repair_alloc_extent	*rae;
> +	struct xfs_repair_alloc_extent	*res = NULL;
> +
> +	list_for_each_entry(rae, free_extents, list) {
> +		if (!res || rae->len > res->len)
> +			res = rae;
> +	}
> +	return res;
> +}
> +
> +/* Find the shortest free extent in the list. */
> +static struct xfs_repair_alloc_extent *
> +xfs_repair_allocbt_get_shortest(
> +	struct list_head		*free_extents)
> +{
> +	struct xfs_repair_alloc_extent	*rae;
> +	struct xfs_repair_alloc_extent	*res = NULL;
> +
> +	list_for_each_entry(rae, free_extents, list) {
> +		if (!res || rae->len < res->len)
> +			res = rae;
> +		if (res->len == 1)
> +			break;
> +	}
> +	return res;
> +}
> +
> +/*
> + * Allocate a block from the (cached) shortest extent in the AG.  In theory
> + * this should never fail, since we already checked that there was enough
> + * space to handle the new btrees.
> + */
> +STATIC xfs_fsblock_t
> +xfs_repair_allocbt_alloc_block(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*free_extents,
> +	struct xfs_repair_alloc_extent	**cached_result)
> +{
> +	struct xfs_repair_alloc_extent	*ext = *cached_result;
> +	xfs_fsblock_t			fsb;
> +
> +	/* No cached result, see if we can find another. */
> +	if (!ext) {
> +		ext = xfs_repair_allocbt_get_shortest(free_extents);
> +		ASSERT(ext);
> +		if (!ext)
> +			return NULLFSBLOCK;
> +	}
> +
> +	/* Subtract one block. */
> +	fsb = XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, ext->bno);
> +	ext->bno++;
> +	ext->len--;
> +	if (ext->len == 0) {
> +		list_del(&ext->list);
> +		kmem_free(ext);
> +		ext = NULL;
> +	}
> +
> +	*cached_result = ext;
> +	return fsb;
> +}
> +
> +/* Free every record in the extent list. */
> +STATIC void
> +xfs_repair_allocbt_cancel_freelist(
> +	struct list_head		*extlist)
> +{
> +	struct xfs_repair_alloc_extent	*rae;
> +	struct xfs_repair_alloc_extent	*n;
> +
> +	list_for_each_entry_safe(rae, n, extlist, list) {
> +		list_del(&rae->list);
> +		kmem_free(rae);
> +	}
> +}
> +
> +/*
> + * Iterate all reverse mappings to find (1) the free extents, (2) the OWN_AG
> + * extents, (3) the rmapbt blocks, and (4) the AGFL blocks.  The free space is
> + * (1) + (2) - (3) - (4).  Figure out if we have enough free space to
> + * reconstruct the free space btrees.  Caller must clean up the input lists
> + * if something goes wrong.
> + */
> +STATIC int
> +xfs_repair_allocbt_find_freespace(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*free_extents,
> +	struct xfs_repair_extent_list	*old_allocbt_blocks)
> +{
> +	struct xfs_repair_alloc		ra;
> +	struct xfs_repair_alloc_extent	*rae;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_agblock_t			agend;
> +	xfs_agblock_t			nr_blocks;
> +	int				error;
> +
> +	ra.extlist = free_extents;
> +	ra.btlist = old_allocbt_blocks;
> +	xfs_repair_init_extent_list(&ra.nobtlist);
> +	ra.next_bno = 0;
> +	ra.nr_records = 0;
> +	ra.nr_blocks = 0;
> +	ra.sc = sc;
> +
> +	/*
> +	 * Iterate all the reverse mappings to find gaps in the physical
> +	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
> +	 */
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
> +	error = xfs_rmap_query_all(cur, xfs_repair_alloc_extent_fn, &ra);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	cur = NULL;
> +
> +	/* Insert a record for space between the last rmap and EOAG. */
> +	agend = be32_to_cpu(XFS_BUF_TO_AGF(sc->sa.agf_bp)->agf_length);
> +	if (ra.next_bno < agend) {
> +		rae = kmem_alloc(sizeof(struct xfs_repair_alloc_extent),
> +				KM_MAYFAIL);
> +		if (!rae) {
> +			error = -ENOMEM;
> +			goto err;
> +		}
> +		INIT_LIST_HEAD(&rae->list);
> +		rae->bno = ra.next_bno;
> +		rae->len = agend - ra.next_bno;
> +		list_add_tail(&rae->list, free_extents);
> +		ra.nr_records++;
> +	}
> +
> +	/* Collect all the AGFL blocks. */
> +	error = xfs_agfl_walk(mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
> +			sc->sa.agfl_bp, xfs_repair_collect_agfl_block, &ra);
> +	if (error)
> +		goto err;
> +
> +	/* Do we actually have enough space to do this? */
> +	nr_blocks = 2 * xfs_allocbt_calc_size(mp, ra.nr_records);
> +	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE) ||
> +	    ra.nr_blocks < nr_blocks) {
> +		error = -ENOSPC;
> +		goto err;
> +	}
> +
> +	/* Compute the old bnobt/cntbt blocks. */
> +	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
> +			&ra.nobtlist);
> +	if (error)
> +		goto err;
> +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> +	return 0;
> +
> +err:
> +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> +	if (cur)
> +		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
Ok, makes sense after some digging. I might not have figured out the 2
had Dave not pointed that out though.  But for the most part the in body
comments help a lot.  Thx!

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=nNxagNoo077f7e1qascS_gP9gvh_A31xun9uDjsIiRw&s=pV06fEkJolQtBTE33dKzWHVyIvrswKx5pwP148R8jcs&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 07/21] xfs: repair inode btrees
  2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
  2018-06-28  0:55   ` Dave Chinner
@ 2018-06-30 17:36   ` Allison Henderson
  2018-06-30 18:30     ` Darrick J. Wong
  1 sibling, 1 reply; 77+ messages in thread
From: Allison Henderson @ 2018-06-30 17:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/24/2018 12:24 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Use the rmapbt to find inode chunks, query the chunks to compute
> hole and free masks, and with that information rebuild the inobt
> and finobt.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   fs/xfs/Makefile              |    1
>   fs/xfs/scrub/ialloc_repair.c |  585 ++++++++++++++++++++++++++++++++++++++++++
>   fs/xfs/scrub/repair.h        |    2
>   fs/xfs/scrub/scrub.c         |    4
>   4 files changed, 590 insertions(+), 2 deletions(-)
>   create mode 100644 fs/xfs/scrub/ialloc_repair.c
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 841e0824eeb6..837fd4a95f6f 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -165,6 +165,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
>   xfs-y				+= $(addprefix scrub/, \
>   				   agheader_repair.o \
>   				   alloc_repair.o \
> +				   ialloc_repair.o \
>   				   repair.o \
>   				   )
>   endif
> diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
> new file mode 100644
> index 000000000000..29c736466bba
> --- /dev/null
> +++ b/fs/xfs/scrub/ialloc_repair.c
> @@ -0,0 +1,585 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2018 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_btree.h"
> +#include "xfs_bit.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans.h"
> +#include "xfs_sb.h"
> +#include "xfs_inode.h"
> +#include "xfs_alloc.h"
> +#include "xfs_ialloc.h"
> +#include "xfs_ialloc_btree.h"
> +#include "xfs_icache.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_log.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_error.h"
> +#include "scrub/xfs_scrub.h"
> +#include "scrub/scrub.h"
> +#include "scrub/common.h"
> +#include "scrub/btree.h"
> +#include "scrub/trace.h"
> +#include "scrub/repair.h"
> +
> +/*
> + * Inode Btree Repair
> + * ==================
> + *
> + * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
> + * records.  The OWN_INOBT records are the old inode btree blocks and will be
> + * cleared out after we've rebuilt the tree.  Each possible inode chunk within
> + * an OWN_INODES record will be read in and the freemask calculated from the
> + * i_mode data in the inode chunk.  For sparse inodes the holemask will be
> + * calculated by creating the properly aligned inobt record and punching out
> + * any chunk that's missing.  Inode allocations and frees grab the AGI first,
> + * so repair protects itself from concurrent access by locking the AGI.
> + *
> + * Once we've reconstructed all the inode records, we can create new inode
> + * btree roots and reload the btrees.  We rebuild both inode trees at the same
> + * time because they have the same rmap owner and it would be more complex to
> + * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
> + * blocks it owns.  We have all the data we need to build both, so dump
> + * everything and start over.
> + */
> +
> +struct xfs_repair_ialloc_extent {
> +	struct list_head		list;
> +	xfs_inofree_t			freemask;
> +	xfs_agino_t			startino;
> +	unsigned int			count;
> +	unsigned int			usedcount;
> +	uint16_t			holemask;
> +};
> +
> +struct xfs_repair_ialloc {
> +	struct list_head		*extlist;
> +	struct xfs_repair_extent_list	*btlist;
> +	struct xfs_scrub_context	*sc;
> +	uint64_t			nr_records;
> +};
> +
> +/*
> + * Is this inode in use?  If the inode is in memory we can tell from i_mode,
> + * otherwise we have to check di_mode in the on-disk buffer.  We only care
> + * that the high (i.e. non-permission) bits of _mode are zero.  This should be
> + * safe because repair keeps all AG headers locked until the end, and process
> + * trying to perform an inode allocation/free must lock the AGI.
> + */
> +STATIC int
> +xfs_repair_ialloc_check_free(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*bp,
> +	xfs_ino_t			fsino,
> +	xfs_agino_t			bpino,
> +	bool				*inuse)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_dinode		*dip;
> +	int				error;
> +
> +	/* Will the in-core inode tell us if it's in use? */
> +	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
> +	if (!error)
> +		return 0;
> +
> +	/* Inode uncached or half assembled, read disk buffer */
> +	dip = xfs_buf_offset(bp, bpino * mp->m_sb.sb_inodesize);
> +	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
> +		return -EFSCORRUPTED;
> +
> +	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
> +		return -EFSCORRUPTED;
> +
> +	*inuse = dip->di_mode != 0;
> +	return 0;
> +}
> +
> +/*
> + * For each cluster in this blob of inode, we must calculate the
Ok, so I've been over this one a few times, and I still don't feel
like I've figured out what a blob of an inode is. So I'm gonna have
to break and ask for clarification on that one?  Thx! :-)

> + * properly aligned startino of that cluster, then iterate each
> + * cluster to fill in used and filled masks appropriately.  We
> + * then use the (startino, used, filled) information to construct
> + * the appropriate inode records.
> + */
> +STATIC int
> +xfs_repair_ialloc_process_cluster(
> +	struct xfs_repair_ialloc	*ri,
> +	xfs_agblock_t			agbno,
> +	int				blks_per_cluster,
> +	xfs_agino_t			rec_agino)
> +{
> +	struct xfs_imap			imap;
> +	struct xfs_repair_ialloc_extent	*rie;
> +	struct xfs_dinode		*dip;
> +	struct xfs_buf			*bp;
> +	struct xfs_scrub_context	*sc = ri->sc;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_ino_t			fsino;
> +	xfs_inofree_t			usedmask;
> +	xfs_agino_t			nr_inodes;
> +	xfs_agino_t			startino;
> +	xfs_agino_t			clusterino;
> +	xfs_agino_t			clusteroff;
> +	xfs_agino_t			agino;
> +	uint16_t			fillmask;
> +	bool				inuse;
> +	int				usedcount;
> +	int				error;
> +
> +	/* The per-AG inum of this inode cluster. */
> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> +
> +	/* The per-AG inum of the inobt record. */
> +	startino = rec_agino + rounddown(agino - rec_agino,
> +			XFS_INODES_PER_CHUNK);
> +
> +	/* The per-AG inum of the cluster within the inobt record. */
> +	clusteroff = agino - startino;
> +
> +	/* Every inode in this holemask slot is filled. */
> +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
> +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
> +
> +	/* Grab the inode cluster buffer. */
> +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
> +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
> +	imap.im_boffset = 0;
> +
> +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
> +			XFS_IGET_UNTRUSTED);
> +	if (error)
> +		return error;
> +
> +	usedmask = 0;
> +	usedcount = 0;
> +	/* Which inodes within this cluster are free? */
> +	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
> +		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
> +		error = xfs_repair_ialloc_check_free(sc, bp, fsino,
> +				clusterino, &inuse);
> +		if (error) {
> +			xfs_trans_brelse(sc->tp, bp);
> +			return error;
> +		}
> +		if (inuse) {
> +			usedcount++;
> +			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
> +		}
> +	}
> +	xfs_trans_brelse(sc->tp, bp);
> +
> +	/*
> +	 * If the last item in the list is our chunk record,
> +	 * update that.
> +	 */
> +	if (!list_empty(ri->extlist)) {
> +		rie = list_last_entry(ri->extlist,
> +				struct xfs_repair_ialloc_extent, list);
> +		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
> +			rie->freemask &= ~usedmask;
> +			rie->holemask &= ~fillmask;
> +			rie->count += nr_inodes;
> +			rie->usedcount += usedcount;
> +			return 0;
> +		}
> +	}
> +
> +	/* New inode chunk; add to the list. */
> +	rie = kmem_alloc(sizeof(struct xfs_repair_ialloc_extent), KM_MAYFAIL);
> +	if (!rie)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&rie->list);
> +	rie->startino = startino;
> +	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
> +	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;
> +	rie->count = nr_inodes;
> +	rie->usedcount = usedcount;
> +	list_add_tail(&rie->list, ri->extlist);
> +	ri->nr_records++;
> +
> +	return 0;
> +}
> +
> +/* Record extents that belong to inode btrees. */
> +STATIC int
> +xfs_repair_ialloc_extent_fn(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_rmap_irec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_repair_ialloc	*ri = priv;
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	xfs_fsblock_t			fsbno;
> +	xfs_agblock_t			agbno = rec->rm_startblock;
> +	xfs_agino_t			inoalign;
> +	xfs_agino_t			agino;
> +	xfs_agino_t			rec_agino;
> +	int				blks_per_cluster;
> +	int				error = 0;
> +
> +	if (xfs_scrub_should_terminate(ri->sc, &error))
> +		return error;
> +
> +	/* Fragment of the old btrees; dispose of them later. */
> +	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
> +		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
> +		return xfs_repair_collect_btree_extent(ri->sc, ri->btlist,
> +				fsbno, rec->rm_blockcount);
> +	}
> +
> +	/* Skip extents which are not owned by this inode and fork. */
> +	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
> +		return 0;
> +
> +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> +
> +	if (agbno % blks_per_cluster != 0)
> +		return -EFSCORRUPTED;
> +
> +	trace_xfs_repair_ialloc_extent_fn(mp, ri->sc->sa.agno,
> +			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
> +			rec->rm_offset, rec->rm_flags);
> +
> +	/*
> +	 * Determine the inode block alignment, and where the block
> +	 * ought to start if it's aligned properly.  On a sparse inode
> +	 * system the rmap doesn't have to start on an alignment boundary,
> +	 * but the record does.  On pre-sparse filesystems, we /must/
> +	 * start both rmap and inobt on an alignment boundary.
> +	 */
> +	inoalign = xfs_ialloc_cluster_alignment(mp);
> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> +	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
> +	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
> +		return -EFSCORRUPTED;
> +
> +	/* Set up the free/hole masks for each cluster in this inode chunk. */
By chunk you did you mean record?  Please try to keep terminology
consistent as best you can.  Thx! :-)

> +	for (;
> +	     agbno < rec->rm_startblock + rec->rm_blockcount;
> +	     agbno += blks_per_cluster) {
> +		error = xfs_repair_ialloc_process_cluster(ri, agbno,
> +				blks_per_cluster, rec_agino);
> +		if (error)
> +			return error;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Compare two ialloc extents. */
> +static int
> +xfs_repair_ialloc_extent_cmp(
> +	void				*priv,
> +	struct list_head		*a,
> +	struct list_head		*b)
> +{
> +	struct xfs_repair_ialloc_extent	*ap;
> +	struct xfs_repair_ialloc_extent	*bp;
> +
> +	ap = container_of(a, struct xfs_repair_ialloc_extent, list);
> +	bp = container_of(b, struct xfs_repair_ialloc_extent, list);
> +
> +	if (ap->startino > bp->startino)
> +		return 1;
> +	else if (ap->startino < bp->startino)
> +		return -1;
> +	return 0;
> +}
> +
> +/* Insert an inode chunk record into a given btree. */
> +static int
> +xfs_repair_iallocbt_insert_btrec(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_repair_ialloc_extent	*rie)
> +{
> +	int				stat;
> +	int				error;
> +
> +	error = xfs_inobt_lookup(cur, rie->startino, XFS_LOOKUP_EQ, &stat);
> +	if (error)
> +		return error;
> +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 0);
> +	error = xfs_inobt_insert_rec(cur, rie->holemask, rie->count,
> +			rie->count - rie->usedcount, rie->freemask, &stat);
> +	if (error)
> +		return error;
> +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
> +	return error;
> +}
> +
> +/* Insert an inode chunk record into both inode btrees. */
> +static int
> +xfs_repair_iallocbt_insert_rec(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_repair_ialloc_extent	*rie)
> +{
> +	struct xfs_btree_cur		*cur;
> +	int				error;
> +
> +	trace_xfs_repair_ialloc_insert(sc->mp, sc->sa.agno, rie->startino,
> +			rie->holemask, rie->count, rie->count - rie->usedcount,
> +			rie->freemask);
> +
> +	/* Insert into the inobt. */
> +	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
> +			XFS_BTNUM_INO);
> +	error = xfs_repair_iallocbt_insert_btrec(cur, rie);
> +	if (error)
> +		goto out_cur;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/* Insert into the finobt if chunk has free inodes. */
> +	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb) &&
> +	    rie->count != rie->usedcount) {
> +		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
> +				sc->sa.agno, XFS_BTNUM_FINO);
> +		error = xfs_repair_iallocbt_insert_btrec(cur, rie);
> +		if (error)
> +			goto out_cur;
> +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	}
> +
> +	return xfs_repair_roll_ag_trans(sc);
> +out_cur:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
> +
> +/* Free every record in the inode list. */
> +STATIC void
> +xfs_repair_iallocbt_cancel_inorecs(
> +	struct list_head		*reclist)
> +{
> +	struct xfs_repair_ialloc_extent	*rie;
> +	struct xfs_repair_ialloc_extent	*n;
> +
> +	list_for_each_entry_safe(rie, n, reclist, list) {
> +		list_del(&rie->list);
> +		kmem_free(rie);
> +	}
> +}
> +
> +/*
> + * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
> + * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
> + * the inode btrees.  The caller must clean up the lists if anything goes
> + * wrong.
> + */
> +STATIC int
> +xfs_repair_iallocbt_find_inodes(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*inode_records,
> +	struct xfs_repair_extent_list	*old_iallocbt_blocks)
> +{
> +	struct xfs_repair_ialloc	ri;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_btree_cur		*cur;
> +	xfs_agblock_t			nr_blocks;
> +	int				error;
> +
> +	/* Collect all reverse mappings for inode blocks. */
> +	ri.extlist = inode_records;
> +	ri.btlist = old_iallocbt_blocks;
> +	ri.nr_records = 0;
> +	ri.sc = sc;
> +
> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
> +	error = xfs_rmap_query_all(cur, xfs_repair_ialloc_extent_fn, &ri);
> +	if (error)
> +		goto err;
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +
> +	/* Do we actually have enough space to do this? */
> +	nr_blocks = xfs_iallocbt_calc_size(mp, ri.nr_records);
> +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> +		nr_blocks *= 2;
> +	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE))
> +		return -ENOSPC;
> +
> +	return 0;
> +
> +err:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	return error;
> +}
> +
> +/* Update the AGI counters. */
> +STATIC int
> +xfs_repair_iallocbt_reset_counters(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*inode_records,
> +	int				*log_flags)
> +{
> +	struct xfs_agi			*agi;
> +	struct xfs_repair_ialloc_extent	*rie;
> +	unsigned int			count = 0;
> +	unsigned int			usedcount = 0;
> +	unsigned int			freecount;
> +
> +	/* Figure out the new counters. */
> +	list_for_each_entry(rie, inode_records, list) {
> +		count += rie->count;
> +		usedcount += rie->usedcount;
> +	}
> +
> +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> +	freecount = count - usedcount;
> +
> +	/* XXX: trigger inode count recalculation */
> +
> +	/* Reset the per-AG info, both incore and ondisk. */
> +	sc->sa.pag->pagi_count = count;
> +	sc->sa.pag->pagi_freecount = freecount;
> +	agi->agi_count = cpu_to_be32(count);
> +	agi->agi_freecount = cpu_to_be32(freecount);
> +	*log_flags |= XFS_AGI_COUNT | XFS_AGI_FREECOUNT;
> +
> +	return 0;
> +}
> +
> +/* Initialize new inobt/finobt roots and implant them into the AGI. */
> +STATIC int
> +xfs_repair_iallocbt_reset_btrees(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_owner_info		*oinfo,
> +	int				*log_flags)
> +{
> +	struct xfs_agi			*agi;
> +	struct xfs_buf			*bp;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fsblock_t			inofsb;
> +	xfs_fsblock_t			finofsb;
> +	enum xfs_ag_resv_type		resv;
> +	int				error;
> +
> +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> +
> +	/* Initialize new inobt root. */
> +	resv = XFS_AG_RESV_NONE;
> +	error = xfs_repair_alloc_ag_block(sc, oinfo, &inofsb, resv);
> +	if (error)
> +		return error;
> +	error = xfs_repair_init_btblock(sc, inofsb, &bp, XFS_BTNUM_INO,
> +			&xfs_inobt_buf_ops);
> +	if (error)
> +		return error;
> +	agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, inofsb));
> +	agi->agi_level = cpu_to_be32(1);
> +	*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
> +
> +	/* Initialize new finobt root. */
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return 0;
> +
> +	resv = mp->m_inotbt_nores ? XFS_AG_RESV_NONE : XFS_AG_RESV_METADATA;
> +	error = xfs_repair_alloc_ag_block(sc, oinfo, &finofsb, resv);
> +	if (error)
> +		return error;
> +	error = xfs_repair_init_btblock(sc, finofsb, &bp, XFS_BTNUM_FINO,
> +			&xfs_inobt_buf_ops);
> +	if (error)
> +		return error;
> +	agi->agi_free_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, finofsb));
> +	agi->agi_free_level = cpu_to_be32(1);
> +	*log_flags |= XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
> +
> +	return 0;
> +}
> +
> +/* Build new inode btrees and dispose of the old one. */
> +STATIC int
> +xfs_repair_iallocbt_rebuild_trees(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*inode_records,
> +	struct xfs_owner_info		*oinfo,
> +	struct xfs_repair_extent_list	*old_iallocbt_blocks)
> +{
> +	struct xfs_repair_ialloc_extent	*rie;
> +	struct xfs_repair_ialloc_extent	*n;
> +	int				error;
> +
> +	/* Add all records. */
> +	list_sort(NULL, inode_records, xfs_repair_ialloc_extent_cmp);
> +	list_for_each_entry_safe(rie, n, inode_records, list) {
> +		error = xfs_repair_iallocbt_insert_rec(sc, rie);
> +		if (error)
> +			return error;
> +
> +		list_del(&rie->list);
> +		kmem_free(rie);
> +	}
> +
> +	/* Free the old inode btree blocks if they're not in use. */
> +	return xfs_repair_reap_btree_extents(sc, old_iallocbt_blocks, oinfo,
> +			XFS_AG_RESV_NONE);
> +}
> +
> +/* Repair both inode btrees. */
> +int
> +xfs_repair_iallocbt(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct list_head		inode_records;
> +	struct xfs_repair_extent_list	old_iallocbt_blocks;
> +	struct xfs_mount		*mp = sc->mp;
> +	int				log_flags = 0;
> +	int				error = 0;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +
> +	/* Collect the free space data and find the old btree blocks. */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> +	INIT_LIST_HEAD(&inode_records);
> +	xfs_repair_init_extent_list(&old_iallocbt_blocks);
> +	error = xfs_repair_iallocbt_find_inodes(sc, &inode_records,
> +			&old_iallocbt_blocks);
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Blow out the old inode btrees.  This is the point at which
> +	 * we are no longer able to bail out gracefully.
> +	 */
> +	error = xfs_repair_iallocbt_reset_counters(sc, &inode_records,
> +			&log_flags);
> +	if (error)
> +		goto out;
> +	error = xfs_repair_iallocbt_reset_btrees(sc, &oinfo, &log_flags);
> +	if (error)
> +		goto out;
> +	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
> +
> +	/* Invalidate all the inobt/finobt blocks in btlist. */
> +	error = xfs_repair_invalidate_blocks(sc, &old_iallocbt_blocks);
> +	if (error)
> +		goto out;
> +	error = xfs_repair_roll_ag_trans(sc);
> +	if (error)
> +		goto out;
> +
> +	/* Now rebuild the inode information. */
> +	error = xfs_repair_iallocbt_rebuild_trees(sc, &inode_records, &oinfo,
> +			&old_iallocbt_blocks);
> +out:
> +	xfs_repair_cancel_btree_extents(sc, &old_iallocbt_blocks);
> +	xfs_repair_iallocbt_cancel_inorecs(&inode_records);
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index e5f67fc68e9a..dcfa5eb18940 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -104,6 +104,7 @@ int xfs_repair_agf(struct xfs_scrub_context *sc);
>   int xfs_repair_agfl(struct xfs_scrub_context *sc);
>   int xfs_repair_agi(struct xfs_scrub_context *sc);
>   int xfs_repair_allocbt(struct xfs_scrub_context *sc);
> +int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
>   
>   #else
>   
> @@ -131,6 +132,7 @@ xfs_repair_calc_ag_resblks(
>   #define xfs_repair_agfl			xfs_repair_notsupported
>   #define xfs_repair_agi			xfs_repair_notsupported
>   #define xfs_repair_allocbt		xfs_repair_notsupported
> +#define xfs_repair_iallocbt		xfs_repair_notsupported
>   
>   #endif /* CONFIG_XFS_ONLINE_REPAIR */
>   
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 7a55b20b7e4e..fec0e130f19e 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -238,14 +238,14 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>   		.type	= ST_PERAG,
>   		.setup	= xfs_scrub_setup_ag_iallocbt,
>   		.scrub	= xfs_scrub_inobt,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_iallocbt,
>   	},
>   	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
>   		.type	= ST_PERAG,
>   		.setup	= xfs_scrub_setup_ag_iallocbt,
>   		.scrub	= xfs_scrub_finobt,
>   		.has	= xfs_sb_version_hasfinobt,
> -		.repair	= xfs_repair_notsupported,
> +		.repair	= xfs_repair_iallocbt,
>   	},
>   	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
>   		.type	= ST_PERAG,
> 

Ok, some parts took some time to figure out, but I think I understand
the overall idea.  The comments help, and if you could add in a little
extra detail describing the function parameters, I think it would help
to add more supporting context to your comments.  Thx!

Allison

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=fIL2s7bIVyQHhkt6FVjoAC9YFnsVQMVUbz6DfuinhZs&s=m56pNZbCxuiPzbhEv3nD5G2PqN_7BLoQhkXF1E-CTzY&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 07/21] xfs: repair inode btrees
  2018-06-30 17:36   ` Allison Henderson
@ 2018-06-30 18:30     ` Darrick J. Wong
  2018-07-01  0:45       ` Allison Henderson
  0 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-06-30 18:30 UTC (permalink / raw)
  To: Allison Henderson; +Cc: linux-xfs

On Sat, Jun 30, 2018 at 10:36:23AM -0700, Allison Henderson wrote:
> On 06/24/2018 12:24 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Use the rmapbt to find inode chunks, query the chunks to compute
> > hole and free masks, and with that information rebuild the inobt
> > and finobt.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >   fs/xfs/Makefile              |    1
> >   fs/xfs/scrub/ialloc_repair.c |  585 ++++++++++++++++++++++++++++++++++++++++++
> >   fs/xfs/scrub/repair.h        |    2
> >   fs/xfs/scrub/scrub.c         |    4
> >   4 files changed, 590 insertions(+), 2 deletions(-)
> >   create mode 100644 fs/xfs/scrub/ialloc_repair.c
> > 
> > 
> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 841e0824eeb6..837fd4a95f6f 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -165,6 +165,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
> >   xfs-y				+= $(addprefix scrub/, \
> >   				   agheader_repair.o \
> >   				   alloc_repair.o \
> > +				   ialloc_repair.o \
> >   				   repair.o \
> >   				   )
> >   endif
> > diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
> > new file mode 100644
> > index 000000000000..29c736466bba
> > --- /dev/null
> > +++ b/fs/xfs/scrub/ialloc_repair.c
> > @@ -0,0 +1,585 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +/*
> > + * Copyright (C) 2018 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_btree.h"
> > +#include "xfs_bit.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_sb.h"
> > +#include "xfs_inode.h"
> > +#include "xfs_alloc.h"
> > +#include "xfs_ialloc.h"
> > +#include "xfs_ialloc_btree.h"
> > +#include "xfs_icache.h"
> > +#include "xfs_rmap.h"
> > +#include "xfs_rmap_btree.h"
> > +#include "xfs_log.h"
> > +#include "xfs_trans_priv.h"
> > +#include "xfs_error.h"
> > +#include "scrub/xfs_scrub.h"
> > +#include "scrub/scrub.h"
> > +#include "scrub/common.h"
> > +#include "scrub/btree.h"
> > +#include "scrub/trace.h"
> > +#include "scrub/repair.h"
> > +
> > +/*
> > + * Inode Btree Repair
> > + * ==================
> > + *
> > + * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
> > + * records.  The OWN_INOBT records are the old inode btree blocks and will be
> > + * cleared out after we've rebuilt the tree.  Each possible inode chunk within
> > + * an OWN_INODES record will be read in and the freemask calculated from the
> > + * i_mode data in the inode chunk.  For sparse inodes the holemask will be
> > + * calculated by creating the properly aligned inobt record and punching out
> > + * any chunk that's missing.  Inode allocations and frees grab the AGI first,
> > + * so repair protects itself from concurrent access by locking the AGI.
> > + *
> > + * Once we've reconstructed all the inode records, we can create new inode
> > + * btree roots and reload the btrees.  We rebuild both inode trees at the same
> > + * time because they have the same rmap owner and it would be more complex to
> > + * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
> > + * blocks it owns.  We have all the data we need to build both, so dump
> > + * everything and start over.
> > + */
> > +
> > +struct xfs_repair_ialloc_extent {
> > +	struct list_head		list;
> > +	xfs_inofree_t			freemask;
> > +	xfs_agino_t			startino;
> > +	unsigned int			count;
> > +	unsigned int			usedcount;
> > +	uint16_t			holemask;
> > +};
> > +
> > +struct xfs_repair_ialloc {
> > +	struct list_head		*extlist;
> > +	struct xfs_repair_extent_list	*btlist;
> > +	struct xfs_scrub_context	*sc;
> > +	uint64_t			nr_records;
> > +};
> > +
> > +/*
> > + * Is this inode in use?  If the inode is in memory we can tell from i_mode,
> > + * otherwise we have to check di_mode in the on-disk buffer.  We only care
> > + * that the high (i.e. non-permission) bits of _mode are zero.  This should be
> > + * safe because repair keeps all AG headers locked until the end, and process
> > + * trying to perform an inode allocation/free must lock the AGI.
> > + */
> > +STATIC int
> > +xfs_repair_ialloc_check_free(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_buf			*bp,
> > +	xfs_ino_t			fsino,
> > +	xfs_agino_t			bpino,
> > +	bool				*inuse)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_dinode		*dip;
> > +	int				error;
> > +
> > +	/* Will the in-core inode tell us if it's in use? */
> > +	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
> > +	if (!error)
> > +		return 0;
> > +
> > +	/* Inode uncached or half assembled, read disk buffer */
> > +	dip = xfs_buf_offset(bp, bpino * mp->m_sb.sb_inodesize);
> > +	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
> > +		return -EFSCORRUPTED;
> > +
> > +	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
> > +		return -EFSCORRUPTED;
> > +
> > +	*inuse = dip->di_mode != 0;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * For each cluster in this blob of inode, we must calculate the
> Ok, so I've been over this one a few times, and I still don't feel
> like I've figured out what a blob of an inode is. So I'm gonna have
> to break and ask for clarification on that one?  Thx! :-)

Heh, sorry.

"For each inode cluster covering the physical extent recorded by the
rmapbt, we must calculate..."

> > + * properly aligned startino of that cluster, then iterate each
> > + * cluster to fill in used and filled masks appropriately.  We
> > + * then use the (startino, used, filled) information to construct
> > + * the appropriate inode records.
> > + */
> > +STATIC int
> > +xfs_repair_ialloc_process_cluster(
> > +	struct xfs_repair_ialloc	*ri,
> > +	xfs_agblock_t			agbno,
> > +	int				blks_per_cluster,
> > +	xfs_agino_t			rec_agino)
> > +{
> > +	struct xfs_imap			imap;
> > +	struct xfs_repair_ialloc_extent	*rie;
> > +	struct xfs_dinode		*dip;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_scrub_context	*sc = ri->sc;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_ino_t			fsino;
> > +	xfs_inofree_t			usedmask;
> > +	xfs_agino_t			nr_inodes;
> > +	xfs_agino_t			startino;
> > +	xfs_agino_t			clusterino;
> > +	xfs_agino_t			clusteroff;
> > +	xfs_agino_t			agino;
> > +	uint16_t			fillmask;
> > +	bool				inuse;
> > +	int				usedcount;
> > +	int				error;
> > +
> > +	/* The per-AG inum of this inode cluster. */
> > +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> > +
> > +	/* The per-AG inum of the inobt record. */
> > +	startino = rec_agino + rounddown(agino - rec_agino,
> > +			XFS_INODES_PER_CHUNK);
> > +
> > +	/* The per-AG inum of the cluster within the inobt record. */
> > +	clusteroff = agino - startino;
> > +
> > +	/* Every inode in this holemask slot is filled. */
> > +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> > +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
> > +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
> > +
> > +	/* Grab the inode cluster buffer. */
> > +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
> > +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
> > +	imap.im_boffset = 0;
> > +
> > +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
> > +			XFS_IGET_UNTRUSTED);
> > +	if (error)
> > +		return error;
> > +
> > +	usedmask = 0;
> > +	usedcount = 0;
> > +	/* Which inodes within this cluster are free? */
> > +	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
> > +		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
> > +		error = xfs_repair_ialloc_check_free(sc, bp, fsino,
> > +				clusterino, &inuse);
> > +		if (error) {
> > +			xfs_trans_brelse(sc->tp, bp);
> > +			return error;
> > +		}
> > +		if (inuse) {
> > +			usedcount++;
> > +			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
> > +		}
> > +	}
> > +	xfs_trans_brelse(sc->tp, bp);
> > +
> > +	/*
> > +	 * If the last item in the list is our chunk record,
> > +	 * update that.
> > +	 */
> > +	if (!list_empty(ri->extlist)) {
> > +		rie = list_last_entry(ri->extlist,
> > +				struct xfs_repair_ialloc_extent, list);
> > +		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
> > +			rie->freemask &= ~usedmask;
> > +			rie->holemask &= ~fillmask;
> > +			rie->count += nr_inodes;
> > +			rie->usedcount += usedcount;
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	/* New inode chunk; add to the list. */
> > +	rie = kmem_alloc(sizeof(struct xfs_repair_ialloc_extent), KM_MAYFAIL);
> > +	if (!rie)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&rie->list);
> > +	rie->startino = startino;
> > +	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
> > +	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;
> > +	rie->count = nr_inodes;
> > +	rie->usedcount = usedcount;
> > +	list_add_tail(&rie->list, ri->extlist);
> > +	ri->nr_records++;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Record extents that belong to inode btrees. */
> > +STATIC int
> > +xfs_repair_ialloc_extent_fn(
> > +	struct xfs_btree_cur		*cur,
> > +	struct xfs_rmap_irec		*rec,
> > +	void				*priv)
> > +{
> > +	struct xfs_repair_ialloc	*ri = priv;
> > +	struct xfs_mount		*mp = cur->bc_mp;
> > +	xfs_fsblock_t			fsbno;
> > +	xfs_agblock_t			agbno = rec->rm_startblock;
> > +	xfs_agino_t			inoalign;
> > +	xfs_agino_t			agino;
> > +	xfs_agino_t			rec_agino;
> > +	int				blks_per_cluster;
> > +	int				error = 0;
> > +
> > +	if (xfs_scrub_should_terminate(ri->sc, &error))
> > +		return error;
> > +
> > +	/* Fragment of the old btrees; dispose of them later. */
> > +	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
> > +		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
> > +		return xfs_repair_collect_btree_extent(ri->sc, ri->btlist,
> > +				fsbno, rec->rm_blockcount);
> > +	}
> > +
> > +	/* Skip extents which are not owned by this inode and fork. */
> > +	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
> > +		return 0;
> > +
> > +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> > +
> > +	if (agbno % blks_per_cluster != 0)
> > +		return -EFSCORRUPTED;
> > +
> > +	trace_xfs_repair_ialloc_extent_fn(mp, ri->sc->sa.agno,
> > +			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
> > +			rec->rm_offset, rec->rm_flags);
> > +
> > +	/*
> > +	 * Determine the inode block alignment, and where the block
> > +	 * ought to start if it's aligned properly.  On a sparse inode
> > +	 * system the rmap doesn't have to start on an alignment boundary,
> > +	 * but the record does.  On pre-sparse filesystems, we /must/
> > +	 * start both rmap and inobt on an alignment boundary.
> > +	 */
> > +	inoalign = xfs_ialloc_cluster_alignment(mp);
> > +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> > +	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
> > +	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
> > +		return -EFSCORRUPTED;
> > +
> > +	/* Set up the free/hole masks for each cluster in this inode chunk. */
> By chunk you did you mean record?  Please try to keep terminology
> consistent as best you can.  Thx! :-)

Yikes, that /is/ a misleading comment.

"Set up the free/hole masks for each inode cluster that could be mapped
by this rmap record."

> > +	for (;
> > +	     agbno < rec->rm_startblock + rec->rm_blockcount;
> > +	     agbno += blks_per_cluster) {
> > +		error = xfs_repair_ialloc_process_cluster(ri, agbno,
> > +				blks_per_cluster, rec_agino);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/* Compare two ialloc extents. */
> > +static int
> > +xfs_repair_ialloc_extent_cmp(
> > +	void				*priv,
> > +	struct list_head		*a,
> > +	struct list_head		*b)
> > +{
> > +	struct xfs_repair_ialloc_extent	*ap;
> > +	struct xfs_repair_ialloc_extent	*bp;
> > +
> > +	ap = container_of(a, struct xfs_repair_ialloc_extent, list);
> > +	bp = container_of(b, struct xfs_repair_ialloc_extent, list);
> > +
> > +	if (ap->startino > bp->startino)
> > +		return 1;
> > +	else if (ap->startino < bp->startino)
> > +		return -1;
> > +	return 0;
> > +}
> > +
> > +/* Insert an inode chunk record into a given btree. */
> > +static int
> > +xfs_repair_iallocbt_insert_btrec(
> > +	struct xfs_btree_cur		*cur,
> > +	struct xfs_repair_ialloc_extent	*rie)
> > +{
> > +	int				stat;
> > +	int				error;
> > +
> > +	error = xfs_inobt_lookup(cur, rie->startino, XFS_LOOKUP_EQ, &stat);
> > +	if (error)
> > +		return error;
> > +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 0);
> > +	error = xfs_inobt_insert_rec(cur, rie->holemask, rie->count,
> > +			rie->count - rie->usedcount, rie->freemask, &stat);
> > +	if (error)
> > +		return error;
> > +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
> > +	return error;
> > +}
> > +
> > +/* Insert an inode chunk record into both inode btrees. */
> > +static int
> > +xfs_repair_iallocbt_insert_rec(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_repair_ialloc_extent	*rie)
> > +{
> > +	struct xfs_btree_cur		*cur;
> > +	int				error;
> > +
> > +	trace_xfs_repair_ialloc_insert(sc->mp, sc->sa.agno, rie->startino,
> > +			rie->holemask, rie->count, rie->count - rie->usedcount,
> > +			rie->freemask);
> > +
> > +	/* Insert into the inobt. */
> > +	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
> > +			XFS_BTNUM_INO);
> > +	error = xfs_repair_iallocbt_insert_btrec(cur, rie);
> > +	if (error)
> > +		goto out_cur;
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +
> > +	/* Insert into the finobt if chunk has free inodes. */
> > +	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb) &&
> > +	    rie->count != rie->usedcount) {
> > +		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
> > +				sc->sa.agno, XFS_BTNUM_FINO);
> > +		error = xfs_repair_iallocbt_insert_btrec(cur, rie);
> > +		if (error)
> > +			goto out_cur;
> > +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +	}
> > +
> > +	return xfs_repair_roll_ag_trans(sc);
> > +out_cur:
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > +	return error;
> > +}
> > +
> > +/* Free every record in the inode list. */
> > +STATIC void
> > +xfs_repair_iallocbt_cancel_inorecs(
> > +	struct list_head		*reclist)
> > +{
> > +	struct xfs_repair_ialloc_extent	*rie;
> > +	struct xfs_repair_ialloc_extent	*n;
> > +
> > +	list_for_each_entry_safe(rie, n, reclist, list) {
> > +		list_del(&rie->list);
> > +		kmem_free(rie);
> > +	}
> > +}
> > +
> > +/*
> > + * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
> > + * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
> > + * the inode btrees.  The caller must clean up the lists if anything goes
> > + * wrong.
> > + */
> > +STATIC int
> > +xfs_repair_iallocbt_find_inodes(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*inode_records,
> > +	struct xfs_repair_extent_list	*old_iallocbt_blocks)
> > +{
> > +	struct xfs_repair_ialloc	ri;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_btree_cur		*cur;
> > +	xfs_agblock_t			nr_blocks;
> > +	int				error;
> > +
> > +	/* Collect all reverse mappings for inode blocks. */
> > +	ri.extlist = inode_records;
> > +	ri.btlist = old_iallocbt_blocks;
> > +	ri.nr_records = 0;
> > +	ri.sc = sc;
> > +
> > +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
> > +	error = xfs_rmap_query_all(cur, xfs_repair_ialloc_extent_fn, &ri);
> > +	if (error)
> > +		goto err;
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +
> > +	/* Do we actually have enough space to do this? */
> > +	nr_blocks = xfs_iallocbt_calc_size(mp, ri.nr_records);
> > +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> > +		nr_blocks *= 2;
> > +	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE))
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +
> > +err:
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > +	return error;
> > +}
> > +
> > +/* Update the AGI counters. */
> > +STATIC int
> > +xfs_repair_iallocbt_reset_counters(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*inode_records,
> > +	int				*log_flags)
> > +{
> > +	struct xfs_agi			*agi;
> > +	struct xfs_repair_ialloc_extent	*rie;
> > +	unsigned int			count = 0;
> > +	unsigned int			usedcount = 0;
> > +	unsigned int			freecount;
> > +
> > +	/* Figure out the new counters. */
> > +	list_for_each_entry(rie, inode_records, list) {
> > +		count += rie->count;
> > +		usedcount += rie->usedcount;
> > +	}
> > +
> > +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> > +	freecount = count - usedcount;
> > +
> > +	/* XXX: trigger inode count recalculation */
> > +
> > +	/* Reset the per-AG info, both incore and ondisk. */
> > +	sc->sa.pag->pagi_count = count;
> > +	sc->sa.pag->pagi_freecount = freecount;
> > +	agi->agi_count = cpu_to_be32(count);
> > +	agi->agi_freecount = cpu_to_be32(freecount);
> > +	*log_flags |= XFS_AGI_COUNT | XFS_AGI_FREECOUNT;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Initialize new inobt/finobt roots and implant them into the AGI. */
> > +STATIC int
> > +xfs_repair_iallocbt_reset_btrees(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_owner_info		*oinfo,
> > +	int				*log_flags)
> > +{
> > +	struct xfs_agi			*agi;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fsblock_t			inofsb;
> > +	xfs_fsblock_t			finofsb;
> > +	enum xfs_ag_resv_type		resv;
> > +	int				error;
> > +
> > +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> > +
> > +	/* Initialize new inobt root. */
> > +	resv = XFS_AG_RESV_NONE;
> > +	error = xfs_repair_alloc_ag_block(sc, oinfo, &inofsb, resv);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_init_btblock(sc, inofsb, &bp, XFS_BTNUM_INO,
> > +			&xfs_inobt_buf_ops);
> > +	if (error)
> > +		return error;
> > +	agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, inofsb));
> > +	agi->agi_level = cpu_to_be32(1);
> > +	*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
> > +
> > +	/* Initialize new finobt root. */
> > +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> > +		return 0;
> > +
> > +	resv = mp->m_inotbt_nores ? XFS_AG_RESV_NONE : XFS_AG_RESV_METADATA;
> > +	error = xfs_repair_alloc_ag_block(sc, oinfo, &finofsb, resv);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_init_btblock(sc, finofsb, &bp, XFS_BTNUM_FINO,
> > +			&xfs_inobt_buf_ops);
> > +	if (error)
> > +		return error;
> > +	agi->agi_free_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, finofsb));
> > +	agi->agi_free_level = cpu_to_be32(1);
> > +	*log_flags |= XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Build new inode btrees and dispose of the old one. */
> > +STATIC int
> > +xfs_repair_iallocbt_rebuild_trees(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*inode_records,
> > +	struct xfs_owner_info		*oinfo,
> > +	struct xfs_repair_extent_list	*old_iallocbt_blocks)
> > +{
> > +	struct xfs_repair_ialloc_extent	*rie;
> > +	struct xfs_repair_ialloc_extent	*n;
> > +	int				error;
> > +
> > +	/* Add all records. */
> > +	list_sort(NULL, inode_records, xfs_repair_ialloc_extent_cmp);
> > +	list_for_each_entry_safe(rie, n, inode_records, list) {
> > +		error = xfs_repair_iallocbt_insert_rec(sc, rie);
> > +		if (error)
> > +			return error;
> > +
> > +		list_del(&rie->list);
> > +		kmem_free(rie);
> > +	}
> > +
> > +	/* Free the old inode btree blocks if they're not in use. */
> > +	return xfs_repair_reap_btree_extents(sc, old_iallocbt_blocks, oinfo,
> > +			XFS_AG_RESV_NONE);
> > +}
> > +
> > +/* Repair both inode btrees. */
> > +int
> > +xfs_repair_iallocbt(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct list_head		inode_records;
> > +	struct xfs_repair_extent_list	old_iallocbt_blocks;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				log_flags = 0;
> > +	int				error = 0;
> > +
> > +	/* We require the rmapbt to rebuild anything. */
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> > +
> > +	/* Collect the free space data and find the old btree blocks. */
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> > +	INIT_LIST_HEAD(&inode_records);
> > +	xfs_repair_init_extent_list(&old_iallocbt_blocks);
> > +	error = xfs_repair_iallocbt_find_inodes(sc, &inode_records,
> > +			&old_iallocbt_blocks);
> > +	if (error)
> > +		goto out;
> > +
> > +	/*
> > +	 * Blow out the old inode btrees.  This is the point at which
> > +	 * we are no longer able to bail out gracefully.
> > +	 */
> > +	error = xfs_repair_iallocbt_reset_counters(sc, &inode_records,
> > +			&log_flags);
> > +	if (error)
> > +		goto out;
> > +	error = xfs_repair_iallocbt_reset_btrees(sc, &oinfo, &log_flags);
> > +	if (error)
> > +		goto out;
> > +	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
> > +
> > +	/* Invalidate all the inobt/finobt blocks in btlist. */
> > +	error = xfs_repair_invalidate_blocks(sc, &old_iallocbt_blocks);
> > +	if (error)
> > +		goto out;
> > +	error = xfs_repair_roll_ag_trans(sc);
> > +	if (error)
> > +		goto out;
> > +
> > +	/* Now rebuild the inode information. */
> > +	error = xfs_repair_iallocbt_rebuild_trees(sc, &inode_records, &oinfo,
> > +			&old_iallocbt_blocks);
> > +out:
> > +	xfs_repair_cancel_btree_extents(sc, &old_iallocbt_blocks);
> > +	xfs_repair_iallocbt_cancel_inorecs(&inode_records);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > index e5f67fc68e9a..dcfa5eb18940 100644
> > --- a/fs/xfs/scrub/repair.h
> > +++ b/fs/xfs/scrub/repair.h
> > @@ -104,6 +104,7 @@ int xfs_repair_agf(struct xfs_scrub_context *sc);
> >   int xfs_repair_agfl(struct xfs_scrub_context *sc);
> >   int xfs_repair_agi(struct xfs_scrub_context *sc);
> >   int xfs_repair_allocbt(struct xfs_scrub_context *sc);
> > +int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
> >   #else
> > @@ -131,6 +132,7 @@ xfs_repair_calc_ag_resblks(
> >   #define xfs_repair_agfl			xfs_repair_notsupported
> >   #define xfs_repair_agi			xfs_repair_notsupported
> >   #define xfs_repair_allocbt		xfs_repair_notsupported
> > +#define xfs_repair_iallocbt		xfs_repair_notsupported
> >   #endif /* CONFIG_XFS_ONLINE_REPAIR */
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 7a55b20b7e4e..fec0e130f19e 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -238,14 +238,14 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
> >   		.type	= ST_PERAG,
> >   		.setup	= xfs_scrub_setup_ag_iallocbt,
> >   		.scrub	= xfs_scrub_inobt,
> > -		.repair	= xfs_repair_notsupported,
> > +		.repair	= xfs_repair_iallocbt,
> >   	},
> >   	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
> >   		.type	= ST_PERAG,
> >   		.setup	= xfs_scrub_setup_ag_iallocbt,
> >   		.scrub	= xfs_scrub_finobt,
> >   		.has	= xfs_sb_version_hasfinobt,
> > -		.repair	= xfs_repair_notsupported,
> > +		.repair	= xfs_repair_iallocbt,
> >   	},
> >   	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
> >   		.type	= ST_PERAG,
> > 
> 
> Ok, some parts took some time to figure out, but I think I understand
> the overall idea.  The comments help, and if you could add in a little
> extra detail describing the function parameters, I think it would help
> to add more supporting context to your comments.  Thx!

Every time I go wandering through the ialloc code my head also gets
twisted in knots over inode chunks and inode clusters.  I think for the
next round I'll try to make some ascii art diagrams that I can refer
back to the next time I have to go digging through here (which will
probably be not that long from now, rumor has it the ialloc scrub don't
quite work right on systems with 64K pagesize.

--D

> Allison
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=fIL2s7bIVyQHhkt6FVjoAC9YFnsVQMVUbz6DfuinhZs&s=m56pNZbCxuiPzbhEv3nD5G2PqN_7BLoQhkXF1E-CTzY&e=
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 07/21] xfs: repair inode btrees
  2018-06-30 18:30     ` Darrick J. Wong
@ 2018-07-01  0:45       ` Allison Henderson
  0 siblings, 0 replies; 77+ messages in thread
From: Allison Henderson @ 2018-07-01  0:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 06/30/2018 11:30 AM, Darrick J. Wong wrote:
> On Sat, Jun 30, 2018 at 10:36:23AM -0700, Allison Henderson wrote:
>> On 06/24/2018 12:24 PM, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <darrick.wong@oracle.com>
>>>
>>> Use the rmapbt to find inode chunks, query the chunks to compute
>>> hole and free masks, and with that information rebuild the inobt
>>> and finobt.
>>>
>>> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>>> ---
>>>    fs/xfs/Makefile              |    1
>>>    fs/xfs/scrub/ialloc_repair.c |  585 ++++++++++++++++++++++++++++++++++++++++++
>>>    fs/xfs/scrub/repair.h        |    2
>>>    fs/xfs/scrub/scrub.c         |    4
>>>    4 files changed, 590 insertions(+), 2 deletions(-)
>>>    create mode 100644 fs/xfs/scrub/ialloc_repair.c
>>>
>>>
>>> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
>>> index 841e0824eeb6..837fd4a95f6f 100644
>>> --- a/fs/xfs/Makefile
>>> +++ b/fs/xfs/Makefile
>>> @@ -165,6 +165,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
>>>    xfs-y				+= $(addprefix scrub/, \
>>>    				   agheader_repair.o \
>>>    				   alloc_repair.o \
>>> +				   ialloc_repair.o \
>>>    				   repair.o \
>>>    				   )
>>>    endif
>>> diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
>>> new file mode 100644
>>> index 000000000000..29c736466bba
>>> --- /dev/null
>>> +++ b/fs/xfs/scrub/ialloc_repair.c
>>> @@ -0,0 +1,585 @@
>>> +// SPDX-License-Identifier: GPL-2.0+
>>> +/*
>>> + * Copyright (C) 2018 Oracle.  All Rights Reserved.
>>> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
>>> + */
>>> +#include "xfs.h"
>>> +#include "xfs_fs.h"
>>> +#include "xfs_shared.h"
>>> +#include "xfs_format.h"
>>> +#include "xfs_trans_resv.h"
>>> +#include "xfs_mount.h"
>>> +#include "xfs_defer.h"
>>> +#include "xfs_btree.h"
>>> +#include "xfs_bit.h"
>>> +#include "xfs_log_format.h"
>>> +#include "xfs_trans.h"
>>> +#include "xfs_sb.h"
>>> +#include "xfs_inode.h"
>>> +#include "xfs_alloc.h"
>>> +#include "xfs_ialloc.h"
>>> +#include "xfs_ialloc_btree.h"
>>> +#include "xfs_icache.h"
>>> +#include "xfs_rmap.h"
>>> +#include "xfs_rmap_btree.h"
>>> +#include "xfs_log.h"
>>> +#include "xfs_trans_priv.h"
>>> +#include "xfs_error.h"
>>> +#include "scrub/xfs_scrub.h"
>>> +#include "scrub/scrub.h"
>>> +#include "scrub/common.h"
>>> +#include "scrub/btree.h"
>>> +#include "scrub/trace.h"
>>> +#include "scrub/repair.h"
>>> +
>>> +/*
>>> + * Inode Btree Repair
>>> + * ==================
>>> + *
>>> + * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
>>> + * records.  The OWN_INOBT records are the old inode btree blocks and will be
>>> + * cleared out after we've rebuilt the tree.  Each possible inode chunk within
>>> + * an OWN_INODES record will be read in and the freemask calculated from the
>>> + * i_mode data in the inode chunk.  For sparse inodes the holemask will be
>>> + * calculated by creating the properly aligned inobt record and punching out
>>> + * any chunk that's missing.  Inode allocations and frees grab the AGI first,
>>> + * so repair protects itself from concurrent access by locking the AGI.
>>> + *
>>> + * Once we've reconstructed all the inode records, we can create new inode
>>> + * btree roots and reload the btrees.  We rebuild both inode trees at the same
>>> + * time because they have the same rmap owner and it would be more complex to
>>> + * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
>>> + * blocks it owns.  We have all the data we need to build both, so dump
>>> + * everything and start over.
>>> + */
>>> +
>>> +struct xfs_repair_ialloc_extent {
>>> +	struct list_head		list;
>>> +	xfs_inofree_t			freemask;
>>> +	xfs_agino_t			startino;
>>> +	unsigned int			count;
>>> +	unsigned int			usedcount;
>>> +	uint16_t			holemask;
>>> +};
>>> +
>>> +struct xfs_repair_ialloc {
>>> +	struct list_head		*extlist;
>>> +	struct xfs_repair_extent_list	*btlist;
>>> +	struct xfs_scrub_context	*sc;
>>> +	uint64_t			nr_records;
>>> +};
>>> +
>>> +/*
>>> + * Is this inode in use?  If the inode is in memory we can tell from i_mode,
>>> + * otherwise we have to check di_mode in the on-disk buffer.  We only care
>>> + * that the high (i.e. non-permission) bits of _mode are zero.  This should be
>>> + * safe because repair keeps all AG headers locked until the end, and process
>>> + * trying to perform an inode allocation/free must lock the AGI.
>>> + */
>>> +STATIC int
>>> +xfs_repair_ialloc_check_free(
>>> +	struct xfs_scrub_context	*sc,
>>> +	struct xfs_buf			*bp,
>>> +	xfs_ino_t			fsino,
>>> +	xfs_agino_t			bpino,
>>> +	bool				*inuse)
>>> +{
>>> +	struct xfs_mount		*mp = sc->mp;
>>> +	struct xfs_dinode		*dip;
>>> +	int				error;
>>> +
>>> +	/* Will the in-core inode tell us if it's in use? */
>>> +	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
>>> +	if (!error)
>>> +		return 0;
>>> +
>>> +	/* Inode uncached or half assembled, read disk buffer */
>>> +	dip = xfs_buf_offset(bp, bpino * mp->m_sb.sb_inodesize);
>>> +	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
>>> +		return -EFSCORRUPTED;
>>> +
>>> +	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
>>> +		return -EFSCORRUPTED;
>>> +
>>> +	*inuse = dip->di_mode != 0;
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>> + * For each cluster in this blob of inode, we must calculate the
>> Ok, so I've been over this one a few times, and I still don't feel
>> like I've figured out what a blob of an inode is. So I'm gonna have
>> to break and ask for clarification on that one?  Thx! :-)
> 
> Heh, sorry.
> 
> "For each inode cluster covering the physical extent recorded by the
> rmapbt, we must calculate..."
> 
>>> + * properly aligned startino of that cluster, then iterate each
>>> + * cluster to fill in used and filled masks appropriately.  We
>>> + * then use the (startino, used, filled) information to construct
>>> + * the appropriate inode records.
>>> + */
>>> +STATIC int
>>> +xfs_repair_ialloc_process_cluster(
>>> +	struct xfs_repair_ialloc	*ri,
>>> +	xfs_agblock_t			agbno,
>>> +	int				blks_per_cluster,
>>> +	xfs_agino_t			rec_agino)
>>> +{
>>> +	struct xfs_imap			imap;
>>> +	struct xfs_repair_ialloc_extent	*rie;
>>> +	struct xfs_dinode		*dip;
>>> +	struct xfs_buf			*bp;
>>> +	struct xfs_scrub_context	*sc = ri->sc;
>>> +	struct xfs_mount		*mp = sc->mp;
>>> +	xfs_ino_t			fsino;
>>> +	xfs_inofree_t			usedmask;
>>> +	xfs_agino_t			nr_inodes;
>>> +	xfs_agino_t			startino;
>>> +	xfs_agino_t			clusterino;
>>> +	xfs_agino_t			clusteroff;
>>> +	xfs_agino_t			agino;
>>> +	uint16_t			fillmask;
>>> +	bool				inuse;
>>> +	int				usedcount;
>>> +	int				error;
>>> +
>>> +	/* The per-AG inum of this inode cluster. */
>>> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
>>> +
>>> +	/* The per-AG inum of the inobt record. */
>>> +	startino = rec_agino + rounddown(agino - rec_agino,
>>> +			XFS_INODES_PER_CHUNK);
>>> +
>>> +	/* The per-AG inum of the cluster within the inobt record. */
>>> +	clusteroff = agino - startino;
>>> +
>>> +	/* Every inode in this holemask slot is filled. */
>>> +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
>>> +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
>>> +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
>>> +
>>> +	/* Grab the inode cluster buffer. */
>>> +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
>>> +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
>>> +	imap.im_boffset = 0;
>>> +
>>> +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
>>> +			XFS_IGET_UNTRUSTED);
>>> +	if (error)
>>> +		return error;
>>> +
>>> +	usedmask = 0;
>>> +	usedcount = 0;
>>> +	/* Which inodes within this cluster are free? */
>>> +	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
>>> +		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
>>> +		error = xfs_repair_ialloc_check_free(sc, bp, fsino,
>>> +				clusterino, &inuse);
>>> +		if (error) {
>>> +			xfs_trans_brelse(sc->tp, bp);
>>> +			return error;
>>> +		}
>>> +		if (inuse) {
>>> +			usedcount++;
>>> +			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
>>> +		}
>>> +	}
>>> +	xfs_trans_brelse(sc->tp, bp);
>>> +
>>> +	/*
>>> +	 * If the last item in the list is our chunk record,
>>> +	 * update that.
>>> +	 */
>>> +	if (!list_empty(ri->extlist)) {
>>> +		rie = list_last_entry(ri->extlist,
>>> +				struct xfs_repair_ialloc_extent, list);
>>> +		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
>>> +			rie->freemask &= ~usedmask;
>>> +			rie->holemask &= ~fillmask;
>>> +			rie->count += nr_inodes;
>>> +			rie->usedcount += usedcount;
>>> +			return 0;
>>> +		}
>>> +	}
>>> +
>>> +	/* New inode chunk; add to the list. */
>>> +	rie = kmem_alloc(sizeof(struct xfs_repair_ialloc_extent), KM_MAYFAIL);
>>> +	if (!rie)
>>> +		return -ENOMEM;
>>> +
>>> +	INIT_LIST_HEAD(&rie->list);
>>> +	rie->startino = startino;
>>> +	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
>>> +	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;
>>> +	rie->count = nr_inodes;
>>> +	rie->usedcount = usedcount;
>>> +	list_add_tail(&rie->list, ri->extlist);
>>> +	ri->nr_records++;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/* Record extents that belong to inode btrees. */
>>> +STATIC int
>>> +xfs_repair_ialloc_extent_fn(
>>> +	struct xfs_btree_cur		*cur,
>>> +	struct xfs_rmap_irec		*rec,
>>> +	void				*priv)
>>> +{
>>> +	struct xfs_repair_ialloc	*ri = priv;
>>> +	struct xfs_mount		*mp = cur->bc_mp;
>>> +	xfs_fsblock_t			fsbno;
>>> +	xfs_agblock_t			agbno = rec->rm_startblock;
>>> +	xfs_agino_t			inoalign;
>>> +	xfs_agino_t			agino;
>>> +	xfs_agino_t			rec_agino;
>>> +	int				blks_per_cluster;
>>> +	int				error = 0;
>>> +
>>> +	if (xfs_scrub_should_terminate(ri->sc, &error))
>>> +		return error;
>>> +
>>> +	/* Fragment of the old btrees; dispose of them later. */
>>> +	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
>>> +		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
>>> +		return xfs_repair_collect_btree_extent(ri->sc, ri->btlist,
>>> +				fsbno, rec->rm_blockcount);
>>> +	}
>>> +
>>> +	/* Skip extents which are not owned by this inode and fork. */
>>> +	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
>>> +		return 0;
>>> +
>>> +	blks_per_cluster = xfs_icluster_size_fsb(mp);
>>> +
>>> +	if (agbno % blks_per_cluster != 0)
>>> +		return -EFSCORRUPTED;
>>> +
>>> +	trace_xfs_repair_ialloc_extent_fn(mp, ri->sc->sa.agno,
>>> +			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
>>> +			rec->rm_offset, rec->rm_flags);
>>> +
>>> +	/*
>>> +	 * Determine the inode block alignment, and where the block
>>> +	 * ought to start if it's aligned properly.  On a sparse inode
>>> +	 * system the rmap doesn't have to start on an alignment boundary,
>>> +	 * but the record does.  On pre-sparse filesystems, we /must/
>>> +	 * start both rmap and inobt on an alignment boundary.
>>> +	 */
>>> +	inoalign = xfs_ialloc_cluster_alignment(mp);
>>> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
>>> +	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
>>> +	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
>>> +		return -EFSCORRUPTED;
>>> +
>>> +	/* Set up the free/hole masks for each cluster in this inode chunk. */
>> By chunk you did you mean record?  Please try to keep terminology
>> consistent as best you can.  Thx! :-)
> 
> Yikes, that /is/ a misleading comment.
> 
> "Set up the free/hole masks for each inode cluster that could be mapped
> by this rmap record."
> 
>>> +	for (;
>>> +	     agbno < rec->rm_startblock + rec->rm_blockcount;
>>> +	     agbno += blks_per_cluster) {
>>> +		error = xfs_repair_ialloc_process_cluster(ri, agbno,
>>> +				blks_per_cluster, rec_agino);
>>> +		if (error)
>>> +			return error;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/* Compare two ialloc extents. */
>>> +static int
>>> +xfs_repair_ialloc_extent_cmp(
>>> +	void				*priv,
>>> +	struct list_head		*a,
>>> +	struct list_head		*b)
>>> +{
>>> +	struct xfs_repair_ialloc_extent	*ap;
>>> +	struct xfs_repair_ialloc_extent	*bp;
>>> +
>>> +	ap = container_of(a, struct xfs_repair_ialloc_extent, list);
>>> +	bp = container_of(b, struct xfs_repair_ialloc_extent, list);
>>> +
>>> +	if (ap->startino > bp->startino)
>>> +		return 1;
>>> +	else if (ap->startino < bp->startino)
>>> +		return -1;
>>> +	return 0;
>>> +}
>>> +
>>> +/* Insert an inode chunk record into a given btree. */
>>> +static int
>>> +xfs_repair_iallocbt_insert_btrec(
>>> +	struct xfs_btree_cur		*cur,
>>> +	struct xfs_repair_ialloc_extent	*rie)
>>> +{
>>> +	int				stat;
>>> +	int				error;
>>> +
>>> +	error = xfs_inobt_lookup(cur, rie->startino, XFS_LOOKUP_EQ, &stat);
>>> +	if (error)
>>> +		return error;
>>> +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 0);
>>> +	error = xfs_inobt_insert_rec(cur, rie->holemask, rie->count,
>>> +			rie->count - rie->usedcount, rie->freemask, &stat);
>>> +	if (error)
>>> +		return error;
>>> +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
>>> +	return error;
>>> +}
>>> +
>>> +/* Insert an inode chunk record into both inode btrees. */
>>> +static int
>>> +xfs_repair_iallocbt_insert_rec(
>>> +	struct xfs_scrub_context	*sc,
>>> +	struct xfs_repair_ialloc_extent	*rie)
>>> +{
>>> +	struct xfs_btree_cur		*cur;
>>> +	int				error;
>>> +
>>> +	trace_xfs_repair_ialloc_insert(sc->mp, sc->sa.agno, rie->startino,
>>> +			rie->holemask, rie->count, rie->count - rie->usedcount,
>>> +			rie->freemask);
>>> +
>>> +	/* Insert into the inobt. */
>>> +	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
>>> +			XFS_BTNUM_INO);
>>> +	error = xfs_repair_iallocbt_insert_btrec(cur, rie);
>>> +	if (error)
>>> +		goto out_cur;
>>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>>> +
>>> +	/* Insert into the finobt if chunk has free inodes. */
>>> +	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb) &&
>>> +	    rie->count != rie->usedcount) {
>>> +		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
>>> +				sc->sa.agno, XFS_BTNUM_FINO);
>>> +		error = xfs_repair_iallocbt_insert_btrec(cur, rie);
>>> +		if (error)
>>> +			goto out_cur;
>>> +		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>>> +	}
>>> +
>>> +	return xfs_repair_roll_ag_trans(sc);
>>> +out_cur:
>>> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>>> +	return error;
>>> +}
>>> +
>>> +/* Free every record in the inode list. */
>>> +STATIC void
>>> +xfs_repair_iallocbt_cancel_inorecs(
>>> +	struct list_head		*reclist)
>>> +{
>>> +	struct xfs_repair_ialloc_extent	*rie;
>>> +	struct xfs_repair_ialloc_extent	*n;
>>> +
>>> +	list_for_each_entry_safe(rie, n, reclist, list) {
>>> +		list_del(&rie->list);
>>> +		kmem_free(rie);
>>> +	}
>>> +}
>>> +
>>> +/*
>>> + * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
>>> + * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
>>> + * the inode btrees.  The caller must clean up the lists if anything goes
>>> + * wrong.
>>> + */
>>> +STATIC int
>>> +xfs_repair_iallocbt_find_inodes(
>>> +	struct xfs_scrub_context	*sc,
>>> +	struct list_head		*inode_records,
>>> +	struct xfs_repair_extent_list	*old_iallocbt_blocks)
>>> +{
>>> +	struct xfs_repair_ialloc	ri;
>>> +	struct xfs_mount		*mp = sc->mp;
>>> +	struct xfs_btree_cur		*cur;
>>> +	xfs_agblock_t			nr_blocks;
>>> +	int				error;
>>> +
>>> +	/* Collect all reverse mappings for inode blocks. */
>>> +	ri.extlist = inode_records;
>>> +	ri.btlist = old_iallocbt_blocks;
>>> +	ri.nr_records = 0;
>>> +	ri.sc = sc;
>>> +
>>> +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
>>> +	error = xfs_rmap_query_all(cur, xfs_repair_ialloc_extent_fn, &ri);
>>> +	if (error)
>>> +		goto err;
>>> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>>> +
>>> +	/* Do we actually have enough space to do this? */
>>> +	nr_blocks = xfs_iallocbt_calc_size(mp, ri.nr_records);
>>> +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
>>> +		nr_blocks *= 2;
>>> +	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE))
>>> +		return -ENOSPC;
>>> +
>>> +	return 0;
>>> +
>>> +err:
>>> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>>> +	return error;
>>> +}
>>> +
>>> +/* Update the AGI counters. */
>>> +STATIC int
>>> +xfs_repair_iallocbt_reset_counters(
>>> +	struct xfs_scrub_context	*sc,
>>> +	struct list_head		*inode_records,
>>> +	int				*log_flags)
>>> +{
>>> +	struct xfs_agi			*agi;
>>> +	struct xfs_repair_ialloc_extent	*rie;
>>> +	unsigned int			count = 0;
>>> +	unsigned int			usedcount = 0;
>>> +	unsigned int			freecount;
>>> +
>>> +	/* Figure out the new counters. */
>>> +	list_for_each_entry(rie, inode_records, list) {
>>> +		count += rie->count;
>>> +		usedcount += rie->usedcount;
>>> +	}
>>> +
>>> +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
>>> +	freecount = count - usedcount;
>>> +
>>> +	/* XXX: trigger inode count recalculation */
>>> +
>>> +	/* Reset the per-AG info, both incore and ondisk. */
>>> +	sc->sa.pag->pagi_count = count;
>>> +	sc->sa.pag->pagi_freecount = freecount;
>>> +	agi->agi_count = cpu_to_be32(count);
>>> +	agi->agi_freecount = cpu_to_be32(freecount);
>>> +	*log_flags |= XFS_AGI_COUNT | XFS_AGI_FREECOUNT;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/* Initialize new inobt/finobt roots and implant them into the AGI. */
>>> +STATIC int
>>> +xfs_repair_iallocbt_reset_btrees(
>>> +	struct xfs_scrub_context	*sc,
>>> +	struct xfs_owner_info		*oinfo,
>>> +	int				*log_flags)
>>> +{
>>> +	struct xfs_agi			*agi;
>>> +	struct xfs_buf			*bp;
>>> +	struct xfs_mount		*mp = sc->mp;
>>> +	xfs_fsblock_t			inofsb;
>>> +	xfs_fsblock_t			finofsb;
>>> +	enum xfs_ag_resv_type		resv;
>>> +	int				error;
>>> +
>>> +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
>>> +
>>> +	/* Initialize new inobt root. */
>>> +	resv = XFS_AG_RESV_NONE;
>>> +	error = xfs_repair_alloc_ag_block(sc, oinfo, &inofsb, resv);
>>> +	if (error)
>>> +		return error;
>>> +	error = xfs_repair_init_btblock(sc, inofsb, &bp, XFS_BTNUM_INO,
>>> +			&xfs_inobt_buf_ops);
>>> +	if (error)
>>> +		return error;
>>> +	agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, inofsb));
>>> +	agi->agi_level = cpu_to_be32(1);
>>> +	*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
>>> +
>>> +	/* Initialize new finobt root. */
>>> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
>>> +		return 0;
>>> +
>>> +	resv = mp->m_inotbt_nores ? XFS_AG_RESV_NONE : XFS_AG_RESV_METADATA;
>>> +	error = xfs_repair_alloc_ag_block(sc, oinfo, &finofsb, resv);
>>> +	if (error)
>>> +		return error;
>>> +	error = xfs_repair_init_btblock(sc, finofsb, &bp, XFS_BTNUM_FINO,
>>> +			&xfs_inobt_buf_ops);
>>> +	if (error)
>>> +		return error;
>>> +	agi->agi_free_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, finofsb));
>>> +	agi->agi_free_level = cpu_to_be32(1);
>>> +	*log_flags |= XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/* Build new inode btrees and dispose of the old one. */
>>> +STATIC int
>>> +xfs_repair_iallocbt_rebuild_trees(
>>> +	struct xfs_scrub_context	*sc,
>>> +	struct list_head		*inode_records,
>>> +	struct xfs_owner_info		*oinfo,
>>> +	struct xfs_repair_extent_list	*old_iallocbt_blocks)
>>> +{
>>> +	struct xfs_repair_ialloc_extent	*rie;
>>> +	struct xfs_repair_ialloc_extent	*n;
>>> +	int				error;
>>> +
>>> +	/* Add all records. */
>>> +	list_sort(NULL, inode_records, xfs_repair_ialloc_extent_cmp);
>>> +	list_for_each_entry_safe(rie, n, inode_records, list) {
>>> +		error = xfs_repair_iallocbt_insert_rec(sc, rie);
>>> +		if (error)
>>> +			return error;
>>> +
>>> +		list_del(&rie->list);
>>> +		kmem_free(rie);
>>> +	}
>>> +
>>> +	/* Free the old inode btree blocks if they're not in use. */
>>> +	return xfs_repair_reap_btree_extents(sc, old_iallocbt_blocks, oinfo,
>>> +			XFS_AG_RESV_NONE);
>>> +}
>>> +
>>> +/* Repair both inode btrees. */
>>> +int
>>> +xfs_repair_iallocbt(
>>> +	struct xfs_scrub_context	*sc)
>>> +{
>>> +	struct xfs_owner_info		oinfo;
>>> +	struct list_head		inode_records;
>>> +	struct xfs_repair_extent_list	old_iallocbt_blocks;
>>> +	struct xfs_mount		*mp = sc->mp;
>>> +	int				log_flags = 0;
>>> +	int				error = 0;
>>> +
>>> +	/* We require the rmapbt to rebuild anything. */
>>> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
>>> +
>>> +	/* Collect the free space data and find the old btree blocks. */
>>> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
>>> +	INIT_LIST_HEAD(&inode_records);
>>> +	xfs_repair_init_extent_list(&old_iallocbt_blocks);
>>> +	error = xfs_repair_iallocbt_find_inodes(sc, &inode_records,
>>> +			&old_iallocbt_blocks);
>>> +	if (error)
>>> +		goto out;
>>> +
>>> +	/*
>>> +	 * Blow out the old inode btrees.  This is the point at which
>>> +	 * we are no longer able to bail out gracefully.
>>> +	 */
>>> +	error = xfs_repair_iallocbt_reset_counters(sc, &inode_records,
>>> +			&log_flags);
>>> +	if (error)
>>> +		goto out;
>>> +	error = xfs_repair_iallocbt_reset_btrees(sc, &oinfo, &log_flags);
>>> +	if (error)
>>> +		goto out;
>>> +	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
>>> +
>>> +	/* Invalidate all the inobt/finobt blocks in btlist. */
>>> +	error = xfs_repair_invalidate_blocks(sc, &old_iallocbt_blocks);
>>> +	if (error)
>>> +		goto out;
>>> +	error = xfs_repair_roll_ag_trans(sc);
>>> +	if (error)
>>> +		goto out;
>>> +
>>> +	/* Now rebuild the inode information. */
>>> +	error = xfs_repair_iallocbt_rebuild_trees(sc, &inode_records, &oinfo,
>>> +			&old_iallocbt_blocks);
>>> +out:
>>> +	xfs_repair_cancel_btree_extents(sc, &old_iallocbt_blocks);
>>> +	xfs_repair_iallocbt_cancel_inorecs(&inode_records);
>>> +	return error;
>>> +}
>>> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
>>> index e5f67fc68e9a..dcfa5eb18940 100644
>>> --- a/fs/xfs/scrub/repair.h
>>> +++ b/fs/xfs/scrub/repair.h
>>> @@ -104,6 +104,7 @@ int xfs_repair_agf(struct xfs_scrub_context *sc);
>>>    int xfs_repair_agfl(struct xfs_scrub_context *sc);
>>>    int xfs_repair_agi(struct xfs_scrub_context *sc);
>>>    int xfs_repair_allocbt(struct xfs_scrub_context *sc);
>>> +int xfs_repair_iallocbt(struct xfs_scrub_context *sc);
>>>    #else
>>> @@ -131,6 +132,7 @@ xfs_repair_calc_ag_resblks(
>>>    #define xfs_repair_agfl			xfs_repair_notsupported
>>>    #define xfs_repair_agi			xfs_repair_notsupported
>>>    #define xfs_repair_allocbt		xfs_repair_notsupported
>>> +#define xfs_repair_iallocbt		xfs_repair_notsupported
>>>    #endif /* CONFIG_XFS_ONLINE_REPAIR */
>>> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
>>> index 7a55b20b7e4e..fec0e130f19e 100644
>>> --- a/fs/xfs/scrub/scrub.c
>>> +++ b/fs/xfs/scrub/scrub.c
>>> @@ -238,14 +238,14 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>>>    		.type	= ST_PERAG,
>>>    		.setup	= xfs_scrub_setup_ag_iallocbt,
>>>    		.scrub	= xfs_scrub_inobt,
>>> -		.repair	= xfs_repair_notsupported,
>>> +		.repair	= xfs_repair_iallocbt,
>>>    	},
>>>    	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
>>>    		.type	= ST_PERAG,
>>>    		.setup	= xfs_scrub_setup_ag_iallocbt,
>>>    		.scrub	= xfs_scrub_finobt,
>>>    		.has	= xfs_sb_version_hasfinobt,
>>> -		.repair	= xfs_repair_notsupported,
>>> +		.repair	= xfs_repair_iallocbt,
>>>    	},
>>>    	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
>>>    		.type	= ST_PERAG,
>>>
>>
>> Ok, some parts took some time to figure out, but I think I understand
>> the overall idea.  The comments help, and if you could add in a little
>> extra detail describing the function parameters, I think it would help
>> to add more supporting context to your comments.  Thx!
> 
> Every time I go wandering through the ialloc code my head also gets
> twisted in knots over inode chunks and inode clusters.  I think for the
> next round I'll try to make some ascii art diagrams that I can refer
> back to the next time I have to go digging through here (which will
> probably be not that long from now, rumor has it the ialloc scrub don't
> quite work right on systems with 64K pagesize.
> 
> --D
> 
Alrighty, that sounds like it would be really helpful.  Thank you!!

Allison


>> Allison
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=fIL2s7bIVyQHhkt6FVjoAC9YFnsVQMVUbz6DfuinhZs&s=m56pNZbCxuiPzbhEv3nD5G2PqN_7BLoQhkXF1E-CTzY&e=
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=FnZPawl2adtmmmdjeP-K9vg8vYqtPL1U11LWrPTgikw&s=FB3xLOk3MV-xD-i4C58Dm4yenRzJ1FswSOXlr71kAUc&e=
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=FnZPawl2adtmmmdjeP-K9vg8vYqtPL1U11LWrPTgikw&s=FB3xLOk3MV-xD-i4C58Dm4yenRzJ1FswSOXlr71kAUc&e=
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-06-24 19:24 ` [PATCH 11/21] xfs: repair the rmapbt Darrick J. Wong
@ 2018-07-03  5:32   ` Dave Chinner
  2018-07-03 23:59     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-03  5:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the reverse mapping btree from all primary metadata.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

....

> +static inline int xfs_repair_rmapbt_setup(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	/* We don't support rmap repair, but we can still do a scan. */
> +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> +}

This comment seems at odds with the commit message....

....

> +/*
> + * Reverse Mapping Btree Repair
> + * ============================
> + *
> + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> + * else in XFS we lock inodes and then AG data structures, but generating the
> + * list of rmap records requires that we be able to scan both block mapping
> + * btrees of every inode in the filesystem to see if it owns any extents in
> + * this AG.  We can't tolerate any inode updates while we do this, so we
> + * freeze the filesystem to lock everyone else out, and grant ourselves
> + * special privileges to run transactions with regular background reclamation
> + * turned off.

Hmmm. This implies we are going to scan the entire filesystem for
every AG we need to rebuild the rmap tree in. That seems like an
awful lot of work if there's more than one rmap btree that needs
rebuild.

> + * We also have to be very careful not to allow inode reclaim to start a
> + * transaction because all transactions (other than our own) will block.

What happens when we run out of memory? Inode reclaim will need to
run at that point, right?

> + * So basically we scan all primary per-AG metadata and all block maps of all
> + * inodes to generate a huge list of reverse map records.  Next we look for
> + * gaps in the rmap records to calculate all the unclaimed free space (1).
> + * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
> + * the space used by those btrees from (1), and also subtract the free space
> + * listed in the bnobt from (1).  What's left are the gaps in assigned space
> + * that the new rmapbt knows about but the existing bnobt doesn't; these are
> + * the blocks from the old rmapbt and they can be freed.

THis looks like a lot of repeated work. We've already scanned a
bunch of these trees to repair them, then thrown away the scan
results. Now we do another scan of what we've rebuilt.....

... hold on. Chicken and egg.

We verify and rebuild all the other trees from the rmap information
- how do we do determine that the rmap needs to rebuilt and that the
metadata it's being rebuilt from is valid?

Given that we've effetively got to shut down access to the
filesystem for the entire rmap rebuild while we do an entire
filesystem scan, why would we do this online? It's going to be
faster to do this rebuild offline (because of all the prefetching,
rebuilding all AG trees from the state gathered in the full
filesystem passes, etc) and we don't have to hack around potential
transaction and memory reclaim deadlock situations, either?

So why do rmap rebuilds online at all?

> + */
> +
> +/* Set us up to repair reverse mapping btrees. */
> +int
> +xfs_repair_rmapbt_setup(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	int				error;
> +
> +	/*
> +	 * Freeze out anything that can lock an inode.  We reconstruct
> +	 * the rmapbt by reading inode bmaps with the AGF held, which is
> +	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
> +	 * inodes.
> +	 */
> +	error = xfs_scrub_fs_freeze(sc);
> +	if (error)
> +		return error;
> +
> +	/* Check the AG number and set up the scrub context. */
> +	error = xfs_scrub_setup_fs(sc, ip);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Lock all the AG header buffers so that we can read all the
> +	 * per-AG metadata too.
> +	 */
> +	error = xfs_repair_grab_all_ag_headers(sc);
> +	if (error)
> +		return error;

So if we have thousands of AGs (think PB scale filesystems) then
we're going hold many thousands of locked buffers here? Just so we
can rebuild the rmapbt in one AG?

What does holding these buffers locked protect us against that
an active freeze doesn't?

> +xfs_repair_rmapbt_new_rmap(
> +	struct xfs_repair_rmapbt	*rr,
> +	xfs_agblock_t			startblock,
> +	xfs_extlen_t			blockcount,
> +	uint64_t			owner,
> +	uint64_t			offset,
> +	unsigned int			flags)
> +{
> +	struct xfs_repair_rmapbt_extent	*rre;
> +	int				error = 0;
> +
> +	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
> +			startblock, blockcount, owner, offset, flags);
> +
> +	if (xfs_scrub_should_terminate(rr->sc, &error))
> +		return error;
> +
> +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> +	if (!rre)
> +		return -ENOMEM;

This seems like a likely thing to happen given the "no reclaim"
state of the filesystem and the memory demand a rmapbt rebuild
can have. If we've got GBs of rmap info in the AG that needs to be
rebuilt, how much RAM are we going to need to index it all as we
scan the filesystem?

> +xfs_repair_rmapbt_scan_ifork(
> +	struct xfs_repair_rmapbt	*rr,
> +	struct xfs_inode		*ip,
> +	int				whichfork)
> +{
> +	struct xfs_bmbt_irec		rec;
> +	struct xfs_iext_cursor		icur;
> +	struct xfs_mount		*mp = rr->sc->mp;
> +	struct xfs_btree_cur		*cur = NULL;
> +	struct xfs_ifork		*ifp;
> +	unsigned int			rflags;
> +	int				fmt;
> +	int				error = 0;
> +
> +	/* Do we even have data mapping extents? */
> +	fmt = XFS_IFORK_FORMAT(ip, whichfork);
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +	switch (fmt) {
> +	case XFS_DINODE_FMT_BTREE:
> +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> +			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
> +			if (error)
> +				return error;
> +		}

Ok, so we need inodes locked to do this....

....
> +/* Iterate all the inodes in an AG group. */
> +STATIC int
> +xfs_repair_rmapbt_scan_inobt(
> +	struct xfs_btree_cur		*cur,
> +	union xfs_btree_rec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_inobt_rec_incore	irec;
> +	struct xfs_repair_rmapbt	*rr = priv;
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	struct xfs_inode		*ip = NULL;
> +	xfs_ino_t			ino;
> +	xfs_agino_t			agino;
> +	int				chunkidx;
> +	int				lock_mode = 0;
> +	int				error = 0;
> +
> +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> +
> +	for (chunkidx = 0, agino = irec.ir_startino;
> +	     chunkidx < XFS_INODES_PER_CHUNK;
> +	     chunkidx++, agino++) {
> +		bool	inuse;
> +
> +		/* Skip if this inode is free */
> +		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
> +			continue;
> +		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
> +
> +		/* Back off and try again if an inode is being reclaimed */
> +		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
> +				&inuse);
> +		if (error == -EAGAIN)
> +			return -EDEADLOCK;

And we can get inode access errors here.....

FWIW, how is the inode being reclaimed if the filesystem is frozen?

> +
> +		/*
> +		 * Grab inode for scanning.  We cannot use DONTCACHE here
> +		 * because we already have a transaction so the iput must not
> +		 * trigger inode reclaim (which might allocate a transaction
> +		 * to clean up posteof blocks).
> +		 */
> +		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);

So if there are enough inodes in the AG, we'll run out of memory
here because we aren't reclaiming inodes from the cache but instead
putting them all on the defferred iput list?

> +		if (error)
> +			return error;
> +		trace_xfs_scrub_iget(ip, __this_address);
> +
> +		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
> +		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
> +		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
> +		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
> +			lock_mode = XFS_ILOCK_EXCL;
> +		else
> +			lock_mode = XFS_ILOCK_SHARED;
> +		if (!xfs_ilock_nowait(ip, lock_mode)) {
> +			error = -EBUSY;
> +			goto out_rele;
> +		}

And in what situation do we get inodes stuck with the ilock held on
frozen filesysetms?

....

> +out_unlock:
> +	xfs_iunlock(ip, lock_mode);
> +out_rele:
> +	iput(VFS_I(ip));
> +	return error;

calling iput in the error path is a bug - it will trigger all the
paths you're trying to avoid by using te deferred iput list.

....


> +/* Collect rmaps for all block mappings for every inode in this AG. */
> +STATIC int
> +xfs_repair_rmapbt_generate_aginode_rmaps(
> +	struct xfs_repair_rmapbt	*rr,
> +	xfs_agnumber_t			agno)
> +{
> +	struct xfs_scrub_context	*sc = rr->sc;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_buf			*agi_bp;
> +	int				error;
> +
> +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> +	if (error)
> +		return error;
> +	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
> +	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);

So if we get a locked or reclaiming inode anywhere in the
filesystem we see EDEADLOCK/EBUSY here without having scanned all
the inodes in the AG, right?

> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_trans_brelse(sc->tp, agi_bp);
> +	return error;
> +}
> +
> +/*
> + * Generate all the reverse-mappings for this AG, a list of the old rmapbt
> + * blocks, and the new btreeblks count.  Figure out if we have enough free
> + * space to reconstruct the inode btrees.  The caller must clean up the lists
> + * if anything goes wrong.
> + */
> +STATIC int
> +xfs_repair_rmapbt_find_rmaps(
> +	struct xfs_scrub_context	*sc,
> +	struct list_head		*rmap_records,
> +	xfs_agblock_t			*new_btreeblks)
> +{
> +	struct xfs_repair_rmapbt	rr;
> +	xfs_agnumber_t			agno;
> +	int				error;
> +
> +	rr.rmaplist = rmap_records;
> +	rr.sc = sc;
> +	rr.nr_records = 0;
> +
> +	/* Generate rmaps for AG space metadata */
> +	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
> +	if (error)
> +		return error;
> +	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
> +	if (error)
> +		return error;
> +
> +	/* Iterate all AGs for inodes rmaps. */
> +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
> +		if (error)
> +			return error;

And that means we abort here....

> +/* Repair the rmap btree for some AG. */
> +int
> +xfs_repair_rmapbt(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct list_head		rmap_records;
> +	xfs_extlen_t			new_btreeblks;
> +	int				log_flags = 0;
> +	int				error;
> +
> +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> +
> +	/* Collect rmaps for all AG headers. */
> +	INIT_LIST_HEAD(&rmap_records);
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
> +	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
> +	if (error)
> +		goto out;

And we drop out here. So, essentially, any ENOMEM, locked inode or
inode in reclaim anywhere in the filesystem will prevent rmap
rebuild. Which says to me that rebuilding the rmap on
on any substantial filesystem is likely to fail.

Which brings me back to my original question: why attempt to do
rmap rebuild online given how complex it is, the performance
implications of a full filesystem scan per AG that needs rebuild,
and all the ways it could easily fail?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 12/21] xfs: repair refcount btrees
  2018-06-24 19:24 ` [PATCH 12/21] xfs: repair refcount btrees Darrick J. Wong
@ 2018-07-03  5:50   ` Dave Chinner
  2018-07-04  2:23     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-03  5:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:45PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Reconstruct the refcount data from the rmap btree.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Seems reasonable, though my brain turn to mush when trying to work
out the code that turned the rmap records into refcounts :/

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 13/21] xfs: repair inode records
  2018-06-24 19:24 ` [PATCH 13/21] xfs: repair inode records Darrick J. Wong
@ 2018-07-03  6:17   ` Dave Chinner
  2018-07-04  0:16     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-03  6:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:51PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Try to reinitialize corrupt inodes, or clear the reflink flag
> if it's not needed.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

A comment somewhere that this is only attmepting to repair inodes
that have failed verifier checks on read would be good.

......
> +/* Make sure this buffer can pass the inode buffer verifier. */
> +STATIC void
> +xfs_repair_inode_buf(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_trans		*tp = sc->tp;
> +	struct xfs_dinode		*dip;
> +	xfs_agnumber_t			agno;
> +	xfs_agino_t			agino;
> +	int				ioff;
> +	int				i;
> +	int				ni;
> +	int				di_ok;
> +	bool				unlinked_ok;
> +
> +	ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock;
> +	agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp));
> +	for (i = 0; i < ni; i++) {
> +		ioff = i << mp->m_sb.sb_inodelog;
> +		dip = xfs_buf_offset(bp, ioff);
> +		agino = be32_to_cpu(dip->di_next_unlinked);
> +		unlinked_ok = (agino == NULLAGINO ||
> +			       xfs_verify_agino(sc->mp, agno, agino));
> +		di_ok = dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
> +			xfs_dinode_good_version(mp, dip->di_version);
> +		if (di_ok && unlinked_ok)
> +			continue;

Readability woul dbe better with:

		unlinked_ok = false;
		if (agino == NULLAGINO || xfs_verify_agino(sc->mp, agno, agino))
			unlinked_ok = true;

		di_ok = false;
		if (dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
		    xfs_dinode_good_version(mp, dip->di_version))
			di_ok = true;

		if (di_ok && unlinked_ok)
			continue;


Also, is there a need to check the inode CRC here?

> +		dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> +		dip->di_version = 3;
> +		if (!unlinked_ok)
> +			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
> +		xfs_dinode_calc_crc(mp, dip);
> +		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> +		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);

Hmmmm. how does this interact with other transactions in repair that
might have logged changes to the same in-core inode? If it was just
changing the unlinked pointer, then that would be ok, but
magic/version are overwritten by the inode item recovery...

> +/* Reinitialize things that never change in an inode. */
> +STATIC void
> +xfs_repair_inode_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip)
> +{
> +	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> +	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
> +		dip->di_version = 3;
> +	dip->di_ino = cpu_to_be64(sc->sm->sm_ino);
> +	uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid);
> +	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
> +}
> +
> +/*
> + * Turn di_mode into /something/ recognizable.
> + *
> + * XXX: Ideally we'd try to read data block 0 to see if it's a directory.
> + */
> +STATIC void
> +xfs_repair_inode_mode(
> +	struct xfs_dinode	*dip)
> +{
> +	uint16_t		mode;
> +
> +	mode = be16_to_cpu(dip->di_mode);
> +	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
> +		return;
> +
> +	/* bad mode, so we set it to a file that only root can read */
> +	mode = S_IFREG;
> +	dip->di_mode = cpu_to_be16(mode);
> +	dip->di_uid = 0;
> +	dip->di_gid = 0;

Not sure that's a good idea - if the mode is bad I don't think we
should expose it to anyone. Perhaps we need an orphan type

> +}
> +
> +/* Fix any conflicting flags that the verifiers complain about. */
> +STATIC void
> +xfs_repair_inode_flags(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	uint64_t			flags2;
> +	uint16_t			mode;
> +	uint16_t			flags;
> +
> +	mode = be16_to_cpu(dip->di_mode);
> +	flags = be16_to_cpu(dip->di_flags);
> +	flags2 = be64_to_cpu(dip->di_flags2);
> +
> +	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
> +		flags2 |= XFS_DIFLAG2_REFLINK;
> +	else
> +		flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE);
> +	if (flags & XFS_DIFLAG_REALTIME)
> +		flags2 &= ~XFS_DIFLAG2_REFLINK;
> +	if (flags2 & XFS_DIFLAG2_REFLINK)
> +		flags2 &= ~XFS_DIFLAG2_DAX;
> +	dip->di_flags = cpu_to_be16(flags);
> +	dip->di_flags2 = cpu_to_be64(flags2);
> +}
> +
> +/* Make sure we don't have a garbage file size. */
> +STATIC void
> +xfs_repair_inode_size(
> +	struct xfs_dinode	*dip)
> +{
> +	uint64_t		size;
> +	uint16_t		mode;
> +
> +	mode = be16_to_cpu(dip->di_mode);
> +	size = be64_to_cpu(dip->di_size);
> +	switch (mode & S_IFMT) {
> +	case S_IFIFO:
> +	case S_IFCHR:
> +	case S_IFBLK:
> +	case S_IFSOCK:
> +		/* di_size can't be nonzero for special files */
> +		dip->di_size = 0;
> +		break;
> +	case S_IFREG:
> +		/* Regular files can't be larger than 2^63-1 bytes. */
> +		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
> +		break;
> +	case S_IFLNK:
> +		/* Catch over- or under-sized symlinks. */
> +		if (size > XFS_SYMLINK_MAXLEN)
> +			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
> +		else if (size == 0)
> +			dip->di_size = cpu_to_be64(1);

Not sure this is valid - if the inode is in extent format then a
size of 1 is invalid and means the symlink will point to the
first byte in the data fork, and that could be anything....

> +		break;
> +	case S_IFDIR:
> +		/* Directories can't have a size larger than 32G. */
> +		if (size > XFS_DIR2_SPACE_SIZE)
> +			dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE);
> +		else if (size == 0)
> +			dip->di_size = cpu_to_be64(1);

Similar. A size of 1 is not valid for a directory.

> +		break;
> +	}
> +}
.....
> +
> +/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
> +STATIC int
> +xfs_repair_inode_core(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_imap			imap;
> +	struct xfs_buf			*bp;
> +	struct xfs_dinode		*dip;
> +	xfs_ino_t			ino;
> +	int				error;
> +
> +	/* Map & read inode. */
> +	ino = sc->sm->sm_ino;
> +	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
> +	if (error)
> +		return error;
> +
> +	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
> +			imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp, NULL);
> +	if (error)
> +		return error;

I'd like to see this check the inode isn't in-core after we've read
and locked the inode buffer, just to ensure we haven't raced with
another access.

> +
> +	/* Make sure we can pass the inode buffer verifier. */
> +	xfs_repair_inode_buf(sc, bp);
> +	bp->b_ops = &xfs_inode_buf_ops;
> +
> +	/* Fix everything the verifier will complain about. */
> +	dip = xfs_buf_offset(bp, imap.im_boffset);
> +	xfs_repair_inode_header(sc, dip);
> +	xfs_repair_inode_mode(dip);
> +	xfs_repair_inode_flags(sc, dip);
> +	xfs_repair_inode_size(dip);
> +	xfs_repair_inode_extsize_hints(sc, dip);

what if the inode failed the fork verifiers rather than the dinode
verifier?

> + * Fix problems that the verifiers don't care about.  In general these are
> + * errors that don't cause problems elsewhere in the kernel that we can easily
> + * detect, so we don't check them all that rigorously.
> + */
> +
> +/* Make sure block and extent counts are ok. */
> +STATIC int
> +xfs_repair_inode_unchecked_blockcounts(
> +	struct xfs_scrub_context	*sc)
> +{
> +	xfs_filblks_t			count;
> +	xfs_filblks_t			acount;
> +	xfs_extnum_t			nextents;
> +	int				error;
> +
> +	/* di_nblocks/di_nextents/di_anextents don't match up? */
> +	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK,
> +			&nextents, &count);
> +	if (error)
> +		return error;
> +	sc->ip->i_d.di_nextents = nextents;
> +
> +	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
> +			&nextents, &acount);
> +	if (error)
> +		return error;
> +	sc->ip->i_d.di_anextents = nextents;

Should the returned extent/block counts be validity checked?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-03  5:32   ` Dave Chinner
@ 2018-07-03 23:59     ` Darrick J. Wong
  2018-07-04  8:44       ` Carlos Maiolino
  2018-07-04 23:21       ` Dave Chinner
  0 siblings, 2 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-03 23:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Rebuild the reverse mapping btree from all primary metadata.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> ....
> 
> > +static inline int xfs_repair_rmapbt_setup(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	/* We don't support rmap repair, but we can still do a scan. */
> > +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> > +}
> 
> This comment seems at odds with the commit message....

This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n.

> 
> ....
> 
> > +/*
> > + * Reverse Mapping Btree Repair
> > + * ============================
> > + *
> > + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> > + * else in XFS we lock inodes and then AG data structures, but generating the
> > + * list of rmap records requires that we be able to scan both block mapping
> > + * btrees of every inode in the filesystem to see if it owns any extents in
> > + * this AG.  We can't tolerate any inode updates while we do this, so we
> > + * freeze the filesystem to lock everyone else out, and grant ourselves
> > + * special privileges to run transactions with regular background reclamation
> > + * turned off.
> 
> Hmmm. This implies we are going to scan the entire filesystem for
> every AG we need to rebuild the rmap tree in. That seems like an
> awful lot of work if there's more than one rmap btree that needs
> rebuild.

[some of this Dave and I discussed on IRC, so I'll summarize for
everyone else here...]

For this initial v0 iteration of the rmap repair code, yes, we have to
freeze the fs and iterate everything.  However, unless your computer and
storage are particularly untrustworthy, rmapbt reconstruction should be
a very infrequent thing.  Now that we have a FREEZE_OK flag, userspace
has to opt-in to slow repairs, and presumably it could choose instead to
unmount and run xfs_repair if that's too dear or there are too many
broken AGs, etc.  More on that later.

In the long run I don't see a need to freeze the filesystem to scan
every inode for bmbt entries in the damaged AG.  In fact, we can improve
the performance of all the AG repair functions in general with the
scheme I'm about to outline:

Create a "shut down this AG" primitive.  Once set, block and inode
allocation routines will bypass this AG.  Unlinked inodes are moved to
the unlinked list to avoid touching as much of the AGI as we practically
can.  Unmapped/freed blocks can be moved to a hidden inode (in another
AG) to be freed later.  Growfs operation in that AG can be rejected.

With AG shutdown in place, we no longer need to freeze the whole
filesystem.  Instead we traverse the live fs looking for mappings into
the afflicted AG, and when we're done rebuilding the rmapbt we can then
clean out the unlinked inodes and blocks, which catches us up to the
rest of the system.

(It would be awesome if we could capture the log intent items for the
removals until we're done rebuilding the rmapbt, but I suspect that
could fatally pin the log on us.)

However, given that adding AG shutdowns would be a lot of work I'd
prefer to get this launched even if it's reasonably likely early
versions will fail due to ENOMEM.  I tried to structure the rmapbt
repair so that we collect all the records and touch all the inodes
before we start writing anything so that if we run out of memory or
encounter a locked/reclaimable/whatever inode we'll just free all the
memory and bail out.  After that, the admin can take the fs offline and
run xfs_repair, so it shouldn't be much worse than today.

> > + * We also have to be very careful not to allow inode reclaim to start a
> > + * transaction because all transactions (other than our own) will block.
> 
> What happens when we run out of memory? Inode reclaim will need to
> run at that point, right?

Yes.

> > + * So basically we scan all primary per-AG metadata and all block maps of all
> > + * inodes to generate a huge list of reverse map records.  Next we look for
> > + * gaps in the rmap records to calculate all the unclaimed free space (1).
> > + * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract
> > + * the space used by those btrees from (1), and also subtract the free space
> > + * listed in the bnobt from (1).  What's left are the gaps in assigned space
> > + * that the new rmapbt knows about but the existing bnobt doesn't; these are
> > + * the blocks from the old rmapbt and they can be freed.
> 
> THis looks like a lot of repeated work. We've already scanned a
> bunch of these trees to repair them, then thrown away the scan
> results. Now we do another scan of what we've rebuilt.....
> 
> ... hold on. Chicken and egg.
> 
> We verify and rebuild all the other trees from the rmap information
> - how do we do determine that the rmap needs to rebuilt and that the
> metadata it's being rebuilt from is valid?

Userspace should invoke the other scrubbers for the AG before deciding
to re-run the rmap scrubber with IFLAG_REPAIR set.  xfs_scrub won't try
to repair rmap btrees until after it's already checked everything else
in the filesystem.  Currently xfs_scrub will complain if it finds
corruption in the primary data & the rmapbt; as the code matures I will
probably change it to error out when this happens.

> Given that we've effetively got to shut down access to the
> filesystem for the entire rmap rebuild while we do an entire
> filesystem scan, why would we do this online? It's going to be
> faster to do this rebuild offline (because of all the prefetching,
> rebuilding all AG trees from the state gathered in the full
> filesystem passes, etc) and we don't have to hack around potential
> transaction and memory reclaim deadlock situations, either?
> 
> So why do rmap rebuilds online at all?

The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and
3 while it checks everything.  By the time it gets to rmapbt repairs
towards the end of phase 4 (if there's enough memory) those blocks will
still be in cache and online repair doesn't have to wait for the disk.

If instead you unmount and run xfs_repair then xfs_repair has to reload
all that metadata and recheck it, all of which happens with the fs
offline.

So except for the extra complexity of avoiding deadlocks (which I
readily admit is not a small task) I at least don't think it's a
clear-cut downtime win to rely on xfs_repair.

> > + */
> > +
> > +/* Set us up to repair reverse mapping btrees. */
> > +int
> > +xfs_repair_rmapbt_setup(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	int				error;
> > +
> > +	/*
> > +	 * Freeze out anything that can lock an inode.  We reconstruct
> > +	 * the rmapbt by reading inode bmaps with the AGF held, which is
> > +	 * only safe w.r.t. ABBA deadlocks if we're the only ones locking
> > +	 * inodes.
> > +	 */
> > +	error = xfs_scrub_fs_freeze(sc);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Check the AG number and set up the scrub context. */
> > +	error = xfs_scrub_setup_fs(sc, ip);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Lock all the AG header buffers so that we can read all the
> > +	 * per-AG metadata too.
> > +	 */
> > +	error = xfs_repair_grab_all_ag_headers(sc);
> > +	if (error)
> > +		return error;
> 
> So if we have thousands of AGs (think PB scale filesystems) then
> we're going hold many thousands of locked buffers here? Just so we
> can rebuild the rmapbt in one AG?
> 
> What does holding these buffers locked protect us against that
> an active freeze doesn't?

Nothing, since we're frozen.  I think that's been around since before
rmapbt repair learned to freeze the fs.

> > +xfs_repair_rmapbt_new_rmap(
> > +	struct xfs_repair_rmapbt	*rr,
> > +	xfs_agblock_t			startblock,
> > +	xfs_extlen_t			blockcount,
> > +	uint64_t			owner,
> > +	uint64_t			offset,
> > +	unsigned int			flags)
> > +{
> > +	struct xfs_repair_rmapbt_extent	*rre;
> > +	int				error = 0;
> > +
> > +	trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno,
> > +			startblock, blockcount, owner, offset, flags);
> > +
> > +	if (xfs_scrub_should_terminate(rr->sc, &error))
> > +		return error;
> > +
> > +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> > +	if (!rre)
> > +		return -ENOMEM;
> 
> This seems like a likely thing to happen given the "no reclaim"
> state of the filesystem and the memory demand a rmapbt rebuild
> can have. If we've got GBs of rmap info in the AG that needs to be
> rebuilt, how much RAM are we going to need to index it all as we
> scan the filesystem?

More than I'd like -- at least 24 bytes per record (which at least is no
larger than the size of the on-disk btree) plus a list_head until I can
move the repairers away from creating huge lists.

> > +xfs_repair_rmapbt_scan_ifork(
> > +	struct xfs_repair_rmapbt	*rr,
> > +	struct xfs_inode		*ip,
> > +	int				whichfork)
> > +{
> > +	struct xfs_bmbt_irec		rec;
> > +	struct xfs_iext_cursor		icur;
> > +	struct xfs_mount		*mp = rr->sc->mp;
> > +	struct xfs_btree_cur		*cur = NULL;
> > +	struct xfs_ifork		*ifp;
> > +	unsigned int			rflags;
> > +	int				fmt;
> > +	int				error = 0;
> > +
> > +	/* Do we even have data mapping extents? */
> > +	fmt = XFS_IFORK_FORMAT(ip, whichfork);
> > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > +	switch (fmt) {
> > +	case XFS_DINODE_FMT_BTREE:
> > +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> > +			error = xfs_iread_extents(rr->sc->tp, ip, whichfork);
> > +			if (error)
> > +				return error;
> > +		}
> 
> Ok, so we need inodes locked to do this....
> 
> ....
> > +/* Iterate all the inodes in an AG group. */
> > +STATIC int
> > +xfs_repair_rmapbt_scan_inobt(
> > +	struct xfs_btree_cur		*cur,
> > +	union xfs_btree_rec		*rec,
> > +	void				*priv)
> > +{
> > +	struct xfs_inobt_rec_incore	irec;
> > +	struct xfs_repair_rmapbt	*rr = priv;
> > +	struct xfs_mount		*mp = cur->bc_mp;
> > +	struct xfs_inode		*ip = NULL;
> > +	xfs_ino_t			ino;
> > +	xfs_agino_t			agino;
> > +	int				chunkidx;
> > +	int				lock_mode = 0;
> > +	int				error = 0;
> > +
> > +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> > +
> > +	for (chunkidx = 0, agino = irec.ir_startino;
> > +	     chunkidx < XFS_INODES_PER_CHUNK;
> > +	     chunkidx++, agino++) {
> > +		bool	inuse;
> > +
> > +		/* Skip if this inode is free */
> > +		if (XFS_INOBT_MASK(chunkidx) & irec.ir_free)
> > +			continue;
> > +		ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino);
> > +
> > +		/* Back off and try again if an inode is being reclaimed */
> > +		error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino,
> > +				&inuse);
> > +		if (error == -EAGAIN)
> > +			return -EDEADLOCK;
> 
> And we can get inode access errors here.....
> 
> FWIW, how is the inode being reclaimed if the filesystem is frozen?

Memory reclaim from some other process?  Maybe? :)

I'll do some research to learn if other processes can get into inode
reclaim on our frozen fs.

> > +
> > +		/*
> > +		 * Grab inode for scanning.  We cannot use DONTCACHE here
> > +		 * because we already have a transaction so the iput must not
> > +		 * trigger inode reclaim (which might allocate a transaction
> > +		 * to clean up posteof blocks).
> > +		 */
> > +		error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip);
> 
> So if there are enough inodes in the AG, we'll run out of memory
> here because we aren't reclaiming inodes from the cache but instead
> putting them all on the defferred iput list?

Enough inodes in the AG that also have post-eof blocks or cow blocks to
free.  If we don't have to do any inactive work then they just go away.

> > +		if (error)
> > +			return error;
> > +		trace_xfs_scrub_iget(ip, __this_address);
> > +
> > +		if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE &&
> > +		     !(ip->i_df.if_flags & XFS_IFEXTENTS)) ||
> > +		    (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE &&
> > +		     !(ip->i_afp->if_flags & XFS_IFEXTENTS)))
> > +			lock_mode = XFS_ILOCK_EXCL;
> > +		else
> > +			lock_mode = XFS_ILOCK_SHARED;
> > +		if (!xfs_ilock_nowait(ip, lock_mode)) {
> > +			error = -EBUSY;
> > +			goto out_rele;
> > +		}
> 
> And in what situation do we get inodes stuck with the ilock held on
> frozen filesysetms?

I think we do not, and that this is a relic of pre-freeze rmap repair.

> ....
> 
> > +out_unlock:
> > +	xfs_iunlock(ip, lock_mode);
> > +out_rele:
> > +	iput(VFS_I(ip));
> > +	return error;
> 
> calling iput in the error path is a bug - it will trigger all the
> paths you're trying to avoid by using te deferred iput list.

Oops, that should be our xfs_scrub_repair_iput.

> ....
> 
> 
> > +/* Collect rmaps for all block mappings for every inode in this AG. */
> > +STATIC int
> > +xfs_repair_rmapbt_generate_aginode_rmaps(
> > +	struct xfs_repair_rmapbt	*rr,
> > +	xfs_agnumber_t			agno)
> > +{
> > +	struct xfs_scrub_context	*sc = rr->sc;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_buf			*agi_bp;
> > +	int				error;
> > +
> > +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> > +	if (error)
> > +		return error;
> > +	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO);
> > +	error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr);
> 
> So if we get a locked or reclaiming inode anywhere in the
> filesystem we see EDEADLOCK/EBUSY here without having scanned all
> the inodes in the AG, right?

Right.

> > +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > +	xfs_trans_brelse(sc->tp, agi_bp);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Generate all the reverse-mappings for this AG, a list of the old rmapbt
> > + * blocks, and the new btreeblks count.  Figure out if we have enough free
> > + * space to reconstruct the inode btrees.  The caller must clean up the lists
> > + * if anything goes wrong.
> > + */
> > +STATIC int
> > +xfs_repair_rmapbt_find_rmaps(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*rmap_records,
> > +	xfs_agblock_t			*new_btreeblks)
> > +{
> > +	struct xfs_repair_rmapbt	rr;
> > +	xfs_agnumber_t			agno;
> > +	int				error;
> > +
> > +	rr.rmaplist = rmap_records;
> > +	rr.sc = sc;
> > +	rr.nr_records = 0;
> > +
> > +	/* Generate rmaps for AG space metadata */
> > +	error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_log_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Iterate all AGs for inodes rmaps. */
> > +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno);
> > +		if (error)
> > +			return error;
> 
> And that means we abort here....
> 
> > +/* Repair the rmap btree for some AG. */
> > +int
> > +xfs_repair_rmapbt(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct list_head		rmap_records;
> > +	xfs_extlen_t			new_btreeblks;
> > +	int				log_flags = 0;
> > +	int				error;
> > +
> > +	xfs_scrub_perag_get(sc->mp, &sc->sa);
> > +
> > +	/* Collect rmaps for all AG headers. */
> > +	INIT_LIST_HEAD(&rmap_records);
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN);
> > +	error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks);
> > +	if (error)
> > +		goto out;
> 
> And we drop out here. So, essentially, any ENOMEM, locked inode or
> inode in reclaim anywhere in the filesystem will prevent rmap
> rebuild. Which says to me that rebuilding the rmap on
> on any substantial filesystem is likely to fail.
> 
> Which brings me back to my original question: why attempt to do
> rmap rebuild online given how complex it is, the performance
> implications of a full filesystem scan per AG that needs rebuild,
> and all the ways it could easily fail?

Right.  If we run out of memory or hit a locked/in-reclaim inode we'll
bounce back out to userspace having not touched anything.  Userspace can
decide if it wants to migrate or shut down other services and retry the
online scrub, or if it wants to unmount and run xfs_repair instead.
My mid-term goals for the repair patchset is to minimize the memory use
and minimize the amount of time we spend in the freezer, but for that I
need to add a few more (largeish) tools to XFS and an trying to avoid
snowballing a ton of code in front of repair. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 13/21] xfs: repair inode records
  2018-07-03  6:17   ` Dave Chinner
@ 2018-07-04  0:16     ` Darrick J. Wong
  2018-07-04  1:03       ` Dave Chinner
  0 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  0:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 03, 2018 at 04:17:18PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:51PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Try to reinitialize corrupt inodes, or clear the reflink flag
> > if it's not needed.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> A comment somewhere that this is only attmepting to repair inodes
> that have failed verifier checks on read would be good.

There are a few comments in the callers, e.g.

"Repair all the things that the inode verifiers care about"

"Fix everything xfs_dinode_verify cares about."

"Make sure we can pass the inode buffer verifier."

Hmm, I think maybe you meant that I need to make it more obvious which
functions exist to make the verifiers happy (and so there won't be any
in-core inodes while they run) vs. which ones fix irregularities that
aren't caught as a condition for setting up in-core inodes?

xrep_inode_unchecked_* are the ones that run on in-core inodes; the rest
run on inodes so damaged they can't be _iget'd.

> ......
> > +/* Make sure this buffer can pass the inode buffer verifier. */
> > +STATIC void
> > +xfs_repair_inode_buf(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_trans		*tp = sc->tp;
> > +	struct xfs_dinode		*dip;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agino_t			agino;
> > +	int				ioff;
> > +	int				i;
> > +	int				ni;
> > +	int				di_ok;
> > +	bool				unlinked_ok;
> > +
> > +	ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock;
> > +	agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp));
> > +	for (i = 0; i < ni; i++) {
> > +		ioff = i << mp->m_sb.sb_inodelog;
> > +		dip = xfs_buf_offset(bp, ioff);
> > +		agino = be32_to_cpu(dip->di_next_unlinked);
> > +		unlinked_ok = (agino == NULLAGINO ||
> > +			       xfs_verify_agino(sc->mp, agno, agino));
> > +		di_ok = dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
> > +			xfs_dinode_good_version(mp, dip->di_version);
> > +		if (di_ok && unlinked_ok)
> > +			continue;
> 
> Readability woul dbe better with:
> 
> 		unlinked_ok = false;
> 		if (agino == NULLAGINO || xfs_verify_agino(sc->mp, agno, agino))
> 			unlinked_ok = true;
> 
> 		di_ok = false;
> 		if (dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
> 		    xfs_dinode_good_version(mp, dip->di_version))
> 			di_ok = true;
> 
> 		if (di_ok && unlinked_ok)
> 			continue;
> 

Ok.

> Also, is there a need to check the inode CRC here?

We already know the inode core is bad, so why not just reset it?

> > +		dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> > +		dip->di_version = 3;
> > +		if (!unlinked_ok)
> > +			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
> > +		xfs_dinode_calc_crc(mp, dip);
> > +		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> > +		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
> 
> Hmmmm. how does this interact with other transactions in repair that
> might have logged changes to the same in-core inode? If it was just
> changing the unlinked pointer, then that would be ok, but
> magic/version are overwritten by the inode item recovery...

There shouldn't be an in-core inode; this function should only get
called if we failed to _iget the inode, which implies that nobody else
has an in-core inode.

> 
> > +/* Reinitialize things that never change in an inode. */
> > +STATIC void
> > +xfs_repair_inode_header(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip)
> > +{
> > +	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> > +	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
> > +		dip->di_version = 3;
> > +	dip->di_ino = cpu_to_be64(sc->sm->sm_ino);
> > +	uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid);
> > +	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
> > +}
> > +
> > +/*
> > + * Turn di_mode into /something/ recognizable.
> > + *
> > + * XXX: Ideally we'd try to read data block 0 to see if it's a directory.
> > + */
> > +STATIC void
> > +xfs_repair_inode_mode(
> > +	struct xfs_dinode	*dip)
> > +{
> > +	uint16_t		mode;
> > +
> > +	mode = be16_to_cpu(dip->di_mode);
> > +	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
> > +		return;
> > +
> > +	/* bad mode, so we set it to a file that only root can read */
> > +	mode = S_IFREG;
> > +	dip->di_mode = cpu_to_be16(mode);
> > +	dip->di_uid = 0;
> > +	dip->di_gid = 0;
> 
> Not sure that's a good idea - if the mode is bad I don't think we
> should expose it to anyone. Perhaps we need an orphan type

Agreed.

> > +}
> > +
> > +/* Fix any conflicting flags that the verifiers complain about. */
> > +STATIC void
> > +xfs_repair_inode_flags(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	uint64_t			flags2;
> > +	uint16_t			mode;
> > +	uint16_t			flags;
> > +
> > +	mode = be16_to_cpu(dip->di_mode);
> > +	flags = be16_to_cpu(dip->di_flags);
> > +	flags2 = be64_to_cpu(dip->di_flags2);
> > +
> > +	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
> > +		flags2 |= XFS_DIFLAG2_REFLINK;
> > +	else
> > +		flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE);
> > +	if (flags & XFS_DIFLAG_REALTIME)
> > +		flags2 &= ~XFS_DIFLAG2_REFLINK;
> > +	if (flags2 & XFS_DIFLAG2_REFLINK)
> > +		flags2 &= ~XFS_DIFLAG2_DAX;
> > +	dip->di_flags = cpu_to_be16(flags);
> > +	dip->di_flags2 = cpu_to_be64(flags2);
> > +}
> > +
> > +/* Make sure we don't have a garbage file size. */
> > +STATIC void
> > +xfs_repair_inode_size(
> > +	struct xfs_dinode	*dip)
> > +{
> > +	uint64_t		size;
> > +	uint16_t		mode;
> > +
> > +	mode = be16_to_cpu(dip->di_mode);
> > +	size = be64_to_cpu(dip->di_size);
> > +	switch (mode & S_IFMT) {
> > +	case S_IFIFO:
> > +	case S_IFCHR:
> > +	case S_IFBLK:
> > +	case S_IFSOCK:
> > +		/* di_size can't be nonzero for special files */
> > +		dip->di_size = 0;
> > +		break;
> > +	case S_IFREG:
> > +		/* Regular files can't be larger than 2^63-1 bytes. */
> > +		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
> > +		break;
> > +	case S_IFLNK:
> > +		/* Catch over- or under-sized symlinks. */
> > +		if (size > XFS_SYMLINK_MAXLEN)
> > +			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
> > +		else if (size == 0)
> > +			dip->di_size = cpu_to_be64(1);
> 
> Not sure this is valid - if the inode is in extent format then a
> size of 1 is invalid and means the symlink will point to the
> first byte in the data fork, and that could be anything....

I picked these wonky looking formats so that we'd always trigger the
higher level repair functions to have a look at the link/dir without
blowing up elsewhere in the code if we tried to use them.  Not that we
can do much for broken symlinks, but directories could be rebuilt.

But maybe directories should simply be reset to an empty inline
directory, and eventually grow an iflag that will always trigger
directory reconstruction (when parent pointers become a thing).

> > +		break;
> > +	case S_IFDIR:
> > +		/* Directories can't have a size larger than 32G. */
> > +		if (size > XFS_DIR2_SPACE_SIZE)
> > +			dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE);
> > +		else if (size == 0)
> > +			dip->di_size = cpu_to_be64(1);
> 
> Similar. A size of 1 is not valid for a directory.
> 
> > +		break;
> > +	}
> > +}
> .....
> > +
> > +/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
> > +STATIC int
> > +xfs_repair_inode_core(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_imap			imap;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_dinode		*dip;
> > +	xfs_ino_t			ino;
> > +	int				error;
> > +
> > +	/* Map & read inode. */
> > +	ino = sc->sm->sm_ino;
> > +	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
> > +	if (error)
> > +		return error;
> > +
> > +	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
> > +			imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp, NULL);
> > +	if (error)
> > +		return error;
> 
> I'd like to see this check the inode isn't in-core after we've read
> and locked the inode buffer, just to ensure we haven't raced with
> another access.

Ok.

> > +
> > +	/* Make sure we can pass the inode buffer verifier. */
> > +	xfs_repair_inode_buf(sc, bp);
> > +	bp->b_ops = &xfs_inode_buf_ops;
> > +
> > +	/* Fix everything the verifier will complain about. */
> > +	dip = xfs_buf_offset(bp, imap.im_boffset);
> > +	xfs_repair_inode_header(sc, dip);
> > +	xfs_repair_inode_mode(dip);
> > +	xfs_repair_inode_flags(sc, dip);
> > +	xfs_repair_inode_size(dip);
> > +	xfs_repair_inode_extsize_hints(sc, dip);
> 
> what if the inode failed the fork verifiers rather than the dinode
> verifier?

That's coming up in the next patch.  Want me to put in an XXX comment to
that effect?

> > + * Fix problems that the verifiers don't care about.  In general these are
> > + * errors that don't cause problems elsewhere in the kernel that we can easily
> > + * detect, so we don't check them all that rigorously.
> > + */
> > +
> > +/* Make sure block and extent counts are ok. */
> > +STATIC int
> > +xfs_repair_inode_unchecked_blockcounts(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	xfs_filblks_t			count;
> > +	xfs_filblks_t			acount;
> > +	xfs_extnum_t			nextents;
> > +	int				error;
> > +
> > +	/* di_nblocks/di_nextents/di_anextents don't match up? */
> > +	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK,
> > +			&nextents, &count);
> > +	if (error)
> > +		return error;
> > +	sc->ip->i_d.di_nextents = nextents;
> > +
> > +	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
> > +			&nextents, &acount);
> > +	if (error)
> > +		return error;
> > +	sc->ip->i_d.di_anextents = nextents;
> 
> Should the returned extent/block counts be validity checked?

Er... yes.  Good catch. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 13/21] xfs: repair inode records
  2018-07-04  0:16     ` Darrick J. Wong
@ 2018-07-04  1:03       ` Dave Chinner
  2018-07-04  1:30         ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-04  1:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jul 03, 2018 at 05:16:12PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 03, 2018 at 04:17:18PM +1000, Dave Chinner wrote:
> > On Sun, Jun 24, 2018 at 12:24:51PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Try to reinitialize corrupt inodes, or clear the reflink flag
> > > if it's not needed.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > A comment somewhere that this is only attmepting to repair inodes
> > that have failed verifier checks on read would be good.
> 
> There are a few comments in the callers, e.g.
> 
> "Repair all the things that the inode verifiers care about"
> 
> "Fix everything xfs_dinode_verify cares about."
> 
> "Make sure we can pass the inode buffer verifier."
> 
> Hmm, I think maybe you meant that I need to make it more obvious which
> functions exist to make the verifiers happy (and so there won't be any
> in-core inodes while they run) vs. which ones fix irregularities that
> aren't caught as a condition for setting up in-core inodes?

Well, that too. My main point is that various one-liners don't
explain the overall picture of what, why and how the repair is being
done....

> xrep_inode_unchecked_* are the ones that run on in-core inodes; the rest
> run on inodes so damaged they can't be _iget'd.

It's completely ambiguous, though: "unchecked" by what, exactly? :P

> > Also, is there a need to check the inode CRC here?
> 
> We already know the inode core is bad, so why not just reset it?

But you don't recalculate it if di_ok and unlinked_ok are true. It
only gets recalc'd if when changes need to be made. hence an inode
that failed the verifier because of a CRC error still won't pass the
verifier after going through this function.

> > > +		dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> > > +		dip->di_version = 3;
> > > +		if (!unlinked_ok)
> > > +			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
> > > +		xfs_dinode_calc_crc(mp, dip);
> > > +		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> > > +		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
> > 
> > Hmmmm. how does this interact with other transactions in repair that
> > might have logged changes to the same in-core inode? If it was just
> > changing the unlinked pointer, then that would be ok, but
> > magic/version are overwritten by the inode item recovery...
> 
> There shouldn't be an in-core inode; this function should only get
> called if we failed to _iget the inode, which implies that nobody else
> has an in-core inode.

OK - so we've held the buffer locked across a check for in-core
inodes we are trying to repair?

> > > +	switch (mode & S_IFMT) {
> > > +	case S_IFIFO:
> > > +	case S_IFCHR:
> > > +	case S_IFBLK:
> > > +	case S_IFSOCK:
> > > +		/* di_size can't be nonzero for special files */
> > > +		dip->di_size = 0;
> > > +		break;
> > > +	case S_IFREG:
> > > +		/* Regular files can't be larger than 2^63-1 bytes. */
> > > +		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
> > > +		break;
> > > +	case S_IFLNK:
> > > +		/* Catch over- or under-sized symlinks. */
> > > +		if (size > XFS_SYMLINK_MAXLEN)
> > > +			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
> > > +		else if (size == 0)
> > > +			dip->di_size = cpu_to_be64(1);
> > 
> > Not sure this is valid - if the inode is in extent format then a
> > size of 1 is invalid and means the symlink will point to the
> > first byte in the data fork, and that could be anything....
> 
> I picked these wonky looking formats so that we'd always trigger the
> higher level repair functions to have a look at the link/dir without
> blowing up elsewhere in the code if we tried to use them.  Not that we
> can do much for broken symlinks, but directories could be rebuilt.

Change the symlink to an inline symlink that points to "zero length
symlink repaired by online repair" and set the c/mtimes to the
current time?

> But maybe directories should simply be reset to an empty inline
> directory, and eventually grow an iflag that will always trigger
> directory reconstruction (when parent pointers become a thing).

Yeah, I think if we are going to do anything here we should be
setting the inodes to valid "empty" state. Or at least comment that
it's being set to a state that will detected and rebuilt by the
upcoming fork repair pass.

> > what if the inode failed the fork verifiers rather than the dinode
> > verifier?
> 
> That's coming up in the next patch.  Want me to put in an XXX comment to
> that effect?

Not if all you do is remove it in the next patch - better to
document where we are making changes that the fork rebuild will
detect and fix up. :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 13/21] xfs: repair inode records
  2018-07-04  1:03       ` Dave Chinner
@ 2018-07-04  1:30         ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  1:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 04, 2018 at 11:03:44AM +1000, Dave Chinner wrote:
> On Tue, Jul 03, 2018 at 05:16:12PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 03, 2018 at 04:17:18PM +1000, Dave Chinner wrote:
> > > On Sun, Jun 24, 2018 at 12:24:51PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Try to reinitialize corrupt inodes, or clear the reflink flag
> > > > if it's not needed.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > A comment somewhere that this is only attmepting to repair inodes
> > > that have failed verifier checks on read would be good.
> > 
> > There are a few comments in the callers, e.g.
> > 
> > "Repair all the things that the inode verifiers care about"
> > 
> > "Fix everything xfs_dinode_verify cares about."
> > 
> > "Make sure we can pass the inode buffer verifier."
> > 
> > Hmm, I think maybe you meant that I need to make it more obvious which
> > functions exist to make the verifiers happy (and so there won't be any
> > in-core inodes while they run) vs. which ones fix irregularities that
> > aren't caught as a condition for setting up in-core inodes?
> 
> Well, that too. My main point is that various one-liners don't
> explain the overall picture of what, why and how the repair is being
> done....

Ok.

> > xrep_inode_unchecked_* are the ones that run on in-core inodes; the rest
> > run on inodes so damaged they can't be _iget'd.
> 
> It's completely ambiguous, though: "unchecked" by what, exactly? :P
> 
> > > Also, is there a need to check the inode CRC here?
> > 
> > We already know the inode core is bad, so why not just reset it?
> 
> But you don't recalculate it if di_ok and unlinked_ok are true. It
> only gets recalc'd if when changes need to be made. hence an inode
> that failed the verifier because of a CRC error still won't pass the
> verifier after going through this function.

D'oh.  Thank you for catching that. :)

> > > > +		dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> > > > +		dip->di_version = 3;
> > > > +		if (!unlinked_ok)
> > > > +			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
> > > > +		xfs_dinode_calc_crc(mp, dip);
> > > > +		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> > > > +		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
> > > 
> > > Hmmmm. how does this interact with other transactions in repair that
> > > might have logged changes to the same in-core inode? If it was just
> > > changing the unlinked pointer, then that would be ok, but
> > > magic/version are overwritten by the inode item recovery...
> > 
> > There shouldn't be an in-core inode; this function should only get
> > called if we failed to _iget the inode, which implies that nobody else
> > has an in-core inode.
> 
> OK - so we've held the buffer locked across a check for in-core
> inodes we are trying to repair?

That's the intent, anyway. :)

> > > > +	switch (mode & S_IFMT) {
> > > > +	case S_IFIFO:
> > > > +	case S_IFCHR:
> > > > +	case S_IFBLK:
> > > > +	case S_IFSOCK:
> > > > +		/* di_size can't be nonzero for special files */
> > > > +		dip->di_size = 0;
> > > > +		break;
> > > > +	case S_IFREG:
> > > > +		/* Regular files can't be larger than 2^63-1 bytes. */
> > > > +		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
> > > > +		break;
> > > > +	case S_IFLNK:
> > > > +		/* Catch over- or under-sized symlinks. */
> > > > +		if (size > XFS_SYMLINK_MAXLEN)
> > > > +			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
> > > > +		else if (size == 0)
> > > > +			dip->di_size = cpu_to_be64(1);
> > > 
> > > Not sure this is valid - if the inode is in extent format then a
> > > size of 1 is invalid and means the symlink will point to the
> > > first byte in the data fork, and that could be anything....
> > 
> > I picked these wonky looking formats so that we'd always trigger the
> > higher level repair functions to have a look at the link/dir without
> > blowing up elsewhere in the code if we tried to use them.  Not that we
> > can do much for broken symlinks, but directories could be rebuilt.
> 
> Change the symlink to an inline symlink that points to "zero length
> symlink repaired by online repair" and set the c/mtimes to the
> current time?

Ok.

> > But maybe directories should simply be reset to an empty inline
> > directory, and eventually grow an iflag that will always trigger
> > directory reconstruction (when parent pointers become a thing).
> 
> Yeah, I think if we are going to do anything here we should be
> setting the inodes to valid "empty" state. Or at least comment that
> it's being set to a state that will detected and rebuilt by the
> upcoming fork repair pass.

Ok.

> > > what if the inode failed the fork verifiers rather than the dinode
> > > verifier?
> > 
> > That's coming up in the next patch.  Want me to put in an XXX comment to
> > that effect?
> 
> Not if all you do is remove it in the next patch - better to
> document where we are making changes that the fork rebuild will
> detect and fix up. :P

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 14/21] xfs: zap broken inode forks
  2018-06-24 19:24 ` [PATCH 14/21] xfs: zap broken inode forks Darrick J. Wong
@ 2018-07-04  2:07   ` Dave Chinner
  2018-07-04  3:26     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-04  2:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:24:57PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Determine if inode fork damage is responsible for the inode being unable
> to pass the ifork verifiers in xfs_iget and zap the fork contents if
> this is true.  Once this is done the fork will be empty but we'll be
> able to construct an in-core inode, and a subsequent call to the inode
> fork repair ioctl will search the rmapbt to rebuild the records that
> were in the fork.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_attr_leaf.c |   32 ++-
>  fs/xfs/libxfs/xfs_attr_leaf.h |    2 
>  fs/xfs/libxfs/xfs_bmap.c      |   21 ++
>  fs/xfs/libxfs/xfs_bmap.h      |    2 
>  fs/xfs/scrub/inode_repair.c   |  399 +++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 437 insertions(+), 19 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index b3c19339e1b5..f6c458104934 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -894,23 +894,16 @@ xfs_attr_shortform_allfit(
>  	return xfs_attr_shortform_bytesfit(dp, bytes);
>  }
>  
> -/* Verify the consistency of an inline attribute fork. */
> +/* Verify the consistency of a raw inline attribute fork. */
>  xfs_failaddr_t
> -xfs_attr_shortform_verify(
> -	struct xfs_inode		*ip)
> +xfs_attr_shortform_verify_struct(
> +	struct xfs_attr_shortform	*sfp,
> +	size_t				size)

The internal structure checking functions in the directory code
use the naming convention xfs_dir3_<type>_check(). I think we should use
the same here. i.e. xfs_attr_sf_check().

> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index b7f094e19bab..b1254e6c17b5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -6223,18 +6223,16 @@ xfs_bmap_finish_one(
>  	return error;
>  }
>  
> -/* Check that an inode's extent does not have invalid flags or bad ranges. */
> +/* Check that an extent does not have invalid flags or bad ranges. */
>  xfs_failaddr_t
> -xfs_bmap_validate_extent(
> -	struct xfs_inode	*ip,
> +xfs_bmbt_validate_extent(

xfs_bmbt_ prefixes should only appear in xfs_bmap_btree.c, not
xfs_bmap.c....

So either it needs to get moved or renamed to something like
xfs_bmap_validate_irec()?


> diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
> index 4ac43c1b1eb0..b941f21d7667 100644
> --- a/fs/xfs/scrub/inode_repair.c
> +++ b/fs/xfs/scrub/inode_repair.c
> @@ -22,11 +22,15 @@
>  #include "xfs_ialloc.h"
>  #include "xfs_da_format.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
>  #include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
>  #include "xfs_bmap.h"
> +#include "xfs_bmap_btree.h"
>  #include "xfs_bmap_util.h"
>  #include "xfs_dir2.h"
>  #include "xfs_quota_defs.h"
> +#include "xfs_attr_leaf.h"
>  #include "scrub/xfs_scrub.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
> @@ -113,7 +117,8 @@ xfs_repair_inode_mode(
>  STATIC void
>  xfs_repair_inode_flags(
>  	struct xfs_scrub_context	*sc,
> -	struct xfs_dinode		*dip)
> +	struct xfs_dinode		*dip,
> +	bool				is_rt_file)
>  {
>  	struct xfs_mount		*mp = sc->mp;
>  	uint64_t			flags2;
> @@ -132,6 +137,10 @@ xfs_repair_inode_flags(
>  		flags2 &= ~XFS_DIFLAG2_REFLINK;
>  	if (flags2 & XFS_DIFLAG2_REFLINK)
>  		flags2 &= ~XFS_DIFLAG2_DAX;
> +	if (is_rt_file)
> +		flags |= XFS_DIFLAG_REALTIME;
> +	else
> +		flags &= ~XFS_DIFLAG_REALTIME;

This needs to be done first. i.e. before we check things like rt vs
reflink flags.

> @@ -210,17 +219,402 @@ xfs_repair_inode_extsize_hints(
>  	}
>  }
>  
> +struct xfs_repair_inode_fork_counters {
> +	struct xfs_scrub_context	*sc;
> +	xfs_rfsblock_t			data_blocks;
> +	xfs_rfsblock_t			rt_blocks;

inode is either data or rt, not both. Why do you need two separate
counters? Oh, bmbt blocks are always data, right? Coment, perhaps?

> +	xfs_rfsblock_t			attr_blocks;
> +	xfs_extnum_t			data_extents;
> +	xfs_extnum_t			rt_extents;

but bmbt blocks are not data extents, so only one counter here?

> +/* Count extents and blocks for a given inode from all rmap data. */
> +STATIC int
> +xfs_repair_inode_count_rmaps(
> +	struct xfs_repair_inode_fork_counters	*rifc)
> +{
> +	xfs_agnumber_t			agno;
> +	int				error;
> +
> +	if (!xfs_sb_version_hasrmapbt(&rifc->sc->mp->m_sb) ||
> +	    xfs_sb_version_hasrealtime(&rifc->sc->mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	/* XXX: find rt blocks too */

Ok, needs a comment up front that realtime repair isn't supported,
rather than hiding it down here.

> +	for (agno = 0; agno < rifc->sc->mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_repair_inode_count_ag_rmaps(rifc, agno);
> +		if (error)
> +			return error;
> +	}

	/* rt not supported yet */
	ASSERT(rifc->rt_extents == 0);

> +	/* Can't have extents on both the rt and the data device. */
> +	if (rifc->data_extents && rifc->rt_extents)
> +		return -EFSCORRUPTED;
> +
> +	return 0;
> +}
> +
> +/* Figure out if we need to zap this extents format fork. */
> +STATIC bool
> +xfs_repair_inode_core_check_extents_fork(


Urk. Not sure what this function is supposed to be doing from the
name. xrep_ifork_extent_check()? And then the next function
becomes xrep_ifork_btree_check()?

Also document the return values.

> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip,
> +	int				dfork_size,
> +	int				whichfork)
> +{
> +	struct xfs_bmbt_irec		new;
> +	struct xfs_bmbt_rec		*dp;
> +	bool				isrt;
> +	int				i;
> +	int				nex;
> +	int				fork_size;
> +
> +	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	fork_size = nex * sizeof(struct xfs_bmbt_rec);
> +	if (fork_size < 0 || fork_size > dfork_size)
> +		return true;

Check nex against dip->di_nextents?

> +	dp = (struct xfs_bmbt_rec *)XFS_DFORK_PTR(dip, whichfork);
> +
> +	isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME);
> +	for (i = 0; i < nex; i++, dp++) {
> +		xfs_failaddr_t	fa;
> +
> +		xfs_bmbt_disk_get_all(dp, &new);
> +		fa = xfs_bmbt_validate_extent(sc->mp, isrt, whichfork, &new);
> +		if (fa)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/* Figure out if we need to zap this btree format fork. */
> +STATIC bool
> +xfs_repair_inode_core_check_btree_fork(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip,
> +	int				dfork_size,
> +	int				whichfork)
> +{
> +	struct xfs_bmdr_block		*dfp;
> +	int				nrecs;
> +	int				level;
> +
> +	if (XFS_DFORK_NEXTENTS(dip, whichfork) <=
> +			dfork_size / sizeof(struct xfs_bmbt_irec))
> +		return true;

check against dip->di_nextents?

> +	dfp = (struct xfs_bmdr_block *)XFS_DFORK_PTR(dip, whichfork);
> +	nrecs = be16_to_cpu(dfp->bb_numrecs);
> +	level = be16_to_cpu(dfp->bb_level);
> +
> +	if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size)
> +		return true;
> +	if (level == 0 || level > XFS_BTREE_MAXLEVELS)
> +		return true;

Should this visit the bmbt blocks to check the level is actually
correct?

> +	return false;
> +}
> +
> +/*
> + * Check the data fork for things that will fail the ifork verifiers or the
> + * ifork formatters.
> + */
> +STATIC bool
> +xfs_repair_inode_core_check_data_fork(

xrep_ifork_check_data()

> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip,
> +	uint16_t			mode)
> +{
> +	uint64_t			size;
> +	int				dfork_size;
> +
> +	size = be64_to_cpu(dip->di_size);
> +	switch (mode & S_IFMT) {
> +	case S_IFIFO:
> +	case S_IFCHR:
> +	case S_IFBLK:
> +	case S_IFSOCK:
> +		if (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK) != XFS_DINODE_FMT_DEV)
> +			return true;
> +		break;
> +	case S_IFREG:
> +	case S_IFLNK:
> +	case S_IFDIR:
> +		switch (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK)) {
> +		case XFS_DINODE_FMT_LOCAL:

local format is not valid for S_IFREG.

> +		case XFS_DINODE_FMT_EXTENTS:
> +		case XFS_DINODE_FMT_BTREE:
> +			break;
> +		default:
> +			return true;
> +		}
> +		break;
> +	default:
> +		return true;
> +	}
> +	dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK);
> +	switch (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK)) {
> +	case XFS_DINODE_FMT_DEV:
> +		break;
> +	case XFS_DINODE_FMT_LOCAL:
> +		if (size > dfork_size)
> +			return true;
> +		break;
> +	case XFS_DINODE_FMT_EXTENTS:
> +		if (xfs_repair_inode_core_check_extents_fork(sc, dip,
> +				dfork_size, XFS_DATA_FORK))
> +			return true;
> +		break;
> +	case XFS_DINODE_FMT_BTREE:
> +		if (xfs_repair_inode_core_check_btree_fork(sc, dip,
> +				dfork_size, XFS_DATA_FORK))
> +			return true;
> +		break;
> +	default:
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +/* Reset the data fork to something sane. */
> +STATIC void
> +xfs_repair_inode_core_zap_data_fork(

xrep_ifork_zap_data()

(you get the idea :P)

> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip,
> +	uint16_t			mode,
> +	struct xfs_repair_inode_fork_counters	*rifc)

(structure names can change, too :)

> +{
> +	char				*p;
> +	const struct xfs_dir_ops	*ops;
> +	struct xfs_dir2_sf_hdr		*sfp;
> +	int				i8count;
> +
> +	/* Special files always get reset to DEV */
> +	switch (mode & S_IFMT) {
> +	case S_IFIFO:
> +	case S_IFCHR:
> +	case S_IFBLK:
> +	case S_IFSOCK:
> +		dip->di_format = XFS_DINODE_FMT_DEV;
> +		dip->di_size = 0;
> +		return;
> +	}
> +
> +	/*
> +	 * If we have data extents, reset to an empty map and hope the user
> +	 * will run the bmapbtd checker next.
> +	 */
> +	if (rifc->data_extents || rifc->rt_extents || S_ISREG(mode)) {
> +		dip->di_format = XFS_DINODE_FMT_EXTENTS;
> +		dip->di_nextents = 0;
> +		return;
> +	}

Does the userspace tool run the bmapbtd checker next?

> +	/* Otherwise, reset the local format to the minimum. */
> +	switch (mode & S_IFMT) {
> +	case S_IFLNK:
> +		/* Blow out symlink; now it points to root dir */
> +		dip->di_format = XFS_DINODE_FMT_LOCAL;
> +		dip->di_size = cpu_to_be64(1);
> +		p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
> +		*p = '/';

Maybe factor this so zero length symlinks can be set directly to the
same thing? FWIW, would making it point at '.' be better than '/' so
it always points within the same filesystem?

> +		break;
> +	case S_IFDIR:
> +		/*
> +		 * Blow out dir, make it point to the root.  In the
> +		 * future the direction repair will reconstruct this
> +		 * dir for us.
> +		 */

s/the direction//

> +		dip->di_format = XFS_DINODE_FMT_LOCAL;
> +		i8count = sc->mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM;
> +		ops = xfs_dir_get_ops(sc->mp, NULL);
> +		sfp = (struct xfs_dir2_sf_hdr *)XFS_DFORK_PTR(dip,
> +				XFS_DATA_FORK);
> +		sfp->count = 0;
> +		sfp->i8count = i8count;
> +		ops->sf_put_parent_ino(sfp, sc->mp->m_sb.sb_rootino);
> +		dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count));

What happens now if this dir has an ancestor still pointing at it?
Haven't we just screwed the directory structure? How does this
interact with the dentry cache, esp. w.r.t. disconnected dentries
(filehandle lookups)?

> +/*
> + * Check the attr fork for things that will fail the ifork verifiers or the
> + * ifork formatters.
> + */
> +STATIC bool
> +xfs_repair_inode_core_check_attr_fork(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_dinode		*dip)
> +{
> +	struct xfs_attr_shortform	*sfp;
> +	int				size;
> +
> +	if (XFS_DFORK_BOFF(dip) == 0)
> +		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
> +		       dip->di_anextents != 0;
> +
> +	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
> +	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
> +	case XFS_DINODE_FMT_LOCAL:
> +		sfp = (struct xfs_attr_shortform *)XFS_DFORK_PTR(dip,
> +				XFS_ATTR_FORK);

As a side note, we should make XFS_DFORK_PTR() return a void * so we
don't need casts like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/21] xfs: repair free space btrees
  2018-06-27  3:21   ` Dave Chinner
@ 2018-07-04  2:15     ` Darrick J. Wong
  2018-07-04  2:25       ` Dave Chinner
  0 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  2:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 27, 2018 at 01:21:23PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:07PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Rebuild the free space btrees from the gaps in the rmap btree.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> ......
> > +
> > +/* Collect an AGFL block for the not-to-release list. */
> > +static int
> > +xfs_repair_collect_agfl_block(
> > +	struct xfs_mount		*mp,
> > +	xfs_agblock_t			bno,
> > +	void				*priv)

Whoah, I never replied to this.  Oops. :(

> /me now gets confused by agfl code (xfs_repair_agfl_...) collecting btree
> blocks, and now the btree code (xfs_repair_collect_agfl... )
> collecting agfl blocks.
> 
> The naming/namespace collisions is not that nice. I think this needs
> to be xr_allocbt_collect_agfl_blocks().

> /me idly wonders about consistently renaming everything abt, bnbt, cnbt,
> fibt, ibt, rmbt and rcbt...

Hmm, I'll think about a mass rename. :)

xfs_repair_refcountbt_fiddle_faddle(...);

xrep_rcbt_fiddle_faddle(...);

xrpr_rcbt_fiddle_faddle(...);

xrprrcbt_fiddle_faddle(...);

Yeah, maybe that third one.

> 
> > +/*
> > + * Iterate all reverse mappings to find (1) the free extents, (2) the OWN_AG
> > + * extents, (3) the rmapbt blocks, and (4) the AGFL blocks.  The free space is
> > + * (1) + (2) - (3) - (4).  Figure out if we have enough free space to
> > + * reconstruct the free space btrees.  Caller must clean up the input lists
> > + * if something goes wrong.
> > + */
> > +STATIC int
> > +xfs_repair_allocbt_find_freespace(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*free_extents,
> > +	struct xfs_repair_extent_list	*old_allocbt_blocks)
> > +{
> > +	struct xfs_repair_alloc		ra;
> > +	struct xfs_repair_alloc_extent	*rae;
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_agblock_t			agend;
> > +	xfs_agblock_t			nr_blocks;
> > +	int				error;
> > +
> > +	ra.extlist = free_extents;
> > +	ra.btlist = old_allocbt_blocks;
> > +	xfs_repair_init_extent_list(&ra.nobtlist);
> > +	ra.next_bno = 0;
> > +	ra.nr_records = 0;
> > +	ra.nr_blocks = 0;
> > +	ra.sc = sc;
> > +
> > +	/*
> > +	 * Iterate all the reverse mappings to find gaps in the physical
> > +	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
> > +	 */
> > +	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
> > +	error = xfs_rmap_query_all(cur, xfs_repair_alloc_extent_fn, &ra);
> > +	if (error)
> > +		goto err;
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +	cur = NULL;
> > +
> > +	/* Insert a record for space between the last rmap and EOAG. */
> > +	agend = be32_to_cpu(XFS_BUF_TO_AGF(sc->sa.agf_bp)->agf_length);
> > +	if (ra.next_bno < agend) {
> > +		rae = kmem_alloc(sizeof(struct xfs_repair_alloc_extent),
> > +				KM_MAYFAIL);
> > +		if (!rae) {
> > +			error = -ENOMEM;
> > +			goto err;
> > +		}
> > +		INIT_LIST_HEAD(&rae->list);
> > +		rae->bno = ra.next_bno;
> > +		rae->len = agend - ra.next_bno;
> > +		list_add_tail(&rae->list, free_extents);
> > +		ra.nr_records++;
> > +	}
> > +
> > +	/* Collect all the AGFL blocks. */
> > +	error = xfs_agfl_walk(mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
> > +			sc->sa.agfl_bp, xfs_repair_collect_agfl_block, &ra);
> > +	if (error)
> > +		goto err;
> > +
> > +	/* Do we actually have enough space to do this? */
> > +	nr_blocks = 2 * xfs_allocbt_calc_size(mp, ra.nr_records);
> 
> 	/* Do we have enough space to rebuild both freespace trees? */
> 
> (explains the multiplication by 2)

Yep, will fix.

> > +	if (!xfs_repair_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE) ||
> > +	    ra.nr_blocks < nr_blocks) {
> > +		error = -ENOSPC;
> > +		goto err;
> > +	}
> > +
> > +	/* Compute the old bnobt/cntbt blocks. */
> > +	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
> > +			&ra.nobtlist);
> > +	if (error)
> > +		goto err;
> > +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> > +	return 0;
> > +
> > +err:
> > +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> > +	if (cur)
> > +		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > +	return error;
> 
> Error stacking here can be cleaned up - we don't need an extra stack
> as the cursor is NULL when finished with. Hence it could just be:
> 
> 	/* Compute the old bnobt/cntbt blocks. */
> 	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
> 			&ra.nobtlist);
> err:
> 	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> 	if (cur)
> 		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);

TBH I've been tempted for years to refactor this thing to take error
directly rather than require this XFS_BTREE_{NO,}ERROR business.
There's only two choices, and we nearly always decide using error == 0.

> 	return error;
> }
> 
> 
> > +}
> > +
> > +/*
> > + * Reset the global free block counter and the per-AG counters to make it look
> > + * like this AG has no free space.
> > + */
> > +STATIC int
> > +xfs_repair_allocbt_reset_counters(
> > +	struct xfs_scrub_context	*sc,
> > +	int				*log_flags)
> > +{
> > +	struct xfs_perag		*pag = sc->sa.pag;
> > +	struct xfs_agf			*agf;
> > +	xfs_extlen_t			oldf;
> > +	xfs_agblock_t			rmap_blocks;
> > +	int				error;
> > +
> > +	/*
> > +	 * Since we're abandoning the old bnobt/cntbt, we have to
> > +	 * decrease fdblocks by the # of blocks in those trees.
> > +	 * btreeblks counts the non-root blocks of the free space
> > +	 * and rmap btrees.  Do this before resetting the AGF counters.
> 
> Comment can use 80 columns.
> 
> > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > +	rmap_blocks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> > +	oldf = pag->pagf_btreeblks + 2;
> > +	oldf -= rmap_blocks;
> 
> Convoluted. The comment really didn't help me understand what oldf
> is accounting.
> 
> Ah, rmap_blocks is actually the new btreeblks count. OK.
> 
> 	/*
> 	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
> 	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
> 	 * non-root blocks of the free space and rmap btrees.  Do this before
> 	 * resetting the AGF counters.
> 	 */
> 
> 	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> 
> 	/* rmap_blocks accounts root block, btreeblks doesn't */
> 	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> 
> 	/* btreeblks doesn't account bno/cnt root blocks */
> 	to_free = pag->pagf_btreeblks + 2;
> 
> 	/* and don't account for the blocks we aren't freeing */
> 	to_free -= new_btblks;

Ok, I'll do that.

> 
> > +	error = xfs_mod_fdblocks(sc->mp, -(int64_t)oldf, false);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Reset the per-AG info, both incore and ondisk. */
> > +	pag->pagf_btreeblks = rmap_blocks;
> > +	pag->pagf_freeblks = 0;
> > +	pag->pagf_longest = 0;
> > +
> > +	agf->agf_btreeblks = cpu_to_be32(pag->pagf_btreeblks);
> 
> I'd prefer that you use new_btblks here, too. Easier to see at a
> glance that the on-disk agf is being set to the new value....

Ok.

> 
> 
> > +	agf->agf_freeblks = 0;
> > +	agf->agf_longest = 0;
> > +	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Initialize new bnobt/cntbt roots and implant them into the AGF. */
> > +STATIC int
> > +xfs_repair_allocbt_reset_btrees(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*free_extents,
> > +	int				*log_flags)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct xfs_repair_alloc_extent	*cached = NULL;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_perag		*pag = sc->sa.pag;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_agf			*agf;
> > +	xfs_fsblock_t			bnofsb;
> > +	xfs_fsblock_t			cntfsb;
> > +	int				error;
> > +
> > +	/* Allocate new bnobt root. */
> > +	bnofsb = xfs_repair_allocbt_alloc_block(sc, free_extents, &cached);
> > +	if (bnofsb == NULLFSBLOCK)
> > +		return -ENOSPC;
> 
> Does this happen after the free extent list has been sorted by bno
> order? It really should, that way the new root is as close to the
> the AGF as possible, and the new btree blocks will also tend to
> cluster towards the lower AG offsets.

Will do.

> > +	/* Allocate new cntbt root. */
> > +	cntfsb = xfs_repair_allocbt_alloc_block(sc, free_extents, &cached);
> > +	if (cntfsb == NULLFSBLOCK)
> > +		return -ENOSPC;
> > +
> > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > +	/* Initialize new bnobt root. */
> > +	error = xfs_repair_init_btblock(sc, bnofsb, &bp, XFS_BTNUM_BNO,
> > +			&xfs_allocbt_buf_ops);
> > +	if (error)
> > +		return error;
> > +	agf->agf_roots[XFS_BTNUM_BNOi] =
> > +			cpu_to_be32(XFS_FSB_TO_AGBNO(mp, bnofsb));
> > +	agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
> > +
> > +	/* Initialize new cntbt root. */
> > +	error = xfs_repair_init_btblock(sc, cntfsb, &bp, XFS_BTNUM_CNT,
> > +			&xfs_allocbt_buf_ops);
> > +	if (error)
> > +		return error;
> > +	agf->agf_roots[XFS_BTNUM_CNTi] =
> > +			cpu_to_be32(XFS_FSB_TO_AGBNO(mp, cntfsb));
> > +	agf->agf_levels[XFS_BTNUM_CNTi] = cpu_to_be32(1);
> > +
> > +	/* Add rmap records for the btree roots */
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
> > +			XFS_FSB_TO_AGBNO(mp, bnofsb), 1, &oinfo);
> > +	if (error)
> > +		return error;
> > +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
> > +			XFS_FSB_TO_AGBNO(mp, cntfsb), 1, &oinfo);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Reset the incore state. */
> > +	pag->pagf_levels[XFS_BTNUM_BNOi] = 1;
> > +	pag->pagf_levels[XFS_BTNUM_CNTi] = 1;
> > +
> > +	*log_flags |=  XFS_AGF_ROOTS | XFS_AGF_LEVELS;
> > +	return 0;
> 
> Rather than duplicating all this init code twice, would factoring it
> make sense? The only difference between the alloc/init of the two
> btrees is the array index that info is stored in....

Yeah, it would.

> > +}
> > +
> > +/* Build new free space btrees and dispose of the old one. */
> > +STATIC int
> > +xfs_repair_allocbt_rebuild_trees(
> > +	struct xfs_scrub_context	*sc,
> > +	struct list_head		*free_extents,
> > +	struct xfs_repair_extent_list	*old_allocbt_blocks)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct xfs_repair_alloc_extent	*rae;
> > +	struct xfs_repair_alloc_extent	*n;
> > +	struct xfs_repair_alloc_extent	*longest;
> > +	int				error;
> > +
> > +	xfs_rmap_skip_owner_update(&oinfo);
> > +
> > +	/*
> > +	 * Insert the longest free extent in case it's necessary to
> > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > +	 * extent, we had exactly the free space we needed; we're done.
> > +	 */
> > +	longest = xfs_repair_allocbt_get_longest(free_extents);
> > +	if (!longest)
> > +		goto done;
> > +	error = xfs_repair_allocbt_free_extent(sc,
> > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > +			longest->len, &oinfo);
> > +	list_del(&longest->list);
> > +	kmem_free(longest);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Insert records into the new btrees. */
> > +	list_sort(NULL, free_extents, xfs_repair_allocbt_extent_cmp);
> 
> Hmmm. I guess list sorting doesn't occur before allocating new root
> blocks. Can this get moved?

Certainly.

> ....
> 
> > +bool
> > +xfs_extent_busy_list_empty(
> > +	struct xfs_perag	*pag);
> 
> One line form for header prototypes, please.

Ok.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 07/21] xfs: repair inode btrees
  2018-06-28  0:55   ` Dave Chinner
@ 2018-07-04  2:22     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  2:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 28, 2018 at 10:55:16AM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:13PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Use the rmapbt to find inode chunks, query the chunks to compute
> > hole and free masks, and with that information rebuild the inobt
> > and finobt.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> [....]
> 
> > +/*
> > + * For each cluster in this blob of inode, we must calculate the
> > + * properly aligned startino of that cluster, then iterate each
> > + * cluster to fill in used and filled masks appropriately.  We
> > + * then use the (startino, used, filled) information to construct
> > + * the appropriate inode records.
> > + */
> > +STATIC int
> > +xfs_repair_ialloc_process_cluster(
> > +	struct xfs_repair_ialloc	*ri,
> > +	xfs_agblock_t			agbno,
> > +	int				blks_per_cluster,
> > +	xfs_agino_t			rec_agino)
> > +{
> > +	struct xfs_imap			imap;
> > +	struct xfs_repair_ialloc_extent	*rie;
> > +	struct xfs_dinode		*dip;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_scrub_context	*sc = ri->sc;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_ino_t			fsino;
> > +	xfs_inofree_t			usedmask;
> > +	xfs_agino_t			nr_inodes;
> > +	xfs_agino_t			startino;
> > +	xfs_agino_t			clusterino;
> > +	xfs_agino_t			clusteroff;
> > +	xfs_agino_t			agino;
> > +	uint16_t			fillmask;
> > +	bool				inuse;
> > +	int				usedcount;
> > +	int				error;
> > +
> > +	/* The per-AG inum of this inode cluster. */
> > +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> > +
> > +	/* The per-AG inum of the inobt record. */
> > +	startino = rec_agino + rounddown(agino - rec_agino,
> > +			XFS_INODES_PER_CHUNK);
> > +
> > +	/* The per-AG inum of the cluster within the inobt record. */
> > +	clusteroff = agino - startino;
> > +
> > +	/* Every inode in this holemask slot is filled. */
> > +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> > +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
> > +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
> > +
> > +	/* Grab the inode cluster buffer. */
> > +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
> > +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
> > +	imap.im_boffset = 0;
> > +
> > +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
> > +			XFS_IGET_UNTRUSTED);
> 
> This is going to error out if the cluster we are asking to be mapped
> has no record in the inobt.

It does?  xfs_imap_to_bp is a straightforward wrapper around
xfs_trans_read_buf and xfs_buf_offset; it never consults the inobt.
If the inode buffer verifiers trigger then yes we'll blow out to
userspace, but the inobt can be totally trashed and that won't cause
this to fail.

<confused>

> Aren't we trying to rebuild the inobt here from the rmap's idea of
> on-disk clusters? So how do we rebuild the inobt record if we can't
> already find the chunk record in the inobt?
> 
> At minimum, this needs a comment explaining why it works.

/*
 * Having manually mapped part of a reverse-mapping record to an inode
 * cluster map, use the map to read the inode cluster directly off the
 * disk.
 */

> > +/* Initialize new inobt/finobt roots and implant them into the AGI. */
> > +STATIC int
> > +xfs_repair_iallocbt_reset_btrees(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_owner_info		*oinfo,
> > +	int				*log_flags)
> > +{
> > +	struct xfs_agi			*agi;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fsblock_t			inofsb;
> > +	xfs_fsblock_t			finofsb;
> > +	enum xfs_ag_resv_type		resv;
> > +	int				error;
> > +
> > +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> > +
> > +	/* Initialize new inobt root. */
> > +	resv = XFS_AG_RESV_NONE;
> > +	error = xfs_repair_alloc_ag_block(sc, oinfo, &inofsb, resv);
> > +	if (error)
> > +		return error;
> > +	error = xfs_repair_init_btblock(sc, inofsb, &bp, XFS_BTNUM_INO,
> > +			&xfs_inobt_buf_ops);
> > +	if (error)
> > +		return error;
> > +	agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, inofsb));
> > +	agi->agi_level = cpu_to_be32(1);
> > +	*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
> > +
> > +	/* Initialize new finobt root. */
> > +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> > +		return 0;
> > +
> > +	resv = mp->m_inotbt_nores ? XFS_AG_RESV_NONE : XFS_AG_RESV_METADATA;
> 
> Comment explaining this?

m_inotbt_nores (which, ugh, why isn't that xfs_finobt_nores?) indicates
if we suceeded at making per-AG reservations for finobt expansion.  If
not, then don't bother.

/*
 * If we successfully reserved space for finobt expansion, use that
 * reservation for the rebuilt btree.
 */

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 12/21] xfs: repair refcount btrees
  2018-07-03  5:50   ` Dave Chinner
@ 2018-07-04  2:23     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  2:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 03, 2018 at 03:50:04PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:45PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Reconstruct the refcount data from the rmap btree.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Seems reasonable, though my brain turn to mush when trying to work
> out the code that turned the rmap records into refcounts :/

Ok, thanks for the review!  Enjoy your vacation if I don't make it back
on here before you go. :)

--D

> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 06/21] xfs: repair free space btrees
  2018-07-04  2:15     ` Darrick J. Wong
@ 2018-07-04  2:25       ` Dave Chinner
  0 siblings, 0 replies; 77+ messages in thread
From: Dave Chinner @ 2018-07-04  2:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jul 03, 2018 at 07:15:04PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 27, 2018 at 01:21:23PM +1000, Dave Chinner wrote:
> > On Sun, Jun 24, 2018 at 12:24:07PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > ......
> > > +	if (error)
> > > +		goto err;
> > > +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> > > +	return 0;
> > > +
> > > +err:
> > > +	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> > > +	if (cur)
> > > +		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > > +	return error;
> > 
> > Error stacking here can be cleaned up - we don't need an extra stack
> > as the cursor is NULL when finished with. Hence it could just be:
> > 
> > 	/* Compute the old bnobt/cntbt blocks. */
> > 	error = xfs_repair_subtract_extents(sc, old_allocbt_blocks,
> > 			&ra.nobtlist);
> > err:
> > 	xfs_repair_cancel_btree_extents(sc, &ra.nobtlist);
> > 	if (cur)
> > 		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> 
> TBH I've been tempted for years to refactor this thing to take error
> directly rather than require this XFS_BTREE_{NO,}ERROR business.
> There's only two choices, and we nearly always decide using error == 0.

Yeah, that would make for a nice cleanup. Add it to the TODO list?
Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15/21] xfs: repair inode block maps
  2018-06-24 19:25 ` [PATCH 15/21] xfs: repair inode block maps Darrick J. Wong
@ 2018-07-04  3:00   ` Dave Chinner
  2018-07-04  3:41     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-04  3:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:25:04PM -0700, Darrick J. Wong wrote:
> +#include "scrub/repair.h"
> +
> +/* Inode fork block mapping (BMBT) repair. */
> +
> +struct xfs_repair_bmap_extent {
> +	struct list_head		list;
> +	struct xfs_rmap_irec		rmap;
> +	xfs_agnumber_t			agno;
> +};
> +
> +struct xfs_repair_bmap {
> +	struct list_head		*extlist;
> +	struct xfs_repair_extent_list	*btlist;
> +	struct xfs_scrub_context	*sc;
> +	xfs_ino_t			ino;
> +	xfs_rfsblock_t			otherfork_blocks;
> +	xfs_rfsblock_t			bmbt_blocks;
> +	xfs_extnum_t			extents;
> +	int				whichfork;
> +};
> +
> +/* Record extents that belong to this inode's fork. */
> +STATIC int
> +xfs_repair_bmap_extent_fn(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_rmap_irec		*rec,
> +	void				*priv)
> +{
> +	struct xfs_repair_bmap		*rb = priv;
> +	struct xfs_repair_bmap_extent	*rbe;
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	xfs_fsblock_t			fsbno;
> +	int				error = 0;
> +
> +	if (xfs_scrub_should_terminate(rb->sc, &error))
> +		return error;
> +
> +	/* Skip extents which are not owned by this inode and fork. */
> +	if (rec->rm_owner != rb->ino) {
> +		return 0;
> +	} else if (rb->whichfork == XFS_DATA_FORK &&
> +		 (rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
> +		rb->otherfork_blocks += rec->rm_blockcount;
> +		return 0;
> +	} else if (rb->whichfork == XFS_ATTR_FORK &&
> +		 !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
> +		rb->otherfork_blocks += rec->rm_blockcount;
> +		return 0;
> +	}
> +
> +	rb->extents++;

Shouldn't this be incremented after we've checked for and processed
old BMBT blocks?

> +	/* Delete the old bmbt blocks later. */
> +	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
> +		fsbno = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
> +				rec->rm_startblock);
> +		rb->bmbt_blocks += rec->rm_blockcount;
> +		return xfs_repair_collect_btree_extent(rb->sc, rb->btlist,
> +				fsbno, rec->rm_blockcount);
> +	}
....
> +
> +/* Check for garbage inputs. */
> +STATIC int
> +xfs_repair_bmap_check_inputs(
> +	struct xfs_scrub_context	*sc,
> +	int				whichfork)
> +{
> +	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
> +
> +	/* Don't know how to repair the other fork formats. */
> +	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> +	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
> +		return -EOPNOTSUPP;
> +
> +	/* Only files, symlinks, and directories get to have data forks. */
> +	if (whichfork == XFS_DATA_FORK && !S_ISREG(VFS_I(sc->ip)->i_mode) &&
> +	    !S_ISDIR(VFS_I(sc->ip)->i_mode) && !S_ISLNK(VFS_I(sc->ip)->i_mode))
> +		return -EINVAL;

That'd be nicer as a switch statement.

> +
> +	/* If we somehow have delalloc extents, forget it. */
> +	if (whichfork == XFS_DATA_FORK && sc->ip->i_delayed_blks)
> +		return -EBUSY;

and this can be rolled into the same if (datafork) branch.

....
> +	if (!xfs_sb_version_hasrmapbt(&sc->mp->m_sb))
> +		return -EOPNOTSUPP;

Do this first?

Hmmm, and if you do the attr fork check second then the rest
of the code is all data fork. i.e.

	if (!rmap)
		return -EOPNOTSUPP
	if (attrfork) {
		if (no attr fork)
			return ....
		return 0
	}
	/* now do all data fork checks */

This becomes a lot easier to follow.

> +/*
> + * Collect block mappings for this fork of this inode and decide if we have
> + * enough space to rebuild.  Caller is responsible for cleaning up the list if
> + * anything goes wrong.
> + */
> +STATIC int
> +xfs_repair_bmap_find_mappings(
> +	struct xfs_scrub_context	*sc,
> +	int				whichfork,
> +	struct list_head		*mapping_records,
> +	struct xfs_repair_extent_list	*old_bmbt_blocks,
> +	xfs_rfsblock_t			*old_bmbt_block_count,
> +	xfs_rfsblock_t			*otherfork_blocks)
> +{
> +	struct xfs_repair_bmap		rb;
> +	xfs_agnumber_t			agno;
> +	unsigned int			resblks;
> +	int				error;
> +
> +	memset(&rb, 0, sizeof(rb));
> +	rb.extlist = mapping_records;
> +	rb.btlist = old_bmbt_blocks;
> +	rb.ino = sc->ip->i_ino;
> +	rb.whichfork = whichfork;
> +	rb.sc = sc;
> +
> +	/* Iterate the rmaps for extents. */
> +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_repair_bmap_scan_ag(&rb, agno);
> +		if (error)
> +			return error;
> +	}
> +
> +	/*
> +	 * Guess how many blocks we're going to need to rebuild an entire bmap
> +	 * from the number of extents we found, and pump up our transaction to
> +	 * have sufficient block reservation.
> +	 */
> +	resblks = xfs_bmbt_calc_size(sc->mp, rb.extents);
> +	error = xfs_trans_reserve_more(sc->tp, resblks, 0);
> +	if (error)
> +		return error;

I don't really like this, but I can't think of a way around needing
it at the moment.

> +
> +	*otherfork_blocks = rb.otherfork_blocks;
> +	*old_bmbt_block_count = rb.bmbt_blocks;
> +	return 0;
> +}
> +
> +/* Update the inode counters. */
> +STATIC int
> +xfs_repair_bmap_reset_counters(
> +	struct xfs_scrub_context	*sc,
> +	xfs_rfsblock_t			old_bmbt_block_count,
> +	xfs_rfsblock_t			otherfork_blocks,
> +	int				*log_flags)
> +{
> +	int				error;
> +
> +	xfs_trans_ijoin(sc->tp, sc->ip, 0);
> +
> +	/*
> +	 * Drop the block counts associated with this fork since we'll re-add
> +	 * them with the bmap routines later.
> +	 */
> +	sc->ip->i_d.di_nblocks = otherfork_blocks;

This needs a little more explanation. i.e. that the rmap walk we
just performed for this fork also counted all the data and bmbt
blocks for the other fork so this is really only zeroing the block
count for the fork we are about to rebuild.

> +/* Initialize a new fork and implant it in the inode. */
> +STATIC void
> +xfs_repair_bmap_reset_fork(
> +	struct xfs_scrub_context	*sc,
> +	int				whichfork,
> +	bool				has_mappings,
> +	int				*log_flags)
> +{
> +	/* Set us back to extents format with zero records. */
> +	XFS_IFORK_FMT_SET(sc->ip, whichfork, XFS_DINODE_FMT_EXTENTS);
> +	XFS_IFORK_NEXT_SET(sc->ip, whichfork, 0);
> +
> +	/* Reinitialize the on-disk fork. */

I don't think this touches the on-disk fork - it's re-initialising
the in-memory fork.

> +	if (XFS_IFORK_PTR(sc->ip, whichfork) != NULL)
> +		xfs_idestroy_fork(sc->ip, whichfork);
> +	if (whichfork == XFS_DATA_FORK) {
> +		memset(&sc->ip->i_df, 0, sizeof(struct xfs_ifork));
> +		sc->ip->i_df.if_flags |= XFS_IFEXTENTS;
> +	} else if (whichfork == XFS_ATTR_FORK) {
> +		if (has_mappings) {
> +			sc->ip->i_afp = NULL;
> +		} else {
> +			sc->ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone,
> +					KM_SLEEP);
> +			sc->ip->i_afp->if_flags |= XFS_IFEXTENTS;
> +		}
> +	}
> +	*log_flags |= XFS_ILOG_CORE;
> +}
......

> +/* Repair an inode fork. */
> +STATIC int
> +xfs_repair_bmap(
> +	struct xfs_scrub_context	*sc,
> +	int				whichfork)
> +{
> +	struct list_head		mapping_records;
> +	struct xfs_repair_extent_list	old_bmbt_blocks;
> +	struct xfs_inode		*ip = sc->ip;
> +	xfs_rfsblock_t			old_bmbt_block_count;
> +	xfs_rfsblock_t			otherfork_blocks;
> +	int				log_flags = 0;
> +	int				error = 0;
> +
> +	error = xfs_repair_bmap_check_inputs(sc, whichfork);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * If this is a file data fork, wait for all pending directio to
> +	 * complete, then tear everything out of the page cache.
> +	 */
> +	if (S_ISREG(VFS_I(ip)->i_mode) && whichfork == XFS_DATA_FORK) {
> +		inode_dio_wait(VFS_I(ip));
> +		truncate_inode_pages(VFS_I(ip)->i_mapping, 0);
> +	}

Why would we be waiting only for DIO here? Haven't we already locked
up the inode, flushed dirty data, waited for dio and invalidated the
page cache when we called xfs_scrub_setup_inode_bmap() prior to
doing this work?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 14/21] xfs: zap broken inode forks
  2018-07-04  2:07   ` Dave Chinner
@ 2018-07-04  3:26     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  3:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 04, 2018 at 12:07:06PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:57PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Determine if inode fork damage is responsible for the inode being unable
> > to pass the ifork verifiers in xfs_iget and zap the fork contents if
> > this is true.  Once this is done the fork will be empty but we'll be
> > able to construct an in-core inode, and a subsequent call to the inode
> > fork repair ioctl will search the rmapbt to rebuild the records that
> > were in the fork.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_attr_leaf.c |   32 ++-
> >  fs/xfs/libxfs/xfs_attr_leaf.h |    2 
> >  fs/xfs/libxfs/xfs_bmap.c      |   21 ++
> >  fs/xfs/libxfs/xfs_bmap.h      |    2 
> >  fs/xfs/scrub/inode_repair.c   |  399 +++++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 437 insertions(+), 19 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > index b3c19339e1b5..f6c458104934 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> > @@ -894,23 +894,16 @@ xfs_attr_shortform_allfit(
> >  	return xfs_attr_shortform_bytesfit(dp, bytes);
> >  }
> >  
> > -/* Verify the consistency of an inline attribute fork. */
> > +/* Verify the consistency of a raw inline attribute fork. */
> >  xfs_failaddr_t
> > -xfs_attr_shortform_verify(
> > -	struct xfs_inode		*ip)
> > +xfs_attr_shortform_verify_struct(
> > +	struct xfs_attr_shortform	*sfp,
> > +	size_t				size)
> 
> The internal structure checking functions in the directory code
> use the naming convention xfs_dir3_<type>_check(). I think we should use
> the same here. i.e. xfs_attr_sf_check().

Ok.

> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index b7f094e19bab..b1254e6c17b5 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -6223,18 +6223,16 @@ xfs_bmap_finish_one(
> >  	return error;
> >  }
> >  
> > -/* Check that an inode's extent does not have invalid flags or bad ranges. */
> > +/* Check that an extent does not have invalid flags or bad ranges. */
> >  xfs_failaddr_t
> > -xfs_bmap_validate_extent(
> > -	struct xfs_inode	*ip,
> > +xfs_bmbt_validate_extent(
> 
> xfs_bmbt_ prefixes should only appear in xfs_bmap_btree.c, not
> xfs_bmap.c....
> 
> So either it needs to get moved or renamed to something like
> xfs_bmap_validate_irec()?

Hmm, well the only difference between the two functions is that one
gets what it needs out of struct xfs_inode and the other takes all the
raw inputs.  They both take xfs_bmbt_irec, so...

xfs_failaddr_t
__xfs_bmbt_validate_irec(
	struct xfs_mount	*mp,
	bool			isrt,
	int			whichfork,
	struct xfs_bmbt_irec	*irec)

xfs_failaddr_t
xfs_bmap_validate_irec(
	struct xfs_inode	*ip,
	int			whichfork,
	struct xfs_bmbt_irec	*irec)

?

> 
> > diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
> > index 4ac43c1b1eb0..b941f21d7667 100644
> > --- a/fs/xfs/scrub/inode_repair.c
> > +++ b/fs/xfs/scrub/inode_repair.c
> > @@ -22,11 +22,15 @@
> >  #include "xfs_ialloc.h"
> >  #include "xfs_da_format.h"
> >  #include "xfs_reflink.h"
> > +#include "xfs_alloc.h"
> >  #include "xfs_rmap.h"
> > +#include "xfs_rmap_btree.h"
> >  #include "xfs_bmap.h"
> > +#include "xfs_bmap_btree.h"
> >  #include "xfs_bmap_util.h"
> >  #include "xfs_dir2.h"
> >  #include "xfs_quota_defs.h"
> > +#include "xfs_attr_leaf.h"
> >  #include "scrub/xfs_scrub.h"
> >  #include "scrub/scrub.h"
> >  #include "scrub/common.h"
> > @@ -113,7 +117,8 @@ xfs_repair_inode_mode(
> >  STATIC void
> >  xfs_repair_inode_flags(
> >  	struct xfs_scrub_context	*sc,
> > -	struct xfs_dinode		*dip)
> > +	struct xfs_dinode		*dip,
> > +	bool				is_rt_file)
> >  {
> >  	struct xfs_mount		*mp = sc->mp;
> >  	uint64_t			flags2;
> > @@ -132,6 +137,10 @@ xfs_repair_inode_flags(
> >  		flags2 &= ~XFS_DIFLAG2_REFLINK;
> >  	if (flags2 & XFS_DIFLAG2_REFLINK)
> >  		flags2 &= ~XFS_DIFLAG2_DAX;
> > +	if (is_rt_file)
> > +		flags |= XFS_DIFLAG_REALTIME;
> > +	else
> > +		flags &= ~XFS_DIFLAG_REALTIME;
> 
> This needs to be done first. i.e. before we check things like rt vs
> reflink flags.

Ok.

> > @@ -210,17 +219,402 @@ xfs_repair_inode_extsize_hints(
> >  	}
> >  }
> >  
> > +struct xfs_repair_inode_fork_counters {
> > +	struct xfs_scrub_context	*sc;
> > +	xfs_rfsblock_t			data_blocks;
> > +	xfs_rfsblock_t			rt_blocks;
> 
> inode is either data or rt, not both. Why do you need two separate
> counters? Oh, bmbt blocks are always data, right? Coment, perhaps?

Ok.

/* blocks on the data device, including bmbt blocks. */
	xfs_rfsblock_t			data_blocks;

/* rt_blocks: blocks on the realtime device, if any. */
	xfs_rfsblock_t			rt_blocks;

> > +	xfs_rfsblock_t			attr_blocks;
> > +	xfs_extnum_t			data_extents;
> > +	xfs_extnum_t			rt_extents;
> 
> but bmbt blocks are not data extents, so only one counter here?

In theory we'd look on all the rmapbt's and error out if we find data
fork blocks on both...

> > +/* Count extents and blocks for a given inode from all rmap data. */
> > +STATIC int
> > +xfs_repair_inode_count_rmaps(
> > +	struct xfs_repair_inode_fork_counters	*rifc)
> > +{
> > +	xfs_agnumber_t			agno;
> > +	int				error;
> > +
> > +	if (!xfs_sb_version_hasrmapbt(&rifc->sc->mp->m_sb) ||
> > +	    xfs_sb_version_hasrealtime(&rifc->sc->mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	/* XXX: find rt blocks too */
> 
> Ok, needs a comment up front that realtime repair isn't supported,
> rather than hiding it down here.

...but all this is clumsily roped off while there's no rt rmap. :)

> > +	for (agno = 0; agno < rifc->sc->mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_repair_inode_count_ag_rmaps(rifc, agno);
> > +		if (error)
> > +			return error;
> > +	}
> 
> 	/* rt not supported yet */
> 	ASSERT(rifc->rt_extents == 0);

I'll move the feature checking up to the start of xfs_repair_inode.

> > +	/* Can't have extents on both the rt and the data device. */
> > +	if (rifc->data_extents && rifc->rt_extents)
> > +		return -EFSCORRUPTED;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Figure out if we need to zap this extents format fork. */
> > +STATIC bool
> > +xfs_repair_inode_core_check_extents_fork(
> 
> 
> Urk. Not sure what this function is supposed to be doing from the
> name. xrep_ifork_extent_check()? And then the next function
> becomes xrep_ifork_btree_check()?
> 
> Also document the return values.


/*
 * Decide if this extents-format inode fork looks like garbage.  If so,
 * return true.
 */

> 
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip,
> > +	int				dfork_size,
> > +	int				whichfork)
> > +{
> > +	struct xfs_bmbt_irec		new;
> > +	struct xfs_bmbt_rec		*dp;
> > +	bool				isrt;
> > +	int				i;
> > +	int				nex;
> > +	int				fork_size;
> > +
> > +	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	fork_size = nex * sizeof(struct xfs_bmbt_rec);
> > +	if (fork_size < 0 || fork_size > dfork_size)
> > +		return true;
> 
> Check nex against dip->di_nextents?

Isn't XFS_DFORK_NEXTENTS just dip->di_nextents?

Nevertheless, it should be range checked.

> > +	dp = (struct xfs_bmbt_rec *)XFS_DFORK_PTR(dip, whichfork);
> > +
> > +	isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME);
> > +	for (i = 0; i < nex; i++, dp++) {
> > +		xfs_failaddr_t	fa;
> > +
> > +		xfs_bmbt_disk_get_all(dp, &new);
> > +		fa = xfs_bmbt_validate_extent(sc->mp, isrt, whichfork, &new);
> > +		if (fa)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/* Figure out if we need to zap this btree format fork. */
> > +STATIC bool
> > +xfs_repair_inode_core_check_btree_fork(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip,
> > +	int				dfork_size,
> > +	int				whichfork)
> > +{
> > +	struct xfs_bmdr_block		*dfp;
> > +	int				nrecs;
> > +	int				level;
> > +
> > +	if (XFS_DFORK_NEXTENTS(dip, whichfork) <=
> > +			dfork_size / sizeof(struct xfs_bmbt_irec))
> > +		return true;
> 
> check against dip->di_nextents?

Yes.

> > +	dfp = (struct xfs_bmdr_block *)XFS_DFORK_PTR(dip, whichfork);
> > +	nrecs = be16_to_cpu(dfp->bb_numrecs);
> > +	level = be16_to_cpu(dfp->bb_level);
> > +
> > +	if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size)
> > +		return true;
> > +	if (level == 0 || level > XFS_BTREE_MAXLEVELS)
> > +		return true;
> 
> Should this visit the bmbt blocks to check the level is actually
> correct?

The bmbt checker will detect and repair that.

> > +	return false;
> > +}
> > +
> > +/*
> > + * Check the data fork for things that will fail the ifork verifiers or the
> > + * ifork formatters.
> > + */
> > +STATIC bool
> > +xfs_repair_inode_core_check_data_fork(
> 
> xrep_ifork_check_data()

<nod>

> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip,
> > +	uint16_t			mode)
> > +{
> > +	uint64_t			size;
> > +	int				dfork_size;
> > +
> > +	size = be64_to_cpu(dip->di_size);
> > +	switch (mode & S_IFMT) {
> > +	case S_IFIFO:
> > +	case S_IFCHR:
> > +	case S_IFBLK:
> > +	case S_IFSOCK:
> > +		if (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK) != XFS_DINODE_FMT_DEV)
> > +			return true;
> > +		break;
> > +	case S_IFREG:
> > +	case S_IFLNK:
> > +	case S_IFDIR:
> > +		switch (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK)) {
> > +		case XFS_DINODE_FMT_LOCAL:
> 
> local format is not valid for S_IFREG.

<nod>

> > +		case XFS_DINODE_FMT_EXTENTS:
> > +		case XFS_DINODE_FMT_BTREE:
> > +			break;
> > +		default:
> > +			return true;
> > +		}
> > +		break;
> > +	default:
> > +		return true;
> > +	}
> > +	dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK);
> > +	switch (XFS_DFORK_FORMAT(dip, XFS_DATA_FORK)) {
> > +	case XFS_DINODE_FMT_DEV:
> > +		break;
> > +	case XFS_DINODE_FMT_LOCAL:
> > +		if (size > dfork_size)
> > +			return true;
> > +		break;
> > +	case XFS_DINODE_FMT_EXTENTS:
> > +		if (xfs_repair_inode_core_check_extents_fork(sc, dip,
> > +				dfork_size, XFS_DATA_FORK))
> > +			return true;
> > +		break;
> > +	case XFS_DINODE_FMT_BTREE:
> > +		if (xfs_repair_inode_core_check_btree_fork(sc, dip,
> > +				dfork_size, XFS_DATA_FORK))
> > +			return true;
> > +		break;
> > +	default:
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/* Reset the data fork to something sane. */
> > +STATIC void
> > +xfs_repair_inode_core_zap_data_fork(
> 
> xrep_ifork_zap_data()
> 
> (you get the idea :P)
> 
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip,
> > +	uint16_t			mode,
> > +	struct xfs_repair_inode_fork_counters	*rifc)
> 
> (structure names can change, too :)

Yay!

	struct xrep_ifork_counters	*rifc;

> > +{
> > +	char				*p;
> > +	const struct xfs_dir_ops	*ops;
> > +	struct xfs_dir2_sf_hdr		*sfp;
> > +	int				i8count;
> > +
> > +	/* Special files always get reset to DEV */
> > +	switch (mode & S_IFMT) {
> > +	case S_IFIFO:
> > +	case S_IFCHR:
> > +	case S_IFBLK:
> > +	case S_IFSOCK:
> > +		dip->di_format = XFS_DINODE_FMT_DEV;
> > +		dip->di_size = 0;
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If we have data extents, reset to an empty map and hope the user
> > +	 * will run the bmapbtd checker next.
> > +	 */
> > +	if (rifc->data_extents || rifc->rt_extents || S_ISREG(mode)) {
> > +		dip->di_format = XFS_DINODE_FMT_EXTENTS;
> > +		dip->di_nextents = 0;
> > +		return;
> > +	}
> 
> Does the userspace tool run the bmapbtd checker next?

Yes.

> > +	/* Otherwise, reset the local format to the minimum. */
> > +	switch (mode & S_IFMT) {
> > +	case S_IFLNK:
> > +		/* Blow out symlink; now it points to root dir */
> > +		dip->di_format = XFS_DINODE_FMT_LOCAL;
> > +		dip->di_size = cpu_to_be64(1);
> > +		p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
> > +		*p = '/';
> 
> Maybe factor this so zero length symlinks can be set directly to the
> same thing? FWIW, would making it point at '.' be better than '/' so
> it always points within the same filesystem?

Perhaps?  I'll take any other suggestions the list has for appropriate
targets that don't point anywhere.

I though your earlier suggestion of setting the link target to "broken
link repaired by xfs_scrub" was quite amusing. :)

> > +		break;
> > +	case S_IFDIR:
> > +		/*
> > +		 * Blow out dir, make it point to the root.  In the
> > +		 * future the direction repair will reconstruct this
> > +		 * dir for us.
> > +		 */
> 
> s/the direction//

Ok.

> > +		dip->di_format = XFS_DINODE_FMT_LOCAL;
> > +		i8count = sc->mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM;
> > +		ops = xfs_dir_get_ops(sc->mp, NULL);
> > +		sfp = (struct xfs_dir2_sf_hdr *)XFS_DFORK_PTR(dip,
> > +				XFS_DATA_FORK);
> > +		sfp->count = 0;
> > +		sfp->i8count = i8count;
> > +		ops->sf_put_parent_ino(sfp, sc->mp->m_sb.sb_rootino);
> > +		dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count));
> 
> What happens now if this dir has an ancestor still pointing at it?
> Haven't we just screwed the directory structure? How does this
> interact with the dentry cache, esp. w.r.t. disconnected dentries
> (filehandle lookups)?

Hmm, good question. :)

> > +/*
> > + * Check the attr fork for things that will fail the ifork verifiers or the
> > + * ifork formatters.
> > + */
> > +STATIC bool
> > +xfs_repair_inode_core_check_attr_fork(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_dinode		*dip)
> > +{
> > +	struct xfs_attr_shortform	*sfp;
> > +	int				size;
> > +
> > +	if (XFS_DFORK_BOFF(dip) == 0)
> > +		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
> > +		       dip->di_anextents != 0;
> > +
> > +	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
> > +	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
> > +	case XFS_DINODE_FMT_LOCAL:
> > +		sfp = (struct xfs_attr_shortform *)XFS_DFORK_PTR(dip,
> > +				XFS_ATTR_FORK);
> 
> As a side note, we should make XFS_DFORK_PTR() return a void * so we
> don't need casts like this.

Ok.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 15/21] xfs: repair inode block maps
  2018-07-04  3:00   ` Dave Chinner
@ 2018-07-04  3:41     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04  3:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 04, 2018 at 01:00:22PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:25:04PM -0700, Darrick J. Wong wrote:
> > +#include "scrub/repair.h"
> > +
> > +/* Inode fork block mapping (BMBT) repair. */
> > +
> > +struct xfs_repair_bmap_extent {
> > +	struct list_head		list;
> > +	struct xfs_rmap_irec		rmap;
> > +	xfs_agnumber_t			agno;
> > +};
> > +
> > +struct xfs_repair_bmap {
> > +	struct list_head		*extlist;
> > +	struct xfs_repair_extent_list	*btlist;
> > +	struct xfs_scrub_context	*sc;
> > +	xfs_ino_t			ino;
> > +	xfs_rfsblock_t			otherfork_blocks;
> > +	xfs_rfsblock_t			bmbt_blocks;
> > +	xfs_extnum_t			extents;
> > +	int				whichfork;
> > +};
> > +
> > +/* Record extents that belong to this inode's fork. */
> > +STATIC int
> > +xfs_repair_bmap_extent_fn(
> > +	struct xfs_btree_cur		*cur,
> > +	struct xfs_rmap_irec		*rec,
> > +	void				*priv)
> > +{
> > +	struct xfs_repair_bmap		*rb = priv;
> > +	struct xfs_repair_bmap_extent	*rbe;
> > +	struct xfs_mount		*mp = cur->bc_mp;
> > +	xfs_fsblock_t			fsbno;
> > +	int				error = 0;
> > +
> > +	if (xfs_scrub_should_terminate(rb->sc, &error))
> > +		return error;
> > +
> > +	/* Skip extents which are not owned by this inode and fork. */
> > +	if (rec->rm_owner != rb->ino) {
> > +		return 0;
> > +	} else if (rb->whichfork == XFS_DATA_FORK &&
> > +		 (rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
> > +		rb->otherfork_blocks += rec->rm_blockcount;
> > +		return 0;
> > +	} else if (rb->whichfork == XFS_ATTR_FORK &&
> > +		 !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
> > +		rb->otherfork_blocks += rec->rm_blockcount;
> > +		return 0;
> > +	}
> > +
> > +	rb->extents++;
> 
> Shouldn't this be incremented after we've checked for and processed
> old BMBT blocks?

Yes.

> > +	/* Delete the old bmbt blocks later. */
> > +	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
> > +		fsbno = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
> > +				rec->rm_startblock);
> > +		rb->bmbt_blocks += rec->rm_blockcount;
> > +		return xfs_repair_collect_btree_extent(rb->sc, rb->btlist,
> > +				fsbno, rec->rm_blockcount);
> > +	}
> ....
> > +
> > +/* Check for garbage inputs. */
> > +STATIC int
> > +xfs_repair_bmap_check_inputs(
> > +	struct xfs_scrub_context	*sc,
> > +	int				whichfork)
> > +{
> > +	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
> > +
> > +	/* Don't know how to repair the other fork formats. */
> > +	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> > +	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
> > +		return -EOPNOTSUPP;
> > +
> > +	/* Only files, symlinks, and directories get to have data forks. */
> > +	if (whichfork == XFS_DATA_FORK && !S_ISREG(VFS_I(sc->ip)->i_mode) &&
> > +	    !S_ISDIR(VFS_I(sc->ip)->i_mode) && !S_ISLNK(VFS_I(sc->ip)->i_mode))
> > +		return -EINVAL;
> 
> That'd be nicer as a switch statement.

Will fix.

> > +
> > +	/* If we somehow have delalloc extents, forget it. */
> > +	if (whichfork == XFS_DATA_FORK && sc->ip->i_delayed_blks)
> > +		return -EBUSY;
> 
> and this can be rolled into the same if (datafork) branch.
> 
> ....
> > +	if (!xfs_sb_version_hasrmapbt(&sc->mp->m_sb))
> > +		return -EOPNOTSUPP;
> 
> Do this first?

It's redundant, see xfs_repair_bmap_check_inputs.  Will remove this one.

> Hmmm, and if you do the attr fork check second then the rest
> of the code is all data fork. i.e.
> 
> 	if (!rmap)
> 		return -EOPNOTSUPP
> 	if (attrfork) {
> 		if (no attr fork)
> 			return ....
> 		return 0
> 	}
> 	/* now do all data fork checks */
> 
> This becomes a lot easier to follow.

Ok.

> > +/*
> > + * Collect block mappings for this fork of this inode and decide if we have
> > + * enough space to rebuild.  Caller is responsible for cleaning up the list if
> > + * anything goes wrong.
> > + */
> > +STATIC int
> > +xfs_repair_bmap_find_mappings(
> > +	struct xfs_scrub_context	*sc,
> > +	int				whichfork,
> > +	struct list_head		*mapping_records,
> > +	struct xfs_repair_extent_list	*old_bmbt_blocks,
> > +	xfs_rfsblock_t			*old_bmbt_block_count,
> > +	xfs_rfsblock_t			*otherfork_blocks)
> > +{
> > +	struct xfs_repair_bmap		rb;
> > +	xfs_agnumber_t			agno;
> > +	unsigned int			resblks;
> > +	int				error;
> > +
> > +	memset(&rb, 0, sizeof(rb));
> > +	rb.extlist = mapping_records;
> > +	rb.btlist = old_bmbt_blocks;
> > +	rb.ino = sc->ip->i_ino;
> > +	rb.whichfork = whichfork;
> > +	rb.sc = sc;
> > +
> > +	/* Iterate the rmaps for extents. */
> > +	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_repair_bmap_scan_ag(&rb, agno);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	/*
> > +	 * Guess how many blocks we're going to need to rebuild an entire bmap
> > +	 * from the number of extents we found, and pump up our transaction to
> > +	 * have sufficient block reservation.
> > +	 */
> > +	resblks = xfs_bmbt_calc_size(sc->mp, rb.extents);
> > +	error = xfs_trans_reserve_more(sc->tp, resblks, 0);
> > +	if (error)
> > +		return error;
> 
> I don't really like this, but I can't think of a way around needing
> it at the moment.

Me neither.

(That is to say, I can't think of a way around it that doesn't involve
backing all the way out to the setup function, which would be pretty
gruesome.)

> > +
> > +	*otherfork_blocks = rb.otherfork_blocks;
> > +	*old_bmbt_block_count = rb.bmbt_blocks;
> > +	return 0;
> > +}
> > +
> > +/* Update the inode counters. */
> > +STATIC int
> > +xfs_repair_bmap_reset_counters(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_rfsblock_t			old_bmbt_block_count,
> > +	xfs_rfsblock_t			otherfork_blocks,
> > +	int				*log_flags)
> > +{
> > +	int				error;
> > +
> > +	xfs_trans_ijoin(sc->tp, sc->ip, 0);
> > +
> > +	/*
> > +	 * Drop the block counts associated with this fork since we'll re-add
> > +	 * them with the bmap routines later.
> > +	 */
> > +	sc->ip->i_d.di_nblocks = otherfork_blocks;
> 
> This needs a little more explanation. i.e. that the rmap walk we
> just performed for this fork also counted all the data and bmbt
> blocks for the other fork so this is really only zeroing the block
> count for the fork we are about to rebuild.

/*
 * We're going to use the bmap routines to reconstruct a fork from rmap
 * records.  Those functions increment di_nblocks for us, so we need to
 * subtract out all the data and bmbt blocks from the fork we're about
 * to rebuild.  otherfork_blocks reflects all the data and bmbt blocks
 * for the other fork, so this assignment effectively performs the
 * subtraction for us.
 */

> 
> > +/* Initialize a new fork and implant it in the inode. */
> > +STATIC void
> > +xfs_repair_bmap_reset_fork(
> > +	struct xfs_scrub_context	*sc,
> > +	int				whichfork,
> > +	bool				has_mappings,
> > +	int				*log_flags)
> > +{
> > +	/* Set us back to extents format with zero records. */
> > +	XFS_IFORK_FMT_SET(sc->ip, whichfork, XFS_DINODE_FMT_EXTENTS);
> > +	XFS_IFORK_NEXT_SET(sc->ip, whichfork, 0);
> > +
> > +	/* Reinitialize the on-disk fork. */
> 
> I don't think this touches the on-disk fork - it's re-initialising
> the in-memory fork.

Will fix.

> > +	if (XFS_IFORK_PTR(sc->ip, whichfork) != NULL)
> > +		xfs_idestroy_fork(sc->ip, whichfork);
> > +	if (whichfork == XFS_DATA_FORK) {
> > +		memset(&sc->ip->i_df, 0, sizeof(struct xfs_ifork));
> > +		sc->ip->i_df.if_flags |= XFS_IFEXTENTS;
> > +	} else if (whichfork == XFS_ATTR_FORK) {
> > +		if (has_mappings) {
> > +			sc->ip->i_afp = NULL;
> > +		} else {
> > +			sc->ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone,
> > +					KM_SLEEP);
> > +			sc->ip->i_afp->if_flags |= XFS_IFEXTENTS;
> > +		}
> > +	}

/*
 * Now that we've reinitialized the in-memory fork and set the inode
 * back to extents format with zero extents, any extents that we
 * subsequently map into the file will reinitialize the on-disk fork
 * area for us.  All we have to do is log the inode core to preserve
 * the format and extent count fields.
 */

> > +	*log_flags |= XFS_ILOG_CORE;
> > +}
> ......
> 
> > +/* Repair an inode fork. */
> > +STATIC int
> > +xfs_repair_bmap(
> > +	struct xfs_scrub_context	*sc,
> > +	int				whichfork)
> > +{
> > +	struct list_head		mapping_records;
> > +	struct xfs_repair_extent_list	old_bmbt_blocks;
> > +	struct xfs_inode		*ip = sc->ip;
> > +	xfs_rfsblock_t			old_bmbt_block_count;
> > +	xfs_rfsblock_t			otherfork_blocks;
> > +	int				log_flags = 0;
> > +	int				error = 0;
> > +
> > +	error = xfs_repair_bmap_check_inputs(sc, whichfork);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * If this is a file data fork, wait for all pending directio to
> > +	 * complete, then tear everything out of the page cache.
> > +	 */
> > +	if (S_ISREG(VFS_I(ip)->i_mode) && whichfork == XFS_DATA_FORK) {
> > +		inode_dio_wait(VFS_I(ip));
> > +		truncate_inode_pages(VFS_I(ip)->i_mapping, 0);
> > +	}
> 
> Why would we be waiting only for DIO here? Haven't we already locked
> up the inode, flushed dirty data, waited for dio and invalidated the
> page cache when we called xfs_scrub_setup_inode_bmap() prior to
> doing this work?

Extra paranoia?  IOWs I don't know why. :)

Probably we should xfs_break_layouts here though.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 16/21] xfs: repair damaged symlinks
  2018-06-24 19:25 ` [PATCH 16/21] xfs: repair damaged symlinks Darrick J. Wong
@ 2018-07-04  5:45   ` Dave Chinner
  2018-07-04 18:45     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-04  5:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:25:16PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Repair inconsistent symbolic link data.
.....
> +/*
> + * Symbolic Link Repair
> + * ====================
> + *
> + * There's not much we can do to repair symbolic links -- we truncate them to
> + * the first NULL byte and fix up the remote target block headers if they're
> + * incorrect.  Zero-length symlinks are turned into links to /.
> + */
> +
> +/* Blow out the whole symlink; replace contents. */
> +STATIC int
> +xfs_repair_symlink_rewrite(
> +	struct xfs_trans	**tpp,
> +	struct xfs_inode	*ip,
> +	const char		*target_path,
> +	int			pathlen)
> +{
> +	struct xfs_defer_ops	dfops;
> +	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
> +	struct xfs_ifork	*ifp;
> +	const char		*cur_chunk;
> +	struct xfs_mount	*mp = (*tpp)->t_mountp;
> +	struct xfs_buf		*bp;
> +	xfs_fsblock_t		first_block;
> +	xfs_fileoff_t		first_fsb;
> +	xfs_filblks_t		fs_blocks;
> +	xfs_daddr_t		d;
> +	int			byte_cnt;
> +	int			n;
> +	int			nmaps;
> +	int			offset;
> +	int			error = 0;
> +
> +	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> +
> +	/* Truncate the whole data fork if it wasn't inline. */
> +	if (!(ifp->if_flags & XFS_IFINLINE)) {
> +		error = xfs_itruncate_extents(tpp, ip, XFS_DATA_FORK, 0);
> +		if (error)
> +			goto out;
> +	}
> +
> +	/* Blow out the in-core fork and zero the on-disk fork. */
> +	xfs_idestroy_fork(ip, XFS_DATA_FORK);
> +	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
> +	ip->i_d.di_nextents = 0;
> +	memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
> +	ip->i_df.if_flags |= XFS_IFEXTENTS;

This looks familiar - doesn't the fork zapping code do exactly this,
too? factor into a helper?

> +
> +	/* Rewrite an inline symlink. */
> +	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
> +		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
> +
> +		i_size_write(VFS_I(ip), pathlen);
> +		ip->i_d.di_size = pathlen;
> +		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
> +		xfs_trans_log_inode(*tpp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
> +		goto out;
> +
> +	}

Might make sense to separate inline vs remote into separate
functions - we tend to do that everywhere else in the symlink code.

> +
> +	/* Rewrite a remote symlink. */
> +	fs_blocks = xfs_symlink_blocks(mp, pathlen);
> +	first_fsb = 0;
> +	nmaps = XFS_SYMLINK_MAPS;
> +
> +	/* Reserve quota for new blocks. */
> +	error = xfs_trans_reserve_quota_nblks(*tpp, ip, fs_blocks, 0,
> +			XFS_QMOPT_RES_REGBLKS);
> +	if (error)
> +		goto out;
> +
> +	/* Map blocks, write symlink target. */
> +	xfs_defer_init(&dfops, &first_block);
> +
> +	error = xfs_bmapi_write(*tpp, ip, first_fsb, fs_blocks,
> +			  XFS_BMAPI_METADATA, &first_block, fs_blocks,
> +			  mval, &nmaps, &dfops);
> +	if (error)
> +		goto out_bmap_cancel;
> +
> +	ip->i_d.di_size = pathlen;
> +	i_size_write(VFS_I(ip), pathlen);
> +	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
> +
> +	cur_chunk = target_path;
> +	offset = 0;
> +	for (n = 0; n < nmaps; n++) {
> +		char	*buf;
> +
> +		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
> +		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
> +		bp = xfs_trans_get_buf(*tpp, mp->m_ddev_targp, d,
> +				       BTOBB(byte_cnt), 0);
> +		if (!bp) {
> +			error = -ENOMEM;
> +			goto out_bmap_cancel;
> +		}
> +		bp->b_ops = &xfs_symlink_buf_ops;
> +
> +		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
> +		byte_cnt = min(byte_cnt, pathlen);
> +
> +		buf = bp->b_addr;
> +		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
> +					   byte_cnt, bp);
> +
> +		memcpy(buf, cur_chunk, byte_cnt);
> +
> +		cur_chunk += byte_cnt;
> +		pathlen -= byte_cnt;
> +		offset += byte_cnt;
> +
> +		xfs_trans_buf_set_type(*tpp, bp, XFS_BLFT_SYMLINK_BUF);
> +		xfs_trans_log_buf(*tpp, bp, 0, (buf + byte_cnt - 1) -
> +						(char *)bp->b_addr);
> +	}
> +	ASSERT(pathlen == 0);

This just looks like a copynpaste of main loop in xfs_symlink() -
can you factor that into a helper, please?

> +
> +	error = xfs_defer_finish(tpp, &dfops);
> +	if (error)
> +		goto out_bmap_cancel;
> +
> +	return 0;
> +
> +out_bmap_cancel:
> +	xfs_defer_cancel(&dfops);
> +out:
> +	return error;
> +}
> +
> +/* Fix everything that fails the verifiers in the remote blocks. */
> +STATIC int
> +xfs_repair_symlink_fix_remotes(
> +	struct xfs_scrub_context	*sc,
> +	loff_t				len)
> +{
> +	struct xfs_bmbt_irec		mval[XFS_SYMLINK_MAPS];
> +	struct xfs_buf			*bp;
> +	xfs_filblks_t			fsblocks;
> +	xfs_daddr_t			d;
> +	loff_t				offset;
> +	unsigned int			byte_cnt;
> +	int				n;
> +	int				nmaps = XFS_SYMLINK_MAPS;
> +	int				nr;
> +	int				error;
> +
> +	fsblocks = xfs_symlink_blocks(sc->mp, len);
> +	error = xfs_bmapi_read(sc->ip, 0, fsblocks, mval, &nmaps, 0);
> +	if (error)
> +		return error;
> +
> +	offset = 0;
> +	for (n = 0; n < nmaps; n++) {
> +		d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock);
> +		byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount);
> +
> +		error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
> +				d, BTOBB(byte_cnt), 0, &bp, NULL);
> +		if (error)
> +			return error;
> +		bp->b_ops = &xfs_symlink_buf_ops;
> +
> +		byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt);
> +		if (len < byte_cnt)
> +			byte_cnt = len;

can we make this the same as the other functions? i.e.

		byte_cnt = min(byte_cnt, len);

> +
> +		nr = xfs_symlink_hdr_set(sc->mp, sc->ip->i_ino, offset,
> +				byte_cnt, bp);
> +
> +		len -= byte_cnt;
> +		offset += byte_cnt;
> +
> +		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SYMLINK_BUF);
> +		xfs_trans_log_buf(sc->tp, bp, 0, nr - 1);
> +		xfs_trans_brelse(sc->tp, bp);

xfs_trans_brelse() is a no-op here because the buffer has been
logged. It can be removed.

> +	}
> +	if (len != 0)
> +		return -EFSCORRUPTED;
> +
> +	return 0;
> +}
> +
> +/* Fix this inline symlink. */
> +STATIC int
> +xfs_repair_symlink_inline(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_inode		*ip = sc->ip;
> +	struct xfs_ifork		*ifp;
> +	loff_t				len;
> +	size_t				newlen;
> +
> +	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> +	len = i_size_read(VFS_I(ip));
> +	xfs_trans_ijoin(sc->tp, ip, 0);
> +
> +	if (ifp->if_u1.if_data) {
> +		newlen = strnlen(ifp->if_u1.if_data, XFS_IFORK_DSIZE(ip));
> +	} else {
> +		/* Zero length symlink becomes a root symlink. */
> +		ifp->if_u1.if_data = kmem_alloc(4, KM_SLEEP);
> +		snprintf(ifp->if_u1.if_data, 4, "/");
> +		newlen = 1;

helper function shared with the fork zapping code?

> +	}
> +
> +	if (len > newlen) {

shouldn't this be 'if (len != newlen) {' ?

> +		i_size_write(VFS_I(ip), newlen);
> +		ip->i_d.di_size = newlen;
> +		xfs_trans_log_inode(sc->tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
> +	}
> +
> +	return 0;
> +}
> +
> +/* Repair a remote symlink. */
> +STATIC int
> +xfs_repair_symlink_remote(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_inode		*ip = sc->ip;
> +	loff_t				len;
> +	size_t				newlen;
> +	int				error = 0;
> +
> +	len = i_size_read(VFS_I(ip));
> +	xfs_trans_ijoin(sc->tp, ip, 0);
> +
> +	error = xfs_repair_symlink_fix_remotes(sc, len);
> +	if (error)
> +		return error;
> +
> +	/* Roll transaction, release buffers. */
> +	error = xfs_trans_roll_inode(&sc->tp, ip);
> +	if (error)
> +		return error;
> +
> +	/* Size set correctly? */
> +	len = i_size_read(VFS_I(ip));
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	error = xfs_readlink(ip, sc->buf);
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);

Can we pass a "need_lock" flag to xfs_readlink() rather than
creating a race condition with anything that might be blocked on the
ilock waiting for repair to complete?

> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Figure out the new target length.  We can't handle zero-length
> +	 * symlinks, so make sure that we don't write that out.
> +	 */
> +	newlen = strnlen(sc->buf, XFS_SYMLINK_MAXLEN);
> +	if (newlen == 0) {
> +		*((char *)sc->buf) = '/';
> +		newlen = 1;

We really need to do set the name of the repaired path for zero
length symlinks in only one place. It really seems to me that it
should done in xfs_repair_symlink_rewrite() if newlen is 0, not
here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-03 23:59     ` Darrick J. Wong
@ 2018-07-04  8:44       ` Carlos Maiolino
  2018-07-04 18:40         ` Darrick J. Wong
  2018-07-04 23:21       ` Dave Chinner
  1 sibling, 1 reply; 77+ messages in thread
From: Carlos Maiolino @ 2018-07-04  8:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

> [some of this Dave and I discussed on IRC, so I'll summarize for
> everyone else here...]
> 
> For this initial v0 iteration of the rmap repair code, yes, we have to
> freeze the fs and iterate everything.  However, unless your computer and
> storage are particularly untrustworthy, rmapbt reconstruction should be
> a very infrequent thing.  Now that we have a FREEZE_OK flag, userspace
> has to opt-in to slow repairs, and presumably it could choose instead to
> unmount and run xfs_repair if that's too dear or there are too many
> broken AGs, etc.  More on that later.
> 
> In the long run I don't see a need to freeze the filesystem to scan
> every inode for bmbt entries in the damaged AG.  In fact, we can improve
> the performance of all the AG repair functions in general with the
> scheme I'm about to outline:
> 
> Create a "shut down this AG" primitive.  Once set, block and inode
> allocation routines will bypass this AG.  Unlinked inodes are moved to
> the unlinked list to avoid touching as much of the AGI as we practically
> can.  Unmapped/freed blocks can be moved to a hidden inode (in another
> AG) to be freed later.  Growfs operation in that AG can be rejected.
> 

Does it mean that new block allocation requestsvfor inodes already existing in
the frozen AG will block until the AG is thawed, or these block allocations
will be redirected to another AG? I'm just asking because in either case, we
should document it well. The repair case is certainly (or should be) a rare
case, but if there is any heavy workload going on on the frozen AG, and we
redirect it to another AG, it can end up heavily fragmenting the files on the
frozen AG.
So, I wonder if any operation to the AG under repair should actually be blocked
too?

Cheers


-- 
Carlos

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-04  8:44       ` Carlos Maiolino
@ 2018-07-04 18:40         ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04 18:40 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: cmaiolino

On Wed, Jul 04, 2018 at 10:44:38AM +0200, Carlos Maiolino wrote:
> > [some of this Dave and I discussed on IRC, so I'll summarize for
> > everyone else here...]
> > 
> > For this initial v0 iteration of the rmap repair code, yes, we have to
> > freeze the fs and iterate everything.  However, unless your computer and
> > storage are particularly untrustworthy, rmapbt reconstruction should be
> > a very infrequent thing.  Now that we have a FREEZE_OK flag, userspace
> > has to opt-in to slow repairs, and presumably it could choose instead to
> > unmount and run xfs_repair if that's too dear or there are too many
> > broken AGs, etc.  More on that later.
> > 
> > In the long run I don't see a need to freeze the filesystem to scan
> > every inode for bmbt entries in the damaged AG.  In fact, we can improve
> > the performance of all the AG repair functions in general with the
> > scheme I'm about to outline:
> > 
> > Create a "shut down this AG" primitive.  Once set, block and inode
> > allocation routines will bypass this AG.  Unlinked inodes are moved to
> > the unlinked list to avoid touching as much of the AGI as we practically
> > can.  Unmapped/freed blocks can be moved to a hidden inode (in another
> > AG) to be freed later.  Growfs operation in that AG can be rejected.
> > 
> 
> Does it mean that new block allocation requestsvfor inodes already existing in
> the frozen AG will block until the AG is thawed, or these block allocations
> will be redirected to another AG? I'm just asking because in either case, we
> should document it well. The repair case is certainly (or should be) a rare
> case, but if there is any heavy workload going on on the frozen AG, and we
> redirect it to another AG, it can end up heavily fragmenting the files on the
> frozen AG.
> So, I wonder if any operation to the AG under repair should actually be blocked
> too?

I don't think that will be possible for rmapbt repair -- we need to be
able to take locks in the wrong order (agf -> inodes) without
deadlocking with a regular operation that's blocked on the AG (inodes ->
agf).  The freezer mechanism eliminates the deadlock possibility by
eliminating the regular IO paths, so this proposed AG shutdown would
also have to protect against that by absorbing operations.

--D

> Cheers
> 
> 
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 16/21] xfs: repair damaged symlinks
  2018-07-04  5:45   ` Dave Chinner
@ 2018-07-04 18:45     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-04 18:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 04, 2018 at 03:45:33PM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:25:16PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Repair inconsistent symbolic link data.
> .....
> > +/*
> > + * Symbolic Link Repair
> > + * ====================
> > + *
> > + * There's not much we can do to repair symbolic links -- we truncate them to
> > + * the first NULL byte and fix up the remote target block headers if they're
> > + * incorrect.  Zero-length symlinks are turned into links to /.
> > + */
> > +
> > +/* Blow out the whole symlink; replace contents. */
> > +STATIC int
> > +xfs_repair_symlink_rewrite(
> > +	struct xfs_trans	**tpp,
> > +	struct xfs_inode	*ip,
> > +	const char		*target_path,
> > +	int			pathlen)
> > +{
> > +	struct xfs_defer_ops	dfops;
> > +	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
> > +	struct xfs_ifork	*ifp;
> > +	const char		*cur_chunk;
> > +	struct xfs_mount	*mp = (*tpp)->t_mountp;
> > +	struct xfs_buf		*bp;
> > +	xfs_fsblock_t		first_block;
> > +	xfs_fileoff_t		first_fsb;
> > +	xfs_filblks_t		fs_blocks;
> > +	xfs_daddr_t		d;
> > +	int			byte_cnt;
> > +	int			n;
> > +	int			nmaps;
> > +	int			offset;
> > +	int			error = 0;
> > +
> > +	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> > +
> > +	/* Truncate the whole data fork if it wasn't inline. */
> > +	if (!(ifp->if_flags & XFS_IFINLINE)) {
> > +		error = xfs_itruncate_extents(tpp, ip, XFS_DATA_FORK, 0);
> > +		if (error)
> > +			goto out;
> > +	}
> > +
> > +	/* Blow out the in-core fork and zero the on-disk fork. */
> > +	xfs_idestroy_fork(ip, XFS_DATA_FORK);
> > +	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
> > +	ip->i_d.di_nextents = 0;
> > +	memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
> > +	ip->i_df.if_flags |= XFS_IFEXTENTS;
> 
> This looks familiar - doesn't the fork zapping code do exactly this,
> too? factor into a helper?

Yeah, helpers clearly needed somewhere.

> > +
> > +	/* Rewrite an inline symlink. */
> > +	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
> > +		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
> > +
> > +		i_size_write(VFS_I(ip), pathlen);
> > +		ip->i_d.di_size = pathlen;
> > +		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
> > +		xfs_trans_log_inode(*tpp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
> > +		goto out;
> > +
> > +	}
> 
> Might make sense to separate inline vs remote into separate
> functions - we tend to do that everywhere else in the symlink code.

Ok.

> > +
> > +	/* Rewrite a remote symlink. */
> > +	fs_blocks = xfs_symlink_blocks(mp, pathlen);
> > +	first_fsb = 0;
> > +	nmaps = XFS_SYMLINK_MAPS;
> > +
> > +	/* Reserve quota for new blocks. */
> > +	error = xfs_trans_reserve_quota_nblks(*tpp, ip, fs_blocks, 0,
> > +			XFS_QMOPT_RES_REGBLKS);
> > +	if (error)
> > +		goto out;
> > +
> > +	/* Map blocks, write symlink target. */
> > +	xfs_defer_init(&dfops, &first_block);
> > +
> > +	error = xfs_bmapi_write(*tpp, ip, first_fsb, fs_blocks,
> > +			  XFS_BMAPI_METADATA, &first_block, fs_blocks,
> > +			  mval, &nmaps, &dfops);
> > +	if (error)
> > +		goto out_bmap_cancel;
> > +
> > +	ip->i_d.di_size = pathlen;
> > +	i_size_write(VFS_I(ip), pathlen);
> > +	xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
> > +
> > +	cur_chunk = target_path;
> > +	offset = 0;
> > +	for (n = 0; n < nmaps; n++) {
> > +		char	*buf;
> > +
> > +		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
> > +		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
> > +		bp = xfs_trans_get_buf(*tpp, mp->m_ddev_targp, d,
> > +				       BTOBB(byte_cnt), 0);
> > +		if (!bp) {
> > +			error = -ENOMEM;
> > +			goto out_bmap_cancel;
> > +		}
> > +		bp->b_ops = &xfs_symlink_buf_ops;
> > +
> > +		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
> > +		byte_cnt = min(byte_cnt, pathlen);
> > +
> > +		buf = bp->b_addr;
> > +		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
> > +					   byte_cnt, bp);
> > +
> > +		memcpy(buf, cur_chunk, byte_cnt);
> > +
> > +		cur_chunk += byte_cnt;
> > +		pathlen -= byte_cnt;
> > +		offset += byte_cnt;
> > +
> > +		xfs_trans_buf_set_type(*tpp, bp, XFS_BLFT_SYMLINK_BUF);
> > +		xfs_trans_log_buf(*tpp, bp, 0, (buf + byte_cnt - 1) -
> > +						(char *)bp->b_addr);
> > +	}
> > +	ASSERT(pathlen == 0);
> 
> This just looks like a copynpaste of main loop in xfs_symlink() -
> can you factor that into a helper, please?

Ok.

> > +
> > +	error = xfs_defer_finish(tpp, &dfops);
> > +	if (error)
> > +		goto out_bmap_cancel;
> > +
> > +	return 0;
> > +
> > +out_bmap_cancel:
> > +	xfs_defer_cancel(&dfops);
> > +out:
> > +	return error;
> > +}
> > +
> > +/* Fix everything that fails the verifiers in the remote blocks. */
> > +STATIC int
> > +xfs_repair_symlink_fix_remotes(
> > +	struct xfs_scrub_context	*sc,
> > +	loff_t				len)
> > +{
> > +	struct xfs_bmbt_irec		mval[XFS_SYMLINK_MAPS];
> > +	struct xfs_buf			*bp;
> > +	xfs_filblks_t			fsblocks;
> > +	xfs_daddr_t			d;
> > +	loff_t				offset;
> > +	unsigned int			byte_cnt;
> > +	int				n;
> > +	int				nmaps = XFS_SYMLINK_MAPS;
> > +	int				nr;
> > +	int				error;
> > +
> > +	fsblocks = xfs_symlink_blocks(sc->mp, len);
> > +	error = xfs_bmapi_read(sc->ip, 0, fsblocks, mval, &nmaps, 0);
> > +	if (error)
> > +		return error;
> > +
> > +	offset = 0;
> > +	for (n = 0; n < nmaps; n++) {
> > +		d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock);
> > +		byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount);
> > +
> > +		error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
> > +				d, BTOBB(byte_cnt), 0, &bp, NULL);
> > +		if (error)
> > +			return error;
> > +		bp->b_ops = &xfs_symlink_buf_ops;
> > +
> > +		byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt);
> > +		if (len < byte_cnt)
> > +			byte_cnt = len;
> 
> can we make this the same as the other functions? i.e.
> 
> 		byte_cnt = min(byte_cnt, len);

<nod>

> > +
> > +		nr = xfs_symlink_hdr_set(sc->mp, sc->ip->i_ino, offset,
> > +				byte_cnt, bp);
> > +
> > +		len -= byte_cnt;
> > +		offset += byte_cnt;
> > +
> > +		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SYMLINK_BUF);
> > +		xfs_trans_log_buf(sc->tp, bp, 0, nr - 1);
> > +		xfs_trans_brelse(sc->tp, bp);
> 
> xfs_trans_brelse() is a no-op here because the buffer has been
> logged. It can be removed.

<nod>

> > +	}
> > +	if (len != 0)
> > +		return -EFSCORRUPTED;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Fix this inline symlink. */
> > +STATIC int
> > +xfs_repair_symlink_inline(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_inode		*ip = sc->ip;
> > +	struct xfs_ifork		*ifp;
> > +	loff_t				len;
> > +	size_t				newlen;
> > +
> > +	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> > +	len = i_size_read(VFS_I(ip));
> > +	xfs_trans_ijoin(sc->tp, ip, 0);
> > +
> > +	if (ifp->if_u1.if_data) {
> > +		newlen = strnlen(ifp->if_u1.if_data, XFS_IFORK_DSIZE(ip));
> > +	} else {
> > +		/* Zero length symlink becomes a root symlink. */
> > +		ifp->if_u1.if_data = kmem_alloc(4, KM_SLEEP);
> > +		snprintf(ifp->if_u1.if_data, 4, "/");
> > +		newlen = 1;
> 
> helper function shared with the fork zapping code?

Yes.

> > +	}
> > +
> > +	if (len > newlen) {
> 
> shouldn't this be 'if (len != newlen) {' ?

Yes.

> > +		i_size_write(VFS_I(ip), newlen);
> > +		ip->i_d.di_size = newlen;
> > +		xfs_trans_log_inode(sc->tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/* Repair a remote symlink. */
> > +STATIC int
> > +xfs_repair_symlink_remote(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_inode		*ip = sc->ip;
> > +	loff_t				len;
> > +	size_t				newlen;
> > +	int				error = 0;
> > +
> > +	len = i_size_read(VFS_I(ip));
> > +	xfs_trans_ijoin(sc->tp, ip, 0);
> > +
> > +	error = xfs_repair_symlink_fix_remotes(sc, len);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Roll transaction, release buffers. */
> > +	error = xfs_trans_roll_inode(&sc->tp, ip);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Size set correctly? */
> > +	len = i_size_read(VFS_I(ip));
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	error = xfs_readlink(ip, sc->buf);
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> 
> Can we pass a "need_lock" flag to xfs_readlink() rather than
> creating a race condition with anything that might be blocked on the
> ilock waiting for repair to complete?

LOL, there already is a locked version of that, which I added two years
ago in preparation for online symlink scrub.  Will switch to
xfs_readlink_bmap_ilocked.

> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Figure out the new target length.  We can't handle zero-length
> > +	 * symlinks, so make sure that we don't write that out.
> > +	 */
> > +	newlen = strnlen(sc->buf, XFS_SYMLINK_MAXLEN);
> > +	if (newlen == 0) {
> > +		*((char *)sc->buf) = '/';
> > +		newlen = 1;
> 
> We really need to do set the name of the repaired path for zero
> length symlinks in only one place. It really seems to me that it
> should done in xfs_repair_symlink_rewrite() if newlen is 0, not
> here.

Agreed, will work on that.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-03 23:59     ` Darrick J. Wong
  2018-07-04  8:44       ` Carlos Maiolino
@ 2018-07-04 23:21       ` Dave Chinner
  2018-07-05  3:48         ` Darrick J. Wong
  1 sibling, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-04 23:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Rebuild the reverse mapping btree from all primary metadata.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > ....
> > 
> > > +static inline int xfs_repair_rmapbt_setup(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_inode		*ip)
> > > +{
> > > +	/* We don't support rmap repair, but we can still do a scan. */
> > > +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> > > +}
> > 
> > This comment seems at odds with the commit message....
> 
> This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n.

Ok, that wasn't clear from the patch context.

> > > + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> > > + * else in XFS we lock inodes and then AG data structures, but generating the
> > > + * list of rmap records requires that we be able to scan both block mapping
> > > + * btrees of every inode in the filesystem to see if it owns any extents in
> > > + * this AG.  We can't tolerate any inode updates while we do this, so we
> > > + * freeze the filesystem to lock everyone else out, and grant ourselves
> > > + * special privileges to run transactions with regular background reclamation
> > > + * turned off.
> > 
> > Hmmm. This implies we are going to scan the entire filesystem for
> > every AG we need to rebuild the rmap tree in. That seems like an
> > awful lot of work if there's more than one rmap btree that needs
> > rebuild.
> 
> [some of this Dave and I discussed on IRC, so I'll summarize for
> everyone else here...]

....

> > Given that we've effetively got to shut down access to the
> > filesystem for the entire rmap rebuild while we do an entire
> > filesystem scan, why would we do this online? It's going to be
> > faster to do this rebuild offline (because of all the prefetching,
> > rebuilding all AG trees from the state gathered in the full
> > filesystem passes, etc) and we don't have to hack around potential
> > transaction and memory reclaim deadlock situations, either?
> > 
> > So why do rmap rebuilds online at all?
> 
> The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and
> 3 while it checks everything.  By the time it gets to rmapbt repairs
> towards the end of phase 4 (if there's enough memory) those blocks will
> still be in cache and online repair doesn't have to wait for the disk.

Therein lies the problem: "if there's enough memory". If there's
enough memory to cache at the filesystem metadata, track all the
bits repair needs to track, and there's no other memory pressure
then it will hit the cache. But populating that cache is still going
to be slower than an offline repair because IO patterns (see below)
and there is competing IO from other work being done on the
system (i.e. online repair competes for IO resources and memory
resources).

As such, I don't see that we're going to have everything we need
cached for any significantly sized or busy filesystem, and that
means we actually have to care about how much IO online repair
algorithms require. We also have to take into account that much of
that IO is going to be synchronous single metadata block reads.
This will be a limitation on any sort of high IO latency storage
(spinning rust, network based block devices, slow SSDs, etc).

> If instead you unmount and run xfs_repair then xfs_repair has to reload
> all that metadata and recheck it, all of which happens with the fs
> offline.

xfs_repair has all sorts of concurrency and prefetching
optimisations that allow it to scan and process metadata orders of
magnitude faster than online repair, especially on slow storage.
i.e. online repair is going to be IO seek bound, while offline
repair is typically IO bandwidth and/or CPU bound.  Offline repair
can do full filesystem metadata scans measured in GB/s; as long as
online repair does serialised synchronous single structure walks it
will be orders of magnitude slower than an offline repair.

> So except for the extra complexity of avoiding deadlocks (which I
> readily admit is not a small task) I at least don't think it's a
> clear-cut downtime win to rely on xfs_repair.

Back then - as it is still now - I couldn't see how the IO load
required by synchronous full filesystem scans one structure at a
time was going to reduce filesystem downtime compared to an offline
repair doing optimised "all metadata types at once" concurrent
linear AG scans.

Keep in mind that online repair will never guarantee that it can fix
all problems, so we're always going to have to offline repair. iWhat
we want to achieve is minimising downtime for users when a repair is
required. With the above IO limitations in mind, I've always
considered that online repair would just be for all the simple,
quick, easy to fix stuff, because complex stuff that required huge
amounts of RAM and full filesystem scans to resolve would always be
done faster offline.

That's why I think that offline repair will be a better choice for
users for the forseeable future if repairing the damage requires
full filesystem metadata scans.

> > > +
> > > +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> > > +	if (!rre)
> > > +		return -ENOMEM;
> > 
> > This seems like a likely thing to happen given the "no reclaim"
> > state of the filesystem and the memory demand a rmapbt rebuild
> > can have. If we've got GBs of rmap info in the AG that needs to be
> > rebuilt, how much RAM are we going to need to index it all as we
> > scan the filesystem?
> 
> More than I'd like -- at least 24 bytes per record (which at least is no
> larger than the size of the on-disk btree) plus a list_head until I can
> move the repairers away from creating huge lists.

Ok, it kinda sounds a bit like we need to be able to create the new
btree on the fly, rather than as a single operation at the end. e.g.
if the list builds up to, say, 100k records, we push them into the
new tree and can free them. e.g. can we iteratively build the new
tree on disk as we go, then do a root block swap at the end to
switch from the old tree to the new tree?

If that's a potential direction, then maybe we can look at this as a
future direction? It also leads to the posibility of
pausing/continuing repair from where the last chunk of records were
processed, so if we do run out of memory we don't have to start from
the beginning again?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-04 23:21       ` Dave Chinner
@ 2018-07-05  3:48         ` Darrick J. Wong
  2018-07-05  7:03           ` Dave Chinner
  0 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-05  3:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > > On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Rebuild the reverse mapping btree from all primary metadata.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > ....
> > > 
> > > > +static inline int xfs_repair_rmapbt_setup(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > > +	/* We don't support rmap repair, but we can still do a scan. */
> > > > +	return xfs_scrub_setup_ag_btree(sc, ip, false);
> > > > +}
> > > 
> > > This comment seems at odds with the commit message....
> > 
> > This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n.
> 
> Ok, that wasn't clear from the patch context.

I'll add a noisier comment in that section about ...

/*
 * Compatibility shims for CONFIG_XFS_ONLINE_REPAIR=n.
 */

> > > > + * This is the most involved of all the AG space btree rebuilds.  Everywhere
> > > > + * else in XFS we lock inodes and then AG data structures, but generating the
> > > > + * list of rmap records requires that we be able to scan both block mapping
> > > > + * btrees of every inode in the filesystem to see if it owns any extents in
> > > > + * this AG.  We can't tolerate any inode updates while we do this, so we
> > > > + * freeze the filesystem to lock everyone else out, and grant ourselves
> > > > + * special privileges to run transactions with regular background reclamation
> > > > + * turned off.
> > > 
> > > Hmmm. This implies we are going to scan the entire filesystem for
> > > every AG we need to rebuild the rmap tree in. That seems like an
> > > awful lot of work if there's more than one rmap btree that needs
> > > rebuild.
> > 
> > [some of this Dave and I discussed on IRC, so I'll summarize for
> > everyone else here...]
> 
> ....
> 
> > > Given that we've effetively got to shut down access to the
> > > filesystem for the entire rmap rebuild while we do an entire
> > > filesystem scan, why would we do this online? It's going to be
> > > faster to do this rebuild offline (because of all the prefetching,
> > > rebuilding all AG trees from the state gathered in the full
> > > filesystem passes, etc) and we don't have to hack around potential
> > > transaction and memory reclaim deadlock situations, either?
> > > 
> > > So why do rmap rebuilds online at all?
> > 
> > The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and
> > 3 while it checks everything.  By the time it gets to rmapbt repairs
> > towards the end of phase 4 (if there's enough memory) those blocks will
> > still be in cache and online repair doesn't have to wait for the disk.
> 
> Therein lies the problem: "if there's enough memory". If there's
> enough memory to cache at the filesystem metadata, track all the
> bits repair needs to track, and there's no other memory pressure
> then it will hit the cache.

Fair enough, it is true that the memory requirements of online repair
are high (and higher than they could be), though online repair only
requires enough memory to store all the incore rmap records for the AG
it's repairing, whereas xfs_repair will try to store all rmaps for the
entire filesystem.  It's not a straightforward comparison since
xfs_repair memory can be swapped and its slabs are much more efficient
than the linked lists that online repair has to use, but some of this
seems solvable.

> But populating that cache is still going to be slower than an offline
> repair because IO patterns (see below) and there is competing IO from
> other work being done on the system (i.e. online repair competes for
> IO resources and memory resources).
>
> As such, I don't see that we're going to have everything we need
> cached for any significantly sized or busy filesystem, and that
> means we actually have to care about how much IO online repair
> algorithms require. We also have to take into account that much of
> that IO is going to be synchronous single metadata block reads.
> This will be a limitation on any sort of high IO latency storage
> (spinning rust, network based block devices, slow SSDs, etc).

TBH I'm not sure online repair is a good match for spinning rust, nbd,
or usb sticks since there's a general presumption that it won't consume
so much storage time that everything else sees huge latency spikes.
Repairs will cause much larger spikes but again should be infrequent and
still an improvement over the fs exploding.

> > If instead you unmount and run xfs_repair then xfs_repair has to reload
> > all that metadata and recheck it, all of which happens with the fs
> > offline.
> 
> xfs_repair has all sorts of concurrency and prefetching
> optimisations that allow it to scan and process metadata orders of
> magnitude faster than online repair, especially on slow storage.
> i.e. online repair is going to be IO seek bound, while offline
> repair is typically IO bandwidth and/or CPU bound.  Offline repair
> can do full filesystem metadata scans measured in GB/s; as long as
> online repair does serialised synchronous single structure walks it
> will be orders of magnitude slower than an offline repair.

My long view of online repair is that it satisfies a different set of
constraints than xfs_repair.  Offline repair needs to be as efficient
with resources (particularly CPU) as it can be, because the fs is 100%
offline while it runs.  It needs to run in as little time as possible
and we optimize heavily for that.

Compare this to the situation facing online repair -- we've sharded the
FS into a number of allocation groups, and (in the future) we can take
an AG offline to repair the rmapbt.  Assuming there's sufficient free
space in the surviving AGs that the filesystem can handle everything
asked of it, there are users who prefer to lose 1/4 (or 1/32, or
whatever) of the FS for 10x the time that they would lose all of it.

Even if online repair is bound by uncached synchronous random reads,
it can still be a win.  I have a few customers in mind who have told me
that losing pieces of the storage for long(er) periods of time are
easier for them to handle than all of it, even for a short time.

> > So except for the extra complexity of avoiding deadlocks (which I
> > readily admit is not a small task) I at least don't think it's a
> > clear-cut downtime win to rely on xfs_repair.
> 
> Back then - as it is still now - I couldn't see how the IO load
> required by synchronous full filesystem scans one structure at a
> time was going to reduce filesystem downtime compared to an offline
> repair doing optimised "all metadata types at once" concurrent
> linear AG scans.

<nod> For the scrub half of this story I deliberately avoided anything
anything that required hard fs downtime, at least until we got to the
fscounters thing.  I'm hoping that background scrubs will be gentle
enough that it will be a better option than what ext4 online scrub does
(which is to say it creates lvm snapshots and fscks them).

> Keep in mind that online repair will never guarantee that it can fix
> all problems, so we're always going to have to offline repair. iWhat
> we want to achieve is minimising downtime for users when a repair is
> required. With the above IO limitations in mind, I've always
> considered that online repair would just be for all the simple,
> quick, easy to fix stuff, because complex stuff that required huge
> amounts of RAM and full filesystem scans to resolve would always be
> done faster offline.
>
> That's why I think that offline repair will be a better choice for
> users for the forseeable future if repairing the damage requires
> full filesystem metadata scans.

That might be, but how else can we determine that than to merge it under
an experimental banner, iterate it for a while, and try to persuade a
wider band of users than ourselves to try it on a non-production system
to see if it better solves their problems? :)

At worst, if future us decide that we will never figure out how to make
online rmapbt repair robust I'm fully prepared to withdraw it.

> > > > +
> > > > +	rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL);
> > > > +	if (!rre)
> > > > +		return -ENOMEM;
> > > 
> > > This seems like a likely thing to happen given the "no reclaim"
> > > state of the filesystem and the memory demand a rmapbt rebuild
> > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > rebuilt, how much RAM are we going to need to index it all as we
> > > scan the filesystem?
> > 
> > More than I'd like -- at least 24 bytes per record (which at least is no
> > larger than the size of the on-disk btree) plus a list_head until I can
> > move the repairers away from creating huge lists.
> 
> Ok, it kinda sounds a bit like we need to be able to create the new
> btree on the fly, rather than as a single operation at the end. e.g.
> if the list builds up to, say, 100k records, we push them into the
> new tree and can free them. e.g. can we iteratively build the new
> tree on disk as we go, then do a root block swap at the end to
> switch from the old tree to the new tree?

Seeing as we have a bunch of storage at our disposal, I think we could
push the records out to disk (or just write them straight into a new
btree) when we detect low memory conditions or consume more than some
threshold of memory.  For v0 I was focusing on getting it to work at all
even with a largeish memory cost. :)

> If that's a potential direction, then maybe we can look at this as a
> future direction? It also leads to the posibility of
> pausing/continuing repair from where the last chunk of records were
> processed, so if we do run out of memory we don't have to start from
> the beginning again?

That could be difficult to do in practice -- we'll continue to
accumulate deferred frees for that AG in the meantime.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-05  3:48         ` Darrick J. Wong
@ 2018-07-05  7:03           ` Dave Chinner
  2018-07-06  0:47             ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-05  7:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jul 04, 2018 at 08:48:58PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> > On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > Keep in mind that online repair will never guarantee that it can fix
> > all problems, so we're always going to have to offline repair. iWhat
> > we want to achieve is minimising downtime for users when a repair is
> > required. With the above IO limitations in mind, I've always
> > considered that online repair would just be for all the simple,
> > quick, easy to fix stuff, because complex stuff that required huge
> > amounts of RAM and full filesystem scans to resolve would always be
> > done faster offline.
> >
> > That's why I think that offline repair will be a better choice for
> > users for the forseeable future if repairing the damage requires
> > full filesystem metadata scans.
> 
> That might be, but how else can we determine that than to merge it under
> an experimental banner, iterate it for a while, and try to persuade a
> wider band of users than ourselves to try it on a non-production system
> to see if it better solves their problems? :)

I think "deploy and idea to users and see if they have problems" is
a horrible development model. That's where btrfs went off the rails
almost 10 years ago....

That aside, I'd like to make sure we're on the same page - we don't
need online repair to fix /everything/ before it gets merged. Even
without rmap rebuild, it will still be usable and useful to the vast
majority of users wanting the filesystem to self repair the typical
one-off corruption problems filesystems encounter during their
typical production life.

What I'm worried about is that the online repair algorithms have a
fundmental dependence on rmap lookups to avoid the need to build the
global cross-references needed to isolate and repair corruptions.
This use of rmap, from one angle, can be seen as a performance
optimisation, but as I've come understand more of the algorithms the
online repair code uses, it looks more like an intractable problem
than one of performance optimisation.

That is, if the rmap is corrupt, or we even suspect that it's
corrupt, then we cannot use it to validate the state and contents of
the other on-disk structures. We could propagate rmap corruptions to
those other structures and it would go undetected. And because we
can't determine the validity of all the other structures, the only
way we can correctly repair the filesystem is to build a global
cross-reference, resolve all the inconsistencies, and then rewrite
all the structures. i.e. we need to do what xfs_repair does in phase
3, 4 and 5.

IOWs, ISTM that the online scrub/repair algorithms break down in the
face of rmap corruption and it is correctable only by building a
global cross-reference which we can then reconcile and use to
rebuild all the per-AG metadata btrees in the filesystem. Trying to
do it online via scanning of unverifiable structures risks making
the problem worse and there is a good possibility that we can't
repair the inconsistencies in a single pass.

So at this point, I'd prefer that we stop, discuss and work out a
solution to the rmap rebuild problem rather than merge a rmap
rebuild algorithm prematurely. This doesn't need to hold up any of
the other online repair functionality - none of that is dependent on
this particular piece of the puzzle being implemented.

> At worst, if future us decide that we will never figure out how to make
> online rmapbt repair robust I'm fully prepared to withdraw it.

Merging new features before sorting out the fundamental principles,
behaviours and algorithms of those features is how btrfs ended up
with a pile of stuff that doesn't quite work which no-one
understands quite well enough to fix.

> > > > This seems like a likely thing to happen given the "no reclaim"
> > > > state of the filesystem and the memory demand a rmapbt rebuild
> > > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > > rebuilt, how much RAM are we going to need to index it all as we
> > > > scan the filesystem?
> > > 
> > > More than I'd like -- at least 24 bytes per record (which at least is no
> > > larger than the size of the on-disk btree) plus a list_head until I can
> > > move the repairers away from creating huge lists.
> > 
> > Ok, it kinda sounds a bit like we need to be able to create the new
> > btree on the fly, rather than as a single operation at the end. e.g.
> > if the list builds up to, say, 100k records, we push them into the
> > new tree and can free them. e.g. can we iteratively build the new
> > tree on disk as we go, then do a root block swap at the end to
> > switch from the old tree to the new tree?
> 
> Seeing as we have a bunch of storage at our disposal, I think we could
> push the records out to disk (or just write them straight into a new
> btree) when we detect low memory conditions or consume more than some
> threshold of memory. i

*nod* That's kinda along the lines of what I was thinking of.

> For v0 I was focusing on getting it to work at all
> even with a largeish memory cost. :)

Right, I'm not suggesting that this needs to be done before the
initial merge - I just want to make sure we don't back ourselves
into a corner we can't get out of.

> > If that's a potential direction, then maybe we can look at this as a
> > future direction? It also leads to the posibility of
> > pausing/continuing repair from where the last chunk of records were
> > processed, so if we do run out of memory we don't have to start from
> > the beginning again?
> 
> That could be difficult to do in practice -- we'll continue to
> accumulate deferred frees for that AG in the meantime.

I think it depends on the way we record deferred frees, but it's a
bit premature to be discussing this right now :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-05  7:03           ` Dave Chinner
@ 2018-07-06  0:47             ` Darrick J. Wong
  2018-07-06  1:08               ` Dave Chinner
  0 siblings, 1 reply; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-06  0:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 05, 2018 at 05:03:24PM +1000, Dave Chinner wrote:
> On Wed, Jul 04, 2018 at 08:48:58PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 05, 2018 at 09:21:15AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote:
> > > Keep in mind that online repair will never guarantee that it can fix
> > > all problems, so we're always going to have to offline repair. iWhat
> > > we want to achieve is minimising downtime for users when a repair is
> > > required. With the above IO limitations in mind, I've always
> > > considered that online repair would just be for all the simple,
> > > quick, easy to fix stuff, because complex stuff that required huge
> > > amounts of RAM and full filesystem scans to resolve would always be
> > > done faster offline.
> > >
> > > That's why I think that offline repair will be a better choice for
> > > users for the forseeable future if repairing the damage requires
> > > full filesystem metadata scans.
> > 
> > That might be, but how else can we determine that than to merge it under
> > an experimental banner, iterate it for a while, and try to persuade a
> > wider band of users than ourselves to try it on a non-production system
> > to see if it better solves their problems? :)
> 
> I think "deploy and idea to users and see if they have problems" is
> a horrible development model. That's where btrfs went off the rails
> almost 10 years ago....

I would only test-deploy to a very small and carefully selected group of
tightly-controlled internal users.  For everyone else, it hides behind a
default=N kconfig option and we disavow any support of EXPERIMENTAL
features. :)

> That aside, I'd like to make sure we're on the same page - we don't
> need online repair to fix /everything/ before it gets merged. Even

Indeed.  I'm fine with it merging the non-freezing repairers in their
current form.  I regularly run the repair tool fuzz xfstests to compare
the effectiveness of the xfs_scrub / xfs_repair.  They both have
deficiencies here and there, which will keep us busy for years to come.
:P

> without rmap rebuild, it will still be usable and useful to the vast
> majority of users wanting the filesystem to self repair the typical
> one-off corruption problems filesystems encounter during their
> typical production life.
> 
> What I'm worried about is that the online repair algorithms have a
> fundmental dependence on rmap lookups to avoid the need to build the
> global cross-references needed to isolate and repair corruptions.

Yes, the primary metadata repairers are absolutely dependent on the
secondary data to be faster than a full scan, and repairing secondary
metadata was never going to be easy to do online.

> This use of rmap, from one angle, can be seen as a performance
> optimisation, but as I've come understand more of the algorithms the
> online repair code uses, it looks more like an intractable problem
> than one of performance optimisation.
> 
> That is, if the rmap is corrupt, or we even suspect that it's
> corrupt, then we cannot use it to validate the state and contents of
> the other on-disk structures. We could propagate rmap corruptions to
> those other structures and it would go undetected. And because we
> can't determine the validity of all the other structures, the only
> way we can correctly repair the filesystem is to build a global
> cross-reference, resolve all the inconsistencies, and then rewrite
> all the structures. i.e. we need to do what xfs_repair does in phase
> 3, 4 and 5.
> 
> IOWs, ISTM that the online scrub/repair algorithms break down in the
> face of rmap corruption and it is correctable only by building a
> global cross-reference which we can then reconcile and use to
> rebuild all the per-AG metadata btrees in the filesystem. Trying to
> do it online via scanning of unverifiable structures risks making
> the problem worse and there is a good possibility that we can't
> repair the inconsistencies in a single pass.

Yeah.  I've pondered whether or not the primary repairers ought to
require at least a quick rmap scan before they touch anything, but up
till now I've preferred to keep that knowledge in xfs_scrub.

> So at this point, I'd prefer that we stop, discuss and work out a
> solution to the rmap rebuild problem rather than merge a rmap
> rebuild algorithm prematurely. This doesn't need to hold up any of
> the other online repair functionality - none of that is dependent on
> this particular piece of the puzzle being implemented.

I didn't think it would hold up the others repairers.  I always knew
that the rmapbt repair was going to be a tough project. :)

So, seeing as you're about to head off on vacation, let's set a rough
time to discuss the rmap rebuild problem in detail once you're back.
Does that sound good?

> > At worst, if future us decide that we will never figure out how to make
> > online rmapbt repair robust I'm fully prepared to withdraw it.
> 
> Merging new features before sorting out the fundamental principles,
> behaviours and algorithms of those features is how btrfs ended up
> with a pile of stuff that doesn't quite work which no-one
> understands quite well enough to fix.

Ok ok it can stay out of tree until we're all fairly confident it'll
survive on our bombardment tests.  I do have a test, xfs/422, to
simulate repairing an rmapbt under heavy load by running fsstress,
fsfreeze, and xfs_io injecting and force-repairing rmapbts in a loop.
So far it hasn't blown up or failed to regenerate the rmapbt...

> > > > > This seems like a likely thing to happen given the "no reclaim"
> > > > > state of the filesystem and the memory demand a rmapbt rebuild
> > > > > can have. If we've got GBs of rmap info in the AG that needs to be
> > > > > rebuilt, how much RAM are we going to need to index it all as we
> > > > > scan the filesystem?
> > > > 
> > > > More than I'd like -- at least 24 bytes per record (which at least is no
> > > > larger than the size of the on-disk btree) plus a list_head until I can
> > > > move the repairers away from creating huge lists.
> > > 
> > > Ok, it kinda sounds a bit like we need to be able to create the new
> > > btree on the fly, rather than as a single operation at the end. e.g.
> > > if the list builds up to, say, 100k records, we push them into the
> > > new tree and can free them. e.g. can we iteratively build the new
> > > tree on disk as we go, then do a root block swap at the end to
> > > switch from the old tree to the new tree?
> > 
> > Seeing as we have a bunch of storage at our disposal, I think we could
> > push the records out to disk (or just write them straight into a new
> > btree) when we detect low memory conditions or consume more than some
> > threshold of memory. i
> 
> *nod* That's kinda along the lines of what I was thinking of.
> 
> > For v0 I was focusing on getting it to work at all
> > even with a largeish memory cost. :)
> 
> Right, I'm not suggesting that this needs to be done before the
> initial merge - I just want to make sure we don't back ourselves
> into a corner we can't get out of.

Agreed.  I don't want to get stuck in a corner either, but I don't know
that I know where all those corners are. :)

> > > If that's a potential direction, then maybe we can look at this as a
> > > future direction? It also leads to the posibility of
> > > pausing/continuing repair from where the last chunk of records were
> > > processed, so if we do run out of memory we don't have to start from
> > > the beginning again?
> > 
> > That could be difficult to do in practice -- we'll continue to
> > accumulate deferred frees for that AG in the meantime.
> 
> I think it depends on the way we record deferred frees, but it's a
> bit premature to be discussing this right now :P

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 17/21] xfs: repair extended attributes
  2018-06-24 19:25 ` [PATCH 17/21] xfs: repair extended attributes Darrick J. Wong
@ 2018-07-06  1:03   ` Dave Chinner
  2018-07-06  3:10     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-06  1:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:25:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> If the extended attributes look bad, try to sift through the rubble to
> find whatever keys/values we can, zap the attr tree, and re-add the
> values.
.....
> 
> +/*
> + * Extended Attribute Repair
> + * =========================
> + *
> + * We repair extended attributes by reading the attribute fork blocks looking
> + * for keys and values, then truncate the entire attr fork and reinsert all
> + * the attributes.  Unfortunately, there's no secondary copy of most extended
> + * attribute data, which means that if we blow up midway through there's
> + * little we can do.
> + */
> +
> +struct xfs_attr_key {
> +	struct list_head		list;
> +	unsigned char			*value;
> +	int				valuelen;
> +	int				flags;
> +	int				namelen;
> +	unsigned char			name[0];
> +};
> +
> +#define XFS_ATTR_KEY_LEN(namelen) (sizeof(struct xfs_attr_key) + (namelen) + 1)
> +
> +struct xfs_repair_xattr {
> +	struct list_head		*attrlist;
> +	struct xfs_scrub_context	*sc;
> +};
> +
> +/* Iterate each block in an attr fork extent */
> +#define for_each_xfs_attr_block(mp, irec, dabno) \
> +	for ((dabno) = roundup((xfs_dablk_t)(irec)->br_startoff, \
> +			(mp)->m_attr_geo->fsbcount); \
> +	     (dabno) < (irec)->br_startoff + (irec)->br_blockcount; \
> +	     (dabno) += (mp)->m_attr_geo->fsbcount)

What's the roundup() for? The attribute fsbcount is only ever going
to be 1 (single block), so it's not obvious what this is doing...

> +/*
> + * Record an extended attribute key & value for later reinsertion into the
> + * inode.  Use the helpers below, don't call this directly.
> + */
> +STATIC int
> +__xfs_repair_xattr_salvage_attr(
> +	struct xfs_repair_xattr		*rx,
> +	struct xfs_buf			*bp,
> +	int				flags,
> +	int				idx,
> +	unsigned char			*name,
> +	int				namelen,
> +	unsigned char			*value,
> +	int				valuelen)
> +{
> +	struct xfs_attr_key		*key;
> +	struct xfs_da_args		args;
> +	int				error = -ENOMEM;
> +
> +	/* Ignore incomplete or oversized attributes. */
> +	if ((flags & XFS_ATTR_INCOMPLETE) ||
> +	    namelen > XATTR_NAME_MAX || namelen < 0 ||
> +	    valuelen > XATTR_SIZE_MAX || valuelen < 0)
> +		return 0;
> +
> +	/* Store attr key. */
> +	key = kmem_alloc(XFS_ATTR_KEY_LEN(namelen), KM_MAYFAIL);
> +	if (!key)
> +		goto err;
> +	INIT_LIST_HEAD(&key->list);
> +	key->value = kmem_zalloc_large(valuelen, KM_MAYFAIL);

Why zero this? Also, it looks like valuelen can be zero? Should be
we be allocating a buffer in that case?

> +	if (!key->value)
> +		goto err_key;
> +	key->valuelen = valuelen;
> +	key->flags = flags & (ATTR_ROOT | ATTR_SECURE);
> +	key->namelen = namelen;
> +	key->name[namelen] = 0;
> +	memcpy(key->name, name, namelen);
> +
> +	/* Caller already had the value, so copy it and exit. */
> +	if (value) {
> +		memcpy(key->value, value, valuelen);
> +		goto out_ok;

memcpy of a zero length buffer into a zero length pointer does what?

> +	}
> +
> +	/* Otherwise look up the remote value directly. */

It's not at all obvious why we are looking up a remote xattr at this
point in the function.

> +	memset(&args, 0, sizeof(args));
> +	args.geo = rx->sc->mp->m_attr_geo;
> +	args.index = idx;
> +	args.namelen = namelen;
> +	args.name = key->name;
> +	args.valuelen = valuelen;
> +	args.value = key->value;
> +	args.dp = rx->sc->ip;
> +	args.trans = rx->sc->tp;
> +	error = xfs_attr3_leaf_getvalue(bp, &args);
> +	if (error || args.rmtblkno == 0)
> +		goto err_value;
> +
> +	error = xfs_attr_rmtval_get(&args);
> +	switch (error) {
> +	case 0:
> +		break;
> +	case -EFSBADCRC:
> +	case -EFSCORRUPTED:
> +		error = 0;
> +		/* fall through */
> +	default:
> +		goto err_value;

So here we can return with error = 0, but no actual extended
attribute. Isn't this a silent failure?

> +	}
> +
> +out_ok:
> +	list_add_tail(&key->list, rx->attrlist);
> +	return 0;
> +
> +err_value:
> +	kmem_free(key->value);
> +err_key:
> +	kmem_free(key);
> +err:
> +	return error;
> +}
> +
> +/*
> + * Record a local format extended attribute key & value for later reinsertion
> + * into the inode.
> + */
> +static inline int
> +xfs_repair_xattr_salvage_local_attr(
> +	struct xfs_repair_xattr		*rx,
> +	int				flags,
> +	unsigned char			*name,
> +	int				namelen,
> +	unsigned char			*value,
> +	int				valuelen)
> +{
> +	return __xfs_repair_xattr_salvage_attr(rx, NULL, flags, 0, name,
> +			namelen, value, valuelen);
> +}
> +
> +/*
> + * Record a remote format extended attribute key & value for later reinsertion
> + * into the inode.
> + */
> +static inline int
> +xfs_repair_xattr_salvage_remote_attr(
> +	struct xfs_repair_xattr		*rx,
> +	int				flags,
> +	unsigned char			*name,
> +	int				namelen,
> +	struct xfs_buf			*leaf_bp,
> +	int				idx,
> +	int				valuelen)
> +{
> +	return __xfs_repair_xattr_salvage_attr(rx, leaf_bp, flags, idx,
> +			name, namelen, NULL, valuelen);
> +}

Oh, this is why __xfs_repair_xattr_salvage_attr() has two completely
separate sets of code in it. Can we factor this differently? i.e a
helper function to do all the validity checking and key allocation,
and then leave the local versus remote attr handling in these
functions?

> +
> +/* Extract every xattr key that we can from this attr fork block. */
> +STATIC int
> +xfs_repair_xattr_recover_leaf(
> +	struct xfs_repair_xattr		*rx,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_attr3_icleaf_hdr	leafhdr;
> +	struct xfs_scrub_context	*sc = rx->sc;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_attr_leafblock	*leaf;
> +	unsigned long			*usedmap = sc->buf;
> +	struct xfs_attr_leaf_name_local	*lentry;
> +	struct xfs_attr_leaf_name_remote *rentry;
> +	struct xfs_attr_leaf_entry	*ent;
> +	struct xfs_attr_leaf_entry	*entries;
> +	char				*buf_end;
> +	char				*name;
> +	char				*name_end;
> +	char				*value;
> +	size_t				off;
> +	unsigned int			nameidx;
> +	unsigned int			namesize;
> +	unsigned int			hdrsize;
> +	unsigned int			namelen;
> +	unsigned int			valuelen;
> +	int				i;
> +	int				error;

Can we scope all these variables inside the blocks that use them?

> +
> +	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
> +
> +	/* Check the leaf header */
> +	leaf = bp->b_addr;
> +	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
> +	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
> +	xfs_scrub_xattr_set_map(sc, usedmap, 0, hdrsize);
> +	entries = xfs_attr3_leaf_entryp(leaf);
> +
> +	buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize;
> +	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
> +		/* Skip key if it conflicts with something else? */
> +		off = (char *)ent - (char *)leaf;
> +		if (!xfs_scrub_xattr_set_map(sc, usedmap, off,
> +				sizeof(xfs_attr_leaf_entry_t)))
> +			continue;
> +
> +		/* Check the name information. */
> +		nameidx = be16_to_cpu(ent->nameidx);
> +		if (nameidx < leafhdr.firstused ||
> +		    nameidx >= mp->m_attr_geo->blksize)
> +			continue;
> +
> +		if (ent->flags & XFS_ATTR_LOCAL) {
> +			lentry = xfs_attr3_leaf_name_local(leaf, i);
> +			namesize = xfs_attr_leaf_entsize_local(lentry->namelen,
> +					be16_to_cpu(lentry->valuelen));
> +			name_end = (char *)lentry + namesize;
> +			if (lentry->namelen == 0)
> +				continue;
> +			name = lentry->nameval;
> +			namelen = lentry->namelen;
> +			valuelen = be16_to_cpu(lentry->valuelen);
> +			value = &name[namelen];

It seems cumbersome to do a bunch of special local/remote attr
decoding into a set of semi-common variables, only to then pass the
specific local/remote variables back to specific local/remote
processing functions.

i.e. I'd prefer to see the attr decoding done inside the salvage
function so this looks something like:

	if (ent->flags & XFS_ATTR_LOCAL) {
		lentry = xfs_attr3_leaf_name_local(leaf, i);
		error = xfs_repair_xattr_salvage_local_attr(rx,
					lentry, ...);
	} else {
		rentry = xfs_attr3_leaf_name_remote(leaf, i);
		error = xfs_repair_xattr_salvage_local_attr(rx,
					rentry, ....);
	}

......

> +
> +/* Try to recover shortform attrs. */
> +STATIC int
> +xfs_repair_xattr_recover_sf(
> +	struct xfs_repair_xattr		*rx)
> +{
> +	struct xfs_attr_shortform	*sf;
> +	struct xfs_attr_sf_entry	*sfe;
> +	struct xfs_attr_sf_entry	*next;
> +	struct xfs_ifork		*ifp;
> +	unsigned char			*end;
> +	int				i;
> +	int				error;
> +
> +	ifp = XFS_IFORK_PTR(rx->sc->ip, XFS_ATTR_FORK);
> +	sf = (struct xfs_attr_shortform *)rx->sc->ip->i_afp->if_u1.if_data;

	sf = (struct xfs_attr_shortform *)ifp->if_u1.if_data;

....
> +/*
> + * Repair the extended attribute metadata.
> + *
> + * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer
> + * and we can't handle those 100% until the buffer cache learns how to deal
> + * with that.

I'm not sure what this comment means/implies.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 11/21] xfs: repair the rmapbt
  2018-07-06  0:47             ` Darrick J. Wong
@ 2018-07-06  1:08               ` Dave Chinner
  0 siblings, 0 replies; 77+ messages in thread
From: Dave Chinner @ 2018-07-06  1:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jul 05, 2018 at 05:47:48PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 05, 2018 at 05:03:24PM +1000, Dave Chinner wrote:
> > IOWs, ISTM that the online scrub/repair algorithms break down in the
> > face of rmap corruption and it is correctable only by building a
> > global cross-reference which we can then reconcile and use to
> > rebuild all the per-AG metadata btrees in the filesystem. Trying to
> > do it online via scanning of unverifiable structures risks making
> > the problem worse and there is a good possibility that we can't
> > repair the inconsistencies in a single pass.
> 
> Yeah.  I've pondered whether or not the primary repairers ought to
> require at least a quick rmap scan before they touch anything, but up
> till now I've preferred to keep that knowledge in xfs_scrub.

I think that's fine - scrub is most likely going to be the trigger
to run online repair, anyway.

> > So at this point, I'd prefer that we stop, discuss and work out a
> > solution to the rmap rebuild problem rather than merge a rmap
> > rebuild algorithm prematurely. This doesn't need to hold up any of
> > the other online repair functionality - none of that is dependent on
> > this particular piece of the puzzle being implemented.
> 
> I didn't think it would hold up the others repairers.  I always knew
> that the rmapbt repair was going to be a tough project. :)
> 
> So, seeing as you're about to head off on vacation, let's set a rough
> time to discuss the rmap rebuild problem in detail once you're back.
> Does that sound good?

Yes. Gives us some thinking time, too :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 19/21] xfs: repair quotas
  2018-06-24 19:25 ` [PATCH 19/21] xfs: repair quotas Darrick J. Wong
@ 2018-07-06  1:50   ` Dave Chinner
  2018-07-06  3:16     ` Darrick J. Wong
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Chinner @ 2018-07-06  1:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sun, Jun 24, 2018 at 12:25:35PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Fix anything that causes the quota verifiers to fail.
> +/* Fix anything the verifiers complain about. */
> +STATIC int
> +xfs_repair_quota_block(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*bp,
> +	uint				dqtype,
> +	xfs_dqid_t			id)
> +{
> +	struct xfs_dqblk		*d = (struct xfs_dqblk *)bp->b_addr;
> +	struct xfs_disk_dquot		*ddq;
> +	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
> +	enum xfs_blft			buftype = 0;
> +	int				i;
> +
> +	bp->b_ops = &xfs_dquot_buf_ops;
> +	for (i = 0; i < qi->qi_dqperchunk; i++) {
> +		ddq = &d[i].dd_diskdq;
> +
> +		ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC);
> +		ddq->d_version = XFS_DQUOT_VERSION;
> +		ddq->d_flags = dqtype;
> +		ddq->d_id = cpu_to_be32(id + i);
> +
> +		xfs_repair_quota_fix_timer(ddq->d_blk_softlimit,
> +				ddq->d_bcount, &ddq->d_btimer,
> +				qi->qi_btimelimit);
> +		xfs_repair_quota_fix_timer(ddq->d_ino_softlimit,
> +				ddq->d_icount, &ddq->d_itimer,
> +				qi->qi_itimelimit);
> +		xfs_repair_quota_fix_timer(ddq->d_rtb_softlimit,
> +				ddq->d_rtbcount, &ddq->d_rtbtimer,
> +				qi->qi_rtbtimelimit);
> +
> +		if (xfs_sb_version_hascrc(&sc->mp->m_sb)) {
> +			uuid_copy(&d->dd_uuid, &sc->mp->m_sb.sb_meta_uuid);
> +			xfs_update_cksum((char *)d, sizeof(struct xfs_dqblk),
> +					 XFS_DQUOT_CRC_OFF);

Hmmm - Do we need to reset the lsn here, too? 

What makes me think - do we check that the lsn is valid in verifiers
and so catch/fix problems where the lsn is outside the valid range
of the current log LSNs?

> +STATIC int
> +xfs_repair_quota_data_fork(
> +	struct xfs_scrub_context	*sc,
> +	uint				dqtype)
> +{
> +	struct xfs_bmbt_irec		irec = { 0 };
> +	struct xfs_iext_cursor		icur;
> +	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
> +	struct xfs_ifork		*ifp;
> +	struct xfs_buf			*bp;
> +	struct xfs_dqblk		*d;
> +	xfs_dqid_t			id;
> +	xfs_fileoff_t			max_dqid_off;
> +	xfs_fileoff_t			off;
> +	xfs_fsblock_t			fsbno;
> +	bool				truncate = false;
> +	int				error = 0;
> +
> +	error = xfs_repair_metadata_inode_forks(sc);
> +	if (error)
> +		goto out;
> +
> +	/* Check for data fork problems that apply only to quota files. */
> +	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> +	for_each_xfs_iext(ifp, &icur, &irec) {
> +		if (isnullstartblock(irec.br_startblock)) {
> +			error = -EFSCORRUPTED;
> +			goto out;
> +		}
> +
> +		if (irec.br_startoff > max_dqid_off ||
> +		    irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) {
> +			truncate = true;
> +			break;
> +		}
> +	}
> +	if (truncate) {
> +		error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK,
> +				max_dqid_off * sc->mp->m_sb.sb_blocksize);
> +		if (error)
> +			goto out;
> +	}
> +
> +	/* Now go fix anything that fails the verifiers. */
> +	for_each_xfs_iext(ifp, &icur, &irec) {
> +		for (fsbno = irec.br_startblock, off = irec.br_startoff;
> +		     fsbno < irec.br_startblock + irec.br_blockcount;
> +		     fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB,
> +				off += XFS_DQUOT_CLUSTER_SIZE_FSB) {
> +			id = off * qi->qi_dqperchunk;
> +			error = xfs_trans_read_buf(sc->mp, sc->tp,
> +					sc->mp->m_ddev_targp,
> +					XFS_FSB_TO_DADDR(sc->mp, fsbno),
> +					qi->qi_dqchunklen,
> +					0, &bp, &xfs_dquot_buf_ops);
> +			if (error == 0) {
> +				d = (struct xfs_dqblk *)bp->b_addr;
> +				if (id == be32_to_cpu(d->dd_diskdq.d_id))
> +					continue;

Need to release the buffer here - it's clean and passes the
verifier, so no need to hold on to them as we may have thousands of
them to walk here.


Otherwise looks ok.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 17/21] xfs: repair extended attributes
  2018-07-06  1:03   ` Dave Chinner
@ 2018-07-06  3:10     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-06  3:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 06, 2018 at 11:03:24AM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:25:23PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > If the extended attributes look bad, try to sift through the rubble to
> > find whatever keys/values we can, zap the attr tree, and re-add the
> > values.
> .....
> > 
> > +/*
> > + * Extended Attribute Repair
> > + * =========================
> > + *
> > + * We repair extended attributes by reading the attribute fork blocks looking
> > + * for keys and values, then truncate the entire attr fork and reinsert all
> > + * the attributes.  Unfortunately, there's no secondary copy of most extended
> > + * attribute data, which means that if we blow up midway through there's
> > + * little we can do.
> > + */
> > +
> > +struct xfs_attr_key {
> > +	struct list_head		list;
> > +	unsigned char			*value;
> > +	int				valuelen;
> > +	int				flags;
> > +	int				namelen;
> > +	unsigned char			name[0];
> > +};
> > +
> > +#define XFS_ATTR_KEY_LEN(namelen) (sizeof(struct xfs_attr_key) + (namelen) + 1)
> > +
> > +struct xfs_repair_xattr {
> > +	struct list_head		*attrlist;
> > +	struct xfs_scrub_context	*sc;
> > +};
> > +
> > +/* Iterate each block in an attr fork extent */
> > +#define for_each_xfs_attr_block(mp, irec, dabno) \
> > +	for ((dabno) = roundup((xfs_dablk_t)(irec)->br_startoff, \
> > +			(mp)->m_attr_geo->fsbcount); \
> > +	     (dabno) < (irec)->br_startoff + (irec)->br_blockcount; \
> > +	     (dabno) += (mp)->m_attr_geo->fsbcount)
> 
> What's the roundup() for? The attribute fsbcount is only ever going
> to be 1 (single block), so it's not obvious what this is doing...

I was trying to write defensively in case the attribute fsbcount ever
/does/ become larger than 1.

> > +/*
> > + * Record an extended attribute key & value for later reinsertion into the
> > + * inode.  Use the helpers below, don't call this directly.
> > + */
> > +STATIC int
> > +__xfs_repair_xattr_salvage_attr(
> > +	struct xfs_repair_xattr		*rx,
> > +	struct xfs_buf			*bp,
> > +	int				flags,
> > +	int				idx,
> > +	unsigned char			*name,
> > +	int				namelen,
> > +	unsigned char			*value,
> > +	int				valuelen)
> > +{
> > +	struct xfs_attr_key		*key;
> > +	struct xfs_da_args		args;
> > +	int				error = -ENOMEM;
> > +
> > +	/* Ignore incomplete or oversized attributes. */
> > +	if ((flags & XFS_ATTR_INCOMPLETE) ||
> > +	    namelen > XATTR_NAME_MAX || namelen < 0 ||
> > +	    valuelen > XATTR_SIZE_MAX || valuelen < 0)
> > +		return 0;
> > +
> > +	/* Store attr key. */
> > +	key = kmem_alloc(XFS_ATTR_KEY_LEN(namelen), KM_MAYFAIL);
> > +	if (!key)
> > +		goto err;
> > +	INIT_LIST_HEAD(&key->list);
> > +	key->value = kmem_zalloc_large(valuelen, KM_MAYFAIL);
> 
> Why zero this? Also, it looks like valuelen can be zero? Should be
> we be allocating a buffer in that case?

Good point, we don't need to zero it and this does need to handle the
zero-length attr value case.

> > +	if (!key->value)
> > +		goto err_key;
> > +	key->valuelen = valuelen;
> > +	key->flags = flags & (ATTR_ROOT | ATTR_SECURE);
> > +	key->namelen = namelen;
> > +	key->name[namelen] = 0;
> > +	memcpy(key->name, name, namelen);
> > +
> > +	/* Caller already had the value, so copy it and exit. */
> > +	if (value) {
> > +		memcpy(key->value, value, valuelen);
> > +		goto out_ok;
> 
> memcpy of a zero length buffer into a zero length pointer does what?
> 
> > +	}
> > +
> > +	/* Otherwise look up the remote value directly. */
> 
> It's not at all obvious why we are looking up a remote xattr at this
> point in the function.

We're iterating the attr leaves looking for key/values that can be
stashed in memory while we reset the attr fork and re-add the
salvageable key/value pairs.

Sooooo, reword comment as such:

/*
 * This attribute has a remote value.  Look up the remote value so that
 * we can stash it for later reconstruction.
 */

> > +	memset(&args, 0, sizeof(args));
> > +	args.geo = rx->sc->mp->m_attr_geo;
> > +	args.index = idx;
> > +	args.namelen = namelen;
> > +	args.name = key->name;
> > +	args.valuelen = valuelen;
> > +	args.value = key->value;
> > +	args.dp = rx->sc->ip;
> > +	args.trans = rx->sc->tp;
> > +	error = xfs_attr3_leaf_getvalue(bp, &args);
> > +	if (error || args.rmtblkno == 0)
> > +		goto err_value;
> > +
> > +	error = xfs_attr_rmtval_get(&args);
> > +	switch (error) {
> > +	case 0:
> > +		break;
> > +	case -EFSBADCRC:
> > +	case -EFSCORRUPTED:
> > +		error = 0;
> > +		/* fall through */
> > +	default:
> > +		goto err_value;
> 
> So here we can return with error = 0, but no actual extended
> attribute. Isn't this a silent failure?

We're trying to salvage whatever attributes are readable in the attr
fork, so if we can't retrieve a remote value then we don't bother
reconstructing it later.

Granted there ought to be /some/ notification, maybe it's time for a new
OFLAG that says we couldn't recover everything(?)

> > +	}
> > +
> > +out_ok:
> > +	list_add_tail(&key->list, rx->attrlist);
> > +	return 0;
> > +
> > +err_value:
> > +	kmem_free(key->value);
> > +err_key:
> > +	kmem_free(key);
> > +err:
> > +	return error;
> > +}
> > +
> > +/*
> > + * Record a local format extended attribute key & value for later reinsertion
> > + * into the inode.
> > + */
> > +static inline int
> > +xfs_repair_xattr_salvage_local_attr(
> > +	struct xfs_repair_xattr		*rx,
> > +	int				flags,
> > +	unsigned char			*name,
> > +	int				namelen,
> > +	unsigned char			*value,
> > +	int				valuelen)
> > +{
> > +	return __xfs_repair_xattr_salvage_attr(rx, NULL, flags, 0, name,
> > +			namelen, value, valuelen);
> > +}
> > +
> > +/*
> > + * Record a remote format extended attribute key & value for later reinsertion
> > + * into the inode.
> > + */
> > +static inline int
> > +xfs_repair_xattr_salvage_remote_attr(
> > +	struct xfs_repair_xattr		*rx,
> > +	int				flags,
> > +	unsigned char			*name,
> > +	int				namelen,
> > +	struct xfs_buf			*leaf_bp,
> > +	int				idx,
> > +	int				valuelen)
> > +{
> > +	return __xfs_repair_xattr_salvage_attr(rx, leaf_bp, flags, idx,
> > +			name, namelen, NULL, valuelen);
> > +}
> 
> Oh, this is why __xfs_repair_xattr_salvage_attr() has two completely
> separate sets of code in it. Can we factor this differently? i.e a
> helper function to do all the validity checking and key allocation,
> and then leave the local versus remote attr handling in these
> functions?

Ok.

> > +
> > +/* Extract every xattr key that we can from this attr fork block. */
> > +STATIC int
> > +xfs_repair_xattr_recover_leaf(
> > +	struct xfs_repair_xattr		*rx,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_attr3_icleaf_hdr	leafhdr;
> > +	struct xfs_scrub_context	*sc = rx->sc;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_attr_leafblock	*leaf;
> > +	unsigned long			*usedmap = sc->buf;
> > +	struct xfs_attr_leaf_name_local	*lentry;
> > +	struct xfs_attr_leaf_name_remote *rentry;
> > +	struct xfs_attr_leaf_entry	*ent;
> > +	struct xfs_attr_leaf_entry	*entries;
> > +	char				*buf_end;
> > +	char				*name;
> > +	char				*name_end;
> > +	char				*value;
> > +	size_t				off;
> > +	unsigned int			nameidx;
> > +	unsigned int			namesize;
> > +	unsigned int			hdrsize;
> > +	unsigned int			namelen;
> > +	unsigned int			valuelen;
> > +	int				i;
> > +	int				error;
> 
> Can we scope all these variables inside the blocks that use them?

I'll see if I can split this function up too.

> > +
> > +	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
> > +
> > +	/* Check the leaf header */
> > +	leaf = bp->b_addr;
> > +	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
> > +	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
> > +	xfs_scrub_xattr_set_map(sc, usedmap, 0, hdrsize);
> > +	entries = xfs_attr3_leaf_entryp(leaf);
> > +
> > +	buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize;
> > +	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
> > +		/* Skip key if it conflicts with something else? */
> > +		off = (char *)ent - (char *)leaf;
> > +		if (!xfs_scrub_xattr_set_map(sc, usedmap, off,
> > +				sizeof(xfs_attr_leaf_entry_t)))
> > +			continue;
> > +
> > +		/* Check the name information. */
> > +		nameidx = be16_to_cpu(ent->nameidx);
> > +		if (nameidx < leafhdr.firstused ||
> > +		    nameidx >= mp->m_attr_geo->blksize)
> > +			continue;
> > +
> > +		if (ent->flags & XFS_ATTR_LOCAL) {
> > +			lentry = xfs_attr3_leaf_name_local(leaf, i);
> > +			namesize = xfs_attr_leaf_entsize_local(lentry->namelen,
> > +					be16_to_cpu(lentry->valuelen));
> > +			name_end = (char *)lentry + namesize;
> > +			if (lentry->namelen == 0)
> > +				continue;
> > +			name = lentry->nameval;
> > +			namelen = lentry->namelen;
> > +			valuelen = be16_to_cpu(lentry->valuelen);
> > +			value = &name[namelen];
> 
> It seems cumbersome to do a bunch of special local/remote attr
> decoding into a set of semi-common variables, only to then pass the
> specific local/remote variables back to specific local/remote
> processing functions.
> 
> i.e. I'd prefer to see the attr decoding done inside the salvage
> function so this looks something like:
> 
> 	if (ent->flags & XFS_ATTR_LOCAL) {
> 		lentry = xfs_attr3_leaf_name_local(leaf, i);
> 		error = xfs_repair_xattr_salvage_local_attr(rx,
> 					lentry, ...);
> 	} else {
> 		rentry = xfs_attr3_leaf_name_remote(leaf, i);
> 		error = xfs_repair_xattr_salvage_local_attr(rx,
> 					rentry, ....);
> 	}
> 
> ......

Seems like it'd be better than the current mess. :0

> > +
> > +/* Try to recover shortform attrs. */
> > +STATIC int
> > +xfs_repair_xattr_recover_sf(
> > +	struct xfs_repair_xattr		*rx)
> > +{
> > +	struct xfs_attr_shortform	*sf;
> > +	struct xfs_attr_sf_entry	*sfe;
> > +	struct xfs_attr_sf_entry	*next;
> > +	struct xfs_ifork		*ifp;
> > +	unsigned char			*end;
> > +	int				i;
> > +	int				error;
> > +
> > +	ifp = XFS_IFORK_PTR(rx->sc->ip, XFS_ATTR_FORK);
> > +	sf = (struct xfs_attr_shortform *)rx->sc->ip->i_afp->if_u1.if_data;
> 
> 	sf = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
> 
> ....
> > +/*
> > + * Repair the extended attribute metadata.
> > + *
> > + * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer
> > + * and we can't handle those 100% until the buffer cache learns how to deal
> > + * with that.
> 
> I'm not sure what this comment means/implies.

It's another manifestation of the problem that the xfs_buf cache doesn't
do a very good job of handling multiblock xfs_bufs that alias another
xfs_buf that's already in the cache.  If the attr fork is crosslinked
with something else we'll have problems, but that's not fixed so
easily...

/*
 * XXX: Remote attribute value buffers encompass the entire (up to 64k)
 * buffer.  The buffer cache in XFS can't handle aliased multiblock
 * buffers, so this might misbehave if the attr fork is crosslinked with
 * other filesystem metadata.
 */

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 19/21] xfs: repair quotas
  2018-07-06  1:50   ` Dave Chinner
@ 2018-07-06  3:16     ` Darrick J. Wong
  0 siblings, 0 replies; 77+ messages in thread
From: Darrick J. Wong @ 2018-07-06  3:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 06, 2018 at 11:50:25AM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:25:35PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Fix anything that causes the quota verifiers to fail.
> > +/* Fix anything the verifiers complain about. */
> > +STATIC int
> > +xfs_repair_quota_block(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_buf			*bp,
> > +	uint				dqtype,
> > +	xfs_dqid_t			id)
> > +{
> > +	struct xfs_dqblk		*d = (struct xfs_dqblk *)bp->b_addr;
> > +	struct xfs_disk_dquot		*ddq;
> > +	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
> > +	enum xfs_blft			buftype = 0;
> > +	int				i;
> > +
> > +	bp->b_ops = &xfs_dquot_buf_ops;
> > +	for (i = 0; i < qi->qi_dqperchunk; i++) {
> > +		ddq = &d[i].dd_diskdq;
> > +
> > +		ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC);
> > +		ddq->d_version = XFS_DQUOT_VERSION;
> > +		ddq->d_flags = dqtype;
> > +		ddq->d_id = cpu_to_be32(id + i);
> > +
> > +		xfs_repair_quota_fix_timer(ddq->d_blk_softlimit,
> > +				ddq->d_bcount, &ddq->d_btimer,
> > +				qi->qi_btimelimit);
> > +		xfs_repair_quota_fix_timer(ddq->d_ino_softlimit,
> > +				ddq->d_icount, &ddq->d_itimer,
> > +				qi->qi_itimelimit);
> > +		xfs_repair_quota_fix_timer(ddq->d_rtb_softlimit,
> > +				ddq->d_rtbcount, &ddq->d_rtbtimer,
> > +				qi->qi_rtbtimelimit);
> > +
> > +		if (xfs_sb_version_hascrc(&sc->mp->m_sb)) {
> > +			uuid_copy(&d->dd_uuid, &sc->mp->m_sb.sb_meta_uuid);
> > +			xfs_update_cksum((char *)d, sizeof(struct xfs_dqblk),
> > +					 XFS_DQUOT_CRC_OFF);
> 
> Hmmm - Do we need to reset the lsn here, too? 

Probably, but...

> What makes me think - do we check that the lsn is valid in verifiers
> and so catch/fix problems where the lsn is outside the valid range
> of the current log LSNs?

...while we use them to skip old dquot logs during log recovery, we
don't check them otherwise AFAICT.

> > +STATIC int
> > +xfs_repair_quota_data_fork(
> > +	struct xfs_scrub_context	*sc,
> > +	uint				dqtype)
> > +{
> > +	struct xfs_bmbt_irec		irec = { 0 };
> > +	struct xfs_iext_cursor		icur;
> > +	struct xfs_quotainfo		*qi = sc->mp->m_quotainfo;
> > +	struct xfs_ifork		*ifp;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_dqblk		*d;
> > +	xfs_dqid_t			id;
> > +	xfs_fileoff_t			max_dqid_off;
> > +	xfs_fileoff_t			off;
> > +	xfs_fsblock_t			fsbno;
> > +	bool				truncate = false;
> > +	int				error = 0;
> > +
> > +	error = xfs_repair_metadata_inode_forks(sc);
> > +	if (error)
> > +		goto out;
> > +
> > +	/* Check for data fork problems that apply only to quota files. */
> > +	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> > +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> > +	for_each_xfs_iext(ifp, &icur, &irec) {
> > +		if (isnullstartblock(irec.br_startblock)) {
> > +			error = -EFSCORRUPTED;
> > +			goto out;
> > +		}
> > +
> > +		if (irec.br_startoff > max_dqid_off ||
> > +		    irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) {
> > +			truncate = true;
> > +			break;
> > +		}
> > +	}
> > +	if (truncate) {
> > +		error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK,
> > +				max_dqid_off * sc->mp->m_sb.sb_blocksize);
> > +		if (error)
> > +			goto out;
> > +	}
> > +
> > +	/* Now go fix anything that fails the verifiers. */
> > +	for_each_xfs_iext(ifp, &icur, &irec) {
> > +		for (fsbno = irec.br_startblock, off = irec.br_startoff;
> > +		     fsbno < irec.br_startblock + irec.br_blockcount;
> > +		     fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB,
> > +				off += XFS_DQUOT_CLUSTER_SIZE_FSB) {
> > +			id = off * qi->qi_dqperchunk;
> > +			error = xfs_trans_read_buf(sc->mp, sc->tp,
> > +					sc->mp->m_ddev_targp,
> > +					XFS_FSB_TO_DADDR(sc->mp, fsbno),
> > +					qi->qi_dqchunklen,
> > +					0, &bp, &xfs_dquot_buf_ops);
> > +			if (error == 0) {
> > +				d = (struct xfs_dqblk *)bp->b_addr;
> > +				if (id == be32_to_cpu(d->dd_diskdq.d_id))
> > +					continue;
> 
> Need to release the buffer here - it's clean and passes the
> verifier, so no need to hold on to them as we may have thousands of
> them to walk here.

Ok.

> Otherwise looks ok.

Thank you for the review!

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2018-07-06  3:16 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-24 19:23 [PATCH v16 00/21] xfs-4.19: online repair support Darrick J. Wong
2018-06-24 19:23 ` [PATCH 01/21] xfs: don't assume a left rmap when allocating a new rmap Darrick J. Wong
2018-06-27  0:54   ` Dave Chinner
2018-06-28 21:11   ` Allison Henderson
2018-06-29 14:39     ` Darrick J. Wong
2018-06-24 19:23 ` [PATCH 02/21] xfs: add helper to decide if an inode has allocated cow blocks Darrick J. Wong
2018-06-27  1:02   ` Dave Chinner
2018-06-28 21:12   ` Allison Henderson
2018-06-24 19:23 ` [PATCH 03/21] xfs: refactor part of xfs_free_eofblocks Darrick J. Wong
2018-06-28 21:13   ` Allison Henderson
2018-06-24 19:23 ` [PATCH 04/21] xfs: repair the AGF and AGFL Darrick J. Wong
2018-06-27  2:19   ` Dave Chinner
2018-06-27 16:44     ` Allison Henderson
2018-06-27 23:37       ` Dave Chinner
2018-06-29 15:14         ` Darrick J. Wong
2018-06-28 17:25     ` Allison Henderson
2018-06-29 15:08       ` Darrick J. Wong
2018-06-28 21:14   ` Allison Henderson
2018-06-28 23:21     ` Dave Chinner
2018-06-29  1:35       ` Allison Henderson
2018-06-29 14:55         ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 05/21] xfs: repair the AGI Darrick J. Wong
2018-06-27  2:22   ` Dave Chinner
2018-06-28 21:15   ` Allison Henderson
2018-06-24 19:24 ` [PATCH 06/21] xfs: repair free space btrees Darrick J. Wong
2018-06-27  3:21   ` Dave Chinner
2018-07-04  2:15     ` Darrick J. Wong
2018-07-04  2:25       ` Dave Chinner
2018-06-30 17:36   ` Allison Henderson
2018-06-24 19:24 ` [PATCH 07/21] xfs: repair inode btrees Darrick J. Wong
2018-06-28  0:55   ` Dave Chinner
2018-07-04  2:22     ` Darrick J. Wong
2018-06-30 17:36   ` Allison Henderson
2018-06-30 18:30     ` Darrick J. Wong
2018-07-01  0:45       ` Allison Henderson
2018-06-24 19:24 ` [PATCH 08/21] xfs: defer iput on certain inodes while scrub / repair are running Darrick J. Wong
2018-06-28 23:37   ` Dave Chinner
2018-06-29 14:49     ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 09/21] xfs: finish our set of inode get/put tracepoints for scrub Darrick J. Wong
2018-06-24 19:24 ` [PATCH 10/21] xfs: introduce online scrub freeze Darrick J. Wong
2018-06-24 19:24 ` [PATCH 11/21] xfs: repair the rmapbt Darrick J. Wong
2018-07-03  5:32   ` Dave Chinner
2018-07-03 23:59     ` Darrick J. Wong
2018-07-04  8:44       ` Carlos Maiolino
2018-07-04 18:40         ` Darrick J. Wong
2018-07-04 23:21       ` Dave Chinner
2018-07-05  3:48         ` Darrick J. Wong
2018-07-05  7:03           ` Dave Chinner
2018-07-06  0:47             ` Darrick J. Wong
2018-07-06  1:08               ` Dave Chinner
2018-06-24 19:24 ` [PATCH 12/21] xfs: repair refcount btrees Darrick J. Wong
2018-07-03  5:50   ` Dave Chinner
2018-07-04  2:23     ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 13/21] xfs: repair inode records Darrick J. Wong
2018-07-03  6:17   ` Dave Chinner
2018-07-04  0:16     ` Darrick J. Wong
2018-07-04  1:03       ` Dave Chinner
2018-07-04  1:30         ` Darrick J. Wong
2018-06-24 19:24 ` [PATCH 14/21] xfs: zap broken inode forks Darrick J. Wong
2018-07-04  2:07   ` Dave Chinner
2018-07-04  3:26     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 15/21] xfs: repair inode block maps Darrick J. Wong
2018-07-04  3:00   ` Dave Chinner
2018-07-04  3:41     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 16/21] xfs: repair damaged symlinks Darrick J. Wong
2018-07-04  5:45   ` Dave Chinner
2018-07-04 18:45     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 17/21] xfs: repair extended attributes Darrick J. Wong
2018-07-06  1:03   ` Dave Chinner
2018-07-06  3:10     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 18/21] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-06-29  2:52   ` Dave Chinner
2018-06-24 19:25 ` [PATCH 19/21] xfs: repair quotas Darrick J. Wong
2018-07-06  1:50   ` Dave Chinner
2018-07-06  3:16     ` Darrick J. Wong
2018-06-24 19:25 ` [PATCH 20/21] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
2018-06-24 19:25 ` [PATCH 21/21] xfs: add online scrub/repair for superblock counters Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.