All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: scrub filesystem summary counters
@ 2019-04-17  1:40 Darrick J. Wong
  2019-04-17  1:40 ` [PATCH 1/3] xfs: track delayed allocation reservations across Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-17  1:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This patch series introduces a totally new filesystem summary counter
online scrub feature.  Whereas previous iterations froze the filesystem
to count inodes and free blocks, this one drastically reduces overhead
by loosening its precision somewhat.  Instead of freezing the fs, we
race the expected summary counter calculation with normal fs operations
and use thresholding to check that the counters are in the ballpark,
where ballpark means "off by less than 6% or so".

The first patch implements a per-cpu counter of the number of blocks
being held in delayed allocation reservations.  This should represent
the difference between the incore fdblocks counter and the one that
would be recorded on disk if one were to iterate all the committed
metadata structures.

Patch #2 enables the scrub code to pause the posteof and cowblocks
background reclamation workers temporarily to reduce perturbations in
the summary counters while the scrubber runs.  This isn't totally
foolproof since they can be re-armed, but we only need ballpark
correctness.

Finally, patch #3 implements the actual fs summary counter scrubber
code.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-summary-counters

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-summary-counters

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] xfs: track delayed allocation reservations across
  2019-04-17  1:40 [PATCH 0/3] xfs: scrub filesystem summary counters Darrick J. Wong
@ 2019-04-17  1:40 ` Darrick J. Wong
  2019-04-17 21:40   ` Dave Chinner
  2019-04-17  1:40 ` [PATCH 2/3] xfs: allow scrubbers to pause background reclaim Darrick J. Wong
  2019-04-17  1:40 ` [PATCH 3/3] xfs: add online scrub/repair for superblock counters Darrick J. Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-17  1:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device.  This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device.  It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   17 ++++++++++++++---
 fs/xfs/xfs_mount.h       |   15 +++++++++++++++
 fs/xfs/xfs_super.c       |    9 +++++++++
 3 files changed, 38 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4637ae1ae91c..356ebd1cbe82 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2009,6 +2009,9 @@ xfs_bmap_add_extent_delay_real(
 			goto done;
 	}
 
+	if (da_new != da_old)
+		xfs_mod_delalloc(mp, (int64_t)da_new - da_old);
+
 	if (bma->cur) {
 		da_new += bma->cur->bc_private.b.allocated;
 		bma->cur->bc_private.b.allocated = 0;
@@ -2640,6 +2643,7 @@ xfs_bmap_add_extent_hole_delay(
 		/*
 		 * Nothing to do for disk quota accounting here.
 		 */
+		xfs_mod_delalloc(ip->i_mount, (int64_t)newlen - oldlen);
 	}
 }
 
@@ -3352,8 +3356,10 @@ xfs_bmap_btalloc_accounting(
 		 * already have quota reservation and there's nothing to do
 		 * yet.
 		 */
-		if (ap->wasdel)
+		if (ap->wasdel) {
+			xfs_mod_delalloc(ap->ip->i_mount, -(int64_t)args->len);
 			return;
+		}
 
 		/*
 		 * Otherwise, we've allocated blocks in a hole. The transaction
@@ -3372,8 +3378,10 @@ xfs_bmap_btalloc_accounting(
 	/* data/attr fork only */
 	ap->ip->i_d.di_nblocks += args->len;
 	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
-	if (ap->wasdel)
+	if (ap->wasdel) {
 		ap->ip->i_delayed_blks -= args->len;
+		xfs_mod_delalloc(ap->ip->i_mount, -(int64_t)args->len);
+	}
 	xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
 		ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : XFS_TRANS_DQ_BCOUNT,
 		args->len);
@@ -3969,6 +3977,7 @@ xfs_bmapi_reserve_delalloc(
 
 
 	ip->i_delayed_blks += alen;
+	xfs_mod_delalloc(ip->i_mount, alen + indlen);
 
 	got->br_startoff = aoff;
 	got->br_startblock = nullstartblock(indlen);
@@ -4840,8 +4849,10 @@ xfs_bmap_del_extent_delay(
 	da_diff = da_old - da_new;
 	if (!isrt)
 		da_diff += del->br_blockcount;
-	if (da_diff)
+	if (da_diff) {
 		xfs_mod_fdblocks(mp, da_diff, false);
+		xfs_mod_delalloc(mp, -da_diff);
+	}
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 14fba76ab811..f5cf4eb22d8e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -81,6 +81,12 @@ typedef struct xfs_mount {
 	struct percpu_counter	m_icount;	/* allocated inodes counter */
 	struct percpu_counter	m_ifree;	/* free inodes counter */
 	struct percpu_counter	m_fdblocks;	/* free block counter */
+	/*
+	 * Count of data device blocks reserved for delayed allocations,
+	 * including indlen blocks.  Does not include allocated CoW staging
+	 * extents or anything related to the rt device.
+	 */
+	struct percpu_counter	m_delalloc_blks;
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_fsname;	/* filesystem name */
@@ -476,4 +482,13 @@ struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
 		int error_class, int error);
 void xfs_force_summary_recalc(struct xfs_mount *mp);
 
+/* Update the in-core delayed block counter. */
+static inline void
+xfs_mod_delalloc(
+	struct xfs_mount	*mp,
+	int64_t			delta)
+{
+	percpu_counter_add(&mp->m_delalloc_blks, delta);
+}
+
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index df917f41ca46..3a870a194cd9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1538,8 +1538,14 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_ifree;
 
+	error = percpu_counter_init(&mp->m_delalloc_blks, 0, GFP_KERNEL);
+	if (error)
+		goto free_fdblocks;
+
 	return 0;
 
+free_fdblocks:
+	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
 	percpu_counter_destroy(&mp->m_ifree);
 free_icount:
@@ -1563,6 +1569,9 @@ xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_icount);
 	percpu_counter_destroy(&mp->m_ifree);
 	percpu_counter_destroy(&mp->m_fdblocks);
+	ASSERT(XFS_FORCED_SHUTDOWN(mp) ||
+	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
+	percpu_counter_destroy(&mp->m_delalloc_blks);
 }
 
 static struct xfs_mount *

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] xfs: allow scrubbers to pause background reclaim
  2019-04-17  1:40 [PATCH 0/3] xfs: scrub filesystem summary counters Darrick J. Wong
  2019-04-17  1:40 ` [PATCH 1/3] xfs: track delayed allocation reservations across Darrick J. Wong
@ 2019-04-17  1:40 ` Darrick J. Wong
  2019-04-17 21:52   ` Dave Chinner
  2019-04-17  1:40 ` [PATCH 3/3] xfs: add online scrub/repair for superblock counters Darrick J. Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-17  1:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

The forthcoming summary counter patch races with regular filesystem
activity to compute rough expected values for the counters.  This design
was chosen to avoid having to freeze the entire filesystem to check the
counters, but while that's running we'd prefer to minimize background
reclamation activity to reduce the perturbations to the incore free
block count.  Therefore, provide a way for scrubbers to disable
background posteof and cowblock reclamation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |   18 ++++++++++++++++++
 fs/xfs/scrub/common.h |    2 ++
 fs/xfs/scrub/scrub.c  |    2 ++
 fs/xfs/scrub/scrub.h  |    1 +
 4 files changed, 23 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 7076d5c98151..a406a22a734f 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -894,3 +894,21 @@ xchk_ilock_inverted(
 	}
 	return -EDEADLOCK;
 }
+
+/* Pause background reclamation and inactivation. */
+void
+xchk_disable_reclaim(
+	struct xfs_scrub	*sc)
+{
+	sc->flags |= XCHK_RECLAIM_DISABLED;
+	xfs_icache_disable_reclaim(sc->mp);
+}
+
+/* Unpause background reclamation and inactivation. */
+void
+xchk_enable_reclaim(
+	struct xfs_scrub	*sc)
+{
+	xfs_icache_enable_reclaim(sc->mp);
+	sc->flags &= ~XCHK_RECLAIM_DISABLED;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index e26a430bd466..2288f45a5606 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -137,5 +137,7 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
 int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
+void xchk_disable_reclaim(struct xfs_scrub *sc);
+void xchk_enable_reclaim(struct xfs_scrub *sc);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 93b0f075a4d3..421c22a0bf39 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -187,6 +187,8 @@ xchk_teardown(
 			xfs_irele(sc->ip);
 		sc->ip = NULL;
 	}
+	if (sc->flags & XCHK_RECLAIM_DISABLED)
+		xchk_enable_reclaim(sc);
 	if (sc->flags & XCHK_HAS_QUOTAOFFLOCK) {
 		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
 		sc->flags &= ~XCHK_HAS_QUOTAOFFLOCK;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 1b280f8f185a..1f6de7bbb9f5 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -80,6 +80,7 @@ struct xfs_scrub {
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
 #define XCHK_HAS_QUOTAOFFLOCK	(1 << 1)  /* we hold the quotaoff lock */
+#define XCHK_RECLAIM_DISABLED	(1 << 2)  /* background reclaim is paused */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 /* Metadata scrubbers */

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] xfs: add online scrub/repair for superblock counters
  2019-04-17  1:40 [PATCH 0/3] xfs: scrub filesystem summary counters Darrick J. Wong
  2019-04-17  1:40 ` [PATCH 1/3] xfs: track delayed allocation reservations across Darrick J. Wong
  2019-04-17  1:40 ` [PATCH 2/3] xfs: allow scrubbers to pause background reclaim Darrick J. Wong
@ 2019-04-17  1:40 ` Darrick J. Wong
  2019-04-17 22:30   ` Dave Chinner
  2 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-17  1:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Teach online scrub and repair how to check and reset the superblock
inode and block counters.  The AG rebuilding functions will need these
to adjust the counts if they need to change as a part of recovering from
corruption.  We must use the repair freeze mechanism to prevent any
other changes while we do this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile           |    1 
 fs/xfs/libxfs/xfs_fs.h    |    3 -
 fs/xfs/scrub/common.h     |    1 
 fs/xfs/scrub/fscounters.c |  229 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/health.c     |    1 
 fs/xfs/scrub/scrub.c      |    6 +
 fs/xfs/scrub/scrub.h      |    7 +
 fs/xfs/scrub/trace.h      |   48 +++++++++
 8 files changed, 294 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/fscounters.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b20964e26a22..1dfc6df2e2bd 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -143,6 +143,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   common.o \
 				   dabtree.o \
 				   dir.o \
+				   fscounters.o \
 				   health.o \
 				   ialloc.o \
 				   inode.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 43a53b03247b..e7382c780ed7 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -578,9 +578,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_UQUOTA	21	/* user quotas */
 #define XFS_SCRUB_TYPE_GQUOTA	22	/* group quotas */
 #define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
+#define XFS_SCRUB_TYPE_FSCOUNTERS 24	/* fs summary counters */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	24
+#define XFS_SCRUB_TYPE_NR	25
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 2288f45a5606..7de945eace00 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -105,6 +105,7 @@ xchk_setup_quota(struct xfs_scrub *sc, struct xfs_inode *ip)
 	return -ENOENT;
 }
 #endif
+int xchk_setup_fscounters(struct xfs_scrub *sc, struct xfs_inode *ip);
 
 void xchk_ag_free(struct xfs_scrub *sc, struct xchk_ag *sa);
 int xchk_ag_init(struct xfs_scrub *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
new file mode 100644
index 000000000000..c809213d8cfe
--- /dev/null
+++ b/fs/xfs/scrub/fscounters.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
+#include "xfs_icache.h"
+#include "xfs_health.h"
+#include "xfs_bmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * FS Summary Counters
+ * ===================
+ *
+ * The basics of filesystem summary counter checking are that we iterate the
+ * AGs counting the number of free blocks, free space btree blocks, per-AG
+ * reservations, inodes, delayed allocation reservations, and free inodes.
+ * Then we compare what we computed against the in-core counters.
+ *
+ * However, the reality is that summary counters are a tricky beast to check.
+ * While we /could/ freeze the filesystem and scramble around the AGs counting
+ * the free blocks, in practice we prefer not do that for a scan because
+ * freezing is costly.  To get around this, we added a per-cpu counter of the
+ * delalloc reservations so that we can rotor around the AGs relatively
+ * quickly, and we allow the counts to be slightly off because we're not
+ * taking any locks while we do this.
+ */
+
+int
+xchk_setup_fscounters(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	sc->buf = kmem_zalloc(sizeof(struct xchk_fscounters), KM_SLEEP);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	/*
+	 * Pause background reclaim while we're scrubbing to reduce the
+	 * likelihood of background perturbations to the counters throwing
+	 * off our calculations.
+	 */
+	xchk_disable_reclaim(sc);
+
+	return xchk_trans_alloc(sc, 0);
+}
+
+/*
+ * Calculate what the global in-core counters ought to be from the AG header
+ * contents.  Callers can compare this to the actual in-core counters to
+ * calculate by how much both in-core and on-disk counters need to be
+ * adjusted.
+ */
+STATIC int
+xchk_fscounters_calc(
+	struct xfs_scrub	*sc,
+	struct xchk_fscounters	*fsc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agi_bp;
+	struct xfs_buf		*agf_bp;
+	struct xfs_agi		*agi;
+	struct xfs_agf		*agf;
+	struct xfs_perag	*pag;
+	uint64_t		delayed;
+	xfs_agnumber_t		agno;
+	int			error;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		/* Lock both AG headers. */
+		error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
+		if (error)
+			return error;
+		error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
+		if (error)
+			return error;
+		if (!agf_bp)
+			return -ENOMEM;
+
+		/* Count all the inodes */
+		agi = XFS_BUF_TO_AGI(agi_bp);
+		fsc->icount += be32_to_cpu(agi->agi_count);
+		fsc->ifree += be32_to_cpu(agi->agi_freecount);
+
+		/* Add up the free/freelist/bnobt/cntbt blocks */
+		agf = XFS_BUF_TO_AGF(agf_bp);
+		fsc->fdblocks += be32_to_cpu(agf->agf_freeblks);
+		fsc->fdblocks += be32_to_cpu(agf->agf_flcount);
+		fsc->fdblocks += be32_to_cpu(agf->agf_btreeblks);
+
+		/*
+		 * Per-AG reservations are taken out of the incore counters,
+		 * so they must be left out of the free blocks computation.
+		 */
+		pag = xfs_perag_get(mp, agno);
+		fsc->fdblocks -= pag->pag_meta_resv.ar_reserved;
+		fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved;
+		xfs_perag_put(pag);
+
+		xfs_trans_brelse(sc->tp, agf_bp);
+		xfs_trans_brelse(sc->tp, agi_bp);
+	}
+
+	/*
+	 * The global incore space reservation is taken from the incore
+	 * counters, so leave that out of the computation.
+	 */
+	fsc->fdblocks -= mp->m_resblks_avail;
+
+	/*
+	 * Delayed allocation reservations are taken out of the incore counters
+	 * but not recorded on disk, so leave them and their indlen blocks out
+	 * of the computation.
+	 */
+	delayed = percpu_counter_sum(&mp->m_delalloc_blks);
+	fsc->fdblocks -= delayed;
+
+	trace_xchk_fscounters_calc(mp, fsc->icount, fsc->ifree, fsc->fdblocks,
+			delayed);
+
+	/* Bail out if the values we compute are totally nonsense. */
+	if (!xfs_verify_icount(mp, fsc->icount) ||
+	    fsc->fdblocks > mp->m_sb.sb_dblocks ||
+	    fsc->ifree > fsc->icount)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/*
+ * Is the @counter within an acceptable range of @expected?
+ *
+ * Currently that means 1/16th (6%) or @nr_range of the @expected value.
+ */
+static inline bool
+xchk_fscounter_within_range(
+	struct xfs_scrub	*sc,
+	struct percpu_counter	*counter,
+	uint64_t		expected,
+	uint64_t		nr_range)
+{
+	int64_t			value = percpu_counter_sum(counter);
+	uint64_t		range;
+
+	range = max_t(uint64_t, expected >> 4, nr_range);
+	if (value < 0)
+		return false;
+	if (range < expected && value < expected - range)
+		return false;
+	if ((int64_t)(expected + range) >= 0 && value > expected + range)
+		return false;
+	return true;
+}
+
+/* Check the superblock counters. */
+int
+xchk_fscounters(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xchk_fscounters	*fsc = sc->buf;
+	int64_t			icount, ifree, fdblocks;
+	int			error;
+
+	icount = percpu_counter_sum(&sc->mp->m_icount);
+	ifree = percpu_counter_sum(&sc->mp->m_ifree);
+	fdblocks = percpu_counter_sum(&sc->mp->m_fdblocks);
+
+	if (icount < 0 || ifree < 0 || fdblocks < 0)
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+
+	/* See if icount is obviously wrong. */
+	if (!xfs_verify_icount(mp, icount))
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+
+	/* See if fdblocks / ifree are obviously wrong. */
+	if (fdblocks > mp->m_sb.sb_dblocks)
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+	if (ifree > icount)
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+
+	/* If we already know it's bad, we can skip the AG iteration. */
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return 0;
+
+	/* Counters seem ok, but let's count them. */
+	error = xchk_fscounters_calc(sc, fsc);
+	if (!xchk_process_error(sc, 0, XFS_SB_BLOCK(sc->mp), &error))
+		return error;
+
+	/*
+	 * Compare the in-core counters with whatever we counted.  We'll
+	 * consider the inode counts ok if they're within 1024 inodes, and the
+	 * free block counts if they're within 1/64th of the filesystem size.
+	 */
+	if (!xchk_fscounter_within_range(sc, &mp->m_icount, fsc->icount, 1024))
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+
+	if (!xchk_fscounter_within_range(sc, &mp->m_ifree, fsc->ifree, 1024))
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+
+	if (!xchk_fscounter_within_range(sc, &mp->m_fdblocks, fsc->fdblocks,
+			mp->m_sb.sb_dblocks >> 6))
+		xchk_block_set_corrupt(sc, mp->m_sb_bp);
+
+	return 0;
+}
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index 16b536aa125e..23cf8e2f25db 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -109,6 +109,7 @@ static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_UQUOTA]		= { XHG_FS,  XFS_SICK_FS_UQUOTA },
 	[XFS_SCRUB_TYPE_GQUOTA]		= { XHG_FS,  XFS_SICK_FS_GQUOTA },
 	[XFS_SCRUB_TYPE_PQUOTA]		= { XHG_FS,  XFS_SICK_FS_PQUOTA },
+	[XFS_SCRUB_TYPE_FSCOUNTERS]	= { XHG_FS,  XFS_SICK_FS_COUNTERS },
 };
 
 /* Return the health status mask for this scrub type. */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 421c22a0bf39..4d5d00d35ef7 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -352,6 +352,12 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.scrub	= xchk_quota,
 		.repair	= xrep_notsupported,
 	},
+	[XFS_SCRUB_TYPE_FSCOUNTERS] = {	/* fs summary counters */
+		.type	= ST_FS,
+		.setup	= xchk_setup_fscounters,
+		.scrub	= xchk_fscounters,
+		.repair	= xrep_notsupported,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 1f6de7bbb9f5..caa90ea5a22e 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -127,6 +127,7 @@ xchk_quota(struct xfs_scrub *sc)
 	return -ENOENT;
 }
 #endif
+int xchk_fscounters(struct xfs_scrub *sc);
 
 /* cross-referencing helpers */
 void xchk_xref_is_used_space(struct xfs_scrub *sc, xfs_agblock_t agbno,
@@ -152,4 +153,10 @@ void xchk_xref_is_used_rt_space(struct xfs_scrub *sc, xfs_rtblock_t rtbno,
 # define xchk_xref_is_used_rt_space(sc, rtbno, len) do { } while (0)
 #endif
 
+struct xchk_fscounters {
+	uint64_t		icount;
+	uint64_t		ifree;
+	uint64_t		fdblocks;
+};
+
 #endif	/* __XFS_SCRUB_SCRUB_H__ */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 3c83e8b3b39c..7120aee4a506 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -50,6 +50,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_RTSUM);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_UQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_GQUOTA);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PQUOTA);
+TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 
 #define XFS_SCRUB_TYPE_STRINGS \
 	{ XFS_SCRUB_TYPE_PROBE,		"probe" }, \
@@ -75,7 +76,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PQUOTA);
 	{ XFS_SCRUB_TYPE_RTSUM,		"rtsummary" }, \
 	{ XFS_SCRUB_TYPE_UQUOTA,	"usrquota" }, \
 	{ XFS_SCRUB_TYPE_GQUOTA,	"grpquota" }, \
-	{ XFS_SCRUB_TYPE_PQUOTA,	"prjquota" }
+	{ XFS_SCRUB_TYPE_PQUOTA,	"prjquota" }, \
+	{ XFS_SCRUB_TYPE_FSCOUNTERS,	"fscounters" }
 
 DECLARE_EVENT_CLASS(xchk_class,
 	TP_PROTO(struct xfs_inode *ip, struct xfs_scrub_metadata *sm,
@@ -590,6 +592,50 @@ TRACE_EVENT(xchk_iallocbt_check_cluster,
 		  __entry->cluster_ino)
 )
 
+TRACE_EVENT(xchk_fscounters_calc,
+	TP_PROTO(struct xfs_mount *mp, uint64_t icount, uint64_t ifree,
+		 uint64_t fdblocks, uint64_t delalloc),
+	TP_ARGS(mp, icount, ifree, fdblocks, delalloc),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int64_t, icount_sb)
+		__field(int64_t, icount_percpu)
+		__field(uint64_t, icount_calculated)
+		__field(int64_t, ifree_sb)
+		__field(int64_t, ifree_percpu)
+		__field(uint64_t, ifree_calculated)
+		__field(int64_t, fdblocks_sb)
+		__field(int64_t, fdblocks_percpu)
+		__field(uint64_t, fdblocks_calculated)
+		__field(uint64_t, delalloc)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->icount_sb = mp->m_sb.sb_icount;
+		__entry->icount_percpu = percpu_counter_sum(&mp->m_icount);
+		__entry->icount_calculated = icount;
+		__entry->ifree_sb = mp->m_sb.sb_ifree;
+		__entry->ifree_percpu = percpu_counter_sum(&mp->m_ifree);
+		__entry->ifree_calculated = ifree;
+		__entry->fdblocks_sb = mp->m_sb.sb_fdblocks;
+		__entry->fdblocks_percpu = percpu_counter_sum(&mp->m_fdblocks);
+		__entry->fdblocks_calculated = fdblocks;
+		__entry->delalloc = delalloc;
+	),
+	TP_printk("dev %d:%d icount %lld:%lld:%llu ifree %lld:%lld:%llu fdblocks %lld:%lld:%llu delalloc %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->icount_sb,
+		  __entry->icount_percpu,
+		  __entry->icount_calculated,
+		  __entry->ifree_sb,
+		  __entry->ifree_percpu,
+		  __entry->ifree_calculated,
+		  __entry->fdblocks_sb,
+		  __entry->fdblocks_percpu,
+		  __entry->fdblocks_calculated,
+		  __entry->delalloc)
+)
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: track delayed allocation reservations across
  2019-04-17  1:40 ` [PATCH 1/3] xfs: track delayed allocation reservations across Darrick J. Wong
@ 2019-04-17 21:40   ` Dave Chinner
  2019-04-18  0:07     ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2019-04-17 21:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 16, 2019 at 06:40:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Add a percpu counter to track the number of blocks directly reserved for
> delayed allocations on the data device.  This counter (in contrast to
> i_delayed_blks) does not track allocated CoW staging extents or anything
> going on with the realtime device.  It will be used in the upcoming
> summary counter scrub function to check the free block counts without
> having to freeze the filesystem or walk all the inodes to find the
> delayed allocations.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Hmmm. Given that a lot of modifcations to this counter are going to
be large quantities (much greater than the percpu counter batch size
of 32) this effectively just degrades to spin lock protected global
counter, yes? This is the reason we have XFS_FDBLOCKS_BATCH (1024)
for the free block percpu counter so that changes from frequent
updates via lots of smallish write()s don't need to take the global
aggregation lock very often...

So I'm guessing that this counter needs similar treatment for
scalability on large machines. If it doesn't then an atomic64_t
is all that is necessary here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: allow scrubbers to pause background reclaim
  2019-04-17  1:40 ` [PATCH 2/3] xfs: allow scrubbers to pause background reclaim Darrick J. Wong
@ 2019-04-17 21:52   ` Dave Chinner
  2019-04-17 22:29     ` Darrick J. Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2019-04-17 21:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 16, 2019 at 06:40:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The forthcoming summary counter patch races with regular filesystem
> activity to compute rough expected values for the counters.  This design
> was chosen to avoid having to freeze the entire filesystem to check the
> counters, but while that's running we'd prefer to minimize background
> reclamation activity to reduce the perturbations to the incore free
> block count.  Therefore, provide a way for scrubbers to disable
> background posteof and cowblock reclamation.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/common.c |   18 ++++++++++++++++++
>  fs/xfs/scrub/common.h |    2 ++
>  fs/xfs/scrub/scrub.c  |    2 ++
>  fs/xfs/scrub/scrub.h  |    1 +
>  4 files changed, 23 insertions(+)
> 
> 
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index 7076d5c98151..a406a22a734f 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -894,3 +894,21 @@ xchk_ilock_inverted(
>  	}
>  	return -EDEADLOCK;
>  }
> +
> +/* Pause background reclamation and inactivation. */
> +void
> +xchk_disable_reclaim(
> +	struct xfs_scrub	*sc)
> +{
> +	sc->flags |= XCHK_RECLAIM_DISABLED;
> +	xfs_icache_disable_reclaim(sc->mp);
> +}

Hmmm. Poorly named function.  "reclaim" in the context of
xfs_icache.c means inode cache reclaim...

We're not actually disabling inode reclaim here, we
are disabling background block reaping.

Can we change this all to be named more appropriately (including the
icache.c functions? Then it is all fine because I don't look at
it and think "that'll break under memory pressure"....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: allow scrubbers to pause background reclaim
  2019-04-17 21:52   ` Dave Chinner
@ 2019-04-17 22:29     ` Darrick J. Wong
  2019-04-17 22:45       ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-17 22:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Apr 18, 2019 at 07:52:28AM +1000, Dave Chinner wrote:
> On Tue, Apr 16, 2019 at 06:40:12PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > The forthcoming summary counter patch races with regular filesystem
> > activity to compute rough expected values for the counters.  This design
> > was chosen to avoid having to freeze the entire filesystem to check the
> > counters, but while that's running we'd prefer to minimize background
> > reclamation activity to reduce the perturbations to the incore free
> > block count.  Therefore, provide a way for scrubbers to disable
> > background posteof and cowblock reclamation.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/common.c |   18 ++++++++++++++++++
> >  fs/xfs/scrub/common.h |    2 ++
> >  fs/xfs/scrub/scrub.c  |    2 ++
> >  fs/xfs/scrub/scrub.h  |    1 +
> >  4 files changed, 23 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > index 7076d5c98151..a406a22a734f 100644
> > --- a/fs/xfs/scrub/common.c
> > +++ b/fs/xfs/scrub/common.c
> > @@ -894,3 +894,21 @@ xchk_ilock_inverted(
> >  	}
> >  	return -EDEADLOCK;
> >  }
> > +
> > +/* Pause background reclamation and inactivation. */
> > +void
> > +xchk_disable_reclaim(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	sc->flags |= XCHK_RECLAIM_DISABLED;
> > +	xfs_icache_disable_reclaim(sc->mp);
> > +}
> 
> Hmmm. Poorly named function.  "reclaim" in the context of
> xfs_icache.c means inode cache reclaim...
> 
> We're not actually disabling inode reclaim here, we
> are disabling background block reaping.
> 
> Can we change this all to be named more appropriately (including the
> icache.c functions? Then it is all fine because I don't look at
> it and think "that'll break under memory pressure"....

Ah, reaping!  That's the word I was looking for. :(

xfs_icache_disable_speculative_reaping
xchk_disable_speculative_reaping

Sort of a long name but eh not that many people will use it.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] xfs: add online scrub/repair for superblock counters
  2019-04-17  1:40 ` [PATCH 3/3] xfs: add online scrub/repair for superblock counters Darrick J. Wong
@ 2019-04-17 22:30   ` Dave Chinner
  2019-04-18  0:32     ` Darrick J. Wong
  2019-04-18 23:39     ` Darrick J. Wong
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2019-04-17 22:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Apr 16, 2019 at 06:40:21PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Teach online scrub and repair how to check and reset the superblock
> inode and block counters.  The AG rebuilding functions will need these
> to adjust the counts if they need to change as a part of recovering from
> corruption.  We must use the repair freeze mechanism to prevent any
> other changes while we do this.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
.....
> +/*
> + * FS Summary Counters
> + * ===================
> + *
> + * The basics of filesystem summary counter checking are that we iterate the
> + * AGs counting the number of free blocks, free space btree blocks, per-AG
> + * reservations, inodes, delayed allocation reservations, and free inodes.
> + * Then we compare what we computed against the in-core counters.
> + *
> + * However, the reality is that summary counters are a tricky beast to check.
> + * While we /could/ freeze the filesystem and scramble around the AGs counting
> + * the free blocks, in practice we prefer not do that for a scan because
> + * freezing is costly.  To get around this, we added a per-cpu counter of the
> + * delalloc reservations so that we can rotor around the AGs relatively
> + * quickly, and we allow the counts to be slightly off because we're not
> + * taking any locks while we do this.
> + */
> +
> +int
> +xchk_setup_fscounters(
> +	struct xfs_scrub	*sc,
> +	struct xfs_inode	*ip)
> +{
> +	sc->buf = kmem_zalloc(sizeof(struct xchk_fscounters), KM_SLEEP);
> +	if (!sc->buf)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Pause background reclaim while we're scrubbing to reduce the
> +	 * likelihood of background perturbations to the counters throwing
> +	 * off our calculations.
> +	 */
> +	xchk_disable_reclaim(sc);

Naming :)

> +
> +	return xchk_trans_alloc(sc, 0);
> +}
> +
> +/*
> + * Calculate what the global in-core counters ought to be from the AG header
> + * contents.  Callers can compare this to the actual in-core counters to
> + * calculate by how much both in-core and on-disk counters need to be
> + * adjusted.
> + */
> +STATIC int
> +xchk_fscounters_calc(
> +	struct xfs_scrub	*sc,
> +	struct xchk_fscounters	*fsc)
> +{
> +	struct xfs_mount	*mp = sc->mp;
> +	struct xfs_buf		*agi_bp;
> +	struct xfs_buf		*agf_bp;
> +	struct xfs_agi		*agi;
> +	struct xfs_agf		*agf;
> +	struct xfs_perag	*pag;
> +	uint64_t		delayed;
> +	xfs_agnumber_t		agno;
> +	int			error;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		/* Lock both AG headers. */
> +		error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> +		if (error)
> +			return error;
> +		error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
> +		if (error)
> +			return error;
> +		if (!agf_bp)
> +			return -ENOMEM;
> +
> +		/* Count all the inodes */
> +		agi = XFS_BUF_TO_AGI(agi_bp);
> +		fsc->icount += be32_to_cpu(agi->agi_count);
> +		fsc->ifree += be32_to_cpu(agi->agi_freecount);
> +
> +		/* Add up the free/freelist/bnobt/cntbt blocks */
> +		agf = XFS_BUF_TO_AGF(agf_bp);
> +		fsc->fdblocks += be32_to_cpu(agf->agf_freeblks);
> +		fsc->fdblocks += be32_to_cpu(agf->agf_flcount);
> +		fsc->fdblocks += be32_to_cpu(agf->agf_btreeblks);
> +
> +		/*
> +		 * Per-AG reservations are taken out of the incore counters,
> +		 * so they must be left out of the free blocks computation.
> +		 */
> +		pag = xfs_perag_get(mp, agno);
> +		fsc->fdblocks -= pag->pag_meta_resv.ar_reserved;
> +		fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved;
> +		xfs_perag_put(pag);
> +
> +		xfs_trans_brelse(sc->tp, agf_bp);
> +		xfs_trans_brelse(sc->tp, agi_bp);
> +	}

Hmmmm. Do we have all these counters in the perag? If we do, we've
already checked them against the on-disk structures, yes? So can we
just do a pass across the perags to sum the space usage?

And if we don't ahve them all in the perag, should we add them?

> +
> +	/*
> +	 * The global incore space reservation is taken from the incore
> +	 * counters, so leave that out of the computation.
> +	 */
> +	fsc->fdblocks -= mp->m_resblks_avail;
> +
> +	/*
> +	 * Delayed allocation reservations are taken out of the incore counters
> +	 * but not recorded on disk, so leave them and their indlen blocks out
> +	 * of the computation.
> +	 */
> +	delayed = percpu_counter_sum(&mp->m_delalloc_blks);
> +	fsc->fdblocks -= delayed;
> +
> +	trace_xchk_fscounters_calc(mp, fsc->icount, fsc->ifree, fsc->fdblocks,
> +			delayed);
> +
> +	/* Bail out if the values we compute are totally nonsense. */
> +	if (!xfs_verify_icount(mp, fsc->icount) ||
> +	    fsc->fdblocks > mp->m_sb.sb_dblocks ||
> +	    fsc->ifree > fsc->icount)
> +		return -EFSCORRUPTED;

I suspect we need some tolerance here on ifree vs icount as icount
can decrease as we free inode chunks....

> +/*
> + * Is the @counter within an acceptable range of @expected?
> + *
> + * Currently that means 1/16th (6%) or @nr_range of the @expected value.
> + */

6% is a lot for large filesystems, especially for block counts. That
can be entire AGs missing. I suspect the tolerance should be
related to AG count in some way....

> +static inline bool
> +xchk_fscounter_within_range(
> +	struct xfs_scrub	*sc,
> +	struct percpu_counter	*counter,
> +	uint64_t		expected,
> +	uint64_t		nr_range)
> +{
> +	int64_t			value = percpu_counter_sum(counter);
> +	uint64_t		range;
> +
> +	range = max_t(uint64_t, expected >> 4, nr_range);
> +	if (value < 0)
> +		return false;
> +	if (range < expected && value < expected - range)
> +		return false;
> +	if ((int64_t)(expected + range) >= 0 && value > expected + range)
> +		return false;
> +	return true;
> +}
> +
> +/* Check the superblock counters. */
> +int
> +xchk_fscounters(
> +	struct xfs_scrub	*sc)
> +{
> +	struct xfs_mount	*mp = sc->mp;
> +	struct xchk_fscounters	*fsc = sc->buf;
> +	int64_t			icount, ifree, fdblocks;
> +	int			error;
> +
> +	icount = percpu_counter_sum(&sc->mp->m_icount);
> +	ifree = percpu_counter_sum(&sc->mp->m_ifree);
> +	fdblocks = percpu_counter_sum(&sc->mp->m_fdblocks);

We have a local mp var in this function :)

> +
> +	if (icount < 0 || ifree < 0 || fdblocks < 0)
> +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> +
> +	/* See if icount is obviously wrong. */
> +	if (!xfs_verify_icount(mp, icount))
> +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> +
> +	/* See if fdblocks / ifree are obviously wrong. */
> +	if (fdblocks > mp->m_sb.sb_dblocks)
> +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> +	if (ifree > icount)
> +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> +
> +	/* If we already know it's bad, we can skip the AG iteration. */
> +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> +		return 0;
> +
> +	/* Counters seem ok, but let's count them. */
> +	error = xchk_fscounters_calc(sc, fsc);
> +	if (!xchk_process_error(sc, 0, XFS_SB_BLOCK(sc->mp), &error))
> +		return error;
> +
> +	/*
> +	 * Compare the in-core counters with whatever we counted.  We'll
> +	 * consider the inode counts ok if they're within 1024 inodes, and the
> +	 * free block counts if they're within 1/64th of the filesystem size.
> +	 */
> +	if (!xchk_fscounter_within_range(sc, &mp->m_icount, fsc->icount, 1024))
> +		xchk_block_set_corrupt(sc, mp->m_sb_bp);

We've already summed the percpu counters at this point - why do we
pass them into xchk_fscounter_within_range() and them sum them
again?

Also, what's the magic 1024 here?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: allow scrubbers to pause background reclaim
  2019-04-17 22:29     ` Darrick J. Wong
@ 2019-04-17 22:45       ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2019-04-17 22:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Apr 17, 2019 at 03:29:53PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 18, 2019 at 07:52:28AM +1000, Dave Chinner wrote:
> > On Tue, Apr 16, 2019 at 06:40:12PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > The forthcoming summary counter patch races with regular filesystem
> > > activity to compute rough expected values for the counters.  This design
> > > was chosen to avoid having to freeze the entire filesystem to check the
> > > counters, but while that's running we'd prefer to minimize background
> > > reclamation activity to reduce the perturbations to the incore free
> > > block count.  Therefore, provide a way for scrubbers to disable
> > > background posteof and cowblock reclamation.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/common.c |   18 ++++++++++++++++++
> > >  fs/xfs/scrub/common.h |    2 ++
> > >  fs/xfs/scrub/scrub.c  |    2 ++
> > >  fs/xfs/scrub/scrub.h  |    1 +
> > >  4 files changed, 23 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > > index 7076d5c98151..a406a22a734f 100644
> > > --- a/fs/xfs/scrub/common.c
> > > +++ b/fs/xfs/scrub/common.c
> > > @@ -894,3 +894,21 @@ xchk_ilock_inverted(
> > >  	}
> > >  	return -EDEADLOCK;
> > >  }
> > > +
> > > +/* Pause background reclamation and inactivation. */
> > > +void
> > > +xchk_disable_reclaim(
> > > +	struct xfs_scrub	*sc)
> > > +{
> > > +	sc->flags |= XCHK_RECLAIM_DISABLED;
> > > +	xfs_icache_disable_reclaim(sc->mp);
> > > +}
> > 
> > Hmmm. Poorly named function.  "reclaim" in the context of
> > xfs_icache.c means inode cache reclaim...
> > 
> > We're not actually disabling inode reclaim here, we
> > are disabling background block reaping.
> > 
> > Can we change this all to be named more appropriately (including the
> > icache.c functions? Then it is all fine because I don't look at
> > it and think "that'll break under memory pressure"....
> 
> Ah, reaping!  That's the word I was looking for. :(
> 
> xfs_icache_disable_speculative_reaping
> xchk_disable_speculative_reaping

disable_block_reaping?

but, really, purple or pink at this point...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: track delayed allocation reservations across
  2019-04-17 21:40   ` Dave Chinner
@ 2019-04-18  0:07     ` Darrick J. Wong
  0 siblings, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-18  0:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Apr 18, 2019 at 07:40:31AM +1000, Dave Chinner wrote:
> On Tue, Apr 16, 2019 at 06:40:06PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Add a percpu counter to track the number of blocks directly reserved for
> > delayed allocations on the data device.  This counter (in contrast to
> > i_delayed_blks) does not track allocated CoW staging extents or anything
> > going on with the realtime device.  It will be used in the upcoming
> > summary counter scrub function to check the free block counts without
> > having to freeze the filesystem or walk all the inodes to find the
> > delayed allocations.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Hmmm. Given that a lot of modifcations to this counter are going to
> be large quantities (much greater than the percpu counter batch size
> of 32) this effectively just degrades to spin lock protected global
> counter, yes? This is the reason we have XFS_FDBLOCKS_BATCH (1024)
> for the free block percpu counter so that changes from frequent
> updates via lots of smallish write()s don't need to take the global
> aggregation lock very often...
> 
> So I'm guessing that this counter needs similar treatment for
> scalability on large machines. If it doesn't then an atomic64_t
> is all that is necessary here...

I wrote a silly little test to grow the delalloc count by one from a
number of threads.  It does cause an observable increase in time waiting
for spinlocks.  Since the only users of this counter are unmount check
and the fscounter scrubber I think it's safe to use XFS_FDBLOCKS_BATCH
for this because we aren't making any decisions that need high precision
to maintain correct functioning of the system.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] xfs: add online scrub/repair for superblock counters
  2019-04-17 22:30   ` Dave Chinner
@ 2019-04-18  0:32     ` Darrick J. Wong
  2019-04-18 23:39     ` Darrick J. Wong
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-18  0:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Apr 18, 2019 at 08:30:52AM +1000, Dave Chinner wrote:
> On Tue, Apr 16, 2019 at 06:40:21PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Teach online scrub and repair how to check and reset the superblock
> > inode and block counters.  The AG rebuilding functions will need these
> > to adjust the counts if they need to change as a part of recovering from
> > corruption.  We must use the repair freeze mechanism to prevent any
> > other changes while we do this.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> .....
> > +/*
> > + * FS Summary Counters
> > + * ===================
> > + *
> > + * The basics of filesystem summary counter checking are that we iterate the
> > + * AGs counting the number of free blocks, free space btree blocks, per-AG
> > + * reservations, inodes, delayed allocation reservations, and free inodes.
> > + * Then we compare what we computed against the in-core counters.
> > + *
> > + * However, the reality is that summary counters are a tricky beast to check.
> > + * While we /could/ freeze the filesystem and scramble around the AGs counting
> > + * the free blocks, in practice we prefer not do that for a scan because
> > + * freezing is costly.  To get around this, we added a per-cpu counter of the
> > + * delalloc reservations so that we can rotor around the AGs relatively
> > + * quickly, and we allow the counts to be slightly off because we're not
> > + * taking any locks while we do this.
> > + */
> > +
> > +int
> > +xchk_setup_fscounters(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_inode	*ip)
> > +{
> > +	sc->buf = kmem_zalloc(sizeof(struct xchk_fscounters), KM_SLEEP);
> > +	if (!sc->buf)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * Pause background reclaim while we're scrubbing to reduce the
> > +	 * likelihood of background perturbations to the counters throwing
> > +	 * off our calculations.
> > +	 */
> > +	xchk_disable_reclaim(sc);
> 
> Naming :)

Fixed.

> > +
> > +	return xchk_trans_alloc(sc, 0);
> > +}
> > +
> > +/*
> > + * Calculate what the global in-core counters ought to be from the AG header
> > + * contents.  Callers can compare this to the actual in-core counters to
> > + * calculate by how much both in-core and on-disk counters need to be
> > + * adjusted.
> > + */
> > +STATIC int
> > +xchk_fscounters_calc(
> > +	struct xfs_scrub	*sc,
> > +	struct xchk_fscounters	*fsc)
> > +{
> > +	struct xfs_mount	*mp = sc->mp;
> > +	struct xfs_buf		*agi_bp;
> > +	struct xfs_buf		*agf_bp;
> > +	struct xfs_agi		*agi;
> > +	struct xfs_agf		*agf;
> > +	struct xfs_perag	*pag;
> > +	uint64_t		delayed;
> > +	xfs_agnumber_t		agno;
> > +	int			error;
> > +
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		/* Lock both AG headers. */
> > +		error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp);
> > +		if (error)
> > +			return error;
> > +		error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
> > +		if (error)
> > +			return error;
> > +		if (!agf_bp)
> > +			return -ENOMEM;
> > +
> > +		/* Count all the inodes */
> > +		agi = XFS_BUF_TO_AGI(agi_bp);
> > +		fsc->icount += be32_to_cpu(agi->agi_count);
> > +		fsc->ifree += be32_to_cpu(agi->agi_freecount);
> > +
> > +		/* Add up the free/freelist/bnobt/cntbt blocks */
> > +		agf = XFS_BUF_TO_AGF(agf_bp);
> > +		fsc->fdblocks += be32_to_cpu(agf->agf_freeblks);
> > +		fsc->fdblocks += be32_to_cpu(agf->agf_flcount);
> > +		fsc->fdblocks += be32_to_cpu(agf->agf_btreeblks);
> > +
> > +		/*
> > +		 * Per-AG reservations are taken out of the incore counters,
> > +		 * so they must be left out of the free blocks computation.
> > +		 */
> > +		pag = xfs_perag_get(mp, agno);
> > +		fsc->fdblocks -= pag->pag_meta_resv.ar_reserved;
> > +		fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved;
> > +		xfs_perag_put(pag);
> > +
> > +		xfs_trans_brelse(sc->tp, agf_bp);
> > +		xfs_trans_brelse(sc->tp, agi_bp);
> > +	}
> 
> Hmmmm. Do we have all these counters in the perag?

Yes.

> If we do, we've already checked them against the on-disk structures,
> yes?

Not necessarily, if someone calls the fscounter scrubber without calling
the ag header scrubbers then they could be wrong.  Granted, the incore
counters are supposed to match ondisk counters and unless there's a
software bug or bad memory then they're going to match.  Ok, I think I'm
convinced that we can use the incore counters here, along with a warmup
function in the setup function to make sure pagf_init and pagi_init == 1
and spot check the correspondence.

Oh, hey, the ag header scrubbers don't check the incore counters either.
I'll add that too.

> So can we just do a pass across the perags to sum the space usage?

That sounds like a nice way to speed this up further.

> And if we don't ahve them all in the perag, should we add them?
> 
> > +
> > +	/*
> > +	 * The global incore space reservation is taken from the incore
> > +	 * counters, so leave that out of the computation.
> > +	 */
> > +	fsc->fdblocks -= mp->m_resblks_avail;
> > +
> > +	/*
> > +	 * Delayed allocation reservations are taken out of the incore counters
> > +	 * but not recorded on disk, so leave them and their indlen blocks out
> > +	 * of the computation.
> > +	 */
> > +	delayed = percpu_counter_sum(&mp->m_delalloc_blks);
> > +	fsc->fdblocks -= delayed;
> > +
> > +	trace_xchk_fscounters_calc(mp, fsc->icount, fsc->ifree, fsc->fdblocks,
> > +			delayed);
> > +
> > +	/* Bail out if the values we compute are totally nonsense. */
> > +	if (!xfs_verify_icount(mp, fsc->icount) ||
> > +	    fsc->fdblocks > mp->m_sb.sb_dblocks ||
> > +	    fsc->ifree > fsc->icount)
> > +		return -EFSCORRUPTED;
> 
> I suspect we need some tolerance here on ifree vs icount as icount
> can decrease as we free inode chunks....

TBH I was half convinced that the ifree check here only needed to make
sure that the value isn't larger than the number of possible inodes in
the filesystem, since we do all the thresholding stuff later anyway.

> > +/*
> > + * Is the @counter within an acceptable range of @expected?
> > + *
> > + * Currently that means 1/16th (6%) or @nr_range of the @expected value.
> > + */
> 
> 6% is a lot for large filesystems, especially for block counts. That
> can be entire AGs missing. I suspect the tolerance should be
> related to AG count in some way....

Yeah, I'm going to ponder this while I make dinner...

> > +static inline bool
> > +xchk_fscounter_within_range(
> > +	struct xfs_scrub	*sc,
> > +	struct percpu_counter	*counter,
> > +	uint64_t		expected,
> > +	uint64_t		nr_range)
> > +{
> > +	int64_t			value = percpu_counter_sum(counter);
> > +	uint64_t		range;
> > +
> > +	range = max_t(uint64_t, expected >> 4, nr_range);
> > +	if (value < 0)
> > +		return false;
> > +	if (range < expected && value < expected - range)
> > +		return false;
> > +	if ((int64_t)(expected + range) >= 0 && value > expected + range)
> > +		return false;
> > +	return true;
> > +}
> > +
> > +/* Check the superblock counters. */
> > +int
> > +xchk_fscounters(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	struct xfs_mount	*mp = sc->mp;
> > +	struct xchk_fscounters	*fsc = sc->buf;
> > +	int64_t			icount, ifree, fdblocks;
> > +	int			error;
> > +
> > +	icount = percpu_counter_sum(&sc->mp->m_icount);
> > +	ifree = percpu_counter_sum(&sc->mp->m_ifree);
> > +	fdblocks = percpu_counter_sum(&sc->mp->m_fdblocks);
> 
> We have a local mp var in this function :)

Ok.

> > +
> > +	if (icount < 0 || ifree < 0 || fdblocks < 0)
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +
> > +	/* See if icount is obviously wrong. */
> > +	if (!xfs_verify_icount(mp, icount))
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +
> > +	/* See if fdblocks / ifree are obviously wrong. */
> > +	if (fdblocks > mp->m_sb.sb_dblocks)
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +	if (ifree > icount)
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +
> > +	/* If we already know it's bad, we can skip the AG iteration. */
> > +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> > +		return 0;
> > +
> > +	/* Counters seem ok, but let's count them. */
> > +	error = xchk_fscounters_calc(sc, fsc);
> > +	if (!xchk_process_error(sc, 0, XFS_SB_BLOCK(sc->mp), &error))
> > +		return error;
> > +
> > +	/*
> > +	 * Compare the in-core counters with whatever we counted.  We'll
> > +	 * consider the inode counts ok if they're within 1024 inodes, and the
> > +	 * free block counts if they're within 1/64th of the filesystem size.
> > +	 */
> > +	if (!xchk_fscounter_within_range(sc, &mp->m_icount, fsc->icount, 1024))
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> 
> We've already summed the percpu counters at this point - why do we
> pass them into xchk_fscounter_within_range() and them sum them
> again?

Hmm.  Yeah, we don't need to do that again.

> Also, what's the magic 1024 here?

100% Magic. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] xfs: add online scrub/repair for superblock counters
  2019-04-17 22:30   ` Dave Chinner
  2019-04-18  0:32     ` Darrick J. Wong
@ 2019-04-18 23:39     ` Darrick J. Wong
  1 sibling, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2019-04-18 23:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

[This reply is only about the thresholding stuff...]

On Thu, Apr 18, 2019 at 08:30:52AM +1000, Dave Chinner wrote:
> On Tue, Apr 16, 2019 at 06:40:21PM -0700, Darrick J. Wong wrote:

<snip>

> > +/*
> > + * Is the @counter within an acceptable range of @expected?
> > + *
> > + * Currently that means 1/16th (6%) or @nr_range of the @expected value.
> > + */
> 
> 6% is a lot for large filesystems, especially for block counts. That
> can be entire AGs missing. I suspect the tolerance should be
> related to AG count in some way....

Agreed, 6% is a lot, especially since that's large enough to swallow
several entire AGs.  The other approach (which I've also been testing)
is that we base the threshold on the delta between the
percpu_counter_sum at the start of xchk_fscounters and the second call
in _within_range -- the more that it changes while we're racing to
compute the expected value, the more we let the counter be off by, with
some minimum amount of variance that we tolerate.

Prior to this review, the runtime of the _calc function varied quite a
bit when the fs was running a heavy load because of buffer lock
contention, which made the amount of variance fairly unstable even with
a fairly steady IO load on the filesystem, so I sent the simpler
version.

However, your suggestion of using only the incore perag counters cuts
the runtime down to nearly zero even on crazy-agcount filesystems since
that cuts the synchronization overhead way down, which means that the
counter variance has stabilized and no longer seems quite so crazy of a
way to do it.

Now we have:

counter = percpu_counter_sum()
range = min(512, 2 * (old_counter - counter))
counter >= (expected - range) && counter <= (expected + range)

Granted that 1024 (now 512) value that I use now is more or less
arbitrarily picked to prevent complaints while providing a solid check
that we're at least in the ballpark.

Patches soon,

--D

> > +static inline bool
> > +xchk_fscounter_within_range(
> > +	struct xfs_scrub	*sc,
> > +	struct percpu_counter	*counter,
> > +	uint64_t		expected,
> > +	uint64_t		nr_range)
> > +{
> > +	int64_t			value = percpu_counter_sum(counter);
> > +	uint64_t		range;
> > +
> > +	range = max_t(uint64_t, expected >> 4, nr_range);
> > +	if (value < 0)
> > +		return false;
> > +	if (range < expected && value < expected - range)
> > +		return false;
> > +	if ((int64_t)(expected + range) >= 0 && value > expected + range)
> > +		return false;
> > +	return true;
> > +}
> > +
> > +/* Check the superblock counters. */
> > +int
> > +xchk_fscounters(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	struct xfs_mount	*mp = sc->mp;
> > +	struct xchk_fscounters	*fsc = sc->buf;
> > +	int64_t			icount, ifree, fdblocks;
> > +	int			error;
> > +
> > +	icount = percpu_counter_sum(&sc->mp->m_icount);
> > +	ifree = percpu_counter_sum(&sc->mp->m_ifree);
> > +	fdblocks = percpu_counter_sum(&sc->mp->m_fdblocks);
> 
> We have a local mp var in this function :)
> 
> > +
> > +	if (icount < 0 || ifree < 0 || fdblocks < 0)
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +
> > +	/* See if icount is obviously wrong. */
> > +	if (!xfs_verify_icount(mp, icount))
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +
> > +	/* See if fdblocks / ifree are obviously wrong. */
> > +	if (fdblocks > mp->m_sb.sb_dblocks)
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +	if (ifree > icount)
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> > +
> > +	/* If we already know it's bad, we can skip the AG iteration. */
> > +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> > +		return 0;
> > +
> > +	/* Counters seem ok, but let's count them. */
> > +	error = xchk_fscounters_calc(sc, fsc);
> > +	if (!xchk_process_error(sc, 0, XFS_SB_BLOCK(sc->mp), &error))
> > +		return error;
> > +
> > +	/*
> > +	 * Compare the in-core counters with whatever we counted.  We'll
> > +	 * consider the inode counts ok if they're within 1024 inodes, and the
> > +	 * free block counts if they're within 1/64th of the filesystem size.
> > +	 */
> > +	if (!xchk_fscounter_within_range(sc, &mp->m_icount, fsc->icount, 1024))
> > +		xchk_block_set_corrupt(sc, mp->m_sb_bp);
> 
> We've already summed the percpu counters at this point - why do we
> pass them into xchk_fscounter_within_range() and them sum them
> again?
> 
> Also, what's the magic 1024 here?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-04-18 23:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-17  1:40 [PATCH 0/3] xfs: scrub filesystem summary counters Darrick J. Wong
2019-04-17  1:40 ` [PATCH 1/3] xfs: track delayed allocation reservations across Darrick J. Wong
2019-04-17 21:40   ` Dave Chinner
2019-04-18  0:07     ` Darrick J. Wong
2019-04-17  1:40 ` [PATCH 2/3] xfs: allow scrubbers to pause background reclaim Darrick J. Wong
2019-04-17 21:52   ` Dave Chinner
2019-04-17 22:29     ` Darrick J. Wong
2019-04-17 22:45       ` Dave Chinner
2019-04-17  1:40 ` [PATCH 3/3] xfs: add online scrub/repair for superblock counters Darrick J. Wong
2019-04-17 22:30   ` Dave Chinner
2019-04-18  0:32     ` Darrick J. Wong
2019-04-18 23:39     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.