linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/9] xfsprogs: online scrub/repair support
@ 2017-03-10 23:24 Darrick J. Wong
  2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:24 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

Hi all,

This is the sixth revision of a patchset that adds to XFS userland tools
support for online metadata scrubbing and repair.  There aren't any
on-disk format changes, and the main overview is in the cover letter for
the kernel patches.  Since v5 I have removed from xfs_scrub the ability
to scrub non-XFS filesystems (i.e. XFS is now required) and have added
the ability to run xfs_scrub periodically and in a contained environment
if systemd is active.  None of the systemd functionality is active by
default.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
The xfsprogs patches should apply against 4.10.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
@ 2017-03-10 23:24 ` Darrick J. Wong
  2017-03-10 23:24 ` [PATCH 2/9] xfs_db: introduce fuzz command Darrick J. Wong
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:24 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use rmap records to rebuild corrupt inode forks instead of zapping
the whole inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/libxfs_api_defs.h |    2 
 repair/Makefile          |    5 -
 repair/dino_chunks.c     |    7 +
 repair/dinode.c          |   41 +++++++
 repair/rebuild.c         |  277 ++++++++++++++++++++++++++++++++++++++++++++++
 repair/rebuild.h         |   26 ++++
 repair/rmap.c            |    2 
 repair/rmap.h            |    1 
 8 files changed, 357 insertions(+), 4 deletions(-)
 create mode 100644 repair/rebuild.c
 create mode 100644 repair/rebuild.h


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index d299b7a..f01fff0 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -146,5 +146,7 @@
 #define xfs_rmap_lookup_le_range	libxfs_rmap_lookup_le_range
 #define xfs_refc_block			libxfs_refc_block
 #define xfs_rmap_compare		libxfs_rmap_compare
+#define xfs_bmbt_calc_size		libxfs_bmbt_calc_size
+#define xfs_rmap_query_all		libxfs_rmap_query_all
 
 #endif /* __LIBXFS_API_DEFS_H__ */
diff --git a/repair/Makefile b/repair/Makefile
index b7e8fd5..9edaf18 100644
--- a/repair/Makefile
+++ b/repair/Makefile
@@ -11,14 +11,15 @@ LTCOMMAND = xfs_repair
 
 HFILES = agheader.h attr_repair.h avl.h avl64.h bmap.h btree.h \
 	da_util.h dinode.h dir2.h err_protos.h globals.h incore.h protos.h \
-	rt.h progress.h scan.h versions.h prefetch.h rmap.h slab.h threads.h
+	rt.h progress.h scan.h versions.h prefetch.h rmap.h slab.h threads.h \
+	rebuild.h
 
 CFILES = agheader.c attr_repair.c avl.c avl64.c bmap.c btree.c \
 	da_util.c dino_chunks.c dinode.c dir2.c globals.c incore.c \
 	incore_bmc.c init.c incore_ext.c incore_ino.c phase1.c \
 	phase2.c phase3.c phase4.c phase5.c phase6.c phase7.c \
 	progress.c prefetch.c rmap.c rt.c sb.c scan.c slab.c threads.c \
-	versions.c xfs_repair.c
+	versions.c rebuild.c xfs_repair.c
 
 LLDLIBS = $(LIBXFS) $(LIBXLOG) $(LIBXCMD) $(LIBUUID) \
 	$(LIBRT) $(LIBPTHREAD) $(LIBBLKID)
diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index a3909ac..c479f2c 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -697,6 +697,13 @@ process_inode_chunk(
 		irec_offset += mp->m_sb.sb_inopblock * blks_per_cluster;
 		agbno += blks_per_cluster;
 	}
+	/*
+	 * Allow the buffer to be re-locked by this thread in case
+	 * we want to rebuild an inode fork.
+	 */
+	for (bp_index = 0; bp_index < cluster_count; bp_index++)
+		if (bplist[bp_index])
+			bplist[bp_index]->b_flags |= LIBXFS_B_RECURSIVE_LOCK;
 	agbno = XFS_AGINO_TO_AGBNO(mp, first_irec->ino_startnum);
 
 	/*
diff --git a/repair/dinode.c b/repair/dinode.c
index d664f87..6f71c2f 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -32,6 +32,7 @@
 #include "threads.h"
 #include "slab.h"
 #include "rmap.h"
+#include "rebuild.h"
 
 /*
  * gettext lookups for translations of strings use mutexes internally to
@@ -1915,7 +1916,9 @@ process_inode_data_fork(
 	xfs_ino_t	lino = XFS_AGINO_TO_INO(mp, agno, ino);
 	int		err = 0;
 	int		nex;
+	bool		try_rebuild = !rmapbt_suspect;
 
+retry:
 	/*
 	 * extent count on disk is only valid for positive values. The kernel
 	 * uses negative values in memory. hence if we see negative numbers
@@ -1961,8 +1964,25 @@ process_inode_data_fork(
 	if (err)  {
 		do_warn(_("bad data fork in inode %" PRIu64 "\n"), lino);
 		if (!no_modify)  {
+			if (try_rebuild) {
+				do_warn(
+_("rebuilding inode %"PRIu64" data fork\n"),
+					lino);
+				try_rebuild = false;
+				err = rebuild_bmap(mp, lino, XFS_DATA_FORK,
+						be32_to_cpu(dino->di_nextents));
+				if (!err)
+					goto retry;
+				do_warn(
+_("inode %"PRIu64" data fork rebuild failed, error %d\n"),
+					lino, err);
+			}
 			*dirty += clear_dinode(mp, dino, lino);
 			ASSERT(*dirty > 0);
+		} else if (try_rebuild) {
+			do_warn(
+_("would have tried to rebuild inode %"PRIu64" data fork, or else\n"),
+					lino);
 		}
 		return 1;
 	}
@@ -2026,7 +2046,9 @@ process_inode_attr_fork(
 	blkmap_t	*ablkmap = NULL;
 	int		repair = 0;
 	int		err;
+	bool		try_rebuild = !rmapbt_suspect;
 
+retry:
 	if (!XFS_DFORK_Q(dino)) {
 		*anextents = 0;
 		if (dino->di_aformat != XFS_DINODE_FMT_EXTENTS) {
@@ -2085,6 +2107,19 @@ process_inode_attr_fork(
 		do_warn(_("bad attribute fork in inode %" PRIu64), lino);
 
 		if (!no_modify)  {
+			if (try_rebuild) {
+				try_rebuild = false;
+				do_warn(
+_("rebuilding inode %"PRIu64" attr fork\n"),
+					lino);
+				err = rebuild_bmap(mp, lino, XFS_DATA_FORK,
+						be32_to_cpu(dino->di_nextents));
+				if (!err)
+					goto retry;
+				do_warn(
+_("inode %"PRIu64" attr fork rebuild failed, error %d\n"),
+					lino, err);
+			}
 			if (delete_attr_ok)  {
 				do_warn(_(", clearing attr fork\n"));
 				*dirty += clear_dinode_attr(mp, dino, lino);
@@ -2094,7 +2129,11 @@ process_inode_attr_fork(
 				*dirty += clear_dinode(mp, dino, lino);
 			}
 			ASSERT(*dirty > 0);
-		} else  {
+		} else if (try_rebuild) {
+			do_warn(
+_("would have tried to rebuild inode %"PRIu64" attr fork or cleared it\n"),
+					lino);
+		} else {
 			do_warn(_(", would clear attr fork\n"));
 		}
 
diff --git a/repair/rebuild.c b/repair/rebuild.c
new file mode 100644
index 0000000..bd5d6a8
--- /dev/null
+++ b/repair/rebuild.c
@@ -0,0 +1,277 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include <libxfs.h>
+#include "btree.h"
+#include "err_protos.h"
+#include "libxlog.h"
+#include "incore.h"
+#include "globals.h"
+#include "dinode.h"
+#include "slab.h"
+#include "rmap.h"
+
+/* Borrowed routines from xfs_scrub.c */
+
+struct xfs_repair_bmap_extent {
+	struct xfs_rmap_irec		rmap;
+	xfs_agnumber_t			agno;
+};
+
+struct xfs_repair_bmap {
+	struct xfs_slab			*extslab;
+	xfs_ino_t			ino;
+	xfs_rfsblock_t			bmbt_blocks;
+	int				whichfork;
+};
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xfs_repair_bmap_extent_fn(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_repair_bmap		*rb = priv;
+	struct xfs_repair_bmap_extent	rbe;
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != rb->ino)
+		return 0;
+	else if (rb->whichfork == XFS_DATA_FORK &&
+		 (rec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return 0;
+	else if (rb->whichfork == XFS_ATTR_FORK &&
+		 !(rec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return 0;
+	else if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+		rb->bmbt_blocks += rec->rm_blockcount;
+		return 0;
+	}
+
+	rbe.rmap = *rec;
+	rbe.agno = cur->bc_private.a.agno;
+	return slab_add(rb->extslab, &rbe);
+}
+
+/* Compare two bmap extents. */
+static int
+xfs_repair_bmap_extent_cmp(
+	const void				*a,
+	const void				*b)
+{
+	const struct xfs_repair_bmap_extent	*ap = a;
+	const struct xfs_repair_bmap_extent	*bp = b;
+
+	if (ap->rmap.rm_offset > bp->rmap.rm_offset)
+		return 1;
+	else if (ap->rmap.rm_offset < bp->rmap.rm_offset)
+		return -1;
+	return 0;
+}
+
+/* Repair an inode fork. */
+STATIC int
+xfs_repair_bmap(
+	struct xfs_inode		*ip,
+	struct xfs_trans		**tpp,
+	int				whichfork)
+{
+	struct xfs_repair_bmap		rb = {0};
+	struct xfs_bmbt_irec		bmap;
+	struct xfs_defer_ops		dfops;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_buf			*agf_bp = NULL;
+	struct xfs_repair_bmap_extent	*rbe;
+	struct xfs_btree_cur		*cur;
+	struct xfs_slab_cursor		*scur = NULL;
+	xfs_fsblock_t			firstfsb;
+	xfs_agnumber_t			agno;
+	xfs_extlen_t			extlen;
+	int				baseflags;
+	int				flags;
+	int				nimaps;
+	int				error = 0;
+
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+	/* Don't know how to repair the other fork formats. */
+	if (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		return ENOTTY;
+
+	/* Only files, symlinks, and directories get to have data forks. */
+	if (whichfork == XFS_DATA_FORK && !S_ISREG(VFS_I(ip)->i_mode) &&
+	    !S_ISDIR(VFS_I(ip)->i_mode) && !S_ISLNK(VFS_I(ip)->i_mode))
+		return EINVAL;
+
+	/* If we somehow have delalloc extents, forget it. */
+	if (whichfork == XFS_DATA_FORK && ip->i_delayed_blks)
+		return EBUSY;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return EOPNOTSUPP;
+
+	/* Don't know how to rebuild realtime data forks. */
+	if (XFS_IS_REALTIME_INODE(ip) && whichfork == XFS_DATA_FORK)
+		return EOPNOTSUPP;
+
+	/* Collect all reverse mappings for this fork's extents. */
+	init_slab(&rb.extslab, sizeof(*rbe));
+	rb.ino = ip->i_ino;
+	rb.whichfork = whichfork;
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = -libxfs_alloc_read_agf(mp, *tpp, agno, 0, &agf_bp);
+		if (error)
+			goto out;
+		cur = libxfs_rmapbt_init_cursor(mp, *tpp, agf_bp, agno);
+		error = -libxfs_rmap_query_all(cur, xfs_repair_bmap_extent_fn, &rb);
+		libxfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
+				XFS_BTREE_NOERROR);
+		if (error)
+			goto out;
+	}
+
+	/* Blow out the in-core fork and zero the on-disk fork. */
+	libxfs_trans_ijoin(*tpp, ip, 0);
+	if (XFS_IFORK_PTR(ip, whichfork) != NULL)
+		libxfs_idestroy_fork(ip, whichfork);
+	XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+	XFS_IFORK_NEXT_SET(ip, whichfork, 0);
+
+	/* Reinitialize the on-disk fork. */
+	if (whichfork == XFS_DATA_FORK) {
+		memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
+		ip->i_df.if_flags |= XFS_IFEXTENTS;
+	} else if (whichfork == XFS_ATTR_FORK) {
+		if (slab_count(rb.extslab) == 0)
+			ip->i_afp = NULL;
+		else {
+			ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_NOFS);
+			ip->i_afp->if_flags |= XFS_IFEXTENTS;
+		}
+	}
+	libxfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+	error = -libxfs_trans_roll(tpp, ip);
+	if (error)
+		goto out;
+
+	baseflags = XFS_BMAPI_REMAP | XFS_BMAPI_NORMAP;
+	if (whichfork == XFS_ATTR_FORK)
+		baseflags |= XFS_BMAPI_ATTRFORK;
+
+	/* "Remap" the extents into the fork. */
+	init_slab_cursor(rb.extslab, xfs_repair_bmap_extent_cmp, &scur);
+	rbe = pop_slab_cursor(scur);
+	while (rbe != NULL) {
+		/* Form the "new" mapping... */
+		bmap.br_startblock = XFS_AGB_TO_FSB(mp, rbe->agno,
+				rbe->rmap.rm_startblock);
+		bmap.br_startoff = rbe->rmap.rm_offset;
+		flags = 0;
+		if (rbe->rmap.rm_flags & XFS_RMAP_UNWRITTEN)
+			flags = XFS_BMAPI_PREALLOC;
+		while (rbe->rmap.rm_blockcount > 0) {
+			libxfs_defer_init(&dfops, &firstfsb);
+			extlen = min(rbe->rmap.rm_blockcount, MAXEXTLEN);
+			bmap.br_blockcount = extlen;
+
+			/* Drop the block counter... */
+			ip->i_d.di_nblocks -= extlen;
+
+			/* Re-add the extent to the fork. */
+			nimaps = 1;
+			firstfsb = bmap.br_startblock;
+			error = -libxfs_bmapi_write(*tpp, ip,
+					bmap.br_startoff,
+					extlen, baseflags | flags, &firstfsb,
+					extlen, &bmap, &nimaps,
+					&dfops);
+			if (error)
+				goto out;
+
+			bmap.br_startblock += extlen;
+			bmap.br_startoff += extlen;
+			rbe->rmap.rm_blockcount -= extlen;
+			error = -libxfs_defer_finish(tpp, &dfops, ip);
+			if (error)
+				goto out;
+			/* Make sure we roll the transaction. */
+			error = -libxfs_trans_roll(tpp, ip);
+			if (error)
+				goto out;
+		}
+		rbe = pop_slab_cursor(scur);
+	}
+	free_slab_cursor(&scur);
+	free_slab(&rb.extslab);
+
+	/* Decrease nblocks to reflect the freed bmbt blocks. */
+	if (rb.bmbt_blocks) {
+		ip->i_d.di_nblocks -= rb.bmbt_blocks;
+		libxfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+		error = -libxfs_trans_roll(tpp, ip);
+		if (error)
+			goto out;
+	}
+
+	return error;
+out:
+	if (scur)
+		free_slab_cursor(&scur);
+	if (rb.extslab)
+		free_slab(&rb.extslab);
+	return error;
+}
+
+/* Rebuild some inode's bmap. */
+int
+rebuild_bmap(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino,
+	int			whichfork,
+	unsigned long		nr_extents)
+{
+	struct xfs_inode	*ip;
+	struct xfs_trans	*tp;
+	unsigned long long	resblks;
+	int			error;
+
+	resblks = libxfs_bmbt_calc_size(mp, nr_extents);
+	error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+			resblks, 0, 0, &tp);
+	if (error)
+		return error;
+	error = -libxfs_iget(mp, NULL, ino, 0, &ip);
+	if (error)
+		goto out_trans;
+	error = xfs_repair_bmap(ip, &tp, whichfork);
+	if (error)
+		goto out_irele;
+
+	error = -libxfs_trans_commit(tp);
+	IRELE(ip);
+	return error;
+out_irele:
+	IRELE(ip);
+out_trans:
+	libxfs_trans_cancel(tp);
+	return error;
+}
diff --git a/repair/rebuild.h b/repair/rebuild.h
new file mode 100644
index 0000000..51a44ea
--- /dev/null
+++ b/repair/rebuild.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef REBUILD_H_
+#define REBUILD_H_
+
+int rebuild_bmap(struct xfs_mount *mp, xfs_ino_t ino, int whichfork,
+		 unsigned long nr_extents);
+
+#endif /* REBUILD_H_ */
diff --git a/repair/rmap.c b/repair/rmap.c
index ab6e583..af37829 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -46,7 +46,7 @@ struct xfs_ag_rmap {
 };
 
 static struct xfs_ag_rmap *ag_rmaps;
-static bool rmapbt_suspect;
+bool rmapbt_suspect;
 static bool refcbt_suspect;
 
 static inline int rmap_compare(const void *a, const void *b)
diff --git a/repair/rmap.h b/repair/rmap.h
index 752ece8..c970942 100644
--- a/repair/rmap.h
+++ b/repair/rmap.h
@@ -21,6 +21,7 @@
 #define RMAP_H_
 
 extern bool collect_rmaps;
+extern bool rmapbt_suspect;
 
 extern bool rmap_needs_work(struct xfs_mount *);
 


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/9] xfs_db: introduce fuzz command
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
  2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
@ 2017-03-10 23:24 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 3/9] xfs_db: print attribute remote value blocks Darrick J. Wong
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:24 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Introduce a new 'fuzz' command to write creative values into
disk structure fields.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 db/Makefile       |    3 
 db/bit.c          |   17 +-
 db/bit.h          |    5 -
 db/command.c      |    2 
 db/fuzz.c         |  461 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 db/fuzz.h         |   21 ++
 db/io.c           |    9 +
 db/io.h           |    1 
 db/type.c         |   44 ++++-
 db/type.h         |    1 
 man/man8/xfs_db.8 |   55 ++++++
 11 files changed, 598 insertions(+), 21 deletions(-)
 create mode 100644 db/fuzz.c
 create mode 100644 db/fuzz.h


diff --git a/db/Makefile b/db/Makefile
index cdc0b99..feeacf6 100644
--- a/db/Makefile
+++ b/db/Makefile
@@ -12,7 +12,8 @@ HFILES = addr.h agf.h agfl.h agi.h attr.h attrshort.h bit.h block.h bmap.h \
 	dir2.h dir2sf.h dquot.h echo.h faddr.h field.h \
 	flist.h fprint.h frag.h freesp.h hash.h help.h init.h inode.h input.h \
 	io.h logformat.h malloc.h metadump.h output.h print.h quit.h sb.h \
-	 sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h
+	 sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h \
+	fuzz.h
 CFILES = $(HFILES:.h=.c)
 LSRCFILES = xfs_admin.sh xfs_ncheck.sh xfs_metadump.sh
 
diff --git a/db/bit.c b/db/bit.c
index 24872bf..3fcb085 100644
--- a/db/bit.c
+++ b/db/bit.c
@@ -19,13 +19,8 @@
 #include "libxfs.h"
 #include "bit.h"
 
-#undef setbit	/* defined in param.h on Linux */
-
-static int	getbit(char *ptr, int bit);
-static void	setbit(char *ptr, int bit, int val);
-
-static int
-getbit(
+int
+getbit_l(
 	char	*ptr,
 	int	bit)
 {
@@ -39,8 +34,8 @@ getbit(
 	return (*ptr & mask) >> shift;
 }
 
-static void
-setbit(
+void
+setbit_l(
 	char *ptr,
 	int  bit,
 	int  val)
@@ -106,7 +101,7 @@ getbitval(
 
 
 	for (i = 0, rval = 0LL; i < nbits; i++) {
-		if (getbit(p, bit + i)) {
+		if (getbit_l(p, bit + i)) {
 			/* If the last bit is on and we care about sign
 			 * bits and we don't have a full 64 bit
 			 * container, turn all bits on between the
@@ -162,7 +157,7 @@ setbitval(
 
 	if (bitoff % NBBY || nbits % NBBY) {
 		for (bit = 0; bit < nbits; bit++)
-			setbit(out, bit + bitoff, getbit(in, bit));
+			setbit_l(out, bit + bitoff, getbit_l(in, bit));
 	} else
 		memcpy(out + byteize(bitoff), in, byteize(nbits));
 }
diff --git a/db/bit.h b/db/bit.h
index 80ba24c..4506679 100644
--- a/db/bit.h
+++ b/db/bit.h
@@ -21,9 +21,12 @@
 #define	bitszof(x,y)	bitize(szof(x,y))
 #define	byteize(s)	((s) / NBBY)
 #define	bitoffs(s)	((s) % NBBY)
+#define	byteize_up(s)	(((s) + NBBY - 1) / NBBY)
 
 #define	BVUNSIGNED	0
 #define	BVSIGNED	1
 
 extern __int64_t	getbitval(void *obj, int bitoff, int nbits, int flags);
-extern void             setbitval(void *obuf, int bitoff, int nbits, void *ibuf);
+extern void		setbitval(void *obuf, int bitoff, int nbits, void *ibuf);
+extern int		getbit_l(char *ptr, int bit);
+extern void		setbit_l(char *ptr, int bit, int val);
diff --git a/db/command.c b/db/command.c
index 3d7cfd7..0eb4944 100644
--- a/db/command.c
+++ b/db/command.c
@@ -51,6 +51,7 @@
 #include "dquot.h"
 #include "fsmap.h"
 #include "crc.h"
+#include "fuzz.h"
 
 cmdinfo_t	*cmdtab;
 int		ncmds;
@@ -146,4 +147,5 @@ init_commands(void)
 	type_init();
 	write_init();
 	dquot_init();
+	fuzz_init();
 }
diff --git a/db/fuzz.c b/db/fuzz.c
new file mode 100644
index 0000000..061ecd1
--- /dev/null
+++ b/db/fuzz.c
@@ -0,0 +1,461 @@
+/*
+ * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "libxfs.h"
+#include <ctype.h>
+#include <time.h>
+#include "bit.h"
+#include "block.h"
+#include "command.h"
+#include "type.h"
+#include "faddr.h"
+#include "fprint.h"
+#include "field.h"
+#include "flist.h"
+#include "io.h"
+#include "init.h"
+#include "output.h"
+#include "print.h"
+#include "write.h"
+#include "malloc.h"
+
+static int	fuzz_f(int argc, char **argv);
+static void     fuzz_help(void);
+
+static const cmdinfo_t	fuzz_cmd =
+	{ "fuzz", NULL, fuzz_f, 0, -1, 0, N_("[-c] [-d] field fuzzcmd..."),
+	  N_("fuzz values on disk"), fuzz_help };
+
+void
+fuzz_init(void)
+{
+	if (!expert_mode)
+		return;
+
+	add_command(&fuzz_cmd);
+	srand48(clock());
+}
+
+static void
+fuzz_help(void)
+{
+	dbprintf(_(
+"\n"
+" The 'fuzz' command fuzzes fields in any on-disk data structure.  For\n"
+" block fuzzing, see the 'blocktrash' or 'write' commands."
+"\n"
+" Examples:\n"
+"  Struct mode: 'fuzz core.uid zeroes'    - set an inode uid field to 0.\n"
+"               'fuzz crc ones'           - set a crc filed to all ones.\n"
+"               'fuzz bno[11] firstbit'   - set the high bit of a block array.\n"
+"               'fuzz keys[5].startblock add'    - increase a btree key value.\n"
+"               'fuzz uuid random'        - randomize the superblock uuid.\n"
+"\n"
+" In data mode type 'fuzz' by itself for a list of specific commands.\n\n"
+" Specifying the -c option will allow writes of invalid (corrupt) data with\n"
+" an invalid CRC. Specifying the -d option will allow writes of invalid data,\n"
+" but still recalculate the CRC so we are forced to check and detect the\n"
+" invalid data appropriately.\n\n"
+));
+
+}
+
+static int
+fuzz_f(
+	int		argc,
+	char		**argv)
+{
+	pfunc_t	pf;
+	extern char *progname;
+	int c;
+	bool corrupt = false;	/* Allow write of bad data w/ invalid CRC */
+	bool invalid_data = false; /* Allow write of bad data w/ valid CRC */
+	struct xfs_buf_ops local_ops;
+	const struct xfs_buf_ops *stashed_ops = NULL;
+
+	if (x.isreadonly & LIBXFS_ISREADONLY) {
+		dbprintf(_("%s started in read only mode, fuzzing disabled\n"),
+			progname);
+		return 0;
+	}
+
+	if (cur_typ == NULL) {
+		dbprintf(_("no current type\n"));
+		return 0;
+	}
+
+	pf = cur_typ->pfunc;
+	if (pf == NULL) {
+		dbprintf(_("no handler function for type %s, fuzz unsupported.\n"),
+			 cur_typ->name);
+		return 0;
+	}
+
+	while ((c = getopt(argc, argv, "cd")) != EOF) {
+		switch (c) {
+		case 'c':
+			corrupt = true;
+			break;
+		case 'd':
+			invalid_data = true;
+			break;
+		default:
+			dbprintf(_("bad option for fuzz command\n"));
+			return 0;
+		}
+	}
+
+	if (corrupt && invalid_data) {
+		dbprintf(_("Cannot specify both -c and -d options\n"));
+		return 0;
+	}
+
+	if (invalid_data && iocur_top->typ->crc_off == TYP_F_NO_CRC_OFF &&
+			!iocur_top->ino_buf) {
+		dbprintf(_("Cannot recalculate CRCs on this type of object\n"));
+		return 0;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	/*
+	 * If the buffer has no verifier or we are using standard verifier
+	 * paths, then just fuzz it and return
+	 */
+	if (!iocur_top->bp->b_ops ||
+	    !(corrupt || invalid_data)) {
+		(*pf)(DB_FUZZ, cur_typ->fields, argc, argv);
+		return 0;
+	}
+
+
+	/* Temporarily remove write verifier to write bad data */
+	stashed_ops = iocur_top->bp->b_ops;
+	local_ops.verify_read = stashed_ops->verify_read;
+	iocur_top->bp->b_ops = &local_ops;
+
+	if (corrupt) {
+		local_ops.verify_write = xfs_dummy_verify;
+		dbprintf(_("Allowing fuzz of corrupted data and bad CRC\n"));
+	} else if (iocur_top->ino_buf) {
+		local_ops.verify_write = xfs_verify_recalc_inode_crc;
+		dbprintf(_("Allowing fuzz of corrupted inode with good CRC\n"));
+	} else { /* invalid data */
+		local_ops.verify_write = xfs_verify_recalc_crc;
+		dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
+	}
+
+	(*pf)(DB_FUZZ, cur_typ->fields, argc, argv);
+
+	iocur_top->bp->b_ops = stashed_ops;
+
+	return 0;
+}
+
+/* Write zeroes to the field */
+static bool
+fuzz_zeroes(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	char		*out = buf;
+	int		bit;
+
+	if (bitoff % NBBY || nbits % NBBY) {
+		for (bit = 0; bit < nbits; bit++)
+			setbit_l(out, bit + bitoff, 0);
+	} else
+		memset(out + byteize(bitoff), 0, byteize(nbits));
+	return true;
+}
+
+/* Write ones to the field */
+static bool
+fuzz_ones(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	char		*out = buf;
+	int		bit;
+
+	if (bitoff % NBBY || nbits % NBBY) {
+		for (bit = 0; bit < nbits; bit++)
+			setbit_l(out, bit + bitoff, 1);
+	} else
+		memset(out + byteize(bitoff), 0xFF, byteize(nbits));
+	return true;
+}
+
+/* Flip the high bit in the (presumably big-endian) field */
+static bool
+fuzz_firstbit(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	setbit_l((char *)buf, bitoff, !getbit_l((char *)buf, bitoff));
+	return true;
+}
+
+/* Flip the low bit in the (presumably big-endian) field */
+static bool
+fuzz_lastbit(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	setbit_l((char *)buf, bitoff + nbits - 1,
+			!getbit_l((char *)buf, bitoff));
+	return true;
+}
+
+/* Flip the middle bit in the (presumably big-endian) field */
+static bool
+fuzz_middlebit(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	setbit_l((char *)buf, bitoff + nbits / 2,
+			!getbit_l((char *)buf, bitoff));
+	return true;
+}
+
+/* Format and shift a number into a buffer for setbitval. */
+static char *
+format_number(
+	uint64_t	val,
+	__be64		*out,
+	int		bit_length)
+{
+	int		offset;
+	char		*rbuf = (char *)out;
+
+	/*
+	 * If the length of the field is not a multiple of a byte, push
+	 * the bits up in the field, so the most signicant field bit is
+	 * the most significant bit in the byte:
+	 *
+	 * before:
+	 * val  |----|----|----|----|----|--MM|mmmm|llll|
+	 * after
+	 * val  |----|----|----|----|----|MMmm|mmll|ll00|
+	 */
+	offset = bit_length % NBBY;
+	if (offset)
+		val <<= (NBBY - offset);
+
+	/*
+	 * convert to big endian and copy into the array
+	 * rbuf |----|----|----|----|----|MMmm|mmll|ll00|
+	 */
+	*out = cpu_to_be64(val);
+
+	/*
+	 * Align the array to point to the field in the array.
+	 *  rbuf  = |MMmm|mmll|ll00|
+	 */
+	offset = sizeof(__be64) - 1 - ((bit_length - 1) / sizeof(__be64));
+	return rbuf + offset;
+}
+
+/* Increase the value by some small prime number. */
+static bool
+fuzz_add(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	uint64_t	val;
+	__be64		out;
+	char		*b;
+
+	if (nbits > 64)
+		return false;
+
+	val = getbitval(buf, bitoff, nbits, BVUNSIGNED);
+	val += (nbits > 8 ? 2017 : 137);
+	b = format_number(val, &out, nbits);
+	setbitval(buf, bitoff, nbits, b);
+
+	return true;
+}
+
+/* Decrease the value by some small prime number. */
+static bool
+fuzz_sub(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	uint64_t	val;
+	__be64		out;
+	char		*b;
+
+	if (nbits > 64)
+		return false;
+
+	val = getbitval(buf, bitoff, nbits, BVUNSIGNED);
+	val -= (nbits > 8 ? 2017 : 137);
+	b = format_number(val, &out, nbits);
+	setbitval(buf, bitoff, nbits, b);
+
+	return true;
+}
+
+/* Randomize the field. */
+static bool
+fuzz_random(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	int		i, bytes;
+	char		*b, *rbuf;
+
+	bytes = byteize_up(nbits);
+	rbuf = b = malloc(bytes);
+	if (!b) {
+		perror("fuzz_random");
+		return false;
+	}
+
+	for (i = 0; i < bytes; i++)
+		*b++ = (char)lrand48();
+
+	setbitval(buf, bitoff, nbits, rbuf);
+	free(rbuf);
+
+	return true;
+}
+
+struct fuzzcmd {
+	const char	*verb;
+	bool		(*fn)(void *buf, int bitoff, int nbits);
+};
+
+/* Keep these verbs in sync with enum fuzzcmds. */
+static struct fuzzcmd fuzzverbs[] = {
+	{"zeroes",		fuzz_zeroes},
+	{"ones",		fuzz_ones},
+	{"firstbit",		fuzz_firstbit},
+	{"middlebit",		fuzz_middlebit},
+	{"lastbit",		fuzz_lastbit},
+	{"add",			fuzz_add},
+	{"sub",			fuzz_sub},
+	{"random",		fuzz_random},
+	{NULL,			NULL},
+};
+
+/* ARGSUSED */
+void
+fuzz_struct(
+	const field_t	*fields,
+	int		argc,
+	char		**argv)
+{
+	const ftattr_t	*fa;
+	flist_t		*fl;
+	flist_t		*sfl;
+	int		bit_length;
+	struct fuzzcmd	*fc;
+	bool		success;
+	int		parentoffset;
+
+	if (argc != 2) {
+		dbprintf(_("Usage: fuzz fieldname verb\n"));
+		dbprintf("Verbs: %s", fuzzverbs->verb);
+		for (fc = fuzzverbs + 1; fc->verb != NULL; fc++)
+			dbprintf(", %s", fc->verb);
+		dbprintf(".\n");
+		return;
+	}
+
+	fl = flist_scan(argv[0]);
+	if (!fl) {
+		dbprintf(_("unable to parse '%s'.\n"), argv[0]);
+		return;
+	}
+
+	/* Find our fuzz verb */
+	for (fc = fuzzverbs; fc->verb != NULL; fc++)
+		if (!strcmp(fc->verb, argv[1]))
+			break;
+	if (fc->fn == NULL) {
+		dbprintf(_("Unknown fuzz command '%s'.\n"), argv[1]);
+		return;
+	}
+
+	/* if we're a root field type, go down 1 layer to get field list */
+	if (fields->name[0] == '\0') {
+		fa = &ftattrtab[fields->ftyp];
+		ASSERT(fa->ftyp == fields->ftyp);
+		fields = fa->subfld;
+	}
+
+	/* run down the field list and set offsets into the data */
+	if (!flist_parse(fields, fl, iocur_top->data, 0)) {
+		flist_free(fl);
+		dbprintf(_("parsing error\n"));
+		return;
+	}
+
+	sfl = fl;
+	parentoffset = 0;
+	while (sfl->child) {
+		parentoffset = sfl->offset;
+		sfl = sfl->child;
+	}
+
+	/*
+	 * For structures, fsize * fcount tells us the size of the region we are
+	 * modifying, which is usually a single structure member and is pointed
+	 * to by the last child in the list.
+	 *
+	 * However, if the base structure is an array and we have a direct index
+	 * into the array (e.g. write bno[5]) then we are returned a single
+	 * flist object with the offset pointing directly at the location we
+	 * need to modify. The length of the object we are modifying is then
+	 * determined by the size of the individual array entry (fsize) and the
+	 * indexes defined in the object, not the overall size of the array
+	 * (which is what fcount returns).
+	 */
+	bit_length = fsize(sfl->fld, iocur_top->data, parentoffset, 0);
+	if (sfl->fld->flags & FLD_ARRAY)
+		bit_length *= sfl->high - sfl->low + 1;
+	else
+		bit_length *= fcount(sfl->fld, iocur_top->data, parentoffset);
+
+	/* Fuzz the value */
+	success = fc->fn(iocur_top->data, sfl->offset, bit_length);
+	if (!success) {
+		dbprintf(_("unable to fuzz field '%s'\n"), argv[0]);
+		flist_free(fl);
+		return;
+	}
+
+	/* Write the fuzzed value back */
+	write_cur();
+
+	flist_print(fl);
+	print_flist(fl);
+	flist_free(fl);
+}
diff --git a/db/fuzz.h b/db/fuzz.h
new file mode 100644
index 0000000..c203eb5
--- /dev/null
+++ b/db/fuzz.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+extern void	fuzz_init(void);
+extern void	fuzz_struct(const field_t *fields, int argc, char **argv);
diff --git a/db/io.c b/db/io.c
index f398195..1f316d8 100644
--- a/db/io.c
+++ b/db/io.c
@@ -465,6 +465,15 @@ xfs_dummy_verify(
 }
 
 void
+xfs_verify_recalc_inode_crc(
+	struct xfs_buf *bp)
+{
+	ASSERT(iocur_top->ino_buf);
+	libxfs_dinode_calc_crc(mp, iocur_top->data);
+	iocur_top->ino_crc_ok = 1;
+}
+
+void
 xfs_verify_recalc_crc(
 	struct xfs_buf *bp)
 {
diff --git a/db/io.h b/db/io.h
index c69e9ce..12d96c2 100644
--- a/db/io.h
+++ b/db/io.h
@@ -64,6 +64,7 @@ extern void	set_cur(const struct typ *t, __int64_t d, int c, int ring_add,
 extern void     ring_add(void);
 extern void	set_iocur_type(const struct typ *t);
 extern void	xfs_dummy_verify(struct xfs_buf *bp);
+extern void	xfs_verify_recalc_inode_crc(struct xfs_buf *bp);
 extern void	xfs_verify_recalc_crc(struct xfs_buf *bp);
 
 /*
diff --git a/db/type.c b/db/type.c
index 10fa54e..adab10a 100644
--- a/db/type.c
+++ b/db/type.c
@@ -39,6 +39,7 @@
 #include "dir2.h"
 #include "text.h"
 #include "symlink.h"
+#include "fuzz.h"
 
 static const typ_t	*findtyp(char *name);
 static int		type_f(int argc, char **argv);
@@ -254,10 +255,17 @@ handle_struct(
 	int           argc,
 	char          **argv)
 {
-	if (action == DB_WRITE)
+	switch (action) {
+	case DB_FUZZ:
+		fuzz_struct(fields, argc, argv);
+		break;
+	case DB_WRITE:
 		write_struct(fields, argc, argv);
-	else
+		break;
+	case DB_READ:
 		print_struct(fields, argc, argv);
+		break;
+	}
 }
 
 void
@@ -267,10 +275,17 @@ handle_string(
 	int           argc,
 	char          **argv)
 {
-	if (action == DB_WRITE)
+	switch (action) {
+	case DB_WRITE:
 		write_string(fields, argc, argv);
-	else
+		break;
+	case DB_READ:
 		print_string(fields, argc, argv);
+		break;
+	case DB_FUZZ:
+		dbprintf(_("string fuzzing not supported.\n"));
+		break;
+	}
 }
 
 void
@@ -280,10 +295,17 @@ handle_block(
 	int           argc,
 	char          **argv)
 {
-	if (action == DB_WRITE)
+	switch (action) {
+	case DB_WRITE:
 		write_block(fields, argc, argv);
-	else
+		break;
+	case DB_READ:
 		print_block(fields, argc, argv);
+		break;
+	case DB_FUZZ:
+		dbprintf(_("use 'blocktrash' or 'write' to fuzz a block.\n"));
+		break;
+	}
 }
 
 void
@@ -293,6 +315,14 @@ handle_text(
 	int           argc,
 	char          **argv)
 {
-	if (action != DB_WRITE)
+	switch (action) {
+	case DB_FUZZ:
+		/* fall through */
+	case DB_WRITE:
+		dbprintf(_("text writing/fuzzing not supported.\n"));
+		break;
+	case DB_READ:
 		print_text(fields, argc, argv);
+		break;
+	}
 }
diff --git a/db/type.h b/db/type.h
index 87ff107..a50d705 100644
--- a/db/type.h
+++ b/db/type.h
@@ -30,6 +30,7 @@ typedef enum typnm
 	TYP_TEXT, TYP_FINOBT, TYP_NONE
 } typnm_t;
 
+#define DB_FUZZ  2
 #define DB_WRITE 1
 #define DB_READ  0
 
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index 460d89d..55e0629 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -594,6 +594,55 @@ in units of 512-byte blocks, no matter what the filesystem's block size is.
 .BI "The optional " start " and " end " arguments can be used to constrain
 the output to a particular range of disk blocks.
 .TP
+.BI "fuzz [\-c] [\-d] " "field action"
+Write garbage into a specific structure field on disk.
+Expert mode must be enabled to use this command.
+The operation happens immediately; there is no buffering.
+.IP
+The fuzz command can take the following
+.IR action "s"
+against a field:
+.RS 1.0i
+.TP 0.4i
+.B zeroes
+Clears all bits in the field.
+.TP 0.4i
+.B ones
+Sets all bits in the field.
+.TP 0.4i
+.B firstbit
+Flips the first bit in the field.
+For a scalar value, this is the highest bit.
+.TP 0.4i
+.B middlebit
+Flips the middle bit in the field.
+.TP 0.4i
+.B lastbit
+Flips the last bit in the field.
+For a scalar value, this is the lowest bit.
+.TP 0.4i
+.B add
+Adds a small value to a scalar field.
+.TP 0.4i
+.B sub
+Subtracts a small value from a scalar field.
+.TP 0.4i
+.B random
+Randomizes the contents of the field.
+.RE
+.IP
+The following switches affect the write behavior:
+.RS 1.0i
+.TP 0.4i
+.B \-c
+Skip write verifiers and CRC recalculation; allows invalid data to be written
+to disk.
+.TP 0.4i
+.B \-d
+Skip write verifiers but perform CRC recalculation; allows invalid data to be
+written to disk to test detection of invalid data.
+.RE
+.TP
 .BI hash " string
 Prints the hash value of
 .I string
@@ -755,7 +804,7 @@ and
 bits respectively, and their string equivalent reported
 (but no modifications are made).
 .TP
-.BI "write [\-c] [" "field value" "] ..."
+.BI "write [\-c] [\-d] [" "field value" "] ..."
 Write a value to disk.
 Specific fields can be set in structures (struct mode),
 or a block can be set to data values (data mode),
@@ -778,6 +827,10 @@ with no arguments gives more information on the allowed commands.
 .B \-c
 Skip write verifiers and CRC recalculation; allows invalid data to be written
 to disk.
+.TP 0.4i
+.B \-d
+Skip write verifiers but perform CRC recalculation; allows invalid data to be
+written to disk to test detection of invalid data.
 .RE
 .SH TYPES
 This section gives the fields in each structure type and their meanings.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/9] xfs_db: print attribute remote value blocks
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
  2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
  2017-03-10 23:24 ` [PATCH 2/9] xfs_db: introduce fuzz command Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs Darrick J. Wong
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Teach xfs_db how to print the contents of xattr remote value blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 db/attr.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 db/attr.h  |    1 +
 db/field.c |    3 +++
 db/field.h |    1 +
 4 files changed, 64 insertions(+)


diff --git a/db/attr.c b/db/attr.c
index e26ac67..0fffbc2 100644
--- a/db/attr.c
+++ b/db/attr.c
@@ -41,6 +41,9 @@ static int	attr_leaf_nvlist_offset(void *obj, int startoff, int idx);
 static int	attr_node_btree_count(void *obj, int startoff);
 static int	attr_node_hdr_count(void *obj, int startoff);
 
+static int	attr_remote_count(void *obj, int startoff);
+static int	attr3_remote_count(void *obj, int startoff);
+
 const field_t	attr_hfld[] = {
 	{ "", FLDT_ATTR, OI(0), C1, 0, TYP_NONE },
 	{ NULL }
@@ -53,6 +56,7 @@ const field_t	attr_flds[] = {
 	  FLD_COUNT, TYP_NONE },
 	{ "hdr", FLDT_ATTR_NODE_HDR, OI(NOFF(hdr)), attr_node_hdr_count,
 	  FLD_COUNT, TYP_NONE },
+	{ "data", FLDT_CHARNS, OI(0), attr_remote_count, FLD_COUNT, TYP_NONE },
 	{ "entries", FLDT_ATTR_LEAF_ENTRY, OI(LOFF(entries)),
 	  attr_leaf_entries_count, FLD_ARRAY|FLD_COUNT, TYP_NONE },
 	{ "btree", FLDT_ATTR_NODE_ENTRY, OI(NOFF(__btree)), attr_node_btree_count,
@@ -197,6 +201,33 @@ attr3_leaf_hdr_count(
 	return be16_to_cpu(leaf->hdr.info.hdr.magic) == XFS_ATTR3_LEAF_MAGIC;
 }
 
+static int
+attr_remote_count(
+	void		*obj,
+	int		startoff)
+{
+	if (attr_leaf_hdr_count(obj, startoff) == 0 &&
+	    attr_node_hdr_count(obj, startoff) == 0)
+		return mp->m_sb.sb_blocksize;
+	return 0;
+}
+
+static int
+attr3_remote_count(
+	void		*obj,
+	int		startoff)
+{
+	struct xfs_attr3_rmt_hdr	*hdr = obj;
+
+	ASSERT(startoff == 0);
+
+	if (hdr->rm_magic != cpu_to_be32(XFS_ATTR3_RMT_MAGIC))
+		return 0;
+	if (be32_to_cpu(hdr->rm_bytes) + sizeof(*hdr) > mp->m_sb.sb_blocksize)
+		return mp->m_sb.sb_blocksize - sizeof(*hdr);
+	return be32_to_cpu(hdr->rm_bytes);
+}
+
 typedef int (*attr_leaf_entry_walk_f)(struct xfs_attr_leafblock *,
 				      struct xfs_attr_leaf_entry *, int);
 static int
@@ -477,6 +508,17 @@ attr3_node_hdr_count(
 	return be16_to_cpu(node->hdr.info.hdr.magic) == XFS_DA3_NODE_MAGIC;
 }
 
+static int
+attr3_remote_hdr_count(
+	void			*obj,
+	int			startoff)
+{
+	struct xfs_attr3_rmt_hdr	*node = obj;
+
+	ASSERT(startoff == 0);
+	return be32_to_cpu(node->rm_magic) == XFS_ATTR3_RMT_MAGIC;
+}
+
 int
 attr_size(
 	void	*obj,
@@ -501,6 +543,8 @@ const field_t	attr3_flds[] = {
 	  FLD_COUNT, TYP_NONE },
 	{ "hdr", FLDT_DA3_NODE_HDR, OI(N3OFF(hdr)), attr3_node_hdr_count,
 	  FLD_COUNT, TYP_NONE },
+	{ "hdr", FLDT_ATTR3_REMOTE_HDR, OI(0), attr3_remote_hdr_count,
+	  FLD_COUNT, TYP_NONE },
 	{ "entries", FLDT_ATTR_LEAF_ENTRY, OI(L3OFF(entries)),
 	  attr3_leaf_entries_count, FLD_ARRAY|FLD_COUNT, TYP_NONE },
 	{ "btree", FLDT_ATTR_NODE_ENTRY, OI(N3OFF(__btree)),
@@ -523,6 +567,21 @@ const field_t	attr3_leaf_hdr_flds[] = {
 	{ NULL }
 };
 
+#define	RM3OFF(f)	bitize(offsetof(struct xfs_attr3_rmt_hdr, rm_ ## f))
+const struct field	attr3_remote_crc_flds[] = {
+	{ "magic", FLDT_UINT32X, OI(RM3OFF(magic)), C1, 0, TYP_NONE },
+	{ "offset", FLDT_UINT32D, OI(RM3OFF(offset)), C1, 0, TYP_NONE },
+	{ "bytes", FLDT_UINT32D, OI(RM3OFF(bytes)), C1, 0, TYP_NONE },
+	{ "crc", FLDT_CRC, OI(RM3OFF(crc)), C1, 0, TYP_NONE },
+	{ "uuid", FLDT_UUID, OI(RM3OFF(uuid)), C1, 0, TYP_NONE },
+	{ "owner", FLDT_INO, OI(RM3OFF(owner)), C1, 0, TYP_NONE },
+	{ "bno", FLDT_DFSBNO, OI(RM3OFF(blkno)), C1, 0, TYP_BMAPBTD },
+	{ "lsn", FLDT_UINT64X, OI(RM3OFF(lsn)), C1, 0, TYP_NONE },
+	{ "data", FLDT_CHARNS, OI(bitize(sizeof(struct xfs_attr3_rmt_hdr))),
+		attr3_remote_count, FLD_COUNT, TYP_NONE },
+	{ NULL }
+};
+
 /*
  * Special read verifier for attribute buffers. Detect the magic number
  * appropriately and set the correct verifier and call it.
diff --git a/db/attr.h b/db/attr.h
index bc3431f..d7bb579 100644
--- a/db/attr.h
+++ b/db/attr.h
@@ -30,6 +30,7 @@ extern const field_t	attr3_flds[];
 extern const field_t	attr3_hfld[];
 extern const field_t	attr3_leaf_hdr_flds[];
 extern const field_t	attr3_node_hdr_flds[];
+extern const field_t	attr3_remote_crc_flds[];
 
 extern int	attr_leaf_name_size(void *obj, int startoff, int idx);
 extern int	attr_size(void *obj, int startoff, int idx);
diff --git a/db/field.c b/db/field.c
index 1968dd5..e8bbbe3 100644
--- a/db/field.c
+++ b/db/field.c
@@ -97,6 +97,9 @@ const ftattr_t	ftattrtab[] = {
 	{ FLDT_ATTR3_NODE_HDR, "attr3_node_hdr", NULL,
 	  (char *)da3_node_hdr_flds, SI(bitsz(struct xfs_da3_node_hdr)),
 	  0, NULL, da3_node_hdr_flds },
+	{ FLDT_ATTR3_REMOTE_HDR, "attr3_remote_hdr", NULL,
+	  (char *)attr3_remote_crc_flds, attr_size, FTARG_SIZE, NULL,
+	  attr3_remote_crc_flds },
 
 	{ FLDT_BMAPBTA, "bmapbta", NULL, (char *)bmapbta_flds, btblock_size,
 	  FTARG_SIZE, NULL, bmapbta_flds },
diff --git a/db/field.h b/db/field.h
index 53616f1..e5a943b 100644
--- a/db/field.h
+++ b/db/field.h
@@ -46,6 +46,7 @@ typedef enum fldt	{
 	FLDT_ATTR3,
 	FLDT_ATTR3_LEAF_HDR,
 	FLDT_ATTR3_NODE_HDR,
+	FLDT_ATTR3_REMOTE_HDR,
 
 	FLDT_BMAPBTA,
 	FLDT_BMAPBTA_CRC,


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2017-03-10 23:25 ` [PATCH 3/9] xfs_db: print attribute remote value blocks Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Extend typ_t to (optionally) store a pointer to a function to calculate
the CRC of the block, provide functions to do this for the dir3 and
attr3 types, and then wire up the fuzz and write commands so that we can
effectively fuzz directory and extended attribute block fields.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 db/attr.c  |   32 ++++++++++++++++++++++++++++++++
 db/attr.h  |    1 +
 db/dir2.c  |   37 +++++++++++++++++++++++++++++++++++++
 db/dir2.h  |    1 +
 db/fuzz.c  |    3 +++
 db/type.c  |    8 ++++----
 db/type.h  |    2 ++
 db/write.c |    3 +++
 8 files changed, 83 insertions(+), 4 deletions(-)


diff --git a/db/attr.c b/db/attr.c
index 0fffbc2..5a97925 100644
--- a/db/attr.c
+++ b/db/attr.c
@@ -582,6 +582,38 @@ const struct field	attr3_remote_crc_flds[] = {
 	{ NULL }
 };
 
+/* Set the CRC. */
+void
+xfs_attr3_set_crc(
+	struct xfs_buf		*bp)
+{
+	__be32			magic32;
+	__be16			magic16;
+
+	magic32 = *(__be32 *)bp->b_addr;
+	magic16 = ((struct xfs_da_blkinfo *)bp->b_addr)->magic;
+
+	switch (magic16) {
+	case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_ATTR3_LEAF_CRC_OFF);
+		return;
+	case cpu_to_be16(XFS_DA3_NODE_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_DA3_NODE_CRC_OFF);
+		return;
+	default:
+		break;
+	}
+
+	switch (magic32) {
+	case cpu_to_be32(XFS_ATTR3_RMT_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_ATTR3_RMT_CRC_OFF);
+		return;
+	default:
+		dbprintf(_("Unknown attribute buffer type!\n"));
+		break;
+	}
+}
+
 /*
  * Special read verifier for attribute buffers. Detect the magic number
  * appropriately and set the correct verifier and call it.
diff --git a/db/attr.h b/db/attr.h
index d7bb579..9ea7429 100644
--- a/db/attr.h
+++ b/db/attr.h
@@ -34,5 +34,6 @@ extern const field_t	attr3_remote_crc_flds[];
 
 extern int	attr_leaf_name_size(void *obj, int startoff, int idx);
 extern int	attr_size(void *obj, int startoff, int idx);
+extern void	xfs_attr3_set_crc(struct xfs_buf *bp);
 
 extern const struct xfs_buf_ops xfs_attr3_db_buf_ops;
diff --git a/db/dir2.c b/db/dir2.c
index 533f705..3e21a7b 100644
--- a/db/dir2.c
+++ b/db/dir2.c
@@ -981,6 +981,43 @@ const field_t	da3_node_hdr_flds[] = {
 	{ NULL }
 };
 
+/* Set the CRC. */
+void
+xfs_dir3_set_crc(
+	struct xfs_buf		*bp)
+{
+	__be32			magic32;
+	__be16			magic16;
+
+	magic32 = *(__be32 *)bp->b_addr;
+	magic16 = ((struct xfs_da_blkinfo *)bp->b_addr)->magic;
+
+	switch (magic32) {
+	case cpu_to_be32(XFS_DIR3_BLOCK_MAGIC):
+	case cpu_to_be32(XFS_DIR3_DATA_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_DIR3_DATA_CRC_OFF);
+		return;
+	case cpu_to_be32(XFS_DIR3_FREE_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_DIR3_FREE_CRC_OFF);
+		return;
+	default:
+		break;
+	}
+
+	switch (magic16) {
+	case cpu_to_be16(XFS_DIR3_LEAF1_MAGIC):
+	case cpu_to_be16(XFS_DIR3_LEAFN_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_DIR3_LEAF_CRC_OFF);
+		return;
+	case cpu_to_be16(XFS_DA3_NODE_MAGIC):
+		xfs_buf_update_cksum(bp, XFS_DA3_NODE_CRC_OFF);
+		return;
+	default:
+		dbprintf(_("Unknown directory buffer type! %x %x\n"), magic32, magic16);
+		break;
+	}
+}
+
 /*
  * Special read verifier for directory buffers. Detect the magic number
  * appropriately and set the correct verifier and call it.
diff --git a/db/dir2.h b/db/dir2.h
index 0c2a62e..1b87cd2 100644
--- a/db/dir2.h
+++ b/db/dir2.h
@@ -60,5 +60,6 @@ static inline uint8_t *xfs_dir2_sf_inumberp(xfs_dir2_sf_entry_t *sfep)
 
 extern int	dir2_data_union_size(void *obj, int startoff, int idx);
 extern int	dir2_size(void *obj, int startoff, int idx);
+extern void	xfs_dir3_set_crc(struct xfs_buf *bp);
 
 extern const struct xfs_buf_ops xfs_dir3_db_buf_ops;
diff --git a/db/fuzz.c b/db/fuzz.c
index 061ecd1..f294331 100644
--- a/db/fuzz.c
+++ b/db/fuzz.c
@@ -156,6 +156,9 @@ fuzz_f(
 	} else if (iocur_top->ino_buf) {
 		local_ops.verify_write = xfs_verify_recalc_inode_crc;
 		dbprintf(_("Allowing fuzz of corrupted inode with good CRC\n"));
+	} else if (iocur_top->typ->crc_off == TYP_F_CRC_FUNC) {
+		local_ops.verify_write = iocur_top->typ->set_crc;
+		dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
 	} else { /* invalid data */
 		local_ops.verify_write = xfs_verify_recalc_crc;
 		dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
diff --git a/db/type.c b/db/type.c
index adab10a..740adc0 100644
--- a/db/type.c
+++ b/db/type.c
@@ -88,7 +88,7 @@ static const typ_t	__typtab_crc[] = {
 	{ TYP_AGI, "agi", handle_struct, agi_hfld, &xfs_agi_buf_ops,
 		XFS_AGI_CRC_OFF },
 	{ TYP_ATTR, "attr3", handle_struct, attr3_hfld,
-		&xfs_attr3_db_buf_ops, TYP_F_NO_CRC_OFF },
+		&xfs_attr3_db_buf_ops, TYP_F_CRC_FUNC, xfs_attr3_set_crc },
 	{ TYP_BMAPBTA, "bmapbta", handle_struct, bmapbta_crc_hfld,
 		&xfs_bmbt_buf_ops, XFS_BTREE_LBLOCK_CRC_OFF },
 	{ TYP_BMAPBTD, "bmapbtd", handle_struct, bmapbtd_crc_hfld,
@@ -103,7 +103,7 @@ static const typ_t	__typtab_crc[] = {
 		&xfs_refcountbt_buf_ops, XFS_BTREE_SBLOCK_CRC_OFF },
 	{ TYP_DATA, "data", handle_block, NULL, NULL, TYP_F_NO_CRC_OFF },
 	{ TYP_DIR2, "dir3", handle_struct, dir3_hfld,
-		&xfs_dir3_db_buf_ops, TYP_F_NO_CRC_OFF },
+		&xfs_dir3_db_buf_ops, TYP_F_CRC_FUNC, xfs_dir3_set_crc },
 	{ TYP_DQBLK, "dqblk", handle_struct, dqblk_hfld,
 		&xfs_dquot_buf_ops, TYP_F_NO_CRC_OFF },
 	{ TYP_INOBT, "inobt", handle_struct, inobt_crc_hfld,
@@ -132,7 +132,7 @@ static const typ_t	__typtab_spcrc[] = {
 	{ TYP_AGI, "agi", handle_struct, agi_hfld, &xfs_agi_buf_ops ,
 		XFS_AGI_CRC_OFF },
 	{ TYP_ATTR, "attr3", handle_struct, attr3_hfld,
-		&xfs_attr3_db_buf_ops, TYP_F_NO_CRC_OFF },
+		&xfs_attr3_db_buf_ops, TYP_F_CRC_FUNC, xfs_attr3_set_crc },
 	{ TYP_BMAPBTA, "bmapbta", handle_struct, bmapbta_crc_hfld,
 		&xfs_bmbt_buf_ops, XFS_BTREE_LBLOCK_CRC_OFF },
 	{ TYP_BMAPBTD, "bmapbtd", handle_struct, bmapbtd_crc_hfld,
@@ -147,7 +147,7 @@ static const typ_t	__typtab_spcrc[] = {
 		&xfs_refcountbt_buf_ops, XFS_BTREE_SBLOCK_CRC_OFF },
 	{ TYP_DATA, "data", handle_block, NULL, NULL, TYP_F_NO_CRC_OFF },
 	{ TYP_DIR2, "dir3", handle_struct, dir3_hfld,
-		&xfs_dir3_db_buf_ops, TYP_F_NO_CRC_OFF },
+		&xfs_dir3_db_buf_ops, TYP_F_CRC_FUNC, xfs_dir3_set_crc },
 	{ TYP_DQBLK, "dqblk", handle_struct, dqblk_hfld,
 		&xfs_dquot_buf_ops, TYP_F_NO_CRC_OFF },
 	{ TYP_INOBT, "inobt", handle_struct, inobt_spcrc_hfld,
diff --git a/db/type.h b/db/type.h
index a50d705..3971975 100644
--- a/db/type.h
+++ b/db/type.h
@@ -46,6 +46,8 @@ typedef struct typ
 	const struct xfs_buf_ops *bops;
 	unsigned long		crc_off;
 #define TYP_F_NO_CRC_OFF	(-1UL)
+#define TYP_F_CRC_FUNC		(-2UL)
+	void			(*set_crc)(struct xfs_buf *);
 } typ_t;
 extern const typ_t	*typtab, *cur_typ;
 
diff --git a/db/write.c b/db/write.c
index 5c83874..ea87b40 100644
--- a/db/write.c
+++ b/db/write.c
@@ -164,6 +164,9 @@ write_f(
 	if (corrupt) {
 		local_ops.verify_write = xfs_dummy_verify;
 		dbprintf(_("Allowing write of corrupted data and bad CRC\n"));
+	} else if (iocur_top->typ->crc_off == TYP_F_CRC_FUNC) {
+		local_ops.verify_write = iocur_top->typ->set_crc;
+		dbprintf(_("Allowing write of corrupted data with good CRC\n"));
 	} else { /* invalid data */
 		local_ops.verify_write = xfs_verify_recalc_crc;
 		dbprintf(_("Allowing write of corrupted data with good CRC\n"));


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2017-03-10 23:25 ` [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 6/9] xfs_scrub: create online filesystem scrub program Darrick J. Wong
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new xfs_io command to call the new XFS metadata scrub ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 io/Makefile       |    2 
 io/init.c         |    2 
 io/inject.c       |    4 -
 io/io.h           |    3 
 io/scrub.c        |  330 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |   20 +++
 6 files changed, 359 insertions(+), 2 deletions(-)
 create mode 100644 io/scrub.c


diff --git a/io/Makefile b/io/Makefile
index b5fe83d..e146cc0 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -11,7 +11,7 @@ HFILES = init.h io.h
 CFILES = init.c \
 	attr.c bmap.c cowextsize.c encrypt.c file.c freeze.c fsync.c \
 	getrusage.c imap.c link.c mmap.c open.c parent.c pread.c prealloc.c \
-	pwrite.c seek.c shutdown.c sync.c truncate.c utimes.c
+	pwrite.c scrub.c seek.c shutdown.c sync.c truncate.c utimes.c
 
 LLDLIBS = $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
 LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE)
diff --git a/io/init.c b/io/init.c
index 1532149..bef6f62 100644
--- a/io/init.c
+++ b/io/init.c
@@ -83,7 +83,9 @@ init_commands(void)
 	quit_init();
 	readdir_init();
 	reflink_init();
+	repair_init();
 	resblks_init();
+	scrub_init();
 	seek_init();
 	sendfile_init();
 	shutdown_init();
diff --git a/io/inject.c b/io/inject.c
index 25c7021..91b5fd7 100644
--- a/io/inject.c
+++ b/io/inject.c
@@ -86,7 +86,9 @@ error_tag(char *name)
 		{ XFS_ERRTAG_BMAP_FINISH_ONE,		"bmap_finish_one" },
 #define XFS_ERRTAG_AG_RESV_CRITICAL			27
 		{ XFS_ERRTAG_AG_RESV_CRITICAL,		"ag_resv_critical" },
-#define XFS_ERRTAG_MAX                                  28
+#define XFS_ERRTAG_FORCE_REPAIR				28
+		{ XFS_ERRTAG_FORCE_REPAIR,		"force_repair" },
+#define XFS_ERRTAG_MAX                                  29
 		{ XFS_ERRTAG_MAX,			NULL }
 	};
 	int	count;
diff --git a/io/io.h b/io/io.h
index 43c82d8..ee8a927 100644
--- a/io/io.h
+++ b/io/io.h
@@ -187,3 +187,6 @@ extern void		fsmap_init(void);
 #else
 # define fsmap_init()	do { } while (0)
 #endif
+
+extern void		scrub_init(void);
+extern void		repair_init(void);
diff --git a/io/scrub.c b/io/scrub.c
new file mode 100644
index 0000000..caa965e
--- /dev/null
+++ b/io/scrub.c
@@ -0,0 +1,330 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <sys/uio.h>
+#include <xfs/xfs.h>
+#include "command.h"
+#include "input.h"
+#include "init.h"
+#include "path.h"
+#include "io.h"
+
+static struct cmdinfo scrub_cmd;
+static struct cmdinfo repair_cmd;
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+	ST_NONE,	/* disabled */
+	ST_PERAG,	/* per-AG metadata */
+	ST_FS,		/* per-FS metadata */
+	ST_INODE,	/* per-inode metadata */
+};
+
+/* These must correspond with XFS_SCRUB_TYPE_ */
+struct scrub_descr {
+	const char	*name;
+	enum scrub_type	type;
+};
+
+static const struct scrub_descr scrubbers[] = {
+	{"dummy",	ST_NONE},
+	{"sb",		ST_PERAG},
+	{"agf",		ST_PERAG},
+	{"agfl",	ST_PERAG},
+	{"agi",		ST_PERAG},
+	{"bnobt",	ST_PERAG},
+	{"cntbt",	ST_PERAG},
+	{"inobt",	ST_PERAG},
+	{"finobt",	ST_PERAG},
+	{"rmapbt",	ST_PERAG},
+	{"refcountbt",	ST_PERAG},
+	{"inode",	ST_INODE},
+	{"bmapbtd",	ST_INODE},
+	{"bmapbta",	ST_INODE},
+	{"bmapbtc",	ST_INODE},
+	{"directory",	ST_INODE},
+	{"xattr",	ST_INODE},
+	{"symlink",	ST_INODE},
+	{"rtbitmap",	ST_FS},
+	{"rtsummary",	ST_FS},
+	{NULL,		ST_NONE},
+};
+
+static void
+scrub_help(void)
+{
+	const struct scrub_descr	*d;
+
+	printf(_("\n\
+ Scrubs a piece of XFS filesystem metadata.  The first argument is the type\n\
+ of metadata to examine.  Allocation group number(s) can be specified to\n\
+ restrict the scrub operation to a subset of allocation groups.\n\
+ Certain metadata types do not take AG numbers.\n\
+\n\
+ Example:\n\
+ 'scrub inobt 3' - scrub the inode btree in AG 3.\n\
+ 'scrub bmapbtd 128 13525' - scrubs the extent map of inode 128 gen 13525.\n\
+\n\
+ Known metadata scrub types are:"));
+	for (d = scrubbers; d->name; d++)
+		printf(" %s", d->name);
+	printf("\n");
+}
+
+static void
+scrub_ioctl(
+	int				fd,
+	int				type,
+	uint64_t			control,
+	uint32_t			control2)
+{
+	struct xfs_scrub_metadata	meta;
+	const struct scrub_descr	*sc;
+	int				error;
+
+	sc = &scrubbers[type];
+	memset(&meta, 0, sizeof(meta));
+	meta.sm_type = type;
+	switch (sc->type) {
+	case ST_PERAG:
+		meta.sm_agno = control;
+		break;
+	case ST_INODE:
+		meta.sm_ino = control;
+		meta.sm_gen = control2;
+		break;
+	case ST_NONE:
+	case ST_FS:
+		/* no control parameters */
+		break;
+	}
+	meta.sm_flags = 0;
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (error)
+		perror("scrub");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+		printf(_("Corruption detected.\n"));
+	if (meta.sm_flags & XFS_SCRUB_FLAG_PREEN)
+		printf(_("Optimization possible.\n"));
+	if (meta.sm_flags & XFS_SCRUB_FLAG_XFAIL)
+		printf(_("Cross-referencing failed.\n"));
+	if (meta.sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+		printf(_("Corruption detected during cross-referencing.\n"));
+}
+
+static int
+parse_args(
+	int				argc,
+	char				**argv,
+	struct cmdinfo			*cmdinfo,
+	void				(*fn)(int, int, uint64_t, uint32_t))
+{
+	char				*p;
+	int				type = -1;
+	int				i, c;
+	uint64_t			control = 0;
+	uint32_t			control2 = 0;
+	const struct scrub_descr	*d = NULL;
+
+	while ((c = getopt(argc, argv, "")) != EOF) {
+		switch (c) {
+		default:
+			return command_usage(cmdinfo);
+		}
+	}
+	if (optind > argc - 1)
+		return command_usage(cmdinfo);
+
+	for (i = 0, d = scrubbers; d->name; i++, d++) {
+		if (strcmp(d->name, argv[optind]) == 0) {
+			type = i;
+			break;
+		}
+	}
+	optind++;
+
+	if (type < 0)
+		return command_usage(cmdinfo);
+
+	switch (d->type) {
+	case ST_INODE:
+		if (optind == argc) {
+			control = 0;
+			control2 = 0;
+		} else if (optind == argc - 2) {
+			control = strtoull(argv[optind], &p, 0);
+			if (*p != '\0') {
+				fprintf(stderr,
+					_("Bad inode number %s.\n"), argv[i]);
+				return 0;
+			}
+			control2 = strtoul(argv[optind + 1], &p, 0);
+			if (*p != '\0') {
+				fprintf(stderr,
+					_("Bad generation number %s.\n"), argv[i]);
+				return 0;
+			}
+		} else {
+			fprintf(stderr,
+				_("Must specify inode number and generation.\n"));
+			return 0;
+		}
+		break;
+	case ST_PERAG:
+	case ST_NONE:
+		if (optind != argc - 1) {
+			fprintf(stderr,
+				_("Must specify AG number.\n"));
+			return 0;
+		}
+		control = strtoul(argv[optind], &p, 0);
+		if (*p != '\0') {
+			fprintf(stderr,
+				_("Bad AG number %s.\n"), argv[i]);
+			return 0;
+		}
+		break;
+	default:
+		if (optind != argc) {
+			fprintf(stderr,
+				_("No parameters allowed.\n"));
+			return 0;
+		}
+	}
+	fn(file->fd, type, control, control2);
+
+	return 0;
+}
+
+static int
+scrub_f(
+	int				argc,
+	char				**argv)
+{
+	return parse_args(argc, argv, &scrub_cmd, scrub_ioctl);
+}
+
+void
+scrub_init(void)
+{
+	scrub_cmd.name = "scrub";
+	scrub_cmd.altname = "sc";
+	scrub_cmd.cfunc = scrub_f;
+	scrub_cmd.argmin = 1;
+	scrub_cmd.argmax = -1;
+	scrub_cmd.flags = CMD_NOMAP_OK;
+	scrub_cmd.args =
+_("type [agno...]");
+	scrub_cmd.oneline =
+		_("scrubs filesystem metadata");
+	scrub_cmd.help = scrub_help;
+
+	add_command(&scrub_cmd);
+}
+
+static void
+repair_help(void)
+{
+	const struct scrub_descr	*d;
+
+	printf(_("\n\
+ Repairs a piece of XFS filesystem metadata.  The first argument is the type\n\
+ of metadata to examine.  Allocation group number(s) can be specified to\n\
+ restrict the scrub operation to a subset of allocation groups.\n\
+ Certain metadata types do not take AG numbers.\n\
+\n\
+ Example:\n\
+ 'repair inobt 3 5 7' - repairs the inode btree in groups 3, 5, and 7.\n\
+\n\
+ Known metadata repairs types are:"));
+	for (d = scrubbers; d->name; d++)
+		printf(" %s", d->name);
+	printf("\n");
+}
+
+static void
+repair_ioctl(
+	int				fd,
+	int				type,
+	uint64_t			control,
+	uint32_t			control2)
+{
+	struct xfs_scrub_metadata	meta;
+	const struct scrub_descr	*sc;
+	int				error;
+
+	sc = &scrubbers[type];
+	memset(&meta, 0, sizeof(meta));
+	meta.sm_type = type;
+	switch (sc->type) {
+	case ST_PERAG:
+		meta.sm_agno = control;
+		break;
+	case ST_INODE:
+		meta.sm_ino = control;
+		meta.sm_gen = control2;
+		break;
+	case ST_NONE:
+	case ST_FS:
+		/* no control parameters */
+		break;
+	}
+	meta.sm_flags = XFS_SCRUB_FLAG_REPAIR;
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (error)
+		perror("scrub");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+		printf(_("Corruption remains.\n"));
+	if (meta.sm_flags & XFS_SCRUB_FLAG_PREEN)
+		printf(_("Optimization possible.\n"));
+	if (meta.sm_flags & XFS_SCRUB_FLAG_XFAIL)
+		printf(_("Cross-referencing failed.\n"));
+	if (meta.sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+		printf(_("Corruption still detected during cross-referencing.\n"));
+}
+
+static int
+repair_f(
+	int				argc,
+	char				**argv)
+{
+	return parse_args(argc, argv, &repair_cmd, repair_ioctl);
+}
+
+void
+repair_init(void)
+{
+	if (!expert)
+		return;
+	repair_cmd.name = "repair";
+	repair_cmd.altname = "fix";
+	repair_cmd.cfunc = repair_f;
+	repair_cmd.argmin = 1;
+	repair_cmd.argmax = -1;
+	repair_cmd.flags = CMD_NOMAP_OK;
+	repair_cmd.args =
+_("type [agno...]");
+	repair_cmd.oneline =
+		_("repairs filesystem metadata");
+	repair_cmd.help = repair_help;
+
+	add_command(&repair_cmd);
+}
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 7ca3fdc..049f32c 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -998,6 +998,26 @@ version of policy structure (numeric)
 .BR get_encpolicy
 On filesystems that support encryption, display the encryption policy of the
 current file.
+.TP
+.BI "scrub " type " [ " agnumber... " | " "ino" " " "gen" " ]"
+Scrub internal XFS filesystem metadata.  The
+.BI type
+parameter specifies which type of metadata to scrub.
+For AG metadata, AG numbers can optionally be specified to restrict the
+scrub operation to a particular set of allocation groups.
+By default, all allocation groups are scrubbed.
+For file metadata, the scrub is applied to the open file unless the
+inode number and generation number are specified.
+.TP
+.BI "repair " type " [ " agnumber... " | " "ino" " " "gen" " ]"
+Repair internal XFS filesystem metadata.  The
+.BI type
+parameter specifies which type of metadata to repair.
+For AG metadata, AG numbers can optionally be specified to restrict the
+repair operation to a particular set of allocation groups.
+By default, all allocation groups are repaired.
+For file metadata, the repair is applied to the open file unless the
+inode number and generation number are specified.
 
 .SH SEE ALSO
 .BR mkfs.xfs (8),


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/9] xfs_scrub: create online filesystem scrub program
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2017-03-10 23:25 ` [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality Darrick J. Wong
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a filesystem scrubbing tool that walks the directory tree,
queries every file's extents, extended attributes, and stat data.  For
generic (non-XFS) filesystems this depends on the kernel to do nearly
all the validation.  Optionally, we can (try to) read all the file
data.

This patch provides some helper components that will be used by the
various backends to walk the metadata, perform media scans, etc.  Actual
filesystem drivers will be in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Makefile              |    3 
 configure.ac          |    6 
 include/builddefs.in  |    6 
 m4/package_libcdev.m4 |   88 +++++
 man/man8/xfs_scrub.8  |  109 ++++++
 scrub/Makefile        |   51 +++
 scrub/bitmap.c        |  425 ++++++++++++++++++++++
 scrub/bitmap.h        |   42 ++
 scrub/disk.c          |  288 +++++++++++++++
 scrub/disk.h          |   41 ++
 scrub/iocmd.c         |  239 ++++++++++++
 scrub/iocmd.h         |   50 +++
 scrub/read_verify.c   |  316 ++++++++++++++++
 scrub/read_verify.h   |   59 +++
 scrub/scrub.c         |  950 +++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.h         |  127 +++++++
 16 files changed, 2799 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/xfs_scrub.8
 create mode 100644 scrub/Makefile
 create mode 100644 scrub/bitmap.c
 create mode 100644 scrub/bitmap.h
 create mode 100644 scrub/disk.c
 create mode 100644 scrub/disk.h
 create mode 100644 scrub/iocmd.c
 create mode 100644 scrub/iocmd.h
 create mode 100644 scrub/read_verify.c
 create mode 100644 scrub/read_verify.h
 create mode 100644 scrub/scrub.c
 create mode 100644 scrub/scrub.h


diff --git a/Makefile b/Makefile
index 3a4872a..e6d79af 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ HDR_SUBDIRS = include libxfs
 DLIB_SUBDIRS = libxlog libxcmd libhandle
 LIB_SUBDIRS = libxfs $(DLIB_SUBDIRS)
 TOOL_SUBDIRS = copy db estimate fsck growfs io logprint mkfs quota \
-		mdrestore repair rtcp m4 man doc debian spaceman
+		mdrestore repair rtcp m4 man doc debian spaceman scrub
 
 ifneq ("$(PKG_PLATFORM)","darwin")
 TOOL_SUBDIRS += fsr
@@ -89,6 +89,7 @@ repair: libxlog libxcmd
 copy: libxlog
 mkfs: libxcmd
 spaceman: libxcmd
+scrub: libhandle libxcmd repair
 
 ifeq ($(HAVE_BUILDDEFS), yes)
 include $(BUILDRULES)
diff --git a/configure.ac b/configure.ac
index 103aeba..ccd7460 100644
--- a/configure.ac
+++ b/configure.ac
@@ -146,6 +146,12 @@ AC_HAVE_SYS_FICLONE
 AC_HAVE_SYS_FICLONERANGE
 AC_HAVE_SYS_FIDEDUPERANGE
 AC_HAVE_SYS_GETFSMAP
+AC_HAVE_MALLINFO
+AC_HAVE_SG_IO
+AC_HAVE_HDIO_GETGEO
+AC_HAVE_OPENAT
+AC_HAVE_SYNCFS
+AC_HAVE_FSTATAT
 
 if test "$enable_blkid" = yes; then
 AC_HAVE_BLKID_TOPO
diff --git a/include/builddefs.in b/include/builddefs.in
index 603d900..9d478d3 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -116,6 +116,12 @@ HAVE_SYS_FICLONE = @have_sys_ficlone@
 HAVE_SYS_FICLONERANGE = @have_sys_ficlonerange@
 HAVE_SYS_FIDEDUPERANGE = @have_sys_fideduperange@
 HAVE_SYS_GETFSMAP = @have_sys_getfsmap@
+HAVE_MALLINFO = @have_mallinfo@
+HAVE_SG_IO = @have_sg_io@
+HAVE_HDIO_GETGEO = @have_hdio_getgeo@
+HAVE_OPENAT = @have_openat@
+HAVE_SYNCFS = @have_syncfs@
+HAVE_FSTATAT = @have_fstatat@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 971e390..da15041 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -352,3 +352,91 @@ AC_DEFUN([AC_HAVE_SYS_GETFSMAP],
        AC_MSG_RESULT(no))
     AC_SUBST(have_sys_getfsmap)
   ])
+
+#
+# Check if we have a mallinfo libc call
+#
+AC_DEFUN([AC_HAVE_MALLINFO],
+  [ AC_MSG_CHECKING([for mallinfo ])
+    AC_TRY_COMPILE([
+#include <malloc.h>
+    ], [
+         struct mallinfo test;
+
+         test.arena = 0; test.hblkhd = 0; test.uordblks = 0; test.fordblks = 0;
+         test = mallinfo();
+    ], have_mallinfo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_mallinfo)
+  ])
+
+#
+# Check if we have the SG_IO ioctl
+#
+AC_DEFUN([AC_HAVE_SG_IO],
+  [ AC_MSG_CHECKING([for struct sg_io_hdr ])
+    AC_TRY_COMPILE([#include <scsi/sg.h>],
+    [
+         struct sg_io_hdr hdr;
+         ioctl(0, SG_IO, &hdr);
+    ], have_sg_io=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_sg_io)
+  ])
+
+#
+# Check if we have the HDIO_GETGEO ioctl
+#
+AC_DEFUN([AC_HAVE_HDIO_GETGEO],
+  [ AC_MSG_CHECKING([for struct hd_geometry ])
+    AC_TRY_COMPILE([#include <linux/hdreg.h>],
+    [
+         struct hd_geometry hdr;
+         ioctl(0, HDIO_GETGEO, &hdr);
+    ], have_hdio_getgeo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_hdio_getgeo)
+  ])
+
+#
+# Check if we have a openat call
+#
+AC_DEFUN([AC_HAVE_OPENAT],
+  [ AC_CHECK_DECL([openat],
+       have_openat=yes,
+       [],
+       [#include <sys/types.h>
+        #include <sys/stat.h>
+        #include <fcntl.h>]
+       )
+    AC_SUBST(have_openat)
+  ])
+
+#
+# Check if we have a syncfs call
+#
+AC_DEFUN([AC_HAVE_SYNCFS],
+  [ AC_CHECK_DECL([syncfs],
+       have_syncfs=yes,
+       [],
+       [#define _GNU_SOURCE
+       #include <unistd.h>])
+    AC_SUBST(have_syncfs)
+  ])
+
+#
+# Check if we have a fstatat call
+#
+AC_DEFUN([AC_HAVE_FSTATAT],
+  [ AC_CHECK_DECL([fstatat],
+       have_fstatat=yes,
+       [],
+       [#define _GNU_SOURCE
+       #include <sys/types.h>
+       #include <sys/stat.h>
+       #include <unistd.h>])
+    AC_SUBST(have_fstatat)
+  ])
diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
new file mode 100644
index 0000000..60cdf61
--- /dev/null
+++ b/man/man8/xfs_scrub.8
@@ -0,0 +1,109 @@
+.TH xfs_scrub 8
+.SH NAME
+xfs_scrub \- scrub the contents of an XFS filesystem
+.SH SYNOPSIS
+.B xfs_scrub
+[
+.B \-ademnTvVxy
+]
+.I mountpoint
+.br
+.B xfs_scrub \-V
+.SH DESCRIPTION
+.B xfs_scrub
+attempts to check and repair all metadata in a mounted XFS filesystem.
+.PP
+If an XFS filesystem is detected, then
+.B xfs_scrub
+will ask the kernel to perform more rigorous scrubbing of the
+internal metadata.
+The in-kernel scrubbers also cross-reference each data structure's
+records against the other filesystem metadata.
+.PP
+This utility does not know how to correct all errors.
+If the tool cannot fix the detected errors, you must unmount the
+filesystem and run the appropriate repair tool.
+if this tool is run without either of the
+.B \-n
+or
+.B \-y
+options, then it will preen and optimize the filesystem when possible,
+though it will not try to fix errors.
+.SH OPTIONS
+.TP
+.BI \-a " errors"
+Abort if more than this many errors are found on the filesystem.
+.TP
+.B \-d
+Enable debugging mode, which augments error reports with the exact file
+and line where the scrub failure occurred.
+This also enables verbose mode.
+.TP
+.B \-e
+Specifies what happens when errors are detected.
+If
+.IR shutdown
+is given, the filesystem will be taken offline if errors are found.
+Not all backends can shut down a filesystem.
+If
+.IR continue
+is given, no action taken if errors are found.
+This is the default.
+.TP
+.BI \-m " file"
+Search this file for mounted filesystems instead of /etc/mtab.
+.TP
+.B \-n
+Dry run, do not modify anything in the filesystem.  This disables
+all preening and optimization behaviors, and disables calling
+FITRIM on the free space after a successful run.
+.TP
+.BI \-T
+Print timing and memory usage information for each phase.
+.TP
+.B \-v
+Enable verbose mode, which prints periodic status updates.
+.TP
+.B \-V
+Prints the version number and exits.
+.TP
+.B \-x
+Scrub file data.  This reads every block of every file on disk.
+If the filesystem reports file extent mappings or physical extent
+mappings and is backed by a block device,
+.TP
+.B \-y
+Try to repair all filesystem errors.  If the errors cannot be fixed
+online, then the filesystem must be taken offline for repair.
+.B xfs_scrub
+will issue O_DIRECT reads to the block device directly.
+If the block device is a SCSI disk, it will issue READ VERIFY commands
+directly to the disk.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub
+is the sum of the following conditions:
+.br
+\	0\	\-\ No errors
+.br
+\	1\	\-\ File system errors left uncorrected
+.br
+\	2\	\-\ File system optimizations possible
+.br
+\	4\	\-\ Operational error
+.br
+\	8\	\-\ Usage or syntax error
+.br
+.SH CAVEATS
+.B xfs_scrub
+is an immature utility!
+This program takes advantage of in-kernel scrubbing to verify a
+given data structure with locks held.
+The kernel must support the BULKSTAT, FSGEOMETRY, FSCOUNTS, GET_RESBLKS,
+GET_AG_RESBLKS, GETBMAPX, GETFSMAP, INUMBERS, and SCRUB_METADATA ioctls.
+This can tie up the system for a while.
+.PP
+If errors are found and cannot be repaired, the filesystem should be
+taken offline and repaired.
+.SH SEE ALSO
+.BR xfs_repair (8).
diff --git a/scrub/Makefile b/scrub/Makefile
new file mode 100644
index 0000000..b1ff86a
--- /dev/null
+++ b/scrub/Makefile
@@ -0,0 +1,51 @@
+#
+# Copyright (c) 2017 Oracle.  All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+SCRUB_PREREQS=$(HAVE_OPENAT)$(HAVE_FSTATAT)
+
+ifeq ($(SCRUB_PREREQS),yesyes)
+LTCOMMAND = xfs_scrub
+INSTALL_SCRUB = install-scrub
+endif	# scrub_prereqs
+
+HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h
+CFILES = ../repair/avl64.c disk.c bitmap.c iocmd.c \
+	 read_verify.c scrub.c ../repair/threads.c
+
+LLDLIBS += $(LIBBLKID) $(LIBXFS) $(LIBXCMD) $(LIBUUID) $(LIBRT) $(LIBPTHREAD) $(LIBHANDLE)
+LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBHANDLE)
+LLDFLAGS = -static-libtool-libs
+
+ifeq ($(HAVE_MALLINFO),yes)
+LCFLAGS += -DHAVE_MALLINFO
+endif
+
+ifeq ($(HAVE_SG_IO),yes)
+LCFLAGS += -DHAVE_SG_IO
+endif
+
+ifeq ($(HAVE_HDIO_GETGEO),yes)
+LCFLAGS += -DHAVE_HDIO_GETGEO
+endif
+
+ifeq ($(HAVE_SYNCFS),yes)
+LCFLAGS += -DHAVE_SYNCFS
+endif
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default $(INSTALL_SCRUB)
+
+install-scrub:
+	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
+	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+
+install-dev:
+
+-include .dep
diff --git a/scrub/bitmap.c b/scrub/bitmap.c
new file mode 100644
index 0000000..0146c49
--- /dev/null
+++ b/scrub/bitmap.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include "../repair/avl64.h"
+#include "bitmap.h"
+
+#define avl_for_each_range_safe(pos, n, l, first, last) \
+	for (pos = (first), n = pos->avl_nextino, l = (last)->avl_nextino; pos != (l); \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each_safe(tree, pos, n) \
+	for (pos = (tree)->avl_firstino, n = pos ? pos->avl_nextino : NULL; \
+			pos != NULL; \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each(tree, pos) \
+	for (pos = (tree)->avl_firstino; pos != NULL; pos = pos->avl_nextino)
+
+struct bitmap_node {
+	struct avl64node	btn_node;
+	uint64_t		btn_start;
+	uint64_t		btn_length;
+};
+
+static __uint64_t
+extent_start(
+	struct avl64node	*node)
+{
+	struct bitmap_node	*btn;
+
+	btn = container_of(node, struct bitmap_node, btn_node);
+	return btn->btn_start;
+}
+
+static __uint64_t
+extent_end(
+	struct avl64node	*node)
+{
+	struct bitmap_node	*btn;
+
+	btn = container_of(node, struct bitmap_node, btn_node);
+	return btn->btn_start + btn->btn_length;
+}
+
+static struct avl64ops bitmap_ops = {
+	extent_start,
+	extent_end,
+};
+
+/* Initialize an extent tree. */
+bool
+bitmap_init(
+	struct bitmap		*tree)
+{
+	tree->bt_tree = malloc(sizeof(struct avl64tree_desc));
+	if (!tree->bt_tree)
+		return false;
+
+	pthread_mutex_init(&tree->bt_lock, NULL);
+	avl64_init_tree(tree->bt_tree, &bitmap_ops);
+
+	return true;
+}
+
+/* Free an extent tree. */
+void
+bitmap_free(
+	struct bitmap		*tree)
+{
+	struct avl64node	*node;
+	struct avl64node	*n;
+	struct bitmap_node	*ext;
+
+	if (!tree->bt_tree)
+		return;
+
+	avl_for_each_safe(tree->bt_tree, node, n) {
+		ext = container_of(node, struct bitmap_node, btn_node);
+		free(ext);
+	}
+	free(tree->bt_tree);
+	tree->bt_tree = NULL;
+}
+
+/* Create a new extent. */
+static struct bitmap_node *
+bitmap_node_init(
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct bitmap_node	*ext;
+
+	ext = malloc(sizeof(struct bitmap_node));
+	if (!ext)
+		return NULL;
+
+	ext->btn_node.avl_nextino = NULL;
+	ext->btn_start = start;
+	ext->btn_length = len;
+
+	return ext;
+}
+
+/* Add an extent (locked). */
+static bool
+__bitmap_add(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		length)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct bitmap_node	*ext;
+	uint64_t		new_start;
+	uint64_t		new_length;
+	struct avl64node	*node;
+	bool			res = true;
+
+	/* Find any existing nodes adjacent or within that range. */
+	avl64_findranges(tree->bt_tree, start - 1, start + length + 1,
+			&firstn, &lastn);
+
+	/* Nothing, just insert a new extent. */
+	if (firstn == NULL && lastn == NULL) {
+		ext = bitmap_node_init(start, length);
+		if (!ext)
+			return false;
+
+		node = avl64_insert(tree->bt_tree, &ext->btn_node);
+		if (node == NULL) {
+			free(ext);
+			errno = EEXIST;
+			return false;
+		}
+
+		return true;
+	}
+
+	ASSERT(firstn != NULL && lastn != NULL);
+	new_start = start;
+	new_length = length;
+
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		ext = container_of(pos, struct bitmap_node, btn_node);
+
+		/* Bail if the new extent is contained within an old one. */
+		if (ext->btn_start <= start &&
+		    ext->btn_start + ext->btn_length >= start + length)
+			return res;
+
+		/* Check for overlapping and adjacent extents. */
+		if (ext->btn_start + ext->btn_length >= start ||
+		    ext->btn_start <= start + length) {
+			if (ext->btn_start < start) {
+				new_start = ext->btn_start;
+				new_length += ext->btn_length;
+			}
+
+			if (ext->btn_start + ext->btn_length >
+			    new_start + new_length)
+				new_length = ext->btn_start + ext->btn_length -
+						new_start;
+
+			avl64_delete(tree->bt_tree, pos);
+			free(ext);
+		}
+	}
+
+	ext = bitmap_node_init(new_start, new_length);
+	if (!ext)
+		return false;
+
+	node = avl64_insert(tree->bt_tree, &ext->btn_node);
+	if (node == NULL) {
+		free(ext);
+		errno = EEXIST;
+		return false;
+	}
+
+	return res;
+}
+
+/* Add an extent. */
+bool
+bitmap_add(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		length)
+{
+	bool			res;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	res = __bitmap_add(tree, start, length);
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return res;
+}
+
+/* Remove an extent. */
+bool
+bitmap_remove(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct bitmap_node	*ext;
+	uint64_t		new_start;
+	uint64_t		new_length;
+	struct avl64node	*node;
+	int			stat;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	/* Find any existing nodes over that range. */
+	avl64_findranges(tree->bt_tree, start, start + len, &firstn, &lastn);
+
+	/* Nothing, we're done. */
+	if (firstn == NULL && lastn == NULL) {
+		pthread_mutex_unlock(&tree->bt_lock);
+		return true;
+	}
+
+	ASSERT(firstn != NULL && lastn != NULL);
+
+	/* Delete or truncate everything in sight. */
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		ext = container_of(pos, struct bitmap_node, btn_node);
+
+		stat = 0;
+		if (ext->btn_start < start)
+			stat |= 1;
+		if (ext->btn_start + ext->btn_length > start + len)
+			stat |= 2;
+		switch (stat) {
+		case 0:
+			/* Extent totally within range; delete. */
+			avl64_delete(tree->bt_tree, pos);
+			free(ext);
+			break;
+		case 1:
+			/* Extent is left-adjacent; truncate. */
+			ext->btn_length = start - ext->btn_start;
+			break;
+		case 2:
+			/* Extent is right-adjacent; move it. */
+			ext->btn_length = ext->btn_start + ext->btn_length -
+					(start + len);
+			ext->btn_start = start + len;
+			break;
+		case 3:
+			/* Extent overlaps both ends. */
+			ext->btn_length = start - ext->btn_start;
+			new_start = start + len;
+			new_length = ext->btn_start + ext->btn_length -
+					new_start;
+
+			ext = bitmap_node_init(new_start, new_length);
+			if (!ext)
+				return false;
+
+			node = avl64_insert(tree->bt_tree, &ext->btn_node);
+			if (node == NULL) {
+				errno = EEXIST;
+				return false;
+			}
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&tree->bt_lock);
+	return true;
+}
+
+/* Iterate an extent tree. */
+bool
+bitmap_iterate(
+	struct bitmap		*tree,
+	bool			(*fn)(uint64_t, uint64_t, void *),
+	void			*arg)
+{
+	struct avl64node	*node;
+	struct bitmap_node	*ext;
+	bool			moveon = true;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	avl_for_each(tree->bt_tree, node) {
+		ext = container_of(node, struct bitmap_node, btn_node);
+		moveon = fn(ext->btn_start, ext->btn_length, arg);
+		if (!moveon)
+			break;
+	}
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return moveon;
+}
+
+/* Do any extents overlap the given one?  (locked) */
+static bool
+__bitmap_has_extent(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+
+	/* Find any existing nodes over that range. */
+	avl64_findranges(tree->bt_tree, start, start + len, &firstn, &lastn);
+
+	return firstn != NULL && lastn != NULL;
+}
+
+/* Do any extents overlap the given one? */
+bool
+bitmap_has_extent(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		len)
+{
+	bool			res;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	res = __bitmap_has_extent(tree, start, len);
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return res;
+}
+
+/* Ensure that the extent is set, and return the old value. */
+bool
+bitmap_test_and_set(
+	struct bitmap		*tree,
+	uint64_t		start,
+	bool			*was_set)
+{
+	bool			res = true;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	*was_set = __bitmap_has_extent(tree, start, 1);
+	if (!(*was_set))
+		res = __bitmap_add(tree, start, 1);
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return res;
+}
+
+/* Is it empty? */
+bool
+bitmap_empty(
+	struct bitmap		*tree)
+{
+	return tree->bt_tree->avl_firstino == NULL;
+}
+
+static bool
+merge_helper(
+	uint64_t		start,
+	uint64_t		length,
+	void			*arg)
+{
+	struct bitmap		*thistree = arg;
+
+	return __bitmap_add(thistree, start, length);
+}
+
+/* Merge another tree with this one. */
+bool
+bitmap_merge(
+	struct bitmap		*thistree,
+	struct bitmap		*tree)
+{
+	bool			res;
+
+	assert(thistree != tree);
+
+	pthread_mutex_lock(&thistree->bt_lock);
+	res = bitmap_iterate(tree, merge_helper, thistree);
+	pthread_mutex_unlock(&thistree->bt_lock);
+
+	return res;
+}
+
+static bool
+bitmap_dump_fn(
+	uint64_t		startblock,
+	uint64_t		blockcount,
+	void			*arg)
+{
+	printf("%"PRIu64":%"PRIu64"\n", startblock, blockcount);
+	return true;
+}
+
+/* Dump extent tree. */
+void
+bitmap_dump(
+	struct bitmap		*tree)
+{
+	printf("BITMAP DUMP %p\n", tree);
+	bitmap_iterate(tree, bitmap_dump_fn, NULL);
+	printf("BITMAP DUMP DONE\n");
+}
diff --git a/scrub/bitmap.h b/scrub/bitmap.h
new file mode 100644
index 0000000..e3702aa
--- /dev/null
+++ b/scrub/bitmap.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef BITMAP_H_
+#define BITMAP_H_
+
+struct bitmap {
+	pthread_mutex_t		bt_lock;
+	struct avl64tree_desc	*bt_tree;
+};
+
+bool bitmap_init(struct bitmap *tree);
+void bitmap_free(struct bitmap *tree);
+bool bitmap_add(struct bitmap *tree, uint64_t start, uint64_t length);
+bool bitmap_remove(struct bitmap *tree, uint64_t start,
+		uint64_t len);
+bool bitmap_iterate(struct bitmap *tree,
+		bool (*fn)(uint64_t, uint64_t, void *), void *arg);
+bool bitmap_has_extent(struct bitmap *tree, uint64_t start,
+		uint64_t len);
+bool bitmap_test_and_set(struct bitmap *tree, uint64_t start, bool *was_set);
+bool bitmap_empty(struct bitmap *tree);
+bool bitmap_merge(struct bitmap *thistree, struct bitmap *tree);
+void bitmap_dump(struct bitmap *tree);
+
+#endif /* BITMAP_H_ */
diff --git a/scrub/disk.c b/scrub/disk.c
new file mode 100644
index 0000000..c1136aa
--- /dev/null
+++ b/scrub/disk.c
@@ -0,0 +1,288 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#ifdef HAVE_SG_IO
+# include <scsi/sg.h>
+#endif
+#ifdef HAVE_HDIO_GETGEO
+# include <linux/hdreg.h>
+#endif
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+
+/* Figure out how many disk heads are available. */
+static unsigned int
+__disk_heads(
+	struct disk		*disk)
+{
+	int			iomin;
+	int			ioopt;
+	unsigned short		rot;
+	int			error;
+
+	/* If it's not a block device, throw all the CPUs at it. */
+	if (!S_ISBLK(disk->d_sb.st_mode))
+		return libxfs_nproc();
+
+	/* Non-rotational device?  Throw all the CPUs. */
+	rot = 1;
+	error = ioctl(disk->d_fd, BLKROTATIONAL, &rot);
+	if (error == 0 && rot == 0)
+		return libxfs_nproc();
+
+	/*
+	 * Sometimes we can infer the number of devices from the
+	 * min/optimal IO sizes.
+	 */
+	iomin = ioopt = 0;
+	if (ioctl(disk->d_fd, BLKIOMIN, &iomin) == 0 &&
+	    ioctl(disk->d_fd, BLKIOOPT, &ioopt) == 0 &&
+            iomin > 0 && ioopt > 0) {
+		return min(libxfs_nproc(), max(1, ioopt / iomin));
+	}
+
+	/* Rotating device?  I guess? */
+	return 2;
+}
+
+/* Figure out how many disk heads are available. */
+unsigned int
+disk_heads(
+	struct disk		*disk)
+{
+	if (nr_threads < 0)
+		return __disk_heads(disk);
+	return min(__disk_heads(disk), nr_threads);
+}
+
+/* Execute a SCSI VERIFY(16).  We hope. */
+#ifdef HAVE_SG_IO
+# define SENSE_BUF_LEN		64
+# define VERIFY16_CMDLEN	16
+# define VERIFY16_CMD		0x8F
+
+# ifndef SG_FLAG_Q_AT_TAIL
+#  define SG_FLAG_Q_AT_TAIL	0x10
+# endif
+static int
+disk_scsi_verify(
+	struct disk		*disk,
+	uint64_t		startblock, /* lba */
+	uint64_t		blockcount) /* lba */
+{
+	struct sg_io_hdr	iohdr;
+	unsigned char		cdb[VERIFY16_CMDLEN];
+	unsigned char		sense[SENSE_BUF_LEN];
+	uint64_t		llba;
+	uint64_t		veri_len = blockcount;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"));
+
+	llba = startblock + (disk->d_start >> BBSHIFT);
+
+	/* Borrowed from sg_verify */
+	cdb[0] = VERIFY16_CMD;
+	cdb[1] = 0; /* skip PI, DPO, and byte check. */
+	cdb[2] = (llba >> 56) & 0xff;
+	cdb[3] = (llba >> 48) & 0xff;
+	cdb[4] = (llba >> 40) & 0xff;
+	cdb[5] = (llba >> 32) & 0xff;
+	cdb[6] = (llba >> 24) & 0xff;
+	cdb[7] = (llba >> 16) & 0xff;
+	cdb[8] = (llba >> 8) & 0xff;
+	cdb[9] = llba & 0xff;
+	cdb[10] = (veri_len >> 24) & 0xff;
+	cdb[11] = (veri_len >> 16) & 0xff;
+	cdb[12] = (veri_len >> 8) & 0xff;
+	cdb[13] = veri_len & 0xff;
+	cdb[14] = 0;
+	cdb[15] = 0;
+	memset(sense, 0, SENSE_BUF_LEN);
+
+	/* v3 SG_IO */
+	memset(&iohdr, 0, sizeof(iohdr));
+	iohdr.interface_id = 'S';
+	iohdr.dxfer_direction = SG_DXFER_NONE;
+	iohdr.cmdp = cdb;
+	iohdr.cmd_len = VERIFY16_CMDLEN;
+	iohdr.sbp = sense;
+	iohdr.mx_sb_len = SENSE_BUF_LEN;
+	iohdr.flags |= SG_FLAG_Q_AT_TAIL;
+	iohdr.timeout = 30000; /* 30s */
+
+	error = ioctl(disk->d_fd, SG_IO, &iohdr);
+	if (error)
+		return error;
+
+	dbg_printf("VERIFY(16) fd %d lba %"PRIu64" len %"PRIu64" info %x "
+			"status %d masked %d msg %d host %d driver %d "
+			"duration %d resid %d\n",
+			disk->d_fd, startblock, blockcount, iohdr.info,
+			iohdr.status, iohdr.masked_status, iohdr.msg_status,
+			iohdr.host_status, iohdr.driver_status, iohdr.duration,
+			iohdr.resid);
+
+	if (iohdr.info & SG_INFO_CHECK) {
+		dbg_printf("status: msg %x host %x driver %x\n",
+				iohdr.msg_status, iohdr.host_status,
+				iohdr.driver_status);
+		errno = EIO;
+		return -1;
+	}
+
+	return error;
+}
+#else
+# define disk_scsi_verify(...)		(ENOTTY)
+#endif /* HAVE_SG_IO */
+
+/* Test the availability of the kernel scrub ioctl. */
+static bool
+disk_can_scsi_verify(
+	struct disk		*disk)
+{
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"))
+		return false;
+
+	error = disk_scsi_verify(disk, 0, 1);
+	return error == 0;
+}
+
+/* Open a disk device and discover its geometry. */
+int
+disk_open(
+	const char		*pathname,
+	struct disk		*disk)
+{
+#ifdef HAVE_HDIO_GETGEO
+	struct hd_geometry	bdgeo;
+#endif
+	bool			suspicious_disk = false;
+	int			lba_sz;
+	int			error;
+
+	disk->d_fd = open(pathname, O_RDONLY | O_DIRECT | O_NOATIME);
+	if (disk->d_fd < 0)
+		return -1;
+
+	/* Try to get LBA size. */
+	error = ioctl(disk->d_fd, BLKSSZGET, &lba_sz);
+	if (error)
+		lba_sz = 512;
+	disk->d_lbalog = libxfs_log2_roundup(lba_sz);
+
+	/* Obtain disk's stat info. */
+	error = fstat(disk->d_fd, &disk->d_sb);
+	if (error) {
+		error = errno;
+		close(disk->d_fd);
+		errno = error;
+		disk->d_fd = -1;
+		return -1;
+	}
+
+	/* Determine bdev size, block size, and offset. */
+	if (S_ISBLK(disk->d_sb.st_mode)) {
+		error = ioctl(disk->d_fd, BLKGETSIZE64, &disk->d_size);
+		if (error)
+			disk->d_size = 0;
+		error = ioctl(disk->d_fd, BLKBSZGET, &disk->d_blksize);
+		if (error)
+			disk->d_blksize = 0;
+#ifdef HAVE_HDIO_GETGEO
+		error = ioctl(disk->d_fd, HDIO_GETGEO, &bdgeo);
+		if (!error) {
+			/*
+			 * dm devices will pass through ioctls, which means
+			 * we can't use SCSI VERIFY unless the start is 0.
+			 * Most dm devices don't set geometry (unlike scsi
+			 * and nvme) so use a zeroed out CHS to screen them
+			 * out.
+			 */
+			if (bdgeo.start != 0 &&
+			    (unsigned long long)bdgeo.heads * bdgeo.sectors *
+					bdgeo.sectors == 0)
+				suspicious_disk = true;
+			disk->d_start = bdgeo.start << BBSHIFT;
+		} else
+#endif
+			disk->d_start = 0;
+	} else {
+		disk->d_size = disk->d_sb.st_size;
+		disk->d_blksize = disk->d_sb.st_blksize;
+		disk->d_start = 0;
+	}
+
+	/* Can we issue SCSI VERIFY? */
+	if (!suspicious_disk && disk_can_scsi_verify(disk))
+		disk->d_flags |= DISK_FLAG_SCSI_VERIFY;
+
+	return 0;
+}
+
+/* Close a disk device. */
+int
+disk_close(
+	struct disk		*disk)
+{
+	int			error = 0;
+
+	if (disk->d_fd >= 0)
+		error = close(disk->d_fd);
+	disk->d_fd = -1;
+	return error;
+}
+
+/* Is this device open? */
+bool
+disk_is_open(
+	struct disk		*disk)
+{
+	return disk->d_fd >= 0;
+}
+
+#define BTOLBAT(d, bytes)	((uint64_t)(bytes) >> (d)->d_lbalog)
+#define LBASIZE(d)		(1ULL << (d)->d_lbalog)
+#define BTOLBA(d, bytes)	(((uint64_t)(bytes) + LBASIZE(d) - 1) >> (d)->d_lbalog)
+
+/* Read-verify an extent of a disk device. */
+ssize_t
+disk_read_verify(
+	struct disk		*disk,
+	void			*buf,
+	uint64_t		start,
+	uint64_t		length)
+{
+	/* Convert to logical block size. */
+	if (disk->d_flags & DISK_FLAG_SCSI_VERIFY)
+		return disk_scsi_verify(disk, BTOLBAT(disk, start),
+				BTOLBA(disk, length));
+
+	return pread(disk->d_fd, buf, length, start);
+}
diff --git a/scrub/disk.h b/scrub/disk.h
new file mode 100644
index 0000000..8930075
--- /dev/null
+++ b/scrub/disk.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef DISK_H_
+#define DISK_H_
+
+#define DISK_FLAG_SCSI_VERIFY	0x1
+struct disk {
+	struct stat	d_sb;
+	int		d_fd;
+	int		d_lbalog;
+	unsigned int	d_flags;
+	unsigned int	d_blksize;	/* bytes */
+	uint64_t	d_size;		/* bytes */
+	uint64_t	d_start;	/* bytes */
+};
+
+unsigned int disk_heads(struct disk *disk);
+bool disk_is_open(struct disk *disk);
+int disk_open(const char *pathname, struct disk *disk);
+int disk_close(struct disk *disk);
+ssize_t disk_read_verify(struct disk *disk, void *buf, uint64_t startblock,
+		uint64_t blockcount);
+
+#endif /* DISK_H_ */
diff --git a/scrub/iocmd.c b/scrub/iocmd.c
new file mode 100644
index 0000000..504396c
--- /dev/null
+++ b/scrub/iocmd.c
@@ -0,0 +1,239 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/xattr.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "iocmd.h"
+
+#define NR_EXTENTS	512
+
+/* Scan a filesystem tree. */
+struct scan_fs_tree {
+	unsigned int		nr_dirs;
+	pthread_mutex_t		lock;
+	pthread_cond_t		wakeup;
+	struct stat		root_sb;
+	bool			moveon;
+	bool			(*dir_fn)(struct scrub_ctx *, const char *,
+					  int, void *);
+	bool			(*dirent_fn)(struct scrub_ctx *, const char *,
+					     int, struct dirent *,
+					     struct stat *, void *);
+	void			*arg;
+};
+
+/* Per-work-item scan context. */
+struct scan_fs_tree_dir {
+	char			*path;
+	struct scan_fs_tree	*sft;
+	bool			rootdir;
+};
+
+/* Scan a directory sub tree. */
+static void
+scan_fs_dir(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct scan_fs_tree_dir	*sftd = arg;
+	struct scan_fs_tree	*sft = sftd->sft;
+	DIR			*dir;
+	struct dirent		*dirent;
+	char			newpath[PATH_MAX];
+	struct scan_fs_tree_dir	*new_sftd;
+	struct stat		sb;
+	int			dir_fd;
+	int			error;
+
+	/* Open the directory. */
+	dir_fd = open(sftd->path, O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (dir_fd < 0) {
+		if (errno != ENOENT)
+			str_errno(ctx, sftd->path);
+		goto out;
+	}
+
+	/* Caller-specific directory checks. */
+	if (sft->dir_fn && !sft->dir_fn(ctx, sftd->path, dir_fd, sft->arg)) {
+		sft->moveon = false;
+		goto out;
+	}
+
+	/* Caller-specific directory entry function on the rootdir. */
+	if (sftd->rootdir) {
+		/* Get the stat info for this directory entry. */
+		error = fstat(dir_fd, &sb);
+		if (error) {
+			str_errno(ctx, sftd->path);
+			goto out;
+		}
+		if (!sft->dirent_fn(ctx, sftd->path, dir_fd, NULL, &sb,
+				sft->arg)) {
+			sft->moveon = false;
+			goto out;
+		}
+	}
+
+	/* Iterate the directory entries. */
+	dir = fdopendir(dir_fd);
+	if (!dir) {
+		str_errno(ctx, sftd->path);
+		goto out;
+	}
+	rewinddir(dir);
+	for (dirent = readdir(dir); dirent != NULL; dirent = readdir(dir)) {
+		snprintf(newpath, PATH_MAX, "%s/%s", sftd->path,
+				dirent->d_name);
+
+		/* Get the stat info for this directory entry. */
+		error = fstatat(dir_fd, dirent->d_name, &sb,
+				AT_NO_AUTOMOUNT | AT_SYMLINK_NOFOLLOW);
+		if (error) {
+			str_errno(ctx, newpath);
+			continue;
+		}
+
+		/* Ignore files on other filesystems. */
+		if (sb.st_dev != sft->root_sb.st_dev)
+			continue;
+
+		/* Caller-specific directory entry function. */
+		if (!sft->dirent_fn(ctx, newpath, dir_fd, dirent, &sb,
+				sft->arg)) {
+			sft->moveon = false;
+			break;
+		}
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			sft->moveon = false;
+			break;
+		}
+
+		/* If directory, call ourselves recursively. */
+		if (S_ISDIR(sb.st_mode) && strcmp(".", dirent->d_name) &&
+		    strcmp("..", dirent->d_name)) {
+			new_sftd = malloc(sizeof(struct scan_fs_tree_dir));
+			if (!new_sftd) {
+				str_errno(ctx, newpath);
+				sft->moveon = false;
+				break;
+			}
+			new_sftd->path = strdup(newpath);
+			new_sftd->sft = sft;
+			new_sftd->rootdir = false;
+			pthread_mutex_lock(&sft->lock);
+			sft->nr_dirs++;
+			pthread_mutex_unlock(&sft->lock);
+			queue_work(wq, scan_fs_dir, 0, new_sftd);
+		}
+	}
+
+	/* Close dir, go away. */
+	error = closedir(dir);
+	if (error)
+		str_errno(ctx, sftd->path);
+
+out:
+	pthread_mutex_lock(&sft->lock);
+	sft->nr_dirs--;
+	if (sft->nr_dirs == 0)
+		pthread_cond_signal(&sft->wakeup);
+	pthread_mutex_unlock(&sft->lock);
+
+	free(sftd->path);
+	free(sftd);
+}
+
+/* Scan the entire filesystem. */
+bool
+scan_fs_tree(
+	struct scrub_ctx	*ctx,
+	bool			(*dir_fn)(struct scrub_ctx *, const char *,
+					  int, void *),
+	bool			(*dirent_fn)(struct scrub_ctx *, const char *,
+						int, struct dirent *,
+						struct stat *, void *),
+	void			*arg)
+{
+	struct work_queue	wq;
+	struct scan_fs_tree	sft;
+	struct scan_fs_tree_dir	*sftd;
+
+	sft.moveon = true;
+	sft.nr_dirs = 1;
+	sft.root_sb = ctx->mnt_sb;
+	sft.dir_fn = dir_fn;
+	sft.dirent_fn = dirent_fn;
+	sft.arg = arg;
+	pthread_mutex_init(&sft.lock, NULL);
+	pthread_cond_init(&sft.wakeup, NULL);
+
+	sftd = malloc(sizeof(struct scan_fs_tree_dir));
+	if (!sftd) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	sftd->path = strdup(ctx->mntpoint);
+	sftd->sft = &sft;
+	sftd->rootdir = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, scan_fs_dir, 0, sftd);
+
+	pthread_mutex_lock(&sft.lock);
+	pthread_cond_wait(&sft.wakeup, &sft.lock);
+	assert(sft.nr_dirs == 0);
+	pthread_mutex_unlock(&sft.lock);
+	destroy_work_queue(&wq);
+
+	return sft.moveon;
+}
+
+#ifndef FITRIM
+struct fstrim_range {
+	__u64 start;
+	__u64 len;
+	__u64 minlen;
+};
+#define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#endif
+
+/* Call FITRIM to trim all the unused space in a filesystem. */
+void
+fstrim(
+	struct scrub_ctx	*ctx)
+{
+	struct fstrim_range	range = {0};
+	int			error;
+
+	range.len = ULLONG_MAX;
+	error = ioctl(ctx->mnt_fd, FITRIM, &range);
+	if (error && errno != EOPNOTSUPP && errno != ENOTTY)
+		perror(_("fstrim"));
+}
diff --git a/scrub/iocmd.h b/scrub/iocmd.h
new file mode 100644
index 0000000..047e5fc
--- /dev/null
+++ b/scrub/iocmd.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef IOCMD_H_
+#define IOCMD_H_
+
+struct fiemap_extent;
+
+bool
+scan_fs_tree(
+	struct scrub_ctx	*ctx,
+	bool			(*dir_fn)(struct scrub_ctx *, const char *,
+					  int, void *),
+	bool			(*dirent_fn)(struct scrub_ctx *, const char *,
+						int, struct dirent *,
+						struct stat *, void *),
+	void			*arg);
+
+bool
+fiemap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	bool			attr_fork,
+	bool			fibmap,
+	bool			(*fn)(struct scrub_ctx *, const char *,
+				      struct fiemap_extent *, void *),
+	void			*arg);
+
+void
+fstrim(
+	struct scrub_ctx	*ctx);
+
+#endif /* IOCMD_H_ */
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
new file mode 100644
index 0000000..48de6e1
--- /dev/null
+++ b/scrub/read_verify.c
@@ -0,0 +1,316 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+
+/* How many bytes have we verified? */
+static pthread_mutex_t		verified_lock = PTHREAD_MUTEX_INITIALIZER;
+static unsigned long long	verified_bytes;
+
+/* Tolerate 64k holes in adjacent read verify requests. */
+#define IO_BATCH_LOCALITY	(65536)
+
+/* Create a thread pool to run read verifiers. */
+bool
+read_verify_pool_init(
+	struct read_verify_pool		*rvp,
+	struct scrub_ctx		*ctx,
+	void				*readbuf,
+	size_t				readbufsz,
+	size_t				min_io_sz,
+	read_verify_ioend_fn_t		ioend_fn,
+	unsigned int			nproc)
+{
+	rvp->rvp_readbuf = readbuf;
+	rvp->rvp_readbufsz = readbufsz;
+	rvp->rvp_ctx = ctx;
+	rvp->rvp_min_io_size = min_io_sz;
+	rvp->ioend_fn = ioend_fn;
+	rvp->rvp_nproc = nproc;
+	create_work_queue(&rvp->rvp_wq, (struct xfs_mount *)rvp, nproc);
+	return true;
+}
+
+/* How many bytes has this process verified? */
+unsigned long long
+read_verify_bytes(void)
+{
+	return verified_bytes;
+}
+
+/* Finish up any read verification work and tear it down. */
+void
+read_verify_pool_destroy(
+	struct read_verify_pool		*rvp)
+{
+	destroy_work_queue(&rvp->rvp_wq);
+	memset(&rvp->rvp_wq, 0, sizeof(struct work_queue));
+}
+
+/*
+ * Issue a read-verify IO in big batches.
+ */
+static void
+read_verify(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct read_verify		*rv = arg;
+	struct read_verify_pool		*rvp;
+	unsigned long long		verified = 0;
+	ssize_t				sz;
+	ssize_t				len;
+
+	rvp = (struct read_verify_pool *)wq->mp;
+	while (rv->io_length > 0) {
+		len = min(rv->io_length, rvp->rvp_readbufsz);
+		dbg_printf("diskverify %d %"PRIu64" %zu\n", rv->io_disk->d_fd,
+				rv->io_start, len);
+		sz = disk_read_verify(rv->io_disk, rvp->rvp_readbuf,
+				rv->io_start, len);
+		if (sz < 0) {
+			dbg_printf("IOERR %d %"PRIu64" %zu\n",
+					rv->io_disk->d_fd,
+					rv->io_start, len);
+			rvp->ioend_fn(rvp, rv->io_disk, rv->io_start,
+					rvp->rvp_min_io_size,
+					errno, rv->io_end_arg);
+			len = rvp->rvp_min_io_size;
+		}
+
+		verified += len;
+		rv->io_start += len;
+		rv->io_length -= len;
+	}
+
+	free(rv);
+	pthread_mutex_lock(&verified_lock);
+	verified_bytes += verified;
+	pthread_mutex_unlock(&verified_lock);
+}
+
+/* Queue a read verify request. */
+static void
+read_verify_queue(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv)
+{
+	struct read_verify		*tmp;
+
+	dbg_printf("verify fd %d start %"PRIu64" len %"PRIu64"\n",
+			rv->io_disk->d_fd, rv->io_start, rv->io_length);
+
+	tmp = malloc(sizeof(struct read_verify));
+	if (!tmp) {
+		rvp->ioend_fn(rvp, rv->io_disk, rv->io_start, rv->io_length,
+				errno, rv->io_end_arg);
+		return;
+	}
+	*tmp = *rv;
+
+	queue_work(&rvp->rvp_wq, read_verify, 0, tmp);
+}
+
+/*
+ * Issue an IO request.  We'll batch subsequent requests if they're
+ * within 64k of each other
+ */
+void
+read_verify_schedule(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length,
+	void				*end_arg)
+{
+	uint64_t			ve_end;
+	uint64_t			io_end;
+
+	assert(rvp->rvp_readbuf);
+	ve_end = start + length;
+	io_end = rv->io_start + rv->io_length;
+
+	/*
+	 * If we have a stashed IO, we haven't changed fds, the error
+	 * reporting is the same, and the two extents are close,
+	 * we can combine them.
+	 */
+	if (rv->io_length > 0 && disk == rv->io_disk &&
+	    end_arg == rv->io_end_arg &&
+	    ((start >= rv->io_start && start <= io_end + IO_BATCH_LOCALITY) ||
+	     (rv->io_start >= start &&
+	      rv->io_start <= ve_end + IO_BATCH_LOCALITY))) {
+		rv->io_start = min(rv->io_start, start);
+		rv->io_length = max(ve_end, io_end) - rv->io_start;
+	} else  {
+		/* Otherwise, issue the stashed IO (if there is one) */
+		if (rv->io_length > 0)
+			read_verify_queue(rvp, rv);
+
+		/* Stash the new IO. */
+		rv->io_disk = disk;
+		rv->io_start = start;
+		rv->io_length = length;
+		rv->io_end_arg = end_arg;
+	}
+}
+
+/* Force any stashed IOs into the verifier. */
+void
+read_verify_force(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv)
+{
+	assert(rvp->rvp_readbuf);
+	if (rv->io_length == 0)
+		return;
+
+	read_verify_queue(rvp, rv);
+	rv->io_length = 0;
+}
+
+/* Read all the data in a file. */
+bool
+read_verify_file(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb)
+{
+	off_t			data_end = 0;
+	off_t			data_start;
+	off_t			start;
+	ssize_t			sz;
+	size_t			count;
+	unsigned long long	verified = 0;
+	bool			reports_holes = true;
+	bool			direct_io = false;
+	bool			moveon = true;
+	int			flags;
+	int			error;
+
+	/*
+	 * Try to force the kernel to read file data from disk.  First
+	 * we try to set O_DIRECT.  If that fails, try to purge the page
+	 * cache.
+	 */
+	flags = fcntl(fd, F_GETFL);
+	error = fcntl(fd, F_SETFL, flags | O_DIRECT);
+	if (error)
+		posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_DONTNEED);
+	else
+		direct_io = true;
+
+	/* See if SEEK_DATA/SEEK_HOLE work... */
+	data_start = lseek(fd, data_end, SEEK_DATA);
+	if (data_start < 0) {
+		/* ENXIO for SEEK_DATA means no file data anywhere. */
+		if (errno == ENXIO)
+			return true;
+		reports_holes = false;
+	}
+
+	if (reports_holes) {
+		data_end = lseek(fd, data_start, SEEK_HOLE);
+		if (data_end < 0)
+			reports_holes = false;
+	}
+
+	/* ...or just read everything if they don't. */
+	if (!reports_holes) {
+		data_start = 0;
+		data_end = sb->st_size;
+	}
+
+	if (!direct_io) {
+		posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_SEQUENTIAL);
+		posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_WILLNEED);
+	}
+	/* Read the non-hole areas. */
+	while (data_start < data_end) {
+		start = data_start;
+
+		if (direct_io && (start & (page_size - 1)))
+			start &= ~(page_size - 1);
+		count = min(IO_MAX_SIZE, data_end - start);
+		if (direct_io && (count & (page_size - 1)))
+			count = (count + page_size) & ~(page_size - 1);
+		sz = pread(fd, ctx->readbuf, count, start);
+		if (sz < 0) {
+			str_errno(ctx, descr);
+			break;
+		} else if (sz == 0) {
+			str_error(ctx, descr,
+_("Read zero bytes, expected %zu."),
+					count);
+			break;
+		} else if (sz != count && start + sz != data_end) {
+			str_warn(ctx, descr,
+_("Short read of %zu bytes, expected %zu."),
+					sz, count);
+		}
+		verified += sz;
+		data_start = start + sz;
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			moveon = false;
+			break;
+		}
+
+		if (data_start >= data_end && reports_holes) {
+			data_start = lseek(fd, data_end, SEEK_DATA);
+			if (data_start < 0) {
+				if (errno != ENXIO)
+					str_errno(ctx, descr);
+				break;
+			}
+			data_end = lseek(fd, data_start, SEEK_HOLE);
+			if (data_end < 0) {
+				if (errno != ENXIO)
+					str_errno(ctx, descr);
+				break;
+			}
+		}
+	}
+
+	/* Turn off O_DIRECT. */
+	if (direct_io) {
+		flags = fcntl(fd, F_GETFL);
+		error = fcntl(fd, F_SETFL, flags & ~O_DIRECT);
+		if (error)
+			str_errno(ctx, descr);
+	}
+
+	pthread_mutex_lock(&verified_lock);
+	verified_bytes += verified;
+	pthread_mutex_unlock(&verified_lock);
+
+	return moveon;
+}
diff --git a/scrub/read_verify.h b/scrub/read_verify.h
new file mode 100644
index 0000000..a10ba8c
--- /dev/null
+++ b/scrub/read_verify.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef READ_VERIFY_H_
+#define READ_VERIFY_H_
+
+struct read_verify_pool;
+
+typedef void (*read_verify_ioend_fn_t)(struct read_verify_pool *rvp,
+		struct disk *disk, uint64_t start, uint64_t length,
+		int error, void *arg);
+typedef void (*read_verify_ioend_arg_free_fn_t)(void *arg);
+
+struct read_verify_pool {
+	struct work_queue	rvp_wq;
+	struct scrub_ctx	*rvp_ctx;
+	void			*rvp_readbuf;
+	read_verify_ioend_fn_t	ioend_fn;
+	read_verify_ioend_arg_free_fn_t	ioend_arg_free_fn;
+	size_t			rvp_readbufsz;		/* bytes */
+	size_t			rvp_min_io_size;	/* bytes */
+	int			rvp_nproc;
+};
+
+bool read_verify_pool_init(struct read_verify_pool *rvp, struct scrub_ctx *ctx,
+		void *readbuf, size_t readbufsz, size_t min_io_sz,
+		read_verify_ioend_fn_t ioend_fn, unsigned int nproc);
+void read_verify_pool_destroy(struct read_verify_pool *rvp);
+
+struct read_verify {
+	void			*io_end_arg;
+	struct disk		*io_disk;
+	uint64_t		io_start;	/* bytes */
+	uint64_t		io_length;	/* bytes */
+};
+
+void read_verify_schedule(struct read_verify_pool *rvp, struct read_verify *rv,
+		struct disk *disk, uint64_t start, uint64_t length,
+		void *end_arg);
+void read_verify_force(struct read_verify_pool *rvp, struct read_verify *rv);
+unsigned long long read_verify_bytes(void);
+
+#endif /* READ_VERIFY_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
new file mode 100644
index 0000000..013559a
--- /dev/null
+++ b/scrub/scrub.c
@@ -0,0 +1,950 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <mntent.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/statvfs.h>
+#include <sys/vfs.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+
+#define _PATH_PROC_MOUNTS	"/proc/mounts"
+
+bool				verbose;
+int				debug;
+bool				scrub_data;
+bool				dumpcore;
+bool				display_rusage;
+long				page_size;
+int				nr_threads = -1;
+enum errors_action		error_action = ERRORS_CONTINUE;
+static unsigned long		max_errors;
+
+static void __attribute__((noreturn))
+usage(void)
+{
+	fprintf(stderr, _("Usage: %s [OPTIONS] mountpoint\n"), progname);
+	fprintf(stderr, _("-a:\tStop after this many errors are found.\n"));
+	fprintf(stderr, _("-d:\tRun program in debug mode.\n"));
+	fprintf(stderr, _("-e:\tWhat to do if errors are found.\n"));
+	fprintf(stderr, _("-j:\tStart no more than this many threads.\n"));
+	fprintf(stderr, _("-m:\tPath to /etc/mtab.\n"));
+	fprintf(stderr, _("-n:\tDry run.  Do not modify anything.\n"));
+	fprintf(stderr, _("-T:\tDisplay timing/usage information.\n"));
+	fprintf(stderr, _("-v:\tVerbose output.\n"));
+	fprintf(stderr, _("-V:\tPrint version.\n"));
+	fprintf(stderr, _("-x:\tScrub file data too.\n"));
+	fprintf(stderr, _("-y:\tRepair all errors.\n"));
+
+	exit(16);
+}
+
+/*
+ * Check if the argument is either the device name or mountpoint of a mounted
+ * filesystem.
+ */
+#define MNTTYPE_XFS	"xfs"
+static bool
+find_mountpoint_check(
+	struct stat		*sb,
+	struct mntent		*t)
+{
+	struct stat		ms;
+
+	if (S_ISDIR(sb->st_mode)) {		/* mount point */
+		if (stat(t->mnt_dir, &ms) < 0)
+			return false;
+		if (sb->st_ino != ms.st_ino)
+			return false;
+		if (sb->st_dev != ms.st_dev)
+			return false;
+		if (strcmp(t->mnt_type, MNTTYPE_XFS) != 0)
+			return NULL;
+	} else {				/* device */
+		if (stat(t->mnt_fsname, &ms) < 0)
+			return false;
+		if (sb->st_rdev != ms.st_rdev)
+			return false;
+		if (strcmp(t->mnt_type, MNTTYPE_XFS) != 0)
+			return NULL;
+		/*
+		 * Make sure the mountpoint given by mtab is accessible
+		 * before using it.
+		 */
+		if (stat(t->mnt_dir, &ms) < 0)
+			return false;
+	}
+
+	return true;
+}
+
+/* Check that our alleged mountpoint is in mtab */
+static bool
+find_mountpoint(
+	char			*mtab,
+	struct scrub_ctx	*ctx)
+{
+	struct mntent_cursor	cursor;
+	struct mntent		*t = NULL;
+	bool			found = false;
+
+	if (platform_mntent_open(&cursor, mtab) != 0) {
+		fprintf(stderr, "Error: can't get mntent entries.\n");
+		exit(1);
+	}
+
+	while ((t = platform_mntent_next(&cursor)) != NULL) {
+		/*
+		 * Keep jotting down matching mount details; newer mounts are
+		 * towards the end of the file (hopefully).
+		 */
+		if (find_mountpoint_check(&ctx->mnt_sb, t)) {
+			ctx->mntpoint = strdup(t->mnt_dir);
+			ctx->mnt_type = strdup(t->mnt_type);
+			ctx->blkdev = strdup(t->mnt_fsname);
+			found = true;
+		}
+	}
+	platform_mntent_close(&cursor);
+	return found;
+}
+
+/* Too many errors? Bail out. */
+bool
+xfs_scrub_excessive_errors(
+	struct scrub_ctx	*ctx)
+{
+	bool			ret;
+
+	pthread_mutex_lock(&ctx->lock);
+	ret = max_errors > 0 && ctx->errors_found >= max_errors;
+	pthread_mutex_unlock(&ctx->lock);
+
+	return ret;
+}
+
+/* Print a string and whatever error is stored in errno. */
+void
+__str_errno(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line)
+{
+	char			buf[DESCR_BUFSZ];
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: %s.", str, strerror_r(errno, buf, DESCR_BUFSZ));
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->errors_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some error text. */
+void
+__str_error(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->errors_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some warning text. */
+void
+__str_warn(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->warnings_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some informational text. */
+void
+__str_info(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stdout, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stdout, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stdout, " (%s line %d)", file, line);
+	fprintf(stdout, "\n");
+	fflush(stdout);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the repair count. */
+void
+__record_repair(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->repairs++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the optimization (preening) count. */
+void
+__record_preen(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	if (debug || verbose) {
+		fprintf(stdout, "%s: ", str);
+		va_start(args, format);
+		vfprintf(stdout, format, args);
+		va_end(args);
+		if (debug)
+			fprintf(stdout, " (%s line %d)", file, line);
+		fprintf(stdout, "\n");
+		fflush(stdout);
+	}
+	ctx->preens++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+void __attribute__((noreturn))
+do_error(char const *msg, ...)
+{
+	va_list args;
+
+	fprintf(stderr, _("\nfatal error -- "));
+
+	va_start(args, msg);
+	vfprintf(stderr, msg, args);
+	if (dumpcore)
+		abort();
+	exit(1);
+}
+
+/* How many threads to kick off? */
+unsigned int
+scrub_nproc(
+	struct scrub_ctx	*ctx)
+{
+	if (nr_threads < 0)
+		return ctx->nr_io_threads;
+	return min(ctx->nr_io_threads, nr_threads);
+}
+
+/* Decide if a value is within +/- (n/d) of a desired value. */
+bool
+within_range(
+	struct scrub_ctx	*ctx,
+	unsigned long long	value,
+	unsigned long long	desired,
+	unsigned long long	diff_threshold,
+	unsigned int		n,
+	unsigned int		d,
+	const char		*descr)
+{
+	assert(n < d);
+
+	/* Don't complain if difference does not exceed an absolute value. */
+	if (value < desired && desired - value < diff_threshold)
+		return true;
+	if (value > desired && value - desired < diff_threshold)
+		return true;
+
+	/* Complain if the difference exceeds a certain percentage. */
+	if (value < desired * (d - n) / d) {
+		str_warn(ctx, ctx->mntpoint,
+_("Found fewer %s than reported"), descr);
+		return false;
+	}
+	if (value > desired * (d + n) / d) {
+		str_warn(ctx, ctx->mntpoint,
+_("Found more %s than reported"), descr);
+		return false;
+	}
+	return true;
+}
+
+static double
+timeval_subtract(
+	struct timeval		*tv1,
+	struct timeval		*tv2)
+{
+	return ((tv1->tv_sec - tv2->tv_sec) +
+		((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
+}
+
+/* Produce human readable disk space output. */
+double
+auto_space_units(
+	unsigned long long	bytes,
+	char			**units)
+{
+	if (debug > 1)
+		goto no_prefix;
+	if (bytes > (1ULL << 40)) {
+		*units = "TiB";
+		return (double)bytes / (1ULL << 40);
+	} else if (bytes > (1ULL << 30)) {
+		*units = "GiB";
+		return (double)bytes / (1ULL << 30);
+	} else if (bytes > (1ULL << 20)) {
+		*units = "MiB";
+		return (double)bytes / (1ULL << 20);
+	} else if (bytes > (1ULL << 10)) {
+		*units = "KiB";
+		return (double)bytes / (1ULL << 10);
+	} else {
+no_prefix:
+		*units = "B";
+		return bytes;
+	}
+}
+
+/* Produce human readable discrete number output. */
+double
+auto_units(
+	unsigned long long	number,
+	char			**units)
+{
+	if (debug > 1)
+		goto no_prefix;
+	if (number > 1000000000000ULL) {
+		*units = "T";
+		return number / 1000000000000.0;
+	} else if (number > 1000000000ULL) {
+		*units = "G";
+		return number / 1000000000.0;
+	} else if (number > 1000000ULL) {
+		*units = "M";
+		return number / 1000000.0;
+	} else if (number > 1000ULL) {
+		*units = "K";
+		return number / 1000.0;
+	} else {
+no_prefix:
+		*units = "";
+		return number;
+	}
+}
+
+/*
+ * Given a directory fd and (possibly) a dirent, open the file associated
+ * with the entry.  If the entry is null, just duplicate the dir_fd.
+ */
+int
+dirent_open(
+	int			dir_fd,
+	struct dirent		*dirent)
+{
+	if (!dirent)
+		return dup(dir_fd);
+	return openat(dir_fd, dirent->d_name,
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+}
+
+#ifndef RUSAGE_BOTH
+# define RUSAGE_BOTH		(-2)
+#endif
+
+/* Get resource usage for ourselves and all children. */
+int
+scrub_getrusage(
+	struct rusage		*usage)
+{
+	struct rusage		cusage;
+	int			err;
+
+	err = getrusage(RUSAGE_BOTH, usage);
+	if (!err)
+		return err;
+
+	err = getrusage(RUSAGE_SELF, usage);
+	if (err)
+		return err;
+
+	err = getrusage(RUSAGE_CHILDREN, &cusage);
+	if (err)
+		return err;
+
+	usage->ru_minflt += cusage.ru_minflt;
+	usage->ru_majflt += cusage.ru_majflt;
+	usage->ru_nswap += cusage.ru_nswap;
+	usage->ru_inblock += cusage.ru_inblock;
+	usage->ru_oublock += cusage.ru_oublock;
+	usage->ru_msgsnd += cusage.ru_msgsnd;
+	usage->ru_msgrcv += cusage.ru_msgrcv;
+	usage->ru_nsignals += cusage.ru_nsignals;
+	usage->ru_nvcsw += cusage.ru_nvcsw;
+	usage->ru_nivcsw += cusage.ru_nivcsw;
+	return 0;
+}
+
+struct phase_info {
+	struct rusage		ruse;
+	struct timeval		time;
+	unsigned long long	verified_bytes;
+	void			*brk_start;
+	const char		*tag;
+};
+
+/* Start tracking resource usage for a phase. */
+static bool
+phase_start(
+	struct phase_info	*pi,
+	const char		*tag,
+	const char		*descr)
+{
+	int			error;
+
+	error = scrub_getrusage(&pi->ruse);
+	if (error) {
+		perror(_("getrusage"));
+		return false;
+	}
+	pi->brk_start = sbrk(0);
+
+	error = gettimeofday(&pi->time, NULL);
+	if (error) {
+		perror(_("gettimeofday"));
+		return false;
+	}
+	pi->tag = tag;
+
+	pi->verified_bytes = read_verify_bytes();
+
+	if ((verbose || display_rusage) && descr) {
+		fprintf(stdout, _("%s%s\n"), pi->tag, descr);
+		fflush(stdout);
+	}
+	return true;
+}
+
+/* Report usage stats. */
+static bool
+phase_end(
+	struct phase_info	*pi)
+{
+	struct rusage		ruse_now;
+#ifdef HAVE_MALLINFO
+	struct mallinfo		mall_now;
+#endif
+	struct timeval		time_now;
+	double			dt;
+	unsigned long long	verified;
+	long			in, out;
+	long			io;
+	double			i, o, t;
+	double			din, dout, dtot;
+	char			*iu, *ou, *tu, *dinu, *doutu, *dtotu;
+	double			v, dv;
+	char			*vu, *dvu;
+	int			error;
+
+	if (!display_rusage)
+		return true;
+
+	error = gettimeofday(&time_now, NULL);
+	if (error) {
+		perror(_("gettimeofday"));
+		return false;
+	}
+	dt = timeval_subtract(&time_now, &pi->time);
+
+	error = scrub_getrusage(&ruse_now);
+	if (error) {
+		perror(_("getrusage"));
+		return false;
+	}
+
+#define kbytes(x)	(((unsigned long)(x) + 1023) / 1024)
+#ifdef HAVE_MALLINFO
+
+	mall_now = mallinfo();
+	fprintf(stdout, _("%sMemory used: %luk/%luk (%luk/%luk), "), pi->tag,
+		kbytes(mall_now.arena), kbytes(mall_now.hblkhd),
+		kbytes(mall_now.uordblks), kbytes(mall_now.fordblks));
+#else
+	fprintf(stdout, _("%sMemory used: %luk, "), pi->tag,
+		(unsigned long) kbytes(((char *) sbrk(0)) -
+				       ((char *) pi->brk_start)));
+#endif
+#undef kbytes
+
+	fprintf(stdout, _("time: %5.2f/%5.2f/%5.2fs\n"),
+		timeval_subtract(&time_now, &pi->time),
+		timeval_subtract(&ruse_now.ru_utime, &pi->ruse.ru_utime),
+		timeval_subtract(&ruse_now.ru_stime, &pi->ruse.ru_stime));
+
+	/* I/O usage */
+	in =  (ruse_now.ru_inblock - pi->ruse.ru_inblock) << BBSHIFT;
+	out = (ruse_now.ru_oublock - pi->ruse.ru_oublock) << BBSHIFT;
+	io = in + out;
+	if (io) {
+		i = auto_space_units(in, &iu);
+		o = auto_space_units(out, &ou);
+		t = auto_space_units(io, &tu);
+		din = auto_space_units(in / dt, &dinu);
+		dout = auto_space_units(out / dt, &doutu);
+		dtot = auto_space_units(io / dt, &dtotu);
+		fprintf(stdout,
+_("%sI/O: %.1f%s in, %.1f%s out, %.1f%s tot\n"),
+			pi->tag, i, iu, o, ou, t, tu);
+		fprintf(stdout,
+_("%sI/O rate: %.1f%s/s in, %.1f%s/s out, %.1f%s/s tot\n"),
+			pi->tag, din, dinu, dout, doutu, dtot, dtotu);
+	}
+
+	/* How many bytes were read-verified? */
+	verified = read_verify_bytes() - pi->verified_bytes;
+	if (verified) {
+		v = auto_space_units(verified, &vu);
+		dv = auto_space_units(verified / dt, &dvu);
+		fprintf(stdout, _("%sVerify: %.1f%s, rate: %.1f%s/s\n"),
+			pi->tag, v, vu, dv, dvu);
+	}
+	fflush(stdout);
+
+	return true;
+}
+
+/* Find filesystem geometry and perform any other setup functions. */
+static bool
+find_geo(
+	struct scrub_ctx	*ctx)
+{
+	bool			moveon;
+	int			error;
+
+	/*
+	 * Open the directory with O_NOATIME.  For mountpoints owned
+	 * by root, this should be sufficient to ensure that we have
+	 * CAP_SYS_ADMIN, which we probably need to do anything fancy
+	 * with the (XFS driver) kernel.
+	 */
+	ctx->mnt_fd = open(ctx->mntpoint, O_RDONLY | O_NOATIME | O_DIRECTORY);
+	if (ctx->mnt_fd < 0) {
+		if (errno == EPERM)
+			str_info(ctx, ctx->mntpoint,
+_("Must be root to run scrub."));
+		else
+			str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = disk_open(ctx->blkdev, &ctx->datadev);
+	if (error && errno != ENOENT)
+		str_errno(ctx, ctx->blkdev);
+
+	error = fstat(ctx->mnt_fd, &ctx->mnt_sb);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = fstatvfs(ctx->mnt_fd, &ctx->mnt_sv);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = fstatfs(ctx->mnt_fd, &ctx->mnt_sf);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	if (disk_is_open(&ctx->datadev))
+		ctx->nr_io_threads = disk_heads(&ctx->datadev);
+	else
+		ctx->nr_io_threads = libxfs_nproc();
+	if (verbose) {
+		fprintf(stdout, _("%s: using %d threads to scrub.\n"),
+				ctx->mntpoint, scrub_nproc(ctx));
+		fflush(stdout);
+	}
+
+out:
+	return moveon;
+}
+
+struct scrub_phase {
+	char		*descr;
+	bool		(*fn)(struct scrub_ctx *);
+};
+
+/* Run the preening phase if there are no errors. */
+static bool
+preen(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->errors_found) {
+		str_info(ctx, ctx->mntpoint,
+_("Errors found, please re-run with -y."));
+		return true;
+	}
+
+	return false;
+}
+
+/* Run all the phases of the scrubber. */
+#define REPAIR_DUMMY_FN		((void *)1)
+#define DATASCAN_DUMMY_FN	((void *)2)
+static bool
+run_scrub_phases(
+	struct scrub_ctx	*ctx)
+{
+	struct scrub_phase	phases[] = {
+		{_("Find filesystem geometry."),   find_geo},
+		{_("Check internal metadata."),	   NULL},
+		{_("Scan all inodes."),		   NULL},
+		{NULL, REPAIR_DUMMY_FN},
+		{_("Verify data file integrity."), DATASCAN_DUMMY_FN},
+		{_("Check summary counters."),	   NULL},
+		{NULL, NULL},
+	};
+	struct phase_info	pi;
+	char			buf[DESCR_BUFSZ];
+	struct scrub_phase	*phase;
+	bool			moveon;
+	int			c;
+
+	/* Run all phases of the scrub tool. */
+	for (c = 1, phase = phases; phase->fn; phase++, c++) {
+		/* Inject the repair/preen function. */
+		if (phase->fn == REPAIR_DUMMY_FN) {
+			if (ctx->mode == SCRUB_MODE_PREEN) {
+				phase->descr = _("Preen filesystem.");
+				phase->fn = preen;
+			} else if (ctx->mode == SCRUB_MODE_REPAIR) {
+				phase->descr = _("Repair filesystem.");
+			}
+		} else if (phase->fn == DATASCAN_DUMMY_FN && scrub_data)
+			;
+
+		if (phase->fn == REPAIR_DUMMY_FN ||
+		    phase->fn == DATASCAN_DUMMY_FN) {
+			c--;
+			continue;
+		} else if (phase->descr)
+			snprintf(buf, DESCR_BUFSZ, _("Phase %d: "), c);
+		else
+			buf[0] = 0;
+		moveon = phase_start(&pi, buf, phase->descr);
+		if (!moveon)
+			return false;
+		moveon = phase->fn(ctx);
+		if (!moveon)
+			return false;
+		moveon = phase_end(&pi);
+		if (!moveon)
+			return false;
+
+		/* Too many errors? */
+		if (xfs_scrub_excessive_errors(ctx))
+			return false;
+	}
+
+	return true;
+}
+
+int
+main(
+	int			argc,
+	char			**argv)
+{
+	int			c;
+	char			*mtab = NULL;
+	struct scrub_ctx	ctx = {0};
+	struct phase_info	all_pi;
+	long			arg;
+	bool			ismnt;
+	bool			moveon = true;
+	static bool		injected;
+	int			ret;
+	int			error;
+
+	progname = basename(argv[0]);
+	setlocale(LC_ALL, "");
+	bindtextdomain(PACKAGE, LOCALEDIR);
+	textdomain(PACKAGE);
+
+	pthread_mutex_init(&ctx.lock, NULL);
+	ctx.datadev.d_fd = -1;
+	ctx.mode = SCRUB_MODE_DEFAULT;
+	while ((c = getopt(argc, argv, "a:de:j:m:nTvxVy")) != EOF) {
+		switch (c) {
+		case 'a':
+			max_errors = strtoull(optarg, NULL, 10);
+			if (errno) {
+				perror("max_errors");
+				usage();
+			}
+			break;
+		case 'd':
+			debug++;
+			dumpcore = true;
+			break;
+		case 'e':
+			if (!strcmp("continue", optarg))
+				error_action = ERRORS_CONTINUE;
+			else if (!strcmp("shutdown", optarg))
+				error_action = ERRORS_SHUTDOWN;
+			else
+				usage();
+			break;
+		case 'j':
+			arg = strtol(optarg, NULL, 10);
+			if (errno || arg < 0 || arg > INT_MAX) {
+				perror("nr_threads");
+				usage();
+			}
+			nr_threads = arg;
+			break;
+		case 'm':
+			mtab = optarg;
+			break;
+		case 'n':
+			if (ctx.mode != SCRUB_MODE_DEFAULT) {
+				fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+				return 1;
+			}
+			ctx.mode = SCRUB_MODE_DRY_RUN;
+			break;
+		case 'T':
+			display_rusage = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		case 'x':
+			scrub_data = true;
+			break;
+		case 'V':
+			fprintf(stdout, _("%s version %s\n"), progname,
+					VERSION);
+			fflush(stdout);
+			exit(0);
+		case 'y':
+			if (ctx.mode != SCRUB_MODE_DEFAULT) {
+				fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+				return 1;
+			}
+			ctx.mode = SCRUB_MODE_REPAIR;
+			break;
+		case '?':
+			/* fall through */
+		default:
+			usage();
+		}
+	}
+
+	if (optind != argc - 1)
+		usage();
+
+	ctx.mntpoint = argv[optind];
+
+	/* Find the mount record for the passed-in argument. */
+
+	if (stat(argv[optind], &ctx.mnt_sb) < 0) {
+		fprintf(stderr,
+			_("%s: could not stat: %s: %s\n"),
+			progname, argv[optind], strerror(errno));
+		ret = 8;
+		goto end;
+	}
+
+	/*
+	 * If the user did not specify an explicit mount table, try to use
+	 * /proc/mounts if it is available, else /etc/mtab.  We prefer
+	 * /proc/mounts because it is kernel controlled, while /etc/mtab
+	 * may contain garbage that userspace tools like pam_mounts wrote
+	 * into it.
+	 */
+	if (!mtab) {
+		if (access(_PATH_PROC_MOUNTS, R_OK) == 0)
+			mtab = _PATH_PROC_MOUNTS;
+		else
+			mtab = _PATH_MOUNTED;
+	}
+
+	ismnt = find_mountpoint(mtab, &ctx);
+	if (!ismnt) {
+		fprintf(stderr, _("%s: Not a mount point or block device.\n"),
+			ctx.mntpoint);
+		ret = 8;
+		goto end;
+	}
+
+	/* Initialize overall phase stats. */
+	moveon = phase_start(&all_pi, "", NULL);
+	if (!moveon)
+		goto out;
+
+	/* Set up a page-aligned buffer for read verification. */
+	page_size = sysconf(_SC_PAGESIZE);
+	if (page_size < 0) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	/* Try to allocate a read buffer if we don't have one. */
+	error = posix_memalign((void **)&ctx.readbuf, page_size,
+			IO_MAX_SIZE);
+	if (error || !ctx.readbuf) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	/* Flush everything out to disk before we start. */
+	error = syncfs(ctx.mnt_fd);
+	if (error) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		ctx.mode = SCRUB_MODE_REPAIR;
+		injected = true;
+	}
+
+	/* Scrub a filesystem. */
+	moveon = run_scrub_phases(&ctx);
+	if (!moveon)
+		goto out;
+
+out:
+	if (xfs_scrub_excessive_errors(&ctx))
+		str_info(&ctx, ctx.mntpoint, _("Too many errors; aborting."));
+
+	if (debug_tweak_on("XFS_SCRUB_FORCE_ERROR"))
+		str_error(&ctx, ctx.mntpoint, _("Injecting error."));
+
+	ret = 0;
+	if (!moveon)
+		ret |= 4;
+
+	if (ctx.repairs && ctx.preens)
+		fprintf(stdout,
+_("%s: %lu repairs and %lu optimizations made.\n"),
+			ctx.mntpoint, ctx.repairs, ctx.preens);
+	else if (ctx.repairs && ctx.preens == 0)
+		fprintf(stdout,
+_("%s: %lu repairs made.\n"),
+			ctx.mntpoint, ctx.repairs);
+	else if (ctx.repairs == 0 && ctx.preens)
+		fprintf(stdout,
+_("%s: %lu optimizations made.\n"),
+			ctx.mntpoint, ctx.preens);
+
+	if (ctx.errors_found && ctx.warnings_found)
+		fprintf(stderr,
+_("%s: %lu errors and %lu warnings found.  Unmount and run xfs_repair.\n"),
+			ctx.mntpoint, ctx.errors_found, ctx.warnings_found);
+	else if (ctx.errors_found && ctx.warnings_found == 0)
+		fprintf(stderr,
+_("%s: %lu errors found.  Unmount and run xfs_repair.\n"),
+			ctx.mntpoint, ctx.errors_found);
+	else if (ctx.errors_found == 0 && ctx.warnings_found)
+		fprintf(stderr,
+_("%s: %lu warnings found.\n"),
+			ctx.mntpoint, ctx.warnings_found);
+	if (ctx.errors_found) {
+		ret |= 1;
+	}
+	if (ctx.warnings_found) {
+		ret |= 2;
+	}
+	phase_end(&all_pi);
+	close(ctx.mnt_fd);
+	disk_close(&ctx.datadev);
+
+	free(ctx.blkdev);
+	free(ctx.readbuf);
+	free(ctx.mntpoint);
+	free(ctx.mnt_type);
+end:
+	return ret;
+}
diff --git a/scrub/scrub.h b/scrub/scrub.h
new file mode 100644
index 0000000..114310d
--- /dev/null
+++ b/scrub/scrub.h
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef SCRUB_H_
+#define SCRUB_H_
+
+#define DESCR_BUFSZ		256
+
+/*
+ * Perform all IO in 32M chunks.  This cannot exceed 65536 sectors
+ * because that's the biggest SCSI VERIFY(16) we dare to send.
+ */
+#define IO_MAX_SIZE		33554432
+#define IO_MAX_SECTORS		(IO_MAX_SIZE >> BBSHIFT)
+
+enum scrub_mode {
+	SCRUB_MODE_DRY_RUN,
+	SCRUB_MODE_PREEN,
+	SCRUB_MODE_REPAIR,
+};
+#define SCRUB_MODE_DEFAULT			SCRUB_MODE_PREEN
+
+struct scrub_ctx {
+	/* Immutable scrub state. */
+	char			*mntpoint;
+	char			*blkdev;
+	char			*mnt_type;
+	void			*readbuf;
+	int			mnt_fd;
+	enum scrub_mode		mode;
+	unsigned int		nr_io_threads;
+	struct disk		datadev;
+	struct stat		mnt_sb;
+	struct statvfs		mnt_sv;
+	struct statfs		mnt_sf;
+
+	/* Mutable scrub state; use lock. */
+	pthread_mutex_t		lock;
+	unsigned long		errors_found;
+	unsigned long		warnings_found;
+	unsigned long		repairs;
+	unsigned long		preens;
+};
+
+enum errors_action {
+	ERRORS_CONTINUE,
+	ERRORS_SHUTDOWN,
+};
+
+extern bool			verbose;
+extern int			debug;
+extern bool			scrub_data;
+extern long			page_size;
+extern enum errors_action	error_action;
+extern int			nr_threads;
+
+bool xfs_scrub_excessive_errors(struct scrub_ctx *ctx);
+
+void __str_errno(struct scrub_ctx *, const char *, const char *, int);
+void __str_error(struct scrub_ctx *, const char *, const char *, int,
+		 const char *, ...);
+void __str_warn(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+void __str_info(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+void __record_repair(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+void __record_preen(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+
+#define str_errno(ctx, str)		__str_errno(ctx, str, __FILE__, __LINE__)
+#define str_error(ctx, str, ...)	__str_error(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_warn(ctx, str, ...)		__str_warn(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_info(ctx, str, ...)		__str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_repair(ctx, str, ...)	__record_repair(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_preen(ctx, str, ...)	__record_preen(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define dbg_printf(fmt, ...)		{if (debug > 1) {printf(fmt, __VA_ARGS__);}}
+
+#ifndef container_of
+# define container_of(ptr, type, member) ({			\
+	const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
+		(type *)( (char *)__mptr - offsetof(type,member) );})
+#endif
+
+/* Is this debug tweak enabled? */
+static inline bool
+debug_tweak_on(
+	const char		*name)
+{
+	return debug && getenv(name) != NULL;
+}
+
+/* Miscellaneous utility functions */
+unsigned int scrub_nproc(struct scrub_ctx *ctx);
+bool within_range(struct scrub_ctx *ctx, unsigned long long value,
+		unsigned long long desired, unsigned long long diff_threshold,
+		unsigned int n, unsigned int d, const char *descr);
+double auto_space_units(unsigned long long kilobytes, char **units);
+double auto_units(unsigned long long number, char **units);
+const char *repair_tool(struct scrub_ctx *ctx);
+int dirent_open(int dir_fd, struct dirent *dirent);
+
+#ifndef HAVE_SYNCFS
+static inline int syncfs(int fd)
+{
+	sync();
+	return 0;
+}
+#endif
+
+#endif /* SCRUB_H_ */


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2017-03-10 23:25 ` [PATCH 6/9] xfs_scrub: create online filesystem scrub program Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 9/9] xfs_scrub: integrate services with systemd Darrick J. Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

For XFS, we perform sequential scans of each AG's metadata, inodes,
extent maps, and file data.  Being XFS specific, we can work with the
in-kernel scrubbers to perform much stronger metadata checking and
cross-referencing.  We can also take advantage of newer ioctls such as
GETFSMAP to perform faster read verification.

In the future we will be able to take advantage of (still unwritten)
features such as parent directory pointers to fully validate all
metadata.  The scrub tool can shut down the filesystem if errors are
found.  This is not a default option since scrubbing is very immature at
this time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile    |    4 
 scrub/scrub.c     |   21 +
 scrub/scrub.h     |   26 +
 scrub/xfs.c       | 1517 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_ioctl.c |  968 ++++++++++++++++++++++++++++++++++
 scrub/xfs_ioctl.h |  103 ++++
 6 files changed, 2632 insertions(+), 7 deletions(-)
 create mode 100644 scrub/xfs.c
 create mode 100644 scrub/xfs_ioctl.c
 create mode 100644 scrub/xfs_ioctl.h


diff --git a/scrub/Makefile b/scrub/Makefile
index b1ff86a..bae2fa1 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -12,9 +12,9 @@ LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 endif	# scrub_prereqs
 
-HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h
+HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
 CFILES = ../repair/avl64.c disk.c bitmap.c iocmd.c \
-	 read_verify.c scrub.c ../repair/threads.c
+	 read_verify.c scrub.c ../repair/threads.c xfs.c xfs_ioctl.c
 
 LLDLIBS += $(LIBBLKID) $(LIBXFS) $(LIBXCMD) $(LIBUUID) $(LIBRT) $(LIBPTHREAD) $(LIBHANDLE)
 LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBHANDLE)
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 013559a..a363ac1 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -638,6 +638,9 @@ _("Must be root to run scrub."));
 		ctx->nr_io_threads = disk_heads(&ctx->datadev);
 	else
 		ctx->nr_io_threads = libxfs_nproc();
+	moveon = xfs_scan_fs(ctx);
+	if (!moveon)
+		goto out;
 	if (verbose) {
 		fprintf(stdout, _("%s: using %d threads to scrub.\n"),
 				ctx->mntpoint, scrub_nproc(ctx));
@@ -664,7 +667,7 @@ _("Errors found, please re-run with -y."));
 		return true;
 	}
 
-	return false;
+	return xfs_repair_fs(ctx);
 }
 
 /* Run all the phases of the scrubber. */
@@ -676,11 +679,11 @@ run_scrub_phases(
 {
 	struct scrub_phase	phases[] = {
 		{_("Find filesystem geometry."),   find_geo},
-		{_("Check internal metadata."),	   NULL},
-		{_("Scan all inodes."),		   NULL},
+		{_("Check internal metadata."),	   xfs_scan_metadata},
+		{_("Scan all inodes."),		   xfs_scan_inodes},
 		{NULL, REPAIR_DUMMY_FN},
 		{_("Verify data file integrity."), DATASCAN_DUMMY_FN},
-		{_("Check summary counters."),	   NULL},
+		{_("Check summary counters."),	   xfs_check_summary},
 		{NULL, NULL},
 	};
 	struct phase_info	pi;
@@ -698,9 +701,10 @@ run_scrub_phases(
 				phase->fn = preen;
 			} else if (ctx->mode == SCRUB_MODE_REPAIR) {
 				phase->descr = _("Repair filesystem.");
+				phase->fn = xfs_repair_fs;
 			}
 		} else if (phase->fn == DATASCAN_DUMMY_FN && scrub_data)
-			;
+			phase->fn = xfs_scan_blocks;
 
 		if (phase->fn == REPAIR_DUMMY_FN ||
 		    phase->fn == DATASCAN_DUMMY_FN) {
@@ -906,6 +910,11 @@ _("Only one of the options -n or -y may be specified.\n"));
 	if (!moveon)
 		ret |= 4;
 
+	/* Clean up scan data. */
+	moveon = xfs_cleanup(&ctx);
+	if (!moveon)
+		ret |= 8;
+
 	if (ctx.repairs && ctx.preens)
 		fprintf(stdout,
 _("%s: %lu repairs and %lu optimizations made.\n"),
@@ -932,6 +941,8 @@ _("%s: %lu errors found.  Unmount and run xfs_repair.\n"),
 _("%s: %lu warnings found.\n"),
 			ctx.mntpoint, ctx.warnings_found);
 	if (ctx.errors_found) {
+		if (error_action == ERRORS_SHUTDOWN)
+			xfs_shutdown_fs(&ctx);
 		ret |= 1;
 	}
 	if (ctx.warnings_found) {
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 114310d..c3ced73 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -56,6 +56,22 @@ struct scrub_ctx {
 	unsigned long		warnings_found;
 	unsigned long		repairs;
 	unsigned long		preens;
+
+	/* FS specific stuff */
+	struct xfs_fsop_geom	geo;
+	struct fs_path		fsinfo;
+	unsigned int		agblklog;
+	unsigned int		blocklog;
+	unsigned int		inodelog;
+	unsigned int		inopblog;
+	struct disk		logdev;
+	struct disk		rtdev;
+	void			*fshandle;
+	size_t			fshandle_len;
+	unsigned long long	capabilities;	/* see below */
+	struct read_verify_pool	rvp;
+	struct list_head	repair_list;
+	bool			preen_triggers[XFS_SCRUB_TYPE_MAX + 1];
 };
 
 enum errors_action {
@@ -124,4 +140,14 @@ static inline int syncfs(int fd)
 }
 #endif
 
+/* FS-specific functions */
+bool xfs_cleanup(struct scrub_ctx *ctx);
+bool xfs_scan_fs(struct scrub_ctx *ctx);
+bool xfs_scan_inodes(struct scrub_ctx *ctx);
+bool xfs_scan_metadata(struct scrub_ctx *ctx);
+bool xfs_check_summary(struct scrub_ctx *ctx);
+bool xfs_scan_blocks(struct scrub_ctx *ctx);
+bool xfs_repair_fs(struct scrub_ctx *ctx);
+void xfs_shutdown_fs(struct scrub_ctx *ctx);
+
 #endif /* SCRUB_H_ */
diff --git a/scrub/xfs.c b/scrub/xfs.c
new file mode 100644
index 0000000..7d6a249
--- /dev/null
+++ b/scrub/xfs.c
@@ -0,0 +1,1517 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <attr/attributes.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "iocmd.h"
+#include "scrub.h"
+#include "xfs_ioctl.h"
+#include "xfs_fs.h"
+
+/*
+ * XFS Scrubbing Strategy
+ *
+ * The XFS scrubber uses custom XFS ioctls to probe more deeply into the
+ * internals of the filesystem.  It takes advantage of scrubbing ioctls
+ * to check all the records stored in a metadata btree and to
+ * cross-reference those records against the other metadata btrees.
+ *
+ * The "find geometry" phase queries XFS for the filesystem geometry.
+ * The block devices for the data, realtime, and log devices are opened.
+ * Kernel ioctls are queried to see if they are implemented, and a data
+ * file read-verify strategy is selected.
+ *
+ * In the "check internal metadata" phase, we call the SCRUB_METADATA
+ * ioctl to check the filesystem's internal per-AG btrees.  This
+ * includes the AG superblock, AGF, AGFL, and AGI headers, freespace
+ * btrees, the regular and free inode btrees, the reverse mapping
+ * btrees, and the reference counting btrees.  If the realtime device is
+ * enabled, the realtime bitmap and reverse mapping btrees are enabled.
+ * Each AG (and the realtime device) has its metadata checked in a
+ * separate thread for better performance.
+ *
+ * The "scan inodes" phase uses BULKSTAT to scan all the inodes in an
+ * AG in disk order.  From the BULKSTAT information, a file handle is
+ * constructed and the following items are checked:
+ *
+ *     - If it's a symlink, the target is read but not validated.
+ *     - Bulkstat data is checked.
+ *     - If the inode is a file or a directory, a file descriptor is
+ *       opened to pin the inode and for further analysis.
+ *     - Extended attribute names and values are read via the file
+ *       handle.  If this fails and we have a file descriptor open, we
+ *       retry with the generic extended attribute APIs.
+ *     - If the inode is not a file or directory, we're done.
+ *     - Extent maps are scanned to ensure that the records make sense.
+ *       We also use the SCRUB_METADATA ioctl for better checking of the
+ *       block mapping records.
+ *     - If the inode is a directory, open the directory and check that
+ *       the dirent type code and inode numbers match the stat output.
+ *
+ * Multiple threads are started to check each the inodes of each AG in
+ * parallel.
+ *
+ * In the "verify data file integrity" phase, we employ GETFSMAP to read
+ * the reverse-mappings of all AGs and issue direct-reads of the
+ * underlying disk blocks.  We rely on the underlying storage to have
+ * checksummed the data blocks appropriately.
+ *
+ * Multiple threads are started to check each AG in parallel.  A
+ * separate thread pool is used to handle the direct reads.
+ *
+ * In the "check summary counters" phase, use GETFSMAP to tally up the
+ * blocks and BULKSTAT to tally up the inodes we saw and compare that to
+ * the statfs output.  This gives the user a rough estimate of how
+ * thorough the scrub was.
+ */
+
+/* Routines to scrub an XFS filesystem. */
+
+#define XFS_SCRUB_CAP_PARENT_PTR	(1ULL << 0)	/* can find parent? */
+
+#define XFS_SCRUB_CAPABILITY_FUNCS(name, flagname) \
+static inline bool \
+xfs_scrub_can_##name(struct scrub_ctx *ctx) \
+{ \
+	return ctx->capabilities & XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_set_##name(struct scrub_ctx *ctx) \
+{ \
+	ctx->capabilities |= XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_clear_##name(struct scrub_ctx *ctx) \
+{ \
+	ctx->capabilities &= ~(XFS_SCRUB_CAP_##flagname); \
+}
+XFS_SCRUB_CAPABILITY_FUNCS(getparent,		PARENT_PTR)
+
+/* Find the fd for a given device identifier. */
+static struct disk *
+xfs_dev_to_disk(
+	struct scrub_ctx	*ctx,
+	dev_t			dev)
+{
+	if (dev == ctx->fsinfo.fs_datadev)
+		return &ctx->datadev;
+	else if (dev == ctx->fsinfo.fs_logdev)
+		return &ctx->logdev;
+	else if (dev == ctx->fsinfo.fs_rtdev)
+		return &ctx->rtdev;
+	abort();
+}
+
+/* Find the device major/minor for a given file descriptor. */
+static dev_t
+xfs_disk_to_dev(
+	struct scrub_ctx	*ctx,
+	struct disk		*disk)
+{
+	if (disk == &ctx->datadev)
+		return ctx->fsinfo.fs_datadev;
+	else if (disk == &ctx->logdev)
+		return ctx->fsinfo.fs_logdev;
+	else if (disk == &ctx->rtdev)
+		return ctx->fsinfo.fs_rtdev;
+	abort();
+}
+
+/* Shortcut to creating a read-verify thread pool. */
+static inline bool
+xfs_read_verify_pool_init(
+	struct scrub_ctx	*ctx,
+	read_verify_ioend_fn_t	ioend_fn)
+{
+	return read_verify_pool_init(&ctx->rvp, ctx, ctx->readbuf,
+			IO_MAX_SIZE, ctx->geo.blocksize, ioend_fn,
+			disk_heads(&ctx->datadev));
+}
+
+struct owner_decode {
+	uint64_t		owner;
+	const char		*descr;
+};
+
+static const struct owner_decode special_owners[] = {
+	{XFS_FMR_OWN_FREE,	"free space"},
+	{XFS_FMR_OWN_UNKNOWN,	"unknown owner"},
+	{XFS_FMR_OWN_FS,	"static FS metadata"},
+	{XFS_FMR_OWN_LOG,	"journalling log"},
+	{XFS_FMR_OWN_AG,	"per-AG metadata"},
+	{XFS_FMR_OWN_INOBT,	"inode btree blocks"},
+	{XFS_FMR_OWN_INODES,	"inodes"},
+	{XFS_FMR_OWN_REFC,	"refcount btree"},
+	{XFS_FMR_OWN_COW,	"CoW staging"},
+	{XFS_FMR_OWN_DEFECTIVE,	"bad blocks"},
+	{0, NULL},
+};
+
+/* Decode a special owner. */
+static const char *
+xfs_decode_special_owner(
+	uint64_t			owner)
+{
+	const struct owner_decode	*od = special_owners;
+
+	while (od->descr) {
+		if (od->owner == owner)
+			return od->descr;
+		od++;
+	}
+
+	return NULL;
+}
+
+/* BULKSTAT wrapper routines. */
+struct xfs_scan_inodes {
+	xfs_inode_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Scan all the inodes in an AG. */
+static void
+xfs_scan_ag_inodes(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct xfs_scan_inodes	*si = arg;
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	uint64_t		ag_ino;
+	uint64_t		next_ag_ino;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+				major(ctx->fsinfo.fs_datadev),
+				minor(ctx->fsinfo.fs_datadev),
+				agno);
+
+	ag_ino = (__u64)agno << (ctx->inopblog + ctx->agblklog);
+	next_ag_ino = (__u64)(agno + 1) << (ctx->inopblog + ctx->agblklog);
+
+	fn_arg = ((char *)si->arg) + si->array_arg_size * agno;
+	moveon = xfs_iterate_inodes(ctx, descr, ctx->fshandle, ag_ino,
+			next_ag_ino - 1, si->fn, fn_arg);
+	if (!moveon)
+		si->moveon = false;
+}
+
+/* How many array elements should we create to scan all the inodes? */
+static inline size_t
+xfs_scan_all_inodes_array_size(
+	struct scrub_ctx	*ctx)
+{
+	return ctx->geo.agcount;
+}
+
+/* Scan all the inodes in a filesystem. */
+static bool
+xfs_scan_all_inodes_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	struct xfs_scan_inodes	si;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+
+	si.moveon = true;
+	si.fn = fn;
+	si.arg = arg;
+	si.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_inodes, agno, &si);
+	destroy_work_queue(&wq);
+
+	return si.moveon;
+}
+#define xfs_scan_all_inodes(ctx, fn) \
+	xfs_scan_all_inodes_array_arg((ctx), (fn), NULL, 0)
+#define xfs_scan_all_inodes_arg(ctx, fn, arg) \
+	xfs_scan_all_inodes_array_arg((ctx), (fn), (arg), 0)
+
+/* GETFSMAP wrappers routines. */
+struct xfs_scan_blocks {
+	xfs_fsmap_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Iterate all the reverse mappings of an AG. */
+static void
+xfs_scan_ag_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scan_blocks	*sbx = arg;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	struct fsmap		keys[2];
+	off64_t			bperag;
+	bool			moveon;
+
+	bperag = (off64_t)ctx->geo.agblocks *
+		 (off64_t)ctx->geo.blocksize;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u fsmap"),
+				major(ctx->fsinfo.fs_datadev),
+				minor(ctx->fsinfo.fs_datadev),
+				agno);
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = ctx->fsinfo.fs_datadev;
+	keys->fmr_physical = agno * bperag;
+	(keys + 1)->fmr_device = ctx->fsinfo.fs_datadev;
+	(keys + 1)->fmr_physical = ((agno + 1) * bperag) - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * agno;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of a standalone device. */
+static void
+xfs_scan_dev_blocks(
+	struct scrub_ctx	*ctx,
+	int			idx,
+	dev_t			dev,
+	struct xfs_scan_blocks	*sbx)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	void			*fn_arg;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d fsmap"),
+			major(dev), minor(dev));
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = ULLONG_MAX;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * idx;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of the realtime device. */
+static void
+xfs_scan_rt_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+
+	xfs_scan_dev_blocks(ctx, agno, ctx->fsinfo.fs_rtdev, arg);
+}
+
+/* Iterate all the reverse mappings of the log device. */
+static void
+xfs_scan_log_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+
+	xfs_scan_dev_blocks(ctx, agno, ctx->fsinfo.fs_logdev, arg);
+}
+
+/* How many array elements should we create to scan all the blocks? */
+static size_t
+xfs_scan_all_blocks_array_size(
+	struct scrub_ctx	*ctx)
+{
+	return ctx->geo.agcount + 2;
+}
+
+/* Scan all the blocks in a filesystem. */
+static bool
+xfs_scan_all_blocks_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	struct xfs_scan_blocks	sbx;
+
+	sbx.moveon = true;
+	sbx.fn = fn;
+	sbx.arg = arg;
+	sbx.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	if (ctx->fsinfo.fs_rt)
+		queue_work(&wq, xfs_scan_rt_blocks, ctx->geo.agcount + 1,
+				&sbx);
+	if (ctx->fsinfo.fs_log)
+		queue_work(&wq, xfs_scan_log_blocks, ctx->geo.agcount + 2,
+				&sbx);
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_blocks, agno, &sbx);
+	destroy_work_queue(&wq);
+
+	return sbx.moveon;
+}
+
+/* Routines to translate bad physical extents into file paths and offsets. */
+
+struct xfs_verify_error_info {
+	struct bitmap			*d_bad;		/* bytes */
+	struct bitmap			*r_bad;		/* bytes */
+};
+
+/* Report if this extent overlaps a bad region. */
+static bool
+xfs_report_verify_inode_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct xfs_verify_error_info	*vei = arg;
+	struct bitmap			*tree;
+
+	/* Only report errors for real extents. */
+	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+		return true;
+
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		tree = vei->r_bad;
+	else
+		tree = vei->d_bad;
+
+	if (!bitmap_has_extent(tree, bmap->bm_physical, bmap->bm_length))
+		return true;
+
+	str_error(ctx, descr,
+_("offset %llu failed read verification."), bmap->bm_offset);
+	return true;
+}
+
+/* Iterate the extent mappings of a file to report errors. */
+static bool
+xfs_report_verify_fd(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	void				*arg)
+{
+	struct xfs_bmap			key = {0};
+	bool				moveon;
+
+	/* data fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+
+	/* attr fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+	return true;
+}
+
+/* Report read verify errors in unlinked (but still open) files. */
+static int
+xfs_report_verify_inode(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	char				descr[DESCR_BUFSZ];
+	char				buf[DESCR_BUFSZ];
+	bool				moveon;
+	int				fd;
+	int				error;
+
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu (unlinked)"), bstat->bs_ino);
+
+	/* Ignore linked files and things we can't open. */
+	if (bstat->bs_nlink != 0)
+		return 0;
+	if (!S_ISREG(bstat->bs_mode) && !S_ISDIR(bstat->bs_mode))
+		return 0;
+
+	/* Try to open the inode. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		error = errno;
+		if (error == ESTALE)
+			return error;
+
+		str_warn(ctx, descr, "%s", strerror_r(error, buf, DESCR_BUFSZ));
+		return error;
+	}
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, descr, fd, arg);
+	close(fd);
+
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Scan the inode associated with a directory entry. */
+static bool
+xfs_report_verify_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	struct dirent		*dirent,
+	struct stat		*sb,
+	void			*arg)
+{
+	bool			moveon;
+	int			fd;
+
+	/* Ignore things we can't open. */
+	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode))
+		return true;
+	/* Ignore . and .. */
+	if (dirent && (!strcmp(".", dirent->d_name) ||
+		       !strcmp("..", dirent->d_name)))
+		return true;
+
+	/* Open the file */
+	fd = dirent_open(dir_fd, dirent);
+	if (fd < 0)
+		return true;
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, path, fd, arg);
+	if (moveon)
+		goto out;
+
+out:
+	close(fd);
+
+	return moveon;
+}
+
+/* Given bad extent lists for the data & rtdev, find bad files. */
+static bool
+xfs_report_verify_errors(
+	struct scrub_ctx		*ctx,
+	struct bitmap			*d_bad,
+	struct bitmap			*r_bad)
+{
+	struct xfs_verify_error_info	vei;
+	bool				moveon;
+
+	vei.d_bad = d_bad;
+	vei.r_bad = r_bad;
+
+	/* Scan the directory tree to get file paths. */
+	moveon = scan_fs_tree(ctx, NULL, xfs_report_verify_dirent, &vei);
+	if (!moveon)
+		return false;
+
+	/* Scan for unlinked files. */
+	return xfs_scan_all_inodes_arg(ctx, xfs_report_verify_inode, &vei);
+}
+
+/* Phase 1: Find filesystem geometry */
+
+/* Clean up the XFS-specific state data. */
+bool
+xfs_cleanup(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->fshandle)
+		free_handle(ctx->fshandle, ctx->fshandle_len);
+	disk_close(&ctx->rtdev);
+	disk_close(&ctx->logdev);
+	disk_close(&ctx->datadev);
+
+	return true;
+}
+
+/* Read the XFS geometry. */
+bool
+xfs_scan_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct fs_path			*fsp;
+	int				error;
+
+	if (!platform_test_xfs_fd(ctx->mnt_fd)) {
+		str_error(ctx, ctx->mntpoint,
+_("Does not appear to be an XFS filesystem!"));
+		return false;
+	}
+
+	/*
+	 * Flush everything out to disk before we start checking.
+	 * This seems to reduce the incidence of stale file handle
+	 * errors when we open things by handle.
+	 */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	INIT_LIST_HEAD(&ctx->repair_list);
+	ctx->datadev.d_fd = ctx->logdev.d_fd = ctx->rtdev.d_fd = -1;
+
+	/* Retrieve XFS geometry. */
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSGEOMETRY, &ctx->geo);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		goto err;
+	}
+
+	ctx->agblklog = libxfs_log2_roundup(ctx->geo.agblocks);
+	ctx->blocklog = libxfs_highbit32(ctx->geo.blocksize);
+	ctx->inodelog = libxfs_highbit32(ctx->geo.inodesize);
+	ctx->inopblog = ctx->blocklog - ctx->inodelog;
+
+	error = path_to_fshandle(ctx->mntpoint, &ctx->fshandle,
+			&ctx->fshandle_len);
+	if (error) {
+		perror(_("getting fshandle"));
+		goto err;
+	}
+
+	/* Do we have bulkstat? */
+	if (!xfs_can_iterate_inodes(ctx)) {
+		str_info(ctx, ctx->mntpoint, _("BULKSTAT is required."));
+		goto err;
+	}
+
+	/* Do we have getbmapx? */
+	if (!xfs_can_iterate_bmap(ctx)) {
+		str_info(ctx, ctx->mntpoint, _("GETBMAPX is required."));
+		goto err;
+	}
+
+	/* Do we have getfsmap? */
+	if (!xfs_can_iterate_fsmap(ctx)) {
+		str_info(ctx, ctx->mntpoint, _("GETFSMAP is required."));
+		goto err;
+	}
+
+	/* Do we have kernel-assisted metadata scrubbing? */
+	if (!xfs_can_scrub_fs_metadata(ctx) || !xfs_can_scrub_inode(ctx) ||
+	    !xfs_can_scrub_bmap(ctx) || !xfs_can_scrub_dir(ctx) ||
+	    !xfs_can_scrub_attr(ctx) || !xfs_can_scrub_symlink(ctx)) {
+		str_info(ctx, ctx->mntpoint,
+_("kernel metadata scrub is required."));
+		goto err;
+	}
+
+	/* Go find the XFS devices if we have a usable fsmap. */
+	fs_table_initialise(0, NULL, 0, NULL);
+	errno = 0;
+	fsp = fs_table_lookup(ctx->mntpoint, FS_MOUNT_POINT);
+	if (!fsp) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find XFS information."));
+		goto err;
+	}
+	memcpy(&ctx->fsinfo, fsp, sizeof(struct fs_path));
+
+	/* Did we find the log and rt devices, if they're present? */
+	if (ctx->geo.logstart == 0 && ctx->fsinfo.fs_log == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find log device path."));
+		goto err;
+	}
+	if (ctx->geo.rtblocks && ctx->fsinfo.fs_rt == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find realtime device path."));
+		goto err;
+	}
+
+	/* Open the raw devices. */
+	error = disk_open(ctx->fsinfo.fs_name, &ctx->datadev);
+	if (error) {
+		str_errno(ctx, ctx->fsinfo.fs_name);
+		goto err;
+	}
+	ctx->nr_io_threads = libxfs_nproc();
+
+	if (ctx->fsinfo.fs_log) {
+		error = disk_open(ctx->fsinfo.fs_log, &ctx->logdev);
+		if (error) {
+			str_errno(ctx, ctx->fsinfo.fs_name);
+			goto err;
+		}
+	}
+	if (ctx->fsinfo.fs_rt) {
+		error = disk_open(ctx->fsinfo.fs_rt, &ctx->rtdev);
+		if (error) {
+			str_errno(ctx, ctx->fsinfo.fs_name);
+			goto err;
+		}
+	}
+
+	return true;
+err:
+	return false;
+}
+
+/* Phase 2: Check internal metadata. */
+
+/* Defer all the repairs until phase 4. */
+static void
+xfs_defer_repairs(
+	struct scrub_ctx	*ctx,
+	struct list_head	*repairs)
+{
+	if (list_empty(repairs))
+		return;
+
+	pthread_mutex_lock(&ctx->lock);
+	list_splice_tail_init(repairs, &ctx->repair_list);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Repair some AG metadata; broken things are remembered for later. */
+static bool
+xfs_quick_repair(
+	struct scrub_ctx	*ctx,
+	struct list_head	*repairs)
+{
+	bool			moveon;
+
+	moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd, repairs,
+			XRML_REPAIR_ONLY);
+	if (!moveon)
+		return moveon;
+
+	xfs_defer_repairs(ctx, repairs);
+	return true;
+}
+
+/* Scrub each AG's metadata btrees. */
+static void
+xfs_scan_ag_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	bool				*pmoveon = arg;
+	struct repair_item		*n;
+	struct repair_item		*ri;
+	struct list_head		repairs;
+	struct list_head		repair_now;
+	unsigned int			broken_primaries;
+	unsigned int			broken_secondaries;
+	bool				moveon;
+	char				descr[DESCR_BUFSZ];
+
+	INIT_LIST_HEAD(&repairs);
+	INIT_LIST_HEAD(&repair_now);
+	snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
+
+	/*
+	 * First we scrub and fix the AG headers, because we need
+	 * them to work well enough to check the AG btrees.
+	 */
+	moveon = xfs_scrub_ag_headers(ctx, agno, &repairs);
+	if (!moveon)
+		goto err;
+
+	/* Repair header damage. */
+	moveon = xfs_quick_repair(ctx, &repairs);
+	if (!moveon)
+		goto err;
+
+	/* Now scrub the AG btrees. */
+	moveon = xfs_scrub_ag_metadata(ctx, agno, &repairs);
+	if (!moveon)
+		goto err;
+
+	/*
+	 * Figure out if we need to perform early fixing.  The only
+	 * reason we need to do this is if the inobt is broken, which
+	 * prevents phase 3 (inode scan) from running.  We can rebuild
+	 * the inobt from rmapbt data, but if the rmapbt is broken even
+	 * at this early phase then we are sunk.
+	 */
+	broken_secondaries = 0;
+	broken_primaries = 0;
+	list_for_each_entry_safe(ri, n, &repairs, list) {
+		switch (ri->op.sm_type) {
+		case XFS_SCRUB_TYPE_RMAPBT:
+			broken_secondaries++;
+			break;
+		case XFS_SCRUB_TYPE_FINOBT:
+		case XFS_SCRUB_TYPE_INOBT:
+			list_del(&ri->list);
+			list_add_tail(&ri->list, &repair_now);
+			/* fall through */
+		case XFS_SCRUB_TYPE_BNOBT:
+		case XFS_SCRUB_TYPE_CNTBT:
+		case XFS_SCRUB_TYPE_REFCNTBT:
+			broken_primaries++;
+			break;
+		default:
+			ASSERT(false);
+			break;
+		}
+	}
+	if (broken_secondaries && !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+		if (broken_primaries)
+			str_warn(ctx, descr,
+_("Corrupt primary and secondary block mapping metadata."));
+		else
+			str_warn(ctx, descr,
+_("Corrupt secondary block mapping metadata."));
+		str_warn(ctx, descr,
+_("Filesystem might not be repairable."));
+	}
+
+	/* Repair (inode) btree damage. */
+	moveon = xfs_quick_repair(ctx, &repair_now);
+	if (!moveon)
+		goto err;
+
+	/* Everything else gets fixed during phase 4. */
+	xfs_defer_repairs(ctx, &repairs);
+
+	return;
+err:
+	*pmoveon = false;
+	return;
+}
+
+/* Scrub whole-FS metadata btrees. */
+static void
+xfs_scan_fs_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	bool				*pmoveon = arg;
+	struct list_head		repairs;
+	bool				moveon;
+
+	INIT_LIST_HEAD(&repairs);
+	moveon = xfs_scrub_fs_metadata(ctx, &repairs);
+	if (!moveon)
+		*pmoveon = false;
+
+	pthread_mutex_lock(&ctx->lock);
+	list_splice_tail_init(&repairs, &ctx->repair_list);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Try to scan metadata via sysfs. */
+bool
+xfs_scan_metadata(
+	struct scrub_ctx	*ctx)
+{
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	bool			moveon = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, xfs_scan_fs_metadata, 0, &moveon);
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_metadata, agno, &moveon);
+	destroy_work_queue(&wq);
+
+	return moveon;
+}
+
+/* Phase 3: Scan all inodes. */
+
+/*
+ * Scrub part of a file.  If the user passes in a valid fd we assume
+ * that's the file to check; otherwise, pass in the inode number and
+ * let the kernel sort it out.
+ */
+static bool
+xfs_scrub_fd(
+	struct scrub_ctx	*ctx,
+	bool			(*fn)(struct scrub_ctx *, uint64_t,
+				      uint32_t, int, struct list_head *),
+	struct xfs_bstat	*bs,
+	int			fd,
+	struct list_head	*repairs)
+{
+	if (fd < 0)
+		fd = ctx->mnt_fd;
+	return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd, repairs);
+}
+
+/* Verify the contents, xattrs, and extent maps of an inode. */
+static int
+xfs_scrub_inode(
+	struct scrub_ctx	*ctx,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat,
+	void			*arg)
+{
+	struct list_head	repairs;
+	char			descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	int			fd = -1;
+	int			error = 0;
+
+	INIT_LIST_HEAD(&repairs);
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu"), bstat->bs_ino);
+
+	/* Try to open the inode to pin it. */
+	if (S_ISREG(bstat->bs_mode) || S_ISDIR(bstat->bs_mode)) {
+		fd = open_by_fshandle(handle, sizeof(*handle),
+				O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+		if (fd < 0) {
+			error = errno;
+			if (error != ESTALE)
+				str_errno(ctx, descr);
+			goto out;
+		}
+	}
+
+	/* Scrub the inode. */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd,
+			&repairs);
+	if (!moveon)
+		goto out;
+
+	moveon = xfs_quick_repair(ctx, &repairs);
+	if (!moveon)
+		goto out;
+
+	/* Scrub all block mappings. */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd,
+			&repairs);
+	if (!moveon)
+		goto out;
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd,
+			&repairs);
+	if (!moveon)
+		goto out;
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd,
+			&repairs);
+	if (!moveon)
+		goto out;
+
+	moveon = xfs_quick_repair(ctx, &repairs);
+	if (!moveon)
+		goto out;
+
+	/* XXX: Some day, check child -> parent dir -> child. */
+
+	if (S_ISLNK(bstat->bs_mode)) {
+		/* Check symlink contents. */
+		moveon = xfs_scrub_symlink(ctx, bstat->bs_ino,
+				bstat->bs_gen, ctx->mnt_fd, &repairs);
+	} else if (S_ISDIR(bstat->bs_mode)) {
+		/* Check the directory entries. */
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd, &repairs);
+	}
+	if (!moveon)
+		goto out;
+
+	/*
+	 * Read all the extended attributes.  If any of the read
+	 * functions decline to move on, we can try again with the
+	 * VFS functions if we have a file descriptor.
+	 */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd, &repairs);
+	if (!moveon)
+		goto out;
+
+	moveon = xfs_quick_repair(ctx, &repairs);
+
+out:
+	xfs_defer_repairs(ctx, &repairs);
+	if (fd >= 0)
+		close(fd);
+	if (error)
+		return error;
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Verify all the inodes in a filesystem. */
+bool
+xfs_scan_inodes(
+	struct scrub_ctx	*ctx)
+{
+	if (!xfs_scan_all_inodes(ctx, xfs_scrub_inode))
+		return false;
+	xfs_scrub_report_preen_triggers(ctx);
+	return true;
+}
+
+/* Phase 4: Repair filesystem. */
+
+static int
+list_length(
+	struct list_head		*head)
+{
+	struct list_head		*pos;
+	int				nr = 0;
+
+	list_for_each(pos, head) {
+		nr++;
+	}
+
+	return nr;
+}
+
+/* Fix the per-AG and per-FS metadata. */
+bool
+xfs_repair_fs(
+	struct scrub_ctx		*ctx)
+{
+	int				len;
+	int				old_len;
+	bool				moveon;
+
+	/* Repair anything broken until we fail to make progress. */
+	len = list_length(&ctx->repair_list);
+	do {
+		old_len = len;
+		moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd,
+				&ctx->repair_list, 0);
+		if (!moveon)
+			return false;
+		len = list_length(&ctx->repair_list);
+	} while (old_len > len);
+
+	/* Try once more, but this time complain if we can't fix things. */
+	moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd,
+			&ctx->repair_list, XRML_NOFIX_COMPLAIN);
+	if (!moveon)
+		return false;
+
+	fstrim(ctx);
+	return true;
+}
+
+/* Phase 5: Verify data file integrity. */
+
+/* Verify disk blocks with GETFSMAP */
+
+struct xfs_verify_extent {
+	/* Maintain state for the lazy read verifier. */
+	struct read_verify	rv;
+
+	/* Store bad extents if we don't have parent pointers. */
+	struct bitmap		*d_bad;		/* bytes */
+	struct bitmap		*r_bad;		/* bytes */
+
+	/* Track the last extent we saw. */
+	uint64_t		laststart;	/* bytes */
+	uint64_t		lastlength;	/* bytes */
+	bool			lastshared;	/* bytes */
+};
+
+/* Report an IO error resulting from read-verify based off getfsmap. */
+static bool
+xfs_check_rmap_error_report(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*map,
+	void			*arg)
+{
+	const char		*type;
+	char			buf[32];
+	uint64_t		err_physical = *(uint64_t *)arg;
+	uint64_t		err_off;
+
+	if (err_physical > map->fmr_physical)
+		err_off = err_physical - map->fmr_physical;
+	else
+		err_off = 0;
+
+	snprintf(buf, 32, _("disk offset %llu"),
+			BTOBB(map->fmr_physical + err_off));
+
+	if (map->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+		type = xfs_decode_special_owner(map->fmr_owner);
+		str_error(ctx, buf,
+_("%s failed read verification."),
+				type);
+	} else if (xfs_scrub_can_getparent(ctx)) {
+		/* XXX: go find the parent path */
+		str_error(ctx, buf,
+_("XXX: inode %lld offset %llu failed read verification."),
+				map->fmr_owner, map->fmr_offset + err_off);
+	}
+	return true;
+}
+
+/* Handle a read error in the rmap-based read verify. */
+void
+xfs_check_rmap_ioerr(
+	struct read_verify_pool	*rvp,
+	struct disk		*disk,
+	uint64_t		start,
+	uint64_t		length,
+	int			error,
+	void			*arg)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	struct scrub_ctx	*ctx = rvp->rvp_ctx;
+	struct xfs_verify_extent	*ve;
+	struct bitmap		*tree;
+	dev_t			dev;
+	bool			moveon;
+
+	ve = arg;
+	dev = xfs_disk_to_dev(ctx, disk);
+
+	/*
+	 * If we don't have parent pointers, save the bad extent for
+	 * later rescanning.
+	 */
+	if (!xfs_scrub_can_getparent(ctx)) {
+		if (dev == ctx->fsinfo.fs_datadev)
+			tree = ve->d_bad;
+		else if (dev == ctx->fsinfo.fs_rtdev)
+			tree = ve->r_bad;
+		else
+			tree = NULL;
+		if (tree) {
+			moveon = bitmap_add(tree, start, length);
+			if (!moveon)
+				str_errno(ctx, ctx->mntpoint);
+		}
+	}
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d ioerr @ %"PRIu64":%"PRIu64" "),
+			major(dev), minor(dev), start, length);
+
+	/* Go figure out which blocks are bad from the fsmap. */
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	keys->fmr_physical = start;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = start + length - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+	xfs_iterate_fsmap(ctx, descr, keys, xfs_check_rmap_error_report,
+			&start);
+}
+
+/* Read verify a (data block) extent. */
+static bool
+xfs_check_rmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*map,
+	void				*arg)
+{
+	struct xfs_verify_extent	*ve = arg;
+	struct disk			*disk;
+
+	dbg_printf("rmap dev %d:%d phys %llu owner %lld offset %llu "
+			"len %llu flags 0x%x\n", major(map->fmr_device),
+			minor(map->fmr_device), map->fmr_physical,
+			map->fmr_owner, map->fmr_offset,
+			map->fmr_length, map->fmr_flags);
+
+	/* Remember this extent. */
+	ve->lastshared = (map->fmr_flags & FMR_OF_SHARED);
+	ve->laststart = map->fmr_physical;
+	ve->lastlength = map->fmr_length;
+
+	/* "Unknown" extents should be verified; they could be data. */
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+			map->fmr_owner == XFS_FMR_OWN_UNKNOWN)
+		map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+
+	/*
+	 * We only care about read-verifying data extents that have been
+	 * written to disk.  This means we can skip "special" owners
+	 * (metadata), xattr blocks, unwritten extents, and extent maps.
+	 * These should all get checked elsewhere in the scrubber.
+	 */
+	if (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+			      FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER))
+		goto out;
+
+	/* XXX: Filter out directory data blocks. */
+
+	/* Schedule the read verify command for (eventual) running. */
+	disk = xfs_dev_to_disk(ctx, map->fmr_device);
+
+	read_verify_schedule(&ctx->rvp, &ve->rv, disk, map->fmr_physical,
+			map->fmr_length, ve);
+
+out:
+	/* Is this the last extent?  Fire off the read. */
+	if (map->fmr_flags & FMR_OF_LAST)
+		read_verify_force(&ctx->rvp, &ve->rv);
+
+	return true;
+}
+
+/* Verify all the blocks in a filesystem. */
+bool
+xfs_scan_blocks(
+	struct scrub_ctx		*ctx)
+{
+	struct bitmap			d_bad;
+	struct bitmap			r_bad;
+	struct xfs_verify_extent	*ve;
+	struct xfs_verify_extent	*v;
+	int				i;
+	unsigned int			groups;
+	bool				moveon;
+
+	/*
+	 * Initialize our per-thread context.  By convention,
+	 * the log device comes first, then the rt device, and then
+	 * the AGs.
+	 */
+	groups = xfs_scan_all_blocks_array_size(ctx);
+	ve = calloc(groups, sizeof(struct xfs_verify_extent));
+	if (!ve) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&d_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_ve;
+	}
+
+	moveon = bitmap_init(&r_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dbad;
+	}
+
+	for (i = 0, v = ve; i < groups; i++, v++) {
+		v->d_bad = &d_bad;
+		v->r_bad = &r_bad;
+	}
+
+	moveon = xfs_read_verify_pool_init(ctx, xfs_check_rmap_ioerr);
+	if (!moveon)
+		goto out_rbad;
+	moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_check_rmap,
+			ve, sizeof(*ve));
+	if (!moveon)
+		goto out_pool;
+
+	for (i = 0, v = ve; i < groups; i++, v++)
+		read_verify_force(&ctx->rvp, &v->rv);
+	read_verify_pool_destroy(&ctx->rvp);
+
+	/* Scan the whole dir tree to see what matches the bad extents. */
+	if (!bitmap_empty(&d_bad) || !bitmap_empty(&r_bad))
+		moveon = xfs_report_verify_errors(ctx, &d_bad, &r_bad);
+
+	bitmap_free(&r_bad);
+	bitmap_free(&d_bad);
+	free(ve);
+	return moveon;
+
+out_pool:
+	read_verify_pool_destroy(&ctx->rvp);
+out_rbad:
+	bitmap_free(&r_bad);
+out_dbad:
+	bitmap_free(&d_bad);
+out_ve:
+	free(ve);
+	return moveon;
+}
+
+/* Phase 6: Check summary counters. */
+
+struct xfs_summary_counts {
+	unsigned long long	inodes;		/* number of inodes */
+	unsigned long long	dbytes;		/* data dev bytes */
+	unsigned long long	rbytes;		/* rt dev bytes */
+	unsigned long long	next_phys;	/* next phys bytes we see? */
+	unsigned long long	agbytes;	/* freespace bytes */
+	struct bitmap		dext;		/* data block extent bitmap */
+	struct bitmap		rext;		/* rt block extent bitmap */
+};
+
+struct xfs_inode_fork_summary {
+	struct bitmap		*tree;
+	unsigned long long	bytes;
+};
+
+/* Record inode and block usage. */
+static int
+xfs_record_inode_summary(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	struct xfs_summary_counts	*counts = arg;
+
+	counts->inodes++;
+	return 0;
+}
+
+/* Record block usage. */
+static bool
+xfs_record_block_summary(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*fsmap,
+	void				*arg)
+{
+	struct xfs_summary_counts	*counts = arg;
+	unsigned long long		len;
+
+	if (fsmap->fmr_device == ctx->fsinfo.fs_logdev)
+		return true;
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == XFS_FMR_OWN_FREE)
+		return true;
+
+	len = fsmap->fmr_length;
+
+	/* freesp btrees live in free space, need to adjust counters later. */
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == XFS_FMR_OWN_AG) {
+		counts->agbytes += fsmap->fmr_length;
+	}
+	if (fsmap->fmr_device == ctx->fsinfo.fs_rtdev) {
+		/* Count realtime extents. */
+		counts->rbytes += len;
+	} else {
+		/* Count datadev extents. */
+		if (counts->next_phys >= fsmap->fmr_physical + len)
+			return true;
+		else if (counts->next_phys > fsmap->fmr_physical)
+			len = counts->next_phys - fsmap->fmr_physical;
+		counts->dbytes += len;
+		counts->next_phys = fsmap->fmr_physical + fsmap->fmr_length;
+	}
+
+	return true;
+}
+
+/* Count all inodes and blocks in the filesystem, compare to superblock. */
+bool
+xfs_check_summary(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_fsop_counts		fc;
+	struct xfs_fsop_resblks		rb;
+	struct xfs_fsop_ag_resblks	arb;
+	struct statvfs			sfs;
+	struct xfs_summary_counts	*summary;
+	unsigned long long		fd;
+	unsigned long long		fr;
+	unsigned long long		fi;
+	unsigned long long		sd;
+	unsigned long long		sr;
+	unsigned long long		si;
+	unsigned long long		absdiff;
+	xfs_agnumber_t			agno;
+	bool				moveon;
+	bool				complain;
+	unsigned int			groups;
+	int				error;
+
+	groups = xfs_scan_all_blocks_array_size(ctx);
+	summary = calloc(groups, sizeof(struct xfs_summary_counts));
+	if (!summary) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Flush everything out to disk before we start counting. */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Use fsmap to count blocks. */
+	moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_record_block_summary,
+			summary, sizeof(*summary));
+	if (!moveon)
+		goto out;
+
+	/* Scan the whole fs. */
+	moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_record_inode_summary,
+			summary, sizeof(*summary));
+	if (!moveon)
+		goto out;
+
+	/* Sum the counts. */
+	for (agno = 1; agno < groups; agno++) {
+		summary[0].inodes += summary[agno].inodes;
+		summary[0].dbytes += summary[agno].dbytes;
+		summary[0].rbytes += summary[agno].rbytes;
+		summary[0].agbytes += summary[agno].agbytes;
+	}
+
+	/* Fetch the filesystem counters. */
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSCOUNTS, &fc);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+
+	/* Grab the fstatvfs counters, since it has to report accurately. */
+	error = fstatvfs(ctx->mnt_fd, &sfs);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/*
+	 * XFS reserves some blocks to prevent hard ENOSPC, so add those
+	 * blocks back to the free data counts.
+	 */
+	error = ioctl(ctx->mnt_fd, XFS_IOC_GET_RESBLKS, &rb);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += rb.resblks_avail;
+
+	/*
+	 * XFS with rmap or reflink reserves blocks in each AG to
+	 * prevent the AG from running out of space for metadata blocks.
+	 * Add those back to the free data counts.
+	 */
+	memset(&arb, 0, sizeof(arb));
+	error = ioctl(ctx->mnt_fd, XFS_IOC_GET_AG_RESBLKS, &arb);
+	if (error && errno != ENOTTY)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += arb.resblks;
+
+	/*
+	 * If we counted blocks with fsmap, then dblocks includes
+	 * blocks for the AGFL and the freespace/rmap btrees.  The
+	 * filesystem treats them as "free", but since we scanned
+	 * them, we'll consider them used.
+	 */
+	sfs.f_bfree -= summary[0].agbytes >> ctx->blocklog;
+
+	/* Report on what we found. */
+	fd = (ctx->geo.datablocks - sfs.f_bfree) << ctx->blocklog;
+	fr = (ctx->geo.rtblocks - fc.freertx) << ctx->blocklog;
+	fi = sfs.f_files - sfs.f_ffree;
+	sd = summary[0].dbytes;
+	sr = summary[0].rbytes;
+	si = summary[0].inodes;
+
+	/*
+	 * Complain if the counts are off by more than 10% unless
+	 * the inaccuracy is less than 32MB worth of blocks or 100 inodes.
+	 */
+	absdiff = 1ULL << 25;
+	complain = !within_range(ctx, sd, fd, absdiff, 1, 10, _("data blocks"));
+	complain |= !within_range(ctx, sr, fr, absdiff, 1, 10, _("realtime blocks"));
+	complain |= !within_range(ctx, si, fi, 100, 1, 10, _("inodes"));
+
+	if (complain || verbose) {
+		double		d, r, i;
+		char		*du, *ru, *iu;
+
+		if (fr || sr) {
+			d = auto_space_units(fd, &du);
+			r = auto_space_units(fr, &ru);
+			i = auto_units(fi, &iu);
+			fprintf(stdout,
+_("%.1f%s data used;  %.1f%s realtime data used;  %.2f%s inodes used.\n"),
+					d, du, r, ru, i, iu);
+			d = auto_space_units(sd, &du);
+			r = auto_space_units(sr, &ru);
+			i = auto_units(si, &iu);
+			fprintf(stdout,
+_("%.1f%s data found; %.1f%s realtime data found; %.2f%s inodes found.\n"),
+					d, du, r, ru, i, iu);
+		} else {
+			d = auto_space_units(fd, &du);
+			i = auto_units(fi, &iu);
+			fprintf(stdout,
+_("%.1f%s data used;  %.1f%s inodes used.\n"),
+					d, du, i, iu);
+			d = auto_space_units(sd, &du);
+			i = auto_units(si, &iu);
+			fprintf(stdout,
+_("%.1f%s data found; %.1f%s inodes found.\n"),
+					d, du, i, iu);
+		}
+		fflush(stdout);
+	}
+	moveon = true;
+
+out:
+	for (agno = 0; agno < groups; agno++) {
+		bitmap_free(&summary[agno].dext);
+		bitmap_free(&summary[agno].rext);
+	}
+	free(summary);
+	return moveon;
+}
+
+/* Shut down the filesystem. */
+void
+xfs_shutdown_fs(
+	struct scrub_ctx		*ctx)
+{
+	int				flag;
+
+	flag = XFS_FSOP_GOING_FLAGS_LOGFLUSH;
+	str_info(ctx, ctx->mntpoint, _("Shutting down filesystem!"));
+	if (ioctl(ctx->mnt_fd, XFS_IOC_GOINGDOWN, &flag))
+		str_errno(ctx, ctx->mntpoint);
+}
diff --git a/scrub/xfs_ioctl.c b/scrub/xfs_ioctl.c
new file mode 100644
index 0000000..f71e1f7
--- /dev/null
+++ b/scrub/xfs_ioctl.c
@@ -0,0 +1,968 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "xfs_ioctl.h"
+
+#define FSMAP_NR		65536
+#define BMAP_NR			2048
+
+/* Call the handler function. */
+static int
+xfs_iterate_inode_func(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	struct xfs_bstat	*bs,
+	struct xfs_handle	*handle,
+	void			*arg)
+{
+	int			error;
+
+	handle->ha_fid.fid_ino = bs->bs_ino;
+	handle->ha_fid.fid_gen = bs->bs_gen;
+	error = fn(ctx, handle, bs, arg);
+	if (error)
+		return error;
+	if (xfs_scrub_excessive_errors(ctx))
+		return XFS_ITERATE_INODES_ABORT;
+	return 0;
+}
+
+/* Iterate a range of inodes. */
+bool
+xfs_iterate_inodes(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	void			*fshandle,
+	uint64_t		first_ino,
+	uint64_t		last_ino,
+	xfs_inode_iter_fn	fn,
+	void			*arg)
+{
+	struct xfs_fsop_bulkreq	igrpreq = {0};
+	struct xfs_fsop_bulkreq	bulkreq = {0};
+	struct xfs_fsop_bulkreq	onereq = {0};
+	struct xfs_handle	handle;
+	struct xfs_inogrp	inogrp;
+	struct xfs_bstat	bstat[XFS_INODES_PER_CHUNK] = {0};
+	char			idescr[DESCR_BUFSZ];
+	char			buf[DESCR_BUFSZ];
+	struct xfs_bstat	*bs;
+	__u64			last_stale = first_ino - 1;
+	__u64			igrp_ino;
+	__u64			oneino;
+	__u64			ino;
+	__s32			bulklen = 0;
+	__s32			onelen = 0;
+	__s32			igrplen = 0;
+	bool			moveon = true;
+	int			i;
+	int			error;
+	int			stale_count = 0;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"));
+
+	onereq.lastip  = &oneino;
+	onereq.icount  = 1;
+	onereq.ocount  = &onelen;
+
+	bulkreq.lastip  = &ino;
+	bulkreq.icount  = XFS_INODES_PER_CHUNK;
+	bulkreq.ubuffer = &bstat;
+	bulkreq.ocount  = &bulklen;
+
+	igrpreq.lastip  = &igrp_ino;
+	igrpreq.icount  = 1;
+	igrpreq.ubuffer = &inogrp;
+	igrpreq.ocount  = &igrplen;
+
+	memcpy(&handle.ha_fsid, fshandle, sizeof(handle.ha_fsid));
+	handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
+			sizeof(handle.ha_fid.fid_len);
+	handle.ha_fid.fid_pad = 0;
+
+	/* Find the inode chunk & alloc mask */
+	igrp_ino = first_ino;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSINUMBERS, &igrpreq);
+	while (!error && igrplen) {
+		/* Load the inodes. */
+		ino = inogrp.xi_startino - 1;
+		bulkreq.icount = inogrp.xi_alloccount;
+		error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT, &bulkreq);
+		if (error)
+			str_warn(ctx, descr, "%s", strerror_r(errno,
+						buf, DESCR_BUFSZ));
+
+		/* Did we get exactly the inodes we expected? */
+		for (i = 0, bs = bstat; i < XFS_INODES_PER_CHUNK; i++) {
+			if (!(inogrp.xi_allocmask & (1ULL << i)))
+				continue;
+			if (bs->bs_ino == inogrp.xi_startino + i) {
+				bs++;
+				continue;
+			}
+
+			/* Load the one inode. */
+			oneino = inogrp.xi_startino + i;
+			onereq.ubuffer = bs;
+			error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT_SINGLE,
+					&onereq);
+			if (error || bs->bs_ino != inogrp.xi_startino + i) {
+				memset(bs, 0, sizeof(struct xfs_bstat));
+				bs->bs_ino = inogrp.xi_startino + i;
+				bs->bs_blksize = ctx->mnt_sv.f_frsize;
+			}
+			bs++;
+		}
+
+		/* Iterate all the inodes. */
+		for (i = 0, bs = bstat; i < inogrp.xi_alloccount; i++, bs++) {
+			if (bs->bs_ino > last_ino)
+				goto out;
+
+			error = xfs_iterate_inode_func(ctx, fn, bs, &handle,
+					arg);
+			switch (error) {
+			case 0:
+				break;
+			case ESTALE:
+				if (last_stale == inogrp.xi_startino)
+					stale_count++;
+				else {
+					last_stale = inogrp.xi_startino;
+					stale_count = 0;
+				}
+				if (stale_count < 30) {
+					igrp_ino = inogrp.xi_startino;
+					goto igrp_retry;
+				}
+				snprintf(idescr, DESCR_BUFSZ, "inode %llu",
+						bs->bs_ino);
+				str_warn(ctx, idescr, "%s", strerror_r(error,
+						buf, DESCR_BUFSZ));
+				break;
+			case XFS_ITERATE_INODES_ABORT:
+				error = 0;
+				/* fall thru */
+			default:
+				moveon = false;
+				errno = error;
+				goto err;
+			}
+		}
+
+igrp_retry:
+		error = ioctl(ctx->mnt_fd, XFS_IOC_FSINUMBERS, &igrpreq);
+	}
+
+err:
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	return moveon;
+}
+
+/* Does the kernel support bulkstat? */
+bool
+xfs_can_iterate_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_fsop_bulkreq	bulkreq;
+	__u64			lastino;
+	__s32			bulklen = 0;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"))
+		return false;
+
+	lastino = 0;
+	memset(&bulkreq, 0, sizeof(bulkreq));
+	bulkreq.lastip = (__u64 *)&lastino;
+	bulkreq.icount  = 0;
+	bulkreq.ubuffer = NULL;
+	bulkreq.ocount  = &bulklen;
+
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT, &bulkreq);
+	return error == -1 && errno == EINVAL;
+}
+
+/* Iterate all the extent block mappings between the two keys. */
+bool
+xfs_iterate_bmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	int			whichfork,
+	struct xfs_bmap		*key,
+	xfs_bmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsxattr		fsx;
+	struct getbmapx		*map;
+	struct getbmapx		*p;
+	struct xfs_bmap		bmap;
+	char			bmap_descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	xfs_off_t		new_off;
+	int			getxattr_type;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BMAP"));
+
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s attr"), descr);
+		break;
+	case XFS_COW_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s CoW"), descr);
+		break;
+	case XFS_DATA_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s data"), descr);
+		break;
+	default:
+		assert(0);
+	}
+
+	map = calloc(BMAP_NR, sizeof(struct getbmapx));
+	if (!map) {
+		str_errno(ctx, bmap_descr);
+		return false;
+	}
+
+	map->bmv_offset = BTOBB(key->bm_offset);
+	map->bmv_block = BTOBB(key->bm_physical);
+	if (key->bm_length == 0)
+		map->bmv_length = ULLONG_MAX;
+	else
+		map->bmv_length = BTOBB(key->bm_length);
+	map->bmv_count = BMAP_NR;
+	map->bmv_iflags = BMV_IF_NO_DMAPI_READ | BMV_IF_PREALLOC |
+			  BMV_OF_DELALLOC | BMV_IF_NO_HOLES;
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		getxattr_type = XFS_IOC_FSGETXATTRA;
+		map->bmv_iflags |= BMV_IF_ATTRFORK;
+		break;
+	case XFS_COW_FORK:
+		map->bmv_iflags |= BMV_IF_COWFORK;
+		getxattr_type = FS_IOC_FSGETXATTR;
+		break;
+	case XFS_DATA_FORK:
+		getxattr_type = FS_IOC_FSGETXATTR;
+		break;
+	default:
+		abort();
+	}
+
+	error = ioctl(fd, getxattr_type, &fsx);
+	if (error < 0) {
+		str_errno(ctx, bmap_descr);
+		moveon = false;
+		goto out;
+	}
+
+	while ((error = ioctl(fd, XFS_IOC_GETBMAPX, map)) == 0) {
+		for (i = 0, p = &map[i + 1]; i < map->bmv_entries; i++, p++) {
+			bmap.bm_offset = BBTOB(p->bmv_offset);
+			bmap.bm_physical = BBTOB(p->bmv_block);
+			bmap.bm_length = BBTOB(p->bmv_length);
+			bmap.bm_flags = p->bmv_oflags;
+			moveon = fn(ctx, bmap_descr, fd, whichfork, &fsx,
+					&bmap, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (map->bmv_entries == 0)
+			break;
+		p = map + map->bmv_entries;
+		if (p->bmv_oflags & BMV_OF_LAST)
+			break;
+
+		new_off = p->bmv_offset + p->bmv_length;
+		map->bmv_length -= new_off - map->bmv_offset;
+		map->bmv_offset = new_off;
+	}
+
+	/* Pre-reflink filesystems don't know about CoW forks. */
+	if (whichfork == XFS_COW_FORK && error && errno == EINVAL)
+		error = 0;
+
+	if (error)
+		str_errno(ctx, bmap_descr);
+out:
+	memcpy(key, map, sizeof(struct getbmapx));
+	free(map);
+	return moveon;
+}
+
+/* Does the kernel support getbmapx? */
+bool
+xfs_can_iterate_bmap(
+	struct scrub_ctx	*ctx)
+{
+	struct getbmapx		bsm[2];
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BMAP"))
+		return false;
+
+	memset(bsm, 0, sizeof(struct getbmapx));
+	bsm->bmv_length = ULLONG_MAX;
+	bsm->bmv_count = 2;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_GETBMAPX, bsm);
+	return error == 0;
+}
+
+/* Iterate all the fs block mappings between the two keys. */
+bool
+xfs_iterate_fsmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*keys,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsmap_head	*head;
+	struct fsmap		*p;
+	bool			moveon = true;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_FSMAP"));
+
+	head = malloc(fsmap_sizeof(FSMAP_NR));
+	if (!head) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	memset(head, 0, sizeof(*head));
+	memcpy(head->fmh_keys, keys, sizeof(struct fsmap) * 2);
+	head->fmh_count = FSMAP_NR;
+
+	while ((error = ioctl(ctx->mnt_fd, FS_IOC_GETFSMAP, head)) == 0) {
+		for (i = 0, p = head->fmh_recs;
+		     i < head->fmh_entries;
+		     i++, p++) {
+			moveon = fn(ctx, descr, p, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (head->fmh_entries == 0)
+			break;
+		p = &head->fmh_recs[head->fmh_entries - 1];
+		if (p->fmr_flags & FMR_OF_LAST)
+			break;
+		fsmap_advance(head);
+	}
+
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	free(head);
+	return moveon;
+}
+
+/* Does the kernel support getfsmap? */
+bool
+xfs_can_iterate_fsmap(
+	struct scrub_ctx	*ctx)
+{
+	struct fsmap_head	head;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_FSMAP"))
+		return false;
+
+	memset(&head, 0, sizeof(struct fsmap_head));
+	head.fmh_keys[1].fmr_device = UINT_MAX;
+	head.fmh_keys[1].fmr_physical = ULLONG_MAX;
+	head.fmh_keys[1].fmr_owner = ULLONG_MAX;
+	head.fmh_keys[1].fmr_offset = ULLONG_MAX;
+	error = ioctl(ctx->mnt_fd, FS_IOC_GETFSMAP, &head);
+	return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
+}
+
+/* Online scrub and repair. */
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+	ST_NONE,	/* disabled */
+	ST_AGHEADER,	/* per-AG header */
+	ST_PERAG,	/* per-AG metadata */
+	ST_FS,		/* per-FS metadata */
+	ST_INODE,	/* per-inode metadata */
+};
+struct scrub_descr {
+	const char	*name;
+	enum scrub_type	type;
+};
+
+/* These must correspond to XFS_SCRUB_TYPE_ */
+static const struct scrub_descr scrubbers[] = {
+	{"dummy",				ST_NONE},
+	{"superblock",				ST_AGHEADER},
+	{"free space header",			ST_AGHEADER},
+	{"free list",				ST_AGHEADER},
+	{"inode header",			ST_AGHEADER},
+	{"freesp by block btree",		ST_PERAG},
+	{"freesp by length btree",		ST_PERAG},
+	{"inode btree",				ST_PERAG},
+	{"free inode btree",			ST_PERAG},
+	{"reverse mapping btree",		ST_PERAG},
+	{"reference count btree",		ST_PERAG},
+	{"inode record",			ST_INODE},
+	{"data block map",			ST_INODE},
+	{"attr block map",			ST_INODE},
+	{"CoW block map",			ST_INODE},
+	{"directory entries",			ST_INODE},
+	{"extended attributes",			ST_INODE},
+	{"symbolic link",			ST_INODE},
+	{"realtime bitmap",			ST_FS},
+	{"realtime summary",			ST_FS},
+};
+
+/* Format a scrub description. */
+static void
+format_scrub_descr(
+	char				*buf,
+	size_t				buflen,
+	struct xfs_scrub_metadata	*meta,
+	const struct scrub_descr	*sc)
+{
+	switch (sc->type) {
+	case ST_AGHEADER:
+	case ST_PERAG:
+		snprintf(buf, buflen, _("AG %u %s"), meta->sm_agno,
+				_(sc->name));
+		break;
+	case ST_INODE:
+		snprintf(buf, buflen, _("Inode %llu %s"), meta->sm_ino,
+				_(sc->name));
+		break;
+	case ST_FS:
+		snprintf(buf, buflen, _("%s"), _(sc->name));
+		break;
+	case ST_NONE:
+		assert(0);
+		break;
+	}
+}
+
+static inline bool
+IS_CORRUPT(
+	__u32				flags)
+{
+	return flags & (XFS_SCRUB_FLAG_CORRUPT | XFS_SCRUB_FLAG_XCORRUPT);
+}
+
+/* Do we need to repair something? */
+static inline bool
+xfs_scrub_needs_repair(
+	struct xfs_scrub_metadata	*sm)
+{
+	return IS_CORRUPT(sm->sm_flags);
+}
+
+/* Can we optimize something? */
+static inline bool
+xfs_scrub_needs_preen(
+	struct xfs_scrub_metadata	*sm)
+{
+	return sm->sm_flags & XFS_SCRUB_FLAG_PREEN;
+}
+
+/* Do a read-only check of some metadata. */
+static enum check_outcome
+xfs_check_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct xfs_scrub_metadata	*meta,
+	bool				is_inode)
+{
+	char				buf[DESCR_BUFSZ];
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	assert(meta->sm_type <= XFS_SCRUB_TYPE_MAX);
+	format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+	dbg_printf("check %s flags %xh\n", buf, meta->sm_flags);
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
+		meta->sm_flags |= XFS_SCRUB_FLAG_PREEN;
+	if (error) {
+		/* Metadata not present, just skip it. */
+		if (errno == ENOENT)
+			return CHECK_DONE;
+		else if (errno == ESHUTDOWN) {
+			/* FS already crashed, give up. */
+			str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+			return CHECK_ABORT;
+		}
+
+		/* Operational error. */
+		str_errno(ctx, buf);
+		return CHECK_DONE;
+	} else if (!xfs_scrub_needs_repair(meta) &&
+		   !xfs_scrub_needs_preen(meta)) {
+		/* Clean operation, no corruption or preening detected. */
+		return CHECK_DONE;
+	} else if (xfs_scrub_needs_repair(meta) &&
+		   ctx->mode < SCRUB_MODE_REPAIR) {
+		/* Corrupt, but we're not in repair mode. */
+		str_error(ctx, buf, _("Repairs are required."));
+		return CHECK_DONE;
+	} else if (xfs_scrub_needs_preen(meta) &&
+		   ctx->mode < SCRUB_MODE_PREEN) {
+		/* Preenable, but we're not in preen mode. */
+		if (!is_inode) {
+			/* AG or FS metadata, always warn. */
+			str_info(ctx, buf, _("Optimization is possible."));
+		} else if (!ctx->preen_triggers[meta->sm_type]) {
+			/* File metadata, only warn once per type. */
+			pthread_mutex_lock(&ctx->lock);
+			if (!ctx->preen_triggers[meta->sm_type])
+				ctx->preen_triggers[meta->sm_type] = true;
+			pthread_mutex_unlock(&ctx->lock);
+		}
+		return CHECK_DONE;
+	}
+
+	return CHECK_REPAIR;
+}
+
+/* Bulk-notify user about things that could be optimized. */
+void
+xfs_scrub_report_preen_triggers(
+	struct scrub_ctx		*ctx)
+{
+	int				i;
+
+	for (i = 0; i <= XFS_SCRUB_TYPE_MAX; i++) {
+		pthread_mutex_lock(&ctx->lock);
+		if (ctx->preen_triggers[i]) {
+			ctx->preen_triggers[i] = false;
+			pthread_mutex_unlock(&ctx->lock);
+			str_info(ctx, ctx->mntpoint,
+_("Optimizations of %s are possible."), scrubbers[i].name);
+		} else {
+			pthread_mutex_unlock(&ctx->lock);
+		}
+	}
+}
+
+/* Scrub metadata, saving corruption reports for later. */
+static bool
+xfs_scrub_metadata(
+	struct scrub_ctx		*ctx,
+	enum scrub_type			scrub_type,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	const struct scrub_descr	*sc;
+	struct repair_item		*ri;
+	enum check_outcome		fix;
+	int				type;
+
+	sc = scrubbers;
+	for (type = 0; type <= XFS_SCRUB_TYPE_MAX; type++, sc++) {
+		if (sc->type != scrub_type)
+			continue;
+
+		meta.sm_type = type;
+		meta.sm_flags = 0;
+		meta.sm_agno = agno;
+
+		/* Check the item. */
+		fix = xfs_check_metadata(ctx, ctx->mnt_fd, &meta, false);
+		if (fix == CHECK_ABORT)
+			return false;
+		if (fix == CHECK_DONE)
+			continue;
+
+		/* Schedule this item for later repairs. */
+		ri = malloc(sizeof(struct repair_item));
+		if (!ri) {
+			str_errno(ctx, _("repair list"));
+			return false;
+		}
+		ri->op = meta;
+		list_add_tail(&ri->list, repair_list);
+	}
+
+	return true;
+}
+
+/* Scrub each AG's header blocks. */
+bool
+xfs_scrub_ag_headers(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_AGHEADER, agno, repair_list);
+}
+
+/* Scrub each AG's metadata btrees. */
+bool
+xfs_scrub_ag_metadata(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_PERAG, agno, repair_list);
+}
+
+/* Scrub whole-FS metadata btrees. */
+bool
+xfs_scrub_fs_metadata(
+	struct scrub_ctx		*ctx,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_FS, 0, repair_list);
+}
+
+/* Scrub inode metadata. */
+static bool
+__xfs_scrub_file(
+	struct scrub_ctx		*ctx,
+	uint64_t			ino,
+	uint32_t			gen,
+	int				fd,
+	unsigned int			type,
+	struct list_head		*repair_list)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	struct repair_item		*ri;
+	enum check_outcome		fix;
+
+	assert(type <= XFS_SCRUB_TYPE_MAX);
+	assert(scrubbers[type].type == ST_INODE);
+
+	meta.sm_type = type;
+	meta.sm_ino = ino;
+	meta.sm_gen = gen;
+
+	/* Scrub the piece of metadata. */
+	fix = xfs_check_metadata(ctx, fd, &meta, true);
+	if (fix == CHECK_ABORT)
+		return false;
+	if (fix == CHECK_DONE)
+		return true;
+
+	/* Schedule this item for later repairs. */
+	ri = malloc(sizeof(struct repair_item));
+	if (!ri) {
+		str_errno(ctx, _("repair list"));
+		return false;
+	}
+	ri->op = meta;
+	list_add_tail(&ri->list, repair_list);
+	return true;
+}
+
+#define XFS_SCRUB_FILE_PART(name, flagname) \
+bool \
+xfs_scrub_##name( \
+	struct scrub_ctx		*ctx, \
+	uint64_t			ino, \
+	uint32_t			gen, \
+	int				fd, \
+	struct list_head		*repair_list) \
+{ \
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_##flagname, \
+			repair_list); \
+}
+XFS_SCRUB_FILE_PART(inode_fields,	INODE)
+XFS_SCRUB_FILE_PART(data_fork,		BMBTD)
+XFS_SCRUB_FILE_PART(attr_fork,		BMBTA)
+XFS_SCRUB_FILE_PART(cow_fork,		BMBTC)
+XFS_SCRUB_FILE_PART(dir,		DIR)
+XFS_SCRUB_FILE_PART(attr,		XATTR)
+XFS_SCRUB_FILE_PART(symlink,		SYMLINK)
+
+/*
+ * Prioritize repair items in order of how long we can wait.
+ * 0 = do it now, 10000 = do it later.
+ *
+ * To minimize the amount of repair work, we want to prioritize metadata
+ * objects by perceived corruptness.  If CORRUPT is set, the fields are
+ * just plain bad; try fixing that first.  Otherwise if XCORRUPT is set,
+ * the fields could be bad, but the xref data could also be bad; we'll
+ * try fixing that next.  Finally, if XFAIL is set, some other metadata
+ * structure failed validation during xref, so we'll recheck this
+ * metadata last since it was probably fine.
+ *
+ * For metadata that lie in the critical path of checking other metadata
+ * (superblock, AG{F,I,FL}, inobt) we scrub and fix those things before
+ * we even get to handling their dependencies, so things should progress
+ * in order.
+ */
+static int
+PRIO(
+	struct xfs_scrub_metadata	*op,
+	int				order)
+{
+	if (op->sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+		return order;
+	else if (op->sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+		return 100 + order;
+	else if (op->sm_flags & XFS_SCRUB_FLAG_XFAIL)
+		return 200 + order;
+	else if (op->sm_flags & XFS_SCRUB_FLAG_PREEN)
+		return 300 + order;
+	abort();
+}
+
+static int
+xfs_repair_item_priority(
+	struct repair_item		*ri)
+{
+	switch (ri->op.sm_type) {
+	case XFS_SCRUB_TYPE_SB:
+		return PRIO(&ri->op, 0);
+	case XFS_SCRUB_TYPE_AGF:
+		return PRIO(&ri->op, 1);
+	case XFS_SCRUB_TYPE_AGFL:
+		return PRIO(&ri->op, 2);
+	case XFS_SCRUB_TYPE_AGI:
+		return PRIO(&ri->op, 3);
+	case XFS_SCRUB_TYPE_BNOBT:
+	case XFS_SCRUB_TYPE_CNTBT:
+	case XFS_SCRUB_TYPE_INOBT:
+	case XFS_SCRUB_TYPE_FINOBT:
+	case XFS_SCRUB_TYPE_REFCNTBT:
+		return PRIO(&ri->op, 4);
+	case XFS_SCRUB_TYPE_RMAPBT:
+		return PRIO(&ri->op, 5);
+	case XFS_SCRUB_TYPE_INODE:
+		return PRIO(&ri->op, 6);
+	case XFS_SCRUB_TYPE_BMBTD:
+	case XFS_SCRUB_TYPE_BMBTA:
+	case XFS_SCRUB_TYPE_BMBTC:
+		return PRIO(&ri->op, 7);
+	case XFS_SCRUB_TYPE_DIR:
+	case XFS_SCRUB_TYPE_XATTR:
+	case XFS_SCRUB_TYPE_SYMLINK:
+		return PRIO(&ri->op, 8);
+	case XFS_SCRUB_TYPE_RTBITMAP:
+	case XFS_SCRUB_TYPE_RTSUM:
+		return PRIO(&ri->op, 9);
+	}
+	abort();
+}
+
+/* Make sure that btrees get repaired before headers. */
+static int
+xfs_repair_item_compare(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct repair_item		*ra;
+	struct repair_item		*rb;
+
+	ra = container_of(a, struct repair_item, list);
+	rb = container_of(b, struct repair_item, list);
+
+	return xfs_repair_item_priority(ra) - xfs_repair_item_priority(rb);
+}
+
+/* Repair some metadata. */
+static enum check_outcome
+xfs_repair_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct xfs_scrub_metadata	*meta,
+	bool				complain_if_still_broken)
+{
+	char				buf[DESCR_BUFSZ];
+	__u32				oldf = meta->sm_flags;
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	meta->sm_flags |= XFS_SCRUB_FLAG_REPAIR;
+	assert(meta->sm_type <= XFS_SCRUB_TYPE_MAX);
+	format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+	if (xfs_scrub_needs_repair(meta))
+		str_info(ctx, buf, _("Attempting repair."));
+	else if (debug || verbose)
+		str_info(ctx, buf, _("Attempting optimization."));
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+	if (error) {
+		switch (errno) {
+		case ESHUTDOWN:
+			/* Filesystem is already shut down, abort. */
+			str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+			return CHECK_ABORT;
+		case ENOTTY:
+		case EOPNOTSUPP:
+			/*
+			 * If we forced repairs, don't complain if kernel
+			 * doesn't know how to fix.
+			 */
+			if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+				return CHECK_DONE;
+			/* fall through */
+		case EINVAL:
+			/* Kernel doesn't know how to repair this? */
+			if (complain_if_still_broken)
+				str_error(ctx, buf,
+_("Don't know how to fix; offline repair required."));
+			return CHECK_REPAIR;
+		case EROFS:
+			/* Read-only filesystem, can't fix. */
+			if (verbose || debug || IS_CORRUPT(oldf))
+				str_info(ctx, buf,
+_("Read-only filesystem; cannot make changes."));
+			return CHECK_DONE;
+		case ENOENT:
+			/* Metadata not present, just skip it. */
+			return CHECK_DONE;
+		case ENOMEM:
+		case ENOSPC:
+			/* Don't care if preen fails due to low resources. */
+			if (oldf & XFS_SCRUB_FLAG_PREEN)
+				return CHECK_DONE;
+			/* fall through */
+		default:
+			/* Operational error. */
+			str_errno(ctx, buf);
+			return CHECK_DONE;
+		}
+	} else if (xfs_scrub_needs_repair(meta)) {
+		/* Still broken, try again or fix offline. */
+		if (complain_if_still_broken)
+			str_error(ctx, buf,
+_("Repair unsuccessful; offline repair required."));
+		return CHECK_REPAIR;
+	} else {
+		/* Clean operation, no corruption detected. */
+		if (IS_CORRUPT(oldf))
+			record_repair(ctx, buf, _("Repairs successful."));
+		else
+			record_preen(ctx, buf, _("Optimization successful."));
+		return CHECK_DONE;
+	}
+}
+
+/* Repair everything on this list. */
+bool
+xfs_repair_metadata_list(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct list_head		*repair_list,
+	unsigned int			flags)
+{
+	struct repair_item		*ri;
+	struct repair_item		*n;
+	enum check_outcome		fix;
+
+	list_sort(NULL, repair_list, xfs_repair_item_compare);
+
+	list_for_each_entry_safe(ri, n, repair_list, list) {
+		if (!IS_CORRUPT(ri->op.sm_flags) &&
+		    (flags & XRML_REPAIR_ONLY))
+			continue;
+		fix = xfs_repair_metadata(ctx, fd, &ri->op,
+				flags & XRML_NOFIX_COMPLAIN);
+		if (fix == CHECK_ABORT)
+			return false;
+		else if (fix == CHECK_REPAIR)
+			continue;
+
+		list_del(&ri->list);
+		free(ri);
+	}
+
+	return !xfs_scrub_excessive_errors(ctx);
+}
+
+/* Test the availability of a kernel scrub command. */
+static bool
+__xfs_scrub_test(
+	struct scrub_ctx		*ctx,
+	unsigned int			type)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	struct xfs_error_injection	inject;
+	static bool			injected;
+	int				error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
+		return false;
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		inject.fd = ctx->mnt_fd;
+#define XFS_ERRTAG_FORCE_REPAIR	28
+		inject.errtag = XFS_ERRTAG_FORCE_REPAIR;
+		error = ioctl(ctx->mnt_fd,
+				XFS_IOC_ERROR_INJECTION, &inject);
+		if (error == 0)
+			injected = true;
+	}
+
+	meta.sm_type = type;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_SCRUB_METADATA, &meta);
+	return error == 0 || (error && errno != EOPNOTSUPP && errno != ENOTTY);
+}
+
+#define XFS_CAN_SCRUB_TEST(name, flagname) \
+bool \
+xfs_can_scrub_##name( \
+	struct scrub_ctx		*ctx) \
+{ \
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_##flagname); \
+}
+XFS_CAN_SCRUB_TEST(fs_metadata,		SB)
+XFS_CAN_SCRUB_TEST(inode,		INODE)
+XFS_CAN_SCRUB_TEST(bmap,		BMBTD)
+XFS_CAN_SCRUB_TEST(dir,			DIR)
+XFS_CAN_SCRUB_TEST(attr,		XATTR)
+XFS_CAN_SCRUB_TEST(symlink,		SYMLINK)
diff --git a/scrub/xfs_ioctl.h b/scrub/xfs_ioctl.h
new file mode 100644
index 0000000..78eec51
--- /dev/null
+++ b/scrub/xfs_ioctl.h
@@ -0,0 +1,103 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_IOCTL_H_
+#define XFS_IOCTL_H_
+
+/* inode iteration */
+#define XFS_ITERATE_INODES_ABORT	(-1)
+typedef int (*xfs_inode_iter_fn)(struct scrub_ctx *ctx,
+		struct xfs_handle *handle, struct xfs_bstat *bs, void *arg);
+bool xfs_iterate_inodes(struct scrub_ctx *ctx, const char *descr,
+		void *fshandle, uint64_t first_ino, uint64_t last_ino,
+		xfs_inode_iter_fn fn, void *arg);
+bool xfs_can_iterate_inodes(struct scrub_ctx *ctx);
+
+/* inode fork block mapping */
+struct xfs_bmap {
+	uint64_t	bm_offset;	/* file offset of segment in bytes */
+	uint64_t	bm_physical;	/* physical starting byte  */
+	uint64_t	bm_length;	/* length of segment, bytes */
+	uint32_t	bm_flags;	/* output flags */
+};
+
+typedef bool (*xfs_bmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		int fd, int whichfork, struct fsxattr *fsx,
+		struct xfs_bmap *bmap, void *arg);
+
+bool xfs_iterate_bmap(struct scrub_ctx *ctx, const char *descr, int fd,
+		int whichfork, struct xfs_bmap *key, xfs_bmap_iter_fn fn,
+		void *arg);
+bool xfs_can_iterate_bmap(struct scrub_ctx *ctx);
+
+/* filesystem reverse mapping */
+typedef bool (*xfs_fsmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *fsr, void *arg);
+bool xfs_iterate_fsmap(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *keys, xfs_fsmap_iter_fn fn, void *arg);
+bool xfs_can_iterate_fsmap(struct scrub_ctx *ctx);
+
+/* Online scrub and repair. */
+enum check_outcome {
+	CHECK_DONE,
+	CHECK_REPAIR,
+	CHECK_ABORT,
+};
+
+struct repair_item {
+	struct list_head		list;
+	struct xfs_scrub_metadata	op;
+};
+
+void xfs_scrub_report_preen_triggers(struct scrub_ctx *ctx);
+bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct list_head *repair_list);
+bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct list_head *repair_list);
+bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx,
+		struct list_head *repair_list);
+
+#define XRML_REPAIR_ONLY	1 /* no optimizations */
+#define XRML_NOFIX_COMPLAIN	2 /* complain if still corrupt */
+bool xfs_repair_metadata_list(struct scrub_ctx *ctx, int fd,
+		struct list_head *repair_list, unsigned int flags);
+
+bool xfs_can_scrub_fs_metadata(struct scrub_ctx *ctx);
+bool xfs_can_scrub_inode(struct scrub_ctx *ctx);
+bool xfs_can_scrub_bmap(struct scrub_ctx *ctx);
+bool xfs_can_scrub_dir(struct scrub_ctx *ctx);
+bool xfs_can_scrub_attr(struct scrub_ctx *ctx);
+bool xfs_can_scrub_symlink(struct scrub_ctx *ctx);
+
+bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_data_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_attr_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_cow_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_dir(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_attr(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+
+#endif /* XFS_IOCTL_H_ */


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2017-03-10 23:25 ` [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  2017-03-10 23:25 ` [PATCH 9/9] xfs_scrub: integrate services with systemd Darrick J. Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create an xfs_scrub_all command to find all XFS filesystems
and run an online scrub against them all.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 debian/control           |    3 +
 debian/rules             |    1 
 man/man8/xfs_scrub_all.8 |   32 ++++++++++
 scrub/Makefile           |   15 ++++-
 scrub/xfs_scrub_all.in   |  148 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 195 insertions(+), 4 deletions(-)
 create mode 100644 man/man8/xfs_scrub_all.8
 create mode 100644 scrub/xfs_scrub_all.in


diff --git a/debian/control b/debian/control
index ad81662..d7c0316 100644
--- a/debian/control
+++ b/debian/control
@@ -3,12 +3,13 @@ Section: admin
 Priority: optional
 Maintainer: XFS Development Team <linux-xfs@vger.kernel.org>
 Uploaders: Nathan Scott <nathans@debian.org>, Anibal Monsalve Salazar <anibal@debian.org>
-Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev
+Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev, dh-python
 Standards-Version: 3.9.1
 Homepage: http://xfs.org/
 
 Package: xfsprogs
 Depends: ${shlibs:Depends}, ${misc:Depends}
+Recommends: ${python3:Depends}, util-linux
 Provides: fsck-backend
 Suggests: xfsdump, acl, attr, quota
 Breaks: xfsdump (<< 3.0.0)
diff --git a/debian/rules b/debian/rules
index c673380..a870944 100755
--- a/debian/rules
+++ b/debian/rules
@@ -76,6 +76,7 @@ binary-arch: checkroot built
 	$(pkgdi)  $(MAKE) -C debian install-d-i
 	$(pkgme)  $(MAKE) dist
 	rmdir debian/xfslibs-dev/usr/share/doc/xfsprogs
+	dh_python3
 	dh_installdocs
 	dh_installchangelogs
 	dh_strip
diff --git a/man/man8/xfs_scrub_all.8 b/man/man8/xfs_scrub_all.8
new file mode 100644
index 0000000..5e1420b
--- /dev/null
+++ b/man/man8/xfs_scrub_all.8
@@ -0,0 +1,32 @@
+.TH xfs_scrub_all 8
+.SH NAME
+xfs_scrub_all \- scrub all mounted XFS filesystems
+.SH SYNOPSIS
+.B xfs_scrub_all
+.SH DESCRIPTION
+.B xfs_scrub_all
+attempts to read and check all the metadata on all mounted XFS filesystems.
+The online scrub is performed via the
+.B xfs_scrub
+tool, either by running it directly or by using systemd to start it
+in a restricted fashion.
+Mounted filesystems are mapped to physical storage devices so that scrub
+operations can be run in parallel so long as no two scrubbers access
+the same device simultaneously.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub_all
+is the sum of the following conditions:
+.br
+\	0\	\-\ No errors
+.br
+\	4\	\-\ File system errors left uncorrected
+.br
+\	8\	\-\ Operational error
+.br
+\	16\	\-\ Usage or syntax error
+.TP
+These are the same error codes returned by xfs_scrub.
+.br
+.SH SEE ALSO
+.BR xfs_scrub (8).
diff --git a/scrub/Makefile b/scrub/Makefile
index bae2fa1..78e119f 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -10,6 +10,8 @@ SCRUB_PREREQS=$(HAVE_OPENAT)$(HAVE_FSTATAT)
 ifeq ($(SCRUB_PREREQS),yesyes)
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
+XFS_SCRUB_ALL_PROG = xfs_scrub_all
+XFS_SCRUB_ARGS = -Tvn
 endif	# scrub_prereqs
 
 HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
@@ -36,15 +38,22 @@ ifeq ($(HAVE_SYNCFS),yes)
 LCFLAGS += -DHAVE_SYNCFS
 endif
 
-default: depend $(LTCOMMAND)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG)
+
+xfs_scrub_all: xfs_scrub_all.in
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_ROOT_SBIN_DIR)|g" \
+		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@
+	$(Q)chmod a+x $@
 
 include $(BUILDRULES)
 
-install: default $(INSTALL_SCRUB)
+install: $(INSTALL_SCRUB)
 
-install-scrub:
+install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+	$(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_ROOT_SBIN_DIR)
 
 install-dev:
 
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
new file mode 100644
index 0000000..2215720
--- /dev/null
+++ b/scrub/xfs_scrub_all.in
@@ -0,0 +1,148 @@
+#!/usr/bin/env python3
+
+# Run online scrubbers in parallel, but avoid thrashing.
+#
+# Copyright (C) 2017 Oracle.  All rights reserved.
+#
+# Author: Darrick J. Wong <darrick.wong@oracle.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+
+import subprocess
+import json
+import threading
+import time
+import sys
+
+retcode = 0
+terminate = False
+
+def find_mounts():
+	'''Map mountpoints to physical disks.'''
+
+	fs = {}
+	cmd=['lsblk', '-o', 'KNAME,TYPE,FSTYPE,MOUNTPOINT', '-J']
+	result = subprocess.Popen(cmd, stdout=subprocess.PIPE)
+	result.wait()
+	if result.returncode != 0:
+		return fs
+	sarray = [x.decode('utf-8') for x in result.stdout.readlines()]
+	output = ' '.join(sarray)
+	bdevdata = json.loads(output)
+	for bdev in bdevdata['blockdevices']:
+		if bdev['type'] == 'disk':
+			lastdisk = bdev['kname']
+		elif bdev['fstype'] == 'xfs':
+			mnt = bdev['mountpoint']
+			if mnt is not None:
+				if mnt in fs:
+					fs[mnt].add(lastdisk)
+				else:
+					fs[mnt] = set([lastdisk])
+	return fs
+
+def run_killable(cmd, stdout, killfuncs, kill_fn):
+	'''Run a killable program.  Returns program retcode or -1 if we can't start it.'''
+	try:
+		proc = subprocess.Popen(cmd, stdout = stdout)
+		real_kill_fn = lambda: kill_fn(proc)
+		killfuncs.add(real_kill_fn)
+		proc.wait()
+		try:
+			killfuncs.remove(real_kill_fn)
+		except:
+			pass
+		return proc.returncode
+	except:
+		return -1
+
+def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
+	'''Run a scrub process.'''
+	global retcode, terminate
+
+	print("Scrubbing %s..." % mnt)
+
+	try:
+		if terminate:
+			return
+
+		# Invoke xfs_scrub manually
+		cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
+		ret = run_killable(cmd, None, killfuncs, \
+				lambda proc: proc.terminate())
+		if ret >= 0:
+			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
+			retcode |= ret
+			return
+
+		if terminate:
+			return
+
+		print("Unable to start scrub tool.")
+	finally:
+		running_devs -= mntdevs
+		cond.acquire()
+		cond.notify()
+		cond.release()
+
+def main():
+	'''Find mounts, schedule scrub runs.'''
+	def thr(mnt, devs):
+		a = (mnt, cond, running_devs, devs, killfuncs)
+		thr = threading.Thread(target = run_scrub, args = a)
+		thr.start()
+	global retcode, terminate
+
+	fs = find_mounts()
+
+	# Schedule scrub jobs...
+	running_devs = set()
+	killfuncs = set()
+	cond = threading.Condition()
+	while len(fs) > 0:
+		if len(running_devs) == 0:
+			mnt, devs = fs.popitem()
+			running_devs.update(devs)
+			thr(mnt, devs)
+		poppers = set()
+		for mnt in fs:
+			devs = fs[mnt]
+			can_run = True
+			for dev in devs:
+				if dev in running_devs:
+					can_run = False
+					break
+			if can_run:
+				running_devs.update(devs)
+				poppers.add(mnt)
+				thr(mnt, devs)
+		for p in poppers:
+			fs.pop(p)
+		cond.acquire()
+		try:
+			cond.wait()
+		except KeyboardInterrupt:
+			terminate = True
+			print("Terminating...")
+			while len(killfuncs) > 0:
+				fn = killfuncs.pop()
+				fn()
+			fs = []
+		cond.release()
+
+	sys.exit(retcode)
+
+if __name__ == '__main__':
+	main()


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 9/9] xfs_scrub: integrate services with systemd
  2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2017-03-10 23:25 ` [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
  To: sandeen, darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a systemd service unit so that we can run the online scrubber
under systemd with (somewhat) appropriate containment.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure.ac                     |   15 +++++++++++++
 include/builddefs.in             |    3 +++
 scrub/Makefile                   |   21 +++++++++++++++++-
 scrub/scrub.c                    |   20 +++++++++++++++++
 scrub/xfs_scrub@.service.in      |   18 ++++++++++++++++
 scrub/xfs_scrub_all.in           |   44 ++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrub_all.service.in   |    8 +++++++
 scrub/xfs_scrub_all.timer        |   11 ++++++++++
 scrub/xfs_scrub_fail             |   26 ++++++++++++++++++++++
 scrub/xfs_scrub_fail@.service.in |   10 +++++++++
 10 files changed, 175 insertions(+), 1 deletion(-)
 create mode 100644 scrub/xfs_scrub@.service.in
 create mode 100644 scrub/xfs_scrub_all.service.in
 create mode 100644 scrub/xfs_scrub_all.timer
 create mode 100755 scrub/xfs_scrub_fail
 create mode 100644 scrub/xfs_scrub_fail@.service.in


diff --git a/configure.ac b/configure.ac
index ccd7460..e89aea0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -103,6 +103,21 @@ esac
 AC_SUBST([root_sbindir])
 AC_SUBST([root_libdir])
 
+# Where do systemd services go?
+pkg_systemdsystemunitdir="$(pkg-config --variable=systemdsystemunitdir systemd 2>/dev/null)"
+case "${pkg_systemdsystemunitdir}" in
+"")
+	systemdsystemunitdir=""
+	have_systemd=no
+	;;
+*)
+	systemdsystemunitdir="${pkg_systemdsystemunitdir}"
+	have_systemd=yes
+	;;
+esac
+AC_SUBST([have_systemd])
+AC_SUBST([systemdsystemunitdir])
+
 # Find localized files.  Don't descend into any "dot directories"
 # (like .git or .pc from quilt).  Strangely, the "-print" argument
 # to "find" is required, to avoid including such directories in the
diff --git a/include/builddefs.in b/include/builddefs.in
index 9d478d3..d99c402 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -123,6 +123,9 @@ HAVE_OPENAT = @have_openat@
 HAVE_SYNCFS = @have_syncfs@
 HAVE_FSTATAT = @have_fstatat@
 
+HAVE_SYSTEMD = @have_systemd@
+SYSTEMDSYSTEMUNITDIR = @systemdsystemunitdir@
+
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
 
diff --git a/scrub/Makefile b/scrub/Makefile
index 78e119f..d34e3d4 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -12,6 +12,12 @@ LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 XFS_SCRUB_ALL_PROG = xfs_scrub_all
 XFS_SCRUB_ARGS = -Tvn
+
+ifeq ($(HAVE_SYSTEMD),yes)
+INSTALL_SCRUB += install-systemd
+SYSTEMDSERVICES = xfs_scrub@.service xfs_scrub_all.service xfs_scrub_all.timer xfs_scrub_fail@.service
+endif
+
 endif	# scrub_prereqs
 
 HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
@@ -38,7 +44,7 @@ ifeq ($(HAVE_SYNCFS),yes)
 LCFLAGS += -DHAVE_SYNCFS
 endif
 
-default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(SYSTEMDSERVICES)
 
 xfs_scrub_all: xfs_scrub_all.in
 	@echo "    [SED]    $@"
@@ -50,6 +56,19 @@ include $(BUILDRULES)
 
 install: $(INSTALL_SCRUB)
 
+%.service: %.service.in
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_ROOT_SBIN_DIR)|g" \
+		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" \
+		   -e "s|@pkg_lib_dir@|$(PKG_LIB_DIR)|g" \
+		   -e "s|@pkg_name@|$(PKG_NAME)|g" < $< > $@
+
+install-systemd: default
+	$(INSTALL) -m 755 -d $(SYSTEMDSYSTEMUNITDIR)
+	$(INSTALL) -m 644 $(SYSTEMDSERVICES) $(SYSTEMDSYSTEMUNITDIR)
+	$(INSTALL) -m 755 -d $(PKG_LIB_DIR)/$(PKG_NAME)
+	$(INSTALL) -m 755 xfs_scrub_fail $(PKG_LIB_DIR)/$(PKG_NAME)
+
 install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
diff --git a/scrub/scrub.c b/scrub/scrub.c
index a363ac1..0b6a11d 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -44,6 +44,7 @@ bool				dumpcore;
 bool				display_rusage;
 long				page_size;
 int				nr_threads = -1;
+bool				is_service;
 enum errors_action		error_action = ERRORS_CONTINUE;
 static unsigned long		max_errors;
 
@@ -830,6 +831,9 @@ _("Only one of the options -n or -y may be specified.\n"));
 
 	ctx.mntpoint = argv[optind];
 
+	if (getenv("SERVICE_MODE"))
+		is_service = true;
+
 	/* Find the mount record for the passed-in argument. */
 
 	if (stat(argv[optind], &ctx.mnt_sb) < 0) {
@@ -957,5 +961,21 @@ _("%s: %lu warnings found.\n"),
 	free(ctx.mntpoint);
 	free(ctx.mnt_type);
 end:
+	/*
+	 * If we're running as a service, bump return code up by 16 to
+	 * avoid conflicting with service return codes.
+	 */
+	if (is_service) {
+		/*
+		 * journald queries /proc as part of taking in log
+		 * messages; it uses this information to associate the
+		 * message with systemd units, etc.  This races with
+		 * process exit, so delay that a couple of seconds so
+		 * that we capture the summary outputs in the job log.
+		 */
+		sleep(2);
+		if (ret)
+			ret += 16;
+	}
 	return ret;
 }
diff --git a/scrub/xfs_scrub@.service.in b/scrub/xfs_scrub@.service.in
new file mode 100644
index 0000000..6b6992d
--- /dev/null
+++ b/scrub/xfs_scrub@.service.in
@@ -0,0 +1,18 @@
+[Unit]
+Description=Online XFS Metadata Check for %I
+OnFailure=xfs_scrub_fail@%i.service
+
+[Service]
+Type=oneshot
+WorkingDirectory=%I
+PrivateNetwork=true
+ProtectSystem=full
+ProtectHome=read-only
+PrivateTmp=yes
+AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+NoNewPrivileges=yes
+User=nobody
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+Environment=SERVICE_MODE=1
+ExecStart=@sbindir@/xfs_scrub @scrub_args@ %I
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 2215720..81e0cc2 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -25,6 +25,7 @@ import json
 import threading
 import time
 import sys
+import os
 
 retcode = 0
 terminate = False
@@ -53,6 +54,13 @@ def find_mounts():
 					fs[mnt] = set([lastdisk])
 	return fs
 
+def kill_systemd(unit, proc):
+	'''Kill systemd unit.'''
+	proc.terminate()
+	cmd=['systemctl', 'stop', unit]
+	x = subprocess.Popen(cmd)
+	x.wait()
+
 def run_killable(cmd, stdout, killfuncs, kill_fn):
 	'''Run a killable program.  Returns program retcode or -1 if we can't start it.'''
 	try:
@@ -78,6 +86,20 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 		if terminate:
 			return
 
+		# Try it the systemd way
+		cmd=['systemctl', 'start', 'xfs_scrub@%s' % mnt]
+		ret = run_killable(cmd, subprocess.DEVNULL, killfuncs, \
+				lambda proc: kill_systemd('xfs_scrub@%s' % mnt, proc))
+		if ret == 0 or (ret >= 16 and ret <= 32):
+			if ret != 0:
+				ret -= 16
+			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
+			retcode |= ret
+			return
+
+		if terminate:
+			return
+
 		# Invoke xfs_scrub manually
 		cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
 		ret = run_killable(cmd, None, killfuncs, \
@@ -107,6 +129,17 @@ def main():
 
 	fs = find_mounts()
 
+	# Tail the journal if we ourselves aren't a service...
+	journalthread = None
+	if 'SERVICE_MODE' not in os.environ:
+		try:
+			cmd=['journalctl', '--no-pager', '-q', '-S', 'now', \
+					'-f', '-u', 'xfs_scrub@*', '-o', \
+					'cat']
+			journalthread = subprocess.Popen(cmd)
+		except:
+			pass
+
 	# Schedule scrub jobs...
 	running_devs = set()
 	killfuncs = set()
@@ -142,6 +175,17 @@ def main():
 			fs = []
 		cond.release()
 
+	if journalthread is not None:
+		journalthread.terminate()
+
+	# journald queries /proc as part of taking in log
+	# messages; it uses this information to associate the
+	# message with systemd units, etc.  This races with
+	# process exit, so delay that a couple of seconds so
+	# that we capture the summary outputs in the job log.
+	if 'SERVICE_MODE' in os.environ:
+		time.sleep(2)
+
 	sys.exit(retcode)
 
 if __name__ == '__main__':
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
new file mode 100644
index 0000000..15b0af9
--- /dev/null
+++ b/scrub/xfs_scrub_all.service.in
@@ -0,0 +1,8 @@
+[Unit]
+Description=Online XFS Metadata Check for All Filesystems
+
+[Service]
+Type=oneshot
+Environment=SERVICE_MODE=1
+ConditionACPower=true
+ExecStart=@sbindir@/xfs_scrub_all
diff --git a/scrub/xfs_scrub_all.timer b/scrub/xfs_scrub_all.timer
new file mode 100644
index 0000000..efc13a6
--- /dev/null
+++ b/scrub/xfs_scrub_all.timer
@@ -0,0 +1,11 @@
+[Unit]
+Description=Periodic XFS Online Metadata Check for All Filesystems
+
+[Timer]
+# Run on Sunday at 2am
+OnCalendar=Sun *-*-* 02:00:00
+RandomizedDelaySec=60
+Persistent=true
+
+[Install]
+WantedBy=timers.target
diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
new file mode 100755
index 0000000..36dd50e
--- /dev/null
+++ b/scrub/xfs_scrub_fail
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+# Email logs of failed xfs_scrub unit runs
+
+mailer=/usr/sbin/sendmail
+recipient="$1"
+test -z "${recipient}" && exit 0
+mntpoint="$2"
+test -z "${mntpoint}" && exit 0
+hostname="$(hostname -f 2>/dev/null)"
+test -z "${hostname}" && hostname="${HOSTNAME}"
+if [ ! -x "${mailer}" ]; then
+	echo "${mailer}: Mailer program not found."
+	exit 1
+fi
+
+(cat << ENDL
+To: $1
+From: <xfs_scrub@${hostname}>
+Subject: xfs_scrub failure on ${mntpoint}
+
+So sorry, the automatic xfs_scrub of ${mntpoint} on ${hostname} failed.
+
+A log of what happened follows:
+ENDL
+systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint}") | "${mailer}" -t -i
diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
new file mode 100644
index 0000000..785f881
--- /dev/null
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -0,0 +1,10 @@
+[Unit]
+Description=Online XFS Metadata Check Failure Reporting for %I
+
+[Service]
+Type=oneshot
+Environment=EMAIL_ADDR=root
+ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" %I
+User=mail
+Group=mail
+SupplementaryGroups=systemd-journal


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-03-10 23:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
2017-03-10 23:24 ` [PATCH 2/9] xfs_db: introduce fuzz command Darrick J. Wong
2017-03-10 23:25 ` [PATCH 3/9] xfs_db: print attribute remote value blocks Darrick J. Wong
2017-03-10 23:25 ` [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs Darrick J. Wong
2017-03-10 23:25 ` [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
2017-03-10 23:25 ` [PATCH 6/9] xfs_scrub: create online filesystem scrub program Darrick J. Wong
2017-03-10 23:25 ` [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality Darrick J. Wong
2017-03-10 23:25 ` [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
2017-03-10 23:25 ` [PATCH 9/9] xfs_scrub: integrate services with systemd Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).