All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/39] xfsprogs: online scrub/repair support
@ 2016-11-05  0:24 Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 01/39] xfs: plumb in needed functions for range querying of the freespace btrees Darrick J. Wong
                   ` (38 more replies)
  0 siblings, 39 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Hi all,

This is the second revision of a patchset that adds to XFS userland
tools support for online metadata scrubbing and repair.  There aren't
any on-disk format changes.

Online scrub/repair support consists of four major pieces -- first, an
ioctl that maps physical extents to their owners; second, various
in-kernel metadata scrubbing ioctls to examine metadata records and
cross-reference them with other filesystem metadata; third, an in-kernel
mechanism for rebuilding damaged metadata objects and btrees; and
fourth, a userspace component to initiate kernel scrubbing, walk all
inodes and the directory tree, scrub data extents, and ask the kernel to
repair anything that is broken.

This new utility, xfs_scrub, is separate from the existing offline
xfs_repair tool.  Scrub has three main modes of operation -- in its most
powerful mode, it iterates all XFS metadata and asks the kernel to check
the metadata and repair it if necessary.  The second most powerful mode
can use certain VFS methods and XFS ioctls (BULKSTAT, GETBMAP, and
GETFSMAP) to check as much metadata as it reasonably can from userspace.
It cannot repair anything.  The least powerful mode uses only VFS
functions to access as much of the directory/file/xattr graph as
possible.  It has no mechanism to check internal metadata and also
cannot repair anything.  This is good enough for scrubbing non-XFS
filesystems, but it is intended for the first mode to be used.

Most of the patches in this series are direct imports of the libxfs
changes that the kernel scrubber needed to operate.  The changes to
userspace programs are limited to wiring up ioctl support in xfs_io,
support for per-field fuzzing in xfs_db, and the last patch, which
creates the xfs_scrub program.

The final patch in the series provides the xfs_scrub utility.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
The kernel patches in the git trees should apply to 4.9-rc3; xfsprogs
patches to for-next; and xfstest to master.

The patches have survived all auto group xfstests both with scrub-only
mode and also a special debugging mode to xfs_scrub that forces it to
rebuild the metadata structures even if they're not damaged.  Note that
I haven't thoroughly run the new tests in [3] that try to fuzz every
field in every data structure on disk.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/djwong-devel
[2] https://github.com/djwong/xfsprogs/tree/djwong-devel
[3] https://github.com/djwong/xfstests/tree/djwong-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/39] xfs: plumb in needed functions for range querying of the freespace btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
@ 2016-11-05  0:24 ` Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 02/39] xfs: provide a query_range function for " Darrick J. Wong
                   ` (37 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Plumb in the pieces (init_high_key, diff_two_keys) necessary to call
query_range on the free space btrees.  Remove the debugging asserts
so that we can make queries starting from block 0.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc_btree.c |  156 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 117 insertions(+), 39 deletions(-)


diff --git a/libxfs/xfs_alloc_btree.c b/libxfs/xfs_alloc_btree.c
index ff4bae4..273ea5b 100644
--- a/libxfs/xfs_alloc_btree.c
+++ b/libxfs/xfs_alloc_btree.c
@@ -203,19 +203,28 @@ xfs_allocbt_init_key_from_rec(
 	union xfs_btree_key	*key,
 	union xfs_btree_rec	*rec)
 {
-	ASSERT(rec->alloc.ar_startblock != 0);
-
 	key->alloc.ar_startblock = rec->alloc.ar_startblock;
 	key->alloc.ar_blockcount = rec->alloc.ar_blockcount;
 }
 
 STATIC void
+xfs_bnobt_init_high_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	__u32			x;
+
+	x = be32_to_cpu(rec->alloc.ar_startblock);
+	x += be32_to_cpu(rec->alloc.ar_blockcount) - 1;
+	key->alloc.ar_startblock = cpu_to_be32(x);
+	key->alloc.ar_blockcount = 0;
+}
+
+STATIC void
 xfs_allocbt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
 {
-	ASSERT(cur->bc_rec.a.ar_startblock != 0);
-
 	rec->alloc.ar_startblock = cpu_to_be32(cur->bc_rec.a.ar_startblock);
 	rec->alloc.ar_blockcount = cpu_to_be32(cur->bc_rec.a.ar_blockcount);
 }
@@ -234,18 +243,24 @@ xfs_allocbt_init_ptr_from_cur(
 }
 
 STATIC __int64_t
-xfs_allocbt_key_diff(
+xfs_bnobt_key_diff(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_key	*key)
 {
 	xfs_alloc_rec_incore_t	*rec = &cur->bc_rec.a;
 	xfs_alloc_key_t		*kp = &key->alloc;
-	__int64_t		diff;
 
-	if (cur->bc_btnum == XFS_BTNUM_BNO) {
-		return (__int64_t)be32_to_cpu(kp->ar_startblock) -
-				rec->ar_startblock;
-	}
+	return (__int64_t)be32_to_cpu(kp->ar_startblock) - rec->ar_startblock;
+}
+
+STATIC __int64_t
+xfs_cntbt_key_diff(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*key)
+{
+	xfs_alloc_rec_incore_t	*rec = &cur->bc_rec.a;
+	xfs_alloc_key_t		*kp = &key->alloc;
+	__int64_t		diff;
 
 	diff = (__int64_t)be32_to_cpu(kp->ar_blockcount) - rec->ar_blockcount;
 	if (diff)
@@ -254,6 +269,33 @@ xfs_allocbt_key_diff(
 	return (__int64_t)be32_to_cpu(kp->ar_startblock) - rec->ar_startblock;
 }
 
+STATIC __int64_t
+xfs_bnobt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k1->alloc.ar_startblock) -
+			  be32_to_cpu(k2->alloc.ar_startblock);
+}
+
+STATIC __int64_t
+xfs_cntbt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	__int64_t		diff;
+
+	diff =  be32_to_cpu(k1->alloc.ar_blockcount) -
+		be32_to_cpu(k2->alloc.ar_blockcount);
+	if (diff)
+		return diff;
+
+	return  be32_to_cpu(k1->alloc.ar_startblock) -
+		be32_to_cpu(k2->alloc.ar_startblock);
+}
+
 static bool
 xfs_allocbt_verify(
 	struct xfs_buf		*bp)
@@ -344,44 +386,78 @@ const struct xfs_buf_ops xfs_allocbt_buf_ops = {
 
 #if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
-xfs_allocbt_keys_inorder(
+xfs_bnobt_keys_inorder(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_key	*k1,
 	union xfs_btree_key	*k2)
 {
-	if (cur->bc_btnum == XFS_BTNUM_BNO) {
-		return be32_to_cpu(k1->alloc.ar_startblock) <
-		       be32_to_cpu(k2->alloc.ar_startblock);
-	} else {
-		return be32_to_cpu(k1->alloc.ar_blockcount) <
-			be32_to_cpu(k2->alloc.ar_blockcount) ||
-			(k1->alloc.ar_blockcount == k2->alloc.ar_blockcount &&
-			 be32_to_cpu(k1->alloc.ar_startblock) <
-			 be32_to_cpu(k2->alloc.ar_startblock));
-	}
+	return be32_to_cpu(k1->alloc.ar_startblock) <
+	       be32_to_cpu(k2->alloc.ar_startblock);
 }
 
 STATIC int
-xfs_allocbt_recs_inorder(
+xfs_bnobt_recs_inorder(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*r1,
 	union xfs_btree_rec	*r2)
 {
-	if (cur->bc_btnum == XFS_BTNUM_BNO) {
-		return be32_to_cpu(r1->alloc.ar_startblock) +
-			be32_to_cpu(r1->alloc.ar_blockcount) <=
-			be32_to_cpu(r2->alloc.ar_startblock);
-	} else {
-		return be32_to_cpu(r1->alloc.ar_blockcount) <
-			be32_to_cpu(r2->alloc.ar_blockcount) ||
-			(r1->alloc.ar_blockcount == r2->alloc.ar_blockcount &&
-			 be32_to_cpu(r1->alloc.ar_startblock) <
-			 be32_to_cpu(r2->alloc.ar_startblock));
-	}
+	return be32_to_cpu(r1->alloc.ar_startblock) +
+		be32_to_cpu(r1->alloc.ar_blockcount) <=
+		be32_to_cpu(r2->alloc.ar_startblock);
+}
+
+STATIC int
+xfs_cntbt_keys_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return be32_to_cpu(k1->alloc.ar_blockcount) <
+		be32_to_cpu(k2->alloc.ar_blockcount) ||
+		(k1->alloc.ar_blockcount == k2->alloc.ar_blockcount &&
+		 be32_to_cpu(k1->alloc.ar_startblock) <
+		 be32_to_cpu(k2->alloc.ar_startblock));
+}
+
+STATIC int
+xfs_cntbt_recs_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*r1,
+	union xfs_btree_rec	*r2)
+{
+	return be32_to_cpu(r1->alloc.ar_blockcount) <
+		be32_to_cpu(r2->alloc.ar_blockcount) ||
+		(r1->alloc.ar_blockcount == r2->alloc.ar_blockcount &&
+		 be32_to_cpu(r1->alloc.ar_startblock) <
+		 be32_to_cpu(r2->alloc.ar_startblock));
 }
-#endif	/* DEBUG */
+#endif /* DEBUG */
+
+static const struct xfs_btree_ops xfs_bnobt_ops = {
+	.rec_len		= sizeof(xfs_alloc_rec_t),
+	.key_len		= sizeof(xfs_alloc_key_t),
+
+	.dup_cursor		= xfs_allocbt_dup_cursor,
+	.set_root		= xfs_allocbt_set_root,
+	.alloc_block		= xfs_allocbt_alloc_block,
+	.free_block		= xfs_allocbt_free_block,
+	.update_lastrec		= xfs_allocbt_update_lastrec,
+	.get_minrecs		= xfs_allocbt_get_minrecs,
+	.get_maxrecs		= xfs_allocbt_get_maxrecs,
+	.init_key_from_rec	= xfs_allocbt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_bnobt_init_high_key_from_rec,
+	.init_rec_from_cur	= xfs_allocbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_allocbt_init_ptr_from_cur,
+	.key_diff		= xfs_bnobt_key_diff,
+	.buf_ops		= &xfs_allocbt_buf_ops,
+	.diff_two_keys		= xfs_bnobt_diff_two_keys,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_bnobt_keys_inorder,
+	.recs_inorder		= xfs_bnobt_recs_inorder,
+#endif
+};
 
-static const struct xfs_btree_ops xfs_allocbt_ops = {
+static const struct xfs_btree_ops xfs_cntbt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
 
@@ -395,11 +471,12 @@ static const struct xfs_btree_ops xfs_allocbt_ops = {
 	.init_key_from_rec	= xfs_allocbt_init_key_from_rec,
 	.init_rec_from_cur	= xfs_allocbt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_allocbt_init_ptr_from_cur,
-	.key_diff		= xfs_allocbt_key_diff,
+	.key_diff		= xfs_cntbt_key_diff,
 	.buf_ops		= &xfs_allocbt_buf_ops,
+	.diff_two_keys		= xfs_cntbt_diff_two_keys,
 #if defined(DEBUG) || defined(XFS_WARN)
-	.keys_inorder		= xfs_allocbt_keys_inorder,
-	.recs_inorder		= xfs_allocbt_recs_inorder,
+	.keys_inorder		= xfs_cntbt_keys_inorder,
+	.recs_inorder		= xfs_cntbt_recs_inorder,
 #endif
 };
 
@@ -425,12 +502,13 @@ xfs_allocbt_init_cursor(
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-	cur->bc_ops = &xfs_allocbt_ops;
 
 	if (btnum == XFS_BTNUM_CNT) {
+		cur->bc_ops = &xfs_cntbt_ops;
 		cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
 		cur->bc_flags = XFS_BTREE_LASTREC_UPDATE;
 	} else {
+		cur->bc_ops = &xfs_bnobt_ops;
 		cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
 	}
 


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/39] xfs: provide a query_range function for freespace btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 01/39] xfs: plumb in needed functions for range querying of the freespace btrees Darrick J. Wong
@ 2016-11-05  0:24 ` Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 03/39] xfs: create a function to query all records in a btree Darrick J. Wong
                   ` (36 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Implement a query_range function for the bnobt and cntbt.  This will
be used for getfsmap fallback if there is no rmapbt and by the online
scrub and repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_alloc.h |   10 ++++++++++
 2 files changed, 52 insertions(+)


diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c
index 1ca3268..382a507 100644
--- a/libxfs/xfs_alloc.c
+++ b/libxfs/xfs_alloc.c
@@ -2927,3 +2927,45 @@ xfs_free_extent(
 	xfs_trans_brelse(tp, agbp);
 	return error;
 }
+
+struct xfs_alloc_query_range_info {
+	xfs_alloc_query_range_fn	fn;
+	void				*priv;
+};
+
+/* Format btree record and pass to our callback. */
+STATIC int
+xfs_alloc_query_range_helper(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct xfs_alloc_query_range_info	*query = priv;
+	struct xfs_alloc_rec_incore		irec;
+
+	irec.ar_startblock = be32_to_cpu(rec->alloc.ar_startblock);
+	irec.ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount);
+	return query->fn(cur, &irec, query->priv);
+}
+
+/* Find all free space within a given range of blocks. */
+int
+xfs_alloc_query_range(
+	struct xfs_btree_cur			*cur,
+	struct xfs_alloc_rec_incore		*low_rec,
+	struct xfs_alloc_rec_incore		*high_rec,
+	xfs_alloc_query_range_fn		fn,
+	void					*priv)
+{
+	union xfs_btree_irec			low_brec;
+	union xfs_btree_irec			high_brec;
+	struct xfs_alloc_query_range_info	query;
+
+	ASSERT(cur->bc_btnum == XFS_BTNUM_BNO);
+	low_brec.a = *low_rec;
+	high_brec.a = *high_rec;
+	query.priv = priv;
+	query.fn = fn;
+	return xfs_btree_query_range(cur, &low_brec, &high_brec,
+			xfs_alloc_query_range_helper, &query);
+}
diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h
index 7c404a6..f9f8b81 100644
--- a/libxfs/xfs_alloc.h
+++ b/libxfs/xfs_alloc.h
@@ -223,4 +223,14 @@ int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno,
 
 xfs_extlen_t xfs_prealloc_blocks(struct xfs_mount *mp);
 
+typedef int (*xfs_alloc_query_range_fn)(
+	struct xfs_btree_cur		*cur,
+	struct xfs_alloc_rec_incore	*rec,
+	void				*priv);
+
+int xfs_alloc_query_range(struct xfs_btree_cur *cur,
+		struct xfs_alloc_rec_incore *low_rec,
+		struct xfs_alloc_rec_incore *high_rec,
+		xfs_alloc_query_range_fn fn, void *priv);
+
 #endif	/* __XFS_ALLOC_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/39] xfs: create a function to query all records in a btree
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 01/39] xfs: plumb in needed functions for range querying of the freespace btrees Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 02/39] xfs: provide a query_range function for " Darrick J. Wong
@ 2016-11-05  0:24 ` Darrick J. Wong
  2016-11-05  0:24 ` [PATCH 04/39] xfs: introduce the XFS_IOC_GETFSMAP ioctl Darrick J. Wong
                   ` (35 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Create a helper function that will query all records in a btree.
This will be used by the online repair functions to examine every
record in a btree to rebuild a second btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc.c |   15 +++++++++++++++
 libxfs/xfs_alloc.h |    2 ++
 libxfs/xfs_btree.c |   14 ++++++++++++++
 libxfs/xfs_btree.h |    2 ++
 libxfs/xfs_rmap.c  |   28 +++++++++++++++++++++-------
 libxfs/xfs_rmap.h  |    2 ++
 6 files changed, 56 insertions(+), 7 deletions(-)


diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c
index 382a507..3bfca12 100644
--- a/libxfs/xfs_alloc.c
+++ b/libxfs/xfs_alloc.c
@@ -2969,3 +2969,18 @@ xfs_alloc_query_range(
 	return xfs_btree_query_range(cur, &low_brec, &high_brec,
 			xfs_alloc_query_range_helper, &query);
 }
+
+/* Find all free space records. */
+int
+xfs_alloc_query_all(
+	struct xfs_btree_cur			*cur,
+	xfs_alloc_query_range_fn		fn,
+	void					*priv)
+{
+	struct xfs_alloc_query_range_info	query;
+
+	ASSERT(cur->bc_btnum == XFS_BTNUM_BNO);
+	query.priv = priv;
+	query.fn = fn;
+	return xfs_btree_query_all(cur, xfs_alloc_query_range_helper, &query);
+}
diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h
index f9f8b81..0dc34bf 100644
--- a/libxfs/xfs_alloc.h
+++ b/libxfs/xfs_alloc.h
@@ -232,5 +232,7 @@ int xfs_alloc_query_range(struct xfs_btree_cur *cur,
 		struct xfs_alloc_rec_incore *low_rec,
 		struct xfs_alloc_rec_incore *high_rec,
 		xfs_alloc_query_range_fn fn, void *priv);
+int xfs_alloc_query_all(struct xfs_btree_cur *cur, xfs_alloc_query_range_fn fn,
+		void *priv);
 
 #endif	/* __XFS_ALLOC_H__ */
diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index 3dea6bd..ee2e489 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -4802,6 +4802,20 @@ xfs_btree_query_range(
 			fn, priv);
 }
 
+/* Query a btree for all records. */
+int
+xfs_btree_query_all(
+	struct xfs_btree_cur		*cur,
+	xfs_btree_query_range_fn	fn,
+	void				*priv)
+{
+	union xfs_btree_irec		low_rec = {0};
+	union xfs_btree_irec		high_rec;
+
+	memset(&high_rec, 0xFF, sizeof(high_rec));
+	return xfs_btree_query_range(cur, &low_rec, &high_rec, fn, priv);
+}
+
 /*
  * Calculate the number of blocks needed to store a given number of records
  * in a short-format (per-AG metadata) btree.
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index eb20376..6ef5373 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -529,6 +529,8 @@ typedef int (*xfs_btree_query_range_fn)(struct xfs_btree_cur *cur,
 int xfs_btree_query_range(struct xfs_btree_cur *cur,
 		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
 		xfs_btree_query_range_fn fn, void *priv);
+int xfs_btree_query_all(struct xfs_btree_cur *cur, xfs_btree_query_range_fn fn,
+		void *priv);
 
 typedef int (*xfs_btree_visit_blocks_fn)(struct xfs_btree_cur *cur, int level,
 		void *data);
diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index 7a75e26..7738f50 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -1999,14 +1999,14 @@ xfs_rmap_query_range_helper(
 /* Find all rmaps between two keys. */
 int
 xfs_rmap_query_range(
-	struct xfs_btree_cur		*cur,
-	struct xfs_rmap_irec		*low_rec,
-	struct xfs_rmap_irec		*high_rec,
-	xfs_rmap_query_range_fn	fn,
-	void				*priv)
+	struct xfs_btree_cur			*cur,
+	struct xfs_rmap_irec			*low_rec,
+	struct xfs_rmap_irec			*high_rec,
+	xfs_rmap_query_range_fn			fn,
+	void					*priv)
 {
-	union xfs_btree_irec		low_brec;
-	union xfs_btree_irec		high_brec;
+	union xfs_btree_irec			low_brec;
+	union xfs_btree_irec			high_brec;
 	struct xfs_rmap_query_range_info	query;
 
 	low_brec.r = *low_rec;
@@ -2017,6 +2017,20 @@ xfs_rmap_query_range(
 			xfs_rmap_query_range_helper, &query);
 }
 
+/* Find all rmaps. */
+int
+xfs_rmap_query_all(
+	struct xfs_btree_cur			*cur,
+	xfs_rmap_query_range_fn			fn,
+	void					*priv)
+{
+	struct xfs_rmap_query_range_info	query;
+
+	query.priv = priv;
+	query.fn = fn;
+	return xfs_btree_query_all(cur, xfs_rmap_query_range_helper, &query);
+}
+
 /* Clean up after calling xfs_rmap_finish_one. */
 void
 xfs_rmap_finish_one_cleanup(
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index 7899305..faf2c1a 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -162,6 +162,8 @@ typedef int (*xfs_rmap_query_range_fn)(
 int xfs_rmap_query_range(struct xfs_btree_cur *cur,
 		struct xfs_rmap_irec *low_rec, struct xfs_rmap_irec *high_rec,
 		xfs_rmap_query_range_fn fn, void *priv);
+int xfs_rmap_query_all(struct xfs_btree_cur *cur, xfs_rmap_query_range_fn fn,
+		void *priv);
 
 enum xfs_rmap_intent_type {
 	XFS_RMAP_MAP,


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/39] xfs: introduce the XFS_IOC_GETFSMAP ioctl
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2016-11-05  0:24 ` [PATCH 03/39] xfs: create a function to query all records in a btree Darrick J. Wong
@ 2016-11-05  0:24 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 05/39] xfs_io: support the new getfsmap ioctl Darrick J. Wong
                   ` (34 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.

v2: shorten the device field to u32 since that's all we need for
dev_t.  Support reporting reverse mapping information for all the
devices that XFS supports (data, log).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |   95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index df58c1c..6857355 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -117,6 +117,100 @@ struct getbmapx {
 #define BMV_OF_SHARED		0x8	/* segment shared with another file */
 
 /*
+ *	Structure for XFS_IOC_GETFSMAP.
+ *
+ *	The memory layout for this call are the scalar values defined in
+ *	struct fsmap_head, followed by two struct fsmap that describe
+ *	the lower and upper bound of mappings to return, followed by an
+ *	array of struct fsmap mappings.
+ *
+ *	fmh_iflags control the output of the call, whereas fmh_oflags report
+ *	on the overall record output.  fmh_count should be set to the
+ *	length of the fmh_recs array, and fmh_entries will be set to the
+ *	number of entries filled out during each call.  If fmh_count is
+ *	zero, the number of reverse mappings will be returned in
+ *	fmh_entries, though no mappings will be returned.  fmh_reserved
+ *	must be set to zero.
+ *
+ *	The two elements in the fmh_keys array are used to constrain the
+ *	output.  The first element in the array should represent the
+ *	lowest disk mapping ("low key") that the user wants to learn
+ *	about.  If this value is all zeroes, the filesystem will return
+ *	the first entry it knows about.  For a subsequent call, the
+ *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
+ *	copied into fmh_keys[0] to have the kernel start where it left off.
+ *
+ *	The second element in the fmh_keys array should represent the
+ *	highest disk mapping ("high key") that the user wants to learn
+ *	about.  If this value is all ones, the filesystem will not stop
+ *	until it runs out of mapping to return or runs out of space in
+ *	fmh_recs.
+ *
+ *	fmr_device can be either a 32-bit cookie representing a device, or
+ *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
+ *	fmr_offset, and fmr_length are expressed in units of bytes.
+ *	fmr_owner is either an inode number, or a special value if
+ *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
+ */
+#ifndef HAVE_GETFSMAP
+struct fsmap {
+	__u32		fmr_device;	/* device id */
+	__u32		fmr_flags;	/* mapping flags */
+	__u64		fmr_physical;	/* device offset of segment */
+	__u64		fmr_owner;	/* owner id */
+	__u64		fmr_offset;	/* file offset of segment */
+	__u64		fmr_length;	/* length of segment */
+	__u64		fmr_reserved[3];	/* must be zero */
+};
+
+struct fsmap_head {
+	__u32		fmh_iflags;	/* control flags */
+	__u32		fmh_oflags;	/* output flags */
+	__u32		fmh_count;	/* # of entries in array incl. input */
+	__u32		fmh_entries;	/* # of entries filled in (output). */
+	__u64		fmh_reserved[6];	/* must be zero */
+
+	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
+	struct fsmap	fmh_recs[];	/* returned records */
+};
+
+/* Size of an fsmap_head with room for nr records. */
+static inline size_t
+fsmap_sizeof(
+	unsigned int	nr)
+{
+	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
+}
+#endif
+
+/*	fmh_iflags values - set by XFS_IOC_GETFSMAP caller in the header. */
+/* no flags defined yet */
+#define FMH_IF_VALID		0
+
+/*	fmh_oflags values - returned in the header segment only. */
+#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
+
+/*	fmr_flags values - returned for each non-header segment */
+#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
+#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
+#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
+#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
+#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
+#define FMR_OF_LAST		0x20	/* segment is the last in the FS */
+
+/*	fmr_owner special values */
+#define FMR_OWN_FREE		(-1ULL)	/* free space */
+#define FMR_OWN_UNKNOWN		(-2ULL)	/* unknown owner */
+#define FMR_OWN_FS		(-3ULL)	/* static fs metadata */
+#define FMR_OWN_LOG		(-4ULL)	/* journalling log */
+#define FMR_OWN_AG		(-5ULL)	/* per-AG metadata */
+#define FMR_OWN_INOBT		(-6ULL)	/* inode btree blocks */
+#define FMR_OWN_INODES		(-7ULL)	/* inodes */
+#define FMR_OWN_REFC		(-8ULL) /* refcount tree */
+#define FMR_OWN_COW		(-9ULL) /* cow staging */
+#define FMR_OWN_DEFECTIVE	(-10ULL) /* bad blocks */
+
+/*
  * Structure for XFS_IOC_FSSETDM.
  * For use by backup and restore programs to set the XFS on-disk inode
  * fields di_dmevmask and di_dmstate.  These must be set to exactly and
@@ -523,6 +617,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_GETBMAPX	_IOWR('X', 56, struct getbmap)
 #define XFS_IOC_ZERO_RANGE	_IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS	_IOR ('X', 58, struct xfs_fs_eofblocks)
+#define XFS_IOC_GETFSMAP	_IOWR('X', 59, struct fsmap_head)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/39] xfs_io: support the new getfsmap ioctl
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2016-11-05  0:24 ` [PATCH 04/39] xfs: introduce the XFS_IOC_GETFSMAP ioctl Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 06/39] xfs: use GPF_NOFS when allocating btree cursors Darrick J. Wong
                   ` (33 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 io/Makefile       |    2 
 io/fsmap.c        |  518 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 io/init.c         |    1 
 io/io.h           |    1 
 man/man8/xfs_io.8 |   47 +++++
 5 files changed, 568 insertions(+), 1 deletion(-)
 create mode 100644 io/fsmap.c


diff --git a/io/Makefile b/io/Makefile
index 1072e74..d65bafc 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -11,7 +11,7 @@ HFILES = init.h io.h
 CFILES = init.c \
 	attr.c bmap.c file.c freeze.c fsync.c getrusage.c imap.c link.c \
 	mmap.c open.c parent.c pread.c prealloc.c pwrite.c seek.c shutdown.c \
-	sync.c truncate.c reflink.c cowextsize.c
+	sync.c truncate.c reflink.c cowextsize.c fsmap.c
 
 LLDLIBS = $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
 LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE)
diff --git a/io/fsmap.c b/io/fsmap.c
new file mode 100644
index 0000000..bd6ec65
--- /dev/null
+++ b/io/fsmap.c
@@ -0,0 +1,518 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "platform_defs.h"
+#include "command.h"
+#include "init.h"
+#include "io.h"
+#include "input.h"
+#include "path.h"
+
+static cmdinfo_t	fsmap_cmd;
+static dev_t		xfs_data_dev;
+
+static void
+fsmap_help(void)
+{
+	printf(_(
+"\n"
+" prints the block mapping for an XFS filesystem"
+"\n"
+" Example:\n"
+" 'fsmap -vn' - tabular format verbose map, including unwritten extents\n"
+"\n"
+" fsmap prints the map of disk blocks used by the whole filesystem.\n"
+" The map lists each extent used by the file, as well as regions in the\n"
+" filesystem that do not have any corresponding blocks (free space).\n"
+" By default, each line of the listing takes the following form:\n"
+"     extent: [startoffset..endoffset] owner startblock..endblock\n"
+" All the file offsets and disk blocks are in units of 512-byte blocks.\n"
+" -n -- query n extents.\n"
+" -v -- Verbose information, specify ag info.  Show flags legend on 2nd -v\n"
+"\n"));
+}
+
+static int
+numlen(
+	off64_t	val)
+{
+	off64_t	tmp;
+	int	len;
+
+	for (len = 0, tmp = val; tmp > 0; tmp = tmp/10)
+		len++;
+	return (len == 0 ? 1 : len);
+}
+
+static const char *
+special_owner(
+	__int64_t	owner)
+{
+	switch (owner) {
+	case FMR_OWN_FREE:
+		return _("free space");
+	case FMR_OWN_UNKNOWN:
+		return _("unknown");
+	case FMR_OWN_FS:
+		return _("static fs metadata");
+	case FMR_OWN_LOG:
+		return _("journalling log");
+	case FMR_OWN_AG:
+		return _("per-AG metadata");
+	case FMR_OWN_INOBT:
+		return _("inode btree");
+	case FMR_OWN_INODES:
+		return _("inodes");
+	case FMR_OWN_REFC:
+		return _("refcount btree");
+	case FMR_OWN_COW:
+		return _("cow reservation");
+	case FMR_OWN_DEFECTIVE:
+		return _("defective");
+	default:
+		return _("unknown");
+	}
+}
+
+static void
+dump_map(
+	unsigned long long	*nr,
+	struct fsmap_head	*head)
+{
+	unsigned long long	i;
+	struct fsmap		*p;
+
+	for (i = 0, p = head->fmh_recs; i < head->fmh_entries; i++, p++) {
+		printf("\t%llu: %u:%u [%lld..%lld]: ", i + (*nr),
+			major(p->fmr_device), minor(p->fmr_device),
+			(long long)BTOBBT(p->fmr_physical),
+			(long long)BTOBBT(p->fmr_physical + p->fmr_length - 1));
+		if (p->fmr_flags & FMR_OF_SPECIAL_OWNER)
+			printf("%s", special_owner(p->fmr_owner));
+		else if (p->fmr_flags & FMR_OF_EXTENT_MAP)
+			printf(_("inode %lld extent map"),
+				(long long) p->fmr_owner);
+		else
+			printf(_("inode %lld %lld..%lld"),
+				(long long)p->fmr_owner,
+				(long long)BTOBBT(p->fmr_offset),
+				(long long)BTOBBT(p->fmr_offset + p->fmr_length - 1));
+		printf(_(" %lld blocks\n"),
+			(long long)BTOBBT(p->fmr_length));
+	}
+
+	(*nr) += head->fmh_entries;
+}
+
+/*
+ * Verbose mode displays:
+ *   extent: major:minor [startblock..endblock]: startoffset..endoffset \
+ *	ag# (agoffset..agendoffset) totalbbs flags
+ */
+#define MINRANGE_WIDTH	16
+#define MINAG_WIDTH	2
+#define MINTOT_WIDTH	5
+#define NFLG		7		/* count of flags */
+#define	FLG_NULL	00000000	/* Null flag */
+#define	FLG_SHARED	01000000	/* shared extent */
+#define	FLG_ATTR_FORK	00100000	/* attribute fork */
+#define	FLG_PRE		00010000	/* Unwritten extent */
+#define	FLG_BSU		00001000	/* Not on begin of stripe unit  */
+#define	FLG_ESU		00000100	/* Not on end   of stripe unit  */
+#define	FLG_BSW		00000010	/* Not on begin of stripe width */
+#define	FLG_ESW		00000001	/* Not on end   of stripe width */
+static void
+dump_map_verbose(
+	unsigned long long	*nr,
+	struct fsmap_head	*head,
+	bool			*dumped_flags,
+	struct xfs_fsop_geom	*fsgeo)
+{
+	unsigned long long	i;
+	struct fsmap		*p;
+	int			agno;
+	off64_t			agoff, bperag;
+	int			foff_w, boff_w, aoff_w, tot_w, agno_w, own_w;
+	int			nr_w, dev_w;
+	char			rbuf[32], bbuf[32], abuf[32], obuf[32];
+	char			nbuf[32], dbuf[32], gbuf[32];
+	int			sunit, swidth;
+	int			flg = 0;
+
+	foff_w = boff_w = aoff_w = own_w = MINRANGE_WIDTH;
+	dev_w = 3;
+	nr_w = 4;
+	tot_w = MINTOT_WIDTH;
+	bperag = (off64_t)fsgeo->agblocks *
+		  (off64_t)fsgeo->blocksize;
+	sunit = (fsgeo->sunit * fsgeo->blocksize);
+	swidth = (fsgeo->swidth * fsgeo->blocksize);
+
+	/*
+	 * Go through the extents and figure out the width
+	 * needed for all columns.
+	 */
+	for (i = 0, p = head->fmh_recs; i < head->fmh_entries; i++, p++) {
+		if (p->fmr_flags & FMR_OF_PREALLOC ||
+		    p->fmr_flags & FMR_OF_ATTR_FORK ||
+		    p->fmr_flags & FMR_OF_SHARED)
+			flg = 1;
+		if (sunit &&
+		    (p->fmr_physical  % sunit != 0 ||
+		     ((p->fmr_physical + p->fmr_length) % sunit) != 0 ||
+		     p->fmr_physical % swidth != 0 ||
+		     ((p->fmr_physical + p->fmr_length) % swidth) != 0))
+			flg = 1;
+		if (flg)
+			*dumped_flags = true;
+		snprintf(nbuf, sizeof(nbuf), "%llu", (*nr) + i);
+		nr_w = max(nr_w, strlen(nbuf));
+		if (head->fmh_oflags & FMH_OF_DEV_T)
+			snprintf(dbuf, sizeof(dbuf), "%u:%u",
+				major(p->fmr_device),
+				minor(p->fmr_device));
+		else
+			snprintf(dbuf, sizeof(dbuf), "0x%x", p->fmr_device);
+		dev_w = max(dev_w, strlen(dbuf));
+		snprintf(bbuf, sizeof(bbuf), "[%lld..%lld]:",
+			(long long)BTOBBT(p->fmr_physical),
+			(long long)BTOBBT(p->fmr_physical + p->fmr_length - 1));
+		boff_w = max(boff_w, strlen(bbuf));
+		if (p->fmr_flags & FMR_OF_SPECIAL_OWNER)
+			own_w = max(own_w, strlen(special_owner(p->fmr_owner)));
+		else {
+			snprintf(obuf, sizeof(obuf), "%lld",
+				(long long)p->fmr_owner);
+			own_w = max(own_w, strlen(obuf));
+		}
+		if (p->fmr_flags & FMR_OF_EXTENT_MAP)
+			foff_w = max(foff_w, strlen(_("extent_map")));
+		else if (p->fmr_flags & FMR_OF_SPECIAL_OWNER)
+			;
+		else {
+			snprintf(rbuf, sizeof(rbuf), "%lld..%lld",
+				(long long)BTOBBT(p->fmr_offset),
+				(long long)BTOBBT(p->fmr_offset + p->fmr_length - 1));
+			foff_w = max(foff_w, strlen(rbuf));
+		}
+		if (p->fmr_device == xfs_data_dev) {
+			agno = p->fmr_physical / bperag;
+			agoff = p->fmr_physical - (agno * bperag);
+			snprintf(abuf, sizeof(abuf),
+				"(%lld..%lld)",
+				(long long)BTOBBT(agoff),
+				(long long)BTOBBT(agoff + p->fmr_length - 1));
+		} else
+			abuf[0] = 0;
+		aoff_w = max(aoff_w, strlen(abuf));
+		tot_w = max(tot_w,
+			numlen(BTOBBT(p->fmr_length)));
+	}
+	agno_w = max(MINAG_WIDTH, numlen(fsgeo->agcount));
+	if (nr == 0)
+		printf("%*s: %-*s %-*s %-*s %-*s %*s %-*s %*s%s\n",
+			nr_w, _("EXT"),
+			dev_w, _("DEV"),
+			boff_w, _("BLOCK-RANGE"),
+			own_w, _("OWNER"),
+			foff_w, _("FILE-OFFSET"),
+			agno_w, _("AG"),
+			aoff_w, _("AG-OFFSET"),
+			tot_w, _("TOTAL"),
+			flg ? _(" FLAGS") : "");
+	for (i = 0, p = head->fmh_recs; i < head->fmh_entries; i++, p++) {
+		flg = FLG_NULL;
+		if (p->fmr_flags & FMR_OF_PREALLOC)
+			flg |= FLG_PRE;
+		if (p->fmr_flags & FMR_OF_ATTR_FORK)
+			flg |= FLG_ATTR_FORK;
+		if (p->fmr_flags & FMR_OF_SHARED)
+			flg |= FLG_SHARED;
+		/*
+		 * If striping enabled, determine if extent starts/ends
+		 * on a stripe unit boundary.
+		 */
+		if (sunit) {
+			if (p->fmr_physical  % sunit != 0)
+				flg |= FLG_BSU;
+			if (((p->fmr_physical +
+			      p->fmr_length ) % sunit ) != 0)
+				flg |= FLG_ESU;
+			if (p->fmr_physical % swidth != 0)
+				flg |= FLG_BSW;
+			if (((p->fmr_physical +
+			      p->fmr_length ) % swidth ) != 0)
+				flg |= FLG_ESW;
+		}
+		if (head->fmh_oflags & FMH_OF_DEV_T)
+			snprintf(dbuf, sizeof(dbuf), "%u:%u",
+				major(p->fmr_device),
+				minor(p->fmr_device));
+		else
+			snprintf(dbuf, sizeof(dbuf), "0x%x", p->fmr_device);
+		snprintf(bbuf, sizeof(bbuf), "[%lld..%lld]:",
+			(long long)BTOBBT(p->fmr_physical),
+			(long long)BTOBBT(p->fmr_physical + p->fmr_length - 1));
+		if (p->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+			snprintf(obuf, sizeof(obuf), "%s",
+				special_owner(p->fmr_owner));
+			snprintf(rbuf, sizeof(rbuf), " ");
+		} else {
+			snprintf(obuf, sizeof(obuf), "%lld",
+				(long long)p->fmr_owner);
+			snprintf(rbuf, sizeof(rbuf), "%lld..%lld",
+				(long long)BTOBBT(p->fmr_offset),
+				(long long)BTOBBT(p->fmr_offset + p->fmr_length - 1));
+		}
+		if (p->fmr_device == xfs_data_dev) {
+			agno = p->fmr_physical / bperag;
+			agoff = p->fmr_physical - (agno * bperag);
+			snprintf(abuf, sizeof(abuf),
+				"(%lld..%lld)",
+				(long long)BTOBBT(agoff),
+				(long long)BTOBBT(agoff + p->fmr_length - 1));
+			snprintf(gbuf, sizeof(gbuf),
+				"%lld",
+				(long long)agno);
+		} else {
+			abuf[0] = 0;
+			gbuf[0] = 0;
+		}
+		if (p->fmr_flags & FMR_OF_EXTENT_MAP)
+			printf("%*llu: %-*s %-*s %-*s %-*s %-*s %-*s %*lld\n",
+				nr_w, (*nr) + i,
+				dev_w, dbuf,
+				boff_w, bbuf,
+				own_w, obuf,
+				foff_w, _("extent map"),
+				agno_w, gbuf,
+				aoff_w, abuf,
+				tot_w, (long long)BTOBBT(p->fmr_length));
+		else {
+			printf("%*llu: %-*s %-*s %-*s %-*s", nr_w, (*nr) + i,
+				dev_w, dbuf, boff_w, bbuf, own_w, obuf,
+				foff_w, rbuf);
+			printf(" %-*s %-*s", agno_w, gbuf,
+				aoff_w, abuf);
+			printf(" %*lld", tot_w,
+				(long long)BTOBBT(p->fmr_length));
+			if (flg == FLG_NULL)
+				printf("\n");
+			else
+				printf(" %-*.*o\n", NFLG, NFLG, flg);
+		}
+	}
+
+	(*nr) += head->fmh_entries;
+}
+
+static void
+dump_verbose_key(void)
+{
+	printf(_(" FLAG Values:\n"));
+	printf(_("    %*.*o Shared extent\n"),
+		NFLG+1, NFLG+1, FLG_SHARED);
+	printf(_("    %*.*o Attribute fork\n"),
+		NFLG+1, NFLG+1, FLG_ATTR_FORK);
+	printf(_("    %*.*o Unwritten preallocated extent\n"),
+		NFLG+1, NFLG+1, FLG_PRE);
+	printf(_("    %*.*o Doesn't begin on stripe unit\n"),
+		NFLG+1, NFLG+1, FLG_BSU);
+	printf(_("    %*.*o Doesn't end   on stripe unit\n"),
+		NFLG+1, NFLG+1, FLG_ESU);
+	printf(_("    %*.*o Doesn't begin on stripe width\n"),
+		NFLG+1, NFLG+1, FLG_BSW);
+	printf(_("    %*.*o Doesn't end   on stripe width\n"),
+		NFLG+1, NFLG+1, FLG_ESW);
+}
+
+int
+fsmap_f(
+	int			argc,
+	char			**argv)
+{
+	struct fsmap		*p;
+	struct fsmap_head	*nhead;
+	struct fsmap_head	*head;
+	struct xfs_fsop_geom	fsgeo;
+	long long		start = 0;
+	long long		end = -1;
+	int			nmap_size;
+	int			map_size;
+	int			nflag = 0;
+	int			vflag = 0;
+	int			i = 0;
+	int			c;
+	unsigned long long	nr = 0;
+	size_t			fsblocksize, fssectsize;
+	struct fs_path		*fs;
+	static bool		tab_init;
+	bool			dumped_flags = false;
+
+	init_cvtnum(&fsblocksize, &fssectsize);
+
+	while ((c = getopt(argc, argv, "n:v")) != EOF) {
+		switch (c) {
+		case 'n':	/* number of extents specified */
+			nflag = atoi(optarg);
+			break;
+		case 'v':	/* Verbose output */
+			vflag++;
+			break;
+		default:
+			return command_usage(&fsmap_cmd);
+		}
+	}
+
+	if (argc > optind) {
+		start = cvtnum(fsblocksize, fssectsize, argv[optind]);
+		if (start < 0) {
+			fprintf(stderr,
+				_("Bad rmap start_fsb %s.\n"),
+				argv[optind]);
+			return 0;
+		}
+	}
+
+	if (argc > optind + 1) {
+		end = cvtnum(fsblocksize, fssectsize, argv[optind + 1]);
+		if (end < 0) {
+			fprintf(stderr,
+				_("Bad rmap end_fsb %s.\n"),
+				argv[optind + 1]);
+			return 0;
+		}
+	}
+
+	if (vflag) {
+		c = xfsctl(file->name, file->fd, XFS_IOC_FSGEOMETRY_V1, &fsgeo);
+		if (c < 0) {
+			fprintf(stderr,
+				_("%s: can't get geometry [\"%s\"]: %s\n"),
+				progname, file->name, strerror(errno));
+			exitcode = 1;
+			return 0;
+		}
+	}
+
+	map_size = nflag ? nflag : 131072 / sizeof(struct fsmap);
+	head = malloc(fsmap_sizeof(map_size));
+	if (head == NULL) {
+		fprintf(stderr, _("%s: malloc of %zu bytes failed.\n"),
+			progname, fsmap_sizeof(map_size));
+		exitcode = 1;
+		return 0;
+	}
+
+	memset(head, 0, sizeof(*head));
+	head->fmh_keys[0].fmr_physical = start;
+	head->fmh_keys[1].fmr_device = UINT_MAX;
+	head->fmh_keys[1].fmr_physical = end;
+	head->fmh_keys[1].fmr_owner = ULLONG_MAX;
+	head->fmh_keys[1].fmr_offset = ULLONG_MAX;
+
+	/* Count mappings */
+	if (!nflag) {
+		head->fmh_count = 0;
+		i = xfsctl(file->name, file->fd, XFS_IOC_GETFSMAP, head);
+		if (i < 0) {
+			fprintf(stderr, _("%s: xfsctl(XFS_IOC_GETFSMAP)"
+				" iflags=0x%x [\"%s\"]: %s\n"),
+				progname, head->fmh_iflags, file->name,
+				strerror(errno));
+			free(head);
+			exitcode = 1;
+			return 0;
+		}
+		if (head->fmh_entries > map_size + 2) {
+			map_size = 11ULL * head->fmh_entries / 10;
+			nmap_size = map_size > INT_MAX ? INT_MAX : map_size;
+			nhead = realloc(head, fsmap_sizeof(nmap_size));
+			if (nhead == NULL) {
+				fprintf(stderr,
+					_("%s: cannot realloc %zu bytes\n"),
+					progname, fsmap_sizeof(nmap_size));
+			} else {
+				head = nhead;
+				map_size = nmap_size;
+			}
+		}
+	}
+
+	/*
+	 * If this is an XFS filesystem, remember the data device.
+	 * (We report AG number/block for data device extents on XFS).
+	 */
+	if (!tab_init) {
+		fs_table_initialise(0, NULL, 0, NULL);
+		tab_init = true;
+	}
+	fs = fs_table_lookup(file->name, FS_MOUNT_POINT);
+	xfs_data_dev = fs ? fs->fs_datadev : 0;
+
+	head->fmh_count = map_size;
+	do {
+		/* Get some extents */
+		i = xfsctl(file->name, file->fd, XFS_IOC_GETFSMAP, head);
+		if (i < 0) {
+			fprintf(stderr, _("%s: xfsctl(XFS_IOC_GETFSMAP)"
+				" iflags=0x%x [\"%s\"]: %s\n"),
+				progname, head->fmh_iflags, file->name,
+				strerror(errno));
+			free(head);
+			exitcode = 1;
+			return 0;
+		}
+
+		if (head->fmh_entries == 0)
+			break;
+
+		if (!vflag)
+			dump_map(&nr, head);
+		else
+			dump_map_verbose(&nr, head, &dumped_flags, &fsgeo);
+
+		p = &head->fmh_recs[head->fmh_entries - 1];
+		if (p->fmr_flags & FMR_OF_LAST)
+			break;
+
+		head->fmh_keys[0] = *p;
+	} while (true);
+
+	if (dumped_flags)
+		dump_verbose_key();
+
+	free(head);
+	return 0;
+}
+
+void
+fsmap_init(void)
+{
+	fsmap_cmd.name = "fsmap";
+	fsmap_cmd.cfunc = fsmap_f;
+	fsmap_cmd.argmin = 0;
+	fsmap_cmd.argmax = -1;
+	fsmap_cmd.flags = CMD_NOMAP_OK;
+	fsmap_cmd.args = _("[-v] [-n nx] [start] [end]");
+	fsmap_cmd.oneline = _("print filesystem mapping for a range of blocks");
+	fsmap_cmd.help = fsmap_help;
+
+	add_command(&fsmap_cmd);
+}
diff --git a/io/init.c b/io/init.c
index a9191cf..27c4a16 100644
--- a/io/init.c
+++ b/io/init.c
@@ -63,6 +63,7 @@ init_commands(void)
 	file_init();
 	flink_init();
 	freeze_init();
+	fsmap_init();
 	fsync_init();
 	getrusage_init();
 	help_init();
diff --git a/io/io.h b/io/io.h
index 5d21314..0ee2c41 100644
--- a/io/io.h
+++ b/io/io.h
@@ -97,6 +97,7 @@ extern void		bmap_init(void);
 extern void		file_init(void);
 extern void		flink_init(void);
 extern void		freeze_init(void);
+extern void		fsmap_init(void);
 extern void		fsync_init(void);
 extern void		getrusage_init(void);
 extern void		help_init(void);
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 885df7f..f7edcab 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -273,6 +273,53 @@ ioctl.  Options behave as described in the
 .BR xfs_bmap (8)
 manual page.
 .TP
+.BI "fsmap [ \-v ] [ \-n " nx " ] [ " start " ] [ " end " ]
+Prints the mapping of disk blocks used by an XFS filesystem.  The map
+lists each extent used by files, allocation group metadata,
+journalling logs, and static filesystem metadata, as well as any
+regions that are unused.  Each line of the listings takes the
+following form:
+.PP
+.RS
+.IR extent ": " major ":" minor " [" startblock .. endblock "]: " owner " " startoffset .. endoffset " " length
+.PP
+Static filesystem metadata, allocation group metadata, btrees,
+journalling logs, and free space are marked by replacing the
+.IR startoffset .. endoffset
+with the appropriate marker.  All blocks, offsets, and lengths are specified
+in units of 512-byte blocks, no matter what the filesystem's block size is.
+.BI "The optional " start " and " end " arguments can be used to constrain
+the output to a particular range of disk blocks.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI \-n " num_extents"
+If this option is given,
+.B xfs_fsmap
+obtains the extent list of the file in groups of
+.I num_extents
+extents. In the absence of
+.BR \-n ", " xfs_fsmap
+queries the system for the number of extents in the filesystem and uses that
+value to compute the group size.
+.TP
+.B \-v
+Shows verbose information. When this flag is specified, additional AG
+specific information is appended to each line in the following form:
+.IP
+.RS 1.2i
+.IR agno " (" startagblock .. endagblock ") " nblocks " " flags
+.RE
+.IP
+A second
+.B \-v
+option will print out the
+.I flags
+legend.
+.RE
+.PD
+.TP
 .BI "extsize [ \-R | \-D ] [ " value " ]"
 Display and/or modify the preferred extent size used when allocating
 space for the currently open file. If the


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/39] xfs: use GPF_NOFS when allocating btree cursors
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 05/39] xfs_io: support the new getfsmap ioctl Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 07/39] xfs: add scrub tracepoints Darrick J. Wong
                   ` (32 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Use NOFS for allocating btree cursors, since they can be called
under the ilock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc_btree.c  |    2 +-
 libxfs/xfs_bmap_btree.c   |    2 +-
 libxfs/xfs_ialloc_btree.c |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/libxfs/xfs_alloc_btree.c b/libxfs/xfs_alloc_btree.c
index 273ea5b..4f4fecc 100644
--- a/libxfs/xfs_alloc_btree.c
+++ b/libxfs/xfs_alloc_btree.c
@@ -496,7 +496,7 @@ xfs_allocbt_init_cursor(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/libxfs/xfs_bmap_btree.c b/libxfs/xfs_bmap_btree.c
index 601385d..2997a3a 100644
--- a/libxfs/xfs_bmap_btree.c
+++ b/libxfs/xfs_bmap_btree.c
@@ -793,7 +793,7 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/libxfs/xfs_ialloc_btree.c b/libxfs/xfs_ialloc_btree.c
index 7bf6040..c2d4a5e 100644
--- a/libxfs/xfs_ialloc_btree.c
+++ b/libxfs/xfs_ialloc_btree.c
@@ -356,7 +356,7 @@ xfs_inobt_init_cursor(
 	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agbp);
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/39] xfs: add scrub tracepoints
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 06/39] xfs: use GPF_NOFS when allocating btree cursors Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 08/39] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
                   ` (31 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_types.h |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/libxfs/xfs_types.h b/libxfs/xfs_types.h
index cf044c0..442b223 100644
--- a/libxfs/xfs_types.h
+++ b/libxfs/xfs_types.h
@@ -95,6 +95,11 @@ typedef __int64_t	xfs_sfiloff_t;	/* signed block number in a file */
 #define	XFS_ATTR_FORK	1
 #define	XFS_COW_FORK	2
 
+#define XFS_FORK_DESC \
+	{ XFS_DATA_FORK,	"data" }, \
+	{ XFS_ATTR_FORK,	"attr" }, \
+	{ XFS_COW_FORK,		"CoW" }
+
 /*
  * Min numbers of data/attr fork btree root pointers.
  */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/39] xfs: create an ioctl to scrub AG metadata
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 07/39] xfs: add scrub tracepoints Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 09/39] xfs: generic functions to scrub metadata and btrees Darrick J. Wong
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Create an ioctl that can be used to scrub internal filesystem metadata.
The new ioctl takes the metadata type, an (optional) AG number, an
(optional) inode number and generation, and a flags argument.  This will
be used by the upcoming XFS online scrub tool.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 6857355..27878cf 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -578,6 +578,40 @@ typedef struct xfs_swapext
 #define XFS_FSOP_GOING_FLAGS_LOGFLUSH		0x1	/* flush log but not data */
 #define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH		0x2	/* don't flush log nor data */
 
+/* metadata scrubbing */
+struct xfs_scrub_metadata {
+	__u32 sm_type;		/* What to check? */
+	__u32 sm_flags;		/* Flags; none defined right now. */
+	union {
+		__u32		__agno;
+		struct {
+			__u64	__ino;
+			__u32	__gen;
+		} i;
+		__u64		__reserved[7];	/* pad to 64 bytes */
+	} p;
+};
+#define sm_agno	p.__agno
+#define sm_ino	p.i.__ino
+#define sm_gen	p.i.__gen
+
+/*
+ * Metadata types and flags for scrub operation.
+ */
+#define XFS_SCRUB_TYPE_TEST	0	/* dummy to test ioctl */
+#define XFS_SCRUB_TYPE_MAX	0
+
+#define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
+#define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */
+#define XFS_SCRUB_FLAG_PREEN	0x4	/* o: could be optimized */
+#define XFS_SCRUB_FLAG_XREF_FAIL 0x8	/* o: errors during cross-referencing */
+
+#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_FLAG_REPAIR)
+#define XFS_SCRUB_FLAGS_OUT	(XFS_SCRUB_FLAG_CORRUPT | \
+				 XFS_SCRUB_FLAG_PREEN | \
+				 XFS_SCRUB_FLAG_XREF_FAIL)
+#define XFS_SCRUB_FLAGS_ALL	(XFS_SCRUB_FLAGS_IN | XFS_SCRUB_FLAGS_OUT)
+
 /*
  * ioctl limits
  */
@@ -587,6 +621,7 @@ typedef struct xfs_swapext
 #  define XFS_XATTR_LIST_MAX 65536
 #endif
 
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -618,6 +653,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_ZERO_RANGE	_IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS	_IOR ('X', 58, struct xfs_fs_eofblocks)
 #define XFS_IOC_GETFSMAP	_IOWR('X', 59, struct fsmap_head)
+#define XFS_IOC_SCRUB_METADATA	_IOWR('X', 60, struct xfs_scrub_metadata)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/39] xfs: generic functions to scrub metadata and btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 08/39] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 10/39] xfs: scrub the backup superblocks Darrick J. Wong
                   ` (29 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Create a function that walks a btree, checking the integrity of each
btree block (headers, keys, records) and calling back to the caller
to perform further checks on the records.  Add some helper functions
so that we report detailed scrub errors in a uniform manner in dmesg.
These are helper functions for subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc.c  |    2 +-
 libxfs/xfs_alloc.h  |    2 ++
 libxfs/xfs_btree.c  |   41 +++++++++++++++++++++++++++++++++++------
 libxfs/xfs_btree.h  |   17 +++++++++++++++--
 libxfs/xfs_format.h |    2 +-
 5 files changed, 54 insertions(+), 10 deletions(-)


diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c
index 3bfca12..0cb8db7 100644
--- a/libxfs/xfs_alloc.c
+++ b/libxfs/xfs_alloc.c
@@ -625,7 +625,7 @@ const struct xfs_buf_ops xfs_agfl_buf_ops = {
 /*
  * Read in the allocation group free block array.
  */
-STATIC int				/* error */
+int					/* error */
 xfs_alloc_read_agfl(
 	xfs_mount_t	*mp,		/* mount point structure */
 	xfs_trans_t	*tp,		/* transaction pointer */
diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h
index 0dc34bf..89a23be 100644
--- a/libxfs/xfs_alloc.h
+++ b/libxfs/xfs_alloc.h
@@ -217,6 +217,8 @@ xfs_alloc_get_rec(
 
 int xfs_read_agf(struct xfs_mount *mp, struct xfs_trans *tp,
 			xfs_agnumber_t agno, int flags, struct xfs_buf **bpp);
+int xfs_alloc_read_agfl(struct xfs_mount *mp, struct xfs_trans *tp,
+			xfs_agnumber_t agno, struct xfs_buf **bpp);
 int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
 int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno,
 		struct xfs_buf **agbp);
diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index ee2e489..4a291cc 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -548,7 +548,7 @@ xfs_btree_ptr_offset(
 /*
  * Return a pointer to the n-th record in the btree block.
  */
-STATIC union xfs_btree_rec *
+union xfs_btree_rec *
 xfs_btree_rec_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -561,7 +561,7 @@ xfs_btree_rec_addr(
 /*
  * Return a pointer to the n-th key in the btree block.
  */
-STATIC union xfs_btree_key *
+union xfs_btree_key *
 xfs_btree_key_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -574,7 +574,7 @@ xfs_btree_key_addr(
 /*
  * Return a pointer to the n-th high key in the btree block.
  */
-STATIC union xfs_btree_key *
+union xfs_btree_key *
 xfs_btree_high_key_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -587,7 +587,7 @@ xfs_btree_high_key_addr(
 /*
  * Return a pointer to the n-th block pointer in the btree block.
  */
-STATIC union xfs_btree_ptr *
+union xfs_btree_ptr *
 xfs_btree_ptr_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -621,7 +621,7 @@ xfs_btree_get_iroot(
  * Retrieve the block pointer from the cursor at the given level.
  * This may be an inode btree root or from a buffer.
  */
-STATIC struct xfs_btree_block *		/* generic btree block pointer */
+struct xfs_btree_block *		/* generic btree block pointer */
 xfs_btree_get_block(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level,	/* level in btree */
@@ -1732,7 +1732,7 @@ xfs_btree_decrement(
 	return error;
 }
 
-STATIC int
+int
 xfs_btree_lookup_get_block(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level,	/* level in the btree */
@@ -4862,3 +4862,32 @@ xfs_btree_count_blocks(
 	return xfs_btree_visit_blocks(cur, xfs_btree_count_blocks_helper,
 			blocks);
 }
+
+/* If there's an extent, we're done. */
+STATIC int
+xfs_btree_has_record_helper(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+/* Is there a record covering a given range of keys? */
+int
+xfs_btree_has_record(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_irec	*low,
+	union xfs_btree_irec	*high,
+	bool			*exists)
+{
+	int			error;
+
+	error = xfs_btree_query_range(cur, low, high,
+			&xfs_btree_has_record_helper, NULL);
+	if (error && error != XFS_BTREE_QUERY_RANGE_ABORT)
+		return error;
+	*exists = error == XFS_BTREE_QUERY_RANGE_ABORT;
+
+	return 0;
+}
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index 6ef5373..0210662 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -197,7 +197,6 @@ struct xfs_btree_ops {
 
 	const struct xfs_buf_ops	*buf_ops;
 
-#if defined(DEBUG) || defined(XFS_WARN)
 	/* check that k1 is lower than k2 */
 	int	(*keys_inorder)(struct xfs_btree_cur *cur,
 				union xfs_btree_key *k1,
@@ -207,7 +206,6 @@ struct xfs_btree_ops {
 	int	(*recs_inorder)(struct xfs_btree_cur *cur,
 				union xfs_btree_rec *r1,
 				union xfs_btree_rec *r2);
-#endif
 };
 
 /*
@@ -539,4 +537,19 @@ int xfs_btree_visit_blocks(struct xfs_btree_cur *cur,
 
 int xfs_btree_count_blocks(struct xfs_btree_cur *cur, xfs_extlen_t *blocks);
 
+union xfs_btree_rec *xfs_btree_rec_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+union xfs_btree_key *xfs_btree_key_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+union xfs_btree_key *xfs_btree_high_key_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+union xfs_btree_ptr *xfs_btree_ptr_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+int xfs_btree_lookup_get_block(struct xfs_btree_cur *cur, int level,
+		union xfs_btree_ptr *pp, struct xfs_btree_block **blkp);
+struct xfs_btree_block *xfs_btree_get_block(struct xfs_btree_cur *cur,
+		int level, struct xfs_buf **bpp);
+int xfs_btree_has_record(struct xfs_btree_cur *cur, union xfs_btree_irec *low,
+		union xfs_btree_irec *high, bool *exists);
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index e14f964..d819d00 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -518,7 +518,7 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
 		 (sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
 }
 
-static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+static inline bool xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 {
 	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/39] xfs: scrub the backup superblocks
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 09/39] xfs: generic functions to scrub metadata and btrees Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 11/39] xfs: scrub AGF and AGFL Darrick J. Wong
                   ` (28 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Ensure that the geometry presented in the backup superblocks matches
the primary superblock so that repair can recover the filesystem if
that primary gets corrupted.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 27878cf..4f76e80 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -599,7 +599,8 @@ struct xfs_scrub_metadata {
  * Metadata types and flags for scrub operation.
  */
 #define XFS_SCRUB_TYPE_TEST	0	/* dummy to test ioctl */
-#define XFS_SCRUB_TYPE_MAX	0
+#define XFS_SCRUB_TYPE_SB	1	/* superblock */
+#define XFS_SCRUB_TYPE_MAX	1
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/39] xfs: scrub AGF and AGFL
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 10/39] xfs: scrub the backup superblocks Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 12/39] xfs: scrub the AGI Darrick J. Wong
                   ` (27 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Check the block references in the AGF and AGFL headers to make sure
they make sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 4f76e80..f2ab770 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -600,7 +600,9 @@ struct xfs_scrub_metadata {
  */
 #define XFS_SCRUB_TYPE_TEST	0	/* dummy to test ioctl */
 #define XFS_SCRUB_TYPE_SB	1	/* superblock */
-#define XFS_SCRUB_TYPE_MAX	1
+#define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
+#define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
+#define XFS_SCRUB_TYPE_MAX	3
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/39] xfs: scrub the AGI
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 11/39] xfs: scrub AGF and AGFL Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:25 ` [PATCH 13/39] xfs: support scrubbing free space btrees Darrick J. Wong
                   ` (26 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Add a forgotten check to the AGI verifier, then wire up the scrub
infrastructure to check the AGI contents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h     |    3 ++-
 libxfs/xfs_ialloc.c |    5 +++++
 2 files changed, 7 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index f2ab770..4869803 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -602,7 +602,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_SB	1	/* superblock */
 #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
 #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
-#define XFS_SCRUB_TYPE_MAX	3
+#define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
+#define XFS_SCRUB_TYPE_MAX	4
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */
diff --git a/libxfs/xfs_ialloc.c b/libxfs/xfs_ialloc.c
index efb37d3..8f7fe60 100644
--- a/libxfs/xfs_ialloc.c
+++ b/libxfs/xfs_ialloc.c
@@ -2509,6 +2509,11 @@ xfs_agi_verify(
 
 	if (be32_to_cpu(agi->agi_level) > XFS_BTREE_MAXLEVELS)
 		return false;
+
+	if (xfs_sb_version_hasfinobt(&mp->m_sb) &&
+	    be32_to_cpu(agi->agi_free_level) > XFS_BTREE_MAXLEVELS)
+		return false;
+
 	/*
 	 * during growfs operations, the perag is not fully initialised,
 	 * so we can't use it for any useful checking. growfs ensures we can't


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/39] xfs: support scrubbing free space btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 12/39] xfs: scrub the AGI Darrick J. Wong
@ 2016-11-05  0:25 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 14/39] xfs: support scrubbing inode btrees Darrick J. Wong
                   ` (25 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Plumb in the pieces necessary to check the free space btrees.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc_btree.c |    6 ------
 libxfs/xfs_fs.h          |    4 +++-
 2 files changed, 3 insertions(+), 7 deletions(-)


diff --git a/libxfs/xfs_alloc_btree.c b/libxfs/xfs_alloc_btree.c
index 4f4fecc..41e2eac 100644
--- a/libxfs/xfs_alloc_btree.c
+++ b/libxfs/xfs_alloc_btree.c
@@ -384,7 +384,6 @@ const struct xfs_buf_ops xfs_allocbt_buf_ops = {
 };
 
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_bnobt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -431,7 +430,6 @@ xfs_cntbt_recs_inorder(
 		 be32_to_cpu(r1->alloc.ar_startblock) <
 		 be32_to_cpu(r2->alloc.ar_startblock));
 }
-#endif /* DEBUG */
 
 static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
@@ -451,10 +449,8 @@ static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.key_diff		= xfs_bnobt_key_diff,
 	.buf_ops		= &xfs_allocbt_buf_ops,
 	.diff_two_keys		= xfs_bnobt_diff_two_keys,
-#if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_bnobt_keys_inorder,
 	.recs_inorder		= xfs_bnobt_recs_inorder,
-#endif
 };
 
 static const struct xfs_btree_ops xfs_cntbt_ops = {
@@ -474,10 +470,8 @@ static const struct xfs_btree_ops xfs_cntbt_ops = {
 	.key_diff		= xfs_cntbt_key_diff,
 	.buf_ops		= &xfs_allocbt_buf_ops,
 	.diff_two_keys		= xfs_cntbt_diff_two_keys,
-#if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_cntbt_keys_inorder,
 	.recs_inorder		= xfs_cntbt_recs_inorder,
-#endif
 };
 
 /*
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 4869803..0e0c35f 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -603,7 +603,9 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
 #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
 #define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
-#define XFS_SCRUB_TYPE_MAX	4
+#define XFS_SCRUB_TYPE_BNOBT	5	/* freesp by block btree */
+#define XFS_SCRUB_TYPE_CNTBT	6	/* freesp by length btree */
+#define XFS_SCRUB_TYPE_MAX	6
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/39] xfs: support scrubbing inode btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2016-11-05  0:25 ` [PATCH 13/39] xfs: support scrubbing free space btrees Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 15/39] xfs: support scrubbing rmap btree Darrick J. Wong
                   ` (24 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Plumb in the pieces necessary to check the inode btrees.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h           |    4 +++-
 libxfs/xfs_ialloc.c       |   41 +++++++++++++++++++++++++----------------
 libxfs/xfs_ialloc.h       |    3 +++
 libxfs/xfs_ialloc_btree.c |   32 ++++++++++++++++++++++++++------
 4 files changed, 57 insertions(+), 23 deletions(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 0e0c35f..97df1b9 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -605,7 +605,9 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
 #define XFS_SCRUB_TYPE_BNOBT	5	/* freesp by block btree */
 #define XFS_SCRUB_TYPE_CNTBT	6	/* freesp by length btree */
-#define XFS_SCRUB_TYPE_MAX	6
+#define XFS_SCRUB_TYPE_INOBT	7	/* inode btree */
+#define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
+#define XFS_SCRUB_TYPE_MAX	8
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */
diff --git a/libxfs/xfs_ialloc.c b/libxfs/xfs_ialloc.c
index 8f7fe60..1445315 100644
--- a/libxfs/xfs_ialloc.c
+++ b/libxfs/xfs_ialloc.c
@@ -93,24 +93,14 @@ xfs_inobt_update(
 	return xfs_btree_update(cur, &rec);
 }
 
-/*
- * Get the data from the pointed-to record.
- */
-int					/* error */
-xfs_inobt_get_rec(
-	struct xfs_btree_cur	*cur,	/* btree cursor */
-	xfs_inobt_rec_incore_t	*irec,	/* btree record */
-	int			*stat)	/* output: success/failure */
+void
+xfs_inobt_btrec_to_irec(
+	struct xfs_mount		*mp,
+	union xfs_btree_rec		*rec,
+	struct xfs_inobt_rec_incore	*irec)
 {
-	union xfs_btree_rec	*rec;
-	int			error;
-
-	error = xfs_btree_get_rec(cur, &rec, stat);
-	if (error || *stat == 0)
-		return error;
-
 	irec->ir_startino = be32_to_cpu(rec->inobt.ir_startino);
-	if (xfs_sb_version_hassparseinodes(&cur->bc_mp->m_sb)) {
+	if (xfs_sb_version_hassparseinodes(&mp->m_sb)) {
 		irec->ir_holemask = be16_to_cpu(rec->inobt.ir_u.sp.ir_holemask);
 		irec->ir_count = rec->inobt.ir_u.sp.ir_count;
 		irec->ir_freecount = rec->inobt.ir_u.sp.ir_freecount;
@@ -125,6 +115,25 @@ xfs_inobt_get_rec(
 				be32_to_cpu(rec->inobt.ir_u.f.ir_freecount);
 	}
 	irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int					/* error */
+xfs_inobt_get_rec(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_inobt_rec_incore_t	*irec,	/* btree record */
+	int			*stat)	/* output: success/failure */
+{
+	union xfs_btree_rec	*rec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (error || *stat == 0)
+		return error;
+
+	xfs_inobt_btrec_to_irec(cur->bc_mp, rec, irec);
 
 	return 0;
 }
diff --git a/libxfs/xfs_ialloc.h b/libxfs/xfs_ialloc.h
index 0bb8966..8e5861d 100644
--- a/libxfs/xfs_ialloc.h
+++ b/libxfs/xfs_ialloc.h
@@ -168,5 +168,8 @@ int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
 int xfs_read_agi(struct xfs_mount *mp, struct xfs_trans *tp,
 		xfs_agnumber_t agno, struct xfs_buf **bpp);
 
+union xfs_btree_rec;
+void xfs_inobt_btrec_to_irec(struct xfs_mount *mp, union xfs_btree_rec *rec,
+		struct xfs_inobt_rec_incore *irec);
 
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/libxfs/xfs_ialloc_btree.c b/libxfs/xfs_ialloc_btree.c
index c2d4a5e..4795f08 100644
--- a/libxfs/xfs_ialloc_btree.c
+++ b/libxfs/xfs_ialloc_btree.c
@@ -151,6 +151,18 @@ xfs_inobt_init_key_from_rec(
 }
 
 STATIC void
+xfs_inobt_init_high_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	__u32			x;
+
+	x = be32_to_cpu(rec->inobt.ir_startino);
+	x += XFS_INODES_PER_CHUNK - 1;
+	key->inobt.ir_startino = cpu_to_be32(x);
+}
+
+STATIC void
 xfs_inobt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
@@ -204,6 +216,16 @@ xfs_inobt_key_diff(
 			  cur->bc_rec.i.ir_startino;
 }
 
+STATIC __int64_t
+xfs_inobt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k1->inobt.ir_startino) -
+			  be32_to_cpu(k2->inobt.ir_startino);
+}
+
 static int
 xfs_inobt_verify(
 	struct xfs_buf		*bp)
@@ -278,7 +300,6 @@ const struct xfs_buf_ops xfs_inobt_buf_ops = {
 	.verify_write = xfs_inobt_write_verify,
 };
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_inobt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -298,7 +319,6 @@ xfs_inobt_recs_inorder(
 	return be32_to_cpu(r1->inobt.ir_startino) + XFS_INODES_PER_CHUNK <=
 		be32_to_cpu(r2->inobt.ir_startino);
 }
-#endif	/* DEBUG */
 
 static const struct xfs_btree_ops xfs_inobt_ops = {
 	.rec_len		= sizeof(xfs_inobt_rec_t),
@@ -311,14 +331,14 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 	.get_minrecs		= xfs_inobt_get_minrecs,
 	.get_maxrecs		= xfs_inobt_get_maxrecs,
 	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_inobt_init_high_key_from_rec,
 	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_inobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,
 	.buf_ops		= &xfs_inobt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
+	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
-#endif
 };
 
 static const struct xfs_btree_ops xfs_finobt_ops = {
@@ -332,14 +352,14 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 	.get_minrecs		= xfs_inobt_get_minrecs,
 	.get_maxrecs		= xfs_inobt_get_maxrecs,
 	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_inobt_init_high_key_from_rec,
 	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,
 	.buf_ops		= &xfs_inobt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
+	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
-#endif
 };
 
 /*


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/39] xfs: support scrubbing rmap btree
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 14/39] xfs: support scrubbing inode btrees Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 16/39] xfs: support scrubbing refcount btree Darrick J. Wong
                   ` (23 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Plumb in the pieces necessary to check the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h         |    3 ++-
 libxfs/xfs_rmap.c       |    3 ++-
 libxfs/xfs_rmap.h       |    3 +++
 libxfs/xfs_rmap_btree.c |    4 ----
 4 files changed, 7 insertions(+), 6 deletions(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 97df1b9..ff3c232 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -607,7 +607,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_CNTBT	6	/* freesp by length btree */
 #define XFS_SCRUB_TYPE_INOBT	7	/* inode btree */
 #define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
-#define XFS_SCRUB_TYPE_MAX	8
+#define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
+#define XFS_SCRUB_TYPE_MAX	9
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */
diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index 7738f50..97fc4dd 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -177,7 +177,8 @@ xfs_rmap_delete(
 	return error;
 }
 
-static int
+/* Convert an internal btree record to an rmap record. */
+int
 xfs_rmap_btrec_to_irec(
 	union xfs_btree_rec	*rec,
 	struct xfs_rmap_irec	*irec)
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index faf2c1a..3fa4559 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -214,5 +214,8 @@ int xfs_rmap_find_left_neighbor(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		uint64_t owner, uint64_t offset, unsigned int flags,
 		struct xfs_rmap_irec *irec, int	*stat);
+union xfs_btree_rec;
+int xfs_rmap_btrec_to_irec(union xfs_btree_rec *rec,
+		struct xfs_rmap_irec *irec);
 
 #endif	/* __XFS_RMAP_H__ */
diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index d11112a..4ceed59 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -375,7 +375,6 @@ const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
 	.verify_write		= xfs_rmapbt_write_verify,
 };
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_rmapbt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -435,7 +434,6 @@ xfs_rmapbt_recs_inorder(
 		return 1;
 	return 0;
 }
-#endif	/* DEBUG */
 
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
@@ -454,10 +452,8 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.key_diff		= xfs_rmapbt_key_diff,
 	.buf_ops		= &xfs_rmapbt_buf_ops,
 	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
-#if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_rmapbt_keys_inorder,
 	.recs_inorder		= xfs_rmapbt_recs_inorder,
-#endif
 };
 
 /*


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 16/39] xfs: support scrubbing refcount btree
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 15/39] xfs: support scrubbing rmap btree Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 17/39] xfs: scrub inodes Darrick J. Wong
                   ` (22 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Plumb in the pieces necessary to check the refcount btree.  If rmap is
available, check the reference count by performing an interval query
against the rmapbt.

v2: Handle the case where the rmap records are not all at least the
length of the refcount extent.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h             |    3 ++-
 libxfs/xfs_refcount_btree.c |    4 ----
 2 files changed, 2 insertions(+), 5 deletions(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index ff3c232..91146ac 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -608,7 +608,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_INOBT	7	/* inode btree */
 #define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
 #define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
-#define XFS_SCRUB_TYPE_MAX	9
+#define XFS_SCRUB_TYPE_REFCNTBT	10	/* reference count btree */
+#define XFS_SCRUB_TYPE_MAX	10
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */
diff --git a/libxfs/xfs_refcount_btree.c b/libxfs/xfs_refcount_btree.c
index 50c4682..9276d4f 100644
--- a/libxfs/xfs_refcount_btree.c
+++ b/libxfs/xfs_refcount_btree.c
@@ -284,7 +284,6 @@ const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
 	.verify_write		= xfs_refcountbt_write_verify,
 };
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_refcountbt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -305,7 +304,6 @@ xfs_refcountbt_recs_inorder(
 		be32_to_cpu(r1->refc.rc_blockcount) <=
 		be32_to_cpu(r2->refc.rc_startblock);
 }
-#endif
 
 static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
@@ -324,10 +322,8 @@ static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.key_diff		= xfs_refcountbt_key_diff,
 	.buf_ops		= &xfs_refcountbt_buf_ops,
 	.diff_two_keys		= xfs_refcountbt_diff_two_keys,
-#if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_refcountbt_keys_inorder,
 	.recs_inorder		= xfs_refcountbt_recs_inorder,
-#endif
 };
 
 /*


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 17/39] xfs: scrub inodes
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 16/39] xfs: support scrubbing refcount btree Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 18/39] xfs: scrub inode block mappings Darrick J. Wong
                   ` (21 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Scrub the fields within an inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 91146ac..7a794b4 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -609,7 +609,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
 #define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
 #define XFS_SCRUB_TYPE_REFCNTBT	10	/* reference count btree */
-#define XFS_SCRUB_TYPE_MAX	10
+#define XFS_SCRUB_TYPE_INODE	11	/* inode record */
+#define XFS_SCRUB_TYPE_MAX	11
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 18/39] xfs: scrub inode block mappings
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (16 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 17/39] xfs: scrub inodes Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 19/39] xfs: scrub directory/attribute btrees Darrick J. Wong
                   ` (20 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Scrub an individual inode's block mappings to make sure they make sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_bmap_btree.c |   26 ++++++++++++++++++++++----
 libxfs/xfs_fs.h         |    5 ++++-
 2 files changed, 26 insertions(+), 5 deletions(-)


diff --git a/libxfs/xfs_bmap_btree.c b/libxfs/xfs_bmap_btree.c
index 2997a3a..78a0440 100644
--- a/libxfs/xfs_bmap_btree.c
+++ b/libxfs/xfs_bmap_btree.c
@@ -620,6 +620,16 @@ xfs_bmbt_init_key_from_rec(
 }
 
 STATIC void
+xfs_bmbt_init_high_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	key->bmbt.br_startoff = cpu_to_be64(
+			xfs_bmbt_disk_get_startoff(&rec->bmbt) +
+			xfs_bmbt_disk_get_blockcount(&rec->bmbt) - 1);
+}
+
+STATIC void
 xfs_bmbt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
@@ -644,6 +654,16 @@ xfs_bmbt_key_diff(
 				      cur->bc_rec.b.br_startoff;
 }
 
+STATIC __int64_t
+xfs_bmbt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be64_to_cpu(k1->bmbt.br_startoff) -
+			  be64_to_cpu(k2->bmbt.br_startoff);
+}
+
 static bool
 xfs_bmbt_verify(
 	struct xfs_buf		*bp)
@@ -734,7 +754,6 @@ const struct xfs_buf_ops xfs_bmbt_buf_ops = {
 };
 
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_bmbt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -755,7 +774,6 @@ xfs_bmbt_recs_inorder(
 		xfs_bmbt_disk_get_blockcount(&r1->bmbt) <=
 		xfs_bmbt_disk_get_startoff(&r2->bmbt);
 }
-#endif	/* DEBUG */
 
 static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.rec_len		= sizeof(xfs_bmbt_rec_t),
@@ -769,14 +787,14 @@ static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.get_minrecs		= xfs_bmbt_get_minrecs,
 	.get_dmaxrecs		= xfs_bmbt_get_dmaxrecs,
 	.init_key_from_rec	= xfs_bmbt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_bmbt_init_high_key_from_rec,
 	.init_rec_from_cur	= xfs_bmbt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_bmbt_init_ptr_from_cur,
 	.key_diff		= xfs_bmbt_key_diff,
+	.diff_two_keys		= xfs_bmbt_diff_two_keys,
 	.buf_ops		= &xfs_bmbt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_bmbt_keys_inorder,
 	.recs_inorder		= xfs_bmbt_recs_inorder,
-#endif
 };
 
 /*
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 7a794b4..fc8efef 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -610,7 +610,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
 #define XFS_SCRUB_TYPE_REFCNTBT	10	/* reference count btree */
 #define XFS_SCRUB_TYPE_INODE	11	/* inode record */
-#define XFS_SCRUB_TYPE_MAX	11
+#define XFS_SCRUB_TYPE_BMBTD	12	/* data fork block mapping */
+#define XFS_SCRUB_TYPE_BMBTA	13	/* attr fork block mapping */
+#define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
+#define XFS_SCRUB_TYPE_MAX	14
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 19/39] xfs: scrub directory/attribute btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 18/39] xfs: scrub inode block mappings Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 20/39] xfs: scrub directories Darrick J. Wong
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Provide a way to check the shape and scrub the hashes and records
in a directory or extended attribute btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_dir2_node.c |   28 ++++++++++++++++++++++++++++
 libxfs/xfs_dir2_priv.h |    2 ++
 2 files changed, 30 insertions(+)


diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index b75b432..56f00f7 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -478,6 +478,34 @@ xfs_dir2_free_hdr_check(
  * Stale entries are ok.
  */
 xfs_dahash_t					/* hash value */
+xfs_dir2_leaf1_lasthash(
+	struct xfs_inode *dp,
+	struct xfs_buf	*bp,			/* leaf buffer */
+	int		*count)			/* count of entries in leaf */
+{
+	struct xfs_dir2_leaf	*leaf = bp->b_addr;
+	struct xfs_dir2_leaf_entry *ents;
+	struct xfs_dir3_icleaf_hdr leafhdr;
+
+	dp->d_ops->leaf_hdr_from_disk(&leafhdr, leaf);
+
+	ASSERT(leafhdr.magic == XFS_DIR2_LEAF1_MAGIC ||
+	       leafhdr.magic == XFS_DIR3_LEAF1_MAGIC);
+
+	if (count)
+		*count = leafhdr.count;
+	if (!leafhdr.count)
+		return 0;
+
+	ents = dp->d_ops->leaf_ents_p(leaf);
+	return be32_to_cpu(ents[leafhdr.count - 1].hashval);
+}
+
+/*
+ * Return the last hash value in the leaf.
+ * Stale entries are ok.
+ */
+xfs_dahash_t					/* hash value */
 xfs_dir2_leafn_lasthash(
 	struct xfs_inode *dp,
 	struct xfs_buf	*bp,			/* leaf buffer */
diff --git a/libxfs/xfs_dir2_priv.h b/libxfs/xfs_dir2_priv.h
index d04547f..1abd314 100644
--- a/libxfs/xfs_dir2_priv.h
+++ b/libxfs/xfs_dir2_priv.h
@@ -93,6 +93,8 @@ extern bool xfs_dir3_leaf_check_int(struct xfs_mount *mp, struct xfs_inode *dp,
 /* xfs_dir2_node.c */
 extern int xfs_dir2_leaf_to_node(struct xfs_da_args *args,
 		struct xfs_buf *lbp);
+extern xfs_dahash_t xfs_dir2_leaf1_lasthash(struct xfs_inode *dp,
+		struct xfs_buf *bp, int *count);
 extern xfs_dahash_t xfs_dir2_leafn_lasthash(struct xfs_inode *dp,
 		struct xfs_buf *bp, int *count);
 extern int xfs_dir2_leafn_lookup_int(struct xfs_buf *bp,


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 20/39] xfs: scrub directories
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (18 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 19/39] xfs: scrub directory/attribute btrees Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 21/39] xfs: scrub extended attributes Darrick J. Wong
                   ` (18 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Scrub all the entries, hash tree, and freespace data in a
directory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_dir2_priv.h |    2 ++
 libxfs/xfs_fs.h        |    3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_dir2_priv.h b/libxfs/xfs_dir2_priv.h
index 1abd314..5e54571 100644
--- a/libxfs/xfs_dir2_priv.h
+++ b/libxfs/xfs_dir2_priv.h
@@ -131,5 +131,7 @@ extern int xfs_dir2_sf_replace(struct xfs_da_args *args);
 /* xfs_dir2_readdir.c */
 extern int xfs_readdir(struct xfs_inode *dp, struct dir_context *ctx,
 		       size_t bufsize);
+extern int xfs_readdir_locked(struct xfs_trans *tp, struct xfs_inode *dp,
+		       struct dir_context *ctx, size_t bufsize);
 
 #endif /* __XFS_DIR2_PRIV_H__ */
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index fc8efef..5dfa0bb 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -613,7 +613,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_BMBTD	12	/* data fork block mapping */
 #define XFS_SCRUB_TYPE_BMBTA	13	/* attr fork block mapping */
 #define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
-#define XFS_SCRUB_TYPE_MAX	14
+#define XFS_SCRUB_TYPE_DIR	15	/* directory */
+#define XFS_SCRUB_TYPE_MAX	15
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 21/39] xfs: scrub extended attributes
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (19 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 20/39] xfs: scrub directories Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:26 ` [PATCH 22/39] xfs: scrub symbolic links Darrick J. Wong
                   ` (17 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Scrub the hash tree, keys, and values in an extended attribute structure.
Refactor the attribute code to use the transaction if the caller supplied
one to avoid buffer deadlocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_attr.c        |   26 ++++++++++++++++++--------
 libxfs/xfs_attr_remote.c |    5 +++--
 libxfs/xfs_fs.h          |    3 ++-
 3 files changed, 23 insertions(+), 11 deletions(-)


diff --git a/libxfs/xfs_attr.c b/libxfs/xfs_attr.c
index 60513f9..cd9cda1 100644
--- a/libxfs/xfs_attr.c
+++ b/libxfs/xfs_attr.c
@@ -109,6 +109,23 @@ xfs_inode_hasattr(
  * Overall external interface routines.
  *========================================================================*/
 
+/* Retrieve an extended attribute and its value.  Must have iolock. */
+int
+xfs_attr_get_locked(
+	struct xfs_inode	*ip,
+	struct xfs_da_args	*args)
+{
+	if (!xfs_inode_hasattr(ip))
+		return -ENOATTR;
+	else if (ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
+		return xfs_attr_shortform_getvalue(args);
+	else if (xfs_bmap_one_block(ip, XFS_ATTR_FORK))
+		return xfs_attr_leaf_get(args);
+	else
+		return xfs_attr_node_get(args);
+}
+
+/* Retrieve an extended attribute by name, and its value. */
 int
 xfs_attr_get(
 	struct xfs_inode	*ip,
@@ -139,14 +156,7 @@ xfs_attr_get(
 	args.op_flags = XFS_DA_OP_OKNOENT;
 
 	lock_mode = xfs_ilock_attr_map_shared(ip);
-	if (!xfs_inode_hasattr(ip))
-		error = -ENOATTR;
-	else if (ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
-		error = xfs_attr_shortform_getvalue(&args);
-	else if (xfs_bmap_one_block(ip, XFS_ATTR_FORK))
-		error = xfs_attr_leaf_get(&args);
-	else
-		error = xfs_attr_node_get(&args);
+	error = xfs_attr_get_locked(ip, &args);
 	xfs_iunlock(ip, lock_mode);
 
 	*valuelenp = args.valuelen;
diff --git a/libxfs/xfs_attr_remote.c b/libxfs/xfs_attr_remote.c
index abe1705..b7040ec 100644
--- a/libxfs/xfs_attr_remote.c
+++ b/libxfs/xfs_attr_remote.c
@@ -381,7 +381,8 @@ xfs_attr_rmtval_get(
 			       (map[i].br_startblock != HOLESTARTBLOCK));
 			dblkno = XFS_FSB_TO_DADDR(mp, map[i].br_startblock);
 			dblkcnt = XFS_FSB_TO_BB(mp, map[i].br_blockcount);
-			error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp,
+			error = xfs_trans_read_buf(mp, args->trans,
+						   mp->m_ddev_targp,
 						   dblkno, dblkcnt, 0, &bp,
 						   &xfs_attr3_rmt_buf_ops);
 			if (error)
@@ -390,7 +391,7 @@ xfs_attr_rmtval_get(
 			error = xfs_attr_rmtval_copyout(mp, bp, args->dp->i_ino,
 							&offset, &valuelen,
 							&dst);
-			xfs_buf_relse(bp);
+			xfs_trans_brelse(args->trans, bp);
 			if (error)
 				return error;
 
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 5dfa0bb..c5d7dd9 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -614,7 +614,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_BMBTA	13	/* attr fork block mapping */
 #define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
 #define XFS_SCRUB_TYPE_DIR	15	/* directory */
-#define XFS_SCRUB_TYPE_MAX	15
+#define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
+#define XFS_SCRUB_TYPE_MAX	16
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 22/39] xfs: scrub symbolic links
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (20 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 21/39] xfs: scrub extended attributes Darrick J. Wong
@ 2016-11-05  0:26 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 23/39] xfs: scrub realtime bitmap/summary Darrick J. Wong
                   ` (16 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Create the infrastructure to scrub symbolic link data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index c5d7dd9..2709d1f 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -615,7 +615,8 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
 #define XFS_SCRUB_TYPE_DIR	15	/* directory */
 #define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
-#define XFS_SCRUB_TYPE_MAX	16
+#define XFS_SCRUB_TYPE_SYMLINK	17	/* symbolic link */
+#define XFS_SCRUB_TYPE_MAX	17
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 23/39] xfs: scrub realtime bitmap/summary
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (21 preceding siblings ...)
  2016-11-05  0:26 ` [PATCH 22/39] xfs: scrub symbolic links Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 24/39] xfs: scrub should cross-reference with the bnobt Darrick J. Wong
                   ` (15 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Perform simple tests of the realtime bitmap and summary.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_format.h   |    5 +++++
 libxfs/xfs_fs.h       |    4 +++-
 libxfs/xfs_rtbitmap.c |    2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)


diff --git a/libxfs/xfs_format.h b/libxfs/xfs_format.h
index d819d00..46e4794 100644
--- a/libxfs/xfs_format.h
+++ b/libxfs/xfs_format.h
@@ -315,6 +315,11 @@ static inline bool xfs_sb_good_version(struct xfs_sb *sbp)
 	return false;
 }
 
+static inline bool xfs_sb_version_hasrealtime(struct xfs_sb *sbp)
+{
+	return sbp->sb_rblocks > 0;
+}
+
 /*
  * Detect a mismatched features2 field.  Older kernels read/wrote
  * this into the wrong slot, so to be safe we keep them in sync.
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 2709d1f..c2275b1 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -616,7 +616,9 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_DIR	15	/* directory */
 #define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
 #define XFS_SCRUB_TYPE_SYMLINK	17	/* symbolic link */
-#define XFS_SCRUB_TYPE_MAX	17
+#define XFS_SCRUB_TYPE_RTBITMAP	18	/* realtime bitmap */
+#define XFS_SCRUB_TYPE_RTSUM	19	/* realtime summary */
+#define XFS_SCRUB_TYPE_MAX	19
 
 #define XFS_SCRUB_FLAG_REPAIR	0x1	/* i: repair this metadata */
 #define XFS_SCRUB_FLAG_CORRUPT	0x2	/* o: needs repair */
diff --git a/libxfs/xfs_rtbitmap.c b/libxfs/xfs_rtbitmap.c
index 36fe323..70ea975 100644
--- a/libxfs/xfs_rtbitmap.c
+++ b/libxfs/xfs_rtbitmap.c
@@ -65,7 +65,7 @@ const struct xfs_buf_ops xfs_rtbuf_ops = {
  * Get a buffer for the bitmap or summary file block specified.
  * The buffer is returned read and locked.
  */
-static int
+int
 xfs_rtbuf_get(
 	xfs_mount_t	*mp,		/* file system mount structure */
 	xfs_trans_t	*tp,		/* transaction pointer */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 24/39] xfs: scrub should cross-reference with the bnobt
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (22 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 23/39] xfs: scrub realtime bitmap/summary Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 25/39] xfs: cross-reference bnobt records with cntbt Darrick J. Wong
                   ` (14 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

When we're scrubbing various btrees, cross-reference the records with
the bnobt to ensure that we don't also think the space is free.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc.c |   19 +++++++++++++++++++
 libxfs/xfs_alloc.h |    3 +++
 2 files changed, 22 insertions(+)


diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c
index 0cb8db7..76622d1 100644
--- a/libxfs/xfs_alloc.c
+++ b/libxfs/xfs_alloc.c
@@ -2984,3 +2984,22 @@ xfs_alloc_query_all(
 	query.fn = fn;
 	return xfs_btree_query_all(cur, xfs_alloc_query_range_helper, &query);
 }
+
+/* Is there a record covering a given extent? */
+int
+xfs_alloc_has_record(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			*exists)
+{
+	union xfs_btree_irec	low;
+	union xfs_btree_irec	high;
+
+	memset(&low, 0, sizeof(low));
+	low.a.ar_startblock = bno;
+	memset(&high, 0xFF, sizeof(high));
+	high.a.ar_startblock = bno + len - 1;
+
+	return xfs_btree_has_record(cur, &low, &high, exists);
+}
diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h
index 89a23be..3fd6540 100644
--- a/libxfs/xfs_alloc.h
+++ b/libxfs/xfs_alloc.h
@@ -237,4 +237,7 @@ int xfs_alloc_query_range(struct xfs_btree_cur *cur,
 int xfs_alloc_query_all(struct xfs_btree_cur *cur, xfs_alloc_query_range_fn fn,
 		void *priv);
 
+int xfs_alloc_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, bool *exist);
+
 #endif	/* __XFS_ALLOC_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 25/39] xfs: cross-reference bnobt records with cntbt
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (23 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 24/39] xfs: scrub should cross-reference with the bnobt Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 26/39] xfs: cross-reference inode btrees during scrub Darrick J. Wong
                   ` (13 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Scrub should make sure that each bnobt record has a corresponding
cntbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc.c |    2 +-
 libxfs/xfs_alloc.h |    7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c
index 76622d1..3b8323d 100644
--- a/libxfs/xfs_alloc.c
+++ b/libxfs/xfs_alloc.c
@@ -165,7 +165,7 @@ xfs_alloc_lookup_ge(
  * Lookup the first record less than or equal to [bno, len]
  * in the btree given by cur.
  */
-static int				/* error */
+int					/* error */
 xfs_alloc_lookup_le(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	xfs_agblock_t		bno,	/* starting block of extent */
diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h
index 3fd6540..b79159c 100644
--- a/libxfs/xfs_alloc.h
+++ b/libxfs/xfs_alloc.h
@@ -202,6 +202,13 @@ xfs_free_extent(
 	enum xfs_ag_resv_type	type);	/* block reservation type */
 
 int				/* error */
+xfs_alloc_lookup_le(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_agblock_t		bno,	/* starting block of extent */
+	xfs_extlen_t		len,	/* length of extent */
+	int			*stat);	/* success/failure */
+
+int				/* error */
 xfs_alloc_lookup_ge(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	xfs_agblock_t		bno,	/* starting block of extent */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 26/39] xfs: cross-reference inode btrees during scrub
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (24 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 25/39] xfs: cross-reference bnobt records with cntbt Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 27/39] xfs: cross-reference reverse-mapping btree Darrick J. Wong
                   ` (12 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Cross-reference the inode btrees with the other metadata when we
scrub the filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_ialloc.c |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_ialloc.h |    4 ++++
 2 files changed, 62 insertions(+)


diff --git a/libxfs/xfs_ialloc.c b/libxfs/xfs_ialloc.c
index 1445315..3d02a16 100644
--- a/libxfs/xfs_ialloc.c
+++ b/libxfs/xfs_ialloc.c
@@ -2660,3 +2660,61 @@ xfs_ialloc_pagi_init(
 		xfs_trans_brelse(tp, bp);
 	return 0;
 }
+
+/* Is there an inode record covering a given range of inode numbers? */
+int
+xfs_ialloc_has_inode_record(
+	struct xfs_btree_cur	*cur,
+	xfs_agino_t		low,
+	xfs_agino_t		high,
+	bool			*exists)
+{
+	struct xfs_inobt_rec_incore	irec;
+	xfs_agino_t		agino;
+	__uint16_t		holemask;
+	int			has;
+	int			i;
+	int			error;
+
+	*exists = false;
+	error = xfs_inobt_lookup(cur, low, XFS_LOOKUP_LE, &has);
+	while (error == 0 && has) {
+		error = xfs_inobt_get_rec(cur, &irec, &has);
+		if (error || irec.ir_startino > high)
+			break;
+
+		agino = irec.ir_startino;
+		holemask = irec.ir_holemask;
+		for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
+				i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {
+			if (holemask & 1)
+				continue;
+			if (agino + XFS_INODES_PER_HOLEMASK_BIT > low &&
+					agino <= high) {
+				*exists = true;
+				goto out;
+			}
+		}
+
+		error = xfs_btree_increment(cur, 0, &has);
+	}
+out:
+	return error;
+}
+
+/* Is there an inode record covering a given extent? */
+int
+xfs_ialloc_has_inodes_at_extent(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			*exists)
+{
+	xfs_agino_t		low;
+	xfs_agino_t		high;
+
+	low = XFS_OFFBNO_TO_AGINO(cur->bc_mp, bno, 0);
+	high = XFS_OFFBNO_TO_AGINO(cur->bc_mp, bno + len, 0) - 1;
+
+	return xfs_ialloc_has_inode_record(cur, low, high, exists);
+}
diff --git a/libxfs/xfs_ialloc.h b/libxfs/xfs_ialloc.h
index 8e5861d..f20d958 100644
--- a/libxfs/xfs_ialloc.h
+++ b/libxfs/xfs_ialloc.h
@@ -171,5 +171,9 @@ int xfs_read_agi(struct xfs_mount *mp, struct xfs_trans *tp,
 union xfs_btree_rec;
 void xfs_inobt_btrec_to_irec(struct xfs_mount *mp, union xfs_btree_rec *rec,
 		struct xfs_inobt_rec_incore *irec);
+int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
+int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low,
+		xfs_agino_t high, bool *exists);
 
 #endif	/* __XFS_IALLOC_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 27/39] xfs: cross-reference reverse-mapping btree
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (25 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 26/39] xfs: cross-reference inode btrees during scrub Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 28/39] xfs: cross-reference refcount btree during scrub Darrick J. Wong
                   ` (11 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

When scrubbing various btrees, we should cross-reference the records
with the reverse mapping btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_rmap.c |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_rmap.h |    5 +++++
 2 files changed, 63 insertions(+)


diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index 97fc4dd..0b8eed1 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -2304,3 +2304,61 @@ xfs_rmap_free_extent(
 	return __xfs_rmap_add(mp, dfops, XFS_RMAP_FREE, owner,
 			XFS_DATA_FORK, &bmap);
 }
+
+/* Is there a record covering a given extent? */
+int
+xfs_rmap_has_record(
+	struct xfs_btree_cur	*cur,
+	xfs_fsblock_t		bno,
+	xfs_filblks_t		len,
+	bool			*exists)
+{
+	union xfs_btree_irec	low;
+	union xfs_btree_irec	high;
+
+	memset(&low, 0, sizeof(low));
+	low.r.rm_startblock = bno;
+	memset(&high, 0xFF, sizeof(high));
+	high.r.rm_startblock = bno + len - 1;
+
+	return xfs_btree_has_record(cur, &low, &high, exists);
+}
+
+/* Is there a record covering a given extent? */
+int
+xfs_rmap_record_exists(
+	struct xfs_btree_cur	*cur,
+	xfs_fsblock_t		bno,
+	xfs_filblks_t		len,
+	struct xfs_owner_info	*oinfo,
+	bool			*has_rmap)
+{
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags;
+	int			stat;
+	struct xfs_rmap_irec	irec;
+	int			error;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+
+	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*has_rmap = false;
+		return 0;
+	}
+
+	error = xfs_rmap_get_rec(cur, &irec, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*has_rmap = false;
+		return 0;
+	}
+
+	*has_rmap = (irec.rm_startblock <= bno &&
+		     irec.rm_startblock + irec.rm_blockcount >= bno + len);
+	return 0;
+}
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index 3fa4559..ea359ab 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -217,5 +217,10 @@ int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 union xfs_btree_rec;
 int xfs_rmap_btrec_to_irec(union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
+int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
+		xfs_filblks_t len, bool *exists);
+int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
+		xfs_filblks_t len, struct xfs_owner_info *oinfo,
+		bool *has_rmap);
 
 #endif	/* __XFS_RMAP_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 28/39] xfs: cross-reference refcount btree during scrub
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (26 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 27/39] xfs: cross-reference reverse-mapping btree Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 29/39] xfs: scrub should cross-reference the realtime bitmap Darrick J. Wong
                   ` (10 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

During metadata btree scrub, we should cross-reference with the
reference counts.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_refcount.c |   19 +++++++++++++++++++
 libxfs/xfs_refcount.h |    3 +++
 2 files changed, 22 insertions(+)


diff --git a/libxfs/xfs_refcount.c b/libxfs/xfs_refcount.c
index 0508ec3..124d5c0 100644
--- a/libxfs/xfs_refcount.c
+++ b/libxfs/xfs_refcount.c
@@ -1695,3 +1695,22 @@ xfs_refcount_recover_cow_leftovers(
 	xfs_trans_cancel(tp);
 	goto out_free;
 }
+
+/* Is there a record covering a given extent? */
+int
+xfs_refcount_has_record(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			*exists)
+{
+	union xfs_btree_irec	low;
+	union xfs_btree_irec	high;
+
+	memset(&low, 0, sizeof(low));
+	low.rc.rc_startblock = bno;
+	memset(&high, 0xFF, sizeof(high));
+	high.rc.rc_startblock = bno + len - 1;
+
+	return xfs_btree_has_record(cur, &low, &high, exists);
+}
diff --git a/libxfs/xfs_refcount.h b/libxfs/xfs_refcount.h
index 098dc66..78cb142 100644
--- a/libxfs/xfs_refcount.h
+++ b/libxfs/xfs_refcount.h
@@ -67,4 +67,7 @@ extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
 extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
 		xfs_agnumber_t agno);
 
+extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
+
 #endif	/* __XFS_REFCOUNT_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 29/39] xfs: scrub should cross-reference the realtime bitmap
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (27 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 28/39] xfs: cross-reference refcount btree during scrub Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 30/39] xfs: add helper routines for the repair code Darrick J. Wong
                   ` (9 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

While we're scrubbing various btrees, cross-reference the records
with the other metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_rtbitmap.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)


diff --git a/libxfs/xfs_rtbitmap.c b/libxfs/xfs_rtbitmap.c
index 70ea975..80fff5f 100644
--- a/libxfs/xfs_rtbitmap.c
+++ b/libxfs/xfs_rtbitmap.c
@@ -1011,3 +1011,33 @@ xfs_rtfree_extent(
 	}
 	return 0;
 }
+
+/* Is the given extent all free? */
+int
+xfs_rtbitmap_extent_is_free(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	xfs_rtblock_t			start,
+	xfs_rtblock_t			len,
+	bool				*is_free)
+{
+	xfs_rtblock_t			end;
+	xfs_extlen_t			clen;
+	int				matches;
+	int				error;
+
+	*is_free = false;
+	while (len) {
+		clen = len > ~0U ? ~0U : len;
+		error = xfs_rtcheck_range(mp, tp, start, clen, 1, &end,
+				&matches);
+		if (error || !matches || end < start + clen)
+			return error;
+
+		len -= end - start;
+		start = end + 1;
+	}
+
+	*is_free = true;
+	return error;
+}


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 30/39] xfs: add helper routines for the repair code
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (28 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 29/39] xfs: scrub should cross-reference the realtime bitmap Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 31/39] xfs: repair inode btrees Darrick J. Wong
                   ` (8 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Add some helper functions for repair functions that will help us to
allocate and initialize new metadata blocks for btrees that we're
rebuilding.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_alloc_btree.c  |    9 ++++++++
 libxfs/xfs_alloc_btree.h  |    2 ++
 libxfs/xfs_bmap_btree.c   |    9 ++++++++
 libxfs/xfs_bmap_btree.h   |    3 +++
 libxfs/xfs_btree.c        |    4 ++--
 libxfs/xfs_btree.h        |    2 +-
 libxfs/xfs_ialloc_btree.c |    9 ++++++++
 libxfs/xfs_ialloc_btree.h |    3 +++
 libxfs/xfs_rmap.c         |   51 +++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_rmap.h         |    3 +++
 10 files changed, 92 insertions(+), 3 deletions(-)


diff --git a/libxfs/xfs_alloc_btree.c b/libxfs/xfs_alloc_btree.c
index 41e2eac..2f4491f 100644
--- a/libxfs/xfs_alloc_btree.c
+++ b/libxfs/xfs_alloc_btree.c
@@ -530,3 +530,12 @@ xfs_allocbt_maxrecs(
 		return blocklen / sizeof(xfs_alloc_rec_t);
 	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
 }
+
+/* Calculate the freespace btree size for some records. */
+xfs_extlen_t
+xfs_allocbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_alloc_mnr, len);
+}
diff --git a/libxfs/xfs_alloc_btree.h b/libxfs/xfs_alloc_btree.h
index 45e189e..2fd5472 100644
--- a/libxfs/xfs_alloc_btree.h
+++ b/libxfs/xfs_alloc_btree.h
@@ -61,5 +61,7 @@ extern struct xfs_btree_cur *xfs_allocbt_init_cursor(struct xfs_mount *,
 		struct xfs_trans *, struct xfs_buf *,
 		xfs_agnumber_t, xfs_btnum_t);
 extern int xfs_allocbt_maxrecs(struct xfs_mount *, int, int);
+extern xfs_extlen_t xfs_allocbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
 
 #endif	/* __XFS_ALLOC_BTREE_H__ */
diff --git a/libxfs/xfs_bmap_btree.c b/libxfs/xfs_bmap_btree.c
index 78a0440..c7c11aa 100644
--- a/libxfs/xfs_bmap_btree.c
+++ b/libxfs/xfs_bmap_btree.c
@@ -909,3 +909,12 @@ xfs_bmbt_change_owner(
 	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
 	return error;
 }
+
+/* Calculate the bmap btree size for some records. */
+unsigned long long
+xfs_bmbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_bmap_dmnr, len);
+}
diff --git a/libxfs/xfs_bmap_btree.h b/libxfs/xfs_bmap_btree.h
index 819a8a4..835f0a3 100644
--- a/libxfs/xfs_bmap_btree.h
+++ b/libxfs/xfs_bmap_btree.h
@@ -140,4 +140,7 @@ extern int xfs_bmbt_change_owner(struct xfs_trans *tp, struct xfs_inode *ip,
 extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
 		struct xfs_trans *, struct xfs_inode *, int);
 
+extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
+
 #endif	/* __XFS_BMAP_BTREE_H__ */
diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index 4a291cc..0c7d549 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -4820,7 +4820,7 @@ xfs_btree_query_all(
  * Calculate the number of blocks needed to store a given number of records
  * in a short-format (per-AG metadata) btree.
  */
-xfs_extlen_t
+unsigned long long
 xfs_btree_calc_size(
 	struct xfs_mount	*mp,
 	uint			*limits,
@@ -4828,7 +4828,7 @@ xfs_btree_calc_size(
 {
 	int			level;
 	int			maxrecs;
-	xfs_extlen_t		rval;
+	unsigned long long	rval;
 
 	maxrecs = limits[0];
 	for (level = 0, rval = 0; len > 1; level++) {
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index 0210662..52714f0 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -515,7 +515,7 @@ bool xfs_btree_sblock_v5hdr_verify(struct xfs_buf *bp);
 bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
 				 unsigned long len);
-xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
+unsigned long long xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
 		unsigned long long len);
 
 /* return codes */
diff --git a/libxfs/xfs_ialloc_btree.c b/libxfs/xfs_ialloc_btree.c
index 4795f08..978685e 100644
--- a/libxfs/xfs_ialloc_btree.c
+++ b/libxfs/xfs_ialloc_btree.c
@@ -497,3 +497,12 @@ xfs_inobt_rec_check_count(
 	return 0;
 }
 #endif	/* DEBUG */
+
+/* Calculate the inobt btree size for some records. */
+xfs_extlen_t
+xfs_iallocbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_inobt_mnr, len);
+}
diff --git a/libxfs/xfs_ialloc_btree.h b/libxfs/xfs_ialloc_btree.h
index bd88453..3046c11 100644
--- a/libxfs/xfs_ialloc_btree.h
+++ b/libxfs/xfs_ialloc_btree.h
@@ -72,4 +72,7 @@ int xfs_inobt_rec_check_count(struct xfs_mount *,
 #define xfs_inobt_rec_check_count(mp, rec)	0
 #endif	/* DEBUG */
 
+extern xfs_extlen_t xfs_iallocbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index 0b8eed1..c7d8aac 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -2362,3 +2362,54 @@ xfs_rmap_record_exists(
 		     irec.rm_startblock + irec.rm_blockcount >= bno + len);
 	return 0;
 }
+
+struct xfs_rmap_has_other_keys {
+	uint64_t			owner;
+	uint64_t			offset;
+	bool				*has_rmap;
+	unsigned int			flags;
+};
+
+/* For each rmap given, figure out if it doesn't match the key we want. */
+STATIC int
+xfs_rmap_has_other_keys_helper(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_rmap_has_other_keys	*rhok = priv;
+
+	if (rhok->owner == rec->rm_owner && rhok->offset == rec->rm_offset &&
+	    ((rhok->flags & rec->rm_flags) & XFS_RMAP_KEY_FLAGS) == rhok->flags)
+		return 0;
+	*rhok->has_rmap = true;
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+/*
+ * Given an extent and some owner info, can we find records overlapping
+ * the extent whose owner info does not match the given owner?
+ */
+int
+xfs_rmap_has_other_keys(
+	struct xfs_btree_cur		*cur,
+	xfs_fsblock_t			bno,
+	xfs_filblks_t			len,
+	struct xfs_owner_info		*oinfo,
+	bool				*has_rmap)
+{
+	struct xfs_rmap_irec		low = {0};
+	struct xfs_rmap_irec		high;
+	struct xfs_rmap_has_other_keys	rhok;
+
+	xfs_owner_info_unpack(oinfo, &rhok.owner, &rhok.offset, &rhok.flags);
+	*has_rmap = false;
+	rhok.has_rmap = has_rmap;
+
+	low.rm_startblock = bno;
+	memset(&high, 0xFF, sizeof(high));
+	high.rm_startblock = bno + len - 1;
+
+	return xfs_rmap_query_range(cur, &low, &high,
+			xfs_rmap_has_other_keys_helper, &rhok);
+}
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index ea359ab..606efe3 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -222,5 +222,8 @@ int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
 int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
 		xfs_filblks_t len, struct xfs_owner_info *oinfo,
 		bool *has_rmap);
+int xfs_rmap_has_other_keys(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
+		xfs_filblks_t len, struct xfs_owner_info *oinfo,
+		bool *has_rmap);
 
 #endif	/* __XFS_RMAP_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 31/39] xfs: repair inode btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (29 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 30/39] xfs: add helper routines for the repair code Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:27 ` [PATCH 32/39] xfs: rebuild the rmapbt Darrick J. Wong
                   ` (7 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_ialloc.c |    2 +-
 libxfs/xfs_ialloc.h |    3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_ialloc.c b/libxfs/xfs_ialloc.c
index 3d02a16..923214a 100644
--- a/libxfs/xfs_ialloc.c
+++ b/libxfs/xfs_ialloc.c
@@ -141,7 +141,7 @@ xfs_inobt_get_rec(
 /*
  * Insert a single inobt record. Cursor must already point to desired location.
  */
-STATIC int
+int
 xfs_inobt_insert_rec(
 	struct xfs_btree_cur	*cur,
 	__uint16_t		holemask,
diff --git a/libxfs/xfs_ialloc.h b/libxfs/xfs_ialloc.h
index f20d958..afcb250 100644
--- a/libxfs/xfs_ialloc.h
+++ b/libxfs/xfs_ialloc.h
@@ -175,5 +175,8 @@ int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
 int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low,
 		xfs_agino_t high, bool *exists);
+int xfs_inobt_insert_rec(struct xfs_btree_cur *cur, __uint16_t holemask,
+		__uint8_t count, __int32_t freecount, xfs_inofree_t free,
+		int *stat);
 
 #endif	/* __XFS_IALLOC_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 32/39] xfs: rebuild the rmapbt
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (30 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 31/39] xfs: repair inode btrees Darrick J. Wong
@ 2016-11-05  0:27 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 33/39] xfs: repair refcount btrees Darrick J. Wong
                   ` (6 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Rebuild the reverse mapping btree from all primary metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_refcount.c |    2 +-
 libxfs/xfs_refcount.h |    3 +++
 libxfs/xfs_rmap.c     |   28 ++++++++++++++++++++++++++++
 libxfs/xfs_rmap.h     |    1 +
 4 files changed, 33 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_refcount.c b/libxfs/xfs_refcount.c
index 124d5c0..be0eab3 100644
--- a/libxfs/xfs_refcount.c
+++ b/libxfs/xfs_refcount.c
@@ -87,7 +87,7 @@ xfs_refcount_lookup_ge(
 }
 
 /* Convert on-disk record to in-core format. */
-static inline void
+void
 xfs_refcount_btrec_to_irec(
 	union xfs_btree_rec		*rec,
 	struct xfs_refcount_irec	*irec)
diff --git a/libxfs/xfs_refcount.h b/libxfs/xfs_refcount.h
index 78cb142..5973c56 100644
--- a/libxfs/xfs_refcount.h
+++ b/libxfs/xfs_refcount.h
@@ -69,5 +69,8 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
 
 extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
+union xfs_btree_rec;
+extern void xfs_refcount_btrec_to_irec(union xfs_btree_rec *rec,
+		struct xfs_refcount_irec *irec);
 
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/libxfs/xfs_rmap.c b/libxfs/xfs_rmap.c
index c7d8aac..77ddb0b 100644
--- a/libxfs/xfs_rmap.c
+++ b/libxfs/xfs_rmap.c
@@ -1975,6 +1975,34 @@ xfs_rmap_map_shared(
 	return error;
 }
 
+/* Insert a raw rmap into the rmapbt. */
+int
+xfs_rmap_map_raw(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rmap)
+{
+	struct xfs_owner_info	oinfo;
+
+	oinfo.oi_owner = rmap->rm_owner;
+	oinfo.oi_offset = rmap->rm_offset;
+	oinfo.oi_flags = 0;
+	if (rmap->rm_flags & XFS_RMAP_ATTR_FORK)
+		oinfo.oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
+	if (rmap->rm_flags & XFS_RMAP_BMBT_BLOCK)
+		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
+
+	if (rmap->rm_flags || XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+		return xfs_rmap_map(cur, rmap->rm_startblock,
+				rmap->rm_blockcount,
+				rmap->rm_flags & XFS_RMAP_UNWRITTEN,
+				&oinfo);
+
+	return xfs_rmap_map_shared(cur, rmap->rm_startblock,
+			rmap->rm_blockcount,
+			rmap->rm_flags & XFS_RMAP_UNWRITTEN,
+			&oinfo);
+}
+
 struct xfs_rmap_query_range_info {
 	xfs_rmap_query_range_fn	fn;
 	void				*priv;
diff --git a/libxfs/xfs_rmap.h b/libxfs/xfs_rmap.h
index 606efe3..eac90d7 100644
--- a/libxfs/xfs_rmap.h
+++ b/libxfs/xfs_rmap.h
@@ -225,5 +225,6 @@ int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
 int xfs_rmap_has_other_keys(struct xfs_btree_cur *cur, xfs_fsblock_t bno,
 		xfs_filblks_t len, struct xfs_owner_info *oinfo,
 		bool *has_rmap);
+int xfs_rmap_map_raw(struct xfs_btree_cur *cur, struct xfs_rmap_irec *rmap);
 
 #endif	/* __XFS_RMAP_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 33/39] xfs: repair refcount btrees
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (31 preceding siblings ...)
  2016-11-05  0:27 ` [PATCH 32/39] xfs: rebuild the rmapbt Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 34/39] xfs: repair inode block maps Darrick J. Wong
                   ` (5 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Reconstruct the refcount data from the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_btree.c    |   21 +++++++++++++++++++++
 libxfs/xfs_btree.h    |    1 +
 libxfs/xfs_refcount.c |   19 ++++++++++++++++++-
 libxfs/xfs_refcount.h |    4 ++++
 4 files changed, 44 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_btree.c b/libxfs/xfs_btree.c
index 0c7d549..d1a5347 100644
--- a/libxfs/xfs_btree.c
+++ b/libxfs/xfs_btree.c
@@ -4891,3 +4891,24 @@ xfs_btree_has_record(
 
 	return 0;
 }
+
+/* Are there more records in this btree? */
+bool
+xfs_btree_has_more_records(
+	struct xfs_btree_cur	*cur)
+{
+	struct xfs_btree_block	*block;
+	struct xfs_buf		*bp;
+
+	block = xfs_btree_get_block(cur, 0, &bp);
+
+	/* There are still records in this block. */
+	if (cur->bc_ptrs[0] < xfs_btree_get_numrecs(block))
+		return true;
+
+	/* There are more record blocks. */
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		return block->bb_u.l.bb_rightsib != cpu_to_be64(NULLFSBLOCK);
+	else
+		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
+}
diff --git a/libxfs/xfs_btree.h b/libxfs/xfs_btree.h
index 52714f0..ace0bb0 100644
--- a/libxfs/xfs_btree.h
+++ b/libxfs/xfs_btree.h
@@ -551,5 +551,6 @@ struct xfs_btree_block *xfs_btree_get_block(struct xfs_btree_cur *cur,
 		int level, struct xfs_buf **bpp);
 int xfs_btree_has_record(struct xfs_btree_cur *cur, union xfs_btree_irec *low,
 		union xfs_btree_irec *high, bool *exists);
+bool xfs_btree_has_more_records(struct xfs_btree_cur *);
 
 #endif	/* __XFS_BTREE_H__ */
diff --git a/libxfs/xfs_refcount.c b/libxfs/xfs_refcount.c
index be0eab3..088f850 100644
--- a/libxfs/xfs_refcount.c
+++ b/libxfs/xfs_refcount.c
@@ -86,6 +86,23 @@ xfs_refcount_lookup_ge(
 	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
 }
 
+/*
+ * Look up the first record equal to [bno, len] in the btree
+ * given by cur.
+ */
+int
+xfs_refcount_lookup_eq(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	int			*stat)
+{
+	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
+			XFS_LOOKUP_LE);
+	cur->bc_rec.rc.rc_startblock = bno;
+	cur->bc_rec.rc.rc_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
+}
+
 /* Convert on-disk record to in-core format. */
 void
 xfs_refcount_btrec_to_irec(
@@ -147,7 +164,7 @@ xfs_refcount_update(
  * by [bno, len, refcount].
  * This either works (return 0) or gets an EFSCORRUPTED error.
  */
-STATIC int
+int
 xfs_refcount_insert(
 	struct xfs_btree_cur		*cur,
 	struct xfs_refcount_irec	*irec,
diff --git a/libxfs/xfs_refcount.h b/libxfs/xfs_refcount.h
index 5973c56..cad61de 100644
--- a/libxfs/xfs_refcount.h
+++ b/libxfs/xfs_refcount.h
@@ -24,6 +24,8 @@ extern int xfs_refcount_lookup_le(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, int *stat);
 extern int xfs_refcount_lookup_ge(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, int *stat);
+extern int xfs_refcount_lookup_eq(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, int *stat);
 extern int xfs_refcount_get_rec(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
@@ -72,5 +74,7 @@ extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
+extern int xfs_refcount_insert(struct xfs_btree_cur *cur,
+		struct xfs_refcount_irec *irec, int *stat);
 
 #endif	/* __XFS_REFCOUNT_H__ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 34/39] xfs: repair inode block maps
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (32 preceding siblings ...)
  2016-11-05  0:28 ` [PATCH 33/39] xfs: repair refcount btrees Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 35/39] xfs: query the per-AG reservation counters Darrick J. Wong
                   ` (4 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Use the reverse-mapping btree information to rebuild an inode fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_bmap.c |   20 ++++++++++++--------
 libxfs/xfs_bmap.h |    6 +++++-
 2 files changed, 17 insertions(+), 9 deletions(-)


diff --git a/libxfs/xfs_bmap.c b/libxfs/xfs_bmap.c
index 0e2f450..53d6848 100644
--- a/libxfs/xfs_bmap.c
+++ b/libxfs/xfs_bmap.c
@@ -2216,9 +2216,12 @@ xfs_bmap_add_extent_delay_real(
 	}
 
 	/* add reverse mapping */
-	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
-	if (error)
-		goto done;
+	if (!(bma->flags & XFS_BMAPI_NORMAP)) {
+		error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip,
+				whichfork, new);
+		if (error)
+			goto done;
+	}
 
 	/* convert to a btree if necessary */
 	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
@@ -3159,9 +3162,12 @@ xfs_bmap_add_extent_hole_real(
 	}
 
 	/* add reverse mapping */
-	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
-	if (error)
-		goto done;
+	if (!(bma->flags & XFS_BMAPI_NORMAP)) {
+		error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip,
+				whichfork, new);
+		if (error)
+			goto done;
+	}
 
 	/* convert to a btree if necessary */
 	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
@@ -4585,8 +4591,6 @@ xfs_bmapi_write(
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-	ASSERT(!(flags & XFS_BMAPI_REMAP) || whichfork == XFS_DATA_FORK);
-	ASSERT(!(flags & XFS_BMAPI_PREALLOC) || !(flags & XFS_BMAPI_REMAP));
 	ASSERT(!(flags & XFS_BMAPI_CONVERT) || !(flags & XFS_BMAPI_REMAP));
 	ASSERT(!(flags & XFS_BMAPI_PREALLOC) || whichfork != XFS_COW_FORK);
 	ASSERT(!(flags & XFS_BMAPI_CONVERT) || whichfork != XFS_COW_FORK);
diff --git a/libxfs/xfs_bmap.h b/libxfs/xfs_bmap.h
index 7cae6ec..9d4754f 100644
--- a/libxfs/xfs_bmap.h
+++ b/libxfs/xfs_bmap.h
@@ -110,6 +110,9 @@ struct xfs_extent_free_item
 /* Map something in the CoW fork. */
 #define XFS_BMAPI_COWFORK	0x200
 
+/* Don't update the rmap btree. */
+#define XFS_BMAPI_NORMAP	0x400
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -120,7 +123,8 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
 	{ XFS_BMAPI_ZERO,	"ZERO" }, \
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
-	{ XFS_BMAPI_COWFORK,	"COWFORK" }
+	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
+	{ XFS_BMAPI_NORMAP,	"NORMAP" }
 
 
 static inline int xfs_bmapi_aflag(int w)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 35/39] xfs: query the per-AG reservation counters
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (33 preceding siblings ...)
  2016-11-05  0:28 ` [PATCH 34/39] xfs: repair inode block maps Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 36/39] xfs_db: introduce fuzz command Darrick J. Wong
                   ` (3 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Establish an ioctl for userspace to query the original and current
per-AG reservation counts.  This will be used by xfs_scrub to
check that the vfs counters are at least somewhat sane.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 libxfs/xfs_fs.h |   10 ++++++++++
 1 file changed, 10 insertions(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index c2275b1..deebb57 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -632,6 +632,15 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_FLAGS_ALL	(XFS_SCRUB_FLAGS_IN | XFS_SCRUB_FLAGS_OUT)
 
 /*
+ * AG reserved block counters
+ */
+struct xfs_fsop_ag_resblks {
+	__u64 resblks;		/* blocks reserved now */
+	__u64 resblks_orig;	/* blocks reserved at mount time */
+	__u64 reserved[2];
+};
+
+/*
  * ioctl limits
  */
 #ifdef XATTR_LIST_MAX
@@ -705,6 +714,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_GET_AG_RESBLKS	     _IOR ('X', 126, struct xfs_fsop_ag_resblks)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 /* reflink ioctls; these MUST match the btrfs ioctl definitions */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 36/39] xfs_db: introduce fuzz command
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (34 preceding siblings ...)
  2016-11-05  0:28 ` [PATCH 35/39] xfs: query the per-AG reservation counters Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 37/39] xfs_db: print attribute remote value blocks Darrick J. Wong
                   ` (2 subsequent siblings)
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Introduce a new 'fuzz' command to write creative values into
disk structure fields.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 db/Makefile       |    3 
 db/bit.c          |   17 +-
 db/bit.h          |    5 -
 db/command.c      |    2 
 db/fuzz.c         |  461 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 db/fuzz.h         |   21 ++
 db/io.c           |    9 +
 db/io.h           |    1 
 db/type.c         |   44 ++++-
 db/type.h         |    1 
 man/man8/xfs_db.8 |   55 ++++++
 11 files changed, 598 insertions(+), 21 deletions(-)
 create mode 100644 db/fuzz.c
 create mode 100644 db/fuzz.h


diff --git a/db/Makefile b/db/Makefile
index cdc0b99..feeacf6 100644
--- a/db/Makefile
+++ b/db/Makefile
@@ -12,7 +12,8 @@ HFILES = addr.h agf.h agfl.h agi.h attr.h attrshort.h bit.h block.h bmap.h \
 	dir2.h dir2sf.h dquot.h echo.h faddr.h field.h \
 	flist.h fprint.h frag.h freesp.h hash.h help.h init.h inode.h input.h \
 	io.h logformat.h malloc.h metadump.h output.h print.h quit.h sb.h \
-	 sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h
+	 sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h \
+	fuzz.h
 CFILES = $(HFILES:.h=.c)
 LSRCFILES = xfs_admin.sh xfs_ncheck.sh xfs_metadump.sh
 
diff --git a/db/bit.c b/db/bit.c
index 24872bf..3fcb085 100644
--- a/db/bit.c
+++ b/db/bit.c
@@ -19,13 +19,8 @@
 #include "libxfs.h"
 #include "bit.h"
 
-#undef setbit	/* defined in param.h on Linux */
-
-static int	getbit(char *ptr, int bit);
-static void	setbit(char *ptr, int bit, int val);
-
-static int
-getbit(
+int
+getbit_l(
 	char	*ptr,
 	int	bit)
 {
@@ -39,8 +34,8 @@ getbit(
 	return (*ptr & mask) >> shift;
 }
 
-static void
-setbit(
+void
+setbit_l(
 	char *ptr,
 	int  bit,
 	int  val)
@@ -106,7 +101,7 @@ getbitval(
 
 
 	for (i = 0, rval = 0LL; i < nbits; i++) {
-		if (getbit(p, bit + i)) {
+		if (getbit_l(p, bit + i)) {
 			/* If the last bit is on and we care about sign
 			 * bits and we don't have a full 64 bit
 			 * container, turn all bits on between the
@@ -162,7 +157,7 @@ setbitval(
 
 	if (bitoff % NBBY || nbits % NBBY) {
 		for (bit = 0; bit < nbits; bit++)
-			setbit(out, bit + bitoff, getbit(in, bit));
+			setbit_l(out, bit + bitoff, getbit_l(in, bit));
 	} else
 		memcpy(out + byteize(bitoff), in, byteize(nbits));
 }
diff --git a/db/bit.h b/db/bit.h
index 80ba24c..4506679 100644
--- a/db/bit.h
+++ b/db/bit.h
@@ -21,9 +21,12 @@
 #define	bitszof(x,y)	bitize(szof(x,y))
 #define	byteize(s)	((s) / NBBY)
 #define	bitoffs(s)	((s) % NBBY)
+#define	byteize_up(s)	(((s) + NBBY - 1) / NBBY)
 
 #define	BVUNSIGNED	0
 #define	BVSIGNED	1
 
 extern __int64_t	getbitval(void *obj, int bitoff, int nbits, int flags);
-extern void             setbitval(void *obuf, int bitoff, int nbits, void *ibuf);
+extern void		setbitval(void *obuf, int bitoff, int nbits, void *ibuf);
+extern int		getbit_l(char *ptr, int bit);
+extern void		setbit_l(char *ptr, int bit, int val);
diff --git a/db/command.c b/db/command.c
index 3d7cfd7..0eb4944 100644
--- a/db/command.c
+++ b/db/command.c
@@ -51,6 +51,7 @@
 #include "dquot.h"
 #include "fsmap.h"
 #include "crc.h"
+#include "fuzz.h"
 
 cmdinfo_t	*cmdtab;
 int		ncmds;
@@ -146,4 +147,5 @@ init_commands(void)
 	type_init();
 	write_init();
 	dquot_init();
+	fuzz_init();
 }
diff --git a/db/fuzz.c b/db/fuzz.c
new file mode 100644
index 0000000..061ecd1
--- /dev/null
+++ b/db/fuzz.c
@@ -0,0 +1,461 @@
+/*
+ * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "libxfs.h"
+#include <ctype.h>
+#include <time.h>
+#include "bit.h"
+#include "block.h"
+#include "command.h"
+#include "type.h"
+#include "faddr.h"
+#include "fprint.h"
+#include "field.h"
+#include "flist.h"
+#include "io.h"
+#include "init.h"
+#include "output.h"
+#include "print.h"
+#include "write.h"
+#include "malloc.h"
+
+static int	fuzz_f(int argc, char **argv);
+static void     fuzz_help(void);
+
+static const cmdinfo_t	fuzz_cmd =
+	{ "fuzz", NULL, fuzz_f, 0, -1, 0, N_("[-c] [-d] field fuzzcmd..."),
+	  N_("fuzz values on disk"), fuzz_help };
+
+void
+fuzz_init(void)
+{
+	if (!expert_mode)
+		return;
+
+	add_command(&fuzz_cmd);
+	srand48(clock());
+}
+
+static void
+fuzz_help(void)
+{
+	dbprintf(_(
+"\n"
+" The 'fuzz' command fuzzes fields in any on-disk data structure.  For\n"
+" block fuzzing, see the 'blocktrash' or 'write' commands."
+"\n"
+" Examples:\n"
+"  Struct mode: 'fuzz core.uid zeroes'    - set an inode uid field to 0.\n"
+"               'fuzz crc ones'           - set a crc filed to all ones.\n"
+"               'fuzz bno[11] firstbit'   - set the high bit of a block array.\n"
+"               'fuzz keys[5].startblock add'    - increase a btree key value.\n"
+"               'fuzz uuid random'        - randomize the superblock uuid.\n"
+"\n"
+" In data mode type 'fuzz' by itself for a list of specific commands.\n\n"
+" Specifying the -c option will allow writes of invalid (corrupt) data with\n"
+" an invalid CRC. Specifying the -d option will allow writes of invalid data,\n"
+" but still recalculate the CRC so we are forced to check and detect the\n"
+" invalid data appropriately.\n\n"
+));
+
+}
+
+static int
+fuzz_f(
+	int		argc,
+	char		**argv)
+{
+	pfunc_t	pf;
+	extern char *progname;
+	int c;
+	bool corrupt = false;	/* Allow write of bad data w/ invalid CRC */
+	bool invalid_data = false; /* Allow write of bad data w/ valid CRC */
+	struct xfs_buf_ops local_ops;
+	const struct xfs_buf_ops *stashed_ops = NULL;
+
+	if (x.isreadonly & LIBXFS_ISREADONLY) {
+		dbprintf(_("%s started in read only mode, fuzzing disabled\n"),
+			progname);
+		return 0;
+	}
+
+	if (cur_typ == NULL) {
+		dbprintf(_("no current type\n"));
+		return 0;
+	}
+
+	pf = cur_typ->pfunc;
+	if (pf == NULL) {
+		dbprintf(_("no handler function for type %s, fuzz unsupported.\n"),
+			 cur_typ->name);
+		return 0;
+	}
+
+	while ((c = getopt(argc, argv, "cd")) != EOF) {
+		switch (c) {
+		case 'c':
+			corrupt = true;
+			break;
+		case 'd':
+			invalid_data = true;
+			break;
+		default:
+			dbprintf(_("bad option for fuzz command\n"));
+			return 0;
+		}
+	}
+
+	if (corrupt && invalid_data) {
+		dbprintf(_("Cannot specify both -c and -d options\n"));
+		return 0;
+	}
+
+	if (invalid_data && iocur_top->typ->crc_off == TYP_F_NO_CRC_OFF &&
+			!iocur_top->ino_buf) {
+		dbprintf(_("Cannot recalculate CRCs on this type of object\n"));
+		return 0;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	/*
+	 * If the buffer has no verifier or we are using standard verifier
+	 * paths, then just fuzz it and return
+	 */
+	if (!iocur_top->bp->b_ops ||
+	    !(corrupt || invalid_data)) {
+		(*pf)(DB_FUZZ, cur_typ->fields, argc, argv);
+		return 0;
+	}
+
+
+	/* Temporarily remove write verifier to write bad data */
+	stashed_ops = iocur_top->bp->b_ops;
+	local_ops.verify_read = stashed_ops->verify_read;
+	iocur_top->bp->b_ops = &local_ops;
+
+	if (corrupt) {
+		local_ops.verify_write = xfs_dummy_verify;
+		dbprintf(_("Allowing fuzz of corrupted data and bad CRC\n"));
+	} else if (iocur_top->ino_buf) {
+		local_ops.verify_write = xfs_verify_recalc_inode_crc;
+		dbprintf(_("Allowing fuzz of corrupted inode with good CRC\n"));
+	} else { /* invalid data */
+		local_ops.verify_write = xfs_verify_recalc_crc;
+		dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
+	}
+
+	(*pf)(DB_FUZZ, cur_typ->fields, argc, argv);
+
+	iocur_top->bp->b_ops = stashed_ops;
+
+	return 0;
+}
+
+/* Write zeroes to the field */
+static bool
+fuzz_zeroes(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	char		*out = buf;
+	int		bit;
+
+	if (bitoff % NBBY || nbits % NBBY) {
+		for (bit = 0; bit < nbits; bit++)
+			setbit_l(out, bit + bitoff, 0);
+	} else
+		memset(out + byteize(bitoff), 0, byteize(nbits));
+	return true;
+}
+
+/* Write ones to the field */
+static bool
+fuzz_ones(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	char		*out = buf;
+	int		bit;
+
+	if (bitoff % NBBY || nbits % NBBY) {
+		for (bit = 0; bit < nbits; bit++)
+			setbit_l(out, bit + bitoff, 1);
+	} else
+		memset(out + byteize(bitoff), 0xFF, byteize(nbits));
+	return true;
+}
+
+/* Flip the high bit in the (presumably big-endian) field */
+static bool
+fuzz_firstbit(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	setbit_l((char *)buf, bitoff, !getbit_l((char *)buf, bitoff));
+	return true;
+}
+
+/* Flip the low bit in the (presumably big-endian) field */
+static bool
+fuzz_lastbit(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	setbit_l((char *)buf, bitoff + nbits - 1,
+			!getbit_l((char *)buf, bitoff));
+	return true;
+}
+
+/* Flip the middle bit in the (presumably big-endian) field */
+static bool
+fuzz_middlebit(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	setbit_l((char *)buf, bitoff + nbits / 2,
+			!getbit_l((char *)buf, bitoff));
+	return true;
+}
+
+/* Format and shift a number into a buffer for setbitval. */
+static char *
+format_number(
+	uint64_t	val,
+	__be64		*out,
+	int		bit_length)
+{
+	int		offset;
+	char		*rbuf = (char *)out;
+
+	/*
+	 * If the length of the field is not a multiple of a byte, push
+	 * the bits up in the field, so the most signicant field bit is
+	 * the most significant bit in the byte:
+	 *
+	 * before:
+	 * val  |----|----|----|----|----|--MM|mmmm|llll|
+	 * after
+	 * val  |----|----|----|----|----|MMmm|mmll|ll00|
+	 */
+	offset = bit_length % NBBY;
+	if (offset)
+		val <<= (NBBY - offset);
+
+	/*
+	 * convert to big endian and copy into the array
+	 * rbuf |----|----|----|----|----|MMmm|mmll|ll00|
+	 */
+	*out = cpu_to_be64(val);
+
+	/*
+	 * Align the array to point to the field in the array.
+	 *  rbuf  = |MMmm|mmll|ll00|
+	 */
+	offset = sizeof(__be64) - 1 - ((bit_length - 1) / sizeof(__be64));
+	return rbuf + offset;
+}
+
+/* Increase the value by some small prime number. */
+static bool
+fuzz_add(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	uint64_t	val;
+	__be64		out;
+	char		*b;
+
+	if (nbits > 64)
+		return false;
+
+	val = getbitval(buf, bitoff, nbits, BVUNSIGNED);
+	val += (nbits > 8 ? 2017 : 137);
+	b = format_number(val, &out, nbits);
+	setbitval(buf, bitoff, nbits, b);
+
+	return true;
+}
+
+/* Decrease the value by some small prime number. */
+static bool
+fuzz_sub(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	uint64_t	val;
+	__be64		out;
+	char		*b;
+
+	if (nbits > 64)
+		return false;
+
+	val = getbitval(buf, bitoff, nbits, BVUNSIGNED);
+	val -= (nbits > 8 ? 2017 : 137);
+	b = format_number(val, &out, nbits);
+	setbitval(buf, bitoff, nbits, b);
+
+	return true;
+}
+
+/* Randomize the field. */
+static bool
+fuzz_random(
+	void		*buf,
+	int		bitoff,
+	int		nbits)
+{
+	int		i, bytes;
+	char		*b, *rbuf;
+
+	bytes = byteize_up(nbits);
+	rbuf = b = malloc(bytes);
+	if (!b) {
+		perror("fuzz_random");
+		return false;
+	}
+
+	for (i = 0; i < bytes; i++)
+		*b++ = (char)lrand48();
+
+	setbitval(buf, bitoff, nbits, rbuf);
+	free(rbuf);
+
+	return true;
+}
+
+struct fuzzcmd {
+	const char	*verb;
+	bool		(*fn)(void *buf, int bitoff, int nbits);
+};
+
+/* Keep these verbs in sync with enum fuzzcmds. */
+static struct fuzzcmd fuzzverbs[] = {
+	{"zeroes",		fuzz_zeroes},
+	{"ones",		fuzz_ones},
+	{"firstbit",		fuzz_firstbit},
+	{"middlebit",		fuzz_middlebit},
+	{"lastbit",		fuzz_lastbit},
+	{"add",			fuzz_add},
+	{"sub",			fuzz_sub},
+	{"random",		fuzz_random},
+	{NULL,			NULL},
+};
+
+/* ARGSUSED */
+void
+fuzz_struct(
+	const field_t	*fields,
+	int		argc,
+	char		**argv)
+{
+	const ftattr_t	*fa;
+	flist_t		*fl;
+	flist_t		*sfl;
+	int		bit_length;
+	struct fuzzcmd	*fc;
+	bool		success;
+	int		parentoffset;
+
+	if (argc != 2) {
+		dbprintf(_("Usage: fuzz fieldname verb\n"));
+		dbprintf("Verbs: %s", fuzzverbs->verb);
+		for (fc = fuzzverbs + 1; fc->verb != NULL; fc++)
+			dbprintf(", %s", fc->verb);
+		dbprintf(".\n");
+		return;
+	}
+
+	fl = flist_scan(argv[0]);
+	if (!fl) {
+		dbprintf(_("unable to parse '%s'.\n"), argv[0]);
+		return;
+	}
+
+	/* Find our fuzz verb */
+	for (fc = fuzzverbs; fc->verb != NULL; fc++)
+		if (!strcmp(fc->verb, argv[1]))
+			break;
+	if (fc->fn == NULL) {
+		dbprintf(_("Unknown fuzz command '%s'.\n"), argv[1]);
+		return;
+	}
+
+	/* if we're a root field type, go down 1 layer to get field list */
+	if (fields->name[0] == '\0') {
+		fa = &ftattrtab[fields->ftyp];
+		ASSERT(fa->ftyp == fields->ftyp);
+		fields = fa->subfld;
+	}
+
+	/* run down the field list and set offsets into the data */
+	if (!flist_parse(fields, fl, iocur_top->data, 0)) {
+		flist_free(fl);
+		dbprintf(_("parsing error\n"));
+		return;
+	}
+
+	sfl = fl;
+	parentoffset = 0;
+	while (sfl->child) {
+		parentoffset = sfl->offset;
+		sfl = sfl->child;
+	}
+
+	/*
+	 * For structures, fsize * fcount tells us the size of the region we are
+	 * modifying, which is usually a single structure member and is pointed
+	 * to by the last child in the list.
+	 *
+	 * However, if the base structure is an array and we have a direct index
+	 * into the array (e.g. write bno[5]) then we are returned a single
+	 * flist object with the offset pointing directly at the location we
+	 * need to modify. The length of the object we are modifying is then
+	 * determined by the size of the individual array entry (fsize) and the
+	 * indexes defined in the object, not the overall size of the array
+	 * (which is what fcount returns).
+	 */
+	bit_length = fsize(sfl->fld, iocur_top->data, parentoffset, 0);
+	if (sfl->fld->flags & FLD_ARRAY)
+		bit_length *= sfl->high - sfl->low + 1;
+	else
+		bit_length *= fcount(sfl->fld, iocur_top->data, parentoffset);
+
+	/* Fuzz the value */
+	success = fc->fn(iocur_top->data, sfl->offset, bit_length);
+	if (!success) {
+		dbprintf(_("unable to fuzz field '%s'\n"), argv[0]);
+		flist_free(fl);
+		return;
+	}
+
+	/* Write the fuzzed value back */
+	write_cur();
+
+	flist_print(fl);
+	print_flist(fl);
+	flist_free(fl);
+}
diff --git a/db/fuzz.h b/db/fuzz.h
new file mode 100644
index 0000000..30ada2f
--- /dev/null
+++ b/db/fuzz.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+extern void	fuzz_init(void);
+extern void	fuzz_struct(const field_t *fields, int argc, char **argv);
diff --git a/db/io.c b/db/io.c
index f398195..1f316d8 100644
--- a/db/io.c
+++ b/db/io.c
@@ -465,6 +465,15 @@ xfs_dummy_verify(
 }
 
 void
+xfs_verify_recalc_inode_crc(
+	struct xfs_buf *bp)
+{
+	ASSERT(iocur_top->ino_buf);
+	libxfs_dinode_calc_crc(mp, iocur_top->data);
+	iocur_top->ino_crc_ok = 1;
+}
+
+void
 xfs_verify_recalc_crc(
 	struct xfs_buf *bp)
 {
diff --git a/db/io.h b/db/io.h
index c69e9ce..12d96c2 100644
--- a/db/io.h
+++ b/db/io.h
@@ -64,6 +64,7 @@ extern void	set_cur(const struct typ *t, __int64_t d, int c, int ring_add,
 extern void     ring_add(void);
 extern void	set_iocur_type(const struct typ *t);
 extern void	xfs_dummy_verify(struct xfs_buf *bp);
+extern void	xfs_verify_recalc_inode_crc(struct xfs_buf *bp);
 extern void	xfs_verify_recalc_crc(struct xfs_buf *bp);
 
 /*
diff --git a/db/type.c b/db/type.c
index 10fa54e..adab10a 100644
--- a/db/type.c
+++ b/db/type.c
@@ -39,6 +39,7 @@
 #include "dir2.h"
 #include "text.h"
 #include "symlink.h"
+#include "fuzz.h"
 
 static const typ_t	*findtyp(char *name);
 static int		type_f(int argc, char **argv);
@@ -254,10 +255,17 @@ handle_struct(
 	int           argc,
 	char          **argv)
 {
-	if (action == DB_WRITE)
+	switch (action) {
+	case DB_FUZZ:
+		fuzz_struct(fields, argc, argv);
+		break;
+	case DB_WRITE:
 		write_struct(fields, argc, argv);
-	else
+		break;
+	case DB_READ:
 		print_struct(fields, argc, argv);
+		break;
+	}
 }
 
 void
@@ -267,10 +275,17 @@ handle_string(
 	int           argc,
 	char          **argv)
 {
-	if (action == DB_WRITE)
+	switch (action) {
+	case DB_WRITE:
 		write_string(fields, argc, argv);
-	else
+		break;
+	case DB_READ:
 		print_string(fields, argc, argv);
+		break;
+	case DB_FUZZ:
+		dbprintf(_("string fuzzing not supported.\n"));
+		break;
+	}
 }
 
 void
@@ -280,10 +295,17 @@ handle_block(
 	int           argc,
 	char          **argv)
 {
-	if (action == DB_WRITE)
+	switch (action) {
+	case DB_WRITE:
 		write_block(fields, argc, argv);
-	else
+		break;
+	case DB_READ:
 		print_block(fields, argc, argv);
+		break;
+	case DB_FUZZ:
+		dbprintf(_("use 'blocktrash' or 'write' to fuzz a block.\n"));
+		break;
+	}
 }
 
 void
@@ -293,6 +315,14 @@ handle_text(
 	int           argc,
 	char          **argv)
 {
-	if (action != DB_WRITE)
+	switch (action) {
+	case DB_FUZZ:
+		/* fall through */
+	case DB_WRITE:
+		dbprintf(_("text writing/fuzzing not supported.\n"));
+		break;
+	case DB_READ:
 		print_text(fields, argc, argv);
+		break;
+	}
 }
diff --git a/db/type.h b/db/type.h
index 87ff107..a50d705 100644
--- a/db/type.h
+++ b/db/type.h
@@ -30,6 +30,7 @@ typedef enum typnm
 	TYP_TEXT, TYP_FINOBT, TYP_NONE
 } typnm_t;
 
+#define DB_FUZZ  2
 #define DB_WRITE 1
 #define DB_READ  0
 
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index 460d89d..55e0629 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -594,6 +594,55 @@ in units of 512-byte blocks, no matter what the filesystem's block size is.
 .BI "The optional " start " and " end " arguments can be used to constrain
 the output to a particular range of disk blocks.
 .TP
+.BI "fuzz [\-c] [\-d] " "field action"
+Write garbage into a specific structure field on disk.
+Expert mode must be enabled to use this command.
+The operation happens immediately; there is no buffering.
+.IP
+The fuzz command can take the following
+.IR action "s"
+against a field:
+.RS 1.0i
+.TP 0.4i
+.B zeroes
+Clears all bits in the field.
+.TP 0.4i
+.B ones
+Sets all bits in the field.
+.TP 0.4i
+.B firstbit
+Flips the first bit in the field.
+For a scalar value, this is the highest bit.
+.TP 0.4i
+.B middlebit
+Flips the middle bit in the field.
+.TP 0.4i
+.B lastbit
+Flips the last bit in the field.
+For a scalar value, this is the lowest bit.
+.TP 0.4i
+.B add
+Adds a small value to a scalar field.
+.TP 0.4i
+.B sub
+Subtracts a small value from a scalar field.
+.TP 0.4i
+.B random
+Randomizes the contents of the field.
+.RE
+.IP
+The following switches affect the write behavior:
+.RS 1.0i
+.TP 0.4i
+.B \-c
+Skip write verifiers and CRC recalculation; allows invalid data to be written
+to disk.
+.TP 0.4i
+.B \-d
+Skip write verifiers but perform CRC recalculation; allows invalid data to be
+written to disk to test detection of invalid data.
+.RE
+.TP
 .BI hash " string
 Prints the hash value of
 .I string
@@ -755,7 +804,7 @@ and
 bits respectively, and their string equivalent reported
 (but no modifications are made).
 .TP
-.BI "write [\-c] [" "field value" "] ..."
+.BI "write [\-c] [\-d] [" "field value" "] ..."
 Write a value to disk.
 Specific fields can be set in structures (struct mode),
 or a block can be set to data values (data mode),
@@ -778,6 +827,10 @@ with no arguments gives more information on the allowed commands.
 .B \-c
 Skip write verifiers and CRC recalculation; allows invalid data to be written
 to disk.
+.TP 0.4i
+.B \-d
+Skip write verifiers but perform CRC recalculation; allows invalid data to be
+written to disk to test detection of invalid data.
 .RE
 .SH TYPES
 This section gives the fields in each structure type and their meanings.


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 37/39] xfs_db: print attribute remote value blocks
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (35 preceding siblings ...)
  2016-11-05  0:28 ` [PATCH 36/39] xfs_db: introduce fuzz command Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 38/39] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 39/39] xfs_scrub: create online filesystem scrub program Darrick J. Wong
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Teach xfs_db how to print the contents of xattr remote value blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 db/attr.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 db/attr.h  |    1 +
 db/field.c |    3 +++
 db/field.h |    1 +
 4 files changed, 64 insertions(+)


diff --git a/db/attr.c b/db/attr.c
index e26ac67..0fffbc2 100644
--- a/db/attr.c
+++ b/db/attr.c
@@ -41,6 +41,9 @@ static int	attr_leaf_nvlist_offset(void *obj, int startoff, int idx);
 static int	attr_node_btree_count(void *obj, int startoff);
 static int	attr_node_hdr_count(void *obj, int startoff);
 
+static int	attr_remote_count(void *obj, int startoff);
+static int	attr3_remote_count(void *obj, int startoff);
+
 const field_t	attr_hfld[] = {
 	{ "", FLDT_ATTR, OI(0), C1, 0, TYP_NONE },
 	{ NULL }
@@ -53,6 +56,7 @@ const field_t	attr_flds[] = {
 	  FLD_COUNT, TYP_NONE },
 	{ "hdr", FLDT_ATTR_NODE_HDR, OI(NOFF(hdr)), attr_node_hdr_count,
 	  FLD_COUNT, TYP_NONE },
+	{ "data", FLDT_CHARNS, OI(0), attr_remote_count, FLD_COUNT, TYP_NONE },
 	{ "entries", FLDT_ATTR_LEAF_ENTRY, OI(LOFF(entries)),
 	  attr_leaf_entries_count, FLD_ARRAY|FLD_COUNT, TYP_NONE },
 	{ "btree", FLDT_ATTR_NODE_ENTRY, OI(NOFF(__btree)), attr_node_btree_count,
@@ -197,6 +201,33 @@ attr3_leaf_hdr_count(
 	return be16_to_cpu(leaf->hdr.info.hdr.magic) == XFS_ATTR3_LEAF_MAGIC;
 }
 
+static int
+attr_remote_count(
+	void		*obj,
+	int		startoff)
+{
+	if (attr_leaf_hdr_count(obj, startoff) == 0 &&
+	    attr_node_hdr_count(obj, startoff) == 0)
+		return mp->m_sb.sb_blocksize;
+	return 0;
+}
+
+static int
+attr3_remote_count(
+	void		*obj,
+	int		startoff)
+{
+	struct xfs_attr3_rmt_hdr	*hdr = obj;
+
+	ASSERT(startoff == 0);
+
+	if (hdr->rm_magic != cpu_to_be32(XFS_ATTR3_RMT_MAGIC))
+		return 0;
+	if (be32_to_cpu(hdr->rm_bytes) + sizeof(*hdr) > mp->m_sb.sb_blocksize)
+		return mp->m_sb.sb_blocksize - sizeof(*hdr);
+	return be32_to_cpu(hdr->rm_bytes);
+}
+
 typedef int (*attr_leaf_entry_walk_f)(struct xfs_attr_leafblock *,
 				      struct xfs_attr_leaf_entry *, int);
 static int
@@ -477,6 +508,17 @@ attr3_node_hdr_count(
 	return be16_to_cpu(node->hdr.info.hdr.magic) == XFS_DA3_NODE_MAGIC;
 }
 
+static int
+attr3_remote_hdr_count(
+	void			*obj,
+	int			startoff)
+{
+	struct xfs_attr3_rmt_hdr	*node = obj;
+
+	ASSERT(startoff == 0);
+	return be32_to_cpu(node->rm_magic) == XFS_ATTR3_RMT_MAGIC;
+}
+
 int
 attr_size(
 	void	*obj,
@@ -501,6 +543,8 @@ const field_t	attr3_flds[] = {
 	  FLD_COUNT, TYP_NONE },
 	{ "hdr", FLDT_DA3_NODE_HDR, OI(N3OFF(hdr)), attr3_node_hdr_count,
 	  FLD_COUNT, TYP_NONE },
+	{ "hdr", FLDT_ATTR3_REMOTE_HDR, OI(0), attr3_remote_hdr_count,
+	  FLD_COUNT, TYP_NONE },
 	{ "entries", FLDT_ATTR_LEAF_ENTRY, OI(L3OFF(entries)),
 	  attr3_leaf_entries_count, FLD_ARRAY|FLD_COUNT, TYP_NONE },
 	{ "btree", FLDT_ATTR_NODE_ENTRY, OI(N3OFF(__btree)),
@@ -523,6 +567,21 @@ const field_t	attr3_leaf_hdr_flds[] = {
 	{ NULL }
 };
 
+#define	RM3OFF(f)	bitize(offsetof(struct xfs_attr3_rmt_hdr, rm_ ## f))
+const struct field	attr3_remote_crc_flds[] = {
+	{ "magic", FLDT_UINT32X, OI(RM3OFF(magic)), C1, 0, TYP_NONE },
+	{ "offset", FLDT_UINT32D, OI(RM3OFF(offset)), C1, 0, TYP_NONE },
+	{ "bytes", FLDT_UINT32D, OI(RM3OFF(bytes)), C1, 0, TYP_NONE },
+	{ "crc", FLDT_CRC, OI(RM3OFF(crc)), C1, 0, TYP_NONE },
+	{ "uuid", FLDT_UUID, OI(RM3OFF(uuid)), C1, 0, TYP_NONE },
+	{ "owner", FLDT_INO, OI(RM3OFF(owner)), C1, 0, TYP_NONE },
+	{ "bno", FLDT_DFSBNO, OI(RM3OFF(blkno)), C1, 0, TYP_BMAPBTD },
+	{ "lsn", FLDT_UINT64X, OI(RM3OFF(lsn)), C1, 0, TYP_NONE },
+	{ "data", FLDT_CHARNS, OI(bitize(sizeof(struct xfs_attr3_rmt_hdr))),
+		attr3_remote_count, FLD_COUNT, TYP_NONE },
+	{ NULL }
+};
+
 /*
  * Special read verifier for attribute buffers. Detect the magic number
  * appropriately and set the correct verifier and call it.
diff --git a/db/attr.h b/db/attr.h
index bc3431f..d7bb579 100644
--- a/db/attr.h
+++ b/db/attr.h
@@ -30,6 +30,7 @@ extern const field_t	attr3_flds[];
 extern const field_t	attr3_hfld[];
 extern const field_t	attr3_leaf_hdr_flds[];
 extern const field_t	attr3_node_hdr_flds[];
+extern const field_t	attr3_remote_crc_flds[];
 
 extern int	attr_leaf_name_size(void *obj, int startoff, int idx);
 extern int	attr_size(void *obj, int startoff, int idx);
diff --git a/db/field.c b/db/field.c
index 1968dd5..e8bbbe3 100644
--- a/db/field.c
+++ b/db/field.c
@@ -97,6 +97,9 @@ const ftattr_t	ftattrtab[] = {
 	{ FLDT_ATTR3_NODE_HDR, "attr3_node_hdr", NULL,
 	  (char *)da3_node_hdr_flds, SI(bitsz(struct xfs_da3_node_hdr)),
 	  0, NULL, da3_node_hdr_flds },
+	{ FLDT_ATTR3_REMOTE_HDR, "attr3_remote_hdr", NULL,
+	  (char *)attr3_remote_crc_flds, attr_size, FTARG_SIZE, NULL,
+	  attr3_remote_crc_flds },
 
 	{ FLDT_BMAPBTA, "bmapbta", NULL, (char *)bmapbta_flds, btblock_size,
 	  FTARG_SIZE, NULL, bmapbta_flds },
diff --git a/db/field.h b/db/field.h
index 53616f1..e5a943b 100644
--- a/db/field.h
+++ b/db/field.h
@@ -46,6 +46,7 @@ typedef enum fldt	{
 	FLDT_ATTR3,
 	FLDT_ATTR3_LEAF_HDR,
 	FLDT_ATTR3_NODE_HDR,
+	FLDT_ATTR3_REMOTE_HDR,
 
 	FLDT_BMAPBTA,
 	FLDT_BMAPBTA_CRC,


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 38/39] xfs_io: provide an interface to the scrub ioctls
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (36 preceding siblings ...)
  2016-11-05  0:28 ` [PATCH 37/39] xfs_db: print attribute remote value blocks Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  0:28 ` [PATCH 39/39] xfs_scrub: create online filesystem scrub program Darrick J. Wong
  38 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Create a new xfs_io command to call the new XFS metadata scrub ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 io/Makefile       |    2 
 io/init.c         |    2 
 io/inject.c       |    4 -
 io/io.h           |    2 
 io/scrub.c        |  325 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |    8 +
 6 files changed, 341 insertions(+), 2 deletions(-)
 create mode 100644 io/scrub.c


diff --git a/io/Makefile b/io/Makefile
index d65bafc..1e1bc6a 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -11,7 +11,7 @@ HFILES = init.h io.h
 CFILES = init.c \
 	attr.c bmap.c file.c freeze.c fsync.c getrusage.c imap.c link.c \
 	mmap.c open.c parent.c pread.c prealloc.c pwrite.c seek.c shutdown.c \
-	sync.c truncate.c reflink.c cowextsize.c fsmap.c
+	sync.c truncate.c reflink.c cowextsize.c fsmap.c scrub.c
 
 LLDLIBS = $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
 LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE)
diff --git a/io/init.c b/io/init.c
index 27c4a16..ccb1a95 100644
--- a/io/init.c
+++ b/io/init.c
@@ -89,6 +89,8 @@ init_commands(void)
 	truncate_init();
 	reflink_init();
 	cowextsize_init();
+	scrub_init();
+	repair_init();
 }
 
 static int
diff --git a/io/inject.c b/io/inject.c
index 5d5e4ae..ea0d3b0 100644
--- a/io/inject.c
+++ b/io/inject.c
@@ -86,7 +86,9 @@ error_tag(char *name)
 		{ XFS_ERRTAG_BMAP_FINISH_ONE,		"bmap_finish_one" },
 #define XFS_ERRTAG_AG_RESV_CRITICAL			27
 		{ XFS_ERRTAG_AG_RESV_CRITICAL,		"ag_resv_critical" },
-#define XFS_ERRTAG_MAX                                  28
+#define XFS_ERRTAG_FORCE_REPAIR				28
+		{ XFS_ERRTAG_FORCE_REPAIR,		"force_repair" },
+#define XFS_ERRTAG_MAX                                  29
 		{ XFS_ERRTAG_MAX,			NULL }
 	};
 	int	count;
diff --git a/io/io.h b/io/io.h
index 0ee2c41..3c53e0e 100644
--- a/io/io.h
+++ b/io/io.h
@@ -172,3 +172,5 @@ extern void		readdir_init(void);
 extern void		reflink_init(void);
 
 extern void		cowextsize_init(void);
+extern void		scrub_init(void);
+extern void		repair_init(void);
diff --git a/io/scrub.c b/io/scrub.c
new file mode 100644
index 0000000..65cafff
--- /dev/null
+++ b/io/scrub.c
@@ -0,0 +1,325 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <sys/uio.h>
+#include <xfs/xfs.h>
+#include "command.h"
+#include "input.h"
+#include "init.h"
+#include "io.h"
+
+static struct cmdinfo scrub_cmd;
+static struct cmdinfo repair_cmd;
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+	ST_NONE,	/* disabled */
+	ST_PERAG,	/* per-AG metadata */
+	ST_FS,		/* per-FS metadata */
+	ST_INODE,	/* per-inode metadata */
+};
+
+/* These must correspond with XFS_SCRUB_TYPE_ */
+struct scrub_descr {
+	const char	*name;
+	enum scrub_type	type;
+};
+
+static const struct scrub_descr scrubbers[] = {
+	{"dummy",	ST_NONE},
+	{"sb",		ST_PERAG},
+	{"agf",		ST_PERAG},
+	{"agfl",	ST_PERAG},
+	{"agi",		ST_PERAG},
+	{"bnobt",	ST_PERAG},
+	{"cntbt",	ST_PERAG},
+	{"inobt",	ST_PERAG},
+	{"finobt",	ST_PERAG},
+	{"rmapbt",	ST_PERAG},
+	{"refcountbt",	ST_PERAG},
+	{"inode",	ST_INODE},
+	{"bmapbtd",	ST_INODE},
+	{"bmapbta",	ST_INODE},
+	{"bmapbtc",	ST_INODE},
+	{"directory",	ST_INODE},
+	{"xattr",	ST_INODE},
+	{"symlink",	ST_INODE},
+	{"rtbitmap",	ST_FS},
+	{"rtsummary",	ST_FS},
+	{NULL,		ST_NONE},
+};
+
+static void
+scrub_help(void)
+{
+	const struct scrub_descr	*d;
+
+	printf(_("\n\
+ Scrubs a piece of XFS filesystem metadata.  The first argument is the type\n\
+ of metadata to examine.  Allocation group number(s) can be specified to\n\
+ restrict the scrub operation to a subset of allocation groups.\n\
+ Certain metadata types do not take AG numbers.\n\
+\n\
+ Example:\n\
+ 'scrub inobt 3' - scrub the inode btree in AG 3.\n\
+ 'scrub bmapbtd 128 13525' - scrubs the extent map of inode 128 gen 13525.\n\
+\n\
+ Known metadata scrub types are:"));
+	for (d = scrubbers; d->name; d++)
+		printf(" %s", d->name);
+	printf("\n");
+}
+
+static void
+scrub_ioctl(
+	int				fd,
+	int				type,
+	uint64_t			control,
+	uint32_t			control2)
+{
+	struct xfs_scrub_metadata	meta;
+	const struct scrub_descr	*sc;
+	int				error;
+
+	sc = &scrubbers[type];
+	memset(&meta, 0, sizeof(meta));
+	meta.sm_type = type;
+	switch (sc->type) {
+	case ST_PERAG:
+		meta.sm_agno = control;
+		break;
+	case ST_INODE:
+		meta.sm_ino = control;
+		meta.sm_gen = control2;
+		break;
+	case ST_NONE:
+	case ST_FS:
+		/* no control parameters */
+		break;
+	}
+	meta.sm_flags = 0;
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (error)
+		perror("scrub");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+		printf("Corruption detected.\n");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_PREEN)
+		printf("Optimization possible.\n");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_XREF_FAIL)
+		printf("Cross-referencing failed.\n");
+}
+
+static int
+parse_args(
+	int				argc,
+	char				**argv,
+	struct cmdinfo			*cmdinfo,
+	void				(*fn)(int, int, uint64_t, uint32_t))
+{
+	char				*p;
+	int				type = -1;
+	int				i, c;
+	uint64_t			control = 0;
+	uint32_t			control2 = 0;
+	const struct scrub_descr	*d = NULL;
+
+	while ((c = getopt(argc, argv, "")) != EOF) {
+		switch (c) {
+		default:
+			return command_usage(cmdinfo);
+		}
+	}
+	if (optind > argc - 1)
+		return command_usage(cmdinfo);
+
+	for (i = 0, d = scrubbers; d->name; i++, d++) {
+		if (strcmp(d->name, argv[optind]) == 0) {
+			type = i;
+			break;
+		}
+	}
+	optind++;
+
+	if (type < 0)
+		return command_usage(cmdinfo);
+
+	switch (d->type) {
+	case ST_INODE:
+		if (optind == argc) {
+			control = 0;
+			control2 = 0;
+		} else if (optind == argc - 2) {
+			control = strtoull(argv[optind], &p, 0);
+			if (*p != '\0') {
+				fprintf(stderr,
+					_("Bad inode number %s.\n"), argv[i]);
+				return 0;
+			}
+			control2 = strtoul(argv[optind + 1], &p, 0);
+			if (*p != '\0') {
+				fprintf(stderr,
+					_("Bad generation number %s.\n"), argv[i]);
+				return 0;
+			}
+		} else {
+			fprintf(stderr,
+				_("Must specify inode number and generation.\n"));
+			return 0;
+		}
+		break;
+	case ST_PERAG:
+	case ST_NONE:
+		if (optind != argc - 1) {
+			fprintf(stderr,
+				_("Must specify AG number.\n"));
+			return 0;
+		}
+		control = strtoul(argv[optind], &p, 0);
+		if (*p != '\0') {
+			fprintf(stderr,
+				_("Bad AG number %s.\n"), argv[i]);
+			return 0;
+		}
+		break;
+	default:
+		if (optind != argc) {
+			fprintf(stderr,
+				_("No parameters allowed.\n"));
+			return 0;
+		}
+	}
+	fn(file->fd, type, control, control2);
+
+	return 0;
+}
+
+static int
+scrub_f(
+	int				argc,
+	char				**argv)
+{
+	return parse_args(argc, argv, &scrub_cmd, scrub_ioctl);
+}
+
+void
+scrub_init(void)
+{
+	scrub_cmd.name = "scrub";
+	scrub_cmd.altname = "sc";
+	scrub_cmd.cfunc = scrub_f;
+	scrub_cmd.argmin = 1;
+	scrub_cmd.argmax = -1;
+	scrub_cmd.flags = CMD_NOMAP_OK;
+	scrub_cmd.args =
+_("type [agno...]");
+	scrub_cmd.oneline =
+		_("scrubs filesystem metadata");
+	scrub_cmd.help = scrub_help;
+
+	add_command(&scrub_cmd);
+}
+
+static void
+repair_help(void)
+{
+	const struct scrub_descr	*d;
+
+	printf(_("\n\
+ Repairs a piece of XFS filesystem metadata.  The first argument is the type\n\
+ of metadata to examine.  Allocation group number(s) can be specified to\n\
+ restrict the scrub operation to a subset of allocation groups.\n\
+ Certain metadata types do not take AG numbers.\n\
+\n\
+ Example:\n\
+ 'repair inobt 3 5 7' - repairs the inode btree in groups 3, 5, and 7.\n\
+\n\
+ Known metadata repairs types are:"));
+	for (d = scrubbers; d->name; d++)
+		printf(" %s", d->name);
+	printf("\n");
+}
+
+static void
+repair_ioctl(
+	int				fd,
+	int				type,
+	uint64_t			control,
+	uint32_t			control2)
+{
+	struct xfs_scrub_metadata	meta;
+	const struct scrub_descr	*sc;
+	int				error;
+
+	sc = &scrubbers[type];
+	memset(&meta, 0, sizeof(meta));
+	meta.sm_type = type;
+	switch (sc->type) {
+	case ST_PERAG:
+		meta.sm_agno = control;
+		break;
+	case ST_INODE:
+		meta.sm_ino = control;
+		meta.sm_gen = control2;
+		break;
+	case ST_NONE:
+	case ST_FS:
+		/* no control parameters */
+		break;
+	}
+	meta.sm_flags = XFS_SCRUB_FLAG_REPAIR;
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (error)
+		perror("scrub");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+		printf("Corruption remains.\n");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_PREEN)
+		printf("Optimization possible.\n");
+	if (meta.sm_flags & XFS_SCRUB_FLAG_XREF_FAIL)
+		printf("Cross-referencing failed.\n");
+}
+
+static int
+repair_f(
+	int				argc,
+	char				**argv)
+{
+	return parse_args(argc, argv, &repair_cmd, repair_ioctl);
+}
+
+void
+repair_init(void)
+{
+	if (!expert)
+		return;
+	repair_cmd.name = "repair";
+	repair_cmd.altname = "fix";
+	repair_cmd.cfunc = repair_f;
+	repair_cmd.argmin = 1;
+	repair_cmd.argmax = -1;
+	repair_cmd.flags = CMD_NOMAP_OK;
+	repair_cmd.args =
+_("type [agno...]");
+	repair_cmd.oneline =
+		_("repairs filesystem metadata");
+	repair_cmd.help = repair_help;
+
+	add_command(&repair_cmd);
+}
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index f7edcab..f5e89ab 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -937,6 +937,14 @@ verbose output will be printed.
 .IP
 .B [NOTE: Not currently operational on Linux.]
 .PD
+.TP
+.BI "scrub " type " [ " agnumber... " ]"
+Scrub internal XFS filesystem metadata.  The
+.BI type
+parameter specifies which type of metadata to scrub.
+AG numbers can optionally be specified to restrict the scrub operation
+to a particular set of allocation groups.
+By default, all allocation groups are scrubbed.
 
 .SH SEE ALSO
 .BR mkfs.xfs (8),


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 39/39] xfs_scrub: create online filesystem scrub program
  2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (37 preceding siblings ...)
  2016-11-05  0:28 ` [PATCH 38/39] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
@ 2016-11-05  0:28 ` Darrick J. Wong
  2016-11-05  5:22   ` Eryu Guan
  38 siblings, 1 reply; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-05  0:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs

Create a filesystem scrubbing tool that walks the directory tree,
queries every file's extents, extended attributes, and stat data.  For
generic (non-XFS) filesystems this depends on the kernel to do nearly
all the validation.  Optionally, we can (try to) read all the file
data.

For XFS, we perform sequential scans of each AG's metadata, inodes,
extent maps, and file data.  Being XFS specific, we can work with
the in-kernel scrubbers to perform much stronger
metadata checking and cross-referencing.  We can also take advantage
of newer ioctls such as GETFSMAP to perform faster read verification.

In the future we will be able to take advantage of (still unwritten)
features such as parent directory pointers to fully validate all
metadata.  However, this tool /should/ work for most non-XFS
filesystems such as ext4 and btrfs.

Note also that the scrub tool can shut down the filesystem if errors
are found.  This is not a default option since scrubbing is very
immature at this time.  It can also ask the XFS driver in the kernel
to optimize or repair metadata, though this may not be successful.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Makefile              |    3 
 configure.ac          |   13 
 include/builddefs.in  |   13 
 m4/Makefile           |    1 
 m4/package_attrdev.m4 |   29 +
 m4/package_libcdev.m4 |  140 +++
 man/man8/xfs_scrub.8  |  127 +++
 scrub/Makefile        |   47 +
 scrub/bitmap.c        |  425 ++++++++
 scrub/bitmap.h        |   42 +
 scrub/disk.c          |  278 ++++++
 scrub/disk.h          |   41 +
 scrub/generic.c       | 1151 +++++++++++++++++++++++
 scrub/iocmd.c         |  412 ++++++++
 scrub/iocmd.h         |   50 +
 scrub/non_xfs.c       |  185 ++++
 scrub/read_verify.c   |  314 ++++++
 scrub/read_verify.h   |   59 +
 scrub/scrub.c         | 1009 ++++++++++++++++++++
 scrub/scrub.h         |  197 ++++
 scrub/xfs.c           | 2465 +++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_ioctl.c     |  767 +++++++++++++++
 scrub/xfs_ioctl.h     |   84 ++
 23 files changed, 7851 insertions(+), 1 deletion(-)
 create mode 100644 m4/package_attrdev.m4
 create mode 100644 man/man8/xfs_scrub.8
 create mode 100644 scrub/Makefile
 create mode 100644 scrub/bitmap.c
 create mode 100644 scrub/bitmap.h
 create mode 100644 scrub/disk.c
 create mode 100644 scrub/disk.h
 create mode 100644 scrub/generic.c
 create mode 100644 scrub/iocmd.c
 create mode 100644 scrub/iocmd.h
 create mode 100644 scrub/non_xfs.c
 create mode 100644 scrub/read_verify.c
 create mode 100644 scrub/read_verify.h
 create mode 100644 scrub/scrub.c
 create mode 100644 scrub/scrub.h
 create mode 100644 scrub/xfs.c
 create mode 100644 scrub/xfs_ioctl.c
 create mode 100644 scrub/xfs_ioctl.h


diff --git a/Makefile b/Makefile
index 84dc62c..eb41be3 100644
--- a/Makefile
+++ b/Makefile
@@ -46,7 +46,7 @@ HDR_SUBDIRS = include libxfs
 DLIB_SUBDIRS = libxlog libxcmd libhandle
 LIB_SUBDIRS = libxfs $(DLIB_SUBDIRS)
 TOOL_SUBDIRS = copy db estimate fsck growfs io logprint mkfs quota \
-		mdrestore repair rtcp m4 man doc debian
+		mdrestore repair rtcp m4 man doc debian scrub
 
 ifneq ("$(PKG_PLATFORM)","darwin")
 TOOL_SUBDIRS += fsr
@@ -87,6 +87,7 @@ quota: libxcmd
 repair: libxlog libxcmd
 copy: libxlog
 mkfs: libxcmd
+scrub: libhandle libxcmd repair
 
 ifeq ($(HAVE_BUILDDEFS), yes)
 include $(BUILDRULES)
diff --git a/configure.ac b/configure.ac
index b88ab7f..6d6cb11 100644
--- a/configure.ac
+++ b/configure.ac
@@ -131,8 +131,21 @@ AC_HAVE_MNTENT
 AC_HAVE_FLS
 AC_HAVE_READDIR
 AC_HAVE_FSETXATTR
+AC_HAVE_FGETXATTR
+AC_HAVE_FLISTXATTR
+AC_HAVE_LLISTXATTR
 AC_HAVE_MREMAP
 AC_NEED_INTERNAL_FSXATTR
+AC_HAVE_MALLINFO
+AC_HAVE_SG_IO
+AC_HAVE_HDIO_GETGEO
+AC_HAVE_ATTRIBUTES_H
+AC_HAVE_ATTRIBUTES_MACROS
+AC_HAVE_ATTRIBUTES_STRUCTS
+AC_HAVE_OPENAT
+AC_HAVE_READLINKAT
+AC_HAVE_SYNCFS
+AC_HAVE_FSTATAT
 
 if test "$enable_blkid" = yes; then
 AC_HAVE_BLKID_TOPO
diff --git a/include/builddefs.in b/include/builddefs.in
index aeb2905..a8ebd68 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -108,8 +108,21 @@ HAVE_READDIR = @have_readdir@
 HAVE_MNTENT = @have_mntent@
 HAVE_FLS = @have_fls@
 HAVE_FSETXATTR = @have_fsetxattr@
+HAVE_FGETXATTR = @have_fgetxattr@
+HAVE_FLISTXATTR = @have_flistxattr@
+HAVE_LLISTXATTR = @have_llistxattr@
 HAVE_MREMAP = @have_mremap@
 NEED_INTERNAL_FSXATTR = @need_internal_fsxattr@
+HAVE_MALLINFO = @have_mallinfo@
+HAVE_SG_IO = @have_sg_io@
+HAVE_HDIO_GETGEO = @have_hdio_getgeo@
+HAVE_ATTRIBUTES_H = @have_attributes_h@
+HAVE_ATTRIBUTES_MACROS = @have_attributes_macros@
+HAVE_ATTRIBUTES_STRUCTS = @have_attributes_structs@
+HAVE_OPENAT = @have_openat@
+HAVE_READLINKAT = @have_readlinkat@
+HAVE_SYNCFS = @have_syncfs@
+HAVE_FSTATAT = @have_fstatat@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/Makefile b/m4/Makefile
index d282f0a..0c73f35 100644
--- a/m4/Makefile
+++ b/m4/Makefile
@@ -14,6 +14,7 @@ CONFIGURE = \
 
 LSRCFILES = \
 	manual_format.m4 \
+	package_attrdev.m4 \
 	package_blkid.m4 \
 	package_globals.m4 \
 	package_libcdev.m4 \
diff --git a/m4/package_attrdev.m4 b/m4/package_attrdev.m4
new file mode 100644
index 0000000..eb0e35b
--- /dev/null
+++ b/m4/package_attrdev.m4
@@ -0,0 +1,29 @@
+AC_DEFUN([AC_HAVE_ATTRIBUTES_H],
+  [ AC_CHECK_HEADERS(attr/attributes.h, [have_attributes_h=yes])
+    AC_SUBST(have_attributes_h)
+    if test "$have_attributes_h" != "yes"; then
+        echo
+        echo 'WARNING: attr/attributes.h does not exist.'
+        echo 'Install the extended attributes (attr) development package.'
+        echo 'Alternatively, run "make install-dev" from the attr source.'
+        echo
+    fi
+  ])
+
+AC_DEFUN([AC_HAVE_ATTRIBUTES_STRUCTS],
+  [ AC_CHECK_TYPES([struct attrlist_cursor, struct attr_multiop, struct attrlist_ent],
+    [have_attributes_structs=yes],,
+    [
+#include <sys/types.h>
+#include <attr/attributes.h>] )
+    AC_SUBST(have_attributes_structs)
+  ])
+
+AC_DEFUN([AC_HAVE_ATTRIBUTES_MACROS],
+  [ AC_TRY_LINK([
+#include <sys/types.h>
+#include <attr/attributes.h>],
+    [ int x = ATTR_SECURE; int y = ATTR_ROOT; int z = ATTR_TRUST; ATTR_ENTRY(0, 0); ],
+    [have_attributes_macros=yes])
+    AC_SUBST(have_attributes_macros)
+  ])
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index e3c59d8..64c3171 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -236,6 +236,45 @@ AC_DEFUN([AC_HAVE_FSETXATTR],
   ])
 
 #
+# Check if we have a fgetxattr call (Mac OS X)
+#
+AC_DEFUN([AC_HAVE_FGETXATTR],
+  [ AC_CHECK_DECL([fgetxattr],
+       have_fgetxattr=yes,
+       [],
+       [#include <sys/types.h>
+        #include <attr/xattr.h>]
+       )
+    AC_SUBST(have_fgetxattr)
+  ])
+
+#
+# Check if we have a flistxattr call (Mac OS X)
+#
+AC_DEFUN([AC_HAVE_FLISTXATTR],
+  [ AC_CHECK_DECL([flistxattr],
+       have_flistxattr=yes,
+       [],
+       [#include <sys/types.h>
+        #include <attr/xattr.h>]
+       )
+    AC_SUBST(have_flistxattr)
+  ])
+
+#
+# Check if we have a llistxattr call (Mac OS X)
+#
+AC_DEFUN([AC_HAVE_LLISTXATTR],
+  [ AC_CHECK_DECL([llistxattr],
+       have_llistxattr=yes,
+       [],
+       [#include <sys/types.h>
+        #include <attr/xattr.h>]
+       )
+    AC_SUBST(have_llistxattr)
+  ])
+
+#
 # Check if there is mntent.h
 #
 AC_DEFUN([AC_HAVE_MNTENT],
@@ -293,3 +332,104 @@ AC_DEFUN([AC_NEED_INTERNAL_FSXATTR],
     )
     AC_SUBST(need_internal_fsxattr)
   ])
+
+#
+# Check if we have a mallinfo libc call
+#
+AC_DEFUN([AC_HAVE_MALLINFO],
+  [ AC_MSG_CHECKING([for mallinfo ])
+    AC_TRY_COMPILE([
+#include <malloc.h>
+    ], [
+         struct mallinfo test;
+
+         test.arena = 0; test.hblkhd = 0; test.uordblks = 0; test.fordblks = 0;
+         test = mallinfo();
+    ], have_mallinfo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_mallinfo)
+  ])
+
+#
+# Check if we have the SG_IO ioctl
+#
+AC_DEFUN([AC_HAVE_SG_IO],
+  [ AC_MSG_CHECKING([for struct sg_io_hdr ])
+    AC_TRY_COMPILE([#include <scsi/sg.h>],
+    [
+         struct sg_io_hdr hdr;
+         ioctl(0, SG_IO, &hdr);
+    ], have_sg_io=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_sg_io)
+  ])
+
+#
+# Check if we have the HDIO_GETGEO ioctl
+#
+AC_DEFUN([AC_HAVE_HDIO_GETGEO],
+  [ AC_MSG_CHECKING([for struct hd_geometry ])
+    AC_TRY_COMPILE([#include <linux/hdreg.h>],
+    [
+         struct hd_geometry hdr;
+         ioctl(0, HDIO_GETGEO, &hdr);
+    ], have_hdio_getgeo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_hdio_getgeo)
+  ])
+
+#
+# Check if we have a openat call
+#
+AC_DEFUN([AC_HAVE_OPENAT],
+  [ AC_CHECK_DECL([openat],
+       have_openat=yes,
+       [],
+       [#include <sys/types.h>
+        #include <sys/stat.h>
+        #include <fcntl.h>]
+       )
+    AC_SUBST(have_openat)
+  ])
+
+#
+# Check if we have a readlinkat call
+#
+AC_DEFUN([AC_HAVE_READLINKAT],
+  [ AC_CHECK_DECL([readlinkat],
+       have_readlinkat=yes,
+       [],
+       [#include <unistd.h>
+        #include <fcntl.h>]
+       )
+    AC_SUBST(have_readlinkat)
+  ])
+
+#
+# Check if we have a syncfs call
+#
+AC_DEFUN([AC_HAVE_SYNCFS],
+  [ AC_CHECK_DECL([syncfs],
+       have_syncfs=yes,
+       [],
+       [#define _GNU_SOURCE
+       #include <unistd.h>])
+    AC_SUBST(have_syncfs)
+  ])
+
+#
+# Check if we have a fstatat call
+#
+AC_DEFUN([AC_HAVE_FSTATAT],
+  [ AC_CHECK_DECL([fstatat],
+       have_fstatat=yes,
+       [],
+       [#define _GNU_SOURCE
+       #include <sys/types.h>
+       #include <sys/stat.h>
+       #include <unistd.h>])
+    AC_SUBST(have_fstatat)
+  ])
diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
new file mode 100644
index 0000000..0ad1fb8
--- /dev/null
+++ b/man/man8/xfs_scrub.8
@@ -0,0 +1,127 @@
+.TH xfs_scrub 8
+.SH NAME
+xfs_scrub \- scrub the contents of an XFS filesystem
+.SH SYNOPSIS
+.B xfs_scrub
+[
+.B \-ademntTvVxy
+]
+.I mountpoint
+.br
+.B xfs_scrub \-V
+.SH DESCRIPTION
+.B xfs_scrub
+attempts to read and check all the metadata in a Linux filesystem.
+.PP
+If
+.B xfs_scrub
+does not detect an XFS filesystem, it will use a generic backend to
+scrub the filesystem.
+This involves walking the directory tree, querying the data and
+extended attribute extent maps, performing limited checks of directory
+and inode data, reading all of an inode's extended attributes,
+optionally reading all data in a file, and comparing the number of
+blocks and inodes seen against the reported counters.
+.PP
+If an XFS filesystem is detected, then
+.B xfs_scrub
+will ask the kernel to perform more rigorous scrubbing of the
+internal metadata.
+The in-kernel scrubbers also cross-reference each data structure's
+records against the other filesystem metadata.
+.PP
+This utility does not know how to correct all errors.
+If the tool cannot fix the detected errors, you must unmount the
+filesystem and run the appropriate repair tool.
+if this tool is run without either of the
+.B \-n
+or
+.B \-y
+options, then it will preen and optimize the filesystem when possible,
+though it will not try to fix errors.
+.SH OPTIONS
+.TP
+.BI \-a " errors"
+Abort if more than this many errors are found on the filesystem.
+.TP
+.B \-d
+Enable debugging mode, which augments error reports with the exact file
+and line where the scrub failure occurred.
+This also enables verbose mode.
+.TP
+.B \-e
+Specifies what happens when errors are detected.
+If
+.IR shutdown
+is given, the filesystem will be taken offline if errors are found.
+Not all backends can shut down a filesystem.
+If
+.IR continue
+is given, no action taken if errors are found.
+This is the default.
+.TP
+.BI \-m " file"
+Search this file for mounted filesystems instead of /etc/mtab.
+.TP
+.B \-n
+Dry run, do not modify anything in the filesystem.  This disables
+all preening and optimization behaviors, and disables calling
+FITRIM on the free space after a successful run.
+.TP
+.BI \-t " fstype"
+Force the use of a particular type of filesystem scrubber.
+The current backends are:
+.IR xfs , " ext4" , " ext3", " ext2", " btrfs" ", and " generic "."
+Most filesystems will work just fine with the generic backend.
+.TP
+.BI \-T
+Print timing and memory usage information for each phase.
+.TP
+.B \-v
+Enable verbose mode, which prints periodic status updates.
+.TP
+.B \-V
+Prints the version number and exits.
+.TP
+.B \-x
+Scrub file data.  This reads every block of every file on disk.
+If the filesystem reports file extent mappings or physical extent
+mappings and is backed by a block device,
+.TP
+.B \-y
+Try to repair all filesystem errors.  If the errors cannot be fixed
+online, then the filesystem must be taken offline for repair.
+.B xfs_scrub
+will issue O_DIRECT reads to the block device directly.
+If the block device is a SCSI disk, it will issue READ VERIFY commands
+directly to the disk.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub
+is the sum of the following conditions:
+.br
+\	0\	\-\ No errors
+.br
+\	4\	\-\ File system errors left uncorrected
+.br
+\	8\	\-\ Operational error
+.br
+\	16\	\-\ Usage or syntax error
+.br
+.SH CAVEATS
+.B xfs_scrub
+is an immature utility!
+The generic scrub backend walks the directory tree, reads file extents
+and data, and queries every extended attribute it can find.
+The generic scrub does not grab exclusive locks on the objects it is
+examining, nor does it have any way to cross-reference what it sees
+against the internal filesystem metadata.
+.PP
+The XFS backend takes advantage of in-kernel scrubbing to verify a
+given data structure with locks held.
+This can tie up the system for a while.
+.PP
+If errors are found, the filesystem should be taken offline and
+repaired.
+.SH SEE ALSO
+.BR xfs_repair (8).
diff --git a/scrub/Makefile b/scrub/Makefile
new file mode 100644
index 0000000..c6cdaf5
--- /dev/null
+++ b/scrub/Makefile
@@ -0,0 +1,47 @@
+#
+# Copyright (c) 2016 Oracle.  All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+SCRUB_PREREQS=$(HAVE_FIEMAP)$(HAVE_ATTRIBUTES_H)$(HAVE_ATTRIBUTES_MACROS)$(HAVE_ATTRIBUTES_STRUCTS)$(HAVE_FGETXATTR)$(HAVE_FLISTXATTR)$(HAVE_LLISTXATTR)$(HAVE_OPENAT)$(HAVE_READLINKAT)$(HAVE_FSTATAT)
+
+ifeq ($(SCRUB_PREREQS),yesyesyesyesyesyesyesyesyesyes)
+LTCOMMAND = xfs_scrub
+endif
+
+HFILES = scrub.h ../repair/threads.h xfs_ioctl.h read_verify.h iocmd.h
+CFILES = ../repair/avl64.c disk.c bitmap.c generic.c iocmd.c non_xfs.c \
+	 read_verify.c scrub.c ../repair/threads.c xfs.c xfs_ioctl.c
+
+LLDLIBS += $(LIBBLKID) $(LIBXFS) $(LIBXCMD) $(LIBUUID) $(LIBRT) $(LIBPTHREAD) $(LIBHANDLE)
+LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBHANDLE)
+LLDFLAGS = -static-libtool-libs
+
+ifeq ($(HAVE_MALLINFO),yes)
+LCFLAGS += -DHAVE_MALLINFO
+endif
+
+ifeq ($(HAVE_SG_IO),yes)
+LCFLAGS += -DHAVE_SG_IO
+endif
+
+ifeq ($(HAVE_HDIO_GETGEO),yes)
+LCFLAGS += -DHAVE_HDIO_GETGEO
+endif
+
+ifeq ($(HAVE_SYNCFS),yes)
+LCFLAGS += -DHAVE_SYNCFS
+endif
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default
+	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
+	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+install-dev:
+
+-include .dep
diff --git a/scrub/bitmap.c b/scrub/bitmap.c
new file mode 100644
index 0000000..96ea745
--- /dev/null
+++ b/scrub/bitmap.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include "../repair/avl64.h"
+#include "bitmap.h"
+
+#define avl_for_each_range_safe(pos, n, l, first, last) \
+	for (pos = (first), n = pos->avl_nextino, l = (last)->avl_nextino; pos != (l); \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each_safe(tree, pos, n) \
+	for (pos = (tree)->avl_firstino, n = pos ? pos->avl_nextino : NULL; \
+			pos != NULL; \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each(tree, pos) \
+	for (pos = (tree)->avl_firstino; pos != NULL; pos = pos->avl_nextino)
+
+struct bitmap_node {
+	struct avl64node	btn_node;
+	uint64_t		btn_start;
+	uint64_t		btn_length;
+};
+
+static __uint64_t
+extent_start(
+	struct avl64node	*node)
+{
+	struct bitmap_node	*btn;
+
+	btn = container_of(node, struct bitmap_node, btn_node);
+	return btn->btn_start;
+}
+
+static __uint64_t
+extent_end(
+	struct avl64node	*node)
+{
+	struct bitmap_node	*btn;
+
+	btn = container_of(node, struct bitmap_node, btn_node);
+	return btn->btn_start + btn->btn_length;
+}
+
+static struct avl64ops bitmap_ops = {
+	extent_start,
+	extent_end,
+};
+
+/* Initialize an extent tree. */
+bool
+bitmap_init(
+	struct bitmap		*tree)
+{
+	tree->bt_tree = malloc(sizeof(struct avl64tree_desc));
+	if (!tree->bt_tree)
+		return false;
+
+	pthread_mutex_init(&tree->bt_lock, NULL);
+	avl64_init_tree(tree->bt_tree, &bitmap_ops);
+
+	return true;
+}
+
+/* Free an extent tree. */
+void
+bitmap_free(
+	struct bitmap		*tree)
+{
+	struct avl64node	*node;
+	struct avl64node	*n;
+	struct bitmap_node	*ext;
+
+	if (!tree->bt_tree)
+		return;
+
+	avl_for_each_safe(tree->bt_tree, node, n) {
+		ext = container_of(node, struct bitmap_node, btn_node);
+		free(ext);
+	}
+	free(tree->bt_tree);
+	tree->bt_tree = NULL;
+}
+
+/* Create a new extent. */
+static struct bitmap_node *
+bitmap_node_init(
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct bitmap_node	*ext;
+
+	ext = malloc(sizeof(struct bitmap_node));
+	if (!ext)
+		return NULL;
+
+	ext->btn_node.avl_nextino = NULL;
+	ext->btn_start = start;
+	ext->btn_length = len;
+
+	return ext;
+}
+
+/* Add an extent (locked). */
+static bool
+__bitmap_add(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		length)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct bitmap_node	*ext;
+	uint64_t		new_start;
+	uint64_t		new_length;
+	struct avl64node	*node;
+	bool			res = true;
+
+	/* Find any existing nodes adjacent or within that range. */
+	avl64_findranges(tree->bt_tree, start - 1, start + length + 1,
+			&firstn, &lastn);
+
+	/* Nothing, just insert a new extent. */
+	if (firstn == NULL && lastn == NULL) {
+		ext = bitmap_node_init(start, length);
+		if (!ext)
+			return false;
+
+		node = avl64_insert(tree->bt_tree, &ext->btn_node);
+		if (node == NULL) {
+			free(ext);
+			errno = EEXIST;
+			return false;
+		}
+
+		return true;
+	}
+
+	ASSERT(firstn != NULL && lastn != NULL);
+	new_start = start;
+	new_length = length;
+
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		ext = container_of(pos, struct bitmap_node, btn_node);
+
+		/* Bail if the new extent is contained within an old one. */
+		if (ext->btn_start <= start &&
+		    ext->btn_start + ext->btn_length >= start + length)
+			return res;
+
+		/* Check for overlapping and adjacent extents. */
+		if (ext->btn_start + ext->btn_length >= start ||
+		    ext->btn_start <= start + length) {
+			if (ext->btn_start < start) {
+				new_start = ext->btn_start;
+				new_length += ext->btn_length;
+			}
+
+			if (ext->btn_start + ext->btn_length >
+			    new_start + new_length)
+				new_length = ext->btn_start + ext->btn_length -
+						new_start;
+
+			avl64_delete(tree->bt_tree, pos);
+			free(ext);
+		}
+	}
+
+	ext = bitmap_node_init(new_start, new_length);
+	if (!ext)
+		return false;
+
+	node = avl64_insert(tree->bt_tree, &ext->btn_node);
+	if (node == NULL) {
+		free(ext);
+		errno = EEXIST;
+		return false;
+	}
+
+	return res;
+}
+
+/* Add an extent. */
+bool
+bitmap_add(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		length)
+{
+	bool			res;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	res = __bitmap_add(tree, start, length);
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return res;
+}
+
+/* Remove an extent. */
+bool
+bitmap_remove(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct bitmap_node	*ext;
+	uint64_t		new_start;
+	uint64_t		new_length;
+	struct avl64node	*node;
+	int			stat;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	/* Find any existing nodes over that range. */
+	avl64_findranges(tree->bt_tree, start, start + len, &firstn, &lastn);
+
+	/* Nothing, we're done. */
+	if (firstn == NULL && lastn == NULL) {
+		pthread_mutex_unlock(&tree->bt_lock);
+		return true;
+	}
+
+	ASSERT(firstn != NULL && lastn != NULL);
+
+	/* Delete or truncate everything in sight. */
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		ext = container_of(pos, struct bitmap_node, btn_node);
+
+		stat = 0;
+		if (ext->btn_start < start)
+			stat |= 1;
+		if (ext->btn_start + ext->btn_length > start + len)
+			stat |= 2;
+		switch (stat) {
+		case 0:
+			/* Extent totally within range; delete. */
+			avl64_delete(tree->bt_tree, pos);
+			free(ext);
+			break;
+		case 1:
+			/* Extent is left-adjacent; truncate. */
+			ext->btn_length = start - ext->btn_start;
+			break;
+		case 2:
+			/* Extent is right-adjacent; move it. */
+			ext->btn_length = ext->btn_start + ext->btn_length -
+					(start + len);
+			ext->btn_start = start + len;
+			break;
+		case 3:
+			/* Extent overlaps both ends. */
+			ext->btn_length = start - ext->btn_start;
+			new_start = start + len;
+			new_length = ext->btn_start + ext->btn_length -
+					new_start;
+
+			ext = bitmap_node_init(new_start, new_length);
+			if (!ext)
+				return false;
+
+			node = avl64_insert(tree->bt_tree, &ext->btn_node);
+			if (node == NULL) {
+				errno = EEXIST;
+				return false;
+			}
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&tree->bt_lock);
+	return true;
+}
+
+/* Iterate an extent tree. */
+bool
+bitmap_iterate(
+	struct bitmap		*tree,
+	bool			(*fn)(uint64_t, uint64_t, void *),
+	void			*arg)
+{
+	struct avl64node	*node;
+	struct bitmap_node	*ext;
+	bool			moveon = true;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	avl_for_each(tree->bt_tree, node) {
+		ext = container_of(node, struct bitmap_node, btn_node);
+		moveon = fn(ext->btn_start, ext->btn_length, arg);
+		if (!moveon)
+			break;
+	}
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return moveon;
+}
+
+/* Do any extents overlap the given one?  (locked) */
+static bool
+__bitmap_has_extent(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+
+	/* Find any existing nodes over that range. */
+	avl64_findranges(tree->bt_tree, start, start + len, &firstn, &lastn);
+
+	return firstn != NULL && lastn != NULL;
+}
+
+/* Do any extents overlap the given one? */
+bool
+bitmap_has_extent(
+	struct bitmap		*tree,
+	uint64_t		start,
+	uint64_t		len)
+{
+	bool			res;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	res = __bitmap_has_extent(tree, start, len);
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return res;
+}
+
+/* Ensure that the extent is set, and return the old value. */
+bool
+bitmap_test_and_set(
+	struct bitmap		*tree,
+	uint64_t		start,
+	bool			*was_set)
+{
+	bool			res = true;
+
+	pthread_mutex_lock(&tree->bt_lock);
+	*was_set = __bitmap_has_extent(tree, start, 1);
+	if (!(*was_set))
+		res = __bitmap_add(tree, start, 1);
+	pthread_mutex_unlock(&tree->bt_lock);
+
+	return res;
+}
+
+/* Is it empty? */
+bool
+bitmap_empty(
+	struct bitmap		*tree)
+{
+	return tree->bt_tree->avl_firstino == NULL;
+}
+
+static bool
+merge_helper(
+	uint64_t		start,
+	uint64_t		length,
+	void			*arg)
+{
+	struct bitmap		*thistree = arg;
+
+	return __bitmap_add(thistree, start, length);
+}
+
+/* Merge another tree with this one. */
+bool
+bitmap_merge(
+	struct bitmap		*thistree,
+	struct bitmap		*tree)
+{
+	bool			res;
+
+	assert(thistree != tree);
+
+	pthread_mutex_lock(&thistree->bt_lock);
+	res = bitmap_iterate(tree, merge_helper, thistree);
+	pthread_mutex_unlock(&thistree->bt_lock);
+
+	return res;
+}
+
+static bool
+bitmap_dump_fn(
+	uint64_t		startblock,
+	uint64_t		blockcount,
+	void			*arg)
+{
+	printf("%"PRIu64":%"PRIu64"\n", startblock, blockcount);
+	return true;
+}
+
+/* Dump extent tree. */
+void
+bitmap_dump(
+	struct bitmap		*tree)
+{
+	printf("BITMAP DUMP %p\n", tree);
+	bitmap_iterate(tree, bitmap_dump_fn, NULL);
+	printf("BITMAP DUMP DONE\n");
+}
diff --git a/scrub/bitmap.h b/scrub/bitmap.h
new file mode 100644
index 0000000..1c0a8a8
--- /dev/null
+++ b/scrub/bitmap.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef BITMAP_H_
+#define BITMAP_H_
+
+struct bitmap {
+	pthread_mutex_t		bt_lock;
+	struct avl64tree_desc	*bt_tree;
+};
+
+bool bitmap_init(struct bitmap *tree);
+void bitmap_free(struct bitmap *tree);
+bool bitmap_add(struct bitmap *tree, uint64_t start, uint64_t length);
+bool bitmap_remove(struct bitmap *tree, uint64_t start,
+		uint64_t len);
+bool bitmap_iterate(struct bitmap *tree,
+		bool (*fn)(uint64_t, uint64_t, void *), void *arg);
+bool bitmap_has_extent(struct bitmap *tree, uint64_t start,
+		uint64_t len);
+bool bitmap_test_and_set(struct bitmap *tree, uint64_t start, bool *was_set);
+bool bitmap_empty(struct bitmap *tree);
+bool bitmap_merge(struct bitmap *thistree, struct bitmap *tree);
+void bitmap_dump(struct bitmap *tree);
+
+#endif /* BITMAP_H_ */
diff --git a/scrub/disk.c b/scrub/disk.c
new file mode 100644
index 0000000..8343a3c
--- /dev/null
+++ b/scrub/disk.c
@@ -0,0 +1,278 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#ifdef HAVE_SG_IO
+# include <scsi/sg.h>
+#endif
+#ifdef HAVE_HDIO_GETGEO
+# include <linux/hdreg.h>
+#endif
+#include "disk.h"
+#include "scrub.h"
+
+/* Figure out how many disk heads are available. */
+unsigned int
+disk_heads(
+	struct disk		*disk)
+{
+	int			iomin;
+	int			ioopt;
+	unsigned short		rot;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_THREADS"))
+		return 1;
+
+	/* If it's not a block device, throw all the CPUs at it. */
+	if (!S_ISBLK(disk->d_sb.st_mode))
+		return libxfs_nproc();
+
+	/* Non-rotational device?  Throw all the CPUs. */
+	rot = 1;
+	error = ioctl(disk->d_fd, BLKROTATIONAL, &rot);
+	if (error == 0 && rot == 0)
+		return libxfs_nproc();
+
+	/*
+	 * Sometimes we can infer the number of devices from the
+	 * min/optimal IO sizes.
+	 */
+	iomin = ioopt = 0;
+	if (ioctl(disk->d_fd, BLKIOMIN, &iomin) == 0 &&
+	    ioctl(disk->d_fd, BLKIOOPT, &ioopt) == 0 &&
+            iomin > 0 && ioopt > 0) {
+		return min(libxfs_nproc(), max(1, ioopt / iomin));
+	}
+
+	/* Rotating device?  I guess? */
+	return 2;
+}
+
+/* Execute a SCSI VERIFY(16).  We hope. */
+#ifdef HAVE_SG_IO
+# define SENSE_BUF_LEN		64
+# define VERIFY16_CMDLEN	16
+# define VERIFY16_CMD		0x8F
+
+# ifndef SG_FLAG_Q_AT_TAIL
+#  define SG_FLAG_Q_AT_TAIL	0x10
+# endif
+static int
+disk_scsi_verify(
+	struct disk		*disk,
+	uint64_t		startblock, /* lba */
+	uint64_t		blockcount) /* lba */
+{
+	struct sg_io_hdr	iohdr;
+	unsigned char		cdb[VERIFY16_CMDLEN];
+	unsigned char		sense[SENSE_BUF_LEN];
+	uint64_t		llba;
+	uint64_t		veri_len = blockcount;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"));
+
+	llba = startblock + (disk->d_start >> BBSHIFT);
+
+	/* Borrowed from sg_verify */
+	cdb[0] = VERIFY16_CMD;
+	cdb[1] = 0; /* skip PI, DPO, and byte check. */
+	cdb[2] = (llba >> 56) & 0xff;
+	cdb[3] = (llba >> 48) & 0xff;
+	cdb[4] = (llba >> 40) & 0xff;
+	cdb[5] = (llba >> 32) & 0xff;
+	cdb[6] = (llba >> 24) & 0xff;
+	cdb[7] = (llba >> 16) & 0xff;
+	cdb[8] = (llba >> 8) & 0xff;
+	cdb[9] = llba & 0xff;
+	cdb[10] = (veri_len >> 24) & 0xff;
+	cdb[11] = (veri_len >> 16) & 0xff;
+	cdb[12] = (veri_len >> 8) & 0xff;
+	cdb[13] = veri_len & 0xff;
+	cdb[14] = 0;
+	cdb[15] = 0;
+	memset(sense, 0, SENSE_BUF_LEN);
+
+	/* v3 SG_IO */
+	memset(&iohdr, 0, sizeof(iohdr));
+	iohdr.interface_id = 'S';
+	iohdr.dxfer_direction = SG_DXFER_NONE;
+	iohdr.cmdp = cdb;
+	iohdr.cmd_len = VERIFY16_CMDLEN;
+	iohdr.sbp = sense;
+	iohdr.mx_sb_len = SENSE_BUF_LEN;
+	iohdr.flags |= SG_FLAG_Q_AT_TAIL;
+	iohdr.timeout = 30000; /* 30s */
+
+	error = ioctl(disk->d_fd, SG_IO, &iohdr);
+	if (error)
+		return error;
+
+	dbg_printf("VERIFY(16) fd %d lba %"PRIu64" len %"PRIu64" info %x "
+			"status %d masked %d msg %d host %d driver %d "
+			"duration %d resid %d\n",
+			disk->d_fd, startblock, blockcount, iohdr.info,
+			iohdr.status, iohdr.masked_status, iohdr.msg_status,
+			iohdr.host_status, iohdr.driver_status, iohdr.duration,
+			iohdr.resid);
+
+	if (iohdr.info & SG_INFO_CHECK) {
+		dbg_printf("status: msg %x host %x driver %x\n",
+				iohdr.msg_status, iohdr.host_status,
+				iohdr.driver_status);
+		errno = EIO;
+		return -1;
+	}
+
+	return error;
+}
+#else
+# define disk_scsi_verify(...)		(ENOTTY)
+#endif /* HAVE_SG_IO */
+
+/* Test the availability of the kernel scrub ioctl. */
+static bool
+disk_can_scsi_verify(
+	struct disk		*disk)
+{
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"))
+		return false;
+
+	error = disk_scsi_verify(disk, 0, 1);
+	return error == 0;
+}
+
+/* Open a disk device and discover its geometry. */
+int
+disk_open(
+	const char		*pathname,
+	struct disk		*disk)
+{
+#ifdef HAVE_HDIO_GETGEO
+	struct hd_geometry	bdgeo;
+#endif
+	bool			suspicious_disk = false;
+	int			lba_sz;
+	int			error;
+
+	disk->d_fd = open(pathname, O_RDONLY | O_DIRECT | O_NOATIME);
+	if (disk->d_fd < 0)
+		return -1;
+
+	/* Try to get LBA size. */
+	error = ioctl(disk->d_fd, BLKSSZGET, &lba_sz);
+	if (error)
+		lba_sz = 512;
+	disk->d_lbalog = libxfs_log2_roundup(lba_sz);
+
+	/* Obtain disk's stat info. */
+	error = fstat(disk->d_fd, &disk->d_sb);
+	if (error) {
+		error = errno;
+		close(disk->d_fd);
+		errno = error;
+		disk->d_fd = -1;
+		return -1;
+	}
+
+	/* Determine bdev size, block size, and offset. */
+	if (S_ISBLK(disk->d_sb.st_mode)) {
+		error = ioctl(disk->d_fd, BLKGETSIZE64, &disk->d_size);
+		if (error)
+			disk->d_size = 0;
+		error = ioctl(disk->d_fd, BLKBSZGET, &disk->d_blksize);
+		if (error)
+			disk->d_blksize = 0;
+#ifdef HAVE_HDIO_GETGEO
+		error = ioctl(disk->d_fd, HDIO_GETGEO, &bdgeo);
+		if (!error) {
+			/*
+			 * dm devices will pass through ioctls, which means
+			 * we can't use SCSI VERIFY unless the start is 0.
+			 * Most dm devices don't set geometry (unlike scsi
+			 * and nvme) so use a zeroed out CHS to screen them
+			 * out.
+			 */
+			if (bdgeo.start != 0 &&
+			    (unsigned long long)bdgeo.heads * bdgeo.sectors *
+					bdgeo.sectors == 0)
+				suspicious_disk = true;
+			disk->d_start = bdgeo.start << BBSHIFT;
+		} else
+#endif
+			disk->d_start = 0;
+	} else {
+		disk->d_size = disk->d_sb.st_size;
+		disk->d_blksize = disk->d_sb.st_blksize;
+		disk->d_start = 0;
+	}
+
+	/* Can we issue SCSI VERIFY? */
+	if (!suspicious_disk && disk_can_scsi_verify(disk))
+		disk->d_flags |= DISK_FLAG_SCSI_VERIFY;
+
+	return 0;
+}
+
+/* Close a disk device. */
+int
+disk_close(
+	struct disk		*disk)
+{
+	int			error = 0;
+
+	if (disk->d_fd >= 0)
+		error = close(disk->d_fd);
+	disk->d_fd = -1;
+	return error;
+}
+
+/* Is this device open? */
+bool
+disk_is_open(
+	struct disk		*disk)
+{
+	return disk->d_fd >= 0;
+}
+
+#define BTOLBAT(d, bytes)	((uint64_t)(bytes) >> (d)->d_lbalog)
+#define LBASIZE(d)		(1ULL << (d)->d_lbalog)
+#define BTOLBA(d, bytes)	(((uint64_t)(bytes) + LBASIZE(d) - 1) >> (d)->d_lbalog)
+
+/* Read-verify an extent of a disk device. */
+ssize_t
+disk_read_verify(
+	struct disk		*disk,
+	void			*buf,
+	uint64_t		start,
+	uint64_t		length)
+{
+	/* Convert to logical block size. */
+	if (disk->d_flags & DISK_FLAG_SCSI_VERIFY)
+		return disk_scsi_verify(disk, BTOLBAT(disk, start),
+				BTOLBA(disk, length));
+
+	return pread(disk->d_fd, buf, length, start);
+}
diff --git a/scrub/disk.h b/scrub/disk.h
new file mode 100644
index 0000000..915907d
--- /dev/null
+++ b/scrub/disk.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef DISK_H_
+#define DISK_H_
+
+#define DISK_FLAG_SCSI_VERIFY	0x1
+struct disk {
+	struct stat	d_sb;
+	int		d_fd;
+	int		d_lbalog;
+	unsigned int	d_flags;
+	unsigned int	d_blksize;	/* bytes */
+	uint64_t	d_size;		/* bytes */
+	uint64_t	d_start;	/* bytes */
+};
+
+unsigned int disk_heads(struct disk *disk);
+bool disk_is_open(struct disk *disk);
+int disk_open(const char *pathname, struct disk *disk);
+int disk_close(struct disk *disk);
+ssize_t disk_read_verify(struct disk *disk, void *buf, uint64_t startblock,
+		uint64_t blockcount);
+
+#endif /* DISK_H_ */
diff --git a/scrub/generic.c b/scrub/generic.c
new file mode 100644
index 0000000..bcec07c
--- /dev/null
+++ b/scrub/generic.c
@@ -0,0 +1,1151 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <linux/fiemap.h>
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/xattr.h>
+#include "disk.h"
+#include "scrub.h"
+#include "iocmd.h"
+#include "../repair/threads.h"
+#include "read_verify.h"
+#include "bitmap.h"
+
+/*
+ * Generic Filesystem Scrub Strategy
+ *
+ * For a generic filesystem, we can only scrub the filesystem using the
+ * generic VFS APIs that are accessible to userspace.  This requirement
+ * reduces the effectiveness of the scrub because we can only scrub that
+ * which we can find through the directory tree namespace -- we won't be
+ * able to examine open unlinked files or any directory subtree that is
+ * also a mountpoint.
+ *
+ * The "find geometry" phase collects statfs/statvfs information and
+ * opens file descriptors to the mountpoint.  If the filesystem has a
+ * block device, a file descriptor is opened to that as well.
+ *
+ * The VFS has no mechanism to scrub internal metadata or to iterate
+ * inodes by inode number, so those phases do nothing.
+ *
+ * The "check directory structure" phase walks the directory tree
+ * looking for inodes.  Each directory is processed separately by thread
+ * pool workers.  For each entry in a directory, we scrub the following
+ * pieces of metadata:
+ *
+ *     - The dirent inode number is compared against the fstatat output.
+ *     - The dirent type code is also checked against the fstatat type.
+ *     - If it's a symlink, the target is read but not validated.
+ *     - If the entry is not a file or directory, the extended
+ *       attributes names and values are read via llistxattr.
+ *     - If the entry points to a file or directory, open the inode.
+ *       If not, we're done with the entry.
+ *     - The inode stat buffer is re-checked.
+ *     - The extent maps for file data and extended attribute data are
+ *       checked.
+ *     - Extended attributes are read.
+ *
+ * The "verify data file integrity" phase re-walks the directory tree
+ * for files.  If the filesystem supports FIEMAP and we have the block
+ * device open, the data extents are read directly from disk.  This step
+ * is optimized by buffering the disk extents in a bitmap and using the
+ * bitmap to issue large IOs; if there are errors, those are recorded
+ * and cross-referenced against the metadata to identify the affected
+ * files with a second walk/FIEMAP run.  If FIEMAP is unavailable, it
+ * falls back to using SEEK_DATA and SEEK_HOLE to direct-read file
+ * contents.  If even that fails, direct-read the entire file.
+ *
+ * In the "check summary counters" phase, we tally up the blocks and
+ * inodes we saw and compare that to the statfs output.  This gives the
+ * user a rough estimate of how thorough the scrub was.
+ */
+
+#ifndef SEEK_DATA
+# define SEEK_DATA	3	/* seek to the next data */
+#endif
+
+#ifndef SEEK_HOLE
+# define SEEK_HOLE	4	/* seek to the next hole */
+#endif
+
+/* Routines to translate bad physical extents into file paths and offsets. */
+
+/* Report if this extent overlaps a bad region. */
+static bool
+report_verify_inode_fiemap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fiemap_extent	*extent,
+	void			*arg)
+{
+	struct bitmap	*tree = arg;
+
+	/* Skip non-real/non-aligned extents. */
+	if (extent->fe_flags & (FIEMAP_EXTENT_UNKNOWN |
+				FIEMAP_EXTENT_DELALLOC |
+				FIEMAP_EXTENT_ENCODED |
+				FIEMAP_EXTENT_NOT_ALIGNED |
+				FIEMAP_EXTENT_UNWRITTEN))
+		return true;
+
+	if (!bitmap_has_extent(tree, extent->fe_physical,
+			extent->fe_length))
+		return true;
+
+	str_error(ctx, descr,
+_("offset %llu failed read verification."), extent->fe_logical);
+
+	return true;
+}
+
+/* Iterate the extent mappings of a file to report errors. */
+static bool
+report_verify_fd(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	void				*arg)
+{
+	/* data fork */
+	fiemap(ctx, descr, fd, false, false, report_verify_inode_fiemap, arg);
+
+	/* attr fork */
+	fiemap(ctx, descr, fd, true, false, report_verify_inode_fiemap, arg);
+
+	return true;
+}
+
+/* Scan the inode associated with a directory entry. */
+static bool
+report_verify_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	struct dirent		*dirent,
+	struct stat		*sb,
+	void			*arg)
+{
+	bool			moveon;
+	int			fd;
+
+	/* Ignore things we can't open. */
+	if (!S_ISREG(sb->st_mode))
+		return true;
+	/* Ignore . and .. */
+	if (dirent && (!strcmp(".", dirent->d_name) ||
+		       !strcmp("..", dirent->d_name)))
+		return true;
+
+	/* Open the file */
+	fd = dirent_open(dir_fd, dirent);
+	if (fd < 0)
+		return true;
+
+	/* Go find the badness. */
+	moveon = report_verify_fd(ctx, path, fd, arg);
+	if (moveon)
+		goto out;
+
+out:
+	close(fd);
+
+	return moveon;
+}
+
+/* Given bad extent lists for the data device, find bad files. */
+static bool
+report_verify_errors(
+	struct scrub_ctx		*ctx,
+	struct bitmap		*d_bad)
+{
+	/* Scan the directory tree to get file paths. */
+	return scan_fs_tree(ctx, NULL, report_verify_dirent, d_bad);
+}
+
+/* Phase 1 */
+bool
+generic_scan_fs(
+	struct scrub_ctx	*ctx)
+{
+	/* If there's no disk device, forget FIEMAP. */
+	if (!disk_is_open(&ctx->datadev))
+		ctx->quirks &= ~(SCRUB_QUIRK_FIEMAP_WORKS |
+				 SCRUB_QUIRK_FIEMAP_ATTR_WORKS |
+				 SCRUB_QUIRK_FIBMAP_WORKS);
+
+	return true;
+}
+
+bool
+generic_cleanup(
+	struct scrub_ctx	*ctx)
+{
+	/* Nothing to do here. */
+	return true;
+}
+
+/* Phase 2 */
+bool
+generic_scan_metadata(
+	struct scrub_ctx	*ctx)
+{
+	/* Nothing to do here. */
+	return true;
+}
+
+/* Phase 3 */
+bool
+generic_scan_inodes(
+	struct scrub_ctx	*ctx)
+{
+	/* Nothing to do here. */
+	return true;
+}
+
+/* Phase 4 */
+
+/* Check all entries in a directory. */
+bool
+generic_check_dir(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			dir_fd)
+{
+	/* Nothing to do here. */
+	return true;
+}
+
+/* Check an extent for problems. */
+static bool
+check_fiemap_extent(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fiemap_extent	*extent,
+	void			*arg)
+{
+	unsigned long long	eofs;
+
+	if (!disk_is_open(&ctx->datadev))
+		return true;
+	eofs = ctx->datadev.d_size;
+
+	if (extent->fe_length == 0)
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) has zero length."),
+			extent->fe_physical,
+			extent->fe_logical,
+			extent->fe_length);
+	if (extent->fe_physical > eofs)
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) starts past end of filesystem at %llu."),
+			extent->fe_physical,
+			extent->fe_logical,
+			extent->fe_length,
+			eofs);
+	if (extent->fe_physical + extent->fe_length > eofs ||
+	    extent->fe_physical + extent->fe_length < extent->fe_physical)
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) ends past end of filesystem at %llu."),
+			extent->fe_physical,
+			extent->fe_logical,
+			extent->fe_length,
+			eofs);
+	if (extent->fe_logical + extent->fe_length < extent->fe_logical)
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) overflows file offset."),
+			extent->fe_physical,
+			extent->fe_logical,
+			extent->fe_length);
+	return true;
+}
+
+/* Check an inode's extents. */
+bool
+generic_scan_extents(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb,
+	bool			attr_fork)
+{
+	/* FIEMAP only works for files. */
+	if (!S_ISREG(sb->st_mode))
+		return true;
+
+	/* Don't invoke FIEMAP if we don't support it. */
+	if (attr_fork && !scrub_has_fiemap_attr(ctx))
+		return true;
+	if (!attr_fork && !(scrub_has_fiemap(ctx) || scrub_has_fibmap(ctx)))
+		return true;
+
+	return fiemap(ctx, descr, fd, attr_fork, true,
+			check_fiemap_extent, NULL);
+}
+
+/* Check the fields of an inode. */
+bool
+generic_check_inode(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb)
+{
+	if (sb->st_nlink == 0)
+		str_error(ctx, descr,
+_("nlinks should not be 0."));
+
+	return true;
+}
+
+/* Does this file have extended attributes? */
+bool
+file_has_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd)
+{
+	ssize_t			buf_sz;
+
+	buf_sz = flistxattr(fd, NULL, 0);
+	if (buf_sz == 0)
+		return false;
+	else if (buf_sz < 0) {
+		if (errno == EOPNOTSUPP || errno == ENODATA)
+			return false;
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	return true;
+}
+
+/* Try to read all the extended attributes. */
+bool
+generic_scan_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd)
+{
+	char			*buf = NULL;
+	char			*p;
+	ssize_t			buf_sz;
+	ssize_t			sz;
+	ssize_t			val_sz;
+	ssize_t			sz2;
+	bool			moveon = true;
+
+	buf_sz = flistxattr(fd, NULL, 0);
+	if (buf_sz == 0)
+		return true;
+	else if (buf_sz < 0) {
+		if (errno == EOPNOTSUPP || errno == ENODATA)
+			return true;
+		str_errno(ctx, descr);
+		return true;
+	}
+
+	buf = malloc(buf_sz);
+	if (!buf) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	sz = flistxattr(fd, buf, buf_sz);
+	if (sz < 0) {
+		str_errno(ctx, descr);
+		goto out;
+	} else if (sz != buf_sz) {
+		str_error(ctx, descr,
+_("read %zu bytes of xattr names, expected %zu bytes."),
+				sz, buf_sz);
+	}
+
+	/* Read all the attrs and values. */
+	for (p = buf; p < buf + sz; p += strlen(p) + 1) {
+		val_sz = fgetxattr(fd, p, NULL, 0);
+		if (val_sz < 0) {
+			if (errno != EOPNOTSUPP && errno != ENODATA)
+				str_errno(ctx, descr);
+			continue;
+		}
+		sz2 = fgetxattr(fd, p, ctx->readbuf, val_sz);
+		if (sz2 < 0) {
+			str_errno(ctx, descr);
+			continue;
+		} else if (sz2 != val_sz)
+			str_error(ctx, descr,
+_("read %zu bytes from xattr %s value, expected %zu bytes."),
+					sz2, p, val_sz);
+	}
+out:
+	free(buf);
+	return moveon;
+}
+
+/* Try to read all the extended attributes of things that have no fd. */
+bool
+generic_scan_special_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*path)
+{
+	char			*buf = NULL;
+	char			*p;
+	ssize_t			buf_sz;
+	ssize_t			sz;
+	ssize_t			val_sz;
+	ssize_t			sz2;
+	bool			moveon = true;
+
+	buf_sz = llistxattr(path, NULL, 0);
+	if (buf_sz == -EOPNOTSUPP)
+		return true;
+	else if (buf_sz == 0)
+		return true;
+	else if (buf_sz < 0) {
+		str_errno(ctx, path);
+		return true;
+	}
+
+	buf = malloc(buf_sz);
+	if (!buf) {
+		str_errno(ctx, path);
+		return false;
+	}
+
+	sz = llistxattr(path, buf, buf_sz);
+	if (sz < 0) {
+		str_errno(ctx, path);
+		goto out;
+	} else if (sz != buf_sz) {
+		str_error(ctx, path,
+_("read %zu bytes of xattr names, expected %zu bytes."),
+				sz, buf_sz);
+	}
+
+	/* Read all the attrs and values. */
+	for (p = buf; p < buf + sz; p += strlen(p) + 1) {
+		val_sz = lgetxattr(path, p, NULL, 0);
+		if (val_sz < 0) {
+			str_errno(ctx, path);
+			continue;
+		}
+		sz2 = lgetxattr(path, p, ctx->readbuf, val_sz);
+		if (sz2 < 0) {
+			str_errno(ctx, path);
+			continue;
+		} else if (sz2 != val_sz)
+			str_error(ctx, path,
+_("read %zu bytes from xattr %s value, expected %zu bytes."),
+					sz2, p, val_sz);
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			moveon = false;
+			break;
+		}
+	}
+out:
+	free(buf);
+	return moveon;
+}
+
+/* Directory checking */
+#define CHECK_TYPE(type) \
+	case DT_##type: \
+		if (!S_IS##type(sb->st_mode)) { \
+			str_error(ctx, descr, \
+_("dtype of block does not match mode 0x%x\n"), \
+				sb->st_mode & S_IFMT); \
+		} \
+		break;
+
+/* Ensure that the directory entry matches the stat info. */
+static bool
+generic_verify_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct dirent		*dirent,
+	struct stat		*sb)
+{
+	if (!scrub_has_unstable_inums(ctx) && dirent->d_ino != sb->st_ino) {
+		str_error(ctx, descr,
+_("inode numbers (%llu != %llu) do not match!"),
+			(unsigned long long)dirent->d_ino,
+			(unsigned long long)sb->st_ino);
+	}
+
+	switch (dirent->d_type) {
+	case DT_UNKNOWN:
+		break;
+	CHECK_TYPE(BLK)
+	CHECK_TYPE(CHR)
+	CHECK_TYPE(DIR)
+	CHECK_TYPE(FIFO)
+	CHECK_TYPE(LNK)
+	CHECK_TYPE(REG)
+	CHECK_TYPE(SOCK)
+	}
+
+	return true;
+}
+#undef CHECK_TYPE
+
+/* Scan the inode associated with a directory entry. */
+static bool
+check_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	struct dirent		*dirent,
+	struct stat		*sb,
+	void			*arg)
+{
+	struct stat		fd_sb;
+	static char		linkbuf[PATH_MAX + 1];
+	ssize_t			len;
+	bool			moveon;
+	int			fd;
+	int			error;
+
+	/* No dirent for the rootdir; skip it. */
+	if (!dirent)
+		return true;
+
+	/* Check the directory entry itself. */
+	moveon = generic_verify_dirent(ctx, path, dirent, sb);
+	if (!moveon)
+		return moveon;
+
+	/* If symlink, read the target value. */
+	if (S_ISLNK(sb->st_mode)) {
+		len = readlinkat(dir_fd, dirent->d_name, linkbuf,
+				PATH_MAX);
+		if (len < 0)
+			str_errno(ctx, path);
+		else if (len > sb->st_size)
+			str_error(ctx, path,
+_("read %zu bytes from a %zu byte symlink?"),
+				len, sb->st_size);
+	}
+
+	/* Read the xattrs without a file descriptor. */
+	if (S_ISSOCK(sb->st_mode) || S_ISFIFO(sb->st_mode) ||
+	    S_ISBLK(sb->st_mode) || S_ISCHR(sb->st_mode) ||
+	    S_ISLNK(sb->st_mode)) {
+		moveon = ctx->ops->scan_special_xattrs(ctx, path);
+		if (!moveon)
+			return moveon;
+	}
+
+	/* If not dir or file, move on to the next dirent. */
+	if (!S_ISDIR(sb->st_mode) && !S_ISREG(sb->st_mode))
+		return true;
+
+	/* Open the file */
+	fd = openat(dir_fd, dirent->d_name,
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		if (errno != ENOENT)
+			str_errno(ctx, path);
+		return true;
+	}
+
+	/* Did the fstatat and the open race? */
+	if (fstat(fd, &fd_sb) < 0) {
+		str_errno(ctx, path);
+		goto close;
+	}
+	if (fd_sb.st_ino != sb->st_ino || fd_sb.st_dev != sb->st_dev)
+		str_warn(ctx, path,
+_("inode changed out from under us!"));
+
+	/* Check the inode. */
+	moveon = ctx->ops->check_inode(ctx, path, fd, &fd_sb);
+	if (!moveon)
+		goto close;
+
+	/* Scan the extent maps. */
+	moveon = ctx->ops->scan_extents(ctx, path, fd, &fd_sb, false);
+	if (!moveon)
+		goto close;
+	if (file_has_xattrs(ctx, path, fd)) {
+		moveon = ctx->ops->scan_extents(ctx, path, fd, &fd_sb, true);
+		if (!moveon)
+			goto close;
+	}
+
+	/* Read all the extended attributes. */
+	moveon = ctx->ops->scan_xattrs(ctx, path, fd);
+	if (!moveon)
+		goto close;
+
+close:
+	/* Close file. */
+	error = close(fd);
+	if (error)
+		str_errno(ctx, path);
+
+	return moveon;
+}
+
+/*
+ * Check all the entries in a directory.
+ */
+bool
+generic_check_directory(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			*pfd)
+{
+	struct stat		sb;
+	DIR			*dir;
+	struct dirent		*dirent;
+	bool			moveon = true;
+	int			fd = *pfd;
+	int			error;
+
+	/* Iterate the directory entries. */
+	dir = fdopendir(fd);
+	if (!dir) {
+		str_errno(ctx, descr);
+		return true;
+	}
+	rewinddir(dir);
+
+	/* Iterate every directory entry. */
+	for (dirent = readdir(dir);
+	     dirent != NULL;
+	     dirent = readdir(dir)) {
+		error = fstatat(fd, dirent->d_name, &sb,
+				AT_NO_AUTOMOUNT | AT_SYMLINK_NOFOLLOW);
+		if (error) {
+			str_errno(ctx, descr);
+			break;
+		}
+
+		/* Ignore files on other filesystems. */
+		if (sb.st_dev != ctx->mnt_sb.st_dev)
+			continue;
+
+		/* Check the type codes. */
+		moveon = generic_verify_dirent(ctx, descr, dirent, &sb);
+		if (!moveon)
+			break;
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			moveon = false;
+			break;
+		}
+	}
+
+	/* Close dir, go away. */
+	error = closedir(dir);
+	if (error)
+		str_errno(ctx, descr);
+	*pfd = -1;
+	return moveon;
+}
+
+/* Adapter for the check_dir thing. */
+static bool
+check_dir(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			dir_fd,
+	void			*arg)
+{
+	return ctx->ops->check_dir(ctx, descr, dir_fd);
+}
+
+/* Traverse the directory tree. */
+bool
+generic_scan_fs_tree(
+	struct scrub_ctx	*ctx)
+{
+	return scan_fs_tree(ctx, check_dir, check_dirent, NULL);
+}
+
+/* Phase 5 */
+
+struct read_verify_files {
+	struct scrub_ctx	*ctx;
+	struct bitmap		good;		/* bytes */
+	struct bitmap		bad;		/* bytes */
+	struct read_verify_pool	rvp;
+	struct read_verify	rv;
+	bool			use_fiemap;
+};
+
+/* Handle an io error while read verifying an extent. */
+void
+read_verify_fiemap_ioerr(
+	struct read_verify_pool		*rvp,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length,
+	int				error,
+	void				*arg)
+{
+	struct read_verify_files	*rvf = arg;
+
+	bitmap_add(&rvf->bad, start, length);
+}
+
+/* Check an extent for data integrity problems. */
+bool
+read_verify_fiemap_extent(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fiemap_extent		*extent,
+	void				*arg)
+{
+	struct read_verify_files	*rvf = arg;
+
+	/* Skip non-real/non-aligned extents. */
+	if (extent->fe_flags & (FIEMAP_EXTENT_UNKNOWN |
+				FIEMAP_EXTENT_DELALLOC |
+				FIEMAP_EXTENT_ENCODED |
+				FIEMAP_EXTENT_NOT_ALIGNED |
+				FIEMAP_EXTENT_UNWRITTEN))
+		return true;
+
+	return bitmap_add(&rvf->good, extent->fe_physical,
+			extent->fe_length);
+}
+
+/* Scan the inode associated with a directory entry. */
+static bool
+read_verify_dirent(
+	struct scrub_ctx		*ctx,
+	const char			*path,
+	int				dir_fd,
+	struct dirent			*dirent,
+	struct stat			*sb,
+	void				*arg)
+{
+	struct stat			fd_sb;
+	struct read_verify_files	*rvf = arg;
+	bool				moveon = true;
+	int				fd;
+	int				error;
+
+	/* If not file, move on to the next dirent. */
+	if (!S_ISREG(sb->st_mode))
+		return true;
+
+	/* Open the file */
+	fd = openat(dir_fd, dirent->d_name,
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		if (errno != ENOENT)
+			str_errno(ctx, path);
+		return true;
+	}
+
+	/* Did the fstatat and the open race? */
+	if (fstat(fd, &fd_sb) < 0) {
+		str_errno(ctx, path);
+		goto close;
+	}
+	if (fd_sb.st_ino != sb->st_ino || fd_sb.st_dev != sb->st_dev)
+		str_warn(ctx, path,
+_("inode changed out from under us!"));
+
+	/*
+	 * Either record the file extent map data for one big push later,
+	 * or read the file data the regular way.
+	 */
+	if (rvf->use_fiemap)
+		moveon = fiemap(ctx, path, fd, false, false,
+				read_verify_fiemap_extent, rvf);
+	else
+		moveon = ctx->ops->read_file(ctx, path, fd, &fd_sb);
+	if (!moveon)
+		goto close;
+
+close:
+	/* Close file. */
+	error = close(fd);
+	if (error)
+		str_errno(ctx, path);
+
+	return moveon;
+}
+
+static bool
+schedule_read_verify(
+	uint64_t			start,
+	uint64_t			length,
+	void				*arg)
+{
+	struct read_verify_files	*rvf = arg;
+
+	read_verify_schedule(&rvf->rvp, &rvf->rv, &rvf->ctx->datadev,
+			start, length, rvf);
+	return true;
+}
+
+/* Can we FIEMAP every block in a file? */
+static bool
+can_fiemap_all_file_blocks(
+	struct scrub_ctx		*ctx)
+{
+	return disk_is_open(&ctx->datadev) &&
+		scrub_has_fiemap(ctx) && scrub_has_fiemap_attr(ctx);
+}
+
+/* Scan all the data blocks, using FIEMAP to figure out what to verify. */
+bool
+generic_scan_blocks(
+	struct scrub_ctx		*ctx)
+{
+	struct read_verify_files	rvf = {0};
+	bool				moveon;
+
+	if (!scrub_data)
+		return true;
+
+	rvf.ctx = ctx;
+
+	/* If FIEMAP is unavailable, just use regular file pread. */
+	if (!can_fiemap_all_file_blocks(ctx))
+		return scan_fs_tree(ctx, NULL, read_verify_dirent, &rvf);
+
+	rvf.use_fiemap = true;
+	moveon = bitmap_init(&rvf.good);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&rvf.bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_good;
+	}
+
+	/* Collect all the extent maps. */
+	moveon = scan_fs_tree(ctx, NULL, read_verify_dirent, &rvf);
+	if (!moveon)
+		goto out_bad;
+
+	/* Run all the IO in batches. */
+	moveon = read_verify_pool_init(&rvf.rvp, ctx, ctx->readbuf, IO_MAX_SIZE,
+			ctx->mnt_sf.f_frsize, read_verify_fiemap_ioerr,
+			disk_heads(&ctx->datadev));
+	if (!moveon)
+		goto out_bad;
+	moveon = bitmap_iterate(&rvf.good, schedule_read_verify, &rvf);
+	if (!moveon)
+		goto out_pool;
+	read_verify_force(&rvf.rvp, &rvf.rv);
+	read_verify_pool_destroy(&rvf.rvp);
+
+	/* Scan the whole dir tree to see what matches the bad extents. */
+	if (!bitmap_empty(&rvf.bad))
+		moveon = report_verify_errors(ctx, &rvf.bad);
+
+	bitmap_free(&rvf.bad);
+	bitmap_free(&rvf.good);
+	return moveon;
+
+out_pool:
+	read_verify_pool_destroy(&rvf.rvp);
+out_bad:
+	bitmap_free(&rvf.bad);
+out_good:
+	bitmap_free(&rvf.good);
+
+	return moveon;
+}
+
+/* Phase 6 */
+struct summary_counts {
+	pthread_mutex_t		lock;
+	struct bitmap	dext;
+	struct bitmap	inob;	/* inode bitmap */
+	unsigned long long	inodes;	/* number of inodes */
+	unsigned long long	bytes;	/* bytes used */
+};
+
+struct inode_fork_summary {
+	struct bitmap	*tree;
+	unsigned long long	bytes;
+};
+
+/* Record data block extents in a bitmap. */
+bool
+generic_record_inode_summary_fiemap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fiemap_extent		*extent,
+	void				*arg)
+{
+	struct inode_fork_summary	*ifs = arg;
+
+	/* Skip non-real/non-aligned extents. */
+	if (extent->fe_flags & (FIEMAP_EXTENT_UNKNOWN |
+				FIEMAP_EXTENT_DELALLOC |
+				FIEMAP_EXTENT_ENCODED |
+				FIEMAP_EXTENT_NOT_ALIGNED))
+		return true;
+
+	bitmap_add(ifs->tree, extent->fe_physical, extent->fe_length);
+	ifs->bytes += extent->fe_length;
+
+	return true;
+}
+
+/* Record the presence of an inode and its block usage. */
+static bool
+generic_record_inode_summary(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				dir_fd,
+	struct dirent			*dirent,
+	struct stat			*sb,
+	void				*arg)
+{
+	struct summary_counts		*summary = arg;
+	struct stat			fd_sb;
+	struct inode_fork_summary	ifs;
+	unsigned long long		bs_bytes;
+	int				fd;
+	bool				has;
+	bool				moveon = true;
+
+	if (dirent && (strcmp(dirent->d_name, ".") == 0 ||
+		       strcmp(dirent->d_name, "..") == 0))
+		return true;
+
+	/* Detect hardlinked files. */
+	moveon = bitmap_test_and_set(&summary->inob, sb->st_ino, &has);
+	if (!moveon)
+		return moveon;
+	if (has)
+		return true;
+
+	bs_bytes = sb->st_blocks << BBSHIFT;
+
+	/* Record the inode.  If it's not a file, record the data usage too. */
+	pthread_mutex_lock(&summary->lock);
+	summary->inodes++;
+
+	/*
+	 * We can use fiemap and dext to figure out the correct block usage
+	 * for files that might share blocks.  If any of those conditions
+	 * are not met (non-file, fs doesn't support reflink, fiemap doesn't
+	 * work) then we just assume that the inode is the sole owner of its
+	 * blocks and use that to calculate the block usage.
+	 */
+	if (!can_fiemap_all_file_blocks(ctx) || !scrub_has_shared_blocks(ctx) ||
+	    !S_ISREG(sb->st_mode)) {
+		summary->bytes += bs_bytes;
+		pthread_mutex_unlock(&summary->lock);
+		return true;
+	}
+	pthread_mutex_unlock(&summary->lock);
+
+	/* Open the file */
+	fd = dirent_open(dir_fd, dirent);
+	if (fd < 0) {
+		if (errno != ENOENT)
+			str_errno(ctx, descr);
+		return true;
+	}
+
+	/* Did the fstatat and the open race? */
+	if (fstat(fd, &fd_sb) < 0) {
+		str_errno(ctx, descr);
+		goto close;
+	}
+
+	if (fd_sb.st_ino != sb->st_ino || fd_sb.st_dev != sb->st_dev)
+		str_warn(ctx, descr,
+_("inode changed out from under us!"));
+
+	ifs.tree = &summary->dext;
+	ifs.bytes = 0;
+	moveon = fiemap(ctx, descr, fd, false, false,
+			generic_record_inode_summary_fiemap, &ifs);
+	if (!moveon)
+		goto out_nofiemap;
+	if (file_has_xattrs(ctx, descr, fd)) {
+		moveon = fiemap(ctx, descr, fd, true, false,
+				generic_record_inode_summary_fiemap, &ifs);
+		if (!moveon)
+			goto out_nofiemap;
+	}
+
+	/*
+	 * bs_bytes tracks the number of bytes assigned to this file
+	 * for data, xattrs, and block mapping metadata.  ifs.bytes tracks
+	 * the data and xattr storage space used, so the diff between the
+	 * two is the space used for block mapping metadata.  Add that to
+	 * the data usage.
+	 */
+out_nofiemap:
+	pthread_mutex_lock(&summary->lock);
+	summary->bytes += bs_bytes - ifs.bytes;
+	pthread_mutex_unlock(&summary->lock);
+
+close:
+	close(fd);
+	return moveon;
+}
+
+/* Sum the bytes in each extent. */
+static bool
+generic_summary_count_helper(
+	uint64_t			start,
+	uint64_t			length,
+	void				*arg)
+{
+	unsigned long long		*count = arg;
+
+	*count += length;
+	return true;
+}
+
+/* Traverse the directory tree, counting inodes & blocks. */
+bool
+generic_check_summary(
+	struct scrub_ctx	*ctx)
+{
+	struct summary_counts	summary = {0};
+	struct stat		sb;
+	struct statvfs		sfs;
+	unsigned long long	fd;
+	unsigned long long	fi;
+	unsigned long long	sd;
+	unsigned long long	si;
+	unsigned long long	absdiff;
+	bool			complain = false;
+	bool			moveon;
+	int			error;
+
+	pthread_mutex_init(&summary.lock, NULL);
+
+	/* Flush everything out to disk before we start counting. */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Get the rootdir's summary stats. */
+	error = fstat(ctx->mnt_fd, &sb);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&summary.dext);
+	if (!moveon)
+		return moveon;
+
+	moveon = bitmap_init(&summary.inob);
+	if (!moveon)
+		return moveon;
+
+	/* Scan the rest of the filesystem. */
+	moveon = scan_fs_tree(ctx, NULL, generic_record_inode_summary,
+			&summary);
+	if (!moveon)
+		return moveon;
+
+	/* Summarize extent tree results. */
+	moveon = bitmap_iterate(&summary.dext,
+			generic_summary_count_helper, &summary.bytes);
+	if (!moveon)
+		return moveon;
+
+	bitmap_free(&summary.inob);
+	bitmap_free(&summary.dext);
+
+	/* Compare to statfs results. */
+	error = fstatvfs(ctx->mnt_fd, &sfs);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Report on what we found. */
+	fd = (sfs.f_blocks - sfs.f_bfree) * sfs.f_frsize;
+	fi = sfs.f_files - sfs.f_ffree;
+	sd = summary.bytes;
+	si = summary.inodes;
+
+	/*
+	 * Complain if the counts are off by more than 10%, unless
+	 * the inaccuracy is less than 32MB worth of blocks or 100 inodes.
+	 * Ignore zero counters.
+	 */
+	absdiff = 1ULL << 25;
+	if (fd)
+		complain = !within_range(ctx, sd, fd, absdiff, 1, 10,
+				_("data blocks"));
+	if (fi)
+		complain |= !within_range(ctx, si, fi, 100, 1, 10, _("inodes"));
+
+	if (complain || verbose) {
+		double		b, i;
+		char		*bu, *iu;
+
+		b = auto_space_units(fd, &bu);
+		i = auto_units(fi, &iu);
+		printf(_("%.1f%s data used;  %.1f%s inodes used.\n"),
+				b, bu, i, iu);
+		b = auto_space_units(sd, &bu);
+		i = auto_units(si, &iu);
+		printf(_("%.1f%s data found; %.1f%s inodes found.\n"),
+				b, bu, i, iu);
+	}
+
+	return true;
+}
+
+/* Phase 7: Preening filesystem. */
+bool
+generic_preen_fs(
+	struct scrub_ctx		*ctx)
+{
+	fstrim(ctx);
+	return true;
+}
+
+struct scrub_ops generic_scrub_ops = {
+	.name			= "generic",
+	.cleanup		= generic_cleanup,
+	.scan_fs		= generic_scan_fs,
+	.scan_inodes		= generic_scan_inodes,
+	.check_dir		= generic_check_dir,
+	.check_inode		= generic_check_inode,
+	.scan_extents		= generic_scan_extents,
+	.scan_xattrs		= generic_scan_xattrs,
+	.scan_special_xattrs	= generic_scan_special_xattrs,
+	.scan_metadata		= generic_scan_metadata,
+	.check_summary		= generic_check_summary,
+	.read_file		= read_verify_file,
+	.scan_blocks		= generic_scan_blocks,
+	.scan_fs_tree		= generic_scan_fs_tree,
+	.preen_fs		= generic_preen_fs,
+};
diff --git a/scrub/iocmd.c b/scrub/iocmd.c
new file mode 100644
index 0000000..d8a769d
--- /dev/null
+++ b/scrub/iocmd.c
@@ -0,0 +1,412 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <linux/fiemap.h>
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/xattr.h>
+#include "../repair/threads.h"
+#include "disk.h"
+#include "scrub.h"
+#include "iocmd.h"
+
+#define NR_EXTENTS	512
+
+/* Scan a filesystem tree. */
+struct scan_fs_tree {
+	unsigned int		nr_dirs;
+	pthread_mutex_t		lock;
+	pthread_cond_t		wakeup;
+	struct stat		root_sb;
+	bool			moveon;
+	bool			(*dir_fn)(struct scrub_ctx *, const char *,
+					  int, void *);
+	bool			(*dirent_fn)(struct scrub_ctx *, const char *,
+					     int, struct dirent *,
+					     struct stat *, void *);
+	void			*arg;
+};
+
+/* Per-work-item scan context. */
+struct scan_fs_tree_dir {
+	char			*path;
+	struct scan_fs_tree	*sft;
+	bool			rootdir;
+};
+
+/* Scan a directory sub tree. */
+static void
+scan_fs_dir(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct scan_fs_tree_dir	*sftd = arg;
+	struct scan_fs_tree	*sft = sftd->sft;
+	DIR			*dir;
+	struct dirent		*dirent;
+	char			newpath[PATH_MAX];
+	struct scan_fs_tree_dir	*new_sftd;
+	struct stat		sb;
+	int			dir_fd;
+	int			error;
+
+	/* Open the directory. */
+	dir_fd = open(sftd->path, O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (dir_fd < 0) {
+		if (errno != ENOENT)
+			str_errno(ctx, sftd->path);
+		goto out;
+	}
+
+	/* Caller-specific directory checks. */
+	if (sft->dir_fn && !sft->dir_fn(ctx, sftd->path, dir_fd, sft->arg)) {
+		sft->moveon = false;
+		goto out;
+	}
+
+	/* Caller-specific directory entry function on the rootdir. */
+	if (sftd->rootdir) {
+		/* Get the stat info for this directory entry. */
+		error = fstat(dir_fd, &sb);
+		if (error) {
+			str_errno(ctx, sftd->path);
+			goto out;
+		}
+		if (!sft->dirent_fn(ctx, sftd->path, dir_fd, NULL, &sb,
+				sft->arg)) {
+			sft->moveon = false;
+			goto out;
+		}
+	}
+
+	/* Iterate the directory entries. */
+	dir = fdopendir(dir_fd);
+	if (!dir) {
+		str_errno(ctx, sftd->path);
+		goto out;
+	}
+	rewinddir(dir);
+	for (dirent = readdir(dir); dirent != NULL; dirent = readdir(dir)) {
+		snprintf(newpath, PATH_MAX, "%s/%s", sftd->path,
+				dirent->d_name);
+
+		/* Get the stat info for this directory entry. */
+		error = fstatat(dir_fd, dirent->d_name, &sb,
+				AT_NO_AUTOMOUNT | AT_SYMLINK_NOFOLLOW);
+		if (error) {
+			str_errno(ctx, newpath);
+			continue;
+		}
+
+		/* Ignore files on other filesystems. */
+		if (sb.st_dev != sft->root_sb.st_dev)
+			continue;
+
+		/* Caller-specific directory entry function. */
+		if (!sft->dirent_fn(ctx, newpath, dir_fd, dirent, &sb,
+				sft->arg)) {
+			sft->moveon = false;
+			break;
+		}
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			sft->moveon = false;
+			break;
+		}
+
+		/* If directory, call ourselves recursively. */
+		if (S_ISDIR(sb.st_mode) && strcmp(".", dirent->d_name) &&
+		    strcmp("..", dirent->d_name)) {
+			new_sftd = malloc(sizeof(struct scan_fs_tree_dir));
+			if (!new_sftd) {
+				str_errno(ctx, newpath);
+				sft->moveon = false;
+				break;
+			}
+			new_sftd->path = strdup(newpath);
+			new_sftd->sft = sft;
+			new_sftd->rootdir = false;
+			pthread_mutex_lock(&sft->lock);
+			sft->nr_dirs++;
+			pthread_mutex_unlock(&sft->lock);
+			queue_work(wq, scan_fs_dir, 0, new_sftd);
+		}
+	}
+
+	/* Close dir, go away. */
+	error = closedir(dir);
+	if (error)
+		str_errno(ctx, sftd->path);
+
+out:
+	pthread_mutex_lock(&sft->lock);
+	sft->nr_dirs--;
+	if (sft->nr_dirs == 0)
+		pthread_cond_signal(&sft->wakeup);
+	pthread_mutex_unlock(&sft->lock);
+
+	free(sftd->path);
+	free(sftd);
+}
+
+/* Scan the entire filesystem. */
+bool
+scan_fs_tree(
+	struct scrub_ctx	*ctx,
+	bool			(*dir_fn)(struct scrub_ctx *, const char *,
+					  int, void *),
+	bool			(*dirent_fn)(struct scrub_ctx *, const char *,
+						int, struct dirent *,
+						struct stat *, void *),
+	void			*arg)
+{
+	struct work_queue	wq;
+	struct scan_fs_tree	sft;
+	struct scan_fs_tree_dir	*sftd;
+
+	sft.moveon = true;
+	sft.nr_dirs = 1;
+	sft.root_sb = ctx->mnt_sb;
+	sft.dir_fn = dir_fn;
+	sft.dirent_fn = dirent_fn;
+	sft.arg = arg;
+	pthread_mutex_init(&sft.lock, NULL);
+	pthread_cond_init(&sft.wakeup, NULL);
+
+	sftd = malloc(sizeof(struct scan_fs_tree_dir));
+	if (!sftd) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	sftd->path = strdup(ctx->mntpoint);
+	sftd->sft = &sft;
+	sftd->rootdir = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, scan_fs_dir, 0, sftd);
+
+	pthread_mutex_lock(&sft.lock);
+	pthread_cond_wait(&sft.wakeup, &sft.lock);
+	assert(sft.nr_dirs == 0);
+	pthread_mutex_unlock(&sft.lock);
+	destroy_work_queue(&wq);
+
+	return sft.moveon;
+}
+
+/* Check an inode's extents... the hard way. */
+static bool
+fibmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	bool			(*fn)(struct scrub_ctx *, const char *,
+				      struct fiemap_extent *, void *),
+	void			*arg)
+{
+	struct stat		sb;
+	struct fiemap_extent	extent = {0};
+	unsigned int		blk;
+	unsigned int		b;
+	unsigned int		blksz;
+	unsigned long long	physical;
+	off_t			numblocks;
+	bool			moveon = true;
+	int			error;
+
+	assert(scrub_has_fibmap(ctx));
+
+	error = fstat(fd, &sb);
+	if (error) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	blksz = ctx->datadev.d_blksize;
+	numblocks = (sb.st_size + blksz - 1) / blksz;
+	if (numblocks > UINT_MAX)
+		numblocks = UINT_MAX;
+	extent.fe_flags = FIEMAP_EXTENT_MERGED;
+	for (blk = 0; blk < numblocks; blk++) {
+		b = blk;
+		error = ioctl(fd, FIBMAP, &b);
+		if (error) {
+			if (errno == EOPNOTSUPP || errno == EINVAL) {
+				str_warn(ctx, descr,
+_("data block FIEMAP/FIBMAP not supported, will not check extent map."));
+				ctx->quirks &= ~SCRUB_QUIRK_FIBMAP_WORKS;
+				return true;
+			}
+			str_errno(ctx, descr);
+			continue;
+		}
+
+		physical = b * blksz;
+		if (extent.fe_length > 0 &&
+		    physical == extent.fe_physical + extent.fe_length) {
+			/* Physically contiguous, just merge. */
+			extent.fe_length += blksz;
+		} else {
+			/* Emit extent if there is one. */
+			if (extent.fe_length > 0) {
+				moveon = fn(ctx, descr, &extent, arg);
+				if (!moveon)
+					break;
+			}
+			if (physical == 0) {
+				/* b == 0 means a hole... */
+				extent.fe_length = 0;
+			} else {
+				/* Start a new extent. */
+				extent.fe_physical = physical;
+				extent.fe_logical = blk * blksz;
+				extent.fe_length = blksz;
+			}
+		}
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			moveon = false;
+			break;
+		}
+	}
+
+	/* If there's an extent left over, emit it. */
+	if (moveon && extent.fe_length > 0) {
+		extent.fe_flags |= FIEMAP_EXTENT_LAST;
+		moveon = fn(ctx, descr, &extent, arg);
+	}
+
+	return moveon;
+}
+
+/* Call the FIEMAP ioctl on a file. */
+bool
+fiemap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	bool			attr_fork,
+	bool			use_fibmap,
+	bool			(*fn)(struct scrub_ctx *, const char *,
+				      struct fiemap_extent *, void *),
+	void			*arg)
+{
+	struct fiemap		*fiemap;
+	struct fiemap_extent	*extent;
+	size_t			sz;
+	__u64			next_logical;
+	bool			moveon = true;
+	bool			last = false;
+	unsigned int		i;
+	int			error;
+
+	assert(attr_fork || (scrub_has_fiemap(ctx) || scrub_has_fibmap(ctx)));
+	assert(!attr_fork || scrub_has_fiemap_attr(ctx));
+
+	if (!attr_fork && !scrub_has_fiemap(ctx))
+		return use_fibmap ? fibmap(ctx, descr, fd, fn, arg) : false;
+	else if (attr_fork && !scrub_has_fiemap_attr(ctx))
+		return true;
+
+	sz = sizeof(struct fiemap) + sizeof(struct fiemap_extent) * NR_EXTENTS;
+	fiemap = calloc(1, sz);
+	if (!fiemap) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	fiemap->fm_length = ~0ULL;
+	fiemap->fm_flags = FIEMAP_FLAG_SYNC;
+	if (attr_fork)
+		fiemap->fm_flags |= FIEMAP_FLAG_XATTR;
+	fiemap->fm_extent_count = NR_EXTENTS;
+	fiemap->fm_reserved = 0;
+	next_logical = 0;
+
+	while (!last) {
+		fiemap->fm_start = next_logical;
+		error = ioctl(fd, FS_IOC_FIEMAP, (unsigned long)fiemap);
+		if (error < 0 && (errno == EOPNOTSUPP || errno == EBADR)) {
+			if (attr_fork) {
+				str_warn(ctx, descr,
+_("extended attribute FIEMAP not supported, will not check extent map."));
+				ctx->quirks &= ~SCRUB_QUIRK_FIEMAP_ATTR_WORKS;
+			} else {
+				ctx->quirks &= ~SCRUB_QUIRK_FIEMAP_WORKS;
+			}
+			break;
+		}
+		if (error < 0) {
+			str_errno(ctx, descr);
+			break;
+		}
+
+		/* No more extents to map, exit */
+		if (!fiemap->fm_mapped_extents)
+			break;
+
+		for (i = 0; i < fiemap->fm_mapped_extents; i++) {
+			extent = &fiemap->fm_extents[i];
+
+			moveon = fn(ctx, descr, extent, arg);
+			if (!moveon)
+				goto out;
+
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+
+			next_logical = extent->fe_logical + extent->fe_length;
+			if (extent->fe_flags & FIEMAP_EXTENT_LAST)
+				last = true;
+		}
+	}
+
+out:
+	free(fiemap);
+	return moveon;
+}
+
+#ifndef FITRIM
+struct fstrim_range {
+	__u64 start;
+	__u64 len;
+	__u64 minlen;
+};
+#define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#endif
+
+/* Call FITRIM to trim all the unused space in a filesystem. */
+void
+fstrim(
+	struct scrub_ctx	*ctx)
+{
+	struct fstrim_range	range = {0};
+	int			error;
+
+	range.len = ULLONG_MAX;
+	error = ioctl(ctx->mnt_fd, FITRIM, &range);
+	if (error && errno != EOPNOTSUPP && errno != ENOTTY)
+		perror(_("fstrim"));
+}
diff --git a/scrub/iocmd.h b/scrub/iocmd.h
new file mode 100644
index 0000000..c6cf2c4
--- /dev/null
+++ b/scrub/iocmd.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef IOCMD_H_
+#define IOCMD_H_
+
+struct fiemap_extent;
+
+bool
+scan_fs_tree(
+	struct scrub_ctx	*ctx,
+	bool			(*dir_fn)(struct scrub_ctx *, const char *,
+					  int, void *),
+	bool			(*dirent_fn)(struct scrub_ctx *, const char *,
+						int, struct dirent *,
+						struct stat *, void *),
+	void			*arg);
+
+bool
+fiemap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	bool			attr_fork,
+	bool			fibmap,
+	bool			(*fn)(struct scrub_ctx *, const char *,
+				      struct fiemap_extent *, void *),
+	void			*arg);
+
+void
+fstrim(
+	struct scrub_ctx	*ctx);
+
+#endif /* IOCMD_H_ */
diff --git a/scrub/non_xfs.c b/scrub/non_xfs.c
new file mode 100644
index 0000000..47fef92
--- /dev/null
+++ b/scrub/non_xfs.c
@@ -0,0 +1,185 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <dirent.h>
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include "disk.h"
+#include "scrub.h"
+
+/* Stub scrubbers for non-XFS filesystems. */
+
+/* Read the btrfs geometry. */
+static bool
+btrfs_scan_fs(
+	struct scrub_ctx		*ctx)
+{
+	/*
+	 * btrfs is a volume manager, so we can't get meaningful block numbers
+	 * out of FIEMAP/FIBMAP.  It also checksums data, so raw device access
+	 * for file verify is impossible.  btrfs also supports reflink.
+	 */
+	ctx->quirks |= SCRUB_QUIRK_SHARED_BLOCKS;
+	disk_close(&ctx->datadev);
+	return generic_scan_fs(ctx);
+}
+
+/* Scrub all disk blocks using the btrfs scrub command. */
+static bool
+btrfs_scan_blocks(
+	struct scrub_ctx		*ctx)
+{
+	pid_t				pid;
+	pid_t				rpid;
+	char				*args[] = {"btrfs", "scrub", "start",
+						   "-B", "-f", "-q",
+						   ctx->mntpoint, NULL, NULL};
+	int				status;
+	int				err;
+
+	if (ctx->mode == SCRUB_MODE_DRY_RUN) {
+		args[6] = "-n";
+		args[7] = ctx->mntpoint;
+	}
+
+	pid = fork();
+	if (pid < 0)
+		str_errno(ctx, ctx->mntpoint);
+	else if (pid == 0) {
+		status = execvp(args[0], args);
+		exit(255);
+	} else {
+		rpid = waitpid(pid, &status, 0);
+		while (rpid >= 0 && rpid != pid && !WIFEXITED(status) &&
+				!WIFSIGNALED(status)) {
+			rpid = waitpid(pid, &status, 0);
+		}
+		if (rpid < 0)
+			str_errno(ctx, ctx->mntpoint);
+		else if (WIFSIGNALED(status))
+			str_error(ctx, ctx->mntpoint,
+_("btrfs scrub died, signal %d"),
+					WTERMSIG(status));
+		else if (WIFEXITED(status)) {
+			err = WEXITSTATUS(status);
+			if (err == 0)
+				return true;
+			else if (err == 255)
+				str_error(ctx, ctx->mntpoint,
+_("btrfs scrub failed to run."));
+			else
+				str_error(ctx, ctx->mntpoint,
+_("btrfs scrub signalled corruption, error %d"),
+						err);
+		}
+	}
+
+	return true;
+}
+
+/* btrfs profile */
+struct scrub_ops btrfs_scrub_ops = {
+	.name			= "btrfs",
+	.cleanup		= generic_cleanup,
+	.scan_fs		= btrfs_scan_fs,
+	.scan_inodes		= generic_scan_inodes,
+	.check_dir		= generic_check_dir,
+	.check_inode		= generic_check_inode,
+	.scan_extents		= generic_scan_extents,
+	.scan_xattrs		= generic_scan_xattrs,
+	.scan_special_xattrs	= generic_scan_special_xattrs,
+	.scan_metadata		= generic_scan_metadata,
+	.check_summary		= generic_check_summary,
+	.read_file		= read_verify_file,
+	.scan_blocks		= btrfs_scan_blocks,
+	.scan_fs_tree		= generic_scan_fs_tree,
+	.preen_fs		= generic_preen_fs,
+};
+
+/*
+ * Generic FS scanner for filesystems that support shared blocks.
+ */
+static bool
+scan_fs_shared_blocks(
+	struct scrub_ctx		*ctx)
+{
+	ctx->quirks |= SCRUB_QUIRK_SHARED_BLOCKS;
+	return generic_scan_fs(ctx);
+}
+
+/* shared block filesystem profiles */
+struct scrub_ops shared_block_fs_scrub_ops = {
+	.name			= "shared block generic",
+	.aliases		= "ocfs2\0",
+	.cleanup		= generic_cleanup,
+	.scan_fs		= scan_fs_shared_blocks,
+	.scan_inodes		= generic_scan_inodes,
+	.check_dir		= generic_check_dir,
+	.check_inode		= generic_check_inode,
+	.scan_extents		= generic_scan_extents,
+	.scan_xattrs		= generic_scan_xattrs,
+	.scan_special_xattrs	= generic_scan_special_xattrs,
+	.scan_metadata		= generic_scan_metadata,
+	.check_summary		= generic_check_summary,
+	.read_file		= read_verify_file,
+	.scan_blocks		= generic_scan_blocks,
+	.scan_fs_tree		= generic_scan_fs_tree,
+	.preen_fs		= generic_preen_fs,
+};
+
+/*
+ * Generic FS scan for filesystems that don't present stable inode numbers
+ * between the directory entry and the stat buffer.
+ */
+static bool
+scan_fs_unstable_inum(
+	struct scrub_ctx		*ctx)
+{
+	/*
+	 * HFS+ implements hard links by creating a special hidden file
+	 * that redirects to the real file, so the inode numbers reported
+	 * in the dirent and the fstat buffers don't necessarily match.
+	 *
+	 * iso9660/vfat don't have stable dirent -> inode numbers.
+	 */
+	ctx->quirks |= SCRUB_QUIRK_UNSTABLE_INUM;
+	return generic_scan_fs(ctx);
+}
+
+/* unstable inum filesystem profile */
+struct scrub_ops unstable_inum_fs_scrub_ops = {
+	.name			= "unstable inum generic",
+	.aliases		= "hfsplus\0iso9660\0vfat\0",
+	.cleanup		= generic_cleanup,
+	.scan_fs		= scan_fs_unstable_inum,
+	.scan_inodes		= generic_scan_inodes,
+	.check_dir		= generic_check_dir,
+	.check_inode		= generic_check_inode,
+	.scan_extents		= generic_scan_extents,
+	.scan_xattrs		= generic_scan_xattrs,
+	.scan_special_xattrs	= generic_scan_special_xattrs,
+	.scan_metadata		= generic_scan_metadata,
+	.check_summary		= generic_check_summary,
+	.read_file		= read_verify_file,
+	.scan_blocks		= generic_scan_blocks,
+	.scan_fs_tree		= generic_scan_fs_tree,
+	.preen_fs		= generic_preen_fs,
+};
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
new file mode 100644
index 0000000..8433012
--- /dev/null
+++ b/scrub/read_verify.c
@@ -0,0 +1,314 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "scrub.h"
+#include "../repair/threads.h"
+#include "read_verify.h"
+
+/* How many bytes have we verified? */
+static pthread_mutex_t		verified_lock = PTHREAD_MUTEX_INITIALIZER;
+static unsigned long long	verified_bytes;
+
+/* Tolerate 64k holes in adjacent read verify requests. */
+#define IO_BATCH_LOCALITY	(65536)
+
+/* Create a thread pool to run read verifiers. */
+bool
+read_verify_pool_init(
+	struct read_verify_pool		*rvp,
+	struct scrub_ctx		*ctx,
+	void				*readbuf,
+	size_t				readbufsz,
+	size_t				min_io_sz,
+	read_verify_ioend_fn_t		ioend_fn,
+	unsigned int			nproc)
+{
+	rvp->rvp_readbuf = readbuf;
+	rvp->rvp_readbufsz = readbufsz;
+	rvp->rvp_ctx = ctx;
+	rvp->rvp_min_io_size = min_io_sz;
+	rvp->ioend_fn = ioend_fn;
+	rvp->rvp_nproc = nproc;
+	create_work_queue(&rvp->rvp_wq, (struct xfs_mount *)rvp, nproc);
+	return true;
+}
+
+/* How many bytes has this process verified? */
+unsigned long long
+read_verify_bytes(void)
+{
+	return verified_bytes;
+}
+
+/* Finish up any read verification work and tear it down. */
+void
+read_verify_pool_destroy(
+	struct read_verify_pool		*rvp)
+{
+	destroy_work_queue(&rvp->rvp_wq);
+	memset(&rvp->rvp_wq, 0, sizeof(struct work_queue));
+}
+
+/*
+ * Issue a read-verify IO in big batches.
+ */
+static void
+read_verify(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct read_verify		*rv = arg;
+	struct read_verify_pool		*rvp;
+	unsigned long long		verified = 0;
+	ssize_t				sz;
+	ssize_t				len;
+
+	rvp = (struct read_verify_pool *)wq->mp;
+	while (rv->io_length > 0) {
+		len = min(rv->io_length, rvp->rvp_readbufsz);
+		dbg_printf("diskverify %d %"PRIu64" %zu\n", rv->io_disk->d_fd,
+				rv->io_start, len);
+		sz = disk_read_verify(rv->io_disk, rvp->rvp_readbuf,
+				rv->io_start, len);
+		if (sz < 0) {
+			dbg_printf("IOERR %d %"PRIu64" %zu\n",
+					rv->io_disk->d_fd,
+					rv->io_start, len);
+			rvp->ioend_fn(rvp, rv->io_disk, rv->io_start,
+					rvp->rvp_min_io_size,
+					errno, rv->io_end_arg);
+			len = rvp->rvp_min_io_size;
+		}
+
+		verified += len;
+		rv->io_start += len;
+		rv->io_length -= len;
+	}
+
+	free(rv);
+	pthread_mutex_lock(&verified_lock);
+	verified_bytes += verified;
+	pthread_mutex_unlock(&verified_lock);
+}
+
+/* Queue a read verify request. */
+static void
+read_verify_queue(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv)
+{
+	struct read_verify		*tmp;
+
+	dbg_printf("verify fd %d start %"PRIu64" len %"PRIu64"\n",
+			rv->io_disk->d_fd, rv->io_start, rv->io_length);
+
+	tmp = malloc(sizeof(struct read_verify));
+	if (!tmp) {
+		rvp->ioend_fn(rvp, rv->io_disk, rv->io_start, rv->io_length,
+				errno, rv->io_end_arg);
+		return;
+	}
+	*tmp = *rv;
+
+	queue_work(&rvp->rvp_wq, read_verify, 0, tmp);
+}
+
+/*
+ * Issue an IO request.  We'll batch subsequent requests if they're
+ * within 64k of each other
+ */
+void
+read_verify_schedule(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length,
+	void				*end_arg)
+{
+	uint64_t			ve_end;
+	uint64_t			io_end;
+
+	assert(rvp->rvp_readbuf);
+	ve_end = start + length;
+	io_end = rv->io_start + rv->io_length;
+
+	/*
+	 * If we have a stashed IO, we haven't changed fds, the error
+	 * reporting is the same, and the two extents are close,
+	 * we can combine them.
+	 */
+	if (rv->io_length > 0 && disk == rv->io_disk &&
+	    end_arg == rv->io_end_arg &&
+	    ((start >= rv->io_start && start <= io_end + IO_BATCH_LOCALITY) ||
+	     (rv->io_start >= start &&
+	      rv->io_start <= ve_end + IO_BATCH_LOCALITY))) {
+		rv->io_start = min(rv->io_start, start);
+		rv->io_length = max(ve_end, io_end) - rv->io_start;
+	} else  {
+		/* Otherwise, issue the stashed IO (if there is one) */
+		if (rv->io_length > 0)
+			read_verify_queue(rvp, rv);
+
+		/* Stash the new IO. */
+		rv->io_disk = disk;
+		rv->io_start = start;
+		rv->io_length = length;
+		rv->io_end_arg = end_arg;
+	}
+}
+
+/* Force any stashed IOs into the verifier. */
+void
+read_verify_force(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv)
+{
+	assert(rvp->rvp_readbuf);
+	if (rv->io_length == 0)
+		return;
+
+	read_verify_queue(rvp, rv);
+	rv->io_length = 0;
+}
+
+/* Read all the data in a file. */
+bool
+read_verify_file(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb)
+{
+	off_t			data_end = 0;
+	off_t			data_start;
+	off_t			start;
+	ssize_t			sz;
+	size_t			count;
+	unsigned long long	verified = 0;
+	bool			reports_holes = true;
+	bool			direct_io = false;
+	bool			moveon = true;
+	int			flags;
+	int			error;
+
+	/*
+	 * Try to force the kernel to read file data from disk.  First
+	 * we try to set O_DIRECT.  If that fails, try to purge the page
+	 * cache.
+	 */
+	flags = fcntl(fd, F_GETFL);
+	error = fcntl(fd, F_SETFL, flags | O_DIRECT);
+	if (error)
+		posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_DONTNEED);
+	else
+		direct_io = true;
+
+	/* See if SEEK_DATA/SEEK_HOLE work... */
+	data_start = lseek(fd, data_end, SEEK_DATA);
+	if (data_start < 0) {
+		/* ENXIO for SEEK_DATA means no file data anywhere. */
+		if (errno == ENXIO)
+			return true;
+		reports_holes = false;
+	}
+
+	if (reports_holes) {
+		data_end = lseek(fd, data_start, SEEK_HOLE);
+		if (data_end < 0)
+			reports_holes = false;
+	}
+
+	/* ...or just read everything if they don't. */
+	if (!reports_holes) {
+		data_start = 0;
+		data_end = sb->st_size;
+	}
+
+	if (!direct_io) {
+		posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_SEQUENTIAL);
+		posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_WILLNEED);
+	}
+	/* Read the non-hole areas. */
+	while (data_start < data_end) {
+		start = data_start;
+
+		if (direct_io && (start & (page_size - 1)))
+			start &= ~(page_size - 1);
+		count = min(IO_MAX_SIZE, data_end - start);
+		if (direct_io && (count & (page_size - 1)))
+			count = (count + page_size) & ~(page_size - 1);
+		sz = pread(fd, ctx->readbuf, count, start);
+		if (sz < 0) {
+			str_errno(ctx, descr);
+			break;
+		} else if (sz == 0) {
+			str_error(ctx, descr,
+_("Read zero bytes, expected %zu."),
+					count);
+			break;
+		} else if (sz != count && start + sz != data_end) {
+			str_warn(ctx, descr,
+_("Short read of %zu bytes, expected %zu."),
+					sz, count);
+		}
+		verified += sz;
+		data_start = start + sz;
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			moveon = false;
+			break;
+		}
+
+		if (data_start >= data_end && reports_holes) {
+			data_start = lseek(fd, data_end, SEEK_DATA);
+			if (data_start < 0) {
+				if (errno != ENXIO)
+					str_errno(ctx, descr);
+				break;
+			}
+			data_end = lseek(fd, data_start, SEEK_HOLE);
+			if (data_end < 0) {
+				if (errno != ENXIO)
+					str_errno(ctx, descr);
+				break;
+			}
+		}
+	}
+
+	/* Turn off O_DIRECT. */
+	if (direct_io) {
+		flags = fcntl(fd, F_GETFL);
+		error = fcntl(fd, F_SETFL, flags & ~O_DIRECT);
+		if (error)
+			str_errno(ctx, descr);
+	}
+
+	pthread_mutex_lock(&verified_lock);
+	verified_bytes += verified;
+	pthread_mutex_unlock(&verified_lock);
+
+	return moveon;
+}
diff --git a/scrub/read_verify.h b/scrub/read_verify.h
new file mode 100644
index 0000000..01f712b
--- /dev/null
+++ b/scrub/read_verify.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef READ_VERIFY_H_
+#define READ_VERIFY_H_
+
+struct read_verify_pool;
+
+typedef void (*read_verify_ioend_fn_t)(struct read_verify_pool *rvp,
+		struct disk *disk, uint64_t start, uint64_t length,
+		int error, void *arg);
+typedef void (*read_verify_ioend_arg_free_fn_t)(void *arg);
+
+struct read_verify_pool {
+	struct work_queue	rvp_wq;
+	struct scrub_ctx	*rvp_ctx;
+	void			*rvp_readbuf;
+	read_verify_ioend_fn_t	ioend_fn;
+	read_verify_ioend_arg_free_fn_t	ioend_arg_free_fn;
+	size_t			rvp_readbufsz;		/* bytes */
+	size_t			rvp_min_io_size;	/* bytes */
+	int			rvp_nproc;
+};
+
+bool read_verify_pool_init(struct read_verify_pool *rvp, struct scrub_ctx *ctx,
+		void *readbuf, size_t readbufsz, size_t min_io_sz,
+		read_verify_ioend_fn_t ioend_fn, unsigned int nproc);
+void read_verify_pool_destroy(struct read_verify_pool *rvp);
+
+struct read_verify {
+	void			*io_end_arg;
+	struct disk		*io_disk;
+	uint64_t		io_start;	/* bytes */
+	uint64_t		io_length;	/* bytes */
+};
+
+void read_verify_schedule(struct read_verify_pool *rvp, struct read_verify *rv,
+		struct disk *disk, uint64_t start, uint64_t length,
+		void *end_arg);
+void read_verify_force(struct read_verify_pool *rvp, struct read_verify *rv);
+unsigned long long read_verify_bytes(void);
+
+#endif /* READ_VERIFY_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
new file mode 100644
index 0000000..d9b8687
--- /dev/null
+++ b/scrub/scrub.c
@@ -0,0 +1,1009 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <mntent.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/statvfs.h>
+#include <sys/vfs.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include "disk.h"
+#include "scrub.h"
+#include "../../repair/threads.h"
+#include "read_verify.h"
+
+#define _PATH_PROC_MOUNTS	"/proc/mounts"
+
+bool				verbose;
+int				debug;
+bool				scrub_data;
+bool				dumpcore;
+bool				display_rusage;
+long				page_size;
+enum errors_action		error_action = ERRORS_CONTINUE;
+static unsigned long		max_errors;
+
+static void __attribute__((noreturn))
+usage(void)
+{
+	fprintf(stderr, _("Usage: %s [OPTIONS] mountpoint\n"), progname);
+	fprintf(stderr, _("-a:\tStop after this many errors are found.\n"));
+	fprintf(stderr, _("-d:\tRun program in debug mode.\n"));
+	fprintf(stderr, _("-e:\tWhat to do if errors are found.\n"));
+	fprintf(stderr, _("-m:\tPath to /etc/mtab.\n"));
+	fprintf(stderr, _("-n:\tDry run.  Do not modify anything.\n"));
+	fprintf(stderr, _("-t:\tUse this filesystem backend for scrubbing.\n"));
+	fprintf(stderr, _("-T:\tDisplay timing/usage information.\n"));
+	fprintf(stderr, _("-v:\tVerbose output.\n"));
+	fprintf(stderr, _("-V:\tPrint version.\n"));
+	fprintf(stderr, _("-x:\tScrub file data too.\n"));
+	fprintf(stderr, _("-y:\tRepair all errors.\n"));
+
+	exit(16);
+}
+
+/*
+ * Check if the argument is either the device name or mountpoint of a mounted
+ * filesystem.
+ */
+static bool
+find_mountpoint_check(
+	struct stat		*sb,
+	struct mntent		*t)
+{
+	struct stat		ms;
+
+	if (S_ISDIR(sb->st_mode)) {		/* mount point */
+		if (stat(t->mnt_dir, &ms) < 0)
+			return false;
+		if (sb->st_ino != ms.st_ino)
+			return false;
+		if (sb->st_dev != ms.st_dev)
+			return false;
+		/*
+		 * Since we can handle non-XFS filesystems, we don't
+		 * need to check that the device is accessible.
+		 * (The xfs_fsr version of this function does care.)
+		 */
+	} else {				/* device */
+		if (stat(t->mnt_fsname, &ms) < 0)
+			return false;
+		if (sb->st_rdev != ms.st_rdev)
+			return false;
+		/*
+		 * Make sure the mountpoint given by mtab is accessible
+		 * before using it.
+		 */
+		if (stat(t->mnt_dir, &ms) < 0)
+			return false;
+	}
+
+	return true;
+}
+
+/* Check that our alleged mountpoint is in mtab */
+static bool
+find_mountpoint(
+	char			*mtab,
+	struct scrub_ctx	*ctx)
+{
+	struct mntent_cursor	cursor;
+	struct mntent		*t = NULL;
+	bool			found = false;
+
+	if (platform_mntent_open(&cursor, mtab) != 0) {
+		fprintf(stderr, "Error: can't get mntent entries.\n");
+		exit(1);
+	}
+
+	while ((t = platform_mntent_next(&cursor)) != NULL) {
+		/*
+		 * Keep jotting down matching mount details; newer mounts are
+		 * towards the end of the file (hopefully).
+		 */
+		if (find_mountpoint_check(&ctx->mnt_sb, t)) {
+			ctx->mntpoint = strdup(t->mnt_dir);
+			ctx->mnt_type = strdup(t->mnt_type);
+			ctx->blkdev = strdup(t->mnt_fsname);
+			found = true;
+		}
+	}
+	platform_mntent_close(&cursor);
+	return found;
+}
+
+/* Too many errors? Bail out. */
+bool
+xfs_scrub_excessive_errors(
+	struct scrub_ctx	*ctx)
+{
+	bool			ret;
+
+	pthread_mutex_lock(&ctx->lock);
+	ret = max_errors > 0 && ctx->errors_found >= max_errors;
+	pthread_mutex_unlock(&ctx->lock);
+
+	return ret;
+}
+
+/* Get the name of the repair tool. */
+const char *
+repair_tool(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->ops->repair_tool)
+		return ctx->ops->repair_tool;
+
+	return "fsck";
+}
+
+/* Print a string and whatever error is stored in errno. */
+void
+__str_errno(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line)
+{
+	char			buf[DESCR_BUFSZ];
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: %s.", str, strerror_r(errno, buf, DESCR_BUFSZ));
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->errors_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some error text. */
+void
+__str_error(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->errors_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some warning text. */
+void
+__str_warn(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->warnings_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some informational text. */
+void
+__str_info(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	printf("%s: ", str);
+	va_start(args, format);
+	vprintf(format, args);
+	va_end(args);
+	if (debug)
+		printf(" (%s line %d)", file, line);
+	printf("\n");
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the repair count. */
+void
+__record_repair(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, "%s: ", str);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, " (%s line %d)", file, line);
+	fprintf(stderr, "\n");
+	ctx->repairs++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the optimization (preening) count. */
+void
+__record_preen(
+	struct scrub_ctx	*ctx,
+	const char		*str,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	if (debug || verbose) {
+		printf("%s: ", str);
+		va_start(args, format);
+		vprintf(format, args);
+		va_end(args);
+		if (debug)
+			printf(" (%s line %d)", file, line);
+		printf("\n");
+	}
+	ctx->preens++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+static struct scrub_ops *scrub_impl[] = {
+	&xfs_scrub_ops,
+	&btrfs_scrub_ops,
+	&shared_block_fs_scrub_ops,
+	&unstable_inum_fs_scrub_ops,
+	NULL
+};
+
+void __attribute__((noreturn))
+do_error(char const *msg, ...)
+{
+	va_list args;
+
+	fprintf(stderr, _("\nfatal error -- "));
+
+	va_start(args, msg);
+	vfprintf(stderr, msg, args);
+	if (dumpcore)
+		abort();
+	exit(1);
+}
+
+#define SCRUB_QUIRK_FNS(name, flagname) \
+bool \
+scrub_has_##name( \
+	struct scrub_ctx		*ctx) \
+{ \
+	return ctx->quirks & SCRUB_QUIRK_##flagname; \
+}
+SCRUB_QUIRK_FNS(fiemap,		FIEMAP_WORKS)
+SCRUB_QUIRK_FNS(fiemap_attr,	FIEMAP_ATTR_WORKS)
+SCRUB_QUIRK_FNS(fibmap,		FIBMAP_WORKS)
+SCRUB_QUIRK_FNS(shared_blocks,	SHARED_BLOCKS)
+SCRUB_QUIRK_FNS(unstable_inums,	UNSTABLE_INUM)
+
+/* How many threads to kick off? */
+unsigned int
+scrub_nproc(
+	struct scrub_ctx	*ctx)
+{
+	if (debug_tweak_on("XFS_SCRUB_NO_THREADS"))
+		return 1;
+	return ctx->nr_io_threads;
+}
+
+/* Decide if a value is within +/- (n/d) of a desired value. */
+bool
+within_range(
+	struct scrub_ctx	*ctx,
+	unsigned long long	value,
+	unsigned long long	desired,
+	unsigned long long	diff_threshold,
+	unsigned int		n,
+	unsigned int		d,
+	const char		*descr)
+{
+	assert(n < d);
+
+	/* Don't complain if difference does not exceed an absolute value. */
+	if (value < desired && desired - value < diff_threshold)
+		return true;
+	if (value > desired && value - desired < diff_threshold)
+		return true;
+
+	/* Complain if the difference exceeds a certain percentage. */
+	if (value < desired * (d - n) / d) {
+		str_warn(ctx, ctx->mntpoint,
+_("Found fewer %s than reported"), descr);
+		return false;
+	}
+	if (value > desired * (d + n) / d) {
+		str_warn(ctx, ctx->mntpoint,
+_("Found more %s than reported"), descr);
+		return false;
+	}
+	return true;
+}
+
+static double
+timeval_subtract(
+	struct timeval		*tv1,
+	struct timeval		*tv2)
+{
+	return ((tv1->tv_sec - tv2->tv_sec) +
+		((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
+}
+
+/* Produce human readable disk space output. */
+double
+auto_space_units(
+	unsigned long long	bytes,
+	char			**units)
+{
+	if (debug > 1)
+		goto no_prefix;
+	if (bytes > (1ULL << 40)) {
+		*units = "TiB";
+		return (double)bytes / (1ULL << 40);
+	} else if (bytes > (1ULL << 30)) {
+		*units = "GiB";
+		return (double)bytes / (1ULL << 30);
+	} else if (bytes > (1ULL << 20)) {
+		*units = "MiB";
+		return (double)bytes / (1ULL << 20);
+	} else if (bytes > (1ULL << 10)) {
+		*units = "KiB";
+		return (double)bytes / (1ULL << 10);
+	} else {
+no_prefix:
+		*units = "B";
+		return bytes;
+	}
+}
+
+/* Produce human readable discrete number output. */
+double
+auto_units(
+	unsigned long long	number,
+	char			**units)
+{
+	if (debug > 1)
+		goto no_prefix;
+	if (number > 1000000000000ULL) {
+		*units = "T";
+		return number / 1000000000000.0;
+	} else if (number > 1000000000ULL) {
+		*units = "G";
+		return number / 1000000000.0;
+	} else if (number > 1000000ULL) {
+		*units = "M";
+		return number / 1000000.0;
+	} else if (number > 1000ULL) {
+		*units = "K";
+		return number / 1000.0;
+	} else {
+no_prefix:
+		*units = "";
+		return number;
+	}
+}
+
+/*
+ * Given a directory fd and (possibly) a dirent, open the file associated
+ * with the entry.  If the entry is null, just duplicate the dir_fd.
+ */
+int
+dirent_open(
+	int			dir_fd,
+	struct dirent		*dirent)
+{
+	if (!dirent)
+		return dup(dir_fd);
+	return openat(dir_fd, dirent->d_name,
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+}
+
+#ifndef RUSAGE_BOTH
+# define RUSAGE_BOTH		(-2)
+#endif
+
+/* Get resource usage for ourselves and all children. */
+int
+scrub_getrusage(
+	struct rusage		*usage)
+{
+	struct rusage		cusage;
+	int			err;
+
+	err = getrusage(RUSAGE_BOTH, usage);
+	if (!err)
+		return err;
+
+	err = getrusage(RUSAGE_SELF, usage);
+	if (err)
+		return err;
+
+	err = getrusage(RUSAGE_CHILDREN, &cusage);
+	if (err)
+		return err;
+
+	usage->ru_minflt += cusage.ru_minflt;
+	usage->ru_majflt += cusage.ru_majflt;
+	usage->ru_nswap += cusage.ru_nswap;
+	usage->ru_inblock += cusage.ru_inblock;
+	usage->ru_oublock += cusage.ru_oublock;
+	usage->ru_msgsnd += cusage.ru_msgsnd;
+	usage->ru_msgrcv += cusage.ru_msgrcv;
+	usage->ru_nsignals += cusage.ru_nsignals;
+	usage->ru_nvcsw += cusage.ru_nvcsw;
+	usage->ru_nivcsw += cusage.ru_nivcsw;
+	return 0;
+}
+
+struct phase_info {
+	struct rusage		ruse;
+	struct timeval		time;
+	unsigned long long	verified_bytes;
+	void			*brk_start;
+	const char		*tag;
+};
+
+/* Start tracking resource usage for a phase. */
+static bool
+phase_start(
+	struct phase_info	*pi,
+	const char		*tag,
+	const char		*descr)
+{
+	int			error;
+
+	error = scrub_getrusage(&pi->ruse); //getrusage(RUSAGE_SELF, &pi->ruse);
+	if (error) {
+		perror(_("getrusage"));
+		return false;
+	}
+	pi->brk_start = sbrk(0);
+
+	error = gettimeofday(&pi->time, NULL);
+	if (error) {
+		perror(_("gettimeofday"));
+		return false;
+	}
+	pi->tag = tag;
+
+	pi->verified_bytes = read_verify_bytes();
+
+	if ((verbose || display_rusage) && descr)
+		printf(_("%s%s\n"), pi->tag, descr);
+	return true;
+}
+
+/* Report usage stats. */
+static bool
+phase_end(
+	struct phase_info	*pi)
+{
+	struct rusage		ruse_now;
+#ifdef HAVE_MALLINFO
+	struct mallinfo		mall_now;
+#endif
+	struct timeval		time_now;
+	double			dt;
+	unsigned long long	verified;
+	long			in, out;
+	long			io;
+	double			i, o, t;
+	double			din, dout, dtot;
+	char			*iu, *ou, *tu, *dinu, *doutu, *dtotu;
+	double			v, dv;
+	char			*vu, *dvu;
+	int			error;
+
+	if (!display_rusage)
+		return true;
+
+	error = gettimeofday(&time_now, NULL);
+	if (error) {
+		perror(_("gettimeofday"));
+		return false;
+	}
+	dt = timeval_subtract(&time_now, &pi->time);
+
+	error = scrub_getrusage(&ruse_now); //getrusage(RUSAGE_SELF, &ruse_now);
+	if (error) {
+		perror(_("getrusage"));
+		return false;
+	}
+
+#define kbytes(x)	(((unsigned long)(x) + 1023) / 1024)
+#ifdef HAVE_MALLINFO
+
+	mall_now = mallinfo();
+	printf(_("%sMemory used: %luk/%luk (%luk/%luk), "), pi->tag,
+		kbytes(mall_now.arena), kbytes(mall_now.hblkhd),
+		kbytes(mall_now.uordblks), kbytes(mall_now.fordblks));
+#else
+	printf(_("%sMemory used: %luk, "), pi->tag,
+		(unsigned long) kbytes(((char *) sbrk(0)) -
+				       ((char *) pi->brk_start)));
+#endif
+#undef kbytes
+
+	printf(_("time: %5.2f/%5.2f/%5.2fs\n"),
+		timeval_subtract(&time_now, &pi->time),
+		timeval_subtract(&ruse_now.ru_utime, &pi->ruse.ru_utime),
+		timeval_subtract(&ruse_now.ru_stime, &pi->ruse.ru_stime));
+
+	/* I/O usage */
+	in =  (ruse_now.ru_inblock - pi->ruse.ru_inblock) << BBSHIFT;
+	out = (ruse_now.ru_oublock - pi->ruse.ru_oublock) << BBSHIFT;
+	io = in + out;
+	if (io) {
+		i = auto_space_units(in, &iu);
+		o = auto_space_units(out, &ou);
+		t = auto_space_units(io, &tu);
+		din = auto_space_units(in / dt, &dinu);
+		dout = auto_space_units(out / dt, &doutu);
+		dtot = auto_space_units(io / dt, &dtotu);
+		printf(
+_("%sI/O: %.1f%s in, %.1f%s out, %.1f%s tot\n"),
+			pi->tag, i, iu, o, ou, t, tu);
+		printf(
+_("%sI/O rate: %.1f%s/s in, %.1f%s/s out, %.1f%s/s tot\n"),
+			pi->tag, din, dinu, dout, doutu, dtot, dtotu);
+	}
+
+	/* How many bytes were read-verified? */
+	verified = read_verify_bytes() - pi->verified_bytes;
+	if (verified) {
+		v = auto_space_units(verified, &vu);
+		dv = auto_space_units(verified / dt, &dvu);
+		printf(_("%sVerify: %.1f%s, rate: %.1f%s/s\n"), pi->tag,
+			v, vu, dv, dvu);
+	}
+
+	return true;
+}
+
+/* Find filesystem geometry and perform any other setup functions. */
+static bool
+find_geo(
+	struct scrub_ctx	*ctx)
+{
+	bool			moveon;
+	int			error;
+
+	/*
+	 * Open the directory with O_NOATIME.  For mountpoints owned
+	 * by root, this should be sufficient to ensure that we have
+	 * CAP_SYS_ADMIN, which we probably need to do anything fancy
+	 * with the (XFS driver) kernel.
+	 */
+	ctx->mnt_fd = open(ctx->mntpoint, O_RDONLY | O_NOATIME | O_DIRECTORY);
+	if (ctx->mnt_fd < 0) {
+		if (errno == EPERM)
+			str_info(ctx, ctx->mntpoint,
+_("Must be root to run scrub."));
+		else
+			str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = disk_open(ctx->blkdev, &ctx->datadev);
+	if (error && errno != ENOENT)
+		str_errno(ctx, ctx->blkdev);
+
+	error = fstat(ctx->mnt_fd, &ctx->mnt_sb);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = fstatvfs(ctx->mnt_fd, &ctx->mnt_sv);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = fstatfs(ctx->mnt_fd, &ctx->mnt_sf);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	if (disk_is_open(&ctx->datadev))
+		ctx->nr_io_threads = disk_heads(&ctx->datadev);
+	else
+		ctx->nr_io_threads = libxfs_nproc();
+	moveon = ctx->ops->scan_fs(ctx);
+	if (verbose)
+		printf(_("%s: using %d threads to scrub.\n"),
+				ctx->mntpoint, ctx->nr_io_threads);
+
+	return moveon;
+}
+
+struct scrub_phase {
+	char		*descr;
+	bool		(*fn)(struct scrub_ctx *);
+};
+
+/* Run the preening phase if there are no errors. */
+static bool
+preen(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->errors_found) {
+		str_info(ctx, ctx->mntpoint,
+_("Errors found, please re-run with -y."));
+		return true;
+	}
+
+	return ctx->ops->preen_fs(ctx);
+}
+
+/* Run all the phases of the scrubber. */
+static bool
+run_scrub_phases(
+	struct scrub_ctx	*ctx)
+{
+	struct scrub_phase	phases[] = {
+		{_("Find filesystem geometry."),   find_geo},
+		{_("Check internal metadata."),	   ctx->ops->scan_metadata},
+		{_("Scan all inodes."),		   ctx->ops->scan_inodes},
+		{_("Check directory structure."),  ctx->ops->scan_fs_tree},
+		{_("Verify data file integrity."), ctx->ops->scan_blocks},
+		{_("Check summary counters."),	   ctx->ops->check_summary},
+#define REPAIR_PHASE	(ARRAY_SIZE(phases) - 2)
+		{NULL, NULL}, /* fill this in if we're preening or fixing. */
+		{NULL, NULL},
+	};
+	struct phase_info	pi;
+	char			buf[DESCR_BUFSZ];
+	struct scrub_phase	*phase;
+	bool			moveon;
+	int			c;
+
+	/* Phase 7 can be turned into preening or fixing the filesystem. */
+	phase = &phases[REPAIR_PHASE];
+	if (ctx->mode == SCRUB_MODE_PREEN) {
+		phase->descr = _("Preen filesystem.");
+		phase->fn = preen;
+	} else if (ctx->mode == SCRUB_MODE_REPAIR) {
+		phase->descr = _("Repair filesystem.");
+		phase->fn = ctx->ops->repair_fs;
+	}
+
+	/* Run all phases of the scrub tool. */
+	for (c = 1, phase = phases; phase->fn; phase++, c++) {
+		if (phase->descr)
+			snprintf(buf, DESCR_BUFSZ, _("Phase %d: "), c);
+		else
+			buf[0] = 0;
+		moveon = phase_start(&pi, buf, phase->descr);
+		if (!moveon)
+			return false;
+		moveon = phase->fn(ctx);
+		if (!moveon)
+			return false;
+		moveon = phase_end(&pi);
+		if (!moveon)
+			return false;
+
+		/* Too many errors? */
+		if (xfs_scrub_excessive_errors(ctx))
+			return false;
+	}
+
+	return true;
+}
+
+/* Find an appropriate scrub backend. */
+static struct scrub_ops *
+find_ops(
+	const char		*mnt_type)
+{
+	struct scrub_ops	**ops;
+	struct scrub_ops	*op;
+	const char		*p;
+
+	for (ops = scrub_impl; *ops; ops++) {
+		op = *ops;
+		if (op->aliases) {
+			for (p = op->aliases; *p != 0; p += strlen(p) + 1) {
+				if (!strcmp(mnt_type, p))
+					return op;
+			}
+		}
+		if (!strcmp(mnt_type, op->name))
+			return op;
+	}
+
+	return &generic_scrub_ops;
+}
+
+int
+main(
+	int			argc,
+	char			**argv)
+{
+	int			c;
+	char			*mtab = NULL;
+	struct scrub_ctx	ctx = {0};
+	struct phase_info	all_pi;
+	bool			ismnt;
+	bool			moveon = true;
+	static bool		injected;
+	int			ret;
+	int			error;
+
+	progname = basename(argv[0]);
+	setlocale(LC_ALL, "");
+	bindtextdomain(PACKAGE, LOCALEDIR);
+	textdomain(PACKAGE);
+
+	pthread_mutex_init(&ctx.lock, NULL);
+	ctx.datadev.d_fd = -1;
+	ctx.mode = SCRUB_MODE_DEFAULT;
+	while ((c = getopt(argc, argv, "a:de:m:nTt:vxVy")) != EOF) {
+		switch (c) {
+		case 'a':
+			max_errors = strtoull(optarg, NULL, 10);
+			if (errno) {
+				perror("max_errors");
+				usage();
+			}
+			break;
+		case 'd':
+			debug++;
+			dumpcore = true;
+			break;
+		case 'e':
+			if (!strcmp("continue", optarg))
+				error_action = ERRORS_CONTINUE;
+			else if (!strcmp("shutdown", optarg))
+				error_action = ERRORS_SHUTDOWN;
+			else
+				usage();
+			break;
+		case 'm':
+			mtab = optarg;
+			break;
+		case 'n':
+			if (ctx.mode != SCRUB_MODE_DEFAULT) {
+				fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+				return 1;
+			}
+			ctx.mode = SCRUB_MODE_DRY_RUN;
+			break;
+		case 't':
+			ctx.ops = find_ops(optarg);
+			break;
+		case 'T':
+			display_rusage = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		case 'x':
+			scrub_data = true;
+			break;
+		case 'V':
+			printf(_("%s version %s\n"), progname, VERSION);
+			exit(0);
+		case 'y':
+			if (ctx.mode != SCRUB_MODE_DEFAULT) {
+				fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+				return 1;
+			}
+			ctx.mode = SCRUB_MODE_REPAIR;
+			break;
+		case '?':
+			/* fall through */
+		default:
+			usage();
+		}
+	}
+
+	if (optind != argc - 1)
+		usage();
+
+	ctx.mntpoint = argv[optind];
+	if (!debug_tweak_on("XFS_SCRUB_NO_FIEMAP"))
+		ctx.quirks |= SCRUB_QUIRK_FIEMAP_WORKS |
+			      SCRUB_QUIRK_FIEMAP_ATTR_WORKS;
+	if (!debug_tweak_on("XFS_SCRUB_NO_FIBMAP"))
+		ctx.quirks |= SCRUB_QUIRK_FIBMAP_WORKS;
+
+	/* Find the mount record for the passed-in argument. */
+
+	if (stat(argv[optind], &ctx.mnt_sb) < 0) {
+		fprintf(stderr,
+			_("%s: could not stat: %s: %s\n"),
+			progname, argv[optind], strerror(errno));
+		return 16;
+	}
+
+	/*
+	 * If the user did not specify an explicit mount table, try to use
+	 * /proc/mounts if it is available, else /etc/mtab.  We prefer
+	 * /proc/mounts because it is kernel controlled, while /etc/mtab
+	 * may contain garbage that userspace tools like pam_mounts wrote
+	 * into it.
+	 */
+	if (!mtab) {
+		if (access(_PATH_PROC_MOUNTS, R_OK) == 0)
+			mtab = _PATH_PROC_MOUNTS;
+		else
+			mtab = _PATH_MOUNTED;
+	}
+
+	ismnt = find_mountpoint(mtab, &ctx);
+	if (!ismnt) {
+		fprintf(stderr, _("%s: Not a mount point or block device.\n"),
+			ctx.mntpoint);
+		return 16;
+	}
+
+	/* Find an appropriate scrub backend. */
+	if (!ctx.ops)
+		ctx.ops = find_ops(ctx.mnt_type);
+	if (verbose)
+		printf(_("%s: scrubbing %s filesystem with %s driver.\n"),
+			ctx.mntpoint, ctx.mnt_type, ctx.ops->name);
+
+	/* Initialize overall phase stats. */
+	moveon = phase_start(&all_pi, "", NULL);
+	if (!moveon)
+		goto out;
+
+	/*
+	 * Does our backend support shutting down, if the user
+	 * wants errors=shutdown?
+	 */
+	if (error_action == ERRORS_SHUTDOWN && ctx.ops->shutdown_fs == NULL) {
+		fprintf(stderr,
+_("%s: %s driver does not support error shutdown!\n"),
+			ctx.mntpoint, ctx.ops->name);
+		goto out;
+	}
+
+	/* Does our backend support preen, if the user so requests? */
+	if (ctx.mode == SCRUB_MODE_PREEN && ctx.ops->preen_fs == NULL) {
+		fprintf(stderr,
+_("%s: %s driver does not support preening filesystem!\n"),
+			ctx.mntpoint, ctx.ops->name);
+		goto out;
+	}
+
+	/* Does our backend support repair, if the user so requests? */
+	if (ctx.mode == SCRUB_MODE_REPAIR && ctx.ops->repair_fs == NULL) {
+		fprintf(stderr,
+_("%s: %s driver does not support repairing filesystem!\n"),
+			ctx.mntpoint, ctx.ops->name);
+		goto out;
+	}
+
+	/* Set up a page-aligned buffer for read verification. */
+	page_size = sysconf(_SC_PAGESIZE);
+	if (page_size < 0) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	/* Try to allocate a read buffer if we don't have one. */
+	error = posix_memalign((void **)&ctx.readbuf, page_size,
+			IO_MAX_SIZE);
+	if (error || !ctx.readbuf) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	/* Flush everything out to disk before we start. */
+	error = syncfs(ctx.mnt_fd);
+	if (error) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		ctx.mode = SCRUB_MODE_REPAIR;
+		injected = true;
+	}
+
+	/* Scrub a filesystem. */
+	moveon = run_scrub_phases(&ctx);
+	if (!moveon)
+		goto out;
+
+out:
+	if (xfs_scrub_excessive_errors(&ctx))
+		str_info(&ctx, ctx.mntpoint, _("Too many errors; aborting."));
+
+	ret = 0;
+	if (!moveon)
+		ret |= 8;
+
+	/* Clean up scan data. */
+	moveon = ctx.ops->cleanup(&ctx);
+	if (!moveon)
+		ret |= 8;
+
+	if (ctx.errors_found && ctx.warnings_found)
+		fprintf(stderr,
+_("%s: %lu errors and %lu warnings found.  Unmount and run %s.\n"),
+			ctx.mntpoint, ctx.errors_found, ctx.warnings_found,
+			repair_tool(&ctx));
+	else if (ctx.errors_found && ctx.warnings_found == 0)
+		fprintf(stderr,
+_("%s: %lu errors found.  Unmount and run %s.\n"),
+			ctx.mntpoint, ctx.errors_found, repair_tool(&ctx));
+	else if (ctx.errors_found == 0 && ctx.warnings_found)
+		fprintf(stderr,
+_("%s: %lu warnings found.\n"),
+			ctx.mntpoint, ctx.warnings_found);
+	if (ctx.errors_found) {
+		if (error_action == ERRORS_SHUTDOWN)
+			ctx.ops->shutdown_fs(&ctx);
+		ret |= 4;
+	}
+	phase_end(&all_pi);
+	close(ctx.mnt_fd);
+	disk_close(&ctx.datadev);
+
+	free(ctx.blkdev);
+	free(ctx.readbuf);
+	free(ctx.mntpoint);
+	free(ctx.mnt_type);
+	return ret;
+}
diff --git a/scrub/scrub.h b/scrub/scrub.h
new file mode 100644
index 0000000..27df9a6
--- /dev/null
+++ b/scrub/scrub.h
@@ -0,0 +1,197 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef SCRUB_H_
+#define SCRUB_H_
+
+#define DESCR_BUFSZ		256
+
+/*
+ * Perform all IO in 32M chunks.  This cannot exceed 65536 sectors
+ * because that's the biggest SCSI VERIFY(16) we dare to send.
+ */
+#define IO_MAX_SIZE		33554432
+#define IO_MAX_SECTORS		(IO_MAX_SIZE >> BBSHIFT)
+
+struct scrub_ctx;
+
+struct scrub_ops {
+	const char	*name;
+	const char	*repair_tool;
+	const char	*aliases; /* null-separated string, end w/ two nulls */
+	bool (*cleanup)(struct scrub_ctx *ctx);
+	bool (*scan_fs)(struct scrub_ctx *ctx);
+	bool (*scan_inodes)(struct scrub_ctx *ctx);
+	bool (*check_dir)(struct scrub_ctx *ctx, const char *descr, int dir_fd);
+	bool (*check_inode)(struct scrub_ctx *ctx, const char *descr, int fd,
+			    struct stat *sb);
+	bool (*scan_extents)(struct scrub_ctx *ctx, const char *descr, int fd,
+			     struct stat *sb, bool attr_fork);
+	bool (*scan_xattrs)(struct scrub_ctx *ctx, const char *descr, int fd);
+	bool (*scan_special_xattrs)(struct scrub_ctx *ctx, const char *path);
+	bool (*scan_metadata)(struct scrub_ctx *ctx);
+	bool (*check_summary)(struct scrub_ctx *ctx);
+	bool (*scan_blocks)(struct scrub_ctx *ctx);
+	bool (*read_file)(struct scrub_ctx *ctx, const char *descr, int fd,
+			  struct stat *sb);
+	bool (*scan_fs_tree)(struct scrub_ctx *ctx);
+	bool (*preen_fs)(struct scrub_ctx *ctx);
+	bool (*repair_fs)(struct scrub_ctx *ctx);
+	void (*shutdown_fs)(struct scrub_ctx *ctx);
+};
+
+enum scrub_mode {
+	SCRUB_MODE_DRY_RUN,
+	SCRUB_MODE_PREEN,
+	SCRUB_MODE_REPAIR,
+};
+#define SCRUB_MODE_DEFAULT			SCRUB_MODE_PREEN
+
+#define SCRUB_QUIRK_FIEMAP_WORKS	(1UL << 0)
+#define SCRUB_QUIRK_FIEMAP_ATTR_WORKS	(1UL << 1)
+#define SCRUB_QUIRK_FIBMAP_WORKS	(1UL << 2)
+#define SCRUB_QUIRK_SHARED_BLOCKS	(1UL << 3)
+/* dirent/stat inode numbers do not match */
+#define SCRUB_QUIRK_UNSTABLE_INUM	(1UL << 4)
+
+bool scrub_has_fiemap(struct scrub_ctx *ctx);
+bool scrub_has_fiemap_attr(struct scrub_ctx *ctx);
+bool scrub_has_fibmap(struct scrub_ctx *ctx);
+bool scrub_has_shared_blocks(struct scrub_ctx *ctx);
+bool scrub_has_unstable_inums(struct scrub_ctx *ctx);
+
+struct scrub_ctx {
+	/* Immutable scrub state. */
+	struct scrub_ops	*ops;
+	char			*mntpoint;
+	char			*blkdev;
+	char			*mnt_type;
+	void			*readbuf;
+	int			mnt_fd;
+	enum scrub_mode		mode;
+	unsigned int		nr_io_threads;
+	struct disk		datadev;
+	struct stat		mnt_sb;
+	struct statvfs		mnt_sv;
+	struct statfs		mnt_sf;
+
+	/* Mutable scrub state; use lock. */
+	pthread_mutex_t		lock;
+	unsigned long		errors_found;
+	unsigned long		warnings_found;
+	unsigned long		repairs;
+	unsigned long		preens;
+	unsigned long		quirks;
+
+	void			*priv;
+};
+
+enum errors_action {
+	ERRORS_CONTINUE,
+	ERRORS_SHUTDOWN,
+};
+
+extern bool			verbose;
+extern int			debug;
+extern bool			scrub_data;
+extern long			page_size;
+extern enum errors_action	error_action;
+
+bool xfs_scrub_excessive_errors(struct scrub_ctx *ctx);
+
+void __str_errno(struct scrub_ctx *, const char *, const char *, int);
+void __str_error(struct scrub_ctx *, const char *, const char *, int,
+		 const char *, ...);
+void __str_warn(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+void __str_info(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+void __record_repair(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+void __record_preen(struct scrub_ctx *, const char *, const char *, int,
+		const char *, ...);
+
+#define str_errno(ctx, str)		__str_errno(ctx, str, __FILE__, __LINE__)
+#define str_error(ctx, str, ...)	__str_error(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_warn(ctx, str, ...)		__str_warn(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_info(ctx, str, ...)		__str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_repair(ctx, str, ...)	__record_repair(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_preen(ctx, str, ...)	__record_preen(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define dbg_printf(fmt, ...)		{if (debug > 1) {printf(fmt, __VA_ARGS__);}}
+
+#ifndef container_of
+# define container_of(ptr, type, member) ({			\
+	const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
+		(type *)( (char *)__mptr - offsetof(type,member) );})
+#endif
+
+/* Is this debug tweak enabled? */
+static inline bool
+debug_tweak_on(
+	const char		*name)
+{
+	return debug && getenv(name) != NULL;
+}
+
+extern struct scrub_ops	generic_scrub_ops;
+extern struct scrub_ops	xfs_scrub_ops;
+extern struct scrub_ops	btrfs_scrub_ops;
+extern struct scrub_ops	shared_block_fs_scrub_ops;
+extern struct scrub_ops	unstable_inum_fs_scrub_ops;
+
+/* Generic implementations of the ops functions */
+bool generic_cleanup(struct scrub_ctx *ctx);
+bool generic_scan_fs(struct scrub_ctx *ctx);
+bool generic_scan_inodes(struct scrub_ctx *ctx);
+bool generic_check_dir(struct scrub_ctx *ctx, const char *descr, int dir_fd);
+bool generic_check_inode(struct scrub_ctx *ctx, const char *descr, int fd,
+			 struct stat *sb);
+bool generic_scan_extents(struct scrub_ctx *ctx, const char *descr, int fd,
+			  struct stat *sb, bool attr_fork);
+bool generic_scan_xattrs(struct scrub_ctx *ctx, const char *descr, int fd);
+bool generic_scan_special_xattrs(struct scrub_ctx *ctx, const char *path);
+bool generic_scan_metadata(struct scrub_ctx *ctx);
+bool generic_check_summary(struct scrub_ctx *ctx);
+bool read_verify_file(struct scrub_ctx *ctx, const char *descr, int fd,
+		      struct stat *sb);
+bool generic_scan_blocks(struct scrub_ctx *ctx);
+bool generic_scan_fs_tree(struct scrub_ctx *ctx);
+bool generic_preen_fs(struct scrub_ctx *ctx);
+
+/* Miscellaneous utility functions */
+unsigned int scrub_nproc(struct scrub_ctx *ctx);
+bool generic_check_directory(struct scrub_ctx *ctx, const char *descr,
+		int *pfd);
+bool within_range(struct scrub_ctx *ctx, unsigned long long value,
+		unsigned long long desired, unsigned long long diff_threshold,
+		unsigned int n, unsigned int d, const char *descr);
+double auto_space_units(unsigned long long kilobytes, char **units);
+double auto_units(unsigned long long number, char **units);
+const char *repair_tool(struct scrub_ctx *ctx);
+int dirent_open(int dir_fd, struct dirent *dirent);
+
+#ifndef HAVE_SYNCFS
+static inline int syncfs(int fd)
+{
+	sync();
+	return 0;
+}
+#endif
+
+#endif /* SCRUB_H_ */
diff --git a/scrub/xfs.c b/scrub/xfs.c
new file mode 100644
index 0000000..47c6f11
--- /dev/null
+++ b/scrub/xfs.c
@@ -0,0 +1,2465 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <attr/attributes.h>
+#include "disk.h"
+#include "scrub.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "xfs_ioctl.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "iocmd.h"
+#include "xfs_fs.h"
+
+/*
+ * XFS Scrubbing Strategy
+ *
+ * The XFS scrubber is much more thorough than the generic scrubber
+ * because we can use custom XFS ioctls to probe more deeply into the
+ * internals of the filesystem.  Furthermore, we can take advantage of
+ * scrubbing ioctls to check all the records stored in a metadata btree
+ * and cross-reference those records against the other btrees.
+ *
+ * The "find geometry" phase queries XFS for the filesystem geometry.
+ * The block devices for the data, realtime, and log devices are opened.
+ * Kernel ioctls are queried to see if they are implemented, and a data
+ * file read-verify strategy is selected.
+ *
+ * In the "check internal metadata" phase, we call the SCRUB_METADATA
+ * ioctl to check the filesystem's internal per-AG btrees.  This
+ * includes the AG superblock, AGF, AGFL, and AGI headers, freespace
+ * btrees, the regular and free inode btrees, the reverse mapping
+ * btrees, and the reference counting btrees.  If the realtime device is
+ * enabled, the realtime bitmap and reverse mapping btrees are enabled.
+ * Each AG (and the realtime device) has its metadata checked in a
+ * separate thread for better performance.
+ *
+ * The "scan inodes" phase uses BULKSTAT to scan all the inodes in an
+ * AG in disk order.  From the BULKSTAT information, a file handle is
+ * constructed and the following items are checked:
+ *
+ *     - If it's a symlink, the target is read but not validated.
+ *     - Bulkstat data is checked.
+ *     - If the inode is a file or a directory, a file descriptor is
+ *       opened to pin the inode and for further analysis.
+ *     - Extended attribute names and values are read via the file
+ *       handle.  If this fails and we have a file descriptor open, we
+ *       retry with the generic extended attribute APIs.
+ *     - If the inode is not a file or directory, we're done.
+ *     - Extent maps are scanned to ensure that the records make sense.
+ *       We also use the SCRUB_METADATA ioctl for better checking of the
+ *       block mapping records.
+ *     - If the inode is a directory, open the directory and check that
+ *       the dirent type code and inode numbers match the stat output.
+ *
+ * Multiple threads are started to check each the inodes of each AG in
+ * parallel.
+ *
+ * If BULKSTAT is available, we can skip the "check directory structure"
+ * phase because directories were checked during the inode scan.
+ * Otherwise, the generic directory structure check is used.
+ *
+ * In the "verify data file integrity" phase, we can employ multiple
+ * strategies to read-verify the data blocks:
+ *
+ *     - If GETFSMAP is available, use it to read the reverse-mappings of
+ *       all AGs and issue direct-reads of the underlying disk blocks.
+ *       We rely on the underlying storage to have checksummed the data
+ *       blocks appropriately.
+ *     - If GETBMAPX is available, we use BULKSTAT (or a directory tree
+ *       walk) to iterate all inodes and issue direct-reads of the
+ *       underlying data.  Similar to the generic read-verify, the data
+ *       extents are buffered through a bitmap, which is used to issue
+ *       larger IOs.  Errors are recorded and cross-referenced through
+ *       a second BULKSTAT/GETBMAPX run.
+ *     - Otherwise, call the generic handler to verify file data.
+ *
+ * Multiple threads are started to check each AG in parallel.  A
+ * separate thread pool is used to handle the direct reads.
+ *
+ * In the "check summary counters" phase, use GETFSMAP to tally up the
+ * blocks and BULKSTAT to tally up the inodes we saw and compare that to
+ * the statfs output.  This gives the user a rough estimate of how
+ * thorough the scrub was.
+ */
+
+/* Routines to scrub an XFS filesystem. */
+
+enum data_scrub_type {
+	DS_NOSCRUB,		/* no data scrub */
+	DS_READ,		/* generic_scan_blocks */
+	DS_BULKSTAT_READ,	/* bulkstat and generic_file_read */
+	DS_BMAPX,		/* bulkstat, getbmapx, and read_verify */
+	DS_FSMAP,		/* getfsmap and read_verify */
+};
+
+struct xfs_scrub_ctx {
+	struct xfs_fsop_geom	geo;
+	struct fs_path		fsinfo;
+	unsigned int		agblklog;
+	unsigned int		blocklog;
+	unsigned int		inodelog;
+	unsigned int		inopblog;
+	struct disk		datadev;
+	struct disk		logdev;
+	struct disk		rtdev;
+	void			*fshandle;
+	size_t			fshandle_len;
+	unsigned long long	capabilities;	/* see below */
+	struct read_verify_pool	rvp;
+	enum data_scrub_type	data_scrubber;
+	struct list_head	repair_list;
+};
+
+#define XFS_SCRUB_CAP_KSCRUB_FS		(1ULL << 0)	/* can scrub fs meta? */
+#define XFS_SCRUB_CAP_GETFSMAP		(1ULL << 1)	/* have getfsmap? */
+#define XFS_SCRUB_CAP_BULKSTAT		(1ULL << 2)	/* have bulkstat? */
+#define XFS_SCRUB_CAP_BMAPX		(1ULL << 3)	/* have bmapx? */
+#define XFS_SCRUB_CAP_KSCRUB_INODE	(1ULL << 4)	/* can scrub inode? */
+#define XFS_SCRUB_CAP_KSCRUB_BMAP	(1ULL << 5)	/* can scrub bmap? */
+#define XFS_SCRUB_CAP_KSCRUB_DIR	(1ULL << 6)	/* can scrub dirs? */
+#define XFS_SCRUB_CAP_KSCRUB_XATTR	(1ULL << 7)	/* can scrub attrs?*/
+#define XFS_SCRUB_CAP_PARENT_PTR	(1ULL << 8)	/* can find parent? */
+/* If the fast xattr checks fail, we have to use the slower generic scan. */
+#define XFS_SCRUB_CAP_SKIP_SLOW_XATTR	(1ULL << 9)
+#define XFS_SCRUB_CAP_KSCRUB_SYMLINK	(1ULL << 10)	/* can scrub symlink? */
+
+#define XFS_SCRUB_CAPABILITY_FUNCS(name, flagname) \
+static inline bool \
+xfs_scrub_can_##name(struct xfs_scrub_ctx *xctx) \
+{ \
+	return xctx->capabilities & XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_set_##name(struct xfs_scrub_ctx *xctx) \
+{ \
+	xctx->capabilities |= XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_clear_##name(struct xfs_scrub_ctx *xctx) \
+{ \
+	xctx->capabilities &= ~(XFS_SCRUB_CAP_##flagname); \
+}
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_fs,		KSCRUB_FS)
+XFS_SCRUB_CAPABILITY_FUNCS(getfsmap,		GETFSMAP)
+XFS_SCRUB_CAPABILITY_FUNCS(bulkstat,		BULKSTAT)
+XFS_SCRUB_CAPABILITY_FUNCS(bmapx,		BMAPX)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_inode,	KSCRUB_INODE)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_bmap,		KSCRUB_BMAP)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_dir,		KSCRUB_DIR)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_xattr,	KSCRUB_XATTR)
+XFS_SCRUB_CAPABILITY_FUNCS(getparent,		PARENT_PTR)
+XFS_SCRUB_CAPABILITY_FUNCS(skip_slow_xattr,	SKIP_SLOW_XATTR)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_symlink,	KSCRUB_SYMLINK)
+
+/* Find the fd for a given device identifier. */
+static struct disk *
+xfs_dev_to_disk(
+	struct xfs_scrub_ctx	*xctx,
+	dev_t			dev)
+{
+	if (dev == xctx->fsinfo.fs_datadev)
+		return &xctx->datadev;
+	else if (dev == xctx->fsinfo.fs_logdev)
+		return &xctx->logdev;
+	else if (dev == xctx->fsinfo.fs_rtdev)
+		return &xctx->rtdev;
+	assert(0);
+}
+
+/* Find the device major/minor for a given file descriptor. */
+static dev_t
+xfs_disk_to_dev(
+	struct xfs_scrub_ctx	*xctx,
+	struct disk		*disk)
+{
+	if (disk == &xctx->datadev)
+		return xctx->fsinfo.fs_datadev;
+	else if (disk == &xctx->logdev)
+		return xctx->fsinfo.fs_logdev;
+	else if (disk == &xctx->rtdev)
+		return xctx->fsinfo.fs_rtdev;
+	assert(0);
+}
+
+/* Shortcut to creating a read-verify thread pool. */
+static inline bool
+xfs_read_verify_pool_init(
+	struct scrub_ctx	*ctx,
+	read_verify_ioend_fn_t	ioend_fn)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	return read_verify_pool_init(&xctx->rvp, ctx, ctx->readbuf,
+			IO_MAX_SIZE, xctx->geo.blocksize, ioend_fn,
+			disk_heads(&xctx->datadev));
+}
+
+struct owner_decode {
+	uint64_t		owner;
+	const char		*descr;
+};
+
+static const struct owner_decode special_owners[] = {
+	{FMR_OWN_FREE,		"free space"},
+	{FMR_OWN_UNKNOWN,	"unknown owner"},
+	{FMR_OWN_FS,		"static FS metadata"},
+	{FMR_OWN_LOG,		"journalling log"},
+	{FMR_OWN_AG,		"per-AG metadata"},
+	{FMR_OWN_INOBT,		"inode btree blocks"},
+	{FMR_OWN_INODES,	"inodes"},
+	{FMR_OWN_REFC,		"refcount btree"},
+	{FMR_OWN_COW,		"CoW staging"},
+	{FMR_OWN_DEFECTIVE,	"bad blocks"},
+	{0, NULL},
+};
+
+/* Decode a special owner. */
+static const char *
+xfs_decode_special_owner(
+	uint64_t			owner)
+{
+	const struct owner_decode	*od = special_owners;
+
+	while (od->descr) {
+		if (od->owner == owner)
+			return od->descr;
+		od++;
+	}
+
+	return NULL;
+}
+
+/* BULKSTAT wrapper routines. */
+struct xfs_scan_inodes {
+	xfs_inode_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Scan all the inodes in an AG. */
+static void
+xfs_scan_ag_inodes(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct xfs_scan_inodes	*si = arg;
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	uint64_t		ag_ino;
+	uint64_t		next_ag_ino;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+				major(xctx->fsinfo.fs_datadev),
+				minor(xctx->fsinfo.fs_datadev),
+				agno);
+
+	ag_ino = (__u64)agno << (xctx->inopblog + xctx->agblklog);
+	next_ag_ino = (__u64)(agno + 1) << (xctx->inopblog + xctx->agblklog);
+
+	fn_arg = ((char *)si->arg) + si->array_arg_size * agno;
+	moveon = xfs_iterate_inodes(ctx, descr, xctx->fshandle, ag_ino,
+			next_ag_ino - 1, si->fn, fn_arg);
+	if (!moveon)
+		si->moveon = false;
+}
+
+/* How many array elements should we create to scan all the inodes? */
+static inline size_t
+xfs_scan_all_inodes_array_size(
+	struct xfs_scrub_ctx	*xctx)
+{
+	return xctx->geo.agcount;
+}
+
+/* Scan all the inodes in a filesystem. */
+static bool
+xfs_scan_all_inodes_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct xfs_scan_inodes	si;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+
+	if (!xfs_scrub_can_bulkstat(xctx))
+		return true;
+
+	si.moveon = true;
+	si.fn = fn;
+	si.arg = arg;
+	si.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	for (agno = 0; agno < xctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_inodes, agno, &si);
+	destroy_work_queue(&wq);
+
+	return si.moveon;
+}
+#define xfs_scan_all_inodes(ctx, fn) \
+	xfs_scan_all_inodes_array_arg((ctx), (fn), NULL, 0)
+#define xfs_scan_all_inodes_arg(ctx, fn, arg) \
+	xfs_scan_all_inodes_array_arg((ctx), (fn), (arg), 0)
+
+/* GETFSMAP wrappers routines. */
+struct xfs_scan_blocks {
+	xfs_fsmap_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Iterate all the reverse mappings of an AG. */
+static void
+xfs_scan_ag_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct xfs_scan_blocks	*sbx = arg;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	struct fsmap		keys[2];
+	off64_t			bperag;
+	bool			moveon;
+
+	bperag = (off64_t)xctx->geo.agblocks *
+		 (off64_t)xctx->geo.blocksize;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u fsmap"),
+				major(xctx->fsinfo.fs_datadev),
+				minor(xctx->fsinfo.fs_datadev),
+				agno);
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = xctx->fsinfo.fs_datadev;
+	keys->fmr_physical = agno * bperag;
+	(keys + 1)->fmr_device = xctx->fsinfo.fs_datadev;
+	(keys + 1)->fmr_physical = ((agno + 1) * bperag) - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * agno;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of a standalone device. */
+static void
+xfs_scan_dev_blocks(
+	struct scrub_ctx	*ctx,
+	int			idx,
+	dev_t			dev,
+	struct xfs_scan_blocks	*sbx)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	void			*fn_arg;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d fsmap"),
+			major(dev), minor(dev));
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = ULLONG_MAX;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * idx;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of the realtime device. */
+static void
+xfs_scan_rt_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	xfs_scan_dev_blocks(ctx, agno, xctx->fsinfo.fs_rtdev, arg);
+}
+
+/* Iterate all the reverse mappings of the log device. */
+static void
+xfs_scan_log_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	xfs_scan_dev_blocks(ctx, agno, xctx->fsinfo.fs_logdev, arg);
+}
+
+/* How many array elements should we create to scan all the blocks? */
+static size_t
+xfs_scan_all_blocks_array_size(
+	struct xfs_scrub_ctx	*xctx)
+{
+	return xctx->geo.agcount + 2;
+}
+
+/* Scan all the blocks in a filesystem. */
+static bool
+xfs_scan_all_blocks_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	struct xfs_scan_blocks	sbx;
+
+	sbx.moveon = true;
+	sbx.fn = fn;
+	sbx.arg = arg;
+	sbx.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	if (xctx->fsinfo.fs_rt)
+		queue_work(&wq, xfs_scan_rt_blocks, xctx->geo.agcount + 1,
+				&sbx);
+	if (xctx->fsinfo.fs_log)
+		queue_work(&wq, xfs_scan_log_blocks, xctx->geo.agcount + 2,
+				&sbx);
+	for (agno = 0; agno < xctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_blocks, agno, &sbx);
+	destroy_work_queue(&wq);
+
+	return sbx.moveon;
+}
+
+/* Routines to translate bad physical extents into file paths and offsets. */
+
+struct xfs_verify_error_info {
+	struct bitmap			*d_bad;		/* bytes */
+	struct bitmap			*r_bad;		/* bytes */
+};
+
+/* Report if this extent overlaps a bad region. */
+static bool
+xfs_report_verify_inode_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct xfs_verify_error_info	*vei = arg;
+	struct bitmap			*tree;
+
+	/*
+	 * Only do data scrubbing if the extent is neither unwritten nor
+	 * delalloc.
+	 */
+	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+		return true;
+
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		tree = vei->r_bad;
+	else
+		tree = vei->d_bad;
+
+	if (!bitmap_has_extent(tree, bmap->bm_physical, bmap->bm_length))
+		return true;
+
+	str_error(ctx, descr,
+_("offset %llu failed read verification."), bmap->bm_offset);
+	return true;
+}
+
+/* Iterate the extent mappings of a file to report errors. */
+static bool
+xfs_report_verify_fd(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	void				*arg)
+{
+	struct xfs_bmap			key = {0};
+	bool				moveon;
+
+	/* data fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+
+	/* attr fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+	return true;
+}
+
+/* Report read verify errors in unlinked (but still open) files. */
+static bool
+xfs_report_verify_inode(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	char				descr[DESCR_BUFSZ];
+	bool				moveon;
+	int				fd;
+
+	/* Ignore linked files and things we can't open. */
+	if (bstat->bs_nlink != 0)
+		return true;
+	if (!S_ISREG(bstat->bs_mode) && !S_ISDIR(bstat->bs_mode))
+		return true;
+
+	/* Try to open the inode. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0)
+		return true;
+
+	/* Go find the badness. */
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu (unlinked)"), bstat->bs_ino);
+	moveon = xfs_report_verify_fd(ctx, descr, fd, arg);
+	if (moveon)
+		goto out;
+
+out:
+	close(fd);
+	return moveon;
+}
+
+/* Scan the inode associated with a directory entry. */
+static bool
+xfs_report_verify_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	struct dirent		*dirent,
+	struct stat		*sb,
+	void			*arg)
+{
+	bool			moveon;
+	int			fd;
+
+	/* Ignore things we can't open. */
+	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode))
+		return true;
+	/* Ignore . and .. */
+	if (dirent && (!strcmp(".", dirent->d_name) ||
+		       !strcmp("..", dirent->d_name)))
+		return true;
+
+	/* Open the file */
+	fd = dirent_open(dir_fd, dirent);
+	if (fd < 0)
+		return true;
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, path, fd, arg);
+	if (moveon)
+		goto out;
+
+out:
+	close(fd);
+
+	return moveon;
+}
+
+/* Given bad extent lists for the data & rtdev, find bad files. */
+static bool
+xfs_report_verify_errors(
+	struct scrub_ctx		*ctx,
+	struct bitmap			*d_bad,
+	struct bitmap			*r_bad)
+{
+	struct xfs_verify_error_info	vei;
+	bool				moveon;
+
+	vei.d_bad = d_bad;
+	vei.r_bad = r_bad;
+
+	/* Scan the directory tree to get file paths. */
+	moveon = scan_fs_tree(ctx, NULL, xfs_report_verify_dirent, &vei);
+	if (!moveon)
+		return false;
+
+	/* Scan for unlinked files. */
+	return xfs_scan_all_inodes_arg(ctx, xfs_report_verify_inode, &vei);
+}
+
+/* Phase 1 */
+
+/* Clean up the XFS-specific state data. */
+static bool
+xfs_cleanup(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (!xctx)
+		goto out;
+	if (xctx->fshandle)
+		free_handle(xctx->fshandle, xctx->fshandle_len);
+	disk_close(&xctx->rtdev);
+	disk_close(&xctx->logdev);
+	disk_close(&xctx->datadev);
+	free(ctx->priv);
+	ctx->priv = NULL;
+
+out:
+	return generic_cleanup(ctx);
+}
+
+/* Test what kernel functions we can call for this filesystem. */
+static void
+xfs_test_capability(
+	struct scrub_ctx		*ctx,
+	bool				(*test_fn)(struct scrub_ctx *),
+	void				(*set_fn)(struct xfs_scrub_ctx *),
+	const char			*errmsg)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+
+	if (test_fn(ctx))
+		set_fn(xctx);
+	else
+		str_info(ctx, ctx->mntpoint, errmsg);
+}
+
+/* Read the XFS geometry. */
+static bool
+xfs_scan_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx;
+	struct fs_path			*fsp;
+	int				error;
+
+	if (!platform_test_xfs_fd(ctx->mnt_fd)) {
+		str_error(ctx, ctx->mntpoint,
+_("Does not appear to be an XFS filesystem!"));
+		return false;
+	}
+
+	/*
+	 * Flush everything out to disk before we start checking.
+	 * This seems to reduce the incidence of stale file handle
+	 * errors when we open things by handle.
+	 */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	xctx = calloc(1, sizeof(struct xfs_scrub_ctx));
+	if (!xctx) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	INIT_LIST_HEAD(&xctx->repair_list);
+	xctx->datadev.d_fd = xctx->logdev.d_fd = xctx->rtdev.d_fd = -1;
+
+	/* Retrieve XFS geometry. */
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSGEOMETRY,
+			&xctx->geo);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		goto err;
+	}
+	ctx->priv = xctx;
+
+	xctx->agblklog = libxfs_log2_roundup(xctx->geo.agblocks);
+	xctx->blocklog = libxfs_highbit32(xctx->geo.blocksize);
+	xctx->inodelog = libxfs_highbit32(xctx->geo.inodesize);
+	xctx->inopblog = xctx->blocklog - xctx->inodelog;
+
+	error = path_to_fshandle(ctx->mntpoint, &xctx->fshandle,
+			&xctx->fshandle_len);
+	if (error) {
+		perror(_("getting fshandle"));
+		goto err;
+	}
+
+	/* Do we have bulkstat? */
+	xfs_test_capability(ctx, xfs_can_iterate_inodes, xfs_scrub_set_bulkstat,
+_("Kernel lacks BULKSTAT; scrub will be incomplete."));
+
+	/* Do we have getbmapx? */
+	xfs_test_capability(ctx, xfs_can_iterate_bmap, xfs_scrub_set_bmapx,
+_("Kernel lacks GETBMAPX; scrub will be less efficient."));
+
+	/* Do we have getfsmap? */
+	xfs_test_capability(ctx, xfs_can_iterate_fsmap, xfs_scrub_set_getfsmap,
+_("Kernel lacks GETFSMAP; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted metadata scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_fs_metadata,
+			xfs_scrub_set_kscrub_fs,
+_("Kernel cannot help scrub metadata; scrub will be incomplete."));
+
+	/* Do we have kernel-assisted inode scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_inode,
+			xfs_scrub_set_kscrub_inode,
+_("Kernel cannot help scrub inodes; scrub will be incomplete."));
+
+	/* Do we have kernel-assisted bmap scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_bmap,
+			xfs_scrub_set_kscrub_bmap,
+_("Kernel cannot help scrub extent map; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted dir scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_dir,
+			xfs_scrub_set_kscrub_dir,
+_("Kernel cannot help scrub directories; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted xattr scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_attr,
+			xfs_scrub_set_kscrub_xattr,
+_("Kernel cannot help scrub extended attributes; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted symlink scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_symlink,
+			xfs_scrub_set_kscrub_symlink,
+_("Kernel cannot help scrub symbolic links; scrub will be less efficient."));
+
+	/*
+	 * We don't need to use the slow generic xattr scan unless all
+	 * of the fast scanners fail.
+	 */
+	xfs_scrub_set_skip_slow_xattr(xctx);
+
+	/* Go find the XFS devices if we have a usable fsmap. */
+	fs_table_initialise(0, NULL, 0, NULL);
+	errno = 0;
+	fsp = fs_table_lookup(ctx->mntpoint, FS_MOUNT_POINT);
+	if (!fsp) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find XFS information."));
+		goto err;
+	}
+	memcpy(&xctx->fsinfo, fsp, sizeof(struct fs_path));
+
+	/* Did we find the log and rt devices, if they're present? */
+	if (xctx->geo.logstart == 0 && xctx->fsinfo.fs_log == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find log device path."));
+		goto err;
+	}
+	if (xctx->geo.rtblocks && xctx->fsinfo.fs_rt == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find realtime device path."));
+		goto err;
+	}
+
+	/* Open the raw devices. */
+	error = disk_open(xctx->fsinfo.fs_name, &xctx->datadev);
+	if (error) {
+		str_errno(ctx, xctx->fsinfo.fs_name);
+		xfs_scrub_clear_getfsmap(xctx);
+	}
+	ctx->nr_io_threads = libxfs_nproc();
+
+	if (xctx->fsinfo.fs_log) {
+		error = disk_open(xctx->fsinfo.fs_log, &xctx->logdev);
+		if (error) {
+			str_errno(ctx, xctx->fsinfo.fs_name);
+			xfs_scrub_clear_getfsmap(xctx);
+		}
+	}
+	if (xctx->fsinfo.fs_rt) {
+		error = disk_open(xctx->fsinfo.fs_rt, &xctx->rtdev);
+		if (error) {
+			str_errno(ctx, xctx->fsinfo.fs_name);
+			xfs_scrub_clear_getfsmap(xctx);
+		}
+	}
+
+	/* Figure out who gets to scrub data extents... */
+	if (scrub_data) {
+		if (xfs_scrub_can_getfsmap(xctx))
+			xctx->data_scrubber = DS_FSMAP;
+		else if (xfs_scrub_can_bmapx(xctx))
+			xctx->data_scrubber = DS_BMAPX;
+		else  if (xfs_scrub_can_bulkstat(xctx))
+			xctx->data_scrubber = DS_BULKSTAT_READ;
+		else
+			xctx->data_scrubber = DS_READ;
+	} else
+		xctx->data_scrubber = DS_NOSCRUB;
+
+	return generic_scan_fs(ctx);
+err:
+	xfs_cleanup(ctx);
+	return false;
+}
+
+/* Phase 2 */
+
+/* Scrub each AG's metadata btrees. */
+static void
+xfs_scan_ag_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	bool				*pmoveon = arg;
+	struct list_head		repairs;
+	bool				moveon;
+
+	if (!xfs_scrub_can_kscrub_fs(xctx))
+		return;
+
+	INIT_LIST_HEAD(&repairs);
+	moveon = xfs_scrub_ag_metadata(ctx, agno, &repairs);
+	if (!moveon) {
+		*pmoveon = false;
+		return;
+	}
+
+	pthread_mutex_lock(&ctx->lock);
+	list_splice_tail_init(&repairs, &xctx->repair_list);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Scrub whole-FS metadata btrees. */
+static void
+xfs_scan_fs_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	bool				*pmoveon = arg;
+	struct list_head		repairs;
+	bool				moveon;
+
+	if (!xfs_scrub_can_kscrub_fs(xctx))
+		return;
+
+	INIT_LIST_HEAD(&repairs);
+	moveon = xfs_scrub_fs_metadata(ctx, &repairs);
+	if (!moveon)
+		*pmoveon = false;
+
+	pthread_mutex_lock(&ctx->lock);
+	list_splice_tail_init(&repairs, &xctx->repair_list);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Try to scan metadata via sysfs. */
+static bool
+xfs_scan_metadata(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	bool			moveon = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, xfs_scan_fs_metadata, 0, &moveon);
+	for (agno = 0; agno < xctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_metadata, agno, &moveon);
+	destroy_work_queue(&wq);
+
+	return moveon;
+}
+
+/* Phase 3 */
+
+/* Scrub an inode extent, report if it's bad. */
+static bool
+xfs_scrub_inode_extent(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	unsigned long long		*nextoff = arg;		/* bytes */
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	unsigned long long		eofs;
+	bool				badmap = false;
+
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		eofs = xctx->geo.rtblocks;
+	else
+		eofs = xctx->geo.datablocks;
+	eofs <<= xctx->blocklog;
+
+	if (bmap->bm_length == 0) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) has zero length."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if (bmap->bm_physical >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) starts past end of filesystem at %llu."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length, eofs);
+	}
+
+	if (bmap->bm_offset < *nextoff) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) overlaps another extent."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if (bmap->bm_physical + bmap->bm_length < bmap->bm_physical ||
+	    bmap->bm_physical + bmap->bm_length >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) ends past end of filesystem at %llu."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length, eofs);
+	}
+
+	if (bmap->bm_offset + bmap->bm_length < bmap->bm_offset) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) overflows file offset."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if ((bmap->bm_flags & BMV_OF_SHARED) &&
+	    (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) has conflicting flags 0x%x."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length,
+				bmap->bm_flags);
+	}
+
+	if ((bmap->bm_flags & BMV_OF_SHARED) &&
+	    !(xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_REFLINK)) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) is shared but filesystem does not support sharing."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if (!badmap)
+		*nextoff = bmap->bm_offset + bmap->bm_length;
+
+	return true;
+}
+
+/* Scrub an inode's data, xattr, and CoW extent records. */
+static bool
+xfs_scan_inode_extents(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd)
+{
+	struct xfs_bmap			key = {0};
+	bool				moveon;
+	unsigned long long		nextoff;	/* bytes */
+
+	/* data fork */
+	nextoff = 0;
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_scrub_inode_extent, &nextoff);
+	if (!moveon)
+		return false;
+
+	/* attr fork */
+	nextoff = 0;
+	return xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_scrub_inode_extent, &nextoff);
+}
+
+enum xfs_xattr_ns {
+	RXT_USER	= 0,
+	RXT_ROOT	= ATTR_ROOT,
+	RXT_TRUST	= ATTR_TRUST,
+	RXT_SECURE	= ATTR_SECURE,
+	RXT_MAX		= 4,
+};
+
+static const enum xfs_xattr_ns known_attr_ns[RXT_MAX] = {
+	RXT_USER,
+	RXT_ROOT,
+	RXT_TRUST,
+	RXT_SECURE,
+};
+
+/*
+ * Read all the extended attributes of a file handle.
+ * This function can return false if the get-attr-by-handle function
+ * does not work correctly; callers must be able to work around that.
+ */
+static bool
+xfs_read_handle_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct xfs_handle	*handle,
+	enum xfs_xattr_ns	ns)
+{
+	struct attrlist_cursor	cur;
+	struct attr_multiop	mop;
+	char			attrbuf[XFS_XATTR_LIST_MAX];
+	char			*firstname = NULL;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct attrlist		*attrlist = (struct attrlist *)attrbuf;
+	struct attrlist_ent	*ent;
+	bool			moveon = true;
+	int			i;
+	int			flags = 0;
+	int			error;
+
+	flags |= ns;
+	memset(&attrbuf, 0, XFS_XATTR_LIST_MAX);
+	memset(&cur, 0, sizeof(cur));
+	mop.am_opcode = ATTR_OP_GET;
+	mop.am_flags = flags;
+	while ((error = attr_list_by_handle(handle, sizeof(*handle),
+			attrbuf, XFS_XATTR_LIST_MAX, flags, &cur)) == 0) {
+		for (i = 0; i < attrlist->al_count; i++) {
+			ent = ATTR_ENTRY(attrlist, i);
+
+			/*
+			 * XFS has a longstanding bug where the attr cursor
+			 * never gets updated, causing an infinite loop.
+			 * Detect this and bail out.
+			 */
+			if (i == 0 && xfs_scrub_can_skip_slow_xattr(xctx)) {
+				if (firstname == NULL) {
+					firstname = malloc(ent->a_valuelen);
+					memcpy(firstname, ent->a_name,
+							ent->a_valuelen);
+				} else if (memcmp(firstname, ent->a_name,
+							ent->a_valuelen) == 0) {
+					str_error(ctx, descr,
+_("duplicate extended attribute \"%s\", buggy XFS?"),
+							ent->a_name);
+					moveon = false;
+					goto out;
+				}
+			}
+
+			mop.am_attrname = ent->a_name;
+			mop.am_attrvalue = ctx->readbuf;
+			mop.am_length = IO_MAX_SIZE;
+			error = attr_multi_by_handle(handle, sizeof(*handle),
+					&mop, 1, flags);
+			if (error)
+				goto out;
+		}
+
+		if (!attrlist->al_more)
+			break;
+	}
+
+	/* ATTR_TRUST doesn't currently work on Linux... */
+	if (ns == RXT_TRUST && error && errno == EINVAL)
+		error = 0;
+
+out:
+	if (firstname)
+		free(firstname);
+	if (error)
+		str_errno(ctx, descr);
+	return moveon;
+}
+
+/*
+ * Scrub part of a file.  If the user passes in a valid fd we assume
+ * that's the file to check; otherwise, pass in the inode number and
+ * let the kernel sort it out.
+ */
+static bool
+xfs_scrub_fd(
+	struct scrub_ctx	*ctx,
+	bool			(*fn)(struct scrub_ctx *, uint64_t,
+				      uint32_t, int),
+	struct xfs_bstat	*bs,
+	int			fd)
+{
+	if (fd >= 0)
+		return fn(ctx, 0, 0, fd);
+	return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd);
+}
+
+/* Verify the contents, xattrs, and extent maps of an inode. */
+static bool
+xfs_scrub_inode(
+	struct scrub_ctx	*ctx,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat,
+	void			*arg)
+{
+	struct stat		fd_sb;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	static char		linkbuf[PATH_MAX];
+	char			descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	int			fd = -1;
+	int			i;
+	int			error;
+
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu/%u"), bstat->bs_ino,
+			bstat->bs_gen);
+
+	/* Check block sizes. */
+	if (!S_ISBLK(bstat->bs_mode) && !S_ISCHR(bstat->bs_mode) &&
+	    bstat->bs_blksize != xctx->geo.blocksize)
+		str_error(ctx, descr,
+_("Block size mismatch %u, expected %u"),
+				bstat->bs_blksize, xctx->geo.blocksize);
+	if (bstat->bs_xflags & FS_XFLAG_EXTSIZE) {
+		if (bstat->bs_extsize > (MAXEXTLEN << xctx->blocklog))
+			str_error(ctx, descr,
+_("Extent size hint %u too large"), bstat->bs_extsize);
+		if (!(bstat->bs_xflags & FS_XFLAG_REALTIME) &&
+		    bstat->bs_extsize > (xctx->geo.agblocks << (xctx->blocklog - 1)))
+			str_error(ctx, descr,
+_("Extent size hint %u too large for AG"), bstat->bs_extsize);
+		if (!(bstat->bs_xflags & FS_XFLAG_REALTIME) &&
+		    bstat->bs_extsize % xctx->geo.blocksize)
+			str_error(ctx, descr,
+_("Extent size hint %u not a multiple of blocksize"), bstat->bs_extsize);
+		if ((bstat->bs_xflags & FS_XFLAG_REALTIME) &&
+		    bstat->bs_extsize % (xctx->geo.rtextsize << xctx->blocklog))
+			str_error(ctx, descr,
+_("Extent size hint %u not a multiple of rt extent size"), bstat->bs_extsize);
+	}
+	if ((bstat->bs_xflags & FS_XFLAG_COWEXTSIZE) &&
+	    !(xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_REFLINK))
+		str_error(ctx, descr,
+_("Has a CoW extent size hint on a non-reflink filesystem?"), 0);
+	if (bstat->bs_xflags & FS_XFLAG_COWEXTSIZE) {
+		if (bstat->bs_cowextsize > (MAXEXTLEN << xctx->blocklog))
+			str_error(ctx, descr,
+_("CoW Extent size hint %u too large"), bstat->bs_cowextsize);
+		if (bstat->bs_cowextsize > (xctx->geo.agblocks << (xctx->blocklog - 1)))
+			str_error(ctx, descr,
+_("CoW Extent size hint %u too large for AG"), bstat->bs_cowextsize);
+		if (bstat->bs_cowextsize % xctx->geo.blocksize)
+			str_error(ctx, descr,
+_("CoW Extent size hint %u not a multiple of blocksize"), bstat->bs_cowextsize);
+	}
+
+	/* Try to open the inode to pin it. */
+	if (S_ISREG(bstat->bs_mode) || S_ISDIR(bstat->bs_mode)) {
+		fd = open_by_fshandle(handle, sizeof(*handle),
+				O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+		if (debug && fd < 0) {
+			char buf[DESCR_BUFSZ];
+
+			str_warn(ctx, descr, "%s", strerror_r(errno,
+					buf, DESCR_BUFSZ));
+		}
+	}
+
+	/* Scrub the inode. */
+	if (xfs_scrub_can_kscrub_inode(xctx)) {
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd);
+		if (!moveon)
+			goto out;
+	}
+
+	/* Scrub all block mappings. */
+	if (xfs_scrub_can_kscrub_bmap(xctx)) {
+		/* Use the kernel scrubbers. */
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd);
+		if (!moveon)
+			goto out;
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd);
+		if (!moveon)
+			goto out;
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd);
+		if (!moveon)
+			goto out;
+	} else if (fd >= 0 && xfs_scrub_can_bmapx(xctx)) {
+		/* Scan the extent maps with GETBMAPX. */
+		moveon = xfs_scan_inode_extents(ctx, descr, fd);
+		if (!moveon)
+			goto out;
+	} else if (fd >= 0) {
+		/* Fall back to the FIEMAP scanner. */
+		error = fstat(fd, &fd_sb);
+		if (error) {
+			str_errno(ctx, descr);
+			goto out;
+		}
+
+		moveon = generic_scan_extents(ctx, descr, fd, &fd_sb, false);
+		if (!moveon)
+			goto out;
+		moveon = generic_scan_extents(ctx, descr, fd, &fd_sb, true);
+		if (!moveon)
+			goto out;
+	} else {
+		/*
+		 * If this is a file or dir, we have no way to scan the
+		 * extent maps.  Complain.
+		 */
+		if (S_ISREG(bstat->bs_mode) || S_ISDIR(bstat->bs_mode))
+			str_error(ctx, descr,
+_("Unable to open inode to scrub extent maps."));
+	}
+
+	/* XXX: Some day, check child -> parent dir -> child. */
+
+	if (S_ISLNK(bstat->bs_mode)) {
+		/* Check symlink contents. */
+		if (xfs_scrub_can_kscrub_symlink(xctx))
+			moveon = xfs_scrub_symlink(ctx, bstat->bs_ino,
+					bstat->bs_gen, ctx->mnt_fd);
+		else {
+			error = readlink_by_handle(handle, sizeof(*handle),
+					linkbuf, PATH_MAX);
+			if (error < 0)
+				str_errno(ctx, descr);
+		}
+		if (!moveon)
+			goto out;
+	} else if (S_ISDIR(bstat->bs_mode)) {
+		/* Check the directory entries. */
+		if (xfs_scrub_can_kscrub_dir(xctx))
+			moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd);
+		else if (fd >= 0)
+			moveon = generic_check_directory(ctx, descr, &fd);
+		else {
+			str_error(ctx, descr,
+_("Unable to open directory to scrub."));
+			moveon = true;
+		}
+		if (!moveon)
+			goto out;
+	}
+
+	/*
+	 * Read all the extended attributes.  If any of the read
+	 * functions decline to move on, we can try again with the
+	 * VFS functions if we have a file descriptor.
+	 */
+	if (xfs_scrub_can_kscrub_xattr(xctx))
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd);
+	else {
+		moveon = true;
+		for (i = 0; i < RXT_MAX; i++) {
+			moveon = xfs_read_handle_xattrs(ctx, descr, handle,
+					known_attr_ns[i]);
+			if (!moveon)
+				break;
+		}
+		if (!moveon && fd >= 0) {
+			moveon = generic_scan_xattrs(ctx, descr, fd);
+			if (!moveon)
+				goto out;
+		}
+		if (!moveon)
+			xfs_scrub_clear_skip_slow_xattr(xctx);
+		moveon = true;
+	}
+	if (!moveon)
+		goto out;
+
+out:
+	if (fd >= 0)
+		close(fd);
+	return moveon;
+}
+
+/* Verify all the inodes in a filesystem. */
+static bool
+xfs_scan_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (!xfs_scrub_can_bulkstat(xctx))
+		return generic_scan_inodes(ctx);
+
+	return xfs_scan_all_inodes(ctx, xfs_scrub_inode);
+}
+
+/* Phase 4 */
+
+/* Check an inode's extents. */
+static bool
+xfs_scan_extents(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb,
+	bool			attr_fork)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/*
+	 * If we have bulkstat and either bmap or kernel scrubbing,
+	 * we already checked the extents.
+	 */
+	if (xfs_scrub_can_bulkstat(xctx) &&
+	    (xfs_scrub_can_bmapx(xctx) || xfs_scrub_can_kscrub_fs(xctx)))
+		return true;
+
+	return generic_scan_extents(ctx, descr, fd, sb, attr_fork);
+}
+
+/* Try to read all the extended attributes. */
+static bool
+xfs_scan_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/* If we have bulkstat, we already checked the attributes. */
+	if (xfs_scrub_can_bulkstat(xctx) && xfs_scrub_can_skip_slow_xattr(xctx))
+		return true;
+
+	return generic_scan_xattrs(ctx, descr, fd);
+}
+
+/* Try to read all the extended attributes of things that have no fd. */
+static bool
+xfs_scan_special_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*path)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/* If we have bulkstat, we already checked the attributes. */
+	if (xfs_scrub_can_bulkstat(xctx) && xfs_scrub_can_skip_slow_xattr(xctx))
+		return true;
+
+	return generic_scan_special_xattrs(ctx, path);
+}
+
+/* Traverse the directory tree. */
+static bool
+xfs_scan_fs_tree(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/* If we have bulkstat, we already checked the attributes. */
+	if (xfs_scrub_can_bulkstat(xctx) && xfs_scrub_can_skip_slow_xattr(xctx))
+		return true;
+
+	return generic_scan_fs_tree(ctx);
+}
+
+/* Phase 5 */
+
+/* Verify disk blocks with GETFSMAP */
+
+struct xfs_verify_extent {
+	/* Maintain state for the lazy read verifier. */
+	struct read_verify	rv;
+
+	/* Store bad extents if we don't have parent pointers. */
+	struct bitmap		*d_bad;		/* bytes */
+	struct bitmap		*r_bad;		/* bytes */
+
+	/* Track the last extent we saw. */
+	uint64_t		laststart;	/* bytes */
+	uint64_t		lastlength;	/* bytes */
+	bool			lastshared;	/* bytes */
+};
+
+/* Report an IO error resulting from read-verify based off getfsmap. */
+static bool
+xfs_check_rmap_error_report(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*map,
+	void			*arg)
+{
+	const char		*type;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	char			buf[32];
+	uint64_t		err_physical = *(uint64_t *)arg;
+	uint64_t		err_off;
+
+	if (err_physical > map->fmr_physical)
+		err_off = err_physical - map->fmr_physical;
+	else
+		err_off = 0;
+
+	snprintf(buf, 32, _("disk offset %llu"),
+			BTOBB(map->fmr_physical + err_off));
+
+	if (map->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+		type = xfs_decode_special_owner(map->fmr_owner);
+		str_error(ctx, buf,
+_("%s failed read verification."),
+				type);
+	} else if (xfs_scrub_can_getparent(xctx)) {
+		/* XXX: go find the parent path */
+		str_error(ctx, buf,
+_("XXX: inode %lld offset %llu failed read verification."),
+				map->fmr_owner, map->fmr_offset + err_off);
+	}
+	return true;
+}
+
+/* Handle a read error in the rmap-based read verify. */
+void
+xfs_check_rmap_ioerr(
+	struct read_verify_pool	*rvp,
+	struct disk		*disk,
+	uint64_t		start,
+	uint64_t		length,
+	int			error,
+	void			*arg)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	struct scrub_ctx	*ctx = rvp->rvp_ctx;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct xfs_verify_extent	*ve;
+	struct bitmap		*tree;
+	dev_t			dev;
+	bool			moveon;
+
+	ve = arg;
+	dev = xfs_disk_to_dev(xctx, disk);
+
+	/*
+	 * If we don't have parent pointers, save the bad extent for
+	 * later rescanning.
+	 */
+	if (!xfs_scrub_can_getparent(xctx)) {
+		if (dev == xctx->fsinfo.fs_datadev)
+			tree = ve->d_bad;
+		else if (dev == xctx->fsinfo.fs_rtdev)
+			tree = ve->r_bad;
+		else
+			tree = NULL;
+		if (tree) {
+			moveon = bitmap_add(tree, start, length);
+			if (!moveon)
+				str_errno(ctx, ctx->mntpoint);
+		}
+	}
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d ioerr @ %"PRIu64":%"PRIu64" "),
+			major(dev), minor(dev), start, length);
+
+	/* Go figure out which blocks are bad from the fsmap. */
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	keys->fmr_physical = start;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = start + length - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+	xfs_iterate_fsmap(ctx, descr, keys, xfs_check_rmap_error_report,
+			&start);
+}
+
+/* Read verify a (data block) extent. */
+static bool
+xfs_check_rmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*map,
+	void				*arg)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct xfs_verify_extent	*ve = arg;
+	struct disk			*disk;
+	uint64_t			eofs;
+	uint64_t			min_physical;
+	bool				badflags = false;
+	bool				badmap = false;
+
+	dbg_printf("rmap dev %d:%d phys %llu owner %lld offset %llu "
+			"len %llu flags 0x%x\n", major(map->fmr_device),
+			minor(map->fmr_device), map->fmr_physical,
+			map->fmr_owner, map->fmr_offset,
+			map->fmr_length, map->fmr_flags);
+
+	/* If kernel already checked this... */
+	if (xfs_scrub_can_kscrub_fs(xctx))
+		goto skip_check;
+
+	if (map->fmr_device == xctx->fsinfo.fs_datadev)
+		eofs = xctx->geo.datablocks;
+	else if (map->fmr_device == xctx->fsinfo.fs_rtdev)
+		eofs = xctx->geo.rtblocks;
+	else if (map->fmr_device == xctx->fsinfo.fs_logdev)
+		eofs = xctx->geo.logblocks;
+	else
+		assert(0);
+	eofs <<= xctx->blocklog;
+
+	/* Don't go past EOFS */
+	if (map->fmr_physical >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) starts past end of filesystem at %llu."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length, eofs);
+	}
+
+	if (map->fmr_physical + map->fmr_length < map->fmr_physical ||
+	    map->fmr_physical + map->fmr_length >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) ends past end of filesystem at %llu."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length, eofs);
+	}
+
+	/* Check for illegal overlapping. */
+	if (ve->lastshared && (map->fmr_flags & FMR_OF_SHARED))
+		min_physical = ve->laststart;
+	else
+		min_physical = ve->laststart + ve->lastlength;
+
+	if (map->fmr_physical < min_physical) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) overlaps another rmap."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length);
+	}
+
+	/* can't have shared on non-reflink */
+	if ((map->fmr_flags & FMR_OF_SHARED) &&
+	    !(xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_REFLINK))
+		badflags = true;
+
+	/* unwritten can't have any of the other flags */
+	if ((map->fmr_flags & FMR_OF_PREALLOC) &&
+	     (map->fmr_flags & (FMR_OF_ATTR_FORK | FMR_OF_EXTENT_MAP |
+				 FMR_OF_SHARED | FMR_OF_SPECIAL_OWNER)))
+		badflags = true;
+
+	/* attr fork can't be shared or uwnritten or special */
+	if ((map->fmr_flags & FMR_OF_ATTR_FORK) &&
+	     (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_SHARED |
+				 FMR_OF_SPECIAL_OWNER)))
+		badflags = true;
+
+	/* extent maps can only have attrfork */
+	if ((map->fmr_flags & FMR_OF_EXTENT_MAP) &&
+	     (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_SHARED |
+				 FMR_OF_SPECIAL_OWNER)))
+		badflags = true;
+
+	/* shared maps can't have any of the other flags */
+	if ((map->fmr_flags & FMR_OF_SHARED) &&
+	    (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+				FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER)))
+
+	/* special owners can't have any of the other flags */
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	     (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+				 FMR_OF_EXTENT_MAP | FMR_OF_SHARED)))
+		badflags = true;
+
+	if (badflags) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) has conflicting flags 0x%x."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length, map->fmr_flags);
+	}
+
+	/* If this rmap is suspect, don't bother verifying it. */
+	if (badmap)
+		goto out;
+
+skip_check:
+	/* Remember this extent. */
+	ve->lastshared = (map->fmr_flags & FMR_OF_SHARED);
+	ve->laststart = map->fmr_physical;
+	ve->lastlength = map->fmr_length;
+
+	/* "Unknown" extents should be verified; they could be data. */
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+			map->fmr_owner == FMR_OWN_UNKNOWN)
+		map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+
+	/*
+	 * We only care about read-verifying data extents that have been
+	 * written to disk.  This means we can skip "special" owners
+	 * (metadata), xattr blocks, unwritten extents, and extent maps.
+	 * These should all get checked elsewhere in the scrubber.
+	 */
+	if (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+			       FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER))
+		goto out;
+
+	/* XXX: Filter out directory data blocks. */
+
+	/* Schedule the read verify command for (eventual) running. */
+	disk = xfs_dev_to_disk(xctx, map->fmr_device);
+
+	read_verify_schedule(&xctx->rvp, &ve->rv, disk, map->fmr_physical,
+			map->fmr_length, ve);
+
+out:
+	/* Is this the last extent?  Fire off the read. */
+	if (map->fmr_flags & FMR_OF_LAST)
+		read_verify_force(&xctx->rvp, &ve->rv);
+
+	return true;
+}
+
+/* Verify all the blocks in a filesystem. */
+static bool
+xfs_scan_rmaps(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct bitmap			d_bad;
+	struct bitmap			r_bad;
+	struct xfs_verify_extent	*ve;
+	struct xfs_verify_extent	*v;
+	int				i;
+	unsigned int			groups;
+	bool				moveon;
+
+	/*
+	 * Initialize our per-thread context.  By convention,
+	 * the log device comes first, then the rt device, and then
+	 * the AGs.
+	 */
+	groups = xfs_scan_all_blocks_array_size(xctx);
+	ve = calloc(groups, sizeof(struct xfs_verify_extent));
+	if (!ve) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&d_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_ve;
+	}
+
+	moveon = bitmap_init(&r_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dbad;
+	}
+
+	for (i = 0, v = ve; i < groups; i++, v++) {
+		v->d_bad = &d_bad;
+		v->r_bad = &r_bad;
+	}
+
+	moveon = xfs_read_verify_pool_init(ctx, xfs_check_rmap_ioerr);
+	if (!moveon)
+		goto out_rbad;
+	moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_check_rmap,
+			ve, sizeof(*ve));
+	if (!moveon)
+		goto out_pool;
+
+	for (i = 0, v = ve; i < groups; i++, v++)
+		read_verify_force(&xctx->rvp, &v->rv);
+	read_verify_pool_destroy(&xctx->rvp);
+
+	/* Scan the whole dir tree to see what matches the bad extents. */
+	if (!bitmap_empty(&d_bad) || !bitmap_empty(&r_bad))
+		moveon = xfs_report_verify_errors(ctx, &d_bad, &r_bad);
+
+	bitmap_free(&r_bad);
+	bitmap_free(&d_bad);
+	free(ve);
+	return moveon;
+
+out_pool:
+	read_verify_pool_destroy(&xctx->rvp);
+out_rbad:
+	bitmap_free(&r_bad);
+out_dbad:
+	bitmap_free(&d_bad);
+out_ve:
+	free(ve);
+	return moveon;
+}
+
+/* Read-verify with BULKSTAT + GETBMAPX */
+struct xfs_verify_inode {
+	struct bitmap			d_good;		/* bytes */
+	struct bitmap			r_good;		/* bytes */
+	struct bitmap			*d_bad;		/* bytes */
+	struct bitmap			*r_bad;		/* bytes */
+};
+
+struct xfs_verify_submit {
+	struct read_verify_pool		*rvp;
+	struct bitmap			*bad;
+	struct disk			*disk;
+	struct read_verify		rv;
+};
+
+/* Finish a inode block scan. */
+void
+xfs_verify_inode_bmap_ioerr(
+	struct read_verify_pool		*rvp,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length,
+	int				error,
+	void				*arg)
+{
+	struct bitmap			*tree = arg;
+
+	bitmap_add(tree, start, length);
+}
+
+/* Scrub an inode extent and read-verify it. */
+bool
+xfs_verify_inode_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct bitmap			*tree = arg;
+
+	/*
+	 * Only do data scrubbing if the extent is neither unwritten nor
+	 * delalloc.
+	 */
+	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+		return true;
+
+	return bitmap_add(tree, bmap->bm_physical, bmap->bm_length);
+}
+
+/* Read-verify the data blocks of a file via BMAP. */
+static bool
+xfs_verify_inode(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	struct stat			fd_sb;
+	struct xfs_bmap			key = {0};
+	struct xfs_verify_inode		*vi = arg;
+	struct bitmap			*tree;
+	char				descr[DESCR_BUFSZ];
+	bool				moveon = true;
+	int				fd = -1;
+	int				error;
+
+	if (!S_ISREG(bstat->bs_mode))
+		return true;
+
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu/%u"), bstat->bs_ino,
+			bstat->bs_gen);
+
+	/* Try to open the inode to pin it. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		char buf[DESCR_BUFSZ];
+
+		str_warn(ctx, descr, "%s", strerror_r(errno,
+				buf, DESCR_BUFSZ));
+		return true;
+	}
+
+	if (vi) {
+		/* Use BMAPX */
+		if (bstat->bs_xflags & FS_XFLAG_REALTIME)
+			tree = &vi->r_good;
+		else
+			tree = &vi->d_good;
+
+		/* data fork */
+		moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+				xfs_verify_inode_bmap, tree);
+	} else {
+		error = fstat(fd, &fd_sb);
+		if (error) {
+			str_errno(ctx, descr);
+			goto out;
+		}
+
+		/* Use generic_file_read */
+		moveon = read_verify_file(ctx, descr, fd, &fd_sb);
+	}
+
+out:
+	if (fd >= 0)
+		close(fd);
+	return moveon;
+}
+
+/* Schedule a read verification from an extent tree record. */
+static bool
+xfs_schedule_read_verify(
+	uint64_t			start,
+	uint64_t			length,
+	void				*arg)
+{
+	struct xfs_verify_submit	*rvs = arg;
+
+	read_verify_schedule(rvs->rvp, &rvs->rv, rvs->disk, start, length,
+			rvs->bad);
+	return true;
+}
+
+/* Verify all the file data in a filesystem. */
+static bool
+xfs_verify_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct bitmap		d_good;
+	struct bitmap		d_bad;
+	struct bitmap		r_good;
+	struct bitmap		r_bad;
+	struct xfs_verify_inode	*vi;
+	struct xfs_verify_inode	*v;
+	struct xfs_verify_submit	vs;
+	int			i;
+	unsigned int		groups;
+	bool			moveon;
+
+	groups = xfs_scan_all_inodes_array_size(xctx);
+	vi = calloc(groups, sizeof(struct xfs_verify_inode));
+	if (!vi) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&d_good);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_vi;
+	}
+
+	moveon = bitmap_init(&d_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dgood;
+	}
+
+	moveon = bitmap_init(&r_good);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dbad;
+	}
+
+	moveon = bitmap_init(&r_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_rgood;
+	}
+
+	for (i = 0, v = vi; i < groups; i++, v++) {
+		v->d_bad = &d_bad;
+		v->r_bad = &r_bad;
+
+		moveon = bitmap_init(&v->d_good);
+		if (!moveon) {
+			str_errno(ctx, ctx->mntpoint);
+			goto out_varray;
+		}
+
+		moveon = bitmap_init(&v->r_good);
+		if (!moveon) {
+			str_errno(ctx, ctx->mntpoint);
+			goto out_varray;
+		}
+	}
+
+	/* Scan all the inodes for extent information. */
+	moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_verify_inode,
+			vi, sizeof(*vi));
+	if (!moveon)
+		goto out_varray;
+
+	/* Merge all the IOs. */
+	for (i = 0, v = vi; i < groups; i++, v++) {
+		bitmap_merge(&d_good, &v->d_good);
+		bitmap_free(&v->d_good);
+		bitmap_merge(&r_good, &v->r_good);
+		bitmap_free(&v->r_good);
+	}
+
+	/* Run all the IO in batches. */
+	memset(&vs, 0, sizeof(struct xfs_verify_submit));
+	vs.rvp = &xctx->rvp;
+	moveon = xfs_read_verify_pool_init(ctx, xfs_verify_inode_bmap_ioerr);
+	if (!moveon)
+		goto out_varray;
+	vs.disk = &xctx->datadev;
+	vs.bad = &d_bad;
+	moveon = bitmap_iterate(&d_good, xfs_schedule_read_verify, &vs);
+	if (!moveon)
+		goto out_pool;
+	vs.disk = &xctx->rtdev;
+	vs.bad = &r_bad;
+	moveon = bitmap_iterate(&r_good, xfs_schedule_read_verify, &vs);
+	if (!moveon)
+		goto out_pool;
+	read_verify_force(&xctx->rvp, &vs.rv);
+	read_verify_pool_destroy(&xctx->rvp);
+
+	/* Re-scan the file bmaps to see if they match the bad. */
+	if (!bitmap_empty(&d_bad) || !bitmap_empty(&r_bad))
+		moveon = xfs_report_verify_errors(ctx, &d_bad, &r_bad);
+
+	goto out_varray;
+
+out_pool:
+	read_verify_pool_destroy(&xctx->rvp);
+out_varray:
+	for (i = 0, v = vi; i < xctx->geo.agcount; i++, v++) {
+		bitmap_free(&v->d_good);
+		bitmap_free(&v->r_good);
+	}
+	bitmap_free(&r_bad);
+out_rgood:
+	bitmap_free(&r_good);
+out_dbad:
+	bitmap_free(&d_bad);
+out_dgood:
+	bitmap_free(&d_good);
+out_vi:
+	free(vi);
+	return moveon;
+}
+
+/* Verify all the file data in a filesystem with the generic verifier. */
+static bool
+xfs_verify_inodes_generic(
+	struct scrub_ctx	*ctx)
+{
+	return xfs_scan_all_inodes(ctx, xfs_verify_inode);
+}
+
+/* Scan all the blocks in a filesystem. */
+static bool
+xfs_scan_blocks(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+
+	switch (xctx->data_scrubber) {
+	case DS_NOSCRUB:
+		return true;
+	case DS_READ:
+		return generic_scan_blocks(ctx);
+	case DS_BULKSTAT_READ:
+		return xfs_verify_inodes_generic(ctx);
+	case DS_BMAPX:
+		return xfs_verify_inodes(ctx);
+	case DS_FSMAP:
+		return xfs_scan_rmaps(ctx);
+	default:
+		assert(0);
+	}
+}
+
+/* Read an entire file's data. */
+static bool
+xfs_read_file(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (xctx->data_scrubber != DS_READ)
+		return true;
+
+	return read_verify_file(ctx, descr, fd, sb);
+}
+
+/* Phase 6 */
+
+struct xfs_summary_counts {
+	unsigned long long	inodes;		/* number of inodes */
+	unsigned long long	dbytes;		/* data dev bytes */
+	unsigned long long	rbytes;		/* rt dev bytes */
+	unsigned long long	next_phys;	/* next phys bytes we see? */
+	unsigned long long	agbytes;	/* freespace bytes */
+	struct bitmap		dext;		/* data block extent bitmap */
+	struct bitmap		rext;		/* rt block extent bitmap */
+};
+
+struct xfs_inode_fork_summary {
+	struct bitmap		*tree;
+	unsigned long long	bytes;
+};
+
+/* Record data block extents in a bitmap. */
+bool
+xfs_record_inode_summary_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct xfs_inode_fork_summary	*ifs = arg;
+
+	/* Only record real extents. */
+	if (bmap->bm_flags & BMV_OF_DELALLOC)
+		return true;
+
+	bitmap_add(ifs->tree, bmap->bm_physical, bmap->bm_length);
+	ifs->bytes += bmap->bm_length;
+
+	return true;
+}
+
+/* Record inode and block usage. */
+static bool
+xfs_record_inode_summary(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct xfs_summary_counts	*counts = arg;
+	struct xfs_inode_fork_summary	ifs = {0};
+	struct xfs_bmap			key = {0};
+	char				descr[DESCR_BUFSZ];
+	int				fd;
+	bool				moveon;
+
+	counts->inodes++;
+	if (xfs_scrub_can_getfsmap(xctx) || bstat->bs_blocks == 0)
+		return true;
+
+	if (!xfs_scrub_can_bmapx(xctx) || !S_ISREG(bstat->bs_mode)) {
+		counts->dbytes += (bstat->bs_blocks << xctx->blocklog);
+		return true;
+	}
+
+	/* Potentially a reflinked file, so collect the bitmap... */
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu/%u"), bstat->bs_ino,
+			bstat->bs_gen);
+
+	/* Try to open the inode to pin it. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		char buf[DESCR_BUFSZ];
+
+		str_warn(ctx, descr, "%s", strerror_r(errno,
+				buf, DESCR_BUFSZ));
+		return true;
+	}
+
+	/* data fork */
+	if (bstat->bs_xflags & FS_XFLAG_REALTIME)
+		ifs.tree = &counts->rext;
+	else
+		ifs.tree = &counts->dext;
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_record_inode_summary_bmap, &ifs);
+	if (!moveon)
+		goto out;
+
+	/* attr fork */
+	ifs.tree = &counts->dext;
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_record_inode_summary_bmap, &ifs);
+	if (!moveon)
+		goto out;
+
+	/*
+	 * bs_blocks tracks the number of sectors assigned to this file
+	 * for data, xattrs, and block mapping metadata.  ifs.bytes tracks
+	 * the data and xattr storage space used, so the diff between the
+	 * two is the space used for block mapping metadata.  Add that to
+	 * the data usage.
+	 */
+	counts->dbytes += (bstat->bs_blocks << xctx->blocklog) - ifs.bytes;
+
+out:
+	if (fd >= 0)
+		close(fd);
+	return moveon;
+}
+
+/* Record block usage. */
+static bool
+xfs_record_block_summary(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*fsmap,
+	void				*arg)
+{
+	struct xfs_summary_counts	*counts = arg;
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	unsigned long long		len;
+
+	if (fsmap->fmr_device == xctx->fsinfo.fs_logdev)
+		return true;
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == FMR_OWN_FREE)
+		return true;
+
+	len = fsmap->fmr_length;
+
+	/* freesp btrees live in free space, need to adjust counters later. */
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == FMR_OWN_AG) {
+		counts->agbytes += fsmap->fmr_length;
+	}
+	if (fsmap->fmr_device == xctx->fsinfo.fs_rtdev) {
+		/* Count realtime extents. */
+		counts->rbytes += len;
+	} else {
+		/* Count data extents. */
+		if (counts->next_phys >= fsmap->fmr_physical + len)
+			return true;
+		else if (counts->next_phys > fsmap->fmr_physical)
+			len = counts->next_phys - fsmap->fmr_physical;
+			
+		counts->dbytes += len;
+		counts->next_phys = fsmap->fmr_physical + fsmap->fmr_length;
+	}
+
+	return true;
+}
+
+/* Sum the bytes in each extent. */
+static bool
+xfs_summary_count_helper(
+	uint64_t			start,
+	uint64_t			length,
+	void				*arg)
+{
+	unsigned long long		*count = arg;
+
+	*count += length;
+	return true;
+}
+
+/* Count all inodes and blocks in the filesystem, compare to superblock. */
+static bool
+xfs_check_summary(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct xfs_fsop_counts		fc;
+	struct xfs_fsop_resblks		rb;
+	struct xfs_fsop_ag_resblks	arb;
+	struct statvfs			sfs;
+	struct xfs_summary_counts	*summary;
+	unsigned long long		fd;
+	unsigned long long		fr;
+	unsigned long long		fi;
+	unsigned long long		sd;
+	unsigned long long		sr;
+	unsigned long long		si;
+	unsigned long long		absdiff;
+	xfs_agnumber_t			agno;
+	bool				moveon;
+	bool				complain;
+	unsigned int			groups;
+	int				error;
+
+	if (!xfs_scrub_can_bulkstat(xctx))
+		return generic_check_summary(ctx);
+
+	groups = xfs_scan_all_blocks_array_size(xctx);
+	summary = calloc(groups, sizeof(struct xfs_summary_counts));
+	if (!summary) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Flush everything out to disk before we start counting. */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	if (xfs_scrub_can_getfsmap(xctx)) {
+		/* Use fsmap to count blocks. */
+		moveon = xfs_scan_all_blocks_array_arg(ctx,
+				xfs_record_block_summary,
+				summary, sizeof(*summary));
+		if (!moveon)
+			goto out;
+	} else {
+		/* Reflink w/o rmap; have to collect extents in a bitmap. */
+		for (agno = 0; agno < groups; agno++) {
+			moveon = bitmap_init(&summary[agno].dext);
+			if (!moveon) {
+				str_errno(ctx, ctx->mntpoint);
+				goto out;
+			}
+			moveon = bitmap_init(&summary[agno].rext);
+			if (!moveon) {
+				str_errno(ctx, ctx->mntpoint);
+				goto out;
+			}
+		}
+	}
+
+	/* Scan the whole fs. */
+	moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_record_inode_summary,
+			summary, sizeof(*summary));
+	if (!moveon)
+		goto out;
+
+	if (!xfs_scrub_can_getfsmap(xctx)) {
+		/* Reflink w/o rmap; merge the bitmaps. */
+		for (agno = 1; agno < groups; agno++) {
+			bitmap_merge(&summary[0].dext, &summary[agno].dext);
+			bitmap_free(&summary[agno].dext);
+			bitmap_merge(&summary[0].rext, &summary[agno].rext);
+			bitmap_free(&summary[agno].rext);
+		}
+		moveon = bitmap_iterate(&summary[0].dext,
+				xfs_summary_count_helper, &summary[0].dbytes);
+		moveon = bitmap_iterate(&summary[0].rext,
+				xfs_summary_count_helper, &summary[0].rbytes);
+		bitmap_free(&summary[0].dext);
+		bitmap_free(&summary[0].rext);
+		if (!moveon)
+			goto out;
+	}
+
+	/* Sum the counts. */
+	for (agno = 1; agno < groups; agno++) {
+		summary[0].inodes += summary[agno].inodes;
+		summary[0].dbytes += summary[agno].dbytes;
+		summary[0].rbytes += summary[agno].rbytes;
+		summary[0].agbytes += summary[agno].agbytes;
+	}
+
+	/* Account for an internal log, if present. */
+	if (!xfs_scrub_can_getfsmap(xctx) && xctx->fsinfo.fs_log == NULL)
+		summary[0].dbytes += (unsigned long long)xctx->geo.logblocks <<
+				xctx->blocklog;
+
+	/* Account for hidden rt metadata inodes. */
+	summary[0].inodes += 2;
+	if ((xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT) &&
+			xctx->geo.rtblocks > 0)
+		summary[0].inodes++;
+
+	/* Fetch the filesystem counters. */
+	error = xfsctl(NULL, ctx->mnt_fd, XFS_IOC_FSCOUNTS, &fc);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+
+	/* Grab the fstatvfs counters, since it has to report accurately. */
+	error = fstatvfs(ctx->mnt_fd, &sfs);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/*
+	 * XFS reserves some blocks to prevent hard ENOSPC, so add those
+	 * blocks back to the free data counts.
+	 */
+	error = xfsctl(NULL, ctx->mnt_fd, XFS_IOC_GET_RESBLKS, &rb);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += rb.resblks_avail;
+
+	/*
+	 * XFS with rmap or reflink reserves blocks in each AG to
+	 * prevent the AG from running out of space for metadata blocks.
+	 * Add those back to the free data counts.
+	 */
+	memset(&arb, 0, sizeof(arb));
+	error = xfsctl(NULL, ctx->mnt_fd, XFS_IOC_GET_AG_RESBLKS, &arb);
+	if (error && errno != ENOTTY)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += arb.resblks;
+
+	/*
+	 * If we counted blocks with fsmap, then dblocks includes
+	 * blocks for the AGFL and the freespace/rmap btrees.  The
+	 * filesystem treats them as "free", but since we scanned
+	 * them, we'll consider them used.
+	 */
+	sfs.f_bfree -= summary[0].agbytes >> xctx->blocklog;
+
+	/* Report on what we found. */
+	fd = (xctx->geo.datablocks - sfs.f_bfree) << xctx->blocklog;
+	fr = (xctx->geo.rtblocks - fc.freertx) << xctx->blocklog;
+	fi = sfs.f_files - sfs.f_ffree;
+	sd = summary[0].dbytes;
+	sr = summary[0].rbytes;
+	si = summary[0].inodes;
+
+	/*
+	 * Complain if the counts are off by more than 10% unless
+	 * the inaccuracy is less than 32MB worth of blocks or 100 inodes.
+	 */
+	absdiff = 1ULL << 25;
+	complain = !within_range(ctx, sd, fd, absdiff, 1, 10, _("data blocks"));
+	complain |= !within_range(ctx, sr, fr, absdiff, 1, 10, _("realtime blocks"));
+	complain |= !within_range(ctx, si, fi, 100, 1, 10, _("inodes"));
+
+	if (complain || verbose) {
+		double		d, r, i;
+		char		*du, *ru, *iu;
+
+		if (fr || sr) {
+			d = auto_space_units(fd, &du);
+			r = auto_space_units(fr, &ru);
+			i = auto_units(fi, &iu);
+			printf(
+_("%.1f%s data used;  %.1f%s realtime data used;  %.2f%s inodes used.\n"),
+					d, du, r, ru, i, iu);
+			d = auto_space_units(sd, &du);
+			r = auto_space_units(sr, &ru);
+			i = auto_units(si, &iu);
+			printf(
+_("%.1f%s data found; %.1f%s realtime data found; %.2f%s inodes found.\n"),
+					d, du, r, ru, i, iu);
+		} else {
+			d = auto_space_units(fd, &du);
+			i = auto_units(fi, &iu);
+			printf(
+_("%.1f%s data used;  %.1f%s inodes used.\n"),
+					d, du, i, iu);
+			d = auto_space_units(sd, &du);
+			i = auto_units(si, &iu);
+			printf(
+_("%.1f%s data found; %.1f%s inodes found.\n"),
+					d, du, i, iu);
+		}
+	}
+	moveon = true;
+
+out:
+	for (agno = 0; agno < groups; agno++) {
+		bitmap_free(&summary[agno].dext);
+		bitmap_free(&summary[agno].rext);
+	}
+	free(summary);
+	return moveon;
+}
+
+/* Phase 7: Preen filesystem. */
+
+static bool
+xfs_repair_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	bool				moveon;
+
+	/* Repair anything broken. */
+	moveon = xfs_repair_metadata_list(ctx, &xctx->repair_list);
+	if (!moveon)
+		return false;
+
+	fstrim(ctx);
+	return true;
+}
+
+/* Shut down the filesystem. */
+static void
+xfs_shutdown_fs(
+	struct scrub_ctx		*ctx)
+{
+	int				flag;
+
+	flag = XFS_FSOP_GOING_FLAGS_LOGFLUSH;
+	if (xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GOINGDOWN, &flag))
+		str_errno(ctx, ctx->mntpoint);
+}
+
+struct scrub_ops xfs_scrub_ops = {
+	.name			= "xfs",
+	.repair_tool		= "xfs_repair",
+	.cleanup		= xfs_cleanup,
+	.scan_fs		= xfs_scan_fs,
+	.scan_inodes		= xfs_scan_inodes,
+	.check_dir		= generic_check_dir,
+	.check_inode		= generic_check_inode,
+	.scan_extents		= xfs_scan_extents,
+	.scan_xattrs		= xfs_scan_xattrs,
+	.scan_special_xattrs	= xfs_scan_special_xattrs,
+	.scan_metadata		= xfs_scan_metadata,
+	.check_summary		= xfs_check_summary,
+	.scan_blocks		= xfs_scan_blocks,
+	.read_file		= xfs_read_file,
+	.scan_fs_tree		= xfs_scan_fs_tree,
+	.shutdown_fs		= xfs_shutdown_fs,
+	.preen_fs		= xfs_repair_fs,
+	.repair_fs		= xfs_repair_fs,
+};
diff --git a/scrub/xfs_ioctl.c b/scrub/xfs_ioctl.c
new file mode 100644
index 0000000..397755b
--- /dev/null
+++ b/scrub/xfs_ioctl.c
@@ -0,0 +1,767 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "scrub.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+
+#include "xfs_ioctl.h"
+
+#define BSTATBUF_NR		1024
+#define FSMAP_NR		65536
+#define BMAP_NR			2048
+
+/* Iterate a range of inodes. */
+bool
+xfs_iterate_inodes(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	void			*fshandle,
+	uint64_t		first_ino,
+	uint64_t		last_ino,
+	xfs_inode_iter_fn	fn,
+	void			*arg)
+{
+	struct xfs_fsop_bulkreq	bulkreq;
+	struct xfs_bstat	*bstatbuf;
+	struct xfs_bstat	*p;
+	struct xfs_bstat	*endp;
+	struct xfs_handle	handle;
+	__s32			buflenout = 0;
+	bool			moveon = true;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"));
+
+	bstatbuf = calloc(BSTATBUF_NR, sizeof(struct xfs_bstat));
+	if (!bstatbuf) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	memset(&bulkreq, 0, sizeof(bulkreq));
+	bulkreq.lastip = (__u64 *)&first_ino;
+	bulkreq.icount  = BSTATBUF_NR;
+	bulkreq.ubuffer = (void *)bstatbuf;
+	bulkreq.ocount  = &buflenout;
+
+	memcpy(&handle.ha_fsid, fshandle, sizeof(handle.ha_fsid));
+	handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
+			sizeof(handle.ha_fid.fid_len);
+	handle.ha_fid.fid_pad = 0;
+	while ((error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSBULKSTAT,
+			&bulkreq)) == 0) {
+		if (buflenout == 0)
+			break;
+		for (p = bstatbuf, endp = bstatbuf + buflenout; p < endp; p++) {
+			if (p->bs_ino > last_ino)
+				goto out;
+
+			handle.ha_fid.fid_gen = p->bs_gen;
+			handle.ha_fid.fid_ino = p->bs_ino;
+			moveon = fn(ctx, &handle, p, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+	}
+
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	free(bstatbuf);
+	return moveon;
+}
+
+/* Does the kernel support bulkstat? */
+bool
+xfs_can_iterate_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_fsop_bulkreq	bulkreq;
+	__u64			lastino;
+	__s32			buflenout = 0;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"))
+		return false;
+
+	lastino = 0;
+	memset(&bulkreq, 0, sizeof(bulkreq));
+	bulkreq.lastip = (__u64 *)&lastino;
+	bulkreq.icount  = 0;
+	bulkreq.ubuffer = NULL;
+	bulkreq.ocount  = &buflenout;
+
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSBULKSTAT,
+			&bulkreq);
+	return error == -1 && errno == EINVAL;
+}
+
+/* Iterate all the extent block mappings between the two keys. */
+bool
+xfs_iterate_bmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	int			whichfork,
+	struct xfs_bmap		*key,
+	xfs_bmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsxattr		fsx;
+	struct getbmapx		*map;
+	struct getbmapx		*p;
+	struct xfs_bmap		bmap;
+	char			bmap_descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	xfs_off_t		new_off;
+	int			getxattr_type;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BMAP"));
+
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s attr"), descr);
+		break;
+	case XFS_COW_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s CoW"), descr);
+		break;
+	case XFS_DATA_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s data"), descr);
+		break;
+	default:
+		assert(0);
+	}
+
+	map = calloc(BMAP_NR, sizeof(struct getbmapx));
+	if (!map) {
+		str_errno(ctx, bmap_descr);
+		return false;
+	}
+
+	map->bmv_offset = BTOBB(key->bm_offset);
+	map->bmv_block = BTOBB(key->bm_physical);
+	if (key->bm_length == 0)
+		map->bmv_length = ULLONG_MAX;
+	else
+		map->bmv_length = BTOBB(key->bm_length);
+	map->bmv_count = BMAP_NR;
+	map->bmv_iflags = BMV_IF_NO_DMAPI_READ | BMV_IF_PREALLOC |
+			  BMV_OF_DELALLOC | BMV_IF_NO_HOLES;
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		getxattr_type = XFS_IOC_FSGETXATTRA;
+		map->bmv_iflags |= BMV_IF_ATTRFORK;
+		break;
+	case XFS_COW_FORK:
+		map->bmv_iflags |= BMV_IF_COWFORK;
+		getxattr_type = XFS_IOC_FSGETXATTR;
+		break;
+	case XFS_DATA_FORK:
+		getxattr_type = XFS_IOC_FSGETXATTR;
+		break;
+	default:
+		assert(0);
+	}
+
+	error = xfsctl("", fd, getxattr_type, &fsx);
+	if (error < 0) {
+		str_errno(ctx, bmap_descr);
+		moveon = false;
+		goto out;
+	}
+
+	while ((error = xfsctl(bmap_descr, fd, XFS_IOC_GETBMAPX, map)) == 0) {
+
+		for (i = 0, p = &map[i + 1]; i < map->bmv_entries; i++, p++) {
+			bmap.bm_offset = BBTOB(p->bmv_offset);
+			bmap.bm_physical = BBTOB(p->bmv_block);
+			bmap.bm_length = BBTOB(p->bmv_length);
+			bmap.bm_flags = p->bmv_oflags;
+			moveon = fn(ctx, bmap_descr, fd, whichfork, &fsx,
+					&bmap, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (map->bmv_entries == 0)
+			break;
+		p = map + map->bmv_entries;
+		if (p->bmv_oflags & BMV_OF_LAST)
+			break;
+
+		new_off = p->bmv_offset + p->bmv_length;
+		map->bmv_length -= new_off - map->bmv_offset;
+		map->bmv_offset = new_off;
+	}
+
+	/* Pre-reflink filesystems don't know about CoW forks. */
+	if (whichfork == XFS_COW_FORK && error && errno == EINVAL)
+		error = 0;
+
+	if (error)
+		str_errno(ctx, bmap_descr);
+out:
+	memcpy(key, map, sizeof(struct getbmapx));
+	free(map);
+	return moveon;
+}
+
+/* Does the kernel support getbmapx? */
+bool
+xfs_can_iterate_bmap(
+	struct scrub_ctx	*ctx)
+{
+	struct getbmapx		bsm[2];
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BMAP"))
+		return false;
+
+	memset(bsm, 0, sizeof(struct getbmapx));
+	bsm->bmv_length = ULLONG_MAX;
+	bsm->bmv_count = 2;
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GETBMAPX, bsm);
+	return error == 0;
+}
+
+/* Iterate all the fs block mappings between the two keys. */
+bool
+xfs_iterate_fsmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*keys,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsmap_head	*head;
+	struct fsmap		*p;
+	bool			moveon = true;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_FSMAP"));
+
+	head = malloc(fsmap_sizeof(FSMAP_NR));
+	if (!head) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	memset(head, 0, sizeof(*head));
+	memcpy(head->fmh_keys, keys, sizeof(struct fsmap) * 2);
+	head->fmh_count = FSMAP_NR;
+
+	while ((error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GETFSMAP,
+				head)) == 0) {
+
+		for (i = 0, p = head->fmh_recs; i < head->fmh_entries; i++, p++) {
+			moveon = fn(ctx, descr, p, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (head->fmh_entries == 0)
+			break;
+		p = &head->fmh_recs[head->fmh_entries - 1];
+		if (p->fmr_flags & FMR_OF_LAST)
+			break;
+
+		head->fmh_keys[0] = *p;
+	}
+
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	free(head);
+	return moveon;
+}
+
+/* Does the kernel support getfsmap? */
+bool
+xfs_can_iterate_fsmap(
+	struct scrub_ctx	*ctx)
+{
+	struct fsmap_head	head;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_FSMAP"))
+		return false;
+
+	memset(&head, 0, sizeof(struct fsmap_head));
+	head.fmh_keys[1].fmr_device = UINT_MAX;
+	head.fmh_keys[1].fmr_physical = ULLONG_MAX;
+	head.fmh_keys[1].fmr_owner = ULLONG_MAX;
+	head.fmh_keys[1].fmr_offset = ULLONG_MAX;
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GETFSMAP, &head);
+	return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
+}
+
+/* Online scrub and repair. */
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+	ST_NONE,	/* disabled */
+	ST_PERAG,	/* per-AG metadata */
+	ST_FS,		/* per-FS metadata */
+	ST_INODE,	/* per-inode metadata */
+};
+struct scrub_descr {
+	const char	*name;
+	enum scrub_type	type;
+};
+
+/* These must correspond to XFS_SCRUB_TYPE_ */
+static const struct scrub_descr scrubbers[] = {
+	{"dummy",				ST_NONE},
+	{"superblock",				ST_PERAG},
+	{"AG free header",			ST_PERAG},
+	{"AG free list",			ST_PERAG},
+	{"AG inode header",			ST_PERAG},
+	{"freesp by block btree",		ST_PERAG},
+	{"freesp by length btree",		ST_PERAG},
+	{"inode btree",				ST_PERAG},
+	{"free inode btree",			ST_PERAG},
+	{"reverse mapping btree",		ST_PERAG},
+	{"reference count btree",		ST_PERAG},
+	{"record",				ST_INODE},
+	{"data block map",			ST_INODE},
+	{"attr block map",			ST_INODE},
+	{"CoW block map",			ST_INODE},
+	{"directory entries",			ST_INODE},
+	{"extended attributes",			ST_INODE},
+	{"symbolic link",			ST_INODE},
+	{"realtime bitmap",			ST_FS},
+	{"realtime summary",			ST_FS},
+};
+
+/* Format a scrub description. */
+static void
+format_scrub_descr(
+	char				*buf,
+	size_t				buflen,
+	int				fd,
+	unsigned long long		ctl,
+	const struct scrub_descr	*sc)
+{
+	struct stat			sb;
+
+	switch (sc->type) {
+	case ST_PERAG:
+		snprintf(buf, buflen, _("AG %llu %s"), ctl, _(sc->name));
+		break;
+	case ST_INODE:
+		if (ctl == 0 && fd >= 0) {
+			fstat(fd, &sb);
+			ctl = sb.st_ino;
+		}
+		snprintf(buf, buflen, _("inode %llu %s"), ctl, _(sc->name));
+		break;
+	case ST_FS:
+		snprintf(buf, buflen, _("%s"), _(sc->name));
+		break;
+	case ST_NONE:
+		assert(0);
+		break;
+	}
+}
+
+/* Do we need to repair something? */
+static inline bool
+xfs_scrub_needs_repair(
+	struct xfs_scrub_metadata	*sm)
+{
+	return sm->sm_flags & XFS_SCRUB_FLAG_CORRUPT;
+}
+
+/* Can we optimize something? */
+static inline bool
+xfs_scrub_needs_preen(
+	struct xfs_scrub_metadata	*sm)
+{
+	return sm->sm_flags & XFS_SCRUB_FLAG_PREEN;
+}
+
+enum check_outcome {
+	CHECK_OK,
+	CHECK_REPAIR,
+	CHECK_PREEN,
+};
+
+/* Do a read-only check of some metadata. */
+static bool
+xfs_check_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	unsigned int			type,
+	unsigned long long		ctl,
+	unsigned long			ctl2,
+	enum check_outcome		*outcome)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	const struct scrub_descr	*sc;
+	char				buf[DESCR_BUFSZ];
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+
+	sc = &scrubbers[type];
+	*outcome = CHECK_OK;
+	switch (sc->type) {
+	case ST_PERAG:
+		meta.sm_agno = ctl;
+		break;
+	case ST_INODE:
+		meta.sm_ino = ctl;
+		meta.sm_gen = ctl2;
+		break;
+	case ST_NONE:
+	case ST_FS:
+		/* nothing */
+		break;
+	}
+	meta.sm_type = type;
+	meta.sm_flags = 0;
+	format_scrub_descr(buf, DESCR_BUFSZ, fd, ctl, sc);
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	dbg_printf("check %s fd %d type %s ctl %llu error %d errno %d flags %xh\n",
+			buf, fd, sc->name, ctl, error, errno, meta.sm_flags);
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
+		meta.sm_flags |= XFS_SCRUB_FLAG_PREEN;
+	if (error) {
+		/* Metadata not present, just skip it. */
+		if (errno == ENOENT)
+			return true;
+
+		/* Operational error. */
+		str_errno(ctx, buf);
+		return true;
+	} else if (!xfs_scrub_needs_repair(&meta) &&
+		   !xfs_scrub_needs_preen(&meta)) {
+		/* Clean operation, no corruption or preening detected. */
+		return true;
+	} else if (xfs_scrub_needs_repair(&meta) &&
+		   ctx->mode < SCRUB_MODE_REPAIR) {
+		/* Corrupt, but we're not in repair mode. */
+		str_error(ctx, buf, _("Repairs are required."));
+		return true;
+	} else if (xfs_scrub_needs_preen(&meta) &&
+		   ctx->mode < SCRUB_MODE_PREEN) {
+		/* Preenable, but we're not in preen mode. */
+		str_info(ctx, buf, _("Optimization is possible."));
+		return true;
+	}
+
+	/* Save for later. */
+	if (xfs_scrub_needs_repair(&meta))
+		*outcome = CHECK_REPAIR;
+	else
+		*outcome = CHECK_PREEN;
+	return true;
+}
+
+/* Repair some metadata. */
+static bool
+xfs_repair_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	int				type,
+	unsigned long long		ctl,
+	unsigned long			ctl2,
+	enum check_outcome		fix)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	const struct scrub_descr	*sc;
+	char				buf[DESCR_BUFSZ];
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	assert(fix != CHECK_OK);
+
+	sc = &scrubbers[type];
+	switch (sc->type) {
+	case ST_PERAG:
+		meta.sm_agno = ctl;
+		break;
+	case ST_INODE:
+		meta.sm_ino = ctl;
+		meta.sm_gen = ctl2;
+		break;
+	case ST_NONE:
+	case ST_FS:
+		/* nothing */
+		break;
+	}
+	meta.sm_type = type;
+	meta.sm_flags |= XFS_SCRUB_FLAG_REPAIR;
+	format_scrub_descr(buf, DESCR_BUFSZ, fd, ctl, sc);
+
+	if (fix == CHECK_REPAIR)
+		record_repair(ctx, buf, _("Attempting repair."));
+	else
+		record_preen(ctx, buf, _("Attempting optimization."));
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (error) {
+		switch (errno) {
+		case ENOTTY:
+		case EOPNOTSUPP:
+			/*
+			 * If we forced repairs, don't complain if kernel
+			 * doesn't know how to fix.
+			 */
+			if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+				return true;
+			/* fall through */
+		case EINVAL:
+			/* Kernel doesn't know how to repair this. */
+			goto fix_offline;
+		case EROFS:
+			/* Read-only filesystem, can't fix. */
+			if (verbose || debug || fix == CHECK_REPAIR)
+				str_info(ctx, buf,
+_("Read-only filesystem; cannot make changes."));
+			/* fall through */
+		case ENOENT:
+			/* Metadata not present, just skip it. */
+			return true;
+		default:
+			/* Operational error. */
+			str_errno(ctx, buf);
+			return true;
+		}
+	} else if (xfs_scrub_needs_repair(&meta)) {
+fix_offline:
+		/* Corrupt, must fix offline. */
+		str_error(ctx, buf, _("Offline repair required."));
+		return true;
+	} else {
+		/* Clean operation, no corruption detected. */
+		return true;
+	}
+}
+
+struct repair_item {
+	struct list_head	list;
+	unsigned int		type;
+	unsigned long long	ctl;
+	enum check_outcome	fix;
+};
+
+/* Scrub metadata, saving corruption reports for later. */
+static bool
+xfs_scrub_metadata(
+	struct scrub_ctx		*ctx,
+	enum scrub_type			scrub_type,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	const struct scrub_descr	*sc;
+	struct repair_item		*ri;
+	enum check_outcome		fix;
+	int				type;
+	bool				moveon;
+
+	sc = scrubbers;
+	for (type = 0; type <= XFS_SCRUB_TYPE_MAX; type++, sc++) {
+		if (sc->type != scrub_type)
+			continue;
+
+		/* Check the item. */
+		moveon = xfs_check_metadata(ctx, ctx->mnt_fd, type, agno,
+				0, &fix);
+		if (!moveon)
+			return false;
+		if (fix == CHECK_OK)
+			continue;
+
+		/* Schedule this item for later repairs. */
+		ri = malloc(sizeof(struct repair_item));
+		if (!ri) {
+			str_errno(ctx, _("repair list"));
+			return false;
+		}
+		ri->type = type;
+		ri->ctl = agno;
+		ri->fix = fix;
+		list_add_tail(&ri->list, repair_list);
+	}
+
+	return true;
+}
+
+/* Scrub each AG's metadata btrees. */
+bool
+xfs_scrub_ag_metadata(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_PERAG, agno, repair_list);
+}
+
+/* Scrub whole-FS metadata btrees. */
+bool
+xfs_scrub_fs_metadata(
+	struct scrub_ctx		*ctx,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_FS, 0, repair_list);
+}
+
+/* Repair everything on this list. */
+bool
+xfs_repair_metadata_list(
+	struct scrub_ctx		*ctx,
+	struct list_head		*repair_list)
+{
+	struct repair_item		*ri;
+	struct repair_item		*n;
+	bool				moveon;
+
+	list_for_each_entry(ri, repair_list, list) {
+		moveon = xfs_repair_metadata(ctx, ctx->mnt_fd, ri->type,
+				ri->ctl, 0, ri->fix);
+		if (!moveon)
+			break;
+	}
+
+	list_for_each_entry_safe(ri, n, repair_list, list) {
+		list_del(&ri->list);
+		free(ri);
+	}
+
+	return !xfs_scrub_excessive_errors(ctx);
+}
+
+/* Scrub inode metadata. */
+static bool
+__xfs_scrub_file(
+	struct scrub_ctx		*ctx,
+	uint64_t			ino,
+	uint32_t			gen,
+	int				fd,
+	unsigned int			type)
+{
+	const struct scrub_descr	*sc;
+	enum check_outcome		fix;
+	bool				moveon;
+
+	assert(type <= XFS_SCRUB_TYPE_MAX);
+	sc = &scrubbers[type];
+	assert(sc->type == ST_INODE);
+
+	/* Scrub the piece of metadata. */
+	moveon = xfs_check_metadata(ctx, fd, type, ino, gen, &fix);
+	if (!moveon || xfs_scrub_excessive_errors(ctx))
+		return false;
+	else if (fix == CHECK_OK)
+		return true;
+
+	/* Repair the metadata. */
+	moveon = xfs_repair_metadata(ctx, fd, type, ino, gen, fix);
+	if (!moveon)
+		return false;
+	return !xfs_scrub_excessive_errors(ctx);
+}
+
+#define XFS_SCRUB_FILE_PART(name, flagname) \
+bool \
+xfs_scrub_##name( \
+	struct scrub_ctx		*ctx, \
+	uint64_t			ino, \
+	uint32_t			gen, \
+	int				fd) \
+{ \
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_##flagname); \
+}
+XFS_SCRUB_FILE_PART(inode_fields,	INODE)
+XFS_SCRUB_FILE_PART(data_fork,		BMBTD)
+XFS_SCRUB_FILE_PART(attr_fork,		BMBTA)
+XFS_SCRUB_FILE_PART(cow_fork,		BMBTC)
+XFS_SCRUB_FILE_PART(dir,		DIR)
+XFS_SCRUB_FILE_PART(attr,		XATTR)
+XFS_SCRUB_FILE_PART(symlink,		SYMLINK)
+
+/* Test the availability of a kernel scrub command. */
+static bool
+__xfs_scrub_test(
+	struct scrub_ctx		*ctx,
+	unsigned int			type)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	struct xfs_error_injection	inject;
+	static bool			injected;
+	int				error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
+		return false;
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		inject.fd = ctx->mnt_fd;
+#define XFS_ERRTAG_FORCE_REPAIR	28
+		inject.errtag = XFS_ERRTAG_FORCE_REPAIR;
+		error = xfsctl(ctx->mntpoint, ctx->mnt_fd,
+				XFS_IOC_ERROR_INJECTION, &inject);
+		if (error == 0)
+			injected = true;
+	}
+
+	meta.sm_type = type;
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_SCRUB_METADATA,
+			&meta);
+	return error == 0 || errno == ENOENT;
+}
+
+#define XFS_CAN_SCRUB_TEST(name, flagname) \
+bool \
+xfs_can_scrub_##name( \
+	struct scrub_ctx		*ctx) \
+{ \
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_##flagname); \
+}
+XFS_CAN_SCRUB_TEST(fs_metadata,		SB)
+XFS_CAN_SCRUB_TEST(inode,		INODE)
+XFS_CAN_SCRUB_TEST(bmap,		BMBTD)
+XFS_CAN_SCRUB_TEST(dir,			DIR)
+XFS_CAN_SCRUB_TEST(attr,		XATTR)
+XFS_CAN_SCRUB_TEST(symlink,		SYMLINK)
diff --git a/scrub/xfs_ioctl.h b/scrub/xfs_ioctl.h
new file mode 100644
index 0000000..c9c2504
--- /dev/null
+++ b/scrub/xfs_ioctl.h
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_IOCTL_H_
+#define XFS_IOCTL_H_
+
+/* inode iteration */
+typedef bool (*xfs_inode_iter_fn)(struct scrub_ctx *ctx,
+		struct xfs_handle *handle, struct xfs_bstat *bs, void *arg);
+bool xfs_iterate_inodes(struct scrub_ctx *ctx, const char *descr,
+		void *fshandle, uint64_t first_ino, uint64_t last_ino,
+		xfs_inode_iter_fn fn, void *arg);
+bool xfs_can_iterate_inodes(struct scrub_ctx *ctx);
+
+/* inode fork block mapping */
+struct xfs_bmap {
+	uint64_t	bm_offset;	/* file offset of segment in bytes */
+	uint64_t	bm_physical;	/* physical starting byte  */
+	uint64_t	bm_length;	/* length of segment, bytes */
+	uint32_t	bm_flags;	/* output flags */
+};
+
+typedef bool (*xfs_bmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		int fd, int whichfork, struct fsxattr *fsx,
+		struct xfs_bmap *bmap, void *arg);
+
+bool xfs_iterate_bmap(struct scrub_ctx *ctx, const char *descr, int fd,
+		int whichfork, struct xfs_bmap *key, xfs_bmap_iter_fn fn,
+		void *arg);
+bool xfs_can_iterate_bmap(struct scrub_ctx *ctx);
+
+/* filesystem reverse mapping */
+typedef bool (*xfs_fsmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *fsr, void *arg);
+bool xfs_iterate_fsmap(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *keys, xfs_fsmap_iter_fn fn, void *arg);
+bool xfs_can_iterate_fsmap(struct scrub_ctx *ctx);
+
+/* Online scrub and repair. */
+
+bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct list_head *repair_list);
+bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx,
+		struct list_head *repair_list);
+bool xfs_repair_metadata_list(struct scrub_ctx *ctx,
+		struct list_head *repair_list);
+
+bool xfs_can_scrub_fs_metadata(struct scrub_ctx *ctx);
+bool xfs_can_scrub_inode(struct scrub_ctx *ctx);
+bool xfs_can_scrub_bmap(struct scrub_ctx *ctx);
+bool xfs_can_scrub_dir(struct scrub_ctx *ctx);
+bool xfs_can_scrub_attr(struct scrub_ctx *ctx);
+bool xfs_can_scrub_symlink(struct scrub_ctx *ctx);
+
+bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_data_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		 int fd);
+bool xfs_scrub_attr_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		 int fd);
+bool xfs_scrub_cow_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		 int fd);
+bool xfs_scrub_dir(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen, int fd);
+bool xfs_scrub_attr(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen, int fd);
+bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		 int fd);
+
+#endif /* XFS_IOCTL_H_ */


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 39/39] xfs_scrub: create online filesystem scrub program
  2016-11-05  0:28 ` [PATCH 39/39] xfs_scrub: create online filesystem scrub program Darrick J. Wong
@ 2016-11-05  5:22   ` Eryu Guan
  2016-11-08  8:37     ` Darrick J. Wong
  0 siblings, 1 reply; 42+ messages in thread
From: Eryu Guan @ 2016-11-05  5:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Fri, Nov 04, 2016 at 05:28:44PM -0700, Darrick J. Wong wrote:
> Create a filesystem scrubbing tool that walks the directory tree,
> queries every file's extents, extended attributes, and stat data.  For
> generic (non-XFS) filesystems this depends on the kernel to do nearly
> all the validation.  Optionally, we can (try to) read all the file
> data.
> 
> For XFS, we perform sequential scans of each AG's metadata, inodes,
> extent maps, and file data.  Being XFS specific, we can work with
> the in-kernel scrubbers to perform much stronger
> metadata checking and cross-referencing.  We can also take advantage
> of newer ioctls such as GETFSMAP to perform faster read verification.
> 
> In the future we will be able to take advantage of (still unwritten)
> features such as parent directory pointers to fully validate all
> metadata.  However, this tool /should/ work for most non-XFS
> filesystems such as ext4 and btrfs.
> 
> Note also that the scrub tool can shut down the filesystem if errors
> are found.  This is not a default option since scrubbing is very
> immature at this time.  It can also ask the XFS driver in the kernel
> to optimize or repair metadata, though this may not be successful.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
[snip]
> --- /dev/null
> +++ b/scrub/scrub.c
> @@ -0,0 +1,1009 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#include "libxfs.h"
> +#include <stdio.h>
> +#include <mntent.h>
> +#include <unistd.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +#include <sys/statvfs.h>
> +#include <sys/vfs.h>
> +#include <fcntl.h>
> +#include <dirent.h>
> +#include "disk.h"
> +#include "scrub.h"
> +#include "../../repair/threads.h"

I have trouble compiling the djwong-devel branch, it failed to find
"../../repair/threads.h", seems it should be "../repair/threads.h" here.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 39/39] xfs_scrub: create online filesystem scrub program
  2016-11-05  5:22   ` Eryu Guan
@ 2016-11-08  8:37     ` Darrick J. Wong
  0 siblings, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2016-11-08  8:37 UTC (permalink / raw)
  To: Eryu Guan; +Cc: david, linux-xfs

On Sat, Nov 05, 2016 at 01:22:56PM +0800, Eryu Guan wrote:
> On Fri, Nov 04, 2016 at 05:28:44PM -0700, Darrick J. Wong wrote:
> > Create a filesystem scrubbing tool that walks the directory tree,
> > queries every file's extents, extended attributes, and stat data.  For
> > generic (non-XFS) filesystems this depends on the kernel to do nearly
> > all the validation.  Optionally, we can (try to) read all the file
> > data.
> > 
> > For XFS, we perform sequential scans of each AG's metadata, inodes,
> > extent maps, and file data.  Being XFS specific, we can work with
> > the in-kernel scrubbers to perform much stronger
> > metadata checking and cross-referencing.  We can also take advantage
> > of newer ioctls such as GETFSMAP to perform faster read verification.
> > 
> > In the future we will be able to take advantage of (still unwritten)
> > features such as parent directory pointers to fully validate all
> > metadata.  However, this tool /should/ work for most non-XFS
> > filesystems such as ext4 and btrfs.
> > 
> > Note also that the scrub tool can shut down the filesystem if errors
> > are found.  This is not a default option since scrubbing is very
> > immature at this time.  It can also ask the XFS driver in the kernel
> > to optimize or repair metadata, though this may not be successful.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> [snip]
> > --- /dev/null
> > +++ b/scrub/scrub.c
> > @@ -0,0 +1,1009 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "libxfs.h"
> > +#include <stdio.h>
> > +#include <mntent.h>
> > +#include <unistd.h>
> > +#include <sys/types.h>
> > +#include <sys/stat.h>
> > +#include <sys/time.h>
> > +#include <sys/resource.h>
> > +#include <sys/statvfs.h>
> > +#include <sys/vfs.h>
> > +#include <fcntl.h>
> > +#include <dirent.h>
> > +#include "disk.h"
> > +#include "scrub.h"
> > +#include "../../repair/threads.h"
> 
> I have trouble compiling the djwong-devel branch, it failed to find
> "../../repair/threads.h", seems it should be "../repair/threads.h" here.

Doh.  Yeah.  I'll fix that in the morning.  Sorry about that.

In a nutshell: I build xfsprogs git for each arch in a build-$arch/ subdirs
containing symlinks to everything in the parent, which is why I never tripped
on this. :/

--D

> 
> Thanks,
> Eryu

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-11-08  8:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-05  0:24 [PATCH v2 00/39] xfsprogs: online scrub/repair support Darrick J. Wong
2016-11-05  0:24 ` [PATCH 01/39] xfs: plumb in needed functions for range querying of the freespace btrees Darrick J. Wong
2016-11-05  0:24 ` [PATCH 02/39] xfs: provide a query_range function for " Darrick J. Wong
2016-11-05  0:24 ` [PATCH 03/39] xfs: create a function to query all records in a btree Darrick J. Wong
2016-11-05  0:24 ` [PATCH 04/39] xfs: introduce the XFS_IOC_GETFSMAP ioctl Darrick J. Wong
2016-11-05  0:25 ` [PATCH 05/39] xfs_io: support the new getfsmap ioctl Darrick J. Wong
2016-11-05  0:25 ` [PATCH 06/39] xfs: use GPF_NOFS when allocating btree cursors Darrick J. Wong
2016-11-05  0:25 ` [PATCH 07/39] xfs: add scrub tracepoints Darrick J. Wong
2016-11-05  0:25 ` [PATCH 08/39] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
2016-11-05  0:25 ` [PATCH 09/39] xfs: generic functions to scrub metadata and btrees Darrick J. Wong
2016-11-05  0:25 ` [PATCH 10/39] xfs: scrub the backup superblocks Darrick J. Wong
2016-11-05  0:25 ` [PATCH 11/39] xfs: scrub AGF and AGFL Darrick J. Wong
2016-11-05  0:25 ` [PATCH 12/39] xfs: scrub the AGI Darrick J. Wong
2016-11-05  0:25 ` [PATCH 13/39] xfs: support scrubbing free space btrees Darrick J. Wong
2016-11-05  0:26 ` [PATCH 14/39] xfs: support scrubbing inode btrees Darrick J. Wong
2016-11-05  0:26 ` [PATCH 15/39] xfs: support scrubbing rmap btree Darrick J. Wong
2016-11-05  0:26 ` [PATCH 16/39] xfs: support scrubbing refcount btree Darrick J. Wong
2016-11-05  0:26 ` [PATCH 17/39] xfs: scrub inodes Darrick J. Wong
2016-11-05  0:26 ` [PATCH 18/39] xfs: scrub inode block mappings Darrick J. Wong
2016-11-05  0:26 ` [PATCH 19/39] xfs: scrub directory/attribute btrees Darrick J. Wong
2016-11-05  0:26 ` [PATCH 20/39] xfs: scrub directories Darrick J. Wong
2016-11-05  0:26 ` [PATCH 21/39] xfs: scrub extended attributes Darrick J. Wong
2016-11-05  0:26 ` [PATCH 22/39] xfs: scrub symbolic links Darrick J. Wong
2016-11-05  0:27 ` [PATCH 23/39] xfs: scrub realtime bitmap/summary Darrick J. Wong
2016-11-05  0:27 ` [PATCH 24/39] xfs: scrub should cross-reference with the bnobt Darrick J. Wong
2016-11-05  0:27 ` [PATCH 25/39] xfs: cross-reference bnobt records with cntbt Darrick J. Wong
2016-11-05  0:27 ` [PATCH 26/39] xfs: cross-reference inode btrees during scrub Darrick J. Wong
2016-11-05  0:27 ` [PATCH 27/39] xfs: cross-reference reverse-mapping btree Darrick J. Wong
2016-11-05  0:27 ` [PATCH 28/39] xfs: cross-reference refcount btree during scrub Darrick J. Wong
2016-11-05  0:27 ` [PATCH 29/39] xfs: scrub should cross-reference the realtime bitmap Darrick J. Wong
2016-11-05  0:27 ` [PATCH 30/39] xfs: add helper routines for the repair code Darrick J. Wong
2016-11-05  0:27 ` [PATCH 31/39] xfs: repair inode btrees Darrick J. Wong
2016-11-05  0:27 ` [PATCH 32/39] xfs: rebuild the rmapbt Darrick J. Wong
2016-11-05  0:28 ` [PATCH 33/39] xfs: repair refcount btrees Darrick J. Wong
2016-11-05  0:28 ` [PATCH 34/39] xfs: repair inode block maps Darrick J. Wong
2016-11-05  0:28 ` [PATCH 35/39] xfs: query the per-AG reservation counters Darrick J. Wong
2016-11-05  0:28 ` [PATCH 36/39] xfs_db: introduce fuzz command Darrick J. Wong
2016-11-05  0:28 ` [PATCH 37/39] xfs_db: print attribute remote value blocks Darrick J. Wong
2016-11-05  0:28 ` [PATCH 38/39] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
2016-11-05  0:28 ` [PATCH 39/39] xfs_scrub: create online filesystem scrub program Darrick J. Wong
2016-11-05  5:22   ` Eryu Guan
2016-11-08  8:37     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.