* [PATCH v6 0/9] xfsprogs: online scrub/repair support
@ 2017-03-10 23:24 Darrick J. Wong
2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:24 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
Hi all,
This is the sixth revision of a patchset that adds to XFS userland tools
support for online metadata scrubbing and repair. There aren't any
on-disk format changes, and the main overview is in the cover letter for
the kernel patches. Since v5 I have removed from xfs_scrub the ability
to scrub non-XFS filesystems (i.e. XFS is now required) and have added
the ability to run xfs_scrub periodically and in a contained environment
if systemd is active. None of the systemd functionality is active by
default.
If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
The xfsprogs patches should apply against 4.10.
This is an extraordinary way to eat your data. Enjoy!
Comments and questions are, as always, welcome.
--D
[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
@ 2017-03-10 23:24 ` Darrick J. Wong
2017-03-10 23:24 ` [PATCH 2/9] xfs_db: introduce fuzz command Darrick J. Wong
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:24 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Use rmap records to rebuild corrupt inode forks instead of zapping
the whole inode.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
libxfs/libxfs_api_defs.h | 2
repair/Makefile | 5 -
repair/dino_chunks.c | 7 +
repair/dinode.c | 41 +++++++
repair/rebuild.c | 277 ++++++++++++++++++++++++++++++++++++++++++++++
repair/rebuild.h | 26 ++++
repair/rmap.c | 2
repair/rmap.h | 1
8 files changed, 357 insertions(+), 4 deletions(-)
create mode 100644 repair/rebuild.c
create mode 100644 repair/rebuild.h
diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index d299b7a..f01fff0 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -146,5 +146,7 @@
#define xfs_rmap_lookup_le_range libxfs_rmap_lookup_le_range
#define xfs_refc_block libxfs_refc_block
#define xfs_rmap_compare libxfs_rmap_compare
+#define xfs_bmbt_calc_size libxfs_bmbt_calc_size
+#define xfs_rmap_query_all libxfs_rmap_query_all
#endif /* __LIBXFS_API_DEFS_H__ */
diff --git a/repair/Makefile b/repair/Makefile
index b7e8fd5..9edaf18 100644
--- a/repair/Makefile
+++ b/repair/Makefile
@@ -11,14 +11,15 @@ LTCOMMAND = xfs_repair
HFILES = agheader.h attr_repair.h avl.h avl64.h bmap.h btree.h \
da_util.h dinode.h dir2.h err_protos.h globals.h incore.h protos.h \
- rt.h progress.h scan.h versions.h prefetch.h rmap.h slab.h threads.h
+ rt.h progress.h scan.h versions.h prefetch.h rmap.h slab.h threads.h \
+ rebuild.h
CFILES = agheader.c attr_repair.c avl.c avl64.c bmap.c btree.c \
da_util.c dino_chunks.c dinode.c dir2.c globals.c incore.c \
incore_bmc.c init.c incore_ext.c incore_ino.c phase1.c \
phase2.c phase3.c phase4.c phase5.c phase6.c phase7.c \
progress.c prefetch.c rmap.c rt.c sb.c scan.c slab.c threads.c \
- versions.c xfs_repair.c
+ versions.c rebuild.c xfs_repair.c
LLDLIBS = $(LIBXFS) $(LIBXLOG) $(LIBXCMD) $(LIBUUID) \
$(LIBRT) $(LIBPTHREAD) $(LIBBLKID)
diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index a3909ac..c479f2c 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -697,6 +697,13 @@ process_inode_chunk(
irec_offset += mp->m_sb.sb_inopblock * blks_per_cluster;
agbno += blks_per_cluster;
}
+ /*
+ * Allow the buffer to be re-locked by this thread in case
+ * we want to rebuild an inode fork.
+ */
+ for (bp_index = 0; bp_index < cluster_count; bp_index++)
+ if (bplist[bp_index])
+ bplist[bp_index]->b_flags |= LIBXFS_B_RECURSIVE_LOCK;
agbno = XFS_AGINO_TO_AGBNO(mp, first_irec->ino_startnum);
/*
diff --git a/repair/dinode.c b/repair/dinode.c
index d664f87..6f71c2f 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -32,6 +32,7 @@
#include "threads.h"
#include "slab.h"
#include "rmap.h"
+#include "rebuild.h"
/*
* gettext lookups for translations of strings use mutexes internally to
@@ -1915,7 +1916,9 @@ process_inode_data_fork(
xfs_ino_t lino = XFS_AGINO_TO_INO(mp, agno, ino);
int err = 0;
int nex;
+ bool try_rebuild = !rmapbt_suspect;
+retry:
/*
* extent count on disk is only valid for positive values. The kernel
* uses negative values in memory. hence if we see negative numbers
@@ -1961,8 +1964,25 @@ process_inode_data_fork(
if (err) {
do_warn(_("bad data fork in inode %" PRIu64 "\n"), lino);
if (!no_modify) {
+ if (try_rebuild) {
+ do_warn(
+_("rebuilding inode %"PRIu64" data fork\n"),
+ lino);
+ try_rebuild = false;
+ err = rebuild_bmap(mp, lino, XFS_DATA_FORK,
+ be32_to_cpu(dino->di_nextents));
+ if (!err)
+ goto retry;
+ do_warn(
+_("inode %"PRIu64" data fork rebuild failed, error %d\n"),
+ lino, err);
+ }
*dirty += clear_dinode(mp, dino, lino);
ASSERT(*dirty > 0);
+ } else if (try_rebuild) {
+ do_warn(
+_("would have tried to rebuild inode %"PRIu64" data fork, or else\n"),
+ lino);
}
return 1;
}
@@ -2026,7 +2046,9 @@ process_inode_attr_fork(
blkmap_t *ablkmap = NULL;
int repair = 0;
int err;
+ bool try_rebuild = !rmapbt_suspect;
+retry:
if (!XFS_DFORK_Q(dino)) {
*anextents = 0;
if (dino->di_aformat != XFS_DINODE_FMT_EXTENTS) {
@@ -2085,6 +2107,19 @@ process_inode_attr_fork(
do_warn(_("bad attribute fork in inode %" PRIu64), lino);
if (!no_modify) {
+ if (try_rebuild) {
+ try_rebuild = false;
+ do_warn(
+_("rebuilding inode %"PRIu64" attr fork\n"),
+ lino);
+ err = rebuild_bmap(mp, lino, XFS_DATA_FORK,
+ be32_to_cpu(dino->di_nextents));
+ if (!err)
+ goto retry;
+ do_warn(
+_("inode %"PRIu64" attr fork rebuild failed, error %d\n"),
+ lino, err);
+ }
if (delete_attr_ok) {
do_warn(_(", clearing attr fork\n"));
*dirty += clear_dinode_attr(mp, dino, lino);
@@ -2094,7 +2129,11 @@ process_inode_attr_fork(
*dirty += clear_dinode(mp, dino, lino);
}
ASSERT(*dirty > 0);
- } else {
+ } else if (try_rebuild) {
+ do_warn(
+_("would have tried to rebuild inode %"PRIu64" attr fork or cleared it\n"),
+ lino);
+ } else {
do_warn(_(", would clear attr fork\n"));
}
diff --git a/repair/rebuild.c b/repair/rebuild.c
new file mode 100644
index 0000000..bd5d6a8
--- /dev/null
+++ b/repair/rebuild.c
@@ -0,0 +1,277 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include <libxfs.h>
+#include "btree.h"
+#include "err_protos.h"
+#include "libxlog.h"
+#include "incore.h"
+#include "globals.h"
+#include "dinode.h"
+#include "slab.h"
+#include "rmap.h"
+
+/* Borrowed routines from xfs_scrub.c */
+
+struct xfs_repair_bmap_extent {
+ struct xfs_rmap_irec rmap;
+ xfs_agnumber_t agno;
+};
+
+struct xfs_repair_bmap {
+ struct xfs_slab *extslab;
+ xfs_ino_t ino;
+ xfs_rfsblock_t bmbt_blocks;
+ int whichfork;
+};
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xfs_repair_bmap_extent_fn(
+ struct xfs_btree_cur *cur,
+ struct xfs_rmap_irec *rec,
+ void *priv)
+{
+ struct xfs_repair_bmap *rb = priv;
+ struct xfs_repair_bmap_extent rbe;
+
+ /* Skip extents which are not owned by this inode and fork. */
+ if (rec->rm_owner != rb->ino)
+ return 0;
+ else if (rb->whichfork == XFS_DATA_FORK &&
+ (rec->rm_flags & XFS_RMAP_ATTR_FORK))
+ return 0;
+ else if (rb->whichfork == XFS_ATTR_FORK &&
+ !(rec->rm_flags & XFS_RMAP_ATTR_FORK))
+ return 0;
+ else if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+ rb->bmbt_blocks += rec->rm_blockcount;
+ return 0;
+ }
+
+ rbe.rmap = *rec;
+ rbe.agno = cur->bc_private.a.agno;
+ return slab_add(rb->extslab, &rbe);
+}
+
+/* Compare two bmap extents. */
+static int
+xfs_repair_bmap_extent_cmp(
+ const void *a,
+ const void *b)
+{
+ const struct xfs_repair_bmap_extent *ap = a;
+ const struct xfs_repair_bmap_extent *bp = b;
+
+ if (ap->rmap.rm_offset > bp->rmap.rm_offset)
+ return 1;
+ else if (ap->rmap.rm_offset < bp->rmap.rm_offset)
+ return -1;
+ return 0;
+}
+
+/* Repair an inode fork. */
+STATIC int
+xfs_repair_bmap(
+ struct xfs_inode *ip,
+ struct xfs_trans **tpp,
+ int whichfork)
+{
+ struct xfs_repair_bmap rb = {0};
+ struct xfs_bmbt_irec bmap;
+ struct xfs_defer_ops dfops;
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_buf *agf_bp = NULL;
+ struct xfs_repair_bmap_extent *rbe;
+ struct xfs_btree_cur *cur;
+ struct xfs_slab_cursor *scur = NULL;
+ xfs_fsblock_t firstfsb;
+ xfs_agnumber_t agno;
+ xfs_extlen_t extlen;
+ int baseflags;
+ int flags;
+ int nimaps;
+ int error = 0;
+
+ ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+ /* Don't know how to repair the other fork formats. */
+ if (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+ XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE)
+ return ENOTTY;
+
+ /* Only files, symlinks, and directories get to have data forks. */
+ if (whichfork == XFS_DATA_FORK && !S_ISREG(VFS_I(ip)->i_mode) &&
+ !S_ISDIR(VFS_I(ip)->i_mode) && !S_ISLNK(VFS_I(ip)->i_mode))
+ return EINVAL;
+
+ /* If we somehow have delalloc extents, forget it. */
+ if (whichfork == XFS_DATA_FORK && ip->i_delayed_blks)
+ return EBUSY;
+
+ /* We require the rmapbt to rebuild anything. */
+ if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+ return EOPNOTSUPP;
+
+ /* Don't know how to rebuild realtime data forks. */
+ if (XFS_IS_REALTIME_INODE(ip) && whichfork == XFS_DATA_FORK)
+ return EOPNOTSUPP;
+
+ /* Collect all reverse mappings for this fork's extents. */
+ init_slab(&rb.extslab, sizeof(*rbe));
+ rb.ino = ip->i_ino;
+ rb.whichfork = whichfork;
+ for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+ error = -libxfs_alloc_read_agf(mp, *tpp, agno, 0, &agf_bp);
+ if (error)
+ goto out;
+ cur = libxfs_rmapbt_init_cursor(mp, *tpp, agf_bp, agno);
+ error = -libxfs_rmap_query_all(cur, xfs_repair_bmap_extent_fn, &rb);
+ libxfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
+ XFS_BTREE_NOERROR);
+ if (error)
+ goto out;
+ }
+
+ /* Blow out the in-core fork and zero the on-disk fork. */
+ libxfs_trans_ijoin(*tpp, ip, 0);
+ if (XFS_IFORK_PTR(ip, whichfork) != NULL)
+ libxfs_idestroy_fork(ip, whichfork);
+ XFS_IFORK_FMT_SET(ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+ XFS_IFORK_NEXT_SET(ip, whichfork, 0);
+
+ /* Reinitialize the on-disk fork. */
+ if (whichfork == XFS_DATA_FORK) {
+ memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
+ ip->i_df.if_flags |= XFS_IFEXTENTS;
+ } else if (whichfork == XFS_ATTR_FORK) {
+ if (slab_count(rb.extslab) == 0)
+ ip->i_afp = NULL;
+ else {
+ ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_NOFS);
+ ip->i_afp->if_flags |= XFS_IFEXTENTS;
+ }
+ }
+ libxfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+ error = -libxfs_trans_roll(tpp, ip);
+ if (error)
+ goto out;
+
+ baseflags = XFS_BMAPI_REMAP | XFS_BMAPI_NORMAP;
+ if (whichfork == XFS_ATTR_FORK)
+ baseflags |= XFS_BMAPI_ATTRFORK;
+
+ /* "Remap" the extents into the fork. */
+ init_slab_cursor(rb.extslab, xfs_repair_bmap_extent_cmp, &scur);
+ rbe = pop_slab_cursor(scur);
+ while (rbe != NULL) {
+ /* Form the "new" mapping... */
+ bmap.br_startblock = XFS_AGB_TO_FSB(mp, rbe->agno,
+ rbe->rmap.rm_startblock);
+ bmap.br_startoff = rbe->rmap.rm_offset;
+ flags = 0;
+ if (rbe->rmap.rm_flags & XFS_RMAP_UNWRITTEN)
+ flags = XFS_BMAPI_PREALLOC;
+ while (rbe->rmap.rm_blockcount > 0) {
+ libxfs_defer_init(&dfops, &firstfsb);
+ extlen = min(rbe->rmap.rm_blockcount, MAXEXTLEN);
+ bmap.br_blockcount = extlen;
+
+ /* Drop the block counter... */
+ ip->i_d.di_nblocks -= extlen;
+
+ /* Re-add the extent to the fork. */
+ nimaps = 1;
+ firstfsb = bmap.br_startblock;
+ error = -libxfs_bmapi_write(*tpp, ip,
+ bmap.br_startoff,
+ extlen, baseflags | flags, &firstfsb,
+ extlen, &bmap, &nimaps,
+ &dfops);
+ if (error)
+ goto out;
+
+ bmap.br_startblock += extlen;
+ bmap.br_startoff += extlen;
+ rbe->rmap.rm_blockcount -= extlen;
+ error = -libxfs_defer_finish(tpp, &dfops, ip);
+ if (error)
+ goto out;
+ /* Make sure we roll the transaction. */
+ error = -libxfs_trans_roll(tpp, ip);
+ if (error)
+ goto out;
+ }
+ rbe = pop_slab_cursor(scur);
+ }
+ free_slab_cursor(&scur);
+ free_slab(&rb.extslab);
+
+ /* Decrease nblocks to reflect the freed bmbt blocks. */
+ if (rb.bmbt_blocks) {
+ ip->i_d.di_nblocks -= rb.bmbt_blocks;
+ libxfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+ error = -libxfs_trans_roll(tpp, ip);
+ if (error)
+ goto out;
+ }
+
+ return error;
+out:
+ if (scur)
+ free_slab_cursor(&scur);
+ if (rb.extslab)
+ free_slab(&rb.extslab);
+ return error;
+}
+
+/* Rebuild some inode's bmap. */
+int
+rebuild_bmap(
+ struct xfs_mount *mp,
+ xfs_ino_t ino,
+ int whichfork,
+ unsigned long nr_extents)
+{
+ struct xfs_inode *ip;
+ struct xfs_trans *tp;
+ unsigned long long resblks;
+ int error;
+
+ resblks = libxfs_bmbt_calc_size(mp, nr_extents);
+ error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+ resblks, 0, 0, &tp);
+ if (error)
+ return error;
+ error = -libxfs_iget(mp, NULL, ino, 0, &ip);
+ if (error)
+ goto out_trans;
+ error = xfs_repair_bmap(ip, &tp, whichfork);
+ if (error)
+ goto out_irele;
+
+ error = -libxfs_trans_commit(tp);
+ IRELE(ip);
+ return error;
+out_irele:
+ IRELE(ip);
+out_trans:
+ libxfs_trans_cancel(tp);
+ return error;
+}
diff --git a/repair/rebuild.h b/repair/rebuild.h
new file mode 100644
index 0000000..51a44ea
--- /dev/null
+++ b/repair/rebuild.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef REBUILD_H_
+#define REBUILD_H_
+
+int rebuild_bmap(struct xfs_mount *mp, xfs_ino_t ino, int whichfork,
+ unsigned long nr_extents);
+
+#endif /* REBUILD_H_ */
diff --git a/repair/rmap.c b/repair/rmap.c
index ab6e583..af37829 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -46,7 +46,7 @@ struct xfs_ag_rmap {
};
static struct xfs_ag_rmap *ag_rmaps;
-static bool rmapbt_suspect;
+bool rmapbt_suspect;
static bool refcbt_suspect;
static inline int rmap_compare(const void *a, const void *b)
diff --git a/repair/rmap.h b/repair/rmap.h
index 752ece8..c970942 100644
--- a/repair/rmap.h
+++ b/repair/rmap.h
@@ -21,6 +21,7 @@
#define RMAP_H_
extern bool collect_rmaps;
+extern bool rmapbt_suspect;
extern bool rmap_needs_work(struct xfs_mount *);
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/9] xfs_db: introduce fuzz command
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
@ 2017-03-10 23:24 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 3/9] xfs_db: print attribute remote value blocks Darrick J. Wong
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:24 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Introduce a new 'fuzz' command to write creative values into
disk structure fields.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
db/Makefile | 3
db/bit.c | 17 +-
db/bit.h | 5 -
db/command.c | 2
db/fuzz.c | 461 +++++++++++++++++++++++++++++++++++++++++++++++++++++
db/fuzz.h | 21 ++
db/io.c | 9 +
db/io.h | 1
db/type.c | 44 ++++-
db/type.h | 1
man/man8/xfs_db.8 | 55 ++++++
11 files changed, 598 insertions(+), 21 deletions(-)
create mode 100644 db/fuzz.c
create mode 100644 db/fuzz.h
diff --git a/db/Makefile b/db/Makefile
index cdc0b99..feeacf6 100644
--- a/db/Makefile
+++ b/db/Makefile
@@ -12,7 +12,8 @@ HFILES = addr.h agf.h agfl.h agi.h attr.h attrshort.h bit.h block.h bmap.h \
dir2.h dir2sf.h dquot.h echo.h faddr.h field.h \
flist.h fprint.h frag.h freesp.h hash.h help.h init.h inode.h input.h \
io.h logformat.h malloc.h metadump.h output.h print.h quit.h sb.h \
- sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h
+ sig.h strvec.h text.h type.h write.h attrset.h symlink.h fsmap.h \
+ fuzz.h
CFILES = $(HFILES:.h=.c)
LSRCFILES = xfs_admin.sh xfs_ncheck.sh xfs_metadump.sh
diff --git a/db/bit.c b/db/bit.c
index 24872bf..3fcb085 100644
--- a/db/bit.c
+++ b/db/bit.c
@@ -19,13 +19,8 @@
#include "libxfs.h"
#include "bit.h"
-#undef setbit /* defined in param.h on Linux */
-
-static int getbit(char *ptr, int bit);
-static void setbit(char *ptr, int bit, int val);
-
-static int
-getbit(
+int
+getbit_l(
char *ptr,
int bit)
{
@@ -39,8 +34,8 @@ getbit(
return (*ptr & mask) >> shift;
}
-static void
-setbit(
+void
+setbit_l(
char *ptr,
int bit,
int val)
@@ -106,7 +101,7 @@ getbitval(
for (i = 0, rval = 0LL; i < nbits; i++) {
- if (getbit(p, bit + i)) {
+ if (getbit_l(p, bit + i)) {
/* If the last bit is on and we care about sign
* bits and we don't have a full 64 bit
* container, turn all bits on between the
@@ -162,7 +157,7 @@ setbitval(
if (bitoff % NBBY || nbits % NBBY) {
for (bit = 0; bit < nbits; bit++)
- setbit(out, bit + bitoff, getbit(in, bit));
+ setbit_l(out, bit + bitoff, getbit_l(in, bit));
} else
memcpy(out + byteize(bitoff), in, byteize(nbits));
}
diff --git a/db/bit.h b/db/bit.h
index 80ba24c..4506679 100644
--- a/db/bit.h
+++ b/db/bit.h
@@ -21,9 +21,12 @@
#define bitszof(x,y) bitize(szof(x,y))
#define byteize(s) ((s) / NBBY)
#define bitoffs(s) ((s) % NBBY)
+#define byteize_up(s) (((s) + NBBY - 1) / NBBY)
#define BVUNSIGNED 0
#define BVSIGNED 1
extern __int64_t getbitval(void *obj, int bitoff, int nbits, int flags);
-extern void setbitval(void *obuf, int bitoff, int nbits, void *ibuf);
+extern void setbitval(void *obuf, int bitoff, int nbits, void *ibuf);
+extern int getbit_l(char *ptr, int bit);
+extern void setbit_l(char *ptr, int bit, int val);
diff --git a/db/command.c b/db/command.c
index 3d7cfd7..0eb4944 100644
--- a/db/command.c
+++ b/db/command.c
@@ -51,6 +51,7 @@
#include "dquot.h"
#include "fsmap.h"
#include "crc.h"
+#include "fuzz.h"
cmdinfo_t *cmdtab;
int ncmds;
@@ -146,4 +147,5 @@ init_commands(void)
type_init();
write_init();
dquot_init();
+ fuzz_init();
}
diff --git a/db/fuzz.c b/db/fuzz.c
new file mode 100644
index 0000000..061ecd1
--- /dev/null
+++ b/db/fuzz.c
@@ -0,0 +1,461 @@
+/*
+ * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libxfs.h"
+#include <ctype.h>
+#include <time.h>
+#include "bit.h"
+#include "block.h"
+#include "command.h"
+#include "type.h"
+#include "faddr.h"
+#include "fprint.h"
+#include "field.h"
+#include "flist.h"
+#include "io.h"
+#include "init.h"
+#include "output.h"
+#include "print.h"
+#include "write.h"
+#include "malloc.h"
+
+static int fuzz_f(int argc, char **argv);
+static void fuzz_help(void);
+
+static const cmdinfo_t fuzz_cmd =
+ { "fuzz", NULL, fuzz_f, 0, -1, 0, N_("[-c] [-d] field fuzzcmd..."),
+ N_("fuzz values on disk"), fuzz_help };
+
+void
+fuzz_init(void)
+{
+ if (!expert_mode)
+ return;
+
+ add_command(&fuzz_cmd);
+ srand48(clock());
+}
+
+static void
+fuzz_help(void)
+{
+ dbprintf(_(
+"\n"
+" The 'fuzz' command fuzzes fields in any on-disk data structure. For\n"
+" block fuzzing, see the 'blocktrash' or 'write' commands."
+"\n"
+" Examples:\n"
+" Struct mode: 'fuzz core.uid zeroes' - set an inode uid field to 0.\n"
+" 'fuzz crc ones' - set a crc filed to all ones.\n"
+" 'fuzz bno[11] firstbit' - set the high bit of a block array.\n"
+" 'fuzz keys[5].startblock add' - increase a btree key value.\n"
+" 'fuzz uuid random' - randomize the superblock uuid.\n"
+"\n"
+" In data mode type 'fuzz' by itself for a list of specific commands.\n\n"
+" Specifying the -c option will allow writes of invalid (corrupt) data with\n"
+" an invalid CRC. Specifying the -d option will allow writes of invalid data,\n"
+" but still recalculate the CRC so we are forced to check and detect the\n"
+" invalid data appropriately.\n\n"
+));
+
+}
+
+static int
+fuzz_f(
+ int argc,
+ char **argv)
+{
+ pfunc_t pf;
+ extern char *progname;
+ int c;
+ bool corrupt = false; /* Allow write of bad data w/ invalid CRC */
+ bool invalid_data = false; /* Allow write of bad data w/ valid CRC */
+ struct xfs_buf_ops local_ops;
+ const struct xfs_buf_ops *stashed_ops = NULL;
+
+ if (x.isreadonly & LIBXFS_ISREADONLY) {
+ dbprintf(_("%s started in read only mode, fuzzing disabled\n"),
+ progname);
+ return 0;
+ }
+
+ if (cur_typ == NULL) {
+ dbprintf(_("no current type\n"));
+ return 0;
+ }
+
+ pf = cur_typ->pfunc;
+ if (pf == NULL) {
+ dbprintf(_("no handler function for type %s, fuzz unsupported.\n"),
+ cur_typ->name);
+ return 0;
+ }
+
+ while ((c = getopt(argc, argv, "cd")) != EOF) {
+ switch (c) {
+ case 'c':
+ corrupt = true;
+ break;
+ case 'd':
+ invalid_data = true;
+ break;
+ default:
+ dbprintf(_("bad option for fuzz command\n"));
+ return 0;
+ }
+ }
+
+ if (corrupt && invalid_data) {
+ dbprintf(_("Cannot specify both -c and -d options\n"));
+ return 0;
+ }
+
+ if (invalid_data && iocur_top->typ->crc_off == TYP_F_NO_CRC_OFF &&
+ !iocur_top->ino_buf) {
+ dbprintf(_("Cannot recalculate CRCs on this type of object\n"));
+ return 0;
+ }
+
+ argc -= optind;
+ argv += optind;
+
+ /*
+ * If the buffer has no verifier or we are using standard verifier
+ * paths, then just fuzz it and return
+ */
+ if (!iocur_top->bp->b_ops ||
+ !(corrupt || invalid_data)) {
+ (*pf)(DB_FUZZ, cur_typ->fields, argc, argv);
+ return 0;
+ }
+
+
+ /* Temporarily remove write verifier to write bad data */
+ stashed_ops = iocur_top->bp->b_ops;
+ local_ops.verify_read = stashed_ops->verify_read;
+ iocur_top->bp->b_ops = &local_ops;
+
+ if (corrupt) {
+ local_ops.verify_write = xfs_dummy_verify;
+ dbprintf(_("Allowing fuzz of corrupted data and bad CRC\n"));
+ } else if (iocur_top->ino_buf) {
+ local_ops.verify_write = xfs_verify_recalc_inode_crc;
+ dbprintf(_("Allowing fuzz of corrupted inode with good CRC\n"));
+ } else { /* invalid data */
+ local_ops.verify_write = xfs_verify_recalc_crc;
+ dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
+ }
+
+ (*pf)(DB_FUZZ, cur_typ->fields, argc, argv);
+
+ iocur_top->bp->b_ops = stashed_ops;
+
+ return 0;
+}
+
+/* Write zeroes to the field */
+static bool
+fuzz_zeroes(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ char *out = buf;
+ int bit;
+
+ if (bitoff % NBBY || nbits % NBBY) {
+ for (bit = 0; bit < nbits; bit++)
+ setbit_l(out, bit + bitoff, 0);
+ } else
+ memset(out + byteize(bitoff), 0, byteize(nbits));
+ return true;
+}
+
+/* Write ones to the field */
+static bool
+fuzz_ones(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ char *out = buf;
+ int bit;
+
+ if (bitoff % NBBY || nbits % NBBY) {
+ for (bit = 0; bit < nbits; bit++)
+ setbit_l(out, bit + bitoff, 1);
+ } else
+ memset(out + byteize(bitoff), 0xFF, byteize(nbits));
+ return true;
+}
+
+/* Flip the high bit in the (presumably big-endian) field */
+static bool
+fuzz_firstbit(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ setbit_l((char *)buf, bitoff, !getbit_l((char *)buf, bitoff));
+ return true;
+}
+
+/* Flip the low bit in the (presumably big-endian) field */
+static bool
+fuzz_lastbit(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ setbit_l((char *)buf, bitoff + nbits - 1,
+ !getbit_l((char *)buf, bitoff));
+ return true;
+}
+
+/* Flip the middle bit in the (presumably big-endian) field */
+static bool
+fuzz_middlebit(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ setbit_l((char *)buf, bitoff + nbits / 2,
+ !getbit_l((char *)buf, bitoff));
+ return true;
+}
+
+/* Format and shift a number into a buffer for setbitval. */
+static char *
+format_number(
+ uint64_t val,
+ __be64 *out,
+ int bit_length)
+{
+ int offset;
+ char *rbuf = (char *)out;
+
+ /*
+ * If the length of the field is not a multiple of a byte, push
+ * the bits up in the field, so the most signicant field bit is
+ * the most significant bit in the byte:
+ *
+ * before:
+ * val |----|----|----|----|----|--MM|mmmm|llll|
+ * after
+ * val |----|----|----|----|----|MMmm|mmll|ll00|
+ */
+ offset = bit_length % NBBY;
+ if (offset)
+ val <<= (NBBY - offset);
+
+ /*
+ * convert to big endian and copy into the array
+ * rbuf |----|----|----|----|----|MMmm|mmll|ll00|
+ */
+ *out = cpu_to_be64(val);
+
+ /*
+ * Align the array to point to the field in the array.
+ * rbuf = |MMmm|mmll|ll00|
+ */
+ offset = sizeof(__be64) - 1 - ((bit_length - 1) / sizeof(__be64));
+ return rbuf + offset;
+}
+
+/* Increase the value by some small prime number. */
+static bool
+fuzz_add(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ uint64_t val;
+ __be64 out;
+ char *b;
+
+ if (nbits > 64)
+ return false;
+
+ val = getbitval(buf, bitoff, nbits, BVUNSIGNED);
+ val += (nbits > 8 ? 2017 : 137);
+ b = format_number(val, &out, nbits);
+ setbitval(buf, bitoff, nbits, b);
+
+ return true;
+}
+
+/* Decrease the value by some small prime number. */
+static bool
+fuzz_sub(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ uint64_t val;
+ __be64 out;
+ char *b;
+
+ if (nbits > 64)
+ return false;
+
+ val = getbitval(buf, bitoff, nbits, BVUNSIGNED);
+ val -= (nbits > 8 ? 2017 : 137);
+ b = format_number(val, &out, nbits);
+ setbitval(buf, bitoff, nbits, b);
+
+ return true;
+}
+
+/* Randomize the field. */
+static bool
+fuzz_random(
+ void *buf,
+ int bitoff,
+ int nbits)
+{
+ int i, bytes;
+ char *b, *rbuf;
+
+ bytes = byteize_up(nbits);
+ rbuf = b = malloc(bytes);
+ if (!b) {
+ perror("fuzz_random");
+ return false;
+ }
+
+ for (i = 0; i < bytes; i++)
+ *b++ = (char)lrand48();
+
+ setbitval(buf, bitoff, nbits, rbuf);
+ free(rbuf);
+
+ return true;
+}
+
+struct fuzzcmd {
+ const char *verb;
+ bool (*fn)(void *buf, int bitoff, int nbits);
+};
+
+/* Keep these verbs in sync with enum fuzzcmds. */
+static struct fuzzcmd fuzzverbs[] = {
+ {"zeroes", fuzz_zeroes},
+ {"ones", fuzz_ones},
+ {"firstbit", fuzz_firstbit},
+ {"middlebit", fuzz_middlebit},
+ {"lastbit", fuzz_lastbit},
+ {"add", fuzz_add},
+ {"sub", fuzz_sub},
+ {"random", fuzz_random},
+ {NULL, NULL},
+};
+
+/* ARGSUSED */
+void
+fuzz_struct(
+ const field_t *fields,
+ int argc,
+ char **argv)
+{
+ const ftattr_t *fa;
+ flist_t *fl;
+ flist_t *sfl;
+ int bit_length;
+ struct fuzzcmd *fc;
+ bool success;
+ int parentoffset;
+
+ if (argc != 2) {
+ dbprintf(_("Usage: fuzz fieldname verb\n"));
+ dbprintf("Verbs: %s", fuzzverbs->verb);
+ for (fc = fuzzverbs + 1; fc->verb != NULL; fc++)
+ dbprintf(", %s", fc->verb);
+ dbprintf(".\n");
+ return;
+ }
+
+ fl = flist_scan(argv[0]);
+ if (!fl) {
+ dbprintf(_("unable to parse '%s'.\n"), argv[0]);
+ return;
+ }
+
+ /* Find our fuzz verb */
+ for (fc = fuzzverbs; fc->verb != NULL; fc++)
+ if (!strcmp(fc->verb, argv[1]))
+ break;
+ if (fc->fn == NULL) {
+ dbprintf(_("Unknown fuzz command '%s'.\n"), argv[1]);
+ return;
+ }
+
+ /* if we're a root field type, go down 1 layer to get field list */
+ if (fields->name[0] == '\0') {
+ fa = &ftattrtab[fields->ftyp];
+ ASSERT(fa->ftyp == fields->ftyp);
+ fields = fa->subfld;
+ }
+
+ /* run down the field list and set offsets into the data */
+ if (!flist_parse(fields, fl, iocur_top->data, 0)) {
+ flist_free(fl);
+ dbprintf(_("parsing error\n"));
+ return;
+ }
+
+ sfl = fl;
+ parentoffset = 0;
+ while (sfl->child) {
+ parentoffset = sfl->offset;
+ sfl = sfl->child;
+ }
+
+ /*
+ * For structures, fsize * fcount tells us the size of the region we are
+ * modifying, which is usually a single structure member and is pointed
+ * to by the last child in the list.
+ *
+ * However, if the base structure is an array and we have a direct index
+ * into the array (e.g. write bno[5]) then we are returned a single
+ * flist object with the offset pointing directly at the location we
+ * need to modify. The length of the object we are modifying is then
+ * determined by the size of the individual array entry (fsize) and the
+ * indexes defined in the object, not the overall size of the array
+ * (which is what fcount returns).
+ */
+ bit_length = fsize(sfl->fld, iocur_top->data, parentoffset, 0);
+ if (sfl->fld->flags & FLD_ARRAY)
+ bit_length *= sfl->high - sfl->low + 1;
+ else
+ bit_length *= fcount(sfl->fld, iocur_top->data, parentoffset);
+
+ /* Fuzz the value */
+ success = fc->fn(iocur_top->data, sfl->offset, bit_length);
+ if (!success) {
+ dbprintf(_("unable to fuzz field '%s'\n"), argv[0]);
+ flist_free(fl);
+ return;
+ }
+
+ /* Write the fuzzed value back */
+ write_cur();
+
+ flist_print(fl);
+ print_flist(fl);
+ flist_free(fl);
+}
diff --git a/db/fuzz.h b/db/fuzz.h
new file mode 100644
index 0000000..c203eb5
--- /dev/null
+++ b/db/fuzz.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+extern void fuzz_init(void);
+extern void fuzz_struct(const field_t *fields, int argc, char **argv);
diff --git a/db/io.c b/db/io.c
index f398195..1f316d8 100644
--- a/db/io.c
+++ b/db/io.c
@@ -465,6 +465,15 @@ xfs_dummy_verify(
}
void
+xfs_verify_recalc_inode_crc(
+ struct xfs_buf *bp)
+{
+ ASSERT(iocur_top->ino_buf);
+ libxfs_dinode_calc_crc(mp, iocur_top->data);
+ iocur_top->ino_crc_ok = 1;
+}
+
+void
xfs_verify_recalc_crc(
struct xfs_buf *bp)
{
diff --git a/db/io.h b/db/io.h
index c69e9ce..12d96c2 100644
--- a/db/io.h
+++ b/db/io.h
@@ -64,6 +64,7 @@ extern void set_cur(const struct typ *t, __int64_t d, int c, int ring_add,
extern void ring_add(void);
extern void set_iocur_type(const struct typ *t);
extern void xfs_dummy_verify(struct xfs_buf *bp);
+extern void xfs_verify_recalc_inode_crc(struct xfs_buf *bp);
extern void xfs_verify_recalc_crc(struct xfs_buf *bp);
/*
diff --git a/db/type.c b/db/type.c
index 10fa54e..adab10a 100644
--- a/db/type.c
+++ b/db/type.c
@@ -39,6 +39,7 @@
#include "dir2.h"
#include "text.h"
#include "symlink.h"
+#include "fuzz.h"
static const typ_t *findtyp(char *name);
static int type_f(int argc, char **argv);
@@ -254,10 +255,17 @@ handle_struct(
int argc,
char **argv)
{
- if (action == DB_WRITE)
+ switch (action) {
+ case DB_FUZZ:
+ fuzz_struct(fields, argc, argv);
+ break;
+ case DB_WRITE:
write_struct(fields, argc, argv);
- else
+ break;
+ case DB_READ:
print_struct(fields, argc, argv);
+ break;
+ }
}
void
@@ -267,10 +275,17 @@ handle_string(
int argc,
char **argv)
{
- if (action == DB_WRITE)
+ switch (action) {
+ case DB_WRITE:
write_string(fields, argc, argv);
- else
+ break;
+ case DB_READ:
print_string(fields, argc, argv);
+ break;
+ case DB_FUZZ:
+ dbprintf(_("string fuzzing not supported.\n"));
+ break;
+ }
}
void
@@ -280,10 +295,17 @@ handle_block(
int argc,
char **argv)
{
- if (action == DB_WRITE)
+ switch (action) {
+ case DB_WRITE:
write_block(fields, argc, argv);
- else
+ break;
+ case DB_READ:
print_block(fields, argc, argv);
+ break;
+ case DB_FUZZ:
+ dbprintf(_("use 'blocktrash' or 'write' to fuzz a block.\n"));
+ break;
+ }
}
void
@@ -293,6 +315,14 @@ handle_text(
int argc,
char **argv)
{
- if (action != DB_WRITE)
+ switch (action) {
+ case DB_FUZZ:
+ /* fall through */
+ case DB_WRITE:
+ dbprintf(_("text writing/fuzzing not supported.\n"));
+ break;
+ case DB_READ:
print_text(fields, argc, argv);
+ break;
+ }
}
diff --git a/db/type.h b/db/type.h
index 87ff107..a50d705 100644
--- a/db/type.h
+++ b/db/type.h
@@ -30,6 +30,7 @@ typedef enum typnm
TYP_TEXT, TYP_FINOBT, TYP_NONE
} typnm_t;
+#define DB_FUZZ 2
#define DB_WRITE 1
#define DB_READ 0
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index 460d89d..55e0629 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -594,6 +594,55 @@ in units of 512-byte blocks, no matter what the filesystem's block size is.
.BI "The optional " start " and " end " arguments can be used to constrain
the output to a particular range of disk blocks.
.TP
+.BI "fuzz [\-c] [\-d] " "field action"
+Write garbage into a specific structure field on disk.
+Expert mode must be enabled to use this command.
+The operation happens immediately; there is no buffering.
+.IP
+The fuzz command can take the following
+.IR action "s"
+against a field:
+.RS 1.0i
+.TP 0.4i
+.B zeroes
+Clears all bits in the field.
+.TP 0.4i
+.B ones
+Sets all bits in the field.
+.TP 0.4i
+.B firstbit
+Flips the first bit in the field.
+For a scalar value, this is the highest bit.
+.TP 0.4i
+.B middlebit
+Flips the middle bit in the field.
+.TP 0.4i
+.B lastbit
+Flips the last bit in the field.
+For a scalar value, this is the lowest bit.
+.TP 0.4i
+.B add
+Adds a small value to a scalar field.
+.TP 0.4i
+.B sub
+Subtracts a small value from a scalar field.
+.TP 0.4i
+.B random
+Randomizes the contents of the field.
+.RE
+.IP
+The following switches affect the write behavior:
+.RS 1.0i
+.TP 0.4i
+.B \-c
+Skip write verifiers and CRC recalculation; allows invalid data to be written
+to disk.
+.TP 0.4i
+.B \-d
+Skip write verifiers but perform CRC recalculation; allows invalid data to be
+written to disk to test detection of invalid data.
+.RE
+.TP
.BI hash " string
Prints the hash value of
.I string
@@ -755,7 +804,7 @@ and
bits respectively, and their string equivalent reported
(but no modifications are made).
.TP
-.BI "write [\-c] [" "field value" "] ..."
+.BI "write [\-c] [\-d] [" "field value" "] ..."
Write a value to disk.
Specific fields can be set in structures (struct mode),
or a block can be set to data values (data mode),
@@ -778,6 +827,10 @@ with no arguments gives more information on the allowed commands.
.B \-c
Skip write verifiers and CRC recalculation; allows invalid data to be written
to disk.
+.TP 0.4i
+.B \-d
+Skip write verifiers but perform CRC recalculation; allows invalid data to be
+written to disk to test detection of invalid data.
.RE
.SH TYPES
This section gives the fields in each structure type and their meanings.
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/9] xfs_db: print attribute remote value blocks
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
2017-03-10 23:24 ` [PATCH 2/9] xfs_db: introduce fuzz command Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs Darrick J. Wong
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Teach xfs_db how to print the contents of xattr remote value blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
db/attr.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
db/attr.h | 1 +
db/field.c | 3 +++
db/field.h | 1 +
4 files changed, 64 insertions(+)
diff --git a/db/attr.c b/db/attr.c
index e26ac67..0fffbc2 100644
--- a/db/attr.c
+++ b/db/attr.c
@@ -41,6 +41,9 @@ static int attr_leaf_nvlist_offset(void *obj, int startoff, int idx);
static int attr_node_btree_count(void *obj, int startoff);
static int attr_node_hdr_count(void *obj, int startoff);
+static int attr_remote_count(void *obj, int startoff);
+static int attr3_remote_count(void *obj, int startoff);
+
const field_t attr_hfld[] = {
{ "", FLDT_ATTR, OI(0), C1, 0, TYP_NONE },
{ NULL }
@@ -53,6 +56,7 @@ const field_t attr_flds[] = {
FLD_COUNT, TYP_NONE },
{ "hdr", FLDT_ATTR_NODE_HDR, OI(NOFF(hdr)), attr_node_hdr_count,
FLD_COUNT, TYP_NONE },
+ { "data", FLDT_CHARNS, OI(0), attr_remote_count, FLD_COUNT, TYP_NONE },
{ "entries", FLDT_ATTR_LEAF_ENTRY, OI(LOFF(entries)),
attr_leaf_entries_count, FLD_ARRAY|FLD_COUNT, TYP_NONE },
{ "btree", FLDT_ATTR_NODE_ENTRY, OI(NOFF(__btree)), attr_node_btree_count,
@@ -197,6 +201,33 @@ attr3_leaf_hdr_count(
return be16_to_cpu(leaf->hdr.info.hdr.magic) == XFS_ATTR3_LEAF_MAGIC;
}
+static int
+attr_remote_count(
+ void *obj,
+ int startoff)
+{
+ if (attr_leaf_hdr_count(obj, startoff) == 0 &&
+ attr_node_hdr_count(obj, startoff) == 0)
+ return mp->m_sb.sb_blocksize;
+ return 0;
+}
+
+static int
+attr3_remote_count(
+ void *obj,
+ int startoff)
+{
+ struct xfs_attr3_rmt_hdr *hdr = obj;
+
+ ASSERT(startoff == 0);
+
+ if (hdr->rm_magic != cpu_to_be32(XFS_ATTR3_RMT_MAGIC))
+ return 0;
+ if (be32_to_cpu(hdr->rm_bytes) + sizeof(*hdr) > mp->m_sb.sb_blocksize)
+ return mp->m_sb.sb_blocksize - sizeof(*hdr);
+ return be32_to_cpu(hdr->rm_bytes);
+}
+
typedef int (*attr_leaf_entry_walk_f)(struct xfs_attr_leafblock *,
struct xfs_attr_leaf_entry *, int);
static int
@@ -477,6 +508,17 @@ attr3_node_hdr_count(
return be16_to_cpu(node->hdr.info.hdr.magic) == XFS_DA3_NODE_MAGIC;
}
+static int
+attr3_remote_hdr_count(
+ void *obj,
+ int startoff)
+{
+ struct xfs_attr3_rmt_hdr *node = obj;
+
+ ASSERT(startoff == 0);
+ return be32_to_cpu(node->rm_magic) == XFS_ATTR3_RMT_MAGIC;
+}
+
int
attr_size(
void *obj,
@@ -501,6 +543,8 @@ const field_t attr3_flds[] = {
FLD_COUNT, TYP_NONE },
{ "hdr", FLDT_DA3_NODE_HDR, OI(N3OFF(hdr)), attr3_node_hdr_count,
FLD_COUNT, TYP_NONE },
+ { "hdr", FLDT_ATTR3_REMOTE_HDR, OI(0), attr3_remote_hdr_count,
+ FLD_COUNT, TYP_NONE },
{ "entries", FLDT_ATTR_LEAF_ENTRY, OI(L3OFF(entries)),
attr3_leaf_entries_count, FLD_ARRAY|FLD_COUNT, TYP_NONE },
{ "btree", FLDT_ATTR_NODE_ENTRY, OI(N3OFF(__btree)),
@@ -523,6 +567,21 @@ const field_t attr3_leaf_hdr_flds[] = {
{ NULL }
};
+#define RM3OFF(f) bitize(offsetof(struct xfs_attr3_rmt_hdr, rm_ ## f))
+const struct field attr3_remote_crc_flds[] = {
+ { "magic", FLDT_UINT32X, OI(RM3OFF(magic)), C1, 0, TYP_NONE },
+ { "offset", FLDT_UINT32D, OI(RM3OFF(offset)), C1, 0, TYP_NONE },
+ { "bytes", FLDT_UINT32D, OI(RM3OFF(bytes)), C1, 0, TYP_NONE },
+ { "crc", FLDT_CRC, OI(RM3OFF(crc)), C1, 0, TYP_NONE },
+ { "uuid", FLDT_UUID, OI(RM3OFF(uuid)), C1, 0, TYP_NONE },
+ { "owner", FLDT_INO, OI(RM3OFF(owner)), C1, 0, TYP_NONE },
+ { "bno", FLDT_DFSBNO, OI(RM3OFF(blkno)), C1, 0, TYP_BMAPBTD },
+ { "lsn", FLDT_UINT64X, OI(RM3OFF(lsn)), C1, 0, TYP_NONE },
+ { "data", FLDT_CHARNS, OI(bitize(sizeof(struct xfs_attr3_rmt_hdr))),
+ attr3_remote_count, FLD_COUNT, TYP_NONE },
+ { NULL }
+};
+
/*
* Special read verifier for attribute buffers. Detect the magic number
* appropriately and set the correct verifier and call it.
diff --git a/db/attr.h b/db/attr.h
index bc3431f..d7bb579 100644
--- a/db/attr.h
+++ b/db/attr.h
@@ -30,6 +30,7 @@ extern const field_t attr3_flds[];
extern const field_t attr3_hfld[];
extern const field_t attr3_leaf_hdr_flds[];
extern const field_t attr3_node_hdr_flds[];
+extern const field_t attr3_remote_crc_flds[];
extern int attr_leaf_name_size(void *obj, int startoff, int idx);
extern int attr_size(void *obj, int startoff, int idx);
diff --git a/db/field.c b/db/field.c
index 1968dd5..e8bbbe3 100644
--- a/db/field.c
+++ b/db/field.c
@@ -97,6 +97,9 @@ const ftattr_t ftattrtab[] = {
{ FLDT_ATTR3_NODE_HDR, "attr3_node_hdr", NULL,
(char *)da3_node_hdr_flds, SI(bitsz(struct xfs_da3_node_hdr)),
0, NULL, da3_node_hdr_flds },
+ { FLDT_ATTR3_REMOTE_HDR, "attr3_remote_hdr", NULL,
+ (char *)attr3_remote_crc_flds, attr_size, FTARG_SIZE, NULL,
+ attr3_remote_crc_flds },
{ FLDT_BMAPBTA, "bmapbta", NULL, (char *)bmapbta_flds, btblock_size,
FTARG_SIZE, NULL, bmapbta_flds },
diff --git a/db/field.h b/db/field.h
index 53616f1..e5a943b 100644
--- a/db/field.h
+++ b/db/field.h
@@ -46,6 +46,7 @@ typedef enum fldt {
FLDT_ATTR3,
FLDT_ATTR3_LEAF_HDR,
FLDT_ATTR3_NODE_HDR,
+ FLDT_ATTR3_REMOTE_HDR,
FLDT_BMAPBTA,
FLDT_BMAPBTA_CRC,
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
` (2 preceding siblings ...)
2017-03-10 23:25 ` [PATCH 3/9] xfs_db: print attribute remote value blocks Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Extend typ_t to (optionally) store a pointer to a function to calculate
the CRC of the block, provide functions to do this for the dir3 and
attr3 types, and then wire up the fuzz and write commands so that we can
effectively fuzz directory and extended attribute block fields.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
db/attr.c | 32 ++++++++++++++++++++++++++++++++
db/attr.h | 1 +
db/dir2.c | 37 +++++++++++++++++++++++++++++++++++++
db/dir2.h | 1 +
db/fuzz.c | 3 +++
db/type.c | 8 ++++----
db/type.h | 2 ++
db/write.c | 3 +++
8 files changed, 83 insertions(+), 4 deletions(-)
diff --git a/db/attr.c b/db/attr.c
index 0fffbc2..5a97925 100644
--- a/db/attr.c
+++ b/db/attr.c
@@ -582,6 +582,38 @@ const struct field attr3_remote_crc_flds[] = {
{ NULL }
};
+/* Set the CRC. */
+void
+xfs_attr3_set_crc(
+ struct xfs_buf *bp)
+{
+ __be32 magic32;
+ __be16 magic16;
+
+ magic32 = *(__be32 *)bp->b_addr;
+ magic16 = ((struct xfs_da_blkinfo *)bp->b_addr)->magic;
+
+ switch (magic16) {
+ case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_ATTR3_LEAF_CRC_OFF);
+ return;
+ case cpu_to_be16(XFS_DA3_NODE_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_DA3_NODE_CRC_OFF);
+ return;
+ default:
+ break;
+ }
+
+ switch (magic32) {
+ case cpu_to_be32(XFS_ATTR3_RMT_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_ATTR3_RMT_CRC_OFF);
+ return;
+ default:
+ dbprintf(_("Unknown attribute buffer type!\n"));
+ break;
+ }
+}
+
/*
* Special read verifier for attribute buffers. Detect the magic number
* appropriately and set the correct verifier and call it.
diff --git a/db/attr.h b/db/attr.h
index d7bb579..9ea7429 100644
--- a/db/attr.h
+++ b/db/attr.h
@@ -34,5 +34,6 @@ extern const field_t attr3_remote_crc_flds[];
extern int attr_leaf_name_size(void *obj, int startoff, int idx);
extern int attr_size(void *obj, int startoff, int idx);
+extern void xfs_attr3_set_crc(struct xfs_buf *bp);
extern const struct xfs_buf_ops xfs_attr3_db_buf_ops;
diff --git a/db/dir2.c b/db/dir2.c
index 533f705..3e21a7b 100644
--- a/db/dir2.c
+++ b/db/dir2.c
@@ -981,6 +981,43 @@ const field_t da3_node_hdr_flds[] = {
{ NULL }
};
+/* Set the CRC. */
+void
+xfs_dir3_set_crc(
+ struct xfs_buf *bp)
+{
+ __be32 magic32;
+ __be16 magic16;
+
+ magic32 = *(__be32 *)bp->b_addr;
+ magic16 = ((struct xfs_da_blkinfo *)bp->b_addr)->magic;
+
+ switch (magic32) {
+ case cpu_to_be32(XFS_DIR3_BLOCK_MAGIC):
+ case cpu_to_be32(XFS_DIR3_DATA_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_DIR3_DATA_CRC_OFF);
+ return;
+ case cpu_to_be32(XFS_DIR3_FREE_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_DIR3_FREE_CRC_OFF);
+ return;
+ default:
+ break;
+ }
+
+ switch (magic16) {
+ case cpu_to_be16(XFS_DIR3_LEAF1_MAGIC):
+ case cpu_to_be16(XFS_DIR3_LEAFN_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_DIR3_LEAF_CRC_OFF);
+ return;
+ case cpu_to_be16(XFS_DA3_NODE_MAGIC):
+ xfs_buf_update_cksum(bp, XFS_DA3_NODE_CRC_OFF);
+ return;
+ default:
+ dbprintf(_("Unknown directory buffer type! %x %x\n"), magic32, magic16);
+ break;
+ }
+}
+
/*
* Special read verifier for directory buffers. Detect the magic number
* appropriately and set the correct verifier and call it.
diff --git a/db/dir2.h b/db/dir2.h
index 0c2a62e..1b87cd2 100644
--- a/db/dir2.h
+++ b/db/dir2.h
@@ -60,5 +60,6 @@ static inline uint8_t *xfs_dir2_sf_inumberp(xfs_dir2_sf_entry_t *sfep)
extern int dir2_data_union_size(void *obj, int startoff, int idx);
extern int dir2_size(void *obj, int startoff, int idx);
+extern void xfs_dir3_set_crc(struct xfs_buf *bp);
extern const struct xfs_buf_ops xfs_dir3_db_buf_ops;
diff --git a/db/fuzz.c b/db/fuzz.c
index 061ecd1..f294331 100644
--- a/db/fuzz.c
+++ b/db/fuzz.c
@@ -156,6 +156,9 @@ fuzz_f(
} else if (iocur_top->ino_buf) {
local_ops.verify_write = xfs_verify_recalc_inode_crc;
dbprintf(_("Allowing fuzz of corrupted inode with good CRC\n"));
+ } else if (iocur_top->typ->crc_off == TYP_F_CRC_FUNC) {
+ local_ops.verify_write = iocur_top->typ->set_crc;
+ dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
} else { /* invalid data */
local_ops.verify_write = xfs_verify_recalc_crc;
dbprintf(_("Allowing fuzz of corrupted data with good CRC\n"));
diff --git a/db/type.c b/db/type.c
index adab10a..740adc0 100644
--- a/db/type.c
+++ b/db/type.c
@@ -88,7 +88,7 @@ static const typ_t __typtab_crc[] = {
{ TYP_AGI, "agi", handle_struct, agi_hfld, &xfs_agi_buf_ops,
XFS_AGI_CRC_OFF },
{ TYP_ATTR, "attr3", handle_struct, attr3_hfld,
- &xfs_attr3_db_buf_ops, TYP_F_NO_CRC_OFF },
+ &xfs_attr3_db_buf_ops, TYP_F_CRC_FUNC, xfs_attr3_set_crc },
{ TYP_BMAPBTA, "bmapbta", handle_struct, bmapbta_crc_hfld,
&xfs_bmbt_buf_ops, XFS_BTREE_LBLOCK_CRC_OFF },
{ TYP_BMAPBTD, "bmapbtd", handle_struct, bmapbtd_crc_hfld,
@@ -103,7 +103,7 @@ static const typ_t __typtab_crc[] = {
&xfs_refcountbt_buf_ops, XFS_BTREE_SBLOCK_CRC_OFF },
{ TYP_DATA, "data", handle_block, NULL, NULL, TYP_F_NO_CRC_OFF },
{ TYP_DIR2, "dir3", handle_struct, dir3_hfld,
- &xfs_dir3_db_buf_ops, TYP_F_NO_CRC_OFF },
+ &xfs_dir3_db_buf_ops, TYP_F_CRC_FUNC, xfs_dir3_set_crc },
{ TYP_DQBLK, "dqblk", handle_struct, dqblk_hfld,
&xfs_dquot_buf_ops, TYP_F_NO_CRC_OFF },
{ TYP_INOBT, "inobt", handle_struct, inobt_crc_hfld,
@@ -132,7 +132,7 @@ static const typ_t __typtab_spcrc[] = {
{ TYP_AGI, "agi", handle_struct, agi_hfld, &xfs_agi_buf_ops ,
XFS_AGI_CRC_OFF },
{ TYP_ATTR, "attr3", handle_struct, attr3_hfld,
- &xfs_attr3_db_buf_ops, TYP_F_NO_CRC_OFF },
+ &xfs_attr3_db_buf_ops, TYP_F_CRC_FUNC, xfs_attr3_set_crc },
{ TYP_BMAPBTA, "bmapbta", handle_struct, bmapbta_crc_hfld,
&xfs_bmbt_buf_ops, XFS_BTREE_LBLOCK_CRC_OFF },
{ TYP_BMAPBTD, "bmapbtd", handle_struct, bmapbtd_crc_hfld,
@@ -147,7 +147,7 @@ static const typ_t __typtab_spcrc[] = {
&xfs_refcountbt_buf_ops, XFS_BTREE_SBLOCK_CRC_OFF },
{ TYP_DATA, "data", handle_block, NULL, NULL, TYP_F_NO_CRC_OFF },
{ TYP_DIR2, "dir3", handle_struct, dir3_hfld,
- &xfs_dir3_db_buf_ops, TYP_F_NO_CRC_OFF },
+ &xfs_dir3_db_buf_ops, TYP_F_CRC_FUNC, xfs_dir3_set_crc },
{ TYP_DQBLK, "dqblk", handle_struct, dqblk_hfld,
&xfs_dquot_buf_ops, TYP_F_NO_CRC_OFF },
{ TYP_INOBT, "inobt", handle_struct, inobt_spcrc_hfld,
diff --git a/db/type.h b/db/type.h
index a50d705..3971975 100644
--- a/db/type.h
+++ b/db/type.h
@@ -46,6 +46,8 @@ typedef struct typ
const struct xfs_buf_ops *bops;
unsigned long crc_off;
#define TYP_F_NO_CRC_OFF (-1UL)
+#define TYP_F_CRC_FUNC (-2UL)
+ void (*set_crc)(struct xfs_buf *);
} typ_t;
extern const typ_t *typtab, *cur_typ;
diff --git a/db/write.c b/db/write.c
index 5c83874..ea87b40 100644
--- a/db/write.c
+++ b/db/write.c
@@ -164,6 +164,9 @@ write_f(
if (corrupt) {
local_ops.verify_write = xfs_dummy_verify;
dbprintf(_("Allowing write of corrupted data and bad CRC\n"));
+ } else if (iocur_top->typ->crc_off == TYP_F_CRC_FUNC) {
+ local_ops.verify_write = iocur_top->typ->set_crc;
+ dbprintf(_("Allowing write of corrupted data with good CRC\n"));
} else { /* invalid data */
local_ops.verify_write = xfs_verify_recalc_crc;
dbprintf(_("Allowing write of corrupted data with good CRC\n"));
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
` (3 preceding siblings ...)
2017-03-10 23:25 ` [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 6/9] xfs_scrub: create online filesystem scrub program Darrick J. Wong
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Create a new xfs_io command to call the new XFS metadata scrub ioctl.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
io/Makefile | 2
io/init.c | 2
io/inject.c | 4 -
io/io.h | 3
io/scrub.c | 330 +++++++++++++++++++++++++++++++++++++++++++++++++++++
man/man8/xfs_io.8 | 20 +++
6 files changed, 359 insertions(+), 2 deletions(-)
create mode 100644 io/scrub.c
diff --git a/io/Makefile b/io/Makefile
index b5fe83d..e146cc0 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -11,7 +11,7 @@ HFILES = init.h io.h
CFILES = init.c \
attr.c bmap.c cowextsize.c encrypt.c file.c freeze.c fsync.c \
getrusage.c imap.c link.c mmap.c open.c parent.c pread.c prealloc.c \
- pwrite.c seek.c shutdown.c sync.c truncate.c utimes.c
+ pwrite.c scrub.c seek.c shutdown.c sync.c truncate.c utimes.c
LLDLIBS = $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE)
diff --git a/io/init.c b/io/init.c
index 1532149..bef6f62 100644
--- a/io/init.c
+++ b/io/init.c
@@ -83,7 +83,9 @@ init_commands(void)
quit_init();
readdir_init();
reflink_init();
+ repair_init();
resblks_init();
+ scrub_init();
seek_init();
sendfile_init();
shutdown_init();
diff --git a/io/inject.c b/io/inject.c
index 25c7021..91b5fd7 100644
--- a/io/inject.c
+++ b/io/inject.c
@@ -86,7 +86,9 @@ error_tag(char *name)
{ XFS_ERRTAG_BMAP_FINISH_ONE, "bmap_finish_one" },
#define XFS_ERRTAG_AG_RESV_CRITICAL 27
{ XFS_ERRTAG_AG_RESV_CRITICAL, "ag_resv_critical" },
-#define XFS_ERRTAG_MAX 28
+#define XFS_ERRTAG_FORCE_REPAIR 28
+ { XFS_ERRTAG_FORCE_REPAIR, "force_repair" },
+#define XFS_ERRTAG_MAX 29
{ XFS_ERRTAG_MAX, NULL }
};
int count;
diff --git a/io/io.h b/io/io.h
index 43c82d8..ee8a927 100644
--- a/io/io.h
+++ b/io/io.h
@@ -187,3 +187,6 @@ extern void fsmap_init(void);
#else
# define fsmap_init() do { } while (0)
#endif
+
+extern void scrub_init(void);
+extern void repair_init(void);
diff --git a/io/scrub.c b/io/scrub.c
new file mode 100644
index 0000000..caa965e
--- /dev/null
+++ b/io/scrub.c
@@ -0,0 +1,330 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <sys/uio.h>
+#include <xfs/xfs.h>
+#include "command.h"
+#include "input.h"
+#include "init.h"
+#include "path.h"
+#include "io.h"
+
+static struct cmdinfo scrub_cmd;
+static struct cmdinfo repair_cmd;
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+ ST_NONE, /* disabled */
+ ST_PERAG, /* per-AG metadata */
+ ST_FS, /* per-FS metadata */
+ ST_INODE, /* per-inode metadata */
+};
+
+/* These must correspond with XFS_SCRUB_TYPE_ */
+struct scrub_descr {
+ const char *name;
+ enum scrub_type type;
+};
+
+static const struct scrub_descr scrubbers[] = {
+ {"dummy", ST_NONE},
+ {"sb", ST_PERAG},
+ {"agf", ST_PERAG},
+ {"agfl", ST_PERAG},
+ {"agi", ST_PERAG},
+ {"bnobt", ST_PERAG},
+ {"cntbt", ST_PERAG},
+ {"inobt", ST_PERAG},
+ {"finobt", ST_PERAG},
+ {"rmapbt", ST_PERAG},
+ {"refcountbt", ST_PERAG},
+ {"inode", ST_INODE},
+ {"bmapbtd", ST_INODE},
+ {"bmapbta", ST_INODE},
+ {"bmapbtc", ST_INODE},
+ {"directory", ST_INODE},
+ {"xattr", ST_INODE},
+ {"symlink", ST_INODE},
+ {"rtbitmap", ST_FS},
+ {"rtsummary", ST_FS},
+ {NULL, ST_NONE},
+};
+
+static void
+scrub_help(void)
+{
+ const struct scrub_descr *d;
+
+ printf(_("\n\
+ Scrubs a piece of XFS filesystem metadata. The first argument is the type\n\
+ of metadata to examine. Allocation group number(s) can be specified to\n\
+ restrict the scrub operation to a subset of allocation groups.\n\
+ Certain metadata types do not take AG numbers.\n\
+\n\
+ Example:\n\
+ 'scrub inobt 3' - scrub the inode btree in AG 3.\n\
+ 'scrub bmapbtd 128 13525' - scrubs the extent map of inode 128 gen 13525.\n\
+\n\
+ Known metadata scrub types are:"));
+ for (d = scrubbers; d->name; d++)
+ printf(" %s", d->name);
+ printf("\n");
+}
+
+static void
+scrub_ioctl(
+ int fd,
+ int type,
+ uint64_t control,
+ uint32_t control2)
+{
+ struct xfs_scrub_metadata meta;
+ const struct scrub_descr *sc;
+ int error;
+
+ sc = &scrubbers[type];
+ memset(&meta, 0, sizeof(meta));
+ meta.sm_type = type;
+ switch (sc->type) {
+ case ST_PERAG:
+ meta.sm_agno = control;
+ break;
+ case ST_INODE:
+ meta.sm_ino = control;
+ meta.sm_gen = control2;
+ break;
+ case ST_NONE:
+ case ST_FS:
+ /* no control parameters */
+ break;
+ }
+ meta.sm_flags = 0;
+
+ error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+ if (error)
+ perror("scrub");
+ if (meta.sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+ printf(_("Corruption detected.\n"));
+ if (meta.sm_flags & XFS_SCRUB_FLAG_PREEN)
+ printf(_("Optimization possible.\n"));
+ if (meta.sm_flags & XFS_SCRUB_FLAG_XFAIL)
+ printf(_("Cross-referencing failed.\n"));
+ if (meta.sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+ printf(_("Corruption detected during cross-referencing.\n"));
+}
+
+static int
+parse_args(
+ int argc,
+ char **argv,
+ struct cmdinfo *cmdinfo,
+ void (*fn)(int, int, uint64_t, uint32_t))
+{
+ char *p;
+ int type = -1;
+ int i, c;
+ uint64_t control = 0;
+ uint32_t control2 = 0;
+ const struct scrub_descr *d = NULL;
+
+ while ((c = getopt(argc, argv, "")) != EOF) {
+ switch (c) {
+ default:
+ return command_usage(cmdinfo);
+ }
+ }
+ if (optind > argc - 1)
+ return command_usage(cmdinfo);
+
+ for (i = 0, d = scrubbers; d->name; i++, d++) {
+ if (strcmp(d->name, argv[optind]) == 0) {
+ type = i;
+ break;
+ }
+ }
+ optind++;
+
+ if (type < 0)
+ return command_usage(cmdinfo);
+
+ switch (d->type) {
+ case ST_INODE:
+ if (optind == argc) {
+ control = 0;
+ control2 = 0;
+ } else if (optind == argc - 2) {
+ control = strtoull(argv[optind], &p, 0);
+ if (*p != '\0') {
+ fprintf(stderr,
+ _("Bad inode number %s.\n"), argv[i]);
+ return 0;
+ }
+ control2 = strtoul(argv[optind + 1], &p, 0);
+ if (*p != '\0') {
+ fprintf(stderr,
+ _("Bad generation number %s.\n"), argv[i]);
+ return 0;
+ }
+ } else {
+ fprintf(stderr,
+ _("Must specify inode number and generation.\n"));
+ return 0;
+ }
+ break;
+ case ST_PERAG:
+ case ST_NONE:
+ if (optind != argc - 1) {
+ fprintf(stderr,
+ _("Must specify AG number.\n"));
+ return 0;
+ }
+ control = strtoul(argv[optind], &p, 0);
+ if (*p != '\0') {
+ fprintf(stderr,
+ _("Bad AG number %s.\n"), argv[i]);
+ return 0;
+ }
+ break;
+ default:
+ if (optind != argc) {
+ fprintf(stderr,
+ _("No parameters allowed.\n"));
+ return 0;
+ }
+ }
+ fn(file->fd, type, control, control2);
+
+ return 0;
+}
+
+static int
+scrub_f(
+ int argc,
+ char **argv)
+{
+ return parse_args(argc, argv, &scrub_cmd, scrub_ioctl);
+}
+
+void
+scrub_init(void)
+{
+ scrub_cmd.name = "scrub";
+ scrub_cmd.altname = "sc";
+ scrub_cmd.cfunc = scrub_f;
+ scrub_cmd.argmin = 1;
+ scrub_cmd.argmax = -1;
+ scrub_cmd.flags = CMD_NOMAP_OK;
+ scrub_cmd.args =
+_("type [agno...]");
+ scrub_cmd.oneline =
+ _("scrubs filesystem metadata");
+ scrub_cmd.help = scrub_help;
+
+ add_command(&scrub_cmd);
+}
+
+static void
+repair_help(void)
+{
+ const struct scrub_descr *d;
+
+ printf(_("\n\
+ Repairs a piece of XFS filesystem metadata. The first argument is the type\n\
+ of metadata to examine. Allocation group number(s) can be specified to\n\
+ restrict the scrub operation to a subset of allocation groups.\n\
+ Certain metadata types do not take AG numbers.\n\
+\n\
+ Example:\n\
+ 'repair inobt 3 5 7' - repairs the inode btree in groups 3, 5, and 7.\n\
+\n\
+ Known metadata repairs types are:"));
+ for (d = scrubbers; d->name; d++)
+ printf(" %s", d->name);
+ printf("\n");
+}
+
+static void
+repair_ioctl(
+ int fd,
+ int type,
+ uint64_t control,
+ uint32_t control2)
+{
+ struct xfs_scrub_metadata meta;
+ const struct scrub_descr *sc;
+ int error;
+
+ sc = &scrubbers[type];
+ memset(&meta, 0, sizeof(meta));
+ meta.sm_type = type;
+ switch (sc->type) {
+ case ST_PERAG:
+ meta.sm_agno = control;
+ break;
+ case ST_INODE:
+ meta.sm_ino = control;
+ meta.sm_gen = control2;
+ break;
+ case ST_NONE:
+ case ST_FS:
+ /* no control parameters */
+ break;
+ }
+ meta.sm_flags = XFS_SCRUB_FLAG_REPAIR;
+
+ error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+ if (error)
+ perror("scrub");
+ if (meta.sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+ printf(_("Corruption remains.\n"));
+ if (meta.sm_flags & XFS_SCRUB_FLAG_PREEN)
+ printf(_("Optimization possible.\n"));
+ if (meta.sm_flags & XFS_SCRUB_FLAG_XFAIL)
+ printf(_("Cross-referencing failed.\n"));
+ if (meta.sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+ printf(_("Corruption still detected during cross-referencing.\n"));
+}
+
+static int
+repair_f(
+ int argc,
+ char **argv)
+{
+ return parse_args(argc, argv, &repair_cmd, repair_ioctl);
+}
+
+void
+repair_init(void)
+{
+ if (!expert)
+ return;
+ repair_cmd.name = "repair";
+ repair_cmd.altname = "fix";
+ repair_cmd.cfunc = repair_f;
+ repair_cmd.argmin = 1;
+ repair_cmd.argmax = -1;
+ repair_cmd.flags = CMD_NOMAP_OK;
+ repair_cmd.args =
+_("type [agno...]");
+ repair_cmd.oneline =
+ _("repairs filesystem metadata");
+ repair_cmd.help = repair_help;
+
+ add_command(&repair_cmd);
+}
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 7ca3fdc..049f32c 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -998,6 +998,26 @@ version of policy structure (numeric)
.BR get_encpolicy
On filesystems that support encryption, display the encryption policy of the
current file.
+.TP
+.BI "scrub " type " [ " agnumber... " | " "ino" " " "gen" " ]"
+Scrub internal XFS filesystem metadata. The
+.BI type
+parameter specifies which type of metadata to scrub.
+For AG metadata, AG numbers can optionally be specified to restrict the
+scrub operation to a particular set of allocation groups.
+By default, all allocation groups are scrubbed.
+For file metadata, the scrub is applied to the open file unless the
+inode number and generation number are specified.
+.TP
+.BI "repair " type " [ " agnumber... " | " "ino" " " "gen" " ]"
+Repair internal XFS filesystem metadata. The
+.BI type
+parameter specifies which type of metadata to repair.
+For AG metadata, AG numbers can optionally be specified to restrict the
+repair operation to a particular set of allocation groups.
+By default, all allocation groups are repaired.
+For file metadata, the repair is applied to the open file unless the
+inode number and generation number are specified.
.SH SEE ALSO
.BR mkfs.xfs (8),
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 6/9] xfs_scrub: create online filesystem scrub program
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
` (4 preceding siblings ...)
2017-03-10 23:25 ` [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality Darrick J. Wong
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Create a filesystem scrubbing tool that walks the directory tree,
queries every file's extents, extended attributes, and stat data. For
generic (non-XFS) filesystems this depends on the kernel to do nearly
all the validation. Optionally, we can (try to) read all the file
data.
This patch provides some helper components that will be used by the
various backends to walk the metadata, perform media scans, etc. Actual
filesystem drivers will be in subsequent patches.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
Makefile | 3
configure.ac | 6
include/builddefs.in | 6
m4/package_libcdev.m4 | 88 +++++
man/man8/xfs_scrub.8 | 109 ++++++
scrub/Makefile | 51 +++
scrub/bitmap.c | 425 ++++++++++++++++++++++
scrub/bitmap.h | 42 ++
scrub/disk.c | 288 +++++++++++++++
scrub/disk.h | 41 ++
scrub/iocmd.c | 239 ++++++++++++
scrub/iocmd.h | 50 +++
scrub/read_verify.c | 316 ++++++++++++++++
scrub/read_verify.h | 59 +++
scrub/scrub.c | 950 +++++++++++++++++++++++++++++++++++++++++++++++++
scrub/scrub.h | 127 +++++++
16 files changed, 2799 insertions(+), 1 deletion(-)
create mode 100644 man/man8/xfs_scrub.8
create mode 100644 scrub/Makefile
create mode 100644 scrub/bitmap.c
create mode 100644 scrub/bitmap.h
create mode 100644 scrub/disk.c
create mode 100644 scrub/disk.h
create mode 100644 scrub/iocmd.c
create mode 100644 scrub/iocmd.h
create mode 100644 scrub/read_verify.c
create mode 100644 scrub/read_verify.h
create mode 100644 scrub/scrub.c
create mode 100644 scrub/scrub.h
diff --git a/Makefile b/Makefile
index 3a4872a..e6d79af 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ HDR_SUBDIRS = include libxfs
DLIB_SUBDIRS = libxlog libxcmd libhandle
LIB_SUBDIRS = libxfs $(DLIB_SUBDIRS)
TOOL_SUBDIRS = copy db estimate fsck growfs io logprint mkfs quota \
- mdrestore repair rtcp m4 man doc debian spaceman
+ mdrestore repair rtcp m4 man doc debian spaceman scrub
ifneq ("$(PKG_PLATFORM)","darwin")
TOOL_SUBDIRS += fsr
@@ -89,6 +89,7 @@ repair: libxlog libxcmd
copy: libxlog
mkfs: libxcmd
spaceman: libxcmd
+scrub: libhandle libxcmd repair
ifeq ($(HAVE_BUILDDEFS), yes)
include $(BUILDRULES)
diff --git a/configure.ac b/configure.ac
index 103aeba..ccd7460 100644
--- a/configure.ac
+++ b/configure.ac
@@ -146,6 +146,12 @@ AC_HAVE_SYS_FICLONE
AC_HAVE_SYS_FICLONERANGE
AC_HAVE_SYS_FIDEDUPERANGE
AC_HAVE_SYS_GETFSMAP
+AC_HAVE_MALLINFO
+AC_HAVE_SG_IO
+AC_HAVE_HDIO_GETGEO
+AC_HAVE_OPENAT
+AC_HAVE_SYNCFS
+AC_HAVE_FSTATAT
if test "$enable_blkid" = yes; then
AC_HAVE_BLKID_TOPO
diff --git a/include/builddefs.in b/include/builddefs.in
index 603d900..9d478d3 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -116,6 +116,12 @@ HAVE_SYS_FICLONE = @have_sys_ficlone@
HAVE_SYS_FICLONERANGE = @have_sys_ficlonerange@
HAVE_SYS_FIDEDUPERANGE = @have_sys_fideduperange@
HAVE_SYS_GETFSMAP = @have_sys_getfsmap@
+HAVE_MALLINFO = @have_mallinfo@
+HAVE_SG_IO = @have_sg_io@
+HAVE_HDIO_GETGEO = @have_hdio_getgeo@
+HAVE_OPENAT = @have_openat@
+HAVE_SYNCFS = @have_syncfs@
+HAVE_FSTATAT = @have_fstatat@
GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
# -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 971e390..da15041 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -352,3 +352,91 @@ AC_DEFUN([AC_HAVE_SYS_GETFSMAP],
AC_MSG_RESULT(no))
AC_SUBST(have_sys_getfsmap)
])
+
+#
+# Check if we have a mallinfo libc call
+#
+AC_DEFUN([AC_HAVE_MALLINFO],
+ [ AC_MSG_CHECKING([for mallinfo ])
+ AC_TRY_COMPILE([
+#include <malloc.h>
+ ], [
+ struct mallinfo test;
+
+ test.arena = 0; test.hblkhd = 0; test.uordblks = 0; test.fordblks = 0;
+ test = mallinfo();
+ ], have_mallinfo=yes
+ AC_MSG_RESULT(yes),
+ AC_MSG_RESULT(no))
+ AC_SUBST(have_mallinfo)
+ ])
+
+#
+# Check if we have the SG_IO ioctl
+#
+AC_DEFUN([AC_HAVE_SG_IO],
+ [ AC_MSG_CHECKING([for struct sg_io_hdr ])
+ AC_TRY_COMPILE([#include <scsi/sg.h>],
+ [
+ struct sg_io_hdr hdr;
+ ioctl(0, SG_IO, &hdr);
+ ], have_sg_io=yes
+ AC_MSG_RESULT(yes),
+ AC_MSG_RESULT(no))
+ AC_SUBST(have_sg_io)
+ ])
+
+#
+# Check if we have the HDIO_GETGEO ioctl
+#
+AC_DEFUN([AC_HAVE_HDIO_GETGEO],
+ [ AC_MSG_CHECKING([for struct hd_geometry ])
+ AC_TRY_COMPILE([#include <linux/hdreg.h>],
+ [
+ struct hd_geometry hdr;
+ ioctl(0, HDIO_GETGEO, &hdr);
+ ], have_hdio_getgeo=yes
+ AC_MSG_RESULT(yes),
+ AC_MSG_RESULT(no))
+ AC_SUBST(have_hdio_getgeo)
+ ])
+
+#
+# Check if we have a openat call
+#
+AC_DEFUN([AC_HAVE_OPENAT],
+ [ AC_CHECK_DECL([openat],
+ have_openat=yes,
+ [],
+ [#include <sys/types.h>
+ #include <sys/stat.h>
+ #include <fcntl.h>]
+ )
+ AC_SUBST(have_openat)
+ ])
+
+#
+# Check if we have a syncfs call
+#
+AC_DEFUN([AC_HAVE_SYNCFS],
+ [ AC_CHECK_DECL([syncfs],
+ have_syncfs=yes,
+ [],
+ [#define _GNU_SOURCE
+ #include <unistd.h>])
+ AC_SUBST(have_syncfs)
+ ])
+
+#
+# Check if we have a fstatat call
+#
+AC_DEFUN([AC_HAVE_FSTATAT],
+ [ AC_CHECK_DECL([fstatat],
+ have_fstatat=yes,
+ [],
+ [#define _GNU_SOURCE
+ #include <sys/types.h>
+ #include <sys/stat.h>
+ #include <unistd.h>])
+ AC_SUBST(have_fstatat)
+ ])
diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
new file mode 100644
index 0000000..60cdf61
--- /dev/null
+++ b/man/man8/xfs_scrub.8
@@ -0,0 +1,109 @@
+.TH xfs_scrub 8
+.SH NAME
+xfs_scrub \- scrub the contents of an XFS filesystem
+.SH SYNOPSIS
+.B xfs_scrub
+[
+.B \-ademnTvVxy
+]
+.I mountpoint
+.br
+.B xfs_scrub \-V
+.SH DESCRIPTION
+.B xfs_scrub
+attempts to check and repair all metadata in a mounted XFS filesystem.
+.PP
+If an XFS filesystem is detected, then
+.B xfs_scrub
+will ask the kernel to perform more rigorous scrubbing of the
+internal metadata.
+The in-kernel scrubbers also cross-reference each data structure's
+records against the other filesystem metadata.
+.PP
+This utility does not know how to correct all errors.
+If the tool cannot fix the detected errors, you must unmount the
+filesystem and run the appropriate repair tool.
+if this tool is run without either of the
+.B \-n
+or
+.B \-y
+options, then it will preen and optimize the filesystem when possible,
+though it will not try to fix errors.
+.SH OPTIONS
+.TP
+.BI \-a " errors"
+Abort if more than this many errors are found on the filesystem.
+.TP
+.B \-d
+Enable debugging mode, which augments error reports with the exact file
+and line where the scrub failure occurred.
+This also enables verbose mode.
+.TP
+.B \-e
+Specifies what happens when errors are detected.
+If
+.IR shutdown
+is given, the filesystem will be taken offline if errors are found.
+Not all backends can shut down a filesystem.
+If
+.IR continue
+is given, no action taken if errors are found.
+This is the default.
+.TP
+.BI \-m " file"
+Search this file for mounted filesystems instead of /etc/mtab.
+.TP
+.B \-n
+Dry run, do not modify anything in the filesystem. This disables
+all preening and optimization behaviors, and disables calling
+FITRIM on the free space after a successful run.
+.TP
+.BI \-T
+Print timing and memory usage information for each phase.
+.TP
+.B \-v
+Enable verbose mode, which prints periodic status updates.
+.TP
+.B \-V
+Prints the version number and exits.
+.TP
+.B \-x
+Scrub file data. This reads every block of every file on disk.
+If the filesystem reports file extent mappings or physical extent
+mappings and is backed by a block device,
+.TP
+.B \-y
+Try to repair all filesystem errors. If the errors cannot be fixed
+online, then the filesystem must be taken offline for repair.
+.B xfs_scrub
+will issue O_DIRECT reads to the block device directly.
+If the block device is a SCSI disk, it will issue READ VERIFY commands
+directly to the disk.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub
+is the sum of the following conditions:
+.br
+\ 0\ \-\ No errors
+.br
+\ 1\ \-\ File system errors left uncorrected
+.br
+\ 2\ \-\ File system optimizations possible
+.br
+\ 4\ \-\ Operational error
+.br
+\ 8\ \-\ Usage or syntax error
+.br
+.SH CAVEATS
+.B xfs_scrub
+is an immature utility!
+This program takes advantage of in-kernel scrubbing to verify a
+given data structure with locks held.
+The kernel must support the BULKSTAT, FSGEOMETRY, FSCOUNTS, GET_RESBLKS,
+GET_AG_RESBLKS, GETBMAPX, GETFSMAP, INUMBERS, and SCRUB_METADATA ioctls.
+This can tie up the system for a while.
+.PP
+If errors are found and cannot be repaired, the filesystem should be
+taken offline and repaired.
+.SH SEE ALSO
+.BR xfs_repair (8).
diff --git a/scrub/Makefile b/scrub/Makefile
new file mode 100644
index 0000000..b1ff86a
--- /dev/null
+++ b/scrub/Makefile
@@ -0,0 +1,51 @@
+#
+# Copyright (c) 2017 Oracle. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+SCRUB_PREREQS=$(HAVE_OPENAT)$(HAVE_FSTATAT)
+
+ifeq ($(SCRUB_PREREQS),yesyes)
+LTCOMMAND = xfs_scrub
+INSTALL_SCRUB = install-scrub
+endif # scrub_prereqs
+
+HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h
+CFILES = ../repair/avl64.c disk.c bitmap.c iocmd.c \
+ read_verify.c scrub.c ../repair/threads.c
+
+LLDLIBS += $(LIBBLKID) $(LIBXFS) $(LIBXCMD) $(LIBUUID) $(LIBRT) $(LIBPTHREAD) $(LIBHANDLE)
+LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBHANDLE)
+LLDFLAGS = -static-libtool-libs
+
+ifeq ($(HAVE_MALLINFO),yes)
+LCFLAGS += -DHAVE_MALLINFO
+endif
+
+ifeq ($(HAVE_SG_IO),yes)
+LCFLAGS += -DHAVE_SG_IO
+endif
+
+ifeq ($(HAVE_HDIO_GETGEO),yes)
+LCFLAGS += -DHAVE_HDIO_GETGEO
+endif
+
+ifeq ($(HAVE_SYNCFS),yes)
+LCFLAGS += -DHAVE_SYNCFS
+endif
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default $(INSTALL_SCRUB)
+
+install-scrub:
+ $(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
+ $(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+
+install-dev:
+
+-include .dep
diff --git a/scrub/bitmap.c b/scrub/bitmap.c
new file mode 100644
index 0000000..0146c49
--- /dev/null
+++ b/scrub/bitmap.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include "../repair/avl64.h"
+#include "bitmap.h"
+
+#define avl_for_each_range_safe(pos, n, l, first, last) \
+ for (pos = (first), n = pos->avl_nextino, l = (last)->avl_nextino; pos != (l); \
+ pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each_safe(tree, pos, n) \
+ for (pos = (tree)->avl_firstino, n = pos ? pos->avl_nextino : NULL; \
+ pos != NULL; \
+ pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each(tree, pos) \
+ for (pos = (tree)->avl_firstino; pos != NULL; pos = pos->avl_nextino)
+
+struct bitmap_node {
+ struct avl64node btn_node;
+ uint64_t btn_start;
+ uint64_t btn_length;
+};
+
+static __uint64_t
+extent_start(
+ struct avl64node *node)
+{
+ struct bitmap_node *btn;
+
+ btn = container_of(node, struct bitmap_node, btn_node);
+ return btn->btn_start;
+}
+
+static __uint64_t
+extent_end(
+ struct avl64node *node)
+{
+ struct bitmap_node *btn;
+
+ btn = container_of(node, struct bitmap_node, btn_node);
+ return btn->btn_start + btn->btn_length;
+}
+
+static struct avl64ops bitmap_ops = {
+ extent_start,
+ extent_end,
+};
+
+/* Initialize an extent tree. */
+bool
+bitmap_init(
+ struct bitmap *tree)
+{
+ tree->bt_tree = malloc(sizeof(struct avl64tree_desc));
+ if (!tree->bt_tree)
+ return false;
+
+ pthread_mutex_init(&tree->bt_lock, NULL);
+ avl64_init_tree(tree->bt_tree, &bitmap_ops);
+
+ return true;
+}
+
+/* Free an extent tree. */
+void
+bitmap_free(
+ struct bitmap *tree)
+{
+ struct avl64node *node;
+ struct avl64node *n;
+ struct bitmap_node *ext;
+
+ if (!tree->bt_tree)
+ return;
+
+ avl_for_each_safe(tree->bt_tree, node, n) {
+ ext = container_of(node, struct bitmap_node, btn_node);
+ free(ext);
+ }
+ free(tree->bt_tree);
+ tree->bt_tree = NULL;
+}
+
+/* Create a new extent. */
+static struct bitmap_node *
+bitmap_node_init(
+ uint64_t start,
+ uint64_t len)
+{
+ struct bitmap_node *ext;
+
+ ext = malloc(sizeof(struct bitmap_node));
+ if (!ext)
+ return NULL;
+
+ ext->btn_node.avl_nextino = NULL;
+ ext->btn_start = start;
+ ext->btn_length = len;
+
+ return ext;
+}
+
+/* Add an extent (locked). */
+static bool
+__bitmap_add(
+ struct bitmap *tree,
+ uint64_t start,
+ uint64_t length)
+{
+ struct avl64node *firstn;
+ struct avl64node *lastn;
+ struct avl64node *pos;
+ struct avl64node *n;
+ struct avl64node *l;
+ struct bitmap_node *ext;
+ uint64_t new_start;
+ uint64_t new_length;
+ struct avl64node *node;
+ bool res = true;
+
+ /* Find any existing nodes adjacent or within that range. */
+ avl64_findranges(tree->bt_tree, start - 1, start + length + 1,
+ &firstn, &lastn);
+
+ /* Nothing, just insert a new extent. */
+ if (firstn == NULL && lastn == NULL) {
+ ext = bitmap_node_init(start, length);
+ if (!ext)
+ return false;
+
+ node = avl64_insert(tree->bt_tree, &ext->btn_node);
+ if (node == NULL) {
+ free(ext);
+ errno = EEXIST;
+ return false;
+ }
+
+ return true;
+ }
+
+ ASSERT(firstn != NULL && lastn != NULL);
+ new_start = start;
+ new_length = length;
+
+ avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+ ext = container_of(pos, struct bitmap_node, btn_node);
+
+ /* Bail if the new extent is contained within an old one. */
+ if (ext->btn_start <= start &&
+ ext->btn_start + ext->btn_length >= start + length)
+ return res;
+
+ /* Check for overlapping and adjacent extents. */
+ if (ext->btn_start + ext->btn_length >= start ||
+ ext->btn_start <= start + length) {
+ if (ext->btn_start < start) {
+ new_start = ext->btn_start;
+ new_length += ext->btn_length;
+ }
+
+ if (ext->btn_start + ext->btn_length >
+ new_start + new_length)
+ new_length = ext->btn_start + ext->btn_length -
+ new_start;
+
+ avl64_delete(tree->bt_tree, pos);
+ free(ext);
+ }
+ }
+
+ ext = bitmap_node_init(new_start, new_length);
+ if (!ext)
+ return false;
+
+ node = avl64_insert(tree->bt_tree, &ext->btn_node);
+ if (node == NULL) {
+ free(ext);
+ errno = EEXIST;
+ return false;
+ }
+
+ return res;
+}
+
+/* Add an extent. */
+bool
+bitmap_add(
+ struct bitmap *tree,
+ uint64_t start,
+ uint64_t length)
+{
+ bool res;
+
+ pthread_mutex_lock(&tree->bt_lock);
+ res = __bitmap_add(tree, start, length);
+ pthread_mutex_unlock(&tree->bt_lock);
+
+ return res;
+}
+
+/* Remove an extent. */
+bool
+bitmap_remove(
+ struct bitmap *tree,
+ uint64_t start,
+ uint64_t len)
+{
+ struct avl64node *firstn;
+ struct avl64node *lastn;
+ struct avl64node *pos;
+ struct avl64node *n;
+ struct avl64node *l;
+ struct bitmap_node *ext;
+ uint64_t new_start;
+ uint64_t new_length;
+ struct avl64node *node;
+ int stat;
+
+ pthread_mutex_lock(&tree->bt_lock);
+ /* Find any existing nodes over that range. */
+ avl64_findranges(tree->bt_tree, start, start + len, &firstn, &lastn);
+
+ /* Nothing, we're done. */
+ if (firstn == NULL && lastn == NULL) {
+ pthread_mutex_unlock(&tree->bt_lock);
+ return true;
+ }
+
+ ASSERT(firstn != NULL && lastn != NULL);
+
+ /* Delete or truncate everything in sight. */
+ avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+ ext = container_of(pos, struct bitmap_node, btn_node);
+
+ stat = 0;
+ if (ext->btn_start < start)
+ stat |= 1;
+ if (ext->btn_start + ext->btn_length > start + len)
+ stat |= 2;
+ switch (stat) {
+ case 0:
+ /* Extent totally within range; delete. */
+ avl64_delete(tree->bt_tree, pos);
+ free(ext);
+ break;
+ case 1:
+ /* Extent is left-adjacent; truncate. */
+ ext->btn_length = start - ext->btn_start;
+ break;
+ case 2:
+ /* Extent is right-adjacent; move it. */
+ ext->btn_length = ext->btn_start + ext->btn_length -
+ (start + len);
+ ext->btn_start = start + len;
+ break;
+ case 3:
+ /* Extent overlaps both ends. */
+ ext->btn_length = start - ext->btn_start;
+ new_start = start + len;
+ new_length = ext->btn_start + ext->btn_length -
+ new_start;
+
+ ext = bitmap_node_init(new_start, new_length);
+ if (!ext)
+ return false;
+
+ node = avl64_insert(tree->bt_tree, &ext->btn_node);
+ if (node == NULL) {
+ errno = EEXIST;
+ return false;
+ }
+ break;
+ }
+ }
+
+ pthread_mutex_unlock(&tree->bt_lock);
+ return true;
+}
+
+/* Iterate an extent tree. */
+bool
+bitmap_iterate(
+ struct bitmap *tree,
+ bool (*fn)(uint64_t, uint64_t, void *),
+ void *arg)
+{
+ struct avl64node *node;
+ struct bitmap_node *ext;
+ bool moveon = true;
+
+ pthread_mutex_lock(&tree->bt_lock);
+ avl_for_each(tree->bt_tree, node) {
+ ext = container_of(node, struct bitmap_node, btn_node);
+ moveon = fn(ext->btn_start, ext->btn_length, arg);
+ if (!moveon)
+ break;
+ }
+ pthread_mutex_unlock(&tree->bt_lock);
+
+ return moveon;
+}
+
+/* Do any extents overlap the given one? (locked) */
+static bool
+__bitmap_has_extent(
+ struct bitmap *tree,
+ uint64_t start,
+ uint64_t len)
+{
+ struct avl64node *firstn;
+ struct avl64node *lastn;
+
+ /* Find any existing nodes over that range. */
+ avl64_findranges(tree->bt_tree, start, start + len, &firstn, &lastn);
+
+ return firstn != NULL && lastn != NULL;
+}
+
+/* Do any extents overlap the given one? */
+bool
+bitmap_has_extent(
+ struct bitmap *tree,
+ uint64_t start,
+ uint64_t len)
+{
+ bool res;
+
+ pthread_mutex_lock(&tree->bt_lock);
+ res = __bitmap_has_extent(tree, start, len);
+ pthread_mutex_unlock(&tree->bt_lock);
+
+ return res;
+}
+
+/* Ensure that the extent is set, and return the old value. */
+bool
+bitmap_test_and_set(
+ struct bitmap *tree,
+ uint64_t start,
+ bool *was_set)
+{
+ bool res = true;
+
+ pthread_mutex_lock(&tree->bt_lock);
+ *was_set = __bitmap_has_extent(tree, start, 1);
+ if (!(*was_set))
+ res = __bitmap_add(tree, start, 1);
+ pthread_mutex_unlock(&tree->bt_lock);
+
+ return res;
+}
+
+/* Is it empty? */
+bool
+bitmap_empty(
+ struct bitmap *tree)
+{
+ return tree->bt_tree->avl_firstino == NULL;
+}
+
+static bool
+merge_helper(
+ uint64_t start,
+ uint64_t length,
+ void *arg)
+{
+ struct bitmap *thistree = arg;
+
+ return __bitmap_add(thistree, start, length);
+}
+
+/* Merge another tree with this one. */
+bool
+bitmap_merge(
+ struct bitmap *thistree,
+ struct bitmap *tree)
+{
+ bool res;
+
+ assert(thistree != tree);
+
+ pthread_mutex_lock(&thistree->bt_lock);
+ res = bitmap_iterate(tree, merge_helper, thistree);
+ pthread_mutex_unlock(&thistree->bt_lock);
+
+ return res;
+}
+
+static bool
+bitmap_dump_fn(
+ uint64_t startblock,
+ uint64_t blockcount,
+ void *arg)
+{
+ printf("%"PRIu64":%"PRIu64"\n", startblock, blockcount);
+ return true;
+}
+
+/* Dump extent tree. */
+void
+bitmap_dump(
+ struct bitmap *tree)
+{
+ printf("BITMAP DUMP %p\n", tree);
+ bitmap_iterate(tree, bitmap_dump_fn, NULL);
+ printf("BITMAP DUMP DONE\n");
+}
diff --git a/scrub/bitmap.h b/scrub/bitmap.h
new file mode 100644
index 0000000..e3702aa
--- /dev/null
+++ b/scrub/bitmap.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef BITMAP_H_
+#define BITMAP_H_
+
+struct bitmap {
+ pthread_mutex_t bt_lock;
+ struct avl64tree_desc *bt_tree;
+};
+
+bool bitmap_init(struct bitmap *tree);
+void bitmap_free(struct bitmap *tree);
+bool bitmap_add(struct bitmap *tree, uint64_t start, uint64_t length);
+bool bitmap_remove(struct bitmap *tree, uint64_t start,
+ uint64_t len);
+bool bitmap_iterate(struct bitmap *tree,
+ bool (*fn)(uint64_t, uint64_t, void *), void *arg);
+bool bitmap_has_extent(struct bitmap *tree, uint64_t start,
+ uint64_t len);
+bool bitmap_test_and_set(struct bitmap *tree, uint64_t start, bool *was_set);
+bool bitmap_empty(struct bitmap *tree);
+bool bitmap_merge(struct bitmap *thistree, struct bitmap *tree);
+void bitmap_dump(struct bitmap *tree);
+
+#endif /* BITMAP_H_ */
diff --git a/scrub/disk.c b/scrub/disk.c
new file mode 100644
index 0000000..c1136aa
--- /dev/null
+++ b/scrub/disk.c
@@ -0,0 +1,288 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#ifdef HAVE_SG_IO
+# include <scsi/sg.h>
+#endif
+#ifdef HAVE_HDIO_GETGEO
+# include <linux/hdreg.h>
+#endif
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+
+/* Figure out how many disk heads are available. */
+static unsigned int
+__disk_heads(
+ struct disk *disk)
+{
+ int iomin;
+ int ioopt;
+ unsigned short rot;
+ int error;
+
+ /* If it's not a block device, throw all the CPUs at it. */
+ if (!S_ISBLK(disk->d_sb.st_mode))
+ return libxfs_nproc();
+
+ /* Non-rotational device? Throw all the CPUs. */
+ rot = 1;
+ error = ioctl(disk->d_fd, BLKROTATIONAL, &rot);
+ if (error == 0 && rot == 0)
+ return libxfs_nproc();
+
+ /*
+ * Sometimes we can infer the number of devices from the
+ * min/optimal IO sizes.
+ */
+ iomin = ioopt = 0;
+ if (ioctl(disk->d_fd, BLKIOMIN, &iomin) == 0 &&
+ ioctl(disk->d_fd, BLKIOOPT, &ioopt) == 0 &&
+ iomin > 0 && ioopt > 0) {
+ return min(libxfs_nproc(), max(1, ioopt / iomin));
+ }
+
+ /* Rotating device? I guess? */
+ return 2;
+}
+
+/* Figure out how many disk heads are available. */
+unsigned int
+disk_heads(
+ struct disk *disk)
+{
+ if (nr_threads < 0)
+ return __disk_heads(disk);
+ return min(__disk_heads(disk), nr_threads);
+}
+
+/* Execute a SCSI VERIFY(16). We hope. */
+#ifdef HAVE_SG_IO
+# define SENSE_BUF_LEN 64
+# define VERIFY16_CMDLEN 16
+# define VERIFY16_CMD 0x8F
+
+# ifndef SG_FLAG_Q_AT_TAIL
+# define SG_FLAG_Q_AT_TAIL 0x10
+# endif
+static int
+disk_scsi_verify(
+ struct disk *disk,
+ uint64_t startblock, /* lba */
+ uint64_t blockcount) /* lba */
+{
+ struct sg_io_hdr iohdr;
+ unsigned char cdb[VERIFY16_CMDLEN];
+ unsigned char sense[SENSE_BUF_LEN];
+ uint64_t llba;
+ uint64_t veri_len = blockcount;
+ int error;
+
+ assert(!debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"));
+
+ llba = startblock + (disk->d_start >> BBSHIFT);
+
+ /* Borrowed from sg_verify */
+ cdb[0] = VERIFY16_CMD;
+ cdb[1] = 0; /* skip PI, DPO, and byte check. */
+ cdb[2] = (llba >> 56) & 0xff;
+ cdb[3] = (llba >> 48) & 0xff;
+ cdb[4] = (llba >> 40) & 0xff;
+ cdb[5] = (llba >> 32) & 0xff;
+ cdb[6] = (llba >> 24) & 0xff;
+ cdb[7] = (llba >> 16) & 0xff;
+ cdb[8] = (llba >> 8) & 0xff;
+ cdb[9] = llba & 0xff;
+ cdb[10] = (veri_len >> 24) & 0xff;
+ cdb[11] = (veri_len >> 16) & 0xff;
+ cdb[12] = (veri_len >> 8) & 0xff;
+ cdb[13] = veri_len & 0xff;
+ cdb[14] = 0;
+ cdb[15] = 0;
+ memset(sense, 0, SENSE_BUF_LEN);
+
+ /* v3 SG_IO */
+ memset(&iohdr, 0, sizeof(iohdr));
+ iohdr.interface_id = 'S';
+ iohdr.dxfer_direction = SG_DXFER_NONE;
+ iohdr.cmdp = cdb;
+ iohdr.cmd_len = VERIFY16_CMDLEN;
+ iohdr.sbp = sense;
+ iohdr.mx_sb_len = SENSE_BUF_LEN;
+ iohdr.flags |= SG_FLAG_Q_AT_TAIL;
+ iohdr.timeout = 30000; /* 30s */
+
+ error = ioctl(disk->d_fd, SG_IO, &iohdr);
+ if (error)
+ return error;
+
+ dbg_printf("VERIFY(16) fd %d lba %"PRIu64" len %"PRIu64" info %x "
+ "status %d masked %d msg %d host %d driver %d "
+ "duration %d resid %d\n",
+ disk->d_fd, startblock, blockcount, iohdr.info,
+ iohdr.status, iohdr.masked_status, iohdr.msg_status,
+ iohdr.host_status, iohdr.driver_status, iohdr.duration,
+ iohdr.resid);
+
+ if (iohdr.info & SG_INFO_CHECK) {
+ dbg_printf("status: msg %x host %x driver %x\n",
+ iohdr.msg_status, iohdr.host_status,
+ iohdr.driver_status);
+ errno = EIO;
+ return -1;
+ }
+
+ return error;
+}
+#else
+# define disk_scsi_verify(...) (ENOTTY)
+#endif /* HAVE_SG_IO */
+
+/* Test the availability of the kernel scrub ioctl. */
+static bool
+disk_can_scsi_verify(
+ struct disk *disk)
+{
+ int error;
+
+ if (debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"))
+ return false;
+
+ error = disk_scsi_verify(disk, 0, 1);
+ return error == 0;
+}
+
+/* Open a disk device and discover its geometry. */
+int
+disk_open(
+ const char *pathname,
+ struct disk *disk)
+{
+#ifdef HAVE_HDIO_GETGEO
+ struct hd_geometry bdgeo;
+#endif
+ bool suspicious_disk = false;
+ int lba_sz;
+ int error;
+
+ disk->d_fd = open(pathname, O_RDONLY | O_DIRECT | O_NOATIME);
+ if (disk->d_fd < 0)
+ return -1;
+
+ /* Try to get LBA size. */
+ error = ioctl(disk->d_fd, BLKSSZGET, &lba_sz);
+ if (error)
+ lba_sz = 512;
+ disk->d_lbalog = libxfs_log2_roundup(lba_sz);
+
+ /* Obtain disk's stat info. */
+ error = fstat(disk->d_fd, &disk->d_sb);
+ if (error) {
+ error = errno;
+ close(disk->d_fd);
+ errno = error;
+ disk->d_fd = -1;
+ return -1;
+ }
+
+ /* Determine bdev size, block size, and offset. */
+ if (S_ISBLK(disk->d_sb.st_mode)) {
+ error = ioctl(disk->d_fd, BLKGETSIZE64, &disk->d_size);
+ if (error)
+ disk->d_size = 0;
+ error = ioctl(disk->d_fd, BLKBSZGET, &disk->d_blksize);
+ if (error)
+ disk->d_blksize = 0;
+#ifdef HAVE_HDIO_GETGEO
+ error = ioctl(disk->d_fd, HDIO_GETGEO, &bdgeo);
+ if (!error) {
+ /*
+ * dm devices will pass through ioctls, which means
+ * we can't use SCSI VERIFY unless the start is 0.
+ * Most dm devices don't set geometry (unlike scsi
+ * and nvme) so use a zeroed out CHS to screen them
+ * out.
+ */
+ if (bdgeo.start != 0 &&
+ (unsigned long long)bdgeo.heads * bdgeo.sectors *
+ bdgeo.sectors == 0)
+ suspicious_disk = true;
+ disk->d_start = bdgeo.start << BBSHIFT;
+ } else
+#endif
+ disk->d_start = 0;
+ } else {
+ disk->d_size = disk->d_sb.st_size;
+ disk->d_blksize = disk->d_sb.st_blksize;
+ disk->d_start = 0;
+ }
+
+ /* Can we issue SCSI VERIFY? */
+ if (!suspicious_disk && disk_can_scsi_verify(disk))
+ disk->d_flags |= DISK_FLAG_SCSI_VERIFY;
+
+ return 0;
+}
+
+/* Close a disk device. */
+int
+disk_close(
+ struct disk *disk)
+{
+ int error = 0;
+
+ if (disk->d_fd >= 0)
+ error = close(disk->d_fd);
+ disk->d_fd = -1;
+ return error;
+}
+
+/* Is this device open? */
+bool
+disk_is_open(
+ struct disk *disk)
+{
+ return disk->d_fd >= 0;
+}
+
+#define BTOLBAT(d, bytes) ((uint64_t)(bytes) >> (d)->d_lbalog)
+#define LBASIZE(d) (1ULL << (d)->d_lbalog)
+#define BTOLBA(d, bytes) (((uint64_t)(bytes) + LBASIZE(d) - 1) >> (d)->d_lbalog)
+
+/* Read-verify an extent of a disk device. */
+ssize_t
+disk_read_verify(
+ struct disk *disk,
+ void *buf,
+ uint64_t start,
+ uint64_t length)
+{
+ /* Convert to logical block size. */
+ if (disk->d_flags & DISK_FLAG_SCSI_VERIFY)
+ return disk_scsi_verify(disk, BTOLBAT(disk, start),
+ BTOLBA(disk, length));
+
+ return pread(disk->d_fd, buf, length, start);
+}
diff --git a/scrub/disk.h b/scrub/disk.h
new file mode 100644
index 0000000..8930075
--- /dev/null
+++ b/scrub/disk.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef DISK_H_
+#define DISK_H_
+
+#define DISK_FLAG_SCSI_VERIFY 0x1
+struct disk {
+ struct stat d_sb;
+ int d_fd;
+ int d_lbalog;
+ unsigned int d_flags;
+ unsigned int d_blksize; /* bytes */
+ uint64_t d_size; /* bytes */
+ uint64_t d_start; /* bytes */
+};
+
+unsigned int disk_heads(struct disk *disk);
+bool disk_is_open(struct disk *disk);
+int disk_open(const char *pathname, struct disk *disk);
+int disk_close(struct disk *disk);
+ssize_t disk_read_verify(struct disk *disk, void *buf, uint64_t startblock,
+ uint64_t blockcount);
+
+#endif /* DISK_H_ */
diff --git a/scrub/iocmd.c b/scrub/iocmd.c
new file mode 100644
index 0000000..504396c
--- /dev/null
+++ b/scrub/iocmd.c
@@ -0,0 +1,239 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/xattr.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "iocmd.h"
+
+#define NR_EXTENTS 512
+
+/* Scan a filesystem tree. */
+struct scan_fs_tree {
+ unsigned int nr_dirs;
+ pthread_mutex_t lock;
+ pthread_cond_t wakeup;
+ struct stat root_sb;
+ bool moveon;
+ bool (*dir_fn)(struct scrub_ctx *, const char *,
+ int, void *);
+ bool (*dirent_fn)(struct scrub_ctx *, const char *,
+ int, struct dirent *,
+ struct stat *, void *);
+ void *arg;
+};
+
+/* Per-work-item scan context. */
+struct scan_fs_tree_dir {
+ char *path;
+ struct scan_fs_tree *sft;
+ bool rootdir;
+};
+
+/* Scan a directory sub tree. */
+static void
+scan_fs_dir(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+ struct scan_fs_tree_dir *sftd = arg;
+ struct scan_fs_tree *sft = sftd->sft;
+ DIR *dir;
+ struct dirent *dirent;
+ char newpath[PATH_MAX];
+ struct scan_fs_tree_dir *new_sftd;
+ struct stat sb;
+ int dir_fd;
+ int error;
+
+ /* Open the directory. */
+ dir_fd = open(sftd->path, O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+ if (dir_fd < 0) {
+ if (errno != ENOENT)
+ str_errno(ctx, sftd->path);
+ goto out;
+ }
+
+ /* Caller-specific directory checks. */
+ if (sft->dir_fn && !sft->dir_fn(ctx, sftd->path, dir_fd, sft->arg)) {
+ sft->moveon = false;
+ goto out;
+ }
+
+ /* Caller-specific directory entry function on the rootdir. */
+ if (sftd->rootdir) {
+ /* Get the stat info for this directory entry. */
+ error = fstat(dir_fd, &sb);
+ if (error) {
+ str_errno(ctx, sftd->path);
+ goto out;
+ }
+ if (!sft->dirent_fn(ctx, sftd->path, dir_fd, NULL, &sb,
+ sft->arg)) {
+ sft->moveon = false;
+ goto out;
+ }
+ }
+
+ /* Iterate the directory entries. */
+ dir = fdopendir(dir_fd);
+ if (!dir) {
+ str_errno(ctx, sftd->path);
+ goto out;
+ }
+ rewinddir(dir);
+ for (dirent = readdir(dir); dirent != NULL; dirent = readdir(dir)) {
+ snprintf(newpath, PATH_MAX, "%s/%s", sftd->path,
+ dirent->d_name);
+
+ /* Get the stat info for this directory entry. */
+ error = fstatat(dir_fd, dirent->d_name, &sb,
+ AT_NO_AUTOMOUNT | AT_SYMLINK_NOFOLLOW);
+ if (error) {
+ str_errno(ctx, newpath);
+ continue;
+ }
+
+ /* Ignore files on other filesystems. */
+ if (sb.st_dev != sft->root_sb.st_dev)
+ continue;
+
+ /* Caller-specific directory entry function. */
+ if (!sft->dirent_fn(ctx, newpath, dir_fd, dirent, &sb,
+ sft->arg)) {
+ sft->moveon = false;
+ break;
+ }
+
+ if (xfs_scrub_excessive_errors(ctx)) {
+ sft->moveon = false;
+ break;
+ }
+
+ /* If directory, call ourselves recursively. */
+ if (S_ISDIR(sb.st_mode) && strcmp(".", dirent->d_name) &&
+ strcmp("..", dirent->d_name)) {
+ new_sftd = malloc(sizeof(struct scan_fs_tree_dir));
+ if (!new_sftd) {
+ str_errno(ctx, newpath);
+ sft->moveon = false;
+ break;
+ }
+ new_sftd->path = strdup(newpath);
+ new_sftd->sft = sft;
+ new_sftd->rootdir = false;
+ pthread_mutex_lock(&sft->lock);
+ sft->nr_dirs++;
+ pthread_mutex_unlock(&sft->lock);
+ queue_work(wq, scan_fs_dir, 0, new_sftd);
+ }
+ }
+
+ /* Close dir, go away. */
+ error = closedir(dir);
+ if (error)
+ str_errno(ctx, sftd->path);
+
+out:
+ pthread_mutex_lock(&sft->lock);
+ sft->nr_dirs--;
+ if (sft->nr_dirs == 0)
+ pthread_cond_signal(&sft->wakeup);
+ pthread_mutex_unlock(&sft->lock);
+
+ free(sftd->path);
+ free(sftd);
+}
+
+/* Scan the entire filesystem. */
+bool
+scan_fs_tree(
+ struct scrub_ctx *ctx,
+ bool (*dir_fn)(struct scrub_ctx *, const char *,
+ int, void *),
+ bool (*dirent_fn)(struct scrub_ctx *, const char *,
+ int, struct dirent *,
+ struct stat *, void *),
+ void *arg)
+{
+ struct work_queue wq;
+ struct scan_fs_tree sft;
+ struct scan_fs_tree_dir *sftd;
+
+ sft.moveon = true;
+ sft.nr_dirs = 1;
+ sft.root_sb = ctx->mnt_sb;
+ sft.dir_fn = dir_fn;
+ sft.dirent_fn = dirent_fn;
+ sft.arg = arg;
+ pthread_mutex_init(&sft.lock, NULL);
+ pthread_cond_init(&sft.wakeup, NULL);
+
+ sftd = malloc(sizeof(struct scan_fs_tree_dir));
+ if (!sftd) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+ sftd->path = strdup(ctx->mntpoint);
+ sftd->sft = &sft;
+ sftd->rootdir = true;
+
+ create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+ queue_work(&wq, scan_fs_dir, 0, sftd);
+
+ pthread_mutex_lock(&sft.lock);
+ pthread_cond_wait(&sft.wakeup, &sft.lock);
+ assert(sft.nr_dirs == 0);
+ pthread_mutex_unlock(&sft.lock);
+ destroy_work_queue(&wq);
+
+ return sft.moveon;
+}
+
+#ifndef FITRIM
+struct fstrim_range {
+ __u64 start;
+ __u64 len;
+ __u64 minlen;
+};
+#define FITRIM _IOWR('X', 121, struct fstrim_range) /* Trim */
+#endif
+
+/* Call FITRIM to trim all the unused space in a filesystem. */
+void
+fstrim(
+ struct scrub_ctx *ctx)
+{
+ struct fstrim_range range = {0};
+ int error;
+
+ range.len = ULLONG_MAX;
+ error = ioctl(ctx->mnt_fd, FITRIM, &range);
+ if (error && errno != EOPNOTSUPP && errno != ENOTTY)
+ perror(_("fstrim"));
+}
diff --git a/scrub/iocmd.h b/scrub/iocmd.h
new file mode 100644
index 0000000..047e5fc
--- /dev/null
+++ b/scrub/iocmd.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef IOCMD_H_
+#define IOCMD_H_
+
+struct fiemap_extent;
+
+bool
+scan_fs_tree(
+ struct scrub_ctx *ctx,
+ bool (*dir_fn)(struct scrub_ctx *, const char *,
+ int, void *),
+ bool (*dirent_fn)(struct scrub_ctx *, const char *,
+ int, struct dirent *,
+ struct stat *, void *),
+ void *arg);
+
+bool
+fiemap(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ int fd,
+ bool attr_fork,
+ bool fibmap,
+ bool (*fn)(struct scrub_ctx *, const char *,
+ struct fiemap_extent *, void *),
+ void *arg);
+
+void
+fstrim(
+ struct scrub_ctx *ctx);
+
+#endif /* IOCMD_H_ */
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
new file mode 100644
index 0000000..48de6e1
--- /dev/null
+++ b/scrub/read_verify.c
@@ -0,0 +1,316 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+
+/* How many bytes have we verified? */
+static pthread_mutex_t verified_lock = PTHREAD_MUTEX_INITIALIZER;
+static unsigned long long verified_bytes;
+
+/* Tolerate 64k holes in adjacent read verify requests. */
+#define IO_BATCH_LOCALITY (65536)
+
+/* Create a thread pool to run read verifiers. */
+bool
+read_verify_pool_init(
+ struct read_verify_pool *rvp,
+ struct scrub_ctx *ctx,
+ void *readbuf,
+ size_t readbufsz,
+ size_t min_io_sz,
+ read_verify_ioend_fn_t ioend_fn,
+ unsigned int nproc)
+{
+ rvp->rvp_readbuf = readbuf;
+ rvp->rvp_readbufsz = readbufsz;
+ rvp->rvp_ctx = ctx;
+ rvp->rvp_min_io_size = min_io_sz;
+ rvp->ioend_fn = ioend_fn;
+ rvp->rvp_nproc = nproc;
+ create_work_queue(&rvp->rvp_wq, (struct xfs_mount *)rvp, nproc);
+ return true;
+}
+
+/* How many bytes has this process verified? */
+unsigned long long
+read_verify_bytes(void)
+{
+ return verified_bytes;
+}
+
+/* Finish up any read verification work and tear it down. */
+void
+read_verify_pool_destroy(
+ struct read_verify_pool *rvp)
+{
+ destroy_work_queue(&rvp->rvp_wq);
+ memset(&rvp->rvp_wq, 0, sizeof(struct work_queue));
+}
+
+/*
+ * Issue a read-verify IO in big batches.
+ */
+static void
+read_verify(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct read_verify *rv = arg;
+ struct read_verify_pool *rvp;
+ unsigned long long verified = 0;
+ ssize_t sz;
+ ssize_t len;
+
+ rvp = (struct read_verify_pool *)wq->mp;
+ while (rv->io_length > 0) {
+ len = min(rv->io_length, rvp->rvp_readbufsz);
+ dbg_printf("diskverify %d %"PRIu64" %zu\n", rv->io_disk->d_fd,
+ rv->io_start, len);
+ sz = disk_read_verify(rv->io_disk, rvp->rvp_readbuf,
+ rv->io_start, len);
+ if (sz < 0) {
+ dbg_printf("IOERR %d %"PRIu64" %zu\n",
+ rv->io_disk->d_fd,
+ rv->io_start, len);
+ rvp->ioend_fn(rvp, rv->io_disk, rv->io_start,
+ rvp->rvp_min_io_size,
+ errno, rv->io_end_arg);
+ len = rvp->rvp_min_io_size;
+ }
+
+ verified += len;
+ rv->io_start += len;
+ rv->io_length -= len;
+ }
+
+ free(rv);
+ pthread_mutex_lock(&verified_lock);
+ verified_bytes += verified;
+ pthread_mutex_unlock(&verified_lock);
+}
+
+/* Queue a read verify request. */
+static void
+read_verify_queue(
+ struct read_verify_pool *rvp,
+ struct read_verify *rv)
+{
+ struct read_verify *tmp;
+
+ dbg_printf("verify fd %d start %"PRIu64" len %"PRIu64"\n",
+ rv->io_disk->d_fd, rv->io_start, rv->io_length);
+
+ tmp = malloc(sizeof(struct read_verify));
+ if (!tmp) {
+ rvp->ioend_fn(rvp, rv->io_disk, rv->io_start, rv->io_length,
+ errno, rv->io_end_arg);
+ return;
+ }
+ *tmp = *rv;
+
+ queue_work(&rvp->rvp_wq, read_verify, 0, tmp);
+}
+
+/*
+ * Issue an IO request. We'll batch subsequent requests if they're
+ * within 64k of each other
+ */
+void
+read_verify_schedule(
+ struct read_verify_pool *rvp,
+ struct read_verify *rv,
+ struct disk *disk,
+ uint64_t start,
+ uint64_t length,
+ void *end_arg)
+{
+ uint64_t ve_end;
+ uint64_t io_end;
+
+ assert(rvp->rvp_readbuf);
+ ve_end = start + length;
+ io_end = rv->io_start + rv->io_length;
+
+ /*
+ * If we have a stashed IO, we haven't changed fds, the error
+ * reporting is the same, and the two extents are close,
+ * we can combine them.
+ */
+ if (rv->io_length > 0 && disk == rv->io_disk &&
+ end_arg == rv->io_end_arg &&
+ ((start >= rv->io_start && start <= io_end + IO_BATCH_LOCALITY) ||
+ (rv->io_start >= start &&
+ rv->io_start <= ve_end + IO_BATCH_LOCALITY))) {
+ rv->io_start = min(rv->io_start, start);
+ rv->io_length = max(ve_end, io_end) - rv->io_start;
+ } else {
+ /* Otherwise, issue the stashed IO (if there is one) */
+ if (rv->io_length > 0)
+ read_verify_queue(rvp, rv);
+
+ /* Stash the new IO. */
+ rv->io_disk = disk;
+ rv->io_start = start;
+ rv->io_length = length;
+ rv->io_end_arg = end_arg;
+ }
+}
+
+/* Force any stashed IOs into the verifier. */
+void
+read_verify_force(
+ struct read_verify_pool *rvp,
+ struct read_verify *rv)
+{
+ assert(rvp->rvp_readbuf);
+ if (rv->io_length == 0)
+ return;
+
+ read_verify_queue(rvp, rv);
+ rv->io_length = 0;
+}
+
+/* Read all the data in a file. */
+bool
+read_verify_file(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ int fd,
+ struct stat *sb)
+{
+ off_t data_end = 0;
+ off_t data_start;
+ off_t start;
+ ssize_t sz;
+ size_t count;
+ unsigned long long verified = 0;
+ bool reports_holes = true;
+ bool direct_io = false;
+ bool moveon = true;
+ int flags;
+ int error;
+
+ /*
+ * Try to force the kernel to read file data from disk. First
+ * we try to set O_DIRECT. If that fails, try to purge the page
+ * cache.
+ */
+ flags = fcntl(fd, F_GETFL);
+ error = fcntl(fd, F_SETFL, flags | O_DIRECT);
+ if (error)
+ posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_DONTNEED);
+ else
+ direct_io = true;
+
+ /* See if SEEK_DATA/SEEK_HOLE work... */
+ data_start = lseek(fd, data_end, SEEK_DATA);
+ if (data_start < 0) {
+ /* ENXIO for SEEK_DATA means no file data anywhere. */
+ if (errno == ENXIO)
+ return true;
+ reports_holes = false;
+ }
+
+ if (reports_holes) {
+ data_end = lseek(fd, data_start, SEEK_HOLE);
+ if (data_end < 0)
+ reports_holes = false;
+ }
+
+ /* ...or just read everything if they don't. */
+ if (!reports_holes) {
+ data_start = 0;
+ data_end = sb->st_size;
+ }
+
+ if (!direct_io) {
+ posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_SEQUENTIAL);
+ posix_fadvise(fd, 0, sb->st_size, POSIX_FADV_WILLNEED);
+ }
+ /* Read the non-hole areas. */
+ while (data_start < data_end) {
+ start = data_start;
+
+ if (direct_io && (start & (page_size - 1)))
+ start &= ~(page_size - 1);
+ count = min(IO_MAX_SIZE, data_end - start);
+ if (direct_io && (count & (page_size - 1)))
+ count = (count + page_size) & ~(page_size - 1);
+ sz = pread(fd, ctx->readbuf, count, start);
+ if (sz < 0) {
+ str_errno(ctx, descr);
+ break;
+ } else if (sz == 0) {
+ str_error(ctx, descr,
+_("Read zero bytes, expected %zu."),
+ count);
+ break;
+ } else if (sz != count && start + sz != data_end) {
+ str_warn(ctx, descr,
+_("Short read of %zu bytes, expected %zu."),
+ sz, count);
+ }
+ verified += sz;
+ data_start = start + sz;
+
+ if (xfs_scrub_excessive_errors(ctx)) {
+ moveon = false;
+ break;
+ }
+
+ if (data_start >= data_end && reports_holes) {
+ data_start = lseek(fd, data_end, SEEK_DATA);
+ if (data_start < 0) {
+ if (errno != ENXIO)
+ str_errno(ctx, descr);
+ break;
+ }
+ data_end = lseek(fd, data_start, SEEK_HOLE);
+ if (data_end < 0) {
+ if (errno != ENXIO)
+ str_errno(ctx, descr);
+ break;
+ }
+ }
+ }
+
+ /* Turn off O_DIRECT. */
+ if (direct_io) {
+ flags = fcntl(fd, F_GETFL);
+ error = fcntl(fd, F_SETFL, flags & ~O_DIRECT);
+ if (error)
+ str_errno(ctx, descr);
+ }
+
+ pthread_mutex_lock(&verified_lock);
+ verified_bytes += verified;
+ pthread_mutex_unlock(&verified_lock);
+
+ return moveon;
+}
diff --git a/scrub/read_verify.h b/scrub/read_verify.h
new file mode 100644
index 0000000..a10ba8c
--- /dev/null
+++ b/scrub/read_verify.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef READ_VERIFY_H_
+#define READ_VERIFY_H_
+
+struct read_verify_pool;
+
+typedef void (*read_verify_ioend_fn_t)(struct read_verify_pool *rvp,
+ struct disk *disk, uint64_t start, uint64_t length,
+ int error, void *arg);
+typedef void (*read_verify_ioend_arg_free_fn_t)(void *arg);
+
+struct read_verify_pool {
+ struct work_queue rvp_wq;
+ struct scrub_ctx *rvp_ctx;
+ void *rvp_readbuf;
+ read_verify_ioend_fn_t ioend_fn;
+ read_verify_ioend_arg_free_fn_t ioend_arg_free_fn;
+ size_t rvp_readbufsz; /* bytes */
+ size_t rvp_min_io_size; /* bytes */
+ int rvp_nproc;
+};
+
+bool read_verify_pool_init(struct read_verify_pool *rvp, struct scrub_ctx *ctx,
+ void *readbuf, size_t readbufsz, size_t min_io_sz,
+ read_verify_ioend_fn_t ioend_fn, unsigned int nproc);
+void read_verify_pool_destroy(struct read_verify_pool *rvp);
+
+struct read_verify {
+ void *io_end_arg;
+ struct disk *io_disk;
+ uint64_t io_start; /* bytes */
+ uint64_t io_length; /* bytes */
+};
+
+void read_verify_schedule(struct read_verify_pool *rvp, struct read_verify *rv,
+ struct disk *disk, uint64_t start, uint64_t length,
+ void *end_arg);
+void read_verify_force(struct read_verify_pool *rvp, struct read_verify *rv);
+unsigned long long read_verify_bytes(void);
+
+#endif /* READ_VERIFY_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
new file mode 100644
index 0000000..013559a
--- /dev/null
+++ b/scrub/scrub.c
@@ -0,0 +1,950 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <mntent.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/statvfs.h>
+#include <sys/vfs.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+
+#define _PATH_PROC_MOUNTS "/proc/mounts"
+
+bool verbose;
+int debug;
+bool scrub_data;
+bool dumpcore;
+bool display_rusage;
+long page_size;
+int nr_threads = -1;
+enum errors_action error_action = ERRORS_CONTINUE;
+static unsigned long max_errors;
+
+static void __attribute__((noreturn))
+usage(void)
+{
+ fprintf(stderr, _("Usage: %s [OPTIONS] mountpoint\n"), progname);
+ fprintf(stderr, _("-a:\tStop after this many errors are found.\n"));
+ fprintf(stderr, _("-d:\tRun program in debug mode.\n"));
+ fprintf(stderr, _("-e:\tWhat to do if errors are found.\n"));
+ fprintf(stderr, _("-j:\tStart no more than this many threads.\n"));
+ fprintf(stderr, _("-m:\tPath to /etc/mtab.\n"));
+ fprintf(stderr, _("-n:\tDry run. Do not modify anything.\n"));
+ fprintf(stderr, _("-T:\tDisplay timing/usage information.\n"));
+ fprintf(stderr, _("-v:\tVerbose output.\n"));
+ fprintf(stderr, _("-V:\tPrint version.\n"));
+ fprintf(stderr, _("-x:\tScrub file data too.\n"));
+ fprintf(stderr, _("-y:\tRepair all errors.\n"));
+
+ exit(16);
+}
+
+/*
+ * Check if the argument is either the device name or mountpoint of a mounted
+ * filesystem.
+ */
+#define MNTTYPE_XFS "xfs"
+static bool
+find_mountpoint_check(
+ struct stat *sb,
+ struct mntent *t)
+{
+ struct stat ms;
+
+ if (S_ISDIR(sb->st_mode)) { /* mount point */
+ if (stat(t->mnt_dir, &ms) < 0)
+ return false;
+ if (sb->st_ino != ms.st_ino)
+ return false;
+ if (sb->st_dev != ms.st_dev)
+ return false;
+ if (strcmp(t->mnt_type, MNTTYPE_XFS) != 0)
+ return NULL;
+ } else { /* device */
+ if (stat(t->mnt_fsname, &ms) < 0)
+ return false;
+ if (sb->st_rdev != ms.st_rdev)
+ return false;
+ if (strcmp(t->mnt_type, MNTTYPE_XFS) != 0)
+ return NULL;
+ /*
+ * Make sure the mountpoint given by mtab is accessible
+ * before using it.
+ */
+ if (stat(t->mnt_dir, &ms) < 0)
+ return false;
+ }
+
+ return true;
+}
+
+/* Check that our alleged mountpoint is in mtab */
+static bool
+find_mountpoint(
+ char *mtab,
+ struct scrub_ctx *ctx)
+{
+ struct mntent_cursor cursor;
+ struct mntent *t = NULL;
+ bool found = false;
+
+ if (platform_mntent_open(&cursor, mtab) != 0) {
+ fprintf(stderr, "Error: can't get mntent entries.\n");
+ exit(1);
+ }
+
+ while ((t = platform_mntent_next(&cursor)) != NULL) {
+ /*
+ * Keep jotting down matching mount details; newer mounts are
+ * towards the end of the file (hopefully).
+ */
+ if (find_mountpoint_check(&ctx->mnt_sb, t)) {
+ ctx->mntpoint = strdup(t->mnt_dir);
+ ctx->mnt_type = strdup(t->mnt_type);
+ ctx->blkdev = strdup(t->mnt_fsname);
+ found = true;
+ }
+ }
+ platform_mntent_close(&cursor);
+ return found;
+}
+
+/* Too many errors? Bail out. */
+bool
+xfs_scrub_excessive_errors(
+ struct scrub_ctx *ctx)
+{
+ bool ret;
+
+ pthread_mutex_lock(&ctx->lock);
+ ret = max_errors > 0 && ctx->errors_found >= max_errors;
+ pthread_mutex_unlock(&ctx->lock);
+
+ return ret;
+}
+
+/* Print a string and whatever error is stored in errno. */
+void
+__str_errno(
+ struct scrub_ctx *ctx,
+ const char *str,
+ const char *file,
+ int line)
+{
+ char buf[DESCR_BUFSZ];
+
+ pthread_mutex_lock(&ctx->lock);
+ fprintf(stderr, "%s: %s.", str, strerror_r(errno, buf, DESCR_BUFSZ));
+ if (debug)
+ fprintf(stderr, " (%s line %d)", file, line);
+ fprintf(stderr, "\n");
+ ctx->errors_found++;
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some error text. */
+void
+__str_error(
+ struct scrub_ctx *ctx,
+ const char *str,
+ const char *file,
+ int line,
+ const char *format,
+ ...)
+{
+ va_list args;
+
+ pthread_mutex_lock(&ctx->lock);
+ fprintf(stderr, "%s: ", str);
+ va_start(args, format);
+ vfprintf(stderr, format, args);
+ va_end(args);
+ if (debug)
+ fprintf(stderr, " (%s line %d)", file, line);
+ fprintf(stderr, "\n");
+ ctx->errors_found++;
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some warning text. */
+void
+__str_warn(
+ struct scrub_ctx *ctx,
+ const char *str,
+ const char *file,
+ int line,
+ const char *format,
+ ...)
+{
+ va_list args;
+
+ pthread_mutex_lock(&ctx->lock);
+ fprintf(stderr, "%s: ", str);
+ va_start(args, format);
+ vfprintf(stderr, format, args);
+ va_end(args);
+ if (debug)
+ fprintf(stderr, " (%s line %d)", file, line);
+ fprintf(stderr, "\n");
+ ctx->warnings_found++;
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a string and some informational text. */
+void
+__str_info(
+ struct scrub_ctx *ctx,
+ const char *str,
+ const char *file,
+ int line,
+ const char *format,
+ ...)
+{
+ va_list args;
+
+ pthread_mutex_lock(&ctx->lock);
+ fprintf(stdout, "%s: ", str);
+ va_start(args, format);
+ vfprintf(stdout, format, args);
+ va_end(args);
+ if (debug)
+ fprintf(stdout, " (%s line %d)", file, line);
+ fprintf(stdout, "\n");
+ fflush(stdout);
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the repair count. */
+void
+__record_repair(
+ struct scrub_ctx *ctx,
+ const char *str,
+ const char *file,
+ int line,
+ const char *format,
+ ...)
+{
+ va_list args;
+
+ pthread_mutex_lock(&ctx->lock);
+ fprintf(stderr, "%s: ", str);
+ va_start(args, format);
+ vfprintf(stderr, format, args);
+ va_end(args);
+ if (debug)
+ fprintf(stderr, " (%s line %d)", file, line);
+ fprintf(stderr, "\n");
+ ctx->repairs++;
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the optimization (preening) count. */
+void
+__record_preen(
+ struct scrub_ctx *ctx,
+ const char *str,
+ const char *file,
+ int line,
+ const char *format,
+ ...)
+{
+ va_list args;
+
+ pthread_mutex_lock(&ctx->lock);
+ if (debug || verbose) {
+ fprintf(stdout, "%s: ", str);
+ va_start(args, format);
+ vfprintf(stdout, format, args);
+ va_end(args);
+ if (debug)
+ fprintf(stdout, " (%s line %d)", file, line);
+ fprintf(stdout, "\n");
+ fflush(stdout);
+ }
+ ctx->preens++;
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+void __attribute__((noreturn))
+do_error(char const *msg, ...)
+{
+ va_list args;
+
+ fprintf(stderr, _("\nfatal error -- "));
+
+ va_start(args, msg);
+ vfprintf(stderr, msg, args);
+ if (dumpcore)
+ abort();
+ exit(1);
+}
+
+/* How many threads to kick off? */
+unsigned int
+scrub_nproc(
+ struct scrub_ctx *ctx)
+{
+ if (nr_threads < 0)
+ return ctx->nr_io_threads;
+ return min(ctx->nr_io_threads, nr_threads);
+}
+
+/* Decide if a value is within +/- (n/d) of a desired value. */
+bool
+within_range(
+ struct scrub_ctx *ctx,
+ unsigned long long value,
+ unsigned long long desired,
+ unsigned long long diff_threshold,
+ unsigned int n,
+ unsigned int d,
+ const char *descr)
+{
+ assert(n < d);
+
+ /* Don't complain if difference does not exceed an absolute value. */
+ if (value < desired && desired - value < diff_threshold)
+ return true;
+ if (value > desired && value - desired < diff_threshold)
+ return true;
+
+ /* Complain if the difference exceeds a certain percentage. */
+ if (value < desired * (d - n) / d) {
+ str_warn(ctx, ctx->mntpoint,
+_("Found fewer %s than reported"), descr);
+ return false;
+ }
+ if (value > desired * (d + n) / d) {
+ str_warn(ctx, ctx->mntpoint,
+_("Found more %s than reported"), descr);
+ return false;
+ }
+ return true;
+}
+
+static double
+timeval_subtract(
+ struct timeval *tv1,
+ struct timeval *tv2)
+{
+ return ((tv1->tv_sec - tv2->tv_sec) +
+ ((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
+}
+
+/* Produce human readable disk space output. */
+double
+auto_space_units(
+ unsigned long long bytes,
+ char **units)
+{
+ if (debug > 1)
+ goto no_prefix;
+ if (bytes > (1ULL << 40)) {
+ *units = "TiB";
+ return (double)bytes / (1ULL << 40);
+ } else if (bytes > (1ULL << 30)) {
+ *units = "GiB";
+ return (double)bytes / (1ULL << 30);
+ } else if (bytes > (1ULL << 20)) {
+ *units = "MiB";
+ return (double)bytes / (1ULL << 20);
+ } else if (bytes > (1ULL << 10)) {
+ *units = "KiB";
+ return (double)bytes / (1ULL << 10);
+ } else {
+no_prefix:
+ *units = "B";
+ return bytes;
+ }
+}
+
+/* Produce human readable discrete number output. */
+double
+auto_units(
+ unsigned long long number,
+ char **units)
+{
+ if (debug > 1)
+ goto no_prefix;
+ if (number > 1000000000000ULL) {
+ *units = "T";
+ return number / 1000000000000.0;
+ } else if (number > 1000000000ULL) {
+ *units = "G";
+ return number / 1000000000.0;
+ } else if (number > 1000000ULL) {
+ *units = "M";
+ return number / 1000000.0;
+ } else if (number > 1000ULL) {
+ *units = "K";
+ return number / 1000.0;
+ } else {
+no_prefix:
+ *units = "";
+ return number;
+ }
+}
+
+/*
+ * Given a directory fd and (possibly) a dirent, open the file associated
+ * with the entry. If the entry is null, just duplicate the dir_fd.
+ */
+int
+dirent_open(
+ int dir_fd,
+ struct dirent *dirent)
+{
+ if (!dirent)
+ return dup(dir_fd);
+ return openat(dir_fd, dirent->d_name,
+ O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+}
+
+#ifndef RUSAGE_BOTH
+# define RUSAGE_BOTH (-2)
+#endif
+
+/* Get resource usage for ourselves and all children. */
+int
+scrub_getrusage(
+ struct rusage *usage)
+{
+ struct rusage cusage;
+ int err;
+
+ err = getrusage(RUSAGE_BOTH, usage);
+ if (!err)
+ return err;
+
+ err = getrusage(RUSAGE_SELF, usage);
+ if (err)
+ return err;
+
+ err = getrusage(RUSAGE_CHILDREN, &cusage);
+ if (err)
+ return err;
+
+ usage->ru_minflt += cusage.ru_minflt;
+ usage->ru_majflt += cusage.ru_majflt;
+ usage->ru_nswap += cusage.ru_nswap;
+ usage->ru_inblock += cusage.ru_inblock;
+ usage->ru_oublock += cusage.ru_oublock;
+ usage->ru_msgsnd += cusage.ru_msgsnd;
+ usage->ru_msgrcv += cusage.ru_msgrcv;
+ usage->ru_nsignals += cusage.ru_nsignals;
+ usage->ru_nvcsw += cusage.ru_nvcsw;
+ usage->ru_nivcsw += cusage.ru_nivcsw;
+ return 0;
+}
+
+struct phase_info {
+ struct rusage ruse;
+ struct timeval time;
+ unsigned long long verified_bytes;
+ void *brk_start;
+ const char *tag;
+};
+
+/* Start tracking resource usage for a phase. */
+static bool
+phase_start(
+ struct phase_info *pi,
+ const char *tag,
+ const char *descr)
+{
+ int error;
+
+ error = scrub_getrusage(&pi->ruse);
+ if (error) {
+ perror(_("getrusage"));
+ return false;
+ }
+ pi->brk_start = sbrk(0);
+
+ error = gettimeofday(&pi->time, NULL);
+ if (error) {
+ perror(_("gettimeofday"));
+ return false;
+ }
+ pi->tag = tag;
+
+ pi->verified_bytes = read_verify_bytes();
+
+ if ((verbose || display_rusage) && descr) {
+ fprintf(stdout, _("%s%s\n"), pi->tag, descr);
+ fflush(stdout);
+ }
+ return true;
+}
+
+/* Report usage stats. */
+static bool
+phase_end(
+ struct phase_info *pi)
+{
+ struct rusage ruse_now;
+#ifdef HAVE_MALLINFO
+ struct mallinfo mall_now;
+#endif
+ struct timeval time_now;
+ double dt;
+ unsigned long long verified;
+ long in, out;
+ long io;
+ double i, o, t;
+ double din, dout, dtot;
+ char *iu, *ou, *tu, *dinu, *doutu, *dtotu;
+ double v, dv;
+ char *vu, *dvu;
+ int error;
+
+ if (!display_rusage)
+ return true;
+
+ error = gettimeofday(&time_now, NULL);
+ if (error) {
+ perror(_("gettimeofday"));
+ return false;
+ }
+ dt = timeval_subtract(&time_now, &pi->time);
+
+ error = scrub_getrusage(&ruse_now);
+ if (error) {
+ perror(_("getrusage"));
+ return false;
+ }
+
+#define kbytes(x) (((unsigned long)(x) + 1023) / 1024)
+#ifdef HAVE_MALLINFO
+
+ mall_now = mallinfo();
+ fprintf(stdout, _("%sMemory used: %luk/%luk (%luk/%luk), "), pi->tag,
+ kbytes(mall_now.arena), kbytes(mall_now.hblkhd),
+ kbytes(mall_now.uordblks), kbytes(mall_now.fordblks));
+#else
+ fprintf(stdout, _("%sMemory used: %luk, "), pi->tag,
+ (unsigned long) kbytes(((char *) sbrk(0)) -
+ ((char *) pi->brk_start)));
+#endif
+#undef kbytes
+
+ fprintf(stdout, _("time: %5.2f/%5.2f/%5.2fs\n"),
+ timeval_subtract(&time_now, &pi->time),
+ timeval_subtract(&ruse_now.ru_utime, &pi->ruse.ru_utime),
+ timeval_subtract(&ruse_now.ru_stime, &pi->ruse.ru_stime));
+
+ /* I/O usage */
+ in = (ruse_now.ru_inblock - pi->ruse.ru_inblock) << BBSHIFT;
+ out = (ruse_now.ru_oublock - pi->ruse.ru_oublock) << BBSHIFT;
+ io = in + out;
+ if (io) {
+ i = auto_space_units(in, &iu);
+ o = auto_space_units(out, &ou);
+ t = auto_space_units(io, &tu);
+ din = auto_space_units(in / dt, &dinu);
+ dout = auto_space_units(out / dt, &doutu);
+ dtot = auto_space_units(io / dt, &dtotu);
+ fprintf(stdout,
+_("%sI/O: %.1f%s in, %.1f%s out, %.1f%s tot\n"),
+ pi->tag, i, iu, o, ou, t, tu);
+ fprintf(stdout,
+_("%sI/O rate: %.1f%s/s in, %.1f%s/s out, %.1f%s/s tot\n"),
+ pi->tag, din, dinu, dout, doutu, dtot, dtotu);
+ }
+
+ /* How many bytes were read-verified? */
+ verified = read_verify_bytes() - pi->verified_bytes;
+ if (verified) {
+ v = auto_space_units(verified, &vu);
+ dv = auto_space_units(verified / dt, &dvu);
+ fprintf(stdout, _("%sVerify: %.1f%s, rate: %.1f%s/s\n"),
+ pi->tag, v, vu, dv, dvu);
+ }
+ fflush(stdout);
+
+ return true;
+}
+
+/* Find filesystem geometry and perform any other setup functions. */
+static bool
+find_geo(
+ struct scrub_ctx *ctx)
+{
+ bool moveon;
+ int error;
+
+ /*
+ * Open the directory with O_NOATIME. For mountpoints owned
+ * by root, this should be sufficient to ensure that we have
+ * CAP_SYS_ADMIN, which we probably need to do anything fancy
+ * with the (XFS driver) kernel.
+ */
+ ctx->mnt_fd = open(ctx->mntpoint, O_RDONLY | O_NOATIME | O_DIRECTORY);
+ if (ctx->mnt_fd < 0) {
+ if (errno == EPERM)
+ str_info(ctx, ctx->mntpoint,
+_("Must be root to run scrub."));
+ else
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+ error = disk_open(ctx->blkdev, &ctx->datadev);
+ if (error && errno != ENOENT)
+ str_errno(ctx, ctx->blkdev);
+
+ error = fstat(ctx->mnt_fd, &ctx->mnt_sb);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+ error = fstatvfs(ctx->mnt_fd, &ctx->mnt_sv);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+ error = fstatfs(ctx->mnt_fd, &ctx->mnt_sf);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+ if (disk_is_open(&ctx->datadev))
+ ctx->nr_io_threads = disk_heads(&ctx->datadev);
+ else
+ ctx->nr_io_threads = libxfs_nproc();
+ if (verbose) {
+ fprintf(stdout, _("%s: using %d threads to scrub.\n"),
+ ctx->mntpoint, scrub_nproc(ctx));
+ fflush(stdout);
+ }
+
+out:
+ return moveon;
+}
+
+struct scrub_phase {
+ char *descr;
+ bool (*fn)(struct scrub_ctx *);
+};
+
+/* Run the preening phase if there are no errors. */
+static bool
+preen(
+ struct scrub_ctx *ctx)
+{
+ if (ctx->errors_found) {
+ str_info(ctx, ctx->mntpoint,
+_("Errors found, please re-run with -y."));
+ return true;
+ }
+
+ return false;
+}
+
+/* Run all the phases of the scrubber. */
+#define REPAIR_DUMMY_FN ((void *)1)
+#define DATASCAN_DUMMY_FN ((void *)2)
+static bool
+run_scrub_phases(
+ struct scrub_ctx *ctx)
+{
+ struct scrub_phase phases[] = {
+ {_("Find filesystem geometry."), find_geo},
+ {_("Check internal metadata."), NULL},
+ {_("Scan all inodes."), NULL},
+ {NULL, REPAIR_DUMMY_FN},
+ {_("Verify data file integrity."), DATASCAN_DUMMY_FN},
+ {_("Check summary counters."), NULL},
+ {NULL, NULL},
+ };
+ struct phase_info pi;
+ char buf[DESCR_BUFSZ];
+ struct scrub_phase *phase;
+ bool moveon;
+ int c;
+
+ /* Run all phases of the scrub tool. */
+ for (c = 1, phase = phases; phase->fn; phase++, c++) {
+ /* Inject the repair/preen function. */
+ if (phase->fn == REPAIR_DUMMY_FN) {
+ if (ctx->mode == SCRUB_MODE_PREEN) {
+ phase->descr = _("Preen filesystem.");
+ phase->fn = preen;
+ } else if (ctx->mode == SCRUB_MODE_REPAIR) {
+ phase->descr = _("Repair filesystem.");
+ }
+ } else if (phase->fn == DATASCAN_DUMMY_FN && scrub_data)
+ ;
+
+ if (phase->fn == REPAIR_DUMMY_FN ||
+ phase->fn == DATASCAN_DUMMY_FN) {
+ c--;
+ continue;
+ } else if (phase->descr)
+ snprintf(buf, DESCR_BUFSZ, _("Phase %d: "), c);
+ else
+ buf[0] = 0;
+ moveon = phase_start(&pi, buf, phase->descr);
+ if (!moveon)
+ return false;
+ moveon = phase->fn(ctx);
+ if (!moveon)
+ return false;
+ moveon = phase_end(&pi);
+ if (!moveon)
+ return false;
+
+ /* Too many errors? */
+ if (xfs_scrub_excessive_errors(ctx))
+ return false;
+ }
+
+ return true;
+}
+
+int
+main(
+ int argc,
+ char **argv)
+{
+ int c;
+ char *mtab = NULL;
+ struct scrub_ctx ctx = {0};
+ struct phase_info all_pi;
+ long arg;
+ bool ismnt;
+ bool moveon = true;
+ static bool injected;
+ int ret;
+ int error;
+
+ progname = basename(argv[0]);
+ setlocale(LC_ALL, "");
+ bindtextdomain(PACKAGE, LOCALEDIR);
+ textdomain(PACKAGE);
+
+ pthread_mutex_init(&ctx.lock, NULL);
+ ctx.datadev.d_fd = -1;
+ ctx.mode = SCRUB_MODE_DEFAULT;
+ while ((c = getopt(argc, argv, "a:de:j:m:nTvxVy")) != EOF) {
+ switch (c) {
+ case 'a':
+ max_errors = strtoull(optarg, NULL, 10);
+ if (errno) {
+ perror("max_errors");
+ usage();
+ }
+ break;
+ case 'd':
+ debug++;
+ dumpcore = true;
+ break;
+ case 'e':
+ if (!strcmp("continue", optarg))
+ error_action = ERRORS_CONTINUE;
+ else if (!strcmp("shutdown", optarg))
+ error_action = ERRORS_SHUTDOWN;
+ else
+ usage();
+ break;
+ case 'j':
+ arg = strtol(optarg, NULL, 10);
+ if (errno || arg < 0 || arg > INT_MAX) {
+ perror("nr_threads");
+ usage();
+ }
+ nr_threads = arg;
+ break;
+ case 'm':
+ mtab = optarg;
+ break;
+ case 'n':
+ if (ctx.mode != SCRUB_MODE_DEFAULT) {
+ fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+ return 1;
+ }
+ ctx.mode = SCRUB_MODE_DRY_RUN;
+ break;
+ case 'T':
+ display_rusage = true;
+ break;
+ case 'v':
+ verbose = true;
+ break;
+ case 'x':
+ scrub_data = true;
+ break;
+ case 'V':
+ fprintf(stdout, _("%s version %s\n"), progname,
+ VERSION);
+ fflush(stdout);
+ exit(0);
+ case 'y':
+ if (ctx.mode != SCRUB_MODE_DEFAULT) {
+ fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+ return 1;
+ }
+ ctx.mode = SCRUB_MODE_REPAIR;
+ break;
+ case '?':
+ /* fall through */
+ default:
+ usage();
+ }
+ }
+
+ if (optind != argc - 1)
+ usage();
+
+ ctx.mntpoint = argv[optind];
+
+ /* Find the mount record for the passed-in argument. */
+
+ if (stat(argv[optind], &ctx.mnt_sb) < 0) {
+ fprintf(stderr,
+ _("%s: could not stat: %s: %s\n"),
+ progname, argv[optind], strerror(errno));
+ ret = 8;
+ goto end;
+ }
+
+ /*
+ * If the user did not specify an explicit mount table, try to use
+ * /proc/mounts if it is available, else /etc/mtab. We prefer
+ * /proc/mounts because it is kernel controlled, while /etc/mtab
+ * may contain garbage that userspace tools like pam_mounts wrote
+ * into it.
+ */
+ if (!mtab) {
+ if (access(_PATH_PROC_MOUNTS, R_OK) == 0)
+ mtab = _PATH_PROC_MOUNTS;
+ else
+ mtab = _PATH_MOUNTED;
+ }
+
+ ismnt = find_mountpoint(mtab, &ctx);
+ if (!ismnt) {
+ fprintf(stderr, _("%s: Not a mount point or block device.\n"),
+ ctx.mntpoint);
+ ret = 8;
+ goto end;
+ }
+
+ /* Initialize overall phase stats. */
+ moveon = phase_start(&all_pi, "", NULL);
+ if (!moveon)
+ goto out;
+
+ /* Set up a page-aligned buffer for read verification. */
+ page_size = sysconf(_SC_PAGESIZE);
+ if (page_size < 0) {
+ str_errno(&ctx, ctx.mntpoint);
+ goto out;
+ }
+
+ /* Try to allocate a read buffer if we don't have one. */
+ error = posix_memalign((void **)&ctx.readbuf, page_size,
+ IO_MAX_SIZE);
+ if (error || !ctx.readbuf) {
+ str_errno(&ctx, ctx.mntpoint);
+ goto out;
+ }
+
+ /* Flush everything out to disk before we start. */
+ error = syncfs(ctx.mnt_fd);
+ if (error) {
+ str_errno(&ctx, ctx.mntpoint);
+ goto out;
+ }
+
+ if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+ ctx.mode = SCRUB_MODE_REPAIR;
+ injected = true;
+ }
+
+ /* Scrub a filesystem. */
+ moveon = run_scrub_phases(&ctx);
+ if (!moveon)
+ goto out;
+
+out:
+ if (xfs_scrub_excessive_errors(&ctx))
+ str_info(&ctx, ctx.mntpoint, _("Too many errors; aborting."));
+
+ if (debug_tweak_on("XFS_SCRUB_FORCE_ERROR"))
+ str_error(&ctx, ctx.mntpoint, _("Injecting error."));
+
+ ret = 0;
+ if (!moveon)
+ ret |= 4;
+
+ if (ctx.repairs && ctx.preens)
+ fprintf(stdout,
+_("%s: %lu repairs and %lu optimizations made.\n"),
+ ctx.mntpoint, ctx.repairs, ctx.preens);
+ else if (ctx.repairs && ctx.preens == 0)
+ fprintf(stdout,
+_("%s: %lu repairs made.\n"),
+ ctx.mntpoint, ctx.repairs);
+ else if (ctx.repairs == 0 && ctx.preens)
+ fprintf(stdout,
+_("%s: %lu optimizations made.\n"),
+ ctx.mntpoint, ctx.preens);
+
+ if (ctx.errors_found && ctx.warnings_found)
+ fprintf(stderr,
+_("%s: %lu errors and %lu warnings found. Unmount and run xfs_repair.\n"),
+ ctx.mntpoint, ctx.errors_found, ctx.warnings_found);
+ else if (ctx.errors_found && ctx.warnings_found == 0)
+ fprintf(stderr,
+_("%s: %lu errors found. Unmount and run xfs_repair.\n"),
+ ctx.mntpoint, ctx.errors_found);
+ else if (ctx.errors_found == 0 && ctx.warnings_found)
+ fprintf(stderr,
+_("%s: %lu warnings found.\n"),
+ ctx.mntpoint, ctx.warnings_found);
+ if (ctx.errors_found) {
+ ret |= 1;
+ }
+ if (ctx.warnings_found) {
+ ret |= 2;
+ }
+ phase_end(&all_pi);
+ close(ctx.mnt_fd);
+ disk_close(&ctx.datadev);
+
+ free(ctx.blkdev);
+ free(ctx.readbuf);
+ free(ctx.mntpoint);
+ free(ctx.mnt_type);
+end:
+ return ret;
+}
diff --git a/scrub/scrub.h b/scrub/scrub.h
new file mode 100644
index 0000000..114310d
--- /dev/null
+++ b/scrub/scrub.h
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef SCRUB_H_
+#define SCRUB_H_
+
+#define DESCR_BUFSZ 256
+
+/*
+ * Perform all IO in 32M chunks. This cannot exceed 65536 sectors
+ * because that's the biggest SCSI VERIFY(16) we dare to send.
+ */
+#define IO_MAX_SIZE 33554432
+#define IO_MAX_SECTORS (IO_MAX_SIZE >> BBSHIFT)
+
+enum scrub_mode {
+ SCRUB_MODE_DRY_RUN,
+ SCRUB_MODE_PREEN,
+ SCRUB_MODE_REPAIR,
+};
+#define SCRUB_MODE_DEFAULT SCRUB_MODE_PREEN
+
+struct scrub_ctx {
+ /* Immutable scrub state. */
+ char *mntpoint;
+ char *blkdev;
+ char *mnt_type;
+ void *readbuf;
+ int mnt_fd;
+ enum scrub_mode mode;
+ unsigned int nr_io_threads;
+ struct disk datadev;
+ struct stat mnt_sb;
+ struct statvfs mnt_sv;
+ struct statfs mnt_sf;
+
+ /* Mutable scrub state; use lock. */
+ pthread_mutex_t lock;
+ unsigned long errors_found;
+ unsigned long warnings_found;
+ unsigned long repairs;
+ unsigned long preens;
+};
+
+enum errors_action {
+ ERRORS_CONTINUE,
+ ERRORS_SHUTDOWN,
+};
+
+extern bool verbose;
+extern int debug;
+extern bool scrub_data;
+extern long page_size;
+extern enum errors_action error_action;
+extern int nr_threads;
+
+bool xfs_scrub_excessive_errors(struct scrub_ctx *ctx);
+
+void __str_errno(struct scrub_ctx *, const char *, const char *, int);
+void __str_error(struct scrub_ctx *, const char *, const char *, int,
+ const char *, ...);
+void __str_warn(struct scrub_ctx *, const char *, const char *, int,
+ const char *, ...);
+void __str_info(struct scrub_ctx *, const char *, const char *, int,
+ const char *, ...);
+void __record_repair(struct scrub_ctx *, const char *, const char *, int,
+ const char *, ...);
+void __record_preen(struct scrub_ctx *, const char *, const char *, int,
+ const char *, ...);
+
+#define str_errno(ctx, str) __str_errno(ctx, str, __FILE__, __LINE__)
+#define str_error(ctx, str, ...) __str_error(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_warn(ctx, str, ...) __str_warn(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_info(ctx, str, ...) __str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_repair(ctx, str, ...) __record_repair(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_preen(ctx, str, ...) __record_preen(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define dbg_printf(fmt, ...) {if (debug > 1) {printf(fmt, __VA_ARGS__);}}
+
+#ifndef container_of
+# define container_of(ptr, type, member) ({ \
+ const typeof( ((type *)0)->member ) *__mptr = (ptr); \
+ (type *)( (char *)__mptr - offsetof(type,member) );})
+#endif
+
+/* Is this debug tweak enabled? */
+static inline bool
+debug_tweak_on(
+ const char *name)
+{
+ return debug && getenv(name) != NULL;
+}
+
+/* Miscellaneous utility functions */
+unsigned int scrub_nproc(struct scrub_ctx *ctx);
+bool within_range(struct scrub_ctx *ctx, unsigned long long value,
+ unsigned long long desired, unsigned long long diff_threshold,
+ unsigned int n, unsigned int d, const char *descr);
+double auto_space_units(unsigned long long kilobytes, char **units);
+double auto_units(unsigned long long number, char **units);
+const char *repair_tool(struct scrub_ctx *ctx);
+int dirent_open(int dir_fd, struct dirent *dirent);
+
+#ifndef HAVE_SYNCFS
+static inline int syncfs(int fd)
+{
+ sync();
+ return 0;
+}
+#endif
+
+#endif /* SCRUB_H_ */
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
` (5 preceding siblings ...)
2017-03-10 23:25 ` [PATCH 6/9] xfs_scrub: create online filesystem scrub program Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
2017-03-10 23:25 ` [PATCH 9/9] xfs_scrub: integrate services with systemd Darrick J. Wong
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
For XFS, we perform sequential scans of each AG's metadata, inodes,
extent maps, and file data. Being XFS specific, we can work with the
in-kernel scrubbers to perform much stronger metadata checking and
cross-referencing. We can also take advantage of newer ioctls such as
GETFSMAP to perform faster read verification.
In the future we will be able to take advantage of (still unwritten)
features such as parent directory pointers to fully validate all
metadata. The scrub tool can shut down the filesystem if errors are
found. This is not a default option since scrubbing is very immature at
this time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
scrub/Makefile | 4
scrub/scrub.c | 21 +
scrub/scrub.h | 26 +
scrub/xfs.c | 1517 +++++++++++++++++++++++++++++++++++++++++++++++++++++
scrub/xfs_ioctl.c | 968 ++++++++++++++++++++++++++++++++++
scrub/xfs_ioctl.h | 103 ++++
6 files changed, 2632 insertions(+), 7 deletions(-)
create mode 100644 scrub/xfs.c
create mode 100644 scrub/xfs_ioctl.c
create mode 100644 scrub/xfs_ioctl.h
diff --git a/scrub/Makefile b/scrub/Makefile
index b1ff86a..bae2fa1 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -12,9 +12,9 @@ LTCOMMAND = xfs_scrub
INSTALL_SCRUB = install-scrub
endif # scrub_prereqs
-HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h
+HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
CFILES = ../repair/avl64.c disk.c bitmap.c iocmd.c \
- read_verify.c scrub.c ../repair/threads.c
+ read_verify.c scrub.c ../repair/threads.c xfs.c xfs_ioctl.c
LLDLIBS += $(LIBBLKID) $(LIBXFS) $(LIBXCMD) $(LIBUUID) $(LIBRT) $(LIBPTHREAD) $(LIBHANDLE)
LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBHANDLE)
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 013559a..a363ac1 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -638,6 +638,9 @@ _("Must be root to run scrub."));
ctx->nr_io_threads = disk_heads(&ctx->datadev);
else
ctx->nr_io_threads = libxfs_nproc();
+ moveon = xfs_scan_fs(ctx);
+ if (!moveon)
+ goto out;
if (verbose) {
fprintf(stdout, _("%s: using %d threads to scrub.\n"),
ctx->mntpoint, scrub_nproc(ctx));
@@ -664,7 +667,7 @@ _("Errors found, please re-run with -y."));
return true;
}
- return false;
+ return xfs_repair_fs(ctx);
}
/* Run all the phases of the scrubber. */
@@ -676,11 +679,11 @@ run_scrub_phases(
{
struct scrub_phase phases[] = {
{_("Find filesystem geometry."), find_geo},
- {_("Check internal metadata."), NULL},
- {_("Scan all inodes."), NULL},
+ {_("Check internal metadata."), xfs_scan_metadata},
+ {_("Scan all inodes."), xfs_scan_inodes},
{NULL, REPAIR_DUMMY_FN},
{_("Verify data file integrity."), DATASCAN_DUMMY_FN},
- {_("Check summary counters."), NULL},
+ {_("Check summary counters."), xfs_check_summary},
{NULL, NULL},
};
struct phase_info pi;
@@ -698,9 +701,10 @@ run_scrub_phases(
phase->fn = preen;
} else if (ctx->mode == SCRUB_MODE_REPAIR) {
phase->descr = _("Repair filesystem.");
+ phase->fn = xfs_repair_fs;
}
} else if (phase->fn == DATASCAN_DUMMY_FN && scrub_data)
- ;
+ phase->fn = xfs_scan_blocks;
if (phase->fn == REPAIR_DUMMY_FN ||
phase->fn == DATASCAN_DUMMY_FN) {
@@ -906,6 +910,11 @@ _("Only one of the options -n or -y may be specified.\n"));
if (!moveon)
ret |= 4;
+ /* Clean up scan data. */
+ moveon = xfs_cleanup(&ctx);
+ if (!moveon)
+ ret |= 8;
+
if (ctx.repairs && ctx.preens)
fprintf(stdout,
_("%s: %lu repairs and %lu optimizations made.\n"),
@@ -932,6 +941,8 @@ _("%s: %lu errors found. Unmount and run xfs_repair.\n"),
_("%s: %lu warnings found.\n"),
ctx.mntpoint, ctx.warnings_found);
if (ctx.errors_found) {
+ if (error_action == ERRORS_SHUTDOWN)
+ xfs_shutdown_fs(&ctx);
ret |= 1;
}
if (ctx.warnings_found) {
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 114310d..c3ced73 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -56,6 +56,22 @@ struct scrub_ctx {
unsigned long warnings_found;
unsigned long repairs;
unsigned long preens;
+
+ /* FS specific stuff */
+ struct xfs_fsop_geom geo;
+ struct fs_path fsinfo;
+ unsigned int agblklog;
+ unsigned int blocklog;
+ unsigned int inodelog;
+ unsigned int inopblog;
+ struct disk logdev;
+ struct disk rtdev;
+ void *fshandle;
+ size_t fshandle_len;
+ unsigned long long capabilities; /* see below */
+ struct read_verify_pool rvp;
+ struct list_head repair_list;
+ bool preen_triggers[XFS_SCRUB_TYPE_MAX + 1];
};
enum errors_action {
@@ -124,4 +140,14 @@ static inline int syncfs(int fd)
}
#endif
+/* FS-specific functions */
+bool xfs_cleanup(struct scrub_ctx *ctx);
+bool xfs_scan_fs(struct scrub_ctx *ctx);
+bool xfs_scan_inodes(struct scrub_ctx *ctx);
+bool xfs_scan_metadata(struct scrub_ctx *ctx);
+bool xfs_check_summary(struct scrub_ctx *ctx);
+bool xfs_scan_blocks(struct scrub_ctx *ctx);
+bool xfs_repair_fs(struct scrub_ctx *ctx);
+void xfs_shutdown_fs(struct scrub_ctx *ctx);
+
#endif /* SCRUB_H_ */
diff --git a/scrub/xfs.c b/scrub/xfs.c
new file mode 100644
index 0000000..7d6a249
--- /dev/null
+++ b/scrub/xfs.c
@@ -0,0 +1,1517 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <attr/attributes.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "iocmd.h"
+#include "scrub.h"
+#include "xfs_ioctl.h"
+#include "xfs_fs.h"
+
+/*
+ * XFS Scrubbing Strategy
+ *
+ * The XFS scrubber uses custom XFS ioctls to probe more deeply into the
+ * internals of the filesystem. It takes advantage of scrubbing ioctls
+ * to check all the records stored in a metadata btree and to
+ * cross-reference those records against the other metadata btrees.
+ *
+ * The "find geometry" phase queries XFS for the filesystem geometry.
+ * The block devices for the data, realtime, and log devices are opened.
+ * Kernel ioctls are queried to see if they are implemented, and a data
+ * file read-verify strategy is selected.
+ *
+ * In the "check internal metadata" phase, we call the SCRUB_METADATA
+ * ioctl to check the filesystem's internal per-AG btrees. This
+ * includes the AG superblock, AGF, AGFL, and AGI headers, freespace
+ * btrees, the regular and free inode btrees, the reverse mapping
+ * btrees, and the reference counting btrees. If the realtime device is
+ * enabled, the realtime bitmap and reverse mapping btrees are enabled.
+ * Each AG (and the realtime device) has its metadata checked in a
+ * separate thread for better performance.
+ *
+ * The "scan inodes" phase uses BULKSTAT to scan all the inodes in an
+ * AG in disk order. From the BULKSTAT information, a file handle is
+ * constructed and the following items are checked:
+ *
+ * - If it's a symlink, the target is read but not validated.
+ * - Bulkstat data is checked.
+ * - If the inode is a file or a directory, a file descriptor is
+ * opened to pin the inode and for further analysis.
+ * - Extended attribute names and values are read via the file
+ * handle. If this fails and we have a file descriptor open, we
+ * retry with the generic extended attribute APIs.
+ * - If the inode is not a file or directory, we're done.
+ * - Extent maps are scanned to ensure that the records make sense.
+ * We also use the SCRUB_METADATA ioctl for better checking of the
+ * block mapping records.
+ * - If the inode is a directory, open the directory and check that
+ * the dirent type code and inode numbers match the stat output.
+ *
+ * Multiple threads are started to check each the inodes of each AG in
+ * parallel.
+ *
+ * In the "verify data file integrity" phase, we employ GETFSMAP to read
+ * the reverse-mappings of all AGs and issue direct-reads of the
+ * underlying disk blocks. We rely on the underlying storage to have
+ * checksummed the data blocks appropriately.
+ *
+ * Multiple threads are started to check each AG in parallel. A
+ * separate thread pool is used to handle the direct reads.
+ *
+ * In the "check summary counters" phase, use GETFSMAP to tally up the
+ * blocks and BULKSTAT to tally up the inodes we saw and compare that to
+ * the statfs output. This gives the user a rough estimate of how
+ * thorough the scrub was.
+ */
+
+/* Routines to scrub an XFS filesystem. */
+
+#define XFS_SCRUB_CAP_PARENT_PTR (1ULL << 0) /* can find parent? */
+
+#define XFS_SCRUB_CAPABILITY_FUNCS(name, flagname) \
+static inline bool \
+xfs_scrub_can_##name(struct scrub_ctx *ctx) \
+{ \
+ return ctx->capabilities & XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_set_##name(struct scrub_ctx *ctx) \
+{ \
+ ctx->capabilities |= XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_clear_##name(struct scrub_ctx *ctx) \
+{ \
+ ctx->capabilities &= ~(XFS_SCRUB_CAP_##flagname); \
+}
+XFS_SCRUB_CAPABILITY_FUNCS(getparent, PARENT_PTR)
+
+/* Find the fd for a given device identifier. */
+static struct disk *
+xfs_dev_to_disk(
+ struct scrub_ctx *ctx,
+ dev_t dev)
+{
+ if (dev == ctx->fsinfo.fs_datadev)
+ return &ctx->datadev;
+ else if (dev == ctx->fsinfo.fs_logdev)
+ return &ctx->logdev;
+ else if (dev == ctx->fsinfo.fs_rtdev)
+ return &ctx->rtdev;
+ abort();
+}
+
+/* Find the device major/minor for a given file descriptor. */
+static dev_t
+xfs_disk_to_dev(
+ struct scrub_ctx *ctx,
+ struct disk *disk)
+{
+ if (disk == &ctx->datadev)
+ return ctx->fsinfo.fs_datadev;
+ else if (disk == &ctx->logdev)
+ return ctx->fsinfo.fs_logdev;
+ else if (disk == &ctx->rtdev)
+ return ctx->fsinfo.fs_rtdev;
+ abort();
+}
+
+/* Shortcut to creating a read-verify thread pool. */
+static inline bool
+xfs_read_verify_pool_init(
+ struct scrub_ctx *ctx,
+ read_verify_ioend_fn_t ioend_fn)
+{
+ return read_verify_pool_init(&ctx->rvp, ctx, ctx->readbuf,
+ IO_MAX_SIZE, ctx->geo.blocksize, ioend_fn,
+ disk_heads(&ctx->datadev));
+}
+
+struct owner_decode {
+ uint64_t owner;
+ const char *descr;
+};
+
+static const struct owner_decode special_owners[] = {
+ {XFS_FMR_OWN_FREE, "free space"},
+ {XFS_FMR_OWN_UNKNOWN, "unknown owner"},
+ {XFS_FMR_OWN_FS, "static FS metadata"},
+ {XFS_FMR_OWN_LOG, "journalling log"},
+ {XFS_FMR_OWN_AG, "per-AG metadata"},
+ {XFS_FMR_OWN_INOBT, "inode btree blocks"},
+ {XFS_FMR_OWN_INODES, "inodes"},
+ {XFS_FMR_OWN_REFC, "refcount btree"},
+ {XFS_FMR_OWN_COW, "CoW staging"},
+ {XFS_FMR_OWN_DEFECTIVE, "bad blocks"},
+ {0, NULL},
+};
+
+/* Decode a special owner. */
+static const char *
+xfs_decode_special_owner(
+ uint64_t owner)
+{
+ const struct owner_decode *od = special_owners;
+
+ while (od->descr) {
+ if (od->owner == owner)
+ return od->descr;
+ od++;
+ }
+
+ return NULL;
+}
+
+/* BULKSTAT wrapper routines. */
+struct xfs_scan_inodes {
+ xfs_inode_iter_fn fn;
+ void *arg;
+ size_t array_arg_size;
+ bool moveon;
+};
+
+/* Scan all the inodes in an AG. */
+static void
+xfs_scan_ag_inodes(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct xfs_scan_inodes *si = arg;
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+ void *fn_arg;
+ char descr[DESCR_BUFSZ];
+ uint64_t ag_ino;
+ uint64_t next_ag_ino;
+ bool moveon;
+
+ snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+ major(ctx->fsinfo.fs_datadev),
+ minor(ctx->fsinfo.fs_datadev),
+ agno);
+
+ ag_ino = (__u64)agno << (ctx->inopblog + ctx->agblklog);
+ next_ag_ino = (__u64)(agno + 1) << (ctx->inopblog + ctx->agblklog);
+
+ fn_arg = ((char *)si->arg) + si->array_arg_size * agno;
+ moveon = xfs_iterate_inodes(ctx, descr, ctx->fshandle, ag_ino,
+ next_ag_ino - 1, si->fn, fn_arg);
+ if (!moveon)
+ si->moveon = false;
+}
+
+/* How many array elements should we create to scan all the inodes? */
+static inline size_t
+xfs_scan_all_inodes_array_size(
+ struct scrub_ctx *ctx)
+{
+ return ctx->geo.agcount;
+}
+
+/* Scan all the inodes in a filesystem. */
+static bool
+xfs_scan_all_inodes_array_arg(
+ struct scrub_ctx *ctx,
+ xfs_inode_iter_fn fn,
+ void *arg,
+ size_t array_arg_size)
+{
+ struct xfs_scan_inodes si;
+ xfs_agnumber_t agno;
+ struct work_queue wq;
+
+ si.moveon = true;
+ si.fn = fn;
+ si.arg = arg;
+ si.array_arg_size = array_arg_size;
+
+ create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+ for (agno = 0; agno < ctx->geo.agcount; agno++)
+ queue_work(&wq, xfs_scan_ag_inodes, agno, &si);
+ destroy_work_queue(&wq);
+
+ return si.moveon;
+}
+#define xfs_scan_all_inodes(ctx, fn) \
+ xfs_scan_all_inodes_array_arg((ctx), (fn), NULL, 0)
+#define xfs_scan_all_inodes_arg(ctx, fn, arg) \
+ xfs_scan_all_inodes_array_arg((ctx), (fn), (arg), 0)
+
+/* GETFSMAP wrappers routines. */
+struct xfs_scan_blocks {
+ xfs_fsmap_iter_fn fn;
+ void *arg;
+ size_t array_arg_size;
+ bool moveon;
+};
+
+/* Iterate all the reverse mappings of an AG. */
+static void
+xfs_scan_ag_blocks(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+ struct xfs_scan_blocks *sbx = arg;
+ void *fn_arg;
+ char descr[DESCR_BUFSZ];
+ struct fsmap keys[2];
+ off64_t bperag;
+ bool moveon;
+
+ bperag = (off64_t)ctx->geo.agblocks *
+ (off64_t)ctx->geo.blocksize;
+
+ snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u fsmap"),
+ major(ctx->fsinfo.fs_datadev),
+ minor(ctx->fsinfo.fs_datadev),
+ agno);
+
+ memset(keys, 0, sizeof(struct fsmap) * 2);
+ keys->fmr_device = ctx->fsinfo.fs_datadev;
+ keys->fmr_physical = agno * bperag;
+ (keys + 1)->fmr_device = ctx->fsinfo.fs_datadev;
+ (keys + 1)->fmr_physical = ((agno + 1) * bperag) - 1;
+ (keys + 1)->fmr_owner = ULLONG_MAX;
+ (keys + 1)->fmr_offset = ULLONG_MAX;
+ (keys + 1)->fmr_flags = UINT_MAX;
+
+ fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * agno;
+ moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+ if (!moveon)
+ sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of a standalone device. */
+static void
+xfs_scan_dev_blocks(
+ struct scrub_ctx *ctx,
+ int idx,
+ dev_t dev,
+ struct xfs_scan_blocks *sbx)
+{
+ struct fsmap keys[2];
+ char descr[DESCR_BUFSZ];
+ void *fn_arg;
+ bool moveon;
+
+ snprintf(descr, DESCR_BUFSZ, _("dev %d:%d fsmap"),
+ major(dev), minor(dev));
+
+ memset(keys, 0, sizeof(struct fsmap) * 2);
+ keys->fmr_device = dev;
+ (keys + 1)->fmr_device = dev;
+ (keys + 1)->fmr_physical = ULLONG_MAX;
+ (keys + 1)->fmr_owner = ULLONG_MAX;
+ (keys + 1)->fmr_offset = ULLONG_MAX;
+ (keys + 1)->fmr_flags = UINT_MAX;
+
+ fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * idx;
+ moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+ if (!moveon)
+ sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of the realtime device. */
+static void
+xfs_scan_rt_blocks(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+
+ xfs_scan_dev_blocks(ctx, agno, ctx->fsinfo.fs_rtdev, arg);
+}
+
+/* Iterate all the reverse mappings of the log device. */
+static void
+xfs_scan_log_blocks(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+
+ xfs_scan_dev_blocks(ctx, agno, ctx->fsinfo.fs_logdev, arg);
+}
+
+/* How many array elements should we create to scan all the blocks? */
+static size_t
+xfs_scan_all_blocks_array_size(
+ struct scrub_ctx *ctx)
+{
+ return ctx->geo.agcount + 2;
+}
+
+/* Scan all the blocks in a filesystem. */
+static bool
+xfs_scan_all_blocks_array_arg(
+ struct scrub_ctx *ctx,
+ xfs_fsmap_iter_fn fn,
+ void *arg,
+ size_t array_arg_size)
+{
+ xfs_agnumber_t agno;
+ struct work_queue wq;
+ struct xfs_scan_blocks sbx;
+
+ sbx.moveon = true;
+ sbx.fn = fn;
+ sbx.arg = arg;
+ sbx.array_arg_size = array_arg_size;
+
+ create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+ if (ctx->fsinfo.fs_rt)
+ queue_work(&wq, xfs_scan_rt_blocks, ctx->geo.agcount + 1,
+ &sbx);
+ if (ctx->fsinfo.fs_log)
+ queue_work(&wq, xfs_scan_log_blocks, ctx->geo.agcount + 2,
+ &sbx);
+ for (agno = 0; agno < ctx->geo.agcount; agno++)
+ queue_work(&wq, xfs_scan_ag_blocks, agno, &sbx);
+ destroy_work_queue(&wq);
+
+ return sbx.moveon;
+}
+
+/* Routines to translate bad physical extents into file paths and offsets. */
+
+struct xfs_verify_error_info {
+ struct bitmap *d_bad; /* bytes */
+ struct bitmap *r_bad; /* bytes */
+};
+
+/* Report if this extent overlaps a bad region. */
+static bool
+xfs_report_verify_inode_bmap(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ int fd,
+ int whichfork,
+ struct fsxattr *fsx,
+ struct xfs_bmap *bmap,
+ void *arg)
+{
+ struct xfs_verify_error_info *vei = arg;
+ struct bitmap *tree;
+
+ /* Only report errors for real extents. */
+ if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+ return true;
+
+ if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+ tree = vei->r_bad;
+ else
+ tree = vei->d_bad;
+
+ if (!bitmap_has_extent(tree, bmap->bm_physical, bmap->bm_length))
+ return true;
+
+ str_error(ctx, descr,
+_("offset %llu failed read verification."), bmap->bm_offset);
+ return true;
+}
+
+/* Iterate the extent mappings of a file to report errors. */
+static bool
+xfs_report_verify_fd(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ int fd,
+ void *arg)
+{
+ struct xfs_bmap key = {0};
+ bool moveon;
+
+ /* data fork */
+ moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+ xfs_report_verify_inode_bmap, arg);
+ if (!moveon)
+ return false;
+
+ /* attr fork */
+ moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+ xfs_report_verify_inode_bmap, arg);
+ if (!moveon)
+ return false;
+ return true;
+}
+
+/* Report read verify errors in unlinked (but still open) files. */
+static int
+xfs_report_verify_inode(
+ struct scrub_ctx *ctx,
+ struct xfs_handle *handle,
+ struct xfs_bstat *bstat,
+ void *arg)
+{
+ char descr[DESCR_BUFSZ];
+ char buf[DESCR_BUFSZ];
+ bool moveon;
+ int fd;
+ int error;
+
+ snprintf(descr, DESCR_BUFSZ, _("inode %llu (unlinked)"), bstat->bs_ino);
+
+ /* Ignore linked files and things we can't open. */
+ if (bstat->bs_nlink != 0)
+ return 0;
+ if (!S_ISREG(bstat->bs_mode) && !S_ISDIR(bstat->bs_mode))
+ return 0;
+
+ /* Try to open the inode. */
+ fd = open_by_fshandle(handle, sizeof(*handle),
+ O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+ if (fd < 0) {
+ error = errno;
+ if (error == ESTALE)
+ return error;
+
+ str_warn(ctx, descr, "%s", strerror_r(error, buf, DESCR_BUFSZ));
+ return error;
+ }
+
+ /* Go find the badness. */
+ moveon = xfs_report_verify_fd(ctx, descr, fd, arg);
+ close(fd);
+
+ return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Scan the inode associated with a directory entry. */
+static bool
+xfs_report_verify_dirent(
+ struct scrub_ctx *ctx,
+ const char *path,
+ int dir_fd,
+ struct dirent *dirent,
+ struct stat *sb,
+ void *arg)
+{
+ bool moveon;
+ int fd;
+
+ /* Ignore things we can't open. */
+ if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode))
+ return true;
+ /* Ignore . and .. */
+ if (dirent && (!strcmp(".", dirent->d_name) ||
+ !strcmp("..", dirent->d_name)))
+ return true;
+
+ /* Open the file */
+ fd = dirent_open(dir_fd, dirent);
+ if (fd < 0)
+ return true;
+
+ /* Go find the badness. */
+ moveon = xfs_report_verify_fd(ctx, path, fd, arg);
+ if (moveon)
+ goto out;
+
+out:
+ close(fd);
+
+ return moveon;
+}
+
+/* Given bad extent lists for the data & rtdev, find bad files. */
+static bool
+xfs_report_verify_errors(
+ struct scrub_ctx *ctx,
+ struct bitmap *d_bad,
+ struct bitmap *r_bad)
+{
+ struct xfs_verify_error_info vei;
+ bool moveon;
+
+ vei.d_bad = d_bad;
+ vei.r_bad = r_bad;
+
+ /* Scan the directory tree to get file paths. */
+ moveon = scan_fs_tree(ctx, NULL, xfs_report_verify_dirent, &vei);
+ if (!moveon)
+ return false;
+
+ /* Scan for unlinked files. */
+ return xfs_scan_all_inodes_arg(ctx, xfs_report_verify_inode, &vei);
+}
+
+/* Phase 1: Find filesystem geometry */
+
+/* Clean up the XFS-specific state data. */
+bool
+xfs_cleanup(
+ struct scrub_ctx *ctx)
+{
+ if (ctx->fshandle)
+ free_handle(ctx->fshandle, ctx->fshandle_len);
+ disk_close(&ctx->rtdev);
+ disk_close(&ctx->logdev);
+ disk_close(&ctx->datadev);
+
+ return true;
+}
+
+/* Read the XFS geometry. */
+bool
+xfs_scan_fs(
+ struct scrub_ctx *ctx)
+{
+ struct fs_path *fsp;
+ int error;
+
+ if (!platform_test_xfs_fd(ctx->mnt_fd)) {
+ str_error(ctx, ctx->mntpoint,
+_("Does not appear to be an XFS filesystem!"));
+ return false;
+ }
+
+ /*
+ * Flush everything out to disk before we start checking.
+ * This seems to reduce the incidence of stale file handle
+ * errors when we open things by handle.
+ */
+ error = syncfs(ctx->mnt_fd);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+
+ INIT_LIST_HEAD(&ctx->repair_list);
+ ctx->datadev.d_fd = ctx->logdev.d_fd = ctx->rtdev.d_fd = -1;
+
+ /* Retrieve XFS geometry. */
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSGEOMETRY, &ctx->geo);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ goto err;
+ }
+
+ ctx->agblklog = libxfs_log2_roundup(ctx->geo.agblocks);
+ ctx->blocklog = libxfs_highbit32(ctx->geo.blocksize);
+ ctx->inodelog = libxfs_highbit32(ctx->geo.inodesize);
+ ctx->inopblog = ctx->blocklog - ctx->inodelog;
+
+ error = path_to_fshandle(ctx->mntpoint, &ctx->fshandle,
+ &ctx->fshandle_len);
+ if (error) {
+ perror(_("getting fshandle"));
+ goto err;
+ }
+
+ /* Do we have bulkstat? */
+ if (!xfs_can_iterate_inodes(ctx)) {
+ str_info(ctx, ctx->mntpoint, _("BULKSTAT is required."));
+ goto err;
+ }
+
+ /* Do we have getbmapx? */
+ if (!xfs_can_iterate_bmap(ctx)) {
+ str_info(ctx, ctx->mntpoint, _("GETBMAPX is required."));
+ goto err;
+ }
+
+ /* Do we have getfsmap? */
+ if (!xfs_can_iterate_fsmap(ctx)) {
+ str_info(ctx, ctx->mntpoint, _("GETFSMAP is required."));
+ goto err;
+ }
+
+ /* Do we have kernel-assisted metadata scrubbing? */
+ if (!xfs_can_scrub_fs_metadata(ctx) || !xfs_can_scrub_inode(ctx) ||
+ !xfs_can_scrub_bmap(ctx) || !xfs_can_scrub_dir(ctx) ||
+ !xfs_can_scrub_attr(ctx) || !xfs_can_scrub_symlink(ctx)) {
+ str_info(ctx, ctx->mntpoint,
+_("kernel metadata scrub is required."));
+ goto err;
+ }
+
+ /* Go find the XFS devices if we have a usable fsmap. */
+ fs_table_initialise(0, NULL, 0, NULL);
+ errno = 0;
+ fsp = fs_table_lookup(ctx->mntpoint, FS_MOUNT_POINT);
+ if (!fsp) {
+ str_error(ctx, ctx->mntpoint,
+_("Unable to find XFS information."));
+ goto err;
+ }
+ memcpy(&ctx->fsinfo, fsp, sizeof(struct fs_path));
+
+ /* Did we find the log and rt devices, if they're present? */
+ if (ctx->geo.logstart == 0 && ctx->fsinfo.fs_log == NULL) {
+ str_error(ctx, ctx->mntpoint,
+_("Unable to find log device path."));
+ goto err;
+ }
+ if (ctx->geo.rtblocks && ctx->fsinfo.fs_rt == NULL) {
+ str_error(ctx, ctx->mntpoint,
+_("Unable to find realtime device path."));
+ goto err;
+ }
+
+ /* Open the raw devices. */
+ error = disk_open(ctx->fsinfo.fs_name, &ctx->datadev);
+ if (error) {
+ str_errno(ctx, ctx->fsinfo.fs_name);
+ goto err;
+ }
+ ctx->nr_io_threads = libxfs_nproc();
+
+ if (ctx->fsinfo.fs_log) {
+ error = disk_open(ctx->fsinfo.fs_log, &ctx->logdev);
+ if (error) {
+ str_errno(ctx, ctx->fsinfo.fs_name);
+ goto err;
+ }
+ }
+ if (ctx->fsinfo.fs_rt) {
+ error = disk_open(ctx->fsinfo.fs_rt, &ctx->rtdev);
+ if (error) {
+ str_errno(ctx, ctx->fsinfo.fs_name);
+ goto err;
+ }
+ }
+
+ return true;
+err:
+ return false;
+}
+
+/* Phase 2: Check internal metadata. */
+
+/* Defer all the repairs until phase 4. */
+static void
+xfs_defer_repairs(
+ struct scrub_ctx *ctx,
+ struct list_head *repairs)
+{
+ if (list_empty(repairs))
+ return;
+
+ pthread_mutex_lock(&ctx->lock);
+ list_splice_tail_init(repairs, &ctx->repair_list);
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Repair some AG metadata; broken things are remembered for later. */
+static bool
+xfs_quick_repair(
+ struct scrub_ctx *ctx,
+ struct list_head *repairs)
+{
+ bool moveon;
+
+ moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd, repairs,
+ XRML_REPAIR_ONLY);
+ if (!moveon)
+ return moveon;
+
+ xfs_defer_repairs(ctx, repairs);
+ return true;
+}
+
+/* Scrub each AG's metadata btrees. */
+static void
+xfs_scan_ag_metadata(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+ bool *pmoveon = arg;
+ struct repair_item *n;
+ struct repair_item *ri;
+ struct list_head repairs;
+ struct list_head repair_now;
+ unsigned int broken_primaries;
+ unsigned int broken_secondaries;
+ bool moveon;
+ char descr[DESCR_BUFSZ];
+
+ INIT_LIST_HEAD(&repairs);
+ INIT_LIST_HEAD(&repair_now);
+ snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
+
+ /*
+ * First we scrub and fix the AG headers, because we need
+ * them to work well enough to check the AG btrees.
+ */
+ moveon = xfs_scrub_ag_headers(ctx, agno, &repairs);
+ if (!moveon)
+ goto err;
+
+ /* Repair header damage. */
+ moveon = xfs_quick_repair(ctx, &repairs);
+ if (!moveon)
+ goto err;
+
+ /* Now scrub the AG btrees. */
+ moveon = xfs_scrub_ag_metadata(ctx, agno, &repairs);
+ if (!moveon)
+ goto err;
+
+ /*
+ * Figure out if we need to perform early fixing. The only
+ * reason we need to do this is if the inobt is broken, which
+ * prevents phase 3 (inode scan) from running. We can rebuild
+ * the inobt from rmapbt data, but if the rmapbt is broken even
+ * at this early phase then we are sunk.
+ */
+ broken_secondaries = 0;
+ broken_primaries = 0;
+ list_for_each_entry_safe(ri, n, &repairs, list) {
+ switch (ri->op.sm_type) {
+ case XFS_SCRUB_TYPE_RMAPBT:
+ broken_secondaries++;
+ break;
+ case XFS_SCRUB_TYPE_FINOBT:
+ case XFS_SCRUB_TYPE_INOBT:
+ list_del(&ri->list);
+ list_add_tail(&ri->list, &repair_now);
+ /* fall through */
+ case XFS_SCRUB_TYPE_BNOBT:
+ case XFS_SCRUB_TYPE_CNTBT:
+ case XFS_SCRUB_TYPE_REFCNTBT:
+ broken_primaries++;
+ break;
+ default:
+ ASSERT(false);
+ break;
+ }
+ }
+ if (broken_secondaries && !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+ if (broken_primaries)
+ str_warn(ctx, descr,
+_("Corrupt primary and secondary block mapping metadata."));
+ else
+ str_warn(ctx, descr,
+_("Corrupt secondary block mapping metadata."));
+ str_warn(ctx, descr,
+_("Filesystem might not be repairable."));
+ }
+
+ /* Repair (inode) btree damage. */
+ moveon = xfs_quick_repair(ctx, &repair_now);
+ if (!moveon)
+ goto err;
+
+ /* Everything else gets fixed during phase 4. */
+ xfs_defer_repairs(ctx, &repairs);
+
+ return;
+err:
+ *pmoveon = false;
+ return;
+}
+
+/* Scrub whole-FS metadata btrees. */
+static void
+xfs_scan_fs_metadata(
+ struct work_queue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->mp;
+ bool *pmoveon = arg;
+ struct list_head repairs;
+ bool moveon;
+
+ INIT_LIST_HEAD(&repairs);
+ moveon = xfs_scrub_fs_metadata(ctx, &repairs);
+ if (!moveon)
+ *pmoveon = false;
+
+ pthread_mutex_lock(&ctx->lock);
+ list_splice_tail_init(&repairs, &ctx->repair_list);
+ pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Try to scan metadata via sysfs. */
+bool
+xfs_scan_metadata(
+ struct scrub_ctx *ctx)
+{
+ xfs_agnumber_t agno;
+ struct work_queue wq;
+ bool moveon = true;
+
+ create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+ queue_work(&wq, xfs_scan_fs_metadata, 0, &moveon);
+ for (agno = 0; agno < ctx->geo.agcount; agno++)
+ queue_work(&wq, xfs_scan_ag_metadata, agno, &moveon);
+ destroy_work_queue(&wq);
+
+ return moveon;
+}
+
+/* Phase 3: Scan all inodes. */
+
+/*
+ * Scrub part of a file. If the user passes in a valid fd we assume
+ * that's the file to check; otherwise, pass in the inode number and
+ * let the kernel sort it out.
+ */
+static bool
+xfs_scrub_fd(
+ struct scrub_ctx *ctx,
+ bool (*fn)(struct scrub_ctx *, uint64_t,
+ uint32_t, int, struct list_head *),
+ struct xfs_bstat *bs,
+ int fd,
+ struct list_head *repairs)
+{
+ if (fd < 0)
+ fd = ctx->mnt_fd;
+ return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd, repairs);
+}
+
+/* Verify the contents, xattrs, and extent maps of an inode. */
+static int
+xfs_scrub_inode(
+ struct scrub_ctx *ctx,
+ struct xfs_handle *handle,
+ struct xfs_bstat *bstat,
+ void *arg)
+{
+ struct list_head repairs;
+ char descr[DESCR_BUFSZ];
+ bool moveon = true;
+ int fd = -1;
+ int error = 0;
+
+ INIT_LIST_HEAD(&repairs);
+ snprintf(descr, DESCR_BUFSZ, _("inode %llu"), bstat->bs_ino);
+
+ /* Try to open the inode to pin it. */
+ if (S_ISREG(bstat->bs_mode) || S_ISDIR(bstat->bs_mode)) {
+ fd = open_by_fshandle(handle, sizeof(*handle),
+ O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+ if (fd < 0) {
+ error = errno;
+ if (error != ESTALE)
+ str_errno(ctx, descr);
+ goto out;
+ }
+ }
+
+ /* Scrub the inode. */
+ moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd,
+ &repairs);
+ if (!moveon)
+ goto out;
+
+ moveon = xfs_quick_repair(ctx, &repairs);
+ if (!moveon)
+ goto out;
+
+ /* Scrub all block mappings. */
+ moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd,
+ &repairs);
+ if (!moveon)
+ goto out;
+ moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd,
+ &repairs);
+ if (!moveon)
+ goto out;
+ moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd,
+ &repairs);
+ if (!moveon)
+ goto out;
+
+ moveon = xfs_quick_repair(ctx, &repairs);
+ if (!moveon)
+ goto out;
+
+ /* XXX: Some day, check child -> parent dir -> child. */
+
+ if (S_ISLNK(bstat->bs_mode)) {
+ /* Check symlink contents. */
+ moveon = xfs_scrub_symlink(ctx, bstat->bs_ino,
+ bstat->bs_gen, ctx->mnt_fd, &repairs);
+ } else if (S_ISDIR(bstat->bs_mode)) {
+ /* Check the directory entries. */
+ moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd, &repairs);
+ }
+ if (!moveon)
+ goto out;
+
+ /*
+ * Read all the extended attributes. If any of the read
+ * functions decline to move on, we can try again with the
+ * VFS functions if we have a file descriptor.
+ */
+ moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd, &repairs);
+ if (!moveon)
+ goto out;
+
+ moveon = xfs_quick_repair(ctx, &repairs);
+
+out:
+ xfs_defer_repairs(ctx, &repairs);
+ if (fd >= 0)
+ close(fd);
+ if (error)
+ return error;
+ return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Verify all the inodes in a filesystem. */
+bool
+xfs_scan_inodes(
+ struct scrub_ctx *ctx)
+{
+ if (!xfs_scan_all_inodes(ctx, xfs_scrub_inode))
+ return false;
+ xfs_scrub_report_preen_triggers(ctx);
+ return true;
+}
+
+/* Phase 4: Repair filesystem. */
+
+static int
+list_length(
+ struct list_head *head)
+{
+ struct list_head *pos;
+ int nr = 0;
+
+ list_for_each(pos, head) {
+ nr++;
+ }
+
+ return nr;
+}
+
+/* Fix the per-AG and per-FS metadata. */
+bool
+xfs_repair_fs(
+ struct scrub_ctx *ctx)
+{
+ int len;
+ int old_len;
+ bool moveon;
+
+ /* Repair anything broken until we fail to make progress. */
+ len = list_length(&ctx->repair_list);
+ do {
+ old_len = len;
+ moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd,
+ &ctx->repair_list, 0);
+ if (!moveon)
+ return false;
+ len = list_length(&ctx->repair_list);
+ } while (old_len > len);
+
+ /* Try once more, but this time complain if we can't fix things. */
+ moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd,
+ &ctx->repair_list, XRML_NOFIX_COMPLAIN);
+ if (!moveon)
+ return false;
+
+ fstrim(ctx);
+ return true;
+}
+
+/* Phase 5: Verify data file integrity. */
+
+/* Verify disk blocks with GETFSMAP */
+
+struct xfs_verify_extent {
+ /* Maintain state for the lazy read verifier. */
+ struct read_verify rv;
+
+ /* Store bad extents if we don't have parent pointers. */
+ struct bitmap *d_bad; /* bytes */
+ struct bitmap *r_bad; /* bytes */
+
+ /* Track the last extent we saw. */
+ uint64_t laststart; /* bytes */
+ uint64_t lastlength; /* bytes */
+ bool lastshared; /* bytes */
+};
+
+/* Report an IO error resulting from read-verify based off getfsmap. */
+static bool
+xfs_check_rmap_error_report(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ struct fsmap *map,
+ void *arg)
+{
+ const char *type;
+ char buf[32];
+ uint64_t err_physical = *(uint64_t *)arg;
+ uint64_t err_off;
+
+ if (err_physical > map->fmr_physical)
+ err_off = err_physical - map->fmr_physical;
+ else
+ err_off = 0;
+
+ snprintf(buf, 32, _("disk offset %llu"),
+ BTOBB(map->fmr_physical + err_off));
+
+ if (map->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+ type = xfs_decode_special_owner(map->fmr_owner);
+ str_error(ctx, buf,
+_("%s failed read verification."),
+ type);
+ } else if (xfs_scrub_can_getparent(ctx)) {
+ /* XXX: go find the parent path */
+ str_error(ctx, buf,
+_("XXX: inode %lld offset %llu failed read verification."),
+ map->fmr_owner, map->fmr_offset + err_off);
+ }
+ return true;
+}
+
+/* Handle a read error in the rmap-based read verify. */
+void
+xfs_check_rmap_ioerr(
+ struct read_verify_pool *rvp,
+ struct disk *disk,
+ uint64_t start,
+ uint64_t length,
+ int error,
+ void *arg)
+{
+ struct fsmap keys[2];
+ char descr[DESCR_BUFSZ];
+ struct scrub_ctx *ctx = rvp->rvp_ctx;
+ struct xfs_verify_extent *ve;
+ struct bitmap *tree;
+ dev_t dev;
+ bool moveon;
+
+ ve = arg;
+ dev = xfs_disk_to_dev(ctx, disk);
+
+ /*
+ * If we don't have parent pointers, save the bad extent for
+ * later rescanning.
+ */
+ if (!xfs_scrub_can_getparent(ctx)) {
+ if (dev == ctx->fsinfo.fs_datadev)
+ tree = ve->d_bad;
+ else if (dev == ctx->fsinfo.fs_rtdev)
+ tree = ve->r_bad;
+ else
+ tree = NULL;
+ if (tree) {
+ moveon = bitmap_add(tree, start, length);
+ if (!moveon)
+ str_errno(ctx, ctx->mntpoint);
+ }
+ }
+
+ snprintf(descr, DESCR_BUFSZ, _("dev %d:%d ioerr @ %"PRIu64":%"PRIu64" "),
+ major(dev), minor(dev), start, length);
+
+ /* Go figure out which blocks are bad from the fsmap. */
+ memset(keys, 0, sizeof(struct fsmap) * 2);
+ keys->fmr_device = dev;
+ keys->fmr_physical = start;
+ (keys + 1)->fmr_device = dev;
+ (keys + 1)->fmr_physical = start + length - 1;
+ (keys + 1)->fmr_owner = ULLONG_MAX;
+ (keys + 1)->fmr_offset = ULLONG_MAX;
+ (keys + 1)->fmr_flags = UINT_MAX;
+ xfs_iterate_fsmap(ctx, descr, keys, xfs_check_rmap_error_report,
+ &start);
+}
+
+/* Read verify a (data block) extent. */
+static bool
+xfs_check_rmap(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ struct fsmap *map,
+ void *arg)
+{
+ struct xfs_verify_extent *ve = arg;
+ struct disk *disk;
+
+ dbg_printf("rmap dev %d:%d phys %llu owner %lld offset %llu "
+ "len %llu flags 0x%x\n", major(map->fmr_device),
+ minor(map->fmr_device), map->fmr_physical,
+ map->fmr_owner, map->fmr_offset,
+ map->fmr_length, map->fmr_flags);
+
+ /* Remember this extent. */
+ ve->lastshared = (map->fmr_flags & FMR_OF_SHARED);
+ ve->laststart = map->fmr_physical;
+ ve->lastlength = map->fmr_length;
+
+ /* "Unknown" extents should be verified; they could be data. */
+ if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+ map->fmr_owner == XFS_FMR_OWN_UNKNOWN)
+ map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+
+ /*
+ * We only care about read-verifying data extents that have been
+ * written to disk. This means we can skip "special" owners
+ * (metadata), xattr blocks, unwritten extents, and extent maps.
+ * These should all get checked elsewhere in the scrubber.
+ */
+ if (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+ FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER))
+ goto out;
+
+ /* XXX: Filter out directory data blocks. */
+
+ /* Schedule the read verify command for (eventual) running. */
+ disk = xfs_dev_to_disk(ctx, map->fmr_device);
+
+ read_verify_schedule(&ctx->rvp, &ve->rv, disk, map->fmr_physical,
+ map->fmr_length, ve);
+
+out:
+ /* Is this the last extent? Fire off the read. */
+ if (map->fmr_flags & FMR_OF_LAST)
+ read_verify_force(&ctx->rvp, &ve->rv);
+
+ return true;
+}
+
+/* Verify all the blocks in a filesystem. */
+bool
+xfs_scan_blocks(
+ struct scrub_ctx *ctx)
+{
+ struct bitmap d_bad;
+ struct bitmap r_bad;
+ struct xfs_verify_extent *ve;
+ struct xfs_verify_extent *v;
+ int i;
+ unsigned int groups;
+ bool moveon;
+
+ /*
+ * Initialize our per-thread context. By convention,
+ * the log device comes first, then the rt device, and then
+ * the AGs.
+ */
+ groups = xfs_scan_all_blocks_array_size(ctx);
+ ve = calloc(groups, sizeof(struct xfs_verify_extent));
+ if (!ve) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+
+ moveon = bitmap_init(&d_bad);
+ if (!moveon) {
+ str_errno(ctx, ctx->mntpoint);
+ goto out_ve;
+ }
+
+ moveon = bitmap_init(&r_bad);
+ if (!moveon) {
+ str_errno(ctx, ctx->mntpoint);
+ goto out_dbad;
+ }
+
+ for (i = 0, v = ve; i < groups; i++, v++) {
+ v->d_bad = &d_bad;
+ v->r_bad = &r_bad;
+ }
+
+ moveon = xfs_read_verify_pool_init(ctx, xfs_check_rmap_ioerr);
+ if (!moveon)
+ goto out_rbad;
+ moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_check_rmap,
+ ve, sizeof(*ve));
+ if (!moveon)
+ goto out_pool;
+
+ for (i = 0, v = ve; i < groups; i++, v++)
+ read_verify_force(&ctx->rvp, &v->rv);
+ read_verify_pool_destroy(&ctx->rvp);
+
+ /* Scan the whole dir tree to see what matches the bad extents. */
+ if (!bitmap_empty(&d_bad) || !bitmap_empty(&r_bad))
+ moveon = xfs_report_verify_errors(ctx, &d_bad, &r_bad);
+
+ bitmap_free(&r_bad);
+ bitmap_free(&d_bad);
+ free(ve);
+ return moveon;
+
+out_pool:
+ read_verify_pool_destroy(&ctx->rvp);
+out_rbad:
+ bitmap_free(&r_bad);
+out_dbad:
+ bitmap_free(&d_bad);
+out_ve:
+ free(ve);
+ return moveon;
+}
+
+/* Phase 6: Check summary counters. */
+
+struct xfs_summary_counts {
+ unsigned long long inodes; /* number of inodes */
+ unsigned long long dbytes; /* data dev bytes */
+ unsigned long long rbytes; /* rt dev bytes */
+ unsigned long long next_phys; /* next phys bytes we see? */
+ unsigned long long agbytes; /* freespace bytes */
+ struct bitmap dext; /* data block extent bitmap */
+ struct bitmap rext; /* rt block extent bitmap */
+};
+
+struct xfs_inode_fork_summary {
+ struct bitmap *tree;
+ unsigned long long bytes;
+};
+
+/* Record inode and block usage. */
+static int
+xfs_record_inode_summary(
+ struct scrub_ctx *ctx,
+ struct xfs_handle *handle,
+ struct xfs_bstat *bstat,
+ void *arg)
+{
+ struct xfs_summary_counts *counts = arg;
+
+ counts->inodes++;
+ return 0;
+}
+
+/* Record block usage. */
+static bool
+xfs_record_block_summary(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ struct fsmap *fsmap,
+ void *arg)
+{
+ struct xfs_summary_counts *counts = arg;
+ unsigned long long len;
+
+ if (fsmap->fmr_device == ctx->fsinfo.fs_logdev)
+ return true;
+ if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+ fsmap->fmr_owner == XFS_FMR_OWN_FREE)
+ return true;
+
+ len = fsmap->fmr_length;
+
+ /* freesp btrees live in free space, need to adjust counters later. */
+ if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+ fsmap->fmr_owner == XFS_FMR_OWN_AG) {
+ counts->agbytes += fsmap->fmr_length;
+ }
+ if (fsmap->fmr_device == ctx->fsinfo.fs_rtdev) {
+ /* Count realtime extents. */
+ counts->rbytes += len;
+ } else {
+ /* Count datadev extents. */
+ if (counts->next_phys >= fsmap->fmr_physical + len)
+ return true;
+ else if (counts->next_phys > fsmap->fmr_physical)
+ len = counts->next_phys - fsmap->fmr_physical;
+ counts->dbytes += len;
+ counts->next_phys = fsmap->fmr_physical + fsmap->fmr_length;
+ }
+
+ return true;
+}
+
+/* Count all inodes and blocks in the filesystem, compare to superblock. */
+bool
+xfs_check_summary(
+ struct scrub_ctx *ctx)
+{
+ struct xfs_fsop_counts fc;
+ struct xfs_fsop_resblks rb;
+ struct xfs_fsop_ag_resblks arb;
+ struct statvfs sfs;
+ struct xfs_summary_counts *summary;
+ unsigned long long fd;
+ unsigned long long fr;
+ unsigned long long fi;
+ unsigned long long sd;
+ unsigned long long sr;
+ unsigned long long si;
+ unsigned long long absdiff;
+ xfs_agnumber_t agno;
+ bool moveon;
+ bool complain;
+ unsigned int groups;
+ int error;
+
+ groups = xfs_scan_all_blocks_array_size(ctx);
+ summary = calloc(groups, sizeof(struct xfs_summary_counts));
+ if (!summary) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+
+ /* Flush everything out to disk before we start counting. */
+ error = syncfs(ctx->mnt_fd);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+
+ /* Use fsmap to count blocks. */
+ moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_record_block_summary,
+ summary, sizeof(*summary));
+ if (!moveon)
+ goto out;
+
+ /* Scan the whole fs. */
+ moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_record_inode_summary,
+ summary, sizeof(*summary));
+ if (!moveon)
+ goto out;
+
+ /* Sum the counts. */
+ for (agno = 1; agno < groups; agno++) {
+ summary[0].inodes += summary[agno].inodes;
+ summary[0].dbytes += summary[agno].dbytes;
+ summary[0].rbytes += summary[agno].rbytes;
+ summary[0].agbytes += summary[agno].agbytes;
+ }
+
+ /* Fetch the filesystem counters. */
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSCOUNTS, &fc);
+ if (error)
+ str_errno(ctx, ctx->mntpoint);
+
+ /* Grab the fstatvfs counters, since it has to report accurately. */
+ error = fstatvfs(ctx->mnt_fd, &sfs);
+ if (error) {
+ str_errno(ctx, ctx->mntpoint);
+ return false;
+ }
+
+ /*
+ * XFS reserves some blocks to prevent hard ENOSPC, so add those
+ * blocks back to the free data counts.
+ */
+ error = ioctl(ctx->mnt_fd, XFS_IOC_GET_RESBLKS, &rb);
+ if (error)
+ str_errno(ctx, ctx->mntpoint);
+ sfs.f_bfree += rb.resblks_avail;
+
+ /*
+ * XFS with rmap or reflink reserves blocks in each AG to
+ * prevent the AG from running out of space for metadata blocks.
+ * Add those back to the free data counts.
+ */
+ memset(&arb, 0, sizeof(arb));
+ error = ioctl(ctx->mnt_fd, XFS_IOC_GET_AG_RESBLKS, &arb);
+ if (error && errno != ENOTTY)
+ str_errno(ctx, ctx->mntpoint);
+ sfs.f_bfree += arb.resblks;
+
+ /*
+ * If we counted blocks with fsmap, then dblocks includes
+ * blocks for the AGFL and the freespace/rmap btrees. The
+ * filesystem treats them as "free", but since we scanned
+ * them, we'll consider them used.
+ */
+ sfs.f_bfree -= summary[0].agbytes >> ctx->blocklog;
+
+ /* Report on what we found. */
+ fd = (ctx->geo.datablocks - sfs.f_bfree) << ctx->blocklog;
+ fr = (ctx->geo.rtblocks - fc.freertx) << ctx->blocklog;
+ fi = sfs.f_files - sfs.f_ffree;
+ sd = summary[0].dbytes;
+ sr = summary[0].rbytes;
+ si = summary[0].inodes;
+
+ /*
+ * Complain if the counts are off by more than 10% unless
+ * the inaccuracy is less than 32MB worth of blocks or 100 inodes.
+ */
+ absdiff = 1ULL << 25;
+ complain = !within_range(ctx, sd, fd, absdiff, 1, 10, _("data blocks"));
+ complain |= !within_range(ctx, sr, fr, absdiff, 1, 10, _("realtime blocks"));
+ complain |= !within_range(ctx, si, fi, 100, 1, 10, _("inodes"));
+
+ if (complain || verbose) {
+ double d, r, i;
+ char *du, *ru, *iu;
+
+ if (fr || sr) {
+ d = auto_space_units(fd, &du);
+ r = auto_space_units(fr, &ru);
+ i = auto_units(fi, &iu);
+ fprintf(stdout,
+_("%.1f%s data used; %.1f%s realtime data used; %.2f%s inodes used.\n"),
+ d, du, r, ru, i, iu);
+ d = auto_space_units(sd, &du);
+ r = auto_space_units(sr, &ru);
+ i = auto_units(si, &iu);
+ fprintf(stdout,
+_("%.1f%s data found; %.1f%s realtime data found; %.2f%s inodes found.\n"),
+ d, du, r, ru, i, iu);
+ } else {
+ d = auto_space_units(fd, &du);
+ i = auto_units(fi, &iu);
+ fprintf(stdout,
+_("%.1f%s data used; %.1f%s inodes used.\n"),
+ d, du, i, iu);
+ d = auto_space_units(sd, &du);
+ i = auto_units(si, &iu);
+ fprintf(stdout,
+_("%.1f%s data found; %.1f%s inodes found.\n"),
+ d, du, i, iu);
+ }
+ fflush(stdout);
+ }
+ moveon = true;
+
+out:
+ for (agno = 0; agno < groups; agno++) {
+ bitmap_free(&summary[agno].dext);
+ bitmap_free(&summary[agno].rext);
+ }
+ free(summary);
+ return moveon;
+}
+
+/* Shut down the filesystem. */
+void
+xfs_shutdown_fs(
+ struct scrub_ctx *ctx)
+{
+ int flag;
+
+ flag = XFS_FSOP_GOING_FLAGS_LOGFLUSH;
+ str_info(ctx, ctx->mntpoint, _("Shutting down filesystem!"));
+ if (ioctl(ctx->mnt_fd, XFS_IOC_GOINGDOWN, &flag))
+ str_errno(ctx, ctx->mntpoint);
+}
diff --git a/scrub/xfs_ioctl.c b/scrub/xfs_ioctl.c
new file mode 100644
index 0000000..f71e1f7
--- /dev/null
+++ b/scrub/xfs_ioctl.c
@@ -0,0 +1,968 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "xfs_ioctl.h"
+
+#define FSMAP_NR 65536
+#define BMAP_NR 2048
+
+/* Call the handler function. */
+static int
+xfs_iterate_inode_func(
+ struct scrub_ctx *ctx,
+ xfs_inode_iter_fn fn,
+ struct xfs_bstat *bs,
+ struct xfs_handle *handle,
+ void *arg)
+{
+ int error;
+
+ handle->ha_fid.fid_ino = bs->bs_ino;
+ handle->ha_fid.fid_gen = bs->bs_gen;
+ error = fn(ctx, handle, bs, arg);
+ if (error)
+ return error;
+ if (xfs_scrub_excessive_errors(ctx))
+ return XFS_ITERATE_INODES_ABORT;
+ return 0;
+}
+
+/* Iterate a range of inodes. */
+bool
+xfs_iterate_inodes(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ void *fshandle,
+ uint64_t first_ino,
+ uint64_t last_ino,
+ xfs_inode_iter_fn fn,
+ void *arg)
+{
+ struct xfs_fsop_bulkreq igrpreq = {0};
+ struct xfs_fsop_bulkreq bulkreq = {0};
+ struct xfs_fsop_bulkreq onereq = {0};
+ struct xfs_handle handle;
+ struct xfs_inogrp inogrp;
+ struct xfs_bstat bstat[XFS_INODES_PER_CHUNK] = {0};
+ char idescr[DESCR_BUFSZ];
+ char buf[DESCR_BUFSZ];
+ struct xfs_bstat *bs;
+ __u64 last_stale = first_ino - 1;
+ __u64 igrp_ino;
+ __u64 oneino;
+ __u64 ino;
+ __s32 bulklen = 0;
+ __s32 onelen = 0;
+ __s32 igrplen = 0;
+ bool moveon = true;
+ int i;
+ int error;
+ int stale_count = 0;
+
+ assert(!debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"));
+
+ onereq.lastip = &oneino;
+ onereq.icount = 1;
+ onereq.ocount = &onelen;
+
+ bulkreq.lastip = &ino;
+ bulkreq.icount = XFS_INODES_PER_CHUNK;
+ bulkreq.ubuffer = &bstat;
+ bulkreq.ocount = &bulklen;
+
+ igrpreq.lastip = &igrp_ino;
+ igrpreq.icount = 1;
+ igrpreq.ubuffer = &inogrp;
+ igrpreq.ocount = &igrplen;
+
+ memcpy(&handle.ha_fsid, fshandle, sizeof(handle.ha_fsid));
+ handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
+ sizeof(handle.ha_fid.fid_len);
+ handle.ha_fid.fid_pad = 0;
+
+ /* Find the inode chunk & alloc mask */
+ igrp_ino = first_ino;
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSINUMBERS, &igrpreq);
+ while (!error && igrplen) {
+ /* Load the inodes. */
+ ino = inogrp.xi_startino - 1;
+ bulkreq.icount = inogrp.xi_alloccount;
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT, &bulkreq);
+ if (error)
+ str_warn(ctx, descr, "%s", strerror_r(errno,
+ buf, DESCR_BUFSZ));
+
+ /* Did we get exactly the inodes we expected? */
+ for (i = 0, bs = bstat; i < XFS_INODES_PER_CHUNK; i++) {
+ if (!(inogrp.xi_allocmask & (1ULL << i)))
+ continue;
+ if (bs->bs_ino == inogrp.xi_startino + i) {
+ bs++;
+ continue;
+ }
+
+ /* Load the one inode. */
+ oneino = inogrp.xi_startino + i;
+ onereq.ubuffer = bs;
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT_SINGLE,
+ &onereq);
+ if (error || bs->bs_ino != inogrp.xi_startino + i) {
+ memset(bs, 0, sizeof(struct xfs_bstat));
+ bs->bs_ino = inogrp.xi_startino + i;
+ bs->bs_blksize = ctx->mnt_sv.f_frsize;
+ }
+ bs++;
+ }
+
+ /* Iterate all the inodes. */
+ for (i = 0, bs = bstat; i < inogrp.xi_alloccount; i++, bs++) {
+ if (bs->bs_ino > last_ino)
+ goto out;
+
+ error = xfs_iterate_inode_func(ctx, fn, bs, &handle,
+ arg);
+ switch (error) {
+ case 0:
+ break;
+ case ESTALE:
+ if (last_stale == inogrp.xi_startino)
+ stale_count++;
+ else {
+ last_stale = inogrp.xi_startino;
+ stale_count = 0;
+ }
+ if (stale_count < 30) {
+ igrp_ino = inogrp.xi_startino;
+ goto igrp_retry;
+ }
+ snprintf(idescr, DESCR_BUFSZ, "inode %llu",
+ bs->bs_ino);
+ str_warn(ctx, idescr, "%s", strerror_r(error,
+ buf, DESCR_BUFSZ));
+ break;
+ case XFS_ITERATE_INODES_ABORT:
+ error = 0;
+ /* fall thru */
+ default:
+ moveon = false;
+ errno = error;
+ goto err;
+ }
+ }
+
+igrp_retry:
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSINUMBERS, &igrpreq);
+ }
+
+err:
+ if (error) {
+ str_errno(ctx, descr);
+ moveon = false;
+ }
+out:
+ return moveon;
+}
+
+/* Does the kernel support bulkstat? */
+bool
+xfs_can_iterate_inodes(
+ struct scrub_ctx *ctx)
+{
+ struct xfs_fsop_bulkreq bulkreq;
+ __u64 lastino;
+ __s32 bulklen = 0;
+ int error;
+
+ if (debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"))
+ return false;
+
+ lastino = 0;
+ memset(&bulkreq, 0, sizeof(bulkreq));
+ bulkreq.lastip = (__u64 *)&lastino;
+ bulkreq.icount = 0;
+ bulkreq.ubuffer = NULL;
+ bulkreq.ocount = &bulklen;
+
+ error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT, &bulkreq);
+ return error == -1 && errno == EINVAL;
+}
+
+/* Iterate all the extent block mappings between the two keys. */
+bool
+xfs_iterate_bmap(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ int fd,
+ int whichfork,
+ struct xfs_bmap *key,
+ xfs_bmap_iter_fn fn,
+ void *arg)
+{
+ struct fsxattr fsx;
+ struct getbmapx *map;
+ struct getbmapx *p;
+ struct xfs_bmap bmap;
+ char bmap_descr[DESCR_BUFSZ];
+ bool moveon = true;
+ xfs_off_t new_off;
+ int getxattr_type;
+ int i;
+ int error;
+
+ assert(!debug_tweak_on("XFS_SCRUB_NO_BMAP"));
+
+ switch (whichfork) {
+ case XFS_ATTR_FORK:
+ snprintf(bmap_descr, DESCR_BUFSZ, _("%s attr"), descr);
+ break;
+ case XFS_COW_FORK:
+ snprintf(bmap_descr, DESCR_BUFSZ, _("%s CoW"), descr);
+ break;
+ case XFS_DATA_FORK:
+ snprintf(bmap_descr, DESCR_BUFSZ, _("%s data"), descr);
+ break;
+ default:
+ assert(0);
+ }
+
+ map = calloc(BMAP_NR, sizeof(struct getbmapx));
+ if (!map) {
+ str_errno(ctx, bmap_descr);
+ return false;
+ }
+
+ map->bmv_offset = BTOBB(key->bm_offset);
+ map->bmv_block = BTOBB(key->bm_physical);
+ if (key->bm_length == 0)
+ map->bmv_length = ULLONG_MAX;
+ else
+ map->bmv_length = BTOBB(key->bm_length);
+ map->bmv_count = BMAP_NR;
+ map->bmv_iflags = BMV_IF_NO_DMAPI_READ | BMV_IF_PREALLOC |
+ BMV_OF_DELALLOC | BMV_IF_NO_HOLES;
+ switch (whichfork) {
+ case XFS_ATTR_FORK:
+ getxattr_type = XFS_IOC_FSGETXATTRA;
+ map->bmv_iflags |= BMV_IF_ATTRFORK;
+ break;
+ case XFS_COW_FORK:
+ map->bmv_iflags |= BMV_IF_COWFORK;
+ getxattr_type = FS_IOC_FSGETXATTR;
+ break;
+ case XFS_DATA_FORK:
+ getxattr_type = FS_IOC_FSGETXATTR;
+ break;
+ default:
+ abort();
+ }
+
+ error = ioctl(fd, getxattr_type, &fsx);
+ if (error < 0) {
+ str_errno(ctx, bmap_descr);
+ moveon = false;
+ goto out;
+ }
+
+ while ((error = ioctl(fd, XFS_IOC_GETBMAPX, map)) == 0) {
+ for (i = 0, p = &map[i + 1]; i < map->bmv_entries; i++, p++) {
+ bmap.bm_offset = BBTOB(p->bmv_offset);
+ bmap.bm_physical = BBTOB(p->bmv_block);
+ bmap.bm_length = BBTOB(p->bmv_length);
+ bmap.bm_flags = p->bmv_oflags;
+ moveon = fn(ctx, bmap_descr, fd, whichfork, &fsx,
+ &bmap, arg);
+ if (!moveon)
+ goto out;
+ if (xfs_scrub_excessive_errors(ctx)) {
+ moveon = false;
+ goto out;
+ }
+ }
+
+ if (map->bmv_entries == 0)
+ break;
+ p = map + map->bmv_entries;
+ if (p->bmv_oflags & BMV_OF_LAST)
+ break;
+
+ new_off = p->bmv_offset + p->bmv_length;
+ map->bmv_length -= new_off - map->bmv_offset;
+ map->bmv_offset = new_off;
+ }
+
+ /* Pre-reflink filesystems don't know about CoW forks. */
+ if (whichfork == XFS_COW_FORK && error && errno == EINVAL)
+ error = 0;
+
+ if (error)
+ str_errno(ctx, bmap_descr);
+out:
+ memcpy(key, map, sizeof(struct getbmapx));
+ free(map);
+ return moveon;
+}
+
+/* Does the kernel support getbmapx? */
+bool
+xfs_can_iterate_bmap(
+ struct scrub_ctx *ctx)
+{
+ struct getbmapx bsm[2];
+ int error;
+
+ if (debug_tweak_on("XFS_SCRUB_NO_BMAP"))
+ return false;
+
+ memset(bsm, 0, sizeof(struct getbmapx));
+ bsm->bmv_length = ULLONG_MAX;
+ bsm->bmv_count = 2;
+ error = ioctl(ctx->mnt_fd, XFS_IOC_GETBMAPX, bsm);
+ return error == 0;
+}
+
+/* Iterate all the fs block mappings between the two keys. */
+bool
+xfs_iterate_fsmap(
+ struct scrub_ctx *ctx,
+ const char *descr,
+ struct fsmap *keys,
+ xfs_fsmap_iter_fn fn,
+ void *arg)
+{
+ struct fsmap_head *head;
+ struct fsmap *p;
+ bool moveon = true;
+ int i;
+ int error;
+
+ assert(!debug_tweak_on("XFS_SCRUB_NO_FSMAP"));
+
+ head = malloc(fsmap_sizeof(FSMAP_NR));
+ if (!head) {
+ str_errno(ctx, descr);
+ return false;
+ }
+
+ memset(head, 0, sizeof(*head));
+ memcpy(head->fmh_keys, keys, sizeof(struct fsmap) * 2);
+ head->fmh_count = FSMAP_NR;
+
+ while ((error = ioctl(ctx->mnt_fd, FS_IOC_GETFSMAP, head)) == 0) {
+ for (i = 0, p = head->fmh_recs;
+ i < head->fmh_entries;
+ i++, p++) {
+ moveon = fn(ctx, descr, p, arg);
+ if (!moveon)
+ goto out;
+ if (xfs_scrub_excessive_errors(ctx)) {
+ moveon = false;
+ goto out;
+ }
+ }
+
+ if (head->fmh_entries == 0)
+ break;
+ p = &head->fmh_recs[head->fmh_entries - 1];
+ if (p->fmr_flags & FMR_OF_LAST)
+ break;
+ fsmap_advance(head);
+ }
+
+ if (error) {
+ str_errno(ctx, descr);
+ moveon = false;
+ }
+out:
+ free(head);
+ return moveon;
+}
+
+/* Does the kernel support getfsmap? */
+bool
+xfs_can_iterate_fsmap(
+ struct scrub_ctx *ctx)
+{
+ struct fsmap_head head;
+ int error;
+
+ if (debug_tweak_on("XFS_SCRUB_NO_FSMAP"))
+ return false;
+
+ memset(&head, 0, sizeof(struct fsmap_head));
+ head.fmh_keys[1].fmr_device = UINT_MAX;
+ head.fmh_keys[1].fmr_physical = ULLONG_MAX;
+ head.fmh_keys[1].fmr_owner = ULLONG_MAX;
+ head.fmh_keys[1].fmr_offset = ULLONG_MAX;
+ error = ioctl(ctx->mnt_fd, FS_IOC_GETFSMAP, &head);
+ return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
+}
+
+/* Online scrub and repair. */
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+ ST_NONE, /* disabled */
+ ST_AGHEADER, /* per-AG header */
+ ST_PERAG, /* per-AG metadata */
+ ST_FS, /* per-FS metadata */
+ ST_INODE, /* per-inode metadata */
+};
+struct scrub_descr {
+ const char *name;
+ enum scrub_type type;
+};
+
+/* These must correspond to XFS_SCRUB_TYPE_ */
+static const struct scrub_descr scrubbers[] = {
+ {"dummy", ST_NONE},
+ {"superblock", ST_AGHEADER},
+ {"free space header", ST_AGHEADER},
+ {"free list", ST_AGHEADER},
+ {"inode header", ST_AGHEADER},
+ {"freesp by block btree", ST_PERAG},
+ {"freesp by length btree", ST_PERAG},
+ {"inode btree", ST_PERAG},
+ {"free inode btree", ST_PERAG},
+ {"reverse mapping btree", ST_PERAG},
+ {"reference count btree", ST_PERAG},
+ {"inode record", ST_INODE},
+ {"data block map", ST_INODE},
+ {"attr block map", ST_INODE},
+ {"CoW block map", ST_INODE},
+ {"directory entries", ST_INODE},
+ {"extended attributes", ST_INODE},
+ {"symbolic link", ST_INODE},
+ {"realtime bitmap", ST_FS},
+ {"realtime summary", ST_FS},
+};
+
+/* Format a scrub description. */
+static void
+format_scrub_descr(
+ char *buf,
+ size_t buflen,
+ struct xfs_scrub_metadata *meta,
+ const struct scrub_descr *sc)
+{
+ switch (sc->type) {
+ case ST_AGHEADER:
+ case ST_PERAG:
+ snprintf(buf, buflen, _("AG %u %s"), meta->sm_agno,
+ _(sc->name));
+ break;
+ case ST_INODE:
+ snprintf(buf, buflen, _("Inode %llu %s"), meta->sm_ino,
+ _(sc->name));
+ break;
+ case ST_FS:
+ snprintf(buf, buflen, _("%s"), _(sc->name));
+ break;
+ case ST_NONE:
+ assert(0);
+ break;
+ }
+}
+
+static inline bool
+IS_CORRUPT(
+ __u32 flags)
+{
+ return flags & (XFS_SCRUB_FLAG_CORRUPT | XFS_SCRUB_FLAG_XCORRUPT);
+}
+
+/* Do we need to repair something? */
+static inline bool
+xfs_scrub_needs_repair(
+ struct xfs_scrub_metadata *sm)
+{
+ return IS_CORRUPT(sm->sm_flags);
+}
+
+/* Can we optimize something? */
+static inline bool
+xfs_scrub_needs_preen(
+ struct xfs_scrub_metadata *sm)
+{
+ return sm->sm_flags & XFS_SCRUB_FLAG_PREEN;
+}
+
+/* Do a read-only check of some metadata. */
+static enum check_outcome
+xfs_check_metadata(
+ struct scrub_ctx *ctx,
+ int fd,
+ struct xfs_scrub_metadata *meta,
+ bool is_inode)
+{
+ char buf[DESCR_BUFSZ];
+ int error;
+
+ assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+ assert(meta->sm_type <= XFS_SCRUB_TYPE_MAX);
+ format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+ dbg_printf("check %s flags %xh\n", buf, meta->sm_flags);
+
+ error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+ if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
+ meta->sm_flags |= XFS_SCRUB_FLAG_PREEN;
+ if (error) {
+ /* Metadata not present, just skip it. */
+ if (errno == ENOENT)
+ return CHECK_DONE;
+ else if (errno == ESHUTDOWN) {
+ /* FS already crashed, give up. */
+ str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+ return CHECK_ABORT;
+ }
+
+ /* Operational error. */
+ str_errno(ctx, buf);
+ return CHECK_DONE;
+ } else if (!xfs_scrub_needs_repair(meta) &&
+ !xfs_scrub_needs_preen(meta)) {
+ /* Clean operation, no corruption or preening detected. */
+ return CHECK_DONE;
+ } else if (xfs_scrub_needs_repair(meta) &&
+ ctx->mode < SCRUB_MODE_REPAIR) {
+ /* Corrupt, but we're not in repair mode. */
+ str_error(ctx, buf, _("Repairs are required."));
+ return CHECK_DONE;
+ } else if (xfs_scrub_needs_preen(meta) &&
+ ctx->mode < SCRUB_MODE_PREEN) {
+ /* Preenable, but we're not in preen mode. */
+ if (!is_inode) {
+ /* AG or FS metadata, always warn. */
+ str_info(ctx, buf, _("Optimization is possible."));
+ } else if (!ctx->preen_triggers[meta->sm_type]) {
+ /* File metadata, only warn once per type. */
+ pthread_mutex_lock(&ctx->lock);
+ if (!ctx->preen_triggers[meta->sm_type])
+ ctx->preen_triggers[meta->sm_type] = true;
+ pthread_mutex_unlock(&ctx->lock);
+ }
+ return CHECK_DONE;
+ }
+
+ return CHECK_REPAIR;
+}
+
+/* Bulk-notify user about things that could be optimized. */
+void
+xfs_scrub_report_preen_triggers(
+ struct scrub_ctx *ctx)
+{
+ int i;
+
+ for (i = 0; i <= XFS_SCRUB_TYPE_MAX; i++) {
+ pthread_mutex_lock(&ctx->lock);
+ if (ctx->preen_triggers[i]) {
+ ctx->preen_triggers[i] = false;
+ pthread_mutex_unlock(&ctx->lock);
+ str_info(ctx, ctx->mntpoint,
+_("Optimizations of %s are possible."), scrubbers[i].name);
+ } else {
+ pthread_mutex_unlock(&ctx->lock);
+ }
+ }
+}
+
+/* Scrub metadata, saving corruption reports for later. */
+static bool
+xfs_scrub_metadata(
+ struct scrub_ctx *ctx,
+ enum scrub_type scrub_type,
+ xfs_agnumber_t agno,
+ struct list_head *repair_list)
+{
+ struct xfs_scrub_metadata meta = {0};
+ const struct scrub_descr *sc;
+ struct repair_item *ri;
+ enum check_outcome fix;
+ int type;
+
+ sc = scrubbers;
+ for (type = 0; type <= XFS_SCRUB_TYPE_MAX; type++, sc++) {
+ if (sc->type != scrub_type)
+ continue;
+
+ meta.sm_type = type;
+ meta.sm_flags = 0;
+ meta.sm_agno = agno;
+
+ /* Check the item. */
+ fix = xfs_check_metadata(ctx, ctx->mnt_fd, &meta, false);
+ if (fix == CHECK_ABORT)
+ return false;
+ if (fix == CHECK_DONE)
+ continue;
+
+ /* Schedule this item for later repairs. */
+ ri = malloc(sizeof(struct repair_item));
+ if (!ri) {
+ str_errno(ctx, _("repair list"));
+ return false;
+ }
+ ri->op = meta;
+ list_add_tail(&ri->list, repair_list);
+ }
+
+ return true;
+}
+
+/* Scrub each AG's header blocks. */
+bool
+xfs_scrub_ag_headers(
+ struct scrub_ctx *ctx,
+ xfs_agnumber_t agno,
+ struct list_head *repair_list)
+{
+ return xfs_scrub_metadata(ctx, ST_AGHEADER, agno, repair_list);
+}
+
+/* Scrub each AG's metadata btrees. */
+bool
+xfs_scrub_ag_metadata(
+ struct scrub_ctx *ctx,
+ xfs_agnumber_t agno,
+ struct list_head *repair_list)
+{
+ return xfs_scrub_metadata(ctx, ST_PERAG, agno, repair_list);
+}
+
+/* Scrub whole-FS metadata btrees. */
+bool
+xfs_scrub_fs_metadata(
+ struct scrub_ctx *ctx,
+ struct list_head *repair_list)
+{
+ return xfs_scrub_metadata(ctx, ST_FS, 0, repair_list);
+}
+
+/* Scrub inode metadata. */
+static bool
+__xfs_scrub_file(
+ struct scrub_ctx *ctx,
+ uint64_t ino,
+ uint32_t gen,
+ int fd,
+ unsigned int type,
+ struct list_head *repair_list)
+{
+ struct xfs_scrub_metadata meta = {0};
+ struct repair_item *ri;
+ enum check_outcome fix;
+
+ assert(type <= XFS_SCRUB_TYPE_MAX);
+ assert(scrubbers[type].type == ST_INODE);
+
+ meta.sm_type = type;
+ meta.sm_ino = ino;
+ meta.sm_gen = gen;
+
+ /* Scrub the piece of metadata. */
+ fix = xfs_check_metadata(ctx, fd, &meta, true);
+ if (fix == CHECK_ABORT)
+ return false;
+ if (fix == CHECK_DONE)
+ return true;
+
+ /* Schedule this item for later repairs. */
+ ri = malloc(sizeof(struct repair_item));
+ if (!ri) {
+ str_errno(ctx, _("repair list"));
+ return false;
+ }
+ ri->op = meta;
+ list_add_tail(&ri->list, repair_list);
+ return true;
+}
+
+#define XFS_SCRUB_FILE_PART(name, flagname) \
+bool \
+xfs_scrub_##name( \
+ struct scrub_ctx *ctx, \
+ uint64_t ino, \
+ uint32_t gen, \
+ int fd, \
+ struct list_head *repair_list) \
+{ \
+ return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_##flagname, \
+ repair_list); \
+}
+XFS_SCRUB_FILE_PART(inode_fields, INODE)
+XFS_SCRUB_FILE_PART(data_fork, BMBTD)
+XFS_SCRUB_FILE_PART(attr_fork, BMBTA)
+XFS_SCRUB_FILE_PART(cow_fork, BMBTC)
+XFS_SCRUB_FILE_PART(dir, DIR)
+XFS_SCRUB_FILE_PART(attr, XATTR)
+XFS_SCRUB_FILE_PART(symlink, SYMLINK)
+
+/*
+ * Prioritize repair items in order of how long we can wait.
+ * 0 = do it now, 10000 = do it later.
+ *
+ * To minimize the amount of repair work, we want to prioritize metadata
+ * objects by perceived corruptness. If CORRUPT is set, the fields are
+ * just plain bad; try fixing that first. Otherwise if XCORRUPT is set,
+ * the fields could be bad, but the xref data could also be bad; we'll
+ * try fixing that next. Finally, if XFAIL is set, some other metadata
+ * structure failed validation during xref, so we'll recheck this
+ * metadata last since it was probably fine.
+ *
+ * For metadata that lie in the critical path of checking other metadata
+ * (superblock, AG{F,I,FL}, inobt) we scrub and fix those things before
+ * we even get to handling their dependencies, so things should progress
+ * in order.
+ */
+static int
+PRIO(
+ struct xfs_scrub_metadata *op,
+ int order)
+{
+ if (op->sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+ return order;
+ else if (op->sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+ return 100 + order;
+ else if (op->sm_flags & XFS_SCRUB_FLAG_XFAIL)
+ return 200 + order;
+ else if (op->sm_flags & XFS_SCRUB_FLAG_PREEN)
+ return 300 + order;
+ abort();
+}
+
+static int
+xfs_repair_item_priority(
+ struct repair_item *ri)
+{
+ switch (ri->op.sm_type) {
+ case XFS_SCRUB_TYPE_SB:
+ return PRIO(&ri->op, 0);
+ case XFS_SCRUB_TYPE_AGF:
+ return PRIO(&ri->op, 1);
+ case XFS_SCRUB_TYPE_AGFL:
+ return PRIO(&ri->op, 2);
+ case XFS_SCRUB_TYPE_AGI:
+ return PRIO(&ri->op, 3);
+ case XFS_SCRUB_TYPE_BNOBT:
+ case XFS_SCRUB_TYPE_CNTBT:
+ case XFS_SCRUB_TYPE_INOBT:
+ case XFS_SCRUB_TYPE_FINOBT:
+ case XFS_SCRUB_TYPE_REFCNTBT:
+ return PRIO(&ri->op, 4);
+ case XFS_SCRUB_TYPE_RMAPBT:
+ return PRIO(&ri->op, 5);
+ case XFS_SCRUB_TYPE_INODE:
+ return PRIO(&ri->op, 6);
+ case XFS_SCRUB_TYPE_BMBTD:
+ case XFS_SCRUB_TYPE_BMBTA:
+ case XFS_SCRUB_TYPE_BMBTC:
+ return PRIO(&ri->op, 7);
+ case XFS_SCRUB_TYPE_DIR:
+ case XFS_SCRUB_TYPE_XATTR:
+ case XFS_SCRUB_TYPE_SYMLINK:
+ return PRIO(&ri->op, 8);
+ case XFS_SCRUB_TYPE_RTBITMAP:
+ case XFS_SCRUB_TYPE_RTSUM:
+ return PRIO(&ri->op, 9);
+ }
+ abort();
+}
+
+/* Make sure that btrees get repaired before headers. */
+static int
+xfs_repair_item_compare(
+ void *priv,
+ struct list_head *a,
+ struct list_head *b)
+{
+ struct repair_item *ra;
+ struct repair_item *rb;
+
+ ra = container_of(a, struct repair_item, list);
+ rb = container_of(b, struct repair_item, list);
+
+ return xfs_repair_item_priority(ra) - xfs_repair_item_priority(rb);
+}
+
+/* Repair some metadata. */
+static enum check_outcome
+xfs_repair_metadata(
+ struct scrub_ctx *ctx,
+ int fd,
+ struct xfs_scrub_metadata *meta,
+ bool complain_if_still_broken)
+{
+ char buf[DESCR_BUFSZ];
+ __u32 oldf = meta->sm_flags;
+ int error;
+
+ assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+ meta->sm_flags |= XFS_SCRUB_FLAG_REPAIR;
+ assert(meta->sm_type <= XFS_SCRUB_TYPE_MAX);
+ format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+ if (xfs_scrub_needs_repair(meta))
+ str_info(ctx, buf, _("Attempting repair."));
+ else if (debug || verbose)
+ str_info(ctx, buf, _("Attempting optimization."));
+
+ error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+ if (error) {
+ switch (errno) {
+ case ESHUTDOWN:
+ /* Filesystem is already shut down, abort. */
+ str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+ return CHECK_ABORT;
+ case ENOTTY:
+ case EOPNOTSUPP:
+ /*
+ * If we forced repairs, don't complain if kernel
+ * doesn't know how to fix.
+ */
+ if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+ return CHECK_DONE;
+ /* fall through */
+ case EINVAL:
+ /* Kernel doesn't know how to repair this? */
+ if (complain_if_still_broken)
+ str_error(ctx, buf,
+_("Don't know how to fix; offline repair required."));
+ return CHECK_REPAIR;
+ case EROFS:
+ /* Read-only filesystem, can't fix. */
+ if (verbose || debug || IS_CORRUPT(oldf))
+ str_info(ctx, buf,
+_("Read-only filesystem; cannot make changes."));
+ return CHECK_DONE;
+ case ENOENT:
+ /* Metadata not present, just skip it. */
+ return CHECK_DONE;
+ case ENOMEM:
+ case ENOSPC:
+ /* Don't care if preen fails due to low resources. */
+ if (oldf & XFS_SCRUB_FLAG_PREEN)
+ return CHECK_DONE;
+ /* fall through */
+ default:
+ /* Operational error. */
+ str_errno(ctx, buf);
+ return CHECK_DONE;
+ }
+ } else if (xfs_scrub_needs_repair(meta)) {
+ /* Still broken, try again or fix offline. */
+ if (complain_if_still_broken)
+ str_error(ctx, buf,
+_("Repair unsuccessful; offline repair required."));
+ return CHECK_REPAIR;
+ } else {
+ /* Clean operation, no corruption detected. */
+ if (IS_CORRUPT(oldf))
+ record_repair(ctx, buf, _("Repairs successful."));
+ else
+ record_preen(ctx, buf, _("Optimization successful."));
+ return CHECK_DONE;
+ }
+}
+
+/* Repair everything on this list. */
+bool
+xfs_repair_metadata_list(
+ struct scrub_ctx *ctx,
+ int fd,
+ struct list_head *repair_list,
+ unsigned int flags)
+{
+ struct repair_item *ri;
+ struct repair_item *n;
+ enum check_outcome fix;
+
+ list_sort(NULL, repair_list, xfs_repair_item_compare);
+
+ list_for_each_entry_safe(ri, n, repair_list, list) {
+ if (!IS_CORRUPT(ri->op.sm_flags) &&
+ (flags & XRML_REPAIR_ONLY))
+ continue;
+ fix = xfs_repair_metadata(ctx, fd, &ri->op,
+ flags & XRML_NOFIX_COMPLAIN);
+ if (fix == CHECK_ABORT)
+ return false;
+ else if (fix == CHECK_REPAIR)
+ continue;
+
+ list_del(&ri->list);
+ free(ri);
+ }
+
+ return !xfs_scrub_excessive_errors(ctx);
+}
+
+/* Test the availability of a kernel scrub command. */
+static bool
+__xfs_scrub_test(
+ struct scrub_ctx *ctx,
+ unsigned int type)
+{
+ struct xfs_scrub_metadata meta = {0};
+ struct xfs_error_injection inject;
+ static bool injected;
+ int error;
+
+ if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
+ return false;
+ if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+ inject.fd = ctx->mnt_fd;
+#define XFS_ERRTAG_FORCE_REPAIR 28
+ inject.errtag = XFS_ERRTAG_FORCE_REPAIR;
+ error = ioctl(ctx->mnt_fd,
+ XFS_IOC_ERROR_INJECTION, &inject);
+ if (error == 0)
+ injected = true;
+ }
+
+ meta.sm_type = type;
+ error = ioctl(ctx->mnt_fd, XFS_IOC_SCRUB_METADATA, &meta);
+ return error == 0 || (error && errno != EOPNOTSUPP && errno != ENOTTY);
+}
+
+#define XFS_CAN_SCRUB_TEST(name, flagname) \
+bool \
+xfs_can_scrub_##name( \
+ struct scrub_ctx *ctx) \
+{ \
+ return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_##flagname); \
+}
+XFS_CAN_SCRUB_TEST(fs_metadata, SB)
+XFS_CAN_SCRUB_TEST(inode, INODE)
+XFS_CAN_SCRUB_TEST(bmap, BMBTD)
+XFS_CAN_SCRUB_TEST(dir, DIR)
+XFS_CAN_SCRUB_TEST(attr, XATTR)
+XFS_CAN_SCRUB_TEST(symlink, SYMLINK)
diff --git a/scrub/xfs_ioctl.h b/scrub/xfs_ioctl.h
new file mode 100644
index 0000000..78eec51
--- /dev/null
+++ b/scrub/xfs_ioctl.h
@@ -0,0 +1,103 @@
+/*
+ * Copyright (C) 2017 Oracle. All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#ifndef XFS_IOCTL_H_
+#define XFS_IOCTL_H_
+
+/* inode iteration */
+#define XFS_ITERATE_INODES_ABORT (-1)
+typedef int (*xfs_inode_iter_fn)(struct scrub_ctx *ctx,
+ struct xfs_handle *handle, struct xfs_bstat *bs, void *arg);
+bool xfs_iterate_inodes(struct scrub_ctx *ctx, const char *descr,
+ void *fshandle, uint64_t first_ino, uint64_t last_ino,
+ xfs_inode_iter_fn fn, void *arg);
+bool xfs_can_iterate_inodes(struct scrub_ctx *ctx);
+
+/* inode fork block mapping */
+struct xfs_bmap {
+ uint64_t bm_offset; /* file offset of segment in bytes */
+ uint64_t bm_physical; /* physical starting byte */
+ uint64_t bm_length; /* length of segment, bytes */
+ uint32_t bm_flags; /* output flags */
+};
+
+typedef bool (*xfs_bmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+ int fd, int whichfork, struct fsxattr *fsx,
+ struct xfs_bmap *bmap, void *arg);
+
+bool xfs_iterate_bmap(struct scrub_ctx *ctx, const char *descr, int fd,
+ int whichfork, struct xfs_bmap *key, xfs_bmap_iter_fn fn,
+ void *arg);
+bool xfs_can_iterate_bmap(struct scrub_ctx *ctx);
+
+/* filesystem reverse mapping */
+typedef bool (*xfs_fsmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+ struct fsmap *fsr, void *arg);
+bool xfs_iterate_fsmap(struct scrub_ctx *ctx, const char *descr,
+ struct fsmap *keys, xfs_fsmap_iter_fn fn, void *arg);
+bool xfs_can_iterate_fsmap(struct scrub_ctx *ctx);
+
+/* Online scrub and repair. */
+enum check_outcome {
+ CHECK_DONE,
+ CHECK_REPAIR,
+ CHECK_ABORT,
+};
+
+struct repair_item {
+ struct list_head list;
+ struct xfs_scrub_metadata op;
+};
+
+void xfs_scrub_report_preen_triggers(struct scrub_ctx *ctx);
+bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+ struct list_head *repair_list);
+bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+ struct list_head *repair_list);
+bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx,
+ struct list_head *repair_list);
+
+#define XRML_REPAIR_ONLY 1 /* no optimizations */
+#define XRML_NOFIX_COMPLAIN 2 /* complain if still corrupt */
+bool xfs_repair_metadata_list(struct scrub_ctx *ctx, int fd,
+ struct list_head *repair_list, unsigned int flags);
+
+bool xfs_can_scrub_fs_metadata(struct scrub_ctx *ctx);
+bool xfs_can_scrub_inode(struct scrub_ctx *ctx);
+bool xfs_can_scrub_bmap(struct scrub_ctx *ctx);
+bool xfs_can_scrub_dir(struct scrub_ctx *ctx);
+bool xfs_can_scrub_attr(struct scrub_ctx *ctx);
+bool xfs_can_scrub_symlink(struct scrub_ctx *ctx);
+
+bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+bool xfs_scrub_data_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+bool xfs_scrub_attr_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+bool xfs_scrub_cow_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+bool xfs_scrub_dir(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+bool xfs_scrub_attr(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+ int fd, struct list_head *repair_list);
+
+#endif /* XFS_IOCTL_H_ */
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
` (6 preceding siblings ...)
2017-03-10 23:25 ` [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
2017-03-10 23:25 ` [PATCH 9/9] xfs_scrub: integrate services with systemd Darrick J. Wong
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Create an xfs_scrub_all command to find all XFS filesystems
and run an online scrub against them all.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
debian/control | 3 +
debian/rules | 1
man/man8/xfs_scrub_all.8 | 32 ++++++++++
scrub/Makefile | 15 ++++-
scrub/xfs_scrub_all.in | 148 ++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 195 insertions(+), 4 deletions(-)
create mode 100644 man/man8/xfs_scrub_all.8
create mode 100644 scrub/xfs_scrub_all.in
diff --git a/debian/control b/debian/control
index ad81662..d7c0316 100644
--- a/debian/control
+++ b/debian/control
@@ -3,12 +3,13 @@ Section: admin
Priority: optional
Maintainer: XFS Development Team <linux-xfs@vger.kernel.org>
Uploaders: Nathan Scott <nathans@debian.org>, Anibal Monsalve Salazar <anibal@debian.org>
-Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev
+Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev, dh-python
Standards-Version: 3.9.1
Homepage: http://xfs.org/
Package: xfsprogs
Depends: ${shlibs:Depends}, ${misc:Depends}
+Recommends: ${python3:Depends}, util-linux
Provides: fsck-backend
Suggests: xfsdump, acl, attr, quota
Breaks: xfsdump (<< 3.0.0)
diff --git a/debian/rules b/debian/rules
index c673380..a870944 100755
--- a/debian/rules
+++ b/debian/rules
@@ -76,6 +76,7 @@ binary-arch: checkroot built
$(pkgdi) $(MAKE) -C debian install-d-i
$(pkgme) $(MAKE) dist
rmdir debian/xfslibs-dev/usr/share/doc/xfsprogs
+ dh_python3
dh_installdocs
dh_installchangelogs
dh_strip
diff --git a/man/man8/xfs_scrub_all.8 b/man/man8/xfs_scrub_all.8
new file mode 100644
index 0000000..5e1420b
--- /dev/null
+++ b/man/man8/xfs_scrub_all.8
@@ -0,0 +1,32 @@
+.TH xfs_scrub_all 8
+.SH NAME
+xfs_scrub_all \- scrub all mounted XFS filesystems
+.SH SYNOPSIS
+.B xfs_scrub_all
+.SH DESCRIPTION
+.B xfs_scrub_all
+attempts to read and check all the metadata on all mounted XFS filesystems.
+The online scrub is performed via the
+.B xfs_scrub
+tool, either by running it directly or by using systemd to start it
+in a restricted fashion.
+Mounted filesystems are mapped to physical storage devices so that scrub
+operations can be run in parallel so long as no two scrubbers access
+the same device simultaneously.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub_all
+is the sum of the following conditions:
+.br
+\ 0\ \-\ No errors
+.br
+\ 4\ \-\ File system errors left uncorrected
+.br
+\ 8\ \-\ Operational error
+.br
+\ 16\ \-\ Usage or syntax error
+.TP
+These are the same error codes returned by xfs_scrub.
+.br
+.SH SEE ALSO
+.BR xfs_scrub (8).
diff --git a/scrub/Makefile b/scrub/Makefile
index bae2fa1..78e119f 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -10,6 +10,8 @@ SCRUB_PREREQS=$(HAVE_OPENAT)$(HAVE_FSTATAT)
ifeq ($(SCRUB_PREREQS),yesyes)
LTCOMMAND = xfs_scrub
INSTALL_SCRUB = install-scrub
+XFS_SCRUB_ALL_PROG = xfs_scrub_all
+XFS_SCRUB_ARGS = -Tvn
endif # scrub_prereqs
HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
@@ -36,15 +38,22 @@ ifeq ($(HAVE_SYNCFS),yes)
LCFLAGS += -DHAVE_SYNCFS
endif
-default: depend $(LTCOMMAND)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG)
+
+xfs_scrub_all: xfs_scrub_all.in
+ @echo " [SED] $@"
+ $(Q)$(SED) -e "s|@sbindir@|$(PKG_ROOT_SBIN_DIR)|g" \
+ -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@
+ $(Q)chmod a+x $@
include $(BUILDRULES)
-install: default $(INSTALL_SCRUB)
+install: $(INSTALL_SCRUB)
-install-scrub:
+install-scrub: default
$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+ $(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_ROOT_SBIN_DIR)
install-dev:
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
new file mode 100644
index 0000000..2215720
--- /dev/null
+++ b/scrub/xfs_scrub_all.in
@@ -0,0 +1,148 @@
+#!/usr/bin/env python3
+
+# Run online scrubbers in parallel, but avoid thrashing.
+#
+# Copyright (C) 2017 Oracle. All rights reserved.
+#
+# Author: Darrick J. Wong <darrick.wong@oracle.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+
+import subprocess
+import json
+import threading
+import time
+import sys
+
+retcode = 0
+terminate = False
+
+def find_mounts():
+ '''Map mountpoints to physical disks.'''
+
+ fs = {}
+ cmd=['lsblk', '-o', 'KNAME,TYPE,FSTYPE,MOUNTPOINT', '-J']
+ result = subprocess.Popen(cmd, stdout=subprocess.PIPE)
+ result.wait()
+ if result.returncode != 0:
+ return fs
+ sarray = [x.decode('utf-8') for x in result.stdout.readlines()]
+ output = ' '.join(sarray)
+ bdevdata = json.loads(output)
+ for bdev in bdevdata['blockdevices']:
+ if bdev['type'] == 'disk':
+ lastdisk = bdev['kname']
+ elif bdev['fstype'] == 'xfs':
+ mnt = bdev['mountpoint']
+ if mnt is not None:
+ if mnt in fs:
+ fs[mnt].add(lastdisk)
+ else:
+ fs[mnt] = set([lastdisk])
+ return fs
+
+def run_killable(cmd, stdout, killfuncs, kill_fn):
+ '''Run a killable program. Returns program retcode or -1 if we can't start it.'''
+ try:
+ proc = subprocess.Popen(cmd, stdout = stdout)
+ real_kill_fn = lambda: kill_fn(proc)
+ killfuncs.add(real_kill_fn)
+ proc.wait()
+ try:
+ killfuncs.remove(real_kill_fn)
+ except:
+ pass
+ return proc.returncode
+ except:
+ return -1
+
+def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
+ '''Run a scrub process.'''
+ global retcode, terminate
+
+ print("Scrubbing %s..." % mnt)
+
+ try:
+ if terminate:
+ return
+
+ # Invoke xfs_scrub manually
+ cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
+ ret = run_killable(cmd, None, killfuncs, \
+ lambda proc: proc.terminate())
+ if ret >= 0:
+ print("Scrubbing %s done, (err=%d)" % (mnt, ret))
+ retcode |= ret
+ return
+
+ if terminate:
+ return
+
+ print("Unable to start scrub tool.")
+ finally:
+ running_devs -= mntdevs
+ cond.acquire()
+ cond.notify()
+ cond.release()
+
+def main():
+ '''Find mounts, schedule scrub runs.'''
+ def thr(mnt, devs):
+ a = (mnt, cond, running_devs, devs, killfuncs)
+ thr = threading.Thread(target = run_scrub, args = a)
+ thr.start()
+ global retcode, terminate
+
+ fs = find_mounts()
+
+ # Schedule scrub jobs...
+ running_devs = set()
+ killfuncs = set()
+ cond = threading.Condition()
+ while len(fs) > 0:
+ if len(running_devs) == 0:
+ mnt, devs = fs.popitem()
+ running_devs.update(devs)
+ thr(mnt, devs)
+ poppers = set()
+ for mnt in fs:
+ devs = fs[mnt]
+ can_run = True
+ for dev in devs:
+ if dev in running_devs:
+ can_run = False
+ break
+ if can_run:
+ running_devs.update(devs)
+ poppers.add(mnt)
+ thr(mnt, devs)
+ for p in poppers:
+ fs.pop(p)
+ cond.acquire()
+ try:
+ cond.wait()
+ except KeyboardInterrupt:
+ terminate = True
+ print("Terminating...")
+ while len(killfuncs) > 0:
+ fn = killfuncs.pop()
+ fn()
+ fs = []
+ cond.release()
+
+ sys.exit(retcode)
+
+if __name__ == '__main__':
+ main()
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 9/9] xfs_scrub: integrate services with systemd
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
` (7 preceding siblings ...)
2017-03-10 23:25 ` [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
@ 2017-03-10 23:25 ` Darrick J. Wong
8 siblings, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2017-03-10 23:25 UTC (permalink / raw)
To: sandeen, darrick.wong; +Cc: linux-xfs
From: Darrick J. Wong <darrick.wong@oracle.com>
Create a systemd service unit so that we can run the online scrubber
under systemd with (somewhat) appropriate containment.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
configure.ac | 15 +++++++++++++
include/builddefs.in | 3 +++
scrub/Makefile | 21 +++++++++++++++++-
scrub/scrub.c | 20 +++++++++++++++++
scrub/xfs_scrub@.service.in | 18 ++++++++++++++++
scrub/xfs_scrub_all.in | 44 ++++++++++++++++++++++++++++++++++++++
scrub/xfs_scrub_all.service.in | 8 +++++++
scrub/xfs_scrub_all.timer | 11 ++++++++++
scrub/xfs_scrub_fail | 26 ++++++++++++++++++++++
scrub/xfs_scrub_fail@.service.in | 10 +++++++++
10 files changed, 175 insertions(+), 1 deletion(-)
create mode 100644 scrub/xfs_scrub@.service.in
create mode 100644 scrub/xfs_scrub_all.service.in
create mode 100644 scrub/xfs_scrub_all.timer
create mode 100755 scrub/xfs_scrub_fail
create mode 100644 scrub/xfs_scrub_fail@.service.in
diff --git a/configure.ac b/configure.ac
index ccd7460..e89aea0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -103,6 +103,21 @@ esac
AC_SUBST([root_sbindir])
AC_SUBST([root_libdir])
+# Where do systemd services go?
+pkg_systemdsystemunitdir="$(pkg-config --variable=systemdsystemunitdir systemd 2>/dev/null)"
+case "${pkg_systemdsystemunitdir}" in
+"")
+ systemdsystemunitdir=""
+ have_systemd=no
+ ;;
+*)
+ systemdsystemunitdir="${pkg_systemdsystemunitdir}"
+ have_systemd=yes
+ ;;
+esac
+AC_SUBST([have_systemd])
+AC_SUBST([systemdsystemunitdir])
+
# Find localized files. Don't descend into any "dot directories"
# (like .git or .pc from quilt). Strangely, the "-print" argument
# to "find" is required, to avoid including such directories in the
diff --git a/include/builddefs.in b/include/builddefs.in
index 9d478d3..d99c402 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -123,6 +123,9 @@ HAVE_OPENAT = @have_openat@
HAVE_SYNCFS = @have_syncfs@
HAVE_FSTATAT = @have_fstatat@
+HAVE_SYSTEMD = @have_systemd@
+SYSTEMDSYSTEMUNITDIR = @systemdsystemunitdir@
+
GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
# -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/scrub/Makefile b/scrub/Makefile
index 78e119f..d34e3d4 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -12,6 +12,12 @@ LTCOMMAND = xfs_scrub
INSTALL_SCRUB = install-scrub
XFS_SCRUB_ALL_PROG = xfs_scrub_all
XFS_SCRUB_ARGS = -Tvn
+
+ifeq ($(HAVE_SYSTEMD),yes)
+INSTALL_SCRUB += install-systemd
+SYSTEMDSERVICES = xfs_scrub@.service xfs_scrub_all.service xfs_scrub_all.timer xfs_scrub_fail@.service
+endif
+
endif # scrub_prereqs
HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
@@ -38,7 +44,7 @@ ifeq ($(HAVE_SYNCFS),yes)
LCFLAGS += -DHAVE_SYNCFS
endif
-default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(SYSTEMDSERVICES)
xfs_scrub_all: xfs_scrub_all.in
@echo " [SED] $@"
@@ -50,6 +56,19 @@ include $(BUILDRULES)
install: $(INSTALL_SCRUB)
+%.service: %.service.in
+ @echo " [SED] $@"
+ $(Q)$(SED) -e "s|@sbindir@|$(PKG_ROOT_SBIN_DIR)|g" \
+ -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" \
+ -e "s|@pkg_lib_dir@|$(PKG_LIB_DIR)|g" \
+ -e "s|@pkg_name@|$(PKG_NAME)|g" < $< > $@
+
+install-systemd: default
+ $(INSTALL) -m 755 -d $(SYSTEMDSYSTEMUNITDIR)
+ $(INSTALL) -m 644 $(SYSTEMDSERVICES) $(SYSTEMDSYSTEMUNITDIR)
+ $(INSTALL) -m 755 -d $(PKG_LIB_DIR)/$(PKG_NAME)
+ $(INSTALL) -m 755 xfs_scrub_fail $(PKG_LIB_DIR)/$(PKG_NAME)
+
install-scrub: default
$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
diff --git a/scrub/scrub.c b/scrub/scrub.c
index a363ac1..0b6a11d 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -44,6 +44,7 @@ bool dumpcore;
bool display_rusage;
long page_size;
int nr_threads = -1;
+bool is_service;
enum errors_action error_action = ERRORS_CONTINUE;
static unsigned long max_errors;
@@ -830,6 +831,9 @@ _("Only one of the options -n or -y may be specified.\n"));
ctx.mntpoint = argv[optind];
+ if (getenv("SERVICE_MODE"))
+ is_service = true;
+
/* Find the mount record for the passed-in argument. */
if (stat(argv[optind], &ctx.mnt_sb) < 0) {
@@ -957,5 +961,21 @@ _("%s: %lu warnings found.\n"),
free(ctx.mntpoint);
free(ctx.mnt_type);
end:
+ /*
+ * If we're running as a service, bump return code up by 16 to
+ * avoid conflicting with service return codes.
+ */
+ if (is_service) {
+ /*
+ * journald queries /proc as part of taking in log
+ * messages; it uses this information to associate the
+ * message with systemd units, etc. This races with
+ * process exit, so delay that a couple of seconds so
+ * that we capture the summary outputs in the job log.
+ */
+ sleep(2);
+ if (ret)
+ ret += 16;
+ }
return ret;
}
diff --git a/scrub/xfs_scrub@.service.in b/scrub/xfs_scrub@.service.in
new file mode 100644
index 0000000..6b6992d
--- /dev/null
+++ b/scrub/xfs_scrub@.service.in
@@ -0,0 +1,18 @@
+[Unit]
+Description=Online XFS Metadata Check for %I
+OnFailure=xfs_scrub_fail@%i.service
+
+[Service]
+Type=oneshot
+WorkingDirectory=%I
+PrivateNetwork=true
+ProtectSystem=full
+ProtectHome=read-only
+PrivateTmp=yes
+AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+NoNewPrivileges=yes
+User=nobody
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+Environment=SERVICE_MODE=1
+ExecStart=@sbindir@/xfs_scrub @scrub_args@ %I
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 2215720..81e0cc2 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -25,6 +25,7 @@ import json
import threading
import time
import sys
+import os
retcode = 0
terminate = False
@@ -53,6 +54,13 @@ def find_mounts():
fs[mnt] = set([lastdisk])
return fs
+def kill_systemd(unit, proc):
+ '''Kill systemd unit.'''
+ proc.terminate()
+ cmd=['systemctl', 'stop', unit]
+ x = subprocess.Popen(cmd)
+ x.wait()
+
def run_killable(cmd, stdout, killfuncs, kill_fn):
'''Run a killable program. Returns program retcode or -1 if we can't start it.'''
try:
@@ -78,6 +86,20 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
if terminate:
return
+ # Try it the systemd way
+ cmd=['systemctl', 'start', 'xfs_scrub@%s' % mnt]
+ ret = run_killable(cmd, subprocess.DEVNULL, killfuncs, \
+ lambda proc: kill_systemd('xfs_scrub@%s' % mnt, proc))
+ if ret == 0 or (ret >= 16 and ret <= 32):
+ if ret != 0:
+ ret -= 16
+ print("Scrubbing %s done, (err=%d)" % (mnt, ret))
+ retcode |= ret
+ return
+
+ if terminate:
+ return
+
# Invoke xfs_scrub manually
cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
ret = run_killable(cmd, None, killfuncs, \
@@ -107,6 +129,17 @@ def main():
fs = find_mounts()
+ # Tail the journal if we ourselves aren't a service...
+ journalthread = None
+ if 'SERVICE_MODE' not in os.environ:
+ try:
+ cmd=['journalctl', '--no-pager', '-q', '-S', 'now', \
+ '-f', '-u', 'xfs_scrub@*', '-o', \
+ 'cat']
+ journalthread = subprocess.Popen(cmd)
+ except:
+ pass
+
# Schedule scrub jobs...
running_devs = set()
killfuncs = set()
@@ -142,6 +175,17 @@ def main():
fs = []
cond.release()
+ if journalthread is not None:
+ journalthread.terminate()
+
+ # journald queries /proc as part of taking in log
+ # messages; it uses this information to associate the
+ # message with systemd units, etc. This races with
+ # process exit, so delay that a couple of seconds so
+ # that we capture the summary outputs in the job log.
+ if 'SERVICE_MODE' in os.environ:
+ time.sleep(2)
+
sys.exit(retcode)
if __name__ == '__main__':
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
new file mode 100644
index 0000000..15b0af9
--- /dev/null
+++ b/scrub/xfs_scrub_all.service.in
@@ -0,0 +1,8 @@
+[Unit]
+Description=Online XFS Metadata Check for All Filesystems
+
+[Service]
+Type=oneshot
+Environment=SERVICE_MODE=1
+ConditionACPower=true
+ExecStart=@sbindir@/xfs_scrub_all
diff --git a/scrub/xfs_scrub_all.timer b/scrub/xfs_scrub_all.timer
new file mode 100644
index 0000000..efc13a6
--- /dev/null
+++ b/scrub/xfs_scrub_all.timer
@@ -0,0 +1,11 @@
+[Unit]
+Description=Periodic XFS Online Metadata Check for All Filesystems
+
+[Timer]
+# Run on Sunday at 2am
+OnCalendar=Sun *-*-* 02:00:00
+RandomizedDelaySec=60
+Persistent=true
+
+[Install]
+WantedBy=timers.target
diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
new file mode 100755
index 0000000..36dd50e
--- /dev/null
+++ b/scrub/xfs_scrub_fail
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+# Email logs of failed xfs_scrub unit runs
+
+mailer=/usr/sbin/sendmail
+recipient="$1"
+test -z "${recipient}" && exit 0
+mntpoint="$2"
+test -z "${mntpoint}" && exit 0
+hostname="$(hostname -f 2>/dev/null)"
+test -z "${hostname}" && hostname="${HOSTNAME}"
+if [ ! -x "${mailer}" ]; then
+ echo "${mailer}: Mailer program not found."
+ exit 1
+fi
+
+(cat << ENDL
+To: $1
+From: <xfs_scrub@${hostname}>
+Subject: xfs_scrub failure on ${mntpoint}
+
+So sorry, the automatic xfs_scrub of ${mntpoint} on ${hostname} failed.
+
+A log of what happened follows:
+ENDL
+systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint}") | "${mailer}" -t -i
diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
new file mode 100644
index 0000000..785f881
--- /dev/null
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -0,0 +1,10 @@
+[Unit]
+Description=Online XFS Metadata Check Failure Reporting for %I
+
+[Service]
+Type=oneshot
+Environment=EMAIL_ADDR=root
+ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" %I
+User=mail
+Group=mail
+SupplementaryGroups=systemd-journal
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-03-10 23:25 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 23:24 [PATCH v6 0/9] xfsprogs: online scrub/repair support Darrick J. Wong
2017-03-10 23:24 ` [PATCH 1/9] xfs_repair: rebuild bmbt from rmapbt data Darrick J. Wong
2017-03-10 23:24 ` [PATCH 2/9] xfs_db: introduce fuzz command Darrick J. Wong
2017-03-10 23:25 ` [PATCH 3/9] xfs_db: print attribute remote value blocks Darrick J. Wong
2017-03-10 23:25 ` [PATCH 4/9] xfs_db: write / fuzz bad values into dir/attr blocks with good CRCs Darrick J. Wong
2017-03-10 23:25 ` [PATCH 5/9] xfs_io: provide an interface to the scrub ioctls Darrick J. Wong
2017-03-10 23:25 ` [PATCH 6/9] xfs_scrub: create online filesystem scrub program Darrick J. Wong
2017-03-10 23:25 ` [PATCH 7/9] xfs_scrub: add XFS-specific scrubbing functionality Darrick J. Wong
2017-03-10 23:25 ` [PATCH 8/9] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
2017-03-10 23:25 ` [PATCH 9/9] xfs_scrub: integrate services with systemd Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).