All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/22] xfsprogs: online scrub/repair support
@ 2017-08-04  0:07 Darrick J. Wong
  2017-08-04  0:07 ` [PATCH 01/22] xfs_scrub: create online filesystem scrub program Darrick J. Wong
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:07 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

Hi all,

This is the ninth revision of a patchset that adds to XFS userland tools
support for online metadata scrubbing and repair.  There aren't any
on-disk format changes, and the main overview is in the cover letter for
the kernel patches.  Since v8, I have split up the userland scrub
patches so that we lead off with a bunch of documentation for the future
program and then start building the tool one piece at a time.  Most of the
scrub activity simply calls into the kernel, but other parts (data block
verification, name collision checking) are mostly userspace code.

Per my new policy of reducing mailing list bombardment, I am not posting
any of the libxfs patches, only patches for actual userspace programs.

If you're going to start using this mess, you probably ought to just
pull from my git trees for kernel[1], xfsprogs[2], and xfstests[3].
The xfsprogs patches should apply against for-next.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/22] xfs_scrub: create online filesystem scrub program
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
@ 2017-08-04  0:07 ` Darrick J. Wong
  2017-08-04  0:07 ` [PATCH 02/22] xfs_scrub: common error handling Darrick J. Wong
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:07 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create the foundations of a filesystem scrubbing tool that asks the
kernel to inspect all metadata in the filesystem and (ultimately) to
repair anything that's broken.  Also create the man page for the
utility.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Makefile             |    3 +
 man/man8/xfs_scrub.8 |  117 ++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/Makefile       |   42 +++++++++++++++++
 scrub/common.c       |   36 +++++++++++++++
 scrub/common.h       |   23 +++++++++
 scrub/scrub.c        |  123 ++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.h        |   23 +++++++++
 7 files changed, 366 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/xfs_scrub.8
 create mode 100644 scrub/Makefile
 create mode 100644 scrub/common.c
 create mode 100644 scrub/common.h
 create mode 100644 scrub/scrub.c
 create mode 100644 scrub/scrub.h


diff --git a/Makefile b/Makefile
index 72d0044..ef54bda 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ HDR_SUBDIRS = include libxfs
 DLIB_SUBDIRS = libxlog libxcmd libhandle
 LIB_SUBDIRS = libxfs $(DLIB_SUBDIRS)
 TOOL_SUBDIRS = copy db estimate fsck growfs io logprint mkfs quota \
-		mdrestore repair rtcp m4 man doc debian spaceman
+		mdrestore repair rtcp m4 man doc debian spaceman scrub
 
 ifneq ("$(PKG_PLATFORM)","darwin")
 TOOL_SUBDIRS += fsr
@@ -89,6 +89,7 @@ repair: libxlog libxcmd
 copy: libxlog
 mkfs: libxcmd
 spaceman: libxcmd
+scrub: libhandle libxcmd repair
 
 ifeq ($(HAVE_BUILDDEFS), yes)
 include $(BUILDRULES)
diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
new file mode 100644
index 0000000..a432aed
--- /dev/null
+++ b/man/man8/xfs_scrub.8
@@ -0,0 +1,117 @@
+.TH xfs_scrub 8
+.SH NAME
+xfs_scrub \- scrub the contents of an XFS filesystem
+.SH SYNOPSIS
+.B xfs_scrub
+[
+.B \-abemnTvVxy
+]
+.I mount-point
+.br
+.B xfs_scrub \-V
+.SH DESCRIPTION
+.B xfs_scrub
+attempts to check and repair all metadata in a mounted XFS filesystem.
+.PP
+.B xfs_scrub
+asks the kernel to scrub all metadata objects in the filesystem.
+Metadata records are scanned for obviously bad values and then
+cross-referenced against other metadata.
+The goal is to establish a threasonable confidence about the consistency
+of the overall filesystem by examining the consistency of individual
+metadata records against the other metadata in the filesystem across the
+entire filesystem.
+Damaged metadata can be rebuilt from other metadata if there is
+sufficient redundancy (and no other corruption) in the metadata.
+.PP
+This utility does not know how to correct all errors.
+If the tool cannot fix the detected errors, you must unmount the
+filesystem and run
+.B xfs_repair
+to fix the problems.
+If this tool is not run with either of the
+.B \-n
+or
+.B \-y
+options, then it will optimize the filesystem when possible,
+but it will not try to fix errors.
+.SH OPTIONS
+.TP
+.BI \-a " errors"
+Abort if more than this many errors are found on the filesystem.
+.TP
+.B \-b
+Run in background mode.
+If the option is specified once, only run a single scrubbing thread at a
+time.
+If given more than once, an artificial delay of 100us is added to each
+scrub call to reduce CPU overhead even further.
+.TP
+.B \-e
+Specifies what happens when errors are detected.
+If
+.IR shutdown
+is given, the filesystem will be taken offline if errors are found.
+Not all backends can shut down a filesystem.
+If
+.IR continue
+is given, no action taken if errors are found.
+This is the default.
+.TP
+.BI \-m " file"
+Search this file for mounted filesystems instead of /etc/mtab.
+.TP
+.B \-n
+Dry run, do not modify anything in the filesystem.
+This disables all preening and optimization behaviors, and disables
+calling FITRIM on the free space after a successful run.
+.TP
+.BI \-T
+Print timing and memory usage information for each phase.
+.TP
+.B \-v
+Enable verbose mode, which prints periodic status updates.
+.TP
+.B \-V
+Prints the version number and exits.
+.TP
+.B \-x
+Scrub all file data too.
+The block list will be sorted in disk order for better performance.
+.B xfs_scrub
+will issue O_DIRECT reads to the block device directly.
+If the block device is a SCSI disk, it will issue READ VERIFY commands
+directly to the disk.
+.TP
+.B \-y
+Try to repair all filesystem errors.
+If the errors cannot be fixed online, then the filesystem must be taken
+offline for repair.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub
+is the sum of the following conditions:
+.br
+\	0\	\-\ No errors
+.br
+\	1\	\-\ File system errors left uncorrected
+.br
+\	2\	\-\ File system optimizations possible
+.br
+\	4\	\-\ Operational error
+.br
+\	8\	\-\ Usage or syntax error
+.br
+.SH CAVEATS
+.B xfs_scrub
+is an immature utility!
+This program takes advantage of in-kernel scrubbing to verify a given
+data structure with locks held.
+The kernel must support the BULKSTAT, FSGEOMETRY, FSCOUNTS, GET_RESBLKS,
+GET_AG_RESBLKS, GETBMAPX, GETFSMAP, INUMBERS, and SCRUB_METADATA ioctls.
+This can tie up the system for a while.
+.PP
+If errors are found and cannot be repaired, the filesystem must be taken
+offline and repaired.
+.SH SEE ALSO
+.BR xfs_repair (8).
diff --git a/scrub/Makefile b/scrub/Makefile
new file mode 100644
index 0000000..90a1c47
--- /dev/null
+++ b/scrub/Makefile
@@ -0,0 +1,42 @@
+#
+# Copyright (c) 2017 Oracle.  All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+# On linux we get fsmap from the system or define it ourselves
+# so include this based on platform type.  If this reverts to only
+# the autoconf check w/o local definition, change to testing HAVE_GETFSMAP
+SCRUB_PREREQS=$(PKG_PLATFORM)
+
+ifeq ($(SCRUB_PREREQS),linux)
+LTCOMMAND = xfs_scrub
+INSTALL_SCRUB = install-scrub
+endif	# scrub_prereqs
+
+HFILES = \
+common.h \
+scrub.h
+
+CFILES = \
+common.c \
+scrub.c
+
+LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
+LTDEPENDENCIES += $(LIBXCMD) $(LIBHANDLE)
+LLDFLAGS = -static
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default $(INSTALL_SCRUB)
+
+install-scrub:
+	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
+	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+
+install-dev:
+
+-include .dep
diff --git a/scrub/common.c b/scrub/common.c
new file mode 100644
index 0000000..7f2b4d2
--- /dev/null
+++ b/scrub/common.c
@@ -0,0 +1,36 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <mntent.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/statvfs.h>
+#include <sys/vfs.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "input.h"
diff --git a/scrub/common.h b/scrub/common.h
new file mode 100644
index 0000000..f29e4d3
--- /dev/null
+++ b/scrub/common.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_COMMON_H_
+#define XFS_SCRUB_COMMON_H_
+
+#endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
new file mode 100644
index 0000000..4fe1590
--- /dev/null
+++ b/scrub/scrub.c
@@ -0,0 +1,123 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <mntent.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/statvfs.h>
+#include <sys/vfs.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+
+/*
+ * XFS Online Metadata Scrub (and Repair)
+ *
+ * The XFS scrubber uses custom XFS ioctls to probe more deeply into the
+ * internals of the filesystem.  It takes advantage of scrubbing ioctls
+ * to check all the records stored in a metadata object and to
+ * cross-reference those records against the other filesystem metadata.
+ *
+ * After the program gathers command line arguments to figure out
+ * exactly what the user wants the program is going to do, scrub
+ * execution is split up into several separate phases:
+ *
+ * The "find geometry" phase queries XFS for the filesystem geometry.
+ * The block devices for the data, realtime, and log devices are opened.
+ * Kernel ioctls are test-queried to see if they actually work (the scrub
+ * ioctl in particular), and any other filesystem-specific information
+ * is gathered.
+ *
+ * In the "check internal metadata" phase, we call the metadata scrub
+ * ioctl to check the filesystem's internal per-AG btrees.  This
+ * includes the AG superblock, AGF, AGFL, and AGI headers, freespace
+ * btrees, the regular and free inode btrees, the reverse mapping
+ * btrees, and the reference counting btrees.  If the realtime device is
+ * enabled, the realtime bitmap and reverse mapping btrees are enabled.
+ * Quotas, if enabled, are also checked in this phase.
+ *
+ * Each AG (and the realtime device) has its metadata checked in a
+ * separate thread for better performance.  Errors in the internal
+ * metadata can be fixed here prior to the inode scan; refer to the
+ * section about the "repair filesystem" phase for more information.
+ *
+ * The "scan all inodes" phase uses BULKSTAT to scan all the inodes in
+ * an AG in disk order.  The BULKSTAT information provides enough
+ * information to construct a file handle that is used to check the
+ * following parts of every file:
+ *
+ *  - The inode record
+ *  - All three block forks (data, attr, CoW)
+ *  - If it's a symlink, the symlink target.
+ *  - If it's a directory, the directory entries.
+ *  - All extended attributes
+ *  - The parent pointer
+ *
+ * Multiple threads are started to check each the inodes of each AG in
+ * parallel.  Errors in file metadata can be fixed here; see the section
+ * about the "repair filesystem" phase for more information.
+ *
+ * Next comes the (configurable) "repair filesystem" phase.  The user
+ * can instruct this program to fix all problems encountered; to fix
+ * only optimality problems and leave the corruptions; or not to touch
+ * the filesystem at all.  Any metadata repairs that did not succeed in
+ * the previous two phases are retried here; if there are uncorrectable
+ * errors, xfs_scrub stops here.
+ *
+ * The next phase is the "check directory tree" phase.  In this phase,
+ * every directory is opened (via file handle) to confirm that each
+ * directory is connected to the root.  Directory entries are checked
+ * for ambiguous Unicode normalization mappings, which is to say that we
+ * look for pairs of entries whose utf-8 strings normalize to the same
+ * code point sequence and map to different inodes, because that could
+ * be used to trick a user into opening the wrong file.  The names of
+ * extended attributes are checked for Unicode normalization collisions.
+ *
+ * In the "verify data file integrity" phase, we employ GETFSMAP to read
+ * the reverse-mappings of all AGs and issue direct-reads of the
+ * underlying disk blocks.  We rely on the underlying storage to have
+ * checksummed the data blocks appropriately.  Multiple threads are
+ * started to check each AG in parallel; a separate thread pool is used
+ * to handle the direct reads.
+ *
+ * In the "check summary counters" phase, use GETFSMAP to tally up the
+ * blocks and BULKSTAT to tally up the inodes we saw and compare that to
+ * the statfs output.  This gives the user a rough estimate of how
+ * thorough the scrub was.
+ */
+
+/* Program name; needed for libxcmd error reports. */
+char				*progname = "xfs_scrub";
+
+int
+main(
+	int			argc,
+	char			**argv)
+{
+	fprintf(stderr, "XXX: This program is not complete!\n");
+	return 4;
+}
diff --git a/scrub/scrub.h b/scrub/scrub.h
new file mode 100644
index 0000000..b07029b
--- /dev/null
+++ b/scrub/scrub.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_SCRUB_H_
+#define XFS_SCRUB_SCRUB_H_
+
+#endif /* XFS_SCRUB_SCRUB_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/22] xfs_scrub: common error handling
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
  2017-08-04  0:07 ` [PATCH 01/22] xfs_scrub: create online filesystem scrub program Darrick J. Wong
@ 2017-08-04  0:07 ` Darrick J. Wong
  2017-08-04  0:07 ` [PATCH 03/22] xfs_scrub: set up command line argument parsing Darrick J. Wong
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:07 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Standardize how we record and report errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/common.c |  134 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/common.h |   28 ++++++++++++
 scrub/scrub.c  |    6 +++
 scrub/scrub.h  |   12 +++++
 4 files changed, 180 insertions(+)


diff --git a/scrub/common.c b/scrub/common.c
index 7f2b4d2..4e47332 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -34,3 +34,137 @@
 #include "scrub.h"
 #include "common.h"
 #include "input.h"
+
+/*
+ * Reporting Status to the Console
+ *
+ * We aim for a roughly standard reporting format -- the severity of the
+ * status being reported, a textual description of the objecting being
+ * reported, and whatever the status happens to be.
+ *
+ * Errors are the most severe and reflect filesystem corruption.
+ * Warnings indicate that something is amiss and needs the attention of
+ * the administrator, but does not constitute a corruption.  Information
+ * is merely advisory.
+ */
+
+/* Too many errors? Bail out. */
+bool
+xfs_scrub_excessive_errors(
+	struct scrub_ctx	*ctx)
+{
+	bool			ret;
+
+	pthread_mutex_lock(&ctx->lock);
+	ret = ctx->max_errors > 0 && ctx->errors_found >= ctx->max_errors;
+	pthread_mutex_unlock(&ctx->lock);
+
+	return ret;
+}
+
+/* Print an error string and whatever error is stored in errno. */
+void
+__str_errno(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line)
+{
+	char			buf[DESCR_BUFSZ];
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, _("Error: %s: %s."), descr,
+			strerror_r(errno, buf, DESCR_BUFSZ));
+	if (debug)
+		fprintf(stderr, _(" (%s line %d)"), file, line);
+	fprintf(stderr, "\n");
+	ctx->runtime_errors++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print an error string and some error text. */
+void
+__str_error(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, _("Error: %s: "), descr);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, _(" (%s line %d)"), file, line);
+	fprintf(stderr, "\n");
+	ctx->errors_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print a warning string and some warning text. */
+void
+__str_warn(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, _("Warning: %s: "), descr);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, _(" (%s line %d)"), file, line);
+	fprintf(stderr, "\n");
+	ctx->warnings_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Print an informational string and some informational text. */
+void
+__str_info(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stdout, _("Info: %s: "), descr);
+	va_start(args, format);
+	vfprintf(stdout, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stdout, _(" (%s line %d)"), file, line);
+	fprintf(stdout, "\n");
+	fflush(stdout);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Catch fatal errors from pieces we import from xfs_repair. */
+void __attribute__((noreturn))
+do_error(char const *msg, ...)
+{
+	va_list args;
+
+	fprintf(stderr, _("\nfatal error -- "));
+
+	va_start(args, msg);
+	vfprintf(stderr, msg, args);
+	if (dumpcore)
+		abort();
+	exit(1);
+}
diff --git a/scrub/common.h b/scrub/common.h
index f29e4d3..b7c5f47 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -20,4 +20,32 @@
 #ifndef XFS_SCRUB_COMMON_H_
 #define XFS_SCRUB_COMMON_H_
 
+/*
+ * When reporting a defective metadata object to the console, this
+ * is the size of the buffer to use to store the description of that
+ * item.
+ */
+#define DESCR_BUFSZ	256
+
+bool xfs_scrub_excessive_errors(struct scrub_ctx *ctx);
+
+void __str_errno(struct scrub_ctx *ctx, const char *descr, const char *file,
+		 int line);
+void __str_error(struct scrub_ctx *ctx, const char *descr, const char *file,
+		 int line, const char *format, ...);
+void __str_warn(struct scrub_ctx *ctx, const char *descr, const char *file,
+		int line, const char *format, ...);
+void __str_info(struct scrub_ctx *ctx, const char *descr, const char *file,
+		int line, const char *format, ...);
+void __record_repair(struct scrub_ctx *ctx, const char *descr, const char *file,
+		int line, const char *format, ...);
+void __record_preen(struct scrub_ctx *ctx, const char *descr, const char *file,
+		int line, const char *format, ...);
+
+#define str_errno(ctx, str)		__str_errno(ctx, str, __FILE__, __LINE__)
+#define str_error(ctx, str, ...)	__str_error(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_warn(ctx, str, ...)		__str_warn(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_info(ctx, str, ...)		__str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define dbg_printf(fmt, ...)		{if (debug > 1) {printf(fmt, __VA_ARGS__);}}
+
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 4fe1590..bac1e8c 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -113,6 +113,12 @@
 /* Program name; needed for libxcmd error reports. */
 char				*progname = "xfs_scrub";
 
+/* Debug level; higher values mean more verbosity. */
+unsigned int			debug;
+
+/* Should we dump core if errors happen? */
+bool				dumpcore;
+
 int
 main(
 	int			argc,
diff --git a/scrub/scrub.h b/scrub/scrub.h
index b07029b..49de30b 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -20,4 +20,16 @@
 #ifndef XFS_SCRUB_SCRUB_H_
 #define XFS_SCRUB_SCRUB_H_
 
+extern unsigned int		debug;
+extern bool			dumpcore;
+
+struct scrub_ctx {
+	/* Mutable scrub state; use lock. */
+	pthread_mutex_t		lock;
+	unsigned long long	max_errors;
+	unsigned long long	runtime_errors;
+	unsigned long long	errors_found;
+	unsigned long long	warnings_found;
+};
+
 #endif /* XFS_SCRUB_SCRUB_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/22] xfs_scrub: set up command line argument parsing
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
  2017-08-04  0:07 ` [PATCH 01/22] xfs_scrub: create online filesystem scrub program Darrick J. Wong
  2017-08-04  0:07 ` [PATCH 02/22] xfs_scrub: common error handling Darrick J. Wong
@ 2017-08-04  0:07 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 04/22] xfs_scrub: dispatch the various phases of the scrub program Darrick J. Wong
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:07 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Parse command line options in order to set up the context in which we
will scrub the filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/common.h |    8 ++
 scrub/scrub.c  |  206 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.h  |   31 ++++++++
 3 files changed, 245 insertions(+)


diff --git a/scrub/common.h b/scrub/common.h
index b7c5f47..b601680 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -48,4 +48,12 @@ void __record_preen(struct scrub_ctx *ctx, const char *descr, const char *file,
 #define str_info(ctx, str, ...)		__str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
 #define dbg_printf(fmt, ...)		{if (debug > 1) {printf(fmt, __VA_ARGS__);}}
 
+/* Is this debug tweak enabled? */
+static inline bool
+debug_tweak_on(
+	const char		*name)
+{
+	return debug && getenv(name) != NULL;
+}
+
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index bac1e8c..2b620bf 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -33,6 +33,9 @@
 #include "path.h"
 #include "scrub.h"
 #include "common.h"
+#include "input.h"
+
+#define _PATH_PROC_MOUNTS	"/proc/mounts"
 
 /*
  * XFS Online Metadata Scrub (and Repair)
@@ -119,11 +122,214 @@ unsigned int			debug;
 /* Should we dump core if errors happen? */
 bool				dumpcore;
 
+/* Display resource usage at the end of each phase? */
+bool				display_rusage;
+
+/* Background mode; higher values insert more pauses between scrub calls. */
+unsigned int			bg_mode;
+
+/* Maximum number of processors available to us. */
+int				nproc;
+
+/* Number of threads we're allowed to use. */
+unsigned int			nr_threads;
+
+/* Verbosity; higher values print more information. */
+bool				verbose;
+
+/* Should we scrub the data blocks? */
+bool				scrub_data;
+
+/* Size of a memory page. */
+long				page_size;
+
+static void __attribute__((noreturn))
+usage(void)
+{
+	fprintf(stderr, _("Usage: %s [OPTIONS] mountpoint\n"), progname);
+	fprintf(stderr, _("-a:\tStop after this many errors are found.\n"));
+	fprintf(stderr, _("-b:\tBackground mode.\n"));
+	fprintf(stderr, _("-e:\tWhat to do if errors are found.\n"));
+	fprintf(stderr, _("-m:\tPath to /etc/mtab.\n"));
+	fprintf(stderr, _("-n:\tDry run.  Do not modify anything.\n"));
+	fprintf(stderr, _("-T:\tDisplay timing/usage information.\n"));
+	fprintf(stderr, _("-v:\tVerbose output.\n"));
+	fprintf(stderr, _("-V:\tPrint version.\n"));
+	fprintf(stderr, _("-x:\tScrub file data too.\n"));
+	fprintf(stderr, _("-y:\tRepair all errors.\n"));
+
+	exit(16);
+}
+
 int
 main(
 	int			argc,
 	char			**argv)
 {
+	int			c;
+	char			*mtab = NULL;
+	struct scrub_ctx	ctx = {0};
+	unsigned long long	total_errors;
+	bool			moveon = true;
+	static bool		injected;
+	int			ret;
+
 	fprintf(stderr, "XXX: This program is not complete!\n");
 	return 4;
+
+	progname = basename(argv[0]);
+	setlocale(LC_ALL, "");
+	bindtextdomain(PACKAGE, LOCALEDIR);
+	textdomain(PACKAGE);
+
+	pthread_mutex_init(&ctx.lock, NULL);
+	ctx.mode = SCRUB_MODE_DEFAULT;
+	ctx.error_action = ERRORS_CONTINUE;
+	while ((c = getopt(argc, argv, "a:bde:m:nTvxVy")) != EOF) {
+		switch (c) {
+		case 'a':
+			ctx.max_errors = cvt_u64(optarg, 10);
+			if (errno) {
+				perror(optarg);
+				usage();
+			}
+			break;
+		case 'b':
+			nr_threads = 1;
+			bg_mode++;
+			break;
+		case 'd':
+			debug++;
+			dumpcore = true;
+			break;
+		case 'e':
+			if (!strcmp("continue", optarg))
+				ctx.error_action = ERRORS_CONTINUE;
+			else if (!strcmp("shutdown", optarg))
+				ctx.error_action = ERRORS_SHUTDOWN;
+			else
+				usage();
+			break;
+		case 'm':
+			mtab = optarg;
+			break;
+		case 'n':
+			if (ctx.mode != SCRUB_MODE_DEFAULT) {
+				fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+				return 1;
+			}
+			ctx.mode = SCRUB_MODE_DRY_RUN;
+			break;
+		case 'T':
+			display_rusage = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		case 'V':
+			fprintf(stdout, _("%s version %s\n"), progname,
+					VERSION);
+			fflush(stdout);
+			exit(0);
+		case 'x':
+			scrub_data = true;
+			break;
+		case 'y':
+			if (ctx.mode != SCRUB_MODE_DEFAULT) {
+				fprintf(stderr,
+_("Only one of the options -n or -y may be specified.\n"));
+				return 1;
+			}
+			ctx.mode = SCRUB_MODE_REPAIR;
+			break;
+		case '?':
+			/* fall through */
+		default:
+			usage();
+		}
+	}
+
+	/* Override thread count if debugger */
+	if (debug_tweak_on("XFS_SCRUB_THREADS")) {
+		unsigned int	x;
+
+		x = cvt_u32(getenv("XFS_SCRUB_THREADS"), 10);
+		if (errno) {
+			perror("nr_threads");
+			usage();
+		}
+		nr_threads = x;
+	}
+
+	if (optind != argc - 1)
+		usage();
+
+	ctx.mntpoint = argv[optind];
+
+	/*
+	 * If the user did not specify an explicit mount table, try to use
+	 * /proc/mounts if it is available, else /etc/mtab.  We prefer
+	 * /proc/mounts because it is kernel controlled, while /etc/mtab
+	 * may contain garbage that userspace tools like pam_mounts wrote
+	 * into it.
+	 */
+	if (!mtab) {
+		if (access(_PATH_PROC_MOUNTS, R_OK) == 0)
+			mtab = _PATH_PROC_MOUNTS;
+		else
+			mtab = _PATH_MOUNTED;
+	}
+
+	/* How many CPUs? */
+	nproc = sysconf(_SC_NPROCESSORS_ONLN);
+	if (nproc < 0)
+		nproc = 1;
+
+	/* Set up a page-aligned buffer for read verification. */
+	page_size = sysconf(_SC_PAGESIZE);
+	if (page_size < 0) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		ctx.mode = SCRUB_MODE_REPAIR;
+		injected = true;
+	}
+
+out:
+	if (xfs_scrub_excessive_errors(&ctx))
+		str_info(&ctx, ctx.mntpoint, _("Too many errors; aborting."));
+
+	if (debug_tweak_on("XFS_SCRUB_FORCE_ERROR"))
+		str_error(&ctx, ctx.mntpoint, _("Injecting error."));
+
+	ret = 0;
+	if (!moveon)
+		ret |= 4;
+
+	total_errors = ctx.errors_found + ctx.runtime_errors;
+	if (total_errors && ctx.warnings_found)
+		fprintf(stderr,
+_("%s: %llu errors and %llu warnings found.  Unmount and run xfs_repair.\n"),
+			ctx.mntpoint, total_errors, ctx.warnings_found);
+	else if (total_errors && ctx.warnings_found == 0)
+		fprintf(stderr,
+_("%s: %llu errors found.  Unmount and run xfs_repair.\n"),
+			ctx.mntpoint, total_errors);
+	else if (total_errors == 0 && ctx.warnings_found)
+		fprintf(stderr,
+_("%s: %llu warnings found.\n"),
+			ctx.mntpoint, ctx.warnings_found);
+	if (ctx.errors_found)
+		ret |= 1;
+	if (ctx.warnings_found)
+		ret |= 2;
+	if (ctx.runtime_errors)
+		ret |= 4;
+
+	free(ctx.blkdev);
+	free(ctx.mntpoint);
+	return ret;
 }
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 49de30b..669c9dc 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -20,10 +20,41 @@
 #ifndef XFS_SCRUB_SCRUB_H_
 #define XFS_SCRUB_SCRUB_H_
 
+extern unsigned int		nr_threads;
+extern unsigned int		bg_mode;
 extern unsigned int		debug;
+extern int			nproc;
+extern bool			display_rusage;
 extern bool			dumpcore;
+extern bool			verbose;
+extern bool			scrub_data;
+extern long			page_size;
+
+enum scrub_mode {
+	SCRUB_MODE_DRY_RUN,
+	SCRUB_MODE_PREEN,
+	SCRUB_MODE_REPAIR,
+};
+#define SCRUB_MODE_DEFAULT			SCRUB_MODE_PREEN
+
+enum error_action {
+	ERRORS_CONTINUE,
+	ERRORS_SHUTDOWN,
+};
 
 struct scrub_ctx {
+	/* Immutable scrub state. */
+
+	/* Strings we need for presentation */
+	char			*mntpoint;
+	char			*blkdev;
+
+	/* What does the user want us to do? */
+	enum scrub_mode		mode;
+
+	/* How does the user want us to react to errors? */
+	enum error_action	error_action;
+
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
 	unsigned long long	max_errors;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/22] xfs_scrub: dispatch the various phases of the scrub program
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2017-08-04  0:07 ` [PATCH 03/22] xfs_scrub: set up command line argument parsing Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 05/22] xfs_scrub: bind to a mount point and a block device Darrick J. Wong
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create the dispatching routines that we'll use to call out to each
separate phase of the program.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure.ac          |    1 
 include/builddefs.in  |    1 
 m4/package_libcdev.m4 |   18 +++
 scrub/Makefile        |    4 +
 scrub/common.c        |   63 +++++++++++
 scrub/common.h        |    4 +
 scrub/scrub.c         |  271 +++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 362 insertions(+)


diff --git a/configure.ac b/configure.ac
index d5f072a..e7ea99e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -143,6 +143,7 @@ AC_HAVE_FSETXATTR
 AC_HAVE_MREMAP
 AC_NEED_INTERNAL_FSXATTR
 AC_HAVE_GETFSMAP
+AC_HAVE_MALLINFO
 
 if test "$enable_blkid" = yes; then
 AC_HAVE_BLKID_TOPO
diff --git a/include/builddefs.in b/include/builddefs.in
index ec630bd..c178556 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -113,6 +113,7 @@ HAVE_FSETXATTR = @have_fsetxattr@
 HAVE_MREMAP = @have_mremap@
 NEED_INTERNAL_FSXATTR = @need_internal_fsxattr@
 HAVE_GETFSMAP = @have_getfsmap@
+HAVE_MALLINFO = @have_mallinfo@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index fa5b639..3ccdea5 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -297,3 +297,21 @@ AC_DEFUN([AC_HAVE_GETFSMAP],
        AC_MSG_RESULT(no))
     AC_SUBST(have_getfsmap)
   ])
+
+#
+# Check if we have a mallinfo libc call
+#
+AC_DEFUN([AC_HAVE_MALLINFO],
+  [ AC_MSG_CHECKING([for mallinfo ])
+    AC_TRY_COMPILE([
+#include <malloc.h>
+    ], [
+         struct mallinfo test;
+
+         test.arena = 0; test.hblkhd = 0; test.uordblks = 0; test.fordblks = 0;
+         test = mallinfo();
+    ], have_mallinfo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_mallinfo)
+  ])
diff --git a/scrub/Makefile b/scrub/Makefile
index 90a1c47..6134fe9 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -27,6 +27,10 @@ LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
 LTDEPENDENCIES += $(LIBXCMD) $(LIBHANDLE)
 LLDFLAGS = -static
 
+ifeq ($(HAVE_MALLINFO),yes)
+LCFLAGS += -DHAVE_MALLINFO
+endif
+
 default: depend $(LTCOMMAND)
 
 include $(BUILDRULES)
diff --git a/scrub/common.c b/scrub/common.c
index 4e47332..86e92ed 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -168,3 +168,66 @@ do_error(char const *msg, ...)
 		abort();
 	exit(1);
 }
+
+double
+timeval_subtract(
+	struct timeval		*tv1,
+	struct timeval		*tv2)
+{
+	return ((tv1->tv_sec - tv2->tv_sec) +
+		((float) (tv1->tv_usec - tv2->tv_usec)) / 1000000);
+}
+
+/* Produce human readable disk space output. */
+double
+auto_space_units(
+	unsigned long long	bytes,
+	char			**units)
+{
+	if (debug > 1)
+		goto no_prefix;
+	if (bytes > (1ULL << 40)) {
+		*units = "TiB";
+		return (double)bytes / (1ULL << 40);
+	} else if (bytes > (1ULL << 30)) {
+		*units = "GiB";
+		return (double)bytes / (1ULL << 30);
+	} else if (bytes > (1ULL << 20)) {
+		*units = "MiB";
+		return (double)bytes / (1ULL << 20);
+	} else if (bytes > (1ULL << 10)) {
+		*units = "KiB";
+		return (double)bytes / (1ULL << 10);
+	}
+
+no_prefix:
+	*units = "B";
+	return bytes;
+}
+
+/* Produce human readable discrete number output. */
+double
+auto_units(
+	unsigned long long	number,
+	char			**units)
+{
+	if (debug > 1)
+		goto no_prefix;
+	if (number > 1000000000000ULL) {
+		*units = "T";
+		return number / 1000000000000.0;
+	} else if (number > 1000000000ULL) {
+		*units = "G";
+		return number / 1000000000.0;
+	} else if (number > 1000000ULL) {
+		*units = "M";
+		return number / 1000000.0;
+	} else if (number > 1000ULL) {
+		*units = "K";
+		return number / 1000.0;
+	}
+
+no_prefix:
+	*units = "";
+	return number;
+}
diff --git a/scrub/common.h b/scrub/common.h
index b601680..70a3b9d 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -56,4 +56,8 @@ debug_tweak_on(
 	return debug && getenv(name) != NULL;
 }
 
+double timeval_subtract(struct timeval *tv1, struct timeval *tv2);
+double auto_space_units(unsigned long long kilobytes, char **units);
+double auto_units(unsigned long long number, char **units);
+
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 2b620bf..cb3d5f4 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -161,6 +161,265 @@ usage(void)
 	exit(16);
 }
 
+#ifndef RUSAGE_BOTH
+# define RUSAGE_BOTH		(-2)
+#endif
+
+/* Get resource usage for ourselves and all children. */
+static int
+scrub_getrusage(
+	struct rusage		*usage)
+{
+	struct rusage		cusage;
+	int			err;
+
+	err = getrusage(RUSAGE_BOTH, usage);
+	if (!err)
+		return err;
+
+	err = getrusage(RUSAGE_SELF, usage);
+	if (err)
+		return err;
+
+	err = getrusage(RUSAGE_CHILDREN, &cusage);
+	if (err)
+		return err;
+
+	usage->ru_minflt += cusage.ru_minflt;
+	usage->ru_majflt += cusage.ru_majflt;
+	usage->ru_nswap += cusage.ru_nswap;
+	usage->ru_inblock += cusage.ru_inblock;
+	usage->ru_oublock += cusage.ru_oublock;
+	usage->ru_msgsnd += cusage.ru_msgsnd;
+	usage->ru_msgrcv += cusage.ru_msgrcv;
+	usage->ru_nsignals += cusage.ru_nsignals;
+	usage->ru_nvcsw += cusage.ru_nvcsw;
+	usage->ru_nivcsw += cusage.ru_nivcsw;
+	return 0;
+}
+
+/*
+ * Scrub Phase Dispatch
+ *
+ * The operations of the scrub program are split up into several
+ * different phases.  Each phase builds upon the metadata checked in the
+ * previous phase, which is to say that we may skip phase (X + 1) if our
+ * scans in phase (X) reveal corruption.  A phase may be skipped
+ * entirely.
+ */
+
+/* Resource usage for each phase. */
+struct phase_rusage {
+	struct rusage		ruse;
+	struct timeval		time;
+	unsigned long long	verified_bytes;
+	void			*brk_start;
+	const char		*descr;
+};
+
+/* Operations for each phase. */
+#define DATASCAN_DUMMY_FN	((void *)1)
+#define REPAIR_DUMMY_FN		((void *)2)
+struct phase_ops {
+	char		*descr;
+	bool		(*fn)(struct scrub_ctx *);
+	bool		must_run;
+};
+
+/* Start tracking resource usage for a phase. */
+static bool
+phase_start(
+	struct phase_rusage	*pi,
+	unsigned int		phase,
+	const char		*descr)
+{
+	int			error;
+
+	error = scrub_getrusage(&pi->ruse);
+	if (error) {
+		perror(_("getrusage"));
+		return false;
+	}
+	pi->brk_start = sbrk(0);
+
+	error = gettimeofday(&pi->time, NULL);
+	if (error) {
+		perror(_("gettimeofday"));
+		return false;
+	}
+
+	pi->descr = descr;
+	if ((verbose || display_rusage) && descr) {
+		fprintf(stdout, _("Phase %u: %s\n"), phase, descr);
+		fflush(stdout);
+	}
+	return true;
+}
+
+/* Report usage stats. */
+static bool
+phase_end(
+	struct phase_rusage	*pi,
+	unsigned int		phase)
+{
+	struct rusage		ruse_now;
+#ifdef HAVE_MALLINFO
+	struct mallinfo		mall_now;
+#endif
+	struct timeval		time_now;
+	char			phasebuf[DESCR_BUFSZ];
+	double			dt;
+	long			in, out;
+	long			io;
+	double			i, o, t;
+	double			din, dout, dtot;
+	char			*iu, *ou, *tu, *dinu, *doutu, *dtotu;
+	int			error;
+
+	if (!display_rusage)
+		return true;
+
+	error = gettimeofday(&time_now, NULL);
+	if (error) {
+		perror(_("gettimeofday"));
+		return false;
+	}
+	dt = timeval_subtract(&time_now, &pi->time);
+
+	error = scrub_getrusage(&ruse_now);
+	if (error) {
+		perror(_("getrusage"));
+		return false;
+	}
+
+	if (phase)
+		snprintf(phasebuf, DESCR_BUFSZ, _("Phase %u: "), phase);
+	else
+		phasebuf[0] = 0;
+
+#define kbytes(x)	(((unsigned long)(x) + 1023) / 1024)
+#ifdef HAVE_MALLINFO
+
+	mall_now = mallinfo();
+	fprintf(stdout, _("%sMemory used: %luk/%luk (%luk/%luk), "),
+		phasebuf,
+		kbytes(mall_now.arena), kbytes(mall_now.hblkhd),
+		kbytes(mall_now.uordblks), kbytes(mall_now.fordblks));
+#else
+	fprintf(stdout, _("%sMemory used: %luk, "),
+		phasebuf,
+		(unsigned long) kbytes(((char *) sbrk(0)) -
+				       ((char *) pi->brk_start)));
+#endif
+#undef kbytes
+
+	fprintf(stdout, _("time: %5.2f/%5.2f/%5.2fs\n"),
+		timeval_subtract(&time_now, &pi->time),
+		timeval_subtract(&ruse_now.ru_utime, &pi->ruse.ru_utime),
+		timeval_subtract(&ruse_now.ru_stime, &pi->ruse.ru_stime));
+
+	/* I/O usage */
+	in =  (ruse_now.ru_inblock - pi->ruse.ru_inblock) << BBSHIFT;
+	out = (ruse_now.ru_oublock - pi->ruse.ru_oublock) << BBSHIFT;
+	io = in + out;
+	if (io) {
+		i = auto_space_units(in, &iu);
+		o = auto_space_units(out, &ou);
+		t = auto_space_units(io, &tu);
+		din = auto_space_units(in / dt, &dinu);
+		dout = auto_space_units(out / dt, &doutu);
+		dtot = auto_space_units(io / dt, &dtotu);
+		fprintf(stdout,
+_("%sI/O: %.1f%s in, %.1f%s out, %.1f%s tot\n"),
+			phasebuf, i, iu, o, ou, t, tu);
+		fprintf(stdout,
+_("%sI/O rate: %.1f%s/s in, %.1f%s/s out, %.1f%s/s tot\n"),
+			phasebuf, din, dinu, dout, doutu, dtot, dtotu);
+	}
+	fflush(stdout);
+
+	return true;
+}
+
+/* Run all the phases of the scrubber. */
+static bool
+run_scrub_phases(
+	struct scrub_ctx	*ctx)
+{
+	struct phase_ops phases[] =
+	{
+		{
+			.descr = _("Find filesystem geometry."),
+			.must_run = true,
+		},
+		{
+			.descr = _("Check internal metadata."),
+		},
+		{
+			.descr = _("Scan all inodes."),
+		},
+		{
+			.descr = _("Defer filesystem repairs."),
+			.fn = REPAIR_DUMMY_FN,
+		},
+		{
+			.descr = _("Check directory tree."),
+		},
+		{
+			.descr = _("Verify data file integrity."),
+			.fn = DATASCAN_DUMMY_FN,
+		},
+		{
+			.descr = _("Check summary counters."),
+		},
+		{
+			NULL
+		},
+	};
+	struct phase_rusage	pi;
+	struct phase_ops	*sp;
+	bool			moveon = true;
+	unsigned int		debug_phase = 0;
+	unsigned int		phase;
+
+	if (debug && debug_tweak_on("XFS_SCRUB_PHASE"))
+		debug_phase = atoi(getenv("XFS_SCRUB_PHASE"));
+
+	/* Run all phases of the scrub tool. */
+	for (phase = 1, sp = phases; sp->fn; sp++, phase++) {
+		/* Skip certain phases unless they're turned on. */
+		if (sp->fn == REPAIR_DUMMY_FN ||
+		    sp->fn == DATASCAN_DUMMY_FN)
+			continue;
+
+		/* Allow debug users to force a particular phase. */
+		if (debug_phase && phase != debug_phase && !sp->must_run)
+			continue;
+
+		/* Run this phase. */
+		moveon = phase_start(&pi, phase, sp->descr);
+		if (!moveon)
+			break;
+		moveon = sp->fn(ctx);
+		if (!moveon) {
+			str_info(ctx, ctx->mntpoint,
+_("Scrub aborted after phase %d."),
+					phase);
+			break;
+		}
+		moveon = phase_end(&pi, phase);
+		if (!moveon)
+			break;
+
+		/* Too many errors? */
+		moveon = !xfs_scrub_excessive_errors(ctx);
+		if (!moveon)
+			break;
+	}
+
+	return moveon;
+}
+
 int
 main(
 	int			argc,
@@ -169,6 +428,7 @@ main(
 	int			c;
 	char			*mtab = NULL;
 	struct scrub_ctx	ctx = {0};
+	struct phase_rusage	all_pi;
 	unsigned long long	total_errors;
 	bool			moveon = true;
 	static bool		injected;
@@ -281,6 +541,11 @@ _("Only one of the options -n or -y may be specified.\n"));
 			mtab = _PATH_MOUNTED;
 	}
 
+	/* Initialize overall phase stats. */
+	moveon = phase_start(&all_pi, 0, NULL);
+	if (!moveon)
+		goto out;
+
 	/* How many CPUs? */
 	nproc = sysconf(_SC_NPROCESSORS_ONLN);
 	if (nproc < 0)
@@ -298,6 +563,11 @@ _("Only one of the options -n or -y may be specified.\n"));
 		injected = true;
 	}
 
+	/* Scrub a filesystem. */
+	moveon = run_scrub_phases(&ctx);
+	if (!moveon)
+		goto out;
+
 out:
 	if (xfs_scrub_excessive_errors(&ctx))
 		str_info(&ctx, ctx.mntpoint, _("Too many errors; aborting."));
@@ -328,6 +598,7 @@ _("%s: %llu warnings found.\n"),
 		ret |= 2;
 	if (ctx.runtime_errors)
 		ret |= 4;
+	phase_end(&all_pi, 0);
 
 	free(ctx.blkdev);
 	free(ctx.mntpoint);


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/22] xfs_scrub: bind to a mount point and a block device
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 04/22] xfs_scrub: dispatch the various phases of the scrub program Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 06/22] xfs_scrub: find XFS filesystem geometry Darrick J. Wong
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create an abstraction to handle all of our low level disk operations,
then use it to bind to a fs mount point and block device.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    2 +
 scrub/common.c |   27 +++++++++
 scrub/common.h |    1 
 scrub/disk.c   |  160 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/disk.h   |   40 ++++++++++++++
 scrub/scrub.c  |   67 +++++++++++++++++++++++
 scrub/scrub.h  |   14 +++++
 7 files changed, 311 insertions(+)
 create mode 100644 scrub/disk.c
 create mode 100644 scrub/disk.h


diff --git a/scrub/Makefile b/scrub/Makefile
index 6134fe9..fa88e01 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -17,10 +17,12 @@ endif	# scrub_prereqs
 
 HFILES = \
 common.h \
+disk.h \
 scrub.h
 
 CFILES = \
 common.c \
+disk.c \
 scrub.c
 
 LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
diff --git a/scrub/common.c b/scrub/common.c
index 86e92ed..f650438 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -31,6 +31,7 @@
 #include <dirent.h>
 #include "../repair/threads.h"
 #include "path.h"
+#include "disk.h"
 #include "scrub.h"
 #include "common.h"
 #include "input.h"
@@ -231,3 +232,29 @@ auto_units(
 	*units = "";
 	return number;
 }
+
+/* How many threads to kick off? */
+unsigned int
+scrub_nproc(
+	struct scrub_ctx	*ctx)
+{
+	if (nr_threads)
+		return nr_threads;
+	return ctx->nr_io_threads;
+}
+
+/*
+ * Return ceil(log2(i)).
+ * Avoid linking in libxfs by providing the few symbols we actually need.
+ */
+unsigned int
+libxfs_log2_roundup(unsigned int i)
+{
+	unsigned int	rval;
+
+	for (rval = 0; rval < NBBY * sizeof(i); rval++) {
+		if ((1 << rval) >= i)
+			break;
+	}
+	return rval;
+}
diff --git a/scrub/common.h b/scrub/common.h
index 70a3b9d..0bc6872 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -59,5 +59,6 @@ debug_tweak_on(
 double timeval_subtract(struct timeval *tv1, struct timeval *tv2);
 double auto_space_units(unsigned long long kilobytes, char **units);
 double auto_units(unsigned long long number, char **units);
+unsigned int scrub_nproc(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/disk.c b/scrub/disk.c
new file mode 100644
index 0000000..613e7fd
--- /dev/null
+++ b/scrub/disk.c
@@ -0,0 +1,160 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "scrub.h"
+#include "common.h"
+
+/*
+ * Disk Abstraction
+ *
+ * These routines help us to discover the geometry of a block device,
+ * estimate the amount of concurrent IOs that we can send to it, and
+ * abstract the process of performing read verification of disk blocks.
+ */
+
+/* Figure out how many disk heads are available. */
+static unsigned int
+__disk_heads(
+	struct disk		*disk)
+{
+	int			iomin;
+	int			ioopt;
+	unsigned short		rot;
+	int			error;
+
+	/* If it's not a block device, throw all the CPUs at it. */
+	if (!S_ISBLK(disk->d_sb.st_mode))
+		return nproc;
+
+	/* Non-rotational device?  Throw all the CPUs. */
+	rot = 1;
+	error = ioctl(disk->d_fd, BLKROTATIONAL, &rot);
+	if (error == 0 && rot == 0)
+		return nproc;
+
+	/*
+	 * Sometimes we can infer the number of devices from the
+	 * min/optimal IO sizes.
+	 */
+	iomin = ioopt = 0;
+	if (ioctl(disk->d_fd, BLKIOMIN, &iomin) == 0 &&
+	    ioctl(disk->d_fd, BLKIOOPT, &ioopt) == 0 &&
+	    iomin > 0 && ioopt > 0) {
+		return min(nproc, max(1, ioopt / iomin));
+	}
+
+	/* Rotating device?  I guess? */
+	return 2;
+}
+
+/* Figure out how many disk heads are available. */
+unsigned int
+disk_heads(
+	struct disk		*disk)
+{
+	if (nr_threads)
+		return nr_threads;
+	return __disk_heads(disk);
+}
+
+/* Open a disk device and discover its geometry. */
+int
+disk_open(
+	const char		*pathname,
+	struct disk		*disk)
+{
+	int			lba_sz;
+	int			error;
+
+	disk->d_fd = open(pathname, O_RDONLY | O_DIRECT | O_NOATIME);
+	if (disk->d_fd < 0)
+		return -1;
+
+	/* Try to get LBA size. */
+	error = ioctl(disk->d_fd, BLKSSZGET, &lba_sz);
+	if (error)
+		lba_sz = 512;
+	disk->d_lbalog = libxfs_log2_roundup(lba_sz);
+
+	/* Obtain disk's stat info. */
+	error = fstat(disk->d_fd, &disk->d_sb);
+	if (error) {
+		error = errno;
+		close(disk->d_fd);
+		errno = error;
+		disk->d_fd = -1;
+		return -1;
+	}
+
+	/* Determine bdev size, block size, and offset. */
+	if (S_ISBLK(disk->d_sb.st_mode)) {
+		error = ioctl(disk->d_fd, BLKGETSIZE64, &disk->d_size);
+		if (error)
+			disk->d_size = 0;
+		error = ioctl(disk->d_fd, BLKBSZGET, &disk->d_blksize);
+		if (error)
+			disk->d_blksize = 0;
+		disk->d_start = 0;
+	} else {
+		disk->d_size = disk->d_sb.st_size;
+		disk->d_blksize = disk->d_sb.st_blksize;
+		disk->d_start = 0;
+	}
+
+	return 0;
+}
+
+/* Close a disk device. */
+int
+disk_close(
+	struct disk		*disk)
+{
+	int			error = 0;
+
+	if (disk->d_fd >= 0)
+		error = close(disk->d_fd);
+	disk->d_fd = -1;
+	return error;
+}
+
+/* Is this device open? */
+bool
+disk_is_open(
+	struct disk		*disk)
+{
+	return disk->d_fd >= 0;
+}
+
+/* Read-verify an extent of a disk device. */
+ssize_t
+disk_read_verify(
+	struct disk		*disk,
+	void			*buf,
+	uint64_t		start,
+	uint64_t		length)
+{
+	return pread(disk->d_fd, buf, length, start);
+}
diff --git a/scrub/disk.h b/scrub/disk.h
new file mode 100644
index 0000000..797fd71
--- /dev/null
+++ b/scrub/disk.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_DISK_H_
+#define XFS_SCRUB_DISK_H_
+
+struct disk {
+	struct stat	d_sb;
+	int		d_fd;
+	int		d_lbalog;
+	unsigned int	d_flags;
+	unsigned int	d_blksize;	/* bytes */
+	uint64_t	d_size;		/* bytes */
+	uint64_t	d_start;	/* bytes */
+};
+
+unsigned int disk_heads(struct disk *disk);
+bool disk_is_open(struct disk *disk);
+int disk_open(const char *pathname, struct disk *disk);
+int disk_close(struct disk *disk);
+ssize_t disk_read_verify(struct disk *disk, void *buf, uint64_t startblock,
+		uint64_t blockcount);
+
+#endif /* XFS_SCRUB_DISK_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index cb3d5f4..f492301 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -31,6 +31,7 @@
 #include <dirent.h>
 #include "../repair/threads.h"
 #include "path.h"
+#include "disk.h"
 #include "scrub.h"
 #include "common.h"
 #include "input.h"
@@ -341,6 +342,58 @@ _("%sI/O rate: %.1f%s/s in, %.1f%s/s out, %.1f%s/s tot\n"),
 	return true;
 }
 
+/* Find filesystem geometry and perform any other setup functions. */
+static bool
+find_geo(
+	struct scrub_ctx	*ctx)
+{
+	bool			moveon;
+	int			error;
+
+	/*
+	 * Open the directory with O_NOATIME.  For mountpoints owned
+	 * by root, this should be sufficient to ensure that we have
+	 * CAP_SYS_ADMIN, which we probably need to do anything fancy
+	 * with the (XFS driver) kernel.
+	 */
+	ctx->mnt_fd = open(ctx->mntpoint, O_RDONLY | O_NOATIME | O_DIRECTORY);
+	if (ctx->mnt_fd < 0) {
+		if (errno == EPERM)
+			str_info(ctx, ctx->mntpoint,
+_("Must be root to run scrub."));
+		else
+			str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	error = disk_open(ctx->blkdev, &ctx->datadev);
+	if (error && errno != ENOENT)
+		str_errno(ctx, ctx->blkdev);
+
+	error = fstat(ctx->mnt_fd, &ctx->mnt_sb);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = fstatvfs(ctx->mnt_fd, &ctx->mnt_sv);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	error = fstatfs(ctx->mnt_fd, &ctx->mnt_sf);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	if (verbose) {
+		fprintf(stdout, _("%s: using %d threads to scrub.\n"),
+				ctx->mntpoint, scrub_nproc(ctx));
+		fflush(stdout);
+	}
+
+	return moveon;
+}
+
 /* Run all the phases of the scrubber. */
 static bool
 run_scrub_phases(
@@ -350,6 +403,7 @@ run_scrub_phases(
 	{
 		{
 			.descr = _("Find filesystem geometry."),
+			.fn = find_geo,
 			.must_run = true,
 		},
 		{
@@ -443,6 +497,7 @@ main(
 	textdomain(PACKAGE);
 
 	pthread_mutex_init(&ctx.lock, NULL);
+	ctx.datadev.d_fd = -1;
 	ctx.mode = SCRUB_MODE_DEFAULT;
 	ctx.error_action = ERRORS_CONTINUE;
 	while ((c = getopt(argc, argv, "a:bde:m:nTvxVy")) != EOF) {
@@ -527,6 +582,15 @@ _("Only one of the options -n or -y may be specified.\n"));
 
 	ctx.mntpoint = argv[optind];
 
+	/* Find the mount record for the passed-in argument. */
+	if (stat(argv[optind], &ctx.mnt_sb) < 0) {
+		fprintf(stderr,
+			_("%s: could not stat: %s: %s\n"),
+			progname, argv[optind], strerror(errno));
+		ret = 8;
+		goto end;
+	}
+
 	/*
 	 * If the user did not specify an explicit mount table, try to use
 	 * /proc/mounts if it is available, else /etc/mtab.  We prefer
@@ -599,8 +663,11 @@ _("%s: %llu warnings found.\n"),
 	if (ctx.runtime_errors)
 		ret |= 4;
 	phase_end(&all_pi, 0);
+	close(ctx.mnt_fd);
+	disk_close(&ctx.datadev);
 
 	free(ctx.blkdev);
 	free(ctx.mntpoint);
+end:
 	return ret;
 }
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 669c9dc..3a776e1 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -49,12 +49,26 @@ struct scrub_ctx {
 	char			*mntpoint;
 	char			*blkdev;
 
+	/* Mountpoint info */
+	struct stat		mnt_sb;
+	struct statvfs		mnt_sv;
+	struct statfs		mnt_sf;
+
+	/* Open block devices */
+	struct disk		datadev;
+
 	/* What does the user want us to do? */
 	enum scrub_mode		mode;
 
 	/* How does the user want us to react to errors? */
 	enum error_action	error_action;
 
+	/* fd to filesystem mount point */
+	int			mnt_fd;
+
+	/* Number of threads for metadata scrubbing */
+	unsigned int		nr_io_threads;
+
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
 	unsigned long long	max_errors;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/22] xfs_scrub: find XFS filesystem geometry
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 05/22] xfs_scrub: bind to a mount point and a block device Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 07/22] xfs_scrub: scan filesystem and AG metadata Darrick J. Wong
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Discover the geometry of the XFS filesystem that we've been told to
scan, and set up some common functions that will be used by the
scrub phases.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |   13 +++-
 scrub/common.c |   71 ++++++++++++++++++++
 scrub/common.h |   10 +++
 scrub/ioctl.c  |  195 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/ioctl.h  |   94 +++++++++++++++++++++++++++
 scrub/phase1.c |  169 +++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.c  |   26 +++++++
 scrub/scrub.h  |   13 ++++
 scrub/xfs.c    |   44 +++++++++++++
 scrub/xfs.h    |   29 ++++++++
 10 files changed, 660 insertions(+), 4 deletions(-)
 create mode 100644 scrub/ioctl.c
 create mode 100644 scrub/ioctl.h
 create mode 100644 scrub/phase1.c
 create mode 100644 scrub/xfs.c
 create mode 100644 scrub/xfs.h


diff --git a/scrub/Makefile b/scrub/Makefile
index fa88e01..a797bfb 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -18,12 +18,17 @@ endif	# scrub_prereqs
 HFILES = \
 common.h \
 disk.h \
-scrub.h
+ioctl.h \
+scrub.h \
+xfs.h
 
 CFILES = \
 common.c \
 disk.c \
-scrub.c
+ioctl.c \
+phase1.c \
+scrub.c \
+xfs.c
 
 LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
 LTDEPENDENCIES += $(LIBXCMD) $(LIBHANDLE)
@@ -33,6 +38,10 @@ ifeq ($(HAVE_MALLINFO),yes)
 LCFLAGS += -DHAVE_MALLINFO
 endif
 
+ifeq ($(HAVE_SYNCFS),yes)
+LCFLAGS += -DHAVE_SYNCFS
+endif
+
 default: depend $(LTCOMMAND)
 
 include $(BUILDRULES)
diff --git a/scrub/common.c b/scrub/common.c
index f650438..874f8ab 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -35,6 +35,8 @@
 #include "scrub.h"
 #include "common.h"
 #include "input.h"
+#include "ioctl.h"
+#include "xfs.h"
 
 /*
  * Reporting Status to the Console
@@ -258,3 +260,72 @@ libxfs_log2_roundup(unsigned int i)
 	}
 	return rval;
 }
+
+/*
+ * Check if the argument is either the device name or mountpoint of a mounted
+ * filesystem.
+ */
+#define MNTTYPE_XFS	"xfs"
+static bool
+find_mountpoint_check(
+	struct stat		*sb,
+	struct mntent		*t)
+{
+	struct stat		ms;
+
+	if (S_ISDIR(sb->st_mode)) {		/* mount point */
+		if (stat(t->mnt_dir, &ms) < 0)
+			return false;
+		if (sb->st_ino != ms.st_ino)
+			return false;
+		if (sb->st_dev != ms.st_dev)
+			return false;
+		if (strcmp(t->mnt_type, MNTTYPE_XFS) != 0)
+			return NULL;
+	} else {				/* device */
+		if (stat(t->mnt_fsname, &ms) < 0)
+			return false;
+		if (sb->st_rdev != ms.st_rdev)
+			return false;
+		if (strcmp(t->mnt_type, MNTTYPE_XFS) != 0)
+			return NULL;
+		/*
+		 * Make sure the mountpoint given by mtab is accessible
+		 * before using it.
+		 */
+		if (stat(t->mnt_dir, &ms) < 0)
+			return false;
+	}
+
+	return true;
+}
+
+/* Check that our alleged mountpoint is in mtab */
+bool
+find_mountpoint(
+	char			*mtab,
+	struct scrub_ctx	*ctx)
+{
+	struct mntent_cursor	cursor;
+	struct mntent		*t = NULL;
+	bool			found = false;
+
+	if (platform_mntent_open(&cursor, mtab) != 0) {
+		fprintf(stderr, "Error: can't get mntent entries.\n");
+		exit(1);
+	}
+
+	while ((t = platform_mntent_next(&cursor)) != NULL) {
+		/*
+		 * Keep jotting down matching mount details; newer mounts are
+		 * towards the end of the file (hopefully).
+		 */
+		if (find_mountpoint_check(&ctx->mnt_sb, t)) {
+			ctx->mntpoint = strdup(t->mnt_dir);
+			ctx->blkdev = strdup(t->mnt_fsname);
+			found = true;
+		}
+	}
+	platform_mntent_close(&cursor);
+	return found;
+}
diff --git a/scrub/common.h b/scrub/common.h
index 0bc6872..a8b1ff8 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -61,4 +61,14 @@ double auto_space_units(unsigned long long kilobytes, char **units);
 double auto_units(unsigned long long number, char **units);
 unsigned int scrub_nproc(struct scrub_ctx *ctx);
 
+#ifndef HAVE_SYNCFS
+static inline int syncfs(int fd)
+{
+	sync();
+	return 0;
+}
+#endif
+
+bool find_mountpoint(char *mtab, struct scrub_ctx *ctx);
+
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/ioctl.c b/scrub/ioctl.c
new file mode 100644
index 0000000..6578672
--- /dev/null
+++ b/scrub/ioctl.c
@@ -0,0 +1,195 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+
+/* Does the kernel support bulkstat? */
+bool
+xfs_can_iterate_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_fsop_bulkreq	bulkreq;
+	__u64			lastino;
+	__s32			bulklen = 0;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"))
+		return false;
+
+	lastino = 0;
+	memset(&bulkreq, 0, sizeof(bulkreq));
+	bulkreq.lastip = (__u64 *)&lastino;
+	bulkreq.icount  = 0;
+	bulkreq.ubuffer = NULL;
+	bulkreq.ocount  = &bulklen;
+
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT, &bulkreq);
+	return error == -1 && errno == EINVAL;
+}
+
+/* Does the kernel support getbmapx? */
+bool
+xfs_can_iterate_bmap(
+	struct scrub_ctx	*ctx)
+{
+	struct getbmapx		bsm[2];
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BMAP"))
+		return false;
+
+	memset(bsm, 0, sizeof(struct getbmapx));
+	bsm->bmv_length = ULLONG_MAX;
+	bsm->bmv_count = 2;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_GETBMAPX, bsm);
+	return error == 0;
+}
+
+/* Does the kernel support getfsmap? */
+bool
+xfs_can_iterate_fsmap(
+	struct scrub_ctx	*ctx)
+{
+	struct fsmap_head	head;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_FSMAP"))
+		return false;
+
+	memset(&head, 0, sizeof(struct fsmap_head));
+	head.fmh_keys[1].fmr_device = UINT_MAX;
+	head.fmh_keys[1].fmr_physical = ULLONG_MAX;
+	head.fmh_keys[1].fmr_owner = ULLONG_MAX;
+	head.fmh_keys[1].fmr_offset = ULLONG_MAX;
+	error = ioctl(ctx->mnt_fd, FS_IOC_GETFSMAP, &head);
+	return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
+}
+
+/* Test the availability of a kernel scrub command. */
+#define XFS_ERRTAG_FORCE_SCRUB_REPAIR	30
+static bool
+__xfs_scrub_test(
+	struct scrub_ctx		*ctx,
+	unsigned int			type,
+	bool				repair)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	struct xfs_error_injection	inject;
+	static bool			injected;
+	int				error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
+		return false;
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		inject.fd = ctx->mnt_fd;
+		inject.errtag = XFS_ERRTAG_FORCE_SCRUB_REPAIR;
+		error = ioctl(ctx->mnt_fd,
+				XFS_IOC_ERROR_INJECTION, &inject);
+		if (error == 0)
+			injected = true;
+	}
+
+	meta.sm_type = type;
+	if (repair)
+		meta.sm_flags |= XFS_SCRUB_IFLAG_REPAIR;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (!error)
+		return true;
+	switch (errno) {
+	case EROFS:
+		str_info(ctx, ctx->mntpoint,
+_("Filesystem is mounted read-only; cannot proceed."));
+		return false;
+	case ENOTRECOVERABLE:
+		str_info(ctx, ctx->mntpoint,
+_("Filesystem is mounted norecovery; cannot proceed."));
+		return false;
+	case EOPNOTSUPP:
+	case ENOTTY:
+		str_info(ctx, ctx->mntpoint,
+_("Kernel metadata scrub is required."));
+		return false;
+	case ENOENT:
+		/* Scrubber says not present on this fs; that's fine. */
+		return true;
+	default:
+		str_info(ctx, ctx->mntpoint, "%s", strerror(errno));
+		return true;
+	}
+	return error == 0 || (error && errno != EOPNOTSUPP && errno != ENOTTY);
+}
+
+bool
+xfs_can_scrub_fs_metadata(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_TEST, false);
+}
+
+bool
+xfs_can_scrub_inode(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_INODE, false);
+}
+
+bool
+xfs_can_scrub_bmap(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_BMBTD, false);
+}
+
+bool
+xfs_can_scrub_dir(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_DIR, false);
+}
+
+bool
+xfs_can_scrub_attr(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_XATTR, false);
+}
+
+bool
+xfs_can_scrub_symlink(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_SYMLINK, false);
+}
+
+bool
+xfs_can_scrub_parent(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_PARENT, false);
+}
diff --git a/scrub/ioctl.h b/scrub/ioctl.h
new file mode 100644
index 0000000..c255bbb
--- /dev/null
+++ b/scrub/ioctl.h
@@ -0,0 +1,94 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_IOCTL_H_
+#define XFS_SCRUB_IOCTL_H_
+
+/* inode iteration */
+#define XFS_ITERATE_INODES_ABORT	(-1)
+typedef int (*xfs_inode_iter_fn)(struct scrub_ctx *ctx,
+		struct xfs_handle *handle, struct xfs_bstat *bs, void *arg);
+bool xfs_iterate_inodes(struct scrub_ctx *ctx, const char *descr,
+		void *fshandle, uint64_t first_ino, uint64_t last_ino,
+		xfs_inode_iter_fn fn, void *arg);
+bool xfs_can_iterate_inodes(struct scrub_ctx *ctx);
+
+/* inode fork block mapping */
+struct xfs_bmap {
+	uint64_t	bm_offset;	/* file offset of segment in bytes */
+	uint64_t	bm_physical;	/* physical starting byte  */
+	uint64_t	bm_length;	/* length of segment, bytes */
+	uint32_t	bm_flags;	/* output flags */
+};
+
+typedef bool (*xfs_bmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		int fd, int whichfork, struct fsxattr *fsx,
+		struct xfs_bmap *bmap, void *arg);
+
+bool xfs_iterate_bmap(struct scrub_ctx *ctx, const char *descr, int fd,
+		int whichfork, struct xfs_bmap *key, xfs_bmap_iter_fn fn,
+		void *arg);
+bool xfs_can_iterate_bmap(struct scrub_ctx *ctx);
+
+/* filesystem reverse mapping */
+typedef bool (*xfs_fsmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *fsr, void *arg);
+bool xfs_iterate_fsmap(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *keys, xfs_fsmap_iter_fn fn, void *arg);
+bool xfs_can_iterate_fsmap(struct scrub_ctx *ctx);
+
+/* Online scrub and repair. */
+enum check_outcome {
+	CHECK_DONE,	/* no further processing needed */
+	CHECK_REPAIR,	/* schedule this for repairs */
+	CHECK_ABORT,	/* end program */
+	CHECK_RETRY,	/* repair failed, try again later */
+};
+
+void xfs_scrub_report_preen_triggers(struct scrub_ctx *ctx);
+bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno);
+bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno);
+bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx);
+
+bool xfs_can_scrub_fs_metadata(struct scrub_ctx *ctx);
+bool xfs_can_scrub_inode(struct scrub_ctx *ctx);
+bool xfs_can_scrub_bmap(struct scrub_ctx *ctx);
+bool xfs_can_scrub_dir(struct scrub_ctx *ctx);
+bool xfs_can_scrub_attr(struct scrub_ctx *ctx);
+bool xfs_can_scrub_symlink(struct scrub_ctx *ctx);
+bool xfs_can_scrub_parent(struct scrub_ctx *ctx);
+
+bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_data_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_attr_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_cow_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_dir(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_attr(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+bool xfs_scrub_parent(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd);
+
+#endif /* XFS_SCRUB_IOCTL_H_ */
diff --git a/scrub/phase1.c b/scrub/phase1.c
new file mode 100644
index 0000000..6c3aab4
--- /dev/null
+++ b/scrub/phase1.c
@@ -0,0 +1,169 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+
+/* Phase 1: Find filesystem geometry (and clean up after) */
+
+/* Clean up the XFS-specific state data. */
+bool
+xfs_cleanup(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->fshandle)
+		free_handle(ctx->fshandle, ctx->fshandle_len);
+	disk_close(&ctx->rtdev);
+	disk_close(&ctx->logdev);
+	disk_close(&ctx->datadev);
+
+	return true;
+}
+
+/* Read the XFS geometry. */
+bool
+xfs_scan_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct fs_path			*fsp;
+	int				error;
+
+	if (!platform_test_xfs_fd(ctx->mnt_fd)) {
+		str_error(ctx, ctx->mntpoint,
+_("Does not appear to be an XFS filesystem!"));
+		return false;
+	}
+
+	/*
+	 * Flush everything out to disk before we start checking.
+	 * This seems to reduce the incidence of stale file handle
+	 * errors when we open things by handle.
+	 */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	ctx->datadev.d_fd = ctx->logdev.d_fd = ctx->rtdev.d_fd = -1;
+
+	/* Retrieve XFS geometry. */
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSGEOMETRY, &ctx->geo);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		goto err;
+	}
+
+	ctx->agblklog = libxfs_log2_roundup(ctx->geo.agblocks);
+	ctx->blocklog = libxfs_highbit32(ctx->geo.blocksize);
+	ctx->inodelog = libxfs_highbit32(ctx->geo.inodesize);
+	ctx->inopblog = ctx->blocklog - ctx->inodelog;
+
+	error = path_to_fshandle(ctx->mntpoint, &ctx->fshandle,
+			&ctx->fshandle_len);
+	if (error) {
+		perror(_("getting fshandle"));
+		goto err;
+	}
+
+	/* Do we have bulkstat? */
+	if (!xfs_can_iterate_inodes(ctx)) {
+		str_info(ctx, ctx->mntpoint, _("BULKSTAT is required."));
+		goto err;
+	}
+
+	/* Do we have getbmapx? */
+	if (!xfs_can_iterate_bmap(ctx)) {
+		str_info(ctx, ctx->mntpoint, _("GETBMAPX is required."));
+		goto err;
+	}
+
+	/* Do we have getfsmap? */
+	if (!xfs_can_iterate_fsmap(ctx)) {
+		str_info(ctx, ctx->mntpoint, _("GETFSMAP is required."));
+		goto err;
+	}
+
+	/* Do we have kernel-assisted metadata scrubbing? */
+	if (!xfs_can_scrub_fs_metadata(ctx) || !xfs_can_scrub_inode(ctx) ||
+	    !xfs_can_scrub_bmap(ctx) || !xfs_can_scrub_dir(ctx) ||
+	    !xfs_can_scrub_attr(ctx) || !xfs_can_scrub_symlink(ctx) ||
+	    !xfs_can_scrub_parent(ctx))
+		goto err;
+
+	/* Go find the XFS devices if we have a usable fsmap. */
+	fs_table_initialise(0, NULL, 0, NULL);
+	errno = 0;
+	fsp = fs_table_lookup(ctx->mntpoint, FS_MOUNT_POINT);
+	if (!fsp) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find XFS information."));
+		goto err;
+	}
+	memcpy(&ctx->fsinfo, fsp, sizeof(struct fs_path));
+
+	/* Did we find the log and rt devices, if they're present? */
+	if (ctx->geo.logstart == 0 && ctx->fsinfo.fs_log == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find log device path."));
+		goto err;
+	}
+	if (ctx->geo.rtblocks && ctx->fsinfo.fs_rt == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find realtime device path."));
+		goto err;
+	}
+
+	/* Open the raw devices. */
+	error = disk_open(ctx->fsinfo.fs_name, &ctx->datadev);
+	if (error) {
+		str_errno(ctx, ctx->fsinfo.fs_name);
+		goto err;
+	}
+	ctx->nr_io_threads = nproc;
+
+	if (ctx->fsinfo.fs_log) {
+		error = disk_open(ctx->fsinfo.fs_log, &ctx->logdev);
+		if (error) {
+			str_errno(ctx, ctx->fsinfo.fs_name);
+			goto err;
+		}
+	}
+	if (ctx->fsinfo.fs_rt) {
+		error = disk_open(ctx->fsinfo.fs_rt, &ctx->rtdev);
+		if (error) {
+			str_errno(ctx, ctx->fsinfo.fs_name);
+			goto err;
+		}
+	}
+
+	return true;
+err:
+	return false;
+}
diff --git a/scrub/scrub.c b/scrub/scrub.c
index f492301..4b9b4cc 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -35,6 +35,8 @@
 #include "scrub.h"
 #include "common.h"
 #include "input.h"
+#include "ioctl.h"
+#include "xfs.h"
 
 #define _PATH_PROC_MOUNTS	"/proc/mounts"
 
@@ -385,12 +387,15 @@ _("Must be root to run scrub."));
 		str_errno(ctx, ctx->mntpoint);
 		return false;
 	}
+	moveon = xfs_scan_fs(ctx);
+	if (!moveon)
+		goto out;
 	if (verbose) {
 		fprintf(stdout, _("%s: using %d threads to scrub.\n"),
 				ctx->mntpoint, scrub_nproc(ctx));
 		fflush(stdout);
 	}
-
+out:
 	return moveon;
 }
 
@@ -485,6 +490,7 @@ main(
 	struct phase_rusage	all_pi;
 	unsigned long long	total_errors;
 	bool			moveon = true;
+	bool			ismnt;
 	static bool		injected;
 	int			ret;
 
@@ -610,6 +616,14 @@ _("Only one of the options -n or -y may be specified.\n"));
 	if (!moveon)
 		goto out;
 
+	ismnt = find_mountpoint(mtab, &ctx);
+	if (!ismnt) {
+		fprintf(stderr, _("%s: Not a mount point or block device.\n"),
+			ctx.mntpoint);
+		ret = 8;
+		goto end;
+	}
+
 	/* How many CPUs? */
 	nproc = sysconf(_SC_NPROCESSORS_ONLN);
 	if (nproc < 0)
@@ -643,6 +657,11 @@ _("Only one of the options -n or -y may be specified.\n"));
 	if (!moveon)
 		ret |= 4;
 
+	/* Clean up scan data. */
+	moveon = xfs_cleanup(&ctx);
+	if (!moveon)
+		ret |= 8;
+
 	total_errors = ctx.errors_found + ctx.runtime_errors;
 	if (total_errors && ctx.warnings_found)
 		fprintf(stderr,
@@ -656,8 +675,11 @@ _("%s: %llu errors found.  Unmount and run xfs_repair.\n"),
 		fprintf(stderr,
 _("%s: %llu warnings found.\n"),
 			ctx.mntpoint, ctx.warnings_found);
-	if (ctx.errors_found)
+	if (ctx.errors_found) {
+		if (ctx.error_action == ERRORS_SHUTDOWN)
+			xfs_shutdown_fs(&ctx);
 		ret |= 1;
+	}
 	if (ctx.warnings_found)
 		ret |= 2;
 	if (ctx.runtime_errors)
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 3a776e1..87f59d6 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -56,6 +56,8 @@ struct scrub_ctx {
 
 	/* Open block devices */
 	struct disk		datadev;
+	struct disk		logdev;
+	struct disk		rtdev;
 
 	/* What does the user want us to do? */
 	enum scrub_mode		mode;
@@ -69,12 +71,23 @@ struct scrub_ctx {
 	/* Number of threads for metadata scrubbing */
 	unsigned int		nr_io_threads;
 
+	/* XFS specific geometry */
+	struct xfs_fsop_geom	geo;
+	struct fs_path		fsinfo;
+	unsigned int		agblklog;
+	unsigned int		blocklog;
+	unsigned int		inodelog;
+	unsigned int		inopblog;
+	void			*fshandle;
+	size_t			fshandle_len;
+
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
 	unsigned long long	max_errors;
 	unsigned long long	runtime_errors;
 	unsigned long long	errors_found;
 	unsigned long long	warnings_found;
+	bool			preen_triggers[XFS_SCRUB_TYPE_NR];
 };
 
 #endif /* XFS_SCRUB_SCRUB_H_ */
diff --git a/scrub/xfs.c b/scrub/xfs.c
new file mode 100644
index 0000000..e9ad15c
--- /dev/null
+++ b/scrub/xfs.c
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+
+/* Shut down the filesystem. */
+void
+xfs_shutdown_fs(
+	struct scrub_ctx		*ctx)
+{
+	int				flag;
+
+	flag = XFS_FSOP_GOING_FLAGS_LOGFLUSH;
+	str_info(ctx, ctx->mntpoint, _("Shutting down filesystem!"));
+	if (ioctl(ctx->mnt_fd, XFS_IOC_GOINGDOWN, &flag))
+		str_errno(ctx, ctx->mntpoint);
+}
diff --git a/scrub/xfs.h b/scrub/xfs.h
new file mode 100644
index 0000000..24709f3
--- /dev/null
+++ b/scrub/xfs.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_XFS_H_
+#define XFS_SCRUB_XFS_H_
+
+void xfs_shutdown_fs(struct scrub_ctx *ctx);
+
+/* Phase-specific functions. */
+bool xfs_cleanup(struct scrub_ctx *ctx);
+bool xfs_scan_fs(struct scrub_ctx *ctx);
+
+#endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/22] xfs_scrub: scan filesystem and AG metadata
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 06/22] xfs_scrub: find XFS filesystem geometry Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 08/22] xfs_scrub: scan inodes Darrick J. Wong
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scrub the filesystem and per-AG metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    3 
 scrub/common.c |   18 ++
 scrub/common.h |    1 
 scrub/ioctl.c  |  462 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/phase2.c |   99 ++++++++++++
 scrub/scrub.c  |    1 
 scrub/xfs.h    |    1 
 7 files changed, 584 insertions(+), 1 deletion(-)
 create mode 100644 scrub/phase2.c


diff --git a/scrub/Makefile b/scrub/Makefile
index a797bfb..5ac4962 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -16,6 +16,7 @@ INSTALL_SCRUB = install-scrub
 endif	# scrub_prereqs
 
 HFILES = \
+../repair/threads.h \
 common.h \
 disk.h \
 ioctl.h \
@@ -23,10 +24,12 @@ scrub.h \
 xfs.h
 
 CFILES = \
+../repair/threads.c \
 common.c \
 disk.c \
 ioctl.c \
 phase1.c \
+phase2.c \
 scrub.c \
 xfs.c
 
diff --git a/scrub/common.c b/scrub/common.c
index 874f8ab..167d373 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -329,3 +329,21 @@ find_mountpoint(
 	platform_mntent_close(&cursor);
 	return found;
 }
+
+/*
+ * Sleep for 100ms * however many -b we got past the initial one.
+ */
+void
+background_sleep(void)
+{
+	unsigned long long	time;
+	struct timespec		tv;
+
+	if (bg_mode < 2)
+		return;
+
+	time = 100000 * (bg_mode - 1);
+	tv.tv_sec = time / 1000000;
+	tv.tv_nsec = time % 1000000;
+	nanosleep(&tv, NULL);
+}
diff --git a/scrub/common.h b/scrub/common.h
index a8b1ff8..7bbd061 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -70,5 +70,6 @@ static inline int syncfs(int fd)
 #endif
 
 bool find_mountpoint(char *mtab, struct scrub_ctx *ctx);
+void background_sleep(void);
 
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/ioctl.c b/scrub/ioctl.c
index 6578672..2fb039c 100644
--- a/scrub/ioctl.c
+++ b/scrub/ioctl.c
@@ -91,6 +91,464 @@ xfs_can_iterate_fsmap(
 	return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
 }
 
+/* Online scrub. */
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+	ST_NONE,	/* disabled */
+	ST_AGHEADER,	/* per-AG header */
+	ST_PERAG,	/* per-AG metadata */
+	ST_FS,		/* per-FS metadata */
+	ST_INODE,	/* per-inode metadata */
+};
+struct scrub_descr {
+	const char	*name;
+	enum scrub_type	type;
+};
+
+/* These must correspond to XFS_SCRUB_TYPE_ */
+static const struct scrub_descr scrubbers[] = {
+	[XFS_SCRUB_TYPE_TEST] =
+		{"metadata",				ST_NONE},
+	[XFS_SCRUB_TYPE_SB] =
+		{"superblock",				ST_AGHEADER},
+	[XFS_SCRUB_TYPE_AGF] =
+		{"free space header",			ST_AGHEADER},
+	[XFS_SCRUB_TYPE_AGFL] =
+		{"free list",				ST_AGHEADER},
+	[XFS_SCRUB_TYPE_AGI] =
+		{"inode header",			ST_AGHEADER},
+	[XFS_SCRUB_TYPE_BNOBT] =
+		{"freesp by block btree",		ST_PERAG},
+	[XFS_SCRUB_TYPE_CNTBT] =
+		{"freesp by length btree",		ST_PERAG},
+	[XFS_SCRUB_TYPE_INOBT] =
+		{"inode btree",				ST_PERAG},
+	[XFS_SCRUB_TYPE_FINOBT] =
+		{"free inode btree",			ST_PERAG},
+	[XFS_SCRUB_TYPE_RMAPBT] =
+		{"reverse mapping btree",		ST_PERAG},
+	[XFS_SCRUB_TYPE_REFCNTBT] =
+		{"reference count btree",		ST_PERAG},
+	[XFS_SCRUB_TYPE_INODE] =
+		{"inode record",			ST_INODE},
+	[XFS_SCRUB_TYPE_BMBTD] =
+		{"data block map",			ST_INODE},
+	[XFS_SCRUB_TYPE_BMBTA] =
+		{"attr block map",			ST_INODE},
+	[XFS_SCRUB_TYPE_BMBTC] =
+		{"CoW block map",			ST_INODE},
+	[XFS_SCRUB_TYPE_DIR] =
+		{"directory entries",			ST_INODE},
+	[XFS_SCRUB_TYPE_XATTR] =
+		{"extended attributes",			ST_INODE},
+	[XFS_SCRUB_TYPE_SYMLINK] =
+		{"symbolic link",			ST_INODE},
+	[XFS_SCRUB_TYPE_PARENT] =
+		{"parent pointer",			ST_INODE},
+	[XFS_SCRUB_TYPE_RTBITMAP] =
+		{"realtime bitmap",			ST_FS},
+	[XFS_SCRUB_TYPE_RTSUM] =
+		{"realtime summary",			ST_FS},
+	[XFS_SCRUB_TYPE_UQUOTA] =
+		{"user quotas",				ST_FS},
+	[XFS_SCRUB_TYPE_GQUOTA] =
+		{"group quotas",			ST_FS},
+	[XFS_SCRUB_TYPE_PQUOTA] =
+		{"project quotas",			ST_FS},
+};
+
+/* Format a scrub description. */
+static void
+format_scrub_descr(
+	char				*buf,
+	size_t				buflen,
+	struct xfs_scrub_metadata	*meta,
+	const struct scrub_descr	*sc)
+{
+	switch (sc->type) {
+	case ST_AGHEADER:
+	case ST_PERAG:
+		snprintf(buf, buflen, _("AG %u %s"), meta->sm_agno,
+				_(sc->name));
+		break;
+	case ST_INODE:
+		snprintf(buf, buflen, _("Inode %llu %s"), meta->sm_ino,
+				_(sc->name));
+		break;
+	case ST_FS:
+		snprintf(buf, buflen, _("%s"), _(sc->name));
+		break;
+	case ST_NONE:
+		assert(0);
+		break;
+	}
+}
+
+/* Predicates for scrub flag state. */
+
+static inline bool is_corrupt(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT;
+}
+
+static inline bool is_unoptimized(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_PREEN;
+}
+
+static inline bool xref_failed(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_XFAIL;
+}
+
+static inline bool xref_disagrees(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT;
+}
+
+static inline bool is_incomplete(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE;
+}
+
+static inline bool is_suspicious(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & XFS_SCRUB_OFLAG_WARNING;
+}
+
+/* Should we fix it? */
+static inline bool needs_repair(struct xfs_scrub_metadata *sm)
+{
+	return is_corrupt(sm) || xref_disagrees(sm);
+}
+
+/* Warn about strange circumstances after scrub. */
+static inline void
+xfs_scrub_warn_incomplete_scrub(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct xfs_scrub_metadata	*meta)
+{
+	if (is_incomplete(meta))
+		str_info(ctx, descr, _("Check incomplete."));
+
+	if (is_suspicious(meta)) {
+		if (debug)
+			str_info(ctx, descr, _("Possibly suspect metadata."));
+		else
+			str_warn(ctx, descr, _("Possibly suspect metadata."));
+	}
+
+	if (xref_failed(meta))
+		str_info(ctx, descr, _("Cross-referencing failed."));
+}
+
+/* Do a read-only check of some metadata. */
+static enum check_outcome
+xfs_check_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct xfs_scrub_metadata	*meta,
+	bool				is_inode)
+{
+	char				buf[DESCR_BUFSZ];
+	unsigned int			tries = 0;
+	int				code;
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	assert(meta->sm_type < XFS_SCRUB_TYPE_NR);
+	format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+	dbg_printf("check %s flags %xh\n", buf, meta->sm_flags);
+retry:
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
+		meta->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	if (error) {
+		code = errno;
+		switch (code) {
+		case ENOENT:
+			/* Metadata not present, just skip it. */
+			return CHECK_DONE;
+		case ESHUTDOWN:
+			/* FS already crashed, give up. */
+			str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+			return CHECK_ABORT;
+		case ENOMEM:
+			/* Ran out of memory, just give up. */
+			str_errno(ctx, buf);
+			return CHECK_ABORT;
+		case EDEADLOCK:
+		case EBUSY:
+		case EFSBADCRC:
+		case EFSCORRUPTED:
+			/*
+			 * The first two should never escape the kernel,
+			 * and the other two should be reported via sm_flags.
+			 */
+			str_error(ctx, buf,
+_("Kernel bug!  errno=%d"), code);
+			/* fall through */
+		default:
+			/* Operational error. */
+			str_errno(ctx, buf);
+			return CHECK_DONE;
+		}
+	}
+
+	/*
+	 * If the kernel says the test was incomplete or that there was
+	 * a cross-referencing discrepancy but no obvious corruption,
+	 * we'll try the scan again, just in case the fs was busy.
+	 * Only retry so many times.
+	 */
+	if (tries < 10 && (is_incomplete(meta) ||
+			   (xref_disagrees(meta) && !is_corrupt(meta)))) {
+		tries++;
+		goto retry;
+	}
+
+	/* Complain about incomplete or suspicious metadata. */
+	xfs_scrub_warn_incomplete_scrub(ctx, buf, meta);
+
+	/*
+	 * If we need repairs or there were discrepancies, schedule a
+	 * repair if desired, otherwise complain.
+	 */
+	if (is_corrupt(meta) || xref_disagrees(meta)) {
+		if (ctx->mode < SCRUB_MODE_REPAIR) {
+			str_error(ctx, buf,
+_("Repairs are required."));
+			return CHECK_DONE;
+		}
+
+		return CHECK_REPAIR;
+	}
+
+	/*
+	 * If we could optimize, schedule a repair if desired,
+	 * otherwise complain.
+	 */
+	if (is_unoptimized(meta)) {
+		if (ctx->mode < SCRUB_MODE_PREEN) {
+			if (!is_inode) {
+				/* AG or FS metadata, always warn. */
+				str_info(ctx, buf,
+_("Optimization is possible."));
+			} else if (!ctx->preen_triggers[meta->sm_type]) {
+				/* File metadata, only warn once per type. */
+				pthread_mutex_lock(&ctx->lock);
+				if (!ctx->preen_triggers[meta->sm_type])
+					ctx->preen_triggers[meta->sm_type] = true;
+				pthread_mutex_unlock(&ctx->lock);
+			}
+			return CHECK_DONE;
+		}
+
+		return CHECK_REPAIR;
+	}
+
+	/* Everything is ok. */
+	return CHECK_DONE;
+}
+
+/* Bulk-notify user about things that could be optimized. */
+void
+xfs_scrub_report_preen_triggers(
+	struct scrub_ctx		*ctx)
+{
+	int				i;
+
+	for (i = 0; i < XFS_SCRUB_TYPE_NR; i++) {
+		pthread_mutex_lock(&ctx->lock);
+		if (ctx->preen_triggers[i]) {
+			ctx->preen_triggers[i] = false;
+			pthread_mutex_unlock(&ctx->lock);
+			str_info(ctx, ctx->mntpoint,
+_("Optimizations of %s are possible."), scrubbers[i].name);
+		} else {
+			pthread_mutex_unlock(&ctx->lock);
+		}
+	}
+}
+
+/* Scrub metadata, saving corruption reports for later. */
+static bool
+xfs_scrub_metadata(
+	struct scrub_ctx		*ctx,
+	enum scrub_type			scrub_type,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	const struct scrub_descr	*sc;
+	enum check_outcome		fix;
+	int				type;
+
+	sc = scrubbers;
+	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
+		if (sc->type != scrub_type)
+			continue;
+
+		meta.sm_type = type;
+		meta.sm_flags = 0;
+		meta.sm_agno = agno;
+		background_sleep();
+
+		/* Check the item. */
+		fix = xfs_check_metadata(ctx, ctx->mnt_fd, &meta, false);
+		switch (fix) {
+		case CHECK_ABORT:
+			return false;
+		case CHECK_REPAIR:
+		case CHECK_DONE:
+			continue;
+		case CHECK_RETRY:
+			abort();
+			break;
+		}
+	}
+
+	return true;
+}
+
+/* Scrub each AG's header blocks. */
+bool
+xfs_scrub_ag_headers(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno)
+{
+	return xfs_scrub_metadata(ctx, ST_AGHEADER, agno);
+}
+
+/* Scrub each AG's metadata btrees. */
+bool
+xfs_scrub_ag_metadata(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno)
+{
+	return xfs_scrub_metadata(ctx, ST_PERAG, agno);
+}
+
+/* Scrub whole-FS metadata btrees. */
+bool
+xfs_scrub_fs_metadata(
+	struct scrub_ctx		*ctx)
+{
+	return xfs_scrub_metadata(ctx, ST_FS, 0);
+}
+
+/* Scrub inode metadata. */
+static bool
+__xfs_scrub_file(
+	struct scrub_ctx		*ctx,
+	uint64_t			ino,
+	uint32_t			gen,
+	int				fd,
+	unsigned int			type)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	enum check_outcome		fix;
+
+	assert(type < XFS_SCRUB_TYPE_NR);
+	assert(scrubbers[type].type == ST_INODE);
+
+	meta.sm_type = type;
+	meta.sm_ino = ino;
+	meta.sm_gen = gen;
+
+	/* Scrub the piece of metadata. */
+	fix = xfs_check_metadata(ctx, fd, &meta, true);
+	if (fix == CHECK_ABORT)
+		return false;
+	if (fix == CHECK_DONE)
+		return true;
+
+	return true;
+}
+
+bool
+xfs_scrub_inode_fields(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_INODE);
+}
+
+bool
+xfs_scrub_data_fork(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTD);
+}
+
+bool
+xfs_scrub_attr_fork(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTA);
+}
+
+bool
+xfs_scrub_cow_fork(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTC);
+}
+
+bool
+xfs_scrub_dir(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_DIR);
+}
+
+bool
+xfs_scrub_attr(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_XATTR);
+}
+
+bool
+xfs_scrub_symlink(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_SYMLINK);
+}
+
+bool
+xfs_scrub_parent(
+	struct scrub_ctx	*ctx,
+	uint64_t		ino,
+	uint32_t		gen,
+	int			fd)
+{
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_PARENT);
+}
+
 /* Test the availability of a kernel scrub command. */
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR	30
 static bool
@@ -133,7 +591,9 @@ _("Filesystem is mounted norecovery; cannot proceed."));
 	case EOPNOTSUPP:
 	case ENOTTY:
 		str_info(ctx, ctx->mntpoint,
-_("Kernel metadata scrub is required."));
+_("Kernel %s %s facility is required."),
+				_(scrubbers[type].name),
+				repair ? _("repair") : _("scrub"));
 		return false;
 	case ENOENT:
 		/* Scrubber says not present on this fs; that's fine. */
diff --git a/scrub/phase2.c b/scrub/phase2.c
new file mode 100644
index 0000000..b8b44ac
--- /dev/null
+++ b/scrub/phase2.c
@@ -0,0 +1,99 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+
+/* Phase 2: Check internal metadata. */
+
+/* Scrub each AG's metadata btrees. */
+static void
+xfs_scan_ag_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	bool				*pmoveon = arg;
+	bool				moveon;
+	char				descr[DESCR_BUFSZ];
+
+	snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
+
+	/*
+	 * First we scrub and fix the AG headers, because we need
+	 * them to work well enough to check the AG btrees.
+	 */
+	moveon = xfs_scrub_ag_headers(ctx, agno);
+	if (!moveon)
+		goto err;
+
+	/* Now scrub the AG btrees. */
+	moveon = xfs_scrub_ag_metadata(ctx, agno);
+	if (!moveon)
+		goto err;
+
+	return;
+err:
+	*pmoveon = false;
+}
+
+/* Scrub whole-FS metadata btrees. */
+static void
+xfs_scan_fs_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	bool				*pmoveon = arg;
+	bool				moveon;
+
+	moveon = xfs_scrub_fs_metadata(ctx);
+	if (!moveon)
+		*pmoveon = false;
+}
+
+/* Scan all filesystem metadata. */
+bool
+xfs_scan_metadata(
+	struct scrub_ctx	*ctx)
+{
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	bool			moveon = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, xfs_scan_fs_metadata, 0, &moveon);
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_metadata, agno, &moveon);
+	destroy_work_queue(&wq);
+
+	return moveon;
+}
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 4b9b4cc..c068835 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -413,6 +413,7 @@ run_scrub_phases(
 		},
 		{
 			.descr = _("Check internal metadata."),
+			.fn = xfs_scan_metadata,
 		},
 		{
 			.descr = _("Scan all inodes."),
diff --git a/scrub/xfs.h b/scrub/xfs.h
index 24709f3..d3c5782 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -25,5 +25,6 @@ void xfs_shutdown_fs(struct scrub_ctx *ctx);
 /* Phase-specific functions. */
 bool xfs_cleanup(struct scrub_ctx *ctx);
 bool xfs_scan_fs(struct scrub_ctx *ctx);
+bool xfs_scan_metadata(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/22] xfs_scrub: scan inodes
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 07/22] xfs_scrub: scan filesystem and AG metadata Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 09/22] xfs_scrub: check directory connectivity Darrick J. Wong
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scan all the inodes in the system for problems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    1 
 scrub/ioctl.c  |  366 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/ioctl.h  |    1 
 scrub/phase3.c |  140 +++++++++++++++++++++
 scrub/scrub.c  |    1 
 scrub/xfs.c    |   88 +++++++++++++
 scrub/xfs.h    |    4 +
 7 files changed, 601 insertions(+)
 create mode 100644 scrub/phase3.c


diff --git a/scrub/Makefile b/scrub/Makefile
index 5ac4962..e583cb9 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -30,6 +30,7 @@ disk.c \
 ioctl.c \
 phase1.c \
 phase2.c \
+phase3.c \
 scrub.c \
 xfs.c
 
diff --git a/scrub/ioctl.c b/scrub/ioctl.c
index 2fb039c..a3b7c04 100644
--- a/scrub/ioctl.c
+++ b/scrub/ioctl.c
@@ -29,6 +29,186 @@
 #include "common.h"
 #include "ioctl.h"
 
+#define FSMAP_NR		65536
+#define BMAP_NR			2048
+
+/* Call the handler function. */
+static int
+xfs_iterate_inode_func(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	struct xfs_bstat	*bs,
+	struct xfs_handle	*handle,
+	void			*arg)
+{
+	int			error;
+
+	handle->ha_fid.fid_ino = bs->bs_ino;
+	handle->ha_fid.fid_gen = bs->bs_gen;
+	error = fn(ctx, handle, bs, arg);
+	if (error)
+		return error;
+	if (xfs_scrub_excessive_errors(ctx))
+		return XFS_ITERATE_INODES_ABORT;
+	return 0;
+}
+
+/*
+ * Iterate a range of inodes.
+ *
+ * This is a little more involved than repeatedly asking BULKSTAT for a
+ * buffer's worth of stat data for some number of inodes.  We want to
+ * scan as many of the inodes that the inobt thinks there are, including
+ * the ones that are broken, but if we ask for n inodes start at x,
+ * it'll skip the bad ones and fill from beyond the range (x + n).
+ *
+ * Therefore, we ask INUMBERS to return one inobt chunk's worth of inode
+ * bitmap information.  Then we try to BULKSTAT only the inodes that
+ * were present in that chunk, and compare what we got against what
+ * INUMBERS said was there.  If there's a mismatch, we know that we have
+ * an inode that fails the verifiers but so we can inject the bulkstat
+ * information to force the scrub code to deal with the broken inodes.
+ *
+ * If the iteration function returns ESTALE, that means that the inode
+ * has been deleted and possibly recreated since the BULKSTAT call.  We
+ * wil refresh the stat information and try again up to 30 times before
+ * reporting the staleness as an error.
+ */
+bool
+xfs_iterate_inodes(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	void			*fshandle,
+	uint64_t		first_ino,
+	uint64_t		last_ino,
+	xfs_inode_iter_fn	fn,
+	void			*arg)
+{
+	struct xfs_fsop_bulkreq	igrpreq = {0};
+	struct xfs_fsop_bulkreq	bulkreq = {0};
+	struct xfs_fsop_bulkreq	onereq = {0};
+	struct xfs_handle	handle;
+	struct xfs_inogrp	inogrp;
+	struct xfs_bstat	bstat[XFS_INODES_PER_CHUNK] = {0};
+	char			idescr[DESCR_BUFSZ];
+	char			buf[DESCR_BUFSZ];
+	struct xfs_bstat	*bs;
+	__u64			last_stale = first_ino - 1;
+	__u64			igrp_ino;
+	__u64			oneino;
+	__u64			ino;
+	__s32			bulklen = 0;
+	__s32			onelen = 0;
+	__s32			igrplen = 0;
+	bool			moveon = true;
+	int			i;
+	int			error;
+	int			stale_count = 0;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"));
+
+	onereq.lastip  = &oneino;
+	onereq.icount  = 1;
+	onereq.ocount  = &onelen;
+
+	bulkreq.lastip  = &ino;
+	bulkreq.icount  = XFS_INODES_PER_CHUNK;
+	bulkreq.ubuffer = &bstat;
+	bulkreq.ocount  = &bulklen;
+
+	igrpreq.lastip  = &igrp_ino;
+	igrpreq.icount  = 1;
+	igrpreq.ubuffer = &inogrp;
+	igrpreq.ocount  = &igrplen;
+
+	memcpy(&handle.ha_fsid, fshandle, sizeof(handle.ha_fsid));
+	handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
+			sizeof(handle.ha_fid.fid_len);
+	handle.ha_fid.fid_pad = 0;
+
+	/* Find the inode chunk & alloc mask */
+	igrp_ino = first_ino;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSINUMBERS, &igrpreq);
+	while (!error && igrplen) {
+		/* Load the inodes. */
+		ino = inogrp.xi_startino - 1;
+		bulkreq.icount = inogrp.xi_alloccount;
+		error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT, &bulkreq);
+		if (error)
+			str_warn(ctx, descr, "%s", strerror_r(errno,
+						buf, DESCR_BUFSZ));
+
+		/* Did we get exactly the inodes we expected? */
+		for (i = 0, bs = bstat; i < XFS_INODES_PER_CHUNK; i++) {
+			if (!(inogrp.xi_allocmask & (1ULL << i)))
+				continue;
+			if (bs->bs_ino == inogrp.xi_startino + i) {
+				bs++;
+				continue;
+			}
+
+			/* Load the one inode. */
+			oneino = inogrp.xi_startino + i;
+			onereq.ubuffer = bs;
+			error = ioctl(ctx->mnt_fd, XFS_IOC_FSBULKSTAT_SINGLE,
+					&onereq);
+			if (error || bs->bs_ino != inogrp.xi_startino + i) {
+				memset(bs, 0, sizeof(struct xfs_bstat));
+				bs->bs_ino = inogrp.xi_startino + i;
+				bs->bs_blksize = ctx->mnt_sv.f_frsize;
+			}
+			bs++;
+		}
+
+		/* Iterate all the inodes. */
+		for (i = 0, bs = bstat; i < inogrp.xi_alloccount; i++, bs++) {
+			if (bs->bs_ino > last_ino)
+				goto out;
+
+			error = xfs_iterate_inode_func(ctx, fn, bs, &handle,
+					arg);
+			switch (error) {
+			case 0:
+				break;
+			case ESTALE:
+				if (last_stale == inogrp.xi_startino)
+					stale_count++;
+				else {
+					last_stale = inogrp.xi_startino;
+					stale_count = 0;
+				}
+				if (stale_count < 30) {
+					igrp_ino = inogrp.xi_startino;
+					goto igrp_retry;
+				}
+				snprintf(idescr, DESCR_BUFSZ, "inode %llu",
+						bs->bs_ino);
+				str_warn(ctx, idescr, "%s", strerror_r(error,
+						buf, DESCR_BUFSZ));
+				break;
+			case XFS_ITERATE_INODES_ABORT:
+				error = 0;
+				/* fall thru */
+			default:
+				moveon = false;
+				errno = error;
+				goto err;
+			}
+		}
+
+igrp_retry:
+		error = ioctl(ctx->mnt_fd, XFS_IOC_FSINUMBERS, &igrpreq);
+	}
+
+err:
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	return moveon;
+}
+
 /* Does the kernel support bulkstat? */
 bool
 xfs_can_iterate_inodes(
@@ -53,6 +233,135 @@ xfs_can_iterate_inodes(
 	return error == -1 && errno == EINVAL;
 }
 
+/*
+ * Open a file by handle, or return a negative error code.
+ */
+int
+xfs_open_handle(
+	struct xfs_handle	*handle)
+{
+	return open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+}
+
+/* Iterate all the extent block mappings between the key and fork end. */
+bool
+xfs_iterate_bmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	int			whichfork,
+	struct xfs_bmap		*key,
+	xfs_bmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsxattr		fsx;
+	struct getbmapx		*map;
+	struct getbmapx		*p;
+	struct xfs_bmap		bmap;
+	char			bmap_descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	xfs_off_t		new_off;
+	int			getxattr_type;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BMAP"));
+
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s attr"), descr);
+		break;
+	case XFS_COW_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s CoW"), descr);
+		break;
+	case XFS_DATA_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s data"), descr);
+		break;
+	default:
+		assert(0);
+	}
+
+	map = calloc(BMAP_NR, sizeof(struct getbmapx));
+	if (!map) {
+		str_errno(ctx, bmap_descr);
+		return false;
+	}
+
+	map->bmv_offset = BTOBB(key->bm_offset);
+	map->bmv_block = BTOBB(key->bm_physical);
+	if (key->bm_length == 0)
+		map->bmv_length = ULLONG_MAX;
+	else
+		map->bmv_length = BTOBB(key->bm_length);
+	map->bmv_count = BMAP_NR;
+	map->bmv_iflags = BMV_IF_NO_DMAPI_READ | BMV_IF_PREALLOC |
+			  BMV_OF_DELALLOC | BMV_IF_NO_HOLES;
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		getxattr_type = XFS_IOC_FSGETXATTRA;
+		map->bmv_iflags |= BMV_IF_ATTRFORK;
+		break;
+	case XFS_COW_FORK:
+		map->bmv_iflags |= BMV_IF_COWFORK;
+		getxattr_type = FS_IOC_FSGETXATTR;
+		break;
+	case XFS_DATA_FORK:
+		getxattr_type = FS_IOC_FSGETXATTR;
+		break;
+	default:
+		abort();
+	}
+
+	error = ioctl(fd, getxattr_type, &fsx);
+	if (error < 0) {
+		str_errno(ctx, bmap_descr);
+		moveon = false;
+		goto out;
+	}
+
+	while ((error = ioctl(fd, XFS_IOC_GETBMAPX, map)) == 0) {
+		for (i = 0, p = &map[i + 1]; i < map->bmv_entries; i++, p++) {
+			bmap.bm_offset = BBTOB(p->bmv_offset);
+			bmap.bm_physical = BBTOB(p->bmv_block);
+			bmap.bm_length = BBTOB(p->bmv_length);
+			bmap.bm_flags = p->bmv_oflags;
+			moveon = fn(ctx, bmap_descr, fd, whichfork, &fsx,
+					&bmap, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (map->bmv_entries == 0)
+			break;
+		p = map + map->bmv_entries;
+		if (p->bmv_oflags & BMV_OF_LAST)
+			break;
+
+		new_off = p->bmv_offset + p->bmv_length;
+		map->bmv_length -= new_off - map->bmv_offset;
+		map->bmv_offset = new_off;
+	}
+
+	/*
+	 * Pre-reflink filesystems don't know about CoW forks, so don't
+	 * be too surprised if it fails.
+	 */
+	if (whichfork == XFS_COW_FORK && error && errno == EINVAL)
+		error = 0;
+
+	if (error)
+		str_errno(ctx, bmap_descr);
+out:
+	memcpy(key, map, sizeof(struct getbmapx));
+	free(map);
+	return moveon;
+}
+
 /* Does the kernel support getbmapx? */
 bool
 xfs_can_iterate_bmap(
@@ -71,6 +380,63 @@ xfs_can_iterate_bmap(
 	return error == 0;
 }
 
+/* Iterate all the fs block mappings between the two keys. */
+bool
+xfs_iterate_fsmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*keys,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsmap_head	*head;
+	struct fsmap		*p;
+	bool			moveon = true;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_FSMAP"));
+
+	head = malloc(fsmap_sizeof(FSMAP_NR));
+	if (!head) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	memset(head, 0, sizeof(*head));
+	memcpy(head->fmh_keys, keys, sizeof(struct fsmap) * 2);
+	head->fmh_count = FSMAP_NR;
+
+	while ((error = ioctl(ctx->mnt_fd, FS_IOC_GETFSMAP, head)) == 0) {
+		for (i = 0, p = head->fmh_recs;
+		     i < head->fmh_entries;
+		     i++, p++) {
+			moveon = fn(ctx, descr, p, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (head->fmh_entries == 0)
+			break;
+		p = &head->fmh_recs[head->fmh_entries - 1];
+		if (p->fmr_flags & FMR_OF_LAST)
+			break;
+		fsmap_advance(head);
+	}
+
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	free(head);
+	return moveon;
+}
+
 /* Does the kernel support getfsmap? */
 bool
 xfs_can_iterate_fsmap(
diff --git a/scrub/ioctl.h b/scrub/ioctl.h
index c255bbb..ee2ac26 100644
--- a/scrub/ioctl.h
+++ b/scrub/ioctl.h
@@ -45,6 +45,7 @@ bool xfs_iterate_bmap(struct scrub_ctx *ctx, const char *descr, int fd,
 		int whichfork, struct xfs_bmap *key, xfs_bmap_iter_fn fn,
 		void *arg);
 bool xfs_can_iterate_bmap(struct scrub_ctx *ctx);
+int xfs_open_handle(struct xfs_handle *handle);
 
 /* filesystem reverse mapping */
 typedef bool (*xfs_fsmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
diff --git a/scrub/phase3.c b/scrub/phase3.c
new file mode 100644
index 0000000..cdd8a7c
--- /dev/null
+++ b/scrub/phase3.c
@@ -0,0 +1,140 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+#include "xfs.h"
+
+/* Phase 3: Scan all inodes. */
+
+/*
+ * Scrub part of a file.  If the user passes in a valid fd we assume
+ * that's the file to check; otherwise, pass in the inode number and
+ * generation from bstat and let the kernel sort it out.
+ */
+static bool
+xfs_scrub_fd(
+	struct scrub_ctx	*ctx,
+	bool			(*fn)(struct scrub_ctx *, uint64_t,
+				      uint32_t, int),
+	struct xfs_bstat	*bs,
+	int			fd)
+{
+	if (fd < 0)
+		fd = ctx->mnt_fd;
+	return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd);
+}
+
+/* Verify the contents, xattrs, and extent maps of an inode. */
+static int
+xfs_scrub_inode(
+	struct scrub_ctx	*ctx,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat,
+	void			*arg)
+{
+	char			descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	xfs_agnumber_t		agno;
+	xfs_agino_t		agino;
+	int			fd = -1;
+	int			error = 0;
+
+	agno = bstat->bs_ino / (1ULL << (ctx->inopblog + ctx->agblklog));
+	agino = bstat->bs_ino % (1ULL << (ctx->inopblog + ctx->agblklog));
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu (%u/%u)"), bstat->bs_ino,
+			agno, agino);
+	background_sleep();
+
+	/* Try to open the inode to pin it. */
+	if (S_ISREG(bstat->bs_mode)) {
+		fd = xfs_open_handle(handle);
+		if (fd < 0) {
+			error = errno;
+			if (error != ESTALE)
+				str_errno(ctx, descr);
+			goto out;
+		}
+	}
+
+	/* Scrub the inode. */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd);
+	if (!moveon)
+		goto out;
+
+	/* Scrub all block mappings. */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd);
+	if (!moveon)
+		goto out;
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd);
+	if (!moveon)
+		goto out;
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd);
+	if (!moveon)
+		goto out;
+
+	if (S_ISLNK(bstat->bs_mode)) {
+		/* Check symlink contents. */
+		moveon = xfs_scrub_symlink(ctx, bstat->bs_ino,
+				bstat->bs_gen, ctx->mnt_fd);
+	} else if (S_ISDIR(bstat->bs_mode)) {
+		/* Check the directory entries. */
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd);
+	}
+	if (!moveon)
+		goto out;
+
+	/* Check all the extended attributes. */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd);
+	if (!moveon)
+		goto out;
+
+	/* Check parent pointers. */
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_parent, bstat, fd);
+	if (!moveon)
+		goto out;
+
+out:
+	if (fd >= 0)
+		close(fd);
+	if (error)
+		return error;
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Verify all the inodes in a filesystem. */
+bool
+xfs_scan_inodes(
+	struct scrub_ctx	*ctx)
+{
+	if (!xfs_scan_all_inodes(ctx, xfs_scrub_inode))
+		return false;
+	xfs_scrub_report_preen_triggers(ctx);
+	return true;
+}
diff --git a/scrub/scrub.c b/scrub/scrub.c
index c068835..4638281 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -417,6 +417,7 @@ run_scrub_phases(
 		},
 		{
 			.descr = _("Scan all inodes."),
+			.fn = xfs_scan_inodes,
 		},
 		{
 			.descr = _("Defer filesystem repairs."),
diff --git a/scrub/xfs.c b/scrub/xfs.c
index e9ad15c..882bd28 100644
--- a/scrub/xfs.c
+++ b/scrub/xfs.c
@@ -42,3 +42,91 @@ xfs_shutdown_fs(
 	if (ioctl(ctx->mnt_fd, XFS_IOC_GOINGDOWN, &flag))
 		str_errno(ctx, ctx->mntpoint);
 }
+
+/* BULKSTAT wrapper routines. */
+struct xfs_scan_inodes {
+	xfs_inode_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Scan all the inodes in an AG. */
+static void
+xfs_scan_ag_inodes(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct xfs_scan_inodes	*si = arg;
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	uint64_t		ag_ino;
+	uint64_t		next_ag_ino;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+				major(ctx->fsinfo.fs_datadev),
+				minor(ctx->fsinfo.fs_datadev),
+				agno);
+
+	ag_ino = (__u64)agno << (ctx->inopblog + ctx->agblklog);
+	next_ag_ino = (__u64)(agno + 1) << (ctx->inopblog + ctx->agblklog);
+
+	fn_arg = ((char *)si->arg) + si->array_arg_size * agno;
+	moveon = xfs_iterate_inodes(ctx, descr, ctx->fshandle, ag_ino,
+			next_ag_ino - 1, si->fn, fn_arg);
+	if (!moveon)
+		si->moveon = false;
+}
+
+/* How many array elements should we create to scan all the inodes? */
+static inline size_t
+xfs_scan_all_inodes_array_size(
+	struct scrub_ctx	*ctx)
+{
+	return ctx->geo.agcount;
+}
+
+/* Scan all the inodes in a filesystem. */
+static bool
+xfs_scan_all_inodes_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	struct xfs_scan_inodes	si;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+
+	si.moveon = true;
+	si.fn = fn;
+	si.arg = arg;
+	si.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_inodes, agno, &si);
+	destroy_work_queue(&wq);
+
+	return si.moveon;
+}
+
+bool
+xfs_scan_all_inodes(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn)
+{
+	return xfs_scan_all_inodes_array_arg(ctx, fn, NULL, 0);
+}
+
+bool
+xfs_scan_all_inodes_arg(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	void			*arg)
+{
+	return xfs_scan_all_inodes_array_arg(ctx, fn, arg, 0);
+}
diff --git a/scrub/xfs.h b/scrub/xfs.h
index d3c5782..8c442be 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -21,10 +21,14 @@
 #define XFS_SCRUB_XFS_H_
 
 void xfs_shutdown_fs(struct scrub_ctx *ctx);
+bool xfs_scan_all_inodes(struct scrub_ctx *ctx, xfs_inode_iter_fn fn);
+bool xfs_scan_all_inodes_arg(struct scrub_ctx *ctx, xfs_inode_iter_fn fn,
+		void *arg);
 
 /* Phase-specific functions. */
 bool xfs_cleanup(struct scrub_ctx *ctx);
 bool xfs_scan_fs(struct scrub_ctx *ctx);
 bool xfs_scan_metadata(struct scrub_ctx *ctx);
+bool xfs_scan_inodes(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/22] xfs_scrub: check directory connectivity
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 08/22] xfs_scrub: scan inodes Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 10/22] xfs_scrub: thread-safe stats counter Darrick J. Wong
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Opening directories by file handle will cause the kernel to perform
parent lookups all the way to the root directory.  Take advantage of
this to ensure that directories actually connect to the root.  Some
day we'll have parent pointers and can make this more comprehensive.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    1 +
 scrub/phase5.c |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.c  |    1 +
 scrub/xfs.h    |    1 +
 4 files changed, 97 insertions(+)
 create mode 100644 scrub/phase5.c


diff --git a/scrub/Makefile b/scrub/Makefile
index e583cb9..13a0b55 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -31,6 +31,7 @@ ioctl.c \
 phase1.c \
 phase2.c \
 phase3.c \
+phase5.c \
 scrub.c \
 xfs.c
 
diff --git a/scrub/phase5.c b/scrub/phase5.c
new file mode 100644
index 0000000..7ea8b58
--- /dev/null
+++ b/scrub/phase5.c
@@ -0,0 +1,94 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+#include "xfs.h"
+
+/* Phase 5: Check directory connectivity. */
+
+/*
+ * Verify the connectivity of the directory tree.
+ * We know that the kernel's open-by-handle function will try to reconnect
+ * parents of an opened directory, so we'll accept that as sufficient.
+ */
+static int
+xfs_scrub_connections(
+	struct scrub_ctx	*ctx,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat,
+	void			*arg)
+{
+	char			descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	xfs_agnumber_t		agno;
+	xfs_agino_t		agino;
+	int			fd = -1;
+	int			error = 0;
+
+	agno = bstat->bs_ino / (1ULL << (ctx->inopblog + ctx->agblklog));
+	agino = bstat->bs_ino % (1ULL << (ctx->inopblog + ctx->agblklog));
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu (%u/%u)"), bstat->bs_ino,
+			agno, agino);
+	background_sleep();
+
+	/* Open the dir, let the kernel try to reconnect it to the root. */
+	if (S_ISDIR(bstat->bs_mode)) {
+		fd = xfs_open_handle(handle);
+		if (fd < 0) {
+			error = errno;
+			if (error != ESTALE)
+				str_errno(ctx, descr);
+			goto out;
+		}
+	}
+
+out:
+	if (fd >= 0)
+		close(fd);
+	if (error)
+		return error;
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Check directory connectivity. */
+bool
+xfs_scan_connections(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->errors_found) {
+		str_info(ctx, ctx->mntpoint,
+_("Filesystem has errors, skipping connectivity checks."));
+		return true;
+	}
+	if (!xfs_scan_all_inodes(ctx, xfs_scrub_connections))
+		return false;
+	xfs_scrub_report_preen_triggers(ctx);
+	return true;
+}
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 4638281..c2385da 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -425,6 +425,7 @@ run_scrub_phases(
 		},
 		{
 			.descr = _("Check directory tree."),
+			.fn = xfs_scan_connections,
 		},
 		{
 			.descr = _("Verify data file integrity."),
diff --git a/scrub/xfs.h b/scrub/xfs.h
index 8c442be..2e434c2 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -30,5 +30,6 @@ bool xfs_cleanup(struct scrub_ctx *ctx);
 bool xfs_scan_fs(struct scrub_ctx *ctx);
 bool xfs_scan_metadata(struct scrub_ctx *ctx);
 bool xfs_scan_inodes(struct scrub_ctx *ctx);
+bool xfs_scan_connections(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/22] xfs_scrub: thread-safe stats counter
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 09/22] xfs_scrub: check directory connectivity Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 11/22] xfs_scrub: create a bitmap data structure Darrick J. Wong
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a threaded stats counter that we'll use to track scan progress.
This includes things like how much of the disk blocks we've scanned,
or later how much progress we've made in each phase.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile  |    2 +
 scrub/counter.c |  122 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/counter.h |   29 +++++++++++++
 3 files changed, 153 insertions(+)
 create mode 100644 scrub/counter.c
 create mode 100644 scrub/counter.h


diff --git a/scrub/Makefile b/scrub/Makefile
index 13a0b55..deb352b 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -18,6 +18,7 @@ endif	# scrub_prereqs
 HFILES = \
 ../repair/threads.h \
 common.h \
+counter.h \
 disk.h \
 ioctl.h \
 scrub.h \
@@ -26,6 +27,7 @@ xfs.h
 CFILES = \
 ../repair/threads.c \
 common.c \
+counter.c \
 disk.c \
 ioctl.c \
 phase1.c \
diff --git a/scrub/counter.c b/scrub/counter.c
new file mode 100644
index 0000000..1aae0a3
--- /dev/null
+++ b/scrub/counter.c
@@ -0,0 +1,122 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <dirent.h>
+#include <sys/statvfs.h>
+#include <math.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "scrub.h"
+#include "counter.h"
+
+/*
+ * Per-Thread Counters
+ *
+ * This is a global counter object that uses per-thread counters to
+ * count things without having to content for a single shared lock.
+ * Provided we know the number of threads that will be accessing the
+ * counter, each thread gets its own thread-specific counter variable.
+ * Changing the value is fast, though retrieving the value is expensive
+ * and approximate.
+ */
+struct ptcounter {
+	pthread_key_t	key;
+	pthread_mutex_t	lock;
+	size_t		nr_used;
+	size_t		nr_counters;
+	uint64_t	counters[0];
+};
+#define PTCOUNTER_SIZE(nr) (sizeof(struct ptcounter) + sizeof(uint64_t) * (nr))
+
+/* Initialize per-thread counter. */
+struct ptcounter *
+ptcounter_init(
+	size_t			nr)
+{
+	struct ptcounter	*p;
+	int			ret;
+
+	p = malloc(PTCOUNTER_SIZE(nr));
+	if (!p)
+		return NULL;
+	p->nr_counters = nr;
+	p->nr_used = 0;
+	memset(p->counters, 0, sizeof(uint64_t) * nr);
+	ret = pthread_mutex_init(&p->lock, NULL);
+	if (ret)
+		goto out;
+	ret = pthread_key_create(&p->key, NULL);
+	if (ret)
+		goto out_mutex;
+	return p;
+
+out_mutex:
+	pthread_mutex_destroy(&p->lock);
+out:
+	free(p);
+	return NULL;
+}
+
+/* Free per-thread counter. */
+void
+ptcounter_free(
+	struct ptcounter	*ptc)
+{
+	pthread_key_delete(ptc->key);
+	pthread_mutex_destroy(&ptc->lock);
+	free(ptc);
+}
+
+/* Add a quantity to the counter. */
+void
+ptcounter_add(
+	struct ptcounter	*ptc,
+	int64_t			nr)
+{
+	uint64_t		*p;
+
+	p = pthread_getspecific(ptc->key);
+	if (!p) {
+		pthread_mutex_lock(&ptc->lock);
+		assert(ptc->nr_used < ptc->nr_counters);
+		p = &ptc->counters[ptc->nr_used++];
+		pthread_setspecific(ptc->key, p);
+		pthread_mutex_unlock(&ptc->lock);
+	}
+	*p += nr;
+}
+
+/* Return the approximate value of this counter. */
+uint64_t
+ptcounter_value(
+	struct ptcounter	*ptc)
+{
+	size_t			i;
+	uint64_t		sum = 0;
+
+	pthread_mutex_lock(&ptc->lock);
+	for (i = 0; i < ptc->nr_used; i++)
+		sum += ptc->counters[i];
+	pthread_mutex_unlock(&ptc->lock);
+
+	return sum;
+}
diff --git a/scrub/counter.h b/scrub/counter.h
new file mode 100644
index 0000000..f6225b2
--- /dev/null
+++ b/scrub/counter.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_COUNTER_H_
+#define XFS_SCRUB_COUNTER_H_
+
+struct ptcounter;
+struct ptcounter *ptcounter_init(size_t nr);
+void ptcounter_free(struct ptcounter *ptc);
+void ptcounter_add(struct ptcounter *ptc, int64_t nr);
+uint64_t ptcounter_value(struct ptcounter *ptc);
+
+#endif /* XFS_SCRUB_COUNTER_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/22] xfs_scrub: create a bitmap data structure
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 10/22] xfs_scrub: thread-safe stats counter Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 12/22] xfs_scrub: create infrastructure to read verify data blocks Darrick J. Wong
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create an efficient tree-based bitmap data structure.  We will use this
during the data block scan to record the LBAs of IO errors so that we
can report broken files to userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    4 +
 scrub/bitmap.c |  415 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/bitmap.h |   41 ++++++
 3 files changed, 460 insertions(+)
 create mode 100644 scrub/bitmap.c
 create mode 100644 scrub/bitmap.h


diff --git a/scrub/Makefile b/scrub/Makefile
index deb352b..b1cd393 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -16,7 +16,9 @@ INSTALL_SCRUB = install-scrub
 endif	# scrub_prereqs
 
 HFILES = \
+../repair/avl64.h \
 ../repair/threads.h \
+bitmap.h \
 common.h \
 counter.h \
 disk.h \
@@ -25,7 +27,9 @@ scrub.h \
 xfs.h
 
 CFILES = \
+../repair/avl64.c \
 ../repair/threads.c \
+bitmap.c \
 common.c \
 counter.c \
 disk.c \
diff --git a/scrub/bitmap.c b/scrub/bitmap.c
new file mode 100644
index 0000000..6d30644
--- /dev/null
+++ b/scrub/bitmap.c
@@ -0,0 +1,415 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include "../repair/avl64.h"
+#include "bitmap.h"
+
+/*
+ * Space Efficient Bitmap
+ *
+ * Implements a space-efficient bitmap.  We use an AVL tree to manage
+ * extent records that tell us which ranges are set; the bitmap key is
+ * an arbitrary uint64_t.  The usual bitmap operations (set, clear,
+ * test, test and set) are supported, plus we can iterate set ranges.
+ */
+
+#define avl_for_each_range_safe(pos, n, l, first, last) \
+	for (pos = (first), n = pos->avl_nextino, l = (last)->avl_nextino; pos != (l); \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each_safe(tree, pos, n) \
+	for (pos = (tree)->avl_firstino, n = pos ? pos->avl_nextino : NULL; \
+			pos != NULL; \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+#define avl_for_each(tree, pos) \
+	for (pos = (tree)->avl_firstino; pos != NULL; pos = pos->avl_nextino)
+
+struct bitmap_node {
+	struct avl64node	btn_node;
+	uint64_t		btn_start;
+	uint64_t		btn_length;
+};
+
+static uint64_t
+extent_start(
+	struct avl64node	*node)
+{
+	struct bitmap_node	*btn;
+
+	btn = container_of(node, struct bitmap_node, btn_node);
+	return btn->btn_start;
+}
+
+static uint64_t
+extent_end(
+	struct avl64node	*node)
+{
+	struct bitmap_node	*btn;
+
+	btn = container_of(node, struct bitmap_node, btn_node);
+	return btn->btn_start + btn->btn_length;
+}
+
+static struct avl64ops bitmap_ops = {
+	extent_start,
+	extent_end,
+};
+
+/* Initialize a bitmap. */
+bool
+bitmap_init(
+	struct bitmap		**bmapp)
+{
+	struct bitmap		*bmap;
+
+	bmap = calloc(sizeof(struct bitmap), 1);
+	if (!bmap)
+		return false;
+	bmap->bt_tree = malloc(sizeof(struct avl64tree_desc));
+	if (!bmap->bt_tree) {
+		free(bmap);
+		return false;
+	}
+
+	pthread_mutex_init(&bmap->bt_lock, NULL);
+	avl64_init_tree(bmap->bt_tree, &bitmap_ops);
+	*bmapp = bmap;
+
+	return true;
+}
+
+/* Free a bitmap. */
+void
+bitmap_free(
+	struct bitmap		**bmapp)
+{
+	struct bitmap		*bmap;
+	struct avl64node	*node;
+	struct avl64node	*n;
+	struct bitmap_node	*ext;
+
+	bmap = *bmapp;
+	avl_for_each_safe(bmap->bt_tree, node, n) {
+		ext = container_of(node, struct bitmap_node, btn_node);
+		free(ext);
+	}
+	free(bmap->bt_tree);
+	*bmapp = NULL;
+}
+
+/* Create a new bitmap extent node. */
+static struct bitmap_node *
+bitmap_node_init(
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct bitmap_node	*ext;
+
+	ext = malloc(sizeof(struct bitmap_node));
+	if (!ext)
+		return NULL;
+
+	ext->btn_node.avl_nextino = NULL;
+	ext->btn_start = start;
+	ext->btn_length = len;
+
+	return ext;
+}
+
+/* Set a region of bits (locked). */
+static bool
+__bitmap_set(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		length)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct bitmap_node	*ext;
+	uint64_t		new_start;
+	uint64_t		new_length;
+	struct avl64node	*node;
+	bool			res = true;
+
+	/* Find any existing nodes adjacent or within that range. */
+	avl64_findranges(bmap->bt_tree, start - 1, start + length + 1,
+			&firstn, &lastn);
+
+	/* Nothing, just insert a new extent. */
+	if (firstn == NULL && lastn == NULL) {
+		ext = bitmap_node_init(start, length);
+		if (!ext)
+			return false;
+
+		node = avl64_insert(bmap->bt_tree, &ext->btn_node);
+		if (node == NULL) {
+			free(ext);
+			errno = EEXIST;
+			return false;
+		}
+
+		return true;
+	}
+
+	ASSERT(firstn != NULL && lastn != NULL);
+	new_start = start;
+	new_length = length;
+
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		ext = container_of(pos, struct bitmap_node, btn_node);
+
+		/* Bail if the new extent is contained within an old one. */
+		if (ext->btn_start <= start &&
+		    ext->btn_start + ext->btn_length >= start + length)
+			return res;
+
+		/* Check for overlapping and adjacent extents. */
+		if (ext->btn_start + ext->btn_length >= start ||
+		    ext->btn_start <= start + length) {
+			if (ext->btn_start < start) {
+				new_start = ext->btn_start;
+				new_length += ext->btn_length;
+			}
+
+			if (ext->btn_start + ext->btn_length >
+			    new_start + new_length)
+				new_length = ext->btn_start + ext->btn_length -
+						new_start;
+
+			avl64_delete(bmap->bt_tree, pos);
+			free(ext);
+		}
+	}
+
+	ext = bitmap_node_init(new_start, new_length);
+	if (!ext)
+		return false;
+
+	node = avl64_insert(bmap->bt_tree, &ext->btn_node);
+	if (node == NULL) {
+		free(ext);
+		errno = EEXIST;
+		return false;
+	}
+
+	return res;
+}
+
+/* Set a region of bits. */
+bool
+bitmap_set(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		length)
+{
+	bool			res;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+	res = __bitmap_set(bmap, start, length);
+	pthread_mutex_unlock(&bmap->bt_lock);
+
+	return res;
+}
+
+/* Clear a region of bits. */
+bool
+bitmap_clear(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct bitmap_node	*ext;
+	uint64_t		new_start;
+	uint64_t		new_length;
+	struct avl64node	*node;
+	int			stat;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+	/* Find any existing nodes over that range. */
+	avl64_findranges(bmap->bt_tree, start, start + len, &firstn, &lastn);
+
+	/* Nothing, we're done. */
+	if (firstn == NULL && lastn == NULL) {
+		pthread_mutex_unlock(&bmap->bt_lock);
+		return true;
+	}
+
+	ASSERT(firstn != NULL && lastn != NULL);
+
+	/* Delete or truncate everything in sight. */
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		ext = container_of(pos, struct bitmap_node, btn_node);
+
+		stat = 0;
+		if (ext->btn_start < start)
+			stat |= 1;
+		if (ext->btn_start + ext->btn_length > start + len)
+			stat |= 2;
+		switch (stat) {
+		case 0:
+			/* Extent totally within range; delete. */
+			avl64_delete(bmap->bt_tree, pos);
+			free(ext);
+			break;
+		case 1:
+			/* Extent is left-adjacent; truncate. */
+			ext->btn_length = start - ext->btn_start;
+			break;
+		case 2:
+			/* Extent is right-adjacent; move it. */
+			ext->btn_length = ext->btn_start + ext->btn_length -
+					(start + len);
+			ext->btn_start = start + len;
+			break;
+		case 3:
+			/* Extent overlaps both ends. */
+			ext->btn_length = start - ext->btn_start;
+			new_start = start + len;
+			new_length = ext->btn_start + ext->btn_length -
+					new_start;
+
+			ext = bitmap_node_init(new_start, new_length);
+			if (!ext)
+				return false;
+
+			node = avl64_insert(bmap->bt_tree, &ext->btn_node);
+			if (node == NULL) {
+				errno = EEXIST;
+				return false;
+			}
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&bmap->bt_lock);
+	return true;
+}
+
+/* Iterate the set regions of this bitmap. */
+bool
+bitmap_iterate(
+	struct bitmap		*bmap,
+	bool			(*fn)(uint64_t, uint64_t, void *),
+	void			*arg)
+{
+	struct avl64node	*node;
+	struct bitmap_node	*ext;
+	bool			moveon = true;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+	avl_for_each(bmap->bt_tree, node) {
+		ext = container_of(node, struct bitmap_node, btn_node);
+		moveon = fn(ext->btn_start, ext->btn_length, arg);
+		if (!moveon)
+			break;
+	}
+	pthread_mutex_unlock(&bmap->bt_lock);
+
+	return moveon;
+}
+
+/* Do any bitmap extents overlap the given one?  (locked) */
+static bool
+__bitmap_test(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+
+	/* Find any existing nodes over that range. */
+	avl64_findranges(bmap->bt_tree, start, start + len, &firstn, &lastn);
+
+	return firstn != NULL && lastn != NULL;
+}
+
+/* Is any part of this range set? */
+bool
+bitmap_test(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	uint64_t		len)
+{
+	bool			res;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+	res = __bitmap_test(bmap, start, len);
+	pthread_mutex_unlock(&bmap->bt_lock);
+
+	return res;
+}
+
+/* Ensure that the extent is set, and return the old value. */
+bool
+bitmap_test_and_set(
+	struct bitmap		*bmap,
+	uint64_t		start,
+	bool			*was_set)
+{
+	bool			res = true;
+
+	pthread_mutex_lock(&bmap->bt_lock);
+	*was_set = __bitmap_test(bmap, start, 1);
+	if (!(*was_set))
+		res = __bitmap_set(bmap, start, 1);
+	pthread_mutex_unlock(&bmap->bt_lock);
+
+	return res;
+}
+
+/* Are none of the bits set? */
+bool
+bitmap_empty(
+	struct bitmap		*bmap)
+{
+	return bmap->bt_tree->avl_firstino == NULL;
+}
+
+#ifdef DEBUG
+static bool
+bitmap_dump_fn(
+	uint64_t		startblock,
+	uint64_t		blockcount,
+	void			*arg)
+{
+	printf("%"PRIu64":%"PRIu64"\n", startblock, blockcount);
+	return true;
+}
+
+/* Dump bitmap. */
+void
+bitmap_dump(
+	struct bitmap		*bmap)
+{
+	printf("BITMAP DUMP %p\n", bmap);
+	bitmap_iterate(bmap, bitmap_dump_fn, NULL);
+	printf("BITMAP DUMP DONE\n");
+}
+#endif
diff --git a/scrub/bitmap.h b/scrub/bitmap.h
new file mode 100644
index 0000000..db89659
--- /dev/null
+++ b/scrub/bitmap.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_BITMAP_H_
+#define XFS_SCRUB_BITMAP_H_
+
+struct bitmap {
+	pthread_mutex_t		bt_lock;
+	struct avl64tree_desc	*bt_tree;
+};
+
+bool bitmap_init(struct bitmap **bmap);
+void bitmap_free(struct bitmap **bmap);
+bool bitmap_set(struct bitmap *bmap, uint64_t start, uint64_t length);
+bool bitmap_clear(struct bitmap *bmap, uint64_t start,
+		uint64_t len);
+bool bitmap_iterate(struct bitmap *bmap,
+		bool (*fn)(uint64_t, uint64_t, void *), void *arg);
+bool bitmap_test(struct bitmap *bmap, uint64_t start,
+		uint64_t len);
+bool bitmap_test_and_set(struct bitmap *bmap, uint64_t start, bool *was_set);
+bool bitmap_empty(struct bitmap *bmap);
+void bitmap_dump(struct bitmap *bmap);
+
+#endif /* XFS_SCRUB_BITMAP_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/22] xfs_scrub: create infrastructure to read verify data blocks
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 11/22] xfs_scrub: create a bitmap data structure Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:08 ` [PATCH 13/22] xfs_scrub: scrub file " Darrick J. Wong
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Manage the scheduling, issuance, and reporting of data block
verification reads.  This enables us to combine adjacent (or nearly
adjacent) read requests, and to take advantage of high-IOPS devices by
issuing IO from multiple threads.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile      |    2 
 scrub/phase1.c      |    1 
 scrub/phase2.c      |    1 
 scrub/phase3.c      |    1 
 scrub/phase5.c      |    1 
 scrub/read_verify.c |  224 +++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/read_verify.h |   58 +++++++++++++
 scrub/scrub.c       |   25 ++++++
 scrub/scrub.h       |   13 +++
 9 files changed, 326 insertions(+)
 create mode 100644 scrub/read_verify.c
 create mode 100644 scrub/read_verify.h


diff --git a/scrub/Makefile b/scrub/Makefile
index b1cd393..5df3e95 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -23,6 +23,7 @@ common.h \
 counter.h \
 disk.h \
 ioctl.h \
+read_verify.h \
 scrub.h \
 xfs.h
 
@@ -38,6 +39,7 @@ phase1.c \
 phase2.c \
 phase3.c \
 phase5.c \
+read_verify.c \
 scrub.c \
 xfs.c
 
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 6c3aab4..66f4aa3 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -25,6 +25,7 @@
 #include "../repair/threads.h"
 #include "handle.h"
 #include "path.h"
+#include "bitmap.h"
 #include "scrub.h"
 #include "common.h"
 #include "ioctl.h"
diff --git a/scrub/phase2.c b/scrub/phase2.c
index b8b44ac..88136a3 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -25,6 +25,7 @@
 #include "../repair/threads.h"
 #include "handle.h"
 #include "path.h"
+#include "bitmap.h"
 #include "scrub.h"
 #include "common.h"
 #include "ioctl.h"
diff --git a/scrub/phase3.c b/scrub/phase3.c
index cdd8a7c..b920995 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -25,6 +25,7 @@
 #include "../repair/threads.h"
 #include "handle.h"
 #include "path.h"
+#include "bitmap.h"
 #include "scrub.h"
 #include "common.h"
 #include "ioctl.h"
diff --git a/scrub/phase5.c b/scrub/phase5.c
index 7ea8b58..e5a5835 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -25,6 +25,7 @@
 #include "../repair/threads.h"
 #include "handle.h"
 #include "path.h"
+#include "bitmap.h"
 #include "scrub.h"
 #include "common.h"
 #include "ioctl.h"
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
new file mode 100644
index 0000000..18ba73a
--- /dev/null
+++ b/scrub/read_verify.c
@@ -0,0 +1,224 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "common.h"
+#include "counter.h"
+
+/*
+ * Read Verify Pool
+ *
+ * Manages the data block read verification phase.  The caller schedules
+ * verification requests, which are then scheduled to be run by a thread
+ * pool worker.  Adjacent (or nearly adjacent) requests can be combined
+ * to reduce overhead when free space fragmentation is high.  The thread
+ * pool takes care of issuing multiple IOs to the device, if possible.
+ */
+
+/* How many bytes have we verified? */
+static struct ptcounter		*verified_bytes;
+
+/* Tolerate 64k holes in adjacent read verify requests. */
+#define IO_BATCH_LOCALITY	(65536)
+
+/* Create a thread pool to run read verifiers. */
+bool
+read_verify_pool_init(
+	struct read_verify_pool		**rvpp,
+	struct scrub_ctx		*ctx,
+	void				*readbuf,
+	size_t				readbufsz,
+	size_t				miniosz,
+	read_verify_ioerr_fn_t		ioerr_fn,
+	unsigned int			nproc)
+{
+	struct read_verify_pool		*rvp;
+
+	rvp = calloc(sizeof(struct read_verify_pool), 1);
+	if (!rvp)
+		return false;
+	verified_bytes = ptcounter_init(nproc);
+	if (!verified_bytes) {
+		free(rvp);
+		return false;
+	}
+	rvp->rvp_readbuf = readbuf;
+	rvp->rvp_readbufsz = readbufsz;
+	rvp->rvp_miniosz = miniosz;
+	rvp->rvp_ctx = ctx;
+	rvp->rvp_ioerr_fn = ioerr_fn;
+	rvp->rvp_nproc = nproc;
+	create_work_queue(&rvp->rvp_wq, (struct xfs_mount *)rvp, nproc);
+	*rvpp = rvp;
+	return true;
+}
+
+/* Finish up any read verification work and tear it down. */
+void
+read_verify_pool_destroy(
+	struct read_verify_pool		**rvpp)
+{
+	struct read_verify_pool		*rvp = *rvpp;
+
+	destroy_work_queue(&rvp->rvp_wq);
+	ptcounter_free(verified_bytes);
+	verified_bytes = NULL;
+	*rvpp = NULL;
+}
+
+/*
+ * Issue a read-verify IO in big batches.
+ */
+static void
+read_verify(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct read_verify		*rv = arg;
+	struct read_verify_pool		*rvp;
+	unsigned long long		verified = 0;
+	ssize_t				sz;
+	ssize_t				len;
+
+	rvp = (struct read_verify_pool *)wq->mp;
+	while (rv->io_length > 0) {
+		len = min(rv->io_length, rvp->rvp_readbufsz);
+		dbg_printf("diskverify %d %"PRIu64" %zu\n", rv->io_disk->d_fd,
+				rv->io_start, len);
+		sz = disk_read_verify(rv->io_disk, rvp->rvp_readbuf,
+				rv->io_start, len);
+		if (sz < 0) {
+			dbg_printf("IOERR %d %"PRIu64" %zu\n",
+					rv->io_disk->d_fd,
+					rv->io_start, len);
+			/* IO error, so try the next logical block. */
+			len = rvp->rvp_miniosz;
+			rvp->rvp_ioerr_fn(rvp, rv->io_disk, rv->io_start, len,
+					errno, rv->io_end_arg);
+		}
+
+		verified += len;
+		rv->io_start += len;
+		rv->io_length -= len;
+	}
+
+	free(rv);
+	ptcounter_add(verified_bytes, verified);
+}
+
+/* Queue a read verify request. */
+static void
+read_verify_queue(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv)
+{
+	struct read_verify		*tmp;
+
+	dbg_printf("verify fd %d start %"PRIu64" len %"PRIu64"\n",
+			rv->io_disk->d_fd, rv->io_start, rv->io_length);
+
+	tmp = malloc(sizeof(struct read_verify));
+	if (!tmp) {
+		rvp->rvp_ioerr_fn(rvp, rv->io_disk, rv->io_start, rv->io_length,
+				errno, rv->io_end_arg);
+		return;
+	}
+	*tmp = *rv;
+
+	queue_work(&rvp->rvp_wq, read_verify, 0, tmp);
+}
+
+/*
+ * Issue an IO request.  We'll batch subsequent requests if they're
+ * within 64k of each other
+ */
+void
+read_verify_schedule(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length,
+	void				*end_arg)
+{
+	uint64_t			req_end;
+	uint64_t			rv_end;
+
+	assert(rvp->rvp_readbuf);
+	req_end = start + length;
+	rv_end = rv->io_start + rv->io_length;
+
+	/*
+	 * If we have a stashed IO, we haven't changed fds, the error
+	 * reporting is the same, and the two extents are close,
+	 * we can combine them.
+	 */
+	if (rv->io_length > 0 && disk == rv->io_disk &&
+	    end_arg == rv->io_end_arg &&
+	    ((start >= rv->io_start && start <= rv_end + IO_BATCH_LOCALITY) ||
+	     (rv->io_start >= start &&
+	      rv->io_start <= req_end + IO_BATCH_LOCALITY))) {
+		rv->io_start = min(rv->io_start, start);
+		rv->io_length = max(req_end, rv_end) - rv->io_start;
+	} else  {
+		/* Otherwise, issue the stashed IO (if there is one) */
+		if (rv->io_length > 0)
+			read_verify_queue(rvp, rv);
+
+		/* Stash the new IO. */
+		rv->io_disk = disk;
+		rv->io_start = start;
+		rv->io_length = length;
+		rv->io_end_arg = end_arg;
+	}
+}
+
+/* Force any stashed IOs into the verifier. */
+void
+read_verify_force(
+	struct read_verify_pool		*rvp,
+	struct read_verify		*rv)
+{
+	assert(rvp->rvp_readbuf);
+	if (rv->io_length == 0)
+		return;
+
+	read_verify_queue(rvp, rv);
+	rv->io_length = 0;
+}
+
+/* How many bytes has this process verified? */
+unsigned long long
+read_verify_bytes(void)
+{
+	if (!verified_bytes)
+		return 0;
+	return ptcounter_value(verified_bytes);
+}
+
diff --git a/scrub/read_verify.h b/scrub/read_verify.h
new file mode 100644
index 0000000..59cddd7
--- /dev/null
+++ b/scrub/read_verify.h
@@ -0,0 +1,58 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_READ_VERIFY_H_
+#define XFS_SCRUB_READ_VERIFY_H_
+
+struct read_verify_pool;
+
+/* Function called when an IO error happens. */
+typedef void (*read_verify_ioerr_fn_t)(struct read_verify_pool *rvp,
+		struct disk *disk, uint64_t start, uint64_t length,
+		int error, void *arg);
+
+struct read_verify_pool {
+	struct work_queue	rvp_wq;		/* thread pool */
+	struct scrub_ctx	*rvp_ctx;	/* scrub context */
+	void			*rvp_readbuf;	/* read buffer */
+	read_verify_ioerr_fn_t	rvp_ioerr_fn;	/* io error callback */
+	size_t			rvp_miniosz;	/* minimum io size, bytes */
+	size_t			rvp_readbufsz;	/* read buffer size, bytes */
+	int			rvp_nproc;	/* number of threads */
+};
+
+bool read_verify_pool_init(struct read_verify_pool **rvpp, struct scrub_ctx *ctx,
+		void *readbuf, size_t readbufsz, size_t miniosz,
+		read_verify_ioerr_fn_t ioerr_fn, unsigned int nproc);
+void read_verify_pool_destroy(struct read_verify_pool **rvpp);
+
+struct read_verify {
+	void			*io_end_arg;
+	struct disk		*io_disk;
+	uint64_t		io_start;	/* bytes */
+	uint64_t		io_length;	/* bytes */
+};
+
+void read_verify_schedule(struct read_verify_pool *rvp, struct read_verify *rv,
+		struct disk *disk, uint64_t start, uint64_t length,
+		void *end_arg);
+void read_verify_force(struct read_verify_pool *rvp, struct read_verify *rv);
+unsigned long long read_verify_bytes(void);
+
+#endif /* XFS_SCRUB_READ_VERIFY_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index c2385da..d4527e4 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -32,6 +32,7 @@
 #include "../repair/threads.h"
 #include "path.h"
 #include "disk.h"
+#include "read_verify.h"
 #include "scrub.h"
 #include "common.h"
 #include "input.h"
@@ -251,6 +252,8 @@ phase_start(
 		return false;
 	}
 
+	pi->verified_bytes = read_verify_bytes();
+
 	pi->descr = descr;
 	if ((verbose || display_rusage) && descr) {
 		fprintf(stdout, _("Phase %u: %s\n"), phase, descr);
@@ -272,11 +275,14 @@ phase_end(
 	struct timeval		time_now;
 	char			phasebuf[DESCR_BUFSZ];
 	double			dt;
+	unsigned long long	verified;
 	long			in, out;
 	long			io;
 	double			i, o, t;
 	double			din, dout, dtot;
 	char			*iu, *ou, *tu, *dinu, *doutu, *dtotu;
+	double			v, dv;
+	char			*vu, *dvu;
 	int			error;
 
 	if (!display_rusage)
@@ -339,6 +345,15 @@ _("%sI/O: %.1f%s in, %.1f%s out, %.1f%s tot\n"),
 _("%sI/O rate: %.1f%s/s in, %.1f%s/s out, %.1f%s/s tot\n"),
 			phasebuf, din, dinu, dout, doutu, dtot, dtotu);
 	}
+
+	/* How many bytes were read-verified? */
+	verified = read_verify_bytes() - pi->verified_bytes;
+	if (verified) {
+		v = auto_space_units(verified, &vu);
+		dv = auto_space_units(verified / dt, &dvu);
+		fprintf(stdout, _("Phase %u: Verify: %.1f%s, rate: %.1f%s/s\n"),
+			phase, v, vu, dv, dvu);
+	}
 	fflush(stdout);
 
 	return true;
@@ -496,6 +511,7 @@ main(
 	bool			ismnt;
 	static bool		injected;
 	int			ret;
+	int			error;
 
 	fprintf(stderr, "XXX: This program is not complete!\n");
 	return 4;
@@ -639,6 +655,14 @@ _("Only one of the options -n or -y may be specified.\n"));
 		goto out;
 	}
 
+	/* Try to allocate a read buffer if we don't have one. */
+	error = posix_memalign((void **)&ctx.readbuf, page_size,
+			IO_MAX_SIZE);
+	if (error || !ctx.readbuf) {
+		str_errno(&ctx, ctx.mntpoint);
+		goto out;
+	}
+
 	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
 		ctx.mode = SCRUB_MODE_REPAIR;
 		injected = true;
@@ -692,6 +716,7 @@ _("%s: %llu warnings found.\n"),
 	disk_close(&ctx.datadev);
 
 	free(ctx.blkdev);
+	free(ctx.readbuf);
 	free(ctx.mntpoint);
 end:
 	return ret;
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 87f59d6..0b82d9f 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -42,6 +42,15 @@ enum error_action {
 	ERRORS_SHUTDOWN,
 };
 
+/*
+ * Perform all IO in 32M chunks.  This cannot exceed 65536 sectors
+ * because that's the biggest SCSI VERIFY(16) we dare to send.
+ */
+#define IO_MAX_SIZE		33554432
+#define IO_MAX_SECTORS		(IO_MAX_SIZE >> BBSHIFT)
+
+struct read_verify_pool;
+
 struct scrub_ctx {
 	/* Immutable scrub state. */
 
@@ -81,8 +90,12 @@ struct scrub_ctx {
 	void			*fshandle;
 	size_t			fshandle_len;
 
+	/* Data block read verification buffer */
+	void			*readbuf;
+
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
+	struct read_verify_pool	*rvp;
 	unsigned long long	max_errors;
 	unsigned long long	runtime_errors;
 	unsigned long long	errors_found;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/22] xfs_scrub: scrub file data blocks
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 12/22] xfs_scrub: create infrastructure to read verify data blocks Darrick J. Wong
@ 2017-08-04  0:08 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 14/22] xfs_scrub: optionally use SCSI READ VERIFY commands to scrub data blocks on disk Darrick J. Wong
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:08 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Read all data blocks from the disk, hoping to catch IO errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure.ac          |    2 
 include/builddefs.in  |    2 
 m4/package_libcdev.m4 |   28 +++
 scrub/Makefile        |    7 -
 scrub/phase6.c        |  539 +++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.c         |    4 
 scrub/vfs.c           |  199 ++++++++++++++++++
 scrub/vfs.h           |   33 +++
 scrub/xfs.c           |  140 +++++++++++++
 scrub/xfs.h           |    4 
 10 files changed, 956 insertions(+), 2 deletions(-)
 create mode 100644 scrub/phase6.c
 create mode 100644 scrub/vfs.c
 create mode 100644 scrub/vfs.h


diff --git a/configure.ac b/configure.ac
index e7ea99e..b2b4566 100644
--- a/configure.ac
+++ b/configure.ac
@@ -144,6 +144,8 @@ AC_HAVE_MREMAP
 AC_NEED_INTERNAL_FSXATTR
 AC_HAVE_GETFSMAP
 AC_HAVE_MALLINFO
+AC_HAVE_OPENAT
+AC_HAVE_FSTATAT
 
 if test "$enable_blkid" = yes; then
 AC_HAVE_BLKID_TOPO
diff --git a/include/builddefs.in b/include/builddefs.in
index c178556..65d3d20 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -114,6 +114,8 @@ HAVE_MREMAP = @have_mremap@
 NEED_INTERNAL_FSXATTR = @need_internal_fsxattr@
 HAVE_GETFSMAP = @have_getfsmap@
 HAVE_MALLINFO = @have_mallinfo@
+HAVE_OPENAT = @have_openat@
+HAVE_FSTATAT = @have_fstatat@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 3ccdea5..540633a 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -315,3 +315,31 @@ AC_DEFUN([AC_HAVE_MALLINFO],
        AC_MSG_RESULT(no))
     AC_SUBST(have_mallinfo)
   ])
+
+#
+# Check if we have a openat call
+#
+AC_DEFUN([AC_HAVE_OPENAT],
+  [ AC_CHECK_DECL([openat],
+       have_openat=yes,
+       [],
+       [#include <sys/types.h>
+        #include <sys/stat.h>
+        #include <fcntl.h>]
+       )
+    AC_SUBST(have_openat)
+  ])
+
+#
+# Check if we have a fstatat call
+#
+AC_DEFUN([AC_HAVE_FSTATAT],
+  [ AC_CHECK_DECL([fstatat],
+       have_fstatat=yes,
+       [],
+       [#define _GNU_SOURCE
+       #include <sys/types.h>
+       #include <sys/stat.h>
+       #include <unistd.h>])
+    AC_SUBST(have_fstatat)
+  ])
diff --git a/scrub/Makefile b/scrub/Makefile
index 5df3e95..18a65b9 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -8,9 +8,9 @@ include $(TOPDIR)/include/builddefs
 # On linux we get fsmap from the system or define it ourselves
 # so include this based on platform type.  If this reverts to only
 # the autoconf check w/o local definition, change to testing HAVE_GETFSMAP
-SCRUB_PREREQS=$(PKG_PLATFORM)
+SCRUB_PREREQS=$(PKG_PLATFORM)$(HAVE_OPENAT)$(HAVE_FSTATAT)
 
-ifeq ($(SCRUB_PREREQS),linux)
+ifeq ($(SCRUB_PREREQS),linuxyesyes)
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 endif	# scrub_prereqs
@@ -25,6 +25,7 @@ disk.h \
 ioctl.h \
 read_verify.h \
 scrub.h \
+vfs.h \
 xfs.h
 
 CFILES = \
@@ -39,8 +40,10 @@ phase1.c \
 phase2.c \
 phase3.c \
 phase5.c \
+phase6.c \
 read_verify.c \
 scrub.c \
+vfs.c \
 xfs.c
 
 LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
diff --git a/scrub/phase6.c b/scrub/phase6.c
new file mode 100644
index 0000000..d60a044
--- /dev/null
+++ b/scrub/phase6.c
@@ -0,0 +1,539 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "vfs.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+#include "xfs.h"
+
+/*
+ * Phase 6: Verify data file integrity.
+ *
+ * Identify potential data block extents with GETFSMAP, then feed those
+ * extents to the read-verify pool to get the verify commands batched,
+ * issued, and (if there are problems) reported back to us.  If there
+ * are errors, we'll record the bad regions and (if available) use rmap
+ * to tell us if metadata are now corrupt.  Otherwise, we'll scan the
+ * whole directory tree looking for files that overlap the bad regions
+ * and report the paths of the now corrupt files.
+ */
+
+/* Find the fd for a given device identifier. */
+static struct disk *
+xfs_dev_to_disk(
+	struct scrub_ctx	*ctx,
+	dev_t			dev)
+{
+	if (dev == ctx->fsinfo.fs_datadev)
+		return &ctx->datadev;
+	else if (dev == ctx->fsinfo.fs_logdev)
+		return &ctx->logdev;
+	else if (dev == ctx->fsinfo.fs_rtdev)
+		return &ctx->rtdev;
+	abort();
+}
+
+/* Find the device major/minor for a given file descriptor. */
+static dev_t
+xfs_disk_to_dev(
+	struct scrub_ctx	*ctx,
+	struct disk		*disk)
+{
+	if (disk == &ctx->datadev)
+		return ctx->fsinfo.fs_datadev;
+	else if (disk == &ctx->logdev)
+		return ctx->fsinfo.fs_logdev;
+	else if (disk == &ctx->rtdev)
+		return ctx->fsinfo.fs_rtdev;
+	abort();
+}
+
+struct owner_decode {
+	uint64_t		owner;
+	const char		*descr;
+};
+
+static const struct owner_decode special_owners[] = {
+	{XFS_FMR_OWN_FREE,	"free space"},
+	{XFS_FMR_OWN_UNKNOWN,	"unknown owner"},
+	{XFS_FMR_OWN_FS,	"static FS metadata"},
+	{XFS_FMR_OWN_LOG,	"journalling log"},
+	{XFS_FMR_OWN_AG,	"per-AG metadata"},
+	{XFS_FMR_OWN_INOBT,	"inode btree blocks"},
+	{XFS_FMR_OWN_INODES,	"inodes"},
+	{XFS_FMR_OWN_REFC,	"refcount btree"},
+	{XFS_FMR_OWN_COW,	"CoW staging"},
+	{XFS_FMR_OWN_DEFECTIVE,	"bad blocks"},
+	{0, NULL},
+};
+
+/* Decode a special owner. */
+static const char *
+xfs_decode_special_owner(
+	uint64_t			owner)
+{
+	const struct owner_decode	*od = special_owners;
+
+	while (od->descr) {
+		if (od->owner == owner)
+			return od->descr;
+		od++;
+	}
+
+	return NULL;
+}
+
+/* Routines to translate bad physical extents into file paths and offsets. */
+
+struct xfs_verify_error_info {
+	struct bitmap			*d_bad;		/* bytes */
+	struct bitmap			*r_bad;		/* bytes */
+};
+
+/* Report if this extent overlaps a bad region. */
+static bool
+xfs_report_verify_inode_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct xfs_verify_error_info	*vei = arg;
+	struct bitmap			*bmp;
+
+	/* Only report errors for real extents. */
+	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+		return true;
+
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		bmp = vei->r_bad;
+	else
+		bmp = vei->d_bad;
+
+	if (!bitmap_test(bmp, bmap->bm_physical, bmap->bm_length))
+		return true;
+
+	str_error(ctx, descr,
+_("offset %llu failed read verification."), bmap->bm_offset);
+	return true;
+}
+
+/* Iterate the extent mappings of a file to report errors. */
+static bool
+xfs_report_verify_fd(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	void				*arg)
+{
+	struct xfs_bmap			key = {0};
+	bool				moveon;
+
+	/* data fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+
+	/* attr fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+	return true;
+}
+
+/* Report read verify errors in unlinked (but still open) files. */
+static int
+xfs_report_verify_inode(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	char				descr[DESCR_BUFSZ];
+	char				buf[DESCR_BUFSZ];
+	bool				moveon;
+	int				fd;
+	int				error;
+
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu (unlinked)"), bstat->bs_ino);
+
+	/* Ignore linked files and things we can't open. */
+	if (bstat->bs_nlink != 0)
+		return 0;
+	if (!S_ISREG(bstat->bs_mode) && !S_ISDIR(bstat->bs_mode))
+		return 0;
+
+	/* Try to open the inode. */
+	fd = xfs_open_handle(handle);
+	if (fd < 0) {
+		error = errno;
+		if (error == ESTALE)
+			return error;
+
+		str_warn(ctx, descr, "%s", strerror_r(error, buf, DESCR_BUFSZ));
+		return error;
+	}
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, descr, fd, arg);
+	close(fd);
+
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Scan a directory for matches in the read verify error list. */
+static bool
+xfs_report_verify_dir(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	void			*arg)
+{
+	return xfs_report_verify_fd(ctx, path, dir_fd, arg);
+}
+
+/*
+ * Scan the inode associated with a directory entry for matches with
+ * the read verify error list.
+ */
+static bool
+xfs_report_verify_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	struct dirent		*dirent,
+	struct stat		*sb,
+	void			*arg)
+{
+	bool			moveon;
+	int			fd;
+
+	/* Ignore things we can't open. */
+	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode))
+		return true;
+	/* Ignore . and .. */
+	if (dirent && (!strcmp(".", dirent->d_name) ||
+		       !strcmp("..", dirent->d_name)))
+		return true;
+
+	/*
+	 * If we were given a dirent, open the associated file under
+	 * dir_fd for badblocks scanning.  If dirent is NULL, then it's
+	 * the directory itself we want to scan.
+	 */
+	fd = openat(dir_fd, dirent->d_name,
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0)
+		return true;
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, path, fd, arg);
+	if (moveon)
+		goto out;
+
+out:
+	close(fd);
+
+	return moveon;
+}
+
+/* Given bad extent lists for the data & rtdev, find bad files. */
+static bool
+xfs_report_verify_errors(
+	struct scrub_ctx		*ctx,
+	struct bitmap			*d_bad,
+	struct bitmap			*r_bad)
+{
+	struct xfs_verify_error_info	vei;
+	bool				moveon;
+
+	vei.d_bad = d_bad;
+	vei.r_bad = r_bad;
+
+	/* Scan the directory tree to get file paths. */
+	moveon = scan_fs_tree(ctx, xfs_report_verify_dir,
+			xfs_report_verify_dirent, &vei);
+	if (!moveon)
+		return false;
+
+	/* Scan for unlinked files. */
+	return xfs_scan_all_inodes_arg(ctx, xfs_report_verify_inode, &vei);
+}
+
+/* Verify disk blocks with GETFSMAP */
+
+struct xfs_verify_extent {
+	/* Maintain state for the lazy read verifier. */
+	struct read_verify	rv;
+
+	/* Store bad extents if we don't have parent pointers. */
+	struct bitmap		*d_bad;		/* bytes */
+	struct bitmap		*r_bad;		/* bytes */
+
+	/* Track the last extent we saw. */
+	uint64_t		laststart;	/* bytes */
+	uint64_t		lastlength;	/* bytes */
+	bool			lastshared;	/* bytes */
+};
+
+/* Report an IO error resulting from read-verify based off getfsmap. */
+static bool
+xfs_check_rmap_error_report(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*map,
+	void			*arg)
+{
+	const char		*type;
+	char			buf[32];
+	uint64_t		err_physical = *(uint64_t *)arg;
+	uint64_t		err_off;
+
+	if (err_physical > map->fmr_physical)
+		err_off = err_physical - map->fmr_physical;
+	else
+		err_off = 0;
+
+	snprintf(buf, 32, _("disk offset %llu"),
+			BTOBB(map->fmr_physical + err_off));
+
+	if (map->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+		type = xfs_decode_special_owner(map->fmr_owner);
+		str_error(ctx, buf,
+_("%s failed read verification."),
+				type);
+	}
+
+	/*
+	 * XXX: If we had a getparent() call we could report IO errors
+	 * efficiently.  Until then, we'll have to scan the dir tree
+	 * to find the bad file's pathname.
+	 */
+
+	return true;
+}
+
+/*
+ * Remember a read error for later, and see if rmap will tell us about the
+ * owner ahead of time.
+ */
+void
+xfs_check_rmap_ioerr(
+	struct read_verify_pool	*rvp,
+	struct disk		*disk,
+	uint64_t		start,
+	uint64_t		length,
+	int			error,
+	void			*arg)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	struct scrub_ctx	*ctx = rvp->rvp_ctx;
+	struct xfs_verify_extent	*ve;
+	struct bitmap		*tree;
+	dev_t			dev;
+	bool			moveon;
+
+	ve = arg;
+	dev = xfs_disk_to_dev(ctx, disk);
+
+	/*
+	 * If we don't have parent pointers, save the bad extent for
+	 * later rescanning.
+	 */
+	if (dev == ctx->fsinfo.fs_datadev)
+		tree = ve->d_bad;
+	else if (dev == ctx->fsinfo.fs_rtdev)
+		tree = ve->r_bad;
+	else
+		tree = NULL;
+	if (tree) {
+		moveon = bitmap_set(tree, start, length);
+		if (!moveon)
+			str_errno(ctx, ctx->mntpoint);
+	}
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d ioerr @ %"PRIu64":%"PRIu64" "),
+			major(dev), minor(dev), start, length);
+
+	/* Go figure out which blocks are bad from the fsmap. */
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	keys->fmr_physical = start;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = start + length - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+	xfs_iterate_fsmap(ctx, descr, keys, xfs_check_rmap_error_report,
+			&start);
+}
+
+/* Schedule a read-verify of a (data block) extent. */
+static bool
+xfs_check_rmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*map,
+	void				*arg)
+{
+	struct xfs_verify_extent	*ve = arg;
+	struct disk			*disk;
+
+	dbg_printf("rmap dev %d:%d phys %llu owner %lld offset %llu "
+			"len %llu flags 0x%x\n", major(map->fmr_device),
+			minor(map->fmr_device), map->fmr_physical,
+			map->fmr_owner, map->fmr_offset,
+			map->fmr_length, map->fmr_flags);
+
+	/* Remember this extent. */
+	ve->lastshared = (map->fmr_flags & FMR_OF_SHARED);
+	ve->laststart = map->fmr_physical;
+	ve->lastlength = map->fmr_length;
+
+	/* "Unknown" extents should be verified; they could be data. */
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+			map->fmr_owner == XFS_FMR_OWN_UNKNOWN)
+		map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+
+	/*
+	 * We only care about read-verifying data extents that have been
+	 * written to disk.  This means we can skip "special" owners
+	 * (metadata), xattr blocks, unwritten extents, and extent maps.
+	 * These should all get checked elsewhere in the scrubber.
+	 */
+	if (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+			      FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER))
+		goto out;
+
+	/* XXX: Filter out directory data blocks. */
+
+	/* Schedule the read verify command for (eventual) running. */
+	disk = xfs_dev_to_disk(ctx, map->fmr_device);
+
+	read_verify_schedule(ctx->rvp, &ve->rv, disk, map->fmr_physical,
+			map->fmr_length, ve);
+
+out:
+	/* Is this the last extent?  Fire off the read. */
+	if (map->fmr_flags & FMR_OF_LAST)
+		read_verify_force(ctx->rvp, &ve->rv);
+
+	return true;
+}
+
+/*
+ * Read verify all the file data blocks in a filesystem.  Since XFS doesn't
+ * do data checksums, we trust that the underlying storage will pass back
+ * an IO error if it can't retrieve whatever we previously stored there.
+ * If we hit an IO error, we'll record the bad blocks in a bitmap and then
+ * scan the extent maps of the entire fs tree to figure (and the unlinked
+ * inodes) out which files are now broken.
+ */
+bool
+xfs_scan_blocks(
+	struct scrub_ctx		*ctx)
+{
+	struct bitmap			*d_bad;
+	struct bitmap			*r_bad;
+	struct xfs_verify_extent	*ve;
+	struct xfs_verify_extent	*v;
+	int				i;
+	unsigned int			groups;
+	bool				moveon;
+
+	/*
+	 * Initialize our per-thread context.  By convention,
+	 * the log device comes first, then the rt device, and then
+	 * the AGs.
+	 */
+	groups = xfs_scan_all_blocks_array_size(ctx);
+	ve = calloc(groups, sizeof(struct xfs_verify_extent));
+	if (!ve) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&d_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_ve;
+	}
+
+	moveon = bitmap_init(&r_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dbad;
+	}
+
+	for (i = 0, v = ve; i < groups; i++, v++) {
+		v->d_bad = d_bad;
+		v->r_bad = r_bad;
+	}
+
+	moveon = read_verify_pool_init(&ctx->rvp, ctx, ctx->readbuf,
+			IO_MAX_SIZE, ctx->geo.blocksize, xfs_check_rmap_ioerr,
+			disk_heads(&ctx->datadev));
+	if (!moveon)
+		goto out_rbad;
+	moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_check_rmap,
+			ve, sizeof(*ve));
+	if (!moveon)
+		goto out_pool;
+
+	for (i = 0, v = ve; i < groups; i++, v++)
+		read_verify_force(ctx->rvp, &v->rv);
+	read_verify_pool_destroy(&ctx->rvp);
+
+	/* Scan the whole dir tree to see what matches the bad extents. */
+	if (!bitmap_empty(d_bad) || !bitmap_empty(r_bad))
+		moveon = xfs_report_verify_errors(ctx, d_bad, r_bad);
+
+	bitmap_free(&r_bad);
+	bitmap_free(&d_bad);
+	free(ve);
+	return moveon;
+
+out_pool:
+	read_verify_pool_destroy(&ctx->rvp);
+out_rbad:
+	bitmap_free(&r_bad);
+out_dbad:
+	bitmap_free(&d_bad);
+out_ve:
+	free(ve);
+	return moveon;
+}
diff --git a/scrub/scrub.c b/scrub/scrub.c
index d4527e4..97bd795 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -464,6 +464,10 @@ run_scrub_phases(
 
 	/* Run all phases of the scrub tool. */
 	for (phase = 1, sp = phases; sp->fn; sp++, phase++) {
+		/* Turn on certain phases if user said to. */
+		if (sp->fn == DATASCAN_DUMMY_FN && scrub_data)
+			sp->fn = xfs_scan_blocks;
+
 		/* Skip certain phases unless they're turned on. */
 		if (sp->fn == REPAIR_DUMMY_FN ||
 		    sp->fn == DATASCAN_DUMMY_FN)
diff --git a/scrub/vfs.c b/scrub/vfs.c
new file mode 100644
index 0000000..1cff2ab
--- /dev/null
+++ b/scrub/vfs.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/xattr.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "common.h"
+#include "vfs.h"
+
+/*
+ * Helper functions to assist in traversing a directory tree using regular
+ * VFS calls.
+ */
+
+/* Scan a filesystem tree. */
+struct scan_fs_tree {
+	unsigned int		nr_dirs;
+	pthread_mutex_t		lock;
+	pthread_cond_t		wakeup;
+	struct stat		root_sb;
+	bool			moveon;
+	scan_fs_tree_dir_fn	dir_fn;
+	scan_fs_tree_dirent_fn	dirent_fn;
+	void			*arg;
+};
+
+/* Per-work-item scan context. */
+struct scan_fs_tree_dir {
+	char			*path;
+	struct scan_fs_tree	*sft;
+	bool			rootdir;
+};
+
+/* Scan a directory sub tree. */
+static void
+scan_fs_dir(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct scan_fs_tree_dir	*sftd = arg;
+	struct scan_fs_tree	*sft = sftd->sft;
+	DIR			*dir;
+	struct dirent		*dirent;
+	char			newpath[PATH_MAX];
+	struct scan_fs_tree_dir	*new_sftd;
+	struct stat		sb;
+	int			dir_fd;
+	int			error;
+
+	/* Open the directory. */
+	dir_fd = open(sftd->path, O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (dir_fd < 0) {
+		if (errno != ENOENT)
+			str_errno(ctx, sftd->path);
+		goto out;
+	}
+
+	/* Caller-specific directory checks. */
+	if (!sft->dir_fn(ctx, sftd->path, dir_fd, sft->arg)) {
+		sft->moveon = false;
+		goto out;
+	}
+
+	/* Iterate the directory entries. */
+	dir = fdopendir(dir_fd);
+	if (!dir) {
+		str_errno(ctx, sftd->path);
+		goto out;
+	}
+	rewinddir(dir);
+	for (dirent = readdir(dir); dirent != NULL; dirent = readdir(dir)) {
+		snprintf(newpath, PATH_MAX, "%s/%s", sftd->path,
+				dirent->d_name);
+
+		/* Get the stat info for this directory entry. */
+		error = fstatat(dir_fd, dirent->d_name, &sb,
+				AT_NO_AUTOMOUNT | AT_SYMLINK_NOFOLLOW);
+		if (error) {
+			str_errno(ctx, newpath);
+			continue;
+		}
+
+		/* Ignore files on other filesystems. */
+		if (sb.st_dev != sft->root_sb.st_dev)
+			continue;
+
+		/* Caller-specific directory entry function. */
+		if (!sft->dirent_fn(ctx, newpath, dir_fd, dirent, &sb,
+				sft->arg)) {
+			sft->moveon = false;
+			break;
+		}
+
+		if (xfs_scrub_excessive_errors(ctx)) {
+			sft->moveon = false;
+			break;
+		}
+
+		/* If directory, call ourselves recursively. */
+		if (S_ISDIR(sb.st_mode) && strcmp(".", dirent->d_name) &&
+		    strcmp("..", dirent->d_name)) {
+			new_sftd = malloc(sizeof(struct scan_fs_tree_dir));
+			if (!new_sftd) {
+				str_errno(ctx, newpath);
+				sft->moveon = false;
+				break;
+			}
+			new_sftd->path = strdup(newpath);
+			new_sftd->sft = sft;
+			new_sftd->rootdir = false;
+			pthread_mutex_lock(&sft->lock);
+			sft->nr_dirs++;
+			pthread_mutex_unlock(&sft->lock);
+			queue_work(wq, scan_fs_dir, 0, new_sftd);
+		}
+	}
+
+	/* Close dir, go away. */
+	error = closedir(dir);
+	if (error)
+		str_errno(ctx, sftd->path);
+
+out:
+	pthread_mutex_lock(&sft->lock);
+	sft->nr_dirs--;
+	if (sft->nr_dirs == 0)
+		pthread_cond_signal(&sft->wakeup);
+	pthread_mutex_unlock(&sft->lock);
+
+	free(sftd->path);
+	free(sftd);
+}
+
+/* Scan the entire filesystem. */
+bool
+scan_fs_tree(
+	struct scrub_ctx	*ctx,
+	scan_fs_tree_dir_fn	dir_fn,
+	scan_fs_tree_dirent_fn	dirent_fn,
+	void			*arg)
+{
+	struct work_queue	wq;
+	struct scan_fs_tree	sft;
+	struct scan_fs_tree_dir	*sftd;
+
+	sft.moveon = true;
+	sft.nr_dirs = 1;
+	sft.root_sb = ctx->mnt_sb;
+	sft.dir_fn = dir_fn;
+	sft.dirent_fn = dirent_fn;
+	sft.arg = arg;
+	pthread_mutex_init(&sft.lock, NULL);
+	pthread_cond_init(&sft.wakeup, NULL);
+
+	sftd = malloc(sizeof(struct scan_fs_tree_dir));
+	if (!sftd) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	sftd->path = strdup(ctx->mntpoint);
+	sftd->sft = &sft;
+	sftd->rootdir = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, scan_fs_dir, 0, sftd);
+
+	pthread_mutex_lock(&sft.lock);
+	pthread_cond_wait(&sft.wakeup, &sft.lock);
+	assert(sft.nr_dirs == 0);
+	pthread_mutex_unlock(&sft.lock);
+	destroy_work_queue(&wq);
+
+	return sft.moveon;
+}
diff --git a/scrub/vfs.h b/scrub/vfs.h
new file mode 100644
index 0000000..3a3b2dc
--- /dev/null
+++ b/scrub/vfs.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_VFS_H_
+#define XFS_SCRUB_VFS_H_
+
+struct scrub_ctx;
+
+typedef bool (*scan_fs_tree_dir_fn)(struct scrub_ctx *, const char *,
+		int, void *);
+typedef bool (*scan_fs_tree_dirent_fn)(struct scrub_ctx *, const char *,
+		int, struct dirent *, struct stat *, void *);
+
+bool scan_fs_tree(struct scrub_ctx *ctx, scan_fs_tree_dir_fn dir_fn,
+		scan_fs_tree_dirent_fn dirent_fn, void *arg);
+
+#endif /* XFS_SCRUB_VFS_H_ */
diff --git a/scrub/xfs.c b/scrub/xfs.c
index 882bd28..36a5ba1 100644
--- a/scrub/xfs.c
+++ b/scrub/xfs.c
@@ -25,6 +25,7 @@
 #include "../repair/threads.h"
 #include "handle.h"
 #include "path.h"
+#include "vfs.h"
 #include "scrub.h"
 #include "common.h"
 #include "ioctl.h"
@@ -130,3 +131,142 @@ xfs_scan_all_inodes_arg(
 {
 	return xfs_scan_all_inodes_array_arg(ctx, fn, arg, 0);
 }
+
+/* GETFSMAP wrappers routines. */
+struct xfs_scan_blocks {
+	xfs_fsmap_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Iterate all the reverse mappings of an AG. */
+static void
+xfs_scan_ag_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scan_blocks	*sbx = arg;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	struct fsmap		keys[2];
+	off64_t			bperag;
+	bool			moveon;
+
+	bperag = (off64_t)ctx->geo.agblocks *
+		 (off64_t)ctx->geo.blocksize;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u fsmap"),
+				major(ctx->fsinfo.fs_datadev),
+				minor(ctx->fsinfo.fs_datadev),
+				agno);
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = ctx->fsinfo.fs_datadev;
+	keys->fmr_physical = agno * bperag;
+	(keys + 1)->fmr_device = ctx->fsinfo.fs_datadev;
+	(keys + 1)->fmr_physical = ((agno + 1) * bperag) - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * agno;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of a standalone device. */
+static void
+xfs_scan_dev_blocks(
+	struct scrub_ctx	*ctx,
+	int			idx,
+	dev_t			dev,
+	struct xfs_scan_blocks	*sbx)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	void			*fn_arg;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d fsmap"),
+			major(dev), minor(dev));
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = ULLONG_MAX;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * idx;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of the realtime device. */
+static void
+xfs_scan_rt_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+
+	xfs_scan_dev_blocks(ctx, agno, ctx->fsinfo.fs_rtdev, arg);
+}
+
+/* Iterate all the reverse mappings of the log device. */
+static void
+xfs_scan_log_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+
+	xfs_scan_dev_blocks(ctx, agno, ctx->fsinfo.fs_logdev, arg);
+}
+
+/* How many array elements should we create to scan all the blocks? */
+size_t
+xfs_scan_all_blocks_array_size(
+	struct scrub_ctx	*ctx)
+{
+	return ctx->geo.agcount + 2;
+}
+
+/* Scan all the blocks in a filesystem. */
+bool
+xfs_scan_all_blocks_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	struct xfs_scan_blocks	sbx;
+
+	sbx.moveon = true;
+	sbx.fn = fn;
+	sbx.arg = arg;
+	sbx.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	if (ctx->fsinfo.fs_rt)
+		queue_work(&wq, xfs_scan_rt_blocks, ctx->geo.agcount + 1,
+				&sbx);
+	if (ctx->fsinfo.fs_log)
+		queue_work(&wq, xfs_scan_log_blocks, ctx->geo.agcount + 2,
+				&sbx);
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_blocks, agno, &sbx);
+	destroy_work_queue(&wq);
+
+	return sbx.moveon;
+}
diff --git a/scrub/xfs.h b/scrub/xfs.h
index 2e434c2..7d087db 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -24,6 +24,9 @@ void xfs_shutdown_fs(struct scrub_ctx *ctx);
 bool xfs_scan_all_inodes(struct scrub_ctx *ctx, xfs_inode_iter_fn fn);
 bool xfs_scan_all_inodes_arg(struct scrub_ctx *ctx, xfs_inode_iter_fn fn,
 		void *arg);
+size_t xfs_scan_all_blocks_array_size(struct scrub_ctx *ctx);
+bool xfs_scan_all_blocks_array_arg(struct scrub_ctx *ctx, xfs_fsmap_iter_fn fn,
+		void *arg, size_t array_arg_size);
 
 /* Phase-specific functions. */
 bool xfs_cleanup(struct scrub_ctx *ctx);
@@ -31,5 +34,6 @@ bool xfs_scan_fs(struct scrub_ctx *ctx);
 bool xfs_scan_metadata(struct scrub_ctx *ctx);
 bool xfs_scan_inodes(struct scrub_ctx *ctx);
 bool xfs_scan_connections(struct scrub_ctx *ctx);
+bool xfs_scan_blocks(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 14/22] xfs_scrub: optionally use SCSI READ VERIFY commands to scrub data blocks on disk
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2017-08-04  0:08 ` [PATCH 13/22] xfs_scrub: scrub file " Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 15/22] xfs_scrub: check summary counters Darrick J. Wong
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

If we sense that we're talking to a raw SCSI disk, use the SCSI READ
VERIFY command to ask the disk to verify a disk internally.  This can
sharply reduce the runtime of the data block verification phase on
devices whose internal bandwidth exceeds their link bandwidth.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure.ac          |    2 +
 include/builddefs.in  |    2 +
 m4/package_libcdev.m4 |   30 ++++++++++
 scrub/Makefile        |    8 +++
 scrub/disk.c          |  145 +++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/disk.h          |    1 
 6 files changed, 187 insertions(+), 1 deletion(-)


diff --git a/configure.ac b/configure.ac
index b2b4566..8b65cf5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -146,6 +146,8 @@ AC_HAVE_GETFSMAP
 AC_HAVE_MALLINFO
 AC_HAVE_OPENAT
 AC_HAVE_FSTATAT
+AC_HAVE_SG_IO
+AC_HAVE_HDIO_GETGEO
 
 if test "$enable_blkid" = yes; then
 AC_HAVE_BLKID_TOPO
diff --git a/include/builddefs.in b/include/builddefs.in
index 65d3d20..c90c76f 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -116,6 +116,8 @@ HAVE_GETFSMAP = @have_getfsmap@
 HAVE_MALLINFO = @have_mallinfo@
 HAVE_OPENAT = @have_openat@
 HAVE_FSTATAT = @have_fstatat@
+HAVE_SG_IO = @have_sg_io@
+HAVE_HDIO_GETGEO = @have_hdio_getgeo@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 540633a..8608d10 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -343,3 +343,33 @@ AC_DEFUN([AC_HAVE_FSTATAT],
        #include <unistd.h>])
     AC_SUBST(have_fstatat)
   ])
+
+#
+# Check if we have the SG_IO ioctl
+#
+AC_DEFUN([AC_HAVE_SG_IO],
+  [ AC_MSG_CHECKING([for struct sg_io_hdr ])
+    AC_TRY_COMPILE([#include <scsi/sg.h>],
+    [
+         struct sg_io_hdr hdr;
+         ioctl(0, SG_IO, &hdr);
+    ], have_sg_io=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_sg_io)
+  ])
+
+#
+# Check if we have the HDIO_GETGEO ioctl
+#
+AC_DEFUN([AC_HAVE_HDIO_GETGEO],
+  [ AC_MSG_CHECKING([for struct hd_geometry ])
+    AC_TRY_COMPILE([#include <linux/hdreg.h>],
+    [
+         struct hd_geometry hdr;
+         ioctl(0, HDIO_GETGEO, &hdr);
+    ], have_hdio_getgeo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_hdio_getgeo)
+  ])
diff --git a/scrub/Makefile b/scrub/Makefile
index 18a65b9..e8864cc 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -58,6 +58,14 @@ ifeq ($(HAVE_SYNCFS),yes)
 LCFLAGS += -DHAVE_SYNCFS
 endif
 
+ifeq ($(HAVE_SG_IO),yes)
+LCFLAGS += -DHAVE_SG_IO
+endif
+
+ifeq ($(HAVE_HDIO_GETGEO),yes)
+LCFLAGS += -DHAVE_HDIO_GETGEO
+endif
+
 default: depend $(LTCOMMAND)
 
 include $(BUILDRULES)
diff --git a/scrub/disk.c b/scrub/disk.c
index 613e7fd..63aa844 100644
--- a/scrub/disk.c
+++ b/scrub/disk.c
@@ -21,6 +21,12 @@
 #include <sys/statvfs.h>
 #include <sys/types.h>
 #include <dirent.h>
+#ifdef HAVE_SG_IO
+# include <scsi/sg.h>
+#endif
+#ifdef HAVE_HDIO_GETGEO
+# include <linux/hdreg.h>
+#endif
 #include "../repair/threads.h"
 #include "path.h"
 #include "disk.h"
@@ -80,12 +86,119 @@ disk_heads(
 	return __disk_heads(disk);
 }
 
+/*
+ * Execute a SCSI VERIFY(16) to verify disk contents.
+ * For devices that support this command, this can sharply reduce the
+ * runtime of the data block verification phase if the storage device's
+ * internal bandwidth exceeds its link bandwidth.  However, it only
+ * works if we're talking to a raw SCSI device, and only if we trust the
+ * firmware.
+ */
+#ifdef HAVE_SG_IO
+# define SENSE_BUF_LEN		64
+# define VERIFY16_CMDLEN	16
+# define VERIFY16_CMD		0x8F
+
+# ifndef SG_FLAG_Q_AT_TAIL
+#  define SG_FLAG_Q_AT_TAIL	0x10
+# endif
+static int
+disk_scsi_verify(
+	struct disk		*disk,
+	uint64_t		startblock, /* lba */
+	uint64_t		blockcount) /* lba */
+{
+	struct sg_io_hdr	iohdr;
+	unsigned char		cdb[VERIFY16_CMDLEN];
+	unsigned char		sense[SENSE_BUF_LEN];
+	uint64_t		llba;
+	uint64_t		veri_len = blockcount;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"));
+
+	llba = startblock + (disk->d_start >> BBSHIFT);
+
+	/* Borrowed from sg_verify */
+	cdb[0] = VERIFY16_CMD;
+	cdb[1] = 0; /* skip PI, DPO, and byte check. */
+	cdb[2] = (llba >> 56) & 0xff;
+	cdb[3] = (llba >> 48) & 0xff;
+	cdb[4] = (llba >> 40) & 0xff;
+	cdb[5] = (llba >> 32) & 0xff;
+	cdb[6] = (llba >> 24) & 0xff;
+	cdb[7] = (llba >> 16) & 0xff;
+	cdb[8] = (llba >> 8) & 0xff;
+	cdb[9] = llba & 0xff;
+	cdb[10] = (veri_len >> 24) & 0xff;
+	cdb[11] = (veri_len >> 16) & 0xff;
+	cdb[12] = (veri_len >> 8) & 0xff;
+	cdb[13] = veri_len & 0xff;
+	cdb[14] = 0;
+	cdb[15] = 0;
+	memset(sense, 0, SENSE_BUF_LEN);
+
+	/* v3 SG_IO */
+	memset(&iohdr, 0, sizeof(iohdr));
+	iohdr.interface_id = 'S';
+	iohdr.dxfer_direction = SG_DXFER_NONE;
+	iohdr.cmdp = cdb;
+	iohdr.cmd_len = VERIFY16_CMDLEN;
+	iohdr.sbp = sense;
+	iohdr.mx_sb_len = SENSE_BUF_LEN;
+	iohdr.flags |= SG_FLAG_Q_AT_TAIL;
+	iohdr.timeout = 30000; /* 30s */
+
+	error = ioctl(disk->d_fd, SG_IO, &iohdr);
+	if (error)
+		return error;
+
+	dbg_printf("VERIFY(16) fd %d lba %"PRIu64" len %"PRIu64" info %x "
+			"status %d masked %d msg %d host %d driver %d "
+			"duration %d resid %d\n",
+			disk->d_fd, startblock, blockcount, iohdr.info,
+			iohdr.status, iohdr.masked_status, iohdr.msg_status,
+			iohdr.host_status, iohdr.driver_status, iohdr.duration,
+			iohdr.resid);
+
+	if (iohdr.info & SG_INFO_CHECK) {
+		dbg_printf("status: msg %x host %x driver %x\n",
+				iohdr.msg_status, iohdr.host_status,
+				iohdr.driver_status);
+		errno = EIO;
+		return -1;
+	}
+
+	return error;
+}
+#else
+# define disk_scsi_verify(...)		(ENOTTY)
+#endif /* HAVE_SG_IO */
+
+/* Test the availability of the kernel scrub ioctl. */
+static bool
+disk_can_scsi_verify(
+	struct disk		*disk)
+{
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"))
+		return false;
+
+	error = disk_scsi_verify(disk, 0, 1);
+	return error == 0;
+}
+
 /* Open a disk device and discover its geometry. */
 int
 disk_open(
 	const char		*pathname,
 	struct disk		*disk)
 {
+#ifdef HAVE_HDIO_GETGEO
+	struct hd_geometry	bdgeo;
+#endif
+	bool			suspicious_disk = false;
 	int			lba_sz;
 	int			error;
 
@@ -117,13 +230,34 @@ disk_open(
 		error = ioctl(disk->d_fd, BLKBSZGET, &disk->d_blksize);
 		if (error)
 			disk->d_blksize = 0;
-		disk->d_start = 0;
+#ifdef HAVE_HDIO_GETGEO
+		error = ioctl(disk->d_fd, HDIO_GETGEO, &bdgeo);
+		if (!error) {
+			/*
+			 * dm devices will pass through ioctls, which means
+			 * we can't use SCSI VERIFY unless the start is 0.
+			 * Most dm devices don't set geometry (unlike scsi
+			 * and nvme) so use a zeroed out CHS to screen them
+			 * out.
+			 */
+			if (bdgeo.start != 0 &&
+			    (unsigned long long)bdgeo.heads * bdgeo.sectors *
+					bdgeo.sectors == 0)
+				suspicious_disk = true;
+			disk->d_start = bdgeo.start << BBSHIFT;
+		} else
+#endif
+			disk->d_start = 0;
 	} else {
 		disk->d_size = disk->d_sb.st_size;
 		disk->d_blksize = disk->d_sb.st_blksize;
 		disk->d_start = 0;
 	}
 
+	/* Can we issue SCSI VERIFY? */
+	if (!suspicious_disk && disk_can_scsi_verify(disk))
+		disk->d_flags |= DISK_FLAG_SCSI_VERIFY;
+
 	return 0;
 }
 
@@ -148,6 +282,10 @@ disk_is_open(
 	return disk->d_fd >= 0;
 }
 
+#define BTOLBAT(d, bytes)	((uint64_t)(bytes) >> (d)->d_lbalog)
+#define LBASIZE(d)		(1ULL << (d)->d_lbalog)
+#define BTOLBA(d, bytes)	(((uint64_t)(bytes) + LBASIZE(d) - 1) >> (d)->d_lbalog)
+
 /* Read-verify an extent of a disk device. */
 ssize_t
 disk_read_verify(
@@ -156,5 +294,10 @@ disk_read_verify(
 	uint64_t		start,
 	uint64_t		length)
 {
+	/* Convert to logical block size. */
+	if (disk->d_flags & DISK_FLAG_SCSI_VERIFY)
+		return disk_scsi_verify(disk, BTOLBAT(disk, start),
+				BTOLBA(disk, length));
+
 	return pread(disk->d_fd, buf, length, start);
 }
diff --git a/scrub/disk.h b/scrub/disk.h
index 797fd71..34eb0a0 100644
--- a/scrub/disk.h
+++ b/scrub/disk.h
@@ -20,6 +20,7 @@
 #ifndef XFS_SCRUB_DISK_H_
 #define XFS_SCRUB_DISK_H_
 
+#define DISK_FLAG_SCSI_VERIFY	0x1
 struct disk {
 	struct stat	d_sb;
 	int		d_fd;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 15/22] xfs_scrub: check summary counters
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 14/22] xfs_scrub: optionally use SCSI READ VERIFY commands to scrub data blocks on disk Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 16/22] xfs_scrub: wire up repair ioctl Darrick J. Wong
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Make sure the filesystem summary counters are somewhat close to what
we can find by scanning the filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    1 
 scrub/common.c |   28 +++++++
 scrub/common.h |    3 +
 scrub/phase7.c |  236 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/scrub.c  |    4 -
 scrub/xfs.c    |   63 +++++++++++++++
 scrub/xfs.h    |    7 ++
 7 files changed, 338 insertions(+), 4 deletions(-)
 create mode 100644 scrub/phase7.c


diff --git a/scrub/Makefile b/scrub/Makefile
index e8864cc..461df83 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -41,6 +41,7 @@ phase2.c \
 phase3.c \
 phase5.c \
 phase6.c \
+phase7.c \
 read_verify.c \
 scrub.c \
 vfs.c \
diff --git a/scrub/common.c b/scrub/common.c
index 167d373..4ec07a0 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -347,3 +347,31 @@ background_sleep(void)
 	tv.tv_nsec = time % 1000000;
 	nanosleep(&tv, NULL);
 }
+
+/* Decide if a value is within +/- (n/d) of a desired value. */
+bool
+within_range(
+	struct scrub_ctx	*ctx,
+	unsigned long long	value,
+	unsigned long long	desired,
+	unsigned long long	abs_threshold,
+	unsigned int		n,
+	unsigned int		d,
+	const char		*descr)
+{
+	assert(n < d);
+
+	/* Don't complain if difference does not exceed an absolute value. */
+	if (value < desired && desired - value < abs_threshold)
+		return true;
+	if (value > desired && value - desired < abs_threshold)
+		return true;
+
+	/* Complain if the difference exceeds a certain percentage. */
+	if (value < desired * (d - n) / d)
+		return false;
+	if (value > desired * (d + n) / d)
+		return false;
+
+	return true;
+}
diff --git a/scrub/common.h b/scrub/common.h
index 7bbd061..7c35f3f 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -71,5 +71,8 @@ static inline int syncfs(int fd)
 
 bool find_mountpoint(char *mtab, struct scrub_ctx *ctx);
 void background_sleep(void);
+bool within_range(struct scrub_ctx *ctx, unsigned long long value,
+		unsigned long long desired, unsigned long long abs_threshold,
+		unsigned int n, unsigned int d, const char *descr);
 
 #endif /* XFS_SCRUB_COMMON_H_ */
diff --git a/scrub/phase7.c b/scrub/phase7.c
new file mode 100644
index 0000000..bdb4a79
--- /dev/null
+++ b/scrub/phase7.c
@@ -0,0 +1,236 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "vfs.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+#include "xfs.h"
+
+/* Phase 7: Check summary counters. */
+
+struct xfs_summary_counts {
+	unsigned long long	inodes;		/* number of inodes */
+	unsigned long long	dbytes;		/* data dev bytes */
+	unsigned long long	rbytes;		/* rt dev bytes */
+	unsigned long long	next_phys;	/* next phys bytes we see? */
+	unsigned long long	agbytes;	/* freespace bytes */
+};
+
+struct xfs_inode_fork_summary {
+	struct bitmap		*tree;
+	unsigned long long	bytes;
+};
+
+/* Record inode and block usage. */
+static int
+xfs_record_inode_summary(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	struct xfs_summary_counts	*counts = arg;
+
+	counts->inodes++;
+	return 0;
+}
+
+/* Record block usage. */
+static bool
+xfs_record_block_summary(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*fsmap,
+	void				*arg)
+{
+	struct xfs_summary_counts	*counts = arg;
+	unsigned long long		len;
+
+	if (fsmap->fmr_device == ctx->fsinfo.fs_logdev)
+		return true;
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == XFS_FMR_OWN_FREE)
+		return true;
+
+	len = fsmap->fmr_length;
+
+	/* freesp btrees live in free space, need to adjust counters later. */
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == XFS_FMR_OWN_AG) {
+		counts->agbytes += fsmap->fmr_length;
+	}
+	if (fsmap->fmr_device == ctx->fsinfo.fs_rtdev) {
+		/* Count realtime extents. */
+		counts->rbytes += len;
+	} else {
+		/* Count datadev extents. */
+		if (counts->next_phys >= fsmap->fmr_physical + len)
+			return true;
+		else if (counts->next_phys > fsmap->fmr_physical)
+			len = counts->next_phys - fsmap->fmr_physical;
+		counts->dbytes += len;
+		counts->next_phys = fsmap->fmr_physical + fsmap->fmr_length;
+	}
+
+	return true;
+}
+
+/*
+ * Count all inodes and blocks in the filesystem as told by GETFSMAP and
+ * BULKSTAT, and compare that to summary counters.  Since this is a live
+ * filesystem we'll be content if the summary counts are within 10% of
+ * what we observed.
+ */
+bool
+xfs_scan_summary(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_summary_counts	*summary;
+	unsigned long long		fd;
+	unsigned long long		fr;
+	unsigned long long		fi;
+	unsigned long long		sd;
+	unsigned long long		sr;
+	unsigned long long		si;
+	unsigned long long		absdiff;
+	unsigned long long		d_blocks;
+	unsigned long long		d_bfree;
+	unsigned long long		r_blocks;
+	unsigned long long		r_bfree;
+	unsigned long long		f_files;
+	unsigned long long		f_free;
+	xfs_agnumber_t			agno;
+	bool				moveon;
+	bool				complain;
+	unsigned int			groups;
+	int				error;
+
+	groups = xfs_scan_all_blocks_array_size(ctx);
+	summary = calloc(groups, sizeof(struct xfs_summary_counts));
+	if (!summary) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Flush everything out to disk before we start counting. */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Use fsmap to count blocks. */
+	moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_record_block_summary,
+			summary, sizeof(*summary));
+	if (!moveon)
+		goto out;
+
+	/* Scan the whole fs. */
+	moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_record_inode_summary,
+			summary, sizeof(*summary));
+	if (!moveon)
+		goto out;
+
+	/* Sum the counts. */
+	for (agno = 1; agno < groups; agno++) {
+		summary[0].inodes += summary[agno].inodes;
+		summary[0].dbytes += summary[agno].dbytes;
+		summary[0].rbytes += summary[agno].rbytes;
+		summary[0].agbytes += summary[agno].agbytes;
+	}
+
+	moveon = xfs_scan_estimate_blocks(ctx, &d_blocks, &d_bfree, &r_blocks,
+			&r_bfree, &f_files, &f_free);
+	if (!moveon)
+		return moveon;
+
+	/*
+	 * If we counted blocks with fsmap, then dblocks includes
+	 * blocks for the AGFL and the freespace/rmap btrees.  The
+	 * filesystem treats them as "free", but since we scanned
+	 * them, we'll consider them used.
+	 */
+	d_bfree -= summary[0].agbytes >> ctx->blocklog;
+
+	/* Report on what we found. */
+	fd = (d_blocks - d_bfree) << ctx->blocklog;
+	fr = (r_blocks - r_bfree) << ctx->blocklog;
+	fi = f_files - f_free;
+	sd = summary[0].dbytes;
+	sr = summary[0].rbytes;
+	si = summary[0].inodes;
+
+	/*
+	 * Complain if the counts are off by more than 10% unless
+	 * the inaccuracy is less than 32MB worth of blocks or 100 inodes.
+	 */
+	absdiff = 1ULL << 25;
+	complain = !within_range(ctx, sd, fd, absdiff, 1, 10, _("data blocks"));
+	complain |= !within_range(ctx, sr, fr, absdiff, 1, 10, _("realtime blocks"));
+	complain |= !within_range(ctx, si, fi, 100, 1, 10, _("inodes"));
+
+	if (complain || verbose) {
+		double		d, r, i;
+		char		*du, *ru, *iu;
+
+		if (fr || sr) {
+			d = auto_space_units(fd, &du);
+			r = auto_space_units(fr, &ru);
+			i = auto_units(fi, &iu);
+			fprintf(stdout,
+_("%.1f%s data used;  %.1f%s realtime data used;  %.2f%s inodes used.\n"),
+					d, du, r, ru, i, iu);
+			d = auto_space_units(sd, &du);
+			r = auto_space_units(sr, &ru);
+			i = auto_units(si, &iu);
+			fprintf(stdout,
+_("%.1f%s data found; %.1f%s realtime data found; %.2f%s inodes found.\n"),
+					d, du, r, ru, i, iu);
+		} else {
+			d = auto_space_units(fd, &du);
+			i = auto_units(fi, &iu);
+			fprintf(stdout,
+_("%.1f%s data used;  %.1f%s inodes used.\n"),
+					d, du, i, iu);
+			d = auto_space_units(sd, &du);
+			i = auto_units(si, &iu);
+			fprintf(stdout,
+_("%.1f%s data found; %.1f%s inodes found.\n"),
+					d, du, i, iu);
+		}
+		fflush(stdout);
+	}
+	moveon = true;
+
+out:
+	free(summary);
+	return moveon;
+}
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 97bd795..647e050 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -448,6 +448,7 @@ run_scrub_phases(
 		},
 		{
 			.descr = _("Check summary counters."),
+			.fn = xfs_scan_summary,
 		},
 		{
 			NULL
@@ -517,9 +518,6 @@ main(
 	int			ret;
 	int			error;
 
-	fprintf(stderr, "XXX: This program is not complete!\n");
-	return 4;
-
 	progname = basename(argv[0]);
 	setlocale(LC_ALL, "");
 	bindtextdomain(PACKAGE, LOCALEDIR);
diff --git a/scrub/xfs.c b/scrub/xfs.c
index 36a5ba1..4db0267 100644
--- a/scrub/xfs.c
+++ b/scrub/xfs.c
@@ -91,7 +91,7 @@ xfs_scan_all_inodes_array_size(
 }
 
 /* Scan all the inodes in a filesystem. */
-static bool
+bool
 xfs_scan_all_inodes_array_arg(
 	struct scrub_ctx	*ctx,
 	xfs_inode_iter_fn	fn,
@@ -270,3 +270,64 @@ xfs_scan_all_blocks_array_arg(
 
 	return sbx.moveon;
 }
+
+/* Estimate the number of blocks and inodes in the filesystem. */
+bool
+xfs_scan_estimate_blocks(
+	struct scrub_ctx		*ctx,
+	unsigned long long		*d_blocks,
+	unsigned long long		*d_bfree,
+	unsigned long long		*r_blocks,
+	unsigned long long		*r_bfree,
+	unsigned long long		*f_files,
+	unsigned long long		*f_free)
+{
+	struct xfs_fsop_counts		fc;
+	struct xfs_fsop_resblks		rb;
+	struct xfs_fsop_ag_resblks	arb;
+	struct statvfs			sfs;
+	int				error;
+
+	/* Grab the fstatvfs counters, since it has to report accurately. */
+	error = fstatvfs(ctx->mnt_fd, &sfs);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Fetch the filesystem counters. */
+	error = ioctl(ctx->mnt_fd, XFS_IOC_FSCOUNTS, &fc);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/*
+	 * XFS reserves some blocks to prevent hard ENOSPC, so add those
+	 * blocks back to the free data counts.
+	 */
+	error = ioctl(ctx->mnt_fd, XFS_IOC_GET_RESBLKS, &rb);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += rb.resblks_avail;
+
+	/*
+	 * XFS with rmap or reflink reserves blocks in each AG to
+	 * prevent the AG from running out of space for metadata blocks.
+	 * Add those back to the free data counts.
+	 */
+	memset(&arb, 0, sizeof(arb));
+	error = ioctl(ctx->mnt_fd, XFS_IOC_GET_AG_RESBLKS, &arb);
+	if (error && errno != ENOTTY)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += arb.ar_current_resv;
+
+	*d_blocks = ctx->geo.datablocks;
+	*d_bfree = sfs.f_bfree;
+	*r_blocks = ctx->geo.rtblocks;
+	*r_bfree = fc.freertx;
+	*f_files = sfs.f_files;
+	*f_free = sfs.f_ffree;
+
+	return true;
+}
diff --git a/scrub/xfs.h b/scrub/xfs.h
index 7d087db..996f791 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -24,9 +24,15 @@ void xfs_shutdown_fs(struct scrub_ctx *ctx);
 bool xfs_scan_all_inodes(struct scrub_ctx *ctx, xfs_inode_iter_fn fn);
 bool xfs_scan_all_inodes_arg(struct scrub_ctx *ctx, xfs_inode_iter_fn fn,
 		void *arg);
+bool xfs_scan_all_inodes_array_arg(struct scrub_ctx *ctx, xfs_inode_iter_fn fn,
+		void *arg, size_t array_arg_size);
 size_t xfs_scan_all_blocks_array_size(struct scrub_ctx *ctx);
 bool xfs_scan_all_blocks_array_arg(struct scrub_ctx *ctx, xfs_fsmap_iter_fn fn,
 		void *arg, size_t array_arg_size);
+bool xfs_scan_estimate_blocks(struct scrub_ctx *ctx,
+		unsigned long long *d_blocks, unsigned long long *d_bfree,
+		unsigned long long *r_blocks, unsigned long long *r_bfree,
+		unsigned long long *f_files, unsigned long long *f_free);
 
 /* Phase-specific functions. */
 bool xfs_cleanup(struct scrub_ctx *ctx);
@@ -35,5 +41,6 @@ bool xfs_scan_metadata(struct scrub_ctx *ctx);
 bool xfs_scan_inodes(struct scrub_ctx *ctx);
 bool xfs_scan_connections(struct scrub_ctx *ctx);
 bool xfs_scan_blocks(struct scrub_ctx *ctx);
+bool xfs_scan_summary(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 16/22] xfs_scrub: wire up repair ioctl
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 15/22] xfs_scrub: check summary counters Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 17/22] xfs_scrub: schedule and manage repairs to the filesystem Darrick J. Wong
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create the mechanism we need to actually call the kernel's online repair
functionality.  The interface will consume a repair description; the
descriptor management will follow in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/common.c |   51 +++++++++++++++++++++++
 scrub/common.h |    2 +
 scrub/ioctl.c  |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/ioctl.h  |   20 +++++++++
 scrub/phase1.c |   15 +++++++
 scrub/scrub.h  |    2 +
 6 files changed, 211 insertions(+), 1 deletion(-)


diff --git a/scrub/common.c b/scrub/common.c
index 4ec07a0..7ca0e78 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -157,6 +157,57 @@ __str_info(
 	pthread_mutex_unlock(&ctx->lock);
 }
 
+/* Increment the repair count. */
+void
+__record_repair(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, _("Repaired: %s: "), descr);
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+	if (debug)
+		fprintf(stderr, _(" (%s line %d)"), file, line);
+	fprintf(stderr, "\n");
+	ctx->repairs++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Increment the optimization (preening) count. */
+void
+__record_preen(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line,
+	const char		*format,
+	...)
+{
+	va_list			args;
+
+	pthread_mutex_lock(&ctx->lock);
+	if (debug || verbose) {
+		fprintf(stdout, _("Optimized: %s: "), descr);
+		va_start(args, format);
+		vfprintf(stdout, format, args);
+		va_end(args);
+		if (debug)
+			fprintf(stdout, _(" (%s line %d)"), file, line);
+		fprintf(stdout, "\n");
+		fflush(stdout);
+	}
+	ctx->preens++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
 /* Catch fatal errors from pieces we import from xfs_repair. */
 void __attribute__((noreturn))
 do_error(char const *msg, ...)
diff --git a/scrub/common.h b/scrub/common.h
index 7c35f3f..2e4ff05 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -46,6 +46,8 @@ void __record_preen(struct scrub_ctx *ctx, const char *descr, const char *file,
 #define str_error(ctx, str, ...)	__str_error(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
 #define str_warn(ctx, str, ...)		__str_warn(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
 #define str_info(ctx, str, ...)		__str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_repair(ctx, str, ...)	__record_repair(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define record_preen(ctx, str, ...)	__record_preen(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
 #define dbg_printf(fmt, ...)		{if (debug > 1) {printf(fmt, __VA_ARGS__);}}
 
 /* Is this debug tweak enabled? */
diff --git a/scrub/ioctl.c b/scrub/ioctl.c
index a3b7c04..3ed9758 100644
--- a/scrub/ioctl.c
+++ b/scrub/ioctl.c
@@ -457,7 +457,7 @@ xfs_can_iterate_fsmap(
 	return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
 }
 
-/* Online scrub. */
+/* Online scrub and repair. */
 
 /* Type info and names for the scrub types. */
 enum scrub_type {
@@ -915,6 +915,119 @@ xfs_scrub_parent(
 	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_PARENT);
 }
 
+/* Repair some metadata. */
+enum check_outcome
+xfs_repair_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct repair_item		*ri,
+	unsigned int			repair_flags)
+{
+	char				buf[DESCR_BUFSZ];
+	struct xfs_scrub_metadata	meta = { 0 };
+	struct xfs_scrub_metadata	oldm;
+	int				error;
+
+	assert(ri->type < XFS_SCRUB_TYPE_NR);
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	meta.sm_type = ri->type;
+	meta.sm_flags = ri->flags | XFS_SCRUB_IFLAG_REPAIR;
+	switch (scrubbers[ri->type].type) {
+	case ST_AGHEADER:
+	case ST_PERAG:
+		meta.sm_agno = ri->agno;
+		break;
+	case ST_INODE:
+		meta.sm_ino = ri->ino;
+		meta.sm_gen = ri->gen;
+		break;
+	default:
+		break;
+	}
+
+	/*
+	 * If this is a preen operation but we're only repairing
+	 * critical items, defer the preening until later.
+	 */
+	if (!needs_repair(&meta) && (repair_flags & XRM_REPAIR_ONLY))
+		return CHECK_RETRY;
+
+	memcpy(&oldm, &meta, sizeof(oldm));
+	format_scrub_descr(buf, DESCR_BUFSZ, &meta, &scrubbers[meta.sm_type]);
+
+	if (needs_repair(&meta))
+		str_info(ctx, buf, _("Attempting repair."));
+	else if (debug || verbose)
+		str_info(ctx, buf, _("Attempting optimization."));
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, &meta);
+	if (error) {
+		switch (errno) {
+		case EDEADLOCK:
+		case EBUSY:
+			/* Filesystem is busy, try again later. */
+			if (debug || verbose)
+				str_info(ctx, buf,
+_("Filesystem is busy, deferring repair."));
+			return CHECK_RETRY;
+		case ESHUTDOWN:
+			/* Filesystem is already shut down, abort. */
+			str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+			return CHECK_ABORT;
+		case ENOTTY:
+		case EOPNOTSUPP:
+			/*
+			 * If we forced repairs, don't complain if kernel
+			 * doesn't know how to fix.
+			 */
+			if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+				return CHECK_DONE;
+			/* fall through */
+		case EINVAL:
+			/* Kernel doesn't know how to repair this? */
+			if (repair_flags & XRM_NOFIX_COMPLAIN)
+				str_error(ctx, buf,
+_("Don't know how to fix; offline repair required."));
+			return CHECK_DONE;
+		case EROFS:
+			/* Read-only filesystem, can't fix. */
+			if (verbose || debug || needs_repair(&oldm))
+				str_info(ctx, buf,
+_("Read-only filesystem; cannot make changes."));
+			return CHECK_DONE;
+		case ENOENT:
+			/* Metadata not present, just skip it. */
+			return CHECK_DONE;
+		case ENOMEM:
+		case ENOSPC:
+			/* Don't care if preen fails due to low resources. */
+			if (is_unoptimized(&oldm) && !needs_repair(&oldm))
+				return CHECK_DONE;
+			/* fall through */
+		default:
+			/* Operational error. */
+			str_errno(ctx, buf);
+			return CHECK_DONE;
+		}
+	}
+	if (repair_flags & XRM_NOFIX_COMPLAIN)
+		xfs_scrub_warn_incomplete_scrub(ctx, buf, &meta);
+	if (needs_repair(&meta)) {
+		/* Still broken, try again or fix offline. */
+		if (repair_flags & XRM_NOFIX_COMPLAIN)
+			str_error(ctx, buf,
+_("Repair unsuccessful; offline repair required."));
+	} else {
+		/* Clean operation, no corruption detected. */
+		if (needs_repair(&oldm))
+			record_repair(ctx, buf, _("Repairs successful."));
+		else
+			record_preen(ctx, buf, _("Optimization successful."));
+	}
+	return CHECK_DONE;
+}
+
 /* Test the availability of a kernel scrub command. */
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR	30
 static bool
@@ -1019,3 +1132,10 @@ xfs_can_scrub_parent(
 {
 	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_PARENT, false);
 }
+
+bool
+xfs_can_repair(
+	struct scrub_ctx	*ctx)
+{
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_TEST, true);
+}
diff --git a/scrub/ioctl.h b/scrub/ioctl.h
index ee2ac26..4042f79 100644
--- a/scrub/ioctl.h
+++ b/scrub/ioctl.h
@@ -62,11 +62,30 @@ enum check_outcome {
 	CHECK_RETRY,	/* repair failed, try again later */
 };
 
+/* Repair parameters are the scrub inputs and retry count. */
+struct repair_item {
+	struct list_head	list;
+	__u64			ino;
+	__u32			type;
+	__u32			flags;
+	__u32			gen;
+	__u32			agno;
+};
+
 void xfs_scrub_report_preen_triggers(struct scrub_ctx *ctx);
 bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno);
 bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno);
 bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx);
 
+/* Only perform repairs; leave optimization-only actions for later. */
+#define XRM_REPAIR_ONLY		(1 << 0)
+
+/* Complain if still broken even after fix. */
+#define XRM_NOFIX_COMPLAIN	(1 << 1)
+
+enum check_outcome xfs_repair_metadata(struct scrub_ctx *ctx, int fd,
+		struct repair_item *ri, unsigned int repair_flags);
+
 bool xfs_can_scrub_fs_metadata(struct scrub_ctx *ctx);
 bool xfs_can_scrub_inode(struct scrub_ctx *ctx);
 bool xfs_can_scrub_bmap(struct scrub_ctx *ctx);
@@ -74,6 +93,7 @@ bool xfs_can_scrub_dir(struct scrub_ctx *ctx);
 bool xfs_can_scrub_attr(struct scrub_ctx *ctx);
 bool xfs_can_scrub_symlink(struct scrub_ctx *ctx);
 bool xfs_can_scrub_parent(struct scrub_ctx *ctx);
+bool xfs_can_repair(struct scrub_ctx *ctx);
 
 bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
 		int fd);
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 66f4aa3..6c0a31d 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -118,6 +118,21 @@ _("Does not appear to be an XFS filesystem!"));
 	    !xfs_can_scrub_parent(ctx))
 		goto err;
 
+	/* Do we have kernel-assisted metadata repair? */
+	if (ctx->mode != SCRUB_MODE_DRY_RUN && !xfs_can_repair(ctx)) {
+		if (ctx->mode == SCRUB_MODE_PREEN) {
+			/* w/o repair, demote preen to dry run. */
+			if (debug || verbose)
+				str_info(ctx, ctx->mntpoint,
+_("Metadata repairing not supported; demoting to scan mode.")
+						);
+			ctx->mode = SCRUB_MODE_DRY_RUN;
+		} else {
+			/* Repair mode w/o repair; abort. */
+			goto err;
+		}
+	}
+
 	/* Go find the XFS devices if we have a usable fsmap. */
 	fs_table_initialise(0, NULL, 0, NULL);
 	errno = 0;
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 0b82d9f..7e835ec 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -100,6 +100,8 @@ struct scrub_ctx {
 	unsigned long long	runtime_errors;
 	unsigned long long	errors_found;
 	unsigned long long	warnings_found;
+	unsigned long long	repairs;
+	unsigned long long	preens;
 	bool			preen_triggers[XFS_SCRUB_TYPE_NR];
 };
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 17/22] xfs_scrub: schedule and manage repairs to the filesystem
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 16/22] xfs_scrub: wire up repair ioctl Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 18/22] xfs_scrub: fstrim the free areas if there are no errors on " Darrick J. Wong
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Teach xfs_scrub to remember scrub requests that failed (or indicated
that optimization is a possibility) as repair requests that can be
deferred until later.  Add a new repair phase that deals with the
repair requests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile |    4 +
 scrub/ioctl.c  |  101 +++++++++++++++------
 scrub/ioctl.h  |   27 +++--
 scrub/phase1.c |    8 ++
 scrub/phase2.c |   53 ++++++++++-
 scrub/phase3.c |   44 +++++++--
 scrub/phase4.c |   91 +++++++++++++++++++
 scrub/repair.c |  275 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/repair.h |   52 +++++++++++
 scrub/scrub.c  |   47 +++++++++-
 scrub/scrub.h  |    1 
 scrub/xfs.c    |   30 ++++++
 scrub/xfs.h    |    5 +
 13 files changed, 687 insertions(+), 51 deletions(-)
 create mode 100644 scrub/phase4.c
 create mode 100644 scrub/repair.c
 create mode 100644 scrub/repair.h


diff --git a/scrub/Makefile b/scrub/Makefile
index 461df83..a17dfa3 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -24,11 +24,13 @@ counter.h \
 disk.h \
 ioctl.h \
 read_verify.h \
+repair.h \
 scrub.h \
 vfs.h \
 xfs.h
 
 CFILES = \
+../libxfs/list_sort.c \
 ../repair/avl64.c \
 ../repair/threads.c \
 bitmap.c \
@@ -39,10 +41,12 @@ ioctl.c \
 phase1.c \
 phase2.c \
 phase3.c \
+phase4.c \
 phase5.c \
 phase6.c \
 phase7.c \
 read_verify.c \
+repair.c \
 scrub.c \
 vfs.c \
 xfs.c
diff --git a/scrub/ioctl.c b/scrub/ioctl.c
index 3ed9758..31d2fd8 100644
--- a/scrub/ioctl.c
+++ b/scrub/ioctl.c
@@ -28,6 +28,7 @@
 #include "scrub.h"
 #include "common.h"
 #include "ioctl.h"
+#include "repair.h"
 
 #define FSMAP_NR		65536
 #define BMAP_NR			2048
@@ -741,12 +742,47 @@ _("Optimizations of %s are possible."), scrubbers[i].name);
 	}
 }
 
+/* Save a scrub context for later repairs. */
+bool
+xfs_scrub_save_repair(
+	struct scrub_ctx		*ctx,
+	struct xfs_repair_list		*rl,
+	struct xfs_scrub_metadata	*meta)
+{
+	struct repair_item		*ri;
+
+	/* Schedule this item for later repairs. */
+	ri = malloc(sizeof(struct repair_item));
+	if (!ri) {
+		str_errno(ctx, _("repair list"));
+		return false;
+	}
+	ri->type = meta->sm_type;
+	ri->flags = meta->sm_flags;
+	switch (scrubbers[meta->sm_type].type) {
+	case ST_AGHEADER:
+	case ST_PERAG:
+		ri->agno = meta->sm_agno;
+		break;
+	case ST_INODE:
+		ri->ino = meta->sm_ino;
+		ri->gen = meta->sm_gen;
+		break;
+	default:
+		break;
+	}
+
+	xfs_repair_list_add(rl, ri);
+	return true;
+}
+
 /* Scrub metadata, saving corruption reports for later. */
 static bool
 xfs_scrub_metadata(
 	struct scrub_ctx		*ctx,
 	enum scrub_type			scrub_type,
-	xfs_agnumber_t			agno)
+	xfs_agnumber_t			agno,
+	struct xfs_repair_list		*rl)
 {
 	struct xfs_scrub_metadata	meta = {0};
 	const struct scrub_descr	*sc;
@@ -769,6 +805,9 @@ xfs_scrub_metadata(
 		case CHECK_ABORT:
 			return false;
 		case CHECK_REPAIR:
+			if (!xfs_scrub_save_repair(ctx, rl, &meta))
+				return false;
+			/* fall through */
 		case CHECK_DONE:
 			continue;
 		case CHECK_RETRY:
@@ -784,26 +823,29 @@ xfs_scrub_metadata(
 bool
 xfs_scrub_ag_headers(
 	struct scrub_ctx		*ctx,
-	xfs_agnumber_t			agno)
+	xfs_agnumber_t			agno,
+	struct xfs_repair_list		*rl)
 {
-	return xfs_scrub_metadata(ctx, ST_AGHEADER, agno);
+	return xfs_scrub_metadata(ctx, ST_AGHEADER, agno, rl);
 }
 
 /* Scrub each AG's metadata btrees. */
 bool
 xfs_scrub_ag_metadata(
 	struct scrub_ctx		*ctx,
-	xfs_agnumber_t			agno)
+	xfs_agnumber_t			agno,
+	struct xfs_repair_list		*rl)
 {
-	return xfs_scrub_metadata(ctx, ST_PERAG, agno);
+	return xfs_scrub_metadata(ctx, ST_PERAG, agno, rl);
 }
 
 /* Scrub whole-FS metadata btrees. */
 bool
 xfs_scrub_fs_metadata(
-	struct scrub_ctx		*ctx)
+	struct scrub_ctx		*ctx,
+	struct xfs_repair_list		*rl)
 {
-	return xfs_scrub_metadata(ctx, ST_FS, 0);
+	return xfs_scrub_metadata(ctx, ST_FS, 0, rl);
 }
 
 /* Scrub inode metadata. */
@@ -813,7 +855,8 @@ __xfs_scrub_file(
 	uint64_t			ino,
 	uint32_t			gen,
 	int				fd,
-	unsigned int			type)
+	unsigned int			type,
+	struct xfs_repair_list		*rl)
 {
 	struct xfs_scrub_metadata	meta = {0};
 	enum check_outcome		fix;
@@ -832,7 +875,7 @@ __xfs_scrub_file(
 	if (fix == CHECK_DONE)
 		return true;
 
-	return true;
+	return xfs_scrub_save_repair(ctx, rl, &meta);
 }
 
 bool
@@ -840,9 +883,10 @@ xfs_scrub_inode_fields(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_INODE);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_INODE, rl);
 }
 
 bool
@@ -850,9 +894,10 @@ xfs_scrub_data_fork(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTD);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTD, rl);
 }
 
 bool
@@ -860,9 +905,10 @@ xfs_scrub_attr_fork(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTA);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTA, rl);
 }
 
 bool
@@ -870,9 +916,10 @@ xfs_scrub_cow_fork(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTC);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_BMBTC, rl);
 }
 
 bool
@@ -880,9 +927,10 @@ xfs_scrub_dir(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_DIR);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_DIR, rl);
 }
 
 bool
@@ -890,9 +938,10 @@ xfs_scrub_attr(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_XATTR);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_XATTR, rl);
 }
 
 bool
@@ -900,9 +949,10 @@ xfs_scrub_symlink(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_SYMLINK);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_SYMLINK, rl);
 }
 
 bool
@@ -910,9 +960,10 @@ xfs_scrub_parent(
 	struct scrub_ctx	*ctx,
 	uint64_t		ino,
 	uint32_t		gen,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
-	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_PARENT);
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_PARENT, rl);
 }
 
 /* Repair some metadata. */
diff --git a/scrub/ioctl.h b/scrub/ioctl.h
index 4042f79..ab4044a 100644
--- a/scrub/ioctl.h
+++ b/scrub/ioctl.h
@@ -73,9 +73,14 @@ struct repair_item {
 };
 
 void xfs_scrub_report_preen_triggers(struct scrub_ctx *ctx);
-bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno);
-bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno);
-bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx);
+bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct xfs_repair_list *repair_list);
+bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct xfs_repair_list *repair_list);
+bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx,
+		struct xfs_repair_list *repair_list);
+enum check_outcome xfs_repair_metadata(struct scrub_ctx *ctx, int fd,
+		struct repair_item *ri, unsigned int flags);
 
 /* Only perform repairs; leave optimization-only actions for later. */
 #define XRM_REPAIR_ONLY		(1 << 0)
@@ -96,20 +101,20 @@ bool xfs_can_scrub_parent(struct scrub_ctx *ctx);
 bool xfs_can_repair(struct scrub_ctx *ctx);
 
 bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_data_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_attr_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_cow_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_dir(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_attr(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 bool xfs_scrub_parent(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
-		int fd);
+		int fd, struct xfs_repair_list *repair_list);
 
 #endif /* XFS_SCRUB_IOCTL_H_ */
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 6c0a31d..d5b985d 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -30,6 +30,7 @@
 #include "common.h"
 #include "ioctl.h"
 #include "xfs_fs.h"
+#include "repair.h"
 
 /* Phase 1: Find filesystem geometry (and clean up after) */
 
@@ -38,6 +39,7 @@ bool
 xfs_cleanup(
 	struct scrub_ctx	*ctx)
 {
+	xfs_repair_lists_free(&ctx->repair_lists);
 	if (ctx->fshandle)
 		free_handle(ctx->fshandle, ctx->fshandle_len);
 	disk_close(&ctx->rtdev);
@@ -81,6 +83,11 @@ _("Does not appear to be an XFS filesystem!"));
 		goto err;
 	}
 
+	if (!xfs_repair_lists_alloc(ctx->geo.agcount, &ctx->repair_lists)) {
+		str_error(ctx, ctx->mntpoint, _("Not enough memory."));
+		goto err;
+	}
+
 	ctx->agblklog = libxfs_log2_roundup(ctx->geo.agblocks);
 	ctx->blocklog = libxfs_highbit32(ctx->geo.blocksize);
 	ctx->inodelog = libxfs_highbit32(ctx->geo.inodesize);
@@ -181,5 +188,6 @@ _("Unable to find realtime device path."));
 
 	return true;
 err:
+	/* Everything will get cleaned up by xfs_cleanup */
 	return false;
 }
diff --git a/scrub/phase2.c b/scrub/phase2.c
index 88136a3..8bf3ad6 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -30,6 +30,8 @@
 #include "common.h"
 #include "ioctl.h"
 #include "xfs_fs.h"
+#include "xfs.h"
+#include "repair.h"
 
 /* Phase 2: Check internal metadata. */
 
@@ -42,24 +44,65 @@ xfs_scan_ag_metadata(
 {
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
 	bool				*pmoveon = arg;
+	struct xfs_repair_list		repairs;
+	struct xfs_repair_list		repair_now;
+	unsigned long long		broken_primaries;
+	unsigned long long		broken_secondaries;
 	bool				moveon;
 	char				descr[DESCR_BUFSZ];
 
+	xfs_repair_list_init(&repairs);
+	xfs_repair_list_init(&repair_now);
 	snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
 
 	/*
 	 * First we scrub and fix the AG headers, because we need
 	 * them to work well enough to check the AG btrees.
 	 */
-	moveon = xfs_scrub_ag_headers(ctx, agno);
+	moveon = xfs_scrub_ag_headers(ctx, agno, &repairs);
+	if (!moveon)
+		goto err;
+
+	/* Repair header damage. */
+	moveon = xfs_quick_repair(ctx, agno, &repairs);
 	if (!moveon)
 		goto err;
 
 	/* Now scrub the AG btrees. */
-	moveon = xfs_scrub_ag_metadata(ctx, agno);
+	moveon = xfs_scrub_ag_metadata(ctx, agno, &repairs);
 	if (!moveon)
 		goto err;
 
+	/*
+	 * Figure out if we need to perform early fixing.  The only
+	 * reason we need to do this is if the inobt is broken, which
+	 * prevents phase 3 (inode scan) from running.  We can rebuild
+	 * the inobt from rmapbt data, but if the rmapbt is broken even
+	 * at this early phase then we are sunk.
+	 */
+	broken_secondaries = 0;
+	broken_primaries = 0;
+	xfs_repair_find_mustfix(&repairs, &repair_now,
+			&broken_primaries, &broken_secondaries);
+	if (broken_secondaries && !debug_tweak_on("XFS_SCRUB_FORCE_REPAIR")) {
+		if (broken_primaries)
+			str_warn(ctx, descr,
+_("Corrupt primary and secondary block mapping metadata."));
+		else
+			str_warn(ctx, descr,
+_("Corrupt secondary block mapping metadata."));
+		str_warn(ctx, descr,
+_("Filesystem might not be repairable."));
+	}
+
+	/* Repair (inode) btree damage. */
+	moveon = xfs_quick_repair(ctx, agno, &repair_now);
+	if (!moveon)
+		goto err;
+
+	/* Everything else gets fixed during phase 4. */
+	xfs_defer_repairs(ctx, agno, &repairs);
+
 	return;
 err:
 	*pmoveon = false;
@@ -74,11 +117,15 @@ xfs_scan_fs_metadata(
 {
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
 	bool				*pmoveon = arg;
+	struct xfs_repair_list		repairs;
 	bool				moveon;
 
-	moveon = xfs_scrub_fs_metadata(ctx);
+	xfs_repair_list_init(&repairs);
+	moveon = xfs_scrub_fs_metadata(ctx, &repairs);
 	if (!moveon)
 		*pmoveon = false;
+
+	xfs_defer_repairs(ctx, agno, &repairs);
 }
 
 /* Scan all filesystem metadata. */
diff --git a/scrub/phase3.c b/scrub/phase3.c
index b920995..3cfbea4 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -31,6 +31,7 @@
 #include "ioctl.h"
 #include "xfs_fs.h"
 #include "xfs.h"
+#include "repair.h"
 
 /* Phase 3: Scan all inodes. */
 
@@ -43,13 +44,14 @@ static bool
 xfs_scrub_fd(
 	struct scrub_ctx	*ctx,
 	bool			(*fn)(struct scrub_ctx *, uint64_t,
-				      uint32_t, int),
+				      uint32_t, int, struct xfs_repair_list *),
 	struct xfs_bstat	*bs,
-	int			fd)
+	int			fd,
+	struct xfs_repair_list	*rl)
 {
 	if (fd < 0)
 		fd = ctx->mnt_fd;
-	return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd);
+	return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd, rl);
 }
 
 /* Verify the contents, xattrs, and extent maps of an inode. */
@@ -60,6 +62,7 @@ xfs_scrub_inode(
 	struct xfs_bstat	*bstat,
 	void			*arg)
 {
+	struct xfs_repair_list	repairs;
 	char			descr[DESCR_BUFSZ];
 	bool			moveon = true;
 	xfs_agnumber_t		agno;
@@ -67,6 +70,7 @@ xfs_scrub_inode(
 	int			fd = -1;
 	int			error = 0;
 
+	xfs_repair_list_init(&repairs);
 	agno = bstat->bs_ino / (1ULL << (ctx->inopblog + ctx->agblklog));
 	agino = bstat->bs_ino % (1ULL << (ctx->inopblog + ctx->agblklog));
 	snprintf(descr, DESCR_BUFSZ, _("inode %llu (%u/%u)"), bstat->bs_ino,
@@ -85,43 +89,61 @@ xfs_scrub_inode(
 	}
 
 	/* Scrub the inode. */
-	moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd);
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd,
+			&repairs);
+	if (!moveon)
+		goto out;
+
+	moveon = xfs_quick_repair(ctx, agno, &repairs);
 	if (!moveon)
 		goto out;
 
 	/* Scrub all block mappings. */
-	moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd);
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd,
+			&repairs);
 	if (!moveon)
 		goto out;
-	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd);
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd,
+			&repairs);
 	if (!moveon)
 		goto out;
-	moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd);
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd,
+			&repairs);
+	if (!moveon)
+		goto out;
+
+	moveon = xfs_quick_repair(ctx, agno, &repairs);
 	if (!moveon)
 		goto out;
 
 	if (S_ISLNK(bstat->bs_mode)) {
 		/* Check symlink contents. */
 		moveon = xfs_scrub_symlink(ctx, bstat->bs_ino,
-				bstat->bs_gen, ctx->mnt_fd);
+				bstat->bs_gen, ctx->mnt_fd, &repairs);
 	} else if (S_ISDIR(bstat->bs_mode)) {
 		/* Check the directory entries. */
-		moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd);
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd, &repairs);
 	}
 	if (!moveon)
 		goto out;
 
 	/* Check all the extended attributes. */
-	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd);
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd, &repairs);
 	if (!moveon)
 		goto out;
 
 	/* Check parent pointers. */
-	moveon = xfs_scrub_fd(ctx, xfs_scrub_parent, bstat, fd);
+	moveon = xfs_scrub_fd(ctx, xfs_scrub_parent, bstat, fd, &repairs);
+	if (!moveon)
+		goto out;
+
+	/* Try to repair the file while it's open. */
+	moveon = xfs_quick_repair(ctx, agno, &repairs);
 	if (!moveon)
 		goto out;
 
 out:
+	xfs_defer_repairs(ctx, agno, &repairs);
 	if (fd >= 0)
 		close(fd);
 	if (error)
diff --git a/scrub/phase4.c b/scrub/phase4.c
new file mode 100644
index 0000000..7c2e0fb
--- /dev/null
+++ b/scrub/phase4.c
@@ -0,0 +1,91 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "vfs.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "xfs_fs.h"
+#include "repair.h"
+
+/* Phase 4: Repair filesystem. */
+
+/* Fix all the problems in our per-AG list. */
+static void
+xfs_repair_ag(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*priv)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	bool				*moveon = priv;
+	struct xfs_repair_list		*repairs;
+	size_t				unfixed;
+	size_t				new_unfixed;
+	unsigned int			flags = 0;
+
+	repairs = &ctx->repair_lists[agno];
+	unfixed = xfs_repair_list_length(repairs);
+
+	/* Repair anything broken until we fail to make progress. */
+	do {
+		*moveon = xfs_repair_list_now(ctx, ctx->mnt_fd, repairs, flags);
+		if (!*moveon)
+			return;
+		new_unfixed = xfs_repair_list_length(repairs);
+		if (new_unfixed == unfixed)
+			break;
+		unfixed = new_unfixed;
+	} while (unfixed > 0);
+
+	/* Try once more, but this time complain if we can't fix things. */
+	flags |= XRML_NOFIX_COMPLAIN;
+	*moveon = xfs_repair_list_now(ctx, ctx->mnt_fd, repairs, flags);
+}
+
+/* Fix everything that needs fixing. */
+bool
+xfs_repair_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct work_queue		wq;
+	xfs_agnumber_t			agno;
+	bool				moveon = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	for (agno = 0; agno < ctx->geo.agcount; agno++) {
+		if (xfs_repair_list_length(&ctx->repair_lists[agno]) > 0)
+			queue_work(&wq, xfs_repair_ag, agno, &moveon);
+		if (!moveon)
+			break;
+	}
+	destroy_work_queue(&wq);
+
+	return moveon;
+}
diff --git a/scrub/repair.c b/scrub/repair.c
new file mode 100644
index 0000000..53706c3
--- /dev/null
+++ b/scrub/repair.c
@@ -0,0 +1,275 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <mntent.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/statvfs.h>
+#include <sys/vfs.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include "../repair/threads.h"
+#include "disk.h"
+#include "path.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "common.h"
+#include "ioctl.h"
+#include "repair.h"
+
+/*
+ * Prioritize repair items in order of how long we can wait.
+ * 0 = do it now, 10000 = do it later.
+ *
+ * To minimize the amount of repair work, we want to prioritize metadata
+ * objects by perceived corruptness.  If CORRUPT is set, the fields are
+ * just plain bad; try fixing that first.  Otherwise if XCORRUPT is set,
+ * the fields could be bad, but the xref data could also be bad; we'll
+ * try fixing that next.  Finally, if XFAIL is set, some other metadata
+ * structure failed validation during xref, so we'll recheck this
+ * metadata last since it was probably fine.
+ *
+ * For metadata that lie in the critical path of checking other metadata
+ * (superblock, AG{F,I,FL}, inobt) we scrub and fix those things before
+ * we even get to handling their dependencies, so things should progress
+ * in order.
+ */
+
+/* Sort repair items in severity order. */
+static int
+PRIO(
+	struct repair_item	*ri,
+	int			order)
+{
+	if (ri->flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return order;
+	else if (ri->flags & XFS_SCRUB_OFLAG_XCORRUPT)
+		return 100 + order;
+	else if (ri->flags & XFS_SCRUB_OFLAG_XFAIL)
+		return 200 + order;
+	else if (ri->flags & XFS_SCRUB_OFLAG_PREEN)
+		return 300 + order;
+	abort();
+}
+
+/* Sort the repair items in dependency order. */
+static int
+xfs_repair_item_priority(
+	struct repair_item	*ri)
+{
+	switch (ri->type) {
+	case XFS_SCRUB_TYPE_SB:
+	case XFS_SCRUB_TYPE_AGF:
+	case XFS_SCRUB_TYPE_AGFL:
+	case XFS_SCRUB_TYPE_AGI:
+	case XFS_SCRUB_TYPE_BNOBT:
+	case XFS_SCRUB_TYPE_CNTBT:
+	case XFS_SCRUB_TYPE_INOBT:
+	case XFS_SCRUB_TYPE_FINOBT:
+	case XFS_SCRUB_TYPE_REFCNTBT:
+	case XFS_SCRUB_TYPE_RMAPBT:
+	case XFS_SCRUB_TYPE_INODE:
+	case XFS_SCRUB_TYPE_BMBTD:
+	case XFS_SCRUB_TYPE_BMBTA:
+	case XFS_SCRUB_TYPE_BMBTC:
+		return PRIO(ri, ri->type - 1);
+	case XFS_SCRUB_TYPE_DIR:
+	case XFS_SCRUB_TYPE_XATTR:
+	case XFS_SCRUB_TYPE_SYMLINK:
+	case XFS_SCRUB_TYPE_PARENT:
+		return PRIO(ri, XFS_SCRUB_TYPE_DIR);
+	case XFS_SCRUB_TYPE_RTBITMAP:
+	case XFS_SCRUB_TYPE_RTSUM:
+		return PRIO(ri, XFS_SCRUB_TYPE_RTBITMAP);
+	case XFS_SCRUB_TYPE_UQUOTA:
+	case XFS_SCRUB_TYPE_GQUOTA:
+	case XFS_SCRUB_TYPE_PQUOTA:
+		return PRIO(ri, XFS_SCRUB_TYPE_UQUOTA);
+	}
+	abort();
+}
+
+/* Make sure that btrees get repaired before headers. */
+static int
+xfs_repair_item_compare(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct repair_item		*ra;
+	struct repair_item		*rb;
+
+	ra = container_of(a, struct repair_item, list);
+	rb = container_of(b, struct repair_item, list);
+
+	return xfs_repair_item_priority(ra) - xfs_repair_item_priority(rb);
+}
+
+/*
+ * Figure out which AG metadata must be fixed before we can move on
+ * to the inode scan.
+ */
+void
+xfs_repair_find_mustfix(
+	struct xfs_repair_list		*repairs,
+	struct xfs_repair_list		*repair_now,
+	unsigned long long		*broken_primaries,
+	unsigned long long		*broken_secondaries)
+{
+	struct repair_item		*n;
+	struct repair_item		*ri;
+
+	list_for_each_entry_safe(ri, n, &repairs->list, list) {
+		switch (ri->type) {
+		case XFS_SCRUB_TYPE_RMAPBT:
+			(*broken_secondaries)++;
+			break;
+		case XFS_SCRUB_TYPE_FINOBT:
+		case XFS_SCRUB_TYPE_INOBT:
+			repairs->nr--;
+			list_del(&ri->list);
+			list_add_tail(&ri->list, &repair_now->list);
+			repair_now->nr++;
+			/* fall through */
+		case XFS_SCRUB_TYPE_BNOBT:
+		case XFS_SCRUB_TYPE_CNTBT:
+		case XFS_SCRUB_TYPE_REFCNTBT:
+			(*broken_primaries)++;
+			break;
+		default:
+			abort();
+			break;
+		}
+	}
+}
+
+/* Allocate a certain number of repair lists for the scrub context. */
+bool
+xfs_repair_lists_alloc(
+	size_t				nr,
+	struct xfs_repair_list		**listsp)
+{
+	struct xfs_repair_list		*lists;
+	xfs_agnumber_t			agno;
+
+	lists = calloc(nr, sizeof(struct xfs_repair_list));
+	if (!lists)
+		return false;
+
+	for (agno = 0; agno < nr; agno++)
+		xfs_repair_list_init(&lists[agno]);
+	*listsp = lists;
+
+	return true;
+}
+
+/* Free the repair lists. */
+void
+xfs_repair_lists_free(
+	struct xfs_repair_list		**listsp)
+{
+	free(*listsp);
+	*listsp = NULL;
+}
+
+/* Initialize repair list */
+void
+xfs_repair_list_init(
+	struct xfs_repair_list		*rl)
+{
+	INIT_LIST_HEAD(&rl->list);
+	rl->nr = 0;
+	rl->sorted = false;
+}
+
+/* Number of repairs in this list. */
+size_t
+xfs_repair_list_length(
+	struct xfs_repair_list		*rl)
+{
+	return rl->nr;
+};
+
+/* Add to the list of repairs. */
+void
+xfs_repair_list_add(
+	struct xfs_repair_list		*rl,
+	struct repair_item		*ri)
+{
+	list_add_tail(&ri->list, &rl->list);
+	rl->nr++;
+	rl->sorted = false;
+}
+
+/* Splice two repair lists. */
+void
+xfs_repair_list_splice(
+	struct xfs_repair_list		*dest,
+	struct xfs_repair_list		*src)
+{
+	if (src->nr == 0)
+		return;
+
+	list_splice_tail_init(&src->list, &dest->list);
+	dest->nr += src->nr;
+	src->nr = 0;
+	dest->sorted = false;
+}
+
+/* Repair everything on this list. */
+bool
+xfs_repair_list_now(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct xfs_repair_list		*rl,
+	unsigned int			repair_flags)
+{
+	struct repair_item		*ri;
+	struct repair_item		*n;
+	enum check_outcome		fix;
+
+	if (!rl->sorted) {
+		list_sort(NULL, &rl->list, xfs_repair_item_compare);
+		rl->sorted = true;
+	}
+
+	list_for_each_entry_safe(ri, n, &rl->list, list) {
+		fix = xfs_repair_metadata(ctx, fd, ri, repair_flags);
+		switch (fix) {
+		case CHECK_DONE:
+			rl->nr--;
+			list_del(&ri->list);
+			free(ri);
+			continue;
+		case CHECK_ABORT:
+			return false;
+		case CHECK_RETRY:
+			continue;
+		case CHECK_REPAIR:
+			abort();
+		}
+	}
+
+	return !xfs_scrub_excessive_errors(ctx);
+}
diff --git a/scrub/repair.h b/scrub/repair.h
new file mode 100644
index 0000000..3142838
--- /dev/null
+++ b/scrub/repair.h
@@ -0,0 +1,52 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_REPAIR_H_
+#define XFS_SCRUB_REPAIR_H_
+
+struct xfs_repair_list {
+	struct list_head	list;
+	size_t			nr;
+	bool			sorted;
+};
+
+bool xfs_repair_lists_alloc(size_t nr, struct xfs_repair_list **listsp);
+void xfs_repair_lists_free(struct xfs_repair_list **listsp);
+
+void xfs_repair_list_init(struct xfs_repair_list *rl);
+size_t xfs_repair_list_length(struct xfs_repair_list *rl);
+void xfs_repair_list_add(struct xfs_repair_list *dest,
+			 struct repair_item *item);
+void xfs_repair_list_splice(struct xfs_repair_list *dest,
+			    struct xfs_repair_list *src);
+
+void xfs_repair_find_mustfix(struct xfs_repair_list *repairs,
+			     struct xfs_repair_list *repair_now,
+			     unsigned long long *broken_primaries,
+			     unsigned long long *broken_secondaries);
+
+/* Passed through to xfs_repair_metadata() */
+#define XRML_REPAIR_ONLY	(XRM_REPAIR_ONLY)
+#define XRML_NOFIX_COMPLAIN	(XRM_NOFIX_COMPLAIN)
+
+bool xfs_repair_list_now(struct scrub_ctx *ctx, int fd,
+			 struct xfs_repair_list *repair_list,
+			 unsigned int repair_flags);
+
+#endif /* XFS_SCRUB_REPAIR_H_ */
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 647e050..b654be5 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -95,6 +95,15 @@
  * the previous two phases are retried here; if there are uncorrectable
  * errors, xfs_scrub stops here.
  *
+ * To perform the actual repairs, we iterate all the items on the per-AG
+ * repair list and ask the kernel to repair them.  Items which are
+ * successfully repaired are removed from the list.  If an item is not
+ * repaired successfully (or the kernel asks us to try again), we retry
+ * the repairs until there is nothing left to fix or we fail to make
+ * forward progress.  In that event, the unrepaired items are recorded
+ * as errors.  If there are no errors at this point, we call FSTRIM on
+ * the filesystem.
+ *
  * The next phase is the "check directory tree" phase.  In this phase,
  * every directory is opened (via file handle) to confirm that each
  * directory is connected to the root.  Directory entries are checked
@@ -414,6 +423,20 @@ _("Must be root to run scrub."));
 	return moveon;
 }
 
+/* Run the preening phase if there are no errors. */
+static bool
+preen(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->errors_found) {
+		str_info(ctx, ctx->mntpoint,
+_("Errors found, please re-run with -y."));
+		return true;
+	}
+
+	return xfs_repair_fs(ctx);
+}
+
 /* Run all the phases of the scrubber. */
 static bool
 run_scrub_phases(
@@ -466,8 +489,17 @@ run_scrub_phases(
 	/* Run all phases of the scrub tool. */
 	for (phase = 1, sp = phases; sp->fn; sp++, phase++) {
 		/* Turn on certain phases if user said to. */
-		if (sp->fn == DATASCAN_DUMMY_FN && scrub_data)
+		if (sp->fn == DATASCAN_DUMMY_FN && scrub_data) {
 			sp->fn = xfs_scan_blocks;
+		} else if (sp->fn == REPAIR_DUMMY_FN) {
+			if (ctx->mode == SCRUB_MODE_PREEN) {
+				sp->descr = _("Preen filesystem.");
+				sp->fn = preen;
+			} else if (ctx->mode == SCRUB_MODE_REPAIR) {
+				sp->descr = _("Repair filesystem.");
+				sp->fn = xfs_repair_fs;
+			}
+		}
 
 		/* Skip certain phases unless they're turned on. */
 		if (sp->fn == REPAIR_DUMMY_FN ||
@@ -691,6 +723,19 @@ _("Only one of the options -n or -y may be specified.\n"));
 	if (!moveon)
 		ret |= 8;
 
+	if (ctx.repairs && ctx.preens)
+		fprintf(stdout,
+_("%s: %llu repairs and %llu optimizations made.\n"),
+			ctx.mntpoint, ctx.repairs, ctx.preens);
+	else if (ctx.repairs && ctx.preens == 0)
+		fprintf(stdout,
+_("%s: %llu repairs made.\n"),
+			ctx.mntpoint, ctx.repairs);
+	else if (ctx.repairs == 0 && ctx.preens)
+		fprintf(stdout,
+_("%s: %llu optimizations made.\n"),
+			ctx.mntpoint, ctx.preens);
+
 	total_errors = ctx.errors_found + ctx.runtime_errors;
 	if (total_errors && ctx.warnings_found)
 		fprintf(stderr,
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 7e835ec..0561044 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -96,6 +96,7 @@ struct scrub_ctx {
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
 	struct read_verify_pool	*rvp;
+	struct xfs_repair_list	*repair_lists;
 	unsigned long long	max_errors;
 	unsigned long long	runtime_errors;
 	unsigned long long	errors_found;
diff --git a/scrub/xfs.c b/scrub/xfs.c
index 4db0267..10f7e54 100644
--- a/scrub/xfs.c
+++ b/scrub/xfs.c
@@ -30,6 +30,7 @@
 #include "common.h"
 #include "ioctl.h"
 #include "xfs_fs.h"
+#include "repair.h"
 
 /* Shut down the filesystem. */
 void
@@ -331,3 +332,32 @@ xfs_scan_estimate_blocks(
 
 	return true;
 }
+
+/* Defer all the repairs until phase 4. */
+void
+xfs_defer_repairs(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct xfs_repair_list		*rl)
+{
+	ASSERT(agno < ctx->geo.agcount);
+
+	xfs_repair_list_splice(&ctx->repair_lists[agno], rl);
+}
+
+/* Quickly try to repair AG metadata; broken things are remembered for later. */
+bool
+xfs_quick_repair(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct xfs_repair_list		*rl)
+{
+	bool				moveon;
+
+	moveon = xfs_repair_list_now(ctx, ctx->mnt_fd, rl, XRML_REPAIR_ONLY);
+	if (!moveon)
+		return moveon;
+
+	xfs_defer_repairs(ctx, agno, rl);
+	return true;
+}
diff --git a/scrub/xfs.h b/scrub/xfs.h
index 996f791..8a7d32e 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -33,6 +33,10 @@ bool xfs_scan_estimate_blocks(struct scrub_ctx *ctx,
 		unsigned long long *d_blocks, unsigned long long *d_bfree,
 		unsigned long long *r_blocks, unsigned long long *r_bfree,
 		unsigned long long *f_files, unsigned long long *f_free);
+void xfs_defer_repairs(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct xfs_repair_list *rl);
+bool xfs_quick_repair(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct xfs_repair_list *rl);
 
 /* Phase-specific functions. */
 bool xfs_cleanup(struct scrub_ctx *ctx);
@@ -42,5 +46,6 @@ bool xfs_scan_inodes(struct scrub_ctx *ctx);
 bool xfs_scan_connections(struct scrub_ctx *ctx);
 bool xfs_scan_blocks(struct scrub_ctx *ctx);
 bool xfs_scan_summary(struct scrub_ctx *ctx);
+bool xfs_repair_fs(struct scrub_ctx *ctx);
 
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 18/22] xfs_scrub: fstrim the free areas if there are no errors on the filesystem
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (16 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 17/22] xfs_scrub: schedule and manage repairs to the filesystem Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 19/22] xfs_scrub: warn about normalized Unicode name collisions Darrick J. Wong
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

If the filesystem scan comes out clean or fixes all the problems, call
fstrim to clean out the free areas (if it's an ssd/thinp/whatever).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/phase4.c |    5 +++++
 scrub/vfs.c    |   23 +++++++++++++++++++++++
 scrub/vfs.h    |    2 ++
 3 files changed, 30 insertions(+)


diff --git a/scrub/phase4.c b/scrub/phase4.c
index 7c2e0fb..606f3fd 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -87,5 +87,10 @@ xfs_repair_fs(
 	}
 	destroy_work_queue(&wq);
 
+	pthread_mutex_lock(&ctx->lock);
+	if (moveon && ctx->errors_found == 0)
+		fstrim(ctx);
+	pthread_mutex_unlock(&ctx->lock);
+
 	return moveon;
 }
diff --git a/scrub/vfs.c b/scrub/vfs.c
index 1cff2ab..c6cbf5d 100644
--- a/scrub/vfs.c
+++ b/scrub/vfs.c
@@ -197,3 +197,26 @@ scan_fs_tree(
 
 	return sft.moveon;
 }
+
+#ifndef FITRIM
+struct fstrim_range {
+	__u64 start;
+	__u64 len;
+	__u64 minlen;
+};
+#define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#endif
+
+/* Call FITRIM to trim all the unused space in a filesystem. */
+void
+fstrim(
+	struct scrub_ctx	*ctx)
+{
+	struct fstrim_range	range = {0};
+	int			error;
+
+	range.len = ULLONG_MAX;
+	error = ioctl(ctx->mnt_fd, FITRIM, &range);
+	if (error && errno != EOPNOTSUPP && errno != ENOTTY)
+		perror(_("fstrim"));
+}
diff --git a/scrub/vfs.h b/scrub/vfs.h
index 3a3b2dc..fa6d9a3 100644
--- a/scrub/vfs.h
+++ b/scrub/vfs.h
@@ -30,4 +30,6 @@ typedef bool (*scan_fs_tree_dirent_fn)(struct scrub_ctx *, const char *,
 bool scan_fs_tree(struct scrub_ctx *ctx, scan_fs_tree_dir_fn dir_fn,
 		scan_fs_tree_dirent_fn dirent_fn, void *arg);
 
+void fstrim(struct scrub_ctx *ctx);
+
 #endif /* XFS_SCRUB_VFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 19/22] xfs_scrub: warn about normalized Unicode name collisions
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 18/22] xfs_scrub: fstrim the free areas if there are no errors on " Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 20/22] xfs_scrub: progress indicator Darrick J. Wong
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Iterate all directory and xattr names to look for name collisions
amongst Unicode normalized names.  This is generally a sign of buggy
programs or malicious duplicate files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure.ac            |    4 
 debian/control          |    2 
 include/builddefs.in    |    3 
 m4/Makefile             |    1 
 m4/package_libcdev.m4   |   13 +
 m4/package_unistring.m4 |   19 ++
 scrub/Makefile          |   18 +-
 scrub/common.c          |   20 ++
 scrub/common.h          |    3 
 scrub/phase5.c          |   15 +
 scrub/unicrash.c        |  484 +++++++++++++++++++++++++++++++++++++++++++++++
 scrub/unicrash.h        |   38 ++++
 12 files changed, 616 insertions(+), 4 deletions(-)
 create mode 100644 m4/package_unistring.m4
 create mode 100644 scrub/unicrash.c
 create mode 100644 scrub/unicrash.h


diff --git a/configure.ac b/configure.ac
index 8b65cf5..91fba71 100644
--- a/configure.ac
+++ b/configure.ac
@@ -124,6 +124,9 @@ AC_PACKAGE_NEED_UUIDCOMPARE
 AC_PACKAGE_NEED_PTHREAD_H
 AC_PACKAGE_NEED_PTHREADMUTEXINIT
 
+AC_PACKAGE_WANT_UNINORM_H
+AC_PACKAGE_WANT_U8_NORMALIZE
+
 AC_HAVE_FADVISE
 AC_HAVE_MADVISE
 AC_HAVE_MINCORE
@@ -148,6 +151,7 @@ AC_HAVE_OPENAT
 AC_HAVE_FSTATAT
 AC_HAVE_SG_IO
 AC_HAVE_HDIO_GETGEO
+AC_HAVE_ATTR_ROOT
 
 if test "$enable_blkid" = yes; then
 AC_HAVE_BLKID_TOPO
diff --git a/debian/control b/debian/control
index ad81662..c255ae6 100644
--- a/debian/control
+++ b/debian/control
@@ -3,7 +3,7 @@ Section: admin
 Priority: optional
 Maintainer: XFS Development Team <linux-xfs@vger.kernel.org>
 Uploaders: Nathan Scott <nathans@debian.org>, Anibal Monsalve Salazar <anibal@debian.org>
-Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev
+Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev, libunistring-dev
 Standards-Version: 3.9.1
 Homepage: http://xfs.org/
 
diff --git a/include/builddefs.in b/include/builddefs.in
index c90c76f..2cde2dc 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -35,6 +35,8 @@ LIBTERMCAP = @libtermcap@
 LIBEDITLINE = @libeditline@
 LIBREADLINE = @libreadline@
 LIBBLKID = @libblkid@
+LIBUNISTRING = @libunistring@
+HAVE_LIBUNISTRING = @have_libunistring@
 LIBXFS = $(TOPDIR)/libxfs/libxfs.la
 LIBXCMD = $(TOPDIR)/libxcmd/libxcmd.la
 LIBXLOG = $(TOPDIR)/libxlog/libxlog.la
@@ -118,6 +120,7 @@ HAVE_OPENAT = @have_openat@
 HAVE_FSTATAT = @have_fstatat@
 HAVE_SG_IO = @have_sg_io@
 HAVE_HDIO_GETGEO = @have_hdio_getgeo@
+HAVE_ATTR_ROOT = @have_attr_root@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/Makefile b/m4/Makefile
index d282f0a..8348325 100644
--- a/m4/Makefile
+++ b/m4/Makefile
@@ -19,6 +19,7 @@ LSRCFILES = \
 	package_libcdev.m4 \
 	package_pthread.m4 \
 	package_types.m4 \
+	package_unistring.m4 \
 	package_utilies.m4 \
 	package_uuiddev.m4 \
 	multilib.m4 \
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 8608d10..5d821f0 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -373,3 +373,16 @@ AC_DEFUN([AC_HAVE_HDIO_GETGEO],
        AC_MSG_RESULT(no))
     AC_SUBST(have_hdio_getgeo)
   ])
+
+#
+# Check if we have a ATTR_ROOT flag
+#
+AC_DEFUN([AC_HAVE_ATTR_ROOT],
+  [ AC_CHECK_DECL([ATTR_ROOT],
+       have_attr_root=yes,
+       [],
+       [#include <sys/types.h>
+        #include <attr/attributes.h>]
+       )
+    AC_SUBST(have_attr_root)
+  ])
diff --git a/m4/package_unistring.m4 b/m4/package_unistring.m4
new file mode 100644
index 0000000..9aaadfc
--- /dev/null
+++ b/m4/package_unistring.m4
@@ -0,0 +1,19 @@
+AC_DEFUN([AC_PACKAGE_WANT_UNINORM_H],
+  [ AC_CHECK_HEADERS(uninorm.h)
+    if test $ac_cv_header_uninorm_h = no; then
+	AC_CHECK_HEADERS(uninorm.h,, [
+	echo
+	echo 'WARNING: could not find a valid uninorm header.'])
+    fi
+  ])
+
+AC_DEFUN([AC_PACKAGE_WANT_U8_NORMALIZE],
+  [ AC_CHECK_LIB(unistring, u8_normalize,[
+	libunistring=-lunistring
+	have_libunistring=yes
+    ],[
+	echo
+	echo 'WARNING: xfs_scrub will not be built with Unicode libraries.'])
+    AC_SUBST(libunistring)
+    AC_SUBST(have_libunistring)
+  ])
diff --git a/scrub/Makefile b/scrub/Makefile
index a17dfa3..0fbcf6b 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -27,7 +27,8 @@ read_verify.h \
 repair.h \
 scrub.h \
 vfs.h \
-xfs.h
+xfs.h \
+unicrash.h
 
 CFILES = \
 ../libxfs/list_sort.c \
@@ -51,8 +52,8 @@ scrub.c \
 vfs.c \
 xfs.c
 
-LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD)
-LTDEPENDENCIES += $(LIBXCMD) $(LIBHANDLE)
+LLDLIBS += $(LIBXCMD) $(LIBHANDLE) $(LIBPTHREAD) $(LIBUNISTRING)
+LTDEPENDENCIES += $(LIBXCMD) $(LIBHANDLE) $(LIBUNISTRING)
 LLDFLAGS = -static
 
 ifeq ($(HAVE_MALLINFO),yes)
@@ -71,8 +72,19 @@ ifeq ($(HAVE_HDIO_GETGEO),yes)
 LCFLAGS += -DHAVE_HDIO_GETGEO
 endif
 
+ifeq ($(HAVE_LIBUNISTRING),yes)
+CFILES += unicrash.c
+LCFLAGS += -DHAVE_LIBUNISTRING
+endif
+
+ifeq ($(HAVE_ATTR_ROOT),yes)
+LCFLAGS += -DHAVE_ATTR_ROOT
+endif
+
 default: depend $(LTCOMMAND)
 
+unicrash.o xfs.o: $(TOPDIR)/include/builddefs
+
 include $(BUILDRULES)
 
 install: default $(INSTALL_SCRUB)
diff --git a/scrub/common.c b/scrub/common.c
index 7ca0e78..02a8453 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -85,6 +85,26 @@ __str_errno(
 	pthread_mutex_unlock(&ctx->lock);
 }
 
+/* Print a warning string and whatever error is stored in errno. */
+void
+__str_errno_warn(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*file,
+	int			line)
+{
+	char			buf[DESCR_BUFSZ];
+
+	pthread_mutex_lock(&ctx->lock);
+	fprintf(stderr, _("Warning: %s: %s."), descr,
+			strerror_r(errno, buf, DESCR_BUFSZ));
+	if (debug)
+		fprintf(stderr, _(" (%s line %d)"), file, line);
+	fprintf(stderr, "\n");
+	ctx->warnings_found++;
+	pthread_mutex_unlock(&ctx->lock);
+}
+
 /* Print an error string and some error text. */
 void
 __str_error(
diff --git a/scrub/common.h b/scrub/common.h
index 2e4ff05..6251c6d 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -41,6 +41,8 @@ void __record_repair(struct scrub_ctx *ctx, const char *descr, const char *file,
 		int line, const char *format, ...);
 void __record_preen(struct scrub_ctx *ctx, const char *descr, const char *file,
 		int line, const char *format, ...);
+void __str_errno_warn(struct scrub_ctx *, const char *descr, const char *file,
+		      int line);
 
 #define str_errno(ctx, str)		__str_errno(ctx, str, __FILE__, __LINE__)
 #define str_error(ctx, str, ...)	__str_error(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
@@ -48,6 +50,7 @@ void __record_preen(struct scrub_ctx *ctx, const char *descr, const char *file,
 #define str_info(ctx, str, ...)		__str_info(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
 #define record_repair(ctx, str, ...)	__record_repair(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
 #define record_preen(ctx, str, ...)	__record_preen(ctx, str, __FILE__, __LINE__, __VA_ARGS__)
+#define str_errno_warn(ctx, str)	__str_errno_warn(ctx, str, __FILE__, __LINE__)
 #define dbg_printf(fmt, ...)		{if (debug > 1) {printf(fmt, __VA_ARGS__);}}
 
 /* Is this debug tweak enabled? */
diff --git a/scrub/phase5.c b/scrub/phase5.c
index e5a5835..9010a08 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -31,6 +31,7 @@
 #include "ioctl.h"
 #include "xfs_fs.h"
 #include "xfs.h"
+#include "unicrash.h"
 
 /* Phase 5: Check directory connectivity. */
 
@@ -38,6 +39,8 @@
  * Verify the connectivity of the directory tree.
  * We know that the kernel's open-by-handle function will try to reconnect
  * parents of an opened directory, so we'll accept that as sufficient.
+ *
+ * Check for potential Unicode collisions in names.
  */
 static int
 xfs_scrub_connections(
@@ -59,6 +62,11 @@ xfs_scrub_connections(
 			agno, agino);
 	background_sleep();
 
+	/* Warn about Unicode normalization problems in xattrs. */
+	moveon = unicrash_scan_fh_xattrs(ctx, descr, handle, bstat);
+	if (!moveon)
+		goto out;
+
 	/* Open the dir, let the kernel try to reconnect it to the root. */
 	if (S_ISDIR(bstat->bs_mode)) {
 		fd = xfs_open_handle(handle);
@@ -70,6 +78,13 @@ xfs_scrub_connections(
 		}
 	}
 
+	/* Warn about Unicode normalization problems in dirents. */
+	if (fd >= 0 && S_ISDIR(bstat->bs_mode)) {
+		moveon = unicrash_scan_dir(ctx, descr, fd, bstat->bs_size);
+		if (!moveon)
+			goto out;
+	}
+
 out:
 	if (fd >= 0)
 		close(fd);
diff --git a/scrub/unicrash.c b/scrub/unicrash.c
new file mode 100644
index 0000000..dfa8a0c
--- /dev/null
+++ b/scrub/unicrash.c
@@ -0,0 +1,484 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <dirent.h>
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#ifdef HAVE_ATTR_ROOT
+# include <attr/attributes.h>
+#endif
+#include <unistr.h>
+#include <uninorm.h>
+#include "disk.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "vfs.h"
+#include "scrub.h"
+#include "common.h"
+#include "unicrash.h"
+
+/*
+ * Detect collisions of Unicode-normalized names.
+ *
+ * Record all the name->ino mappings in a directory/xattr, with a twist!
+ * The twist is that we perform unicode normalization on every name we
+ * see, so that we can warn about a directory containing more than one
+ * directory entries that normalize to the same Unicode string.  These
+ * entries are at best a sign of Unicode mishandling, or some sort of
+ * weird name substitution attack if the entries do not point to the
+ * same inode.  Warn if we see multiple dirents that do not all point to
+ * the same inode.
+ *
+ * For extended attributes we perform the same collision checks on the
+ * attribute, though any collision is enough to trigger a warning.
+ *
+ * We flag these collisions as warnings and not errors because XFS
+ * treats names as a sequence of arbitrary nonzero bytes.  While a
+ * Unicode collision is not technically a filesystem corruption, we
+ * ought to say something if there's a possibility for misleading a
+ * user.
+ *
+ * To normalize, we use Unicode NFKC.  We use the composing
+ * normalization mode (e.g. "E WITH ACUTE" instead of "E" then "ACUTE")
+ * because that's what W3C (and in general Linux) uses.  This enables us
+ * to detect multiple object names that normalize to the same name and
+ * could be confusing to users.  Furthermore, we use the compatibility
+ * mode to detect names with compatible but different code points to
+ * strengthen those checks.
+ */
+
+struct name_entry {
+	struct name_entry	*next;
+	xfs_ino_t		ino;
+	size_t			namelen;
+	uint8_t			name[0];
+};
+#define NAME_ENTRY_SZ(nl)	(sizeof(struct name_entry) + 1 + \
+				 (nl * sizeof(uint8_t)))
+
+struct unicrash {
+	bool			compare_ino;
+	size_t			nr_buckets;
+	struct name_entry	*buckets[0];
+};
+#define UNICRASH_SZ(nr)		(sizeof(struct unicrash) + \
+				 (nr * sizeof(struct name_entry)))
+
+/* Initialize the crash detector. */
+static struct unicrash *
+unicrash_init(
+	bool			compare_ino,
+	size_t			nr_buckets)
+{
+	struct unicrash		*p;
+
+	assert(nr_buckets > 0);
+	p = calloc(1, UNICRASH_SZ(nr_buckets));
+	if (!p)
+		return NULL;
+	p->nr_buckets = nr_buckets;
+	p->compare_ino = compare_ino;
+	return p;
+}
+
+/* Free the crash detector. */
+static void
+unicrash_free(
+	struct unicrash		*uc)
+{
+	struct name_entry	*ne;
+	struct name_entry	*x;
+	size_t			i;
+
+	for (i = 0; i < uc->nr_buckets; i++) {
+		for (ne = uc->buckets[i]; ne != NULL; ne = x) {
+			x = ne->next;
+			free(ne);
+		}
+	}
+	free(uc);
+}
+
+/* Steal the dirhash function from libxfs, avoid linking with libxfs. */
+
+#define rol32(x, y)		(((x) << (y)) | ((x) >> (32 - (y))))
+
+/*
+ * Implement a simple hash on a character string.
+ * Rotate the hash value by 7 bits, then XOR each character in.
+ * This is implemented with some source-level loop unrolling.
+ */
+xfs_dahash_t
+unicrash_hashname(
+	const uint8_t		*name,
+	size_t			namelen)
+{
+	xfs_dahash_t		hash;
+
+	/*
+	 * Do four characters at a time as long as we can.
+	 */
+	for (hash = 0; namelen >= 4; namelen -= 4, name += 4)
+		hash = (name[0] << 21) ^ (name[1] << 14) ^ (name[2] << 7) ^
+		       (name[3] << 0) ^ rol32(hash, 7 * 4);
+
+	/*
+	 * Now do the rest of the characters.
+	 */
+	switch (namelen) {
+	case 3:
+		return (name[0] << 14) ^ (name[1] << 7) ^ (name[2] << 0) ^
+		       rol32(hash, 7 * 3);
+	case 2:
+		return (name[0] << 7) ^ (name[1] << 0) ^ rol32(hash, 7 * 2);
+	case 1:
+		return (name[0] << 0) ^ rol32(hash, 7 * 1);
+	default: /* case 0: */
+		return hash;
+	}
+}
+
+/*
+ * Normalize a name according to Unicode NFKC normalization rules.
+ * Returns true if the name was already normalized.
+ */
+static bool
+unicrash_normalize(
+	const char		*in,
+	uint8_t			*out,
+	size_t			outlen)
+{
+	size_t			inlen = strlen(in);
+
+	assert(inlen <= outlen);
+	if (!u8_normalize(UNINORM_NFKC, (const uint8_t *)in, inlen,
+			out, &outlen)) {
+		/* Didn't normalize, just return the same buffer. */
+		memcpy(out, in, inlen + 1);
+		return true;
+	}
+	out[outlen] = 0;
+	return outlen == inlen ? memcmp(in, out, inlen) == 0 : false;
+}
+
+/*
+ * Return the input string with non-printing bytes escaped.
+ * Caller must free the buffer.
+ */
+static char *
+unicrash_escape(
+	const char		*in)
+{
+	char			*str;
+	const char		*p;
+	char			*q;
+	int			x;
+
+	str = malloc(strlen(in) * 4);
+	for (p = in, q = str; *p != '\0'; p++) {
+		if (isprint(*p)) {
+			*q = *p;
+			q++;
+		} else {
+			x = sprintf(q, "\\x%02x", *p);
+			q += x;
+		}
+	}
+	*q = '\0';
+	return str;
+}
+
+/* Complain about Unicode problems. */
+#define TOO_MANY_NORM_WARNINGS	10000
+static void
+unicrash_complain(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	const char		*what,
+	bool			normal,
+	bool			unique,
+	const char		*name,
+	uint8_t			*uniname)
+{
+	static pthread_mutex_t	normlock = PTHREAD_MUTEX_INITIALIZER;
+	static unsigned int	normwarnings;
+	char			*bad1 = NULL;
+	char			*bad2 = NULL;
+
+	if (normal && unique)
+		return;
+
+	bad1 = unicrash_escape(name);
+	bad2 = unicrash_escape((char *)uniname);
+
+	if (!normal && (debug || verbose)) {
+		pthread_mutex_lock(&normlock);
+		if (normwarnings == TOO_MANY_NORM_WARNINGS) {
+			str_info(ctx, ctx->mntpoint,
+_("Filesystem has more than %u normalization warnings, shutting up."),
+					normwarnings);
+			normwarnings++;
+		} else if (normwarnings < TOO_MANY_NORM_WARNINGS) {
+			str_info(ctx, descr,
+_("Unicode name \"%s\" in %s should be normalized as \"%s\"."),
+					bad1, what, bad2);
+			normwarnings++;
+		}
+		pthread_mutex_unlock(&normlock);
+	}
+	if (!unique)
+		str_warn(ctx, descr,
+_("Duplicate normalized Unicode name \"%s\" found in %s."),
+				bad1, what);
+
+	free(bad1);
+	free(bad2);
+}
+#undef TOO_MANY_NORM_WARNINGS
+
+/*
+ * Try to add a name -> ino entry to the collision detector.  The name
+ * must be normalized according to Unicode NFKC normalization rules to
+ * detect byte-unique names that map to the same sequence of Unicode
+ * code points.
+ *
+ * This function returns true either if there was no previous mapping or
+ * there was a mapping that matched exactly.  It returns false if
+ * there is already a record with that name pointing to a different
+ * inode.
+ */
+static bool
+unicrash_add(
+	struct unicrash		*uc,
+	uint8_t			*name,
+	xfs_ino_t		ino)
+{
+	struct name_entry	*ne;
+	struct name_entry	*x;
+	struct name_entry	**nep;
+	size_t			namelen = u8_strlen(name);
+	size_t			bucket;
+	xfs_dahash_t		hash;
+
+	/* Do we already know about that name? */
+	hash = unicrash_hashname(name, namelen);
+	bucket = hash % uc->nr_buckets;
+	for (nep = &uc->buckets[bucket], ne = *nep; ne != NULL; ne = x) {
+		if (u8_strcmp(name, ne->name) == 0)
+			return uc->compare_ino ? ne->ino == ino : false;
+		nep = &ne->next;
+		x = ne->next;
+	}
+
+	/* Remember that name. */
+	x = malloc(NAME_ENTRY_SZ(namelen));
+	x->next = NULL;
+	x->ino = ino;
+	x->namelen = namelen;
+	memcpy(x->name, name, namelen + 1);
+	*nep = x;
+	return true;
+}
+
+/*
+ * Iterate a directory looking for collisions between the Unicode
+ * normalized names.
+ */
+bool
+unicrash_scan_dir(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	off_t			dirsize)
+{
+	uint8_t			buf[NAME_MAX * 2];
+	DIR			*dir;
+	struct dirent		*dentry;
+	struct unicrash		*uc;
+	size_t			buckets;
+	bool			normal;
+	bool			unique;
+	int			new_fd;
+
+	new_fd = dup(fd);
+	if (new_fd < 0) {
+		str_errno_warn(ctx, descr);
+		goto out;
+	}
+
+	dir = fdopendir(new_fd);
+	if (!dir) {
+		str_errno_warn(ctx, descr);
+		goto out_close;
+	}
+	new_fd = -1;
+
+	/*
+	 * Assume 64 bytes per dentry, clamp buckets between 16 and 64k.
+	 * Same general idea as dir_hash_init in xfs_repair.
+	 */
+	buckets = dirsize / 64;
+	if (buckets > 65536)
+		buckets = 65536;
+	else if (buckets < 16)
+		buckets = 16;
+
+	uc = unicrash_init(true, buckets);
+	if (!uc) {
+		str_errno_warn(ctx, descr);
+		goto out_closedir;
+	}
+
+	while ((dentry = readdir(dir))) {
+		normal = unicrash_normalize(dentry->d_name, buf, NAME_MAX * 2);
+		unique = unicrash_add(uc, buf, dentry->d_ino);
+
+		unicrash_complain(ctx, descr, _("directory"),
+				normal, unique, dentry->d_name, buf);
+	}
+
+	unicrash_free(uc);
+out_closedir:
+	closedir(dir);
+out_close:
+	if (new_fd >= 0)
+		close(new_fd);
+out:
+	return true;
+}
+
+/* Routines to scan all of an inode's xattrs for Unicode problems. */
+
+#ifdef HAVE_ATTR_ROOT
+enum xfs_xattr_ns {
+	RXT_USER	= 0,
+	RXT_ROOT,
+	RXT_TRUST,
+	RXT_SECURE,
+	RXT_MAX,
+};
+
+static const int ns_to_flag[] = {
+	[RXT_USER]	= 0,
+	[RXT_ROOT]	= ATTR_ROOT,
+	[RXT_TRUST]	= ATTR_TRUST,
+	[RXT_SECURE]	= ATTR_SECURE,
+};
+
+static const char * const ns_to_str[] = {
+	[RXT_USER]	= "user",
+	[RXT_ROOT]	= "system",
+	[RXT_TRUST]	= "trust",
+	[RXT_SECURE]	= "secure",
+};
+
+/*
+ * Check all the xattr names in a particular namespace of a file handle
+ * for Unicode normalization problems or collisions.
+ */
+static bool
+unicrash_scan_fh_ns_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat,
+	enum xfs_xattr_ns	ns)
+{
+	struct attrlist_cursor	cur;
+	char			attrbuf[XFS_XATTR_LIST_MAX];
+	char			buf[NAME_MAX];
+	uint8_t			nbuf[NAME_MAX * 2];
+	struct attrlist		*attrlist = (struct attrlist *)attrbuf;
+	struct attrlist_ent	*ent;
+	struct unicrash		*uc;
+	bool			normal;
+	bool			unique;
+	int			i;
+	int			error;
+
+	/* Assume 16 buckets per attr extent will do. */
+	uc = unicrash_init(false, 16 * (1 + bstat->bs_aextents));
+	if (!uc) {
+		str_errno_warn(ctx, descr);
+		goto out;
+	}
+
+	memset(attrbuf, 0, XFS_XATTR_LIST_MAX);
+	memset(&cur, 0, sizeof(cur));
+	while ((error = attr_list_by_handle(handle, sizeof(*handle),
+			attrbuf, XFS_XATTR_LIST_MAX, ns_to_flag[ns],
+			&cur)) == 0) {
+
+		/* Examine the xattrs. */
+		for (i = 0; i < attrlist->al_count; i++) {
+			ent = ATTR_ENTRY(attrlist, i);
+			normal = unicrash_normalize(ent->a_name, nbuf,
+					NAME_MAX * 2);
+			unique = unicrash_add(uc, nbuf, 0);
+
+			snprintf(buf, NAME_MAX, "%s.%s", ns_to_str[ns],
+					ent->a_name);
+			unicrash_complain(ctx, descr, _("extended attribute"),
+					normal, unique, buf, nbuf);
+		}
+
+		if (!attrlist->al_more)
+			break;
+	}
+	/*
+	 * ATTR_TRUST doesn't currently work on Linux, and ignore
+	 * the file if the handle is now stale.
+	 */
+	if (error && ((ns == RXT_TRUST && errno == EINVAL) ||
+		      (errno == ESTALE)))
+		error = 0;
+	if (error)
+		str_errno_warn(ctx, descr);
+	unicrash_free(uc);
+out:
+	return true;
+}
+
+/*
+ * Check all the xattr names under a file handle for Unicode normalization
+ * problems or collisions.
+ */
+bool
+unicrash_scan_fh_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat)
+{
+	enum xfs_xattr_ns	i;
+	bool			moveon = true;
+
+	for (i = 0; i < RXT_MAX; i++) {
+		moveon = unicrash_scan_fh_ns_xattrs(ctx, descr, handle,
+				bstat, i);
+		if (!moveon)
+			break;
+	}
+	return moveon;
+}
+#endif /* HAVE_ATTR_ROOT */
diff --git a/scrub/unicrash.h b/scrub/unicrash.h
new file mode 100644
index 0000000..9caae39
--- /dev/null
+++ b/scrub/unicrash.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_UNICRASH_H_
+#define XFS_SCRUB_UNICRASH_H_
+
+/* Unicode name collision detection. */
+#ifdef HAVE_LIBUNISTRING
+bool unicrash_scan_dir(struct scrub_ctx *ctx, const char *descr, int fd,
+		off_t size);
+#else
+# define unicrash_scan_dir(c, d, f, s)		(true)
+#endif
+
+#if defined(HAVE_LIBUNISTRING) && defined(HAVE_ATTR_ROOT)
+bool unicrash_scan_fh_xattrs(struct scrub_ctx *ctx, const char *descr,
+		struct xfs_handle *handle, struct xfs_bstat *bstat);
+#else
+# define unicrash_scan_fh_xattrs(c, d, h, b)	(true)
+#endif
+
+#endif /* XFS_SCRUB_UNICRASH_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 20/22] xfs_scrub: progress indicator
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (18 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 19/22] xfs_scrub: warn about normalized Unicode name collisions Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 21/22] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 22/22] xfs_scrub: integrate services with systemd Darrick J. Wong
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Implement a progress indicator.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 man/man8/xfs_scrub.8 |   11 ++-
 scrub/Makefile       |    2 
 scrub/common.c       |   27 +++++-
 scrub/ioctl.c        |   28 +++++++
 scrub/ioctl.h        |    2 
 scrub/phase2.c       |   15 +++
 scrub/phase3.c       |   16 ++++
 scrub/phase4.c       |   26 ++++++
 scrub/phase5.c       |    2 
 scrub/phase6.c       |   28 +++++++
 scrub/progress.c     |  215 ++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/progress.h     |   33 ++++++++
 scrub/read_verify.c  |    2 
 scrub/repair.c       |    3 +
 scrub/repair.h       |    3 +
 scrub/scrub.c        |   57 +++++++++++++
 scrub/scrub.h        |    2 
 scrub/xfs.h          |   11 +++
 18 files changed, 471 insertions(+), 12 deletions(-)
 create mode 100644 scrub/progress.c
 create mode 100644 scrub/progress.h


diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
index a432aed..4505c3e 100644
--- a/man/man8/xfs_scrub.8
+++ b/man/man8/xfs_scrub.8
@@ -4,7 +4,7 @@ xfs_scrub \- scrub the contents of an XFS filesystem
 .SH SYNOPSIS
 .B xfs_scrub
 [
-.B \-abemnTvVxy
+.B \-abCemnTvVxy
 ]
 .I mount-point
 .br
@@ -47,6 +47,15 @@ time.
 If given more than once, an artificial delay of 100us is added to each
 scrub call to reduce CPU overhead even further.
 .TP
+.BI \-C " fd"
+This option causes xfs_scrub to write progress information to the
+specified file description so that the progress of the filesystem check
+can be monitored.
+If the file description is a tty, a fancy progress bar is rendered.
+Otherwise, a simple numeric status dump compatible with the
+.B fsck -C
+format is output.
+.TP
 .B \-e
 Specifies what happens when errors are detected.
 If
diff --git a/scrub/Makefile b/scrub/Makefile
index 0fbcf6b..5b3e522 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -23,6 +23,7 @@ common.h \
 counter.h \
 disk.h \
 ioctl.h \
+progress.h \
 read_verify.h \
 repair.h \
 scrub.h \
@@ -46,6 +47,7 @@ phase4.c \
 phase5.c \
 phase6.c \
 phase7.c \
+progress.c \
 read_verify.c \
 repair.c \
 scrub.c \
diff --git a/scrub/common.c b/scrub/common.c
index 02a8453..fca7243 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -37,6 +37,7 @@
 #include "input.h"
 #include "ioctl.h"
 #include "xfs.h"
+#include "progress.h"
 
 /*
  * Reporting Status to the Console
@@ -65,6 +66,18 @@ xfs_scrub_excessive_errors(
 	return ret;
 }
 
+/* If stderr is a tty, clear to end of line to clean up progress bar. */
+static inline const char *stderr_start(void)
+{
+	return stderr_isatty ? CLEAR_EOL : "";
+}
+
+/* If stdout is a tty, clear to end of line to clean up progress bar. */
+static inline const char *stdout_start(void)
+{
+	return stdout_isatty ? CLEAR_EOL : "";
+}
+
 /* Print an error string and whatever error is stored in errno. */
 void
 __str_errno(
@@ -76,7 +89,7 @@ __str_errno(
 	char			buf[DESCR_BUFSZ];
 
 	pthread_mutex_lock(&ctx->lock);
-	fprintf(stderr, _("Error: %s: %s."), descr,
+	fprintf(stderr, _("%sError: %s: %s."), stderr_start(), descr,
 			strerror_r(errno, buf, DESCR_BUFSZ));
 	if (debug)
 		fprintf(stderr, _(" (%s line %d)"), file, line);
@@ -96,7 +109,7 @@ __str_errno_warn(
 	char			buf[DESCR_BUFSZ];
 
 	pthread_mutex_lock(&ctx->lock);
-	fprintf(stderr, _("Warning: %s: %s."), descr,
+	fprintf(stderr, _("%sWarning: %s: %s."), stderr_start(), descr,
 			strerror_r(errno, buf, DESCR_BUFSZ));
 	if (debug)
 		fprintf(stderr, _(" (%s line %d)"), file, line);
@@ -118,7 +131,7 @@ __str_error(
 	va_list			args;
 
 	pthread_mutex_lock(&ctx->lock);
-	fprintf(stderr, _("Error: %s: "), descr);
+	fprintf(stderr, _("%sError: %s: "), stderr_start(), descr);
 	va_start(args, format);
 	vfprintf(stderr, format, args);
 	va_end(args);
@@ -142,7 +155,7 @@ __str_warn(
 	va_list			args;
 
 	pthread_mutex_lock(&ctx->lock);
-	fprintf(stderr, _("Warning: %s: "), descr);
+	fprintf(stderr, _("%sWarning: %s: "), stderr_start(), descr);
 	va_start(args, format);
 	vfprintf(stderr, format, args);
 	va_end(args);
@@ -166,7 +179,7 @@ __str_info(
 	va_list			args;
 
 	pthread_mutex_lock(&ctx->lock);
-	fprintf(stdout, _("Info: %s: "), descr);
+	fprintf(stdout, _("%sInfo: %s: "), stdout_start(), descr);
 	va_start(args, format);
 	vfprintf(stdout, format, args);
 	va_end(args);
@@ -190,7 +203,7 @@ __record_repair(
 	va_list			args;
 
 	pthread_mutex_lock(&ctx->lock);
-	fprintf(stderr, _("Repaired: %s: "), descr);
+	fprintf(stderr, _("%sRepaired: %s: "), stderr_start(), descr);
 	va_start(args, format);
 	vfprintf(stderr, format, args);
 	va_end(args);
@@ -215,7 +228,7 @@ __record_preen(
 
 	pthread_mutex_lock(&ctx->lock);
 	if (debug || verbose) {
-		fprintf(stdout, _("Optimized: %s: "), descr);
+		fprintf(stdout, _("%sOptimized: %s: "), stdout_start(), descr);
 		va_start(args, format);
 		vfprintf(stdout, format, args);
 		va_end(args);
diff --git a/scrub/ioctl.c b/scrub/ioctl.c
index 31d2fd8..914ec79 100644
--- a/scrub/ioctl.c
+++ b/scrub/ioctl.c
@@ -29,6 +29,7 @@
 #include "common.h"
 #include "ioctl.h"
 #include "repair.h"
+#include "progress.h"
 
 #define FSMAP_NR		65536
 #define BMAP_NR			2048
@@ -801,6 +802,7 @@ xfs_scrub_metadata(
 
 		/* Check the item. */
 		fix = xfs_check_metadata(ctx, ctx->mnt_fd, &meta, false);
+		progress_add(1);
 		switch (fix) {
 		case CHECK_ABORT:
 			return false;
@@ -848,6 +850,32 @@ xfs_scrub_fs_metadata(
 	return xfs_scrub_metadata(ctx, ST_FS, 0, rl);
 }
 
+/* How many items do we have to check? */
+unsigned int
+xfs_scrub_estimate_ag_work(
+	struct scrub_ctx		*ctx)
+{
+	const struct scrub_descr	*sc;
+	int				type;
+	unsigned int			estimate = 0;
+
+	sc = scrubbers;
+	for (type = 0; type < XFS_SCRUB_TYPE_NR; type++, sc++) {
+		switch (sc->type) {
+		case ST_AGHEADER:
+		case ST_PERAG:
+			estimate += ctx->geo.agcount;
+			break;
+		case ST_FS:
+			estimate++;
+			break;
+		default:
+			break;
+		}
+	}
+	return estimate;
+}
+
 /* Scrub inode metadata. */
 static bool
 __xfs_scrub_file(
diff --git a/scrub/ioctl.h b/scrub/ioctl.h
index ab4044a..283445e 100644
--- a/scrub/ioctl.h
+++ b/scrub/ioctl.h
@@ -117,4 +117,6 @@ bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
 bool xfs_scrub_parent(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
 		int fd, struct xfs_repair_list *repair_list);
 
+unsigned int xfs_scrub_estimate_ag_work(struct scrub_ctx *ctx);
+
 #endif /* XFS_SCRUB_IOCTL_H_ */
diff --git a/scrub/phase2.c b/scrub/phase2.c
index 8bf3ad6..91c0d8f 100644
--- a/scrub/phase2.c
+++ b/scrub/phase2.c
@@ -32,6 +32,7 @@
 #include "xfs_fs.h"
 #include "xfs.h"
 #include "repair.h"
+#include "progress.h"
 
 /* Phase 2: Check internal metadata. */
 
@@ -145,3 +146,17 @@ xfs_scan_metadata(
 
 	return moveon;
 }
+
+/* Estimate how much work we're going to do. */
+bool
+xfs_estimate_metadata_work(
+	struct scrub_ctx	*ctx,
+	uint64_t		*items,
+	unsigned int		*nr_threads,
+	int			*rshift)
+{
+	*items = xfs_scrub_estimate_ag_work(ctx);
+	*nr_threads = scrub_nproc(ctx);
+	*rshift = 0;
+	return true;
+}
diff --git a/scrub/phase3.c b/scrub/phase3.c
index 3cfbea4..bcc848f 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -32,6 +32,7 @@
 #include "xfs_fs.h"
 #include "xfs.h"
 #include "repair.h"
+#include "progress.h"
 
 /* Phase 3: Scan all inodes. */
 
@@ -143,6 +144,7 @@ xfs_scrub_inode(
 		goto out;
 
 out:
+	progress_add(1);
 	xfs_defer_repairs(ctx, agno, &repairs);
 	if (fd >= 0)
 		close(fd);
@@ -161,3 +163,17 @@ xfs_scan_inodes(
 	xfs_scrub_report_preen_triggers(ctx);
 	return true;
 }
+
+/* Estimate how much work we're going to do. */
+bool
+xfs_estimate_inodes_work(
+	struct scrub_ctx	*ctx,
+	uint64_t		*items,
+	unsigned int		*nr_threads,
+	int			*rshift)
+{
+	*items = ctx->mnt_sv.f_files - ctx->mnt_sv.f_ffree;
+	*nr_threads = scrub_nproc(ctx);
+	*rshift = 0;
+	return true;
+}
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 606f3fd..e6c39d6 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -33,6 +33,7 @@
 #include "ioctl.h"
 #include "xfs_fs.h"
 #include "repair.h"
+#include "progress.h"
 
 /* Phase 4: Repair filesystem. */
 
@@ -52,6 +53,7 @@ xfs_repair_ag(
 
 	repairs = &ctx->repair_lists[agno];
 	unfixed = xfs_repair_list_length(repairs);
+	flags |= XRML_REPORT_PROGRESS;
 
 	/* Repair anything broken until we fail to make progress. */
 	do {
@@ -88,9 +90,31 @@ xfs_repair_fs(
 	destroy_work_queue(&wq);
 
 	pthread_mutex_lock(&ctx->lock);
-	if (moveon && ctx->errors_found == 0)
+	if (moveon && ctx->errors_found == 0) {
 		fstrim(ctx);
+		progress_add(1);
+	}
 	pthread_mutex_unlock(&ctx->lock);
 
 	return moveon;
 }
+
+/* Estimate how much work we're going to do. */
+bool
+xfs_estimate_repair_work(
+	struct scrub_ctx	*ctx,
+	uint64_t		*items,
+	unsigned int		*nr_threads,
+	int			*rshift)
+{
+	xfs_agnumber_t		agno;
+	size_t			need_fixing = 0;
+
+	for (agno = 0; agno < ctx->geo.agcount; agno++)
+		need_fixing += xfs_repair_list_length(&ctx->repair_lists[agno]);
+	need_fixing++;
+	*items = need_fixing;
+	*nr_threads = scrub_nproc(ctx) + 1;
+	*rshift = 0;
+	return true;
+}
diff --git a/scrub/phase5.c b/scrub/phase5.c
index 9010a08..8b8e314 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -32,6 +32,7 @@
 #include "xfs_fs.h"
 #include "xfs.h"
 #include "unicrash.h"
+#include "progress.h"
 
 /* Phase 5: Check directory connectivity. */
 
@@ -86,6 +87,7 @@ xfs_scrub_connections(
 	}
 
 out:
+	progress_add(1);
 	if (fd >= 0)
 		close(fd);
 	if (error)
diff --git a/scrub/phase6.c b/scrub/phase6.c
index d60a044..f8713a9 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -33,6 +33,7 @@
 #include "ioctl.h"
 #include "xfs_fs.h"
 #include "xfs.h"
+#include "progress.h"
 
 /*
  * Phase 6: Verify data file integrity.
@@ -537,3 +538,30 @@ xfs_scan_blocks(
 	free(ve);
 	return moveon;
 }
+
+/* Estimate how much work we're going to do. */
+bool
+xfs_estimate_verify_work(
+	struct scrub_ctx	*ctx,
+	uint64_t		*items,
+	unsigned int		*nr_threads,
+	int			*rshift)
+{
+	unsigned long long	d_blocks;
+	unsigned long long	d_bfree;
+	unsigned long long	r_blocks;
+	unsigned long long	r_bfree;
+	unsigned long long	f_files;
+	unsigned long long	f_free;
+	bool			moveon;
+
+	moveon = xfs_scan_estimate_blocks(ctx, &d_blocks, &d_bfree,
+				&r_blocks, &r_bfree, &f_files, &f_free);
+	if (!moveon)
+		return moveon;
+
+	*items = ((d_blocks - d_bfree) + (r_blocks - r_bfree)) << ctx->blocklog;
+	*nr_threads = disk_heads(&ctx->datadev);
+	*rshift = 20;
+	return moveon;
+}
diff --git a/scrub/progress.c b/scrub/progress.c
new file mode 100644
index 0000000..80947ef
--- /dev/null
+++ b/scrub/progress.c
@@ -0,0 +1,215 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <stdio.h>
+#include <dirent.h>
+#include <sys/statvfs.h>
+#include "../repair/threads.h"
+#include "path.h"
+#include "disk.h"
+#include "read_verify.h"
+#include "scrub.h"
+#include "common.h"
+#include "counter.h"
+#include "progress.h"
+
+/*
+ * Progress Tracking
+ *
+ * For scrub phases that expect to take a long time, this facility uses
+ * the threaded counter and some phase/state information to report the
+ * progress of a particular phase to stdout.  Each phase that wants
+ * progress information needs to set up the tracker with an estimate of
+ * the work to be done and periodic updates when work items finish.  In
+ * return, the progress tracker will print a pretty progress bar and
+ * twiddle to a tty, or a raw numeric output compatible with fsck -C.
+ */
+struct progress_tracker {
+	FILE			*fp;
+	const char		*tag;
+	struct ptcounter	*ptc;
+	uint64_t		max;
+	unsigned int		phase;
+	int			rshift;
+	int			twiddle;
+	bool			isatty;
+};
+
+static struct progress_tracker pt;
+
+/* Add some progress. */
+void
+progress_add(
+	uint64_t		x)
+{
+	if (pt.fp)
+		ptcounter_add(pt.ptc, x);
+}
+
+static const char twiddles[] = "|/-\\";
+
+static void
+progress_report(
+	uint64_t		sum)
+{
+	char			buf[80];
+	int			tag_len;
+	int			num_len;
+	int			pbar_len;
+	int			plen;
+
+	if (!pt.fp)
+		return;
+
+	if (sum > pt.max)
+		sum = pt.max;
+
+	/* Emulate fsck machine-readable output (phase, cur, max, label) */
+	if (!pt.isatty) {
+		snprintf(buf, sizeof(buf), _("%u %"PRIu64" %"PRIu64" %s"),
+				pt.phase, sum, pt.max, pt.tag);
+		fprintf(pt.fp, "%s\n", buf);
+		fflush(pt.fp);
+		return;
+	}
+
+	/* Interactive twiddle progress bar. */
+	tag_len = snprintf(buf, sizeof(buf), _("Phase %u: |"), pt.phase);
+	num_len = snprintf(buf, sizeof(buf),
+			"%c %"PRIu64"/%"PRIu64" (%.1f%%)",
+			twiddles[pt.twiddle],
+			sum >> pt.rshift,
+			pt.max >> pt.rshift,
+			100.0 * sum / pt.max) + 1;
+	pbar_len = sizeof(buf) - (num_len + tag_len);
+	snprintf(buf, sizeof(buf), _("Phase %u: |"), pt.phase);
+	snprintf(buf + sizeof(buf) - num_len, num_len,
+			"%c %"PRIu64"/%"PRIu64" (%.1f%%)",
+			twiddles[pt.twiddle],
+			sum >> pt.rshift,
+			pt.max >> pt.rshift,
+			100.0 * sum / pt.max);
+	plen = (int)((double)pbar_len * sum / pt.max);
+	memset(buf + tag_len, '=', plen);
+	memset(buf + tag_len + plen, ' ', pbar_len - plen);
+	pt.twiddle = (pt.twiddle + 1) % 4;
+	fprintf(pt.fp, "%c%s\r%c", START_IGNORE, buf, END_IGNORE);
+	fflush(pt.fp);
+}
+
+static void
+progress_report_handler(
+	int			sig)
+{
+	if (!pt.fp)
+		return;
+
+	progress_report(ptcounter_value(pt.ptc));
+}
+
+/* End a phase of progress reporting. */
+void
+progress_end_phase(void)
+{
+	struct sigaction	sa = {
+		.sa_handler = SIG_DFL,
+	};
+	struct itimerval	itimer = {
+		.it_interval = {
+			.tv_sec = 0,
+		},
+		.it_value = {
+			.tv_sec = 0,
+		},
+	};
+
+	if (!pt.fp)
+		return;
+
+	setitimer(ITIMER_REAL, &itimer, NULL);
+	sigaction(SIGALRM, &sa, NULL);
+	progress_report(pt.max);
+	ptcounter_free(pt.ptc);
+	pt.max = 0;
+	pt.ptc = NULL;
+	if (pt.fp) {
+		fprintf(pt.fp, CLEAR_EOL);
+		fflush(pt.fp);
+	}
+	pt.fp = NULL;
+}
+
+/* Set ourselves up to report progress. */
+bool
+progress_init_phase(
+	struct scrub_ctx	*ctx,
+	FILE			*fp,
+	unsigned int		phase,
+	uint64_t		max,
+	int			rshift,
+	unsigned int		nr_threads)
+{
+	struct sigaction	sa = {
+		.sa_handler = progress_report_handler,
+	};
+	struct itimerval	itimer = {
+		.it_interval = {
+			.tv_sec = 0,
+			.tv_usec = 500000,
+		},
+		.it_value = {
+			.tv_sec = 0,
+			.tv_usec = 500000,
+		},
+	};
+	int			ret;
+
+	assert(pt.fp == NULL);
+	if (fp == NULL || max == 0) {
+		pt.fp = NULL;
+		return true;
+	}
+	pt.fp = fp;
+	pt.isatty = isatty(fileno(fp));
+	pt.tag = ctx->mntpoint;
+	pt.max = max;
+	pt.phase = phase;
+	pt.rshift = rshift;
+	pt.twiddle = 0;
+	pt.ptc = ptcounter_init(nr_threads);
+	if (!pt.ptc)
+		goto out_max;
+	ret = sigaction(SIGALRM, &sa, NULL);
+	if (ret)
+		goto out_ptcounter;
+	ret = setitimer(ITIMER_REAL, &itimer, NULL);
+	if (ret)
+		goto out_sigaction;
+	return true;
+out_sigaction:
+	sa.sa_handler = SIG_DFL;
+	sigaction(SIGALRM, &sa, NULL);
+out_ptcounter:
+	ptcounter_free(pt.ptc);
+	pt.ptc = NULL;
+out_max:
+	pt.max = 0;
+	return false;
+}
diff --git a/scrub/progress.h b/scrub/progress.h
new file mode 100644
index 0000000..1fbbf77
--- /dev/null
+++ b/scrub/progress.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_SCRUB_PROGRESS_H_
+#define XFS_SCRUB_PROGRESS_H_
+
+#define CLEAR_EOL	"\033[K"
+#define START_IGNORE	'\001'
+#define END_IGNORE	'\002'
+
+bool progress_init_phase(struct scrub_ctx *ctx, FILE *progress_fp,
+			 unsigned int phase, uint64_t max, int rshift,
+			 unsigned int nr_threads);
+void progress_end_phase(void);
+void progress_add(uint64_t x);
+
+#endif /* XFS_SCRUB_PROGRESS_H_ */
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 18ba73a..b4fa081 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -29,6 +29,7 @@
 #include "scrub.h"
 #include "common.h"
 #include "counter.h"
+#include "progress.h"
 
 /*
  * Read Verify Pool
@@ -123,6 +124,7 @@ read_verify(
 					errno, rv->io_end_arg);
 		}
 
+		progress_add(len);
 		verified += len;
 		rv->io_start += len;
 		rv->io_length -= len;
diff --git a/scrub/repair.c b/scrub/repair.c
index 53706c3..2dbd413 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -37,6 +37,7 @@
 #include "common.h"
 #include "ioctl.h"
 #include "repair.h"
+#include "progress.h"
 
 /*
  * Prioritize repair items in order of how long we can wait.
@@ -261,6 +262,8 @@ xfs_repair_list_now(
 			rl->nr--;
 			list_del(&ri->list);
 			free(ri);
+			if (repair_flags & XRML_REPORT_PROGRESS)
+				progress_add(1);
 			continue;
 		case CHECK_ABORT:
 			return false;
diff --git a/scrub/repair.h b/scrub/repair.h
index 3142838..4eaf5f5 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -45,6 +45,9 @@ void xfs_repair_find_mustfix(struct xfs_repair_list *repairs,
 #define XRML_REPAIR_ONLY	(XRM_REPAIR_ONLY)
 #define XRML_NOFIX_COMPLAIN	(XRM_NOFIX_COMPLAIN)
 
+/* Report progress */
+#define XRML_REPORT_PROGRESS	(1 << 31)
+
 bool xfs_repair_list_now(struct scrub_ctx *ctx, int fd,
 			 struct xfs_repair_list *repair_list,
 			 unsigned int repair_flags);
diff --git a/scrub/scrub.c b/scrub/scrub.c
index b654be5..f319f5b 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -38,6 +38,7 @@
 #include "input.h"
 #include "ioctl.h"
 #include "xfs.h"
+#include "progress.h"
 
 #define _PATH_PROC_MOUNTS	"/proc/mounts"
 
@@ -156,12 +157,17 @@ bool				scrub_data;
 /* Size of a memory page. */
 long				page_size;
 
+/* If stdout/stderr are ttys, we can use richer terminal control. */
+bool				stderr_isatty;
+bool				stdout_isatty;
+
 static void __attribute__((noreturn))
 usage(void)
 {
 	fprintf(stderr, _("Usage: %s [OPTIONS] mountpoint\n"), progname);
 	fprintf(stderr, _("-a:\tStop after this many errors are found.\n"));
 	fprintf(stderr, _("-b:\tBackground mode.\n"));
+	fprintf(stderr, _("-C:\tPrint progress information to this fd.\n"));
 	fprintf(stderr, _("-e:\tWhat to do if errors are found.\n"));
 	fprintf(stderr, _("-m:\tPath to /etc/mtab.\n"));
 	fprintf(stderr, _("-n:\tDry run.  Do not modify anything.\n"));
@@ -236,6 +242,8 @@ struct phase_rusage {
 struct phase_ops {
 	char		*descr;
 	bool		(*fn)(struct scrub_ctx *);
+	bool		(*estimate_work)(struct scrub_ctx *, uint64_t *,
+					 unsigned int *, int *);
 	bool		must_run;
 };
 
@@ -440,7 +448,8 @@ _("Errors found, please re-run with -y."));
 /* Run all the phases of the scrubber. */
 static bool
 run_scrub_phases(
-	struct scrub_ctx	*ctx)
+	struct scrub_ctx	*ctx,
+	FILE			*progress_fp)
 {
 	struct phase_ops phases[] =
 	{
@@ -452,22 +461,27 @@ run_scrub_phases(
 		{
 			.descr = _("Check internal metadata."),
 			.fn = xfs_scan_metadata,
+			.estimate_work = xfs_estimate_metadata_work,
 		},
 		{
 			.descr = _("Scan all inodes."),
 			.fn = xfs_scan_inodes,
+			.estimate_work = xfs_estimate_inodes_work,
 		},
 		{
 			.descr = _("Defer filesystem repairs."),
 			.fn = REPAIR_DUMMY_FN,
+			.estimate_work = xfs_estimate_repair_work,
 		},
 		{
 			.descr = _("Check directory tree."),
 			.fn = xfs_scan_connections,
+			.estimate_work = xfs_estimate_inodes_work,
 		},
 		{
 			.descr = _("Verify data file integrity."),
 			.fn = DATASCAN_DUMMY_FN,
+			.estimate_work = xfs_estimate_verify_work,
 		},
 		{
 			.descr = _("Check summary counters."),
@@ -479,9 +493,12 @@ run_scrub_phases(
 	};
 	struct phase_rusage	pi;
 	struct phase_ops	*sp;
+	uint64_t		max_work;
 	bool			moveon = true;
 	unsigned int		debug_phase = 0;
 	unsigned int		phase;
+	unsigned int		nr_threads;
+	int			rshift;
 
 	if (debug && debug_tweak_on("XFS_SCRUB_PHASE"))
 		debug_phase = atoi(getenv("XFS_SCRUB_PHASE"));
@@ -514,6 +531,18 @@ run_scrub_phases(
 		moveon = phase_start(&pi, phase, sp->descr);
 		if (!moveon)
 			break;
+		if (sp->estimate_work) {
+			moveon = sp->estimate_work(ctx, &max_work, &nr_threads,
+					&rshift);
+			if (!moveon)
+				break;
+			moveon = progress_init_phase(ctx, progress_fp, phase,
+					max_work, rshift, nr_threads);
+		} else {
+			moveon = progress_init_phase(ctx, NULL, phase, 0, 0, 0);
+		}
+		if (!moveon)
+			break;
 		moveon = sp->fn(ctx);
 		if (!moveon) {
 			str_info(ctx, ctx->mntpoint,
@@ -521,6 +550,7 @@ _("Scrub aborted after phase %d."),
 					phase);
 			break;
 		}
+		progress_end_phase();
 		moveon = phase_end(&pi, phase);
 		if (!moveon)
 			break;
@@ -541,6 +571,7 @@ main(
 {
 	int			c;
 	char			*mtab = NULL;
+	FILE			*progress_fp = NULL;
 	struct scrub_ctx	ctx = {0};
 	struct phase_rusage	all_pi;
 	unsigned long long	total_errors;
@@ -559,7 +590,7 @@ main(
 	ctx.datadev.d_fd = -1;
 	ctx.mode = SCRUB_MODE_DEFAULT;
 	ctx.error_action = ERRORS_CONTINUE;
-	while ((c = getopt(argc, argv, "a:bde:m:nTvxVy")) != EOF) {
+	while ((c = getopt(argc, argv, "a:bC:de:m:nTvxVy")) != EOF) {
 		switch (c) {
 		case 'a':
 			ctx.max_errors = cvt_u64(optarg, 10);
@@ -572,6 +603,19 @@ main(
 			nr_threads = 1;
 			bg_mode++;
 			break;
+		case 'C':
+			errno = 0;
+			ret = cvt_u32(optarg, 10);
+			if (errno) {
+				perror(optarg);
+				usage();
+			}
+			progress_fp = fdopen(ret, "w");
+			if (!progress_fp) {
+				perror(optarg);
+				usage();
+			}
+			break;
 		case 'd':
 			debug++;
 			dumpcore = true;
@@ -624,6 +668,13 @@ _("Only one of the options -n or -y may be specified.\n"));
 		}
 	}
 
+	stdout_isatty = isatty(STDOUT_FILENO);
+	stderr_isatty = isatty(STDERR_FILENO);
+
+	/* If interactive, start the progress bar. */
+	if (stdout_isatty && !progress_fp)
+		progress_fp = fdopen(1, "w+");
+
 	/* Override thread count if debugger */
 	if (debug_tweak_on("XFS_SCRUB_THREADS")) {
 		unsigned int	x;
@@ -703,7 +754,7 @@ _("Only one of the options -n or -y may be specified.\n"));
 	}
 
 	/* Scrub a filesystem. */
-	moveon = run_scrub_phases(&ctx);
+	moveon = run_scrub_phases(&ctx, progress_fp);
 	if (!moveon)
 		goto out;
 
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 0561044..4050b73 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -29,6 +29,8 @@ extern bool			dumpcore;
 extern bool			verbose;
 extern bool			scrub_data;
 extern long			page_size;
+extern bool			stderr_isatty;
+extern bool			stdout_isatty;
 
 enum scrub_mode {
 	SCRUB_MODE_DRY_RUN,
diff --git a/scrub/xfs.h b/scrub/xfs.h
index 8a7d32e..448a536 100644
--- a/scrub/xfs.h
+++ b/scrub/xfs.h
@@ -48,4 +48,15 @@ bool xfs_scan_blocks(struct scrub_ctx *ctx);
 bool xfs_scan_summary(struct scrub_ctx *ctx);
 bool xfs_repair_fs(struct scrub_ctx *ctx);
 
+uint64_t xfs_estimate_inodes(struct scrub_ctx *ctx);
+
+bool xfs_estimate_metadata_work(struct scrub_ctx *ctx, uint64_t *items,
+				unsigned int *nr_threads, int *rshift);
+bool xfs_estimate_inodes_work(struct scrub_ctx *ctx, uint64_t *items,
+			      unsigned int *nr_threads, int *rshift);
+bool xfs_estimate_repair_work(struct scrub_ctx *ctx, uint64_t *items,
+			      unsigned int *nr_threads, int *rshift);
+bool xfs_estimate_verify_work(struct scrub_ctx *ctx, uint64_t *items,
+			      unsigned int *nr_threads, int *rshift);
+
 #endif /* XFS_SCRUB_XFS_H_ */


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 21/22] xfs_scrub: create a script to scrub all xfs filesystems
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (19 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 20/22] xfs_scrub: progress indicator Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  2017-08-04  0:09 ` [PATCH 22/22] xfs_scrub: integrate services with systemd Darrick J. Wong
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create an xfs_scrub_all command to find all XFS filesystems
and run an online scrub against them all.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 debian/control           |    3 +
 debian/rules             |    1 
 man/man8/xfs_scrub_all.8 |   32 ++++++++++
 scrub/Makefile           |   15 ++++
 scrub/xfs_scrub_all.in   |  154 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 201 insertions(+), 4 deletions(-)
 create mode 100644 man/man8/xfs_scrub_all.8
 create mode 100644 scrub/xfs_scrub_all.in


diff --git a/debian/control b/debian/control
index c255ae6..30602cb 100644
--- a/debian/control
+++ b/debian/control
@@ -3,12 +3,13 @@ Section: admin
 Priority: optional
 Maintainer: XFS Development Team <linux-xfs@vger.kernel.org>
 Uploaders: Nathan Scott <nathans@debian.org>, Anibal Monsalve Salazar <anibal@debian.org>
-Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev, libunistring-dev
+Build-Depends: uuid-dev, dh-autoreconf, debhelper (>= 5), gettext, libtool, libreadline-gplv2-dev | libreadline5-dev, libblkid-dev (>= 2.17), linux-libc-dev, libunistring-dev, dh-python
 Standards-Version: 3.9.1
 Homepage: http://xfs.org/
 
 Package: xfsprogs
 Depends: ${shlibs:Depends}, ${misc:Depends}
+Recommends: ${python3:Depends}, util-linux
 Provides: fsck-backend
 Suggests: xfsdump, acl, attr, quota
 Breaks: xfsdump (<< 3.0.0)
diff --git a/debian/rules b/debian/rules
index c673380..a870944 100755
--- a/debian/rules
+++ b/debian/rules
@@ -76,6 +76,7 @@ binary-arch: checkroot built
 	$(pkgdi)  $(MAKE) -C debian install-d-i
 	$(pkgme)  $(MAKE) dist
 	rmdir debian/xfslibs-dev/usr/share/doc/xfsprogs
+	dh_python3
 	dh_installdocs
 	dh_installchangelogs
 	dh_strip
diff --git a/man/man8/xfs_scrub_all.8 b/man/man8/xfs_scrub_all.8
new file mode 100644
index 0000000..5e1420b
--- /dev/null
+++ b/man/man8/xfs_scrub_all.8
@@ -0,0 +1,32 @@
+.TH xfs_scrub_all 8
+.SH NAME
+xfs_scrub_all \- scrub all mounted XFS filesystems
+.SH SYNOPSIS
+.B xfs_scrub_all
+.SH DESCRIPTION
+.B xfs_scrub_all
+attempts to read and check all the metadata on all mounted XFS filesystems.
+The online scrub is performed via the
+.B xfs_scrub
+tool, either by running it directly or by using systemd to start it
+in a restricted fashion.
+Mounted filesystems are mapped to physical storage devices so that scrub
+operations can be run in parallel so long as no two scrubbers access
+the same device simultaneously.
+.SH EXIT CODE
+The exit code returned by
+.B xfs_scrub_all
+is the sum of the following conditions:
+.br
+\	0\	\-\ No errors
+.br
+\	4\	\-\ File system errors left uncorrected
+.br
+\	8\	\-\ Operational error
+.br
+\	16\	\-\ Usage or syntax error
+.TP
+These are the same error codes returned by xfs_scrub.
+.br
+.SH SEE ALSO
+.BR xfs_scrub (8).
diff --git a/scrub/Makefile b/scrub/Makefile
index 5b3e522..9db4a5d 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -13,6 +13,8 @@ SCRUB_PREREQS=$(PKG_PLATFORM)$(HAVE_OPENAT)$(HAVE_FSTATAT)
 ifeq ($(SCRUB_PREREQS),linuxyesyes)
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
+XFS_SCRUB_ALL_PROG = xfs_scrub_all
+XFS_SCRUB_ARGS = -n
 endif	# scrub_prereqs
 
 HFILES = \
@@ -83,17 +85,24 @@ ifeq ($(HAVE_ATTR_ROOT),yes)
 LCFLAGS += -DHAVE_ATTR_ROOT
 endif
 
-default: depend $(LTCOMMAND)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG)
+
+xfs_scrub_all: xfs_scrub_all.in
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_ROOT_SBIN_DIR)|g" \
+		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@
+	$(Q)chmod a+x $@
 
 unicrash.o xfs.o: $(TOPDIR)/include/builddefs
 
 include $(BUILDRULES)
 
-install: default $(INSTALL_SCRUB)
+install: $(INSTALL_SCRUB)
 
-install-scrub:
+install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+	$(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_ROOT_SBIN_DIR)
 
 install-dev:
 
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
new file mode 100644
index 0000000..1abcef6
--- /dev/null
+++ b/scrub/xfs_scrub_all.in
@@ -0,0 +1,154 @@
+#!/usr/bin/env python3
+
+# Run online scrubbers in parallel, but avoid thrashing.
+#
+# Copyright (C) 2017 Oracle.  All rights reserved.
+#
+# Author: Darrick J. Wong <darrick.wong@oracle.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+
+import subprocess
+import json
+import threading
+import time
+import sys
+
+retcode = 0
+terminate = False
+
+def find_mounts():
+	'''Map mountpoints to physical disks.'''
+
+	fs = {}
+	cmd=['lsblk', '-o', 'KNAME,TYPE,FSTYPE,MOUNTPOINT', '-J']
+	result = subprocess.Popen(cmd, stdout=subprocess.PIPE)
+	result.wait()
+	if result.returncode != 0:
+		return fs
+	sarray = [x.decode('utf-8') for x in result.stdout.readlines()]
+	output = ' '.join(sarray)
+	bdevdata = json.loads(output)
+	# The lsblk output had better be in disks-then-partitions order
+	for bdev in bdevdata['blockdevices']:
+		if bdev['type'] == 'disk':
+			lastdisk = bdev['kname']
+		if bdev['fstype'] == 'xfs':
+			mnt = bdev['mountpoint']
+			if mnt is None:
+				continue
+			if mnt in fs:
+				fs[mnt].add(lastdisk)
+			else:
+				fs[mnt] = set([lastdisk])
+	return fs
+
+def run_killable(cmd, stdout, killfuncs, kill_fn):
+	'''Run a killable program.  Returns program retcode or -1 if we can't start it.'''
+	try:
+		proc = subprocess.Popen(cmd, stdout = stdout)
+		real_kill_fn = lambda: kill_fn(proc)
+		killfuncs.add(real_kill_fn)
+		proc.wait()
+		try:
+			killfuncs.remove(real_kill_fn)
+		except:
+			pass
+		return proc.returncode
+	except:
+		return -1
+
+def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
+	'''Run a scrub process.'''
+	global retcode, terminate
+
+	print("Scrubbing %s..." % mnt)
+	sys.stdout.flush()
+
+	try:
+		if terminate:
+			return
+
+		# Invoke xfs_scrub manually
+		cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
+		ret = run_killable(cmd, None, killfuncs, \
+				lambda proc: proc.terminate())
+		if ret >= 0:
+			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
+			sys.stdout.flush()
+			retcode |= ret
+			return
+
+		if terminate:
+			return
+
+		print("Unable to start scrub tool.")
+		sys.stdout.flush()
+	finally:
+		running_devs -= mntdevs
+		cond.acquire()
+		cond.notify()
+		cond.release()
+
+def main():
+	'''Find mounts, schedule scrub runs.'''
+	def thr(mnt, devs):
+		a = (mnt, cond, running_devs, devs, killfuncs)
+		thr = threading.Thread(target = run_scrub, args = a)
+		thr.start()
+	global retcode, terminate
+
+	fs = find_mounts()
+
+	# Schedule scrub jobs...
+	running_devs = set()
+	killfuncs = set()
+	cond = threading.Condition()
+	while len(fs) > 0:
+		if len(running_devs) == 0:
+			mnt, devs = fs.popitem()
+			running_devs.update(devs)
+			thr(mnt, devs)
+		poppers = set()
+		for mnt in fs:
+			devs = fs[mnt]
+			can_run = True
+			for dev in devs:
+				if dev in running_devs:
+					can_run = False
+					break
+			if can_run:
+				running_devs.update(devs)
+				poppers.add(mnt)
+				thr(mnt, devs)
+		for p in poppers:
+			fs.pop(p)
+		cond.acquire()
+		try:
+			cond.wait()
+		except KeyboardInterrupt:
+			terminate = True
+			print("Terminating...")
+			sys.stdout.flush()
+			while len(killfuncs) > 0:
+				fn = killfuncs.pop()
+				fn()
+			fs = []
+		cond.release()
+
+	sys.exit(retcode)
+
+if __name__ == '__main__':
+	main()


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 22/22] xfs_scrub: integrate services with systemd
  2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
                   ` (20 preceding siblings ...)
  2017-08-04  0:09 ` [PATCH 21/22] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
@ 2017-08-04  0:09 ` Darrick J. Wong
  21 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2017-08-04  0:09 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a systemd service unit so that we can run the online scrubber
under systemd with (somewhat) appropriate containment.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 configure.ac                     |   15 +++++++++++
 include/builddefs.in             |    3 ++
 scrub/Makefile                   |   22 ++++++++++++++--
 scrub/scrub.c                    |   25 ++++++++++++++++++
 scrub/xfs_scrub@.service.in      |   18 +++++++++++++
 scrub/xfs_scrub_all.in           |   53 ++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrub_all.service.in   |    8 ++++++
 scrub/xfs_scrub_all.timer        |   11 ++++++++
 scrub/xfs_scrub_fail             |   26 +++++++++++++++++++
 scrub/xfs_scrub_fail@.service.in |   10 +++++++
 10 files changed, 189 insertions(+), 2 deletions(-)
 create mode 100644 scrub/xfs_scrub@.service.in
 create mode 100644 scrub/xfs_scrub_all.service.in
 create mode 100644 scrub/xfs_scrub_all.timer
 create mode 100755 scrub/xfs_scrub_fail
 create mode 100644 scrub/xfs_scrub_fail@.service.in


diff --git a/configure.ac b/configure.ac
index 91fba71..2e92ee6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -103,6 +103,21 @@ esac
 AC_SUBST([root_sbindir])
 AC_SUBST([root_libdir])
 
+# Where do systemd services go?
+pkg_systemdsystemunitdir="$(pkg-config --variable=systemdsystemunitdir systemd 2>/dev/null)"
+case "${pkg_systemdsystemunitdir}" in
+"")
+	systemdsystemunitdir=""
+	have_systemd=no
+	;;
+*)
+	systemdsystemunitdir="${pkg_systemdsystemunitdir}"
+	have_systemd=yes
+	;;
+esac
+AC_SUBST([have_systemd])
+AC_SUBST([systemdsystemunitdir])
+
 # Find localized files.  Don't descend into any "dot directories"
 # (like .git or .pc from quilt).  Strangely, the "-print" argument
 # to "find" is required, to avoid including such directories in the
diff --git a/include/builddefs.in b/include/builddefs.in
index 2cde2dc..cf50a26 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -122,6 +122,9 @@ HAVE_SG_IO = @have_sg_io@
 HAVE_HDIO_GETGEO = @have_hdio_getgeo@
 HAVE_ATTR_ROOT = @have_attr_root@
 
+HAVE_SYSTEMD = @have_systemd@
+SYSTEMDSYSTEMUNITDIR = @systemdsystemunitdir@
+
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
 
diff --git a/scrub/Makefile b/scrub/Makefile
index 9db4a5d..d10f171 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -14,7 +14,12 @@ ifeq ($(SCRUB_PREREQS),linuxyesyes)
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 XFS_SCRUB_ALL_PROG = xfs_scrub_all
-XFS_SCRUB_ARGS = -n
+XFS_SCRUB_ARGS = -b -n
+ifeq ($(HAVE_SYSTEMD),yes)
+INSTALL_SCRUB += install-systemd
+SYSTEMDSERVICES = xfs_scrub@.service xfs_scrub_all.service xfs_scrub_all.timer xfs_scrub_fail@.service
+endif
+
 endif	# scrub_prereqs
 
 HFILES = \
@@ -85,7 +90,7 @@ ifeq ($(HAVE_ATTR_ROOT),yes)
 LCFLAGS += -DHAVE_ATTR_ROOT
 endif
 
-default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(SYSTEMDSERVICES)
 
 xfs_scrub_all: xfs_scrub_all.in
 	@echo "    [SED]    $@"
@@ -99,6 +104,19 @@ include $(BUILDRULES)
 
 install: $(INSTALL_SCRUB)
 
+%.service: %.service.in
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_ROOT_SBIN_DIR)|g" \
+		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" \
+		   -e "s|@pkg_lib_dir@|$(PKG_LIB_DIR)|g" \
+		   -e "s|@pkg_name@|$(PKG_NAME)|g" < $< > $@
+
+install-systemd: default
+	$(INSTALL) -m 755 -d $(SYSTEMDSYSTEMUNITDIR)
+	$(INSTALL) -m 644 $(SYSTEMDSERVICES) $(SYSTEMDSYSTEMUNITDIR)
+	$(INSTALL) -m 755 -d $(PKG_LIB_DIR)/$(PKG_NAME)
+	$(INSTALL) -m 755 xfs_scrub_fail $(PKG_LIB_DIR)/$(PKG_NAME)
+
 install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
diff --git a/scrub/scrub.c b/scrub/scrub.c
index f319f5b..915fc4a 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -161,6 +161,12 @@ long				page_size;
 bool				stderr_isatty;
 bool				stdout_isatty;
 
+/*
+ * If we are running as a service, we need to be careful about what
+ * error codes we return to the calling process.
+ */
+bool				is_service;
+
 static void __attribute__((noreturn))
 usage(void)
 {
@@ -692,6 +698,9 @@ _("Only one of the options -n or -y may be specified.\n"));
 
 	ctx.mntpoint = argv[optind];
 
+	if (getenv("SERVICE_MODE"))
+		is_service = true;
+
 	/* Find the mount record for the passed-in argument. */
 	if (stat(argv[optind], &ctx.mnt_sb) < 0) {
 		fprintf(stderr,
@@ -817,5 +826,21 @@ _("%s: %llu warnings found.\n"),
 	free(ctx.readbuf);
 	free(ctx.mntpoint);
 end:
+	/*
+	 * If we're running as a service, bump return code up by 16 to
+	 * avoid conflicting with service return codes.
+	 */
+	if (is_service) {
+		/*
+		 * journald queries /proc as part of taking in log
+		 * messages; it uses this information to associate the
+		 * message with systemd units, etc.  This races with
+		 * process exit, so delay that a couple of seconds so
+		 * that we capture the summary outputs in the job log.
+		 */
+		sleep(2);
+		if (ret)
+			ret += 150;
+	}
 	return ret;
 }
diff --git a/scrub/xfs_scrub@.service.in b/scrub/xfs_scrub@.service.in
new file mode 100644
index 0000000..6b6992d
--- /dev/null
+++ b/scrub/xfs_scrub@.service.in
@@ -0,0 +1,18 @@
+[Unit]
+Description=Online XFS Metadata Check for %I
+OnFailure=xfs_scrub_fail@%i.service
+
+[Service]
+Type=oneshot
+WorkingDirectory=%I
+PrivateNetwork=true
+ProtectSystem=full
+ProtectHome=read-only
+PrivateTmp=yes
+AmbientCapabilities=CAP_SYS_ADMIN CAP_FOWNER CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_SYS_RAWIO
+NoNewPrivileges=yes
+User=nobody
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+Environment=SERVICE_MODE=1
+ExecStart=@sbindir@/xfs_scrub @scrub_args@ %I
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 1abcef6..c522181 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -25,10 +25,19 @@ import json
 import threading
 import time
 import sys
+import os
 
 retcode = 0
 terminate = False
 
+def DEVNULL():
+	'''Return /dev/null in subprocess writable format.'''
+	try:
+		from subprocess import DEVNULL
+		return DEVNULL
+	except ImportError:
+		return open(os.devnull, 'wb')
+
 def find_mounts():
 	'''Map mountpoints to physical disks.'''
 
@@ -55,6 +64,13 @@ def find_mounts():
 				fs[mnt] = set([lastdisk])
 	return fs
 
+def kill_systemd(unit, proc):
+	'''Kill systemd unit.'''
+	proc.terminate()
+	cmd=['systemctl', 'stop', unit]
+	x = subprocess.Popen(cmd)
+	x.wait()
+
 def run_killable(cmd, stdout, killfuncs, kill_fn):
 	'''Run a killable program.  Returns program retcode or -1 if we can't start it.'''
 	try:
@@ -81,6 +97,19 @@ def run_scrub(mnt, cond, running_devs, mntdevs, killfuncs):
 		if terminate:
 			return
 
+		# Try it the systemd way
+		cmd=['systemctl', 'start', 'xfs_scrub@%s' % mnt]
+		ret = run_killable(cmd, DEVNULL(), killfuncs, \
+				lambda proc: kill_systemd('xfs_scrub@%s' % mnt, proc))
+		if ret == 0 or ret == 1:
+			print("Scrubbing %s done, (err=%d)" % (mnt, ret))
+			sys.stdout.flush()
+			retcode |= ret
+			return
+
+		if terminate:
+			return
+
 		# Invoke xfs_scrub manually
 		cmd=['@sbindir@/xfs_scrub', '@scrub_args@', mnt]
 		ret = run_killable(cmd, None, killfuncs, \
@@ -112,6 +141,17 @@ def main():
 
 	fs = find_mounts()
 
+	# Tail the journal if we ourselves aren't a service...
+	journalthread = None
+	if 'SERVICE_MODE' not in os.environ:
+		try:
+			cmd=['journalctl', '--no-pager', '-q', '-S', 'now', \
+					'-f', '-u', 'xfs_scrub@*', '-o', \
+					'cat']
+			journalthread = subprocess.Popen(cmd)
+		except:
+			pass
+
 	# Schedule scrub jobs...
 	running_devs = set()
 	killfuncs = set()
@@ -148,6 +188,19 @@ def main():
 			fs = []
 		cond.release()
 
+	if journalthread is not None:
+		journalthread.terminate()
+
+	# journald queries /proc as part of taking in log
+	# messages; it uses this information to associate the
+	# message with systemd units, etc.  This races with
+	# process exit, so delay that a couple of seconds so
+	# that we capture the summary outputs in the job log.
+	if 'SERVICE_MODE' in os.environ:
+		time.sleep(2)
+		if retcode:
+			retcode += 150
+
 	sys.exit(retcode)
 
 if __name__ == '__main__':
diff --git a/scrub/xfs_scrub_all.service.in b/scrub/xfs_scrub_all.service.in
new file mode 100644
index 0000000..683804e
--- /dev/null
+++ b/scrub/xfs_scrub_all.service.in
@@ -0,0 +1,8 @@
+[Unit]
+Description=Online XFS Metadata Check for All Filesystems
+ConditionACPower=true
+
+[Service]
+Type=oneshot
+Environment=SERVICE_MODE=1
+ExecStart=@sbindir@/xfs_scrub_all
diff --git a/scrub/xfs_scrub_all.timer b/scrub/xfs_scrub_all.timer
new file mode 100644
index 0000000..2e4a33b
--- /dev/null
+++ b/scrub/xfs_scrub_all.timer
@@ -0,0 +1,11 @@
+[Unit]
+Description=Periodic XFS Online Metadata Check for All Filesystems
+
+[Timer]
+# Run on Sunday at 3:10am, to avoid running afoul of DST changes
+OnCalendar=Sun *-*-* 03:10:00
+RandomizedDelaySec=60
+Persistent=true
+
+[Install]
+WantedBy=timers.target
diff --git a/scrub/xfs_scrub_fail b/scrub/xfs_scrub_fail
new file mode 100755
index 0000000..36dd50e
--- /dev/null
+++ b/scrub/xfs_scrub_fail
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+# Email logs of failed xfs_scrub unit runs
+
+mailer=/usr/sbin/sendmail
+recipient="$1"
+test -z "${recipient}" && exit 0
+mntpoint="$2"
+test -z "${mntpoint}" && exit 0
+hostname="$(hostname -f 2>/dev/null)"
+test -z "${hostname}" && hostname="${HOSTNAME}"
+if [ ! -x "${mailer}" ]; then
+	echo "${mailer}: Mailer program not found."
+	exit 1
+fi
+
+(cat << ENDL
+To: $1
+From: <xfs_scrub@${hostname}>
+Subject: xfs_scrub failure on ${mntpoint}
+
+So sorry, the automatic xfs_scrub of ${mntpoint} on ${hostname} failed.
+
+A log of what happened follows:
+ENDL
+systemctl status --full --lines 4294967295 "xfs_scrub@${mntpoint}") | "${mailer}" -t -i
diff --git a/scrub/xfs_scrub_fail@.service.in b/scrub/xfs_scrub_fail@.service.in
new file mode 100644
index 0000000..785f881
--- /dev/null
+++ b/scrub/xfs_scrub_fail@.service.in
@@ -0,0 +1,10 @@
+[Unit]
+Description=Online XFS Metadata Check Failure Reporting for %I
+
+[Service]
+Type=oneshot
+Environment=EMAIL_ADDR=root
+ExecStart=@pkg_lib_dir@/@pkg_name@/xfs_scrub_fail "${EMAIL_ADDR}" %I
+User=mail
+Group=mail
+SupplementaryGroups=systemd-journal


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-08-04  0:10 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04  0:07 [PATCH v9 00/22] xfsprogs: online scrub/repair support Darrick J. Wong
2017-08-04  0:07 ` [PATCH 01/22] xfs_scrub: create online filesystem scrub program Darrick J. Wong
2017-08-04  0:07 ` [PATCH 02/22] xfs_scrub: common error handling Darrick J. Wong
2017-08-04  0:07 ` [PATCH 03/22] xfs_scrub: set up command line argument parsing Darrick J. Wong
2017-08-04  0:08 ` [PATCH 04/22] xfs_scrub: dispatch the various phases of the scrub program Darrick J. Wong
2017-08-04  0:08 ` [PATCH 05/22] xfs_scrub: bind to a mount point and a block device Darrick J. Wong
2017-08-04  0:08 ` [PATCH 06/22] xfs_scrub: find XFS filesystem geometry Darrick J. Wong
2017-08-04  0:08 ` [PATCH 07/22] xfs_scrub: scan filesystem and AG metadata Darrick J. Wong
2017-08-04  0:08 ` [PATCH 08/22] xfs_scrub: scan inodes Darrick J. Wong
2017-08-04  0:08 ` [PATCH 09/22] xfs_scrub: check directory connectivity Darrick J. Wong
2017-08-04  0:08 ` [PATCH 10/22] xfs_scrub: thread-safe stats counter Darrick J. Wong
2017-08-04  0:08 ` [PATCH 11/22] xfs_scrub: create a bitmap data structure Darrick J. Wong
2017-08-04  0:08 ` [PATCH 12/22] xfs_scrub: create infrastructure to read verify data blocks Darrick J. Wong
2017-08-04  0:08 ` [PATCH 13/22] xfs_scrub: scrub file " Darrick J. Wong
2017-08-04  0:09 ` [PATCH 14/22] xfs_scrub: optionally use SCSI READ VERIFY commands to scrub data blocks on disk Darrick J. Wong
2017-08-04  0:09 ` [PATCH 15/22] xfs_scrub: check summary counters Darrick J. Wong
2017-08-04  0:09 ` [PATCH 16/22] xfs_scrub: wire up repair ioctl Darrick J. Wong
2017-08-04  0:09 ` [PATCH 17/22] xfs_scrub: schedule and manage repairs to the filesystem Darrick J. Wong
2017-08-04  0:09 ` [PATCH 18/22] xfs_scrub: fstrim the free areas if there are no errors on " Darrick J. Wong
2017-08-04  0:09 ` [PATCH 19/22] xfs_scrub: warn about normalized Unicode name collisions Darrick J. Wong
2017-08-04  0:09 ` [PATCH 20/22] xfs_scrub: progress indicator Darrick J. Wong
2017-08-04  0:09 ` [PATCH 21/22] xfs_scrub: create a script to scrub all xfs filesystems Darrick J. Wong
2017-08-04  0:09 ` [PATCH 22/22] xfs_scrub: integrate services with systemd Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.