All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 00/25] xfs: online scrub support
@ 2017-10-03 20:40 Darrick J. Wong
  2017-10-03 20:40 ` [PATCH 01/25] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
                   ` (24 more replies)
  0 siblings, 25 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This is the eleventh revision of a patchset that adds to XFS kernel
support for online metadata scrubbing and repair.  There aren't any
on-disk format changes.  Changes since v10 include voluntarily aborting
scrub if the kernel hits OOM, avoiding deadlocks with inodes that are
being reclaimed, and rebasing to 4.14-rc3.  I have been performing daily
online scrubs of my XFS filesystems for several months now, with
surprisingly few problems.

Online scrub/repair support consists of four major pieces -- first, an
ioctl that maps physical extents to their owners (GETFSMAP; already in
4.12); second, various in-kernel metadata scrubbing ioctls to examine
metadata records and cross-reference them with other filesystem
metadata; third, an in-kernel mechanism for rebuilding damaged metadata
objects and btrees; and fourth, a userspace component to coordinate
scrubbing and repair operations.

This new utility, xfs_scrub, is separate from the existing offline
xfs_repair tool.  The program uses various XFS ioctls to iterate all XFS
metadata and asks the kernel to check the metadata and repair it if
necessary.

While I understand that reviewer bandwidth is limited, I would like to
get this series prepped for 4.15, if possible.  I have isolated the
scrub code such that it can be compiled out entirely, in the hopes that
we can stabilize the code while not exposing regular users to riskier
code.

If you're going to start using this mess, you probably ought to just
pull from my git trees.  The kernel patches[1] should apply against
4.14-rc3.  xfsprogs[2] and xfstests[3] can be found in their usual
places.  The git trees contain all four series' worth of changes.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/25] xfs: create an ioctl to scrub AG metadata
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
@ 2017-10-03 20:40 ` Darrick J. Wong
  2017-10-03 20:41 ` [PATCH 02/25] xfs: dispatch metadata scrub subcommands Darrick J. Wong
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:40 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create an ioctl that can be used to scrub internal filesystem metadata.
The new ioctl takes the metadata type, an (optional) AG number, an
(optional) inode number and generation, and a flags argument.  This will
be used by the upcoming XFS online scrub tool.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Kconfig           |   17 ++++++++++++++
 fs/xfs/Makefile          |   11 +++++++++
 fs/xfs/libxfs/xfs_fs.h   |   53 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c     |   54 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.h     |   25 +++++++++++++++++++++
 fs/xfs/scrub/trace.c     |   41 +++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/trace.h     |   33 ++++++++++++++++++++++++++++
 fs/xfs/scrub/xfs_scrub.h |   29 +++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c       |   28 ++++++++++++++++++++++++
 fs/xfs/xfs_ioctl32.c     |    1 +
 10 files changed, 292 insertions(+)
 create mode 100644 fs/xfs/scrub/scrub.c
 create mode 100644 fs/xfs/scrub/scrub.h
 create mode 100644 fs/xfs/scrub/trace.c
 create mode 100644 fs/xfs/scrub/trace.h
 create mode 100644 fs/xfs/scrub/xfs_scrub.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa..f42fcf1 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -71,6 +71,23 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_ONLINE_SCRUB
+	bool "XFS online metadata check support"
+	default n
+	depends on XFS_FS
+	help
+	  If you say Y here you will be able to check metadata on a
+	  mounted XFS filesystem.  This feature is intended to reduce
+	  filesystem downtime by supplementing xfs_repair.  The key
+	  advantage here is to look for problems proactively so that
+	  they can be dealt with in a controlled manner.
+
+	  This feature is considered EXPERIMENTAL.  Use with caution!
+
+	  See the xfs_scrub man page in section 8 for additional information.
+
+	  If unsure, say N.
+
 config XFS_WARN
 	bool "XFS Verbose Warnings"
 	depends on XFS_FS && !XFS_DEBUG
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index dbc33e0..f4312bc 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -138,3 +138,14 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+
+# online scrub/repair
+ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
+
+# Tracepoints like to blow up, so build that before everything else
+
+xfs-y				+= $(addprefix scrub/, \
+				   trace.o \
+				   scrub.o \
+				   )
+endif
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 8c61f21..3b4a36e 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -468,6 +468,58 @@ typedef struct xfs_swapext
 #define XFS_FSOP_GOING_FLAGS_LOGFLUSH		0x1	/* flush log but not data */
 #define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH		0x2	/* don't flush log nor data */
 
+/* metadata scrubbing */
+struct xfs_scrub_metadata {
+	__u32 sm_type;		/* What to check? */
+	__u32 sm_flags;		/* flags; see below. */
+	__u64 sm_ino;		/* inode number. */
+	__u32 sm_gen;		/* inode generation. */
+	__u32 sm_agno;		/* ag number. */
+	__u64 sm_reserved[5];	/* pad to 64 bytes */
+};
+
+/*
+ * Metadata types and flags for scrub operation.
+ */
+
+/* Scrub subcommands. */
+
+/* Number of scrub subcommands. */
+#define XFS_SCRUB_TYPE_NR	0
+
+/* i: Repair this metadata. */
+#define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
+
+/* o: Metadata object needs repair. */
+#define XFS_SCRUB_OFLAG_CORRUPT		(1 << 1)
+
+/*
+ * o: Metadata object could be optimized.  It's not corrupt, but
+ *    we could improve on it somehow.
+ */
+#define XFS_SCRUB_OFLAG_PREEN		(1 << 2)
+
+/* o: Cross-referencing failed. */
+#define XFS_SCRUB_OFLAG_XFAIL		(1 << 3)
+
+/* o: Metadata object disagrees with cross-referenced metadata. */
+#define XFS_SCRUB_OFLAG_XCORRUPT	(1 << 4)
+
+/* o: Scan was not complete. */
+#define XFS_SCRUB_OFLAG_INCOMPLETE	(1 << 5)
+
+/* o: Metadata object looked funny but isn't corrupt. */
+#define XFS_SCRUB_OFLAG_WARNING		(1 << 6)
+
+#define XFS_SCRUB_FLAGS_IN	(XFS_SCRUB_IFLAG_REPAIR)
+#define XFS_SCRUB_FLAGS_OUT	(XFS_SCRUB_OFLAG_CORRUPT | \
+				 XFS_SCRUB_OFLAG_PREEN | \
+				 XFS_SCRUB_OFLAG_XFAIL | \
+				 XFS_SCRUB_OFLAG_XCORRUPT | \
+				 XFS_SCRUB_OFLAG_INCOMPLETE | \
+				 XFS_SCRUB_OFLAG_WARNING)
+#define XFS_SCRUB_FLAGS_ALL	(XFS_SCRUB_FLAGS_IN | XFS_SCRUB_FLAGS_OUT)
+
 /*
  * ioctl limits
  */
@@ -511,6 +563,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_ZERO_RANGE	_IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS	_IOR ('X', 58, struct xfs_fs_eofblocks)
 /*	XFS_IOC_GETFSMAP ------ hoisted 59         */
+#define XFS_IOC_SCRUB_METADATA	_IOWR('X', 60, struct xfs_scrub_metadata)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
new file mode 100644
index 0000000..5db2a6f
--- /dev/null
+++ b/fs/xfs/scrub/scrub.c
@@ -0,0 +1,54 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/trace.h"
+
+/* Dispatch metadata scrubbing. */
+int
+xfs_scrub_metadata(
+	struct xfs_inode		*ip,
+	struct xfs_scrub_metadata	*sm)
+{
+	return -EOPNOTSUPP;
+}
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
new file mode 100644
index 0000000..eb1cd9d
--- /dev/null
+++ b/fs/xfs/scrub/scrub.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_SCRUB_H__
+#define __XFS_SCRUB_SCRUB_H__
+
+/* Metadata scrubbers */
+
+#endif	/* __XFS_SCRUB_SCRUB_H__ */
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
new file mode 100644
index 0000000..c59fd41
--- /dev/null
+++ b/fs/xfs/scrub/trace.c
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_da_format.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_btree.h"
+#include "xfs_trans.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+
+/*
+ * We include this last to have the helpers above available for the trace
+ * event implementations.
+ */
+#define CREATE_TRACE_POINTS
+#include "scrub/trace.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
new file mode 100644
index 0000000..a95a7c8
--- /dev/null
+++ b/fs/xfs/scrub/trace.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM xfs_scrub
+
+#if !defined(_TRACE_XFS_SCRUB_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_XFS_SCRUB_TRACE_H
+
+#include <linux/tracepoint.h>
+
+#endif /* _TRACE_XFS_SCRUB_TRACE_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE scrub/trace
+#include <trace/define_trace.h>
diff --git a/fs/xfs/scrub/xfs_scrub.h b/fs/xfs/scrub/xfs_scrub.h
new file mode 100644
index 0000000..e00e0ea
--- /dev/null
+++ b/fs/xfs/scrub/xfs_scrub.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_H__
+#define __XFS_SCRUB_H__
+
+#ifndef CONFIG_XFS_ONLINE_SCRUB
+# define xfs_scrub_metadata(ip, sm)	(-ENOTTY)
+#else
+int xfs_scrub_metadata(struct xfs_inode *ip, struct xfs_scrub_metadata *sm);
+#endif /* CONFIG_XFS_ONLINE_SCRUB */
+
+#endif	/* __XFS_SCRUB_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389..6ff012f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
 #include "xfs_fsmap.h"
+#include "scrub/xfs_scrub.h"
 
 #include <linux/capability.h>
 #include <linux/cred.h>
@@ -1703,6 +1704,30 @@ xfs_ioc_getfsmap(
 	return 0;
 }
 
+STATIC int
+xfs_ioc_scrub_metadata(
+	struct xfs_inode		*ip,
+	void				__user *arg)
+{
+	struct xfs_scrub_metadata	scrub;
+	int				error;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&scrub, arg, sizeof(scrub)))
+		return -EFAULT;
+
+	error = xfs_scrub_metadata(ip, &scrub);
+	if (error)
+		return error;
+
+	if (copy_to_user(arg, &scrub, sizeof(scrub)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int
 xfs_ioc_swapext(
 	xfs_swapext_t	*sxp)
@@ -1886,6 +1911,9 @@ xfs_file_ioctl(
 	case FS_IOC_GETFSMAP:
 		return xfs_ioc_getfsmap(ip, arg);
 
+	case XFS_IOC_SCRUB_METADATA:
+		return xfs_ioc_scrub_metadata(ip, arg);
+
 	case XFS_IOC_FD_TO_HANDLE:
 	case XFS_IOC_PATH_TO_HANDLE:
 	case XFS_IOC_PATH_TO_FSHANDLE: {
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index fa0bc4d..35c79e2 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -556,6 +556,7 @@ xfs_file_compat_ioctl(
 	case XFS_IOC_ERROR_INJECTION:
 	case XFS_IOC_ERROR_CLEARALL:
 	case FS_IOC_GETFSMAP:
+	case XFS_IOC_SCRUB_METADATA:
 		return xfs_file_ioctl(filp, cmd, p);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/25] xfs: dispatch metadata scrub subcommands
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
  2017-10-03 20:40 ` [PATCH 01/25] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-03 20:41 ` [PATCH 03/25] xfs: probe the scrub ioctl Darrick J. Wong
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create structures needed to hold scrubbing context and dispatch incoming
commands to the individual scrubbers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/scrub.c |  178 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.h |   24 +++++++
 fs/xfs/scrub/trace.h |   43 ++++++++++++
 3 files changed, 244 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 5db2a6f..1f0ab15 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -44,11 +44,187 @@
 #include "scrub/scrub.h"
 #include "scrub/trace.h"
 
+/*
+ * Online Scrub and Repair
+ *
+ * Traditionally, XFS (the kernel driver) did not know how to check or
+ * repair on-disk data structures.  That task was left to the xfs_check
+ * and xfs_repair tools, both of which require taking the filesystem
+ * offline for a thorough but time consuming examination.  Online
+ * scrub & repair, on the other hand, enables us to check the metadata
+ * for obvious errors while carefully stepping around the filesystem's
+ * ongoing operations, locking rules, etc.
+ *
+ * Given that most XFS metadata consist of records stored in a btree,
+ * most of the checking functions iterate the btree blocks themselves
+ * looking for irregularities.  When a record block is encountered, each
+ * record can be checked for obviously bad values.  Record values can
+ * also be cross-referenced against other btrees to look for potential
+ * misunderstandings between pieces of metadata.
+ *
+ * It is expected that the checkers responsible for per-AG metadata
+ * structures will lock the AG headers (AGI, AGF, AGFL), iterate the
+ * metadata structure, and perform any relevant cross-referencing before
+ * unlocking the AG and returning the results to userspace.  These
+ * scrubbers must not keep an AG locked for too long to avoid tying up
+ * the block and inode allocators.
+ *
+ * Block maps and b-trees rooted in an inode present a special challenge
+ * because they can involve extents from any AG.  The general scrubber
+ * structure of lock -> check -> xref -> unlock still holds, but AG
+ * locking order rules /must/ be obeyed to avoid deadlocks.  The
+ * ordering rule, of course, is that we must lock in increasing AG
+ * order.  Helper functions are provided to track which AG headers we've
+ * already locked.  If we detect an imminent locking order violation, we
+ * can signal a potential deadlock, in which case the scrubber can jump
+ * out to the top level, lock all the AGs in order, and retry the scrub.
+ *
+ * For file data (directories, extended attributes, symlinks) scrub, we
+ * can simply lock the inode and walk the data.  For btree data
+ * (directories and attributes) we follow the same btree-scrubbing
+ * strategy outlined previously to check the records.
+ *
+ * We use a bit of trickery with transactions to avoid buffer deadlocks
+ * if there is a cycle in the metadata.  The basic problem is that
+ * travelling down a btree involves locking the current buffer at each
+ * tree level.  If a pointer should somehow point back to a buffer that
+ * we've already examined, we will deadlock due to the second buffer
+ * locking attempt.  Note however that grabbing a buffer in transaction
+ * context links the locked buffer to the transaction.  If we try to
+ * re-grab the buffer in the context of the same transaction, we avoid
+ * the second lock attempt and continue.  Between the verifier and the
+ * scrubber, something will notice that something is amiss and report
+ * the corruption.  Therefore, each scrubber will allocate an empty
+ * transaction, attach buffers to it, and cancel the transaction at the
+ * end of the scrub run.  Cancelling a non-dirty transaction simply
+ * unlocks the buffers.
+ *
+ * There are four pieces of data that scrub can communicate to
+ * userspace.  The first is the error code (errno), which can be used to
+ * communicate operational errors in performing the scrub.  There are
+ * also three flags that can be set in the scrub context.  If the data
+ * structure itself is corrupt, the CORRUPT flag will be set.  If
+ * the metadata is correct but otherwise suboptimal, the PREEN flag
+ * will be set.
+ */
+
+/* Scrub setup and teardown */
+
+/* Free all the resources and finish the transactions. */
+STATIC int
+xfs_scrub_teardown(
+	struct xfs_scrub_context	*sc,
+	int				error)
+{
+	if (sc->tp) {
+		xfs_trans_cancel(sc->tp);
+		sc->tp = NULL;
+	}
+	return error;
+}
+
+/* Scrubbing dispatch. */
+
+static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
+};
+
+/* This isn't a stable feature, warn once per day. */
+static inline void
+xfs_scrub_experimental_warning(
+	struct xfs_mount	*mp)
+{
+	static struct ratelimit_state scrub_warning = RATELIMIT_STATE_INIT(
+			"xfs_scrub_warning", 86400 * HZ, 1);
+	ratelimit_set_flags(&scrub_warning, RATELIMIT_MSG_ON_RELEASE);
+
+	if (__ratelimit(&scrub_warning))
+		xfs_alert(mp,
+"EXPERIMENTAL online scrub feature in use. Use at your own risk!");
+}
+
 /* Dispatch metadata scrubbing. */
 int
 xfs_scrub_metadata(
 	struct xfs_inode		*ip,
 	struct xfs_scrub_metadata	*sm)
 {
-	return -EOPNOTSUPP;
+	struct xfs_scrub_context	sc;
+	struct xfs_mount		*mp = ip->i_mount;
+	const struct xfs_scrub_meta_ops	*ops;
+	bool				try_harder = false;
+	int				error = 0;
+
+	trace_xfs_scrub_start(ip, sm, error);
+
+	/* Forbidden if we are shut down or mounted norecovery. */
+	error = -ESHUTDOWN;
+	if (XFS_FORCED_SHUTDOWN(mp))
+		goto out;
+	error = -ENOTRECOVERABLE;
+	if (mp->m_flags & XFS_MOUNT_NORECOVERY)
+		goto out;
+
+	/* Check our inputs. */
+	error = -EINVAL;
+	sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	if (sm->sm_flags & ~XFS_SCRUB_FLAGS_IN)
+		goto out;
+	if (memchr_inv(sm->sm_reserved, 0, sizeof(sm->sm_reserved)))
+		goto out;
+
+	/* Do we know about this type of metadata? */
+	error = -ENOENT;
+	if (sm->sm_type >= XFS_SCRUB_TYPE_NR)
+		goto out;
+	ops = &meta_scrub_ops[sm->sm_type];
+	if (ops->scrub == NULL)
+		goto out;
+
+	/* Does this fs even support this type of metadata? */
+	if (ops->has && !ops->has(&mp->m_sb))
+		goto out;
+
+	/* We don't know how to repair anything yet. */
+	error = -EOPNOTSUPP;
+	if (sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)
+		goto out;
+
+	xfs_scrub_experimental_warning(mp);
+
+retry_op:
+	/* Set up for the operation. */
+	memset(&sc, 0, sizeof(sc));
+	sc.mp = ip->i_mount;
+	sc.sm = sm;
+	sc.ops = ops;
+	sc.try_harder = try_harder;
+	error = sc.ops->setup(&sc, ip);
+	if (error)
+		goto out_teardown;
+
+	/* Scrub for errors. */
+	error = sc.ops->scrub(&sc);
+	if (!try_harder && error == -EDEADLOCK) {
+		/*
+		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
+		 * Tear down everything we hold, then set up again with
+		 * preparation for worst-case scenarios.
+		 */
+		error = xfs_scrub_teardown(&sc, 0);
+		if (error)
+			goto out;
+		try_harder = true;
+		goto retry_op;
+	} else if (error)
+		goto out_teardown;
+
+	if (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+			       XFS_SCRUB_OFLAG_XCORRUPT))
+		xfs_alert_ratelimited(mp, "Corruption detected during scrub.");
+
+out_teardown:
+	error = xfs_scrub_teardown(&sc, error);
+out:
+	trace_xfs_scrub_done(ip, sm, error);
+	return error;
 }
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index eb1cd9d..ef7b50e 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -20,6 +20,30 @@
 #ifndef __XFS_SCRUB_SCRUB_H__
 #define __XFS_SCRUB_SCRUB_H__
 
+struct xfs_scrub_context;
+
+struct xfs_scrub_meta_ops {
+	/* Acquire whatever resources are needed for the operation. */
+	int		(*setup)(struct xfs_scrub_context *,
+				 struct xfs_inode *);
+
+	/* Examine metadata for errors. */
+	int		(*scrub)(struct xfs_scrub_context *);
+
+	/* Decide if we even have this piece of metadata. */
+	bool		(*has)(struct xfs_sb *);
+};
+
+struct xfs_scrub_context {
+	/* General scrub state. */
+	struct xfs_mount		*mp;
+	struct xfs_scrub_metadata	*sm;
+	const struct xfs_scrub_meta_ops	*ops;
+	struct xfs_trans		*tp;
+	struct xfs_inode		*ip;
+	bool				try_harder;
+};
+
 /* Metadata scrubbers */
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a95a7c8..688517e 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -25,6 +25,49 @@
 
 #include <linux/tracepoint.h>
 
+DECLARE_EVENT_CLASS(xfs_scrub_class,
+	TP_PROTO(struct xfs_inode *ip, struct xfs_scrub_metadata *sm,
+		 int error),
+	TP_ARGS(ip, sm, error),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, type)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, inum)
+		__field(unsigned int, gen)
+		__field(unsigned int, flags)
+		__field(int, error)
+	),
+	TP_fast_assign(
+		__entry->dev = ip->i_mount->m_super->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->type = sm->sm_type;
+		__entry->agno = sm->sm_agno;
+		__entry->inum = sm->sm_ino;
+		__entry->gen = sm->sm_gen;
+		__entry->flags = sm->sm_flags;
+		__entry->error = error;
+	),
+	TP_printk("dev %d:%d ino %llu type %u agno %u inum %llu gen %u flags 0x%x error %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->type,
+		  __entry->agno,
+		  __entry->inum,
+		  __entry->gen,
+		  __entry->flags,
+		  __entry->error)
+)
+#define DEFINE_SCRUB_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_class, name, \
+	TP_PROTO(struct xfs_inode *ip, struct xfs_scrub_metadata *sm, \
+		 int error), \
+	TP_ARGS(ip, sm, error))
+
+DEFINE_SCRUB_EVENT(xfs_scrub_start);
+DEFINE_SCRUB_EVENT(xfs_scrub_done);
+
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/25] xfs: probe the scrub ioctl
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
  2017-10-03 20:40 ` [PATCH 01/25] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
  2017-10-03 20:41 ` [PATCH 02/25] xfs: dispatch metadata scrub subcommands Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-03 23:32   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 04/25] xfs: create helpers to record and deal with scrub problems Darrick J. Wong
                   ` (21 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a probe scrubber with id 0.  This will be used by xfs_scrub to
probe the kernel's abilities to scrub (and repair) the metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 +
 fs/xfs/libxfs/xfs_fs.h |    3 ++
 fs/xfs/scrub/common.c  |   59 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h  |   39 ++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |   24 ++++++++++++++++++++
 fs/xfs/scrub/scrub.h   |    1 +
 fs/xfs/scrub/trace.c   |    1 +
 7 files changed, 127 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/common.c
 create mode 100644 fs/xfs/scrub/common.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index f4312bc..ca14595 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -146,6 +146,7 @@ ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
 xfs-y				+= $(addprefix scrub/, \
 				   trace.o \
+				   common.o \
 				   scrub.o \
 				   )
 endif
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 3b4a36e..765f91e 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -483,9 +483,10 @@ struct xfs_scrub_metadata {
  */
 
 /* Scrub subcommands. */
+#define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	0
+#define XFS_SCRUB_TYPE_NR	1
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
new file mode 100644
index 0000000..d2c8f94
--- /dev/null
+++ b/fs/xfs/scrub/common.c
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Common code for the metadata scrubbers. */
+
+/* Per-scrubber setup functions */
+
+/* Set us up with a transaction and an empty context. */
+int
+xfs_scrub_setup_fs(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp);
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
new file mode 100644
index 0000000..f3d5865
--- /dev/null
+++ b/fs/xfs/scrub/common.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_COMMON_H__
+#define __XFS_SCRUB_COMMON_H__
+
+/*
+ * Grab an empty transaction so that we can re-grab locked buffers if
+ * one of our btrees turns out to be cyclic.
+ */
+static inline int
+xfs_scrub_trans_alloc(
+	struct xfs_scrub_metadata	*sm,
+	struct xfs_mount		*mp,
+	struct xfs_trans		**tpp)
+{
+	return xfs_trans_alloc_empty(mp, tpp);
+}
+
+/* Setup functions */
+int xfs_scrub_setup_fs(struct xfs_scrub_context *sc, struct xfs_inode *ip);
+
+#endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 1f0ab15..d51354d 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -42,6 +42,7 @@
 #include "xfs_rmap_btree.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
+#include "scrub/common.h"
 #include "scrub/trace.h"
 
 /*
@@ -108,6 +109,25 @@
  * will be set.
  */
 
+/*
+ * Scrub probe -- userspace uses this to probe if we're willing to
+ * scrub or repair a given mountpoint.
+ */
+int
+xfs_scrub_probe(
+	struct xfs_scrub_context	*sc)
+{
+	if (sc->sm->sm_ino || sc->sm->sm_agno)
+		return -EINVAL;
+	if (sc->sm->sm_gen & ~XFS_SCRUB_FLAGS_OUT)
+		return -ENOENT;
+
+	/* Echo parameters back to userspace to prove that we exist. */
+	sc->sm->sm_flags |= (sc->sm->sm_gen & XFS_SCRUB_FLAGS_OUT);
+
+	return 0;
+}
+
 /* Scrub setup and teardown */
 
 /* Free all the resources and finish the transactions. */
@@ -126,6 +146,10 @@ xfs_scrub_teardown(
 /* Scrubbing dispatch. */
 
 static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
+	{ /* ioctl presence test */
+		.setup	= xfs_scrub_setup_fs,
+		.scrub	= xfs_scrub_probe,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index ef7b50e..b7b9422 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -45,5 +45,6 @@ struct xfs_scrub_context {
 };
 
 /* Metadata scrubbers */
+int xfs_scrub_tester(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index c59fd41..88b5ccb 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -32,6 +32,7 @@
 #include "xfs_trans.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
+#include "scrub/common.h"
 
 /*
  * We include this last to have the helpers above available for the trace


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/25] xfs: create helpers to record and deal with scrub problems
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 03/25] xfs: probe the scrub ioctl Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-03 23:44   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 05/25] xfs: create helpers to scrub a metadata btree Darrick J. Wong
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create helper functions to record crc and corruption problems, and
deal with any other runtime errors that arise.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |  251 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h |   23 ++++
 fs/xfs/scrub/trace.h  |  193 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 467 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index d2c8f94..a84ba19 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -47,6 +47,257 @@
 
 /* Common code for the metadata scrubbers. */
 
+/*
+ * Handling operational errors.
+ *
+ * The *_op_ok() family of functions are used to process error return
+ * codes from functions called as part of a scrub operation.
+ *
+ * If there's no error, we return true to tell the caller that it's ok
+ * to move on to the next check in its list.
+ *
+ * For non-verifier errors (e.g. ENOMEM) we return false to tell the
+ * caller that something bad happened, and we preserve *error so that
+ * the caller can return the *error up the stack to userspace.
+ *
+ * Verifier errors (EFSBADCRC/EFSCORRUPTED) are recorded by setting
+ * OFLAG_CORRUPT in sm_flags and the *error is cleared.  In other words,
+ * we track verifier errors (and failed scrub checks) via OFLAG_CORRUPT,
+ * not via return codes.  We return false to tell the caller that
+ * something bad happened.  Since the error has been cleared, the caller
+ * will (presumably) return that zero and scrubbing will move on to
+ * whatever's next.
+ *
+ * ftrace can be used to record the precise metadata location and the
+ * approximate code location of the failed operation.
+ */
+
+/* Check for operational errors. */
+bool
+xfs_scrub_op_ok(
+	struct xfs_scrub_context	*sc,
+	xfs_agnumber_t			agno,
+	xfs_agblock_t			bno,
+	int				*error)
+{
+	switch (*error) {
+	case 0:
+		return true;
+	case -EDEADLOCK:
+		/* Used to restart an op with deadlock avoidance. */
+		trace_xfs_scrub_deadlock_retry(sc->ip, sc->sm, *error);
+		break;
+	case -EFSBADCRC:
+	case -EFSCORRUPTED:
+		/* Note the badness but don't abort. */
+		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+		*error = 0;
+		/* fall through */
+	default:
+		trace_xfs_scrub_op_error(sc, agno, bno, *error,
+				__return_address);
+		break;
+	}
+	return false;
+}
+
+/* Check for operational errors for a file offset. */
+bool
+xfs_scrub_fblock_op_ok(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	xfs_fileoff_t			offset,
+	int				*error)
+{
+	switch (*error) {
+	case 0:
+		return true;
+	case -EDEADLOCK:
+		/* Used to restart an op with deadlock avoidance. */
+		trace_xfs_scrub_deadlock_retry(sc->ip, sc->sm, *error);
+		break;
+	case -EFSBADCRC:
+	case -EFSCORRUPTED:
+		/* Note the badness but don't abort. */
+		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+		*error = 0;
+		/* fall through */
+	default:
+		trace_xfs_scrub_file_op_error(sc, whichfork, offset, *error,
+				__return_address);
+		break;
+	}
+	return false;
+}
+
+/*
+ * Handling scrub corruption/optimization/warning checks.
+ *
+ * The *_set_{corrupt,preen,warning}() family of functions are used to
+ * record the presence of metadata that is incorrect (corrupt), could be
+ * optimized somehow (preen), or should be flagged for administrative
+ * review but is not incorrect (warn).
+ *
+ * ftrace can be used to record the precise metadata location and
+ * approximate code location of the failed check.
+ */
+
+/* Record a block which could be optimized. */
+void
+xfs_scrub_block_set_preen(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp)
+{
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fsblock_t			fsbno;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+
+	fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
+	agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	bno = XFS_FSB_TO_AGBNO(mp, fsbno);
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	trace_xfs_scrub_block_preen(sc, agno, bno, __return_address);
+}
+
+/*
+ * Record an inode which could be optimized.  The trace data will
+ * include the block given by bp if bp is given; otherwise it will use
+ * the block location of the inode record itself.
+ */
+void
+xfs_scrub_ino_set_preen(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp)
+{
+	struct xfs_inode		*ip = sc->ip;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fsblock_t			fsbno;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+
+	if (bp) {
+		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
+		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	} else {
+		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	}
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	trace_xfs_scrub_ino_preen(sc, ip->i_ino, agno, bno, __return_address);
+}
+
+/* Record a corrupt block. */
+void
+xfs_scrub_block_set_corrupt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp)
+{
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fsblock_t			fsbno;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+
+	fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
+	agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	bno = XFS_FSB_TO_AGBNO(mp, fsbno);
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	trace_xfs_scrub_block_error(sc, agno, bno, __return_address);
+}
+
+/*
+ * Record a corrupt inode.  The trace data will include the block given
+ * by bp if bp is given; otherwise it will use the block location of the
+ * inode record itself.
+ */
+void
+xfs_scrub_ino_set_corrupt(
+	struct xfs_scrub_context	*sc,
+	xfs_ino_t			ino,
+	struct xfs_buf			*bp)
+{
+	struct xfs_inode		*ip = sc->ip;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fsblock_t			fsbno;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+
+	if (bp) {
+		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
+		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	} else {
+		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	}
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	trace_xfs_scrub_ino_error(sc, ino, agno, bno, __return_address);
+}
+
+/* Record corruption in a block indexed by a file fork. */
+void
+xfs_scrub_fblock_set_corrupt(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	xfs_fileoff_t			offset)
+{
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	trace_xfs_scrub_fblock_error(sc, whichfork, offset, __return_address);
+}
+
+/*
+ * Warn about inodes that need administrative review but is not
+ * incorrect.
+ */
+void
+xfs_scrub_ino_set_warning(
+	struct xfs_scrub_context	*sc,
+	struct xfs_buf			*bp)
+{
+	struct xfs_inode		*ip = sc->ip;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fsblock_t			fsbno;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+
+	if (bp) {
+		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
+		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	} else {
+		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	}
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_WARNING;
+	trace_xfs_scrub_ino_warning(sc, ip->i_ino, agno, bno, __return_address);
+}
+
+/* Warn about a block indexed by a file fork that needs review. */
+void
+xfs_scrub_fblock_set_warning(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	xfs_fileoff_t			offset)
+{
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_WARNING;
+	trace_xfs_scrub_fblock_warning(sc, whichfork, offset, __return_address);
+}
+
+/* Signal an incomplete scrub. */
+void
+xfs_scrub_set_incomplete(
+	struct xfs_scrub_context	*sc)
+{
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_INCOMPLETE;
+	trace_xfs_scrub_incomplete(sc, __return_address);
+}
+
 /* Per-scrubber setup functions */
 
 /* Set us up with a transaction and an empty context. */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index f3d5865..e396aa6 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -33,6 +33,29 @@ xfs_scrub_trans_alloc(
 	return xfs_trans_alloc_empty(mp, tpp);
 }
 
+bool xfs_scrub_op_ok(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
+		xfs_agblock_t bno, int *error);
+bool xfs_scrub_fblock_op_ok(struct xfs_scrub_context *sc, int whichfork,
+		xfs_fileoff_t offset, int *error);
+
+void xfs_scrub_block_set_preen(struct xfs_scrub_context *sc,
+		struct xfs_buf *bp);
+void xfs_scrub_ino_set_preen(struct xfs_scrub_context *sc, struct xfs_buf *bp);
+
+void xfs_scrub_block_set_corrupt(struct xfs_scrub_context *sc,
+		struct xfs_buf *bp);
+void xfs_scrub_ino_set_corrupt(struct xfs_scrub_context *sc, xfs_ino_t ino,
+		struct xfs_buf *bp);
+void xfs_scrub_fblock_set_corrupt(struct xfs_scrub_context *sc, int whichfork,
+		xfs_fileoff_t offset);
+
+void xfs_scrub_ino_set_warning(struct xfs_scrub_context *sc,
+		struct xfs_buf *bp);
+void xfs_scrub_fblock_set_warning(struct xfs_scrub_context *sc, int whichfork,
+		xfs_fileoff_t offset);
+
+void xfs_scrub_set_incomplete(struct xfs_scrub_context *sc);
+
 /* Setup functions */
 int xfs_scrub_setup_fs(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 688517e..8d67a85 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -67,6 +67,199 @@ DEFINE_EVENT(xfs_scrub_class, name, \
 
 DEFINE_SCRUB_EVENT(xfs_scrub_start);
 DEFINE_SCRUB_EVENT(xfs_scrub_done);
+DEFINE_SCRUB_EVENT(xfs_scrub_deadlock_retry);
+
+TRACE_EVENT(xfs_scrub_op_error,
+	TP_PROTO(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
+		 xfs_agblock_t bno, int error, void *ret_ip),
+	TP_ARGS(sc, agno, bno, error, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, error)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->agno = agno;
+		__entry->bno = bno;
+		__entry->error = error;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->error,
+		  __entry->ret_ip)
+);
+
+TRACE_EVENT(xfs_scrub_file_op_error,
+	TP_PROTO(struct xfs_scrub_context *sc, int whichfork,
+		 xfs_fileoff_t offset, int error, void *ret_ip),
+	TP_ARGS(sc, whichfork, offset, error, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(unsigned int, type)
+		__field(xfs_fileoff_t, offset)
+		__field(int, error)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->ip->i_mount->m_super->s_dev;
+		__entry->ino = sc->ip->i_ino;
+		__entry->whichfork = whichfork;
+		__entry->type = sc->sm->sm_type;
+		__entry->offset = offset;
+		__entry->error = error;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->whichfork,
+		  __entry->type,
+		  __entry->offset,
+		  __entry->error,
+		  __entry->ret_ip)
+);
+
+DECLARE_EVENT_CLASS(xfs_scrub_block_error_class,
+	TP_PROTO(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
+		 xfs_agblock_t bno, void *ret_ip),
+	TP_ARGS(sc, agno, bno, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->agno = agno;
+		__entry->bno = bno;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->ret_ip)
+)
+
+#define DEFINE_SCRUB_BLOCK_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_block_error_class, name, \
+	TP_PROTO(struct xfs_scrub_context *sc, xfs_agnumber_t agno, \
+		 xfs_agblock_t bno, void *ret_ip), \
+	TP_ARGS(sc, agno, bno, ret_ip))
+
+DEFINE_SCRUB_BLOCK_ERROR_EVENT(xfs_scrub_block_error);
+DEFINE_SCRUB_BLOCK_ERROR_EVENT(xfs_scrub_block_preen);
+
+DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
+	TP_PROTO(struct xfs_scrub_context *sc, xfs_ino_t ino,
+		 xfs_agnumber_t agno, xfs_agblock_t bno, void *ret_ip),
+	TP_ARGS(sc, ino, agno, bno, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(unsigned int, type)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = ino;
+		__entry->type = sc->sm->sm_type;
+		__entry->agno = agno;
+		__entry->bno = bno;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->type,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->ret_ip)
+)
+
+#define DEFINE_SCRUB_INO_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_ino_error_class, name, \
+	TP_PROTO(struct xfs_scrub_context *sc, xfs_ino_t ino, \
+		 xfs_agnumber_t agno, xfs_agblock_t bno, void *ret_ip), \
+	TP_ARGS(sc, ino, agno, bno, ret_ip))
+
+DEFINE_SCRUB_INO_ERROR_EVENT(xfs_scrub_ino_error);
+DEFINE_SCRUB_INO_ERROR_EVENT(xfs_scrub_ino_preen);
+DEFINE_SCRUB_INO_ERROR_EVENT(xfs_scrub_ino_warning);
+
+DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
+	TP_PROTO(struct xfs_scrub_context *sc, int whichfork,
+		 xfs_fileoff_t offset, void *ret_ip),
+	TP_ARGS(sc, whichfork, offset, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(unsigned int, type)
+		__field(xfs_fileoff_t, offset)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->ip->i_mount->m_super->s_dev;
+		__entry->ino = sc->ip->i_ino;
+		__entry->whichfork = whichfork;
+		__entry->type = sc->sm->sm_type;
+		__entry->offset = offset;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->whichfork,
+		  __entry->type,
+		  __entry->offset,
+		  __entry->ret_ip)
+);
+
+#define DEFINE_SCRUB_FBLOCK_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_fblock_error_class, name, \
+	TP_PROTO(struct xfs_scrub_context *sc, int whichfork, \
+		 xfs_fileoff_t offset, void *ret_ip), \
+	TP_ARGS(sc, whichfork, offset, ret_ip))
+
+DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xfs_scrub_fblock_error);
+DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xfs_scrub_fblock_warning);
+
+TRACE_EVENT(xfs_scrub_incomplete,
+	TP_PROTO(struct xfs_scrub_context *sc, void *ret_ip),
+	TP_ARGS(sc, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d type %u ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->ret_ip)
+);
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */
 


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/25] xfs: create helpers to scrub a metadata btree
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 04/25] xfs: create helpers to record and deal with scrub problems Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-03 23:49   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 06/25] xfs: scrub the shape of " Darrick J. Wong
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create helper functions and tracepoints to deal with errors while
scrubbing a metadata btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile      |    1 
 fs/xfs/scrub/btree.c |  112 +++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/btree.h |   57 ++++++++++++++++++
 fs/xfs/scrub/trace.c |   15 +++++
 fs/xfs/scrub/trace.h |  162 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 347 insertions(+)
 create mode 100644 fs/xfs/scrub/btree.c
 create mode 100644 fs/xfs/scrub/btree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ca14595..5888b9f 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -146,6 +146,7 @@ ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
 xfs-y				+= $(addprefix scrub/, \
 				   trace.o \
+				   btree.o \
 				   common.o \
 				   scrub.o \
 				   )
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
new file mode 100644
index 0000000..9645f6d
--- /dev/null
+++ b/fs/xfs/scrub/btree.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/* btree scrubbing */
+
+/*
+ * Check for btree operation errors.  See the section about handling
+ * operational errors in common.c.
+ */
+bool
+xfs_scrub_btree_op_ok(
+	struct xfs_scrub_context	*sc,
+	struct xfs_btree_cur		*cur,
+	int				level,
+	int				*error)
+{
+	if (*error == 0)
+		return true;
+
+	switch (*error) {
+	case -EDEADLOCK:
+		/* Used to restart an op with deadlock avoidance. */
+		trace_xfs_scrub_deadlock_retry(sc->ip, sc->sm, *error);
+		break;
+	case -EFSBADCRC:
+	case -EFSCORRUPTED:
+		/* Note the badness but don't abort. */
+		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+		*error = 0;
+		/* fall through */
+	default:
+		if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
+			trace_xfs_scrub_ifork_btree_op_error(sc, cur, level,
+					*error, __return_address);
+		else
+			trace_xfs_scrub_btree_op_error(sc, cur, level,
+					*error, __return_address);
+		break;
+	}
+	return false;
+}
+
+/* Record btree block corruption. */
+void
+xfs_scrub_btree_set_corrupt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_btree_cur		*cur,
+	int				level)
+{
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+
+	if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
+		trace_xfs_scrub_ifork_btree_error(sc, cur, level,
+				__return_address);
+	else
+		trace_xfs_scrub_btree_error(sc, cur, level,
+				__return_address);
+}
+
+/*
+ * Visit all nodes and leaves of a btree.  Check that all pointers and
+ * records are in order, that the keys reflect the records, and use a callback
+ * so that the caller can verify individual records.  The callback is the same
+ * as the one for xfs_btree_query_range, so therefore this function also
+ * returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a negative error code.
+ */
+int
+xfs_scrub_btree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_btree_cur		*cur,
+	xfs_scrub_btree_rec_fn		scrub_fn,
+	struct xfs_owner_info		*oinfo,
+	void				*private)
+{
+	xfs_scrub_btree_op_ok(sc, cur, 0, false);
+	return -EOPNOTSUPP;
+}
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
new file mode 100644
index 0000000..38ca0e8
--- /dev/null
+++ b/fs/xfs/scrub/btree.h
@@ -0,0 +1,57 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_BTREE_H__
+#define __XFS_SCRUB_BTREE_H__
+
+/* btree scrub */
+
+/* Check for btree operation errors. */
+bool xfs_scrub_btree_op_ok(struct xfs_scrub_context *sc,
+		struct xfs_btree_cur *cur, int level, int *error);
+
+/* Check for btree corruption. */
+void xfs_scrub_btree_set_corrupt(struct xfs_scrub_context *sc,
+		struct xfs_btree_cur *cur, int level);
+
+struct xfs_scrub_btree;
+typedef int (*xfs_scrub_btree_rec_fn)(
+	struct xfs_scrub_btree	*bs,
+	union xfs_btree_rec	*rec);
+
+struct xfs_scrub_btree {
+	/* caller-provided scrub state */
+	struct xfs_scrub_context	*sc;
+	struct xfs_btree_cur		*cur;
+	xfs_scrub_btree_rec_fn		scrub_rec;
+	struct xfs_owner_info		*oinfo;
+	void				*private;
+
+	/* internal scrub state */
+	union xfs_btree_rec		lastrec;
+	bool				firstrec;
+	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
+	bool				firstkey[XFS_BTREE_MAXLEVELS];
+	struct list_head		to_check;
+};
+int xfs_scrub_btree(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
+		    xfs_scrub_btree_rec_fn scrub_fn,
+		    struct xfs_owner_info *oinfo, void *private);
+
+#endif /* __XFS_SCRUB_BTREE_H__ */
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 88b5ccb..a1abf04 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -30,10 +30,25 @@
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_trans.h"
+#include "xfs_bit.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 
+/* Figure out which block the btree cursor was pointing to. */
+static inline xfs_fsblock_t
+xfs_scrub_btree_cur_fsbno(
+	struct xfs_btree_cur		*cur,
+	int				level)
+{
+	if (level < cur->bc_nlevels && cur->bc_bufs[level])
+		return XFS_DADDR_TO_FSB(cur->bc_mp, cur->bc_bufs[level]->b_bn);
+	else if (level == cur->bc_nlevels - 1 &&
+		 cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_private.b.ip->i_ino);
+	return NULLFSBLOCK;
+}
+
 /*
  * We include this last to have the helpers above available for the trace
  * event implementations.
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 8d67a85..78f96b0 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -261,6 +261,168 @@ TRACE_EVENT(xfs_scrub_incomplete,
 		  __entry->ret_ip)
 );
 
+TRACE_EVENT(xfs_scrub_btree_op_error,
+	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
+		 int level, int error, void *ret_ip),
+	TP_ARGS(sc, cur, level, error, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(xfs_btnum_t, btnum)
+		__field(int, level)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, ptr);
+		__field(int, error)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		xfs_fsblock_t fsbno = xfs_scrub_btree_cur_fsbno(cur, level);
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->btnum = cur->bc_btnum;
+		__entry->level = level;
+		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
+		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+		__entry->ptr = cur->bc_ptrs[level];
+		__entry->error = error;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->btnum,
+		  __entry->level,
+		  __entry->ptr,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->error,
+		  __entry->ret_ip)
+);
+
+TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
+	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
+		 int level, int error, void *ret_ip),
+	TP_ARGS(sc, cur, level, error, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(unsigned int, type)
+		__field(xfs_btnum_t, btnum)
+		__field(int, level)
+		__field(int, ptr)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, error)
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		xfs_fsblock_t fsbno = xfs_scrub_btree_cur_fsbno(cur, level);
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->ip->i_ino;
+		__entry->whichfork = cur->bc_private.b.whichfork;
+		__entry->type = sc->sm->sm_type;
+		__entry->btnum = cur->bc_btnum;
+		__entry->level = level;
+		__entry->ptr = cur->bc_ptrs[level];
+		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
+		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+		__entry->error = error;
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->whichfork,
+		  __entry->type,
+		  __entry->btnum,
+		  __entry->level,
+		  __entry->ptr,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->error,
+		  __entry->ret_ip)
+);
+
+TRACE_EVENT(xfs_scrub_btree_error,
+	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
+		 int level, void *ret_ip),
+	TP_ARGS(sc, cur, level, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(xfs_btnum_t, btnum)
+		__field(int, level)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, ptr);
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		xfs_fsblock_t fsbno = xfs_scrub_btree_cur_fsbno(cur, level);
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->btnum = cur->bc_btnum;
+		__entry->level = level;
+		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
+		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->btnum,
+		  __entry->level,
+		  __entry->ptr,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->ret_ip)
+);
+
+TRACE_EVENT(xfs_scrub_ifork_btree_error,
+	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
+		 int level, void *ret_ip),
+	TP_ARGS(sc, cur, level, ret_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, whichfork)
+		__field(unsigned int, type)
+		__field(xfs_btnum_t, btnum)
+		__field(int, level)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, ptr);
+		__field(void *, ret_ip)
+	),
+	TP_fast_assign(
+		xfs_fsblock_t fsbno = xfs_scrub_btree_cur_fsbno(cur, level);
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->ino = sc->ip->i_ino;
+		__entry->whichfork = cur->bc_private.b.whichfork;
+		__entry->type = sc->sm->sm_type;
+		__entry->btnum = cur->bc_btnum;
+		__entry->level = level;
+		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
+		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ret_ip = ret_ip;
+	),
+	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pF",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->whichfork,
+		  __entry->type,
+		  __entry->btnum,
+		  __entry->level,
+		  __entry->ptr,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->ret_ip)
+);
+
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/25] xfs: scrub the shape of a metadata btree
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 05/25] xfs: create helpers to scrub a metadata btree Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-04  0:15   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 07/25] xfs: scrub btree keys and records Darrick J. Wong
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a function that can check the shape of a btree -- each block
passes basic inspection and all the pointers look ok.  In the next patch
we'll add the ability to check the actual keys and records stored within
the btree.  Add some helper functions so that we report detailed scrub
errors in a uniform manner in dmesg.  These are helper functions for
subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c |   16 +++
 fs/xfs/libxfs/xfs_btree.h |    7 +
 fs/xfs/scrub/btree.c      |  237 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h     |   18 +++
 4 files changed, 274 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 5bfb882..c4d8b47 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1027,7 +1027,7 @@ xfs_btree_setbuf(
 	}
 }
 
-STATIC int
+bool
 xfs_btree_ptr_is_null(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_ptr	*ptr)
@@ -1052,7 +1052,7 @@ xfs_btree_set_ptr_null(
 /*
  * Get/set/init sibling pointers
  */
-STATIC void
+void
 xfs_btree_get_sibling(
 	struct xfs_btree_cur	*cur,
 	struct xfs_btree_block	*block,
@@ -4914,3 +4914,15 @@ xfs_btree_count_blocks(
 	return xfs_btree_visit_blocks(cur, xfs_btree_count_blocks_helper,
 			blocks);
 }
+
+/* Compare two btree pointers. */
+int64_t
+xfs_btree_diff_two_ptrs(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_ptr	*a,
+	const union xfs_btree_ptr	*b)
+{
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
+		return (int64_t)be64_to_cpu(a->l) - be64_to_cpu(b->l);
+	return (int64_t)be32_to_cpu(a->s) - be32_to_cpu(b->s);
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f2a88c3..0daf524 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -517,5 +517,12 @@ int xfs_btree_lookup_get_block(struct xfs_btree_cur *cur, int level,
 		union xfs_btree_ptr *pp, struct xfs_btree_block **blkp);
 struct xfs_btree_block *xfs_btree_get_block(struct xfs_btree_cur *cur,
 		int level, struct xfs_buf **bpp);
+bool xfs_btree_ptr_is_null(struct xfs_btree_cur *cur, union xfs_btree_ptr *ptr);
+int64_t xfs_btree_diff_two_ptrs(struct xfs_btree_cur *cur,
+				const union xfs_btree_ptr *a,
+				const union xfs_btree_ptr *b);
+void xfs_btree_get_sibling(struct xfs_btree_cur *cur,
+			   struct xfs_btree_block *block,
+			   union xfs_btree_ptr *ptr, int lr);
 
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 9645f6d..899c9b1 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -92,6 +92,152 @@ xfs_scrub_btree_set_corrupt(
 				__return_address);
 }
 
+/* Check a btree pointer.  Returns true if it's ok to use this pointer. */
+static bool
+xfs_scrub_btree_ptr_ok(
+	struct xfs_scrub_btree		*bs,
+	int				level,
+	union xfs_btree_ptr		*ptr)
+{
+	struct xfs_btree_cur		*cur = bs->cur;
+	xfs_daddr_t			daddr;
+	xfs_daddr_t			eofs;
+
+	if (xfs_btree_ptr_is_null(cur, ptr)) {
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+		return false;
+	}
+	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
+		daddr = XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
+	} else {
+		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
+		daddr = XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
+				be32_to_cpu(ptr->s));
+	}
+	eofs = XFS_FSB_TO_BB(cur->bc_mp, cur->bc_mp->m_sb.sb_dblocks);
+	if (daddr == 0 || daddr >= eofs) {
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+		return false;
+	}
+
+	return true;
+}
+
+/* Check that a btree block's sibling matches what we expect it. */
+STATIC int
+xfs_scrub_btree_block_check_sibling(
+	struct xfs_scrub_btree		*bs,
+	int				level,
+	int				direction,
+	union xfs_btree_ptr		*sibling)
+{
+	struct xfs_btree_cur		*cur = bs->cur;
+	struct xfs_btree_block		*pblock;
+	struct xfs_buf			*pbp;
+	struct xfs_btree_cur		*ncur = NULL;
+	union xfs_btree_ptr		*pp;
+	int				success;
+	int				error;
+
+	if (xfs_btree_ptr_is_null(cur, sibling))
+		return 0;
+
+	error = xfs_btree_dup_cursor(cur, &ncur);
+	if (!xfs_scrub_btree_op_ok(bs->sc, cur, level + 1, &error) || !ncur)
+		return error;
+
+	if (direction > 0)
+		error = xfs_btree_increment(ncur, level + 1, &success);
+	else
+		error = xfs_btree_decrement(ncur, level + 1, &success);
+	if (!xfs_scrub_btree_op_ok(bs->sc, cur, level + 1, &error))
+		goto out;
+	if (!success) {
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level + 1);
+		goto out;
+	}
+
+	pblock = xfs_btree_get_block(ncur, level + 1, &pbp);
+	pp = xfs_btree_ptr_addr(ncur, ncur->bc_ptrs[level + 1], pblock);
+	if (!xfs_scrub_btree_ptr_ok(bs, level + 1, pp))
+		goto out;
+
+	if (xfs_btree_diff_two_ptrs(cur, pp, sibling))
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+out:
+	xfs_btree_del_cursor(ncur, XFS_BTREE_ERROR);
+	return error;
+}
+
+/* Check the siblings of a btree block. */
+STATIC int
+xfs_scrub_btree_block_check_siblings(
+	struct xfs_scrub_btree		*bs,
+	struct xfs_btree_block		*block)
+{
+	struct xfs_btree_cur		*cur = bs->cur;
+	union xfs_btree_ptr		leftsib;
+	union xfs_btree_ptr		rightsib;
+	int				level;
+	int				error = 0;
+
+	xfs_btree_get_sibling(cur, block, &leftsib, XFS_BB_LEFTSIB);
+	xfs_btree_get_sibling(cur, block, &rightsib, XFS_BB_RIGHTSIB);
+	level = xfs_btree_get_level(block);
+
+	/* Root block should never have siblings. */
+	if (level == cur->bc_nlevels - 1) {
+		if (!xfs_btree_ptr_is_null(cur, &leftsib) ||
+		    !xfs_btree_ptr_is_null(cur, &rightsib))
+			xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+		goto out;
+	}
+
+	/*
+	 * Does the left & right sibling pointers match the adjacent
+	 * parent level pointers?
+	 * (These function absorbs error codes for us.)
+	 */
+	error = xfs_scrub_btree_block_check_sibling(bs, level, -1, &leftsib);
+	if (error)
+		return error;
+	error = xfs_scrub_btree_block_check_sibling(bs, level, 1, &rightsib);
+	if (error)
+		return error;
+out:
+	return error;
+}
+
+/*
+ * Grab and scrub a btree block given a btree pointer.  Returns block
+ * and buffer pointers (if applicable) if they're ok to use.
+ */
+STATIC int
+xfs_scrub_btree_get_block(
+	struct xfs_scrub_btree		*bs,
+	int				level,
+	union xfs_btree_ptr		*pp,
+	struct xfs_btree_block		**pblock,
+	struct xfs_buf			**pbp)
+{
+	int				error;
+
+	error = xfs_btree_lookup_get_block(bs->cur, level, pp, pblock);
+	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error) || !pblock)
+		return error;
+
+	xfs_btree_get_block(bs->cur, level, pbp);
+	error = xfs_btree_check_block(bs->cur, *pblock, level, *pbp);
+	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error))
+		return error;
+
+	/*
+	 * Check the block's siblings; this function absorbs error codes
+	 * for us.
+	 */
+	return xfs_scrub_btree_block_check_siblings(bs, *pblock);
+}
+
 /*
  * Visit all nodes and leaves of a btree.  Check that all pointers and
  * records are in order, that the keys reflect the records, and use a callback
@@ -107,6 +253,93 @@ xfs_scrub_btree(
 	struct xfs_owner_info		*oinfo,
 	void				*private)
 {
-	xfs_scrub_btree_op_ok(sc, cur, 0, false);
-	return -EOPNOTSUPP;
+	struct xfs_scrub_btree		bs = {0};
+	union xfs_btree_ptr		ptr;
+	union xfs_btree_ptr		*pp;
+	struct xfs_btree_block		*block;
+	int				level;
+	struct xfs_buf			*bp;
+	int				i;
+	int				error = 0;
+
+	/* Initialize scrub state */
+	bs.cur = cur;
+	bs.scrub_rec = scrub_fn;
+	bs.oinfo = oinfo;
+	bs.firstrec = true;
+	bs.private = private;
+	bs.sc = sc;
+	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
+		bs.firstkey[i] = true;
+	INIT_LIST_HEAD(&bs.to_check);
+
+	/* Don't try to check a tree with a height we can't handle. */
+	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
+		xfs_scrub_btree_set_corrupt(sc, cur, 0);
+		goto out;
+	}
+
+	/* Make sure the root isn't in the superblock. */
+	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)) {
+		cur->bc_ops->init_ptr_from_cur(cur, &ptr);
+		if (!xfs_scrub_btree_ptr_ok(&bs, cur->bc_nlevels - 1, &ptr))
+			goto out;
+	}
+
+	/*
+	 * Load the root of the btree.  The helper function absorbs
+	 * error codes for us.
+	 */
+	level = cur->bc_nlevels - 1;
+	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
+	error = xfs_scrub_btree_get_block(&bs, level, &ptr, &block, &bp);
+	if (error)
+		goto out;
+
+	cur->bc_ptrs[level] = 1;
+
+	while (level < cur->bc_nlevels) {
+		block = xfs_btree_get_block(cur, level, &bp);
+
+		if (level == 0) {
+			/* End of leaf, pop back towards the root. */
+			if (cur->bc_ptrs[level] >
+			    be16_to_cpu(block->bb_numrecs)) {
+				if (level < cur->bc_nlevels - 1)
+					cur->bc_ptrs[level + 1]++;
+				level++;
+				continue;
+			}
+
+			if (xfs_scrub_should_terminate(sc, &error))
+				break;
+
+			cur->bc_ptrs[level]++;
+			continue;
+		}
+
+		/* End of node, pop back towards the root. */
+		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+			if (level < cur->bc_nlevels - 1)
+				cur->bc_ptrs[level + 1]++;
+			level++;
+			continue;
+		}
+
+		/* Drill another level deeper. */
+		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		if (!xfs_scrub_btree_ptr_ok(&bs, level, pp)) {
+			cur->bc_ptrs[level]++;
+			continue;
+		}
+		level--;
+		error = xfs_scrub_btree_get_block(&bs, level, pp, &block, &bp);
+		if (!xfs_scrub_btree_op_ok(sc, cur, level, &error))
+			goto out;
+
+		cur->bc_ptrs[level] = 1;
+	}
+
+out:
+	return error;
 }
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index e396aa6..309e882 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -21,6 +21,24 @@
 #define __XFS_SCRUB_COMMON_H__
 
 /*
+ * We /could/ terminate a scrub/repair operation early.  If we're not
+ * in a good place to continue (fatal signal, etc.) then bail out.
+ * Note that we're careful not to make any judgements about *error.
+ */
+static inline bool
+xfs_scrub_should_terminate(
+	struct xfs_scrub_context	*sc,
+	int				*error)
+{
+	if (fatal_signal_pending(current)) {
+		if (*error == 0)
+			*error = -EAGAIN;
+		return true;
+	}
+	return false;
+}
+
+/*
  * Grab an empty transaction so that we can re-grab locked buffers if
  * one of our btrees turns out to be cyclic.
  */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/25] xfs: scrub btree keys and records
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 06/25] xfs: scrub the shape of " Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-04 20:52   ` Darrick J. Wong
  2017-10-03 20:41 ` [PATCH 08/25] xfs: create helpers to scan an allocation group Darrick J. Wong
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Add to the btree scrubber the ability to check that the keys and
records are in the right order and actually call out to our record
iterator to do actual checking of the records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/btree.c |  108 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/trace.h |   44 ++++++++++++++++++++
 2 files changed, 152 insertions(+)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 899c9b1..ad3518c 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -92,6 +92,101 @@ xfs_scrub_btree_set_corrupt(
 				__return_address);
 }
 
+/*
+ * Make sure this record is in order and doesn't stray outside of the parent
+ * keys.
+ */
+STATIC void
+xfs_scrub_btree_rec(
+	struct xfs_scrub_btree	*bs)
+{
+	struct xfs_btree_cur	*cur = bs->cur;
+	union xfs_btree_rec	*rec;
+	union xfs_btree_key	key;
+	union xfs_btree_key	hkey;
+	union xfs_btree_key	*keyp;
+	struct xfs_btree_block	*block;
+	struct xfs_btree_block	*keyblock;
+	struct xfs_buf		*bp;
+
+	block = xfs_btree_get_block(cur, 0, &bp);
+	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+
+	trace_xfs_scrub_btree_rec(bs->sc, cur, 0);
+
+	/* If this isn't the first record, are they in order? */
+	if (!bs->firstrec && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, 0);
+	bs->firstrec = false;
+	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
+
+	if (cur->bc_nlevels == 1)
+		return;
+
+	/* Is this at least as large as the parent low key? */
+	cur->bc_ops->init_key_from_rec(&key, rec);
+	keyblock = xfs_btree_get_block(cur, 1, &bp);
+	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, 1);
+
+	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
+		return;
+
+	/* Is this no larger than the parent high key? */
+	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, 1);
+}
+
+/*
+ * Make sure this key is in order and doesn't stray outside of the parent
+ * keys.
+ */
+STATIC void
+xfs_scrub_btree_key(
+	struct xfs_scrub_btree	*bs,
+	int			level)
+{
+	struct xfs_btree_cur	*cur = bs->cur;
+	union xfs_btree_key	*key;
+	union xfs_btree_key	*keyp;
+	struct xfs_btree_block	*block;
+	struct xfs_btree_block	*keyblock;
+	struct xfs_buf		*bp;
+
+	block = xfs_btree_get_block(cur, level, &bp);
+	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
+
+	trace_xfs_scrub_btree_key(bs->sc, cur, level);
+
+	/* If this isn't the first key, are they in order? */
+	if (!bs->firstkey[level] &&
+	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+	bs->firstkey[level] = false;
+	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
+
+	if (level + 1 >= cur->bc_nlevels)
+		return;
+
+	/* Is this at least as large as the parent low key? */
+	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
+	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+
+	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
+		return;
+
+	/* Is this no larger than the parent high key? */
+	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
+		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
+}
+
 /* Check a btree pointer.  Returns true if it's ok to use this pointer. */
 static bool
 xfs_scrub_btree_ptr_ok(
@@ -256,6 +351,7 @@ xfs_scrub_btree(
 	struct xfs_scrub_btree		bs = {0};
 	union xfs_btree_ptr		ptr;
 	union xfs_btree_ptr		*pp;
+	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
 	int				level;
 	struct xfs_buf			*bp;
@@ -311,6 +407,15 @@ xfs_scrub_btree(
 				continue;
 			}
 
+			/* Records in order for scrub? */
+			xfs_scrub_btree_rec(&bs);
+
+			/* Call out to the record checker. */
+			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			error = bs.scrub_rec(&bs, recp);
+			if (error < 0 ||
+			    error == XFS_BTREE_QUERY_RANGE_ABORT)
+				break;
 			if (xfs_scrub_should_terminate(sc, &error))
 				break;
 
@@ -326,6 +431,9 @@ xfs_scrub_btree(
 			continue;
 		}
 
+		/* Keys in order for scrub? */
+		xfs_scrub_btree_key(&bs, level);
+
 		/* Drill another level deeper. */
 		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
 		if (!xfs_scrub_btree_ptr_ok(&bs, level, pp)) {
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 78f96b0..a78c8d1 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -423,6 +423,50 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
 		  __entry->ret_ip)
 );
 
+DECLARE_EVENT_CLASS(xfs_scrub_sbtree_class,
+	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
+		 int level),
+	TP_ARGS(sc, cur, level),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, type)
+		__field(xfs_btnum_t, btnum)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, level)
+		__field(int, nlevels)
+		__field(int, ptr)
+	),
+	TP_fast_assign(
+		xfs_fsblock_t fsbno = xfs_scrub_btree_cur_fsbno(cur, level);
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->btnum = cur->bc_btnum;
+		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
+		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+		__entry->level = level;
+		__entry->nlevels = cur->bc_nlevels;
+		__entry->ptr = cur->bc_ptrs[level];
+	),
+	TP_printk("dev %d:%d type %u btnum %d agno %u agbno %u level %d nlevels %d ptr %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->btnum,
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->level,
+		  __entry->nlevels,
+		  __entry->ptr)
+)
+#define DEFINE_SCRUB_SBTREE_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_sbtree_class, name, \
+	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur, \
+		 int level), \
+	TP_ARGS(sc, cur, level))
+
+DEFINE_SCRUB_SBTREE_EVENT(xfs_scrub_btree_rec);
+DEFINE_SCRUB_SBTREE_EVENT(xfs_scrub_btree_key);
+
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/25] xfs: create helpers to scan an allocation group
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 07/25] xfs: scrub btree keys and records Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-04  0:46   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 09/25] xfs: scrub the backup superblocks Darrick J. Wong
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Add some helpers to enable us to lock an AG's headers, create btree
cursors for all btrees in that allocation group, and clean up
afterwards.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |  173 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h |   10 +++
 fs/xfs/scrub/scrub.c  |    4 +
 fs/xfs/scrub/scrub.h  |   21 ++++++
 4 files changed, 208 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index a84ba19..b056c9d 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -44,6 +44,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/btree.h"
 
 /* Common code for the metadata scrubbers. */
 
@@ -298,6 +299,178 @@ xfs_scrub_set_incomplete(
 	trace_xfs_scrub_incomplete(sc, __return_address);
 }
 
+/*
+ * AG scrubbing
+ *
+ * These helpers facilitate locking an allocation group's header
+ * buffers, setting up cursors for all btrees that are present, and
+ * cleaning everything up once we're through.
+ */
+
+/* Grab all the headers for an AG. */
+int
+xfs_scrub_ag_read_headers(
+	struct xfs_scrub_context	*sc,
+	xfs_agnumber_t			agno,
+	struct xfs_buf			**agi,
+	struct xfs_buf			**agf,
+	struct xfs_buf			**agfl)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
+	if (error)
+		goto out;
+
+	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
+	if (error)
+		goto out;
+	if (!*agf) {
+		error = -ENOMEM;
+		goto out;
+	}
+
+	error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
+	if (error)
+		goto out;
+
+out:
+	return error;
+}
+
+/* Release all the AG btree cursors. */
+void
+xfs_scrub_ag_btcur_free(
+	struct xfs_scrub_ag		*sa)
+{
+	if (sa->refc_cur)
+		xfs_btree_del_cursor(sa->refc_cur, XFS_BTREE_ERROR);
+	if (sa->rmap_cur)
+		xfs_btree_del_cursor(sa->rmap_cur, XFS_BTREE_ERROR);
+	if (sa->fino_cur)
+		xfs_btree_del_cursor(sa->fino_cur, XFS_BTREE_ERROR);
+	if (sa->ino_cur)
+		xfs_btree_del_cursor(sa->ino_cur, XFS_BTREE_ERROR);
+	if (sa->cnt_cur)
+		xfs_btree_del_cursor(sa->cnt_cur, XFS_BTREE_ERROR);
+	if (sa->bno_cur)
+		xfs_btree_del_cursor(sa->bno_cur, XFS_BTREE_ERROR);
+
+	sa->refc_cur = NULL;
+	sa->rmap_cur = NULL;
+	sa->fino_cur = NULL;
+	sa->ino_cur = NULL;
+	sa->bno_cur = NULL;
+	sa->cnt_cur = NULL;
+}
+
+/* Initialize all the btree cursors for an AG. */
+int
+xfs_scrub_ag_btcur_init(
+	struct xfs_scrub_context	*sc,
+	struct xfs_scrub_ag		*sa)
+{
+	struct xfs_mount		*mp = sc->mp;
+	xfs_agnumber_t			agno = sa->agno;
+
+	if (sa->agf_bp) {
+		/* Set up a bnobt cursor for cross-referencing. */
+		sa->bno_cur = xfs_allocbt_init_cursor(mp, sc->tp, sa->agf_bp,
+				agno, XFS_BTNUM_BNO);
+		if (!sa->bno_cur)
+			goto err;
+
+		/* Set up a cntbt cursor for cross-referencing. */
+		sa->cnt_cur = xfs_allocbt_init_cursor(mp, sc->tp, sa->agf_bp,
+				agno, XFS_BTNUM_CNT);
+		if (!sa->cnt_cur)
+			goto err;
+	}
+
+	/* Set up a inobt cursor for cross-referencing. */
+	if (sa->agi_bp) {
+		sa->ino_cur = xfs_inobt_init_cursor(mp, sc->tp, sa->agi_bp,
+					agno, XFS_BTNUM_INO);
+		if (!sa->ino_cur)
+			goto err;
+	}
+
+	/* Set up a finobt cursor for cross-referencing. */
+	if (sa->agi_bp && xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		sa->fino_cur = xfs_inobt_init_cursor(mp, sc->tp, sa->agi_bp,
+				agno, XFS_BTNUM_FINO);
+		if (!sa->fino_cur)
+			goto err;
+	}
+
+	/* Set up a rmapbt cursor for cross-referencing. */
+	if (sa->agf_bp && xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		sa->rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, sa->agf_bp,
+				agno);
+		if (!sa->rmap_cur)
+			goto err;
+	}
+
+	/* Set up a refcountbt cursor for cross-referencing. */
+	if (sa->agf_bp && xfs_sb_version_hasreflink(&mp->m_sb)) {
+		sa->refc_cur = xfs_refcountbt_init_cursor(mp, sc->tp,
+				sa->agf_bp, agno, NULL);
+		if (!sa->refc_cur)
+			goto err;
+	}
+
+	return 0;
+err:
+	return -ENOMEM;
+}
+
+/* Release the AG header context and btree cursors. */
+void
+xfs_scrub_ag_free(
+	struct xfs_scrub_context	*sc,
+	struct xfs_scrub_ag		*sa)
+{
+	xfs_scrub_ag_btcur_free(sa);
+	if (sa->agfl_bp) {
+		xfs_trans_brelse(sc->tp, sa->agfl_bp);
+		sa->agfl_bp = NULL;
+	}
+	if (sa->agf_bp) {
+		xfs_trans_brelse(sc->tp, sa->agf_bp);
+		sa->agf_bp = NULL;
+	}
+	if (sa->agi_bp) {
+		xfs_trans_brelse(sc->tp, sa->agi_bp);
+		sa->agi_bp = NULL;
+	}
+	sa->agno = NULLAGNUMBER;
+}
+
+/*
+ * For scrub, grab the AGI and the AGF headers, in that order.  Locking
+ * order requires us to get the AGI before the AGF.  We use the
+ * transaction to avoid deadlocking on crosslinked metadata buffers;
+ * either the caller passes one in (bmap scrub) or we have to create a
+ * transaction ourselves.
+ */
+int
+xfs_scrub_ag_init(
+	struct xfs_scrub_context	*sc,
+	xfs_agnumber_t			agno,
+	struct xfs_scrub_ag		*sa)
+{
+	int				error;
+
+	sa->agno = agno;
+	error = xfs_scrub_ag_read_headers(sc, agno, &sa->agi_bp,
+			&sa->agf_bp, &sa->agfl_bp);
+	if (error)
+		return error;
+
+	return xfs_scrub_ag_btcur_init(sc, sa);
+}
+
 /* Per-scrubber setup functions */
 
 /* Set us up with a transaction and an empty context. */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 309e882..979ad89 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -77,4 +77,14 @@ void xfs_scrub_set_incomplete(struct xfs_scrub_context *sc);
 /* Setup functions */
 int xfs_scrub_setup_fs(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
+void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
+int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
+		      struct xfs_scrub_ag *sa);
+int xfs_scrub_ag_read_headers(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
+			      struct xfs_buf **agi, struct xfs_buf **agf,
+			      struct xfs_buf **agfl);
+void xfs_scrub_ag_btcur_free(struct xfs_scrub_ag *sa);
+int xfs_scrub_ag_btcur_init(struct xfs_scrub_context *sc,
+			    struct xfs_scrub_ag *sa);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index d51354d..ae577d7 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -44,6 +44,8 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/scrub.h"
+#include "scrub/btree.h"
 
 /*
  * Online Scrub and Repair
@@ -136,6 +138,7 @@ xfs_scrub_teardown(
 	struct xfs_scrub_context	*sc,
 	int				error)
 {
+	xfs_scrub_ag_free(sc, &sc->sa);
 	if (sc->tp) {
 		xfs_trans_cancel(sc->tp);
 		sc->tp = NULL;
@@ -222,6 +225,7 @@ xfs_scrub_metadata(
 	sc.sm = sm;
 	sc.ops = ops;
 	sc.try_harder = try_harder;
+	sc.sa.agno = NULLAGNUMBER;
 	error = sc.ops->setup(&sc, ip);
 	if (error)
 		goto out_teardown;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index b7b9422..1385295 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -34,6 +34,24 @@ struct xfs_scrub_meta_ops {
 	bool		(*has)(struct xfs_sb *);
 };
 
+/* Buffer pointers and btree cursors for an entire AG. */
+struct xfs_scrub_ag {
+	xfs_agnumber_t			agno;
+
+	/* AG btree roots */
+	struct xfs_buf			*agf_bp;
+	struct xfs_buf			*agfl_bp;
+	struct xfs_buf			*agi_bp;
+
+	/* AG btrees */
+	struct xfs_btree_cur		*bno_cur;
+	struct xfs_btree_cur		*cnt_cur;
+	struct xfs_btree_cur		*ino_cur;
+	struct xfs_btree_cur		*fino_cur;
+	struct xfs_btree_cur		*rmap_cur;
+	struct xfs_btree_cur		*refc_cur;
+};
+
 struct xfs_scrub_context {
 	/* General scrub state. */
 	struct xfs_mount		*mp;
@@ -42,6 +60,9 @@ struct xfs_scrub_context {
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip;
 	bool				try_harder;
+
+	/* State tracking for single-AG operations. */
+	struct xfs_scrub_ag		sa;
 };
 
 /* Metadata scrubbers */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/25] xfs: scrub the backup superblocks
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 08/25] xfs: create helpers to scan an allocation group Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-04  0:57   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 10/25] xfs: scrub AGF and AGFL Darrick J. Wong
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Ensure that the geometry presented in the backup superblocks matches
the primary superblock so that repair can recover the filesystem if
that primary gets corrupted.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile         |    1 
 fs/xfs/libxfs/xfs_fs.h  |    3 
 fs/xfs/scrub/agheader.c |  317 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h   |    2 
 fs/xfs/scrub/scrub.c    |    4 +
 fs/xfs/scrub/scrub.h    |    1 
 6 files changed, 327 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/agheader.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5888b9f..e92d04d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -146,6 +146,7 @@ ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
 xfs-y				+= $(addprefix scrub/, \
 				   trace.o \
+				   agheader.o \
 				   btree.o \
 				   common.o \
 				   scrub.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 765f91e..8543cbb 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -484,9 +484,10 @@ struct xfs_scrub_metadata {
 
 /* Scrub subcommands. */
 #define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
+#define XFS_SCRUB_TYPE_SB	1	/* superblock */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	1
+#define XFS_SCRUB_TYPE_NR	2
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
new file mode 100644
index 0000000..487c4f4
--- /dev/null
+++ b/fs/xfs/scrub/agheader.c
@@ -0,0 +1,317 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Set us up to check an AG header. */
+int
+xfs_scrub_setup_ag_header(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	struct xfs_mount		*mp = sc->mp;
+
+	if (sc->sm->sm_agno >= mp->m_sb.sb_agcount ||
+	    sc->sm->sm_ino || sc->sm->sm_gen)
+		return -EINVAL;
+	return xfs_scrub_setup_fs(sc, ip);
+}
+
+/* Superblock */
+
+/* Scrub the filesystem superblock. */
+int
+xfs_scrub_superblock(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*bp;
+	struct xfs_dsb			*sb;
+	xfs_agnumber_t			agno;
+	uint32_t			v2_ok;
+	__be32				features_mask;
+	int				error;
+	__be16				vernum_mask;
+
+	agno = sc->sm->sm_agno;
+	if (agno == 0)
+		return 0;
+
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+		  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
+		  XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
+	if (!xfs_scrub_op_ok(sc, agno, XFS_SB_BLOCK(mp), &error))
+		return error;
+
+	sb = XFS_BUF_TO_SBP(bp);
+
+	/*
+	 * Verify the geometries match.  Fields that are permanently
+	 * set by mkfs are checked; fields that can be updated later
+	 * (and are not propagated to backup superblocks) are preen
+	 * checked.
+	 */
+	if (sb->sb_blocksize != cpu_to_be32(mp->m_sb.sb_blocksize))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_rblocks != cpu_to_be64(mp->m_sb.sb_rblocks))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_rextents != cpu_to_be64(mp->m_sb.sb_rextents))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (!uuid_equal(&sb->sb_uuid, &mp->m_sb.sb_uuid))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_logstart != cpu_to_be64(mp->m_sb.sb_logstart))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_rootino != cpu_to_be64(mp->m_sb.sb_rootino))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_rextsize != cpu_to_be32(mp->m_sb.sb_rextsize))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_agblocks != cpu_to_be32(mp->m_sb.sb_agblocks))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_agcount != cpu_to_be32(mp->m_sb.sb_agcount))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_rbmblocks != cpu_to_be32(mp->m_sb.sb_rbmblocks))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_logblocks != cpu_to_be32(mp->m_sb.sb_logblocks))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	/* Check sb_versionnum bits that are set at mkfs time. */
+	vernum_mask = cpu_to_be16(~XFS_SB_VERSION_OKBITS |
+				  XFS_SB_VERSION_NUMBITS |
+				  XFS_SB_VERSION_ALIGNBIT |
+				  XFS_SB_VERSION_DALIGNBIT |
+				  XFS_SB_VERSION_SHAREDBIT |
+				  XFS_SB_VERSION_LOGV2BIT |
+				  XFS_SB_VERSION_SECTORBIT |
+				  XFS_SB_VERSION_EXTFLGBIT |
+				  XFS_SB_VERSION_DIRV2BIT);
+	if ((sb->sb_versionnum & vernum_mask) !=
+	    (cpu_to_be16(mp->m_sb.sb_versionnum) & vernum_mask))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	/* Check sb_versionnum bits that can be set after mkfs time. */
+	vernum_mask = cpu_to_be16(XFS_SB_VERSION_ATTRBIT |
+				  XFS_SB_VERSION_NLINKBIT |
+				  XFS_SB_VERSION_QUOTABIT);
+	if ((sb->sb_versionnum & vernum_mask) !=
+	    (cpu_to_be16(mp->m_sb.sb_versionnum) & vernum_mask))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_sectsize != cpu_to_be16(mp->m_sb.sb_sectsize))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_inodesize != cpu_to_be16(mp->m_sb.sb_inodesize))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_inopblock != cpu_to_be16(mp->m_sb.sb_inopblock))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (memcmp(sb->sb_fname, mp->m_sb.sb_fname, sizeof(sb->sb_fname)))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_blocklog != mp->m_sb.sb_blocklog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_sectlog != mp->m_sb.sb_sectlog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_inodelog != mp->m_sb.sb_inodelog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_inopblog != mp->m_sb.sb_inopblog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_agblklog != mp->m_sb.sb_agblklog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_rextslog != mp->m_sb.sb_rextslog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_imax_pct != mp->m_sb.sb_imax_pct)
+		xfs_scrub_block_set_preen(sc, bp);
+
+	/*
+	 * Skip the summary counters since we track them in memory anyway.
+	 * sb_icount, sb_ifree, sb_fdblocks, sb_frexents
+	 */
+
+	if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	/*
+	 * Skip the quota flags since repair will force quotacheck.
+	 * sb_qflags
+	 */
+
+	if (sb->sb_flags != mp->m_sb.sb_flags)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_shared_vn != mp->m_sb.sb_shared_vn)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_inoalignmt != cpu_to_be32(mp->m_sb.sb_inoalignmt))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_unit != cpu_to_be32(mp->m_sb.sb_unit))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_width != cpu_to_be32(mp->m_sb.sb_width))
+		xfs_scrub_block_set_preen(sc, bp);
+
+	if (sb->sb_dirblklog != mp->m_sb.sb_dirblklog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_logsectlog != mp->m_sb.sb_logsectlog)
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_logsectsize != cpu_to_be16(mp->m_sb.sb_logsectsize))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (sb->sb_logsunit != cpu_to_be32(mp->m_sb.sb_logsunit))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	/* Do we see any invalid bits in sb_features2? */
+	if (!xfs_sb_version_hasmorebits(&mp->m_sb)) {
+		if (sb->sb_features2 != 0)
+			xfs_scrub_block_set_corrupt(sc, bp);
+	} else {
+		v2_ok = XFS_SB_VERSION2_OKBITS;
+		if (XFS_SB_VERSION_NUM(&mp->m_sb) >= XFS_SB_VERSION_5)
+			v2_ok |= XFS_SB_VERSION2_CRCBIT;
+
+		if (!!(sb->sb_features2 & cpu_to_be32(~v2_ok)))
+			xfs_scrub_block_set_corrupt(sc, bp);
+
+		if (sb->sb_features2 != sb->sb_bad_features2)
+			xfs_scrub_block_set_preen(sc, bp);
+	}
+
+	/* Check sb_features2 flags that are set at mkfs time. */
+	features_mask = cpu_to_be32(XFS_SB_VERSION2_LAZYSBCOUNTBIT |
+				    XFS_SB_VERSION2_PROJID32BIT |
+				    XFS_SB_VERSION2_CRCBIT |
+				    XFS_SB_VERSION2_FTYPE);
+	if ((sb->sb_features2 & features_mask) !=
+	    (cpu_to_be32(mp->m_sb.sb_features2) & features_mask))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	/* Check sb_features2 flags that can be set after mkfs time. */
+	features_mask = cpu_to_be32(XFS_SB_VERSION2_ATTR2BIT);
+	if ((sb->sb_features2 & features_mask) !=
+	    (cpu_to_be32(mp->m_sb.sb_features2) & features_mask))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	if (!xfs_sb_version_hascrc(&mp->m_sb)) {
+		/* all v5 fields must be zero */
+		if (memchr_inv(&sb->sb_features_compat, 0,
+				sizeof(struct xfs_dsb) -
+				offsetof(struct xfs_dsb, sb_features_compat)))
+			xfs_scrub_block_set_corrupt(sc, bp);
+	} else {
+		/* Check compat flags; all are set at mkfs time. */
+		features_mask = cpu_to_be32(XFS_SB_FEAT_COMPAT_UNKNOWN);
+		if ((sb->sb_features_compat & features_mask) !=
+		    (cpu_to_be32(mp->m_sb.sb_features_compat) & features_mask))
+			xfs_scrub_block_set_corrupt(sc, bp);
+
+		/* Check ro compat flags; all are set at mkfs time. */
+		features_mask = cpu_to_be32(XFS_SB_FEAT_RO_COMPAT_UNKNOWN |
+					    XFS_SB_FEAT_RO_COMPAT_FINOBT |
+					    XFS_SB_FEAT_RO_COMPAT_RMAPBT |
+					    XFS_SB_FEAT_RO_COMPAT_REFLINK);
+		if ((sb->sb_features_ro_compat & features_mask) !=
+		    (cpu_to_be32(mp->m_sb.sb_features_ro_compat) & features_mask))
+			xfs_scrub_block_set_corrupt(sc, bp);
+
+		/* Check incompat flags; all are set at mkfs time. */
+		features_mask = cpu_to_be32(XFS_SB_FEAT_INCOMPAT_UNKNOWN |
+					    XFS_SB_FEAT_INCOMPAT_FTYPE |
+					    XFS_SB_FEAT_INCOMPAT_SPINODES |
+					    XFS_SB_FEAT_INCOMPAT_META_UUID);
+		if ((sb->sb_features_incompat & features_mask) !=
+		    (cpu_to_be32(mp->m_sb.sb_features_incompat) & features_mask))
+			xfs_scrub_block_set_corrupt(sc, bp);
+
+		/* Check log incompat flags; all are set at mkfs time. */
+		features_mask = cpu_to_be32(XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN);
+		if ((sb->sb_features_log_incompat & features_mask) !=
+		    (cpu_to_be32(mp->m_sb.sb_features_log_incompat) & features_mask))
+			xfs_scrub_block_set_corrupt(sc, bp);
+
+		/* Don't care about sb_crc */
+
+		if (sb->sb_spino_align != cpu_to_be32(mp->m_sb.sb_spino_align))
+			xfs_scrub_block_set_corrupt(sc, bp);
+
+		if (sb->sb_pquotino != cpu_to_be64(mp->m_sb.sb_pquotino))
+			xfs_scrub_block_set_preen(sc, bp);
+
+		/* Don't care about sb_lsn */
+	}
+
+	if (xfs_sb_version_hasmetauuid(&mp->m_sb)) {
+		/* The metadata UUID must be the same for all supers */
+		if (!uuid_equal(&sb->sb_meta_uuid, &mp->m_sb.sb_meta_uuid))
+			xfs_scrub_block_set_corrupt(sc, bp);
+	}
+
+	/* Everything else must be zero. */
+	if (memchr_inv(sb + 1, 0,
+			BBTOB(bp->b_length) - sizeof(struct xfs_dsb)))
+		xfs_scrub_block_set_corrupt(sc, bp);
+
+	return error;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 979ad89..390f772 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -76,6 +76,8 @@ void xfs_scrub_set_incomplete(struct xfs_scrub_context *sc);
 
 /* Setup functions */
 int xfs_scrub_setup_fs(struct xfs_scrub_context *sc, struct xfs_inode *ip);
+int xfs_scrub_setup_ag_header(struct xfs_scrub_context *sc,
+			      struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index ae577d7..7f6f997 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -153,6 +153,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_fs,
 		.scrub	= xfs_scrub_probe,
 	},
+	{ /* superblock */
+		.setup	= xfs_scrub_setup_ag_header,
+		.scrub	= xfs_scrub_superblock,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 1385295..13e3f9b 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -67,5 +67,6 @@ struct xfs_scrub_context {
 
 /* Metadata scrubbers */
 int xfs_scrub_tester(struct xfs_scrub_context *sc);
+int xfs_scrub_superblock(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/25] xfs: scrub AGF and AGFL
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 09/25] xfs: scrub the backup superblocks Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-04  1:31   ` Dave Chinner
  2017-10-03 20:41 ` [PATCH 11/25] xfs: scrub the AGI Darrick J. Wong
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Check the block references in the AGF and AGFL headers to make sure
they make sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h  |    4 +
 fs/xfs/scrub/agheader.c |  220 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c   |   60 +++++++++++++
 fs/xfs/scrub/common.h   |    6 +
 fs/xfs/scrub/scrub.c    |    8 ++
 fs/xfs/scrub/scrub.h    |    2 
 6 files changed, 299 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 8543cbb..aeb2a66 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -485,9 +485,11 @@ struct xfs_scrub_metadata {
 /* Scrub subcommands. */
 #define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
 #define XFS_SCRUB_TYPE_SB	1	/* superblock */
+#define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
+#define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	2
+#define XFS_SCRUB_TYPE_NR	4
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 487c4f4..7fe6630 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -49,6 +49,72 @@ xfs_scrub_setup_ag_header(
 	return xfs_scrub_setup_fs(sc, ip);
 }
 
+/* Find the size of the AG, in blocks. */
+static inline xfs_agblock_t
+xfs_scrub_ag_blocks(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	if (agno < mp->m_sb.sb_agcount - 1)
+		return mp->m_sb.sb_agblocks;
+	return mp->m_sb.sb_dblocks - (agno * mp->m_sb.sb_agblocks);
+}
+
+/* Walk all the blocks in the AGFL. */
+int
+xfs_scrub_walk_agfl(
+	struct xfs_scrub_context	*sc,
+	int				(*fn)(struct xfs_scrub_context *,
+					      xfs_agblock_t bno, void *),
+	void				*priv)
+{
+	struct xfs_agf			*agf;
+	__be32				*agfl_bno;
+	struct xfs_mount		*mp = sc->mp;
+	unsigned int			flfirst;
+	unsigned int			fllast;
+	int				i;
+	int				error;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, sc->sa.agfl_bp);
+	flfirst = be32_to_cpu(agf->agf_flfirst);
+	fllast = be32_to_cpu(agf->agf_fllast);
+
+	/* Skip an empty AGFL. */
+	if (agf->agf_flcount == cpu_to_be32(0))
+		return 0;
+
+	/* first to last is a consecutive list. */
+	if (fllast >= flfirst) {
+		for (i = flfirst; i <= fllast; i++) {
+			error = fn(sc, be32_to_cpu(agfl_bno[i]), priv);
+			if (error)
+				return error;
+		}
+
+		return 0;
+	}
+
+	/* first to the end */
+	for (i = flfirst; i < XFS_AGFL_SIZE(mp); i++) {
+		error = fn(sc, be32_to_cpu(agfl_bno[i]), priv);
+		if (error)
+			return error;
+	}
+
+	/* the start to last. */
+	for (i = 0; i <= fllast; i++) {
+		error = fn(sc, be32_to_cpu(agfl_bno[i]), priv);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
 /* Superblock */
 
 /* Scrub the filesystem superblock. */
@@ -315,3 +381,157 @@ xfs_scrub_superblock(
 
 	return error;
 }
+
+/* AGF */
+
+/* Scrub the AGF. */
+int
+xfs_scrub_agf(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agf			*agf;
+	xfs_daddr_t			daddr;
+	xfs_daddr_t			eofs;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			agbno;
+	xfs_agblock_t			eoag;
+	xfs_agblock_t			agfl_first;
+	xfs_agblock_t			agfl_last;
+	xfs_agblock_t			agfl_count;
+	xfs_agblock_t			fl_count;
+	int				level;
+	int				error = 0;
+
+	agno = sc->sm->sm_agno;
+	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGF);
+	if (!xfs_scrub_op_ok(sc, agno, XFS_AGF_BLOCK(sc->mp), &error))
+		goto out;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+
+	/* Check the AG length */
+	eoag = be32_to_cpu(agf->agf_length);
+	if (eoag != xfs_scrub_ag_blocks(mp, agno))
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+	/* Check the AGF btree roots and levels */
+	agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_BNO]);
+	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
+	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
+	    agbno >= eoag || daddr >= eofs)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+	agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_CNT]);
+	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
+	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
+	    agbno >= eoag || daddr >= eofs)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
+	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
+	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_RMAP]);
+		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
+		if (agbno <= XFS_AGI_BLOCK(mp) ||
+		    agbno >= mp->m_sb.sb_agblocks ||
+		    agbno >= eoag ||
+		    daddr >= eofs)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+		level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
+		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+	}
+
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		agbno = be32_to_cpu(agf->agf_refcount_root);
+		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
+		if (agbno <= XFS_AGI_BLOCK(mp) ||
+		    agbno >= mp->m_sb.sb_agblocks ||
+		    agbno >= eoag ||
+		    daddr >= eofs)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+		level = be32_to_cpu(agf->agf_refcount_level);
+		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+	}
+
+	/* Check the AGFL counters */
+	agfl_first = be32_to_cpu(agf->agf_flfirst);
+	agfl_last = be32_to_cpu(agf->agf_fllast);
+	agfl_count = be32_to_cpu(agf->agf_flcount);
+	if (agfl_last > agfl_first)
+		fl_count = agfl_last - agfl_first + 1;
+	else
+		fl_count = XFS_AGFL_SIZE(mp) - agfl_first + agfl_last + 1;
+	if (agfl_count != 0 && fl_count != agfl_count)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
+
+out:
+	return error;
+}
+
+/* AGFL */
+
+struct xfs_scrub_agfl {
+	xfs_agblock_t			eoag;
+	xfs_daddr_t			eofs;
+};
+
+/* Scrub an AGFL block. */
+STATIC int
+xfs_scrub_agfl_block(
+	struct xfs_scrub_context	*sc,
+	xfs_agblock_t			agbno,
+	void				*priv)
+{
+	struct xfs_mount		*mp = sc->mp;
+	xfs_agnumber_t			agno = sc->sa.agno;
+	struct xfs_scrub_agfl		*sagfl = priv;
+	int				error = 0;
+
+	if (agbno <= XFS_AGI_BLOCK(mp) ||
+	    agbno >= mp->m_sb.sb_agblocks ||
+	    agbno >= sagfl->eoag ||
+	    XFS_AGB_TO_DADDR(mp, agno, agbno) >= sagfl->eofs)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agfl_bp);
+
+	return error;
+}
+
+/* Scrub the AGFL. */
+int
+xfs_scrub_agfl(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_scrub_agfl		sagfl;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agf			*agf;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	agno = sc->sm->sm_agno;
+	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGFL);
+	if (!xfs_scrub_op_ok(sc, agno, XFS_AGFL_BLOCK(sc->mp), &error))
+		goto out;
+	if (!sc->sa.agf_bp)
+		return -EFSCORRUPTED;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	sagfl.eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+	sagfl.eoag = be32_to_cpu(agf->agf_length);
+
+	/* Check the blocks in the AGFL. */
+	return xfs_scrub_walk_agfl(sc, xfs_scrub_agfl_block, &sagfl);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index b056c9d..ee8e7be 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -471,6 +471,66 @@ xfs_scrub_ag_init(
 	return xfs_scrub_ag_btcur_init(sc, sa);
 }
 
+/*
+ * Load and verify an AG header for further AG header examination.
+ * If this header is not the target of the examination, don't return
+ * the buffer if a runtime or verifier error occurs.
+ */
+STATIC int
+xfs_scrub_load_ag_header(
+	struct xfs_scrub_context	*sc,
+	xfs_daddr_t			daddr,
+	struct xfs_buf			**bpp,
+	const struct xfs_buf_ops	*ops,
+	bool				is_target)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	*bpp = NULL;
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, daddr),
+			XFS_FSS_TO_BB(mp, 1), 0, bpp, ops);
+	return is_target ? error : 0;
+}
+
+/*
+ * Load as many of the AG headers and btree cursors as we can for an
+ * examination and cross-reference of an AG header.
+ */
+int
+xfs_scrub_load_ag_headers(
+	struct xfs_scrub_context	*sc,
+	xfs_agnumber_t			agno,
+	unsigned int			type)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL);
+	memset(&sc->sa, 0, sizeof(sc->sa));
+	sc->sa.agno = agno;
+
+	error = xfs_scrub_load_ag_header(sc, XFS_AGI_DADDR(mp),
+			&sc->sa.agi_bp, &xfs_agi_buf_ops, false);
+	if (error)
+		return error;
+
+	error = xfs_scrub_load_ag_header(sc, XFS_AGF_DADDR(mp),
+			&sc->sa.agf_bp, &xfs_agf_buf_ops,
+			type == XFS_SCRUB_TYPE_AGF);
+	if (error)
+		return error;
+
+	error = xfs_scrub_load_ag_header(sc, XFS_AGFL_DADDR(mp),
+			&sc->sa.agfl_bp, &xfs_agfl_buf_ops,
+			type == XFS_SCRUB_TYPE_AGFL);
+	if (error)
+		return error;
+
+	return 0;
+}
+
 /* Per-scrubber setup functions */
 
 /* Set us up with a transaction and an empty context. */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 390f772..4d8bb72 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -88,5 +88,11 @@ int xfs_scrub_ag_read_headers(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
 void xfs_scrub_ag_btcur_free(struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_btcur_init(struct xfs_scrub_context *sc,
 			    struct xfs_scrub_ag *sa);
+int xfs_scrub_load_ag_headers(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
+			      unsigned int type);
+int xfs_scrub_walk_agfl(struct xfs_scrub_context *sc,
+			int (*fn)(struct xfs_scrub_context *, xfs_agblock_t bno,
+				  void *),
+			void *priv);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7f6f997..6f3c4f0 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -157,6 +157,14 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_header,
 		.scrub	= xfs_scrub_superblock,
 	},
+	{ /* agf */
+		.setup	= xfs_scrub_setup_ag_header,
+		.scrub	= xfs_scrub_agf,
+	},
+	{ /* agfl */
+		.setup	= xfs_scrub_setup_ag_header,
+		.scrub	= xfs_scrub_agfl,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 13e3f9b..50f8641 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -68,5 +68,7 @@ struct xfs_scrub_context {
 /* Metadata scrubbers */
 int xfs_scrub_tester(struct xfs_scrub_context *sc);
 int xfs_scrub_superblock(struct xfs_scrub_context *sc);
+int xfs_scrub_agf(struct xfs_scrub_context *sc);
+int xfs_scrub_agfl(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/25] xfs: scrub the AGI
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 10/25] xfs: scrub AGF and AGFL Darrick J. Wong
@ 2017-10-03 20:41 ` Darrick J. Wong
  2017-10-04  1:43   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 12/25] xfs: scrub free space btrees Darrick J. Wong
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:41 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a forgotten check to the AGI verifier, then wire up the scrub
infrastructure to check the AGI contents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h  |    3 +-
 fs/xfs/scrub/agheader.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c   |    6 ++-
 fs/xfs/scrub/scrub.c    |    4 ++
 fs/xfs/scrub/scrub.h    |    1 +
 5 files changed, 99 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index aeb2a66..1e326dd 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -487,9 +487,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_SB	1	/* superblock */
 #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
 #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
+#define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	4
+#define XFS_SCRUB_TYPE_NR	5
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 7fe6630..3d269c2 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -535,3 +535,91 @@ xfs_scrub_agfl(
 out:
 	return error;
 }
+
+/* AGI */
+
+/* Scrub the AGI. */
+int
+xfs_scrub_agi(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_agi			*agi;
+	xfs_daddr_t			daddr;
+	xfs_daddr_t			eofs;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			agbno;
+	xfs_agblock_t			eoag;
+	xfs_agino_t			agino;
+	xfs_agino_t			first_agino;
+	xfs_agino_t			last_agino;
+	int				i;
+	int				level;
+	int				error = 0;
+
+	agno = sc->sm->sm_agno;
+	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGI);
+	if (!xfs_scrub_op_ok(sc, agno, XFS_AGI_BLOCK(sc->mp), &error))
+		goto out;
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+
+	/* Check the AG length */
+	eoag = be32_to_cpu(agi->agi_length);
+	if (eoag != xfs_scrub_ag_blocks(mp, agno))
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+
+	/* Check btree roots and levels */
+	agbno = be32_to_cpu(agi->agi_root);
+	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
+	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
+	    agbno >= eoag || daddr >= eofs)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+
+	level = be32_to_cpu(agi->agi_level);
+	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+
+	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		agbno = be32_to_cpu(agi->agi_free_root);
+		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
+		if (agbno <= XFS_AGI_BLOCK(mp) ||
+		    agbno >= mp->m_sb.sb_agblocks ||
+		    agbno >= eoag ||
+		    daddr >= eofs)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+
+		level = be32_to_cpu(agi->agi_free_level);
+		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+	}
+
+	/* Check inode counters */
+	first_agino = XFS_OFFBNO_TO_AGINO(mp, XFS_AGI_BLOCK(mp) + 1, 0);
+	last_agino = XFS_OFFBNO_TO_AGINO(mp, eoag + 1, 0) - 1;
+	agino = be32_to_cpu(agi->agi_count);
+	if (agino > last_agino - first_agino + 1 ||
+	    agino < be32_to_cpu(agi->agi_freecount))
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+
+	/* Check inode pointers */
+	agino = be32_to_cpu(agi->agi_newino);
+	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+	agino = be32_to_cpu(agi->agi_dirino);
+	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
+		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+
+	/* Check unlinked inode buckets */
+	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
+		agino = be32_to_cpu(agi->agi_unlinked[i]);
+		if (agino == NULLAGINO)
+			continue;
+		if (agino < first_agino || agino > last_agino)
+			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
+	}
+
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index ee8e7be..4f42401 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -507,12 +507,14 @@ xfs_scrub_load_ag_headers(
 	struct xfs_mount		*mp = sc->mp;
 	int				error;
 
-	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL);
+	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL ||
+	       type == XFS_SCRUB_TYPE_AGI);
 	memset(&sc->sa, 0, sizeof(sc->sa));
 	sc->sa.agno = agno;
 
 	error = xfs_scrub_load_ag_header(sc, XFS_AGI_DADDR(mp),
-			&sc->sa.agi_bp, &xfs_agi_buf_ops, false);
+			&sc->sa.agi_bp, &xfs_agi_buf_ops,
+			type == XFS_SCRUB_TYPE_AGI);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6f3c4f0..a62f53b 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -165,6 +165,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_header,
 		.scrub	= xfs_scrub_agfl,
 	},
+	{ /* agi */
+		.setup	= xfs_scrub_setup_ag_header,
+		.scrub	= xfs_scrub_agi,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 50f8641..09952c2 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -70,5 +70,6 @@ int xfs_scrub_tester(struct xfs_scrub_context *sc);
 int xfs_scrub_superblock(struct xfs_scrub_context *sc);
 int xfs_scrub_agf(struct xfs_scrub_context *sc);
 int xfs_scrub_agfl(struct xfs_scrub_context *sc);
+int xfs_scrub_agi(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/25] xfs: scrub free space btrees
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2017-10-03 20:41 ` [PATCH 11/25] xfs: scrub the AGI Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-05  0:59   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 13/25] xfs: scrub inode btrees Darrick J. Wong
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Check the extent records free space btrees to ensure that the values
look sane.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    4 +-
 fs/xfs/scrub/alloc.c   |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c  |   16 +++++++
 fs/xfs/scrub/common.h  |    6 +++
 fs/xfs/scrub/scrub.c   |    8 ++++
 fs/xfs/scrub/scrub.h   |    2 +
 7 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/alloc.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e92d04d..84ac733 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -147,6 +147,7 @@ ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 xfs-y				+= $(addprefix scrub/, \
 				   trace.o \
 				   agheader.o \
+				   alloc.o \
 				   btree.o \
 				   common.o \
 				   scrub.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 1e326dd..1e23d13 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -488,9 +488,11 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
 #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
 #define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
+#define XFS_SCRUB_TYPE_BNOBT	5	/* freesp by block btree */
+#define XFS_SCRUB_TYPE_CNTBT	6	/* freesp by length btree */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	5
+#define XFS_SCRUB_TYPE_NR	7
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
new file mode 100644
index 0000000..c1e97cd
--- /dev/null
+++ b/fs/xfs/scrub/alloc.c
@@ -0,0 +1,108 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_rmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/*
+ * Set us up to scrub free space btrees.
+ * Push everything out of the log so that the busy extent list is empty.
+ */
+int
+xfs_scrub_setup_ag_allocbt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_setup_ag_btree(sc, ip, sc->try_harder);
+}
+
+/* Free space btree scrubber. */
+
+/* Scrub a bnobt/cntbt record. */
+STATIC int
+xfs_scrub_allocbt_helper(
+	struct xfs_scrub_btree		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_agf			*agf;
+	unsigned long long		rec_end;
+	xfs_agblock_t			bno;
+	xfs_extlen_t			len;
+	int				error = 0;
+
+	bno = be32_to_cpu(rec->alloc.ar_startblock);
+	len = be32_to_cpu(rec->alloc.ar_blockcount);
+	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
+	rec_end = (unsigned long long)bno + len;
+
+	if (bno >= mp->m_sb.sb_agblocks ||
+	    bno >= be32_to_cpu(agf->agf_length) ||
+	    len == 0 ||
+	    rec_end > mp->m_sb.sb_agblocks ||
+	    rec_end > be32_to_cpu(agf->agf_length))
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	return error;
+}
+
+/* Scrub the freespace btrees for some AG. */
+STATIC int
+xfs_scrub_allocbt(
+	struct xfs_scrub_context	*sc,
+	xfs_btnum_t			which)
+{
+	struct xfs_owner_info		oinfo;
+	struct xfs_btree_cur		*cur;
+
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	cur = which == XFS_BTNUM_BNO ? sc->sa.bno_cur : sc->sa.cnt_cur;
+	return xfs_scrub_btree(sc, cur, xfs_scrub_allocbt_helper,
+			&oinfo, NULL);
+}
+
+int
+xfs_scrub_bnobt(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_scrub_allocbt(sc, XFS_BTNUM_BNO);
+}
+
+int
+xfs_scrub_cntbt(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_scrub_allocbt(sc, XFS_BTNUM_CNT);
+}
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 4f42401..4f8d103 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -543,3 +543,19 @@ xfs_scrub_setup_fs(
 {
 	return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp);
 }
+
+/* Set us up with AG headers and btree cursors. */
+int
+xfs_scrub_setup_ag_btree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip,
+	bool				force_log)
+{
+	int				error;
+
+	error = xfs_scrub_setup_ag_header(sc, ip);
+	if (error)
+		return error;
+
+	return xfs_scrub_ag_init(sc, sc->sm->sm_agno, &sc->sa);
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 4d8bb72..9a37e05 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -78,6 +78,9 @@ void xfs_scrub_set_incomplete(struct xfs_scrub_context *sc);
 int xfs_scrub_setup_fs(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 int xfs_scrub_setup_ag_header(struct xfs_scrub_context *sc,
 			      struct xfs_inode *ip);
+int xfs_scrub_setup_ag_allocbt(struct xfs_scrub_context *sc,
+			       struct xfs_inode *ip);
+
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
@@ -95,4 +98,7 @@ int xfs_scrub_walk_agfl(struct xfs_scrub_context *sc,
 				  void *),
 			void *priv);
 
+int xfs_scrub_setup_ag_btree(struct xfs_scrub_context *sc,
+			     struct xfs_inode *ip, bool force_log);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index a62f53b..f543ce9 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -169,6 +169,14 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_header,
 		.scrub	= xfs_scrub_agi,
 	},
+	{ /* bnobt */
+		.setup	= xfs_scrub_setup_ag_allocbt,
+		.scrub	= xfs_scrub_bnobt,
+	},
+	{ /* cntbt */
+		.setup	= xfs_scrub_setup_ag_allocbt,
+		.scrub	= xfs_scrub_cntbt,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 09952c2..a4af99c 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -71,5 +71,7 @@ int xfs_scrub_superblock(struct xfs_scrub_context *sc);
 int xfs_scrub_agf(struct xfs_scrub_context *sc);
 int xfs_scrub_agfl(struct xfs_scrub_context *sc);
 int xfs_scrub_agi(struct xfs_scrub_context *sc);
+int xfs_scrub_bnobt(struct xfs_scrub_context *sc);
+int xfs_scrub_cntbt(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/25] xfs: scrub inode btrees
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 12/25] xfs: scrub free space btrees Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-05  2:08   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 14/25] xfs: scrub rmap btrees Darrick J. Wong
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Check the records of the inode btrees to make sure that the values
make sense given the inode records themselves.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/libxfs/xfs_format.h |    2 
 fs/xfs/libxfs/xfs_fs.h     |    4 -
 fs/xfs/scrub/common.h      |    2 
 fs/xfs/scrub/ialloc.c      |  341 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c       |    9 +
 fs/xfs/scrub/scrub.h       |    2 
 7 files changed, 359 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/ialloc.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 84ac733..82326b7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -150,6 +150,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc.o \
 				   btree.o \
 				   common.o \
+				   ialloc.o \
 				   scrub.o \
 				   )
 endif
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 23229f0..154c3dd 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -518,7 +518,7 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
 		 (sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
 }
 
-static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+static inline bool xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 {
 	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 1e23d13..74df6ec 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -490,9 +490,11 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
 #define XFS_SCRUB_TYPE_BNOBT	5	/* freesp by block btree */
 #define XFS_SCRUB_TYPE_CNTBT	6	/* freesp by length btree */
+#define XFS_SCRUB_TYPE_INOBT	7	/* inode btree */
+#define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	7
+#define XFS_SCRUB_TYPE_NR	9
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 9a37e05..60b159a 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -80,6 +80,8 @@ int xfs_scrub_setup_ag_header(struct xfs_scrub_context *sc,
 			      struct xfs_inode *ip);
 int xfs_scrub_setup_ag_allocbt(struct xfs_scrub_context *sc,
 			       struct xfs_inode *ip);
+int xfs_scrub_setup_ag_iallocbt(struct xfs_scrub_context *sc,
+				struct xfs_inode *ip);
 
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
new file mode 100644
index 0000000..db8404c
--- /dev/null
+++ b/fs/xfs/scrub/ialloc.c
@@ -0,0 +1,341 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_log.h"
+#include "xfs_trans_priv.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/*
+ * Set us up to scrub inode btrees.
+ * If we detect a discrepancy between the inobt and the inode,
+ * try again after forcing logged inode cores out to disk.
+ */
+int
+xfs_scrub_setup_ag_iallocbt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_setup_ag_btree(sc, ip, sc->try_harder);
+}
+
+/* Inode btree scrubber. */
+
+/* Is this chunk worth checking? */
+STATIC bool
+xfs_scrub_iallocbt_chunk(
+	struct xfs_scrub_btree		*bs,
+	struct xfs_inobt_rec_incore	*irec,
+	xfs_agino_t			agino,
+	xfs_extlen_t			len)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_agf			*agf;
+	unsigned long long		rec_end;
+	xfs_agblock_t			eoag;
+	xfs_agblock_t			bno;
+
+	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
+	eoag = be32_to_cpu(agf->agf_length);
+	bno = XFS_AGINO_TO_AGBNO(mp, agino);
+	rec_end = (unsigned long long)bno + len;
+
+	if (bno >= mp->m_sb.sb_agblocks || bno >= eoag ||
+	    rec_end > mp->m_sb.sb_agblocks || rec_end > eoag) {
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return false;
+	}
+
+	return true;
+}
+
+/* Count the number of free inodes. */
+static unsigned int
+xfs_scrub_iallocbt_freecount(
+	xfs_inofree_t			freemask)
+{
+	int				bits = XFS_INODES_PER_CHUNK;
+	unsigned int			ret = 0;
+
+	while (bits--) {
+		if (freemask & 1)
+			ret++;
+		freemask >>= 1;
+	}
+
+	return ret;
+}
+
+/* Check a particular inode with ir_free. */
+STATIC int
+xfs_scrub_iallocbt_check_cluster_freemask(
+	struct xfs_scrub_btree		*bs,
+	xfs_ino_t			fsino,
+	xfs_agino_t			chunkino,
+	xfs_agino_t			clusterino,
+	struct xfs_inobt_rec_incore	*irec,
+	struct xfs_buf			*bp)
+{
+	struct xfs_dinode		*dip;
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	bool				freemask_ok;
+	bool				inuse;
+	int				error;
+
+	dip = xfs_buf_offset(bp, clusterino * mp->m_sb.sb_inodesize);
+	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC ||
+	    (dip->di_version >= 3 &&
+	     be64_to_cpu(dip->di_ino) != fsino + clusterino)) {
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+		goto out;
+	}
+
+	freemask_ok = !!(irec->ir_free & XFS_INOBT_MASK(chunkino + clusterino));
+	error = xfs_icache_inode_is_allocated(mp, bs->cur->bc_tp,
+			fsino + clusterino, &inuse);
+	if (error == -ENODATA) {
+		/* Not cached, just read the disk buffer */
+		freemask_ok ^= !!(dip->di_mode);
+		if (!bs->sc->try_harder && !freemask_ok)
+			return -EDEADLOCK;
+	} else if (error < 0) {
+		/* Inode is only half assembled, don't bother. */
+		freemask_ok = true;
+	} else {
+		/* Inode is all there. */
+		freemask_ok ^= inuse;
+	}
+	if (!freemask_ok)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+out:
+	return 0;
+}
+
+/* Make sure the free mask is consistent with what the inodes think. */
+STATIC int
+xfs_scrub_iallocbt_check_freemask(
+	struct xfs_scrub_btree		*bs,
+	struct xfs_inobt_rec_incore	*irec)
+{
+	struct xfs_owner_info		oinfo;
+	struct xfs_imap			imap;
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_dinode		*dip;
+	struct xfs_buf			*bp;
+	xfs_ino_t			fsino;
+	xfs_agino_t			nr_inodes;
+	xfs_agino_t			agino;
+	xfs_agino_t			chunkino;
+	xfs_agino_t			clusterino;
+	xfs_agblock_t			agbno;
+	int				blks_per_cluster;
+	uint16_t			holemask;
+	uint16_t			ir_holemask;
+	int				error = 0;
+
+	/* Make sure the freemask matches the inode records. */
+	blks_per_cluster = xfs_icluster_size_fsb(mp);
+	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INODES);
+
+	for (agino = irec->ir_startino;
+	     agino < irec->ir_startino + XFS_INODES_PER_CHUNK;
+	     agino += blks_per_cluster * mp->m_sb.sb_inopblock) {
+		fsino = XFS_AGINO_TO_INO(mp, bs->cur->bc_private.a.agno, agino);
+		chunkino = agino - irec->ir_startino;
+		agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+
+		/* Compute the holemask mask for this cluster. */
+		for (clusterino = 0, holemask = 0; clusterino < nr_inodes;
+		     clusterino += XFS_INODES_PER_HOLEMASK_BIT)
+			holemask |= XFS_INOBT_MASK((chunkino + clusterino) /
+					XFS_INODES_PER_HOLEMASK_BIT);
+
+		/* The whole cluster must be a hole or not a hole. */
+		ir_holemask = (irec->ir_holemask & holemask);
+		if (ir_holemask != holemask && ir_holemask != 0)
+			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+		/* If any part of this is a hole, skip it. */
+		if (ir_holemask)
+			continue;
+
+		/* Grab the inode cluster buffer. */
+		imap.im_blkno = XFS_AGB_TO_DADDR(mp, bs->cur->bc_private.a.agno,
+				agbno);
+		imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
+		imap.im_boffset = 0;
+
+		error = xfs_imap_to_bp(mp, bs->cur->bc_tp, &imap,
+				&dip, &bp, 0, 0);
+		if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, 0, &error))
+			continue;
+
+		/* Which inodes are free? */
+		for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
+			error = xfs_scrub_iallocbt_check_cluster_freemask(bs,
+					fsino, chunkino, clusterino, irec, bp);
+			if (error) {
+				xfs_trans_brelse(bs->cur->bc_tp, bp);
+				return error;
+			}
+		}
+
+		xfs_trans_brelse(bs->cur->bc_tp, bp);
+	}
+
+	return error;
+}
+
+/* Scrub an inobt/finobt record. */
+STATIC int
+xfs_scrub_iallocbt_helper(
+	struct xfs_scrub_btree		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_agi			*agi;
+	struct xfs_inobt_rec_incore	irec;
+	uint64_t			holes;
+	xfs_agino_t			agino;
+	xfs_agblock_t			agbno;
+	xfs_extlen_t			len;
+	int				holecount;
+	int				i;
+	int				error = 0;
+	unsigned int			real_freecount;
+	uint16_t			holemask;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	if (irec.ir_count > XFS_INODES_PER_CHUNK ||
+	    irec.ir_freecount > XFS_INODES_PER_CHUNK)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	real_freecount = irec.ir_freecount +
+			(XFS_INODES_PER_CHUNK - irec.ir_count);
+	if (real_freecount != xfs_scrub_iallocbt_freecount(irec.ir_free))
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	agi = XFS_BUF_TO_AGI(bs->sc->sa.agi_bp);
+	agino = irec.ir_startino;
+	agbno = XFS_AGINO_TO_AGBNO(mp, irec.ir_startino);
+	if (agbno >= be32_to_cpu(agi->agi_length)) {
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+		goto out;
+	}
+
+	if ((agbno & (xfs_ialloc_cluster_alignment(mp) - 1)) ||
+	    (agbno & (xfs_icluster_size_fsb(mp) - 1)))
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	/* Handle non-sparse inodes */
+	if (!xfs_inobt_issparse(irec.ir_holemask)) {
+		len = XFS_B_TO_FSB(mp,
+				XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize);
+		if (irec.ir_count != XFS_INODES_PER_CHUNK)
+			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+		if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
+			goto out;
+		goto check_freemask;
+	}
+
+	/* Check each chunk of a sparse inode cluster. */
+	holemask = irec.ir_holemask;
+	holecount = 0;
+	len = XFS_B_TO_FSB(mp,
+			XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize);
+	holes = ~xfs_inobt_irec_to_allocmask(&irec);
+	if ((holes & irec.ir_free) != holes ||
+	    irec.ir_freecount > irec.ir_count)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
+			i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {
+		if (holemask & 1) {
+			holecount += XFS_INODES_PER_HOLEMASK_BIT;
+			continue;
+		}
+
+		if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
+			break;
+	}
+
+	if (holecount > XFS_INODES_PER_CHUNK ||
+	    holecount + irec.ir_count != XFS_INODES_PER_CHUNK)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+check_freemask:
+	error = xfs_scrub_iallocbt_check_freemask(bs, &irec);
+	if (error)
+		goto out;
+
+out:
+	return error;
+}
+
+/* Scrub the inode btrees for some AG. */
+STATIC int
+xfs_scrub_iallocbt(
+	struct xfs_scrub_context	*sc,
+	xfs_btnum_t			which)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_owner_info		oinfo;
+
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
+	cur = which == XFS_BTNUM_INO ? sc->sa.ino_cur : sc->sa.fino_cur;
+	return xfs_scrub_btree(sc, cur, xfs_scrub_iallocbt_helper,
+			&oinfo, NULL);
+}
+
+int
+xfs_scrub_inobt(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_scrub_iallocbt(sc, XFS_BTNUM_INO);
+}
+
+int
+xfs_scrub_finobt(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_scrub_iallocbt(sc, XFS_BTNUM_FINO);
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f543ce9..f209348 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -177,6 +177,15 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_ag_allocbt,
 		.scrub	= xfs_scrub_cntbt,
 	},
+	{ /* inobt */
+		.setup	= xfs_scrub_setup_ag_iallocbt,
+		.scrub	= xfs_scrub_inobt,
+	},
+	{ /* finobt */
+		.setup	= xfs_scrub_setup_ag_iallocbt,
+		.scrub	= xfs_scrub_finobt,
+		.has	= xfs_sb_version_hasfinobt,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index a4af99c..5d97453 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -73,5 +73,7 @@ int xfs_scrub_agfl(struct xfs_scrub_context *sc);
 int xfs_scrub_agi(struct xfs_scrub_context *sc);
 int xfs_scrub_bnobt(struct xfs_scrub_context *sc);
 int xfs_scrub_cntbt(struct xfs_scrub_context *sc);
+int xfs_scrub_inobt(struct xfs_scrub_context *sc);
+int xfs_scrub_finobt(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/25] xfs: scrub rmap btrees
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 13/25] xfs: scrub inode btrees Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-05  2:56   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 15/25] xfs: scrub refcount btrees Darrick J. Wong
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Check the reverse mapping records to make sure that the contents
make sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    3 +
 fs/xfs/scrub/common.h  |    2 +
 fs/xfs/scrub/rmap.c    |  137 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |    5 ++
 fs/xfs/scrub/scrub.h   |    1 
 6 files changed, 148 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/rmap.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 82326b7..5a64f8d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -151,6 +151,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   btree.o \
 				   common.o \
 				   ialloc.o \
+				   rmap.o \
 				   scrub.o \
 				   )
 endif
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 74df6ec..fb1d997 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -492,9 +492,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_CNTBT	6	/* freesp by length btree */
 #define XFS_SCRUB_TYPE_INOBT	7	/* inode btree */
 #define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
+#define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	9
+#define XFS_SCRUB_TYPE_NR	10
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 60b159a..dd12b37 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -82,6 +82,8 @@ int xfs_scrub_setup_ag_allocbt(struct xfs_scrub_context *sc,
 			       struct xfs_inode *ip);
 int xfs_scrub_setup_ag_iallocbt(struct xfs_scrub_context *sc,
 				struct xfs_inode *ip);
+int xfs_scrub_setup_ag_rmapbt(struct xfs_scrub_context *sc,
+			      struct xfs_inode *ip);
 
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
new file mode 100644
index 0000000..d4ec751
--- /dev/null
+++ b/fs/xfs/scrub/rmap.c
@@ -0,0 +1,137 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_rmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/*
+ * Set us up to scrub reverse mapping btrees.
+ */
+int
+xfs_scrub_setup_ag_rmapbt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_setup_ag_btree(sc, ip, false);
+}
+
+/* Reverse-mapping scrubber. */
+
+/* Scrub an rmapbt record. */
+STATIC int
+xfs_scrub_rmapbt_helper(
+	struct xfs_scrub_btree		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_agf			*agf;
+	struct xfs_rmap_irec		irec;
+	unsigned long long		rec_end;
+	xfs_agblock_t			eoag;
+	bool				non_inode;
+	bool				is_unwritten;
+	bool				is_bmbt;
+	bool				is_attr;
+	int				error;
+
+	error = xfs_rmap_btrec_to_irec(rec, &irec);
+	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, 0, &error))
+		goto out;
+
+	/* Check extent. */
+	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
+	eoag = be32_to_cpu(agf->agf_length);
+	rec_end = (unsigned long long)irec.rm_startblock + irec.rm_blockcount;
+
+	if (irec.rm_startblock >= mp->m_sb.sb_agblocks ||
+	    irec.rm_startblock >= eoag ||
+	    irec.rm_blockcount == 0 ||
+	    rec_end > mp->m_sb.sb_agblocks ||
+	    rec_end > eoag)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	/* Check flags. */
+	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
+	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
+	is_attr = irec.rm_flags & XFS_RMAP_ATTR_FORK;
+	is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN;
+
+	if (is_bmbt && irec.rm_offset != 0)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	if (non_inode && irec.rm_offset != 0)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	if (is_unwritten && (is_bmbt || non_inode || is_attr))
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	if (non_inode && (is_bmbt || is_unwritten || is_attr))
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	if (!non_inode) {
+		xfs_agnumber_t	agno = XFS_INO_TO_AGNO(mp, irec.rm_owner);
+		xfs_agino_t	agino = XFS_INO_TO_AGINO(mp, irec.rm_owner);
+		xfs_agblock_t	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+
+		/* Owner inode within an AG? */
+		if (agno >= mp->m_sb.sb_agcount ||
+		    agbno >= mp->m_sb.sb_agblocks)
+			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+		/* Owner inode within the FS? */
+		if (XFS_AGB_TO_DADDR(mp, agno, agbno) >=
+		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks))
+			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+	} else {
+		/* Non-inode owner within the magic values? */
+		if (irec.rm_owner <= XFS_RMAP_OWN_MIN ||
+		    irec.rm_owner > XFS_RMAP_OWN_FS)
+			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+	}
+out:
+	return error;
+}
+
+/* Scrub the rmap btree for some AG. */
+int
+xfs_scrub_rmapbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	return xfs_scrub_btree(sc, sc->sa.rmap_cur, xfs_scrub_rmapbt_helper,
+			&oinfo, NULL);
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f209348..35f7dce 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -186,6 +186,11 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.scrub	= xfs_scrub_finobt,
 		.has	= xfs_sb_version_hasfinobt,
 	},
+	{ /* rmapbt */
+		.setup	= xfs_scrub_setup_ag_rmapbt,
+		.scrub	= xfs_scrub_rmapbt,
+		.has	= xfs_sb_version_hasrmapbt,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 5d97453..0d1e78b 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -75,5 +75,6 @@ int xfs_scrub_bnobt(struct xfs_scrub_context *sc);
 int xfs_scrub_cntbt(struct xfs_scrub_context *sc);
 int xfs_scrub_inobt(struct xfs_scrub_context *sc);
 int xfs_scrub_finobt(struct xfs_scrub_context *sc);
+int xfs_scrub_rmapbt(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/25] xfs: scrub refcount btrees
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 14/25] xfs: scrub rmap btrees Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-05  2:59   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 16/25] xfs: scrub inodes Darrick J. Wong
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Plumb in the pieces necessary to check the refcount btree.  If rmap is
available, check the reference count by performing an interval query
against the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile         |    1 
 fs/xfs/libxfs/xfs_fs.h  |    3 +
 fs/xfs/scrub/common.h   |    2 +
 fs/xfs/scrub/refcount.c |  102 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c    |    5 ++
 fs/xfs/scrub/scrub.h    |    1 
 6 files changed, 113 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/refcount.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5a64f8d..a7c5752 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -151,6 +151,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   btree.o \
 				   common.o \
 				   ialloc.o \
+				   refcount.o \
 				   rmap.o \
 				   scrub.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index fb1d997..b3f992c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -493,9 +493,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_INOBT	7	/* inode btree */
 #define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
 #define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
+#define XFS_SCRUB_TYPE_REFCNTBT	10	/* reference count btree */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	10
+#define XFS_SCRUB_TYPE_NR	11
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index dd12b37..6325f03 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -84,6 +84,8 @@ int xfs_scrub_setup_ag_iallocbt(struct xfs_scrub_context *sc,
 				struct xfs_inode *ip);
 int xfs_scrub_setup_ag_rmapbt(struct xfs_scrub_context *sc,
 			      struct xfs_inode *ip);
+int xfs_scrub_setup_ag_refcountbt(struct xfs_scrub_context *sc,
+				  struct xfs_inode *ip);
 
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
new file mode 100644
index 0000000..c91f894
--- /dev/null
+++ b/fs/xfs/scrub/refcount.c
@@ -0,0 +1,102 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_rmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/*
+ * Set us up to scrub reference count btrees.
+ */
+int
+xfs_scrub_setup_ag_refcountbt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_setup_ag_btree(sc, ip, false);
+}
+
+/* Reference count btree scrubber. */
+
+/* Scrub a refcountbt record. */
+STATIC int
+xfs_scrub_refcountbt_helper(
+	struct xfs_scrub_btree		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_agf			*agf;
+	struct xfs_refcount_irec	irec;
+	unsigned long long		rec_end;
+	xfs_agblock_t			eoag;
+	bool				has_cowflag;
+	int				error = 0;
+
+	irec.rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
+	irec.rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
+	irec.rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
+	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
+	eoag = be32_to_cpu(agf->agf_length);
+
+	/* Only CoW records can have refcount == 1. */
+	has_cowflag = !!(irec.rc_startblock & XFS_REFC_COW_START);
+	if ((irec.rc_refcount == 1 && !has_cowflag) ||
+	    (irec.rc_refcount != 1 && has_cowflag))
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	/* Check the extent. */
+	irec.rc_startblock &= ~XFS_REFC_COW_START;
+	rec_end = (unsigned long long)irec.rc_startblock + irec.rc_blockcount;
+	if (irec.rc_startblock >= mp->m_sb.sb_agblocks ||
+	    irec.rc_startblock >= eoag ||
+	    irec.rc_blockcount == 0 ||
+	    rec_end > mp->m_sb.sb_agblocks ||
+	    rec_end > eoag ||
+	    irec.rc_refcount < 1)
+		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	return error;
+}
+
+/* Scrub the refcount btree for some AG. */
+int
+xfs_scrub_refcountbt(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_owner_info		oinfo;
+
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
+	return xfs_scrub_btree(sc, sc->sa.refc_cur, xfs_scrub_refcountbt_helper,
+			&oinfo, NULL);
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 35f7dce..7b44364 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -191,6 +191,11 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.scrub	= xfs_scrub_rmapbt,
 		.has	= xfs_sb_version_hasrmapbt,
 	},
+	{ /* refcountbt */
+		.setup	= xfs_scrub_setup_ag_refcountbt,
+		.scrub	= xfs_scrub_refcountbt,
+		.has	= xfs_sb_version_hasreflink,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 0d1e78b..1c80bf5 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -76,5 +76,6 @@ int xfs_scrub_cntbt(struct xfs_scrub_context *sc);
 int xfs_scrub_inobt(struct xfs_scrub_context *sc);
 int xfs_scrub_finobt(struct xfs_scrub_context *sc);
 int xfs_scrub_rmapbt(struct xfs_scrub_context *sc);
+int xfs_scrub_refcountbt(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/25] xfs: scrub inodes
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 15/25] xfs: scrub refcount btrees Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-05  4:04   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 17/25] xfs: scrub inode block mappings Darrick J. Wong
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scrub the fields within an inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    3 
 fs/xfs/scrub/common.c  |   65 +++++++++
 fs/xfs/scrub/common.h  |    5 +
 fs/xfs/scrub/inode.c   |  353 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |   17 ++
 fs/xfs/scrub/scrub.h   |    2 
 7 files changed, 443 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/inode.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a7c5752..28e14b7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -151,6 +151,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   btree.o \
 				   common.o \
 				   ialloc.o \
+				   inode.o \
 				   refcount.o \
 				   rmap.o \
 				   scrub.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b3f992c..f8463e0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -494,9 +494,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_FINOBT	8	/* free inode btree */
 #define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
 #define XFS_SCRUB_TYPE_REFCNTBT	10	/* reference count btree */
+#define XFS_SCRUB_TYPE_INODE	11	/* inode record */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	11
+#define XFS_SCRUB_TYPE_NR	12
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 4f8d103..415dec6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -30,6 +30,8 @@
 #include "xfs_trans.h"
 #include "xfs_sb.h"
 #include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_itable.h"
 #include "xfs_alloc.h"
 #include "xfs_alloc_btree.h"
 #include "xfs_bmap.h"
@@ -40,6 +42,8 @@
 #include "xfs_refcount_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_log.h"
+#include "xfs_trans_priv.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -559,3 +563,64 @@ xfs_scrub_setup_ag_btree(
 
 	return xfs_scrub_ag_init(sc, sc->sm->sm_agno, &sc->sa);
 }
+
+/*
+ * Given an inode and the scrub control structure, grab either the
+ * inode referenced in the control structure or the inode passed in.
+ * The inode is not locked.
+ */
+int
+xfs_scrub_get_inode(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip_in)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_inode		*ips = NULL;
+	int				error;
+
+	if (sc->sm->sm_agno || (sc->sm->sm_gen && !sc->sm->sm_ino))
+		return -EINVAL;
+
+	/* We want to scan the inode we already had opened. */
+	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
+		sc->ip = ip_in;
+		return 0;
+	}
+
+	/* Look up the inode, see if the generation number matches. */
+	if (xfs_internal_inum(mp, sc->sm->sm_ino))
+		return -ENOENT;
+	error = xfs_iget(mp, NULL, sc->sm->sm_ino, XFS_IGET_UNTRUSTED,
+			0, &ips);
+	if (error == -ENOENT || error == -EINVAL) {
+		/* inode doesn't exist... */
+		return -ENOENT;
+	} else if (error) {
+		trace_xfs_scrub_op_error(sc,
+				XFS_INO_TO_AGNO(mp, sc->sm->sm_ino),
+				XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
+				error, __return_address);
+		return error;
+	}
+	if (VFS_I(ips)->i_generation != sc->sm->sm_gen) {
+		iput(VFS_I(ips));
+		return -ENOENT;
+	}
+
+	sc->ip = ips;
+	return 0;
+}
+
+/* Push everything out of the log onto disk. */
+int
+xfs_scrub_checkpoint_log(
+	struct xfs_mount	*mp)
+{
+	int			error;
+
+	error = _xfs_log_force(mp, XFS_LOG_SYNC, NULL);
+	if (error)
+		return error;
+	xfs_ail_push_all_sync(mp->m_ail);
+	return 0;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 6325f03..c8143e8 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -74,6 +74,8 @@ void xfs_scrub_fblock_set_warning(struct xfs_scrub_context *sc, int whichfork,
 
 void xfs_scrub_set_incomplete(struct xfs_scrub_context *sc);
 
+int xfs_scrub_checkpoint_log(struct xfs_mount *mp);
+
 /* Setup functions */
 int xfs_scrub_setup_fs(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 int xfs_scrub_setup_ag_header(struct xfs_scrub_context *sc,
@@ -86,6 +88,8 @@ int xfs_scrub_setup_ag_rmapbt(struct xfs_scrub_context *sc,
 			      struct xfs_inode *ip);
 int xfs_scrub_setup_ag_refcountbt(struct xfs_scrub_context *sc,
 				  struct xfs_inode *ip);
+int xfs_scrub_setup_inode(struct xfs_scrub_context *sc,
+			  struct xfs_inode *ip);
 
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
@@ -106,5 +110,6 @@ int xfs_scrub_walk_agfl(struct xfs_scrub_context *sc,
 
 int xfs_scrub_setup_ag_btree(struct xfs_scrub_context *sc,
 			     struct xfs_inode *ip, bool force_log);
+int xfs_scrub_get_inode(struct xfs_scrub_context *sc, struct xfs_inode *ip_in);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
new file mode 100644
index 0000000..88dd3fc
--- /dev/null
+++ b/fs/xfs/scrub/inode.c
@@ -0,0 +1,353 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_inode_buf.h"
+#include "xfs_inode_fork.h"
+#include "xfs_ialloc.h"
+#include "xfs_reflink.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Set us up with an inode. */
+int
+xfs_scrub_setup_inode(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	/*
+	 * Try to get the inode.  If the verifiers fail, we try again
+	 * in raw mode.
+	 */
+	error = xfs_scrub_get_inode(sc, ip);
+	switch (error) {
+	case 0:
+		break;
+	case -EFSCORRUPTED:
+	case -EFSBADCRC:
+		return xfs_scrub_checkpoint_log(mp);
+	default:
+		return error;
+	}
+
+	/* Got the inode, lock it and we're ready to go. */
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+	if (error)
+		goto out_unlock;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+
+	return error;
+out_unlock:
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	if (sc->ip != ip)
+		iput(VFS_I(sc->ip));
+	sc->ip = NULL;
+	return error;
+}
+
+/* Inode core */
+
+/* Scrub an inode. */
+int
+xfs_scrub_inode(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_imap			imap;
+	struct xfs_dinode		di;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*bp = NULL;
+	struct xfs_dinode		*dip;
+	xfs_ino_t			ino;
+	size_t				fork_recs;
+	unsigned long long		isize;
+	uint64_t			flags2;
+	uint32_t			nextents;
+	uint32_t			extsize;
+	uint32_t			cowextsize;
+	uint16_t			flags;
+	uint16_t			mode;
+	bool				has_shared;
+	int				error = 0;
+
+	/* Did we get the in-core inode, or are we doing this manually? */
+	if (sc->ip) {
+		ino = sc->ip->i_ino;
+		xfs_inode_to_disk(sc->ip, &di, 0);
+		dip = &di;
+	} else {
+		/* Map & read inode. */
+		ino = sc->sm->sm_ino;
+		error = xfs_imap(mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
+		if (error == -EINVAL) {
+			/*
+			 * Inode could have gotten deleted out from under us;
+			 * just forget about it.
+			 */
+			error = -ENOENT;
+			goto out;
+		}
+		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
+				XFS_INO_TO_AGBNO(mp, ino), &error))
+			goto out;
+
+		error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+				imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp,
+				NULL);
+		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
+				XFS_INO_TO_AGBNO(mp, ino), &error))
+			goto out;
+
+		/* Is this really an inode? */
+		bp->b_ops = &xfs_inode_buf_ops;
+		dip = xfs_buf_offset(bp, imap.im_boffset);
+		if (!xfs_dinode_verify(mp, ino, dip) ||
+		    !xfs_dinode_good_version(mp, dip->di_version)) {
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+			goto out;
+		}
+
+		/* ...and is it the one we asked for? */
+		if (be32_to_cpu(dip->di_gen) != sc->sm->sm_gen) {
+			error = -ENOENT;
+			goto out;
+		}
+	}
+
+	flags = be16_to_cpu(dip->di_flags);
+	if (dip->di_version >= 3)
+		flags2 = be64_to_cpu(dip->di_flags2);
+	else
+		flags2 = 0;
+
+	/* di_mode */
+	mode = be16_to_cpu(dip->di_mode);
+	if (mode & ~(S_IALLUGO | S_IFMT))
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+	/* v1/v2 fields */
+	switch (dip->di_version) {
+	case 1:
+		if (dip->di_nlink != 0)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+		if (dip->di_mode == 0 && sc->ip)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+		if (dip->di_projid_lo != 0 || dip->di_projid_hi != 0)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case 2:
+	case 3:
+		if (dip->di_onlink != 0)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+		if (dip->di_mode == 0 && sc->ip)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+		if (dip->di_projid_hi != 0 &&
+		    !xfs_sb_version_hasprojid32bit(&mp->m_sb))
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+
+	/*
+	 * di_uid/di_gid -- -1 isn't invalid, but there's no way that
+	 * userspace could have created that.
+	 */
+	if (dip->di_uid == cpu_to_be32(-1U) ||
+	    dip->di_gid == cpu_to_be32(-1U))
+		xfs_scrub_ino_set_warning(sc, bp);
+
+	/* di_format */
+	switch (dip->di_format) {
+	case XFS_DINODE_FMT_DEV:
+		if (!S_ISCHR(mode) && !S_ISBLK(mode) &&
+		    !S_ISFIFO(mode) && !S_ISSOCK(mode))
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		if (!S_ISDIR(mode) && !S_ISLNK(mode))
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (!S_ISREG(mode) && !S_ISDIR(mode) && !S_ISLNK(mode))
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (!S_ISREG(mode) && !S_ISDIR(mode))
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_UUID:
+	default:
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	}
+
+	/* di_size */
+	isize = be64_to_cpu(dip->di_size);
+	if (isize & (1ULL << 63))
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	if (!S_ISDIR(mode) && !S_ISREG(mode) && !S_ISLNK(mode) && isize != 0)
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+	/* di_nblocks */
+	if (flags2 & XFS_DIFLAG2_REFLINK) {
+		; /* nblocks can exceed dblocks */
+	} else if (flags & XFS_DIFLAG_REALTIME) {
+		if (be64_to_cpu(dip->di_nblocks) >=
+		    mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	} else {
+		if (be64_to_cpu(dip->di_nblocks) >= mp->m_sb.sb_dblocks)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	}
+
+	/* di_extsize */
+	extsize = be32_to_cpu(dip->di_extsize);
+	if (flags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT)) {
+		if (extsize <= 0 || extsize > MAXEXTLEN)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+		if (!(flags & XFS_DIFLAG_REALTIME) &&
+		    extsize > mp->m_sb.sb_agblocks / 2)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	} else {
+		if (extsize != 0)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	}
+
+	/* di_flags */
+	if ((flags & XFS_DIFLAG_IMMUTABLE) && (flags & XFS_DIFLAG_APPEND))
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	if ((flags & XFS_DIFLAG_FILESTREAM) && (flags & XFS_DIFLAG_REALTIME))
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+	/* di_nextents */
+	nextents = be32_to_cpu(dip->di_nextents);
+	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
+	switch (dip->di_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		if (nextents > fork_recs)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (nextents <= fork_recs)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+	case XFS_DINODE_FMT_DEV:
+	case XFS_DINODE_FMT_UUID:
+	default:
+		if (nextents != 0)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	}
+
+	/* di_anextents */
+	nextents = be16_to_cpu(dip->di_anextents);
+	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
+	switch (dip->di_aformat) {
+	case XFS_DINODE_FMT_EXTENTS:
+		if (nextents > fork_recs)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (nextents <= fork_recs)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+	case XFS_DINODE_FMT_DEV:
+	case XFS_DINODE_FMT_UUID:
+	default:
+		if (nextents != 0)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		break;
+	}
+
+	/* di_forkoff */
+	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+	/* di_aformat */
+	if (dip->di_aformat != XFS_DINODE_FMT_LOCAL &&
+	    dip->di_aformat != XFS_DINODE_FMT_EXTENTS &&
+	    dip->di_aformat != XFS_DINODE_FMT_BTREE)
+		xfs_scrub_ino_set_corrupt(sc, ino, bp);
+
+	/* di_cowextsize */
+	if (flags2 & XFS_DIFLAG2_COWEXTSIZE) {
+		cowextsize = be32_to_cpu(dip->di_cowextsize);
+
+		if (!xfs_sb_version_hasreflink(&mp->m_sb))
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		if (cowextsize <= 0 || cowextsize > MAXEXTLEN)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+		if (cowextsize > mp->m_sb.sb_agblocks / 2)
+			xfs_scrub_ino_set_corrupt(sc, ino, bp);
+	}
+
+	/* Now let's do the things that require a live inode. */
+	if (!sc->ip)
+		goto out;
+
+	/*
+	 * Does this inode have the reflink flag set but no shared extents?
+	 * Set the preening flag if this is the case.
+	 */
+	if (xfs_is_reflink_inode(sc->ip)) {
+		error = xfs_reflink_inode_has_shared_extents(sc->tp, sc->ip,
+				&has_shared);
+		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
+				XFS_INO_TO_AGBNO(mp, ino), &error))
+			goto out;
+		if (!has_shared)
+			xfs_scrub_ino_set_preen(sc, bp);
+	}
+
+out:
+	if (bp)
+		xfs_trans_brelse(sc->tp, bp);
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7b44364..f014ef0 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -30,6 +30,8 @@
 #include "xfs_trans.h"
 #include "xfs_sb.h"
 #include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_itable.h"
 #include "xfs_alloc.h"
 #include "xfs_alloc_btree.h"
 #include "xfs_bmap.h"
@@ -136,6 +138,7 @@ xfs_scrub_probe(
 STATIC int
 xfs_scrub_teardown(
 	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip_in,
 	int				error)
 {
 	xfs_scrub_ag_free(sc, &sc->sa);
@@ -143,6 +146,12 @@ xfs_scrub_teardown(
 		xfs_trans_cancel(sc->tp);
 		sc->tp = NULL;
 	}
+	if (sc->ip) {
+		xfs_iunlock(sc->ip, sc->ilock_flags);
+		if (sc->ip != ip_in)
+			iput(VFS_I(sc->ip));
+		sc->ip = NULL;
+	}
 	return error;
 }
 
@@ -196,6 +205,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.scrub	= xfs_scrub_refcountbt,
 		.has	= xfs_sb_version_hasreflink,
 	},
+	{ /* inode record */
+		.setup	= xfs_scrub_setup_inode,
+		.scrub	= xfs_scrub_inode,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
@@ -281,7 +294,7 @@ xfs_scrub_metadata(
 		 * Tear down everything we hold, then set up again with
 		 * preparation for worst-case scenarios.
 		 */
-		error = xfs_scrub_teardown(&sc, 0);
+		error = xfs_scrub_teardown(&sc, ip, 0);
 		if (error)
 			goto out;
 		try_harder = true;
@@ -294,7 +307,7 @@ xfs_scrub_metadata(
 		xfs_alert_ratelimited(mp, "Corruption detected during scrub.");
 
 out_teardown:
-	error = xfs_scrub_teardown(&sc, error);
+	error = xfs_scrub_teardown(&sc, ip, error);
 out:
 	trace_xfs_scrub_done(ip, sm, error);
 	return error;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 1c80bf5..ec635d4 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -59,6 +59,7 @@ struct xfs_scrub_context {
 	const struct xfs_scrub_meta_ops	*ops;
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip;
+	uint				ilock_flags;
 	bool				try_harder;
 
 	/* State tracking for single-AG operations. */
@@ -77,5 +78,6 @@ int xfs_scrub_inobt(struct xfs_scrub_context *sc);
 int xfs_scrub_finobt(struct xfs_scrub_context *sc);
 int xfs_scrub_rmapbt(struct xfs_scrub_context *sc);
 int xfs_scrub_refcountbt(struct xfs_scrub_context *sc);
+int xfs_scrub_inode(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 17/25] xfs: scrub inode block mappings
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 16/25] xfs: scrub inodes Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-06  2:51   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 18/25] xfs: scrub directory/attribute btrees Darrick J. Wong
                   ` (7 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scrub an individual inode's block mappings to make sure they make sense.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    5 +
 fs/xfs/scrub/bmap.c    |  371 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h  |    5 +
 fs/xfs/scrub/scrub.c   |   12 ++
 fs/xfs/scrub/scrub.h   |    3 
 6 files changed, 395 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/bmap.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 28e14b7..5a77489 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -148,6 +148,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   trace.o \
 				   agheader.o \
 				   alloc.o \
+				   bmap.o \
 				   btree.o \
 				   common.o \
 				   ialloc.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index f8463e0..02ae58b 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -495,9 +495,12 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_RMAPBT	9	/* reverse mapping btree */
 #define XFS_SCRUB_TYPE_REFCNTBT	10	/* reference count btree */
 #define XFS_SCRUB_TYPE_INODE	11	/* inode record */
+#define XFS_SCRUB_TYPE_BMBTD	12	/* data fork block mapping */
+#define XFS_SCRUB_TYPE_BMBTA	13	/* attr fork block mapping */
+#define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	12
+#define XFS_SCRUB_TYPE_NR	15
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
new file mode 100644
index 0000000..57f1bbe
--- /dev/null
+++ b/fs/xfs/scrub/bmap.c
@@ -0,0 +1,371 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+
+/* Set us up with an inode's bmap. */
+STATIC int
+__xfs_scrub_setup_inode_bmap(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip,
+	bool				flush_data)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	error = xfs_scrub_get_inode(sc, ip);
+	if (error)
+		return error;
+
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+
+	/*
+	 * We don't want any ephemeral data fork updates sitting around
+	 * while we inspect block mappings, so wait for directio to finish
+	 * and flush dirty data if we have delalloc reservations.
+	 */
+	if (S_ISREG(VFS_I(sc->ip)->i_mode) && flush_data) {
+		inode_dio_wait(VFS_I(sc->ip));
+		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
+		if (error)
+			goto out_unlock;
+		error = invalidate_inode_pages2(VFS_I(sc->ip)->i_mapping);
+		if (error)
+			goto out_unlock;
+	}
+
+	/* Got the inode, lock it and we're ready to go. */
+	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+	if (error)
+		goto out_unlock;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+
+	return 0;
+out_unlock:
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	if (sc->ip != ip)
+		iput(VFS_I(sc->ip));
+	sc->ip = NULL;
+	return error;
+}
+
+/* Set us up to scrub the data fork. */
+int
+xfs_scrub_setup_inode_bmap_data(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return __xfs_scrub_setup_inode_bmap(sc, ip, true);
+}
+
+/* Set us up to scrub the attr or CoW fork. */
+int
+xfs_scrub_setup_inode_bmap(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return __xfs_scrub_setup_inode_bmap(sc, ip, false);
+}
+
+/*
+ * Inode fork block mapping (BMBT) scrubber.
+ * More complex than the others because we have to scrub
+ * all the extents regardless of whether or not the fork
+ * is in btree format.
+ */
+
+struct xfs_scrub_bmap_info {
+	struct xfs_scrub_context	*sc;
+	xfs_daddr_t			eofs;
+	xfs_fileoff_t			lastoff;
+	bool				is_rt;
+	bool				is_shared;
+	int				whichfork;
+};
+
+/* Scrub a single extent record. */
+STATIC int
+xfs_scrub_bmap_extent(
+	struct xfs_inode		*ip,
+	struct xfs_btree_cur		*cur,
+	struct xfs_scrub_bmap_info	*info,
+	struct xfs_bmbt_irec		*irec)
+{
+	struct xfs_scrub_ag		sa = { 0 };
+	struct xfs_mount		*mp = info->sc->mp;
+	struct xfs_buf			*bp = NULL;
+	xfs_daddr_t			daddr;
+	xfs_daddr_t			dlen;
+	xfs_fsblock_t			bno;
+	xfs_agnumber_t			agno;
+	int				error = 0;
+
+	if (cur)
+		xfs_btree_get_block(cur, 0, &bp);
+
+	if (irec->br_startoff < info->lastoff ||
+	    irec->br_startblock == HOLESTARTBLOCK ||
+	    isnullstartblock(irec->br_startblock))
+		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	/* Actual mapping, so check the block ranges. */
+	if (info->is_rt) {
+		daddr = XFS_FSB_TO_BB(mp, irec->br_startblock);
+		agno = NULLAGNUMBER;
+		bno = irec->br_startblock;
+	} else {
+		daddr = XFS_FSB_TO_DADDR(mp, irec->br_startblock);
+		agno = XFS_FSB_TO_AGNO(mp, irec->br_startblock);
+		if (agno >= mp->m_sb.sb_agcount) {
+			xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+			goto out;
+		}
+		bno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
+		if (bno >= mp->m_sb.sb_agblocks)
+			xfs_scrub_fblock_set_corrupt(info->sc,
+						     info->whichfork,
+						     irec->br_startoff);
+	}
+	dlen = XFS_FSB_TO_BB(mp, irec->br_blockcount);
+	if (irec->br_blockcount <= 0 ||
+	    irec->br_blockcount > MAXEXTLEN ||
+	    daddr >= info->eofs ||
+	    daddr + dlen > info->eofs)
+		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	if (irec->br_state == XFS_EXT_UNWRITTEN &&
+	    !xfs_sb_version_hasextflgbit(&mp->m_sb))
+		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
+				irec->br_startoff);
+
+	/* Set ourselves up for cross-referencing later. */
+	if (!info->is_rt) {
+		error = xfs_scrub_ag_init(info->sc, agno, &sa);
+		if (!xfs_scrub_fblock_op_ok(info->sc, info->whichfork,
+				irec->br_startoff, &error))
+			goto out;
+	}
+
+	xfs_scrub_ag_free(info->sc, &sa);
+out:
+	info->lastoff = irec->br_startoff + irec->br_blockcount;
+	return error;
+}
+
+/* Scrub a bmbt record. */
+STATIC int
+xfs_scrub_bmapbt_helper(
+	struct xfs_scrub_btree		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_bmbt_rec_host	ihost;
+	struct xfs_bmbt_irec		irec;
+	struct xfs_scrub_bmap_info	*info = bs->private;
+	struct xfs_inode		*ip = bs->cur->bc_private.b.ip;
+	struct xfs_buf			*bp = NULL;
+	struct xfs_btree_block		*block;
+	uint64_t			owner;
+	int				i;
+
+	/*
+	 * Check the owners of the btree blocks up to the level below
+	 * the root since the verifiers don't do that.
+	 */
+	if (xfs_sb_version_hascrc(&bs->cur->bc_mp->m_sb) &&
+	    bs->cur->bc_ptrs[0] == 1) {
+		for (i = 0; i < bs->cur->bc_nlevels - 1; i++) {
+			block = xfs_btree_get_block(bs->cur, i, &bp);
+			owner = be64_to_cpu(block->bb_u.l.bb_owner);
+			if (owner != ip->i_ino)
+				xfs_scrub_fblock_set_corrupt(bs->sc,
+						info->whichfork, 0);
+		}
+	}
+
+	/* Set up the in-core record and scrub it. */
+	ihost.l0 = be64_to_cpu(rec->bmbt.l0);
+	ihost.l1 = be64_to_cpu(rec->bmbt.l1);
+	xfs_bmbt_get_all(&ihost, &irec);
+	return xfs_scrub_bmap_extent(ip, bs->cur, info, &irec);
+}
+
+/* Scrub an inode fork's block mappings. */
+STATIC int
+xfs_scrub_bmap(
+	struct xfs_scrub_context	*sc,
+	int				whichfork)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_scrub_bmap_info	info = {0};
+	struct xfs_owner_info		oinfo;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_inode		*ip = sc->ip;
+	struct xfs_ifork		*ifp;
+	struct xfs_btree_cur		*cur;
+	xfs_fileoff_t			endoff;
+	xfs_extnum_t			idx;
+	bool				found;
+	int				error = 0;
+
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+
+	info.is_rt = whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip);
+	info.eofs = XFS_FSB_TO_BB(mp, info.is_rt ? mp->m_sb.sb_rblocks :
+					      mp->m_sb.sb_dblocks);
+	info.whichfork = whichfork;
+	info.is_shared = whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(ip);
+	info.sc = sc;
+
+	switch (whichfork) {
+	case XFS_COW_FORK:
+		/* Non-existent CoW forks are ignorable. */
+		if (!ifp)
+			goto out_unlock;
+		/* No CoW forks on non-reflink inodes/filesystems. */
+		if (!xfs_is_reflink_inode(ip)) {
+			xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
+			goto out_unlock;
+		}
+		break;
+	case XFS_ATTR_FORK:
+		if (!ifp)
+			goto out_unlock;
+		if (!xfs_sb_version_hasattr(&mp->m_sb) &&
+		    !xfs_sb_version_hasattr2(&mp->m_sb))
+			xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
+		break;
+	}
+
+	/* Check the fork values */
+	switch (XFS_IFORK_FORMAT(ip, whichfork)) {
+	case XFS_DINODE_FMT_UUID:
+	case XFS_DINODE_FMT_DEV:
+	case XFS_DINODE_FMT_LOCAL:
+		/* No mappings to check. */
+		goto out_unlock;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+			xfs_scrub_fblock_set_corrupt(sc, whichfork, 0);
+			goto out_unlock;
+		}
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		ASSERT(whichfork != XFS_COW_FORK);
+
+		/* Scan the btree records. */
+		cur = xfs_bmbt_init_cursor(mp, sc->tp, ip, whichfork);
+		xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
+		error = xfs_scrub_btree(sc, cur, xfs_scrub_bmapbt_helper,
+				&oinfo, &info);
+		xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
+						  XFS_BTREE_NOERROR);
+		if (error == -EDEADLOCK)
+			return error;
+		else if (error)
+			goto out_unlock;
+		break;
+	default:
+		xfs_scrub_fblock_set_corrupt(sc, whichfork, 0);
+		goto out_unlock;
+	}
+
+	/* Extent data is in memory, so scrub that. */
+
+	/* Find the offset of the last extent in the mapping. */
+	error = xfs_bmap_last_offset(ip, &endoff, whichfork);
+	if (!xfs_scrub_fblock_op_ok(sc, whichfork, 0, &error))
+		goto out_unlock;
+
+	/* Scrub extent records. */
+	info.lastoff = 0;
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+	for (found = xfs_iext_lookup_extent(ip, ifp, 0, &idx, &irec);
+	     found != 0;
+	     found = xfs_iext_get_extent(ifp, ++idx, &irec)) {
+		if (xfs_scrub_should_terminate(sc, &error))
+			break;
+		if (isnullstartblock(irec.br_startblock))
+			continue;
+		if (irec.br_startoff >= endoff) {
+			xfs_scrub_fblock_set_corrupt(sc, whichfork,
+					irec.br_startoff);
+			goto out_unlock;
+		}
+		error = xfs_scrub_bmap_extent(ip, NULL, &info, &irec);
+		if (error == -EDEADLOCK)
+			return error;
+	}
+
+out_unlock:
+	return error;
+}
+
+/* Scrub an inode's data fork. */
+int
+xfs_scrub_bmap_data(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_scrub_bmap(sc, XFS_DATA_FORK);
+}
+
+/* Scrub an inode's attr fork. */
+int
+xfs_scrub_bmap_attr(
+	struct xfs_scrub_context	*sc)
+{
+	return xfs_scrub_bmap(sc, XFS_ATTR_FORK);
+}
+
+/* Scrub an inode's CoW fork. */
+int
+xfs_scrub_bmap_cow(
+	struct xfs_scrub_context	*sc)
+{
+	if (!xfs_is_reflink_inode(sc->ip))
+		return -ENOENT;
+
+	return xfs_scrub_bmap(sc, XFS_COW_FORK);
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index c8143e8..3e87e3a 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -90,7 +90,10 @@ int xfs_scrub_setup_ag_refcountbt(struct xfs_scrub_context *sc,
 				  struct xfs_inode *ip);
 int xfs_scrub_setup_inode(struct xfs_scrub_context *sc,
 			  struct xfs_inode *ip);
-
+int xfs_scrub_setup_inode_bmap(struct xfs_scrub_context *sc,
+			       struct xfs_inode *ip);
+int xfs_scrub_setup_inode_bmap_data(struct xfs_scrub_context *sc,
+				    struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f014ef0..b10d627 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -209,6 +209,18 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_inode,
 		.scrub	= xfs_scrub_inode,
 	},
+	{ /* inode data fork */
+		.setup	= xfs_scrub_setup_inode_bmap_data,
+		.scrub	= xfs_scrub_bmap_data,
+	},
+	{ /* inode attr fork */
+		.setup	= xfs_scrub_setup_inode_bmap,
+		.scrub	= xfs_scrub_bmap_attr,
+	},
+	{ /* inode CoW fork */
+		.setup	= xfs_scrub_setup_inode_bmap,
+		.scrub	= xfs_scrub_bmap_cow,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index ec635d4..8920ccf 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -79,5 +79,8 @@ int xfs_scrub_finobt(struct xfs_scrub_context *sc);
 int xfs_scrub_rmapbt(struct xfs_scrub_context *sc);
 int xfs_scrub_refcountbt(struct xfs_scrub_context *sc);
 int xfs_scrub_inode(struct xfs_scrub_context *sc);
+int xfs_scrub_bmap_data(struct xfs_scrub_context *sc);
+int xfs_scrub_bmap_attr(struct xfs_scrub_context *sc);
+int xfs_scrub_bmap_cow(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 18/25] xfs: scrub directory/attribute btrees
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (16 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 17/25] xfs: scrub inode block mappings Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-06  5:07   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 19/25] xfs: scrub directory metadata Darrick J. Wong
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Fengguang Wu

From: Darrick J. Wong <darrick.wong@oracle.com>

Provide a way to check the shape and scrub the hashes and records
in a directory or extended attribute btree.  These are helper functions
for the directory & attribute scrubbers in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[fengguang: remove unneeded variable to store return value]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/scrub/dabtree.c |  556 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/dabtree.h |   51 ++++
 3 files changed, 608 insertions(+)
 create mode 100644 fs/xfs/scrub/dabtree.c
 create mode 100644 fs/xfs/scrub/dabtree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5a77489..b48437f 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -151,6 +151,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   bmap.o \
 				   btree.o \
 				   common.o \
+				   dabtree.o \
 				   ialloc.o \
 				   inode.o \
 				   refcount.o \
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
new file mode 100644
index 0000000..392a76c
--- /dev/null
+++ b/fs/xfs/scrub/dabtree.c
@@ -0,0 +1,556 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_attr_leaf.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/dabtree.h"
+
+/* Directory/Attribute Btree */
+
+/*
+ * Check for da btree operation errors.  See the section about handling
+ * operational errors in common.c.
+ */
+bool
+xfs_scrub_da_op_ok(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	int				*error)
+{
+	struct xfs_scrub_context	*sc = ds->sc;
+
+	if (*error == 0)
+		return true;
+
+	switch (*error) {
+	case -EDEADLOCK:
+		/* Used to restart an op with deadlock avoidance. */
+		trace_xfs_scrub_deadlock_retry(sc->ip, sc->sm, *error);
+		break;
+	case -EFSBADCRC:
+	case -EFSCORRUPTED:
+		/* Note the badness but don't abort. */
+		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+		*error = 0;
+		/* fall through */
+	default:
+		trace_xfs_scrub_file_op_error(sc, ds->dargs.whichfork,
+				xfs_dir2_da_to_db(ds->dargs.geo,
+					ds->state->path.blk[level].blkno),
+				*error, __return_address);
+		break;
+	}
+	return false;
+}
+
+/*
+ * Check for da btree corruption.  See the section about handling
+ * operational errors in common.c.
+ */
+void
+xfs_scrub_da_set_corrupt(
+	struct xfs_scrub_da_btree	*ds,
+	int				level)
+{
+	struct xfs_scrub_context	*sc = ds->sc;
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+
+	trace_xfs_scrub_fblock_error(sc, ds->dargs.whichfork,
+			xfs_dir2_da_to_db(ds->dargs.geo,
+				ds->state->path.blk[level].blkno),
+			__return_address);
+}
+
+/* Find an entry at a certain level in a da btree. */
+STATIC void *
+xfs_scrub_da_btree_entry(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	int				rec)
+{
+	char				*ents;
+	void				*(*fn)(void *);
+	size_t				sz;
+	struct xfs_da_state_blk		*blk;
+
+	/* Dispatch the entry finding function. */
+	blk = &ds->state->path.blk[level];
+	switch (blk->magic) {
+	case XFS_ATTR_LEAF_MAGIC:
+	case XFS_ATTR3_LEAF_MAGIC:
+		fn = (xfs_da_leaf_ents_fn)xfs_attr3_leaf_entryp;
+		sz = sizeof(struct xfs_attr_leaf_entry);
+		break;
+	case XFS_DIR2_LEAFN_MAGIC:
+	case XFS_DIR3_LEAFN_MAGIC:
+		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->leaf_ents_p;
+		sz = sizeof(struct xfs_dir2_leaf_entry);
+		break;
+	case XFS_DIR2_LEAF1_MAGIC:
+	case XFS_DIR3_LEAF1_MAGIC:
+		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->leaf_ents_p;
+		sz = sizeof(struct xfs_dir2_leaf_entry);
+		break;
+	case XFS_DA_NODE_MAGIC:
+	case XFS_DA3_NODE_MAGIC:
+		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->node_tree_p;
+		sz = sizeof(struct xfs_da_node_entry);
+		break;
+	default:
+		return NULL;
+	}
+
+	ents = fn(blk->bp->b_addr);
+	return ents + (sz * rec);
+}
+
+/* Scrub a da btree hash (key). */
+int
+xfs_scrub_da_btree_hash(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	__be32				*hashp)
+{
+	struct xfs_da_state_blk		*blks;
+	struct xfs_da_node_entry	*btree;
+	xfs_dahash_t			hash;
+	xfs_dahash_t			parent_hash;
+
+	/* Is this hash in order? */
+	hash = be32_to_cpu(*hashp);
+	if (hash < ds->hashes[level])
+		xfs_scrub_da_set_corrupt(ds, level);
+	ds->hashes[level] = hash;
+
+	if (level == 0)
+		return 0;
+
+	/* Is this hash no larger than the parent hash? */
+	blks = ds->state->path.blk;
+	btree = xfs_scrub_da_btree_entry(ds, level - 1, blks[level - 1].index);
+	parent_hash = be32_to_cpu(btree->hashval);
+	if (parent_hash < hash)
+		xfs_scrub_da_set_corrupt(ds, level);
+
+	return 0;
+}
+
+/*
+ * Check a da btree pointer.  Returns true if it's ok to use this
+ * pointer.
+ */
+STATIC bool
+xfs_scrub_da_btree_ptr_ok(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	xfs_dablk_t			blkno)
+{
+	if (blkno < ds->lowest || (ds->highest != 0 && blkno >= ds->highest)) {
+		xfs_scrub_da_set_corrupt(ds, level);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * The da btree scrubber can handle leaf1 blocks as a degenerate
+ * form of da btree.  Since the regular da code doesn't handle
+ * leaf1, we must multiplex the verifiers.
+ */
+static void
+xfs_scrub_da_btree_read_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_da_blkinfo	*info = bp->b_addr;
+
+	switch (be16_to_cpu(info->magic)) {
+	case XFS_DIR2_LEAF1_MAGIC:
+	case XFS_DIR3_LEAF1_MAGIC:
+		bp->b_ops = &xfs_dir3_leaf1_buf_ops;
+		bp->b_ops->verify_read(bp);
+		return;
+	default:
+		bp->b_ops = &xfs_da3_node_buf_ops;
+		bp->b_ops->verify_read(bp);
+		return;
+	}
+}
+static void
+xfs_scrub_da_btree_write_verify(
+	struct xfs_buf	*bp)
+{
+	struct xfs_da_blkinfo	*info = bp->b_addr;
+
+	switch (be16_to_cpu(info->magic)) {
+	case XFS_DIR2_LEAF1_MAGIC:
+	case XFS_DIR3_LEAF1_MAGIC:
+		bp->b_ops = &xfs_dir3_leaf1_buf_ops;
+		bp->b_ops->verify_write(bp);
+		return;
+	default:
+		bp->b_ops = &xfs_da3_node_buf_ops;
+		bp->b_ops->verify_write(bp);
+		return;
+	}
+}
+
+static const struct xfs_buf_ops xfs_scrub_da_btree_buf_ops = {
+	.name = "xfs_scrub_da_btree",
+	.verify_read = xfs_scrub_da_btree_read_verify,
+	.verify_write = xfs_scrub_da_btree_write_verify,
+};
+
+/* Check a block's sibling. */
+STATIC int
+xfs_scrub_da_btree_block_check_sibling(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	int				direction,
+	xfs_dablk_t			sibling)
+{
+	int				retval;
+	int				error;
+
+	if (!sibling)
+		return 0;
+
+	/* Move the alternate cursor back one block. */
+	memcpy(&ds->state->altpath, &ds->state->path,
+			sizeof(ds->state->altpath));
+	error = xfs_da3_path_shift(ds->state, &ds->state->altpath,
+			direction, false, &retval);
+	if (!xfs_scrub_da_op_ok(ds, level, &error))
+		return error;
+	if (retval) {
+		xfs_scrub_da_set_corrupt(ds, level);
+		return error;
+	}
+
+	if (ds->state->altpath.blk[level].blkno != sibling)
+		xfs_scrub_da_set_corrupt(ds, level);
+	xfs_trans_brelse(ds->dargs.trans, ds->state->altpath.blk[level].bp);
+	return error;
+}
+
+/* Check a block's sibling pointers. */
+STATIC int
+xfs_scrub_da_btree_block_check_siblings(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	struct xfs_da_blkinfo		*hdr)
+{
+	xfs_dablk_t			forw;
+	xfs_dablk_t			back;
+	int				error = 0;
+
+	forw = be32_to_cpu(hdr->forw);
+	back = be32_to_cpu(hdr->back);
+
+	/* Top level blocks should not have sibling pointers. */
+	if (level == 0) {
+		if (forw != 0 || back != 0)
+			xfs_scrub_da_set_corrupt(ds, level);
+		return error;
+	}
+
+	/*
+	 * Check back (left) and forw (right) pointers.  These functions
+	 * absorb error codes for us.
+	 */
+	error = xfs_scrub_da_btree_block_check_sibling(ds, level, 0, back);
+	if (error)
+		goto out;
+	error = xfs_scrub_da_btree_block_check_sibling(ds, level, 1, forw);
+
+out:
+	memset(&ds->state->altpath, 0, sizeof(ds->state->altpath));
+	return error;
+}
+
+/* Load a dir/attribute block from a btree. */
+STATIC int
+xfs_scrub_da_btree_block(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	xfs_dablk_t			blkno)
+{
+	struct xfs_da_state_blk		*blk;
+	struct xfs_da_intnode		*node;
+	struct xfs_da_node_entry	*btree;
+	struct xfs_da3_blkinfo		*hdr3;
+	struct xfs_da_args		*dargs = &ds->dargs;
+	struct xfs_inode		*ip = ds->dargs.dp;
+	xfs_ino_t			owner;
+	int				*pmaxrecs;
+	struct xfs_da3_icnode_hdr	nodehdr;
+	int				error;
+
+	blk = &ds->state->path.blk[level];
+	ds->state->path.active = level + 1;
+
+	/* Release old block. */
+	if (blk->bp) {
+		xfs_trans_brelse(dargs->trans, blk->bp);
+		blk->bp = NULL;
+	}
+
+	/* Check the pointer. */
+	blk->blkno = blkno;
+	if (!xfs_scrub_da_btree_ptr_ok(ds, level, blkno))
+		goto out_nobuf;
+
+	/* Read the buffer. */
+	error = xfs_da_read_buf(dargs->trans, dargs->dp, blk->blkno, -2,
+			&blk->bp, dargs->whichfork,
+			&xfs_scrub_da_btree_buf_ops);
+	if (!xfs_scrub_da_op_ok(ds, level, &error))
+		goto out_nobuf;
+
+	/* It's ok for a directory not to have a da btree in it. */
+	if (ds->dargs.whichfork == XFS_DATA_FORK && level == 0 &&
+			blk->bp == NULL)
+		goto out_nobuf;
+
+	/* It's /not/ ok for attr trees not to have a da btree. */
+	if (blk->bp == NULL) {
+		xfs_scrub_da_set_corrupt(ds, level);
+		goto out_nobuf;
+	}
+
+	hdr3 = blk->bp->b_addr;
+	blk->magic = be16_to_cpu(hdr3->hdr.magic);
+	pmaxrecs = &ds->maxrecs[level];
+
+	/* Check the owner. */
+	if (xfs_sb_version_hascrc(&ip->i_mount->m_sb)) {
+		owner = be64_to_cpu(hdr3->owner);
+		if (owner != ip->i_ino)
+			xfs_scrub_da_set_corrupt(ds, level);
+	}
+
+	/* Check the siblings. */
+	error = xfs_scrub_da_btree_block_check_siblings(ds, level, &hdr3->hdr);
+	if (error)
+		goto out;
+
+	/* Interpret the buffer. */
+	switch (blk->magic) {
+	case XFS_ATTR_LEAF_MAGIC:
+	case XFS_ATTR3_LEAF_MAGIC:
+		xfs_trans_buf_set_type(dargs->trans, blk->bp,
+				XFS_BLFT_ATTR_LEAF_BUF);
+		blk->magic = XFS_ATTR_LEAF_MAGIC;
+		blk->hashval = xfs_attr_leaf_lasthash(blk->bp, pmaxrecs);
+		if (ds->tree_level != 0)
+			xfs_scrub_da_set_corrupt(ds, level);
+		break;
+	case XFS_DIR2_LEAFN_MAGIC:
+	case XFS_DIR3_LEAFN_MAGIC:
+		xfs_trans_buf_set_type(dargs->trans, blk->bp,
+				XFS_BLFT_DIR_LEAFN_BUF);
+		blk->magic = XFS_DIR2_LEAFN_MAGIC;
+		blk->hashval = xfs_dir2_leaf_lasthash(ip, blk->bp, pmaxrecs);
+		if (ds->tree_level != 0)
+			xfs_scrub_da_set_corrupt(ds, level);
+		break;
+	case XFS_DIR2_LEAF1_MAGIC:
+	case XFS_DIR3_LEAF1_MAGIC:
+		xfs_trans_buf_set_type(dargs->trans, blk->bp,
+				XFS_BLFT_DIR_LEAF1_BUF);
+		blk->magic = XFS_DIR2_LEAF1_MAGIC;
+		blk->hashval = xfs_dir2_leaf_lasthash(ip, blk->bp, pmaxrecs);
+		if (ds->tree_level != 0)
+			xfs_scrub_da_set_corrupt(ds, level);
+		break;
+	case XFS_DA_NODE_MAGIC:
+	case XFS_DA3_NODE_MAGIC:
+		xfs_trans_buf_set_type(dargs->trans, blk->bp,
+				XFS_BLFT_DA_NODE_BUF);
+		blk->magic = XFS_DA_NODE_MAGIC;
+		node = blk->bp->b_addr;
+		ip->d_ops->node_hdr_from_disk(&nodehdr, node);
+		btree = ip->d_ops->node_tree_p(node);
+		*pmaxrecs = nodehdr.count;
+		blk->hashval = be32_to_cpu(btree[*pmaxrecs - 1].hashval);
+		if (level == 0) {
+			if (nodehdr.level >= XFS_DA_NODE_MAXDEPTH) {
+				xfs_scrub_da_set_corrupt(ds, level);
+				goto out_freebp;
+			}
+			ds->tree_level = nodehdr.level;
+		} else {
+			if (ds->tree_level != nodehdr.level) {
+				xfs_scrub_da_set_corrupt(ds, level);
+				goto out_freebp;
+			}
+		}
+		break;
+	default:
+		xfs_scrub_da_set_corrupt(ds, level);
+		goto out_freebp;
+	}
+
+out:
+	return error;
+out_freebp:
+	xfs_trans_brelse(dargs->trans, blk->bp);
+	blk->bp = NULL;
+out_nobuf:
+	blk->blkno = 0;
+	return error;
+}
+
+/* Visit all nodes and leaves of a da btree. */
+int
+xfs_scrub_da_btree(
+	struct xfs_scrub_context	*sc,
+	int				whichfork,
+	xfs_scrub_da_btree_rec_fn	scrub_fn)
+{
+	struct xfs_scrub_da_btree	ds;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_da_state_blk		*blks;
+	struct xfs_da_node_entry	*key;
+	void				*rec;
+	xfs_dablk_t			blkno;
+	bool				is_attr;
+	int				level;
+	int				error;
+
+	memset(&ds, 0, sizeof(ds));
+	/* Skip short format data structures; no btree to scan. */
+	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		return 0;
+
+	/* Set up initial da state. */
+	is_attr = whichfork == XFS_ATTR_FORK;
+	ds.dargs.geo = is_attr ? mp->m_attr_geo : mp->m_dir_geo;
+	ds.dargs.dp = sc->ip;
+	ds.dargs.whichfork = whichfork;
+	ds.dargs.trans = sc->tp;
+	ds.dargs.op_flags = XFS_DA_OP_OKNOENT;
+	ds.state = xfs_da_state_alloc();
+	ds.state->args = &ds.dargs;
+	ds.state->mp = mp;
+	ds.sc = sc;
+	blkno = ds.lowest = is_attr ? 0 : ds.dargs.geo->leafblk;
+	ds.highest = is_attr ? 0 : ds.dargs.geo->freeblk;
+	level = 0;
+
+	/* Find the root of the da tree, if present. */
+	blks = ds.state->path.blk;
+	error = xfs_scrub_da_btree_block(&ds, level, blkno);
+	if (error)
+		goto out_state;
+	if (blks[level].bp == NULL)
+		goto out_state;
+
+	blks[level].index = 0;
+	while (level >= 0 && level < XFS_DA_NODE_MAXDEPTH) {
+		/* Handle leaf block. */
+		if (blks[level].magic != XFS_DA_NODE_MAGIC) {
+			/* End of leaf, pop back towards the root. */
+			if (blks[level].index >= ds.maxrecs[level]) {
+				if (level > 0)
+					blks[level - 1].index++;
+				ds.tree_level++;
+				level--;
+				continue;
+			}
+
+			/* Dispatch record scrubbing. */
+			rec = xfs_scrub_da_btree_entry(&ds, level,
+					blks[level].index);
+			error = scrub_fn(&ds, level, rec);
+			if (error < 0 ||
+			    error == XFS_BTREE_QUERY_RANGE_ABORT)
+				break;
+			if (xfs_scrub_should_terminate(sc, &error))
+				break;
+
+			blks[level].index++;
+			continue;
+		}
+
+
+		/* End of node, pop back towards the root. */
+		if (blks[level].index >= ds.maxrecs[level]) {
+			if (level > 0)
+				blks[level - 1].index++;
+			ds.tree_level++;
+			level--;
+			continue;
+		}
+
+		/* Hashes in order for scrub? */
+		key = xfs_scrub_da_btree_entry(&ds, level, blks[level].index);
+		error = xfs_scrub_da_btree_hash(&ds, level, &key->hashval);
+		if (error)
+			goto out;
+
+		/* Drill another level deeper. */
+		blkno = be32_to_cpu(key->before);
+		level++;
+		ds.tree_level--;
+		error = xfs_scrub_da_btree_block(&ds, level, blkno);
+		if (error)
+			goto out;
+		if (blks[level].bp == NULL)
+			goto out;
+
+		blks[level].index = 0;
+	}
+
+out:
+	/* Release all the buffers we're tracking. */
+	for (level = 0; level < XFS_DA_NODE_MAXDEPTH; level++) {
+		if (blks[level].bp == NULL)
+			continue;
+		xfs_trans_brelse(sc->tp, blks[level].bp);
+		blks[level].bp = NULL;
+	}
+
+out_state:
+	xfs_da_state_free(ds.state);
+	return error;
+}
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
new file mode 100644
index 0000000..04e400d
--- /dev/null
+++ b/fs/xfs/scrub/dabtree.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_DABTREE_H__
+#define __XFS_SCRUB_DABTREE_H__
+
+/* dir/attr btree */
+
+struct xfs_scrub_da_btree {
+	struct xfs_da_args		dargs;
+	xfs_dahash_t			hashes[XFS_DA_NODE_MAXDEPTH];
+	int				maxrecs[XFS_DA_NODE_MAXDEPTH];
+	struct xfs_da_state		*state;
+	struct xfs_scrub_context	*sc;
+	xfs_dablk_t			lowest;
+	xfs_dablk_t			highest;
+	int				tree_level;
+};
+
+typedef void *(*xfs_da_leaf_ents_fn)(void *);
+typedef int (*xfs_scrub_da_btree_rec_fn)(struct xfs_scrub_da_btree *ds,
+		int level, void *rec);
+
+/* Check for da btree operation errors. */
+bool xfs_scrub_da_op_ok(struct xfs_scrub_da_btree *ds, int level, int *error);
+
+/* Check for da btree corruption. */
+void xfs_scrub_da_set_corrupt(struct xfs_scrub_da_btree *ds, int level);
+
+int xfs_scrub_da_btree_hash(struct xfs_scrub_da_btree *ds, int level,
+			    __be32 *hashp);
+int xfs_scrub_da_btree(struct xfs_scrub_context *sc, int whichfork,
+		       xfs_scrub_da_btree_rec_fn scrub_fn);
+
+#endif /* __XFS_SCRUB_DABTREE_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 19/25] xfs: scrub directory metadata
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 18/25] xfs: scrub directory/attribute btrees Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-06  7:07   ` Dave Chinner
  2017-10-03 20:42 ` [PATCH 20/25] xfs: scrub directory freespace Darrick J. Wong
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scrub the hash tree and all the entries in a directory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    3 
 fs/xfs/scrub/common.c  |   32 +++++
 fs/xfs/scrub/common.h  |    4 +
 fs/xfs/scrub/dir.c     |  300 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |    4 +
 fs/xfs/scrub/scrub.h   |    1 
 7 files changed, 344 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/dir.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b48437f..69aa88e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -152,6 +152,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   btree.o \
 				   common.o \
 				   dabtree.o \
+				   dir.o \
 				   ialloc.o \
 				   inode.o \
 				   refcount.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 02ae58b..b16d004 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -498,9 +498,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_BMBTD	12	/* data fork block mapping */
 #define XFS_SCRUB_TYPE_BMBTA	13	/* attr fork block mapping */
 #define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
+#define XFS_SCRUB_TYPE_DIR	15	/* directory */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	15
+#define XFS_SCRUB_TYPE_NR	16
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 415dec6..ac66744 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -624,3 +624,35 @@ xfs_scrub_checkpoint_log(
 	xfs_ail_push_all_sync(mp->m_ail);
 	return 0;
 }
+
+/* Set us up to scrub a file's contents. */
+int
+xfs_scrub_setup_inode_contents(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip,
+	unsigned int			resblks)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	error = xfs_scrub_get_inode(sc, ip);
+	if (error)
+		return error;
+
+	/* Got the inode, lock it and we're ready to go. */
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
+	if (error)
+		goto out_unlock;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+
+	return 0;
+out_unlock:
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	if (sc->ip != ip)
+		iput(VFS_I(sc->ip));
+	sc->ip = NULL;
+	return error;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 3e87e3a..f530301 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -94,6 +94,8 @@ int xfs_scrub_setup_inode_bmap(struct xfs_scrub_context *sc,
 			       struct xfs_inode *ip);
 int xfs_scrub_setup_inode_bmap_data(struct xfs_scrub_context *sc,
 				    struct xfs_inode *ip);
+int xfs_scrub_setup_directory(struct xfs_scrub_context *sc,
+			      struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
@@ -114,5 +116,7 @@ int xfs_scrub_walk_agfl(struct xfs_scrub_context *sc,
 int xfs_scrub_setup_ag_btree(struct xfs_scrub_context *sc,
 			     struct xfs_inode *ip, bool force_log);
 int xfs_scrub_get_inode(struct xfs_scrub_context *sc, struct xfs_inode *ip_in);
+int xfs_scrub_setup_inode_contents(struct xfs_scrub_context *sc,
+				   struct xfs_inode *ip, unsigned int resblks);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
new file mode 100644
index 0000000..e58252b
--- /dev/null
+++ b/fs/xfs/scrub/dir.c
@@ -0,0 +1,300 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_itable.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/dabtree.h"
+
+/* Set us up to scrub directories. */
+int
+xfs_scrub_setup_directory(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_setup_inode_contents(sc, ip, 0);
+}
+
+/* Directories */
+
+/* Scrub a directory entry. */
+
+struct xfs_scrub_dir_ctx {
+	struct dir_context		dc;
+	struct xfs_scrub_context	*sc;
+};
+
+/* Check that an inode's mode matches a given DT_ type. */
+STATIC int
+xfs_scrub_dir_check_ftype(
+	struct xfs_scrub_dir_ctx	*sdc,
+	xfs_fileoff_t			offset,
+	xfs_ino_t			inum,
+	int				dtype)
+{
+	struct xfs_mount		*mp = sdc->sc->mp;
+	struct xfs_inode		*ip;
+	int				ino_dtype;
+	int				error = 0;
+
+	if (!xfs_sb_version_hasftype(&mp->m_sb)) {
+		if (dtype != DT_UNKNOWN && dtype != DT_DIR)
+			xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
+					offset);
+		goto out;
+	}
+
+	error = xfs_iget(mp, sdc->sc->tp, inum, 0, 0, &ip);
+	if (!xfs_scrub_fblock_op_ok(sdc->sc, XFS_DATA_FORK, offset, &error))
+		goto out;
+
+	/* Convert mode to the DT_* values that dir_emit uses. */
+	ino_dtype = (VFS_I(ip)->i_mode & S_IFMT) >> 12;
+	if (ino_dtype != dtype)
+		xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
+	iput(VFS_I(ip));
+out:
+	return error;
+}
+
+/* Scrub a single directory entry. */
+STATIC int
+xfs_scrub_dir_actor(
+	struct dir_context		*dc,
+	const char			*name,
+	int				namelen,
+	loff_t				pos,
+	u64				ino,
+	unsigned			type)
+{
+	struct xfs_mount		*mp;
+	struct xfs_inode		*ip;
+	struct xfs_scrub_dir_ctx	*sdc;
+	struct xfs_name			xname;
+	xfs_ino_t			lookup_ino;
+	xfs_dablk_t			offset;
+	int				error = 0;
+
+	sdc = container_of(dc, struct xfs_scrub_dir_ctx, dc);
+	ip = sdc->sc->ip;
+	mp = ip->i_mount;
+	offset = xfs_dir2_db_to_da(mp->m_dir_geo,
+			xfs_dir2_dataptr_to_db(mp->m_dir_geo, pos));
+
+	/* Does this inode number make sense? */
+	if (xfs_dir_ino_validate(mp, ino) != 0 || xfs_internal_inum(mp, ino)) {
+		xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
+		goto out;
+	}
+
+	/* Verify that we can look up this name by hash. */
+	xname.name = name;
+	xname.len = namelen;
+	xname.type = XFS_DIR3_FT_UNKNOWN;
+
+	error = xfs_dir_lookup(sdc->sc->tp, ip, &xname, &lookup_ino, NULL);
+	if (!xfs_scrub_fblock_op_ok(sdc->sc, XFS_DATA_FORK, offset, &error))
+		goto fail_xref;
+	if (lookup_ino != ino) {
+		xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
+		goto out;
+	}
+
+	if (!strncmp(".", name, namelen)) {
+		/* If this is "." then check that the inum matches the dir. */
+		if (xfs_sb_version_hasftype(&mp->m_sb) && type != DT_DIR)
+			xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
+					offset);
+		if (ino != ip->i_ino)
+			xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
+					offset);
+	} else if (!strncmp("..", name, namelen)) {
+		/*
+		 * If this is ".." in the root inode, check that the inum
+		 * matches this dir.
+		 */
+		if (xfs_sb_version_hasftype(&mp->m_sb) && type != DT_DIR)
+			xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
+					offset);
+		if (ip->i_ino == mp->m_sb.sb_rootino && ino != ip->i_ino)
+			xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
+					offset);
+	}
+
+	/* Verify the file type.  This function absorbs error codes. */
+	error = xfs_scrub_dir_check_ftype(sdc, offset, lookup_ino, type);
+	if (error)
+		goto out;
+out:
+	return error;
+fail_xref:
+	return error;
+}
+
+/* Scrub a directory btree record. */
+STATIC int
+xfs_scrub_dir_rec(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	void				*rec)
+{
+	struct xfs_mount		*mp = ds->state->mp;
+	struct xfs_dir2_leaf_entry	*ent = rec;
+	struct xfs_inode		*dp = ds->dargs.dp;
+	struct xfs_dir2_data_entry	*dent;
+	struct xfs_buf			*bp;
+	xfs_ino_t			ino;
+	xfs_dablk_t			rec_bno;
+	xfs_dir2_db_t			db;
+	xfs_dir2_data_aoff_t		off;
+	xfs_dir2_dataptr_t		ptr;
+	xfs_dahash_t			calc_hash;
+	xfs_dahash_t			hash;
+	unsigned int			tag;
+	int				error;
+
+	/* Check the hash of the entry. */
+	error = xfs_scrub_da_btree_hash(ds, level, &ent->hashval);
+	if (error)
+		goto out;
+
+	/* Valid hash pointer? */
+	ptr = be32_to_cpu(ent->address);
+	if (ptr == 0)
+		return 0;
+
+	/* Find the directory entry's location. */
+	db = xfs_dir2_dataptr_to_db(mp->m_dir_geo, ptr);
+	off = xfs_dir2_dataptr_to_off(mp->m_dir_geo, ptr);
+	rec_bno = xfs_dir2_db_to_da(mp->m_dir_geo, db);
+
+	if (rec_bno >= mp->m_dir_geo->leafblk) {
+		xfs_scrub_da_set_corrupt(ds, level);
+		goto out;
+	}
+	error = xfs_dir3_data_read(ds->dargs.trans, dp, rec_bno, -2, &bp);
+	if (!xfs_scrub_fblock_op_ok(ds->sc, XFS_DATA_FORK, rec_bno, &error))
+		goto out;
+	if (!bp) {
+		xfs_scrub_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno);
+		goto out;
+	}
+
+	/* Retrieve the entry and check it. */
+	dent = (struct xfs_dir2_data_entry *)(((char *)bp->b_addr) + off);
+	ino = be64_to_cpu(dent->inumber);
+	hash = be32_to_cpu(ent->hashval);
+	tag = be16_to_cpup(dp->d_ops->data_entry_tag_p(dent));
+	if (xfs_dir_ino_validate(mp, ino) != 0 ||
+	    xfs_internal_inum(mp, ino) ||
+	    tag != off)
+		xfs_scrub_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno);
+	if (dent->namelen == 0) {
+		xfs_scrub_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno);
+		goto out_relse;
+	}
+	calc_hash = xfs_da_hashname(dent->name, dent->namelen);
+	if (calc_hash != hash)
+		xfs_scrub_fblock_set_corrupt(ds->sc, XFS_DATA_FORK, rec_bno);
+
+out_relse:
+	xfs_trans_brelse(ds->dargs.trans, bp);
+out:
+	return error;
+}
+
+/* Scrub a whole directory. */
+int
+xfs_scrub_directory(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_scrub_dir_ctx	sdc = {
+		.dc.actor = xfs_scrub_dir_actor,
+		.dc.pos = 0,
+	};
+	size_t				bufsize;
+	loff_t				oldpos;
+	int				error;
+
+	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
+		return -ENOENT;
+
+	/* Plausible size? */
+	if (sc->ip->i_d.di_size < xfs_dir2_sf_hdr_size(0)) {
+		xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
+		goto out;
+	}
+
+	/* Check directory tree structure */
+	error = xfs_scrub_da_btree(sc, XFS_DATA_FORK, xfs_scrub_dir_rec);
+	if (error)
+		return error;
+
+	/*
+	 * Check that every dirent we see can also be looked up by hash.
+	 * Userspace usually asks for a 32k buffer, so we will too.
+	 */
+	bufsize = (size_t)min_t(loff_t, 32768, sc->ip->i_d.di_size);
+	sdc.sc = sc;
+
+	/*
+	 * Look up every name in this directory by hash.
+	 *
+	 * The VFS grabs a read or write lock via i_rwsem before it reads
+	 * or writes to a directory.  If we've gotten this far we've
+	 * already obtained IOLOCK_EXCL, which (since 4.10) is the same as
+	 * getting a write lock on i_rwsem.  Therefore, it is safe for us
+	 * to drop the ILOCK here in order to reuse the _readdir and
+	 * _dir_lookup routines, which do their own ILOCK locking.
+	 */
+	oldpos = 0;
+	sc->ilock_flags &= ~XFS_ILOCK_EXCL;
+	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
+	while (true) {
+		error = xfs_readdir(sc->tp, sc->ip, &sdc.dc, bufsize);
+		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+			goto out;
+		if (oldpos == sdc.dc.pos)
+			break;
+		oldpos = sdc.dc.pos;
+	}
+
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index b10d627..f3a84d9 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -221,6 +221,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_inode_bmap,
 		.scrub	= xfs_scrub_bmap_cow,
 	},
+	{ /* directory */
+		.setup	= xfs_scrub_setup_directory,
+		.scrub	= xfs_scrub_directory,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 8920ccf..844506e 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -82,5 +82,6 @@ int xfs_scrub_inode(struct xfs_scrub_context *sc);
 int xfs_scrub_bmap_data(struct xfs_scrub_context *sc);
 int xfs_scrub_bmap_attr(struct xfs_scrub_context *sc);
 int xfs_scrub_bmap_cow(struct xfs_scrub_context *sc);
+int xfs_scrub_directory(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 20/25] xfs: scrub directory freespace
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (18 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 19/25] xfs: scrub directory metadata Darrick J. Wong
@ 2017-10-03 20:42 ` Darrick J. Wong
  2017-10-09  1:44   ` Dave Chinner
  2017-10-03 20:43 ` [PATCH 21/25] xfs: scrub extended attributes Darrick J. Wong
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:42 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Check the free space information in a directory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/dir.c |  347 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 347 insertions(+)


diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index e58252b..6ea06c3 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -239,6 +239,348 @@ xfs_scrub_dir_rec(
 	return error;
 }
 
+/* Is this free entry either in the bestfree or smaller than all of them? */
+static inline void
+xfs_scrub_directory_check_free_entry(
+	struct xfs_scrub_context	*sc,
+	xfs_dablk_t			lblk,
+	struct xfs_dir2_data_free	*bf,
+	struct xfs_dir2_data_unused	*dup)
+{
+	struct xfs_dir2_data_free	*dfp;
+	unsigned int			smallest;
+
+	smallest = -1U;
+	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
+		if (dfp->offset &&
+		    be16_to_cpu(dfp->length) == be16_to_cpu(dup->length))
+			return;
+		if (smallest < be16_to_cpu(dfp->length))
+			smallest = be16_to_cpu(dfp->length);
+	}
+
+	if (be16_to_cpu(dup->length) > smallest)
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+}
+
+/* Check free space info in a directory data block. */
+STATIC int
+xfs_scrub_directory_data_bestfree(
+	struct xfs_scrub_context	*sc,
+	xfs_dablk_t			lblk,
+	bool				is_block)
+{
+	struct xfs_dir2_data_unused	*dup;
+	struct xfs_dir2_data_free	*dfp;
+	struct xfs_buf			*bp;
+	struct xfs_dir2_data_free	*bf;
+	struct xfs_mount		*mp = sc->mp;
+	char				*ptr;
+	char				*endptr;
+	u16				tag;
+	int				newlen;
+	int				offset;
+	int				error;
+
+	if (is_block) {
+		/* dir block format */
+		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
+	} else {
+		/* dir data format */
+		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, -1, &bp);
+	}
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
+		goto out;
+
+	/* Do the bestfrees correspond to actual free space? */
+	bf = sc->ip->d_ops->data_bestfree_p(bp->b_addr);
+	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
+		offset = be16_to_cpu(dfp->offset);
+		if (offset == 0)
+			continue;
+		if (offset >= BBTOB(bp->b_length)) {
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+			continue;
+		}
+		dup = (struct xfs_dir2_data_unused *)(bp->b_addr + offset);
+		tag = be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup));
+
+		if (dup->freetag != cpu_to_be16(XFS_DIR2_DATA_FREE_TAG) ||
+		    be16_to_cpu(dup->length) != be16_to_cpu(dfp->length) ||
+		    tag != ((char *)dup - (char *)bp->b_addr))
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+	}
+
+	/* Make sure the bestfrees are actually the best free spaces. */
+	ptr = (char *)sc->ip->d_ops->data_entry_p(bp->b_addr);
+	if (is_block) {
+		struct xfs_dir2_block_tail	*btp;
+
+		btp = xfs_dir2_block_tail_p(sc->mp->m_dir_geo, bp->b_addr);
+		endptr = (char *)xfs_dir2_block_leaf_p(btp);
+	} else
+		endptr = (char *)bp->b_addr + BBTOB(bp->b_length);
+	while (ptr < endptr) {
+		dup = (struct xfs_dir2_data_unused *)ptr;
+		/* Skip real entries */
+		if (dup->freetag != cpu_to_be16(XFS_DIR2_DATA_FREE_TAG)) {
+			struct xfs_dir2_data_entry	*dep;
+
+			dep = (struct xfs_dir2_data_entry *)ptr;
+			newlen = sc->ip->d_ops->data_entsize(dep->namelen);
+			if (newlen <= 0) {
+				xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
+						lblk);
+				goto out_buf;
+			}
+			ptr += newlen;
+			if (endptr < ptr)
+				xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
+					      lblk);
+			continue;
+		}
+
+		/* Spot check this free entry */
+		tag = be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup));
+		if (tag != ((char *)dup - (char *)bp->b_addr))
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+
+		/*
+		 * Either this entry is a bestfree or it's smaller than
+		 * any of the bestfrees.
+		 */
+		xfs_scrub_directory_check_free_entry(sc, lblk, bf, dup);
+
+		/* Move on. */
+		newlen = be16_to_cpu(dup->length);
+		if (newlen <= 0) {
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+			goto out_buf;
+		}
+		ptr += newlen;
+		if (endptr < ptr)
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+	}
+out_buf:
+	xfs_trans_brelse(sc->tp, bp);
+out:
+	return error;
+}
+
+/* Is this the longest free entry in the block? */
+static inline void
+xfs_scrub_directory_check_freesp(
+	struct xfs_scrub_context	*sc,
+	xfs_dablk_t			lblk,
+	struct xfs_buf			*dbp,
+	unsigned int			len)
+{
+	struct xfs_dir2_data_free	*bf;
+	struct xfs_dir2_data_free	*dfp;
+	unsigned int			longest = 0;
+	int				offset;
+
+	bf = sc->ip->d_ops->data_bestfree_p(dbp->b_addr);
+	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
+		offset = be16_to_cpu(dfp->offset);
+		if (!offset)
+			continue;
+		if (longest < be16_to_cpu(dfp->length))
+			longest = be16_to_cpu(dfp->length);
+	}
+
+	if (longest != len)
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+}
+
+/* Check free space info in a directory leaf1 block. */
+STATIC int
+xfs_scrub_directory_leaf1_bestfree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_da_args		*args,
+	xfs_dablk_t			lblk)
+{
+	struct xfs_dir2_leaf_tail	*ltp;
+	struct xfs_buf			*dbp;
+	struct xfs_buf			*bp;
+	struct xfs_mount		*mp = sc->mp;
+	__be16				*bestp;
+	__u16				best;
+	int				i;
+	int				error;
+
+	/* Read the free space block */
+	error = xfs_dir3_leaf_read(sc->tp, sc->ip, lblk, -1, &bp);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
+		goto out;
+
+	/* Check all the entries. */
+	ltp = xfs_dir2_leaf_tail_p(mp->m_dir_geo, bp->b_addr);
+	bestp = xfs_dir2_leaf_bests_p(ltp);
+	for (i = 0; i < be32_to_cpu(ltp->bestcount); i++, bestp++) {
+		best = be16_to_cpu(*bestp);
+		if (best == NULLDATAOFF)
+			continue;
+		error = xfs_dir3_data_read(sc->tp, sc->ip,
+				i * args->geo->fsbcount, -1, &dbp);
+		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
+			continue;
+		xfs_scrub_directory_check_freesp(sc, lblk, dbp, best);
+		xfs_trans_brelse(sc->tp, dbp);
+	}
+out:
+	return error;
+}
+
+/* Check free space info in a directory freespace block. */
+STATIC int
+xfs_scrub_directory_free_bestfree(
+	struct xfs_scrub_context	*sc,
+	struct xfs_da_args		*args,
+	xfs_dablk_t			lblk)
+{
+	struct xfs_dir3_icfree_hdr	freehdr;
+	struct xfs_buf			*dbp;
+	struct xfs_buf			*bp;
+	__be16				*bestp;
+	__be16				best;
+	int				i;
+	int				error;
+
+	/* Read the free space block */
+	error = xfs_dir2_free_read(sc->tp, sc->ip, lblk, &bp);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
+		goto out;
+
+	/* Check all the entries. */
+	sc->ip->d_ops->free_hdr_from_disk(&freehdr, bp->b_addr);
+	bestp = sc->ip->d_ops->free_bests_p(bp->b_addr);
+	for (i = 0; i < freehdr.nvalid; i++, bestp++) {
+		best = be16_to_cpu(*bestp);
+		if (best == NULLDATAOFF)
+			continue;
+		error = xfs_dir3_data_read(sc->tp, sc->ip,
+				(freehdr.firstdb + i) * args->geo->fsbcount,
+				-1, &dbp);
+		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
+			continue;
+		xfs_scrub_directory_check_freesp(sc, lblk, dbp, best);
+		xfs_trans_brelse(sc->tp, dbp);
+	}
+out:
+	return error;
+}
+
+/* Check free space information in directories. */
+STATIC int
+xfs_scrub_directory_blocks(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_bmbt_irec		got;
+	struct xfs_da_args		args;
+	struct xfs_ifork		*ifp;
+	struct xfs_mount		*mp = sc->mp;
+	xfs_fileoff_t			leaf_lblk;
+	xfs_fileoff_t			free_lblk;
+	xfs_fileoff_t			lblk;
+	xfs_extnum_t			idx;
+	bool				found;
+	int				is_block = 0;
+	int				error;
+
+	/* Ignore local format directories. */
+	if (sc->ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
+	    sc->ip->i_d.di_format != XFS_DINODE_FMT_BTREE)
+		return 0;
+
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	lblk = XFS_B_TO_FSB(mp, XFS_DIR2_DATA_OFFSET);
+	leaf_lblk = XFS_B_TO_FSB(mp, XFS_DIR2_LEAF_OFFSET);
+	free_lblk = XFS_B_TO_FSB(mp, XFS_DIR2_FREE_OFFSET);
+
+	/* Is this a block dir? */
+	args.dp = sc->ip;
+	args.geo = mp->m_dir_geo;
+	args.trans = sc->tp;
+	error = xfs_dir2_isblock(&args, &is_block);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
+		goto out;
+
+	/* Iterate all the data extents in the directory... */
+	found = xfs_iext_lookup_extent(sc->ip, ifp, lblk, &idx, &got);
+	while (found) {
+		/* No more data blocks... */
+		if (got.br_startoff >= leaf_lblk)
+			break;
+
+		/* Check each data block's bestfree data */
+		for (lblk = roundup((xfs_dablk_t)got.br_startoff,
+				args.geo->fsbcount);
+		     lblk < got.br_startoff + got.br_blockcount;
+		     lblk += args.geo->fsbcount) {
+			error = xfs_scrub_directory_data_bestfree(sc, lblk,
+					is_block);
+			if (error)
+				goto out;
+		}
+
+		found = xfs_iext_get_extent(ifp, ++idx, &got);
+	}
+
+	/* Look for a leaf1 block, which has free info. */
+	if (xfs_iext_lookup_extent(sc->ip, ifp, leaf_lblk, &idx, &got) &&
+	    got.br_startoff == leaf_lblk &&
+	    got.br_blockcount == args.geo->fsbcount &&
+	    !xfs_iext_get_extent(ifp, ++idx, &got)) {
+		if (is_block) {
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+			goto not_leaf1;
+		}
+		error = xfs_scrub_directory_leaf1_bestfree(sc, &args,
+				leaf_lblk);
+		if (error)
+			goto out;
+	}
+not_leaf1:
+
+	/* Scan for free blocks */
+	lblk = free_lblk;
+	found = xfs_iext_lookup_extent(sc->ip, ifp, lblk, &idx, &got);
+	while (found) {
+		/*
+		 * Dirs can't have blocks mapped above 2^32.
+		 * Single-block dirs shouldn't even be here.
+		 */
+		lblk = got.br_startoff;
+		if (lblk & ~0xFFFFFFFFULL) {
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+			goto out;
+		}
+		if (is_block) {
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
+			goto nextfree;
+		}
+
+		/* Check each dir free block's bestfree data */
+		for (lblk = roundup((xfs_dablk_t)got.br_startoff,
+				args.geo->fsbcount);
+		     lblk < got.br_startoff + got.br_blockcount;
+		     lblk += args.geo->fsbcount) {
+			error = xfs_scrub_directory_free_bestfree(sc, &args,
+					lblk);
+			if (error)
+				goto out;
+		}
+
+nextfree:
+		found = xfs_iext_get_extent(ifp, ++idx, &got);
+	}
+out:
+	return error;
+}
+
 /* Scrub a whole directory. */
 int
 xfs_scrub_directory(
@@ -266,6 +608,11 @@ xfs_scrub_directory(
 	if (error)
 		return error;
 
+	/* Check the freespace. */
+	error = xfs_scrub_directory_blocks(sc);
+	if (error)
+		return error;
+
 	/*
 	 * Check that every dirent we see can also be looked up by hash.
 	 * Userspace usually asks for a 32k buffer, so we will too.


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 21/25] xfs: scrub extended attributes
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (19 preceding siblings ...)
  2017-10-03 20:42 ` [PATCH 20/25] xfs: scrub directory freespace Darrick J. Wong
@ 2017-10-03 20:43 ` Darrick J. Wong
  2017-10-09  2:13   ` Dave Chinner
  2017-10-03 20:43 ` [PATCH 22/25] xfs: scrub symbolic links Darrick J. Wong
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:43 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scrub the hash tree, keys, and values in an extended attribute structure.
Refactor the attribute code to use the transaction if the caller supplied
one to avoid buffer deadocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    3 -
 fs/xfs/scrub/attr.c    |  218 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h  |    2 
 fs/xfs/scrub/scrub.c   |    8 ++
 fs/xfs/scrub/scrub.h   |    2 
 6 files changed, 233 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/attr.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 69aa88e..4d46399 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -148,6 +148,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   trace.o \
 				   agheader.o \
 				   alloc.o \
+				   attr.o \
 				   bmap.o \
 				   btree.o \
 				   common.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b16d004..0834ce6 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -499,9 +499,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_BMBTA	13	/* attr fork block mapping */
 #define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
 #define XFS_SCRUB_TYPE_DIR	15	/* directory */
+#define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	16
+#define XFS_SCRUB_TYPE_NR	17
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
new file mode 100644
index 0000000..319c42b
--- /dev/null
+++ b/fs/xfs/scrub/attr.c
@@ -0,0 +1,218 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_attr.h"
+#include "xfs_attr_leaf.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/dabtree.h"
+#include "scrub/trace.h"
+
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+
+/* Set us up to scrub an inode's extended attributes. */
+int
+xfs_scrub_setup_xattr(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	/* Allocate the buffer without the inode lock held. */
+	sc->buf = kmem_zalloc_large(XATTR_SIZE_MAX, KM_SLEEP);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	return xfs_scrub_setup_inode_contents(sc, ip, 0);
+}
+
+/* Extended Attributes */
+
+struct xfs_scrub_xattr {
+	struct xfs_attr_list_context	context;
+	struct xfs_scrub_context	*sc;
+};
+
+/* Check that an extended attribute key can be looked up by hash. */
+static void
+xfs_scrub_xattr_listent(
+	struct xfs_attr_list_context	*context,
+	int				flags,
+	unsigned char			*name,
+	int				namelen,
+	int				valuelen)
+{
+	struct xfs_scrub_xattr		*sx;
+	struct xfs_da_args		args = {0};
+	int				error = 0;
+
+	sx = container_of(context, struct xfs_scrub_xattr, context);
+
+	args.flags = ATTR_KERNOTIME;
+	if (flags & XFS_ATTR_ROOT)
+		args.flags |= ATTR_ROOT;
+	else if (flags & XFS_ATTR_SECURE)
+		args.flags |= ATTR_SECURE;
+	args.geo = context->dp->i_mount->m_attr_geo;
+	args.whichfork = XFS_ATTR_FORK;
+	args.dp = context->dp;
+	args.name = name;
+	args.namelen = namelen;
+	args.hashval = xfs_da_hashname(args.name, args.namelen);
+	args.trans = context->tp;
+	args.value = sx->sc->buf;
+	args.valuelen = XATTR_SIZE_MAX;
+
+	error = xfs_attr_get_ilocked(context->dp, &args);
+	if (error == -EEXIST)
+		error = 0;
+	if (!xfs_scrub_fblock_op_ok(sx->sc, XFS_ATTR_FORK, args.blkno, &error))
+		goto fail_xref;
+	if (args.valuelen != valuelen)
+		xfs_scrub_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK,
+					     args.blkno);
+
+fail_xref:
+	return;
+}
+
+/* Scrub a attribute btree record. */
+STATIC int
+xfs_scrub_xattr_rec(
+	struct xfs_scrub_da_btree	*ds,
+	int				level,
+	void				*rec)
+{
+	struct xfs_mount		*mp = ds->state->mp;
+	struct xfs_attr_leaf_entry	*ent = rec;
+	struct xfs_da_state_blk		*blk;
+	struct xfs_attr_leaf_name_local	*lentry;
+	struct xfs_attr_leaf_name_remote	*rentry;
+	struct xfs_buf			*bp;
+	xfs_dahash_t			calc_hash;
+	xfs_dahash_t			hash;
+	int				nameidx;
+	int				hdrsize;
+	unsigned int			badflags;
+	int				error;
+
+	blk = &ds->state->path.blk[level];
+
+	/* Check the hash of the entry. */
+	error = xfs_scrub_da_btree_hash(ds, level, &ent->hashval);
+	if (error)
+		goto out;
+
+	/* Find the attr entry's location. */
+	bp = blk->bp;
+	hdrsize = xfs_attr3_leaf_hdr_size(bp->b_addr);
+	nameidx = be16_to_cpu(ent->nameidx);
+	if (nameidx < hdrsize || nameidx >= mp->m_attr_geo->blksize) {
+		xfs_scrub_da_set_corrupt(ds, level);
+		goto out;
+	}
+
+	/* Retrieve the entry and check it. */
+	hash = be32_to_cpu(ent->hashval);
+	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
+			XFS_ATTR_INCOMPLETE);
+	if ((ent->flags & badflags) != 0)
+		xfs_scrub_da_set_corrupt(ds, level);
+	if (ent->flags & XFS_ATTR_LOCAL) {
+		lentry = (struct xfs_attr_leaf_name_local *)
+				(((char *)bp->b_addr) + nameidx);
+		if (lentry->namelen <= 0) {
+			xfs_scrub_da_set_corrupt(ds, level);
+			goto out;
+		}
+		calc_hash = xfs_da_hashname(lentry->nameval, lentry->namelen);
+	} else {
+		rentry = (struct xfs_attr_leaf_name_remote *)
+				(((char *)bp->b_addr) + nameidx);
+		if (rentry->namelen <= 0) {
+			xfs_scrub_da_set_corrupt(ds, level);
+			goto out;
+		}
+		calc_hash = xfs_da_hashname(rentry->name, rentry->namelen);
+	}
+	if (calc_hash != hash)
+		xfs_scrub_da_set_corrupt(ds, level);
+
+out:
+	return error;
+}
+
+/* Scrub the extended attribute metadata. */
+int
+xfs_scrub_xattr(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_scrub_xattr		sx = { 0 };
+	struct attrlist_cursor_kern	cursor = { 0 };
+	int				error = 0;
+
+	if (!xfs_inode_hasattr(sc->ip))
+		return -ENOENT;
+
+	memset(&sx, 0, sizeof(sx));
+	/* Check attribute tree structure */
+	error = xfs_scrub_da_btree(sc, XFS_ATTR_FORK, xfs_scrub_xattr_rec);
+	if (error)
+		goto out;
+
+	/* Check that every attr key can also be looked up by hash. */
+	sx.context.dp = sc->ip;
+	sx.context.cursor = &cursor;
+	sx.context.resynch = 1;
+	sx.context.put_listent = xfs_scrub_xattr_listent;
+	sx.context.tp = sc->tp;
+	sx.sc = sc;
+
+	/*
+	 * Look up every xattr in this file by name.
+	 *
+	 * The VFS only locks i_rwsem when modifying attrs, so keep all
+	 * three locks held because that's the only way to ensure we're
+	 * the only thread poking into the da btree.  We traverse the da
+	 * btree while holding a leaf buffer locked for the xattr name
+	 * iteration, which doesn't really follow the usual buffer
+	 * locking order.
+	 */
+	error = xfs_attr_list_int_ilocked(&sx.context);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_ATTR_FORK, 0, &error))
+		goto out;
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index f530301..b2d3bc7 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -96,6 +96,8 @@ int xfs_scrub_setup_inode_bmap_data(struct xfs_scrub_context *sc,
 				    struct xfs_inode *ip);
 int xfs_scrub_setup_directory(struct xfs_scrub_context *sc,
 			      struct xfs_inode *ip);
+int xfs_scrub_setup_xattr(struct xfs_scrub_context *sc,
+			  struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f3a84d9..d1b80dc 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -152,6 +152,10 @@ xfs_scrub_teardown(
 			iput(VFS_I(sc->ip));
 		sc->ip = NULL;
 	}
+	if (sc->buf) {
+		kmem_free(sc->buf);
+		sc->buf = NULL;
+	}
 	return error;
 }
 
@@ -225,6 +229,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_directory,
 		.scrub	= xfs_scrub_directory,
 	},
+	{ /* extended attributes */
+		.setup	= xfs_scrub_setup_xattr,
+		.scrub	= xfs_scrub_xattr,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 844506e..d31ff58 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -59,6 +59,7 @@ struct xfs_scrub_context {
 	const struct xfs_scrub_meta_ops	*ops;
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip;
+	void				*buf;
 	uint				ilock_flags;
 	bool				try_harder;
 
@@ -83,5 +84,6 @@ int xfs_scrub_bmap_data(struct xfs_scrub_context *sc);
 int xfs_scrub_bmap_attr(struct xfs_scrub_context *sc);
 int xfs_scrub_bmap_cow(struct xfs_scrub_context *sc);
 int xfs_scrub_directory(struct xfs_scrub_context *sc);
+int xfs_scrub_xattr(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 22/25] xfs: scrub symbolic links
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (20 preceding siblings ...)
  2017-10-03 20:43 ` [PATCH 21/25] xfs: scrub extended attributes Darrick J. Wong
@ 2017-10-03 20:43 ` Darrick J. Wong
  2017-10-09  2:17   ` Dave Chinner
  2017-10-03 20:43 ` [PATCH 23/25] xfs: scrub parent pointers Darrick J. Wong
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:43 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create the infrastructure to scrub symbolic link data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 +
 fs/xfs/libxfs/xfs_fs.h |    3 +-
 fs/xfs/scrub/common.h  |    2 +
 fs/xfs/scrub/scrub.c   |    4 ++
 fs/xfs/scrub/scrub.h   |    1 +
 fs/xfs/scrub/symlink.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/symlink.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 4d46399..28637a6 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -159,5 +159,6 @@ xfs-y				+= $(addprefix scrub/, \
 				   refcount.o \
 				   rmap.o \
 				   scrub.o \
+				   symlink.o \
 				   )
 endif
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 0834ce6..bb8bcd0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -500,9 +500,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_BMBTC	14	/* CoW fork block mapping */
 #define XFS_SCRUB_TYPE_DIR	15	/* directory */
 #define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
+#define XFS_SCRUB_TYPE_SYMLINK	17	/* symbolic link */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	17
+#define XFS_SCRUB_TYPE_NR	18
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index b2d3bc7..30a1756 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -98,6 +98,8 @@ int xfs_scrub_setup_directory(struct xfs_scrub_context *sc,
 			      struct xfs_inode *ip);
 int xfs_scrub_setup_xattr(struct xfs_scrub_context *sc,
 			  struct xfs_inode *ip);
+int xfs_scrub_setup_symlink(struct xfs_scrub_context *sc,
+			    struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index d1b80dc..5839a36 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -233,6 +233,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_xattr,
 		.scrub	= xfs_scrub_xattr,
 	},
+	{ /* symbolic link */
+		.setup	= xfs_scrub_setup_symlink,
+		.scrub	= xfs_scrub_symlink,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index d31ff58..dc4ed8d 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -85,5 +85,6 @@ int xfs_scrub_bmap_attr(struct xfs_scrub_context *sc);
 int xfs_scrub_bmap_cow(struct xfs_scrub_context *sc);
 int xfs_scrub_directory(struct xfs_scrub_context *sc);
 int xfs_scrub_xattr(struct xfs_scrub_context *sc);
+int xfs_scrub_symlink(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
new file mode 100644
index 0000000..0ae3201
--- /dev/null
+++ b/fs/xfs/scrub/symlink.c
@@ -0,0 +1,92 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_symlink.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Set us up to scrub a symbolic link. */
+int
+xfs_scrub_setup_symlink(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	/* Allocate the buffer without the inode lock held. */
+	sc->buf = kmem_zalloc_large(XFS_SYMLINK_MAXLEN + 1, KM_SLEEP);
+	if (!sc->buf)
+		return -ENOMEM;
+
+	return xfs_scrub_setup_inode_contents(sc, ip, 0);
+}
+
+/* Symbolic links. */
+
+int
+xfs_scrub_symlink(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_inode		*ip = sc->ip;
+	struct xfs_ifork		*ifp;
+	loff_t				len;
+	int				error = 0;
+
+	if (!S_ISLNK(VFS_I(ip)->i_mode))
+		return -ENOENT;
+	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	len = ip->i_d.di_size;
+
+	/* Plausible size? */
+	if (len > XFS_SYMLINK_MAXLEN || len <= 0) {
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		goto out;
+	}
+
+	/* Inline symlink? */
+	if (ifp->if_flags & XFS_IFINLINE) {
+		if (len > XFS_IFORK_DSIZE(ip) ||
+		    len > strnlen(ifp->if_u1.if_data, XFS_IFORK_DSIZE(ip)))
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		goto out;
+	}
+
+	/* Remote symlink; must read the contents. */
+	error = xfs_readlink_bmap_ilocked(sc->ip, sc->buf);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+		goto out;
+	if (strnlen(sc->buf, XFS_SYMLINK_MAXLEN) < len)
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+out:
+	return error;
+}


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 23/25] xfs: scrub parent pointers
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (21 preceding siblings ...)
  2017-10-03 20:43 ` [PATCH 22/25] xfs: scrub symbolic links Darrick J. Wong
@ 2017-10-03 20:43 ` Darrick J. Wong
  2017-10-03 20:43 ` [PATCH 24/25] xfs: scrub realtime bitmap/summary Darrick J. Wong
  2017-10-03 20:43 ` [PATCH 25/25] xfs: scrub quota information Darrick J. Wong
  24 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:43 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Scrub parent pointers, sort of.  For directories, we can ride the
'..' entry up to the parent to confirm that there's at most one
dentry that points back to this directory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    3 -
 fs/xfs/scrub/common.h  |    2 
 fs/xfs/scrub/parent.c  |  265 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |    4 +
 fs/xfs/scrub/scrub.h   |    1 
 6 files changed, 275 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/parent.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 28637a6..2193a54 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -156,6 +156,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   dir.o \
 				   ialloc.o \
 				   inode.o \
+				   parent.o \
 				   refcount.o \
 				   rmap.o \
 				   scrub.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index bb8bcd0..7444094 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -501,9 +501,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_DIR	15	/* directory */
 #define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
 #define XFS_SCRUB_TYPE_SYMLINK	17	/* symbolic link */
+#define XFS_SCRUB_TYPE_PARENT	18	/* parent pointers */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	18
+#define XFS_SCRUB_TYPE_NR	19
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 30a1756..bde5add 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -100,6 +100,8 @@ int xfs_scrub_setup_xattr(struct xfs_scrub_context *sc,
 			  struct xfs_inode *ip);
 int xfs_scrub_setup_symlink(struct xfs_scrub_context *sc,
 			    struct xfs_inode *ip);
+int xfs_scrub_setup_parent(struct xfs_scrub_context *sc,
+			   struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
new file mode 100644
index 0000000..235b2e6
--- /dev/null
+++ b/fs/xfs/scrub/parent.c
@@ -0,0 +1,265 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Set us up to scrub parents. */
+int
+xfs_scrub_setup_parent(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	return xfs_scrub_setup_inode_contents(sc, ip, 0);
+}
+
+/* Parent pointers */
+
+/* Look for an entry in a parent pointing to this inode. */
+
+struct xfs_scrub_parent_ctx {
+	struct dir_context		dc;
+	xfs_ino_t			ino;
+	xfs_nlink_t			nr;
+};
+
+/* Look for a single entry in a directory pointing to an inode. */
+STATIC int
+xfs_scrub_parent_actor(
+	struct dir_context		*dc,
+	const char			*name,
+	int				namelen,
+	loff_t				pos,
+	u64				ino,
+	unsigned			type)
+{
+	struct xfs_scrub_parent_ctx	*spc;
+
+	spc = container_of(dc, struct xfs_scrub_parent_ctx, dc);
+	if (spc->ino == ino)
+		spc->nr++;
+	return 0;
+}
+
+/* Count the number of dentries in the parent dir that point to this inode. */
+STATIC int
+xfs_scrub_parent_count_parent_dentries(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*parent,
+	xfs_nlink_t			*nr)
+{
+	struct xfs_scrub_parent_ctx	spc = {
+		.dc.actor = xfs_scrub_parent_actor,
+		.dc.pos = 0,
+		.ino = sc->ip->i_ino,
+		.nr = 0,
+	};
+	struct xfs_ifork		*ifp;
+	size_t				bufsize;
+	loff_t				oldpos;
+	uint				lock_mode;
+	int				error;
+
+	/*
+	 * Load the parent directory's extent map.  A regular directory
+	 * open would start readahead (and thus load the extent map)
+	 * before we even got to a readdir call, but this isn't
+	 * guaranteed here.
+	 */
+	lock_mode = xfs_ilock_data_map_shared(parent);
+	ifp = XFS_IFORK_PTR(parent, XFS_DATA_FORK);
+	if (XFS_IFORK_FORMAT(parent, XFS_DATA_FORK) == XFS_DINODE_FMT_BTREE &&
+	    !(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(sc->tp, parent, XFS_DATA_FORK);
+		if (error) {
+			xfs_iunlock(parent, lock_mode);
+			return error;
+		}
+	}
+	xfs_iunlock(parent, lock_mode);
+
+	/*
+	 * Iterate the parent dir to confirm that there is
+	 * exactly one entry pointing back to the inode being
+	 * scanned.
+	 */
+	bufsize = (size_t)min_t(loff_t, 32768, parent->i_d.di_size);
+	oldpos = 0;
+	while (true) {
+		error = xfs_readdir(sc->tp, parent, &spc.dc, bufsize);
+		if (error)
+			goto out;
+		if (oldpos == spc.dc.pos)
+			break;
+		oldpos = spc.dc.pos;
+	}
+	*nr = spc.nr;
+out:
+	return error;
+}
+
+/* Scrub a parent pointer. */
+int
+xfs_scrub_parent(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_inode		*dp = NULL;
+	xfs_ino_t			dnum;
+	xfs_nlink_t			expected_nr;
+	xfs_nlink_t			nr;
+	int				tries = 0;
+	int				error;
+
+	/*
+	 * If we're a directory, check that the '..' link points up to
+	 * a directory that has one entry pointing to us.
+	 */
+	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
+		return -ENOENT;
+
+	/*
+	 * If we're an unlinked directory, the parent /won't/ have a link
+	 * to us.  Otherwise, it should have one link.
+	 */
+	expected_nr = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1;
+
+	/*
+	 * The VFS grabs a read or write lock via i_rwsem before it reads
+	 * or writes to a directory.  If we've gotten this far we've
+	 * already obtained IOLOCK_EXCL, which (since 4.10) is the same as
+	 * getting a write lock on i_rwsem.  Therefore, it is safe for us
+	 * to drop the ILOCK here in order to do directory lookups.
+	 */
+	sc->ilock_flags &= ~(XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
+
+	/* Look up '..' */
+	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+		goto out;
+
+	/* Is this the root dir?  Then '..' must point to itself. */
+	if (sc->ip == mp->m_rootip) {
+		if (sc->ip->i_ino != mp->m_sb.sb_rootino ||
+		    sc->ip->i_ino != dnum)
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		return 0;
+	}
+
+try_again:
+	/* Otherwise, '..' must not point to ourselves. */
+	if (sc->ip->i_ino == dnum) {
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		goto out;
+	}
+
+	error = xfs_iget(mp, sc->tp, dnum, 0, 0, &dp);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+		goto out;
+	if (dp == sc->ip) {
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		goto out_rele;
+	}
+
+	/*
+	 * We prefer to keep the inode locked while we lock and search
+	 * its alleged parent for a forward reference.  However, this
+	 * child -> parent scheme can deadlock with the parent -> child
+	 * scheme that is normally used.  Therefore, if we can lock the
+	 * parent, just validate the references and get out.
+	 */
+	if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
+		error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nr);
+		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+			goto out_unlock;
+		if (nr != expected_nr)
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		goto out_unlock;
+	}
+
+	/*
+	 * The game changes if we get here.  We failed to lock the parent,
+	 * so we're going to try to verify both pointers while only holding
+	 * one lock so as to avoid deadlocking with something that's actually
+	 * trying to traverse down the directory tree.
+	 */
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = 0;
+	xfs_ilock(dp, XFS_IOLOCK_SHARED);
+
+	/* Go looking for our dentry. */
+	error = xfs_scrub_parent_count_parent_dentries(sc, dp, &nr);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+		goto out_unlock;
+
+	/* Drop the parent lock, relock this inode. */
+	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+	sc->ilock_flags = XFS_IOLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+
+	/* Look up '..' to see if the inode changed. */
+	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+		goto out_rele;
+
+	/* Drat, parent changed.  Try again! */
+	if (dnum != dp->i_ino) {
+		iput(VFS_I(dp));
+		tries++;
+		if (tries < 20)
+			goto try_again;
+		xfs_scrub_set_incomplete(sc);
+		goto out;
+	}
+	iput(VFS_I(dp));
+
+	/*
+	 * '..' didn't change, so check that there was only one entry
+	 * for us in the parent.
+	 */
+	if (nr != expected_nr)
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+	goto out;
+
+out_unlock:
+	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+out_rele:
+	iput(VFS_I(dp));
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 5839a36..5d57a5b 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -237,6 +237,10 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_symlink,
 		.scrub	= xfs_scrub_symlink,
 	},
+	{ /* parent pointers */
+		.setup	= xfs_scrub_setup_parent,
+		.scrub	= xfs_scrub_parent,
+	},
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index dc4ed8d..a264810 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -86,5 +86,6 @@ int xfs_scrub_bmap_cow(struct xfs_scrub_context *sc);
 int xfs_scrub_directory(struct xfs_scrub_context *sc);
 int xfs_scrub_xattr(struct xfs_scrub_context *sc);
 int xfs_scrub_symlink(struct xfs_scrub_context *sc);
+int xfs_scrub_parent(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 24/25] xfs: scrub realtime bitmap/summary
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (22 preceding siblings ...)
  2017-10-03 20:43 ` [PATCH 23/25] xfs: scrub parent pointers Darrick J. Wong
@ 2017-10-03 20:43 ` Darrick J. Wong
  2017-10-09  2:28   ` Dave Chinner
  2017-10-03 20:43 ` [PATCH 25/25] xfs: scrub quota information Darrick J. Wong
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:43 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Perform simple tests of the realtime bitmap and summary.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    2 +
 fs/xfs/libxfs/xfs_format.h |    5 ++
 fs/xfs/libxfs/xfs_fs.h     |    4 +-
 fs/xfs/scrub/common.h      |    1 
 fs/xfs/scrub/rtbitmap.c    |   98 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c       |   15 +++++++
 fs/xfs/scrub/scrub.h       |    2 +
 7 files changed, 126 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/rtbitmap.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 2193a54..9ce581e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -162,4 +162,6 @@ xfs-y				+= $(addprefix scrub/, \
 				   scrub.o \
 				   symlink.o \
 				   )
+
+xfs-$(CONFIG_XFS_RT)		+= scrub/rtbitmap.o
 endif
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 154c3dd..d4d9bef 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -315,6 +315,11 @@ static inline bool xfs_sb_good_version(struct xfs_sb *sbp)
 	return false;
 }
 
+static inline bool xfs_sb_version_hasrealtime(struct xfs_sb *sbp)
+{
+	return sbp->sb_rblocks > 0;
+}
+
 /*
  * Detect a mismatched features2 field.  Older kernels read/wrote
  * this into the wrong slot, so to be safe we keep them in sync.
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 7444094..f8bac92 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -502,9 +502,11 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_XATTR	16	/* extended attribute */
 #define XFS_SCRUB_TYPE_SYMLINK	17	/* symbolic link */
 #define XFS_SCRUB_TYPE_PARENT	18	/* parent pointers */
+#define XFS_SCRUB_TYPE_RTBITMAP	19	/* realtime bitmap */
+#define XFS_SCRUB_TYPE_RTSUM	20	/* realtime summary */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	19
+#define XFS_SCRUB_TYPE_NR	21
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index bde5add..f5f8d70 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -102,6 +102,7 @@ int xfs_scrub_setup_symlink(struct xfs_scrub_context *sc,
 			    struct xfs_inode *ip);
 int xfs_scrub_setup_parent(struct xfs_scrub_context *sc,
 			   struct xfs_inode *ip);
+int xfs_scrub_setup_rt(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
new file mode 100644
index 0000000..e15f8f3
--- /dev/null
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_inode.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Set us up with the realtime metadata locked. */
+int
+xfs_scrub_setup_rt(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	struct xfs_mount		*mp = sc->mp;
+	int				lockmode;
+	int				error = 0;
+
+	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
+		return -EINVAL;
+
+	error = xfs_scrub_setup_fs(sc, ip);
+	if (error)
+		return error;
+
+	lockmode = XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP;
+	xfs_ilock(mp->m_rbmip, lockmode);
+	xfs_trans_ijoin(sc->tp, mp->m_rbmip, lockmode);
+
+	return 0;
+}
+
+/* Realtime bitmap. */
+
+/* Scrub a free extent record from the realtime bitmap. */
+STATIC int
+xfs_scrub_rtbitmap_helper(
+	struct xfs_trans		*tp,
+	struct xfs_rtalloc_rec		*rec,
+	void				*priv)
+{
+	return 0;
+}
+
+/* Scrub the realtime bitmap. */
+int
+xfs_scrub_rtbitmap(
+	struct xfs_scrub_context	*sc)
+{
+	int				error;
+
+	error = xfs_rtalloc_query_all(sc->tp, xfs_scrub_rtbitmap_helper, NULL);
+	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
+		goto out;
+
+out:
+	return error;
+}
+
+/* Scrub the realtime summary. */
+int
+xfs_scrub_rtsummary(
+	struct xfs_scrub_context	*sc)
+{
+	/* XXX: implement this some day */
+	return -ENOENT;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 5d57a5b..348e3c3 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -241,6 +241,21 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 		.setup	= xfs_scrub_setup_parent,
 		.scrub	= xfs_scrub_parent,
 	},
+#ifdef CONFIG_XFS_RT
+	{ /* realtime bitmap */
+		.setup	= xfs_scrub_setup_rt,
+		.scrub	= xfs_scrub_rtbitmap,
+		.has	= xfs_sb_version_hasrealtime,
+	},
+	{ /* realtime summary */
+		.setup	= xfs_scrub_setup_rt,
+		.scrub	= xfs_scrub_rtsummary,
+		.has	= xfs_sb_version_hasrealtime,
+	},
+#else
+	{ NULL },
+	{ NULL },
+#endif
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index a264810..329c169 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -87,5 +87,7 @@ int xfs_scrub_directory(struct xfs_scrub_context *sc);
 int xfs_scrub_xattr(struct xfs_scrub_context *sc);
 int xfs_scrub_symlink(struct xfs_scrub_context *sc);
 int xfs_scrub_parent(struct xfs_scrub_context *sc);
+int xfs_scrub_rtbitmap(struct xfs_scrub_context *sc);
+int xfs_scrub_rtsummary(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 25/25] xfs: scrub quota information
  2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
                   ` (23 preceding siblings ...)
  2017-10-03 20:43 ` [PATCH 24/25] xfs: scrub realtime bitmap/summary Darrick J. Wong
@ 2017-10-03 20:43 ` Darrick J. Wong
  2017-10-09  2:51   ` Dave Chinner
  24 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-03 20:43 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Perform some quick sanity testing of the disk quota information.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    5 +
 fs/xfs/scrub/common.h  |    1 
 fs/xfs/scrub/quota.c   |  259 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |   18 +++
 fs/xfs/scrub/scrub.h   |    1 
 6 files changed, 284 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/quota.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 9ce581e..3152469 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,4 +164,5 @@ xfs-y				+= $(addprefix scrub/, \
 				   )
 
 xfs-$(CONFIG_XFS_RT)		+= scrub/rtbitmap.o
+xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 endif
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index f8bac92..b9092410 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -504,9 +504,12 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_PARENT	18	/* parent pointers */
 #define XFS_SCRUB_TYPE_RTBITMAP	19	/* realtime bitmap */
 #define XFS_SCRUB_TYPE_RTSUM	20	/* realtime summary */
+#define XFS_SCRUB_TYPE_UQUOTA	21	/* user quotas */
+#define XFS_SCRUB_TYPE_GQUOTA	22	/* group quotas */
+#define XFS_SCRUB_TYPE_PQUOTA	23	/* project quotas */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	21
+#define XFS_SCRUB_TYPE_NR	24
 
 /* i: Repair this metadata. */
 #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index f5f8d70..26b08df 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -103,6 +103,7 @@ int xfs_scrub_setup_symlink(struct xfs_scrub_context *sc,
 int xfs_scrub_setup_parent(struct xfs_scrub_context *sc,
 			   struct xfs_inode *ip);
 int xfs_scrub_setup_rt(struct xfs_scrub_context *sc, struct xfs_inode *ip);
+int xfs_scrub_setup_quota(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
 void xfs_scrub_ag_free(struct xfs_scrub_context *sc, struct xfs_scrub_ag *sa);
 int xfs_scrub_ag_init(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
new file mode 100644
index 0000000..fb50258
--- /dev/null
+++ b/fs/xfs/scrub/quota.c
@@ -0,0 +1,259 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_dquot.h"
+#include "xfs_dquot_item.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+
+/* Convert a scrub type code to a DQ flag, or return 0 if error. */
+static inline uint
+xfs_scrub_quota_to_dqtype(
+	struct xfs_scrub_context	*sc)
+{
+	switch (sc->sm->sm_type) {
+	case XFS_SCRUB_TYPE_UQUOTA:
+		return XFS_DQ_USER;
+	case XFS_SCRUB_TYPE_GQUOTA:
+		return XFS_DQ_GROUP;
+	case XFS_SCRUB_TYPE_PQUOTA:
+		return XFS_DQ_PROJ;
+	default:
+		return 0;
+	}
+}
+
+/* Set us up to scrub a quota. */
+int
+xfs_scrub_setup_quota(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	uint				dqtype;
+
+	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
+		return -EINVAL;
+
+	dqtype = xfs_scrub_quota_to_dqtype(sc);
+	if (dqtype == 0)
+		return -EINVAL;
+	return 0;
+}
+
+/* Quotas. */
+
+/* Scrub the fields in an individual quota item. */
+STATIC void
+xfs_scrub_quota_item(
+	struct xfs_scrub_context	*sc,
+	uint				dqtype,
+	struct xfs_dquot		*dq,
+	xfs_dqid_t			id)
+{
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_disk_dquot		*d = &dq->q_core;
+	struct xfs_quotainfo		*qi = mp->m_quotainfo;
+	xfs_fileoff_t			offset;
+	unsigned long long		bsoft;
+	unsigned long long		isoft;
+	unsigned long long		rsoft;
+	unsigned long long		bhard;
+	unsigned long long		ihard;
+	unsigned long long		rhard;
+	unsigned long long		bcount;
+	unsigned long long		icount;
+	unsigned long long		rcount;
+	xfs_ino_t			inodes;
+
+	/* Did we get the dquot we wanted? */
+	offset = id * qi->qi_dqperchunk;
+	if (id > be32_to_cpu(d->d_id) ||
+	    dqtype != (d->d_flags & XFS_DQ_ALLTYPES))
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+
+	/* Check the limits. */
+	bhard = be64_to_cpu(d->d_blk_hardlimit);
+	ihard = be64_to_cpu(d->d_ino_hardlimit);
+	rhard = be64_to_cpu(d->d_rtb_hardlimit);
+
+	bsoft = be64_to_cpu(d->d_blk_softlimit);
+	isoft = be64_to_cpu(d->d_ino_softlimit);
+	rsoft = be64_to_cpu(d->d_rtb_softlimit);
+
+	inodes = XFS_AGINO_TO_INO(mp, mp->m_sb.sb_agcount, 0);
+
+	/*
+	 * Warn if the limits are larger than the fs.  Administrators
+	 * can do this, though in production this seems suspect.
+	 */
+	if (bhard > mp->m_sb.sb_dblocks || bsoft > mp->m_sb.sb_dblocks)
+		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+	if (ihard > inodes || isoft > inodes)
+		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+	if (rhard > mp->m_sb.sb_rblocks || rsoft > mp->m_sb.sb_rblocks)
+		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+
+	/* Soft limit must be less than the hard limit. */
+	if (bsoft > bhard || isoft > ihard || rsoft > rhard)
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+
+	/* Check the resource counts. */
+	bcount = be64_to_cpu(d->d_bcount);
+	icount = be64_to_cpu(d->d_icount);
+	rcount = be64_to_cpu(d->d_rtbcount);
+	inodes = percpu_counter_sum(&mp->m_icount);
+
+	/*
+	 * Check that usage doesn't exceed physical limits.  However, on
+	 * a reflink filesystem we're allowed to exceed physical space
+	 * if there are no quota limits.
+	 */
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		if (mp->m_sb.sb_dblocks < bcount)
+			xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK,
+					offset);
+	} else {
+		if (mp->m_sb.sb_dblocks < bcount)
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
+					offset);
+	}
+	if (icount > inodes || rcount > mp->m_sb.sb_rblocks)
+		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+
+	/*
+	 * We can violate the hard limits if the admin suddenly sets a
+	 * lower limit than the actual usage.  However, we flag it for
+	 * admin review.
+	 */
+	if (id != 0 && bhard != 0 && bcount > bhard)
+		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+	if (id != 0 && ihard != 0 && icount > ihard)
+		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+	if (id != 0 && rhard != 0 && rcount > rhard)
+		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
+}
+
+/* Scrub all of a quota type's items. */
+int
+xfs_scrub_quota(
+	struct xfs_scrub_context	*sc)
+{
+	struct xfs_bmbt_irec		irec = { 0 };
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_inode		*ip;
+	struct xfs_quotainfo		*qi = mp->m_quotainfo;
+	struct xfs_dquot		*dq;
+	xfs_fileoff_t			max_dqid_off;
+	xfs_fileoff_t			off = 0;
+	xfs_dqid_t			id = 0;
+	uint				dqtype;
+	int				nimaps;
+	int				error;
+
+	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
+		return -ENOENT;
+
+	mutex_lock(&qi->qi_quotaofflock);
+	dqtype = xfs_scrub_quota_to_dqtype(sc);
+	if (!xfs_this_quota_on(sc->mp, dqtype)) {
+		error = -ENOENT;
+		goto out;
+	}
+
+	/* Attach to the quota inode and set sc->ip so that reporting works. */
+	ip = xfs_quota_inode(sc->mp, dqtype);
+	sc->ip = ip;
+
+	/* Look for problem extents. */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
+	while (1) {
+		if (xfs_scrub_should_terminate(sc, &error))
+			break;
+
+		off = irec.br_startoff + irec.br_blockcount;
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, off, -1, &irec, &nimaps,
+				XFS_BMAPI_ENTIRE);
+		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, off, &error))
+			goto out_unlock;
+		if (!nimaps)
+			break;
+		if (irec.br_startblock == HOLESTARTBLOCK)
+			continue;
+
+		/*
+		 * Unwritten extents or blocks mapped above the highest
+		 * quota id shouldn't happen.
+		 */
+		if (isnullstartblock(irec.br_startblock) ||
+		    irec.br_startoff > max_dqid_off ||
+		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
+			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
+	}
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	/* Check all the quota items. */
+	while (id < ((xfs_dqid_t)-1ULL)) {
+		if (xfs_scrub_should_terminate(sc, &error))
+			break;
+
+		error = xfs_qm_dqget(mp, NULL, id, dqtype, XFS_QMOPT_DQNEXT,
+				&dq);
+		if (error == -ENOENT)
+			break;
+		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK,
+				id * qi->qi_dqperchunk, &error))
+			goto out;
+
+		xfs_scrub_quota_item(sc, dqtype, dq, id);
+
+		id = be32_to_cpu(dq->q_core.d_id) + 1;
+		xfs_qm_dqput(dq);
+		if (!id)
+			break;
+	}
+	goto out;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	sc->ip = NULL;
+	mutex_unlock(&qi->qi_quotaofflock);
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 348e3c3..0849b3f 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -256,6 +256,24 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
 	{ NULL },
 	{ NULL },
 #endif
+#ifdef CONFIG_XFS_QUOTA
+	{ /* user quota */
+		.setup = xfs_scrub_setup_quota,
+		.scrub = xfs_scrub_quota,
+	},
+	{ /* group quota */
+		.setup = xfs_scrub_setup_quota,
+		.scrub = xfs_scrub_quota,
+	},
+	{ /* project quota */
+		.setup = xfs_scrub_setup_quota,
+		.scrub = xfs_scrub_quota,
+	},
+#else
+	{ NULL },
+	{ NULL },
+	{ NULL },
+#endif
 };
 
 /* This isn't a stable feature, warn once per day. */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 329c169..e5bc1aa 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -89,5 +89,6 @@ int xfs_scrub_symlink(struct xfs_scrub_context *sc);
 int xfs_scrub_parent(struct xfs_scrub_context *sc);
 int xfs_scrub_rtbitmap(struct xfs_scrub_context *sc);
 int xfs_scrub_rtsummary(struct xfs_scrub_context *sc);
+int xfs_scrub_quota(struct xfs_scrub_context *sc);
 
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/25] xfs: probe the scrub ioctl
  2017-10-03 20:41 ` [PATCH 03/25] xfs: probe the scrub ioctl Darrick J. Wong
@ 2017-10-03 23:32   ` Dave Chinner
  2017-10-04  0:02     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-03 23:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:08PM -0700, Darrick J. Wong wrote:
> +/*
> + * Scrub probe -- userspace uses this to probe if we're willing to
> + * scrub or repair a given mountpoint.
> + */
> +int
> +xfs_scrub_probe(
> +	struct xfs_scrub_context	*sc)
> +{
> +	if (sc->sm->sm_ino || sc->sm->sm_agno)
> +		return -EINVAL;
> +	if (sc->sm->sm_gen & ~XFS_SCRUB_FLAGS_OUT)

sm_flags?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/25] xfs: create helpers to record and deal with scrub problems
  2017-10-03 20:41 ` [PATCH 04/25] xfs: create helpers to record and deal with scrub problems Darrick J. Wong
@ 2017-10-03 23:44   ` Dave Chinner
  2017-10-04  0:56     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-03 23:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:14PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create helper functions to record crc and corruption problems, and
> deal with any other runtime errors that arise.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
....
> +/*
> + * Handling scrub corruption/optimization/warning checks.
> + *
> + * The *_set_{corrupt,preen,warning}() family of functions are used to
> + * record the presence of metadata that is incorrect (corrupt), could be
> + * optimized somehow (preen), or should be flagged for administrative
> + * review but is not incorrect (warn).
> + *
> + * ftrace can be used to record the precise metadata location and
> + * approximate code location of the failed check.
> + */
> +
> +/* Record a block which could be optimized. */
> +void
> +xfs_scrub_block_set_preen(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fsblock_t			fsbno;
> +	xfs_agnumber_t			agno;
> +	xfs_agblock_t			bno;
> +
> +	fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> +	agno = XFS_FSB_TO_AGNO(mp, fsbno);
> +	bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> +
> +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
> +	trace_xfs_scrub_block_preen(sc, agno, bno, __return_address);

Is agno/agbno used here only for the trace point? if
so, the tracepoint call could simply be:

	trace_xfs_scrub_block_preen(sc, bp->b_bn, __return_address);

and the tracepoint internal implementation can split that into
agno/agbno.


> +}
> +
> +/*
> + * Record an inode which could be optimized.  The trace data will
> + * include the block given by bp if bp is given; otherwise it will use
> + * the block location of the inode record itself.
> + */
> +void
> +xfs_scrub_ino_set_preen(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_inode		*ip = sc->ip;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fsblock_t			fsbno;
> +	xfs_agnumber_t			agno;
> +	xfs_agblock_t			bno;
> +
> +	if (bp) {
> +		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> +		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> +		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> +	} else {
> +		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);

That's not a block number. Going to be mighty confusing because the
trace output isn't going to tell you whether bno is an agbno or an
agino....

> +	}
> +
> +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
> +	trace_xfs_scrub_ino_preen(sc, ip->i_ino, agno, bno, __return_address);
> +}
> +
> +/* Record a corrupt block. */
> +void
> +xfs_scrub_block_set_corrupt(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fsblock_t			fsbno;
> +	xfs_agnumber_t			agno;
> +	xfs_agblock_t			bno;
> +
> +	fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> +	agno = XFS_FSB_TO_AGNO(mp, fsbno);
> +	bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> +
> +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> +	trace_xfs_scrub_block_error(sc, agno, bno, __return_address);
> +}
> +
> +/*
> + * Record a corrupt inode.  The trace data will include the block given
> + * by bp if bp is given; otherwise it will use the block location of the
> + * inode record itself.
> + */
> +void
> +xfs_scrub_ino_set_corrupt(
> +	struct xfs_scrub_context	*sc,
> +	xfs_ino_t			ino,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_inode		*ip = sc->ip;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fsblock_t			fsbno;
> +	xfs_agnumber_t			agno;
> +	xfs_agblock_t			bno;
> +
> +	if (bp) {
> +		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> +		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> +		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> +	} else {
> +		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	}
> +
> +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> +	trace_xfs_scrub_ino_error(sc, ino, agno, bno, __return_address);

Pattern seems to be repeated...

>  DEFINE_SCRUB_EVENT(xfs_scrub_start);
>  DEFINE_SCRUB_EVENT(xfs_scrub_done);
> +DEFINE_SCRUB_EVENT(xfs_scrub_deadlock_retry);
> +
> +TRACE_EVENT(xfs_scrub_op_error,
> +	TP_PROTO(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
> +		 xfs_agblock_t bno, int error, void *ret_ip),
> +	TP_ARGS(sc, agno, bno, error, ret_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, type)
> +		__field(xfs_agnumber_t, agno)
> +		__field(xfs_agblock_t, bno)
> +		__field(int, error)
> +		__field(void *, ret_ip)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = sc->mp->m_super->s_dev;
> +		__entry->type = sc->sm->sm_type;
> +		__entry->agno = agno;
> +		__entry->bno = bno;
> +		__entry->error = error;
> +		__entry->ret_ip = ret_ip;

i.e. we can put the decoding in here, which makes it all conditional
on the tracepoint being enabled....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/25] xfs: create helpers to scrub a metadata btree
  2017-10-03 20:41 ` [PATCH 05/25] xfs: create helpers to scrub a metadata btree Darrick J. Wong
@ 2017-10-03 23:49   ` Dave Chinner
  2017-10-04  0:13     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-03 23:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:20PM -0700, Darrick J. Wong wrote:
> +/* btree scrubbing */
> +
> +/*
> + * Check for btree operation errors.  See the section about handling
> + * operational errors in common.c.
> + */
> +bool
> +xfs_scrub_btree_op_ok(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_btree_cur		*cur,
> +	int				level,
> +	int				*error)
> +{
......
> +
> +/*
> + * Visit all nodes and leaves of a btree.  Check that all pointers and
> + * records are in order, that the keys reflect the records, and use a callback
> + * so that the caller can verify individual records.  The callback is the same
> + * as the one for xfs_btree_query_range, so therefore this function also
> + * returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a negative error code.
> + */
> +int
> +xfs_scrub_btree(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_btree_cur		*cur,
> +	xfs_scrub_btree_rec_fn		scrub_fn,
> +	struct xfs_owner_info		*oinfo,
> +	void				*private)
> +{
> +	xfs_scrub_btree_op_ok(sc, cur, 0, false);

Looks like this patch won't compile - it's passing a bool to an int
*. I know it's just a stub and probably gets fixed later, but
better not to leave a bisect landmine...

-Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/25] xfs: probe the scrub ioctl
  2017-10-03 23:32   ` Dave Chinner
@ 2017-10-04  0:02     ` Darrick J. Wong
  2017-10-04  1:56       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  0:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 10:32:47AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:08PM -0700, Darrick J. Wong wrote:
> > +/*
> > + * Scrub probe -- userspace uses this to probe if we're willing to
> > + * scrub or repair a given mountpoint.
> > + */
> > +int
> > +xfs_scrub_probe(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	if (sc->sm->sm_ino || sc->sm->sm_agno)
> > +		return -EINVAL;
> > +	if (sc->sm->sm_gen & ~XFS_SCRUB_FLAGS_OUT)
> 
> sm_flags?

sm_gen is correct because in xfs_scrub_probe we reflect back to
userspace any valid scrub outflags that were passed in via sm_gen.
Therefore we have to check sm_gen for unknown output flags so that we
can error out on any invalid inputs from userland.

xfs_scrub_metadata() unconditionally clears all the out flags from
sm_flags on its way in, prior to any of the scrub handlers being called.

(It's possible I'm misunderstanding your question.)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/25] xfs: create helpers to scrub a metadata btree
  2017-10-03 23:49   ` Dave Chinner
@ 2017-10-04  0:13     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  0:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 10:49:13AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:20PM -0700, Darrick J. Wong wrote:
> > +/* btree scrubbing */
> > +
> > +/*
> > + * Check for btree operation errors.  See the section about handling
> > + * operational errors in common.c.
> > + */
> > +bool
> > +xfs_scrub_btree_op_ok(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_btree_cur		*cur,
> > +	int				level,
> > +	int				*error)
> > +{
> ......
> > +
> > +/*
> > + * Visit all nodes and leaves of a btree.  Check that all pointers and
> > + * records are in order, that the keys reflect the records, and use a callback
> > + * so that the caller can verify individual records.  The callback is the same
> > + * as the one for xfs_btree_query_range, so therefore this function also
> > + * returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a negative error code.
> > + */
> > +int
> > +xfs_scrub_btree(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_btree_cur		*cur,
> > +	xfs_scrub_btree_rec_fn		scrub_fn,
> > +	struct xfs_owner_info		*oinfo,
> > +	void				*private)
> > +{
> > +	xfs_scrub_btree_op_ok(sc, cur, 0, false);
> 
> Looks like this patch won't compile - it's passing a bool to an int
> *. I know it's just a stub and probably gets fixed later, but
> better not to leave a bisect landmine...

Frighteningly it /does/ compile because the 'false' apparently gets
converted to NULL and therefore there's no type error.  Fortunately
there are no callers so it'll never blow up... but hey, I'll fix it
anyway.  Good catch!

--D

> 
> -Dave
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/25] xfs: scrub the shape of a metadata btree
  2017-10-03 20:41 ` [PATCH 06/25] xfs: scrub the shape of " Darrick J. Wong
@ 2017-10-04  0:15   ` Dave Chinner
  2017-10-04  3:51     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  0:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:27PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a function that can check the shape of a btree -- each block
> passes basic inspection and all the pointers look ok.  In the next patch
> we'll add the ability to check the actual keys and records stored within
> the btree.  Add some helper functions so that we report detailed scrub
> errors in a uniform manner in dmesg.  These are helper functions for
> subsequent patches.
.....
>  
> +/* Check a btree pointer.  Returns true if it's ok to use this pointer. */
> +static bool
> +xfs_scrub_btree_ptr_ok(
> +	struct xfs_scrub_btree		*bs,
> +	int				level,
> +	union xfs_btree_ptr		*ptr)
> +{
> +	struct xfs_btree_cur		*cur = bs->cur;
> +	xfs_daddr_t			daddr;
> +	xfs_daddr_t			eofs;
> +
> +	if (xfs_btree_ptr_is_null(cur, ptr)) {
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> +		return false;
> +	}
> +	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
> +		daddr = XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
> +	} else {
> +		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
> +		daddr = XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
> +				be32_to_cpu(ptr->s));
> +	}
> +	eofs = XFS_FSB_TO_BB(cur->bc_mp, cur->bc_mp->m_sb.sb_dblocks);
> +	if (daddr == 0 || daddr >= eofs) {
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> +		return false;
> +	}
> +
> +	return true;
> +}

There seems to be quite a bit of overlap here with
xfs_btree_check_ptr(). Indeed, for the short pointers the above code
fails to check it is within the bounds of the AG size. I'd suggest
both of these should use the same validity checking functions....

....
> +/*
> + * Grab and scrub a btree block given a btree pointer.  Returns block
> + * and buffer pointers (if applicable) if they're ok to use.
> + */
> +STATIC int
> +xfs_scrub_btree_get_block(
> +	struct xfs_scrub_btree		*bs,
> +	int				level,
> +	union xfs_btree_ptr		*pp,
> +	struct xfs_btree_block		**pblock,
> +	struct xfs_buf			**pbp)
> +{
> +	int				error;
> +
> +	error = xfs_btree_lookup_get_block(bs->cur, level, pp, pblock);
> +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error) || !pblock)
> +		return error;
> +
> +	xfs_btree_get_block(bs->cur, level, pbp);
> +	error = xfs_btree_check_block(bs->cur, *pblock, level, *pbp);
> +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error))
> +		return error;

xfs_btree_check_block() will throw error reports to dmesg for each
corrupt block that is found. Do we want scrub to do this, or should
it just report the corrupt block to userspace?

> +
> +	/*
> +	 * Check the block's siblings; this function absorbs error codes
> +	 * for us.
> +	 */
> +	return xfs_scrub_btree_block_check_siblings(bs, *pblock);
> +}
> +
>  /*
>   * Visit all nodes and leaves of a btree.  Check that all pointers and
>   * records are in order, that the keys reflect the records, and use a callback
> @@ -107,6 +253,93 @@ xfs_scrub_btree(
>  	struct xfs_owner_info		*oinfo,
>  	void				*private)
>  {
> -	xfs_scrub_btree_op_ok(sc, cur, 0, false);
> -	return -EOPNOTSUPP;
> +	struct xfs_scrub_btree		bs = {0};
> +	union xfs_btree_ptr		ptr;
> +	union xfs_btree_ptr		*pp;
> +	struct xfs_btree_block		*block;
> +	int				level;
> +	struct xfs_buf			*bp;
> +	int				i;
> +	int				error = 0;
> +
> +	/* Initialize scrub state */
> +	bs.cur = cur;
> +	bs.scrub_rec = scrub_fn;
> +	bs.oinfo = oinfo;
> +	bs.firstrec = true;
> +	bs.private = private;
> +	bs.sc = sc;
> +	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> +		bs.firstkey[i] = true;
> +	INIT_LIST_HEAD(&bs.to_check);
> +
> +	/* Don't try to check a tree with a height we can't handle. */
> +	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> +		xfs_scrub_btree_set_corrupt(sc, cur, 0);
> +		goto out;
> +	}
> +
> +	/* Make sure the root isn't in the superblock. */
> +	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)) {
> +		cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> +		if (!xfs_scrub_btree_ptr_ok(&bs, cur->bc_nlevels - 1, &ptr))
> +			goto out;

Set corrupt if the init ptr is bad? And why do this check before
the code below that has another init_ptr_from_cur() call?

> +	}
> +
> +	/*
> +	 * Load the root of the btree.  The helper function absorbs
> +	 * error codes for us.
> +	 */
> +	level = cur->bc_nlevels - 1;
> +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);

i.e.

	level = cur->bc_nlevels - 1;
	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) &&
	    !xfs_scrub_btree_ptr_ok(&bs, level, &ptr)) {
		xfs_scrub_btree_set_corrupt(sc, cur, 0);
		goto out;
	}

Which makes me ask the question - why aren't we validating the
initial pointer when the root is in an inode?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/25] xfs: create helpers to scan an allocation group
  2017-10-03 20:41 ` [PATCH 08/25] xfs: create helpers to scan an allocation group Darrick J. Wong
@ 2017-10-04  0:46   ` Dave Chinner
  2017-10-04  3:58     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  0:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:40PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Add some helpers to enable us to lock an AG's headers, create btree
> cursors for all btrees in that allocation group, and clean up
> afterwards.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/common.c |  173 +++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/common.h |   10 +++
>  fs/xfs/scrub/scrub.c  |    4 +
>  fs/xfs/scrub/scrub.h  |   21 ++++++
>  4 files changed, 208 insertions(+)
> 
> 
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index a84ba19..b056c9d 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -44,6 +44,7 @@
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
>  #include "scrub/trace.h"
> +#include "scrub/btree.h"
>  
>  /* Common code for the metadata scrubbers. */
>  
> @@ -298,6 +299,178 @@ xfs_scrub_set_incomplete(
>  	trace_xfs_scrub_incomplete(sc, __return_address);
>  }
>  
> +/*
> + * AG scrubbing
> + *
> + * These helpers facilitate locking an allocation group's header
> + * buffers, setting up cursors for all btrees that are present, and
> + * cleaning everything up once we're through.
> + */
> +
> +/* Grab all the headers for an AG. */
> +int
> +xfs_scrub_ag_read_headers(
> +	struct xfs_scrub_context	*sc,
> +	xfs_agnumber_t			agno,
> +	struct xfs_buf			**agi,
> +	struct xfs_buf			**agf,
> +	struct xfs_buf			**agfl)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
> +	if (error)
> +		goto out;
> +
> +	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
> +	if (error)
> +		goto out;
> +	if (!*agf) {
> +		error = -ENOMEM;
> +		goto out;
> +	}
> +
> +	error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
> +	if (error)
> +		goto out;
> +
> +out:
> +	return error;
> +}

It's not immediately obvious what releases the buffers on error.
Maybe add a comment to say cleanup/release on error is unconditional
through xfs_scrub_ag_free()?

Hmmm - now there's a question - is the reference we get here freed
through cancelling the fake transaction, or via the manual
xfs_trans_brelse() call in the free function? which one happens
first? add that to the comment?

And given this locks out the AG from allocation for an arbitrary
length of time, I'm wondering if we should add a flag into the pag
somewhere to say "being scrubbed" so the extent and inode allocation
code can skip over this AG and no block trying to lock it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/25] xfs: create helpers to record and deal with scrub problems
  2017-10-03 23:44   ` Dave Chinner
@ 2017-10-04  0:56     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  0:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 10:44:49AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:14PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create helper functions to record crc and corruption problems, and
> > deal with any other runtime errors that arise.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ....
> > +/*
> > + * Handling scrub corruption/optimization/warning checks.
> > + *
> > + * The *_set_{corrupt,preen,warning}() family of functions are used to
> > + * record the presence of metadata that is incorrect (corrupt), could be
> > + * optimized somehow (preen), or should be flagged for administrative
> > + * review but is not incorrect (warn).
> > + *
> > + * ftrace can be used to record the precise metadata location and
> > + * approximate code location of the failed check.
> > + */
> > +
> > +/* Record a block which could be optimized. */
> > +void
> > +xfs_scrub_block_set_preen(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fsblock_t			fsbno;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			bno;
> > +
> > +	fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> > +	agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > +	bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> > +
> > +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
> > +	trace_xfs_scrub_block_preen(sc, agno, bno, __return_address);
> 
> Is agno/agbno used here only for the trace point? if
> so, the tracepoint call could simply be:
> 
> 	trace_xfs_scrub_block_preen(sc, bp->b_bn, __return_address);
> 
> and the tracepoint internal implementation can split that into
> agno/agbno.

Yes, I'll make that change.

> 
> 
> > +}
> > +
> > +/*
> > + * Record an inode which could be optimized.  The trace data will
> > + * include the block given by bp if bp is given; otherwise it will use
> > + * the block location of the inode record itself.
> > + */
> > +void
> > +xfs_scrub_ino_set_preen(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_inode		*ip = sc->ip;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fsblock_t			fsbno;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			bno;
> > +
> > +	if (bp) {
> > +		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> > +		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > +		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> > +	} else {
> > +		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> > +		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);
> 
> That's not a block number. Going to be mighty confusing because the
> trace output isn't going to tell you whether bno is an agbno or an
> agino....

Oops.  That should've been XFS_AGINO_TO_AGBNO(mp, XFS_INO_TO_AGINO(mp, ino));

Will fix, thanks for catching that.

> > +	}
> > +
> > +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
> > +	trace_xfs_scrub_ino_preen(sc, ip->i_ino, agno, bno, __return_address);
> > +}
> > +
> > +/* Record a corrupt block. */
> > +void
> > +xfs_scrub_block_set_corrupt(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fsblock_t			fsbno;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			bno;
> > +
> > +	fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> > +	agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > +	bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> > +
> > +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > +	trace_xfs_scrub_block_error(sc, agno, bno, __return_address);
> > +}
> > +
> > +/*
> > + * Record a corrupt inode.  The trace data will include the block given
> > + * by bp if bp is given; otherwise it will use the block location of the
> > + * inode record itself.
> > + */
> > +void
> > +xfs_scrub_ino_set_corrupt(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_ino_t			ino,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_inode		*ip = sc->ip;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fsblock_t			fsbno;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			bno;
> > +
> > +	if (bp) {
> > +		fsbno = XFS_DADDR_TO_FSB(mp, bp->b_bn);
> > +		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > +		bno = XFS_FSB_TO_AGBNO(mp, fsbno);
> > +	} else {
> > +		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> > +		bno = XFS_INO_TO_AGINO(mp, ip->i_ino);
> > +	}
> > +
> > +	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > +	trace_xfs_scrub_ino_error(sc, ino, agno, bno, __return_address);
> 
> Pattern seems to be repeated...
> 
> >  DEFINE_SCRUB_EVENT(xfs_scrub_start);
> >  DEFINE_SCRUB_EVENT(xfs_scrub_done);
> > +DEFINE_SCRUB_EVENT(xfs_scrub_deadlock_retry);
> > +
> > +TRACE_EVENT(xfs_scrub_op_error,
> > +	TP_PROTO(struct xfs_scrub_context *sc, xfs_agnumber_t agno,
> > +		 xfs_agblock_t bno, int error, void *ret_ip),
> > +	TP_ARGS(sc, agno, bno, error, ret_ip),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(unsigned int, type)
> > +		__field(xfs_agnumber_t, agno)
> > +		__field(xfs_agblock_t, bno)
> > +		__field(int, error)
> > +		__field(void *, ret_ip)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = sc->mp->m_super->s_dev;
> > +		__entry->type = sc->sm->sm_type;
> > +		__entry->agno = agno;
> > +		__entry->bno = bno;
> > +		__entry->error = error;
> > +		__entry->ret_ip = ret_ip;
> 
> i.e. we can put the decoding in here, which makes it all conditional
> on the tracepoint being enabled....

<nod> Done.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/25] xfs: scrub the backup superblocks
  2017-10-03 20:41 ` [PATCH 09/25] xfs: scrub the backup superblocks Darrick J. Wong
@ 2017-10-04  0:57   ` Dave Chinner
  2017-10-04  4:06     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:46PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Ensure that the geometry presented in the backup superblocks matches
> the primary superblock so that repair can recover the filesystem if
> that primary gets corrupted.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/Makefile         |    1 
>  fs/xfs/libxfs/xfs_fs.h  |    3 
>  fs/xfs/scrub/agheader.c |  317 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/common.h   |    2 
>  fs/xfs/scrub/scrub.c    |    4 +
>  fs/xfs/scrub/scrub.h    |    1 
>  6 files changed, 327 insertions(+), 1 deletion(-)
>  create mode 100644 fs/xfs/scrub/agheader.c
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 5888b9f..e92d04d 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -146,6 +146,7 @@ ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
>  
>  xfs-y				+= $(addprefix scrub/, \
>  				   trace.o \
> +				   agheader.o \
>  				   btree.o \
>  				   common.o \
>  				   scrub.o \
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 765f91e..8543cbb 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -484,9 +484,10 @@ struct xfs_scrub_metadata {
>  
>  /* Scrub subcommands. */
>  #define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
> +#define XFS_SCRUB_TYPE_SB	1	/* superblock */
>  
>  /* Number of scrub subcommands. */
> -#define XFS_SCRUB_TYPE_NR	1
> +#define XFS_SCRUB_TYPE_NR	2
>  
>  /* i: Repair this metadata. */
>  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> new file mode 100644
> index 0000000..487c4f4
> --- /dev/null
> +++ b/fs/xfs/scrub/agheader.c
> @@ -0,0 +1,317 @@
> +/*
> + * Copyright (C) 2017 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_btree.h"
> +#include "xfs_bit.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans.h"
> +#include "xfs_sb.h"
> +#include "xfs_inode.h"
> +#include "scrub/xfs_scrub.h"
> +#include "scrub/scrub.h"
> +#include "scrub/common.h"
> +#include "scrub/trace.h"
> +
> +/* Set us up to check an AG header. */
> +int
> +xfs_scrub_setup_ag_header(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{

Not immediately clear what "AG header" is being set up here?

> +	struct xfs_mount		*mp = sc->mp;
> +
> +	if (sc->sm->sm_agno >= mp->m_sb.sb_agcount ||
> +	    sc->sm->sm_ino || sc->sm->sm_gen)
> +		return -EINVAL;
> +	return xfs_scrub_setup_fs(sc, ip);
> +}
> +
> +/* Superblock */
> +
> +/* Scrub the filesystem superblock. */
> +int
> +xfs_scrub_superblock(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*bp;
> +	struct xfs_dsb			*sb;
> +	xfs_agnumber_t			agno;
> +	uint32_t			v2_ok;
> +	__be32				features_mask;
> +	int				error;
> +	__be16				vernum_mask;
> +
> +	agno = sc->sm->sm_agno;
> +	if (agno == 0)
> +		return 0;
> +
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +		  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> +		  XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
> +	if (!xfs_scrub_op_ok(sc, agno, XFS_SB_BLOCK(mp), &error))
> +		return error;

Might be worth a comment to say the verifier is doing validity/range
checks of the on-disk fields so they aren't duplicated here. I took
a little while to work out why range checks weren't being done
here...

> +
> +	sb = XFS_BUF_TO_SBP(bp);
> +
> +	/*
> +	 * Verify the geometries match.  Fields that are permanently
> +	 * set by mkfs are checked; fields that can be updated later
> +	 * (and are not propagated to backup superblocks) are preen
> +	 * checked.
> +	 */
> +	if (sb->sb_blocksize != cpu_to_be32(mp->m_sb.sb_blocksize))
> +		xfs_scrub_block_set_corrupt(sc, bp);
> +
> +	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
> +		xfs_scrub_block_set_corrupt(sc, bp);

Just wondering - once we've set the corrupt flag, do we need to
bother checking any of the other fields? It makes no difference to
what is reported to userspace or the action it is going to take,
so couldn't we just do something like:

	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
		goto out_corrupt;

.....
out_corrupt:
	xfs_scrub_block_set_corrupt(sc, bp);
	return 0;

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/25] xfs: scrub AGF and AGFL
  2017-10-03 20:41 ` [PATCH 10/25] xfs: scrub AGF and AGFL Darrick J. Wong
@ 2017-10-04  1:31   ` Dave Chinner
  2017-10-04  4:21     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  1:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:52PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Check the block references in the AGF and AGFL headers to make sure
> they make sense.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_fs.h  |    4 +
>  fs/xfs/scrub/agheader.c |  220 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/common.c   |   60 +++++++++++++
>  fs/xfs/scrub/common.h   |    6 +
>  fs/xfs/scrub/scrub.c    |    8 ++
>  fs/xfs/scrub/scrub.h    |    2 
>  6 files changed, 299 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 8543cbb..aeb2a66 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -485,9 +485,11 @@ struct xfs_scrub_metadata {
>  /* Scrub subcommands. */
>  #define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
>  #define XFS_SCRUB_TYPE_SB	1	/* superblock */
> +#define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
> +#define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
>  
>  /* Number of scrub subcommands. */
> -#define XFS_SCRUB_TYPE_NR	2
> +#define XFS_SCRUB_TYPE_NR	4
>  
>  /* i: Repair this metadata. */
>  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> index 487c4f4..7fe6630 100644
> --- a/fs/xfs/scrub/agheader.c
> +++ b/fs/xfs/scrub/agheader.c
> @@ -49,6 +49,72 @@ xfs_scrub_setup_ag_header(
>  	return xfs_scrub_setup_fs(sc, ip);
>  }
>  
> +/* Find the size of the AG, in blocks. */
> +static inline xfs_agblock_t
> +xfs_scrub_ag_blocks(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
> +{
> +	ASSERT(agno < mp->m_sb.sb_agcount);
> +
> +	if (agno < mp->m_sb.sb_agcount - 1)
> +		return mp->m_sb.sb_agblocks;
> +	return mp->m_sb.sb_dblocks - (agno * mp->m_sb.sb_agblocks);
> +}

Can you make this a generic libxfs function, say xfs_get_ag_blocks()?
This same calculation is repeated in quite a few places, especially
in userspace...

> +
> +/* Walk all the blocks in the AGFL. */
> +int
> +xfs_scrub_walk_agfl(
> +	struct xfs_scrub_context	*sc,
> +	int				(*fn)(struct xfs_scrub_context *,
> +					      xfs_agblock_t bno, void *),
> +	void				*priv)
> +{
> +	struct xfs_agf			*agf;
> +	__be32				*agfl_bno;
> +	struct xfs_mount		*mp = sc->mp;
> +	unsigned int			flfirst;
> +	unsigned int			fllast;
> +	int				i;
> +	int				error;
> +
> +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, sc->sa.agfl_bp);
> +	flfirst = be32_to_cpu(agf->agf_flfirst);
> +	fllast = be32_to_cpu(agf->agf_fllast);
> +
> +	/* Skip an empty AGFL. */
> +	if (agf->agf_flcount == cpu_to_be32(0))
> +		return 0;

Check flfirst -> fllast == flcount.

....

> +/* Scrub the AGF. */
> +int
> +xfs_scrub_agf(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_agf			*agf;
> +	xfs_daddr_t			daddr;
> +	xfs_daddr_t			eofs;
> +	xfs_agnumber_t			agno;
> +	xfs_agblock_t			agbno;
> +	xfs_agblock_t			eoag;
> +	xfs_agblock_t			agfl_first;
> +	xfs_agblock_t			agfl_last;
> +	xfs_agblock_t			agfl_count;
> +	xfs_agblock_t			fl_count;
> +	int				level;
> +	int				error = 0;
> +
> +	agno = sc->sm->sm_agno;
> +	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGF);
> +	if (!xfs_scrub_op_ok(sc, agno, XFS_AGF_BLOCK(sc->mp), &error))
> +		goto out;
> +
> +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> +	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> +
> +	/* Check the AG length */
> +	eoag = be32_to_cpu(agf->agf_length);
> +	if (eoag != xfs_scrub_ag_blocks(mp, agno))
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +	/* Check the AGF btree roots and levels */
> +	agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_BNO]);
> +	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> +	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||

I'm assuming that you are checking that the block isn't part of the
static metadata range with this XFS_AGI_BLOCK() check? Shouldn't it
actually be agbno <= XFS_AGFL_BLOCK(mp) i.e. the AGFL block address?

I think we need a generic "verify agbno" function. These checks seem
to be open coded throughout the code instead calling a single
function that does all the checks. The short btree pointers can use
it as well...

> +	    agbno >= eoag || daddr >= eofs)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +	agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_CNT]);
> +	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> +	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
> +	    agbno >= eoag || daddr >= eofs)

There's another.

> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
> +	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
> +	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> +		agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_RMAP]);
> +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> +		if (agbno <= XFS_AGI_BLOCK(mp) ||
> +		    agbno >= mp->m_sb.sb_agblocks ||
> +		    agbno >= eoag ||
> +		    daddr >= eofs)

And another.

> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +		level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
> +		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +	}
> +
> +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> +		agbno = be32_to_cpu(agf->agf_refcount_root);
> +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> +		if (agbno <= XFS_AGI_BLOCK(mp) ||
> +		    agbno >= mp->m_sb.sb_agblocks ||
> +		    agbno >= eoag ||
> +		    daddr >= eofs)

And another.

> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +
> +		level = be32_to_cpu(agf->agf_refcount_level);
> +		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> +	}
> +
> +	/* Check the AGFL counters */
> +	agfl_first = be32_to_cpu(agf->agf_flfirst);
> +	agfl_last = be32_to_cpu(agf->agf_fllast);
> +	agfl_count = be32_to_cpu(agf->agf_flcount);
> +	if (agfl_last > agfl_first)
> +		fl_count = agfl_last - agfl_first + 1;
> +	else
> +		fl_count = XFS_AGFL_SIZE(mp) - agfl_first + agfl_last + 1;
> +	if (agfl_count != 0 && fl_count != agfl_count)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);

Oh, the agfl counts are checked here. Maybe put a comment in
xfs_scrub_walk_agfl() to mention this?

.....

> +struct xfs_scrub_agfl {
> +	xfs_agblock_t			eoag;
> +	xfs_daddr_t			eofs;
> +};
> +
> +/* Scrub an AGFL block. */
> +STATIC int
> +xfs_scrub_agfl_block(
> +	struct xfs_scrub_context	*sc,
> +	xfs_agblock_t			agbno,
> +	void				*priv)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_agnumber_t			agno = sc->sa.agno;
> +	struct xfs_scrub_agfl		*sagfl = priv;
> +	int				error = 0;
> +
> +	if (agbno <= XFS_AGI_BLOCK(mp) ||
> +	    agbno >= mp->m_sb.sb_agblocks ||
> +	    agbno >= sagfl->eoag ||
> +	    XFS_AGB_TO_DADDR(mp, agno, agbno) >= sagfl->eofs)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agfl_bp);
> +
> +	return error;
> +}

Oh, look, there's another xfs_agbno_verify() function call :P

.....
> diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> index b056c9d..ee8e7be 100644
> --- a/fs/xfs/scrub/common.c
> +++ b/fs/xfs/scrub/common.c
> @@ -471,6 +471,66 @@ xfs_scrub_ag_init(
>  	return xfs_scrub_ag_btcur_init(sc, sa);
>  }
>  
> +/*
> + * Load and verify an AG header for further AG header examination.
> + * If this header is not the target of the examination, don't return
> + * the buffer if a runtime or verifier error occurs.
> + */
> +STATIC int
> +xfs_scrub_load_ag_header(
> +	struct xfs_scrub_context	*sc,
> +	xfs_daddr_t			daddr,
> +	struct xfs_buf			**bpp,
> +	const struct xfs_buf_ops	*ops,
> +	bool				is_target)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	*bpp = NULL;
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, daddr),
> +			XFS_FSS_TO_BB(mp, 1), 0, bpp, ops);
> +	return is_target ? error : 0;
> +}
> +
> +/*
> + * Load as many of the AG headers and btree cursors as we can for an
> + * examination and cross-reference of an AG header.
> + */
> +int
> +xfs_scrub_load_ag_headers(
> +	struct xfs_scrub_context	*sc,
> +	xfs_agnumber_t			agno,
> +	unsigned int			type)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL);
> +	memset(&sc->sa, 0, sizeof(sc->sa));
> +	sc->sa.agno = agno;
> +
> +	error = xfs_scrub_load_ag_header(sc, XFS_AGI_DADDR(mp),
> +			&sc->sa.agi_bp, &xfs_agi_buf_ops, false);
> +	if (error)
> +		return error;
> +
> +	error = xfs_scrub_load_ag_header(sc, XFS_AGF_DADDR(mp),
> +			&sc->sa.agf_bp, &xfs_agf_buf_ops,
> +			type == XFS_SCRUB_TYPE_AGF);
> +	if (error)
> +		return error;
> +
> +	error = xfs_scrub_load_ag_header(sc, XFS_AGFL_DADDR(mp),
> +			&sc->sa.agfl_bp, &xfs_agfl_buf_ops,
> +			type == XFS_SCRUB_TYPE_AGFL);
> +	if (error)
> +		return error;
> +
> +	return 0;
> +}

This should probably be combined with xfs_scrub_ag_read_headers().
They essentially do the same thing, the only difference is the
"target" error reporting.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/25] xfs: scrub the AGI
  2017-10-03 20:41 ` [PATCH 11/25] xfs: scrub the AGI Darrick J. Wong
@ 2017-10-04  1:43   ` Dave Chinner
  2017-10-04  4:25     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  1:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:41:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Add a forgotten check to the AGI verifier, then wire up the scrub
> infrastructure to check the AGI contents.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_fs.h  |    3 +-
>  fs/xfs/scrub/agheader.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/common.c   |    6 ++-
>  fs/xfs/scrub/scrub.c    |    4 ++
>  fs/xfs/scrub/scrub.h    |    1 +
>  5 files changed, 99 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index aeb2a66..1e326dd 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -487,9 +487,10 @@ struct xfs_scrub_metadata {
>  #define XFS_SCRUB_TYPE_SB	1	/* superblock */
>  #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
>  #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
> +#define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
>  
>  /* Number of scrub subcommands. */
> -#define XFS_SCRUB_TYPE_NR	4
> +#define XFS_SCRUB_TYPE_NR	5
>  
>  /* i: Repair this metadata. */
>  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> index 7fe6630..3d269c2 100644
> --- a/fs/xfs/scrub/agheader.c
> +++ b/fs/xfs/scrub/agheader.c
> @@ -535,3 +535,91 @@ xfs_scrub_agfl(
>  out:
>  	return error;
>  }
> +
> +/* AGI */
> +
> +/* Scrub the AGI. */
> +int
> +xfs_scrub_agi(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_agi			*agi;
> +	xfs_daddr_t			daddr;
> +	xfs_daddr_t			eofs;
> +	xfs_agnumber_t			agno;
> +	xfs_agblock_t			agbno;
> +	xfs_agblock_t			eoag;
> +	xfs_agino_t			agino;
> +	xfs_agino_t			first_agino;
> +	xfs_agino_t			last_agino;
> +	int				i;
> +	int				level;
> +	int				error = 0;
> +
> +	agno = sc->sm->sm_agno;
> +	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGI);
> +	if (!xfs_scrub_op_ok(sc, agno, XFS_AGI_BLOCK(sc->mp), &error))
> +		goto out;
> +
> +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> +	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> +
> +	/* Check the AG length */
> +	eoag = be32_to_cpu(agi->agi_length);
> +	if (eoag != xfs_scrub_ag_blocks(mp, agno))
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);

Should we be cross checking that the AGI and AGF both have
the same length here?

> +
> +	/* Check btree roots and levels */
> +	agbno = be32_to_cpu(agi->agi_root);
> +	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> +	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
> +	    agbno >= eoag || daddr >= eofs)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);

xfs_verify_agbno(), again :P

> +
> +	level = be32_to_cpu(agi->agi_level);
> +	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> +
> +	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
> +		agbno = be32_to_cpu(agi->agi_free_root);
> +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> +		if (agbno <= XFS_AGI_BLOCK(mp) ||
> +		    agbno >= mp->m_sb.sb_agblocks ||
> +		    agbno >= eoag ||
> +		    daddr >= eofs)

Broken records are us....

> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> +
> +		level = be32_to_cpu(agi->agi_free_level);
> +		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> +	}
> +
> +	/* Check inode counters */
> +	first_agino = XFS_OFFBNO_TO_AGINO(mp, XFS_AGI_BLOCK(mp) + 1, 0);

Don't think this is right. AGFL, not AGI....

> +	last_agino = XFS_OFFBNO_TO_AGINO(mp, eoag + 1, 0) - 1;

Not sure this is right, either, because inode chunks won't be
allocated over the end of the AG. hence if the eoag is not chunk
aligned, there will be up to (chunk size - 1) blocks inodes won't be
allocated in...

> +	agino = be32_to_cpu(agi->agi_count);
> +	if (agino > last_agino - first_agino + 1 ||
> +	    agino < be32_to_cpu(agi->agi_freecount))
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);

Please don't use agino as a count of inodes - this confused me
very much because I was wondering how these checks were in any way
realted to valid AG inode numbers....

> +
> +	/* Check inode pointers */
> +	agino = be32_to_cpu(agi->agi_newino);
> +	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> +	agino = be32_to_cpu(agi->agi_dirino);
> +	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
> +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);

Perhaps we also need a xfs_verify_agino() helper here.

> +	/* Check unlinked inode buckets */
> +	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
> +		agino = be32_to_cpu(agi->agi_unlinked[i]);
> +		if (agino == NULLAGINO)
> +			continue;
> +		if (agino < first_agino || agino > last_agino)
> +			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);

This is effectively the same check as above:

	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))

so all these checks could use the same helper to make it easier
to read.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/25] xfs: probe the scrub ioctl
  2017-10-04  0:02     ` Darrick J. Wong
@ 2017-10-04  1:56       ` Dave Chinner
  2017-10-04  3:14         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  1:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 05:02:09PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 10:32:47AM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:41:08PM -0700, Darrick J. Wong wrote:
> > > +/*
> > > + * Scrub probe -- userspace uses this to probe if we're willing to
> > > + * scrub or repair a given mountpoint.
> > > + */
> > > +int
> > > +xfs_scrub_probe(
> > > +	struct xfs_scrub_context	*sc)
> > > +{
> > > +	if (sc->sm->sm_ino || sc->sm->sm_agno)
> > > +		return -EINVAL;
> > > +	if (sc->sm->sm_gen & ~XFS_SCRUB_FLAGS_OUT)
> > 
> > sm_flags?
> 
> sm_gen is correct because in xfs_scrub_probe we reflect back to
> userspace any valid scrub outflags that were passed in via sm_gen.
> Therefore we have to check sm_gen for unknown output flags so that we
> can error out on any invalid inputs from userland.

Hmmm - seeing sm_ino and sm_gen in the same structure immediately
makes me think {inode number, generation} tuple for uniquely
identifying inodes from userspace. What on earth would scrub output
flags be doing in a generation number? Indeed, from the first patch:

+struct xfs_scrub_metadata {
+       __u32 sm_type;          /* What to check? */
+       __u32 sm_flags;         /* flags; see below. */
+       __u64 sm_ino;           /* inode number. */
+       __u32 sm_gen;           /* inode generation. */
+       __u32 sm_agno;          /* ag number. */
+       __u64 sm_reserved[5];   /* pad to 64 bytes */
....

It is an inode generation number?

So I'm somewhat confused by flags appearing in the generation
number...

> xfs_scrub_metadata() unconditionally clears all the out flags from
> sm_flags on its way in, prior to any of the scrub handlers being called.
> 
> (It's possible I'm misunderstanding your question.)

What has tripped me up is that the probe function sm_gen abuse isn't
documented in this patch, so I had no idea why oflags were in
sm_gen.  Maybe you explained it previously, but I don't remember
that detail and there's nothing in the patch to remind me....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/25] xfs: probe the scrub ioctl
  2017-10-04  1:56       ` Dave Chinner
@ 2017-10-04  3:14         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  3:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 12:56:06PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 05:02:09PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 10:32:47AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:41:08PM -0700, Darrick J. Wong wrote:
> > > > +/*
> > > > + * Scrub probe -- userspace uses this to probe if we're willing to
> > > > + * scrub or repair a given mountpoint.
> > > > + */
> > > > +int
> > > > +xfs_scrub_probe(
> > > > +	struct xfs_scrub_context	*sc)
> > > > +{
> > > > +	if (sc->sm->sm_ino || sc->sm->sm_agno)
> > > > +		return -EINVAL;
> > > > +	if (sc->sm->sm_gen & ~XFS_SCRUB_FLAGS_OUT)
> > > 
> > > sm_flags?
> > 
> > sm_gen is correct because in xfs_scrub_probe we reflect back to
> > userspace any valid scrub outflags that were passed in via sm_gen.
> > Therefore we have to check sm_gen for unknown output flags so that we
> > can error out on any invalid inputs from userland.
> 
> Hmmm - seeing sm_ino and sm_gen in the same structure immediately
> makes me think {inode number, generation} tuple for uniquely
> identifying inodes from userspace. What on earth would scrub output
> flags be doing in a generation number? Indeed, from the first patch:

For actual inode-related scrubbing, yes sm_ino/sm_gen represent a tuple
for uniquely identifying inodes.

> +struct xfs_scrub_metadata {
> +       __u32 sm_type;          /* What to check? */
> +       __u32 sm_flags;         /* flags; see below. */
> +       __u64 sm_ino;           /* inode number. */
> +       __u32 sm_gen;           /* inode generation. */
> +       __u32 sm_agno;          /* ag number. */
> +       __u64 sm_reserved[5];   /* pad to 64 bytes */
> ....
> 
> It is an inode generation number?
> 
> So I'm somewhat confused by flags appearing in the generation
> number...
> 
> > xfs_scrub_metadata() unconditionally clears all the out flags from
> > sm_flags on its way in, prior to any of the scrub handlers being called.
> > 
> > (It's possible I'm misunderstanding your question.)
> 
> What has tripped me up is that the probe function sm_gen abuse isn't
> documented in this patch, so I had no idea why oflags were in
> sm_gen.  Maybe you explained it previously, but I don't remember
> that detail and there's nothing in the patch to remind me....

Yeah... we /are/ abusing the sm_gen field for the dorky probe function.
xfs_io doesn't touch it at all and xfs_scrub doesn't do much with it.
Easier just to drop it.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/25] xfs: scrub the shape of a metadata btree
  2017-10-04  0:15   ` Dave Chinner
@ 2017-10-04  3:51     ` Darrick J. Wong
  2017-10-04  5:48       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  3:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 11:15:35AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:27PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a function that can check the shape of a btree -- each block
> > passes basic inspection and all the pointers look ok.  In the next patch
> > we'll add the ability to check the actual keys and records stored within
> > the btree.  Add some helper functions so that we report detailed scrub
> > errors in a uniform manner in dmesg.  These are helper functions for
> > subsequent patches.
> .....
> >  
> > +/* Check a btree pointer.  Returns true if it's ok to use this pointer. */
> > +static bool
> > +xfs_scrub_btree_ptr_ok(
> > +	struct xfs_scrub_btree		*bs,
> > +	int				level,
> > +	union xfs_btree_ptr		*ptr)
> > +{
> > +	struct xfs_btree_cur		*cur = bs->cur;
> > +	xfs_daddr_t			daddr;
> > +	xfs_daddr_t			eofs;
> > +
> > +	if (xfs_btree_ptr_is_null(cur, ptr)) {
> > +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > +		return false;
> > +	}
> > +	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
> > +		daddr = XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
> > +	} else {
> > +		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
> > +		daddr = XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
> > +				be32_to_cpu(ptr->s));
> > +	}
> > +	eofs = XFS_FSB_TO_BB(cur->bc_mp, cur->bc_mp->m_sb.sb_dblocks);
> > +	if (daddr == 0 || daddr >= eofs) {
> > +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> 
> There seems to be quite a bit of overlap here with
> xfs_btree_check_ptr(). Indeed, for the short pointers the above code
> fails to check it is within the bounds of the AG size. I'd suggest
> both of these should use the same validity checking functions....

Hmm... you're right that the short pointer needs to be checked against
the AG size.  That said, the regular xfs_btree_check_ptr function will
log a XFS_ERROR_REPORT to dmesg, which we don't want, since we're going
to report the scrub failure to userspace anyway.

I think I prefer to fix this existing function since it's silent and
we can maintain the current behavior where a failure in regular
operation gets logged to dmesg.

> ....
> > +/*
> > + * Grab and scrub a btree block given a btree pointer.  Returns block
> > + * and buffer pointers (if applicable) if they're ok to use.
> > + */
> > +STATIC int
> > +xfs_scrub_btree_get_block(
> > +	struct xfs_scrub_btree		*bs,
> > +	int				level,
> > +	union xfs_btree_ptr		*pp,
> > +	struct xfs_btree_block		**pblock,
> > +	struct xfs_buf			**pbp)
> > +{
> > +	int				error;
> > +
> > +	error = xfs_btree_lookup_get_block(bs->cur, level, pp, pblock);
> > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error) || !pblock)
> > +		return error;
> > +
> > +	xfs_btree_get_block(bs->cur, level, pbp);
> > +	error = xfs_btree_check_block(bs->cur, *pblock, level, *pbp);
> > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error))
> > +		return error;
> 
> xfs_btree_check_block() will throw error reports to dmesg for each
> corrupt block that is found. Do we want scrub to do this, or should
> it just report the corrupt block to userspace?

Having looked at xfs_btree_check_block again, I prefer not to spew to
dmesg at all for scrub operations in favor of simply reporting the
corruption back to userland.  I think I'll copy it to scrub so that we
can have better tracepointing and eliminate the XFS_TEST_ERROR that will
get in the way.

> 
> > +
> > +	/*
> > +	 * Check the block's siblings; this function absorbs error codes
> > +	 * for us.
> > +	 */
> > +	return xfs_scrub_btree_block_check_siblings(bs, *pblock);
> > +}
> > +
> >  /*
> >   * Visit all nodes and leaves of a btree.  Check that all pointers and
> >   * records are in order, that the keys reflect the records, and use a callback
> > @@ -107,6 +253,93 @@ xfs_scrub_btree(
> >  	struct xfs_owner_info		*oinfo,
> >  	void				*private)
> >  {
> > -	xfs_scrub_btree_op_ok(sc, cur, 0, false);
> > -	return -EOPNOTSUPP;
> > +	struct xfs_scrub_btree		bs = {0};
> > +	union xfs_btree_ptr		ptr;
> > +	union xfs_btree_ptr		*pp;
> > +	struct xfs_btree_block		*block;
> > +	int				level;
> > +	struct xfs_buf			*bp;
> > +	int				i;
> > +	int				error = 0;
> > +
> > +	/* Initialize scrub state */
> > +	bs.cur = cur;
> > +	bs.scrub_rec = scrub_fn;
> > +	bs.oinfo = oinfo;
> > +	bs.firstrec = true;
> > +	bs.private = private;
> > +	bs.sc = sc;
> > +	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> > +		bs.firstkey[i] = true;
> > +	INIT_LIST_HEAD(&bs.to_check);
> > +
> > +	/* Don't try to check a tree with a height we can't handle. */
> > +	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> > +		xfs_scrub_btree_set_corrupt(sc, cur, 0);
> > +		goto out;
> > +	}
> > +
> > +	/* Make sure the root isn't in the superblock. */
> > +	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)) {
> > +		cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > +		if (!xfs_scrub_btree_ptr_ok(&bs, cur->bc_nlevels - 1, &ptr))
> > +			goto out;
> 
> Set corrupt if the init ptr is bad?

ptr_ok already sets corrupt for us.  Will update docs to point that out.

> And why do this check before the code below that has another
> init_ptr_from_cur() call?
> 
> > +	}
> > +
> > +	/*
> > +	 * Load the root of the btree.  The helper function absorbs
> > +	 * error codes for us.
> > +	 */
> > +	level = cur->bc_nlevels - 1;
> > +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> 
> i.e.
> 
> 	level = cur->bc_nlevels - 1;
> 	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> 	if (!(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) &&
> 	    !xfs_scrub_btree_ptr_ok(&bs, level, &ptr)) {
> 		xfs_scrub_btree_set_corrupt(sc, cur, 0);
> 		goto out;
> 	}

Fair enough, that'll make it straighter.

> Which makes me ask the question - why aren't we validating the
> initial pointer when the root is in an inode?

What /is/ the correct initial pointer value for when the root is an
inode?  xfs_bmbt_init_ptr_from_cur returns a pointer to fsb 0, which to
seems wrong.  Maybe it should return NULLFSBLOCK since the root of the
btree isn't a block anyway?  But perhaps it returns zero to avoid
tripping up xfs_btree_check_lptr....

What if I rewrite the start of xfs_scrub_btree_ptr_ok to be:

	if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) &&
	    level == cur->bc_nlevels - 1) {
		if (ptr->l != 0) {
			xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
			return false;
		}
		return true;
	}

	if (xfs_btree_ptr_is_null(cur, ptr)) {
		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
		return false;
	}

and then your suggested callsite in xfs_scrub_btree becomes:

	level = cur->bc_nlevels - 1;
	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
	if (!xfs_scrub_btree_ptr_ok(&bs, level, &ptr))
		goto out;

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/25] xfs: create helpers to scan an allocation group
  2017-10-04  0:46   ` Dave Chinner
@ 2017-10-04  3:58     ` Darrick J. Wong
  2017-10-04  5:59       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  3:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 11:46:03AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:40PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Add some helpers to enable us to lock an AG's headers, create btree
> > cursors for all btrees in that allocation group, and clean up
> > afterwards.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/common.c |  173 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/common.h |   10 +++
> >  fs/xfs/scrub/scrub.c  |    4 +
> >  fs/xfs/scrub/scrub.h  |   21 ++++++
> >  4 files changed, 208 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > index a84ba19..b056c9d 100644
> > --- a/fs/xfs/scrub/common.c
> > +++ b/fs/xfs/scrub/common.c
> > @@ -44,6 +44,7 @@
> >  #include "scrub/scrub.h"
> >  #include "scrub/common.h"
> >  #include "scrub/trace.h"
> > +#include "scrub/btree.h"
> >  
> >  /* Common code for the metadata scrubbers. */
> >  
> > @@ -298,6 +299,178 @@ xfs_scrub_set_incomplete(
> >  	trace_xfs_scrub_incomplete(sc, __return_address);
> >  }
> >  
> > +/*
> > + * AG scrubbing
> > + *
> > + * These helpers facilitate locking an allocation group's header
> > + * buffers, setting up cursors for all btrees that are present, and
> > + * cleaning everything up once we're through.
> > + */
> > +
> > +/* Grab all the headers for an AG. */
> > +int
> > +xfs_scrub_ag_read_headers(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_agnumber_t			agno,
> > +	struct xfs_buf			**agi,
> > +	struct xfs_buf			**agf,
> > +	struct xfs_buf			**agfl)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
> > +	if (error)
> > +		goto out;
> > +
> > +	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
> > +	if (error)
> > +		goto out;
> > +	if (!*agf) {
> > +		error = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
> > +	if (error)
> > +		goto out;
> > +
> > +out:
> > +	return error;
> > +}
> 
> It's not immediately obvious what releases the buffers on error.
> Maybe add a comment to say cleanup/release on error is unconditional
> through xfs_scrub_ag_free()?
> 
> Hmmm - now there's a question - is the reference we get here freed
> through cancelling the fake transaction, or via the manual
> xfs_trans_brelse() call in the free function? which one happens
> first? add that to the comment?

The AG headers /should/ always be released by the xfs_trans_brelse calls
in the ag_free function, with a failsafe that the trans_cancel will dump
anything else that we came across during our check, just in case all
heck broke loose while we were checking.

> And given this locks out the AG from allocation for an arbitrary
> length of time, I'm wondering if we should add a flag into the pag
> somewhere to say "being scrubbed" so the extent and inode allocation
> code can skip over this AG and no block trying to lock it...

That might be a good idea for a end-of-series enhancement.

Though it could use a little more engineering thought -- what about
a more general ability to mark an AG offline?  ISTR we discussed growing
the ability to shut down an AG (rather than the whole FS) if scrub finds
problems, and/or being able to control that from spaceman.  The patch
was "spaceman: AG state control".

xfs_scrub has an -e option that allows the admin to specify what happens
on an error.  Right now it'll just shut down the filesystem, but
presumably it could react to a per-ag metadata problem by shutting down
the AG.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/25] xfs: scrub the backup superblocks
  2017-10-04  0:57   ` Dave Chinner
@ 2017-10-04  4:06     ` Darrick J. Wong
  2017-10-04  6:13       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  4:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 11:57:09AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:46PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Ensure that the geometry presented in the backup superblocks matches
> > the primary superblock so that repair can recover the filesystem if
> > that primary gets corrupted.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/Makefile         |    1 
> >  fs/xfs/libxfs/xfs_fs.h  |    3 
> >  fs/xfs/scrub/agheader.c |  317 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/common.h   |    2 
> >  fs/xfs/scrub/scrub.c    |    4 +
> >  fs/xfs/scrub/scrub.h    |    1 
> >  6 files changed, 327 insertions(+), 1 deletion(-)
> >  create mode 100644 fs/xfs/scrub/agheader.c
> > 
> > 
> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 5888b9f..e92d04d 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -146,6 +146,7 @@ ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
> >  
> >  xfs-y				+= $(addprefix scrub/, \
> >  				   trace.o \
> > +				   agheader.o \
> >  				   btree.o \
> >  				   common.o \
> >  				   scrub.o \
> > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > index 765f91e..8543cbb 100644
> > --- a/fs/xfs/libxfs/xfs_fs.h
> > +++ b/fs/xfs/libxfs/xfs_fs.h
> > @@ -484,9 +484,10 @@ struct xfs_scrub_metadata {
> >  
> >  /* Scrub subcommands. */
> >  #define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
> > +#define XFS_SCRUB_TYPE_SB	1	/* superblock */
> >  
> >  /* Number of scrub subcommands. */
> > -#define XFS_SCRUB_TYPE_NR	1
> > +#define XFS_SCRUB_TYPE_NR	2
> >  
> >  /* i: Repair this metadata. */
> >  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > new file mode 100644
> > index 0000000..487c4f4
> > --- /dev/null
> > +++ b/fs/xfs/scrub/agheader.c
> > @@ -0,0 +1,317 @@
> > +/*
> > + * Copyright (C) 2017 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_btree.h"
> > +#include "xfs_bit.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_sb.h"
> > +#include "xfs_inode.h"
> > +#include "scrub/xfs_scrub.h"
> > +#include "scrub/scrub.h"
> > +#include "scrub/common.h"
> > +#include "scrub/trace.h"
> > +
> > +/* Set us up to check an AG header. */
> > +int
> > +xfs_scrub_setup_ag_header(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> 
> Not immediately clear what "AG header" is being set up here?

AGF/AGFL/AGI.  All three of them.  Maybe I ought to split them into
three separate files...?

> 
> > +	struct xfs_mount		*mp = sc->mp;
> > +
> > +	if (sc->sm->sm_agno >= mp->m_sb.sb_agcount ||
> > +	    sc->sm->sm_ino || sc->sm->sm_gen)
> > +		return -EINVAL;
> > +	return xfs_scrub_setup_fs(sc, ip);
> > +}
> > +
> > +/* Superblock */
> > +
> > +/* Scrub the filesystem superblock. */
> > +int
> > +xfs_scrub_superblock(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_dsb			*sb;
> > +	xfs_agnumber_t			agno;
> > +	uint32_t			v2_ok;
> > +	__be32				features_mask;
> > +	int				error;
> > +	__be16				vernum_mask;
> > +
> > +	agno = sc->sm->sm_agno;
> > +	if (agno == 0)
> > +		return 0;
> > +
> > +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > +		  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
> > +		  XFS_FSS_TO_BB(mp, 1), 0, &bp, &xfs_sb_buf_ops);
> > +	if (!xfs_scrub_op_ok(sc, agno, XFS_SB_BLOCK(mp), &error))
> > +		return error;
> 
> Might be worth a comment to say the verifier is doing validity/range
> checks of the on-disk fields so they aren't duplicated here. I took
> a little while to work out why range checks weren't being done
> here...

Ok.

> > +
> > +	sb = XFS_BUF_TO_SBP(bp);
> > +
> > +	/*
> > +	 * Verify the geometries match.  Fields that are permanently
> > +	 * set by mkfs are checked; fields that can be updated later
> > +	 * (and are not propagated to backup superblocks) are preen
> > +	 * checked.
> > +	 */
> > +	if (sb->sb_blocksize != cpu_to_be32(mp->m_sb.sb_blocksize))
> > +		xfs_scrub_block_set_corrupt(sc, bp);
> > +
> > +	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
> > +		xfs_scrub_block_set_corrupt(sc, bp);
> 
> Just wondering - once we've set the corrupt flag, do we need to
> bother checking any of the other fields? It makes no difference to
> what is reported to userspace or the action it is going to take,
> so couldn't we just do something like:

This is something I've also struggled with for quite a while.  The most
pragmatic reaction is to set the corrupt flag and jump out immediately
on any failure since we really only care about whether or not we have to
react to bad metadata either by fixing it or shutting down.

On the other hand, continuing with the checks gives us the ability to
report /everything/ that's broken in the data structure, which could be
useful for online forensics (cough) to correlate scrub's report against
anything else that has popped up in dmesg.

A downside of having everything jump to a single call to
xfs_scrub_block_set_corrupt at the end of the function is that the
return address that we record in the tracepoint will be the end of the
function instead of right after the failing check.

(Turning on the ludicrous speed optimizer might do that anyway...)

--D

> 	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
> 		goto out_corrupt;
> 
> .....
> out_corrupt:
> 	xfs_scrub_block_set_corrupt(sc, bp);
> 	return 0;
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/25] xfs: scrub AGF and AGFL
  2017-10-04  1:31   ` Dave Chinner
@ 2017-10-04  4:21     ` Darrick J. Wong
  2017-10-04  6:28       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  4:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 12:31:48PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:52PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Check the block references in the AGF and AGFL headers to make sure
> > they make sense.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_fs.h  |    4 +
> >  fs/xfs/scrub/agheader.c |  220 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/common.c   |   60 +++++++++++++
> >  fs/xfs/scrub/common.h   |    6 +
> >  fs/xfs/scrub/scrub.c    |    8 ++
> >  fs/xfs/scrub/scrub.h    |    2 
> >  6 files changed, 299 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > index 8543cbb..aeb2a66 100644
> > --- a/fs/xfs/libxfs/xfs_fs.h
> > +++ b/fs/xfs/libxfs/xfs_fs.h
> > @@ -485,9 +485,11 @@ struct xfs_scrub_metadata {
> >  /* Scrub subcommands. */
> >  #define XFS_SCRUB_TYPE_PROBE	0	/* presence test ioctl */
> >  #define XFS_SCRUB_TYPE_SB	1	/* superblock */
> > +#define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
> > +#define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
> >  
> >  /* Number of scrub subcommands. */
> > -#define XFS_SCRUB_TYPE_NR	2
> > +#define XFS_SCRUB_TYPE_NR	4
> >  
> >  /* i: Repair this metadata. */
> >  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > index 487c4f4..7fe6630 100644
> > --- a/fs/xfs/scrub/agheader.c
> > +++ b/fs/xfs/scrub/agheader.c
> > @@ -49,6 +49,72 @@ xfs_scrub_setup_ag_header(
> >  	return xfs_scrub_setup_fs(sc, ip);
> >  }
> >  
> > +/* Find the size of the AG, in blocks. */
> > +static inline xfs_agblock_t
> > +xfs_scrub_ag_blocks(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > +
> > +	if (agno < mp->m_sb.sb_agcount - 1)
> > +		return mp->m_sb.sb_agblocks;
> > +	return mp->m_sb.sb_dblocks - (agno * mp->m_sb.sb_agblocks);
> > +}
> 
> Can you make this a generic libxfs function, say xfs_get_ag_blocks()?
> This same calculation is repeated in quite a few places, especially
> in userspace...

Ok.

> > +
> > +/* Walk all the blocks in the AGFL. */
> > +int
> > +xfs_scrub_walk_agfl(
> > +	struct xfs_scrub_context	*sc,
> > +	int				(*fn)(struct xfs_scrub_context *,
> > +					      xfs_agblock_t bno, void *),
> > +	void				*priv)
> > +{
> > +	struct xfs_agf			*agf;
> > +	__be32				*agfl_bno;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	unsigned int			flfirst;
> > +	unsigned int			fllast;
> > +	int				i;
> > +	int				error;
> > +
> > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, sc->sa.agfl_bp);
> > +	flfirst = be32_to_cpu(agf->agf_flfirst);
> > +	fllast = be32_to_cpu(agf->agf_fllast);
> > +
> > +	/* Skip an empty AGFL. */
> > +	if (agf->agf_flcount == cpu_to_be32(0))
> > +		return 0;
> 
> Check flfirst -> fllast == flcount.

<nod>

> ....
> 
> > +/* Scrub the AGF. */
> > +int
> > +xfs_scrub_agf(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_agf			*agf;
> > +	xfs_daddr_t			daddr;
> > +	xfs_daddr_t			eofs;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			agbno;
> > +	xfs_agblock_t			eoag;
> > +	xfs_agblock_t			agfl_first;
> > +	xfs_agblock_t			agfl_last;
> > +	xfs_agblock_t			agfl_count;
> > +	xfs_agblock_t			fl_count;
> > +	int				level;
> > +	int				error = 0;
> > +
> > +	agno = sc->sm->sm_agno;
> > +	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGF);
> > +	if (!xfs_scrub_op_ok(sc, agno, XFS_AGF_BLOCK(sc->mp), &error))
> > +		goto out;
> > +
> > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > +	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> > +
> > +	/* Check the AG length */
> > +	eoag = be32_to_cpu(agf->agf_length);
> > +	if (eoag != xfs_scrub_ag_blocks(mp, agno))
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +	/* Check the AGF btree roots and levels */
> > +	agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_BNO]);
> > +	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> > +	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
> 
> I'm assuming that you are checking that the block isn't part of the
> static metadata range with this XFS_AGI_BLOCK() check? Shouldn't it
> actually be agbno <= XFS_AGFL_BLOCK(mp) i.e. the AGFL block address?

D'oh! Yes.

> I think we need a generic "verify agbno" function. These checks seem
> to be open coded throughout the code instead calling a single
> function that does all the checks. The short btree pointers can use
> it as well...
> 
> > +	    agbno >= eoag || daddr >= eofs)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +	agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_CNT]);
> > +	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> > +	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
> > +	    agbno >= eoag || daddr >= eofs)
> 
> There's another.
> 
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
> > +	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
> > +	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > +		agbno = be32_to_cpu(agf->agf_roots[XFS_BTNUM_RMAP]);
> > +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> > +		if (agbno <= XFS_AGI_BLOCK(mp) ||
> > +		    agbno >= mp->m_sb.sb_agblocks ||
> > +		    agbno >= eoag ||
> > +		    daddr >= eofs)
> 
> And another.
> 
> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +		level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
> > +		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +	}
> > +
> > +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> > +		agbno = be32_to_cpu(agf->agf_refcount_root);
> > +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> > +		if (agbno <= XFS_AGI_BLOCK(mp) ||
> > +		    agbno >= mp->m_sb.sb_agblocks ||
> > +		    agbno >= eoag ||
> > +		    daddr >= eofs)
> 
> And another.

Yes I see your point, I'll add some helpers to check that something
hasn't gone off the end of the AG or the FS.

> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +
> > +		level = be32_to_cpu(agf->agf_refcount_level);
> > +		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> > +	}
> > +
> > +	/* Check the AGFL counters */
> > +	agfl_first = be32_to_cpu(agf->agf_flfirst);
> > +	agfl_last = be32_to_cpu(agf->agf_fllast);
> > +	agfl_count = be32_to_cpu(agf->agf_flcount);
> > +	if (agfl_last > agfl_first)
> > +		fl_count = agfl_last - agfl_first + 1;
> > +	else
> > +		fl_count = XFS_AGFL_SIZE(mp) - agfl_first + agfl_last + 1;
> > +	if (agfl_count != 0 && fl_count != agfl_count)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agf_bp);
> 
> Oh, the agfl counts are checked here. Maybe put a comment in
> xfs_scrub_walk_agfl() to mention this?

Ok.

> 
> .....
> 
> > +struct xfs_scrub_agfl {
> > +	xfs_agblock_t			eoag;
> > +	xfs_daddr_t			eofs;
> > +};
> > +
> > +/* Scrub an AGFL block. */
> > +STATIC int
> > +xfs_scrub_agfl_block(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_agblock_t			agbno,
> > +	void				*priv)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_agnumber_t			agno = sc->sa.agno;
> > +	struct xfs_scrub_agfl		*sagfl = priv;
> > +	int				error = 0;
> > +
> > +	if (agbno <= XFS_AGI_BLOCK(mp) ||
> > +	    agbno >= mp->m_sb.sb_agblocks ||
> > +	    agbno >= sagfl->eoag ||
> > +	    XFS_AGB_TO_DADDR(mp, agno, agbno) >= sagfl->eofs)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agfl_bp);
> > +
> > +	return error;
> > +}
> 
> Oh, look, there's another xfs_agbno_verify() function call :P
> 
> .....
> > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > index b056c9d..ee8e7be 100644
> > --- a/fs/xfs/scrub/common.c
> > +++ b/fs/xfs/scrub/common.c
> > @@ -471,6 +471,66 @@ xfs_scrub_ag_init(
> >  	return xfs_scrub_ag_btcur_init(sc, sa);
> >  }
> >  
> > +/*
> > + * Load and verify an AG header for further AG header examination.
> > + * If this header is not the target of the examination, don't return
> > + * the buffer if a runtime or verifier error occurs.
> > + */
> > +STATIC int
> > +xfs_scrub_load_ag_header(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_daddr_t			daddr,
> > +	struct xfs_buf			**bpp,
> > +	const struct xfs_buf_ops	*ops,
> > +	bool				is_target)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	*bpp = NULL;
> > +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > +			XFS_AG_DADDR(mp, sc->sa.agno, daddr),
> > +			XFS_FSS_TO_BB(mp, 1), 0, bpp, ops);
> > +	return is_target ? error : 0;
> > +}
> > +
> > +/*
> > + * Load as many of the AG headers and btree cursors as we can for an
> > + * examination and cross-reference of an AG header.
> > + */
> > +int
> > +xfs_scrub_load_ag_headers(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_agnumber_t			agno,
> > +	unsigned int			type)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL);
> > +	memset(&sc->sa, 0, sizeof(sc->sa));
> > +	sc->sa.agno = agno;
> > +
> > +	error = xfs_scrub_load_ag_header(sc, XFS_AGI_DADDR(mp),
> > +			&sc->sa.agi_bp, &xfs_agi_buf_ops, false);
> > +	if (error)
> > +		return error;
> > +
> > +	error = xfs_scrub_load_ag_header(sc, XFS_AGF_DADDR(mp),
> > +			&sc->sa.agf_bp, &xfs_agf_buf_ops,
> > +			type == XFS_SCRUB_TYPE_AGF);
> > +	if (error)
> > +		return error;
> > +
> > +	error = xfs_scrub_load_ag_header(sc, XFS_AGFL_DADDR(mp),
> > +			&sc->sa.agfl_bp, &xfs_agfl_buf_ops,
> > +			type == XFS_SCRUB_TYPE_AGFL);
> > +	if (error)
> > +		return error;
> > +
> > +	return 0;
> > +}
> 
> This should probably be combined with xfs_scrub_ag_read_headers().
> They essentially do the same thing, the only difference is the
> "target" error reporting.

It's quite different -- this function ignores verifier errors for
the two headers that don't match 'type'  In other words, if we're
checking the AGF (for example) we'll try to grab the AGI and the AGFL.
Verifier errors on the AGI/AGFL don't matter, but we /do/ want to hear
the results if the AGF verifier fails.

xfs_scrub_ag_read_headers on the other hand will fail if /any/ of the
three verifiers fail.  We want this behavior for the btree scrubbers so
that we can bail out with an operational error if the headers are bad,
but we don't want this behavior for the header scrubbers because an AGI
verifier error can cause the AGF verifier to report corruption.

Later on, repair will want the perag stuff loaded (which
xfs_scrub_load_ag_headers doesn't do), fwiw.

The two functions /could/ be combined, though the 'type' test becomes
trickier.  Maybe it'd be better just to enhance the comments for the two
header loader functions to spell out how they differ in usage.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/25] xfs: scrub the AGI
  2017-10-04  1:43   ` Dave Chinner
@ 2017-10-04  4:25     ` Darrick J. Wong
  2017-10-04  6:43       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04  4:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 12:43:47PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:41:59PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Add a forgotten check to the AGI verifier, then wire up the scrub
> > infrastructure to check the AGI contents.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_fs.h  |    3 +-
> >  fs/xfs/scrub/agheader.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/common.c   |    6 ++-
> >  fs/xfs/scrub/scrub.c    |    4 ++
> >  fs/xfs/scrub/scrub.h    |    1 +
> >  5 files changed, 99 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > index aeb2a66..1e326dd 100644
> > --- a/fs/xfs/libxfs/xfs_fs.h
> > +++ b/fs/xfs/libxfs/xfs_fs.h
> > @@ -487,9 +487,10 @@ struct xfs_scrub_metadata {
> >  #define XFS_SCRUB_TYPE_SB	1	/* superblock */
> >  #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
> >  #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
> > +#define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
> >  
> >  /* Number of scrub subcommands. */
> > -#define XFS_SCRUB_TYPE_NR	4
> > +#define XFS_SCRUB_TYPE_NR	5
> >  
> >  /* i: Repair this metadata. */
> >  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > index 7fe6630..3d269c2 100644
> > --- a/fs/xfs/scrub/agheader.c
> > +++ b/fs/xfs/scrub/agheader.c
> > @@ -535,3 +535,91 @@ xfs_scrub_agfl(
> >  out:
> >  	return error;
> >  }
> > +
> > +/* AGI */
> > +
> > +/* Scrub the AGI. */
> > +int
> > +xfs_scrub_agi(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_agi			*agi;
> > +	xfs_daddr_t			daddr;
> > +	xfs_daddr_t			eofs;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			agbno;
> > +	xfs_agblock_t			eoag;
> > +	xfs_agino_t			agino;
> > +	xfs_agino_t			first_agino;
> > +	xfs_agino_t			last_agino;
> > +	int				i;
> > +	int				level;
> > +	int				error = 0;
> > +
> > +	agno = sc->sm->sm_agno;
> > +	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGI);
> > +	if (!xfs_scrub_op_ok(sc, agno, XFS_AGI_BLOCK(sc->mp), &error))
> > +		goto out;
> > +
> > +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> > +	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> > +
> > +	/* Check the AG length */
> > +	eoag = be32_to_cpu(agi->agi_length);
> > +	if (eoag != xfs_scrub_ag_blocks(mp, agno))
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> 
> Should we be cross checking that the AGI and AGF both have
> the same length here?

Isn't that what this does?  Albeit indirectly?

xfs_scrub_ag_blocks returns sb_agcount for every AG except the last one.
For the last AG it returns (sb_dblocks - (all blocks in the other AGs))
which should be the same as agf->agf_length, right?

> > +
> > +	/* Check btree roots and levels */
> > +	agbno = be32_to_cpu(agi->agi_root);
> > +	daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> > +	if (agbno <= XFS_AGI_BLOCK(mp) || agbno >= mp->m_sb.sb_agblocks ||
> > +	    agbno >= eoag || daddr >= eofs)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> 
> xfs_verify_agbno(), again :P
> 
> > +
> > +	level = be32_to_cpu(agi->agi_level);
> > +	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> > +
> > +	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
> > +		agbno = be32_to_cpu(agi->agi_free_root);
> > +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno);
> > +		if (agbno <= XFS_AGI_BLOCK(mp) ||
> > +		    agbno >= mp->m_sb.sb_agblocks ||
> > +		    agbno >= eoag ||
> > +		    daddr >= eofs)
> 
> Broken records are us....
> 
> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> > +
> > +		level = be32_to_cpu(agi->agi_free_level);
> > +		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> > +	}
> > +
> > +	/* Check inode counters */
> > +	first_agino = XFS_OFFBNO_TO_AGINO(mp, XFS_AGI_BLOCK(mp) + 1, 0);
> 
> Don't think this is right. AGFL, not AGI....

Yes.

> > +	last_agino = XFS_OFFBNO_TO_AGINO(mp, eoag + 1, 0) - 1;
> 
> Not sure this is right, either, because inode chunks won't be
> allocated over the end of the AG. hence if the eoag is not chunk
> aligned, there will be up to (chunk size - 1) blocks inodes won't be
> allocated in...

Yes.

> > +	agino = be32_to_cpu(agi->agi_count);
> > +	if (agino > last_agino - first_agino + 1 ||
> > +	    agino < be32_to_cpu(agi->agi_freecount))
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> 
> Please don't use agino as a count of inodes - this confused me
> very much because I was wondering how these checks were in any way
> realted to valid AG inode numbers....

<nod> Sorry about that, will fix.

> > +
> > +	/* Check inode pointers */
> > +	agino = be32_to_cpu(agi->agi_newino);
> > +	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> > +	agino = be32_to_cpu(agi->agi_dirino);
> > +	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
> > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> 
> Perhaps we also need a xfs_verify_agino() helper here.
> 
> > +	/* Check unlinked inode buckets */
> > +	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
> > +		agino = be32_to_cpu(agi->agi_unlinked[i]);
> > +		if (agino == NULLAGINO)
> > +			continue;
> > +		if (agino < first_agino || agino > last_agino)
> > +			xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> 
> This is effectively the same check as above:
> 
> 	if (agino != NULLAGINO && (agino < first_agino || agino > last_agino))
> 
> so all these checks could use the same helper to make it easier
> to read.

<nod> Will do.  Thank you for the review so far!

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/25] xfs: scrub the shape of a metadata btree
  2017-10-04  3:51     ` Darrick J. Wong
@ 2017-10-04  5:48       ` Dave Chinner
  2017-10-04 17:48         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  5:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 08:51:17PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 11:15:35AM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:41:27PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Create a function that can check the shape of a btree -- each block
> > > passes basic inspection and all the pointers look ok.  In the next patch
> > > we'll add the ability to check the actual keys and records stored within
> > > the btree.  Add some helper functions so that we report detailed scrub
> > > errors in a uniform manner in dmesg.  These are helper functions for
> > > subsequent patches.
> > .....
> > >  
> > > +/* Check a btree pointer.  Returns true if it's ok to use this pointer. */
> > > +static bool
> > > +xfs_scrub_btree_ptr_ok(
> > > +	struct xfs_scrub_btree		*bs,
> > > +	int				level,
> > > +	union xfs_btree_ptr		*ptr)
> > > +{
> > > +	struct xfs_btree_cur		*cur = bs->cur;
> > > +	xfs_daddr_t			daddr;
> > > +	xfs_daddr_t			eofs;
> > > +
> > > +	if (xfs_btree_ptr_is_null(cur, ptr)) {
> > > +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > > +		return false;
> > > +	}
> > > +	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
> > > +		daddr = XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
> > > +	} else {
> > > +		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
> > > +		daddr = XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
> > > +				be32_to_cpu(ptr->s));
> > > +	}
> > > +	eofs = XFS_FSB_TO_BB(cur->bc_mp, cur->bc_mp->m_sb.sb_dblocks);
> > > +	if (daddr == 0 || daddr >= eofs) {
> > > +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > > +		return false;
> > > +	}
> > > +
> > > +	return true;
> > > +}
> > 
> > There seems to be quite a bit of overlap here with
> > xfs_btree_check_ptr(). Indeed, for the short pointers the above code
> > fails to check it is within the bounds of the AG size. I'd suggest
> > both of these should use the same validity checking functions....
> 
> Hmm... you're right that the short pointer needs to be checked against
> the AG size.  That said, the regular xfs_btree_check_ptr function will
> log a XFS_ERROR_REPORT to dmesg, which we don't want, since we're going
> to report the scrub failure to userspace anyway.
> 
> I think I prefer to fix this existing function since it's silent and
> we can maintain the current behavior where a failure in regular
> operation gets logged to dmesg.

I'd prefer a core function that doesn't ERROR_REPORT, and a version
with the error report wrapped around the outside to replace the
existing users....

> > ....
> > > +/*
> > > + * Grab and scrub a btree block given a btree pointer.  Returns block
> > > + * and buffer pointers (if applicable) if they're ok to use.
> > > + */
> > > +STATIC int
> > > +xfs_scrub_btree_get_block(
> > > +	struct xfs_scrub_btree		*bs,
> > > +	int				level,
> > > +	union xfs_btree_ptr		*pp,
> > > +	struct xfs_btree_block		**pblock,
> > > +	struct xfs_buf			**pbp)
> > > +{
> > > +	int				error;
> > > +
> > > +	error = xfs_btree_lookup_get_block(bs->cur, level, pp, pblock);
> > > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error) || !pblock)
> > > +		return error;
> > > +
> > > +	xfs_btree_get_block(bs->cur, level, pbp);
> > > +	error = xfs_btree_check_block(bs->cur, *pblock, level, *pbp);
> > > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error))
> > > +		return error;
> > 
> > xfs_btree_check_block() will throw error reports to dmesg for each
> > corrupt block that is found. Do we want scrub to do this, or should
> > it just report the corrupt block to userspace?
> 
> Having looked at xfs_btree_check_block again, I prefer not to spew to
> dmesg at all for scrub operations in favor of simply reporting the
> corruption back to userland.  I think I'll copy it to scrub so that we
> can have better tracepointing and eliminate the XFS_TEST_ERROR that will
> get in the way.

As above, I'd much prefer we don't copy-n-paste extremely similar
checks just to avoid a ERROR_REPORT. Factor out the error report,
call the common code here, make xfs_btree_check_block() wrap the
common code with an error report...

> > Which makes me ask the question - why aren't we validating the
> > initial pointer when the root is in an inode?
> 
> What /is/ the correct initial pointer value for when the root is an
> inode?

Somewhere between FSB 1 and sb_dblocks....?

> xfs_bmbt_init_ptr_from_cur returns a pointer to fsb 0, which to
> seems wrong.  Maybe it should return NULLFSBLOCK since the root of the
> btree isn't a block anyway?  But perhaps it returns zero to avoid
> tripping up xfs_btree_check_lptr....
> 
> What if I rewrite the start of xfs_scrub_btree_ptr_ok to be:
> 
> 	if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) &&
> 	    level == cur->bc_nlevels - 1) {
> 		if (ptr->l != 0) {
> 			xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> 			return false;
> 		}
> 		return true;
> 	}
> 
> 	if (xfs_btree_ptr_is_null(cur, ptr)) {
> 		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> 		return false;
> 	}
> 
> and then your suggested callsite in xfs_scrub_btree becomes:
> 
> 	level = cur->bc_nlevels - 1;
> 	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> 	if (!xfs_scrub_btree_ptr_ok(&bs, level, &ptr))
> 		goto out;
> 

Makes more sense.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/25] xfs: create helpers to scan an allocation group
  2017-10-04  3:58     ` Darrick J. Wong
@ 2017-10-04  5:59       ` Dave Chinner
  2017-10-04 17:51         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  5:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 08:58:53PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 11:46:03AM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:41:40PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Add some helpers to enable us to lock an AG's headers, create btree
> > > cursors for all btrees in that allocation group, and clean up
> > > afterwards.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/common.c |  173 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/scrub/common.h |   10 +++
> > >  fs/xfs/scrub/scrub.c  |    4 +
> > >  fs/xfs/scrub/scrub.h  |   21 ++++++
> > >  4 files changed, 208 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > > index a84ba19..b056c9d 100644
> > > --- a/fs/xfs/scrub/common.c
> > > +++ b/fs/xfs/scrub/common.c
> > > @@ -44,6 +44,7 @@
> > >  #include "scrub/scrub.h"
> > >  #include "scrub/common.h"
> > >  #include "scrub/trace.h"
> > > +#include "scrub/btree.h"
> > >  
> > >  /* Common code for the metadata scrubbers. */
> > >  
> > > @@ -298,6 +299,178 @@ xfs_scrub_set_incomplete(
> > >  	trace_xfs_scrub_incomplete(sc, __return_address);
> > >  }
> > >  
> > > +/*
> > > + * AG scrubbing
> > > + *
> > > + * These helpers facilitate locking an allocation group's header
> > > + * buffers, setting up cursors for all btrees that are present, and
> > > + * cleaning everything up once we're through.
> > > + */
> > > +
> > > +/* Grab all the headers for an AG. */
> > > +int
> > > +xfs_scrub_ag_read_headers(
> > > +	struct xfs_scrub_context	*sc,
> > > +	xfs_agnumber_t			agno,
> > > +	struct xfs_buf			**agi,
> > > +	struct xfs_buf			**agf,
> > > +	struct xfs_buf			**agfl)
> > > +{
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	int				error;
> > > +
> > > +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
> > > +	if (error)
> > > +		goto out;
> > > +
> > > +	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
> > > +	if (error)
> > > +		goto out;
> > > +	if (!*agf) {
> > > +		error = -ENOMEM;
> > > +		goto out;
> > > +	}
> > > +
> > > +	error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
> > > +	if (error)
> > > +		goto out;
> > > +
> > > +out:
> > > +	return error;
> > > +}
> > 
> > It's not immediately obvious what releases the buffers on error.
> > Maybe add a comment to say cleanup/release on error is unconditional
> > through xfs_scrub_ag_free()?
> > 
> > Hmmm - now there's a question - is the reference we get here freed
> > through cancelling the fake transaction, or via the manual
> > xfs_trans_brelse() call in the free function? which one happens
> > first? add that to the comment?
> 
> The AG headers /should/ always be released by the xfs_trans_brelse calls
> in the ag_free function, with a failsafe that the trans_cancel will dump
> anything else that we came across during our check, just in case all
> heck broke loose while we were checking.

Ok, comments. :P

> 
> > And given this locks out the AG from allocation for an arbitrary
> > length of time, I'm wondering if we should add a flag into the pag
> > somewhere to say "being scrubbed" so the extent and inode allocation
> > code can skip over this AG and no block trying to lock it...
> 
> That might be a good idea for a end-of-series enhancement.

*nod*

> 
> Though it could use a little more engineering thought -- what about
> a more general ability to mark an AG offline?  ISTR we discussed growing
> the ability to shut down an AG (rather than the whole FS) if scrub finds
> problems, and/or being able to control that from spaceman.  The patch
> was "spaceman: AG state control".

Well, only a small part of making an AG offline is preventing
allocation from blocking in it.  What I suggested above is
completely internal functionality that users would never even know
about, so if we later want to add offline AG controls we can rework
the implementation scrub uses to fit into that model....

> xfs_scrub has an -e option that allows the admin to specify what happens
> on an error.  Right now it'll just shut down the filesystem, but
> presumably it could react to a per-ag metadata problem by shutting down
> the AG.

Not that simple, I'm afraid. Think about modifying a directory that
has blocks that span multiple AGs. If we mark an AG as offline, then
what do we do with an attempt to modify that directory block? Even
if we can read it, do we allow the modification to proceed? How do
we even know ahead of time that a directory has blocks in an offline
AG?  And if the AG is shut down, then the attempt to read the
directory block will get EIO, which will cause a dirty transaction
cancellation, which will cause a filesystem wide shutdown...

Let's not complicate a simple optimisation specific to scrub by
trying to make it work wth blue-sky functionality that requires us
to solve a bunch of "OMFG HARD!" problems we haven't even thought
about yet, let alone have answers for....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/25] xfs: scrub the backup superblocks
  2017-10-04  4:06     ` Darrick J. Wong
@ 2017-10-04  6:13       ` Dave Chinner
  2017-10-04 17:56         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  6:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 09:06:46PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 11:57:09AM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:41:46PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Ensure that the geometry presented in the backup superblocks matches
> > > the primary superblock so that repair can recover the filesystem if
> > > that primary gets corrupted.
.....

> > > +
> > > +/* Set us up to check an AG header. */
> > > +int
> > > +xfs_scrub_setup_ag_header(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_inode		*ip)
> > > +{
> > 
> > Not immediately clear what "AG header" is being set up here?
> 
> AGF/AGFL/AGI.  All three of them.  Maybe I ought to split them into
> three separate files...?

No, just clarify the comment.

/*
 * Set up scrub to check all the static metadata in each AG. These
 * are the SB, AGF, AGI and AGFL header structures.
 */

> > > +	sb = XFS_BUF_TO_SBP(bp);
> > > +
> > > +	/*
> > > +	 * Verify the geometries match.  Fields that are permanently
> > > +	 * set by mkfs are checked; fields that can be updated later
> > > +	 * (and are not propagated to backup superblocks) are preen
> > > +	 * checked.
> > > +	 */
> > > +	if (sb->sb_blocksize != cpu_to_be32(mp->m_sb.sb_blocksize))
> > > +		xfs_scrub_block_set_corrupt(sc, bp);
> > > +
> > > +	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
> > > +		xfs_scrub_block_set_corrupt(sc, bp);
> > 
> > Just wondering - once we've set the corrupt flag, do we need to
> > bother checking any of the other fields? It makes no difference to
> > what is reported to userspace or the action it is going to take,
> > so couldn't we just do something like:
> 
> This is something I've also struggled with for quite a while.  The most
> pragmatic reaction is to set the corrupt flag and jump out immediately
> on any failure since we really only care about whether or not we have to
> react to bad metadata either by fixing it or shutting down.

*nod*

> On the other hand, continuing with the checks gives us the ability to
> report /everything/ that's broken in the data structure, which could be
> useful for online forensics (cough) to correlate scrub's report against
> anything else that has popped up in dmesg.

Report where, exactly? The only detailed report we get out of this
is tracepoint information, isn't it? And we'll have to convert the
return address in the tracepoint to a line number to work out what
actually was reported as corrupt. I really can't see myself spending
the time to do that for every corruption in a single structure. Once
I know the structure is corrupt, I don't care about other
corruptions I just want to move on to repair.

IMO, scrub is for detecting errors so they can be repaired or
analysed, not for doing fault analysis. For actual forensics work
we'll still be using xfs_db - analysis processes that require manual
decoding of tracepoints, structures and/or error reports is just not
going to be efficient or usuable by the average developer....

> A downside of having everything jump to a single call to
> xfs_scrub_block_set_corrupt at the end of the function is that the
> return address that we record in the tracepoint will be the end of the
> function instead of right after the failing check.

That's the same optimisation issue we solved for the verifiers
tracing, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/25] xfs: scrub AGF and AGFL
  2017-10-04  4:21     ` Darrick J. Wong
@ 2017-10-04  6:28       ` Dave Chinner
  2017-10-04 17:57         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  6:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 09:21:40PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 12:31:48PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:41:52PM -0700, Darrick J. Wong wrote:
> > > +/*
> > > + * Load as many of the AG headers and btree cursors as we can for an
> > > + * examination and cross-reference of an AG header.
> > > + */
> > > +int
> > > +xfs_scrub_load_ag_headers(
> > > +	struct xfs_scrub_context	*sc,
> > > +	xfs_agnumber_t			agno,
> > > +	unsigned int			type)
> > > +{
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	int				error;
> > > +
> > > +	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL);
> > > +	memset(&sc->sa, 0, sizeof(sc->sa));
> > > +	sc->sa.agno = agno;
> > > +
> > > +	error = xfs_scrub_load_ag_header(sc, XFS_AGI_DADDR(mp),
> > > +			&sc->sa.agi_bp, &xfs_agi_buf_ops, false);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	error = xfs_scrub_load_ag_header(sc, XFS_AGF_DADDR(mp),
> > > +			&sc->sa.agf_bp, &xfs_agf_buf_ops,
> > > +			type == XFS_SCRUB_TYPE_AGF);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	error = xfs_scrub_load_ag_header(sc, XFS_AGFL_DADDR(mp),
> > > +			&sc->sa.agfl_bp, &xfs_agfl_buf_ops,
> > > +			type == XFS_SCRUB_TYPE_AGFL);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	return 0;
> > > +}
> > 
> > This should probably be combined with xfs_scrub_ag_read_headers().
> > They essentially do the same thing, the only difference is the
> > "target" error reporting.
> 
> It's quite different -- this function ignores verifier errors for
> the two headers that don't match 'type'  In other words, if we're
> checking the AGF (for example) we'll try to grab the AGI and the AGFL.
> Verifier errors on the AGI/AGFL don't matter, but we /do/ want to hear
> the results if the AGF verifier fails.

What they do is quite different. The implementation is /almost/
identical. type is just an error masking variable and ....

> xfs_scrub_ag_read_headers on the other hand will fail if /any/ of the
> three verifiers fail. 

.... if no type is set, then we don't mask any errors at all and
we bail if any of the three verifiers fail. i.e.:

	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
	if (error && (!type || type == XFS_SCRUB_TYPE_AGI)
		return error;

        error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
	if (error && (!type || type == XFS_SCRUB_TYPE_AGF)
                return error;

        error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
	if (error && (!type || type == XFS_SCRUB_TYPE_AGFL)
                return error;

It's also much simpler to understand because we are using the proper
functions for reading these headers....

> The two functions /could/ be combined, though the 'type' test becomes
> trickier.  Maybe it'd be better just to enhance the comments for the two
> header loader functions to spell out how they differ in usage.

Again, I'd much prefer similar functionality is combined into
common helpers if it's simple enough to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/25] xfs: scrub the AGI
  2017-10-04  4:25     ` Darrick J. Wong
@ 2017-10-04  6:43       ` Dave Chinner
  2017-10-04 18:02         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04  6:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 09:25:01PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 12:43:47PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:41:59PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Add a forgotten check to the AGI verifier, then wire up the scrub
> > > infrastructure to check the AGI contents.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_fs.h  |    3 +-
> > >  fs/xfs/scrub/agheader.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/scrub/common.c   |    6 ++-
> > >  fs/xfs/scrub/scrub.c    |    4 ++
> > >  fs/xfs/scrub/scrub.h    |    1 +
> > >  5 files changed, 99 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > > index aeb2a66..1e326dd 100644
> > > --- a/fs/xfs/libxfs/xfs_fs.h
> > > +++ b/fs/xfs/libxfs/xfs_fs.h
> > > @@ -487,9 +487,10 @@ struct xfs_scrub_metadata {
> > >  #define XFS_SCRUB_TYPE_SB	1	/* superblock */
> > >  #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
> > >  #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
> > > +#define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
> > >  
> > >  /* Number of scrub subcommands. */
> > > -#define XFS_SCRUB_TYPE_NR	4
> > > +#define XFS_SCRUB_TYPE_NR	5
> > >  
> > >  /* i: Repair this metadata. */
> > >  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> > > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > > index 7fe6630..3d269c2 100644
> > > --- a/fs/xfs/scrub/agheader.c
> > > +++ b/fs/xfs/scrub/agheader.c
> > > @@ -535,3 +535,91 @@ xfs_scrub_agfl(
> > >  out:
> > >  	return error;
> > >  }
> > > +
> > > +/* AGI */
> > > +
> > > +/* Scrub the AGI. */
> > > +int
> > > +xfs_scrub_agi(
> > > +	struct xfs_scrub_context	*sc)
> > > +{
> > > +	struct xfs_mount		*mp = sc->mp;
> > > +	struct xfs_agi			*agi;
> > > +	xfs_daddr_t			daddr;
> > > +	xfs_daddr_t			eofs;
> > > +	xfs_agnumber_t			agno;
> > > +	xfs_agblock_t			agbno;
> > > +	xfs_agblock_t			eoag;
> > > +	xfs_agino_t			agino;
> > > +	xfs_agino_t			first_agino;
> > > +	xfs_agino_t			last_agino;
> > > +	int				i;
> > > +	int				level;
> > > +	int				error = 0;
> > > +
> > > +	agno = sc->sm->sm_agno;
> > > +	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGI);
> > > +	if (!xfs_scrub_op_ok(sc, agno, XFS_AGI_BLOCK(sc->mp), &error))
> > > +		goto out;
> > > +
> > > +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> > > +	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> > > +
> > > +	/* Check the AG length */
> > > +	eoag = be32_to_cpu(agi->agi_length);
> > > +	if (eoag != xfs_scrub_ag_blocks(mp, agno))
> > > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> > 
> > Should we be cross checking that the AGI and AGF both have
> > the same length here?
> 
> Isn't that what this does?  Albeit indirectly?

I was kinda thinking of explicit checks, but you are right, it's
indirectly verified....

> xfs_scrub_ag_blocks returns sb_agcount for every AG except the last one.
> For the last AG it returns (sb_dblocks - (all blocks in the other AGs))
> which should be the same as agf->agf_length, right?

... which assumes we've validated sb_agblocks and sb_dblocks in some
way, which we haven't really done in the superblock scrubber.

It seems to me that we're using the superblock 0 values as the
golden master because it's a mounted filesystem, and then comparing
everything else against it. Maybe we should at least check a couple
of secondary superblocks to see that they match the primary
superblock - that way we'll have some confidence that at least
things like agcount, agblocks, dblocks, etc are valid before we go
any further...

BUt maybe all we need is comment in the overall scrub description -
that we're pretty much assuming that sb 0 is intact because we write
what is in memory back to it and so we can simply validate
everything else against the primary superblock contents...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/25] xfs: scrub the shape of a metadata btree
  2017-10-04  5:48       ` Dave Chinner
@ 2017-10-04 17:48         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 04:48:13PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 08:51:17PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 11:15:35AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:41:27PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Create a function that can check the shape of a btree -- each block
> > > > passes basic inspection and all the pointers look ok.  In the next patch
> > > > we'll add the ability to check the actual keys and records stored within
> > > > the btree.  Add some helper functions so that we report detailed scrub
> > > > errors in a uniform manner in dmesg.  These are helper functions for
> > > > subsequent patches.
> > > .....
> > > >  
> > > > +/* Check a btree pointer.  Returns true if it's ok to use this pointer. */
> > > > +static bool
> > > > +xfs_scrub_btree_ptr_ok(
> > > > +	struct xfs_scrub_btree		*bs,
> > > > +	int				level,
> > > > +	union xfs_btree_ptr		*ptr)
> > > > +{
> > > > +	struct xfs_btree_cur		*cur = bs->cur;
> > > > +	xfs_daddr_t			daddr;
> > > > +	xfs_daddr_t			eofs;
> > > > +
> > > > +	if (xfs_btree_ptr_is_null(cur, ptr)) {
> > > > +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > > > +		return false;
> > > > +	}
> > > > +	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
> > > > +		daddr = XFS_FSB_TO_DADDR(cur->bc_mp, be64_to_cpu(ptr->l));
> > > > +	} else {
> > > > +		ASSERT(cur->bc_private.a.agno != NULLAGNUMBER);
> > > > +		daddr = XFS_AGB_TO_DADDR(cur->bc_mp, cur->bc_private.a.agno,
> > > > +				be32_to_cpu(ptr->s));
> > > > +	}
> > > > +	eofs = XFS_FSB_TO_BB(cur->bc_mp, cur->bc_mp->m_sb.sb_dblocks);
> > > > +	if (daddr == 0 || daddr >= eofs) {
> > > > +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > > > +		return false;
> > > > +	}
> > > > +
> > > > +	return true;
> > > > +}
> > > 
> > > There seems to be quite a bit of overlap here with
> > > xfs_btree_check_ptr(). Indeed, for the short pointers the above code
> > > fails to check it is within the bounds of the AG size. I'd suggest
> > > both of these should use the same validity checking functions....
> > 
> > Hmm... you're right that the short pointer needs to be checked against
> > the AG size.  That said, the regular xfs_btree_check_ptr function will
> > log a XFS_ERROR_REPORT to dmesg, which we don't want, since we're going
> > to report the scrub failure to userspace anyway.
> > 
> > I think I prefer to fix this existing function since it's silent and
> > we can maintain the current behavior where a failure in regular
> > operation gets logged to dmesg.
> 
> I'd prefer a core function that doesn't ERROR_REPORT, and a version
> with the error report wrapped around the outside to replace the
> existing users....
> 
> > > ....
> > > > +/*
> > > > + * Grab and scrub a btree block given a btree pointer.  Returns block
> > > > + * and buffer pointers (if applicable) if they're ok to use.
> > > > + */
> > > > +STATIC int
> > > > +xfs_scrub_btree_get_block(
> > > > +	struct xfs_scrub_btree		*bs,
> > > > +	int				level,
> > > > +	union xfs_btree_ptr		*pp,
> > > > +	struct xfs_btree_block		**pblock,
> > > > +	struct xfs_buf			**pbp)
> > > > +{
> > > > +	int				error;
> > > > +
> > > > +	error = xfs_btree_lookup_get_block(bs->cur, level, pp, pblock);
> > > > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error) || !pblock)
> > > > +		return error;
> > > > +
> > > > +	xfs_btree_get_block(bs->cur, level, pbp);
> > > > +	error = xfs_btree_check_block(bs->cur, *pblock, level, *pbp);
> > > > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, level, &error))
> > > > +		return error;
> > > 
> > > xfs_btree_check_block() will throw error reports to dmesg for each
> > > corrupt block that is found. Do we want scrub to do this, or should
> > > it just report the corrupt block to userspace?
> > 
> > Having looked at xfs_btree_check_block again, I prefer not to spew to
> > dmesg at all for scrub operations in favor of simply reporting the
> > corruption back to userland.  I think I'll copy it to scrub so that we
> > can have better tracepointing and eliminate the XFS_TEST_ERROR that will
> > get in the way.
> 
> As above, I'd much prefer we don't copy-n-paste extremely similar
> checks just to avoid a ERROR_REPORT. Factor out the error report,
> call the common code here, make xfs_btree_check_block() wrap the
> common code with an error report...

Sure.

> > > Which makes me ask the question - why aren't we validating the
> > > initial pointer when the root is in an inode?
> > 
> > What /is/ the correct initial pointer value for when the root is an
> > inode?
> 
> Somewhere between FSB 1 and sb_dblocks....?
> 
> > xfs_bmbt_init_ptr_from_cur returns a pointer to fsb 0, which to
> > seems wrong.  Maybe it should return NULLFSBLOCK since the root of the
> > btree isn't a block anyway?  But perhaps it returns zero to avoid
> > tripping up xfs_btree_check_lptr....
> > 
> > What if I rewrite the start of xfs_scrub_btree_ptr_ok to be:
> > 
> > 	if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) &&
> > 	    level == cur->bc_nlevels - 1) {
> > 		if (ptr->l != 0) {
> > 			xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > 			return false;
> > 		}
> > 		return true;
> > 	}
> > 
> > 	if (xfs_btree_ptr_is_null(cur, ptr)) {
> > 		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> > 		return false;
> > 	}
> > 
> > and then your suggested callsite in xfs_scrub_btree becomes:
> > 
> > 	level = cur->bc_nlevels - 1;
> > 	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > 	if (!xfs_scrub_btree_ptr_ok(&bs, level, &ptr))
> > 		goto out;
> > 
> 
> Makes more sense.

OK.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/25] xfs: create helpers to scan an allocation group
  2017-10-04  5:59       ` Dave Chinner
@ 2017-10-04 17:51         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 17:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 04:59:56PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 08:58:53PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 11:46:03AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:41:40PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Add some helpers to enable us to lock an AG's headers, create btree
> > > > cursors for all btrees in that allocation group, and clean up
> > > > afterwards.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/scrub/common.c |  173 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/scrub/common.h |   10 +++
> > > >  fs/xfs/scrub/scrub.c  |    4 +
> > > >  fs/xfs/scrub/scrub.h  |   21 ++++++
> > > >  4 files changed, 208 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
> > > > index a84ba19..b056c9d 100644
> > > > --- a/fs/xfs/scrub/common.c
> > > > +++ b/fs/xfs/scrub/common.c
> > > > @@ -44,6 +44,7 @@
> > > >  #include "scrub/scrub.h"
> > > >  #include "scrub/common.h"
> > > >  #include "scrub/trace.h"
> > > > +#include "scrub/btree.h"
> > > >  
> > > >  /* Common code for the metadata scrubbers. */
> > > >  
> > > > @@ -298,6 +299,178 @@ xfs_scrub_set_incomplete(
> > > >  	trace_xfs_scrub_incomplete(sc, __return_address);
> > > >  }
> > > >  
> > > > +/*
> > > > + * AG scrubbing
> > > > + *
> > > > + * These helpers facilitate locking an allocation group's header
> > > > + * buffers, setting up cursors for all btrees that are present, and
> > > > + * cleaning everything up once we're through.
> > > > + */
> > > > +
> > > > +/* Grab all the headers for an AG. */
> > > > +int
> > > > +xfs_scrub_ag_read_headers(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	xfs_agnumber_t			agno,
> > > > +	struct xfs_buf			**agi,
> > > > +	struct xfs_buf			**agf,
> > > > +	struct xfs_buf			**agfl)
> > > > +{
> > > > +	struct xfs_mount		*mp = sc->mp;
> > > > +	int				error;
> > > > +
> > > > +	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
> > > > +	if (error)
> > > > +		goto out;
> > > > +
> > > > +	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
> > > > +	if (error)
> > > > +		goto out;
> > > > +	if (!*agf) {
> > > > +		error = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
> > > > +	if (error)
> > > > +		goto out;
> > > > +
> > > > +out:
> > > > +	return error;
> > > > +}
> > > 
> > > It's not immediately obvious what releases the buffers on error.
> > > Maybe add a comment to say cleanup/release on error is unconditional
> > > through xfs_scrub_ag_free()?
> > > 
> > > Hmmm - now there's a question - is the reference we get here freed
> > > through cancelling the fake transaction, or via the manual
> > > xfs_trans_brelse() call in the free function? which one happens
> > > first? add that to the comment?
> > 
> > The AG headers /should/ always be released by the xfs_trans_brelse calls
> > in the ag_free function, with a failsafe that the trans_cancel will dump
> > anything else that we came across during our check, just in case all
> > heck broke loose while we were checking.
> 
> Ok, comments. :P

Done.

> > > And given this locks out the AG from allocation for an arbitrary
> > > length of time, I'm wondering if we should add a flag into the pag
> > > somewhere to say "being scrubbed" so the extent and inode allocation
> > > code can skip over this AG and no block trying to lock it...
> > 
> > That might be a good idea for a end-of-series enhancement.
> 
> *nod*
> 
> > 
> > Though it could use a little more engineering thought -- what about
> > a more general ability to mark an AG offline?  ISTR we discussed growing
> > the ability to shut down an AG (rather than the whole FS) if scrub finds
> > problems, and/or being able to control that from spaceman.  The patch
> > was "spaceman: AG state control".
> 
> Well, only a small part of making an AG offline is preventing
> allocation from blocking in it.  What I suggested above is
> completely internal functionality that users would never even know
> about, so if we later want to add offline AG controls we can rework
> the implementation scrub uses to fit into that model....

Yes, I think that sounds fairly straightforward.

> > xfs_scrub has an -e option that allows the admin to specify what happens
> > on an error.  Right now it'll just shut down the filesystem, but
> > presumably it could react to a per-ag metadata problem by shutting down
> > the AG.
> 
> Not that simple, I'm afraid. Think about modifying a directory that
> has blocks that span multiple AGs. If we mark an AG as offline, then
> what do we do with an attempt to modify that directory block? Even
> if we can read it, do we allow the modification to proceed? How do
> we even know ahead of time that a directory has blocks in an offline
> AG?  And if the AG is shut down, then the attempt to read the
> directory block will get EIO, which will cause a dirty transaction
> cancellation, which will cause a filesystem wide shutdown...
> 
> Let's not complicate a simple optimisation specific to scrub by
> trying to make it work wth blue-sky functionality that requires us
> to solve a bunch of "OMFG HARD!" problems we haven't even thought
> about yet, let alone have answers for....

Heh, ok, no more OMFG HARD than there already is. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/25] xfs: scrub the backup superblocks
  2017-10-04  6:13       ` Dave Chinner
@ 2017-10-04 17:56         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 17:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 05:13:00PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 09:06:46PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 11:57:09AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:41:46PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Ensure that the geometry presented in the backup superblocks matches
> > > > the primary superblock so that repair can recover the filesystem if
> > > > that primary gets corrupted.
> .....
> 
> > > > +
> > > > +/* Set us up to check an AG header. */
> > > > +int
> > > > +xfs_scrub_setup_ag_header(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > 
> > > Not immediately clear what "AG header" is being set up here?
> > 
> > AGF/AGFL/AGI.  All three of them.  Maybe I ought to split them into
> > three separate files...?
> 
> No, just clarify the comment.
> 
> /*
>  * Set up scrub to check all the static metadata in each AG. These
>  * are the SB, AGF, AGI and AGFL header structures.
>  */
> 
> > > > +	sb = XFS_BUF_TO_SBP(bp);
> > > > +
> > > > +	/*
> > > > +	 * Verify the geometries match.  Fields that are permanently
> > > > +	 * set by mkfs are checked; fields that can be updated later
> > > > +	 * (and are not propagated to backup superblocks) are preen
> > > > +	 * checked.
> > > > +	 */
> > > > +	if (sb->sb_blocksize != cpu_to_be32(mp->m_sb.sb_blocksize))
> > > > +		xfs_scrub_block_set_corrupt(sc, bp);
> > > > +
> > > > +	if (sb->sb_dblocks != cpu_to_be64(mp->m_sb.sb_dblocks))
> > > > +		xfs_scrub_block_set_corrupt(sc, bp);
> > > 
> > > Just wondering - once we've set the corrupt flag, do we need to
> > > bother checking any of the other fields? It makes no difference to
> > > what is reported to userspace or the action it is going to take,
> > > so couldn't we just do something like:
> > 
> > This is something I've also struggled with for quite a while.  The most
> > pragmatic reaction is to set the corrupt flag and jump out immediately
> > on any failure since we really only care about whether or not we have to
> > react to bad metadata either by fixing it or shutting down.
> 
> *nod*
> 
> > On the other hand, continuing with the checks gives us the ability to
> > report /everything/ that's broken in the data structure, which could be
> > useful for online forensics (cough) to correlate scrub's report against
> > anything else that has popped up in dmesg.
> 
> Report where, exactly? The only detailed report we get out of this
> is tracepoint information, isn't it? And we'll have to convert the
> return address in the tracepoint to a line number to work out what
> actually was reported as corrupt. I really can't see myself spending
> the time to do that for every corruption in a single structure. Once
> I know the structure is corrupt, I don't care about other
> corruptions I just want to move on to repair.
> 
> IMO, scrub is for detecting errors so they can be repaired or
> analysed, not for doing fault analysis. For actual forensics work
> we'll still be using xfs_db - analysis processes that require manual
> decoding of tracepoints, structures and/or error reports is just not
> going to be efficient or usuable by the average developer....
> 
> > A downside of having everything jump to a single call to
> > xfs_scrub_block_set_corrupt at the end of the function is that the
> > return address that we record in the tracepoint will be the end of the
> > function instead of right after the failing check.
> 
> That's the same optimisation issue we solved for the verifiers
> tracing, right?

Not quite.  For the optimizers we adopted:

#define __this_address   ({ __label__ __here; __here: asm volatile(""); &&__here; })

(The asm volatile("") piece will (so far as I can tell) prevent the
optimizer from moving the label around within the verifier functions.)

Whereas for scrub we just use __return_address, which is a gcc-ism which
doesn't disable reorganization optimizations.

Granted I guess I could rework all those little helpers to take (void *)
and then stuff in __this_address...

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/25] xfs: scrub AGF and AGFL
  2017-10-04  6:28       ` Dave Chinner
@ 2017-10-04 17:57         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 17:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 05:28:40PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 09:21:40PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 12:31:48PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:41:52PM -0700, Darrick J. Wong wrote:
> > > > +/*
> > > > + * Load as many of the AG headers and btree cursors as we can for an
> > > > + * examination and cross-reference of an AG header.
> > > > + */
> > > > +int
> > > > +xfs_scrub_load_ag_headers(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	xfs_agnumber_t			agno,
> > > > +	unsigned int			type)
> > > > +{
> > > > +	struct xfs_mount		*mp = sc->mp;
> > > > +	int				error;
> > > > +
> > > > +	ASSERT(type == XFS_SCRUB_TYPE_AGF || type == XFS_SCRUB_TYPE_AGFL);
> > > > +	memset(&sc->sa, 0, sizeof(sc->sa));
> > > > +	sc->sa.agno = agno;
> > > > +
> > > > +	error = xfs_scrub_load_ag_header(sc, XFS_AGI_DADDR(mp),
> > > > +			&sc->sa.agi_bp, &xfs_agi_buf_ops, false);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	error = xfs_scrub_load_ag_header(sc, XFS_AGF_DADDR(mp),
> > > > +			&sc->sa.agf_bp, &xfs_agf_buf_ops,
> > > > +			type == XFS_SCRUB_TYPE_AGF);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	error = xfs_scrub_load_ag_header(sc, XFS_AGFL_DADDR(mp),
> > > > +			&sc->sa.agfl_bp, &xfs_agfl_buf_ops,
> > > > +			type == XFS_SCRUB_TYPE_AGFL);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	return 0;
> > > > +}
> > > 
> > > This should probably be combined with xfs_scrub_ag_read_headers().
> > > They essentially do the same thing, the only difference is the
> > > "target" error reporting.
> > 
> > It's quite different -- this function ignores verifier errors for
> > the two headers that don't match 'type'  In other words, if we're
> > checking the AGF (for example) we'll try to grab the AGI and the AGFL.
> > Verifier errors on the AGI/AGFL don't matter, but we /do/ want to hear
> > the results if the AGF verifier fails.
> 
> What they do is quite different. The implementation is /almost/
> identical. type is just an error masking variable and ....
> 
> > xfs_scrub_ag_read_headers on the other hand will fail if /any/ of the
> > three verifiers fail. 
> 
> .... if no type is set, then we don't mask any errors at all and
> we bail if any of the three verifiers fail. i.e.:
> 
> 	error = xfs_ialloc_read_agi(mp, sc->tp, agno, agi);
> 	if (error && (!type || type == XFS_SCRUB_TYPE_AGI)
> 		return error;
> 
>         error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, agf);
> 	if (error && (!type || type == XFS_SCRUB_TYPE_AGF)
>                 return error;
> 
>         error = xfs_alloc_read_agfl(mp, sc->tp, agno, agfl);
> 	if (error && (!type || type == XFS_SCRUB_TYPE_AGFL)
>                 return error;
> 
> It's also much simpler to understand because we are using the proper
> functions for reading these headers....

Ok, sounds good to me.

> > The two functions /could/ be combined, though the 'type' test becomes
> > trickier.  Maybe it'd be better just to enhance the comments for the two
> > header loader functions to spell out how they differ in usage.
> 
> Again, I'd much prefer similar functionality is combined into
> common helpers if it's simple enough to do...

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/25] xfs: scrub the AGI
  2017-10-04  6:43       ` Dave Chinner
@ 2017-10-04 18:02         ` Darrick J. Wong
  2017-10-04 22:16           ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 18:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 05:43:33PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 09:25:01PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 12:43:47PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:41:59PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Add a forgotten check to the AGI verifier, then wire up the scrub
> > > > infrastructure to check the AGI contents.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_fs.h  |    3 +-
> > > >  fs/xfs/scrub/agheader.c |   88 +++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/scrub/common.c   |    6 ++-
> > > >  fs/xfs/scrub/scrub.c    |    4 ++
> > > >  fs/xfs/scrub/scrub.h    |    1 +
> > > >  5 files changed, 99 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > > > index aeb2a66..1e326dd 100644
> > > > --- a/fs/xfs/libxfs/xfs_fs.h
> > > > +++ b/fs/xfs/libxfs/xfs_fs.h
> > > > @@ -487,9 +487,10 @@ struct xfs_scrub_metadata {
> > > >  #define XFS_SCRUB_TYPE_SB	1	/* superblock */
> > > >  #define XFS_SCRUB_TYPE_AGF	2	/* AG free header */
> > > >  #define XFS_SCRUB_TYPE_AGFL	3	/* AG free list */
> > > > +#define XFS_SCRUB_TYPE_AGI	4	/* AG inode header */
> > > >  
> > > >  /* Number of scrub subcommands. */
> > > > -#define XFS_SCRUB_TYPE_NR	4
> > > > +#define XFS_SCRUB_TYPE_NR	5
> > > >  
> > > >  /* i: Repair this metadata. */
> > > >  #define XFS_SCRUB_IFLAG_REPAIR		(1 << 0)
> > > > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > > > index 7fe6630..3d269c2 100644
> > > > --- a/fs/xfs/scrub/agheader.c
> > > > +++ b/fs/xfs/scrub/agheader.c
> > > > @@ -535,3 +535,91 @@ xfs_scrub_agfl(
> > > >  out:
> > > >  	return error;
> > > >  }
> > > > +
> > > > +/* AGI */
> > > > +
> > > > +/* Scrub the AGI. */
> > > > +int
> > > > +xfs_scrub_agi(
> > > > +	struct xfs_scrub_context	*sc)
> > > > +{
> > > > +	struct xfs_mount		*mp = sc->mp;
> > > > +	struct xfs_agi			*agi;
> > > > +	xfs_daddr_t			daddr;
> > > > +	xfs_daddr_t			eofs;
> > > > +	xfs_agnumber_t			agno;
> > > > +	xfs_agblock_t			agbno;
> > > > +	xfs_agblock_t			eoag;
> > > > +	xfs_agino_t			agino;
> > > > +	xfs_agino_t			first_agino;
> > > > +	xfs_agino_t			last_agino;
> > > > +	int				i;
> > > > +	int				level;
> > > > +	int				error = 0;
> > > > +
> > > > +	agno = sc->sm->sm_agno;
> > > > +	error = xfs_scrub_load_ag_headers(sc, agno, XFS_SCRUB_TYPE_AGI);
> > > > +	if (!xfs_scrub_op_ok(sc, agno, XFS_AGI_BLOCK(sc->mp), &error))
> > > > +		goto out;
> > > > +
> > > > +	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
> > > > +	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> > > > +
> > > > +	/* Check the AG length */
> > > > +	eoag = be32_to_cpu(agi->agi_length);
> > > > +	if (eoag != xfs_scrub_ag_blocks(mp, agno))
> > > > +		xfs_scrub_block_set_corrupt(sc, sc->sa.agi_bp);
> > > 
> > > Should we be cross checking that the AGI and AGF both have
> > > the same length here?
> > 
> > Isn't that what this does?  Albeit indirectly?
> 
> I was kinda thinking of explicit checks, but you are right, it's
> indirectly verified....
> 
> > xfs_scrub_ag_blocks returns sb_agcount for every AG except the last one.
> > For the last AG it returns (sb_dblocks - (all blocks in the other AGs))
> > which should be the same as agf->agf_length, right?
> 
> ... which assumes we've validated sb_agblocks and sb_dblocks in some
> way, which we haven't really done in the superblock scrubber.

Yes.

> It seems to me that we're using the superblock 0 values as the
> golden master because it's a mounted filesystem, and then comparing
> everything else against it. Maybe we should at least check a couple
> of secondary superblocks to see that they match the primary
> superblock - that way we'll have some confidence that at least
> things like agcount, agblocks, dblocks, etc are valid before we go
> any further...

xfs_scrub_superblock does check the secondary superblock geometry
against whatever's in mp->m_sb, which came from sb 0.

> BUt maybe all we need is comment in the overall scrub description -
> that we're pretty much assuming that sb 0 is intact because we write
> what is in memory back to it and so we can simply validate
> everything else against the primary superblock contents...

Correct.  Since scrub is run against a mounted live filesystem we assume
that the mount code fully validated sb 0 and therefore we can rely on it
not being wrong.

If OTOH sb 0 *is* wrong then the admin is better off running xfs_repair
because there's too much whirring machinery to go changing fundamental
geometry.

Ok more comments are coming.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/25] xfs: scrub btree keys and records
  2017-10-03 20:41 ` [PATCH 07/25] xfs: scrub btree keys and records Darrick J. Wong
@ 2017-10-04 20:52   ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 20:52 UTC (permalink / raw)
  To: linux-xfs; +Cc: david

On Tue, Oct 03, 2017 at 01:41:33PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Add to the btree scrubber the ability to check that the keys and
> records are in the right order and actually call out to our record
> iterator to do actual checking of the records.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/btree.c |  108 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/trace.h |   44 ++++++++++++++++++++
>  2 files changed, 152 insertions(+)
> 
> 
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index 899c9b1..ad3518c 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -92,6 +92,101 @@ xfs_scrub_btree_set_corrupt(
>  				__return_address);
>  }
>  
> +/*
> + * Make sure this record is in order and doesn't stray outside of the parent
> + * keys.
> + */
> +STATIC void
> +xfs_scrub_btree_rec(
> +	struct xfs_scrub_btree	*bs)
> +{
> +	struct xfs_btree_cur	*cur = bs->cur;
> +	union xfs_btree_rec	*rec;
> +	union xfs_btree_key	key;
> +	union xfs_btree_key	hkey;
> +	union xfs_btree_key	*keyp;
> +	struct xfs_btree_block	*block;
> +	struct xfs_btree_block	*keyblock;
> +	struct xfs_buf		*bp;
> +
> +	block = xfs_btree_get_block(cur, 0, &bp);
> +	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> +
> +	trace_xfs_scrub_btree_rec(bs->sc, cur, 0);
> +
> +	/* If this isn't the first record, are they in order? */
> +	if (!bs->firstrec && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, 0);
> +	bs->firstrec = false;
> +	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
> +
> +	if (cur->bc_nlevels == 1)
> +		return;
> +
> +	/* Is this at least as large as the parent low key? */
> +	cur->bc_ops->init_key_from_rec(&key, rec);
> +	keyblock = xfs_btree_get_block(cur, 1, &bp);
> +	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
> +	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, 1);
> +
> +	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
> +		return;
> +
> +	/* Is this no larger than the parent high key? */
> +	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
> +	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
> +	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, 1);
> +}
> +
> +/*
> + * Make sure this key is in order and doesn't stray outside of the parent
> + * keys.
> + */
> +STATIC void
> +xfs_scrub_btree_key(
> +	struct xfs_scrub_btree	*bs,
> +	int			level)
> +{
> +	struct xfs_btree_cur	*cur = bs->cur;
> +	union xfs_btree_key	*key;
> +	union xfs_btree_key	*keyp;
> +	struct xfs_btree_block	*block;
> +	struct xfs_btree_block	*keyblock;
> +	struct xfs_buf		*bp;
> +
> +	block = xfs_btree_get_block(cur, level, &bp);
> +	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> +
> +	trace_xfs_scrub_btree_key(bs->sc, cur, level);
> +
> +	/* If this isn't the first key, are they in order? */
> +	if (!bs->firstkey[level] &&
> +	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> +	bs->firstkey[level] = false;
> +	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
> +
> +	if (level + 1 >= cur->bc_nlevels)
> +		return;
> +
> +	/* Is this at least as large as the parent low key? */
> +	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
> +	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
> +	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> +
> +	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
> +		return;
> +
> +	/* Is this no larger than the parent high key? */
> +	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> +	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
> +	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
> +		xfs_scrub_btree_set_corrupt(bs->sc, cur, level);
> +}
> +
>  /* Check a btree pointer.  Returns true if it's ok to use this pointer. */
>  static bool
>  xfs_scrub_btree_ptr_ok(
> @@ -256,6 +351,7 @@ xfs_scrub_btree(
>  	struct xfs_scrub_btree		bs = {0};
>  	union xfs_btree_ptr		ptr;
>  	union xfs_btree_ptr		*pp;
> +	union xfs_btree_rec		*recp;
>  	struct xfs_btree_block		*block;
>  	int				level;
>  	struct xfs_buf			*bp;
> @@ -311,6 +407,15 @@ xfs_scrub_btree(
>  				continue;
>  			}
>  
> +			/* Records in order for scrub? */
> +			xfs_scrub_btree_rec(&bs);
> +
> +			/* Call out to the record checker. */
> +			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> +			error = bs.scrub_rec(&bs, recp);
> +			if (error < 0 ||
> +			    error == XFS_BTREE_QUERY_RANGE_ABORT)
> +				break;
>  			if (xfs_scrub_should_terminate(sc, &error))
>  				break;

Referencing the discussion of a later patch, we could check for
OFLAG_CORRUPT here as a quick way out if we find corruption in the
metadata object.

--D

>  
> @@ -326,6 +431,9 @@ xfs_scrub_btree(
>  			continue;
>  		}
>  
> +		/* Keys in order for scrub? */
> +		xfs_scrub_btree_key(&bs, level);
> +
>  		/* Drill another level deeper. */
>  		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
>  		if (!xfs_scrub_btree_ptr_ok(&bs, level, pp)) {
> diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> index 78f96b0..a78c8d1 100644
> --- a/fs/xfs/scrub/trace.h
> +++ b/fs/xfs/scrub/trace.h
> @@ -423,6 +423,50 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
>  		  __entry->ret_ip)
>  );
>  
> +DECLARE_EVENT_CLASS(xfs_scrub_sbtree_class,
> +	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur,
> +		 int level),
> +	TP_ARGS(sc, cur, level),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(int, type)
> +		__field(xfs_btnum_t, btnum)
> +		__field(xfs_agnumber_t, agno)
> +		__field(xfs_agblock_t, bno)
> +		__field(int, level)
> +		__field(int, nlevels)
> +		__field(int, ptr)
> +	),
> +	TP_fast_assign(
> +		xfs_fsblock_t fsbno = xfs_scrub_btree_cur_fsbno(cur, level);
> +		__entry->dev = sc->mp->m_super->s_dev;
> +		__entry->type = sc->sm->sm_type;
> +		__entry->btnum = cur->bc_btnum;
> +		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
> +		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
> +		__entry->level = level;
> +		__entry->nlevels = cur->bc_nlevels;
> +		__entry->ptr = cur->bc_ptrs[level];
> +	),
> +	TP_printk("dev %d:%d type %u btnum %d agno %u agbno %u level %d nlevels %d ptr %d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->type,
> +		  __entry->btnum,
> +		  __entry->agno,
> +		  __entry->bno,
> +		  __entry->level,
> +		  __entry->nlevels,
> +		  __entry->ptr)
> +)
> +#define DEFINE_SCRUB_SBTREE_EVENT(name) \
> +DEFINE_EVENT(xfs_scrub_sbtree_class, name, \
> +	TP_PROTO(struct xfs_scrub_context *sc, struct xfs_btree_cur *cur, \
> +		 int level), \
> +	TP_ARGS(sc, cur, level))
> +
> +DEFINE_SCRUB_SBTREE_EVENT(xfs_scrub_btree_rec);
> +DEFINE_SCRUB_SBTREE_EVENT(xfs_scrub_btree_key);
> +
>  #endif /* _TRACE_XFS_SCRUB_TRACE_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/25] xfs: scrub the AGI
  2017-10-04 18:02         ` Darrick J. Wong
@ 2017-10-04 22:16           ` Dave Chinner
  2017-10-04 23:12             ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-04 22:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 11:02:04AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 04, 2017 at 05:43:33PM +1100, Dave Chinner wrote:
> > It seems to me that we're using the superblock 0 values as the
> > golden master because it's a mounted filesystem, and then comparing
> > everything else against it. Maybe we should at least check a couple
> > of secondary superblocks to see that they match the primary
> > superblock - that way we'll have some confidence that at least
> > things like agcount, agblocks, dblocks, etc are valid before we go
> > any further...
> 
> xfs_scrub_superblock does check the secondary superblock geometry
> against whatever's in mp->m_sb, which came from sb 0.

/me smacks forehead

The patch is even named "scrub the backup superblocks".

Perhaps it didn't sink in because they are normally called
"secondary superblocks". My bad....

> > BUt maybe all we need is comment in the overall scrub description -
> > that we're pretty much assuming that sb 0 is intact because we write
> > what is in memory back to it and so we can simply validate
> > everything else against the primary superblock contents...
> 
> Correct.  Since scrub is run against a mounted live filesystem we assume
> that the mount code fully validated sb 0 and therefore we can rely on it
> not being wrong.
> 
> If OTOH sb 0 *is* wrong then the admin is better off running xfs_repair
> because there's too much whirring machinery to go changing fundamental
> geometry.

Yup, that makes sense.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/25] xfs: scrub the AGI
  2017-10-04 22:16           ` Dave Chinner
@ 2017-10-04 23:12             ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-04 23:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 09:16:34AM +1100, Dave Chinner wrote:
> On Wed, Oct 04, 2017 at 11:02:04AM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 04, 2017 at 05:43:33PM +1100, Dave Chinner wrote:
> > > It seems to me that we're using the superblock 0 values as the
> > > golden master because it's a mounted filesystem, and then comparing
> > > everything else against it. Maybe we should at least check a couple
> > > of secondary superblocks to see that they match the primary
> > > superblock - that way we'll have some confidence that at least
> > > things like agcount, agblocks, dblocks, etc are valid before we go
> > > any further...
> > 
> > xfs_scrub_superblock does check the secondary superblock geometry
> > against whatever's in mp->m_sb, which came from sb 0.
> 
> /me smacks forehead
> 
> The patch is even named "scrub the backup superblocks".
> 
> Perhaps it didn't sink in because they are normally called
> "secondary superblocks". My bad....

Sometimes I think I still have ext4 on the brain. :(

--D

> > > BUt maybe all we need is comment in the overall scrub description -
> > > that we're pretty much assuming that sb 0 is intact because we write
> > > what is in memory back to it and so we can simply validate
> > > everything else against the primary superblock contents...
> > 
> > Correct.  Since scrub is run against a mounted live filesystem we assume
> > that the mount code fully validated sb 0 and therefore we can rely on it
> > not being wrong.
> > 
> > If OTOH sb 0 *is* wrong then the admin is better off running xfs_repair
> > because there's too much whirring machinery to go changing fundamental
> > geometry.
> 
> Yup, that makes sense.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/25] xfs: scrub free space btrees
  2017-10-03 20:42 ` [PATCH 12/25] xfs: scrub free space btrees Darrick J. Wong
@ 2017-10-05  0:59   ` Dave Chinner
  2017-10-05  1:13     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  0:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:05PM -0700, Darrick J. Wong wrote:
> +
> +/*
> + * Set us up to scrub free space btrees.
> + * Push everything out of the log so that the busy extent list is empty.
> + */
> +int
> +xfs_scrub_setup_ag_allocbt(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	return xfs_scrub_setup_ag_btree(sc, ip, sc->try_harder);

These setup calls are getting deeply nested and intertwined. And
I really don't know why we pass sc->try_harder as a separate
parameter when we are already passing sc, especically as
xfs_scrub_setup_ag_btree() doesn't use it at all...

> +/* Scrub a bnobt/cntbt record. */
> +STATIC int
> +xfs_scrub_allocbt_helper(
> +	struct xfs_scrub_btree		*bs,
> +	union xfs_btree_rec		*rec)
> +{
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	struct xfs_agf			*agf;
> +	unsigned long long		rec_end;
> +	xfs_agblock_t			bno;
> +	xfs_extlen_t			len;
> +	int				error = 0;
> +
> +	bno = be32_to_cpu(rec->alloc.ar_startblock);
> +	len = be32_to_cpu(rec->alloc.ar_blockcount);
> +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> +	rec_end = (unsigned long long)bno + len;
> +
> +	if (bno >= mp->m_sb.sb_agblocks ||

Needs to take into account short last AG, so....

> +	    bno >= be32_to_cpu(agf->agf_length) ||
> +	    len == 0 ||
> +	    rec_end > mp->m_sb.sb_agblocks ||
> +	    rec_end > be32_to_cpu(agf->agf_length))
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);

... it should probably just validate the agbno is good. i.e.

	if (!xfs_agbno_verify(bno) || !len ||
	    !xfs_agbno_verify(rec_end)) {
.....


> +xfs_scrub_allocbt(
> +	struct xfs_scrub_context	*sc,
> +	xfs_btnum_t			which)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_btree_cur		*cur;
> +
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> +	cur = which == XFS_BTNUM_BNO ? sc->sa.bno_cur : sc->sa.cnt_cur;
> +	return xfs_scrub_btree(sc, cur, xfs_scrub_allocbt_helper,
> +			&oinfo, NULL);
> +}

I'm assuming the owner info is for later functionality to cross
check btree blocks against the rmap btree?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/25] xfs: scrub free space btrees
  2017-10-05  0:59   ` Dave Chinner
@ 2017-10-05  1:13     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05  1:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 11:59:43AM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:05PM -0700, Darrick J. Wong wrote:
> > +
> > +/*
> > + * Set us up to scrub free space btrees.
> > + * Push everything out of the log so that the busy extent list is empty.
> > + */
> > +int
> > +xfs_scrub_setup_ag_allocbt(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	return xfs_scrub_setup_ag_btree(sc, ip, sc->try_harder);
> 
> These setup calls are getting deeply nested and intertwined. And
> I really don't know why we pass sc->try_harder as a separate
> parameter when we are already passing sc, especically as
> xfs_scrub_setup_ag_btree() doesn't use it at all...
> 
> > +/* Scrub a bnobt/cntbt record. */
> > +STATIC int
> > +xfs_scrub_allocbt_helper(
> > +	struct xfs_scrub_btree		*bs,
> > +	union xfs_btree_rec		*rec)
> > +{
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	struct xfs_agf			*agf;
> > +	unsigned long long		rec_end;
> > +	xfs_agblock_t			bno;
> > +	xfs_extlen_t			len;
> > +	int				error = 0;
> > +
> > +	bno = be32_to_cpu(rec->alloc.ar_startblock);
> > +	len = be32_to_cpu(rec->alloc.ar_blockcount);
> > +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> > +	rec_end = (unsigned long long)bno + len;
> > +
> > +	if (bno >= mp->m_sb.sb_agblocks ||
> 
> Needs to take into account short last AG, so....
> 
> > +	    bno >= be32_to_cpu(agf->agf_length) ||
> > +	    len == 0 ||
> > +	    rec_end > mp->m_sb.sb_agblocks ||
> > +	    rec_end > be32_to_cpu(agf->agf_length))
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> 
> ... it should probably just validate the agbno is good. i.e.
> 
> 	if (!xfs_agbno_verify(bno) || !len ||
> 	    !xfs_agbno_verify(rec_end)) {

Will do.

> .....
> 
> 
> > +xfs_scrub_allocbt(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_btnum_t			which)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct xfs_btree_cur		*cur;
> > +
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > +	cur = which == XFS_BTNUM_BNO ? sc->sa.bno_cur : sc->sa.cnt_cur;
> > +	return xfs_scrub_btree(sc, cur, xfs_scrub_allocbt_helper,
> > +			&oinfo, NULL);
> > +}
> 
> I'm assuming the owner info is for later functionality to cross
> check btree blocks against the rmap btree?

Yes.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/25] xfs: scrub inode btrees
  2017-10-03 20:42 ` [PATCH 13/25] xfs: scrub inode btrees Darrick J. Wong
@ 2017-10-05  2:08   ` Dave Chinner
  2017-10-05  5:47     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  2:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:11PM -0700, Darrick J. Wong wrote:
> +/*
> + * Set us up to scrub inode btrees.
> + * If we detect a discrepancy between the inobt and the inode,
> + * try again after forcing logged inode cores out to disk.
> + */
> +int
> +xfs_scrub_setup_ag_iallocbt(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	return xfs_scrub_setup_ag_btree(sc, ip, sc->try_harder);
> +}
> +
> +/* Inode btree scrubber. */
> +
> +/* Is this chunk worth checking? */
> +STATIC bool
> +xfs_scrub_iallocbt_chunk(
> +	struct xfs_scrub_btree		*bs,
> +	struct xfs_inobt_rec_incore	*irec,
> +	xfs_agino_t			agino,
> +	xfs_extlen_t			len)
> +{
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	struct xfs_agf			*agf;
> +	unsigned long long		rec_end;
> +	xfs_agblock_t			eoag;
> +	xfs_agblock_t			bno;
> +
> +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> +	eoag = be32_to_cpu(agf->agf_length);

Probably should use the AGI for this.

> +	bno = XFS_AGINO_TO_AGBNO(mp, agino);
> +	rec_end = (unsigned long long)bno + len;
> +
> +	if (bno >= mp->m_sb.sb_agblocks || bno >= eoag ||
> +	    rec_end > mp->m_sb.sb_agblocks || rec_end > eoag) {

Same comment as last patch.

> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +		return false;
> +	}
> +
> +	return true;
> +}

I note there is no check on the length passed in for the inode
chunk - should that be verified?

> +
> +/* Count the number of free inodes. */
> +static unsigned int
> +xfs_scrub_iallocbt_freecount(
> +	xfs_inofree_t			freemask)
> +{
> +	int				bits = XFS_INODES_PER_CHUNK;
> +	unsigned int			ret = 0;
> +
> +	while (bits--) {
> +		if (freemask & 1)
> +			ret++;
> +		freemask >>= 1;
> +	}


Seems a little cumbersome. Perhaps a loop using xfs_next_bit()
might be a bit faster, something like:

	nextbit = xfs_next_bit(&freemask, 1, 0); 
	while (nextbit != -1) {
		ret++;
		nextbit = xfs_next_bit(&freemask, 1, nextbit + 1);
	}


> +/* Check a particular inode with ir_free. */
> +STATIC int
> +xfs_scrub_iallocbt_check_cluster_freemask(
> +	struct xfs_scrub_btree		*bs,
> +	xfs_ino_t			fsino,
> +	xfs_agino_t			chunkino,
> +	xfs_agino_t			clusterino,
> +	struct xfs_inobt_rec_incore	*irec,
> +	struct xfs_buf			*bp)
> +{
> +	struct xfs_dinode		*dip;
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	bool				freemask_ok;
> +	bool				inuse;
> +	int				error;
> +
> +	dip = xfs_buf_offset(bp, clusterino * mp->m_sb.sb_inodesize);
> +	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC ||
> +	    (dip->di_version >= 3 &&
> +	     be64_to_cpu(dip->di_ino) != fsino + clusterino)) {
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +		goto out;
> +	}
> +
> +	freemask_ok = !!(irec->ir_free & XFS_INOBT_MASK(chunkino + clusterino));

No need for !!(...) for a bool type - the compiler will squash it
down to 0/1 autmoatically.

> +	error = xfs_icache_inode_is_allocated(mp, bs->cur->bc_tp,
> +			fsino + clusterino, &inuse);
> +	if (error == -ENODATA) {
> +		/* Not cached, just read the disk buffer */

I think that is wrong. xfs_icache_inode_is_allocated() returns
-ENOENT if the inode is not in cache....

> +		freemask_ok ^= !!(dip->di_mode);
> +		if (!bs->sc->try_harder && !freemask_ok)
> +			return -EDEADLOCK;
> +	} else if (error < 0) {
> +		/* Inode is only half assembled, don't bother. */
> +		freemask_ok = true;

Or we had an IO error looking it up. i.e. -EAGAIN is the "half
assembled" state (i.e. in the XFS_INEW state) or the half
*disasembled* state (i.e. XFS_IRECLAIMABLE), anything
else is an error...

> +	} else {
> +		/* Inode is all there. */
> +		freemask_ok ^= inuse;

So inuse is returned from a mode check after iget succeeds. The mode
isn't zeroed until  /after/ XFS_IRECLAIMABLE is set, but it's also
set before XFS_INEW is cleared.  IOWs, how can
xfs_icache_inode_is_allocated() report anything
other than inuse == true here? If that's the case, what's the point
of the mode check inside xfs_icache_inode_is_allocated()?

> +	}
> +	if (!freemask_ok)
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +out:
> +	return 0;
> +}
> +
> +/* Make sure the free mask is consistent with what the inodes think. */
> +STATIC int
> +xfs_scrub_iallocbt_check_freemask(
> +	struct xfs_scrub_btree		*bs,
> +	struct xfs_inobt_rec_incore	*irec)
> +{
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_imap			imap;
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	struct xfs_dinode		*dip;
> +	struct xfs_buf			*bp;
> +	xfs_ino_t			fsino;
> +	xfs_agino_t			nr_inodes;
> +	xfs_agino_t			agino;
> +	xfs_agino_t			chunkino;
> +	xfs_agino_t			clusterino;
> +	xfs_agblock_t			agbno;
> +	int				blks_per_cluster;
> +	uint16_t			holemask;
> +	uint16_t			ir_holemask;
> +	int				error = 0;
> +
> +	/* Make sure the freemask matches the inode records. */
> +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);

Does this setup and loop work for the case where we have 64k
filesystem blocks and so two or more inode chunks per filesystem
block (i.e. ppc64)? 

> +/* Scrub an inobt/finobt record. */
> +STATIC int
> +xfs_scrub_iallocbt_helper(
> +	struct xfs_scrub_btree		*bs,
> +	union xfs_btree_rec		*rec)
> +{
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	struct xfs_agi			*agi;
> +	struct xfs_inobt_rec_incore	irec;
> +	uint64_t			holes;
> +	xfs_agino_t			agino;
> +	xfs_agblock_t			agbno;
> +	xfs_extlen_t			len;
> +	int				holecount;
> +	int				i;
> +	int				error = 0;
> +	unsigned int			real_freecount;
> +	uint16_t			holemask;
> +
> +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> +
> +	if (irec.ir_count > XFS_INODES_PER_CHUNK ||
> +	    irec.ir_freecount > XFS_INODES_PER_CHUNK)
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	real_freecount = irec.ir_freecount +
> +			(XFS_INODES_PER_CHUNK - irec.ir_count);
> +	if (real_freecount != xfs_scrub_iallocbt_freecount(irec.ir_free))
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	agi = XFS_BUF_TO_AGI(bs->sc->sa.agi_bp);
> +	agino = irec.ir_startino;
> +	agbno = XFS_AGINO_TO_AGBNO(mp, irec.ir_startino);
> +	if (agbno >= be32_to_cpu(agi->agi_length)) {

Validate as per every other agbno?

> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +		goto out;
> +	}
> +
> +	if ((agbno & (xfs_ialloc_cluster_alignment(mp) - 1)) ||
> +	    (agbno & (xfs_icluster_size_fsb(mp) - 1)))
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);

What's the magic masking checks being done here? (comment?)

> +	/* Handle non-sparse inodes */
> +	if (!xfs_inobt_issparse(irec.ir_holemask)) {
> +		len = XFS_B_TO_FSB(mp,
> +				XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize);
> +		if (irec.ir_count != XFS_INODES_PER_CHUNK)
> +			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +		if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
> +			goto out;
> +		goto check_freemask;
> +	}
> +
> +	/* Check each chunk of a sparse inode cluster. */
> +	holemask = irec.ir_holemask;
> +	holecount = 0;
> +	len = XFS_B_TO_FSB(mp,
> +			XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize);
> +	holes = ~xfs_inobt_irec_to_allocmask(&irec);
> +	if ((holes & irec.ir_free) != holes ||
> +	    irec.ir_freecount > irec.ir_count)
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
> +			i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {

Urk. THat's a bit hard to read.

> +		if (holemask & 1) {
> +			holecount += XFS_INODES_PER_HOLEMASK_BIT;
> +			continue;
> +		}
> +
> +		if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
> +			break;
> +	}

How about

	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; i++) {
		if (holemask & 1) {
			holecount += XFS_INODES_PER_HOLEMASK_BIT;
		} else if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
			break;

		holemask >>= 1;
		agino += XFS_INODES_PER_HOLEMASK_BIT;
	}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/25] xfs: scrub rmap btrees
  2017-10-03 20:42 ` [PATCH 14/25] xfs: scrub rmap btrees Darrick J. Wong
@ 2017-10-05  2:56   ` Dave Chinner
  2017-10-05  5:02     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  2:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:17PM -0700, Darrick J. Wong wrote:
> +/* Scrub an rmapbt record. */
> +STATIC int
> +xfs_scrub_rmapbt_helper(
> +	struct xfs_scrub_btree		*bs,
> +	union xfs_btree_rec		*rec)
> +{
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	struct xfs_agf			*agf;
> +	struct xfs_rmap_irec		irec;
> +	unsigned long long		rec_end;
> +	xfs_agblock_t			eoag;
> +	bool				non_inode;
> +	bool				is_unwritten;
> +	bool				is_bmbt;
> +	bool				is_attr;
> +	int				error;
> +
> +	error = xfs_rmap_btrec_to_irec(rec, &irec);
> +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, 0, &error))
> +		goto out;

This got me again. Again I was thinking that this code threw
away the error from xfs_rmap_btrec_to_irec(). Could we consider
renaming "op_ok" to "process_error" or something like that so
it's clearer that it's doing some kind of checking on the error
we just got back?

> +
> +	/* Check extent. */
> +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> +	eoag = be32_to_cpu(agf->agf_length);
> +	rec_end = (unsigned long long)irec.rm_startblock + irec.rm_blockcount;
> +
> +	if (irec.rm_startblock >= mp->m_sb.sb_agblocks ||
> +	    irec.rm_startblock >= eoag ||
> +	    irec.rm_blockcount == 0 ||
> +	    rec_end > mp->m_sb.sb_agblocks ||
> +	    rec_end > eoag)
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);

Broken Record (tm).

> +	/* Check flags. */
> +	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
> +	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
> +	is_attr = irec.rm_flags & XFS_RMAP_ATTR_FORK;
> +	is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN;
> +
> +	if (is_bmbt && irec.rm_offset != 0)
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	if (non_inode && irec.rm_offset != 0)
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	if (is_unwritten && (is_bmbt || non_inode || is_attr))
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	if (non_inode && (is_bmbt || is_unwritten || is_attr))
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	if (!non_inode) {
> +		xfs_agnumber_t	agno = XFS_INO_TO_AGNO(mp, irec.rm_owner);
> +		xfs_agino_t	agino = XFS_INO_TO_AGINO(mp, irec.rm_owner);
> +		xfs_agblock_t	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
> +
> +		/* Owner inode within an AG? */
> +		if (agno >= mp->m_sb.sb_agcount ||
> +		    agbno >= mp->m_sb.sb_agblocks)
> +			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +		/* Owner inode within the FS? */
> +		if (XFS_AGB_TO_DADDR(mp, agno, agbno) >=
> +		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks))
> +			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);

These two checks probably also should be libxfs functionality.
We've got similar checks strewn all over the place (e.g. valid_bno()
in xfs_db, verify_aginum/verify_inum/verify_agbno/verify_dfsbno
in repair, etc.

It would be good to get all these sorts of basic type checks
centralised and used consistenly by all the code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/25] xfs: scrub refcount btrees
  2017-10-03 20:42 ` [PATCH 15/25] xfs: scrub refcount btrees Darrick J. Wong
@ 2017-10-05  2:59   ` Dave Chinner
  2017-10-05  5:02     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  2:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> +/* Scrub a refcountbt record. */
> +STATIC int
> +xfs_scrub_refcountbt_helper(
> +	struct xfs_scrub_btree		*bs,
> +	union xfs_btree_rec		*rec)
> +{
> +	struct xfs_mount		*mp = bs->cur->bc_mp;
> +	struct xfs_agf			*agf;
> +	struct xfs_refcount_irec	irec;
> +	unsigned long long		rec_end;
> +	xfs_agblock_t			eoag;
> +	bool				has_cowflag;
> +	int				error = 0;
> +
> +	irec.rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
> +	irec.rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
> +	irec.rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
> +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> +	eoag = be32_to_cpu(agf->agf_length);
> +
> +	/* Only CoW records can have refcount == 1. */
> +	has_cowflag = !!(irec.rc_startblock & XFS_REFC_COW_START);
> +	if ((irec.rc_refcount == 1 && !has_cowflag) ||
> +	    (irec.rc_refcount != 1 && has_cowflag))
> +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> +
> +	/* Check the extent. */
> +	irec.rc_startblock &= ~XFS_REFC_COW_START;
> +	rec_end = (unsigned long long)irec.rc_startblock + irec.rc_blockcount;
> +	if (irec.rc_startblock >= mp->m_sb.sb_agblocks ||
> +	    irec.rc_startblock >= eoag ||
> +	    irec.rc_blockcount == 0 ||
> +	    rec_end > mp->m_sb.sb_agblocks ||
> +	    rec_end > eoag ||

For the (broken) record...

> +	    irec.rc_refcount < 1)

Unsigned variable, so it seems to me that irec.rc_refcount == 0
would be a more correct check.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/25] xfs: scrub inodes
  2017-10-03 20:42 ` [PATCH 16/25] xfs: scrub inodes Darrick J. Wong
@ 2017-10-05  4:04   ` Dave Chinner
  2017-10-05  5:22     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  4:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:30PM -0700, Darrick J. Wong wrote:
> @@ -559,3 +563,64 @@ xfs_scrub_setup_ag_btree(
>  
>  	return xfs_scrub_ag_init(sc, sc->sm->sm_agno, &sc->sa);
>  }
> +
> +/*
> + * Given an inode and the scrub control structure, grab either the
> + * inode referenced in the control structure or the inode passed in.
> + * The inode is not locked.
> + */
> +int
> +xfs_scrub_get_inode(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip_in)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_inode		*ips = NULL;

*ip?

> +	int				error;
> +
> +	if (sc->sm->sm_agno || (sc->sm->sm_gen && !sc->sm->sm_ino))
> +		return -EINVAL;

What's this checking?

> +
> +	/* We want to scan the inode we already had opened. */
> +	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
> +		sc->ip = ip_in;
> +		return 0;
> +	}
> +
> +	/* Look up the inode, see if the generation number matches. */
> +	if (xfs_internal_inum(mp, sc->sm->sm_ino))
> +		return -ENOENT;

maybe xfs_internal_inum should be moved to the same place as all the
inode/agbno/bno verification functions....

> +	error = xfs_iget(mp, NULL, sc->sm->sm_ino, XFS_IGET_UNTRUSTED,
> +			0, &ips);

I think we also want XFS_IGET_DONTCACHE here, so we don't trash the
inode cache with inodes that we use once for scrub and never touch
again.

> +	if (error == -ENOENT || error == -EINVAL) {
> +		/* inode doesn't exist... */
> +		return -ENOENT;
> +	} else if (error) {
> +		trace_xfs_scrub_op_error(sc,
> +				XFS_INO_TO_AGNO(mp, sc->sm->sm_ino),
> +				XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
> +				error, __return_address);
> +		return error;
> +	}
> +	if (VFS_I(ips)->i_generation != sc->sm->sm_gen) {
> +		iput(VFS_I(ips));
> +		return -ENOENT;
> +	}
> +
> +	sc->ip = ips;
> +	return 0;
> +}
> +
> +/* Push everything out of the log onto disk. */
> +int
> +xfs_scrub_checkpoint_log(
> +	struct xfs_mount	*mp)
> +{
> +	int			error;
> +
> +	error = _xfs_log_force(mp, XFS_LOG_SYNC, NULL);
> +	if (error)
> +		return error;
> +	xfs_ail_push_all_sync(mp->m_ail);
> +	return 0;
> +}

Oooo, that's a nasty thing to do on busy systems with large dirty
logs. I hope this is a "last resort" kind of thing....

> +/* Set us up with an inode. */

What state are we trying to get the inode into here? We grab all the
various locks, but we can still have data changing via mmap pages
that are already faulted in and delalloc extents in the incore
extent list that aren't reflected on disk...

A comment explaining what we expect here would be nice.

> +int
> +xfs_scrub_setup_inode(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	/*
> +	 * Try to get the inode.  If the verifiers fail, we try again
> +	 * in raw mode.
> +	 */
> +	error = xfs_scrub_get_inode(sc, ip);
> +	switch (error) {
> +	case 0:
> +		break;
> +	case -EFSCORRUPTED:
> +	case -EFSBADCRC:
> +		return xfs_scrub_checkpoint_log(mp);
> +	default:
> +		return error;
> +	}
> +
> +	/* Got the inode, lock it and we're ready to go. */
> +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> +	xfs_ilock(sc->ip, sc->ilock_flags);
> +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +	if (error)
> +		goto out_unlock;
> +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);

Should the inode be joined to the transaction so that cancelling the
transaction unlocks the inode? Then the need for the ilock_flags
variable goes away....

> +
> +	return error;
> +out_unlock:
> +	xfs_iunlock(sc->ip, sc->ilock_flags);
> +	if (sc->ip != ip)
> +		iput(VFS_I(sc->ip));
> +	sc->ip = NULL;
> +	return error;
> +}
> +
> +/* Inode core */
> +
> +/* Scrub an inode. */
> +int
> +xfs_scrub_inode(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_imap			imap;
> +	struct xfs_dinode		di;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*bp = NULL;
> +	struct xfs_dinode		*dip;
> +	xfs_ino_t			ino;
> +	size_t				fork_recs;
> +	unsigned long long		isize;
> +	uint64_t			flags2;
> +	uint32_t			nextents;
> +	uint32_t			extsize;
> +	uint32_t			cowextsize;
> +	uint16_t			flags;
> +	uint16_t			mode;
> +	bool				has_shared;
> +	int				error = 0;
> +
> +	/* Did we get the in-core inode, or are we doing this manually? */
> +	if (sc->ip) {
> +		ino = sc->ip->i_ino;
> +		xfs_inode_to_disk(sc->ip, &di, 0);
> +		dip = &di;
> +	} else {
> +		/* Map & read inode. */
> +		ino = sc->sm->sm_ino;
> +		error = xfs_imap(mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
> +		if (error == -EINVAL) {
> +			/*
> +			 * Inode could have gotten deleted out from under us;
> +			 * just forget about it.
> +			 */
> +			error = -ENOENT;
> +			goto out;
> +		}
> +		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
> +				XFS_INO_TO_AGBNO(mp, ino), &error))
> +			goto out;
> +
> +		error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +				imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp,
> +				NULL);
> +		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
> +				XFS_INO_TO_AGBNO(mp, ino), &error))
> +			goto out;
> +
> +		/* Is this really an inode? */
> +		bp->b_ops = &xfs_inode_buf_ops;
> +		dip = xfs_buf_offset(bp, imap.im_boffset);
> +		if (!xfs_dinode_verify(mp, ino, dip) ||
> +		    !xfs_dinode_good_version(mp, dip->di_version)) {
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +			goto out;
> +		}
> +
> +		/* ...and is it the one we asked for? */
> +		if (be32_to_cpu(dip->di_gen) != sc->sm->sm_gen) {
> +			error = -ENOENT;
> +			goto out;
> +		}
> +	}

Can we factor the manual mapping into a separate function? Just
makes it a bit cleaner and gets rid of a bunch of local variables
from this function that are just used to map and read the inode.

ANd reading on and getting ahead of the code, could we split it
further into

	<setup dip>

	xfs_scrub_dinode(sc, ino, dip, bp);

	<do live incore inode stuff>

> +
> +	flags = be16_to_cpu(dip->di_flags);
> +	if (dip->di_version >= 3)
> +		flags2 = be64_to_cpu(dip->di_flags2);
> +	else
> +		flags2 = 0;
> +
> +	/* di_mode */
> +	mode = be16_to_cpu(dip->di_mode);
> +	if (mode & ~(S_IALLUGO | S_IFMT))
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +	/* v1/v2 fields */
> +	switch (dip->di_version) {
> +	case 1:
> +		if (dip->di_nlink != 0)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +		if (dip->di_mode == 0 && sc->ip)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +		if (dip->di_projid_lo != 0 || dip->di_projid_hi != 0)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;

We don't really support v1 inode format anymore - we convert it to
version 2 automatically in xfs_inode_from_disk() so the in-memory
inode is always v2 or v3, never v1. And when we write it back out,
we write it as a v2 inode, never as a v1 inode.

Hence I'm not sure whether we should be worrying about scrubbing
such inodes - they are going to be in an ever shrinking minority
of filesystems. At minimum, they should always return "preen".

> +	case 2:
> +	case 3:
> +		if (dip->di_onlink != 0)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +		if (dip->di_mode == 0 && sc->ip)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +		if (dip->di_projid_hi != 0 &&
> +		    !xfs_sb_version_hasprojid32bit(&mp->m_sb))
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	default:
> +		ASSERT(0);

If we don't understand the version, it's corrupt.

> +		break;
> +	}
> +
> +	/*
> +	 * di_uid/di_gid -- -1 isn't invalid, but there's no way that
> +	 * userspace could have created that.
> +	 */
> +	if (dip->di_uid == cpu_to_be32(-1U) ||
> +	    dip->di_gid == cpu_to_be32(-1U))
> +		xfs_scrub_ino_set_warning(sc, bp);
> +
> +	/* di_format */
> +	switch (dip->di_format) {
> +	case XFS_DINODE_FMT_DEV:
> +		if (!S_ISCHR(mode) && !S_ISBLK(mode) &&
> +		    !S_ISFIFO(mode) && !S_ISSOCK(mode))
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_LOCAL:
> +		if (!S_ISDIR(mode) && !S_ISLNK(mode))
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_EXTENTS:
> +		if (!S_ISREG(mode) && !S_ISDIR(mode) && !S_ISLNK(mode))
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_BTREE:
> +		if (!S_ISREG(mode) && !S_ISDIR(mode))
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_UUID:
> +	default:
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	}
> +
> +	/* di_size */
> +	isize = be64_to_cpu(dip->di_size);
> +	if (isize & (1ULL << 63))
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);

Should we be checking against the on disk format size, or the
mounted filesystem maximum size (i.e. mp->m_super->s_maxbytes)?
32 or 64 bit systems are going to have different maximum valid file
sizes..

Directories have a maximum bound size, too - the data space, leaf
space and freespace space, each of which are 32GB in size, IIRC.

And symlinks have a different maximum size, too.

> +	if (!S_ISDIR(mode) && !S_ISREG(mode) && !S_ISLNK(mode) && isize != 0)
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);

> +
> +	/* di_nblocks */
> +	if (flags2 & XFS_DIFLAG2_REFLINK) {
> +		; /* nblocks can exceed dblocks */
> +	} else if (flags & XFS_DIFLAG_REALTIME) {
> +		if (be64_to_cpu(dip->di_nblocks) >=
> +		    mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);

That doesn't seem right. the file can be on either the data or the
rt device, so the maximum file blocks is the size of one device or
the other, not both combined.

> +	} else {
> +		if (be64_to_cpu(dip->di_nblocks) >= mp->m_sb.sb_dblocks)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +	}
> +
> +	/* di_extsize */
> +	extsize = be32_to_cpu(dip->di_extsize);
> +	if (flags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT)) {
> +		if (extsize <= 0 || extsize > MAXEXTLEN)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +		if (!(flags & XFS_DIFLAG_REALTIME) &&
> +		    extsize > mp->m_sb.sb_agblocks / 2)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +	} else {
> +		if (extsize != 0)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +	}

There more to validating the extentsize hints than this. From
xfs_ioctl_setattr_check_extsize():

/*
 * extent size hint validation is somewhat cumbersome. Rules are:
 *
 * 1. extent size hint is only valid for directories and regular files
 * 2. FS_XFLAG_EXTSIZE is only valid for regular files
 * 3. FS_XFLAG_EXTSZINHERIT is only valid for directories.
 * 4. can only be changed on regular files if no extents are allocated
 * 5. can be changed on directories at any time
 * 6. extsize hint of 0 turns off hints, clears inode flags.
 * 7. Extent size must be a multiple of the appropriate block size.
 * 8. for non-realtime files, the extent size hint must be limited
 *    to half the AG size to avoid alignment extending the extent beyond the
 *    limits of the AG.
 */

Maybe there's some commonality between these two functions...

> +
> +	/* di_flags */
> +	if ((flags & XFS_DIFLAG_IMMUTABLE) && (flags & XFS_DIFLAG_APPEND))
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +	if ((flags & XFS_DIFLAG_FILESTREAM) && (flags & XFS_DIFLAG_REALTIME))
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);

What about project id flags?

Also, there are flags that are allowed only on regular files, and
there are flags that are only allowed on directories. Those should
probably also be checked for preening.

> +
> +	/* di_nextents */
> +	nextents = be32_to_cpu(dip->di_nextents);
> +	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> +	switch (dip->di_format) {
> +	case XFS_DINODE_FMT_EXTENTS:
> +		if (nextents > fork_recs)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_BTREE:
> +		if (nextents <= fork_recs)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_LOCAL:
> +	case XFS_DINODE_FMT_DEV:
> +	case XFS_DINODE_FMT_UUID:
> +	default:
> +		if (nextents != 0)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	}
> +
> +	/* di_anextents */
> +	nextents = be16_to_cpu(dip->di_anextents);
> +	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> +	switch (dip->di_aformat) {
> +	case XFS_DINODE_FMT_EXTENTS:
> +		if (nextents > fork_recs)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_BTREE:
> +		if (nextents <= fork_recs)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	case XFS_DINODE_FMT_LOCAL:
> +	case XFS_DINODE_FMT_DEV:
> +	case XFS_DINODE_FMT_UUID:
> +	default:
> +		if (nextents != 0)
> +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +		break;
> +	}

Don't we need a check here first to see whether an attribute fork
exists or not?

> +
> +	/* di_forkoff */
> +	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> +
> +	/* di_aformat */
> +	if (dip->di_aformat != XFS_DINODE_FMT_LOCAL &&
> +	    dip->di_aformat != XFS_DINODE_FMT_EXTENTS &&
> +	    dip->di_aformat != XFS_DINODE_FMT_BTREE)
> +		xfs_scrub_ino_set_corrupt(sc, ino, bp);

Shouldn't this come before we use dip->di_aformat in a switch
statement? Hmmm - aren't we missing the same checks for the data
fork?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/25] xfs: scrub rmap btrees
  2017-10-05  2:56   ` Dave Chinner
@ 2017-10-05  5:02     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05  5:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 01:56:20PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:17PM -0700, Darrick J. Wong wrote:
> > +/* Scrub an rmapbt record. */
> > +STATIC int
> > +xfs_scrub_rmapbt_helper(
> > +	struct xfs_scrub_btree		*bs,
> > +	union xfs_btree_rec		*rec)
> > +{
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	struct xfs_agf			*agf;
> > +	struct xfs_rmap_irec		irec;
> > +	unsigned long long		rec_end;
> > +	xfs_agblock_t			eoag;
> > +	bool				non_inode;
> > +	bool				is_unwritten;
> > +	bool				is_bmbt;
> > +	bool				is_attr;
> > +	int				error;
> > +
> > +	error = xfs_rmap_btrec_to_irec(rec, &irec);
> > +	if (!xfs_scrub_btree_op_ok(bs->sc, bs->cur, 0, &error))
> > +		goto out;
> 
> This got me again. Again I was thinking that this code threw
> away the error from xfs_rmap_btrec_to_irec(). Could we consider
> renaming "op_ok" to "process_error" or something like that so
> it's clearer that it's doing some kind of checking on the error
> we just got back?

Ok.  That /is/ a better suggestion.

> > +
> > +	/* Check extent. */
> > +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> > +	eoag = be32_to_cpu(agf->agf_length);
> > +	rec_end = (unsigned long long)irec.rm_startblock + irec.rm_blockcount;
> > +
> > +	if (irec.rm_startblock >= mp->m_sb.sb_agblocks ||
> > +	    irec.rm_startblock >= eoag ||
> > +	    irec.rm_blockcount == 0 ||
> > +	    rec_end > mp->m_sb.sb_agblocks ||
> > +	    rec_end > eoag)
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> 
> Broken Record (tm).

Woo.

> > +	/* Check flags. */
> > +	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
> > +	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
> > +	is_attr = irec.rm_flags & XFS_RMAP_ATTR_FORK;
> > +	is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN;
> > +
> > +	if (is_bmbt && irec.rm_offset != 0)
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	if (non_inode && irec.rm_offset != 0)
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	if (is_unwritten && (is_bmbt || non_inode || is_attr))
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	if (non_inode && (is_bmbt || is_unwritten || is_attr))
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	if (!non_inode) {
> > +		xfs_agnumber_t	agno = XFS_INO_TO_AGNO(mp, irec.rm_owner);
> > +		xfs_agino_t	agino = XFS_INO_TO_AGINO(mp, irec.rm_owner);
> > +		xfs_agblock_t	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
> > +
> > +		/* Owner inode within an AG? */
> > +		if (agno >= mp->m_sb.sb_agcount ||
> > +		    agbno >= mp->m_sb.sb_agblocks)
> > +			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +		/* Owner inode within the FS? */
> > +		if (XFS_AGB_TO_DADDR(mp, agno, agbno) >=
> > +		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks))
> > +			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> 
> These two checks probably also should be libxfs functionality.
> We've got similar checks strewn all over the place (e.g. valid_bno()
> in xfs_db, verify_aginum/verify_inum/verify_agbno/verify_dfsbno
> in repair, etc.
> 
> It would be good to get all these sorts of basic type checks
> centralised and used consistenly by all the code....

I just added that (currently code-factoring my way through the agi
scrubbers) so yes this should be easy to do for v12.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/25] xfs: scrub refcount btrees
  2017-10-05  2:59   ` Dave Chinner
@ 2017-10-05  5:02     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05  5:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 01:59:37PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:23PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > +/* Scrub a refcountbt record. */
> > +STATIC int
> > +xfs_scrub_refcountbt_helper(
> > +	struct xfs_scrub_btree		*bs,
> > +	union xfs_btree_rec		*rec)
> > +{
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	struct xfs_agf			*agf;
> > +	struct xfs_refcount_irec	irec;
> > +	unsigned long long		rec_end;
> > +	xfs_agblock_t			eoag;
> > +	bool				has_cowflag;
> > +	int				error = 0;
> > +
> > +	irec.rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
> > +	irec.rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
> > +	irec.rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
> > +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> > +	eoag = be32_to_cpu(agf->agf_length);
> > +
> > +	/* Only CoW records can have refcount == 1. */
> > +	has_cowflag = !!(irec.rc_startblock & XFS_REFC_COW_START);
> > +	if ((irec.rc_refcount == 1 && !has_cowflag) ||
> > +	    (irec.rc_refcount != 1 && has_cowflag))
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	/* Check the extent. */
> > +	irec.rc_startblock &= ~XFS_REFC_COW_START;
> > +	rec_end = (unsigned long long)irec.rc_startblock + irec.rc_blockcount;
> > +	if (irec.rc_startblock >= mp->m_sb.sb_agblocks ||
> > +	    irec.rc_startblock >= eoag ||
> > +	    irec.rc_blockcount == 0 ||
> > +	    rec_end > mp->m_sb.sb_agblocks ||
> > +	    rec_end > eoag ||
> 
> For the (broken) record...

Yup.

> > +	    irec.rc_refcount < 1)
> 
> Unsigned variable, so it seems to me that irec.rc_refcount == 0
> would be a more correct check.

Ok.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/25] xfs: scrub inodes
  2017-10-05  4:04   ` Dave Chinner
@ 2017-10-05  5:22     ` Darrick J. Wong
  2017-10-05  7:13       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05  5:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 03:04:52PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:30PM -0700, Darrick J. Wong wrote:
> > @@ -559,3 +563,64 @@ xfs_scrub_setup_ag_btree(
> >  
> >  	return xfs_scrub_ag_init(sc, sc->sm->sm_agno, &sc->sa);
> >  }
> > +
> > +/*
> > + * Given an inode and the scrub control structure, grab either the
> > + * inode referenced in the control structure or the inode passed in.
> > + * The inode is not locked.
> > + */
> > +int
> > +xfs_scrub_get_inode(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip_in)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_inode		*ips = NULL;
> 
> *ip?
> 
> > +	int				error;
> > +
> > +	if (sc->sm->sm_agno || (sc->sm->sm_gen && !sc->sm->sm_ino))
> > +		return -EINVAL;
> 
> What's this checking?

/*
 * Jump out if userspace fed us an AG number or a inode generation
 * without an inode number.  Both indicate that userspace hasn't got a
 * clue.
 */

> > +
> > +	/* We want to scan the inode we already had opened. */
> > +	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
> > +		sc->ip = ip_in;
> > +		return 0;
> > +	}
> > +
> > +	/* Look up the inode, see if the generation number matches. */
> > +	if (xfs_internal_inum(mp, sc->sm->sm_ino))
> > +		return -ENOENT;
> 
> maybe xfs_internal_inum should be moved to the same place as all the
> inode/agbno/bno verification functions....

Yes.

> > +	error = xfs_iget(mp, NULL, sc->sm->sm_ino, XFS_IGET_UNTRUSTED,
> > +			0, &ips);
> 
> I think we also want XFS_IGET_DONTCACHE here, so we don't trash the
> inode cache with inodes that we use once for scrub and never touch
> again.

I thought about adding this, but if we let the inodes fall out of the
cache now then we'll just have to load them back in for the bmap checks,
right?

> > +	if (error == -ENOENT || error == -EINVAL) {
> > +		/* inode doesn't exist... */
> > +		return -ENOENT;
> > +	} else if (error) {
> > +		trace_xfs_scrub_op_error(sc,
> > +				XFS_INO_TO_AGNO(mp, sc->sm->sm_ino),
> > +				XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
> > +				error, __return_address);
> > +		return error;
> > +	}
> > +	if (VFS_I(ips)->i_generation != sc->sm->sm_gen) {
> > +		iput(VFS_I(ips));
> > +		return -ENOENT;
> > +	}
> > +
> > +	sc->ip = ips;
> > +	return 0;
> > +}
> > +
> > +/* Push everything out of the log onto disk. */
> > +int
> > +xfs_scrub_checkpoint_log(
> > +	struct xfs_mount	*mp)
> > +{
> > +	int			error;
> > +
> > +	error = _xfs_log_force(mp, XFS_LOG_SYNC, NULL);
> > +	if (error)
> > +		return error;
> > +	xfs_ail_push_all_sync(mp->m_ail);
> > +	return 0;
> > +}
> 
> Oooo, that's a nasty thing to do on busy systems with large dirty
> logs. I hope this is a "last resort" kind of thing....

It is; we only do this if the inobt says there's an inode there and the
inode verifiers fail.

> > +/* Set us up with an inode. */
> 
> What state are we trying to get the inode into here? We grab all the
> various locks, but we can still have data changing via mmap pages
> that are already faulted in and delalloc extents in the incore
> extent list that aren't reflected on disk...
> 
> A comment explaining what we expect here would be nice.

/* 
 * Grab total control of the inode metadata.  It doesn't matter here if
 * the file data is still changing, we just want exclusive access to the
 * metadata.
 */

> > +int
> > +xfs_scrub_setup_inode(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	/*
> > +	 * Try to get the inode.  If the verifiers fail, we try again
> > +	 * in raw mode.
> > +	 */
> > +	error = xfs_scrub_get_inode(sc, ip);
> > +	switch (error) {
> > +	case 0:
> > +		break;
> > +	case -EFSCORRUPTED:
> > +	case -EFSBADCRC:
> > +		return xfs_scrub_checkpoint_log(mp);
> > +	default:
> > +		return error;
> > +	}
> > +
> > +	/* Got the inode, lock it and we're ready to go. */
> > +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > +	xfs_ilock(sc->ip, sc->ilock_flags);
> > +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> > +	if (error)
> > +		goto out_unlock;
> > +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> > +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> 
> Should the inode be joined to the transaction so that cancelling the
> transaction unlocks the inode? Then the need for the ilock_flags
> variable goes away....

This is the confluence of two semi-icky things: first, some of the
scrubbers (particularly the dir and parent pointer scrubbers) will need
to drop the ILOCK for short periods of time; later on, repair will want
to keep the inode locked across all the repair transactions, so it makes
more sense to control the lock and unlock directly.

> > +
> > +	return error;
> > +out_unlock:
> > +	xfs_iunlock(sc->ip, sc->ilock_flags);
> > +	if (sc->ip != ip)
> > +		iput(VFS_I(sc->ip));
> > +	sc->ip = NULL;
> > +	return error;
> > +}
> > +
> > +/* Inode core */
> > +
> > +/* Scrub an inode. */
> > +int
> > +xfs_scrub_inode(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_imap			imap;
> > +	struct xfs_dinode		di;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_buf			*bp = NULL;
> > +	struct xfs_dinode		*dip;
> > +	xfs_ino_t			ino;
> > +	size_t				fork_recs;
> > +	unsigned long long		isize;
> > +	uint64_t			flags2;
> > +	uint32_t			nextents;
> > +	uint32_t			extsize;
> > +	uint32_t			cowextsize;
> > +	uint16_t			flags;
> > +	uint16_t			mode;
> > +	bool				has_shared;
> > +	int				error = 0;
> > +
> > +	/* Did we get the in-core inode, or are we doing this manually? */
> > +	if (sc->ip) {
> > +		ino = sc->ip->i_ino;
> > +		xfs_inode_to_disk(sc->ip, &di, 0);
> > +		dip = &di;
> > +	} else {
> > +		/* Map & read inode. */
> > +		ino = sc->sm->sm_ino;
> > +		error = xfs_imap(mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
> > +		if (error == -EINVAL) {
> > +			/*
> > +			 * Inode could have gotten deleted out from under us;
> > +			 * just forget about it.
> > +			 */
> > +			error = -ENOENT;
> > +			goto out;
> > +		}
> > +		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
> > +				XFS_INO_TO_AGBNO(mp, ino), &error))
> > +			goto out;
> > +
> > +		error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > +				imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp,
> > +				NULL);
> > +		if (!xfs_scrub_op_ok(sc, XFS_INO_TO_AGNO(mp, ino),
> > +				XFS_INO_TO_AGBNO(mp, ino), &error))
> > +			goto out;
> > +
> > +		/* Is this really an inode? */
> > +		bp->b_ops = &xfs_inode_buf_ops;
> > +		dip = xfs_buf_offset(bp, imap.im_boffset);
> > +		if (!xfs_dinode_verify(mp, ino, dip) ||
> > +		    !xfs_dinode_good_version(mp, dip->di_version)) {
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +			goto out;
> > +		}
> > +
> > +		/* ...and is it the one we asked for? */
> > +		if (be32_to_cpu(dip->di_gen) != sc->sm->sm_gen) {
> > +			error = -ENOENT;
> > +			goto out;
> > +		}
> > +	}
> 
> Can we factor the manual mapping into a separate function? Just
> makes it a bit cleaner and gets rid of a bunch of local variables
> from this function that are just used to map and read the inode.
> 
> ANd reading on and getting ahead of the code, could we split it
> further into
> 
> 	<setup dip>
> 
> 	xfs_scrub_dinode(sc, ino, dip, bp);
> 
> 	<do live incore inode stuff>

Yes, good plan.

> > +
> > +	flags = be16_to_cpu(dip->di_flags);
> > +	if (dip->di_version >= 3)
> > +		flags2 = be64_to_cpu(dip->di_flags2);
> > +	else
> > +		flags2 = 0;
> > +
> > +	/* di_mode */
> > +	mode = be16_to_cpu(dip->di_mode);
> > +	if (mode & ~(S_IALLUGO | S_IFMT))
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +	/* v1/v2 fields */
> > +	switch (dip->di_version) {
> > +	case 1:
> > +		if (dip->di_nlink != 0)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +		if (dip->di_mode == 0 && sc->ip)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +		if (dip->di_projid_lo != 0 || dip->di_projid_hi != 0)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> 
> We don't really support v1 inode format anymore - we convert it to
> version 2 automatically in xfs_inode_from_disk() so the in-memory
> inode is always v2 or v3, never v1. And when we write it back out,
> we write it as a v2 inode, never as a v1 inode.
> 
> Hence I'm not sure whether we should be worrying about scrubbing
> such inodes - they are going to be in an ever shrinking minority
> of filesystems. At minimum, they should always return "preen".

Ok.  I figured that we might end up at "v1 => preen" but decided to play
this straight until we got to review.

> > +	case 2:
> > +	case 3:
> > +		if (dip->di_onlink != 0)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +		if (dip->di_mode == 0 && sc->ip)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +		if (dip->di_projid_hi != 0 &&
> > +		    !xfs_sb_version_hasprojid32bit(&mp->m_sb))
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	default:
> > +		ASSERT(0);
> 
> If we don't understand the version, it's corrupt.

Yikes an assert.  Will replace that stat!

> > +		break;
> > +	}
> > +
> > +	/*
> > +	 * di_uid/di_gid -- -1 isn't invalid, but there's no way that
> > +	 * userspace could have created that.
> > +	 */
> > +	if (dip->di_uid == cpu_to_be32(-1U) ||
> > +	    dip->di_gid == cpu_to_be32(-1U))
> > +		xfs_scrub_ino_set_warning(sc, bp);
> > +
> > +	/* di_format */
> > +	switch (dip->di_format) {
> > +	case XFS_DINODE_FMT_DEV:
> > +		if (!S_ISCHR(mode) && !S_ISBLK(mode) &&
> > +		    !S_ISFIFO(mode) && !S_ISSOCK(mode))
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_LOCAL:
> > +		if (!S_ISDIR(mode) && !S_ISLNK(mode))
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_EXTENTS:
> > +		if (!S_ISREG(mode) && !S_ISDIR(mode) && !S_ISLNK(mode))
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_BTREE:
> > +		if (!S_ISREG(mode) && !S_ISDIR(mode))
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_UUID:
> > +	default:
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	}
> > +
> > +	/* di_size */
> > +	isize = be64_to_cpu(dip->di_size);
> > +	if (isize & (1ULL << 63))
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> 
> Should we be checking against the on disk format size, or the
> mounted filesystem maximum size (i.e. mp->m_super->s_maxbytes)?
> 32 or 64 bit systems are going to have different maximum valid file
> sizes..

It's perfectly valid to 'truncate -s $((2 ** 60) foofile' so the only
thing we can really check for here is that the upper bit isn't set
(because the VFS does not check, but barfs on, files with that large of
a size).

> Directories have a maximum bound size, too - the data space, leaf
> space and freespace space, each of which are 32GB in size, IIRC.
> 
> And symlinks have a different maximum size, too.

Fair enough, I'll expand the i_size checks, though ISTR the verifiers
now check that for us.

> > +	if (!S_ISDIR(mode) && !S_ISREG(mode) && !S_ISLNK(mode) && isize != 0)
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> 
> > +
> > +	/* di_nblocks */
> > +	if (flags2 & XFS_DIFLAG2_REFLINK) {
> > +		; /* nblocks can exceed dblocks */
> > +	} else if (flags & XFS_DIFLAG_REALTIME) {
> > +		if (be64_to_cpu(dip->di_nblocks) >=
> > +		    mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> 
> That doesn't seem right. the file can be on either the data or the
> rt device, so the maximum file blocks is the size of one device or
> the other, not both combined.

di_nblocks is the sum of (data blocks + bmbt blocks + attr blocks),
right?  So in theory if you had a rt file with 1000 data blocks, 10 bmbt
blocks to map the data blocks, and 100 attr blocks then di_nblocks has
to be 1110.

> > +	} else {
> > +		if (be64_to_cpu(dip->di_nblocks) >= mp->m_sb.sb_dblocks)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +	}
> > +
> > +	/* di_extsize */
> > +	extsize = be32_to_cpu(dip->di_extsize);
> > +	if (flags & (XFS_DIFLAG_EXTSIZE | XFS_DIFLAG_EXTSZINHERIT)) {
> > +		if (extsize <= 0 || extsize > MAXEXTLEN)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +		if (!(flags & XFS_DIFLAG_REALTIME) &&
> > +		    extsize > mp->m_sb.sb_agblocks / 2)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +	} else {
> > +		if (extsize != 0)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +	}
> 
> There more to validating the extentsize hints than this. From
> xfs_ioctl_setattr_check_extsize():
> 
> /*
>  * extent size hint validation is somewhat cumbersome. Rules are:
>  *
>  * 1. extent size hint is only valid for directories and regular files
>  * 2. FS_XFLAG_EXTSIZE is only valid for regular files
>  * 3. FS_XFLAG_EXTSZINHERIT is only valid for directories.
>  * 4. can only be changed on regular files if no extents are allocated
>  * 5. can be changed on directories at any time
>  * 6. extsize hint of 0 turns off hints, clears inode flags.
>  * 7. Extent size must be a multiple of the appropriate block size.
>  * 8. for non-realtime files, the extent size hint must be limited
>  *    to half the AG size to avoid alignment extending the extent beyond the
>  *    limits of the AG.
>  */
> 
> Maybe there's some commonality between these two functions...

Definitely, will refactor both.

> > +
> > +	/* di_flags */
> > +	if ((flags & XFS_DIFLAG_IMMUTABLE) && (flags & XFS_DIFLAG_APPEND))
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +	if ((flags & XFS_DIFLAG_FILESTREAM) && (flags & XFS_DIFLAG_REALTIME))
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> 
> What about project id flags?
> 
> Also, there are flags that are allowed only on regular files, and
> there are flags that are only allowed on directories. Those should
> probably also be checked for preening.

<nod> Some of these are checked by the dinode verifier, but there needs
to be a comment (or more comment) explaining that.

> > +
> > +	/* di_nextents */
> > +	nextents = be32_to_cpu(dip->di_nextents);
> > +	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> > +	switch (dip->di_format) {
> > +	case XFS_DINODE_FMT_EXTENTS:
> > +		if (nextents > fork_recs)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_BTREE:
> > +		if (nextents <= fork_recs)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_LOCAL:
> > +	case XFS_DINODE_FMT_DEV:
> > +	case XFS_DINODE_FMT_UUID:
> > +	default:
> > +		if (nextents != 0)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	}
> > +
> > +	/* di_anextents */
> > +	nextents = be16_to_cpu(dip->di_anextents);
> > +	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> > +	switch (dip->di_aformat) {
> > +	case XFS_DINODE_FMT_EXTENTS:
> > +		if (nextents > fork_recs)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_BTREE:
> > +		if (nextents <= fork_recs)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	case XFS_DINODE_FMT_LOCAL:
> > +	case XFS_DINODE_FMT_DEV:
> > +	case XFS_DINODE_FMT_UUID:
> > +	default:
> > +		if (nextents != 0)
> > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +		break;
> > +	}
> 
> Don't we need a check here first to see whether an attribute fork
> exists or not?

Do you mean the xfs_inode_fork, or something else?

XFS_DFORK_ASIZE returns zero if !XFS_DFORK_Q which in turn is based on
di_forkoff so we're really only checking that di_aformat makes sense
given the number of extents and the size of the attr fork area.

We're not actually checking anything in the attr fork; that's a
different scrubber.

> > +
> > +	/* di_forkoff */
> > +	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > +
> > +	/* di_aformat */
> > +	if (dip->di_aformat != XFS_DINODE_FMT_LOCAL &&
> > +	    dip->di_aformat != XFS_DINODE_FMT_EXTENTS &&
> > +	    dip->di_aformat != XFS_DINODE_FMT_BTREE)
> > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> 
> Shouldn't this come before we use dip->di_aformat in a switch
> statement?

Yes.  Good catch.

> Hmmm - aren't we missing the same checks for the data fork?

I believe you'll find them further up in the function.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/25] xfs: scrub inode btrees
  2017-10-05  2:08   ` Dave Chinner
@ 2017-10-05  5:47     ` Darrick J. Wong
  2017-10-05  7:22       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05  5:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 01:08:10PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:11PM -0700, Darrick J. Wong wrote:
> > +/*
> > + * Set us up to scrub inode btrees.
> > + * If we detect a discrepancy between the inobt and the inode,
> > + * try again after forcing logged inode cores out to disk.
> > + */
> > +int
> > +xfs_scrub_setup_ag_iallocbt(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	return xfs_scrub_setup_ag_btree(sc, ip, sc->try_harder);
> > +}
> > +
> > +/* Inode btree scrubber. */
> > +
> > +/* Is this chunk worth checking? */
> > +STATIC bool
> > +xfs_scrub_iallocbt_chunk(
> > +	struct xfs_scrub_btree		*bs,
> > +	struct xfs_inobt_rec_incore	*irec,
> > +	xfs_agino_t			agino,
> > +	xfs_extlen_t			len)
> > +{
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	struct xfs_agf			*agf;
> > +	unsigned long long		rec_end;
> > +	xfs_agblock_t			eoag;
> > +	xfs_agblock_t			bno;
> > +
> > +	agf = XFS_BUF_TO_AGF(bs->sc->sa.agf_bp);
> > +	eoag = be32_to_cpu(agf->agf_length);

<nod>

> Probably should use the AGI for this.
> 
> > +	bno = XFS_AGINO_TO_AGBNO(mp, agino);
> > +	rec_end = (unsigned long long)bno + len;
> > +
> > +	if (bno >= mp->m_sb.sb_agblocks || bno >= eoag ||
> > +	    rec_end > mp->m_sb.sb_agblocks || rec_end > eoag) {
> 
> Same comment as last patch.

Yup yup.

> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> 
> I note there is no check on the length passed in for the inode
> chunk - should that be verified?

len is computed from constants, so in theory it should be safe since
if the geometry is all messed up then we likely wouldn't have been
able to mount.

OTOH it's pretty cheap to put in, so sure.

> > +
> > +/* Count the number of free inodes. */
> > +static unsigned int
> > +xfs_scrub_iallocbt_freecount(
> > +	xfs_inofree_t			freemask)
> > +{
> > +	int				bits = XFS_INODES_PER_CHUNK;
> > +	unsigned int			ret = 0;
> > +
> > +	while (bits--) {
> > +		if (freemask & 1)
> > +			ret++;
> > +		freemask >>= 1;
> > +	}
> 
> 
> Seems a little cumbersome. Perhaps a loop using xfs_next_bit()
> might be a bit faster, something like:
> 
> 	nextbit = xfs_next_bit(&freemask, 1, 0); 
> 	while (nextbit != -1) {
> 		ret++;
> 		nextbit = xfs_next_bit(&freemask, 1, nextbit + 1);
> 	}

<nod>  A pity there's no popcnt()...

> > +/* Check a particular inode with ir_free. */
> > +STATIC int
> > +xfs_scrub_iallocbt_check_cluster_freemask(
> > +	struct xfs_scrub_btree		*bs,
> > +	xfs_ino_t			fsino,
> > +	xfs_agino_t			chunkino,
> > +	xfs_agino_t			clusterino,
> > +	struct xfs_inobt_rec_incore	*irec,
> > +	struct xfs_buf			*bp)
> > +{
> > +	struct xfs_dinode		*dip;
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	bool				freemask_ok;
> > +	bool				inuse;
> > +	int				error;
> > +
> > +	dip = xfs_buf_offset(bp, clusterino * mp->m_sb.sb_inodesize);
> > +	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC ||
> > +	    (dip->di_version >= 3 &&
> > +	     be64_to_cpu(dip->di_ino) != fsino + clusterino)) {
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +		goto out;
> > +	}
> > +
> > +	freemask_ok = !!(irec->ir_free & XFS_INOBT_MASK(chunkino + clusterino));
> 
> No need for !!(...) for a bool type - the compiler will squash it
> down to 0/1 autmoatically.

<nod>

> > +	error = xfs_icache_inode_is_allocated(mp, bs->cur->bc_tp,
> > +			fsino + clusterino, &inuse);
> > +	if (error == -ENODATA) {
> > +		/* Not cached, just read the disk buffer */
> 
> I think that is wrong. xfs_icache_inode_is_allocated() returns
> -ENOENT if the inode is not in cache....

I changed it to ENODATA so that we can tell the difference between
inode not in cache (ENODATA) and inode racing with unlink (ENOENT).

(Patch was sent to the ML a while ago and I omitted it from this posting...)

> > +		freemask_ok ^= !!(dip->di_mode);
> > +		if (!bs->sc->try_harder && !freemask_ok)
> > +			return -EDEADLOCK;
> > +	} else if (error < 0) {
> > +		/* Inode is only half assembled, don't bother. */
> > +		freemask_ok = true;
> 
> Or we had an IO error looking it up. i.e. -EAGAIN is the "half
> assembled" state (i.e. in the XFS_INEW state) or the half
> *disasembled* state (i.e. XFS_IRECLAIMABLE), anything
> else is an error...

<nod>

> > +	} else {
> > +		/* Inode is all there. */
> > +		freemask_ok ^= inuse;
> 
> So inuse is returned from a mode check after iget succeeds. The mode
> isn't zeroed until  /after/ XFS_IRECLAIMABLE is set, but it's also
> set before XFS_INEW is cleared.  IOWs, how can
> xfs_icache_inode_is_allocated() report anything
> other than inuse == true here? If that's the case, what's the point
> of the mode check inside xfs_icache_inode_is_allocated()?

I think you're right about this, I'll have a look in the morning.

> > +	}
> > +	if (!freemask_ok)
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +out:
> > +	return 0;
> > +}
> > +
> > +/* Make sure the free mask is consistent with what the inodes think. */
> > +STATIC int
> > +xfs_scrub_iallocbt_check_freemask(
> > +	struct xfs_scrub_btree		*bs,
> > +	struct xfs_inobt_rec_incore	*irec)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct xfs_imap			imap;
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	struct xfs_dinode		*dip;
> > +	struct xfs_buf			*bp;
> > +	xfs_ino_t			fsino;
> > +	xfs_agino_t			nr_inodes;
> > +	xfs_agino_t			agino;
> > +	xfs_agino_t			chunkino;
> > +	xfs_agino_t			clusterino;
> > +	xfs_agblock_t			agbno;
> > +	int				blks_per_cluster;
> > +	uint16_t			holemask;
> > +	uint16_t			ir_holemask;
> > +	int				error = 0;
> > +
> > +	/* Make sure the freemask matches the inode records. */
> > +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> > +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> 
> Does this setup and loop work for the case where we have 64k
> filesystem blocks and so two or more inode chunks per filesystem
> block (i.e. ppc64)? 

I think the answer is yes, at worst case we end up processing a block's
worth of inodes at a time.  The last time I ran scrub on ppc64 (last
week) it worked fine.

> > +/* Scrub an inobt/finobt record. */
> > +STATIC int
> > +xfs_scrub_iallocbt_helper(
> > +	struct xfs_scrub_btree		*bs,
> > +	union xfs_btree_rec		*rec)
> > +{
> > +	struct xfs_mount		*mp = bs->cur->bc_mp;
> > +	struct xfs_agi			*agi;
> > +	struct xfs_inobt_rec_incore	irec;
> > +	uint64_t			holes;
> > +	xfs_agino_t			agino;
> > +	xfs_agblock_t			agbno;
> > +	xfs_extlen_t			len;
> > +	int				holecount;
> > +	int				i;
> > +	int				error = 0;
> > +	unsigned int			real_freecount;
> > +	uint16_t			holemask;
> > +
> > +	xfs_inobt_btrec_to_irec(mp, rec, &irec);
> > +
> > +	if (irec.ir_count > XFS_INODES_PER_CHUNK ||
> > +	    irec.ir_freecount > XFS_INODES_PER_CHUNK)
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	real_freecount = irec.ir_freecount +
> > +			(XFS_INODES_PER_CHUNK - irec.ir_count);
> > +	if (real_freecount != xfs_scrub_iallocbt_freecount(irec.ir_free))
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	agi = XFS_BUF_TO_AGI(bs->sc->sa.agi_bp);
> > +	agino = irec.ir_startino;
> > +	agbno = XFS_AGINO_TO_AGBNO(mp, irec.ir_startino);
> > +	if (agbno >= be32_to_cpu(agi->agi_length)) {
> 
> Validate as per every other agbno?

<nod>

> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +		goto out;
> > +	}
> > +
> > +	if ((agbno & (xfs_ialloc_cluster_alignment(mp) - 1)) ||
> > +	    (agbno & (xfs_icluster_size_fsb(mp) - 1)))
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> 
> What's the magic masking checks being done here? (comment?)

/* Make sure this record is aligned to cluster and inoalignmnt size. */

> > +	/* Handle non-sparse inodes */
> > +	if (!xfs_inobt_issparse(irec.ir_holemask)) {
> > +		len = XFS_B_TO_FSB(mp,
> > +				XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize);
> > +		if (irec.ir_count != XFS_INODES_PER_CHUNK)
> > +			xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +		if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
> > +			goto out;
> > +		goto check_freemask;
> > +	}
> > +
> > +	/* Check each chunk of a sparse inode cluster. */
> > +	holemask = irec.ir_holemask;
> > +	holecount = 0;
> > +	len = XFS_B_TO_FSB(mp,
> > +			XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize);
> > +	holes = ~xfs_inobt_irec_to_allocmask(&irec);
> > +	if ((holes & irec.ir_free) != holes ||
> > +	    irec.ir_freecount > irec.ir_count)
> > +		xfs_scrub_btree_set_corrupt(bs->sc, bs->cur, 0);
> > +
> > +	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
> > +			i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {
> 
> Urk. THat's a bit hard to read.
> 
> > +		if (holemask & 1) {
> > +			holecount += XFS_INODES_PER_HOLEMASK_BIT;
> > +			continue;
> > +		}
> > +
> > +		if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
> > +			break;
> > +	}
> 
> How about
> 
> 	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; i++) {
> 		if (holemask & 1) {
> 			holecount += XFS_INODES_PER_HOLEMASK_BIT;
> 		} else if (!xfs_scrub_iallocbt_chunk(bs, &irec, agino, len))
> 			break;
> 
> 		holemask >>= 1;
> 		agino += XFS_INODES_PER_HOLEMASK_BIT;
> 	}

Looks better than mine. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/25] xfs: scrub inodes
  2017-10-05  5:22     ` Darrick J. Wong
@ 2017-10-05  7:13       ` Dave Chinner
  2017-10-05 19:56         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  7:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 10:22:19PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 05, 2017 at 03:04:52PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:42:30PM -0700, Darrick J. Wong wrote:
> > > +	error = xfs_iget(mp, NULL, sc->sm->sm_ino, XFS_IGET_UNTRUSTED,
> > > +			0, &ips);
> > 
> > I think we also want XFS_IGET_DONTCACHE here, so we don't trash the
> > inode cache with inodes that we use once for scrub and never touch
> > again.
> 
> I thought about adding this, but if we let the inodes fall out of the
> cache now then we'll just have to load them back in for the bmap checks,
> right?

Well, I'm looking at ensuring that we don't blow out the memory
side of things. We've still got the inode buffer in the buffer
cache, so I don't see why we should double cache these things
and then leave both cached copied hanging around after we've
finished with them. Leave the buffer around because we do a fair few
checks with it, but don't use excessive icache memory and trash the
working set if we can avoid it...

> > > +xfs_scrub_checkpoint_log(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	int			error;
> > > +
> > > +	error = _xfs_log_force(mp, XFS_LOG_SYNC, NULL);
> > > +	if (error)
> > > +		return error;
> > > +	xfs_ail_push_all_sync(mp->m_ail);
> > > +	return 0;
> > > +}
> > 
> > Oooo, that's a nasty thing to do on busy systems with large dirty
> > logs. I hope this is a "last resort" kind of thing....
> 
> It is; we only do this if the inobt says there's an inode there and the
> inode verifiers fail.

Ok, so why would pushing the log and the AIL make the verifier then
succeed? how likely is this to occur on a busy system?

> > > +/* Set us up with an inode. */
> > 
> > What state are we trying to get the inode into here? We grab all the
> > various locks, but we can still have data changing via mmap pages
> > that are already faulted in and delalloc extents in the incore
> > extent list that aren't reflected on disk...
> > 
> > A comment explaining what we expect here would be nice.
> 
> /* 
>  * Grab total control of the inode metadata.  It doesn't matter here if
>  * the file data is still changing, we just want exclusive access to the
>  * metadata.
>  */

*nod*

> > > +	/* Got the inode, lock it and we're ready to go. */
> > > +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > +	xfs_ilock(sc->ip, sc->ilock_flags);
> > > +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> > > +	if (error)
> > > +		goto out_unlock;
> > > +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> > > +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> > 
> > Should the inode be joined to the transaction so that cancelling the
> > transaction unlocks the inode? Then the need for the ilock_flags
> > variable goes away....
> 
> This is the confluence of two semi-icky things: first, some of the
> scrubbers (particularly the dir and parent pointer scrubbers) will need
> to drop the ILOCK for short periods of time; later on, repair will want
> to keep the inode locked across all the repair transactions, so it makes
> more sense to control the lock and unlock directly.

Ok, I'll pass on this for now, see how the rest of the code falls
out.

> > > +	/* di_size */
> > > +	isize = be64_to_cpu(dip->di_size);
> > > +	if (isize & (1ULL << 63))
> > > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > 
> > Should we be checking against the on disk format size, or the
> > mounted filesystem maximum size (i.e. mp->m_super->s_maxbytes)?
> > 32 or 64 bit systems are going to have different maximum valid file
> > sizes..
> 
> It's perfectly valid to 'truncate -s $((2 ** 60) foofile' so the only

Ugh. We can't do IO past 16TB on 32 bit systems, so I'm kinda
surprised truncate doesn't have the same s_maxbytes restriction...

> thing we can really check for here is that the upper bit isn't set
> (because the VFS does not check, but barfs on, files with that large of
> a size).

xfs_max_file_offset() sets the max file offset to 2^63 - 1, so it
looks like the lack of checking in truncate is the problem here,
not the IO path.

> > Directories have a maximum bound size, too - the data space, leaf
> > space and freespace space, each of which are 32GB in size, IIRC.
> > 
> > And symlinks have a different maximum size, too.
> 
> Fair enough, I'll expand the i_size checks, though ISTR the verifiers
> now check that for us.

If they do, then just drop a comment in there to say what is checked
by the verifier.

> > > +	if (!S_ISDIR(mode) && !S_ISREG(mode) && !S_ISLNK(mode) && isize != 0)
> > > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > 
> > > +
> > > +	/* di_nblocks */
> > > +	if (flags2 & XFS_DIFLAG2_REFLINK) {
> > > +		; /* nblocks can exceed dblocks */
> > > +	} else if (flags & XFS_DIFLAG_REALTIME) {
> > > +		if (be64_to_cpu(dip->di_nblocks) >=
> > > +		    mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > 
> > That doesn't seem right. the file can be on either the data or the
> > rt device, so the maximum file blocks is the size of one device or
> > the other, not both combined.
> 
> di_nblocks is the sum of (data blocks + bmbt blocks + attr blocks),
> right?

Yeah, forgot it was more than just data extents.

> So in theory if you had a rt file with 1000 data blocks, 10 bmbt
> blocks to map the data blocks, and 100 attr blocks then di_nblocks has
> to be 1110.

Yup, but the additional metadata on the data device is not going to
be anywhere near the size of the data device.

/me shrugs

I can't think of an easy way to get a maximum block count, so I
guess that'll have to do...

> > > +		if (nextents > fork_recs)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > +		break;
> > > +	case XFS_DINODE_FMT_BTREE:
> > > +		if (nextents <= fork_recs)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > +		break;
> > > +	case XFS_DINODE_FMT_LOCAL:
> > > +	case XFS_DINODE_FMT_DEV:
> > > +	case XFS_DINODE_FMT_UUID:
> > > +	default:
> > > +		if (nextents != 0)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > +		break;
> > > +	}
> > > +
> > > +	/* di_anextents */
> > > +	nextents = be16_to_cpu(dip->di_anextents);
> > > +	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> > > +	switch (dip->di_aformat) {
> > > +	case XFS_DINODE_FMT_EXTENTS:
> > > +		if (nextents > fork_recs)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > +		break;
> > > +	case XFS_DINODE_FMT_BTREE:
> > > +		if (nextents <= fork_recs)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > +		break;
> > > +	case XFS_DINODE_FMT_LOCAL:
> > > +	case XFS_DINODE_FMT_DEV:
> > > +	case XFS_DINODE_FMT_UUID:
> > > +	default:
> > > +		if (nextents != 0)
> > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > +		break;
> > > +	}
> > 
> > Don't we need a check here first to see whether an attribute fork
> > exists or not?
> 
> Do you mean the xfs_inode_fork, or something else?

SOmething else. :P

> XFS_DFORK_ASIZE returns zero if !XFS_DFORK_Q which in turn is based on
> di_forkoff so we're really only checking that di_aformat makes sense
> given the number of extents and the size of the attr fork area.

Right, but if XFS_DFORK_ASIZE == 0, the dip->di_aformat *must* be
XFS_DINODE_FMT_EXTENTS. That's the only valid configuration when
there is no attribute fork present.

If there is an attribute fork present, then it can be XFS_DINODE_FMT_LOCAL,
EXTENT or BTREE, and then the extent count needs checking.
XFS_DINODE_FMT_DEV and XFS_DINODE_FMT_UUID are both invalid for the
attribute fork.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/25] xfs: scrub inode btrees
  2017-10-05  5:47     ` Darrick J. Wong
@ 2017-10-05  7:22       ` Dave Chinner
  2017-10-05 18:26         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-05  7:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Oct 04, 2017 at 10:47:14PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 05, 2017 at 01:08:10PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:42:11PM -0700, Darrick J. Wong wrote:
> > > +/* Count the number of free inodes. */
> > > +static unsigned int
> > > +xfs_scrub_iallocbt_freecount(
> > > +	xfs_inofree_t			freemask)
> > > +{
> > > +	int				bits = XFS_INODES_PER_CHUNK;
> > > +	unsigned int			ret = 0;
> > > +
> > > +	while (bits--) {
> > > +		if (freemask & 1)
> > > +			ret++;
> > > +		freemask >>= 1;
> > > +	}
> > 
> > 
> > Seems a little cumbersome. Perhaps a loop using xfs_next_bit()
> > might be a bit faster, something like:
> > 
> > 	nextbit = xfs_next_bit(&freemask, 1, 0); 
> > 	while (nextbit != -1) {
> > 		ret++;
> > 		nextbit = xfs_next_bit(&freemask, 1, nextbit + 1);
> > 	}
> 
> <nod>  A pity there's no popcnt()...

Oh, hweight64().

> > > +	error = xfs_icache_inode_is_allocated(mp, bs->cur->bc_tp,
> > > +			fsino + clusterino, &inuse);
> > > +	if (error == -ENODATA) {
> > > +		/* Not cached, just read the disk buffer */
> > 
> > I think that is wrong. xfs_icache_inode_is_allocated() returns
> > -ENOENT if the inode is not in cache....
> 
> I changed it to ENODATA so that we can tell the difference between
> inode not in cache (ENODATA) and inode racing with unlink (ENOENT).
> 
> (Patch was sent to the ML a while ago and I omitted it from this posting...)

Ah, if it's not committed upstream yet, include it in the next
posting, please :)

> > > +	/* Make sure the freemask matches the inode records. */
> > > +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> > > +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> > 
> > Does this setup and loop work for the case where we have 64k
> > filesystem blocks and so two or more inode chunks per filesystem
> > block (i.e. ppc64)? 
> 
> I think the answer is yes, at worst case we end up processing a block's
> worth of inodes at a time.  The last time I ran scrub on ppc64 (last
> week) it worked fine.

Hmmm - there's nothing to count how many inodes are scrubbed, is
there? Perhaps it would be good to gcount as we go so we know if
we've scrubbed all the inodes?

Hmmm - I might have missed it, but is there anywhere in this code
where we check the inode number in the inode that we have read
actually matches the agino we are attempting to validate?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/25] xfs: scrub inode btrees
  2017-10-05  7:22       ` Dave Chinner
@ 2017-10-05 18:26         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05 18:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 06:22:45PM +1100, Dave Chinner wrote:
> On Wed, Oct 04, 2017 at 10:47:14PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 05, 2017 at 01:08:10PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:42:11PM -0700, Darrick J. Wong wrote:
> > > > +/* Count the number of free inodes. */
> > > > +static unsigned int
> > > > +xfs_scrub_iallocbt_freecount(
> > > > +	xfs_inofree_t			freemask)
> > > > +{
> > > > +	int				bits = XFS_INODES_PER_CHUNK;
> > > > +	unsigned int			ret = 0;
> > > > +
> > > > +	while (bits--) {
> > > > +		if (freemask & 1)
> > > > +			ret++;
> > > > +		freemask >>= 1;
> > > > +	}
> > > 
> > > 
> > > Seems a little cumbersome. Perhaps a loop using xfs_next_bit()
> > > might be a bit faster, something like:
> > > 
> > > 	nextbit = xfs_next_bit(&freemask, 1, 0); 
> > > 	while (nextbit != -1) {
> > > 		ret++;
> > > 		nextbit = xfs_next_bit(&freemask, 1, nextbit + 1);
> > > 	}
> > 
> > <nod>  A pity there's no popcnt()...
> 
> Oh, hweight64().
> 
> > > > +	error = xfs_icache_inode_is_allocated(mp, bs->cur->bc_tp,
> > > > +			fsino + clusterino, &inuse);
> > > > +	if (error == -ENODATA) {
> > > > +		/* Not cached, just read the disk buffer */
> > > 
> > > I think that is wrong. xfs_icache_inode_is_allocated() returns
> > > -ENOENT if the inode is not in cache....
> > 
> > I changed it to ENODATA so that we can tell the difference between
> > inode not in cache (ENODATA) and inode racing with unlink (ENOENT).
> > 
> > (Patch was sent to the ML a while ago and I omitted it from this posting...)
> 
> Ah, if it's not committed upstream yet, include it in the next
> posting, please :)
> 
> > > > +	/* Make sure the freemask matches the inode records. */
> > > > +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> > > > +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> > > 
> > > Does this setup and loop work for the case where we have 64k
> > > filesystem blocks and so two or more inode chunks per filesystem
> > > block (i.e. ppc64)? 
> > 
> > I think the answer is yes, at worst case we end up processing a block's
> > worth of inodes at a time.  The last time I ran scrub on ppc64 (last
> > week) it worked fine.
> 
> Hmmm - there's nothing to count how many inodes are scrubbed, is
> there? Perhaps it would be good to gcount as we go so we know if
> we've scrubbed all the inodes?

Userspace will count the number of inodes scrubbed (since it's
initiating all the scrub requests) and check it against statfs.

> Hmmm - I might have missed it, but is there anywhere in this code
> where we check the inode number in the inode that we have read
> actually matches the agino we are attempting to validate?

The dinode verifier checks di_ino, so if they don't match then xfs_iread
-> xfs_iget return -EFSCORRUPTED.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/25] xfs: scrub inodes
  2017-10-05  7:13       ` Dave Chinner
@ 2017-10-05 19:56         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-05 19:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 05, 2017 at 06:13:39PM +1100, Dave Chinner wrote:
> On Wed, Oct 04, 2017 at 10:22:19PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 05, 2017 at 03:04:52PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:42:30PM -0700, Darrick J. Wong wrote:
> > > > +	error = xfs_iget(mp, NULL, sc->sm->sm_ino, XFS_IGET_UNTRUSTED,
> > > > +			0, &ips);
> > > 
> > > I think we also want XFS_IGET_DONTCACHE here, so we don't trash the
> > > inode cache with inodes that we use once for scrub and never touch
> > > again.
> > 
> > I thought about adding this, but if we let the inodes fall out of the
> > cache now then we'll just have to load them back in for the bmap checks,
> > right?
> 
> Well, I'm looking at ensuring that we don't blow out the memory
> side of things. We've still got the inode buffer in the buffer
> cache, so I don't see why we should double cache these things
> and then leave both cached copied hanging around after we've
> finished with them. Leave the buffer around because we do a fair few
> checks with it, but don't use excessive icache memory and trash the
> working set if we can avoid it...

Sure, I'll give it a try and see what the performance impacts are.
(I /have/ been looking into OOM problems on artificially memory-constrained VMs.)

> > > > +xfs_scrub_checkpoint_log(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	int			error;
> > > > +
> > > > +	error = _xfs_log_force(mp, XFS_LOG_SYNC, NULL);
> > > > +	if (error)
> > > > +		return error;
> > > > +	xfs_ail_push_all_sync(mp->m_ail);
> > > > +	return 0;
> > > > +}
> > > 
> > > Oooo, that's a nasty thing to do on busy systems with large dirty
> > > logs. I hope this is a "last resort" kind of thing....
> > 
> > It is; we only do this if the inobt says there's an inode there and the
> > inode verifiers fail.
> 
> Ok, so why would pushing the log and the AIL make the verifier then
> succeed? how likely is this to occur on a busy system?

Hmm, looking through my notes the original reason for shoving on the log
is to force any dirty inode items to get checkpointed back to disk so
that we could read the raw buffers.

I think there's still a desire to be able to checkpoint the log in order
to resolve discrepancies via raw disk buffer reads ... but at this point
the only code that does this is:

(a) the code that checks inobt freemask against the on-disk inodes if
the inode isn't in the cache (we can't rely on the inobt to look up
uncached inodes in order to check inobt fields)

(b) AGF/AGI repair will want to ->verify_read the alleged btree root
blocks.  If the root blocks are new and have never been checkpointed,
they'll have a crc/lsn of zero, which causes verifier error.

(Granted this should never happen except when we're artificially forcing
repair to rebuild otherwise ok metadata as part of testing.)

(c) bnobt repair will want to be able to flush the log to empty out the
busy extent list because it works by freeing all the blocks not listed
by the rmap and the busy extent list can't handle overlapping entries.

Though now that I look at it, this function belongs in the ialloc patch,
and the if (force_log) xfs_scrub_checkpoint_log() call has somehow
migrated out to one of the repair patches.  Ugh.

> > > > +/* Set us up with an inode. */
> > > 
> > > What state are we trying to get the inode into here? We grab all the
> > > various locks, but we can still have data changing via mmap pages
> > > that are already faulted in and delalloc extents in the incore
> > > extent list that aren't reflected on disk...
> > > 
> > > A comment explaining what we expect here would be nice.
> > 
> > /* 
> >  * Grab total control of the inode metadata.  It doesn't matter here if
> >  * the file data is still changing, we just want exclusive access to the
> >  * metadata.
> >  */
> 
> *nod*
> 
> > > > +	/* Got the inode, lock it and we're ready to go. */
> > > > +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > > +	xfs_ilock(sc->ip, sc->ilock_flags);
> > > > +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> > > > +	if (error)
> > > > +		goto out_unlock;
> > > > +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> > > > +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> > > 
> > > Should the inode be joined to the transaction so that cancelling the
> > > transaction unlocks the inode? Then the need for the ilock_flags
> > > variable goes away....
> > 
> > This is the confluence of two semi-icky things: first, some of the
> > scrubbers (particularly the dir and parent pointer scrubbers) will need
> > to drop the ILOCK for short periods of time; later on, repair will want
> > to keep the inode locked across all the repair transactions, so it makes
> > more sense to control the lock and unlock directly.
> 
> Ok, I'll pass on this for now, see how the rest of the code falls
> out.
> 
> > > > +	/* di_size */
> > > > +	isize = be64_to_cpu(dip->di_size);
> > > > +	if (isize & (1ULL << 63))
> > > > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > 
> > > Should we be checking against the on disk format size, or the
> > > mounted filesystem maximum size (i.e. mp->m_super->s_maxbytes)?
> > > 32 or 64 bit systems are going to have different maximum valid file
> > > sizes..
> > 
> > It's perfectly valid to 'truncate -s $((2 ** 60) foofile' so the only
> 
> Ugh. We can't do IO past 16TB on 32 bit systems, so I'm kinda
> surprised truncate doesn't have the same s_maxbytes restriction...
>
> > thing we can really check for here is that the upper bit isn't set
> > (because the VFS does not check, but barfs on, files with that large of
> > a size).
> 
> xfs_max_file_offset() sets the max file offset to 2^63 - 1, so it
> looks like the lack of checking in truncate is the problem here,
> not the IO path.

Imagine that you format an xfs on a x64 system, create a 20TB file, and
scrub/repair say it's ok.  Then you mount the same fs on your 2014-era
smartphone^W^Warm32 system.  Yes, the smartphone can't handle such a big
file, but is this fs corruption?  I don't think this qualifies for
OFLAG_CORRUPT.

Come to think of it, this is one of the scenarios for which
OFLAG_WARNING was intended -- not technically a violation of the disk
spec, but needs review anyway.

Ok, I've convinced myself. :)

	/*
	 * Warn if the running kernel can't handle the kinds of offsets
	 * needed to deal with the file size.  In other words, if the
	 * pagecache can't cache all the blocks in this file due to
	 * overly large offsets, flag the inode for admin review.
	 */
	if (isize >= mp->m_super->s_maxbytes)
		xfs_scrub_ino_set_warning(sc, bp);

> > > Directories have a maximum bound size, too - the data space, leaf
> > > space and freespace space, each of which are 32GB in size, IIRC.
> > > 
> > > And symlinks have a different maximum size, too.
> > 
> > Fair enough, I'll expand the i_size checks, though ISTR the verifiers
> > now check that for us.
> 
> If they do, then just drop a comment in there to say what is checked
> by the verifier.

Hmm, the dinode verifier checks for the upper bit and for zero-length
symlinks and directories.

Will add the symlink/directory isize checks.

> > > > +	if (!S_ISDIR(mode) && !S_ISREG(mode) && !S_ISLNK(mode) && isize != 0)
> > > > +		xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > 
> > > > +
> > > > +	/* di_nblocks */
> > > > +	if (flags2 & XFS_DIFLAG2_REFLINK) {
> > > > +		; /* nblocks can exceed dblocks */
> > > > +	} else if (flags & XFS_DIFLAG_REALTIME) {
> > > > +		if (be64_to_cpu(dip->di_nblocks) >=
> > > > +		    mp->m_sb.sb_dblocks + mp->m_sb.sb_rblocks)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > 
> > > That doesn't seem right. the file can be on either the data or the
> > > rt device, so the maximum file blocks is the size of one device or
> > > the other, not both combined.
> > 
> > di_nblocks is the sum of (data blocks + bmbt blocks + attr blocks),
> > right?
> 
> Yeah, forgot it was more than just data extents.
> 
> > So in theory if you had a rt file with 1000 data blocks, 10 bmbt
> > blocks to map the data blocks, and 100 attr blocks then di_nblocks has
> > to be 1110.
> 
> Yup, but the additional metadata on the data device is not going to
> be anywhere near the size of the data device.
> 
> /me shrugs
> 
> I can't think of an easy way to get a maximum block count, so I
> guess that'll have to do...

Yeah, I couldn't come up with a more precise check that doesn't involve
checking the data forks directly (which comes up later).

> > > > +		if (nextents > fork_recs)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > > +		break;
> > > > +	case XFS_DINODE_FMT_BTREE:
> > > > +		if (nextents <= fork_recs)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > > +		break;
> > > > +	case XFS_DINODE_FMT_LOCAL:
> > > > +	case XFS_DINODE_FMT_DEV:
> > > > +	case XFS_DINODE_FMT_UUID:
> > > > +	default:
> > > > +		if (nextents != 0)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	/* di_anextents */
> > > > +	nextents = be16_to_cpu(dip->di_anextents);
> > > > +	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> > > > +	switch (dip->di_aformat) {
> > > > +	case XFS_DINODE_FMT_EXTENTS:
> > > > +		if (nextents > fork_recs)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > > +		break;
> > > > +	case XFS_DINODE_FMT_BTREE:
> > > > +		if (nextents <= fork_recs)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > > +		break;
> > > > +	case XFS_DINODE_FMT_LOCAL:
> > > > +	case XFS_DINODE_FMT_DEV:
> > > > +	case XFS_DINODE_FMT_UUID:
> > > > +	default:
> > > > +		if (nextents != 0)
> > > > +			xfs_scrub_ino_set_corrupt(sc, ino, bp);
> > > > +		break;
> > > > +	}
> > > 
> > > Don't we need a check here first to see whether an attribute fork
> > > exists or not?
> > 
> > Do you mean the xfs_inode_fork, or something else?
> 
> SOmething else. :P
> 
> > XFS_DFORK_ASIZE returns zero if !XFS_DFORK_Q which in turn is based on
> > di_forkoff so we're really only checking that di_aformat makes sense
> > given the number of extents and the size of the attr fork area.
> 
> Right, but if XFS_DFORK_ASIZE == 0, the dip->di_aformat *must* be
> XFS_DINODE_FMT_EXTENTS. That's the only valid configuration when
> there is no attribute fork present.
> 
> If there is an attribute fork present, then it can be XFS_DINODE_FMT_LOCAL,
> EXTENT or BTREE, and then the extent count needs checking.
> XFS_DINODE_FMT_DEV and XFS_DINODE_FMT_UUID are both invalid for the
> attribute fork.

Ok.  Will rearrange this to check di_forkoff and di_aformat before
bothering with the di_anextents checks.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/25] xfs: scrub inode block mappings
  2017-10-03 20:42 ` [PATCH 17/25] xfs: scrub inode block mappings Darrick J. Wong
@ 2017-10-06  2:51   ` Dave Chinner
  2017-10-06 17:00     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-06  2:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:36PM -0700, Darrick J. Wong wrote:
> +/* Set us up with an inode's bmap. */
> +STATIC int
> +__xfs_scrub_setup_inode_bmap(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip,
> +	bool				flush_data)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	error = xfs_scrub_get_inode(sc, ip);
> +	if (error)
> +		return error;
> +
> +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> +	xfs_ilock(sc->ip, sc->ilock_flags);
> +
> +	/*
> +	 * We don't want any ephemeral data fork updates sitting around
> +	 * while we inspect block mappings, so wait for directio to finish
> +	 * and flush dirty data if we have delalloc reservations.
> +	 */
> +	if (S_ISREG(VFS_I(sc->ip)->i_mode) && flush_data) {
> +		inode_dio_wait(VFS_I(sc->ip));
> +		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
> +		if (error)
> +			goto out_unlock;
> +		error = invalidate_inode_pages2(VFS_I(sc->ip)->i_mapping);
> +		if (error)
> +			goto out_unlock;
> +	}

The same flush and invalidate is done in xfs_fs_map_blocks and
xfs_ioctl_setattr_dax_invalidate. we used to have helper functions
to do this, but we got rid of them because we removed all the
cases where such behaviour was necessary. Now we're adding them
back, perhaps we should have a helper for this again?


> +	/* Got the inode, lock it and we're ready to go. */
> +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +	if (error)
> +		goto out_unlock;
> +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> +
> +	return 0;
> +out_unlock:
> +	xfs_iunlock(sc->ip, sc->ilock_flags);
> +	if (sc->ip != ip)
> +		iput(VFS_I(sc->ip));
> +	sc->ip = NULL;

Slightly tricky - how many places do we end up having to do this?
If its more than one, perhaps we need a xfs_scrub_irele(sc, ip)
helper?

> +/*
> + * Inode fork block mapping (BMBT) scrubber.
> + * More complex than the others because we have to scrub
> + * all the extents regardless of whether or not the fork
> + * is in btree format.
> + */
> +
> +struct xfs_scrub_bmap_info {
> +	struct xfs_scrub_context	*sc;
> +	xfs_daddr_t			eofs;
> +	xfs_fileoff_t			lastoff;
> +	bool				is_rt;
> +	bool				is_shared;
> +	int				whichfork;
> +};
> +
> +/* Scrub a single extent record. */
> +STATIC int
> +xfs_scrub_bmap_extent(
> +	struct xfs_inode		*ip,
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_scrub_bmap_info	*info,
> +	struct xfs_bmbt_irec		*irec)
> +{
> +	struct xfs_scrub_ag		sa = { 0 };
> +	struct xfs_mount		*mp = info->sc->mp;
> +	struct xfs_buf			*bp = NULL;
> +	xfs_daddr_t			daddr;
> +	xfs_daddr_t			dlen;
> +	xfs_fsblock_t			bno;
> +	xfs_agnumber_t			agno;
> +	int				error = 0;
> +
> +	if (cur)
> +		xfs_btree_get_block(cur, 0, &bp);
> +
> +	if (irec->br_startoff < info->lastoff ||
> +	    irec->br_startblock == HOLESTARTBLOCK ||
> +	    isnullstartblock(irec->br_startblock))
> +		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> +				irec->br_startoff);

What are we checking here? that it's ordered correctly and not a
hole/delalloc record? If it is bad, shouldn't we just jump out here
because the following checks are likely to throw silly errors on
hole/delalloc mappings?

> +	/* Actual mapping, so check the block ranges. */
> +	if (info->is_rt) {
> +		daddr = XFS_FSB_TO_BB(mp, irec->br_startblock);
> +		agno = NULLAGNUMBER;
> +		bno = irec->br_startblock;
> +	} else {
> +		daddr = XFS_FSB_TO_DADDR(mp, irec->br_startblock);
> +		agno = XFS_FSB_TO_AGNO(mp, irec->br_startblock);
> +		if (agno >= mp->m_sb.sb_agcount) {
> +			xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> +				irec->br_startoff);
> +			goto out;
> +		}
> +		bno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
> +		if (bno >= mp->m_sb.sb_agblocks)
> +			xfs_scrub_fblock_set_corrupt(info->sc,
> +						     info->whichfork,
> +						     irec->br_startoff);

more verify_agbno()/verify_fsbno stuff.

> +	}
> +	dlen = XFS_FSB_TO_BB(mp, irec->br_blockcount);
> +	if (irec->br_blockcount <= 0 ||
> +	    irec->br_blockcount > MAXEXTLEN ||

irec->br_blockcount is unsigned (uint64_t).

Also needs to be checked against AG size.

> +	    daddr >= info->eofs ||
> +	    daddr + dlen > info->eofs)
> +		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> +				irec->br_startoff);
> +
> +	if (irec->br_state == XFS_EXT_UNWRITTEN &&
> +	    !xfs_sb_version_hasextflgbit(&mp->m_sb))

Superblock scrubber should reject any filesystem without the
extflgbit as corrupt - it's always set by mkfs - so I'm not sure we
need to check this here.

> +/* Scrub an inode fork's block mappings. */
> +STATIC int
> +xfs_scrub_bmap(
> +	struct xfs_scrub_context	*sc,
> +	int				whichfork)
> +{
> +	struct xfs_bmbt_irec		irec;
> +	struct xfs_scrub_bmap_info	info = {0};
> +	struct xfs_owner_info		oinfo;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_inode		*ip = sc->ip;
> +	struct xfs_ifork		*ifp;
> +	struct xfs_btree_cur		*cur;
> +	xfs_fileoff_t			endoff;
> +	xfs_extnum_t			idx;
> +	bool				found;
> +	int				error = 0;
> +
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +
> +	info.is_rt = whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip);
> +	info.eofs = XFS_FSB_TO_BB(mp, info.is_rt ? mp->m_sb.sb_rblocks :
> +					      mp->m_sb.sb_dblocks);
> +	info.whichfork = whichfork;
> +	info.is_shared = whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(ip);
> +	info.sc = sc;
> +
> +	switch (whichfork) {
> +	case XFS_COW_FORK:
> +		/* Non-existent CoW forks are ignorable. */
> +		if (!ifp)
> +			goto out_unlock;
> +		/* No CoW forks on non-reflink inodes/filesystems. */
> +		if (!xfs_is_reflink_inode(ip)) {
> +			xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
> +			goto out_unlock;
> +		}
> +		break;
> +	case XFS_ATTR_FORK:
> +		if (!ifp)
> +			goto out_unlock;
> +		if (!xfs_sb_version_hasattr(&mp->m_sb) &&
> +		    !xfs_sb_version_hasattr2(&mp->m_sb))
> +			xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
> +		break;
> +	}
> +
> +	/* Check the fork values */
> +	switch (XFS_IFORK_FORMAT(ip, whichfork)) {
> +	case XFS_DINODE_FMT_UUID:
> +	case XFS_DINODE_FMT_DEV:
> +	case XFS_DINODE_FMT_LOCAL:
> +		/* No mappings to check. */
> +		goto out_unlock;
> +	case XFS_DINODE_FMT_EXTENTS:
> +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> +			xfs_scrub_fblock_set_corrupt(sc, whichfork, 0);
> +			goto out_unlock;
> +		}
> +		break;
> +	case XFS_DINODE_FMT_BTREE:
> +		ASSERT(whichfork != XFS_COW_FORK);

Corruption check, jump to out?

> +
> +		/* Scan the btree records. */
> +		cur = xfs_bmbt_init_cursor(mp, sc->tp, ip, whichfork);
> +		xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> +		error = xfs_scrub_btree(sc, cur, xfs_scrub_bmapbt_helper,
> +				&oinfo, &info);
> +		xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
> +						  XFS_BTREE_NOERROR);

FYI, I missed that the on-disk bmbt was scanned here the first time
I went through this code - i had to go back and work out why the
code only appeared to scrub the incore extent list. Can you wrap
this whole chunk of code into a helper named xfs_scrub_bmbt()
so it stands out that this is where the on disk btree is scrubbed?


> +		if (error == -EDEADLOCK)
> +			return error;

Ok, why don't we go to out_unlock here?

> +		else if (error)
> +			goto out_unlock;

But do for all other errors....

> +		break;
> +	default:
> +		xfs_scrub_fblock_set_corrupt(sc, whichfork, 0);
> +		goto out_unlock;
> +	}
> +
> +	/* Extent data is in memory, so scrub that. */
> +
> +	/* Find the offset of the last extent in the mapping. */
> +	error = xfs_bmap_last_offset(ip, &endoff, whichfork);
> +	if (!xfs_scrub_fblock_op_ok(sc, whichfork, 0, &error))
> +		goto out_unlock;
> +
> +	/* Scrub extent records. */
> +	info.lastoff = 0;
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +	for (found = xfs_iext_lookup_extent(ip, ifp, 0, &idx, &irec);
> +	     found != 0;
> +	     found = xfs_iext_get_extent(ifp, ++idx, &irec)) {
> +		if (xfs_scrub_should_terminate(sc, &error))
> +			break;
> +		if (isnullstartblock(irec.br_startblock))
> +			continue;
> +		if (irec.br_startoff >= endoff) {
> +			xfs_scrub_fblock_set_corrupt(sc, whichfork,
> +					irec.br_startoff);
> +			goto out_unlock;
> +		}
> +		error = xfs_scrub_bmap_extent(ip, NULL, &info, &irec);
> +		if (error == -EDEADLOCK)
> +			return error;
> +	}
> +
> +out_unlock:
> +	return error;

Hmmm - out_unlock doesn't unlock anything?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 18/25] xfs: scrub directory/attribute btrees
  2017-10-03 20:42 ` [PATCH 18/25] xfs: scrub directory/attribute btrees Darrick J. Wong
@ 2017-10-06  5:07   ` Dave Chinner
  2017-10-06 18:30     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-06  5:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Fengguang Wu

On Tue, Oct 03, 2017 at 01:42:42PM -0700, Darrick J. Wong wrote:
> +/* Find an entry at a certain level in a da btree. */
> +STATIC void *
> +xfs_scrub_da_btree_entry(
> +	struct xfs_scrub_da_btree	*ds,
> +	int				level,
> +	int				rec)
> +{
> +	char				*ents;
> +	void				*(*fn)(void *);
> +	size_t				sz;
> +	struct xfs_da_state_blk		*blk;
> +
> +	/* Dispatch the entry finding function. */
> +	blk = &ds->state->path.blk[level];
> +	switch (blk->magic) {
> +	case XFS_ATTR_LEAF_MAGIC:
> +	case XFS_ATTR3_LEAF_MAGIC:
> +		fn = (xfs_da_leaf_ents_fn)xfs_attr3_leaf_entryp;
> +		sz = sizeof(struct xfs_attr_leaf_entry);
> +		break;
> +	case XFS_DIR2_LEAFN_MAGIC:
> +	case XFS_DIR3_LEAFN_MAGIC:
> +		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->leaf_ents_p;
> +		sz = sizeof(struct xfs_dir2_leaf_entry);
> +		break;
> +	case XFS_DIR2_LEAF1_MAGIC:
> +	case XFS_DIR3_LEAF1_MAGIC:
> +		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->leaf_ents_p;
> +		sz = sizeof(struct xfs_dir2_leaf_entry);
> +		break;
> +	case XFS_DA_NODE_MAGIC:
> +	case XFS_DA3_NODE_MAGIC:
> +		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->node_tree_p;
> +		sz = sizeof(struct xfs_da_node_entry);
> +		break;
> +	default:
> +		return NULL;
> +	}
> +
> +	ents = fn(blk->bp->b_addr);
> +	return ents + (sz * rec);
> +}

This looks kinda unnecesarily abstracted.

	case XFS_ATTR_LEAF_MAGIC:
	case XFS_ATTR3_LEAF_MAGIC:
		ents = xfs_attr3_leaf_entryp(blk->bp->b_addr);
		return ents + (rec * sizeof(struct xfs_attr_leaf_entry));

	case XFS_DIR2_LEAF1_MAGIC:
	case XFS_DIR3_LEAF1_MAGIC:
	case XFS_DIR2_LEAFN_MAGIC:
	case XFS_DIR3_LEAFN_MAGIC:
		ents = ds->dargs.dp->d_ops->leaf_ents_p(blk->bp->b_addr);
		return ents + (rec * sizeof(struct xfs_dir2_leaf_entry));

	case XFS_DA_NODE_MAGIC:
	case XFS_DA3_NODE_MAGIC:
		ents = ds->dargs.dp->d_ops->node_tree_p(blk->bp->b_addr)
		return ents + (rec * sizeof(struct xfs_da_node_entry));


> +
> +/* Scrub a da btree hash (key). */
> +int
> +xfs_scrub_da_btree_hash(
> +	struct xfs_scrub_da_btree	*ds,
> +	int				level,
> +	__be32				*hashp)
> +{
> +	struct xfs_da_state_blk		*blks;
> +	struct xfs_da_node_entry	*btree;

*entry?

> +	xfs_dahash_t			hash;
> +	xfs_dahash_t			parent_hash;
> +
> +	/* Is this hash in order? */
> +	hash = be32_to_cpu(*hashp);
> +	if (hash < ds->hashes[level])
> +		xfs_scrub_da_set_corrupt(ds, level);
> +	ds->hashes[level] = hash;
> +
> +	if (level == 0)
> +		return 0;
> +
> +	/* Is this hash no larger than the parent hash? */
> +	blks = ds->state->path.blk;
> +	btree = xfs_scrub_da_btree_entry(ds, level - 1, blks[level - 1].index);

entry = ?

> +	parent_hash = be32_to_cpu(btree->hashval);
> +	if (parent_hash < hash)
> +		xfs_scrub_da_set_corrupt(ds, level);
> +
> +	return 0;
> +}
> +
> +/*
> + * Check a da btree pointer.  Returns true if it's ok to use this
> + * pointer.
> + */
> +STATIC bool
> +xfs_scrub_da_btree_ptr_ok(
> +	struct xfs_scrub_da_btree	*ds,
> +	int				level,
> +	xfs_dablk_t			blkno)
> +{
> +	if (blkno < ds->lowest || (ds->highest != 0 && blkno >= ds->highest)) {
> +		xfs_scrub_da_set_corrupt(ds, level);
> +		return false;
> +	}

Not sure what lowest and highest are here - the structure definition
is not commented. I /think/ it's the offset within the dierctory
address space for the leaf pointers (i.e. XFS_DIR2_LEAF_OFFSET ->
XFS_DIR2_FREE_OFFSET for directories), but I'm mostly guessing from
context here...

> +
> +	return true;
> +}
> +
> +/*
> + * The da btree scrubber can handle leaf1 blocks as a degenerate
> + * form of da btree.  Since the regular da code doesn't handle

degenerate form of LEAFN blocks?

> +
> +/* Check a block's sibling. */
> +STATIC int
> +xfs_scrub_da_btree_block_check_sibling(
> +	struct xfs_scrub_da_btree	*ds,
> +	int				level,
> +	int				direction,
> +	xfs_dablk_t			sibling)
> +{
> +	int				retval;
> +	int				error;
> +
> +	if (!sibling)
> +		return 0;
> +
> +	/* Move the alternate cursor back one block. */

Move the alternate cursor one block in the direction specified?

> +	memcpy(&ds->state->altpath, &ds->state->path,
> +			sizeof(ds->state->altpath));
> +	error = xfs_da3_path_shift(ds->state, &ds->state->altpath,
> +			direction, false, &retval);
> +	if (!xfs_scrub_da_op_ok(ds, level, &error))
> +		return error;
> +	if (retval) {
> +		xfs_scrub_da_set_corrupt(ds, level);
> +		return error;
> +	}
> +
> +	if (ds->state->altpath.blk[level].blkno != sibling)
> +		xfs_scrub_da_set_corrupt(ds, level);
> +	xfs_trans_brelse(ds->dargs.trans, ds->state->altpath.blk[level].bp);
> +	return error;
> +}
> +
> +/* Check a block's sibling pointers. */
> +STATIC int
> +xfs_scrub_da_btree_block_check_siblings(
> +	struct xfs_scrub_da_btree	*ds,
> +	int				level,
> +	struct xfs_da_blkinfo		*hdr)
> +{
> +	xfs_dablk_t			forw;
> +	xfs_dablk_t			back;
> +	int				error = 0;
> +
> +	forw = be32_to_cpu(hdr->forw);
> +	back = be32_to_cpu(hdr->back);
> +
> +	/* Top level blocks should not have sibling pointers. */
> +	if (level == 0) {
> +		if (forw != 0 || back != 0)
> +			xfs_scrub_da_set_corrupt(ds, level);
> +		return error;

Error is always zero here?

> +	}
> +
> +	/*
> +	 * Check back (left) and forw (right) pointers.  These functions
> +	 * absorb error codes for us.
> +	 */
> +	error = xfs_scrub_da_btree_block_check_sibling(ds, level, 0, back);
> +	if (error)
> +		goto out;
> +	error = xfs_scrub_da_btree_block_check_sibling(ds, level, 1, forw);
> +
> +out:
> +	memset(&ds->state->altpath, 0, sizeof(ds->state->altpath));
> +	return error;
> +}
> +
> +/* Load a dir/attribute block from a btree. */
> +STATIC int
> +xfs_scrub_da_btree_block(
> +	struct xfs_scrub_da_btree	*ds,
> +	int				level,
> +	xfs_dablk_t			blkno)
> +{
> +	struct xfs_da_state_blk		*blk;
> +	struct xfs_da_intnode		*node;
> +	struct xfs_da_node_entry	*btree;
> +	struct xfs_da3_blkinfo		*hdr3;
> +	struct xfs_da_args		*dargs = &ds->dargs;
> +	struct xfs_inode		*ip = ds->dargs.dp;
> +	xfs_ino_t			owner;
> +	int				*pmaxrecs;
> +	struct xfs_da3_icnode_hdr	nodehdr;
> +	int				error;
> +
> +	blk = &ds->state->path.blk[level];
> +	ds->state->path.active = level + 1;
> +
> +	/* Release old block. */
> +	if (blk->bp) {
> +		xfs_trans_brelse(dargs->trans, blk->bp);
> +		blk->bp = NULL;
> +	}
> +
> +	/* Check the pointer. */
> +	blk->blkno = blkno;
> +	if (!xfs_scrub_da_btree_ptr_ok(ds, level, blkno))
> +		goto out_nobuf;
> +
> +	/* Read the buffer. */
> +	error = xfs_da_read_buf(dargs->trans, dargs->dp, blk->blkno, -2,
> +			&blk->bp, dargs->whichfork,
> +			&xfs_scrub_da_btree_buf_ops);

Hmmm - this verifier only special cases LEAF1 blocks, no comments as
to why it treats everything else as a with the node verifier. DOn't
we have to special case the attr leaf blocks here as well?

> +	if (!xfs_scrub_da_op_ok(ds, level, &error))
> +		goto out_nobuf;
> +
> +	/* It's ok for a directory not to have a da btree in it. */
> +	if (ds->dargs.whichfork == XFS_DATA_FORK && level == 0 &&
> +			blk->bp == NULL)
> +		goto out_nobuf;

What case is that? single block form? Need a magic number check
here if that's the case?

> +/* Visit all nodes and leaves of a da btree. */
> +int
> +xfs_scrub_da_btree(
> +	struct xfs_scrub_context	*sc,
> +	int				whichfork,
> +	xfs_scrub_da_btree_rec_fn	scrub_fn)
> +{
> +	struct xfs_scrub_da_btree	ds;
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_da_state_blk		*blks;
> +	struct xfs_da_node_entry	*key;
> +	void				*rec;
> +	xfs_dablk_t			blkno;
> +	bool				is_attr;
> +	int				level;
> +	int				error;
> +
> +	memset(&ds, 0, sizeof(ds));

I almost missed this - had to go looking later for why the
ds.maxrecs[] started off at zero. Can we change this to be
initialised to zero at declaration like so:

	struct xfs_scrub_da_btree	ds = {};

> +	/* Skip short format data structures; no btree to scan. */
> +	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> +	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
> +		return 0;
> +
> +	/* Set up initial da state. */
> +	is_attr = whichfork == XFS_ATTR_FORK;
> +	ds.dargs.geo = is_attr ? mp->m_attr_geo : mp->m_dir_geo;
> +	ds.dargs.dp = sc->ip;
> +	ds.dargs.whichfork = whichfork;
> +	ds.dargs.trans = sc->tp;
> +	ds.dargs.op_flags = XFS_DA_OP_OKNOENT;
> +	ds.state = xfs_da_state_alloc();
> +	ds.state->args = &ds.dargs;
> +	ds.state->mp = mp;
> +	ds.sc = sc;
> +	blkno = ds.lowest = is_attr ? 0 : ds.dargs.geo->leafblk;
> +	ds.highest = is_attr ? 0 : ds.dargs.geo->freeblk;
> +	level = 0;

bit hard to read with all the ?: constructs. Can we make this:

	if (whichfork == XFS_ATTR_FORK) {
		ds.dargs.geo = ...
		ds.lowest = ..
		ds.highest = ...
	} else {
		....
	}
	......

	blkno = ds.lowest;

> +
> +	/* Find the root of the da tree, if present. */
> +	blks = ds.state->path.blk;
> +	error = xfs_scrub_da_btree_block(&ds, level, blkno);
> +	if (error)
> +		goto out_state;
> +	if (blks[level].bp == NULL)
> +		goto out_state;

So for a single block directory, we'll jump out here because it's
at block zero and there's nothing at mp->m_dir_geo.leafblk.
That means the loop will only ever handle LEAF1/LEAFN format
directory structures. Correct? (comment?)

> +	blks[level].index = 0;
> +	while (level >= 0 && level < XFS_DA_NODE_MAXDEPTH) {
> +		/* Handle leaf block. */
> +		if (blks[level].magic != XFS_DA_NODE_MAGIC) {
> +			/* End of leaf, pop back towards the root. */
> +			if (blks[level].index >= ds.maxrecs[level]) {
> +				if (level > 0)
> +					blks[level - 1].index++;
> +				ds.tree_level++;
> +				level--;
> +				continue;
> +			}
> +
> +			/* Dispatch record scrubbing. */
> +			rec = xfs_scrub_da_btree_entry(&ds, level,
> +					blks[level].index);
> +			error = scrub_fn(&ds, level, rec);
> +			if (error < 0 ||
> +			    error == XFS_BTREE_QUERY_RANGE_ABORT)

When would we get a XFS_BTREE_QUERY_RANGE_ABORT error?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/25] xfs: scrub directory metadata
  2017-10-03 20:42 ` [PATCH 19/25] xfs: scrub directory metadata Darrick J. Wong
@ 2017-10-06  7:07   ` Dave Chinner
  2017-10-06 19:45     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-06  7:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:49PM -0700, Darrick J. Wong wrote:
> +
> +/* Set us up to scrub a file's contents. */
> +int
> +xfs_scrub_setup_inode_contents(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip,
> +	unsigned int			resblks)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	error = xfs_scrub_get_inode(sc, ip);
> +	if (error)
> +		return error;
> +
> +	/* Got the inode, lock it and we're ready to go. */
> +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> +	xfs_ilock(sc->ip, sc->ilock_flags);
> +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> +	if (error)
> +		goto out_unlock;
> +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> +
> +	return 0;
> +out_unlock:
> +	xfs_iunlock(sc->ip, sc->ilock_flags);
> +	if (sc->ip != ip)
> +		iput(VFS_I(sc->ip));
> +	sc->ip = NULL;
> +	return error;
> +}

I've seen this get/lock/alloc code many times now - seems like scope
for factoring?

> +/* Scrub a directory entry. */
> +
> +struct xfs_scrub_dir_ctx {
> +	struct dir_context		dc;
> +	struct xfs_scrub_context	*sc;
> +};

These two character variable names make the code really hard to
understand, especially when....

> +
> +/* Check that an inode's mode matches a given DT_ type. */
> +STATIC int
> +xfs_scrub_dir_check_ftype(
> +	struct xfs_scrub_dir_ctx	*sdc,

... we now have sc, dc, and sdc all intertwined. Especially as some
of these functions seem to invert the calling convention of the rest
of the scrub code in that the scrub context is the primary object
that is passed around and everything is attached to the sc....

Brain hurts trying to keep all this straight here...

/me keeps blundering around until it becomes apparent that
dir_context has nothing to do with scrub, but is a VFS filldir
callback structure triggered through readdir.

Darrick, can you add some comments explaining how the dirent
scrubbing works?  It'd make it so much easier to understand if I
didn't have to reverse engineer the intent and design from the
patch. I'm going to skip this one for now...

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/25] xfs: scrub inode block mappings
  2017-10-06  2:51   ` Dave Chinner
@ 2017-10-06 17:00     ` Darrick J. Wong
  2017-10-07 23:10       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-06 17:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 06, 2017 at 01:51:23PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:36PM -0700, Darrick J. Wong wrote:
> > +/* Set us up with an inode's bmap. */
> > +STATIC int
> > +__xfs_scrub_setup_inode_bmap(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip,
> > +	bool				flush_data)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	error = xfs_scrub_get_inode(sc, ip);
> > +	if (error)
> > +		return error;
> > +
> > +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > +	xfs_ilock(sc->ip, sc->ilock_flags);
> > +
> > +	/*
> > +	 * We don't want any ephemeral data fork updates sitting around
> > +	 * while we inspect block mappings, so wait for directio to finish
> > +	 * and flush dirty data if we have delalloc reservations.
> > +	 */
> > +	if (S_ISREG(VFS_I(sc->ip)->i_mode) && flush_data) {
> > +		inode_dio_wait(VFS_I(sc->ip));
> > +		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
> > +		if (error)
> > +			goto out_unlock;
> > +		error = invalidate_inode_pages2(VFS_I(sc->ip)->i_mapping);
> > +		if (error)
> > +			goto out_unlock;
> > +	}
> 
> The same flush and invalidate is done in xfs_fs_map_blocks and
> xfs_ioctl_setattr_dax_invalidate. we used to have helper functions
> to do this, but we got rid of them because we removed all the
> cases where such behaviour was necessary. Now we're adding them
> back, perhaps we should have a helper for this again?

Hmmm... looking at all the filemap_write_and_wait callers in the kernel:

xfs_ioctl_setattr_dax_invalidate flushes & invalidates the page cache
xfs_vm_bmap flushes dirty pages only
xfs_reflink_unshare flushes a range of dirty pages
xfs_fs_map_blocks flushes & invalidates pagecache
xfs_setattr_size flushes between old and new EOF
xfs_getbmap flushes dirty pages
xfs_flush_unmap_range flushes and rips out the page cache

Looking at the one caller of invalidate_inode_pages* that I hadn't
already mentioned:

xfs_fs_commit_blocks invalidates a range of pagecache

And looking at the callers of truncate_pagecache_range:

xfs_file_iomap_end_delalloc truncates pagecache for a partial delalloc write

So you're right, the only usage patterns that match are the two that you
suggested.  However, I got to thinking last night, do we actually need
to invalidate pagecache if we're /only/ doing a readonly check of the
data fork mappings?  I see a good argument for doing so if we need to
rebuild the data maps, but since scrub never dirties anything, the page
cache needn't be nuked here.

I'm going to try changing this code only to invalidate if userspace set
IFLAG_REPAIR.

> > +	/* Got the inode, lock it and we're ready to go. */
> > +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> > +	if (error)
> > +		goto out_unlock;
> > +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> > +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> > +
> > +	return 0;
> > +out_unlock:
> > +	xfs_iunlock(sc->ip, sc->ilock_flags);
> > +	if (sc->ip != ip)
> > +		iput(VFS_I(sc->ip));
> > +	sc->ip = NULL;
> 
> Slightly tricky - how many places do we end up having to do this?
> If its more than one, perhaps we need a xfs_scrub_irele(sc, ip)
> helper?

Three places, two of which are setup functions and the third is the
teardown function.  The teardown function is always called even if setup
fails, so I'll just purge those out and let xfs_scrub_teardown do this
once.  With a better comment.

> > +/*
> > + * Inode fork block mapping (BMBT) scrubber.
> > + * More complex than the others because we have to scrub
> > + * all the extents regardless of whether or not the fork
> > + * is in btree format.
> > + */
> > +
> > +struct xfs_scrub_bmap_info {
> > +	struct xfs_scrub_context	*sc;
> > +	xfs_daddr_t			eofs;
> > +	xfs_fileoff_t			lastoff;
> > +	bool				is_rt;
> > +	bool				is_shared;
> > +	int				whichfork;
> > +};
> > +
> > +/* Scrub a single extent record. */
> > +STATIC int
> > +xfs_scrub_bmap_extent(
> > +	struct xfs_inode		*ip,
> > +	struct xfs_btree_cur		*cur,
> > +	struct xfs_scrub_bmap_info	*info,
> > +	struct xfs_bmbt_irec		*irec)
> > +{
> > +	struct xfs_scrub_ag		sa = { 0 };
> > +	struct xfs_mount		*mp = info->sc->mp;
> > +	struct xfs_buf			*bp = NULL;
> > +	xfs_daddr_t			daddr;
> > +	xfs_daddr_t			dlen;
> > +	xfs_fsblock_t			bno;
> > +	xfs_agnumber_t			agno;
> > +	int				error = 0;
> > +
> > +	if (cur)
> > +		xfs_btree_get_block(cur, 0, &bp);
> > +
> > +	if (irec->br_startoff < info->lastoff ||
> > +	    irec->br_startblock == HOLESTARTBLOCK ||
> > +	    isnullstartblock(irec->br_startblock))
> > +		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> > +				irec->br_startoff);
> 
> What are we checking here? that it's ordered correctly and not a
> hole/delalloc record? If it is bad, shouldn't we just jump out here
> because the following checks are likely to throw silly errors on
> hole/delalloc mappings?

Yes, I rewrote this section to use xfs_verify_fsbno and jump out if
corrupt.  I plan to make all the check functions skip the xref if
OFLAG_CORRUPT is set by the time we finish the record sanity checks.

> > +	/* Actual mapping, so check the block ranges. */
> > +	if (info->is_rt) {
> > +		daddr = XFS_FSB_TO_BB(mp, irec->br_startblock);
> > +		agno = NULLAGNUMBER;
> > +		bno = irec->br_startblock;
> > +	} else {
> > +		daddr = XFS_FSB_TO_DADDR(mp, irec->br_startblock);
> > +		agno = XFS_FSB_TO_AGNO(mp, irec->br_startblock);
> > +		if (agno >= mp->m_sb.sb_agcount) {
> > +			xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> > +				irec->br_startoff);
> > +			goto out;
> > +		}
> > +		bno = XFS_FSB_TO_AGBNO(mp, irec->br_startblock);
> > +		if (bno >= mp->m_sb.sb_agblocks)
> > +			xfs_scrub_fblock_set_corrupt(info->sc,
> > +						     info->whichfork,
> > +						     irec->br_startoff);
> 
> more verify_agbno()/verify_fsbno stuff.
> 
> > +	}
> > +	dlen = XFS_FSB_TO_BB(mp, irec->br_blockcount);
> > +	if (irec->br_blockcount <= 0 ||
> > +	    irec->br_blockcount > MAXEXTLEN ||
> 
> irec->br_blockcount is unsigned (uint64_t).
> 
> Also needs to be checked against AG size.
> 
> > +	    daddr >= info->eofs ||
> > +	    daddr + dlen > info->eofs)
> > +		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> > +				irec->br_startoff);
> > +
> > +	if (irec->br_state == XFS_EXT_UNWRITTEN &&
> > +	    !xfs_sb_version_hasextflgbit(&mp->m_sb))
> 
> Superblock scrubber should reject any filesystem without the
> extflgbit as corrupt - it's always set by mkfs - so I'm not sure we
> need to check this here.

What happens if scrub encounters a v4 filesystem without EXTFLGBIT?
The superblock scrubber only checks that the secondary superblocks are
consistent (geometry-wise) with sb 0, and mount doesn't prohibit
!EXTFLGBIT filesystems from mounting.  fallocate and friends even work,
albeit slower because we actually write zeroes to the disk in lieu of
setting the unwritten flag, apparently.

But, seeing as mkfs always sets EXTFLGBIT and v5 implies the feature
even if the bit isn't set; and there's no way to turn off the feature
bit (except unsupported things like xfs_db -x), are you suggesting that
we should simply end support for mounting !EXTFLGBIT v4 filesystems?

> > +/* Scrub an inode fork's block mappings. */
> > +STATIC int
> > +xfs_scrub_bmap(
> > +	struct xfs_scrub_context	*sc,
> > +	int				whichfork)
> > +{
> > +	struct xfs_bmbt_irec		irec;
> > +	struct xfs_scrub_bmap_info	info = {0};
> > +	struct xfs_owner_info		oinfo;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_inode		*ip = sc->ip;
> > +	struct xfs_ifork		*ifp;
> > +	struct xfs_btree_cur		*cur;
> > +	xfs_fileoff_t			endoff;
> > +	xfs_extnum_t			idx;
> > +	bool				found;
> > +	int				error = 0;
> > +
> > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > +
> > +	info.is_rt = whichfork == XFS_DATA_FORK && XFS_IS_REALTIME_INODE(ip);
> > +	info.eofs = XFS_FSB_TO_BB(mp, info.is_rt ? mp->m_sb.sb_rblocks :
> > +					      mp->m_sb.sb_dblocks);
> > +	info.whichfork = whichfork;
> > +	info.is_shared = whichfork == XFS_DATA_FORK && xfs_is_reflink_inode(ip);
> > +	info.sc = sc;
> > +
> > +	switch (whichfork) {
> > +	case XFS_COW_FORK:
> > +		/* Non-existent CoW forks are ignorable. */
> > +		if (!ifp)
> > +			goto out_unlock;
> > +		/* No CoW forks on non-reflink inodes/filesystems. */
> > +		if (!xfs_is_reflink_inode(ip)) {
> > +			xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
> > +			goto out_unlock;
> > +		}
> > +		break;
> > +	case XFS_ATTR_FORK:
> > +		if (!ifp)
> > +			goto out_unlock;
> > +		if (!xfs_sb_version_hasattr(&mp->m_sb) &&
> > +		    !xfs_sb_version_hasattr2(&mp->m_sb))
> > +			xfs_scrub_ino_set_corrupt(sc, sc->ip->i_ino, NULL);
> > +		break;
> > +	}
> > +
> > +	/* Check the fork values */
> > +	switch (XFS_IFORK_FORMAT(ip, whichfork)) {
> > +	case XFS_DINODE_FMT_UUID:
> > +	case XFS_DINODE_FMT_DEV:
> > +	case XFS_DINODE_FMT_LOCAL:
> > +		/* No mappings to check. */
> > +		goto out_unlock;
> > +	case XFS_DINODE_FMT_EXTENTS:
> > +		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> > +			xfs_scrub_fblock_set_corrupt(sc, whichfork, 0);
> > +			goto out_unlock;
> > +		}
> > +		break;
> > +	case XFS_DINODE_FMT_BTREE:
> > +		ASSERT(whichfork != XFS_COW_FORK);
> 
> Corruption check, jump to out?

Oops, yes.

> > +
> > +		/* Scan the btree records. */
> > +		cur = xfs_bmbt_init_cursor(mp, sc->tp, ip, whichfork);
> > +		xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> > +		error = xfs_scrub_btree(sc, cur, xfs_scrub_bmapbt_helper,
> > +				&oinfo, &info);
> > +		xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR :
> > +						  XFS_BTREE_NOERROR);
> 
> FYI, I missed that the on-disk bmbt was scanned here the first time
> I went through this code - i had to go back and work out why the
> code only appeared to scrub the incore extent list. Can you wrap
> this whole chunk of code into a helper named xfs_scrub_bmbt()
> so it stands out that this is where the on disk btree is scrubbed?

Sure.  I'll update the xfs_scrub_bmap comment to point this out too.

> 
> > +		if (error == -EDEADLOCK)
> > +			return error;
> 
> Ok, why don't we go to out_unlock here?

<urk> bad code.

> 
> > +		else if (error)
> > +			goto out_unlock;
> 
> But do for all other errors....
> 
> > +		break;
> > +	default:
> > +		xfs_scrub_fblock_set_corrupt(sc, whichfork, 0);
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* Extent data is in memory, so scrub that. */
> > +
> > +	/* Find the offset of the last extent in the mapping. */
> > +	error = xfs_bmap_last_offset(ip, &endoff, whichfork);
> > +	if (!xfs_scrub_fblock_op_ok(sc, whichfork, 0, &error))
> > +		goto out_unlock;
> > +
> > +	/* Scrub extent records. */
> > +	info.lastoff = 0;
> > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > +	for (found = xfs_iext_lookup_extent(ip, ifp, 0, &idx, &irec);
> > +	     found != 0;
> > +	     found = xfs_iext_get_extent(ifp, ++idx, &irec)) {
> > +		if (xfs_scrub_should_terminate(sc, &error))
> > +			break;
> > +		if (isnullstartblock(irec.br_startblock))
> > +			continue;
> > +		if (irec.br_startoff >= endoff) {
> > +			xfs_scrub_fblock_set_corrupt(sc, whichfork,
> > +					irec.br_startoff);
> > +			goto out_unlock;
> > +		}
> > +		error = xfs_scrub_bmap_extent(ip, NULL, &info, &irec);
> > +		if (error == -EDEADLOCK)
> > +			return error;
> > +	}
> > +
> > +out_unlock:
> > +	return error;
> 
> Hmmm - out_unlock doesn't unlock anything?

Heh, it never does.  Baaaaaaaaad label.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 18/25] xfs: scrub directory/attribute btrees
  2017-10-06  5:07   ` Dave Chinner
@ 2017-10-06 18:30     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-06 18:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Fengguang Wu

On Fri, Oct 06, 2017 at 04:07:34PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:42PM -0700, Darrick J. Wong wrote:
> > +/* Find an entry at a certain level in a da btree. */
> > +STATIC void *
> > +xfs_scrub_da_btree_entry(
> > +	struct xfs_scrub_da_btree	*ds,
> > +	int				level,
> > +	int				rec)
> > +{
> > +	char				*ents;
> > +	void				*(*fn)(void *);
> > +	size_t				sz;
> > +	struct xfs_da_state_blk		*blk;
> > +
> > +	/* Dispatch the entry finding function. */
> > +	blk = &ds->state->path.blk[level];
> > +	switch (blk->magic) {
> > +	case XFS_ATTR_LEAF_MAGIC:
> > +	case XFS_ATTR3_LEAF_MAGIC:
> > +		fn = (xfs_da_leaf_ents_fn)xfs_attr3_leaf_entryp;
> > +		sz = sizeof(struct xfs_attr_leaf_entry);
> > +		break;
> > +	case XFS_DIR2_LEAFN_MAGIC:
> > +	case XFS_DIR3_LEAFN_MAGIC:
> > +		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->leaf_ents_p;
> > +		sz = sizeof(struct xfs_dir2_leaf_entry);
> > +		break;
> > +	case XFS_DIR2_LEAF1_MAGIC:
> > +	case XFS_DIR3_LEAF1_MAGIC:
> > +		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->leaf_ents_p;
> > +		sz = sizeof(struct xfs_dir2_leaf_entry);
> > +		break;
> > +	case XFS_DA_NODE_MAGIC:
> > +	case XFS_DA3_NODE_MAGIC:
> > +		fn = (xfs_da_leaf_ents_fn)ds->dargs.dp->d_ops->node_tree_p;
> > +		sz = sizeof(struct xfs_da_node_entry);
> > +		break;
> > +	default:
> > +		return NULL;
> > +	}
> > +
> > +	ents = fn(blk->bp->b_addr);
> > +	return ents + (sz * rec);
> > +}
> 
> This looks kinda unnecesarily abstracted.
> 
> 	case XFS_ATTR_LEAF_MAGIC:
> 	case XFS_ATTR3_LEAF_MAGIC:
> 		ents = xfs_attr3_leaf_entryp(blk->bp->b_addr);
> 		return ents + (rec * sizeof(struct xfs_attr_leaf_entry));
> 
> 	case XFS_DIR2_LEAF1_MAGIC:
> 	case XFS_DIR3_LEAF1_MAGIC:
> 	case XFS_DIR2_LEAFN_MAGIC:
> 	case XFS_DIR3_LEAFN_MAGIC:
> 		ents = ds->dargs.dp->d_ops->leaf_ents_p(blk->bp->b_addr);
> 		return ents + (rec * sizeof(struct xfs_dir2_leaf_entry));
> 
> 	case XFS_DA_NODE_MAGIC:
> 	case XFS_DA3_NODE_MAGIC:
> 		ents = ds->dargs.dp->d_ops->node_tree_p(blk->bp->b_addr)
> 		return ents + (rec * sizeof(struct xfs_da_node_entry));
> 

Ok.

> 
> > +
> > +/* Scrub a da btree hash (key). */
> > +int
> > +xfs_scrub_da_btree_hash(
> > +	struct xfs_scrub_da_btree	*ds,
> > +	int				level,
> > +	__be32				*hashp)
> > +{
> > +	struct xfs_da_state_blk		*blks;
> > +	struct xfs_da_node_entry	*btree;
> 
> *entry?
> 
> > +	xfs_dahash_t			hash;
> > +	xfs_dahash_t			parent_hash;
> > +
> > +	/* Is this hash in order? */
> > +	hash = be32_to_cpu(*hashp);
> > +	if (hash < ds->hashes[level])
> > +		xfs_scrub_da_set_corrupt(ds, level);
> > +	ds->hashes[level] = hash;
> > +
> > +	if (level == 0)
> > +		return 0;
> > +
> > +	/* Is this hash no larger than the parent hash? */
> > +	blks = ds->state->path.blk;
> > +	btree = xfs_scrub_da_btree_entry(ds, level - 1, blks[level - 1].index);
> 
> entry = ?

Makes sense.

> > +	parent_hash = be32_to_cpu(btree->hashval);
> > +	if (parent_hash < hash)
> > +		xfs_scrub_da_set_corrupt(ds, level);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Check a da btree pointer.  Returns true if it's ok to use this
> > + * pointer.
> > + */
> > +STATIC bool
> > +xfs_scrub_da_btree_ptr_ok(
> > +	struct xfs_scrub_da_btree	*ds,
> > +	int				level,
> > +	xfs_dablk_t			blkno)
> > +{
> > +	if (blkno < ds->lowest || (ds->highest != 0 && blkno >= ds->highest)) {
> > +		xfs_scrub_da_set_corrupt(ds, level);
> > +		return false;
> > +	}
> 
> Not sure what lowest and highest are here - the structure definition
> is not commented. I /think/ it's the offset within the dierctory
> address space for the leaf pointers (i.e. XFS_DIR2_LEAF_OFFSET ->
> XFS_DIR2_FREE_OFFSET for directories), but I'm mostly guessing from
> context here...

Correct.  Will add:

/*
 * Lowest and highest directory block address in which we expect
 * to find dir/attr btree node blocks.  For a directory this
 * (presumably) means between LEAF_OFFSET and FREE_OFFSET; for
 * attributes there is no limit.
 */

> > +
> > +	return true;
> > +}
> > +
> > +/*
> > + * The da btree scrubber can handle leaf1 blocks as a degenerate
> > + * form of da btree.  Since the regular da code doesn't handle
> 
> degenerate form of LEAFN blocks?

Oops, corrected.

> > +
> > +/* Check a block's sibling. */
> > +STATIC int
> > +xfs_scrub_da_btree_block_check_sibling(
> > +	struct xfs_scrub_da_btree	*ds,
> > +	int				level,
> > +	int				direction,
> > +	xfs_dablk_t			sibling)
> > +{
> > +	int				retval;
> > +	int				error;
> > +
> > +	if (!sibling)
> > +		return 0;
> > +
> > +	/* Move the alternate cursor back one block. */
> 
> Move the alternate cursor one block in the direction specified?

Yes, corrected.

> > +	memcpy(&ds->state->altpath, &ds->state->path,
> > +			sizeof(ds->state->altpath));
> > +	error = xfs_da3_path_shift(ds->state, &ds->state->altpath,
> > +			direction, false, &retval);
> > +	if (!xfs_scrub_da_op_ok(ds, level, &error))
> > +		return error;
> > +	if (retval) {
> > +		xfs_scrub_da_set_corrupt(ds, level);
> > +		return error;
> > +	}
> > +
> > +	if (ds->state->altpath.blk[level].blkno != sibling)
> > +		xfs_scrub_da_set_corrupt(ds, level);
> > +	xfs_trans_brelse(ds->dargs.trans, ds->state->altpath.blk[level].bp);
> > +	return error;
> > +}
> > +
> > +/* Check a block's sibling pointers. */
> > +STATIC int
> > +xfs_scrub_da_btree_block_check_siblings(
> > +	struct xfs_scrub_da_btree	*ds,
> > +	int				level,
> > +	struct xfs_da_blkinfo		*hdr)
> > +{
> > +	xfs_dablk_t			forw;
> > +	xfs_dablk_t			back;
> > +	int				error = 0;
> > +
> > +	forw = be32_to_cpu(hdr->forw);
> > +	back = be32_to_cpu(hdr->back);
> > +
> > +	/* Top level blocks should not have sibling pointers. */
> > +	if (level == 0) {
> > +		if (forw != 0 || back != 0)
> > +			xfs_scrub_da_set_corrupt(ds, level);
> > +		return error;
> 
> Error is always zero here?

Yes.

> > +	}
> > +
> > +	/*
> > +	 * Check back (left) and forw (right) pointers.  These functions
> > +	 * absorb error codes for us.
> > +	 */
> > +	error = xfs_scrub_da_btree_block_check_sibling(ds, level, 0, back);
> > +	if (error)
> > +		goto out;
> > +	error = xfs_scrub_da_btree_block_check_sibling(ds, level, 1, forw);
> > +
> > +out:
> > +	memset(&ds->state->altpath, 0, sizeof(ds->state->altpath));
> > +	return error;
> > +}
> > +
> > +/* Load a dir/attribute block from a btree. */
> > +STATIC int
> > +xfs_scrub_da_btree_block(
> > +	struct xfs_scrub_da_btree	*ds,
> > +	int				level,
> > +	xfs_dablk_t			blkno)
> > +{
> > +	struct xfs_da_state_blk		*blk;
> > +	struct xfs_da_intnode		*node;
> > +	struct xfs_da_node_entry	*btree;
> > +	struct xfs_da3_blkinfo		*hdr3;
> > +	struct xfs_da_args		*dargs = &ds->dargs;
> > +	struct xfs_inode		*ip = ds->dargs.dp;
> > +	xfs_ino_t			owner;
> > +	int				*pmaxrecs;
> > +	struct xfs_da3_icnode_hdr	nodehdr;
> > +	int				error;
> > +
> > +	blk = &ds->state->path.blk[level];
> > +	ds->state->path.active = level + 1;
> > +
> > +	/* Release old block. */
> > +	if (blk->bp) {
> > +		xfs_trans_brelse(dargs->trans, blk->bp);
> > +		blk->bp = NULL;
> > +	}
> > +
> > +	/* Check the pointer. */
> > +	blk->blkno = blkno;
> > +	if (!xfs_scrub_da_btree_ptr_ok(ds, level, blkno))
> > +		goto out_nobuf;
> > +
> > +	/* Read the buffer. */
> > +	error = xfs_da_read_buf(dargs->trans, dargs->dp, blk->blkno, -2,
> > +			&blk->bp, dargs->whichfork,
> > +			&xfs_scrub_da_btree_buf_ops);
> 
> Hmmm - this verifier only special cases LEAF1 blocks, no comments as
> to why it treats everything else as a with the node verifier. DOn't
> we have to special case the attr leaf blocks here as well?

The xfs_da3_node_buf_ops functions already know how to check DA*_NODE,
ATTR*_LEAF, and DIR*_LEAFN blocks; we're only adding DIR*_LEAF1 blocks
to the mix.

Added comment to xfs_scrub_da_btree_{read,write}_verify:

/*
 * xfs_da3_node_buf_ops already know how to handle
 * DA*_NODE, ATTR*_LEAF, and DIR*_LEAFN blocks.
 */

> > +	if (!xfs_scrub_da_op_ok(ds, level, &error))
> > +		goto out_nobuf;
> > +
> > +	/* It's ok for a directory not to have a da btree in it. */
> > +	if (ds->dargs.whichfork == XFS_DATA_FORK && level == 0 &&
> > +			blk->bp == NULL)
> > +		goto out_nobuf;
> 
> What case is that? single block form? Need a magic number check
> here if that's the case?

It's the same case as the "didn't find a dabtree root block so just jump
out of dabtree checking entirely" case below.  Basically,
xfs_scrub_da_btree asks xfs_scrub_da_btree_block first to find it the
block at offset ds.lowest; if _block doesn't find anything mapped there
then it returns a NULL bp, and the outer function sees the NULL bp and
itself jumps out.

Added comment:

/*
 * We didn't find a dir btree root block, which means that
 * there's no LEAF1/LEAFN tree (at least not where it's supposed
 * to be), so jump out now.
 */

> > +/* Visit all nodes and leaves of a da btree. */
> > +int
> > +xfs_scrub_da_btree(
> > +	struct xfs_scrub_context	*sc,
> > +	int				whichfork,
> > +	xfs_scrub_da_btree_rec_fn	scrub_fn)
> > +{
> > +	struct xfs_scrub_da_btree	ds;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_da_state_blk		*blks;
> > +	struct xfs_da_node_entry	*key;
> > +	void				*rec;
> > +	xfs_dablk_t			blkno;
> > +	bool				is_attr;
> > +	int				level;
> > +	int				error;
> > +
> > +	memset(&ds, 0, sizeof(ds));
> 
> I almost missed this - had to go looking later for why the
> ds.maxrecs[] started off at zero. Can we change this to be
> initialised to zero at declaration like so:
> 
> 	struct xfs_scrub_da_btree	ds = {};

Sure.

> > +	/* Skip short format data structures; no btree to scan. */
> > +	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> > +	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
> > +		return 0;
> > +
> > +	/* Set up initial da state. */
> > +	is_attr = whichfork == XFS_ATTR_FORK;
> > +	ds.dargs.geo = is_attr ? mp->m_attr_geo : mp->m_dir_geo;
> > +	ds.dargs.dp = sc->ip;
> > +	ds.dargs.whichfork = whichfork;
> > +	ds.dargs.trans = sc->tp;
> > +	ds.dargs.op_flags = XFS_DA_OP_OKNOENT;
> > +	ds.state = xfs_da_state_alloc();
> > +	ds.state->args = &ds.dargs;
> > +	ds.state->mp = mp;
> > +	ds.sc = sc;
> > +	blkno = ds.lowest = is_attr ? 0 : ds.dargs.geo->leafblk;
> > +	ds.highest = is_attr ? 0 : ds.dargs.geo->freeblk;
> > +	level = 0;
> 
> bit hard to read with all the ?: constructs. Can we make this:
> 
> 	if (whichfork == XFS_ATTR_FORK) {
> 		ds.dargs.geo = ...
> 		ds.lowest = ..
> 		ds.highest = ...
> 	} else {
> 		....
> 	}
> 	......
> 
> 	blkno = ds.lowest;

Done.

> > +
> > +	/* Find the root of the da tree, if present. */
> > +	blks = ds.state->path.blk;
> > +	error = xfs_scrub_da_btree_block(&ds, level, blkno);
> > +	if (error)
> > +		goto out_state;
> > +	if (blks[level].bp == NULL)
> > +		goto out_state;
> 
> So for a single block directory, we'll jump out here because it's
> at block zero and there's nothing at mp->m_dir_geo.leafblk.
> That means the loop will only ever handle LEAF1/LEAFN format
> directory structures. Correct? (comment?)

Right.

/*
 * We didn't find a block at ds.lowest, which means that there's
 * no LEAF1/LEAFN tree (at least not where it's supposed to be),
 * so jump out now.
 */

> > +	blks[level].index = 0;
> > +	while (level >= 0 && level < XFS_DA_NODE_MAXDEPTH) {
> > +		/* Handle leaf block. */
> > +		if (blks[level].magic != XFS_DA_NODE_MAGIC) {
> > +			/* End of leaf, pop back towards the root. */
> > +			if (blks[level].index >= ds.maxrecs[level]) {
> > +				if (level > 0)
> > +					blks[level - 1].index++;
> > +				ds.tree_level++;
> > +				level--;
> > +				continue;
> > +			}
> > +
> > +			/* Dispatch record scrubbing. */
> > +			rec = xfs_scrub_da_btree_entry(&ds, level,
> > +					blks[level].index);
> > +			error = scrub_fn(&ds, level, rec);
> > +			if (error < 0 ||
> > +			    error == XFS_BTREE_QUERY_RANGE_ABORT)
> 
> When would we get a XFS_BTREE_QUERY_RANGE_ABORT error?

In theory the scrub_fn could return that to signal a non-error abort.
Between the dabtree and the btree scrubbers none of them actually do that,
so in theory this could be removed.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/25] xfs: scrub directory metadata
  2017-10-06  7:07   ` Dave Chinner
@ 2017-10-06 19:45     ` Darrick J. Wong
  2017-10-06 22:16       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-06 19:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Oct 06, 2017 at 06:07:14PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:49PM -0700, Darrick J. Wong wrote:
> > +
> > +/* Set us up to scrub a file's contents. */
> > +int
> > +xfs_scrub_setup_inode_contents(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip,
> > +	unsigned int			resblks)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	error = xfs_scrub_get_inode(sc, ip);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Got the inode, lock it and we're ready to go. */
> > +	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > +	xfs_ilock(sc->ip, sc->ilock_flags);
> > +	error = xfs_scrub_trans_alloc(sc->sm, mp, &sc->tp);
> > +	if (error)
> > +		goto out_unlock;
> > +	sc->ilock_flags |= XFS_ILOCK_EXCL;
> > +	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
> > +
> > +	return 0;
> > +out_unlock:
> > +	xfs_iunlock(sc->ip, sc->ilock_flags);
> > +	if (sc->ip != ip)
> > +		iput(VFS_I(sc->ip));
> > +	sc->ip = NULL;
> > +	return error;
> > +}
> 
> I've seen this get/lock/alloc code many times now - seems like scope
> for factoring?

(Yeah, they're all gone now.)

> > +/* Scrub a directory entry. */
> > +
> > +struct xfs_scrub_dir_ctx {
> > +	struct dir_context		dc;
> > +	struct xfs_scrub_context	*sc;
> > +};
> 
> These two character variable names make the code really hard to
> understand, especially when....
> 
> > +
> > +/* Check that an inode's mode matches a given DT_ type. */
> > +STATIC int
> > +xfs_scrub_dir_check_ftype(
> > +	struct xfs_scrub_dir_ctx	*sdc,
> 
> ... we now have sc, dc, and sdc all intertwined. Especially as some
> of these functions seem to invert the calling convention of the rest
> of the scrub code in that the scrub context is the primary object
> that is passed around and everything is attached to the sc....

The change in calling conventions is due to the fact that we're using
the VFS filldir iterator, which doesn't have a facility for passing
through a (void *).  To reuse that code,  we therefore must use this
annoying xfs_scrub_dir_ctx wrapper so that we can pass our own context
through to the _actor function.

Now, I /could/ make it a little clearer simply by doing this instead:

/*
 * Scrub a single directory entry.
 *
 * We use the VFS directory iterator (i.e. readdir) to call this
 * function for every directory entry in a directory.  Once we're here,
 * we check the inode number to make sure it's sane, then we check that
 * we can look up this filename.  Finally, we check the ftype.
 */
STATIC inline int
xfs_scrub_dir_entry(
	struct xfs_scrub_context	*sc,
	const char			*name,
	int				namelen,
	loff_t				pos,
	u64				ino,
	unsigned			type)
{
	/* do actual dirent scrubbing here */
}

/* Adapter for VFS filldir iterator function. */
STATIC int
xfs_scrub_readdir_actor(
	struct dir_context		*dir_iter,
	const char			*name,
	int				namelen,
	loff_t				pos,
	u64				ino,
	unsigned			type)
{
	struct xfs_scrub_dir_ctx	*sdc;

	sdc = container_of(dir_iter, struct xfs_scrub_dir_ctx, dir_iter);
	return xfs_scrub_dir_entry(sdc->sc, name, namelen, pos, ino, type);
}

At least that would contain the xfs_scrub_dir_ctx nastiness to a smaller
part of the code.  Better?

> Brain hurts trying to keep all this straight here...
> 
> /me keeps blundering around until it becomes apparent that
> dir_context has nothing to do with scrub, but is a VFS filldir
> callback structure triggered through readdir.

Correct.

> Darrick, can you add some comments explaining how the dirent
> scrubbing works?  It'd make it so much easier to understand if I
> didn't have to reverse engineer the intent and design from the
> patch. I'm going to skip this one for now...

Ok.  Down by the xfs_readdir call I will have:

/*
 * Look up every name in this directory by hash.
 *
 * Use the xfs_readdir function to call xfs_scrub_dir_actor on every
 * directory entry in this directory.  In _actor, we check the name,
 * inode number, and ftype (if applicable) of the entry.  xfs_readdir
 * uses the VFS filldir functions to provide iteration context.
 *
 * The VFS grabs a read or write lock via i_rwsem before it reads or
 * writes to a directory.  If we've gotten this far we've already
 * obtained IOLOCK_EXCL, which (since 4.10) is the same as getting a
 * write lock on i_rwsem.  Therefore, it is safe for us to drop the
 * ILOCK here in order to reuse the _readdir and _dir_lookup routines,
 * which do their own ILOCK locking.
 */

--D

> 
> CHeers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/25] xfs: scrub directory metadata
  2017-10-06 19:45     ` Darrick J. Wong
@ 2017-10-06 22:16       ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2017-10-06 22:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Oct 06, 2017 at 12:45:39PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 06, 2017 at 06:07:14PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:42:49PM -0700, Darrick J. Wong wrote:
> > > +/* Scrub a directory entry. */
> > > +
> > > +struct xfs_scrub_dir_ctx {
> > > +	struct dir_context		dc;
> > > +	struct xfs_scrub_context	*sc;
> > > +};
> > 
> > These two character variable names make the code really hard to
> > understand, especially when....
> > 
> > > +
> > > +/* Check that an inode's mode matches a given DT_ type. */
> > > +STATIC int
> > > +xfs_scrub_dir_check_ftype(
> > > +	struct xfs_scrub_dir_ctx	*sdc,
> > 
> > ... we now have sc, dc, and sdc all intertwined. Especially as some
> > of these functions seem to invert the calling convention of the rest
> > of the scrub code in that the scrub context is the primary object
> > that is passed around and everything is attached to the sc....
> 
> The change in calling conventions is due to the fact that we're using
> the VFS filldir iterator, which doesn't have a facility for passing
> through a (void *).  To reuse that code,  we therefore must use this
> annoying xfs_scrub_dir_ctx wrapper so that we can pass our own context
> through to the _actor function.
> 
> Now, I /could/ make it a little clearer simply by doing this instead:
> 
> /*
>  * Scrub a single directory entry.
>  *
>  * We use the VFS directory iterator (i.e. readdir) to call this
>  * function for every directory entry in a directory.  Once we're here,
>  * we check the inode number to make sure it's sane, then we check that
>  * we can look up this filename.  Finally, we check the ftype.
>  */
> STATIC inline int
> xfs_scrub_dir_entry(
> 	struct xfs_scrub_context	*sc,
> 	const char			*name,
> 	int				namelen,
> 	loff_t				pos,
> 	u64				ino,
> 	unsigned			type)
> {
> 	/* do actual dirent scrubbing here */
> }
> 
> /* Adapter for VFS filldir iterator function. */
> STATIC int
> xfs_scrub_readdir_actor(
> 	struct dir_context		*dir_iter,
> 	const char			*name,
> 	int				namelen,
> 	loff_t				pos,
> 	u64				ino,
> 	unsigned			type)
> {
> 	struct xfs_scrub_dir_ctx	*sdc;
> 
> 	sdc = container_of(dir_iter, struct xfs_scrub_dir_ctx, dir_iter);
> 	return xfs_scrub_dir_entry(sdc->sc, name, namelen, pos, ino, type);
> }
> 
> At least that would contain the xfs_scrub_dir_ctx nastiness to a smaller
> part of the code.  Better?

Yeah, that does help, and with the comments you've added it makes it
easier to wrap my feeble brain around. Thanks!

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/25] xfs: scrub inode block mappings
  2017-10-06 17:00     ` Darrick J. Wong
@ 2017-10-07 23:10       ` Dave Chinner
  2017-10-08  3:54         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-07 23:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Oct 06, 2017 at 10:00:44AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 06, 2017 at 01:51:23PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:42:36PM -0700, Darrick J. Wong wrote:
> > > +	    daddr >= info->eofs ||
> > > +	    daddr + dlen > info->eofs)
> > > +		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> > > +				irec->br_startoff);
> > > +
> > > +	if (irec->br_state == XFS_EXT_UNWRITTEN &&
> > > +	    !xfs_sb_version_hasextflgbit(&mp->m_sb))
> > 
> > Superblock scrubber should reject any filesystem without the
> > extflgbit as corrupt - it's always set by mkfs - so I'm not sure we
> > need to check this here.
> 
> What happens if scrub encounters a v4 filesystem without EXTFLGBIT?

We don't support such filesystems (user level stale data exposure
security risk), but we have allowed them to continue to mount
because such filesystems did exist in the past (*cough* SGI DMF
database partitions *cough*). I'd suggest scrub should say
filesystems without the flag are bad are too old to be scrubbed
correctly.

> The superblock scrubber only checks that the secondary superblocks are
> consistent (geometry-wise) with sb 0, and mount doesn't prohibit
> !EXTFLGBIT filesystems from mounting.  fallocate and friends even work,
> albeit slower because we actually write zeroes to the disk in lieu of
> setting the unwritten flag, apparently.

Unwritten extents were enabled by default in 2003, and the mkfs flag
was dropped completely in 2007. So no filesystem made in the past
ten years should have this set.

> But, seeing as mkfs always sets EXTFLGBIT and v5 implies the feature
> even if the bit isn't set; and there's no way to turn off the feature
> bit (except unsupported things like xfs_db -x), are you suggesting that
> we should simply end support for mounting !EXTFLGBIT v4 filesystems?

I wasn't suggesting that, but perhaps we should.

> > > +
> > > +out_unlock:
> > > +	return error;
> > 
> > Hmmm - out_unlock doesn't unlock anything?
> 
> Heh, it never does.  Baaaaaaaaad label.

:P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/25] xfs: scrub inode block mappings
  2017-10-07 23:10       ` Dave Chinner
@ 2017-10-08  3:54         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-08  3:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sun, Oct 08, 2017 at 10:10:55AM +1100, Dave Chinner wrote:
> On Fri, Oct 06, 2017 at 10:00:44AM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 06, 2017 at 01:51:23PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:42:36PM -0700, Darrick J. Wong wrote:
> > > > +	    daddr >= info->eofs ||
> > > > +	    daddr + dlen > info->eofs)
> > > > +		xfs_scrub_fblock_set_corrupt(info->sc, info->whichfork,
> > > > +				irec->br_startoff);
> > > > +
> > > > +	if (irec->br_state == XFS_EXT_UNWRITTEN &&
> > > > +	    !xfs_sb_version_hasextflgbit(&mp->m_sb))
> > > 
> > > Superblock scrubber should reject any filesystem without the
> > > extflgbit as corrupt - it's always set by mkfs - so I'm not sure we
> > > need to check this here.
> > 
> > What happens if scrub encounters a v4 filesystem without EXTFLGBIT?
> 
> We don't support such filesystems (user level stale data exposure
> security risk), but we have allowed them to continue to mount
> because such filesystems did exist in the past (*cough* SGI DMF
> database partitions *cough*). I'd suggest scrub should say
> filesystems without the flag are bad are too old to be scrubbed
> correctly.

Ok, works for me.

> > The superblock scrubber only checks that the secondary superblocks are
> > consistent (geometry-wise) with sb 0, and mount doesn't prohibit
> > !EXTFLGBIT filesystems from mounting.  fallocate and friends even work,
> > albeit slower because we actually write zeroes to the disk in lieu of
> > setting the unwritten flag, apparently.
> 
> Unwritten extents were enabled by default in 2003, and the mkfs flag
> was dropped completely in 2007. So no filesystem made in the past
> ten years should have this set.
> 
> > But, seeing as mkfs always sets EXTFLGBIT and v5 implies the feature
> > even if the bit isn't set; and there's no way to turn off the feature
> > bit (except unsupported things like xfs_db -x), are you suggesting that
> > we should simply end support for mounting !EXTFLGBIT v4 filesystems?
> 
> I wasn't suggesting that, but perhaps we should.

Anyone want to send a patch to get people out of the woodwork?

:)

> > > > +
> > > > +out_unlock:
> > > > +	return error;
> > > 
> > > Hmmm - out_unlock doesn't unlock anything?
> > 
> > Heh, it never does.  Baaaaaaaaad label.

(Fixed.)

--D

> :P
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 20/25] xfs: scrub directory freespace
  2017-10-03 20:42 ` [PATCH 20/25] xfs: scrub directory freespace Darrick J. Wong
@ 2017-10-09  1:44   ` Dave Chinner
  2017-10-09 22:54     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-09  1:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:42:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Check the free space information in a directory.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/dir.c |  347 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 347 insertions(+)
> 
> 
> diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
> index e58252b..6ea06c3 100644
> --- a/fs/xfs/scrub/dir.c
> +++ b/fs/xfs/scrub/dir.c
> @@ -239,6 +239,348 @@ xfs_scrub_dir_rec(
>  	return error;
>  }
>  
> +/* Is this free entry either in the bestfree or smaller than all of them? */
> +static inline void
> +xfs_scrub_directory_check_free_entry(
> +	struct xfs_scrub_context	*sc,
> +	xfs_dablk_t			lblk,
> +	struct xfs_dir2_data_free	*bf,
> +	struct xfs_dir2_data_unused	*dup)
> +{
> +	struct xfs_dir2_data_free	*dfp;
> +	unsigned int			smallest;
> +
> +	smallest = -1U;

Urk. That's the same as "smallest = UINT_MAX", and so ......

> +	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
> +		if (dfp->offset &&
> +		    be16_to_cpu(dfp->length) == be16_to_cpu(dup->length))
> +			return;
> +		if (smallest < be16_to_cpu(dfp->length))
> +			smallest = be16_to_cpu(dfp->length);

.... how does this work? Shouldn't it be a ">" check here?


> +	}
> +
> +	if (be16_to_cpu(dup->length) > smallest)
> +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +}
> +
> +/* Check free space info in a directory data block. */
> +STATIC int
> +xfs_scrub_directory_data_bestfree(
> +	struct xfs_scrub_context	*sc,
> +	xfs_dablk_t			lblk,
> +	bool				is_block)
> +{
> +	struct xfs_dir2_data_unused	*dup;
> +	struct xfs_dir2_data_free	*dfp;
> +	struct xfs_buf			*bp;
> +	struct xfs_dir2_data_free	*bf;
> +	struct xfs_mount		*mp = sc->mp;
> +	char				*ptr;
> +	char				*endptr;
> +	u16				tag;
> +	int				newlen;
> +	int				offset;
> +	int				error;
> +
> +	if (is_block) {
> +		/* dir block format */
> +		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
> +	} else {
> +		/* dir data format */
> +		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, -1, &bp);
> +	}
> +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> +		goto out;
> +
> +	/* Do the bestfrees correspond to actual free space? */
> +	bf = sc->ip->d_ops->data_bestfree_p(bp->b_addr);

With the number of d_ops callouts in this code, a local dops
variable might be in order.

> +	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
> +		offset = be16_to_cpu(dfp->offset);
> +		if (offset == 0)
> +			continue;
> +		if (offset >= BBTOB(bp->b_length)) {
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +			continue;
> +		}

Not sure I like all the checks against and calculations using
bp->b_length in this function. it would be more correct to check
against geo->blksize.

> +		dup = (struct xfs_dir2_data_unused *)(bp->b_addr + offset);
> +		tag = be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup));
> +
> +		if (dup->freetag != cpu_to_be16(XFS_DIR2_DATA_FREE_TAG) ||
> +		    be16_to_cpu(dup->length) != be16_to_cpu(dfp->length) ||
> +		    tag != ((char *)dup - (char *)bp->b_addr))
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +	}

Also, count the number of best frees here.

> +
> +	/* Make sure the bestfrees are actually the best free spaces. */
> +	ptr = (char *)sc->ip->d_ops->data_entry_p(bp->b_addr);
> +	if (is_block) {
> +		struct xfs_dir2_block_tail	*btp;
> +
> +		btp = xfs_dir2_block_tail_p(sc->mp->m_dir_geo, bp->b_addr);

mp->m_dir_geo

> +		endptr = (char *)xfs_dir2_block_leaf_p(btp);
> +	} else
> +		endptr = (char *)bp->b_addr + BBTOB(bp->b_length);




> +	while (ptr < endptr) {
> +		dup = (struct xfs_dir2_data_unused *)ptr;
> +		/* Skip real entries */
> +		if (dup->freetag != cpu_to_be16(XFS_DIR2_DATA_FREE_TAG)) {
> +			struct xfs_dir2_data_entry	*dep;
> +
> +			dep = (struct xfs_dir2_data_entry *)ptr;
> +			newlen = sc->ip->d_ops->data_entsize(dep->namelen);
> +			if (newlen <= 0) {
> +				xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> +						lblk);
> +				goto out_buf;
> +			}
> +			ptr += newlen;
> +			if (endptr < ptr)
> +				xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> +					      lblk);
> +			continue;
> +		}
> +
> +		/* Spot check this free entry */
> +		tag = be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup));
> +		if (tag != ((char *)dup - (char *)bp->b_addr))
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +
> +		/*
> +		 * Either this entry is a bestfree or it's smaller than
> +		 * any of the bestfrees.
> +		 */
> +		xfs_scrub_directory_check_free_entry(sc, lblk, bf, dup);

SO this checks if the entry is in the bestfree, but it doesn't
tell us if the bestfree array has the correct number of entries...

> +
> +		/* Move on. */
> +		newlen = be16_to_cpu(dup->length);
> +		if (newlen <= 0) {
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +			goto out_buf;
> +		}
> +		ptr += newlen;
> +		if (endptr < ptr)
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);

Count the number of free entries here.

> +	}

And now check that the number of bestfrees vs free entries is
valid. If there's more than 3 free entries in the block, the
bestfrees array should be full...

> +out_buf:
> +	xfs_trans_brelse(sc->tp, bp);
> +out:
> +	return error;
> +}
> +
> +/* Is this the longest free entry in the block? */
> +static inline void
> +xfs_scrub_directory_check_freesp(
> +	struct xfs_scrub_context	*sc,
> +	xfs_dablk_t			lblk,
> +	struct xfs_buf			*dbp,
> +	unsigned int			len)
> +{
> +	struct xfs_dir2_data_free	*bf;
> +	struct xfs_dir2_data_free	*dfp;
> +	unsigned int			longest = 0;
> +	int				offset;
> +
> +	bf = sc->ip->d_ops->data_bestfree_p(dbp->b_addr);
> +	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
> +		offset = be16_to_cpu(dfp->offset);
> +		if (!offset)
> +			continue;
> +		if (longest < be16_to_cpu(dfp->length))
> +			longest = be16_to_cpu(dfp->length);
> +	}
> +
> +	if (longest != len)
> +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +}

This needs a better explanation - it's called to check whether then
freespace length in the freespace index matches the longest
freespace in the data block bests array.

And from that, the data block bests array is supposed to be ordered
from largest to smallest, yes? As per __xfs_dir3_data_check():


	XFS_WANT_CORRUPTED_RETURN(mp, be16_to_cpu(bf[0].length) >=
						be16_to_cpu(bf[1].length));
	XFS_WANT_CORRUPTED_RETURN(mp, be16_to_cpu(bf[1].length) >=
						be16_to_cpu(bf[2].length));

So why doesn't this code just check the first entry in the array?

Hmmm, and now I've remembered that, xfs_scrub_directory_check_free_entry()
probably only needs to do a reverse scan just to find the smallest
non-zero entry...

> +/* Check free space info in a directory leaf1 block. */
> +STATIC int
> +xfs_scrub_directory_leaf1_bestfree(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_da_args		*args,
> +	xfs_dablk_t			lblk)
> +{
> +	struct xfs_dir2_leaf_tail	*ltp;
> +	struct xfs_buf			*dbp;
> +	struct xfs_buf			*bp;
> +	struct xfs_mount		*mp = sc->mp;
> +	__be16				*bestp;
> +	__u16				best;
> +	int				i;
> +	int				error;
> +
> +	/* Read the free space block */
> +	error = xfs_dir3_leaf_read(sc->tp, sc->ip, lblk, -1, &bp);
> +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> +		goto out;
> +
> +	/* Check all the entries. */
> +	ltp = xfs_dir2_leaf_tail_p(mp->m_dir_geo, bp->b_addr);
> +	bestp = xfs_dir2_leaf_bests_p(ltp);
> +	for (i = 0; i < be32_to_cpu(ltp->bestcount); i++, bestp++) {
> +		best = be16_to_cpu(*bestp);
> +		if (best == NULLDATAOFF)
> +			continue;
> +		error = xfs_dir3_data_read(sc->tp, sc->ip,
> +				i * args->geo->fsbcount, -1, &dbp);
> +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> +			continue;
> +		xfs_scrub_directory_check_freesp(sc, lblk, dbp, best);
> +		xfs_trans_brelse(sc->tp, dbp);
> +	}
> +out:
> +	return error;

This needs comments to explain what it is not checking because those
checks were done in the verifier. (i.e.  hash index does not overlap
the freespace index, stale entry count is valid).

hmmmm. More philosophical question: should we rerun the verifiers
in the scrubber manually so guarantee that we fully cover whatever
is in memory on cached and modified buffers?

> +
> +/* Check free space info in a directory freespace block. */
> +STATIC int
> +xfs_scrub_directory_free_bestfree(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_da_args		*args,
> +	xfs_dablk_t			lblk)
> +{
> +	struct xfs_dir3_icfree_hdr	freehdr;
> +	struct xfs_buf			*dbp;
> +	struct xfs_buf			*bp;
> +	__be16				*bestp;
> +	__be16				best;
> +	int				i;
> +	int				error;
> +
> +	/* Read the free space block */
> +	error = xfs_dir2_free_read(sc->tp, sc->ip, lblk, &bp);
> +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> +		goto out;
> +
> +	/* Check all the entries. */
> +	sc->ip->d_ops->free_hdr_from_disk(&freehdr, bp->b_addr);
> +	bestp = sc->ip->d_ops->free_bests_p(bp->b_addr);
> +	for (i = 0; i < freehdr.nvalid; i++, bestp++) {
> +		best = be16_to_cpu(*bestp);
> +		if (best == NULLDATAOFF)
> +			continue;
> +		error = xfs_dir3_data_read(sc->tp, sc->ip,
> +				(freehdr.firstdb + i) * args->geo->fsbcount,
> +				-1, &dbp);
> +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> +			continue;
> +		xfs_scrub_directory_check_freesp(sc, lblk, dbp, best);
> +		xfs_trans_brelse(sc->tp, dbp);
> +	}
> +out:
> +	return error;
> +}
> +
> +/* Check free space information in directories. */
> +STATIC int
> +xfs_scrub_directory_blocks(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_bmbt_irec		got;
> +	struct xfs_da_args		args;
> +	struct xfs_ifork		*ifp;
> +	struct xfs_mount		*mp = sc->mp;
> +	xfs_fileoff_t			leaf_lblk;
> +	xfs_fileoff_t			free_lblk;
> +	xfs_fileoff_t			lblk;
> +	xfs_extnum_t			idx;
> +	bool				found;
> +	int				is_block = 0;
> +	int				error;
> +
> +	/* Ignore local format directories. */
> +	if (sc->ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
> +	    sc->ip->i_d.di_format != XFS_DINODE_FMT_BTREE)
> +		return 0;
> +
> +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> +	lblk = XFS_B_TO_FSB(mp, XFS_DIR2_DATA_OFFSET);
> +	leaf_lblk = XFS_B_TO_FSB(mp, XFS_DIR2_LEAF_OFFSET);
> +	free_lblk = XFS_B_TO_FSB(mp, XFS_DIR2_FREE_OFFSET);
> +
> +	/* Is this a block dir? */
> +	args.dp = sc->ip;
> +	args.geo = mp->m_dir_geo;
> +	args.trans = sc->tp;
> +	error = xfs_dir2_isblock(&args, &is_block);
> +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> +		goto out;
> +
> +	/* Iterate all the data extents in the directory... */
> +	found = xfs_iext_lookup_extent(sc->ip, ifp, lblk, &idx, &got);
> +	while (found) {
> +		/* No more data blocks... */
> +		if (got.br_startoff >= leaf_lblk)
> +			break;

If it's a block dir and got.br_startoff > 0, then it's corrupt?

> +
> +		/* Check each data block's bestfree data */
> +		for (lblk = roundup((xfs_dablk_t)got.br_startoff,
> +				args.geo->fsbcount);
> +		     lblk < got.br_startoff + got.br_blockcount;
> +		     lblk += args.geo->fsbcount) {

This is not obvious as to why it works with discontiguous directory
blocks. I think it's because it grabs the aligned start block of
each directory block and then internally the blocks get mapped
correctly via the directory block read functions, but this
definitely needs a better comment explaining the iteration mechanism
being used here....

> +			error = xfs_scrub_directory_data_bestfree(sc, lblk,
> +					is_block);
> +			if (error)
> +				goto out;
> +		}
> +
> +		found = xfs_iext_get_extent(ifp, ++idx, &got);

As it is, I think this is going to check discontiguous directory
blocks multiple times. It's going to find each extent in a
discontiguous dir block, round it up to the next dirblock and
scan that next dirblock. It then finds the next block in the
current discontig block, rounds it up to the next dirblock, and
scans it again....

I think it would be much better to use xfs_iext_lookup_extent() here
to iterate by expected start block rather than iterating by extent
index.

> +	}
> +
> +	/* Look for a leaf1 block, which has free info. */
> +	if (xfs_iext_lookup_extent(sc->ip, ifp, leaf_lblk, &idx, &got) &&
> +	    got.br_startoff == leaf_lblk &&
> +	    got.br_blockcount == args.geo->fsbcount &&
> +	    !xfs_iext_get_extent(ifp, ++idx, &got)) {
> +		if (is_block) {
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> +			goto not_leaf1;

Can just abort the scrub at this point.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 21/25] xfs: scrub extended attributes
  2017-10-03 20:43 ` [PATCH 21/25] xfs: scrub extended attributes Darrick J. Wong
@ 2017-10-09  2:13   ` Dave Chinner
  2017-10-09 21:14     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-09  2:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:43:02PM -0700, Darrick J. Wong wrote:
> +/* Extended Attributes */
> +
> +struct xfs_scrub_xattr {
> +	struct xfs_attr_list_context	context;
> +	struct xfs_scrub_context	*sc;
> +};

A comment here explaining that we are using the listattr callback
infrastructure to scrub the xattr?  And, now that I've got to the
rest of the code, that we don't validate the pointers/values in the
attribute records when doing dabtree record check because we are
doing that indirectly afterwards by reading every attribute value.
And that this follows the pointers for remote attr blocks and reads
them, hence checking the remote attr is valid via verifiers?

And, with that out of the way, what about attributes that listent
skips? i.e. those with the flag that says they are not valid? We
don't check they exist or are valid, and their existence would be
a case for preening the xattr tree...

Otherwise this seems pretty straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/25] xfs: scrub symbolic links
  2017-10-03 20:43 ` [PATCH 22/25] xfs: scrub symbolic links Darrick J. Wong
@ 2017-10-09  2:17   ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2017-10-09  2:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:43:08PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create the infrastructure to scrub symbolic link data.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine - a simple one at last :P

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 24/25] xfs: scrub realtime bitmap/summary
  2017-10-03 20:43 ` [PATCH 24/25] xfs: scrub realtime bitmap/summary Darrick J. Wong
@ 2017-10-09  2:28   ` Dave Chinner
  2017-10-09 20:24     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-09  2:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:43:21PM -0700, Darrick J. Wong wrote:
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 154c3dd..d4d9bef 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -315,6 +315,11 @@ static inline bool xfs_sb_good_version(struct xfs_sb *sbp)
>  	return false;
>  }
>  
> +static inline bool xfs_sb_version_hasrealtime(struct xfs_sb *sbp)
> +{
> +	return sbp->sb_rblocks > 0;
> +}

How much can we rely on that? do we allow a fs to mount with that
being > 0 but no rtdev= mount option?

> +/* Set us up with the realtime metadata locked. */
> +int
> +xfs_scrub_setup_rt(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	int				lockmode;
> +	int				error = 0;
> +
> +	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
> +		return -EINVAL;

I've forgotten what this means already :/

> +	error = xfs_scrub_setup_fs(sc, ip);
> +	if (error)
> +		return error;
> +
> +	lockmode = XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP;
> +	xfs_ilock(mp->m_rbmip, lockmode);
> +	xfs_trans_ijoin(sc->tp, mp->m_rbmip, lockmode);

Ok, so why do we join this inode to the transaction and not use
the sc->ilock_flags field to track how we've locked it?

> +
> +	return 0;
> +}
> +
> +/* Realtime bitmap. */
> +
> +/* Scrub a free extent record from the realtime bitmap. */
> +STATIC int
> +xfs_scrub_rtbitmap_helper(
> +	struct xfs_trans		*tp,
> +	struct xfs_rtalloc_rec		*rec,
> +	void				*priv)
> +{
> +	return 0;
> +}

Check the extent record returned is within the range of the rtdev
address space?

> +
> +/* Scrub the realtime bitmap. */
> +int
> +xfs_scrub_rtbitmap(
> +	struct xfs_scrub_context	*sc)
> +{
> +	int				error;
> +
> +	error = xfs_rtalloc_query_all(sc->tp, xfs_scrub_rtbitmap_helper, NULL);
> +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
> +		goto out;
> +
> +out:
> +	return error;
> +}
> +
> +/* Scrub the realtime summary. */
> +int
> +xfs_scrub_rtsummary(
> +	struct xfs_scrub_context	*sc)
> +{
> +	/* XXX: implement this some day */
> +	return -ENOENT;
> +}

Alright, this is all just a stub that doesn't really do any real
scrubbing yet. I guess it's better that nothing in that it walks
the rtbitmap....

> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -241,6 +241,21 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>  		.setup	= xfs_scrub_setup_parent,
>  		.scrub	= xfs_scrub_parent,
>  	},
> +#ifdef CONFIG_XFS_RT
> +	{ /* realtime bitmap */
> +		.setup	= xfs_scrub_setup_rt,
> +		.scrub	= xfs_scrub_rtbitmap,
> +		.has	= xfs_sb_version_hasrealtime,
> +	},
> +	{ /* realtime summary */
> +		.setup	= xfs_scrub_setup_rt,
> +		.scrub	= xfs_scrub_rtsummary,
> +		.has	= xfs_sb_version_hasrealtime,
> +	},
> +#else
> +	{ NULL },
> +	{ NULL },
> +#endif

I think I'd prefer that you supply stub functions when
CONFIG_XFS_RT=n so this table doesn't require ifdefs.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/25] xfs: scrub quota information
  2017-10-03 20:43 ` [PATCH 25/25] xfs: scrub quota information Darrick J. Wong
@ 2017-10-09  2:51   ` Dave Chinner
  2017-10-09 20:03     ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-09  2:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 01:43:27PM -0700, Darrick J. Wong wrote:
> +xfs_scrub_quota_to_dqtype(
> +	struct xfs_scrub_context	*sc)
> +{
> +	switch (sc->sm->sm_type) {
> +	case XFS_SCRUB_TYPE_UQUOTA:
> +		return XFS_DQ_USER;
> +	case XFS_SCRUB_TYPE_GQUOTA:
> +		return XFS_DQ_GROUP;
> +	case XFS_SCRUB_TYPE_PQUOTA:
> +		return XFS_DQ_PROJ;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +/* Set us up to scrub a quota. */
> +int
> +xfs_scrub_setup_quota(
> +	struct xfs_scrub_context	*sc,
> +	struct xfs_inode		*ip)
> +{
> +	uint				dqtype;
> +
> +	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
> +		return -EINVAL;
> +
> +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> +	if (dqtype == 0)
> +		return -EINVAL;
> +	return 0;
> +}

Should this check whether the quota type is actually enabled, and
return ENOENT if it's not? i.e move the check out of
xfs_scrub_quota() and into the setup function?

> +
> +/* Quotas. */
> +
> +/* Scrub the fields in an individual quota item. */
> +STATIC void
> +xfs_scrub_quota_item(
> +	struct xfs_scrub_context	*sc,
> +	uint				dqtype,
> +	struct xfs_dquot		*dq,
> +	xfs_dqid_t			id)
> +{
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_disk_dquot		*d = &dq->q_core;
> +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> +	xfs_fileoff_t			offset;
> +	unsigned long long		bsoft;
> +	unsigned long long		isoft;
> +	unsigned long long		rsoft;
> +	unsigned long long		bhard;
> +	unsigned long long		ihard;
> +	unsigned long long		rhard;
> +	unsigned long long		bcount;
> +	unsigned long long		icount;
> +	unsigned long long		rcount;
> +	xfs_ino_t			inodes;
> +
> +	/* Did we get the dquot we wanted? */
> +	offset = id * qi->qi_dqperchunk;
> +	if (id > be32_to_cpu(d->d_id) ||

Why is this a ">" check rather than "!="?

> +	    dqtype != (d->d_flags & XFS_DQ_ALLTYPES))
> +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
> +
> +	/* Check the limits. */
> +	bhard = be64_to_cpu(d->d_blk_hardlimit);
> +	ihard = be64_to_cpu(d->d_ino_hardlimit);
> +	rhard = be64_to_cpu(d->d_rtb_hardlimit);
> +
> +	bsoft = be64_to_cpu(d->d_blk_softlimit);
> +	isoft = be64_to_cpu(d->d_ino_softlimit);
> +	rsoft = be64_to_cpu(d->d_rtb_softlimit);
> +
> +	inodes = XFS_AGINO_TO_INO(mp, mp->m_sb.sb_agcount, 0);

Allocated inode counts should check against the filesystem inode
limit (mp->m_maxicount) rather than the physical last inode number
(which is wrong, anyway, for a small last AG).

> +
> +	/*
> +	 * Warn if the limits are larger than the fs.  Administrators
> +	 * can do this, though in production this seems suspect.
> +	 */
> +	if (bhard > mp->m_sb.sb_dblocks || bsoft > mp->m_sb.sb_dblocks)
> +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> +	if (ihard > inodes || isoft > inodes)
> +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> +	if (rhard > mp->m_sb.sb_rblocks || rsoft > mp->m_sb.sb_rblocks)
> +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);

Can you stack these so there's one per line? i.e.:

	if (bhard > mp->m_sb.sb_dblocks ||
	    bsoft > mp->m_sb.sb_dblocks)
		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);

> +
> +	/* Soft limit must be less than the hard limit. */
> +	if (bsoft > bhard || isoft > ihard || rsoft > rhard)
> +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);

Though with this check, I suspect you only need to check the hard
limits against their upper limits because if the hard limit is valid
and the soft is above then it's going to trigger corruption. Do we
need a warning as well in that case?

> +	/* Check the resource counts. */
> +	bcount = be64_to_cpu(d->d_bcount);
> +	icount = be64_to_cpu(d->d_icount);
> +	rcount = be64_to_cpu(d->d_rtbcount);
> +	inodes = percpu_counter_sum(&mp->m_icount);

Can we use different variable names for "inodes" here? One is the
maximum allowed, the other is currently allocated.

> +/* Scrub all of a quota type's items. */
> +int
> +xfs_scrub_quota(
> +	struct xfs_scrub_context	*sc)
> +{
> +	struct xfs_bmbt_irec		irec = { 0 };
> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_inode		*ip;
> +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> +	struct xfs_dquot		*dq;
> +	xfs_fileoff_t			max_dqid_off;
> +	xfs_fileoff_t			off = 0;
> +	xfs_dqid_t			id = 0;
> +	uint				dqtype;
> +	int				nimaps;
> +	int				error;
> +
> +	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> +		return -ENOENT;
> +
> +	mutex_lock(&qi->qi_quotaofflock);
> +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> +	if (!xfs_this_quota_on(sc->mp, dqtype)) {
> +		error = -ENOENT;
> +		goto out;

goto out_unlock_quota

> +	}
> +
> +	/* Attach to the quota inode and set sc->ip so that reporting works. */
> +	ip = xfs_quota_inode(sc->mp, dqtype);
> +	sc->ip = ip;
> +
> +	/* Look for problem extents. */
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> +	while (1) {
> +		if (xfs_scrub_should_terminate(sc, &error))
> +			break;

goto out_unlock_inode

> +
> +		off = irec.br_startoff + irec.br_blockcount;
> +		nimaps = 1;
> +		error = xfs_bmapi_read(ip, off, -1, &irec, &nimaps,
> +				XFS_BMAPI_ENTIRE);
> +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, off, &error))
> +			goto out_unlock;

out_unlock_inode

> +		if (!nimaps)
> +			break;
> +		if (irec.br_startblock == HOLESTARTBLOCK)
> +			continue;
> +
> +		/*
> +		 * Unwritten extents or blocks mapped above the highest
> +		 * quota id shouldn't happen.
> +		 */
> +		if (isnullstartblock(irec.br_startblock) ||
> +		    irec.br_startoff > max_dqid_off ||
> +		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
> +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
> +	}
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	/* Check all the quota items. */
> +	while (id < ((xfs_dqid_t)-1ULL)) {
> +		if (xfs_scrub_should_terminate(sc, &error))
> +			break;
> +
> +		error = xfs_qm_dqget(mp, NULL, id, dqtype, XFS_QMOPT_DQNEXT,
> +				&dq);
> +		if (error == -ENOENT)
> +			break;
> +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK,
> +				id * qi->qi_dqperchunk, &error))
> +			goto out;

break

> +
> +		xfs_scrub_quota_item(sc, dqtype, dq, id);
> +
> +		id = be32_to_cpu(dq->q_core.d_id) + 1;
> +		xfs_qm_dqput(dq);
> +		if (!id)
> +			break;
> +	}

out_unlock_quota:
	sc->ip = NULL;
	mutex_unlock(&qi->qi_quotaofflock);
	return error;

out_unlock_inode:
	xfs_iunlock(ip, XFS_ILOCK_EXCL);
	goto out_unlock_quota;
}

> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 348e3c3..0849b3f 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -256,6 +256,24 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
>  	{ NULL },
>  	{ NULL },
>  #endif
> +#ifdef CONFIG_XFS_QUOTA
> +	{ /* user quota */
> +		.setup = xfs_scrub_setup_quota,
> +		.scrub = xfs_scrub_quota,
> +	},
> +	{ /* group quota */
> +		.setup = xfs_scrub_setup_quota,
> +		.scrub = xfs_scrub_quota,
> +	},
> +	{ /* project quota */
> +		.setup = xfs_scrub_setup_quota,
> +		.scrub = xfs_scrub_quota,
> +	},
> +#else
> +	{ NULL },
> +	{ NULL },
> +	{ NULL },
> +#endif
>  };

Again, I think stub functions are in order here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/25] xfs: scrub quota information
  2017-10-09  2:51   ` Dave Chinner
@ 2017-10-09 20:03     ` Darrick J. Wong
  2017-10-09 22:17       ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-09 20:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Oct 09, 2017 at 01:51:51PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:43:27PM -0700, Darrick J. Wong wrote:
> > +xfs_scrub_quota_to_dqtype(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	switch (sc->sm->sm_type) {
> > +	case XFS_SCRUB_TYPE_UQUOTA:
> > +		return XFS_DQ_USER;
> > +	case XFS_SCRUB_TYPE_GQUOTA:
> > +		return XFS_DQ_GROUP;
> > +	case XFS_SCRUB_TYPE_PQUOTA:
> > +		return XFS_DQ_PROJ;
> > +	default:
> > +		return 0;
> > +	}
> > +}
> > +
> > +/* Set us up to scrub a quota. */
> > +int
> > +xfs_scrub_setup_quota(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	uint				dqtype;
> > +
> > +	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
> > +		return -EINVAL;
> > +
> > +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > +	if (dqtype == 0)
> > +		return -EINVAL;
> > +	return 0;
> > +}
> 
> Should this check whether the quota type is actually enabled, and
> return ENOENT if it's not? i.e move the check out of
> xfs_scrub_quota() and into the setup function?

I can add a xfs_this_quota_on check to the setup function, but don't we
need xfs_scrub_quota to lock qi_quotaofflock and then recheck that the
quota type is still enabled?

> > +/* Quotas. */
> > +
> > +/* Scrub the fields in an individual quota item. */
> > +STATIC void
> > +xfs_scrub_quota_item(
> > +	struct xfs_scrub_context	*sc,
> > +	uint				dqtype,
> > +	struct xfs_dquot		*dq,
> > +	xfs_dqid_t			id)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_disk_dquot		*d = &dq->q_core;
> > +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> > +	xfs_fileoff_t			offset;
> > +	unsigned long long		bsoft;
> > +	unsigned long long		isoft;
> > +	unsigned long long		rsoft;
> > +	unsigned long long		bhard;
> > +	unsigned long long		ihard;
> > +	unsigned long long		rhard;
> > +	unsigned long long		bcount;
> > +	unsigned long long		icount;
> > +	unsigned long long		rcount;
> > +	xfs_ino_t			inodes;
> > +
> > +	/* Did we get the dquot we wanted? */
> > +	offset = id * qi->qi_dqperchunk;
> > +	if (id > be32_to_cpu(d->d_id) ||
> 
> Why is this a ">" check rather than "!="?

/*
 * We fed id and DQNEXT into the xfs_qm_dqget call, which means
 * that the actual dquot we got must either have the same id or
 * the next higher id.
 */

> 
> > +	    dqtype != (d->d_flags & XFS_DQ_ALLTYPES))
> > +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
> > +
> > +	/* Check the limits. */
> > +	bhard = be64_to_cpu(d->d_blk_hardlimit);
> > +	ihard = be64_to_cpu(d->d_ino_hardlimit);
> > +	rhard = be64_to_cpu(d->d_rtb_hardlimit);
> > +
> > +	bsoft = be64_to_cpu(d->d_blk_softlimit);
> > +	isoft = be64_to_cpu(d->d_ino_softlimit);
> > +	rsoft = be64_to_cpu(d->d_rtb_softlimit);
> > +
> > +	inodes = XFS_AGINO_TO_INO(mp, mp->m_sb.sb_agcount, 0);
> 
> Allocated inode counts should check against the filesystem inode
> limit (mp->m_maxicount) rather than the physical last inode number
> (which is wrong, anyway, for a small last AG).

Oops, ok.

> > +
> > +	/*
> > +	 * Warn if the limits are larger than the fs.  Administrators
> > +	 * can do this, though in production this seems suspect.
> > +	 */
> > +	if (bhard > mp->m_sb.sb_dblocks || bsoft > mp->m_sb.sb_dblocks)
> > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > +	if (ihard > inodes || isoft > inodes)
> > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > +	if (rhard > mp->m_sb.sb_rblocks || rsoft > mp->m_sb.sb_rblocks)
> > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> 
> Can you stack these so there's one per line? i.e.:

Will do.  I'll also change the ihard/isoft check here to check against
mp->m_maxicount directly.

> 	if (bhard > mp->m_sb.sb_dblocks ||
> 	    bsoft > mp->m_sb.sb_dblocks)
> 		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> 
> > +
> > +	/* Soft limit must be less than the hard limit. */
> > +	if (bsoft > bhard || isoft > ihard || rsoft > rhard)
> > +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
> 
> Though with this check, I suspect you only need to check the hard
> limits against their upper limits because if the hard limit is valid
> and the soft is above then it's going to trigger corruption. Do we
> need a warning as well in that case?

I don't follow here ... if the soft limit is above the hard limit, what
will trigger corruption?  The quota syscalls enforce that bhard > bsoft
when the admin tries to set new limits, but if the disk is corrupt such
that dblocks = 300, bhard = 250, and bsoft = 280, the checks for
(bhard > dblocks || bsoft > dblocks) checks won't trigger.  That's why
there's an explicit bsoft > bhard check here.

I found a softlimit/hardlimit comparison in a debugging ASSERT in
xfs_qm_adjust_dqtimers, but I didn't find anything that looked like a
direct comparison of softlimit <= hardlimit leading to a corruption
message in the xfs_qm_dqget path.

<shrug> I might be missing something here.

> > +	/* Check the resource counts. */
> > +	bcount = be64_to_cpu(d->d_bcount);
> > +	icount = be64_to_cpu(d->d_icount);
> > +	rcount = be64_to_cpu(d->d_rtbcount);
> > +	inodes = percpu_counter_sum(&mp->m_icount);
> 
> Can we use different variable names for "inodes" here? One is the
> maximum allowed, the other is currently allocated.

Renamed to fs_icount.

> > +/* Scrub all of a quota type's items. */
> > +int
> > +xfs_scrub_quota(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_bmbt_irec		irec = { 0 };
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_inode		*ip;
> > +	struct xfs_quotainfo		*qi = mp->m_quotainfo;
> > +	struct xfs_dquot		*dq;
> > +	xfs_fileoff_t			max_dqid_off;
> > +	xfs_fileoff_t			off = 0;
> > +	xfs_dqid_t			id = 0;
> > +	uint				dqtype;
> > +	int				nimaps;
> > +	int				error;
> > +
> > +	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
> > +		return -ENOENT;
> > +
> > +	mutex_lock(&qi->qi_quotaofflock);
> > +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > +	if (!xfs_this_quota_on(sc->mp, dqtype)) {
> > +		error = -ENOENT;
> > +		goto out;
> 
> goto out_unlock_quota
> 
> > +	}
> > +
> > +	/* Attach to the quota inode and set sc->ip so that reporting works. */
> > +	ip = xfs_quota_inode(sc->mp, dqtype);
> > +	sc->ip = ip;
> > +
> > +	/* Look for problem extents. */
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
> > +	while (1) {
> > +		if (xfs_scrub_should_terminate(sc, &error))
> > +			break;
> 
> goto out_unlock_inode
> 
> > +
> > +		off = irec.br_startoff + irec.br_blockcount;
> > +		nimaps = 1;
> > +		error = xfs_bmapi_read(ip, off, -1, &irec, &nimaps,
> > +				XFS_BMAPI_ENTIRE);
> > +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, off, &error))
> > +			goto out_unlock;
> 
> out_unlock_inode
> 
> > +		if (!nimaps)
> > +			break;
> > +		if (irec.br_startblock == HOLESTARTBLOCK)
> > +			continue;
> > +
> > +		/*
> > +		 * Unwritten extents or blocks mapped above the highest
> > +		 * quota id shouldn't happen.
> > +		 */
> > +		if (isnullstartblock(irec.br_startblock) ||
> > +		    irec.br_startoff > max_dqid_off ||
> > +		    irec.br_startoff + irec.br_blockcount > max_dqid_off + 1)
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, off);
> > +	}
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +
> > +	/* Check all the quota items. */
> > +	while (id < ((xfs_dqid_t)-1ULL)) {
> > +		if (xfs_scrub_should_terminate(sc, &error))
> > +			break;
> > +
> > +		error = xfs_qm_dqget(mp, NULL, id, dqtype, XFS_QMOPT_DQNEXT,
> > +				&dq);
> > +		if (error == -ENOENT)
> > +			break;
> > +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK,
> > +				id * qi->qi_dqperchunk, &error))
> > +			goto out;
> 
> break
> 
> > +
> > +		xfs_scrub_quota_item(sc, dqtype, dq, id);
> > +
> > +		id = be32_to_cpu(dq->q_core.d_id) + 1;
> > +		xfs_qm_dqput(dq);
> > +		if (!id)
> > +			break;
> > +	}
> 
> out_unlock_quota:
> 	sc->ip = NULL;
> 	mutex_unlock(&qi->qi_quotaofflock);
> 	return error;
> 
> out_unlock_inode:
> 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 	goto out_unlock_quota;

Fixed.

> }
> 
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 348e3c3..0849b3f 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -256,6 +256,24 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
> >  	{ NULL },
> >  	{ NULL },
> >  #endif
> > +#ifdef CONFIG_XFS_QUOTA
> > +	{ /* user quota */
> > +		.setup = xfs_scrub_setup_quota,
> > +		.scrub = xfs_scrub_quota,
> > +	},
> > +	{ /* group quota */
> > +		.setup = xfs_scrub_setup_quota,
> > +		.scrub = xfs_scrub_quota,
> > +	},
> > +	{ /* project quota */
> > +		.setup = xfs_scrub_setup_quota,
> > +		.scrub = xfs_scrub_quota,
> > +	},
> > +#else
> > +	{ NULL },
> > +	{ NULL },
> > +	{ NULL },
> > +#endif
> >  };
> 
> Again, I think stub functions are in order here.

Ok, will do.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 24/25] xfs: scrub realtime bitmap/summary
  2017-10-09  2:28   ` Dave Chinner
@ 2017-10-09 20:24     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-09 20:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Oct 09, 2017 at 01:28:04PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:43:21PM -0700, Darrick J. Wong wrote:
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 154c3dd..d4d9bef 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -315,6 +315,11 @@ static inline bool xfs_sb_good_version(struct xfs_sb *sbp)
> >  	return false;
> >  }
> >  
> > +static inline bool xfs_sb_version_hasrealtime(struct xfs_sb *sbp)
> > +{
> > +	return sbp->sb_rblocks > 0;
> > +}
> 
> How much can we rely on that? do we allow a fs to mount with that
> being > 0 but no rtdev= mount option?
> 
> > +/* Set us up with the realtime metadata locked. */
> > +int
> > +xfs_scrub_setup_rt(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_inode		*ip)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				lockmode;
> > +	int				error = 0;
> > +
> > +	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
> > +		return -EINVAL;
> 
> I've forgotten what this means already :/

(I fixed all of these the first time you complained. :))

> > +	error = xfs_scrub_setup_fs(sc, ip);
> > +	if (error)
> > +		return error;
> > +
> > +	lockmode = XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP;
> > +	xfs_ilock(mp->m_rbmip, lockmode);
> > +	xfs_trans_ijoin(sc->tp, mp->m_rbmip, lockmode);
> 
> Ok, so why do we join this inode to the transaction and not use
> the sc->ilock_flags field to track how we've locked it?

I don't know why.

It might just be a forgotten leftover from when I started tracking
inodes in the scrub context.

> > +
> > +	return 0;
> > +}
> > +
> > +/* Realtime bitmap. */
> > +
> > +/* Scrub a free extent record from the realtime bitmap. */
> > +STATIC int
> > +xfs_scrub_rtbitmap_helper(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_rtalloc_rec		*rec,
> > +	void				*priv)
> > +{
> > +	return 0;
> > +}
> 
> Check the extent record returned is within the range of the rtdev
> address space?

I added:

if (rec->ar_startblock + rec->ar_blockcount <= rec->ar_startblock ||
    !xfs_verify_rtbno_ptr(sc->mp, rec->ar_startblock) ||
    !xfs_verify_rtbno_ptr(sc->mp, rec->ar_startblock +
		rec->ar_blockcount - 1))
	xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);

...back when I was reworking the scrub patches to add verify_agbno_ptr and
declutter the bnobt scrubbers.

> > +
> > +/* Scrub the realtime bitmap. */
> > +int
> > +xfs_scrub_rtbitmap(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	int				error;
> > +
> > +	error = xfs_rtalloc_query_all(sc->tp, xfs_scrub_rtbitmap_helper, NULL);
> > +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, 0, &error))
> > +		goto out;
> > +
> > +out:
> > +	return error;
> > +}
> > +
> > +/* Scrub the realtime summary. */
> > +int
> > +xfs_scrub_rtsummary(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	/* XXX: implement this some day */
> > +	return -ENOENT;
> > +}
> 
> Alright, this is all just a stub that doesn't really do any real
> scrubbing yet. I guess it's better that nothing in that it walks
> the rtbitmap....
> 
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -241,6 +241,21 @@ static const struct xfs_scrub_meta_ops meta_scrub_ops[] = {
> >  		.setup	= xfs_scrub_setup_parent,
> >  		.scrub	= xfs_scrub_parent,
> >  	},
> > +#ifdef CONFIG_XFS_RT
> > +	{ /* realtime bitmap */
> > +		.setup	= xfs_scrub_setup_rt,
> > +		.scrub	= xfs_scrub_rtbitmap,
> > +		.has	= xfs_sb_version_hasrealtime,
> > +	},
> > +	{ /* realtime summary */
> > +		.setup	= xfs_scrub_setup_rt,
> > +		.scrub	= xfs_scrub_rtsummary,
> > +		.has	= xfs_sb_version_hasrealtime,
> > +	},
> > +#else
> > +	{ NULL },
> > +	{ NULL },
> > +#endif
> 
> I think I'd prefer that you supply stub functions when
> CONFIG_XFS_RT=n so this table doesn't require ifdefs.

Ok.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 21/25] xfs: scrub extended attributes
  2017-10-09  2:13   ` Dave Chinner
@ 2017-10-09 21:14     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-09 21:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Oct 09, 2017 at 01:13:12PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:43:02PM -0700, Darrick J. Wong wrote:
> > +/* Extended Attributes */
> > +
> > +struct xfs_scrub_xattr {
> > +	struct xfs_attr_list_context	context;
> > +	struct xfs_scrub_context	*sc;
> > +};
> 
> A comment here explaining that we are using the listattr callback
> infrastructure to scrub the xattr?

New comments:

/*
 * Check that an extended attribute key can be looked up by hash.
 *
 * We use the XFS attribute list iterator (i.e. xfs_attr_list_int_ilocked)
 * to call this function for every attribute key in an inode.  Once
 * we're here, we load the attribute value to see if any errors happen,
 * or if we get more or less data than we expected.
 */

...comes before the definition of xfs_scrub_xattr_listent, and...

/*
 * Look up every xattr in this file by name.
 *
 * Use the backend implementation of xfs_attr_list to call
 * xfs_scrub_xattr_listent on every attribute key in this inode.
 * In other words, we use the same iterator/callback mechanism
 * that listattr uses to scrub extended attributes, though in our
 * _listent function, we check the value of the attribute.
 *
 * The VFS only locks i_rwsem when modifying attrs, so keep all
 * three locks held because that's the only way to ensure we're
 * the only thread poking into the da btree.  We traverse the da
 * btree while holding a leaf buffer locked for the xattr name
 * iteration, which doesn't really follow the usual buffer
 * locking order.
 */

...comes right before the call to xfs_attr_list_int_ilocked.

> And, now that I've got to the rest of the code, that we don't validate
> the pointers/values in the attribute records when doing dabtree record
> check because we are doing that indirectly afterwards by reading every
> attribute value.

Correct.

> And that this follows the pointers for remote attr blocks and reads
> them, hence checking the remote attr is valid via verifiers?

Correct.

> And, with that out of the way, what about attributes that listent
> skips?

Oops, I forgot those.

> i.e. those with the flag that says they are not valid?  We don't check
> they exist or are valid, and their existence would be a case for
> preening the xattr tree...

Good point, I'll add those.

> Otherwise this seems pretty straight forward...

Woot.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/25] xfs: scrub quota information
  2017-10-09 20:03     ` Darrick J. Wong
@ 2017-10-09 22:17       ` Dave Chinner
  2017-10-09 23:08         ` Darrick J. Wong
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2017-10-09 22:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Oct 09, 2017 at 01:03:28PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 09, 2017 at 01:51:51PM +1100, Dave Chinner wrote:
> > On Tue, Oct 03, 2017 at 01:43:27PM -0700, Darrick J. Wong wrote:
> > > +xfs_scrub_quota_to_dqtype(
> > > +	struct xfs_scrub_context	*sc)
> > > +{
> > > +	switch (sc->sm->sm_type) {
> > > +	case XFS_SCRUB_TYPE_UQUOTA:
> > > +		return XFS_DQ_USER;
> > > +	case XFS_SCRUB_TYPE_GQUOTA:
> > > +		return XFS_DQ_GROUP;
> > > +	case XFS_SCRUB_TYPE_PQUOTA:
> > > +		return XFS_DQ_PROJ;
> > > +	default:
> > > +		return 0;
> > > +	}
> > > +}
> > > +
> > > +/* Set us up to scrub a quota. */
> > > +int
> > > +xfs_scrub_setup_quota(
> > > +	struct xfs_scrub_context	*sc,
> > > +	struct xfs_inode		*ip)
> > > +{
> > > +	uint				dqtype;
> > > +
> > > +	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
> > > +		return -EINVAL;
> > > +
> > > +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > > +	if (dqtype == 0)
> > > +		return -EINVAL;
> > > +	return 0;
> > > +}
> > 
> > Should this check whether the quota type is actually enabled, and
> > return ENOENT if it's not? i.e move the check out of
> > xfs_scrub_quota() and into the setup function?
> 
> I can add a xfs_this_quota_on check to the setup function, but don't we
> need xfs_scrub_quota to lock qi_quotaofflock and then recheck that the
> quota type is still enabled?

The qi_quotaofflock is held across the entire scrub, right?

Ah, this is called before the qi_quotaofflock is held - is there
a teardown callback? If so, would it be better to lock
qi_quotaofflock in the setup, release it in the teardown? That way
we can check here in the setup code, and not have to double check
the user input it in the scrub function itself?

> > > +	/*
> > > +	 * Warn if the limits are larger than the fs.  Administrators
> > > +	 * can do this, though in production this seems suspect.
> > > +	 */
> > > +	if (bhard > mp->m_sb.sb_dblocks || bsoft > mp->m_sb.sb_dblocks)
> > > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > > +	if (ihard > inodes || isoft > inodes)
> > > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > > +	if (rhard > mp->m_sb.sb_rblocks || rsoft > mp->m_sb.sb_rblocks)
> > > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > 
> > Can you stack these so there's one per line? i.e.:
> 
> Will do.  I'll also change the ihard/isoft check here to check against
> mp->m_maxicount directly.
> 
> > 	if (bhard > mp->m_sb.sb_dblocks ||
> > 	    bsoft > mp->m_sb.sb_dblocks)
> > 		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > 
> > > +
> > > +	/* Soft limit must be less than the hard limit. */
> > > +	if (bsoft > bhard || isoft > ihard || rsoft > rhard)
> > > +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
> > 
> > Though with this check, I suspect you only need to check the hard
> > limits against their upper limits because if the hard limit is valid
> > and the soft is above then it's going to trigger corruption. Do we
> > need a warning as well in that case?
> 
> I don't follow here ... if the soft limit is above the hard limit, what
> will trigger corruption? 

What I mean was this:

	if (bhard > mp->m_sb.sb_dblocks)
		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
	if (bsoft > bhard)
		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);

That is, if bhard is over the valid limit, throw a warning and we
now don't care about bsoft because we've already signalled to
userspace there's a problem.

If bsoft is now greater than bhard, regardless of whether it's over
the valid size, we signal a corruption. If bhard is already an
invalid number, then this corruption report also implies that bsoft
is over the limit...

> The quota syscalls enforce that bhard > bsoft
> when the admin tries to set new limits, but if the disk is corrupt such
> that dblocks = 300, bhard = 250, and bsoft = 280, the checks for
> (bhard > dblocks || bsoft > dblocks) checks won't trigger.  That's why
> there's an explicit bsoft > bhard check here.

Right, I wasn't suggesting removing it - I didn't make it very clear
what I meant.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 20/25] xfs: scrub directory freespace
  2017-10-09  1:44   ` Dave Chinner
@ 2017-10-09 22:54     ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-09 22:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Oct 09, 2017 at 12:44:29PM +1100, Dave Chinner wrote:
> On Tue, Oct 03, 2017 at 01:42:55PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Check the free space information in a directory.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/dir.c |  347 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 347 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
> > index e58252b..6ea06c3 100644
> > --- a/fs/xfs/scrub/dir.c
> > +++ b/fs/xfs/scrub/dir.c
> > @@ -239,6 +239,348 @@ xfs_scrub_dir_rec(
> >  	return error;
> >  }
> >  
> > +/* Is this free entry either in the bestfree or smaller than all of them? */
> > +static inline void
> > +xfs_scrub_directory_check_free_entry(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_dablk_t			lblk,
> > +	struct xfs_dir2_data_free	*bf,
> > +	struct xfs_dir2_data_unused	*dup)
> > +{
> > +	struct xfs_dir2_data_free	*dfp;
> > +	unsigned int			smallest;
> > +
> > +	smallest = -1U;
> 
> Urk. That's the same as "smallest = UINT_MAX", and so ......
> 
> > +	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
> > +		if (dfp->offset &&
> > +		    be16_to_cpu(dfp->length) == be16_to_cpu(dup->length))
> > +			return;
> > +		if (smallest < be16_to_cpu(dfp->length))
> > +			smallest = be16_to_cpu(dfp->length);
> 
> .... how does this work? Shouldn't it be a ">" check here?

Yes.  Thanks for catching that.  I might as well change the -1U above to
UINT_MAX while I'm at it.

Though since you later point out that the bestfree array should be
sorted longest to shortest, we can make this function faster:

if (dup length < bestfrees[2].length)
	return;

for (bestfrees in reverse order) {
	if (dup offset == bestfree offset) {
		if (dup length != bestfree length)
			corrupt();
		return;
	}
}

corrupt(); /* should be in bestfree[] but isn't? */

> > +	}
> > +
> > +	if (be16_to_cpu(dup->length) > smallest)
> > +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +}
> > +
> > +/* Check free space info in a directory data block. */
> > +STATIC int
> > +xfs_scrub_directory_data_bestfree(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_dablk_t			lblk,
> > +	bool				is_block)
> > +{
> > +	struct xfs_dir2_data_unused	*dup;
> > +	struct xfs_dir2_data_free	*dfp;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_dir2_data_free	*bf;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	char				*ptr;
> > +	char				*endptr;
> > +	u16				tag;
> > +	int				newlen;
> > +	int				offset;
> > +	int				error;
> > +
> > +	if (is_block) {
> > +		/* dir block format */
> > +		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
> > +	} else {
> > +		/* dir data format */
> > +		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, -1, &bp);
> > +	}
> > +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> > +		goto out;
> > +
> > +	/* Do the bestfrees correspond to actual free space? */
> > +	bf = sc->ip->d_ops->data_bestfree_p(bp->b_addr);
> 
> With the number of d_ops callouts in this code, a local dops
> variable might be in order.

Ok.

> > +	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
> > +		offset = be16_to_cpu(dfp->offset);
> > +		if (offset == 0)
> > +			continue;
> > +		if (offset >= BBTOB(bp->b_length)) {
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +			continue;
> > +		}
> 
> Not sure I like all the checks against and calculations using
> bp->b_length in this function. it would be more correct to check
> against geo->blksize.

Ok.

> > +		dup = (struct xfs_dir2_data_unused *)(bp->b_addr + offset);
> > +		tag = be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup));
> > +
> > +		if (dup->freetag != cpu_to_be16(XFS_DIR2_DATA_FREE_TAG) ||
> > +		    be16_to_cpu(dup->length) != be16_to_cpu(dfp->length) ||
> > +		    tag != ((char *)dup - (char *)bp->b_addr))
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +	}
> 
> Also, count the number of best frees here.

Ok.

I will also check that the bestfree entries are sorted by length.

> > +
> > +	/* Make sure the bestfrees are actually the best free spaces. */
> > +	ptr = (char *)sc->ip->d_ops->data_entry_p(bp->b_addr);
> > +	if (is_block) {
> > +		struct xfs_dir2_block_tail	*btp;
> > +
> > +		btp = xfs_dir2_block_tail_p(sc->mp->m_dir_geo, bp->b_addr);
> 
> mp->m_dir_geo

Ok.

> > +		endptr = (char *)xfs_dir2_block_leaf_p(btp);
> > +	} else
> > +		endptr = (char *)bp->b_addr + BBTOB(bp->b_length);
> 
> 

Empty space here?

> 
> 
> > +	while (ptr < endptr) {
> > +		dup = (struct xfs_dir2_data_unused *)ptr;
> > +		/* Skip real entries */
> > +		if (dup->freetag != cpu_to_be16(XFS_DIR2_DATA_FREE_TAG)) {
> > +			struct xfs_dir2_data_entry	*dep;
> > +
> > +			dep = (struct xfs_dir2_data_entry *)ptr;
> > +			newlen = sc->ip->d_ops->data_entsize(dep->namelen);
> > +			if (newlen <= 0) {
> > +				xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > +						lblk);
> > +				goto out_buf;
> > +			}
> > +			ptr += newlen;
> > +			if (endptr < ptr)
> > +				xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK,
> > +					      lblk);
> > +			continue;
> > +		}
> > +
> > +		/* Spot check this free entry */
> > +		tag = be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup));
> > +		if (tag != ((char *)dup - (char *)bp->b_addr))
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +
> > +		/*
> > +		 * Either this entry is a bestfree or it's smaller than
> > +		 * any of the bestfrees.
> > +		 */
> > +		xfs_scrub_directory_check_free_entry(sc, lblk, bf, dup);
> 
> SO this checks if the entry is in the bestfree, but it doesn't
> tell us if the bestfree array has the correct number of entries...
> 
> > +
> > +		/* Move on. */
> > +		newlen = be16_to_cpu(dup->length);
> > +		if (newlen <= 0) {
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +			goto out_buf;
> > +		}
> > +		ptr += newlen;
> > +		if (endptr < ptr)
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> 
> Count the number of free entries here.

Ok.

> > +	}
> 
> And now check that the number of bestfrees vs free entries is
> valid. If there's more than 3 free entries in the block, the
> bestfrees array should be full...

Ok, done.

> > +out_buf:
> > +	xfs_trans_brelse(sc->tp, bp);
> > +out:
> > +	return error;
> > +}
> > +
> > +/* Is this the longest free entry in the block? */
> > +static inline void
> > +xfs_scrub_directory_check_freesp(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_dablk_t			lblk,
> > +	struct xfs_buf			*dbp,
> > +	unsigned int			len)
> > +{
> > +	struct xfs_dir2_data_free	*bf;
> > +	struct xfs_dir2_data_free	*dfp;
> > +	unsigned int			longest = 0;
> > +	int				offset;
> > +
> > +	bf = sc->ip->d_ops->data_bestfree_p(dbp->b_addr);
> > +	for (dfp = &bf[0]; dfp < &bf[XFS_DIR2_DATA_FD_COUNT]; dfp++) {
> > +		offset = be16_to_cpu(dfp->offset);
> > +		if (!offset)
> > +			continue;
> > +		if (longest < be16_to_cpu(dfp->length))
> > +			longest = be16_to_cpu(dfp->length);
> > +	}
> > +
> > +	if (longest != len)
> > +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +}
> 
> This needs a better explanation - it's called to check whether then
> freespace length in the freespace index matches the longest
> freespace in the data block bests array.
> 
> And from that, the data block bests array is supposed to be ordered
> from largest to smallest, yes? As per __xfs_dir3_data_check():
> 
> 
> 	XFS_WANT_CORRUPTED_RETURN(mp, be16_to_cpu(bf[0].length) >=
> 						be16_to_cpu(bf[1].length));
> 	XFS_WANT_CORRUPTED_RETURN(mp, be16_to_cpu(bf[1].length) >=
> 						be16_to_cpu(bf[2].length));
> 
> So why doesn't this code just check the first entry in the array?

I missed that detail, which means that I'll refactor both
check_free_entry and check_freesp to take advantage of that, having also
added a check that the bestfree array is sorted by length.

> Hmmm, and now I've remembered that, xfs_scrub_directory_check_free_entry()
> probably only needs to do a reverse scan just to find the smallest
> non-zero entry...
> 
> > +/* Check free space info in a directory leaf1 block. */
> > +STATIC int
> > +xfs_scrub_directory_leaf1_bestfree(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_da_args		*args,
> > +	xfs_dablk_t			lblk)
> > +{
> > +	struct xfs_dir2_leaf_tail	*ltp;
> > +	struct xfs_buf			*dbp;
> > +	struct xfs_buf			*bp;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	__be16				*bestp;
> > +	__u16				best;
> > +	int				i;
> > +	int				error;
> > +
> > +	/* Read the free space block */
> > +	error = xfs_dir3_leaf_read(sc->tp, sc->ip, lblk, -1, &bp);
> > +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> > +		goto out;
> > +
> > +	/* Check all the entries. */
> > +	ltp = xfs_dir2_leaf_tail_p(mp->m_dir_geo, bp->b_addr);
> > +	bestp = xfs_dir2_leaf_bests_p(ltp);
> > +	for (i = 0; i < be32_to_cpu(ltp->bestcount); i++, bestp++) {
> > +		best = be16_to_cpu(*bestp);
> > +		if (best == NULLDATAOFF)
> > +			continue;
> > +		error = xfs_dir3_data_read(sc->tp, sc->ip,
> > +				i * args->geo->fsbcount, -1, &dbp);
> > +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> > +			continue;
> > +		xfs_scrub_directory_check_freesp(sc, lblk, dbp, best);
> > +		xfs_trans_brelse(sc->tp, dbp);
> > +	}
> > +out:
> > +	return error;
> 
> This needs comments to explain what it is not checking because those
> checks were done in the verifier. (i.e.  hash index does not overlap
> the freespace index, stale entry count is valid).

Ok.

/*
 * Read the free space block.  The verifier will check for hash
 * value ordering problems and check the stale entry count.
 */

> hmmmm. More philosophical question: should we rerun the verifiers
> in the scrubber manually so guarantee that we fully cover whatever
> is in memory on cached and modified buffers?

That's coming in part 4 when I wire up the buf_ops to the raw structure
verifier functions and call them from scrub.

(The order can be changed; this is part 1; cross referencing with other
metadata is part 2; repairs are part 3; and exposing the structure
verifiers to internal code is part 4.)

> > +
> > +/* Check free space info in a directory freespace block. */
> > +STATIC int
> > +xfs_scrub_directory_free_bestfree(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_da_args		*args,
> > +	xfs_dablk_t			lblk)
> > +{
> > +	struct xfs_dir3_icfree_hdr	freehdr;
> > +	struct xfs_buf			*dbp;
> > +	struct xfs_buf			*bp;
> > +	__be16				*bestp;
> > +	__be16				best;
> > +	int				i;
> > +	int				error;
> > +
> > +	/* Read the free space block */
> > +	error = xfs_dir2_free_read(sc->tp, sc->ip, lblk, &bp);
> > +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> > +		goto out;
> > +
> > +	/* Check all the entries. */
> > +	sc->ip->d_ops->free_hdr_from_disk(&freehdr, bp->b_addr);
> > +	bestp = sc->ip->d_ops->free_bests_p(bp->b_addr);
> > +	for (i = 0; i < freehdr.nvalid; i++, bestp++) {
> > +		best = be16_to_cpu(*bestp);
> > +		if (best == NULLDATAOFF)
> > +			continue;
> > +		error = xfs_dir3_data_read(sc->tp, sc->ip,
> > +				(freehdr.firstdb + i) * args->geo->fsbcount,
> > +				-1, &dbp);
> > +		if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> > +			continue;
> > +		xfs_scrub_directory_check_freesp(sc, lblk, dbp, best);
> > +		xfs_trans_brelse(sc->tp, dbp);
> > +	}
> > +out:
> > +	return error;
> > +}
> > +
> > +/* Check free space information in directories. */
> > +STATIC int
> > +xfs_scrub_directory_blocks(
> > +	struct xfs_scrub_context	*sc)
> > +{
> > +	struct xfs_bmbt_irec		got;
> > +	struct xfs_da_args		args;
> > +	struct xfs_ifork		*ifp;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	xfs_fileoff_t			leaf_lblk;
> > +	xfs_fileoff_t			free_lblk;
> > +	xfs_fileoff_t			lblk;
> > +	xfs_extnum_t			idx;
> > +	bool				found;
> > +	int				is_block = 0;
> > +	int				error;
> > +
> > +	/* Ignore local format directories. */
> > +	if (sc->ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
> > +	    sc->ip->i_d.di_format != XFS_DINODE_FMT_BTREE)
> > +		return 0;
> > +
> > +	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
> > +	lblk = XFS_B_TO_FSB(mp, XFS_DIR2_DATA_OFFSET);
> > +	leaf_lblk = XFS_B_TO_FSB(mp, XFS_DIR2_LEAF_OFFSET);
> > +	free_lblk = XFS_B_TO_FSB(mp, XFS_DIR2_FREE_OFFSET);
> > +
> > +	/* Is this a block dir? */
> > +	args.dp = sc->ip;
> > +	args.geo = mp->m_dir_geo;
> > +	args.trans = sc->tp;
> > +	error = xfs_dir2_isblock(&args, &is_block);
> > +	if (!xfs_scrub_fblock_op_ok(sc, XFS_DATA_FORK, lblk, &error))
> > +		goto out;
> > +
> > +	/* Iterate all the data extents in the directory... */
> > +	found = xfs_iext_lookup_extent(sc->ip, ifp, lblk, &idx, &got);
> > +	while (found) {
> > +		/* No more data blocks... */
> > +		if (got.br_startoff >= leaf_lblk)
> > +			break;
> 
> If it's a block dir and got.br_startoff > 0, then it's corrupt?

Or if got.br_blockcount != mp->m_dir_geo->fsbcount, right?

> > +
> > +		/* Check each data block's bestfree data */
> > +		for (lblk = roundup((xfs_dablk_t)got.br_startoff,
> > +				args.geo->fsbcount);
> > +		     lblk < got.br_startoff + got.br_blockcount;
> > +		     lblk += args.geo->fsbcount) {
> 
> This is not obvious as to why it works with discontiguous directory
> blocks. I think it's because it grabs the aligned start block of
> each directory block and then internally the blocks get mapped
> correctly via the directory block read functions, but this
> definitely needs a better comment explaining the iteration mechanism
> being used here....
> 
> > +			error = xfs_scrub_directory_data_bestfree(sc, lblk,
> > +					is_block);
> > +			if (error)
> > +				goto out;
> > +		}
> > +
> > +		found = xfs_iext_get_extent(ifp, ++idx, &got);
> 
> As it is, I think this is going to check discontiguous directory
> blocks multiple times. It's going to find each extent in a
> discontiguous dir block, round it up to the next dirblock and
> scan that next dirblock. It then finds the next block in the
> current discontig block, rounds it up to the next dirblock, and
> scans it again....
> 
> I think it would be much better to use xfs_iext_lookup_extent() here
> to iterate by expected start block rather than iterating by extent
> index.

Ok.  I'll add in a comment explaining what we're doing, and change the
directory lookup to round up to the next expected directory block
offset:

/*
 * Check each data block's bestfree data.
 *
 * Iterate all the fsbcount-aligned block offsets in
 * this directory.  The directory block reading code is
 * smart enough to do its own bmap lookups to handle
 * discontiguous directory blocks.  When we're done
 * with the extent record, re-query the bmap at the
 * next fsbcount-aligned offset to avoid redundant
 * block checks.
 */
for (lblk = roundup((xfs_dablk_t)got.br_startoff,
		args.geo->fsbcount);
     lblk < got.br_startoff + got.br_blockcount;
     lblk += args.geo->fsbcount) {
	error = xfs_scrub_directory_data_bestfree(sc, lblk,
			is_block);
	if (error)
		goto out;
}
lblk = roundup((xfs_dablk_t)got.br_startoff + got.br_blockcount,
		args.geo->fsbcount);
found = xfs_iext_lookup_extent(sc->ip, ifp, lblk, &idx, &got);

Likewise for the free block iteration.

> > +	}
> > +
> > +	/* Look for a leaf1 block, which has free info. */
> > +	if (xfs_iext_lookup_extent(sc->ip, ifp, leaf_lblk, &idx, &got) &&
> > +	    got.br_startoff == leaf_lblk &&
> > +	    got.br_blockcount == args.geo->fsbcount &&
> > +	    !xfs_iext_get_extent(ifp, ++idx, &got)) {
> > +		if (is_block) {
> > +			xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
> > +			goto not_leaf1;
> 
> Can just abort the scrub at this point.

Yes.  I'll also abort on corruption prior to the free_lblk scan.

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 25/25] xfs: scrub quota information
  2017-10-09 22:17       ` Dave Chinner
@ 2017-10-09 23:08         ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2017-10-09 23:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 10, 2017 at 09:17:51AM +1100, Dave Chinner wrote:
> On Mon, Oct 09, 2017 at 01:03:28PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 09, 2017 at 01:51:51PM +1100, Dave Chinner wrote:
> > > On Tue, Oct 03, 2017 at 01:43:27PM -0700, Darrick J. Wong wrote:
> > > > +xfs_scrub_quota_to_dqtype(
> > > > +	struct xfs_scrub_context	*sc)
> > > > +{
> > > > +	switch (sc->sm->sm_type) {
> > > > +	case XFS_SCRUB_TYPE_UQUOTA:
> > > > +		return XFS_DQ_USER;
> > > > +	case XFS_SCRUB_TYPE_GQUOTA:
> > > > +		return XFS_DQ_GROUP;
> > > > +	case XFS_SCRUB_TYPE_PQUOTA:
> > > > +		return XFS_DQ_PROJ;
> > > > +	default:
> > > > +		return 0;
> > > > +	}
> > > > +}
> > > > +
> > > > +/* Set us up to scrub a quota. */
> > > > +int
> > > > +xfs_scrub_setup_quota(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > > +	uint				dqtype;
> > > > +
> > > > +	if (sc->sm->sm_agno || sc->sm->sm_ino || sc->sm->sm_gen)
> > > > +		return -EINVAL;
> > > > +
> > > > +	dqtype = xfs_scrub_quota_to_dqtype(sc);
> > > > +	if (dqtype == 0)
> > > > +		return -EINVAL;
> > > > +	return 0;
> > > > +}
> > > 
> > > Should this check whether the quota type is actually enabled, and
> > > return ENOENT if it's not? i.e move the check out of
> > > xfs_scrub_quota() and into the setup function?
> > 
> > I can add a xfs_this_quota_on check to the setup function, but don't we
> > need xfs_scrub_quota to lock qi_quotaofflock and then recheck that the
> > quota type is still enabled?
> 
> The qi_quotaofflock is held across the entire scrub, right?
> 
> Ah, this is called before the qi_quotaofflock is held - is there
> a teardown callback? If so, would it be better to lock
> qi_quotaofflock in the setup, release it in the teardown? That way
> we can check here in the setup code, and not have to double check
> the user input it in the scrub function itself?

There isn't a generic callback in the teardown.

> > > > +	/*
> > > > +	 * Warn if the limits are larger than the fs.  Administrators
> > > > +	 * can do this, though in production this seems suspect.
> > > > +	 */
> > > > +	if (bhard > mp->m_sb.sb_dblocks || bsoft > mp->m_sb.sb_dblocks)
> > > > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > > > +	if (ihard > inodes || isoft > inodes)
> > > > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > > > +	if (rhard > mp->m_sb.sb_rblocks || rsoft > mp->m_sb.sb_rblocks)
> > > > +		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > > 
> > > Can you stack these so there's one per line? i.e.:
> > 
> > Will do.  I'll also change the ihard/isoft check here to check against
> > mp->m_maxicount directly.
> > 
> > > 	if (bhard > mp->m_sb.sb_dblocks ||
> > > 	    bsoft > mp->m_sb.sb_dblocks)
> > > 		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> > > 
> > > > +
> > > > +	/* Soft limit must be less than the hard limit. */
> > > > +	if (bsoft > bhard || isoft > ihard || rsoft > rhard)
> > > > +		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
> > > 
> > > Though with this check, I suspect you only need to check the hard
> > > limits against their upper limits because if the hard limit is valid
> > > and the soft is above then it's going to trigger corruption. Do we
> > > need a warning as well in that case?
> > 
> > I don't follow here ... if the soft limit is above the hard limit, what
> > will trigger corruption? 
> 
> What I mean was this:
> 
> 	if (bhard > mp->m_sb.sb_dblocks)
> 		xfs_scrub_fblock_set_warning(sc, XFS_DATA_FORK, offset);
> 	if (bsoft > bhard)
> 		xfs_scrub_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
> 
> That is, if bhard is over the valid limit, throw a warning and we
> now don't care about bsoft because we've already signalled to
> userspace there's a problem.
> 
> If bsoft is now greater than bhard, regardless of whether it's over
> the valid size, we signal a corruption. If bhard is already an
> invalid number, then this corruption report also implies that bsoft
> is over the limit...
> 
> > The quota syscalls enforce that bhard > bsoft
> > when the admin tries to set new limits, but if the disk is corrupt such
> > that dblocks = 300, bhard = 250, and bsoft = 280, the checks for
> > (bhard > dblocks || bsoft > dblocks) checks won't trigger.  That's why
> > there's an explicit bsoft > bhard check here.
> 
> Right, I wasn't suggesting removing it - I didn't make it very clear
> what I meant.

Ah, ok, I get it now, thanks.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2017-10-09 23:08 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-03 20:40 [PATCH v11 00/25] xfs: online scrub support Darrick J. Wong
2017-10-03 20:40 ` [PATCH 01/25] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
2017-10-03 20:41 ` [PATCH 02/25] xfs: dispatch metadata scrub subcommands Darrick J. Wong
2017-10-03 20:41 ` [PATCH 03/25] xfs: probe the scrub ioctl Darrick J. Wong
2017-10-03 23:32   ` Dave Chinner
2017-10-04  0:02     ` Darrick J. Wong
2017-10-04  1:56       ` Dave Chinner
2017-10-04  3:14         ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 04/25] xfs: create helpers to record and deal with scrub problems Darrick J. Wong
2017-10-03 23:44   ` Dave Chinner
2017-10-04  0:56     ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 05/25] xfs: create helpers to scrub a metadata btree Darrick J. Wong
2017-10-03 23:49   ` Dave Chinner
2017-10-04  0:13     ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 06/25] xfs: scrub the shape of " Darrick J. Wong
2017-10-04  0:15   ` Dave Chinner
2017-10-04  3:51     ` Darrick J. Wong
2017-10-04  5:48       ` Dave Chinner
2017-10-04 17:48         ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 07/25] xfs: scrub btree keys and records Darrick J. Wong
2017-10-04 20:52   ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 08/25] xfs: create helpers to scan an allocation group Darrick J. Wong
2017-10-04  0:46   ` Dave Chinner
2017-10-04  3:58     ` Darrick J. Wong
2017-10-04  5:59       ` Dave Chinner
2017-10-04 17:51         ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 09/25] xfs: scrub the backup superblocks Darrick J. Wong
2017-10-04  0:57   ` Dave Chinner
2017-10-04  4:06     ` Darrick J. Wong
2017-10-04  6:13       ` Dave Chinner
2017-10-04 17:56         ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 10/25] xfs: scrub AGF and AGFL Darrick J. Wong
2017-10-04  1:31   ` Dave Chinner
2017-10-04  4:21     ` Darrick J. Wong
2017-10-04  6:28       ` Dave Chinner
2017-10-04 17:57         ` Darrick J. Wong
2017-10-03 20:41 ` [PATCH 11/25] xfs: scrub the AGI Darrick J. Wong
2017-10-04  1:43   ` Dave Chinner
2017-10-04  4:25     ` Darrick J. Wong
2017-10-04  6:43       ` Dave Chinner
2017-10-04 18:02         ` Darrick J. Wong
2017-10-04 22:16           ` Dave Chinner
2017-10-04 23:12             ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 12/25] xfs: scrub free space btrees Darrick J. Wong
2017-10-05  0:59   ` Dave Chinner
2017-10-05  1:13     ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 13/25] xfs: scrub inode btrees Darrick J. Wong
2017-10-05  2:08   ` Dave Chinner
2017-10-05  5:47     ` Darrick J. Wong
2017-10-05  7:22       ` Dave Chinner
2017-10-05 18:26         ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 14/25] xfs: scrub rmap btrees Darrick J. Wong
2017-10-05  2:56   ` Dave Chinner
2017-10-05  5:02     ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 15/25] xfs: scrub refcount btrees Darrick J. Wong
2017-10-05  2:59   ` Dave Chinner
2017-10-05  5:02     ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 16/25] xfs: scrub inodes Darrick J. Wong
2017-10-05  4:04   ` Dave Chinner
2017-10-05  5:22     ` Darrick J. Wong
2017-10-05  7:13       ` Dave Chinner
2017-10-05 19:56         ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 17/25] xfs: scrub inode block mappings Darrick J. Wong
2017-10-06  2:51   ` Dave Chinner
2017-10-06 17:00     ` Darrick J. Wong
2017-10-07 23:10       ` Dave Chinner
2017-10-08  3:54         ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 18/25] xfs: scrub directory/attribute btrees Darrick J. Wong
2017-10-06  5:07   ` Dave Chinner
2017-10-06 18:30     ` Darrick J. Wong
2017-10-03 20:42 ` [PATCH 19/25] xfs: scrub directory metadata Darrick J. Wong
2017-10-06  7:07   ` Dave Chinner
2017-10-06 19:45     ` Darrick J. Wong
2017-10-06 22:16       ` Dave Chinner
2017-10-03 20:42 ` [PATCH 20/25] xfs: scrub directory freespace Darrick J. Wong
2017-10-09  1:44   ` Dave Chinner
2017-10-09 22:54     ` Darrick J. Wong
2017-10-03 20:43 ` [PATCH 21/25] xfs: scrub extended attributes Darrick J. Wong
2017-10-09  2:13   ` Dave Chinner
2017-10-09 21:14     ` Darrick J. Wong
2017-10-03 20:43 ` [PATCH 22/25] xfs: scrub symbolic links Darrick J. Wong
2017-10-09  2:17   ` Dave Chinner
2017-10-03 20:43 ` [PATCH 23/25] xfs: scrub parent pointers Darrick J. Wong
2017-10-03 20:43 ` [PATCH 24/25] xfs: scrub realtime bitmap/summary Darrick J. Wong
2017-10-09  2:28   ` Dave Chinner
2017-10-09 20:24     ` Darrick J. Wong
2017-10-03 20:43 ` [PATCH 25/25] xfs: scrub quota information Darrick J. Wong
2017-10-09  2:51   ` Dave Chinner
2017-10-09 20:03     ` Darrick J. Wong
2017-10-09 22:17       ` Dave Chinner
2017-10-09 23:08         ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.